Predicting Textual Complexity for Elementary School Students

No Thumbnail Available
Journal Title
Journal ISSN
Volume Title
In this thesis the viability of predicting textual complexity from short texts written by primary school students was investigated. Linguistic features were extracted from texts from the BasiScript corpus using T-Scan, and analyzed using Multiple Linear Regression Analysis and Principal Component Analysis. Although the Multiple Linear Regression results cannot be shown to be correct for the individual features due to collinearity, a strong effect size was found for both the total amount of features (R2 = .68) and for a subset of 50 features (R2 = .55). Approximately 68 percent of the variability in textual complexity can be predicted using the total amount of features, and approximately 55 percent using the subset of features. Multiple Linear Regression Analysis using a subset of only five selected Principal Components showed a moderate effect size (R2 = .43). Additionally, the first few Principal Components showed a structural relation in the highest contributing features, with features related to word complexity, concreteness, relational cohesion and relational coherence having a relatively high contribution. These results suggest that to a certain extent a prediction of text complexity can be made. A follow-up study should investigate the optimal way to select features so that collinearity is removed, yet predictive power is retained. Looking at the results in this study this should be a possible and logical next step.
Faculteit der Sociale Wetenschappen