Lexical Features of Text Complexity: the case of Russian academic texts

The work presented in this paper is a part of an ongoing project that investigates academic text features indicative of its complexity at different grade levels. In this study we examine comparative complexity of Social science texts used in Russian secondary and high schools. Based on the metrics of ten descriptive and four lexical features assessed for seven classroom textbooks we claim lexical diversity, frequency, abstractness and the number of terminological units to be statistically significant predictors of text complexity. The total size of the Corpus of over 160.000 tokens comprising two sets of textbooks ranging from the 5 to the 11 grades provides a satisfactory level of its representativeness and as such a solid foundation for statistical validity of the results. We employ RusAC, an online text analyzer, to compute lexical features of texts and the effect of the four lexical features on text complexity is confirmed with a mixed analysis of variance. The study fills a gap both in corpus linguistics as regards a systematic approach to Russian academic texts and in text complexity studies as regards the description of secondary and high school textbooks.


Introduction
As a focus of numerous studies for over fifty years, the problem of assessment of Russian texts linguistic complexity is still viewed theoretically valuable [1][2][3]. The research in the area is aimed at designing an algorithm identifying a "target reading audience" and validating a list of text features which effect its complexity. The latter is especially significant nowadays due to the increased information flow and cognitive density of modern academic texts [4]. The three lexical features with the highest impact on academic text complexity validated in the recent studies are lexical diversity, frequency and abstractness [5]. The total count of terms is viewed as an additional predictor of text complexity in studies on reading comprehension [6,7].

Readability
Readability determines the level of reading ease of a text and is measured based on solely quantitative parameters: 1) number of sentences in the text; 2) average number of syllables in the words of the text. The most popular readability formula, i.e. Flesh-Kincaid Grade level, ranks text appropriateness for a certain school grade [8]: , (1) where 'ASL' stands for average sentence length, 'ASW' stands for average syllables per word. The index obtained lies corresponds a grade level.
The equation adapted for the Russian language and validated in numerous studies, proved its reliability when applied to academic texts [9]: However, though quantitative readability measures, i.e. average sentence and word length, enable researchers to compare text descriptive metrics and quite ubiquitously applied, do not allow to compare texts in an objective way. They ignore numerous text characteristics which can influence its readability. A word, though long, may be so frequently used in the discourse and thus present no difficulty for readers. E.g. words international and morphological have the same ASW, i.e. five syllables, but their frequency registered in COCA is strikingly different: 158.89 vs 1.54 per mln.
[10]. Thus, due to its frequency the word international presents much less difficulty for an average potential reader than the word morphological. By now studies aimed at extending the list of 'qualitative' features affecting text complexity have been going on for over a century. Nowadays researchers integrate text features estimating not only descriptive metrics, but morphological (parts of speech, distribution, etc.), lexical, syntactical and discourse parameters [3].
Defining 'progression towards a more 'academic' style', which is in fact progression towards a higher degree of complexity, D. Biber [14] indicates higher scores of nouns and groups, fewer verbs and verb groups, more nominalisations of verbs and adjectives, and a greater number of abstract nouns and long words. D. Biber also validates a high level of 'informational density' of Social Science texts realized in the above mentioned morphological and lexical categories [14].

Lexical features
Vocabulary range and its awareness appeared on the list of text complexity parameters as early as 1900s, since many researchers now and then view vocabulary features as fundamental to reading comprehension and correlation between verbature or a person's vocabulary size and reading comprehension is an acknowledged fact [15].
Type-token ratio (TTR) has been widely used in assessments of texts lexical diversity since 1957, when M. Templin introduced it [16 where 'word types' are unique, i.e. not repeated, words in a text and 'word tokens' are total amount of words in a text.
In early 2000s research showed that unique words distribution is not linear for corpora of various sizes due to the fact that words tend to repeat themselves: the bigger the corpus, the more repeated words it comprises. Hence, only relatively small amount of words is going to increase along with the corpus enlargement, which causes extra difficulties while comparing corpora of different sizes. Thus, it was suggested to assess TTR per 1000 words [14].
One of the first text complexity formulas, the Dale-Chall formula [17], employs vocabulary lists to rate books for grade levels. The ratio of listed words in a text provides the data to measure complexity of a text and correlate it with a grade level. In 1981, Anderson & Freebody also claimed the ratio of difficult words in a reading text to be the best predictor of text complexity [18].
Another feature directly influencing text complexity is lexical frequency: the more high frequency words are used in the text, the easier it is for the reader. (cf. international and morphological above). The research validating frequency as a function of complexity has integrated into corpus linguistics and is finalized in online servers as Lexile [19].
Russian frequency indices estimated with the help of Frequency dictionary [20] and are also successfully used to assess Russian texts complexity [21][22].
In cases when frequency lists are unavailable or corpora lack frequency annotation, researchers resort to simpler lexical metrics, e.g. the number of terminological units or nomenclature in a text to assess its complexity. R.V. Mayer (2016) introduces a new text complexity notion, didactic complexity, which rests on the metric of the number of terms in a text alongside with mathematical symbols count and information density of a text [23].
A significant number of models and ideas have been developed to estimate text abstractness after abstractness or degree of abstractness was validated as a metric of text complexity. Among the most popular are numerical indices or ratings of abstractness and scales of abstractness/ concreteness of different range: from 1 to 5 [24], 0 to 9 [25], 1 to 7 [26].
In summary, text complexity is a developing notion, not a well-defined concept. Though the features presented above estimate text complexity based on conventional metrics only they provide a fine-grained assessment of text age, cognitive and linguistic appropriateness.

Methods and Material
In this paper, we aim at identifying the effect of lexical parameters cluster on academic text complexity. The set of lexical parameters comprises (1) lexical diversity, (2) number of terminological units, (3) words frequency and (4) abstractness of text vocabulary. The research focus is in quantifying differences at various grade levels (5-11) thus providing the data to automate text complexity assessment.
The research data are 70 academic texts extracted from school textbooks "Social science, Grades 5-11" [27] with the total size of over 160.000 tokens. We used the online service RusAC [22] to compute texts lexical features and performed a preliminary mixed analysis of variance (Spearman) to define the effect of the four lexical features.
The research data are Social Science students textbooks written by Bogolyubov (2012-2014) for grades 5th -11th recommended for all Russian schools by the Ministry of Education.

Analysis
On Stage 1 we compiled Russian Social Science Academic Corpus (RSSAC) as a subcorpus of Russian Academic Corpus [24] and estimated the following: (1) the size of RSSAC; (2) the size of each textbook in RSSAC and (3) the size of 10 samplings from each textbook. All the samplings from the textbooks were coded with the grade number, the subject title and the number of the sampling as '5SS1', where '5' stands for 'the 5th grade', 'SS' -for 'Social Science' and '1' is Sampling #1. (see Table 1) The problem of corpus representativeness is associated not only with its size but its quality and genre range [14]. Russian Social Science Academic Corpus used for the current study is defined as representative based on the fact that it represents a certain language variety, i.e. classroom books texts used to teach Social science in Russia. (https://4ege.ru/obrazovanie/60190-utverzhden-federalnyj-perechen-uchebnikov-na-5let.html). We also defined the size of the sampling in each textbook based on the formula designed by D. Biber [14]: where T(s) is the number of tokens in a sampling and T(t) is the number of tokens in a textbook. The size of samplings varies from 1008 in the 5th grade to 5280 in the 11th grade.

Research results
The results of the research conducted are presented in Fig,1 -4. Fig.1 demonstrates a steady growth of FKG from 6,26 (6th grade) to 10,12 (11th grade). However, as we can see their readability rates are in many cases below the corresponding proficiency level of the target audience. Frequency level of the texts under study decreases as their complexity grows (see Fig.  2). The frequency variable lies within the range from 143,79, the highest in the 6th grade, to 104,65, the lowest in the 11th grade.
Type-token ratio for normalized texts' extracts (1000 words) is within the range from 0,61 (9th grade textbook) to 0,64 (7th and 10th grades textbooks) (see Figure 3). That means that from 61% to 64% of all words used in the texts are unique, which is viewed as an average for this text types [9].  The abstractness indices are also indicative of the overall increasing text complexity (see Figure 4).
The fluctuations of the graph can be explained by the fact that abstractness of the narration is not always achieved exclusively by abstract lexical units. According to the analysis data, the distribution of parts of speech is also changing with the rising of the grades' number. The adjectives' rate in the texts rises from 0,128 for 5th grade to 0,180 for 11th grade. At the same time, average rate of adverbs, on the contrary, decreases from 0,047 for 5th grade textbook to 0,035 for 11th grade textbook. Similarly, the average rate for nouns and pronouns is increasing, when verbs rates are falling, which indicates the increase of abstractness of the texts. Nouns rates are changing from 0,342 (5th grade textbooks) to 0,409 (11th grade textbooks) and pronouns' rates are changing from 0,097 (5th grade textbooks) to 0,106 (11th grade textbooks). Verbs rates are changing from 0,167 to 0,114 respectively. Thus, abstractness increases from grade to grade not only because of the number of abstract words in the text, but also due to the changes in morphological distributions.
The average number of terms per text also tends to increase across grades, though a fluctuation in the 10 th grade may testify to the revising character of the reading material in the book (see. Fig 5).The total number of terms ranges between 243 (6 th grade) and 2844 (10 th grade). The average number of terms gradually increases from 18 (6 th grade) to 167 (11 th grade). Though the average amount of terms for the 10th grade textbook is relatively lower than in both the 9 th and 11 th grade textbooks, this also can be explained with specific topics selected by the authors for each textbook. While texts for the 9th grade are focused on politics, the 11 th grade texts are centered on economy and social stratification; texts for 10 th grade are focused mostly on social mechanisms and human activities. The latter is presented mostly with everyday words, not terminological nomenclature. We also conducted a mixed analysis of variance (Spearman) to confirm statistical significance of the features estimated (See Table 3). Spearman Rank Order Correlations are confirmed for the following text features: total amount of words, total amount of syllables, total amount of sentences, average amount of syllables per word, adjectives rate, adverbs rate, pronouns rate, nouns rate, verbs rate, word frequency, FKG, and TTR. The statistically significant metrics have p value <0,05.

Conclusion
In this paper we have presented a multi-factor analysis of seven Russian textbooks on Social science. The analysis of 14 text features performed with the help of RusAC, an online tool designed to assess conventional, morphological and lexical metrics indicated a statistically significant correlation of text complexity with its diversity, frequency, abstractness, number of terminological units . The findings lead us to believe that RusAC is a useful tool for researchers, teachers, and test developers. The results of the research are applicable to match academic texts and test materials with potential target readers. We view identifying syntactic text parameters effecting its complexity as the research perspective.