Creative Education
2012. Vol.3, No.8, 1320-1325
Published Online December 2012 in SciRes (
Copyright © 2012 SciRes.
Building a Better Mousetrap: Replacing Subjective Writing
Rubrics with More Empirically-Sound Alternatives for EFL
Andrew D. Schenck, Eoin Daly*
English Education, Department of Liberal Arts Education (LAEC), Ju Si-Gyeong College, Pai Chai University,
Daejeon, South Korea
Email:, *
Received October 2nd, 2012; revised November 5th, 2012; accepted November 9th, 2012
Although writing rubrics can provide valuable feedback, the criteria they use are often subjective, which
compels raters to employ their own tacit biases. The purpose of this study is to see if discreet empirical
characteristics of texts can be used in lieu of the rubric to objectively assess the writing quality of EFL
learners. The academic paragraphs of 38 participants were evaluated according to several empirically
calculable criteria related to cohesion, content, and grammar. Values were then compared to scores ob-
tained from holistic scoring by multiple raters using a multiple regression formula. The resulting correla-
tion between variables (R = .873) was highly significant, suggesting that more empirical, impartial means
of writing evaluation can now be used in conjunction with technology to provide student feedback and
teacher training.
Keywords: Writing Rubrics; Writing Evaluation; Cohesion; Grammar; Word Frequency
Several studies recognize the efficacy of the rubric as a
means to score writing and provide feedback (Cope, Kalantzis,
McCarthey, Vojak, & Kline, 2011; Mansilla, Duraisingh, Wolfe,
& Haynes, 2009; Peden & Carroll, 2008). A study by Beyreli
and Ari (2009), for example, found that it could be accurately
used to assess ten properties related to structure, language, and
organization with a fair degree of inter-rater reliability (from
65% to 81%). Another study revealed that it could be used to
evaluate writing holistically, regardless of the participants’ L1
(Sévigny, Savard, & Beaudoin, 2009). Recent adaptations of
the rubric have even discovered the potential to increase forma-
tive feedback through the use of both technology and self-as-
sessment strategies (Cope, Kalantzis, McCarthey, Vojak, & Kline,
2011; Peden & Carroll, 2008).
While rubrics can provide a systematic means to evaluate
student writing, their reliability and validity can be questionable.
This is exemplified by recent studies, which reveal that rater
bias and invalidity of writing assessments are negatively impac-
ting summative student evaluation (Graham, Hebert, & Harris,
2011: p. 10; Johnson & VanBrackle, 2012). To overcome these
shortcomings, educators have advocated the use of more au-
thentic assessment methods such as self-assessment checklists,
writing conferences, and writing portfolios (Schulz, 2009).
Current problems with reliability and validity of the writing
rubric may be caused by the subjectivity of rubric criteria. As
pointed out by Fang and Wang (2011), such criteria contain ex-
pressions such as “exceptionally clear”, “effectively organized”,
“carefully chosen”, and “strong control”, which force teachers to
“rely on their own intuition and discursive knowledge in mak-
ing judgment calls” (Fang & Wang, 2011: p. 148). In reality,
this use of vague, subjective descriptors for different categories
of writing reflect a deficiency in understanding of what consti-
tutes good writing. Exploration of more objective, empirical
measures of writing quality may improve this understanding,
thereby allowing for the development of more effective evalua-
tion techniques (Sévigny, Savard, & Beaudoin, 2009). The pur-
pose of this study, therefore, is to examine multiple empirical
criteria and their influence on overall writing quality.
Disparities between Writing Rubrics
Many educators have attempted to increase the validity and
reliability of writing evaluation through the development of ru-
brics. Although they are a useful step forward, key limitations
remain. One of the largest problems with such rubrics is the
subjectivity and ambiguity of language they contain. Holistic
rubrics, for example, which rely upon general impressions of
quality based upon descriptors contained within each proficien-
cy level, often contain vague language which masks the signifi-
cance of results and lessens the potential for washback (Brown,
2004). Consider the following examples contained within levels
4 and 5 of the Test of English as a Foreign Language (TOEFL)
rubric for academic writing (Educational Testing Service, 2008):
Criteria for Rubric Level 4
1) Addresses the topic and task well, though some points
may not be fully elaborated.
2) Is generally well organized and well developed, using ap-
propriate and sufficient explanations, exemplifications, and/or
3) Displays unity, progression, and coherence, though it may
contain occasional redundancy, digression, or unclear connec-
*Corresponding author.
4) Displays facility in the use of language, demonstrating syn-
tactic variety and range of vocabulary, though it will probably
have occasional noticeable minor errors in structure, word form,
or use of idiomatic language that do not interfere with meaning.
Criteria for Rubric Level 5
1) Effectively addresses the topic and task.
2) Is well organized and well developed, using clearly ap-
propriate explanations, exemplifications, and/or details.
3) Displays unity, progression, and coherence.
4) Displays consistent facility in the use of language, demon-
strating syntactic variety, appropriate word choice, and idio-
maticity, though it may have minor lexical or grammatical er-
As revealed by the words highlighted in bold text, criteria
within levels four and five can be decidedly subjective. Expres-
sions such as “effectively addresses the topic” or “addresses the
topic and task well”, for example, cannot be assigned an em-
pirical value, since they rely primarily on the opinion of an eva-
luator. Raters must use their own intuition to interpret whether
the text is “effective” by using their unique personal experi-
ences and cultural backgrounds. Other terms, such as “appro-
priate”, “sufficient”, “occasional”, and “probably”, are also am-
biguous, and may be interpreted differently depending upon
personal characteristics of the reviewer. It is this ambiguity that
requires extensive training to attain an acceptable level of inter-
rater reliability (Brown, 2004). Due to such problems with ho-
listic writing rubrics, it is imperative that more objective means
of evaluating writing are developed to increase reliability and
decrease the need for extensive training of multiple raters.
In addition to problems with subjectivity, rubric criteria often
evaluate disparate traits, making assertions of validity problem-
atic. The TOEFL IBT rubric, for example, evaluates factors such
as organization, unity, coherence, grammar, and idiomatic lan-
guage of the academic essay genre (Educational Testing Ser-
vice, 2008), while the American Council on the Teaching of Fo-
reign Languages (ACTFL) evaluates multiple genres in con-
junction with factors such as fluency, use of low-frequency
structures, and vocabulary (American Council on the Teaching
of Foreign Languages, 2012). Yet another writing rubric de-
signed by Peregoy & Boyle (2005) includes sentence variety in
addition to elements included in both the TOEFL and ACTFL
In summary, criteria within rubrics have a great deal of am-
biguity and disparity, which make determinations of validity
and reliability more problematic. Raters are compelled to inter-
pret criteria differently, leading to the introduction of personal
biases during the writing evaluation process. This behavior is
exemplified through a study by Hunter and Docherty (2011),
which reveals that raters use their own tacit expectations to
interpret and evaluate aspects of writing structure, content, and
expression. Because of problems with rubrics and associated
rater behaviors, more objective empirical measures of writing
quality are needed. Such measures can increase validity and re-
liability by ensuring that multiple raters assess precisely what is
prescribed within set criteria. Moreover, the use of these meas-
ures can allow for extensive automation of writing evaluation.
While some empirical measures of writing quality have now
been developed, they continue to yield EFL writing scores that
differ significantly from those assigned by human raters (Cho-
dorow & Burstein, 2004; Weigle, 2010). More study is needed
to understand how empirical criteria may be used to evaluate
the overall writing quality of diverse learners. The purpose of
this study, therefore, is to see if multiple empirical measures
can be used to accurately assess the quality of academic writing
composed by EFL learners.
Research Questions
1) Can empirical methods of text evaluation (e.g., cohesion,
content, and grammatical accuracy) be collectively used in lieu
of a holistic scale to accurately rate the writing of EFL learn-
2) How can empirical measures of writing quality be used to
improve evaluation and education in an EFL setting?
The purpose of this quasi-experimental study was to see if
writing quality could be accurately assessed through using em-
pirical measures of EFL writing. Due to the complexity of de-
veloping measures for multiple types of discourse, only one
genre, that of academic writing, was examined within this study.
Traditional evaluation methods, which required multiple raters
and a holistic scoring rubric, were statistically compared to
empirical values of writing quality.
This study included participants from two 6-month in-service
training programs for English teachers in Seoul, South Korea.
There were a total of 38 participants, all of whom spoke Korean
as their L1. All participants were middle and high school Eng-
lish teachers with extensive language training. They ranged in
age from 30 to 56 years old, and had extensive teaching ex-
perience that often surpassed 20 years.
To obtain writing samples, the participants were each asked
to write an academic paragraph about the cell phone’s influence
on society. Following the collection of the writings, they were
evaluated using a holistic writing rubric designed for the Test of
English as a Foreign Language (TOEFL) (Educational Testing
Service, 2008). Two native English-speaking EFL instructors
independently rated each writing using the holistic rubric. Sub-
sequently, scores were averaged to provide a benchmark for
comparison to empirical methods of evaluation. Both instruc-
tors had over 10 years of experience as EFL teachers in South
Korea, and had received extensive training in EFL assessment
at the graduate level.
To empirically determine the quality of each academic para-
graph, quantitative strategies for the analysis of cohesion, con-
tent, and grammatical accuracy were developed. These strate-
gies are further explained within the following sections.
Empirical Evaluation of Cohesion
Cohesion, which refers to relationships within a text that
make it appear unified, was operationally defined through the
work of Halliday and Hasan (1976), which asserts that cohesion
is maintained through lexical repetition, reference, conjunctions,
ellipsis, and substitution. Lexical repetition was evaluated ac-
cording to eight categories (Hasan, 1984: p. 202):
Copyright © 2012 SciRes. 1321
Type Example
1) Repetition leave, leaving, left
2) Synonymy leave, depart
3) Antonymy leave, arrive
4) Hyponymy travel, leave/arrive
(leave and arrive are included in the word, travel)
5) Meronymy hand, finger (finger is part of the hand)
6) Equivalence the doctor was their dad
7) Naming the dolphin was named flipper
8) Semblance the girl looked like an angel
According to these categories, examples within texts that
used versions of the same word, synonyms, and antonyms;
more specific forms of a word, called hyponyms (fork is a hy-
ponym of silverware); constituent parts of a word, called me-
ronyms (finger is a constituent of hand); or seemingly different
words to refer to the same object, as in the examples of equiva-
lence, naming, and semblance, were all tallied for each text.
The information was then entered into a database for analysis.
In addition to examples of lexical cohesion, conjunctions
(e.g., however, and, but, in contrast), which connect different
sentences or clauses; references (pronouns and determiners),
which denote a semantic relationship to other words within a
text; substitution, the insertion of a word or phrase for another;
and ellipsis, the omission of a word or phrase, were all tallied
for each text (Hoey, 1996).
After empirical values of cohesion were collected for a para-
graph, they were divided by the total number of words within
the respective paragraph from which they were taken. This en-
sured that text length did not skew the significance of the em-
pirical values.
Empirical Evaluation of Content, Fluency, and
Content was empirically evaluated by calculating lexical den-
sity, sentence length, the presence of low-frequency vocabulary,
and the presence of hard words in each text. First, lexical den-
sity, which describes the proportion of content words (nouns,
adjectives, verbs, and adverbs) to the total number of words,
may reveal how much information is contained within the text
(Johansson, 2008). Second, texts containing longer sentences
may reveal higher fluency and more sophistication of gramma-
tical features. Third, the frequency of low-frequency vocabu-
lary (vocabulary that appears less often within a corpus) and
hard words (words with three or more syllables) may indicate
that the student is using more sophisticated vocabulary. While
not all long words may be considered difficult (e.g., asparagus),
academic texts with longer words may reveal an overall trend
toward the use of more sophisticated vocabulary.
Free software programs were used to calculate lexical density,
sentence length, vocabulary frequency, and hard words. Sen-
tence length and vocabulary frequency could be determined th-
rough using the Lexile Analyzer freely available at,
while lexical density and hard words were determined by using
the free Text Analyser included at Since nei-
ther of these programs included all of the criteria for evaluating
text content, both programs were used.
Empirical Evaluati on of Grammar
Grammar was empirically assessed by tallying each error
within a text. Errors were further divided into the following ca-
tegories for more detailed analysis: prepositions, verb tense/
agreement, count/non-count, plurals/article, run-on sentence,
sentence fragment, word form, other.
After empirical values of grammar were collected for a para-
graph, they were divided by the total number of words within
the respective paragraph from which they were taken. This en-
sured that text length did not skew the significance of the em-
pirical values.
Data Analysis
After empirical values for cohesion, content, and grammar
were obtained for each academic paragraph, they were used to
predict inter-rater writing scores using the multiple regression
formula. Issues of multicollinearity were also examined, and
texts were qualitatively examined to interpret the significance
of the quantitative results.
Results and Discussion
After rating and averaging writing scores from two raters,
scores were compiled into a chart (Appendix A). Although rat-
ings on the TOEFL rubric range from 0 to 5, inter-rater scores
revealed a range from 1.5 to 4.5. Although writing evaluation
by individual raters was generally similar, there were some
differences, resulting in a moderate Cronbach’s alpha value (α
= .622). There was a normal distribution of scores along a
Gaussian curve, with most of the scores falling between 2.5 to
Comparison of inter-rater scores with empirical values of
cohesion, content, and grammar revealed substantially signifi-
cant results. Using multiple regression, the empirical values of
cohesion, content, and grammar correlated highly (R = .873) to
inter-rater determinations of writing quality (Table 1).
ANOVA results further confirm the significance of relation-
ships between variables, yielding an F score of 4.702 which is
significant to the .001 probability level. The high correlation
between the dependent variable (inter-rater scores) and inde-
pendent variables (empirical values of cohesion, content, and
grammar) suggests that purely quantitative assessments of writ-
ing may have a high degree of predictive validity. The R-square
value further indicates that 76.2% of the scores assigned th-
rough inter-rater evaluation can be explained using the indepen-
dent variables within this study.
Analysis of individual variables in the multiple regression
model yields a more holistic understanding of the results (See
Table 2). None of the variables of cohesion (lexical repetition,
reference words, or conjunctions) show a significant relation-
ship to inter-rater scores. This may mean that the empirical
method for calculating cohesion is problematic. It may also
signify that problems with cohesion were not prominent within
the texts, which were created by EFL middle and high school
English teachers with a great deal of experience. Albeit insig-
nificant, the positive t values for reference words and conjunct-
Table 1.
Multiple regression model summary.
Model R R SquareAdjusted R Square Std. Error of the Estimate
1 .873.762 .600 .48086
Copyright © 2012 SciRes.
Copyright © 2012 SciRes. 1323
Table 2.
Independent variables included in the multiple regression analysis.
Unstandardized Coefficients Std. Coefficient
B Std. Error Beta
t Sig.
(Constant) 16.283 4.048 4.022 .001
Lexical Repetition –.029 .030 –.163 –.941 .357
Reference Words .061 .037 .364 1.639 .115
Conjunctions .019 .048 .060 .391 .700
Sentence Length .061 .031 .271 1.968 .062
Word Frequency –3.216 1.088 –.547 –2.955 .007
Hard Words .003 .032 .019 .108 .915
Lexical Density –.042 .020 –.354 –2.127 .045
Preposition –12.054 16.540 –.109 –.729 .474
Subject Verb Agreement/Verb Tense –76.722 25.051 –.433 –3.063 .006
Count/Noncount –52.925 23.646 –.316 –2.238 .036
Plurals/Article –2.326 5.938 –.055 –.392 .699
Run-on Sentence –28.983 17.608 –.215 –1.646 .114
Sentence Fragment –13.819 37.914 –.057 –.364 .719
Part of Speech –41.029 21.404 –.272 –1.917 .068
Other –40.219 17.763 –.308 –2.264 .034
tions (t = 1.639; t = .391) suggest that these cohesive devices
were used more often as writing quality increased; the negative
t value for lexical repetition –.941, in contrast, suggests that a
larger number of repetition was employed in writings of lower
quality. From a qualitative perspective, learners who scored low
on the holistic writing rubric appeared to employ simple repeti-
tion (e.g., repeating the subject cell phone) much more often
than those with a high rubric score. In writings with higher
scores, sophisticated references (e.g., this form of technology, a
new device) and conjunctions (e.g., Moreover, In contrast, etc.)
were used more often than simple repetition. Unlike texts with
lower scores, the highest quality texts also included examples
of semblance, as in the sentence “A cell phone is a new door to
a new era.”
In the category of content, both word frequency and lexical
density were significantly correlated to inter-rater scores when
other independent variables were controlled for, yielding p
values of .007 an .045, respectively. Sentence length, while
insignificant at the .05 probability level, was nearly significant
(p = .062). In contrast to other variables within the content
category, hard words appear to have a miniscule influence on
inter-rater scoring, yielding an insignificant score of p = .915.
Overall, empirical measures of writing seem to accurately pre-
dict inter-rater scores. Positive t values of sentence length and
hard words (t = .108; t = 1.968) may suggest that both of these
factors increase as writing quality increases, while negative t
values of word frequency and lexical density (t = –2.955; t =
–2.127) suggest that both of these factors decrease in intensity
as writing quality increases. This supports qualitative observa-
tion of texts. Consider the following examples:
1) Participant 6 (Rating 4.5)
Cell phone is a new education tool that leads you to a new
educational process. Mobile learning is a topic that people are
interested in these days. You can get new information or learn
via cell phones from the web as well as from textbooks. Third,
its a new device to get whatever information you need by us-
ing wireless connection to the Internet.
2) Participant 12 (Rating 2)
Cell phone is used for educational purpose for their kids.
Cell phone can be books, dictionary, and teachers that give all
the information children need all the time. Cel l phone can be a
toy for their children.
Writings with higher ratings tended to have longer sentences,
which were lengthened using sophisticated conjunctions and
more difficult vocabulary. In example one, relative clauses,
prepositions, and conjunctions are freely employed (e.g., that
people, that leads, whatever information, by using, or), thereby
lengthening sentences and increasing writing quality. Moreover,
this text includes difficult vocabulary, such as device, wireless
communication, and mobile, increasing the sophistication of the
text. In the second example, conjunctions are hardly used, and
sentence length appears to be limited by the proficiency of the
learner. The example is more lexically dense because less gram-
matical features are used to make sophisticated sentences. Sim-
ple nouns, such as cell phone, are repeatedly used to convey
meaning, without the use of complex grammatical features.
Like content, the grammar category of empirical evaluation
has several independent variables that appear to predict inter-
rater scores of participant writings. The subject agreement/tense,
countable/uncountable, and miscellaneous “other” categories
(this category predominantly contained errors with gerund use),
were the most significant predictors of inter-rater writing scores,
yielding p values of .006, .036, and .034 respectively. Incorrect
use of part of speech was nearly significant at p = .068. When
viewed holistically, trends in grammar use reveal a distinct
pattern. Grammatical errors steadily decrease as scores rise from
1.5 to 4.5. Not only do errors within grammatical error catego-
ries steadily decrease, but the number of error categories de-
crease as inter-rater writing scores increase (See Appendix B).
While variables within each category differ in their degree of
significance, the overall high correlation of combined variables
in the multiple regression model suggests that multiple vari-
ables may be involved in the assignment of inter-rater scores.
Analysis of multicollinearity further suggests that each factor
may independently contribute to the assignment of a holistic
writing score. All independent variables had tolerance levels
above 2 and variance inflation factors (VIF) below 5, suggest-
ing that one factor was not significantly related to another (Ap-
pendix C).
Although more study is needed to confirm and increase the
predictive validity of variables used within this study, the highly
significant results suggest that empirical methods of calculating
EFL writing quality may be both a valid and reliable tool for
education. The use of empirical methods has several advantages
over traditional rubrics. One distinct advantage is that it can re-
duce subjectivity which is now associated with rubric criteria
and rater performance. Empirical methods of writing evaluation,
for example, would eliminate the influence of tacit rater biases
that linguistically discriminate against cultural or linguistic
groups (Johnson & Van Brackle, 2012).
An additional advantage of discreet empirical criteria for
evaluation is the potential for use with automatic grading tech-
nology. The use of such technology would greatly increase the
potential to provide washback to EFL students anywhere, any-
time. Students could use technology to get feedback concerning
vocabulary use, grammatical accuracy, or cohesion without the
classroom constraints now imposed by instructor-evaluated ru-
A final benefit of empirical methods is that they have the po-
tential to provide EFL teacher training. Teachers may obtain
valuable feedback concerning their own personal biases em-
ployed while assessing writing quality. To facilitate the training
process, automatic assessment technology could be used to high-
light criteria of evaluation that need to be further emphasized,
or deemphasized. Teachers could then learn to provide equal
weight to each rubric category being evaluated, regardless of
factors such as language, gender, or culture.
Results of this study reveal that several empirically measur-
able criteria for writing related to cohesion, content, and gram-
mar can be used to predict overall writing quality of EFL learn-
ers. While some of the criteria are more accurate predictors
than others, they all appear to synergistically influence the rat-
ings of holistic scores assessed by human raters.
Empirical evaluation of writing has several advantages over
traditional methods of evaluation. It allows for the automation
of writing assessment, which opens the door to use of the tech-
nology as a means of providing both summative and formative
writing feedback for students or teachers. Not only can students
get more constant and consistent feedback, teachers can receive
valuable pre-service or in-service training to sharpen their wri-
ting evaluation skills.
Although this study is promising, more study is needed to
confirm the validity of empirical measures, as well as to discern
additional relevant criteria for empirical writing evaluation of
EFL learners. Before such methods of assessment can be used
for any summative or formative purpose, they must be thor-
oughly compared to other forms of writing assessment and
examined by a large number of highly trained raters. In addition,
empirical methods must be tested with native and non-native
English speaking populations to ensure that such techniques are
uniformly accurate. Despite the need for further research, the
potential to provide automatic EFL writing feedback is clearly
evident, and should be further explored.
American Council on the Teaching of Foreign Languages (2012).
ACTFL proficiency guidelines. URL (last checked 1 October 2012).
Beyreli, L., & Ari, G. (2009). The use of analytic rubric in the assess-
ment of writing performance: Inter-rater concordance study. Educa-
tional Sciences: Theory & Practice, 9, 105-125.
Brown, H. D. (2004). Language assessment principles and classroom
practices. White Plains, NY: Pearson Education.
Chodorow, M., & Burstein, J. (2004). Beyond essay length: Evaluating
e-raters performance on TOEFL essays. Princeton, NJ: ETS.
Cope, B., Kalantzis, M., McCarthey, S., Vojak, C., & Kline, S. (2011).
Technology-mediated writing assessments: Principles and processes.
Computers & Composition, 28, 79-96.
Educational Testing Service (2008). TOEFL IBT test: Independent wri-
ting rubrics (scoring standards). URL (last checked 1 October 2012).
Fang, Z., & Wang, Z. (2011). Beyond rubrics: Using functional lan-
guage analysis to evaluate student writing. Australian Journal of
Language and Literacy, 34, 147-165.
Graham, S., Hebert, M., & Harris, K. R. (2011). Throw ‘em out or
make ‘em better? State and district high-stakes writing assessments.
Focus on Exceptional Children, 44, 1-12.
Halliday, M. A. K., & Hasan, R. (1976). Cohesion in English. London:
Hasan, R. (1984). Coherence and cohesive harmony. In J. Flood (Ed.),
Understanding reading comprehension (pp. 181-219). Delaware: In-
ternational Reading Association.
Hoey, M. (1996). Patterns of lexis in text. New York, NY: Oxford Uni-
versity Press.
Hunter, K., & Docherty, P. (2011). Reducing variation in the assess-
ment of student writing. Assessment & Evaluation in Higher Educa-
tion, 36, 109-124. doi:10.1080/02602930903215842
Johansson, V. (2008). Lexical diversity and lexical density in speech
and writing: A developmental perspective. Working Papers, 53, 61-
Johnson, D., & Van Brackle, L. (2012). Linguistic discrimination in
writing assessment: How raters react to African American “errors,”
ESL errors, and standard English errors on a state-mandated writing
exam. Assessing Writing, 17, 35-54. doi:10.1016/j.asw.2011.10.001
Lexile Analyzer [Computer software]. URL (last checked 25 October
Mansilla, V. B., Duraisingh, E. D., Wolfe, C. R., & Haynes, C. (2009).
Targeted assessment rubric: An empirically grounded rubric for in-
terdisciplinary writing. The Journal of Higher Education, 80, 334-
353. doi:10.1353/jhe.0.0044
Peden, B. F., & Carroll, D. W. (2008). Ways of writing: Linguistic ana-
lysis of self-assessment and traditional assignments. Teaching of Psy-
chology, 35, 313-318. doi:10.1080/00986280802374419
Peregoy, S. F., & Boyle, O. F. (2005). Reading, writing, and learning
in ESL: A resource book for K-12 teachers. New York, NY: Pearson
Schulz, M. M. (2009). Effective writing assessment and instruction for
young English language learners. Early Childhood Education Jour-
nal, 37, 57-62. doi:10.1007/s10643-009-0317-0
Sévigny, S., Savard, D., & Beaudoin, I. (2009). Comparability of writ-
ing assessment scores across languages: Searching for evidence of
valid interpretations. International Journal of Testing, 9, 134-150.
Text Analyser [Computer software]. URL (last checked 25 October
Weigle, S. C. (2010). Validation of automated scores of TOEFL iBT
tasks against non-test indicators of writing ability. Language Testing,
27, 335-353. doi:10.1177/0265532210364406
Copyright © 2012 SciRes.
Copyright © 2012 SciRes. 1325
Appendix A Appendix C
Table C.
Collinearity statistics for independent variables.
Collinearity Statistics
Tolerance VIF
Lexical Repetition .572 1.748
Reference Words .315 3.170
Conjunctions .350 2.856
Sentence Length .391 2.559
Word Frequency .362 2.761
Hard Words .219 4.564
Lexical Density .457 2.191
Preposition .486 2.057
Subject Verb Agreement/Verb Tense .542 1.846
Count/Noncount .541 1.847
Plurals/Article .544 1.838
Run-on Sentence .632 1.582
Sentence Fragment .438 2.281
Part of Speech .538 1.858
Other .582 1.718
Figure A.
Inter-rater writing score values.
Appendix B
Figure B.
Grammatical errors within writings (separated by score).