Do ESL Essay Raters’ Evaluation Criteria Change With Experience? A Mixed-Methods, Cross-Sectional Study

Do ESL Essay Raters’ Evaluation Criteria Change With Experience? A Mixed-Methods, Cross-Sectional Study KHALED BARKAOUI York University Toronto, Ontario, Canada This study adopted a mixed-methods cross-sectional approach to identify similarities and differences in the English as a second language (ESL) essay holistic scores and evaluation criteria of raters with different levels of experience Each of 31 experienced and 29 novice raters rated a sample of ESL essays holistically and analytically and provided written explanations for each holistic score they assigned Score and qualitative data analyses were conducted to identify the criteria that the raters employed to rate the essays holistically The findings indicated that both groups gave more importance to the communicative quality of the essays than to other aspects of writing However, the novice raters tended to be more lenient and to give more importance to argumentation than the experienced raters did The experienced raters tended to be more severe, to give more importance to linguistic accuracy, and to refer to evaluation criteria other than those listed in the rating scale more frequently than the novices did The article concludes with a call for longitudinal research to investigate to what extent, how, and why rater evaluation criteria change over time and across contexts doi: 10.5054/tq.2010.214047 here is ample evidence that raters from different backgrounds may consider and weight evaluation criteria differently when assessing English as a second language (ESL) essays holistically Holistic rating consists in reading a writing sample and choosing one score to reflect the overall quality of the paper (Goulden, 1994; Weigle, 2002) Holistic scales usually list evaluation criteria without specifying how important each criterion is to the overall score As a result, raters may use personal judgment to determine the importance of different rating criteria and/ or include evaluation criteria not listed in the rating scale when deciding on an overall score (Goulden, 1994) The result is large variability between (and within) raters in terms of rating criteria and scores This variability is further exacerbated by variability in rater background and experience T TESOL QUARTERLY Vol 44, No 1, March 2010 31 For example, Connor-Linton (1995) and Shi (2001) found that, although native (NES) and nonnative (NNES) speakers of English assigned similar scores to English as a foreign language (EFL) essays, they provided different reasons for assigning the same scores Studies comparing raters from different academic and educational backgrounds also found that faculty from different departments rated and reacted differently to different aspects of ESL essays and disagreed as to when various criteria were being met (e.g., Mendelsohn & Cumming, 1987; Santos, 1988) Other studies (e.g., Brown, 1991; O’Laughlin, 1994) found no statistically significant differences in the holistic ratings assigned by ESL and English composition teachers to ESL essays, although the two groups gave different reasons for their holistic judgments of the same essays Finally, Cumming, Kantor, and Powers (2002) found that ESL teachers tended to focus more on language issues than did the English teachers, who focused more on ideas or content Although these studies highlight the variability in evaluation criteria between raters from different backgrounds, little is known about variability within raters across time Few studies have examined the relationship between teaching and rating experience and essay evaluation criteria These studies adopted a cross-sectional approach, comparing the scores and/or evaluation criteria of raters who differed in terms of their experience teaching and assessing ESL writing Most of these studies also used think-aloud protocols to compare the evaluation criteria and rating processes of novice and experienced raters (e.g., Cumming, 1990; Delaruelle, 1997; Erdosy, 2004; Sakyi, 2003; Weigle, 1999) Cumming (1990), for example, found that experienced teachers used a large and varied number of criteria and knowledge sources to read and judge ESL essays, whereas novice teachers tended to evaluate essays with only a few of these criteria, using skills that may derive from their general reading abilities or other knowledge they have acquired previously Although their primary objective was not the study of change in rater evaluation criteria as a function of experience, other studies have yielded results that are informative concerning this question Lumley and McNamara (1995) reported evidence indicating that rater severity and self-consistency change over time, whereas Song and Caruso (1996) found that experienced teachers tended to be less severe in their holistic scoring of ESL essays than were less experienced raters Sweedler-Brown (1985), by contrast, found that experienced raters tended to be more severe, and, as a result, she speculated that rating and teaching experience might give raters the confidence to rate more critically A variety of factors, such as study context, writing tasks, evaluation criteria, and participants’ backgrounds, may explain the discrepancy between the findings of Sweedler-Brown and Song and Caruso 32 TESOL QUARTERLY Rinnert and Kobayashi (2001) found a significant interaction effect between rater first-language (L1) background and experience on rater evaluation criteria when assessing EFL essays holistically Although novice NNES raters (i.e., inexperienced Japanese EFL students) attended mainly to content when judging and commenting on the essays, more experienced NNES raters (i.e., experienced Japanese EFL students and EFL teachers), like NES EFL teachers, attended more to clarity, logical connections, and organization Rinnert and Kobayashi interpreted this finding as indicating a gradual change in NNES readers’ perceptions of EFL essays from preferring the writing features of their L1 to preferring many of the second-language (L2) writing features Hamp-Lyons (1989) also observed a shift in NES raters’ evaluation criteria As they gain more experience with other cultures and their languages, NES raters become used to different types of rhetorical patterns and transfer across languages and, consequently, tend to react less unfavorably to the English writing of members of those language communities None of the studies reviewed earlier, however, specifically examined the relationship between rater experience and evaluation criteria The think-aloud studies described earlier focused on differences in the rating processes, particularly decision-making behaviors, between novice and experienced raters, rather than differences in their evaluation criteria per se Methodologically, various approaches have been used to identify the evaluation criteria that raters employ when evaluating ESL essays holistically The most common approach is to examine the correlations between the holistic scores raters assign and measures of specific essay features These measures include analytic ratings of specific aspects of the essays (e.g., language, organization) by research participants (e.g., Tedick & Mathison, 1995), text analysis or coding of essay features by the researcher (e.g., Frase, Faletti, Ginther, & Grant, 1999; Homburg, 1984), and rewriting essays to reflect strengths and weaknesses in specific writing areas (e.g., Kobayashi & Rinnert, 1996; Mendelsohn & Cumming, 1987) Other studies have used self-report data in the form of interviews (e.g., Erdosy, 2004), questionnaires (e.g., Shi, 2001), written score explanations (e.g., Milanovic, Saville, & Shuhong, 1996; Rinnert & Kobayashi, 2001), and think-aloud protocols (e.g., Cumming et al., 2002; Delaruelle, 1997) to identify the evaluation criteria that raters employ when assessing ESL essays holistically THE PRESENT STUDY The current study is part of a larger research project that compared the rating processes and outcomes of experienced and novice raters DO ESL RATERS’ CRITERIA CHANGE 33 when using holistic and analytic rating scales to evaluate ESL essays (Barkaoui, 2008) The study described in this article combined both score analysis and self-report data to identify and compare the evaluation criteria that novice and experienced raters attend to when rating ESL essays holistically Specifically, this study addressed two research questions: What aspects of writing explain the holistic scores that raters assign to ESL essays? To what extent and how the aspects of writing that explain ESL essay holistic scores vary in relation to rater experience? Method Participants The study included 31 novice and 29 experienced raters Participants were assigned to groups based on their response to a background questionnaire Experienced raters were graduate students and/or ESL instructors who had been teaching and rating ESL writing for at least years, had a master of arts or master of education degree, had received specific training in assessment and essay rating, and rated themselves as competent or expert raters Novice raters were mainly students who were enrolled in or had just completed a preservice or teacher-training program in ESL, had no ESL teaching and rating experience at all at the time of data collection, and rated themselves as novices The participants were recruited from various universities in southern Ontario, Canada They varied in terms of their gender, age, and L1 backgrounds, but all were native or highly proficient NNES Table describes the profile of a typical participant in each group Essays The study included 180 essays produced under real exam conditions by 180 adult ESL learners from diverse parts of the world and at different levels of proficiency in English.1 Each essay was written within 30 minutes in response to one of two comparable independent prompts, 34 The 180 essays used in this study were obtained from the Test of English as a Foreign Language (TOEFL) The 180 test takers who wrote the essays came from more than 35 different countries and from about 30 different L1 backgrounds; the majority were Japanese, Spanish, and Korean speakers Their ages ranged between 16 and 45 years (mean [M] 25 years, standard deviation [SD] 6) The great majority (91%) took the test to pursue graduate (58%) or undergraduate (33%) studies Their TOEFL scores ranged between 90 and 290 (M 212.56, SD 44.51) TESOL QUARTERLY TABLE Typical Profile of a Novice and an Experienced Rater Novice Role at time of the research ESL teaching experience Rating experience Postgraduate study Received training in assessment Self-assessment of rating ability TESL student None None None No Novice Experienced ESL teacher 10 years or more years or more MA or MEd Yes Competent or expert Note TESL teaching English as a second language MA master of arts MEd master of education one on the importance of the study of some academic subjects (study topic) and one on the advantages and disadvantages of practicing sports (sports topic) Data Collection Procedures The 180 essays were first randomly compiled into batches of 24 essays, and then batches were randomly assigned to raters.2 Each rater received a 30-min individual training session on using a holistic and an analytic rating scale and then rated their batch of essays holistically and analytically, with half the participants rating the essays holistically first and the other half rating them analytically first (i.e., a counterbalanced design) The holistic and analytic scales, borrowed from Hamp-Lyons (1991, pp 247–251), included exactly the same evaluation criteria, wording, and number of score levels (9) The evaluation criteria in the analytic scale were grouped under five categories: communicative quality, argumentation, organization, linguistic accuracy, and linguistic appropriacy In addition to scoring the essays, the participants were instructed to provide a brief written explanation for each holistic score they assigned The essays were rated individually, at the rater’s home, and there was a span of at least weeks between the holistic and analytic ratings Each participant rated the same batch of 24 essays holistically and analytically but in a different random order of essays and prompts Data Analysis Data for this study consisted of the holistic and five analytic scores each rater assigned to each essay and the written score explanations the The essays included no information about the writers; instead, number codes were used to identify the essays However, the raters were provided with a general description of the purpose and conditions under which the essays were written DO ESL RATERS’ CRITERIA CHANGE 35 participants provided for each holistic score they assigned All the raters were included in the score analyses, but one rater from each group was excluded when analyzing the score explanation data, since he or she did not provide explanations for more than one-third of the holistic scores they assigned The final sample consisted of 1,069 score explanations provided by 28 experienced and 30 novice raters The novices provided 571 (53%) of the score explanations The score explanations were typed into word-processing files and then coded using the computer program NVivo (Richards, 1999) Given that these explanations were brief (24 words or less), the unit of analysis adopted was the whole score explanation provided A coding scheme was developed based on the criteria in the rating scales, preliminary inspections of the data, and Cumming et al.’s (2002) empirically based schemes of rater decision-making behaviors and aspects of writing to which raters attend The scheme consisted of 24 codes under seven main categories, five related to the categories in the analytic rating scale, one concerned comments on the overall quality of the essay (e.g., ‘‘poor essay’’), and one related to comments on aspects of writing other than those included in the holistic scale (e.g., task completion, quantity) A complete list of the codes with examples from the current study appears in the appendix Each score explanation was coded in terms of focus (i.e., one of the seven coding categories in the appendix) and type (i.e., positive, negative, or neutral) as follows Each score explanation was first classified as being related to one or more of the seven main categories in the appendix The score explanation was then further categorized in terms of one or more of the subcategories under each category For example, a comment on argumentation could be classified as being related to relevance, interest, support, and/or other argumentation aspects When tallying the number of comments under each category and subcategory, each comment was counted only once for that category or subcategory For instance, a comment that was coded as being related to interest and relevance, both under argumentation, was counted once under each of these two subcategories but also only once under the main category of argumentation Finally, each comment was coded as being positive, negative, or neutral Comments were coded as neutral if they were neither positive nor negative and/or were ambiguous (e.g., ‘‘Similar to the previous essay’’; ‘‘Language fits description of # 5’’; ‘‘Common ESL mistakes’’; ‘‘Long essay’’) Many score explanations included both negative and positive comments; such comments were coded as both negative and positive (i.e., twice) For example, the score explanation ‘‘Well structured, but should be longer,’’ was coded as being positive for text organization (under organization) and negative in relation to quantity (under other aspects of writing) 36 TESOL QUARTERLY The author coded all the data in this study by assigning each score explanation all relevant codes in the coding scheme To check the reliability of the coding, the coding scheme was discussed with another researcher, who then independently coded a random sample of 250 written score explanations (1,127 codes) Percentage agreement achieved was 90%, computed for agreement in terms of the main categories: organization, argumentation, linguistic accuracy, linguistic appropriacy, overall impression, and other aspects of writing (see appendix) Percentage agreements for main categories and within each category varied, however (e.g., 85% for argumentation; 93% for linguistic accuracy) All the difficult cases were discussed and, for most cases, the codes were reconciled In the few cases where an agreement was not reached, the author selected the final code to be assigned Because the focus in this study was on comparing the frequency of focus and type of comments across rater groups, the coded data were tallied and percentages were computed for each rater for each code in the coding scheme These percentages served as the data for comparison across rater groups Statistical tests were then conducted on the main categories in the appendix Subcategories were used for descriptive purposes only and to explain significant differences in main categories Because the coded data did not seem to meet the statistical assumptions of parametric tests, nonparametric tests (Mann-Whitney test) were used to compare coded data across rater groups To examine the relationships between the score explanations and the holistic scores that the raters provided, Spearman rho correlations were conducted Because nonparametric tests rely on ranks, the following descriptive statistics are reported below: mean, median, standard deviation, and range (i.e., the difference between the highest and lowest values; Field, 2005) To examine the relationships between the analytic and holistic scores and the effects of rater experience on these relationships, multilevel modeling (MLM) was used MLM is an advanced form of multipleregression analysis that takes into account the hierarchical structure of data (Hox, 2002; Luke, 2004) Hierarchical data means that observations at lower levels are nested within units at higher levels In this study, ratings are nested within raters With nested data, there may be more variation between raters than within raters, a violation of the independence of observations assumption that underlies traditional multiple-regression analysis MLM addresses this problem, because it assumes independence of observations between raters, but not between ratings within a rater (Hox, 2002; Luke, 2004) MLM also allows the examination of the effects of rater variables (e.g., experience) on holistic scores (main effects) and on the relationships between the analytic and holistic scores (called cross-level interaction effects in MLM; Hox, 2002) DO ESL RATERS’ CRITERIA CHANGE 37 The software program HLM 6.0 for Windows (Raudenbush, Bryk, Cheong, & Congdon, 2004) was used to build and test various MLM models, following procedures suggested by Hox (2002), before identifying the final model that fit the data In addition to the outcome variable, holistic scores, the study included one rater-level (called Level-2 in MLM) predictor, rater experience (coded for novice and for experienced), and seven measures of essay features that constitute the Level-1 predictors These measures were the five categories in the analytic scale as well as essay length (number of words per essay measured using the word count function in Microsoft Word) and essay topic The prompt was used as a measure of essay topic (what the essay is about), with the study prompt coded and the sports prompt coded FINDINGS Score Analyses Table reports descriptive statistics and correlations between the holistic and analytic ratings by rater group It shows that the novice group had slightly higher means and standard deviations than the experienced group did for the holistic scale and each of the analytic scales In addition, the correlations between the holistic ratings, on the one hand, and communicative quality, organization, and argumentation, on the other hand, were slightly higher for the novice raters, but those correlations between holistic ratings and linguistic accuracy, linguistic appropriacy, and essay length were higher for the experienced group The following paragraphs examine whether these differences are statistically significant MLM was used to (a) examine whether the two rater groups differed significantly in the holistic scores they assigned, (b) identify which essay TABLE Descriptive Statistics and Pearson r Correlations by Rater Group Novice Holistic CQ ORG ARG LAC LAP Length Experienced Total M SD r M SD r M SD r 5.48 5.69 5.77 5.57 5.36 5.54 239.61 1.57 1.61 1.67 1.75 1.52 1.55 85.35 1.00 0.64 0.61 0.63 0.56 0.55 0.24 5.08 5.48 5.33 5.16 5.04 5.29 240.05 1.41 1.34 1.35 1.46 1.30 1.34 84.38 1.00 0.62 0.59 0.56 0.61 0.58 0.27 5.29 5.59 5.56 5.38 5.21 5.42 239.83 1.51 1.49 1.54 1.62 1.43 1.45 84.85 1.00 0.63 0.61 0.61 0.58 0.57 0.25 Note All rating criteria are measured on a nine-point scale M mean; SD standard deviation; CQ communicative quality; ORG organization; ARG argumentation; LAC linguistic accuracy; LAP linguistic appropriacy; Length number of words per essay 38 TESOL QUARTERLY features account for differences in the holistic scores the raters assigned, and (c) assess whether novice and experienced raters gave different weights to different evaluation criteria in the holistic scores they assigned Five MLM models were examined before building the final model These exploratory models indicated that essay length, communicative quality, argumentation, and linguistic accuracy had significant associations with the holistic scores, whereas topic, organization, and linguistic appropriacy did not, at p , 0.05 Second, the within-rater relationships between the holistic scores, on the one hand, and each of topic, essay length, communicative quality, and linguistic accuracy, on the other, varied significantly across raters Third, on average, the experienced raters assigned significantly lower holistic scores than the novice raters did to essays on the same topic after statistically accounting for differences in terms of essay length, communicative quality, argumentation, and linguistic accuracy Fourth, rater experience significantly moderated the relationships between the holistic scores, on the one hand, and the argumentation and linguistic accuracy scores, on the other Based on the results of the exploratory models and analyses, a final model was specified that included five measures of essay features: essay length, topic, communicative quality, argumentation, and linguistic accuracy The exploratory models indicated that (a) these five predictors had significant associations with the holistic scores, (b) their associations with the holistic scores varied significantly across raters, and/or (c) their associations with the holistic scores were significantly influenced by rater experience Organization and linguistic appropriacy did not meet any of the three criteria and, as a result, were not included in the final model The results for the final MLM model are presented in Table Table reports various statistics The first set of statistics is the Level-1 fixed effects, which refer to (a) the average intercept and (b) the average associations between each of the Level-1 predictors and the holistic scores First, the intercept represents the average holistic score assigned by the novice raters to essays on the study prompt and adjusted for all the four essay features in the final model The value of the intercept is 5.53 and can be thought of as a baseline against which all other values in Table are interpreted Second, Table shows that the average associations between essay length, communicative quality, argumentation, and linguistic accuracy, on the one hand, and holistic scores, on the other, are significant For instance, the association for communicative quality (0.28) is positive and significant at p , 0.01, indicating that, on average, essays with high scores on this criterion obtained higher holistic scores (0.28 points higher), after accounting for the effects of all other predictors in the final model By contrast, the association is 20.09 for topic, indicating that, on average, the sports DO ESL RATERS’ CRITERIA CHANGE 39 TABLE Results for Final MLM Model Fixed effects Unstandardized coefficient (SE) Level Intercept Topic Length CQ ARG LAC 5.53** 20.09 0.003** 0.28** 0.22** 0.11* Level 2: Rater experience effect Intercept ARG slope LAC slope Coefficient (SE) 20.40* (0.19) 20.10* (0.05) 0.20** (0.05) Random effects Between-rater Topic slope Length slope CQ slope LAC slope Within-rater (0.17) (0.07) (0.0003) (0.05) (0.03) (0.04) Standardized coefficient 20.03 0.17 0.27 0.24 0.10 Variance Chi-square df 0.63 0.13 0.00 0.05 0.02 0.91 533.39** 109.26** 90.35* 125.17** 78.79* 58 59 59 59 58 Note SE standard error; CQ communicative quality; ARG argumentation; LAC linguistic accuracy * p , 0.05 ** p , 0.01 topic (coded 1) resulted in lower scores than the study topic (coded 0), but this difference was not statistically significant The average association of essay length with the holistic scores was 0.003, indicating that, on average, the holistic scores increased by 0.003 points with each additional word (i.e., 0.30 points per 100 words) Note that the coefficients in column are the unstandardized coefficients of the associations Because the predictors were measured on different scales (e.g., essay length was measured in terms of number of words, whereas communicative quality was measured on a nine-point scale), these coefficients needed to be standardized to allow comparison of the strength of associations across predictors The standardized coefficients appear in column three of Table When the coefficients of associations are standardized, communicative quality and argumentation have the highest average associations with holistic scores (0.27 and 0.24, respectively), followed by essay length (0.17) The association between linguistic accuracy and holistic scores is the lowest (0.10) In other words, on average, communicative quality and argumentation played the most prominent roles in the holistic scores the raters assigned, followed by essay length and linguistic accuracy The second set of statistics in Table is the Level-2 fixed effects and concerns the direct effects of rater experience on (a) the holistic scores 40 TESOL QUARTERLY TABLE Descriptive Statistics and (Spearman rho) Correlations for Percentages of Writing Aspects Reported in Score Explanations for All Raters (N 58 raters) Descriptive statistics Focus Correlations with M Mdn Range SD Holistic scores Essay length CQ ORG ARG LAC LAP Overall Other 17.60 15.42 26.29 23.76 4.15 3.80 6.48 16.67 14.58 25.09 23.89 2.80 1.70 6.07 44.44 43.75 34.65 38.19 22.30 20.29 19.31 8.82 8.65 7.76 8.12 4.66 5.27 4.23 20.20** 0.00 0.04 0.11** 0.07* 0.07* 20.00 20.01 0.07* 20.02 0.03 0.05 0.05 20.15** Type Positive Negative Neutral 34.21 55.76 7.52 32.76 58.44 2.35 65.62 87.50 100.00 13.88 16.43 17.50 0.55** 20.60** 0.10** 0.20** 20.22** 20.01 Positive CQ ORG ARG LAC LAP Overall Other 14.45 17.89 18.80 11.90 1.59 3.30 2.46 13.61 17.36 17.88 7.64 0.00 1.24 0.00 48.61 71.53 52.78 50.35 11.81 19.44 31.60 10.31 13.08 12.79 11.87 2.68 4.54 4.88 0.26** 0.17** 0.34** 0.37** 0.15** 0.22** 0.09** 0.13** 0.08** 0.15** 0.07* 0.06* 0.13** 0.05 Negative CQ ORG ARG LAC LAP Overall Other 11.14 7.94 22.54 25.77 4.19 12.43 7.50 10.25 7.29 22.71 27.13 3.08 11.46 6.49 29.17 40.97 50.14 63.89 25.83 29.17 22.50 6.66 7.17 11.15 13.12 5.15 6.72 5.15 20.37** 20.26** 20.24** 20.11** 20.04 20.40** 20.04 20.06 0.00 20.14** 0.02 0.03 20.08* 20.18** Type by focus Note M mean; Mdn median; SD standard deviation; CQ communicative quality; ORG organization; ARG argumentation; LAC linguistic accuracy; LAP linguistic appropriacy; Overall overall impression; Other other aspects of writing * p , 0.05; ** p , 0.01 In terms of type of comments, Table shows that the majority of the comments (M 56%) were negative, indicating that (a) the essays had numerous problems, (b) raters tended to comment more frequently on weak aspects, and/or (c) it was easier for raters to perceive and/or comment on weak aspects of writing than on positive aspects Only onethird of the comments were positive (M 34%), whereas 7% were coded as neutral (i.e., neither positive nor negative, and/or ambiguous) The following paragraphs focus mainly on the positive and negative comments; the neutral comments are discussed very briefly, because (a) their meaning is not always clear and (b) the proportion of such comments is small compared with negative and positive comments Overall, the largest proportions of the positive comments the raters DO ESL RATERS’ CRITERIA CHANGE 43 made concerned, in descending order, argumentation (M 19%), organization (M 18%), and communicative quality (M 14%) By far the largest proportion of negative comments related to linguistic accuracy (M 26%) and argumentation (M 23%) Table also presents the correlations between each of the main categories of aspects of writing reported in the score explanations, on the one hand, and essay length and holistic scores, on the other First, the correlation (Spearman rank order correlation, rs) between the frequency of positive comments and holistic scores is positive and significant (rs 0.55), indicating that essays that obtained high scores tended to receive more praise The correlation between the frequency of negative comments and holistic scores, on the other hand, is negative and significant (rs 20.60), indicating that essays with lower scores tended to receive more negative comments related to all aspects of writing considered in this study This is an expected result and provides partial validity evidence for the coding of the comments Second, as essay scores increased, the raters made fewer comments on communicative quality (rs 20.20) and more comments on linguistic accuracy (rs 0.11), suggesting a shift in rater evaluation criteria, depending on essay proficiency level (see discussion later) As for essay length, Table shows that longer essays tended to receive more positive comments (rs 0.20), whereas shorter ones tended to receive more negative comments (rs 20.22) In addition, the longer the essay, the less likely it was to receive comments on other aspects of writing than those listed in the rating scale (e.g., task completion, quantity; rs 20.15) Table reports (a) descriptive statistics concerning the aspects of writing reported in the written score explanations by focus and type of main category and (b) Spearman rho correlations between each of these main categories and essay length and holistic scores across rater groups It shows that, overall, the two rater groups made about the same proportion of comments on organization and linguistic appropriacy The experienced raters reported higher proportions for communicative quality, overall impression of essay, and other aspects of writing than did the novices, who reported more often on argumentation and linguistic accuracy In addition, the novices made more positive and neutral comments than did the experienced raters, who tended to make more negative comments Mann-Whitney U tests indicated that only the difference concerning linguistic accuracy was statistically significant at p , 0.05 (Z 22.23, p 0.048, r 0.294), with novices making significantly more comments on this aspect of writing The differences 44 Following Field (2005), Pearson’s correlation coefficient r is used as a measure of effect size in this study This coefficient is constrained to lie between (no effect) and (a perfect effect) Following Cohen (1988), Field suggested the following guidelines for interpreting effect sizes: small effect: r 0.10; medium effect: r 0.30; and large effect: r 0.50 (Field, p 32) TESOL QUARTERLY DO ESL RATERS’ CRITERIA CHANGE 45 10.18 6.89 22.18 26.90 3.56 10.79 6.43 Negative CQ ORG ARG LAC LAP Overall Other 9.10 6.60 22.36 28.37 2.95 9.93 5.90 13.96 17.01 21.63 12.50 0.00 2.08 0.42 26.39 18.54 50.14 63.89 16.67 26.39 16.67 35.42 52.78 43.94 50.35 11.81 16.67 9.09 65.62 78.75 100.00 28.06 28.82 27.70 32.39 22.30 19.44 12.46 Range SD 6.75 5.12 11.98 14.00 3.93 6.75 4.48 8.85 11.79 12.17 13.62 3.38 4.56 2.98 15.55 19.66 23.82 8.21 6.97 6.75 7.38 5.14 4.59 3.57 20.42** 20.32** 20.30** 20.06 20.03 20.43** 0.00 0.24** 0.15** 0.35** 0.34** 0.11* 0.27** 0.12** 20.07 20.02 20.10* 0.03 20.00 20.09* 20.22** 0.13** 0.06 0.16** 0.04 0.09* 0.13** 0.08 0.20** 20.22** 0.03 0.01 0.04 0.03 0.01 0.05 0.09* 20.17** 0.52** 20.57** 0.15** Length Holistic 20.25** 20.05 0.04 0.19** 0.08 0.17** 0.05 Correlations with M 12.16 9.08 22.93 24.56 4.87 14.18 8.64 15.81 18.58 15.89 8.79 1.07 2.94 2.63 31.69 60.49 5.00 19.70 15.60 24.97 21.56 3.97 4.13 7.38 Mdn 10.94 8.47 23.23 25.35 3.60 15.63 8.33 12.92 17.36 14.24 5.68 0.00 0.00 0.00 30.88 62.45 3.61 17.50 12.96 23.57 21.10 3.14 2.29 7.26 26.39 40.97 39.58 52.08 25.83 25.92 22.50 48.61 71.53 52.78 31.94 5.30 19.44 31.60 38.89 43.75 16.67 39.73 43.75 31.95 38.19 16.67 20.29 19.31 Range 6.53 8.82 10.38 12.25 6.21 6.35 5.64 11.69 14.52 13.01 8.87 1.55 4.56 6.37 11.57 10.49 4.92 9.12 10.29 8.64 8.43 4.18 5.99 4.75 SD Experienced (n 28) Descriptive statistics 20.30** 20.17** 20.18** 20.19** 20.04 20.36** 20.08 0.28** 0.19** 0.31** 0.38** 0.19** 0.15** 0.09* 0.58** 20.61** 0.04 20.13** 0.06 0.01 20.01 0.06 20.04 20.04 Holistic 2.04 03 2.18** 01 06 2.05 2.15** 13** 10* 14** 11* 02 12** 02 20** 2.21** 2.06 2.02 10* 2.08 04 04 02 2.13** Length Correlation with Note M mean; Mdn median; SD standard deviation; CQ communicative quality; ORG organization; ARG argumentation; LAC linguistic accuracy; LAP linguistic appropriacy; Overall overall impression; Other other aspects of writing Mann-Whitney U tests indicated that the difference between rater groups was significant at p , 0.05 * p , 0.05; ** p , 0.01 13.18 17.25 21.53 14.79 2.06 3.64 2.31 36.29 57.48 1.67 36.57 51.34 9.87 Mdn 15.35 15.56 27.36 26.22 2.78 1.53 5.49 M 15.64 15.25 27.52 25.81 4.33 3.49 5.64 Positive CQ ORG ARG LAC LAP Overall Other Type by focus CQ ORG ARG LACa LAP Overall Other Type Positive Negative Neutral Focus Descriptive statistics Novice (n 30) TABLE Descriptive Statistics and (Spearman rho) Correlations for Percentages of Writing Aspects Reported in Score Explanations by Rater Group between the two rater groups concerning positive comments on argumentation and negative comments on communicative quality were marginally significant (p 0.06 each) In terms of specific aspects, the novices tended to refer more frequently in their score explanations to the use of writer experience (M 3%; under argumentation), error frequency (M 6%), syntax and morphology (M 10%), and punctuation and spelling (M 6%; under linguistic accuracy) than did the experienced raters (M 1%, 4%, 6%, and 3%, respectively) The experienced raters, on the other hand, referred more often to communicative quality (M 15%; under communicative quality), language overall (M 3%; under linguistic accuracy), and task completion (M 3%; under other aspects) than did the novices (M 11%, 1%, and 1%, respectively) When negative and positive comments were considered separately (under ‘‘Type by focus’’ in Table 5), the trends were similar to the overall pattern (i.e., no type by focus interaction), except that the experienced raters made more negative comments on organization than did the novices Table shows that the correlations between each of the main categories of aspects of writing reported in the score explanations, on the one hand, and essay length and holistic scores, on the other, did not differ across rater groups For both groups, essays with higher holistic scores tended to receive more positive and fewer negative comments in relation to all the aspects of writing considered in this study Similarly, the score explanations for shorter essays tended to be negative, whereas longer essays tended to receive more praise from raters in both groups In addition, shorter essays tended to receive more comments related to other aspects of writing (e.g., task completion, quantity) from raters in both groups Finally, both groups tended to make fewer comments on communicative quality when explaining the high scores they assigned than they did for low scores It was the novice raters, however, who had a clearer tendency to make more comments on linguistic accuracy (rs 0.19) as well as overall impressions (rs 0.17) as the scores they assigned increased, than did the experienced group (rs 20.01 and rs 20.04, respectively) Overall, then, communicative quality, organization, argumentation, and linguistic accuracy were referred to frequently, whereas linguistic appropriacy was reported less frequently by raters in both groups Other aspects of writing and overall impression tended to be reported more frequently by the experienced raters SUMMARY AND DISCUSSION Score analyses indicated that communicative quality has the largest average association with the holistic scores of both rater groups, 46 TESOL QUARTERLY indicating that this criterion was given more weight than any of the other essay features considered in this study by raters in both groups Communicative quality, as described in the rating scale (see Figure 1), refers to the clarity and comprehensibility of the message and can be seen as an overarching criterion that subsumes and presupposes the other evaluation criteria in the rating scale The association concerning communicative quality varied significantly across raters; for some raters this association was positive and for others it was negative, suggesting that the raters for whom it was negative did not give much importance to this criterion when assessing the essays holistically Rater experience, however, did not seem to explain this variance Other rater factors, such as L1, age, and writing experience, might account for these differences across raters Argumentation also had a significant and positive association with the holistic scores the raters assigned However, this association was significantly stronger for the novices than for the experienced raters Linguistic accuracy, by contrast, has a significant but weaker association with the holistic scores the participants assigned This association was even weaker for the novices than for the experienced group These results indicate that the two groups differed significantly in terms of the importance they gave to argumentation and linguistic accuracy when rating the essays holistically Novices seem to have been influenced by the quality of the argument, or content and ideas, of the essays in their holistic ratings compared with the experienced raters, who seem to have given more weight to linguistic accuracy (syntax, vocabulary, etc.) The experienced raters’ focus on linguistic accuracy might be because of their training and experience as language teachers (cf Cumming et al., 2001; Erdosy, 2004; McNamara, 1996; Rinnert & Kobayashi, 2001; Song & Caruso, 1996) As noted earlier, all the experienced raters in this study were ESL teachers The novice raters, by contrast, seem to have taken their assessment criteria from their experience as learners and/or readers, where the focus may have been on content (cf Cumming, 1990) Note that the association concerning linguistic accuracy with the holistic scores varied significantly across raters; for some raters this association was positive and for others it was negative, suggesting that the raters for whom it was negative did not assign much importance to linguistic accuracy in their holistic ratings of the essays As noted earlier, rater experience explained a small proportion of this variance Again, other rater factors might account for these differences between raters Essay length was significantly and positively associated with holistic scores Overall, longer essays obtained higher holistic scores from all raters In addition, although this association varied significantly across raters, it was not influenced by rater experience The holistic scores assigned to essays on the sports prompt were generally lower than those DO ESL RATERS’ CRITERIA CHANGE 47 assigned to essays on the study prompt, but the difference was not significant In addition, some raters assigned lower scores to essays on the study prompt However, the association between topic and holistic scores did not vary as a function of rater experience Finally, organization and linguistic appropriacy did not seem to have a significant effect on the holistic scores once all other essay features were accounted for This result might have occurred because they were highly correlated with the other rating dimensions in the final multilevel model The experienced raters assigned, on average, lower holistic scores than did the novices, perhaps because of differences in terms of the evaluation criteria that the two groups focused on and the importance they assigned to these criteria (cf Sakyi, 2003; Song & Caruso, 1996) In particular, experienced raters gave more importance to linguistic accuracy in their ratings This aspect of writing is inherently difficult for ESL writers Another possible explanation is that rating and teaching experience might give raters the confidence to rate more critically (Sweedler-Brown, 1985) Sweedler-Brown speculated that less experienced raters may be less certain about how to apply rating criteria, may be afraid that they will make an error in judgment that will punish the student, and/or may be influenced by the notion that lower scores are value judgments of the students themselves As a result, novice raters may tend to be less critical of an essay’s qualities and to express their uncertainty by giving higher scores to essays, offering students the benefit of the doubt (Sweedler-Brown, pp 54–55) This interpretation finds support in the qualitative data, which indicated that experienced raters tended to make more negative comments, whereas novices tended to make more positive and neutral comments when explaining the holistic scores they assigned Much of the within- and between-rater variance in the holistic scores, however, remains unexplained Information on other rater factors (e.g., L1, age, gender, writing experience) is needed to explain the betweenrater variance, whereas other within-rater and essay factors (e.g., essay order, content) might account for the remaining within-rater (between ratings) variance Additionally, some of the variance between raters may be because of essay sampling (i.e., different raters rated different essays), whereas some within-rater variance could be because of differences in the raters’ interpretations of the analytic criteria (i.e., construct differences) Findings from the qualitative data both support and contradict findings from the score analyses First, novices referred most frequently to argumentation when explaining the scores they assigned, which supports the finding from score analyses that argumentation played a prominent role in the holistic scores this group assigned However, 48 TESOL QUARTERLY contrary to the finding from score analyses that linguistic accuracy played a more important role in the holistic scores of the experienced group, the qualitative data indicated that the novices referred more frequently than the experienced raters did to aspects related to linguistic accuracy when explaining the holistic scores they assigned There are three possible explanations for this contradiction First, because of their lack of experience with ESL writing, the novice raters might have focused on local linguistic features in order to understand the texts before they could evaluate other aspects of them (Sakyi, 2003) Second, because they lacked established criteria for judging writing quality, these novice raters may have based their score explanations on simple or easily discernable (and reportable) aspects of writing, such as lexis, syntax, and punctuation (Sakyi, 2003) Third, because the holistic scale lists several specific linguistic features (grammar, vocabulary, spelling, etc.) without any indication of their importance relative to each other or to other criteria, it is possible that the novices treated these features as multiple categories that need to be considered (and perhaps weighted and scored) separately (see Barkaoui, 2008) Experienced raters, by contrast, tended to refer to language overall, rather than specific linguistic aspects, more frequently than the novices did Another finding from the qualitative data that contradicts the results of score analyses is that, although communicative quality had the strongest association with the holistic scores for all raters, this aspect of writing was mentioned less frequently than argumentation and linguistic accuracy in the score explanations of raters from both groups One explanation for this finding is that communicative quality (i.e., the comprehensibility and clarity of the message) was perhaps felt to subsume all other aspects of writing and, as a result, the raters did not feel they needed to mention it when explaining the scores they assigned, opting instead to focus on other specific aspects In other words, communicative quality as defined in the rating scale (see Figure 1) might have been interpreted by the participants as an overall quality that itself needed to be explained with reference to other evaluation criteria This is apparent in several score explanations that indicated that an essay failed to communicate effectively because it had poor organization, argumentation, and/or linguistic accuracy In addition, in follow-up discussions several participants reported that they gave more importance to communicative quality when rating the essays holistically, because it was listed first in each score descriptor on the holistic rating scale Unfortunately, the quantification of the score explanation data in this study does not reflect the importance the raters assigned to each aspect of writing they mentioned in their written score explanations As Lumley (2005) noted, lack of mention of a particular feature or features by a rater is no indication that the feature was not observed and noted Also, DO ESL RATERS’ CRITERIA CHANGE 49 simple mention of a particular feature is not an indication of its importance Organization, by contrast, was frequently mentioned by both groups when explaining the scores they assigned, although score analyses indicated that it did not have a significant association with the holistic scores the raters assigned It is possible that the high intercorrelations between the different analytic criteria (these correlations ranged between 0.65 and 0.82) might have statistically suppressed the association between the organization and holistic scores, although qualitative data indicated that raters in both groups did consider organization when rating the essays holistically Both the quantitative and qualitative data indicated that linguistic appropriacy did not play a significant role in the holistic scores the participants assigned Furthermore, score analyses indicated that much of the variance in the holistic scores remains unexplained One explanation for this is that the raters employed criteria other than those listed in the rating scale The qualitative data showed that the raters did rely on other criteria, such as overall impression of the essay and other aspects of writing, than those in the scale (e.g., task completion, essay length) The qualitative data indicated that the experienced raters tended to refer more frequently to aspects of writing other than those listed in the scoring rubric This finding suggests that experienced raters were more likely than the novices to bring other, supplementary criteria, rather than those listed in the rating scale, to the evaluation task Finally, it is worth noting that the raters seemed to employ different evaluation criteria, depending on essay proficiency level as well For example, raters generally made more comments related to communicative quality and fewer comments related to linguistic accuracy, when evaluating essays that received low holistic scores than when assessing essays that received high holistic scores This finding is consistent with previous research (e.g., Connor-Linton, 1995; Cumming, 1990; Cumming et al., 2002; Shi, 2001) Cumming et al., for example, found that raters attended more to language features when rating lowproficiency essays but attended to both rhetoric and language when reading high-proficiency essays LIMITATIONS AND IMPLICATIONS As with any research, there were limitations to the present study Five such limitations are discussed and used to point out implications and areas for further research First, the participants were not involved in selecting or developing the rating criteria and scale used in this study They also did not receive extensive group training on essay rating As a 50 TESOL QUARTERLY result, several participants reported that they did not like the rating scale and criteria and/or ignored some of these criteria (e.g., linguistic appropriacy, use of writer’s experience) In addition, because the ratings were done individually at home, with no control for rating time, this may have affected rater performance Second, there were some problems with data analyses First, the terminology used to describe writing qualities and evaluation criteria are open to different interpretations; raters might have meant different things by using the same term or meant the same thing by different terms (Cumming et al., 2002) Additionally, qualitative data analysis was limited to comparing the frequency of codes Although this is a useful strategy given the relatively large number of participants in this study, it cannot detect such qualitative differences as variation in the importance of aspects of writing mentioned or their interactions and relationships within and across raters and groups As noted earlier with reference to the role of communicative quality in the holistic scores the raters assigned, the quantification of the qualitative data in this study does not reflect the importance that the raters reported they had assigned to each aspect of writing Furthermore, frequency or lack of mention of a particular feature by a rater is not always a good indicator of its importance (Lumley, 2005) Third, as noted earlier, there were several conflicting findings from the score and qualitative data analyses This is not unique to the current study Several previous studies have found discrepancies between the aspects of writing that raters reported, in questionnaires or interviews, as being most important and the aspects that influenced their ratings of specific essays, as revealed through analyses of the scores they assigned or the comments they made while rating (e.g., Breland & Jones, 1984; Harris, 1977; Mendelsohn & Cumming, 1987; Sakyi, 2003) Although the mixed findings of the present study may seem to undermine the value of mixed-methods research, these results expand our understanding of raters’ evaluation practices and how these are influenced not only by factors in the rating setting but also by variation in research tools and contexts Further research is needed to clarify the reasons for these conflicting findings Fourth, this study was cross-sectional; it compared the performance of two groups of raters at two points on the expertise continuum As such, although it provided important insights into differences and similarities between the two groups in terms of the evaluation criteria they use, it says very little about whether, how, and why raters’ evaluation criteria change over time This is also a limitation of all previous studies on rater expertise, however Fifth, the study adopted an experimental, rather than a naturalistic, approach to the examination of rater evaluation criteria As a result, some of the considerations, motivation, and institutional norms that appear in rating for a real exam may have been lacking (cf Lumley, 2005) In DO ESL RATERS’ CRITERIA CHANGE 51 addition, the factors considered in this study are primarily text based (i.e., essay characteristics) Research has shown, however, that several contextual factors and considerations, such as test purpose and the conditions and implications of the assessment for the test-taker as perceived by the rater, may influence the evaluation criteria raters employ and the scores they assign (e.g., Broad, 2003; Davison, 2004; Lumley, 2005) For example, few participants in the current study reported that they took into account the time pressure that the test-takers had to deal with, whereas others reported that the holistic scores they assigned were based in part on their perceptions as to whether the writer was ‘‘university material’’ or not Unfortunately, these factors were not considered in this study As noted earlier, other factors internal to the study, such as essay order, essay sampling, and rating time, might have affected the scores and the explanations the raters provided Despite these limitations, the current study suggests implications that can be tested in other specific assessment contexts and points to several areas for further research First, the findings of this study suggest that, with longer teaching and rating experience, raters become more severe in their holistic evaluations of ESL essays One apparent reason for this change in severity is that, although the communicative quality of the message remains a prominent criterion, raters’ evaluation criteria tend to shift from a focus on content (argumentation) to a focus on form (linguistic accuracy), which is often a weak aspect of ESL essays The findings also suggest that experienced raters are more likely to refer to evaluation criteria other than those listed in the rating scale However, given the limitations pointed out earlier, particularly that this study was cross-sectional, these findings represent hypotheses for further research In particular, future research needs to replicate the current study with raters from different linguistic, cultural, and professional backgrounds, different writing tasks, and in different assessment systems and contexts Such research needs to adopt a longitudinal approach to investigate to what extent, how, and why rater evaluation criteria change over time Another area worth exploring relates to whether L2 learners, as a result of their developing L2 proficiency, experience a similar shift in the evaluation criteria they use to evaluate their own or their peers’ L2 writing performance in self- and peer-assessment of writing (cf Rinnert & Kobayashi, 2001) It is possible that the evaluation criteria that raters employ vary across contexts Another area for further research, as a result, is how raters are socialized into new institutional or assessment contexts and whether, how, and to what extent raters from different backgrounds are able to set aside their personal values and criteria and adopt (or not) the evaluation values and criteria in new assessment systems and contexts This program of research requires focusing on raters making real judgments in specific 52 TESOL QUARTERLY courses, programs, or institutions and using naturalistic, ethnographic approaches (e.g., observation of ratings, discussion of scores, interviews) as well as score analyses (e.g., using MLM) Such research can enhance our understanding of how the broader sociocultural and institutional contexts, within which essay rating occurs, affect variability in ESL essay evaluation criteria and scores over time and across contexts ACKNOWLEDGMENTS The author thanks the raters who participated in this study and Alister Cumming, Merrill Swain, Richard Wolfe, and three anonymous TESOL Quarterly reviewers for their comments on earlier versions of this article THE AUTHOR Khaled Barkaoui is an assistant professor in the Faculty of Education, York University, Toronto, Ontario, Canada His research interests include secondlanguage assessment, second-language writing, language-program evaluation, research methodology, and English for academic purposes REFERENCES Barkaoui, K (2008) Effects of scoring method and rater experience on ESL essay rating processes and outcomes (Unpublished doctoral thesis) University of Toronto, Canada Breland, H M., & Jones, R J (1984) Perceptions of writing skills Written Communication, 1, 101–119 Broad, B (2003) What we really value: Rubrics in teaching and assessing writing Logan, UT: Utah State University Press Brown, J D (1991) Do English and ESL faculties rate writing samples differently? TESOL Quarterly, 25, 587–603 Cohen, J (1988) Statistical power analysis for the behavioral sciences (2nd ed.) New York, NY: Academic Press Connor-Linton, J (1995) Crosscultural comparison of writing standards: American ESL and Japanese EFL World Englishes, 14, 99–115 Cumming, A (1990) Expertise in evaluating second language compositions Language Testing, 7, 31–51 Cumming, A., Kantor, R., & Powers, D (2002) Decision making while rating ESL/ EFL writing tasks: A descriptive framework Modern Language Journal, 86, 67–96 Davison, C (2004) The contradictory culture of teacher-based assessment: ESL teacher assessment practices in Australian and Hong Kong secondary schools Language Testing, 21, 305–334 Delaruelle, S (1997) Text type and rater decision-making in the writing module In G Brindley & G Wigglesworth (Eds.), Access: Issues in English language test design and delivery (pp 215–242) Sydney, Australia: National Center for English Language Teaching and Research, Macquarie University Erdosy, M U (2004) Exploring variability in judging writing ability in a second language: A study of four experienced raters of ESL compositions (TOEFL Research Report RR-03-17) Princeton, NJ: Educational Testing Service DO ESL RATERS’ CRITERIA CHANGE 53 Field, A (2005) Discovering statistics using SPSS (2nd ed.) Thousand Oaks, CA: Sage Frase, L T., Faletti, J., Ginther, A., & Grant, L (1999) Computer analysis of the TOEFL test of written English (TOEFL Research Report No 64) Princeton, NJ: Educational Testing Service Goulden, N R (1994) Relationship of analytic and holistic methods to raters’ scores for speeches The Journal of Research and Development in Education, 27, 73–82 Hamp-Lyons, L (1989) Raters respond to rhetoric in writing In H W Dechert & M Raupach (Eds.), Interlingual processes (pp 229–244) Tubingen, Germany: Gunter Narr Verlag Hamp-Lyons, L (1991) Scoring procedures for ESL contexts In L Hamp-Lyons (Ed.), Assessing second language writing in academic contexts (pp 241–276) Norwood, NJ: Ablex Harris, W (1977) Teacher response to student writing: A study of the response patterns of high-school English teachers to determine the basis for teacher judgment of student writing Research in the Teaching of English, 11, 175–185 Homburg, T J (1984) Holistic evaluation of ESL compositions: Can it be validated objectively? TESOL Quarterly, 18, 87–107 Hox, J J (2002) Multilevel analysis: Techniques and applications Mahwah, NJ: Lawrence Erlbaum Kobayashi, H., & Rinnert, C (1996) Factors affecting composition evaluation in an EFL context: Cultural rhetorical pattern and readers’ background Language Learning, 46, 397–437 Luke, D (2004) Multilevel modeling Thousand Oaks, CA: Sage Lumley, T (2005) Assessing second language writing: The rater’s perspective New York, NY: Peter Lang Lumley, T., & McNamara, T F (1995) Rater characteristics and rater bias: Implications for training Language Testing, 12, 54–71 McNamara, T (1996) Measuring second language performance London, UK: Longman Mendelsohn, D., & Cumming, A (1987) Professors’ ratings of language use and rhetorical organization in ESL compositions TESL Canada Journal, 5, 9–26 Milanovic, M., Saville, N., & Shuhong, S (1996) A study of the decision-making behaviour of composition markers In M Milanovic & N Saville (Eds.), Performance testing, cognition and assessment: Selected papers from the 15th Language Testing Colloquium (LTRC), Cambridge and Arnhem (pp 92–114) Cambridge, England: Cambridge University Press O’Laughlin, K (1994) The assessment of writing by English and ESL teachers Australian Review of Applied Linguistics, 17, 23–44 Raudenbush, S W., Bryk, A S., Cheong, Y F., & Congdon, R (2004) HLM6: Hierarchical linear and nonlinear modeling Lincolnwood, IL: Scientific Software International Richards, L (1999) Using NVivo in qualitative research Melbourne, Australia: Qualitative Solutions and Research Rinnert, C., & Kobayashi, H (2001) Differing perceptions of EFL writing among readers in Japan The Modern Language Journal, 85, 189–209 Sakyi, A A (2003) A study of the holistic scoring behaviors of experienced and novice ESL instructors (Unpublished doctoral dissertation) University of Toronto, Canada Santos, T (1988) Professors’ reactions to the academic writing of non-nativespeaking students TESOL Quarterly, 22, 69–90 Shi, L (2001) Native- and nonnative-speaking EFL teachers’ evaluation of Chinese students’ English writing Language Testing, 18, 303–325 Song, C B., & Caruso, I (1996) Do English and ESL faculty differ in evaluating the essays of native English-speaking and ESL students? Journal of Second Language Writing, 5, 163–182 54 TESOL QUARTERLY Sweedler-Brown, C O (1985) The influence of training and experience on holistic essay evaluation English Journal, 74, 49–55 Tedick, D J., & Mathison, M A (1995) Holistic scoring in ESL writing assessment: What does an analysis of rhetorical features reveal? In D Belcher & G Braine (Eds.), Academic writing in a second language: Essays on research and pedagogy (pp 205–230) Norwood, NJ: Ablex Weigle, S C (1999) Investigating rater/prompt interactions in writing assessment: Quantitative and qualitative approaches Assessing Writing, 6, 145–78 Weigle, S C (2002) Assessing writing Cambridge, England: Cambridge University Press DO ESL RATERS’ CRITERIA CHANGE 55 APPENDIX Coding Scheme for Written Score Explanation Data Examplesa Code Communicative quality Organization Text organization Coherence and transition Argumentation Argument quality and content Support and examples Positive The communicative strength of this essay makes you forget about the inaccuracies (E31, 258) Strain I reread this or times (E1, 112) There is a distinct beginning, middle, and end, and there is a clear progression (N28, 146) This writer does have many ideas, but she/he does not know how to organize his/her thoughts (N20, 224) But discourse markers are used inappropriately, creating confusion for the reader (E32, 202) There is a sense of coherence in message (E07, 238) Developed and logical argumentation with clear ideas Arguments are well presented (E1, 288) Some good, relevant examples are provided (E36, 124) Presence and quality A positive aspect of the essay is of main idea that the author presents his/her point of view and standing at the very beginning (N25, 121) Use of writer This writer’s use of personal experience experience to support arguments is particularly effective (N1, 215) Relevance This paper presents a clear lucid argument that sticks to the topic (E32, 184) Interest and This writer displays a creativity, originality which was refreshing This writer also showed a sophistication in her writing (E12, 118) Linguistic accuracy Error gravity Reader not troubled by errors (E19, 290) Error frequency What is written is relatively error free linguistically (E15, 174) Lexis Variety and appropriate use of vocabulary (N36, 234) Fairly good control of grammar (E20, 207) Correct punctuation and spelling aided comprehension (N2, 113) They have an excellent grasp of the language (N17, 130) Syntax and morphology Punctuation and spelling Language overall 56 Negative It really lacks substance I really don’t think this person is university material (N8, 102) I don’t think the examples given were convincing enough to reader (N23, 155) No real main idea; seems to evolve as the writer puts down ideas (N7, 145) Lacks specific examples from personal experience (E18, 125) Argument mainly irrelevant (E3, 276) The arguments are superficial (E4, 163) I was primarily aware of ‘‘gross inadequacies’’ of language here (E28, 149) There are enough spelling and grammar errors to make the text incomprehensible at times (N28, 251) Poor grasp of vocabulary here affecting score (E20, 161) Control of grammar is inadequate (E11, 217) Atrocious spelling (N30, 177) Terrible language use (E6, 109) TESOL QUARTERLY APPENDIX Continued Examples Code Linguistic appropriacy Linguistic appropriacy general Style, register, or genre Overall impression Overall impression Other aspects of writing Task completion Positive Good effort at linguistic system manipulation (N30, 168) Very elegant style and comfortable with writing (E3, 246) No sense of linguistic appropriacy (E27, 114) The writing seems a little florid, but that’s not on the rating scale so I shouldn’t penalize (N17, 130) I enjoyed reading this essay (N6, 171) Unacceptable (E31, 102) Addresses both sides of the issue (N36, 184) Incomplete Did not answer the question Did not plan to discuss disadvantages (N33, 244) Much irrelevant and repetitive information also included unnecessarily (E8, 224) N/A Redundancy N/A Voice I would like to give this writer 7.5 because he/she has a ‘‘voice’’ and presents good arguments (E15, 128) It is well formatted, especially the way the writer is organizing the paragraphs (N16, 264) Layout Quantity Fluency Negative The length of the essay compared with the others is quite impressive (N24, 131) I can tell that the writer is fluent in English (N16, 101) There is no division into paragraphs, and it is hard to tell when a sentence is ending and when it is starting (N25, 164) So short, had to tell the writing/language ability from or so sentences (E5, 184) N/A Note aEach quote is followed by a rater code (N novice; E experienced) and an essay code (101–190 essay on study prompt, 201–290 essay on sports prompt) DO ESL RATERS’ CRITERIA CHANGE 57

Định dạng
Số trang	27
Dung lượng	146,65 KB