This article was downloaded by: [UNSW Library] On: 31 October 2012, At: 16:57 Publisher: Routledge Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Language Assessment Quarterly Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/hlaq20 Variability in ESL Essay Rating Processes: The Role of the Rating Scale and Rater Experience Khaled Barkaoui a a York University, Version of record first published: 19 Feb 2010. To cite this article: Khaled Barkaoui (2010): Variability in ESL Essay Rating Processes: The Role of the Rating Scale and Rater Experience, Language Assessment Quarterly, 7:1, 54-74 To link to this article: http://dx.doi.org/10.1080/15434300903464418 PLEASE SCROLL DOWN FOR ARTICLE Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-conditions This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand, or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material. Language Assessment Quarterly, 7: 54–74, 2010 Copyright © Taylor & Francis Group, LLC ISSN: 1543-4303 print / 1543-4311 online DOI: 10.1080/15434300903464418 HLAQ1543-43031543-4311Language Assessment Quarterly, Vol. 7, No. 1, Dec 2009: pp. 0–0Language Assessment Quarterly Variability in ESL Essay Rating Processes: The Role of the Rating Scale and Rater Experience ESL Essay Rating ProcessesBarkaoui Khaled Barkaoui York University Various factors contribute to variability in English as a second language (ESL) essay scores and rating processes. Most previous research, however, has focused on score variability in relation to task, rater, and essay characteristics. A few studies have examined variability in essay rating processes. The current study used think-aloud protocols to examine the roles of rating scales, rater experience, and interactions between them in variability in raters’ decision-making processes and the aspects of writing they attend to when reading and rating ESL essays. The study included 11 novice and 14 experienced raters, who each rated 12 ESL essays, both holistically and analytically, while thinking aloud. The findings indicated that rating scale type had larger effects on the participants’ rating processes than did rater experience. With holistic scoring, raters tended to refer more often to the essay (the focus of the assessment), whereas with analytic scoring they tended to refer to the rating scale (the source of evaluation criteria) more frequently; analytic scoring drew raters’ attention to all evaluation criteria in the rating scale, and novices were influenced by variation in rating scales more than were the experienced raters. The article concludes with implications for essay rating practices and research. This study examined the roles and effects of two sources of variability in the rating context, rat- ing scale and rater experience, on English as a second language (ESL) essay rating processes. It may be useful to think of the rating process as involving a reader/rater interacting with three texts (the writing task, the essay, and the rating scale) within a specific sociocultural context (e.g., institution) that specifies the criteria, purposes, and possibly processes of reading and interpreting the three texts to arrive at a rating decision (Lumley, 2005; Weigle, 2002). Although various factors can contribute to variability in scores and rater decision-making processes, research on second-language essay rating has tended to focus on such factors as task require- ments, rater characteristics, and/or essay features (Barkaoui, 2007a). However, it is obvious that other contextual factors, such as rating procedures, influence raters’ judgment of student performance and the scores they assign. As Schoonen (2005) argued, “The effects of task and rater are most likely dependent on what has to be scored in a text and how it has to be scored” (p. 5). In addition, the rating scale is an important component of the rating context because it specifies what raters should look for in a written performance and will ultimately influence the validity of the inferences and the fairness of the decisions Correspondence should be sent to Khaled Barkaoui, York University, Faculty of Education, 235 Winters College, 4700 Keele Street, Toronto, Ontario, M3J 1P3 Canada. E-mail: kbarkaoui@edu.yorku.ca Downloaded by [UNSW Library] at 16:57 31 October 2012 ESL ESSAY RATING PROCESSES 55 that educators make about individuals and programs based on essay test scores (Weigle, 2002). This aspect of the rating context, however, has received little attention (Barkaoui, 2007a; Hamp-Lyons & Kroll, 1997; Weigle, 2002). This article focuses on two types of rating scales, holistic and analytic, that are widely used in large-scale and classroom assessments (Hamp-Lyons, 1991; Weigle, 2002). These two types of scales differ in terms of scoring methods and implications for rater decision-making processes (Goulden, 1992, 1994; Weigle, 2002). In terms of scoring method, in analytic scoring raters assign subscores to individual writing traits (e.g., language, content, organization); these subscores may then be summed to arrive at an overall score. In holistic scoring, the rater may also consider individ- ual elements of writing but chooses one score to reflect the overall quality of the paper (Goulden, 1992, 1994). In terms of decision-making processes, with analytic scoring, the rater has to evaluate the different writing traits separately. In holistic scoring the rater has to consider different writing traits too, but the rater has also to weight and combine their assessments of the different traits to arrive at one overall score, which is likely to make the rating task more cognitively demanding. These differences are likely to influence essay rating processes and outcomes. However, although the literature is replete with arguments for and against the two rating methods, little is known about whether and how they impact on ESL essay reading and rating processes and scores (Barkaoui, 2007a; Hamp-Lyons & Kroll, 1997; Weigle, 2002). Such studies as have been reported in the literature (e.g., Bacha, 2001; O’Loughlin, 1994; Schoonen, 2005; Song & Caruso, 1996) examined the effects of rating scales on rater and score reliability but did not consider the rating process. Furthermore, the findings of some of these studies are mixed. For example, in two studies comparing the holistic and analytic scores assigned by ESL and English teachers to ESL essays, O’Loughlin (1994) found that holistic ratings achieved higher levels of interrater agreement across both rater groups, whereas Song and Caruso (1996) found significant differences in terms of the holistic, but not the analytic, scores across rater groups. Bacha (2001), on the other hand, reported high levels of inter- and intrarater reliabilities for both types of rating scales. I am not aware of any study that has examined the effects of different types of rating scales on L2 essay rating processes (but see Barkaoui, 2007b). Most qualitative studies have investi- gated the decision-making behaviors and aspects of writing that raters attend to when rating essays with no specific rating guidelines (e.g., Cumming, Kantor, & Powers, 2002; Delaruelle, 1997), or when using holistic (e.g., Milanovic, Saville, & Shuhong, 1996; Sakyi, 2003; Vaughan, 1991) or analytic scoring (e.g., Cumming, 1990; Lumley, 2005; Smith, 2000; Weigle, 1999). Lumley and Smith may be two exceptions in that, although they did not specifically com- pare different rating scales, their findings raise several relevant questions concerning the role of the rating scale in essay rating processes. Smith found that raters attend to other textual features in addition to those mentioned in the rating scale, that raters with different reading strategies interpret and apply the rating criteria differently, and that the rating criteria have different effects on raters with different approaches to essay reading and rating. Lumley found that (a) raters may understand the rating criteria similarly in general, but emphasize different components and apply them in different ways, and (b) raters may face problems reconciling their impression of the text, the specific features of the text, and the wordings of the rating scale. Another limitation of previous research is that the frameworks that describe the essay rating process (e.g., Cumming et al., 2002; Freedman & Calfee, 1983; Homburg, 1984; Milanovic et al., 1996; Ruth & Murphy, 1988; Sakyi, 2003) do not discuss whether and how the content and organization of the rating scale influence rater decision-making behaviors and the aspects of Downloaded by [UNSW Library] at 16:57 31 October 2012 56 BARKAOUI writing raters attend to. For example, Freedman and Calfee seemed to suggest that essay rating is a linear process where the rater reads the essay, forms a mental representation of it, compares and matches this representation to the rating criteria, and then articulates a rating decision. Other studies of essay rating did not include any rating scales (e.g., Cumming et al., 2002). As a result, these studies do not discuss the role of the rating scale in variation in rater decision-making behaviors. Such information is crucial for designing, selecting, and improving rating scales and rater training as well as for the validation of ESL writing assessments. To examine rating scales inevitably means examining the individuals using them, i.e., raters. As Lumley (2005) emphasized, the rater is at the center of the rating activity (cf. Cumming, Kantor, & Powers, 2001; Erdosy, 2004). One of the rater factors that seems to play an important role in the rating process is rater experience (e.g., Cumming, 1990; Lumley, 2005; Schoonen, Vergeer, & Eiting, 1997; Wolfe, 2006). Schoonen et al., for instance, argued that the expertise and knowledge that raters bring to the rating task are essential for a reliable and valid rating (p. 158). There is a relatively extensive literature on the effects of rater expertise on ESL essay rating processes (Cumming, 1990; Delaruelle, 1997; Erdosy, 2004; Sakyi, 2003; Weigle, 1999). This research indicates that experienced and novice raters employ qualitatively different rating processes. Cumming (1990), for example, found that experienced teachers had a much fuller mental representation of the essay assessment task and used a large and varied number of criteria, self-control strategies, 1 and knowledge sources to read and judge ESL essays. Novice raters, by contrast, tended to evaluate essays with only a few of these component skills and cri- teria, using skills that may derive from their general reading abilities or other knowledge they have acquired previously (e.g., editing). However, there is no research on how raters with different levels of experience approach essay rating with different types of rating scales. Cumming (1990) hypothesized that novice raters, unlike experienced raters, may benefit from analytic scoring procedures to direct their attention to specific aspects of writing as well as appropriate evaluation strategies and criteria, whereas Goulden (1994) hypothesized that analytic scoring is easier for inexperienced raters, as fewer unguided decisions (e.g., weighting different evaluation criteria) are required. It was the aim of the present study to investigate these empirical issues. Specifically, the current study used think-aloud protocols to examine the roles of rating scale type (holistic vs. analytic), rater experience (novice vs. experienced) and interaction among them in variability in ESL essay rating processes. Following previous research (e.g., Cumming et al., 2002; Lumley, 2005; Milanovic et al., 1996), rating processes are defined as the decision-making behaviors of the raters and the aspects of writing they attend to while reading and rating ESL essays. METHOD Participants The study included 11 novice and 14 experienced raters randomly selected from among 60 vol- unteers in a larger study on ESL essay scores and rating processes (Barkaoui, 2008). Experienced 1 Raters’ strategies for controlling their own evaluation behavior (e.g., define, assess, and revise own rating criteria; summarize own rating judgment collectively). Downloaded by [UNSW Library] at 16:57 31 October 2012 ESL ESSAY RATING PROCESSES 57 raters were graduate students and/or ESL instructors who had been teaching and rating ESL writing for at least 5 years, had an M.A. or M.Ed. degree, had received specific training in assessment and essay rating, and rated themselves as competent or expert raters. Novice raters were mainly teaching English as a second language students who were enrolled in or had just completed a preservice or teacher training program in ESL, had no ESL teaching and rating experience at all at the time of data collection, and rated themselves as novice raters. The partic- ipants were recruited from various ESL and ESL teacher education (teaching English as a second language) programs at universities in southern Ontario. They varied in terms of their gender, age, and first-language backgrounds, but all were native or highly proficient non-native speakers of English. Table 1 describes the profile of a typical participant in each group. Data Collection Procedures The study included 180 essays produced under real-exam conditions by adult ESL learners from diverse parts of the world and with varying levels of proficiency in English. Each essay was written within 30 minutes in response to one of two comparable independent prompts (Study and Sports). Each rater rated a random sample of 24 essays, 12 silently and 12 while thinking aloud. To ensure counterbalancing, half the participants in each group were randomly assigned to start with holistic rating and the other half to start with analytic rating. The holistic and analytic scales, borrowed from Hamp-Lyons (1991, pp. 247–251), included the same evaluation criteria, wording and number of score levels (9), but differed in terms of whether to assign one overall score (holistic) or multiple scores (analytic) to each essay. The rating criteria in the analytic scale were grouped under five categories: communicative quality, organization, argumentation, linguistic accuracy, and linguistic appropriacy. Each participant attended a 30-minute individual training session about one of the rating scales and rated and discussed a sample of four essays. Next, each rated 12 essays silently at home using the first rating scale (these silent ratings are not considered in this paper). Each rater then attended a 30-min session where they received detailed instructions and careful training on how to think aloud while rating the essays following procedures and instructions in Cumming et al. (2001, pp. 83–85). Later, each participant rated the remaining 12 essays while thinking TABLE 1 Typical Profile of a Novice and an Experienced Rater Novice a Experienced b Role at time of the research TESL student ESL teacher ESL teaching experience None 10 years or more Rating experience None 5 years or more Post-graduate study None M.A./M.Ed. Received training in assessment No Yes Self-assessment of rating ability Novice Competent or expert Note. TESL = teaching English as a second language; ESL = English as a second language; a n = 11. b n = 14. Downloaded by [UNSW Library] at 16:57 31 October 2012 58 BARKAOUI aloud into a tape-recorder. At least two weeks later, each participant attended a second training session with the second rating scale and rated 12 essays silently and 12 while thinking aloud. Each participant rated the same 12 think-aloud essays with both scales but in a different random order of essays and prompts. All participants did all the think-aloud protocols individually, at the participant’s home, to allow them enough time to verbalize and to minimize researcher effects on the participants’ performance. Figure 1 summarizes the data collection procedures. Data Analysis Data for this current study consisted of the participants’ think-aloud protocols only. Because some raters did not record their thinking aloud while rating some of the essays and because of poor recording quality, only 558 protocols (out of 600) were analyzed. The novice raters provided 264 of these protocols (47%). There was an equal number of protocols for each rating scale and on each prompt. The protocols were coded with the assistance of the computer program Observer 5.0 (Noldus Information Technology, 2003), a software for the organization, analysis, and management of audio and video data. Using Observer allowed coding to be carried out directly from the protocol audio-recordings (instead of transcripts). The unit of analysis for the think-aloud protocols was a decision-making statement, which was segmented using the following criteria from Cumming et al. (2002): (a) a pause of five seconds or more, (b) rater reading aloud a segment of the essay, and/or (c) end or beginning of the assessment of a single essay. The coding scheme was developed based mainly on Cum- ming et al.’s (2002) empirically based schemes of rater decision-making behaviors and aspects of writing raters attend to. Cumming et al.’s main model of rater behavior, as it applied to the rating of independent prompts, 2 consists of various decision-making behaviors grouped under three foci (rater self-monitoring behavior, ideational and rhetorical elements of the text, control of language within the text) and two strategies (interpretation and judgment). Interpretation strategies consist of reading strategies aimed at comprehending the essay, whereas judgment concerns evaluation strategies for formulating a rating. Cumming et al. also distinguished between three general types of decision-making behavior: a focus on self-monitoring (i.e., focus on one’s own rating behavior, e.g., monitor for personal bias), a focus on the 2 Cumming et al. (2001, 2002) developed three frameworks based on data from different types of tasks and both ESL and English teachers. FIGURE 1 Summary of data collection procedures. Phase 1: 1. Orientation session for rating scale 1 (scales counterbalanced). 2. Rating 12 essays silently using scale 1 (at home). 3. Think-aloud training. 4. Rating 12 essays while thinking aloud using scale 1 (at home). Phase 2: 5. Orientation session for rating scale 2. 6. Rating 12 essays silently using scale 2 (same essays as in 2 above) (at home). 7. Rating 12 essays while thinking aloud using scale 2 (same essays as in 4 above) (at home). Downloaded by [UNSW Library] at 16:57 31 October 2012 ESL ESSAY RATING PROCESSES 59 essay’s realization of ideational and rhetorical elements (e.g., essay rhetorical structure, coherence, relevance), and a focus on the essay’s accuracy and fluency in the English language (e.g., syntax, lexis). Based on preliminary inspections of the data, 36 codes were selected from Cumming et al.’s frameworks and three new ones were added: (a) Read, interpret, refer, or comment on rating scale to account for the raters’ uses of the rating scales; (b) Assess communicative effectiveness or quality, which pertains to text comprehensibility and clarity at both the local and global levels; and (c) Compare scores across rating categories, to account for participants’ comparison of scores assigned to the same essay on different analytic rating categories. The final coding scheme consisted of 39 codes. A complete list of the codes with examples from the current study is presented in the appendix. The author coded all the protocols by assigning each decision-making statement all the relevant codes in the coding scheme. To check the reliability of the coding, the coding scheme was discussed with another researcher, who then independently coded a random sample of 70 protocols (3,083 codes). Percentage agreement achieved was 81%, computed for agreement in terms of the main categories in the appendix. Percentage agreements for main categories and within each category varied, however (e.g., 76% for self-monitoring-judgment, 85% for language-judgment). For most cases, the coders were able to reconcile the codes. In the few cases where they were not able to reach an agreement, the author decided the final code to be assigned. As in previous studies (e.g., Cumming, 1990; Cumming et al., 2002; Wolfe, 2006; Wolfe, Kao, & Ranney, 1998), the focus in this study is on comparing the frequency of the decision-making behaviors and aspects of writing attended to. Consequently, the coded protocol data were tallied and percentages were computed for each rater for each code in the coding scheme. These percentages served as the data for comparison across rater groups and rating scales. Statistical tests were then conducted on the main categories in the appen- dix. Subcategories were used for descriptive purposes only and to explain significant differ- ences in main categories. Because the coded data did not seem to meet the statistical assumptions of parametric tests, nonparametric tests were used to compare coded data across rating scales (Wilcoxon Signed-Ranks Test) and across rater groups (Mann-Whitney Test). 3 Because these tests rely on ranks, the following descriptive statistics are reported next: median (Mdn) and the highest (Max) and lowest (Min) values for each main category. Finally, because each participant provided 12 protocols for each rating scale, each rater had 24 percentages for each code. For example, each rater had 24 percentages, 1 for each essay for each rating scale (i.e., 12 essays × 2 rating scales), for the code “scan whole composition.” To be able to analyze the coded data statistically, these percentages had to be aggregated as follows. To compare coded data across rating scales, the protocols were aggregated at the rater level, by type of rating scale, to obtain 2 average percentages for each code for each rater, 1 for each rating scale. To compare the coded data across rater groups, the protocols were aggregated at the rater level to obtain one proportion per rater. Statistical tests were then conducted on aggregated data. 3 Wilcoxon signed-ranks test is a nonparametric equivalent of the dependent t test, whereas Mann-Whitney test is a nonparametric equivalent of the independent t test for comparing two independent groups. Downloaded by [UNSW Library] at 16:57 31 October 2012 60 BARKAOUI FINDINGS Scale Effects Table 2 reports descriptive statistics of the percentages of decision-making strategies and aspects of writing reported in the think-aloud protocols by main category across rating scales. Overall, (a) the participants reported more judgment (Mdn = 58% and 63% for holistic and ana- lytic, respectively) than interpretation strategies (Mdn = 42% and 37%) with both rating scales, (b) self-monitoring focus was the most frequently mentioned (Mdn = 44% and 50%) and lan- guage focus the least frequently mentioned (Mdn = 23% and 20%) with both rating scales, and (c) Wilcoxon Signed-Ranks tests indicated that the holistic scale elicited significantly (p < .05) more interpretation strategies for the three focuses (self-monitoring, Mdn = 31%; language, Mdn = 6%; and rhetorical and ideational, Mdn = 4%) and more language focus (Mdn = 23%) than did the analytic scale, which elicited significantly more judgment strategies (Mdn = 63%) and self- monitoring focus (Mdn = 50%) than did the holistic scale. In terms of subcategories, Table 3 shows the strategies that were reported more frequently with each rating scale. Table 3 shows that there were more references to specific linguistic fea- tures (e.g., syntax, lexis, spelling) with the holistic scale, whereas the analytic scale elicited more reference to rating language overall (see appendix for examples). In addition, with holistic scoring raters tended to read and interpret the essay more frequently, whereas the analytic scale TABLE 2 Descriptive Statistics for Decision-Making Behaviors by Rating Scale Holistic Analytic Mdn Min Max Mdn Min Max Focus Self-monitoring* 43.88 36.18 62.30 50.40 39.29 62.53 Rhetorical 31.00 18.58 44.10 28.10 22.24 36.84 Language* 22.84 12.12 37.77 20.39 11.99 33.96 Strategy Interpretation* 41.70 32.86 51.12 37.38 25.12 43.67 Judgment* 58.30 48.88 67.14 62.62 56.33 74.88 Strategy × Focus Interpretation Self-monitoring* 30.96 26.50 36.38 29.71 18.20 35.92 Rhetorical* 3.67 .35 13.99 3.42 .80 6.31 Language* 5.75 2.20 11.46 3.76 1.07 11.70 Judgment Self-monitoring* 13.41 7.67 28.84 22.06 10.97 31.98 Rhetorical 24.83 15.57 36.08 26.09 19.40 33.10 Language 17.51 9.92 27.57 15.56 9.99 27.51 Note. N = 25 raters. *Wilcoxon Signed Ranks tests indicated that the differences across rating scales were statistically significant at p < .05. Downloaded by [UNSW Library] at 16:57 31 October 2012 ESL ESSAY RATING PROCESSES 61 elicited more reference to the rating scale and articulating and justifying scores. Finally, the analytic scale prompted more references to text organization and linguistic appropriacy. Rater Experience Effects Table 4 reports descriptive statistics for the percentages of think-aloud codes by main category across rater groups. It shows that, overall, (a) both groups reported more judgment (Mdn = 59% and 61% for novices and experts, respectively) than interpretation (Mdn = 41% and 39%) strate- gies, (b) self-monitoring focus was the most frequently mentioned (Mdn = 49% and 45%) and language the least frequently mentioned focus (Mdn = 23% and 20%) for both groups, and (c) the novice raters reported slightly more interpretation strategies (Mdn = 41%) and self-monitor- ing focus (Mdn = 49%) than the experienced group (Mdn = 39% and 45%, respectively), who reported slightly more judgment strategies (Mdn = 61%) and rhetorical and ideational focus (Mdn = 30%). Mann-Whitney tests indicated that none of these differences was statistically significant at p < .05, however. Table 5 shows the subcategories that each rater group reported more frequently than the other group did. Overall, Table 5 shows that the novices tended to refer to the rating scale and to focus on local textual aspects and understanding essay content (e.g., summarize ideas) more frequently than did the experienced raters, who tended to refer more frequently to the essay and to rhetorical aspects of writing such as text organization and ideas, as well as the writer’s situa- tion and essay length, two aspects that were not included in the rating scales. Interaction Effects Table 6 reports descriptive statistics of the percentages of think-aloud codes by main category across rating scales and rater groups. First, comparing across rating scales within rater group, TABLE 3 Medians for Strategies That Differed by 1% or More Across Rating Scales Strategies Holistic Mdn Analytic Mdn Higher with the holistic scale Read or reread essay 19.32% 14.33% Interpret ambiguous or unclear phrases 2.36% 1.29% Articulate general impression 2.97% 1.83% Rate ideas and/or rhetoric 3.18% 2.09% Classify errors into types 3.26% 1.82% Consider lexis 2.28% 1.28% Consider syntax and morphology 3.62% 2.24% Consider spelling or punctuation 3.78% 1.91% Higher with the analytic scale Refer to, read or interpret rating scale 7.78% 11.07% Articulate, justify or revise scoring decision 8.55% 16.85% Assess text organization 2.98% 4.54% Assess style, register, or linguistic appropriacy 1.10% 3.49% Rate language overall 1.04% 3.32% Downloaded by [UNSW Library] at 16:57 31 October 2012 62 BARKAOUI Table 6 shows that both rater groups reported more self-monitoring focus and judgment strate- gies with the analytic scale and more interpretation strategies and language-interpretation with the holistic scale. Wilcoxon Signed Ranks tests indicated that these differences across rating scales were statistically significant for both rater groups at p < .05. In addition, the novice raters TABLE 4 Descriptive Statistics for Decision-Making Behaviors by Rater Group Novice a Experienced b Mdn Min Max Mdn Min Max Focus Self-monitoring 49.08 43.29 62.42 45.25 39.91 54.74 Rhetorical 27.70 22.43 37.38 30.32 21.32 38.83 Language 23.04 13.45 27.99 20.43 13.91 34.62 Strategy Interpretation 40.88 36.19 45.09 38.51 31.94 45.52 Judgment 59.12 54.91 63.81 61.49 54.48 68.06 Strategy × Focus Interpretation Self-monitoring 30.20 25.91 34.69 29.42 23.77 33.24 Rhetorical 4.47 .81 9.31 3.33 .90 7.30 Language 4.81 2.31 11.58 4.58 1.76 9.65 Judgment Self-monitoring 18.30 14.81 27.73 16.79 10.83 26.43 Rhetorical 23.32 19.01 30.68 27.02 18.68 33.52 Language 16.28 11.15 20.19 15.75 12.15 24.97 a n = 11 raters. b n = 14 raters. TABLE 5 Medians for Strategies That Differed by 1% or More Across Rater Groups Strategies Novice Mdn Experienced Mdn Higher for the novice group Refer to, read or interpret rating scale 10.15% 8.38% Articulate or justify score 13.79% 11.22% Interpret ambiguous or unclear phrases 2.03% 1.02% Summarize ideas and propositions 1.87% 0.71% Edit or interpret unclear phrases 1.69% 0.51% Consider spelling and punctuation 3.78% 2.64% Higher for the experienced group Read or reread essay 15.89% 17.35% Envision writer’s personal situation 0.66% 1.67% Assess text organization 2.42% 4.19% Rate ideas and/or rhetoric 2.01% 3.13% Assess quantity 1.01% 2.17% Downloaded by [UNSW Library] at 16:57 31 October 2012 [...]... 31 October 2012 ESL ESSAY RATING PROCESSES 71 Barkaoui, K (2007b) Rating scale impact on EFL essay marking: A mixed-method study Assessing Writing, 12, 86–107 Barkaoui, K (2008) Effects of scoring method and rater experience on ESL essay rating processes and outcomes Unpublished Ph.D Thesis, University of Toronto Barkaoui, K (forthcoming) Think-aloud protocols in research on essay rating: An empirical... influence essay rating processes and outcomes, including the broader sociocultural, institutional, and political contexts within which ESL essay rating occurs As Torrance (1998) argued, essay rating is a “socially situated process” with a “social meaning” and “social consequences.” The social context within which the assessment occurs is central because it provides meaning and purpose for the rating and... number of participants in this study, it cannot detect such qualitative differences as variation in sequences of decision-making behaviors and individual rating styles within and across rating scales, raters, and groups Rating style refers to how a rater reads the essay, interprets the rating scale, and assigns a score (Lumley, 2005; Sakyi, 2003; Smith, 2000) Both sequencing and rating styles are important... and ESL faculty differ in evaluating the essays of native English-speaking and ESL students? Journal of Second Language Writing, 5, 163–182 Torrance, H (1998) Learning from research in assessment: A response to writing assessment-raters’ elaboration of the rating task Assessing Writing 5, 31–37 Vaughan, C (1991) Holistic assessment: What goes on in the rater’s mind? In L Hamp-Lyons (Ed.), Assessing... Testing Broad, B (2003) What we really value: Rubrics in teaching and assessing writing Logan: Utah State University Press Cumming, A (1990) Expertise in evaluating second language compositions Language Testing, 7, 31–51 Cumming, A., Kantor, R., & Powers, D (2001) Scoring TOEFL essays and TOEFL 2000 prototype writing tasks: An investigation into raters’ decision making and development of a preliminary... language writing in academic contexts (pp 111–125) Norwood, NJ: Ablex Weigle, S C (1994) Effects of training on raters of ESL compositions Language Testing, 11, 197–223 Weigle, S C (1998) Using FACETS to model rater training effects Language Testing, 15, 263–287 Weigle, S C (1999) Investigating rater/prompt interactions in writing assessment: Quantitative and qualitative approaches Assessing Writing, 6,... assessment systems and contexts For instance, the current study included rating scales that are identical in terms of evaluation criteria, wording, and number of score levels Future studies could compare rater performance across rating scales that vary in terms of wording, focus and number of rating criteria, and number of score levels In addition, with the growing interest in alternative approaches to assessment,... their attention on the rating task and rating criteria in the rubric, to lessen the cognitive demands of weighting and arbitrating between rating criteria, and to enhance their self consistency These are effects that Weigle (1994, 1998) found to be associated with rater training as well Because it is more complex, holistic scoring may require a higher level of rating expertise (Cumming, 1990; Huot, 1993)... aspects of their rating processes (e.g., rating criteria); and that these effects, as well as the quality and quantity of verbalization, varied across raters and rating scales These limitations need to be taken into account when interpreting the findings and conclusions of this study Second, the think-aloud protocols were coded and analyzed quantitatively As Cumming et al (2001) noted, the coding framework... Analyses at the individual rater level indicated that there was some individual variability in terms of decision-making behavior and aspects of writing attended to The findings of this study, thus, suggest that analytic scoring focuses raters’ attention on the criteria listed in the rating scale (Goulden, 1994) and allows raters to reduce the number of conflicts they face in their scoring decisions (cf . session for rating scale 2. 6. Rating 12 essays silently using scale 2 (same essays as in 2 above) (at home). 7. Rating 12 essays while thinking aloud using scale 2 (same essays as in 4 above). Orientation session for rating scale 1 (scales counterbalanced). 2. Rating 12 essays silently using scale 1 (at home). 3. Think-aloud training. 4. Rating 12 essays while thinking aloud using scale 1 (at. Quarterly Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/hlaq20 Variability in ESL Essay Rating Processes: The Role of the Rating Scale and