ielts rr volume12 report6

Coherence and cohesion

Coherence

Coherence, while more challenging to define than cohesion, can be understood through thematic progression, which illustrates how meaning unfolds in text Halliday, influenced by the Prague School of Linguistics, conceptualized text as a series of clauses, with each clause centered around a theme that indicates its subject matter.

The clause's point of departure, as described by Halliday and Matthiessen (2004), is established in the rheme, which introduces new information related to the theme This rheme often becomes the theme of subsequent clauses, enhancing the overall discourse flow within the text Halliday emphasized that both paragraphs and entire texts exhibit a thematic pattern, contributing to coherent communication.

Rhetorical Structure Analysis, introduced by Mann and Thompson in 1989, offers a method for evaluating coherence in texts by examining the hierarchical relationships between key propositions, known as nuclei, and their supporting elements, referred to as satellites Mann and Thompson identified 20 distinct relationships between satellites and nuclei, including elaboration, concession, and evidence, which help clarify how supporting information enhances the main ideas.

IELTS Research Reports Volume 12 © www.ielts.org 7

Propositional coherence has been explored through topic-based analysis, which employs a top-down approach rooted in schemata theory, as noted by Watson Todd (1998) This method organizes content schemata hierarchically, often represented in tabular or tree diagram formats, to track the evolution of topics within a text In his analysis of spoken discourse, Crow (1983) identified six ways topics can progress: topic maintenance, topic shift, non-coherent topic shift, coherent topic shift, topic renewal, and topic insertion However, topic-based analysis faces challenges due to the subjectivity involved in defining specific topics and their interrelationships, as well as tracking their progression throughout the text.

Topic Structure Analysis (TSA) is an approach to analysing coherence building on the work of

Halliday's framework, alongside the Prague School of Linguistics, employs Thematic Structure Analysis (TSA) to categorize thematic progression in texts The predominant types include sequential progression, where the rheme of one sentence transitions to become the theme of the next, and parallel progression, where the theme is carried over to subsequent clauses Extended parallel progression features a return to the initial theme after a sequence of related topics Notable studies by Connor and Farmer (1990) and Schneider and Connor (1990) have explored these concepts While thematic progression offers insights into text coherence, it does not encompass all aspects of coherence.

Coherence in text organization is a crucial aspect often overlooked by the TSA Rhetoric studies indicate that specific text types, such as essays, possess distinct features that aid in interpretation and composition (Paltridge, 2001) The essay, familiar to English educators and examiners, follows a typical introduction-body-conclusion format Connor (1990) highlighted that the Toulmin measure of logical progression—comprising claim, data, and warrant—was pivotal in evaluating essays by experienced markers These structural elements are integral to academic writing curricula (Cox & Hill, 2004; Oshima & Hogue, 2006) However, research reveals that essay conventions can vary culturally A study by Mickan and Slater (2003) found that while native speakers used their opening and closing paragraphs to clearly present and restate their positions, non-native speakers often lacked this clarity, resulting in writing that resembled discussion rather than definitive responses.

Cohesion

Cohesion analysis involves identifying explicit lexical and grammatical elements that connect a text Halliday and Hasan (1976) proposed a prominent framework categorizing cohesion into five types: reference, substitution, ellipsis, conjunction, and lexical cohesion Reference chains utilize personal and demonstrative pronouns, determiners, and comparatives to link elements through anaphoric and, to a lesser extent, cataphoric relations Conjunctions create logico-semantic ties with conjunctive markers that advance the text (Halliday and Matthiessen, 2004) Ellipsis and substitution enable the omission of parts of a sentence by referring back to earlier elements, while lexical cohesion arises from repetition, synonymy, meronymy, and collocation Halliday refers to these methods of creating cohesion as 'cohesive devices.'

Hoey (1991) emphasized the importance of lexical ties in text cohesion, proposing that text is 'organised' rather than 'structured' and that well-connected sentences form 'inter-related packages of information' (p 48) He argued that the collective meaning of sentences exceeds their individual contributions (p 13) While building on Halliday's concepts of lexical cohesion, Hoey also introduced 'cohesive breaks,' which Watson Todd et al (2007) suggest may indicate points of communication failure, potentially making Hoey's approach more effective for cohesion analysis than that of Halliday and Hasan Hoey noted that cohesive ties can lead readers to perceive a text as coherent (p 12), yet cautioned that excessive cohesion might result in a lack of coherence due to over-repetition or weak logical connections.

The role of the band descriptors

Various researchers have suggested that rating variance may relate to the vagueness of the descriptors in different rating scales (Watson Todd, Thienpermpool et al 2004; Watson Todd, Khongput et al

According to Shaw and Falvey (2008), the creation of a rating scale and its descriptors is essential for ensuring the validity of assessments Researchers, including North and Schneider (1998) and Turner and Upshur (2002), have emphasized the need for further studies to develop rating scales grounded in robust empirical analyses of sample written texts.

Knoch (2007) conducted an empirical study to develop a coherence measurement scale using a TSA approach, analyzing over 600 expository texts Her scale incorporated variables such as direct and indirect progression, superstructure, coherence breaks, unrelated sequential progression, parallel progression, and extended progression Trained raters applied this new scale to assess 100 university diagnostic scripts, achieving more consistent ratings compared to the original multi-trait scale, which included nine traits like organization and style The TSA scale enabled a detailed analysis of thematic progression, leading to a more objective evaluation of coherence However, it still relies on the reader's perception for recognizing thematic links and does not encompass all aspects of coherence.

The ambiguity in descriptors across various rating scales raises concerns about construct validity Weigle (2002) and McNamara (1996) emphasize that band descriptors reflect the constructs being assessed, highlighting their theoretical foundations For a test to maintain construct validity, these descriptors must articulate the assessment criteria clearly and understandably for users This clarity is crucial for evaluating coherence and cohesion, as Knoch (2007) indicates that challenges in rating may stem from difficulties in operationalizing these constructs.

The revised IELTS descriptors prioritize analytic marking over holistic marking to enhance observation accuracy, minimize rater bias, and discourage norm-referencing (Shaw and Falvey, 2008) This revision was informed by various research studies, notably Kennedy and Thorp's analysis of IELTS sample scripts (2007) and Hawkey's Common Scale for Writing studies (2001) The descriptors were refined through an iterative process involving trialling and redrafting by independent rater teams, with sample scripts evaluated against the updated descriptors and supported by both quantitative and qualitative validation studies (Shaw and Falvey).

Shaw and Falvey (2004) conducted a quantitative study with only 15 experienced raters, highlighting the need for ongoing validation studies to ensure the reliability of rating scales (Shaw and Falvey, 2008, p 13).

Examiner characteristics

The reliability of exam raters is influenced by factors such as their background and experience (Hamp-Lyons, 1991; Milanovic, Saville, and Shuhong, 1996; Wolfe, 1997) North and Schneider (1998) emphasize that even the best descriptors, no matter how objectively scaled, are still open to interpretation by raters concerning different groups of learners (p 243).

Eckes (2008, p 156) highlights that raters can vary significantly in their interpretation and application of scoring criteria, as well as their adherence to the scoring rubric This variation includes differences in the severity or leniency of their evaluations and the consistency of their ratings across different examinees, scoring criteria, and performance tasks.

Research highlights variations in examiners' marking styles, with Wolfe (1997) finding that systematic scorers who read essays thoroughly before grading produced more reliable results A follow-up study by Wolfe, Kao et al (1998) indicated that consistent raters adhered closely to the scoring rubric and maintained a general focus, contrasting with less proficient raters Detailed analyses by DeRemer (1998) and Lumley (2002, 2005) explored the intricate problem-solving strategies of examiners, revealing differing approaches: one examiner aligned responses with the text and rubric, another quickly assigned grades based on initial impressions, while a third carefully considered the rubric before grading.

In 1998, DeRemer identified three primary evaluation approaches: general impression scoring, text-based evaluation, and rubric-based evaluation Lumley (2002) highlighted the intricate nature of the scoring process, noting that examiners first form a global impression of a script before aligning it with band descriptors for a final score However, this holistic approach has faced criticism for its reliability and validity, as noted by Allison (1999) and O’Sullivan and Taylor (2002), as referenced in Shaw and Falvey (2008).

Examiner background may also be a factor affecting the reliability of marking written scripts Eckes

(2008) attempted to correlate marking style with examiners’ background In a survey-based study of

In a study involving 64 markers for assessing writing tasks in German as a foreign language, examiners were asked to prioritize key text features for evaluation Eckes identified six types of raters, with four being particularly prominent: the Syntax Type, the Correctness Type, the Structure Type, and another unspecified type.

Research indicates that examiner characteristics significantly influence marking preferences, with older examiners showing a reduced focus on Fluency Additionally, raters fluent in multiple foreign languages tend to prioritize Syntax, while those with limited language skills lean towards Fluency A study by Barkaoui (2007) in Tunisia revealed that raters rely heavily on 'internal criteria' shaped by their teaching backgrounds, despite having undergone extensive training Factors such as education, teaching experience, and marking experience may further affect how examiners interpret criteria for coherence and their overall marking approach.

Several authors, including Furneaux and Rignall (2007) and Kennedy and Thorp (2007), advocate for the use of think-aloud protocols to investigate subjective marking processes Numerous studies, such as those by Wolfe (1997) and Brown (2000), have effectively employed this methodology Milanovic, Saville, and Shugong (1996) emphasize the importance of these studies in enhancing examiner training.

Verbal protocols, while not fully capturing the complexity of examiners' marking processes, offer valuable insights into their cognitive evaluations during assessments As highlighted by Lumley (2002) and Brown (2000), examiners may only partially articulate their thought processes and may remain unaware of their deeply internalized reactions to candidates' writing Nonetheless, these protocols can yield rich data regarding the specific features of text that draw examiners' attention during the evaluation of scripts.

Examiner training

Effective rater training significantly influences the assessment of writing performance, as highlighted by various studies (Weigle 1994; Wolfe 1997; Weigle 1998) To minimize variability among examiners, training is crucial, with evidence showing that it enhances both rater consistency and inter-rater reliability (Weigle 1994; Weigle 1998; Knoch, Read et al 2007; Schaefer 2008) According to Hamp-Lyons (2007), trainees should emerge from training feeling confident and engaged, fostering a sense of community and a shared language for analyzing scripts A survey by McDowell (2000) indicated that IELTS examiners generally view the training positively, though some expressed a desire for more problem scripts and felt less prepared for marking Task 2 compared to Task 1 Schaefer (2008) recommends improving training through multi-faceted Rasch analysis to help raters recognize their bias patterns Additionally, Shaw (2002) calls for further research into the effectiveness of consensus-style training versus top-down approaches.

This study investigates the clarity of CC band descriptors in relevant literature, focusing on how examiners interpret these descriptors and the impact of training on their implementation.

A mixed methods study was conducted, comprising both qualitative and quantitative phases The qualitative phase involved 12 examiners and aimed to deeply explore their perceptions and training regarding the assessment of coherence and cohesion (CC) The quantitative phase expanded this investigation through a survey of 55 examiners, focusing on the reliability of their marking It assessed how examiners' evaluations of coherence and cohesion compared to standardized scores and examined the influence of variables such as qualifications and experience on rater scoring.

Ethics clearance for the study was secured from the University of Canberra’s Committee for Ethics in Human Research, with all research personnel signing confidentiality agreements Access to the IELTS examiner training materials was granted under secure conditions, which included an overview of the training materials, the ‘Instructions for Examiners’ booklet, and the band descriptors.

CC was undertaken to identify the key concepts underpinning the scoring system for IELTS CC.

Phase 1: Qualitative phase

The qualitative phase of the study used both a think-aloud protocol, recorded as examiners were in the process of marking, and a follow-up semi-guided interview

Twelve volunteers were selected from two Australian testing centres, including six IELTS examiners with less than two years of experience and six with over five years of experience, comprising three males and nine females Participants were compensated at standard marking rates, and all examiners and testing centre administrators signed ethics approval forms, adhering to IELTS confidentiality agreements To maintain anonymity, examiners are identified by initials in this report Participants received minimal information about the study's purpose to reduce the influence of prior knowledge on their think-aloud reports.

Examiners evaluated a set of 10 standardized Academic Task 2 scripts using established IELTS marking procedures, with scripts provided by Cambridge ESOL representing a diverse range of proficiency levels All scripts focused on the same Writing Task A, and the first five were assessed traditionally Following a brief break, the remaining five scripts were marked under 'think-aloud' conditions, where examiners verbalized their thoughts during the assessment process, as described by Shaw and Falvey (2006).

Examiners initially evaluated the first five scripts using standard marking methods to ensure they understood the task and potential response types Following this, they assessed the next five scripts with the unfamiliar 'think-aloud' procedure.

Ericsson and Simon (1984) and Faerch and Kasper (1987) emphasize the importance of addressing the limitations of introspective research methods by considering various factors Since cognition is inherently a private activity (Padron and Waxman 1988), many examiners struggle to articulate their cognitive processes To facilitate this, the think-aloud method was thoroughly explained and demonstrated, encouraging examiners to verbalize their thoughts while marking scripts, whether they were reading the content, reviewing descriptors, or contemplating a grade.

Participants expressed concerns about researchers monitoring their internal thoughts; however, they were assured that the study was non-evaluative, and all data would be anonymized and kept confidential.

The think-aloud procedure utilized in Phase 1 of this research study diverges from standard IELTS script marking, potentially leading the 12 participating examiners to evaluate scripts differently than they would under typical conditions Consequently, the reliability of their assessments was not compared to standardized scores However, think-aloud protocols provide valuable insights into examiner cognition that other methods cannot capture (Falvey and Shaw, 2006, p 3) To enhance data triangulation, follow-up interviews were conducted, and the qualitative findings were subsequently aligned with the quantitative data in Phase 2 of the study.

Immediately on completion of the think-aloud recording, each examiner participated in a semi-guided interview lasting from 30 minutes to one hour

The semi-guided interview schedule (Appendix 2) included questions to probe:

! examiners’ perceptions of the different criteria

! their views on the band descriptors

! specific features of CC which affect their decision-making

! their views of the training in relation to CC

Examiners were then asked to comment on their assessment of CC in the scripts they had just marked Both the think-aloud protocols and the interviews were recorded

To enhance the validity and reliability of the interview schedule and think-aloud protocols during Phase 1, several measures were implemented These included consultations with experienced IELTS examiners, refining and piloting the interview schedule, and testing both the think-aloud process and follow-up interviews to ensure an organized and timely data collection process.

The think-aloud protocols and semi-guided interviews were transcribed by a research assistant under the close supervision of the researchers The transcripts underwent thorough verification before being divided into segments for analysis.

The study's analysis of think-aloud protocols involved segmenting and coding each transcript, primarily based on the methodologies established by Green (1998) and Lumley.

In 2005, segmentation was conducted at the clause level, ensuring that if a single idea extended into the next meaning unit, both were combined into one segment During the reading process, each instance was documented as a single segment by the examiners.

Segments were coded at four levels:

The initial phase involved coding each segment to analyze the examiners' overall behavior during the marking process This encompassed various activities such as overseeing the assessment, reviewing the script, criteria, or questions, evaluating the script, and deciphering the writer's intended meaning.

The segments were analyzed to pinpoint each examiner's distinct behaviors during the marking process, including their evaluation of scripts, moments of hesitation, grading actions, and justifications for their grading decisions.

Segments were coded to determine the specific aspects examiners referenced while forming their judgments This included evaluations of the entire text, the application of individual criteria—such as Task Response (TR), Coherence and Cohesion (CC), Lexical Resource (LR), or Grammatical Range and Accuracy (GRA)—as well as occasional assessments of the test takers themselves.

The analysis focused exclusively on segments related to coherence and cohesion (CC), coding them to pinpoint the specific features evaluated by examiners Key features assessed included logical organization, progression, paragraphing, discourse markers, reference, and substitution, as outlined in the band descriptors Additionally, terms like 'flow,' 'linking words,' and 'overall structure' were noted from the examiners' think-aloud recordings, highlighting the comprehensive criteria used in the assessment process.

This study analyzed think-aloud data to understand the cognitive processes of markers assessing criteria in scripts, focusing specifically on segments related to the assessment of CC Two researchers independently coded the data, ensuring consistency through careful cross-checking Some segments received multiple codes, while others were ambiguous, leading to interpretive rather than definitive coding Consequently, this paper reports only on segments where examiners explicitly referenced major features of CC or provided examples in their assessments (See Table 1).

Features of CC explicitly discussed

1 coherence COH Coherence, well they’re trying They’re trying (M/182)

2 meaning/message/ideas M You can certainly see what he’s trying to say (B/446)

You can get a message there I suppose (D/387)

3 argument ARG this argument is not coherent (A/18)

4 flow/fluency FL but it’s the overall flow is OK (F/662)

5 clarity CL it’s certainly not as clear as an 8 (L/30)

6 logic LOG what he's got to say is logical (K/79)

ORG on the whole it's logically organised (J/50)

PRO and there’s no clear progression.(L/218)

9 logical relationships/ semantic links REL Um, yep, they [the ideas] are - they relate to each other (E/191)

10 paragraphing PARA Paragraphing doesn’t look as good (D/152)

11 introduction INTRO OK introduction’s pretty sloppy (M/212)

12 conclusion CONCL and the - probably not complete, incomplete conclusion is open ended (B/449)

HESION Um, it’s fairly high in terms of cohesion I think (S/125)

14 cohesive devices CD yeah, there is certainly a range of cohesive devices (L/41)

15 coordinating conjunctions CONJ So there’s a problem with the coordinator there (K/4)

He’s got some idea of basic conjunctions as well as basic transition signals (S/217)

DM So automatically I’m drawn to the fact that the discourse markers are way off (K/115)

17 reference REF Reference is OK Um (S/345)

18 substitution SUB It’s more the lack of substitution, um makes it seem very repetitive (K/32)

Table 1: Features of CC and their codes explicity referred to in the think-aloud data

Phase 2: Quantitative phase

Fifty-five examiners were recruited from four different testing centres They comprised 22 males and

28 females and five unidentified in the survey data The examiners were employed under the same conditions as for the participants in Phase 1 Their biodata can be seen in Appendix 4

Examiners evaluated 12 standardized Academic Task 2 scripts from Cambridge ESOL, comprising six scripts for Writing Task A and six for Writing Task B, representing various proficiency levels from Band 3 to Band 8 Initially, the same set of 10 scripts was intended for both phases of the study; however, concerns regarding the wording of Task A prompted a revision To ensure consistent marking in Phase 2, it was decided that examiners would assess six scripts for Task A and six for an alternative Task B, thus minimizing the potential impact of question type on the marking process.

To counter any script order effect on examiner marking, the scripts were sorted into four groups and distributed at random to the examiners:

! Task A, Scripts 1-6, followed by Task B, Scripts 1-6

! Task A, Scripts 6-1, followed by Task B, Scripts 6-1

! Task B, Scripts 1-6, followed by Task A, Scripts 1-6

! Task B, Scripts 6-1, followed by Task A, Scripts 6-1

Despite being an experimental study, we rigorously adhered to standard marking conditions during data collection in both Phase 1 and Phase 2 to minimize any potential influence of the research design on our results.

After marking the 12 scripts, examiners were asked to complete a questionnaire comprising three parts:

! Part A sought to investigate examiner perceptions in relation to their assessment of CC

! Part B asked questions in relation to examiner perceptions of the training in CC

! Part C collected information about the background qualifications and experience of the participants

The questionnaire utilized a variety of question types, including five-point Likert scales, yes/no questions, and ranking questions (refer to Appendix 5) To enhance the validity of the measurement tool, the questionnaire was refined through four drafts, piloted, and reviewed by four experienced IELTS examiners, including a senior examiner, as well as a quantitative research consultant from the university.

To examine the reliability of examiners' marking and the influence of variables such as qualifications and experience, Spearman correlations were calculated between examiner scores on each criterion and the total and standardized IELTS scores A confidence interval for an acceptable correlation of 0.8, as suggested by Alderson, Clapham, and Wall (1995), was established using Howell's methods Since Spearman correlations are not normally distributed, the Fisher transformation was applied to prepare the data for parametric hypothesis testing The reliability of examiners on each criterion was assessed by comparing the mean correlations using a repeated measures Analysis of Variance, with results adjusted via the Bonferroni method to account for multiple comparisons For clarity, the mean Spearman correlation scores are reported instead of those derived from the Fisher transformation.

To evaluate the harshness or leniency of individual examiners, mean scores across all criteria were compared to standard scores using independent samples t-tests Factors influencing examiner reliability, including gender and teaching experience, were analyzed through independent samples t-tests on the correlation differences between examiners’ scores and standard scores on CC For groups with fewer than 15 participants, the Mann-Whitney U test was employed Prior to any mean comparisons, such as ANOVA or t-tests, Levene’s test assessed the homogeneity of variance, while the Shapiro-Wilk test evaluated score normality All analyses were performed using SPSS 13 and Systat 13.

The findings for each research question are derived from both the qualitative Phase 1 and the quantitative Phase 2 of the study, and will be presented accordingly under each specific research question.

Research question 1: Do examiners find the marking of CC more diffcult than other criteria?

The think-aloud protocols

In think-aloud protocols, the time spent assessing each criterion serves as an initial indicator of marking difficulty Analyzing the distribution of segments dedicated to each criterion revealed insights into the marking process, particularly concerning CC.

Task Response (TR) assessment may pose more challenges compared to evaluating Lexical Resource (LR) or Grammatical Range and Accuracy (GRA), as a significant portion of the evaluation focuses on the Task itself.

In the analysis of segment allocation, 24% of the total segments focused on the interpretation and assessment of TR, while 22% were dedicated to CC In comparison, only 16% of the segments addressed LR, and a mere 12.5% pertained to GRA Notably, for six out of the twelve examiners, a greater number of segments were allocated to CC than to TR.

The findings indicate that examiners generally allocate more time to assessing the interpretation of writers' answers than to the assessment of text response (TR) When excluding the time spent on interpretation from TR, it becomes evident that eight out of twelve examiners (approximately two-thirds) dedicated more segments to content comprehension (CC) than to TR Consequently, the overall proportion of time spent on TR drops to 21%, aligning it closely with the time allocated for CC.

The time examiners spend reviewing the band descriptors while assessing scripts may indicate the level of difficulty they encounter with the evaluation criteria An increased duration in this process suggests a greater overall investment in ensuring accurate assessments.

In the analysis of IELTS Research Reports Volume 12, it was found that examiners allocated 29% of their reading segments to the CC band descriptors, slightly higher than the 28% for TR, 21% for LR, and 19% for GRA Notably, only four out of twelve examiners focused more on the CC descriptors compared to others, while another four devoted an equal amount of time to both the TR and CC descriptors The least attention was given to the GRA band descriptors.

Examiners with 5+ years’ experience Examiners with less than 2 years’ experience

TR band descriptors reading segments 29 7 15 31 5 13 5 5 11 4 4 11 140 28

CC band descriptors reading segments 29 7 15 12 0 12 7 7 27 16 4 10 146 29

LR band descriptors reading segments 15 8 13 25 0 6 5 1 12 4 11 4 104 21

GRA band descriptors reading segments 22 5 11 9 0 10 0 3 19 3 12 2 96 19

Table 2: Number of segments dedicated to reading the band descriptors

The analysis of hesitation during the assessment of different criteria reveals that examiners exhibited more uncertainty when marking the Criterion of Content (CC), accounting for 32% of all hesitation segments, compared to 28% for Task Response (TR), 17% for Grammar (GRA), and 15% for Lexical Resource (LR) However, this trend varies among individual examiners; while Examiners P and E showed significantly more hesitation for CC, Examiner F displayed only a slight increase in hesitation for this criterion Conversely, the other nine markers tended to hesitate more when evaluating TR, indicating a more nuanced perspective on the marking process.

Individual marking styles varied significantly among examiners, with Examiner D exhibiting notable hesitation and taking nearly twice as long to complete the marking process, in contrast to the confident and decisive approach of Examiner B Despite these individual differences, it is evident that there were considerably fewer hesitations noted in the assessment of LR and GRA compared to the other criteria.

Examiners with 5+ years’ experience Examiners with less than

Table 3: Number of segments coded as examiner hesitancy

Interviews

The data from the interviews yielded some mixed findings on Research Question 1 While seven of the

Twelve examiners agreed that all assessment criteria pose similar challenges, while four noted that the clarity of the CC criterion is particularly lacking Examiner K highlighted that the extensive descriptors for CC diverted focus from the actual script, and Examiner S echoed this sentiment.

I tend to do CC last because that’s the one I’m least clear about There’s a fair bit to look at there It’s easier if you look at the others first (S)

While Examiner J admitted that at the training course:

I found myself perplexed, as the aspect I had the least control over was the CC, unlike the other criteria Since I primarily engage in lower-level teaching, high-level teaching is not a focus for me.

One of the examiners found CC easier to mark than the other criteria, saying that she found paragraphing made it easy for her to identify logical progression

Interviews revealed that some examiners lacked confidence in assessing CC, with half feeling reasonably assured about their marking, as noted by Examiner M's comment, "I know overall that it will all balance out." In contrast, Examiner B showed complete certainty, while four examiners, including Examiner D, expressed doubts about their assessment capabilities.

I constantly struggle with self-doubt, feeling uncertain about my decisions and criteria This ongoing hesitation makes evaluating student work a nightmare, as I find myself oscillating between different tasks and criteria Even when I move on to the next assignment, my mind remains preoccupied with my previous evaluations, leading to a relentless cycle of uncertainty.

Surveys

In the Phase 2 survey, examiners assessed the difficulty of marking four criteria, revealing that 66% identified CC as the most challenging, while 20% found TR to be the most difficult Only 4% ranked LR as the most difficult criterion, and none of the examiners considered GRA to be the hardest to mark.

According to the survey results, 33% of respondents found LR to be the easiest criterion to mark, followed by GRA at 27% and TR at 20% Only one examiner identified CC as the easiest, while 13% of examiners believed all criteria were equally difficult or easy to assess.

Table 4: Responses to the question, ‘In general, which criterion do you usually find most difficult to mark?’

Most examiners reported a reasonable level of confidence in assessing all four criteria; however, confidence in marking Content and Coherence (CC) was notably lower While 84% of examiners felt confident or very confident in their marking of Task Response (TR), 93% expressed similar confidence in Language Resources (LR), and 94% in Grammatical Range and Accuracy (GRA) In stark contrast, only 60% of examiners felt confident in marking CC, with 15% indicating they were not very confident, compared to just 4% for TR and LR Remarkably, only one examiner reported a lack of confidence in marking GRA.

Table 5: Examiners’ levels of confidence in marking each criterion

Research findings indicate that a considerable number of examiners perceive the marking of Content and Communication (CC) as more challenging compared to other assessment criteria In addressing Research Question 2, we will delve deeper into the specific aspects examiners prioritize when evaluating CC, aiming to uncover the reasons behind the increased difficulty associated with its marking.

Research question 2: What features are examiners looking for in marking CC?

Ranking of key features of CC: Phase 2 results

In Phase 2, examiners ranked eight key features of CC based on their perceived importance, with a scale from 1 (most important) to 8 (least important) Additionally, they indicated the frequency of their references to these features during marking These two survey questions collectively aim to shed light on the significance of these features in the examiners' understanding of CC.

The terms 'reference', 'substitution', 'paragraphing', 'message/ideas', and 'logical progression' are essential features that CC examiners consider, as they are integral to the band descriptors for this criterion Additionally, 'linking words', 'flow/fluency', and 'overall structure' were identified as significant features of CC, based on feedback from examiners in the qualitative phase of the study, despite not being explicitly mentioned in the current band descriptors.

Table 7: Examiners' rankings of features of CC in terms of their importance in the assessment process

Examiners ranked the following features of CC in either first or second position:

Features ranked in the last two positions in terms of their relative importance, were as follows:

The ranking exercise revealed that examiners have varying perceptions regarding the significance of certain features in the marking of CC Notably, there is a strong consensus on the importance of logical progression and substitution Specifically, 55% of examiners ranked logical progression as either the first or second most critical feature, with only two examiners placing it in the lowest rankings.

A significant 46% of examiners ranked 'substitution' as either the lowest or second lowest in importance, aligning with Halliday and Hasan's (1976) observation that substitution is infrequently used This low ranking may also stem from a lack of understanding among some examiners, as evidenced by the definitions they provided (refer to section 4.2.6).

Almost half of the examiners (49%) ranked 'flow/fluency' among the top two features, despite its absence in the band descriptors and challenges in analytical assessment Interestingly, only four examiners placed this feature in the lowest two rankings, suggesting that 'flow' might be a term used interchangeably by some examiners.

Examiners often assess the 'flow' of a response intuitively, as indicated by the analysis of think-aloud data However, there is little evidence that they critically analyze the logical progression of ideas within the content.

The assessment of CC features revealed significant disagreement among examiners, particularly regarding paragraphing, which was ranked in the top two by nine examiners and in the bottom two by ten This variance may stem from differing perceptions of paragraph ceilings Additionally, the ranking of 'message and ideas' showed a wide range of opinions, with 12 examiners placing it in the top two and 21 in the bottom two Similarly, 'overall structure,' although not explicitly defined in the descriptors, was ranked in the bottom two by 19 examiners while 11 placed it in the top two, highlighting inconsistencies in evaluation criteria.

Participant examiners consistently highlighted eight key features of CC in their assessments, with 'logical progression' and 'flow/fluency' being the most frequently mentioned In contrast, 'substitution' and 'reference' were cited the least often, aligning with findings from a prior ranking exercise.

There is a notable divergence in opinions regarding the significance of 'message/ideas' and 'overall structure' in assessments, as evidenced by the varied responses from examiners Specifically, 26% of examiners consistently emphasize 'overall structure', while 36% do so frequently, 26% occasionally, and 11% infrequently In contrast, 'paragraphing' is highlighted more consistently, with 52% of examiners always referencing it and 23% doing so very often.

IELTS Research Reports Volume 12 © www.ielts.org 25 never seldom sometimes very often always n % n % n % n % n %

Table 8: Examiners' perceived frequency of use of features of CC

In the next section, we give more detailed feedback on examiners’ perceptions of the key features of

Coherence

Examiners allocated their time evenly across three main areas of coherence in writing: 23% of their evaluations focused on the overall qualities of the text, including flow, fluency, and clarity; 26% concentrated on logical aspects; and another 23% assessed paragraphing.

Typical comments in relation to the general qualities of the text include the following:

! It’s just a bit incoherent (A, line 88)

! Not sure that makes sense (T, line 108)

! There is a good level of clarity and fluency in this piece of writing on the whole though

! It’s not fluent There lacks fluency (F, lines 34-35)

! Well the sentences sort of flow on nicely (P, line 243)

! But the fluency and logic flow is not clear (B, line 47)

! So it’s quite easy to go through and follow what the person is saying (B, line 74)

! It’s pretty good, it flows quite well (S, lines 99-100)

Examiners in the think-aloud data often lacked concrete evidence to substantiate their impressions or intuitions regarding text coherence, although they occasionally referenced paragraphing to support their evaluations.

Experienced examiners A, M, and F, along with less experienced examiners K and L, spent over 40% of their time assessing the coherence of scripts by focusing on overall qualities like flow, fluency, and clarity This indicates a tendency towards an impressionistic approach in their marking.

Some examiners focus on traditional elements of argumentative essays, such as the introduction, body, and conclusion, which are not explicitly mentioned in the CC band descriptors For instance, Examiner B's protocol highlights this tendency.

209 Well the good thing is that there is presentation of very clear paragraphing

210 albeit the introduction is a single sentence

211 The 2 body paragraphs begin with ‘first’ and ‘second’

212 and the conclusion begins with ‘eventually’, spelt correctly

213 and there is an element of logicality to the paragraphs

When asked what they thought coherence meant, eight of the 12 examiners in the Phase 1 interviews, characterised coherence principally as ‘flow’, ‘fluency’ and ‘clarity’

Coherence is if it sort of flows okay; then I would say it was coherent (T)

Coherence I always think is about my understanding of what you mean So it’s your clarity It’s your, your sort of strategic choice of vocabulary that’s going to get the message across

So it’s about fluency and clarity in your style of writing (E)

Look at how it FLOWS really nicely! (P)

Examiners often associate coherence with rhetorical structure and logical argumentation While all participants recognized the importance of paragraphing, these examiners specifically emphasized the need for logical organization and clear argumentation In interviews, they articulated their understanding of coherence in these terms.

Coherence in writing is characterized by a well-developed argument that is clearly organized into paragraphs The visual layout is crucial, as effective paragraphing enhances the reader's understanding of how the ideas connect Ultimately, coherence arises from the structured organization and logical progression of the argument presented.

Coherence refers to the overall structure and organization of ideas within a piece of writing It involves having a clear introduction and conclusion, as well as logically connected points that follow a cohesive flow throughout the text.

Coherence in writing is essential, particularly in paragraph structure An effective introduction should clearly outline the topic and the writer's objectives Following this, body paragraphs must be well-defined and aligned with the main points presented in the introduction, ensuring a cohesive and focused discussion throughout the article.

Examiners with a focus on structure often define coherence in terms of 'logic' and 'logical organization.' For instance, Examiner B described it as a 'logical stepping arrangement,' while Examiner S elaborated on 'logic' by highlighting examples such as transitioning from general to specific ideas or employing chronological ordering Additionally, Examiner J noted that logic is a cultural construct, emphasizing the importance of carefully evaluating the logic in candidates' responses.

Understanding logical order is essential for clear communication, yet individuals often have diverse thought processes that influence their reasoning It's important to recognize that what seems logical to one person may not resonate the same way with another Therefore, it's crucial to reassess our assumptions and not take information at face value, as constructing a coherent argument requires careful consideration of different perspectives.

Phase 1 think-aloud and interview data suggest that examiners typically relied on either a holistic, intuitive impression of the text for grading coherence or referred to structuralist concepts of coherence based on their knowledge of the traditional essay format.

Results from the Phase 2 surveys revealed a strong consensus among examiners regarding the definition of 'coherence.' Most agreed with Shaw and Falvey’s (2008) conceptual definition, with 80% of examiners defining coherence in terms of the clarity, comprehensibility, and intelligibility of the message conveyed by a text About half of these examiners emphasized the importance of making sense of the ideas presented.

The study highlights the importance of clarity in communication, with 16 respondents emphasizing the significance of 'meaning' in the development of ideas Additionally, seven participants defined coherence as essential for conveying a clear message Notably, 45% of examiners stressed the necessity of 'logic' to enhance understanding, with some specifically mentioning 'logical progression' and 'logical sequencing' as indicators of writing as a dynamic process Furthermore, 33% of the respondents referred to 'logical organization' as a crucial element in achieving clarity.

‘logical structure’, possibly conveying a more static or structural approach to coherence in text

Nine percent of examiners explicitly mentioned the deductive essay format, emphasizing the importance of a structured argument that includes an introduction, body, and conclusion, along with paragraphs featuring topic sentences supported by relevant ideas and evidence Additionally, four examiners, representing 7%, referred to the term 'essay', even though it is not included in the IELTS band descriptors.

Eight examiners (15%) referred to the term, ‘flow’ as in, ‘the flow of the text’ or ‘flow of ideas’ That

Paragraphing

Paragraphing contributed 20% of total codes and was the most used code in the think-aloud data

It was also one of the first features of text that caught examiners’ attention For example, Examiner M as she began to look at Script 9 commented as follows:

148 Right, moving on to number 9

149 OK, which is incredibly short

150 OK, this will be minus 2 on the TR

153 there’s just one large paragraph

154 so that’s going to be a problem with coherence OK

Three examiners discussed good paragraphs in terms of topic sentences plus supporting sentences – see the extract from Examiner S’s transcript below

106 The only problem I find is that his paragraphs are relatively short

107 They might be only two sentences long

108 so effectively you’ve got a topic sentence and then a supporting sentence

109 but the supporting sentence doesn’t really fit with the topic sentence

110 It kind of it sort of jumps from the topic sentence to the supporting sentence

111 and it doesn’t really blend

While the band descriptors highlight the importance of a 'clear topic', they do not specifically reference 'topic sentence' Additionally, examiners often describe the written work as an 'essay', deviating from the IELTS terminology of 'response' For further details, see section 4.3.2.

The same underlying understanding of effective paragraphing as a feature of logical organisation appears to underlie Examiner A’s thinking in the following excerpt

206 Well first of all there are paragraphs

207 this thing has four paragraphs – an introduction, two bodies and a conclusion

212 There are clear main topics and there are paragraphs

213 Yeah, I think this is a 5 on that respect

Interviews reveal that the term "topic sentence" is frequently utilized by examiners, with eight out of twelve referencing it during Phase 1 In contrast, the phrase "clear central topic" was noted by only three examiners, highlighting a preference for the former terminology in their evaluations.

Effective paragraphing requires a clear topic sentence that guides the content, supported by additional sentences that reinforce the main idea When paragraphs are skillfully managed, they present a coherent structure that enhances readability and comprehension.

…paragraphing … is there a structure like a, well, the conventional one of a topic sentence, elaboration and examples (L)

The interview data also showed that paragraphing was an aid to navigating the scripts

Effective writing requires each paragraph to focus on a single central topic, with clear connections to preceding paragraphs The absence of distinct paragraphing or identifiable topics within paragraphs can hinder readability and comprehension.

I analyze the coherence of an article by examining its paragraphs, assessing their logical flow and structure I focus on the organization of each paragraph, starting with their introductory sentences to determine if they convey meaningful content This approach allows me to evaluate the overall effectiveness of the writing.

However, some examiners felt that they were not clear about what was expected in terms of paragraphing

And I don’t think I’ve ever really got a clear answer to what constitutes a paragraph that’s um consistent in my training (M)

Several examiners pointed out that superficially, cohesive paragraphs were not always very meaningful

And the paragraphing […] it can be beautifully done, you know each paragraph is introduction – firstly – secondly and then finally – in conclusion It’s all very well

Superficially it looks terrific but it’s not always worth very much (S)

A number of examiners expressed reservations about the paragraph ceilings

Paragraphing is crucial for effective communication While a beautifully written piece may convey a strong message, the absence of proper paragraphs can disrupt readability Even minor details like missing lines or indentations can impact the overall presentation However, I adhere to my guidelines to ensure clarity and coherence.

I once encountered a native speaker whose writing lacked any paragraphing, which I found quite extraordinary Despite recognizing their fluency and command of the language, the absence of structure made it difficult for me to rate their work highly.

The absence or improper use of paragraphing can significantly impact the clarity of a text, as highlighted in IELTS Research Reports Volume 12 Specifically, if a paragraph is missing, it is often located further down in the document, indicating that effective organization is crucial for coherent communication.

The article lacks clear paragraphing and transitional phrases, which can enhance the flow and style of writing However, it presents a coherent train of thought that is logical and continuous While it doesn't follow a structured format of one idea per paragraph, this does not diminish its quality or effectiveness.

Examiner observations align with Shaw's (2006) findings, indicating that the revised scales may place excessive emphasis on paragraphing This could disadvantage candidates who have not received instruction in paragraphing skills during their preparation.

A survey of examiners regarding the effectiveness of bolded ceilings for paragraphing in Academic Task 2 assessments revealed mixed opinions: 13% found them not very useful, 45% considered them quite useful, 24% rated them very useful, and 18% deemed them extremely useful Some examiners, during interviews, voiced concerns about the scoring system that limits writing without paragraphs to a maximum score of 5, despite instances of coherent and cohesive writing without separate paragraphs for each thought This perspective may explain the relatively low priority some examiners assigned to paragraphing in the ranking exercise, as highlighted in Table 6.

Cohesion

The analysis of interview data and survey responses indicated a significant consensus among examiners regarding the concept of cohesion Specifically, over half of the examiners (58%, n=2) defined cohesion based on the techniques employed to establish connections.

Cohesion in a text refers to the techniques used to connect and arrange its parts effectively A small group of respondents (9%) described cohesion as the flow of the text, while a larger group of examiners (15%) emphasized grammatical and syntactic methods that enhance linking and coherence.

Cohesion is often understood as the structuring and organization of text, with elements like discourse markers and pronoun substitution playing a crucial role While some respondents emphasized these structural components, one examiner mistakenly conflated cohesion with coherence Another examiner offered a more simplistic definition, referring to cohesion merely as the use of cohesive devices, lacking a deeper exploration of their purpose.

In a study of 55 Phase 2 examiners, 40% noted that cohesive links occur at the word, sentence, or paragraph level Specifically, 13% highlighted the connections of ideas within sentences, while 22% focused on the relationships between sentences Additionally, four examiners addressed the linking mechanisms within or between paragraphs.

In a recent analysis, 25 Phase 2 examiners highlighted the significance of connecting ideas within a text, emphasizing that cohesion involves the logical linking of information While one examiner defined cohesion as the connection of meanings, two others noted that it pertains to how arguments are cohesively structured.

A study of examiners revealed varying perspectives on text cohesion, with 25% identifying connectors such as linking words and transition signals as key elements for linking text parts Additionally, 27% emphasized the importance of reference, while nine examiners mentioned substitution as a cohesive tool The role of ellipsis was noted by two examiners, and unique mentions included articles, synonyms, and topic sentences as mechanisms for enhancing textual cohesion.

In interviews, examiners viewed cohesion primarily as a 'micro' level feature, while coherence was seen as an attribute of the text as a whole Differences in assessment styles emerged, with some examiners marking analytically and systematically, while others relied on a more intuitive approach Intuitive markers often described cohesion in broad, general terms, highlighting the variability in how cohesion is perceived and assessed.

… the sticking together of paragraphs and sentences and ideas and notions so that it’s not higgledy piggledy absolute gibberish (E)

Cohesion refers to the seamless flow and connection of ideas within a text While I tend to prioritize coherence in my writing, I acknowledge that cohesion plays a role in enhancing the overall readability However, I often overlook the importance of cohesion as a distinct element in my work.

In the think-aloud marking process, examiners F and B systematically highlighted various aspects of cohesion, emphasizing the significance of paragraph structure as a key element.

Cohesion primarily operates at the level of paragraphs and sentences, focusing on the overall structure It is essential to identify a clear topic sentence within each paragraph, which can be effectively introduced using cohesive devices such as "firstly."

‘secondly’, or ‘on the other hand’ or whatever (F)

Cohesion within paragraphs is essential for clarity and understanding Each paragraph begins with a clear topic sentence that outlines its main idea, followed by supporting sentences that provide evidence and relate directly to the topic This structure ensures that the ideas flow logically and reinforce one another, enhancing the overall coherence of the text.

Cohesive devices/sequencers/discourse markers

Cohesive devices, as defined by Halliday and Hasan (1976), encompass various linguistic mechanisms that ensure text cohesion, including reference, substitution, ellipsis, conjunction, and lexical cohesion In contrast, examiner training materials emphasize different textual features such as sequencers and linking devices, along with referencing and substitution While examiners frequently referred to 'cohesive devices' during think-alouds and interviews, they often equated the term with discourse markers, neglecting to acknowledge the broader range of cohesive devices.

Data from the think-aloud protocols showed that examiners accorded a great deal of prominence to

Cohesive devices, discourse markers, and coordinating conjunctions comprised 20% of all coding instances, with less experienced examiners K and T relying heavily on these features—32% of K's codes and 28% of T's codes were attributed to them Specifically, K's coding included 16% for discourse markers, 6% for coordinating conjunctions, and 10% for cohesive devices, while 18% of T's segments also focused on these elements.

‘cohesive devices’ and 5% each to coordinating conjunctions and discourse markers

Experienced examiners demonstrated a tendency to focus less on explicit markers of cohesion compared to their less experienced counterparts In total, the group of experienced examiners referenced cohesive devices, discourse markers, or coordinating conjunctions 47 times.

Examiner A, a veteran in the field, highlighted discourse markers as the sole aspect of cohesion during assessments, dedicating the majority of the think-aloud transcript to coherence evaluation In stark contrast, less experienced examiners collectively mentioned cohesive devices, discourse markers, and coordinating conjunctions 71 times Notably, all six novice examiners addressed all three features, while only two of the six seasoned examiners did so distinctly.

The disparity between more and less experienced examiners may stem from their differing approaches to assessing coherent texts Experienced examiners understand that meaning can be conveyed without relying solely on explicit grammatical or lexical markers, while less experienced examiners often seek these identifiable cues Additionally, seasoned examiners, most of whom have over a decade of experience, may still remember the holistic marking scale from earlier practices, influencing their evaluations In contrast, newly trained examiners are taught to adopt a more analytical approach, systematically evaluating each feature of the text.

CC in turn More research is needed to explore these issues further

In the interviews, the terms ‘sequencing words’, ‘linking words’, ‘transition signals’ and ‘discourse markers’ were used interchangeably Examiners commented:

I rated it a 5 for coherence and cohesion due to the limited use of linking words, primarily just "because" and "but." Additionally, there is a lack of effective paragraph sequencing, which impacts the overall flow of the content.

Sequencing – I guess the “first”, “second” sort of use, that type of thing (L)

Cohesive devices, but that’s the transition signals and stuff isn’t it? (P)

Cohesive devices ‘although’, ‘but’, ‘so’, that’s how I understand (L)

The cohesive devices? The little words and the little tie ins (S)

One interviewee took a slightly broader definition of ‘cohesive devices’

Cohesive devices would be use of link words for example, or um a good topic sentence (D)

Several examiners pointed out that the use of cohesive devices did not necessarily indicate coherent writing

Cohesive devices are essential for crafting a well-structured essay; however, students often misuse them by focusing on transitions like "firstly," "so," and "in addition" without providing substantial content While these terms can enhance the flow of writing, their effectiveness diminishes if the underlying ideas lack depth and meaning.

Often candidates have a bit of a handle on cohesive devices Even basic ones, you know ‘and’,

Writers occasionally attempt to use transitional phrases like "so" and "therefore" to create coherence in their work, though they may not always apply them correctly While these phrases are intended to connect ideas, their usage is infrequent and often misplaced However, many writers seem to grasp simpler aspects of writing more effectively than these transitional elements.

All examiners in Phase 1 think-aloud protocols noted the importance of cohesive devices, regardless of their marking style During interviews, systematic analytic examiners emphasized that they assess discourse markers as a sign of a well-structured text.

In conclusion, the final section of an article typically begins with phrases like "In conclusion" or "In summary," signaling the reader that it is time to present the answer to the question posed.

In the Phase 2 survey, respondents were requested to define or list cohesive devices, but their answers often lacked the comprehensiveness found in the examiner training materials Many participants tended to equate cohesive devices with discourse markers, indicating a misunderstanding of the concept.

While many examiners offered unclear definitions of 'cohesive devices' without examples, 65% identified them as linking words or phrases, highlighting their role in enhancing text coherence.

Linking words, also known as transition words or discourse markers, play a crucial role in enhancing the coherence of writing Terms such as sequencers, logical connectors, and conjunctions help to guide readers through the text In a recent study, 22 out of 36 examiners provided examples of these essential linguistic tools, highlighting their importance in effective communication.

Eleven percent of examiners identified cohesive devices by highlighting their role as logical connectors or linking words, while also incorporating an additional aspect of cohesion; specifically, four examiners mentioned 'reference' as a key component in their definitions.

Cohesive devices are essential elements in writing that enhance coherence, encompassing reference, substitution, and logical connectors While some examiners emphasized the role of synonyms, others provided more detailed definitions, highlighting the importance of linking words in creating a structured paragraph These devices collectively contribute to the flow and clarity of text, making them vital for effective communication.

! logical connectors, reference, substitution, lexical chains, theme/rheme

! linking words and transition markers, conjunctions, ellipsis, paragraphing, referencing and substitution.

Reference and substitution

In the think-aloud data, the aspect of 'referencing and substitution' from the CC descriptors was the least utilized, representing only 5% of the total codes in the transcripts This finding aligns with previous research, highlighting a consistent trend in the underuse of this particular aspect.

Halliday and Hasan (1976) analyzed various sample texts, revealing that four examiners did not address the concepts of reference and substitution in their assessments Among the eight remaining examiners, five conflated the two terms, consistently using them together Additionally, five examiners discussed 'reference' independently, while 'substitution' was mentioned as a distinct feature of cohesive communication only once in the think-aloud transcripts.

Confusion exists in the definitions of referencing and substitution, as demonstrated by two extracts from the think-aloud data of the same examiner In the first extract, substitution is described as synonymy, while in the second extract, the examiner mistakenly equates substitution with referencing.

She’s, well, she’s summarising what she said in the first paragraph but not substituting it with synonyms or parallel expressions (E, line169)

Good examples of substitution ‘It is certainly true that products are bought anywhere on the globe and even manufactured as well but THIS does not make …’ (E, line 60)

The anaphoric use of the definite article was identified as a sub-feature of reference once in the

In the analysis of think-aloud protocols, examiners focused on coding reference and substitution, particularly identifying the use of anaphoric pronouns, synonyms, and the repetition of key nouns.

Generally in the interviews ‘reference’ was thought to mean the use of pronouns, while ‘substitution’ referred to paraphrasing

The reference um well I think of as ‘it’ ‘this means’ – that sort of thing Pronouns (J)

Substitution means … paraphrasing […] It means saying the same thing in different words (A)

Many examiners in the interviews expressed confusion regarding the definitions of reference and substitution, as well as their application in marking coherence and cohesion (CC) One examiner noted this uncertainty.

‘Use of reference and substitution’ That doesn’t mean very much to me ‘Can result in some repetition or error’ Umm No that doesn’t mean anything to me (P)

Another examiner conflated the two terms

Reference um, reference and substitution I’m inclined to put together in the sense of using pronouns to refer to something that’s mentioned earlier (L)

Two examiners noted that they tend to look at reference and substitution last in their assessment of scripts

The reference I don’t look at too much unless it’s glaringly standing out I don’t sort of find that as something I think about too much (F)

When evaluating content, I prioritize referencing and substitution, placing them lower on my list of criteria However, when I come across effective substitutions or paraphrasing, it catches my attention and impresses me.

Examiners widely define reference as lexical items used to avoid repetition, with 24% providing this broad definition A significant majority, 70%, specifically mention pronouns that replace earlier or later lexical items Additionally, 38% of respondents discuss reference in terms of anaphoric reference, where items refer back in the text Some examiners also address forward reference, with one mentioning cataphoric reference for items that point to future text, while another incorrectly cites exophoric reference.

Among the 39 examiners who defined reference through pronoun usage, 49% cited personal pronouns, 24% mentioned demonstrative pronouns, and 15% referenced relative pronouns Despite the training materials highlighting the definite article as a feature of reference, only three examiners acknowledged articles One examiner referred to ‘Similar(ly)’ as a comparative term, aligning with Halliday and Hasan's definition of reference Additionally, one examiner vaguely suggested, ‘give examples,’ leaving it uncertain whether this was a unique definition or a request for more clarity in IELTS training on ‘reference’ and ‘substitution.’ Another examiner equated reference with substitution, while one expressed confusion about the topic, stating, ‘Pronouns and stuff like that?’ Notably, one examiner did not provide a definition at all.

The term 'substitution' appears to be misunderstood by some examiners, leading to confusion between its meaning and that of 'reference.'

Substitution is often viewed as a mechanism to avoid repetition in writing, yet many examiners find it challenging to define clearly Approximately 20% of examiners acknowledge that substitution involves replacing words and ideas to maintain coherence, but their explanations tend to be vague Notably, nearly half of the examiners (47%) provided definitions related to the concept of substitution without offering specific details on its application.

In IELTS Research Reports Volume 12, a study revealed that 11 examiners recognized the use of synonyms to replace nouns as a strategy to avoid repetition, while others described this substitution as replacing nouns with different nouns Additionally, four examiners equated substitution with paraphrasing Notably, 47% of examiners viewed substitution as the use of pronouns to replace nouns, with 13 specifically mentioning pronominals and eight providing examples of demonstrative pronouns Furthermore, two examiners illustrated substitution according to Halliday and Hasan's definitions, offering examples of verbal substitutions.

John went home and Mary did too

Are you going? I think so

Of the remaining responses, one examiner each identified articles, determiners and quantifiers as examples of substitution.

Further issues in assessing the features of CC

Overlaps in the assessment of the band descriptors

The think-aloud analysis revealed several overlapping segments among the identified criteria, leading to examiner uncertainty regarding the features to assess under each criterion Although some overlap is unavoidable, notable similarities were found between the TR and CC criteria, with fewer overlaps between LR and CC.

Overlaps between TR and CC

All 12 examiners recorded at least some overlap between TR and CC, which accounted for 134 segments or 3% of the total

So I do look for coherence in the argument – and that also comes into task response I’ve already sort of assessed that they’ve responded to that task because I understood the development of their argument So I’ve already looked at that It overlaps doesn’t it? Because

TR is good, that indicates that it’s a coherent answer (A)

Ambiguity in the wording of band descriptors contributes to the increased overlap between Task Response (TR) and Coherence and Cohesion (CC) compared to other assessment criteria This issue arises primarily from the challenges examiners face in distinguishing between these two aspects effectively.

! how clearly the position is presented and supported in each script (to be assessed under TR)

! how clearly the message or ideas are conveyed and logically organised (assessed under CC)

Conceptually, it is difficult to see how these two key descriptors can be regarded as distinct, since one would seem to imply the other, at least to some degree

Survey data reveals a significant ambiguity in the assessment of content and communication (CC) among examiners While 35% of examiners ranked 'message/ideas' among the top three factors in CC evaluation, 46% placed this feature in the least important categories for their marking decisions Despite the inclusion of 'message' and 'ideas' in the band descriptors for CC, some examiners perceive these elements as integral to the marking of task response (TR) One examiner explicitly stated that she does not consider 'message/ideas' as part of CC, viewing it instead as a component of TR.

The second related source of ambiguity was examiner difficulty in differentiating between:

! ‘the development of ideas’ or the lack thereof in TR

! ‘the logical progression of ideas’ in CC

The following extract from Examiner T directs attention to this issue:

115 A lack of overall progression, and that’s kind of not really developing it

116 But if we’ve already talked about in TR, clearly I shouldn’t penalise her for that

117 Her, this one I’ve decided it’s a girl,

119 But they’re not firmly linked?

123 Cohesive devices are used to some good effect

124 Definitely – better than some good effect

126 And they’ll always be logical Yeah

127 So clear overall progression, or a lack of progression… increases…

129 because I think I’ve already marked her down for not really developing it in

130 and the reason I want to go down to a 5 for CC is exactly the same thing, that there’s not really much development

132 But yeah, I don’t think I can drag her down twice

The ambiguity surrounding the overlapping TR/CC segments has made it challenging for researchers to determine whether the examiners were evaluating TR and CC independently or in conjunction.

Phase 2 examiners provided various definitions of 'coherence,' suggesting a potential cause for overlaps in responses Key aspects of coherence included the relevance of answers to the questions posed and the degree to which responses remained 'on task.'

! Relevance – more to do with the logical presentation of ideas, addressing a question with suitable/appropriate – the understandability of an essay

! making sense Being logical and relevant

! … does it make sense/is it relevant

! readable, clear message/communication, logical organisation, on task

! linking of essay to test topic…

Relevance is often more effectively evaluated through topical relevance (TR) than coherence (CC), suggesting that some examiners may struggle to distinguish between the two concepts This overlap in definitions may indicate a challenge in clearly defining coherence within academic assessments.

Overlaps between CC and LR

The analysis revealed minimal overlap between the criteria, with seven examiners identifying an average of three overlapping segments each These overlaps were associated with lexical chains, synonym usage, and the repetition of key nouns, which Halliday and Hasan (1976) describe as lexical cohesion This raises the question of whether such reiteration should be evaluated as a cohesive device under CC or as a sign of limited vocabulary flexibility under LR.

The candidate's script features a repetitive phrase, "the accessibility to products," at the beginning of each paragraph, which some examiners viewed as a sophisticated theme progression However, others criticized this repetition, noting that while the writing flowed well, it lacked cohesion For instance, Examiner T questioned the effectiveness of this pattern, highlighting the use of transition words like "although" and "again" to indicate potential weaknesses in coherence.

83 The vocab’s better than the grammar I think

84 and that gives it the real nice flow rather

89 The good noun phrase, ‘the accessibility of many on the globe to products’ which I guess is pretty complex

90 ‘The character of countries’, ‘accessibility to the same product’…

91 although then, yeah, ‘accessibility to products’ again

Examiner J views reiteration as a negative aspect, yet implicitly recognizes its role as a cohesive device, as highlighted by the use of "but" in line 62.

56 Funny, ‘that should be celebrated’ he’s got that right

57 but then later on ‘which we need celebrate’

58 he’s been repeating there – same word –

60 I think his spelling is generally very good

61 He has repeated words quite a lot

62 but it’s quite easy to read

64 but there's not much in the way of uncommon words

65 so make that a 7 [for LR]

The overlap between LR and CC in relation to lexical cohesion is also evident in the following extract from Examiner E while assessing CC:

169 she’s, well, she’s summarising what she said in the first paragraph

170 but not substituting it with synonyms or parallel expressions

In the interviews, Examiner K explained this overlap as follows:

The one I’m least confident about is LR because it overlaps so broadly with the other areas

Having a strong vocabulary is essential for effective transformation and communication in any subject Without the right words, completing tasks becomes challenging, particularly in creative contexts Understanding word classes is crucial for this transformation, as many students struggle with repetition due to their inability to nominalize or utilize advanced language structures.

Similarly, Examiner F, when asked to define ‘substitution’, focused on the overlap between CC and LR:

Substitution refers to the use of varied vocabulary to express the same idea, which primarily relates to lexical choice rather than coherence or cohesion in writing.

Incorporating examples of lexical cohesion and the effective use of key noun repetition in longer sample scripts could enhance future training materials This approach should be contrasted with instances where excessive repetitive vocabulary indicates a lack of flexibility in language resources (LR) Such training would provide clarity on how to assess the repetition of key vocabulary and its impact on coherence and cohesion in writing.

The concept of the ‘essay’

The concept of the 'essay' frequently emerged in discussions about assessing coherence in Academic Writing Task 2, despite not being explicitly mentioned in the question rubric This suggests that candidates are expected to understand the essay genre Notably, ten out of twelve Phase 1 examiners referenced the essay and its specific features during their think-aloud evaluations.

Explicit reference was made to the term ‘essay’ in the assessment of CC by four examiners as follows:

This person could be taught how to arrange this into an essay easily (T, line 255)

So they haven’t really answered the question They don’t know how to write a formal essay (D, line 560)

This, look, in terms of mechanics, this essay is better than the 4 that it’s getting for task fulfilment (F, line 263)

And I guess the content and organisation might be a reflection of the fact that it’s only a 40 minute essay and that she’s spent already 20 minutes on a different essay (S, line 199)

Overuse of cohesive devices

Another issue that caused difficulties for some examiners was the interpretation of ‘overuse’ of cohesive devices

In evaluating writing, there's a dilemma regarding the overuse of taught techniques, such as structured transitions like "firstly," "secondly," and "thirdly." While these can enhance the fluency of the text, they may also render the writing mechanical This raises the question of whether to penalize writers for excessive reliance on these methods or to recognize their contribution to overall coherence.

Differentiating between the band levels for CC

Some examiners in our study complained about the lack of precision in the CC descriptors For example:

The paragraphing may be inadequate or missing.’ I just find those ‘may bes’ very wishy washy It either IS or it’s not (M)

The difference between ‘may be missing’ and ‘may be no paragraphing’ and ‘no paragraphing’ – they’re the same thing (T)

… reference and substitution I know what it means, but I’m just thinking ‘inadequate’?

‘Inaccurate’? What is ‘inadequate’? […] ‘inadequate’ – that doesn’t mean anything to me (P)

Fitting the scripts to the band descriptors

Examiners talked about the difficulty of ‘fitting’ the script to the band descriptors This metaphor of

‘fitting’ was used by six of the 12 examiners in the Phase 1 interviews, for example:

Understanding the criteria can be challenging, as they often seem complex and difficult to interpret Fitting an individual's writing into specific band descriptors can be a struggle, especially when certain aspects don't align It raises the question of which descriptor sentences to prioritize and which can be disregarded.

We decide where we’re going to fit them [the scripts] rather than where they necessarily fit (A)

One of the most experienced examiners commented that the descriptors did not always match her intuitive impression of the candidate’s level

Sometimes I find that becomes a bit frustrating because I feel like no, no, this one should be a

4, but when I look at the descriptors it’s not (F)

Examiners navigated the complexities of marking by attempting an analytical approach, but when faced with challenging scripts, they often resorted to intuitive methods Evidence from think-aloud data indicates that the difficulty in aligning scripts with band descriptors influenced this shift in marking strategy An example of this analytical effort can be seen in the careful approach taken by Examiner F.

175 [reading the descriptors] The writing may be repetitive due to inadequate and or inaccurate use of reference and substitution

176 Well, that’s not the case

177 It is repetitive – the structure

178 because it makes the same point at the beginning of each paragraph more or less

179 But, that’s not due to inadequate or inaccurate use of reference and substitution It’s due to repetition of a single idea

(F, lines 175-179) But later she confessed that she tends to return to marking impressionistically

296 In the end I look at it and I think what do I think this is?

297 What mark should be given?

298 Does it actually work out to that impressionistic mark?

However, Examiner D explained in the interview that profile marking against the band descriptors took precedence in the end

While I aim to avoid overly broad statements, it's important to acknowledge that some outcomes may seem unjust but are simply a result of the established banding system.

The length of the CC band descriptors

The complexity and length of the CC descriptors pose a challenge, as they are more intricate than those for other criteria, comprising four sub-sections: coherence, cohesion, referencing and substitution, and paragraphing Examiner K observed that the effort to comprehend these descriptors diverted her focus from the script.

CC contains significantly more words than other descriptors, making it challenging to assume familiarity with them This structure distracts from the script's language, as I've observed while reviewing these descriptors multiple times.

Wolfe, in his study of essay reading style and scoring proficiency, made a similar observation:

Evaluating an essay using a scoring rubric can impose significant cognitive demands on scorers Those who lack familiarity with the rubric may struggle to recall its elements, leading them to decompose the scoring process into smaller, more manageable tasks to cope with this mental load.

The length of time devoted to reading the CC band descriptors (see Table 2) may also reflect the cognitive demand placed on examiners.

Interpreting the question

The time allocated to assessing Coherence and Cohesion (CC) compared to other criteria serves as an indicator of the challenges examiners face in this evaluation Additionally, the overlaps between Task Response (TR) and CC may reflect ambiguities in interpreting the band descriptors It's important to consider that the specific task questions could influence the time examiners spent on TR and CC during Phase 1 of the study, potentially affecting both the overlaps and the hesitancy in their assessments This context is crucial for accurately interpreting the think-aloud data results.

Question A (Appendix 1) asked participants to acknowledge a statement as a fact and then discuss its pros and cons However, many participants appeared to treat the initial statement as a debatable claim, expressing their agreement or disagreement before addressing its advantages and disadvantages.

Several examiners made some insightful comments on this issue As Examiner A put it:

The rubric suggests that countries are increasingly becoming similar, prompting a discussion on whether this trend is positive or negative While it's important to engage with this perspective, some may find it challenging to express a clear stance on such ambiguous topics until they reach a conclusion This uncertainty can be a significant drawback of the IELTS exam, as it leaves candidates grappling with their position throughout the writing process.

Examiner L expressed it even more clearly

The statement, "Countries are becoming more similar because people can buy the same products globally," raises questions about its validity Some candidates struggle to address this statement directly, either disagreeing without staying on topic or discussing its implications without clear alignment While one candidate disagrees but acknowledges the statement, they shift focus to whether this trend is a positive or negative development The wording of such questions can pose challenges for both candidates and examiners, affecting clarity and coherence in responses.

Further investigation is needed to determine how the phrasing of questions in Academic Writing Task 2 affects both candidate performance and the assessment processes of examiners.

Research question 3: To what extent do examiners differ in their marking?

To what extent do examiners differ in their marking of coherence and cohesion in Task 2 of the Academic Writing module?

A correlation analysis between examiners' overall scores across the four criteria (TR, CC, LR, and GRA) and standardized scores revealed that 53 examiners had correlations of 0.8 or higher, while only two examiners scored below this threshold Notably, one examiner's correlation fell outside the 95% confidence interval of 0.68-0.88, with a score of r=0.55 (see Appendix 6).

An analysis of the CC scores revealed that 44 examiners achieved a reliability correlation exceeding the target of 0.8 with standardized scores However, 11 examiners fell below this threshold, including two with notably low correlations of r=0.49 and r=0.65, indicating significant discrepancies in their scoring reliability (Appendix 6).

All Levene’s tests indicated non-significance, confirming that the transformed variables maintained equal variance, a crucial assumption for ANOVA or t-tests Additionally, the Shapiro-Wilk test revealed that all variables followed a normal distribution Notably, three examiners exhibited mean scores that, without correction for familywise error, were significantly different from the average standard scores across all four marking criteria.

A review of the examiners' scores reveals that only three out of 55 deviated from the standard mean, indicating that there are no significant concerns regarding the leniency or strictness of the marking process.

Examiner t Value Significance Mean score Harsher or more lenient

Table 9: Significant t tests for difference from the mean standard score of 5.54

When only CC was looked at there were no examiners whose scores were significantly harsher or more lenient than the standard scores (Appendix 11)

A significant difference was found in the correlation strengths among examiners regarding standard scores across four criteria, F (3,216) = 7.22, p < 0.05 Bonferroni adjusted contrasts revealed that CC exhibited the lowest mean correlation, which was not significantly different from TR However, both TR and CC showed significant differences when compared to LR and GRA.

Table 10: Mean correlations of all examiners with the standard scores

The presentation order of the scripts had no significant effect on the reliability of examiners’ CC scores

Research question 4: What effects do variables such as qualifications have on marking?

What effect do variables such as examiners’ qualifications and experience have on the marking of coherence and cohesion?

A Spearman correlation of number of years’ teaching with the reliability of the examiners’ scores on

The analysis revealed no significant correlation between the number of years of full-time teaching experience and CC, with a correlation coefficient of r=0.091 (p>0.05) Furthermore, a comparison of mean correlations of CC with standard scores, based on examiner characteristics, showed no notable differences in reliability associated with those characteristics.

The study found no significant differences in the reliability of CC marking based on examiner qualifications Specifically, among the 22 examiners with a master's degree in ESL or Linguistics and the 26 without, the reliability was comparable (t(46) = 1.440, p > 0.05) Additionally, no significant differences were observed between the 34 examiners without a master's in ESL and the eight with such qualifications (Z = 1.025, p > 0.05) Furthermore, the reliability of CC marking did not vary based on whether examiners held Linguistics qualifications (t(53) = 0.397, p > 0.05) Lastly, when comparing the 13 examiners with both ESL and Linguistics qualifications to the 42 with only one or no language-specific qualification, no significant differences in reliability were found (Z = 0.198, p > 0.05).

The frequency of academic writing instruction provided by examiners, whether more or less than five times, did not significantly influence the reliability of CC marking (t 50 = 1.440, p > 0.05) Additionally, the duration of IELTS marking experience, whether less than two years or more than five years, also showed no impact on the reliability of CC marking (t 37 = 0.726, p > 0.05) These results are consistent with Phase 1 of the study, which found no significant differences between new and experienced examiners.

The Mann-Whitney U test indicated no significant differences in the reliability of marking between examiners who marked almost every week (n=9) and those who marked less frequently than every two months (n=8), with results showing Z=0.096 and p>0.05 Additionally, examiners who primarily taught lower levels did not demonstrate significantly less reliability in their marking compared to those teaching upper intermediate or advanced levels (n=42), as indicated by Z=0.458 and p>0.05 Furthermore, the prioritization of flow (nE) versus structure (n=8) by examiners did not significantly impact the marking of CC, with a result of Z=0.174 and p>0.05.

Comparison N in each group Mean

Those with any masters degree/ compared with those who do not

Linguistics qualifications/no linguistics qualifications 20/35 0.858/0.847 0.072/0.086

Taught academic writing more than five times/less than five times

Greater than five years of marking experience/less than two years 17/22 0.850/0.856 0.060/0.96

Table 11: Means and standard deviations of factors which were compared using t tests

Comparison (lower rank means lower reliability) N in each group Mean rank Mann-

Masters in ESL/No masters in ESL 8/34 18.06/22.31 104

Both ESL and linguistics qualifications/one qualification only 13/42 28.77/27.76 263

Mark almost every week/less often than every two months 9/8 8.89/9.12 37

Teaching upper intermediate advanced/intermediate or elementary 42/12 29.98/29.33 274

Table 12: Results of non-parametric tests for difference

The analysis of the reliability of examiners’ marking of Cohesion and Coherence (CC) in Academic Writing Task 2 revealed a reasonably high degree of reliability, with no significant effects from examiner characteristics However, while the difference in marking compared to Task Response (TR) was not statistically significant, it was notably different from Lexical Resource (LR) and Grammatical Range and Accuracy (GRA), exhibiting the lowest mean correlation Despite the overall reliability in marking, concerns about construct validity persist due to the varying emphasis examiners place on different CC features and the inclusion of several criteria not explicitly mentioned in the band descriptors The next phase of the research will investigate how effectively the training materials enhance examiners' understanding of CC.

Research question 5: To what extent do existing training materials clarify perceptions of CC?

To what extent do existing training materials clarify perceptions of CC?

The IELTS Examiner Training Materials 2004 (Writing) 2007 edition includes Powerpoint slides,

The article outlines the breakdown of a presentation on the four criteria for the Academic Writing Task 2, consisting of 28 slides in total It dedicates seven slides to detailing Task Response (TR), four slides to Cohesion and Coherence (CC), ten slides to Lexical Resource (LR), and seven slides to Grammatical Range and Accuracy (GRA).

The first CC slide explains that coherence involves the logical flow of information and arguments, as well as the organization of ideas and paragraph structure In contrast, cohesion refers to the connections between ideas and the effective use of cohesive devices Trainees learn that the CC criterion focuses on these aspects.

‘the overall organisation and fluency of the message’ The remaining three slides deal with ‘cohesive devices’ including ‘sequencers and linking devices’, and ‘referencing and substitution’

The slides present seven brief examples highlighting the overuse and misuse of various devices, emphasizing the importance of coherence While coherence is defined, it is not yet illustrated Trainees are directed to focus on the overarching statement related to coherence in each descriptor, particularly noting the bolded ceiling at Band 5 CC, which limits the grading of scripts that lack proper paragraphing.

The analysis of training materials indicates a scarcity of explanatory slides and examples for the CC criterion compared to other criteria Additionally, the absence of definitions for key terms in the band descriptors hinders examiners from establishing a shared understanding of this criterion.

In the Phase 1 interviews, examiners discussed the effectiveness of the training materials in preparing them for marking CC Overall, examiners expressed appreciation for the training, although one examiner provided a more critical perspective.

It’s like you’re a doctor and they said here’s the guide to today’s surgery and here’s a scalpel Good luck Don’t hit any of the wrong blood vessels (K)

Most examiners expressed a need for clearer connections between the CC band descriptors and actual scripts, with five highlighting the importance of discussing these descriptors in relation to specific examples during training New examiners particularly prioritized receiving feedback post-training, as three could not remember any feedback on their marking Additionally, three new examiners proposed implementing a buddy system or mentoring for further support.

After over a year of examining for the IELTS, I still feel I lack experience due to the limited feedback I receive, which makes it challenging to assess if I'm on the right track Initially, I would appreciate more constructive feedback to ensure my interpretations are accurate Recently, I had a review of my marking that included comments, but without the corresponding text or writing, it's difficult to connect the feedback to my performance.

In the Phase 2 survey, examiners evaluated their training or last standardization in assessing coherence and cohesion using a five-point Likert scale, ranging from poor to excellent The results revealed that most examiners rated their training positively, with ratings of average or above.

! 35% (n) gave a rating of ‘very good’

! two examiners gave a rating of ‘excellent’

! 15% (n=8) of examiners gave a rating of ‘below average’

In contrast to McDowell's (2000) favorable assessment of IELTS examiner training, recent feedback indicates a less positive outlook Notably, McDowell's respondents valued the 'homework scripts' provided for home review, a beneficial resource currently absent from the examiner training process Several participants highlighted their desire for the inclusion of such materials in future training sessions.

In a recent study, examiners were surveyed about their recollection of training related to CC, revealing varied levels of memory retention While 24% reported remembering a significant amount, 27% recalled 'not much,' and 45% remembered only 'some' of their training These results align with feedback from several examiners during qualitative interviews, who expressed frustration over the expectation to retain extensive information without any tangible resources to aid their memory.

Reflecting on writing can be challenging, especially when descriptors remain fixed I often wished to remove these descriptors to focus on refining my work and emphasizing the most critical aspects of my writing.

A survey revealed that 78% of examiners are aware of the availability of standardized scripts for regular revision at testing centers However, 22% of examiners, totaling 12 individuals, were unaware of their access to these valuable resources for preparation.

The 'Information for Writing Examiners' booklet serves as a valuable resource for revision, featuring rating scales and key assessment criteria A Phase 1 examiner emphasized its importance, stating she reviews it before each examination to enhance her focus on marking However, data from Phase 2 examiners reveal that a notable portion, specifically 25%, rarely consult the booklet, while 14% report never reading it, highlighting a gap in adherence to this essential guideline for effective evaluation.

Table 13: Frequency with which examiners read the writing examiners’ instruction booklet before marking

Many examiners struggle to recall their training on assessing coherence and cohesion, suggesting a potential underutilization of available revision materials This indicates a need for more systematic and regular reminders about these resources and effective revision strategies.

A number of suggestions were made by the Phase 1 examiners for improving examiner training with particular reference to the assessment of CC The suggestions were as follows:

This article provides a comprehensive analysis of coherence and cohesion across various band levels, featuring full-length scripts that exemplify the effective use of cohesive devices It highlights instances of misuse, overuse, and omission, demonstrating how these factors impact the overall clarity and flow of the text By examining these elements, the article aims to enhance understanding of effective writing techniques essential for achieving higher band scores.

2 Revision of the CC band descriptors to ensure greater consistency in terminology

3 A glossary of key terms for describing coherence and cohesion with definitions and examples to be included in the instructions for writing examiners

4 More mentoring and feedback in the first year of examining

5 An online question and answer service available for examiners

6 Use of colours to pick out the key features across the bands in the rating scale

7 A list of dos and don’ts for marking scripts (eg don’t compare assessments of one script against another) to be included in the examiners’ instructions booklet

Question 1

Data from the Phase 1 think-aloud protocols, interviews, and Phase 2 surveys reveal that most examiners find the assessment of Coherence and Cohesion (CC) more challenging than the other three criteria, leading to lower confidence in their evaluations This aligns with the findings of Shaw and Falvey (2008) Additionally, examiners dedicated more time to assessing CC and Task Response (TR) compared to Lexical Resource (LR) and Grammatical Range and Accuracy (GRA), indicating a greater complexity in evaluating these criteria They also spent more time reviewing the CC band descriptors and exhibited more hesitation when assessing CC than with the other criteria.

Question 2

Overall, examiners in the think-aloud protocols paid much more attention to the assessment of

The evaluation of coherence in assessment is prioritized over cohesion, with nearly 75% of the segments focusing on coherence In contrast, slightly over 25% of the segments are dedicated to assessing cohesion.

A detailed analysis of Phase 1 segments on 'coherence' revealed that about one-third concentrated on forming an overall impression of the text through subjective interpretations Additionally, slightly over one-third assessed the logical aspects, including organization, progression, and sequencing of ideas, while the remaining third focused on evaluating paragraph structure.

The assessment of cohesion primarily centered on examiners evaluating cohesive devices, coordinators, discourse markers, and linking words, often using these terms interchangeably In contrast, there was a notable lack of focus on the assessment of reference and substitution by the examiners.

The analysis revealed significant variation among the 12 examiners in their emphasis on assessing coherence versus cohesion Specifically, Examiner K dedicated 39% of her think-aloud transcript to coherence features, while 61% focused on cohesion In contrast, Examiner A prioritized coherence, with 90% of her segments referencing it, compared to just 10% for cohesion.

The other examiners ranged between these two examiners in the degree to which they referred to either coherence or cohesion

Differences in interpreting the significance of specific features in CC band descriptors were evident in examiners' evaluations of reference and substitution While eight examiners utilized think-aloud protocols to assess these features across a set of 10 scripts, four examiners did not evaluate reference and substitution in the same scripts.

Examiners frequently utilized the CC band descriptors' terminology, particularly terms like "overall structure," synonymous with logical organization, and "linking words," often referring to discourse markers Additionally, the term "flow" was used variably, with some examiners interpreting it as "logical progression," while others applied it more generally, describing their assessments as a "gut feeling."

Examiners in the think-aloud protocols highlighted key terms such as 'introduction', 'conclusion', 'essay', 'argument', and 'topic sentence', which are not included in the band descriptors They seemed to prioritize a well-structured essay that follows the traditional introduction-body-conclusion format, with clearly defined paragraphs featuring topic sentences This emphasis on structure is notable, especially since the task rubric or prompt does not explicitly require candidates to adhere to this format.

Although paragraphing plays an essential role in guiding most examiners’ marking, some examiners found it difficult to reconcile their marking of coherence with the paragraph ceiling specified in

Band 5 Several examiners also noted that it was difficult to differentiate between the ‘relative terms’ in relation to paragraphing between some of the bands

Examiners largely agreed on the definitions of key terms from the band descriptors for coherence and cohesion; however, some exhibited uncertainty in defining terms like 'coherence', 'cohesion', 'cohesive devices', 'reference', and 'substitution', suggesting a lack of clarity about their meanings.

Some overlaps in the assessment of Task Response (TR) and Coherence and Cohesion (CC) were observed in think-aloud protocols, primarily due to examiners struggling to differentiate specific features of the two band descriptors.

! how clearly a position is presented and supported (TR)

! how clearly the message is logically organised (CC)

! the development of main ideas (TR)

! the logical progression of ideas (CC)

Concerns arose regarding the evaluation of 'relevance' and whether it should fall under TR or CC Examiners expressed uncertainty about whether to classify lexical cohesion, the use of synonyms, and the repetition of key nouns as part of CC or LR.

This study observed that experienced Phase 1 examiners tended to mark creative compositions (CC) more intuitively, focusing on the overall flow and fluency of the text In contrast, less experienced examiners adopted a more analytical approach, systematically referencing the various features outlined in the band descriptors.

Question 3

While some examiners in the think-aloud protocols interpreted band descriptors differently and showed uncertainty about linguistic terms like ‘substitution’ during Phase 2, they consistently demonstrated reliable marking.

In Phase 2, 55 examiners evaluated 12 scripts based on four criteria, resulting in a total of 2,640 observations The correlation of these scores with standardized scores across all criteria demonstrated high reliability, with overall correlations exceeding 0.8 for all but one examiner, and eight correlations surpassing 0.9 While the reliability of marking for the CC criterion was slightly lower than the other three criteria, it still fell within acceptable levels.

Question 4

The study found no significant impact of examiner characteristics on IELTS marking outcomes, including factors such as marking experience, advanced qualifications, linguistic training, and teaching experience This indicates that the IELTS training, certification, and re-certification processes are effective in maintaining examiner reliability, irrespective of their diverse backgrounds and experiences.

Question 5

Examiner training materials offered limited guidance on assessing Coherence and Cohesion (CC) compared to other criteria Despite this, most examiners expressed reasonable satisfaction with their training in CC, although many suggested that additional time for discussion and reflection would be beneficial Some examiners admitted to having little recollection of their CC training, and it was noted that a significant number seldom referred to the 'Information for Writing Examiners' booklet Furthermore, some examiners were unaware of the availability of standardized scripts for revision.

The study reveals that examiners focused more on assessing coherence than on marking cohesion in think-aloud protocols, indicating variability in their interpretations of the CC band descriptors Additionally, some examiners demonstrated an incomplete understanding of specific linguistic terms used in these descriptors While they maintain a high standard of reliability in marking, these findings raise concerns about the construct validity of the test, suggesting that improvements could be made to address these issues.

1 Additions or refinements to examiner training for CC

2 A possible re-assessment of and fine tuning of the band descriptors for CC

3 Consideration of the task rubric so that candidates unfamiliar with the essay genre are not disadvantaged

4 Further discourse studies of aspects of coherence and cohesion in sample texts at different levels.

Suggested additions or refinements to examiner training for CC

The revised rating scales implemented in 2005, following the IELTS Writing Assessment Revision Project, represent a significant enhancement over the previous scale The introduction of four analytic band scales for distinct criteria and more precise wording at each band level has created a more effective testing tool Additionally, the new re-training, certification, and re-certification processes have contributed to improved examiner reliability (Shaw and Falvey, 2008).

Analytic marking can often be impressionistic, as highlighted by Allison (1999, p 180) The assessment of coherence heavily relies on the reader's subjective interpretation of the text, a view supported by Jones (2007) Think-aloud protocols further reinforce this notion, revealing that around one-third of coherence evaluations center on examiners' overall impressions regarding the text's clarity, flow, and coherence.

To minimize the impact of subjective marking on propositional coherence, it is essential to enhance training materials with more exemplar scripts at various levels, clearly annotated for cohesive ties and logical connections The IELTS Writing Assessment Revision Project's Phase 2 emphasizes the need for training materials to be rich in contextualized examples (Shaw and Falvey, 2008, p 44) By studying these annotated scripts, examiners can engage in discussions about identifying propositional coherence in texts and share their insights effectively.

The study of thematic progression in sample texts can significantly enhance examiners' understanding of how ideas are logically connected Knoch (2007) successfully implemented a scoring system based on Topic Structure Analysis (TSA), which focuses on thematic progression Although examiners initially struggled to apply this system, they achieved greater reliability compared to the previous five trait scoring descriptors Identifying coherence breaks, as noted by Knoch, can serve as a valuable analytical tool for assessing 'flow' or 'logical progression' in writing While in-depth studies may not be feasible within current training frameworks, developing or reintroducing a homework pack with sample texts and related exercises could benefit examiners seeking to enhance their knowledge in this crucial area.

The study reveals that many examiners may lack a complete understanding of the linguistic terms in the band descriptors for Coherence and Cohesion (CC) Some examiners might encounter unfamiliar terminology for the first time upon becoming examiners, leading to potential embarrassment about their lack of knowledge As Shaw and Falvey (2008) highlighted, the relationship between coherence and cohesion poses challenges for examiners who are not well-versed in contemporary text linguistics literature Additionally, the IELTS Writing Assessment Revision Project working group acknowledged similar concerns regarding the terminology used.

To enhance examiners' understanding of 'reference' and 'substitution,' it is crucial to provide comprehensive explanations of these concepts (p 44) Including a glossary of key terms in the training materials, as suggested by Shaw and Falvey (2008, p 158), along with detailed clarifications from trainers, would significantly improve examiners' ability to interpret the various concepts and linguistic terms outlined in the CC band descriptors.

The working group emphasized the importance of examiners being aware of the "grey" areas in assessment criteria (Shaw and Falvey, 2008, p 44) This study highlighted specific overlaps between TR (Textual Relevance) and CC (Cohesive Coherence), leading to confusion among some examiners regarding the appropriate criteria for evaluating text features To address this, future training materials should clarify the distinction between assessing the development of ideas (under TR) and the logical progression of those ideas (under CC) Additionally, it is crucial to illustrate the difference between how a position is presented (currently under TR) and how effectively the message or ideas are conveyed (assessed under CC).

Further clarification is needed in examiner training regarding whether 'relevance' should be evaluated under Content Criteria (CC) or Task Response (TR) Additionally, the assessment of synonymy, lexical chains, and the repetition of key nouns raises questions about whether these should fall under CC as aspects of lexical cohesion or under Lexical Resource (LR) as indicators of vocabulary flexibility.

The interviews emphasized the importance of increased feedback and mentoring for new examiners, particularly for those who may lack confidence It is suggested that these examiners seek additional mentoring sessions with senior examiners According to Weigle (2002, p 131), it is crucial for examiners to understand the acceptable variability in scoring and to be reassured that 100% accuracy is not always required, which can be especially beneficial for those with lower confidence levels.

A background reading list on coherence and cohesion might also be appreciated Examiners could also be reminded to take full advantage of the revision materials already available to them

6.2 Possible re-assessment of and fine tuning of the band descriptors for CC

The study also points to the possible need to fine tune some of the wording in the band descriptors for

The complexity and length of the descriptors for the Criterion of Coherence and Cohesion (CC) compared to other criteria may threaten the construct validity, especially due to limited overlaps with Task Response (TR) and, to a lesser extent, with Lexical Resource (LR) Addressing these concerns through training and fine-tuning could minimize the ambiguities identified in the IELTS Writing Assessment Revision Project For instance, it might be more effective to evaluate 'relevance' under TR across all bands rather than including it under CC at Band 2.

The band descriptors for IELTS writing may need to be reconsidered, particularly regarding the paragraphing ceiling in Band 5 While examiners adhered to the rules, some expressed discomfort with this limitation It is essential to maintain appropriate paragraphing in essays; however, imposing strict criteria may disadvantage candidates who write coherently but lack familiarity with academic English paragraphing conventions Participants in the IELTS Writing Assessment Revision Project highlighted that the paragraphing standards were perceived as overly strict, potentially favoring those who had completed IELTS preparation courses A senior examiner remarked that an emphasis on conventional paragraphing could unfairly benefit responses that are well-learned or practiced.

Lexical cohesion is highlighted as a crucial cohesive tie in writing, with Hoey (1991) asserting its significance Halliday and Hasan's (1976) analysis of various texts revealed that lexical cohesion constituted nearly 50% of all cohesive ties This suggests that enhancing the focus on lexical cohesion could be beneficial for the assessment of Academic Writing Task 2 Therefore, incorporating this aspect into future revisions of band descriptors may improve evaluation criteria.

6.3 Revision of the task rubric to minimise candidate disadvantage

The IELTS Writing Assessment Revision Project (Shaw and Falvey, 2008) involved significant updates to the task rubrics for both Task 1 and Task 2, specifically eliminating references to particular genres like 'report.' This change aims to enhance clarity and flexibility in the assessment criteria.

The Academic Task 2 requires candidates to express a viewpoint or evaluate the pros and cons of a statement, despite potential unfamiliarity with essay genres such as 'argument' and 'case.' Examiners expect responses to follow a structured format, including an introduction, coherent body paragraphs, and a clear conclusion that directly addresses the question, adhering to the conventional essay structure (Connor 1990; Mickan and Slater 2003).

The absence of the term 'essay' in task instructions may disadvantage some candidates, as examiners still evaluate based on essay characteristics This raises concerns about the clarity of expectations for students, as highlighted by Weigle (2002, p 63), who emphasizes the importance of prompt wording in guiding candidates on the required genre.

‘rhetorical task or pattern of exposition is explicitly stated’ can have an impact on candidates’ performance

6.4 Further studies of aspects of coherence and cohesion in sample texts at different levels

Revision of the task rubric to minimise candidate disadvantage

The IELTS Writing Assessment Revision Project (Shaw and Falvey, 2008) involved significant updates to the task rubric for both Task 1 and Task 2, eliminating references to specific genres like 'report' This change aims to provide a more flexible framework for assessing writing skills.

The Academic Task 2 requires candidates to express a viewpoint or analyze the pros and cons of a statement, despite potential unfamiliarity with essay, argument, or case genres Examiners expect a structured response that includes an introduction, coherent body paragraphs, and a clear conclusion, aligning with the conventional essay format (Connor 1990; Mickan and Slater 2003).

The absence of the term 'essay' in task instructions may disadvantage some candidates, despite examiners assessing for essay characteristics Weigle (2002) highlights that the phrasing of prompts significantly influences candidates' understanding of the expected genre.

‘rhetorical task or pattern of exposition is explicitly stated’ can have an impact on candidates’ performance

Further studies of aspects of coherence and cohesion in sample texts at different levels

Recent literature emphasizes the need for scale writing assessments to rely on empirical studies of language use instead of solely expert judgments, aiming to enhance the validity of these tests.

(Fulcher 1987; North and Schneider 1998; Turner and Upshur 2002) Further studies of coherence and cohesion in text, using a corpus or discourse analysis approach, such as those of Kennedy and Thorp

Recent studies by Hewings, North, Swan, and Coffin (2007) offer valuable references for updating band descriptors Continued research in this area will enhance the understanding of coherence and cohesion (CC) across various band levels Additionally, excerpts from these studies can be effectively utilized in future training sessions to clearly demonstrate specific characteristics of CC.

CC at the different levels

The study on examiner marking of Coherence and Cohesion (CC) in IELTS Academic Writing Task 2 reveals a positive outlook, showing that while CC marking is slightly less reliable than for Lexical Resource (LR) and Grammatical Range and Accuracy (GRA), it remains within an acceptable range Notably, the qualifications and experience of examiners did not significantly impact CC marking, suggesting the effectiveness of current training procedures in ensuring examiner consistency Examiners demonstrated professionalism and a strong adherence to band descriptors, with one examiner expressing a desire to internalize their wording, highlighting their commitment to accurate assessment.

Examiners engage in a complex marking process that relies on intuitive text appreciation, which may threaten the construct validity of the coherence and cohesion (CC) criterion due to differing interpretations of descriptors To enhance the assessment of coherence and cohesion, we propose refining examiner training materials, adjusting the wording of band descriptors for CC, and revising the task rubric or prompt wording These improvements could bolster the construct validity of the test, increase examiner confidence in their evaluations, and foster a shared understanding of coherence and cohesion, ultimately strengthening the community of practice among examiners.

We extend our heartfelt gratitude to those who supported our study, including IELTS Australia, particularly Jenny Osborne, the former Regional Manager; Cambridge ESOL for supplying scripts and standardized scores; the IELTS administrators who assisted in three Australian cities; and the examiners who participated with enthusiasm and good humor.

We extend our gratitude to Dr Grenville Rose for his invaluable statistical insights and the comprehensive report on examiner reliability, as well as the influence of examiner qualifications and experience Our appreciation also goes to Dr David for his contributions.

Pedersen, statistical consultant at the University of Canberra

Alderson, J, Clapham, C, and Wall, D, 1995, Language test construction and evaluation,

Allison, D, 1999, Language testing and evaluation, Singapore University Press, Singapore

Barkaoui, K, 2007, 'Rating Scale impact on EFL essay marking: a mixed-method study',

In the study conducted by Brown (2000), titled "Legibility and the Rating of Second Language Writing," the author investigates the impact of handwriting versus word processing on the evaluation of IELTS Task Two essays Published in IELTS Research Reports Volume 3, this research explores how legibility influences the assessment process, providing insights into the effectiveness of different writing mediums in second language proficiency evaluations The findings contribute to understanding the nuances of essay grading in language testing, highlighting the importance of presentation in achieving higher scores.

Canagarajah, AS, 2002, Critical academic writing and multilingual students,

University of Michigan Press, Ann Arbor

Canale, M, 1983, 'From communicative competence to language pedagogy', in

Language and communication, eds J Richards and J Schmidt, Longman, London, pp 2-27

Canale, M, 1984, 'A communicative approach to language proficiency assessment in a minority setting', in Communicative competence approaches to language proficiency assessment: research and application, ed C Rivera, Multilingual Matters, Clevedon, pp 107-122

Canale, M, and Swain, M, 1980, 'Theoretical bases of communicative approaches to second language teaching and testing', Applied Linguistics, vol 1(1), pp 1-47

Connor, U, 1990, 'Linguistic/ rhetorical measures for international persuasive student writing',

Research in the Teaching of English, vol 24 (1), pp 67-87

Connor and Farmer (1990) explore the effectiveness of teaching topical structure analysis as a revision strategy for ESL writers Their research, featured in "Second Language Writing: Research Insights for the Classroom," edited by B Kroll and published by Cambridge University Press, emphasizes the importance of structured writing techniques in enhancing the revision process for non-native English speakers This approach not only aids in improving coherence and organization in writing but also serves as a valuable tool for educators in the ESL classroom.

Cox, K, and Hill, D, 2004, EAP Now, Pearson Education Australia, Frenchs Forest, NSW

Crow, B, 1983, 'Topic shifts in couples' conversations', in Conversation Coherence:

Form, Structure and Strategy, eds RC Craig and K Tracy, Sage Publications,

In their 2001 study, Cumming, Kantor, and Powers explored the scoring of TOEFL essays and the writing tasks of the TOEFL 2000 prototype They investigated the decision-making processes of raters and developed a preliminary analytic framework to enhance the assessment of writing skills This research, conducted by the Educational Testing Service in Princeton, NJ, aims to improve the reliability and validity of TOEFL essay evaluations.

DeRemer, ML, 1998, 'Writing assessment: raters’ elaboration of the rating task',

Eckes, T, 2008, 'Rater types in writing performance assessments: A classification approach to rater variability', Language Testing, vol 25 (2), pp 155-185

Ericsson, KA, and Simon, HA, 1984, Protocol analysis, MIT Press, Cambridge, MA

Faerch, C, and Kasper, G, 1987, 'From process to product: Introspective methods in second language research', in Introspection in second language learning, Multilingual Matters, Clevedon

Falvey, P, and Shaw, S, 2006, 'IELTS writing: revising assessment criteria and scales (Phase 5),

Fulcher, G, 1987, Tests of oral performance: the need for a data-based criteria

Furneaux, C, and Rignall, M, 2007, 'The effect of standardization-training on rater judgements for the IELTS Writing Module', in Studies in Language Testing 19, IELTS collected papers,

Research in speaking and writing assessment, Cambridge University Press, Cambridge, pp 422-445

Green, A, 1998, Verbal protocol analysis in language testing research: a handbook,

Studies in Language Testing 5, Cambridge University Press

Halliday, MAK, and Hasan, R, 1976, Cohesion in English, Longman, Harlow

Halliday, MAK, and Matthiessen, C, 2004, An introduction to functional grammar, Arnold, London

Hamp-Lyons, L, 1991, 'Scoring procedures for ESL contexts', in Assessing second language writing in academic contexts, ed L Hamp-Lyons, Ablex, Norwood, NJ, pp 241-276

Hamp-Lyons, L, 2007, 'Worrying about rating', Assessing Writing vol 12, pp 1-9

Hawkey, R, 2001, 'Towards a common scale to describe ESL writing performance',

Hoey, M, 1991, Patterns of lexis in texts, Oxford University Press, Oxford

Howell, D, 1982, Statistical methods for psychology, PWS Kent, Boston, MA

Jones, J, 2007, 'Losing and finding coherence in academic writing',

University of Sydney Papers in TESOL 2 (2), pp 125-148

Kennedy and Thorp (2007) conducted a corpus-based investigation into the linguistic responses generated by an IELTS academic writing task Their research, featured in "Studies in Language Testing 19," offers valuable insights into speaking and writing assessment methodologies The study, edited by L Taylor and P Falvey and published by Cambridge University Press, contributes significantly to understanding how test-takers articulate their thoughts in academic writing contexts.

Knoch, U, 2007, 'Little coherence, considerable strain for reader': A comparison between two rating scales for the assessment of coherence', Assessing Writing vol 12, pp 108-128

Knoch, U, Read, J, and von Randow, J, 2007, 'Re-training writing raters online: how does it compare with face-to-face training?', Assessing Writing, vol 12 (1), pp 26-43

Lumley, T, 2002, 'Assessment criteria in a large-scale writing test: what do they really mean to raters?'

Lumley, T, 2005, Assessing second language writing: the rater's perspective, P Lang, New York Mann, W, and Thompson, S, 1989, Rhetorical structure theory: a theory of text organization

Information Sciences Institute, University of Southern California, Los Angeles

In their 2007 study, "A Linguistic Analysis of Chinese and Greek L1 Scripts for IELTS Academic Writing Task 2," Mayor, Hewings, North, Swann, and Coffin explore the linguistic characteristics of native Chinese and Greek speakers' writing in the IELTS context Published in Studies in Language Testing 19, this research contributes valuable insights into speaking and writing assessment methodologies, as edited by L Taylor and P Falvey.

Cambridge University Press, Cambridge, pp 250-314

McDowell, C, 2000, Monitoring IELTS examiner training effectiveness, in IELTS Research Reports

Volume 3, ed R Tulloh, IELTS Australia Pty Limited, Canberra, pp 109-141

McNamara, T, 1996, Measuring second language performance, Longman, London

Mickan, P, and Slater, S, 2003, Text analysis and the assessment of Academic Writing, in

IELTS Research Reports Volume 4, ed R Tulloh, IELTS Australia Pty Limited, Canberra, pp 59-88

Milanovic, M., Saville, N., and Shuhong, S (1996) conducted a study on the decision-making behavior of composition markers Their research is featured in the collection "Performance Testing, Cognition and Assessment," which includes selected papers from the 15th Language Testing Research Colloquium (LTRC) held in Cambridge and Arnhem This work contributes valuable insights into the assessment practices in language testing.

North, B, and Schneider, G, 1998, 'Scaling descriptors for language proficiency scales.'

Oshima, A, and Hogue, A, 2006, Writing academic English, 4th edn, Pearson Longman,

Padron, YN, and Waxman, HC, 1988, 'The effect of ESL students' perceptions of their cognitive strategies on reading achievement', TESOL Quarterly vol 22, pp 146-150

Paltridge, B, 2001, Genre and the language learning classroom, University of Michigan Press,

Schaefer, E, 2008, 'Rater bias patterns in an EFL writing assessment', Language Testing vol 25 (4), pp 465-493

Schneider, M, and Connor, U, 1990, 'Analyzing topical structure in ESL essays', Studies in

Second Language Acquisition vol 12, pp 411-427

Shaw, S, 2002, 'The effect of training and standardisation on rater judgement and inter-rater reliability'

Shaw, S, 2004, 'IELTS Writing: revising assessment criteria and scales (Phase 3)' Research Notes vol 16, pp 3-7

Shaw, S, 2006, 'IELTS Writing: revising assessment criteria and scales (Phase 5)', Research Notes, vol 26, pp 7-12

Shaw, S and Falvey P, 2008, 'The IELTS Writing Assessment Revision Project: Towards a revised rating scale', Research Reports, vol 1, January 2008

In their 2002 study published in TESOL Quarterly, Turner and Upshur examined the influence of scale makers and student samples on the content and scores of rating scales derived from student samples Their research highlights how variations in scale design and the characteristics of the student population can significantly affect the outcomes of assessments This investigation underscores the importance of considering both the creator of the rating scale and the specific student demographics when interpreting evaluation results in educational contexts.

Watson Todd, R, 1998, 'Topic-based analysis of classroom discourse', System, vol 26, pp 303-318

Watson Todd, R, Khongput, S, and Darasawang, P, 2007, 'Coherence, cohesion and comments on students' academic essays', Assessing Writing vol 12, pp 10-25

Watson Todd, R, Thienpermpool, P, and Keyuravong, S, 2004, 'Measuring the coherence of writing using topic-based analysis', Assessing Writing vol 9, pp 85-104

Weigle, SC, 1994, 'Effects of training on raters of ESL compositions', Language Testing vol 11 (2), pp 197-223

Weigle, SC, 1998, 'Using FACETS to model rater training effects' Language Testing, vol 15 (2), pp 263-287

Weigle, SC, 2002, Assessing writing, Cambridge University Press, Cambridge

Wolfe, EW, 1997, 'The relationship between essay reading style and scoring proficiency in a psychometric scoring system', Assessing Writing vol 4 (1), pp 83-106

Wolfe, EW, Kao, C-W, and Ranney, M, 1998, 'Cognitive differences in proficient and non proficient essay scorers', Written Communication vol 15, pp 465-492

Writing tasks

You should spend about 40 minutes on this task

Write about the following topic:

Countries are becoming more and more similar because people are able to buy the same products anywhere in the world

Do you think this is a positive or negative development?

Give reasons for your answer and include any relevant examples from your own knowledge or experience

You should spend about 40 minutes on this task

Write about the following topic:

Some people think that money is the only reason for working Others, however, believe that there are more important aspects to a job than the salary

Discuss these views and give your own opinion

Give reasons for your answer and include any relevant examples from your own knowledge or experience

Semi-guided interview schedule (Phase 1)

ASSESSMENT OF COHERENCE AND COHESION IN WRITING TASK 2

SEMI-GUIDED INTERVIEWS WITH SAMPLE OF IELTS EXAMINERS

Thank you for completing the script marking We would like to interview you to gain insights into any challenges you may face during the assessment process, even as an experienced examiner We are particularly interested in your perspectives on the various criteria and procedures you utilize when marking the scripts.

Let’s start by looking at the actual band descriptors (Get it out and look at it together)

1 Are all the band descriptors equally easy to understand?

2 If not, which ones do you find more difficult or easier to understand?

When you actually start marking the scripts:

3 In general, which of the four criteria do you find the most difficult and which the easiest to assess and why?

4 Do you sub-vocalise (or mutter quietly out aloud) when you are marking the scripts?

If so, in what ways do you think it helps you mark the scripts?

It’s essential to regularly consult the examiners’ writing instruction booklet, especially before starting the marking process Familiarizing yourself with these guidelines ensures a consistent and fair evaluation of student work By reviewing the instructions, you can clarify any uncertainties and align your marking with established standards, enhancing the overall quality and reliability of the assessment.

3) ASSESSMENT OF COHERENCE AND COHESION

For our study we are looking particularly at coherence and cohesion, so we want to ask you some more detailed questions about this particular criterion

6 What do you think is the difference between coherence and cohesion?

7 What do you look for when you are marking for coherence and cohesion? (Do you separate them out when you are marking?)

8 How difficult do you find it to mark CC compared to the other criteria?

The band descriptors encompass various criteria for assessing coherence and cohesion in writing, which include logical order and sequencing, the use of cohesive devices, references, substitutions, and effective paragraphing Understanding these terms is essential for enhancing clarity and flow in written communication Each element plays a crucial role in ensuring that ideas are presented in a structured manner, facilitating better comprehension for the reader.

10 Which of these measures you do find the most useful in evaluating CC? (Are there any which you tend not to use when assessing scripts?)

Assessing coherence and cohesion in writing can be challenging compared to evaluating other criteria If you lack confidence in marking these aspects, consider seeking additional resources or training that focus on understanding the principles of coherence and cohesion Enhanced familiarity with these concepts can simplify the marking process and improve your overall assessment skills.

The relationship between your confidence in marking CC and various factors is crucial Consider how your experience in teaching writing influences this confidence, as well as your background as an IELTS examiner Additionally, the effectiveness of the IELTS training you have received plays a significant role, alongside the clarity of the band descriptors Each of these elements can contribute to varying degrees of confidence in your marking abilities.

Let's examine the various scripts you recently highlighted and review the standardized scores associated with them, as discussed in the audio recording.

In reviewing script x (y and z), it is essential to consider the scores and reflect on your marking process Key measures of coherence and cohesion, such as logical order and sequencing, cohesive devices, references, substitutions, and effective paragraphing, significantly influence your assessment Identifying which of these elements contributed most to your evaluation will enhance your understanding of the script's overall clarity and flow.

We would also like to ask you some questions about the IELTS training:

14 What IELTS training did you receive in assessing coherence and cohesion?

15 How adequate do you think that training was?

16 Have you any suggestions as to how the training might be improved in order to help you with the marking of CC?

Main codes used in the think-aloud data analysis

1 GB= General behaviors: a M=management strategies b R=reading c I= interpretation d J= judgement

The article outlines specific behaviors (SB) related to management (M), reading (R), interpretation (I), and judgment (J) In management, it emphasizes the importance of indicating direction (DIR), providing explanations (EXPL), and commenting on prior content (CONCL) For reading, it discusses the relevance of reading scripts (RS), establishing reading criteria (RC), and referring to questions (RQ) Interpretation involves questioning (Q), rephrasing testees’ positions (RE), interpreting questions (IQ), and hedging (H) Lastly, judgment includes evaluating writing (EVAL), grading (GR), providing grading justification (JU), expressing hesitancy (HES), personalizing the writer (PERSON), responding to the writer’s position (RP), and using intuition (INT).

The assessment criteria for evaluating written tasks include several key components: Task Response (TR), which measures how well the task is addressed; Coherence and Cohesion (CC), which evaluates the logical flow and connectivity of ideas; Lexical Resources (LR), focusing on the range and appropriateness of vocabulary used; Grammatical Range and Accuracy (GRA), assessing the variety and correctness of grammatical structures; and an overall evaluation (ALL) that encompasses all criteria The examiner (SELF) utilizes these elements to analyze the text being assessed (TXT).

Coherence in writing is essential for effective communication, encompassing various elements such as meaning (M), logical organization (LOG ORG), and logical progression (LOG PRO) A well-structured argument (ARG) should maintain clarity (CL) and fluency (FL) while establishing a clear relationship of ideas (REL) Each paragraph (PARA) should support the overarching theme (THEME) and macro-theme (MAC TH), with a strong introduction (INTRO) and conclusion (CONCL) to reinforce the message Additionally, the micro-theme or topic sentence (MIC TH) should guide the reader through the content, ensuring that all components work together harmoniously.

5 Cohesion codes a Cohesion = Cohesion b Cohesive devices = CD c Reference = Ref d Substitution = Subst e Sequencers/discourse markers= DM f Linking = Link g Subordinators = Subord h Coordinators = Coord i Conjunctions = Conj j Lexical Cohesion = Lexico k Unsorted = U

Participant biodata

Table 1: Phase 2 examiners' age distribution

Table 2: Phase 2 examiners' teaching experience

Table 3: Phase 2 examiners’ teaching sector

Years as IELTS examiner Less than2 years n = 21

Table 4: Phase 2 examiners’ IELTS experience

Phase 2 follow-up questionnaire

1 In general, which criterion do you usually find most difficult to mark? List in order of difficulty

(1=most difficult to 4= least difficult or just tick the box for ‘all the same level of difficulty’)

_ TR _ CC _ LR _ GRA ! all the same level of difficulty

2 Which of the band descriptors is the clearest to understand?

(1 = clearest to 4 = least clear or just tick the box for ‘all equally clear’)

_ TR _ CC _ LR _ GRA ! all equally clear

When evaluating the features of CC that impact your scoring of this criterion, please rank them in order of significance, with 1 being the most significant and 8 being the least significant.

_ Logical progression/sequencing of ideas

When making decisions regarding the criterion of linguistic features in code-switching (CC), consider how frequently you reference these features Rate your usage as never, seldom, sometimes, very often, or always.

When assessing your confidence in rating each criterion, please indicate your level of assurance by selecting one box in each row, ranging from "not at all confident" to "very confident."

6 How much do you think your rating of coherence and cohesion is affected by the following?

Tick one box in each row not to some a great not at all extent deal applicable

Your experience in teaching writing ! ! ! !

Your experience as an IELTS examiner ! ! ! !

Your background in Systemic Functional ! ! ! !

Your IELTS training or standardization ! ! ! !

The clarity of the band descriptors ! ! ! !

Discussing band descriptors with other ! ! ! ! examiners

7 How do you define coherence? Please give a short explanation

8 How do you define cohesion? Please give a short explanation

9 What do ‘cohesive devices’ refer to? Please give short definition or a list of cohesive devices

10 What does ‘substitution’ refer to? Please give a short definition

11 What does ‘reference’ refer to? Please give a short definition

12 Which of the following statements about coherence and cohesion most closely represents your point of view? Please tick one box:

A well-structured response features a clear introduction, body, and conclusion, ensuring that the overall organization is effective While the flow of ideas between individual sentences is important, the primary focus should be on maintaining a coherent overall structure.

! A good answer flows smoothly and is easy to read The overall structure is not as important as the flow of ideas from one sentence to another

13 How useful are the bolded ceilings on paragraphing in your decision-making for marking Task 2 Tick the appropriate box

Not at all useful not very useful quite useful very useful very useful indeed

PART B: EXAMINER TRAINING / STANDARDISATION FOR MARKING COHERENCE AND COHESION IN IELTS WRITING TASK 2

1 How would you rate the examiner training or last standardisation in the assessment of coherence and cohesion? poor below average average very good excellent

2 How much do you remember about the training/last standardisation in the assessment of coherence and cohesion? nothing not much some a great deal everything

3 Are you aware that at each testing centre, examiners ! Yes ! No can access standardised scripts for regular revision?

4 Before you mark scripts, how often do you read through the writing examiners’ booklet containing the definitions of each of the criteria?

" every time I mark scripts " often " sometimes " rarely " never

Several recommendations have been proposed to enhance examiner training for evaluating writing, specifically focusing on coherence and cohesion Please indicate your level of agreement with each suggestion by ticking the appropriate box.

A comprehensive examination of cohesive devices in one or two full-length scripts for each band level reveals their proper usage, as well as instances of misuse, overuse, or omission This analysis aims to provide insights into the effectiveness of these devices in enhancing coherence and clarity in writing The findings will contribute to a better understanding of how cohesive devices impact overall communication, allowing for a more nuanced perspective on their role in various writing contexts.

6 Revision of the CC band descriptors to ensure greater consistency in terminology strongly agree agree neither agree nor disagree disagree strongly disagree

A glossary of essential terms for articulating coherence and cohesion is vital for writing examiners This resource should include clear definitions and practical examples to enhance understanding By providing this glossary, we can ensure that examiners have a consistent framework for evaluating writing, ultimately improving assessment accuracy and clarity It is crucial to gauge agreement on this initiative, with options ranging from strongly agree to strongly disagree.

8 More mentoring and feedback in the first year of examining strongly agree agree neither agree nor disagree disagree strongly disagree

9 An online question and answer service available for examiners strongly agree agree neither agree nor disagree disagree strongly disagree

10 Use of colours to pick out the key features across the bands in the rating scale strongly agree agree neither agree nor disagree disagree strongly disagree

When marking scripts, it is essential to adhere to a clear set of guidelines to ensure fairness and consistency Key dos include focusing on the individual merits of each script without comparing them to others It is crucial to maintain objectivity and avoid biases in the assessment process These principles should be outlined in the examiners’ instructions booklet to provide clarity and support accurate evaluations.

To enhance the marking process, it is essential to follow a structured approach, such as reviewing the band descriptors prior to evaluation This step-by-step guide should be incorporated into the instructions for the examiners' booklet to ensure clarity and consistency in marking By doing so, examiners can align their assessments with established criteria, ultimately improving the quality and fairness of evaluations.

13 Online training materials with exercises for revision and reflection strongly agree agree neither agree nor disagree disagree strongly disagree

Which 3 improvements do you consider the most important? Add 1 to 3 beside the chosen items

_ A detailed analysis of CC in one or two full length scripts showing all cohesive ties

_ Revision of the CC band descriptors to ensure greater consistency in terminology

_ A glossary of key terms for describing coherence and cohesion

_ More mentoring and feedback in the first year of examining

_ An online question and answer service available for examiners

_ Use of colours to pick out the key features across the bands

_ A list of dos and don’ts when marking scripts

_ A step-by-step guide to the recommended process to follow

14 Have you any other comments or suggestions to make for improving training in the marking of CC?

Please complete the following Tick the chosen boxes

1 How many years have you been (ESL) teaching? Tick the box/es

Full-time " Less than 2 years " 2-4 years " 5-9 years " 10+ years

Part-time: " Less than 2 years " 2-4 years " 5-9 years " 10+ years

2 In which ESL/TESOL sector have you mainly worked? Tick one box

" ELICOS " AMES " Senior high school " Other (Please state which sector) _

3 At which level do you have the most TESOL experience? Tick one box

" Elementary " Pre-intermediate " Intermediate " Upper Intermediate " Advanced

4 What are your ESL/TESOL qualifications? Tick the chosen boxes:

Bachelors Grad Cert Grad Dip Masters PhD a) ESL/TESOL " " " " " b) Applied Linguistics " " " " " c) Other Please state: _

When assessing your qualifications, please indicate whether you have completed courses in the following areas: a) Discourse analysis (Yes/No), b) Systemic Functional Linguistics (SFL) text analysis (Yes/No), c) Formal grammar (Yes/No), and d) Teaching academic writing (Yes/No).

6 Have you ever taught academic writing? " Yes " No

7 If yes, how often have you taught an academic writing course? Tick the chosen box

" 10+ times " 6-9 times " 4-5 times " 1-3 times " never

8 Have you ever taught a dedicated IELTS preparation course? " Yes " No

9 If yes, how often have you taught IELTS preparation courses?

" once " 2-3 times " 4-5 times " More than 5 times

10 How many years have you been an IELTS writing examiner?

" Less than 2 years " 2-4 years " 5-9 years " 10+ years

11 On average, how often do you mark IELTS writing scripts?

" Almost every week " twice a month " once a month " once every 2 months " less often than every 2 months _

Thank you very much for your help Please return this questionnaire to Fiona Cotton or Kate Wilson

Correlations of scores on criteria with standardised scores

Examiner Overall TR CC LR GRA

Correlations of criteria with examiner variables

TR CC LR GRA Overall

Level at which most teaching experience -.099 010 -.100 023 -.100

Frequency of teaching academic writing -.059 -.024 -.263 -.173 -.193 Yrs of experience as IELTS examiner 057 -.071 -.073 -.069 -.028

Point biserial correlations of dichotomous factors with criteria

Effect of scripts on the reliability of examiners’ scores

Squares df Mean Square F Sig

Independent samples test

T tests for overall harshness or leniency against standard scores

Levene's Test for Equality of Variances t-test for Equality of Means

95% Confidence Interval of the Difference

Std Error Difference Lower Upper Equal variances assumed

T tests of CC against standard scores for harshness or leniency

Levene's Test for Equality of

Variances t-test for Equality of Means

95% Confidence Interval of the Difference

Std Error Difference Lower Upper Equal variances assumed

Tiêu đề	An investigation of examiner rating of coherence and cohesion in the IELTS Academic Writing Task 2
Tác giả	Fiona Cotton, Kate Wilson
Trường học	University of New South Wales
Chuyên ngành	English Communication
Thể loại	Research Report
Năm xuất bản	2008

Định dạng
Số trang	76
Dung lượng	1,16 MB