Previous studies on test repeaters
A key focus in test validation research is understanding the significance of test scores This inquiry typically involves analyzing the factors that lead to variations in scores at a specific moment However, there is a scarcity of longitudinal studies that explore changes in scores over time Most existing research in this area pertains to the IELTS test and can be categorized into two main groups (Green, 2005).
Studies have shown significant insights by comparing the scores of candidates who took the test twice (e.g., Green 2005) and examining the performance of L2 learners before and after receiving targeted English language instruction.
O’Loughlin and Arkoudis, 2009; Rao et al., 2003;
Read and Hayes, 2003) Green (2005), for example, combined both approaches to estimate and explain score gains on IELTS writing tasks
Research indicates that IELTS scores change following instruction, with variations in direction and magnitude based on language skills and learner characteristics (Green, 2005) Typically, learners with lower initial scores demonstrate more significant gains compared to those with higher starting scores Notably, certain skills, like listening, exhibit greater score improvements than others, such as writing, during the same instructional period This research supports the validity of IELTS, showing that score changes correlate with shifts in L2 ability However, as Green (2005) cautions, individual score fluctuations may also result from factors unrelated to L2 proficiency, including practice effects.
Previous studies on the writing performance of repeaters have primarily focused on changes in test scores without considering the linguistic characteristics of candidates' texts These studies typically utilized data collected at two points, such as pre- and post-tests or different testing occasions However, to effectively analyze the patterns of change in test performance and individual differences over time, it is essential to have at least three repeated measures of the same variable for each participant.
This study seeks to overcome existing limitations by analyzing the linguistic features of texts produced by candidates who completed the IELTS Academic exam three times, specifically focusing on their responses to IELTS Writing Task 2.
Research on writing features distinguishing L2 proficiency levels
Examining the relationships between L2 writing test scores and the linguistic and discourse characteristics of candidates' responses can provide insights into the meaning of these scores Research indicates that the quality of test performance, as indicated by scores, can be partially understood by analyzing the specific characteristics of the writing produced This approach is supported by various studies that highlight the connection between performance attributes and assessment outcomes.
A study conducted in 2005 analyzed the linguistic and discourse features of writing scripts at varying proficiency levels in the New Generation TOEFL The findings revealed that high-scoring scripts consistently exhibited longer lengths, enhanced grammatical accuracy, and a broader vocabulary Additionally, these scripts contained more complex sentence structures, stronger quality claims, and more coherent summaries of supporting evidence compared to their low-scoring counterparts.
Three studies have recently examined the linguistic and discourse characteristics of IELTS Academic Writing
In their study, Mayor et al (2007) analyzed Writing Task 2 scripts from candidates with various first-language backgrounds, focusing on the errors, complexity, and discourse characteristics The research specifically compared high-scoring scripts (bands 7 and 8) with low-scoring ones, revealing significant differences in writing quality and proficiency.
A study involving Chinese and Greek L1 candidates identified several key features that significantly predict writing quality These features include text length, formal error rate, sentence complexity, the use of the impersonal pronoun "one," thematic structure, argument genre, and interpersonal tenor.
Banerjee et al (2007) compared the linguistic characteristics of scripts written by Chinese and Spanish
Banerjee et al conducted a study on L1 candidates' responses to IELTS Academic writing tasks 1 and 2, scoring between bands 3 and 8 Their research focused on key linguistic features such as cohesive devices, lexical variation and sophistication, syntactic complexity, and grammatical accuracy The findings revealed significant insights into how these elements influence writing performance in the IELTS examination.
Research indicates that scripts at higher IELTS band levels exhibit greater lexical variation and sophistication While vocabulary improvements are most noticeable at lower levels, other evaluation criteria become more prominent as proficiency increases.
(c) grammatical accuracy was a good discriminator of proficiency level regardless of task type and test taker L1
More recently, Riazi and Knox (2013) compared the linguistic and discourse characteristics of IELTS
A study analyzing Academic Writing Task 2 scripts from three L1 candidate groups—European, Hindi, and Arabic—across band levels 5, 6, and 7 revealed that higher-scoring scripts (bands 6 and 7) were generally longer and featured a greater proportion of low-frequency words, enhanced lexical diversity, and increased syntactic complexity compared to lower-scoring scripts.
However, high-scoring scripts were not necessarily more cohesive than low-scoring scripts
The three studies also found significant differences in terms of some linguistic characteristics (e.g., lexical diversity) across L1 groups
While previous studies have offered valuable insights into L2 proficiency and the impact of various factors on L2 writers' texts, they primarily utilized a cross-sectional approach, analyzing writing samples from different candidates at varying proficiency levels at a single point in time A longitudinal approach, focusing on individual differences in test performance over time, could greatly enhance this research By examining the scripts of candidates who retake L2 writing tests, researchers can investigate the nature and extent of changes in characteristics such as linguistic accuracy and vocabulary use, as well as how these changes correlate with variations in writing scores.
Here ‘difference’ refers to variation across candidates at one point in time, while ‘change’ refers to variation within the same candidate across time
One of the key challenges in researching the text features of candidates is identifying the optimal combination of measures that can effectively capture variations in writing performance among individuals over time (Banerjee et al., 2007).
This study employs a comprehensive text analysis framework that integrates models of second language (L2) proficiency, insights from prior research, and the established criteria of the IELTS rating scale for Writing Task 2.
This study aimed to examine the patterns of changes over time in the linguistic and discourse characteristics of texts written by IELTS repeaters in response to Writing Task 2
The study analyzed Writing Task 2 scores and scripts from 78 candidates who participated in the IELTS Academic exam across three test occasions Candidates were categorized based on their Writing Task 2 scores from the first test occasion, specifically focusing on their band scores.
The IELTS Writing Task 2 challenges candidates to compose a coherent argumentative essay within 40 minutes, requiring a minimum length of 250 words In this task, candidates must address a specific problem by proposing a solution, articulating and justifying their opinion, or comparing and contrasting various evidence and viewpoints Additionally, they may need to evaluate and critique existing ideas or arguments This assessment evaluates the candidate's ability to construct a clear, relevant, and well-organized argument, supported by appropriate examples and evidence, while demonstrating accurate use of the English language.
The study addressed the following research questions:
1 To what extent and how do the scripts of the three groups of candidates at test occasion 1 differ in terms of their linguistic characteristics?
2 To what extent and how do the linguistic characteristics of the repeaters’ scripts change across test occasions?
3 To what extent and how does test repeaters’ initial L2 writing ability (i.e., initial writing score) relate to changes in the linguistic characteristics of their scripts across test occasions?
4 To what extent and how do the linguistic characteristics of the repeaters’ scripts relate to their writing scores across test occasions?
Sample and dataset
Data for the study were obtained from IELTS and consisted of individual biographical data (age, gender,
L1 and country) and the IELTS Writing Task 2 scores and scripts for a purposive sample of 78 candidates who each took IELTS Academic three times
The sample of candidates was selected based on their scores on IELTS Writing Task 2 at test occasion 1
(i.e., the first time they took the test) Specifically, three groups of candidates (n= 26 per group) were selected:
! group 1 included candidates whose scripts received a score of 4 at test occasion 1
The sample consisted of 35 females (45%) and 43 males who came from 27 different countries, with the majority being from China (n), India (n= 12), Saudi Arabia
They ranged between 16 and 52 years in terms of age
They spoke 23 different first languages, with the majority being L1 speakers of Arabic (n= 16), Chinese (n= 14), Korean (n= 8) and Punjabi (n= 7)
The study included 234 scripts (i.e., 26 candidates x
The study involved three groups of participants who underwent testing on three separate occasions in 2013 As outlined in Table 1, the interval between the first and third tests varied significantly, spanning from 14 to 219 days.
Table 2 presents descriptive statistics on the interval (in days) between test occasions Each candidate's handwritten script was subsequently typed into a Word document by IELTS staff, preserving the original layout and any errors.
Table 3 presents descriptive statistics for overall and Writing Task 2 scores categorized by candidate group and test occasion, revealing an increase in mean scores across all three groups over time Additionally, the inter-correlations among Writing Task 2 scores were notably high, with Pearson r values of 96 between occasions 1 and 2, 94 between occasions 2 and 3, and 90 between occasions 1 and 3.
Candidate group Occasion 1 Occasion 2 Occasion 3 Total
Table 1: Sample of scripts included in the study
Test 1 to Test 2 Test 2 to Test 3 Test 1 to Test 3
Table 2: Descriptive statistics for interval (in days) between test occasions
Group Overall Task 2 Overall Task 2 Overall Task 2
Table 3: Descriptive statistics for Overall and Writing Task 2 scores by occasion and group
Data analyses
Script linguistic characteristics
Fluency in writing, defined as the quantity of production measured by the number of words per script, plays a crucial role in second language (L2) writing assessments Research indicates that text length is a significant predictor of L2 writing test scores, as demonstrated by multiple studies (Cumming et al., 2005; Frase et al., 1999; Grant and Ginther, 2000; Mayor et al., 2007; Riazi and Knox, 2013).
Linguistic accuracy is a key focus in studies analyzing L2 learners' texts, with researchers measuring accuracy by counting the number of linguistic errors present (Cumming et al., 2005; Polio, 1997; Wolfe-Quintero et al., 1998).
Criterion was utilized to identify, categorize, and quantify linguistic errors in each script, highlighting four main types of mistakes: grammar, usage, mechanics, and style Grammar errors include sentence structure issues and verb forms, while usage errors involve article mistakes and incorrect word forms Mechanics errors pertain to spelling and punctuation, and style errors address excessive passive voice and long sentences An error ratio was calculated for every script, expressed as the total number of errors per 100 words, providing a comprehensive overview of the error distribution across each category.
Syntactic complexity measures a writer's ability to convey substantial information within concise grammatical structures (Bardovi-Harlig, 1992; Polio, 2001) According to Coh-Metrix developers, complex sentences are characterized by their structural density and numerous embedded components (Crossley, Greenfield, & McNamara, 2008) This tool evaluates syntactic complexity through various indicators, including left embeddedness, which assesses the average number of words preceding the main verb in main clauses.
Noun-phrase (NP) density refers to the average number of modifiers, such as determiners and adjectives, present in each noun phrase Additionally, syntactic similarity assesses the uniformity and consistency of syntactic constructions throughout the text.
Competence IELTS rating criteria Writing feature Specific measure Computer program Grammatical
Fluency Number of words per script Coh-Metrix Accuracy Number and distribution of four types of errors: grammar, usage, mechanics, and style
NP density; and syntactic similarity
Lexical resource Lexical features Lexical density
Connectives density Coreference cohesion Conceptual cohesion
Discourse structure Organisation: Presence of
5 discourse elements (introductory material, thesis statement, main idea, supporting ideas, and conclusion)
Development: Relative length of each discourse element
Sociolinguistic Register Contractions, Passivisation, and Nominalisation
Strategic Metadiscourse Interactional metadiscourse markers AntConc
Table 4: List of measures of the linguistic characteristics of repeaters' scripts
Coh-Metrix offers various indices of syntactic similarity, but this study focused solely on the mean sentence syntactic similarity across all paragraph combinations It was found that sentences featuring complex syntactic structures possess a greater ratio of constituents per noun phrase (NP) compared to those with simpler syntax (Graesser et al.).
2004) Generally, high syntactic similarity indices indicate less complex syntax (Crossley, Greenfield, and McNamara, 2008; Crossley et al., 2011)
The study analyzed three key lexical features: lexical density, lexical variation, and lexical sophistication Lexical density refers to the proportion of lexical words—such as nouns, verbs, adjectives, and adverbs—compared to the total word count in each script (Engber, 1995).
According to Laufer and Nation (1995) and Lu (2012), the analysis was conducted using Coh-Metrix, which calculated the ratio of lexical words to the total word count in each script This evaluation excluded function or grammatical words, such as articles, prepositions, and pronouns, to focus solely on the lexical content.
Lexical variation (or diversity) is often measured using
Type-Token Ratio (TTR) TTR is the ratio of the types
(the number of different words used) to the tokens (the total number of words used) in a text (Engber, 1995;
Laufer and Nation, 1995; Lu, 2012; Malvern and
A high Type-Token Ratio (TTR) indicates a text with a diverse vocabulary, showcasing a greater variety of different words used Conversely, a low TTR suggests that the writer relies on a limited set of words, resulting in repetition.
TTRs, however, tend to be affected by text length, which makes them unsuitable measures when there is much variability in text length (Koizumi, 2012; Lu, 2012;
Malvern and Richards, 2002; McCarthy and Jarvis,
The Measure of Textual and Lexical Diversity (MTLD), calculated with Coh-Metrix, overcomes the limitation of varying text lengths by providing consistent values This allows for effective comparisons between texts of significantly different lengths (Koizumi, 2012).
Lexical sophistication concerns the proportion of relatively unusual, advanced, or low-frequency words to frequent words used in a text (Laufer and Nation, 1995;
Meara and Bell, 2001) Two measures were used to assess lexical sophistication, average word length (AWL) and word frequency, both computed by Coh-Metrix
AWL is computed by dividing the total number of letters by the total number of words for each script (Biber, 1988;
Cumming et al., 2005; Engber, 1995; Frase et al., 1999;
Grant and Ginther, 2000) Higher AWL values indicate more sophisticated vocabulary use
Word frequency, measured using the mean CELEX word frequency score for content words, refers to how often particular content words occur in the English language
(Graesser et al., 2004) The CELEX frequency score is based on the database from the Centre of Lexical
Information (CELEX) which consists of frequencies taken from the early 1991 version of the COBUILD corpus of 17.9 million words (see Crossley et al., 2007,
2008) Research suggests that advanced L2 learners are more likely to comprehend and use lower-frequency words than do learners with low L2 proficiency
(Bell, 2003; Crossley et al., 2010; Ellis, 2002; Meara and Bell, 2001)
To examine discourse, each script was computer-coded in terms of several coherence and cohesion features and various aspects of discourse structure
Coherence and cohesion: Using Coh-Metrix, each script was computer-analysed in terms of connectives density, coreference cohesion, and conceptual cohesion
Connectives serve as essential indicators of the relationships between ideas in a text, enhancing its cohesion, organization, and overall quality (Crismore et al., 1993; Halliday and Hasan, 1976) Coh-Metrix offers an incidence score that measures the frequency of these connectives, calculated as occurrences per 1,000 words.
(i.e., causal, additive, temporal and clarification connectives) for each script Coreference cohesion occurs when a noun, pronoun, or noun phrase refers to another constituent in the text (Crossley et al., 2007, 2009, 2011;
McNamara et al., 2010) Coh-Metrix provides indices concerning several types of coreferentiality These indices, however, were highly inter-correlated (r > 70)
The study focused solely on argument overlap for adjacent sentences, assessing the frequency with which two neighboring sentences share common arguments, including nouns, pronouns, and noun phrases.
Conceptual cohesion refers to the degree of semantic or conceptual similarity within sentences or paragraphs This concept is primarily measured using Latent Semantic Analysis (LSA), a statistical technique that analyzes text to assess both local and global cohesion LSA evaluates the relatedness of meanings across different parts of a text, including sentences and paragraphs, to determine how well they connect conceptually (Crossley et al., 2008; Crossley et al., 2009, 2011; Foltz, Kintsch, and others).
Landauer, 1998; Graesser et al., 2004; Landauer, Foltz, and Laham, 1998; McNamara, Cai, and Louwerse, 2007)
Unlike lexical markers of coreferentiality (i.e., noun and argument overlap), LSA provides for the tracking of words that are semantically similar, but may not be related morphologically (Landauer and Dumais 1997;
Text cohesion and coherence are believed to improve with greater conceptual similarity among text elements (Crossley, Louwerse, et al., 2007; Landauer et al., 2007) Previous research has utilized Latent Semantic Analysis (LSA) to assess coherence in first-language (L1) texts and to evaluate L1 essays in English composition (Landauer et al., 2007) Coh-Metrix was employed to calculate two LSA scores for each script: the mean LSA overlap for adjacent sentences, indicating the similarity between sentences, and the mean LSA overlap for adjacent paragraphs, reflecting the similarity between paragraphs.
To analyze text structure, the web-based program Criterion was utilized to assess the organization and development of scripts Criterion automatically detects five key discourse elements: introductory material, thesis statement, main idea, supporting ideas, and conclusion (Ramineni et al., 2012; Weigle, 2011) It evaluates organization by verifying the presence of each element, while development is quantified by calculating the relative length of each discourse element This is achieved by dividing the word count of each element by the total word count of the script and multiplying by 100.
Statistical analyses
This study analyzed Writing Task 2 scores along with various linguistic and discourse features for each candidate's script at multiple test occasions Several analyses were performed to address the research questions posed in the study.
First, descriptive statistics (e.g., means, standard deviations) for each linguistic and discourse feature in
Table 4 were computed for all candidates and across test occasions and candidate groups
Second, to address research question 1 concerning differences between the linguistic characteristics of the scripts of the three candidate groups (i.e., band scores 4,
In the first test occasion, a univariate analysis of variance (ANOVA) was performed for each linguistic measure listed in Table 4, using the candidate group as the independent variable and the linguistic index as the dependent variable Significant ANOVA results prompted follow-up pairwise comparisons.
The Bonferroni correction was applied to compare pairs of candidate groups regarding the presence of discourse structure elements, such as introductions and thesis statements Chi-square (X²) tests were utilized to assess the association between candidate groups and the presence or absence of each discourse element In the ANOVA analyses, only significant statistics (F, df, effect size) with p-values less than 05 are reported, while non-significant effects are excluded Additionally, partial eta-squared (partial η²) serves as the effect size measure, with values of partial η² ≥ 01 indicating a small effect size and partial η² ≥ 09 representing a medium effect.
The study analyzed the autocorrelations (Pearson r) of various linguistic features across different test occasions to determine the stability of candidates' rankings over time Specifically, it examined how measures such as lexical density correlated with themselves at different points (e.g., time 1 vs time 2 and time 2 vs time 3) This approach aimed to assess whether the order of candidates shifted for each linguistic feature throughout the testing periods.
To investigate the variations in linguistic and discourse characteristics of scripts across different test occasions and their correlation with candidates' initial writing abilities, multilevel modeling (MLM) was utilized through the HLM6 software (Raudenbush et al., 2004).
Multilevel modeling (MLM) is a statistical approach designed for analyzing data with a nested structure, as highlighted by researchers such as Barkaoui (2013, 2014), Hox (2002), and Luke (2008) It treats repeated-measures observations as nested within individual cases, differentiating between two analytical levels: level-1 observations (test occasions) nested within level-2 units (candidates) By examining an outcome variable, like a linguistic feature index such as fluency, the level-1 equation assesses how this outcome evolves over time for each candidate This equation incorporates two key parameters: the initial status, represented by the intercept of the candidate's trajectory, and the rate of change, indicated by the slope of that trajectory over time.
Recent studies have explored the trends in changes of linguistic features, assessing whether these changes are linear or non-linear Additionally, the examination of parallel change processes as time-varying predictors has been highlighted in the research (Luke, 2008; Preacher et al., 2008; Ross).
In multilevel modeling (MLM), predictors are categorized into time-varying (level-1) and time-invariant (level-2) variables Time-varying predictors, such as candidate age and L2 proficiency, change over time, while time-invariant predictors, like candidate L1 and gender, remain constant The change trajectory of individuals can differ based on their initial status (intercept) and rate of change (slope) At level 2, candidates' intercepts and slopes are analyzed as dependent variables, with factors such as writing scores and gender included as predictor variables Consequently, level-2 models assess the influences on the rate and shape of changes in outcomes, such as fluency, over time.
Following Hox (2002), several MLM models were developed and evaluated for each linguistic feature separately before estimating the final model for that feature
The initial analysis involved a null model to assess the variance between candidates and across test occasions for each linguistic feature Subsequently, Model 2 incorporated occasion as a level 1 predictor to evaluate changes over time in these features, allowing the slope to vary among candidates to capture differences in initial status and rate of change Model 3 introduced a level-2 predictor, candidate group, to explore the relationship between candidate group and variations in linguistic characteristics at the first time point.
4 included cross-level interactions with occasion
The study introduced a candidate group to analyze the correlation between initial writing ability and the evolution of linguistic features over time, addressing research question 3 Due to the limited sample size of candidates involved, Restricted Maximum Likelihood was employed for analysis.
Likelihood (RML) was used for estimating all parameters as recommended by Hox (2002) and Luke (2004)
In all analyses, occasion was uncentered, with occasion 1 coded 0 so the intercept can be interpreted as the expected (average) outcome at occasion 1 Writing Task
2 score at time 1 was also uncentered (with band 4 coded
In the analysis of the models, two key indices were evaluated: the deviance statistic, which assesses the fit of various models against a common dataset, and significance tests for individual coefficients, as outlined by Hox.
A final model was constructed for each linguistic feature based on the outcomes of various models Section 3.2.1 details the steps and decisions made in the development and assessment of MLM models focused on fluency To maintain brevity in the report, the findings section only discusses the results of three MLM models: Model 1, Model 2, and the final model, for the other linguistic features.
In all cases, the final model is compared to Model 1 in terms of fit statistics
To explore the relationship between the linguistic and discourse features of candidates' scripts and their Writing Task 2 scores, correlational analyses and Multilevel Modeling (MLM) were conducted using HLM6 Pearson r correlations were calculated to assess the connections between these variables.
The Writing Task 2 scores and associated linguistic features were analyzed for each test occasion To determine if the correlation between specific linguistic features and writing scores differed significantly across these occasions, an interactive calculator created by Lee and Preacher (2013) was employed to evaluate the equality of two correlation coefficients from the same sample.
Correlations (Pearson r) among measures assessing the same linguistic feature were analyzed for each test occasion to identify highly correlated features (r ≥ 70) When two or more measures showed high correlation, only one was retained for MLM analyses to avoid redundancy, as they likely represent the same construct This approach also minimizes the risk of multicollinearity by including only a single index from highly inter-correlated measures.
Differences in the linguistic characteristics of scripts at different band levels at test occasion 1
characteristics of scripts at different band levels at test occasion 1
Table 6 presents the descriptive statistics for various linguistic measures among candidate groups categorized by their Writing Task 2 scores at time 1 during the first test occasion Fluency, quantified by the total word count per script, revealed significant differences among the groups, as indicated by ANOVA results (F[2, 75]= 9.46, p 05) Model 3, which emerged as the final model, highlighted a significant effect of candidate group on word frequency at time 1, with fit statistics demonstrating that Model 3 significantly outperformed Model 1 (X² = 22.58, df = 2, p < 01) According to Model 3, candidates with a writing score of 4 at test occasion 1 had an average word frequency of 2.55, and while the decrease in word frequency across test occasions was not significant, candidate group was significantly linked to word frequency, indicating that for each 1 band increase in initial writing scores, word frequency decreased by 10 Model 3 explained 20% of the variance between individuals but none of the within-person variance in word frequency.
Table 22: MLM results for word frequency
Coherence and cohesion
Table 23 presents descriptive statistics for the four cohesion and coherence measures across various test occasions and candidate groups, showing no significant differences among them Additionally, the autocorrelations (Pearson r) in Table 24 indicate a positive and significant relationship over time, suggesting that candidates with higher scores in coherence and cohesion at one test occasion tend to maintain those scores in subsequent assessments.
Test occasion M SD M SD M SD M SD
Table 23: Descriptive statistics for cohesion and coherence measures by candidate group and test occasion
Table 24: Autocorrelations for coherence and cohesion measures
MLM analyses revealed consistent results regarding connective density and argument overlap in adjacent sentences The findings indicated that there were no significant changes across test occasions, no notable effect of candidate group at test occasion 1, and the rate of change in the index remained stable across candidates Additionally, candidate group did not significantly influence the rate of change in the index over the test occasions.
The final model for each of the three indices is represented as Model 1 in Table 25 The findings reveal that a significant portion of the variance in connectives density, specifically 61% (235.10), is attributed to the candidates themselves Additionally, inter-individual variability contributes to 39% of the variance in connectives incidence, measured at 148.39 Notably, the intercept variance is also significant.
The analysis revealed significant variations in connectives density across candidates (X² = 225.69, df = 77, p < 01) Although Model 2 showed an average increase of 19 in connectives density per test occasion, this change was not statistically significant Similarly, the variability in the rate of change in connectives density among candidates was not significant (X² = 64.46, df = 77, p > 05) Regarding argument overlap in adjacent sentences, 67% of the variance was found within candidates, while the small intercept variance (.01) was significant (X² = 182.30, df = 77, p < 01), indicating notable differences across candidates Model 2 suggested an average increase of 02 in argument overlap per test occasion, which was also not statistically significant, and the rate of change in argument overlap did not significantly differ among candidates (X² = 90.41, df = 77, p > 05).
As for mean LSA overlap for adjacent sentences, Table
25 shows that most of the variance (.004 or 77%) was within candidates The intercept variance, though small
(.002), was significant (X 2 = 180.49, df.= 77, p 05) Model 3 accounted for 25% of the variance between individuals, but it did not explain any variance within individuals regarding contraction ratios.
Table 32 reveals that 62% of the variance in passivisation can be attributed to differences among candidates The significant intercept variance of 22 (X² = 218.68, df = 77, p < 01) indicates that the rate of passivisation varies significantly between candidates.
Model 2 indicated that the ratio of passivisation increased by 11 passive constructions per 100 words, on average, on each succeeding test occasion; this increase was statistically significant However, the rate of change in passivisation ratio did not vary significantly across candidates (X 2 = 86.01, df.= 77, p>.05) Model 3 indicated that there was a significant effect of candidate group on passivisation ratio at test occasion 1
Consequently, the final model for passivisation ratios included occasion at level 1 and candidate group at level 2
According to Model 3, candidates with a writing score of 4 exhibited an average passivisation ratio of 0.30 passive constructions per 100 words at the first test occasion, with a significant increase of 0.11 passive constructions per 100 words on each subsequent test occasion Additionally, there was a notable correlation between candidate group and passivisation ratio, with each one-band increase in initial writing scores resulting in a 0.38 increase in passive constructions per 100 words The variation in passivisation ratios over time was not significantly different across candidates (X² = 86.09, df = 77, p > 05) Fit statistics demonstrated that Model 3 provided a significantly better fit to the data than Model 1 (X² = 23.45, df = 2, p < 01), explaining 57% of the between-person variance but only 6% of the within-person variance in passivisation ratio.
Table 32 reveals that 87% of the variance in nominalisation occurs within candidates, with a significant intercept variance of 64 (X² = 111.98, df = 77, p < 01), suggesting notable differences in nominalisation ratios among candidates Model 2 shows a slight average decrease of 06 nominalisations per 100 words with each test occasion, though this change is not statistically significant However, the variation in nominalisation rates across candidates is significant (X² = 98.70, df = 77, p < 05) Additionally, Model 3 highlights a significant effect of candidate group on nominalisation ratios during the first test occasion.
As Table 32 shows, the average ratio of nominalisation for candidates with a writing score of 4 at test occasion 1 was 2.93 nominalisations per 100 words There was a non-significant decrease by 06 nominalisations per
The study revealed that candidates exhibited significant variability in nominalisation ratios across test occasions (X² = 98.92, df = 77, p < 05) Notably, an increase of one band in initial writing scores corresponded to an increase of 0.56 nominalisations per 100 words Additionally, candidate group was significantly linked to these ratios Model 3 demonstrated a significantly better fit to the data compared to Model 1 (X² = 10.77, df = 2, p < 01), accounting for 18% of the between-person variance and 12% of the within-person variance in nominalisation ratios.
Model 1 Model 2 Model 3 Model 1 Mode1 2 Model 3 Model 1 Model 2 Model 3
Table 32: MLM results for register measures
Interactional metadiscourse markers
Table 33 presents descriptive statistics for interactional metadiscourse markers and their subcategories across different test occasions and candidate groups The analysis indicates no significant differences in the overall ratio of interactional metadiscourse markers among candidate groups or test occasions However, specific markers—hedges, self-mentions, and boosters—exhibit variability across groups and occasions Candidates scoring 6 at test occasion 1 utilized more hedges (M= 36) and boosters (M= 11) compared to those scoring 4 (M= 24 hedges, 07 boosters) and 5 (M= 24 hedges, 09 boosters) Additionally, candidates scoring 5 showed a decrease in self-mentions from test occasion 1 (M= 40) to test occasion 2 (M= 22) and test occasion 3 (M= 38), while using more self-mentions than candidates scoring 6 (M= 20 and 19, respectively) at both occasions Overall, candidates scoring 6 at test occasion 1 demonstrated a tendency to employ more hedges and boosters while using fewer self-mentions compared to those with lower writing scores.
Test occasion M SD M SD M SD M SD
Table 33: Descriptive statistics for metadiscourse markers by candidate group and test occasion
Table 34 presents the autocorrelations (Pearson r) of interactional metadiscourse markers across different test occasions, revealing that most correlations are positive This suggests that candidates who frequently utilized these markers in one test are likely to continue using them in subsequent tests, with the exception of attitude markers.
Table 34: Autocorrelations for metadiscourse measures
Table 35 presents the MLM results for interactional metadiscourse markers, revealing that Model 1 accounted for approximately 74% of the variance among candidates The significant intercept variance (X² = 162.74, df = 77, p < 01) indicates notable differences in the ratio of these markers across candidates Model 2 showed a statistically significant average decrease of 09 markers per T-unit with each test occasion, although the change rate did not significantly differ among candidates (X² = 60.74, df = 77, p > 05) Additionally, Model 3 found no significant impact of candidate group on the ratio of interactional metadiscourse markers during the first test occasion Thus, Model 2 emerged as the final model for analyzing these ratios.
In the study, the average ratio of interactional metadiscourse markers for all candidates during the first test occasion was recorded at 1.11 markers per T-unit Notably, there was a significant decline in this ratio, decreasing by 0.09 markers per T-unit, which translates to approximately one marker lost for every ten T-units in subsequent test occasions Model 2 accounted for 4% of the variance observed within individuals but did not explain any variance between different individuals regarding the ratio of interactional metadiscourse markers.
The analysis of metadiscourse markers revealed that the MLM results for boosters, attitude markers, and engagement markers were consistent with all interactional markers, while hedges and self-mention showed distinct outcomes Specifically, three-quarters of the variance in hedges was attributed to differences among candidates, with a small yet significant intercept variance indicating notable variation in hedge usage Although the average ratio of hedges decreased slightly over successive test occasions, this change was not statistically significant However, significant variability in the rate of change was observed across candidates At the initial test occasion, candidates with a score of 4 exhibited an average of 23 hedges per T-unit, with a minor, non-significant decrease in subsequent tests Additionally, a significant relationship was found between candidate group and hedge ratios, with each increase in initial writing scores correlating to a rise in hedges The final model accounted for 33% of within-person variance, though it did not explain between-person variance in hedge usage.
Model 1 Model 2 Model 1 Model 2 Model 3 Model 1 Model 2 Model 3
Table 35: MLM results for metadiscourse markers
For self-mentions, Table 35 shows that most of the variance (.05 or 63%) was within candidates
The intercept variance (.03) was small but significant
(X 2 = 192.45, df.= 77, p 05) However, Model 3 demonstrated a significant effect of candidate group on self-mention ratios during the first test occasion According to the findings in Table 35, candidates who scored 4 at this initial test had an average of 0.37 self-mentions per T-unit Additionally, there was a non-significant average decrease of 0.02 self-mentions per T-unit in subsequent test occasions Overall, while the change in self-mention ratios did not significantly differ among candidates (X² = 72.73, df = 77, p > 05), the candidate group was significantly linked to the self-mention ratios at the first test occasion.
1 For each increase of 1 band in initial writing scores, self-mentions decreased by 05 self-mentions per T-unit
Fit statistics indicated that Model 3 fits the data significantly better than Model 1 (X 2 = 6.54, df.= 2, p