BRIEF REPORTS TESOL Quarterly invites readers to submit short reports and updates on their work These summaries may address any areas of interest to TQ readers Edited by ALI SHEHADEH United Arab Emirates University JOHN LEVIS Iowa State University Diagnosing the Support Needs of Second Language Writers: Does the Time Allowance Matter? CATHIE ELDER University of Melbourne Carleton, Victoria, Australia UTE KNOCH University of Melbourne Carleton, Victoria, Australia RONGHUI ZHANG Shenzhen Polytechnic Institute Shenzhen, China Ⅲ This study investigates the impact of changing the time allowance for the writing component of a diagnostic English language assessment administered on a voluntary basis to first year undergraduates at two universities with large populations of immigrant and international students following their admission to the university The test is diagnostic in the sense of identifying areas where students may have difficulties and therefore benefit from targeted English language intervention concurrently with their academic studies A change in the time allocation for the writing component of this assessment (from 55–30 minutes) was introduced in 2006 for practical reasons It was believed by those responsible for implementing the assessment that a reduced time frame would minimize the problems associated with scheduling the test and accordingly encourage faculties to adopt the assessment tool as a means of identifying their students’ language learning needs The current study aims to explore TESOL QUARTERLY Vol 43, No 2, June 2009 351 how the shorter time allowance would influence the validity, reliability, and overall fairness of an EAP writing assessment as a diagnostic tool The impetus for the study arose from anecdotal reports from test raters to the effect that, under the new time limits, students were either planning inadequately in preparation for the task or else failing to meet the word requirements The absence of planning time was perceived to have a negative impact on the quality of students’ written discourse Concerns were also expressed that the limited nature of the writing sample made it difficult to provide an accurate and reliable assessment of student’s ability to cope with the writing demands of the academic situation As discussed in Weigle (2002), the time allowed for test administration raises issues of authenticity, validity, reliability, and practicality Most academic writing tasks in the real world are not generally performed under time limits, and academic essays usually require reflection, planning, and multiple revisions A writing task within a reduced time frame without access to dictionaries and source materials will inevitably be inauthentic in the sense that it fails to replicate the conditions under which academic writing is normally performed Moreover, unless a test task is designed expressly to measure the speed at which test takers can answer the question posed, rigid time limits potentially threaten the validity of score inferences about test takers’ writing ability The limited amount of writing produced under time pressure may also make it difficult for raters to accurately assess the writer’s competence On the other hand, institutional constraints are inevitable on the resources available for any assessment A timed essay test is certainly easier and more economical to administer, and it can be argued that even a limited sample of writing elicited under less than optimal conditions may be better than no assessment at all as a means of flagging potential writing needs Achieving a balance between what is optimal in terms of validity, authenticity, reliability, and what is institutionally feasible is clearly important in any test situation Research investigating the time variable in writing assessment has produced somewhat contradictory findings, perhaps because of the different tasks, participants, contexts, and methodologies involved and also the differing time allocations investigated Some studies suggest that allowing more time results in improved writing performance (Biola, 1982; Crone, Wright, & Baron, 1993; Livingston, 1987; Younkin, 1986; Powers & Fowles, 1996), whereas others find that changing the time allowance makes no difference to performance as far as rater reliability and or rank ordering of students is concerned (Caudery, 1990; Hale, 1992; Kroll, 1990) Not all studies use independent ability measures (such as test scores from a different language test) or a counterbalanced design that controls for extraneous effects such as task 352 TESOL QUARTERLY difficulty and order of presentation (but see Powers & Fowles, 1996) Investigative methods also differ, with most studies looking only at mean score differences across tasks without considering the validity implications of any differences in the relative standing of learners when the time variable is manipulated (but see Hale, 1992) Moreover, most studies have focused on overall scores, based on a holistic scoring or performance aggregates, rather than exploring whether the time condition has a variable impact on different dimensions of performance, such as fluency and accuracy (but see Caudery, 1990) It is particularly important to consider these different dimensions when one is dealing with assessment for diagnostic purposes, where the prime function of the test score is to provide feedback to teachers and learners about future learning needs If changing the time allocation influences the nature of information yielded about particular dimensions of writing ability, this result may have important validity implications as well as practical consequences THE STUDY This study aims to establish whether altering the time conditions on an academic writing test has an effect on (a) the analytic and overall (average) scores raters assigned to students’ writing performance and (b) the level of interrater reliability of the test If scores differ according to time condition, this result would have implications for who is identified as needing language support, and if consistent rating is harder to achieve under one or another condition, then decisions made about individual candidates’ ability cannot be relied on Thirty students each completed two writing tasks aimed at diagnosing their language support needs For one of these tasks they were given a maximum of 30 minutes of writing time and for the other they were given 55 minutes A fully counterbalanced design was chosen to control for task version and order effect RESEARCH QUESTIONS The study investigated the following research questions: Do students’ scores on the various dimensions of writing ability differ between the long (55-minute) and short (30-minute) time condition? Are raters’ judgments of these dimensions of writing ability equally reliable under each time condition? BRIEF REPORTS 353 METHOD Context of the Research The preliminary study reported in this article was conducted in the context of a diagnostic assessment administered in very similar forms at both the University of Melbourne and the University of Auckland The assessment serves to identify the English language needs of undergraduate students following their admission to one or the other university and to guide them to the appropriate language support offered on campus The Diagnostic English Language (Needs) Assessment or DELA/DELNA (the name of the testing procedure differs at each university) is a general rather than discipline-specific measure of academic English The writing subtest, which is the focus of this study, is described in more detail in the Instruments section The data for the current study were collected at the University of Auckland and analysed at the University of Melbourne Participants Test Takers The participants in the study were 30 first-year undergraduate students at the University of Auckland ranging in age from 20 to 39 years old The group comprised 19 females and 11 males All participants were English as an additional language (EAL) students from a range of L1 backgrounds, broadly reflecting the diversity of the EAL student population at the University of Auckland The majority (64%) were Chinese speakers, while other L1 backgrounds included French, Malay, Korean, German, and Hindi The mean length of residence in New Zealand was 5.3 years Raters Two experienced DELNA raters were recruited to rate the essays collected for the study DELNA raters are regularly trained and monitored (see, e.g., Elder, Barkhuizen, Knoch, & von Randow, 2007; Elder, Knoch, Barkhuizen, & von Randow, 2005; Knoch, Read, & von Randow, 2007) Both raters had postgraduate qualifications in TESOL as well as rating experience in other assessment contexts (e.g., International English Language Testing System) 354 TESOL QUARTERLY Instruments Tasks To achieve a counter balanced design, two prompts were chosen for the study The topics of the essays were as follows: Version A: Every citizen has a duty to some sort of voluntary work Version B: Should intellectually gifted children be given special assistance in schools? The task required students to write an argument essay of approximately 300 words in response to these questions To help students formulate the content of the essays, students were provided with a number of brief supporting or opposing statements, although they were asked not to include the exact wording of these statements in their essays To ascertain that the two prompts used were of similar difficulty, overall ratings were compared across the 60 essays An independent samples t test showed that the two prompts were statistically equivalent with respect to the challenge they presented to test takers, t(58) = 0.415, p = 0.680 Rating Scale The rating scale used was an analytic rating scale with three rating categories (fluency, content, and form) rated on six band levels ranging from 1–6, where a score of or less indicates a need for English language support Raters were asked to produce ratings for each of the three categories These ratings were also averaged to produce an overall score Procedures Data Collection To obtain an independent measure of the students’ language ability, the students first completed a screening test comprising a vocabulary and speed-reading task (Elder & Von Randow, in press) Based on these scores, the students were divided into four groups of more or less equal ability Then, to control for prompt and order effect, a fully counter balanced design was used as outlined in Table The writing scripts were presented in random order to the raters, who were given no information about the condition under which the writing BRIEF REPORTS 355 TABLE Research Design Essay Essay Group N Version Time limit Version Time limit 7 A B A B 30 minutes 30 minutes 55 minutes 55 minutes B A B A 55 minutes 55 minutes 30 minutes 30 minutes was produced, so as to eliminate the possibility of their taking the time allowance into account when assigning the scores Raters have been found in other studies (e.g., McNamara & Lumley, 1997), to compensate candidates for task conditions which they feel may have disadvantaged them Data Analysis The scores produced by the two raters were entered into SPSS (2006) T-tests and correlational analyses were used to answer the two research questions RESULTS Research Question Do students’ scores on the various dimensions of writing ability differ between the long (55-minute) and short (30-minute) time condition? Two different types of analyses were used to explore variation in students’ scores under the two time conditions First, mean scores obtained under each condition were compared (see Table 2) The means for form and fluency were almost identical in each time condition, whereas for content, the long writing task elicited ratings almost half a band higher TABLE Paired Samples t Tests Variable Mean–short SD–short Mean–long SD–long t df P Average fluency rating Average content rating Average form rating Average total rating 4.13 4.18 3.90 4.07 0.73 0.79 0.78 0.71 4.15 4.40 4.02 4.19 0.79 0.86 0.80 0.76 0.128 1.58 1.07 1.14 29 29 29 29 0.899 0.125 0.293 0.262 Note SD = standard deviation 356 TESOL QUARTERLY TABLE Correlations of Scores Under Short and Long Condition Correlation R Fluency (short*long) Content (short*long) Form (short*long) Overall (short*long) 0.544 0.636 0.726 0.735 Note All results significant at 0.01 level (2-tailed) than those allocated to the short one Although mean scores for each of the analytic criteria were consistently higher in the 55-minute condition, a paired samples t test (Table 2) showed that none of these mean differences was statistically significant Second, a Spearman rho correlation was used to ascertain if the ranking of the candidates was different under the two time conditions Table presents the correlations for the fluency, content, and form scores under the two conditions as well as a correlation for the averaged, overall score Although the correlations in Table are all significant, they vary somewhat in strength The average scores for writing produced under the short and long time condition correlate more strongly than the analytic scores assigned to particular writing features The correlations are lowest for the fluency criterion, although a Fisher R-to-Z transformation indicates that the size of this coefficient does not differ significantly from the others Research Question Are raters’ judgments of writing ability equally reliable under each time allocation? It was of further interest to determine if there were any differences in the reliability of rater judgments under the two time conditions Table presents the correlations between the two raters under the two time conditions Although the correlation coefficients for the short and long conditions were not significantly different from one another, Table shows that correlations were consistently higher for the short time condition TABLE Rater Correlations 30 minutes 55 minutes Fluency: 0.787 Content: 0.804 Form: 0.836 Total: 0.931 Fluency: 0.755 Content: 0.763 Form: 0.736 Total: 0.891 Note All results significant at 0.01 level (2-tailed) BRIEF REPORTS 357 DISCUSSION The current study’s purpose was to determine both the validity and practical implications of reducing the time allocation for the DELA/DELNA writing test from 55 to 30 minutes Mean score comparisons showed that students performed very similarly across the two task conditions Although this result accords with those of writing researchers such as Kroll (1990), Caudery (1990), Powers and Fowles (1996), it is somewhat at odds with Biola (1982), Crone et al (1993), and Younkin, (1986), who showed that students performed significantly better when more time was given for their writing However, as already suggested in our review of the literature, the differences between these studies’ findings may be partly a function of sample size Worthy of note in our study is the greater discrepancy in means for content between the long and short writing conditions The fact that test takers scored marginally higher on this category under the 55-minute condition is unsurprising, given that it affords more time for test takers to generate ideas on the given topic In general, however, the practical impact of the score differences observable from this study are likely to be negligible One might argue that shortening the task will produce slightly depressed means for the undergraduate population as a whole, with the result that a marginally higher proportion of students receive a recommendation of “needs support.” However, this is hardly of a magnitude that would create significant strain on institutional resources and is, in any case, potentially of benefit in terms of ensuring that a larger number of borderline students are flagged, thereby gaining access to language support classes More important is the question whether the writing construct changes when the time allocation decreases, because this has implications for the validity of inferences drawn about test scores The cross-test correlational statistics are not strong for any of the rating criteria, and this is particularly true for fluency, implying that opportunities to display coherence and other aspects of writing fluency may differ under the two time conditions These construct differences have potential implications for EAP support teachers who may use DELA/DELNA writing profiles to determine how to focus their interventions It cannot however be assumed that the writing produced in the short time condition is a less valid indicator of candidates’ academic writing ability than writing produced within the long time frame As for interrater reliability, the findings of this study revealed (as in the Hale, 1992 study) that scoring consistency was acceptable and comparable across the two time conditions In fact, the data reported here suggest that alignment between raters increases slightly in the short writing condition on each of the writing criteria Because this finding is not statistically significant, it is not appropriate to speculate further about 358 TESOL QUARTERLY possible reasons for this outcome, but this issue is certainly worth exploring further with a larger data set In the meantime we can conclude that shortening the writing task presents no disadvantage as far as reliability of rating is concerned The issue investigated in this small-scale preliminary study certainly begs further investigation, both with a larger sample, and using methods not yet applied in research on the impact of timing on writing performance Procedures such as think-aloud verbal reports and discourse analysis could be used to get a better sense of any construct differences resulting from the time variable than can be gleaned from a quantitative analysis If writing produced under the 55-minute condition were found to show more of the known and researched characteristics of academic discourse than that produced within the 30-minute condition, this result would have important validity implications with regard to the diagnostic capacity of the procedure and its usefulness for students, teaching staff and other stakeholders A further issue, which is the subject of a subsequent investigation, is how test takers feel about doing the writing task under more stringent time conditions Although we have shown that enforcing more stringent time conditions does not make a difference to test scores, it may be perceived as unfair, making it less likely that students will take their results seriously and act on the advice given However, we could caution that any decision based on these results will, as is the case with any language testing endeavor, involve a trade-off between what is feasible and what is desirable in the context of concern ACKNOWLEDGMENTS The authors thank Martin von Randow for assistance with aspects of the study design and Janet von Randow and Jeremy Dumble for their efforts in administering the test tasks and recruiting participants and raters for this study THE AUTHORS Cathie Elder is director of the Language Testing Research Centre at the University of Melbourne, in Carleton, Victoria, Australia Her major research efforts and output have been in the area of language assessment She has a particular interest in issues of fairness and bias in language testing and in the challenges posed by the assessment of language proficiency for specific professional and academic purposes Ute Knoch is a research fellow at the Language Testing Research Centre, University of Melbourne, in Carleton, Victoria, Australia Her research interests are in the areas of writing assessment, rating scale development, rater training, and assessing languages for specific purposes Ronghui Zhang is a lecturer in the Department of Foreign Languages at Shenzhen Polytechnic Institute, Shenzhen, China Her research interests are in the area of foreign language pedagogy and writing assessment BRIEF REPORTS 359 REFERENCES Biola, H R (1982) Time limits and topic assignments for essay tests Research in the Teaching of English, 16, 97–98 Caudery, T (1990) The validity of timed essay tests in the assessment of writing skills ELT Journal, 44, 122–131 Crone, C., Wright, D., & Baron, P (1993) Performance of examinees for whom English is their second language on the spring 1992 SAT II: Writing Test Unpublished manuscript prepared for ETS, Princeton, NJ Elder, C., Barkhuizen, G., Knoch, U., & von Randow, J (2007) Evaluating rater responses to an online rater training program Language Testing, 24, 37–64 Elder, C., Knoch, U., Barkhuizen, G., & von Randow, J (2005) Individual feedback to enhance rater training: Does it work? Language Assessment Quarterly, 2, 175–196 Elder, C., & Von Randow, J (in press) Exploring the utility of a Web-based English language screening tool Language Assessment Quarterly Ellis, R (Ed.) (2005) Planning and task performance in a second language Oxford: Oxford University Press Hale, G (1992) Effects of amount of time allocated on the Test of Written English (Research Report No 92-27) Princeton, NJ: Educational Testing Service Knoch, U., Read, J., & von Randow, J (2007) Re-training writing raters online: How does it compare with face-to-face training? Assessing Writing, 12, 26–43 Kroll, B (1990) What does time buy? ESL student performance on home versus class compositions In B Kroll (Ed.), Second language writing: Research insights for the classroom Cambridge: Cambridge University Press Livingston, S A (1987, April) The effects of time limits on the quality of student-written essays Paper presented at the meeting of the American Educational Research Association, Washington, D.C., United States McNamara, T., & Lumley, T (1997) The effect of interlocutor and assessment mode variables in overseas assessments of speaking skills in occupational settings Language Testing, 14, 140–156 Powers, D E., & Fowles, M E (1996) Effects of applying different time limits to a proposed GRE writing test Journal of Educational Measurement, 33, 433–452 SPSS, Inc (2006) SPSS (Version 15) [Computer software] Chicago: Author Weigle, S C (2002) Assessing Writing Cambridge: Cambridge University Press Younkin, W F (1986) Speededness as a source of test bias for non-native English speakers on the College level Academic Skills Test Dissertation Abstracts International, 47/11-A, 4072 Effect of Repetition of Exposure and Proficiency Level in L2 Listening Tests HIDEKI SAKAI Shinshu University Nagano, Japan Second language (L2) listening test developers must take into account a variety of factors such as the characteristics of the input, the task, and 360 TESOL QUARTERLY ... language needs of undergraduate students following their admission to one or the other university and to guide them to the appropriate language support offered on campus The Diagnostic English Language. .. at diagnosing their language support needs For one of these tasks they were given a maximum of 30 minutes of writing time and for the other they were given 55 minutes A fully counterbalanced... in the context of a diagnostic assessment administered in very similar forms at both the University of Melbourne and the University of Auckland The assessment serves to identify the English language