Technology in Testing, 2000

Technology in testing: the present and the future J. Charles Alderson Department of Linguistics and Modern English Language, Bowland College, Lancaster University, Lancaster LA1 4YT, UK Received 10 January 2000; received in revised form 9 February 2000; accepted 10 February 2000 Abstract As developments in information technology have moved apace, and both hardware and software have become more powerful and cheaper, the long-prophesied use of IT for language testing is ®nally coming about. The Test of English as a Foreign Language (TOEFL) is mounted on computer. CD ROM-based versions of University of Cambridge Local Exam- inations Syndicate tests are available, and the Internet is beginning to be used to deliver language tests. This paper reviews the advantages and disadvantages of computer-based language tests, explores in detail developments in Internet-based testing using the examples of TOEFL and DIALANG Ð an innovative on-line suite of diagnostic tests and self-assessment procedures in 14 European languages Ð and outlines a research agenda for the next decade. # 2000 Elsevier Science Ltd. All rights reserved. Keywords: Information technology; Computer-based language tests; Internet; Self-assessment 1. Uses of IT in testing Computers are beginning to be used to deliver language tests in many settings. A computer-based version of the Test of English as a Foreign Language (TOEFL) was introduced on a regional basis in the summer of 1998. More and more tests are available on CD ROM, and both the Intranet and the Internet are beginning to be used to deliver tests to users who are at a distance. For example, the English as a Second Language Placement Examination at UCLA is currently being adapted to be delivered over the Web, and Internet-based tests of Chinese, Japanese, and Korean System 28 (2000) 593±603 www.elsevier.com/locate/system 0346-251X/00/$ - see front matter # 2000 Elsevier Science Ltd. All rights reserved. PII: S0346-251X(00)00040-3 E-mail address: c.alderson@lancaster.ac.uk (J.C. Alderson). are also being developed. Fulcher (1999) reports the use of the Internet to de liver a placement test to students and describes a pilot study to investigate potential bias against students who lack computer familiarity or have negative attitudes towards technology. The study also estimates the usefulness of the test as a placement instrument by comparing the accuracy of placement with a pencil-and-paper form of the test. 2. Disadvantages of computer-based testing There are, of course, dangers in such innovations. Many commentator s have remarked that computer-based tests (CBTs) are currently limited in the item types that they allow. The multiple-choice technique is ubiquitous on computer, and indeed has enjoyed something of a resurgence in language testing, where it had previously begun to fall into disfavour (see Alderson, 1986, for an early reference to this problem, and Alderson, 1996, for a more recent discussion). Similarly, cloze and gap-®lling techniques are frequently used where other techniques might be more appropriate, but are much harder to implement in a setting where responses must be machine-scorable. Fairly obviously, a degree of computer literacy is required if users are not to be disadvantaged by CBTs over paper-and-pencil tests. The ability to use a mouse and a keyboard are obvious minimal requirements. Reading text from screen is not the same thing as reading from print, and the need to move to and fro through screens is much more limiting than being able to ¯ick back and forth through print. It is noteworthy that the Educational Testing Service (ETS) conducted a large-scale study of computer literacy in the TOEFL-taking population and of the eect of computer literacy on TOEFL performance (Kirsch et al., 1998; Taylor et al., 1998). Although the study found no dierence in TOEFL performance between those who were familiar with computers and those who were not, a signi®cant 16% of the TOEFL population had negligible computer familiarity. ETS has therefore devised a tutorial which all CBT TOEFL takers must undergo before they take the CBT TOEFL for real, in an eort to remove any possible suggestion of bias against computer illiterates. This tutorial is available on sampler CDs which demonstrate the nature of the new CBT TOEFL, but is also mandatory, but untimed, on all administrations of CBT TOEFL. The tutorial not only familiarises candidates with the various test techniques used in CBT TOEFL, and allows practise in responding to such items, it also gives instruction in how to use a mouse to select, point and to scroll text, and to respond to dierent item types. There is also an untimed Writing tutorial for those who plan to type their essay (CBT TOEFL makes the formerly optional Test of Written English compulsory, although candidates have the option of writing their essay by hand, instead of word-processing it). Whether such eorts will be necessary in the future as more and more users become computer literate is hard to predict, but currently there is considerable concern about the eect of the lack of such literacy on test performance. 594 J.C. Alderson / System 28 (2000) 593±603 Perhaps most importantly, we appear to be currently limited in what language skills can be tested on a computer. The highly valued productive skills of speaking and writing can barely be assessed in any meaningful way right now Ð although, as we shall see later in this article, developments are moving apace, and it may be sooner than some commentators have expected that we will also be able to test the ability to respond to open-ended productive tasks in meaningful ways. Despite these drawbacks, there is no doubt that the use of computers in language testing will grow massively in the next few years, and that must in part be because computer-mounted tests oer signi®cant advantages to users or deliverers. 3. Technical advantages of computer-based testing One obvious advantage of computer-based testing is that it removes the need for ®xed delivery dates and locations normally required by traditional paper-and-pencil- based testing. Group administrations are unnecessary, and users can take the test at a time (although not necessarily at a place) of their own choosing. CBTs can be available at any time of the day or night, thus freeing users from the constraint of test administration or even of group-administration: users can take the test on their own, rather than being herded into a large room at somebody else' s place and convenience. Another advantage is that results can be available immediately after the test, unlike paper-and-pencil-based tests which require time to be collected, marked and results issued. As we shall see shortly, this has potential pedagogic advantages, as well as being of obvious bene®t to the users (receiving institutions, as well as candidates). Whilst diskette- and CD ROM-based tests clearly have such advantages, tests delivered over the Internet are even more ¯exible in this regard: delivery or purchase of disks is not required, and anybody with access to the Internet can take a test. Diskettes and CD ROMs are ®xed in format: somebody has to decide what goes onto the diskette. Once the disk has been pressed and distributed, it cann ot normally be updated. However, with tests delivered by the Internet, it is possible to allow access to a much larger database of items, which can itself be updated constantly. Indeed, using the Internet, tests can be piloted alongside live test items, they can be calibrated on the ¯y, and can then be turned into live items as soon as the calibra- tion is complete and the item parameters are known. Use of the Internet means that results can be sent immediately to designated score users, which is not possible with diskette- and CD ROM-based tests. CBTs can make use of specially designed templates for item co nstruction, and indeed some compan ies market special software to allow test developers to construct tailor-made tests (Questionmark, e.g. at http://www.qmark.com/). And software like Authorware (copyright 1993, Macrome dia Inc.) can easily be used to facilitate test development, without the need for recourse to proprietary software. The Internet can also be used for item authoring and reviewing: item writers can input their items into specially prepared templates, review them on screen, and they J.C. Alderson / System 28 (2000) 593±603 595 can then be stored in a database for later review, editing and piloting. This has the added advantage of allowing test developers to employ any item writer who has access to the Internet, regardless of their geographical location, an asset that has proved invaluable in the case of international projects like DIALANG Ð see below. CBTS, and especially Internet-delivered tests, can access large databases of items. This means that test security can be greatly enhanced, since tests can be created by randomly accessing items in the database and producing dierent combinations of items. Thus any one individual is exposed to only a tiny fraction of available items and any compromise of items that might occur will have negligible eect. At present there is a degree of risk in delivering high-stakes tests over the Internet (which is why ETS has not yet made the TOEFL available over the Internet, but instead requires users to travel to specially designated test centres). Not only might hackers break into the database and compromise items, but the diculties of pay- ment for registration, and the risk of impersonation, are considerable. However, for tests which are deliberately low stakes, the security risks are less important, as there is little incentive for users to cheat or otherwise to fool the system. 4. Computer-adaptive testing The most frequently mentioned advantage of computer -based testing in the lit- erature is the use of computer adaptivity. In computer-adaptive tests, the computer estimates the user's ability level on the ¯y. Once it has reached a rough estimate of the candidate's ability, the computer can then select the next item to be presented to the user to match that emerging ability level: if the user gets an item right, they can be presented with a more dicult item and if the user gets the item wrong, they can be given an easier item. In this way, users are present ed with items as close as possible to their ability level. Adaptivity makes tests somewhat more ecient Ð since only items close to the ability of the user are presented. It also means that test security can be enhanced since dierent users take essentially dierent tests. It does, however, require large banks of precalibrated items, and evidence that any one test is equivalent, at least in measurement terms, to any other test that might have been taken by a user with the same ability. For a useful review of computer-adaptive testing, see Chalhoub-Deville and Deville (1999). 5. Pedagogic advantages of CBTs Computer-adaptive tests are often argued to be more user-friendly, in that they avoid users being presented with frustratingly dicult or easy items. They might thus be argued to be more pedagogically appropriate than ®xed-format tests. Indeed, I would go further and argue that the use of computers in general to deliver tests has signi®cant pedagogic advantages and can encourage the incorpora- tion and integration of testing and assessment directly into the teaching and learning 596 J.C. Alderson / System 28 (2000) 593±603 process. I also believe that computer-based testing will encourage and facilitate an increased use of tests for low-stakes purposes like diagnosis. Arguably the major pedagogic advantage of delivering tests by computer is that they can be made more user-friendly than traditional paper-and-pencil tests. In particular, they oer the possibility of giving users immediate feedback once a response has been made. The test designer can allow candidates to receive feedback after each item has been responded to, or this can be delayed until the end of the subtest, the whole test, or even after a time de lay, although it is unlikely that anybody would wish to delay feedback beyond the immediate end of the test. It is generally thought that feedback given immediately after an activity has been completed is likely to be more mean ingful and to have more impact, than feedback which is substantially delayed. Certainly, in traditional paper-and-pencil-based test s, feedback Ð the test results Ð can be delayed for up to several months. At which point candidates are unlikel y to remember what their responses were, and thus the feedback is likely to be much less meaningful. This is not a problem in settings where the only feedback the candidates are likely to be interested in is whether they have passed or failed the test, but even then candidates are likely to appreciate knowing how their performance could have been better, where their strengths and weaknesses lie, and what they might do better next time. Interestingly, UK examining boards are currently considering how best to allow candidates to review their own test papers once they have been marked. This is presumably in the interests of allowing learners to learn from their marked performance. CBTs can oer this facility with ease. If feedback is given immediately after an item has been attempted, the possibility exists of allowing users to make a second attempt at the item Ð with or without penalties for doing so in the light of feedback. One interesting question then arises: if the user gets the item right the second time, which is the true measure of ability: the performance before or after the feedback? One might argue that the second performance is a better indication, since it results from users having learned something about their ®rst performance and thus is closer to current ability. Clearly, research into this area is needed, and providing immediate feedback and allowing second attempts raises interesting research and theoretical que stions. Computers can also be user-friendly in oering a range of support to test takers. On-line Help facilities are, of course, common, and could be accessed during test performance in order to clarify instructions, for example, or to allow users to see an example of what one is supposed to do, and more. In addition, clues as to an appropriate performance could be made available Ð and the only limit on such clues would appear to be our ability to devise clues which are helpful without directly revealing the answer or the expected performance. On-line dictionaries can also be made available, with the advantage of being tailor-made to the text and test being taken, rather than being all-purpose dictionaries of the paper-based sort. Not only can such support be made available: its use can be monitored and taken into account in deriving test results or test scores. It is perfectly straightforward for the computer to adjust scores on an item for the use of support, or for the machine to report the use of support alongside the test score. The challenge to testers and J.C. Alderson / System 28 (2000) 593±603 597 applied linguists is clearly to decide how this should be reported, and how this aects the meaning of the scores. In similar vein, users can be asked how con®dent they are that the answer they have given is correct, and this con®dence rating can be used to adjust the test score, or to help interpret results on particular items (users might unexpectedly get dicult items right, and the associated con®dence rating might give insight into guessing or partial knowledge). An obvious extension of this principle of asking users to give insights into their ability is the use of self-assessment: users can be asked to estimate their ability, and they can then be tested on that ability, with the possibility, via the computer, of an immediate comparison between the self-assessment and the actual performance. Such self-as sessment can be in general terms, for example, about one's ability on the skill or area being tested, or it can be very speci®c to the text or task about to be tested. Users can then be encouraged to re¯ect upon pos sible reasons why their self-assessed performance on a task did or did not match their actual performance. I have already discussed one aspect of user friendliness allowed by CBTs Ð adaptivity Ð which allows for an extensive amount of tailoring of test to user. How - ever, such adaptivity of tests need not be merel y psychometrically driven. It is possible to conceive of a test in which the user is given the choice of taking easier or more dicult items, especially in a context where the user is given immediate feedback on their performance. Indeed, the CBT TOEFL already allows users to see an estimated test result immediately after taking the test, and to decide whether they wish their score to be reported to potential receiving institutions, or whether they would prefer the score not to be reported until they have improved it. This notion of users driving the test can be extended. For example, assuming large banks of items, users can be allowed to choose which skill they wish to be tested on, or whi ch level of diculty they take a test at. In addition, they could be allowed to choose which language they wished to see test rubrics and examples in, they could request the language in which results are presented. They could be allowed to make self-assessments in their own language rather than in the target language and to get detailed feedback on their results in a language other than the target language. Such learner-centredness alrea dy exists in the DIALANG testing system (see below). 6. State of the art Given this range of possibly innovative features in CBTs, what is the current state of the art? At present, in fact, there is little evidence of innovation. True, there are many applications of adaptive tests, but the test methods used are largely indis- tinguishable from those used in paper-and-pencil-based testing, and no evidence has been gathered and presented that testi®es to any change in the constructs being measured, or to increased construct validity. Certainly, there is discussion of the increased practicality of the shortened testing time, for example, but no attempt has 598 J.C. Alderson / System 28 (2000) 593±603 been made to show either that candidates ®nd this an advantage, or that it repre- sents any added value in other than strictly administrative convenience. Indeed one of the problems of computer-adaptive tests is that users cannot change their minds about what an appropriate response might be once they have been assigned another item. This would seem to be a backward step, and to reduce the validity of measurement. Allowing second thoughts, after all, might well be thought to tap re¯ective language use, if not spontaneous response. ETS, the developer of TOEFL, claims, however, that CBT TOEFL does contain innovative features. The listening section, for example, now uses photos and gra- phics ``to create context and support the content of the mini-lectures, producing stimuli that more closely approximate `real world' situations in which people do more than just listen to voices'' (ETS, 1998). But no evidence is presented that this has any impact on candidates or on validity, and in any case, many such visuals could easily have been added to the paper-and-pencil-based TOEFL. However, some innovations in test method are noteworthy. Although the traditional four- option multiple-choice predominates, in one test method candidates are required to select a visual or part of a visual. In some questions candidates must select two choices, usually out of four, and in others candidates are asked to match or order objects or texts. In addition, in CBT TOEFL, candidates wear headphones, can adjust the volume control, and are allowed to control how soon the next question is presented. Moreover, candidates see and hear the test questions before the response options appear: it might be argued that this encourages candidates to construct their own answer before being distracted by irrelevant options, but no studies of this possibility are reported. The Structure section retains the same two item types used in the paper-and-pencil TOEFL, but the Reading section not only uses the traditional multiple choice, it also may require candidates to select a word, phrase, sentence or paragraph in the text itself, and other questions ask candidates to insert a sentence where it ®ts best. Although these techniques have been used elsewhere in paper-and-pencil tests, one advantage of their computer format is that the candidate can see the result of their choice in context, before making a ®nal decision. This may have implications for improved validity, as might the other item types, but again this remains to be demonstrated. In short, although the CBT TOEFL is computer-adaptive in the Listening and Structure sections, there is little that appears to play to the computer's strengths and possibilities as I have described. This is not to say that such innovations may not appear later in the TOEFL 2000 project. Indeed, Bennett (1998) claims that the best way to innovate in computer-based testing is ®rst to mount on computer what can already be done in paper-and-pencil format, with possible minor improvements allowed by the medium, in order to ensure that the basic software works well, before innovating in test method and construct. Once the delivery mechanisms work, it is argued, then computer-based deliveries can be developed that incorporate desirable innovations. A European-Union-funded computer-based diagnostic testing project Ð DIALANG Ð has adopted this cautious approach to innovation in test method. At J.C. Alderson / System 28 (2000) 593±603 599 the same time, it has incorporated major innovations in aspects of test design other than test method, which are deliberately intende d to experiment with and to demonstrate the possibilities of the medium. Moreover, unlike CBT TOEFL, DIA- LANG is available over the Internet, and thus capitalises on the advantages of Internet-based delivery referred to earlier. DIALANG inco rporates self-assessment as an integral part of diagnosis, it provides immediate feedback , not only on scores, but also on the relationship between test results and self-assessment, and it provides for extensive explanatory and advisory feedback to users. The language of administration, of self-assessment, and eventually of feedback, is chosen by the test user from a list of 14 European languages, and users can decide which skill they wish to be tested in, in any one of 14 European languages. Although current test methods only consist of various forms of multiple-choice, gap-®lling and short-answer questions, DIALANG has already developed demonstrations of 18 dierent experimental item types which can be implemented in the future, and it demonstrates the use of help, clue, dictionary and multiple-attempt features, as well as the option to have or to delay immediate feedback which, I have argued above, makes computer- based testing not only more user-friendly, but also more compatible with language pedagogy. DIALANG currently suers from the limitations of IT in assessing learners' productive language abilities. However, the experimental item types include an inge- nious combination of self-assessment and benchmarking which holds considerable promise. Tasks for the elicitation of speaking and writing performances have been developed and administered to learners (in this case, of Finnish for the writing task, and of English for the speaking task). Performances are rated by human judges, and those performances which achieve the greatest agreement are selected as `benchmarks'. A DIALANG user is presented with the same task, and in the case of Writing, responds to the task via the keyboard. The user's performance is then presented on screen together with the pre-r ated benchmarks. The user can view various performances at the dierent DIALANG/Council of Europe levels, and compare their own performance with the benchmarks. In addition, since the benchmarks are pre-analysed, the user can choose to see raters' comments on various features of the benchmarks, in hypertext form, and consider whether they could produce a similar quality of features. In the case of Speaking, the candidate is simply asked to imagine how they would respond to the task, rather than to actually record their performance. They are then presented with recorded benchmark performances, and are asked to estimate whether they could do better or worse than each perfor mance. Since the performances are graded, once the candidate has self-assessed himself against a number of performances, the system can tell him roughly what level his own (imagined) performance is likely to be. Whilst this system is clearly entirely dependent on the user's willingness to play the game, the only person being cheated if he does not do so, is the user himself. This is so because DIALANG is intended to be low stakes and diagnostic. Indeed, it is precisely because DIALANG is used for diagnosis, rather than for high-stakes admissions or employment decisions, that it is possible to be so innovative. I believe 600 J.C. Alderson / System 28 (2000) 593±603 that DIALANG and low-stakes assessment in general oer many possibilities for exciting innovation in computer-based test ing. Other interesting developments include `è-rater'', and ``PhonePass' ', both proprietary names. Automated testing of productive language abilities has so far proved dicult, as we have seen. However, e-rater has been developed by ETS, in an attempt to develop ìntelligent' IT-based systems that will arrive at the same conclusions about users' writing ability as do human raters. In essence, e-rater is trained by being fed samples of open-ended essays that have been previously scored by human raters, and it uses natural language processing techniques to duplicate the performance of human raters. At present, the system is working operationally on GMAT (Graduate Management Admissions Test) essays. E-rater research is on- going for other programmes, such as GRE (Graduate Record Examinations ), and a challenge has been issued to the language education and assessment community to produce essays that will `fool' the system by inducing it to give too high or too low scores compared with those given by humans. This will doubtless encounter interesting problems if it is adapted to deal with second/foreign language testing situations, but current progress is extremely promising. For more information on this project, contact Don Powers, Jill Burstein, or Karen Kukich at ETS (dpowers @ets.org, jburstein@ets.org, kkukich@ets.org). Further information on e-rater is available on http://www.ets.org/research/er ater.html. A second project, using IT to assess aspects of the speaking ability of second/ foreign language learners of English, has already been developed and is being extensively tested and resear ched. Users of PhonePass are given sample tasks in advance, and then have to respond to similar tasks over the telephone in ìnterac- tion' with a computer. Tasks currently include reading aloud, repeating sentences, saying opposite words, and giving short answers to questions. PhonePass returns a score that re¯ects a candidate's ability to understand and respond appropriately to decontextualized spoken material, with 40% of the evaluation re¯ecting the ¯uency and pronunciation of the responses. The system uses speech recognition technology to rate responses, by comparing candidate performance to statistical models of native and non-native performance on the tasks. From these compari sons, subscores for ¯uency and pronunciation are derived, and are comb ined with IRT-based (Item Response Theory) scores derived from right/wrong judgements based on the exact words recognized in the spoken responses. Studies have shown the test to exhibit a reliability coecient of 0.91 and a corre- lation with an ETS Test of Spoken English of 0.88, an d with an ILR (Interagency Language Round Table) Oral Pro®ciency Interview of 0.77 (Bernstein, personal communication, January 2000). These are encouragi ng results. The scored sample is retained on a database, classi®ed according to the various scores assigned. Interestingly, this enables potential employers or other users to access the same speech sample, in order to make their own judgements about the performance for their speci®c purposes, and to compare how their candidate has performed with other speech samples that have been rated either the same, or higher or lower. Like DIALANG, this system thus incorporates elements of objective J.C. Alderson / System 28 (2000) 593±603 601 assessment with an opportunity for user- assessment. Those interested in learning more about the system can contact PhonePass at www.ordinate.com. 7. Conclusions: the need for a research agenda One of the claimed advantages of computer-based assessment is that computers can store enormous amounts of potential research data, including every keystroke made by candidates and their sequence, the time it has taken a person to respond to a task, as well as the correctness of the response, the use of help, clue and dictionary facilities, and much more. The challenge to researchers is to make sense of this mass of data and, indeed, to design procedures for the gathering of useful, meaningful or at least potentially meaningful data, rather than simply trawling through everything that can possibly be gathered in the hope that something interesting might emerge. In other words, a research agenda is needed. Such a research agenda ¯ows fairly naturally from two sources: issues that arise in the development of CBTs, and from the claimed advantages and disadvantages of computer-based testing. To illustrate the former case, let us take computer-adaptive testing. When designing such tests, developers have to take a number of decisions: what should the entry point or level be, and how is this best determined for any given population? At what point should testing cease (the so-called exit point) and what should the criteria be that determine this? How can content balance best be assured in tests whose main principle for adaptation is psychometric? What are the consequences of not allowing users to skip items, and can these be ameliorated? How to ensure that some items are not presented much more frequently than others (item exposure), because of their facility, or their content? The measurement litera- ture has already addressed such questions, but language testing has yet to come to grips with these issues. Arguably more interesting are the issues surrounding the claimed advantages and disadvantages of CBTs. Throughout this paper, I have emphasised the lack of, and the need for, evidence that IT-based testing presents an advance in our ability to assess language performance or pro®ciency. Whilst there may be an argument for IT-based assessment that does not improve on current modes of assessment, because of the compensations of the medium in terms of convenience, speed, and so on, we need to be certain that IT-based assessment does not reduce the validity of what we do. Thus a minimal research agenda would set out to identify comparative advantages of each form of assessment, and to establish whether IT-based assessment oers per- ceived or actual added value. This would apply equally to perceptions of computer- adaptive testing, as well as to the claimed bene®ts of speed of assessment and immediate availability of results, for example. A substantive research agenda would investigate the eects of providing immediate feedback on attitudes, on performance, and on the measurement of ability. The eect of the range of support facilities that can be made available needs to be examined. Not only what are the eects, but how can they be maximised or mini- mised? What does the provision of support imply for the validity of the tests, a nd for 602 J.C. Alderson / System 28 (2000) 593±603 [...]... 1998 Reinventing Assessment: Speculations on the Future of Large-scale Educational Testing Educational Testing Service, Princeton, NJ Chalhoub-Deville, M., Deville, C., 1999 Computer adaptive testing in second language contexts Annual Review of Applied Linguistics 19, 273±299 ETS, 1998 Computer-Based TOEFL Score User Guide Educational Testing Service, Princeton, NJ Fulcher, G., 1999 Computerising an... appropriate use and integration of media and multimedia that will allow us to measure those constructs that might have eluded us in more traditional forms of measurement Ð e.g latencies in spontaneous language use, planning and execution times in task performance, speed reading and processing time more generally And we need research into the impact of the use of the technology on learning, learners and... closely to the learning process, as I have claimed, exactly how does this happen Ð if it does? References Alderson, J.C., 1986 Computers in language testing In: Leech, G.N., Candlin, C.N (Eds.), Computers in English Language Education and Research Longman, London, pp 99±110 Alderson, J.C., 1996 Do corpora have a role in language assessment? In: Thomas, J., Short, M.H (Eds.), Using Corpora for Language... that are being measured, or that might be measured Alongside development work that explores how the potential of the medium might best be harnessed in test methods, support, diagnosis and feedback, we need research that investigates the nature of the most eective and meaningful feedback; the best ways of diagnosing strengths and weaknesses in language use; the most appropriate and meaningful clues... Language Teaching Journal 53, 289±299 Kirsch, I., Jamieson, J., Taylor, C., Eignor, D., 1998 Familiarity Among TOEFL Examinees (TOEFL Research Report 59) Educational Testing Service, Princeton, NJ Taylor, C., Jamieson, J., Eignor, D., Kirsch, I., 1998 The Relationship Between Computer Familiarity and Performance on Computer-based TOEFL Tasks (TOEFL Research Report 61) Educational Testing Service, Princeton,...J.C Alderson / System 28 (2000) 593±603 603 the constructs that can be measured? What is the value of allowing learners to have a second attempt, with or without feedback on the success of their ®rst attempt? What do the resulting scores tell us, should scores be adjusted in the light of such support, and if so, how? What additional information can we learn from the provision of... selfassessment and `normal' assessment? What is the value of con®dence testing, and can it throw more light onto the nature of the constructs? What is needed above all is research that will reveal more about the validity of the tests, that will enable us to estimate the eects of the test method and delivery medium; research that will provide insights into the processes and strategies testtakers use; studies that . decade. # 2000 Elsevier Science Ltd. All rights reserved. Keywords: Information technology; Computer-based language tests; Internet; Self-assessment 1. Uses of IT in testing Computers are beginning. largely indis- tinguishable from those used in paper-and-pencil-based testing, and no evidence has been gathered and presented that testi®es to any change in the constructs being measured, or to increased. are obvious minimal requirements. Reading text from screen is not the same thing as reading from print, and the need to move to and fro through screens is much more limiting than being able to

Định dạng
Số trang	11
Dung lượng	96,36 KB