tasks or items • Test construct the underlying ability/trait being captured/measured by the test • A sample of the TLU tasks: choosing the most characteristic tasks from the domains • Re[r]
(1)(2) Module 5119 Language Testing and Assessment Week Stages of test development and test specifications (3) From Week Test usefulness model • Reliability - consistency of measurement • Construct validity - appropriateness of score interpretation • Authenticity - linking test tasks to TLU tasks • Interactiveness - interaction between language ability and test tasks • Impact - on individuals, teaching, educational institutions and society • Practicality - consideration of resources (4) Quiz In a small group, decide how you would evaluate each of the 12 given assessment scenarios, according the factors listed Fill in the chart with 5-4-3-2-1 scores, with indicating that the principle is highly fulfilled and indicating very low or no fulfillment Use your best intuition to supply these evaluations, even though you don’t have complete information on each context Report your group’s findings to the rest of the class and compare (5) Stages of Test Development (6) Thinking stage (initial ground-clearing) • Establish the major purpose of a test (achievement/placement?) • Determine appropriate objectives (sample on next slide) cannot test each one, choose a possible subset of the objectives to test • Theories of language and language use (e.g., linguistic competence, communicative competence) (7) (8) Thinking stage (cont.) Deciding: • • • • • • • What to test (Content) Layout Test items Length and time limit Instructions to be given Method of scoring Considering resources and constraints (9) Understanding the constraints (10) Writing stage Writing test content in light of test specifications • What goes into the test or what the test contains (i.e tasks or items) • Test construct (the underlying ability/trait being captured/measured by the test) • A sample of the TLU tasks: choosing the most characteristic tasks from the domains • Relevance, coverage and authenticity of tasks • Content selection involves test methods 10 (11) Test methods • Presentation of materials or tasks eliciting responses (prompting) • Response format The way in which candidates will be required to response to/engage with the test materials – – – Fixed response format (e.g., MCQ, True/False) (-$) Constructed response format (e.g., cloze-test, short answer questions) (+$) Authenticity of task format and response • Scoring method how candidate responses will be rated or scored (12) (13) Issues of authenticity • Simulation of the real-world tasks and settings • Direct inference from test performance to likely TLU domain • Principled compromise 13 (14) Issues of authenticity (Cont.) Examples • IELTS speaking tasks • Academic listening test - listening to lectures and note-taking – – – The question of interruption What to note and what not to Pre-set questions 14 (15) Piloting stage Trying out the test: • Analysis of the trial data • Establish reliability and validity • Understanding perceptions of test takers • Revising the test in light of analysis and feedback • Selection and training of raters 15 (16) Implementation stage (17) Test specifications (18) Test Specifications (Hughes, 1989) • Content • • • • • • • • • • Operations Types of text Addressees Topics Format and Timing Criterial levels of performance Scoring procedures Sampling Item writing and moderation Pretesting (19) Content • Operations: tasks that test-takers have to be able to carry out • E.g., a reading test: scan, guess meaning of unknown words • Types of text: e.g., a writing test: letters, forms, essays • Addressees: kind of people that test takers are expected to write or speak to • Topics: topics selected according to suitability for the test takers and the type of test (20) Format and Timing • Test structure (including time allocated to components) • Item types procedures, with examples • What weighting assigned to each component • How many pages will be presented (for reading or required (for writing)? • How many items for each component? (21) Criterial levels of performance: Required level(s) of performance for success to be specified (22) Item writing and moderation Critical questions that may be asked: • Is the task perfectly clear? • Is there more than one possible correct response? • Can test-takers arrive at the correct response without having the skill supposedly being tested? • Do test-takers have enough time to perform the tasks? The best way to identify items that have to be improved or abandoned is thru teamwork/collaborative work (23) What should test specifications look like? • • • • • • • • • • • • What is purpose of the test? (placement? achievement?) What sort of learner will be taking the test? (age, sex, level?) How many sections/papers? What target language situation is envisaged for the test? What text types should be chosen? (written or spoken?) What language skills should be tested? (micro skills?) What language elements? (gram structures/features?) What sort of tasks are required? (simulated authentic?) How many items are required for each section? What test methods? (multiple choice, gap filling?) What rubrics are used as instructions for candidates? (examples ) Which criteria to be used for assessment by markers? (accuracy, appropriacy?) (24) Test specifications The specification is based on Criterion-Referenced Measurement (CRM) as opposed to Norm-Referenced Measurement (NRM) NRM: Concerned with determining the relative standing or rank order of test-takers (scores in percentiles) individual performance is evaluated in terms of its typicality for the population in question (How good was it compared with the performance of others?) CRM: Concerned with determining the absolute standing of test takers in relation to the criterion that is tested (scores refer to the extent of the domain or criterion mastered) : Did it meet what was required? 24 (25) NMR • • • • individual performance is evaluated in terms of its typicality for the population in question (How good was it compared with the performance of others?) Uses a comparison between individuals as a frame of reference Requires a score distribution Typically associated with comprehension tests or tests of grammar and vocabulary (26) CRM • Concerned with determining the absolute standing of test takers in relation to the criterion that is tested (scores refer to the extent of the domain or criterion mastered): Did it meet what was required? Individual performances are evaluated against a verbal description of a satisfactory performance at a given level (27) CRM(Cont.) • A series of performance goals set for individual learners → learners can reach these at their own rate • Motivation is maintained, • Striving is for a ‘personal best’ rather than against other learners • typically involves judgements as to how a performance should be classified • includes the indices of the quality of raters (e.g., inter-rater reliability indices, classification analysis) (28) Sample Test specifications • Handouts (29) Practice On the basis of experience or intuition, try to write a specification for a test designed to measure the level of language proficiency of students applying to study an academic subject in the medium of English at an overseas university Compare your specification with those of tests which have been actually constructed for that purpose For example, you might look at ELTS and TOEFL If specifications are not available, you will have to infer them from sample tests or past papers (30)