Here is a clear and authoritative discussion of the basic concerns which underlie the development and use of language tests, and an uptodate synthesis of research on testing. Primarily for students on teacher education courses, it is also an invaluable resource for all those professionally involved in designing and administering tests, acting as a complement to practical how to books. Winner MLA Kenneth W Mildenberger Prize
Preface Introduction aims the book The climate for language testing Research and development: needs and problems Research and development: an agenda Overview the book Notes Measurement inProduction Definition terms: measurement, test, evaluation Essential measurement qualities Properties measurement scales Characteristics that limit measurement Steps in measurement Summary Notes Further reading Discussion questions Uses 12 18 18 24 30 40 52 Tests Uses of language tests in educational programs Research uses of language tests Features for classifying different types of language test Summary Further reading Discussion 53 67 70 78 79 79 Communicative Language Introduction 81 vi Language proficiency and communicative competence theoretical framework communicative language ability Notes Further reading Discussion questions Introduction A framework test method facets Applications this framework to language resting Summary Notes Further reading Discussion questions Reliability Introduction Factors that affect measurement theory measurement: interpreting individual within classical true score and 82 107 108 109 116 152 156 157 159 160 163 166 187 theory criterion referenced test scores that affect reliability estimates measurement error Summary Notes Further reading Discussion questions 197 202 209 220 222 226 227 232 233 Introduction Reliability and validity revisited Validity as a unitary concept The evidential basis validity Test bias The consequential or ethicai basis of validity Postmortem: face validity 23 23 24 243 271 279 285 289 Notes reading questions 294 Some Authentic language tests future directions general explaining performance on language tests et Summary Notes Further reading Discussion questions 3Q0 Bibliography 397 Measurement Introduc developing language tests, we must take into account consider tests and ations and follow procedures that are characteristic measurement in the social sciences in general Likewise, our interpretation and use of the results language tests are subject the same general limitations that characterize measurement in the social sciences The purpose this chapter is introduce the fundamental concepts measurement, an understanding which is essential the development and use of language tests These include the terms ‘measurement’, ‘test’, and ‘evaluation’, and how these are distinct from each other, different types measurement scales and their properties, the essential qualities measures reliability and validity, and the characteristics measures that our interpret ations test results The process measurement described as a set steps which, followed in test development, will provide the basis for both reliable test scores and valid test use Definition terms: measurement, test, evaluation The terms ‘measurement’, ‘test’, and ‘evaluation’ are often used synonymously; indeed they may, in practice, refer the same activity.’ When we ask for an evaluation an individual’s language proficiency, for example, we are frequently given a test score This attention superficial similarities among these however, tends obscure the distinctive characteristics of each, and believe that an understanding the distinctions among the terms is vital the proper development and use language tests Measurement Measurement in the social sciences is the process of quantifying the characteristics of persons according to explicit procedures and This definition includes three distinguishing features: quantification, characteristics, and explicit rules and procedures Quantification involves the assigning of numbers, and this distin guishes measures from qualitative descriptions such as verbal accounts nonverbal, visual representations Non-numerical cate gories or rankings such as letter grades C labels (for example, ‘excellent, good, average may have the characteristics of measurement, and these are discussed below under ‘properties of measurement scales’ (pp 26 30) However, when we actually use categories or rankings such as these, we frequently assign numbers to them in order analyze and interpret them, and technically, it is not until we this that they constitute measurement We can assign numbers both physical and mental characteristics persons Physical attributes such as height and weight can be observed directly testing, however, we are almost always interested in quantifying mental attributes and abilities, sometimes called traits or constructs, which can only be observed indirectly These mental attributes include characteristics such as aptitude, intelligence, motivation, field attitude, native language, fluency in speaking, and achievement in reading rehension The precise definition ‘ability’ is a complex undertaking a very general sense, ‘ability’ refers being able to something, but the circularity this general definition provides help for measurement unless we can clarify what the ‘something’ is John Carroll has proposed defining an ability with respect a particular class cognitive or mental tasks that an individual required to perform, and ‘menta1 ability’ thus refers to performance on a set mental tasks (Carroll 268) We generally assume that there are degrees ability and that these are associated with tasks or performances increasing difficulty or complexity (Carroll Thus, individuals with degrees of a given ability could be expected to have a higher probability correct on tasks of lower difficulty complexity, and a lower probability of correct performance on tasks greater difficulty or complexity Whatever attributes or abilities we measure, it is important understand that it is these attributes abilities and the persons themselves that we are measuring That is, we are far from being able claim that a single measure or even battery of measures can adequately characterize individual human beings in all their com plexity The third distinguishing characteristic of measurement is that quantification must be done according explicit rules and cedures That is, the ‘blind’ haphazard assignment numbers characteristics individuals cannot be regarded as measurement In order be considered a measure, an observation an attribute must be replicable, for other observers, in other contexts and with other individuals Practically anyone can rate another person’s speaking ability, for example But while one rater may focus on pronunciation accuracy, another may find vocabulary be the most salient feature Or one rater may assign a rating as a percentage, while another might rate on a scale from zero five Ratings such as these can hardly be considered anything more than numerical summaries the raters’ personal conceptualizations the individual’s speaking ability This is because the different raters in this case did not follow the same criteria or procedures for arriving their ratings Measures, then, are distinguished from such ‘pseudo measures’ by the explicit procedures and rules upon which they are based There are many different types measures in the sciences, including rankings, rating scales, and Test Carroll 968) provides the following definition a test: a psychological educational test is a procedure designed elicit certain behavior from which one can make inferences about certain characteristics an individual (Carroll 1968: 46) From definition, it follows that a test is a measurement instrument designed a specific sample an individual’s behavior one type measurement, a test necessarily quantifies characteristics individuals according explicit procedures What distinguishes a test from other types measurement is that it is Measurement designed obtain a specific sample behavior Consider the following example The Interagency Language Roundtable (ILR) oral interview is a test speaking consisting of a set of elicitation procedures, including a sequence activities and sets of question types and topics; and (2) a measurement scale language proficiency ranging from a low level ‘0’ a high level on which samples oral language obtained via the elicitation procedures are rated Each the six scale levels is carefully defined by an extensive verbal description qualified ILR interviewer might be able to rate an individual’s oral proficiency in a given language according to the IER rating scale, on the basis several years’ informal contact with that individual, and this could constitute a measure of that individual’s oral proficiency This measure could not be considered a test, however, because the rater did not follow the procedures prescribed by the ILR interview, and consequently may not have based her on the specific language performance that are obtained in conducting an ILR oral interview believe this distinction is an important one, since it reflects the primary justification for the use language tests and has implications €or how we design, develop, and use them we could count on being able measure a given aspect language ability on the basis any sample language however obtained, there would be no need design language tests However, is precisely because any given sample language will not necessarily enable the test user make inferences about ability that need language That is, the inferences and uses we make language test scores depend upon the sample language use obtained Language tests can thus provide the means for more carefully focusing on the specific language abilities that are of interest such, they could be viewed as supplemental other methods measurement Given the limitations on measurement discussed below (pp and the potentially large effect of elicitation procedures on test performance, however, language tests can more appropriately be viewed as the best means assuring that the sample language obtained is sufficient for the intended measurement purposes, even if we are interested in very general or global abilities That is, carefully designed elicitation procedures such as those the ILR oral interview, those for measuring writing ability described by Jacobs or those multiple-choice tests such as the Test of English as a Foreign Language (TOEFL), provide the best assurance that scores from language tests will be reliable, and 22 Considerations in While measurement is frequently based on the naturalistic observation behavior over a period time, such as in teacher grades, such naturalistic observations might not include samples behavior that manifest specific abilities or attributes Thus a rating based on a collection personal letters, for example, might not provide any indication an individual’s ability write effective argumentative editorials for a news magazine Likewise, a teacher’s rating a student’s language ability based on interactive social language use may nor a very good indicator how well that student can use language perform various ‘cognitive/ academic’ language functions (Curnmins not imply that other measures are less valuable than tests, but make the point that the value of tests lies in their capability for eliciting the specific kinds behavior that the user can interpret as evidence attributes or abilities which are interest Evaluation can be defined as the systematic gathering information for the purpose making decisions (Weiss The probability of making the correct decision in any given situation is a function not only of the ability the decision maker, but also the quality the information upon which the decision is based Everything else being equal, the more reliable and relevant the information, the better the likelihood making the correct decision Few us, for example, would base educational decisions on hearsay or rumor, since we would not generally consider these be reliable sources information Similarly, we frequently attempt screen out inform ation, such as sex and ethnicity, that we believe be irrelevaat a particular decision One aspect evaluation, therefore, is collection reliable and relevant information This information need not be, indeed seldom is, exclusively quantitative descriptions, ranging from performance profiles letters of refer ence, as well as overall impressions, can provide important inform ation for evaluating individuals, as can measures, such as ratings and test scores Evaluation, therefore, does not necessarily entail testing the same token, tests in and themselves are not evaluative Tests are often used for pedagogical purposes, either as means motivating students study, or as a means of reviewing material taught, in which case no evaluative decision is made on the basis of the test results Tests may also be used for purely descriptive purposes It is EVALUATION Figure 2.1 Relationships among tests, and evaluation only when the results tests are used as a basis for making a decision that evaluation is involved Again, this may seem a point, but it places the burden for much the stigma that surrounds testing squarely upon the test user, rather than on the test itself Since by far the majority tests are used for the purpose making decisions about individuals, I believe it is important distinguish the information-providing function measurement from the decisionmaking function evaluation The among measurement, tests, and evaluation are illustrated in Figure An evaluation that does not involve either tests or measures (area is the use qualitative descriptions student performance for diagnosing learning problems An example a non test measure for evaiuasion (area is a teacher ranking used for assigning grades, while an example of a test used ‘for purposes evaluation (area ‘3’) is she use of an achievement test determine student progress The most non-evaluative uses tests and measures are for research purposes example of tests that are not used for evaluation (area ‘4’) is the use of a proficiency test as a criterion in second language acquisition research Finally, assigning code numbers subjects in second language research according native language is an example of a 66 ENTER INSTRUCTION INSTRUCTION Figure Program INSTRUCTION C 67 remedial work would focus on the students’ deficiencies and employ learning activities and teaching methods that are well suited individual students’ learning styles alternative approach would be have the student simply repeat the regular course instruction However, this approach assumes that the failure achieve the objectives is due largely the student’s inadequacies, rather than those of the instructional program The provision effective remedial instruction with subsequent progress the next level also provides means for students who are not successful in the regular course make progress and exit the program In this program, then, several different types decisions need be made, and these are made primarily on the basis information from tests One approach developing tests €or this program would be to develop achievement tests The placement test could be a multi level achievement test based on the objectives all three levels, while the tests at the end each level would focus on the objectives of that level In some situations, where incoming students may vary widely in their backgrounds, it might be more appropriate base the placement test on a general theory language proficiency, while still basing the progress tests on the objectives the course This program is not intended as a ‘model’ program be emulated, but rather serves illustrate a wide range uses of tests in an educational program This program could be altered in a number ways to meet different needs and situations For example, a slightly simpler program, but with the same types of decisions be made, might have students take the multi-level achievement placement test, or a parallel form, at the end each level instruction In this way, even greater flexibility could be built into the program, since might be possible for students who make outstanding progress skip from Level to Level C, or to exit the program after Level Programs such as those described in these examples, then, illustrate the fundamental consideration regarding the use tests in educational programs: the amount and type testing we depends on the number and kinds decisions be made Considerations regarding the qualities these tests, that reliability and validity, will be discussed in greater detail in Chapters and Research uses language tests operational definitions of theoretical constructs, language tests have a potentially important role in virtually all research, both basic and applied, that is related the nature language proficiency, 68 language processing, language acquisition, language attrition, and language teaching The question whether language proficiency is a single unitary competence or whether it is composed distinct component traits is one which has been considerable interest language testing researchers for several years (Oller 1976, Oller and Hinofotis 1980; Bachman and Palmer Carroll and which also has implications for the theory language acquisition and for language teaching It is now generally agreed that language proficiency is not a single unitary ability, but that it consists several distinct but related constructs in addition to a general construct language proficiency (Oller Much current research into the nature language proficiency has now come focus on identifying and empirically verifying its various components (for example, Farhady 1980; Bachman and Palmer Allen 1983; Sang al 1986) Of particular interest this regard are models communicative competence, which have provided the theoretical definitions for the development tests constructs such as sensitivity to cohesive relationships, discourse organization, and differences in register (for example, et Cohen and Olshtain 1980; Ontario Ministry of Education 1980; Wesche 1981; Swain 1985) Such tests in turn provide the basis for verifying (or falsifying) these theoretical models This research involves the construct validation of language tests, which is discussed further in Chapter Language tests can also be used in research into the nature of language processing Responses language tests can provide a rich body data for the identification of processing errors and their explanation, while language testing techniques can serve as elicitation procedures for collecting information on language processing In the investigation how individuals process information in a reading passage, for the cloze would seem have a great deal of potential Through careful observation and analysis of subjects’ response patterns, such as the order in which they complete the blanks and the changes they make in their answers as they work through the passage, we may begin be able test some of the hypotheses that are suggested by various theories of reading An area related to this is the investigation the process test taking itself (for example, 1974; Cohen and Aphek 1979; Cohen 1984, forthcoming; Grotjahn 1986) third research use of language tests is in the examination the nature language acquisition Studies of language acquisition often require indicators the amount of language acquired for of use as criterion or dependent variables, and these indicators frequently include language tests Several studies have used tests different components communicative language ability as criteria for examining the effect learner variables such as length residence in country, age first exposure the target language, and motivational orientation on Ianguage acquisition (for example, Purcell 1983; Bachman and Palmer Fouly 1985; Bachman and Mack 1986) Language tests are also sometimes used as indicators of factors related to second language acquisition, such as language aptitude and level proficiency in the native language Gardner (1983, for example, used measures of attitudes, motivational intensity, and prior language achievement examine a model language acquisition Although language attrition, or loss, is not simply the reverse language acquisition, many the same factors that have been examined with language acquisition are also hypothesized to affect language attrition, and language tests also have a role to play in this area research Oxford (1982) and Clark for example, both discuss the role language tests research on Ianguage attrition, as well as considerations for their use in such research Furthermore, it clear from both Gardner’s (1982) review of the research on social factors in language retention and his own research on attrition (for example, Gardner that language tests play a vital role in such research A fifth area research in which language play important role is in the investigation of effects different instructional settings and techniques on language acquisition Measures language proficiency were essential components several large-scaie foreign language teaching method evaluation studies that were conducted in the 1960s (for example, Smith 1970; Levin Lindblad and Levin as well as the more recent large-scale study of bilingual proficiency conducted by the Modern Language Centre of the Ontario Institute for Studies in Education (Allen 1982, 1983; Harley al Language tests have also provided criterion indicators of ianguage ability for studies in classroom-centered second language acquisition (for example, the research reviewed and discussed in Chaudron and for research into the relationship different language teaching strategies and aspects of second language competence (for example, Sang 198 70 Features for classifying different types language test Language test developers and users are frequently faced with questions regarding what type test would be most appropriate for a given situation, and in discussions language testing one often hears questions such as, ‘Should we use a norm-referenced test an achievement test?’ or ‘Should we use both a diagnostic and a proficiency test in our program?’ Such uses labels for describing test types often raise more questions than they answer How are norm-referenced tests different from achievement tests? Cannot proficiency tests be used for diagnosis? Questions like these imply comparisons that are like the proverbial question, ‘Which are better, apples or oranges?’ in which objects that are similar in some ways and different in others are compared according a criterion they not share (‘appleness’ or ‘orangeness’) clarify the use such terms, language tests can be classified according five distinctive features: the purpose, use, for which they are intended; the content upon which they are based; the frame reference within which their results are be interpreted; the way in which they are scored, and the specific technique or method they employ Intended use Any given language test is typically developed with a particular primary use in mind, whether it be for an educational program or for research In research, language tests are used provide information comparing the performances individuals with different characteristics under different conditions language acquisition or language teaching and for testing hypotheses about the nature language proficiency In educational settings, however, language tests provide information for making a wide variety of decisions One way classifying language tests, therefore, is according the type decision be made Thus we can speak and tests with regard admission decisions, and tests with regard identifying the appropriate instruc tional level or specific areas in which instruction is needed, and or tests with respect decisions about how individuals should proceed through the pro gram, or how well they are attaining the program’s objectives While it is generally best develop a specific test for each different type decision, this is not always necessary, and greater efficiency testing can sometimes be achieved by developing a test for more than one 71 specific purpose, with the understanding that the validity separate use must be adequately demonstrated each Content indicated in Chapter the ‘content’ of language tests can be based on either a theory language proficiency or a specific domain of content, generally as provided in a course syllabus We can refer theory-based tests as proficiency while syllabus based tests are generally referred as Whether or not the specific abilities measured by a given proficiency test actually differ from those measured by a given achievement test will depend, course, on the extent to which the theory upon which the proficiency test is based differs from that upon which the syllabus is based For example, a language proficiency test based on a theory gram matical competence is likely be quite similar an achievement test based on a grammar-based syllabus, but quite different from an achievement test based on a notional-functional syllabus These relationships are illustrated in Figure THEORYA SYLLABUS THEORYB SYLLABUS content are also distinguished according content Like language proficiency tests, language aptitude tests are theory-based, but the theory upon which they are based includes abilities that are related the acquisition, rather than the use language The theory language aptitude, as described by Carroll hypothesizes that cognitive abilities such as rote memorization, phonetic coding, and the recognition of grammatical analogies are related an individual’s ability learn a second or foreign language, and together constitute language aptitude This theory has been operationally defined in measures such as the Modern (Carroll and and the (Pimsleur 1966) reference The results language tests can be interpreted in two different ways, depending on the frame reference adopted When test scores are interpreted in relation the performance of a particular group individuals, we speak of a norm-referenced interpretation If, on the other hand, they are interpreted with respect to a specific level or domain ability, we speak of a criterion- or domain-referenced interpretation Tests that are developed permit these different interpretations are called and criterion-referenced tests Frame Norm-referenced (NR) tests are designed enable the test user make ‘normative’ interpretations test results That is, test results are interpreted with reference the performance of a given group, or norm The ‘norm group’ is typically a large group of individuals who are similar the individuals for whom the test is designed In the development NR tests the norm group is given the test, and then the characteristics, or norms, of this group’s performance are used as reference points for interpreting the performance other students who take the test The performance characteristics, or norms, most typically used as reference points are the mean or average score the group, and the standard deviation which is an indicator how spread out the scores the group are the NR test is properly designed, the scores attained will typically be distributed in the shape of a ‘normal’ bell-shaped curve perfectly normal distribution scores has certain statistical 73 of characteristics that are known and constant In a normal distribution, for example, we know that per cent the scores are below the mean, or average, and 50 per cent are above We also know that 34 per cent the scores are between the mean and one standard deviation above or below the mean, that 27 per cent are between one and two standard deviations from the mean per cent above and 13.5 per cent below), and that only per cent the scores will be as far away or more standard deviations from the mean These characteristics are illustrated in Figure 3.6 We can use these known characteristics interpreting scores individuals on NR tests For example, the mean the of (TOEFL) is about 512 and the standard deviation is about 66 (Test English as a Foreign Language 1987) Thus, a score of 578 on the is about one standard deviation above the mean (512 66 578) This score indicates that the individual well above average with reference the norm group and, more precisely, that his performance is equal or greater than that 84 per cent the students in the norm group In other cases, NR test results are interpreted and reported solely with reference to the actual group taking the test, rather than a separate norm group Perhaps the most familiar example of this is what is sometimes called ‘grading on the curve’, where, say, the top ten per cent the students receive an on the test and the bottom ten cent fail, irrespective the absolute magnitude scores In order provide the most easily interpretable results, tests are designed maximize the distinctions among the individuals in a - 3.6 of distribution 74 given group tests are also sometimes referred as ‘psycho metric’ tests since most theoretical models psychometrics, or psychological measurement, are based on the assumption a normal distribution and maximizing the variations among indi viduals’ scores (Cziko The quintessential NR test is the which has three characteristics (Gronlund First, standardized tests are based on a fixed, or standard content, which does not vary from one form of the test another This content may be based on a theory language proficiency, as with the or it may be based on a specification of language users’ expected needs, as is the case with the test Criper and Davies If there are alternate forms the test, these are carefully examined for content equivalence Second, there are standard procedures for administering and scoring the test, which not vary from one administration of the test the next Finally, standardized tests have been thoroughly tried out, and through a process of empirical research and development, their characteristics are well known Specifically, their measurement properties have been examined, that we know what type of measurement scale they provide, that their reliability and validity have been carefully investigated and demonstrated for the intended uses of the test, that their score distribution norms have been established with groups individuals similar those for whom the test is intended, and that if there are alternate forms the test, these are equated statistically assure that reported scores on each test indicate the same level ability, regardless the particular form the test being used Criterion referenced tests are designed to enable the test user interpret a test score with reference to a criterion level ability or domain content An example would be the case in which students are evaluated in terms their relative degree of mastery of course content, rather than with respect their relative ranking in the class all students who master the course content might receive an irrespective how many students achieve this grade The primary concerns in developing a CR test are that it adequately represent the criterion ability level or sample the content domain, and that it be sensitive levels ability or degrees mastery the different components that domain of The necessary condition for the development a CR test is the specification a level ability or domain content (Glaser 1963; Glaser and Nitko 1971; Nitko 1984) It is important point out that it is this of of that constitutes the criterion, and not the setting of a cut-off score for making decisions The definition a criterion level or domain and the setting a cutscore for a given decision are quite distinct issues It is quite possible develop and use a CR test without explicitly setting a cut score good example of this would be a diagnostic test that is used identify specific areas within a content domain where individuals might benefit from further instruction, without necessar ily evaluating them as masters or non-masters Here the implicit categories are ‘might benefit from instruction’ and ‘probably would not benefit from further instruction’ It is equally possible set a cut off score for making decisions on the basis the distribution scores from a NR test, without reference a content domain For example, if we wanted be highly selective, we could set the cut off score for entrance into a program at standard deviations above the mean From the discussion above about test content, it is clear that achievement tests based on well defined domains can satisfy the condition domain specification for CR tests Of current research interest is the question whether language proficiency tests can also be developed as provide CR scores (Bachman and Savignon 1986; 1987) has about the advantages CR, or language tests (Cziko as well as ‘common metric’ for measuring language ability (for example, 1978, 1981; Carroll 1980; Clark 1980; Bachman and Clark 1987; Bachman and this will be discussed in greater detail in Chapter The two primary distinctions between NR and CR tests are in their design, construction, and development; and (2) in the scales they yield and the interpretation of these scales NR tests are designed and developed maximize distinctions among individual test takers, which means that the items or parts such tests will be selected according how well they discriminage individuals who well on the test as a whole from those who poorly CR tests, on the other hand, are designed be representative of specified levels of ability or domains of content, and the items or parts will be selected according how adequately they represent these ability levels or content domains And while NR test scores are interpreted with reference the performance other individuals on the test, CR test 76 are interpreted indicators a level ability or degree of mastery content domain These differences between NR and tests have particularly important consequences for how we estimate and interpret the reliability This involves technical considerations that are discussed at length in Chapter (pp 209 20) Despite these differences, however, it is important understand that these frames reference are not necessarily mutually exclusive., It is possible, for example, develop NR score interpret ations, on the basis the performance appropriate groups test takers, for a test that was originally designed with reference a criterion level of proficiency or content domain Similarly, it sometimes useful attempt scale a NR test an existing CR test example, Carroll 1967) Scoring procedure Pilliner (1968) pointed out, ‘subjective’ tests are distinguished from ‘objective’ tests entirely in terms scoring procedure All other aspects of tests involve subjective decisions The test developer uses the best information at hand (for example, curriculum content or theory language) subjectively determine the content be covered, while the test writers make subjective decisions about how best construct the test items Tests are also subjective in the taking, since test takers must make subjective decisions about how best answer the questions, be they essays or multiple-choice In an the correctness the test taker’s response is determined entirely by predetermined criteria no judgment is required on the part scorers In a on the other hand, the scorer must make a judgment about the correctness the response based on her subjective interpretation the scoring criteria The multiple-choice technique is the most obvious example an objective test, although other tests can be scored objectively as well Cloze rests and dictations, for example, can be scored objectively by providing scorers with scoring keys that specify exactly which words are acceptable and which not Tests such as the oral interview or the written composition that involve the use rating are necessarily subjectively scored, since there is no feasible way ‘objectify‘ the scoring procedure Testing method The last characteristic be considered in describing a test is the specific testing method used Given the variety methods that have of 77 been and continue be devised, and the test developers, it is not possible make an exhaustive list the methods used for language tests One broad type test method that has been discussed widely by language testers is the so-called ‘performance test’, in which the test takers’ test performance is expected replicate their language performance in non-test situations (for example, Jones Wesche 1985) The oral interview and essay are considered examples of performance tests However, these, as well as virtually all the more commonly used methods, such as the multiple-choice, completion fill-in), dictation, and are not themselves single ‘methods’, but consist different combinations of features: instructions, types input, and task types Test method ‘facets’ such as these provide a more precise way of describing and distinguishing among different types tests than single category labels, and are discussed in detail in Chapter of believe the classification scheme described above provides a means for a reasonably complete description any given language test, and that descriptions language tests that refer only a single feature are likely be misunderstood analogy, describing a given speech sound as a ‘fricative’ does not provide sufficient information a phonologist identify it And even we describe its other features calling it the broad grooved post palatal fricative English this does not tell us whether a given instance this sound is used as an expression of amazement or as request for silence While the above features for classifying language tests are distinct, there are some areas overlap First, the terms ‘achievement’, ‘attainment’, and ‘mastery’ may refer both the type content upon which the test is based and the type decision be made This is not inconsistent, since achievement assumes the attainment or mastery of specific objectives Second, although there no necessary connection between test content and frame reference, the results proficiency and aptitude tests, because the way in which these constructs have been defined in the past, have been interpreted only in relation some reference group Achievement tests, on the other hand, are generally criterion-referenced say ‘generally’ because it is fairly common, particularly in tests arts for native speakers, for language tests based on the contents specific textbooks or materials be normed with relevant groups by level, such as year in school Rigorous definitions such as ‘a norm referenced multiple-choice 78 test achievement in grammar for the purpose placement’ are unlikely replace shorter descriptions such as multiple choice grammar test’ or ‘a grammar test for placement’ in common usage believe, however, that an awareness of the different features that can be used describe tests may help clarify some misconceptions about the ways in which specific tests differ from each other and facilitate better communication among test users about the characteristics the tests they use More importantly, it is these features that determine, a large extent, the approaches we take in investigating and demonstrating reliability and validity Summary The main point of this chapter is that the most important consideration in the and use of language tests is the purpose or purposes for which the particular test is intended far the most prevalent use language tests is for purposes evaluation in educational programs In order use language tests for this purpose, we must assume that information regarding educational outcomes is necessary for effective formal education, that appropri ate changes or modifications in the program are possible, and that educational outcomes are measurable The amount and type testing that will depend upon the decisions that need be made Since the decisions we make will affect people, we must be concerned about the quality reliability and validity of our test results In general, the more important the decision, in terms its impact upon individuals and programs, the greater assurance we must have that our test scores are reliable and valid In educational programs we are generally concerned with types decisions Decisions about individuals include decisions about entrance, placement, diagnosis, progress, and grading Decisions about concerned characteristics such as the appropriateness, effectiveness, or efficiency of the program Language tests also have a potentially important use in several areas research The information obtained from language tests can assist in the investigation the very nature of language proficiency, in examining how individuals process language, in the study language acquisition and attrition, and in assessing the effects of different instructional methods and settings on language learning Different types language tests can be distinguished according five features: use (selection, entrance, readiness, placement, diag nosis, progress), the content upon which they are based oflanguage tests ment, proficiency, aptitude), the frame reference for interpreting test results (norm, criterion), the scoring procedure (subjective, objective), and specific testing method (for example, multiple-choice, completion, essay, dictation, cloze) Further reading Nitko (1989) provides an excellent discussion of the issues related the design and uses tests in educational programs More extensive discussions of criterion referenced tests can be found in (1978) and Nitko (1984) thorough coverage the technical issues in criterion referenced testing is provided in Berk (1980, Cziko (1981) discusses the differences between referenced (‘psychometric’) and criterion-referenced (‘edumetric’) language tests Pilliner describes several different types of tests and discusses these with reference reliability and vaiidity Bachman (1981) discusses various types decisions be made in formative program evaluation, and makes some suggestions for the types tests be used Bachman discusses some the advantages using criterion-referenced proficiency tests for both and summative program evaluation, and describes some the considerations in the development and use of such tests Discussion questions What are the major characteristics norm referenced (NR) and criterion referenced (CR) tests? How would scores from the following tests typically be interpreted (NR CR)? achievement diagnostic test entrance test placement test progress test Under what conditions could scores from these tests be interpreted in both ways? What uses achievement tests have for formative and summative program evaluation? What decisions we make about individual students? What types rests are used provide information as a basis for these decisions? Can we use one test for all these decisions? Are the decisions that we make in educational programs all equal importance? Make a list six different types decisions 80 that need be made educational programs and rank them from most important least What factors you consider in (for example, number students affected, impact on students’ careers, impact on teacher’s career) is the relative importance the decisions be made related the qualities of reliability and validity? ... from examiner examiner 44 Fundamental Considerations in Language Testing some degree, characterize Variations in testing procedures do, virtually all language tests Our objective in developing tests,... of the raters, we nevertheless interpret ratings based on an interview as indicators of a single factor the individual’s in speaking This indeterminacy in specifying what it is that our tests... scale Steps in measurement Interpreting a language test score as an indication a given level language ability involves being able infer, on the basis of an observation that individual’s language