TES KEMAHIRAN BERBAHASA INGGRIS Kemahiran Berbahasa Kemahiran berbahasa dimaknai secara seragam Cummins (1984), menyatakan bahwa kemahiran berbahasa ada yang menyebutnya terdiri dari 64 komponen yang berbeda, tetapi ada pula yang menyebutkan hanya terdiri dari satu faktor saja Valdés dan Figueroa (1994) menyebutkan bahwa mengetahui suatu bahasa tidak cukup hanya menguasai pelafalan, tatabahasa, dan santun berbahasa, tetapi juga melibatkan penguasaan sejumlah komponen yang saling terkait dan berinteraksi satu sama lain tergantung konteks komunikasi yang terjadi Oller dan Damico (1991) menyatakan bahwa rincian elemen kemahiran berbahasa belum ditentukan dan masih terus diperdebatkan Setiap tes kemahiran berbahasa harus didasarkan atas model atau definisi kemahiran berbahasa yang akurat The Council of Chief State School Officers (CCSSO) mendefinisikan bahwa siswa yang mahir berbahasa Inggris dapat menggunakan bahasa itu untuk bertanya, memahami ucapan gurunya dan bahan bacaan, mengungkapkan pikirannya, dan menjawab apa yang ditanyakan di kelas Empat keterampilan berbahasa yang memberi kontribusi atas kemahiran berbahasa adalah keterampilan berbicara, keterampilan membaca, keterampilan menyimak, dan keterampilan menulis Canales (1994) melandasi definisi kemahiran berbahasa dengan landasan sosio-teoretis., yaitu bahasa tidak dilihat sebagai bagian yang terpisah-pisah (misalnya., pelafalan, kosakata, dan tatabahasa) Bahasa berkembang dalam suatu budaya dan berfungsi sebagai media untuk menyampaikan kepercayaan dan adat dan kebiasaan budaya (lihat kasus penerjemahan idiom) Kemahiran berbahasa bersifat dinamis dan kontekstual (bervariasi bergantung situasi, status penutur dan topik pembicaraan), diskursif (memerlukan ujaran yang saling berhubungan), dan membutuhkan keterampilan integratif sehingga kompetensi komunikatif dapat dicapai Dengan kata lain, kemahiran berbahasa merupakan kemampuan menggunakan unsur bahasa yang diskrit seperti kosakata, struktur wacana dan bahasa tubuh untuk menyampaikan makna Keterampilan berbahasa yang mendasari keberhasilan akademik seorang siswa mencakupi kemampuan merespon pertanyaan teman dan gurunya atas informasi tertentu, menanyakan pertanyaan lanjutan, dan menyintesis bahan bacaan Siswa harus memahami instruksi lisan rutin dalam seting kelompok besar, dan komentar untuk teman sebayanya dalam kelompok kecil Dalam keterampilan membaca, siswa dituntut mampu menggali informasi dari berbagai jenis teks Dalam keterampilan menulis, siswa dituntut mampu menulis jawaban pendek, paragraf, esei dan makalah Pemelajar bahasa yang berhasil juga dituntut mengetahui pranata sosial dan budaya yang berkaitan dengan penggunaan bahasa Konsepsi kemahiran berbahasa sebagaimana digambarkan di atas setidaknya terdiri dari dua hal Pertama, definisi itu mengakomodir keempat keterampilan berbahasa: speaking, listening, reading dan writing Kedua, setiap definisi menempatkan kemahiran berbahasa pada konteks tertentu, yaitu pendidikan Dampaknya, tes kemahiran berbahasa harus menggunakan prosedur tes yang sebisa mungkin menggambarkan kontekstualisasi bahasa yang digunakan pada sebagian besar kelas berbahasa Inggris Valdés dan Figueroa (1994) menyatakan bahwa tes kemahiran berbahasa harus mengidentifikasi tingkat tuntutan yang diminta oleh konteks dan jenis kemampuan berbahasa yang umumnya digunakan oleh siswa penutur bahasa Inggris monolingual yang sebagian besar sukses dalam konteks tersebut Berdasarkan pemikiran itu, kita dapat menetapkan criteria untuk mengukur keterampilan berbahasa siswa bukan penutur asli bahasa Inggris untuk memutuskan apakah mereka harus didik dalam bahasa Inggris atau dalam bahasanya sendiri Rekomendasi itu dipahami karena tes kemahiran berbahasa dimaksudkan untuk membantu pendidikan dalam memberikan penilaian yang akurat apakah seorang siswa memerlukan bantuan atau tidak dalam kegiatan pembelajarannya Keputusan seperti itu akan aka menjadi sulit manakala tugas yang ada dalam tes itu hanya memiliki sedikit kemiripan dengan karakteristik tugas yang biasa diberikan pada kebanyakan kelas General Nature of Language Proficiency Tests Oller dan Damico (1991) menyatakan bahwa tes kemahiran berbahasa dapat dikaitkan dengan tiga aliran Yang pertama adalah pendekatan diskrit yang didasari oleh asumsi bahwa bahasa terdiri atas komponen yang dapat dipisah-pisah seperti fonologi, morfologi, leksikon, sintaksis, dan seterusnya dan tiap-tiap komponen dapat lebih jauh dibagi ke dalam elemen yang berbeda (misalnya, bunyi ke dalam kelas bunyi atau fonem, suku kata, morfem, kata, idiom, dan struktur frase) Mereka menyatakan bahwa tes bahasa tidak akan valid jika measukan beberapa keterampilan atau ranah struktur (Lado, 1961) Dengan model ini, model penilaian yang ideal akan melibatkan evaluasi setiap ranah dan setiap keterampilan yang dianggap penting Hasilnya dapat digabung dan membentuk gambaran keseluruhan kemahiran berbahasa (p 82) Tes kemahiran berbahasa diskrit umumnya menggunakan format tes seperti membedakan fonem dimana peserta tes diminta menentukan apakah dua kata yang diberikan secara lisan sama atau berbeda (misalnya, /ten/ versus /den/) Contoh lainnya adalah tes yang dirancang untuk mengukur kosakata yang meminta siswa memilih pilihan yang tepat dari serangkaian pilihan yang sudah ditetapkan Kelemahan model tes secara diskrit ini di antaranya adalah: Kesulitan membatasi pengetesan bahasa ke dalam satu keterampilan (misalnya, writing) tanpa melibatkan keterampilan lainnya (misalnya, reading); Kesulitan membatasi tes ke dalam satu ranah linguistic (misalnya, vocabulary) tanpa melibatkan ranah lain (misalnya, phonology); dan Kesulitan bahasa tanpa melibatkan konteks atau mengaitkannya dengan pengalaman manusia Menurut Damico dan Oller (1991), keterbatasan itu menimbulkan munculnya trend kedua dalam pengetesan, yaitu pendekatan integrative atau holistic Tes seperti itu menghendaki kemahiran berbahasa dites dalam konteks wacana yang kaya (p 83) Asumsi yang mendasarinya adalah bahwa pemrosesan atau penggunaan bahasa menyiratkan penggunaan lebih dari satu komponen bahasa (misalnya, vocabulary, grammar, gesture) dan keterampilan (misalnya, listening, speaking) Mengikuti logika ini, sebuah tugas terintegrasi bias saja meminta seorang peserta tes untuk menyimak sebuah ceritera dan kemudian menceriterakannya kembali atau menyimak ceritera dan menuliskan kembali ceritera itu Trend pengetesan bahasa ketiga yang digambarkan Damico dan Oller (1992) dikenal sebagai pengetesan bahasa secara pragmatic Perbedaan mendasar pendekatan itu dengan pendekatan integratif adalah upaya menghubungkan situasi tes dengan pengalaman peserta tes Seperti dinyatakan Oller dan Damico (1991), penggunaan bahasa dalam situasi normal berkaitan dengan orang, tempat, peristiwa, dan hubungan yang menyiratkan keseluruhan rentang pengalaman dan rentang itu dihambat oleh waktu atau factor temporal Oleh karenanya, tes bahasa pragmatik dirancang sebisa mungkin seperti kenyataan "real life" atau seotentik mungkin Berbeda dengan tugas integrative, pendekatan tes pragmatik meminta peserta tes mengerjakan tugas menyimak hanya dalam kondisi tekstual dan temporal yang mencirikan kegiatan itu Misalnya, jika peserta tes hendak menyimak sebuah ceritera dan menceritakannya kembali, kondisi berikut harus dipenuhi Dari sudut pandang pragmatic, pemelajar bahasa umunya tidak menyimak ceritera yang direkam, tetapi umumnya menyimak ceritera yang dibacakan orang dewasa Dalam kaitan ini, tugas menyimak sebuah ceritera yang direkam tidak memenuhi syarat pragmatic Pendekatan pragmatic dicirikan oleh: Input visual normal diberikan (misalnya, isyarat pembaca, cetakan pada halaman, nomor otentik gambar yang bdihubungkan dengan ceritera Waktu diatur berbeda sehingga memungkinkan pemelajar memperoleh kesempatan untuk bertanya, menarik kesimpulan, bereaksi secara normal atas isi ceritera Ceritera, temanya, pembaca dan tujuan kegiatan membentuk pengalaman siswa Oller dan Damico (1991) melihat kekuatan pengetesan pragmatik ada pada kenyataan bahwa semua tujuan butir item yang disusun secara diskrit (diagnosis, focus, isolasi) akan lebih baik dicapai melalui konteks yang kaya Sebagai metode analisis linguistic, pendekatan tes secara diskrit memiliki validitas, tetapi sebagai salah satu metode yang praktis untuk menilai keterampilan berbahasa, pendekatan itu disalahgunakan, kontraproduktif, dan secara logika tidak mungkin Jika tujuannya adalah mengukur kemahiran berbahasa dalam aspek tatabahasa, kosakata, atau pelafalan, tujuan itu akan lebih mungkin dicapai melalui pendekatan bahasa pragmatic daripada pendekatan diskrit Keterbatasan Tes Kemahiran Berbahasa Saat ini Tes kemahiran berbahasa harus didasari teori atau modelkemahiran berbahasa Akan tetapi, belum ada konsensus di antara para ahli bahasa mengenai hakekat kemahiran berbahasa Akibatnya, muncul berbagai tes kemahiran berbahasa yang satu sama lain berbeda secara mendasar Yang lebih penting lagi adalah kenyataan bahwa tes kemahiran berbahasa yang berbeda menghasilkan klasifikasi bahasa yang berbeda pula (misalnya, (non-English speaking, limited English speaking and fully English proficient) untuk siswa yang sama (Ulibarri, Spencer & Rivas, 1981) Valdés dan Figueroa (1994) melaporkan bahwa tidak hanya kualitas tes yang harus menjadi perhatian para pendidik melainkan juga rancangan tes kemahiran berbahasa itu sendiri Unfortunately, it is not only the test qualities with which educators must be concerned Related to the design of language proficiency tests, there may be a propensity for test developers to use a discrete point approach to language testing Valdés and Figueroa (1994) state: As might be expected, instruments developed to assess the language proficiency of "bilingual" students borrowed directly from traditions of second and foreign language testing Rather than integrative and pragmatic, these language assessments instruments tended to resemble discrete-point, paper- and-pencil tests administered orally (p 64) Consequently, and to the degree that the above two points are accurate, currently available language proficiency tests not only yield questionable results about student's language abilities, but the results are also based on the most impoverished model of language testing In closing this section of the handbook, consider the advice of Spolsky (1984): Those involved with language tests, whether they are developing tests or using their results, have three responsibilities The first is to avoid certainty: Anyone who claims to have a perfect test or to be prepared to make an important decision on the basis of a single test result is acting irresponsibly The second is to avoid mysticism: Whenever we hide behind authority, technical jargon, statistics or cutely labelled new constructs, we are equally guilty Thirdly, and this is fundamental, we must always make sure that tests, like dangerous drugs, are accurately labelled and used with considerable care (p 6) In addition, bear in mind that the above advice applies to any testing situation (e.g., measuring intelligence, academic achievement, self-concept), not only language proficiency testing Remember also that the use of standardized language proficiency testing, in the context of language minority education, is only about two decades old Much remains to be learned Finally, there is little doubt that any procedure for assessing a learner's language proficiency must also entail the use of additional strategically selected measures (e.g., teacher judgments, miscue analysis, writing samples) The Tests Described The English language proficiency tests presented in this Guide are the: 1) Basic Inventory of Natural Language (Herbert, 1979); 2) Bilingual Syntax Measure (Burt, Dulay & Hernández-Chávez, 1975); 3) Idea Proficiency Test (Dalton, 1978;94); 4) Language Assessment Scales (De Avila & Duncan, 1978; 1991); and 5) Woodcock-Muñoz Language Survey (1993) Test Descriptions and Publisher Information Figure 1: Five Standardized English Language Proficiency Tests Included in this Handbook Assessment Instrument General Description Basic Inventory of Natural Language (BINL) CHECpoint Systems, Inc 1520 North Waterman Ave San Bernadino, CA 92404 1-800-635-1235 The BINL (1979) is used to generate a measure of the K12 student's oral language proficiency The test must be administered individually and uses large photographs to elicit unstructured, spontaneous language samples from the student which must be tape-recorded for scoring purposes The student's language sample is scored based on fluency, level of complexity and average sentence length The test can be used for more than 32 different languages Bilingual Syntax Measure (BSM) I and II Psychological Corporation P.O Box 839954 San Antonio, TX 78283 1-800-228-0752 The BSM I (1975) is designed to generate a measure of the K-2 student's oral language proficiency; BSM II (1978) is designed for grades through 12 The oral language sample is elicited using cartoon drawings with specific questions asked by the examiner The student's score is based on whether or not the student produces the desired grammatical structure in their responses Both the BSM I & BSM II are available in Spanish and English Idea Proficiency Tests (IPT) Ballard & Tighe Publishers 480 Atlas Street Brea, CA 92621 1-800-321-4332 The various forms of the IPT ( 1978 & 1994) are designed to generate measures of oral proficiency and reading and writing ability for students in grades K through adult The oral measure must be individually administered but the reading and writing tests can be administered in small groups In general, the tests can be described as discrete-point, measuring content such as vocabulary, syntax, and reading for understanding All forms of the IPT are available in Spanish and English Language Assessment Scales (LAS) CTB MacMillan McGraw-Hill 2500 Garden Road Monterey, CA 93940 1-800-538-9547 The various forms of the LAS (1978 & 1991) are designed to generate measures of oral proficiency and reading and writing ability for students in grades K through adult The oral measure must be individually administered but the reading and writing tests can be administered in small groups In general, the tests can be described as discrete-point and holistic, measuring content such as vocabulary, minimal pairs, listening comprehension and story retelling All forms of the LAS are available in Spanish and English Woodcock-Muñoz Language Survey Riverside Publishing Co 8420 Bryn Mawr Ave Chicago, IL 60631 1-800-323-9540 The Language Survey (1993) is designed to generate measures of cognitive aspects of language proficiency for oral language as well as reading and writing for individuals 48 months and older All parts of this test must be individually administered The test is discretepoint in nature and measures content such as vocabulary, verbal analogies, and letter-word identification The Language Survey is available in Spanish and English Approaches to Assessment Assessment can be broadly divided into two areas, formal and informal, but as Farr (1991, p 496) cautions, they really are on a continuum because both are based on student performance Traditional formal assessment looks at what students know at the end of a given period of instruction Informal assessment looks at how a student knows as well as what he knows Formal assessments are usually published Informal ones are usually teacher-developed although there are published measures, including informal reading inventories, checklists, surveys, and interview guides Obviously, the measure that we as educators choose determines the information that the instrument will yield Therefore, we must be very clear about our purpose when we choose an assessment instrument The choice of assessment instrument—from teacher observation to student survey to formal published test—should be informed by the assessor’s purpose Selection of the wrong instrument will not allow inferences appropriate to the assessor’s needs Traditionally, administrators, seeking information about students’ success in reading, selected published, standardized tests with available normative information, such as the Iowas This allowed them to compare district performance with statewide and national scores and to comply with Title I requirements Although the comparisons may have given them confidence in the success of local curriculums, the scores yielded little information that would help guide instruction or curriculum design Tests The preponderance of objective, norm-referenced tests traditionally have offered students little information about themselves as learners However, the same could be said of the uninformed use of a teacher’s pop quiz or the misuse of the portfolio as a mere paper repository Traditional testing is akin to a behaviorist’s view of the learner as the passive recipient of data Current testing theory is based on the cognitive psychologists’ view of the learner as an active construer of meaning from the information available from the environment We now know, for example, that we should not try to decontextualize test items by using short excerpts in reading that block the reader’s use of prior knowledge to construct new information Short passages prevent skilled readers from using the reading strategies they would employ with a longer passage as they become familiar with the topic and discover the organization of the text Current theory dictates the use of long passages across a variety of text types and topics to gain a valid indication of reader proficiency We no longer depend solely on short answers, such as multiple choice, but include open-ended items that permit test takers more latitude to display their reading skills Parallel issues arise in the assessment of writing We no longer assume that students’ abilities to revise and edit a given text reflect their abilities to generate, organize, and elaborate original ideas In short, editing texts is not a complete test of writing proficiency Current theory holds that any test that purports to be a valid test of writing must include opportunity for the writer to compose original, well-organized text with varied sentence structures and rich word choice using the conventions of standard written English New Jersey’s new 4th-, 8th-, and 11th-grade tests, which are aligned to the language arts literacy standards, reflect much of current theory concerning learning and testing Not only they incorporate long reading passages with opportunities for open-ended responses to diff e rent text types and theme-based topics, but they also elicit multiple writing samples from students In addition, they provide opportunities for students to integrate the reading and writing processes through decision making and problem solving in order to compose an original text using information from a reading passage as support The tests also honor the hallmarks of assessment outlined by Case They are valid because they measure what they purport to measure, that is, they provide rich contexts for the assessment of meaningful speaking, listening, writing, reading, and viewing behaviors The new tests are also fair because they are aligned to the language arts literacy standards and indicators that have been published and distributed to educators, who will share them with their students, parents, and the community Furthermore, this curriculum framework provides the same audiences with vignettes and activities that vividly translate the standards into classroom practices Teachers can use this material to enhance student attainment of the standards and to foster student success on the new tests 2.1 Common characteristics across instruments Bachman’s (2000) review of the literature on language tests outlines the development of language testing over the last 20 years He points out that while testing practice from the mid-1960s and the 1970s tended to be based on a construction of language as skills (listening, speaking, reading, writing) and components (grammar, vocabulary, pronunciation), such constructions were critiqued as new approaches to the study of language emerged Specifically, in the 1980s, the influence of communicative approaches to language instruction was paramount Since applied linguists were developing approaches to teaching that focused on the co-construction of meaning, and the importance of context-based communication, traditional assessments (such as those developed in the 1970s) were ill-suited for the new approach In the 1990s, test-makers became concerned with issues such as the development of (a) new research methodologies, such as criterion-referenced measurement, (b) practical advances, such as pragmatics testing, (c) factors that affect test performance, (d) authentic and performance assessment, and (e) ethical considerations of language testing 2.2 Language constructs represented The tests reviewed above are based on the assumption that language proficiency can be measured accurately by only sampling discrete aspects such as phonology, syntax, morphology, and lexicon The tests rarely consider aspects of language that can be crucial to academic success, such as pragmatic competence (Cummins, 2000) In other words, most language proficiency tests limit the construction of language proficiency to grammatical competence An important flaw with this construction is that to assess grammatical competence, tests usually rely on prescriptivist notions of grammar For instance, if one such type of test were to assess students’ acquisition of the English verb system, an item like (1) below might be presented (1) Dad called earlier He _ (might/ is/ had/ might could) stop by later this evening If a student were to fill in the blank with might could, he would probably be penalized because the Standard English verb system allows one modal verb in that position However, if said student were a member of the group of native English speakers who make a distinction between (2) and (3) below, such an item would be invalid: (2) He might stop by later this evening (3) He might could stop by later this evening While the differences in meaning are subtle and pragmatically determined, in (3) there is less likelihood that "he" will stop by than in (2) (Wolfram and Schilling-Estes, 1998:335) Speakers of the dialect in which sentences like (3) are common need contextual cues in order to distinguish the forcefulness of the assertion However, a typical language proficiency test would not allow for nuances in meaning made by speakers of so-called non-Standard varieties of English Furthermore, to limit the construction of language proficiency to a closed set of grammatical categories negates the real need for language learners to master communicative principles which are essential in informal and academic contexts After all, language learners must develop a range of communicative styles to suit their purposes A language learner whose repertoire is limited to academic discourse styles cannot be considered fully communicatively competent Up to this point we have discussed how commonly-used tests utilize similar constructions of language proficiency, and how this construction of language proficiency is closely linked to prescriptivist notions of standard grammar In the next section we discuss the criticisms that standardized language proficiency tests have received in test reviews 2.3 Critiques of the four most commonly-used tests In addition to the limitation of language proficiency to grammatical competence, other criticisms are revealed in test reviews These have indicated that some of the common shortcomings are (a) that many test items are not valid (Haber, 1985; Carpenter, 1994; Hedberg, 1995; Kao, 1998), (b) that interrater reliability is low (Crocker, 1998), and (c) that the tests are normed on populations that are not representative of the samples of children to whom these measures are commonly administered (Chesterfield, 1985; Haber, 1985; Shellenberger, 1985; Lopez, 2001) Table includes a summary of the reviews TUJUAN TESTS to make inferences about individuals’ language ability to make predictions about individuals’ ability to use language in contexts outside the test itself to make decisions about individuals Zucker Despite this variety, tests generally share some common goals: • measuring what students know and can • improving instruction • helping students achieve higher standards The purpose of tests is to provide educators, students, parents, and policy makers with information that is valid, fair, and reliable Standardized tests provide information that helps support four critically important tasks for educators and the public: Identify the instructional needs of individual students so educators can respond with effective, targeted teaching and appropriate instructional materials; Judge students’ proficiency inessential basic skills and challenging standards and measure their educational growth over time; Evaluate the effectiveness of educational programs; and, Monitor schools for educational accountability including under the NCLB Act In sum, tests provide information to help students learn more successfully, teachers teach more effectively, and schools to be more accountable There are limits to testing, however Tests are a necessary but not the exclusive means to evaluate current achievement and students’ growth in skills What may be tested is not, and cannot be, inclusive of all of the desired outcomes of instruction Tests should be considered a means to an end and not ends in themselves Tests should be used in combination with other important types of information such as teacher judgments of student work and classroom performance plus other individual and group assessments, to measure achievement and growth JENIS TES High-Stakes Testing High-stakes testing has consequences attached to the results For example, high-stakes tests can be used to determine students’ promotion from grade to grade or graduation from high school (Resnick, 2004; Cizek, 2001) State testing to document Adequate Yearly Progress (AYP) in accordance with NCLB is called “high-stakes” because of the consequences to schools (and of course to students) that fail to maintain a steady increase in achievement across the subpopulations of the schools (i.e., minority, poor, and special education students) Low-Stakes Testing Low-stakes testing has no consequences outside the school, although the results may have classroom consequences such as contributing to students’ grades Formative assessment is a good example of low-stakes testing Formative Assessment This assessment provides information about learning in process It consists of the weekly quizzes, tests, and even essays given by teachers to their classes Teachers and students use the results of formative assessments to understand how students are progressing and to make adjustments in instruction Rick Stiggins calls it “day-to-day classroom assessment” and claims evidence that it has triggered “remarkable gains in student achievement” (Stiggins, 2004) Summative assessment provokes most of the controversy about testing because it includes “high-stakes, standardized” testing carried out by the states Summative Norm-Referenced Tests These tests are designed to compare individual students’ achievement to that of a “norm group,” a representative sample of his or her peers The design is governed by the normal or bell-shaped curve in the sense that all elements of the test are directed towards spreading out the results on the curve (Monetti, 2003; NASBE, 2001; Zucker, 2003; Popham, 1999) The curve-governed design of norm-referenced tests means that they not compare the students’ achievement to standards for what they should know and be able to do—they only compare view enlarged chart students to other students who are assumed to be in the same norm group The Educators’ Handbook on Effective Testing (2002) lists the norms frequently used by major testing publishers For example, the available norms for the Iowa Test of Basic Skills are: districts of similar sizes, regions of the country, socio-economic status, ethnicity, and type of school (e.g., public, Catholic, private non-Catholic) in addition to a representation of students nationally Purchasers of norm-referenced tests need to ensure that the chosen norm is a useful comparison for their students Purchasers should also be sure that the norm has been developed recently, because populations change rapidly A norm including a small percentage of English language learners can become a norm with almost 50 percent English language learners in less than the ten-year interval before it is revised Results of norm-referenced tests are frequently reported in terms of percentiles: a score in the 70th percentile means that the student has done better than 70 percent of the others in the norm group (Monetti, 2003) Percentile rankings are often used to identify students for various academic programs such as gifted and talented, regular, or remedial classes On a symmetrical bell curve, a score in the 50th percentile is the average Because norm-referenced tests are designed to spread students’ scores along the bell curve, the questions asked in the tests not necessarily represent the knowledge and skills that all students are expected to have learned Instead, during the test development process, “test items answered correctly by 80 percent or more of the test takers don’t make it past the final cut [into the final test]” writes Popham (1999) Norm-referenced tests lead to frustration on two counts First they frustrate the teacher’s success in teaching important knowledge and skills because students are unlikely to face questions about that skill and knowledge on the test (Popham, 1999) Second, no group of students can achieve at higher levels without others achieving at lower levels Normreferenced tests make it mathematically impossible for “all the children to be above average” (ERS; Burley, 2002) Criterion-referenced Testing (CRT) Rather than compare a student’s test result with the results of a reference group, criterionreferenced tests are intended to measure a level of mastery according to a specific set of performance standards Hence, the content of a criterion referenced test often includes more focused subject matter than a norm-referenced test The test-taker’s score corresponds to a performance level, such as basic, proficient, or advanced NCLB requires each state to design or select an assessment yielding results that can be used to classify students into performance levels for the corresponding academic subject What Is The Difference Between A Criterion-Referenced Test and A Norm Referenced Test? All standardized tests now administered to elementary and secondary school students measure student achievement against a set of academic standards or curricular objectives The standards may be common among the states and major national academic organizations, thus enabling national comparisons Or, the standards may be local standards chosen by the school district or state, which may only allow local comparisons among students in a district or state There are many ways to report and interpret the results of a standardized test One way is based on specific criteria, such as academic skills or objectives and academic achievement standards developed at the state or local level For example, “She has demonstrated mastery of reading at the third-grade level” is a determination made by a criterion-referenced test (CRT) A standardized test also can describe a student’s performance compared to other students nationally or locally For example, “He reads better than 90 percent of fourth grade students nationally” is a determination made from a norm-referenced test (NRT) A student’s score on a CRT using local academic standards is intended to be compared only with other students who have taken the same test In contrast, a student’s scores on an NRT can show performance on academic standards and also enable comparisons with students both locally and nationally When a local CRT is used with a national NRT, the results can be interpreted together to obtain more comprehensive information about a student’s performance For example, “She is ‘proficient’ on a state mandated CRT and is performing at an academic level that is better than 70 percent of students nationwide.” Criterion-Referenced Tests These tests are designed to show how students achieve in comparison to standards, usually state standards (NASBE, 2001; Wilde, 2004; Zucker, 2003) In contrast to normreferenced tests, it is theoretically possible for all students to achieve the highest—or the lowest—score, because there is no attempt to compare students to each other, only to the standards Results are reported in levels that are typically basic, proficient, and advanced The test items are not chosen to sort students but to ascertain whether they have mastered the knowledge and skills contained in the standards Criterion-referenced tests—sometimes, more correctly, called standards-based tests— begin from a state’s standards, which list the knowledge and skills students are expected to learn Because standards are usually far more numerous than could ever be included in a test, test designers work with teachers and content specialists to narrow down the standards to essential knowledge and skills at the grades to be tested They are the basis for the development of test items The number of criterion-referenced tests in use at the state level has dramatically increased since NCLB was implemented in 2001 (NCES, 2005), because they measure achievement of the knowledge and skills required by state standards At this writing, 44 states now use criterion-referenced assessments: 24 states use only criterion-referenced tests, and the other 20 use both criterion-referenced tests and norm-referenced tests Thirteen states use “hybrid” tests, single tests that are reported both as norm-referenced tests (in percentiles or stanines—a nine-point scale used for normalized test scores) and as criterion-referenced tests (in basic, proficient, and advanced levels) in an attempt to show at the same time where students score in relation to standards and in relation to a norm group Only one state, Iowa (home of the Iowa Test of Basic Skills, and also the only state in the nation without state academic standards) uses a norm-referenced test alone (Education Week 2006) STANDARDS-BASED TESTING Standards-based testing allows states to accomplish both objectives (NRT and CRT) at once by incorporating elements of norm-referenced and criterion referenced testing A standards-based test is both normed to a reference group and aligned to a set of performance standards This framework, also called the augmented NRT model, enables states to report standards-based information (content standards scores), performance levels (cut-scores), and percentile rank information for every student For example, a test publisher can use a state’s academic standards to augment an existing norm-referenced test so that the test taker’s results can be used for both comparisons to a reference group and assigning performance levels Typically, statewide results from the first year that a standards-based test is administered are used to establish the test’s reference group Careful design by the test publisher ensures that the test is valid for measuring student mastery of the academic standards Because NCLB requires states to report student performance levels while also comparing the results of specified student populations to the results of previous years, properly designed standards-based tests are especially suited to meet NCLB requirements Standardized testing means that a test is “administered and scored in a predetermined, standard manner” (Popham, 1999) Students take the same test in the same conditions at the same time, if possible, so results can be attributed to student performance and not to differences in the administration or form of the test (Wilde, 2004) For this reason, the results of standardized tests can be compared across schools, districts, or states Standardized testing is sometimes used as a shorthand expression for machine scored multiple-choice tests As we will see, however, standardized tests can have almost any format A standardized achievement test is, simply, a test that is developed using standard procedures and is then administered and scored in a consistent manner for all test takers Students respond to identical or very similar questions under the same conditions and test directions The standardization of test questions, directions, conditions of testing, and scoring is needed to make test scores comparable and to assure, as much as possible, that test takers have equal, unbiased opportunities to demonstrate what they know and can Standardization can apply to any type or format of test However, some types of educational tests such as classroom and teacher-developed tests are not usually considered to be “standardized” tests because they are given under varying conditions and are scored using variable rules Standardized tests may be used for a variety of purposes One purpose of testing is to enable educators to make high-stakes decisions about individual students through measures such as high school graduation tests In contrast, the annual testing provisions of the NCLB Act are used to inform schools, teachers, and parents about student improvement in the classroom and to hold schools and states accountable for such improvement How Are Standardized Tests Used? Information from standardized tests can be used for many purposes These purposes may include: Supporting instructional decisions for individual students by identifying their instructional needs A test may be used to diagnose a student’s strengths and weaknesses, thus allowing the teacher or school to choose effective instructional programs for the student Demonstrating students’ proficiency in basic skills and their ability to meet academic standards Test results are used by states to demonstrate individual student mastery of specified levels of achievement Informing parents and the public about school and student performance States administer standardized assessments and report the results, in part to inform the public about how well the schools and their students are progressing over time and compared to other localities or schools Many states and districts publish annual report cards on school districts and individual schools The results of the tests can motivate education reform by informing and influencing parents to take action to improve the quality of local schools Holding schools and educators accountable for student performance on tests aligned to high standards of what students should know and be able to Consequences are often attached to test results and may include school improvement plans, technical assistance, increased or decreased funding for schools, salary bonuses, promotions, loss of accreditation and takeovers of local schools by the state Such consequences are used to leverage change at the school and classroom level Evaluating programs Many federal and state education programs use standardized tests to determine if public policy objectives are being achieved, and if public funds are wellspent Determining rewards and sanctions Tests may be used for high-stakes purposes with rewards and sanctions to make decisions about individual students, such as placement in specific programs or classes, graduation from high school, or promotion to the next grade FORMAT TEST Multiple-choice questions: Many standardized tests require students to select a single correct response to each test question (called “items”) from among a small number of specific choices This format—called “multiple choice” or “selected response”—is efficient, practical, and usually produces highly reliable results Multiple-choice tests offer the advantages of objectivity and uniformity in scoring, ease of administration, and low cost Performance assessment questions: Performance assessments require students to generate a response to a question rather than choosing from a set of responses provided to them Examples include exhibitions, investigations, demonstrations, written or oral responses, journals, and portfolios Performance assessments can be given and scored according to standard procedures and rules so that a test containing performance assessment questions is a standardized test Performance assessments typically focus on the process of problem solving rather than on answers or solutions Tests including performance assessments, however, are generally less reliable, more difficult to score, and more costly than tests using multiple choice items Constructed-response questions: Constructed-response items may be one type of performance assessment, in which students are given the opportunity to fill-in-a-blank or provide a brief written response to a question, rather than select from an array of possible answers Constructed-response questions are often included, along with multiple choice questions, on a test to obtain additional and different types of information about what a student knows or can Test Question Formats While there is no set format for all questions on standardized tests, the most common standardized test question formats include Multiple-choice Questions and Short-answer Questions Short-answer Questions The short-answer question format, also known as the open-ended or constructed response format, presents the test-taker with a question that is answered by a fill in-the-blank or short written response Answers to constructed-response questions are hand-scored using a rubric that allows for a range of acceptable and partially correct answers Questions and answers in this format provide a more sophisticated evaluation of student performance than selected-response questions However, the reliability of scores obtained using constructed-response questions depends more heavily on the scoring method Carefully designed constructed response questions with a clear scoring rubric can provide important information about student performance and knowledge that cannot be as effectively demonstrated by the selected-response format Open-Ended Tests These test items ask students to respond either by writing a few sentences in short answer form, or by writing an extended essay Open-ended questions are also known as “constructed response” because test-takers must construct their response as opposed to selecting a correct answer (Zucker, 2003) The advantage of open-ended items is that they allow a student to display knowledge and apply critical thinking skills It is particularly difficult to assess writing ability, for example, without an essay or writing sample The disadvantage is that constructed-response items require human readers, although attempts are being made to develop computer programs to score essays (Sireci, 2000; Rudner, 2001; Shermis, 2001) Short-answer questions can be scored by looking for key terms since they often don’t ask for complete sentences But many state assessments ask for an extended essay, often in separate tests from the one used to report AYP Companies across the United States assemble groups of qualified people, often retired teachers when they can get them, to read and score essays or long answers using a common rubric for scoring (Stover, 1999) A rubric is a guide to scoring that provides a detailed description of essays that should be given a particular score (frequently one-six points, with six being the best) After extensive training with models of each score, two readers rate an essay independently If their scores differ, a third reader reads the essay without knowing the two preceding scores Group scoring of essays has a long history and has proved to be remarkably reliable (Mitchell, 1992) Essays and long answers have the desirable effect of promoting more writing and writing instruction in the classroom, but they are expensive to score Multiple-choice testing is less expensive because it is scored by machine (ERS; NASBE, 2001) Differences in cost can be gauged from a U.S General Accounting Office report estimating that from 20022008, states will spend $1.9 billion on mandated testing if they use only machine-scored multiple-choice tests States will spend $3.9 billion if they maintain the present mixture of multiple-choice and a few open-ended items They will spend $5.3 billion if they increase the use of open-ended items—including essays—making the cost of using openended items more than 2.5 times the amount of using multiple-choice tests alone (GAO, 2003) Clearly, the difference in cost makes testing choices difficult Performance Assessment Also called authentic assessment, performance assessment challenges students to perform a task just as it would be performed in the classroom or in life (e.g., a science experiment, a piano recital) Performance assessment was widely promoted in the early 1990s (Mitchell, 1992), but it is time-consuming, difficult to standardize, and expensive Portfolios Portfolios are a type of performance assessment that were also popular before 2001, when state testing in accordance with NCLB came to dominate Portfolios are collections of student work designed to show growth over a semester or a year However, they are difficult to evaluate accurately, because their production and contents can not be standardized (Gearhart, 1993) Both portfolios and performance assessment are now used as formative rather than summative assessment QUALITIES OF AN EFFECTIVE TEST The requirements of NCLB pose a significant challenge to state educational systems: All students must have the same chance to be successful at showing what they know and can in periodic, high-stakes assessments Consequently, states must select or design highquality tests that can be used by the general student population while meeting the special requirements of certain groups and even the needs of individual students Moreover, the high stakes involved compel states to be certain that the tests accurately measure student achievement All standardized tests must meet psychometric (test study, design, and administration) standards for reliability, validity, and lack of bias (Zucker, 2003; Bracey, 2002; Joint Committee on Testing Practices, 2004) For a test to solve this combination of challenges effectively, it must be proven to be: • Reliable – The test must produce consistent results Reliability means that the test is so internally consistent that a student could take it repeatedly and get approximately the same score • Valid – The test must be shown to measure what it is intended to measure • Unbiased – The test should not place students at a disadvantage because of gender, ethnicity, language, or disability References ACT, Inc & The Education Trust (2004) On course for success: A close look at selected high school courses that prepare all students for college and work Washington DC: The Education Trust (Available: http://www.act.org/path/policy/pdf/success_report.pdf ) Bracey, G W 2002 Put to the test: An educator’s and consumer’s guide to standardized testing (2nd ed.) Bloomington IN: Phi Delta Kappa International Burley, H (2002, February) A Measure of Knowledge American School Board Journal,18(2) Cannell, J J (1987) Nationally normed elementary achievement testing in America’s public schools: How all fifty states are above the national average West Virginia: Friends for Education Cizek, G J (1998) Filling in the blanks: Putting standardized tests to the test Washington D.C.: The Thomas B Fordham Foundation Cizek, G J (2001, Winter) More unintended consequences of high-stakes testing Educational Measurement, Issues and Practice, 20(4), 19-28 Darling-Hammond, L (2004, June) Standards, accountability, and school reform Teachers College Record, 106(6), 1047-1085 Data connections: Using assessment to improve teaching and learning [CD-ROM] (2002) Charleston, West Virginia: Edvantia (Formerly Appalachian Educational Laboratory) Dickinson, A C., Friedman, M I., Hatch, C W., Jacobs, J E., Nickerson, A B., & Schnepel, K C (2002) Educators’ handbook on effective testing Columbia, SC: Institute for Evidence-Based Decision-Making in Education Educational Research Service (n.d.) Focus on high-stakes testing Arlington VA: Educational Research Service Education Week, (2006) Quality Counts At 10 Washington D.C.: Editorial Projects in Education General Accounting Office, (2003) Characteristics of tests will influence expenses: Information sharing may help states realize efficiencies Washington D.C.: United States General Accounting Office Gearhart, M., Herman, J L., Baker, E L., & Whittaker, A K (1993, July) Whose work is it? A question for the validity of large-scale portfolio assessment CSE Technical Report 363 Available: http://www.cse.ucla.edu/products/Reports/TECH363.pdf Goldberg, M (2005, January) Test mess 2: Are we doing better a year later? Phi Delta Kappan, 86(5), 389400 Herman, J L., & Baker, E L (2005, November) Making benchmark testing work Educational Leadership, 63(3), 49-53 Joint Committee on Testing Practices (2004) Code of fair testing practices in education (Revised) Washington D.C.: American Psychological Association Lemann, N (1999) The big test New York: Farrar, Strauss, and Giroux Linn, R L (2005, Summer) Fixing the NCLB accountability system CRESST Policy Brief Available: http://www.cse.ucla.edu/products/policy/cresst_policy8.pdf McIntire, T (2005, April) Data: Maximize your mining, part one Technology and Learning, 25(9) Mitchell, R (1992) Testing for learning: How new approaches to evaluation can improve American schools New York: Free Press Monetti, D M., & Hinkle, K T (2003) Five important test interpretation skills for school counselors ERIC Digest ED481472 2003-09-00 National Association of State Boards of Education (2001) A primer on state accountability and large-scale assessments Available: http://www.nasbe.org/Educational_Issues/Reports/Assessment.pdf National Education Goals Panel (1998) Talking about tests: An idea book for state leaders Washington DC: United States Department of Education National Center for Education Statistics (2005) State education reforms Standards, assessment, and accountability Table 1.5 Names and types of statewide assessments administered, by state: 2003-4 [Online report] Retrieved December 7, 2005, from http://nces.ed.gov/programs/statereform/saa_tab5.asp National Center for Education Statistics (2005, August) Online assessment in mathematics and writing: Reports for the NAEP technology-based assessment project, research and development series Washington DC: United States Department of Education Popham, J W (1999, March) Why standardized tests don’t measure educational quality Educational Leadership, 56(6), 8-15 Princeton Review (2003) Testing the testers 2003: An annual ranking of state accountability systems Available: http://testprep.princetonreview.com/testingtesters/report.asp Resnick, B (2004, April) Majority of districts/schools employ “high-stakes” testing Successful School Marketer Retrieved December 9, 2005, from http://www.schooldata.com/ssm-resnick-majority.htm Resnick, M (2004) The educated student: Defining and advancing student achievement Alexandria VA: National School Boards Association Rudner, L., & Gagne, P (2001) An overview of three approaches to scoring written essays by computer ERIC Digest ED458290 2001-12-0 Shermis, M D., Rasmussen, J L., Rajecki, D W., Olson, J., & Marsilio, C (2001) All prompts are created equal, but some prompts are more equal than others Journal of Applied Measurement, 2(2), 154-70 Sireci, S G., & Rizavi, S (2000) Comparing computerized and human scoring of students’ essays New York: The College Board ERIC report number 354 Stiggins, R (2004, September) New assessment beliefs for a new school mission Phi Delta Kappan, 88(1), 22-27 Stokes, V (2005, October) No longer a year behind Learning and Leading with Technology, 33(2), 15-17 Stover, D (1999, March, 23) Who grades the essays on standardized tests? School Board News, p Toch, T (2006, January) Margins of Error: The Education Testing Industry in the No Child Left Behind Era Washington, D.C.: Education Sector./p> Wilde, J (2004, January) Definitions for the no child left behind act of 2001: Assessment Washington DC: National Clearinghouse for English Language Acquisition (NCELA) Zucker, S (2003, December) Fundamentals of standardized testing San Antonio TX: Harcourt Assessment, Inc References American Psychological Association (1985) Standards for Educational and Psychological Testing Washington, DC: American Psychological Association Amori, B A., Dalton, E.F , & Tighe, P.L (1992) IPT Reading & Writing, Grades 2-3, Form 1A, English Brea, CA: Ballard & Tighe, Publishers Anastasi, A (1988) Psychological Testing (sixth edition) New York, NY: Macmillan Publishing Company Ballard, W.S., Tighe, P.L., & Dalton, E F (1979, 1982, 1984, & 1991) Examiner's Manual IPT I, Oral Grades K-6, Forms A, B, C, and D English Brea, CA: Ballard & Tighe, Publishers Ballard, W.S., Tighe, P.L., & Dalton, E F (1979, 1982, 1984, & 1991) Technical Manual IPT I, Oral Grades K-6, Forms C and D English Brea, CA: Ballard & Tighe, Publishers Burt, M.K., Dulay, H.C., & Hernández-Chávez, E., (1976) Bilingual Syntax Measure I, Technical Handbook San Antonio, TX: Harcourt, Brace, Jovanovich, Inc Burt, M.K., Dulay, H.C., Hernández-Chávez, E., & Taleporos, E (1980) Bilingual Syntax Measure II, Technical Handbook San Antonio, TX: Harcourt, Brace, Jovanovich, Inc Canale, M (1984) On some theoretical frameworks for language proficiency In C Rivera (Ed.), Language proficiency and academic achievement Avon, England: Multilingual Matters Ltd Canales, J A (1994) Linking Language Assessment to Classroom Practices In R Rodriguez, N Ramos, & J A Ruiz-Escalante (Eds.) Compendium of Readings in Bilingual Education: Issues and Practices Austin, TX: Texas Association for Bilingual Education CHECpoint Systems, Inc (1987) Basic Inventory of Natural Language Authentic Language Testing Technical Report San Bernadino, CA: CHECpoint Systems, Inc Council of Chief State School Officers (1992) Recommendations for Improving the Assessment and Monitoring of Students with Limited English Proficiency Alexandria, VA: Council of Chief State School Officers, Weber Design CTB MacMillan McGraw-Hill (1991) LAS Preview Materials: Because Every Child Deserves to Understand and Be Understood Monterey, CA: CTB MacMillan McGraw -Hill Cummins, J (1984) Wanted: A theoretical framework for relating language proficiency to academic achievement among bilingual students In C Rivera (Ed.), Language proficiency and academic achievement Avon, England: Multilingual Matters Ltd Dalton, E F (1979, 1982, 1991) IPT Oral Grades K-6 Technical Manual, IDEA Oral Language Proficiency Test Forms C and D English Brea, CA: Ballard & Tighe, Publishers Dalton, E F & Barrett, T.J (1992) Technical Manual IPT & 2, Reading and Writing, Grades 2-6, Forms 1A and 2A English Brea, CA: Ballard & Tighe, Publishers De Avila, E.A & Duncan, S E (1990) LAS, Language Assessment Scales, Oral Technical Report, English, Forms 1C, 1D, 2C, 2D, Spanish, Forms 1B, 2B Monterey, CA: CTB MacMillan McGraw-Hill De Avila, E.A & Duncan, S E (1981, 1982) A Convergent Approach to Oral Language Assessment: Theoretical and Technical Specifications on the Language Assessment Scales (LAS), Form A Monterey, CA: CTB McGraw-Hill De Avila, E.A & Duncan, S E (1987, 1988, 1989, 1990) LAS, Language Assessment Scales, Oral Administration Manual, English, Forms 2C and 2D Monterey, CA: CTB MacMillan McGraw-Hill Duncan, S.E & De Avila, E.A (1988) Examiner's Manual: Language Assessment Scales Reading/Writing (LAS R/W) Monterey, CA: CTB /McGraw Hill Durán, R.P (1988) Validity and Language Skills Assessment: Non-English Background Students In H Wainer & H.I Braun (Eds) Test Validity Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers National Commission on Testing and Public Policy (1990) From Gatekeeper to Gateway: Transforming Testing in America Chestnut Hill, MA: National Commission on Testing and Public Policy Oller, J.W Jr & Damico, J.S (1991) Theoretical considerations in the assessment of LEP students In E Hamayan & J.S Damico (Eds.), Limiting bias in the assessment of bilingual students Austin: Pro-ed publications Rivera, C (1995) How can we ensure equity in statewide assessment programs? Unpublished document Evaluation Assistance Center-East, George Washington University, Arlington, VA Roos, P (1995) Rights of limited English proficient students under Federal Law A guide for school administrators Unpublished paper presented at Weber State University, Success for all Students Conference, Ogden, UT Spolsky, B (1984) The uses of language tests: An ethical envoi In C Rivera (Ed.), Placement procedures in bilingual education: Education and policy issues Avon, England: Multilingual Matters Ltd Ulibarri, D., Spencer, M., & Rivas, G (1981) Language proficiency and academic achievement: A study of language proficiency tests and their relationship to school ratings as predictors of academic achievement NABE Journal, Vol V, No 3, Spring Valdés, G and Figueroa, R (1994) Bilingualism and testing A special case of bias Norwood, NJ: Ablex Publishing Corporation Wheeler, P & Haertel, G.D (1993) Resource Handbook on Performance Assessment and Measurement: A Tool for Students, Practitioners, and Policymakers Berkeley, CA: The Owl Press Woodcock, R W & Muñoz-Sandoval, A.F (1993) Woodcock-Muñoz Language Survey Comprehensive Manual Chicago, IL: Riverside Publishing Company Table 3: Critiques of four most commonly used tests Test View of language Problematic aspects LAS Language consists of discrete skills and elements -Hedberg (1995): LAS-Oral is inadequate for placing language-minority students because of inadequate standardization procedures -Carpenter (1994): LAS reading/ writing is inappropriate to make entry and exit decisions; teacher judgement would be just as valid IPT Language consists of discrete skills and elements -Lopez (2001): Norming procedures limit test validity for a wide range of U.S students, greater emphasis on discrete aspects of language proficiency and less emphasis on pragmatic competence, no studies were conducted to investigate how test content relates to achievement -Ochoa (2001): Standardization sample is not representative of the range of U.S English speakers, nor is the Spanish version representative of the range of Spanish speakers in the U.S WMLS Cummins’ BICS/ CALP distinction -Crocker (1998): To account for construct validity, test makers rely on intercorrelations, not on an explanation of the underlying traits that test attempts to measure -Kao (1998): Test-makers provide insufficient information about validity There is little explanation about the Cognitive-Academic Skills (CALP) construct -Schrank, Fletcher, and Guajardo Alvarado (1996): LAB Language consists of discrete skills and elements -Chesterfield (1985): LAS is problematic for identification of students for bilingual programs, contains unnecessary items, is inadequate to predict success or as a basis for intervention • References American Educational Research Association, American Psychological Association, and National Council on Measurement in Education (1999) Standards of educational and psychological testing Washington, D.C.: American Psychological Association August, D., & Hakuta, K (1997) (Eds) Improving schooling for language-minority children: A research agenda Washington, D C.: National Academy Press Bachman, L (2000) Modern language testing at the end of the century: Assuring that what we count counts Language testing 17 (1) 1-42 Bowman, B T., Donovan, M S., & Burns, M (Eds.) (2001) Eager to learn: Educating our preschoolers Washington, D.C.: National Academy Press Burt M.K., Dulay, H.C., Hernández-Chávez, E., and Taleporos, E (1980) Bilingual Syntax Measure II, Technical handbook San Antonio, TX: Harcourt, Brace, Jovanovich Carpenter, C D (1994) Review of Language Assessment Scales, Reading and Writing Supplement to the eleventh mental measurements yearbook Lincoln, NE: University of Nebraska Press Chesterfield, K B (1985) Review of Language Assessment Battery The Ninth Mental Measurements Yearbook Volume I Lincoln, NE: University of Nebraska Press Crocker, L (1998) Review of the Woodcock-Muñoz Language Survey The Thirteenth Mental Measurements Yearbook Lincoln, NB: University of Nebraska Press • 679 • Cummins, J (2000) Language, power, and pedagogy: Bilingual children in the crossfire Clevedon, UK: Multilingual Matters Ltd Cummins, J., Muñoz-Sandoval, A.F., Alvarado, C.G., & M.L Ruef (1998) The Bilingual Verbal Ability Tests Itasca, IL: Riverside Dalton, E F (1991) IPT Oral Grades K-6 Technical Manual, IDEA Oral Language Proficiency Test Forms C and D English Brea, CA: Ballard & Tighe, Publishers De Avila, E.A & Duncan, S E (1990) Language Assessment Scales, Oral Technical Report, English, Forms 1C, 1D, 2C, 2D, Spanish, Forms 1B, 2B Monterey, CA: CTB MacMillan McGraw-Hill Del Vecchio, A., & Guerrero, M (1995) Handbook of language proficiency tests Albuquerque, NM: Evaluation Assistance Center–Western Region, New Mexico Highlands University Garcia, E (1985) Review of Bilingual Syntax Measure II The Ninth Mental Measurements Yearbook Volume I Lincoln, NE: University of Nebraska Press Garcia, G.E and Pearson, P.D (1994) Assessment and diversity Review of research in education 20:337-391 Gee, J.P (2003) Opportunity to learn: A language-based perspective on assessment Assessment in education 10:27-46 Guyette, T (1985) Review of Basic Inventory of Natural Language The ninth mental measurements yearbook Volume I Lincoln, NE: University of Nebraska Press Guyette,T (1994) Review of Language Assessment Scales, Reading and Writing Supplement to the eleventh mental measurements yearbook Lincoln, NE: University of Nebraska Press Harris Stefanakis, E (1998) Whose judgement counts?: Assessing bilingual children, K-3 Portsmouth, NH: Heinemann Haber, L (1985) Review of Language Assessment Scales The ninth mental measurements yearbook Volume I Lincoln, NE: University of Nebraska Press Hedberg, N L (1995) Review of Language Assessment Scales Oral The twelfth mental measurements yearbook Lincoln, NE: University of Nebraska Press Kao, C (1998) Review of the Woodcock-Muñoz Language Survey The Thirteenth Mental Measurements yearbook Lincoln, NE: University of Nebraska Press Kindler, A (2002) Survey of states’ limited English proficiency students and available educational programs and services: 2000-2001 summary report Washington, D.C.: National Clearinghouse for English Language Acquisition and Language Instruction Educational Programs Lopez, E A (2001) Review of the IDEA Oral Language Proficiency Test The Fourteenth Mental Measurements Yearbook Lincoln, NE: University of Nebraska Press MacSwan, J Rolstad, K and Glass, G.V (2002) Do some school-age children have no language? Some problems of construct validity in the Pre-LAS Español Bilingual research journal 26: 213-238 Macías, R (1998) Summary Report of the Survey of the States' Limited English Proficient Students and Available Educational Programs and Services 1995-96 Washington, D.C.: National Clearinghouse for Bilingual Education McLaughlin, B., Gesi Blanchard, A., & Osanai, Y (1995) Assessing language development in bilingual preschool children Washington, D.C.: National Clearinghouse for Bilingual Education Messick, S (1988) Validity In R.L Linn (Ed.) Educational measurements Third edition New York: Amercian Council on Education/ McMillan No Child Left Behind Act (2001) Retrieved October 2, 2002 from http://www.nochildleftbehind.gov Ochoa, S H (2001) Review of the IDEA Oral Language Proficiency Test The Fourteenth Mental Measurements Yearbook Lincoln, NE: University of Nebraska Press Rueda, R (in press) Student learning and assessment: Setting an agenda In Pedraza, P and Rivera, M (Eds.) National Latino/a Education Research Agenda Project Shellenberger, S (1985) Review of Bilingual Syntax Measure II The Ninth Mental Measurements Yearbook Volume I Lincoln, NE: University of Nebraska Press Tidwell, P.S (1995) Review of Language Assessment Scales Oral The twelfth mental measurements yearbook Lincoln, NE: University of Nebraska Press Valdés, G and Figueroa, R A (1994) Bilingualism and testing: A special case of bias Norwood, NJ: Ablex Valdés, G (2001) Learning and not learning English: Latino students in American schools New York: Teacher’s College Press Wolfram, W., & Schilling-Estes, N (1998) American English Malden, MA: Blackwell Woodcock, R W and Muñoz-Sandoval, A.F (1993) Woodcock-Muñoz Language Survey Comprehensive Manual Chicago: Riverside Publishing Company Copyright information ISB4: Proceedings of the 4th International Symposium on Bilingualism © 2005 Cascadilla Press, Somerville, MA All rights reserved ISBN 978-1-57473-210-8 CD-ROM ISBN 978-1-57473-107-1 library binding (5-volume set) A copyright notice for each paper is located at the bottom of the first page of the paper Reprints for course packs can be authorized by Cascadilla Press Ordering information To order a copy of the proceedings, contact: Cascadilla Press P.O Box 440355 Somerville, MA 02144, USA phone: 1-617-776-2370 fax: 1-617-776-2271 sales@cascadilla.com www.cascadilla.com Web access and citation information This paper is available from www.cascadilla.com/isb4.html and is identical to the version published by Cascadilla Press on CD-ROM and in library binding