Estimating Guessing Effects on the Vocabulary Levels Test for Differing Degrees of Word Knowledge

BRIEF REPORTS AND SUMMARIES TESOL Quarterly invites readers to submit short reports and updates on their work These summaries may address any areas of interest to Quarterly readers Edited by ALI SHEHADEH United Arab Emirates University ANNE BURNS Macquarie University Estimating Guessing Effects on the Vocabulary Levels Test for Differing Degrees of Word Knowledge JEFFREY STEWART Kyushu Sangyo University Fukuoka, Japan DAVID A WHITE Harvard University Cambridge, Massachusetts, United States doi: 10.5054/tq.2011.254523 & The Vocabulary Levels Test (Nation, 1990) has been referred to by Meara as ‘‘the nearest thing we have to a standard test in vocabulary’’ (Meara, 1996, p 38) Multiple-choice tests such as the Vocabulary Levels Test (VLT) are often viewed as a preferable estimator of vocabulary knowledge when compared to yes/no checklists, because self-reporting tests introduce the possibility of students overreporting or underreporting scores However, multiple-choice tests have their own unique disadvantages Simply put, if a multiple-choice test lists possible answers, there is a possibility that test takers will guess the correct answer regardless of their knowledge or ability It has long been acknowledged that guessing on multiple-choice tests affects test reliability and inflates scores (Zimmerman & Williams, 1965; Baldauf, 1982), and scoring formulas such as cfg (Huibregtse, Admiraal, & Meara, 2002) have been proposed to adjust scores for guessing, under the assumption that the probability of a correct guess is consistent among test takers In item response theory, a 370 TESOL QUARTERLY Vol 45, No 2, June 2011 family of approaches that link person ability to item difficulty with probabilistic models (Brown & Hudson, 2002), the three-parameter logistic model (Birnbaum, 1968) has been developed to consider the effect of guessing on estimations of ability However, effects of estimation of guessing on tests such as the VLT are complicated by the fact that distractors are chosen from the same frequency level of words as the correct answer, and therefore from the tested domain This introduces the possibility that increases in scores due to guessing could vary depending on the proportion of words in the tested domain known by the test taker Determining the relationship between proportions of words known and score increases due to guessing is the goal of this study BACKGROUND This study arises from previous research by one of the authors (Stubbe, Stewart, & Pritchard, 2010) on the validity of yes/no vocabulary tests developed by Meara and Buxton (1987), which ask test takers to selfreport which words they know on checklists There are concerns that students may overestimate vocabulary on such checklists; Chall and Dale (1950, cited in Anderson & Freebody, 1981) reported that test takers tended to overestimate vocabulary at a rate of approximately 11%, and Janssens (1999, cited in Beeckmans, Eyckmans, Janssens, Dufranne, & Van de Velde, 2001) found that a majority (69%) of learners studied overestimated their vocabulary knowledge on yes/no tests In the recent study by Stubbe, Stewart, and Pritchard (2010), scores on the yes/no tests were compared to subsequent scores on a bilingual vocabulary test of the same words using the format of VLT for the purpose of determining the potential effects of vocabulary overestimation by learners using the tests Interestingly, scores on the VLT-style test were substantially higher, with a mean of 70.9%, compared to 50.7% on the yes/no tests (N 97) It was concluded from the results that, in contrast to some similar studies (e.g., Mochida & Harrington, 2006), the participants in this experiment (lower level Japanese university students) had a tendency to underestimate their vocabulary sizes on checklists However, it remained unclear to what degree the difference in scores could be accounted for by guessing effects made possible by the VLT’s multiple-choice format, leading to questions regarding the degree to which the VLT could inflate reports of words known by test takers, and how these figures could vary depending on the proportion of tested words known by learners Were we to assume students’ self-estimates were accurate and that they knew 50% of the tested words, what increase in scores could we expect due to guessing? With this parameter known, the extent to which scores differed due to genuine underestimation on the yes/no form could be determined BRIEF REPORTS AND SUMMARIES 371 FIGURE Example of the VLT format, original monolingual version Format of the Vocabulary Levels Test The VLT employs a format with three questions sharing six words as possible choices (see Figure 1) In addition to the original monolingual version, which uses English definitions, there are numerous bilingual versions (see Figure 2) The dependencies on the VLT created by clustering items in groups of has been noted, though previous research has determined that the dependency does not have an overly adverse effect on test reliability (Beglar & Hunt, 1999) However, it should be noted that the principal analysis conducted by Beglar and Hunt, the Rasch model (Rasch, 1960), assumes minimal guessing Wright (1995) argues that mean square fit statistics, designed to detect inconsistencies between predicted and observed responses by test takers on items, can compensate for the lack of a ci parameter in Rasch measurement However, underlying this view is the assumption that guesses are entirely random in nature and that therefore the test takers’ level of knowledge does not affect the component of the score due to guessing, and while the fit statistics are useful for examining unpredicted or erratic response patterns in test takers (for example, a test taker of low ability correctly answering a difficult item), they not account for systemic guessing effects that are consistent throughout the test and predictable at the test takers’ level of ability (Martin, del Pino, & De Boeck, 2006) Consequently, the extent to which the VLT question format is subject to consistent guessing effects has not been as thoroughly explored FIGURE Example of the VLT format, Japanese version 372 TESOL QUARTERLY How does a learner’s vocabulary size (which the VLT attempts to measure) affect the accuracy rate of a learners’ guessing, and the overall score increase from guessing? Assume first that the test taker knows none of the words to be matched and none of the distractors in a given set of three words to be matched They must then guess three times, each time with a 1/6 probability of choosing the correct answer Now suppose that the learner still knows none of the words to be matched, but does know two of the distractors within a given set They still must guess three times, but due to process of elimination, they now have a 1/4 chance of getting each guess right Clearly, then, increased levels of vocabulary knowledge can lead to increased efficacy of guessing At the same time, as knowledge levels increase, learners are more likely to know the correct answers and thus less likely to need to guess in the first place The relationship between levels of knowledge and guessing efficacy are thus somewhat complicated FINDING THE EXPECTED SCORE GIVEN A LEVEL OF WORD KNOWLEDGE For this study, the precise relationship between the proportion of words a student knows and their expected score was determined using elementary probability theory The assumption is that students will be able to choose the correct L1 translation or English definition given its corresponding English word if and only if they will know that English word given the definition or translation Applying Bayes’ rule and the properties of a binomial distribution, which gives the frequency distribution of the probability of a number of successful outcomes for a number of independent trials, the following formula was derived: S X X iz i~0 j~0 3{i 6{i{j À Á 3 izj 6zizj i j p ð1{pÞ where S is the number of sets and p is the proportion of Japanese (L1)– English word pairs that the student actually knows For the sake of conciseness, the formula contains indeterminate forms These should be evaluated to zero Algebraic simplification and a passage from the absolute score to the percentage yields the following, more elegant formula: p6 E~pz { 6 BRIEF REPORTS AND SUMMARIES 373 where E is the expected score as a percentage Note that p refers not to the proportion of words known in a given test but rather the proportion of words known in the total set of testable words Furthermore, it should be noted that prior research has shown that scores of known words can be somewhat lower when the monolingual format of the VLT is used, particularly when higher frequency words are tested (Stewart, 2009) This is due to a change in the construct of what constitutes a known word between test formats For the purposes of the formula, this means that different formats of the test may correspond to differing values of p However, the formula itself may be applied to either format Expected scores for words known adjusted for guessing given percentage of words known are listed in Figure When zero words are presumed known on a 99-item test, the expected increase in points due to guessing is 16.7 Interestingly, this 16.7-point discrepancy remains nearly constant as knowledge levels increase, despite the fact that as more words are known, fewer words remain to be guessed For example, if a student knows zero words on a 100-item test, the predicted 16.7% correctness rate for guessing would on average lead to a 16.7-point increase from guessing However, if a student knows FIGURE Expected VLT scores given percentage of words known 374 TESOL QUARTERLY 50 of the words on the test, there are only 50 remaining words for which guessing is possible Therefore, a 16.7% correctness rate for guessing the remaining words would only lead to an 8.3-point increase overall This effect is evident also in the simplified formula for the expected score We observe from the formula that for any given level of knowledge p, the expected observed score can be simplified to the sum of the knowledge level p and the term which expresses the expected contribution to the score due to guessing For low values of p (below p6 is below 0.01, so that the contribution due to guessing stays 0.6), steady at approximately 1/6, or roughly 16.7% However, the expected contribution due to guessing falls for higher p, and indeed is zero when p Qualitatively speaking, as knowledge levels rise, fewer words remain to be guessed, but those that remain can be guessed more accurately due to the process of elimination For low levels of knowledge these two tendencies cancel out, but as knowledge levels become high, the fact that very few words remain to be guessed strictly limits the expected contribution due to guessing We note that the derivative of the expected score with respect to p is – p to the fifth power This evaluates to when p and then strictly decreases to zero when p 1, confirming that expected score is monotonically increasing with respect to knowledge level and that, as knowledge levels become higher, additional increments have less of an effect on the observed score While the above formula calculates the expected observed score based on the true percentage of words known, a program was written to find the most likely true percentage of words known given an observed score and the number of questions on the test using maximum likelihood estimation Results from the program for observed scores in 5-point intervals are listed in Table 1, assuming a test with 99 items Note that the proportion of words known that will produce a given expected score and the most likely proportion of words known if that score is observed are usually very close, especially for percentages in the middle of the range These numbers are by no means always identical, but they will become closer as the number of questions on the test increases VALIDATION For a statistical validation of the above formula, it was necessary to perform multiple guessing simulations on the VLT format Early attempts at simulations yielded results that varied somewhat trial by trial, and it was clear that large numbers of trials were necessary to obtain BRIEF REPORTS AND SUMMARIES 375 TABLE Estimated Percentage of Words Known Given Observed VLT Score Observed score 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 99 Most likely known (%) 0 0 14 19 24 29 34 39 44 49 54 60 65 71 78 86 100 reasonably narrow confidence intervals for mean score increases Therefore, a software program was written in C++ to run multiple simulations This allowed thousands of simulations to be run in a relatively short period of time We ran 1000 trials for 100 ‘‘known’’ word rates ranging from 1% presumed known to 100% presumed known, for a total of 100,000 99item VLT-format test simulations For the sake of brevity, mean test scores for proportions of known words are listed in intervals of 0.05 It should be noted that Table reports mean score increases from guessing; the standard deviation reveals further information about the nature of the distributions Furthermore, it should be noted that the proportion of words known is taken to be an overall measurement, of which the test is a sample When 0–60% of words were presumed known, the mean score increase was 16.658 points, with a mean standard deviation of 4.93 After this point, the guessing score increases began to decrease, dropping sharply when more than 0.8 of words were considered known, a relationship identical to that demonstrated in Figure using the simplified formula The lower score increases from guessing for simulations with higher proportions of known words was surprising, as the data confirmed that likelihood of guessing unknown words did increase as the proportion of known words increased 376 TESOL QUARTERLY TABLE Simulation Mean Test Scores and Increases by Proportion of ‘‘Known’’ Words (N 1000) Proportion of "known" words 0.0 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 0.100 Skewness Mean test score 16.756 21.791 26.507 31.629 36.856 41.743 46.797 51.709 56.734 61.440 66.733 71.634 76.223 80.386 84.796 88.877 92.426 95.400 97.864 99.446 100.000 Kurtosis Increase Standard deviation Statistic Standard error Statistic Standard error 16.756 16.791 16.507 16.629 16.856 16.743 16.797 16.709 16.734 16.440 16.733 16.634 16.223 15.386 14.796 13.877 12.426 10.400 7.864 4.446 3.810 4.300 4.470 4.821 5.185 5.210 5.292 5.288 5.353 5.340 5.145 5.173 4.744 4.391 4.066 3.602 3.102 2.427 1.662 0.837 0.179 0.105 20.092 0.061 20.037 0.110 20.017 20.078 20.073 0.036 0.004 20.080 20.106 20.184 20.185 20.298 20.348 20.551 20.896 21.765 0.077 0.077 0.077 0.077 0.077 0.077 0.077 0.077 0.077 0.077 0.077 0.077 0.077 0.077 0.077 0.077 0.077 0.077 0.077 0.077 0.058 20.039 0.089 20.110 20.016 20.098 20.060 20.073 0.042 20.032 20.010 0.075 0.067 0.154 20.133 0.027 20.066 0.317 1.002 3.593 0.155 0.155 0.155 0.155 0.155 0.155 0.155 0.155 0.155 0.155 0.155 0.155 0.155 0.155 0.155 0.155 0.155 0.155 0.155 0.155 However, it appears that the greater proportion of correct guesses is countered by a ceiling effect for guessing; for students who know more than 84% of words on a test, it is impossible to see an increase of more than 16% from guessing With 95% of words presumed known, the score FIGURE Proportion of correctly guessed words by proportion of words known BRIEF REPORTS AND SUMMARIES 377 increase from guessing is under 5, despite a very high proportion of correctly guessed words (0.89) CONCLUSION As proportions of known words rises, so does the probability of correctly guessing the diminishing numbers of remaining unknown words This results in a fairly consistent score increase of approximately 16–17 points on a 99-item VLT test until over 60% of words are known, at which point the score increase due to guessing gradually begins to diminish It should be noted, however, that although means for guessing effects have been estimated to narrow confidence intervals, the standard deviations for guessing effects at most levels of vocabulary knowledge are around points, meaning that individual students may see increases substantially higher or lower Implications for Educators and Researchers Latent trait theory models such as Rasch measurement have gained increased popularity in vocabulary testing in recent years (Laufer & Goldstein, 2004; Beglar, 2010) But whereas latent trait theory holds great promise in language testing, tests of learners’ vocabulary sizes are arguably an instance in which use of classical statistics is theoretically justified: learners’ knowledge of a sample of words from a given frequency level is polled in order to make inferences regarding the true proportion of words known at that frequency level However, it should be noted that the multiple-choice test format employed by tests such as the VLT inflates estimates of a learners’ vocabulary size This problem is compounded by the fact that on tests such as the VLT, distractors are drawn from the same frequency level as the target word, and that therefore the probability of a successful guess cannot simply be determined by the number of distractors used For this reason the authors recommend that when possible, tests that not employ a multiple-choice format be used, such as yes/no vocabulary tests or productive tests of vocabulary knowledge (e.g., Laufer & Nation, 1999), in which learners provide words Furthermore, it should be noted that studies that have tested and compared learners’ active and passive vocabulary knowledge (e.g., Laufer & Goldstein, 2004; Laufer, Elder, Hill, & Congdon, 2004) frequently employ a multiple-choice format for tests of passive recognition Whereas such studies commonly report active knowledge as lagging passive knowledge by large degrees, the extent to which the multiple-choice format inflates scores on passive measures, thereby 378 TESOL QUARTERLY enlarging differences in scores, should be considered by researchers when reporting results THE AUTHORS Jeffrey Stewart is a lecturer at Kyushu Sangyo University in Fukuoka, Japan His research interests include vocabulary acquisition and language testing David A White is a mathematics undergraduate at Harvard University in Cambridge, Massachusetts, United States His interests include computer programming and Japanese language REFERENCES Anderson, R C., & Freebody, P (1981) Vocabulary knowledge In J T Guthrie (Ed.), Comprehension and teaching: Research reviews (pp 77–117) Newark, DE: International Reading Association Baldauf, R (1982) The effects of guessing and item dependence on the reliability and validity of recognition based cloze tests Educational and Psychological Measurement, 42, 855–867 doi:10.1177/001316448204200321 Beeckmans, R., Eyckmans, J., Janssens, V., Dufranne, M., & Van de Velde, H (2001) Examining the yes/no vocabulary test: Some methodological issues in theory and practice Language Testing, 18, 235–274 Beglar, D (2010) A Rasch-based validation of the Vocabulary Size Test Language Testing, 27, 101–118 doi:10.1177/0265532209340194 Beglar, D., & Hunt, A (1999) Revising and validating the 2000 word level and university word level vocabulary tests Language Testing, 16, 131–162 Birnbaum, A (1968) Some latent trait models and their use in inferring an examinee’s ability In F M Lord and M R Novick (Eds.), Statistical theories of mental test scores (pp 397–472) Reading, MA: Addison-Wesley Brown, J D., & Hudson, T (2002) Criterion-referenced language testing Cambridge, England: Cambridge University Press Chall, J S., & Dale, E (1950) Familiarity of selected health terms Educational Research Bulletin, 39, 197–206 Huibregtse, I., Admiraal, W., & Meara, P (2002) Scores on a yes–no vocabulary test: Correction for guessing and response style Language Testing, 19, 227–245 doi:10.1191/0265532202lt229oa Janssens, V (1999) Over ‘slapen’ en ‘snurken’ en de hulp van de context hierbij ANBF-nieuwsbrief, 4, 29–45 Laufer, B., & Nation, P (1999) A vocabulary-size test of controlled productive ability Language Testing, 16, 33–51 Laufer, B., Elder, C., Hill, K., & Congdon, P (2004) Size and strength: Do we need both to measure vocabulary knowledge? Language Testing, 21, 202–226 doi:10.1191/0265532204lt277oa Laufer, B., & Goldstein, Z (2004) Testing vocabulary knowledge: Size, strength, and computer adaptiveness Language Learning, 54, 399–436 doi:10.1111/j.00238333.2004.00260.x Martin, E., del Pino, G., & De Boeck, P (2006) IRT models for ability-based guessing Applied Psychological Measurement, 30, 183–203 doi:10.1177/ 0146621605282773 BRIEF REPORTS AND SUMMARIES 379 Meara, P (1996) The dimensions of lexical competence In G Brown, K Malmkjær, & J Williams (Eds.), Competence and performance in language learning (pp 35–53) New York, NY: Cambridge University Press Meara, P., & Buxton, B (1987) An alternative to multiple-choice vocabulary tests Language Testing, 4, 142–145 doi:10.1177/026553228700400202 Mochida, A., & Harrington, M (2006) The yes/no test as a measure of receptive vocabulary Language Testing, 23, 73–98 doi:10.1191/0265532206lt321oa Nation, I S P (1990) Teaching and learning vocabulary Boston, MA: Heinle and Heinle Rasch, G (1960) Probabilistic models for some intelligence and attainment tests Copenhagen, Denmark: Danish Institute for Educational Research Stewart, J (2009) A comparison of test scores between monolingual and bilingual versions of the Vocabulary Size Test: A pilot study In A M Stoke (Ed.), JALT 2008 conference proceedings Tokyo, Japan: JALT Stubbe, R., Stewart, J., & Pritchard, T (2010) Examining the effects of pseudowords in yes/no vocabulary tests for low level learners Kyushu Sangyo University Language Education and Research Center Journal, 5, 5–23 Wright, B D (1995) 3PL or Rasch? Rasch Measurement Transactions, 9, 408–409 Zimmerman, D W., & Williams, R H (1965) Effect of chance success due to guessing on error of measurement in multiple-choice tests Psychological Reports, 16, 1193–1196 380 TESOL QUARTERLY

Định dạng
Số trang	11
Dung lượng	138,4 KB