BRIEF REPORTS AND SUMMARIES TESOL Quarterly invites readers to submit short reports and updates on their work These summaries may address any areas of interest to Quarterly readers Edited by JOHN FLOWERDEW University of Leeds ALI SHEHADEH United Arab Emirates University Differential Item Functioning on an English Listening Test Across Gender GI-PYO PARK Soonchunhyang University Seoul, South Korea Ⅲ Differential item functioning (DIF) is present when two groups of equal ability show a differential probability of a correct response to an item (Ellis & Raju, 2003) Two groups refers to the focal group, or group of primary interest, and the reference group, or a standard group against which the focal group is compared Equal ability in group comparison is important because it tells whether group differences in ability arise from true differences or item bias (Elder, 1997) Differential probability deals with item difficulty, or uniform DIF, and item discrimination, or nonuniform DIF Studies on DIF are crucial because DIF deals with fairness of items across groups, say gender and socioeconomic status, beyond group-mean differences (Thissen, Steinberg, & Gerrard, 1986) Several studies on DIF have investigated whether gender differences in test performance resulted from gender bias (Drasgow, 1987; Ryan & Bachman, 1992; Takala & Kaftandjieva, 2000) Using item response theory, Drasgow investigated whether the ACT Mathematics Usage Test, which consisted of 40 items, functioned differentially across gender and race He found that five items showed evidence of DIF However, testcharacteristic curves, which are the sum of the item-characteristic curves, identified no group differences in the cumulative effects of DIF in the test as a whole Ryan and Bachman (1992) detected DIF in the Test of English as a TESOL QUARTERLY Vol 42, No 1, March 2008 115 Foreign Language (TOEFL) and in the First Certificate of English (FCE) using the Mantel-Haenszel procedure across gender and language background (Indo-European/non-Indo-European) In terms of gender, four of the 140 TOEFL items favored males, and two TOEFL items favored females; in the FCE, one of the 38 items favored males, and one item favored females For language background, 32 TOEFL items were easier for Indo-European native speakers, and 33 TOEFL items were easier for non-Indo-European native speakers; in the FCE, 13 items were easier for Indo-European native speakers, and 12 were easier for non-IndoEuropean native speakers Takala and Kaftandjieva (2000) examined whether the vocabulary subtest of the Finnish Foreign Language Certificate Examination (FFLCE) showed evidence of DIF The FFLCE had 11 items showing DIF, with six items favoring males and five items favoring females Despite these findings, however, excluding the DIF items in the test did not affect the ability parameter estimations between males and females probably because DIF items canceled each other out Previous studies on DIF across gender have identified DIF using various methods such as the Mantel-Haenszel procedure, item response theory, and confirmatory factor analysis However, few have further analyzed the sources of DIF in terms of variables such as language type (dialogue and monologue), question type (local, global, and expression), content about text, and picture presence in the item (Engelhard, Hansche, & Rutledge, 1990; Pae, 2002; Roznowski, 1987) The purpose of this study was to identify DIF across gender using the Mantel-Haenszel procedure in the English listening part of the 2003 Korea College Scholastic Ability Test (KCSAT) Another purpose of this study was to further articulate the sources of DIF in relation to four important variables considered when the English listening test was developed: language type, question type, content about text, and picture presence Two research questions guide this study: (a) Does the English listening test in the 2003 KCSAT include items displaying DIF across gender? (b) If so, what are the sources of DIF? The answers to these questions will sensitize item developers to the issue of DIF, which will, in turn, help them to develop items free from bias based on gender, socioeconomic status, and other factors METHOD Participants The participants were 20,000 males (half in liberal arts and half in science) and 20,000 females (half in liberal arts and half in science) chosen from the 675,922 examinees who took the 2003 KCSAT With the 116 TESOL QUARTERLY cooperation of the Korea Institute of Curriculum and Evaluation (KICE), the participants were chosen based solely on gender and academic background (liberal arts versus sciences) to minimize confounding variables The Mantel-Haenszel Procedure The Mantel-Haenszel procedure was chosen to detect DIF because the procedure is easy to use and because it is widely accepted as a measure of DIF The procedure was introduced by Mantel and Haenszel (1959), and adapted by Holland and Thayer (1988), to identify items displaying DIF across members of different subgroups Using the Mantel-Haenszel 2 statistic with one degree of freedom, the Mantel-Haenszel procedure tests the null hypothesis in the equation, H0: ␣ = 1, where ␣ is the common odds ratio Instrument The English listening test from the 2003 KCSAT was used as an instrument because KCSAT is a high stakes test that fits the study of DIF KCSAT plays a critical role in deciding admission to college in Korea The English listening test was developed in about a month by a special testing committee appointed by the KICE The committee consisted of English professors and teachers who had expertise in developing test items KICE asked them to develop various items, specifically considering language type (dialogue and monologue), question type (local, global, and expression), picture presence, and content (Nunan, 1991; Shohamy & Inbar, 1991) The draft of the test was reviewed twice by two different review committees consisting of high school teachers The review committees were asked to estimate item difficulty and to screen out any items that were similar to the items already used elsewhere Pretesting to investigate the psychometric properties of the test was not possible for security reasons The English listening test was a multiple-choice format, consisting of 17 items about different language types, question types, picture presence, and content In terms of language type, the test included 14 items about dialogues and three items about monologues For question type, it included eight global questions asking for inferential information, four local questions asking for factual information, and five questions asking about appropriate expression With regard to picture presence, the test consisted of two picture items and 15 nonpicture items In addition, the English listening test covered a variety of content such as exchanging information on sports, a health club, and a city; discussing a customer BRIEF REPORTS AND SUMMARIES 117 TABLE Results of Principal Component Analysis Factor Eigenvalue % of variance 4.18 1.17 0.93 0.91 0.86 24.60 6.90 5.48 5.04 4.92 complaint and problems in class; describing views; visiting a patient; asking a person to record a TV program; asking citizens to help in a festival; advising students to behave properly; planning for the weekend; going out for dinner; and identifying a person in a picture Each text about dialogues and monologues in the test comprised about 85 words with 9–11 turns between speakers in dialogues The text was recorded by two native speakers of English, one male and one female, at a speed of about 140 words per minute The reliability of the test as measured by Cronbach’s ␣ was 0.802 FINDINGS Before identifying DIF across gender with the Mantel-Haenszel procedure in the English listening test, dimensionality and item difficulty of the test were investigated Dimensionality, or the number of latent dimensions, was investigated by principal component analysis (see Table 1) The listening test was either unidimensional (the test measured only one ability), determined by the variance of the first factor, 24.60%, or multidimensional (the test measured more than one ability), determined by the eigenvalue of the first factor, 4.18 (Reckase, 1979).1 This result indicated that the Mantel-Haenszel procedure rather than item response theory, which assumes unidimensionality, should be used to investigate whether the test functioned differently across gender with this data (Hambleton, Swaminathan, & Rogers, 1991) To provide a rationale for studying the DIF, item difficulty statistics were calculated before matching ability levels, followed by t tests for males and females Only two items (2 and 12) out of the 17 items in the listening test were significantly easier for males, whereas 13 items (3–9, 11, 13, 14–17) were significantly easier for females This finding indicates that the female participants had better foreign language listening ability 118 Reckase (1979) argued that for the first factor to control the estimation of the parameters, it should have an eigenvalue of 10 or greater or account for at least 20% of the total variance TESOL QUARTERLY than the male participants (Ryan & Bachman, 1992; 2002 TOEIC Through Data, 2002) However, because these statistics were calculated before ability levels were matched between the two groups, the results could not tell whether the group differences in item difficulty resulted from true group differences in ability or item bias (Elder, 1997) Thus, after ability levels were matched, differential item functioning for the two groups in the test was investigated in depth Table shows the results of DIF analysis with the Mantel-Haenszel procedure, which uses total scores as a matching criterion When the two groups were matched with total scores, a total of 13 items out of the 17 items in the test showed DIF, with items (1, 2, 6, 10, 12, and 13) differentially easier for males and items (4, 5, 7, 8, 9, 11, and 17) differentially easier for females These findings suggest that item difficulty statistics should be interpreted with caution because DIF could be present beyond the item difficulty indices (Thissen et al., 1986) Even though the English listening test of the KCSAT had as many as 13 DIF items, the number of DIF items for males and females was almost equal, indicating that DIF items might cancel out each other in the test level analysis (Drasgow, 1987; Takala & Kaftandjieva, 2000) The items showing DIF were further analyzed to determine whether language type (dialogue and monologue), question type (local, global, and expression), picture presence, and content were associated with DIF Table reports that, in general, the DIF items were related to language type and picture presence This relationship, however, was TABLE Identification of DIF After Matching Ability Levels Item Breslow-Day statistic Uniform DIF 10 11 12 13 14 15 16 17 23.51 64.46 0.65 143.47 29.33 10.53 109.22 146.91 131.03 31.69 107.39 75.99 3.92 1.05 0.73 1.06 77.75 P < 0.0001* P < 0.0001* P = 0.4203 P < 0.0001** P < 0.0001** P < 0.0012* P < 0.0001** P < 0.0001** P < 0.0001** P < 0.0001* P < 0.0001** P < 0.0001* P < 0.0476* P = 0.3059 P = 0.3928 P = 0.3039 P < 0.0001** ** Item favored the focal group (females); * item favored the reference group (males) BRIEF REPORTS AND SUMMARIES 119 TABLE Analysis of DIF by Language Type, Question Type, Content, and Picture Presence Item Language type Question type Dialogue Dialogue Dialogue Local Global Local Dialogue Dialogue Global Local 10 11 12 13 14 Monologue Dialogue Dialogue Dialogue Monologue Dialogue Dialogue Dialogue Dialogue Global Global Local Global Global Global Global Expression Expression 15 16 17 Dialogue Dialogue Monologue Expression Expression Expression Picture presence Content Identifying a person Describing views from a mountain Planning for the weekend: Going to the theater Discussing a customer complaint Asking a person to record a TV program Advising students to behave properly Going out for dinner Buying a mirror Exchanging information about a city Asking citizens to help in a festival Discussing problems in class Exchanging information about sports Catching a dog running away Exchanging information about a health club Planting a tree Visiting a patient Clearing snow off the sidewalks DIF Status Picture No No Male Male No No No Female Female No No No No No No No Picture No Male Female Female Female Male Female Male Male No No No No No No Female different from previous findings, which showed that picture presence was easier for females (Pae, 2002) More specifically, four items (1, 2, 12, and 13) favored males and six items (4, 5, 7, 8, 9, and 11) favored females in the dialogues, whereas two items (6 and 10) favored males and one item (17) favored females in the monologues It is interesting that both of the picture items (1 and 13) favored males In question type, however, gender differences in the number of DIF items were not found In the local questions asking for factual information, one item (1) was easier for males, whereas two items (5 and 8) were easier for females In the global questions asking for inferential information, four items (2, 6, 10, and 12) were easier for males and four items (4, 7, 9, and 11) were easier for females In expression questions asking for appropriate expression, one item favored males (13) and one favored females (17) Interpreting DIF in relation to content was difficult because the English listening test covered such a wide variety of content The coverage was so broad because the test makers sought to minimize the examinee’s background knowledge effects on the test and to maximize content validity (Chiang & Dunkel, 1992; Park, 2004) As Roznowski (1987) reported, however, shopping (4 and 8) and theater (3) contents favored females, whereas sports (12) and travel (2) contents favored males However, in this study farming (15) and health (14) did not show DIF, in contrast with Roznowski’s study 120 TESOL QUARTERLY CONCLUSION This study used the Mantel-Haenszel procedure to investigate whether test items were invariant across gender in the English listening test of the 2003 KCSAT Of the 17 items on the listening test, 13 items displayed DIF; items favored males, and items favored females In a closer investigation, four important variables in developing the test—content about text, picture presence in the items, language type, and question type—were all associated with DIF to different degrees The findings of this study have several implications First, test items should be pretested for any problems in psychometric properties including DIF before they are used If any items show DIF, the items should be revised or eliminated after thoughtful evaluation by the selection committee or bias reviewers It is important to note that even though a subtest shows almost equal numbers of DIF items for each group, say items favoring one group and items favoring the other group, the result can be consequential for examinees at the total test score (see Maller, 2001) This problem arises when a raw score difference between the focal group and the reference group in a subtest is accumulated in the total test score In this scenario, the total raw score difference can be substantial, leading to unfairness of the test across groups Second, in case test items cannot be pretested for security reasons, as for the KCSAT, the selection committee should carefully choose items free from possible bias across groups by considering many variables such as content, picture presence, language type, and question type As discussed earlier, the shopping content in the listening test of 2003 KCSAT favored females, whereas the items with picture presence favored males In this case, if item developers put together both the variables of the shopping content and picture presence in an item, the item may show minimal DIF or may not flag for DIF at all Specific care needs to be taken in terms of content because different content favors different groups and exclusion of content in a test can cause problems in validity For instance, if the contents of shopping and sports are excluded in a listening test, the test may be free from DIF as seen in this study However, the test may also suffer from a lack of content validity because it fails to cover the universe of items To tackle these intricate problems, the committee can choose items of various contents about text which may flag for DIF but cancel each other out in the test as a whole (Clauser & Mazor, 1998) What should be noted, however, is that the studies of DIF to date have not shown whether the accumulation of DIF items cancel each other out in the test level analysis Third, item developers should assume professional responsibility for developing items that are as fair as possible by considering as many variables as possible (Carlton & Harris, 1992) Some may argue that it is BRIEF REPORTS AND SUMMARIES 121 practically impossible to consider all these variables in developing test items However, considering the personal and social ramifications of high-stakes tests, every effort should be made to develop items that are free from bias This study suggests the following future inquiries: First, this study identified DIF based on statistical analyses A logical next concern is to explore further whether bias reviewers can identify test items showing DIF without statistical data (Engelhard et al., 1990) Second, it should be investigated whether items showing DIF in a test manifest differential test functioning (DTF) Even though several empirical studies on DTF have been undertaken (Takala & Kaftandjieva, 2000; Zumbo, 2003), we are not sure yet whether (a) a test with DIF items shows DTF because of DIF accumulation in the test level analysis, (b) a test with DIF items shows no DTF because of DIF cancellation in the test level analysis, or (c) a test with DIF items shows no DTF because DIF is independent from DTF ACKNOWLEDGMENTS I express my deep gratitude to the anonymous TESOL Quarterly reviewers for their insightful comments on an earlier draft of this study THE AUTHOR Gi-Pyo Park is a professor of teaching English as a foreign language at Soonchunhyang University in Seoul, South Korea His research interests include testing, learning strategies, and listening and reading comprehension REFERENCES Carlton, S., & Harris, A (1992) Characteristics associated with differential item functioning on the scholastic aptitude test: Gender and majority/minority group comparisons (Research Report No 92, pp 60–70) Princeton, NJ: ETS Chiang, C., & Dunkel, P (1992) The effect of speech modification, prior knowledge, and listening proficiency on EFL lecture learning TESOL Quarterly, 26, 345–374 Clauser, B E., & Mazor, K M (1998) Using statistical procedures to identify differentially functioning test items Educational Measurement: Issues and Practice, 17, 31–47 Drasgow, F (1987) Study of the measurement bias of two standardized psychological tests The Journal of Applied Psychology, 72, 19–29 Elder, C (1997) What does test bias have to with fairness? Language Testing, 14, 261–277 Ellis, B., & Raju, N (2003) Test and item bias: What they are, what they aren’t, and how to detect them Washington, DC: U.S Department of Education (ERIC Document Reproduction Service No ED 480042) Engelhard, G., Hansche, L., & Rutledge, K (1990) Accuracy of bias review judges in identifying differential item functioning on teacher certification tests Applied Measurement in Education, 3, 347–360 122 TESOL QUARTERLY Hambleton, R K., Swaminathan, H., & Rogers, H J (1991) Fundamentals of item response theory Thousand Oakes, CA: Sage Holland, P W., & Thayer, D T (1988) Differential item performance and MantelHaenszel procedure In H Wainer & H Braun (Eds.), Test validity (pp 129–145) Hillsdale, NC: Lawrence Erlbaum Maller, S (2001) Differential item functioning in the WISC-III: Item parameters for boys and girls in the national standardization sample Educational and Psychological Measurement, 61, 793–817 Mantel, N., & Haenszel, W (1959) Statistical aspects of the analysis of data from retrospective studies of disease Journal of the National Cancer Institute, 22, 719–748 Nunan, D (1991) Language teaching methodology: A textbook for teachers New York: Prentice-Hall Pae, T.-I (2002) Gender differential item functioning on a national language test Unpublished doctoral dissertation, Purdue University, West Lafayette, Indiana, United States Park, G.-P (2004) Comparison of L2 listening and reading comprehension by university students learning English in Korea Foreign Language Annals, 37, 448–458 Reckase, M (1979) Unifactor latent trait models applied to multifactor tests: Results and implications Journal of Educational Statistics, 4, 207–230 Roznowski, M (1987) Use of tests manifesting sex differences as measures of intelligence: Implications for measurement bias The Journal of Applied Psychology, 72, 480–483 Ryan, K., & Bachman, L F (1992) Differential item functioning on two tests of EFL proficiency Language Testing, 9, 12–29 Shohamy, E., & Inbar, O (1991) Validation of listening comprehension tests: The effect of text and question type Language Testing, 8, 23–40 Takala, S., & Kaftandjieva, F (2000) Test fairness: A DIF analysis of an L2 vocabulary test Language Testing, 17, 323–340 Thissen, D., Steinberg, L., & Gerrard, M (1986) Beyond group mean differences: The concept of item bias Psychological Bulletin, 99, 118–128 2002 TOEIC through data (2002) TOEIC Newsletter, 17, 2–7 Zumbo, B (2003) Does item-level DIF manifest itself in scale-level analysis? Implications for translating language tests Language Testing, 20, 136–147 BRIEF REPORTS AND SUMMARIES 123