The Subjective and Objective Interface of Bias Detection on Language Tests Steven J. Ross and Junko Okabe Kwansei Gakuin University Kobe-Sanda, Japan Test validity is predicated on there being a lack of bias in tasks, items, or test content. It is well-known that factors such as test candidates’mother tongue, life experiences, and socialization practices of the wider community mayserve to inject subtle interac- tions between individuals’ background and the test content. When the gender of the test candidate interacts further with these factors, the potential for item bias to influ- ence test performances grows. A dilemma faced by test designers concerns how they can proactively screen test content for possible sources of bias. Conventional prac- tices in many contexts rely on the subjective opinion of review panels in detecting sensitive topical content and potentially biased material and items. In the last 2 de- cades this practice has been rivaled by the increased availability of item bias diagnos- tic software. Few studies have compared the relative accuracy and cost utility of the two approaches in the domain of language assessment. This study makes just that comparison. A 4-passage, 20-item reading comprehension test was given to a strati - fied sample of 825 high school students and college undergraduates at 5 Japanese in - stitutions. The sampling included a focus group of 468 female students compared to a reference group of 357 male English as a foreign language (EFL) learners. The test passages and items were also given to a panel of 97 in-service and preservice EFL teachers for subjective ratings of potential gender bias. The results of the actual item responses were then empirically checked for evidence of differential item function - ing using Simultaneous Item Bias analysis, the Mantel-Haenszel Delta method, and logistic regression. Concordance analyses of the subjective and objective methods suggest that subjective screening of bias overestimates the extent of actual item bias. Implications for cost-effective approaches to item bias detection are discussed. Theissueoftestbiashasalwaysbeencentralintheconsiderationoftestvalidity.Bias has beenof concern because inferences aboutthe results oftest outcomes often lead INTERNATIONAL JOURNAL OF TESTING, 6(3), 229–253 Copyright © 2006, Lawrence Erlbaum Associates, Inc. Correspondence should be addressed to Steven J. Ross, School of Policy Studies, Kwansei Gakuin University, Gakuen 2–1, Sanda, Hyogo, Japan 6691337. E-mail: sross@ksc.kwansei.ac.jp to consequences affecting the life-course trajectories of test candidates, such as in the use of tests for employment, admissions, or professional certification. Test re - sults may be considered unambiguously fair to the extent candidates are compared, as in the case of norm-referenced tests, on only the domain-relevant constructs in - cluded in the measurement instrument devised for the purpose. In the real world of testing practice, uncontaminated construct relevant domain coverage is often more an idealthan a reality. This isespecially true whenthe testing construct involves do - mains of knowledge or ability related to language learning. ISSUES IN SECOND LANGUAGE ASSESSMENT BIAS Language learning, particularly second or foreign language learning, is influenced to no small degree by factors that interact with, and that are sometimes even inde - pendent of, the direct consequences of formal classroom-based achievement. Yet in many high stakes contexts, foreign or second language ability is used as a gate-keeping criterion for employment and admissions decisions. Further, inclu- sion of foreign language ability on selection tests is often predicated on the as- sumption that candidates’ relative standing reflects the cumulative effects of achievement propelled by long-term commitment to diligent scholarship. These assumptions do not often factor in the possibly biasing influences of cross-linguis- tic transfer and naturalistic acquisition on individual differences in test outcomes. Constructing high stakes measures to be free of these kinds of bias presents a chal- lenging task to language test designers, particularly when the implicit meritocratic intention is to reward scholastic achievement. Studiesof biason languagetests havetendedtofallinto thethreebroadcategories of transfer, experience,and socialization practices. The first, whichaccounts forthe influence of transfer from a first-learned language to a second or foreign language, addresses the extent ofbias occurring when speakers of different first languages are testedon acommonsecond language.Chenand Henning(1985),forinstance, noted the transferability of Latin cognates from Spanish to English lexical recognition, which served to bias native speakers of Spanish over native speakers of Chinese. Workinginthesamevein,Sasaki(1991)corroboratedChenandHenningusingadif - ferent DIFdetectionmethod. Bothofthese studies suggestedthatwhen novel words areencountered bySpanish andChinesespeakers,the cognitivetaskof lexicalinfer - ence differs. For instance, consider the following sample sentence: Residents evacuated their homes during the conflagration. For Romance language speakers, the deductive task is to parse “conflagration” for its affixation and locate the core free morpheme. Once located, the Romance lan - 230 ROSS AND OKABE guage speaker can compare the root to similar known free morphemes in the reader’s native language, for instance, incendio or conflagracion. The Chinese speaker, in contrast, starts at the same deductive step, but must compare the free root morpheme to all other previously learned morphemes (i.e., most probably, “flag”). The resulting difference leads Spanish speakers to follow a semantically based second step, while Chinese speakers are likely to split between a semantic and phonetic comparison strategy. The item response accuracy in such cases favors the Romance language speakers, even when matched with Chinese counterparts for overall proficiency. The transferability factor applies to orthographic phenomena as well. Brown and Iwashita (1996) detected bias favoring Chinese learners of Japanese over na - tive English speakers, whose native language orthography is typologically most distant from Japanese. Given the fact that modern written Japanese relies on Chi - nese character compounds for the formation of nominal phrases, as well as the root forms of many verbs, Chinese students of Japanese can transfer their knowledge of semantic roots for many Japanese words and compounds, even without knowledge of their corresponding phonemic representations, or exact semantic reference. Here a similar strategic difference emerges for speakers of Chinese versus speakers of an Indo-European language. While the exact compound might not ex- ist in modern written Chinese, the component Chinese characters provide a deduc- tive strategy to Chinese learners of Japanese that is not available to English speak- ers. For instance, the following sentence contains the compound (bullet train) which does not have a direct counterpart in Chinese. The component characters “new,” “trunk,” and “line” provide the basis for a lexical inference that the compound refers to a kind of rail transportation sys - tem. For an English-speaking learner of Japanese, the cognitive load falls on de - ducing the meaning of the whole compound from its components. Here, a mixed grapheme to phoneme strategy is most likely if “new” and “line” are recog - nized as “shin” and “sen.” The lexical inference here might entail filling in the missing component “trunk” with a syllable that matches the surrounding “shin___sen” for successful compound word recognition. Examining transferability on a macrolevel, Ross (2000), while controlling for biographical and experiential factors such as age, educational background, and hours of ESL learning, found weaker evidence of a language distance factor. The distance factor was comprised of canonical syntactic structure, orthography, and typological grouping which served to influence the relative rates of learning English by 72 different groups of migrants to Australia. The overall picture of transfer bias suggests that on the microlevel, particularly in studies that triangulate two different native languages against a target language, BIAS DETECTION ON LANGUAGE TESTS 231 evidence of transfer bias tends to be identifiable. When many languages are com - pared and individual differences in experiential and cognitive variables are fac - tored in, transfer bias at the macro or language typological level appears to be less readily identifiable. A secondtype ofbiasin languageassessment arisesfrom differentialexposureto a target language that candidates might experience. Ryan and Bachman (1992), for instance, consideredTest ofEnglishas aforeign language (TOEFL)type itemsto be more culturallyorientedtoward theNorthAmericancontext thana Britishcompari - son, the First Certificate in English. Language learners withexposure to instruction in American English and test TOEFL preparation courses were thought to have a greater chance onsuch items thanlearners whose exposure did not preparethem for the cultural framework TOEFL samples in its reading and listening items. Their findings suggestthat high stakes language testsfor admissions such asTOEFL may indirectly include knowledge of cultural reference in addition to the core linguistic constructs considered to be the object of measurement. Presumably this phenome - non would be observable on language tests such as the International English Lan- guage Testing System (IELTS), which is designed to qualify candidates for admis- sions to universities in the United Kingdom, New Zealand, or Australia. Cultural backgroundcomparisons insecond languageperformance assessments havedemonstrated howspeechcommunitynormsmaytransferintoassessmentpro- cesses like oral proficiency interviews. While not overtly recognized as a source of assessmentbias,interlanguagepragmatictransferhasbeenseentoinfluencetheper- formances of Asian speakers when compared to European speakers (Young, 1995; Young & Halleck, 1998; Young & Milanovic, 1992). The implication is that if as- sessments arenorm-referenced,speakersfrom discoursecommunitiesfavoringver- bosity may be advantaged in assessment such as interactive interviews. This obser - vation apparently extends to semi-direct speech tasks such as the SPEAK test. Kim (2001), for instance,founddifferentialrating functionsfor pronunciationand gram - marratingsforAsianswhencomparedtoequalabilityEuropeantestcandidates.The implication here is that raters apply the rating scale differently. In considering possible sources of bias in university admissions, Zwick and Sklar (2003) opined that the foreign language component on the SAT II created a “bilingual advantage” for particular candidates for admission to the University of California. If candidates had been raised in bilingual households, for instance, they would be expected to score higher on the foreign language listening comprehen - sion component, which is an optional third subscore on the SAT II. This test is re - quired for undergraduate admissions to all campuses at the University of Califor - nia. The issue of bias in this case stems from the assumption that the foreign language component was presumably conceptualized as an achievement indicator, when in fact the highest scoring candidates are from bilingual households. The perceived advantage is that such candidates develop their proficiency not through coursework and scholarship, but through naturalistic exposure. 232 ROSS AND OKABE Elder (1997) reported on a similar fairness issue arising from the use of second language tests for access to higher education in Australia. Elder noted that the score weighting policy on the Victoria Certificate of Education, functioning as it does as a qualification for university admission in that state, explicitly profiles the language learning history of the test candidate. This form of candidate profiling aimed to reweight the influence of the foreign language scores on the admissions qualification so as to minimize the preferential bias bilingual candidates enjoyed over conventional foreign language learners. Elder found interactions between English and the profile categorizations were not symmetric across different for - eign language test candidatures and concluded that efforts to adjust for differential exposure profiles are fraught with difficulty. A third category of bias in language assessment deals with differences in social - ization patterns. Socialization patterns might involve academic tracking early in a school student’s educational career, usually into either science or humanities aca - demic tracks in high school (Pae, 2004). In some cultural contexts, academic track - ing might correspond to gender socialization practices as well. In contrast to cultural assumptions made about the verbal advantage females have over males, Hyde and Linn (1988) concluded in a meta-analysis of 165 stud- ies of gender differences on all facets of verbal tests that there was an effect size of D = .11 for gender differences. To them, this constituted little firm evidence to sup- port the assumed female verbal advantage. Willingham and Cole (1997), and Zwick (2002) concur with this interpretation, noting that gender differences have steadily diminished over the last four decades and now account for no more than 1% of the total variation on ability tests in general. Willingham and Cole (1997, p. 348) however, noted that females tend to frequent the top 10% in standardized tests of reading and writing. Surveys of gender differences on The Advanced Placement Test, used for uni - versity admissions to the more selective American universities, suggest reasons why verbal differences in literacy still tend to persist. Dwyer and Johnson (1997, p. 136) describe considerable effect size differences between college-bound males and females in preference for language studies. This finding would suggest that in the North American context socialization patterns could serve to channel high school students into academic tracks that tend to correlate with gender. To date, language socialization issues have not been central in foreign or second language test bias analyses in multicultural contexts because of the more immedi - ate and salient influences of exposure and transfer on high stakes tests. In contexts that are not characterized by multiculturalism, a more subtle threat of bias may be related to how socialization practices steer males and females into different aca - demic domains, and in doing so cumulatively serve to make gender in particular knowledge domains differentially salient. When language tests inadvertently sam - ple particular domains more than others, the issue of schematic knowledge inter - acting with the gender of the test candidate takes on a new level of importance. BIAS DETECTION ON LANGUAGE TESTS 233 In astudy ofdifferentialitem function (DIF)on foreignlanguage vocabularytest for Finnish secondary students, Takala and Kaftandjieva (2000) found that individ - ual vocabulary items showed domain-sampling effects, whereas the total score on testdidnot reflectsystematicgender bias.Theirstudyidentifiedhowwordssampled from male activity domains suchas mechanics and sportsmight yield higher scores for male test candidates than for females at the same ability level. Their approach usedconventionalstatisticalanalysesofDIF,which,accordingtosomecurrentstan - dards of test practices, wouldserveto identifyand eliminatebiased itemsbefore test scoresareinterpreted(AmericanEducationalResearchAssociation,AmericanPsy - chological Association, & National Council on Measurement in Education, 1999). With such practices for bias-free testing, faulty items would be screened through sensitivityreviewand contentmoderation prior totest administration, andthen sub - jected to DIF analyses before the final score tally. The issue of interest we address in this article is how gender bias on foreign lan - guage tests devised for high stakes purposes can be diagnosed when accepted cul - tural practices disfavor the use of empirical analysis of item functioning prior to score interpretation. In this study weaddress the issue of the accuracy ofsensitivity review and bias screening through content moderation prior to test administration by comparing the judgments of both expert and novice moderation groups with the results of three different empirical approaches to DIF. BACKGROUND TO THE STUDY Foursamplesubtestswrittenforahighstakesuniversityadmissionstestwereusedin the study. The subtests were all from the fourth section of a six section English as a foreignlanguage (EFL)testgivenannuallyto approximately630,000Japanesehigh schoolseniors.Theresults ofthe examarenorm-referencedandservetoqualifycan - didate forsecondaryexaminations tospecific academic departmentsat nationaland publicuniversities(Ingulsrud, 1994).Increasingly,privateJapaneseuniversitiesuse the results of the Center examination for admissions decisions, making the test the most influential gate-keeping device in the Japanese educational system. The format of the EFL test is a “discrete point” type of test of language structure and vocabulary, sampling the high school syllabus mandated by the Japanese Min - istry of Education. It is construed as an achievement test because only vocabulary and grammatical structures occurring in about 40 high school textbooks sanc - tioned by the Ministry of Education are sampled on the test. The six sections of the examination cover knowledge of segmental pronunciation, tonic word stress, dis - crete point grammar, word order, paragraph coherence and cohesion, interpreta - tion of short texts describing graphics and data in tabular format, interactive dialogic discourse in the form of a transcribed conversation, and comprehension of 234 ROSS AND OKABE a 400-word reading comprehension passage. All items, usually 50 in all, are in multiple-choice format to facilitate machine scoring. The test is constructed by a committee of 20 examiners who convene 40 days each year to draft, moderate, and revise the examination before its administration in January each year. On several occasions during the test construction period the draft passages and items are sent out to an external moderation panel for sensitivity and bias review. The external moderation panel, whose membership is not known to the test committee members, is composed of former committee members and examination committee chairpersons. Their task is to critique the draft passages and items and to recommend changes, large and small. On occasion the modera - tion panel recommends substitution of entire draft test sections. This usually oc - curs when issues of test sensitivity or bias are raised. The criteria for sensitivity are themselves highly subjective and variable across moderation panels. For some, test content should involve “heart-warming” topics that avoid dark or pessimistic themes. For others, avoiding references to specific social or ethnic groups may be the most important criterion. Thefour passagesincludedin thestudywereoriginally draftedforthefourth sec- tion of the EFL language examination. The specifications for the fourth sectioncall forthree orfour paragraphsdescribingcharts,figures,ortabulardataconcerning hy- potheticalexperimental orsurveydataina socialscience domain.Thissection ofthe test isknownto bethe most domain-sensitive, because thecontent sampling usually sits atthe borderlineof where male–female differences inexperiential schemata be- gin to emerge in the population. The four passages were never used in the operational test, but were held in re- serve as alternates. All four had at various stages of development undergone exter- nal review by the moderation panel and were found to be possibly too gender sensi - tive, thus ending further investment in committee time for their revision. The operational test is not screened with DIF statistics prior to score interpreta - tion. The current test policy endorsed by the Japanese testing community is predi - cated on the assumption that the moderation panel reviews are sufficiently accurate in detecting faulty, insensitive, or biased items before any are used on the opera - tional test. The research issue addressed here thus considers empirical evidence of the accuracy of the subjective approach currently used, and directly examines evi - dence that subjective interpretation of gender bias in fact concurs with objective analyses using empirical methods common to DIF analysis. METHOD The four-passage, 20-item reading comprehension test was given to a stratified sample of 825 high school students and college undergraduates at five institutions. The sampling included a focus group of 468 female students compared to a refer - BIAS DETECTION ON LANGUAGE TESTS 235 ence group of 357 male EFL learners. The aim of the sampling was to approximate the range of scores normally observed in the population of Japanese high school seniors. The 20-item test was given in multiple-choice format with enough time (1 hr) for completion, and was followed with a survey about the age, gender, and language learning experiences of the sample test candidates. Materials The test section specifications call for a three to four paragraph text describing graphs, figures, or tables written as specimens of social science types of academic writing. In the case of the experimental test, four of these passages were used. Each of the passages had five items that tested readers’ comprehension of the passage content. The themes sampled on the test can be seen in Table 1. The experimental test comprised of four short reading passages, which closely approximate the format and content of Section Four of the Center Examination. The sampling of students in this study yielded a mean and variance similar to the operational test. Table 2 lists descriptive statistics for the test. Bias Survey Procedure A test bias survey was constructed for use by in-service and preservice EFL teach- ers. The sampling of high school level teachers parallels the normal career path of Japanese members of a typical moderation panel. The actual external moderation panel is comprised of university faculty members, most of whom had followed a career path starting with junior and senior high school EFL teaching. The bias sur- 236 ROSS AND OKABE TABLE 1 Experimental Passage Order and Thematic Content Passage Thematic Content I Letter rotation experiment II Visual illusions experiment III Soccer league tournament IV Survey of transportation use changes TABLE 2 Mean, Standard Deviation, Internal Consistency,and Sample Size M SD Reliability Sample Size Items 12.36 4.14 .780 825 20 vey was thus devised to sample early, mid, and late career EFL teachers who were assumed to represent the larger population of language teaching professionals from whom future test moderation panel members are drafted. In-service teachers (n = 37) were surveyed individually. In addition to the sampling of in-service teachers, a larger group of preservice EFL teachers in training were also surveyed so as to compare the ratings provided by seasoned professional teachers with neophyte teachers (n = 60). All respon - dents were asked to examine the four test passages and each of the 20 items on the test before rating the likelihood that each item would favor male or female test can - didates. The preservice teachers in training completed the survey during Teaching English as a Foreign Language (TEFL) Methodology course meetings. The rating scale used and instructions are shown in the Appendix. ANALYSES: OBJECTIVE DIFFERENTIAL ITEM FUNCTIONING ANALYSIS A variety of options now exist for detecting DIF. Comparative research suggests that DIF methods tend to differ in the extent of Type I error and power. Whitmore and Schumacker (1999), for instance, found logistic regression more accurate than an analysis of variance approach. A direct comparison of logistic regression and Mantel-Haenszel procedure (Rogers & Swaminathan, 1993) indicated moderate differences in power. Swanson, Clauser, Case, Nungester, and Featherman (2002) more recently approached DIF with hierarchical logistic regression and found it to be more accurate than standard logistic regression or Mantel-Haenszel estimates. In this approach, different possible sources of item bias can be dummy-coded and nested in the multilevel design. Recent uses of logistic regression for DIF extend to polytomous rating categories (Lee, Breland, & Muraki, 2005) but still enable an examination of nonuniform DIF through interaction terms between matching scores and group membership. Although multilevel modeling approaches offer extended opportunities for test - ing nested sources of potential DIF, the single level methods, such as logistic re - gression and Mantel-Haenszel approaches, have tended to prevail in DIF studies. Penfield (2001) compared three variants of Mantel-Haenszel according to differ - ences in the criterion significance level, and concluded that the generalized ap - proach provided the lowest error and most power. Zwick and Thayer (2002) found that modifications of the Mantel-Haenszel procedure involving an empirical Bayes approach showed promise of greater potential for bias detection. A direct compari - son of the Mantel-Haenszel procedure with Simultaneous Item Bias (SIB; Narayanan & Swaminathan, 1994) concluded that the Mantel-Haenszel procedure yielded smaller Type I error rates relative to SIB. BIAS DETECTION ON LANGUAGE TESTS 237 In this study, three empirical methods detecting of DIF were used. The choice of bias detection methods used was based on their overall frequency of use in em - pirical DIF studies. The three methods were thought to represent conventional ap - proaches DIF research, and thus best operationalize “objective” approaches to be compared with subjective methods. Mantel-Haenszel Delta was computed from six sets of equipercentile-matched ability subgroups cross tabulated by gender. Differences in the observed Deltas for the matched males and females were evaluated against a chi-square distribution. This method matches males and females along the latent ability continuum and de - tects improbable discontinuities between the expected percentage of success and the observed data. The second method of detecting bias was a logistic regression performed on the di - chotomously scored outcomes for each of the 20 items. The baseline model tested the effects of gender controlling for each student’s total score (Camilli & Shepard, 1994). In this binary regression, the probability of success should be solely influenced by the individual’s overall ability. In the event of no bias, only the test score will account for systematic covariance between the item responses on a particular item. If bias does af- fect a particular item, the variable encoding gender will covary with the item response independently of the covariance between the score and the outcome. Further, if bias is associated with particular levels of ability on the latent score continuum, a nonuniform DIF can be diagnosed with a Gender × Total Score interaction term. Item response = constant + gender + score + (gender × score) In the event a nonuniform DIF is confirmed not to exist, the interaction term can be deleted to yield a main effect for gender, controlling for test score. Gender ef- fects are then tested for nonrandomness against a t-distribution. Thethird empiricalmethodused asimultaneousitem biasutilizingitem response theory (Shealy & Stout, 1993). The SIB approach was performed on each of the 20 itemsin turn.Thesumsoftheallother itemswereusedin rotationasabilityestimates in matching male and female examinees via a regression approach. This approach employs the matching strategy of the Mantel-Haenszel method, and uses the total scorebased onk-1items asaconcurrent covariatefor eachofthe itembias tests. Dif - ferences in estimates of DIF were evaluated against a z distribution. ThecompositeresultsofthethreedifferentapproachestoestimatingDIF foreach of the 20 items are given in Table 3. Each of the three objective measures employs a different test statistic to assess the likelihood of the observed bias statistic. Analo - gous tometa-analytic methods, the different effectscan be assessedstandardized as metrics. To this end, each DIF estimate, controlled for overall candidate ability, is presented as a conventional probability (p < .05) of rejecting the null hypothesis. Table 3 indicates that the Mantel-Haenszel and SIB approaches are equally par - simonious in detecting gender bias on the 20-item test. Both of these methods em - ploy ability matches of men and women along the latent ability continuum. In con - 238 ROSS AND OKABE [...]... COMPARATIVE CONCORDANCE ANALYSES As a final comparison of the differences between subjective and objective approaches to bias detection, separate concordance analyses were performed on the bias estimate effects The three objective methods of detecting bias on the test items show a strong concordance in agreement about the lack of systematic gender bias on the 20 test items Table 8 indicates a Concordance W of. .. based on survey responses TABLE 4 Biased Item No 13 From the Soccer Passage Item Description 13 1 2 3 4 If the Fighters defeat the Sharks by a score of 1–0 then: The Lions will play the Sharks The Fighters will play the Bears The Sharks will play the Eagles The Fighters will play the Lions 240 ROSS AND OKABE The aim of this subdivision of the subjective raters was to explore possible sources of differential... represent biased test items MH = Mantel–Haenszel; SIB = simultaneous item bias proaches, the magnitude of the bias estimation is disproportionately large in the subjective estimation The subjective judgments of item bias apparently differ according to the experience and gender of the preservice and in-service teachers in this study To examine whether there is a conditional proclivity to identify bias subjectively,... functioning detection The subjective ratings of gender bias are strongly in agreement about the existence of bias in the test items, though even the three samples of narrative accounts do not provide much consistent insight into why there is such hypersensitivity to the issue of gender differences CONCLUSION The three conventional empirical methods of estimating item bias via differential item functioning... valence on the effect size estimates indicates bias thought to favor males As Table 6 suggests, the subjective estimates of bias produce larger effect indicators of bias than do the objective methods For Soccer 13, the item detected by the objective methods to produce a bias favoring males, the effect size of the bias is in the small effect range (Cohen, 1988) The subjective diagnostics of bias, in contrast,... difference among different English tests. Here the male teacher contends that there is systematic difference in the female students’ understanding of science and logical content Yet assuming that the language test content does not narrowly sample such domains, he claims there is no bias on the test in question In fact, he suggests that the advantage is on the side of the female students, because the domain... products of differential socialization practices, but apparently tend to over-generalize them as natural categories of gender differences CONCORDANCE ANALYSES After the subjective and objective analyses of bias on the 20 test items were compiled, a direct comparison was undertaken The objective and subjective probability of bias estimates were converted into effect sizes This conversion to a standard... and the null hypothesis needed to be tested to provide bias probabilities1 comparable to those in Table 3 To this end, the mean rating of gender bias on each of the 20 items was tested against the hypothesis that male versus female advantage on each item equaled zero Table 5 contains the subjectively estimated probabilities that each item is biased In contrast with the objective measures of bias, the. .. students on how to solve the tasks and get higher scores rather than on the comprehension of the contents In this sense, rather than the topic, the form of the tasks may cause a different per- 242 ROSS AND OKABE formance between boys and girls; boys may do better in mathematical and logical types of tasks Anyway, girls do much better than boys as far as English exams are concerned So I don’t think there... functioning show strong concordance The three methods used in the analysis, the simultaneous item bias approach, the Mantel-Haenszel Delta, and logistic regression, were largely concordant in identifying that the majority of the 20 items on the four test passages were not biased A single biased item (Soccer 13) was correctly detected by the three different objective methods The subjective ratings of bias suggested . criterion for employment and admissions decisions. Further, inclu- sion of foreign language ability on selection tests is often predicated on the as- sumption that candidates’ relative standing. concordant agreement between the subjective and objective ap - 246 ROSS AND OKABE proaches, the magnitude of the bias estimation is disproportionately large in the subjective estimation. The subjective. students on how to solve the tasks and get higher scores rather than on the comprehension of the contents. In this sense, rather than the topic, the form of the tasks may cause a different per - BIAS