Ensuring Test Fairness Through Item Fairness- 123docz.net

It is a well-known fact that boys are better than girls in Math, and the other round for language. Does this mean the tests have gender bias? More generally, differences in test and examination performance have been found persistently between gender, racial, and socioeconomic groups. Is this evidence that the assessment is biased or lack test fairness? May be and may not be; it depends. Then, on what?

Imagine that Table18.1 shows the results for an examination for two different groups. For the subject (whatever may be), ethnic Group A has 85 % passes and Group B only 60 %. The Yates’ chi-square of 13.79 has a p = 0.01 (actually 0.0002041 in the computer printout) indicates that the pattern of passing rates is very unlikely to have happened by chance—it’s real!

There is no doubt that the two groups perform differently on the examination.

But, could this have happened because the test is biased, favouring Group A and penalizing Group B?

If group differencesaloneare taken as the evidence of bias, the answer then is Yes(and Messick would agree). Nevertheless, the differences may be true reflec- tions of group differences in the relevant knowledge or ability and not due to bias (and those who disagree with Messick would agree). It is readily appreciated that the differences in weight and strength and many other physiological characteristics between males and females are truly natural phenomena and not biases of the measuring instruments. Likewise, tests and examinations with different results for groups differing in gender, ethnicity, and socioeconomic condition may in fact be fair (and hence useful in signaling the differences) in showing up the real differences but not a reflection of the defects in the tools for assessing them. To think otherwise is like executing the messenger for the bad news. In other words, the bias (if it is) lies not in the tests and examinations but outside them elsewhere—the social conditions, the sorting systems, or the nature.

Table 18.1 Examination results for two different groups

Language/Math/Science, etc.

Pass Fail Total

Ethnic group A 85 (85 %) 15 (15 %) 100 (100 %) B 24 (60 %) 20 (40 %) 44 (100 %)

Total 109 35 144

18.2 Ensuring High Qualities 137

But, how do you know if the test is not biased? Look at how the groups respond to the items, that is, differential item functioning (DIF).

There are a few commonly used statistical methods of DIF for detecting item bias with different conceptual or theoretical complexity and computational demands; for instance, Vista and Care (2014) usedﬁve methods to evaluate items measuring space of 187 preparatory children in Melbourne. One of the methods is the chi-square test.

However, to use the chi-square test for differential item function is not as straightforward as it has been used. If a test has many items and is long enough to cover a wide range of knowledge or ability, then items at different points of the whole scale (in terms of facilities; see Chap.13, On Items) may function differ- entially (in terms of discrimination). It is therefore necessary to divide the whole scale into three toﬁve subscales before applying the chi-square test to detect item bias.

Let us say we have 138 students taking a test (any subject) and they areﬁrst divided into two sex groups (Boys = 66 and Girls = 72) and then three ability groups (High 41, Medium 55, and Low 42). For a particular item, the passing rates are shown in Table18.2.

To check whether the item is fair to the two groups (i.e., whether there is a sex bias), we calculate the chi-square value at each ability level and the full chi-square value is the sum of the three values: 0.195 for High, 0.313 for Medium, and 5.789 for Low. The total or full chi-square is 6.297.

With three (3) degrees of freedom (each ability group has a degree of 1 for its 2×2 table), the full chi-square 6.267 is greater than 6.251 (for 90 % conﬁdence level) but less than 7.815 (for 95 % conﬁdence level). Incidentally, chi-square table can be found on the Internet. Thus, this particular item may have a bias, favouring girls of low ability. As shown in Fig.18.1, the Pass and Fail lines are well separated for High and Medium groups, but the line crosses over for the Low groups.

Table 18.2 Performance of an item by three ability groups of students

Pass Fail Total

High Boys 15 6 21

Girls 12 8 20

Subtotal 27 14 41

Yates’chi-square = 0.195, df 1,p= 0.66

Medium Boys 20 8 28

Girls 22 5 27

Subtotal 42 13 55

Yates’chi-square = 0.313, df 1,p= 0.58

Low Boys 5 12 17

Girls 18 7 25

Subtotal 30 12 42

Yates’chi-square = 5.789, df 1,p= 0.01

Once this has been done for all items, if the majority of the items are fair, it may be safe to conclude that the test as a whole is fair. Professional judgment is needed here. Of course, doing this for all items of a test may look tedious, but with the Web-based calculator it takes very little time and that is worth paying for profes- sionalism and test fairness.

References

Kline, R. B. (2013). Assessing statistical aspects of test fairness with structural equation modelling.

Educational Research and Evaluation,19(2–3), 204–222.http://dx.doi.org/10.1080/13803611.

2013.767624.

Kunnan, A. J. (2010). Statistical analysis for test fairness. Revue Francaise de Linguistique Appliquee, 15(1), 39–48.

Messick, S. (1998).Conequences of Test Interpretation and Use: The fusion of Validity and Values in Psychological Assessment. Princeton, NJ: Educational Testing Service.

Vista, A., & Care, E. (2014). Differential item functioning and its utility in an increasingly diverse classroom: perspectives from Australia.Journal of Education and Human Development, 3(2), 753–774.

0 5 10 15 20 25

Boys Girls Boys Girls Boys Girls

High Medium Low

Pass Fail

Fig. 18.1 Item differential functioning curves

18.3 Ensuring Test Fairness Through Item Fairness 139

Epilogue

Test scores are important: to students, because their future depends on these to a large extent, especially in competitive systems of education; to parents, because their children’s future is at stake; to teachers, because their understanding of students is based on these and their effectiveness is partly reflected by these; and, to school leaders, because the schools’reputation is more often influenced by these.

However, training in the understanding and proper use of test scores has not been given as much as time and effort as it deserves in pre-service preparation of teachers; it is cursory at best. Teachers learn this“tricks’on the job and may learn improper knowledge and skills, and such inappropriateness gets perpetuated and shared. It is an important professional knowledge and skills that teachers and school leaders need to acquire, for proper understanding and use of test scores and be fully aware of the subtlety behind test scores and their limitations.

This book begins with trying to explain the subtle statistical concepts but ends up with discussion on tests and measurement. It is because of the nature of the two ﬁelds and their connectedness. Test scores can be properly understood only when their make references to relevant statistical concepts. In the process of writing, I always bear in mind the teachers and school leaders as my audience and limit myself to statistical and measurement concepts that are most relevant to them. In this connection, I would like to thank the three anonymous reviewers who read my book proposal and made favourable comments and useful suggestions. And, if there is any important omission, it is due to my limited experience and knowledge. After all, statistics (educational or otherwise) is a living discipline with new ideas and techniques keep emerging very now and then.

As F.M. Lord, a giant of tests and measurement at the Educational Testing Service, USA, once said,“the numbers do not know where they came from”in his 1955 provocative article,On the Statistical Treatment of Football Numberswhich appeared in theAmerican Psychologist.Test scores standing alone have apparent or

©Springer Science+Business Media Singapore 2016 K. Soh,Understanding Test and Exam Results Statistically, Springer Texts in Education, DOI 10.1007/978-981-10-1581-6

141

seeming but inaccurate meanings. They appear simple and straightforward, but they have contexts and limitations which govern their proper interpretation and hence proper use. In a sense, test scores are not what they simply look like as the various chapters of this book try to show, hopefully, with some degree of success.

Christmas Eve 2015

142 Epilogue

Appendix A

A Test Analysis Report

This report demonstrates how apost-hocanalysis of test/exam can be done, by using the statistical and measurement concepts and techniques introduced.

In addition to using test results to make decisions on the students, test analysis can be conducted to study the efﬁcacy of the test as an instrument for collecting information of students achievement. This approach of looking into the test will enhance the teachers’and school leaders’understanding of how their tests work and identifying areas for improvement where assessment is concerned.

A.1 Students

Three classes of Secondary 3 students (N = 78) were tested with a language test which comprised 10 multiple-choice items (MCQ; scored 1 for right and 0 for wrong) and 10 Essay questions (each carrying a possible maximum score ofﬁve).

A.2 Item-Analysis

Theﬁrst concern of the analysis is how well the 20 items work. Item-analysis was run on the scores and item indices were calculated as facility (p; proportion of correct answers) anddiscrimination (r; correlation between item and total scores).

The appropriateness of each item was evaluated by the conventional criteria and is shown in the Comments column in TableA1.1. The following are observed:

• Among the MCQ items, in terms of facility, one item is very easy, two are easy, three adequate, and four difﬁcult. The subtest of MCQ as a whole has an adequate facility, indicating that it is appropriate for the students. In term of discrimination, all items are adequate.

©Springer Science+Business Media Singapore 2016 K. Soh,Understanding Test and Exam Results Statistically, Springer Texts in Education, DOI 10.1007/978-981-10-1581-6

143

• Among the Essays, in terms of facility, six questions are adequate but four are difﬁcult. However, the subtest as a whole has an adequate facility indicating that it is suitable for the students. In terms of discrimination, seven have strong discrimination, two are adequate, and one is weak.

It is therefore concluded that the test as a whole is well-designed and suites the target students.

Table A1.1 Item-indices

Item No. Facility Discrimination within subtest

Discrimination for whole test

Comments

Multiple-choice subtest

1 0.60 0.42 0.38 Adequate in both indices

2 0.79 0.56 0.56 Easy. Adequate discrimination

3 0.81 0.41 0.27 Very easy. Adequate discrimination

4 0.76 0.58 0.57 Easy. Adequate discrimination

5 0.33 0.50 0.47 Difﬁcult. Adequate discrimination

6 0.37 0.40 0.22 Difﬁcult. Adequate discrimination

7 0.38 0.56 0.46 Difﬁcult. Adequate discrimination

8 0.71 0.55 0.41 Easy. Adequate discrimination

9 0.40 0.49 0.34 Difﬁcult. Adequate discrimination

10 0.46 0.46 0.26 Adequate in both indices

Subtest 0.56 – – Adequate facility

Essay subtest

11 2.78 (0.56) 0.39 0.38 Adequate facility. Weak

discrimination

12 2.86 (0.57) 0.50 0.52 Adequate in both indices

13 1.17 (0.23) 0.46 0.48 Difﬁcult. Adequate discrimination

14 1.81 (0.36) 0.68 0.69 Difﬁcult. Strong discrimination

15 1.29 (0.26) 0.66 0.62 Difﬁcult. Strong discrimination

16 2.54 (0.51) 0.64 0.66 Adequate facility. Strong

discrimination

17 2.67 (0.53) 0.66 0.62 Adequate facility. Strong

discrimination

18 2.73 (0.55) 0.79 0.76 Adequate facility. Strong

discrimination

19 2.23 (0.45) 0.76 0.73 Adequate facility. Strong

discrimination

20 1.63 (0.33) 0.67 0.66 Difﬁcult. Strong discrimination

Subtest 21.71 (0.43) – – Adequate facility

Whole test

27.33 (0.46) – – Adequate facility

NoteFigures in parentheses are facilities calculated as (mean/possible maximum)

144 Appendix A: A Test Analysis Report

A.3 Reliability

The second concern of the analysis is how reliable are the subtests and the whole test. The reliability was estimated in terms of Cronbach’s alpha coefﬁcient which is a measure of internal consistency. As shown in TableA1.2, for the MCQ subtest, the reliability is a moderate 0.65 and for the Essay subtest it is a high 0.82. For the whole test, the reliability of 0.84 is high, close to the expected 0.90 for making decision on individuals.

A.4 Comparisons

By Gender The third concern of the analysis is whether there are differences between the boys (N = 34) and girls (N = 44). As TableA1.3shows, for the MCQ subtest, girls scored 1.9 points (1.9 %) higher than the boys with a large effect size Table A1.2 Reliability

Test section Internal consistency reliability

MCQ 0.65

Essay 0.82

Whole test 0.84

Table A1.3 Performance by gender

All Boys (N = 34) Girls (N = 44) Difference Effect sized MCQ

Mean 5.6 4.6 6.5 −1.9 −1.00

SD 2.3 1.9 2.2 – –

Maximum 10 7 10 −3 –

Minimum 0 0 1 0 –

Essay

Mean 21.7 19.5 23.4 −3.9 −0.52

SD 7.5 7.5 7.0 – –

Maximum 33 29 33 −4 –

Minimum 0 0 6 −6 –

Whole test

Mean 27.3 24.0 29.9 −5.9 −0.66

SD 9.1 9.0 8.5 – –

Maximum 40 34 40 −6 –

Minimum 0 0 7 −7 –

ofd = 1.00. For the Essay subtest, the girls scored 3.9 point (39 %) higher than the boys with a medium effect size ofd = 0.52. And, for the whole test, the girls scored 5.9 point (59 %) higher than the boys. In sum, the girls scored better than the boys generally.

By ClassThe three classes are also compared on their performance, using 3E as the benchmark. As shown in TableA1.4, for the MCQ subtest, 3E1 scored higher than the other two classes and the effect sizes are large (compared with 3E3) and very large (compared with 3E4). For the Essay subtest, 3E1 scored higher than the other two classes and the effect sizes are large (compared with 3E3) and very large (compared with 3E4). For the whole test, 3E1 scored higher than the other two classes and the effect sizes are small (compared with 3E3) and medium (compared with 3E4).

A.5 Correlations and Multiple Regression

It is of theoretical and practical significance to understand the relations between the two subtests and how they contribute to the total score. As shown in TableA1.5, the two subtests have a moderate correlate coefficient of 0.67, sharing 49 % common variance (i.e., total individual differences). However, both subtests have higher correlations with the whole test and the correlation coefficients are high 0.79 Table A1.4 Performance by class

3E1 (N = 21) 3E3 (N = 27) 3E4 (N = 30) 3E1-3E3 3E1-3E4 MCQ

Mean 6.3 6.0 4.8 0.3 (d = 1.67) 1.5 (d = 0.83)

SD 1.8 1.9 2.7 −0.1 −0.9

Maximum 9 10 9 −1 0

Minimum 2 3 0 −1 2

Essay

Mean 23.9 21.5 20.4 2.4 (d = 0.38) 3.5 (d = 0.55)

SD 6.4 5.7 9.2 0.7 –2.8

Maximum 33 30 31 3 2

Minimum 12 7 0 5 12

Whole test

Mean 30.2 27.4 25.2 2.8 (0.37) 5 (d = 0.66)

SD 7.6 6.9 11.3 0.7 –3.7

Maximum 40 39 39 1 1

Minimum 14 11 0 3 14

146 Appendix A: A Test Analysis Report

(MCQ) and 0.98 (Essay). However, the near perfect correlation between the Essay subtest and the whole test indicates that the total scores for the whole test is almost totally determined by scores for the Essay subtests. This indicates that the MCQ subtest plays a very limited role in differentiating among the students.

TableA1.6 shows the results multiple regression where the two sets of subtest scores are used to predict the total scores. According to the results, the raw score equation is:

Total scoresẳ1MCQ + 1 * Essay + Intercept

That is exactly how the total score is arrived at for each student. However, as shown in TableA1.3, for all students, the MCQ has a standard deviation of only 2.3 and the Essay subtest 7.5. This difference in spread (see Chap.7, On Multiple Regression) will affect the contributions of the two subtest to the whole test and the score have to be standardized. And, when the standardized scores are used for the multiple regression, the standardized regression coefﬁcients (Beta’s) are 0.262 for MCQ subtest and 0.823 for Essay subtest. Thus, the regression equation using the standardized scores is

Standardized total scores = 0:262MCQ + 0:823Essay

In this equation, the standardized regression weights (0.262 and 0.823) replaced the unstandardized ones (1.00 and 1.00) and the intercept is standardized at 0.00. It is important to note that this equation shows that the ratio of these two Beta-weights is 0.826/0.262 = 3.14. This means that students’ performance on this test as a whole depends much more on their Essay scores than MCQ scores.

Table A1.5 Correlation coefﬁcients

MCQ Essay Whole test

MCQ 1.00 0.67 0.79

Essay 1.00 0.98

Whole test 1.00

Table A1.6 Multiple regression

b-weight Beta p

MCQ 1.000 0.262 0.01

Essay 1.000 0.823 0.01

Intercept 0.000 0.00 1.00

R = 1.00, Adjusted R2= 1.00

A.6 Summary and Conclusion

The analysis of the test scores of the 78 Secondary 3 students for the 20-item test show the following results:

1. The MCQ and Essay subtests and the whole test are suitable for the students in terms of difﬁculty and have adequate discrimination (i.e., being able to distin- guish between students with differential achievement).

2. Girls do better than boys on both subtests and the whole test. 3E1 scores higher than the other two classes, especially 3E4.

3. The MCQ subtest has a lower reliability when compared with the Essay subtest.

However, the test as a whole has high reliability and can be used for making decision on the individual students.

4. The Essay subtest make three times contribution to the total scores that con- tributed by the MCQ subtest.

A.7 Recommendations

For the future development, the following suggestions are to be considered:

1. The effective items can be kept in the item pools for future use. This will enhance the year-to-year comparability of tests and save the teachers time and effort of coming up with new items.

2. The less adequate items (in term of facility and discrimination) need be studied for content and item phrasing so as to inform teachers of the needed instructional changes and improvement in item-writing skills.

3. The number of MCQ items need be increased such that this subtest will contribute more to the total scores so that the students’performance does not rely so much on the Essay subtest. A balance between MCQ items and essay questions in terms of relative contributions to the total score is desirable for assessing skills in different language aspects.

148 Appendix A: A Test Analysis Report

Appendix B

A Note on the Calculation of Statistics

Using statistics to process test and exam results for better understanding inevitably involves calculation. This is the necessary evil.

More than half a century ago, when I started as a primary school teacher, all test results were hand-calculated and this involved tedious work, rushing for time, boredom and, above all, risking inaccuracy. Moreover, calculating to the third or fourth decimal values seemed to be a sign of conscientiousness (or professional- ism). Then came the hand-operated but clumsy calculating machine and later the hand-held but still somewhat clumsy calculator. As time passed by, the data gets bigger in size but the calculation gets easier although the statistics do not change—a mean is still a mean and does not change its meaning however it is calculated. Now, with the convenience, I can afford to use more statistics which are more complicated to calculate, for example, the SD and correlation coefﬁcient, even regression and multiple regression. And, not to forget the chi-square and exact probability.

With the readily availability of computing facilities, teachers and school leaders nowadays can afford the time and energy to use more statistics (and conceptually more complex ones) for better understanding of test and exam results to beneﬁt the students and the school.

In the school context, sophisticated computing software designed for researchers who always have to handle large amount of complicated calculation is not necessary. As I work more with class and school data, I have realized thatExcelis able to do most if not all the work that needs be done. Moreover, it is almost omnipresent.

B.1 Using Excel

• Create a master worksheet to store all data for all variables and have the labels across the veryﬁrst row, keeping theﬁrst column for students’series numbers and names. The table is always row (individuals) by columns (variables).

• For different analyses, create speciﬁc worksheets by copying from the master worksheet those needed data for the variables to be analyzed (e.g., correlated).

©Springer Science+Business Media Singapore 2016 K. Soh,Understanding Test and Exam Results Statistically, Springer Texts in Education, DOI 10.1007/978-981-10-1581-6

149

• Pay attention to the small down arrowhead next to Σ. It lead to the many calculation functions you need. For it enables you toﬁnd the basic of the total (Sum), the mean (Average), the frequency (count Numbers), the highest (Max), the lowest (Min) and“More Functions…”

• The More Functions… has many choices and the one you need is always Statisticalwhich leads you to many statistical functions, from AVERAGEV to Z.TEST. Once you have used some of the functions, you need only Most Recently Used the next time and this lists only those you have used and may need this time.

• Learn todrag; point to the black dot at the right bottom corner of a command box and drag it to the right. This allows you to repeat the calculation across the columns (for the variables).

• Learn to use $ (not your money!). Thisfixes a variable for which is it to be constantly compared, for examples, Var1-Var2, Var1-Var3, etc. correlation coefficients so that thefirst variable (Var1) is held constant.

150 Appendix B: A Note on the Calculation of Statistics

Ensuring Test Fairness Through Item Fairness

Calculation of Correlation Coef ﬁ cients