It is a well-known fact that boys are better than girls in Math, and the other round for language. Does this mean the tests have gender bias? More generally, differ- ences in test and examination performance have been found persistently between gender, racial, and socioeconomic groups. Is this evidence that the assessment is biased or lack test fairness? May be and may not be; it depends. Then, on what?
Imagine that Table18.1 shows the results for an examination for two different groups. For the subject (whatever may be), ethnic Group A has 85 % passes and Group B only 60 %. The Yates’ chi-square of 13.79 has a p = 0.01 (actually 0.0002041 in the computer printout) indicates that the pattern of passing rates is very unlikely to have happened by chance—it’s real!
There is no doubt that the two groups perform differently on the examination.
But, could this have happened because the test is biased, favouring Group A and penalizing Group B?
If group differencesaloneare taken as the evidence of bias, the answer then is Yes(and Messick would agree). Nevertheless, the differences may be true reflec- tions of group differences in the relevant knowledge or ability and not due to bias (and those who disagree with Messick would agree). It is readily appreciated that the differences in weight and strength and many other physiological characteristics between males and females are truly natural phenomena and not biases of the measuring instruments. Likewise, tests and examinations with different results for groups differing in gender, ethnicity, and socioeconomic condition may in fact be fair (and hence useful in signaling the differences) in showing up the real differ- ences but not a reflection of the defects in the tools for assessing them. To think otherwise is like executing the messenger for the bad news. In other words, the bias (if it is) lies not in the tests and examinations but outside them elsewhere—the social conditions, the sorting systems, or the nature.
Table 18.1 Examination results for two different groups
Language/Math/Science, etc.
Pass Fail Total
Ethnic group A 85 (85 %) 15 (15 %) 100 (100 %) B 24 (60 %) 20 (40 %) 44 (100 %)
Total 109 35 144
18.2 Ensuring High Qualities 137
But, how do you know if the test is not biased? Look at how the groups respond to the items, that is, differential item functioning (DIF).
There are a few commonly used statistical methods of DIF for detecting item bias with different conceptual or theoretical complexity and computational demands; for instance, Vista and Care (2014) usedfive methods to evaluate items measuring space of 187 preparatory children in Melbourne. One of the methods is the chi-square test.
However, to use the chi-square test for differential item function is not as straightforward as it has been used. If a test has many items and is long enough to cover a wide range of knowledge or ability, then items at different points of the whole scale (in terms of facilities; see Chap.13, On Items) may function differ- entially (in terms of discrimination). It is therefore necessary to divide the whole scale into three tofive subscales before applying the chi-square test to detect item bias.
Let us say we have 138 students taking a test (any subject) and they arefirst divided into two sex groups (Boys = 66 and Girls = 72) and then three ability groups (High 41, Medium 55, and Low 42). For a particular item, the passing rates are shown in Table18.2.
To check whether the item is fair to the two groups (i.e., whether there is a sex bias), we calculate the chi-square value at each ability level and the full chi-square value is the sum of the three values: 0.195 for High, 0.313 for Medium, and 5.789 for Low. The total or full chi-square is 6.297.
With three (3) degrees of freedom (each ability group has a degree of 1 for its 2×2 table), the full chi-square 6.267 is greater than 6.251 (for 90 % confidence level) but less than 7.815 (for 95 % confidence level). Incidentally, chi-square table can be found on the Internet. Thus, this particular item may have a bias, favouring girls of low ability. As shown in Fig.18.1, the Pass and Fail lines are well separated for High and Medium groups, but the line crosses over for the Low groups.
Table 18.2 Performance of an item by three ability groups of students
Pass Fail Total
High Boys 15 6 21
Girls 12 8 20
Subtotal 27 14 41
Yates’chi-square = 0.195, df 1,p= 0.66
Medium Boys 20 8 28
Girls 22 5 27
Subtotal 42 13 55
Yates’chi-square = 0.313, df 1,p= 0.58
Low Boys 5 12 17
Girls 18 7 25
Subtotal 30 12 42
Yates’chi-square = 5.789, df 1,p= 0.01
Once this has been done for all items, if the majority of the items are fair, it may be safe to conclude that the test as a whole is fair. Professional judgment is needed here. Of course, doing this for all items of a test may look tedious, but with the Web-based calculator it takes very little time and that is worth paying for profes- sionalism and test fairness.
References
Kline, R. B. (2013). Assessing statistical aspects of test fairness with structural equation modelling.
Educational Research and Evaluation,19(2–3), 204–222.http://dx.doi.org/10.1080/13803611.
2013.767624.
Kunnan, A. J. (2010). Statistical analysis for test fairness. Revue Francaise de Linguistique Appliquee, 15(1), 39–48.
Messick, S. (1998).Conequences of Test Interpretation and Use: The fusion of Validity and Values in Psychological Assessment. Princeton, NJ: Educational Testing Service.
Vista, A., & Care, E. (2014). Differential item functioning and its utility in an increasingly diverse classroom: perspectives from Australia.Journal of Education and Human Development, 3(2), 753–774.
0 5 10 15 20 25
Boys Girls Boys Girls Boys Girls
High Medium Low
Pass Fail
Fig. 18.1 Item differential functioning curves
18.3 Ensuring Test Fairness Through Item Fairness 139
Epilogue
Test scores are important: to students, because their future depends on these to a large extent, especially in competitive systems of education; to parents, because their children’s future is at stake; to teachers, because their understanding of stu- dents is based on these and their effectiveness is partly reflected by these; and, to school leaders, because the schools’reputation is more often influenced by these.
However, training in the understanding and proper use of test scores has not been given as much as time and effort as it deserves in pre-service preparation of teachers; it is cursory at best. Teachers learn this“tricks’on the job and may learn improper knowledge and skills, and such inappropriateness gets perpetuated and shared. It is an important professional knowledge and skills that teachers and school leaders need to acquire, for proper understanding and use of test scores and be fully aware of the subtlety behind test scores and their limitations.
This book begins with trying to explain the subtle statistical concepts but ends up with discussion on tests and measurement. It is because of the nature of the two fields and their connectedness. Test scores can be properly understood only when their make references to relevant statistical concepts. In the process of writing, I always bear in mind the teachers and school leaders as my audience and limit myself to statistical and measurement concepts that are most relevant to them. In this connection, I would like to thank the three anonymous reviewers who read my book proposal and made favourable comments and useful suggestions. And, if there is any important omission, it is due to my limited experience and knowledge. After all, statistics (educational or otherwise) is a living discipline with new ideas and techniques keep emerging very now and then.
As F.M. Lord, a giant of tests and measurement at the Educational Testing Service, USA, once said,“the numbers do not know where they came from”in his 1955 provocative article,On the Statistical Treatment of Football Numberswhich appeared in theAmerican Psychologist.Test scores standing alone have apparent or
©Springer Science+Business Media Singapore 2016 K. Soh,Understanding Test and Exam Results Statistically, Springer Texts in Education, DOI 10.1007/978-981-10-1581-6
141
seeming but inaccurate meanings. They appear simple and straightforward, but they have contexts and limitations which govern their proper interpretation and hence proper use. In a sense, test scores are not what they simply look like as the various chapters of this book try to show, hopefully, with some degree of success.
Christmas Eve 2015
142 Epilogue
Appendix A
A Test Analysis Report
This report demonstrates how apost-hocanalysis of test/exam can be done, by using the statistical and measurement concepts and techniques introduced.
In addition to using test results to make decisions on the students, test analysis can be conducted to study the efficacy of the test as an instrument for collecting information of students achievement. This approach of looking into the test will enhance the teachers’and school leaders’understanding of how their tests work and identifying areas for improvement where assessment is concerned.
A.1 Students
Three classes of Secondary 3 students (N = 78) were tested with a language test which comprised 10 multiple-choice items (MCQ; scored 1 for right and 0 for wrong) and 10 Essay questions (each carrying a possible maximum score offive).
A.2 Item-Analysis
Thefirst concern of the analysis is how well the 20 items work. Item-analysis was run on the scores and item indices were calculated as facility (p; proportion of correct answers) anddiscrimination (r; correlation between item and total scores).
The appropriateness of each item was evaluated by the conventional criteria and is shown in the Comments column in TableA1.1. The following are observed:
• Among the MCQ items, in terms of facility, one item is very easy, two are easy, three adequate, and four difficult. The subtest of MCQ as a whole has an adequate facility, indicating that it is appropriate for the students. In term of discrimination, all items are adequate.
©Springer Science+Business Media Singapore 2016 K. Soh,Understanding Test and Exam Results Statistically, Springer Texts in Education, DOI 10.1007/978-981-10-1581-6
143
• Among the Essays, in terms of facility, six questions are adequate but four are difficult. However, the subtest as a whole has an adequate facility indicating that it is suitable for the students. In terms of discrimination, seven have strong discrimination, two are adequate, and one is weak.
It is therefore concluded that the test as a whole is well-designed and suites the target students.
Table A1.1 Item-indices
Item No. Facility Discrimination within subtest
Discrimination for whole test
Comments
Multiple-choice subtest
1 0.60 0.42 0.38 Adequate in both indices
2 0.79 0.56 0.56 Easy. Adequate discrimination
3 0.81 0.41 0.27 Very easy. Adequate discrimination
4 0.76 0.58 0.57 Easy. Adequate discrimination
5 0.33 0.50 0.47 Difficult. Adequate discrimination
6 0.37 0.40 0.22 Difficult. Adequate discrimination
7 0.38 0.56 0.46 Difficult. Adequate discrimination
8 0.71 0.55 0.41 Easy. Adequate discrimination
9 0.40 0.49 0.34 Difficult. Adequate discrimination
10 0.46 0.46 0.26 Adequate in both indices
Subtest 0.56 – – Adequate facility
Essay subtest
11 2.78 (0.56) 0.39 0.38 Adequate facility. Weak
discrimination
12 2.86 (0.57) 0.50 0.52 Adequate in both indices
13 1.17 (0.23) 0.46 0.48 Difficult. Adequate discrimination
14 1.81 (0.36) 0.68 0.69 Difficult. Strong discrimination
15 1.29 (0.26) 0.66 0.62 Difficult. Strong discrimination
16 2.54 (0.51) 0.64 0.66 Adequate facility. Strong
discrimination
17 2.67 (0.53) 0.66 0.62 Adequate facility. Strong
discrimination
18 2.73 (0.55) 0.79 0.76 Adequate facility. Strong
discrimination
19 2.23 (0.45) 0.76 0.73 Adequate facility. Strong
discrimination
20 1.63 (0.33) 0.67 0.66 Difficult. Strong discrimination
Subtest 21.71 (0.43) – – Adequate facility
Whole test
27.33 (0.46) – – Adequate facility
NoteFigures in parentheses are facilities calculated as (mean/possible maximum)
144 Appendix A: A Test Analysis Report
A.3 Reliability
The second concern of the analysis is how reliable are the subtests and the whole test. The reliability was estimated in terms of Cronbach’s alpha coefficient which is a measure of internal consistency. As shown in TableA1.2, for the MCQ subtest, the reliability is a moderate 0.65 and for the Essay subtest it is a high 0.82. For the whole test, the reliability of 0.84 is high, close to the expected 0.90 for making decision on individuals.
A.4 Comparisons
By Gender The third concern of the analysis is whether there are differences between the boys (N = 34) and girls (N = 44). As TableA1.3shows, for the MCQ subtest, girls scored 1.9 points (1.9 %) higher than the boys with a large effect size Table A1.2 Reliability
Test section Internal consistency reliability
MCQ 0.65
Essay 0.82
Whole test 0.84
Table A1.3 Performance by gender
All Boys (N = 34) Girls (N = 44) Difference Effect sized MCQ
Mean 5.6 4.6 6.5 −1.9 −1.00
SD 2.3 1.9 2.2 – –
Maximum 10 7 10 −3 –
Minimum 0 0 1 0 –
Essay
Mean 21.7 19.5 23.4 −3.9 −0.52
SD 7.5 7.5 7.0 – –
Maximum 33 29 33 −4 –
Minimum 0 0 6 −6 –
Whole test
Mean 27.3 24.0 29.9 −5.9 −0.66
SD 9.1 9.0 8.5 – –
Maximum 40 34 40 −6 –
Minimum 0 0 7 −7 –
ofd = 1.00. For the Essay subtest, the girls scored 3.9 point (39 %) higher than the boys with a medium effect size ofd = 0.52. And, for the whole test, the girls scored 5.9 point (59 %) higher than the boys. In sum, the girls scored better than the boys generally.
By ClassThe three classes are also compared on their performance, using 3E as the benchmark. As shown in TableA1.4, for the MCQ subtest, 3E1 scored higher than the other two classes and the effect sizes are large (compared with 3E3) and very large (compared with 3E4). For the Essay subtest, 3E1 scored higher than the other two classes and the effect sizes are large (compared with 3E3) and very large (compared with 3E4). For the whole test, 3E1 scored higher than the other two classes and the effect sizes are small (compared with 3E3) and medium (compared with 3E4).
A.5 Correlations and Multiple Regression
It is of theoretical and practical significance to understand the relations between the two subtests and how they contribute to the total score. As shown in TableA1.5, the two subtests have a moderate correlate coefficient of 0.67, sharing 49 % common variance (i.e., total individual differences). However, both subtests have higher correlations with the whole test and the correlation coefficients are high 0.79 Table A1.4 Performance by class
3E1 (N = 21) 3E3 (N = 27) 3E4 (N = 30) 3E1-3E3 3E1-3E4 MCQ
Mean 6.3 6.0 4.8 0.3 (d = 1.67) 1.5 (d = 0.83)
SD 1.8 1.9 2.7 −0.1 −0.9
Maximum 9 10 9 −1 0
Minimum 2 3 0 −1 2
Essay
Mean 23.9 21.5 20.4 2.4 (d = 0.38) 3.5 (d = 0.55)
SD 6.4 5.7 9.2 0.7 –2.8
Maximum 33 30 31 3 2
Minimum 12 7 0 5 12
Whole test
Mean 30.2 27.4 25.2 2.8 (0.37) 5 (d = 0.66)
SD 7.6 6.9 11.3 0.7 –3.7
Maximum 40 39 39 1 1
Minimum 14 11 0 3 14
146 Appendix A: A Test Analysis Report
(MCQ) and 0.98 (Essay). However, the near perfect correlation between the Essay subtest and the whole test indicates that the total scores for the whole test is almost totally determined by scores for the Essay subtests. This indicates that the MCQ subtest plays a very limited role in differentiating among the students.
TableA1.6 shows the results multiple regression where the two sets of subtest scores are used to predict the total scores. According to the results, the raw score equation is:
Total scoresẳ1MCQ + 1 * Essay + Intercept
That is exactly how the total score is arrived at for each student. However, as shown in TableA1.3, for all students, the MCQ has a standard deviation of only 2.3 and the Essay subtest 7.5. This difference in spread (see Chap.7, On Multiple Regression) will affect the contributions of the two subtest to the whole test and the score have to be standardized. And, when the standardized scores are used for the multiple regression, the standardized regression coefficients (Beta’s) are 0.262 for MCQ subtest and 0.823 for Essay subtest. Thus, the regression equation using the standardized scores is
Standardized total scores = 0:262MCQ + 0:823Essay
In this equation, the standardized regression weights (0.262 and 0.823) replaced the unstandardized ones (1.00 and 1.00) and the intercept is standardized at 0.00. It is important to note that this equation shows that the ratio of these two Beta-weights is 0.826/0.262 = 3.14. This means that students’ performance on this test as a whole depends much more on their Essay scores than MCQ scores.
Table A1.5 Correlation coefficients
MCQ Essay Whole test
MCQ 1.00 0.67 0.79
Essay 1.00 0.98
Whole test 1.00
Table A1.6 Multiple regression
b-weight Beta p
MCQ 1.000 0.262 0.01
Essay 1.000 0.823 0.01
Intercept 0.000 0.00 1.00
R = 1.00, Adjusted R2= 1.00
A.6 Summary and Conclusion
The analysis of the test scores of the 78 Secondary 3 students for the 20-item test show the following results:
1. The MCQ and Essay subtests and the whole test are suitable for the students in terms of difficulty and have adequate discrimination (i.e., being able to distin- guish between students with differential achievement).
2. Girls do better than boys on both subtests and the whole test. 3E1 scores higher than the other two classes, especially 3E4.
3. The MCQ subtest has a lower reliability when compared with the Essay subtest.
However, the test as a whole has high reliability and can be used for making decision on the individual students.
4. The Essay subtest make three times contribution to the total scores that con- tributed by the MCQ subtest.
A.7 Recommendations
For the future development, the following suggestions are to be considered:
1. The effective items can be kept in the item pools for future use. This will enhance the year-to-year comparability of tests and save the teachers time and effort of coming up with new items.
2. The less adequate items (in term of facility and discrimination) need be studied for content and item phrasing so as to inform teachers of the needed instructional changes and improvement in item-writing skills.
3. The number of MCQ items need be increased such that this subtest will con- tribute more to the total scores so that the students’performance does not rely so much on the Essay subtest. A balance between MCQ items and essay questions in terms of relative contributions to the total score is desirable for assessing skills in different language aspects.
148 Appendix A: A Test Analysis Report
Appendix B
A Note on the Calculation of Statistics
Using statistics to process test and exam results for better understanding inevitably involves calculation. This is the necessary evil.
More than half a century ago, when I started as a primary school teacher, all test results were hand-calculated and this involved tedious work, rushing for time, boredom and, above all, risking inaccuracy. Moreover, calculating to the third or fourth decimal values seemed to be a sign of conscientiousness (or professional- ism). Then came the hand-operated but clumsy calculating machine and later the hand-held but still somewhat clumsy calculator. As time passed by, the data gets bigger in size but the calculation gets easier although the statistics do not change—a mean is still a mean and does not change its meaning however it is calculated. Now, with the convenience, I can afford to use more statistics which are more compli- cated to calculate, for example, the SD and correlation coefficient, even regression and multiple regression. And, not to forget the chi-square and exact probability.
With the readily availability of computing facilities, teachers and school leaders nowadays can afford the time and energy to use more statistics (and conceptually more complex ones) for better understanding of test and exam results to benefit the students and the school.
In the school context, sophisticated computing software designed for researchers who always have to handle large amount of complicated calculation is not neces- sary. As I work more with class and school data, I have realized thatExcelis able to do most if not all the work that needs be done. Moreover, it is almost omnipresent.
B.1 Using Excel
• Create a master worksheet to store all data for all variables and have the labels across the veryfirst row, keeping thefirst column for students’series numbers and names. The table is always row (individuals) by columns (variables).
• For different analyses, create specific worksheets by copying from the master worksheet those needed data for the variables to be analyzed (e.g., correlated).
©Springer Science+Business Media Singapore 2016 K. Soh,Understanding Test and Exam Results Statistically, Springer Texts in Education, DOI 10.1007/978-981-10-1581-6
149
• Pay attention to the small down arrowhead next to Σ. It lead to the many calculation functions you need. For it enables you tofind the basic of the total (Sum), the mean (Average), the frequency (count Numbers), the highest (Max), the lowest (Min) and“More Functions…”
• The More Functions… has many choices and the one you need is always Statisticalwhich leads you to many statistical functions, from AVERAGEV to Z.TEST. Once you have used some of the functions, you need only Most Recently Used the next time and this lists only those you have used and may need this time.
• Learn todrag; point to the black dot at the right bottom corner of a command box and drag it to the right. This allows you to repeat the calculation across the columns (for the variables).
• Learn to use $ (not your money!). Thisfixes a variable for which is it to be constantly compared, for examples, Var1-Var2, Var1-Var3, etc. correlation coefficients so that thefirst variable (Var1) is held constant.
150 Appendix B: A Note on the Calculation of Statistics