Understanding test and exam results statistically Thay vì ngồi tự sướng với những con số tỉ lệ %, điểm trung bình PISA, số huy chương Olympic, người Singapore ngồi biên soạn ra quyển sách này để cảnh tỉnh những người làm giáo dục (và suy rộng ra cho toàn xã hội) về nguy cơ những trị số thống kê có thể dối lừa, ngụy biện và khiến chúng ta đưa ra quyết định sai lầm. Một cuốn sách rất ngắn, chỉ 158 trang nhưng hoàn toàn xứng đáng để đọc trích từ FB của Namlun Didong
Trang 1Springer Texts in Education
Kaycheng Soh
Understanding Test and
Exam Results
Statistically
An Essential Guide for Teachers and School Leaders
Trang 3More information about this series at http://www.springer.com/series/13812
Trang 4Kaycheng Soh
Understanding Test and Exam Results Statistically
An Essential Guide for Teachers and School Leaders
123
Trang 5Kaycheng Soh
Singapore
Singapore
ISSN 2366-7672 ISSN 2366-7980 (electronic)
Springer Texts in Education
ISBN 978-981-10-1580-9 ISBN 978-981-10-1581-6 (eBook)
DOI 10.1007/978-981-10-1581-6
Library of Congress Control Number: 2016943820
© Springer Science+Business Media Singapore 2016
This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc in this
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made.
Printed on acid-free paper
This Springer imprint is published by Springer Nature
The registered company is Springer Science+Business Media Singapore Pte Ltd.
Trang 6In Lieu of a Preface
There are three kinds of lies: lies, damned lies, and statistics.
We education people are honest people, but we often use test and examinationscores in such a way that the effect is the same as lies, though without the intentionbut not without the falsity
We count 10 correctly spelt words as if we count 10 apples We count correctlythe chosen 10 words in an MCQ test as if we count 10 oranges We count 10correctly corrected sentences as if we count 10 pears Then, we add
10 + 10 + 10 = 30 We then concluded that Ben has 30 fruits, called Language
We do the same for something we call Math (meat) And, something we call Art, orMusic, or PE (snacks) We then add fruits, meat, and snacks and call the totalOverall (edible or food) We then make important decision using the Overall.When doing this honestly, sincerely, and seriously, we also assume that there is
no error in counting, be it done by this or another teacher (in fact, all teachersconcerned) We also make the assumption, tacitly though, that one apple is as good
as one orange, and one cut of meat as good as one piece of moachee Right orwrong, life has to go on After all, this has been done as far back as the longforgotten days of the little red house, and since this is a tradition, there must benothing wrong So, why should we begin to worry now?
A few of my class scored high, a few low, and most of them somewhere in between, reported Miss Lim on the recent SA1 performance of her class.
A qualitative description like this onefits almost all normal groups of students.After hearing a few more descriptions similar to these, Miss Lim and her colleagueswere not any wiser about their students’ performance
When dealing with the test or examination scores of a group of students, morespecific descriptions are needed It is here where numbers are more helpful thanwords Such numbers, given the high-sounding name statistics, help to summarize
v
Trang 7the situation and make discussion more focused Even when looking at one dent’s test score, it has to be seen in the context of the scores of other students whohave taken the same test, for that score to have any meaning.
stu-Thus, statistics are good But, that is not the whole truth, there are bad statistics.That is why there are such interesting titles as these: Huff, D (1954) How to Liewith Statistics; Runyon, R.P (1981) How Numbers Lie; Hooke, R (1983) How toTell the Liars from the Statisticians; Homes, C.B (1990) The Honest Truth aboutLying with Statistics; Zuberi, T (2001) Think than Blood: How Racial StatisticsLie; Joel Best (2001) Damned Lies and Statistics; and Joel Best (2004) MoreDamned Lies and Statistics: How Numbers Confuse Public Issues
These interesting and skeptical authors wrote about social statistics, statisticsused by proponents and opponents to influence social policies None deals witheducational statistics and how it has misled teachers and school leaders to makeirreversible decisions that influence the future of the student, the school, and eventhe nation
On the other hand, people also say“Statistics don’t lie but liars use statistics.”Obviously, there are good statistics and there are bad statistics, and we need to beable to differentiate between them
Good statistics are the kind of numbers which simplifies a messy mass ofnumbers to surface the hidden trends and helps in the understanding of them andfacilitates informed discussion and sound policy-making Bad statistics, on theother hand, do the opposite and makes things even more murky or messy than italready is This latter case may happen, unintentionally due to lack of correctknowledge of statistics Bad statistics are those unintentionally misused A rationalapproach to statistics, noting that they can be good or bad, is to follow Joel Best’sadvice:
Some statistics are bad, but others are pretty good, and we need statistics —good statistics—
to talk sensibly about social problems The solution, then, is not to give up on statistics, but
to become better judges of the numbers we encounter We need to think critically about statistics … (Best 2001, p 6 Emphasis added)
In the educational context, increasingly more attention is being paid to statistics,using it for planning, evaluation, and research at different levels, starting from theclassroom to the boardroom However, as the use of statistics has not been part ofprofessional development in traditional programs, many users of educationalstatistics pick up ideas here and there on the job This is practical out of necessity,but it leaves too much to chance, and poor understanding and misuse can be fastcontagious
The notes in this collection have one shared purpose: to rectify misconceptionswhich have already acquired a life of their own and to prevent those that are to beborn The problems, issues, and examples are familiar to teachers and schooladministrators and hence should be found relevant to daily handling of numbers inthe school office as well as the classroom The notes discuss the uses and misuses ofdescriptive statistics which school administrators and teachers have to use andinterpret in the course of their normal day-to-day work Inferential statistics are
Trang 8mentioned by the way but not covered extensively because in most cases they areirrelevant to the schools as they very seldom, if ever, have numbers collectedthrough a random process.
The more I wrote, the more I realized that many of the misconceptions andmisuses were actually caused by misunderstanding of something more fundamental
—that of educational measurement Taking test scores too literally, obsession withdecimals, and seeing too much meaning in small difference are some cases in point.Because educational statistics is intimately tied up with educational measurement(much more than other social statistics do), misinterpretation of test and exami-nation scores (marks, grades, etc.) may have as its root lack of awareness of thepeculiar nature of educational statistics The root causes could be one or all of these:
1 Taking test scores literally as absolute when they are in fact relative
2 Taking test scores as equivalent when they are not
3 Taking test scores as error-free when error is very much part of them
(Incidentally,“test score” will mean “test and examination scores” hereafter toavoid the clumsiness.)
These arise from the combination of two conceptual flaws First is the lack ofunderstanding of levels of measurement There is a mix-up of highly fallible edu-cational measurement (e.g., test scores) with highly infallible physical measurement(e.g., weight or height), looking at a test scores of 50 as if it is the same as 50 kg or
50 cm Secondly, there is a blind faith in score reliability and validity that the testscores have perfect consistency and truthfulness
This indicates a need to clarify the several concepts relevant to reliability,validity, item efficiency, and levels of tests And, above all these, the question ofconsequences of test scores used, especially on students and curriculum, that is,what happens to them, the two most critical elements in schooling
Statistics can be learned for its own sake as a branch of mathematics But, that isnot the reason for teachers and school leaders to familiarize themselves with it Inthe school context, statistics are needed for proper understanding of test andexamination results (in the form of scores) Hence, statistics and measurement need
to go hand in hand so that statistics are meaningful and measurement is understood
In fact, while statistics can stand-alone without educational measurement, tional measurement on which tests and examinations are based cannot do withoutstatistics
educa-Most books about tests and examination begin with concepts of measurementand have an appendix on statistics In this book, statistical understanding of testscores come first, followed by more exposition of measurement concepts Thereversed order comes with the belief that without knowing how to interpret testscoresfirst, measurement is void of meanings
Anyway, statistics is a language for effective communication To build such acommon language among educational practitioners calls for willingness to give upnon-functioning notions and needs patience to acquire new meanings for old labels
By the way, as the notes are not meant to be academic discourse, I take theliberty to avoid citing many references to support the arguments (not argumentative
Trang 9statements but just plain statements of ideas) and take for granted the teachers’ andschool leaders’ trust in my academic integrity Of course, I maintain my intellectualhonesty as best I can, but I stand to be corrected where I do not intentionally lie.
I would like to record my appreciation for the anonymous reviewers for theirperceptive comments on the manuscript and their useful suggestions for itsimprovement Beyond this, errors and omissions are mine
Trang 10Part I Statistical Interpretation of Test/Exam Results
1 On Average: How Good Are They? 3
1.1 Average Is Attractive and Powerful 3
1.2 Is Average a Good Indictor? 4
1.2.1 Average of Marks 4
1.2.2 Average of Ratings 4
1.3 Two Meanings of Average 5
1.4 Other Averages 6
1.5 Additional Information Is Needed 7
1.6 The Painful Truth of Average 8
2 On Percentage: How Much Are There? 9
2.1 Predicting with Non-perfect Certainty 9
2.2 Danger in Combining Percentages 11
2.3 Watch Out for the Base 12
2.4 What Is in a Percentage? 13
2.5 Just Think About This 13
Reference 13
3 On Standard Deviation: How Different Are They? 15
3.1 First, Just Deviation 15
3.2 Next, Standard 16
3.3 Discrepancy in Computer Outputs 17
3.4 Another Use of the SD 18
3.5 Standardized Scores 18
3.6 Scores Are not at the Same Type of Measurement 20
3.7 A Caution 22
Reference 23
4 On Difference: Is that Big Enough? 25
4.1 Meaningless Comparisons 25
4.2 Meaningful Comparison 26
ix
Trang 114.3 Effect Size: Another Use the SD 27
4.4 Substantive Meaning and Spurious Precision 29
4.5 Multiple Comparison 30
4.6 Common but Unwarranted Comparisons 31
References 33
5 On Correlation: What Is Between Them? 35
5.1 Correlations: Foundation of Education Systems 35
5.2 Correlations Among Subjects 36
5.3 Calculation of Correlation Coefficients 37
5.4 Interpretation of Correlation 40
5.5 Causal Direction 41
5.6 Cautions 44
5.7 Conclusion 45
Reference 45
6 On Regression: How Much Does It Depend? 47
6.1 Meanings of Regression 47
6.2 Uses of Regression 48
6.3 Procedure of Regression 49
6.4 Cautions 50
7 On Multiple Regression: What Is the Future? 51
7.1 One Use of Multiple Regression 51
7.2 Predictive Power of Predictors 53
7.3 Another Use of Multiple Regression 53
7.4 R-Square and Adjusted R-Square 54
7.5 Cautions 55
7.6 Concluding Note 56
References 56
8 On Ranking: Who Is the Fairest of Them All? 57
8.1 Where Does Singapore Stand in the World? 57
8.2 Ranking in Education 59
8.3 Is There a Real Difference? 61
8.4 Forced Ranking/Distribution 61
8.5 Combined Scores for Ranking 62
8.6 Conclusion 63
9 On Association: Are They Independent? 65
9.1 A Simplest Case: 2× 2 Contingency Table 65
9.2 A More Complex Case: 2× 4 Contingency Table 67
9.3 Even More Complex Case 68
9.4 If the Worse Come to the Worse 70
9.5 End Note 71
References 71
Trang 12Part II Measurement Involving Statistics
10 On Measurement Error: How Much Can We Trust
Test Scores? 75
10.1 An Experiment in Marking 76
10.2 A Score (Mark) Is not a Point 78
10.3 Minimizing Measurement Error 79
10.4 Does Banding Help? 80
Reference 81
11 On Grades and Marks: How not to Get Confused? 83
11.1 Same Label, Many Numbers 83
11.2 Two Kinds of Numbers 84
11.3 From Labels to Numbers 85
11.4 Possible Alternatives 87
11.5 Quantifying Written Answers 88
11.6 Still Confused? 89
Reference 89
12 On Tests: How Well Do They Serve? 91
12.1 Summative Tests 91
12.2 Selection Tests 93
12.3 Formative Tests 94
12.4 Diagnostic Tests 95
12.5 Summing up 96
References 96
13 On Item-Analysis: How Effective Are the Items? 97
13.1 Facility 98
13.2 Discrimination 100
13.3 Options Analysis 100
13.4 Follow-up 101
13.5 Post-assessment Analysis 102
13.6 Concluding Note 103
Reference 103
14 On Reliability: Are the Scores Stable? 105
14.1 Meaning of Reliability 105
14.2 Factors Affecting Reliability 106
14.3 Checking Reliability 107
14.3.1 Internal Consistency 107
14.3.2 Split-Half Reliability 109
14.3.3 Test–Retest Reliability 109
14.3.4 Parallel-Forms Reliability 109
14.4 Which Reliability and How Good Should It Be? 110
Trang 1315 On Validity: Are the Scores Relevant? 111
15.1 Meaning of Validity 111
15.2 Relation Between Reliability and Validity 115
Reference 116
16 On Consequences: What Happens to the Students, Teachers, and Curriculum? 117
16.1 Consequences to Students 117
16.2 Consequences to Teachers 120
16.3 Consequences to Curriculum 121
16.4 Conclusion 122
References 124
17 On Above-Level Testing: What’s Right and Wrong with It? 125
17.1 Above-Level Testing in Singapore 126
17.2 Assumed Benefits 127
17.3 Probable (Undesirable) Consequences 127
17.4 Statistical Perspective 129
17.5 The Way Ahead 131
17.6 Conclusion 132
References 132
18 On Fairness: Are Your Tests and Examinations Fair? 133
18.1 Dimensions of Test Fairness 134
18.2 Ensuring High Qualities 134
18.3 Ensuring Test Fairness Through Item Fairness 137
References 139
Epilogue 141
Appendix A: A Test Analysis Report 143
Appendix B: A Note on the Calculation of Statistics 149
Appendix C: Interesting and Useful Websites 153
Trang 14Dr Kaycheng Soh (1934) studied for Diploma in Educational Guidance (1965)and Master in Education (Psychology) at the University of Manchester, UK (1970)and was conferred the Doctor of Philosophy by the National University ofSingapore (1985) for his research on child bilingualism.
Dr Soh started as a primary school teacher and principal, then became a teachereducator of long-standing, and later held senior positions in the Ministry ofEducation and consulted on social surveys with other Ministries in Singapore Heserved as a consultant to revise the school appraisal indicator systems to the HongKong SAR Education Bureau After retirement from the National Institute ofEducation, Nanyang Technological University, Singapore, he actively promotedclassroom-based action research and conducted workshops for schools and theministry Currently, he is the Research Consultant of the Singapore Centre forChinese Language
His research focuses on creativity, language teaching, and world universityrankings, and his articles were published in international learned journals Examples
of his recent publications are as follows:
• Soh, Kaycheng (2015) Creativity fostering teacher behavior around the world:Annotations of studies using the CFTIndex Cogent Education, 1−8
This summarizes studies using the Creativity Fostering Teacher Behavior Index
he crafted and published in the Journal of Creative Behavior The scale has beentranslated into several languages and used for Ph.D dissertations
• Soh, Kaycheng (2013) Social and Educational Ranking: Problems andProspects New York: Untested Ideas Research Centre
The chapters are based on his journal articles dealing with several ological and statistical issues in world university rankings and other socialrankings
method-xiii
Trang 15• Soh, Kaycheng, Ed (2016) Teaching Chinese Language in Singapore:Retrospect and Challenges Springer.
This monograph covers many aspects of the teaching of Chinese Language inthe Singapore context, including its past, present, and future, and several surveys ofteacher perceptions, teaching strategies, and assessment literacy
Trang 16Part I
Statistical Interpretation
of Test/Exam Results
Trang 17Chapter 1
On Average: How Good Are They?
At the end of a jetty, there is this signboard:
WARNING
Average depth 5 meters within 50 meters
So, he dived and got a bump on the forehead
1.1 Average Is Attractive and Powerful
Average is so attractively simple and powerfully persuasive so much so that weaccept it without much thinking Average is attractive because it is simple It issimple because it simplifies
During the department’s post-examination meeting, performances of classeswere to be evaluated Miss Tan reported,“My class has two 45, four 52, seven 60,ten 68, …” The HOD stopped her at this point, “Miss Tan, can you make itsimpler?” “Yes, the average is 74.” The other teachers took turns to report, “‘myclass has an average of 68’; ‘mine is 72.’; … and ‘my class scored the highest, theaverage is 91.’” That is the magic of average It simplifies reporting and makescomparison and the ensuing discussion much more convenient
The average is of course the total of all scores of the students of a class divided
by the number of students in that class Arithmetically, mathematically, or tically (depending on whether you like simple or big words), an average presentsthe general tendency of a set of scores and, at the same time, ignores the differencesamong them Laymen call this average, statisticians call it the mean Average ormean, it is an abstraction of a set of marks to represent the whole set by using justone number Implicitly, the differences among marks are assumed to be
statis-© Springer Science+Business Media Singapore 2016
K Soh, Understanding Test and Exam Results Statistically,
Springer Texts in Education, DOI 10.1007/978-981-10-1581-6_1
3
Trang 18unimportant It also ignores the fact that it is possible that none of the students hasactually obtained that mark called average The power of average comes from itsability to make life easier and discussion possible If not for the average (mean), allteachers will report the way Miss Tanfirst did!
1.2 Is Average a Good Indictor?
It depends Four groups of students took the same test (Table1.1) All groups have
an average of 55 Do you think we can teach the groups the same way simplybecause they have the same average?
1.2.1 Average of Marks
It is unlikely in the classroom reality that all students get the same scores like inGroup A The point is that if the group is very homogeneous, teach them all in thesame way and one size mayfit all Group B has students who are below or aroundthe average but with one who scores extremely high when compared with the rest.Group C, on the other hand, has more students above the average but with onescoring extremely low Group D has scores spreading evenly over a wide range.Obviously, the average is not a good indicator here because the scores spreadaround the average in different ways, signaling that the groups are not the same inthe ability tested Such subtle but important differences are masked by the average
1.2.2 Average of Ratings
Assessment rubrics have become quite popular with teachers So, let us take arealistic example of rubric ratings Two teachers assessed three students on oral
Table 1.1 Test marks and
Trang 19presentation A genericfive-point rubric was used As is commonly done, the marksawarded by the two teachers were averaged for each student (Table1.2).
Using the rubric independently, both teachers awarded a score of 3 to Student X;the average is 3 Teacher A awarded a score of 2 to Student Y who got a score of 4from Teacher B; the average is also 3 Student Z was awarded scores of 1 and 5 byTeacher A and Teacher B, respectively; the average is again 3 Now that all threestudents scored the same average of 3, do you think they are the same kind ofstudents? Do the differences in the marks (e.g., 2 for Students Y and 4 for StudentZ) awarded to the same students worry you? Obviously, the average is not a goodindicator because the two teachers did not see Students Y in the same way Theyalso did not see Student Z the same way Incidentally, this is a question ofinter-rater consistency or reliability In this example, the rating for Student X ismost trustworthy and that for Student Z cannot be trusted because the two teachersdid not see eye to eye in this case
On thefive-point scale, the points are usually labeled as 1 = Poor, 2 = Weak,
3 = Average, 4 = Good, and 5 = Excellent Thus, all three students were rated asaverage, but they are different kinds of“average” students
1.3 Two Meanings of Average
In the rubric assessment example, average is used with two different though relatedmeanings Thefirst is the usual one when marks are added and then divided by thenumber of, in this case, teachers This, of course, is the mean, which is its statisticalmeaning because it is the outcome of a statistical operation
Average has a second meaning when, for instance, Mrs Lee says, “Ben is anaverage student” or when Mr Tan describes his class as an “average class.” Here,they used average to mean neither good nor weak, just like most other students orclasses, or nothing outstanding but also nothing worrisome In short, average heremeans typical or ordinary Here, average is a relative label and its meaning depends
on the experiences or expectations of Mrs Lee and Mr Tan
If Mrs Lee has been teaching in a prime school, her expectation is high and Ben
is just like many other students in this school Since Ben is a typical student in thatschool, he is in fact a good or even excellent student when seen in the context of thestudent population at the same class level in Singapore, or any other country.Likewise, if there are, say,five classes at the same class levels formed by abilitygrouping in Mr Tan’s school, then his so-called average class is the one in the
Table 1.2 Assessment marks
Trang 20middle or there about, that is, class C among classes A to E Moreover, Mr Tan’saverage class may be an excellent or a poor one in a different school, depending onthe academic standing of the school By the same token, an average teacher in oneschool may be a good or poor one in another school.
In short, average is not absolute but relative Up to this point, we have noticedthat classes having the same average may not be the same in ability or achievement
We have also seen that students awarded the same average marks may not havebeen assessed in the same way by different teachers The implication is that weshould not trust the average alone as an indicator of student ability or performance;
we need more information In short, an average standing alone can misinform andmislead Obviously, we need other information to help us make sense of an average.And what is this that we need?
1.4 Other Averages
Before answering the question, one more point needs to be made What we havebeen talking about as average is only one of the several averages used in educa-tional statistics The average we have been discussing up to now should strictly becalled the arithmetic mean There is also another average called the mode; it issimply the most frequently appearing mark(s) in a set For example, 45 appearsthree out offive times in Group B; since it is the most frequent mark, it is the mode.The mode is a quick and rough indicator of average performance, used for a quickglance
A more frequently used alternative to the arithmetic mean is the median When aset of marks are ordered, the middlemost is the median For example, in Table1.1,the scores of Group D are sequenced from the lowest to the highest, the middlemostmark is 55 and it is the median of the set offive marks Statistically, the median is abetter average than the arithmetic mean when a set of marks are “lopsided,” orstatistically speaking skewed This happens when a test is too easy for a group ofstudents, resulting in too many high scores The other way round is also true when atest is too difficult and there are too many low scores In either of these situations,the median is a better representation of the scores
Another situation when the median is a better representation is when there is one
or more extremely high (or low) scores and there is a large gap between such scoresand the rest In Table1.1, Group C has an unusually low score of 15 when the otherscores are around 65 (the mean of 60s and 70s) In this case, the mean of 55 is not
as good as the median of 60 (the middlemost score) to represent the group since 55
is an underestimation of the performance of thefive students Had Bill Gates joinedour teaching profession, the average salary of teachers, in Singapore or any othercountry, will run into billions!
Trang 211.5 Additional Information Is Needed
Let us go back to the question of the additional information we need to properlyunderstand and use an average
What we need is an indication of the spread of the marks so that we know notonly what a representative mark (average or mean) is but also how widely ornarrowly the marks are spreading around the average The simplest indicator of thespread (or variability) is the range; it is simply the difference between the highestand the lowest marks In Table1.1, the range for Group A is zero since every mark
is the same 55 and there are no highest and lowest marks For Group B, the range is
85− 45 = 40 For Group C, it is 70 − 15 = 55, and for Group D, 70 − 40 = 30.What do these ranges tell us? Group A (0) is the most homogeneous, followed
by Group D (30), then Group B (40), andfinally the most heterogeneous Group C(55) As all teachers know, heterogeneous classes are more challenging to teachbecause it is more difficult to teach at the correct level that suits most if not allstudents, since they differ so much in the ability of achievement The opposite istrue for homogeneous classes Thus, if we look only at the averages of the classes,
we will misunderstand the different learning capabilities of the students
While the range is another quick and rough statistics (to be paired with the mode),the standard deviation (SD) is a formal statistics (to be elaborated in Chap.3, OnStandard Deviation) Leave the tedious calculation to a software (in this case, theExcel), we can get the SDs for the four groups We can then create a table (Table1.3)
to facilitate a more meaningful discussion at the post-examination meeting.Table1.3drops the individual marks of the students but presents the essentialdescriptive statistics useful for discussing examination results It shows for eachgroup the lowest (Min) and the highest (Max) marks, the range (Max–Min), themean, and the SD Thus, the discussion will not be only about the average per-formance of each class but also how differing the classes and students were in theirexamination results
You must have noticed that Group A has the lowest range (0) and the lowest SD(0.00) On the other hand, Group C has the greatest range (55) and the greatest SD(22.9) The other two groups have the “in-between” ranges and the “in-between”SDs Yes, you are right In fact, there is a perfect match between ranges and SD’samong the groups Since both the range and the SD are indications of the spread ofmarks, this high consistency between them is expected In short, the group with thegreatest range also has the greatest SD, and vice versa
We will discuss this further in Chap.3
Trang 221.6 The Painful Truth of Average
Before we leave the average to talk more about the SD, onefinal and importantpoint needs to be mentioned When professor Frank Warburton of the University ofManchester (which was commissioned to develop the British Intelligence Scale)was interviewed on BBC about the measurement of intelligence, he did not knowthat what he said was going to shock the British public, because a newspaper thenext day printed something like “Prof Warburton says half of the British popu-lation is below average in intelligence.” (We can say the same about our Singaporepopulation.)
Prof Warburton was telling the truth, nothing but the truth The plain fact to him(and those of us who have learned the basics of statistics) is that, by definition, theaverage intelligence score (IQ 100) of a large group of unselected people is a point
on the intelligence scale that separates the top 50 % who score at or higher than themean (average) and the bottom 50 % who score below it He did not mean to shockand said nothing to shock, it was just that the British public (or rather, the news-men) at that time interpreted average using its layman’s meaning By the way, whenthe group is large and the scores are normally distributed, the arithmetic mean andthe median coincide at the same point
This takes us to another story An American governorship candidate of a ticular state promised his electorate that if he was returned to the office, he wouldguarantee that all schools in the state will become above-average We do not knowwhether the voters believed him They should not, because the candidate had nopossibility to keep his promise The simple reason is that, statistically speaking,when all schools in his state are uplifted, the average (mean) moves up accordinglyand there is always half of the schools below the state average! If he did not knowthis, he made a sincere mistake; otherwise, he lied with an educational statistics
Trang 23par-Chapter 2
On Percentage: How Much Are There?
The principal Mrs Fang asked, “How good is the chance that our Chinese orchestra will get
a gold medal in the Central Judging? ”
The instructor Mr Zhang replied, “Probably, 95 % chance.”
Mrs Fang said, “That is not be good enough, we should have 105 % chance.”
Obviously, there is some confusion of the concepts of percentage and bility in this short conversation Here, percentage is used to express the expecta-tions of certainty of an upcoming event Both the principal and instructor spokeabout figures figuratively Statistically, percentage as used here does not makesense What Mr Zhang said was that there was a very high chance (probability) ofsuccess but Mrs Fang expected more than certain certainty of success (a probability
proba-of p = 1.05!)
The percentage is one of the most frequently used statistics in the school and indaily life It could very well also be the most misunderstood and misused statistic
2.1 Predicting with Non-perfect Certainty
When 75 of 100 students passed an examination, the passing rate is 75 % This ofcourse is derived, thus
100% ðNo of passing studentsÞ=ðNo of students sat for the examÞ
¼ 100 % ð75=100Þ
¼ 100 % ð0:75Þ
¼ 75 %
© Springer Science+Business Media Singapore 2016
K Soh, Understanding Test and Exam Results Statistically,
Springer Texts in Education, DOI 10.1007/978-981-10-1581-6_2
9
Trang 24In a sense, the percentage is a special kind of the arithmetic mean (or what isknown as the average, in school language) where the scores obtained by studentsare either 1 (pass) or 0 (fail).
Because the student intakes over years are likely to be about the same, we cansay that our students have a 75 % chance of passing the same kind of examination
or there about We are here using past experience to predict future happenings.However, our prediction based on one year’s experience may not be exact becausethere are many uncontrolled factors influencing what will really happen in thefollowing years If it turns out to be 78 %, we have afluctuation (statistically callederror, though not a mistake) of 3 % in our prediction The size of such errordepends on which years’ percentage we use as a basis of prediction
Knowing that the percentages vary from year to year, it may be wiser of us totake the average of a few years’ percentages as the basis of prediction instead of justone year’s Let’s say, over the past five years, the percentages are 73, 78, 75, 76,and 74 %, and their average is 75.2 or 75 % after rounding We can now say that,
“Based on the experience of the past five years, our students will have around 75 %passes in the following year.” When we use the word around, we allow ourselves amargin of error (fluctuation) in the prediction
But the word around is vague We need to set the upper and lower limits to thaterror We then add to and subtract from the predicted 75 % a margin
What then is this margin? One way is to use the average deviation of thefivepercentages, calculated as shown in Table2.1 First, wefind the average percentage(75.2 %) Next wefind for each year its deviation from the five-year average, forexample, thefirst year, the deviation is (73 − 75.2 %) = −2.2 % For all 5 years,the average deviation is 0.0 and this does not help
We take the absolute deviation of each year, for example, |−2.2 %| = 2.2 % Theaverage of the absolute deviation is (7.2 %/5) = 1.44 % Adding 1.44 % to thepredicted 75 %, we get 76.44 % or, after rounding 76 % On the other hand,subtracting 1.44 % from 75 %, we get 73.56 or 74 % (after rounding) Now we say,
“Based on the experience of the past five years, our students are likely to havebetween 74 and 76 % passes next year.” This is a commonsensical way of makingprediction and at the same time allowing forfluctuation
A more formal way is to use the standard deviation (SD) in place of the averageabsolute deviation Once the SD has been calculated for thefive year’s percentages,
we use it to allow forfluctuations If we are happy to be 95 % sure, the limits will
Table 2.1 Absolute average
deviation Year Passes % Deviation % Absolute deviation %
Trang 25be 71 and 79 % We then can say,“Based on the experience of the past five years,
we have 95 % confidence that our students are likely to have between 71 and 79 %passes next year.” (See Chap.3, On Standard Deviation.) Using the SD is a moreformal statistical approach because this is done with reference to the normal dis-tribution curve, assuming that the five years’ percentages together form a goodsample of a very large number of percentages of passes of the schools’ students.Statistically speaking, the 95 % is a level of confidence, and the 71–79 % limitstogether form the interval of confidence Now, for the level of confidence 99 %,what are the limits forming the corresponding interval of confidence?
2.2 Danger in Combining Percentages
In the above example, we assumed that the cohorts have the same size or at leastvery close (which is a more realistic assumption) However, if the group sizes arerather different, then averaging the percentages is misleading Table2.2shows fortwo groups the numbers of passes and the percent passes for each group If we addthe two percentages and divide the sum by two, (75 % + 50 %)/2, the average is62.5 % However, if the total number of passes is divided by the total number ofstudents, (80/120), the average is 66.7 % This reminds us that when group sizesdiffer, averaging percentages to get an average percentage is misleading
It is a well-documented fact that generally boys do better in mathematics whilegirls in language In statistical terms, there is a sex–subject interaction which needs
be taken into account when discussing achievement is such gender-related subjects
In this example, sex is a confounding or lurking variable which cannot be ignored ifproper understanding is desired
Incidentally, Singapore seems to be an exception where mathematics is cerned In the 1996 Trends in International Mathematics and Science Study(TIMSS), Singapore together with Hong Kong, Japan, and Korea headed the worldlist in mathematics However, a secondary analysis (Soh and Quek 2001) foundSingapore girls outperformed their counterparts in the other three Asian nations,while boys of all four countries performed on par with one another This is anotherexample of the Simpson’s paradox By the way, the Singaporean girls’ advantageshows up again in the TIMSS 2007 Report, while boys of Taipei, Hong Kong, andJapan scored higher than Singapore’s boys By the way, Korea did not take part inthe 2007 study
Trang 262.3 Watch Out for the Base
Burger Queen puts up a sign: Come! Try our new chicken-kangaroo burger!!!
So, Mr Brown went in and tried one It did not taste right, so he asked the manager, “What
is the proportion of chicken to kangaroo? ”
The manager answered, “50-50.”
Mr Brown protested, “But, it didn’t taste like that How many chicken to one kangaroo?”
To Mr Brown ’s bewilderment, the manager said, “One chicken to one kangaroo.”
While Mr Brown expected Burger Queen to have used one kg of chicken toevery kg of kangaroo, the management actually used one whole chicken to onewhole kangaroo Mr Brown and the manager are correct in their own way They areentitled to their respective expectations but the different units used as the base forcalculation make a world of difference, in taste and in profit
Mr Han’s action research project for school-based curriculum innovationinvolved two classes At the end of the project, posttest showed the result as inTable2.3 Looking at the passes, Mr Han concluded that since there were morepasses in the project group than in the comparison group, the project was successful
as he expected Mr Dong disagreed with this conclusion He noticed that there weremore fails in the project group than in the comparison group; therefore, the projectfailed to deliver What both of them overlooked was that different class sizes.The question to ask in this situation is not“What is the pass rate?” or “What isthe failure rate?” The critical question is “What is the percentage of passes in eachclass and is there a difference between the two percentages?” To answer thisquestion, we worked out the percentages of passes for the two classes separately
As Table2.4shows, the passing percentage turned out to be 63 % for the projectclass and 67 % for the comparison class and there is a difference of 4 % in favor ofthe comparison class; this suggests that the intervention did not work as Mr Hanexpected Mr Han and Mr Dong should have looked at both passes and fails inboth groups and not focused just on either passes or fails, although Mr Dong was
Table 2.3 Posttest result of
the project and comparison
Table 2.4 Posttest result of
the project and comparison
Trang 27correct but for a wrong reason Of course, whether the difference of 4 % is largeenough to be worthy of attention, it needs to be further evaluated by checking theeffect size (more about this later).
2.4 What Is in a Percentage?
Miss Siva reported that her Primary 4 students in the project group scored higherthan those in the comparison group by 20 % on a posttest For a similar project,Mrs Hamidah reported the same advantage of 20 % On a quick glance, we con-cluded that the two projects are equally effective, after all both project groupsscored higher by 20 %
Mr Abdul exercised his critical thinking and asked for the numbers of items inthe tests used by Miss Siva and Mrs Hamidah It turned out that Miss Siva used a10-item test and Mrs Hamidah a 20-item test Thus, it turned out that Miss Siva’s
20 % represents two items out of 10 but Mrs Hamidah’s 20 % is four out of 20.Now the question is,“Does the ability to score for four more items the same as theability to score for two more items, given that the two tests are of comparablestandard (difficulty)?” Further, what if the two tests are not of the same difficulty?Not so obvious in such cases is that the base (number of items) is rather small.When the base is small, the percentage based on it is highly exaggerated, and thusgiving the false impression of being important This can bias our thinking such that
we take the percentage too literally When a test has a possible maximum mark of
20, a student scoring 95 % gives the impression that he is almost perfect inwhatever he has been tested upon Likewise, deducting one mark for his care-lessness means penalizing his by 5 % Would it be too severe?
Obviously, when using percentages to report on test and exam performance, weneed to be careful about such possible distortion and we need to take the trouble toprovide more information, not just reporting a stand-alone figure like 75 %.Moreover, when people talk about a percentage, your better ask, “Percentage ofwhat?” to clarify and to avoid unwarranted interpretation and conclusion
2.5 Just Think About This
Improving from 1 to 2 % is not 1 % improvement but 100 %
Trang 28On Standard Deviation:
How Different Are They?
In Chap.1, On Average, we talked about the need to know the spread of a set ofmarks in addition to knowing the average (mean) of those marks We also men-tioned that the range (i.e., the difference between the highest and the lowest marks)
is a quick and rough statistic for this purpose, but the standard deviation (SD) is aformal statistic for this What then is a standard deviation? And, how is it to becalculated, although nowadays we always leave this tedious job to the computer?
3.1 First, Just Deviation
Standard deviation is, of course, a deviation which has been standardized But, thisdoes not explain anything To understand what a SD is, we need to separate theterms,first just talk about deviation and then standard For illustration, we will usethe data from Table1.1with which we are familiar by now
Deviation is a lazy way of saying“deviation from the mean.” Look at Table3.1.Take Group A All marks are the same as the average, and therefore, none of themdeviate from the mean So, the sum of deviations is zero, and the SD is thereforezero Things are more complicated than this
Let us look at Group D As we know, the average is 55 Now we need to knowhow much each student’s mark deviates from the average, that is, how far away is
he from the mean Student 1 deviates from the mean by−15 (i.e., 40 – 55) At theother end, Student 5 deviates from the mean by +15 (i.e., 70– 55) So, if we sum
up all deviations, we should be able to tell how far away allfive students are fromthe average To our surprise, thefive deviations sum up to zero, indicating that, ingeneral, the students do not deviate at all from the mean! This of course is not true.Something is wrong somewhere
© Springer Science+Business Media Singapore 2016
K Soh, Understanding Test and Exam Results Statistically,
Springer Texts in Education, DOI 10.1007/978-981-10-1581-6_3
15
Trang 293.2 Next, Standard
What has gone wrong? There are negative deviations (for marks below the mean),and there are positive deviations (for marks above the mean) Since the mean(average) is the middle point, summing the negative and positive deviations allowsthem to cancel out each other (see the third column in Table3.2) At one time, itwas a common practice to take the absolute values of the deviations by ignoring thenegative sign In this case, the total deviation for group D is 40 Since this iscontributed byfive students (including the one with a zero deviation), we average itand get a value of 8 This is the averaged deviation, and the process of averaging is
to standardize the deviation so that every student is assumed to have the samedeviation (8), hence the term standard deviation For some statistical reasons (notmentioned here to avoid complication or confusion), this practice was discontinued,although there are statisticians who still see its usefulness and try to revive its use.Statisticians generally prefer another way to get rid of the negatives (and not justignoring the negative sign), that is, to square the negative deviations This is shown
in the last column of Table3.2 Now, the sum of squares of deviations from the mean(or simpler, sum of squares) is 500 Since this comes from allfive students, the sum
is divided by five for an averaged sum of square, and this average is called thevariance, in this case 500/5 = 100 Since the variance is the outcome of squaring, theprocess of squaring is reversed by taking its square root This results in a SD of 10.0.Thus, the square root of the variance is the standard deviation, and, the other wayround, the square of standard deviation is the variance This means, on average, a
Table 3.1 Test marks and
Table 3.2 Calculation of the SD for Group D
Group D Mark Deviation from the mean Square of deviation
Trang 30student deviates from the mean by 10 marks Now, Group D’s performance in theexamination can be reported in a conventional manner as 55 (10.0), with the markoutside the brackets being the mean and the one inside the SD.
If you take the trouble to go through the following steps for groups B and C,their SDs are 15.5 and 20.5, respectively The steps are as follows:
1 Find the mean (average) of the set of marks
2 Find for each mark its deviation from the mean, i.e., subtract the mean from themark
3 Square each of the deviation
4 Sum the squares
5 Average the sum of squares by dividing it with N, that is, the number ofstudents
6 Take the positive square root of this average, that is, the SD
3.3 Discrepancy in Computer Outputs
If you use Excel to get the SDs and compare them with those reported in Table3.1,you may notice that with the exception of Group A, there are discrepancies:Group B, 15.5 versus 17.3; Group C, 20.5 versus 22.9; and Group D, 10.0 versus11.2 Are these careless inaccuracies? No, both sets of SDs are correct but fordifferent reasons
If the four groups of students were samples from their respective populations,statisticians will say the SDs we have obtained here by our own calculations arebiased To correct the bias, instead of dividing the sum of square by the sample size(i.e., number of students, N = 5), you have to divide it by the number of studentsminus one (N− 1, or 4) When this is done, those SDs we obtained will be greater.For example, for thefive scores 3, 5, 5, 6, and 7, the sample standard deviation(STDEVS in Excel) is 1.48 But, if thefive scores are from the population (i.e., forall members of your interested group of students), the population standard deviation(STDEVP in Excel) is 1.32 Thus, be careful to choose between STDEVS andSTDEVP
Nowadays, we trust the computer to do the calculation, but a word of caution is
in order here, because software has its own peculiarity For example, StatisticalPackages for Social Sciences (SPSS) routinely uses (N-1) to calculate the SD Thus,for the same set of marks, they will return with somewhat different results.Another point is worth mentioning In our example, each group has onlyfivestudents In this situation, a difference of one more or less student makes a lot ofdifference, for instance, Group B’s 15.5 and 17.2 have a difference of 1.7 However,when the group is large, say 30 or more, the difference between using N and(N− 1) is rather small to have any practical importance Then, either set will servethe same purpose well There is a theoretical reason for using (N− 1) instead of N,but we need not go into this for practical reason
Trang 313.4 Another Use of the SD
Is describing the spread of marks the only use of the SD? Yes, and No
Yes, because the SD is a statistical method of describing how widely or narrowlythe marks are spreading around the mean (average) This information is needed for
us to more appropriately understand an important attribute of a group of studentswho have taken a test or examination Describing a group with only the average islikely to misinform and hence will mislead to futile, worse, or wrong actions such
as failing to help a group that needs help or, conversely, helping a group that doesnot need help
No, because the SD is needed for an important function, that of proper pretation of marks obtained for different tests or subjects In an educational systemwhere ranking or relative position is important and often critical, getting a mark thatplaces a student in higher ranking or position is more desirable than one that doesnot
inter-3.5 Standardized Scores
Now let us see how this works
Ben scores 80 for both English Language (EL) and Mother Tongue (MT) It isquite natural to conclude that he has done equally well in the two languages because
80 = 80 Correct? Maybe, and maybe not! When making such a conclusion, thetacit assumption is that one EL mark is equivalent to one MT mark, that is, earningone more mark in the two tests requires the same effort or ability
This may be true by chance, but always untrue It is like saying having one moreUSD is equivalent to having one more SGD This will be correct if and only if theexchange rate for the two currencies is one to one The possibility of this happening
is always there, but the probability, at this moment, is practically nil, unless there is
a very drastic change in the two nations’ economies How then can we compare themark for EL and that for MT? Or, compare USD and SGD?
Well, the solution is simple: Do what has been done for the Primary SchoolLeaving Examination (PSLE) marks and convert the subject marks to T-scoresbefore comparing them If this sounds challenging or even somewhat mysterious,just see how easily it is done And, with the Excel, you can do it, too
Converting a raw mark to a T-score is called T-transformation The formula touse is this:
T-score¼ Mark Meanð Þ=SD 10 þ 50
To do this, you need the mean (average) and the SD of the test scores of studentswho have taken the same test Let us assume that, for the two subject tests, themeans and the SDs are those shown in Table3.3together with Ben’s EL and MT
Trang 32marks Here, the two tests happen to have the same mean of 70, but the SDs aredifferent, 10 for EL andfive for MT Ben gets 80 for both tests.
When Ben’s raw marks are T-transformed with reference to the means and SDs
of the two subject tests, he got a T-score of 60 for EL, but 70 for MT This showsthat, in the context of his class, Ben has done better in MT than he has in EL Tomake this clear, imagine that 100 students (Ben and 99 others) have taken the twotests Now, visualize that they are asked to form a beeline from the lowest to thehighest EL raw marks Assuming a normal distribution, wefind Ben standing at the84th position (because his EL mark places him one SD above the EL mean, and he
is beaten by 15 students) Now, the students are asked to form a beeline again butbased on their MT marks This time, Ben stands at the top 98th position (becausehis MT mark places him two SDs above the MT mean, and he is beaten by only 2students) Since Ben is farther ahead in MT than in EL when compared with hispeers, the logical conclusion is that he has done better in MT than in EL, in spite ofthe same raw marks This illustrates the importance and need for the SD in addition
to the mean when considering test performance
There are reasons why T-transformation is necessary Firstly, different testsusually do not have the same means and the same SDs The scores for the two testsare therefore neither comparable nor interchangeable This means the same scoresfor two tests obtained by a student will rank him differently or place him on twodifferent points of the two scales, in spite of the same raw marks Thus, scores fortwo tests cannot be meaningfully compared directly; they need be transformed tothe same scale before comparison can be meaningfully made Secondly, raw marksfor two tests are automatically weighted by their respective SDs when compared,with undue advantage to the score from the test with the greater SD This calls for
an explanation which is best done by an illustrative case
Let us do some reversed thinking, working from known T-scores back to rawmarks (Table3.4) Calvin obtained T-score 60 for both EL and MT The T-score of
60 is one SD above the mean 70 According to the normal distribution, this placeshim at the 84th position for both subjects This means Calvin is as good in EL as he
Table 3.3 Transformation of raw marks to T-scores
Table 3.4 Transformation of T-scores to raw marks
Trang 33is in MT when compared with his peers However, when reverted to the raw marks,
he has 80 for EL and only 75 for MT, leading to the erroneous conclusion that he isbetter in EL than in MT You notice that EL has a greater SD of 10 than has MT’sfive Thus, when comparison is made on raw marks, scoring high on a test whichhas a greater SD is unduly advantaged, leading to the false impression that Calvin isbetter in EL than in MT when in fact he is equally good in both subjects (since hisranks in the two subjects are the same, based on the same T-score of 60).The above examples illustrate the need for the SD to enable meaningful com-parison of two raw marks for two different tests obtained by one student Theprinciple and procedure are equally applicable to comparing two students on twotests Moreover, T-scores for different tests are supposed to be comparable in that aT-score of one on a test is equivalent to a T-score of one on another And, moreimportantly, T-scores from different tests can then be added for a T-score Aggregate(as has been done for the PSLE subject scores) Sometimes, when a particularsubject is for some reasons considered more important, the T-score for this subject
is given more weight before adding with the unweighted T-scores of other subjects.Without the SDs to obtain the T-scores, all these are not possible
3.6 Scores Are not at the Same Type of Measurement
A more fundamental reason why transformation of raw marks to T-scores is essary has to do with the types of measurements Type of measurement is a con-ceptual scheme for classifying scales according to their characteristics and allowedstatistical processes This is shown in Table3.5
nec-Table 3.5 Levels of measurement
statistics
Example
Nominal Also called categorical variable.
Objects are in mutually exclusive
groups, but no ordering, i.e., groups
are equally “good”
Frequency count and percentage
Gender, race, class level, home language
Ordinal Objects are in different groups, and
groups are ordered, i.e., one group is
“better” than another
Median, percentile
Socioeconomic status, passed/failed achievement, grades, preference
Interval Different groups are ordered, and the
differences have the same distance or
interval in between, and there is an
absolute zero
Ratio Height, weight, money
Trang 34For nominal or categorical measurement, students can be classified into groupswhich are exclusive of one another For example, they can be male or female butcannot be both Likewise, a Primary 4 student cannot be a Primary 5 boy at thesame time The allowed statistics are frequency counts which can be expressedalso a percentage of a more inclusive but different groupings For instance, therecan be 12 Primary 3 boys who form 20 % of all 60 Primary 3 boys Fornominal/categorical scale, applying arithmetic operations changes the nature of themeasures For instance, adding 12 boys and 15 girls turns them into 27 students,and they are no more boys and girls Adding 12 English-speaking students and 15Mandarin students makes 27 students, and their respective home language identity
is lost But, it does not make sense to say students who speak one language arebetter than those who speak another language at home
For ordinal measurement, students can be arranged in ordered groupings Forinstance, 50 students can be grouped according to whether they have passed orfailed a test, and those who passed are considered better than those who failed.Likewise, they can be grouped based on home environment; those whose parentsearn more are considered“better off” than those whose parents earn less
We can ask students to indicate how they like learning English (or any othersubject) using computer by endorsing on a four-point scale, Like it very much (4),Like it (3), Don’t like it (2), or Don’t like it very much (1) But, we cannot say thatthose who endorsed“Like it very much” (4) are two steps higher than or twice aspositive as those who endorsed Don’t like it (2) simply because we have coded thetwo responses as 4 and 2 This is because the codes (4 and 2) are ordinals and notcardinals In short, for ordinal measures, we cannot be sure that the distancebetween categories is equal, and therefore, subtraction and ratio make no sense.For interval measurement, differences between two groups are supposed to beconsistent Temperature is such as scale, where a 10-degree difference between 30and 40 degrees in Fahrenheit is supposed to be the same as the difference of 10degrees between 80 and 90 in Fahrenheit But, there is no true zero on this scale Inthis case, zero degree Fahrenheit does not mean no temperature at all; it just meansthe temperature is relatively low In educational context, a score of zero on a testdoes not mean the student has no relevant knowledge at all, but that he is relativelypoor in the subject matter assessed Incidentally, it is interesting that the Kelvintemperature scale has an absolute zero, and a zero degree on this scale is equivalent
to−459.67 on the Fahrenheit scale or −273.15 on the Celsius scale
Educational tests of achievement or attitudes are always assumed to be aninterval measurement This is still a controversy, and the assumption of an intervalscale is just for convenience Thus, a difference of 10 marks between 50 and 60 on aScience test is assumed to have the same meaning as another difference of 10 marksbetween 80 and 90 on the same test; that is, the differences are assumed to beuniform at different parts of the scale However, remember that interval scale has notrue zero This means that when a student who scores zero for the Science test, itdoes not mean he know nothing about Science; it means his knowledge of Scienceplaces him at the lowest point of the scale By the same token, a score of 100 on theScience test does not mean the student scoring this has perfect or all the knowledge
Trang 35of Science; he is just at the top of the scale Since there is no true zero, a studentwith a Science score of 80 is not twice as knowledgeable as on who scores 40 Inshort, ratio does not make sense on an interval scale.
For ratio measurement, objects can be counted (nominal scale), ordered (ordinalscale), their differences on different parts of the scale are supposed to be equal orhaving the same meaning (interval scale), and, above all, there is a true zero point atthe lowest end of the scale The zero is critical as it give the scale its meaningfulinterpretation A man who has zero weight (or height) just does not exist; a manwho has zero money is totally broke On the other hand, a giant of 10 feet is exactlyfive times in height as a dwarf of two feet An obese man of 90 kg is three times asheavy as a 30 kg undernourished man And, Bill Gates’s monthly income is justmillion times that of a typical teacher
Although we do not think of a student who scores zero on the most recent Mathtest as having lost all his mathematical skills, we habitually compared students ontheir scores tacitly assuming that there is a true zero on the Math test: Kelly is twice
as smart as Ken because they scored 80 and 40, respectively, on the latestGeography test This kind of thinking is to be avoided, of course
As is well-documented, educational tests yield scores which are measuring at theordinal level and at best assumed to be at the interval level, unlike physicalmeasures such as length and weight which are measured at the highest ratio level.More importantly, educational measures have no objective standards by whichstudents’ performance can be judged, unlike their heights or weights For thisreason, raw marks can only be meaningful when interpreted in terms of deviationsfrom the average performance of all students who have taken the same tests Thus,the further above the mean indicates better performance and vice versa And, toenable meaningful interpretation, raw marks need be transformed with reference tothe mean and the SD, especially when test scores for different tests are to becompared or combined (again, for this, the PSLE is a case in point)
Many efforts have been made to make educational measures interpretable,and different standardized scales have been proposed These include the Stanine(standardized nine-point scale), Stanten (standardized 10-point scale), and NormalCurve Equivalence (there is a long story for this), which are used for reportingperformance on standardized tests widely used in the USA, UK, and Australia andrecently some Asian nations as well (e.g., the TOFEL and the SAT) However, adiscussion on this is beyond the scope of this note
3.7 A Caution
On the job, teachers and school leaders cannot run away from having to interprettest scores and, based on the interpretation, make important decision on the stu-dents, the curriculum, and even the school as a whole We may interpret the testscores correctly or wrongly, and we of course prefer to do it correctly To interprettest scores correctly for the students, the curriculum, and the school, we need to be
Trang 36aware of the pitfalls in the process of interpreting and using test scores Here, ourprofessional integrity is at stake.
Understanding the concept of types of measurement is important to ensurecorrect interpretation of test scores with due caution against misinterpretation andmisjudgment The discussion and example given by Osherson and Lane (n.d.) areworthy of the time reading it
Reference
Osherson, D., & Lane, D M (n.d.) Online statistical education: An interactive multimedia course
of study: Levels of measurement Rice University, University of Houston, and Tufts University.
http://onlinestatbook.com/2/introduction/levels_of_measurement.html
Trang 37Chapter 4
On Difference: Is that Big Enough?
What is the difference between a physician and a statistician?
A physician makes an analysis of your complex illness,
a statistician makes you ill with a complex analysis!
4.1 Meaningless Comparisons
In the school context, we make comparisons tofind the differences between the testperformances of students, classes, and schools We are also interested in the dif-ference between the test performances at two points of time—students’ improve-ment More, we even are concerned with the difference between test performances
of two subjects, say, Mathematics and Science While we are busy with the ferences of all sorts, we tend to forget the commonsense that apples should becompared with apples and oranges with oranges but not oranges with apples.Making comparisons is so easy that it becomes our second nature and we then go
dif-on without much thinking to make the following comparisdif-ons:
Da Ming scored 75 for Mother Tongue and 70 for English Da Ming is better inmother tongue
For Semestral Assessment 1 Science, Primary 4B obtained a mean of 70; forSemestral Assessment 2 Science, the mean is also 70 The class did not make anyimprovement
For last years’ PSLE Math, our school had 55 % A*; this year, we had 57 % A*
We gain 2 %
The fallacy in the above comparisons is that apples are compared with oranges.The Mother Tongue and English tests Da Ming took were not the same test Thetwo Science papers for Semesters 1 and 2 covered different content And, of course,
© Springer Science+Business Media Singapore 2016
K Soh, Understanding Test and Exam Results Statistically,
Springer Texts in Education, DOI 10.1007/978-981-10-1581-6_4
25
Trang 38the two Math papers for the last year and this year are not the same In all thesecomparisons, the two tests are not the same tests.
No doubt, at face value, 75 is greater than 70, 70 is equal to 70, and 55 % is 2 %less than 57 % Then, why do these comparisons make no sense? It boils down tothe question of what is the basis of comparison If an apple costs $0.75 and anorange $0.70, isn’t it that the apple is more expensive than the orange? Yes, if thismeans the price of an apple is more than the price of an orange Here, the basis ofcomparison is the prices not the fruits Because we are busy, we always useshorthand when we talk to other people or ourselves Instead of saying“the price of
an apple,” we just say “apple” and get ourselves confused and involve ourselves inmuddled thinking Incidentally, another dubious habit is to create and use acro-nyms, although doing this may have its sociological function of signalingin-groupness
A difference obtained through comparing two scores or means or percentagescan be meaningfully interpreted if and only if the scores, the means, or percentageshave been obtained from using the same yardsticks In the school context, this isoften not the case It is obvious that the items making up the Semestral Assessment
1 Science test cannot be the same as those making up the Semestral Assessment 2Science test Thus, although the numbers (scores or marks of 70) may look thesame, they do not denote the same ability qualitatively, and hence, 70 for oneassessment and 70 for another assessment are not equal In this case, SemestralAssessment 2 Science score of 70 is likely to represent a higher degree of ability orknowledge since Semestral Assessment 2 usually covers more advanced topicswhich may even be based to some extent on Semestral Assessment 1 topics
As discussed in Chap.3, On Standard Deviation, because of thisnon-equivalence in content and ability assessed by different tests, instead of com-parisons made on the original test scores, they are transformed (standardized) to,say, T-scores Then, through reference to the normal curve, we compare theT-scores When we compare two students on their T-scores for two different tests,
we are in fact comparing their relative standings in an idealized situation, taking aT-score of 50 to be the point where an average student stands and that the 50 in onetest is as good as the 50 in another test Even here, the ability to score a T = 50 inone test may not be the same as scoring a T = 50 in the other This is a problem ofhorizontal test equating
Trang 39For last years’ PSLE Math, our school had 55 % A* and our neighbor school had
57 % A* We lose by 2 %
Compared with the previous three comparisons, these deal with the differences
in the same measures: that is, the same Mother Tongue test taken by Da Ming andJason, the same Semestral Assessment 1 Science paper by Primary 4B and Primary4E, and the same PSLE Math paper for the two schools last year This is criticalbecause the students, the classes, and the schools were compared on the sameyardsticks—comparing apple with apple and orange with orange
In so doing, there are the same bases for making the comparisons The numbers(scores, marks, percentages) are used as the unit to describe the quality quantita-tively In other words, the numbers do not have a life of their own, in each com-parison, the meaningfulness come from the basis of comparison
A statistics professor was going to his lecture He met a colleague who greeted him in the usual manner, “How are you?” “Very well, thank you.” “And how is your wife?” The professor hesitated for a while and asked, “Compared with what?”
This may be a joke, but it underlines the importance of a meaningful basis ofcomparison when we ask such questions like“How good?”; “How much better?”;and “How large is the difference?” Making meaningless comparisons can onlyconfuse and mislead, leading to inappropriate follow-ups Having made sure thestudents, classes, or schools are compared meaningfully andfind a difference, thenext natural question is,“Is that big enough?” To answer this question, we need touse the standard deviation (SD) tofind an effect size To this, we turn now
4.3 Effect Size: Another Use the SD
Because educational tests have no absolute zero, the scores (marks, percentages) arerelative and are void of meaning when read in isolation They therefore have to becompared with some standards for meaningful interpretation This is achieved byfirst finding out what is the average performance and then use this as a standardagainst which all other scores will be compared and interpreted When doing this,
we have to imagine a typical student who scores the average (while this studentmay or may not really exist)
Once this typical score (the average of a set of test scores) is identified, a higherscore denotes a better performance (or better student in the context of the test), andvice versa As such labels of higher and lower are vague, a standard is needed tomake the comparison meaningfully interpretable The most common and conve-nient way is to use the standard deviation (of the same set of scores) as the yardstick
to show how much higher or lower a score is than the mean (average)
When using the SD as the yardstick, we are able to say something like“John ishalf a SD above the mean but Jack is half a SD below the mean and they are one
SD apart.” Likewise, when analyzing the posttest scores of an action research
Trang 40project, we are able to report that the project group’s mean is 4/5 (or 0.8) SD abovethe comparison group’s mean, and therefore, there is a large project effect.The SD is also necessary when comparing the test results of two or more classes.More often than not, classes are compared only on their averages (means) Forinstance, Miss Tan’s class scores a mean of 74 and Mr Lee’s 70 for the same test.Since 74 > 70, the conclusion is that Miss Tan’s class has done better Of course,there is a difference between the classes and the conclusion seems reasonable.However, a question can be asked: Is the difference of four marks a large, medium,small, or trivial difference?
As Darrel Huff says in his book How to Lie with Statistics,“a difference is adifference when it makes a difference.” This may sound like playing with words,but there is a lot of truth in it When an observed difference makes no practicaldifference, it does not matter and is best ignored For instance, when looking for ashirt with 15.5 inch collar size, one with 16.0 inch or 15.0 inch can be tolerated andthe difference of 0.5 inch makes no difference
Is a four-mark difference between Miss Tan’ s and Mr Lee’s classes of practicalimportance? Before making a conclusion one way or the other, we need to evaluatethe size of the difference with reference to some statistical yardstick Cohen (1988)offers one that has been widely used by researchers the world over, that of effect size(ES) There are several formulas for the calculation of the ES and they all look likethis:
Effect size¼ Mean1 Mean2ð Þ=SD
In this seemingly simple formula, Mean1 and Mean2 are the means of the twogroups being compared As for the SD, there are several choices for differentpurposes and theoretical considerations However, using different SDs yield aboutthe same results where the differences are more often than not found in the second
or even the third decimal place, and for practical reasons such differences make nodifference and can be ignored Thus, the simplest is to use the comparison group’s
SD in school-based curriculum innovation or action research projects Forpost-examination discussion, either group’s SD will do as different classes arelikely to have SDs close to each other in size (again, where a difference makes nopractical difference)
Let us go back to the two classes If Miss Tan’s class has a mean of 74 with a SD
of 16.0 and Mr Tan’s 70 and 15.0, the ES is either (74 – 70)/16 = 0.25 or(74– 70)/15 = 0.27 First of all, the difference (0.02) between the two effects andtwo sizes is in the second decimal place and makes no difference Secondly, the ESs(0.25 and 0.27) fall within the range between 0.2 and 0.5 According to Cohen’scriterion, this is a small ES and closer to the trivial category (0.2 and below) Thus,
it may be reasonable to ignore the difference of four marks