Common but Unwarranted Comparisons

Teachers and school leaders might have heard such things as thet testandANOVA (analysis of variance) if they read research articles. In fact, these are so common in the research literature that teachers and school leaders cannot escape from them.

What then are these and why are they unwarranted?

The t test is a technique commonly used for checking whether the mean difference between two groups could havehappened due to sampling. For this, the researcher begins with a null hypothesis (i.e., a prediction of no difference), assuming that the obtained mean difference is due to sampling. He then calculates the t-value (of course, he lets the computer do the job). If he obtains a t-value equal to or greater than thecritical valuefor his sample size, he concludes that the null hypothesis is not supported and that the mean difference is too greater to be a chance occurrence, usually labeled as being significant. In the past, the t-value needed for this purpose is found in the relevant table appended to standard statistics textbooks; but, nowadays, statistical software automatically does this and shows the result of the comparison. When there are more than two groups, the research will run an ANOVA to see,first, whether there is at least one“significant”difference between any two groups; this is then followed by a series of pairwise t test checking for two groups at a time.

Many educational researchers routinely run the t test when they have two groups to compare. They also unthinkingly run the ANOVA when they need to compare more than two groups; and, this is then followed by comparing two groups at a time using the t test. So, the t test and ANOVA actually serve them the same purpose of comparing groups, except that the ANOVA is a more involved procedure.

A school leader had doubt about the oft-said“Practice makes perfect.”She got a teacher who taught Math to three Primary 5 classes to give the classes different amount of practice, say, 10 sums, 20 sums, and 30 sums for practice after teaching the topic of converting fractions to decimals and vice versa. A week later, the three classes took a 20-item test. Tofind out whether the amount of practice mattered, the teacher was advised by the consultant to run the ANOVA and then the pairwise t test. Here, the ANOVA answers the question “Is there at least one significant difference among the three classes?”and the t test answers the question “Which pair of classes has a significant difference?” These are called family-wise and experiment-wise comparisons, respectively. In the end, the teacher, under the guidance of the consultant, reported,

The ANOVA results in a significance F-value and the follow-up pairwise comparisons show a significant difference between students who had 10 and 20 sums to practice, but the difference between students who had 20 and 30 sums is not significant. All in all, the results indicate that practice had an effect up to a point beyond which it made no difference.

This sounds OK, doesn’t it? Why then is it unwarranted?

The use of the t test and ANOVA has some conditions to be satisﬁed before the results of the analysis can be considered valid. If the assumptions are not valid, the

4.6 Common but Unwarranted Comparisons 31

results are not meaningful and cannot be trusted. Now, things seem to be getting complicated and it is, indeed. Here are the assumptions in the education context:

1. The samples are randomly selected from their respective populations.

2. The students are independently sampled.

3. The scores form a normal distribution.

4. Variances are equal in the population.

In education context, these assumptions are seldom satisfied, if at all. In thefirst place, when classes are compared, seldom are they random samples of specified populations but convenient or purposive groups of students. Strictly speaking, they do not form random samples (or even justsamples) as the populations are usually not defined or nonexistent; and, in the latter case, the students form the populations!

Secondly, such comparison is made between intact classes such that the students are not independently sampled; this is especially so when students are ability-grouped.

Thirdly, it cannot be sure that the scores follow the normal distribution, partly because the group sizes tend to be small for classroom-based projects and the test may be too easy or too difﬁcult for different purposes. Fourthly, the variances may and may not be equal. In sum, educational data (test scores) are not always suitable for use of the t test and ANOVA due to the nature of the measures (test scores) and the way students are selected. There are research showing that the t test is vigorous enough to withstand violation of the assumptions of normality and equal variances, but the lack of independence in sampling remains the same and this is the most critical problem.

Those are the theoretical aspects of the problem of using the t test and ANOVA, but there is another problem of theoretical and also practical nature. Whether a t-value issignificantor otherwise depends on the p-value; and, critically, p-value is said to be statistically significant at the 0.05 level or 95 % confidence level for a specific sample size. Thus, indirectly, whether a t-value is statistically significant depends on the total number of students involved in the comparison. Large sample size tends to have large t-values and vice versa; thus, small sample size leads to smaller t-value which in turn leads to large (not small) p-value and the difference become non-significant (say, p > 0.05). This means by artificially increasing the sample size, we can get a t-value large enough so that the p-value is small enough for the difference to be significant. If the sample size is small, we tend to get non-significant results and conclude no difference when in fact there is a difference undetected. This is technically called Type II error: failure to reject a false null hypothesis when it is true. In short, the t test cannot be trusted without consideration for the sample size.

Another problem of the t test (and ANOVA) is a conceptual one. The word significance (and its adjective form, significant) has a daily usage meaning of importance (important). Unfortunately, about a century ago, when Ronald Fisher first used the wordsignificance, he used it to signify rarityof an observed difference, for instance, “the difference is significant”simply means “the difference is unlike to have happened by chance.” There is nothing about being important or

unimportant. The statistical meaning ofsigniﬁcanceand its daily usage meaning got mixed up and, as time passes by, is perpetuated.

Of all the above issues, the fundamental issue is, what do a t-value and its corresponding p-value tell? Let us say a teacher found a “signiﬁcant difference” (p < 0.01) between the experimental and comparison groups in her teaching experiment, what does this really mean? If we ask Abelson (1995, p. 40), he would reply thus,

When the null hypothesis is rejected at, say, the 0.01 level, a correct way to state what has happened is as follows:

If it were true that there were no systematic difference between the means in the populations from which the samples came, then the probability that the observed means would have been different as they were, or more different, is less than one in a hundred. This being strong grounds for doubting the validity of the null hypothesis, the null hypothesis is rejected.

This is a mouthful answer to a seemingly simple question but that is what it is.

More importantly, the question now is whetherthatis the answer sought after by the teacher (who conducted the teaching experiment) and the school leader (who supported the project).

Most likely, they would like to answer the question:Whether the experiment has produced the expected effect, or operationally,whether the experimental students score higher than the comparisons students at the end of the project.These are the right kind of questions to ask and they are about the magnitude of an observed difference and not about the probability of its chance occurrence. Analogically, when we are involved in a car collision, we will ﬁrst be concerned about the magnitudeof damage or injury, not theprobabilityof its occurrence.

Obviously, as gathered from the discussion above, in the school context, using the t test (and ANOVA) not only tend to violate the requirements but doing so gives a wrong answer to a right question.

References

Abelson, R. P. (1995).Statistics as principled argument. Hillsdale, NJ: Lawrence Erlbaum.

Cohen, J. (1988).Statistical power analysis for the behavioral sciences(2nd ed.). Hillsdale, NJ:

Lawrence Erlbaum Associates.

4.6 Common but Unwarranted Comparisons 33

Chapter 5

On Correlation: What Is Between Them?

It is proven that the celebration of birthdays is good for health.

Statistics show that those people who celebrate the most birthdays live the longest.

Calculation of Correlation Coef ﬁ cients

Ensuring Test Fairness Through Item Fairness