TESTS OF WITHIN-SUBJECTS EFFECTS
16.2 Alternatives to the t-test: Mann–Whitney and
16.2.4 Alternative to the paired t-test: the Wilcoxon
The Wilcoxon test also transforms scores to ranks before calculating the test statistic (see later in this chapter).
Consider the following study: general nurses were given a questionnaire that measured how sympathetic they were to ME sufferers; for each nurse, a total score (out of 10) was calculated.
They then took part in an hour’s discussion group, which included ME sufferers. Later, a similar questionnaire was given to them. This is obviously a within-participants design, as the same participants are being measured in both the ‘before’ and ‘after’ conditions. We will make a directional hypothesis here. A directional hypothesis should be made when there is evidence to support such a direction (e.g. from past research). Our hypothesis is that there will be a significant difference between the scores before and after the discussion, such that scores after the discussion will be higher. Note that this is a one-tailed hypothesis because we have specified the direction of the difference. The nurses’ scores on the questionnaires are shown in Table 16.3.
With small samples, sometimes data are skewed and the mean may not be appropriate – in which case, report the median. You will need to look at histograms to discover whether this is the case.
Although the histogram for the BEFORE condition does not look too skewed (see Figure 16.1), the AFTER condition shows negative skew (see Figure 16.2). However, the means and medians are very similar. Summary statistics are shown in Table 16.4.
SPSS exercise
Exercise 2
A psychology lecturer is carrying out a small pilot study to discover whether students prefer learning her advanced statistics module by means of traditional lectures or a problem-based learning (PBL) approach. There are only 12 people in the group. Six are allocated to the ‘traditional’ group and six to the PBL group. She thus delivers the module twice on different days (she is very keen!). As she wants to know what the students feel about their learning, as well as taking performance measures she asks them to rate their enjoyment of the course (1–7, where 1 is not at all enjoyable and 7 is extremely enjoyable), along with various other measures. She does not make a prediction as to which approach will be the more enjoyable for students. Here are the data:
PBL Traditional
5 4
7 6
4 4
5 4
7 1
6 2
Enter the data into SPSS and perform a Mann–Whitney test. Give a written explanation of the meaning of the results.
Statistics without maths for psychology 528
Before discussion After discussion
5.00 7.00
6.00 6.00
2.00 3.00
4.00 8.00
6.00 7.00
7.00 6.00
3.00 7.00
5.00 8.00
5.00 5.00
5.00 8.00
Table 16.3 Nurses’ sympathy scores before and after discussion
Figure 16.1 Histogram showing frequency distribution for scores in the BEFORE condition BEFORE
Frequency
2.0 3.0 4.0 5.0 6.0 7.0
0 1 2 3 4 5
SD 5 1.48 Mean 5 4.8 N 5 10.00
Figure 16.2 Histogram showing frequency distribution for scores in the AFTER condition
Frequency
AFTER
3.0 4.0 5.0 6.0 7.0 8.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
SD 5 1.58 Mean 5 6.5 N 5 10.00
CHAPTER 16 Non-parametric statistics 529
Box and whisker plots, again, can give you a good feel for your data (see Figure 16.3). In this case, the box and whisker plot confirms what we have seen from the histograms: the median of the AFTER condition is higher than the BEFORE condition and the BEFORE condition has a larger spread.
Inspection of the descriptive statistics told us that the data in the AFTER condition were negatively skewed. Given that this is the case and we have only ordinal data and a relatively small sample, we would be wise to use a non-parametric test. The most appropriate for this study would thus be the Wilcoxon test. There are essentially two stages in calculating the Wilcoxon test: (a) differences between each set of scores are obtained, and (b) the differences are ranked from lowest to highest. We could not do this in the case of the independent design because it would not make sense to find differences using different participants. Here we have the same participants in both conditions, and so finding the differences between the conditions before ranking gives us a more sensitive test. If we take the second score from the first score for each participant, some will have a negative difference (their rating was higher in the AFTER condition) and some will have a positive difference (their rating was lower in the AFTER condition). Some will be zero, so we ignore these because they do not give us any information.
If there were no significant differences between the two conditions, there would be a similar number of pluses and minuses, as the differences between one participant and another (some positive, some negative) would tend to cancel each other out.
You can see from the table below that there are far more negative signs than positive. The test also takes into account the strength of the differences – by ranking the differences. Once we have found the difference between scores, we rank the scores in the same way as before, ignoring the signs. (Of course, we keep talking about ‘we’ doing this, and ‘we’ doing that, but
BEFORE condition AFTER condition
X- SD Media X- SD Median
4.8 1.48 5 6.5 1.58 7
Table 16.4 Summary statistics
Figure 16.3 Box and whisker plot for BEFORE and AFTER conditions
1N 5 10 10
BEFORE AFTER
2 3 4 5 6 7 8 9
3
Statistics without maths for psychology 530
in reality it will be the statistical package that is doing it – we are only going into this detail in order to help you conceptualise what is happening when the test is performed.) We ignore ties where the difference between the two scores is zero – these are not ranked at all. The lowest score is 1, and there are three of them. So the mean of ranks 1, 2 and 3 is 2 ((1+2+3),3=2).
diff rank
4 2 2 7.5 2 2 7.5 5.5 2 5.5
least occurring sign (1) 22
0 21 24 21 1 24 23 0 23 This is the only
positive difference in this dataset
The rank of the positive sign is 2
The sum of the ranks of the least occurring sign (in the above case, the pluses) gives us our test statistic, which we call t. In this case, a positive rank has occurred once only. There are seven negative ranks. Therefore the ‘least occurring sign’ is positive. We then add up the ranks of the positive differences. (If there had been three participants who had scored lower in the AFTER condition, there would have been three positive scores; then we would have added up the ranks of the three pluses.) There is only one plus sign, and this has the rank of 2. Therefore, t=2. What we want now is for our computer package to confirm our hand calculations and to give us the likelihood of t=2 having occurred by sampling error.
SPSS: two-sample test for repeated measures — Wilcoxon
Choose Analyze, Nonparametric Tests, Legacy Dialogs and 2 Related Samples:
CHAPTER 16 Non-parametric statistics 531
This gives the Two-Related-Samples Test dialogue box:
Move the two variables of interest from the left-hand side to the Test Pair(s) List on the right-hand side.
Make sure the Wilcoxon option is checked.
If you want descriptive statistics, you will need to press the Options button. Then press OK. This will give you the following SPSS output:
There were two ties (we ranked these as zero) Ranks
After Discussion – Before Discussion Negative Ranks Positive Ranks Ties
Total
N 1a 7b 2c
Mean Rank 2.00 4.86
Sum of Ranks 2.00 34.00
a. After Discussion , Before Discussion b. After Discussion . Before Discussion c. After Discussion 5 Before Discussion
10
Mean rank of the positive cases
Statistics without maths for psychology 532
The next part of our output gives the test statistics:
The two-tailed ASL is .031. This is a two-tailed probability level, however, and since we made a definite directional prediction, we use a one-tailed probability. To obtain this, the two-tailed probability level is divided by 2. Also:
1. The mean rank of the positive and negative ranks can be obtained. In our example, the mean positive rank=2. This represents the smallest rank total (t) and the mean negative rank=4.86. Verify this for yourself from the information above.
2. The t-score is converted into a standardised score (z-score) by SPSS. This enables you to visualise how large the t-score is, relative to the mean of the distribution. In our case (see output above) the t-score is over two standard deviations away from the mean of the sampling distribution (which is always 0). The ASL is given, in this case 0.016 (Exact Sig.) one-tailed.
The textual part of your results might say:
Since the sample size was small, the appropriate measure of central tendency was the median, and the appropriate statistical test was the Wilcoxon test. From Table X [refer your readers to the table where you give your descriptive statistics] it can be seen that the median of the AFTER condition (7.00) is higher than that of the BEFORE condition (5). The Wilcoxon test (t=2) was converted into a z-score of -2.24 with an associated one-tailed probability of 0.01. It can therefore be concluded that the attitude towards ME by nurses is more sympathetic after the nurses have participated in a discussion group.
Test Statisticsa
Z
Asymp. Sig. (2-tailed)
After Discussion – Before Discussion 22.257b .024
b. Based on negative ranks.
a. Wilcoxon Signed Ranks Test
The t-score (which you can calculate by hand – see above) is converted
into a z-score by SPSS This is the two-tailed probability level
Activity 16.3
Look at the following output, where participants performed in both of the two conditions. The researcher hypothesised that the groups would differ, but did not make a specific prediction of direction of the difference:
Wilcoxon Ranks
0c 7 Ties
Total
26.00
6b 4.33
Positive Ranks
COND2 – COND1 Negative Ranks 1a 2.00 2.00
Sum of Ranks
N Mean Rank
a. COND2 , COND1 b. COND2 . COND1 c. COND1 5 COND2
CHAPTER 16 Non-parametric statistics 533
Example from the literature
Help-seeking attitudes and masculine norms in monozygotic male twins
Sánchez, Bocklandt and Vilain (2013) studied MZ male twins who were discordant for sexual orientation.
MZ twins develop from the same egg and share the same genetic code. They say that in general heterosexual men are less favourable to asking for help compared to women and gay men, but we do not know the extent to which such attitudes are due to nature or nurture. One way to study this is to look at twins. The authors recruited 38 pairs of MZ male twins, each pair had one straight twin and one gay one. When comparing two independent groups, the Mann–Whitney is appropriate. However, MZ twins are not independent, and so it is usual to use a within-participants design for MZ twins.
The authors say:
Given the small sample size and that univariate distributions significantly deviated from normality, we employed nonparametric statistical tests . . . Our first aim was to compare the scores within each twin pair using the Wilcoxon signed-rank test. For each test pair, we entered the gay twin’s score first and the heterosexual co-twin’s score second.
The authors gave questionnaires measuring psychological distress, masculine norms (all of these had subscales) and attitude towards help-seeking.
The following (partial) table is taken from their table (p. 54):
Paired differences for SCL-90-R dimensions
Measure
Paired difference
(median)
Wilcoxon signed-rank test
z p r
Somatization 0.00 -1.13 .259 .13
Obsessive-compulsive -.20 -1.66 .096 .19
Interpersonal sensitivity -0.11 -1.49 .138 .17
Depression -0.04 -0.65 .516 .07
Anxiety -0.05 -1.13 .257 .13
Hostility -0.16 -2.60 .009 .30
Phobic anxiety 0.00 -0.46 .643 .05
What can you conclude from the analysis?
a. Based on negative ranks.
b. Wilcoxon Signed Ranks Test Z
Exact Sig. (2-tailed)
22.028a .043 Test Statisticsb
COND2 – COND1
Statistics without maths for psychology 534
Measure
Paired difference
(median)
Wilcoxon signed-rank test
z p r
Paranoid ideation -0.17 -2.25 .025 .26
Psychoticism -0.10 -2.06 .040 .24
Note. Each heterosexual twin’s score was subtracted from his gay co-twin’s score. The positive r values (effect size estimate) mean that the heterosexual twins tended to score higher on the measure.
(Mdn (Median) is the middle value of the 38 paired differences for each measure.
For the median difference: the negative value means that the heterosexual twin scored higher.
In relation to this (partial) table of results, the authors say:
When comparing symptoms of psychological distress, the heterosexual twins scored significantly higher on three dimensions – hostility (Mdn=0.34), paranoid ideation (Mdn=0.42), and psychoticism (Mdn=0.26) – than their gay co-twin (Mdn=0.26, 0.34 and .11, respectively). As a group, the gay twins did not score significantly higher than their heterosexual co-twins on any of the SCL–90–R . . . indices.
Note that they give the median values as text; in the table, the values are the differences between the two medians (paired differences).
Look at the statistically significant results in the table. Note that the r values are effect sizes.
Activity 16.4
Look at the following (partial) table of results relating to the study above.
Paired differences for Gender Role Conflict Scale (masculine norms)
Measure
Paired difference
(median)
Wilcoxon signed-rank test
z p r
Gender Role Conflict Scale Total -9.00 -2.29 0.022 0.26
Success, power and competition subscale 1.00 –.01 0.922 –0.10
Restrictive emotionality subscale -2.00 -0.74 0.460 0.08
Restrictive affectionate behaviour between
men subscale -0.700 -4.51 0.000 0.52
Note. Each heterosexual twin’s score was subtracted from his gay co-twin’s score. The positive r values (effect size estimate) mean that the heterosexual twins tended to score higher on the measure. Median is the middle value of the 38 paired differences for each measure.
For the median difference: the negative value means that the heterosexual twin scored higher.
The following paragraph is taken from the article (p. 54). Using the information from the table above, choose the most appropriate word (emboldened text):
The final set of comparisons was on the scores assessing emphasis of traditional masculine roles.
Overall, the heterosexual twins reported greater/lesser emphasis with masculine roles than their gay co-twins. More specifically, heterosexual twins were more comfortable/uncomfortable being emotionally affectionate with other men than their gay co-twins.
CHAPTER 16 Non-parametric statistics 535
SPSS exercise
Exercise 3
Six students who had a phobia about mice rated their fear (1 – no fear; 5 – extreme fear) both before and after a behavioural programme designed to overcome that fear. The hypothesis was one-tailed: fear would be reduced after the programme.
Participant Before After
1 5 3
2 4 4
3 4 2
4 3 1
5 5 3
6 5 4
Enter the data into SPSS and perform a Wilcoxon test. Give a written explanation of the meaning of the results to your friend.