CHAPTER Conditional Inference: Guessing Lengths, Suicides, Gastrointestinal Damage, and Newborn Infants 4.1 Introduction There are many experimental designs or studies where the subjects are not a random sample from some well-defined population For example, subjects recruited for a clinical trial are hardly ever a random sample from the set of all people suffering from a certain disease but are a selection of patients showing up for examination in a hospital participating in the trial Usually, the subjects are randomly assigned to certain groups, for example a control and a treatment group, and the analysis needs to take this randomisation into account In this chapter, we discuss such test procedures usually known as (re)-randomisation or permutation tests In the room width estimation experiment reported in Chapter 3, 40 of the estimated widths (in feet) of 69 students and 26 of the estimated widths (in metres) of 44 students are tied In fact, this violates one assumption of the unconditional test procedures applied in Chapter 3, namely that the measurements are drawn from a continuous distribution In this chapter, the data will be reanalysed using conditional test procedures, i.e., statistical tests where the distribution of the test statistics under the null hypothesis is determined conditionally on the data at hand A number of other data sets will also be considered in this chapter and these will now be described Mann (1981) reports a study carried out to investigate the causes of jeering or baiting behaviour by a crowd when a person is threatening to commit suicide by jumping from a high building A hypothesis is that baiting is more likely to occur in warm weather Mann (1981) classified 21 accounts of threatened suicide by two factors, the time of year and whether or not baiting occurred The data are given in Table 4.1 and the question is whether they give any evidence to support the hypothesis? The data come from the northern hemisphere, so June–September are the warm months 65 © 2010 by Taylor and Francis Group, LLC 66 CONDITIONAL INFERENCE Table 4.1: suicides data Crowd behaviour at threatened suicides Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:53 11 September 2014 NA June–September October–May Baiting NA Nonbaiting Source: From Mann, L., J Pers Soc Psy., 41, 703–709, 1981 With permission The administration of non-steroidal anti-inflammatory drugs for patients suffering from arthritis induces gastrointestinal damage Lanza (1987) and Lanza et al (1988a,b, 1989) report the results of placebo-controlled randomised clinical trials investigating the prevention of gastrointestinal damage by the application of Misoprostol The degree of the damage is determined by endoscopic examinations and the response variable is defined as the classification described in Table 4.2 Further details of the studies as well as the data can be found in Whitehead and Jones (1994) The data of the four studies are given in Tables 4.3, 4.4, 4.5 and 4.6 Table 4.2: Classification system for the response variable Classification Endoscopy Examination No visible lesions One haemorrhage or erosion 2-10 haemorrhages or erosions 11-25 haemorrhages or erosions More than 25 haemorrhages or erosions or an invasive ulcer of any size Source: From Whitehead, A and Jones, N M B., Stat Med., 13, 2503–2515, 1994 With permission Table 4.3: Lanza data Misoprostol randomised clinical trial from Lanza (1987) treatment Misoprostol Placebo © 2010 by Taylor and Francis Group, LLC classification 21 2 13 INTRODUCTION 67 Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:53 11 September 2014 Table 4.4: Lanza data Misoprostol randomised clinical trial from Lanza et al (1988a) treatment Misoprostol Placebo classification 20 0 Table 4.5: Lanza data Misoprostol randomised clinical trial from Lanza et al (1988b) treatment Misoprostol Placebo classification 20 2 5 17 Table 4.6: Lanza data Misoprostol randomised clinical trial from Lanza et al (1989) treatment Misoprostol Placebo 1 classification 5 0 0 Newborn infants exposed to antiepileptic drugs in utero have a higher risk of major and minor abnormalities of the face and digits The inter-rater agreement in the assessment of babies with respect to the number of minor physical features was investigated by Carlin et al (2000) In their paper, the agreement on total number of face anomalies for 395 newborn infants examined by a paediatrician and a research assistant is reported (see Table 4.7) One is interested in investigating whether the paediatrician and the research assistant agree above a chance level © 2010 by Taylor and Francis Group, LLC 68 CONDITIONAL INFERENCE Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:53 11 September 2014 Table 4.7: anomalies data Abnormalities of the face and digits of newborn infants exposed to antiepileptic drugs as assessed by a paediatrician (MD) and a research assistant (RA) MD 235 23 RA 41 35 20 11 11 3 Source: From Carlin, J B., et al., Teratology, 62, 406-412, 2000 With permission 4.2 Conditional Test Procedures The statistical test procedures applied in Chapter all are defined for samples randomly drawn from a well-defined population In many experiments however, this model is far from being realistic For example in clinical trials, it is often impossible to draw a random sample from all patients suffering a certain disease Commonly, volunteers and patients are recruited from hospital staff, relatives or people showing up for some examination The test procedures applied in this chapter make no assumptions about random sampling or a specific model Instead, the null distribution of the test statistics is computed conditionally on all random permutations of the data Therefore, the procedures shown in the sequel are known as permutation tests or (re)randomisation tests For a general introduction we refer to the text books of Edgington (1987) and Pesarin (2001) 4.2.1 Testing Independence of Two Variables Based on n pairs of measurements (xi , yi ) recorded for n observational units we want to test the null hypothesis of the independence of x and y We may distinguish three situations: both variables x and y are continuous, one is continuous and the other one is a factor or both x and y are factors The special case of paired observations is treated in Section 4.2.2 One class of test procedures for the above three situations are randomisation and permutation tests whose basic principles have been described by Fisher (1935) and Pitman (1937) and are best illustrated for the case of continuous measurements y in two groups, i.e., the x variable is a factor that can take values x = or x = The difference of the means of the y values in both groups is an appropriate statistic for the assessment of the association of y © 2010 by Taylor and Francis Group, LLC CONDITIONAL TEST PROCEDURES 69 and x n n I(xi = 1)yi T = i=1 n I(xi = 2)yi − i=1 n I(xi = 1) Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:53 11 September 2014 i=1 I(xi = 2) i=1 Here I(xi = 1) is the indication function which is equal to one if the condition xi = is true and zero otherwise Clearly, under the null hypothesis of independence of x and y we expect the distribution of T to be centred about zero Suppose that the group labels x = or x = have been assigned to the observational units by randomisation When the result of the randomisation procedure is independent of the y measurements, we are allowed to fix the x values and shuffle the y values randomly over and over again Thus, we can compute, or at least approximate, the distribution of the test statistic T under the conditions of the null hypothesis directly from the data (xi , yi ), i = 1, , n by the so called randomisation principle The test statistic T is computed for a reasonable number of shuffled y values and we can determine how many of the shuffled differences are at least as large as the test statistic T obtained from the original data If this proportion is small, smaller than α = 0.05 say, we have good evidence that the assumption of independence of x and y is not realistic and we therefore can reject the null hypothesis The proportion of larger differences is usually referred to as p-value A special approach is based on ranks assigned to the continuous y values When we replace the raw measurements yi by their corresponding ranks in the computation of T and compare this test statistic with its null distribution we end up with the Wilcoxon Mann-Whitney rank sum test The conditional distribution and the unconditional distribution of the Wilcoxon Mann-Whitney rank sum test as introduced in Chapter coincide when the y values are not tied Without ties in the y values, the ranks are simply the integers 1, 2, , n and the unconditional (Chapter 3) and the conditional view on the Wilcoxon Mann-Whitney test coincide In the case that both variables are nominal, the test statistic can be computed from the corresponding contingency table in which the observations (xi , yi ) are cross-classified A general r × c contingency table may be written in the form of Table 3.6 where each cell (j, k) is the number nij = n i=1 I(xi = j)I(yi = k), see Chapter for more details Under the null hypothesis of independence of x and y, estimated expected values Ejk for cell (j, k) can be computed from the corresponding margin totals Ejk = nj· n·k /n which are fixed for each randomisation of the data The test statistic for assessing independence is r c X2 = j=1 k=1 (njk − Ejk )2 Ejk The exact distribution based on all permutations of the y values for a similar © 2010 by Taylor and Francis Group, LLC Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:53 11 September 2014 70 CONDITIONAL INFERENCE test statistic can be computed by means of Fisher’s exact test (Freeman and Halton, 1951) This test procedure is based on the hyper-geometric probability of the observed contingency table All possible tables can be ordered with respect to this metric and p-values are computed from the fraction of tables more extreme than the observed one When both the x and the y measurements are numeric, the test statistic can be formulated as the product, i.e., by the sum of all xi yi , i = 1, , n Again, we can fix the x values and shuffle the y values in order to approximate the distribution of the test statistic under the laws of the null hypothesis of independence of x and y 4.2.2 Testing Marginal Homogeneity In contrast to the independence problem treated above the data analyst is often confronted with situations where two (or more) measurements of one variable taken from the same observational unit are to be compared In this case one assumes that the measurements are independent between observations and the test statistics are aggregated over all observations Where two nominal variables are taken for each observation (for example see the case of McNemar’s test for binary variables as discussed in Chapter 3), the measurement of each observation can be summarised by a k × k matrix with cell (i, j) being equal to one if the first measurement is the ith level and the second measurement is the jth level All other entries are zero Under the null hypothesis of independence of the first and second measurement, all k × k matrices with exactly one non-zero element are equally likely The test statistic is now based on the elementwise sum of all n matrices 4.3 Analysis Using R 4.3.1 Estimating the Width of a Room Revised The unconditional analysis of the room width estimated by two groups of students in Chapter led to the conclusion that the estimates in metres are slightly larger than the estimates in feet Here, we reanalyse these data in a conditional framework First, we convert metres into feet and store the vector of observations in a variable y: R> data("roomwidth", package = "HSAUR2") R> convert feet metre y T T [1] -8.858893 © 2010 by Taylor and Francis Group, LLC ANALYSIS USING R R> hist(meandiffs) R> abline(v = T, lty = 2) R> abline(v = -T, lty = 2) 71 2000 1500 1000 500 Frequency Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:53 11 September 2014 Histogram of meandiffs −15 −10 −5 10 meandiffs Figure 4.1 An approximation for the conditional distribution of the difference of mean roomwidth estimates in the feet and metres group under the null hypothesis The vertical lines show the negative and positive absolute value of the test statistic T obtained from the original data In order to approximate the conditional distribution of the test statistic T we compute 9999 test statistics for shuffled y values A permutation of the y vector can be obtained from the sample function R> meandiffs for (i in 1:length(meandiffs)) { + sy data("suicides", package = "HSAUR2") R> fisher.test(suicides) Fisher's Exact Test for Count Data data: suicides p-value = 0.0805 alternative hypothesis: true odds ratio is not equal to 95 percent confidence interval: 0.7306872 91.0288231 sample estimates: odds ratio 6.302622 Figure 4.4 R output of Fisher’s exact test for the suicides data The resulting p-value obtained from the hypergeometric distribution is 0.08 (the asymptotic p-value associated with the X statistic for this table is 0.115) There is no strong evidence of crowd behaviour being associated with time of year of threatened suicide, but the sample size is low and the test lacks power Fisher’s exact test can also be applied to larger than × tables, especially when there is concern that the cell frequencies are low (see Exercise 4.1) © 2010 by Taylor and Francis Group, LLC 74 CONDITIONAL INFERENCE Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:53 11 September 2014 4.3.3 Gastrointestinal Damage Here we are interested in the comparison of two groups of patients, where one group received a placebo and the other one Misoprostol In the trials shown here, the response variable is measured on an ordered scale – see Table 4.2 Data from four clinical studies are available and thus the observations are naturally grouped together From the data.frame Lanza we can construct a three-way table as follows: R> data("Lanza", package = "HSAUR2") R> xtabs(~ treatment + classification + study, data = Lanza) , , study = I classification treatment Misoprostol 21 Placebo 2 13 , , study = II classification treatment Misoprostol 20 0 Placebo , , study = III classification treatment Misoprostol 20 Placebo 5 17 , , study = IV treatment Misoprostol Placebo classification 5 0 0 We will first analyse each study separately and then show how one can investigate the effect of Misoprostol for all four studies simultaneously Because the response is ordered, we take this information into account by assigning a score to each level of the response Since the classifications are defined by the number of haemorrhages or erosions, the midpoint of the interval for each level is a reasonable choice, i.e., 0, 1, 6, 17 and 30 – compare those scores to the definitions given in Table 4.2 The corresponding linear-by-linear association tests extending the general Cochran-Mantel-Haenszel statistics (see Agresti, 2002, for further details) are implemented in package coin © 2010 by Taylor and Francis Group, LLC ANALYSIS USING R 75 Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:53 11 September 2014 For the first study, the null hypothesis of independence of treatment and gastrointestinal damage, i.e., of no treatment effect of Misoprostol, is tested by R> library("coin") R> cmh_test(classification ~ treatment, data = Lanza, + scores = list(classification = c(0, 1, 6, 17, 30)), + subset = Lanza$study == "I") Asymptotic Linear-by-Linear Association Test data: classification (ordered) by treatment (Misoprostol, Placebo) chi-squared = 28.8478, df = 1, p-value = 7.83e-08 and, by default, the conditional distribution is approximated by the corresponding limiting distribution The p-value indicates a strong treatment effect For the second study, the asymptotic p-value is a little bit larger: R> cmh_test(classification ~ treatment, data = Lanza, + scores = list(classification = c(0, 1, 6, 17, 30)), + subset = Lanza$study == "II") Asymptotic Linear-by-Linear Association Test data: classification (ordered) by treatment (Misoprostol, Placebo) chi-squared = 12.0641, df = 1, p-value = 0.000514 and we make sure that the implied decision is correct by calculating a confidence interval for the exact p-value: R> p pvalue(p) [1] 5.00025e-05 99 percent confidence interval: 2.506396e-07 3.714653e-04 The third and fourth study indicate a strong treatment effect as well: R> cmh_test(classification ~ treatment, data = Lanza, + scores = list(classification = c(0, 1, 6, 17, 30)), + subset = Lanza$study == "III") Asymptotic Linear-by-Linear Association Test data: classification (ordered) by treatment (Misoprostol, Placebo) chi-squared = 28.1587, df = 1, p-value = 1.118e-07 © 2010 by Taylor and Francis Group, LLC 76 CONDITIONAL INFERENCE R> cmh_test(classification ~ treatment, data = Lanza, + scores = list(classification = c(0, 1, 6, 17, 30)), + subset = Lanza$study == "IV") Asymptotic Linear-by-Linear Association Test Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:53 11 September 2014 data: classification (ordered) by treatment (Misoprostol, Placebo) chi-squared = 15.7414, df = 1, p-value = 7.262e-05 At the end, a separate analysis for each study is unsatisfactory Because the design of the four studies is the same, we can use study as a block variable and perform a global linear-association test investigating the treatment effect of Misoprostol in all four studies The block variable can be incorporated into the formula by the | symbol R> cmh_test(classification ~ treatment | study, data = Lanza, + scores = list(classification = c(0, 1, 6, 17, 30))) Asymptotic Linear-by-Linear Association Test data: classification (ordered) by treatment (Misoprostol, Placebo) stratified by study chi-squared = 83.6188, df = 1, p-value < 2.2e-16 Based on this result, a strong treatment effect can be established 4.3.4 Teratogenesis In this example, the medical doctor (MD) and the research assistant (RA) assessed the number of anomalies (0, 1, or 3) for each of 395 babies: R> anomalies anomalies anomalies RA MD 0 235 23 3 41 35 20 11 11 3 We are interested in testing whether the number of anomalies assessed by the medical doctor differs structurally from the number reported by the research assistant Because we compare paired observations, i.e., one pair of measurements for each newborn, a test of marginal homogeneity (a generalisation of McNemar’s test, Chapter 3) needs to be applied: © 2010 by Taylor and Francis Group, LLC SUMMARY 77 R> mh_test(anomalies) Asymptotic Marginal-Homogeneity Test Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:53 11 September 2014 data: response by groups (MD, RA) stratified by block chi-squared = 21.2266, df = 3, p-value = 9.446e-05 The p-value indicates a deviation from the null hypothesis However, the levels of the response are not treated as ordered Similar to the analysis of the gastrointestinal damage data above, we can take this information into account by the definition of an appropriate score Here, the number of anomalies is a natural choice: R> mh_test(anomalies, scores = list(c(0, 1, 2, 3))) Asymptotic Marginal-Homogeneity Test for Ordered Data data: response (ordered) by groups (MD, RA) stratified by block chi-squared = 21.0199, df = 1, p-value = 4.545e-06 In our case, both versions coincide and one can conclude that the assessment of the number of anomalies differs between the medical doctor and the research assistant 4.4 Summary The analysis of randomised experiments, for example the analysis of randomised clinical trials such as the Misoprostol trial presented in this chapter, requires the application of conditional inferences procedures In such experiments, the observations might not have been sampled from well-defined populations but are assigned to treatment groups, say, by a random procedure which is reiterated when randomisation tests are applied Exercises Ex 4.1 Although in the past Fisher’s test has been largely applied to sparse × tables, it can also be applied to larger tables, especially when there is concern about small values in some cells Using the data displayed in Table 4.8 (taken from Mehta and Patel, 2003) which gives the distribution of the oral lesion site found in house-to-house surveys in three geographic regions of rural India, find the p-value from Fisher’s test and the corresponding p-value from applying the usual chi-square test to the data What are your conclusions? © 2010 by Taylor and Francis Group, LLC 78 CONDITIONAL INFERENCE Table 4.8: orallesions data Oral lesions found in house-tohouse surveys in three geographic regions of rural India Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:53 11 September 2014 site of lesion Buccal mucosa Commissure Gingiva Hard palate Soft palate Tongue Floor of mouth Alveolar ridge Kerala 0 0 1 region Gujarat Andhra 1 1 0 1 Source: From Mehta, C and Patel, N., StatXact-6: Statistical Software for Exact Nonparametric Inference, Cytel Software Corporation, Cambridge, MA, 2003 With permission Ex 4.2 Use the mosaic and assoc functions from the vcd package (Meyer et al., 2009) to create a graphical representation of the deviations from independence in the × contingency table shown in Table 4.1 Ex 4.3 Generate two groups with measurements following a normal distribution having different means For multiple replications of this experiment (1000, say), compare the p-values of the Wilcoxon Mann-Whitney rank sum test and a permutation test (using independence_test) Where the differences come from? © 2010 by Taylor and Francis Group, LLC ...66 CONDITIONAL INFERENCE Table 4.1: suicides data Crowd behaviour at threatened suicides Downloaded by [King Mongkut's... research assistant agree above a chance level © 2010 by Taylor and Francis Group, LLC 68 CONDITIONAL INFERENCE Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:53 11 September... [King Mongkut's Institute of Technology, Ladkrabang] at 01:53 11 September 2014 70 CONDITIONAL INFERENCE test statistic can be computed by means of Fisher’s exact test (Freeman and Halton, 1951)