Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 83 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
83
Dung lượng
650,03 KB
Nội dung
3.7 The Poisson distribution 73 situation in which particles are randomly distributed in space If the space is onedimensional (for instance the length of a cotton thread along which flaws may occur with constant probability at all points), the analogy is immediate With two-dimensional space (for instance a microscopic slide over which bacteria are distributed at random with perfect mixing technique) the total area of size A may be divided into a large number n of subdivisions each of area A/n; the argument then carries through with A replacing T Similarly, with three-dimensional space (bacteria well mixed in a fluid suspension), the total volume V is divided into n small volumes of size V =n In all these situations the model envisages particles distributed at random with density l per unit length (area or volume) The number of particles found in a length (area or volume) of size l (A or V) will follow the Poisson distribution (3.18) where the parameter m ll lA or lV The shapes of the distribution for m 1, and 15 are shown in Fig 3.9 Note that for m the distribution is very skew, for m the skewness is much less and for m 15 it is almost absent The distribution (3.18) is determined entirely by the one parameter, m It follows that all the features of the distribution in which one might be interested are functions only of m In particular the mean and variance must be functions of m The mean is E x I x0 xPx m, this result following after a little algebraic manipulation By similar manipulation we find E x2 m2 m and var x E x2 À m2 m 3:19 Thus, the variance of x, like the mean, is equal to m The standard deviation is p therefore m Much use is made of the Poisson distribution in bacteriology To estimate the density of live organisms in a suspension the bacteriologist may dilute the suspension by a factor of, say, 10À5 , take samples of, say, cm3 in a pipette and drop the contents of the pipette on to a plate containing a nutrient medium on which the bacteria grow After some time each organism dropped on to the plate will have formed a colony and these colonies can be counted If the original suspension was well mixed, the volumes sampled are accurately determined and 74 Probability µ=1 µ=4 0.3 Probability 0.4 0.3 Probability 0.4 0.2 0.1 0.2 0.1 10 12 16 20 0.4 µ=15 Probability 0.3 0.2 0.1 16 24 32 40 Fig 3.9 Poisson distribution for various values of m The horizontal scale in each diagram shows values of x the medium is uniformly adequate to sustain growth, the number of colonies in a large series of plates could be expected to follow a Poisson distribution The mean colony count per plate, x, is an estimate of the mean number of bacteria per 10À5 cm3 of the original suspension, and a knowledge of the theoretical properties of the Poisson distribution permits one to measure the precision of this estimate (see §5.2) Similarly, for total counts of live and dead organisms, repeated samples of constant volume may be examined under the microscope and the organisms counted directly 3.7 The Poisson distribution 75 Example 3.7 As an example, Table 3.3 shows a distribution observed during a count of the root nodule bacterium (Rhizobium trifolii) in a Petroff±Hausser counting chamber The `expected' frequencies are obtained by calculating the mean number of organisms per square, x, from the frequency distribution (giving x 2Á50) and calculating the probabilities Px of the Poisson distribution with m replaced by x The expected frequencies are then given by 400 Px The observed and expected frequencies agree quite well This organism normally produces gum and therefore clumps readily Under these circumstances one would not expect a Poisson distribution, but the data in Table 3.3 were collected to show the effectiveness of a method of overcoming the clumping In the derivation of the Poisson distribution use was made of the fact that the binomial distribution with a large n and small p is an approximation to the Poisson with mean m np Conversely, when the correct distribution is a binomial with large n and small p, one can approximate this by a Poisson with mean np For example, the number of deaths from a certain disease, in a large population of n individuals subject to a probability of death p, is really binomially distributed but may be taken as approximately a Poisson variable with mean m np Note that the p standard deviation on the binomial assumption is np 1 À p, whereas the p Poisson standard deviation is np When p is very small these two expressions are almost equal Table 3.4 shows the probabilities for the Poisson distribution with m 5, and those for various binomial distributions with np The similarity between the binomial and the Poisson improves with increases in n (and corresponding decreases in p) Table 3.3 Distribution of counts of root nodule bacterium (Rhizobium trifolii) in a Petroff±Hausser counting chamber (data from Wilson and Kullman, 1931) Number of bacteria per square 7± Number of squares Observed Expected 34 68 112 94 55 21 12 32Á8 82Á1 102Á6 85Á5 53Á4 26Á7 11Á1 5Á7 400 399Á9 76 Probability Table 3.4 Binomial and Poisson distributions with m 5 10 >10 p n 0Á5 10 0Á10 50 0Á05 100 Poisson 0Á0010 0Á0098 0Á0439 0Á1172 0Á2051 0Á2461 0Á2051 0Á1172 0Á0439 0Á0098 0Á0010 0Á0052 0Á0286 0Á0779 0Á1386 0Á1809 0Á1849 0Á1541 0Á1076 0Á0643 0Á0333 0Á0152 0Á0094 0Á0059 0Á0312 0Á0812 0Á1396 0Á1781 0Á1800 0Á1500 0Á1060 0Á0649 0Á0349 0Á0167 0Á0115 0Á0067 0Á0337 0Á0842 0Á1404 0Á1755 0Á1755 0Á1462 0Á1044 0Á0653 0Á0363 0Á0181 0Á0137 1Á0000 r 1Á0000 1Á0000 1Á0000 Probabilities for the Poisson distribution may be obtained from many statistical packages 3.8 The normal (or Gaussian) distribution The binomial and Poisson distributions both relate to a discrete random variable The most important continuous probability distribution is the Gaussian (C.F Gauss, 1777±1855, German mathematician) or, as it is frequently called, the normal distribution Figures 3.10 and 3.11 show two frequency distributions, of height and of blood pressure, which are similar in shape They are both approximately symmetrical about the middle and exhibit a shape rather like a bell, with a pronounced peak in the middle and a gradual falling off of the frequency in the two tails The observed frequencies have been approximated by a smooth curve, which is in each case the probability density of a normal distribution Frequency distributions resembling the normal probability distribution in shape are often observed, but this form should not be taken as the norm, as the name `normal' might lead one to suppose Many observed distributions are undeniably far from `normal' in shape and yet cannot be said to be abnormal in the ordinary sense of the word The importance of the normal distribution lies not so much in any claim to represent a wide range of observed frequency distributions but in the central place it occupies in sampling theory, as we shall see in Chapters and For the purposes of the present discussion we shall regard the normal distribution as one of a number of theoretical forms for a continuous random variable, and proceed to describe some of its properties 3.8 The normal (or Gaussian) distribution 77 12 000 10 000 Frequency 8000 6000 4000 2000 55 60 65 70 75 80 Height (in.) Fig 3.10 A distribution of heights of young adult males, with an approximating normal distribution (Martin, 1949, Table 17 (Grade 1)) Frequency 30 20 10 40 50 60 70 Diastolic blood pressure (mmHg) 80 Fig 3.11 A distribution of diastotic blood pressures of schoolboys with an approximating normal distribution (Rose, 1962, Table 1) The probability density, f x, of a normally distributed random variable, x, is given by the expression x À m2 f x p exp À , 3:20 s 2p 2s2 78 Probability where exp(z) is a convenient way of writing the exponential function ez (e being the base of natural logarithms), m is the expectation or mean value of x and s is the standard deviation of x (Note that p is the mathematical constant 3Á14159 , not, as in §3.6, the parameter of a binomial distribution.) The curve (3.20) is shown in Fig 3.12, on the horizontal axis of which are marked the positions of the mean, m, and the values of x which differ from m by Ỉs, Ỉ 2s and Ỉ3s The symmetry of the distribution about m may be inferred from (3.20), since changing the sign but not the magnitude of x À m leaves f x unchanged Figure 3.12 shows that a relatively small proportion of the area under the curve lies outside the pair of values x m 2s and x m À 2s The area under the curve between two values of x represents the probability that the random variable x takes values within this range (see §3.4) In fact the probability that x lies within m Ỉ 2s is very nearly 0Á95, and the probability that x lies outside this range is, correspondingly, 0Á05 It is important for the statistician to be able to find the area under any part of a normal distribution Now, the density function (3.20) depends on two parameters, m and s It might be thought, therefore, that any relevant probabilities would have to be worked out separately for every pair of values of m and s Fortunately this is not so In the previous paragraph we made a statement about the probabilities inside and outside the range m Ỉ 2s, without any assumption about the particular values taken by m and s In fact the probabilities depend on an expression of the departure of x from m as a multiple of s For example, the points marked on the axis of Fig 3.12 are characterized by the multiples Ỉ1, Ỉ2 and Ỉ3, as shown on the lower scale The probabilities under various parts of any normal distribution can therefore be expressed in terms of the standardized deviate (or z-value) Probability density × σ z xÀm : s 0.4 0.3 0.2 0.1 µ – 4σ µ – 3σ µ – 2σ µ–σ µ µ+σ µ + 2σ µ + 3σ µ + 4σ Original variable, x –4 –3 –2 –1 Standardized variable, z Fig 3.12 The probability density function of a normal distribution showing the scales of the original variable and the standardized variable 3.8 The normal (or Gaussian) distribution 79 Table 3.5 Some probabilities associated with the normal distribution Standardized deviate z x À m=s Probability of greater deviation In either direction In one direction 0Á0 1Á0 2Á0 3Á0 1Á000 0Á317 0Á046 0Á0027 0Á500 0Á159 0Á023 0Á0013 1Á645 1Á960 2Á576 0Á10 0Á05 0Á01 0Á05 0Á025 0Á005 A few important results, relating values of z to single- or double-tail area probabilities, are shown in Table 3.5 More detailed results are given in Appendix Table A1, and are also readily available from programs in computer packages or on statistical calculators It is convenient to denote by N m, s2 ) a normal distribution with mean m and variance s2 (i.e standard deviation s) With this notation, the standardized deviate z follows the standard normal distribution, N(0, 1) The use of tables of the normal distribution may be illustrated by the next example Example 3.8 The heights of a large population of men are found to follow closely a normal distribution with a mean of 172Á5 cm and a standard deviation of 6Á25 cm We shall use Table A1 to find the proportions of the population corresponding to various ranges of height Above 180 cm If x 180, the standardized deviate z 180 À 172Á5=6Á25 1Á20 The required proportion is the probability that z exceeds 1Á20, which is found from Table A1 to be 0Á115 Below 170 cm z 170 À 172Á5=6Á25 À0Á40 The probability that z falls below À0Á40 is the same as that of exceeding 0Á40, namely 0Á345 Below 185 cm z 185 À 172Á5=6Á25 2Á00 The probability that z falls below 2Á00 is one minus the probability of exceeding 2Á00, namely À 0Á023 0Á977 Between 165 and 175 cm For x 165, z À1Á20; for x 175, z 0Á40 The probability that z falls between À1Á20 and 0Á40 is one minus the probability of (i) falling below À1Á20 or (ii) exceeding 0Á40, namely À 0Á115 0Á345 À 0Á460 0Á540: The normal distribution is often useful as an approximation to the binomial and Poisson distributions The binomial distribution for any particular value of p 80 Probability approaches the shape of a normal distribution as the other parameter n increases indefinitely (see Fig 3.7); the approach to normality is more rapid for values of p near than for values near or 1, since all binomial distributions with p have 2 the advantage of symmetry Thus, provided n is large enough, a binomial variable r (in the notation of §3.6) may be regarded as approximately normally p distributed with mean np and standard deviation np 1 À p The Poisson distribution with mean m approaches normality as m increases indefinitely (see Fig 3.9) A Poisson variable x may, therefore, be regarded as p approximately normal with mean m and standard deviation m If tables of the normal distribution are to be used to provide approximations to the binomial and Poisson distributions, account must be taken of the fact that these two distributions are discrete whereas the normal distribution is continuous It is useful to introduce what is known as a continuity correction, whereby the exact probability for, say, the binomial variable r (taking integral values) is approximated by the probability of a normal variable between r À and r 2 Thus, the probability that a binomial variable took values greater than or equal to r when r > np (or less than or equal to r when r < np) would be approximated by the normal tail area beyond a standardized normal deviate jr À npj À , zp np 1 À p the vertical lines indicating that the `absolute value', or the numerical value ignoring the sign, is to be used Tables 3.6 and 3.7 illustrate the normal approximations to some probabilities for binomial and Poisson variables Table 3.6 Examples of the approximation to the binomial distribution by the normal distribution with continuity correction p n Mean np Standard deviation p np 1 À p Values of r Exact probability Normal approximation with continuity correction z Probability 0Á5 10 1Á581 !8 0Á0547 0Á0547 1Á581 0Á0579 0Á1 50 2Á121 !8 0Á1117 0Á1221 1Á179 0Á1192 0Á5 40 20 3Á162 14 !26 0Á0403 0Á0403 1Á739 0Á0410 0Á2 100 20 4Á000 14 !26 0Á0804 0Á0875 1Á375 0Á0846 3.8 The normal (or Gaussian) distribution 81 Table 3.7 Examples of the approximation to the Poisson distribution by the normal distribution with continuity correction Mean m Standard deviation p m Values of x Exact probability Normal approximation with continuity correction z jxÀmjÀ1 p m Probability 2Á236 !8 !10 0Á0067 0Á1246 0Á1334 0Á0318 2Á013 1Á118 1Á118 2Á013 0Á0221 0Á1318 0Á1318 0Á0221 20 4Á472 10 15 !25 !30 0Á0108 0Á1565 0Á1568 0Á0218 2Á214 1Á006 1Á006 2Á124 0Á0168 0Á1572 0Á1572 0Á0168 100 10Á000 80 90 !110 !120 0Á0226 0Á1714 0Á1706 0Á0282 1Á950 0Á950 0Á950 1Á950 0Á0256 0Á1711 0Á1711 0Á0256 The importance of the normal distribution extends well beyond its value in modelling certain symmetric frequency distributions or as an approximation to the binomial and Poisson distributions We shall note in §4.2 a central role in describing the sampling distribution of means of large samples and, more generally, in §5.4, its importance in the large-sample distribution of a wider range of statistics The x2 distribution 1 Many probability distributions of importance in statistics are closely related to the normal distribution, and will be introduced later in the book We note here one especially important distribution, the x2 (`chi-square' or `chi-squared ') distribution on one degree of freedom, written as x2 It is a member of a wider 1 family of x2 distributions, to be described more fully in §5.1; at present we consider only this one member of the family Suppose z denotes a standardized normal deviate, as defined above That is, z follows the N(0,1) distribution The squared deviate, z2 , is also a random variable, the value of which must be non-negative, ranging from to I Its distribution, the x2 distribution, is depicted in Fig 3.13 The percentiles (p.38) of 1 the distribution are tabulated on the first line of Table A2 Thus, the column headed P 0:050 gives the 95th percentile Two points may be noted at this stage 82 Probability 0.8 Probabililty density 0.6 0.4 0.2 z2 Fig 3.13 Probability density function for a variate z2 following a x2 distribution on one degree of freedom E(z2 ) E (x À m2 =s2 s2 =s2 The mean value of the distribution is The percentiles may be obtained from those of the normal distribution From Table A1 we know, for instance, that there is a probability 0Á05 that z exceeds 1Á960 or falls below À1Á960 Whenever either of these events happens, z2 exceeds 1Á9602 3Á84 Thus, the 0Á05 level of the x2 distribution is 3Á84, as 1 is confirmed by the entry in Table A2 A similar relationship holds for all the other percentiles This equivalence between the standard normal distribution, N(0,1), and the x2 distribution, means that many statements about normally distributed ran 1 dom variables can be equally well expressed in terms of either distribution It must be remembered, though, that the use of z2 removes the information about the sign of z, and so if the direction of the deviation from the mean is important the N(0, 1) distribution must be used 4.6 Sample-size determination 141 The formulae given above assume a knowledge of the standard deviation (SD), s In practice, s will rarely be known in advance, although sometimes the investigator will be able to make use of an estimate of s from previous data which is felt to be reasonably accurate The approaches outlined above may be modified in two ways The first way would be to recognize that the required values of n can be specified only in terms of the ratio of a critical interval (e, d1 or d1 ) to an estimated or true standard deviation (s or s) For instance, in we might specify the estimated standard error to be a certain multiple of s; in d1 might be specified as a certain multiple of s; in 3, the power might be specified against a given ratio of d1 to s In and 3, because the test will use the t distribution rather than the normal, the required values of n will be rather greater than those given by (4.38) or (4.41), but the adjustments will only be important for small samples, that is, those giving 30 or fewer DF for the estimated SD In this case an adjusted sample size can be given by a second use of (4.41), replacing z2a and z2b by the corresponding values from the t distribution, with DF given by the sample size obtained in the initial application of (4.41) The specification of critical distances as multiples of an unknown standard deviation is not an attractive suggestion An alternative approach would be to estimate s by a relatively small pilot investigation and then to use this value in formulae (4.36)±(4.41), to provide an estimate of the total sample size, n Again, some adjustment is called for because of the uncertainty introduced by estimating s, but the effect will be small provided that the initial pilot sample is not too small The above discussion has been in terms of two independent samples, where the analysis is the two-sample t test (§4.3) If the two groups are paired, the method of analysis would be the paired t test and (4.40) would be used with p SE(d) substituted by s= n, where s is the standard deviation of the differences between paired values and n is the number of pairs In terms of the methods of analysis of variance, to be introduced in Chapter 8, s is the `residual' standard deviation The comparison of two independent proportions follows a similar approach In this case the standard error of the difference depends on the values of the proportions (§4.5) Following approach 3, specification would be in terms of requiring detection of a difference between true proportions p1 and p2 , using a test of the null hypothesis that each proportion is equal to the pooled value p The equation corresponding to (4.41) is & n> p p '2 z2a 2p 1 À p z2b p1 1 À p1 p2 1 À p2 : p1 À p2 4:42 This equation is slightly more complicated than (4.41) because the standard error of the difference between two observed proportions is different for the null and 142 Analysing means and proportions alternative hypotheses (§4.5) Use of (4.42) gives sample sizes that are appropriate when statistical significance is to be determined using the uncorrected x2 test (4.30) Fleiss et al (1980) show that for the continuity-corrected test (4.34), 2=jp1 À p2 j should be added to the solution given by (4.42) Casagrande et al (1978) give tables for determining sample sizes that are based on the exact test for  tables (§4.5), and Fleiss (1981) gives a detailed set of tables based on a formula that gives a good approximation to the exact values Appendix Table A8 gives the sample size required in each of the two groups, calculated using (4.42) and including the continuity correction for some common situations Although in general we favour uncorrected tests, Table A8 refers to the corrected version since in those cases where the difference is of practical consequence it would be advisable to aim on the high side Example 4.15 A trial of a new treatment is being planned The success rate of the standard treatment to be used as a control is 0Á25 If the new treatment increases the success rate to 0Á35, then it is required to detect that there is an improvement with a two-sided 5% significance test and a power of 90% How many patients should be included in each group? Using (4.42) with p1 0Á25, p2 0Á35, p 0Á3, z2a 1Á96, and z2b 1Á282 gives p p !2 1Á96 0Á42 1Á282 0Á1875 0Á2275 n> : 0Á1 439Á4 Including Fleiss et al.'s (1980) correction gives n 439Á4 2=0Á1 : 459Á4 From Table A8 the sample size is given as 459 (the same value is given in both Casagrande et al.'s (1978) and Fleiss's (1981) tables) Case±control studies In a case±control study the measure of association is the odds ratio, or approximate relative risk (§4.5 and §19.5), and this measure is used in determining sample size For hypothesis testing the data are arranged in a  table and the significance test is then a comparison of the proportions of individuals in the case and control groups who are exposed to the risk factor Then the problem of sample size determination can be converted to that of comparing two independent proportions The controls represent the general population and we need to specify the proportion of controls exposed to the risk factor, p The first step is to find the proportion of cases, pH , that would be exposed for a specified odds ratio, OR1 This 4.6 Sample-size determination 143 may easily be determined using (4.22) To eliminate this step, and the need to interpolate in tables such as Table A8, tables have been produced that are indexed by the proportion of exposed in the controls and the specified value of the true odds ratio for which it is required to establish statistical significance Table A9 is a brief table covering commonly required situations This table is based on (4.22) and (4.42) with the continuity correction Schlesselman (1982) gives a more extensive set of tables; these tables give smaller sample sizes than Table A9, since Schlesselman did not include Fleiss's addition for the continuity correction Example 4.16 A case±control study is being planned to assess the association between a disease and a risk factor It is estimated that 20% of the general population are exposed to the risk factor It is required to detect an association if the relative risk is or greater with 80% power at the 5% significance level How many cases and controls are required? Using (4.22) with p 0Á2 and OR1 2Á0 gives pH 0Á3333 Then, using (4.42), with p1 0Á3333 and p2 0Á2, and using the continuity correction, gives n 186Á5 So the study should consist of at least 187 cases and the same number of controls The above calculations can be avoided by direct use of Appendix Table A9, which gives n 187 Inverse formulation The sample-size problem is usually expressed as one of determining sample size given the power and magnitude of the specified effect to be detected But it could be considered as finding any one of these three items for given values of the other two This is often a useful approach since the sample size may be limited by other considerations Then, the relevant questions are: what effect could be detected; or what would the power be? The next example considers the inverse problem of estimating what can be achieved with a given sample size Exampe 4.17 Consider the situation of Example 4.14 but suppose that resources are available to include only 50 men in each group (a) What would be the power of the study? Substituting in (4.41) with z2b as unknown gives 50 and therefore ! 1Á96 z2b 0Á5 , 0Á25 144 Analysing means and proportions z2b 0Á540: From Table A1 this corresponds to a power of À 0Á2946 0Á7054 or 71%: (b) What size of mean difference between the groups could be detected with 80% power? Substituting in (4.41) with d1 as unknown gives 50 ! 1Á96 0Á842 0Á5 , d1 and therefore d1 0Á28 A mean difference of 0Á28 l could be detected Note that, if the sample size has already been determined for a specified d1 , the revised value is quickly obtained by noting that the difference that can be detected is proportional to the reciprocal of the square root of the sample size Thus, since for d1 0Á25 the sample size was calculated as 62Á8, then for a sample size of 50 we have p d1 0Á25  62Á8=50 0Á28 For the inverse problem of determining the power or the size of effect that could be detected with a given sample size, the formulae (4.41) and (4.42) can be used but the tables are less convenient Since it is often required to consider what can be achieved with a range of sample sizes, it is convenient to be able to calculate approximate solutions more simply This can be achieved by noting p p that, in (4.41), d1 is proportional to 1= n and z2a z2b is proportional to n These relationships apply exactly for a continuous variable but also form a reasonable approximation for comparing proportions Example 4.18 In Example 4.15 suppose only 600 patients are available so that n 300 (a) What size of difference could be detected with 90% power? Approximately d1 0Á10  0Á124: p 459Á4=300 Using (4.42) gives dt 0Á126 so the approximation is good An increase in success rate to about 37Á6% could be detected (b) What would be the power to detect a difference of 0Á1? Approximately 1Á96 z2b 1Á96 1Á282  2Á620 p 300=459Á4 Therefore z2b 0Á660 and the revised power is 74.5% Using (4.42) gives z2b 0Á626 and a revised power of 73Á4% : 4.6 Sample-size determination 145 Unequal-sized groups It is usually optimal when comparing two groups to have equal numbers in each group But sometimes the number available in one group may be restricted, e.g for a rare disease in a case±control study (§19.4) In this case the power can be increased to a limited extent by having more in the other group If one group contains m subjects and the other rm, then the study is approximately equivalent to a study with n in each group where 1 : n m rm That is, m r 1n : 2r 4:43 This expression is derived by equating the expressions for the standard error of the difference between two means used in a two-sample t test, and is exact for a continuous variable and approximate for the comparison of two proportions Fleiss et al (1980) give a formula for the general case of comparing two proportions where the two samples are not of equal size, and their formula is suitable for the inverse problem of estimating power from known sample sizes The total number of subjects in the study is r 12  2n, 4r which is a minimum for r Other considerations The situations considered in this section are relatively simpleÐthose of comparing two groups without any complicating features The determination of sample size is often facilitated by the use of tables and, in addition to those mentioned earlier, the book by Lemeshow et al (1990) contains a number of sets of tables Even in these simple situations it is necessary to have a reasonable idea of the likely form of the data to be collected before sample size can be estimated For example, when comparing means it is necessary to have an estimate of the standard deviation, or in comparing proportions the approximate size of one of the proportions is required at the planning stage Such information may be available from earlier studies using the same variables or may be obtained from a pilot study In more complicated situations more information is required but (4.41) can be used in principle, provided that it is possible to find an expression for s, the standard deviation relevant to the comparison of interest 146 Analysing means and proportions For a comparison of proportions using paired samples, where the significance test is based on the untied pairs (§4.5), it is necessary to have information on the likely effectiveness of the matching, which would determine the proportion of pairs that were tied (Connor, 1987) As such information may be unavailable, the effect of matching is often ignored in the determination of sample size, but this would lead to a larger sample size than necessary if the matching were effective (Parker and Bregman, 1986) Other more complicated situations are discussed elsewhere in this book These include Bayesian methods (p 182), equivalence trials (p 638), and noncompliance in clinical trials (p 613) Schlesslman (1982) and Donner (1984) consider sample-size determination in the presence of a confounder (Chapter 19) Tables for sample-size determination in survival analysis (Chapter 17) are given by Lemeshow et al (1990) and Machin et al (1997) For a more detailed review, see Lachin (1998) Many investigations are concerned with more than one variable measured on the same individual In a morbidity survey, for example, a wide range of symptoms, as well as the results of certain diagnostic tests, may be recorded for each person Sample sizes deemed adequate for one purpose may, therefore, be inadequate for others In many investigations the sample size chosen would be the largest of the separate requirements for the different variables; it would not matter too much that for some variables the sample size was unnecessarily high In other investigations, in contrast, this may be undesirable, because either the cost or the trouble incurred by taking the extra observations is not negligible A useful device in these circumstances is multiphase sampling In the first phase certain variables are observed on all the members of the initial sample In the second phase a subsample of the original sample is then taken, either by simple random sampling or by one of the other methods described in §19.2, and other variables are observed only on the members of the subsample The process could clearly be extended to more than two phases Some population censuses have been effectively multiphase samples in which the first phase is a 100% sample to which some questions are put In the second phase, a subsample (say, in 10 of the population) is asked certain additional questions The justification here would be that complete enumeration is necessary for certain basic demographic data, but that for certain more specialized purposes (perhaps information about fertility or occupation) a 1-in-10 sample would provide estimates of adequate precision Material savings in cost are achieved by restricting these latter questions to a relatively small subsample Analysing variances, counts and other measures 5.1 Inferences from variances The sampling error of a variance estimate Suppose that a quantitative random variable x follows a distribution with mean m and variance s2 In a sample of size n, the estimated variance is xi À x2 : s2 nÀ1 In repeated random sampling from the distribution, s2 will vary from one sample to another; it will itself be a random variable We now consider the nature of the variation in s2 An important result is that the mean value, or expectation, of s2 is s2 That is, E s2 s2 : 5:1 Another way of stating the result (5.1) is that s2 is an unbiased estimator of s2 It is this property which makes s2 , with its divisor of n À 1, a satisfactory estimator of the population variance: the statistic (2.1), with a divisor of n, has an expectation n À 1s2 =n, which is less than s2 Note that E(s) is not equal to s; it is in fact less than s The reason for paying so much attention to E(s2 ) rather than E(s) will appear in Chapters and What else can be said about the sampling distribution of s2 ? Let us, for the moment, tighten our requirements about the distribution of x by assuming that it is strictly normal In this particular instance, the distribution of s2 is closely related to one of a family of distributions called the x2 distributions (chi-square or chi-squared ), which are of very great importance in statistical work, and it will be useful to introduce these distributions by a short discussion before returning to the distribution of s2 , which is the main concern of this section Denote by X1 the standardized deviate corresponding to the variable x That is, X1 x À m=s X1 is a random variable, whose value must be non-negative As was noted in §3.8, the distribution of X1 is called the x2 distribution on one degree of freedom (1 DF), and is often called the x 1 distribution It is depicted as the first curve in Fig 5.1 The percentage points of the x2 distribution are tabulated in Table A2 The values exceeded with probability P are often called the 100 1 À Pth percentiles; thus, the column headed 147 148 Analysing variances 1.2 DF Probability density 0.8 0.6 0.4 0.2 DF DF 10 DF 10 12 14 16 18 20 22 24 χ2 Fig 5.1 Probability density functions for x2 distributions with various numbers of degrees of freedom P 0Á050 gives the 95th percentile Two points may be noted at this stage E X1 E x À m2 =s2 s2 =s2 The mean value of the distribution is The percentiles may be obtained from those of the normal distribution From Table A1 we know, for instance, that there is a probability 0Á05 that x À m=s exceeds 1Á960 or falls below À1Á960 Whenever either of these events happens, x À m2 =s2 exceeds 1Á9602 3Á84 Thus, the 0Á05 level of the x2 distribution is 3Á84 A similar 1 relationship holds for all the other percentiles Now let x1 and x2 be two independent observations on x, and define X2 x1 À m2 x2 À m2 : s2 s2 X2 follows what is known as the x2 distribution on two degrees of freedom x2 The 2 2 variable X2 , like X1 , is necessarily non-negative Its distribution is shown as the second curve in Fig 5.1, and is tabulated along the second line of Table A2 Note that X2 is the sum of two independent observations on X1 Hence 2 E X2 2E X1 2: Similarly, in a sample of n independent observations xi , define Xn n xi À m2 i1 s2 xi À m2 : s2 5:2 This follows the x2 distribution on n degrees of freedom x2 , and E Xn n n Figure 5.1 and Table A2 show that, as the degrees of freedom increase, the x2 distribution becomes more and more symmetric Indeed, since it is the sum of n independent x2 variables, the central limit theorem (which applies to sums as well as to means) 1 shows that x2 tends to normality as n increases The variance of the x2 distribution is 2n n n 5.1 Inferences from variances 149 The result (5.2) enables us to find the distribution of the sum of squared deviations about the population mean m In the formula for s2 , we use the sum of squares about the sample mean x, and it can be shown that xi À x2 xi À m2 : In fact, xi À x2 =s2 follows the x2 nÀ1 distribution The fact that differences are taken from the sample mean rather than the population mean is compensated for by the subtraction of from the degrees of freedom Now s2 xi À x2 s nÀ1 nÀ1 s2 x : n À nÀ1 xi À x2 s2 That is, s2 behaves as s2 = n À 1 times a x2 nÀ1 variable It follows that E s2 s2 s2 E x2 n À 1 s2 , nÀ1 nÀ1 nÀ1 as was noted in (5.1), and var s2 s4 n À 1 var x2 nÀ1 2s4 : nÀ1 s4 n À 12 2 n À 1 5:3 The formula (5.3) for var(s2 ) is true only for samples from a normal distribution; indeed, the whole sampling theory for s2 is more sensitive to non-normality than that for the mean Inferences from a variance estimate For normally distributed variables, the methods follow immediately from previous results Suppose s2 is the usual estimate of variance in a random sample of size n from a normal distribution with variance s2 ; the population mean need not be specified For a test of the null hypothesis that s2 s2 , calculate n À 1s2 x À x2 , or equivalently , 5:4 X 2 s0 s0 and refer this to the x2 nÀ1 distribution For a two-sided test at a significance level a, the critical values for X will be those corresponding to tabulated probabilities of À a and a For a two-sided 5% level, for example, the entries under the 2 150 Analysing variances headings 0Á975 and 0Á025 must be used We may denote these by x2 , 0Á975 and nÀ1 x2 , 0Á025 nÀ1 For confidence limits for s2 we can argue that the probability is, say, 0Á95 that x À x2 xnÀ1, 0Á975 < < x2 , 0Á025 nÀ1 s2 and hence that x À x2 x2 , 0Á025 nÀ1 x À x2 s2 Denote by Fa, n1 , n2 the 2 tabulated critical value of F for n1 and n2 degrees of freedom and a one-sided significance level a and by F, n2 , n1 , the corresponding entry with n1 and n2 interchanged Then the probability is a that F H > Fa, n1 , n2 , i.e that F > Fa, n1 , n2 s2 =s2 ; and also a that 1=F H > Fa, n2 , n1 , i.e that F < 1=Fa, n2 , n1 s2 =s2 : Consequently, the probability is À 2a that 1=Fa, n2 , n1 s2 =s2 < F < Fa, n1 , n2 s2 =s2 2 or that F =Fa, n1 , n2 < s2 =s2 < F  Fa, n2 , n1 : For a 95% confidence interval, therefore, the observed value of F must be divided by the tabulated value F0Á025, n1 , n2 and multiplied by the value F0Á025, n2 , n1 Example 5.1 Two different microscopic methods, A and B, are available for the measurement of very small dimensions Repeated observations on the same standard object give estimates of variance as follows: 152 Analysing variances Method Number of observations Estimated variance (square micrometres) A n1 10 s2 1Á232 B n2 20 s2 0Á304 For a significance test we calculate F s2 =s2 4Á05: The tabulated value for n1 and n2 19 for P 0Á025 is 2Á88 and for P 0Á005 is 4Á04 (Interpolation in Table A4 is needed.) The observed ratio is thus just significant at the two-sided 1% level, that is P 0Á010 For 95% confidence limits we need the tabulated values F0Á025, 9, 19 2Á88 and F0Á025, 19, 3Á69 (interpolating in the table where necessary) The confidence limits for the ratio of population variances, s2 =s2 , are therefore 4Á05 1Á24 2Á88 and 4Á05 3Á69 14Á9: Two connections may be noted between the F distributions and other distributions already met When n1 1, the F distribution is that of the square of a quantity following the t distribution on n2 DF For example, for a one-sided significance level 0Á05, the tabulated value of F for n1 1, n2 10 is 4Á96; the two-sided value for t on 10 DF is 2Á228; 2Á2282 4Á96 This relationship follows because a t statistic is essentially the ratio of a normal variable with zero mean to an independent estimate of its standard deviation Squaring the numerator gives a x2 variable (equivalent to an estimate of variance on DF), and squaring the denominator similarly gives an estimate of variance on the appropriate number of degrees of freedom Both the positive and negative tails of the t distribution have to be included because, after squaring, they both give values in the upper tail of the F distribution When n2 I, the F distribution is the same as that of a x2 variable divided n by n1 Thus, for n1 10, n2 I, the tabulated F for a one-sided level 0Á05 is 1Á83; that for x2 is 18Á31; 18Á31 / 10 1Á83 10 5.2 Inferences from counts 153 The reason here is similar to that advanced above An estimate of variance s2 on n2 I DF must be exactly equal to the population variance s2 Thus, F may be written as s2 =s2 x2 =n1 (see p 149) n The F test and the associated confidence limits provide an exact treatment of the comparison of two variance estimates from two independent normal samples Unfortunately, the methods are rather sensitive to the assumption of normalityÐmuch more so than in the corresponding uses of the t distribution to compare two means This defect is called a lack of robustness The methods described in this section are appropriate only for the comparison of two independent estimates of variances Sometimes this condition fails because the observations in the two samples are paired, as in the first situation considered in §4.2 The appropriate method for this case makes use of a technique described in Chapter 7, and is therefore postponed until p 203 A different use of the F distribution has already been noted on p 117 5.2 Inferences from counts Suppose that x is a count, say, of the number of events occurring during a certain period or a number of small objects observed in a biological specimen, which can be assumed to follow the Poisson distribution with mean m (§3.7) What can be said about m? Suppose first that we wish to test a null hypothesis specifying that m is equal to some value m0 On this hypothesis, x would follow a Poisson distribution with expectation m0 The departure of x from its expected value, m0 , is measured by the extent to which x falls into either of the tails of the hypothesized distribution The situation is similar to that of the binomial (§3.6) Thus if x > m0 and the probabilities in the Poisson distribution are P0 , P1 , , the P value for a onesided test will be P Px Px1 Px2 À P0 À P1 À À PxÀ1 : The possible methods of constructing a two-sided test follow the same principles as for the binomial in §3.6 Again considerable simplification is achieved by approximating the Poisson distribution by the normal (§3.8) On the null hypothesis, and including a continuity correction, z jx À m0 j À p m0 5:5 is approximately a standardized normal deviate Excluding the continuity correction corresponds to the mid-P value obtained by including only Px in the summation of Poisson probabilities 154 Analysing variances Example 5.2 In a study of asbestos workers a large group was followed over several years and 33 died of lung cancer Making allowance for age, using national death rates, the expected number of deaths due to lung cancer was 20Á0 How strong is this evidence that there is an excess risk of death due to lung cancer? p On the null hypothesis that the national death rates applied, the standard error of x is 20Á0 4Á47 The observed deviation is 33 À 20Á0 13Á0 With continuity correction, the standardized normal deviate is 13Á0 À 0Á5=4Á47 2Á80, giving a one-sided normal tail area of 0Á0026 The exact one-sided value of P, from the Poisson distribution, is 0Á0047, so the normal test exaggerated the significance Two-sided values may be obtained by doubling these values, and both methods show that the evidence of excess mortality due to lung cancer is strong The exact one-sided mid-P value is 0Á0037 and the corresponding standardized normal deviate is 13Á0=4Á47 2Á91, giving a one-sided level of 0Á0018 The 95% confidence limits for m are the two values, mL and mU , for which x is just significant by a one-sided test at the 1% level These values may be obtained from tables of the Poisson distribution (e.g Pearson & Hartley, 1966, Table 7) and Bailar and Ederer (1964) give a table of confidence factors Table VIII1 of Fisher and Yates (1963) may also be used The normal approximation may be used in similar ways to the binomial case The tail areas could be estimated from (5.5) Thus approximations to the 95% limits are given by x À mL À p 1Á96 mL and x À mU p À1Á96: mU If x is large the continuity correction in method may be omitted p p p Replace mL and mU by x This is only satisfactory for large values (greater than 100) The exact limits may be obtained by using tables of the x2 distribution This follows from the mathematical link between the Poisson and the x2 distributions (see Liddell, 1984) The limits are mL x2 , 0Á975 2x and mU 2 x2x2, 0Á025 g : 5:6 5.2 Inferences from counts 155 Example 5.2, continued With x 33, the exact 95% confidence limits are found to be 22Á7 and 46Á3 Method gives 23Á1 and 46Á9, method gives 23Á5 and 46Á3, and method gives 21Á7 and 44Á3 In this example methods and are adequate The 95% confidence limits for the relative death rate due to lung cancer, expressed as the ratio of observed to expected, are 22Á7/20Á0 and 46Á3=20Á0 1Á14 and 2Á32 The mid-P limits are obtained from method as 23Á5/20Á0 and 46Á3/20Á0 = 132 Example 5.3 As an example where exact limits should be calculated, suppose that, in a similar situation to Example 5.2, there were two deaths compared with an expectation of 0Á5 Then mL x2, 0Á975 0Á24 and mU x2, 0Á025 7Á22: The limits for the ratio of observed to expected deaths are 0Á24=0Á5 and 7Á22=0Á5 0Á5 and 14Á4 The mid-P limits of m may be obtained by trial and error on a programmable calculator or personal computer as those values for which P x 0 P x 1 P x 2 0Á975 or 0Á025 This gives mL 0Á335 and mU 6Á61 so that the mid-P limits of the ratio of observed to expected deaths are 0Á7 and 13Á2 The evidence of excess mortality is weak but the data not exclude the possibility of a large excess Suppose that in Example 5.3 there had been no deaths, then there is some ambiguity on the calculation of a 95% confidence interval The point estimate of m is zero and, since the lower limit cannot exceed the point estimate and also cannot be negative, its only possible value is zero There is a probability of zero that the lower limit exceeds the true value of m instead of the nominal value of 1%, and a possibility is to calculate the upper limit as mU x2, 0Á05 3Á00, rather than as x2, 0Á025 3Á69, so that the probability 2 2 that the upper limit is less than the true value is approximately 5%, and the interval has approximately 95% coverage Whilst this is logical and provides the narrowest 95% confidence interval it seems preferable that the upper limit corresponds to 1% in the upper tail to give a uniform interpreta2 tion It turns out that the former value, m 3Á00, is the upper mid-P limit Whilst it is impossible to find a lower limit with this interpretation, this is clear from the fact that the limit equals the point estimate and that both are at the extreme of possible values This rationale is similar to that in our recommendation that a two-sided significance level should be double the one-sided level ... a value observed in some previous work? 1 02 Analysing means and proportions 14 15 15 16 18 18 19 19 20 21 22 22 24 24 26 27 29 30 32 Here n 20 , x 420 and x 420 =20 21 Á0 For t, we need... 0Á0318 2? ?013 1Á118 1Á118 2? ?013 0Á 022 1 0Á1318 0Á1318 0Á 022 1 20 4Á4 72 10 15 !25 !30 0Á0108 0Á1565 0Á1568 0Á 021 8 2? ?21 4 1Á006 1Á006 2? ? 124 0Á0168 0Á15 72 0Á15 72 0Á0168 100 10Á000 80 90 !110 ! 120 0Á 022 6... 0 32? ?00 x2 101Á0 177 8 32 s2 707 ? ?2? ? x ? ?2? ? x ? ?2? ? x 73 959 =n2 71 407Á00 À x2 ? ?2 5 52? ?00 50 32 25 52 446Á 12: 17 À1 Á p SE 1 À x2 446Á 12? ? 12 x p 446Á 12? ? 0? ?22 619