Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 83 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
83
Dung lượng
554,15 KB
Nội dung
11.4 Regression in groups Frequently data are classified into groups, and within each group a linear regression of y on x may be postulated. For example, the regression of forced expiratory volume on age may be considered separately for men in different occupational groups. Possible differences between the regression lines are then often of interest. In this section we consider comparisons of the slopes of the regression lines. If the slopes clearly differ from one group to another, then so, of course, must the mean values of yÐat least for some values of x. In Fig. 11.4(a), the slopes of the regression lines differ from group to group. The lines for groups (i) and (ii) cross. Those for (i) and (iii), and for (ii) and (iii) would also cross if extended sufficiently far, but here there is some doubt as to whether the linear regressions would remain valid outside the range of observed values of x. If the slopes do not differ, the lines are parallel, as in Fig. 11.4(b) and (c), and here it becomes interesting to ask whether, as in (b), the lines differ in their height above the x axis (which depends on the coefficient a in the equation (a) (b) y y x xx y (c) Fig. 11.4 Differences between regression lines fitted to three groups of observations. The lines differ in slope and position in (a), differ only in position in (b), and coincide in (c). (ÐÐ) (i); (- - -) (ii); (- Á-) (iii). 322 Modelling continuous data Eya bx), or whether, as in (c), the lines coincide. In practice, the fitted regression lines would rarely have precisely the same slope or position, and the question is to what extent differences between the lines can be attributed to random variation. Differences in position between parallel lines are discussed in §11.5. In this section we concentrate on the question of differences between slopes. Suppose that there are k groups, with n i pairs of observations in the ith group. Denote the mean values of x and y in the ith group by x i and y i , and the regression line, calculated as in §7.2, by Y i y b i x À x i : 11:12 If all the n i are reasonably large, a satisfactory approach is to estimate the variance of each b i by (7.16) and to ignore the imprecision in these estimates of variance. Changing the notation of (7.16) somewhat, we shall denote the residual mean square for the ith group by s 2 i and the sum of squares of x about x i by i x À x i 2 . Note that the parenthesized suffix i attached to the summation sign indicates summation only over the specified group i; that is, i x À x i 2 n i j1 x ij À x i 2 : To simplify the notation, denote this sum of squares about the mean of x in the ith group by S xxi , 11:13 the sum of products of deviations by S xyi , and so on. Then following the method of §8.2, we write w i 1 varb i S xxi s 2 i , 11:14 and calculate G w i b 2 i À w i b i 2 = w i : 11:15 On the null hypothesis that the true slopes b i are all equal, G follows approxi- mately a x 2 kÀ1 distribution. High values of G indicate departures from the null hypothesis, i.e. real differences between the b i .IfG is non-significant, and the null hypothesis is tentatively accepted, the common value b of the b i is best estimated by the weighted mean b w i b i = w i , 11:16 with an estimated variance 11.4 Regression in groups 323 var b1= w i : 11:17 The sampling variation of b is approximately normal. It is difficult to say how large the n i must be for this `large-sample' approach to be used with safety. There would probably be little risk in adopting it if none of the n i fell below 20. A more exact treatment is available provided that an extra assumption is madeÐthat the residual variances s 2 i are all equal. Suppose the common value is s 2 . We consider first the situation where k 2, as a comparison of two slopes can be effected by use of the t distribution. For k > 2 an analysis of variance is required. Two groups The residual variance s 2 can be estimated, using (7.5) and (7.6), either as s 2 1 1 y À Y 1 2 n 1 À 2 S yy1 À S 2 xy1 =S xx1 n 1 À 2 or by the corresponding mean square for the second group, s 2 2 . A pooled estimate may be obtained (very much as in the two-sample t test (p. 107)), as s 2 1 y À Y 1 2 2 y À Y 2 2 n 1 n 2 À 4 : 11:18 To compare b 1 and b 2 we estimate varb 1 À b 2 s 2 1 S xx1 1 S xx2 ! : 11:19 The difference is tested by t b 1 À b 2 SEb 1 À b 2 on n 1 n 2 À 4DF, 11:20 the DF being the divisor in (11.18). If a common value is assumed for the regression slope in the two groups, its value b may be estimated by b S xy1 S xy2 S xx1 S xx2 , 11:21 with a variance estimated as varb s 2 S xx1 S xx2 : 11:22 324 Modelling continuous data Equations (11.21) and (11.22) can easily be seen to be equivalent to (11.16) and (11.17) if, in the calculation of w i in (11.14), the separate estimates of residual variance s 2 i are replaced by the common estimate s 2 . For tests or the calculation of confidence limits for b using (11.22), the t distribution on n 1 n 2 À 4DF should be used. Where a common slope is accepted it would be more usual to estimate s 2 as the residual mean square about the parallel lines (11.34), which would have n 1 n 2 À 3 DF. Example 11.3 Table 11.4 gives age and vital capacity (litres) for each of 84 men working in the cadmium industry. They are divided into three groups: A 1 , exposed to cadmium fumes for at least 10 years; A 2 , exposed to fumes for less than 10 years; B, not exposed to fumes. The main purpose of the study was to see whether exposure to fumes was associated with a change in respiratory function. However, those in group A 1 must be expected to be older on the average than those in groups A 2 or B, and it is well known that respiratory test per- formance declines with age. A comparison is therefore needed which corrects for discrep- ancies between the mean ages of the different groups. We shall first illustrate the calculations for two groups by amalgamating groups A 1 and A 2 (denoting the pooled group by A) and comparing groups A and B. The sums of squares and products of deviations about the mean, and the separate slopes b i are as follows: Group in i S xxi S xyi S yyi b i A 1 40 4397Á38 À236Á385 26Á5812 À0Á0538 B 2 44 6197Á16 À189Á712 20Á6067 À0Á0306 Total 10594Á54 À426Á097 47Á1879 (À0Á0402) The SSq about the regressions are 1 y ÀY 1 2 26Á5812 ÀÀ236Á385 2 =4397Á38 13Á8741 and 2 y ÀY 2 2 20Á6067 ÀÀ189Á712 2 =6197Á16 14Á7991: Thus, s 2 13Á8741 14Á7991=40 44 À 40Á3584, and, for the difference between b 1 and b 2 using (11.19) and (11.20), t À0Á0538 ÀÀ0Á0306 0Á3584 1 4397Á38 1 6197Á16 ! r À0Á0232=0Á0118 À1Á97 on 80 DF: 11.4 Regression in groups 325 Table 11.4 Ages and vital capacities for three groups of workers in the cadmium industry. x, age last birthday (years); y, vital capacity (litres). Group A 1 , exposed > 10 years Group A 2 , exposed < 10 years Group B, not exposed xyx yx yx y 39 4Á62 29 5Á21 27 5Á29 43 4Á02 40 5Á29 29 5Á17 25 3Á67 41 4Á99 41 5Á52 33 4Á88 24 5Á82 48 3Á86 41 3Á71 32 4Á50 32 4Á77 47 4Á68 45 4Á02 31 4Á47 23 5Á71 53 4Á74 49 5Á09 29 5Á12 25 4Á47 49 3Á76 52 2Á70 29 4Á51 32 4Á55 54 3Á98 47 4Á31 30 4Á85 18 4Á61 48 5Á00 61 2Á70 21 5Á22 19 5Á86 49 3Á31 65 3Á03 28 4Á62 26 5Á20 47 3Á11 58 2Á73 23 5Á07 33 4Á44 52 4Á76 59 3Á67 35 3Á64 27 5Á52 58 3Á95 38 3Á64 33 4Á97 62 4Á60 38 5Á09 25 4Á99 65 4Á83 43 4Á61 42 4Á89 62 3Á18 39 4Á73 35 4Á09 59 3Á03 38 4Á58 35 4Á24 42 5Á12 41 3Á88 43 3Á89 38 4Á85 43 4Á62 41 4Á79 37 4Á30 36 4Á36 50 2Á70 36 4Á02 50 3Á50 41 3Á77 45 5Á06 41 4Á22 48 4Á06 37 4Á94 51 4Á51 42 4Á04 46 4Á66 39 4Á51 58 2Á88 41 4Á06 Sums 597 47Á39 1058 125Á21 1751 196Á33 n 12 28 44 x 2 30 613 42 260 75 879 xy 2 280Á01 4 624Á93 7 623Á33 y 2 198Á8903 572Á4599 896Á6401 The difference is very nearly significant at the 5% level. This example is continued on p. 329. The scatter diagram in Fig. 11.5 shows the regression lines with slopes b A and b B fitted separately to the two groups, and also the two parallel lines with slope b. The steepness of 326 Modelling continuous data 3 4 5 6 20 30 40 Age (years) Vital capacity (l) 50 60 70 A A B B Separate slope Parallel lines Separate slope } Fig. 11.5 Scatter diagram showing age and vital capacity of 84 men working in the cadmium industry, divided into three groups (Table 11.4). (d) Group A 1 ; ( ) Group A 2 ;(s) Group B. the slope for group A may be partly or wholly due to a curvature in the regression: there is a suggestion that the mean value of y at high values of x is lower than is predicted by the linear regressions (see p. 330). Alternatively it may be that a linear regression is appropriate for each group, but that for group A the vital capacity declines more rapidly with age than for group B. More than two groups With any number of groups the pooled slope b is given by a generalization of (11.21): b i S xyi i S xxi : 11:23 Parallel lines may now be drawn through the mean points ( x i , y i ), each with the same slope b. That for the ith group will have this equation: Y ci y i bx À x i : 11:24 The subscript c is used to indicate that the predicted value Y ci is obtained using the common slope, b. The deviation of any observed value y from its group mean y i may be divided as follows: y À y i y ÀY i Y i À Y ci Y ci À y i : 11:25 g 11.4 Regression in groups 327 Again, it can be shown that the sums of squares of these components can be added in the same way. This means that Within-Groups SSq Residual SSq about separate lines SSq due to differences between the b i and b SSq due to fitting common slope b: 11:26 The middle term on the right is the one that particularly concerns us now. It can be obtained by noting that the SSq due to the common slope is i S xyi 2 = i S xxi ; 11:27 this follows directly from (7.5) and (11.23). From previous results, Within-Groups SSq i S yyi 11:28 and Residual SSq about separate lines i S yyi ÀS xyi 2 =S xxi : 11:29 From (11.26) to (11.29), SSq due to differences in slope i S xyi 2 S xxi À i S xyi 2 i S xxi : 11:30 It should be noted that (11.30) is equivalent to i W i b 2 i À i W i b i 2 i W i , 11:31 where W i S xxi s 2 =varb i , and that the pooled slope b equals W i b i = W i , the weighted mean of the b i . The SSq due to differences in slope is thus essentially a weighted sum of squares of the b i about their weighted mean b, the weights being (as usual) inversely proportional to the sampling variances (see §8.2). The analysis is summarized in Table 11.5. There is only one DF for the common slope, since the SSq is proportional to the square of one linear contrast, b. The k À 1 DF for the second line follow because the SSq measures differences between k independent slopes, b i . The residual DF follow because there are n i À 2 DF for the ith group and i n i À 2n À 2k. The total DF within groups are, correctly, n À k. The F test for differences between slopes follows immedi- ately. 328 Modelling continuous data Table 11.5 Analysis of variance for differences between regression slopes. SSq DF MSq VR Due to common slope i S xyi 2 i S xxi 1 Differences between slopes i S xyi 2 S xxi À i S xyi 2 i S xxi k À 1 s 2 A F A s 2 A =s 2 Residual about separate lines i S yyi À i S xyi 2 S xxi n À 2ks 2 Within groups i S yyi n À k Example 11.3, continued We now test the significance between the three slopes. The sums of squares and products of deviations about the mean, and the separate slopes, are as follows: Group in i S xxi S xyi S yyi b i A 1 1 12 912Á25 À77Á643 11Á7393 À0Á0851 A 2 2 28 2282Á71 À106Á219 12Á5476 À0Á0465 B 3 44 6197Á16 À189Á712 20Á6067 À0Á0306 Total 84 9392Á12 À373Á574 44Á8936 (À 0Á0398) The SSq due to the common slope is À373Á574 2 =9392Á12 14Á8590: The Residual SSq about the separate lines using (11.29) are: 1 y ÀY 1 2 5Á1310; 2 y ÀY 2 2 7Á6050; 3 y ÀY 3 2 14Á7991: The total Residual SSq about separate lines is therefore 5Á1310 7Á6050 14Á7991 27Á5351: The Within-Groups SSq is 44Á8936. The SSq for differences between slopes may now be obtained by subtraction, as 44Á8936 À 14Á8590 À 27Á5351 2Á4995: Alternatively, it may be calculated directly as À77Á643 2 912Á25 À106Á219 2 2282Á71 À189Á712 2 6197Á16 À À373Á574 2 9392Á12 2Á4995: The analysis of variance may now be completed. 11.4 Regression in groups 329 SSq DF MSq VR Common slope 14Á8590 1 14Á8590 42Á09 P < 0Á001 Between slopes 2Á4995 2 1Á2498 3Á 54 P 0Á034 Separate residuals 27Á5351 78 0Á3530 Within groups 44Á8936 81 The differences between slopes are more significant than in the two-group analysis. The estimates of the separate slopes, with their standard errors calculated in terms of the Residual MSq on 78 DF, are b A1 À0Á0851 Æ 0Á0197, b A2 À0Á0465 Æ0Á0124, b B À0Á0306 Æ 0Á0075: The most highly exposed group, A 1 , provides the steepest slope. Figure 11.6 shows the separate regressions as well as the three parallel lines. The doubt about linearity suggests further that a curvilinear regression might be more suitable; however, analysis with a quadratic regression line (see §11.1) shows the non- linearity to be quite non-significant. The analysis of variance test can, of course, be applied even for k 2. The results will be entirely equivalent to the t test described at the beginning of this section, the value of F being, as usual, the square of the corresponding value of t. 3 2 20 30 40 Age (years) 50 60 70 4 Vital capacity (l) 5 B A 2 A 2 A 1 A 1 B Parallel lines 6 Separate slope Separate slope Separate slope } Fig. 11.6 Parallel regression lines and lines with separate slopes, for cadmium workers (Fig. 11.5). 330 Modelling continuous data 11.5 Analysis of covariance If, after an analysis of the type described in the last section, there is no strong reason for postulating differences between the slopes of the regression lines in the various groups, the following questions arise. What can be said about the relative position of parallel regression lines? Is there good reason to believe that the true lines differ in position, as in Fig. 11.4(b), or could they coincide, as in Fig. 11.4(c)? What sampling error is to be attached to an estimate of the difference in positions of lines for two particular groups? The set of techniques associated with these questions is called the analysis of covar- iance. Before describing technical details, it may be useful to note some important differences in the purposes of the analysis of covariance and in the circumstances in which it may be used. 1 Main purpose. (a) To correct for bias. If it is known that changes in x affect the mean value of y, and that the groups under comparison differ in their values of x, it will follow that some of the differences between the values of y can be ascribed partly to differences between the xs. We may want to remove this effect as far as possible. For example, if y is forced expiratory volume (FEV) and x is age, a comparison of mean FEVs for men in different occupational groups may be affected by differences in their mean ages. A comparison would be desirable of the mean FEVs at the same age. If the regressions are linear and parallel, this means a comparison of the relative position of the regression lines. (b) To increase precision. Even if the groups have very similar values of x, precision in the comparison of values of y can be increased by using the residual variation of y about regression on x rather than by analysing the ys alone. 2 Type of investigation. (a) Uncontrolled study. In many situations the observations will be made on units which fall naturally into the groups in questionÐwith no element of controlled allocation. Indeed, it will often be this lack of control which leads to the bias discussed in 1(a). (b) Controlled study. In a planned experiment, in which experimental units are allocated randomly to the different groups, the differences between values of x in the various groups will be no greater in the long run than would be expected by sampling theory. Of course, there will occasionally be large fortuitous differences in the xs; it may then be just as important to correct for their effect as it would be in an uncontrolled study. In any case, even with very similar values of x, the extra precision referred to in 1(b) may well be worth acquiring. 11.5 Analysis of covariance 331 [...]... 24Á3 1Á72 59 8 21 5 x2 Group B: (continued) À22Á3 2Á70 73 À18Á6 1Á90 56 4Á4 2Á78 83 À7 5 2Á27 67 À12Á4 1Á74 84 À0Á4 2Á62 68 0Á4 1Á80 64 À16 5 1Á81 60 À14Á4 1 58 62 39Á3 2Á41 76 À2Á6 1Á 65 60 5 0 2Á24 60 15 7 1Á70 59 À1Á6 À0Á9 Group C: Thoracic (nC 2Á 45 84 À 15 8 1Á72 66 30Á0 2Á37 68 5 4 2Á23 65 À13Á3 1Á92 69 À13 5 1Á99 72 1Á99 63 Group B: `Major' non-thoracic (nB 20) 1Á74 68 26 15 5 10 5 2Á 35 56 1Á60... À1Á9 À2Á0 À21Á7 7Á9 À7Á1 7Á0 13) 15 8 46 24 12 25 45 72 25 28 10 25 44 20Á9 16 5 30Á4 29Á3 19Á1 18Á6 25 0 38 5 15 5 29 5 17Á7 36Á2 22Á0 5 9 À8 5 15 6 5 3 À7Á1 6Á4 20Á0 33 5 9 5 À1 5 À7Á7 À11Á2 22Á0 22Á70 y SD y 16Á29 returned to 100 mmHg The data shown here relate to one of the two drugs used in the trial The recovery time is very variable, and a question of interest is the extent to which it... shows the contribution of each factor in the presence of all others except interactions involving that factor; in Types III and IV, any interactions defined in the model are retained even for the testing of main effects In the procedure recommended on p 353 to distinguish between different forms of multiple regression within groups, we noted that the order of introducing the variables is critical This... systolic blood pressure during hypotension (mmHg); y, recovery time (min) x1 x2 y Y yÀY x1 Group A: `Minor' non-thoracic (nA 20) 2Á26 66 7 29Á3 1Á81 52 10 28Á6 1Á78 72 18 13Á6 1 54 67 4 11 5 2Á06 69 10 22Á4 1Á74 71 13 13Á4 2 56 88 21 20Á6 2Á29 68 12 28 5 1Á80 59 9 23Á4 2Á32 73 65 25 7 2Á04 68 20 22Á6 1Á88 58 31 26Á0 1Á18 61 23 7Á3 2Á08 68 22 23Á6 1Á70 69 13 13Á9 1Á74 55 9 24Á8 1Á90 67 50 20Á0 1Á79 67 12... distinguished from the more complex generalized linear model to be described in Chapters 12 and 14 In interpreting the output of such an analysis, care must be taken to ensure that the test for the effect of a particular factor is not distorted by the presence or absence of correlated factors In particular, for data with a factorial structure, the warning given on p 255 , against the testing of main... a main effect is the same whether an interaction involving that main effect is included or not Type III SSq arise when contrasts are created corresponding to main effects and interactions, and these contrasts are defined to be orthogonal Another feature of an orthogonal design is that if there is an interaction involving a main effect, then the estimate of that main effect is effectively the average... interactions in the presence of main effects, and either testing main effects without interactions (if the latter can safely be ignored) or not testing main effects at all (if interactions are regarded as being present), since it would rarely be useful to correct a main effect for an interaction Type I SSq are also useful since they allow an effect to be corrected for any other effects as required in the... regression on x, z1 and z2 Due to introduction of w1 and w2 ( 6 À 4) Due to regression on x, z1 , z2 w1 and w2 Residual about regression on x, z1 , z2 , w1 and w2 Total SSq MSq VR 1 2 17Á4446 0Á1617 0Á0808 0Á22 80 30Á0347 0Á3 754 3 17Á6063 2 2Á4994 5 20Á1 057 78 27 53 52 83 47Á6410 1Á2497 3 54 0Á 353 0 Apart from rounding errors, lines (5) and (7) agree with the analysis of variance in Example 11.3 on p 330... coefficients are (using c to represent estimates of g) b0 5 6803 d1 2 50 31 Æ 1Á0418 d2 0 54 97 Æ 0 57 59 b1 À0Á0306 Æ 0Á00 75 c1 À0Á 054 5 Æ 0Á0211 c2 À0Á0 159 Æ 0Á01 45: The coefficients c1 and c2 represent estimates of the differences in the slope on age between Groups A1 and A2 , respectively, and Group B As noted earlier there are significant differences between the slopes (F 3 54 with 2 and 78... 0Á 054 5 À0Á0 851 , À 0Á0306 À 0Á0 159 À0Á04 65, and À0Á0306, respectively, agreeing with the values found earlier from fitting the regressions for each group separately 11.8 Multiple regression in the analysis of non-orthogonal data The analysis of variance was used in Chapter 9 to study the separate effects of various factors, for data classified in designs exhibiting some degree of balance These so-called . 2Á70 29 4 51 32 4 55 54 3Á98 47 4Á31 30 4Á 85 18 4Á61 48 5 00 61 2Á70 21 5 22 19 5 86 49 3Á31 65 3Á03 28 4Á62 26 5 20 47 3Á11 58 2Á73 23 5 07 33 4Á44 52 4Á76 59 3Á67 35 3Á64 27 5 52 58 3Á 95 38 3Á64. 4Á62 29 5 21 27 5 29 43 4Á02 40 5 29 29 5 17 25 3Á67 41 4Á99 41 5 52 33 4Á88 24 5 82 48 3Á86 41 3Á71 32 4 50 32 4Á77 47 4Á68 45 4Á02 31 4Á47 23 5 71 53 4Á74 49 5 09 29 5 12 25 4Á47 49 3Á76 52 2Á70. of observations. The lines differ in slope and position in (a), differ only in position in (b), and coincide in (c). (ÐÐ) (i); (- - -) (ii); (- -) (iii). 322 Modelling continuous data Eya