Handbook of Industrial Automationedited - Chapter 4 pptx

Chapter 4.1 Regression Richard Brook Off Campus Ltd., Palmerston North, New Zealand Denny Meyer Massey University±Albany, Palmerston North, New Zealand 1.1 FITTING A MODEL TO DATA 1.1.1 What is Regression? 1.1.1.1 Historical Note Regression is, arguably, the most commonly used tech- nique in applied statistics. It can be used with data that are collected in a very structured way, such as sample surveys or experiments, but it can also be applied to observational data. This ¯exibility is its strength but also its weakness, if used in an unthinking manner. The history of the method can be traced to Sir Francis Galton who published in 1885 a paper with the title, ``Regression toward mediocrity in hereditary stature.'' In essence, he measured the heights of parents and found the median height of each mother± father pair and compared these medians with the height of their adult offspring. He concluded that those with very tall parents were generally taller than average but were not as tall as the median height of their parents; those with short parents tended to be below average height but were not as short as the median height of their parents. Female offspring were combined with males by multiplying female heights by a factor of 1.08. Regression can be used to explain relationships or to predict outcomes. In Galton's data, the median height of parents is the explanatory or predictor variable, which we denote by X, while the response or predicted variable is the height of the offspring, denoted by Y. While the individual value of Y cannot be forecast exactly, the average value can be for a given value of the explanatory variable, X. 1.1.1.2 Brief Overview Uppermost in the minds of the authors of this chapter is the desire to relate some basic theory to the application and practice of regression. In Sec 1.1, we set out some terminology and basic theory. Section 1.2 exam- ines statistics and graphs to explore how well the regression model ®ts the data. Section 1.3 concentrates on variables and how to select a small but effective model. Section 1.4 looks to individual data points and seeks out peculiar observations. We will attempt to relate the discussion to some data sets which are shown in Sec 1.5. Note that data may have many different forms and the questions asked of the data will vary considerably from one application to another. The variety of types of data is evident from the description of some of these data sets. Example 1. Pairs (Triplets, etc.) of Variables (Sec. 1.5.1): The Y-variable in this example is the heat devel- oped in mixing the components of certain cements which have varying amounts of four X-variables or chemicals in the mixture. There is no information about how the var- ious amounts of the X-variables have been chosen. All variables are continuous variables. 269 Copyright © 2000 Marcel Dekker, Inc. Example 2. Grouping Variables (Sec. 1.5.2): Qualitative variables are introduced to indicate groups allocated to different safety programs. These qualitative variables differ from other variables in that they only take the values of 0 or 1. Example 3. A Designed Experiment (Sec. 1.5.3): In this example, the values of the X-variables have been set in advance as the design of the study is structured as a three-factor composite experimental design. The X- variables form a pattern chosen to ensure that they are uncorrelated. 1.1.1.3 What Is a Statistical Model? A statistical model is an abstraction from the actual data and refers to all possible values of Y in the population and the relationship between Y and the corre- sponding X in the model. In practice, we only have sample values, y and x, so that we can only check to ascertain whether the model is a reasonable ®t to these data values. In some area of science, there are laws such as the relationship e  mc 2 in which it is assumed that the model is an exact relationship. In other words, this law is a deterministic model in which there is no error. In statistical models, we assume that the model is stochastic, by which we mean that there is an error term, e, so that the model can be written as Y  f X  xe In a regression model, f : indicates a linear function of the X-terms. The error term is assumed to be random with a mean of zero and a variance which is constant, that is, it does not depend on the value taken by the X- term. It may re¯ect error in the measurement of the Y- variable or by variables or conditions not de®ned in the model. The X-variable, on the other hand, is assumed to be measured without error. In Galton's data on heights of parents and offspring, the error term may be due to measurement error in obtaining the heights or the natural variation that is likely to occur in the physical attributes of offspring compared with their parents. There is a saying that ``No model is correct but some are useful.'' In other words, no model will exactly capture all the peculiarities of a data set but some models will ®t better than others. 1.1.2 How to Fit a Model 1.1.2.1 Least-Squares Method We consider Example 1, but concentrate on the effect of the ®rst variable, x 1 , which is tricalcium aluminate, on the response variable, which is the heat generated. The plot of heat on tricalcium aluminate, with the least-squares regression line, is shown in Fig. 1. The least-squares line is shown by the solid line and can be written as  y  f X  x 1 a  bx 1  81:5 1:87x 1 1 where  y is the predicted value of y for the given value x 1 of the variable X 1 . 270 Brook and Meyer Figure 1 Plot of heat, y, on tricalcium aluminate, x 1 . Copyright © 2000 Marcel Dekker, Inc. All the points represented by x 1 ; y do not fall on the line but are scattered about it. The vertical distance between each observation, y, and its respective predicted value,  y, is called the residual, which we denote by e. The residual is positive if the observed value of y falls above the line and negative if below it. Notice in Sec. 1.5.1 that for the fourth row in the table, the ®tted valueis102.04andtheresidual(shownbyeinFig.1) is À14:44, which corresponds to one of the four points below the regression line, namely the point x 1 ; y 11; 87:6: At each of the x 1 values in the data set we assume that the population values of Y can be written as a linear model, by which we mean that the model is linear in the parameters. For convenience, we drop the subscript in the following discussion. Y    x " 2 More correctly, Y should be written as Y j x, which is read as ``Y given X  x.'' Notice that a model, in this case a regression model, is a hypothetical device which explains relationships in the population for all possible values of Y for given values of X. The error (or deviation) term, ",is assumed to have for each point in the sample a population mean of zero and a constant variance of  2 so that for X  a particular value x, Y has the following distribution: Y j x is distributed with mean   x and variance  2 It is also assumed that for any two points in the sample, i and j, the deviations " i and " j are uncorrelated. The method of least squares uses the sample of n ( 13 here) values of x and y to ®nd the least-squares estimates, a and b, of the population parameters  and  by minimizing the deviations. More speci®cally, we seek to minimize the sum of squares of e, which we denote by S 2 , which can be written as S 2   e 2   y À f x 2   y Àa  bx 2 3 The symbol  indicates the summation over the n  13 points in the sample. 1.1.2.2 Normal Equations The values of the coef®cients a and b which minimize S 2 can be found by solving the following, which are called normal equations. We do not prove this state- ment but the reader may refer to a textbook on regression, such as Brook and Arnold [1].  y Àa bx  0orna  b  x   y  xy Àa  bx  0or a  x  b  x 2   xy 4 By simple arithmetic, the solutions of these normal equations are a  " y À b " x b   x À " xy À " y hi 0  x À x 2 5 Note: 1. The mean of y is  y=n,or " y. Likewise the mean of x is " x. 2. b can be written as S xy =S xx , which can be called the sum of cross-products of x and y divided by the sum of squares of x. 3. From Sec. 1.5.1, we see that the mean of x is 7.5 and of y is 95.4. The normal equations become 13a  97b  1240:5 97a  1139b  10,032 6 Simple arithmetic gives the solutions as a  81:5and b  1:87. 1.1.3 Simple Transformations 1.1.3.1 Scaling The size of the coef®cients in a ®tted model will depend on the scales of the variables, predicted and predictor. In the cement example, the X variables are measured in grams. Clearly, if these variables were changed to kilo- grams, the values of the X would be divided by 1000 and, consequently, the sizes of the least squares coef®- cients would be multiplied by 1000. In this example, the coef®cients would be large and it would be clumsy to use such a transformation. In some examples, it is not clear what scales should be used. To measure the consumption of petrol (gas), it is usual to quote the number of miles per gallon, but for those countries which use the metric system, it is the inverse which is often quoted, namely the number of liters per 100 km travelled. 1.1.3.2 Centering of Data In some situations, it may be an advantage to change x to its deviation from its mean, that is, x À " x. The ®tted equation becomes Regression 271 Copyright © 2000 Marcel Dekker, Inc.  y  a bx À " x but these values of x and b may differ from Eq. (1). Notice that the sum of the x À " x terms is zero as  x À " x  x À  " x  n " x À n " x  0 The normal equations become, following Eq. (4), na  0   y 0  b  x À " x 2   x À " xy 7 Thus, a   y=n  " y which differs somewhat from Eq. (5), but b   x À " xy hi 0  x À " x 2 which can be shown to be the same as in Eq. (5). The ®tted line is  y  95:42  1:87x À " x If the y variable is also centered and the two centered variables are denoted by y and x, the ®tted line is y  1:87x The important point of this section is that the inclusion of a constant term in the model leads to the same coef®cient of the X term as transforming X to be centered about its mean. In practice, we do not need to perform this transformation of centering as the inclusion of a constant term in the model leads to the same estimated coef®cient for the X variable. 1.1.4 Correlations Readers will be familiar with the correlation coef®cient between two variables. In particular the correlation between y and x is given by r xy  S xy =  S xx S yy  q 8 There is a duality in this formula in that interchanging x and y would not change the value of r. The relationship between correlation and regression is that the coef®cient b in the simple regression line above can be written as b  r  S yy =S xx q 9 In regression, the duality of x and y does not hold. A regression line of y on x will differ from a regression line of x and y. 1.1.5 Vectors 1.1.5.1 Vector Notation The data for the cement example (Sec. 1.5) appear as equal-length columns. This is typical of data sets in regression analysis. Each column could be considered as a column vector with 13 components. We focus on the three variables y (heat generated),  y (FITS1  predicted values of y), and e (RESI1  residuals). Notice that we represent a vector by bold types: y,  y, and e. The vectors simplify the columns of data to two aspects, the lengths and directions of the vectors and, hence, the angles between them. The length of a vector can be found by the inner, or scalar, product. The reader will recall that the inner product of y is represented as y Á y or y T y, which is simply the sum of the squares of the individual elements. Of more interest is the inner product of  y with e, which can be shown to be zero. These two vectors are said to be orthogonal or `àt right angles'' as indicated in Fig. 2. We will not go into many details about the geome- try of the vectors, but it is usual to talk of  y being the projection of y in the direction of x. Similarly, e is the projection of y in a direction orthogonal to x, orthogonal being a generalization to many dimensions of `àt right angles to,'' which becomes clear when the angle  is considered. Notice that e and  y are `àt right angles'' or `òrtho- gonal.'' It can be shown that a necessary and suf®cient condition for this to be true is that e T  y  0. In vector terms, the predicted value of y is  y  a1  bx and the ®tted model is y  a1  bx e 10 Writing the constant term as a column vector of `1's pave the way for the introduction of matrices in Sec. 1.1.7. 272 Brook and Meyer Figure 2 Relationship between y;  y and e. Copyright © 2000 Marcel Dekker, Inc. 1.1.5.2 VectorsÐCentering and Correlations In this section, we write the vector terms in such a way that the components are deviations from the mean; we have  y  bx The sums of squares of y,  y, and e are y T y  S yy 78:5 À 95:42 2 74:3 À 95:42 2 ÁÁÁ 109:4 À 95:42 2  2715:8  y T  y  S  y  y  1450:1 e T e  S ee  1265:7 As we would expect from a right-angled triangle and Pythagoras' theorem, y T y   y T  y  e T e We discuss this further in Sec. 1.2.1.5 on ANOVA, the analysis of variance. The length of the vector y, written as jyj, is the square root of y T y52:11. Similarly the lengths of  y and e are 38.08 and 35.57, respectively. The inner product of y with the vector of ®tted values,  y,is y T  y   y i  y i  1450:08 The angle  in Fig. 2 has a cosine given by cos   y T  y=jyjj  yj   1450:1=2715:8 p  0:73 11 As y and x are centered, the correlation coef®cient of y on x can be shown to be cos . 1.1.6 Residuals and Fits We return to the actual values of the X and Y variables,notthecenteredvaluesasabove.Figure2pro- vides more insight into the normal equations, as the least-squares solution to the normal equation occurs when the vector of residuals is orthogonal to the vector of predicted values. Notice that  y T e  0 can be expanded to a1  bx T e  a1 T e  bx T e  0 12 This condition will be true if each of the two parts are equal to zero, which leads to the normal equations, Eq. (4), above. Notice that the last column of Sec. 1.5.1 con®rms that the sum of the residuals is zero. It can be shown that the corollary of this is that the sum of the observed y is the same as the sum of the ®tted y values; if the sums are equal the means are equal and Section 1.5.1 shows that they are both 95.4. The second normal equation in Eq. (4) could be checked by multiplying the components of the two columns marked x 1 and RESI1 and then adding the result. In Fig. 1.3, we would expect the residuals to approximately fall into a horizontal band on either side of the zero line. If the data satisfy the assumptions, we would expect that there would not be any systema- tic trend in the residuals. At times, our eyes may deceive us into thinking there is such a trend when in fact there is not one. We pick this topic up again later. 1.1.7 Adding a Variable 1.1.7.1 Two-Predictor Model We consider the effect of adding the second term to the model: Y   0 x 0   1 x 1   2 x 2  " The ®tted regression equation becomes y  b 0 x 0  b 1 x 1  b 2 x 2  e To distinguish between the variables, subscripts have been reintroduced. The constant term has been written as b 0 x 0 and without loss of generality, x 0  1. The normal equations follow a similar pattern to those indicated by Eq. (4), namely,  b 0  b 1 x 1  b 2 x 2   y  x 1 b 0  b 1 x 1  b 2 x 2   x 1 y  x 2 b 0  b 1 x 1  b 2 x 2   x 2 y 13 Figure 3 Plot of residuals against ®tted values for y on x 1 . Copyright © 2000 Marcel Dekker, Inc. These yield 13b 0  97b 1  626b 2  1240:5 97b 0  1139b 1  4922b 2  10,032 626b 0  4922b 1  33,050b 2  62,027:8 14 Note that the entries in bold type are the same as those in the normal equations of the model with one predictor variable. It is clear that the solutions for b 0 and b 1 will differ from those of a and b in the normal equations, Eq. (6). It can be shown that the solutions are: b 0  52:6, b 1  1:47, and b 2  0:622: Note: 1. By adding the second prediction variable x 2 , the coef®cient for the constant term has changed from a  81:5tob 0  52:6. Likewise the coef®- cient for x has changed from 1.87 to 1.47. The structure of the normal equations give some indication why this is so. 2. The coef®cients would not change in value if the variables were orthogonal to each other. For example, if x 0 was orthogonal to x 2 ,  x 0 x 2 would be zero. This would occur if x 2 was in the form of deviation from its mean. Likewise, if x 1 and x 2 were orthogonal,  x 1 x 2 would be zero. 3. What is the meaning of the coef®cients, for example b 1 ? From the ®tted regression equation, one is tempted to say that ``b 1 is the increase in y when x 1 increases by 1.'' From 2, we have to add to this, the words `ìn the presence of the other variables in the model.'' Hence, if you change the variables, the meaning of b 1 also changes. When other variables are added to the model, the for- mulas for the coef®cients become very clumsy and it is much easier to extend the notation of vectors to that of matrices. Matrices provide a clear, generic approach to the problem. 1.1.7.2 Vectors and Matrices As an illustration, we use the cement data in which there are four predictor variables. The model is y   0 x 0   1 x 1   2 x 2   3 x 3   4 x 4  " The ®tted regression equation can be written in vector notation, y  b 0 x 0  b 1 x 1  b 2 x 2  b 3 x 3  b 4 x 4  e 15 The data are displayed in Sec. 1.5.1. Notice that each column vector has n  13 entries and there are k  5 vectors. As blocks of ®ve vectors, the predictors can be written as an n Âk  13 Â 5 matrix, X. The ®tted regression equation is y  Xb e 16 It can be shown that the normal equations are X T Xb  X T y 17 Expanded in vector terms, x T 0 x 0 b 0  x T 0 x 1 b 1 ÁÁÁx T 0 x 4 b 4  x T 0 y x T 1 x 0 b 0  x T 1 x 1 b 1 ÁÁÁx T 1 x 4 b 4  x T 1 y x T 4 x 0 b 0  x T 4 x 1 b 1 ÁÁÁx T 4 x 4 b 4  x T 4 y These yield the normal equations 13b 0  97b 1  626b 2  153b 3  39064b 4  1240:5 97b 0  1130b 1  4922b 2  769b 3  2620b 4  10,032 626b 0  4922b 1  33050b 2  7201b 3  15739b 4  62,027.8 153b 0  769b 1  7201b 2  2293b 3  4628b 4  13,981.5 39,064b 0  2620b 1  15;739b 2  4628b 3  15;062b 4  34,733.3 Notice the symmetry in the coef®cients of the b i . The matrix solution is b X T X À1 X T Y b T 62:4; 1:55; 0:510; 0:102; À0:144 18 With the solution to the normal equations written as above, it is easy to see that the least-squares estimates of the parameters are weighted means of all the y values in the data. The estimates can be written as b i   w i y i where the weights w i are functions of the x values: The regression coef®cients re¯ect the strengths and weaknesses of means. The strengths are that each point in the data set contributes to each estimate but the weaknesses are that one or two unusual values in the data set can have a disproportionate effect on the resulting estimates. 1.1.7.3 The Projection Matrix, P From the matrix solution, the ®tted regression equation becomes 274 Brook and Meyer Copyright © 2000 Marcel Dekker, Inc.  y  xb  xX T X À1 X T yorPy 19 P  XX T X À1 X T is called the projection matrix and it has some nice properties, namely 1. P T  P that is, it is symmetrical. 2. P T P  P that is, it is idempotent. 3. The residual vector e  y À  y I À Py. I is the identity matrix with diagonal elements being 1 and the off-diagonal elements being 0. 4. From the triangle diagram, e is orthogonal to  y, which is easy to see as e T  y  y T I À P T Py  y T P À P T Py  0 5. P is the projection matrix onto X and  y is the projection of y onto X. 6. I ÀP is the projection matrix orthogonal to X and the residual, 1, is the projection of y onto a direction orthogonal to X. ThevectordiagramofFig.2becomesFig.4. 1.1.8 Normality 1.1.8.1 Assumptions about the Models In the discussion so far, we have seen some of the relationships and estimates which result from the least-squares method which are dependent on assumptions about the error, or deviation, term in the model. We now add a further restriction to these assumptions, namely that the error term, e, is distributed normally. This allows us to ®nd the distribution of the residuals and ®nd con®dence intervals for certain estimates and carry out hypothesis tests on them. The addition of the assumption of normality adds to the concept of correlation as a zero correlation coef- ®cient between two variables will mean that they are statistically independent. 1.1.8.2 Distributions of Statistics The variance of the constant term is Var b 0   2 1 n  " x 2 S xx 23 and the variance of the coef®cient of the x variable is Var b 1   2 =S xx 20 We are usually more interested in the coef®cient of the x term. The con®dence interval (CI) for this coef®cient  1  is given by CI  b 1 Æ t nÀ2  s 2 =S xx q 21 1.1.8.3 Con®dence Interval for the Mean The 95% con®dence interval for the predicted value,  y, when x  x 0 is given by  y 0 Æ t nÀ2 s  1 n  x 0 À " x 2 S xx s 22 Note that the width of the con®dence interval is smal- lest when the chosen x 0 is close to the mean, " x, but the width diverges the further the x 0 is from the mean. A more important point is the danger of extrapolating outside of the range of values of X as the model may not be appropriate outside these limits. This con®dence interval is illustrated in Fig. 5 using the cement data. 1.1.8.4 Prediction Interval for a Future Value At times one wants to forecast the value of y for a given single future value x 0 of x. This prediction interval for a future single point is widier than the con®- dence interval of the mean as the variance of single value of y around the mean is  2 . In fact, the ``1'' Regression 275 Figure 4 Projections of y in terms of P. Figure 5 Con®dence and prediction intervals. Copyright © 2000 Marcel Dekker, Inc. under the square root symbol may dominate the other terms. The formula is given by  y 0 Æ t nÀ2 s  1  1 n  x 0 À " x 2 s xx s 23 1.1.9 Conclusions Regression is a widely used and ¯exible tool, applic- able to many situations. The method of least squares is the most commonly used in regression. The resulting estimates are weighted means of the response variable at each data point. Means may not be resistant to extreme values of either X or y. The normal, gaussian, distribution is closely linked to least squares, which facilitates the use of the standard statistical methods of con®dence intervals and hypothesis tests. In ®tting a model to data, an important result of the least-squares approach is that the vector of ®tted or predicted values is orthogonal to the vector of residuals. With the added assumptions of normality, the residuals are statistically independent of the ®tted values. The data appear as columns which can be considered as vectors. Groups of X vectors can be manipulated as a matrix. A projection matrix is a useful tool in understanding the relationships between the observed values of y, the predicted y and the residuals. 1.2 GOODNESS OF FIT OF THE MODEL 1.2.1 Regression Printout from MINITAB 1.2.1.1 Regression with One or More Predictor Variables In this section, comments are made on the printout from a MINITAB program on the cement data using the heat evolved as y and the number of grams of tricalcium aluminate as x. This is extended to two or more variables. 1.2.1.2 Regression Equation The regression equation is y = 81.5 + 1.87 x 1 In keeping with the terminology we are using in this chapter, the y above should be  y. Alternatively, if a residual term e is added to the equation, we have termed this ``the ®tted regression equation.'' With one predictor variable, the ®tted equation will represent a line. We have noted in Sec. 1.1.7.1 that the estimated coef®cients will vary depending on the other variables in the model. With the ®rst two variables in the model, the ®tted regression equation represents a plane and the least-squares solution is y  52:6  1:47x 1  0:662x 2 In vector terms, it is clear that x 1 is not orthogonal to x 2 . 1.2.1.3 Distribution of the Coef®cients Predictor Coef StDev T P Constant 81.479 4.927 16.54 0.000 x1 1.8687 0.5264 3.55 0.005 The formula for the standard deviation (also called the standard error by some authors) of the constant term and for the x 1 term is given in Sec. 1.1.8.1. The T is the t-statistic  (estimator À hypothesized parameter)/standard deviation. The hypothesized parameter is its value under the null hypothesis, which is zero in this situation. The degrees of freedom are the same as those for the error or residual term. One measure of the goodness of ®t of the model is whether the values of the estimated coef®cients, and hence the values of the respective t-statistics, could have arisen by chance and these are indicated by the p-values. The p-value is the probability of obtaining a more extreme t-value by chance. As the p-values here are small, we conclude that small t-value is due to the presence of x 1 in the model. In other words, as the probabilities are small (< 0:05 which is the common level used), both the constant and b 1 are signi®cant at the 5% level. 1.2.1.4 R-Squared and Standard Error S = 10.73 R-Sq = 53.4% R-Sq(adj) = 49.2% S  10:73 is the standard error of the residual term. We would prefer to use lower case, s,asitisan estimate of the S in the S 2 of Eq. (3). RÀSq (short for R-squared) is the coef®cient of determination, R 2 , which indicates the proportion of 276 Brook and Meyer Copyright © 2000 Marcel Dekker, Inc. the variation of Y explained by the regression equation: R 2  S  y  y =S yy and recall that S yy   y À " y 2 It is shown that R is the correlation coef®cient between  y and y provided that the x and y terms have been centered. In terms of the projection matrices, R 2    y 2 i  y 2  y T Py y T y 24 R 2 lies between 0, if the regression equation does not explain any of the variation of Y, and 1 if the regression equation explains all of the variation. Some authors and programs such as MINITAB write R 2 as a percentage between 0 and 100%. In this case, R 2 is only about 50%, which does not indicate a good ®t. After all, this means that 50% of the variation of y is unaccounted for. As more variables are added to the model, the value of R 2 will increase as shown in the following table. The variables x 1 ; x 2 ; x 3 ,andx 4 were sequentially added to the model. Some authors and computer programs consider the increase in R 2 , denoted by ÁR 2 . In this example, x 2 adds a considerable amount to R 2 but the next two variables add very little. In fact, x 4 appears not to add any prediction power to the model but this would suggest that the vector x 4 is orthogonal to the others. It is more likely that some rounding error has occurred. Number of predictor variables 1234 R 2 53.4 97.9 98.2 98.2 R 2 (adjusted) 49.2 97.4 97.6 97.4 Increase in R 2 ; ÁR 2 44.5 0.3 0.0 One peculiarity of R 2 is that it will, by chance, give a value between 0 and 100% even if the X variable is a column of random numbers. To adjust for the random effect of the k variables in the model, the R 2 ,asa proportion, is reduced by k=n À 1 and then adjusted to fall between 0 and 1 to give the adjusted R 2 . It could be multiplied by 100 to become a percent: Adjusted R 2 R 2 À k=n À1n À1=n À k À1 25 1.2.1.5 Analysis of Variance Analysis of Variance Source DF SS MS F P Regression 1 1450.1 1450.1 12.600.005 Residual Error 11 1265.7 115.1 Total 12 2715.8 The SS (sums of squares) can best be understood by referringtoFig.4(Sect.1.7.3)whichshowedtherela- tionship between the three vectors, y,  y, and e provided that the Y-andX-variables are centered around their means. By Pythagoras' theorem, Sums of squares of y  Sums of squares of  y  Sums of squares of e That is, Sums of squares of total  Sums of squares for regression  Sums of squares for residual 26 The ANOVA table is set up to test the hypothesis that the parameter   0. If there are more than one predictor variable, the hypothesis would be, H:  1   2   3 ÁÁÁ   0 If this is the case, it can be shown that the mean, or expected, value of y,  y, and e will all be zero. An unbiased estimated of the variance of y,  2 , could be obtained from the mean squares of each of the three rows of the table by dividing the sums of squares by their degrees of freedom. From Fig. 4, we are now well aware that the vector of ®tted values is orthogonal to the vector of residuals and, hence, we use the ®rst two rows as their mean squares are independent and their ratio follows a distribution called the F-statistic. The degrees of freedom of the F-test will be 1 and 11 in this example. The p-value of 0.005 is the probability that by chance the F-statistic will be more extreme than the value of 12.6. This con®rms that the predictor variable, x 1  tricalcium aluminate, predicts a signi®cant amount of the heat generated when the cement is mixed. What are the effects of adding variables to the model? These can be demonstrated by the cement data. The regression sum of squares monotonically increase as variables are added to the model; the residual sum of squares monotonically decrease; residual mean squares reduce to a minimum and then increase. Regression 277 Copyright © 2000 Marcel Dekker, Inc. One method of selecting a best-®t model is to select the one with the minimum residual mean squares. Number of predictor variables 1234 Regression sum of squares 1450 2658 2668 2668 Residual sum of squares 1266 58 48 48 Residual mean squares  s 2 115 5.8 5.4 6.0 1.2.1.6 Unusual Observations Unusual Observations Obs x1 y Fit StDev Fit Residual St Resid 10 21.0 115.90 120.72 7.72 -4.82 -0.65 X Individual data points may be unusual because the y- values are unusually large or small which would be measured according to whether they fall within a 95% con®dence interval. Alternatively, speci®c x- values may differ from the others and have an unduly large effect on the regression equation and its coef®- cients. More will be said on this in Section 1.4. 1.2.2 Power Transformations Two variables, x and y, may be closely related but the relationship may not be linear. Ideally, theoretical clues would be present which point to a particular relationship such as an exponential growth model which is common in biology. Without such clues, we could ®rstly examine a scatter plot of y against x. Sometimes we may recognize a mathematical model, which ®ts the data well. Otherwise, we try to choose a simple transformation such as raising the variable to a power p as in Table 1. A power of 1 leaves the variable unchanged as raw data. As we proceed up or down the table from 1, the strength of the transformation increases; as we move up the table the transformation stretches larger values relatively more than smaller ones. Although the exponential does not ®t in very well, we have included it as it is the inverse of the logarithmic transformation. Other fractional powers could be used but they may be dif®cult to interpret. It would be feasible to transform either y or x, and, indeed, a transformation of y would be equivalent to the inverse transformation of x. For example, squaring y would have similar effects to taking the square root of x. If there are two or more predictor variables, it may be advisable to transform these in different ways rather than y, for if y is transformed to be linearly related to one predictor variable it may then not be linearly related to another. In general, hwoever, it is usual to transform the y, rather than the x, variable as this transformation may lead to a better-®tting model and also to a better distribution of the response variable and the residuals. Number of predictor variables 1 2 3 4 R 2 53.4 97.9 98.2 98.2 R 2 (adjusted) 49.2 97.9 98.2 98.2 Increase in R 2 , ÁR 2 Ð 44.5 0.3 0.0 1.2.3 Resistant Regression The traditional approach to regression is via the least- squares method, which has close relationships with means and the normal distribution. This is a powerful approach that is widely used. It does have problems in that the ®tted regression line can be greatly affected by a few unusually large or small y-values. Another approach, which is resistant to extreme values, is based on medians rather than means, as medians are not affected so much by strange values. The methodisshowninFig.6(datafromSec.1.5.4).Thex- values are divided, as closely as possible, into three groups according to size. In this simple example, there are only nine points so that each group consists of three points. For the lower third, the median value of x is found and the median value of y giving the point 2; 12; this is repeated for the middle and upper third. The middle and upper median points are 5; 25 and 8; 95. The resistant line is found by joining the lower point, (2, 12), to the upper point, (8, 95) and is shown in Fig. 6 as a solid line. To check whether a curve would be more appropriate than a line, the three pairs of medians are linked by the dashed lines (- - -). If the slopes of the dashed lines differ from the slope of the resistant line, it would suggest that a curve should be used or the response variable, y, should be transformed. A rule of 278 Brook and Meyer Table 1 Common Power Transformations p Name Effect Ð 3 2 Exponential Cube Square Stretches Large Values 1 ``Raw'' 0:5 0 À0:5 À1 Square root Logarithmic Reciprocal/root Reciprocal Shrinks Large Values Copyright © 2000 Marcel Dekker, Inc. [...]... t-values beyond t Æ2 to denote statistical signi®cance at a 5% level for n reasonable large, say ! 15 This transTable 7 Forward Selection for the Cement Data Stepwise Regression F-to-Enter: 0.00 F-to-Remove: x4 T-Value 1 117.57 2 103.10 3 109.78 4 94. 57 -0 .738 -4 .77 -0 .6 14 -1 2.62 -0 .617 -1 5 .44 -0 .46 0 -1 .46 1 .44 10 .40 1.15 6.97 1.25 4. 75 -0 .35 -2 .42 -0 . 24 -0 .86 x1 T-Value x3 T-Value x2 T-Value S R-Sq... À x2 ; x À x3 ; F F F instead of Table 8 Backward Elimination for the Cement Data Stepwise Regression F-to-enter: 99999.00 F-to-Remove: 4. 00 Response is y on 4 predictors, with N = 13 Step Constant 1 94. 57 2 109.78 x1 T-Value 1.25 4. 75 1.15 6.97 x2 T-Value 0.17 0.50 x3 T-Value -0 . 24 -0 .86 -0 .35 -2 .42 x4 T-Value -0 .46 0 -1 .46 -0 .617 -1 5 .44 2. 34 98.38 2. 24 98.33 S R-Sq Copyright © 2000 Marcel Dekker,... Analysis The regression equation is y = 94. 6 + 1.25 x1 + 0.172 x2 - 0.237 x3 0 .46 0 x4 Predictor Constant x1 x2 x3 x4 S = 2. 344 Coef 94. 57 1.2508 0.1722 -0 .2372 -0 .45 99 StDev 30 .40 0.26 34 0. 341 8 0.2773 0.3 148 R-Sq = 98 .4% T 3.11 4. 75 0.50 -0 .86 -1 .46 P 0.0 14 0.000 0.628 0 .41 7 0.182 R-Sq(adj) = 97.6% Variable SelectionÐSequential Methods Another approach for the selection of predictor variables involves the... xT x4 b4 xT y 0 0 0 0 xT x0 b0 xT x1 b1 Á Á Á xT x4 b4 xT y 1 1 1 1 xT x0 b0 xT x1 b1 Á Á Á xT x4 b4 xT y 4 4 4 4 These yield the normal equations 13b0 97b1 626b2 153b3 39064b4 1 240 :5 97b0 1130b1 49 22b2 769b3 2620b4 10,032 626b0 49 22b1 33050b2 7201b3 15739b4 62,027.8 153b0 769b1 7201b2 2293b3 46 28b4 13,981.5 39,064b0 2620b1 15;739b2 46 28b3... The regression equation is y = 76 .4 + 5 .49 x1 + 10.2 x3 + 0. 643 x12 - 7.22 x32 - 1 .46 x1x3 Predictor Constant x1 x3 x12 x32 x1x3 S = 3.1 74 Coef 76.389 5 .48 80 10.1773 0. 642 9 -7 .2236 -1 .46 3 R-Sq = 94. 9% T 69.52 6.39 11.85 0.77 -8 .68 -1 .30 P 0.000 0.000 0.000 0 .45 2 0.000 0.213 MS 525 .42 10.07 St Dev 1.099 0.8588 0.8588 0.8319 0.8319 1.122 F 52.16 R-Sq(adj) = 93.1% Analysis of Variance Source Regression Error... Predictor Constant x1 x3 x12 x32 x1x3 x13 x33 S= 2.601 Coef 76.3886 3.2 24 6.778 0. 642 9 -7 .2236 -1 .46 25 1.2881 1.9 340 R-Sq = 97.1% StDev 0.9006 1. 543 1. 543 0.6818 0.6818 0.9197 0.7185 0.7185 T 84. 82 2.09 4. 39 0. 94 -1 0.59 -1 .59 1.65 2 .47 P 0.000 0.059 0.001 0.3 64 0.000 0.138 0.125 0.029 MS 83. 84 6.77 F 56.73 R-Sq(adj) = 95 .4% Analysis of Variance Source Regression Error Total Copyright © 2000 Marcel Dekker,... two-predictor model This means that the two-predictor model is the Table 5 Best Subsets Regression Response is y #Vars(k) 1 1 2 2 3 3 4 Copyright © 2000 Marcel Dekker, Inc p 2 2 3 3 4 4 5 R-Sq 67.5 66.6 97.9 97.2 98.3 98.2 98 .4 R-Sq (adj) 64. 5 63.6 97 .4 96.7 97.8 97.6 97.6 Cp 151.9 156.0 3.5 6.6 3.3 3.7 5.0 S 8.9639 9.0771 2 .40 63 2.7 343 2. 244 7 2.3087 2. 344 0 x1x2x3x4 X X X X X X X X X X X X X X X X 2 84. .. 71.6 + 1 .45 x1 + 0 .41 6 x2 - 0.237 x4 Predictor Constant x1 x2 x4 S = 2.309 Coef 71.65 1 .45 19 0 .41 61 -0 .2365 StDev 14. 14 0.1170 0.1856 0.1733 R-Sq = 98.2% T 5.07 12 .41 2. 24 -1 .37 P 0.000 0.000 0.052 0.205 R-Sq(adj) = 97.6% full model and the one-predictor model is the reduced model Therefore q À p 1 in the formula SSE(Reduced) À SSE(Full)=q À p F MSE(Full) The predictor with the highest F-value... is y = 71.9 + 5 .49 x1 - 0.71 x2 + 10.2 x3 Predictor Constant x1 x2 x3 S = 7. 646 Coef 71.895 5 .48 8 -0 .711 10.177 StDev 1.710 2.069 2.069 2.069 R-Sq = 66.2% T 42 .05 2.65 -0 . 34 4.92 P 0.000 0.017 0.736 0.000 R-Sq(adj) = 59.9% Figure 11 Relationship between residuals and x3 special methods would have been required Some of these methods are discussed in the next sections Note that the ®t of the above model... the p-values The p-value is the probability of obtaining a more extreme t-value by chance As the p-values here are small, we conclude that small t-value is due to the presence of x1 in the model In other words, as the probabilities are small (< 0:05 which is the common level used), both the constant and b1 are signi®cant at the 5% level 1.2.1 .4 R-Squared and Standard Error S = 10.73 R-Sq = 53 .4% R-Sq(adj) . F-to-Remove: 0.00 Response is y on 4 predictors, with N = 13 Step Constant 1 117.57 2 103.10 3 109.78 4 94. 57 x4 T-Value -0 .738 -4 .77 -0 .6 14 -1 2.62 -0 .617 -1 5 .44 -0 .46 0 -1 .46 x1 T-Value 1 .44 10 .40 1.15 6.97 1.25 4. 75 x3 T-Value -0 .35 -2 .42 -0 . 24 -0 .86 x2 T-Value 0.17 0.50 S R-Sq 8.96 67 .45 2.73 97.25 2. 24 98.33 2. 34 98.38 Copyright. x4 Predictor Constant x1 x2 x3 x4 Coef 94. 57 1.2508 0.1722 -0 .2372 -0 .45 99 StDev 30 .40 0.26 34 0. 341 8 0.2773 0.3 148 T 3.11 4. 75 0.50 -0 .86 -1 .46 P 0.0 14 0.000 0.628 0 .41 7 0.182 S = 2. 344 R-Sq = 98 .4% R-Sq(adj) = 97.6% Table 5. 13 Step Constant 1 94. 57 2 109.78 x1 T-Value 1.25 4. 75 1.15 6.97 x2 T-Value 0.17 0.50 x3 T-Value -0 . 24 -0 .86 -0 .35 -2 .42 x4 T-Value -0 .46 0 -1 .46 -0 .617 -1 5 .44 S R-Sq 2. 34 98.38 2. 24 98.33 Copyright © 2000 Marcel Dekker, Inc. Model Error degrees of freedom Error sums of squares Mean square error 2nd

Định dạng
Số trang	105
Dung lượng	831,66 KB