Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 38 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
38
Dung lượng
466,7 KB
Nội dung
CHAPTER 31 Residuals: Standardized, Predictive, “Studentized” 31.1. Three Decisions about Plotting Residuals After running a regression it is always advisable to look at the residuals. Here one has to make three decisions. The first decision is whether to look at the ordinary residuals (31.1.1) ˆε i = y i − x i ˆ β (x i is the ith row of X), or the “predictive” residuals, which are the residuals computed using the OLS estimate of β gained from all the other data except the data point where the residual is taken. If one writes ˆ β(i) for the OLS estimate without the ith observation, the defining equation for the ith predictive residual, 795 796 31. RESIDUALS which we call ˆε i (i), is (31.1.2) ˆε i (i) = y i − x i ˆ β(i). The second decision is whether to standardize the residuals or not, i.e., whether to divide them by their estimated standard deviations or not. Since ˆε = My, the variance of the ith ordinary residual is (31.1.3) var[ˆε i ] = σ 2 m ii = σ 2 (1 − h ii ), and regarding the predictive residuals it will be shown below, see (31.2.9), that (31.1.4) var[ˆε i (i)] = σ 2 m ii = σ 2 1 − h ii . Here (31.1.5) h ii = x i (X X) −1 x i . (Note that x i is the ith row of X written as a column vector.) h ii is the ith diagonal element of the “hat matrix” H = X(X X) −1 X , the projector on the column space of X. This projector is called “hat matrix” because ˆy = Hy, i.e., H puts the “hat” on y. 31.1. T HREE DECISIONS ABOUT PLOTTING RESIDUALS 797 Problem 362. 2 points Show that the ith diagonal element of the “hat matrix” H = X(X X) −1 X is x i (X X) −1 x i where x i is the ith row of X written as a column vector. Answer. In terms of e i , the n-vector with 1 on the ith place and 0 everywhere else, x i = X e i , and the ith diagonal element of the hat matrix is e i He i = e i X i (X X) −1 X e i = x i (X X) −1 x i . Problem 363. 2 points The variance of the ith disturbance is σ 2 . Is the variance of the ith residual bigger than σ 2 , smaller than σ 2 , or equal to σ 2 ? (Before doing the math, first argue in words what you would expect it to be.) What about the variance of the predictive residual? Prove your answers mathematically. You are allowed to use (31.2.9) without proof. Answer. Here is only the math part of the answer: ˆε = My. Since M = I −H is idem potent and symmetric, we get V [My] = σ 2 M, in particular this m eans var[ˆε i ] = σ 2 m ii where m ii is the ith diagonal elements of M . Then m ii = 1 −h ii . Since all diagonal elements of projection matrices are between 0 and 1, the answer is: the variances of the ordinary residuals cannot be bigger than σ 2 . Regarding predictive residuals, if we plug m ii = 1 −h ii into (31.2.9) it becomes ˆε i (i) = 1 m ii ˆε i therefore var[ˆε i (i)] = 1 m 2 ii σ 2 m ii = σ 2 m ii (31.1.6) which is bigger than σ 2 . 798 31. RESIDUALS Problem 364. Decide in the following situations whether you want predictive residuals or ordinary residuals, and whether you want them standardized or not. • a. 1 point You are looking at the residuals in order to check whether the asso- ciated data points are outliers and do perhaps not belong into the model. Answer. Here one should use the predictive residuals. If the ith o bservation is an outlier which should not be in the regression, then one should not use it when running the regression. Its inclusion may have a strong influence on the regression result, and therefore the residual may not be as conspicuous. One should standardize them. • b. 1 point You are looking at the residuals in order to assess whether there is heteroskedasticity. Answer. Here you want them standardized, but there is no reason to use the predictive residuals. Ordinary residuals are a little more precise than predictive residuals because they are based on more observations. • c. 1 point You are looking at the residuals in order to assess whether the disturbances are autocorrelated. Answer. Same answer as for b. • d. 1 point You are looking at the residuals in order to assess whether the disturbances are normally distributed. 31.1. T HREE DECISIONS ABOUT PLOTTING RESIDUALS 799 Answer. In my view, one should make a normal QQ-plot of standardized residuals, but one should not use the predictive residuals. To see why, let us first look at the distribution of the standardized residuals before division by s. Each ˆε i / √ 1 − h ii is normally distributed with m ean zero and standard deviation σ. (But different such residua ls are not independent.) If one takes a QQ-plot of those residuals against the normal distribution, one will get in the limit a straight line with slope σ. If one divides every residual by s, the slope will be close to 1, but one will again get something approximating a straight line. The fact that s is random does not affect the relation of the residuals to each other, and this relation is what determines whether or not the QQ-plot approximates a straight line. But Belsley, Kuh, and Welsch on [BKW80, p. 43] draw a normal probability plot of the studentized, not the standardized, residuals. They give no justification for their choice. I think it is the wrong choice. • e. 1 point Is there any situation in which you do not want to standardize the residuals? Answer. Standardization is a mathematical procedure which is justified when certain con- ditions hold. But there is no guarantee that these conditions acutally hold, and in order to get a more immediate impression of the fit of the curve one may want to look at the unstandardized residuals. 800 31. RESIDUALS The third decision is how to plot the residuals. Never do it against y. Either do it against the predicted ˆy, or make several plots against all the columns of the X-matrix. In time series, also a plot of the residuals against time is called for. Another option are the partial residual plots, see about this also (30.0.2). Say ˆ β[h] is the estimated parameter vector, which is estimated with the full model, but after estimation we drop the h-th parameter, and X[h] is the X-matrix without the hth column, and x h is the hth column of the X-matrix. Then by (30.0.4), the estimate of the hth slope parameter is the same as that in the simple regression of y − X[h] ˆ β[h] on x h . The plot of y − X[h] ˆ β[h] against x h is called the hth partial residual plot. To understand this better, start out with a regression y i = α + βx i + γz i + ε i ; which gives you the fitted values y i = ˆα+ ˆ βx i +ˆγz i +ˆε i . Now if you regress y i −ˆα− ˆ βx i on x i and z i then the intercept will be zero and the estimated coefficient of x i will be zero, and the estimated coefficient of z i will be ˆγ, and the residuals will be ˆε i . The plot of y i − ˆα − ˆ βx i versus z i is the partial residuals plot for z. 31.2. Relationship between Ordinary and Predictive Residuals In equation (31.1.2), the ith predictive residuals was defined in terms of ˆ β(i), the parameter estimate from the regression of y on X with the ith observation left 31.2. REL ATIONSHIP BETWEEN ORDINARY AND PREDICTIVE RESIDUALS 801 out. We will show now that there is a very simple mathematical relationship between the ith predictive residual and the ith ordinary residual, namely, equation (31.2.9). (It is therefore not necessary to run n different regressions to get the n predictive residuals.) We will write y(i) for the y vector with the ith element deleted, and X(i) is the matrix X with the ith row deleted. Problem 365. 2 points Show that X(i) X(i) = X X − x i x i (31.2.1) X(i) y(i) = X y −x i y i .(31.2.2) Answer. Write (31.2.2) as X y = X(i) y(i) + x i y i , and observe that with our definition of x i as column vectors representing the rows of X, X = x 1 ··· x n . Therefore (31.2.3) X y = x 1 . . . x n y 1 . . . y n = x 1 y 1 + ··· + x n y n . 802 31. RESIDUALS An important stepping stone towards the proof of (31.2.9) is equation (31.2.8), which gives a relationship between h ii and (31.2.4) h ii (i) = x i (X(i) X(i)) −1 x i . ˆy i (i) = x i ˆ β(i) has variance σ 2 h ii (i). The following problems give the steps neces- sary to prove (31.2.8). We begin with a simplified version of theorem A.8.2 in the Mathematical Appendix: Theorem 31.2.1. Let A be a nonsingular k ×k matrix, δ = 0 a scalar, and b a k ×1 vector with b A −1 b + δ = 0. Then (31.2.5) A + bb δ −1 = A −1 − A −1 bb A −1 δ + b A −1 b . Problem 366. Prove (31.2.5) by showing that the product of the matrix with its alleged inverse is the unit matrix. Problem 367. As an application of (31.2.5) show that (31.2.6) (X X) −1 + (X X) −1 x i x i (X X) −1 1 − h ii is th e inverse of X(i) X(i). Answer. This is (31.2.5), or (A.8.20), with A = X X, b = x i , and δ = −1. 31.2. REL ATIONSHIP BETWEEN ORDINARY AND PREDICTIVE RESIDUALS 803 Problem 368. Using (31.2.6) show that (31.2.7) (X(i) X(i)) −1 x i = 1 1 − h ii (X X) −1 x i , and using (31.2.7) show that h ii (i) is related to h ii by th e equation (31.2.8) 1 + h ii (i) = 1 1 − h ii [Gre97, (9-37) on p. 445] was apparently not aware of this relationship. Problem 369. Prove the following mathematical relationship between predictive residuals and ordinary residuals: (31.2.9) ˆε i (i) = 1 1 − h ii ˆε i which is the same as (28.0.29), only in a different notation. 804 31. RESIDUALS Answer. For this we have to a pply the above mathematical tools. With the help of (31.2.7) (transpose it!) and (31 .2.2), (31 .1.2) becomes ˆε i (i) = y i − x i (X(i) X(i)) −1 X(i) y(i) = y i − 1 1 − h ii x i (X X) −1 (X y − x i y i ) = y i − 1 1 − h ii x i ˆ β + 1 1 − h ii x i (X X) −1 x i y i = y i 1 + h ii 1 − h ii − 1 1 − h ii x i ˆ β = 1 1 − h ii (y i − x i ˆ β) This is a little tedious but simplifies extremely nicely at the end. The relationship (31.2.9) is so simple because the estimation of η i = x i β can be done in two steps. First collect the information which the n − 1 observations other than the ith contribute to the estimation of η i = x i β is contained in ˆy i (i). The information from all observations except the ith can be written as (31.2.10) ˆy i (i) = η i + δ i δ i ∼ (0, σ 2 h ii (i)) Here δ i is the “sampling error” or “estimation error” ˆy i (i) −η i from the regression of y(i) on X(i). If we combine this compound “observation” with the ith observation [...]... order filling in of x “results in no changes and is equivalent with dropping the incomplete data.” ¯ The alternative: filling it with zeros and adding a dummy for the data with missing observation amounts to exactly the same thing The only case where filling in missing data makes sense is: if you have multiple regression and you can predict the missing data in the X matrix from the other data in the X... regression results by typing summary(longley.fit) Armed forces and unemployed are significant and have negative sign, as expected GNP and Population are insignificant and have negative sign too, this is not expected GNP, Population and Year are highly collinear • c 3 points Make plots of the ordinary residuals and the standardized residuals against time How do they differ? In R, the commands are plot(Year,... ylab="Ordinary Residuals in Longley Regression") In order to get the next plot in a different graphics window, so that you can compare them, do now either x11() in linux or windows() in windows, and then plot(Year, rstandard(longley.fit), type="h", ylab="Standardized Residuals in Longle Regression") 32.4 SENSITIVITY OF ESTIMATES TO OMISSION OF ONE OBSERVATION 827 Answer You see that the standardized... not give any change in the estimates and although the computer will think it is more efficient it isn’t What other schemes are there? Filling in the missing y by the arithmetic mean of the observed y does not give an unbiased estimator General conclusion: in a single-equation context, filling in missing y not a good idea Now missing values in the X-matrix If there is only one regressor and a constant term,... divide the change in t β due to omission of the ith observation by the standard ˆ i.e., to look at deviation of t β, ˆ ˆ t (β − β(i)) (32.4.3) σ t (X X)−1 t Such a standardization makes it possible to compare the sensitivity of different ˆ linear combinations, and to ask: Which linear combination of the elements of β is affected most if one drops the ith observation? Interestingly and, in hindsight, perhaps... therefore it has only recently become part of the standard procedures Problem 373 1 point Define multicollinearity • a 2 points What are the symptoms of multicollinearity? • b 2 points How can one detect multicollinearity? 813 814 32 REGRESSION DIAGNOSTICS • c 2 points How can one remedy multicollinearity? 32.1 Missing Observations First case: data on y are missing If you use a least squares predictor... scatterplot matrix which observation that might be? Answer In linux, you first have to give the command x11() in order to make the graphics window available In windows, this is not necessary It is important to display the data in a reasonable order, therefore instead of pairs(longley) you should do something like attach(longley) and then pairs(cbind(Year, Population, Employed, Unemployed, Armed.Forces,... no other linear combination of the elements of β will be affected much by the omission of this observation either The righthand side of (32.4.7), with σ estimated by s(i), is called by [BKW80] and many others DFFITS (which stands for DiFference in FIT, Standardized) If one takes its square, divides it by k, and estimates σ 2 by s2 (which is more consistent ˆ than using s2 (i), since one standardizes... in 1950 smaller than that in 1962, but predictive residual in 1950 is very close to 1962 standardized predictive residual in 1951 smaller than that in 1956, but predictive residual in 1951 is larger than in 1956 Largest predictive residual is 1951, but largest standardized predictive residual is 1956 828 32 REGRESSION DIAGNOSTICS • e 3 points Make a plot of the leverage, i.e., the hii -values, using... denominator in the fraction on the lefthand side is zero, then g = o and therefore the numerator is necessarily zero as well In this case, the fraction itself should be considered zero Proof: As in the derivation of the BLUE with nonsperical covariance matrix, pick a nonsingular Q with Ω = QQ , and define P = Q−1 Then it follows P ΩP = I Define y = P x and h = Q g Then h y = g x, h h = g Ω g, and y . conclusion: in a single-equation context, filling in missing y not a good idea. Now missing values in the X-matrix. If there is only one regressor and a constant term, then the zero order filling in of. become part of the standard procedures. Problem 373. 1 point Define multicollinearity. • a. 2 points What are the symptoms of multicollinearity? • b. 2 points How can one detect multicollinearity? 813 814. filling in of ¯x “results in no changes and is equivalent with dropping the incomplete data.” The alternative: filling it with zeros and adding a dummy for the data with missing observation amounts