Since the residuals from the linear model fit are in the error space, orthogonal to the model space, they contain the information in the data that is not explained by the model. Thus, they are useful for investigating a model’s lack of fit. This section takes a closer look at the residuals, including their moments and ways of plotting them to help check a model. We also present descriptions of the influence that each observation has on the least squares fit, using the residuals and “leverage” values from the hat matrix.
2.5.1 Residuals and Fitted Values Are Uncorrelated
From Section 2.4.6, the normal equation corresponding to the intercept term is∑
iyi=
∑
i ̂𝜇i. Thus,∑
iei=∑
i(yi− ̂𝜇i)=0, and the residuals have a sample mean of 0.
Also,
E(e)=E(y−𝝁)̂ =X𝜷−XE(𝜷)̂ =X𝜷−X𝜷 =0.
For linear models with an intercept, the sample correlation between the residuals eand fitted values𝝁̂ has numerator∑
ieî𝜇i=eT𝝁. So, the orthogonality of̂ eand𝝁̂ implies that corr(e,𝝁)̂ =0.
2.5.2 Plots of Residuals
Because corr(e,𝝁)̂ =0, the least squares line fitted to a scatterplot of the elements of e=(y−𝝁) versus the corresponding elements of̂ 𝝁̂ has slope 0. A scatterplot of the residuals against the fitted values helps to identify patterns of a model’s lack of fit.
Examples are nonconstant variance, sometimes referred to asheteroscedasticity, and nonlinearity. Likewise, since the residuals are also orthogonal toC(X), they can be plotted against each explanatory variable to detect lack of fit.
Figure 2.8 shows how a plot ofeagainst𝝁̂ tends to look if (a) the linear model holds, (b) the variance is constant (homoscedasticity), but the mean ofyis a quadratic rather than a linear function of the predictor, and (c) the linear trend predictor is correct, but the variance increases dramatically as the mean increases. In practice,
e
(a)
0 0 0
(c) (b)
e e
μ μ μ
Figure 2.8 Residuals plotted against linear-model fitted values that reflect (a) model ade- quacy, (b) quadratic rather than linear relationship, and (c) nonconstant variance.
plots do not have such a neat appearance, but these illustrate how the plots can highlight model inadequacy. Section 2.6 shows an example.
For the normal linear model, the conditional distribution ofy, given the explanatory variables, is normal. This implies that the residuals, being linear in y, also have normal distributions. A histogram of the residuals provides some information about the actual conditional distribution. Another check of the normality assumption is a plot of ordered residual values against expected values of order statistics from a N(0, 1) distribution, called aQ–Q plot. We will discuss this type of plot in Section 3.4.2, in the chapter about the normal linear model.
2.5.3 Standardized and Studentized Residuals
For the ordinary linear model, the covariance matrix for the observations isV=𝜎2I.
In terms of the hat matrixH=X(XTX)−1XT, this decomposes into V=𝜎2I=𝜎2H+𝜎2(I−H).
Since𝝁̂ =Hyand sinceHis idempotent, var(𝝁̂)=𝜎2H.
So, var(̂𝜇i)=𝜎2hii, where {hii} denote the main diagonal elements of H. Since variances are nonnegative, hii≥0. Likewise, since (y−𝝁)̂ =(I−H)y and since (I−H) is idempotent,
var(y−𝝁)̂ =𝜎2(I−H).
So, the residuals are correlated, and their variance need not be constant, with var(ei)=var(yi− ̂𝜇i)=𝜎2(1−hii).
Again since variances are nonnegative, 0≤hii≤1. Also, var(̂𝜇i)=𝜎2hii≤𝜎2 reveals a consequence of model parsimony: If the model holds (or nearly holds),
̂𝜇iis better thanyias an unbiased estimator of𝜇i.
A standardized version ofei=(yi− ̂𝜇i) that divides it by𝜎√
1−hiihas a standard deviation of 1. In practice,𝜎is unknown, so we replace it by the estimatesof𝜎derived in Section 2.4.1. Thestandardized residualis
ri= yi− ̂𝜇i
s√
1−hii. (2.9)
This describes the number of estimated standard deviations that (yi− ̂𝜇i) departs from 0. If the normal linear model truly holds, these should nearly all fall between about−3 and +3. A slightly different residual, called11astudentized residual, estimates𝜎in the expression for var(yi− ̂𝜇i) based on the fit of the model to then−1 observations after excluding observationi. Then, that estimate is independent of observationi.
2.5.4 Leverages from Hat Matrix Measure Potential Influence
The elementhiifromH, on which var(ei) depends, is called theleverageof observation i. Since var(̂𝜇i)=𝜎2hiiwith 0≤hii ≤1, the leverage determines the precision with which ̂𝜇i estimates𝜇i. For largehii close to 1, var(̂𝜇i)≈var(yi) and var(ei)≈0. In this case,yimay have a large influence on ̂𝜇i. In the extreme casehii=1, var(ei)=0, and ̂𝜇i=yi. By contrast, whenhii is close to 0 and thus var(̂𝜇i) is relatively small, this suggests that ̂𝜇iis based on contributions from many observations.
Here are two other ways to visualize how a relatively large leveragehiiindicates that yi may have a large influence on ̂𝜇i. First, since ̂𝜇i=∑
jhijyj, 𝜕 ̂𝜇i∕𝜕yi=hii. Second, since {yi} are uncorrelated12,
cov(yi,̂𝜇i)=cov (
yi,
∑n j=1
hijyj )
=
∑n j=1
hijcov(yi,yj)=hiicov(yi,yi)=𝜎2hii.
Then, since var(̂𝜇i)=𝜎2hii, it follows that the theoretical correlation,
corr(yi, ̂𝜇i)= 𝜎2hii
√𝜎2⋅𝜎2hii
=√ hii.
When the leverage is relatively large,yiis highly correlated with ̂𝜇i.
11Student is a pseudonym for W. S. Gosset, who discovered thetdistribution in 1908. For the normal linear model, each studentized residual has atdistribution withdf=n−p.
12Recall that for matrices of constantsAandB, cov(Ax,By)=Acov(x,y)BT.
So, what do the leverages look like? For the bivariate linear modelE(yi)=𝛽0+𝛽xi, Section 2.1.3 showed the hat matrix. The leverage for observationiis
hii= 1
n + (xi−x)̄ 2
∑n
k=1(xk−x)̄ 2.
Thenleverages have a mean of 2∕n. They tend to be smaller with larger datasets.
With multiple explanatory variables and valuesxi for observationi and means x̄ (as row vectors), letX̃ denote the model matrix using centered variables. Then, the leverage for observationiis
hii= 1
n+(xi−x)(̄ X̃TX)̃ −1(xi−x)̄ T (2.10) (Belsley et al. 1980, Appendix 2A). The leverage increases asxiis farther fromx.̄ Withpexplanatory variables, including the intercept, the leverages have a mean of p∕n. Observations with relatively large leverages, say exceeding about 3p∕n, may be influential in the fitting process.
2.5.5 Influential Points for Least Squares Fits
An observation having small leverage is not influential in its impact on {̂𝜇i} and {̂𝛽j}, even if it is an outlier in theydirection. A point with extremely large leverage can be influential, but need not be so. It is influential when the observation is a “regression outlier,” falling far from the least squares line that results using only the othern−1 observations. See the first panel of Figure 2.9. By contrast, when the observation has a large leverage but is consistent with the trend shown by the other observations, it is not influential. See the second panel of Figure 2.9. To be influential, a point needs to have both a large leverage and a large standardized residual.
Summary measures that describe an observation’s influence combine information from the leverages and the residuals. For any such measure of influence, larger values correspond to greater influence.Cook’s distance(Cook 1977) is based on the change in𝜷̂when the observation is removed from the dataset. Let 𝜷̂(i) denote the least
y
x y
x
Figure 2.9 High leverage points in a linear model fit may be influential (first panel) or noninfluential (second panel).
squares estimate of𝜷for then−1 observations after excluding observationi. Then, Cook’s distance for observationiis
Di= (𝜷̂(i)−𝜷̂)T[var(̂ 𝜷̂)]−1(𝜷̂(i)−𝜷̂)
p = (𝜷̂(i)−𝜷̂)T(XTX)(𝜷̂(i)−𝜷̂)
ps2 .
Incorporating the estimated variance of𝜷̂makes the measure free of the units of measurement and approximately free of the sample size. An equivalent expression uses the standardized residualriand the leveragehii,
Di=r2i [ hii
p(1−hii) ]
= (yi− ̂𝜇i)2hii
ps2(1−hii)2. (2.11) A relatively largeDi, usually on the order of 1, occurs when both the standardized residual and the leverage are relatively large.
A measure with a similar purpose, DFFIT, describes the change in ̂𝜇i due to deleting observationi. A standardized version (DFFITS) equals the studentized resid- ual multiplied by the “leverage factor”√
hii∕(1−hii). A variable-specific measure, DFBETA(with standardized versionDFBETAS), is based on the change in ̂𝛽jalone when the observation is removed from the dataset. Each observation has a separate DFBETAfor each ̂𝛽j.
2.5.6 Adjusting for Explanatory Variables by Regressing Residuals
Residuals are at the heart of what we mean by “adjusting for the other explana- tory variables in the model,” in describing the partial effect of an explanatory vari- able xj. Suppose we use least squares to (1) regress y on the explanatory vari- ables other than xj, (2) regress xj on those other explanatory variables. When we regress the residuals from (1) on the residuals from (2), Yule (1907) showed that the fit has slope that is identical to the partial effect of variable xj in the multi- ple regression model. A scatterplot of these two sets of residuals is called a par- tial regression plot, also sometimes called an added variable plot. The residuals for the least squares line between these two sets of residuals are identical to the residuals in the multiple regression model that regresses yon all the explanatory variables.
To show Yule’s result, we use his notation for linear model coefficients, introduced in Section 1.2.3. To ease formula complexity, we do this for the case of two explana- tory variables, with all variables centered to eliminate intercept terms. Consider the models
E(yi)=𝛽y2xi2, E(xi1)=𝛽12xi2, E(yi)=𝛽y1⋅2xi1+𝛽y2⋅1xi2.
The normal equations (2.1) for the bivariate models are
∑n i=1
xi2(yi−𝛽y2xi2)=0,
∑n i=1
xi2(xi1−𝛽12xi2)=0.
The normal equations for the multiple regression model are
∑n i=1
xi1(yi−𝛽y1⋅2xi1−𝛽y2⋅1xi2)=0,
∑n i=1
xi2(yi−𝛽y1⋅2xi1−𝛽y2⋅1xi2)=0.
From these two equations for the multiple regression model, 0=
∑n i=1
(yi−𝛽y1⋅2xi1−𝛽y2⋅1xi2)(xi1−𝛽12xi2).
Using this and the normal equation for the second bivariate model, 0=
∑n i=1
yi(xi1−𝛽12xi2)−𝛽y1⋅2
∑n i=1
xi1(xi1−𝛽12xi2)
=
∑n i=1
(yi−𝛽y2xi2)(xi1−𝛽12xi2)−𝛽y1⋅2
∑n i=1
(xi1−𝛽12xi2)2.
It follows that the estimated partial effect ofx1ony, adjusting forx2, is
̂𝛽y1⋅2=
∑n
i=1(yi− ̂𝛽y2xi2)(xi1− ̂𝛽12xi2)
∑n
i=1(xi1− ̂𝛽12xi2)2 .
But from (2.5) this is precisely the result of regressing the residuals from the regression ofyonx2on the residuals from the regression ofx1onx2.
This result has an interesting consequence. From the regression of residuals just mentioned, the fit for the full model satisfies
̂𝜇i− ̂𝛽y2xi2= ̂𝛽y1⋅2(xi1− ̂𝛽12xi2) so that
̂𝜇i= ̂𝛽y2xi2+ ̂𝛽y1⋅2(xi1− ̂𝛽12xi2)= ̂𝛽y1⋅2xi1+(̂𝛽y2− ̂𝛽y1⋅2̂𝛽12)xi2. Therefore, the partial effect ofx2ony, adjusting forx1, has the expression
̂𝛽y2⋅1= ̂𝛽y2− ̂𝛽y1⋅2̂𝛽12.
In particular, ̂𝛽y2⋅1= ̂𝛽y2if ̂𝛽12=0, which is equivalent to corr(x∗1,x∗2)=0. They are also equal if ̂𝛽y1⋅2=0. Likewise, ̂𝛽y1⋅2= ̂𝛽y1− ̂𝛽y2⋅1̂𝛽21. An implication is that if the primary interest in a study is the effect ofx1while adjusting forx2but the model does not includex2, then the difference between the effect of interest and the effect actually estimated is theomitted variable bias,𝛽y2⋅1𝛽21.
2.5.7 Partial Correlation
Apartial correlationdescribes the association between two variables after adjusting for other variables. Yule (1907) also showed how to formalize this concept using residuals. For example, the partial correlation betweenyandx1while adjusting for x2andx3 is obtained by (1) finding the residuals for predictingyusingx2andx3, (2) finding the residuals for predictingx1usingx2andx3, and then (3) finding the ordinary correlation between these two sets of residuals.
The squared partial correlation between yand a particular explanatory variable considers the variability unexplained without that variable and evaluates the pro- portional reduction in variability after adding it. That is, ifR20 is the proportional reduction in error without it, and R2
1is the value after adding it to the model, then the squared partial correlation betweenyand the variable, adjusting for the others, is (R21−R2
0)∕(1−R2
0).