Hat Matrix. We define the hat matrix to be H=X(XX)−1X, so that ˆy=Xb= Hy. From this, the matrix H is said to project the vector of responses y onto the vector of fitted values ˆy.
Because H =H, the hat matrix is symmetric. Further, it is also an idempotent matrix due to the property HH=H. To see this, we have HH= (X(XX)−1X)(X(XX)−1X)=X(XX)−1(XX)(XX)−1X =X(XX)−1X =H.
Similarly, it is easy to check that I−H is idempotent. Because H is idem- potent, from some results in matrix algebra, it is straightforward to show that
ni=1hii =k+1. As discussed in Section 5.4.1, we use our bounds and the average leverage, ¯h=(k+1)/n, to help identify observations with unusually high leverage.
Variance of Residuals. Using the model equation y=Xβ+ε, we can express the vector of residuals as
e=y−ˆy=y−Hy=(I−H)(Xβ+ε)=(I−H)ε. (5.14) The last equality is because (I−H)X=X−HX=X−X=0. Using Varε= σ2I, we have
Var e=Var [(I−H)ε]=(I−H)Varε(I−H)
=σ2(I−H)I(I−H)=σ2(I−H).
The last equality comes from the fact that I−H is idempotent. Thus, we have that
Varei =σ2(1−hii) and Cov(ei, ej)= −σ2hij. (5.15) Thus, although the true errorsεare uncorrelated, there is a small negative corre- lation among residuals e.
Dominance of the Error in the Residual. Examining theith row of equation (5.14), we have that theith residual
ei =εi − n j=1
hijεj (5.16)
can be expressed as a linear combination of independent errors. The relation H=HH yields
hii = n
j=1
h2ij. (5.17)
Becausehiiis, on average, (k+1)/n, this indicates that eachhijis small relative to 1. Thus, when interpreting equation (5.16), we say that most of the information inei is due toεi.
Correlations with Residuals. First define xj =(x1j, x2j, . . . , xnj) to be the column representing the jth variable. With this notation, we can partition the matrix of explanatory variables as X=
x0,x1, . . . ,xk
. Now, examining thejth column of the relation (I−H)X=0, we have (I−H)xj =0. With e=(I−H)ε, this yields exj =ε(I−H)xj =0, forj =0,1, . . . , k.This result has several implications. If the intercept is in the model, then x0=(1,1, . . . ,1) is a vector of ones. Here, ex0 =0 means that ni=1ei =0 or, the average residual is zero.
Further, because exj =0, it is easy to check that the sample correlation between e and xj is zero. Along the same line, we also have that eˆy=e(I−H)Xb=0.
Thus, using the same argument as above, the sample correlation between e and ˆy is zero.
When a vector of ones is present, then the average residual is
zero. Multiple Correlation Coefficient. For an example of a nonzero correlation,
considerr(y,ˆy), the sample correlation between y and ˆy. Because (I−H)x0 =0, we have x0 =Hx0 and thus, ˆyx0=yHx0=yx0. Assuming x0 =(1,1, . . . ,1), this means that ni=1yˆi = ni=1yi, so that the average fitted value is ¯y. Now,
When a vector of ones is present, then the average fitted value
is ¯y. r(y,ˆy)=
n
i=1(yi−y)( ˆ¯ yi−y)¯ (n−1)sysyˆ
.
Recall that (n−1)sy2 = ni=1(yi−y)¯2=Total SS and (n−1)sy2ˆ= ni=1( ˆyi−
¯
y)2 =Regress SS. Further, with x0=(1,1, . . . ,1), n
i=1
(yi−y)( ˆ¯ yi−y)¯ =(y−yx¯ 0)(ˆy−yx¯ 0)=yˆy−y¯2x0 x0
=yXb−ny¯2 =Regress SS.
This yields
r(y,ˆy)= Regress SS Total SS
(Regress SS) =
Regress SS Total SS =√
R2. (5.18) That is, the coefficient of determination can be interpreted as the square root of the correlation between the observed and fitted responses.
5.10.2 Leave-One-Out Statistics
Notation. To test the sensitivity of regression quantities, there are a number of statistics of interest that are based on the notion of “leaving out,”or omitting, an observation. To this end, the subscript notation (i) means to leave out the ith observation. For example, omitting the row of explanatory variables xi = (xi0, xi1, . . . , xik) from X yields X(i), a (n−1)×(k+1) matrix of explanatory variables. Similarly, y(i)is a (n−1)×1 vector, based on removing theith row from y.
Basic Matrix Result. Suppose that A is an invertible,p×pmatrix and z is ap×1 vector. The following result from matrix algebra provides an important tool for understanding leave one out statistics in linear regression analysis.
A−zz−1
=A−1+ A−1zzA−1
1−zA−1z. (5.19)
To check this result, simply multiply A−zz by the right-hand side of equation (5.19) to get I, the identity matrix.
Vector of Regression Coefficients. Omitting the ith observation, our new vector of regression coefficients is b(i)=
X(i)X(i)−1
X(i)y(i). An alternative
expression for b(i)that is simpler to compute turns out to be b(i)=b−
XX−1
xiei
1−hii (5.20)
To verify equation (5.20), first use equation (5.19) with A=XX and z=xi to get
X(i)X(i)−1
=(XX−xixi)−1 =(XX)−1+
XX−1
xixi XX−1
1−hii , where, from equation (5.3), we havehii =xi(XX)−1xi. Multiplying each side by X(i)y(i)=Xy−xiyi yields
b(i)=
X(i)X(i)−1
X(i)y(i)=
(XX)−1+
XX−1
xixi XX−1
1−hii
Xy−xiyi
=b− XX−1
xiyi+
XX−1
xixib− XX−1
xixi XX−1
xiyi 1−hii
=b− (1−hii) XX−1
xiyi− XX−1
xixib− XX−1
xihiiyi
1−hii
=b−
XX−1
xiyi− XX−1
xixib
1−hii =b−
XX−1 xiei 1−hii . This establishes equation (5.20).
Cook’s Distance. To measure the effect, or influence , of omitting the ith observation, Cook examined the difference between fitted values with and without the observation. We define Cook’s distance to be
Di =
ˆy−ˆy(i)
ˆy−ˆy(i) (k+1)s2 ,
where ˆy(i)=Xb(i)is the vector of fitted values calculated omitting theith point.
Using equation (5.20) and ˆy=Xb, an alternative expression for Cook’s distance is
Di =
b−b(i)
XX b−b(i) (k+1)s2
= ei2 (1−hii)2
xi
XX−1
XX XX−1 xi (k+1)s2
= ei2 (1−hii)2
hii
(k+1)s2 =
e2i s√
1−hii
2
hii
(k+1)(1−hii).
This result is not only useful computationally, it also serves to decompose the statistic into the part due to the standardized residual,
ei/
s(1−hii)1/22
, and due to the leverage,hii/((k+1) (1−hii)).
Leave-One-Out Residual. The leave-one-out residual is defined by e(i)= yi−xib(i). It is used in computing the PRESS statistic, described in Section 5.6.3. A simple computational expression ise(i)=ei/(1−hii). To verify this, use equation (5.20) to get
e(i)=yi−xib(i)=yi −xi
b−
XX−1
xiei 1−hii
=ei +xi
XX−1 xiei
1−hii =ei+ hiiei
1−hii = ei
1−hii.
Leave-One-Out Variance Estimate. The leave-one-out estimate of the vari- ance is defined by s(i)2 =((n−1)−(k+1))−1 j=i
yj −xjb(i)
2
. It is used in the definition of the studentized residual, defined in Section5.3.1. A simple computational expression is given by
s(i)2 = (n−(k+1))s2−1−he2iii
(n−1)−(k+1) . (5.21)
To see this, first note that from equation (5.14), we have He=H(I−H)ε= 0, because H=HH. In particular, from the ith row of He=0, we have
nj=1hijej =0. Now, using equations (5.17) and (5.20), we have
j=i
yj−xjb(i)2
= n j=1
yj−xjb(i)2
−
yi−xib(i)2
= n j=1
yj−xjb+xj(XX)−1xiei
1−hii
−e2(i)
= n j=1
(ej+ hijei
1−hii
)2− ei2 (1−hii)2
= n j=1
ej2+0+ e2i
(1−hii)2hii− e2i (1−hii)2
= n j=1
ej2− e2i
1−hii =(n−(k+1))s2− e2i 1−hii. This establishes equation (5.21).
5.10.3 Omitting Variables
Notation. To measure the effect on regression quantities, there are a number of statistics of interest that are based on the notion of omitting an explanatory variable. To this end, the superscript notation (j) means to omit thejth variable, where j =0,1, . . . , k. First, recall that xj =(x1j, x2j, . . . , xnj) is the column
representing the j th variable. Further, define X(j) to be the n×k matrix of explanatory variables defined by removing xjfrom X. For example, takingj =k, we often partition X as X=
X(k): xk
. Employing the results of Section 4.7.2, we will use X(k)=X1 and xk =X2.
Variance Inflation Factor. We first would like to establish the relationship between the definition of the standard error ofbjgiven by
se(bj)=s
(j+1)th diagonal element of (XX)−1 and the relationship involving the variance inflation factor,
se(bj)=s
VIFj sxj√
n−1.
By symmetry of the independent variables, we need consider only the case where j =k. Thus, we would like to establish
(k+1)st diagonal element of (XX)−1 =VIFk/((n−1)sx2k). (5.22) First consider the reparameterized model in equation (4.22). From equation (4.23), we can express the regression coefficient estimatebk =(e1y)/(e1e1). From equation (4.23), we have that Varbk=σ2(E2E2)−1and thus
se(bk)=s(E2E2)−1/2. (5.23) Thus, the (E2E2)−1is (k+1)st diagonal element of
3X1 E2
4 X1 E2−1
and is also the (k+1)st diagonal element of (XX)−1. Alternatively, this can be verified directly using the partitioned matrix inverse in equation (4.19).
Now, suppose that we run a regression using xk =X2as the response vector and X(k)=X1as the matrix of explanatory variables. As noted in equation (4.22), E2
represents the “residuals”from this regression and thus E2E2represents the error sum of squares. For this regression, the total sum of squares is ni=1(xik−x¯k)2 = (n−1)sx2k and the coefficient of determination isR2k. Thus,
E2E2=Error SS=Total SS(1−R2k)=(n−1)sx2k/VIFk. This establishes equation (5.22).
Establishing t2=F. For testing the null hypothesis H0:βk =0, the material in Section 3.4.1 provides a description of a test based on the t-statistic,t(bk)= bk/se(bk). An alternative test procedure, described in Sections 4.2.2, uses the test statistic
F-ratio= (ErrorSS)reduced−(ErrorSS)full
p×(ErrorMS)full = E2y2
s2E2E2
from equation (4.26). Alternatively, from equations (4.23) and (5.23), we have t(bk)= bk
se(bk) = E2y
/ E2E2 s/
E2E2 =
E2y
s E2E2
. (5.24)
Thus,t(bk)2=F-ratio.
Partial Correlation Coefficients. From the full regression model y= X(k)β(k)+xkβk+ε, consider two separate regressions. A regression using xk as the response vector and X(k)as the matrix of explanatory variables yields the residuals E2. Similarly, a regression y as the response vector and X(k) as the matrix of explanatory variables yields the residuals
E1=y−X(k)
X(k)X(k)−1
X(k)y.
If x0 =(1,1, . . . ,1), then the average of E1 and E2 is zero. In this case, the sample correlation between E1and E2 is
r(E1,E2)=
ni=1E1iE2i
ni=1E2i1 n
i=1Ei22 = E1E2
E1E1 E2E2.
Because E2 is a vector of residuals using X(k) as the matrix of explanatory variables, we have that E2X(k)=0. Thus, for the numerator, we have E2E1 = E2(y−X(k)(X(k)X(k))−1X(k)y)=E2y.From equations (4.24) and (4.25), we have that
(n−(k+1))s2 =(ErrorSS)full=E1E1− E1y2
/ E2E2
=E1E1− E1E22
/ E2E2
. Thus, from equation (5.24)
t(bk)
t(bk)2+n−(k+1) = E2y/
s E2E2 (E2y)2
s2E2E2 +n−(k+1)
= E2y
E2y2
+E2E2s2(n−(k+1))
= E2E1
E2E12
+E2E2
E1E1− (E2E1)2
E2E2
= E1E2
(E1E1)(E2E2) =r(E1,E2).
This establishes the relationship between the partial correlation coefficient and the t-ratio statistic.
6
Interpreting Regression Results
Chapter Preview. A regression analyst collects data, selects a model, and then reports on the findings of the study, in that order. This chapter considers these three topics in reverseorder, emphasizing how each stage of the study is influenced by preceding steps.
An application, determining a firm’s characteristics that influence its effectiveness in managing risk, illustrates the regression modeling process from start to finish.
Studying a problem using a regression modeling process involves a substantial commitment of time and energy. One must first embrace the concept of statistical thinking, a willingness to use data actively as part of a decision-making process.
Second, one must appreciate the usefulness of a model that is used to approxi- mate a real situation. Having made this substantial commitment, there is a natural tendency to “oversell”the results of statistical methods such as regression analy- sis. By overselling any set of ideas, consumers eventually become disappointed when the results do not live up to their expectations. This chapter begins in Sec- tion 6.1 by summarizing what we can reasonably expect to learn from regression
modeling. “All models are
wrong, but some are useful” (Box, 1979).
Models are designed to be much simpler than relationships among entities that exist in the real world. A model is merely an approximation of reality. As stated by George Box (1979), “All models are wrong, but some are useful.”
Developing the model, the subject of Chapter 5, is part of the art of statistics.
Although the principles of variable selection are widely accepted, the application of these principles can vary considerably among analysts. The resulting product has certain aesthetic values and is by no means predetermined. Statistics can be thought of as the art of reasoning with data. Section 6.2 will underscore the importance of variable selection.
Model formulation and data collection form the first stage of the modeling process. Students of statistics are usually surprised at the difficulty of relating ideas about relationships to available data. These difficulties include a lack of readily available data and the need to use certain data as proxies for ideal infor- mation that is not available numerically. Section 6.3 will describe several types of difficulties that can arise when collecting data. Section 6.4 will describe some models to alleviate these difficulties.
189