Technical Supplements for Chapter 5

Hat Matrix. We define the hat matrix to be H=X(XX)−1X, so that ˆy=Xb= Hy. From this, the matrix H is said to project the vector of responses y onto the vector of fitted values ˆy.

Because H =H, the hat matrix is symmetric. Further, it is also an idempotent matrix due to the property HH=H. To see this, we have HH= (X(XX)−1X)(X(XX)−1X)=X(XX)−1(XX)(XX)−1X =X(XX)−1X =H.

Similarly, it is easy to check that I−H is idempotent. Because H is idempotent, from some results in matrix algebra, it is straightforward to show that

ni=1hii =k+1. As discussed in Section 5.4.1, we use our bounds and the average leverage, ¯h=(k+1)/n, to help identify observations with unusually high leverage.

Variance of Residuals. Using the model equation y=Xβ+ε, we can express the vector of residuals as

e=y−ˆy=y−Hy=(I−H)(Xβ+ε)=(I−H)ε. (5.14) The last equality is because (I−H)X=X−HX=X−X=0. Using Varε= σ2I, we have

Var e=Var [(I−H)ε]=(I−H)Varε(I−H)

=σ2(I−H)I(I−H)=σ2(I−H).

The last equality comes from the fact that I−H is idempotent. Thus, we have that

Varei =σ2(1−hii) and Cov(ei, ej)= −σ2hij. (5.15) Thus, although the true errorsεare uncorrelated, there is a small negative correlation among residuals e.

Dominance of the Error in the Residual. Examining theith row of equation (5.14), we have that theith residual

ei =εi − n j=1

hijεj (5.16)

can be expressed as a linear combination of independent errors. The relation H=HH yields

hii = n

j=1

h2ij. (5.17)

Becausehiiis, on average, (k+1)/n, this indicates that eachhijis small relative to 1. Thus, when interpreting equation (5.16), we say that most of the information inei is due toεi.

Correlations with Residuals. First define xj =(x1j, x2j, . . . , xnj) to be the column representing the jth variable. With this notation, we can partition the matrix of explanatory variables as X=

x0,x1, . . . ,xk

. Now, examining thejth column of the relation (I−H)X=0, we have (I−H)xj =0. With e=(I−H)ε, this yields exj =ε(I−H)xj =0, forj =0,1, . . . , k.This result has several implications. If the intercept is in the model, then x0=(1,1, . . . ,1) is a vector of ones. Here, ex0 =0 means that ni=1ei =0 or, the average residual is zero.

Further, because exj =0, it is easy to check that the sample correlation between e and xj is zero. Along the same line, we also have that eˆy=e(I−H)Xb=0.

Thus, using the same argument as above, the sample correlation between e and ˆy is zero.

When a vector of ones is present, then the average residual is

zero. Multiple Correlation Coefficient. For an example of a nonzero correlation,

considerr(y,ˆy), the sample correlation between y and ˆy. Because (I−H)x0 =0, we have x0 =Hx0 and thus, ˆyx0=yHx0=yx0. Assuming x0 =(1,1, . . . ,1), this means that ni=1yˆi = ni=1yi, so that the average fitted value is ¯y. Now,

When a vector of ones is present, then the average fitted value

is ¯y. r(y,ˆy)=

i=1(yi−y)( ˆ¯ yi−y)¯ (n−1)sysyˆ

Recall that (n−1)sy2 = ni=1(yi−y)¯2=Total SS and (n−1)sy2ˆ= ni=1( ˆyi−

y)2 =Regress SS. Further, with x0=(1,1, . . . ,1), n

i=1

(yi−y)( ˆ¯ yi−y)¯ =(y−yx¯ 0)(ˆy−yx¯ 0)=yˆy−y¯2x0 x0

=yXb−ny¯2 =Regress SS.

This yields

r(y,ˆy)= Regress SS Total SS

(Regress SS) =

Regress SS Total SS =√

R2. (5.18) That is, the coefficient of determination can be interpreted as the square root of the correlation between the observed and fitted responses.

5.10.2 Leave-One-Out Statistics

Notation. To test the sensitivity of regression quantities, there are a number of statistics of interest that are based on the notion of “leaving out,”or omitting, an observation. To this end, the subscript notation (i) means to leave out the ith observation. For example, omitting the row of explanatory variables xi = (xi0, xi1, . . . , xik) from X yields X(i), a (n−1)×(k+1) matrix of explanatory variables. Similarly, y(i)is a (n−1)×1 vector, based on removing theith row from y.

Basic Matrix Result. Suppose that A is an invertible,p×pmatrix and z is ap×1 vector. The following result from matrix algebra provides an important tool for understanding leave one out statistics in linear regression analysis.

A−zz−1

=A−1+ A−1zzA−1

1−zA−1z. (5.19)

To check this result, simply multiply A−zz by the right-hand side of equation (5.19) to get I, the identity matrix.

Vector of Regression Coefficients. Omitting the ith observation, our new vector of regression coefficients is b(i)=

X(i)X(i)−1

X(i)y(i). An alternative

expression for b(i)that is simpler to compute turns out to be b(i)=b−

XX−1

xiei

1−hii (5.20)

To verify equation (5.20), first use equation (5.19) with A=XX and z=xi to get

X(i)X(i)−1

=(XX−xixi)−1 =(XX)−1+

XX−1

xixi XX−1

1−hii , where, from equation (5.3), we havehii =xi(XX)−1xi. Multiplying each side by X(i)y(i)=Xy−xiyi yields

b(i)=

X(i)X(i)−1

X(i)y(i)=

(XX)−1+

XX−1

xixi XX−1

1−hii

Xy−xiyi

=b− XX−1

xiyi+

XX−1

xixib− XX−1

xixi XX−1

xiyi 1−hii

=b− (1−hii) XX−1

xiyi− XX−1

xixib− XX−1

xihiiyi

1−hii

=b−

XX−1

xiyi− XX−1

xixib

1−hii =b−

XX−1 xiei 1−hii . This establishes equation (5.20).

Cook’s Distance. To measure the effect, or influence , of omitting the ith observation, Cook examined the difference between fitted values with and without the observation. We define Cook’s distance to be

Di =

ˆy−ˆy(i)

ˆy−ˆy(i) (k+1)s2 ,

where ˆy(i)=Xb(i)is the vector of fitted values calculated omitting theith point.

Using equation (5.20) and ˆy=Xb, an alternative expression for Cook’s distance is

Di =

b−b(i)

XX b−b(i) (k+1)s2

= ei2 (1−hii)2

XX−1

XX XX−1 xi (k+1)s2

= ei2 (1−hii)2

hii

(k+1)s2 =

e2i s√

1−hii

hii

(k+1)(1−hii).

This result is not only useful computationally, it also serves to decompose the statistic into the part due to the standardized residual,

ei/

s(1−hii)1/22

, and due to the leverage,hii/((k+1) (1−hii)).

Leave-One-Out Residual. The leave-one-out residual is defined by e(i)= yi−xib(i). It is used in computing the PRESS statistic, described in Section 5.6.3. A simple computational expression ise(i)=ei/(1−hii). To verify this, use equation (5.20) to get

e(i)=yi−xib(i)=yi −xi

b−

XX−1

xiei 1−hii

=ei +xi

XX−1 xiei

1−hii =ei+ hiiei

1−hii = ei

1−hii.

Leave-One-Out Variance Estimate. The leave-one-out estimate of the variance is defined by s(i)2 =((n−1)−(k+1))−1 j=i

yj −xjb(i)

. It is used in the definition of the studentized residual, defined in Section5.3.1. A simple computational expression is given by

s(i)2 = (n−(k+1))s2−1−he2iii

(n−1)−(k+1) . (5.21)

To see this, first note that from equation (5.14), we have He=H(I−H)ε= 0, because H=HH. In particular, from the ith row of He=0, we have

nj=1hijej =0. Now, using equations (5.17) and (5.20), we have

j=i

yj−xjb(i)2

= n j=1

yj−xjb(i)2

−

yi−xib(i)2

= n j=1

yj−xjb+xj(XX)−1xiei

1−hii

−e2(i)

= n j=1

(ej+ hijei

1−hii

)2− ei2 (1−hii)2

= n j=1

ej2+0+ e2i

(1−hii)2hii− e2i (1−hii)2

= n j=1

ej2− e2i

1−hii =(n−(k+1))s2− e2i 1−hii. This establishes equation (5.21).

5.10.3 Omitting Variables

Notation. To measure the effect on regression quantities, there are a number of statistics of interest that are based on the notion of omitting an explanatory variable. To this end, the superscript notation (j) means to omit thejth variable, where j =0,1, . . . , k. First, recall that xj =(x1j, x2j, . . . , xnj) is the column

representing the j th variable. Further, define X(j) to be the n×k matrix of explanatory variables defined by removing xjfrom X. For example, takingj =k, we often partition X as X=

X(k): xk

. Employing the results of Section 4.7.2, we will use X(k)=X1 and xk =X2.

Variance Inflation Factor. We first would like to establish the relationship between the definition of the standard error ofbjgiven by

se(bj)=s

(j+1)th diagonal element of (XX)−1 and the relationship involving the variance inflation factor,

se(bj)=s

VIFj sxj√

n−1.

By symmetry of the independent variables, we need consider only the case where j =k. Thus, we would like to establish

(k+1)st diagonal element of (XX)−1 =VIFk/((n−1)sx2k). (5.22) First consider the reparameterized model in equation (4.22). From equation (4.23), we can express the regression coefficient estimatebk =(e1y)/(e1e1). From equation (4.23), we have that Varbk=σ2(E2E2)−1and thus

se(bk)=s(E2E2)−1/2. (5.23) Thus, the (E2E2)−1is (k+1)st diagonal element of

3X1 E2

4 X1 E2−1

and is also the (k+1)st diagonal element of (XX)−1. Alternatively, this can be verified directly using the partitioned matrix inverse in equation (4.19).

Now, suppose that we run a regression using xk =X2as the response vector and X(k)=X1as the matrix of explanatory variables. As noted in equation (4.22), E2

represents the “residuals”from this regression and thus E2E2represents the error sum of squares. For this regression, the total sum of squares is ni=1(xik−x¯k)2 = (n−1)sx2k and the coefficient of determination isR2k. Thus,

E2E2=Error SS=Total SS(1−R2k)=(n−1)sx2k/VIFk. This establishes equation (5.22).

Establishing t2=F. For testing the null hypothesis H0:βk =0, the material in Section 3.4.1 provides a description of a test based on the t-statistic,t(bk)= bk/se(bk). An alternative test procedure, described in Sections 4.2.2, uses the test statistic

F-ratio= (ErrorSS)reduced−(ErrorSS)full

p×(ErrorMS)full = E2y2

s2E2E2

from equation (4.26). Alternatively, from equations (4.23) and (5.23), we have t(bk)= bk

se(bk) = E2y

/ E2E2 s/

E2E2 =

E2y

s E2E2

. (5.24)

Thus,t(bk)2=F-ratio.

Partial Correlation Coefficients. From the full regression model y= X(k)β(k)+xkβk+ε, consider two separate regressions. A regression using xk as the response vector and X(k)as the matrix of explanatory variables yields the residuals E2. Similarly, a regression y as the response vector and X(k) as the matrix of explanatory variables yields the residuals

E1=y−X(k)

X(k)X(k)−1

X(k)y.

If x0 =(1,1, . . . ,1), then the average of E1 and E2 is zero. In this case, the sample correlation between E1and E2 is

r(E1,E2)=

ni=1E1iE2i

ni=1E2i1 n

i=1Ei22 = E1E2

E1E1 E2E2.

Because E2 is a vector of residuals using X(k) as the matrix of explanatory variables, we have that E2X(k)=0. Thus, for the numerator, we have E2E1 = E2(y−X(k)(X(k)X(k))−1X(k)y)=E2y.From equations (4.24) and (4.25), we have that

(n−(k+1))s2 =(ErrorSS)full=E1E1− E1y2

/ E2E2

=E1E1− E1E22

/ E2E2

. Thus, from equation (5.24)

t(bk)

t(bk)2+n−(k+1) = E2y/

s E2E2 (E2y)2

s2E2E2 +n−(k+1)

= E2y

E2y2

+E2E2s2(n−(k+1))

= E2E1

E2E12

+E2E2

E1E1− (E2E1)2

E2E2

= E1E2

(E1E1)(E2E2) =r(E1,E2).

This establishes the relationship between the partial correlation coefficient and the t-ratio statistic.

Interpreting Regression Results

Chapter Preview. A regression analyst collects data, selects a model, and then reports on the findings of the study, in that order. This chapter considers these three topics in reverseorder, emphasizing how each stage of the study is influenced by preceding steps.

An application, determining a firm’s characteristics that influence its effectiveness in managing risk, illustrates the regression modeling process from start to finish.

Studying a problem using a regression modeling process involves a substantial commitment of time and energy. One must first embrace the concept of statistical thinking, a willingness to use data actively as part of a decision-making process.

Second, one must appreciate the usefulness of a model that is used to approxi- mate a real situation. Having made this substantial commitment, there is a natural tendency to “oversell”the results of statistical methods such as regression analysis. By overselling any set of ideas, consumers eventually become disappointed when the results do not live up to their expectations. This chapter begins in Sec- tion 6.1 by summarizing what we can reasonably expect to learn from regression

modeling. “All models are

wrong, but some are useful” (Box, 1979).

Models are designed to be much simpler than relationships among entities that exist in the real world. A model is merely an approximation of reality. As stated by George Box (1979), “All models are wrong, but some are useful.”

Developing the model, the subject of Chapter 5, is part of the art of statistics.

Although the principles of variable selection are widely accepted, the application of these principles can vary considerably among analysts. The resulting product has certain aesthetic values and is by no means predetermined. Statistics can be thought of as the art of reasoning with data. Section 6.2 will underscore the importance of variable selection.

Model formulation and data collection form the first stage of the modeling process. Students of statistics are usually surprised at the difficulty of relating ideas about relationships to available data. These difficulties include a lack of readily available data and the need to use certain data as proxies for ideal information that is not available numerically. Section 6.3 will describe several types of difficulties that can arise when collecting data. Section 6.4 will describe some models to alleviate these difficulties.

189

Fitting Data to a Normal Distribution

Is the Model Useful? Some Basic Summary Measures