We have mentioned that the least squares fit𝝁̂ is a projection of the datayonto the model spaceC(X), and the hat matrixHthat projectsyto𝝁̂ is aprojection matrix.
We now explain more precisely what is meant by projection of a vectory∈Rnonto a vector subspace such asC(X).
2.2.1 Projection Matrices
Definition. A square matrixPis aprojection matrixonto a vector subspaceWif (1)for ally∈W,Py=y.
(2)for ally∈W⟂,Py=0.
For a projection matrixP, sincePyis a linear combination of the columns ofP, the vector subspaceW onto whichPprojects is the column spaceC(P). The projection
5See Stigler (1981, 1986, Chapters 1 and 4) for details.
matrixPontoWprojects an arbitraryy∈Rnto its componenty1∈Wfor the unique orthogonal decomposition ofyintoy1+y2usingWandW⟂(Recall Section 2.1.5).
We next list this and some other properties of a projection matrix:
r Ify=y
1+y2withy1∈Wandy2∈W⟂, thenPy=P(y1+y2)=Py1+Py2= y1+0=y1. Since the orthogonal decomposition is unique, so too is the projec- tion ontoWunique6.
r The projection matrix onto a subspaceWis unique. To see why, supposeP∗is another one. Then for the orthogonal decompositiony=y1+y2withy1∈W, P∗y=y1=Pyfor ally. Hence,P=P∗. (Recall that ifAy=Byfor ally, then A=B.)
r I−Pis the projection matrix ontoW⟂. For an arbitraryy=y1+y2withy1∈W andy2∈W⟂, we havePy=y1and (I−P)y=y−y1=y2. Thus,
y=Py+(I−P)y
provides the orthogonal decomposition ofy. Also,P(I−P)y=0.
r P is a projection matrix if and only if it is symmetric and idempotent (i.e., P2=P).
We will use this last property often, so let us see why it is true. First, we suppose Pis symmetric and idempotent and show that this impliesPis a projection matrix.
For anyv∈C(P) (the subspace onto whichPprojects),v=Pbfor someb. Then, Pv=P(Pb)=P2b=Pb=v. For anyv∈C(P)⟂, we havePTv=0, but this is also Pvby the symmetry ofP. So, we have shown thatPis a projection matrix ontoC(P).
Second, to prove the converse, we supposePis the projection matrix ontoC(P) and show this impliesPis symmetric and idempotent. For any v∈Rn, letv=v1+v2 withv1∈C(P) andv2∈C(P)⟂. Since
P2v=P(P(v1+v2))=Pv1=v1=Pv,
we haveP2=P. To show symmetry, letw=w1+w2be any other vector inRn, with w1∈C(P) andw2∈C(P)⟂. SinceI−Pis the projection matrix ontoC(P)⟂,
wTPT(I−P)v=wT1v2=0.
Since this is true for anyvandw, we havePT(I−P)=0, orPT=PTP. SincePTP is symmetric, so isPTand henceP.
Next, here are two useful properties about the eigenvalues and the rank of a projection matrix.
r The eigenvalues of any projection matrixPare all 0 and 1.
6The projection defined here is sometimes called anorthogonal projection, becausePyandy−Py are orthogonal vectors. This text considers only orthogonal and not oblique projections, and we take
“projection” to be synonymous with “orthogonal projection.”
This follows from the definitions of a projection matrix and an eigenvalue. For an eigenvalue𝜆ofPwith eigenvector𝝂,P𝝂=𝜆𝝂; but eitherP𝝂=𝝂(if𝝂∈W) or P𝝂=0(if𝝂∈W⟂), so𝜆=1 or 0. In fact, this is a property of symmetric, idempotent matrices.
r For any projection matrixP, rank(P)=trace(P), the sum of its main diagonal elements.
This follows because the trace of a square matrix is the sum of the eigenvalues, and for symmetric matrices the rank is the number of nonzero eigenvalues. Since the eigenvalues ofP (which is symmetric) are all 0 and 1, the sum of its eigenvalues equals the number of nonzero eigenvalues.
Finally, we state a useful property7about decompositions of the identity matrix into a sum of projection matrices:
r Suppose {Pi} are symmetric n×n matrices such that ∑
iPi=I. Then, the following three conditions are equivalent:
1. Piis idempotent for eachi.
2. PiPj=0for alli≠j.
3. ∑
irank(Pi)=n.
The aspect we will use is that symmetric idempotent matrices (thus, projection matrices) that satisfy∑
iPi=Ialso satisfyPiPj=0for alli≠j. The proof of this is a by-product of a key result of the next chapter (Cochran’s theorem) about independent chi-squared quadratic forms.
2.2.2 Projection Matrices for Linear Model Spaces
LetPX denote the projection matrix onto the model spaceC(X) corresponding to a model matrixXfor a linear model. We next present some properties for this particular case.
r IfXhas full rank, thenPXis the hat matrix,H=X(XTX)−1XT.
This follows by noting thatHsatisfies the two parts of the definition of a projection matrix forC(X):
r Ify∈C(X), theny=Xbfor someb. So
Hy=X(XTX)−1XTy=X(XTX)−1XTXb=Xb=y.
r Recall from Section 2.1.5 thatC(X)⟂=N(XT). Ify∈N(XT), thenXTy=0, and thus,Hy=X(XTX)−1XTy=0.
7For a proof, see Bapat (2000, p. 60).
We have seen thatHprojectsyto the least squares fit𝝁̂ =X𝜷̂.
r IfXdoes not have full rank, thenPX =X(XTX)−XT. Moreover,PX is invariant to the choice for the generalized inverse (XTX)−.
The proof is outlined in Exercise 2.16. Thus, 𝝁̂ is invariant to the choice of the solution𝜷̂of the normal equations. In particular, if rank(X)=r<pand ifX0is any matrix havingrcolumns that form a basis forC(X), thenPX =X0(XT0X0)−1XT0. This follows by the same proof just given for the full-rank case.
r IfXandWare model matrices satisfyingC(X)=C(W), thenPX=PW. To see why, for an arbitrary y∈Rn, we use the orthogonal decompositions y= PXy+(I−PX)yandy=PWy+(I−PW)y. By the uniqueness of the decomposition, PXy=PWy. Butyis arbitrary, soPX=PW. It follows that𝝁̂ ande=y−X𝜷̂are also the same for both models. For example, projection matrices and model fits are not affected by reparameterization, such as changing the indicator coding for a factor.
r Nested model projections:When modelais a special case of modelb, with projection matricesPaandPbfor model matricesXaandXb, thenPaPb=PbPa= PaandPb−Pais also a projection matrix.
When one model is a special case of another, we say that the models arenested. To show this result, for an arbitraryy, we use the unique orthogonal decompositiony= y1+y2, withy1∈C(Xa) andy2∈C(Xa)⟂. Then,Pay=y1, from whichPb(Pay)= Pby1=y1=Pay, since the fitted value for the simpler model also satisfies the more complex model. So PbPa=Pa. But PbPa=PaPb because of their symmetry, so we have also thatPaPb=Pa. SincePay=Pa(Pby), we see thatPayis also the projection ofPbyontoC(Xa). SincePbPa=PaPb=PaandPaandPbare idempotent,
(Pb−Pa)(Pb−Pa)=Pb−Pa.
So (Pb−Pa) is also a projection matrix. In fact, an extended orthogonal decomposition incorporates such difference projection matrices,
y=Iy=[Pa+(Pb−Pa)+(I−Pb)]y=y1+y2+y3.
HerePaprojectsyontoC(Xa), (Pb−Pa) projectsyto its component inC(Xb) that is orthogonal withC(Xa), and (I−Pb) projectsyto its component inC(Xb)⟂.
2.2.3 Example: The Geometry of a Linear Model
We next illustrate the geometry that underlies the projections for linear models. We do this for two simple models for which we can easily portray projections graphically.
The first model has a single quantitative explanatory variable, 𝜇i=E(yi)=𝛽xi, i=1,…,n,
but does not contain an intercept. Its model matrixXis then×1 vector (x1,…,xn)T. Figure 2.3 portrays the model, the data, and the fit. The response values y= (y1,…,yn)Tare a point inRn. The explanatory variable valuesXare another such point. The linear predictor valuesX𝛽for all the possible real values for𝛽trace out a line inRnthat passes through the origin. This is the model spaceC(X). The model fit𝝁̂ =PXy= ̂𝛽Xis the orthogonal projection ofyonto the model space line.
y
0 X
C(X) μ βX
μ = PXy = βX y – βX
Figure 2.3 Portrayal of simple linear model with quantitative predictorxand no intercept, showing the observationsy, the model matrixXof predictor values, and the fit𝝁̂ =PXy= ̂𝛽X.
Next, we extend the modeling to handle two quantitative explanatory variables.
Consider the models
E(yi)=𝛽y1xi1, E(yi)=𝛽y2xi2, E(yi)=𝛽y1⋅2xi1+𝛽y2⋅1xi2.
We use Yule’s notation to reflect that𝛽y1⋅2and𝛽y2⋅1typically differ from𝛽y1and𝛽y2, as discussed in Section 1.2.3. Figure 2.4 portrays the data and the three model fits.
When evaluated for all real𝛽y1⋅2 and𝛽y2⋅1,𝝁traces out a plane inRnthat passes through the origin. The projectionP12y= ̂𝛽y1⋅2X1+ ̂𝛽y2⋅1X2gives the least squares fit using both predictors together. The projectionP1y= ̂𝛽y1X1onto the model space forX1=(x11,…,xn1)Tgives the least squares fit whenx1is the sole predictor. The projectionP2y= ̂𝛽y2X2onto the model space forX2=(x12,…,xn2)Tgives the least squares fit whenx2is the sole predictor.
From the result in Section 2.2.2 thatPaPb=Pawhen modelais a special case of modelb,P1yis also the projection ofP12yonto the model space forX1, andP2yis also the projection ofP12yonto the model space forX2. These ideas extend directly to models with several explanatory variables as well as an intercept term.
2.2.4 Orthogonal Columns and Parameter Orthogonality
Although ̂𝛽y1 in the reduced model𝝁=𝛽y1X1is usually not the same as ̂𝛽y1⋅2in the full model𝝁=𝛽y1⋅2X1+𝛽y2⋅1X2, the effects are identical whenX1is orthogonal
y
0
βX1
βX2
P12y = βy1.2 X1 + βy2.1 X2
P1y = βy1 X1
P2y = βy2 X2
Figure 2.4 Portrayal of linear model with two quantitative explanatory variables, showing the observationsyand the fitsP1y= ̂𝛽y1X1,P2y= ̂𝛽y2X2, andP12y= ̂𝛽y1⋅2X1+ ̂𝛽y2⋅1X2.
withX2. We show this for a more general context in whichX1andX2may each refer to asetof explanatory variables.
We partition the model matrix and parameter vector for the full model into
X𝜷 =(
X1:X2)( 𝜷1
𝜷2
)
=X1𝜷1+X2𝜷2.
Then, 𝜷1 and𝜷2are said to beorthogonal parametersif each column fromX1 is orthogonal with each column fromX2, that is,XT1X2=0. In this case
XTX=
(XT1X1 0 0 XT2X2
)
and XTy= (XT1y
XT2y )
.
Because of this, (XTX)−1also has block diagonal structure, and𝜷̂1=(XT1X1)−1XT1y from fitting the reduced model𝝁=X1𝜷1is identical to𝜷̂1from fitting𝝁=X1𝜷1+ X2𝜷2. The same property holds if each column from X1 is orthogonal with each column from X2 after centering each column ofX1(i.e., from subtracting off the mean) or centering each column ofX2. In that case, the correlation is zero for each such pair (Exercise 2.19), and the result is a consequence of a property to be presented in Section 2.5.6 showing that the same partial effects occur in regression modeling using two sets of residuals.
2.2.5 Pythagoras’s Theorem Applications for Linear Models
The projection matrix plays a key role for linear models. The first important result is that the projection matrix projects the data vectoryto the fitted value vector𝝁̂that is theuniquepoint in the model spaceC(X) that is closest toy.
Data projection gives unique least squares fit:For eachy∈Rnand its projection PXy=𝝁̂ onto the model spaceC(X) for a linear model𝝁=X𝜷,
‖y−PXy‖≤‖y−z‖ for all z∈C(X), with equality if and only ifz=PXy.
To show why this is true, for an arbitraryz∈C(X) we express y−z=(y−PXy)+(PXy−z).
Now (y−PXy)=(I−PX)y is in C(X)⟂=N(XT), whereas (PXy−z) is in C(X) because each component is inC(X). Since the subspacesC(X) andC(X)⟂are orthog- onal complements,
‖y−z‖2=‖y−PXy‖2+‖PXy−z‖2,
becauseuTv=0 for anyu∈C(X) andv∈C(X)⟂. It follows from this application of Pythagoras’s theorem that ‖y−z‖2≥‖y−PXy‖2, with equality if and only if PXy=z.
The fact that the fitted values𝝁̂ =PXyprovide the unique least squares solution for𝝁is no surprise, as Section 2.2.2 showed that the projection matrix for a linear model is the hat matrix, which projects the data to the least squares fit. Likewise, (I−PX) is the projection ontoC(X)⟂, and the residual vectore=(I−PX)y=y−𝝁̂ falls in that error space.
Here is another application of Pythagoras’s theorem for linear models.
True and sample residuals:For the fitted values𝝁̂ of a linear model𝝁=X𝜷 obtained by least squares,
‖y−𝝁‖2=‖y−𝝁‖̂ 2+‖𝝁̂ −𝝁‖2.
This follows by decomposing (y−𝝁)=(y−𝝁)̂ +(𝝁̂−𝝁) and using the fact that (𝝁̂−𝝁), which is inC(X), is orthogonal to (y−𝝁), which is in̂ C(X)⟂. In particular, the data tend to be closer to the model fit than to the true means, and the fitted values vary less than the data. From this result, a plot of‖y−𝝁‖2against𝜷 shows a quadratic function that is minimized at𝜷. Figure 2.5 portrays this for the case of â one-dimensional parameter𝛽.
β
β
⎜⎜y – μ μ ⎜⎜2
Figure 2.5 For a linear model𝝁=X𝜷, the sum of squares‖y−𝝁‖2is minimized at the least squares estimate𝜷̂.
Here is a third application of Pythagoras’s theorem for linear models.
Data=fit + residuals:For the fitted values𝝁̂of a linear model𝝁=X𝜷obtained by least squares,
‖y‖2=‖𝝁‖̂ 2+‖y−𝝁‖̂ 2.
This uses the decomposition illustrated in Figure 2.6,
y=𝝁̂ +(y−𝝁)̂ =PXy+(I−PX)y, that is,data = fit + residuals,
Residual
C(X) y
0 μ
Figure 2.6 Pythagoras’s theorem for a linear model applies to the data vector, the fitted values, and the residual vector; that is,data=fit + residuals.
with𝝁̂ inC(X) orthogonal to (y−𝝁) in̂ C(X)⟂. It also follows using the symmetry and idempotence of projection matrices, from
‖y‖2=yTy=yT[PX+(I−PX)]y=yTPXy+yT(I−PX)y
=yTPXTPXy+yT(I−PX)T(I−PX)y=𝝁̂T𝝁̂+(y−𝝁̂)T(y−𝝁).̂ A consequence of𝝁̂being the projection ofyonto the model space is that the squared length ofyequals the squared length of𝝁̂ plus the squared length of the residual vector. The orthogonality of the fitted values and the residuals is a key result that we will use often.
Linear model analyses that decompose y into several orthogonal components have a corresponding sum-of-squares decomposition. LetP1,P2,…,Pkbe projection matrices satisfying an orthogonal decomposition:
I=P1+P2+⋯+Pk.
That is, each projection matrix refers to a vector subspace in a decomposition ofRn using orthogonal subspaces. The unique decomposition ofyinto elements in those orthogonal subspaces is
y=Iy=P1y+P2y+⋯+Pky=y1+⋯+yk.
The corresponding sum-of-squares decomposition is yTy=yTP1y+⋯+yTPky.