Having formed a model matrix X and observed y, how do we obtain parameter estimates𝜷̂andfitted values𝝁̂ =X𝜷̂that best satisfy the linear model? The standard approach uses theleast squaresmethod. This determines the value of𝝁̂that minimizes
‖y−𝝁‖̂ 2=∑
i
(yi− ̂𝜇i)2=
∑n i=1
( yi−
∑p j=1
̂𝛽jxij )2
.
That is, the fitted values𝝁̂ are such that
‖y−𝝁‖̂ ≤‖y−𝝁‖ for all 𝝁∈C(X).
Using least squares corresponds to maximum likelihood when we add a nor- mality assumption to the model. The logarithm1of the likelihood for independent observationsyi∼N(𝜇i,𝜎2),i=1,…,n, is (in terms of {𝜇i})
log [ n
∏
i=1
(
√1
2𝜋𝜎e−(yi−𝜇i)2∕2𝜎2 )]
= constant − [ n
∑
i=1
(yi−𝜇i)2 ]/
2𝜎2.
To maximize the log-likelihood function, we must minimize∑
i(yi−𝜇i)2. 2.1.1 The Normal Equations and Least Squares Solution
The expressionL(𝜷)=∑
i(yi−𝜇i)2=∑
i(yi−∑
j𝛽jxij)2is quadratic in {𝛽j}, so we can minimize it by equating
𝜕L
𝜕𝛽j
=0, j=1,…,p.
1In this book, we use the natural logarithm throughout.
These partial derivatives yield the equations:
∑
i
(yi−𝜇i)xij=0, j=1,…,p. Thus, the least squares estimates satisfy
∑n i=1
yixij=
∑n i=1
̂𝜇ixij, j=1,…,p. (2.1)
These are called2thenormal equations. They occur naturally in more general settings than least squares. Chapter 4 shows that these are the likelihood equations for GLMs that use the canonical link function, such as the normal linear model, the binomial logistic regression model, and the Poisson loglinear model.
Using matrix algebra provides an economical expression for the solution of these equations in terms of the model parameter vector𝜷for the linear model𝝁=X𝜷. In matrix form,
L(𝜷)=‖y−X𝜷‖2=(y−X𝜷)T(y−X𝜷)=yTy−2yTX𝜷+𝜷TXTX𝜷.
We use the results for matrix derivatives that
𝜕(aT𝜷)∕𝜕𝜷=a and 𝜕(𝜷TA𝜷)∕𝜕𝜷=(A+AT)𝜷,
which equals 2A𝜷for symmetricA. So,𝜕L(𝜷)∕𝜕𝜷= −2XT(y−X𝜷). In terms of𝜷̂, the normal equations (2.1) are
XTy=XTX𝜷.̂ (2.2)
Suppose X has full rankp. Then, thep×p matrix (XTX) also has rankp and is nonsingular, its inverse exists, and the least squares estimator of𝜷is
𝜷̂=(XTX)−1XTy. (2.3)
Since𝜕2L(𝜷)∕𝜕𝜷2=2XTXis positive definite, the minimum rather than maximum ofL(𝜷) occurs at𝜷̂.
2.1.2 Hat Matrix and Moments of Estimators The fitted values𝝁̂ are a linear transformation ofy,
̂
𝝁=X𝜷̂=X(XTX)−1XTy.
2Here “normal” refers not to the normal distribution but to orthogonality of (y−𝝁̂) with each column ofX.
The n×n matrix H=X(XTX)−1XT is called3 the hat matrix because it linearly transformsyto𝝁̂ =Hy. The hat matrixHis aprojection matrix, projectingyto𝝁̂ in the model spaceC(X). We define projection matrices and study their properties in Section 2.2.
Recall that for a matrix of constantsA,E(Ay)=AE(y) and var(Ay)=Avar(y)AT. So, the mean and variance of the least squares estimator are
E(𝜷̂)=E[(XTX)−1XTy]=(XTX)−1XTE(y)=(XTX)−1XTX𝜷 =𝜷, var(𝜷̂)=(XTX)−1XT(𝜎2I)X(XTX)−1=𝜎2(XTX)−1. (2.4) For the ordinary linear model with normal random component, since𝜷̂is a linear function ofy,𝜷̂has a normal distribution with these two moments.
2.1.3 Bivariate Linear Model and Regression Toward the Mean
We illustrate least squares using the linear model with a single explanatory variable for a single response, that is, the “bivariate linear model”
E(yi)=𝛽0+𝛽1xi.
From (2.1) withxi1=1 andxi2=xi, the normal equations are
∑n i=1
yi=n𝛽0+𝛽1
∑n i=1
xi,
∑n i=1
xiyi=𝛽0
( n
∑
i=1
xi )
+𝛽1
∑n i=1
x2i.
By straightforward solution of these two equations, you can verify that the least squares estimates are
̂𝛽1=
∑n
i=1(xi−x)(ȳ i−y)̄
∑n
i=1(xi−x)̄ 2 , ̂𝛽0=ȳ− ̂𝛽1x̄. (2.5) From the solution for ̂𝛽0, the least squares fitted equation ̂𝜇i= ̂𝛽0+ ̂𝛽1xi satisfies
̄
y= ̂𝛽0+ ̂𝛽1x. It passes through the center of gravity of the data, that is, the point̄ (̄x,y). The analogous result holds for the linear model with multiple explanatorȳ variables and the point (̄x1,…,x̄p,y).̄
Denote the sample marginal standard deviations ofxandybysxandsy. From the Pearson product-moment formula, the samplecorrelation
r=corr(x,y)=
∑n
i=1(xi−x)(ȳ i−y)̄
√ [∑n
i=1(xi−x)̄ 2][∑n
i=1(yi−y)̄ 2]
= ̂𝛽1
(sx sy
) .
3According to Hoaglin and Welsch (1978), John Tukey proposed the term “hat matrix.”
One implication of this is that the correlation equals the slope when both variables are standardized to havesx=sy=1. Another implication is that an increase ofsxinx corresponds to a change of ̂𝛽1sx=rsyin ̂𝜇. This equation highlights the famous result of Francis Galton (1886) that there isregression toward the mean: When|r|<1, a standard deviation change in x corresponds to a predicted change of less than a standard deviation iny.
In practice, explanatory variables are often centered before entering them in a model by takingx∗i =xi−x. For the centered values,̄ x̄∗=0, so
̂𝛽0=y,̄ ̂𝛽1= ( n
∑
i=1
x∗iyi ) / n
∑
i=1
(x∗i)2.
Under centering, (XTX) is a diagonal matrix with elementsnand∑
i(x∗i)2. Thus, the covariance matrix for𝜷̂is then
var(𝜷̂)=𝜎2(XTX)−1=𝜎2
(1∕n 0 0 1∕[∑n
i=1(xi−x)̄2] )
.
Centering the explanatory variable does not affect ̂𝛽1and its variance but results in corr(̂𝛽0, ̂𝛽1)=0.
You can show directly from the expression for the model matrixX that the hat matrix for the bivariate linear model is
H=X(XTX)−1XT=
⎛⎜
⎜⎜
⎜⎜
⎝
1
n+ ∑(x1−x)̄2
i(xi−x)̄2 ⋯ 1n+(x∑1−̄x)(xn−x)̄
i(xi−x)̄2
⋮ ⋱ ⋮
1
n+(x∑n−x)(x̄ 1−x)̄
i(xi−x)̄2 ⋯ 1n+ ∑(xn−x)̄2
i(xi−x)̄2
⎞⎟
⎟⎟
⎟⎟
⎠ .
In Section 2.5.4 we will see that each diagonal element of the hat matrix is a measure of the observation’s potential influence on the model fit.
2.1.4 Least Squares Solutions WhenXDoes Not Have Full Rank
When X does not have full rank, neither does (XTX) in the normal equations. A solution𝜷̂of the normal equations then uses ageneralized inverseof (XTX), denoted by (XTX)−. Recall that for a matrix A,G is a generalized inverse if and only if AGA=A. Generalized inverses always exist but may not be unique. The least squares estimate𝜷̂=(XTX)−XTyis not then unique, reflecting that𝜷 is not identifiable.
With rank(X)<p, the null spaceN(X) has nonzero elements. For any solution𝜷̂ of the normal equationsXTy=XTX𝜷̂and any element𝜸∈N(X),𝜷̃ =𝜷̂+𝜸is also a solution. This follows becauseX𝜸=0andXTX(𝜷̂+𝜸)=XTX𝜷̂. Although there are multiple solutions𝜷̃ for estimating𝜷,𝝁̂ =X𝜷̃ is invariant to the solution (as are
estimates of estimable quantities), becauseX𝜷̃ =X(𝜷̂+𝜸) has the same fitted values as given by𝝁̂ =X𝜷.̂
Likewise, if𝓵T𝜷is estimable, then𝓵T𝜷̂is the same for all solutions to the normal equations. This follows because𝓵T𝜷̂can be expressed asaTX𝜷̂for somea, and fitted values are identical for all𝜷.̂
2.1.5 Orthogonal Subspaces and Residuals
Section 1.3.1 introduced the model spaceC(X) ofX𝜷 values for all the possible𝜷 values. This vector space is a linear subspace ofn-dimensional Euclidean space,Rn. Many results in this chapter relate toorthogonalityfor this representation, so let us recall a few basic results about orthogonality for vectors and for vector subspaces of Rn:
r Two vectorsuandvinRnareorthogonalifuTv=0. Geometrically, orthogonal vectors are perpendicular inRn.
r For a vector subspaceW of Rn, the subspace of vectorsvsuch that for any u∈W,uTv=0, is theorthogonal complementofW, denoted byW⟂. r Orthogonal complementsWandW⟂inRnsatisfy dim(W) + dim(W⟂)=n.
r For orthogonal complementsWandW⟂, anyy∈Rnhas a unique4orthogonal decompositiony=y1+y2withy1∈Wandy2∈W⟂.
Figure 2.1 portrays the key result about orthogonal decompositions into compo- nents in orthogonal complement subspaces. In the decomposition y=y1+y2, we will see in Section 2.2 thaty1is theorthogonal projectionofyontoW.
W⊥ y W
yz
0
y1
Figure 2.1 Orthogonal decomposition ofyinto componentsy1 in subspaceWplus y2 in orthogonal complement subspaceW⟂.
4The proof uses the Gram–Schmidt process on a basis forRnthat extends one forWto construct an orthogonal basis of vectors inWand vectors inW⟂;y1andy2are then linear combinations of these two sets of vectors. See Christensen (2011, pp. 414–416).
Now, supposeW=C(X), the model space spanned by the columns of a model matrixX. Vectors in its orthogonal complementC(X)⟂inRnare orthogonal with any vector inC(X), and hence with each column ofX. So any vectorvinC(X)⟂satisfies XTv=0, andC(X)⟂is the null space ofXT, denotedN(XT). We will observe next thatC(X)⟂is anerror spacethat contains differences between possible data vectors and model-fitted values for such data.
The normal equationsXTy=XTX𝜷̂that the least squares estimates satisfy can be expressed as
XT(y−X𝜷̂)=XTe=0,
wheree=(y−X𝜷̂). The elements ofeare prediction errors when we use𝝁̂ =X𝜷̂to predictyor𝝁. They are calledresiduals. The normal equations tell us that the residual vectoreis orthogonal to each column ofX. Soeis in the orthogonal complement to the model space C(X), that is,eis in C(X)⟂=N(XT). Figure 2.2 portrays the orthogonality ofewithC(X).
C(X) y–μ
y
μ μ
Figure 2.2 Orthogonality of residual vectore=(y−𝝁̂) with vectors in the model spaceC(X) for a linear model𝝁=X𝜷.
Some linear model analyses decompose yinto several orthogonal components.
An orthogonal decomposition ofRninto korthogonal subspaces {Wi} is one for which anyu∈Wiand anyv∈WjhaveuTv=0 for alli≠j, and anyy∈Rncan be uniquely expressed asy=y1+⋯+ykwithyi∈Wifori=1,…,k.
2.1.6 Alternatives to Least Squares In fitting a linear model, why minimize∑
i(yi− ̂𝜇i)2rather than some other metric, such as∑
i|yi− ̂𝜇i|? Minimizing a sum of squares is mathematically and computa- tionally much simpler. For this reason, least squares has a long history, dating back to a published article by the French mathematician Adrien-Marie Legendre (1805),
followed by German mathematician Carl Friedrich Gauss’s claim in 1809 of prior- ity5in having used it since 1795. Another motivation, seen at the beginning of this section, is that it corresponds to maximum likelihood when we add the normality assumption. Yet another motivation, presented in Section 2.7, shows that the least squares estimator is best in the class of estimators that are unbiased and linear in the data.
Recent research has developed alternatives to least squares that give sensible answers in situations that are unstable in some way. For example, instability may be caused by a severe outlier, because in minimizing a sum of squared deviations, a single observation can have substantial influence. Instability could also be caused by an explanatory variable being linearly determined (or nearly so) by the other explanatory variables, a condition calledcollinearity(Section 4.6.5). Finally, instability occurs in using least squares with datasets containing very large numbers of explanatory variables, sometimes even withp>n.
Regularization methodsadd an additional term to the function minimized, such as𝜆∑
j|𝛽j|or𝜆∑
j𝛽j2for some constant𝜆. The solution then is a smoothing of the least squares estimates that shrinks them toward zero. This is highly effective when we have a large number of explanatory variables but expect few of them to have a substantively important effect. Unlessnis extremely large, because of sampling variability the ordinary least squares estimates {̂𝛽j} then tend to be much larger in absolute value than the true values {𝛽j}. Shrinkage toward 0 causes a bias in the estimators but tends to reduce the variance substantially, resulting in their tending to be closer to {𝛽j}.
Regularization methods are increasingly important as more applications involve
“big data.” Chapter 11, which introduces extensions of the GLM, presents some regularization methods.