In this chapter, we have used least squares to estimate parameters in the ordinary linear model, which assumes independent observations with constant variance. We next show a criterion by which such estimators are optimal. We then generalize least squares to permit observations to be correlated and to have nonconstant variance.
2.7.1 The Gauss–Markov Theorem
For the ordinary linear model, least squares provides the best possible estimator of model parameters, in a certain restricted sense. Like most other results in this chapter, this one does not require an assumption (such as normality) about the distribution of the response variable. We express it here for linear combinations aT𝜷 of the parameters, but then we apply it to the individual parameters.
Gauss–Markov theorem: Suppose E(y)=X𝜷, where X has full rank, and var(y)=𝜎2I. The least squares estimator 𝜷̂=(XTX)−1XTy is the best linear unbiased estimator(BLUE) of𝜷, in this sense: For anyaT𝜷, of the estimators that are linear inyand unbiased,aT𝜷̂has minimum variance.
To prove this, we expressaT𝜷̂in its linear form inyas aT𝜷̂=aT(XTX)−1XTy=cTy,
wherecT=aT(XTX)−1XT. SupposebTyis an alternative linear estimator ofaT𝜷that is unbiased. Then,
E(b−c)Ty=E(bTy)−E(cTy)=aT𝜷−aT𝜷=0
for all𝜷. But this also equals (b−c)TX𝜷 =[𝜷TXT(b−c)]Tfor all𝜷. Therefore14, XT(b−c)=0. So, (b−c) is in the error spaceC(X)⟂=N(XT) for the model. Now,
var(bTy)=var[cTy+(b−c)Ty]=var(cTy)+||b−c||2𝜎2+2cov[cTy, (b−c)Ty].
But sinceXT(b−c)=0,
cov[cTy, (b−c)Ty]=cTvar(y)(b−c)=𝜎2aT(XTX)−1XT(b−c)=0. Thus, var(bTy)≥var(cTy)=var(aT𝜷̂), with equality if and only ifb=c.
From the theorem’s proof, any other linear unbiased estimator of aT𝜷 can be expressed asaT𝜷̂+dTywhereE(dTy)=0 anddTyis uncorrelated withaT𝜷; that is,̂
14Recall that if𝜷TL=𝜷TMfor all𝜷, thenL=M; here we identifyL=XT(b−c) andM=0.
the variate added toaT𝜷̂is like extra noise. The Gauss–Markov theorem extends to non-full-rank models. Using a generalized inverse of (XTX) in obtaining𝜷̂,aT𝜷̂is a BLUE of an estimable functionaT𝜷.
With the added assumption of normality for the distribution of y, aT𝜷̂ is the minimum variance unbiased estimator (MVUE) ofaT𝜷. Here, the restriction is still unbiasedness, but not linearity iny. This follows from the Lehmann–Scheff´e theorem, which states that a function of a complete, sufficient statistic is the unique MVUE of its expectation.
Letahave 1 in positionjand 0 elsewhere. Then the Gauss–Markov theorem implies that, for allj, var(̂𝛽j) takes minimum value out of all linear unbiased estimators of𝛽j. At first glance, the Gauss–Markov theorem is impressive, the least squares estima- tor being declared “best.” However, the restriction to estimators that are both linear and unbiased is severe. In later chapters, maximum likelihood (ML) estimators for parameters in non-normal GLMs usually satisfy neither of these properties. Also, in some cases in Statistics, the best unbiased estimator is not sensible (e.g., see Exer- cise 2.41). In multivariate settings, Bayesian-like biased estimators often obtain a marked improvement in mean squared error by shrinking the ML estimate toward a prior mean15.
2.7.2 Generalized Least Squares
The ordinary linear model, for which E(y)=X𝜷 with var(y)=𝜎2I, assumes that the response observations have identical variances and are uncorrelated. In practice, this is often not plausible. With count data, the variance is typically larger when the mean is larger. With time series data, observations close together in time are often highly correlated. With survey data, sampling designs are usually more complex than simple random sampling, and analysts weight observations so that they receive their appropriate influence.
A linear model with a more general structure for the covariance matrix is E(y)=X𝜷 with var(y)=𝜎2V,
whereV need not be the identity matrix. We next see that ordinary least squares is still relevant for a linear transformation ofy, and the method then corresponds to a weighted version of least squares on the original scale.
Suppose the model matrix X has full rank and V is a known positive definite matrix. Then,Vcan be expressed asV=BBTfor a square matrixBthat is denoted byV1∕2. This results from using thespectral decompositionfor a symmetric matrix as V=Q𝚲QT, where 𝚲 is a diagonal matrix of the eigenvalues of V and Q is orthogonal16 with columns that are its eigenvectors, from whichV1∕2=Q𝚲1∕2QT
15A classic example is Charles Stein’s famous result that, in estimating a vector of normal means, the sample mean vector is inadmissible. See Efron and Morris (1975).
16Recall that an orthogonal matrixQis a square matrix havingQQT=QTQ=I.
using the positive square roots of the eigenvalues. Then,V−1exists, as doesV−1∕2= Q𝚲−1∕2QT. Let
y∗=V−1∕2y, X∗=V−1∕2X.
For these linearly transformed values,
E(y∗)=V−1∕2X𝜷=X∗𝜷, var(y∗)=𝜎2V−1∕2V(V−1∕2)T=𝜎2I.
Soy∗satisfies the ordinary linear model, and we can apply least squares to the trans- formed values. The sum of squared errors comparingy∗andX∗𝜷that is minimized is
(y∗−X∗𝜷)T(y∗−X∗𝜷)=(y−X𝜷)TV−1(y−X𝜷).
The normal equations [(X∗)TX∗]𝜷=(X∗)Ty∗become
(XTV−1∕2V−1∕2X)𝜷=XTV−1∕2V−1∕2y, or XTV−1(y−X𝜷)=0. From (2.3), the least squares solution for the transformed values is
𝜷̂GLS=[(X∗)TX∗]−1(X∗)Ty∗=(XTV−1X)−1XTV−1y. (2.12) The estimator𝜷̂GLSis called thegeneralized least squaresestimator of𝜷. WhenVis diagonal and var(yi)=𝜎2∕wi for a known positive weightwi, as in a survey design that gives more weight to some observations than others, 𝜷̂GLS is also referred to as aweighted least squaresestimator. This form of estimator arises in fitting GLMs (Section 4.5.4).
The generalized least squares estimator has
E(𝜷̂GLS)=(XTV−1X)−1XTV−1E(y)=𝜷.
Like the OLS estimator, it is unbiased. The covariance matrix is var(𝜷̂GLS)=(XTV−1X)−1XTV−1(𝜎2V)V−1X(XTV−1X)−1
=𝜎2(XTV−1X)−1.
It shares other properties of the ordinary least squares estimator, such as𝜷̂being the BLUE estimator of𝜷and also the maximum likelihood estimator under the normality assumption.
The fitted values for this more general model are
̂
𝝁=X𝜷̂GLS=X(XTV−1X)−1XTV−1y.
Here, H=X(XTV−1X)−1XTV−1plays the role of a hat matrix. In this case,H is idempotent but need not be symmetric, so it is not a projection matrix as defined in Section 2.2. However,His a projection matrix in a more general sense if we instead define the inner product to be (w,z)=wTV−1z, as motivated by the normal equations given above. Namely, ifw∈C(X), sayw=Xv, then
Hw=X(XTV−1X)−1XTV−1w
=X(XTV−1X)−1XTV−1Xv=Xv=w.
Also, if w∈C(X)⟂=N(XT), then for all v∈C(X), (w,v)=wTV−1v=0, so Hw=0.
The estimate of 𝜎2 in the generalized model with var(y)=𝜎2V uses the usual unbiased estimator for the linearly transformed values. If rank(X)=r, the estimate is
s2= (y∗−X∗𝜷)̂T(y∗−X∗𝜷)̂
n−r = (y−𝝁)̂ TV−1(y−𝝁)̂
n−r .
Statistical inference for the model parameters can be based directly on the regular inferences of the next chapter for the ordinary linear model but using the transformed variables.
2.7.3 Adjustment Using Estimated Heteroscedasticity
This generalization of the model seems straightforward, but we have neglected a crucial point: In applications,Vitself is also often unknown and must be estimated.
Once we have done that, we can use𝜷̂GLS in (2.12) withVreplaced byV. But thiŝ estimator is no longer unbiased nor has an exact formula for the covariance matrix, which also must be estimated.
Since 𝜷̂GLS is no longer optimal once we have substituted estimated variances, we could instead use the ordinary least squares estimator, which does not require estimating the variances and is still unbiased and consistent (i.e., converging in probability to𝜷asn→∞). In doing so, however, we should adapt standard errors to adjust for the departure from the ordinary linear model assumptions. An important case (heteroscedasticity) is whenV is diagonal. Let var(yi)=𝜎i2. Then, withxi as rowiofX,𝜷̂=(XTX)−1XTy=(
XTX)−1( ∑n i=1xTiyi)
, so
var(𝜷̂)=( XTX)−1
( n
∑
i=1
𝜎i2xTixi )(
XTX)−1
.
Since var(ei)=𝜎i2(1−hii), we can estimate var(𝜷̂) by replacing𝜎2i bye2i∕(1−hii), for eachi.
CHAPTER NOTES
Section 2.1: Least Squares Model Fitting