SUMMARIZING VARIABILITY IN A LINEAR MODEL- 123docz.net

For a linear modelE(y)=X𝜷with model matrixXand covariance matrixV=𝜎2I, in Section 2.2.5 we introduced the “data=fit + residuals” orthogonal decomposition using the projection matrixPX=X(XTX)−1XT(i.e., the hat matrixH),

y=𝝁̂+(y−𝝁)̂ =PXy+(I−PX)y.

This represents the orthogonality of the fitted values𝝁̂ and the raw residualse= (y−𝝁). We have used̂ PXy=𝝁̂ to estimate𝝁. The other part of this decomposition, (I−PX)y=(y−𝝁), falls in the error spacê C(X)⟂ orthogonal to the model space C(X). We next use it to estimate the variance𝜎2of the conditional distribution of eachyi, given its explanatory variable values. This variance is sometimes called the error variance, from the representation of the model asy=X𝜷+𝝐with var(𝝐)=𝜎2I.

2.4.1 Estimating the Error Variance for a Linear Model

To obtain an unbiased estimator of𝜎2, we apply a result aboutE(yTAy), for an×n matrixA. SinceE(y−𝝁)=0,

E[(y−𝝁)TA(y−𝝁)]=E(yTAy)−𝝁TA𝝁.

Using the commutative property of the trace of a matrix,

E[(y−𝝁)TA(y−𝝁)]=E{trace[(y−𝝁)TA(y−𝝁)]}=E{trace[A(y−𝝁)(y−𝝁)T]}

=trace{AE[(y−𝝁)(y−𝝁)T]}=trace(AV).

It follows that

E(yTAy)=trace(AV)+𝝁TA𝝁. (2.7) For a linear model with full-rank model matrixXand projection matrixPX, we now apply this result withA=(I−PX) andV=𝜎2Ifor then×nidentity matrixI.

The rank ofX, which also is the rank ofPX, is the number of model parametersp.

We have

E[yT(I−PX)y]=trace[(I−PX)𝜎2I]+𝝁T(I−PX)𝝁

=𝜎2trace(I−PX),

because (I−PX)𝝁=𝝁−𝝁=0. Then, since

trace(PX)=trace[X(XXT)−1XT]=trace[XTX(XXT)−1]=trace(Ip), whereIpis thep×pidentity matrix, we have trace(I−PX)=n−p, and

[yT(I−PX)y n−p

]

=𝜎2.

Sos2=[yT(I−PX)y]∕(n−p) is an unbiased estimator of𝜎2. SincePXand (I−PX) are symmetric and idempotent, the numerator ofs2is

yT(I−PX)y=yT(I−PX)T(I−PX)y=(y−𝝁)̂ T(y−𝝁)̂ =

∑n i=1

(yi− ̂𝜇i)2.

In summary, an unbiased estimator of the error variance𝜎2in a linear model with full-rank model matrix is

s2=

∑n

i=1(yi− ̂𝜇i)2 n−p ,

an average of the squared residuals. Here, the average is taken with respect to the dimension of the error space in which these residual components reside. WhenX has less than full rankr<p, the same argument holds with the trace(PX)=r. Then, s2has denominatorn−r. The estimates2 is called9theerror mean square, where error=residual, orresidual mean square.

For example, for the null model (Section 2.3.1), the numerator ofs2is∑n

i=1(yi−̄y)2 and the rank ofX=1nis 1. An unbiased estimator of𝜎2is

s2=

∑n

i=1(yi−y)̄2 n−1 .

This is the sample variance and the usual estimator of the marginal variance ofy.

There is nothing special about using an unbiased estimator. In facts, which is on a more helpful scale for interpreting variability, is biased. However,s2occurs naturally in distribution theory for the ordinary linear model, as we will see in the next chapter.

The denominator (n−p) of the estimator occurs as a degrees of freedom measure in sampling distributions of relevant statistics.

9Not to be confused with the “mean squared error,” which isE(̂𝜃−𝜃)2 for an estimator ̂𝜃of a parameter𝜃.

2.4.2 Sums of Squares: Error (SSE) and Regression (SSR) The sum of squares∑

i(yi− ̂𝜇i)2in the numerator ofs2is abbreviated by SSE, for

“sum of squared errors,” It is also referred to as theresidual sum of squares.

The orthogonal decomposition of the data,y=PXy+(I−PX)y, expresses obser- vationiasyi= ̂𝜇i+(yi− ̂𝜇i). Correcting for the sample mean,

(yi−y)̄ =(̂𝜇i−y)̄ +(yi− ̂𝜇i).

Using (yi−y) as the observation corresponds to adjustinḡ yiby including an intercept term before investigating effects of the explanatory variables. (For the null model E(yi)=𝛽, Section 2.3.1 showed that ̂𝜇i=y.) This orthogonal decomposition intō the component in the model space and the component in the error space yields the sum-of-squares decomposition:

∑

(yi−y)̄2=∑

(̂𝜇i−y)̄ 2+∑

(yi− ̂𝜇i)2.

We abbreviate this decomposition as

TSS = SSR + SSE,

for the (corrected)total sum of squaresTSS, thesum of squares due to the regression modelSSR, and the sum of squared errorsSSE. Here, TSS summarizes the total variation in the data after fitting the model containing only an intercept. The SSE component represents the variation iny“unexplained” by the full model, that is, a summary of prediction error remaining after fitting that model. The SSR component represents the variation iny“explained” by the full model, that is, the reduction in variation from TSS to SSE resulting from adding explanatory variables to a model that contains only an intercept term. For short, we will refer to SSR as theregression sum of squares. It is also called themodel sum of squares.

We illustrate with the model for the one-way layout. From Section 2.3.3, TSS partitions into a between-groups SS= ∑c

i=1ni(ȳi−y)̄ 2 and a within-groups SS=

∑c i=1

∑ni

j=1(yij−ȳi)2. The between-groups SS is the SSR for the model, representing variability explained by adding the indicator predictors to the model. Since the fitted value corresponding to observationyijis ̂𝜇ij=ȳi, the within-groups SS is SSE for the model. For the model for the two-way layout in Section 2.3.4, SSR is the sum of the SS for the treatment effects and the SS for the block effects.

2.4.3 Effect on SSR and SSE of Adding Explanatory Variables

The least squares fit minimizes SSE. When we add an explanatory variable to a model, SSE cannot increase, because we could (at worst) obtain the same SSE value by setting ̂𝛽j=0 for the new variable. So, SSE is monotone decreasing as the set of explanatory variables grows. Since TSS depends only on {yi} and is identical for

every model fitted to a particular dataset, SSR=TSS−SSE is monotone increasing as variables are added.

Let SSR(x1,x2) denote the regression sum of squares for a model with two explanatory variables and let SSR(x1) and SSR(x2) denote it for the two models having only one of those explanatory variables (plus, in each case, the intercept). We can partition SSR(x1,x2) into SSR(x1) and the additional variability explained by adding x2 to the model. Denote that additional variability explained by x2, adjusting for x1, by SSR(x2∣x1). That is,

SSR(x1,x2)=SSR(x1)+SSR(x2∣x1).

Equivalently, SSR(x2∣x1) is the decrease in SSE from addingx2to the model.

Let {̂𝜇i1} denote the fitted values whenx1is the sole explanatory variable, and let {̂𝜇i12} denote the fitted values when bothx1andx2are explanatory variables. Then, from the orthogonal decomposition (̂𝜇i12−y)̄ =(̂𝜇i1−y)̄ +(̂𝜇i12− ̂𝜇i1),

SSR(x2∣x1)=

∑n i=1

(̂𝜇i12− ̂𝜇i1)2.

To show that this application of Pythagoras’s theorem holds, we need to show that

∑

i(̂𝜇i1−y)(̄ ̂𝜇i12− ̂𝜇i1)=0. But denoting the projection matrices byP0for the model containing only an intercept, P1 for the model that also has x1 as an explanatory variable, andP12for the model that hasx1andx2as explanatory variables, this sum is

(P1y−P0y)T(P12y−P1y)=yT(P1−P0)(P12−P1)y.

SincePaPb=Pawhen modelais a special case of modelb, (P1−P0)(P12−P1)=0, so yT(P1−P0)(P12−P1)y=0. This also follows from the result about decompositions of I into sums of projection matrices stated at the end of Section 2.2.1, whereby projection matrices that sum toIhave pairwise products of0. Here,I=P0+(P1− P0)+(P12−P1)+(I−P12).

2.4.4 Sequential and Partial Sums of Squares

Next we consider the general case withpexplanatory variables,x1,x2,…,xp, and an intercept or centered value ofy. From entering these variables in sequence into the model, we obtain the regression sum of squares and successive increments to it,

SSR(x1), SSR(x2∣x1), SSR(x3∣x1,x2),…, SSR(xp∣x1,x2,…,xp−1).

These components are referred to as sequential sums of squares. They sum to the regression sum of squares for the full model, denoted by SSR(x1,…,xp). The sequential sum of squares corresponding to adding a term to the model can depend

strongly on which other variables are already in the model, because of correlations among the predictors. For example, SSR(xp) often tends to be much larger than SSR(xp∣x1,…,xp−1) whenxpis highly correlated with the other predictors, as hap- pens in many observational studies. We discuss this further in Section 4.6.5.

An alternative set10of increments to regression sums of squares, calledpartial sums of squares, uses the same set ofpexplanatory variables for each:

SSR(x1∣x2,…,xp), SSR(x2∣x1,…,xp),…, SSR(xp∣x1,…,xp−1).

Each of these represents the additional variability explained by adding a particular explanatory variable to the model, whenallthe other explanatory variables are already in the model. Equivalently, it is the drop in SSE when that explanatory variable is added, after all the others. ThesepartialSS values may differ from all the corre- spondingsequentialSS values SSR(x1), SSR(x2∣x1),…, SSR(xp∣x1,x2,…,xp−1), except for the final one.

2.4.5 Uncorrelated Predictors:

Sequential SS=Partial SS=SSR Component

We have seen that the “data=fit + residuals” orthogonal decompositiony=PXy+ (I−PX)yimplies the corresponding SS decomposition,yTy=yTPXy+yT(I−PX)y.

When the values ofyare centered, this is TSS=SSR + SSE. Now, suppose thep parameters are orthogonal (Section 2.2.4). Then,XTXand its inverse are diagonal.

With the model matrix partitioned intoX=(

X1:X2:⋯:Xp) ,

PX=X(XTX)−1XT=X1(XT1X1)−1XT1+⋯+Xp(XTpXp)−1XTp.

In terms of the projection matrices for separate models, each with only a single explanatory variable, this isPX

1 +⋯+PX

p. Therefore, yTy=yTPX

1y+⋯+yTPX

py+yT(I−PX)y.

Each component of SSR equals the SSR for the model with that sole explanatory variable, so that

SSR(x1,…,xp)=SSR(x1)+SSR(x2)+⋯+SSR(xp). (2.8) When X1=1n is the coefficient of an intercept term, SSR(x1)=nȳ2 and TSS = yTy−SSR(x1) for the uncentered y. The sum of squares that software reports as

10Alternative names areType 1 SSfor sequential SS andType 3 SSfor partial SS.Type 2 SSis an alternative partial SS that adjusts only for effects not containing the given effect, such as adjusting x1forx2but not forx1x2when that interaction term is also in the model.

SSR is then SSR(x2)+⋯+SSR(xp). Also, with an intercept in the model, orthogonality of the parameters implies that pairs of explanatory variables are uncorrelated (Exercise 2.20).

When the explanatory variables in a linear model are uncorrelated, the sequential SS values do not depend on their order of entry into a model. They are then identical to the corresponding partial SS values, and the regression SS decomposes exactly in terms of them. We would not expect this in observational studies, but some balanced experimental designs have such simplicity.

For instance, consider the main effects model for the two-way layoutwith two binary qualitative factors and an equal sample sizenin each cell,

E(yijk)=𝛽0+𝛽i+𝛾j,

fori=1, 2,j=1, 2, andk=1,…,n. With constraints𝛽1+𝛽2=0 and𝛾1+𝛾2=0 for identifiability and withylisting (i,j) in the order (1,1), (1,2), (2,1), (2,2), we can express the model asE(y)=X𝜷with

X𝜷=

⎛⎜

⎜⎜

⎝

1n 1n 1n 1n 1n −1n 1n −1n 1n 1n −1n −1n

⎞⎟

⎟⎟

⎠

⎛⎜

⎜⎜

⎝ 𝛽0

𝛽1

𝛾1

⎞⎟

⎟⎟

⎠ .

The scatterplot for the two indicator explanatory variables hasn observations that occur at each of the points (−1,−1), (−1,1), (1,−1), and (1,1). Thus, those explanatory variables are uncorrelated (and orthogonal), and SSR decomposes into its separate parts for the row effects and for the column effects.

2.4.6 R-Squared and the Multiple Correlation

For a particular dataset and TSS value, the larger the value of SSR relative to SSE, the more effective the explanatory variables are in predicting the response variable.

A summary of this predictive power is R2= SSR

TSS = TSS−SSE

TSS =

∑

i(yi−y)̄ 2−∑

i(yi− ̂𝜇i)2

∑

i(yi−y)̄ 2 .

Here SSR=TSS−SSE measures the reduction in the sum of squared prediction errors after adding the explanatory variables to the model containing only an intercept. So,R2 measures theproportional reduction in error, and it falls between 0 and 1. Sometimes called thecoefficient of determination, it is usually merely referred to as “R-squared.”

A related way to measure predictive power is with the sample correlation between the {yi} and {̂𝜇i} values. From (2.1), the normal equations solved to find the least squares estimates are∑

iyixij=∑

i ̂𝜇ixij,j=1,…,p. The equation corresponding to

the intercept term is∑

iyi=∑

i ̂𝜇i, so the sample mean of {̂𝜇i} equalsy. Therefore,̄ the sample value of

corr(y, ̂𝝁)=

∑

i(yi−y)(̄ ̂𝜇i− ̄̂𝜇)

√[∑

i(yi−y)̄2] [∑

i(̂𝜇i− ̄̂𝜇)2] =

∑

i(yi− ̂𝜇i+ ̂𝜇i−y)(̄ ̂𝜇i−y)̄

√[∑

i(yi−y)̄ 2] [∑

i(̂𝜇i−y)̄ 2]. The numerator simplifies to ∑

i(̂𝜇i−y)̄ 2 = SSR, since ∑

i(yi− ̂𝜇i)̂𝜇i=0 by the orthogonality of (y−𝝁) and̂ 𝝁, and the denominator equalŝ √

(TSS)(SSR). So, corr(y, ̂𝝁)=√

SSR/TSS= +√

R2. This positive square root of R2 is called the multiple correlation. Note that 0≤R≤1. With a single explanatory variable, R=|corr(x,y)|.

Out of all possible linear prediction equations𝝁̃ =X𝜷̃ that use the given model matrix, the least squares solution𝝁̂ has the maximum correlation withy. To ease notation as we show this, we suppose that all variables have been centered, which does not affect correlations. For an arbitrary𝜷̃ and constantc, for the least squares fit,

‖y−𝝁‖̂ 2≤‖y−c𝝁‖̃ 2.

Expanding both sides, subtracting the common term‖y‖2, and dividing by a common denominator yields

2yT𝝁̂

‖y‖‖𝝁‖̂ − ‖𝝁‖̂ 2

‖y‖‖𝝁‖̂ ≥ 2cyT𝝁̃

‖y‖‖𝝁‖̂ − c2‖𝝁‖̃ 2

‖y‖‖𝝁‖̂ . Now, takingc2=‖𝝁‖̂ 2∕‖𝝁‖̃ 2, we have

yT𝝁̂

‖y‖‖𝝁‖̂ ≥ yT𝝁̃

‖y‖‖𝝁‖̃ .

But since the variables are centered, this says thatR=corr(y,𝝁)̂ ≥corr(y,𝝁).̃ When explanatory variables are added to a model, since SSE cannot increase,Rand R2are monotone increasing. For a model matrixX, letx∗jdenote columnjfor explanatory variablej. For the special case in which the sample corr(x∗j,x∗k)=0 for each pair of thepexplanatory variables, by the decomposition (2.8) of SSR(x1,…,xp),

R2=[corr(y,x∗1)]2+[corr(y,x∗2)]2+⋯+[corr(y,x∗p)]2.

When n is small and a model has several explanatory variables, R2 tends to overestimate the corresponding population value. Anadjusted R-squaredis designed to reduce this bias. It is defined to be the proportional reduction in variance based

on the unbiased variance estimates,s2y for the marginal distribution ands2 for the conditional distributions; that is,

adjustedR2= s2y−s2

s2y =1−SSE∕(n−p)

TSS∕(n−1) =1−n−1

n−p(1−R2).

It is slightly smaller than ordinary R2, and need not monotonically increase as we add explanatory variables to a model.

SUMMARIZING VARIABILITY IN A LINEAR MODEL

QUANTITATIVE/QUALITATIVE EXPLANATORY VARIABLES AND INTERPRETING EFFECTS

MODEL MATRICES AND MODEL VECTOR SPACES