We learn more from constructing confidence intervals for parameter values than from significance testing. A confidence interval shows us the entire range of plausible values for a parameter, rather than focusing merely on whether a particular value is plausible.
3.3.1 Confidence Interval for a Parameter of a Normal Linear Model
To construct a confidence interval for a parameter𝛽j in a normal linear model, we construct and then invert attest ofH0:𝛽j=𝛽j0about potential values for𝛽j. The test statistic is
t= ̂𝛽j−𝛽j0 SEj ,
the number of standard errors that ̂𝛽jfalls from𝛽j0. Recall thatSEjis the square root of the element in rowjand columnjof the estimated covariance matrixs2(XTX)−1 of𝜷̂, wheres2is the error mean square. Just as the residuals are orthogonal to the model space, the residuals are uncorrelated with𝜷. Specifically, thê p×ncovariance matrix
cov(𝜷̂,y−𝝁)̂ =cov[
(XTX)−1XTy, (I−H)y]
=(XTX)−1XT𝜎2I(I−H)T, and this is0becauseHX=X(XTX)−1XTX=X. Being linear functions ofy,𝜷̂and (y−𝝁) are jointly normally distributed, so uncorrelatedness implies independence.̂
Since s2 is a function of the residuals, 𝜷̂ ands2 are independent, and so are the numerator and denominator of thetstatistic, as is required to obtain atdistribution.
The 100(1−𝛼)% confidence interval for𝛽jis the set of all𝛽j0values for which the test hasP-value> 𝛼, that is, for which|t|<t𝛼∕2,n−p, the 1−𝛼∕2 quantile of the tdistribution havingdf =n−p. For example, the 95% confidence interval is
̂𝛽j±t0.025,n−p(SEj).
3.3.2 Confidence Interval forE(y)=x0𝜷
At a fixed settingx0(a row vector) for the explanatory variables, we can construct a confidence interval forE(y)=x0𝜷. We do this by constructing and then inverting a ttest about values for that linear predictor.
Let ̂𝜇=x0𝜷̂. Now
var(̂𝜇)=var(x0𝜷̂)=x0var(𝜷)x̂ T0 =𝜎2x0(XTX)−1xT0. Sincex0𝜷̂is a linear function ofy, it has a normal distribution. Thus,
z= x0𝜷̂−x0𝜷 𝜎√
x0(XTX)−1xT0
∼N(0, 1),
and
t= x0𝜷̂−x0𝜷 s
√
x0(XTX)−1xT0
= x0𝜷̂−x0𝜷 𝜎√
x0(XTX)−1xT0 /√s2
𝜎2 ∼tn−p.
This last result follows because (n−p)s2∕𝜎2 has a𝜒n−p2 distribution for a normal linear model, by Cochran’s theorem, so thetstatistic is aN(0, 1) variate divided by the square root of the ratio of a𝜒n−p2 variate to itsdf value. Also, sinces2and𝜷̂are independent, so are the numerator and denominator of thetstatistic. It follows that a 100(1−𝛼)%confidence interval forE(y)=x0𝜷is
x0𝜷̂±t𝛼∕2,n−ps
√
x0(XTX)−1xT0. (3.2) Whenx0 is the explanatory variable valuexifor a particular observation, the term under the square root is the leveragehiifrom the model’s hat matrix.
The construction for this interval extends directly to confidence intervals for linear combinations𝓵𝜷. An example is a contrast of the parameters, such as𝛽j−𝛽kfor a pair of levels of a factor.
3.3.3 Prediction Interval for a Futurey
At a particular valuex0, how can we form an interval that is very likely to contain a future observationyat that value? This is more challenging than forming a confidence
interval for the expected response. With lots of data, we can make precise inference about the mean but not precise prediction about a single future observation.
The normal linear model states that a future valueysatisfies y=x0𝜷+𝜖, where 𝜖∼N(0,𝜎2).
From the fit of the model, the prediction of the futureyvalue is ̂𝜇=x0𝜷. Now thê futureyalso satisfies
y=x0𝜷̂+e, where e=y− ̂𝜇
is the residual for that observation. Since the futureyis independent of the observa- tionsy1,…,ynused to determine𝜷̂and then ̂𝜇,
var(e)=var(y− ̂𝜇)=var(y)+var(̂𝜇)=𝜎2[1+x0(XTX)−1xT0].
It follows that y− ̂𝜇 𝜎√
1+x0(XTX)−1xT0
∼N(0, 1) and y− ̂𝜇 s
√
1+x0(XTX)−1xT0
∼tn−p.
Inverting this yields a 100(1−𝛼)%prediction intervalfor the futureyobservation,
̂𝜇±t𝛼∕2,n−ps
√
1+x0(XTX)−1xT0. (3.3) 3.3.4 Example: Confidence Interval and Prediction Interval for Simple Linear Regression
We illustrate the confidence interval for the mean and the prediction interval for a future observation with the bivariate linear model,
E(yi)=𝛽0+𝛽1xi.
It is simpler to use the explanatory variable in centered formx∗i =xi−x, which (from̄ Section 2.1.3) results in uncorrelated ̂𝛽0 and ̂𝛽1. For the centered predictor values,
̂𝛽0changes value toy, but̄ ̂𝛽1and var(̂𝛽1)=𝜎2∕[∑
i(xi−x)̄ 2] do not change. So, at a particular valuex0forx,
var(̂𝜇)= var[̂𝛽0+ ̂𝛽1(x0−x)]̄
= var(̄y)+(x0−x)̄ 2var(̂𝛽1)=𝜎2 [
1
n+ (x0−x)̄2
∑n
i=1(xi−x)̄2 ]
.
For a future observationyand its independent prediction ̂𝜇, var(y− ̂𝜇)=𝜎2
[ 1+ 1
n+ (x0−x)̄2
∑n
i=1(xi−x)̄2 ]
.
The variances are smallest atx0=x̄and increase in a symmetric quadratic manner as x0 moves away fromx. At̄ x0=x, we see that var(̄ ̂𝜇)=var(̄y)=𝜎2∕n, whereas var(y− ̂𝜇)=𝜎2(1+1∕n). Asnincreases, var(̂𝜇) decreases toward 0, but var(y− ̂𝜇) has𝜎2as its lower bound. Even if we can estimate nearly perfectly the regression line, we are limited in how accurately we can predict any future observation.
Figure 3.1 sketches the confidence interval and prediction interval, as a function of x0. Asnincreases, the width of a confidence interval for the mean at anyx0decreases toward 0, but the width of the 95% prediction interval decreases toward 2(1.96)𝜎.
Prediction int erval
for y y
x0 x x
Confidence interval for μ μ = β0 + β1 x
Figure 3.1 Portrayal of confidence intervals for the mean,E(y)=𝛽0+𝛽1x0, and prediction intervals for a future observationy, at variousx0values.
3.3.5 Interpretation and Limitations of Prediction Intervals
Interpreting a prediction interval is awkward. With𝛼=0.05, we would like to say that conditional on the observed data and the model fit, we have 95% confidence that the futureywill fall in the interval; that is, close to 95% of a large number of future observations would fall in the interval. However, the probability distributions in the derivation of Section 3.3.3 treat ̂𝜇as well as the futureyas random, whereas in practice we use the interval after observing the data and hence ̂𝜇. The conditional probability that the prediction interval captures a futurey, given ̂𝜇, is not 0.95. From the reasoning that led to Equation 3.3, before collecting any data, for the ̂𝜇(ands) to be found and then the futurey,
P [
|y− ̂𝜇|∕s√
1+x0(XTX)−1xT0 ≤t0.025,n−p ]
=0.95.
Once we observe the data and find ̂𝜇ands, this probability (withyas the only random part) does not equal 0.95. It depends on where ̂𝜇 happened to fall. It need not be
close to 0.95 unless var(̂𝜇) is negligible compared to var(y). The 95% confidence for a prediction interval means the following: If we repeatedly used this method with many such datasets of independent observations satisfying the model (i.e., to construct both the fitted equation and this interval) and each time made a future observation, in the long run 95% of the time the interval formed would contain the future observation.
To this interpretation, we add the vital qualifier, if the model truly holds. In practice, we should have considerable faith in the model before forming prediction intervals. Even if we do not truly believe the model (the usual situation in practice), a confidence interval forE(y)=x0𝜷at variousx0values is useful for describing the fit of the model in the population of interest. However, if the model fails, either in its description of the population mean as a function of the explanatory variables or in its assumptions of normality with constant variance, then the actual percentage of many future observations that fall within the limits of 95% prediction intervals may be quite different from 95%.