EXAMPLE: SUMMARIZING THE FIT OF A LINEAR MODEL- 123docz.net

Table 2.3 Record Time to Complete Race Course (in minutes), by Distance of Race (miles) and Climb (in thousands of feet)

Race Distance Climb Record Time

Greenmantle New Year Dash 2.5 0.650 16.08

Craig Dunain Hill Race 6 0.900 33.65

Ben Rha Hill Race 7.5 0.800 45.60

Ben Lomond Hill Race 8 3.070 62.27

Bens of Jura Fell Race 16 7.500 204.62

Lairig Ghru Fun Run 28 2.100 192.67

Source:From Atkinson (1986), by permission of the Institute of Mathematical Statistics, with correction by Hoaglin13(2012). The complete data for 35 races are in the fileScotsRaces.datat the text website, www.stat.ufl.edu/~aa/glm/data.

13Thanks to David Hoaglin for showing me his article and this data set.

We suggest that you download all 35 observations from the text website and view some summary statistics and graphics, such as follows:

---

> attach(ScotsRaces) # complete data at www.stat.ufl.edu/~aa/glm/data

> matrix(cbind(mean(time),sd(time),mean(climb),sd(climb), + mean(distance),sd(distance)),nrow=2)

[,1] [,2] [,3] # e.g., time has mean = 56, std.dev.= 50 [1,] 56.0897 1.8153 7.5286

[2,] 50.3926 1.6192 5.5239

> pairs(~time+climb+distance) # scatterplot matrix for variable pairs

> cor(cbind(climb,distance,time)) # correlation matrix climb distance time

climb 1.0000 0.6523 0.8327 distance 0.6523 1.0000 0.9431 time 0.8327 0.9431 1.0000

---

Figure 2.10 is ascatterplot matrix, showing a plot for each pair of variables. It seems natural that longer races would tend to have greater record times per mile, so we might expect the record time to be a convex increasing function of distance.

However, the scatterplot relating these variables reveals a strong linear trend, apart from a single outlier. The scatterplot of record time by climb also shows linearity, apart from a rather severe outlier discussed below.

time

1 2 3 4 5 6 7

50100150200

7654321

climb

50 100 150 200 5 10 15 20 25

510152025

distance

Figure 2.10 Scatterplot matrix for record time, climb, and distance, in Scottish hill races.

For the ordinary linear model that uses both explanatory variables, without interaction, here is basic R output, not showing inferential results that assume normality fory:

---

> fit.cd <- lm(time ~ climb + distance)

> summary(fit.cd) Coefficients:

Estimate Std. Error (Intercept) -13.1086 2.5608

climb 11.7801 1.2206

distance 6.3510 0.3578 ---

Residual standard error: 8.734 on 32 degrees of freedom # This is s Multiple R-squared: 0.9717, Adjusted R-squared: 0.970

> cor(time, fitted(fit.cd)) # multiple correlation [1] 0.9857611

---

The model fit indicates that, adjusted for climb, the predicted record time increases by 6.35 minutes for every additional mile of distance. The “Residual standard error”

reported for the model fit is the estimated standard deviation of record times, at fixed values of climb and distance; that is, it is s=8.734 minutes. From Section 2.4.1, the error variance estimates2=76.29 averages the variability of the residuals, with denominatorn−p, which is heredf =35−3=32. The sample marginal variance for the record times iss2y =2539.42, considerably larger thans2.

From the output,R2=0.972 indicates a reduction of 97.2% in the sum of squared errors from using this prediction equation instead ofȳ to predict the record times.

The multiple correlation ofR=√

0.972=0.986 equals the correlation between the 35 observedyiand fitted ̂𝜇ivalues. The output also reports adjustedR2=0.970. We estimate that the conditional variance for record times is only 3% of the marginal variance.

The standardized residuals (rstandardin R) have an approximate mean of 0 and standard deviation of 1. A histogram (not shown here) of them or of the raw residuals exhibits some skew to the right. From Section 2.5, the residuals are orthogonal to the model fit, and we can check model assumptions by plots of them. Figure 2.11 plots the standardized residuals against the model-fitted values. We suggest you construct the plots against the explanatory variables. These plots do not suggest this model’s lack of fit, but they and the histogram reveal an outlier. This is the record time of 204.62 minutes with fitted value of 176.86 for the Bens of Jura Fell Race, the race having the greatest climb. For this race, the standardized residual is 4.175 and Cook’s distance is 4.215, the largest for the 32 observations and 13 times the next largest value. From Figure 2.10, the Lairig Ghru Fun Run is a severe outlier when record time is plotted against climb; yet when considered with both climb and distance predictors it has standardized residual of only 0.66 and Cook’s distance of 0.32. Its record time of 192.67 minutes seems very large for a climb of 2.1 thousand feet, but not at all unusual when we take into account that it is the longest race (28 miles). Atkinson

50 100 150

−2−110324

fitted(fit.cd)

rstandard(fit.cd)

Figure 2.11 Plot of standardized residuals versus fitted values, for linear model predicting record time using climb and distance.

(1986) presented other diagnostic measures and plots for these data that are beyond the scope of this book.

---

> hist(residuals(fit.cd)) # Histogram display of residuals

> quantile(rstandard(fit.cd), c(0,0.25,0.5,0.75,1))

0% 25% 50% 75% 100%

-2.0343433 -0.5684549 0.1302666 0.6630338 4.1751367

> cor(fitted(fit.cd),residuals(fit.cd)) # correlation equals zero [1] -7.070225e-17

> mean(rstandard(fit.cd)); sd(rstandard(fit.cd))

[1] 0.03068615 # Standardized residuals have mean approximately = 0 [1] 1.105608 # and standard deviation approximately = 1

> plot(distance, rstandard(fit.cd)) # scatterplot display

> plot(fitted(fit.cd), rstandard(fit.cd))

> cooks.distance(fit.cd)

> plot(cooks.distance(fit.cd))

---

When we fit the model using theglmfunction in R, the output states:

--- Null deviance: 86340.1 on 34 degrees of freedom

Residual deviance: 2441.3 on 32 degrees of freedom

---

We introduce thedeviancein Chapter 4. For now, we mention that for the normal linear model, the null deviance is the corrected TSS and the residual deviance is the SSE. Thus, R2=(86340.1−2441.3)∕86340.1=0.972. The difference 86340.1− 2441.3=83898.8 is the SSR for the model.

Next, we show ANOVA tables that provide SSE and the sequential SS for each explanatory variable in the order in which it enters the model, considering both possible sequences:

---

> anova(lm(time ~ climb + distance)) # climb entered, then distance Analysis of Variance Table

Df Sum Sq Mean Sq

climb 1 59861 59861

distance 1 24038 24038 Residuals 32 2441 76 ---

> anova(lm(time ~ distance + climb)) # distance entered, then climb Analysis of Variance Table

Df Sum Sq Mean Sq distance 1 76793 76793

climb 1 7106 7106

Residuals 32 2441 76

---

The sequential SS values differ substantially according to the order of entering the explanatory variables into the model, because the correlation is 0.652 between distance and climb. However, the SSE and SSR values for the full model, and hence R2, do not depend on this. For each ANOVA table display, SSE=2441 and SSR= 59861 + 24038=76793 + 7106=83,899.

One way this model containing only main effects fails is if the effect of distance is greater when the climb is greater, as seems plausible. To allow the effect of distance to depend on the climb, we add an interaction term:

---

> summary(lm(time ~ climb + distance + climb:distance)) Coefficients:

Estimate Std. Error (Intercept) -0.7672 3.9058

climb 3.7133 2.3647

distance 4.9623 0.4742 climb:distance 0.6598 0.1743 ---

Residual standard error: 7.338 on 31 degrees of freedom Multiple R-squared: 0.9807, Adjusted R-squared: 0.9788

---

The effect on record time of a 1 mile increase in distance now changes from 4.962 + 0.660(0.3) =5.16 minutes at the minimum climb of 0.3 thousand feet to 4.962 + 0.660(7.5) = 9.91 minutes at the maximum climb of 7.5 thousand feet.

As R2 has increased from 0.972 to 0.981 and adjusted R2 from 0.970 to 0.979, this more informative summary explains about a third of the variability that had been unexplained by the main effects model. That is, the squared partial correlation, which summarizes the impact of adding the interaction term, is (0.981−0.972)∕(1− 0.972)=0.32.

EXAMPLE: SUMMARIZING THE FIT OF A LINEAR MODEL

QUANTITATIVE/QUALITATIVE EXPLANATORY VARIABLES AND INTERPRETING EFFECTS

MODEL MATRICES AND MODEL VECTOR SPACES