ADRIEN LEGENDRE: INTRODUCING THE METHOD OF LEAST SQUARES
DEFINITION 15.2 Standard Error of the Estimate
Thestandard error of the estimate,se,is defined by se=
SSE n−2, whereSSEis the error sum of squares.
? What Does It Mean?
Roughly speaking, the standard error of the estimate indicates how much, on average, the predicted values of the response variable differ from the observed values of the
response variable. In the next example, we illustrate the computation and interpretation of the standard error of the estimate.
EXAMPLE 15.2 Standard Error of the Estimate
Age and Price of Orions Refer to the age and price data for a sample of 11 Orions given in Table 15.1 on page 680.
a. Compute and interpret the standard error of the estimate.
b. Presuming that the variables age and price for Orions satisfy the assumptions for regression inferences, interpret the result from part (a).
Solution
a. On page 664, we found that SST =9708.5 and SSR=8285.0. So, by the regression identity,SSE=9708.5−8285.0=1423.5. Thus,
se = SSE
n−2 =
1423.5
11−2 =12.58.
Interpretation Roughly speaking, the predicted price of an Orion in the sample differs, on average, from the observed price by $1258.
b. Presuming that the variables age and price for Orions satisfy the assumptions for regression inferences, the standard error of the estimate,se =12.58, or $1258, provides an estimate for the common population standard deviation,σ,of prices for all Orions of any particular age.
Report 15.1
Exercise 15.29(a)–(b) on page 691
Analysis of Residuals
Next we discuss how to use sample data to decide whether we can reasonably presume that the assumptions for regression inferences are met. We concentrate on Assump- tions 1–3; checking Assumption 4 is more involved and is best left for a second course in statistics.
The method for checking Assumptions 1–3 relies on an analysis of the errors made by using the regression equation to predict the observed values of the response variable, that is, on the differences between the observed and predicted values of the response variable. Each such difference is called aresidual,generically denotede.Thus,
Residual=ei =yi −yˆi. Figure 15.5 shows the residual of a single data point.
We can express the standard error of the estimate in terms of the residuals:
se= SSE
n−2 =
(yi−yˆi)2 n−2 =
e2i
n−2. (15.1)
We can show that the sum of the residuals is always 0, which, in turn, implies thate=0.
Consequently, the standard error of the estimate is essentially the same as the standard deviation of the residuals.†Thus the standard error of the estimate is sometimes called theresidual standard deviation.
†The exact standard deviation of the residuals is obtained by dividing byn−1 instead ofn−2.
15.1 The Regression Model; Analysis of Residuals 685
FIGURE 15.5 Residual of a data point
ei= yi− yi
y = b0+ b1x Sample regression line
^
^
Data point
^yi yi Observed value of
the response variable
Predicted value of the response variable
xi (xi, yi)
We can analyze the residuals to decide whether Assumptions 1–3 for regression inferences are met because those assumptions can be translated into conditions on the residuals. To show how, let’s consider a sample of data points obtained from two vari- ables that satisfy the assumptions for regression inferences.
In light of Assumption 1, the data points should be scattered about the (sample) regression line, which means that the residuals should be scattered about thex-axis.
In light of Assumption 2, the variation of the observed values of the response vari- able should remain approximately constant from one value of the predictor variable to the next, which means the residuals should fall roughly in a horizontal band. In light of Assumption 3, for each value of the predictor variable, the distribution of the corresponding observed values of the response variable should be approximately bell shaped, which implies that the horizontal band should be centered and symmetric about thex-axis.
Furthermore, considering all four regression assumptions simultaneously, we can regard the residuals as independent observations of a variable having a normal distri- bution with mean 0 and standard deviationσ. Thus a normal probability plot of the residuals should be roughly linear.
KEY FACT 15.2 Residual Analysis for the Regression Model
If the assumptions for regression inferences are met, the following two con- ditions should hold:
r A plot of the residuals against the observed values of the predictor vari- able should fall roughly in a horizontal band centered and symmetric about thex-axis.
r A normal probability plot of the residuals should be roughly linear.
Failure of either of these two conditions casts doubt on the validity of one or more of the assumptions for regression inferences for the variables under consideration.
A plot of the residuals against the observed values of the predictor variable, which for brevity we call aresidual plot,provides approximately the same information as does a scatterplot of the data points. However, a residual plot makes spotting patterns such as curvature and nonconstant standard deviation easier.
To illustrate the use of residual plots for regression diagnostics, let’s consider the three plots in Fig. 15.6 on the next page.
r Fig. 15.6(a): In this plot, the residuals are scattered about thex-axis (residuals=0) and fall roughly in a horizontal band, so Assumptions 1 and 2 appear to be met.
r Fig. 15.6(b): This plot suggests that the relation between the variables is curved, indicating that Assumption 1 may be violated.
r Fig. 15.6(c): This plot suggests that the conditional standard deviations increase as xincreases, indicating that Assumption 2 may be violated.
FIGURE 15.6 Residual plots suggesting (a) no violation of linearity or constant standard deviation, (b) violation of linearity, and (c) violation of constant standard deviation 0
Residual
x (a)
0
Residual
0
Residual
x (b)
x (c)
EXAMPLE 15.3 Analysis of Residuals
Age and Price of Orions The age and price data for a sample of 11 Orions are repeated in the first two columns of Table 15.2. Perform a residual analysis to decide whether we can reasonably consider the assumptions for regression inferences met by the variables age and price of Orions.
Solution We must first determine the residuals. Each residual is the difference of the observed price (y) and the predicted price ( ˆy). We find each predicted price by substituting each age in the first column of Table 15.2 into the regression equation,
ˆ
y =195.47−20.26x. The results are shown in the third column of Table 15.2.
TABLE 15.2 Table for obtaining the residuals for the Orion data
Age (yr) Price ($100) Predicted price Residual
x y ˆy e
5 85 94.17 −9.17
4 103 114.43 −11.43
6 70 73.91 −3.91
5 82 94.17 −12.17
5 89 94.17 −5.17
5 98 94.17 3.83
6 66 73.91 −7.91
6 95 73.91 21.09
2 169 154.95 14.05
7 70 53.65 16.35
7 48 53.65 −5.65
Now we obtain the residuals by subtracting the predicted prices in the third column of Table 15.2 from the observed prices in the second column of Table 15.2.
The results are shown in the fourth column of Table 15.2.
We can now perform the required residual analysis by applying the criteria pre- sented in Key Fact 15.2 on page 685. Figure 15.7(a) shows a plot of the residuals against age, and Fig. 15.7(b) shows a normal probability plot for the residuals.
FIGURE 15.7 (a) Residual plot (b) normal probability plot for residuals
−20−15
−10
−5 0 5 10 15 20
1 2 3 4 5 6 7 8
Age (a)
Residual
−3
−2
−1 0 1 2 3 25
−15 −10 −5 0 5 10 15 20 Residual
(b)
Normal score
15.1 The Regression Model; Analysis of Residuals 687 Taking into account the small sample size, we can say that the residuals fall roughly in a horizontal band that is centered and symmetric about thex-axis. We can also say that the normal probability plot for the residuals is (very) roughly linear, although the departure from linearity is sufficient for some concern.†
Interpretation There are no obvious violations of the assumptions for regression inferences for the variables age and price of 2- to 7-year-old Orions.
Report 15.2
Exercise 15.29(c)–(d) on page 691
THE TECHNOLOGY CENTER
Most statistical technologies provide the standard error of the estimate as part of their regression analysis output. For instance, consider the Minitab and Excel regression analysis in Output 14.2 on page 656 for the age and price data of 11 Orions. The items circled in green give the standard error of the estimate, sose =12.58. As you can see, instead of the notationse, Minitab uses S and Excel uses RMSE (root mean square error).
Although the TI-83/84 Plus does not provide the standard error of the estimate in the output of its basic regression procedure,LinReg(a+bx), it can easily be obtained after applying that procedure. Specifically, the TI-83/84 Plus automatically stores the residuals in a list named RESID. In view of the last expression in Equation (15.1) on page 684, we can then determine the standard error of the estimate by using the calculator.
We can also use statistical technology to obtain a residual plot and a normal prob- ability plot of the residuals. The next example illustrates how this is done.
EXAMPLE 15.4 Using Technology to Obtain Plots of Residuals
Age and Price of Orions Use Minitab, Excel, or the TI-83/84 Plus to obtain a residual plot and a normal probability plot of the residuals for the age and price data of Orions given in Table 15.1 on page 680.
Solution We applied the plots-of-residuals programs to the data, resulting in Out- put 15.1. Steps for generating that output are presented in Instructions 15.1 on the following page.
MINITAB
OUTPUT 15.1 Residual plots and normal probability plots of the residuals for the age and price data of 11 Orions
†Recall, though, that the inferential procedures in regression analysis are robust to moderate violations of Assumptions 1–3 for regression inferences.
EXCEL
OUTPUT 15.1 (cont.) Residual plots and normal probability plots of the residuals for the age and price data of 11 Orions TI-83/84 PLUS
INSTRUCTIONS 15.1 Steps for generating Output 15.1 MINITAB
1 Store the age and price data from Table 15.1 in columns named AGE and PRICE, respectively 2 ChooseStat➤Regression➤Fitted Line Plot. . . 3 Press the F3 key to reset the dialog box
4 Specify PRICE in theResponse (Y)text box 5 Specify AGE in thePredictor (X)text box 6 Click theGraphs. . .button
7 In theIndividual plotslist, check theNormal plot of residualscheck box
8 Click in theResiduals versus the variablestext box and specify AGE
9 ClickOKtwice EXCEL
1 Store the age and price data from Table 15.1 in columns named AGE and PRICE, respectively 2 Repeat steps 2–9 of Instructions 14.2 on page 656
except that, in step 6, check (only) thePredictions and residualsandXcheck boxes
3 Go to thePredictions and residualstable at the bottom of the resulting output
4 ChooseXLSTAT➤Visualizing data➤Scatter plots 5 Click the reset button in the lower left corner of the
dialog box
6 Click in theXselection box and then select the range of the table that contains the AGE data (including the label)
7 Click in theYselection box and then select the range of the table that contains the Residual data (including the label)
8 ClickOK
9 Click theContinuebutton in theXLSTAT – Selections dialog box to get the residual plot
10 Return to thePredictions and residualstable
11 ChooseXLSTAT➤Visualizing data➤Univariate plots 12 Click the reset button in the lower left corner of the
dialog box
13 Select the range of the table that contains the Residual data (including the label)
14 Click theOptionstab and uncheck theDescriptive statisticscheck box
15 Click theCharts (1)tab, uncheck theBox plotscheck box, and check theNormal Q-Q plotscheck box 16 ClickOK
17 Click theContinuebutton in theXLSTAT – Selections dailog box to get the normal probability plot of the residuals
TI-83/84 PLUS
1 Store the age and price data from Table 15.1 in lists named AGE and PRICE, respectively
2 PressSTAT, arrow over toCALC, and press8 FOR THE TI-84 PLUSC
3 Press2nd➤LIST, arrow down to AGE, and press ENTERtwice
4 Press2nd➤LIST, arrow down to PRICE, and press ENTERtwice
5 PressCLEAR, arrow down toCalculate, and press ENTER
FOR THE TI-83/84 PLUS
3 Press2nd➤LIST, arrow down to AGE, and press ENTER
4 Press,➤2nd➤LIST, arrow down to PRICE, and press ENTER
5 PressENTER
6 Ensure that all stat plots and allY=functions are off 7 Press2nd➤STAT PLOTand then pressENTERtwice 8 Arrow to the first graph icon and pressENTER 9 Press the down-arrow key
10 Press2nd➤LIST, arrow down to AGE, and press ENTERtwice
11 Press2nd➤LIST, arrow down to RESID, and press ENTERtwice
(continued)
15.1 The Regression Model; Analysis of Residuals 689
TI-83/84 PLUS
12 PressZOOM, then9, and thenTRACEto get the residual plot
13 Press2nd➤STAT PLOTand then pressENTER 14 Arrow to the sixth graph icon and pressENTER
15 Press the down-arrow key
16 Press2nd➤LIST, arrow down to RESID, and press ENTERtwice
17 PressZOOM, then9, and thenTRACEto get the normal probability plot of the residuals
Exercises 15.1
Understanding the Concepts and Skills
15.1 Suppose thatxandyare predictor and response variables, re- spectively, of a population. Consider the population that consists of all members of the original population that have a specified value of the predictor variable. The distribution, mean, and standard deviation of the response variable for this population are called the , , and , respectively, corresponding to the specified value of the predictor variable.
15.2 State the four conditions required for making regression infer- ences.
In Exercises15.3–15.6, assume that the variables under considera- tion satisfy the assumptions for regression inferences.
15.3 Fill in the blanks.
a. The liney=β0+β1xis called the .
b. The common conditional standard deviation of the response vari- able is denoted .
c. For x=6, the conditional distribution of the response variable is a distribution having mean and standard devia-
tion .
15.4 What statistic is used to estimate
a. they-intercept of the population regression line?
b. the slope of the population regression line?
c. the common conditional standard deviation,σ, of the response variable?
15.5 Based on a sample of data points, what is the best estimate of the population regression line?
15.6 Regarding the standard error of the estimate, a. give two interpretations of it.
b. identify another name used for it, and explain the rationale for that name.
c. which one of the three sums of squares figures in its computation?
15.7 The difference between an observed value and a predicted value of the response variable is called a .
15.8 Identify two graphs used in a residual analysis to check the Assumptions 1–3 for regression inferences, and explain the reasoning behind their use.
15.9 Which graph used in a residual analysis provides roughly the same information as a scatterplot? What advantages does it have over a scatterplot?
15.10 Figure 15.8 shows three residual plots and a normal probabil- ity plot of residuals. For each part, decide whether the graph suggests violation of one or more of the assumptions for regression inferences.
Explain your answers.
15.11 Figure 15.9 on the next page shows three residual plots and a normal probability plot of residuals. For each part, decide whether the graph suggests violation of one or more of the assumptions for regression inferences. Explain your answers.
FIGURE 15.8 Plots for Exercise 15.10
0
−3
−2
−1 0 1 2 3
−20−10 0 10 20 30
−30
Residual
x
0
Residual
x
x 0
Residual Normal score
Residual
(a) (b)
(d) (c)
FIGURE 15.9 Plots for Exercise 15.11
0
Residual
x (a)
0
Residual
x (b)
x 0
Residual
(c)
−3
−2
−1 0 1 2 3
Normal score
Residual (d)
−4−2 0 2 4 6 8 10 12 14
In Exercises15.12–15.21, we repeat the data and provide the sample regression equations for Exercises 14.48–14.57.
a. Determine the standard error of the estimate.
b. Construct a residual plot.
c. Construct a normal probability plot of the residuals.
15.12
x 2 4 3
ˆ y=2+x
y 3 5 7
15.13
x 3 1 2
ˆ y=1−2x y −4 0 −5
15.14
x 0 4 3 1 2
ˆ y=1+2x
y 1 9 8 4 3
15.15
x 3 4 1 2
ˆ
y= −3+2x
y 4 5 0 −1
15.16
x 1 3 4 4
ˆ
y=14−3x
y 13 −1 3 5
15.17
x 2 2 3 4 4
ˆ y=5−x
y 3 4 0 2 1
15.18
x 1 3 4 4
ˆ y=9−2x
y 8 0 3 1
15.19
x 1 2 3
ˆ y=1+2x
y 4 3 8
15.20
x 1 1 5 5
ˆ
y=1.75+0.25x
y 1 3 2 4
15.21
x 0 2 2 5 6
ˆ
y=2.875−0.625x
y 4 2 0 −2 1
Applying the Concepts and Skills
In Exercises 15.22–15.27, we repeat the information from Exer- cises 14.58–14.63. For each exercise here, discuss what satisfying Assumptions 1–3 for regression inferences by the variables under consideration would mean.
15.22 Tax Efficiency. Tax efficiencyis a measure—ranging from 0 to 100—of how much tax due to capital gains stock or mutual funds investors pay on their investments each year; the higher the tax effi- ciency, the lower is the tax. The paper “At the Mercy of the Manager”
(Financial Planning, Vol. 30(5), pp. 54–56) by C. Israelsen exam- ined the relationship between investments in mutual fund portfolios and their associated tax efficiencies. The following table shows per- centage of investments in energy securities (x) and tax efficiency (y) for 10 mutual fund portfolios.
x 3.1 3.2 3.7 4.3 4.0 5.5 6.7 7.4 7.4 10.6 y 98.1 94.7 92.0 89.8 87.5 85.0 82.0 77.8 72.1 53.5
15.23 Corvette Prices. TheKelley Blue Bookprovides information on wholesale and retail prices of cars. Following are age and price data for 10 randomly selected Corvettes between 1 and 6 years old.
15.1 The Regression Model; Analysis of Residuals 691 Here,x denotes age, in years, and ydenotes price, in hundreds of
dollars.
x 6 6 6 2 2 5 4 5 1 4
y 290 280 295 425 384 315 355 328 425 325
15.24 Homes for Sale. RE/MAX, an acronym for “Real Estate Maximums”, is a real estate organization in Canada, specializing in buying and selling of properties, with over 5,000 sales associates. A random sample of nine custom homes currently listed for sale pro- vided the following information on size and price. Here,xdenotes size in square feet of the living room of the house and y denotes price, in thousands of dollars, rounded to the nearest thousand.
x 2500 1346 5000 1250 1700 1937 2200 2249 2500 y 458.5 447.0 695.0 114.9 158.8 349.9 369.0 425.0 469.0
15.25 Plant Emissions. Plants emit gases that trigger the ripening of fruit, attract pollinators, and cue other physiological responses.
N. Agelopolous et al. examined factors that affect the emission of volatile compounds by the potato plantSolanum tuberosumand pub- lished their findings in the paper “Factors Affecting Volatile Emis- sions of Intact Potato Plants, Solanum tuberosum: Variability of Quantities and Stability of Ratios” (Journal of Chemical Ecology, Vol. 26(2), pp. 497–511). The volatile compounds analyzed were hydrocarbons used by other plants and animals. Following are data on plant weight (x), in grams, and quantity of volatile compounds emitted (y), in hundreds of nanograms, for 11 potato plants.
x 57 85 57 65 52 67 62 80 77 53 68
y 8.0 22.0 10.5 22.5 12.0 11.5 7.5 13.0 16.5 21.0 12.0
15.26 Crown-Rump Length. In the article “The Human Vomero- nasal Organ. Part II: Prenatal Development” (Journal of Anatomy, Vol. 197, Issue 3, pp. 421–436), T. Smith and K. Bhatnagar examined the controversial issue of the human vomeronasal organ, regarding its structure, function, and identity. The following table shows the age of fetuses (x), in weeks, and length of crown-rump (y), in millimeters.
x 10 10 13 13 18 19 19 23 25 28
y 66 66 108 106 161 166 177 228 235 280
15.27 Study Time and Score. An instructor at Arizona State Uni- versity asked a random sample of 10 students to record their study times in a beginning statistics course. She then made a table for total hours studied (x) over 2 weeks and test score (y) at the end of the 2 weeks. Here are the results.
x 12 24 10 9 15 17 9 18 16 22
y 84 79 81 90 81 85 82 84 83 75
In Exercises15.28–15.33,
a. compute the standard error of the estimate and interpret your answer.
b. interpret your result from part (a) if the assumptions for regression inferences hold.
c. obtain a residual plot and a normal probability plot of the residuals.
d. decide whether you can reasonably consider Assumptions 1–3 for regression inferences to be met by the variables under consider- ation. (The answer here is subjective, especially in view of the extremely small sample sizes.)
15.28 Tax Efficiency. Use the data on percentage of investments in energy securities and tax efficiency from Exercise 15.22.
15.29 Corvette Prices. Use the age and price data for Corvettes from Exercise 15.23.
15.30 Homes for Sale. Use the size and price data for resale homes from Exercise 15.24.
15.31 Plant Emissions. Use the data on plant weight and quantity of volatile emissions from Exercise 15.25.
15.32 Crown-Rump Length. Use the data on age of fetuses and length of crown-rump from Exercise 15.26.
15.33 Study Time and Score. Use the data on total hours studied over 2 weeks and test score at the end of the 2 weeks from Exer- cise 15.27.
Working with Large Data Sets
In Exercises15.34–15.43, use the technology of your choice to a. obtain and interpret the standard error of the estimate.
b. obtain a residual plot and a normal probability plot of the residuals.
c. decide whether you can reasonably consider Assumptions 1–3 for regression inferences met by the two variables under considera- tion.
15.34 Birdies and Score. How important are birdies (a score of one under par on a given hole) in determining the final total score of a woman golfer? From theU.S. Women’s Openwebsite, we obtained data on number of birdies during a tournament and final score for 63 women golfers. The data are presented on the WeissStats site.
15.35 U.S. Presidents. TheInformation Please Almanacprovides data on the ages at inauguration and of death for the presidents of the United States. We give those data on the WeissStats site for those presidents who are not still living at the time of this writing.
15.36 Movie Grosses. Box Office Mojocollects and posts data on movie grosses. For a random sample of 50 movies, we obtained both the domestic (U.S.) and overseas grosses, in millions of dollars. The data are presented on the WeissStats site.
15.37 Acreage and Value. The document Arizona Residential Property Valuation System, published by theArizona Department of Revenue, describes how county assessors use computerized systems to value single-family residential properties for property tax purposes.
On the WeissStats site are data on lot size (in acres) and assessed value (in thousands of dollars) for a sample of homes in a particular area.
15.38 Home Size and Value. On the WeissStats site are data on home size (in square feet) and assessed value (in thousands of dol- lars) for the same homes as in Exercise 15.37.
15.39 High and Low Temperature. The National Oceanic and Atmospheric Administration publishes temperature information of cities around the world inClimates of the World. A random sample of 50 cities gave the data on average high and low temperatures in January shown on the WeissStats site.