ADRIEN LEGENDRE: INTRODUCING THE METHOD OF LEAST SQUARES
DEFINITION 15.1 Standard Error of the Estimate
Thestandard error of the estimate,se,is defined by se=
SSE n−2, whereSSEis the error sum of squares.
In the next example, we illustrate the computation and interpretation of the stan- dard error of the estimate.
EXAMPLE 15.2 Standard Error of the Estimate
Age and Price of Orions Refer to the age and price data for a sample of 11 Orions given in Table 15.1 on page 669.
a. Compute and interpret the standard error of the estimate.
b. Presuming that the variables age and price for Orions satisfy the assumptions for regression inferences, interpret the result from part (a).
Solution
a. On page 651, we found thatSSE=1423.5. So the standard error of the esti- mate is
se =
SSE
n−2 =
1423.5
11−2 =12.58.
Interpretation Roughly speaking, the predicted price of an Orion in the sample differs, on average, from the observed price by $1258.
b. Presuming that the variables age and price for Orions satisfy the assump- tions for regression inferences, the standard error of the estimate,se =12.58, or $1258, provides an estimate for the common population standard devia- tion,σ,of prices for all Orions of any particular age.
Report 15.1
Exercise 15.23(a)–(b) on page 679
Analysis of Residuals
Next we discuss how to use sample data to decide whether we can reasonably presume that the assumptions for regression inferences are met. We concentrate on Assump- tions 1–3; checking Assumption 4 is more involved and is best left for a second course in statistics.
The method for checking Assumptions 1–3 relies on an analysis of the errors made by using the regression equation to predict the observed values of the response variable, that is, on the differences between the observed and predicted values of the response variable. Each such difference is called a residual,generically denoted e. Thus,
Residual=ei = yi− ˆyi. Figure 15.5 shows the residual of a single data point.
FIGURE 15.5 Residual of a data point
ei = yi − yi
y = b0 + b1x Sample regression line
^
^
Data point
^yi yi Observed value of
the response variable
Predicted value of the response variable
xi (xi, yi)
We can express the standard error of the estimate in terms of the residuals:
se=
SSE
n−2 =
(yi − ˆyi)2 n−2 =
e2i n−2.
We can show that the sum of the residuals is always 0, which, in turn, implies that e=0. Consequently, the standard error of the estimate is essentially the same as the standard deviation of the residuals.†Thus the standard error of the estimate is some- times called theresidual standard deviation.
We can analyze the residuals to decide whether Assumptions 1–3 for regression inferences are met because those assumptions can be translated into conditions on the residuals. To show how, let’s consider a sample of data points obtained from two variables that satisfy the assumptions for regression inferences.
†The exact standard deviation of the residuals is obtained by dividing byn−1 instead ofn−2.
674 CHAPTER 15 Inferential Methods in Regression and Correlation
In light of Assumption 1, the data points should be scattered about the (sample) regression line, which means that the residuals should be scattered about thex-axis.
In light of Assumption 2, the variation of the observed values of the response variable should remain approximately constant from one value of the predictor variable to the next, which means the residuals should fall roughly in a horizontal band. In light of As- sumption 3, for each value of the predictor variable, the distribution of the correspond- ing observed values of the response variable should be approximately bell shaped, which implies that the horizontal band should be centered and symmetric about thex-axis.
Furthermore, considering all four regression assumptions simultaneously, we can regard the residuals as independent observations of a variable having a normal distri- bution with mean 0 and standard deviationσ. Thus a normal probability plot of the residuals should be roughly linear.
KEY FACT 15.2 Residual Analysis for the Regression Model
If the assumptions for regression inferences are met, the following two con- ditions should hold:
r A plot of the residuals against the values of the predictor variable should fall roughly in a horizontal band centered and symmetric about thex-axis.
r A normal probability plot of the residuals should be roughly linear.
Failure of either of these two conditions casts doubt on the validity of one or more of the assumptions for regression inferences for the variables under consideration.
A plot of the residuals against the values of the predictor variable, called aresid- ual plot, provides approximately the same information as does a scatterplot of the data points. However, a residual plot makes spotting patterns such as curvature and nonconstant standard deviation easier.
To illustrate the use of residual plots for regression diagnostics, let’s consider the three plots in Fig. 15.6.
r Fig. 15.6(a): In this plot, the residuals are scattered about thex-axis (residuals=0) and fall roughly in a horizontal band, so Assumptions 1 and 2 appear to be met.
r Fig. 15.6(b): This plot suggests that the relation between the variables is curved, indicating that Assumption 1 may be violated.
r Fig. 15.6(c): This plot suggests that the conditional standard deviations increase as xincreases, indicating that Assumption 2 may be violated.
FIGURE 15.6 Residual plots suggesting (a) no violation of linearity or constant standard deviation, (b) violation of linearity, and (c) violation of constant standard deviation 0
Residual
x (a)
0
Residual 0
Residual
x (b)
x (c)
EXAMPLE 15.3 Analysis of Residuals
Age and Price of Orions Perform a residual analysis to decide whether we can reasonably consider the assumptions for regression inferences to be met by the vari- ables age and price of Orions.
Solution We apply the criteria presented in Key Fact 15.2. The ages and residuals for the Orion data are displayed in the first and fourth columns of Table 14.8 on page 652, respectively. We repeat that information in Table 15.2.
TABLE 15.2 Age and residual data for Orions
Agex 5 4 6 5 5 5 6 6 2 7 7
Residual
−9.16 −11.42 −3.90 −12.16 −5.16 3.84 −7.90 21.10 14.05 16.36 −5.64 e
Figure 15.7(a) shows a plot of the residuals against age, and Fig. 15.7(b) shows a normal probability plot for the residuals.
FIGURE 15.7 (a) Residual plot; (b) normal probability plot for residuals
−20
−15
−10
−5 0 5 10 15 20
1 2 3 4 5 6 7 8
Age (a)
Residual
−3
−2
−1 0 1 2 3 25
−15 −10 −5 0 5 10 15 20 Residual
(b)
Normal score
Taking into account the small sample size, we can say that the residuals fall roughly in a horizontal band that is centered and symmetric about thex-axis. We can also say that the normal probability plot for the residuals is (very) roughly linear, although the departure from linearity is sufficient for some concern.†
Report 15.2
Exercise 15.23(c)–(d) on page 679
Interpretation There are no obvious violations of the assumptions for regres- sion inferences for the variables age and price of 2- to 7-year-old Orions.
THE TECHNOLOGY CENTER
Most statistical technologies provide the standard error of the estimate as part of their regression analysis output. For instance, consider the Minitab and Excel regression analysis in Output 14.2 on page 643 for the age and price data of 11 Orions. The items circled in green give the standard error of the estimate, sose =12.58. (Note to TI-83/84 Plus users:At the time of this writing, the TI-83/84 Plus does not display the standard error of the estimate. However, it can be found after running the regression procedure. See theTI-83/84 Plus Manualfor details.)
We can also use statistical technology to obtain a residual plot and a normal prob- ability plot of the residuals.
†Recall, though, that the inferential procedures in regression analysis are robust to moderate violations of Assumptions 1–3 for regression inferences.
676 CHAPTER 15 Inferential Methods in Regression and Correlation
EXAMPLE 15.4 Using Technology to Obtain Plots of Residuals
Age and Price of Orions Use Minitab, Excel, or the TI-83/84 Plus to obtain a residual plot and a normal probability plot of the residuals for the age and price data of Orions given in Table 15.1 on page 669.
Solution We applied the plots-of-residuals programs to the data, resulting in Out- put 15.1. Steps for generating that output are presented in Instructions 15.1.
OUTPUT 15.1 Residual plots and normal probability plots of the residuals for the age and price data of 11 Orions
AGE
Residual
7 6
5 4
3 2
25 20 15 10 5 0 -5 -10
Residuals Versus AGE (response is PRICE)
MINITAB EXCEL
TI-83/84 PLUS
Note the following:
r Minitab’s default normal probability plot uses percents instead of normal scores on the vertical axis.
r Excel plots the residuals against the predicted values of the response variable rather than against the observed values of the predictor variable.
These and similar modifications, however, do not affect the use of the plots as diag- nostic tools to help assess the appropriateness of regression inferences.
INSTRUCTIONS 15.1 Steps for generating Output 15.1
MINITAB EXCEL TI-83/84 PLUS
1 Store the age and price data from Table 15.1 in columns named AGE and PRICE, respectively 2 ChooseStat➤Regression➤
Regression. . .
3 Specify PRICE in theResponse text box
4 Specify AGE in thePredictors text box
5 Click theGraphs. . . button 6 Select theRegularoption button
from theResiduals for Plotslist 7 Select theIndividual plotsoption
button from theResidual Plotslist
8 Select theNormal plot of residualscheck box from the Individual plotslist
9 Click in theResiduals versus the variablestext box and
specify AGE 10 ClickOKtwice
1 Store the age and price data from Table 15.1 in ranges named AGE and PRICE, respectively
2 ChooseDDXL➤Regression 3 SelectSimple regressionfrom the
Function typedrop-down list box 4 Specify PRICE in theResponse
Variabletext box
5 Specify AGE in theExplanatory Variabletext box
6 ClickOK
7 Click theCheck the Residuals button
1 Store the age and price data from Table 15.1 in lists named AGE and PRICE, respectively 2 Clear theY=screen or turn off
any equations located there 3 PressSTAT, arrow over to
CALC, and press8
4 Press2nd➤LIST, arrow down to AGE, and pressENTER 5 Press,➤2nd➤LIST, arrow
down to PRICE, and press ENTERtwice
6 Press2nd➤STAT PLOTand then pressENTERtwice 7 Arrow to the first graph icon
and pressENTER 8 Press the down-arrow key 9 Press2nd➤LIST, arrow down
to AGE, and pressENTERtwice 10 Press2nd➤LIST, arrow down
to RESID, and pressENTER twice
11 PressZOOMand then9(and thenTRACE, if desired) 12 Press2nd➤STAT PLOTand
then pressENTERtwice 13 Arrow to the sixth graph icon
and pressENTER 14 Press the down-arrow key 15 Press2nd➤LIST, arrow down
to RESID, and pressENTER twice
16 PressZOOMand then9(and thenTRACE, if desired)
Exercises 15.1
Understanding the Concepts and Skills
15.1 Suppose thatxandyare predictor and response variables, respectively, of a population. Consider the population that con- sists of all members of the original population that have a spec- ified value of the predictor variable. The distribution, mean, and standard deviation of the response variable for this population are
called the , , and , respectively, corresponding to the specified value of the predictor variable.
15.2 State the four conditions required for making regression inferences.
In Exercises15.3–15.6, assume that the variables under consid- eration satisfy the assumptions for regression inferences.
678 CHAPTER 15 Inferential Methods in Regression and Correlation 15.3 Fill in the blanks.
a. The liney=β0+β1xis called the .
b. The common conditional standard deviation of the response variable is denoted .
c. Forx=6, the conditional distribution of the response vari- able is a distribution having mean and standard deviation .
15.4 What statistic is used to estimate
a. they-intercept of the population regression line?
b. the slope of the population regression line?
c. the common conditional standard deviation,σ,of the response variable?
15.5 Based on a sample of data points, what is the best estimate of the population regression line?
15.6 Regarding the standard error of the estimate, a. give two interpretations of it.
b. identify another name used for it, and explain the rationale for that name.
c. which one of the three sums of squares figures in its computa- tion?
15.7 The difference between an observed value and a predicted value of the response variable is called a .
15.8 Identify two graphs used in a residual analysis to check the Assumptions 1–3 for regression inferences, and explain the rea- soning behind their use.
15.9 Which graph used in a residual analysis provides roughly the same information as a scatterplot? What advantages does it have over a scatterplot?
In Exercises 15.10–15.15, we repeat the data and provide the sample regression equations for Exercises 14.44–14.49.
a. Determine the standard error of the estimate.
b. Construct a residual plot.
c. Construct a normal probability plot of the residuals.
15.10
x 2 4 3
ˆ y=2+x
y 3 5 7
15.11
x 3 1 2
ˆ
y=1−2x y −4 0 −5
15.12
x 0 4 3 1 2
ˆ
y=1+2x
y 1 9 8 4 3
15.13
x 3 4 1 2
ˆ
y= −3+2x
y 4 5 0 −1
15.14
x 1 1 5 5
ˆ
y=1.75+0.25x
y 1 3 2 4
15.15
x 0 2 2 5 6
ˆ
y=2.875−0.625x
y 4 2 0 −2 1
In Exercises15.16–15.21, we repeat the information from Exer- cises 14.50–14.55. For each exercise here, discuss what satisfying Assumptions 1–3 for regression inferences by the variables under consideration would mean.
15.16 Tax Efficiency. Tax efficiency is a measure—ranging from 0 to 100—of how much tax due to capital gains stock or mu- tual funds investors pay on their investments each year; the higher the tax efficiency, the lower is the tax. The paper “At the Mercy of the Manager” (Financial Planning, Vol. 30(5), pp. 54–56) by C. Israelsen examined the relationship between investments in mutual fund portfolios and their associated tax efficiencies. The following table shows percentage of investments in energy secu- rities (x) and tax efficiency (y) for 10 mutual fund portfolios.
x 3.1 3.2 3.7 4.3 4.0 5.5 6.7 7.4 7.4 10.6 y 98.1 94.7 92.0 89.8 87.5 85.0 82.0 77.8 72.1 53.5 15.17 Corvette Prices. TheKelley Blue Bookprovides infor- mation on wholesale and retail prices of cars. Following are age and price data for 10 randomly selected Corvettes between 1 and 6 years old. Here,xdenotes age, in years, andydenotes price, in hundreds of dollars.
x 6 6 6 2 2 5 4 5 1 4
y 290 280 295 425 384 315 355 328 425 325 15.18 Custom Homes. Hanna Propertiesspecializes in custom- home resales in the Equestrian Estates, an exclusive subdivision in Phoenix, Arizona. A random sample of nine custom homes currently listed for sale provided the following information on size and price. Here,x denotes size, in hundreds of square feet, rounded to the nearest hundred, andydenotes price, in thousands of dollars, rounded to the nearest thousand.
x 26 27 33 29 29 34 30 40 22
y 540 555 575 577 606 661 738 804 496 15.19 Plant Emissions. Plants emit gases that trigger the ripen- ing of fruit, attract pollinators, and cue other physiological re- sponses. N. Agelopolous et al. examined factors that affect the emission of volatile compounds by the potato plant Solanum tuberosom and published their findings in the paper “Factors Affecting Volatile Emissions of Intact Potato Plants, Solanum tuberosum: Variability of Quantities and Stability of Ratios”
(Journal of Chemical Ecology, Vol. 26(2), pp. 497–511). The volatile compounds analyzed were hydrocarbons used by other plants and animals. Following are data on plant weight (x), in grams, and quantity of volatile compounds emitted (y), in hun- dreds of nanograms, for 11 potato plants.
x 57 85 57 65 52 67 62 80 77 53 68
y 8.0 22.0 10.5 22.5 12.0 11.5 7.5 13.0 16.5 21.0 12.0
15.20 Crown-Rump Length. In the article “The Human Vomeronasal Organ. Part II: Prenatal Development” (Journal of Anatomy, Vol. 197, Issue 3, pp. 421–436), T. Smith and K.
Bhatnagar examined the controversial issue of the human vomeronasal organ, regarding its structure, function, and iden- tity. The following table shows the age of fetuses (x), in weeks, and length of crown-rump (y), in millimeters.
x 10 10 13 13 18 19 19 23 25 28
y 66 66 108 106 161 166 177 228 235 280 15.21 Study Time and Score. An instructor atArizona State University asked a random sample of eight students to record their study times in a beginning calculus course. She then made a table for total hours studied (x) over 2 weeks and test score (y) at the end of the 2 weeks. Here are the results.
x 10 15 12 20 8 16 14 22
y 92 81 84 74 85 80 84 80 In Exercises15.22–15.27,
a. compute the standard error of the estimate and interpret your answer.
b. interpret your result from part (a) if the assumptions for re- gression inferences hold.
c. obtain a residual plot and a normal probability plot of the residuals.
d. decide whether you can reasonably consider Assumptions 1–3 for regression inferences to be met by the variables under con- sideration. (The answer here is subjective, especially in view of the extremely small sample sizes.)
15.22 Tax Efficiency.Use the data on percentage of investments in energy securities and tax efficiency from Exercise 15.16.
15.23 Corvette Prices. Use the age and price data for Corvettes from Exercise 15.17.
15.24 Custom Homes. Use the size and price data for custom homes from Exercise 15.18.
15.25 Plant Emissions. Use the data on plant weight and quan- tity of volatile emissions from Exercise 15.19.
15.26 Crown-Rump Length. Use the data on age of fetuses and length of crown-rump from Exercise 15.20.
15.27 Study Time and Score. Use the data on total hours stud- ied over 2 weeks and test score at the end of the 2 weeks from Exercise 15.21.
15.28 Figure 15.8 shows three residual plots and a normal prob- ability plot of residuals. For each part, decide whether the graph suggests violation of one or more of the assumptions for regres- sion inferences. Explain your answers.
15.29 Figure 15.9 on the next page shows three residual plots and a normal probability plot of residuals. For each part, decide whether the graph suggests violation of one or more of the as- sumptions for regression inferences. Explain your answers.
Working with Large Data Sets
In Exercises15.30–15.39, use the technology of your choice to a. obtain and interpret the standard error of the estimate.
b. obtain a residual plot and a normal probability plot of the residuals.
c. decide whether you can reasonably consider Assumptions 1–3 for regression inferences met by the two variables under con- sideration.
15.30 Birdies and Score. How important are birdies (a score of one under par on a given hole) in determining the final total score of a woman golfer? From theU.S. Women’s OpenWeb site, we obtained data on number of birdies during a tournament and final score for 63 women golfers. The data are presented on the WeissStats CD.
15.31 U.S. Presidents. The Information Please Almanacpro- vides data on the ages at inauguration and of death for the presidents of the United States. We give those data on the WeissStats CD for those presidents who are not still living at the time of this writing.
FIGURE 15.8 Plots for Exercise 15.28
0
−3
−2−10 1 2 3
−20−10 0 10 20 30
−30
Residual
x
0
Residual
x
x 0
Residual Normal score
Residual
(a) (b)
(d) (c)
680 CHAPTER 15 Inferential Methods in Regression and Correlation FIGURE 15.9
Plots for Exercise 15.29 0
Residual
x (a)
0
Residual
x (b)
x 0
Residual
(c)
−3−2−1 0 1 2 3
Normal score
Residual (d)
−4−2 0 2 4 6 8 10 12 14
15.32 Health Care. From theStatistical Abstract of the United States, we obtained data on percentage of gross domestic product (GDP) spent on health care and life expectancy, in years, for se- lected countries. Those data are provided on the WeissStats CD.
Do the required parts separately for each gender.
15.33 Acreage and Value. The documentArizona Residential Property Valuation System, published by theArizona Department of Revenue, describes how county assessors use computerized systems to value single-family residential properties for prop- erty tax purposes. On the WeissStats CD are data on lot size (in acres) and assessed value (in thousands of dollars) for a sample of homes in a particular area.
15.34 Home Size and Value. On the WeissStats CD are data on home size (in square feet) and assessed value (in thousands of dollars) for the same homes as in Exercise 15.33.
15.35 High and Low Temperature. TheNational Oceanic and Atmospheric Administrationpublishes temperature information of cities around the world inClimates of the World. A random sample of 50 cities gave the data on average high and low tem- peratures in January shown on the WeissStats CD.
15.36 PCBs and Pelicans. Polychlorinated biphenyls (PCBs), industrial pollutants, are a great danger to natural ecosystems.
In a study by R. W. Risebrough titled “Effects of Environmen- tal Pollutants Upon Animals Other Than Man” (Proceedings of the 6th Berkeley Symposium on Mathematics and Statistics, VI, University of California Press, pp. 443–463), 60 Anacapa peli-
can eggs were collected and measured for their shell thickness, in millimeters (mm), and concentration of PCBs, in parts per mil- lion (ppm). The data are presented on the WeissStats CD.
15.37 Gas Guzzlers. The magazine Consumer Reports pub- lishes information on automobile gas mileage and variables that affect gas mileage. In one issue, data on gas mileage (in mpg) and engine displacement (in liters, L) were published for 121 ve- hicles. Those data are stored on the WeissStats CD.
15.38 Estriol Level and Birth Weight. J. Greene and J. Touch- stone conducted a study on the relationship between the estriol levels of pregnant women and the birth weights of their children.
Their findings, “Urinary Tract Estriol: An Index of Placental Function,” were published in theAmerican Journal of Obstetrics and Gynecology(Vol. 85(1), pp. 1–9). The data points are pro- vided on the WeissStats CD, where estriol levels are in mg/24 hr and birth weights are in hectograms (hg).
15.39 Shortleaf Pines. The ability to estimate the volume of a tree based on a simple measurement, such as the diameter of the tree, is important to the lumber industry, ecologists, and conservationists. Data on volume, in cubic feet, and diameter at breast height, in inches, for 70 shortleaf pines was reported in C. Bruce and F. X. Schumacher’s Forest Mensuration(New York: McGraw-Hill, 1935) and analyzed by A. C. Akinson in the article “Transforming Both Sides of a Tree” (The American Statistician, Vol. 48, pp. 307–312). The data are provided on the WeissStats CD.
15.2 Inferences for the Slope
of the Population Regression Line
In this section and the next, we examine several inferential procedures used in regres- sion analysis. Strictly speaking, these inferential techniques require that the assump- tions given in Key Fact 15.1 on page 670 be satisfied. However, as we noted earlier, these techniques are robust to moderate violations of those assumptions.
The first inferential methods we present concern the slope,β1, of the population regression line. To begin, we consider hypothesis testing.