ADRIEN LEGENDRE: INTRODUCING THE METHOD OF LEAST SQUARES
DEFINITION 15.1 Conditional Distribution, Mean, and Standard Deviation
Suppose thatxandyare predictor and response variables, respectively, on a population. Letxpdenote a particular value of the predictor variable and con- sider the subpopulation consisting of all members of the population whose value of the predictor variable isxp.
Conditional distribution of the response variable corresponding to xp: The distribution of all possible values of the response variable on the afore- mentioned subpopulation.
Conditional mean of the response variable corresponding to xp: The mean of all possible values of the response variable on the aforementioned subpopulation.
Conditional standard deviation of the response variable corresponding toxp: The standard deviation of all possible values of the response variable on the aforementioned subpopulation.
Using the terminology presented in Definition 15.1, we can now state the condi- tions required for applying inferential methods in regression analysis.
KEY FACT 15.1 Assumptions (Conditions) for Regression Inferences
1. Population regression line: There are constantsβ0andβ1such that, for each value x of the predictor variable, the conditional mean of the re- sponse variable isβ0+β1x.
2. Equal standard deviations: The conditional standard deviations of the response variable are the same for all values of the predictor variable. We denote this common standard deviationσ.†
3. Normal populations: For each value of the predictor variable, the con- ditional distribution of the response variable is a normal distribution.
4. Independent observations: The observations of the response variable are independent of one another.
? What Does It Mean?
Assumptions 1–3 require that there are constantsβ0,β1, andσ so that, for each valuex of the predictor variable, the conditional distribution of the response variable,y, is a normal distribution with meanβ0+β1x and standard deviationσ. These assumptions are often referred to as theregression model.
Note: We refer to the line y=β0+β1x—on which the conditional means of the response variable lie—as thepopulation regression line and to its equation as the population regression equation.Observe thatβ0is they-intercept of the population regression line andβ1is its slope.
The inferential procedures in regression are robust to moderate violations of Assumptions 1–3 for regression inferences. In other words, the inferential procedures work reasonably well provided the variables under consideration don’t violate any of those assumptions too badly.
EXAMPLE 15.1 Assumptions for Regression Inferences
Age and Price of Orions For Orions, with age as the predictor variable and price as the response variable, what would it mean for the regression-inference Assump- tions 1–3 to be satisfied? Display those assumptions graphically.
†The condition of equal standard deviations is calledhomoscedasticity.When that condition fails, we have what is calledheteroscedasticity.
Solution Satisfying regression-inference Assumptions 1–3 requires that there are constantsβ0,β1, andσ so that for each age,x, the prices of all Orions of that age are normally distributed with meanβ0+β1xand standard deviationσ. Thus the prices of all 2-year-old Orions must be normally distributed with meanβ0+β1ã2 and standard deviationσ, the prices of all 3-year-old Orions must be normally distributed with meanβ0+β1ã3 and standard deviationσ, and so on.
To display the assumptions for regression inferences graphically, let’s first con- sider Assumption 1. This assumption requires that for each age, the mean price of all Orions of that age lies on the line y=β0+β1x,as shown in Fig. 15.1.
FIGURE 15.1 Population regression line
1 x
Age (yr)
Price ($100)
170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 y
2 3 4 5 6 7 8
y = 0 + 1x
y = 0+ 1 •3 y = mean price of all y 5 3-year-old Orions
y = 0 + 1 • 6 y = mean price of all y 5 6-year-old Orions 180
Assumptions 2 and 3 require that the price distributions for the various ages of Orions are all normally distributed with the same standard deviation,σ. Figure 15.2 illustrates those two assumptions for the price distributions of 2-, 5-, and 7-year-old Orions. The shapes of the three normal curves in Fig. 15.2 are identical because normal distributions that have the same standard deviation have the same shape.
FIGURE 15.2 Price distributions for 2-, 5-, and 7-year-old Orions under Assumptions 2 and 3 (The means shown for the three normal
distributions reflect Assumption 1) 0+ 1 •2 Prices of 2-year-old Orions
0+ 1 •5 Prices of 5-year-old Orions
0+ 1 •7 Prices of 7-year-old Orions
Assumptions 1–3 for regression inferences, as they pertain to the variables age and price of Orions, can be portrayed graphically by combining Figs. 15.1 and 15.2 into a three-dimensional graph, as shown in Fig. 15.3. Whether those assumptions actually hold remains to be seen.
Exercise 15.23 on page 690
Estimating the Regression Parameters
Suppose that we are considering two variables,xandy,for which the assumptions for regression inferences are met. Then there are constantsβ0,β1, andσ so that, for each valuex of the predictor variable, the conditional distribution of the response variable is a normal distribution with meanβ0+β1xand standard deviationσ.
Because the parametersβ0,β1, andσare usually unknown, we must estimate them from sample data. We use the y-intercept and slope of a sample regression line as point estimates of they-intercept and slope, respectively, of the population regression line; that is, we useb0(they-intercept of a sample regression line) to estimateβ0(the y-intercept of the population regression line) and we use b1 (the slope of a sample
15.1 The Regression Model; Analysis of Residuals 683 FIGURE 15.3 Graphical portrayal of Assumptions 1–3 for regression inferences pertaining to age and price of Orions
1 2
3 4
5 6
7 8
y
x
Normal distribution of prices for 2-year-old Orions
Normal distribution of prices for 5-year-old Orions
Normal distribution of prices for 7-year-old Orions Population regression line
y = 0 + 1x Normal distributions
all have the same standard deviation,
40
80
120
160
200
Price ($100)
Age (yr)
regression line) to estimateβ1(the slope of the population regression line). We note thatb0is an unbiased estimator ofβ0and thatb1is an unbiased estimator ofβ1.
Equivalently, we use a sample regression line to estimate the unknown population regression line. Of course, a sample regression line ordinarily will not be the same as the population regression line, just as a sample mean generally will not equal the population mean. In Fig. 15.4, we illustrate this situation for the Orion example. Although the population regression line is unknown, we have drawn it to illustrate the difference between the population regression line and a sample regression line.
FIGURE 15.4 Population regression line and sample regression line for age and price of Orions
1 x
Age (yr)
Price ($100)
170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 y
2 3 4 5 6 7 8
y = 0+ 1x Population regression line
(unknown)
y = b0 + b1x = 195.47 − 20.26x Sample regression line (computed from sample data) 180 ^
In Fig. 15.4, the sample regression line (the dashed line) is the best approximation that can be made to the population regression line (the solid line) by using the sample data in Table 15.1 on page 680. A different sample of Orions would almost certainly yield a different sample regression line.
The statistic used to obtain a point estimate for the common conditional standard deviationσ is called thestandard error of the estimate.