The Regression Model; Analysis of Residuals

Một phần của tài liệu Ebook Introductory statistics (9th edition) Part 2 (Trang 238 - 241)

ADRIEN LEGENDRE: INTRODUCING THE METHOD OF LEAST SQUARES

15.1 The Regression Model; Analysis of Residuals

Before we can perform statistical inferences in regression and correlation, we must know whether the variables under consideration satisfy certain conditions. In this sec- tion, we discuss those conditions and examine methods for deciding whether they hold.

The Regression Model

Let’s return to the Orion illustration used throughout Chapter 14. In Table 15.1, we reproduce the data on age and price for a sample of 11 Orions.

TABLE 15.1 Age and price data for a sample of 11 Orions Age (yr) Price ($100)

x y

5 85

4 103

6 70

5 82

5 89

5 98

6 66

6 95

2 169

7 70

7 48

With age as the predictor variable and price as the response variable, the regres- sion equation for these data is yˆ=195.47−20.26x, as we found in Chapter 14 on page 638. Recall that the regression equation can be used to predict the price of an Orion from its age. However, we cannot expect such predictions to be completely ac- curate because prices vary even for Orions of the same age.

For instance, the sample data in Table 15.1 include four 5-year-old Orions. Their prices are $8500, $8200, $8900, and $9800. We expect this variation in price for 5-year-old Orions because such cars generally have different mileages, interior con- ditions, paint quality, and so forth.

We use the population of all 5-year-old Orions to introduce some important re- gression terminology. The distribution of their prices is called theconditional distri- butionof the response variable “price” corresponding to the value 5 of the predictor variable “age.” Likewise, their mean price is called the conditional meanof the re- sponse variable “price” corresponding to the value 5 of the predictor variable “age.”

Similar terminology applies to the standard deviation and other parameters.

Of course, there is a population of Orions for each age. The distribution, mean, and standard deviation of prices for that population are called theconditional distribution, conditional mean, and conditional standard deviation, respectively, of the response variable “price” corresponding to the value of the predictor variable “age.”

The terminology of conditional distributions, means, and standard deviations is used in general for any predictor variable and response variable. Using that terminol- ogy, we now state the conditions required for applying inferential methods in regres- sion analysis.

670 CHAPTER 15 Inferential Methods in Regression and Correlation

KEY FACT 15.1 Assumptions (Conditions) for Regression Inferences

1. Population regression line: There are constantsβ0andβ1such that, for each value x of the predictor variable, the conditional mean of the re- sponse variable isβ0+β1x.

2. Equal standard deviations: The conditional standard deviations of the response variable are the same for all values of the predictor variable. We denote this common standard deviationσ.†

3. Normal populations: For each value of the predictor variable, the con- ditional distribution of the response variable is a normal distribution.

4. Independent observations: The observations of the response variable are independent of one another.

? What Does It Mean?

Assumptions 1–3 require that there are constantsβ0,β1, andσ so that, for each valuex of the predictor variable, the conditional distribution of the response variable,y, is a normal distribution with meanβ0+β1x and standard deviationσ. These assumptions are often referred to as theregression model.

Note: We refer to the line y=β0+β1x—on which the conditional means of the response variable lie—as thepopulation regression line and to its equation as the population regression equation.

The inferential procedures in regression are robust to moderate violations of As- sumptions 1–3 for regression inferences. In other words, the inferential procedures work reasonably well provided the variables under consideration don’t violate any of those assumptions too badly.

EXAMPLE 15.1 Assumptions for Regression Inferences

Age and Price of Orions For Orions, with age as the predictor variable and price as the response variable, what would it mean for the regression-inference Assump- tions 1–3 to be satisfied? Display those assumptions graphically.

Solution Satisfying regression-inference Assumptions 1–3 requires that there are constantsβ0,β1, andσ so that for each age,x, the prices of all Orions of that age are normally distributed with mean β0+β1x and standard deviationσ. Thus the prices of all 2-year-old Orions must be normally distributed with meanβ0+β1ã2 and standard deviationσ, the prices of all 3-year-old Orions must be normally dis- tributed with meanβ0+β1ã3 and standard deviationσ, and so on.

To display the assumptions for regression inferences graphically, let’s first con- sider Assumption 1. This assumption requires that for each age, the mean price of all Orions of that age lies on the liney=β0+β1x,as shown in Fig. 15.1.

Assumptions 2 and 3 require that the price distributions for the various ages of Orions are all normally distributed with the same standard deviation,σ. Fig- ure 15.2 illustrates those two assumptions for the price distributions of 2-, 5-, and 7-year-old Orions. The shapes of the three normal curves in Fig. 15.2 are identical because normal distributions that have the same standard deviation have the same shape.

Assumptions 1–3 for regression inferences, as they pertain to the variables age and price of Orions, can be portrayed graphically by combining Figs. 15.1 and 15.2 into a three-dimensional graph, as shown in Fig. 15.3. Whether those assumptions actually hold remains to be seen.

Exercise 15.17 on page 678

†The condition of equal standard deviations is calledhomoscedasticity.When that condition fails, we have what is calledheteroscedasticity.

FIGURE 15.1 Population regression line

1 x

Age (yr)

Price ($100)

170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 y

2 3 4 5 6 7 8

y = 0+ 1x

y = 0+ 1 •3 y = mean price of all y 5 3-year-old Orions

y = 0+ 1 •6 y = mean price of all y 5 6-year-old Orions 180

FIGURE 15.2 Price distributions for 2-, 5-, and 7-year-old Orions under Assumptions 2 and 3 (The means shown for the three normal

distributions reflect Assumption 1) 0+ 1 •2 Prices of 2-year-old Orions

0+ 1 •5 Prices of 5-year-old Orions

0+ 1 •7 Prices of 7-year-old Orions

FIGURE 15.3 Graphical portrayal of Assumptions 1–3 for regression inferences pertaining to age and price of Orions

1 2

3 4

5 6

7 8

y

x

Normal distribution of prices for 2-year-old Orions

Normal distribution of prices for 5-year-old Orions

Normal distribution of prices for 7-year-old Orions Population regression line

y = 0 + 1x Normal distributions

all have the same standard deviation,

40

80

120

160

200

Price ($100)

Age (yr)

Estimating the Regression Parameters

Suppose that we are considering two variables,xandy,for which the assumptions for regression inferences are met. Then there are constantsβ0,β1, andσ so that, for each valuexof the predictor variable, the conditional distribution of the response variable is a normal distribution with meanβ0+β1x and standard deviationσ.

672 CHAPTER 15 Inferential Methods in Regression and Correlation

Because the parametersβ0,β1, andσare usually unknown, we must estimate them from sample data. We use they-intercept and slope of a sample regression line as point estimates of they-intercept and slope, respectively, of the population regression line;

that is, we use b0 andb1 to estimateβ0 andβ1, respectively. We note thatb0 is an unbiased estimator ofβ0and thatb1is an unbiased estimator ofβ1.

Equivalently, we use a sample regression line to estimate the unknown population regression line. Of course, a sample regression line ordinarily will not be the same as the population regression line, just as a sample mean generally will not equal the pop- ulation mean. In Fig. 15.4, we illustrate this situation for the Orion example. Although the population regression line is unknown, we have drawn it to illustrate the difference between the population regression line and a sample regression line.

FIGURE 15.4 Population regression line and sample regression line for age and price of Orions

1 x

Age (yr)

Price ($100)

170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 y

2 3 4 5 6 7 8

y = 0+ 1x Population regression line

(unknown)

y = b0 + b1x = 195.47 − 20.26x Sample regression line (computed from sample data) 180 ^

In Fig. 15.4, the sample regression line (the dashed line) is the best approximation that can be made to the population regression line (the solid line) by using the sample data in Table 15.1 on page 669. A different sample of Orions would almost certainly yield a different sample regression line.

? What Does It Mean?

Roughly speaking, the standard error of the estimate indicates how much, on average, the predicted values of the response variable differ from the observed values of the response variable.

The statistic used to obtain a point estimate for the common conditional standard deviationσ is called thestandard error of the estimate.

Một phần của tài liệu Ebook Introductory statistics (9th edition) Part 2 (Trang 238 - 241)

Tải bản đầy đủ (PDF)

(454 trang)