Chapter 14 Simple Linear Regression Learning Objectives Understand how regression analysis can be used to develop an equation that estimates mathematically how two variables are related Understand the differences between the regression model, the regression equation, and the estimated regression equation Know how to fit an estimated regression equation to a set of sample data based upon the least squares method Be able to determine how good a fit is provided by the estimated regression equation and compute the sample correlation coefficient from the regression analysis output Understand the assumptions necessary for statistical inference and be able to test for a significant relationship Know how to develop confidence interval estimates of y given a specific value of x in both the case of a mean value of y and an individual value of y Learn how to use a residual plot to make a judgement as to the validity of the regression assumptions Know the definition of the following terms: independent and dependent variable simple linear regression regression model regression equation and estimated regression equation scatter diagram coefficient of determination standard error of the estimate confidence interval prediction interval residual plot 14 1 © 2010 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Chapter 14 Solutions: a 16 14 12 y 10 0 x b There appears to be a positive linear relationship between x and y c Many different straight lines can be drawn to provide a linear approximation of the relationship between x and y; in part (d) we will determine the equation of a straight line that “best” represents the relationship according to the least squares criterion d x xi 15 3 n y ( xi x )( yi y ) 26 b1 yi 40 8 n ( xi x ) 10 ( xi x )( yi y ) 26 2.6 ( xi x ) 10 b0 y b1 x 8 (2.6)(3) 0.2 y�0.2 2.6 x e y�0.2 2.6(4) 10.6 14 2 © 2010 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Simple Linear Regression a b There appears to be a negative linear relationship between x and y c Many different straight lines can be drawn to provide a linear approximation of the relationship between x and y; in part (d) we will determine the equation of a straight line that “best” represents the relationship according to the least squares criterion d x xi 55 11 n y ( xi x )( yi y ) 540 b1 yi 175 35 n ( xi x ) 180 ( xi x )( yi y ) 540 3 ( xi x ) 180 b0 y b1 x 35 (3)(11) 68 yˆ 68 x e yˆ 68 3(10) 38 14 3 © 2010 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Chapter 14 a b x xi 50 10 n y ( xi x )( yi y ) 171 b1 yi 83 16.6 n ( xi x ) 190 ( xi x )( yi y ) 171 0.9 ( xi x ) 190 b0 y b1 x 16.6 (0.9)(10) 7.6 yˆ 7.6 0.9 x c yˆ 7.6 0.9(6) 13 14 4 © 2010 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Simple Linear Regression a 135 130 Weight 125 120 115 110 105 100 61 62 63 64 65 66 67 68 69 Height b There appears to be a positive linear relationship between x = height and y = weight c Many different straight lines can be drawn to provide a linear approximation of the relationship between x and y; in part (d) we will determine the equation of a straight line that “best” represents the relationship according to the least squares criterion d x xi 325 65 n ( xi x )( yi y ) 110 b1 y yi 585 117 n ( xi x ) 20 ( xi x )( yi y ) 110 5.5 ( xi x ) 20 b0 y b1 x 117 (5.5)(65) 240.5 y 240.5 55 x e y 240.5 55 x 240.5 55 (63) 106 pounds 14 5 © 2010 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Chapter 14 a b There appears to be a positive relationship between price and rating. The sign that says “Quality: You Get What You Pay For” does fairly reflect the pricequality relationship for ellipticals c Let x = price ($) and y = rating x xi 1500 1875 n y ( xi x )( yi y ) 68,900 b1 yi 592 74 n ( xi x )2 8,155,000 ( xi x )( yi y ) 68,900 008449 ( xi x ) 8,155,000 b0 y b1 x 74 (.008449)(1875) 58.158 yˆ 58.158 008449 x d yˆ 58.158 008449 x 58.158 008449(1500) 70.83 or approximately 71 14 6 © 2010 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Simple Linear Regression a b There appears to be a negative linear relationship between x = miles and y = sales price If the car has higher miles, the sales price tends to be lower c x xi 874 87.4 n 10 y ( xi x )( yi y ) 135.66 b1 yi 66.4 6.64 n 10 ( xi x )2 5152.4 ( xi x )( yi y ) 135.66 .02633 ( xi x ) 5152.40 b0 y b1 x 6.64 (.02633)(87.4) 8.9412 yˆ 8.9412 02633 x d The slope of the estimated regression equation is -.02633 Thus, a one unit increase in the value of x will result in a decrease in the estimated value of y equal to 02633 Because the data were recorded in thousands, every additional 1000 miles on the car’s odometer will result in a $26.33 decrease in the estimated price e yˆ 8.9412 02633(100) 6.3 or $6300 14 7 © 2010 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Chapter 14 a 150 140 130 Annual Sales ($1000s) 120 110 100 90 80 70 60 50 10 12 14 Years of Experience b Let x = years of experience and y = annual sales ($1000s) y yi 1080 108 n 10 ( xi x )( yi y ) 568 ( xi x ) 142 x xi 70 7 n 10 b1 ( xi x )( yi y ) 568 4 ( xi x ) 142 b0 y b1 x 108 (4)(7) 80 y 80 x c y 80 x 80 4(9) 116 or $116,000 14 8 © 2010 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Simple Linear Regression a b The scatter diagram and the slope of the estimated regression equation indicate a negative linear relationship between x = temperature rating and y = price Thus, it appears that sleeping bags with a lower temperature rating cost more than sleeping bags with a higher temperature rating In other words, it costs more to stay warmer c x xi / n 209 /11 19 ( xi x )( yi y ) 10,090 b1 y yi / n 2849 /11 259 ( xi x ) 1912 ( xi x )( yi y ) 10, 090 5.2772 ( xi x ) 1912 b0 y b1 x 259 (5.2772)(19) 359.2668 yˆ 359.2668 5.2772 x d yˆ 359.2668 5.2772 x 359.2668 5.2772(20) 253.72 Thus, the estimate of the price of sleeping bag with a temperature rating of 20 is approximately $254 14 9 © 2010 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Chapter 14 a b There appears to be a positive linear relationship between x = price and y = score c x xi 2638 263.8 n 10 y ( xi x )( yi y ) 14,601.40 b1 yi 672 67.2 n 10 ( xi x ) 258,695.60 ( xi x )( yi y ) 14,601.40 05644 ( xi x )2 258,695.60 b0 y b1 x 67.2 (.05644)(263.8) 52.311 yˆ 52.311 05644 x d The slope is 05644 For a $100 higher price, the score can be expected to increase = 5.644, or about points e yˆ 52.311 05644(225) 65 14 10 © 2010 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part 100(.05644) Chapter 14 c 50 a The residual plot leads us to question the assumption of a linear relationship between square footage and price Therefore, even though the relationship is very significant (p-value = 000), using the estimated regression equation make predictions of the price for a house with square footage beyond the range of the data is not recommended The Minitab output follows: The regression equation is Y = 66.1 + 0.402 X Predictor Coef SE Coef T p Constant 66.10 32.06 2.06 0.094 X 0.4023 0.2276 1.77 0.137 S = 12.62 Rsq = 38.5% Rsq(adj) = 26.1% Analysis of Variance SOURCE DF SS MS F p Regression 1 497.2 497.2 3.12 0.137 Residual Error 5 795.7 159.1 Total 6 1292.9 Unusual Observations Obs. X Y Fit SEFit Residual St.Resid 1 135 145.00 120.42 4.87 24.58 2.11R R denotes an observation with a large standardized residual The standardized residuals are: 2.11, -1.08, 14, -.38, -.78, -.04, -.41 The first observation appears to be an outlier since it has a large standardized residual 14 36 © 2010 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Simple Linear Regression b 2.5 2.0 Standardized Residual 1.5 1.0 0.5 0.0 -0.5 -1.0 110 115 120 125 Fitted Value 130 135 140 The standardized residual plot indicates that the observation x = 135, y = 145 may be an outlier; note that this observation has a standardized residual of 2.11 c The scatter diagram is shown below 14 37 © 2010 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Chapter 14 51 a The scatter diagram also indicates that the observation x = 135, y = 145 may be an outlier; the implication is that for simple linear regression an outlier can be identified by looking at the scatter diagram The Minitab output is shown below: The regression equation is Y = 13.0 + 0.425 X Predictor Coef SE Coef T p Constant 13.002 2.396 5.43 0.002 X 0.4248 0.2116 2.01 0.091 S = 3.181 Rsq = 40.2% Rsq(adj) = 30.2% Analysis of Variance SOURCE DF SS MS F p Regression 1 40.78 40.78 4.03 0.091 Residual Error 6 60.72 10.12 Total 7 101.50 Unusual Observations Obs. X Y Fit Stdev.Fit Residual St.Resid 7 12.0 24.00 18.10 1.20 5.90 2.00R 8 22.0 19.00 22.35 2.78 3.35 2.16RX R denotes an observation with a large standardized residual X denotes an observation whose X value gives it large influence The standardized residuals are: 1.00, .41, .01, .48, .25, .65, 2.00, 2.16 The last two observations in the data set appear to be outliers since the standardized residuals for these observations are 2.00 and 2.16, respectively 14 38 © 2010 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Simple Linear Regression b Using Minitab, we obtained the following leverage values: 28, .24, .16, .14, .13, .14, .14, .76 MINITAB identifies an observation as having high leverage if hi > 6/n; for these data, 6/n = 6/8 = .75. Since the leverage for the observation x = 22, y = 19 is .76, Minitab would identify observation 8 as a high leverage point. Thus, we conclude that observation 8 is an influential observation c The scatter diagram indicates that the observation x = 22, y = 19 is an influential observation 52 a The Minitab output is shown below: The regression equation is Shipment = 4.09 + 0.196 Media$ Predictor Constant MediaExp S = 5.044 Coef 4.089 0.19552 SE Coef 2.168 0.03635 R-Sq = 78.3% T 1.89 5.38 P 0.096 0.001 R-Sq(adj) = 75.6% Analysis of Variance Source Regression Residual Error Total DF Unusual Observations Obs Media$ Shipment 120 36.30 SS 735.84 203.51 939.35 Fit 27.55 14 39 MS 735.84 25.44 F 28.93 SE Fit 3.30 Residual 8.75 © 2010 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part P 0.001 St Resid 2.30R Chapter 14 R denotes an observation with a large standardized residual b 53 a Minitab identifies observation 1 as having a large standardized residual; thus, we would consider observation 1 to be an outlier The Minitab output is shown below: The regression equation is Price = 28.0 + 0.173 Volume Predictor Constant Volume Coef 27.958 0.17289 S = 17.53 SE Coef 4.521 0.07804 R-Sq = 17.0% T 6.18 2.22 P 0.000 0.036 R-Sq(adj) = 13.5% Analysis of Variance Source Regression Residual Error Total DF 24 25 SS 1508.4 7376.1 8884.5 Unusual Observations Obs Volume Price 22 230 40.00 -3.31RX MS 1508.4 307.3 Fit 67.72 F 4.91 SE Fit 15.40 P 0.036 Residual -27.72 St Resid R denotes an observation with a large standardized residual X denotes an observation whose X value gives it large influence The Minitab output identifies observation 22 as having a large standardized residual and is an observation whose x value gives it large influence The following residual plot verifies these observations Standardized Residual b -1 -2 -3 -4 30 40 50 Fitted Value 14 40 60 70 © 2010 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Simple Linear Regression 54 a The scatter diagram does indicate potential outliers and/or influential observations For example, the data for the Washington Redskins, New England Patriots, and the Dallas Cowboys not only have the three highest revenues, they also have the highest team values b A portion of the Minitab output follows: The regression equation is Value = - 252 + 5.83 Revenue Predictor Constant Revenue Coef -252.1 5.8317 S = 87.2441 SE Coef 130.8 0.5863 R-Sq = 76.7% T -1.93 9.95 P 0.064 0.000 R-Sq(adj) = 76.0% Analysis of Variance Source Regression Residual Error Total DF 30 31 SS 753008 228346 981354 MS 753008 7612 F 98.93 P 0.000 Unusual Observations Obs 19 21 22 32 Revenue 269 282 214 213 327 Value 1612.0 1324.0 1178.0 1170.0 1538.0 Fit 1316.6 1392.5 995.9 990.1 1654.9 SE Fit 31.8 38.6 16.0 16.2 63.7 Residual 295.4 -68.5 182.1 179.9 -116.9 St Resid 3.64R -0.88 X 2.12R 2.10R -1.96 X R denotes an observation with a large standardized residual 14 41 © 2010 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Chapter 14 X denotes an observation whose X value gives it large leverage c The Minitab output indicates that there are five unusual observations: Observation (Dallas Cowboys) is an outlier because it has a large standardized residual Observation 19 (New England Patriots) is an influential observation becasuse has high leverage Observation 21 (New York Giants) is an outlier because it has a large standardized residual Observation 22 (New York Jets) is an outlier because it has a large standardized residual Observation 32 (Washington Redskins) is an influential observation becasuse has high leverage 55 No. Regression or correlation analysis can never prove that two variables are causally related 56 The estimate of a mean value is an estimate of the average of all y values associated with the same x. The estimate of an individual y value is an estimate of only one of the y values associated with a particular x 57 The purpose of testing whether 1 0 is to determine whether or not there is a significant relationship between x and y. However, rejecting 1 0 does not necessarily imply a good fit. For example, if 1 0 is rejected and r2 is low, there is a statistically significant relationship between x and y but the fit is not very good 58 a The Minitab output is shown below: The regression equation is Price = 9.26 + 0.711 Shares Predictor Constant Shares S = 1.419 Coef 9.265 0.7105 SE Coef 1.099 0.1474 R-Sq = 74.4% T 8.43 4.82 P 0.000 0.001 R-Sq(adj) = 71.2% Analysis of Variance Source Regression Residual Error Total DF SS 46.784 16.116 62.900 MS 46.784 2.015 F 23.22 P 0.001 b Since the p-value corresponding to F = 23.22 = 001 < = 05, the relationship is significant c r = 744; a good fit The least squares line explained 74.4% of the variability in Price d yˆ 9.26 711(6) 13.53 59 a The Minitab output is shown below: The regression equation is Share Price ($) = - 2.99 + 0.911 Fair Value ($) Predictor Constant Coef -2.987 SE Coef 5.791 14 42 T -0.52 P 0.610 © 2010 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Simple Linear Regression Fair Value ($) S = 12.0064 0.91128 0.09783 R-Sq = 76.9% 9.31 0.000 R-Sq(adj) = 76.1% Analysis of Variance Source Regression Residual Error Total DF 26 27 SS 12507 3748 16255 MS 12507 144 F 86.76 P 0.000 y�= -2.987 + 91128 Fair Value ($) b Significant relationship: p-value = 000 < = 05 c y�= -2.987 + 91128 Fair Value ($) = -2.987 + 91128(50) = 42.577 or approximately $42.58 d The estimated regression equation should provide a good estimate because r2 = 0.769 60 a The scatter diagram indicates a positive linear relationship between the two variables Online universities with higher retention rates tend to have higher graduation rates b The Minitab output follows: The regression equation is GR(%) = 25.4 + 0.285 RR(%) Predictor Constant RR(%) S = 7.45610 Coef 25.423 0.28453 SE Coef 3.746 0.06063 R-Sq = 44.9% T 6.79 4.69 P 0.000 0.000 R-Sq(adj) = 42.9% Analysis of Variance 14 43 © 2010 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Chapter 14 Source Regression Residual Error Total DF 27 28 SS 1224.3 1501.0 2725.3 MS 1224.3 55.6 F 22.02 P 0.000 Unusual Observations Obs RR(%) 51 GR(%) 25.00 28.00 Fit 39.93 26.56 SE Fit 1.44 3.52 Residual -14.93 1.44 St Resid -2.04R 0.22 X R denotes an observation with a large standardized residual X denotes an observation whose X value gives it large leverage c Because the p-value = 000 < α =.05, the relationship is significant d The estimated regression equation is able to explain 44.9% of the variability in the graduation rate based upon the linear relationship with the retention rate It is not a great fit, but given the type of data, the fit is reasonably good e In the Minitab output in part (b), South University is identified as an observation with a large standardized residual With a retention rate of 51% it does appear that the graduation rate of 25% is low as compared to the results for other online universities The president of South University should be concerned after looking at the data Using the estimated regression equation, we estimate that the gradation rate at South University should be 25.4 + 285(51) = 40% f In the Minitab output in part (b), the University of Phoenix is identified as an observation whose x value gives it large influence With a retention rate of only 4%, the president of the University of Phoenix should be concerned after looking at the data 61 The Minitab output is shown below: The regression equation is Expense = 10.5 + 0.953 Usage Predictor Coef SE Coef T p Constant 10.528 3.745 2.81 0.023 X 0.9534 0.1382 6.90 0.000 S = 4.250 Rsq = 85.6% Rsq(adj) = 83.8% Analysis of Variance SOURCE DF SS MS F p Regression 1 860.05 860.05 47.62 0.000 Residual Error 8 144.47 18.06 Total 9 1004.53 Fit Stdev.Fit 95% C.I. 95% P.I 39.13 1.49 ( 35.69, 42.57) ( 28.74, 49.52) a y�= 10.528 + .9534 Usage b Since the pvalue corresponding to F = 47.62 = .000