part © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in Business Analytics: Data Analysis and Chapter Decision Making 11 Regression Analysis: Statistical Inference Introduction Two basic problems are discussed in this chapter: Population regression model Inferring its characteristics—that is, its intercept and slope term(s)—from the corresponding terms estimated by least squares Determining which explanatory variables belong in the equation Inferring whether there is any population regression equation worth pursuing Prediction Predicting values of the dependent variable for new observations Calculating prediction intervals to measure the accuracy of the predictions © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part The Statistical Model (slide of 7) To perform statistical inference in a regression context, a statistical model is required—that is, we must first make several assumptions about the population These assumptions represent an idealization of reality and are never likely to be entirely satisfied for the population in any real study From a practical point of view, all we can ask is that they represent a close approximation to reality If the assumptions are grossly violated, statistical inferences that are based on these assumptions should be viewed with suspicion © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part The Statistical Model (slide of 7) Regression assumptions: There is a population regression line It joins the means of the dependent variable for all values of the explanatory variables For any fixed values of the explanatory variables, the mean of the errors is zero For any values of the explanatory variables, the variance (or standard deviation) of the dependent variable is a constant, the same for all such values For any values of the explanatory variables, the dependent variable is normally distributed The errors are probabilistically independent © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part The Statistical Model (slide of 7) The first assumption is probably the most important It implies that for some set of explanatory variables, there is an exact linear relationship in the population between the means of the dependent variable and the values of the explanatory variables Equation for population regression line joining means: α is the intercept term, and the βs are the slope terms (Greek letters are used to denote that they are unobservable population parameters.) Most individual Ys not lie on the population regression line The vertical distance from any point to the line is an error Equation for population regression line with error: © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part The Statistical Model (slide of 7) Assumption concerns variation around the population regression line It states that the variation of the Ys about the regression line is the same, regardless of the values of the Xs The technical term for this property is homoscedasticity A simpler term is constant error variance This assumption is often questionable—the variation in Y often increases as X increases Heteroscedasticity means that the variability of Y values is larger for some X values than for others A simpler term for this is nonconstant error variance The easiest way to detect nonconstant error variance is through a visual inspection of a scatterplot © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part The Statistical Model (slide of 7) Assumption is equivalent to stating that the errors are normally distributed You can check this by forming a histogram (or a Q-Q plot) of the residuals If assumption holds, the histogram should be approximately symmetric and bellshaped, and the points of a Q-Q plot should be close to a 45 degree line If there is an obvious skewness or some other nonnormal property, this indicates a violation of assumption © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part The Statistical Model (slide of 7) Assumption requires probabilistic independence of the errors This assumption means that information on some of the errors provides no information on the values of the other errors For cross-sectional data, this assumption is usually taken for granted For time-series data, this assumption is often violated This is because of a property called autocorrelation The Durbin-Watson statistic is one measure of autocorrelation and thus measures the extent to which assumption is violated © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part The Statistical Model (slide of 7) One other assumption is important for numerical calculations: No explanatory variable can be an exact linear combination of any other explanatory variables The violation occurs if one of the explanatory variables can be written as a weighted sum of several of the others This is called exact multicollinearity If it exists, there is redundancy in the data A more common and serious problem is multicollinearity, where explanatory variables are highly, but not exactly, correlated © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Inferences about the Regression Coefficients In the equation for the population regression line, α and the βs are called the regression coefficients There is one other unknown constant in the model: the variance of the errors, labeled σ2 The choice of relevant explanatory variables is almost never obvious Two guiding principles are relevance and data availability One overriding principle is parsimony—to explain the most with the least It favors a model with fewer explanatory variables, assuming that this model explains the dependent variable almost as well as a model with additional explanatory variables © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 11.3 (Continued): Catalog Marketing.xlsx Objective: To use StatTools’s Stepwise Regression procedure to analyze the HyTex data Solution: Choose either the forward, backward, or stepwise procedure from the Regression Type dropdown list in the Regression dialog box Specify Amount Spent as the dependent variable and select all of the other variables (besides Customer) as potential explanatory variables A sample of the stepwise output appears to the right The variables that enter or exit the equation are listed at the bottom of the output © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Outliers (slide of 2) An observation can be considered an outlier for one or more of the following reasons: It has an extreme value for at least one variable Its value of the dependent variable is much larger or smaller than predicted by the regression line, and its residual is abnormally large in magnitude An example of this type of outlier is shown below © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Outliers (slide of 2) Its residual is not only large in magnitude, but this point “tilts” the regression line toward it This type of outlier is called an influential point An example of this type of outlier is shown below, on the left Its values of individual explanatory variables are not extreme, but they fall outside the general pattern of the other observations An example of this type of outlier is shown below, on the right In most cases, the regression output will look “nicer” if you delete the outliers, but this is not necessarily appropriate © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 11.4: Bank Salaries.xlsx (slide of 2) Objective: To locate possible outliers in the bank salary data, and to see to what extent they affect the regression model Solution: Examine each variable for outliers, using box plots of the variables or scatterplots of the residuals versus the fitted values © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 11.4: Bank Salaries.xlsx (slide of 2) Then run the regression with and without the outlier The output with the outlier included is shown on the top right; the output with the outlier excluded is shown on the bottom right © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Violations of Regression Assumptions There are three major issues related to violations of regression assumptions: How to detect violations of the assumptions This is usually relatively easy, using scatterplots, histograms, time series graphs, and numerical measures What goes wrong if the violations are ignored This depends on the type of violation and its severity What to about violations if they are detected This issue is the most difficult to resolve © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Nonconstant Error Variance The second regression assumption—that the variance of the errors should be constant for all values of the explanatory variables—is almost always violated to some extent Mild violations not have much effect on the validity of the regression output One common form of nonconstant error variance that should be dealt with is the fan-shape phenomenon It occurs when increases in a variable result in increases in variability It can cause an incorrect value for the standard error of estimate, so that confidence intervals and hypothesis tests for the regression coefficients are not valid There are two ways to deal with it: Use a different estimation method than least squares, called weighted least squares Use a logarithmic transformation of the dependent variable © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Nonnormality of Residuals The third regression assumption states that the error terms are normally distributed Check this assumption by forming a histogram of the residuals Unless the distribution of the residuals is severely nonnormal, the inferences made from the regression output are still approximately valid One form of nonnormality often encountered is skewness to the right This can often be remedied by the same logarithmic transformation of the dependent variable that remedies nonconstant error variance © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Autocorrelated Residuals The fourth regression assumption states that the error terms are probabilistically independent, but this assumption is often violated for time series data The problem with time series data is that the residuals are often correlated with nearby residuals, a property called autocorrelation of residuals The most frequent type of autocorrelation is positive autocorrelation If residuals separated by one time period are correlated, it is called lag autocorrelation The Durbin-Watson (DW) statistic is a numerical measure used to check for lag autocorrelation A DW statistic below signals that nearby residuals are positively correlated with one another When the number of observations is about 30 and the number of explanatory variables is fairly small, then any DW statistic less than 1.2 warrants attention © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 11.1 (Continued): Overhead Costs.xlsx Objective: To use the Durbin-Watson statistic to check whether there is any lag autocorrelation in the residuals from the Bendrix regression model for overhead costs Solution: Run the usual multiple regression and check the graph of residuals versus fitted values Check for lag autocorrelation in two ways: with the DW statistic and by examining the time series graph of the residuals © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Prediction (slide of 4) Once you have estimated a regression equation from a set of data, you might want to use it to predict the value of the dependent variable for new observations There are two types of prediction problems in regression: Predicting the value of the dependent variable for one or more individual members of the population Predicting the mean of the dependent variable for all members of the population with certain values of the explanatory variables The second problem is inherently easier in the sense that the resulting prediction is bound to be more accurate When you predict a mean, there is a single source of error: the possibly inaccurate estimates of the regression coefficients When you predict an individual value, there are two sources of error: the inaccurate estimates of the regression coefficients and the inherent variation of individual points around the regression line © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Prediction (slide of 4) Predictions for values of the Xs close to their means are likely to be more accurate than predictions for Xs far from their means Trying to predict for Xs beyond the range of the data set is called extrapolation, and it is quite risky © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Prediction (slide of 4) The point prediction, or best guess, is found by substituting the given values of the Xs into the estimated regression equation To measure the accuracy of the point predictions, calculate standard errors of prediction Standard error of prediction for a single Y: This error is approximately equal to the standard error of estimate Standard error of prediction for the mean Y: This error is approximately equal to the standard error of estimate divided by the square root of the sample size © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Prediction (slide of 4) These standard errors can be used to calculate a 95% prediction interval for an individual value and a 95% confidence interval for a mean value Go out a t-multiple of the relevant standard error on either side of the point prediction The term prediction interval (rather than confidence interval) is used for an individual value because an individual value of Y is not a population parameter However, the interpretation is basically the same © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 11.1 (Continued): Overhead Costs.xlsx Objective: To predict Overhead at Bendrix for the next three months, given anticipated values of Machine Hours and Production Runs Solution: Suppose Bendrix expects the values of Machine Hours and Production Runs for the next three months to be 1430, 1560, 1520, and 35, 45, 40, respectively StatTools has the capability to provide predictions and 95% prediction intervals, but you must set up a second data set to capture the results It should have the same variable name headings, and it should include values of the explanatory variable to be used for prediction © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part ... copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 11. 2: Heights Simulation.xlsx (slide of 2) Objective: To illustrate the problem of multicollinearity... much larger or smaller than predicted by the regression line, and its residual is abnormally large in magnitude An example of this type of outlier is shown below © 2015 Cengage Learning All Rights... for all such values For any values of the explanatory variables, the dependent variable is normally distributed The errors are probabilistically independent © 2015 Cengage Learning All Rights