part © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in Business Analytics: Data Analysis and Chapter Decision Making 10 Regression Analysis: Estimating Relationships Introduction (slide of 2) Regression analysis is the study of relationships between variables There are two potential objectives of regression analysis: to understand how the world operates and to make predictions Two basic types of data are analyzed: Cross-sectional data are usually data gathered from approximately the same period of time from a population Time series data involve one or more variables that are observed at several, usually equally spaced, points in time Time series variables are usually related to their own past values—a property called autocorrelation—which adds complications to the analysis © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Introduction (slide of 2) In every regression study, there is a single variable that we are trying to explain or predict, called the dependent variable It is also called the response variable or the target variable To help explain or predict the dependent variable, we use one or more explanatory variables They are also called independent or predictor variables If there is a single explanatory variable, the analysis is called simple regression If there are several explanatory variables, it is called multiple regression Regression can be linear (straight-line relationships) or nonlinear (curved relationships) Many nonlinear relationships can be linearized mathematically © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Scatterplots: Graphing Relationships Drawing scatterplots is a good way to begin regression analysis A scatterplot is a graphical plot of two variables, an X and a Y If there is any relationship between the two variables, it is usually apparent from the scatterplot © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 10.1: Drugstore Sales.xlsx (slide of 2) Objective: To use a scatterplot to examine the relationship between promotional expenditures and sales at Pharmex Solution: Pharmex has collected data from 50 randomly selected metropolitan regions There are two variables: Pharmex’s promotional expenditures as a percentage of those of the leading competitor (“Promote”) and Pharmex’s sales as a percentage of those of the leading competitor (“Sales”) A partial listing of the data is shown below © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 10.1: Drugstore Sales.xlsx (slide of 2) Use Excel’s ® Chart Wizard or the StatTools Scatterplot procedure to create a scatterplot Sales is on the vertical axis and Promote is on the horizontal axis because the store believes that large promotional expenditures tend to “cause” larger values of sales © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 10.2: Overhead Costs.xlsx (slide of 3) Objective: To use scatterplots to examine the relationships among overhead, machine hours, and production runs at Bendrix Solution: Data file contains observations of overhead costs, machine hours, and number of production runs at Bendrix Each observation (row) corresponds to a single month © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 10.2: Overhead Costs.xlsx (slide of 3) Examine scatterplots between each explanatory variable (Machine Hours and Production Runs) and the dependent variable (Overhead) © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 10.2: Overhead Costs.xlsx (slide of 3) Check for possible time series patterns, by creating a time series graph for any of the variables Check for relationships among the multiple explanatory variables (Machine Hours versus Production Runs) © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Linear versus Nonlinear Relationships Scatterplots are useful for detecting relationships that may not be obvious otherwise The typical relationship you hope to see is a straight-line, or linear, relationship This doesn’t mean that all points lie on a straight line, but that the points tend to cluster around a straight line The scatterplot below illustrates a relationship that is clearly nonlinear © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 10.3 (continued): Bank Salaries.xlsx (slide of 2) The regression equations for Female and Male are shown graphically below © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Nonlinear Transformations The general linear regression equation has the form: Predicted Y = a + b1X1 + b2X2 + … + bkXk It is linear in the sense that the right side of the equation is a constant plus a sum of products of constants and variables The variables can be transformations of original variables Nonlinear transformations of variables are often used because of curvature detected in scatterplots You can transform the dependent variable Y or any of the explanatory variables, the Xs Or you can both Typical nonlinear transformations include: the natural logarithm, the square root, the reciprocal, and the square © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 10.4: Cost of Power.xlsx (slide of 3) Objective: To see whether the cost of supplying electricity is a nonlinear function of demand, and if it is, what form the nonlinearity takes Solution: The data set lists the number of units of electricity produced (Units) and the total cost of producing these (Cost) for a 36month period Start with a scatterplot of Cost versus Units © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 10.4: Cost of Power.xlsx (slide of 3) Next, request a scatterplot of the residuals versus the fitted values The negative-positive-negative behavior of residuals suggests a parabola—that is, a quadratic relationship with the square of Units included in the equation Create a new variable (Units)^2 in the data set and then use multiple regression to estimate the equation for Cost with both Units and (Units)^2 included © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 10.4: Cost of Power.xlsx (slide of 3) Use Excel’s Trendline option to superimpose a quadratic curve on the scatterplot This curve is shown below, on the left Finally, try a logarithmic fit by creating a new variable, Log(Units), and then regressing Cost against this variable This curve is shown below, on the right One reason logarithmic transformations of variables are used so widely in regression analysis is that they are fairly easy to interpret A logarithmic transformation of Y is often useful when the distribution of Y values is skewed to the right © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 10.3 (continued): Bank Salaries.xlsx (slide of 2) Objective: To reanalyze the bank salary data, now using the logarithm of salary as the dependent variable Solution: The distribution of salaries of the 208 employees shows some skewness to the right First, create the Log(Salary) variable Then run the regression, with Log(Salary) as the dependent variable and Female and Years as the explanatory variables © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 10.3 (continued): Bank Salaries.xlsx (slide of 2) The lessons from this example are important in general: The R2 values with Y and Log(Y) as dependent variables are not directly comparable They are percentages explained of different variables The se values with Y and Log(Y) as dependent variables are usually of totally different magnitudes To make the se from the log equation comparable, you need to go through the procedure described in the example so that the residuals are in original units To interpret any term of the form bX in the log equation, you should first express b as a percentage Then when X increases by one unit, the expected percentage change in Y is approximately this percentage b © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Constant Elasticity Relationships A particular type of nonlinear relationship that has firm grounding in economic theory is the constant elasticity relationship This is also called a multiplicative relationship It has the form shown in the equation below: The effect of a one-unit change in any X on Y depends on the levels of the other Xs in the equation The dependent variable is expressed as a product of explanatory variables raised to powers When any explanatory variable X changes by 1%, the predicted value of the dependent variable changes by a constant percentage, regardless of the value of this X or the values of the other Xs © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 10.5: Car Sales.xlsx (slide of 2) Objective: To use logarithms of variables in a multiple regression to estimate a multiplicative relationship for automobile sales as a function of price, income, and interest rate Solution: The data set contains annual data on domestic auto sales in the United States Variables include: Sales (in units), Price Index (consumer price index of transportation), Income (real disposable income), and Interest (prime rate) First, take natural logs of all four variables Then run a multiple regression with Log(Sales) as the dependent variable and Log(Price Index), Log(Income), and Log(Interest) as the explanatory variables © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 10.5: Car Sales.xlsx (slide of 2) The resulting output is shown below © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Learning Curve Models A learning curve relates the unit of production time (or cost) to the cumulative volume of output since the production process first began Empirical studies indicate that production times tend to decrease by a relatively constant percentage every time cumulative output doubles This constant is often called the learning rate Equation for Learning Rate (where LN refers to the natural logarithm): © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 10.6: Learning Curve.xlsx (slide of 2) Objective: To use a multiplicative regression equation to estimate the learning rate for production time Solution: Data set contains the times (in hours) to produce each batch of a new product at Presario Company © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 10.6: Learning Curve.xlsx (slide of 2) First, check whether the multiplicative learning model is reasonable by creating a scatterplot of Log(Time) versus Log(Batch) The multiplicative model implies that it should be approximately linear The relationship can then be estimated by regressing Log(Time) on Log(Batch) The resulting equation is: The estimated learning rate satisfies the equation: Now solve for the learning rate (multiply through by LN(2) and then take antilogs) © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Validation of the Fit The fit from a regression analysis is often overly optimistic To see if the regression equation will be successful in predicting new values of the dependent variable, split the original data into two subsets: one for estimation and one for validation A regression equation is estimated from the first subset Then the values of the explanatory variables from the second subset are substituted into the equation to obtain predicted values for the dependent variable Finally, these predicted values are compared to the known values of the dependent variable in the second subset If the agreement is good, there is reason to believe that the regression equation will predict well for the new data This procedure is called validating the fit © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 10.2 (continued): Overhead Costs Validation.xlsx Objective: To validate the original Bendrix regression for making predictions at another plant Solution: Bendrix would like to predict overhead costs for another plant by using data on machine hours and production runs at this second plant The first step is to see how well the regression from the first plant fits data from the other plant © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part ... several, usually equally spaced, points in time Time series variables are usually related to their own past values—a property called autocorrelation—which adds complications to the analysis. .. dependent variable, we use one or more explanatory variables They are also called independent or predictor variables If there is a single explanatory variable, the analysis is called simple regression... several explanatory variables, it is called multiple regression Regression can be linear (straight-line relationships) or nonlinear (curved relationships) Many nonlinear relationships can be linearized