Motivation for Multiple Regression

The Model with Two Independent Variables

We begin with some simple examples to show how multiple regression analysis can be used to solve problems that cannot be solved by simple regression.

The first example is a simple variation of the wage equation introduced in Chapter 2 for obtaining the effect of education on hourly wage:

wage b0 b1educ b2exper u, (3.1) where exper is years of labor market experience. Thus, wage is determined by the two explanatory or independent variables, education and experience, and by other unobserved factors, which are contained in u. We are still primarily interested in the effect of educ on wage, holding fixed all other factors affecting wage; that is, we are interested in the parameter b1.

Compared with a simple regression analysis relating wage to educ, equation (3.1) effectively takes exper out of the error term and puts it explicitly in the equation. Because exper appears in the equation, its coefficient, b2, measures the ceteris paribus effect of exper on wage, which is also of some interest.

Not surprisingly, just as with simple regression, we will have to make assumptions about how u in (3.1) is related to the independent variables, educ and exper. However, as we will see in Section 3.2, there is one thing of which we can be confident: because (3.1) contains experience explicitly, we will be able to measure the effect of education on wage, holding experience fixed. In a simple regression analysis—which puts exper in the error term—we would have to assume that experience is uncorrelated with education, a tenu- ous assumption.

As a second example, consider the problem of explaining the effect of per student spending (expend) on the average standardized test score (avgscore) at the high school level. Suppose that the average test score depends on funding, average family income (avginc), and other unobservables:

avgscore b0 b1expend b2avginc u. (3.2) The coefficient of interest for policy purposes is b1, the ceteris paribus effect of expend on avgscore. By including avginc explicitly in the model, we are able to control for its effect on avgscore. This is likely to be important because average family income tends to be correlated with per student spending: spending levels are often determined by both property and local income taxes. In simple regression analysis, avginc would be included in the error term, which would likely be correlated with expend, causing the OLS esti- mator of b1 in the two-variable model to be biased.

In the two previous similar examples, we have shown how observable factors other than the variable of primary interest [educ in equation (3.1) and expend in equation (3.2)]

can be included in a regression model. Generally, we can write a model with two independent variables as

y b0 b1x1 b2x2 u, (3.3) where b0 is the intercept,b1 measures the change in y with respect to x1, holding other factors fixed, and b2 measures the change in y with respect to x2, holding other factors fixed.

Multiple regression analysis is also useful for generalizing functional relationships between variables. As an example, suppose family consumption (cons) is a quadratic function of family income (inc):

cons b0 b1inc b2inc2 u, (3.4) where u contains other factors affecting consumption. In this model, consumption depends on only one observed factor, income; so it might seem that it can be handled in a simple regression framework. But the model falls outside simple regression because it contains two functions of income, inc and inc2 (and therefore three parameters, b0, b1, and b2).

Nevertheless, the consumption function is easily written as a regression model with two independent variables by letting x1 inc and x2 inc2.

Mechanically, there will be no difference in using the method of ordinary least squares (introduced in Section 3.2) to estimate equations as different as (3.1) and (3.4). Each equation can be written as (3.3), which is all that matters for computation. There is, however, an important difference in how one interprets the parameters. In equation (3.1), 1 is the ceteris paribus effect of educ on wage. The parameter 1 has no such interpretation in (3.4). In other words, it makes no sense to measure the effect of inc on cons while holding inc2 fixed, because if inc changes, then so must inc2! Instead, the change in consumption with respect to the change in income—the marginal propensity to consume—

is approximated by

1 22inc.

See Appendix A for the calculus needed to derive this equation. In other words, the marginal effect of income on consumption depends on 2 as well as on 1 and the level of income. This example shows that, in any particular application, the definitions of the independent variables are crucial. But for the theoretical development of multiple regression, we can be vague about such details. We will study examples like this more completely in Chapter 6.

In the model with two independent variables, the key assumption about how u is related to x1 and x2 is

E(ux1,x2) 0. (3.5)

The interpretation of condition (3.5) is similar to the interpretation of Assumption SLR.4 for simple regression analysis. It means that, for any values of x1 and x2 in the population, the average unobservable is equal to zero. As with simple regression, the important part of the assumption is that the expected value of u is the same for all combinations of x1and x2; that this common value is zero is no assumption at all as long as the intercept 0 is included in the model (see Section 2.1).

How can we interpret the zero conditional mean assumption in the previous examples?

In equation (3.1), the assumption is E(ueduc,exper) 0. This implies that other factors affecting wage are not related on average to educ and exper. Therefore, if we think innate ability is part of u, then we will need average ability levels to be the same across all

cons inc

combinations of education and experience in the working population. This may or may not be true, but, as we will see in Section 3.3, this is the question we need to ask in order to determine whether the method of ordinary least squares produces unbiased estimators.

The example measuring student performance [equation (3.2)] is similar to the wage equation. The zero conditional mean assumption is E(uexpend,avginc) 0, which means that other factors affecting test scores—

school or student characteristics—are, on average, unrelated to per student funding and average family income.

When applied to the quadratic consumption function in (3.4), the zero conditional mean assumption has a slightly different interpretation. Written literally, equation (3.5) becomes E(uinc,inc2) 0.

Since inc2is known when inc is known, including inc2 in the expectation is redun- dant: E(uinc,inc2) 0 is the same as E(uinc) 0. Nothing is wrong with putting inc2 along with inc in the expectation when stating the assumption, but E(uinc) 0 is more concise.

The Model with k Independent Variables

Once we are in the context of multiple regression, there is no need to stop with two independent variables. Multiple regression analysis allows many observed factors to affect y.

In the wage example, we might also include amount of job training, years of tenure with the current employer, measures of ability, and even demographic variables like number of siblings or mother’s education. In the school funding example, additional variables might include measures of teacher quality and school size.

The general multiple linear regression model (also called the multiple regression model) can be written in the population as

y 0 1x1 2x2 3x3 …kxk u, (3.6) where 0 is the intercept,1 is the parameter associated with x1,2 is the parameter associated with x2, and so on. Since there are k independent variables and an intercept, equation (3.6) contains k 1 (unknown) population parameters. For shorthand purposes, we will sometimes refer to the parameters other than the intercept as slope parameters, even though this is not always literally what they are. [See equation (3.4), where neither 1 nor 2 is itself a slope, but together they determine the slope of the relationship between consumption and income.]

The terminology for multiple regression is similar to that for simple regression and is given in Table 3.1. Just as in simple regression, the variable u is the error term or disturbance. It contains factors other than x1, x2, …,xk that affect y. No matter how many explanatory variables we include in our model, there will always be factors we cannot include, and these are collectively contained in u.

When applying the general multiple regression model, we must know how to interpret the parameters. We will get plenty of practice now and in subsequent chapters, but it is useful at

A simple model to explain city murder rates (murdrate) in terms of the probability of conviction (prbconv) and average sentence length (avgsen) is

murdrate 0 1prbconv 2avgsen u.

What are some factors contained in u? Do you think the key assumption (3.5) is likely to hold?

Q U E S T I O N 3 . 1

TABLE 3.1

Terminology for Multiple Regression

y x1, x2, …, xk

Dependent Variable Independent Variables Explained Variable Explanatory Variables Response Variable Control Variables Predicted Variable Predictor Variables

Regressand Regressors

this point to be reminded of some things we already know. Suppose that CEO salary (salary) is related to firm sales (sales) and CEO tenure (ceoten) with the firm by

log(salary) 0 1log(sales) 2ceoten 3ceoten2 u. (3.7) This fits into the multiple regression model (with k 3) by defining y log(salary), x1 log(sales), x2 ceoten, and x3 ceoten2. As we know from Chapter 2, the parameter 1 is the (ceteris paribus) elasticity of salary with respect to sales. If 3 0, then 1002

is approximately the ceteris paribus percentage increase in salary when ceoten increases by one year. When 3 0, the effect of ceoten on salary is more complicated. We will postpone a detailed treatment of general models with quadratics until Chapter 6.

Equation (3.7) provides an important reminder about multiple regression analysis. The term “linear” in multiple linear regression model means that equation (3.6) is linear in the parameters,j. Equation (3.7) is an example of a multiple regression model that, while linear in the j, is a nonlinear relationship between salary and the variables sales and ceoten. Many applications of multiple linear regression involve nonlinear relationships among the underlying variables.

The key assumption for the general multiple regression model is easy to state in terms of a conditional expectation:

E(ux1,x2, …, xk) 0. (3.8) At a minimum, equation (3.8) requires that all factors in the unobserved error term be uncorrelated with the explanatory variables. It also means that we have correctly accounted for the functional relationships between the explained and explanatory variables. Any problem that causes u to be correlated with any of the independent variables causes (3.8) to fail. In Section 3.3, we will show that assumption (3.8) implies that OLS is unbiased and will derive the bias that arises when a key variable has been omitted from

the equation. In Chapters 15 and 16, we will study other reasons that might cause (3.8) to fail and show what can be done in cases where it does fail.

Deriving the Ordinary Least Squares Estimates

Properties of OLS on Any Sample of Data