Statistical Inference for Several Coefficients- 123docz.net

It can be useful to examine several regression coefficients at the same time. For example, when assessing the effect of a categorical variable withclevels, we need to say something jointly about thec−1 binary variables that enter the regression equation. To do this, Section4.2.1introduces a method for handling linear combinations of regression coefficients. Section4.2.2shows how to test several linear combinations and Section4.2.3presents other inference applications.

4.2.1 Sets of Regression Coefficients

Recall that our regression coefficients are specified byβ =(β0, β1, . . . , βk) , a (k+1)×1 vector. It will be convenient to express linear combinations of the regression coefficients using the notation Cβ, where C is a p×(k+1) matrix that is user-specified and depends on the application. Some applications involve estimating Cβ. Others involve testing whether Cβ equals a specific known value (denoted as d). We callH0 : Cβ =d the general linear hypothesis.

To demonstrate the broad variety of applications in which sets of regression coefficients can be used, we now present a series of special cases.

The general linear hypothesis is denoted as H0: Cβ=d.

Special Case 1: One Regression Coefficient. In Section 3.4, we investigated the importance of a single coefficient, say,βj. We may express this coefficient as Cβ by choosing p=1 and C to be a 1×(k+1) vector with a one in the (j+1)st column and zeros otherwise. These choices result in

Cβ=(0 ã ã ã 0 1 0 ã ã ã 0)



 β0

... βk



=βj.

Special Case 2: Regression Function. Here, we choosep=1 and C to be a 1×(k+1) vector representing the transpose of a set of explanatory variables.

These choices result in

Cβ=(x0, x1, . . . , xk)



 β0

... βk



=β0x0+β1x1+ ã ã ã +βkxk=Ey,

the regression function.

Special Case 3: Linear Combination of Regression Coefficients. When p=1, we use the convention that lowercase, bold letters are vectors and let C=c =(c0, . . . , ck). In this case, Cβis a generic linear combination of regression coefficients

Cβ=cβ =c0β0+ ã ã ã +ckβk.

Special Case 4: Testing Equality of Regression Coefficients. Suppose that the interest is in testing H0:β1 =β2. For this purpose, let p=1, c = (0,1,−1,0, . . . ,0),and d=0. With these choices, we have

Cβ=cβ=(0,1,−1,0, . . . ,0)



 β0

... βk



=β1−β2 =0,

so that the general linear hypothesis reduces toH0:β1=β2.

Special Case 5: Adequacy of the Model. It is customary in regression analysis to present a test of whether any of the explanatory variables are useful for explaining the response. Formally, this is a test of the null hypothesisH0 :β1 = β2 = ã ã ã =βk =0. Note that, as a convention, one does not test whether the intercept is zero. To test this using the general linear hypothesis, we choose p=k, d=(0 . . . 0) to be ak×1 vector of zeros and C to be a k×(k+1) matrix such that

Cβ=







0 1 0 ã ã ã 0 0 0 1 ã ã ã 0 ... ... ... ... ...

0 0 0 ã ã ã 1









 β0

... βk



=



 β1

... βk



=



 0... 0



=d.

Special Case 6: Testing Portions of the Model. Suppose that we are interested in comparing a full regression function

Ey =β0+β1x1+ ã ã ã +βkxk+βk+1xk+1+ ã ã ã +βk+pxk+p to a reduced regression function,

Ey=β0+β1x1+ ã ã ã +βkxk.

Beginning with the full regression, we see that if the null hypothesis H0 : βk+1= ã ã ã =βk+p=0 holds, then we arrive at the reduced regression. To illustrate, the variables xk+1, . . . , xk+p may refer to several binary variables representing a categorical variable and our interest is in whether the categorical variable is important. To test the importance of the categorical variable, we want to see whether the binary variablesxk+1, . . . , xk+p jointlyaffect the dependent variables.

To test this using the general linear hypothesis, we choose d and C such that

Cβ=







0 ã ã ã 0 1 0 ã ã ã 0 0 ã ã ã 0 0 1 ã ã ã 0 ... ... ... ... ... ... ...

0 ã ã ã 0 0 0 ã ã ã 1











 β0

... βk

βk+1 ... βk+p









 βk+1

... βk+p



=



 0... 0



=d.

From a list ofk+pvariablesx1, . . . , xk+p, you may drop anypthat you deem appropriate. The additional variables do not need to be the lastpin the regression specification. Droppingxk+1, . . . , xk+pis for notational convenience only.

4.2.2 The General Linear Hypothesis

To recap, the general linear hypothesis can be stated asH0: Cβ =d. Here, C is ap×(k+1) matrix, d is ap×1 vector and both C and d are user specified and depend on the application at hand. Althoughk+1 is the number of regression coefficients,pis the number of restrictions underH0 on these coefficients. (For those readers with knowledge of advanced matrix algebra,pis the rank of C.) This null hypothesis is tested against the alternativeHa : Cβ=d. This may be obvious, but we do requirep≤k+1 because we cannot test more constraints than free parameters.

To understand the basis for the testing procedure, we first recall some of the basic properties of the regression coefficient estimators described in Section 3.3.

Now, however, our goal is to understand properties of the linear combinations of regression coefficients specified by Cβ. A natural estimator of this quantity is Cb. It is easy to see that Cb is an unbiased estimator of Cβ, because E Cb=CE b=Cβ. Moreover, the variance is Var (Cb)=CVar (b) C = σ2C

XX−1

C. To assess the difference between d, the hypothesized value of Cβ, and its estimated value, Cb, we use the following statistic:

F-ratio= (Cb−d) C

XX−1

C−1

(Cb−d)

psfull2 . (4.1)

Here,sfull2 is the mean square error from the full regression model. Using the theory of linear models, it can be checked that the statisticF-ratio has anF-distribution with numerator degrees of freedomdf1 =pand denominator degrees of freedom

df2=n−(k+1). Both the statistic and the theoretical distribution are named for R. A. Fisher, a renowned scientist and statistician who did much to advance statistics as a science in the early half of the twentieth century.

Like the normal and the t-distribution, the F-distribution is a continuous distribution. TheF-distribution is the sampling distribution for theF-ratio and is proportional to the ratio of two sum of squares, each of which is positive or zero.

Thus, unlike the normal distribution and the t-distribution, the F-distribution takes on only nonnegative values. Recall that thet-distribution is indexed by a single degree-of-freedom parameter. TheF-distribution is indexed by two degree of freedom parameters: one for the numerator,df1, and one for the denominator, df2. Appendix A3.4 provides additional details.

Appendix A3.4 provides additional details about the F-distribution, including a graph and distribution table.

The test statistic in equation (4.1) is complex in form. Fortunately, there is an alternative that is simpler to implement and to interpret; this alternative is based on the extra sum of squares principle.

Procedure for Testing the General Linear Hypothesis

(i) Run the full regression and get the error sum of squares and mean square error, which we label as (ErrorSS)fullandsfull2 , respectively.

(ii) Consider the model assuming the null hypothesis is true. Run a regression with this model and get the error sum of squares, which we label (ErrorSS)reduced.

(iii) Calculate

F-ratio= (ErrorSS)reduced−(ErrorSS)full

psfull2 . (4.2)

(iv) Reject the null hypothesis in favor of the alternative if theF-ratio exceeds an F-value. The F-value is a percentile from the F-distribution with df1=p anddf2 =n−(k+1) degrees of freedom. The percentile is one minus the significance level of the test. Following our notation with thet-distribution, we denote this percentile asFp,n−(k+1),1−α, whereαis the significance level.

This procedure is commonly known as anF-test.

Section 4.7.2 provides the mathematical underpinnings. To understand the extra-sum-of-squares principle, recall that the error sum of squares for the full model is determined to be the minimum value of

SS(b∗0, . . . , b∗k)= n

i=1

yi−

b∗0+ ã ã ã +b∗kxi,k2

Here,SS(b∗0, . . . , b∗k) is a function ofb∗0, . . . , bk∗, and (ErrorSS)fullis the minimum over all possible values ofb0∗, . . . , bk∗. Similarly, (ErrorSS)reducedis the minimum error sum of squares under the constraints in the null hypothesis. Because there are fewer possibilities under the null hypothesis, we have

(ErrorSS)full ≤(ErrorSS)reduced. (4.3)

To illustrate, consider our first special case, whereH0:βj =0. In this case, the difference between the full and the reduced models amounts to dropping a variable. A consequence of equation (4.3) is that, when adding variables to a regression model, the error sum of squares never goes up (and, in fact, usually goes down). Thus, adding variables to a regression model increases R2, the coefficient of determination.

When adding variables to a regression model, the error sum of squares never goes up. TheR2 statistic never goes down.

How large a decrease in the error sum of squares is statistically significant?

Intuitively, one can view theF-ratio as the difference in the error sum of squares divided by the number of constraints, ((ErrorSS)reduced−(ErrorSS)full)/p,and then rescaled by the best estimate of the variance term, the s2, from the full model. Under the null hypothesis, this statistic follows anF-distribution, and we can compare the test statistic to this distribution to see whether it is unusually large.

Using the relationship RegressionSS=TotalSS−ErrorSS, we can reex- press the difference in the error sum of squares as

(ErrorSS)reduced−(ErrorSS)full=(RegressionSS)full

−(RegressionSS)reduced.

This difference is known as a Type III sum of squares. When testing the importance of a set of explanatory variables,xk+1, . . . , xk+p,in the presence ofx1, . . . , xk, you will find that many statistical software packages compute this quantity directly in a single regression run. The advantage of this is that it allows the analyst to perform anF-test using a single regression run instead of two regression runs, as in our four-step procedure described previously.

Example: Term Life Insurance, Continued. Before discussing the logic and the implications of the F-test, let us illustrate the use of it. In the term life insurance example, suppose that we want to understand the impact of marital status. Table4.3presented a mixed message in terms oft-ratios; sometimes they were statistically significant and sometimes not. It would be helpful to have a formal test to give a definitive answer, at least in terms of statistical significance.

Specifically, we consider a regression model using LNINCOME, EDUCATION, NUMHH, MAR0, and MAR2 as explanatory variables. The model equation is

y=β0+β1LNINCOME+β2EDUCATION+β3NUMHH +β4MAR0+β5MAR2.

Our goal is to testH0:β4=β5=0.

(i) We begin by running a regression model with all k+p=5 variables.

The results were reported in Table4.2, where we saw that (ErrorSS)full= 615.62 andsfull2 =(1.513)2=2.289.

(ii) The next step is to run the reduced model without MAR0 and MAR2. This was done in Table 3.3 of Chapter 3, where we saw that (ErrorSS)reduced= 630.43.

(iii) We then calculate the test statistic

F-ratio= (ErrorSS)reduced−(ErrorSS)full

psfull2 = 630.43−615.62

2×2.289 =3.235.

(iv) The fourth step compares the test statistic to anF-distribution withdf1 = p=2 and df2 =n−(k+p+1)=269 degrees of freedom. Using a 5% level of significance, it turns out that the 95th percentile isF-value≈ 3.029. The corresponding p-value is Pr(F >3.235)=0.0409. At the 5% significance level, we reject the null hypothesis H0 :β4 =β5 =0.

This suggests that it is important to use marital status to understand term life insurance coverage, even in the presence of income, education, and number of household members.

Some Special Cases

The general linear hypothesis test is available when you can express one model as a subset of another. For this reason, it useful to think of it as a device for comparing “smaller”to “larger”models. However, the smaller model must be a subset of the larger model. For example, the general linear hypothesis test cannot be used to compare the regression functions E y=β0+β7x7 versus Ey= β0+β1x1+β2x2+β3x3+β4x4. This is because the former, smaller function is not a subset of the latter, larger function.

The general linear hypothesis can be used in many instances, although its use is not always necessary. For example, suppose that we wish to testH0 :βk =0.

We have already seen that this null hypothesis can be examined using the t- ratio test. In this special case, it turns out that (t-ratio)2=F-ratio. Thus, these tests are equivalent for testingH0:βk=0 versusHa :βk=0. TheF-test has the advantage that it works for more than one predictor, whereas the t-test has the advantage that one can consider one-sided alternatives. Thus, both tests are considered useful.

Dividing the numerator and denominator of equation (4.2) by TotalSS, the test statistic can also be written as

F-ratio=

R2full−R2reduced /p

1−R2full

/(n−(k+1)). (4.4)

The interpretation of this expression is that theF-ratio measures the drop in the coefficient of determination,R2.

The expression in equation (4.2) is particularly useful for testing the adequacy of the model, our Special Case 5. In this case,p=k, and the regression sum of squares under the reduced model is zero. Thus, we have

F-ratio=

(RegressionSS)full

sfull2 = (RegressionMS)full

(ErrorSS)full

This test statistic is a regular feature of the ANOVA table for many statistical packages.

For example, in our term life insurance example, testing the adequacy of the model means evaluatingH0 :β1 =β2 =β3 =β4 =β5 =0. From Table4.2, the F-ratio is 68.66/2.29 = 29.98. Withdf1 =5 anddf2 =269, we have that theF- value is approximately 2.248 and the correspondingp-value is Pr(F >29.98)≈ 0. This leads us to reject strongly the notion that the explanatory variables are not useful in understanding term life insurance coverage, reaffirming what we learned in the graphical and correlation analysis. Any other result would be surprising.

For another expression, dividing by TotalSS, we may write F-ratio= R2

1−R2

n−(k+1)

k .

Because both F-ratio and R2 are measures of model fit, it seems intuitively plausible that they are related in some fashion. A consequence of this relationship is the fact that asR2 increases, so does theF-ratio and vice versa. TheF-ratio is used because its sampling distribution is known under a null hypothesis, so we can make statements about statistical significance. TheR2 measure is used because of the easy interpretations associated with it.

4.2.3 Estimating and Predicting Several Coefficients Estimating Linear Combinations of Regression Coefficients

In some applications, the main interest is to estimate a linear combination of regression coefficients. To illustrate, recall that, in Section 3.5, we developed a regression function for an individual’s charitable contributions (y) in terms of wages (x). In this function, there was an abrupt change in the function at x=97,500. To model this, we defined the binary variablezto be zero if x <

97,500 and to be one ifx≥97,500 and the regression function Ey=β0+β1x+ β2z(x−97,500). Thus, the marginal expected change in contributions per dollar wage change for wages in excess of 97,500 is∂(Ey)/∂x=β1+β2.

To estimateβ1+β2, a reasonable estimator isb1+b2, which is readily available from standard regression software. In addition, we would also like to compute standard errors forb1+b2 to be used, for example, in determining a confidence interval forβ1+β2. However,b1andb2are typically correlated so that the cal- culation of the standard error ofb1+b2 requires estimation of the covariance betweenb1andb2.

Estimatingβ1+β2is an example of our Special Case 3, which considers linear combinations of regression coefficients of the form cβ=c0β0+c1β1+ ã ã ã + ckβk. For our charitable contributions example, we would choose c1=c2 =1 and otherc’s equal to zero.

To estimate cβ, we replace the vector of parameters by the vector of estimators and use cb. To assess the reliability of this estimator, as in Section 4.2.2, we have that Var

=σ2c(XX)−1c. Thus, we may define the estimated standard deviation, or standard error, of cb to be

se cb

c(XX)−1c.

With this quantity, a 100(1−α)% confidence interval for cβis

cb±tn−(k+1),1−α/2se(cb). (4.5)

The confidence interval in equation (4.5) is valid under Assumptions F1–F5.

If we choose c to have a 1 in the (j+1)st row and 0 otherwise, then cβ=βj, cb=bj and

se(bj)=s

(j+1)st diagonal element of(XX)−1.

Thus, (4.5) provides a theoretical basis for the individual regression coefficient confidence intervals introduced in Section 3.4’s equation (3.10) and generalizes it to arbitrary linear combinations of regression coefficients.

Another important application of equation (4.5) is the choice of c corresponding to a set of explanatory variables of interest, say, x∗=(1, x∗1, x∗2, . . . , x∗k). These may correspond to an observation within the dataset or to a point outside the available data. The parameter of interest, cβ =x∗β, is the expected response or the regression function at that point. Then, x∗b provides a point estimator and equation (4.5) provides the corresponding confidence interval.

Prediction Intervals

Prediction is an inferential goal that is closely related to estimating the regression function at a point. Suppose that, when considering charitable contributions, we know an individual’s wages (and thus whether wages are in excess of $97,500) and want to predict the amount of charitable contributions. In general, we assume that the set of explanatory variables x∗is known and want to predict the corresponding response,y∗. This new response follows the assumptions as described in Section 3.2. Specifically, the expected response is Ey∗=x∗β, x∗is nonstochas- tic, Vary∗=σ2, andy∗is independent of{y1, . . . , yn}and normally distributed.

Under these assumptions, a 100(1−α)% prediction interval fory∗is x∗b±tn−(k+1),1−α/2s

1+x∗(XX)−1x∗. (4.6) Equation (4.6) generalizes the prediction interval introduced in Section 2.4.

Statistical Inference for Several Coefficients

Fitting Data to a Normal Distribution

Is the Model Useful? Some Basic Summary Measures