Combining Categorical and Continuous Explanatory V- 123docz.net

Table 4.8 Several Models That Represent

Combinations of One Factor and One Covariate

Model Description Notation

One factor ANOVA (no covariate model) yij =àj+εij

Regression with constant intercept and slope (no factor model) yij =β0+β1xij+εij

Regression with variable intercept and constant slope yij =β0j+β1xij+εij

(analysis-of-covariance model)

Regression with constant intercept and variable slope yij =β0+β1jxij+εij Regression with variable intercept and slope yij =β0j+β1jxij+εij

y= β0,3+ β1x y= β0,2+ β1x

y= β0,1+ β1x

x y= β0+ β1,2x y= β0+ β1,3x

y= β0+ β1,1x

Figure 4.3 Plot of the expected response versus the covariate for the regression model with variable intercept and constant slope.

Figure 4.4 Plot of the expected response versus the covariate for the regression model with constant intercept and variable slope.

terminology factor for the categorical variable and covariate for the continuous variable.

Combining a Factor and Covariate

Let us begin with the simplest models that use a factor and a covariate. In Sec- tion4.3, we introduced the one factor modelyij =àj +εij. In Chapter 2, we introduced basic linear regression in terms of one continuous variable, or covariate, usingyij =β0+β1xij+εij.Table4.8summarizes different approaches that could be used to represent combinations of a factor and covariate.

We can interpret the regression with variable intercept and constant slope to be an additive model, because we are adding the factor effect,β0j, to the covariate effect,β1xij. Note that we could also use the notationàjin lieu ofβ0,jto suggest the presence of a factor effect. This is also know as an analysis of covariance (ANCOVA) model. The regression with variable intercept and slope can be thought of as an interaction model. Here, both the intercept,β0j, and slope, β1,j, may vary by level of the factor. In this sense, we interpret the factor and covariate to be

“interacting.”The model with constant intercept and variable slope is typically not used in practice; it is included here for completeness. With this model, the factor and covariate interact only through the variable slope. Figures4.3,4.4, and 4.5illustrate the expected responses of these models.

For each model presented in Table4.8, parameter estimates can be calculated using the method of least squares. As usual, this means writing the expected

y= β0,1+ β1,1x y= β0,3+ β1,3x

y= β0,2+ β1,2x Figure 4.5 Plot of

the expected response versus the covariate for the regression model with variable intercept and variable slope.

response, Eyij, as a function of known variables and unknown parameters. For the regression model with variable intercept and constant slope, the least squares estimates can be expressed compactly as

b1=

cj=1 nj

i=1(xij−x¯j)(yij−y¯j)

cj=1 nj

i=1(xij −x¯j)2

and b0,j =y¯j−b1x¯j. Similarly, the least squares estimates for the regression model with variable intercept and slope can be expressed as

b1,j =

i=1(xij−x¯j)(yij−y¯j)

i=1(xij−x¯j)2

andb0,j =y¯j−b1x¯j. With these parameter estimates, fitted values may be calculated.

For each model, fitted values are defined as the expected response with the unknown parameters replaced by their least squares estimates. For example, for the regression model with variable intercept and constant slope, the fitted values are ˆyij =b0,j +b1xij.

R Empirical Filename is

“WiscHospCosts” Example: Wisconsin Hospital Costs. We now study the impact of various predictors on hospital charges in the state of Wisconsin. Identifying predictors of hospital charges can provide direction for hospitals, government, insurers, and consumers in controlling these variables, which in turn leads to better control of hospital costs. The data for the year 1989 were obtained from the Office of Health Care Information, Wisconsin’s Department of Health and Human Ser- vices. Cross-sectional data are used, which detail the 20 diagnosis related group (DRG) discharge costs for hospitals in Wisconsin, broken down into nine major health service areas and three types of providers (fee for service, health main- tenance organization [HMO], and other). Even though there are 540 potential DRG, area, and payer combinations (20×9×3=540), only 526 combinations were actually realized in the 1989 dataset. Other predictor variables included the

3.0 4.5 6.0 7.5 9.0 6.0

7.2 8.4 9.6

Number of Discharges CHGNUM

A A A

B B B

C C

C A A

BB B C C

C A A A

B B B C C

C A A

B B

C C

A A A

B B B

C C C

A A A

B B B C C

C A A A

B B B

C C

C A A

B B B C C

C A A

B B

3.0 4.5 6.0 7.5 9.0 6.0

7.2 8.4 9.6

Number of Discharges

CHGNUM Figure 4.6 Plot of

natural logarithm of cost per discharge versus natural logarithm of the number of discharges.

This plot suggest a misleading negative relationship.

Figure 4.7 Letter plot of natural logarithm of cost per discharge versus natural logarithm of the number of discharges by DRG.

Here, A is for DRG

#209, B is for DRG

#391, and C is for DRG #430.

logarithm of the total number of discharges (NO DSCHG) and total number of hospital beds (NUM BEDS) for each combination. The response variable is the logarithm of total hospital charges per number of discharges (CHGNUM). To streamline the presentation, we now consider only costs associated with three DRGs, DRG #209, DRG #391, and DRG #430.

The covariate,x, is the natural logarithm of the number of discharges. In ideal settings, hospitals with more patients enjoy lower costs because of economies of scale. In nonideal settings, hospitals may not have excess capacity; thus, hospitals with more patients have higher costs. One purpose of this analysis is to investigate the relationship between hospital costs and hospital utilization.

Recall that our measure of hospital charges is the logarithm of costs per discharge (y). The scatter plot in Figure4.6gives a preliminary idea of the relationship betweenyandx. We note that there appears to be a negative relationship betweenyandx.

The negative relationship betweenyandxsuggested by Figure4.6is misleading and is induced by an omitted variable, the category of the cost (DRG). To see the joint effect of the categorical variable DRG and the continuous variable x, Figure4.7shows a plot ofy versusx where the plotting symbols are codes for the level of the categorical variable. From this plot, we see that the level of cost varies by level of the factor DRG. Moreover, for each level of DRG, the slope betweenyandxis either zero or positive. The slopes are not negative, as suggested by Figure4.6.

Each of the five models defined in Table4.8was fit to this subset of the hospital case study. The summary statistics are in Table4.9. For this dataset, there are n=79 observations and c=3 levels of the DRG factor. For each model, the model degrees of freedom is the number of model parameters minus one. The error degrees of freedom is the number of observations minus the number of model parameters.

Using binary variables, each of the models in Table4.8can be written in a regression format. As we have seen in Section4.2, when a model can be written as a subset of a larger model, we have formal testing procedures available to decide which model is more appropriate. To illustrate this testing procedure with

Table 4.9 Wisconsin Hospital Cost Models’

Goodness of Fit Model Error Error Error

Degrees Degrees Sum R2 Mean

Model Description of Freedom of Freedom of Squares (%) Square

One factor ANOVA 2 76 9.396 93.3 0.124

Regression with constant intercept 1 77 115.059 18.2 1.222 and slope

Regression with variable intercept 3 75 7.482 94.7 0.100 and constant slope

Regression with constant intercept 3 75 14.048 90.0 0.187 and variable slope

Regression with variable intercept 5 73 5.458 96.1 0.075 and slope

Note: These models represent combinations of one factor and one covariate.

our DRG example, from Table4.9and the associated plots, it seems clear that the DRG factor is important. Further, at-test, not presented here, shows that the covariatexis important. Thus, let’s compare the full model Eyij =β0,j +β1,jx to the reduced model Eyij =β0,j +β1x. In other words, is there a different slope for each DRG?

Using the notation from Section4.2, we call the variable intercept and slope the full model. Under the null hypothesis, H0:β1,1=β1,2 =β1,3, we get the variable intercept, constant slope model. Thus, using the F-ratio in equation (4.2), we have

F-ratio= (ErrorSS)reduced−(ErrorSS)full

psfull2 = 7.482−5.458

2(0.075) =13.535.

The 95th percentile from the F-distribution with df1=p=2 and df2 = (df)full=73 is approximately 3.13. Thus, this test leads us to reject the null hypothesis and declare the alternative, the regression model with variable intercept and variable slope, to be valid.

Combining Two Factors

We have seen how to combine covariates as well as a covariate and factor, both additively and with interactions. In the same fashion, suppose that we have two factors, say, sex (two levels, male/female) and age (three levels, young/

middle/old). Let the corresponding binary variables bex1to indicate whether the observation represents a female,x2to indicate whether the observation represents a young person andx3to indicate whether the observation represents a middle- aged person.

An additive model for these two factors may use the regression function Ey=β0+β1x1+β2x2+β3x3.

Table 4.10 Regression Function for a Two Factor Model with Interactions

Sex Age x1 x2 x3 x4 x5 Regression Function (4.12)

Male Young 0 1 0 0 0 β0+β2

Male Middle 0 0 1 0 0 β0+β3

Male Old 0 0 0 0 0 β0

Female Young 1 1 0 1 0 β0+β1+β2+β4

Female Middle 1 0 1 0 1 β0+β1+β3+β5

Female Old 1 0 0 0 0 β0+β1

As we have seen, this model is simple to interpret. For example, we can interpret β1as the sex effect, holding age constant.

We can also incorporate two interaction terms,x1x2 andx1x3. Using all five explanatory variables yields the regression function

Ey=β0+β1x1+β2x2+β3x3+β4x1x2+β5x1x3. (4.12) Here, the variablesx1, x2andx3are known as the main effects. Table4.10helps interpret this equation. Specifically, there are six types of people that we can encounter, men and women who are young, middle aged, or old. We have six parameters in equation (4.12). Table4.10 provides the link between the parameters and the types of people. By using the interaction terms, we do not impose any prior specifications on the additive effects of each factor. In Table 4.10, we see that the interpretation of the regression coefficients in equation (4.12) is not straightforward. However, using the additive model with interaction terms is equivalent to creating a new categorial variable with six levels, one for each type of person. If the interaction terms are critical in your study, you may wish to create a new factor that incorporates the interaction terms simply for ease of interpretation.

Extensions to more than two factors follow in a similar fashion. For example, suppose that you are examining the behavior of firms with headquarters in 10 geo- graphic regions, two organizational structures (profit versus nonprofit), and four years of data. If you decide to treat each variable as a factor and want to model all interaction terms, then this is equivalent to a factor with 10×2×4=80 levels.

Models with interaction terms can have a substantial number of parameters and the analyst must be prudent when specifying interactions to be considered.

General Linear Model

The general linear model extends the linear regression model in two ways. First, explanatory variables may be continuous, categorical, or a combination. The only restriction is that they enter linearly such that the resulting regression function

Ey=β0+β1x1+ ã ã ã +βkxk (4.13) is a linear combination of coefficients. As we have seen, we can square continuous variables or take other nonlinear transforms (e.g., logarithms) and use binary

variables to represent categorical variables, so this restriction, as the name sug- gests, allows for a broad class of general functions to represent data.

The second extension is that the explanatory variables may be linear combinations of one another in the general linear model. Because of this, in the general linear model case, the parameter estimates need not be unique. However, an important feature of the general linear model is that the resulting fitted values turn out to be unique, using the method of least squares.

For example, in Section 4.3, we saw that the one factor ANOVA model could be expressed as a regression model withcindicator variables. However, if we had attempted to estimate the model in equation (4.10), the method of least squares would not have arrived at a unique set of regression coefficient estimates. The reason is that, in equation (4.10), each explanatory variable can be expressed as a linear combination of the others. For example, observe thatxc =1−(x1+ x2+ ã ã ã +xc−1).

The fact that parameter estimates are not unique is a drawback but not an overwhelming one. The assumption that the explanatory variables are not linear combinations of one another means that we can compute unique estimates of the regression coefficients using the method of least squares. In terms of matrices, because the explanatory variables are not linear combinations of one another, the matrix XX is not invertible.

Specifically, suppose that we are considering the regression function in equation (4.13) and, using the method of least squares, our regression coefficient estimates are bo0, bo1, . . . , bok. This set of regression coefficients estimates min- imizes our error sum of squares, but there may be other sets of coefficients that also minimize the error sum of squares. The fitted values are computed as

yi =bo0+b1oxi1+ ã ã ã +bokxik. It can be shown that the resulting fitted values are unique, in the sense that any set of coefficients that minimize the error sum of squares produce the same fitted values (see Section4.7.3).

Thus, for a set of data and a specified general linear model, fitted values are unique. Because residuals are computed as observed responses minus fitted values, we have that the residuals are unique. Because residuals are unique, the error sums of squares are unique. Thus, it seems reasonable, and is true, that we can use the general test of hypotheses described in Section4.2to decide whether collections of explanatory variables are important.

To summarize, for general linear models, parameter estimates may not be unique and thus not meaningful. An important part of regression models is the interpretation of regression coefficients. This interpretation is not necessarily available in the general linear model context. However, for general linear models, we may still discuss the important of an individual variable or collection of variables through partial F-tests. Further, fitted values, and the corresponding exercise of prediction, works in the general linear model context. The advantage of the general linear model context is that we need not worry about the type of restrictions to impose on the parameters. Although not the subject of this text, this advantage is particularly important in complicated experimental designs used in

the life sciences. You will find that general linear model estimation routines are widely available in statistical software packages available on the market today.

Combining Categorical and Continuous Explanatory Variables

Fitting Data to a Normal Distribution

Is the Model Useful? Some Basic Summary Measures