Econometric theory and methods, Russell Davidson - Chapter 11 ppt

Chapter 11 Discrete and Limited Dependent Variables 11.1 Introduction Although regression models are useful for modeling many types of data, they are not suitable for modeling every type. In particular, they should not be used when the dependent variable is discrete and can therefore take on only a countable number of values, or when it is continuous but is limited in the range of values it can take on. Since variables of these two types arise quite often, it is important to be able to deal with them, and a large number of models has been proposed for doing so. In this chapter, we discuss some of the simplest and most commonly used models for discrete and limited dependent variables. The most commonly encountered type of dependent variable that cannot be handled properly using a regression model is a binary dependent variable. Such a variable can take on only two values, which for practical reasons are almost always coded as 0 and 1. For example, a person may be in or out of the labor force, a commuter may drive to work or take public transit, a household may own or rent the home it resides in, and so on. In each case, the economic agent chooses between two alternatives, one of which is co ded as 0 and one of which is coded as 1. A binary response model then tries to explain the probability that the agent will choose alternative 1 as a function of some observed explanatory variables. We discuss binary response models at some length in Sections 11.2 and 11.3 A binary dependent variable is a special case of a discrete dependent variable. In Section 11.4, we briefly discuss several models for dealing with discrete dependent variables that can take on a fixed number of values. We consider two different cases, one in which the values have a natural ordering, and one in which they do not. Then, in Section 11.5, we discuss models for count data, in which the dependent variable can, in principle, take on any nonnegative, integer value. Sometimes, a dependent variable is continuous but can take on only a limited range of values. For example, most types of consumer spending can be zero or positive but cannot be negative. If we have a sample that includes some Copyright c  1999, Russell Davidson and James G. MacKinnon 443 444 Discrete and Limited Dependent Variables zero observations, we need to use a model that explicitly allows for this. By the same token, if the zero observations are excluded from the sample, we need to take account of this omission. Both types of model are discussed in Section 11.6. The related problem of sample selectivity, in which certain observations are omitted from the sample in a nonrandom way, is dealt with in Section 11.7. Finally, in Section 11.8, we discuss duration models, which attempt to explain how much time elapses before some event occurs or some state changes. 11.2 Binary Response Models: Estimation In a binary response model, the value of the dependent variable y t can take on only two values, 0 and 1. Let P t denote the probability that y t = 1 conditional on the information set Ω t , which consists of exogenous and predetermined variables. A binary response model serves to model this conditional probability. Since the values are 0 or 1, it is clear that P t is also the expectation of y t conditional on Ω t : P t ≡ Pr(y t = 1 |Ω t ) = E(y t |Ω t ), Thus a binary response model can also be thought of as modeling a conditional expectation. For many types of dependent variable, we can use a regression model to model conditional expectations, but that is not a sensible thing to do in this case. Suppose that X t denotes a row vector of length k of variables that belong to the information set Ω t , almost always including a constant term or the equivalent. Then a linear regression model would specify E(y t |Ω t ) as X t β. But such a model fails to impose the condition that 0 ≤ E(y t |Ω t ) ≤ 1, which must hold because E(y t |Ω t ) is a probability. Even if this condition happened to hold for all observations in a particular sample, it would always be easy to find values of X t for which the estimated probability X t ˆ β would b e less than 0 or greater than 1. Since it makes no sense to have estimated probabilities that are negative or greater than 1, simply regressing y t on X t is not an acceptable way to model the conditional expectation of a binary variable. However, as we will see in the next section, such a regression can provide some useful information, and it is therefore not a completely useless thing to do in the early stages of an empirical investigation. Any reasonable binary response model must ensure that E(y t |Ω t ) lies in the 0-1 interval. In principle, there are many ways to do this. In practice, however, two very similar models are widely used. Both of these models ensure that 0 < P t < 1 by specifying that P t ≡ E(y t |Ω t ) = F(X t β). (11.01) Copyright c  1999, Russell Davidson and James G. MacKinnon 11.2 Binary Response Models: Estimation 445 Here X t β is an index function, which maps from the vector X t of explanatory variables and the vector β of parameters to a scalar index, and F (x) is a transformation function, which has the properties that F (−∞) = 0, F (∞) = 1, and f(x) ≡ dF (x) dx > 0. (11.02) These properties are, in fact, just the defining properties of the CDF of a probability distribution; recall Section 1.2. They ensure that, although the index function X t β can take any value on the real line, the value of F (X t β) must lie between 0 and 1. The properties (11.02) also ensure that F (x) is a nonlinear function. Con- sequently, changes in the values of the X ti , which are the elements of X t , necessarily affect E(y t |Ω t ) in a nonlinear fashion. Specifically, when P t is given by (11.01), its derivative with respect to X ti is ∂P t ∂X ti = ∂F (X t β) ∂X ti = f (X t β)β i , (11.03) where β i is the i th element of β. Therefore, the magnitude of the derivative is proportional to f (X t β). For the transformation functions that are almost always employed, f(X t β) achieves a maximum at X t β = 0 and then falls as |X t β| increases; for examples, see Figure 11.1 below. Thus (11.03) tells us that the effect on P t of a change in one of the independent variables is greatest when P t = 0.5 and very small when P t is close to 0 or 1. The Probit Model The first of the two widely-used choices for F (x) is the cumulative standard normal distribution function, Φ(x) ≡ 1 √ 2π  x −∞ exp  − 1 2 X 2  dX. When F(X t β) = Φ(X t β), (11.01) is called the probit model. Although there exists no closed-form expression for Φ(x), it is easily evaluated numerically, and its first derivative is, of course, simply the standard normal density function, φ(x), which was defined in expression (1.06). One reason for the popularity of the probit model is that it can be derived from a model involving an unobserved, or latent, variable y ◦ t . Suppose that y ◦ t = X t β + u t , u t ∼ NID(0, 1). (11.04) We observe only the sign of y ◦ t , which determines the value of the observed binary variable y t according to the relationship y t = 1 if y ◦ t > 0; y t = 0 if y ◦ t ≤ 0. (11.05) Copyright c  1999, Russell Davidson and James G. MacKinnon 446 Discrete and Limited Dependent Variables Together, (11.04) and (11.05) define what is called a latent variable model. One way to think of y ◦ t is as an index of the net utility associated with some action. If the action yields positive net utility, it will be undertaken; otherwise, it will not be undertaken. Because we observe only the sign of y ◦ t , we can normalize the variance of u t to be unity. If the variance of u t were some other value, say σ 2 , we could divide β, y ◦ t , and u t by σ. Then u t /σ would have variance 1, but the value of y t would be unchanged. Another way to express this property is to say that the variance of u t is not identified by the binary response model. We can now compute P t , the probability that y t = 1. It is Pr(y t = 1) = Pr(y ◦ t > 0) = Pr(X t β + u t > 0) = Pr(u t > −X t β) = Pr(u t ≤ X t β) = Φ(X t β). (11.06) The second-last equality in (11.06) makes use of the fact that the standard normal density function is symmetric around zero. The final result is just what we would get by letting Φ(X t β) play the role of the transformation function F(X t β) in (11.01). Thus we have derived the probit model from the latent variable model that consists of (11.04) and (11.05). The Logit Model The logit model is very similar to the probit model. The only difference is that the function F (x) is now the logistic function Λ(x) ≡ 1 1 + e −x = e x 1 + e x , (11.07) which has first derivative λ(x) ≡ e x (1 + e x ) 2 = Λ(x)Λ(−x). (11.08) This first derivative is evidently symmetric around zero, which implies that Λ(−x) = 1−Λ(x). A graph of the logistic function, as well as of the standard normal distribution function, is shown in Figure 11.1 below. The logit model is most easily derived by assuming that log  P t 1 − P t  = X t β, which says that the logarithm of the odds (that is, the ratio of the two probabilities) is equal to X t β. Solving for P t , we find that P t = exp(X t β) 1 + exp(X t β) = 1 1 + exp(−X t β) = Λ(X t β). This result is what we would get by letting Λ(X t β) play the role of the transformation function F (X t β) in (11.01). Copyright c  1999, Russell Davidson and James G. MacKinnon 11.2 Binary Response Models: Estimation 447 Maximum Likelihood Estimation of Binary Response Models By far the most common way to estimate binary response models is to use the method of maximum likelihood. Because the dependent variable is discrete, the likelihood function cannot be defined as a joint density function, as it was in Chapter 10 for models with a continuously distributed dependent variable. When the dependent variable can take on discrete values, the likelihood function for those values should be defined as the probability that the value is realized, rather than as the probability density at that value. With this redefinition, the sum of the possible values of the likelihood is equal to 1, just as the integral of the possible values of a likelihood based on a continuous distribution is equal to 1. If, for observation t, the realized value of the dependent variable is y t , then the likelihood for that observation if y t = 1 is just the probability that y t = 1, and if y t = 0, it is the probability that y t = 0. The logarithm of the appropriate probability is then the contribution to the loglikelihood made by observation t. Since the probability that y t = 1 is F (X t β), the contribution to the loglikelihood function for observation t when y t = 1 is log F(X t β). Similarly, the contribution to the loglikelihood function for observation t when y t = 0 is log  1 − F (X t β)  . Therefore, if y is an n vector with typical element y t , the loglikelihood function for y can be written as (y, β) = n  t=1  y t log F (X t β) + (1 − y t ) log  1 − F (X t β)   . (11.09) For each observation, one of the terms inside the large parentheses is always 0, and the other is always negative. The first term is 0 whenever y t = 0, and the second term is 0 whenever y t = 1. When either term is nonzero, it must be negative, because it is equal to the logarithm of a probability, and this probability must be less than 1 whenever X t β is finite. For the model to fit perfectly, F(X t β) would have to equal 1 when y t = 1 and 0 when y t = 0, and the entire expression inside the parentheses would then equal 0. This could happen only if X t β = ∞ whenever y t = 1, and X t β = −∞ whenever y t = 0. Therefore, we see that (11.09) is bounded above by 0. Maximizing the loglikelihood function (11.09) is quite easy to do. For the logit and probit models, this function is globally concave with respect to β (see Pratt, 1981, and Exercise 11.1). This implies that the first-order conditions, or likelihood equations, uniquely define the ML estimator ˆ β, except for one special case we consider in the next subsection but one. These likelihood equations can be written as n  t=1  y t − F (X t β)  f(X t β)X ti F (X t β)  1 − F (X t β)  = 0, i = 1, . . . , k. (11.10) There are many ways to find ˆ β in practice. Because of the global concavity Copyright c  1999, Russell Davidson and James G. MacKinnon 448 Discrete and Limited Dependent Variables of the loglikelihood function, Newton’s Method generally works very well. Another approach, based on an artificial regression, will be discussed in the next section. Conditions (11.10) look just like the first-order conditions for weighted least squares estimation of the nonlinear regression model y t = F (X t β) + v t , (11.11) where the weight for observation t is  F (X t β)  1 − F (X t β)   −1/2 . (11.12) This weight is one over the square root of the variance of v t ≡ y t − F (X t β), which is a binary random variable. By construction, v t has mean 0, and its variance is E(v 2 t ) = E  y t − F (X t β)  2 = F (X t β)  1 − F (X t β)  2 +  1 − F (X t β)  F (X t β)  2 = F (X t β)  1 − F (X t β)  . (11.13) Notice how easy it is to take expectations in the case of a binary random variable. There are just two possible outcomes, and the probability of each of them is specified by the model. Because the variance of v t in regression (11.11) is not constant, applying nonlinear least squares to that regression would yield an inefficient estimator of the parameter vector β. ML estimates could be obtained by applying iteratively reweighted nonlinear least squares. However, Newton’s method, or a method based on the artificial regression to be discussed in the next section, is more direct and usually much faster. Since the ML estimator is equivalent to weighted NLS, we can obtain it as an efficient GMM estimator. It is quite easy to construct elementary zero functions for a binary response model. The obvious function for observation t is y t −F (X t β). The covariance matrix of the n vector of these zero functions is the diagonal matrix with typical element (11.13), and the row vector of derivatives of the zero function for observation t is −f(X t β)X t . With this information, we can set up the efficient estimating equations (9.82). As readers are asked to show in Exercise 11.3, these equations are equivalent to the likelihood equations (11.10). Intuitively, efficient GMM and maximum likelihood give the same estimator because, once it is understood that the y t are binary variables, the elementary zero functions serve to specify the probabilities Pr(y t = 1), and they thus constitute a full specification of the model. Copyright c  1999, Russell Davidson and James G. MacKinnon 11.2 Binary Response Models: Estimation 449 −5 −4 −3 −2 −1 0 1 2 3 4 5 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Standard Normal . . . . . . . . . . . . . . . . . . . . . . . . . Logistic . . . . . . . . . . . . . . . . . . . . . . . . . Rescaled Logistic x F (x) Figure 11.1 Alternative choices for F (x) Comparing Probit and Logit Models In practice, the probit and logit models generally yield very similar predicted probabilities, and the maximized values of the loglikelihood function (11.09) for the two models therefore tend to b e very close. A formal comparison of these two values is possible. If twice the difference between them is greater than 3.84, the .05 critical value for the χ 2 (1) distribution, then we can reject whichever mo del fits less well at the .05 level. 1 Such a procedure was discussed in Section 10.8 in the context of linear and loglinear models. In practice, however, exp erience shows that this sort of comparison rarely rejects either model unless the sample size is quite large. In most cases, the only real difference between the probit and logit models is the way in which the elements of β are scaled. This difference in scaling occurs because the variance of the distribution for which the logistic function is the CDF can be shown to be π 2 /3, while that of the standard normal distribution is, of course, unity. The logit estimates therefore all tend to be larger in absolute value than the probit estimates, although usually by a factor that is somewhat less than π/ √ 3. Figure 11.1 plots the standard normal CDF, the logistic function, and the logistic function rescaled to have variance unity. The resemblance between the standard normal CDF and the rescaled logistic 1 This assumes that there exists a comprehensive model, with a single additional parameter, which includes the probit and logit models as special cases. It is not difficult to formulate such a model; see Exercise 11.4. Copyright c  1999, Russell Davidson and James G. MacKinnon 450 Discrete and Limited Dependent Variables function is striking. The main difference is that the rescaled logistic function puts more weight in the extreme tails. The Perfect Classifier Problem We have seen that the loglikelihood function (11.09) is bounded above by 0, and that it achieves this bound if X t β = −∞ whenever y t = 0 and X t β = ∞ whenever y t = 1. Suppose there is some linear combination of the independent variables, say X t β • , such that y t = 0 whenever X t β • < 0, and y t = 1 whenever X t β • > 0. (11.14) When this happens, there is said to be complete separation of the data. In this case, it is possible to make the value of (y, β) arbitrarily close to 0 by setting β = γβ • and letting γ → ∞. This is precisely what any nonlinear maximization algorithm will attempt to do if there exists a vector β • for which conditions (11.14) are satisfied. Because of the limitations of computer arithmetic, the algorithm will eventually terminate with some sort of numeri- cal error at a value of the loglikelihood function that is slightly less than 0. If conditions (11.14) are satisfied, X t β • is said to be a perfect classifier, since it allows us to predict y t with perfect accuracy for every observation. The problem of perfect classifiers has a geometrical interpretation. In the k dimensional space spanned by the columns of the matrix X formed from the row vectors X t , the vector β • defines a hyperplane that passes through the origin and that separates the observations for which y t = 1 from those for which y t = 0. Whenever one column of X is a constant, then the separating hyperplane can be represented in the (k − 1) dimensional space of the other explanatory variables. If we write X t β • = α • + X t2 β • 2 , with X t2 a 1 × (k −1) vector, then X t β • = 0 is equivalent to X t2 β • 2 = −α • , which is the equation of a hyperplane in the space of the X t2 that in general does not pass through the origin. This is illustrated in Figure 11.2 for the case k = 3. The asterisks, which all lie to the northeast of the separating line for which X t β • = 0, represent the X t2 for the observations with y t = 1, and the circles to the southwest of the separating line represent them for the observations with y t = 0. It is clear from Figure 11.2 that, when a perfect classifier occurs, the separating hyperplane is not, in general, unique. One could move the intercept of the separating line in the figure up or down a little while maintaining the separating property. Likewise, one could swivel the line a little about the point of intersection with the vertical axis. Even if the separating hyperplane were unique, we could not identify all the components of β. This follows from the Copyright c  1999, Russell Davidson and James G. MacKinnon 11.2 Binary Response Models: Estimation 451 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ x 1 x 2 X t β • = 0 X t β • < 0 X t β • > 0 Figure 11.2 A perfect classifier yields a separating hyperplane fact that the equation X t β • = 0 is equivalent to the equation X t (cβ • ) = 0 for any nonzero scalar c. The separating hyperplane is therefore defined equally well by any multiple of β • . Although this suggests that we might b e able to estimate β • up to a scalar factor by imposing a normalization on it, there is no question of estimating β • in the usual sense, and inference on it would require methods beyond the scope of this book. Even when no parameter vector exists that satisfies the inequalities (11.14), there may exist a β • that satisfies the corresponding nonstrict inequalities. There must then be at least one observation with y t = 0 and X t β • = 0, and at least one other observation with y t = 1 and X t β • = 0. In such a case, we speak of quasi-complete separation of the data. The separating hyperplane is then unique, and the upper bound of the loglikelihood is no longer zero, as readers are invited to verify in Exercise 11.6. When there is either complete or quasi-complete separation, no finite ML estimator exists. This is likely to o ccur in practice when the sample is very small, when almost all of the y t are equal to 0 or almost all of them are equal to 1, or when the model fits extremely well. Exercise 11.5 is designed to give readers a feel for the circumstances in which ML estimation is likely to fail because there is a perfect classifier. If a perfect classifier exists, the loglikelihood should be close to its upper bound (which may be 0 or a small negative number) when the maximization algorithm quits. Thus, if the model seems to fit extremely well, or if the algorithm terminates in an unusual way, one should always check to see whether the parameter values imply the existence of a perfect classifier. For a detailed discussion of the perfect classifier problem, see Albert and Anderson (1984). Copyright c  1999, Russell Davidson and James G. MacKinnon 452 Discrete and Limited Dependent Variables 11.3 Binary Response Models: Inference Inference about the parameters of binary response models is usually based on the standard results for ML estimation that were discussed in Chapter 10. It can be shown that Var  plim n→∞ n 1/2 ( ˆ β − β 0 )  = plim n→∞  1 − n X  Υ (β 0 )X  −1 , (11.15) where X is an n × k matrix with typical row X t , β 0 is the true value of β, and Υ (β) is an n ×n diagonal matrix with typical diagonal element Υ t (β) ≡ f 2 (X t β) F (X t β)  1 − F (X t β)  . (11.16) Not surprisingly, the right-hand side of expression (11.15) lo oks like the asymptotic covariance matrix for weighted least squares estimation, with weights (11.12), of the GNR that corresponds to regression (11.11). This GNR is y t − F (X t β) = f(X t β)X t b + residual. (11.17) The factor of f(X t β) that multiplies all the regressors of the GNR accounts for the numerator of (11.16). Its denominator is simply the variance of the error term in regression (11.11). Two ways to obtain the asymptotic covariance matrix (11.15) using general results for ML estimation are explored in Exercises 11.7 and 11.8. In practice, the asymptotic result (11.15) is used to justify the covariance matrix estimator  Var( ˆ β) =  X  Υ ( ˆ β)X  −1 , (11.18) in which the unknown β 0 is replaced by ˆ β, and the factor of n −1 , which is needed only for asymptotic analysis, is omitted. This approximation may be used to obtain standard errors, t statistics, Wald statistics, and confidence intervals that are asymptotically valid. However, they will not be exact in finite samples. It is clear from equations (11.15) and (11.18) that the ML estimator for the binary response model gives some observations more weight than others. In fact, the weight given to observation t is proportional to the square root of expression (11.16) evaluated at β = ˆ β. It can be shown that, for both the logit and probit mo dels, the maximum weight will be given to observations for which X t β = 0, which implies that P t = 0.5, while relatively little weight will be given to observations for which P t is close to 0 or 1; see Exercise 11.9. This makes sense, since when P t is close to 0 or 1, a given change in X t β will have little effect on P t , while when P t is close to 0.5, such a change will have a much larger effect. Thus we see that ML estimation, quite sensibly, gives more weight to observations that provide more information about the parameter values. Copyright c  1999, Russell Davidson and James G. MacKinnon [...]... include Amemiya Copyright c 1999, Russell Davidson and James G MacKinnon 11. 5 Models for Count Data 467 (1985, Chapter 9) and McFadden (1984) For a more up-to-date survey, but one that is relatively superficial, see Maddala and Flores-Lagunes (2001) 11. 5 Models for Count Data Many economic variables are nonnegative integers Examples include the number of patents granted to a firm and the number of visits to... b + residual, (11. 20) where Vt (β) ≡ F (Xt β) 1 − F (Xt β) ˆ If the BRMR is evaluated at the vector of ML estimates β, it yields the covariance matrix −1 ˆ s2 X Υ (β)X , (11. 21) 2 This regression was originally proposed, independently in somewhat different forms, by Engle (1984) and Davidson and MacKinnon (1984b) Copyright c 1999, Russell Davidson and James G MacKinnon 454 Discrete and Limited Dependent... (11. 27) δ where δ is a scalar parameter, and τ (·) may be any scalar function that is monotonically increasing in its argument and satisfies the conditions τ (0) = 0, τ (0) = 1, and τ (0) = 0, (11. 28) where τ (0) and τ (0) are the first and second derivatives of τ (x), evaluated at x = 0 The family of models (11. 27) allows for a wide range of transformation functions It was considered by MacKinnon and. .. j=0 ∂Π tj (θ)/∂θi = 0 It Copyright c 1999, Russell Davidson and James G MacKinnon 11. 4 Models for More than Two Discrete Responses 465 follows that the regressand is orthogonal to all the regressors when all the ˆ artificial variables are evaluated at the maximum likelihood estimates θ In Exercises 11. 18 and 11. 19, readers are asked to show that regression (11. 42), the DCAR, satisfies the other requirements... estimation, and hypothesis testing The most intuitive way to think of the BRMR is as a modified version of the GNR The ordinary GNR for the nonlinear regression model (11. 11) is (11. 17) However, it is inappropriate to use this GNR, because the error terms are heteroskedastic, with variance given by (11. 13) We need to divide the regressand and regressors of (11. 17) by the square root of (11. 13) in order... 1, 2, (11. 49) If the observed count value for observation t is yt , then the contribution to the loglikelihood function is the logarithm of the right-hand side of (11. 49), Copyright c 1999, Russell Davidson and James G MacKinnon 468 Discrete and Limited Dependent Variables evaluated at y = yt Therefore, the entire loglikelihood function is n (y, β) = − exp(Xt β) + yt Xt β − log yt ! (11. 50) t=1... unity, ˜ ˜ ˜ and Vt , Ft , and ft are all constants that do not depend on t Since neither subtracting a constant from the regressand nor multiplying the regressand and regressors by a constant has any effect on the F statistic for b2 = 0, regression (11. 22) is equivalent to the much simpler regression y = c1 + X2 c2 + residuals (11. 23) The ordinary F statistic for c2 = 0 in regression (11. 23) is an... strong resemblance between regression (11. 30) and the test regression for the RESET test (Ramsey, 1969), in which squared fitted values are added to an OLS regression as a test for functional form As MacKinnon and Magee (1990) showed, this resemblance is not coincidental Copyright c 1999, Russell Davidson and James G MacKinnon 458 Discrete and Limited Dependent Variables 11. 4 Models for More than Two Discrete... values of yt It is essential that γ2 > γ1 Otherwise, the first and last lines of (11. 32) would be incompatible, and we could never observe yt = 1 Copyright c 1999, Russell Davidson and James G MacKinnon 11. 4 Models for More than Two Discrete Responses 459 If Xt contains a constant term, it is impossible to identify the constant along with γ1 and γ2 To see this, suppose that the constant is equal to α... provide more advanced introductions to the topic of count data, and Copyright c 1999, Russell Davidson and James G MacKinnon 11. 6 Models for Censored and Truncated Data 473 Cameron and Trivedi (1998) provides a detailed treatment of a large number of different models for data of this type 11. 6 Models for Censored and Truncated Data Continuous dependent variables can sometimes take only a limited range . by Engle (1984) and Davidson and MacKinnon (1984b). Copyright c  1999, Russell Davidson and James G. MacKinnon 454 Discrete and Limited Dependent Variables where s is the standard error of the. perfect classifier problem, see Albert and Anderson (1984). Copyright c  1999, Russell Davidson and James G. MacKinnon 452 Discrete and Limited Dependent Variables 11. 3 Binary Response Models: Inference Inference. if y ◦ t ≤ 0. (11. 05) Copyright c  1999, Russell Davidson and James G. MacKinnon 446 Discrete and Limited Dependent Variables Together, (11. 04) and (11. 05) define what is called a latent variable

Định dạng
Số trang	50
Dung lượng	370,03 KB