Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 132 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
132
Dung lượng
739,54 KB
Nội dung
CHAPTER 69 Binary Choice Models 69.1. Fisher’s Scoring and Iteratively Reweighted Least Squares This section draws on chapter 55 about Numerical Minimization. Another im- portant “natural” choice for the positive definite matrix R i in the gradient method is available if one maximizes a likelihood function: then R i can be the inverse of the information matrix for the parameter values β i . This is called Fisher’s Scoring method. It is closely related to the Newton-Raphson method. T he Newton-Raphson method uses the Hessian matrix, and the information matrix is minus the expected value of the Hessian. Apparently Fisher first used the information matrix as a com- putational simplification in the Newton-Raphson method. Today IRLS is used in the GLIM program for generalized linear models. 1487 1488 69. BINARY CHOICE MODELS As in chapter 56 discussing nonlinear least squares, β is the vector of param- eters of interest, and we will work with an intermediate vector η(β) of predictors whose dimension is comparable to that of the observations. Therefore the likelihood function has the form L = L y, η(β ) . By the chain rule (C.1.23) one can write the Jacobian of the likelihood function as ∂L ∂β (β) = u X, where u = ∂L ∂η (η(β )) is the Jacobian of L as a function of η, evaluated at η(β), and X = ∂η ∂β (β) is the Jacobian of η. This is the same notation as in the discussion of the Gauss-Newton regression. Define A = E [uu ]. Since X does not depend on the random variables, the information matrix of y with respect to β is then E [X uu X] = X AX. If one uses the inverse of this information matrix as the R-matrix in the gradient algorithm, one gets (69.1.1) β i+1 = β i + α i X AX −1 X u The Iterated Reweighted Least Squares interpretation of this comes from rewrit- ing (69.1.1) as (69.1.2) β i+1 = β i + X AX −1 X AA −1 u, i.e., one obtains the step by regressing A −1 u on X with weighting matrix A. 69.2. BINARY DEPENDENT VARIABLE 1489 Justifications of IRLS are: the information matrix is usually analytically simpler than the Hessian of the likelihood function, therefore it is a convenient approximation, and one needs the information matrix anyway at the end for the covariance matrix of the M.L. estimators. 69.2. Binary Dependent Variable Assume each individual in the sample makes an independent random choice between two alternatives, which can conveniently be coded as y i = 0 or 1. The probability distribution of y i is fully determined by the probability π i = Pr[y i = 1] of the event which has y i as its indicator function. Then E[y i ] = π i and var[y i ] = E[y 2 i ] − E[y i ] 2 = E[y i ] − E[y i ] 2 = π i (1 −π i ). It is usually assumed that the individual choices are stochastically independent of each other, i.e., the distribution of the data is fully characterized by the π i . Each π i is assumed to depend on a vector of explanatory variables x i . There are different approaches to modelling this dependence. The regression model y i = x i β +ε i with E[ε i ] = 0 is inappropriate because x i β can take any value, whereas 0 ≤ E[y i ] ≤ 1. Nevertheless, people have been tinkering with it. The obvious first tinker is based on the observation that the ε i are no longer homoskedastic, but their variance, which is a function of π i , can be estimated, therefore one can correct for this heteroskedasticity. But things get complicated very 1490 69. BINARY CHOICE MODELS quickly and then the main appe al of OLS, its simplicity, is lost. This is a wrong- headed approach, and any smart ideas which one may get when going down this road are simply wasted. The right way to do this is to set π i = E[y i ] = Pr[y i = 1] = h(x i β) where h is some (necessarily nonlinear) function with values between 0 and 1. 69.2.1. Logit Specification (Logistic Regression). The logit or logistic specification is π i = e x i β /(1 + e x i β ). Invert to get log(π i /(1 − π i )) = x i β. I.e., the logarithm of the odds depends linearly on the predictors. The log odds are a natural re-scaling of probabilities to a scale which goes from −∞ to +∞, and which is symmetric in that the log odds of the complement of an event is just the negative of the log odds of the event itself. (See my remarks about the odds ratio in Question 222.) Problem 560. 1 point If y = log p 1−p (logit function), show that p = exp y 1+exp y (logistic function). Answer. exp y = p 1−p , now multiply by 1 − p to get exp y − p expy = p, collect terms exp y = p(1 + exp y), now divide by 1 + exp y. 69.2. BINARY DEPENDENT VARIABLE 1491 Problem 561. Sometimes one finds the following alternative specification of the logit model: π i = 1/(1+e x i β ). What is the difference between it and our formulation of the logit model? Are these two formulations equivalent? Answer. It is simply a different parametrization. They get this because they come from index number problem. The logit function is also the canonical link function for the binomial distribution, see Problem 113. 69.2.2. Probit Model. An important class of functions with values between 0 and 1 is the class of cumulative probability distribution functions. If h is a cumulative distribution function, then one can give this specification an interesting interpretation in terms of an unobserved “index variable.” The index variable model specifies: there is a variable z i with the property that y i = 1 if and only if z i > 0. For instance, the decision y i whether or not individual i moves to a different location can be modeled by the calculation whether the net benefit of moving, i.e., the wage differential minus the cost of relocation and finding a new job, is positive or not. This moving example is worked out, with references, in [Gre93, pp. 642/3]. The value of the variable z i is not observed, one only observes y i , i.e., the only thing one knows about the value of z i is whether it is positive or not. But it is assumed 1492 69. BINARY CHOICE MODELS that z i is the sum of a deterministic part which is specific to the individual and a random part which has the same distribution for all individuals and is stochastically independent between different individuals. The deterministic part sp e cific to the individual is assumed to depend linearly on individual i’s values of the covariates , with coefficients which are common to all individuals. In other words, z i = x i β +ε i , where the ε i are i.i.d. with cumulative distribution function F ε . Then it follows π i = Pr[y i = 1] = Pr[z i > 0] = Pr[ε i > −x i β] = 1 − Pr[ε i ≤ −x i β] = 1 − F ε (−x i β). I.e., in this case, h(η) = 1 −F ε (−η). If the distribution of ε i is symmetric and has a density, then one gets the simpler formula h(η) = F ε (η). Which cumulative distribution function should be chosen? • In practice, the probit model, in which z i is normal, is the only one used. • The linear model, in which h is the line segment from (a, 0) to (b, 1), can also be considered generated by an in index function z i which is here uniformly distributed. • An alternative possible specification with the Cauchy distribution is pro- posed in [DM93, p. 516]. They say that curiously only logit and probit are being used. In practice, the probit model is very similar to the logit model, once one has rescaled the variables to make the variances equal, but the logit model is easier to handle mathematically. 69.2. BINARY DEPENDENT VARIABLE 1493 69.2.3. Replicated Data. Before discussing estimation methods I want to briefly address the iss ue whether or not to write the data in replicated form [MN89, p. 99–101]. If there are several observations for every individual, or if there are several individuals for the same values of the covariates (which can happen if all covariates are categorical), then one can write the data more compactly if one groups the data into so-called “covariate classes,” i.e., groups of observations which share the same values of x i , and defines y i to be the number of times the decision came out positive in this group. Then one needs a second variable, m i , which is assumed nonrandom, indicating how many individual decisions are combined in the respective group. This is an equivalent formulation of the data, the only thing one loses is the order in which the observations were made (which may be relevant if there are training or warm-up effects). The original representation of the data is a special case of the grouped form: in the non-grouped form, all m i = 1. We will from now on write our formulas for the grouped form. 69.2.4. Estimation. Maximum likeliho od is the preferred estimation method. The likeliho od function has the form L = π y i i (1 − π i ) (m i −y i ) . This likelihood function is not derived from a density, but from a probability mass function. For instance, in the case w ith non-replicated data, all m i = 1, if you have n binary measurements, then you can have only 2 n different outcomes, and the probability of the sequence y 1 , . . . y n = 0, 1, 0, 0, . . . , 1 is as given above. 1494 69. BINARY CHOICE MODELS This is a highly nonlinear maximization and must be done numerically. Let us go through the method of scoring in the example of a logit distribution. L = i y i log π i + (m i − y i ) log(1 − π i ) (69.2.1) ∂L ∂π i = y i π i − m i − y i 1 −π i (69.2.2) ∂ 2 L ∂π 2 i = − y i π 2 i + m i − y i (1 −π i ) 2 (69.2.3) Defining η = Xβ, the logit specification can be written as π i = e η i /(1 + e η i ). Differentiation gives ∂π i ∂η i = π i (1 −π i ). Combine this with (69.2.2) to get (69.2.4) u i = ∂L ∂η i = y i π i − m i − y i 1 −π i π i (1 −π i ) = y i − m i π i . These are the elements of u in (69.1.1), and they have a very simple meaning: it is just the observations minus their expected values. Therefore one obtains immediately A = E [uu ] is a diagonal matrix with m i π i (1 −π i ) in the diagonal. Problem 562. 6 points Show that for the maximization of the likelihood func- tion of the logit model, Fisher’s scoring method is equivalent to the Newton-Raphson algorithm. 69.3. THE GENERALIZED LINEAR MODEL 1495 Problem 563. Show that in the logistic model, m i ˆπ i = y i . 69.3. The Generalized Linear Model The binary choice models show how the linear model can be generalized. [MN89, p. 27–32] develop a unified theory of many different interesting models, called the “generalized linear model.” The following few paragraphs are indebted to the elabo- rate and useful web site about Generalized Linear Models maintained by Gordon K. Smyth at www.maths.uq.oz.au/~gks/research/glm In which cases is it necessary to go beyond linear models? The most important and common situation is one in which y i and µ i = E[y i ] are bounded: • If y represents the amount of some physical substance then we may have y ≥ 0 and µ ≥ 0. • If y is binary, i.e., y = 1 if an animal survives and y = 0 if it does not, then 0 ≤ µ ≤ 1. The linear model is inadequate here because complicated and unnatural constraints on β would be required to make sure that µ stays in the feasible range. Generalized linear models instead assume a link linear relationship (69.3.1) g(µ) = Xβ 1496 69. BINARY CHOICE MODELS where g() is some known monotonic function which acts pointwise on µ. Typically g() is used to transform the µ i to a scale on which they are unconstrained. For example we might use g(µ) = log(µ) if µ i > 0 or g(µ) = log µ/(1 −µ) if 0 < µ i < 1. The same reasons which force us to abandon the linear model also force us to abandon the as sumption of normality. If y is bounded then the variance of y must depend on its mean. Specifically if µ is close to a boundary for y then var(y) must be small. For example, if y > 0, then we must have var(y) → 0 as µ → 0. For this reason strictly positive data almost always shows increasing variability with increased size. If 0 < y < 1, then var(y) → 0 as µ → 0 or µ → 1. For this reason, generalized linear models assume that (69.3.2) var(y i ) = φ ·V (µ i ) where φ is an unknown scale factor and V () is som e known variance function appro- priate for the data at hand. We therefore estimate the nonlinear regression equation (69.3.1) weighting the observations inversely according to the variance functions V (µ i ). This weighting procedure turns out to be exactly equivalent to maximum likelihood estimation when the observations actually come from an exponential family distribution. Problem 564. Describe estimation situations in which a linear model and Nor- mal distribution are not appropriate. [...]... GENERALIZED LINEAR MODEL 1497 The generalized linear model has the following components: • Random component: Instead of being normally distributed, the components of y have a distribution in the exponential family • Introduce a new symbol η = Xβ • A monotonic univariate link function g so that η i = g(µi ) where µ = E [y] The generalized linear model allows for a nonlinear link function g specifying that... linearly independent, and full row rank if its rows are linearly independent The deficiency matrix provides a “holistic” definition for which it is not necessary to look at single rows and columns X has full column rank if and only if X ⊥ O, and full row rank if and only if O ⊥ X Problem 575 Show that the following three statements are equivalent: (1) X has full column rank, (2) X X is nonsingular, and. .. The definition of a g-inverse is apparently due to [Rao62] It is sometimes called the “conditional inverse” [Gra83, p 129] This g-inverse, and not the Moore-Penrose generalized inverse or pseudoinverse A+ , is needed for the linear model, The MoorePenrose generalized inverse is a g-inverse that in addition satisfies A+ AA+ = A+ , and AA+ as well as A+ A symmetric It always exists and is also unique, but... matrix has a g-inverse Answer Simple: a null matrix has its transpose as g-inverse, and if A = O then RL is such a g-inverse The g-inverse of a number is its inverse if the number is nonzero, and is arbitrary otherwise Scalar expressions written as fractions are in many cases the multiplication by a g-inverse We will use a fraction with a thick horizontal rule to indicate where this is the case In other... squared spectral norm is λ2 , and therefore the ii spectral norm itself is λii 1504 A MATRIX FORMULAS A.3 Inverses and g-Inverses of Matrices A g-inverse of a matrix A is any matrix A− satisfying (A.3.1) A = AA− A It always exists but is not always unique If A is square and nonsingular, then A−1 is its only g-inverse Problem 568 Show that a symmetric matrix Ω has a g-inverse which is also symmetric... with Ω = QQ and P with Σ = P P and define A = KQ P The independence of the choice of g-inverses follows from theorem A.3.1 together with (A.5.11) The following was apparently first shown in [Alb69] for the special case of the Moore-Penrose pseudoinverse: 1520 A MATRIX FORMULAS Theorem A.5.11 The symmetric partitioned matrix Ω = Ω yy Ωyz Ω yz is nonΩzz negative definite if and only if the following conditions... that with the ordinary fraction b b This idiosyncratic notation allows to write certain theorems in a more concise form, but it requires more work in the proofs, because one has to consider the additional case that the denominator is zero Theorems A.5.8 and A.8.2 are examples Theorem A.3.1 If B = AA− B holds for one g-inverse A− of A, then it holds for all g-inverses If A is symmetric and B = AA− B,... (1b.5.5) on p 26] Define D = A − AA (AA )− A and show, by multiplying out, that DD = O A.4 Deficiency Matrices Here is again some idiosyncratic terminology and notation It gives an explicit algebraic formulation for something that is often done implicitly or in a geometric paradigm A matrix G will be called a “left deficiency matrix” of S, in symbols, G ⊥ S, if GS = O, and for all Q with QS = O there is an... in (A.9.1), Then z A Az = z Q Λ2 Qz Therefore we can first show: there is a z in the form z = Q x which attains this maximum Proof: for every z which has a nonzero value in the numerator of (A.2.1), set x = Qz Then x = o, and Q x attains the same value as z in the numerator of (A.2.1), and a smaller or equal value in the denominator Therefore one can restrict the search for the maximum argument to vectors... norm is the maximum singular value µmax , and if A is square, then A−1 = 1/µmin It is a true norm, i.e., A = 0 if and only if A = O, furthermore λA = |λ|· A , and the triangle inequality A + B ≤ A + B In addition, it obeys AB ≤ A · B Problem 567 Show that the spectral norm is the maximum singular value Answer Use the definition (A.2.1) A 2 = max z A Az z z Write A = P ΛQ as in (A.9.1), Then z A . individual and a random part which has the same distribution for all individuals and is stochastically independent between different individuals. The deterministic part sp e cific to the individual. different interesting models, called the “generalized linear model.” The following few paragraphs are indebted to the elabo- rate and useful web site about Generalized Linear Models maintained by. assumed nonrandom, indicating how many individual decisions are combined in the respective group. This is an equivalent formulation of the data, the only thing one loses is the order in which the