Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 16 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
16
Dung lượng
263,06 KB
Nội dung
CHAPTER 59 Generalized Method of Moments Estimators This follows mainly [DM93, Chapter 17]. A good and accessible treatment is [M ´ 99]. The textbook [Hay00] uses GMM as the organizing principle for all estimation methods except maximum likelihood. A moment µ of a random variable y is the expected value of some function of y. Such a moment is therefore defined by the equation (59.0.8) E[g(y) − µ] = 0. The same parameter-defining function g(y) −µ defines the method of moments esti- mator ˆµ of µ if one replaces the expected value in (59.0.8) with the sample mean of the elements of an observation vector y consisting of independent observations of y. In other words, ˆµ(y) is that value which satisfies 1 n n i=1 (g(y i ) − ˆµ) = 0. 1287 1288 59. GENERALIZED METHOD OF MOMENTS ESTIMATORS The generalized metho d of moments estimator extends this rule in several re- spects: the y i no longer have to be i.i.d., the parameter-defining equations may be a system of equations defining more than one paramter at a time, there may be more parameter-defining functions than parameters (overidentification), and not only un- conditional but also conditional moments are considered. Under this definition, the OLS estimator is a GMM estimator. To show this, we will write the linear model y = Xβ + ε ε ε row by row as y i = x i β + ε i , where x i is, as in various earlier cases, the ith row of X written as a column vector. The basic property which makes least squares consistent is that the following conditional expectation is zero: (59.0.9) E[y i − x i β|x i ] = 0. This is more information than just knowing that the unconditional expectation is zero. How can this additional information be used to define an estimator? From (59.0.9) follows that the unconditional expectation of the product (59.0.10) E [x i (y i − x i β)] = o. Replacing the expected value by the sample mean gives (59.0.11) 1 n n i=1 x i (y i − x i ˆ β) = o 59. GENERALIZED METHOD OF MOMENTS ESTIMATORS 1289 which can also be written as (59.0.12) 1 n x 1 ··· x n y 1 − x 1 ˆ β . . . y n − x n ˆ β ≡ 1 n X (y −X ˆ β) = o. These are exactly the OLS Normal Equations. This shows that OLS in the linear model is a GMM estimator. Note that the rows of the X-matrix play two different roles in this derivation: they appear in the equation y i = x i β + ε i , and they are also the information set based on which the conditional expectation in (59.0.9) is formed. If this latter role is assumed by the rows of a different matrix of observations W then the GMM estimator becomes the Instrumental Variables Estimator. Most maximum likelihood estimators are also GMM estimators. As long as the maxima are at the interior of the parameter region, the ML estimators solve the first order conditions, i.e., the Jacobian of the log likelihood function evaluated at these estimators is zero. But it follows from the theory of maximum likelihood estimation that the expected value of the Jacobian of the log likeliho od function is zero. Here are the general definitions and theorems, and as example their applications to the textbook example of the Gamma distribution in [Gre97, p. 518] and the Instrumental Variables estimator. 1290 59. GENERALIZED METHOD OF MOMENTS ESTIMATORS y is a vector of n observations created by a Data Generating Process (DGP) µ ∈ M. θ is a k-vector of nonrandom parameters. A parameter-defining function F (y, θ) is a n × matrix function with the following properties (a), (b), and (c): (a) the ith row only depends on the ith observation y i , i.e., (59.0.13) F (y, θ) = f 1 (y 1 , θ) . . . f n (y n , θ) Sometimes the f i have identical functional form and only differ by the values of some exogenous variables, i.e., f i (y i , θ) = g(y i , x i , θ), but sometimes they have genuinely different functional forms. In the Gamma-function example M is the set of all Gamma distributions, θ = r λ consists of the two parameters of the Gamma distribution, = k = 2, and the parameter-defining function has the rows (59.0.14) f i (y i , θ) = y i − r λ 1 y i − λ r−1 so that F (y i , θ) = y 1 − r λ 1 y 1 − λ r−1 . . . . . . y n − r λ 1 y n − λ r−1 . 59. GENERALIZED METHOD OF MOMENTS ESTIMATORS 1291 In the IV case, θ = β and is the number of instruments. If we split X and W into their rows (59.0.15) X = x 1 . . . x n and W = w 1 . . . w n then f i (y i , β) = w i (y i − x i β). This gives (59.0.16) F (y, β) = (y 1 − x 1 β)w 1 . . . (y n − x n β)w n = diag(y −Xβ)W . (b) The vector functions f i (y i , θ) must be such that the true value of the pa- rameter vector θ µ satisfies (59.0.17) E [f i (y i , θ µ )] = o for all i, while any other parameter vector θ = θ µ gives E [f i (y i , θ)] = o. In the Gamma example (59.0.17) follows from the fact that the moments of the Gamma distribution are E[y] = r λ and E[ 1 y i ] = λ r−1 . It is also easy to see that r and λ are characterized by these two relations; given E[y] = µ and E[ 1 y i ] = ν one can solve for r = µν µν−1 and λ = ν µν−1 . 1292 59. GENERALIZED METHOD OF MOMENTS ESTIMATORS In the IV model, (59.0.17) is satisfied if the ε i have zero expectation conditionally on w i , and uniqueness is condition (52.0.3) requiring that plim 1 n W n X n exists, is nonrandom and has full column rank. (In the 781 handout Winter 1998, (52.0.3) was equation (246) on p. 154). Next we need a recipe how to construct an estimator from this parameter-defining function. Let us first discuss the case k = (exact identification). The GMM estimator ˆ θ defined by F satisfies (59.0.18) 1 n F (y, ˆ θ)ι = o which can also be written in the form (59.0.19) 1 n n i=1 f i (y i , ˆ θ) = o. Assumption (c) for a parameter-defining function is that there is only one ˆ θ satisfying (59.0.18). For IV, (59.0.20) F (y, ˜ β)ι = W diag(y −X ˜ β)ι = W (y −X ˜ β) If there are as many instruments as explanatory variables, setting this z ero gives the normal equation for the simple IV estimator W (y −X ˜ β) = o. 59. GENERALIZED METHOD OF MOMENTS ESTIMATORS 1293 In the case > k, (59.0.17) still holds, but the system of equations (59.0.18) no longer has a solution: there are > k relationships for the k parameters. In order to handle this situation, we need to specify what qualifies as a weighting matrix. The symmetric positive definite × matrix A(y) is a weighting matrix if it has a nonrandom positive definite plim, called A 0 (y) = plim n→∞ A(y). Instead of (59.0.18), now the following equation serves to define ˆ θ: (59.0.21) ˆ θ = argmin ι F (y, ˆ θ)A(y )F (y, ˆ θ)ι In this case, condition (c) for a parameter-defining equation reads that there is only one ˆ θ which minimizes this criterion function. For IV, A(y) does not depend on y but is 1 n (W W ) −1 . Therefore A 0 = plim( 1 n W W ) −1 , and (59.0.21) becomes ˜ β = argmin(y−X β) W (W W ) −1 W (y− X β), which is indeed the quadratic form minimized by the generalkized instrumen- tal variables estimator. In order to convert the Gamma-function example into an overidentified system, we add a third relation: (59.0.22) F (y i , θ) = y 1 − r λ 1 y 1 − λ r−1 y 2 1 − r(r+1) λ 2 . . . . . . . . . y n − r λ 1 y n − λ r−1 y 2 n − r(r+1) λ 2 . 1294 59. GENERALIZED METHOD OF MOMENTS ESTIMATORS In this case here is possible to compute the asymptotic covariance; but in real-life situations this covariance matrix is estimated using a preliminary consistent estima- tor of the parameters, as [Gre97] does it. Most GMM estimators depend on such a consistent pre-estimator. The GMM estimator ˆ θ defined in this way is a particular kind of a M-estimator, and many of its properties follow from the general theory of M-estimators. We need some more definitions. Define the plim of the Jacobian of the parameter-defining mapping D = plim 1 n ∂F ι/∂θ and the plim of the covariance matrix of 1 √ n F ι is Ψ = plim 1 n F F . For IV, D = plim 1 n ∂W (y−Xβ) ∂β = −plim n→∞ 1 n W X, and Ψ = plim 1 n W diag(y −Xβ) diag(y −Xβ)W = plim 1 n W Ω Ω ΩW where Ω Ω Ω is the diagonal matrix with typical element E[(y i − x i β) 2 ], i.e., Ω Ω Ω = V [ε ε ε]. With this notation the theory of M -estimators gives us the following result: The asymptotic MSE-matrix of the GMM is (59.0.23) (D A 0 D) −1 D A 0 ΨA 0 D(D A 0 D) −1 59. GENERALIZED METHOD OF MOMENTS ESTIMATORS 1295 This gives the following expression for the plim of √ n times the sampling error of the IV estimator: (59.0.24) plim( 1 n X W ( 1 n W W ) −1 1 n W X) −1 1 n X W ( 1 n W W ) −1 1 n W Ω Ω ΩW ( 1 n W W ) −1 1 n W X( 1 n X W ( 1 n W W ) −1 1 n W X) −1 = (59.0.25) = plim n(X W (W W ) −1 W X) −1 X W (W W ) −1 W Ω Ω ΩW (W W ) −1 W X(X W (W W ) −1 W X) −1 The asymptotic MSE matrix can be obtained fom this by dividing by n. An estimate of the asymptotic covariance matrix is therefore (59.0.26) (X W (W W ) −1 W X) −1 X W (W W ) −1 W Ω Ω ΩW (W W ) −1 W X(X W (W W ) −1 W X) −1 This is [DM93, (17.36) on p. 596]. The best choice of such a weighting matrix is A 0 = Ψ −1 , in which case (59.0.23) simplifies to (D Ψ −1 D) −1 = (D A 0 D) −1 . The criterion function which the optimal IV estimator must minimize, in the presence of unknown heteroskedasticity, is therefore (59.0.27) (y −Xβ) W (W Ω Ω ΩW ) −1 W (y −Xβ) 1296 59. GENERALIZED METHOD OF MOMENTS ESTIMATORS The first-order conditions are (59.0.28) X W (W Ω Ω ΩW ) −1 W (y −Xβ) = o and the optimally weighted IVA is (59.0.29) ˜ β = (X W (W Ω Ω ΩW ) −1 W X) −1 X W (W Ω Ω ΩW ) −1 W y In this, Ω Ω Ω can be replaced by an inconsistent estimate, for instance the diagonal matrix with the squared 2SLS residuals in the diagonal, this is what [DM93] refer to as H2SLS. In the simple IV case, this estimator is the simple IV estimator again. In other words, we need more than the minimum number of instruments to be able to take advantage of the estimated heteroskedasticity. [Cra83] proposes in the OLS case, i.e., W = X, to use the squares of the regressors etc. as additional instruments. To show this optimality take some square nonsingular Q with Ψ = QQ and define P = Q −1 . Then (59.0.30) (D A 0 D) −1 D A 0 ΨA 0 D(D A 0 D) −1 − (D A 0 D) −1 = (59.0.31) = (D A 0 D) −1 D A 0 Ψ − D(D A 0 D) −1 D A 0 D(D A 0 D) −1 [...]... it was originally invented and is often still introduced as a device to reduce bias, but [Efr82, p 10] claims that this motivation is mistaken It is an alternative to the bootstrap, in which random sampling is replaced by a symmetric systematic “sampling” of datasets which are by 1 observation smaller than the original one: namely, n drawings with one observation left out in each In certain situations... computing the OLS residuals and then drawing from these residuals to get pseudo-datapoints and to run the regression on those This is a surprising and strong result; but one has to be careful here that the OLS model is correctly specified For instance, if there is heteroskedasticity which is not corrected for, then the resampling would no longer be uniform, and the bootstrap least squares estimates are inconsistent... each x in this artificially generated random sample, and use these datapoints to construct the distribution function of θ(x) A random sample from the empirical distribution is merely a random drawing from the given values with replacement This requires computing power, usually one has to resample between 1,000 and 10,000 times to get accurate results, but one does not need to do complicated math, and these... nonparametric maximum likelihood estimate of F And your estimate of the distribution of θ(x) is that distribution which derives from this empirical distribution function Just like the maximum likelihood principle, this principle is deceptively simple but has some deep probability theoretic foundations In simple cases, this is a widely used principle; the sample mean, for instance, is the expected value of the... available 60 BOOTSTRAP ESTIMATORS 1301 So far we have been discussing the situation that all observations come from the same population In the regression context this is not the case In the OLS model with i.i.d disturbances, the observations of the independent variable y t have different expected values, i.e., they do not come from the same population On the other hand, the disturbances come from the... little more complicated, and one wants more complex measures of its distribution, such as the standard deviation of a complicated function of x, or some confidence intervals, an analytical expression for this bootstrap estimate is prohibitively complex But with the availability of modern computing power, an alternative to the analytical evaluation is feasible: draw a large random sample from the empirical... estimation principle is very simple: as your estimate of the distribution of x you use Fn , the empirical distribution of the given sample X, i.e that probability distribution which assigns probability mass 1/n to each of the kdimensional observation points xt (or, if the observation xt occured more than once, say j times, then you assign the probability mass j/n to this point) This empirical 1299 1300 60... bootstrap method is an important general estimation principle, which can serve as as an alternative to reliance on the asymptotic properties of an estimator Assume you have a n × k data matrix X each row of which is an independent observation from the same unknown probability distribution, characterized by the cumulative distribution function F Using this data set you want to draw conclusions about... which is nonnegative definite because the matrix in the middle is idempotent The advantage of the GMM is that it is valid for many different DGP’s In this respect it is the opposite of the maximum likelihood estimator, which needs a very specific DGP The more broadly the DGP can be defined, the better the chances are that the GMM etimator is efficient, i.e., in large samples as good as maximum likelihood... original one: namely, n drawings with one observation left out in each In certain situations this is as good as bootstrapping, but much cheaper I third concept is cross-validation 1302 60 BOOTSTRAP ESTIMATORS There is a new book out, [ET93], for which the authors also have written bootstrap and jackknife functions for Splus, to be found if one does attach("/home/econ/eh . expectation conditionally on w i , and uniqueness is condition (52.0.3) requiring that plim 1 n W n X n exists, is nonrandom and has full column rank. (In the 781 handout Winter 1998, (52.0.3) was equation. methods here by first computing the OLS residuals and then drawing from these residuals to get pseudo-datapoints and to run the regression on those. This is a surprising and strong result; but one. random sampling is replaced by a symmetric systematic “sampling” of datasets which are by 1 observation smaller than the original one: namely, n drawings with one observation left out in each. In certain