Class Notes in Statistics and Econometrics Part 20 pps

CHAPTER 39 Random Regressors Until now we always assumed that X was nonrandom, i.e., the hypothetical repetitions of the experiment used the same X matrix. In the nonexperimental sciences, such as economics, this assumption is clearly inappropriate. It is only justified because most results valid for nonrandom regressors can be generalized to the case of random regressors. To indicate that the regressors are random, we will write them as X. 891 892 39. RANDOM REGRESSORS 39.1. Strongest Assumption: Error Term Well Behaved Conditionally on Explanatory Variables The assumption which we will discuss first is that X is random, but the classical assumptions hold conditionally on X, i.e., the conditional expectation E [ε ε ε|X] = o, and the conditional variance-covariance matrix V [ε ε ε|X] = σ 2 I. In this situation, the least squares estimator has all the classical properties conditionally on X, for instance E [ ˆ β|X] = β, V [ ˆ β|X] = σ 2 (X  X) −1 , E[s 2 |X] = σ 2 , etc. Moreover, certain prop erties of the Least Squares estimator remain valid uncon- ditionally. An application of the law of iterated expectations shows that the least squares estimator ˆ β is still unbiased. Start with (24.0.7): ˆ β −β = (X  X) −1 X  ε ε ε(39.1.1) E [ ˆ β − β|X] = E [(X  X) −1 X  ε ε ε|X] = (X  X) −1 X  E [ε ε ε|X] = o.(39.1.2) E [ ˆ β − β] = E  E [ ˆ β − β|X]  = o.(39.1.3) Problem 408. 1 point In the model with random explanatory variables X you are considering an estimator ˜ β of β. Which statement is stronger: E [ ˜ β] = β, or E [ ˜ β|X] = β. Justify your answer. Answer. The second statement is stronger. The first statement follows from the second by the law of iterated expectations.  39.1. STRONGEST ASSUMPTION: ERROR TERM WELL BEHAVED CONDITIONALLY ON EXPLANATORY VARIABLES893 Problem 409. 2 points Assume the regressors X are random, and the classical assumptions hold conditionally on X, i.e., E [ε ε ε|X] = o and V [ε ε ε|X] = σ 2 I. Show that s 2 is an unbiased estimate of σ 2 . Answer. From the theory with nonrandom explanatory variables follows E[s 2 |X] = σ 2 . Therefore E[s 2 ] = E  E[s 2 |X]  = E[σ 2 ] = σ 2 . In words: if the expectation conditional on X does not depend on X, then it is also the unconditional expectat ion.  The law of iterated expectations can also be used to compute the unconditional MSE matrix of ˆ β: MSE[ ˆ β; β] = E [( ˆ β − β)( ˆ β − β)  ](39.1.4) = E  E [( ˆ β − β)( ˆ β − β)  |X]  (39.1.5) = E [σ 2 (X  X) −1 ](39.1.6) = σ 2 E [(X  X) −1 ].(39.1.7) Problem 410. 2 points Show that s 2 (X  X) −1 is unbiased estimator of MSE[ ˆ β; β]. 894 39. RANDOM REGRESSORS Answer. E [s 2 (X  X) −1 ] = E  E [s 2 (X  X) −1 |X]  (39.1.8) = E [σ 2 (X  X) −1 ](39.1.9) = σ 2 E [(X  X) −1 ](39.1.10) = MSE[ ˆ β; β] by (39.1.7).(39.1.11)  The Gauss-Markov theorem generalizes in the following way: Say ˜ β is an estimator, linear in y, but not necessarily in X, satisfying E [ ˜ β|X] = β (which is stronger than unbiasedness); then MSE[ ˜ β; β] ≥ MSE[ ˆ β; β]. Proof is immediate: we know by the usual Gauss-Markov theorem that MSE[ ˜ β; β|X] ≥ MSE[ ˆ β; β|X], and taking expected values will preserve this inequality: E  MSE[ ˜ β; β|X]  ≥ E  MSE[ ˆ β; β|X]  , but this expected value is exactly the unconditional MSE. The assumption E [ε ε ε|X] = o can also be written E [y|X] = Xβ, and V [ε ε ε|X] = σ 2 I can also be written as V [y|X] = σ 2 I. Both of these are assumptions about the conditional distribution y|X = X for all X. This suggests the following broaden- ing of the regression paradigm: y and X are jointly distributed random variables, and one is interested how y|X = X depends on X. If the expected value of this distribution depends linearly, and the variance of this distribution is constant, then 39.2. CONTEMPORANEOUSLY UNCORRELATED DISTURBANCES 895 this is the linear regres sion model discussed above. But the expected value might also depend on X in a nonlinear fashion (nonlinear least squares), and the variance may not be constant—in which case the intuition that y is some function of X plus some error term may no longer be appropriate; y may for instance be the outcome of a binary choice, the probability of which dep ends on X (see chapter 69.2; the generalized linear model). 39.2. Contemporaneously Uncorrelated Disturbances In many situations with random regressors, the condition E [ε ε ε|X] = o is not satisfied. Instead, the columns of X are contemporaneously uncorrelated with ε ε ε, but they may be correlated with past values of ε ε ε. The main example here is regression with a lagged dependent variable. In this case, OLS is no longer unbiased, but asymptotically it still has all the good properties, it is asymptotically normal with the covariance matrix which one would expect. Asymptotically, the computer printout is still valid. This is a very im portant result, which is often used in econometrics, but most econometrics textbooks do not even start to prove it. There is a proof in [Kme86, pp. 749–757], and one in [Mal80, pp. 535–539]. Problem 411. Since least squares with random regressors is appropriate when- ever the disturbances are contemporaneously uncorrelated with the explanatory variables, a friend of yours proposes to test for random explanatory variables by checking 896 39. RANDOM REGRESSORS whether the sample correlation coefficients between the residuals and the explanatory variables is significantly different from zero or not. Is this an appropriate statistic? Answer. No. The sample correlation coefficients are always zero!  39.3. Disturbances Correlated with Regressors in Same Observation But if ε ε ε is contemporaneously correlated with X, then OLS is inconsistent. This can be the case in some dynamic processes (lagged dependent variable as regressor, and autocorrelated errors, see question 506), when there are, in addition to the rela- tion which one wants to test with the regression, other relations making the righthand side variables dependent on the lefthand side variable, or when the righthand side variables are measured with errors. This is usually the case in economics, and econometrics has developed the technique of simultaneous equations estimation to deal with it. Problem 412. 3 points What does on e have to watch out for if some of the regressors are random? CHAPTER 40 The Mahalanobis Distance Everything in this chapter is unpublished work, presently still in draft form. The aim is to give a motivation for the least squares objective function in terms of an initial measure of precision. The case of prediction is mathematically simpler than that of estimation, therefore this chapter will only discuss prediction. We assume that the joint distribution of y and z has the form  y z  ∼  X W  β, σ 2  Ω Ω Ω yy Ω Ω Ω yz Ω Ω Ω zy Ω Ω Ω zz  , σ 2 > 0, otherwise unknown β unknown as well. (40.0.1) y is observed but z is not and has to be predicted. But assume we are not interested in the MSE since we do the experiment only once. We want to predict z in such a 897 898 40. THE MAHALANOBIS DISTANCE way that, whatever the true value of β, the predicted value z ∗ “blends in” best with the given data y. There is an important conceptual difference between this criterion and the one based on the MSE. The present criterion cannot be applied until after the data are known, therefore it is called a “final” criterion as opp ose d to the “initial” criterion of the MSE. See Barnett [Bar82, pp. 157–159] for a good discussion of these issues. How do we measure the degree to which a given data set “blend in,” i.e., are not outliers for a given distribution? Hypothesis testing uses this criterion. The most often-used testing principle is: reject the null hypothesis if the observed value of a certain statistic is too much an outlier for the distribution which this statistic would have under the null hypothesis. If the statistic is a sc alar, and if under the null hypothesis this statistic has expected value µ and standard deviation σ, then one often uses an estimate of |x − µ|/σ, the number of standard deviations the observed value is away from the mean, to measure the “distance” of the observed value x from the distribution ( µ, σ 2 ). The Mahalanobis distance generalizes this concept to the case that the test statistic is a vector random variable. 40.1. Definition of the Mahalanobis Distance Since it is mathematically more convenient to work with the squared distance than with the distance itself, we will make the following thought experiment to 40.1. DEFINITION OF THE MAHALANOBIS DISTANCE 899 motivate the Mahalanobis distance. How could one generalize the squared scalar distance (y − µ) 2 /σ 2 for the distance of a vector value y from the distribution of the vector random variable y ∼ (µ, σ 2 Ω Ω Ω)? If all y i have same variance σ 2 , i.e., if Ω Ω Ω = I, one might measure the squared distance of y from the distribution (µ, σ 2 Ω Ω Ω) by 1 σ 2 max i (y i −µ i ) 2 , but since the maximum from two trials is bigger than the value from one trial only, one should divide this perhaps by the expected value of such a maximum. If the variances are different, say σ 2 i , one might want to look a the number of standard deviations which the “worst” component of y is away from what would be its mean if y we re an observation of y, i.e., the squared distance of the obsrved vector from the distribution would be max i (y i −µ i ) 2 σ 2 i , again normalized by its expected value. The principle actually used by the Mahalanobis distance goes only a small step further than the examples just cited. It is coordinate-free, i.e., any linear combinations of the elements of y are considered on equal footing with these elements themselves. In other words, it does not distinguish between variates and variables. The distance of a given vector value from a certain multivariate distribution is defined to be the distance of the “worst” linear combination of the elements of this vector from the univariate distribution of this linear combination, normalized in such a way that the expected value of this distance is 1. 900 40. THE MAHALANOBIS DISTANCE Definition 40.1.1. Given a random n-vector y which has expected value and a nonsingular covariance matrix. The squared “Mahalanobis distance” or “statistical distance” of the observed value y from the distribution of y is defined to be (40.1.1) MHD[y; y] = 1 n max g  g  y −E[g  y]  2 var[g  y] . If the denominator var[g  y] is zero, then g = o, therefore the numerator is zero as well. In this case the fraction is defined to be zero. Theorem 40.1.2. Let y be a vector random variable with E [y] = µ and V [y] = σ 2 Ω Ω Ω, σ 2 > 0 and Ω Ω Ω positive definite. The squared Mahalanobis distance of the value y from the distribution of y is equal to (40.1.2) MHD[y; y ] = 1 nσ 2 (y −µ)  Ω Ω Ω −1 (y −µ) Proof. (40.1.2) is a simple consequence of (32.4.4). It is also somewhat intuitive since the righthand side of (40.1.2) can be considered a division of the square of y−µ by the covariance matrix of y.  The Mahalanobis distance is an asymmetric measure; a large value indicates a bad fit of the hypothetical population to the observation, while a value of, say, 0.1 does not nec ess arily indicate a better fit than a value of 1. [...]... Second Scenario: One Additional IID Observation In the above situation, we could minimize the relative increase in the Mahalanobis distance (instead of selecting its minimax value) because all parameters of the underlying distribution were known The simplest situation in which they are not known, and therefore we must resort to minimizing the relative increase in the Mahalanobis distance for the most unfavorable... distance of y from y is defined as in (40.1.1), Ω with n replaced by rank[Ω ] If the denominator in (40.1.1) is zero, then it no longer necessarily follows that g = o but it nevertheless follows that the numerator is zero, and the fraction should in this case again be considered zero If Ω is singular, then the inverse Ω −1 in formula (40.1.2) must be replaced by a “g-inverse.” A g-inverse of a matrix A is any... First Scenario: Minimizing relative increase in Mahalanobis distance if distribution is known We start with a situation where the expected values of the random vectors y and z are known, and their joint covariance matrix is known up to an unknown scalar factor σ 2 > 0 We will write this as y µ Ω ∼ , σ 2 yy z ν Ω zy (40.3.1) Ω Ω Ω yz , Ω zz σ 2 > 0 Ωyy has rank p and Ω yy Ω yz has rank r Since σ 2 is not... whole Rn Problem 414 2 points The random vector y = covariance matrix 1 3 2 −1 −1 −1 2 −1 −1 −1 2 y1 y2 y3 has mean 1 2 −3 and Is this covariance matrix singular? If so, give a 902 40 THE MAHALANOBIS DISTANCE linear combination of the elements of y which is known with certainty And give a value which can never be a realization of y Prove everything you state Answer Yes, it is singular; (40.1.5) 2 −1... − µ) yy We already solved this minimization problem in chapter ?? By (??), the minimum value of this relative contribution is zero, and the value of z which minimizes 906 40 THE MAHALANOBIS DISTANCE this relative contribution is the same as the value of the best linear predictor of z, i.e., the value assumed by the linear predictor which minimizes the MSE among all linear predictors 40.4 Second Scenario:... parameter, is the following: A vector y of n i.i.d observations is given with unknown mean µ and variance σ 2 > 0 The squared Mahalanobis distance of these data from their 1 population is nσ2 (y − ιµ) (y − ιµ) it depends on the unknown µ and σ 2 How can we predict a n + 1st observation in such a way as to minimize the worst possible relative increase in this Mahalanobis distance? Minimizing the maximum possible... greater than 1 This concludes the proof that y minimaxes the ¯ relative increase in the Mahalanobis distance over all values of µ and σ 2 40.5 Third Scenario: one additonal observation in a Regression Model Our third scenario starts with an observation y of the random n-vector y ∼ (Xβ, σ 2 I), where the nonrandom X has full column rank, and the parameters β and σ 2 > 0 are unknown The squared Mahalanobis... quadratic form in the exponent of the normal density function of y For a normally distributed y, therefore, all observations located on the same density contour have equal distance from the distribution The Mahalanobis distance is also defined if the covariance matrix of y is singular In this case, certain nonzero linear combinations of the elements of y are known with certainty Certain vectors can... 2 1 1 1 = 0 0 0 I.e., y 1 +y 2 +y 3 = 0 because its variance is 0 and its mean is zero as well since [ 1 1 1 ] 1 2 −3 = 0 Definition 40.1.3 Given a vector random variable y which has a mean and a covariance matrix A value y has in nite statistical distance from this random variable, i.e., it cannot possibly be a realization of this random variable, if a vector of coefficients g exists such that var[g... this minimax problem, and that the minimax value of q is q = xn+1 (X X)−1 xn+1 ˆ This proof will proceed in two steps (1) For yn+1 = xn+1 β, q ≤ xn+1 (X X)−1 xn+ for all values of β, and whatever g-inverse was used in (40.5.6), but one can find β ˆ for which q is arbitrarily close to xn+1 (X X)−1 xn+1 (2) If yn+1 = xn+1 β, then 912 40 THE MAHALANOBIS DISTANCE q > xn+1 (X X)−1 xn+1 for certain values . Scenario: Minimizing relative increase in Mahalanobis distance i f distribution is known We start with a situation where the e xpec ted values of the random vectors y and z are known, and their joint. minimize the relative increase in the Maha- lanobis distance (instead of selecting its minimax value) because all parameters of the underlying distribution were known. The simplest situation in. a n + 1st observation in such a way as to minimize the worst possible relative increase in this Mahalanobis distance? Minimizing the maximum possible relative increase in the Mahalanobis distance due

Định dạng
Số trang	29
Dung lượng	384,16 KB