Class Notes in Statistics and Econometrics Part 19 pdf

CHAPTER 37 OLS With Random Constraint A Bayesian considers the posterior density the full representation of the information provided by sample and prior information. Frequentists have discoveered that one can interpret the parameters of this density as estimators of the key unknown parameters, and that these estimators have good sampling properties. Therefore they have tried to re-derive the Bayesian formulas from frequentist principles. If β satisfies the constraint Rβ = u only approximately or with uncertainty, it has therefore become customary to specify (37.0.55) Rβ = u + η, η ∼ (o, τ 2 Φ), η and ε ε ε uncorrelated. Here it is assumed τ 2 > 0 and Φ positive definite. 877 878 37. OLS WITH RANDOM CONSTRAINT Both interpretations are possible here: either u is a constant, which means nec- essarily that β is random, or β is as usual a constant and u is random, coming from whoever happened to do the research (this is why it is called “mixed estimation”). It is the correct procedure in this situation to do GLS on the model (37.0.56)  y u  =  X R  β +  ε ε ε −η  with  ε ε ε −η  ∼   o o  , σ 2  I O O 1 κ 2 I   . Therefore (37.0.57) ˆ ˆ β = (X  X + κ 2 R  R) −1 (X  y + κ 2 R  u). where κ 2 = σ 2 /τ 2 . This ˆ ˆ β is the BLUE if in repeated samples β and u are drawn from such distri- butions that Rβ −u has mean o and variance τ 2 I, but E [β] can be anything. If one considers both β and u fixed, then ˆ ˆ β is a biased estimator whose properties depend on how close the true value of Rβ is to u. Under the assumption of constant β and u, the MSE matrix of ˆ ˆ β is smaller than that of the OLS ˆ β if and only if the true parameter values β, u, and σ 2 satisfy (37.0.58) (Rβ −u)   2 κ 2 I + R(X  X) −1 R   −1 (Rβ −u) ≤ σ 2 . 37. OLS WITH RANDOM CONSTRAINT 879 This condition is a simple extension of (29.6.6). An estimator of the form ˆ ˆ β = (X  X + κ 2 I) −1 X  y, where κ 2 is a constant, is called “ordinary ridge regression.” Ridge regression can be considered the imposition of a random constraint, even though it does not hold—again in an effort to trade bias for variance. This is similar to the imposition of a constraint which does not hold. An explantation of the term “ridge” given by [VU81, p. 170] is that the ridge solutions are near a ridge in the likelihood surface (at a point where the ridge is close to the origin). This ridge is drawn in [VU81, Figures 1.4a and 1.4b]. Problem 402. Derive from (37.0.58) the wel l-known formula that the MSE of ordinary ridge regression is smaller than that of the OLS estimator if and only if the true parameter vector satisfies (37.0.59) β   2 κ 2 I + (X  X) −1  −1 β ≤ σ 2 . Answer. In (37.0.58) set u = o and R = I.  Whatever the true values of β and σ 2 , there is always a κ 2 > 0 for which (37.0.59) or (37.0.58) holds. The corresponding statement for the trace of the MSE-matrix has been one of the main justifications for ridge regression in [HK70b] and [HK70a], and much of the literature about ridge regression has been inspired by the hop e that 880 37. OLS WITH RANDOM CONSTRAINT one can estimate κ 2 in such a way that the MSE is better everywhere. This is indeed done by the Stein-rule. Ridge regression is reputed to be a good estimator when there is multicollinearity. Problem 403. (Not eligible for in-class exams) Assume E[y] = µ, var(y) = σ 2 , and you make n independent observations y i . Then the best linear unbiased estimator of µ on the basis of these observations is the sample mean ¯y. For which range of values of α is MSE[α¯y; µ] < MSE[¯y; µ]? Unfortunately, this value depends on µ and can therefore not be used to improve the estimate. Answer. MSE[α¯y; µ] = E  (α¯y − µ) 2  = E  (α¯y − αµ + αµ − µ) 2  < MSE[¯y; µ] = var[¯y](37.0.60) α 2 σ 2 /n + (1 −α) 2 µ 2 < σ 2 /n(37.0.61) Now simplify it: (1 − α) 2 µ 2 < (1 − α 2 )σ 2 /n = (1 − α)(1 + α)σ 2 /n(37.0.62) This cannot be true for α ≥ 1, because for α = 1 one has equality, and for α > 1, the righthand side is negative. Therefore we are allowed to assume α < 1, and can divide by 1 −α without disturbing the inequality: (1 − α)µ 2 < (1 + α)σ 2 /n(37.0.63) µ 2 − σ 2 /n < α(µ 2 + σ 2 /n)(37.0.64) 37. OLS WITH RANDOM CONSTRAINT 881 The answer is therefore nµ 2 − σ 2 nµ 2 + σ 2 < α < 1.(37.0.65)  Problem 404. (Not eligible for in-class exams) Assume y = Xβ + ε ε ε with ε ε ε ∼ (o, σ 2 I). If prior knowledge is available that P β lies in an ellipsoid centered around p, i.e., (P β − p)  Φ −1 (P β − p) ≤ h for some known positive definite symmetric matrix Φ and scalar h, then one might argue that the SSE should be mimimized only for those β inside this ellipsoid. Show that this inequality constrained mimimization gives the same formula as OLS with a random constraint of the form κ 2 (Rβ −u) ∼ (o, σ 2 I) (where R and u are appropriately chosen constants, while κ 2 depends on y. You don’t have to compute the precise values, simply indicate how R, u, and κ 2 should be determined.) Answer. Decomp os e Φ −1 = C  C where C is square, and define R = CP and u = Cp. The mixed estimator β = β ∗ minimizes (y − Xβ)  (y − Xβ) + κ 4 (Rβ − u)  (Rβ − u)(37.0.66) = (y −Xβ)  (y − Xβ) + κ 4 (P β − p)  Φ −1 (P β − p)(37.0.67) Choose κ 2 such that β ∗ = (X  X + κ 4 P  Φ −1 P ) −1 (X  y + κ 4 P  Φ −1 p) satisfies the inequality constraint with equality, i.e., (P β ∗ − p)  Φ −1 (P β ∗ − p) = h.  882 37. OLS WITH RANDOM CONSTRAINT Answer. Now take any β that satisfies (P β − p)  Φ −1 (P β − p) ≤ h. Then (y − Xβ ∗ )  (y − Xβ ∗ ) = (y − Xβ ∗ )  (y − Xβ ∗ ) + κ 4 (P β ∗ − p)  Φ −1 (P β ∗ − p) −κ 4 h (37.0.68) (because β ∗ satisfies the inequality constraint with equality) ≤ ( y − Xβ)  (y − Xβ) + κ 4 (P β − p)  Φ −1 (P β − p) −κ 4 h(37.0.69) (because β ∗ minimizes (37.0.67)) ≤ (y −Xβ)  (y − Xβ)(37.0.70) (because β satisfies the inequality constraint). Therefore β = β ∗ minimizes the inequality constrained problem.  CHAPTER 38 Stein Rule Estimators Problem 405. We will work with the regression model y = Xβ + ε ε ε with ε ε ε ∼ N(o, σ 2 I), which in addition is “orthonormal,” i.e., the X-matrix satisfies X  X = I. • a. 0 points Write down the simple formula for the OLS estimator ˆ β in this model. Can you think of situations in which such an “orthonormal” model is appro- priate? Answer. ˆ β = X  y. Sclove [Scl68] gives as examples: if one regresses on orthonormal poly- nomials, or on principal components. I guess also if one simply needs the means of a random vector. It seems the important fact here is that one can order the regressors; if this is the case then one can always make the Gram-Schmidt orthonormalization, which has the advantage that the jth orthonormalized regressor is a linear combination of the first j ordered regressors.  883 884 38. STEIN RULE ESTIMATORS • b. 0 points Assume one has Bayesian prior knowledge that β ∼ N(o, τ 2 I), and β independent of ε ε ε. In the general case, if prior information is β ∼ N(ν, τ 2 A −1 ), the Bayesian posterior mean is ˆ β M = (X  X + κ 2 A) −1 (X  y + κ 2 Aν) where κ 2 = σ 2 /τ 2 . Show that in the present case ˆ β M is proportional to the OLS estimate ˆ β with proportionality factor (1 − σ 2 τ 2 +σ 2 ), i.e., (38.0.71) ˆ β M = ˆ β(1 − σ 2 τ 2 + σ 2 ). Answer. The formula given is (36.0.36), and in the present case, A −1 = I. One can also view it as a regression with a random constraint Rβ ∼ (o, τ 2 I) where R = I, which is mathematically the same as considering the know mean vector, i.e., the null vector, as additional observations. In either case one gets (38.0.72) ˆ β M = (X  X + κ 2 A) −1 X  y = (X  X + κ 2 R  R) −1 X  y = (I + σ 2 τ 2 I) −1 X  y = ˆ β(1 − σ 2 τ 2 + σ 2 ), i.e., it shrinks the O LS ˆ β = X  y.  • c. 0 points Formula (38.0.71) can only be used for estimation if the ratio σ 2 /(τ 2 + σ 2 ) is know n. This is usually not the case, but it is possible to estimate both σ 2 and τ 2 + σ 2 from the data. The use of such estimates instead the actual values of σ 2 and τ 2 in the Bayesian formulas is sometimes called “empirical Bayes.” 38. STEIN RULE ESTIMATORS 885 Show that E[ ˆ β  ˆ β] = k(τ 2 + σ 2 ), and that E[y  y − ˆ β  ˆ β] = (n −k)σ 2 , where n is the number of observations and k is the number of regressors. Answer. Since y = Xβ + ε ε ε ∼ N (o, σ 2 XX  + τ 2 I), it follows ˆ β = X  y ∼ N(o, (σ 2 + τ 2 )I) (where we now have a k-dimensional identity matrix), therefore E[ ˆ β  ˆ β] = k(σ 2 +τ 2 ). Furthermore, since M y = Mε ε ε regardle ss of whether β is random or not, σ 2 can be estimated in the usual manner from the SSE: (n − k)σ 2 = E[ˆε  ˆε] = E[ˆε  ˆε] = E[y  My] = E[y  y − ˆ β  ˆ β] because M = I −XX  .  • d. 0 points If one plugs the unbiased estimates of σ 2 and τ 2 + σ 2 from part (c) into (38.0.71), one obtains a version of the so-called “James and Stein” est imator (38.0.73) ˆ β JS = ˆ β(1 − c y  y − ˆ β  ˆ β ˆ β  ˆ β ). What is the value of the constant c if one follows the above instructions? (This estimator has become famous because for k ≥ 3 and c any number between 0 and 2(n − k)/(n − k + 2) the estimator (38.0.73) has a uniformly lower MSE than the OLS ˆ β, where the MSE is measured as the trace of the MSE-matrix.) Answer. c = k n−k . I would need a proof that this is in the bounds.  • e. 0 points The existence of the James and Stein estimator proves that the OLS estimator is “inadmissible.” What does this mean? Can you explain why the 886 38. STEIN RULE ESTIMATORS OLS estimator turns out to be deficient exactly where it ostensibly tries to be strong? What are the practical implications of this? The properties of this estimator were first discussed in James and Stein [JS61], extending the work of Stein [Ste56]. Stein himself did not introduce the estimator as an “empirical Bayes” estimator, and it is not certain that this is indeed the right way to look at it. Especially this approach does not explain why the OLS cannot b e uniformly improved upon if k ≤ 2. But it is a possible and interesting way to look at it. If one pretends one has prior information, but does not really have it but “steals” it from the data, this “fraud” can still be successful. Another interpretation is that these estimators are shrunk versions of unbiased estimators, and unbiased estimators always get better if one shrinks them a little. The only problem is that one cannot shrink them too much, and in the case of the normal distribution, the amount by which one has to shrink them depends on the unknown parameters. If one estimates the shrinkage factor, one usually does not know if the noise introduced by this estimated factor is greater or smaller than the savings. B ut in the case of the Stein rule, the noise is smaller than the savings. Problem 406. 0 points Return to the “orthonormal” model y = Xβ + ε ε ε with ε ε ε ∼ N (o, σ 2 I) and X  X = I. With the usual assumption of nonrandom β (and [...]... not equivariant and the shrinkage seems arbitrary Discussing them here brings out two things: the formulas for random constraints etc are a pattern according to which one can build good operational estimators And some widely used but seemingly ad-hoc procedures like pre-testing may have deeper foundations and better properties than the halfways sophisticated researcher may think 38 STEIN RULE ESTIMATORS... ˆ β β is positive, but the shrinkage factor is set 0 instead of turning negative This is why it is commonly called the “positive part Stein-rule estimator Stein conjectured early on, and Baranchik [Bar64] showed that it is uniformly better than the Stein rule estimator: • b 0 points Which lessons can one draw about pre-test estimators in general from this exercise? Stein rule estimators have not been... 1, and your estimate of β is the OLS estimate β if the test statistic has a value bigger than 1 Mathematically, this estimator can be written in the form (38.0.74) ˆ ˆ β P T = I(F )β, where F is the F statistic derived in part (1) of this question, and I(F ) is the “indicator function” for F > 1, i.e., I(F ) = 0 if F ≤ 1 and I(F ) = 1 if F > 1 Now modify this pre-test estimator by using the following... I(F ) instead: I(F ) = 0 if F ≤ 1 and I(F ) = 1 − 1/F if F > 1 This is no longer an indicator function, but can be considered a continuous approximation to one Since the discontinuity is removed, one can expect that it has, under certain circumstances, better properties than the indicator function itself Write down the formula for this modified pre-test estimator How does it differ from the Stein rule... halfways sophisticated researcher may think 38 STEIN RULE ESTIMATORS 889 Problem 407 6 points Why was it somewhat a sensation when Charles Stein came up with an estimator which is uniformly better than the OLS? Discuss the Stein rule estimator as empirical Bayes, shrinkage estimator, and discuss the “positive part Stein rule estimator as a modified pretest estimator ... (with the value 888 38 STEIN RULE ESTIMATORS for c coming from the empirical Bayes approach)? Which estimator would you expect to be better, and why? Answer This modified pre-test estimator has the form (38.0.75) ˆ β JS+ = if 1 − c y o ˆ β(1 − c y ˆ ˆ y−β β ) ˆ ˆ β β ˆ ˆ y−β β ˆ ˆ β β . discussed in James and Stein [JS61], extending the work of Stein [Ste56]. Stein himself did not introduce the estimator as an “empirical Bayes” estimator, and it is not certain that this is indeed. exercise? Stein rule estimators have not been used very much, they are not equivariant and the shrinkage seems arbitrary. Discussing them here brings out two things : the formulas for random constraints. = I −XX  .  • d. 0 points If one plugs the unbiased estimates of σ 2 and τ 2 + σ 2 from part (c) into (38.0.71), one obtains a version of the so-called “James and Stein” est imator (38.0.73) ˆ β JS = ˆ β(1

Định dạng
Số trang	13
Dung lượng	333,43 KB