Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 51 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
51
Dung lượng
2,63 MB
Nội dung
Chapter IO BIASED ESTIMATION G G JUDGE and M E BOCK* University of Illinois and Purdue University Contents Introduction Conventional statistical models, estimators, tests, and measures of estimator performance 2.1 Measures 2.3 Conventional 2.2 estimators and tests Bayes estimation of performance Some possibly biased alternatives 3.1 Stochastic non-sample information 3.3 Inequality non-sample information 3.4 Parameter distribution information 3.5 Exact non-sample 3.2 Some remarks Pre-test-variable information (prior) selection estimators 4.1 Conventional equality 4.2 Stochastic 4.3 Inequality 4.4 Bayesian pre-test 4.5 Variable selection pre-test estimator hypothesis pre-test estimator hypothesis pre-test estimator estimators estimators Conventional estimator inadmissibility and the Stein-rule alternatives 5.1 Estimation under squared error loss Some biased estimator alternatives for the stochastic regressor case Biased estimation with nearly collinear data 5.2 Stein-like rules under weighted 7.1 A measure of “near” 7.2 Ridge-type squared error loss estimators collinearity *This work was facilitated by National Zellner are gratefully acknowledged Science Foundation Grants Useful comments Handbook of Econometrics, Volume I, Edited by Z Griliches and M.D Intriligator North-Holland Publishing Company, 1983 601 603 603 606 607 608 609 610 612 615 617 617 618 621 622 625 627 627 628 635 639 641 641 642 by Arnold G G Judge and M E Bock 600 Minimax ridge-type estimators 7.4 Generalized ridge estimators 7.3 7.5 A summary comment Some final comments References 643 644 645 645 647 Ch IO: Biased Estimation 601 Introduction Much of the literature concerned with estimation and inference from a sample of data deals with a situation when the statistical model is correctly specified Consequently, in econometric practice it is customary to assume that the statistical model employed for purposes of estimation and inference is consistent with the sampling process whereby the sample observations were generated In this happy event, statistical theory provides techniques for obtaining point and interval estimates of the population parameters and for hypothesis testing Under this scenario for the traditional linear statistical model with normal, independent, and identically distributed errors it is conventional to make use of the maximum likelihood-least squares rule when estimating the unknown location parameters From the sampling theory point of view this approach is justified since it leads to a minimum variance among linear unbiased estimators and under squared error loss, the least squares estimator is minimax From the Bayesian point of view, under a uniform-non-informative prior for the coefficients, the property of minimum posterior mean squared error is achieved All in all this is a fairly impressive set of credentials and doubtless this goes a long way toward explaining the popularity of the least squares estimator, which is really best in a class of one These results also suggest that if improvement in estimator performance is to be achieved, one must go outside of the traditional sampling theory rules and consider a range of alternative estimators that are biased and possibly nonlinear Despite its popularity the statistical implications of remaining in the linear unbiased family of rules may in many cases be rather severe One indication of the possibly questionable stature of the least squares rule occurred when Stein (1955) showed, under conditions normally fulfilled in practice, that there were other minimax estimators Following Stein’ result, James and Stein (1961) s exhibited an estimator which under squared error loss dominates the least squares estimator and thus demonstrates its inadmissibility This result means that the unbiased least squares rule may have an inferior mean square error when compared to other biased estimators Another trouble spot for the conventional least squares estimator arises in case of a false statistical model Just as few economic variables are free of measurement error and few economic relations are non-stochastic, few statistical models are correctly specified and many of these specification errors imply a biased outcome when the least squares rule is used For example, consider the problem of an investigator who has a single data set and wants to estimate the parameters of a linear model which are known to lie in a high dimensional parameter space f3, The researcher may suspect the relationship may be characterized by a lower 602 G G Judge and M E Bock dimensional parameter space 0, c 8, Under this uncertainty if the 0, dimensional parameter space is estimated by least squares the result, from the possibly overspecified model, will be unbiased but have large variance and thus may make a poor showing in terms of mean square error Alternatively, the fz dimensional parameter space may incorrectly specify the statistical model and thus if estimated by least squares will be biased and this bias may or may not outweigh the reduction in variance if evaluated in a mean square error context Although uncertainty concerning the proper column dimension of the matrix of explanatory variables is the rule, in many cases prior information exists about the individual parameters and/or relationships among the unknown parameters Ignoring this information and using only sample information and the least squares rule may lead to a loss of precision, while taking the information into account may lead to a more precise though biased estimator Intuitively it would seem any estimator that does not take account of existing non-sample information should lead to suboptimal rules Furthermore, since most economic data are passively generated and thus not come from an experimental design situation where the investigator has a good degree of control, the data may be nearly collinear and this means that approximate linear relations may hold among the columns of the explanatory variables that appear in the design matrix X When this happens the least squares estimates are unstable, the X’ matrix is often nearly singular and small changes in the X observations may result in large changes in the estimates of the unknown coefficients Ridge and minimax general ridge estimators have been suggested as alternatives to the least squares rule when handling data with these characteristics In the linear statistical model when the errors are long tailed and the conventional normally distributed constant variance error specification is not appropriate, the least squares rule loses some of its inferential reach Under this scenario it is necessary to consider biased alternatives which are conceptually different from, for example the Stein and ridge approaches noted above In this chapter we no more than identify the problem, since it will be discussed in full elsewhere in this Handbook To cope with some of the problems noted above and to avoid the statistical consequences of remaining with the conventional estimator, researchers have proposed and evaluated a range of alternatives to least squares Useful summaries of some of the results to date include papers by Dempster (1973), Mayer and Willke (1973), Gunst and Mason (1977), and Draper and Van Nostrand (1979) In laying out the statistical implications of a range of biased alternatives to the least squares rule the chapter is organized as follows: In Section conventional linear statistical models, estimators, and a hypothesis testing framework are presented and the sampling theory and Bayes bases for gauging estimator performance are specified In Section sampling theory and Bayes estimators Ch IO: Biased Estimation 603 which permit sample information and various types of non-sample information to be jointly considered, are specified and appropriately evaluated In Section testing frameworks are specified for evaluating the compatibility of the sample information and the various types of non-sample information and the corresponding pretest estimators are derived, compared, and evaluated In Section the inadmissibility of the least squares estimator is discussed and a range of Stein-rule estimators are considered for alternative loss functions and design matrices In Section alternatives to least squares are considered for the stochastic regressor case In Section the problem of nearly collinear data is discussed and the ridge-type and general minimax estimators which have been suggested to cope with this age old problem, are compared and evaluated Finally, in Section some comments are made about the statistical implications of these biased alternatives for econometric theory and practice Conventional statistical models, estimators, tests, and measures of estimator performance We are concerned with the sampling performance of a family of biased estimators for the following linear statistical model: y=Xf3+e, (2.1) where y is a (T X 1) vector of observations, X is a known (T X K) design matrix of rank K, /3 is a (K x 1) fixed vector of unknown parameters, e is a (T X 1) vector of unobservable normal random variables with mean vector zero and finite covariance matrix E[ee’ = a2+, with a2 unknown, and + is a known symmetric ] positive definite matrix We assume throughout that the random variables which comprise e are independently and identically distributed, i.e E [ee’ = u 21Tor can ] be transformed to this specification since + is known In almost all cases we will assume e is a normal random vector 2.1 Conventional estimators and tests Given that y is generated by the linear statistical model (2.1) the least squares basis for estimating the unknown coefficients is given by the linear rule b = (X’ X)_‘ y, X’ (24 which is best linear unbiased If it is assumed that e is multivariate normal then (2.2) is the maximum likelihood estimator and is a minimax estimator no longer G G Judge and M E Bock 604 limited to the class of linear estimators Furthermore, if e is normal then b has minimum risk E[(b - 8)‘ - fi)] among the unbiased (not necessarily linear) (6 estimators of /3 The assumption that y is a normally distributed vector implies that the random vector (b - /3) is normally distributed with mean vector zero and covariance (2.3) Therefore, the quadratic form (b - /3)‘ X(b - /3)/a* is distributed as a central X’ &i-square random variable with K degrees of freedom A best quadratic unbiased estimator of the unknown scalar a* is given by 8*=(y-Xb)‘ (y-xb)/(T-K)=y’ (I~-X(xtX)-’ x~)y/(T-K) =y’ My/(T-K)=e’ Me/(T-K), (2.4 where M is an idempotent matrix of rank (T - K) If we leave the class of the unbiased quadratic estimators of CT*, minimum variance quadratic estimator, My/( T - K + 2) Since e is a normally with smallest mean square error, is 6* = y’ distributed vector with mean vector zero and covariance a*I,, the quadratic form (T- K)e*/a* =e’ Me/u* (2.5) is distributed as a central &i-square random variable with (T - K) degrees of freedom Let us represent the hypotheses we have about the K dimensional unknown parameters in the form of the following linear hypotheses: B=r (2.6) or 6=0, where = /3 - r is a (K X 1) vector representing specification errors and r is a K dimensional known vector Given this formulation it is conventional to use likelihood ratio procedures to test the null hypothesis HO: j3 = r against the alternative hypothesis HA : j3 * r, by using the test statistic u = (b - r)‘ X( X’ b - r)/KB (2.7) If the hypotheses are correct and indeed r = j3, the test statistic u is a central F random variable with K and (T - K) degrees of freedom, i.e u - FcK,T_Kj If the linear hypotheses are incorrect, u is distributed as a non-central F random Ch 10: Biased Estimation 605 variable with K and (T - K) degrees of freedom and non-centrality h = (/3 - r)‘ X(/3 - r)/2a2 X’ = S’ X6/2a2 X’ parameter (2.8) The traditional test procedure for H,, against HA is to reject the linear hypotheses Ho if the value of the test statistic u is greater than some specified value c The value of c is determined for a given significance level ~1by /c >c]=a O3 d~K,.-K, = p[q,,,-K) - (2.9 The above test mechanism leads to an estimator that will be specified and evaluated in Section For some of the estimators to be discussed in the coming sections, it is convenient for expository purposes to reparameterize the linear statistical model (2.1) in one of the following two forms: y=Xf3+e=XS-‘ /2 S’ /3 + e = ze + /“ e, (2.10a) where S is a positive definite symmetric matrix with S’ /2S1/2 = S = X’ = X, S’ /2/3, Z = XS- ‘ and Z’ = IK Under this reparameterization a best linear 12, Z unbiased estimator of is w = Z’ with covariance fW = a21K Note also we may y write (2.10a) as Z’ = y e + Z’e, (2.10b) where z = Z’ has a K variate normal distribution with mean vector and y covariance (7 21K This formulation is equivalent to the K mean statistical model usually analyzed in the statistical literature Although (2.10b) is a convenient form for analysis purposes we will remain in this chapter with the linear statistical (regression) form since this is the one most commonly dealt with in econometrics The common nature of the two problems should be realized in interpreting the results to be developed Alternatively consider the following canonical form: y=XB+e=XTT-‘ /3+e, (2.11) where T is a non-singular matrix chosen so that the columns of XT are orthogonal One choice of T is to choose an orthogonal matrix P whose columns are orthonormal characteristic vectors of X’ Consequently, PP’ = I and X y=Xp+e=XPP’ j?+e=Ha+e (2.12) The columns of H are orthogonal since H’ = A, which is a diagonal matrix with H X The elements h, > h, > > A,, that are the characteristic roots of X’ G.G.Judge and M E Bock 606 best linear unbiased estimator of a is = A- “H’ with covariance a*A- ‘ The y, variance of ai,, i = 1, 2, , K, is a*/h, 2.2 Measures of performance Finally, let us consider the basis for gauging the performance of a range of alternative estimators We can, as we did with the estimators considered above, require the property of unbiasedness, and in this context b is the only unbiased estimate of fl based on sufficient statistics But why the concept of unbiasedness? If the information from sample observations is to be used for decision purposes why not make use of statistical decision theory which is based on the analyses of losses due to incorrect decisions? This is in fact the approach we use in this chapter as a basis for comparing estimators as we go outside of traditional rules and enter the family of non-linear biased estimators Although there are many forms for representing the loss or risk functions we will to a large extent be concerned with estimation alternatives under a squared error loss measure However, the estimators we consider are in general robust under a range of loss functions Assume that y is a (T X 1) random vector If 6( y) is some estimator of the K dimensional parameter vector 8, then the weighted squared error or weighted quadratic loss function is (2.13) (2.14) P(B, 8) = EC@ - P)‘ Q@ - @I, where Q is a known positive definite weight matrix If Q = IK under this criterion, the unbiased estimator with minimum risk is the unbiased estimator with minimum variance If we make use of the condition that 6( y) be both linear in y and unbiased, this leads to the Gauss-Markoff criterion and the minimum risk or best linear unbiased estimator is 6( y) = y if E[ y] = /3 Reparameterizing the statistical model and transforming from one parameter space to another in many cases changes the measure of goodness used to judge performance For example, if interest centers on statistical model (2.1) and sampling performance in the /3 space (estimation problem), specifying an unweighted loss function in the space (2.10), results in a weighted function in the /3 space, i.e (4 -/J)@- 8) = (siP/& sV2/3)@V9 = (/C@s(/-P)= - 5%73) (&/3)~XPX(/!-/3) (2.15) Ch IO: Biased Estimation 607 Therefore, while the reparametrized model (2.10) is appropriate for analyzing the conditional mean forecasting problem of estimating X/3 by Xb, it is not appropriate for analyzing the performance of b as an estimate of /.I unless one is interested in the particular weight matrix (X’ X) Alternatively, an unweighted squared error loss risk in the /I space results in a weighted risk function in the space, i.e ~[(b-~)(b-~)]=~[(~-l’ *8-~-~/*e)r(~-1’ *~-~-1’ *e)] = E [(B e)V( I - e)] =E[(B-e)~(x~x)-‘ (8-e)] (2.16) In some of the evaluations to follow it will be convenient or analytically more tractable to consider the weighted risk function in the space instead of the unweighted counterpart in the /I space Finally, let us note for the canonical form (2.12) that the orthogonal transformation preserves the distance measure, i.e (d-a)‘ (&-a)=(P’ b-P;B)‘ b-P’ (P’ j3) =(b-/3)‘ (b-/3)=(b-/3)‘ PP’ (b-/3) (2.17) The minimum mean square error criterion is another basis we will use for comparing the sampling performance of estimators This generalized mean square error or risk measure for some estimator B of j3 may be defined as MSE[8,~l=E[(B-B)(B-8)‘ = (biasj)(biasb)‘+cov/? (2.18) Under this measure the diagonal elements are mean square errors and the trace of (2.18) is the squared error risk, when Q = IK In using the mean square error criterion an estimator b is equal or superior to another estimator b if, for all 8, A=E[(B-B>(B-8)‘ ]-E[(B-8)(8-8)‘ ] (2.19) is a positive semidefinite matrix This implies I’ for any K dimensional real Af vector 2.3 Bayes estimation The assumption that the vector /3 is itself a random vector with a known distribution leads, when combined with previously developed measures of performance, to a well-defined estimator for /3 In this approach one chooses, optimally, 608 G G Judge and M E Bock a Bayes estimator, &, which minimizes for all B the expected value of p( /3, b), where the expectation is taken over B with respect to its known distribution The Bayes risk for /? is JY[P(BJJ] =inf~Wk48)1 (2.20) i In particular, for a weighted quadratic loss, such as (2.13), (2.21) the mean of the conditional distribution of /3 given the sample data Some possibly biased alternatives Under the standard linear normal statistical model and a sampling theory framework, when only sample information is used, the least squares estimator gives a minimum variance among unbiased estimators In the Bayesian framework for inference, if a non-informative-uniform prior is used in conjunction with the sample information, the minimum posterior mean square error property is achieved via the least squares rule One problem with least squares in either framework is that it does not take into account the often existing prior information or relationships among the coefficients A Bayesian might even say that the non-informative prior which leads to least squares should be replaced by a proper distribution which reflects in a realistic way the existing non-sample information To mitigate the impact of ignoring this non-sample information, and to patch up their basis of estimation and inference so that it makes use of all of the information at hand, sampling theorists have developed procedures for combining sample and various types of non-sample information When the non-sample information is added and certain of these rules are used, although we gain in precision, biased estimators result if the prior information specification is incorrect In other cases biased estimators result even if the prior specification is correct Thus, we are led, in comparing the estimators, to a bias-variance dichotomy for measuring performance and some of the sampling theory estimators which make use of non-sample information show, for example, superior mean square error over much of the relevant parameter space Alternatively, there are other conventionally used biased sampling theory alternatives for which this result does not hold In the remainder of this section we review the sampling properties of these possibly biased estimators and evaluate their performance under a squared error loss measure Ch IO: Biased Estimation 5.2 635 Stein -like rules under weighted squared error loss In economics many situations arise where interest centers on the estimation problem and thus the sampling characteristics of the estimator in the /I space Alternatively, even if we remain in the t9 space we may wish to weigh certain elements of the coefficient vector differently in the loss function This means that for the general linear statistical model y = X/3 + e of Section we are concerned with the estimator 6(b) of the unknown vector j3, under the risk P(8+J2;q E[(6(b)-B)'Q(S(b)-8)l/u2, = (5.18) when Q is some (K x K) positive definite matrix Under the reparameterized model of Section 2, where = S - ‘ 12j3,we may also transform the problem to that of evaluating the risk under P(~-J~;S~)E[(&(@-~)'((S"2)'QS"2)(6,(e)-e)l/(r2, = (5.19) where S(d) = S-‘ /26(S-‘ /2@ For weighted squared error loss reflected by (5.18), where b is a normally distributed K dimensional random vector with mean /3 and covariance a2( X’ - ‘ X) , Judge and Bock (1978, pp 231-238) have shown that if Q’ /2CQ’ and Q’ /2 /2BQ’ /2 are positive definite matrices that commute with each other and with Q’ /2( XlX) - IQ’ /2 and s/u has a &i-square distribution with (T - K) degrees of freedom, then the estimator (5.20) 6(b,s)=[I,-h(y)+ is minimax if certain conditions often fulfilled in practice are true This means there is a family of minimax estimators that are superior to the maximum likelihood estimator under squared error loss We now turn to some of the estimators in this minimax class 5.2.1 A James and Stein-type minimax estimator Out of the class of estimators defined by (5.16) if we let Q = C = I and B = X’ X we have the estimator /IS:= [ - as/b’ Xb] X’ b, (5.21) or its generalized counterpart which shrinks toward a known K dimensional vector & instead of the conventional null vector The mean of &, by a theorem 636 G G Judge and M E Bock in Judge and Bock (1978, pp 321-322) is (5.22) where 2X = @‘ Xfl/a * As shown in Judge and Bock (1978, p 242), X’ G p(fl, o*; b) for all fi and a* if 2A;‘ , (5.24) where hi’ is the largest characteristic root of (X’ - ‘ Note that if X’ = I,, we X) X have the James and Stein estimator discussed in Section 5.1 If (tr( X’ -IX,) G 2, X) there is no value for a > for which 8: dominates the least squares estimator This means that an ill-conditioned X’ matrix could affect whether or not X (tr( X’ /A,) X)-‘ > and that the appearance of three or more regressors no longer assures that a minimax estimator of the form (5.21) exists As Judge and Bock (1978, pp 245-248) show, a positive rule variant of the estimator (5.21) exists and dominates it 52.2 An alternative estimator An alternative to (5.21) is an estimator of the form /3;= [IK-a(X’ X)s/b’ X)6]b, (X’ (5.25) which has mean (5.26) Judge and Bock evaluate the risk for (5.25) and show that if oBa Some risk comparisons 5.2.5 In Section 5.2 we have discussed selected estimators from a family of estimators We still not know how these estimators compare in terms of risk over the parameter space To get some idea of the relative sampling performances, one set of Monte Carlo sampling results is reported from Judge and Bock (1978, PP MLE Risk r-~.~.-.~.-.r-m.I_~.~r -.-.-.-.-‘ -.-.-’ -.-‘ 10.0 - I I I , I, I 12 16 , I 20 , , 24 , I 28 , I 32 , , , 36 8’ D;le Figure 5.2 Empirical risk bounds for S,( y), S,(y), 6,(y), bounds) and 6,,( y) (maximum risk 40 Ch 10: Biused Estimation 639 To obtain the empirical risks 300 five-parameter samples were drawn from a multinormal distribution with mean vector and non-diagonal covariance matrix which had a characteristic root matrix: 259-274) 2.5 D= (5.33) 2.0 1.5 1.0 The estimators compared were the positive rule counterparts of /3T = 6,(y), K = 4(Y), /Y = 6,(Y), and /3; = c?,~(y) and the empirical risk functions are given in Figure 5.2 These results indicate the Berger minimax estimator /3? has the smallest risk at the origin However, the risk for this estimator quickly crosses the Berger-Strawderman estimator & and the risk for j3: is superior to that of all estimators compared over much of the parameter space No one estimator is superior over the whole parameter space At least these limited results indicate that the gains from having an admissible estimator are not large Some biased estimator alternatives for the stochastic regressor case In many cases the economist must work with passively generated non-experimental data where the values of the design matrix X are not fixed in repeated samples In this section we consider a sampling process where a sample of size T is drawn from a K + variate normal population with all parameters unknown The t th random vector has the form where x, is a (K x 1) vector Let the sample mean and covariance statistics based on the T independent random vectors be denoted by where the population mean p and covariance _Z are unknown, and the sample correlation coefficient is R* = S~,Sx;‘ Sx,/S,‘ If we assume yCT+ is unknown and ,) we wish to predict yCr+,) from xCr+ ,) b y using a prediction function based on the original sample, the maximum likelihood prediction function is (6.1) 640 G G Judge and M E Bock where B = SX;‘ and &, = jj - Xl) In gauging estimator performance we use the S,, squared error loss measure +(k% 9) = +- Y(r+,))*]/~* (6.2) Given this problem, Baranchik (1973) specified an alternative prediction estimator toy, when K > 2, that is found among the class of minimax estimators in the following theorem [Judge and Bock (1978, p 278)], where h(u) = C/U, where c is a constant and u = R*/(l - R*) A prediction estimator of the form jj=Y’ -h(R*/(l-R2))(xtT+,)-n);6 (6.3) is minimax and dominates the maximum likelihood prediction function y’if (i) (ii) (iii) OO; the derivative of uh (u) is non-negative for u > 0; and h(u)d(T-3)/(K-l)foru~(K-l)/(T-K+2)whenthederivativeof uh (u) is positive If - h(u) is negative the positive rule version of the Baranchik estimator ~,=max[0,1-t(K-2)(1-R2)/{R2(T-K+2)}](~~T+,~-X)~~+~, (6.4) where < t Q 2, should be used An alternative approach to determining a rule that is some function of the maximum likelihood prediction equation is to consider prediction functions of the form Ji==a(X(,+,)-X)‘ B+Y, (6.5) where a is any constant, and find the value of a that minimizes E[(ga - yCT+ ,,)*I King (1972, 1974) has investigated this problem and derived the following estimator: pa= [(T-K+2)p*/{K+(T-2K-2)p*}](x~T+,)-~)tfi+p, which unfortunately contains the unknown parameter p*, the population correlation coefficient If, following King, the unknown parameter p* is by a sample estimate R*, (6.6), the estimator, is of the general form (6.3) dominates the maximum likelihood estimator when the condition (6.6) multiple replaced and thus 4< K < Ch IO: Biased Estimation 641 3)* + on T and K is satisfied Analytically evaluating the risk for this estimator is a difficult task, but empirical risk functions developed by King (1974) indicate that the estimator compares quite favorably with the Baranchik estimator and its positive part for values of T and K normally found in practice In many econometric problems based on the stochastic regressor model, interest centers on the unknown parameter vector /3 and the performance of alternative estimators under the risk measure E[( I- fi)‘ b - p)] Fortunately the ( theorems, lemmas, and risk derivations for the fixed regressor case given by Judge and Bock (1978, pp 229-258) carry over directly for the stochastic regressor case (T- Biased estimation with nearly collinear data In the presence of a design matrix X that is nearly collinear, small changes in the values of y, the vector of observations on the dependent variable, may result in dramatic changes in the values for 6, the unbiased least squares estimator of /3 Because of the negative inferential implications of this instability, estimators that are not subject to such extreme dependence on the value of y are considered in this section In an attempt to mitigate the problem of imprecision we examine a class of biased estimators, known as ridge-type estimators, for various specifications of the weight matrix Q in the quadratic loss function (7.1) and compare them to the least squares estimator b For a more complete discussion of the identification and mitigation of the problem of multicollinearity the reader is referred to Judge, Griffiths, Hill and Lee (1980, ch 12) A general survey of the literature devoted to ridge regression is given by Vinod (1978) 7.1 A measure of “near” collinearity Certain biased estimators have arisen in attempts to solve problems of near collinearity in the design matrix X This occurs when one or more columns of X are “nearly” equal to a linear combination of the other columns One measure of near collinearity is the condition number of the matrix X’ which is defined as X the ratio of the largest over the smallest characteristic roots of X’ i.e X, (7.2) h/X, The condition number does not change if all the independent variables are G G Judge and M E Bock 642 multiplied by the same scalar Since h,/X, > 1, (7.3) severe near collinearity is said to exist when the condition number, h,/A,, is very large In such cases the data contains.relatively little information about certain directions in the parameter space as compared to other directions [Thisted (1978a)] As Thisted notes, if a direction is denoted by a K dimensional vector c whose Euclidean norm ]]c]]is one, then the admissible unbiased estimate of c’ /3 b, under squared error loss is c’ the corresponding linear combination of the least squared coefficients Thus, if c, and c2 are directions and ]]c;X]] is considerably smaller than ]]c;X]], the variance of c;b will be considerably larger than that of c;b The condition number by itself may not be adequate to define multicollinearity Instead, perhaps, as Silvey (1969) suggests, it may be preferable to look at all of the characteristic roots and the spectral decomposition to see if mwlticollinearity exists and if so its nature and extent Building on Silvey’ suggestion Belsley, s Kuh and Welsch (1980, ch 3) provide a set of condition indexes that identify one or more near dependences and adapts the Silvey regression-variance decomposition so that it can be used with the indexes to (i) isolate those variates that are involved and (ii) assess the degree to which the estimates are being distorted by near linear dependencies Other measures of collinearity are discussed in Chapter of this Handbook by Learner Under near singularity the imprecision that exists in estimating some of the unknown parameters is reflected by the orthonormal statistical model (2.12) In X’ this case the best linear unbiased estimator of a is & = A- ‘ y, with covariance a*&’ and the variance of di is a2/hi, where Xi is the ith characteristic root of X’ Consequently, relatively precise estimation is possible for those parameters X corresponding to the large characteristic roots Alternatively, relatively imprecise estimation exists for those parameters corresponding to the small characteristic roots 7.2 Ridge-type estimators As we have seen in previous sections of this chapter, the transformation of an unbiased estimator often results in a biased estimator of the transformed parameter In this context and in the face of nearly collinear design matrices, Hoer1 and Kennard (1970) suggest biased estimators called “ridge regression estimators” They note that the average squared length of the least squares estimator b is too large, in the sense that E[b’ > fi;S, and Marquardt and Snee (1975) show that b] E[b’ b]=@‘ /3+o*tr(X’ >/3’ X)-‘ /3+a*/h,, (7.4) where X, is the smallest characteristic root of X’ Hoer1 and Kennard use the X Ch IO: Biased Estimation 643 results as a motivation for the use of biased estimators where the “shrinkage” factor is non-stochastic They propose biased estimators of the form /$= [x’ x+cz,]-‘ y, X’ (7.5) where c is a constant In this family of estimators the replacement matrix [ X’ + cZ,], which replaces X’ in the least squares estimator, has a lower X X condition number than X’ These estimators have the property that their mean X squared error is less than that of b, the least squares estimator of p, for a properly chosen c > Unfortunately, the appropriate value of c depended on the unknown parameters /3 and a* For severe n&specification of c, & would have a mean squared error larger than that of b and thus these estimators are not minimax for Q = ZK The estimator z$, though biased, does result in a more stable estimator for positive values of c The appropriate choice of c is no problem in the following Bayesian formulation It is assumed that the prior distributions of /3 are normal with mean vector and covariance matrix r2ZK, where r* is a known positive constant It is appropriate to choose c = u*/r* under squared error loss In that case fit would be the Bayes estimator and preferable to b under the Bayes risk criterion Lack of knowledge of the appropriate value of c leads to various specifications which depend on the value of y itself In the remainder of this section we consider these sorts of estimators In passing we note that another way of characterizing the ridge estimator is that of an estimator which results from using the least squares criterion subject to the quadratic constraint /3’ = r Lacking analytical sampling theory results in /3 this area, many of the Monte Carlo experiments for ridge estimators have made use of this type of formulation The proper choice of r is of course a problem The original specification of ridge regression estimators was eventually to lead to the definition of “ridge-type” estimators: estimators of the form #&= PX’ y, (7.6) where the condition number of the matrix A is less than that of X’ ConseX quently, A may, as in the case of Stein-like estimators, be a stochastic matrix dependent on y Such estimators are more stable than the least squares estimator b 7.3 Minimax ridge -type estimators If a ridge-type estimator dominates the least squares estimator, it is minimax Such estimators exist only for K and we will assume that K for the remainder of this section G G Judge and M E Bock 644 If the loss function weight matrix Q is the identity matrix, there are no known minimax ordinary ridge regression estimators, of the type (7.5) with stochastic c, for extremely unstable design matrixes X Thisted (1978b) notes that such rules have only been specified for design matrixes X in case that the minimaxity index, V-7) of X’ is greater than If the condition number of X’ (Al/AK), is large, this X X, inequality will not be satisfied Work by Casella (1977) and Thisted (1977) indicates that it is unlikely that such types of minimax ridge rules exist when the minimaxity index of X’ is less than or equal to two and Q = IK Thisted and X Morris (1980) show that many of the ordinary ridge estimators that have been proposed in the literature are not minimax when the design matrix X is nearly collinear An example [Judge and Bock (1978)] of a minimax ridge-type estimator for Q = IK is (7.8) where a, is a constant in the interval [0,2(T - K + 2) -‘ ,c~= ,h; - 2)] Note (X’ that the estimator reduces to b if the interval is empty, that is, unless the minimaxity index of X’ is greater than However, even if 6,( y) is distinct from X b, the amount of risk improvement of S,(y) over b diminishes as the condition number of X’ grows large X 7.4 Generalized ridge estimators A generalized ridge estimator may be defined as b* = [ x’ x+ C]_‘ y, X’ (7.9) where C is a positive definite matrix such that (X’ X)C = C( X’ X) Note that C and (X’ X) are simultaneously diagonalizable These are automatically known as “positive shrinkage estimators” These generalized ridge estimators are not necessarily ridge-type estimators since under squared error loss, with Q = IK, they not necessarily improve on stability and may in fact make it worse The Strawderman (1978) adaptive generalized ridge estimator is an example of an estimator out of this class that is Ch IO: Biased Estimation 645 minimax but the condition number for the replacement matrix is worse than that of X’ in the least squares estimate X For a loss function weight matrix Q = ( X’ X)m, Hill (1979) has noted that an estimator of the form kh)= I IK-% s2(XpX)‘ -m b’ x’ - “ ( x)2 b I b (7.10) has the ridge property and is minimax for all X’ provided m > and a, is a X constant in the interval [0, (T - K + 2) -‘ K - 2)] For m = 1, we have Q = X’ 2( X and the prediction loss function is implied In this case S,,,is minimax but is not a ridge-type estimator 7.5 A summary comment In conclusion, we can say that under the usual specification of Q = IK in the quadratic loss, it does not appear to be possible to meet simultaneously the requirements of ridge-type and minimaxity when the X matrix is of the illconditioned form In fact, Draper and Van Nostrand (1979) note that the amount of risk improvement for ridge estimators is strongly affected by the ill-conditioning of the X’ matrix No example of the simultaneous meeting of the requireX ments are known for Q = X’ However, for Q = ( X’ X X)m and m > there are ridge-type estimators which dominate the least squares estimator for all specifications of (T x K) matrices X with rank K Such loss functions heavily penalize estimation errors in those component parameters which can be estimated rather well Finally, it should be noted that Smith and Campbell (1980) raise some questions as to the foundations of the ridge technique and discuss critically some current ridge practices Discussions bv Thisted and others of the Smith and Campbell article reflect in some sense the range of knowledge and practice in the ridge area Some final comments Advances are made in theoretical econometrics by (i) changing the statistical model, (ii) changing the amount of information used, and (iii) changing the measure of performance The specification and evaluation of the estimators discussed in this chapter have involved, in various degrees, departures from tradition in each of these areas 646 G G Judge and M E Bock Post data model evaluation procedures constitute, to a large degree, a rejection of the concept of a true statistical model for which statistical theory provides a basis for estimation and inference In addition for the traditional regression model, although the maximum likelihood-least squares rule is the only unbiased estimator based on sufficient statistics it is plagued by the following problems: (i) two decades ago Stein proved that in estimation under squared error loss there is a better estimator, except in the case of one or two parameters, i.e the estimator is inadmissible; (ii) the least squares rule does not take into account the often existing prior information or relationships among the coordinates; and (iii) when near collinearity is present the least squares rule is unstable and small changes in the observations result in very large changes in the estimates of the unknown coefficients Given these inferential pitfalls it seems natural to question the golden rule of unbiasedness and look at biased estimators as a possibility for improving estimator performance We seek a rule which yields a “favorable” trade-off between bias and variance and thus accomplishes everywhere in the parameter space an overall reduction in mean square error In the pursuit of this rule we have considered a range of alternatives concerning variants of traditional sampling theory estimators and the Stein and ridge families of estimators Within a sampling theory framework the biased estimators that combine sample information and non-sample information in the form of equality or stochastic restrictions not retain, under a squared error loss measure, the minimax property In addition, if we are uncertain of the prior information and perform a preliminary test of the non-sample information, based on the data at hand, the resulting estimators are inadmissible and are in fact inferior to the maximum likelihood estimator over a large range of the specification error parameter space These pre-test estimators are commonly used in applied work, although little or no basis exists for choosing the optimum level of the test If non-sample information of an inequality form is used and the direction of the inequality information is correct, the resulting biased estimator has risk less than or equal to the conventional maximum likelihood estimator Under the same requirement, the Stein versions of the inequality restricted estimator also dominate their conventional inequality, the James and Stein and the Stein positive-rule counterparts If the direction of the inequality information is incorrect, the estimator is inferior to the maximum likelihood and Stein-rule risks over much of the parameter space The Stein-rule family of estimators which shrink the maximum likelihood estimates toward zero or some predetermined coordinate, enjoy gooli properties from both the sampling theory and Bayesian points of view Since the operating characteristics of the Stein rules depend on means and variances of the observations and the unknown coefficients, the estimators are robust relative to the normality assumption They also appear to be robust over a range of loss Ch 10: Biased Estimation 641 functions Although there are several known minimax admissible rules, none of the rules analyzed herein dominate the positive Stein rule The Stein family thus provide rules that are simple, efficient, and robust Ridge regression procedures which lead to biased estimators have been suggested as one means of “improving the conditioning of the design matrix” and coping with the multicollinearity problem Strawderman (1978) and Judge and Bock (1978) have demonstrated a link between the adaptive generalized ridge estimator, the general minimax estimator, and the Stein estimators To some this might suggest that the ridge estimators should be used even in the absence of the collinearity problems However, as noted in Section 7, under conventional loss functions the general minimax and ridge-type estimators are not the solution to the multicollinearity problem and in general the mechanical application of ridge-type procedures that seek some improvement in estimator performance should be strongly questioned since ridge is not always better than its least squares competitor Finally, we should note that least squares and maximum likelihood estimation of non-linear statistical models lead in general to biased estimators Some of these results are discussed elsewhere in this Handbook Also, in general the Bayesian criterion leads to biased estimators The Bayesian basis for estimation and inference is discussed in Sections 2.3, 3.4, and 4.4 of this chapter and in Chapter of this Handbook by Zellner References Akaike, H (1974) “A New Look at the Statistical Identification Model”, IEEE: Transactions on Automatic Control, 19, 716-723 Amemiya, T (1976) “Selection of Regressors”, Technical Report no 225, Stanford University Baranchik, A J (1964) “Multiple Regression and Estimation of the Mean of a Multivariate Normal Distribution”, Technical Report No 51, Department of Statistics, Stanford University, California Baranchik, A J (1973) “Inadmissibility of Maximum Likelihood Estimators in Some Multiple Regression Problems with Three or More Independent Variables”, Annals of Statistics, 1, 312-321 Belsley, D A., E Kuh and R E Welsch (1980) Regression Diagnostics New York: John Wiley & Sons Berger, J (1976a) “Minimax Estimation of a Multivariate Normal Mean Unclear Arbitrary Quadratic Loss”, Journal of Multivariate Analysis, 6, 256-264 Berger, J (1976b) “Admissible Minimax Estimation of a Multivariate Normal Mean with Arbitrary Quadratic Loss”, Annals of Statistics, 4, 223-226 Bock, M E (1975) “Minimax Estimators of the Mean of a Multivariate Normal Distribution”, Annuls of Statistics, 3, 209-218 Bock, M E., G G Judge and T A Yancey (1980) “Inadmissibility of the Inequality Estimator under Squared Error Loss”, Working Paper, University of Illinois Bock, M E., T A Yancey and G G Judge (1973) “The Statistical Consequences of Preliminary Test Estimators in Regression”, Journal of the American Statistical Association, 68, 109- 116 Unpublished Ph.D dissertation, Purdue University Casella, G ( 1977) “ Minimax Ridge Estimation”, in: Proceedings of the Fourth Berkeley Cox, D R (1961) “Test of Separate Families of Hypotheses”, Symposium on Mathematical Statistics and Probability, Vol Berkeley: University of California Press 648 G G Judge and M E Bock COX,D R (1962) “Further Results on Tests of Separate Families of Hypothesis”, Journal of the Royal Statistical Society, Levis, B, 24, 406-424 Dempster, A P (1973) “Alternatives to Least Squares in Multiple Regression”, in: Kabe and Gupta (eds.), Mtdtioariate Statistical Inference Amsterdam: North-Holland Publishing, pp 25-40 Dempster, A P., M Schatzoff and N Wermuth (1977) “A Simulation Study of Alternatives to Ordinary Least Squares”, Journal of the American Statistical Association, 72, 77-106 Draper, N R and R C van Nostrand (1979) “Ridge Regression and James and Stein Estimation: Review and Comments”, Technometrics, 21, 45 l-466 Efron, B and C Morris (1973) “Stein’ Estimation Rule and its Competitors-An s Empirical Bayes Approach”, Journal of the American Statistical Association, 68, 117- 130 Giles, D E A and A C Rayner (1979) “The Mean Squared Errors of the Maximum Likelihood and Natural-Conjugate Bayes Regression Estimators”, Journal of Econometrics, 11, 319-334 Gunst, R F and R L Mason (1977) “Biased Estimation in Regression: An Evaluation Using Mean Square Error”, Technometrics, 72, 616-628 Hill, R C (1979) “The Sampling Characteristics of General Minimax and Ridge Type Estimators Under Multicollinearity”, Research Paper, University of Georgia Hinde, R (1978) “An Admissible Estimator Which Dominates the James-Stein Estimator”, Research Paper 167, School of Economic and Financial Studies, Macquoue University Hocking, R R (1976) “The Analysis and Selection of Variables in Linear Regression”, Biometrics, 32, l-49 Hoer], A E and R W Kennard (1970) “Ridge Regression: Biased Estimation of Nonorthogonal Problems”, Technometrics, 12, 55-67 James, W and C Stein (1961) “Estimation with Quadratic Loss”, in: Proceedings of the Fourth Berkeley Symposium Mathematical Statistics and Probability, vol Berkeley: University of California Press, pp 36 I-379 Judge, G G and M E Bock (1976) “A Comparison of Traditional and Stein Rule Estimators Under Weighted Squared Error Loss”, International Economic Review, 17, 234-240 Judge, G G and M E Bock (1978) The Statistical Implications of Pre-Test and Stein-Rule Estimators in Econometrics Amsterdam: North-Holland Publishing Co Judge, G G and T A Yancey (1978), “Inequality Restricted Estimation Under Squared Error Loss”, Working Paper Series, University of Georgia Judge, G G., W E Griffiths, R C Hill and T C Lee (1980) The Theory and Practice oj Econometrics New York: John Wiley & Sons King, N (1972) “An Alternative for the Linear Regression Equation When the Predictor Variable is Uncontrolled and the Sample is Small”, Journal of the American Statistical Association, 67, 217-219 King, N (1974) “An Alternative for Multiple Regression when the Prediction Variables are Uncontrolled and the Sample Size is not Too Small”, unpublished manuscript Learner, E E (1974) “Fales Models and Post Data Model Evaluation”, Journal of the American Statistical Association, 69, 122- 131 Learner, E E (1978) Specification Searches New York: John Wiley & Sons Learner, E E and G Chamberlain (1976) “A Bayesian Interpretation of Pre-Testing”, Journal of the Royal Statistical Society, Ser B, 38, 89-94 Mallows, C L (1973) “Some Comments on Cp”, Technometrics, 15, 661-676 Marquardt, D W and R D Snee (1975) “Rtdge Regression in Practice”, American Statistician, 29, 3-19 Mayer, L S and T A Willke (1973) “On Biased Estimations in Linear Models”, Technometrics, 15, 497-508 Rothenberg, T J (1973) Efficient Estimation with A Priori Information New Haven: Yale University Press Sclove, S L., C Morris and R Radhakrishnan (1972) “Non Optimality of Pre-Test Estimators for the Multinormal Mean”, Annals of Mathematical Statistics, 43, 1481- 1490 Silvey, S D (1969) “Multicollinearity and Imprecise Estimation”, Journal of the Royal Statistical Society, B, 1, 539-552 Smith, G and F Campbell (1980) “A Critique of Some Ridge Regression Methods,” Journal of the American Statistical Association, Stein, C (1955) “Inadmissibility 75, 74-103 of the Usual Estimator for the Mean of a Multivariate Normal ... “ ,/ p(e,&ct = 0.05 “‘ \.-.\ ‘ \ /;‘ -I _ h_ ‘ L - ._ _? ?? i;c, /’ p(e,&ol=o.l ‘ = _ “ ‘ , ., -. -. -_ -. -_ ‘ _ I _ Y P(P b) 200 -$ $f I 0 I I I I 10 15 20 25 30 x Hypothesis Error X = S’ S/~CT*... as discussed in Chapter of this Handbook by Zellner, we usually compare the posterior probability of the null hypothesis with that of the alternative hypothesis, which may be of the nested variety... previous sections of this chapter, the transformation of an unbiased estimator often results in a biased estimator of the transformed parameter In this context and in the face of nearly collinear