THE LINEAR REGRESSION MODEL II

Trang 1

The linear regression model II — departures from the assumptions underlying the

statistical GM

In the previous chapter we discussed the specification of the linear regression model as well as its statistical analysis based on the underlying eight standard assumptions In the next three chapters several departures from [1]-[8] and their implications will be discussed The discussion differs somewhat from the usual textbook discussion (see Judge et al (1982)) because of the differences in emphasis in the specification of the model

In Section 20.1 the implications of having E(y,/o(X,)) instead of E(y,/X;=X,) as the systematic component are discussed Such a change gives rise to the stochastic regressors model which as a statistical model shares some features with the linear regression model, but the statistical inference related to the statistical parameters of interest 0 is somewhat different The statistical parameters of interest and their role in the context of the statistical GM is the subject of Section 20.2 In this section the so- called omitted variables bias problem is reinterpreted as a parameters of interest issue In Section 20.3 the assumption of exogeneity is briefly considered The cases where a priori exact linear and non-linear restrictions on @ exist are discussed in Section 20.4 Estimation as well as testing when such information is available are considered Section 20.5 considers the concept of the rank deficiency of X known as collinearity and its implications The potentially more serious problem of ‘near collinearity’ is the subject of Section 20.6 Both problems of collinearity and near collinearity are interpreted as insufficient data information for the analysis of the parameters of interest It is crucially important to emphasise at the outset that the discussion of the various departures from the assumptions underlying the statistical GM which follows assumes that the probability

Trang 2

20.1 The stochastic linear regression model 413

and sampling models remain valid and unchanged This assumption is needed because when the probability and/or the sampling model change the whole statistical model requires respecifying

20.1 The stochastic linear regression model

The first assumption underlying the statistical GM is that the systematic component is defined as

H, = Ey,/X,=X,) (20.1)

An alternative but related form of conditioning is the one with respect to the o-field generated by X,, which defines the systematic component to be

uy = E(y,/o(X,)) (20.2)

The similarities and differences between (1) and (2) were discussed in Section 7.2 In this section we will consider the meaning and intuition underlying (2) as compared with (1) Let X,, be the first random variable in X,, which is assumed to be defined on the probability space (S,4~ P(-)) o(X,,) represents the o-field generated by X,,, i.e the minimal o-field with respect to which X,, isa random variable By construction o(X ,,)< ¥ The o-field generated by X,=(X,,, X2, , Xx,) 18 defined to be

k

d(X,)= |) ø(X,)=.Z (20.3)

Let y, be also defined on (S,Z,P()) The conditional expectation E(y,/o(X,)): (S, a(X,)) > (R, A) 1s defined via

| Har | E(y,/o(X,)) dP (20.4)

atX,) a(X,)

This shows that E(},/o(X,)) is a random variable with respect to ø(X,) Intuitively, conditioning \, on o(X,) amounts to considering the part of the random variable y, associated with all the events generated by X, Conditioning on X,=x, can be seen as a special case of this where only the event X,=x, is considered Because of this relationship it should come as no surprise to learn that in the case where D(},, X,; Ø) is jointly normal the conditional expectations take the form

ui, = E(y,/X, =X,) =} 2233 %,, (20.5)

uk = Ely, /o(X,))= 64 2893 X, (20.6)

(see Chapter 15) Using (6) we can define the statistical GM:

Trang 3

where the parameters of interest are Ø=(ÿ,ø?), ÿ=E;jø;,, ø?=

Øịi—Ø;;E;;ø¿¡ The random vectors (X;, X¿, , X;Y= are assumed to satisfy the rank condition, rank(#)=k for any observed value of # The

error term defined by

u,=y,— E(y,/o(X,)), tet (20.8)

satisfies the following properties: E(u,) = E; E(u,/o(X,))} =0, (20.9) E(u) = E{ Eufu,/o(X,))} =0, (20.10) o?, t=s, 20.11 0, ts, t,seT ( ) Eu,,) = E{E(u,u,/ø(X,))} = |

The statistical GM (7) represents a situation where the systematic component of y, is defined as the part of y, associated with the events o(X,) and the observed value of X, does not contain all the relevant information That is, the stochastic structure of X, is of interest as far as it is related to Vy: This should be contrasted with the statistical GM of the Gauss linear and linear regression models

Given that X, in the statistical GM is a random vector, intuition suggests that the probability model underlying (7) should come in the form of the joint distribution D(s,,X,; w) We need, however, a form of this distribution which involves the parameters of interest directly Such a form is readily available using the equality

D(y,,X5 W) = Diy/X,; Wy) D(X; Wo), (20.12)

with Ø=(ÿ,ø”) being a parametrisation of w, This suggests that the probability model underlying (7) should take the form D= {D(y,/X,; 8): D(X, Wz), 0=(B, 0?) Rx R,,teT}, (20.13) where 1 1 D(y,/X,; 0) = 2./0n) sp ~ 563 (y;— xl (20.14) (det £,,)7? im DIX: 2) " exp 3X27 X,} (20.15)

(see Chapter 15) and w, are said to be nuisance parameters (not of interest) The random vectors X,, , X; become part of the sampling model which is defined as follows:

Trang 4

respectively, where as usual

z={ yt )

Xr

If we collect all the above components together we can specify the stochastic linear regression model as follows:

The statistical GM: y,= BX%,+u,, te T

LI] u= EUy,/ø(X,)) and u,= y,— Ely,/a(X,))

[21 0=(B,0°), =E;7ø¿, and a*=06,,—6,2%5;6,, are the parameters of interest {3] X, is assumed to be weakly exogenous wrt @, fort=1,2, , T [4] No a priori information on @ [5] For #=(X,, X2, , X,)’, rank(?) =k for all observable values of % T>k The probability model 1 1 ®= {Da W= | om exp 563 (y— pst | (det £54)? _ x ewe expt 4252) | 6 (REx Ra} {6] (1) D(y/X,; 8) is normal; (1) E(y,/o(X,)) = BX, — linear in X,; (iii) Var(y,/ø(X,))=ø? — homoskedastic; [7] Ø0 =(ÿ ø?) are time-invariant

The sampling model

[8] (Z,, Z., , Zr) is a random sample from D(Z,; w), t= 1, 2, , T, respectively

Trang 5

The probability and sampling models taken together imply that for y= (Vy, ¥o 7) and #=(X,.X: X,)' the likelihood function is LỊB:y #)= nl D(v,/X,; 0) D(X; Wa) (20.16) The log likelihood takes the form TT 1 log L(Ø) = const —5 log ơ? 3,3 3 00,—fX,)? +} log DỤ: Ứ¿) † H (20.17)

The last component in (17) can be treated as a constant as far as the differentiation with respect to the parameters of interest @ is concerned Because of the apparent similarity between (17) and the log likelihood function of the linear regression model (see Section 19.4), it should come as no surprise to learn that maximisation with respect to @ yields

=1

ậ* — ( XA¡) 3 Xa, =(22) 1y (20.18)

and

Gr ru + — B*X, = ñ*ñ* (20.19)

in an obvious notation The same results can be derived by defining the likelihood function in terms of D(y,, X,; ý) and, after estimating 0,,, 0; >, *;; using these estimators and the invariance property of MLE’s to construct the corresponding estimators for ÿ=X;;ø;; and ø?=ø¡¡— G1 9%33'6, Looking at (18) and (19) we can see that these MLE’s of Band o? differ from the corresponding estimators for the linear regression model B= (XX)- 'Xy.,¿? =(1/7)ữ ñ¡n so far as the latter include the observed value x, instead of the random vector X,, as above This difference, however, implies that B* and é*? are no longer a linear and a quadratic function of y and thus the distributional results in relation to B and ở? cannot be extended to p* and ¢*? That is, B* and G* are no longer normally and chi-square distributed, respectively In fact, the distributions of these estimators are not analytically tractable at present As can be seen from (18) and (19) they are very complicated functions of normally distributed random variables The question which naturally arises is whether we can derive any properties of B* and é*? without knowing their distributions Using the properties SCE 1-SCES (especially SCE3) on conditional expectations with respect to some oa-field (see Section 7.2) we can deduce the following:

Trang 6

E(Ệ*)= E[E(Ệ*/ø(#))]= §+ EL'#) 12" Elu/o(X))]

=ÿ (20.21)

f EL4 2) 14 ']< œ›, since by the construction of the statistical GM (7), E(u/ø(2))=0 That is, B* is an unbiased estimator of B

Cov(B*) = E(B* — B)(B* — By = ELE(B* — B)(B* — B)'/o(2)]

=E[L#'2)rL#'E(uu/ø(#))#(4 2) 1]

=ø?E(3'#) 1 (20.22)

if E(2’X)~! exists, since E(uu’/o(:7)) = 071 Using the similarity between the log likelihood function (17) and that of the linear regression model we can deduce that the information matrix in the present case should be of the form E(#'2) 5 0 I‡(0)= a (20.23) T O55

This shows that f* is an efficient estimator of B

Using the same properties for the conditional expectation operator we can show that for the MLE 6*? of o?

E(6*?) = ELE(6*?/o(2))) == ELE(a*'a*/o(2))]

= ET E(u'Myu/o(%)], where My=1,-2(2°2) 712" = ELE(tr Myuu /ø2(Z))] a E[tr MyE(uu /ø(2))]

1 1

= PE(tt My)= = 07 Blt dy =tr(2)7!222)) T—k

-( T Jer for all observable values of % (20.24) This implies that although ê*? is a biased estimator of o7 the estimator defined by

s*?= ñ*â* (20.25)

Trang 7

Using the Lehmann—Scheffe theorem (see Chapter 12) we can show that ty, He=lyy, 2X, Xy) is a minimal sufficient statistic and, as can be seen from (18) and (19), both estimators are functions of this statistic

Although we were able to derive certain finite properties of the MLE’s p* and é*? without having their distribution, no testing or confidence regions are possible without it For this reason we usually resort to asymptotic theory Under the assumption E(2'# lim ( ) =Qyx<ơ and non-singular, (20.26) Too we deduce that /T(B* — B) ~ N(0,07Qx3), (20.27) /T(6*? — 0?) ~ N(0, 20%) (20.28)

These asymptotic distributions can be used to test hypotheses and set up confidence regions when T is large

The above discussion of the stochastic linear regression model as a separate statistical model will be of considerable value in the discussion of the dynamic linear regression model in Chapter 23 In that chapter it is argued that the dynamic linear regression model can be profitably viewed as a hybrid of the linear and stochastic linear regression models 20.2 The statistical parameters of interest

The statistical parameters which define the statistical GM are said to be the statistical parameters of interest In the case of the linear regression model

these are B=Zjj0,, and ø?=ø¡;—ø;5;‡øạ; Estimation of these

statistical parameters provides us with an estimated data generating mechanism assumed to have given rise to the observed data in question The notion of the statistical parameters of interest is of paramount importance because the whole statistical analysis ‘revolves’ around these parameters A cursory look at assumptions [1]-[8] defining the linear regression model reveals that all the assumptions are directly or indirectly related to the statistical parameters of interest 0 Assumption [1] defines the systematic and non-systematic component in terms of 9 The assumption of weak exogeneity [3] is defined relative to 6 Any a priori information is introduced into the statistical model via # Assumption [5] referring to the rank of X is indirectly related to @ because the condition

Trang 8

20.2 The statistical parameters of interest 419 is the sample equivalent to the condition

rank(Z,,)=k, (20.30)

required to ensure that £,, is invertible and thus the statistical parameters of interest @ can be defined, Note that for T>k, rank(X)=rank(X’X) Assumptions [6] to [8] are directly related to 6 in view of the fact that they are all defined in terms of D(y,/X,; 6)

The statistical parameters of interest 6 do not necessarily coincide with the theoretical parameters of interest, say € The two sets of parameters, however, should be related in such a way as to ensure that € is uniquely defined in terms of 8 Only then the theoretical parameters of interest can be given statistical meaning In such a case € is said to be identifiable (see Chapter 25) Empirical econometric models represent reparametrised statistical GM’s in terms of € Their statistical meaning is derived from 0 and their theoretical meaning through € As it stands, the statistical GM,

¥:= BX, +4, tel, (20.31)

might or might not have any theoretical meaning depending on the mapping

'G(ẽ,60)=0, (20.32)

relating the two sets of parameters It does, however, have statistical meaning irrespective of the mapping (32) Moreover, the statistical parameters of interest 6 are not restricted unduly at the outset in order to

enable the modeller to test any such testable restrictions That is, the

statistical GM is not restricted to coincide with any theoretical model at the outset Before any such restrictions are imposed we need to ensure that the estimated statistical GM is well defined statistically; the underlying assumptions [1]-[8] are valid for the data in hand

The statistical parametrisation @ depends crucially on the choice of Z, and its underlying probabilistic structure as summarised in D(Z; w) Any changes in Z, or/and D(Z; w) changes @ as well as the statistical model in ` question Hence, caution should be exercised in postulating arguments which depend on different parametrisations, especially when the parametrisations involved are not directly comparable In order to illustrate this let us consider the so-called omitted variables bias problem The textbook discussion of the omitted variables bias argument can be summarised as follows:

The true specification is

Trang 9

but instead y=XBt+u, u~N(0,07!7) (20.34) was estimated by ordinary least-squares (OLS) (see Chapter 21), the OLS estimators being Ê=(XX)”'Xy (20.35) and a, ñ=y_—XỆ (20.36) In view of the fact that a comparison between (33) and (34) reveals that u=Wy+e, (20.37) we can deduce that E(u) = Wy 40 (20.38) and thus (i) E(B) — B=(X"X)!X’Wy 40: (20.39) and

(il) E(2?)— ø?= rh y WM,W’y (20.40)

My =1—X(X’X)"'X’ That is, B and 6? suffer from omitted variables bias unless W’X = 0 and y=0, respectively; see Maddala (1977), Johnston (1984), Schmidt (1976), inter alia

From the textbook specification approach viewpoint, where the statistical model is derived by attaching an error term to the theoretical model, it is impossible to question the validity of the above argument On the other hand, looking at it from the specification viewpoint proposed in Chapter 19 we can see a number of serious weaknesses in the argument The most obvious weakness of the argument is that it depends on two statistical models with different parametrisations In particular B in (33) and (34) is very different If we denote the coefficient of X in (33) by B=Zj3'¢,,, the same coefficient in X takes the form:

— =1 —1 =1

Bo = 22 3621 — Dy 3273033 đai,

Trang 10

20.3 Weak exogeneity 421 E (—a=(XX)"!X'Wy#0, (20.41) „⁄X,W since E (u)= Wy #0, (20.42) yiX W

where E,,y y{°) refers to the expectation operator defined in terrns of D(y,/X,, W,5 99) Looking at (41) we can see that the omitted variables bias arises when we try to estimate Bp in

yY=XBo+ Wyte (20.43)

by estimating B in (34) where, by construction, B¥ B, On the other hand, in the context of the same statistical model,

E(p)—B=0 (20.44)

since _

E(u)=0 (20.45)

WX

and no omitted variables problem arises A similar argument can be made for 6° From this viewpoint the question of estimating the statistical parameters of interest 6) =(Bo y, 2) by estimating 6=(f, o”) never arises since the two parameter sets 0, and @ depend on different sample

information, Fo = oly,, X,, W,, t= 1,2, , Tyand F =oly, X,,t= 1,2, ,

T) respectively This, however, does not imply that the omitted variables argument is useless, quite the opposite In cases where the sample information is the same (#,=.¥) the argument can be very useful in deriving misspecification tests (see Chapters 21 and 22) For further discussion of this issue see Spanos (198 5b)

The above argument illustrates the dangers of not specifying explicitly the underlying probability model and the statistical parameters of interest By changing the underlying probability distribution and the parametrisation the results on bias disappear The two parametrisations are only comparable when they are both derivable from the joint distribution, D(Z,, ,Z;;y) using alternative ‘reduction’ arguments 20.3 Weak exogeneity

Trang 11

0 =(ÿ, ø”) is concerned That is, although at the outset we postulate D(y,, X35 y) as far as the parameters of interest are concerned, D(y,/X,; ,) suffices; note that

D(y,, XW) =Dy/X 1) D(X; W2) (20.46)

is true for any joint distribution (see Chapter 5) If we want to test the exogeneity assumption we need to specify D(X,; w,) and consider it in relation to D(y,/X,; w,) (see Wu (1973), Engle (1984); inter alia) These exogeneity tests usually test certain implications of the exogeneity assumption and this can present various problems The implications of exogeneity tested depend crucially on the other assumptions of the model as well as the appropriate specification of the statistical GM giving rise to x,, t=1,2, , T; see Engle et al (1983)

Exogeneity in this context will be treated as a non-directly testable assumption and no exogeneity tests will be considered It will be argued in Chapter 21 that exogeneity assumptions can be tested indirectly by testing the assumptions [6]-[8] The argument in a nutshell is that when inappropriate marginalisation and conditioning are used in defining the parameters of interest the assumptions [6]-[8] are unlikely to be valid (see Engle et al (1983), Richard (1982)) For example a way to ‘test’ the weak exogeneity assumption indirectly is to test for departures from the normality of D(y,,X,; y) using the implied normality of D(y,/X,; 6) and homoskedasticity of Var(y,/X,=x,) For instance, in the case where D(y,, X43 y) is multivariate Student's r, the parameters w, and w, above are no longer variation free (see Section 21.4) Testing for departures from normality in the directions implied by D(y,, X,; w) being multivariate t can be viewed as an indirect test for the variation free assumption underlying weak exogeneity

20.4 Restrictions on the statistical parameters of interest Ø

The statistical inference results on the linear regression model derived in Chapter 19 are based on the assumption that no a priori information on Ø= (, ø?) is available Such a priori information, when available, can take various forms such as linear, non-linear, exact, inexact or stochastic In this section only exact a priori information on £ and its implications will be considered; a priori information on o? is rather scarce

(1) Linear a priori restrictions on B

Let us assume that a priori information in the form of m linear restrictions

Trang 12

20.4 Restrictions on parameters of interest 423 is also available at the outset, where R and r are mx k and mx 1 known

matrices, rank(R)=m Such restrictions imply that the parameter space where f takes values in no longer R* but some subset of it as determined by (47) These restrictions represent information relevant for the statistical analysis of the linear regression model and can be taken into consideration

In the estimation of @ these restrictions can be taken into consideration by extending the concept of the log likelihood function to include such restrictions This is achieved by defining the Lagrangian function to be

T 1

I(Ø, g; y, X)= const —-~ log ơ? ~3„z(y—X)(y— Xổ) — (R§—t),

(20.48) where g represents an m x | vector of Lagrange multipliers Optimisation of (48) with respect to B, c* and y gives rise to the first-order conditions: a apo? (Xy—XXB)—R'u=0, (20.49) al T1 Ga? ~2øg3 24W) y~ X8)=0, 2 al am” —(RB—r)=0 (20.51)

Premultiplying (49) with R(X’X)' (to make the second term invertible) and solving for we get

ñ=[ø?R(XX)~!R']- !(RỆ — r), (20.52)

Trang 13

Properties of 0=(B, fi, 6”)

Using these formulae we can derive the distributions of the constrained MLE'’s of B, 07 and p Band ji being linear functions of B we can deduce that

nh i [z?R(XX)_!R']”!(Rÿ—r) tea wee Ci, C,, Cr C;;//7 (20.56) where C;;=ø?[XX) !_—(XX)"! R[R(XX) 'R] 'R(XX) !]= Cov(j) C¡;=(XX) 'R[ø?R(XX)!R']”! =Cov(ỗ, 8)= Có¿, C;;=[ø?R(XX)”'R]”' =Cov(g) 9

Using (56) we can deduce that

(i) When RB=r, E(p)=B and E(ji)=0, ice B and ji are unbiased estimators of B and 0, respectively

(ii) Band pare fully efficient estimators of B and p since their variances achieve the Cramer—Rao lower bounds as can be verified directly using the extended information matrix: XX “> RO L(Bu02)=1 R 0 0 (20.57) T 0 0 2a*

(see exercises 1 and 2)

(1) [Cov(Ø) —Cov(p)] <0, i.e the covariance of the constrained MLE B is always less than or equal to the covariance of the unconstrained MLE 8, irrespective of whether RB=r holds or not; but [MSE(B) — MSE(g)] >0 where MSE stands for mean square

error (see Chapter 12)

Trang 14

20.4 Restrictions on parameters of interest 425 and Rộ—r PM CR1 mộ ~g~ „2m, (20.60) where "- 2060 we can deduce that ` 6), (20.62)

using the reproductive property of the chi-square (see Appendix 6.1) and the independence of the two components in (59) This implies

5029 (20.63)

But for §?=[l/(T+m—k)]ũũ, F(S”)=ø? when Rÿ=r, since =0

The F-test revisited

In Section 19.5 above we derived the F-test based on the test statistic

— ip q-i §—

r(y)= (RỆ—r)[RIXX) s1) 0ý (R@—Ð (20.64)

for the null hypothesis

Ho: RB=r against H,: RB¥r,

using the intuitive argument that when H, is valid || RB —r|| must be close to

zero We can derive the same test using various other intuitive arguments

similar to this in relation to quantities like | B — || and |i) being close to

zero when H, is valid (see question 5) A more formal! derivation of the F- test can be based on the likelihood ratio test procedure (see Chapter 14) The above null and alternative hypotheses in the language of Chapter 14 can be written as

Hy: @€O,, H,:0cQ@,=Q—Q,, where

6=(B,07), O={(B,o*): BERS a7 ER, },

= {(B,a7): BER’, RB=r.o7 eR, }

The likelihood ratio takes the form

Tax HOY) ey) Qn Me) Me? 2.2)" L(8:y) LIỖ:y) (2z ny FR, waa) +

maXHẾy) Ly) (2m a") (20.65)

Trang 15

The problem we have to face at this stage is to determine the distribution of Aly) or some monotonic function of it Using (58) we can write A(y) in the form

(RB ry [ROX’X) RI (RB—)]

An | | (20.66)

Looking at (66) we can see that it is directly related to (64) whose distribution we know Hence, A(y) can be transformed into the F-test using the monotonic transformation

ey T—k

t(y)=(A(y) ””— s) (20.67)

This transformation provides us with an alternative way to calculate the value of the test statistic t(y) using the estimates of the restricted and unrestricted MLE’s of a” An even simpler operational form of t(y) can be specified using the equality (58) From this equality we can deduce that (RÊ—r}[R(XX)_ !R']r 'RÊ— r)=ữ ũ — ữ ô (20.68) (see exercise 4) This implies that t(y) can be written in the form ưũ_-ữñ (/T—k r{Y)= " ss () wu m (20.69) or R — —È _ RRSS ass (TA) 00.10) !=— DRSS m

Trang 16

20.4 Restrictions on parameters of interest 427

Assuming that this is a well-defined estimated statistical GM (a very questionable assumption) we can proceed to consider specification tests related to a priori restrictions on the parameters of interest One set of such a priori restrictions which is interesting from the economic theory viewpoint is

Ho: B}=1 and f,=1 against H,: 8,41 or ;“l Interpreting 8, and f, as income and price elasticities, respectively, we can view H, as a unit elasticity hypothesis

In order to use the form of the F-test (linear restrictions) as specified in (70) we need to re-estimate (71) imposing the restrictions This estimation yielded (m,—p,— y,)= —0.529—0.219i, + ñ, (20.72) (0.055) (0.019) (0.087) R*=0.629, R*=0.624, s=0.0866, log L=83.18, RSS=0.58552, T=80 Given that RRSS=0.585 52 and URSS=0.117 52 we can deduce that 0.585 52 —0.117 52\/76 ` `“ ` 20.73 mơ) ( 0.117 52 \8) 20:73) For a size «=0.05 test the rejection region is C¡=(y:t(y)> 3.12) (20.74)

Hence we can conclude that H, is strongly rejected It must be stressed, however, that this is a specification test and is based on the presupposition that all the assumptions underlying the linear regression model are valid From the limited analysis of this estimated equation in Chapter 19 there are clear signs such as its predictive ability and the residual’s time pattern that some of the underlying assumptions might be invalid In such a case the above conclusion based on the F-test might be very misleading

The above form of the F-test will play a very important role in the context of misspecification testing to be considered in Chapters 21-23

(2) Exact non-linear restrictions on B+

Having considered the estimation and testing of the linear regression model when a priori information in the form of exact linear restrictions on B we turn to exact non-linear restrictions

Trang 17

Consider the case where a priori information comes in the form of m non- linear restrictions (e.g 8, = 82/83, 8B, = — 3): h{p)=0, i=1,2, m, or, in matrix form: H(ÿ) =0 (20.75) In order to ensure independence between the m restrictions we assume that cH rank( | =m (20.76) o

As in the case of the linear restrictions, let us consider first the question of constructing a test for the null hypothesis

H,: H(p)=0 against H,: H(p)<0 (20.77)

Using the same intuitive argument as the one which served us so well in constructing the F-test (see Section 19.5) we expect that when Hạ 1s valid, H(g)~0 The problem then becomes one of constructing a test statistic based on the distance

|| H(B) — 0] (20.78)

Following the same arguments in relation to the units of measurement and the absolute value we might transform (78) into

H(B)' [Cow(H())] H(A) (20.79)

Unlike the case of the linear restrictions, however, we do not know the distribution of (79) and thus it is of no use as a test statistic This is because although we know that B ~ N(B, o7(X'X)~'!) the distribution of H(8) is no longer normal, being a non-linear function of B Although, for some non- linear functions h,(B) we might be able to derive their distribution, it is of little value because we need general results which can be used for any non- linear restrictions The construction of the F-test suggests that if we could linearise H(B) such a general test could be derived along similar lines as the F-test Linearisation of H(B) can be achieved by taking its first-order Taylor’s expansion at B, Le

R 2H@) „

H(ÿ) = HD+ eB +0,(1), (20.80)

Trang 18

asymptotically negligible implies that any result based on (80) can only be justified asymptotically Hence, any tests based on (80) can only be asymptotic What we could not get in finite sample theory (linearity), we get by going asymptotic Given that

/T(B—B) ~ N(0,0?Qx"), (20.81)

we can deduce that

./ T(H(B) —H(p)) ~ N (0.¢ | (Pos: ()) (20.82)

(see Ni in Chapter 15) This result implies that if we substitute the asymptotic covariance (Cov,(H())) in (79) we could get an asymptotic test because TH(B) [Cov(H(B))] 7 Ht) ~ *(m; 8), (20.83) where como) and ô=H(#} [Covip)]” !H(Ø) (20.84) As it stands (83) cannot be used as a test statistic because B and o are P

Trang 19

Ự dz}(m) =z a where This is because Họ Wiy) ~ 7(m) (20.88)

This test is known as the Wald test whose general form was discussed in Chapter {6 In the same chapter two other asymptotic test procedures which give rise to asymptotically equivalent tests were also discussed These are the Lagrange multiplier and the likelihood ratio asymptotic test procedures

The Lagrange multiplier test procedure is based on the constrained MLE B of B instead of the unconstrained MLE B That is, B is derived from the optimisation of the Lagrangian function (8, w; y)=log L(8; y) — m H(ÿ) (20.89) Via l _clogL cHŒ) _ êÐ - ê§ cp uB)=0 and él —= -H(ÿ)=0 > (20.90) Cụ

ame up) =" log “(@y) and H(B) =0, cB cp (20.91) ñ=(ÿ 22) is the constrained MLE of 0=(f, 2) In order to understand what is involved in (89)-(91), it is advisable to compare these with (48), (49) and (52) above for the linear restrictions case In direct analogy to the linear restrictions case the distance

|| a(B) —0| (20.92)

Trang 20

(see Chapter 16) This suggests that the statistic

LM(y)= ws] (EP )xA): 1 (ee) op Cp ~ x (m: ồ)

(20.95) can be used to construct an & size test with rejection region

C,= ty: LM(y)>c,} (20.96)

and power function

AB) = Pr(LM(y)>c,)= Ự dy?(m; 6) (20.97) In Chapter 16 it was shown that the Lagrange multiplier test can take an alternative formulation based on the score function [é log L(6)]/28 In view of the relationship between the score function and the Lagrange multipliers given in (91) we can deduce that we can construct a test for Hy against H, based on the distance

é log L(6) 60

- (20.98)

Following the same argument as in the case of the construction of W(y) and LM(y) we can suggest that the quantity

Slog L(6) é log L(8)\\' Clog L(6

Clog (cov( 8 ”)) ê log L(8) 20

ô8 e0 06 (20.99)

should form the basis of another reasonable test statistic Given that

1 Clog L(é)

JT ¿8 - ~ N(0,1,,(0)) a (20.100)

(see Chapters 13 and 16) we can deduce the following test statistic:

BF (ee (ee T co co x aim, (20.101)

This test statistic can be simplified further by noting that H, involves only a subset of @ (just Ø) and ƒ „(Ø) 1s block diagonal Using the results of Chapter

16 we can deduce that

Trang 21

The test statistic ES(y) constitutes what is sometimes called the efficient score form of the Lagrange multiplier test

The likelihood ratio test is based on the test statistic (see Chapter 16): LR(y) = —2(log L(@; y) —log L(8; y)) ~ y?(m; ô), (20.103) # with a rejection region C¡=\y:LR(y)>c,)} (20.104) Using the first-order Taylor’s expansion we can approximate LR(y) as LR(y)~ T(8—8)T,,(0)(8 —8) (20.105)

This approximation is very suggestive because it shows clearly an important feature shared by all four test statistics W(y), LM(y), ES(y) and LR(y) All four are based on the intuitive argument that some distance

IH(/)| (Bl), || Le log L(ð 0)]/êØ|| and |ð— 8B], respectively, is ‘close to

zero’ when Hi Is valid

Another important feature shared by all four test statistics is that their asymptotic distribution (y?(m) under Hy) depends crucially on the asymptotic normality of a certain quantity involved in defining these

distances, /T(B—B), (1/ Tu) — g(6)) (1/./T)[2 log L(B,0)]/eB and

/TO-6) respectively All three tests, the Wald (W), Lagrange multiplier (LM) and likelihood ratio (LR) are asymptotically equivalent, in the sense that they have the same asymptotic power characteristics On practical grounds the only difference between the three test statistics is purely computational, W(y) is based only on the unconstrained MLE B LM(y) is based on the constrained MLE B and LR(y) on both For a given example size T, however, the three test statistics can lead to difference decisions as far as rejection of Hy is concerned For example, if we were to apply the above procedures to the case where H(g)=RB—r we could show that W(y)> LR(y) = LM(y) (see exercises 5 and 6) , 20.5 Collinearity As argued in Section 19.3 above, assumption [5] stating that rank(X)=k, T>k, (20.106)

is directly related to the statistical parameters of interest 0=(B, c”) (B= X576y1, 0° =0;1; — 612257 G>,) Via X45 This is because

Trang 22

20.5 Collinearity 433

and (106) represents the sample equivalent to

rank(*;;)= (20.108)

When (108) is invalid and E;; ¡s singular 8 and ø? cannot even be defined Condition (108), however, cannot be verified directly and thus we need to rely on (106) which ensures that the estimators

ˆ 1

Ê=(XX) 'Xy and 4? =~ y(U-X(XX)'X)y (20.109)

of Band o” can be defined In the case where X’X is singular B and é? cannot

be derived

The problem we face is that the singularity of (XX) does not necessarily imply the singularity of L,, This is because the singularity of (X’X) might be a problem with the observed data in hand and not a population problem For example, in the case where T<k rank(X’X)<k, irrespective of L,, because of the inadequacy of the observed data information The only clear conclusion to be drawn from the failure of the condition (106) is that the sample information in X is inadequate for the estimation of the statistical parameters of interest B and co” The source of the problem is rather more difficult to establish (sometimes impossible)

In econometric modelling the problem of collinearity is rather rare and when it occurs the reason is commonly because the modeller has ignored relevant measurement information (see Chapter 26) related to the data chosen For example, in the case where an accounting identity holds among some of the x,,8

It is important to note that the problem of collinearity is defined relative to a given parametrisation The presence of collinearity among the columns of X, however, does not preclude the possibility of estimating another parametrisation/restriction of the statistical GM One such parametrisation is provided by a particular linear combination of the columns of X based on the eigenvalues and eigenvectors of (X’X)

Let

P(X'X)P’ = A=diagls,, A2, , Am, 0,0, ,0) (20.110) and P’P= PP’ =I, where P represents a kx k orthogonal matrix whose columns are the eigenvectors of (XÃ) and /,, 4), ., 4, ls non-zero eigenvalues (see Householder (1974)) If we define the new observed data matrix to be X* = XP and B* = P’B the associated coefficient parameters we could reparametrise the statistical GM into

y= BY xF tu, t=1,2, ,T (20.111)

Trang 23

artificial random variable X*=Xjp,, i=1, 2, , k The columns of X* defined by X* = Xp, are known as principal components of X and in view of (XX)p,=0 fori=m+1, ,k (20.112) (see 110), where p,,i=1,2, ,k,are the columns of P, we can deduce that X*=(X*:X%), X¥=0 Decomposing B* =P’ B conformably in the form B* =(a’, y’)' we can rewrite (iii) for t=1,2, ,T as y=X*a+u, (20.114)

where rank(X;) =m, with # and rt? being the new parameters which are now data specific These parameters can be estimated via

4=(XIXH 'Xƒ and 722 y-XIXƑXƒ 'XĐY, (20.115)

Moreover, in view of the relationship B=Pp*=P,4+P.y

any linear combination c’B of B is estimable if c’P, =0 since

cb=cP,z+cP;y =cP,z, (20.117)

Using the principal components as the new columns of X, however, does not constitute a solution to the original collinearity problem because the estimators (115) refer to a new parametrisation This shows clearly how the collinearity problem is relative to a given parametrisation and not just a problem of data matrices The same is also true for a potentially more serious problem, that of ‘near collinearity’, to be considered in the next section

20.6 ‘Near’ collinearity

If we define collinearity as the situation where the rank of (X’X) is less than k, ‘near’ collinearity refers to the situation where (X’X) is ‘nearly’ singular or ill-conditioned as is known in numerical analysis The effect of this near singularity is that the solution of the system

(X’X)B=X’y (20.118)

Trang 24

20.6 ‘Near’ collinearity 435

collinearity, this is a problem of insufficient data information relative to a given parametrisation In the present context the information in X is not quite adequate for the estimation of the statistical parameters of interest B and o” This might be due to insufficient sample information or to the choice of the variables involved (2,, 1s nearly singular) For example, in cases where there is not enough variability in some of the observed data series the sample information is inadequate for the task of determining B and «7 In such cases ‘near’ collinearity is a problem to be tackled On the other hand, when the problem is inherent in the choice of X, (Le a population problem) then near collinearity is not really a problem to be

tackled In practice, however, there is no way to distinguish between these

two sources of ‘near’ collinearity because we do not know £,, unless we are ina Monte Carlo experimental situation (see Hendry (1984)) This suggests that any assessment of whether ‘near’ collinearity relative to a given parametrisation is a problem to be tackled will depend on assumptions about the ‘true’ values of the statistical parameters of interest

Some of the commonly used criteria for detecting ‘near’ collinearity suggested by the textbook econometric literature are motivated solely by its effect on the ‘accuracy’ of B as measured by

Cov(8)=ø?(XX)_ ! (20.119)

Such criteria include:

(a) Simple correlations

These refer to the transformation of the (XX) matrix into a correlation matrix by standardising the regressors using

Ryn eg FHL ek (20.120)

| » (X44 =8 | t=1

The standardisation is used in order to eliminate the units of measurement problem High simple correlations are sometimes interpreted as indicators of near collinearity

(b) Auxiliary regressions

Auxiliary regressions are estimated between each regressor and all the others, Say Xựy and Xị,, Xa, ‹; XS 12 Maggs +> Nes

Trang 25

and a high value of the multiple correlation coefficient from this regression, R?, is used as a criterion for ‘near’ collinearity This is motivated by the following form of the covariance of B,:

Cov(8,)=ø°(XX)j! = 07 [1 — RE)X,X,] | (20.122)

(see Theil (1971)) A high value for R? (everything else assumed fixed) leads to a high value for Cov (B,), viewed as the uncertainty related to , By the same token, however, a small value for x,x,, interpreted as the variability of the Ath regressor, will have the same effect Nofe that Rj refers to _ XX (XX — 4) 1X „Xu RƑ==— ~——=—= (20.123) XiX„ (c) Condition numbers Using the spectral decomposition of (X’X) in (110) we can express (119) in the form k , Cov(ÿ)=ø?(X'X)”!=ø?(PA -!P)=ø? » 9) (20.124) i= “i and thus ˆ n P2 Var(B,)=ø? 3` (=) i=1,2, ,k (20.125) i=1 \Aj

(see Silvey (1969)) This suggests that the variance of each estimated coefficient 8B, depends on all the eigenvalues of (X'X) The presence of a relatively small eigenvalue 4, will ‘dominate’ these variances For this reason we look at the condition numbers

ki(X X)= i=1,2, k, (20.126)

+

for large values indicating ‘near’ collinearity, where /,,,, refers to the largest eigenvalue (see Belsley et al (1980)) How large a condition number is large enough to indicate the presence of near collinearity is an open question in view of the fact that the eigenvalues /,, , A, are not invariant to scale changes For further discussion see Belsley (1984)

Several other criteria for detecting ‘near’ collinearity have been suggested in the econometric literature (see Judge et al (1985) for a survey) but all these criteria, together with (a)-(c) above, suffer from two major weaknesses:

Trang 26

20.6 ‘Near’ collinearity 437

(ii) none of these criteria is invariant to linear transformations of the data (changes of origin and scale)

Ideally, we would like the matrix X'X to be diagonal, reflecting the orthogonality of the regressors, because in such a case the statistical GM takes a form where the effect of each regressor can be assessed separately given that L,,=diag(o,>, 033, , J) and Ou, pa i=2,3, k (20.127) Ơi; “Gi, ø,=Cov(X„w), Br=m,— Y ——m,, ¡=2 Ơi my=E(y,), mụ= KX„u), 1=2,3, ,k (20.128) In such a case

ñ.=(xxj)!'x¿y and Var(,)=ơ?(x/xj)”!, (20.129)

with the estimator as well as its variance being effected only by the ith regressor This represents a very robust estimated statistical model because changes in the behaviour of one regressor over time will affect no other

coefficient estimator but its own Moreover, by increasing the number of

Trang 27

ensure that the sample information is sufficient for the accurate determination of its parameters It is important to emphasise at this stage that ‘good’ empirical econometric models are not, as some econometric and time-series textbooks would have us believe, given to us from outside by ‘sophisticated’ theories or by observed data regularities, but constructed by econometric modellers using their ingenuity and craftsmanship

In view of the above discussion it seems preferable to put more emphasis in constructing robust empirical econometric models with ‘nearly’ orthogonal regressors without, however, sacrificing either their statistical or economic meaning Hence, instead of worrying how to detect ‘near’ collinearity (which cannot be defined precisely anyway) by some battery of suspect criteria, it is preferable to turn the question on its head and consider the problem as one of constructing empirical econometric models with ‘nearly’ orthogonal regressors among its other desirable properties To that end we need some criterion which assesses the contribution of each regressor separately and is invariant to linear transformations of the observed data That is, a criterion whose value remains unchanged when

y, 18s mapped into yƠ=a,y,+cÂ, (20.130)

and

X, is mapped into X*¥ = A,X,+¢), (20.131) where a, #0 and A, is a (k—1)x (k~—1) non-singular matrix

We know that under the normality assumption Z,~N(m,X), teT (20.132) where MV my, Ớii Gy2 Z,= = ; = 5 20.133 ‘ (x) " (5) * (7 =) the statistic (Z,S), Z=+ >Z, and S= Ÿ (Z,-Z)#,~Zy (20.134) t=1

Trang 28

20.6 ‘Near’ collinearity 439 are kx k and kx 1 matrices The corresponding transformations on the parameters are: m —> Ám +c, (20.137) x AXA’

What we are seeking is a criterion which remains unchanged under this group of transformations The obvious candidates as likely measures for assessing the separate contributions of the regressors involved are the multiple and partial correlation coefficients (see Chapter 15) The sample partial correlation coefficient of y, with one of the regressors, say X,, given the rest X3,, is given in equation (15.48) This represents a measure of the correlation of y, and *,, when the effect of all the other variables have been ‘partialled out’ On the other hand, if we want the incremental contribution of X,, to the regression of y, on X, we need to take the difference between the sample multiple correlation coefficient, between y, and X, and that between y, and X_,, (X, with X,, excluded), denoted by R? and R2,, ie use

(R?— R2.,) (20.138)

(see equation (15.39) for R*) It is not very surprising that both of these measures are directly related via

(R2 — R? ,)=/?,,(1-R2,) (20.139)

(see Theil (1971)), where p,, 3 denotes the sample partial correlation coefficient of y, and X,, given the rest X;,

Let

R? =g(Z,S8) and ôj;;=g.(2.S) We can verify directly that

GZ, S)=g(AZ +e, ASA) (20.140)

and

g,(Z,S)=g,(AZ +e, ASA) (20.141)

Trang 29

empirical econometric model) using the incremental contributions

(R?—R2,), i=1,2, ,.k-1, (20.142)

in conjunction with

k-1

R?— ¥ (R?—R?), (20.143)

which Theil (1971) called the multicollinearity effect In the present context such an interpretation should be viewed as coincidental to the main aim of constructing robust empirical econometric models It is important to remember that the computation of the multiple correlation coefficient differs from one computer package to another and it is rarely the one used above

To conclude, note that the various so-called ‘solutions’ to the problem of

near collinearity such as dropping or adding regressors or supplementing the model with a priori information are simply ways to introduce alternative reparametrisations, not solutions to the original problem

Important concepts

Stochastic linear regression model, statistical versus theoretical parameters

of interest, omitted variables bias, reparametrisation, constrained and

unconstrained MLE’s, a priori linear and non-linear restrictions, restricted and unrestricted residual sum of squares, collinearity, ‘near’ collinearity, orthogonal regressors, incremental contributions, partial correlation coefficient, invariant to linear transformations Questions 1 Compare and contrast (1) E(y,/X, = X;) E(y,/a(X,)); (ii) =(XX)"'Xy, ƒ*=(#'#2) Nay; 1 1 (H1) 2?=_— ữñ, đ*#?—=-ñ*ñ*, T T

2 Compare and contrast the statistical GM's, the probability and sampling models for the linear regression and stochastic linear regression statistical models

3 ‘Let the “true” model be

Trang 30

11 12 20.6 ‘Near’ collinearity 44} and the one used be y,=f'x,+u, It can be shown that for p= (XX) 'Xy

() E(B) # B, ie B suffers from omitted variables bias; and

(ii) E(u) =y'w,-

Discuss

Explain informally how you would go about constructing an exogeneity test Discuss the difficulties associated with such a test Compare the constrained and unconstrained MLE’s of B and a?:

~

đ=B—(XX)”'R'[R(XX)”'Đ]ˆ '(RỆ~r)

Explain how you would go about constructing a test for H,: RB=r against H,: RB#r

based on the intuitive argument that when Ho is true pis close to zero ‘Why don’t we use the distance || RB —r |?’ Compare the resulting test with the F-test

Explain intuitively the derivation of the Wald test for H,: H(p)=0 against H;: H(ÿ)z0

‘Why don’t we use ||H(f)|| instead of |H(B)| as the basis of the argument for the derivation”

What do we mean by the statement that the Wald, Lagrange multiplier and likelihood ratio test procedures give rise to three asymptotically equivalent tests?

When do we prefer an asymptotic to a finite sample test?

Explain the role of the assumption that rank (X)=k in the context of the linear regression model

Discuss the concepts of ‘collinearity’ and ‘near-collinearity’ and their implications as far as the MLE’s B and 6? are concerned

Trang 31

Exercises

1 Using the first-order conditions (49}{51) of Section 20.4 derive the information matrix (57)

2 Using the partitioned inverse formula,

A B\.,_ AT !+FE !IE —-FE7! B D/ \ -ET'F E1! E=D-BA-!B, F=A_-1B, đerive [Iz{8, w,ø?)]~? and compare its various elements with C,,,C, » and C,, in (56) 3 Verify the distribution of (*) ii n (56)

4 Verify the equality (Rÿ—r}[R(XX)”'!R']r !(RÊ—r)=ữ ũ — ñ â 5 For the null hypothesis Hạ: Rf=r against H;: Rzr use the Wald,

Lagrange multiplier and likelihood ratio test procedures to derive the following test statistics: Tint(y) T _ my) LR(y)=T lo ¡ + mt)

respectively, where t(y) is the test statistic of the F-test 6 Using Wy), LM(y) and LR(y) from exercise 5 show that

Wy) 2 LR(y) 2 LM\y)

(Note that log(1+z)>z/(L+z), z>log(I+z), z20) (see Evans and Savin (1982),.)

Additional references

Định dạng
Số trang	31
Dung lượng	1,01 MB