KInh tế ứng dụng_ Lecture 7: Multicollinearity

10 356 0
KInh tế ứng dụng_ Lecture 7: Multicollinearity

Đang tải... (xem toàn văn)

Thông tin tài liệu

Applied Econometrics Multicollinearity 1 Applied Econometrics Lecture 7: Multicollinearity Double whom you will, but never yourself 1) Introduction Multiple regression can be written as follows: Y i = b 0 + b 1 X 1 + b 2 X 2 + … + b k X k Collinearity refers to linear relationships between two X variables. Multicollinearity encompasses linear relationships between more than two X variables. Multiple regression is impossible in the presence of perfect collinearity or multicollinearity. If X 1 and X 2 have no independent variation, we cannot estimate the effects of X 1 adjusting for X 2 or vice versa. One of the variables must be dropped. This is no loss, since a perfect relationship implies perfect redundancy. Perfect multicollinearity is, however, rarely practice problem. Strong (not perfect) multicollinearity, which permits estimation but makes it less precise, is more common. When the multicollinearity is present, the interpretation of the coefficients will be quite difficult. 2) Practice consequences of multicollinearity Standard errors of coefficients The easiest way tell whether multicollinearity is causing problems is to examine the standard errors of the coefficients. If several coefficients have high standard errors and dropping one or more variables from the equation lowers the standard errors of the remaining variables. Multicollinearity will be the source of the problem. A more sophisticated analysis would take into account the fact that the covariance between estimated parameters may be sensitive to multicollinearity (aq high degree of multicollinearity will be associated with a relatively high covariance between estimated parameters). This suggests that if one estimated parameter b i overestimates the true parameter β i , a second parameter estimates bj is likely to underestimates β j , and vice versa. Because of the large standard errors, the confident intervals for the relevant population parameters tend to be larger. Written by Nguyen Hoang Bao May 24, 2004 Applied Econometrics Multicollinearity 2 Sensitive coefficients One of the consequences of high correlation between explanatory variables is that the parameters estimates would be very sensitive to addition or deletion of observations. A high R 2 but few significant t-ratios There are few coefficients, which are not statistically significant difference from zero and the coefficient of determination is high. 3) Detection of multicollinearity 3.1) There is high R 2 but few significant t-ratios. The F-test will reject the hypothesis that partial slope coefficient are simultaneously equal to zero, but the individual t-test show that non or very few partial slope coefficients are statistically different from zero. 3.2) Multicollinearity can be considered a serious problem only if R 2 y < R 2 i (Klein, 1962) where R 2 y is the squared multiple correlation coefficient between Y and the explanatory variables and R 2 i is the squared multiple correlation coefficient between X i and the other explanatory variables. z Even if R 2 y < R 2 i , t – values for the regression coefficients is still statistical significant. z Even if R 2 i is very high, the simple correlations among regressors are comparatively low. 3.3) In the regression of Y on X 2 , X 3 , and X 4 , if one find that R 2 1.234 is very high but r 2 12.34 , r 2 13.24 , and r 2 14.23 are comparatively low, it may suggest that the variables X 2 , X 3 , and X 4 are highly intercorrelated and that at least one of these variables is superfluous. 3.4) We may use the overall F test to check whether there is a relationship between any one of explanatory variable on the remaining explanatory variables. 3.5) In the regression of Y on X 1 and X 2 , we may calculate λ from the following equation: (S 11 – λ)(S 11 – λ) – S 2 12 = 0 where () () ()() X 2 X 2i n 1i X 1 X 1i S 12 n 1i X 2 X 2i 2 S 22 n 1i X 1 X 1i 2 S 11 − ∑ = − ∑ = − ∑ = − = = = Written by Nguyen Hoang Bao May 24, 2004 Applied Econometrics Multicollinearity 3 The condition number (Raduchel (1971) and Belsley, Kuhn and Welsch (1980)) is defined as: λ λ CN 2 1 = where λ 1 > λ 2 If CN is between 10 to 30, there is moderate to strong multicollinearity If CN is greater than 30, there is severe multicollinearity The closer the condition number is to zero, the better condition is. 3.6) Theil’s test (1971) 1 Calculate m, which is defined as: ( ) ∑ −−= = − k 1i 2 i 22 RRRm where R 2 is the squared multiple correlation coefficient between Y and the explanatory variables (X 1 , X 2 , …, X i , …, X k ) R 2 -i is the squared multiple correlation coefficient between Y and the explanatory variables (X 1 , X 2 , …, X i-1 , X i+1 , …, X k ) with X i omitted If (X 1 , X 2 , …, X i , …, X k ) are mutually uncorrelated, then m will be zero. 3.7) Variance-Inflation Factor (VIF). The VIF is defined as: () R1 1 β ˆ VIF 2 i i − = where R 2 i is the squared multiple correlation coefficient between X i and other explanatory variables. We may calculate for each of explanatory variable separately. The VIF i s measures the degree of multicollinearity among regressors with reference to the idea situation where all explanatory variables are uncorrelated (R 2 i = 0 implies VIF i = 1) 2 . VIF j s will be useful for dropping some variables and imposing parameter constraints only in some very extreme cases where R 2 i is approximately equal to zero. 4) Remedial measures 4.1) Getting more data: Increasing the size of the sample may reduce the multicollinearity problem. The variance of the coefficient is defined as follows: 1 Theil, H (1971), Principles of Econometrics, (New York: Wiley), pp 179. 2 We can interpret VIF j s as the ratio of the actual variance of β i to what the variance of β i would have been if X i were to be uncorrelated with the remaining X’s. Written by Nguyen Hoang Bao May 24, 2004 Applied Econometrics Multicollinearity 4 () R1 S σ )β ˆ V( 2 i ii 2 i − = where σ 2 is the variance of the error term. () ∑ − = = n 1i 2 ii X i X ii S R 2 i is the squared multiple correlations coefficient between X i and other explanatory variables As sample increases, S ii will increase. Therefore, for any given R 2 i , the variance of the coefficient (V(β i )) will decrease, thus decreasing standard error, which will enable us to estimate β i more precisely. 4.2) Transforming of variables (using ratios or first differences): The ratios or first differences regression model often reduces the severity of multicollinearity. However, the first different regression may generate additional problems: (i) error terms may be serially correlated; (ii) one observation is lost; (iii) the first differencing procedure may not be appropriate in cross-sectional data, where there is no logical ordering of the observations. 4.3) Dropping variables: From the previous lectures, dropping a variable to alleviate the problem of multicollinearity may lead to the specification bias. Hence, the remedy may be worst than the disease in some situations because while multicollinearity may precise estimation of the parameters of the model, omitting a variable my seriously mislead us as to the true value of parameters. 4.4) Using extraneous estimates (Tobin, 1950): The equation to be estimated is: lnQ = + lnP + lnI α ˆ 1 β ˆ 2 β ˆ where Q, P and I represent quantity of products, price and income respectively. The time-series data of income and price were both highly collinear. First, we estimate the income elasticity because the data which is at a point in time, the price do not vary much; is known as the extraneous estimate. 2 β ˆ 2 β ˆ Second, we regress (lnQ – lnI) on lnP to get the estimates of and 2 β ˆ α ˆ 1 β ˆ Written by Nguyen Hoang Bao May 24, 2004 Applied Econometrics Multicollinearity 5 We may not interpret the problem of how income elasticity is not changed over time. However, the technique may be worth of consideration in situations, where the cross-sectional estimates do not vary substantially from one cross section to another. 4.5) Using a priori information: Considering the following equation Y 1 = X 1 + X 2 1 β ˆ 2 β ˆ We cannot get good estimates of and because of high correlation between X 1 and X 2 . We get an estimate of from another data set and another equation 1 β ˆ 2 β ˆ 1 β ˆ Y 2 = X 1 + 2 Z 1 β ˆ α ˆ X 1 and Z are not highly correlated and we get good estimate of . Then we regress (Y 1 – X 1 ) on X 2 to get an estimate . 1 β ˆ 1 β ˆ 2 β ˆ 5) Fragility analysis: Making sense of slope coefficients It is a useful exercise to investigate the sensitivity of regression coefficients across plausible neighboring specifications to check the fragility of the inferences we make on the basis of nay one specification uncertainty as to which variables to include. 1. If the different regressors are highly correlated with one another, then there is a problem of collinearity of multicollinearity. This means that the parameters we estimated are very sensitive to the specification model we use and that we may get a high R 2 but insignificant coefficients (another indication of multicollinearity is that the R 2 s from the simple regressions do not sum to near the R 2 for the multiple regression). 2. We would much refer to have robust coefficients, which are not sensitive to small changes in the model specifications. Consider the following model Y i = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 As there are three variables, we have seven possible equations (the number of equation is 2 k -1) to estimate, where k is the regressors. In some cases, there may be one or more Written by Nguyen Hoang Bao May 24, 2004 Applied Econometrics Multicollinearity 6 variables we wish to include in all specifications, if we are particularly interested in that variable or really sure it cannot be omitted. These seven are as follows: 1) Y on X 1 2) Y on X 2 3) Y on X 3 4) Y on X 1 and X 2 5) Y on X 1 and X 3 6) Y on X 2 and X 3 7) Y on X 1 , X 2 and X 3 To carry out a fragility analysis, we perform the following steps: 1. Estimate all seven regressions 2. Construct a table of coefficients (excluding the intercepts and including the R 2 ) 3. If the coefficients vary widely, there is evidence of multicollinearity (look also at simple versus multiple R 2 ) 4. To avoid problems of scale, we normalize each coefficients by dividing through by the mean of the absolute value of that coefficient across all specification and then calculate the maximum, minimum and range of each coefficient 5. We then can identify which regressors are robust Example 5.1 : An examination of fertility in developing countries The data for this example is in the data file FERTILIT, which contains comparative cross-section data for 64 countries on fertility and its determinants as given by the following variables: TFR: The total fertility rate, 1980-85. The average number of children born to a woman, using age-specific fertility rates for a given year FP: An index of family planning effort GNP: Gross National Products per capita 1980 FL: Female literacy rate, expressed as percentage CM: Child mortality. The number of deaths of children under age live in a year per 1,000 live births Table 5.1: Summarizes Coefficients from 15 Possible Regressions FP Ln(GNP) FL CM R 2 1 -0.042 0.57 Written by Nguyen Hoang Bao May 24, 2004 Applied Econometrics Multicollinearity 7 2 -0.568 0.16 3 -0.036 0.39 4 0.133 0.45 5 -0.040 -0.453 0.67 6 -0.033 -0.022 0.69 7 -0.031 0.076 0.67 8 -0.038 -0.035 0.39 9 0.237 0.157 0.46 10 -0.013 0.096 0.47 11 -0.035 -0.234 -0.016 0.70 12 -0.034 -0.288 0.047 0.68 13 -0.031 -0.015 0.033 0.69 14 0.255 -0.014 0.119 0.48 15 -0.034 -0.212 -0.015 0.008 0.70 The dependent variable is the total fertility rate (TFR). With four explanatory variables, there are 15 possible regressions (2 4 -1 = 15). Some coefficients seem to vary much and others rather less so. But it is difficult to get a precise idea comparatively, as the scale of coefficients varies. Hence, we normalize them, as shown in table 5.2. Table 5.2: Normalized Coefficients for TFR Regressions FP Ln(GNP) FL CM 1 -1.20 2 -1.99 3 -1.73 4 1.59 5 -1.14 -1.59 6 -0.94 -1.06 7 -0.89 0.91 8 -0.13 -1.69 9 0.83 1.88 10 -0.63 1.15 Written by Nguyen Hoang Bao May 24, 2004 Applied Econometrics Multicollinearity 8 11 -1.00 -0.82 -0.77 12 -0.97 -1.01 0.56 13 -0.89 -0.72 0.39 14 0.89 -0.67 1.42 15 -0.97 -0.74 -0.72 0.10 Max -0.89 0.89 -0.63 1.88 Min -1.20 -1.99 -1.73 0.10 Range=Max-Min 0.31 2.88 1.10 1.78 From these results, we see that: 1. The income variable is the least robust (range=2.88). The evidences are the coefficient from the simple regression is twice from others and, in some cases, the coefficients even becomes positive 2. The family planning coefficient is the most robust (range=0.31). It always retains the same negative sign and varies over a comparatively small range 3. Collinearity seems particularly severe between lnGNP and CM, it is likely to be necessary to estimate an equation containing only one of these two variables. If both are included, then neither are statistically significant different from zero 4. Hence a regression of TFR on FP and FL seems sensible. References Bao, Nguyen Hoang (1995), ‘Applied Econometrics’, Lecture notes and Readings, Vietnam-Netherlands Project for MA Program in Economics of Development. Belsley, D. E., E. Kuh and R. Welsch (1980), “Regression Diagnostics’ (New York: Wiley, 1980). Gujarati, Damodar N. (1988), Basic Econometrics, Second Edition, McGraw – Hill, Inc Klein, L. R. (1962), ‘An Introduction of Econometrics’, Englewood Cliffs, New York: Prentice Hall, pp.101. Written by Nguyen Hoang Bao May 24, 2004 Applied Econometrics Multicollinearity 9 Maddala, G.S. (1992), ‘Introduction to Econometrics’, Macmillan Publishing Company, New York. Mukherjee Chandan, Howard White and Marc Wuyts (1998), ‘Econometrics and Data Analysis for Developing Countries’ published by Routledge, London, UK. Raduchel, W. J. (1971), ‘Multicollinearity Once Again’, Harward Institute of Economic Research, Paper 205, Cambridge, Mass. Theil, H (1971), ‘Principles of Econometrics’, (New York: Wiley), pp 179. Written by Nguyen Hoang Bao May 24, 2004 Applied Econometrics Multicollinearity 10 Workshop 7: Multicollinearity 1) In the regression of Y on X 1 and X 2 , match up the equivalent statements: a) There is multicollinearity in the regressors b) Y has a nearly perfect linear relation to X 1 and X 2 c) The multiple correlation of Y on X 1 and X 2 is nearly one d) The residual variance after regression is very small compared to the variance of Y without regression e) X 1 and X 2 have high correlation 2) Use the data file KRISNAIJ, we estimate the following model lnM = β 1 + β 2 lnY + β 3 lnP f + β 4 lnP m The above model specification is, in fact, a restricted version of a more elaborate model which include, apart from the income variable, Y, the price variables, P f (the price of cereals) and P m , two more price variables: namely, Pof, a price index of other food products, and Ps, a price index of consumer services. Including the last two variables into the double – log model specification yields a six – variable regression. 2.1) Construct a table with the results of all possible regressions, which at least include the income variable (why?) 2.2) Construct comparative box plots of the variation in the slope coefficient of each regressor in the model 2.3) Judging from your table, check whether there is much evidence of multicollinearity 2.4) Check whether any variables in any of the specifications appear superfluous 2.5) How robust is the income elasticity across alternative specifications? 2.6) In your opinion, which price variables appear to be most relevant in the model? Written by Nguyen Hoang Bao May 24, 2004 . Applied Econometrics Multicollinearity 1 Applied Econometrics Lecture 7: Multicollinearity Double whom you will, but never. Econometrics Multicollinearity 10 Workshop 7: Multicollinearity 1) In the regression of Y on X 1 and X 2 , match up the equivalent statements: a) There is multicollinearity

Ngày đăng: 25/10/2013, 09:15

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan