Applied Econometrics Outliers,LeverageandInfluence
1
Applied Econometrics
Lecture 3:Outliers,LeverageandInfluence
‘Life is the art of drawing sufficient conclusions from insufficient premises’
SAMUEL BUTLER
1) Introduction
The estimates of the regression parameters are influenced by a few extreme observations. The
residual plot may let us pick out, which the individual data points are high or low. We may use the
residual plot to find the outlier, which are inadequately captured by the regression model itself.
2) Identification of outliers
¾ The percentiles that cut the data up into four quarters have special names: The 25
th
percentiles
and the 75
th
percentiles are called the lower and upper quartiles (Q
L
and Q
U
)
¾ The lower quartile will be the [integer((n+1)/2)+1]/2 value from the bottom of the ordered list.
the upper quartile is the [integer((n+1)/2)+1]/2 value from the top
¾ A data point Y
0
is considered to be an outliers if
Y
0
< Q
L
– 1.5 IQR or Y
0
> Q
U
+ 1.5 IQR
where IQR is the inter – quartile range (IQR = Q
U
– Q
L
) (Source: Hoaglin, 1983)
3) Outliers
An outlier is a point, which is far removed from its fitted value (i.e., has large residual). Large in this
context does not refer to the absolute size of a residual but to its size relative to most of the other
residuals in the regression.
When a point is an outlier in univariate analysis, it is defined with reference to its own mean. When
a point is an outlier in bivariate analysis, it has a large residual (i.e., Y value is far removed from its
fitted value).
Apart from the graphical methods, we can also rely on special statistics to detect outliers. In order to
compare the large residual to the other residual, we may calculate the standardized residual, which is
simply the residual divided by the standard error of the estimate (e
i
/s). But an outlier in the data set
will inflate the standard error of the regression. Hence we use the studentized residual
Written by Nguyen Hoang Bao May 20, 2004
Applied Econometrics Outliers,LeverageandInfluence
2
h1
s(i)
e
t
i
i
i
−
=
where
e
i
is the residual (e
i
= y
i
– )
iy
ˆ
s(i) is the standard error of the estimation having dropped the ith observation from the sample
h
i
is the hat statistic for the observation ith, which is defined as
∑
−
−
+=
=
n
1i
2
2
i
)
XX
i
(
)
XX
i
(
n
1
h
The additional term in the denominator,
h1
i
−
, is necessary since the variance of the residuals is
assumed not to be constant. With the adjustment factor, we get a t – statistic, which tests whether the
ith residual is significant different from zero and, hence, signals an outliers, which does not really fit
the overall pattern.
Alternatively, the t – statistic of the coefficient of the dummy variable pick out a single observation
from the sample.
4) Leverage
A data point has a high leverage if it is far removed in the X – direction (i.e., it is a disproportionate
distance away from the middle range of the X – direction) (Myers, 1990).
The points of high leverage can exert undue influence on the outcome of a least squares regression
line. That is, points with high leverage are capable of exerting a strong pull on the slope of the
regression line.
In univariate analysis, the definition of an outlier and a point of leverage are the same. A point,
which is an outlier, also has high leverage with respect to the mean. In bivariate analysis, a point of
high leverage (with respect to the slope coefficient) is one which is far removed in the X – direction
(as opposed to an outliers. which are far removed from Y – direction).
A test statistic for the leverage is the hat statistic:
∑
−
−
+=
=
n
1i
2
2
i
)
XX
i
(
)
XX
i
(
n
1
h
Written by Nguyen Hoang Bao May 20, 2004
Applied Econometrics Outliers,LeverageandInfluence
3
which serves as a measure of leverage of the ith data point. It measures leverage because the
numerator is the squared distance of the ith data point from its mean in the X – direction, while its
denominator is a measure of overall variability of the data points along the X – axis. Therefore, the
higher value of h
i
the higher is the leverage of the ith data point, the greater the distance of X
i
from
its mean.
h
i
can vary from 1/n (i.e., close to zero) for a point with no leverageand tend to one for very high
leverage. It is suggested that the following guidelines are based on the maximum observed h
i
=
max(h
i
) (Huber, 1981):
max(h
i
) < 0.2 little to worry about
0.2 < max(h
i
) < 0.5 risky
0.5 < max(h
i
) too much leverage
5) Influence
A data point is influential if removing it from the sample would markedly change the position of the
least squares regression line (Moore and McCabe, 1989). Hence, influential data points pull the
regression line in their regression.
The influential data points do not necessarily produce large residuals. That is, they are not always
outliers as well, although they can be. Conversely, an outlier is not necessarily an influential point,
particularly when it is a point with little leverage.
In univariate analysis, an outlier has high leverageand will be influential. In bivariate analysis, high
leverage is a necessarily condition for influence on the slope, but not a sufficient one. Similarly, an
outlier may not be influential if it has low leverage, nor a point of high leverage be an outlier if its
leverage is strong enough.
A test for influence is the DFBETA statistic, which is defined as
1
:
)(i)
β
SE(
(i)
ββ
DFBETA
1
11
i
−
=
where bracket (i) refers to the value of the statistic when observation ith is excluded from the
regression. The DFBETAs measure the sensitivity of the slope coefficient to the deletion of the ith
data point
1
We suppose that the regression model can be specified as Y = β
0
+ β
1
X
Written by Nguyen Hoang Bao May 20, 2004
Applied Econometrics Outliers,LeverageandInfluence
4
if DFBETA < 2/
n
, the point has no influence
if DFBETA > 3/
n
, the point is influential
if 2/
n
< DFBETA < 3/
n
, the point is inconclusive
The regression analysis should capture general pattern in the data: an influential point can prevent
this from being so. Hence, they are often best dropped from the regression.
DFBETAs should always be used in conjunction with diagnostic regression graphics. It is always
possible that a cluster of points is exerting influence rather than a single data point.
Table 5: Summary measures outliers, leverage, andinfluence
Statistic Formula Use Critical value
Studentized residual (t
i
)
h1
s(i)
e
t
i
i
i
−
=
Outliers Critical values available (higher than usual t–test), but
recommend use t
i
as an exploratory tool
Hat statistic (h
i
)
∑
−
−
+=
=
n
1i
2
2
i
)
XX
i
(
)
XX
i
(
n
1
h
Leverage
Bounded by 1/n (no leverage) and 1 (extremely
leverage); values above 0.5 indicate excessive leverage
and values over 0.2 indicate the observation may give
problems
DFBETA
)(i)
β
SE(
(i)
ββ
DFBETA
1
11
i
−
=
Influence
Under 2/
n , the point has no influence; over 3/ n,
the point is influential and strongly so if DFBETA
exceeds 2
Note: n is the sample size; k is the number of regressors; the subscript (i) (i.e., with parentheses) indicates an estimation from the
sample omitting observation i. In each case you should use the absolute value of the calculated statistic.
Source: Mukherjee Chandan, Howard White and Marc Wuyts (1998), ‘Econometrics and Data Analysis for
Developing Countries’ published by Routledge, London, UK.
Written by Nguyen Hoang Bao May 20, 2004
Applied Econometrics Outliers,LeverageandInfluence
5
References
Bao, Nguyen Hoang (1995), ‘Applied Econometrics’, Lecture notes and Readings,
Vietnam-Netherlands Project for MA Program in Economics of Development.
Hoaglin, David C., Mosteller F., Tukey J. (1983), Understanding Robust and Exploratory Data
Analysis, New York: John Wiley.
Huber, Peter J. (1981), Robust Statistics, New York: John Wiley.
Maddala, G.S. (1992), ‘Introduction to Econometrics’, Macmillan Publishing Company, New York.
Moore, D.S. and McCabe, G.P. (1989), Introduction to the Practice of Statistics, New York:
Freeman.
Mukherjee Chandan, Howard White and Marc Wuyts (1998), ‘Econometrics and Data Analysis for
Developing Countries’ published by Routledge, London, UK.
Myers R. H. (1990), Classical and Modern Regression with Application, Second Edition, Boston,
M.A: PWS – Kent.
Written by Nguyen Hoang Bao May 20, 2004
Applied Econometrics Outliers,LeverageandInfluence
6
Workshop 3:Outliers,LeverageandInfluence
1) Look carefully at the four plots in the attached figure. For each plot write down whether any of
the points is: an outliers, a point of high leverage, an influential points or some combination of
these. Briefly comment on your findings
Hint:
Outliers are not necessarily influential (plot 4)
But they can be so (depending on leverage) (plot 3)
Yet high leverage points are not always influential (plot 1)
And influential points are not necessarily outliers (plot 2)
Plot summary
Plot Outliers LeverageInfluence
Plot 1
Plot 2
Plot 3
Plot 4
______
______
______
______
______
______
______
______
______
______
______
______
2) An examination of residuals provides a diagnostic check on the model. When the regression
model is inadequately specified, the residuals are not just pure noise. Instead they contain a
message that can help us to specify a better model.
Consider the four different relations between Y and X plotted below (Anscombe, 1973) – a
simplified version of some common phenomena.
2.1) Calculate the regression line (Y against X), and graph it in panel 1, 2, 3 and 4 with the data
points.
2.2) State which graph above corresponds to this situation:
(i) The relation is really curved, rather than linear
(ii) The positive relation is entirely the result of just one data point
(iii) The residual variance is entirely the result of just one data point – which may very
well be recorded in error
(iv) It makes good sense to use the regression line for prediction
2.3) Briefly, what lesson does this show?
Written by Nguyen Hoang Bao May 20, 2004
Applied Econometrics Outliers,LeverageandInfluence
7
Regression 1 Regression 2 Regression 3 Regression 4
Y X
1
Y X
2
YX
3
Y X
4
8.04
6.95
7.58
8.81
8.33
9.96
7.24
4.26
10.84
4.82
5.68
10
8
13
9
11
14
6
4
12
7
5
9.14
8.14
8.74
8.77
9.26
8.1
6.13
3.1
9.13
7.26
4.74
10
8
13
9
11
14
6
4
12
7
5
7.46
6.77
12.74
7.11
7.81
8.84
6.08
5.39
8.15
6.42
5.73
10
8
13
9
11
14
6
4
12
7
5
6.58
5.76
7.71
8
8.47
7.04
5.25
12.5
5.56
7.91
6.89
8
8
8
8
8
8
8
19
8
8
8
3) The identification of outliers in univariate analysis
Using the data file LEACCESS.WK1, identify if there is any outliers in each of the following
data sets:
3.1) LE
3.2) Y
3.3) ln(Y)
4) The identification of outliers in bivariate analysis
4.1) Using the data file AIDSAV, test whether observation 26 (Lesotho) is:
a) an outlier
b) an point of high leverage
c) an influential point
4.2) Draw the scatter plot of S/Y against A/Y showing the regression line with and without
point 26 in the same graph
4.3) What happen to the R
2
when observation 26 is dropped from the data set? Explain
4.4) Are there any other problematic points in the sample?
Written by Nguyen Hoang Bao May 20, 2004
Applied Econometrics Outliers,LeverageandInfluence
8
4.5) Show algebraically that a point with no leverage cannot have any influence on the slope
coefficient
5) Outliers in bivariate analysis
5.1) Using the data file HOLMQ, which contains the data for EDUEXP and EAID, examine the
figure and test whether any possible points is:
a) an outlier
b) a point of high leverage
c) an influential point
5.2) Draw the scatter plot, showing the fitted line. Briefly comment on your findings.
Written by Nguyen Hoang Bao May 20, 2004
. Applied Econometrics Outliers, Leverage and Influence
1
Applied Econometrics
Lecture 3: Outliers, Leverage and Influence
‘Life is the art. Hoang Bao May 20, 2004
Applied Econometrics Outliers, Leverage and Influence
6
Workshop 3: Outliers, Leverage and Influence
1) Look carefully at the four