Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 20 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
20
Dung lượng
353 KB
Nội dung
CHAPTER 47 Density Estimation [Sim96] is a very encompassing text. A more elementary introduction with good explanations is [WJ95]. This also has some plots with datasets relevant to economics, see pp. 1, 11, and there is a R and Splus-package called KernSmooth associated with it (but this package does not contain the datasets). A more applied book is [BA97], which goes together with the R and Splus-package sm. 47.1. How to Measure the Precision of a Density Estimator Let ˆ f be the estimated density and f the true density. Then for every fixed value u, the estimation error at u is ˆ f(u) −f(u). This is a random variable which depends on u as a nonrandom parameter. Its expected value is the bias at u E[ ˆ f(u) − f(u)], 1031 1032 47. DENSITY ESTIMATION and the expected value of its square is the MSE at u E ˆ f(u) − f(u) 2 . This is a measure of the precision of the density estimate at point u only. The overall deviation of the e stimated density from the true density can be measured by the integrated squared error (ISE) +∞ u=−∞ ˆ f(u) −f (u) 2 du, This is a random variable; for each observation vector, it gives a different number. The mean integrated squared error (MISE) is the expected value of the ISE, and at the same time (as long as integration and formation of the expected value can be interchanged) it is the integral of the MSE at u over all us: MISE= +∞ u=−∞ ˆ f(u) − f(u) 2 du = +∞ u=−∞ E ˆ f(u) − f(u) 2 du. The asymptotic value of this is called AMISE. The MSE is the variance plus the squared bias. In density estimation, bias arises if one overmoothes, and variance increases if one undersmoothes. 47.2. The Histogram Histograms are density estimates. They are easy to understand, easy to con- struct, and do not require advanced graphical tools. Here the number of bins is important. Too few bins lead to oversmoothing, too many to undersmoothing. [Sim96, p. 16] has some math how to compute the MISE of a histogram, and which bin size is optimal. If the underlying distribution 47.2. T HE HISTOGRAM 1033 is Normal, the optimal bin width is h = 3.491σn −1/3 This is often used also for non-Normal distributions, but if these distributions are bimodal, then one needs narrower bins. The R/S-function dpih (which stands for Di- rect Plug In Histogram) in the library KernSmooth uses more sophisticated methods to select an optimal bin width. Also the anchor positions can have a big impact on the appearance of a histogram. To demonstrate this, cd /usr/share/ecmet/xlispstat/anchor-position then do xlispstat, then (load "fde"), then (fde-demo), and pick animate anchor-moving. Regarding the labeling on the vertical axis of a histogram there is a naive and a more sophisticated approach. The naive approach gives the number of data points in each bin. The preferred, more sophisticated approach is to divide the total number of points in each bin by the overall size of the dataset and by the bin width. In this way one gets the relative frequency density. With this normalization, the total area under the histogram is 1 and the histogram is directly comparable w ith other estimates of the probability density function. 1034 47. DENSITY ESTIMATION 47.3. The Frequency Polygon Derived from histogram by connecting the mid-points of each bin. Gives a better approximation to the actual density. Now the optimal bin width for a Normal is h = 2.15σn −1/5 Dominates the histogram, and is not really more difficult to construct. Simonoff argues that one should never draw histograms, only frequency polygons. 47.4. Kernel Densities For every observation draw a standard Normal with that point as the mode, and then add them up. An illustration is sm.script(sp build). It can also be a Normal with variance different than 1; the greater the variance, the smoother the density estimate. Instead of the Normal density one can also take other smoothing kernels, i.e., functions k with +∞ u=−∞ k(u) du = 1 and +∞ u=−∞ uk(u) du = 0. An often-used kernel is the Triweight kernel 35 32 (1 − x 2 ) 3 for |x| ≤ 1 and 0 otherwise, but thes e kernel functions may also assume negative values (in which case they are no longer densities). The choice of the functional form of the kernel is much less important than the bandwidth, i.e., the variance of the kernel (if interpreted as a density function) µ 2 = +∞ u=−∞ u 2 k(u) du = 1, 47.4. KERNEL DENSITIES 1035 Problem 446. If u → k(u) is the kernel, and x = x 1 ··· x n the data vector, then ˆ f(u) = 1 n n i=1 k(u − x i ) is t he kernel estimate of the density at u. • a. 3 points Compute the mean of the kernel estimator at u. Answer. E[ ˆ f(u)] = 1 n n i=1 E[k(u − x i )] but since all x i are assumed to come from the same distribution, it follows E[ ˆ f(u)] = E[k(u − x)] = +∞ x=−∞ k(u − x)f (x) dx. • b. 4 points Assuming the x i are independent, show that (47.4.1) var[ ˆ f(u)] = 1 n +∞ x=−∞ k 2 (u − x)f(x) dx − +∞ x=−∞ k(u − x)f(x) dx 2 . Answer. var[ ˆ f(u)] = 1 n 2 n i=1 var[k(u − x i )](47.4.2) = 1 n var[k(u − x)](47.4.3) = 1 n E k(u − x) 2 − E[k(u − x)] 2 (47.4.4) = 1 n +∞ x=−∞ k 2 (u −x)f(x) dx − +∞ x=−∞ k(u − x)f (x) dx 2 .(47.4.5) 1036 47. DENSITY ESTIMATION 47.5. Transformational Kernel Density Estimators This approach transforms the data first, then estimates a density of the trans- formed data, and then re-transforms this density to the original scale. For instance the income distribution can use this, see [WJ95, p. 11]. 47.6. Confidence Bands The variance of a density estimate is usually easier to compute than the bias. One method to get a confidence band is to draw additional curves with 2 estimated point- wise standard deviations above and below the plot. This makes the assumption that the bias is 0. Therefore it is not really usable for inference, but it may give some idea whether certain features of the plot should be taken seriously. sm.script(air band) Another approach is bootstrapping. sm.script(air boot). The expected value of the bootstrapped density functions is ˆ f (and not f; therefore bootstrapping will not reveal the bias but it does reveal the variance of the density estimate. 47.7. Other Approaches to Density Estimation Variable bandwidth methods Nearest Neighbor methods 47.8. TWO-AND THREE-DIMENSIONAL DENSITIES 1037 Orthogonal Series Methods: project the data on an orthogonal base and only use the first few terms. Advantage: here one actually knows the functional form of the estimated density. See [BA97, pp. 19–21]. 47.8. Two-and Three-Dimensional Densities sm.script(air dens) and sm.script(air imag) give different representations of a two-dimensional density; sm.script(air cont) gives the evolution over time (dotted line is first, dashed second, and solid line third). sm.script(mag scat) is the plot of a dataset containing 3-dimensional direc- tions (longitude and latitude). Here is a kernel function and a smoothed representa- tion of this dataset: sm.script(mag dens). Problem 447. Write a function that translates the latitude and longitude data of t he magrem dataset into a 3-dimensional dataset which can be loaded into xgobi. Here is a 3-dimensional rendering of the geyser data: provide.data(geys3d) and then xgobi(geys3d). The script which draws a 3-dimensional density contour does not work right now: sm.script(geys td). 1038 47. DENSITY ESTIMATION 47.9. Other Characterizations of Distributions Instead of the density function one can also give smoothed versions of the em- pirical cumulative distribution function, or of the hazard function f(u) 1−F (u) . 47.10. Quantile-Quantile Plots The QQ-plot is a plot of the quantile functions, as defined in (3.4.14), of two different distributions against each other. The graph of a cumulative distribution function is given in Figure 1, and the corresponding quantile function is given in Figure 2. The bullets on the beginning of the lines in the cumulative distribution function indicate that the line includes its infimum but not its supremum. The quantile function has the bullets at the end of the lines. The “theoretical QQ plot” of two distributions which have distribution functions F 1 and F 2 and quantile functions F −1 1 and F −1 2 is the set of all (x 1 , x 2 ) ∈ R 2 for which there exists a p such that x 1 = F −1 1 (p) and x 2 = F −1 2 (p). If both distributions are continuous and strictly increasing, then the theoretical QQ-plot is continuous as well. If the cumulative distribution functions have hori- zontal straight line segments, then the theoretical QQ-plot has gaps. If one of the two distribution functions is a step function and the other is continuous, then the 47.10. QU ANT ILE-QUANT ILE PLOTS 1039 theoretical QQ-plot is a step function; and if both distribution functions are step functions, then the theoretical QQ-plot consists of isolated points. Here is a practical instruction how to construct a QQ plot from the given cumu- lative distribution functions: Draw the cumulative distribution functions of the two distributions which you want to compare into the same diagram. Then, for every value p between 0 and 1 plot the abscisse of the intersection of the horizontal line with height p with the first cumulative distribution function against the abscisse of its intersection with the second. If there are horizontal line segments in these dis- tribution functions, then the suprema of these line segments should be used. If the cumulative distribution functions is a step function stepping over p, then the value at which the step occurs should be used. If the QQ-plot is a straight line, then the two distributions are either identical, or the underlying random variables differ only by a scale factor. The plots have special sensitivity regarding differences in the tail areas of the two distributions. Problem 448. Let F 1 be the cumulative distribution function of random variable x 1 , and F 2 that of the variable x 2 whose distribution is the same as that of αx 1 , where α is a positive constant. Show that the theoretical QQ plot of these two distributions is contained in the straight line q 2 = αq 1 . Answer. (x 1 , x 2 ) ∈ QQ-plot ⇐⇒ a p exists with x 1 = F −1 1 (p) = inf{u: Pr[x 1 ≤ u] ≥ p} and x 2 = F −1 2 (p) = inf{u : Pr[x 2 ≤ u] ≥ p} = inf{u: Pr[αx 1 ≤ u] ≥ p}. Write v = u/α, i.e., 1040 47. DENSITY ESTIMATION u = αv; then x 2 = inf{αv : Pr[αx 1 ≤ αv] ≥ p} = inf{αv : Pr[x 1 ≤ v] ≥ p} = α inf{v : Pr[x 1 ≤ v] ≥ p} = αx 1 . In other words, if one makes a QQ plot of a normal with mean zero and variance 2 on the vertical axis against a normal with mean zero and variance 1 on the horizontal axis, one gets a straight line with slope 2. This makes such plots so valuable, since visual inspection can easily discriminate whether a curve is a straight line or not. To repeat, QQ plots have the great advantage that one only needs to know the correct distribution up to a scale factor! QQ-plots can not only be used to compare two probability measures, but an important application is to decide whether a given sample comes from a given distri- bution by plotting the quantile function of the empirical distribution of the sample, compare (3.4.17). against the quantile function of the given cumulative distribution function. Since empirical cumulative distribution functions and quantile functions are step functions, the resulting theoretical QQ plot is also a step function. In order to make it easier to compare this QQ plot with a straight line, one usually does not draw the full step function but one chooses one point on the face of each step, so that the plot contains one point per observation. This is like plotting the given sample against a very regular sample from the given distribution. Where on the face of each step should one choose these points? One wants to choose that [...]... OF INCOME INEQUALITY 1045 more clearly what is happening in the middle income ranges But perhaps it is not so readily apparent what is happening in the upper tail This can be remedied by taking the distribution of the logarithm of income, i.e., arrange the income markers on the fence in such a way that equal physical distances mean equal income ratios • Lorenz curve: Line up everybody in ascending... inequality the area between the 1046 48 MEASURING ECONOMIC INEQUALITY Parade curve and this horizontal line, divided by the total area under the Parade curve • Gini coefficient: the area between the Lorenz curve and the diagonal line, times 2 (so that a Gini coefficient of 100% would mean: one person owns everything, and a Gini of 0 means total equality • Theil’s entropy measure: Say xi is person i’s income,... looking down on a field On one side, there is a long straight fence marked off with income categories: the physical distance between any two points directly corresponds to the income differences they represent Get the whole population to come into the field and line up in the strip of land marked off by the piece of fence corresponding to their income bracket The shape you will see is a histogram of the income... qqnorm(rchisq(25,df=3)) The classic reference which everyone has read and which explains it all is [Gum58, pp 28–34 and 46/47] Also [WG68] is useful, many examples 47.11 Testing for Normality [Vas76] is a test for Normality based on entropy CHAPTER 48 Measuring Economic Inequality 48.1 Web Resources about Income Inequality • UNU/Wider-UNDP World Income Inequality Database (4500 Gini-coefficients www.wider.unu.edu/wiid/wiid.htm... on manufacturing and industrial wages for 71 countries) http://utip.gov.utexas.edu/ 1043 1044 48 MEASURING ECONOMIC INEQUALITY • MacArthur Network on the Effects of Inequality on Economic Performance, Institute of International Studies, University of California, Berkeley http://gl • Inter-American Development Bank site on Poverty and Inequality http://www 48.2 Graphical Representations of Inequality See... Luxembourg Income Study (detailed household surveys) http://lissy.ceps.lu • World Bank site on Inequality, Poverty, and Socio-economic Performance http://www.worldbank.org/poverty/inequal/index.htm • World Bank Data on Poverty and Inequality http://www.worldbank.org/pove • World Bank Living Standard Measurement Surveys http://www.worldbank.or • University of Texas Inequality Project (Theil indexes on... test against, and then draw between the zero line and the line p = 1 n parallel lines which divide the unit strip into n + 1 equidistant strips The intersection points of these n lines with the cdf will roughly give the locations where the smallest, second smallest, etc., of a sample of n normally distributed observations should be found For a mathematical justification of this, make the following thought... her income Line everybody in the population up in order of height Then the curve which their heights draws out is “Pen’s Parade.” It is the empirical quantile function of the sample It highlights the presence of any extremely large income and to a certain extent abnormally small incomes But the information on middle income receivers is not so obvious • Histogram or other estimates of the underlying... is the average ¯ income, and n the population count Then the person’s income share is xi si = n¯ The entropy of this income distribution, according to (3.11.2), x but with natural logarithms instead of base 2, is n 1 si ln si (48.3.1) i=1 and the maximum possible entropy, obtained if income distribution is equal, is n 1 n (48.3.2) i=1 ln n = ln n 48.3 QUANTITATIVE MEASURES OF INCOME INEQUALITY 1047... −1 cancels in the difference Interpretation: if h(s2 ) − h(s1 ) = h(s4 ) − h(s3 ) then for the purposes of this inequality measure, the distance between 2 and 1 is the same as the distance between 1050 48 MEASURING ECONOMIC INEQUALITY 4 and 3 These inequality measures are therefore based on very specific notions of what constitutes inequality 48.4 Properties of Inequality Measures nvariance: In order to . to test against, and then draw between the zero line and the line p = 1 n parallel lines which divide the unit strip into n + 1 equidistant strips. The intersection points of these n lines with. the beginning of the lines in the cumulative distribution function indicate that the line includes its in mum but not its supremum. The quantile function has the bullets at the end of the lines. The. draw into Pen’s parade a horizontal line at the average income, and use as measure of inequality the area between the 1046 48. MEASURING ECONOMIC INEQUALITY Parade curve and this horizontal line,