CHAPTER Density Estimation: Erupting Geysers and Star Clusters 8.1 Introduction Geysers are natural fountains that shoot up into the air, at more or less regular intervals, a column of heated water and steam Old Faithful is one such geyser and is the most popular attraction of Yellowstone National Park, although it is not the largest or grandest geyser in the park Old Faithful can vary in height from 100–180 feet with an average near 130–140 feet Eruptions normally last between 1.5 to minutes From August to August 15, 1985, Old Faithful was observed and the waiting times between successive eruptions noted There were 300 eruptions observed, so 299 waiting times were (in minutes) recorded and those shown in Table 8.1 Table 8.1: faithful data (package datasets) Old Faithful geyser waiting times between two eruptions waiting 79 54 74 62 85 55 88 85 51 85 54 84 78 47 83 52 62 84 52 waiting 83 71 64 77 81 59 84 48 82 60 92 78 78 65 73 82 56 79 71 waiting 75 59 89 79 59 81 50 85 59 87 53 69 77 56 88 81 45 82 55 139 © 2010 by Taylor and Francis Group, LLC waiting 76 63 88 52 93 49 57 77 68 81 81 73 50 85 74 55 77 83 83 waiting 50 82 54 75 78 79 78 78 70 79 70 54 86 50 90 54 54 77 79 140 DENSITY ESTIMATION Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:53 11 September 2014 Table 8.1: faithful data (continued) waiting 79 51 47 78 69 74 83 55 76 78 79 73 77 66 80 74 52 48 80 59 90 80 58 84 58 73 83 64 53 82 59 75 90 54 80 54 waiting 62 76 60 78 76 83 75 82 70 65 73 88 76 80 48 86 60 90 50 78 63 72 84 75 51 82 62 88 49 83 81 47 84 52 86 81 © 2010 by Taylor and Francis Group, LLC waiting 90 45 83 56 89 46 82 51 86 53 79 81 60 82 77 76 59 80 49 96 53 77 77 65 81 71 70 81 93 53 89 45 86 58 78 66 waiting 51 78 84 46 83 55 81 57 76 84 77 81 87 77 51 78 60 82 91 53 78 46 77 84 49 83 71 80 49 75 64 76 53 94 55 76 waiting 64 75 47 86 63 85 82 57 82 67 74 54 83 73 73 88 80 71 83 56 79 78 84 58 83 43 60 75 81 46 90 46 74 DENSITY ESTIMATION 141 Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:53 11 September 2014 The Hertzsprung-Russell (H-R) diagram forms the basis of the theory of stellar evolution The diagram is essentially a plot of the energy output of stars plotted against their surface temperature Data from the H-R diagram of Star Cluster CYG OB1, calibrated according to Vanisma and De Greve (1972) are shown in Table 8.2 (from Hand et al., 1994) Table 8.2: CYGOB1 data Energy output and surface temperature of Star Cluster CYG OB1 logst 4.37 4.56 4.26 4.56 4.30 4.46 3.84 4.57 4.26 4.37 3.49 4.43 4.48 4.01 4.29 4.42 logli 5.23 5.74 4.93 5.74 5.19 5.46 4.65 5.27 5.57 5.12 5.73 5.45 5.42 4.05 4.26 4.58 logst 4.23 4.42 4.23 3.49 4.29 4.29 4.42 4.49 4.38 4.42 4.29 4.38 4.22 3.48 4.38 4.56 logli 3.94 4.18 4.18 5.89 4.38 4.22 4.42 4.85 5.02 4.66 4.66 4.90 4.39 6.05 4.42 5.10 logst 4.45 3.49 4.23 4.62 4.53 4.45 4.53 4.43 4.38 4.45 4.50 4.45 4.55 4.45 4.42 logli 5.22 6.29 4.34 5.62 5.10 5.22 5.18 5.57 4.62 5.06 5.34 5.34 5.54 4.98 4.50 8.2 Density Estimation The goal of density estimation is to approximate the probability density function of a random variable (univariate or multivariate) given a sample of observations of the variable Univariate histograms are a simple example of a density estimate; they are often used for two purposes, counting and displaying the distribution of a variable, but according to Wilkinson (1992), they are effective for neither For bivariate data, two-dimensional histograms can be constructed, but for small and moderate sized data sets that is not of any real use for estimating the bivariate density function, simply because most of the ‘boxes’ in the histogram will contain too few observations, or if the number of boxes is reduced the resulting histogram will be too coarse a representation of the density function The density estimates provided by one- and two-dimensional histograms can be improved on in a number of ways If, of course, we are willing to assume a particular form for the variable’s distribution, for example, Gaussian, density © 2010 by Taylor and Francis Group, LLC Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:53 11 September 2014 142 DENSITY ESTIMATION estimation would be reduced to estimating the parameters of the assumed distribution More commonly, however, we wish to allow the data to speak for themselves and so one of a variety of non-parametric estimation procedures that are now available might be used Density estimation is covered in detail in several books, including Silverman (1986), Scott (1992), Wand and Jones (1995) and Simonoff (1996) One of the most popular classes of procedures is the kernel density estimators, which we now briefly describe for univariate and bivariate data 8.2.1 Kernel Density Estimators From the definition of a probability density, if the random X has a density f , f (x) = lim h→0 P(x − h < X < x + h) 2h (8.1) For any given h a na¨ıve estimator of P(x − h < X < x + h) is the proportion of the observations x1 , x2 , , xn falling in the interval (x − h, x + h), that is fˆ(x) = 2hn n i=1 I(xi ∈ (x − h, x + h)), (8.2) i.e., the number of x1 , , xn falling in the interval (x − h, x + h) divided by 2hn If we introduce a weight function W given by |x| < W (x) = else then the na¨ıve estimator can be rewritten as fˆ(x) = n n i=1 W h x − xi h (8.3) Unfortunately this estimator is not a continuous function and is not particularly satisfactory for practical density estimation It does however lead naturally to the kernel estimator defined by fˆ(x) = hn n K i=1 x − xi h (8.4) where K is known as the kernel function and h as the bandwidth or smoothing parameter The kernel function must satisfy the condition ∞ K(x)dx = −∞ Usually, but not always, the kernel function will be a symmetric density function, for example, the normal Three commonly used kernel functions are © 2010 by Taylor and Francis Group, LLC DENSITY ESTIMATION 143 rectangular: K(x) = Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:53 11 September 2014 |x| < else triangular: K(x) = − |x| Gaussian: |x| < else K(x) = √ e− x 2π The three kernel functions are implemented in R as shown in lines 1–3 of Figure 8.1 For some grid x, the kernel functions are plotted using the R statements in lines 5–11 (Figure 8.1) The kernel estimator fˆ is a sum of ‘bumps’ placed at the observations The kernel function determines the shape of the bumps while the window width h determines their width Figure 8.2 (redrawn from a similar plot in Silverman, 1986) shows the individual bumps n−1 h−1 K((x−xi )/h), as well as the estimate fˆ obtained by adding them up for an artificial set of data points R> x n xgrid h bumps plot(xgrid, rowSums(bumps), ylab = expression(hat(f)(x)), + type = "l", xlab = "x", lwd = 2) R> rug(x, lwd = 2) R> out logL library("KernSmooth") R> data("CYGOB1", package = "HSAUR2") R> CYGOB1d contour(x = CYGOB1d$x1, y = CYGOB1d$x2, z = CYGOB1d$fhat, + xlab = "log surface temperature", + ylab = "log light intensity") 3.4 3.6 3.8 4.0 4.2 4.4 4.6 log surface temperature Figure 8.5 A contour plot of the bivariate density estimate of the CYGOB1 data, i.e., a two-dimensional graphical display for a three-dimensional problem © 2010 by Taylor and Francis Group, LLC estimated de nsity log lig ht int en sit y Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:53 11 September 2014 150 DENSITY ESTIMATION R> persp(x = CYGOB1d$x1, y = CYGOB1d$x2, z = CYGOB1d$fhat, + xlab = "log surface temperature", + ylab = "log light intensity", + zlab = "estimated density", + theta = -35, axes = TRUE, box = TRUE) log Figure 8.6 e fac sur te e mp u rat re The bivariate density estimate of the CYGOB1 data, here shown in a three-dimensional fashion using the persp function R> startparam opp opp $par p mu1 © 2010 by Taylor and Francis Group, LLC sd1 mu2 sd2 ANALYSIS USING R 0.360891 54.612125 151 5.872379 80.093414 5.867288 Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:53 11 September 2014 $value [1] 1034.002 $counts function gradient 55 55 $convergence [1] Of course, optimising the appropriate likelihood ‘by hand’ is not very convenient In fact, (at least) two packages offer high-level functionality for estimating mixture models The first one is package mclust (Fraley et al., 2009) implementing the methodology described in Fraley and Raftery (2002) Here, a Bayesian information criterion (BIC) is applied to choose the form of the mixture model: R> library("mclust") R> mc mc best model: equal variance with components and the estimated means are R> mc$parameters$mean 54.61911 80.09384 with estimated standard deviation (found to be equal within both groups) R> sqrt(mc$parameters$variance$sigmasq) [1] 5.86848 The proportion is pˆ = 0.36 The second package is called flexmix whose functionality is described by Leisch (2004) A mixture of two normals can be fitted using R> library("flexmix") R> fl parameters(fl, component = 1) Comp.1 coef.(Intercept) 54.628701 sigma 5.895234 R> parameters(fl, component = 2) Comp.2 coef.(Intercept) 80.098582 sigma 5.871749 © 2010 by Taylor and Francis Group, LLC 0.06 0.03 0.04 0.05 Fitted two−component mixture density Fitted single normal density 0.00 0.01 0.02 Density Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:53 11 September 2014 152 DENSITY ESTIMATION R> opar rx d1 d2 f hist(x, probability = TRUE, xlab = "Waiting times (in min.)", + border = "gray", xlim = range(rx), ylim = c(0, 0.06), + main = "") R> lines(rx, f, lwd = 2) R> lines(rx, dnorm(rx, mean = mean(x), sd = sd(x)), lty = 2, + lwd = 2) R> legend(50, 0.06, lty = 1:2, bty = "n", + legend = c("Fitted two-component mixture density", + "Fitted single normal density")) 40 50 60 70 80 90 100 110 Waiting times (in min.) Figure 8.7 Fitted normal density and two-component normal mixture for geyser eruption data © 2010 by Taylor and Francis Group, LLC Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:53 11 September 2014 ANALYSIS USING R 153 The results are identical for all practical purposes and we can plot the fitted mixture and a single fitted normal into a histogram of the data using the R code which produces Figure 8.7 The dnorm function can be used to evaluate the normal density with given mean and standard deviation, here as estimated for the two-components of our mixture model, which are then collapsed into our density estimate f Clearly the two-component mixture is a far better fit than a single normal distribution for these data We can get standard errors for the five parameter estimates by using a bootstrap approach (see Efron and Tibshirani, 1993) The original data are slightly perturbed by drawing n out of n observations with replacement and those artificial replications of the original data are called bootstrap samples Now, we can fit the mixture for each bootstrap sample and assess the variability of the estimates, for example using confidence intervals Some suitable R code based on the Mclust function follows First, we define a function that, for a bootstrap sample indx, fits a two-component mixture model and returns pˆ and the estimated means (note that we need to make sure that we always get an estimate of p, not − p): R> library("boot") R> fit boot.ci(bootpara, type = "bca", index = 2) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 1000 bootstrap replicates Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:53 11 September 2014 CALL : boot.ci(boot.out = bootpara, type = "bca", index = 2) Intervals : Level BCa 95% (53.42, 56.07 ) Calculations and Intervals on Original Scale for µ ˆ1 and for µ ˆ2 from R> boot.ci(bootpara, type = "bca", index = 3) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 1000 bootstrap replicates CALL : boot.ci(boot.out = bootpara, type = "bca", index = 3) Intervals : Level BCa 95% (79.05, 81.01 ) Calculations and Intervals on Original Scale Finally, we show a graphical representation of both the bootstrap distribution of the mean estimates and the corresponding confidence intervals For convenience, we define a function for plotting, namely R> bootplot