1. Trang chủ
  2. » Công Nghệ Thông Tin

Computational Statistics Handbook with MATLAB phần 5 pot

58 463 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 58
Dung lượng 5,35 MB

Nội dung

Chapter 6: Monte Carlo Methods for Inferential Statistics 221 Similar to this, the bootstrap standard confidence interval is given by , (6.21) where is the standard error for the statistic obtained using the boot- strap [Mooney and Duval, 1993]. The confidence interval in Equation 6.21 can be used when the distribution for is normally distributed or the normality assumption is plausible. This is easily coded in MATLAB using previous results and is left as an exercise for the reader. The second type of confidence interval using the bootstrap is called the boot- strap-t. We first generate B bootstrap samples, and for each bootstrap sample the following quantity is computed: . (6.22) As before, is the bootstrap replicate of , but is the estimated stan- dard error of for that bootstrap sample. If a formula exists for the stan- dard error of , then we can use that to determine the denominator of Equation 6.22. For instance, if is the mean, then we can calculate the stan- dard error as explained in Chapter 3. However, in most situations where we have to resort to using the bootstrap, these formulas are not available. One option is to use the bootstrap method of finding the standard error, keeping in mind that you are estimating the standard error of using the bootstrap sample . In other words, one resamples with replacement from the boot- strap sample to get an estimate of . Once we have the B bootstrapped values from Equation 6.22, the next step is to estimate the quantiles needed for the endpoints of the interval. The -th quantile, denoted by of the , is estimated by . (6.23) This says that the estimated quantile is the such that % of the points are less than this number. For example, if and , then could be estimated as the fifth largest value of the . One could also use the quantile estimates dis- cussed previously in Chapter 3 or some other suitable estimate. We are now ready to calculate the bootstrap-t confidence interval. This is given by θ ˆ z 1 α– 2⁄() SE θ ˆ – θ ˆ z α 2⁄() SE θ ˆ –,() SE θ ˆ θ ˆ θ ˆ z *b θ ˆ *b θ ˆ – SE ˆ *b = θ ˆ *b θ ˆ SE ˆ *b θ ˆ *b θ ˆ *b θ ˆ θ ˆ *b x *b x *b SE ˆ *b z *b α 2⁄ t ˆ α 2⁄() z *b α 2⁄ # z *b t ˆ α 2⁄() ≤() B = t ˆ α 2⁄() 100 α 2⁄⋅ z *b B 100= α 2⁄ 0.05= t ˆ 0.05() z *b B α 2⁄⋅ 100 0.05⋅ 5==() © 2002 by Chapman & Hall/CRC 222 Computational Statistics Handbook with M ATLAB , (6.24) where is an estimate of the standard error of . The bootstrap-t interval is suitable for location statistics such as the mean or quantiles. However, its accuracy for more general situations is questionable [Efron and Tibshirani, 1993]. The next method based on the bootstrap percentiles is more reliable. PROCEDURE - BOOTSTRAP-T CONFIDENCE INTERVAL 1. Given a random sample, , calculate . 2. Sample with replacement from the original sample to get . 3. Calculate the same statistic using the sample in step 2 to get . 4. Use the bootstrap sample to get the standard error of . This can be calculated using a formula or estimated by the bootstrap. 5. Calculate using the information found in steps 3 and 4. 6. Repeat steps 2 through 5, B times, where . 7. Order the from smallest to largest. Find the quantiles and . 8. Estimate the standard error of using the B bootstrap repli- cates of (from step 3). 9. Use Equation 6.24 to get the confidence interval. The number of bootstrap replicates that are needed is quite large for confi- dence intervals. It is recommended that B should be 1000 or more. If no for- mula exists for calculating the standard error of , then the bootstrap method can be used. This means that there are two levels of bootstrapping: one for finding the and one for finding the , which can greatly increase the computational burden. For example, say that and we use 50 bootstrap replicates to find , then this results in a total of 50,000 resamples. Example 6.11 Say we are interested in estimating the variance of the forearm data, and we decide to use the following statistic, , θ ˆ t ˆ 1 α 2⁄–() SE ˆ θ ˆ ⋅– θ ˆ t ˆ α 2⁄() SE ˆ θ ˆ ⋅–,() SE ˆ θ ˆ x x 1 … x n ,,()= θ ˆ x * b x 1 * b … x n * b ,,()= θ ˆ *b x *b θ ˆ *b z *b B 1000≥ z *b t ˆ 1 α 2⁄–() t ˆ α 2⁄() SE ˆ θ ˆ θ ˆ θ ˆ *b θ ˆ *b SE ˆ *b z *b B 1000= SE ˆ *b σ ˆ 2 1 n X i X–() 2 i 1= n ∑ = © 2002 by Chapman & Hall/CRC Chapter 6: Monte Carlo Methods for Inferential Statistics 223 which is the sample second central moment. We write our own simple func- tion called mom (included in the Computational Statistics Toolbox) to estimate this. % This function will calculate the sample 2nd % central moment for a given sample vector x. function mr = mom(x) n = length(x); mu = mean(x); mr = (1/n)*sum((x-mu).^2); We use this function as an input argument to bootstrp to get the bootstrap- t confidence interval. The MATLAB code given below also shows how to get the bootstrap estimate of standard error for each bootstrap sample. First we load the data and get the observed value of the statistic. load forearm n = length(forearm); alpha = 0.1; B = 1000; thetahat = mom(forearm); Now we get the bootstrap replicates using the function bootstrp. One of the optional output arguments from this function is a matrix of indices for the resamples. As shown below, each column of the output bootsam contains the indices to a bootstrap sample. We loop through all of the bootstrap sam- ples to estimate the standard error of the bootstrap replicate using that resa- mple. % Get the bootstrap replicates and samples. [bootreps, bootsam] = bootstrp(B,'mom',forearm); % Set up some storage space for the SE’s. sehats = zeros(size(bootreps)); % Each column of bootsam contains indices % to a bootstrap sample. for i = 1:B % Extract the sample from the data. xstar = forearm(bootsam(:,i)); bvals(i) = mom(xstar); % Do bootstrap using that sample to estimate SE. sehats(i) = std(bootstrp(25,'mom',xstar)); end zvals = (bootreps - thetahat)./sehats; Then we get the estimate of the standard error that we need for the endpoints of the interval. % Estimate the SE using the bootstrap. SE = std(bootreps); © 2002 by Chapman & Hall/CRC 224 Computational Statistics Handbook with M ATLAB Now we get the quantiles that we need for the interval given in Equation 6.24 and calculate the interval. % Get the quantiles. k = B*alpha/2; szval = sort(zvals); tlo = szval(k); thi = szval(B-k); % Get the endpoints of the interval. blo = thetahat - thi*SE; bhi = thetahat - tlo*SE; The bootstrap-t interval for the variance of the forearm data is . An improved bootstrap confidence interval is based on the quantiles of the distribution of the bootstrap replicates. This technique has the benefit of being more stable than the bootstrap-t, and it also enjoys better theoretical coverage properties [Efron and Tibshirani, 1993]. The bootstrap percentile confidence interval is , (6.25) where is the quantile in the bootstrap distribution of . For example, if and , then is the in the 25th position of the ordered bootstrap replicates. Similarly, is the replicate in position 975. As discussed previously, some other suitable estimate for the quantile can be used. The procedure is the same as the general bootstrap method, making it easy to understand and to implement. We outline the steps below. PROCEDURE - BOOTSTRAP PERCENTILE INTERVAL 1. Given a random sample, , calculate . 2. Sample with replacement from the original sample to get . 3. Calculate the same statistic using the sample in step 2 to get the bootstrap replicates, . 4. Repeat steps 2 through 3, B times, where . 5. Order the from smallest to largest. 6. Calculate and . 1.00 1.57,() θ ˆ B * α 2⁄() θ ˆ B *1 α 2⁄–() ,() θ ˆ B * α 2⁄() α 2⁄θ ˆ * α 2⁄ 0.025= B 1000= θ ˆ B * 0.025() θ ˆ *b θ ˆ B * 0.975() x x 1 … x n ,,()= θ ˆ x * b x 1 * b … x n * b ,,()= θ ˆ *b B 1000≥ θ ˆ *b B α 2⁄⋅ B 1 α 2⁄–()⋅ © 2002 by Chapman & Hall/CRC Chapter 6: Monte Carlo Methods for Inferential Statistics 225 7. The lower endpoint of the interval is given by the bootstrap repli- cate that is in the -th position of the ordered , and the upper endpoint is given by the bootstrap replicate that is in the -th position of the same ordered list. Alternatively, using quantile notation, the lower endpoint is the estimated quan- tile and the upper endpoint is the estimated quantile , where the estimates are taken from the bootstrap replicates. Example 6.12 Let’s find the bootstrap percentile interval for the same forearm data. The confidence interval is easily found from the bootstrap replicates, as shown below. % Use Statistics Toolbox function % to get the bootstrap replicates. bvals = bootstrp(B,'mom',forearm); % Find the upper and lower endpoints k = B*alpha/2; sbval = sort(bvals); blo = sbval(k); bhi = sbval(B-k); This interval is given by , which is slightly narrower than the bootstrap-t interval from Example 6.11. So far, we discussed three types of bootstrap confidence intervals. The stan- dard interval is the easiest and assumes that is normally distributed. The bootstrap-t interval estimates the standardized version of from the data, avoiding the normality assumptions used in the standard interval. The per- centile interval is simple to calculate and obtains the endpoints directly from the bootstrap estimate of the distribution for It has another advantage in that it is range-preserving. This means that if the parameter can take on values in a certain range, then the confidence interval will reflect that. This is not always the case with the other intervals. According to Efron and Tibshirani [1993], the bootstrap-t interval has good coverage probabilities, but does not perform well in practice. The bootstrap percentile interval is more dependable in most situations, but does not enjoy the good coverage property of the bootstrap-t interval. There is another boot- strap confidence interval, called the interval, that has both good cover- age and is dependable. This interval is described in the next chapter. The bootstrap estimates of bias and standard error are also random vari- ables, and they have their own error associated with them. So, how accurate are they? In the next chapter, we discuss how one can use the jackknife method to evaluate the error in the bootstrap estimates. As with any method, the bootstrap is not appropriate in every situation. When analytical methods are available to understand the uncertainty associ- B α 2⁄⋅θ ˆ *b B 1 α 2⁄–()⋅ q ˆ α 2⁄ q ˆ 1 α 2⁄– 1.03 1.45,() θ ˆ θ ˆ θ ˆ . θ BC a © 2002 by Chapman & Hall/CRC 226 Computational Statistics Handbook with M ATLAB ated with an estimate, then those are more efficient than the bootstrap. In what situations should the analyst use caution in applying the bootstrap? One important assumption that underlies the theory of the bootstrap is the notion that the empirical distribution function is representative of the true population distribution. If this is not the case, then the bootstrap will not yield reliable results. For example, this can happen when the sample size is small or the sample was not gathered using appropriate random sampling techniques. Chernick [1999] describes other examples from the literature where the bootstrap should not be used. We also address a situation in Chap- ter 7 where the bootstrap fails. This can happen when the statistic is non- smooth, such as the median. 6.5 M ATLAB Code We include several functions with the Computational Statistics Toolbox that implement some of the bootstrap techniques discussed in this chapter. These are listed in Table 6.2. Like bootstrp, these functions have an input argu- ment that specifies a MATLAB function that calculates the statistic. As we saw in the examples, the MATLAB Statistics Toolbox has a function called bootstrp that will return the bootstrap replicates from the input argument bootfun (e.g., mean, std, var, etc.). It takes an input data set, finds the bootstrap resamples, applies the bootfun to the resamples, and stores the replicate in the first row of the output argument. The user can get two outputs from the function: the bootstrap replicates and the indices that correspond to the points selected in the resample. There is a Bootstrap MATLAB Toolbox written by Zoubir and Iskander at the Curtin University of Technology. It is available for download at List of M ATLAB Functions for Chapter 6 Purpose M ATLAB Function General bootstrap: resampling, estimates of standard error and bias csboot bootstrp Constructing bootstrap confidence Intervals csbootint csbooperint csbootbca © 2002 by Chapman & Hall/CRC Chapter 6: Monte Carlo Methods for Inferential Statistics 227 www.atri.curtin.edu.au/csp. It requires the MATLAB Statistics Tool- box and has a postscript version of the reference manual. Other software exists for Monte Carlo simulation as applied to statistics. The Efron and Tibshirani [1993] book has a description of S code for imple- menting the bootstrap. This code, written by the authors, can be downloaded from the statistics archive at Carnegie-Mellon University that was mentioned in Chapter 1. Another software package that has some of these capabilities is called Resampling Stats® [Simon, 1999], and information on this can be found at www.resample.com. Routines are available from Resampling Stats for MATLAB [Kaplan, 1999] and Excel. 6.6 Further Reading Mooney [1997] describes Monte Carlo simulation for inferential statistics that is written in a way that is accessible to most data analysts. It has some excel- lent examples of using Monte Carlo simulation for hypothesis testing using multiple experiments, assessing the behavior of an estimator, and exploring the distribution of a statistic using graphical techniques. The text by Gentle [1998] has a chapter on performing Monte Carlo studies in statistics. He dis- cusses how simulation can be considered as a scientific experiment and should be held to the same high standards. Hoaglin and Andrews [1975] pro- vide guidelines and standards for reporting the results from computations. Efron and Tibshirani [1991] explain several computational techniques, writ- ten at a level accessible to most readers. Other articles describing Monte Carlo inferential methods can be found in Joeckel [1991], Hope [1968], Besag and Diggle [1977], Diggle and Gratton [ 1984], Efron [1979], Efron and Gong [1983], and Teichroew [1965]. There has been a lot of work in the literature on bootstrap methods. Per- haps the most comprehensive and easy to understand treatment of the topic can be found in Efron and Tibshirani [1993]. Efron’s [1982] earlier monogram on resampling techniques describes the jackknife, the bootstrap and cross- validation. A more recent book by Chernick [1999] gives an updated descrip- tion of results in this area, and it also has an extensive bibliography (over 1,600 references!) on the bootstrap. Hall [1992] describes the connection between Edgeworth expansions and the bootstrap. A volume of papers on the bootstrap was edited by LePage and Billard [1992], where many applica- tions of the bootstrap are explored. Politis, Romano, and Wolf [1999] present subsampling as an alternative to the bootstrap. A subset of articles that present the theoretical justification for the bootstrap are Efron [1981, 1985, 1987]. The paper by Boos and Zhang [2000] looks at a way to ease the compu- tational burden of Monte Carlo estimation of the power of tests that uses res- ampling methods. For a nice discussion on the coverage of the bootstrap percentile confidence interval, see Polansky [1999]. © 2002 by Chapman & Hall/CRC 228 Computational Statistics Handbook with M ATLAB Exercises 6.1. Repeat Example 6.1 where the population standard deviation for the travel times to work is minutes. Is minutes still consistent with the null hypothesis? 6.2. Using the information in Example 6.3, plot the probability of Type II error as a function of . How does this compare with Figure 6.2? 6.3. Would you reject the null hypothesis in Example 6.4 if ? 6.4. Using the same value for the sample mean, repeat Example 6.3 for different sample sizes of . What happens to the curve showing the power as a function of the true mean as the sample size changes? 6.5. Repeat Example 6.6 using a two-tail test. In other words, test for the alternative hypothesis that the mean is not equal to 454. 6.6. Repeat Example 6.8 for larger M. Does the estimated Type I error get closer to the true value? 6.7. Write MATLAB code that implements the parametric bootstrap. Test it using the forearm data. Assume that the normal distribution is a reasonable model for the data. Use your code to get a bootstrap estimate of the standard error and the bias of the coefficient of skew- ness and the coefficient of kurtosis. Get a bootstrap percentile interval for the sample central second moment using your parametric boot- strap approach. 6.8. Write MATLAB code that will get the bootstrap standard confidence interval. Use it with the forearm data to get a confidence interval for the sample central second moment. Compare this interval with the ones obtained in the examples and in the previous problem. 6.9. Use your program from problem 6.8 and the forearm data to get a bootstrap confidence interval for the mean. Compare this to the the- oretical one. 6.10. The remiss data set contains the remission times for 42 leukemia patients. Some of the patients were treated with the drug called 6- mercaptopurine (mp), and the rest were part of the control group (control). Use the techniques from Chapter 5 to help determine a suitable model (e.g., Weibull, exponential, etc.) for each group. Devise a Monte Carlo hypothesis test to test for the equality of means between the two groups [Hand, et al., 1994; Gehan, 1965]. Use the p-value approach. 6.11. Load the lawpop data set [Efron and Tibshirani, 1993]. These data contain the average scores on the LSAT (lsat) and the corresponding σ X 5= x 47.2= µ α 0.10= n 50 100 200,,= © 2002 by Chapman & Hall/CRC Chapter 6: Monte Carlo Methods for Inferential Statistics 229 average undergraduate grade point average (gpa) for the 1973 fresh- man class at 82 law schools. Note that these data constitute the entire population. The data contained in law comprise a random sample of 15 of these classes. Obtain the true population variances for the lsat and the gpa. Use the sample in law to estimate the population vari- ance using the sample central second moment. Get bootstrap esti- mates of the standard error and the bias in your estimate of the variance. Make some comparisons between the known population variance and the estimated variance. 6.12. Using the lawpop data, devise a test statistic to test for the signifi- cance of the correlation between the LSAT scores and the correspond- ing grade point averages. Get a random sample from the population, and use that sample to test your hypothesis. Do a Monte Carlo sim- ulation of the Type I and Type II error of the test you devise. 6.13. In 1961, 16 states owned the retail liquor stores. In 26 others, the stores were owned by private citizens. The data contained in whisky reflect the price (in dollars) of a fifth of whisky from these 42 states. Note that this represents the population, not a sample. Use the whisky data to get an appropriate bootstrap confidence interval for the median price of whisky at the state owned stores and the median price of whisky at the privately owned stores. First get the random sample from each of the populations, and then use the bootstrap with that sample to get the confidence intervals. Do a Monte Carlo study where you compare the confidence intervals for different sample sizes. Compare the intervals with the known population medians [Hand, et al., 1994]. 6.14. The quakes data [Hand, et al., 1994] give the time in days between successive earthquakes. Use the bootstrap to get an appropriate con- fidence interval for the average time between earthquakes. © 2002 by Chapman & Hall/CRC Chapter 7 Data Partitioning 7.1 Introduction In this book, data partitioning refers to procedures where some observations from the sample are removed as part of the analysis. These techniques are used for the following purposes: • To evaluate the accuracy of the model or classification scheme; • To decide what is a reasonable model for the data; • To find a smoothing parameter in density estimation; • To estimate the bias and error in parameter estimation; • And many others. We start off with an example to motivate the reader. We have a sample where we measured the average atmospheric temperature and the corre- sponding amount of steam used per month [Draper and Smith, 1981]. Our goal in the analysis is to model the relationship between these variables. Once we have a model, we can use it to predict how much steam is needed for a given average monthly temperature. The model can also be used to gain understanding about the structure of the relationship between the two vari- ables. The problem then is deciding what model to use. To start off, one should always look at a scatterplot (or scatterplot matrix) of the data as discussed in Chapter 5. The scatterplot for these data is shown in Figure 7.1 and is exam- ined in Example 7.3. We see from the plot that as the temperature increases, the amount of steam used per month decreases. It appears that using a line (i.e., a first degree polynomial) to model the relationship between the vari- ables is not unreasonable. However, other models might provide a better fit. For example, a cubic or some higher degree polynomial might be a better model for the relationship between average temperature and steam usage. So, how can we decide which model is better? To make that decision, we need to assess the accuracy of the various models. We could then choose the © 2002 by Chapman & Hall/CRC [...]... – 1 )T © 2002 by Chapman & Hall/CRC ( – i) i = 1, …, n , (7.12) 244 Computational Statistics Handbook with MATLAB 14 14 12 12 10 10 8 8 6 6 4 5 10 15 4 20 14 5 10 15 20 8 6 20 10 8 15 12 10 10 14 12 5 6 4 5 10 15 4 20 3.7 ERUGIF 3.7 ERUGIF 3.7 ERUGIF 3.7 ERUGIF This shows the scatterplots of the four data sets discussed in Example 7 .5 These data were created to show the importance of looking at scatterplots... = [58 , 67, 74, 74, 80, 89, 95, 97, 98, 107]; ˆ The median of this data set is q 0 .5 = 84 .5 To see how the median changes with small changes of x, we increment the fourth observation x = 74 by one ˆ The change in the median is zero, because it is still at q0 .5 = 84 .5 In fact, the median does not change until we increment the fourth observation by 7, at ˆ which time the median becomes q 0 .5 = 85 Let’s... 0.00 65 This data set will be explored further in the exercises Example 7 .5 We provide a MATLAB function called csjack that implements the jackknife procedure This will work with any MATLAB function that takes the random sample as the argument and returns a statistic This function can be one that comes with MATLAB, such as mean or var, or it can be one written by the user We illustrate its use with. .. (pounds) 12 11 10 9 8 7 6 20 30 40 50 60 Average Temperature (° F ) 70 80 2.7 ERUGIF 2.7 ERUGIF 2.7 ERUGIF This figure shows a scatterplot of the steam data along with the line obtained using ˆ polyfit The estimate of the slope is β 1 = – 0.08, and the estimate of the y-intercept is ˆ β 0 = 13.62 © 2002 by Chapman & Hall/CRC 236 Computational Statistics Handbook with MATLAB The prediction error is defined... the estimated parameters We do not go into the derivation of the estiβ1 mators, since it can be found in most introductory statistics textbooks © 2002 by Chapman & Hall/CRC 234 Computational Statistics Handbook with MATLAB 13 Steam per Month (pounds) 12 11 10 9 8 7 6 20 30 40 50 60 Average Temperature (° F ) 70 80 1.7 ERUGIF 1.7 ERUGIF 1.7 ERUGIF 1.7 ERUGIF Scatterplot of a data set where we are interested... estimate This can be computationally intensive, because we would need a new set of bootstrap samples when we leave out each data point x i There is a shortˆ ˆ cut method for obtaining var Ja ck ( γ B ) where we use the original B bootstrap samples There will be some bootstrap samples where the i-th data point does © 2002 by Chapman & Hall/CRC 252 Computational Statistics Handbook with MATLAB not appear... replicates B is large Otherwise, it ˆ overestimates the variance of γB 7.6 M ATLAB Code To our knowledge, MATLAB does not have M-files for either cross-validation or the jackknife As described earlier, we provide a function (csjack) that © 2002 by Chapman & Hall/CRC 254 Computational Statistics Handbook with MATLAB will implement the jackknife procedure for estimating the bias and standard error in an estimate... = n , so the size of the testing set is one Since this requires fitting the model n times, this can be computationally expensive if n is large We note, however, that there are efficient ways of doing this [Gentle 1998; Hjorth, © 2002 by Chapman & Hall/CRC 238 Computational Statistics Handbook with MATLAB 1994] We outline the steps for cross-validation below and demonstrate this approach in Example 7.3... be used instead These are applications where the statistic is not smooth An example of this type of statistic is the median Here smoothness refers to statistics where small changes © 2002 by Chapman & Hall/CRC 246 Computational Statistics Handbook with MATLAB in the data set produce small changes in the value of the statistic We illustrate this situation in the next example Example 7.7 Researchers collected... looking at the sample data © 2002 by Chapman & Hall/CRC 240 Computational Statistics Handbook with MATLAB The jackknife method is similar to cross-validation in that we leave out one observation x i from our sample to form a jackknife sample as follows x 1, … , x i – 1 , x i + 1 , … , x n This says that the i-th jackknife sample is the original sample with the i-th data point removed We calculate the value . = t ˆ α 2⁄() 100 α 2⁄⋅ z *b B 100= α 2⁄ 0. 05= t ˆ 0. 05( ) z *b B α 2⁄⋅ 100 0. 05 5= =() © 2002 by Chapman & Hall/CRC 222 Computational Statistics Handbook with M ATLAB , (6.24) where is an estimate. α 2⁄–()⋅ q ˆ α 2⁄ q ˆ 1 α 2⁄– 1.03 1. 45, () θ ˆ θ ˆ θ ˆ . θ BC a © 2002 by Chapman & Hall/CRC 226 Computational Statistics Handbook with M ATLAB ated with an estimate, then those are more. (pounds) β ˆ 1 0.08,–= β ˆ 0 13.62= © 2002 by Chapman & Hall/CRC 236 Computational Statistics Handbook with M ATLAB The prediction error is defined as ,(7 .5) where the expectation is with respect to the true population. To

Ngày đăng: 14/08/2014, 08:22

TỪ KHÓA LIÊN QUAN