Statistics, Data Mining, and Machine Learning in Astronomy 3 4 The Central Limit Theorem • 105 3 4 The Central Limit Theorem The central limit theorem provides the theoretical foundation for the pract[.]
3.4 The Central Limit Theorem • 105 3.4 The Central Limit Theorem The central limit theorem provides the theoretical foundation for the practice of repeated measurements in order to improve the accuracy of the final result Given an arbitrary distribution h(x), characterized by its mean µ and standard from deviation σ , the central limit theorem says that the mean of N values xi drawn√ that distribution will approximately follow a Gaussian distribution N (µ, σ/ N), with the approximation accuracy improving with N This is a remarkable result since the details of the distribution h(x) are not specified—we can “average” √ our measurements (i.e., compute their mean value using eq 3.31) and expect the 1/ N improvement in accuracy regardless of details in our measuring apparatus! The underlying reason why the central limit theorem can make such a far-reaching statement is the strong assumption about h(x): it must have a standard deviation and thus its tails must fall off faster than 1/x for large x As more measurements are combined, the tails will be “clipped” and eventually (for large N) the mean will follow a Gaussian distribution (it is easy to prove this theorem using standard tools from statistics such as characteristic functions; e.g., see Lup93) Alternatively, it can be shown that the resulting Gaussian distribution rises as the result of many consecutive convolutions (e.g., see Greg05) An illustration of the central limit theorem in action, using a uniform distribution for h(x), is shown in figure 3.20 However, there are cases when the central limit theorem cannot be invoked! We already discussed the Cauchy distribution, which does not have a well-defined mean or standard deviation, and thus the central limit theorem is not applicable (recall figure 3.12) In other words, if we repeatedly draw N values xi from a Cauchy distribution and compute their mean value, the resulting distribution of these mean values will not follow a Gaussian distribution (it will follow the Cauchy distribution, and will have an infinite variance) If we decide to use the √ mean of measured values to estimate the location parameter µ, we will not gain the N improvement in accuracy promised by the central limit theorem Instead, we need to compute the median and interquartile range for xi , which are unbiased estimators of the location and scale parameters for the Cauchy distribution Of course, the reason why the central limit theorem is not applicable to the Cauchy distribution is its extended tails that decrease only as x −2 We mention in passing the weak law of large numbers (also known as Bernoulli’s theorem): the sample mean converges to the distribution mean as the sample size increases Again, for distributions with ill-defined variance, such as the Cauchy distribution, the weak law of large numbers breaks down In another extreme case of tail behavior, we have the uniform distribution which does not even have tails (cf §3.3.1) If we repeatedly draw N values xi from a uniform distribution described by its mean µ and width W, the distribution of their mean value x will be centered on µ, as expected from the central limit theorem In addition, the uncertainty√of our estimate for the location parameter µ will decrease proportionally to 1/ N, again in agreement with the central limit theorem However, using the mean to estimate µ is not the best option here, and indeed √ µ can be estimated with an accuracy that improves as 1/N, that is, faster than 1/ N How is this arguably surprising result possible? Given the uniform distribution described by eq 3.39, a value xi that happens to be larger than µ rules out all • Chapter Probability and Statistical Distributions 2.0 N =2 p(x) 1.6 1.2 0.8 0.4 2.5 N =3 p(x) 2.0 1.5 1.0 0.5 N = 10 p(x) 106 0.0000 0.2000 0.4000 x 0.6000 0.8000 1.0000 Figure 3.20 An illustration of the central limit theorem The histogram in each panel shows the distribution of the mean value of N random variables drawn from the (0, 1) range (a uniform distribution with µ = 0.5 and W = 1; see eq 3.39) The distribution for N = has a triangular shape and as N increases it becomes increasingly similar to a Gaussian, in agreement with √ the central limit theorem The predicted normal distribution with µ = 0.5 and σ = 1/ 12N is shown by the line Already for N = 10, the “observed” distribution is essentially the same as the predicted distribution values µ < xi − W/2 This strong conclusion is of course the result of the sharp edges of the uniform distribution The strongest constraint on µ comes from the extremal value of xi and thus we know that µ > max(xi ) − W/2 Analogously, we know that µ < min(xi ) + W/2 (of course, it must be true that max(xi ) ≤ W/2 and min(xi ) ≥ −W/2) Therefore, given N values xi , the allowed range for µ is max(xi ) − W/2 < µ < min(xi ) + W/2, with a uniform probability distribution for µ within that range The best estimate for µ is then in the middle of the range, ˜ = µ min(xi ) + max(xi ) , (3.68) and the standard deviation of this estimate (note that the scatter of this estimate around the true value µ is not Gaussian) is the width of the allowed interval, R, • 3.4 The Central Limit Theorem 107 µ ¯ = mean(x) 0.15 0.10 µ ¯ 0.05 0.00 −0.05 −0.10 −0.15 σ= µ ¯= 0.03 [max(x) √1 12 · W √ N + min(x)] 0.02 µ ¯ 0.01 0.00 −0.01 −0.02 −0.03 σ= 102 √1 12 · 103 N 2W N 104 Figure 3.21 A comparison of the sample-size dependence of two estimators for the location parameter of a uniform distribution, with the sample size ranging from N = 100 to N = 10,000 The estimator in the top panel is the sample mean, and the estimator in the bottom panel is the mean value of two extreme values The theoretical 1σ , 2σ , and 3σ contours are shown for comparison When using the sample √ mean to estimate the location parameter, the uncertainty decreases proportionally to 1/ N, and when using the mean of two extreme values as 1/N Note different vertical scales for the two panels divided by √ 12 (cf eq 3.40) In addition, the best estimate for W is given by ˜ = [max(xi ) − min(xi )] W N N −2 (3.69) What is the width of the allowed interval, R = (max(xi ) − min(xi ) − W)? By considering the distribution of extreme values of xi , it can be shown that the expectation values are E [min(xi )] = (µ − W/2 + W/N) and E [max(xi )] = (µ + W/2 − W/N) These results can be easily understood: if N values xi are uniformly scattered within a box of width W, then the two extreme points will be on average ∼ W/N away from the box edges Therefore, the width of the allowed ˜ is an unbiased estimator of µ with a standard range for µ is R = 2W/N, and µ deviation of 2W σµ˜ = √ 12 N (3.70) ˜ is a much more While the mean value of xi is also an unbiased estimator √ of µ, µ ˜ wins for N > efficient estimator: the ratio of the two uncertainties is 2/ N and µ The different behavior of these two estimators is illustrated in figure 3.21 In summary, while the central limit theorem is of course valid for the uniform distribution, the mean of xi is not the most efficient estimator of the location ... triangular shape and as N increases it becomes increasingly similar to a Gaussian, in agreement with √ the central limit theorem The predicted normal distribution with µ = 0.5 and σ = 1/ 12N is... < min(xi ) + W/2, with a uniform probability distribution for µ within that range The best estimate for µ is then in the middle of the range, ˜ = µ min(xi ) + max(xi ) , (3.68) and the standard... constraint on µ comes from the extremal value of xi and thus we know that µ > max(xi ) − W/2 Analogously, we know that µ < min(xi ) + W/2 (of course, it must be true that max(xi ) ≤ W/2 and min(xi