Central limit theorem The sampling distribution of the mean of a random sample drawn from any population is approximately normal for a sufficiently large sample size.. The Central Limit
Trang 1
TRUONG DAI HOC KINH TE QUOC DAN
STATISTICS THE APPLICATION OF SAMPLING DISTRIBUTION AND ESTIMATION
Dương Trung Nguyên 11219367
Nguyễn Trần Quang Vinh — 11216272
Ha Noi, April 2023
Trang 3Y
100, 250, 500, 1000, 2000, and 4000): results of 100,000 simulati 19 Figure 8: Simulated net winnings in n = 100 consecutive 4-spot bets Of EL iccsscssccssccsssscssssssscssssessssosees 23 Figure 9: The percentage of people vote for Donal Trump in the presidential election in 2020 at the state
Table 1: Comparison of 95% confidence limits for payback % from n slot pulls calculated from the CLT
- based formula (CLT column) and the same obtained from 100,000 simulation (SIM colHmm) 18 Table 2: shows that the 95% confidence limits for the true payback percentage computed using the CLT based formula are quite close to the corresponding values obtained from simulati lô Table 3 Sample payout table — gross winning 20 Table 4 Net winnings summaries versus number of spots chosen 21 Table 5 Estimated probabilities of at least breaking even in 100 consecutive s-spot £Ï befs 23
Trang 4Asampling distribution is a probability distribution of a statistic obtained from a larger number of samples drawn from a specific population The sampling distribution of a given population is the distribution of frequencies of a range of different outcomes that could possibly occur for a statistic
of a population
In statistics, a population is the entire pool from which a statistical sample is drawn A population may refer to an entire group of people, objects, events, hospital visits, or measurements A population can thus be said to be an aggregate observation of subjects grouped together by a common feature
There are two ways to create a sampling distribution The first is to actually draw samples of the same size from the population, calculate the statistics of interest, and then use descriptive techniques to learn more about sampling distribution The second method relies on the rules of probability and the laws of expected value and variance to derive the sampling distribution
a Sample means
Asample mean is an average of a set of data The sample mean can be used to calculate the central tendency, standard deviation and the variance of a data set The sample mean can be applied to a variety of uses, including calculating population averages
Arithmetic mean from sample:
Where: — the value of each item
n— total number of items
b Central limit theorem
The sampling distribution of the mean of a random sample drawn from any population is approximately normal for a sufficiently large sample size The larger the sample size, the more closely the sampling distribution of will resemble a normal distribution
The accuracy of the approximation alluded to in the central limit theorem depends on the probability distribution of the population and on the sample size If the population is normal, then
is normally distributed for all values of n If the population is non normal, then is approximately normal for the larger values of n
4
Trang 5For samples of size 30 or more, the sample mean is approximately normally distributed, with mean , and standard deviation where n is the sample size The larger the sample size, the better the approximation The Central Limit Theorem is illustrated for several common population
distributions is showed in the figure 1:
Figure 1: Distribution of Populations and Sample Means
The dashed vertical lines in the figures locate the population mean Regardless of the distribution of the population, as the sample size is increased the shape of the sampling distribution of the sample mean becomes increasingly bell-shaped, centered on the population mean Typically by the time the sample size is 30, the distribution of the sample mean is practically the same as a normal
distribution
The importance of the Central Limit Theorem is that it allows us to make probability statements about the sample mean, specifically in relation to its value in comparison to the population mean, as
we will see in the examples
The central limit theorem doesn't have its own formula, but it relies on sample mean and standard deviation As sample means are gathered from the population, standard deviation is used to distribute the data across a probability distribution curve
If X is a random variable with a mean and variance , then in general:
=> Z~N(0,l)asn œ
The central limit theorem is useful when analyzing large data sets because it allows one to assume that the sampling distribution of the mean will be normally-distributed in most cases This allows for easier statistical analysis and inference
5
Trang 6c The sampling distribution of sample means
The mean of the sampling distribution of the mean is the mean of the population from which the scores were sampled Therefore, if a population has a mean i, then the mean of the sampling distribution of the mean is also p The symbol tu is used to refer to the mean of the sampling distribution of the mean The formula for the mean of sampling distribution of the mean can be written as:
The variance of the sampling distribution of the sample mean is the variance of the population divided by the sample size:
The standard deviation of sampling distribution is called the standard error of the mean, that is written as:
To understand more, we want to illustrate an example: We constructed the probability distribution
of the sample mean for samples of size two drawn from the population of four rowers The probability distribution is:
152 154 156 158 160 162 164
Figure 2 shows a side-by-side comparison of a histogram for the original population and a histogram for this distribution Whereas the distribution of the population is uniform, the sampling distribution of the mean has a shape approaching the shape of the familiar bell curve This phenomenon of the sampling distribution of the mean taking on a bell shape even though the population distribution is not bell-shaped happens in general Here is a somewhat more realistic
Figure 2: shows distribution of a population and a sample mean
Suppose we take samples of size 1, 5, 10, or 20 from a population that consists entirely of the numbers 0 and 1, half the population 0, half 1, so that the population mean is 0.5 The sampling distributions are:
Trang 7
Histograms illustrating these distributions are shown in Figure 3
become smoother and more bell-shaped What we are seeing in these examples does not depend on the particular population distributions involved In general, one may start with any distribution and the sampling distribution of the sample mean will increasingly resemble the bell-shaped normal curve as the sample size increases
This means that if we have a large enough sample, we can always find out probabilities to do with the mean, since it will have a normal distribution no matter what the original distribution
Trang 8d The sampling distribution of a proportion
The binomial distribution whose parameter is p, the probability of success in any trial However, in the real world, p is unknown, requiring the statistics practitioner to estimate its value from a sample The estimator of a population proportion of success is the sample proportion; that is, we count the number of successes in a sample and compute:
Where: the probability of success in any trial, is read as p bat
X: the number of successes n: sample size
Using the laws of expected value and variance, we can determine the mean, variance, and standard deviation of is approximately normally distributed provided that np and
n(1 — p) are greater than or equal to 5
The expected value is:
The variance:
The standard deviation (The standard error of the proportion):
e Comparison between sampling distributions for sample means and sample proportions
Sampling distributions
Quantitative = population mean; = | =sample mean
(example: age) | population standard
b Purpose of estimation
The objective of estimation is to determine the approximate value of a population parameter on the basis of a sample statistics
c Types of estimators
Trang 9We can use sample data to estimate a population parameter in two ways:
¢ Point estimator
A point estimator draws inferences about a population by estimating the value of an unknown parameter using a single value or point
Example: Sample mean = 99 is a point estimate of the population mean —
However, there are three drawbacks when we use point estimators Firstly, it is virtually certain that the estimate will be wrong (The probability that a continuous random variable will equal a specific value is 0; that is, the probability that will exactly equal is 0) Secondly, we often need to know how close the estimator is to the parameter Thirdly, in drawing inferences about a population, it is intuitively reasonable to expect that a large sample will produce more accurate results because it contains more information than a smaller sample does But point estimators do not have the capacity to reflect the effects of larger sample sizes
¢ Interval estimator
An interval estimator draws inferences about a population by estimating the value of an unknown parameter using an interval (range) This interval corresponds to a probability and this probability and this probability is never equal to 100%
Example: We are 95% confident that the unknown mean scores lie between 51 and 75
A95% confidence interval is a range of values above and below the point estimate within which the true value in the population is likely to lie with 95% confidence The other 5% is the possibility that the true value is not within the confidence interval
Consider a normal distribution The range of Z scores that would likely capture 95% of the observations would be the 95% of Z scores in the middle of the standard normal distribution, i.e., excluding the 2.5% of the Z scores at the bottom of the standard normal distribution and the 2.5% of
Z scores at the top
In other words, 95% of the observations lie within -1.96 < Z < 1.96
Trang 102.5% 2.5%
"196 95% Z=™.96
Figure 4: Z lies within -1.96 and 1.96
® When interval estimator for is known
The central limit theorem stated that is normally distributed if X is normally distributed, or approximately normally distributed if X is nonnormal and n is sufficiently large This means the variable:
Is standard normally distributed (or approximately) We developed the following probability statement associated with the sampling distribution of the mean:
Which was derived from:
And then, we can express the probability in a slightly different form:
Where: : confidence level
: lower confidence limit
: upper confidence limit
The population in this mean is in the center of the interval created by adding and subtracting standard errors to and from the sample mean
10
Trang 11® When interval estimator for is unknown:
We cannot replace s for since does not have a standard normal distribution However, it does follow a known distribution: it follows a t-distribution with n-1 degrees of freedom The statistic is called t-statistic:
The t-distribution, also known as the Student’s t-distribution, is a type of probability distribution that is similar to the normal distribution with its bell shape but has heavier tails It is used for estimating population parameters for small sample sizes or unknown variances T-distributions have
a greater chance for extreme values than normal distributions, and as a result have fatter tails
Comparing t and Z Distributions
—=t with 10 degrees of freedom
Figure 5: The t-distribution
Degrees of freedom (df) are the maximum number of logically independent values, which may vary ina data sample Degrees of freedom are calculated by subtracting one from the number of items within the data sample:
After that, we have confidence interval for is unknown:
Suppose that before we gather data, we know that we want to get an average within a certain range
of the true population value We can use the central limit theorem to find the minimum sample size required to meet this condition, if the standard deviation of the population is known We express the following probability:
Which can be expressed as:
This means the difference between and lies between and with the probability Expressed another way, we have with the probability
In addition, we want our estimators to be accurate and precise It means:
- Accurate: on average, our estimator is getting towards the true value
- Precise: our estimates are close together
al
Trang 12Sample mean is a precise and accurate estimator of the population mean (Sometimes, accurate and precise together is referred to as unbiased)
II Application
1 Election polls
One of the most common applications of CLT is in election polls To calculate the percentage of persons supporting a candidate which are seen on news as confidence intervals If the distribution is not known or not normal, we consider the sample distribution to be normal according to CTL As this method assume that the population given is normally distributed
2 Average income
It is also used to measure the mean or average family income of a family in a particular region To estimate the population mean more accurately, we can increase the samples taken from the population which will ultimately decrease the sample mean deviation
3 Clinical trial
In clinical research, we define the population as a group of people who share a common character or
a condition, usually the disease If we are conducting a study on patients with ischemic stroke, it will be difficult to include the whole population of ischemic stroke all over the world It is difficult
to locate the whole population everywhere and to have access to all the population Therefore, the practical approach in clinical research is to include a part of this population, called the “sample population” The whole population is sometimes called the “target population” while the sample population is called the “study population" When doing a research study, we should consider the sample to be representative to the target population, as much as possible, with the least possible error and without substitution or incompleteness
4 Supply chain
Supply chain analytics is no exception when it comes to applications of probability distributions and the Central Limit Theorem (CLT) Be it capacity planning, inventory management, deciding the order throughput in fulfillment centers, CLT finds a vast majority of applications across the supply chain domains In this article, let’s explore an application of the Central Limit Theorem for a real- time shipment data set of a pharmaceutical company, which has hundreds of thousands of observations
12
Trang 13Andrea Messori, Erminia Caccese and Maria Claudia D’ Avella use progression-free survival (PFS) PFS is used when evaluating the effectiveness of a treatment in controlling the progression of the disease, particularly in cancers where the disease can progress rapidly The goal is to evaluate the effectiveness of a treatment in delaying disease progression Andrea Messori, Erminia Caccese and Maria Claudia D’ Avella calculate the 95% confidence interval of a population of patients intended
to treat with experimental or control to identify
II Purpose
This article aims to provide an investigation of the observations of patients intended to treat with experimental treatment or control By comparing the outcomes of the experimental group to the control group, researchers can determine whether the experimental treatment is effective in treating the cancer
13
Trang 14Number Censored, n (%) 11 (16.7) 19 (28.4)
No Baseline Tumour Assessments 1 (1.5) 2 (3.0)
No Post-Baseline Tumour Assessments 2 (3.0) 2 (3.0)
Death or Progression After Two or More Missed Visits 1 (1.5) 2 ef o
Start of New Anticancer Therapy 5 (7.6)
Figure 6: Kaplan-Meier curve for PFS (Investigator assessment) of Investigational Arm
versus Control Arm (ITT population) - Study JGDG Phase 2
- Standard deviation(e)= 10.3, standard deviation(c) = 5,4
Sample size(e) = 66, Sample size(c) = 67
n(e) = 66, n(c) = 67
Experimental arm mean = 6.6, Control arm mean = 4.1
- We compute the 95% confidence interval with the formula:
14
Trang 15Ss
CTI = confidence interval s = sample standard deviation
2% =sample mean 7% =sample size
2 = confidence level value
Overall, PFS is a useful measure in assessing the effectiveness of cancer treatment, but it should be interpreted in conjunction with other clinical and patient factors to make informed treatment decisions
VI Other articles
1 Article 1
a Article name
Casino Games and the Central Limit Theorem
Ashok Singh, Ph D., Anthony F Lucas, Ph.D, Rohan J Dalpatadu, Ph.D., Dennis J Murphy, Ph.D
b Case
15
Trang 16In Nevada, for the calendar year ended December 31, 2012, baccarat produced more wins than any other table game Baccarat served as a perennial revenue juggernaut for Nevada casinos Because of these contributions, this study
In the standard form of baccarat, the dealer deals two cards each to the Bank hand and the Player hand, and a gambler can bet on either of the two hands to win, or that a tie will occur Of course, when dictated by the rules, either or both of these two hands may receive a third card before a winning outcome is determined The payout on the Player hand is | to 1 while winning Bank wagers are paid at a rate of 0.95 to 1, assuming the standard 5% commission on winning Bank wagers Winning tie bets are paid at a rate of 8 to 1 In the event of a tie hand, a bet on either the Bank or the Player will push, i.¢., the bet will neither win nor lose
c Purpose
This study will show that, in the long run, any casino game will result in a positive casino win, provided the game carries a positive expected value
d Method
Probability that the Player hand wins = 0.4462466
Probability that the Bank hand wins = 0.4585974
Probability that the two hands are tied = 0.0951560
The expected value of a single unit wager on the Player hand in baccarat is -0.01235, and the per unit standard deviation is 0.9512
In order to generate results of a sequence of n hands of baccarat, in which a player wagers | unit each time on the Player hand, n random numbers were generated from the above probability distribution and the sample mean was calculated By repeating the above steps a large number (N)
of times, N values of the sample mean, %>%2> ->%y > were obtained, The number of iterations N used was 100,000 and the same six values of the sample sizes n (i.e hands) were included in this study: 100, 250, 500, 1000, 2000, and 4000 In addition, the CLT based 95% confidence limits are compared to the non-parametric 95% confidence limits obtained from simulations, to highlight any relevant differences That is, the limits computed by the CLT-based formula and are compared to those produced by the simulations, for each sample size
16