(Tiểu luận) topic sampling distribution and estimation

SCHOOL OF ADVANCED EDUCATIONAL PROGRAMS NATIONAL ECONOMICS UNIVERSITY - SUBJECT: BUSINESS STATISTICS GROUP MID-TERM ASSIGNMENT TOPIC: SAMPLING DISTRIBUTION AND ESTIMATION Group 3: Nguyễn Diệu Linh : 11213220 Nguyễn Thuỳ Linh : 11213351 Vũ Hoàng Minh : 11213979 (Leader) Phạm Thị Hương Trà : 11203958 Phạm Phương Thảo : 11215447 Nguyễn Thu Trang : 11215873 Class: Advanced Finance 63D Assoc Prof : Tran Thi Bich HANOI, 2023 TABLE OF CONTENT: PART 1: INTRODUCTION 1.1 Definition of Sampling distribution and The Central Limit Theorem 1.1.1 Sampling Distribution 1.1.2 The Central Limit Theorem 1.2 Definition of Estimation 1.3 Applications of Sampling Distribution and Estimation 1.3.1 Sampling Distribution 1.3.2Estimation PART 2: ARTICLE AND SOURCES ANALYSIS 2.1 Article Summary 2.1.1General Info 2.1.2 Objective & Purpose of the article 2.1.3 Methodology 2.1.4Main Results 6 7 2.2 Techniques used in the article 2.2.1 Sample size 2.2.2 Sampling error 2.2.3 Simple random sampling 2.2.4 Confidence level 2.2.5 The Adjusted Wald, Jeffreys and Wilson Interval Methods 2.2.6 Finite Population Correction (FPC) 2.2.7 The Maximum likelihood estimate, Laplace, Jeffreys and Wilson Point Estimators 8 8 9 2.3 Additional Source 2.3.1 Methods of sample size calculation in descriptive retrospective burden of illness studies 2.3.2 COVID-19 prevalence estimation by random sampling in population - optimal sample pooling under varying assumptions about true prevalence 10 2.4 Conclusions 11 PART 3: DATA ANALYSIS 12 3.1 General Info 12 3.2 Survey Questions 13 3.3 3.2.1 3.2.2 3.2.3 3.2.4 13 13 13 13 Data Analysis 14 3.3.1 Sampling distribution 14 3.3.2 Estimation 16 3.3.3 Correlation 19 3.4 3.5 Level of trust in COVID-19 vaccination (TR) Level of Perceiving COVID-19 Risk (PRC) Level of COVID-19 Vaccine Perception (PV) Level of COVID-19 Vaccination Intention (INT) Conclusion 20 3.5.1 3.5.2 3.5.3 3.5.4 Recommendation 21 The Government Vaccine import Firms The Media & Telecommunication Citizen 21 21 21 21 REFERENCE 22 PART 1: INTRODUCTION The probability distribution of a given statistic is estimated based on a random sample The estimator is the generalized mathematical parameter to calculate sample statistics It is used to calculate statistics for a provided sample and helps remove unpredictability when conducting research or collecting statistical data 1.1 Definition of Sampling distribution and The Central Limit Theorem 1.1.1 Sampling Distribution Sampling distribution is a probability distribution of a statistic based on data from multiple samples within a specific population Sine analyzing the entire population is impractical, its main purpose is to provide representative results for small samples of a larger population The sampling distribution of a population refers to the frequency distribution of a range of potential outcomes that could occur for a statistic of that population There are primarily three types of sampling distribution: - Sampling distribution of mean The mean of sampling distribution of the mean is the mean of the population from which the scores were sampled The graph will show a normal distribution, and the center will be the mean of the sampling distribution, which is the mean of the entire population - Sampling distribution of proportion The Sampling distribution of proportion measures the proportion of success, i.e a chance of occurrence of certain events, by dividing the number of successes i.e chances by the sample size ‘n’ The mean of all the sample proportions you calculate from each sample group would become the proportion of the entire population - T - distribution T-distribution is used for estimating population parameters for small sample sizes or unknown variances It is used to estimate the mean of the population, confidence intervals, statistical differences, and linear regression The sampling distribution is influenced by several factors, such as the statistic, sample size, sampling process, and the overall population It is used to calculate statistics like means, ranges, variances, and standard deviations for the sample at hand Sample size and normality: If X has a distribution that is: + Normal, then the sample mean has a normal distribution for all sample sizes + + Close to normal, the approximation is good for small sample sizes Far from normal, the approximation requires larger sample sizes For a sample size of more than 30, the sampling distribution formula is given: μ x=μ and σ x=μ and σ x=μ and σ x= √ + The mean of the sample and population is represented by μ xand + The standard deviation of the sample and population is represented as σ xand σ + The sample size of more than 30 is represented as n 1.1.2 The Central Limit Theorem The Central Limit Theorem (CLT) explains that the distribution of sample means approaches a normal distribution as the sample size increases, regardless of the population’s distribution In practice, sample sizes equal to or greater than 30 are commonly considered sufficient for the CLT to be applicable A fundamental feature of the CLT is that the average of the sample means and standard deviations will equal the population means and standard deviation 1.2 Definition of Estimation Estimation in statistics are any procedures used to calculate the value of a population drawn from observations within a sample size drawn from that population There are two types of estimation: either point or interval estimation The purpose of estimation is to find the approximate value of a parameter for the population by calculating statistics based on samples taken from that population 1.3 Applications of Sampling Distribution and Estimation 1.3.1 Sampling Distribution Sampling distribution may be used to represent the size of particles produced by grinding, milling, and crushing In Economics, collect a simple random sample of 50 individuals in a town and use the average annual income of the individuals in the sample to estimate the average annual income of individuals in the entire town If the average annual income of the individuals in the sample is $58,000, then the best guess for the actual average annual income of them will be $58,000 - In Biology, biologists may measure the height of 30 randomly selected plants and then use the sample mean height to estimate the population mean height If the biologist finds that the sample mean height of the 30 plants is 10.3 inches, then her best guess for the population mean height will also be 10.3 inches In Surveys, the HR department of some companies may randomly select 50 employees to take a survey that assesses their overall satisfaction on a scale of to 10 If it is found that the average satisfaction among employees in the survey is 8.5 then the best guess for the average satisfaction rating of all employees at the company is also 8.5 1.3.2 Estimation In engineering, for any specified rock slope failure mode (such as plane shear, step path, or wedge), the safety factor (SF) equation can be applied to the point estimation method to provide accurate estimates of the mean and standard deviation of the SF probability distribution In calculating labor costs, an estimation model can help to count hours on the right day to compute labor costs, which can be particularly challenging for night shift workers In the construction and contracting industry, people have either designed estimation apps for clients or programmed software to approximately calculate the volume and cost of flooring (concrete, wood, carpet) or drywall In practical mathematics, estimation can be applied to time management, future budgeting, and algebra, CLT is helpful in finance when examining a large portfolio of securities to estimate returns, risks, and correlations PART 2: ARTICLE AND SOURCES ANALYSIS 2.1 Article Summary 2.1.1 General Info The use of random sampling in investigations involving child abuse material This article by Brian Jones, Syd Pleno, Michael Wilkinson done in 2012 address two ubiquitous problems of practicing digital forensics in law enforcement, the ever increasing data volume and staff exposure to disturbing material It discusses how the New South Wales Police Force, State Electronic Evidence Branch (SEEB) has implemented a “Discovery Process” This article uses random sampling of files and applying statistical estimation to the results, the branch has been able to reduce backlogs from three months to 24h The process has the added advantage of reducing staff exposure to child abuse material and providing the courts with an easily interpreted report Figure SEEB Investigation July 2010 to June 2011 2.1.2 Objective & Purpose of the article In order to maximize its capacity to offer timely support to significant criminal investigations, SEEB was forced to develop and execute a variety of systems in response to the growing demand for digital forensic support Investigations on the possession of child abuse material (CAM) are one area with high demand In the past, it was the job of SEEB analysts to detect any images, writings, or films that included CAM A crucial limitation of chart review is the requirement to inexpensively get a sufficiently large sample Researchers need to consider how valuable the data is expected to be Given the absence of validated techniques for predetermined sample size calculation, the goal of this study is to offer a satisfactory methodology for sample size computation 2.1.3 Methodology A sample is an accurate representation of the population In this situation, quantitative information about that population is expected to be gained from analyzing the sample “A sample is representative if the statistics computed from it accurately reflect the corresponding population parameters ”(De-Veaux et al 2009) The population must be randomly sorted in order to allow for the above-mentioned estimation, and a minimum required number of items must be presented— the sample set This can be modeled as "simple random samplings" in statistics By assigning a random number to each file, we can sort the files and choose a sample from the first n files to calculate the ratio of files on a digital storage device that include CAM to those that not 2.1.4 Main Results Table Test results Restraints: CL - 99%, CL < 5% population - 52,061, sample 8388, actual items of interest - 1527 Table Average test results 2.2 Techniques used in the article 2.2.1 Sample size Due to the law of large numbers and the Central Limit Theorem in order to achieve a confidence level of 99%, a maximum of approximately 10,000 files (when using the Discovery statistical constraints) have to be viewed – irrespective of the population size (Yamane, 1967) 2.2.2 Sampling error In this case we are only concerned with sampling errors, as non-sampling errors are minimal due to the controlled environment and uniformity of the population 2.2.3 Simple random sampling Sampling is defined in ISO/IEC 17,025 as: “ A defined procedure whereby a part of a substance, material or product is taken to provide for testing or calibration of a representative sample of the whole.” To allow for the estimation the population must be randomly sorted and a minimum recommended number of items are to be presented – the sample set In statistics this can be modeled as “simple random sampling” “With simple random sampling, each member of the sampling frame has an equal chance of selection and each possible sample of a given size has an equal chance of being selected Every member of the sampling frame is numbered sequentially and a random selection process is applied to the numbers.” (McLennan, 1999) For the purposes of determining the ratio of files containing CAM to files not containing CAM on a digital storage device, simple random sampling is relatively straightforward By assigning a random number to each file the files can then be sorted and the sample selected from the first n files 2.2.4 Confidence level This value indicates the reliability of the estimate If the confidence level is 95%, this implies that if 100 samples were conducted, 95 of them would fall between the required confidence interval (Stewart, 2011) 2.2.5 The Adjusted Wald, Jeffreys and Wilson Interval Methods To calculate the estimate, the first thing required is to calculate the interval estimation of the proportion, or the confidence interval The standard formula for calculating the confidence interval is known in most introductory statistics textbooks as the Wald interval As the standard interval has been proven to be inconsistent and unreliable in many circumstances, it is recommended by several sources that the Adjusted Wald is a more reliable interval calculation The Adjusted Wald interval is defined as: Wilson & Jeffreys Bayesian intervals method were also used in conducting simulations and testing to compare and find out the most reliable and give acceptable conservative estimates which were selected for use in the SEEB sampling process 2.2.6 Finite Population Correction (FPC) If the population is known and the sample is greater than 5% of the population, the finite population correction factor can be used The FPC is included into the estimate calculation and usually results in a narrower margin for the estimate, without affecting the reliability The formula for the FPC is: − =μ and σ x= −1 2.2.7 The Maximum likelihood estimate, Laplace, Jeffreys and Wilson Point Estimators The point estimate is a ratio of the selection in relation to the sample or s/ n It is a value that is used in the final estimate calculation There are main Binomial point estimators: The Maximum likelihood estimate (MLE), Laplace, Wilson and Jeffreys (Böhning and Viwatwongkasem, 2005) 2.3 Additional Source 2.3.1 Methods of sample size calculation in descriptive retrospective burden of illness studies - Issue Studies on the burden of disease can put the existing treatment environment in context and point put significant treatment gaps in pharmacoepidemiology Large databases are frequently taken into account while planning observational research due to the growing availability of “big data” An objective of this study was to propose rigorous methodologies for sample size computation in light of the lack of verified methods for priori sample size estimations - Purpose This study offers official guidelines for calculating sample sizes for studies of the retrospective burden of disease Pharmacoepidemiology uses observational burden of illness research to achieve a number of goals, including contextualizing the present treatment environment, highlighting significant treatment gaps, and offering estimates to parameterize economic models The goal of this study was to create suggested sample size formulas for use in such studies - Technique + Cost estimate – Bottom-up sampling Estimating work effort in agile projects is fundamentally different from traditional methods of estimation The traditional approach is to estimate using a “bottom-up” technique: detail out all requirements and estimate each task to complete those requirements in hours/days, then use this data to develop the project schedule This technique is used when the requirements are known at a discrete level where the smaller workpieces are then aggregated to estimate the entire project This is usually used when the information is only known in smaller pieces + Retrospective chart review The retrospective chart review (RCR), also known as a medical record review, is a type of research design in which pre-recorded, patientcentered data are used to answer one or more research questions The data used in such reviews exist in many forms: electronic databases, results from diagnostic tests, and notes from health service providers to mention a few RCR is a popular methodology widely applied in many healthcare-based disciplines such as epidemiology, quality assessment, professional education and residency training, inpatient care, and clinical research (cf Gearing et al.), and valuable information may be gathered from study results to direct subsequent prospective studies 2.3.2 COVID-19 prevalence estimation by random sampling in population optimal sample pooling under varying assumptions about true prevalence - Issue A rough estimate of the illness burden in a population is the sum of confirmed COVID-19 cases divided by the size of the population However, a number of studies indicate that a sizable portion of cases typically go unreported and that this percentage greatly varies on the level of sampling and the various test criteria employed in different jurisdictions - Purpose This research aims to Examine how the number of samples used in the experiment and the degree of sample pooling affect the accuracy of prevalence estimates and the possibility of reducing the number of tests necessary to obtain individual-level diagnostic results - Technique Estimates of the true prevalence of COVID-19 in a population can be made by random sampling and pooling of RT-PCR tests These estimations not take into account follow-up testing on sub-pools that enable patient-level diagnosis; they are only based on the initial pooled tests The precision from the pooled test estimates would be closer to that of testing individually if the results from these samples were taken into account This article started by generating a population of 500,000 individuals and then let each individual have the probability of being infected at sampling time The number of patient samples collected from the population is denoted by n, and the number of patient samples that are pooled into a single well is denoted by k The total number of pools are thus n/k, hereby called m The number of positive pools in an experiment is termed x 2.4 Conclusions In conclusion, this study represents a formal guide to sample size calculations to search for CAM files SEEB has been able to significantly reduce the exposure of its staff and police investigators to disturbing child abuse material It has also significantly reduced backlogs, enables investigators to establish the extent of their investigation in a short timeframe and provides the courts with a clear record of the quantity and severity of CAM on a device 1 PART 3: DATA ANALYSIS 3.1 General Info COVID-19 Perceived Risk, Vaccine Perception, Vaccination Intention in VN The survey data is about the perceived COVID-19 risk, the vaccine perception and intention among Vietnamese It was written by Nguyen Phi Hung and Nguyen Van Duy and was available online on 11 January 2022 This questionnaire was conducted in a bilingual version (English and Vietnamese) and distributed to respondents through Internet platforms (Emails, Google Forms and Facebook) from 6/2021 to 7/2021 and yielded 329 valid responses Participation was voluntary Data analysis was completed using the SPSS 26.0 software packages following data cleansing and coding The data were summarized based on respondents' socioeconomic and demographic characteristics Therefore, this data will contribute to the literature on COVID-19 vaccination perceptions and intention to vaccinate among Vietnamese Accordingly, this information will help Vietnamese people sufficiently comprehend the issue of COVID-19 vaccination Table 1: Respondents Characteristics The Table shows that male respondents accounted for 52.6% (173 people), women accounted for 45.6% (150 people), and other genders were 1.8% (6 people) The income mainly ranged from 10 to 15 million VND/month (164 people, 49.8%), followed by the range from more than 20 million VND/ month (60 people, 21%) Of note, the study participants had a higher representation of participants under 35 years old (234 people, 73.9%) Half of the participants were private office staff (153 people, 46.5%) The majority had bachelor’s degrees (242 people, 73.5%) and some had master’s degrees (44 people, 13.4%) The aspect of vaccination safety is the most concerning issue while injecting the COVID-19 vaccine (147 people, 44.7%) 3.2 Survey Questions 3.2.1 Level of trust in COVID-19 vaccination (TR) TR1: Trust in the government's ability to prevent COVID-19 TR2: Trust the vaccine being used by the Vietnamese government TR3: Trust in the COVID-19 vaccine storage procedures TR4: Trust in the medical team during the COVID-19 vaccination process TR5: Trust in the ability to manage side effects after a COVID-19 vaccine TR6: Trust that vaccines are the most effective method of disease prevention and control COVID-19 3.2.2 Level of Perceiving COVID-19 Risk (PRC) PRC1: The COVID-19 pandemic has a high mortality rate PRC2: Worrying about yourself, relatives, and colleagues who may be infected with COVID-19 PRC3: Recognizing the possibility of a COVID-19 will pandemic breaking out in the area where you live and work PRC4: Risk Perception of infection during concentrated isolation PRC5: Risk Perception of infection during self-isolation PRC6: Risk perception of distance guidance during self-isolation 3.2.3 Level of COVID-19 Vaccine Perception (PV) PV1: Perceive that getting vaccinated against COVID-19 reduces the risk of the disease PV2: Perceive that getting vaccinated against COVID-19 reduces the severity of the disease PV3: Perceive that vaccination against COVID-19 is required to prevent disease outbreaks PV4: Perceive that vaccination against COVID-19 is good for the community PV5: Perceive that vaccination against COVID-19 helps economic and social activities return to normal soon PV6: Research on a COVID-19 vaccine is needed in the context of many new variants 3.2.4 Level of COVID-19 Vaccination Intention (INT) INT1: Registered for the COVID-19 vaccine INT2: Expect to get a COVID-19 vaccine at any time INT3: Ready to encourage loved ones to get vaccinated against COVID-19 3.3 Data Analysis 3.3.1 Sampling distribution Due to the Central limit theorem, the larger the sample size, the more closely the sampling distribution will follow a normal distribution When the sample size is small, the sampling distribution of the mean is sometimes non-normal That’s because the central limit theorem only holds true when the sample size is “sufficiently large” If X is a random variable with a mean μ and variance σ2, then in general: Where: - X is the sampling distribution of the sample means - N is the normal distribution - μ is the mean of the population - σ is the standard deviation of the population - n is the sample size Randomly divide into portions of sample size of 50, 100, 329 which meet the conditions of the central limit theorem: + The sample size is sufficiently large This condition is met if the sample size is n ≥ 30 + The samples are independent and identically distributed random variables This condition is met if the sampling is random + The population’s distribution has finite variance Central limit theorem doesn’t apply to distributions with infinite variance, such as the Cauchy distribution With n=μ and σ x=50: Figure Level of trust in Covid-19 vaccination (TR) (N=μ and σ x=50) In the first histogram, we randomly select a sample size of 50, with the mean of Level of trust in Covid-19 vaccination (TR) of 3.75 and the standard deviation is 0.732 Notice how the histogram centers on the population mean of 3.75 It’s also a reasonably symmetric distribution Those are features of many sampling distributions This distribution isn’t particularly smooth because 50 samples is a small number for this purpose With n=μ and σ x=100: 15 Figure Level of trust in Covid-19 vaccination (TR) (N=μ and σ x=100) The histogram illustrates the increase of sample size to 100 The mean is now increased to 4.03 with the standard deviation is 0.578 Just as the central limit theorem predicts, as we increase the sample size, the sampling distributions more closely approximate a normal distribution and have a tighter spread of values With n=μ and σ x=329: Figure Level of trust in Covid-19 vaccination (TR) (N=μ and σ x=329) In the final histogram, the sample size is selected of 329, with the mean of Level of trust in Covid-19 vaccination (TR) of 3.94 and the standard deviation is 0.677 We can notice that the histograms for all four statistics (sample mean, and sample standard deviation) are becoming more and more symmetric and bell-shaped is also becoming narrower, particularly those for the sample mean Also notice that the estimated standard deviation of the sample mean is not only decreasing as sample size increases, but is also approximately the same for the same sample sizes 3.3.2 Estimation By using the One sample T-test we can draw a conclusion about the population means of the variables Method and formula of calculating confidence intervals: ×( ) x - tα/2 √ ×( < μ < x + tα/2 ± x ×( tα/2 ) √ ) √ - With 95% CI of the Difference For Level of trust in COVID-19 vaccination (TR): We have n=μ and σ x=329, x=μ and σ x=3.94, s=μ and σ x=0.68, df=μ and σ x=328 computing a 95% confidence interval for the population mean μ 3.94 - 1.96 × (√ ) < μ < 3.94 + 1.96 × (√ ) (critical values of the t Distribution) 3.8665 < μ < 4.0135 =μ and σ x=> This result indicates that the population mean of Trust in COVID-19 vaccination (TR) will be within 3.8865 to 4.0135 with 95% certainty which illustrates a quite high Trust in COVID-19 vaccination For Level of Perceiving COVID-19 Risk (PRC): We have n=μ and σ x=329, x=μ and σ x=3.81, s=μ and σ x=0.57, df=μ and σ x=328 computing a 95% confidence interval for the population mean μ ×( ) 3.81 - 1.96 ×( < μ < 3.81 + 1.96 √ ) √ (critical values of the t Distribution) 3.7484 < μ < 3.8716 =μ and σ x=> This result indicates that the population mean of Perceiving COVID-19 Risk (PRC) will be within 3.7484 to 3.8716 with 95% certainty which indicates that the average of Perceiving COVID-19 risk is a considerably high amount For Level of COVID-19 Vaccination Intention (INT) We have n=μ and σ x=329, x=μ and σ x=3.88, s=μ and σ x=0.72, df=μ and σ x=328 computing a 95% confidence interval for the population mean μ ×( 3.88 - 1.96 ) √ ×( < μ < 3.88 + 1.96 ) √ (critical values of the t Distribution) 3.8021 < μ < 3.9579 =μ and σ x=> Similarly, the population mean of COVID-19 Vaccination Intention (INT) will be within 3.8 to 3.95 with a confidence interval of 95% This shows that the Level of COVID-19 Vaccination Intention is not completely trust but very high trust We can see the result from the table below (tool: SPSS One sample T-test) Table One-sample Statítics #1 Table 3: One-Sample Test #1 - With 99% CI of the Difference By increasing the confidence interval of the test, we can be more certain with the sample Both the upper and lower confidence interval will increase by an amount and hence the 99% confidence interval is going to be wider Therefore, as we increase the confidence level, the width of the interval increases as well A more accurate means a higher confidence level And we now see it comparing to statistics in 95% CI case: We have n=μ and σ x=329, x=μ and σ x=3.94, s=μ and σ x=0.68, df=μ and σ x=328 computing a 99% confidence interval for the population mean μ × ( ) √ × ( ) 3.94 - 2.576 INT) shows that the more people trust in COVID-19 vaccination campaign of the Government, including the Government's ability to prevent COVID-19, type of vaccine being used by the Vietnamese government, the COVID-19 vaccine storage procedures, the medical team during the COVID-19 vaccination process, the ability to manage side effects after a COVID-19 vaccine of them, the more they are willing and intending to take Vaccination Moreover, with 0.457 (PRC -> INT), people who are aware and scared of high mortality rate of this pandemic, worry about themselves, relatives, and colleagues then recognize the possibility of a COVID-19 pandemic breaking out in the area where they live and work, having high risk perception of infection during concentrated isolation, self-isolation and distance guidance during that period, would probably registered and be strongly ready for a dose of Vaccine In addition, correlation of 0.591 (PV -> INT) illustrates the higher intention of taking a vaccination of people who have personal perceive that getting vaccinated against COVID-19 reduces the risk, severity, outbreak of the disease, therefore being a is good citizen for the community, helps economic and social activities return to normal soon 3.4 Conclusion From the result of this analysis, we can see many uses of sampling distribution and estimation in many fields in life From the dataset, we can see the status of people trusting in COVID-19 vaccination campaign of the Governments, their Vaccination Intention (INT), and Perceiving COVID-19 Risk, Vaccine Perception to acknowledge the current vaccine campaign status and COVID situation in Vietnam This result can raise awareness of people on preventing them from infecting COVID19 by using vaccines These findings will affect the government and the medical ministry to improve their healthcare service for their citizens Hence, helps economic and social activities return to normal soon In addition, by using sampling distribution and estimation, we can calculate and estimate the mean of the whole population without having to collect information about every single person in VN Samples are easier to collect data from because they are practical, cost-effective, convenient, and manageable There are so many applications of sampling distribution and estimation such as in economics to estimate the average annual income; in biology to conclude about organisms; to see how many of the products in manufacturing are defective; to measure employee satisfaction at companies and so on (as presented above)

Định dạng
Số trang	22
Dung lượng	364,8 KB