1. Trang chủ
  2. » Luận Văn - Báo Cáo

topic sampling distribution and estimation

33 0 0
Tài liệu được quét OCR, nội dung có thể không chính xác
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Sampling Distribution and Estimation
Tác giả Nguyễn Anh Đức, Nguyễn Sinh Kiờn, Tao Hoàng Lộc, Đặng Hữu Phước, Hồ Minh Trang, Nguyễn Phan Võn Trang, Pham Long Vu
Trường học National Economics University
Chuyên ngành Statistics
Thể loại Project Report
Năm xuất bản 2022
Thành phố Hanoi
Định dạng
Số trang 33
Dung lượng 5,47 MB

Cấu trúc

  • 1.1. Definition of Sampling distribution and The Central Limit Theorem (4)
  • 1.2 Definition of Estimation 5 (5)
  • 1.3 The necessity of Sampling distribution and Estimation (5)
  • 1.4 Application ofSampling distribution and Estimation (5)
  • 2. Article and sources analysis 7 (7)
    • 2.1 Article summary 7 (7)
      • 2.1.2 Purpose of the artiCÌ€.........................-- . 2. 2. 1201110111011131 111111111111 111111 1111111111111 ca 7 (7)
      • 2.1.3 Methodology............................- . 2. 22012011101 1101 111111111111 1111111111 1111111111111 1111111 ca 8 (8)
    • 2.2. Techniques used in the article 12 (12)
      • 2.2.1 Cost estimate - Bottom-up samplIng....................... ..-- ¿+ 5c 2221212212212. 12 2.2.2. Retrospective chart TCVICW................. . L0 n1 1121121112110 1101110111011 H1. 13 2.3. Additional sources 13 (12)
      • 2.3.1 The use of sample distribution and Estimation in investigation involving (13)
      • 2.3.2 COVID-19 prevalence estimation by random sampling in population - (16)
  • 3. Data analysis 19 (19)
    • 3.1 Database 19 (19)
    • 3.2 Data analysis 21 (21)
      • 3.2.1 Sampling distribution .....0000 a (21)
      • 3.2.4 vo. on (0)

Nội dung

Definition of Sampling distribution and The Central Limit Theorem

A sampling distribution is a probability distribution of a statistic obtained from a larger number of samples drawn from a specific population The sampling distribution of a given population 1s the distribution of frequencies of a range of different outcomes that could possibly occur for a statistic of a population

You can measure the sampling distribution's variability either by standard deviation, also called “standard error of the mean,” or population variance, depending on the context and inferences you are trying to draw They both are mathematical formulas that measure the spread of data points in relation to the mean

There are three primary factors that influence the variability of a sampling distribution They are: ® The number observed in a population: The symbol for this variable is "N." It is the measure of observed activity in a given group of data e@ The number observed in the sample: The symbol for this variable is "n." It is the measure of observed activity in a random sample of data that is part of the larger grouping e The method of choosing the sample: How you chose the samples can account for variability in some cases

Sample size and normality: e If X has a normal distribution, then the sample mean has a normal distribution for all sample sizes e If X has a distribution that is close to normal, the approximation is good for small sample sizes (e.g n ) e If X has a distribution that is far from normal, the approximation requires larger sample sizes (e.g nP)

For a sample size of more than 30, the sampling distribution formula is given:

BX =u and 6X =0 / \n Here, e The mean of the sample and population represented by ux and yu e The standard deviation of the sample and population is represented as ứx and ơ ` e@ The sample size of more than 30 is represented as n. b) Central limit theorem:

The central limit theorem states that if you take sufficiently large samples from a population, the samples’ means will be normally distributed, even if the population isn’t normally distributed The central limit theorem says that the sampling distribution of the mean will always be normally distributed, as long as the sample size is large enough Regardless of whether the population has a normal, Poisson, binomial, or any other distribution, the sampling distribution of the mean will be normal.

The necessity of Sampling distribution and Estimation

Since populations are typically large in size, it 1s important to use a sampling distribution so that you can randomly select a subset of the entire population Doing so helps eliminate variability when you are doing research or gathering statistical data It also helps make the data easier to manage and builds a foundation for statistical inferencing, which leads to making inferences for the whole population Understanding statistical inference is important because it helps individuals understand the spread of frequencies and what various outcomes are like within a dataset b) Estimation

Estimation has many important properties for the ideal estimator These properties include unbiased nature, efficiency, consistency and sufficiency

The estimators that are unbiased while performing estimation are those that have

0 bias results for the entire values of the parameter Going by statistical language and terminology, unbiased estimators are those where the mathematical expectation or the mean proves to be the parameter of the target population

Application ofSampling distribution and Estimation

Suppose you want to find the average height of children at the age of 10 from each continent You take random samples of 100 children from each continent, and you compute the mean for each sample group For example, in South America, you randomly select data about the heights of 10-year-old children, and you calculate the mean for 100 of the children You also randomly select data from North America and calculate the mean height for one hundred 10-year-old children As you continue to find the average heights for each sample group of children from each continent, you can calculate the mean of the sampling distribution by finding the mean of all the average heights of each sample group Not only can it be computed for the mean, but it can also be calculated for other statistics such as standard deviation and variance.

The following examples show how the central limit theorem is used in different real-life situations

Economists often use the central limit theorem when using sample data to draw conclusions about a population

For example, an economist may collect a simple random sample of 50 individuals

In a town and use the average annual income of the individuals in the sample to estimate the average annual income of individuals in the entire town

If the economist finds that the average annual income of the individuals in the sample is $58,000, then her best guess for the true average annual income of individuals in the entire town will be $58,000

Biologists use the central limit theorem whenever they use data from a sample of organisms to draw conclusions about the overall population of organisms

For example, a biologist may measure the height of 30 randomly selected plants and then use the sample mean height to estimate the population mean height

If the biologist finds that the sample mean height of the 30 plants is 10.3 inches, then her best guess for the population mean height will also be 10.3 inches

Manufacturing plants often use the central limit theorem to estimate how many products produced by the plant are defective

For example, the manager of the plant may randomly select 60 products produced by the plant in a given day and count how many of the products are defective He can use the proportion of defective products in the sample to estimate the proportion of all products that are defective that are produced by the entire plant

If he finds that 2% of products are defective in the sample, then his best guess for the proportion of defective products produced by the entire plant is also 2%

Human Resources departments often use the central limit theorem when using surveys to draw conclusions about overall employee satisfaction at companies

For example, the HR department of some company may randomly select 50 employees to take a survey that assesses their overall satisfaction on a scale of | to 10

If it’s found that the average satisfaction among employees in the survey is 8.5 then the best guess for the average satisfaction rating of all employees at the company 1s also 8.5.

Agricultural scientists use the central limit theorem whenever they use data from samples to draw conclusions about a larger population

For example, an agricultural scientist may test a new fertilizer on 15 different fields and measure the average crop yield of each field

If it’s found that the average field produces 400 pounds of wheat, then the best guess for the average crop yield for all fields will also be 400 pounds b) Estimation: ¢ As an example, we estimate the mean of a population using the mean of a sample drawn from that population That is, the sample mean is an estimator of the population mean ¢ The actual statistic we calculate in respect of the sample is called an estimate of the population parameter For example, a calculated sample mean is an estimate of the population mean.

Article and sources analysis 7

Article summary 7

Studies on the burden of disease can put the existing treatment environment in context and point out significant treatment gaps in pharmacoepidemiology Large databases are frequently taken into account while planning observational research due to the growing availability of "big data" Chart reviews continue to be an effective way for getting thorough data straight from de-identified patient charts

The ability of obtaining a sufficiently big sample at a reasonable cost is a significant restriction to chart review Researchers must take into account the projected value of the information and may already be aware of how many charts are accessible for extraction An objective of this study was to propose rigorous methodologies for sample size computation in light of the lack of verified methods for priori sample size estimations

This study offers official guidelines for calculating sample sizes for studies of the retrospective burden of disease Pharmacoepidemiology uses observational burden of illness research to achieve a number of goals, including contextualizing the present treatment environment, highlighting significant treatment gaps, and offering estimates to parameterize economic models Methodologies like retrospective chart reviews may be used in situations when there are no existing datasets or the clinical detail in those datasets 1s insufficient There aren't many reliable methods available for calculating sample size in this situation, despite the fact that specifying the number of charts to be extracted and/or figuring out whether the number that can feasibly be extracted will be clinically meaningful are important study design considerations The goal of this study was to create suggested sample size formulas for use in such studies

Methodologies such as retrospective chart review may be utilized in settings for which existing datasets are not available or do not include sufficient clinical detail While specifying the number of charts to be extracted and/or determining whether the number that can feasibly be extracted will be clinically meaningful is an important study design consideration, there is a lack of rigorous methods available for sample size calculation in this setting

Calculations for identifying the optimal feasible sample size calculations were derived, for studies characterizing treatment patterns and medical costs, based on the ability to comprehensively observe treatments and maximize precision of resulting 95% confidence intervals For cost outcomes, if the standard deviation is not known, the coefficient of variation cv can be used as an alternative

Sample size calculations for categorical outcomes (e.g treatment patterns)

When considering treatment distributions in a population, assuming a binomial distribution (7, p) for recetving a particular treatment, in which the 7 represents the sample size and p represents the probability of receiving the treatment, the following are direct results of the binomial distribution:

Expected number of observed patients receiving the treatment is: nX p (1)

The width of the 95% CI for estimating p is:

The probability of not receiving the treatment is (1 — p), and therefore the probability of all n patients not receiving the treatment is (1 — p)”, such that the probability of observing at least one patient receiving the treatment is:

Sample size calculations for continuous outcomes (e.g costs)

When considering medical costs, assuming a normal distribution for mean costs Ji and standard deviation o, precision associated with a particular sample size can be characterized by the width W of the 95% CI:

Expected number of individuals receiving treatment; (Probability of observing treatment at least once) + Expected 95% confidence interval width for proportion receiving treatment nz50 n0 n 0 nz300 nz500 nz1000 pz001 1(039);‡003 1083), +002 2(087),:00 3 (0.95); £0.01 5 (0.99); £0.01 10 (1.00); +0.01 p= 3|09):00 5 (0.99) +0.04 10 (1.00) +0.03 15 (1.00) +0.02 25 (1.00) +002 50 (1.00) +001

Table I, Expected number of individuals receiving treatments

Table | presents calculated relationships between sample sizes and the expected number of cases to be observed, the probability of observing a treatment in practice, and expected precision, for a range of treatment probabilities Across sample sizes, any treatment given with greater than 1% frequency has a high likelihood of being observed For a sample of size 200, and a treatment given to 5% of the population, the precision of a 95% CI 1s expected to be 0.03; Le the expected 95% CI would be (0.02—0.08) Generally, with respect to characterizing treatment patterns, sample sizes above 200 are only required for treatments given to 1% of the population or less, or if particularly narrow precision estimates are needed

The information in Table 1 can be used to identify the optimal sample size based on a treatment pattern-related research question, or, in the case of a fixed sample size, to identify the level of detail that can be described.

Mean SD Gy Mean SD Cy Mean SD Cy

Table 2 Observed data from the MELODY study

Observed data from the MELODY study are presented in Table 2 to describe a range of cv observed in practice Trends in observed cv values included higher values typically observed for hospice and hospital costs relative to outpatient costs, and higher when considering the full population of included patients vs the subset with non-zero use of a particular category of utilization Across all categories considered, values for cv ranged from 0.26 to 4.30, with a median value of 0.72 In practice, a range of possible values for can be considered, based on any a priori knowledge regarding heterogeneity in the population with respect to health resource utilization and cost outcomes of interest, e.g the expected range of disease severity, and the anticipated distribution of costs with respect to routine maintenance and care vs high cost acute treatment such as Inpatient stays.

Required sample size Required sample size 4000 6000 8000 2000 300 400 | 200 100

Width of 95% Cl (expressed as % of mean)

Width of 95% Cl (expressed as % of mean)

Coefficient of variation Figure 1 Coefficient of variation

The figure above displays sample size requirements for the full range of values for cv observed in the MELODY study, while Fig 1b only considers values for cv between

0 and 1 For the median cv value of 0.72 from the MELODY study, a sample size of approximately 200 would be required to generate a 95% CI precise to within +10% of the mean For a cv of 4.5, more than 8000 individuals would be required to estimate a 95% CI precisely to within +10% of the mean Thus, in situations where large variability is anticipated, e.g a sample ranging from zero costs to long and costly hospitalizations, the required sample size may be prohibitively large for a chart review or prospective study, requiring access to an administrative or other large database

To sum up, this study presents a formal guide to sample size calculations for disease burden retrospective studies The approach presented here is methodologically rigorous and designed for practical application in retrospective charting studies in the real world and can be used in two distinct ways; When research resources are flexible and the desired accuracy is known, the formula can be used to guide the selection of the sample size, or if the available research resource or sample size is fixed, then the formula can be used to generate predictive values of accuracy

While specifying the number of graphs extracted and determining whether a feasible extractable quantity would be clinically significant is an important consideration in the study design, the rigorous methods available for calculating sample sizes in this setup still have shortcomings and limitations But through this research, we can dig deeper through the presentation and development of rigorous approaches to sample size calculations using a specific real-world study Hopefully, this study can clarify the problems and be solved more easily.

Techniques used in the article 12

2.2.1 Cost estimate - Bottom-up sampling

Estimating work effort in agile projects is fundamentally different from traditional methods of estimation The traditional approach is to estimate using a “bottom-up” technique: detail out all requirements and estimate each task to complete those requirements in hours/days, then use this data to develop the project schedule

The traditional method for estimating projects is to spend several weeks or months at the beginning of a project defining the detailed requirements for the product being built Once all the known requirements have been elicited and documented, a Gantt chart can be produced showing all the tasks needed to complete the requirements, along with each task estimate Resources can then be assigned to tasks, and actions such as loading and leveling help to determine the final delivery date and budget This process is known as a bottom-up method, as all details regarding the product must be defined before project schedule and cost can be estimated.

This technique 1s used when the requirements are known at a discrete level where the smaller workpieces are then aggregated to estimate the entire project This is usually used when the information is only known in smaller pieces

In the software industry, the use of the bottom-up method has severe drawbacks due to today's speed of change

The retrospective chart review (RCR), also known as a medical record review, is a type of research design in which pre-recorded, patient-centered data are used to answer one or more research questions The data used in such reviews exist in many forms: electronic databases, results from diagnostic tests, and notes from health service providers to mention a few RCR is a popular methodology widely applied in many healthcare-based disciplines such as epidemiology, quality assessment, professional education and residency training, inpatient care, and clinical research (cf Gearing et al.), and valuable information may be gathered from study results to direct subsequent prospective studies

2.3.1 The use of sample distribution and Estimation in investigation involving child abuse material

A methodology which has been used to address two ubiquitous problems of practicing digital forensics in law enforcement, the ever increasing data volume and staff exposure to disturbing material It discusses how the New South Wales Police Force, State Electronic Evidence Branch (SEEB) has implemented a “Discovery Process” This article uses random sampling of files and applying statistical estimation to the results, the branch has been able to reduce backlogs from three months to 24h The process has the added advantage of reducing staff exposure to child abuse material and providing the courts with an easily interpreted report

Due to ever increasing demand for digital forensic support SEEB has been required to develop and implement a range of processes to maximize its ability to provide timely support to serious major crime investigations One area of significant demand is investigations relating to the possession of child abuse material

As the standard interval has been proven to be inconsistent and unreliable in many circumstances, it is recommended by several sources that the Adjusted Wald is a more reliable interval calculation

The Adjusted Wald interval is defined as:

Where: aI z = confidence level (for 99% ~ 2.58) n= sample size s = selection z2

The percentage of items in the population that are expected to have the qualities being evaluated Ifthe proportion is unknown it should be set to 0.5 This results in the most conservative estimate and the largest sample size

If the population is known and the sample is greater than 5% of the population, the finite population correction factor can be used The FPC 1s included into the estimate calculation and usually results in a narrower margin for the estimate, without affecting the reliability The formula for the FPC is

The point estimate is a ratio of the selection in relation to the sample or s/n It 1s a value that is used in the final estimate calculation There are 4 main Binomial point estimators: The maximum likelihood estimate the point estimates usually result in similar outcomes unless there is an unusually small selection in comparison to the population.

Test results, Restraints: CL - 99%, Cl < 5%, population - 52,061, sample - 8388, actual items of interest - 1527

Adjusted Wald Jeffreys Wilson Point FPC

Coverage Average Average Average Coverage Average Average Average Coverage Average Average Average alia (4) estimate margin margin of estimate margin margin estimate margin margin error (+—%) of error of error

99.57 1298-1795 497 0.47715 99.46 1296-1787 491 0.46091 99.52 1299-1795 495 04757 Wilson No 99.54 1287-1781 495 0.47522 99.55 1295-1789 493 0.48917 99.54 1286-1782 495 0.47579 Laplace No 99.42 1283-1777 494 0.47464 99.48 1294-1788 494 0.49512 99.42 1283-1778 495 0.47566 Jeffreys No 99.5 1283-1777 494 0.47457 885 1294-1787 494 0.49504 99.5 1282-1777 495 0.47559 MLE No

9913 1319-1774 455 043694 9915 1313-1762 444 041497 9913 1319-1773 454 043561 Wilson ‘Yes 98.98 1306-1759 453 0.43499 99.04 1311-1763 451 0.44281 99.12 1305-1759 453 0.43552 Laplace Yes 98.99 1303-1755 452 0.43457 99.1 1311-1763 452 0.44883 98.99 1302-1756 453 0435 Jeffreys Yes 98.86 1305-1758 453 0.43489 98.85 1313-1765 452 0.44916 98.86 1304-1758 454 0.43582 MLE Yes

Coverage Average Average MA ô=ằ Average =ằ Average ô= Coverage ô= Average = Average

H margin HmajD — Í margin map" — [3 margin = margin oferor oferror 2 tfem

Average 9850 4845043 98.69 42 04 9959 Al 043 Min 98.29 3400 0W 199) 2600 0Ú 98.60 37.00 = 004 Max IU 1M tbử 9077 27100418 10000 = 27100415

1 No sexual Depictions of children with no activity sexual activity — Nudity, surreptitious images showing underwear nakedness, sexually suggestive posing, explicit emphasis on genital areas, solo urination

2 Child non- Non-penetrative sexual activity penetrate between children or solo masturbation by a child

3 Adult non- Non-penetrative sexual activity penetrate between child(ren) and adult(s)

Mutual masturbation and other non-penetrative sexual activity

4 Child/adult Penetrative sexual activity between penetrate child(ren) or between child(ren) and adult(s) - Including, but not limited to, intercourse, cunnilingus and fellatio a Sadism/ Sadism, bestiality or humiliation bestiality/ (urination, defecation, vomit, bondage child abuse etc.) or child abuse as per CCA 199

6 Animated or Anime, cartoons, comics and drawings virtual depicting children engaged in sexual poses or activity

Table 5 CETS CAM/CEM scale

2.3.2, COVID-19 prevalence estimation by random sampling in population - optimal sample pooling under varying assumptions about true prevalence

- Ola Brynildsrud - Brynildsrud BMC Medical Research Methodology Issue:

A rough estimate of the illness burden in a population is the sum of confirmed COVID-19 cases divided by the size of the population However, a number of studies indicate that a sizable portion of cases typically go unreported and that this percentage greatly varies on the level of sampling and the various test criteria employed in different jurisdictions

This research aims to Examine how the number of samples used in the experiment and the degree of sample pooling affect the accuracy of prevalence estimates and the possibility of reducing the number of tests necessary to obtain individual-level diagnostic results

A disease's prevalence 1s its likelihood of existing It is calculated by dividing the total number of patients by the total population The sample estimate of prevalence is the observed percentage of affected individuals in a sample

(Dyches, 2010) Estimates of the true prevalence of COVID-19 in a population can be made by random sampling and pooling of RT-PCR tests These estimations don't take into account follow-up testing on sub-pools that enable patient-level diagnosis; they are only based on the initial pooled tests The precision from the pooled test estimates would be closer to that of testing individually if the results from these samples were taken into account

Data analysis 19

Database 19

Alcohol, caffeine and nicotine are commonly consumed psychotropic substances, exhibiting varying levels of addictiveness while lacking the social stigma associated with consumption of illicit drugs Excess alcohol consumption is estimated to lead to 548.000 deaths in Vietnam according to the EOC Vietnam (Emergency Operation Center, 2022), with alcohol abuse costs (2016-adjusted) of 16,372 billion VND annually Deaths related to nicotine consumption, predominantly associated with cigarette consumption and secondhand smoke exposure, reach more than 40,000 deaths annually (Ministry of Public Health, 2021) Costs from lost productivity and health impacts of tobacco consumption reach approximately $1 billion USD per year due to the Ca Mau Department of Health Comparable economic data on the impact of caffeine consumption are currently 500gr annually (vicofa) However, consuming a large amount of coffee could cause anxiety, depression, insomnia, headaches, urinary incontinence, increased risk of miscarriage, premature birth according to VNExpress,

Question form surveys have been used extensively in Europe to monitor alcohol, nicotine and caffeine consumption, as well as in China, Vietnam, Canada, South Korea, Japan, New Zealand, and Australia By comparison, use of Questions form in The National Economics University is still in its There is a notable lack of information on baseline concentrations of parental stimulants and metabolites interpreted in the context of the local geographic and demographic setting and health challenges faced by local communities

We believe that survey forms are institutions of higher education Self-reported alcohol consumption for Advanced Education Programme students in the National Economics University (n = 271) collected by the survey in 2022 showed 57.2% of students consumed alcohol in the past month, with 38% reporting binge drinking (4-5 drinks per event), and 10.5% reporting heavy alcohol use (binge drinking N5 days per month) This study also reported that 21.1% of students consumed tobacco products, predominantly cigarettes More recently, nicotine consumption by vaporizer use has increased, a practice that has shown to carry less stigma than cigarette smoking, and allows for consumption in smoking-prohibited areas (e.g., smoke-free campuses) (McKeganey, 2018) Currently there are concerns about vaporizer use being a gateway to cigarette consumption (Dunbar et al., 2018) Caffeine is the most highly consumed of the psychotropic compounds evaluated in this study In the college student population, there is concern of overuse (N 400 mg caffeine per day FDA recommendation; N4 standard cups of coffee), particularly because of the increased popularity of energy drinks (75 to 174 mg caffeine per serving The goal of the present study was to measure indicators of alcohol, caffeine, and nicotine consumption directly in wastewater from AEP students population over a predetermined time period (four academic years), and (11) state their impact on the user's health from which to give health advice In the present study, health advice was given to students based on their preferences over these three products (alcohol, caffeine, and nicotine) at their locations 3.1.2 Survey questions

Overall how would you rate your : general health very poor 123456789 10 very good

Do you smoke? Yes No: If yes, how many cigarettes do you smoke per day?

How many standard alcoholic drinks do you consume on an average day?

How many drinks containing caffeine (eg coffee, tea or cola) do you drink per day?

Generally, how many hours sleep do you get: On weeknights: hours

How satisfied are you with the amount of sleep you get? very dissatisfied 12345678910 _ very satisfied

Please rate how stressed you have felt over the last month: not at all 123456789 10 extremely stressed

Please rate how much your anxiety over the past month:

20 not at all 123456789 10 extremely stressed

Please rate how much your depression: not at all 123456789 10 extremely stressed

Please rate how fatigued you’ve felt over the past month: not at all 123456789 10 to a great extent

Please rate how tired you’ve felt over the past month: not at all 123456789 10 to a great extent

Please rate how sleepy you've felt over the past month: not at all 123456789 10 to a great extent

Please rate how much you've felt lacking in energy over the past month: not at all 123456789 10 to a great extent

Data analysis 21

Due to the Central limit theorem, the larger the sample size, the more closely the sampling distribution will follow a normal distribution When the sample size is small, the sampling distribution of the mean is sometimes non-normal That’s because the central limit theorem only holds true when the sample size is “sufficiently large.”

If X is a random variable with a mean p and variance o?, then in general:

Where: e X is the sampling distribution of the sample means e@ Nis the normal distribution

@ wis the mean of the population

@ ois the standard deviation of the population

In our survey, we provide sample size of 50, 99, 199, 270 which meet the conditions of the central limit theorem: e@ The sample size is sufficiently large This condition is usually met if the sample size is n > 30 e The samples are independent and identically distributed (1.1.d.) random variables This condition is usually met if the sampling is random e@ The population’s distribution has finite variance Central limit theorem doesn’t apply to distributions with infinite variance, such as the Cauchy distribution Most distributions have finite variance

4.0 5.0 6.0 7.0 8.0 9.0 10.0 hours sleep/ week nights Figure 3 Hours of sleep on average of weekdays (NP)

In the first histogram, we randomly select a sample size of 50, with the mean hours sleep per night on weekdays of 7.05 and the standard deviation is 0.981 Notice how the

22 histogram centers on the population mean of 7.05, and sample means become rarer further away It’s also a reasonably symmetric distribution Those are features of many sampling distributions This distribution isn’t particularly smooth because 50 samples is a small number for this purpose

Figure 4 Hours of sleep on average of weekdays (N)

In the second histogram, the sample size increases to 99 The mean hours sleep per night on weekdays has slightly decreased to 6.89 with the standard deviation is 1.139,

Figure 5 Hours of sleep on average of weekdays (N9)

The histogram illustrates the increase of sample size to 199 The mean is now increased to 7.04 with the standard deviation is 1.046 Just as the central limit theorem predicts, as we increase the sample size, the sampling distributions more closely approximate a normal distribution and have a tighter spread of values

Figure 6 Hours of sleep on average of weekdays(N'0)

In the final histogram, the sample size is selected of 270, with the mean hours sleep per night on weekdays of 6.97 and the standard deviation 1s 1.067

In conclusion, we can see that the hours sleep per might of AEP students on weekdays are slightly changing due to each sample size with mean of 7.05, 6.89, 7.04, 6.97 As the sample size increases, the sampling distributions more closely approximate the normal distribution and become more tightly clustered around the population mean

- just as the central limit theorem states

We can notice that the histograms for all four statistics (sample mean, and sample standard deviation) are becoming more and more symmetric and bell-shaped is also becoming narrower, particularly those for the sample mean Also notice that the estimated standard deviation of the sample mean is not only decreasing as sample size increases, but is also approximately the same for the same sample sizes

By using the one sample t-test we can draw a conclusion about the population means of the variables For instance, we can conclude that for the entire population the health rate’s mean is going to be between 7,6 to 7,98 for 95% of the time

3.2.2.a How to calculate confidence intervals

We have n&2, 8 )427, s2647, df&1 computing a 95% confidence interval for the population mean Hi

[For the amount of cup of caffeine consumption per day

=> This result indicates that the population mean of caffeine consumption will be within

27084 to 37771 (cup of caffeine per day) with 95% certainty which illustrate a quite low consumption rate of AEP students ˆ

We have n&7, x=7.79, s=1.58, df&6 computing a 95% confidence interval for the population mean pu

7.79 — 1.96x vo =< U < 7.79 + 1.96x == ( Critical values of the t Distribution) iS 7.6 Similarly, the population mean of anxiety will be within 392 to 676 with a confidence interval of 95% This shows that the stress rate of AEP students is not very high but there is still a considerable amount of anxiety

We can see the result from the table below (tool: SPSS one sample t-test)

N Mean Deviation Mean do you smoke 270 1.87 332 020 how many alcoholic 257 1.0019 1.37447 08574 drinks per day how many caffeine 262 2.9427 1.92647 11902 drinks per day general health 267 7.79 1.580 097 hours sleep/ week 270 6.972 1.0666 -0649 nights satisfied with sleep 268 5.55 2.506 153 amount

Table 6 One-sample Statistics One-Sample Test

Sig (2- Mean the Difference t df tailed) Difference Lower Upper do you smoke 92.647 269 000 1.874 1.83 1.91 how many alcoholic 11.686 256 -000 1.00195 8331 1.1708 drinks per day how many caffeine 24.725 261 000 2.94275 2.7084 3.1771 drinks per day general health 80.627 266 -000 7.794 7.60 7.98 hours sleep/ week 107.409 269 000 6.9722 6.844 7.100 nights satisfied with sleep 36.251 267 000 5.549 5.25 5.85 amount

By increasing the confidence interval of the test, we can be more certain with the sample Both the upper and lower confidence interval will increase by an amount and hence the 99% confidence interval 1s going to be wider Therefore, as we increase the confidence level, the width of the interval increases as well A more accurate means a higher confidence level

In this research, we can also compute a 99% confidence interval for heath rate

We have n&7, x=7.79, s=1.58, df&6 computing a 99% confidence interval for the population mean

7.79 —2.576x se

Ngày đăng: 12/08/2024, 14:35

w