Bí quyết phân tích dữ liệu với Stata

PHÂN TÍCH PHƯƠNG SAI LẶP LẠI (REPEATEDMEASURES ANOVA) • Dùng để đánh giá sự thay đổi của biến phụ thuộc theo thời gian can thiệp. • Được sử dụng khi các đối tượng được đo lường >= 3 lần lặp lại của một biến phụ thuộc. • Điều kiện (assumption) dành cho kiểm định này: o Phương sai đồng nhất (homogeneity of variance) o Biến phụ thuộc phải có tính bình thường (normality) o Shpericity: có nghĩa là tương quan giữa các cặp can thiệp đều bằng nhau. Ví dụ tương quan giữa điều trị lần 1 và lần 2 phải bằng với tương quan giữa lần 1 và lần 3. Dùng Mauchly’s test để kiểm tra shpericity. • Giả thuyết kiểm định o H0: Biến độc lập (factor) không ảnh hưởng đến biến phụ thuộc vì vậy tất cả các trung bình đều bằng nhau. o H1: biến độc lập có ảnh hưởng đến biến phụ thuộc vì vậy ít nhất hai trung bình khác nhau có ý nghĩa thống kê. • Diễn giải kết quả kiểm định o Đọc kết quả dựa trên giá trị p như bình thường o Ví dụ về diễn giải kết quả kiểm định: “The number of hours individuals volunteered changed over time, F(2,14) = 24.24, p ≤ .05.” • Sau khi kết luận phép kiểm có ý nghĩa thống kê tức là trung bình các lần can thiệp khác nhau thì phải làm tiếp tục các phép kiểm ttest cho từng cặp các lần can thiệp như sau:

Trang 1

PHÂN TÍCH PHƯƠNG SAI LẶP LẠI (REPEATED-MEASURES ANOVA)

 Dùng để đánh giá sự thay đổi của biến phụ thuộc theo thời gian can thiệp

 Được sử dụng khi các đối tượng được đo lường >= 3 lần lặp lại của một biến phụ thuộc

 Điều kiện (assumption) dành cho kiểm định này:

o Phương sai đồng nhất (homogeneity of variance)

o Biến phụ thuộc phải có tính bình thường (normality)

o Shpericity: có nghĩa là tương quan giữa các cặp can thiệp đều bằng

nhau Ví dụ tương quan giữa điều trị lần 1 và lần 2 phải bằng với tương

quan giữa lần 1 và lần 3 Dùng Mauchly’s test để kiểm tra shpericity.

 Giả thuyết kiểm định

o H0: Biến độc lập (factor) không ảnh hưởng đến biến phụ thuộc vì vậy

tất cả các trung bình đều bằng nhau

o H1: biến độc lập có ảnh hưởng đến biến phụ thuộc vì vậy ít nhất haitrung bình khác nhau có ý nghĩa thống kê

 Diễn giải kết quả kiểm định

o Đọc kết quả dựa trên giá trị p như bình thường

o Ví dụ về diễn giải kết quả kiểm định: “The number of hours individuals

volunteered changed over time, F(2,14) = 24.24, p ≤ 05.”

 Sau khi kết luận phép kiểm có ý nghĩa thống kê tức là trung bình các lần canthiệp khác nhau thì phải làm tiếp tục các phép kiểm t-test cho từng cặp các lầncan thiệp như sau:

o Kiểm định t-test cho lần 1 và lần 2 Đọc kết quả như bình thường

o Ví dụ về đọc kết quả sau khi thực hiện phép kiểm t-test bắt cặp:

“Individuals volunteered more hours during the month following the ad (M = 6.25) than they did before seeing the ad (M = 1.875, t(14) = 6.92,

p ≤ 05, two-tailed) In addition, even one-year after watching the ad, participants volunteered more hours (M = 4.5) than they did before seeing the ad, t(14) = 4.15, p ≤ 05, two-tailed Notably, the effect of the ad campaign did seem to “wear off” some over time Specifically, participants volunteered fewer hours 1-year after watching the ad compared to the month following the ad, t(14) = 2.77 p ≤ 05, two- tailed.”

PHÂN TÍCH PHƯƠNG SAI HỖN HỢP (MIXED MODEL

ANOVA)

Trang 2

Simple probability (1 of 2)

What is the probability that a card drawn at random from a deck of cards will be

an ace? Since of the 52 cards in the deck, 4 are aces, the probability is 4/52 In general, the probability of an event is the number of favorable outcomes divided

by the total number of possible outcomes (This assumes the outcomes are all equally likely.) In this case there are four favorable outcomes: (1) the ace of spades, (2) the ace of hearts, (3) the ace of diamonds, and (4) the ace of clubs Since each of the 52 cards in the deck represents a possible outcome, there are 52 possible outcomes

Simple Probability (2 of 2)

Next section: Conditional probability

The same principle can be applied to the problem of determining the probability ofobtaining different totals from a pair of dice As shown below, there are 36

possible outcomes when a pair of dice is thrown

To calculate the probability that the sum of the two dice will equal 5, calculate the number of outcomes that sum to 5 and divide by the total number of outcomes (36) Since four of the outcomes have a total of 5 (1,4; 2,3; 3,2; 4,1), the

probability of the two dice adding up to 5 is 4/36 = 1/9 In like manner, the

probability of obtaining a sum of 12 is computed by dividing the number of

favorable outcomes (there is only one) by the total number of outcomes (36) The probability is therefore 1/36

Trang 3

Conditional Probability

Next section: Probability of A and B

A conditional probability is the probability of an event given that another event has occurred For example, what is the probability that the total of two dice will begreater than 8 given that the first die is a 6? This can be computed by considering only outcomes for which the first die is a 6 Then, determine the proportion of these outcomes that total more than 8 All the possible outcomes for two dice are shown below:

There are 6 outcomes for which the first die is a 6, and of these, there are four that total more than 8 (6,3; 6,4; 6,5; 6,6) The probability of a total greater than 8 giventhat the first die is 6 is therefore 4/6 = 2/3

More formally, this probability can be written as:

p(total>8 | Die 1 = 6) = 2/3

In this equation, the expression to the left of the vertical bar represents the event and the expression to the right of the vertical bar represents the condition Thus it would be read as "The probability that the total is greater than 8 given that Die 1 is

6 is 2/3." In more abstract form, p(A|B) is the probability of event A given that event B occurred

Probability of A and B (1 of 2)

Trang 4

If A and B are Independent

A and B are two events If A and B are independent, then the probability that events A and B both occur is:

p(A and B) = p(A) x p(B)

In other words, the probability of A and B both occurring is the product of the probability of A and the probability of B

What is the probability that a fair coin will come up with heads twice in a row? Two events must occur: a head on the first toss and a head on the second toss Since the probability of each event is 1/2, the probability of both events is: 1/2 x 1/2 = 1/4

Now consider a similar problem: Someone draws a card at random out of a deck,

replaces it, and then draws another card at random What is the probability that the

first card is the ace of clubs and the second card is a club (any club) Since there is only one ace of clubs in the deck, the probability of the first event is 1/52 Since 13/52 = 1/4 of the deck is composed of clubs, the probability of the second event

is 1/4 Therefore, the probability of both events is: 1/52 x 1/4 = 1/208

Probability of A and B (2 of 2)

Next section: Probability of A or B

If A and B are Not Independent

If A and B are not independent, then the probability of A and B is:

p(A and B) = p(A) x p(B|A)

where p(B|A) is the conditional probability of B given A

If someone draws a card at random from a deck and then, without replacing the first card, draws a second card, what is the probability that both cards will be aces?Event A is that the first card is an ace Since 4 of the 52 cards are aces, p(A) = 4/52 = 1/13 Given that the first card is an ace, what is the probability that the second card will be an ace as well? Of the 51 remaining cards, 3 are aces

Therefore, p(B|A) = 3/51 = 1/17 and the probability of A and B is:

1/13 x 1/17 = 1/221

Trang 5

If the events A and B are not mutually exclusive, then

p(A or B) = p(A) + p(B) - p(A and B)

The logic behind this formula is that when p(A) and p(B) are added, the occasions

on which A and B both occur are counted twice To adjust for this, p(A and B) is subtracted What is the probability that a card selected from a deck will be either

an ace or a spade? The relevant probabilities are:

p(ace and spade) = 1/52

The probability of an ace or a spade can be computed as:

p(ace)+p(spade)-p(ace and spade) = 4/52 + 13/52 - 1/52 = 16/52 = 4/13

Consider the probability of rolling a die twice and getting a 6 on at least one of therolls The events are defined in the following way:

Trang 6

Event A: 6 on the first roll: p(A) = 1/6

Event B: 6 on the second roll: p(B) = 1/6

Next section: Binomial distribution

The probability of getting a number from 1 to 5 on the first roll is 5/6 Likewise, the probability of getting a number from 1 to 5 on the second roll is 5/6

Therefore, the probability of getting a number from 1 to 5 on both rolls is: 5/6 x 5/6 = 25/36 This means that the probability of not getting a 1 to 5 on both rolls (getting a 6 on at least one roll) is:

1-25/36 = 11/36

Despite the convoluted nature of this method, it has the advantage of being easy togeneralize to three or more events For example, the probability of rolling a die three times and getting a six on at least one of the three rolls is:

Trang 7

outcomes For convenience, one of the outcomes can be labeled "success" and the other outcome "failure." If an event occurs N times (for example, a coin is flipped

N times), then the binomial distribution can be used to determine the probability ofobtaining exactly r successes in the N outcomes The binomial probability for obtaining r successes in N trials is:

where P(r) is the probability of exactly r successes, N is the number of events, and

π is the probability of success on any one trial This formula for the binomial distribution assumes that the events:

1 are dichotomous (fall into only two categories)

2 are mutually exclusive

3 are independent and

4 are randomly selected

Consider this simple application of the binomial distribution: What is the

probability of obtaining exactly 3 heads if a fair coin is flipped 6 times?

Binomial Distribution (2 of 3)

For this problem, N = 6, r = 3, and π = 0.5 Therefore,

Two binomial distributions are shown below Notice that for π = 0.5, the

distribution is symmetric whereas for π = 0.3, the distribution has a positive skew

Trang 8

Math Textbooks

Binomial distribution (3 of 3)

Next section: Subjective probability

Often the cumulative form of the binomial distribution is used To determine the probability of obtaining 3 or more successes with n=6 and π = 0.3, you compute P(3) + P(4) + P(5) + P(6) This can also be written as:

and is equal to 0.1852 + 0.0595 + 0.0102 + 0.0007 = 0.2556 The binomial

distribution can be approximated by a normal distribution (click here to see how) Click here for an interactive demonstration of the normal approximation to the binomial

Subjective probability (1 of 1)

Next chapter: Normal distribution

For some purposes, probability is best thought of as subjective Questions such as

"What is the probability that Boston will defeat New York in an upcoming

baseball game?" cannot be calculated by dividing the number of favorable

outcomes by the number of possible outcomes Rather, assigning probability 0.6 (say) to this event seems to reflect the speaker's personal opinion - perhaps his orher willingness to bet according to certain odds Such an approach to probability, however, seems to lose the objective content of the idea of chance; probability becomes mere opinion Two people might attach different probabilities to the

Trang 9

outcome, yet there would be no criterion for calling one "right" and the other

"wrong." We cannot call one of the two people right simply because he or she assigned a higher probability to the outcome that actually occurred After all, you would be right to attribute probability 1/6 to throwing a six with a fair die, and your friend who attributes 2/3 to this event would be wrong And you are still right(and your friend is still wrong) even if the die ends up showing a six!

The following example illustrates the present approach to probabilities Suppose you wish to know what the weather will be like next Saturday because you are planning a picnic You turn on your radio, and the weather person says, “There is a10% chance of rain.” You decide to have the picnic outdoors and, lo and behold, itrains You are furious with the weather person But was he or she wrong? No, theydid not say it would not rain, only that rain was unlikely The weather person would have been flatly wrong only if they said that the probability is 0 and it subsequently rained However, if you kept track of the weather predictions over a long periods of time and found that it rained on 50% of the days that the weather person said the probability was 0.10, you could say his or her probability

assessments are wrong

So when is it sensible to say that the probability of rain is 0.10? According to a frequency interpretation, it means that it will rain 10% of the days on which rain isforecast with this probability

Sampling Distribution (1 of 3)

If you compute the mean of a sample of 10 numbers, the value you obtain will not equal the population mean exactly; by chance it will be a little bit higher or a little bit lower If you sampled sets of 10 numbers over and over again (computing the mean for each set), you would find that some sample means come much closer to the population mean than others Some would be higher than the population mean and some would be lower Imagine sampling 10 numbers and computing the meanover and over again, say about 1,000 times, and then constructing a relative

frequency distribution of those 1,000 means This distribution of means is a very good approximation to the sampling distribution of the mean The sampling

distribution of the mean is a theoretical distribution that is approached as the number of samples in the relative frequency distribution increases With 1,000 samples, the relative frequency distribution is quite close; with 10,000 it is even closer As the number of samples approaches infinity, the relative frequency distribution approaches the sampling distribution

Trang 10

The sampling distribution of the mean for a sample size of 10 was just an

example; there is a different sampling distribution for other sample sizes Also, keep in mind that the relative frequency distribution approaches a sampling

distribution as the number of samples increases, not as the sample size increases since there is a different sampling distribution for each sample size

A sampling distribution can also be defined as the relative frequency distribution that would be obtained if all possible samples of a particular sample size were taken For example, the sampling distribution of the mean for a sample size of 10 would be constructed by computing the mean for each of the possible ways in which 10 scores could be sampled from the population and creating a relative frequency distribution of these means although these two definitions may seem different, they are actually the same: Both procedures produce exactly the same sampling distribution

Next section: Sampling distribution of the mean

Statistics other than the mean have sampling distributions too The sampling distribution of the median is the distribution that would result if the median instead

of the mean were computed in each sample

Students often define "sampling distribution" as the sampling distribution of the mean That is a serious mistake

Sampling distributions are very important since almost all inferential statistics are based on sampling distributions

Click here for interactive simulation illustrating important concepts about

sampling distributions

Sampling Distribution of the Mean

Next section: Standard error

Trang 11

The sampling distribution of the mean is a very important distribution In later chapters you will see that it is used to construct confidence intervals for the mean and for significance testing.

Given a population with a mean of μ and a standard deviation of σ, the sampling distribution of the mean has a mean of μ and a standard deviation of

Next section: Central limit theorem

The standard error of a statistic is the standard deviation of the sampling

distribution of that statistic Standard errors are important because they reflect howmuch sampling fluctuation a statistic will show The inferential statistics involved

in the construction of confidence intervals and significance testing are based on standard errors The standard error of a statistic depends on the sample size In general, the larger the sample size the smaller the standard error The standard error of a statistic is usually designated by the Greek letter sigma (σ) with a

subscript indicating the statistic For instance, the standard error of the mean is indicated by the symbol: σM

Click on a statistic to view its standard error

Trang 12

Central Limit Theorem (1 of 2)

The central limit theorem states that given a distribution with a mean μ and

variance σ², the sampling distribution of the mean approaches a normal

distribution with a mean (μ) and a variance σ²/N as N, the sample size, increases The amazing and counter-intuitive thing about the central limit theorem is that no matter what the shape of the original distribution, the sampling distribution of the mean approaches a normal distribution Furthermore, for most distributions, a normal distribution is approached very quickly as N increases Keep in mind that

N is the sample size for each mean and not the number of samples Remember in a

sampling distribution the number of samples is assumed to be infinite The sample size is the number of scores in each sample; it is the number of scores that goes into the computation of each mean

On the next page are shown the results of a simulation exercise to demonstrate the central limit theorem The computer sampled N scores from a uniform distribution

and computed the mean This procedure was performed 500 times for each of the sample sizes 1, 4, 7, and 10

Central Limit Theorem (2 of 2)

Next section: Area under sampling distribution of the mean

Below are shown the resulting frequency distributions each based on 500 means For N = 4, 4 scores were sampled from a uniform distribution 500 times and the mean computed each time The same method was followed with means of 7 scoresfor N = 7 and 10 scores for N = 10

Trang 13

Two things should be noted about the effect of increasing N:

1 The distributions becomes more and more normal

2 The spread of the distributions decreases Click here for an interactive demonstration of the central limit theorem

Area Under the Sampling Distribution

of the Mean (1 of 4)

Assume a test with a mean of 500 and a standard deviation of 100 Which is more likely: (1) that the mean of a sample of 5 people is greater than 580 or (2) that the mean of a sample of 10 people is greater than 580? Using your intuition, you may have been able to figure out that a mean over 580 is more likely to occur with the smaller sample

One way to approach problems of this kind is to think of the extremes What is theprobability that the mean of a sample of 1,000 people would be greater than 580 The probability is practically zero since the mean of a sample that large will almost certainly be very close to the population mean The chance that it is more than 80 points away is practically nil On the other hand, with a small sample, the sample mean could very well be as many as 80 points from the population mean Therefore, the larger the sample size, the less likely it is that a sample mean will deviate greatly from the population mean It follows that it is more likely that the sample of 5 people will have a mean greater than 580 then will the sample of 10 people

Trang 14

Area Under the Sampling Distribution

To figure out the probabilities exactly, it is necessary to make the assumption that the distribution is normal Given normality and the formula for the standard error

of the mean, the probability that the mean of 5 students is over 580 can be

calculated in a manner almost identical to that used in calculating the area under portions of the normal curve

Since the question involves the probability of a mean of 5 numbers being over

580, it is necessary to know the distribution of means of 5 numbers But that is simply the sampling distribution of the mean with an N of 5 The mean of the sampling distribution of the mean is μ (500 in this example) and the standard deviation is = 100/2.236 = 44.72 The sampling distribution of the mean is shown below

The area to the left of 580 is shaded What proportion of the curve is below 580? Since 580 is 80 points above the mean and the standard deviation is 44.72, 580 is 80/44.72 = 1.79 standard deviations above the mean

Area under Sampling Distribution

The formula for z used here is a special case of the general formula for z:

Trang 15

Since the distribution of interest is the distribution of means, the formula can be rewritten as

where M is a sample mean, μ is mean of the distribution of means which is equal

to the population mean, and is the standard error of the mean

In general, when a problem asks about the probability of a mean, the sampling distribution of the mean should be used The standard error of the mean is used as the standard deviation Continuing with the calculations,

Next section: Difference between means

From a z table, it can be determined that 0.96 of the distribution is below 1.79 Therefore the probability that the mean of 5 numbers will be greater than 580 is only 0.04 The calculation of the probability with N = 10 is similar The standard error of the mean (σm) is equal to :

Trang 16

which, of course, is smaller than the value of 44.72 obtained for N=5

Using the formula:

to calculate z and a z table to calculate the probability, it can be determined that the probability of obtaining a mean based on N = 10 that is greater than 580 is only 0.01 As expected, this is much lower than the probability of 04 obtained for

N = 5

Summing up, finding an area under the sampling distribution of the mean is the same as finding an area below any normal curve In this case, the normal curve is the sampling distribution of the mean It has a mean of μ and a standard deviation of

Next section: Difference between means

Sampling Distribution, Difference

Between Independent Means (1 of 5)

This section applies only when the means are computed from independent

samples The formulas are more complicated when the two means are not

independent Let's say that a researcher has come up with a drug that improves memory Consider two hypothetical populations: the performance of people on a memory test if they had taken the drug and the performance of people if they had not Assume that the mean (μ) and the variance ( ) of the distribution of people taking the drug are 50 and 25 respectively and that the mean (μ) and the variance () of the distribution of people not taking the drug are 40 and 24 respectively It follows that the drug, on average, improves performance on the memory test by 10points This 10-point improvement is for the whole population Now consider the sampling distribution of the difference between means This distribution can be understood by thinking of the following sampling plan:

Trang 17

Sampling Distribution, Difference between Independent Means (2 of 5)

Sample n1 scores from the population of people taking the drug and compute the mean This mean will be designated as M1 Then, sample n2 scores from the population of people not taking the drug and compute the mean This mean will bedesignated as M2 Finally compute the difference between M1 and M2 This

difference will be called Md where the "d" stands for "difference." This is the statistic whose sampling distribution is of interest

The sampling distribution could be approximated by repeating the above samplingprocedure over and over while plotting each value of Md The resulting frequency distribution would be an approximation to the sampling distribution The mean and the variance of the sampling distribution of Md are:

Trang 18

between Independent Means (3 of 5)

probability that the mean of the group of 10 subjects getting the drug will be 15 or more points higher than the mean of the 8 subjects not getting the drug? Right away it can be determined that the chances are not very high since on average the difference between the drug and no-drug groups is only 10

Sampling Distribution, Difference between Independent Means (4 of 5)

A look at a graph of the sampling distribution of Md makes the problem more concrete The mean of the sampling distribution is 10 and the standard deviation is2.35 The graph shows the distribution Back to the question: What is the

probability that the drug group will score 15 or more points higher?

Trang 19

The blue region of the graph is 15 or higher and is a small portion of the area The probability can be determined by computing the number of standard deviations above the mean 15 is Since the mean is 10 and the standard deviation is 2.35, 15

is (15-10)/2.35 = 2.13 standard deviations above the mean From a z table it can bedetermined that 0.983 of the area is below 2.13; therefore, 0.017 of the area is above It follows that the probability of a difference between the drug and no-drug means of 15 or larger is 017

As shown below, the z table calculator allows you to compute this value directly without using z scores

Math Textbooks

Trang 20

between Independent Means (5 of 5)

Next section: Linear combination of means

The main difference between this problem and simple problems involving the areaunder the normal distribution is that in this problem you first had to figure out the mean and standard deviation of the sampling distribution of the difference betweentwo means In the simpler problems involving the area under the normal

distribution, you were always given the mean and standard deviation of the

distribution The formula used there:

is used here in a different form:

Since the problem now concerns differences between means, the relevant statistic

is the difference between means obtained in the sample (experiment), the relevant population mean (μ) is the mean of the sampling distribution of the difference between means, and the relevant standard deviation is the standard deviation of thesampling distribution of the difference between means

Sampling Distribution of a Linear

Combination of Means (1 of 4)

Assume there are k populations each with the same variance (σ²) Further assume that (1) n subjects are sampled randomly from each population with the mean computed for each sample and (2) a linear combination of these means is

computed by multiplying each mean by a coefficient and summing the results Let the linear combination be designated by the letter "L." If this sampling procedure were repeated over and over again, a different value of L would be obtained each

Trang 21

time It is this distribution of the values of L that makes up the sampling

distribution of a linear combination of means The importance of linear

combinations of means can be seen in the section "Confidence interval on linear combination of means" where it is shown that many experimental hypotheses can

be stated in terms of linear combinations of the mean and that the choice of

coefficients determines the hypothesis tested The formula for L is:

L = a1M1 + a2M2 + + akMk

where M1is the mean of the numbers sampled from Population 1, M2 is the mean

of the numbers sampled from Population 2, etc The coefficient a1 is used to multiply the first mean, a2 is used to multiply the second mean, etc

Sampling Distribution of a Linear

where μi is the mean for population i, σ² is the variance of each population, and n

is the number of elements sampled from each population Consider an example application using the sampling distribution of L Assume that on a test of reading ability, the population means for 10, 12, and 14 year olds are 60, 68, and 80

respectively Further assume that the variance within each of these three

populations is 100 Then, μ1 = 60, μ2 = 68, μ3 = 80, and σ² = 100 If eight olds, eight 12- year-olds and eight 14-year-olds are sampled randomly, what is the probability that the mean for the 14 year olds will be 15 or more points higher thanthe average of the means for the two younger groups?

Trang 22

10-year-Sampling Distribution of a Linear Combination of Means (3 of 4)

Or, symbolically, what is the probability that:

Trang 23

Next section: Pearson's correlation

The sampling distribution of L therefore has a mean of 16 and a standard deviation

of 4.33 The question is, what is the probability of getting a value of L greater than

or equal to 15? The formula:

= (15 - 16)/4.33 = -0.23

can be used to find out how many standard deviations above μL an L of 15 is

Using a z table, it can be determined that 0.41 of the time a z of -0.23 or lower would occur Therefore the probability of a z of -0.23 or higher occurring is (1 -

0.41) = 0.59 So, the probability that is greater than 15 is 0.59

Sampling Distribution of Pearson's

r (1 of 3)

Just like any other statistic, Pearson's r has a sampling distribution If N pairs of scores were sampled over and over again the resulting Pearson r's would form a distribution When the absolute value of the correlation in the population is low (say less than about 0.4) then the sampling distribution of Pearson's r is

approximately normal However, with high values of correlation, the distribution has a negative skew The graph below shows the sampling distribution of

Pearson's r when the population correlation is 0.60 and when N = 12 The negativeskew is apparent

Trang 24

A transformation called Fisher's z' transformation converts Pearson's r to a value that is normally distributed and with a standard error of:

Dissertation and Thesis Methods and Statistical Consulting (877-437-8622)

Sampling Distribution of Pearson's r (2

correlation based on 19 students would be larger than 0.75? The first step is to convert a correlation of 0.5 to z' This can be done with the r to z' table The value

of z' is 0.55 The standard error is:

= 1/4 = 25

Sampling Distribution of Pearson's r (3

of 3)

Next section: Difference between correlations

The number of standard deviations from the mean can be calculated with the formula:

Trang 25

where: z is the number of standard deviations above the z' associated with the population correlation, z' is the value of Fisher's z' for the sample correlation (z'

=.97 in this case), μ is the value of z' for the population correlation (.55 in this case) and is the mean of the sampling distribution of z' is the standard error of Fisher's z'; it was previously calculated to be 25 for N = 19

Plugging the numbers into the formula: z = (.97 - 55)/.25 = 1.68 Therefore, a correlation of 75 is associated with a value 1.68 standard deviations above the mean As shown previously, a z table can be used to determine the probability of avalue more than 1.68 standard deviations above the mean The probability is 95 Therefore there is a 05 probability of obtaining a Pearson's r of 75 or greater when the "true" correlation is only 50

Sampling Distribution of the Difference Between Independent Pearson r's (1 of 2)

The sampling distribution of the difference between two independentPearson r's

can be approached in terms of the sampling distribution of z' First both r's are converted to z' The standard error of the difference between independent z's is:

where N1 is the number of pairs of scores the first correlation is based on and N2 is the number of pairs of scores the second correlation is based upon It is important

to keep in mind that this formula only holds when the two correlations are

independent This means that different subjects must be used for each correlation

If three tests are given to the same subjects then the correlation between tests one and two is not independent of the correlation between tests one and three

Assume that in the population of females the correlation between a test of verbal ability and a test of spatial ability is 0.6 whereas in the population of males the correlation is 0.5 If a random sample of 40 females and 35 males is taken, what is the probability that the correlation for the female sample will be lower than the correlation for the male sample

Trang 26

Sampling Distribution of the

Difference Between Independent

Pearson r's (2 of 2)

Next section: Median

Start by computing the mean and standard deviation of the sampling distribution

of the difference between z's As can be calculated with the help of the r to z' procedure, r's of 6 and 5 correspond to Z's of 69 and 55 respectively Therefore the mean of the sampling distribution is:

0.69-0.55 = 0.14

The standard deviation is:

The portion of the distribution for which the difference is negative (the correlation

in the female sample is lower) is shaded What proportion of the area is this?

A difference of 0 is: standard deviations above (0.58 sd's below) the mean A z table can be used to calculate that the probability of a z less than or equal to -0.58 is 0.28 Therefore the probability the the correlation will be lower in the female sample is 0.28

Trang 27

Next section: Median

Inexpensive Statistics and Method Section Consulting (877-437-8622)

Sampling Distribution of Median

Next section: Standard deviation

The standard error of the median for large samples and normal distributions is:

Thus, the standard error of the median is about 25% larger than that for the mean

It is thus less efficient and more subject to sampling fluctuations This formula is fairly accurate even for small samples but can be very wrong for extremely non- normal distributions For non-normal distributions, the standard error of the median is difficult to compute

The sampling distribution simulation can be used to explore the sampling

distribution of the median for non-normal distributions

Trang 28

Sampling Distribution of the Standard Deviation

Next section: proportion

The standard error of the standard deviation is approximately

The approximation is more accurate for larger sample sizes (N > 16) and when thepopulation is normally distributed Try this simulation to explore the accuracy of the approximation

The distribution of the standard deviation is positively skewed for small N but is approximately normal if N is 25 or greater Thus, procedures for calculating the area under the normal curve work for the sampling distribution of the standard deviation as long as N is at least 25 and the distribution is approximately normal

Sampling Distribution of a Proportion (1 of 4)

Assume that 0.80 of all third grade students can pass a test of physical fitness A random sample of 20 students is chosen: 13 passed and 7 failed The parameter π

is used to designate the proportion of subjects in the population that pass (.80 in this case) and the statistic p is used to designate the proportion who pass in a sample (13/20 = 65 in this case) The sample size (N) in this example is 20 If repeated samples of size N where taken from the population and the proportion passing (p) were determined for each sample, a distribution of values of p would

be formed If the sampling went on forever, the distribution would be the samplingdistribution of a proportion The sampling distribution of a proportion is equal to the binomial distribution The mean and standard deviation of the binomial

Trang 29

of p (μ) is 8 and the standard error of p (σp) is 0.089 The shape of the binomial distribution depends on both N and π With large values of N and values of π in the neighborhood of 5, the sampling distribution is very close to a normal

Assume that for the population of people applying for a job at a bank in a major city, 40 are able to pass a basic literacy test required to get the job Out of a group

of 20 applicants, what is the probability that 50% or more of them will pass? This problem involves the sampling distribution of p with π = 40 and N = 20 The mean of the sampling distribution is π = 40 The standard deviation is:

Statistics and Methods Consulting For Graduate Students (877-437-8622)

Trang 30

Correction for Continuity

Since the normal distribution is a continuous distribution, the probability that a sample value will exactly equal any specific value is zero However, this is not true when the normal distribution is used to approximate the sampling distribution

of a proportion A correction called the "correction for continuity" can be used to improve the approximation

The basic idea is that to estimate the probability of, say, 10 successes out of 20 when π is 0.4, one should compute the area between 9.5 and 10.5 as shown below

Statistics and Methods Consulting For Graduate Students (877-437-8622)

Sampling Distribution of a Proportion (4 of 4)

Trang 31

Next section: Difference between proportions

Therefore to compute the probability of 10 or more successes, compute the area above 9.5 successes In terms of proportions, 9.5 successes is 9.5/20 = 0.475 Therefore, 9.5 = (0.475 - 0.40)/.11 = 0.682 standard deviations above the mean The probability of being 0.682 or more standard deviations above the mean is 0.247 rather than the 0.182 that was obtained previously.The exact answer

calculated using the binomial distribution is 0.245 For small sample sizes the correction can make a much bigger difference than it did here

Sampling Distribution of the Difference Between Two Proportions (1 of 2)

The mean of the sampling distribution of the difference between two independent

proportions (p1 - p2) is:

The standard error of p1- p2 is:

The sampling distribution of p1- p2 is approximately normal as long as the

proportions are not too close to 1 or 0 and the sample sizes are not too small As a rule of thumb, if n1 and n2 are both at least 10 and neither is within 0.10 of 0 or 1 then the approximation is satisfactory for most purposes An alternative rule of thumb is that the approximation is good if both Nπ and N(1 - π) are greater than 10for both π1 and π2

To see the application of this sampling distribution, assume that 0.8 of high schoolgraduates but only 0.4 of high school drop outs are able to pass a basic literacy test If 20 students are sampled from the population of high school graduates and

25 students are sampled from the population of high school drop outs, what is the probability that the proportion of drop outs that pass will be as high as the

proportion of graduates?

Trang 32

Sampling Distribution of the Difference Between Two Proportions (2 of 2)

Next chapter: Point estimation

For this example, the mean of the sampling distribution of p1 - p2 is:

From z table it can be determined that only 0.0013 of the time would p1 - p2 be 3.01 or more standard deviations below the mean

Overview of Point Estimation

Next section: Characteristics of estimators

When a parameter is being estimated, the estimate can be either a single number or

it can be a range of scores When the estimate is a single number, the estimate is called a "point estimate"; when the estimate is a range of scores, the estimate is called an interval estimate Confidence intervals are used for interval estimates

As an example of a point estimate, assume you wanted to estimate the mean time

it takes 12-year-olds to run 100 yards The mean running time of a random sample

of olds would be an estimate of the mean running time for all

Trang 33

12-year-olds Thus, the samplemean, M, would be a point estimate of the population

mean, μ

Often point estimates are used as parts of other statistical calculations For

example, a point estimate of the standard deviation is used in the calculation of a confidence interval for μ Point estimates of parameters are often used in the formulas for significance testing

Point estimates are not usually as informative as confidence intervals Their

importance lies in the fact that many statistical formulas are based on them

Characteristics of Estimators

Next section: Estimating variance

Statistics are used to estimate parameters Three important attributes of statistics asestimators are covered in this text: unbiasedness, consistency, and relative

efficiency

Most statistics you will see in this text are unbiased estimates of the parameter they estimate For example, the samplemean, M, is an unbiased estimate of the

population mean, μ

All statistics covered will be consistent estimators It is hard to imagine a

reasonably-chosen statistic that is not consistent

When more than one statistic can be used to estimate a parameter, one will

naturally be more efficient than the other(s) In general the relative efficiency of two statistics differs depending on the shape of the distribution of the numbers in the population Statistics that minimize the sum of squared deviations such as the

mean are generally the most efficient estimators for normal distributions but may not be for highly skewed distributions

Next section: Estimating variance

Inexpensive Statistics and Method Section Consulting (877-437-8622)

Estimating Variance (1 of 4)

Trang 34

The formula for the variance computed in the population, σ², is different from the formula for an unbiased estimate of variance, s², computed in a sample The two formulas are shown below:

σ² = Σ(X-μ)²/N

s² = Σ(X-M)²/(N-1)

The unexpected difference between the two formulas is that the denominator is N for σ² and is N-1 for s² That there should be a difference in formulas is very counterintuitive To understand the reason that N-1 rather than N is needed in the denominator of the formula for s², consider the problem of estimating σ² when the population mean, μ, is already known

Assume that you knew that the mean amount of practice it takes student pilots to master a particular maneuver is 12 hours If you sampled one pilot and found he orshe took 14 hours to master the maneuver, what would be your estimate of σ²? Theanswer lies in considering the definition of variance: It is the average squared deviation of individual scores from μ

Σ(X - μ)²/N

Now it is time to consider what happens when μ is not known and M is used as an estimate of μ Which value is going to be larger for a sample of N values of X:Σ(X - M)²/N or Σ(X - μ)²/N?

Since M is the mean of the N values of X and since the sum of squared deviations

of a set of numbers from their own mean is smaller than the sum of squared

Trang 35

deviations from any other number, the quantity Σ(X - M)²/N will always be

smaller than Σ(X - μ)²/N

Estimating variance (3 of 4)

The argument goes that since Σ(X - μ)²/N is an unbiased estimate of σ² and since Σ(X - M)²/N is always smaller than e Σ(X - μ)²/N, then Σ(X - M)²/N must be biased and will have a tendency to underestimate σ² It turns out that dividing by N-1 rather than by N increases the estimate just enough to eliminate the bias exactly

Another way to think about why you divide by N-1 rather than by N has to do withthe concept of degrees of freedom When μ is known, each value of X provides an independent estimate of σ²: Each value of (X - μ)² is an independent estimate of σ².The estimate of σ² based on N X's is simply the average of these N independent estimates Since the estimate of σ² is the average of these N estimates, it can be written as:

where there are N degrees of freedom and therefore df = N When μ is not known and has to be estimated with M, the N values of (X-M)² are not independent

because if you know the value of M and the value of N-1 of the X's, then you can compute the value of the N'th X exactly

Estimating variance (4 of 4)

Next chapter: Confidence intervals

The number of degrees of freedom an estimate is based upon is equal to the

number of independent scores that went into the estimate minus the number of parameters estimated en route to the estimation of the parameter of interest In this case, there are N independent scores and one parameter (μ) is estimated en route tothe estimation of the parameter of interest, σ² Therefore the estimate has N-1 degrees of freedom The formula for s² can then be written as:

Trang 36

where df = N-1 Naturally, the greater the degrees of freedom the closer the

Once the population is specified, the next step is to take a random sample from it

In this example, let's say that a sample of 10 students were drawn and each

student's memory tested The way to estimate the mean of all high school students would be to compute the mean of the 10 students in the sample Indeed, the

sample mean is an unbiased estimate of μ, the population mean But it will

certainly not be a perfect estimate By chance it is bound to be at least either a little bit too high or a little bit too low (or, perhaps, much too high or much too low)

For the estimate of μ to be of value, one must have some idea of how precise it is That is, how close to μ is the estimate likely to be?

Overview of Confidence Intervals (2

of 2)

Next section: Mean, σ known

An excellent way to specify the precision is to construct a confidence interval If the number of digits remembered for the 10 students were: 4, 4, 5, 5, 5, 6, 6, 7, 8, 9then the estimated value of μ would be 5.9 and the 95% confidence interval would range from 4.71 to 7.09 (Click here to see how to compute the interval.)

Trang 37

The wider the interval, the more confident you are that it contains the parameter The 99% confidence interval is therefore wider than the 95% confidence interval and extends from 4.19 to 7.61.

Below are shown some examples of possible confidence intervals Although the parameter μ1 - μ2 represents the difference between two means, it is still valid to think of it as one parameter; π1 - π2 can also be thought of as one parameter

Lower Limit Parameter Upper Limit

This section explains how to compute a confidence interval for the mean of a

normally-distributed variable for which the populationstandard deviation is known In practice, the population standard deviation is rarely known However, learning how to compute a confidence interval when the standard deviation is known is an excellent introduction to how to compute a confidence interval when the standard deviation has to be estimated

Three values are used to construct a confidence interval for μ: the sample mean (M), the value of z (which depends on the level of confidence), and the standard error of the mean (σM) The confidence interval has M for its center and extends a distance equal to the product of z and σM in both directions Therefore, the formulafor a confidence interval is:

Trang 38

M - z σM ≤ μ ≤ M + z σM

Assume that the standard deviation of SAT verbal scores in a school system is known to be 100 A researcher wishes to estimate the mean SAT score and compute a 95% confidence interval from a random sample of 10 scores

Confidence Interval for μ, Standard Deviation Known (2 of 3)

The 10 scores are: 320, 380, 400, 420, 500, 520, 600, 660, 720, and 780

Therefore, M = 530, N = 10, and

=

The value of z for the 95% confidence interval is the number of standard

deviations one must go from the mean (in both directions) to contain 0.95 of the scores

It turns out that one must go 1.96 standard deviations from the mean in both directions to contain 0.95 of the scores The value of 1.96 was found using a z table Since each tail is to contain 0.025 of the scores, you find the value of z for which 1-0.025 = 0.975 of the scores are below This value is 1.96

All the components of the confidence interval are now known:

M = 530, σM = 31.62, z = 1.96

Lower limit = 530 - (1.96)(31.62) = 468.02

Upper limit = 530 + (1.96)(31.62) = 591.98

Trang 39

Confidence Interval for μ, Standard Deviation Known (3 of 3)

Next section: Mean, σ estimated

Therefore, 468.02 ≤ μ ≤ 591.98 Naturally, if a larger sample size had been used, the range of scores would have been smaller

The computation of the 99% confidence interval is exactly the same except that 2.58 rather than 1.96 is used for z The 99% confidence interval is: 448.54 ≤ μ ≤ 611.46 As it must be, the 99% confidence interval is even wider than the 95% confidence interval

3 Scores are sampled randomly and are independent

Confidence Interval for μ, Standard Deviation Estimated (1 of 3)

It is very rare for a researcher wishing to estimate the mean of a population to already know its standard deviation Therefore, the construction of a confidence interval almost always involves the estimation of both μ and σ

Trang 40

When σ is known, the formula:

M - zσM ≤ μ ≤ M + zσM

is used for a confidence interval When σ is not known,

(N is the sample size)

is used as an estimate of σM Whenever the standard deviation is estimated, the t

rather than the normal (z) distribution should be used The values of t are larger than the values of z so confidence intervals when σ is estimated are wider than confidence intervals when σ is known

The formula for a confidence interval for μ when σ is estimated is:

M - t sM ≤ μ ≤ M + t sM

where M is the sample mean, sM is an estimate of σM, and t depends on the degrees

of freedom and the level of confidence

Confidence Interval for μ, Standard

Deviation Estimated (2 of 3)

The value of t can be determined from a t table The degrees of freedom for t is equal to the degrees of freedom for the estimate of σM which is equal to N-1

Suppose a researcher were interested in estimating the mean reading speed

(number of words per minute) of high-school graduates and computing the 95% confidence interval A sample of 6 graduates was taken and the reading speeds were: 200, 240, 300, 410, 450, and 600 For these data,

Tiêu đề	Bí quyết phân tích dữ liệu với Stata (Secrets of Data Analysis with Stata)
Chuyên ngành	Statistics
Thể loại	Lecture Notes

Định dạng
Số trang	86
Dung lượng	763 KB
File đính kèm	TOM_TAT_THONG_KE.zip (240 KB)