ABRAHAM DE MOIVRE: PAVING THE WAY FOR PROPORTION INFERENCES
12.4 Chi-Square Independence Test
In Section 12.3, you learned how to determine whether an association exists between two variables of a population if you have the bivariate data for the entire population.
However, because, in most cases, data for an entire population are not available, you must usually apply inferential methods to decide whether an association exists between two variables.
One of the most commonly used procedures for making such decisions is the chi-square independence test. In the next example, we introduce and explain the reasoning behind the chi-square independence test.
EXAMPLE 12.9 Introducing the Chi-Square Independence Test
Marital Status and Drinking A national survey was conducted to obtain infor- mation on the alcohol consumption patterns of U.S. adults by marital status. A ran- dom sample of 1772 residents 18 years old and older yielded the data displayed in Table 12.13.†
TABLE 12.13 Contingency table of marital status and alcohol consumption for 1772 randomly selected U.S. adults
Maritalstatus
Drinks per month
Abstain 1–60 Over 60 Total
Single 67 213 74 354
Married 411 633 129 1173
Widowed 85 51 7 143
Divorced 27 60 15 102
Total 590 957 225 1772
Suppose we want to use the data in Table 12.13 to decide whether marital status and alcohol consumption are associated.
a. Formulate the problem statistically by posing it as a hypothesis test.
b. Explain the basic idea for carrying out the hypothesis test.
c. Develop a formula for computing the expected frequencies.
d. Construct a table that provides both the observed frequencies in Table 12.13 and the expected frequencies.
e. Discuss the details for making a decision concerning the hypothesis test.
Solution
a. For a chi-square independence test, the null hypothesis is that the two vari- ables are not associated; the alternative hypothesis is that the two variables are associated. Thus, we want to perform the following hypothesis test.
H0:Marital status and alcohol consumption are not associated.
Ha:Marital status and alcohol consumption are associated.
b. The idea behind the chi-square independence test is to compare the ob- served frequencies in Table 12.13 with the frequencies we would expect if the null hypothesis of nonassociation is true. The test statistic for making the comparison has the same form as the one used for the goodness-of-fit test:
χ2=(O−E)2/E, where O represents observed frequency and E repre- sents expected frequency.
†Adapted from research by W. Clark and L. Midanik. In: National Institute on Alcohol Abuse and Alcoholism,Al- cohol Consumption and Related Problems: Alcohol and Health Monograph 1(DHHS Pub. No. (ADM) 82–1190).
c. To develop a formula for computing the expected frequencies, consider, for instance, the cell of Table 12.13 corresponding to “MarriedandAbstain,” the cell in the second row and first column. We note that the population proportion of all adults who abstain can be estimated by the sample proportion of the 1772 adults sampled who abstain, that is, by
Number sampled who abstain
590
1772 =0.333 or 33.3%. Total number sampled
If no association exists between marital status and alcohol consumption (i.e., ifH0is true), then the proportion of married adults who abstain is the same as the proportion of all adults who abstain. Therefore, of the 1173 married adults sampled, we would expect about
590
1772 ã1173=390.6 to abstain from alcohol.
Let’s rewrite the left side of this expected-frequency computation in a slightly different way. By using algebra and referring to Table 12.13, we obtain
Expected frequency= 590 1772 ã1173
= 1173ã590 1772
= (Row total)ã(Column total) Sample size .
If we letRdenote “Row total” andC denote “Column total,” we can write this equation as
E= RãC
n , (12.1)
where, as usual,Edenotes expected frequency andndenotes sample size.
d. Using Equation (12.1), we can calculate the expected frequencies for all the cells in Table 12.13. For the cell in the upper right corner of the table, we get
E= RãC
n = 354ã225
1772 =44.9.
In Table 12.14, we have modified Table 12.13 by including each expected frequency beneath the corresponding observed frequency. Table 12.14 shows, TABLE 12.14
Observed and expected frequencies for marital status and alcohol consumption (expected frequencies printed below observed frequencies)
Maritalstatus
Drinks per month
Abstain 1–60 Over 60 Total
Single 67 213 74
117.9 191.2 44.9 354
Married 411 633 129
390.6 633.5 148.9 1173
Widowed 85 51 7
47.6 77.2 18.2 143
Divorced 27 60 15
34.0 55.1 13.0 102
Total 590 957 225 1772
for instance, that of the adults sampled, 74 were observed to be single and consume more than 60 drinks per month, whereas if marital status and alcohol consumption are not associated, the expected frequency is 44.9.
e. If the null hypothesis of nonassociation is true, the observed and expected fre- quencies should be approximately equal, which would result in a relatively small value of the test statistic, χ2=(O−E)2/E. Consequently, ifχ2 is too large, we reject the null hypothesis and conclude that an association exists between marital status and alcohol consumption. From Table 12.14, we find that
χ2=(O−E)2/E
=(67−117.9)2/117.9+(213−191.2)2/191.2+(74−44.9)2/44.9 +(411−390.6)2/390.6+(633−633.5)2/633.5+(129−148.9)2/148.9 +(85−47.6)2/47.6+(51−77.2)2/77.2+(7−18.2)2/18.2
+(27−34.0)2/34.0+(60−55.1)2/55.1+(15−13.0)2/13.0
=21.952+2.489+18.776+1.070+0.000+2.670 +29.358+8.908+6.856+1.427+0.438+0.324
=94.269.†
Can this value be reasonably attributed to sampling error, or is it large enough to indicate that marital status and alcohol consumption are associated? Before we can answer that question, we must know the distribution of theχ2-statistic.
First we present the formula for expected frequencies in a chi-square independence test, as discussed in the preceding example.
FORMULA 12.2 Expected Frequencies for an Independence Test
In a chi-square independence test, the expected frequency for each cell is found by using the formula
E = RãC n ,
whereRis the row total,Cis the column total, andnis the sample size.
? What Does It Mean?
To obtain an expected frequency, multiply the row total by the column total and divide by the sample size.
Now we provide the distribution of the test statistic for a chi-square indepen- dence test.
KEY FACT 12.3 Distribution of theχ2-Statistic for a Chi-Square Independence Test
For a chi-square independence test, the test statistic χ2=(O−E)2/E
has approximately a chi-square distribution if the null hypothesis of non- association is true. The number of degrees of freedom is (r −1)(c−1), wherer andcare the number of possible values for the two variables under consideration.
? What Does It Mean?
To obtain a chi-square subtotal, square the difference between an observed and expected frequency and divide the result by the expected frequency. Adding the chi-square subtotals gives the χ2-statistic, which has approximately a chi-square
distribution. †
Although we have displayed the expected frequencies to one decimal place and the chi-square subtotals to three decimal places, the calculations were made at full calculator accuracy.
Procedure for the Chi-Square Independence Test
In light of Key Fact 12.3, we present, in Procedure 12.2, a step-by-step method for conducting a chi-square independence test by using either the critical-value approach or the P-value approach. Because the null hypothesis is rejected only when the test statistic is too large, a chi-square independence test is always right tailed.
PROCEDURE 12.2 Chi-Square Independence Test
Purpose To perform a hypothesis test to decide whether two variables are associated Assumptions
1. All expected frequencies are 1 or greater
2. At most 20% of the expected frequencies are less than 5 3. Simple random sample
Step 1 The null and alternative hypotheses are, respectively, H0:The two variables are not associated.
Ha:The two variables are associated.
Step 2 Decide on the significance level,α. Step 3 Compute the value of the test statistic
χ2=(O−E)2/E,
where O and E represent observed and expected frequencies, respectively.
Denote the value of the test statisticχ02.
CRITICAL-VALUE APPROACH OR P-VALUE APPROACH Step 4 The critical value isχα2with df=(r−1)×
(c−1), where r and c are the number of possible values for the two variables. Use Table V to find the critical value.
2 0
Do not reject H0 Reject H0
2
Step 5 If the value of the test statistic falls in the rejection region, reject H0; otherwise, do not rejectH0.