Chi-Square Independence Test

ABRAHAM DE MOIVRE: PAVING THE WAY FOR PROPORTION INFERENCES

12.4 Chi-Square Independence Test

In Section 12.3, you learned how to determine whether an association exists between two variables of a population if you have the bivariate data for the entire population.

However, because, in most cases, data for an entire population are not available, you must usually apply inferential methods to decide whether an association exists between two variables.

One of the most commonly used procedures for making such decisions is the chi-square independence test. In the next example, we introduce and explain the reasoning behind the chi-square independence test.

EXAMPLE 12.9 Introducing the Chi-Square Independence Test

Marital Status and Drinking A national survey was conducted to obtain infor- mation on the alcohol consumption patterns of U.S. adults by marital status. A random sample of 1772 residents 18 years old and older yielded the data displayed in Table 12.13.†

TABLE 12.13 Contingency table of marital status and alcohol consumption for 1772 randomly selected U.S. adults

Maritalstatus

Drinks per month

Abstain 1–60 Over 60 Total

Single 67 213 74 354

Married 411 633 129 1173

Widowed 85 51 7 143

Divorced 27 60 15 102

Total 590 957 225 1772

Suppose we want to use the data in Table 12.13 to decide whether marital status and alcohol consumption are associated.

a. Formulate the problem statistically by posing it as a hypothesis test.

b. Explain the basic idea for carrying out the hypothesis test.

c. Develop a formula for computing the expected frequencies.

d. Construct a table that provides both the observed frequencies in Table 12.13 and the expected frequencies.

e. Discuss the details for making a decision concerning the hypothesis test.

Solution

a. For a chi-square independence test, the null hypothesis is that the two variables are not associated; the alternative hypothesis is that the two variables are associated. Thus, we want to perform the following hypothesis test.

H0:Marital status and alcohol consumption are not associated.

Ha:Marital status and alcohol consumption are associated.

b. The idea behind the chi-square independence test is to compare the observed frequencies in Table 12.13 with the frequencies we would expect if the null hypothesis of nonassociation is true. The test statistic for making the comparison has the same form as the one used for the goodness-of-ﬁt test:

χ2=(O−E)2/E, where O represents observed frequency and E represents expected frequency.

†Adapted from research by W. Clark and L. Midanik. In: National Institute on Alcohol Abuse and Alcoholism,Al- cohol Consumption and Related Problems: Alcohol and Health Monograph 1(DHHS Pub. No. (ADM) 82–1190).

c. To develop a formula for computing the expected frequencies, consider, for instance, the cell of Table 12.13 corresponding to “MarriedandAbstain,” the cell in the second row and ﬁrst column. We note that the population proportion of all adults who abstain can be estimated by the sample proportion of the 1772 adults sampled who abstain, that is, by

Number sampled who abstain

590

1772 =0.333 or 33.3%. Total number sampled

If no association exists between marital status and alcohol consumption (i.e., ifH0is true), then the proportion of married adults who abstain is the same as the proportion of all adults who abstain. Therefore, of the 1173 married adults sampled, we would expect about

590

1772 ã1173=390.6 to abstain from alcohol.

Let’s rewrite the left side of this expected-frequency computation in a slightly different way. By using algebra and referring to Table 12.13, we obtain

Expected frequency= 590 1772 ã1173

= 1173ã590 1772

= (Row total)ã(Column total) Sample size .

If we letRdenote “Row total” andC denote “Column total,” we can write this equation as

E= RãC

n , (12.1)

where, as usual,Edenotes expected frequency andndenotes sample size.

d. Using Equation (12.1), we can calculate the expected frequencies for all the cells in Table 12.13. For the cell in the upper right corner of the table, we get

E= RãC

n = 354ã225

1772 =44.9.

In Table 12.14, we have modiﬁed Table 12.13 by including each expected frequency beneath the corresponding observed frequency. Table 12.14 shows, TABLE 12.14

Observed and expected frequencies for marital status and alcohol consumption (expected frequencies printed below observed frequencies)

Maritalstatus

Drinks per month

Abstain 1–60 Over 60 Total

Single 67 213 74

117.9 191.2 44.9 354

Married 411 633 129

390.6 633.5 148.9 1173

Widowed 85 51 7

47.6 77.2 18.2 143

Divorced 27 60 15

34.0 55.1 13.0 102

Total 590 957 225 1772

for instance, that of the adults sampled, 74 were observed to be single and consume more than 60 drinks per month, whereas if marital status and alcohol consumption are not associated, the expected frequency is 44.9.

e. If the null hypothesis of nonassociation is true, the observed and expected frequencies should be approximately equal, which would result in a relatively small value of the test statistic, χ2=(O−E)2/E. Consequently, ifχ2 is too large, we reject the null hypothesis and conclude that an association exists between marital status and alcohol consumption. From Table 12.14, we ﬁnd that

χ2=(O−E)2/E

=(67−117.9)2/117.9+(213−191.2)2/191.2+(74−44.9)2/44.9 +(411−390.6)2/390.6+(633−633.5)2/633.5+(129−148.9)2/148.9 +(85−47.6)2/47.6+(51−77.2)2/77.2+(7−18.2)2/18.2

+(27−34.0)2/34.0+(60−55.1)2/55.1+(15−13.0)2/13.0

=21.952+2.489+18.776+1.070+0.000+2.670 +29.358+8.908+6.856+1.427+0.438+0.324

=94.269.†

Can this value be reasonably attributed to sampling error, or is it large enough to indicate that marital status and alcohol consumption are associated? Before we can answer that question, we must know the distribution of theχ2-statistic.

First we present the formula for expected frequencies in a chi-square independence test, as discussed in the preceding example.

FORMULA 12.2 Expected Frequencies for an Independence Test

In a chi-square independence test, the expected frequency for each cell is found by using the formula

E = RãC n ,

whereRis the row total,Cis the column total, andnis the sample size.

? What Does It Mean?

To obtain an expected frequency, multiply the row total by the column total and divide by the sample size.

Now we provide the distribution of the test statistic for a chi-square independence test.

KEY FACT 12.3 Distribution of theχ2-Statistic for a Chi-Square Independence Test

For a chi-square independence test, the test statistic χ2=(O−E)2/E

has approximately a chi-square distribution if the null hypothesis of nonassociation is true. The number of degrees of freedom is (r −1)(c−1), wherer andcare the number of possible values for the two variables under consideration.

? What Does It Mean?

To obtain a chi-square subtotal, square the difference between an observed and expected frequency and divide the result by the expected frequency. Adding the chi-square subtotals gives the χ2-statistic, which has approximately a chi-square

distribution. †

Although we have displayed the expected frequencies to one decimal place and the chi-square subtotals to three decimal places, the calculations were made at full calculator accuracy.

Procedure for the Chi-Square Independence Test

In light of Key Fact 12.3, we present, in Procedure 12.2, a step-by-step method for conducting a chi-square independence test by using either the critical-value approach or the P-value approach. Because the null hypothesis is rejected only when the test statistic is too large, a chi-square independence test is always right tailed.

PROCEDURE 12.2 Chi-Square Independence Test

Purpose To perform a hypothesis test to decide whether two variables are associated Assumptions

1. All expected frequencies are 1 or greater

2. At most 20% of the expected frequencies are less than 5 3. Simple random sample

Step 1 The null and alternative hypotheses are, respectively, H0:The two variables are not associated.

Ha:The two variables are associated.

Step 2 Decide on the signiﬁcance level,α. Step 3 Compute the value of the test statistic

χ2=(O−E)2/E,

where O and E represent observed and expected frequencies, respectively.

Denote the value of the test statisticχ02.

CRITICAL-VALUE APPROACH OR P-VALUE APPROACH Step 4 The critical value isχα2with df=(r−1)×

(c−1), where r and c are the number of possible values for the two variables. Use Table V to ﬁnd the critical value.

2 0

Do not reject H0 Reject H0

Step 5 If the value of the test statistic falls in the rejection region, reject H0; otherwise, do not rejectH0.

Decide on the significance level, α . Step 3 Compute the value of the test statistic

Inferences for Two Population Means, Using Paired Samples