Chi-Square Independence Test

ABRAHAM DE MOIVRE: PAVING THE WAY FOR PROPORTION INFERENCES

13.4 Chi-Square Independence Test

In Section 13.3, you learned how to determine whether an association exists between two variables of a population if you have the bivariate data for the entire population.

However, because, in most cases, data for an entire population are not available, you must usually apply inferential methods to decide whether an association exists between two variables.

One of the most commonly used procedures for making such decisions is the chi-square independence test. In the next example, we introduce and explain the reasoning behind the chi-square independence test.

EXAMPLE 13.9 Introducing the Chi-Square Independence Test

Marital Status and Drinking A national survey was conducted to obtain infor- mation on the alcohol consumption patterns of U.S. adults by marital status. A random sample of 1772 residents 18 years old and older yielded the data displayed in Table 13.13.†

TABLE 13.13 Contingency table of marital status and alcohol consumption for 1772 randomly selected U.S. adults

Maritalstatus

Drinks per month

Abstain 1–60 Over 60 Total

Single 67 213 74 354

Married 411 633 129 1173

Widowed 85 51 7 143

Divorced 27 60 15 102

Total 590 957 225 1772

Suppose we want to use the data in Table 13.13 to decide whether marital status and alcohol consumption are associated.

a. Formulate the problem statistically by posing it as a hypothesis test.

b. Explain the basic idea for carrying out the hypothesis test.

c. Develop a formula for computing the expected frequencies.

d. Construct a table that provides both the observed frequencies in Table 13.13 and the expected frequencies.

e. Discuss the details for making a decision concerning the hypothesis test.

Solution

a. For a chi-square independence test, the null hypothesis is that the two variables are not associated; the alternative hypothesis is that the two variables are associated. Thus, we want to perform the following hypothesis test.

H0:Marital status and alcohol consumption are not associated.

Ha:Marital status and alcohol consumption are associated.

b. The idea behind the chi-square independence test is to compare the observed frequencies in Table 13.13 with the frequencies we would expect if the null hypothesis of nonassociation is true. The test statistic for making the comparison has the same form as the one used for the goodness-of-ﬁt test:

χ2=(O−E)2/E, where O represents observed frequency and E represents expected frequency.

†Adapted from research by W. Clark and L. Midanik. In: National Institute on Alcohol Abuse and Alcoholism,Al- cohol Consumption and Related Problems: Alcohol and Health Monograph 1(DHHS Pub. No. (ADM) 82–1190).

604 CHAPTER 13 Chi-Square Procedures

c. To develop a formula for computing the expected frequencies, consider, for instance, the cell of Table 13.13 corresponding to “MarriedandAbstain,” the cell in the second row and ﬁrst column. We note that the population proportion of all adults who abstain can be estimated by the sample proportion of the 1772 adults sampled who abstain, that is, by

Number sampled who abstain

590

1772 =0.333 or 33.3%. Total number sampled

If no association exists between marital status and alcohol consumption (i.e., ifH0is true), then the proportion of married adults who abstain is the same as the proportion of all adults who abstain. Therefore, of the 1173 married adults sampled, we would expect about

590

1772 ã1173=390.6 to abstain from alcohol.

Let’s rewrite the left side of this expected-frequency computation in a slightly different way. By using algebra and referring to Table 13.13, we obtain

Expected frequency= 590 1772 ã1173

= 1173ã590 1772

= (Row total)ã(Column total) Sample size .

If we letRdenote “Row total” andC denote “Column total,” we can write this equation as

E= RãC

n , (13.1)

where, as usual,Edenotes expected frequency andndenotes sample size.

d. Using Equation (13.1), we can calculate the expected frequencies for all the cells in Table 13.13. For the cell in the upper right corner of the table, we get

E= RãC

n = 354ã225

1772 =44.9.

In Table 13.14, we have modiﬁed Table 13.13 by including each expected frequency beneath the corresponding observed frequency. Table 13.14 shows, TABLE 13.14

Observed and expected frequencies for marital status and alcohol consumption (expected frequencies printed below observed frequencies)

Maritalstatus

Drinks per month

Abstain 1–60 Over 60 Total

Single 67 213 74

117.9 191.2 44.9 354

Married 411 633 129

390.6 633.5 148.9 1173

Widowed 85 51 7

47.6 77.2 18.2 143

Divorced 27 60 15

34.0 55.1 13.0 102

Total 590 957 225 1772

for instance, that of the adults sampled, 74 were observed to be single and consume more than 60 drinks per month, whereas if marital status and alcohol consumption are not associated, the expected frequency is 44.9.

e. If the null hypothesis of nonassociation is true, the observed and expected frequencies should be approximately equal, which would result in a relatively small value of the test statistic, χ2=(O−E)2/E. Consequently, ifχ2 is too large, we reject the null hypothesis and conclude that an association exists between marital status and alcohol consumption. From Table 13.14, we ﬁnd that

χ2=(O−E)2/E

=(67−117.9)2/117.9+(213−191.2)2/191.2+(74−44.9)2/44.9 +(411−390.6)2/390.6+(633−633.5)2/633.5+(129−148.9)2/148.9 +(85−47.6)2/47.6+(51−77.2)2/77.2+(7−18.2)2/18.2

+(27−34.0)2/34.0+(60−55.1)2/55.1+(15−13.0)2/13.0

=21.952+2.489+18.776+1.070+0.000+2.670 +29.358+8.908+6.856+1.427+0.438+0.324

=94.269.†

Can this value be reasonably attributed to sampling error, or is it large enough to indicate that marital status and alcohol consumption are associated? Before we can answer that question, we must know the distribution of theχ2-statistic.

First we present the formula for expected frequencies in a chi-square independence test, as discussed in the preceding example.

FORMULA 13.2 Expected Frequencies for an Independence Test

In a chi-square independence test, the expected frequency for each cell is found by using the formula

E = RãC n ,

whereRis the row total,Cis the column total, andnis the sample size.

? What Does It Mean?

To obtain an expected frequency, multiply the row total by the column total and divide by the sample size.

Now we provide the distribution of the test statistic for a chi-square independence test.

KEY FACT 13.3 Distribution of theχ2-Statistic for a Chi-Square Independence Test

For a chi-square independence test, the test statistic χ2=(O−E)2/E

has approximately a chi-square distribution if the null hypothesis of nonassociation is true. The number of degrees of freedom is (r −1)(c−1), wherer andcare the number of possible values for the two variables under consideration.

? What Does It Mean?

To obtain a chi-square subtotal, square the difference between an observed and expected frequency and divide the result by the expected frequency. Adding the chi-square subtotals gives the χ2-statistic, which has approximately a chi-square

distribution. †

Although we have displayed the expected frequencies to one decimal place and the chi-square subtotals to three decimal places, the calculations were made at full calculator accuracy.

606 CHAPTER 13 Chi-Square Procedures

Procedure for the Chi-Square Independence Test

In light of Key Fact 13.3, we present, in Procedure 13.2, a step-by-step method for conducting a chi-square independence test by using either the critical-value approach or the P-value approach. Because the null hypothesis is rejected only when the test statistic is too large, a chi-square independence test is always right tailed.

PROCEDURE 13.2 Chi-Square Independence Test

Purpose To perform a hypothesis test to decide whether two variables are associated Assumptions

1. All expected frequencies are 1 or greater

2. At most 20% of the expected frequencies are less than 5 3. Simple random sample

Step 1 The null and alternative hypotheses are, respectively, H0:The two variables are not associated.

Ha:The two variables are associated.

Step 2 Decide on the signiﬁcance level,α. Step 3 Compute the value of the test statistic

χ2=(O−E)2/E,

where O and E represent observed and expected frequencies, respectively.

Denote the value of the test statisticχ02.

CRITICAL-VALUE APPROACH OR P-VALUE APPROACH Step 4 The critical value isχα2with df=(r−1)×

(c−1), where r and c are the number of possible values for the two variables. Use Table VII to ﬁnd the critical value.

2 0

Do not reject H0 Reject H0

Step 5 If the value of the test statistic falls in the rejection region, reject H0; otherwise, do not rejectH0.

Step 4 The χ2-statistic has df=(r−1)(c−1), where r and c are the number of possible values for the two variables. Use Table VII to estimate the

P-value, or obtain it exactly by using technology.

2 0

P-value 20

Step 5 If P≤α, reject H0; otherwise, do not reject H0.

Step 6 Interpret the results of the hypothesis test.

Note:Regarding Assumptions 1 and 2, in many texts the rule given is that all expected frequencies be 5 or greater. However, research by the noted statistician W. G.

Cochran shows that the “rule of 5” is too restrictive. See, for instance, W. G. Cochran,

“Some Methods for Strengthening the Commonχ2Tests” (Biometrics, Vol. 10, No. 4, pp. 417–451).

EXAMPLE 13.10 The Chi-Square Independence Test

Marital Status and Drinking A random sample of 1772 U.S. adults yielded the data on marital status and alcohol consumption displayed in Table 13.13 on page 603. At the 5% signiﬁcance level, do the data provide sufﬁcient evidence to conclude that an association exists between marital status and alcohol consumption?

Solution We calculated the expected frequencies earlier and displayed them in Table 13.14 below the observed frequencies For ease of reference, we repeat that table here.

Maritalstatus

Drinks per month

Abstain 1–60 Over 60 Total

Single 67 213 74

117.9 191.2 44.9 354

Married 411 633 129

390.6 633.5 148.9 1173

Widowed 85 51 7

47.6 77.2 18.2 143

Divorced 27 60 15

34.0 55.1 13.0 102

Total 590 957 225 1772

From this table, we see that the expected-frequency conditions, Assumptions 1 and 2 of Procedure 13.2, are satisﬁed because all of the expected frequencies ex- ceed 5. Consequently, we can apply Procedure 13.2 to perform the required hypothesis test.

Step 1 State the null and alternative hypotheses.

The null and alternative hypotheses are, respectively,

H0:Marital status and alcohol consumption are not associated Ha:Marital status and alcohol consumption are associated.

Step 2 Decide on the signiﬁcance level,α.

The test is to be performed at the 5% signiﬁcance level, soα=0.05.

Step 3 Compute the value of the test statistic χ2=(O− E)2/E,

whereO andE represent observed and expected frequencies, respectively.

The observed and expected frequencies are displayed in the preceding table. Using them, we compute the value of the test statistic:

χ2=(67−117.9)2/117.9+(213−191.2)2/191.2+ ã ã ã +(15−13.0)2/13.0

=21.952+2.489+ ã ã ã +0.324

=94.269.

608 CHAPTER 13 Chi-Square Procedures

CRITICAL-VALUE APPROACH OR P-VALUE APPROACH Step 4 The critical value isχα2with df=(r−1)×

(c−1), whererandcare the number of possible values for the two variables. Use Table VII to ﬁnd the critical value.

The number of marital status categories is four, and the number of drinks-per-month categories is three. Hence r =4,c=3, and

df=(r −1)(c−1)=3ã2=6.

Forα=0.05, Table VII reveals that the critical value is χ02.05=12.592,as shown in Fig. 13.5A.

FIGURE 13.5A

2 0

Do not reject H0 Reject H0

0.05

12.592

Step 5 If the value of the test statistic falls in the rejection region, rejectH0; otherwise, do not rejectH0.

From Step 3, we see that the value of the test statistic is χ2 =94.269, which falls in the rejection region, as shown in Fig. 13.5A. Thus we reject H0. The test results are statistically signiﬁcant at the 5% level.

The critical value for a two-tailed test

Inferences for Two Population Means, Using Paired Samples