Humanities Data Analysis “125 85018 Karsdrop Humanities ch01 3p” — 2020/8/19 — 11 03 — page 195 — #31 Statistics Essentials Who Reads Novels? • 195 readfict yes no reg16 pacific 0 81 0 19 w nor centra[.]
“125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:03 — page 195 — #31 Statistics Essentials: Who Reads Novels? readfict yes no reg16 pacific 0.81 0.19 w nor central 0.80 0.20 e nor central 0.74 0.26 new england 0.74 0.26 middle atlantic 0.73 0.27 w sou central 0.70 0.30 mountain 0.68 0.32 foreign 0.67 0.33 e sou central 0.67 0.33 south atlantic 0.64 0.36 A stacked bar plot expressing the information on this table can be made using the same method plot.bar(stacked=True) that we used before: pd.crosstab( df_subset['reg16'], df_subset['readfict'], normalize='index').plot.barh(stacked=True) plt.legend( loc="upper center", bbox_to_anchor=(0.5, 1.15), ncol=2, title="Read fiction?") From the plot in figure 5.11 it is possible to see that the observed density of readfict “yes” responders is lowest in states assigned the south atlantic category (e.g., South Carolina) and highest in the states assigned the pacific category The differences between regions are noticeable, at least visually We have respectable sample sizes for many of these regions so we are justified in suspecting that there may be considerable geographical variation in the self-reporting of fiction reading With smaller sample sizes, however, we would worry that a difference visible in a stacked bar chart or a contingency table may well be due to chance: for example, if “yes” is a common response to the readfict question and many people grew up in a pacific state, we certainly expect to see people living in the pacific states and reporting reading fiction in the last twelve months even if we are confident that fiction reading is conditionally independent from the region a respondent grew up in 5.6.3 Mutual information This brief section on mutual information assumes the reader is familiar with discrete probability distributions and random variables Readers who have not encountered probability before may wish to skip this section Mutual information is a statistic which measures the dependence between two categorical variables (Cover and Thomas 2006, chp 2) If two categorical outcomes co-occur no more than random chance would predict, mutual • 195 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:03 — page 196 — #32 196 • Chapter Figure 5.11 Stacked bar plot showing proportion of fiction readers for the regions of the United States information will tend to be near zero Mutual information is defined as follows: I(X, Y) = x∈X y∈Y Pr(X = x, Y = y) log Pr(X = x, Y = y) Pr(X = x)Pr(Y = y) (5.5) where X is a random variable taking on values in the set X and Y is a random variable taking on values in Y As we did with entropy, we use the empirical distribution of responses to estimate the joint and marginal distributions needed to calculate mutual information For example, if we were to associate the response to readfict with X and the response to reg10 as Y, we would estimate Pr(X = yes, Y = pacific) using the relative frequence of that pair of responses among all the responses recorded Looking closely at the mutual information equation, it is possible to appreciate why the mutual information between two variables will be zero if the two Pr(X=x,Y=y) term in the summation will be are statistically independent: each Pr(X=x)Pr(Y=y) and the mutual information (the sum of the logarithm of these terms) will be zero as log = When two outcomes co-occur more often than chance would Pr(X=x,Y=y) will be greater than predict, the term Pr(X=x)Pr(Y=y) We will now calculate the mutual information for responses to the reg16 question and answers to the readfict question # Strategy: # Calculate the table of Pr(X=x, Y=y) from empirical frequencies # Calculate the marginal distributions Pr(X=x)Pr(Y=y) ...“125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:03 — page 196 — #32 196 • Chapter Figure 5.11 Stacked bar plot