Humanities Data Analysis “125 85018 Karsdrop Humanities ch01 3p” — 2020/8/19 — 11 04 — page 202 — #2 202 • Chapter 6 the negative binomial distribution Indeed, facility with the negative bino mial dis[.]
“125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:04 — page 202 — #2 202 • Chapter the negative binomial distribution Indeed, facility with the negative binomial distribution counts as an essential skill for anyone interested in modeling word frequencies in text documents The negative binomial distribution is the simplest probability distribution which provides a plausible model of word frequencies observed in English-language documents (Church and Gale 1995) The only kinds of words which can be credibly modeled with more familiar distributions (e.g., normal, binomial, or Poisson) are those which are extremely frequent or vanishingly rare 6.1 Uncertainty and Thomas Pynchon To motivate the Bayesian task of learning from evidence, consider the following scenario It has been suggested that Thomas Pynchon, a well-known American novelist, may have written under a pseudonym during his long career (e.g., Winslow 2015) Suppose the probability of a work of literary fiction published between 1960 and 2010 being written by Thomas Pynchon (under his own name or under a pseudonym) is 0.001 percent, i.e., in 100,000 Suppose, moreover, that a stylometric test exists which is able to identify a novel as being written by Pynchon 90 percent of the time (i.e., the true positive rate— “sensitivity” in some fields—equals 0.9) One percent of the time, however, the test mistakenly attributes the work to Pynchon (i.e., the false positive rate equals 0.01) In this scenario, we assume the test works as described; we might imagine Pynchon himself vouches for the accuracy of the test or that the test invariably exhibits these properties over countless applications Suppose a novel (written by someone other than Pynchon) published in 2010 tests positive on the stylometric test of Pynchon authorship What is the probability that the novel was penned by Pynchon? One answer to this question is provided by Bayes’s rule, which is given below and whose justification will be addressed shortly: Pr(Pynchon|) = Pr(|Pynchon)Pr(Pynchon) Pr(|Pynchon) Pr(Pynchon) + Pr(|¬Pynchon) (1 − Pr(Pynchon)) (6.1) where indicates the event of the novel testing positive and Pynchon indicates the event of the novel having been written by Pynchon The preceding paragraph provides us with values for all the quantities on the right-hand side of the equation, where we have used the expression Pr(A) to indicate the probability of the event A occurring pr_pynchon = 0.00001 pr_positive = 0.90 pr_false_positive = 0.01 print(pr_positive * pr_pynchon / (pr_positive * pr_pynchon + pr_false_positive * (1 - pr_pynchon))) 0.000899199712256092 Bayes’s rule produces the following answer: the probability that Pynchon is indeed the author of the novel given a positive test is roughly one tenth of one