Humanities Data Analysis “125 85018 Karsdrop Humanities ch01 3p” — 2020/8/19 — 11 05 — page 320 — #36 320 • Chapter 9 K values In the case of the topic model, the dense matrix contains point esti mate[.]
“125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:05 — page 320 — #36 320 • Chapter K values In the case of the topic model, the dense matrix contains point estimates of the N document-topic distributions In the case of PCA, the dense matrix contains N component scores, each of length K The interpretation of each matrix is different, of course And the values themselves are different For example, each element of the document-topic matrix is non-negative and the rows sum to Each matrix, however, seen as a “compressed” version of the original matrix, can be put to work in a similar way The exercise presented here considers using the dense matrix—rather than the full matrix of counts—as an input to a supervised classifier Compare the topic model’s representation of the Old Bailey Corpus with the representation provided by PCA (use as many principal components as used in the topic model) If needed, go back to chapter to learn how to this with scikit-learn’s PCA class, which has essentially the same interface as LatentDirichletAllocation Train a classifier provided by scikit-learn such as KNeighborsClassifier using 50% of the documents Have the classifier predict the offence labels in the remaining 50% of opinions Assess the predictive performance of the classifier based on the documenttopic distributions for each offence label separately (scikit-learn’s classification_report in the metrics module might be useful here) Are there types of opinions which appear to be easier to predict than others? 9.6 Appendix: Mapping Between Our Topic Model and Lauderdale and Clark (2014) # Each tuple records the following in the order used by Lauderdale and Clark: # (, , ) lauderdale_clark_figure_3_mapping = ( ('lands, indian, land', 59, 'indian, territory, indians'), ('tax, commerce, interstate', 89, 'commerce, interstate, state'), ('federal, immunity, law', 2, 'suit, action, states, , immunity'), ('military, aliens, aliens', 22, ' , alien, , aliens, , deportation, immigration'), ('property, income, tax', 79, 'tax, taxes, property'), ('district, habeas, appeal', 43, 'court, federal, district, appeals, review, courts, habeas'), ('negligence, maritime, admiralty', 7, 'vessel, ship, admiralty'), ('patent, copyright, cable', 86, 'patent, , invention, patents'), ('search, fourth, warrant', 37, 'search, warrant, fourth'), ('jury, death, penalty', 3, 'sentence, death, sentencing, penalty'), ('school, religious, schools', 73, 'religious, funds, government, , establishment'), ('trial, counsel, testimony', 13, 'counsel, trial, defendant'), ('epa, waste, safety', 95, 'regulations, , agency, , safety, , air, epa'), ('speech, ordinance, public', 58, 'speech, amendment, , public'), ('antitrust, price, securities', 39, 'market, price, competition, act, antitrust'), ('child, abortion, children', 14, “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:05 — page 321 — #37 Topic Model of US Supreme Court Opinions, 1900–2000 'child, children, medical, , woman, abortion'), ('prison, inmates, parole', 67, 'prison, release, custody, parole'), ('political, election, party', 23, 'speech, amendment, , political, party'), ('title, vii, employment', 55, 'title, discrimination, , vii'), ('offense, criminal, jeopardy', 78, 'criminal, , crime, offense'), ('union, labor, board', 24, 'board, union, labor'), ('damages, fees, attorneys', 87, 'attorney, fees, , costs'), ('commission, rates, gas', 97, 'rate, , gas, , rates'), ('congress, act, usc', 41, 'federal, congress, act, law'), ) • 321 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:05 — page 322 — #38 ...“125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:05 — page 321 — #37 Topic Model of US Supreme Court Opinions,... rates''), (''congress, act, usc'', 41, ''federal, congress, act, law''), ) • 321 “125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:05 — page 322 — #38