Humanities Data Analysis “125 85018 Karsdrop Humanities ch01 3p” — 2020/8/19 — 11 01 — page 90 — #13 90 • Chapter 3 Table 3 2 Initial rows and several columns from the document term matrix representat[.]
“125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 90 — #13 90 • Chapter Table 3.2 Initial rows and several columns from the document-term matrix representation of the toy corpus of French texts Rows are numbered using zero-based indexing and column headers display the respective elements of vocabulary a ah ami amour cléobule d’accord d’hyménée d’où 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 While Python’s list is a convenient data type for constructing a documentterm matrix, it is less useful when one is interested in accessing and manipulating the matrix In what follows, we will use Python’s canonical package for scientific computing, NumPy, which enables us to store and analyze documentterm matrices using less computational resources and with much less effort on our part In order not to disrupt the narrative flow of the chapter, we shall not introduce this package in detail here: less experienced readers are referred to the introductory overview at the end of this chapter, which discusses the main features of the package at significant length (section 3.5) 3.3 Mapping Genres Loading the corpus In what preceded, we have demonstrated how to construct a document-term matrix and which text preprocessing steps are typically involved in creating this representation of text (e.g., text cleansing and string segmentation) This document-term matrix is now ready to be casted into a two-dimensional NumPy array, allowing it to be more efficiently stored and manipulated The resulting object’s shape attribute can be printed to verify whether the table’s dimensions still correctly correspond to our original vector space model (i.e., 10 rows, for each documents, and 66 columns, one for each term in the vocabulary) import numpy as np document_term_matrix = np.array(document_term_matrix) print(document_term_matrix.shape) (10, 66) “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 91 — #14 Exploring Texts Using the Vector Space Model Vector space models have proven to be invaluable for numerous computational approaches to textual data, such as text classification, information retrieval, and stylometry (cf chapter 8) In the remainder of this chapter, we will use a vector space representation of a real-world corpus The object of study will be a collection of French plays from the Classical and Enlightenment period (seventeenth to eighteenth century), which includes works by well-known figures in the history of French theatre (e.g., Molière and Pierre Corneille) Using a vector space model, we aim to illustrate how a bag-of-words model is useful in studying the lexical differences between three subgenres in the corpus Before diving into the details of the genre information in the corpus, let us first load the collection of French plays into memory and, subsequently, transform them into a document-term matrix The collection of dramatic texts under scrutiny is part of the larger Théâtre Classique corpus, which is curated by Paul Fièvre at http://www theatre-classique.fr/ A distinct feature of this data collection, apart from its scope and quality, is the fact that all texts are available in a meticulously encoded XML format (cf section 2.6 in the previous chapter) An excerpt, slightly edited for space, of one of these XML files (504.xml) is shown below: ACTE I Le Théâtre représente un salon où il y a plusieurs issues. SCÈNE PREMIÈRE. FABRICE, seul. ARIETTE. J'aime l'ộclat des Franỗaises, L'air fripon des Milanaises, La frcheur des Hollandaises, Le port noble des Anglaises ; Allemandes, Piémontaises, Toutes m'enivrent d'amour, Et m'enflamment tour tour ! Mais mon aimable Jeanette Est si belle, si bien faite, Qu'elle fait tourner la tête ; Elle enchante tous les yeux, Elle est l'objet de mes vœux. J'aime l'éclat, etc. Il sort. The collection contains plays of different dramatic (sub)genres, three of which will be studied in the present chapter: comédie, tragédie, and tragicomédie The genre of each play is encoded in the tag, and, as can be observed from the excerpt above, all spoken text in these plays (i.e direct speech) is enclosed with the tag The remaining texts reside insidetags • 91 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 92 — #15 92 • Chapter Both elements can be retrieved using a simple XPath expression (cf section 2.6 in the previous chapter), as shown in the following code block: import os import lxml.etree import tarfile tf = tarfile.open('data/theatre-classique.tar.gz', 'r') tf.extractall('data') subgenres = ('Comédie', 'Tragédie', 'Tragi-comédie') plays, titles, genres = [], [], [] for fn in os.scandir('data/theatre-classique'): # Only include XML files if not fn.name.endswith('.xml'): continue tree = lxml.etree.parse(fn.path) genre = tree.find('//genre') title = tree.find('//title') if genre is not None and genre.text in subgenres: lines = [] for line in tree.xpath('//l|//p'): lines.append(' '.join(line.itertext())) text = '\n'.join(lines) plays.append(text) genres.append(genre.text) titles.append(title.text) Let us inspect the distribution of the dramatic subgenres (henceforth simply “genres”) in this corpus (see figure 3.3): import matplotlib.pyplot as plt counts = collections.Counter(genres) fig, ax = plt.subplots() ax.bar(counts.keys(), counts.values(), width=0.3) ax.set(xlabel="genre", ylabel="count") We clearly have a relatively skewed distribution: the most common genre of comédies outnumbers the runner-up genre of tragédies almost by two to one The curious genre of tragi-comédies—the oxymoron in its name suggests it to be a curious mix of both comédies and tragédies—is much less common as a genre label in the dataset The apparent straightforwardness with which we have discussed literary genres so far is not entirely justified from the point of view of literary theory (e.g., Devitt 1993; Stephens and McCallum 2013), and even cultural theory at large (Chandler 1997) Although “genre” seems a (misleadingly) intuitive “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 93 — #16 Exploring Texts Using the Vector Space Model Figure 3.3 Distribution of dramatic subgenres in Théâtre Classique concept when talking about literature, it is also a highly vexed and controversial notion: genres are mere conventional tags that people use to refer to certain “text varieties” or “textual modes” that are very hard to delineate using explicit, let alone objective, criteria They are certainly not mutually exclusive—a “detective” can be a “romance” too—and they can overlap in complex hierarchies—a “detective” can be considered a hyponym of “thriller.” Genre properties can moreover be extracted at various levels from texts, including style, themes, and settings, and successful authors often like to blend genres (e.g., a “historical thriller”) Genre classifications therefore rarely go uncontested and their application can be a highly subjective matter, where personal taste or the paradigm a scholar works in will play a significant role Because of the (inter)subjectivity that is involved in genre studies, quantitative approaches can offer a valuable second opinion on genetic classifications, like the one offered by Paul Fièvre Are there any lexical differences between the texts in this corpus that would seem to correlate, or perhaps contradict, the classification proposed? Can the textual properties in a bag-of-words model shed new light on the special status of the tragi-comédies? And so on Exploring the corpus After loading the plays into memory, we can transform the collection into a document-term matrix In the following code block, we first preprocess each • 93 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 94 — #17 94 • Chapter play using the preprocess_text() function defined earlier, which returns a list of lowercase word tokens for each play Subsequently, we construct the vocabulary with extract_vocabulary(), and prune all words that occur less than two times in the collection The final step, then, is to assemble the documentterm matrix by computing the token counts for all remaining words in the vocabulary for each document in the collection plays_tok = [preprocess_text(play, 'french') for play in plays] vocabulary = extract_vocabulary(plays_tok, min_count=2) document_term_matrix = np.array(corpus2dtm(plays_tok, vocabulary)) print(f"document-term matrix with " f"|D| = {document_term_matrix.shape[0]} documents and " f"|V| = {document_term_matrix.shape[1]} words.") document-term matrix with |D| = 498 documents and |V| = 48062 words We are now ready to start our analysis: we have an efficient bag-of-words representation of a corpus in the form of a NumPy matrix (a two-dimensional array) and list of labels that unambiguously encodes the genre for each document vector in that table Let us start by naively plotting the available documents, as if the frequency counts for two specific words in our bag-of-words model were simple twodimensional coordinates on a map In previous work by Schöch (2017), two words that had considerable discriminative power for these genres were “monsieur” (sir) and “sang” (blood), so we will use these as a starting point We can select the corresponding columns from our document-term matrix, by first retrieving their index in the vocabulary The index of the words in our vocabulary is aligned with the indices of the corresponding columns in the bag-of-words table: we will therefore always use the index of an item in the vocabulary to retrieve the correct frequency column from the bag-of-words model (The Pandas library, which is discussed at length in chapter 4, simplifies this process considerably when working with so-called DataFrame objects.) monsieur_idx = vocabulary.index('monsieur') sang_idx = vocabulary.index('sang') monsieur_counts = document_term_matrix[:, monsieur_idx] sang_counts = document_term_matrix[:, sang_idx] While NumPy is optimized for dealing with numeric data, lists of strings can also be casted into arrays This is exactly what we will to our list of genre labels, too, in order to ease the process of retrieving the locations of specific genre labels in the list later on: genres = np.array(genres) The column vectors, monsieur_counts and sang_counts, both have the same length and include the frequency counts for each of our two words in each document Using the labels in the corresponding list of genre tags, we can now plot each document as a point in the two-dimensional space defined by the “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 95 — #18 Exploring Texts Using the Vector Space Model Figure 3.4 Absolute frequency of “monsieur” and “sang” in individual plays two count vectors Pay close attention to the first two arguments passed to the scatter() function inside the for loop in which we iterate over the three genres: using the mechanism of “boolean indexing,” we select the frequency counts for the relevant documents and we plot those as a group in each iteration Figure 3.4 is generated using the following code block: fig, ax = plt.subplots() for genre in ('Comédie', 'Tragédie', 'Tragi-comédie'): ax.scatter( monsieur_counts[genres == genre], sang_counts[genres == genre], label=genre, alpha=0.7) ax.set(xlabel='monsieur', ylabel='sang') plt.legend() What does this initial “textual map” tell us? As we can glean from this plot, the usage of these two words appears to be remarkably distinctive Many tragédies seem to use the term “sang” profusely, whereas the term is almost absent from the comédies Conversely, the term “monsieur” is clearly favored by the authors of comédies, where it is perhaps predominantly used as a • 95 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 96 — #19 96 • Chapter vocative, because conversations are often said to be more typical of this particular subgenre (Schöch 2017) Interestingly, the tragi-comédies seem to hold the middle between the other two genres, as these seem to invite much less extreme frequencies for those terms Genre vectors Do we have any more objective methods to verify these impressions? A first option would be to take a more aggregate view and look at the average usage of these terms in the three genres In the code block below, we calculate the geometric mean or “centroids” for each genetic subcluster This is easy to achieve in NumPy, which has a dedicated function for this, numpy.mean(), that we can apply to our entire bag-of-words model at once Through setting the axis parameter to zero, we indicate that we are interested in the column-wise mean (as opposed to, e.g., the row-wise mean for which we could need to specify axis=1) (If this is all new to you, please study the materials in section 3.5.) Note how we again make use of the boolean indexing mechanism to retrieve only the vectors accociated with the specific genre in each line below: tr_means = document_term_matrix[genres == 'Tragédie'].mean(axis=0) co_means = document_term_matrix[genres == 'Comédie'].mean(axis=0) tc_means = document_term_matrix[genres == 'Tragi-comédie'].mean(axis=0) The resulting mean vectors will hold a one-dimensional list or vector for each term in our vocabulary: print(tr_means.shape) (48062,) We still can use the precomputed indices to retrieve the mean frequency of individual words from these summary vectors: print('Mean absolute frequency of "monsieur"') print(f' print(f' print(f' in comédies: {co_means[monsieur_idx]:.2f}') in tragédies: {tr_means[monsieur_idx]:.2f}') in tragi-comédies: {tc_means[monsieur_idx]:.2f}') Mean absolute frequency of "monsieur" in comédies: 45.46 in tragédies: 1.20 in tragi-comédies: 8.13 The mean frequencies for these words are again revealing telling differences across our three genres This also becomes evident by plotting the mean values in a scatter plot (see figure 3.5): fig, ax = plt.subplots() ax.scatter(co_means[monsieur_idx], co_means[sang_idx], label='Comédies') ax.scatter(tr_means[monsieur_idx], tr_means[sang_idx], label='Tragédie') ... tarfile.open( ''data/ theatre-classique.tar.gz'', ''r'') tf.extractall( ''data'' ) subgenres = (''Comédie'', ''Tragédie'', ''Tragi-comédie'') plays, titles, genres = [], [], [] for fn in os.scandir( ''data/ theatre-classique''):... Vector space models have proven to be invaluable for numerous computational approaches to textual data, such as text classification, information retrieval, and stylometry (cf chapter 8) In the remainder... which is curated by Paul Fièvre at http://www theatre-classique.fr/ A distinct feature of this data collection, apart from its scope and quality, is the fact that all texts are available in a