Humanities Data Analysis

Humanities Data Analysis “125 85018 Karsdrop Humanities ch01 3p” — 2020/8/19 — 11 01 — page 90 — #13 90 • Chapter 3 Table 3 2 Initial rows and several columns from the document term matrix representat[.]

tags • 91 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 92 — #15 92 • Chapter Both elements can be retrieved using a simple XPath expression (cf section 2.6 in the previous chapter), as shown in the following code block: import os import lxml.etree import tarfile tf = tarfile.open('data/theatre-classique.tar.gz', 'r') tf.extractall('data') subgenres = ('Comédie', 'Tragédie', 'Tragi-comédie') plays, titles, genres = [], [], [] for fn in os.scandir('data/theatre-classique'): # Only include XML files if not fn.name.endswith('.xml'): continue tree = lxml.etree.parse(fn.path) genre = tree.find('//genre') title = tree.find('//title') if genre is not None and genre.text in subgenres: lines = [] for line in tree.xpath('//l|//p'): lines.append(' '.join(line.itertext())) text = '\n'.join(lines) plays.append(text) genres.append(genre.text) titles.append(title.text) Let us inspect the distribution of the dramatic subgenres (henceforth simply “genres”) in this corpus (see figure 3.3): import matplotlib.pyplot as plt counts = collections.Counter(genres) fig, ax = plt.subplots() ax.bar(counts.keys(), counts.values(), width=0.3) ax.set(xlabel="genre", ylabel="count") We clearly have a relatively skewed distribution: the most common genre of comédies outnumbers the runner-up genre of tragédies almost by two to one The curious genre of tragi-comédies—the oxymoron in its name suggests it to be a curious mix of both comédies and tragédies—is much less common as a genre label in the dataset The apparent straightforwardness with which we have discussed literary genres so far is not entirely justified from the point of view of literary theory (e.g., Devitt 1993; Stephens and McCallum 2013), and even cultural theory at large (Chandler 1997) Although “genre” seems a (misleadingly) intuitive “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 93 — #16 Exploring Texts Using the Vector Space Model Figure 3.3 Distribution of dramatic subgenres in Théâtre Classique concept when talking about literature, it is also a highly vexed and controversial notion: genres are mere conventional tags that people use to refer to certain “text varieties” or “textual modes” that are very hard to delineate using explicit, let alone objective, criteria They are certainly not mutually exclusive—a “detective” can be a “romance” too—and they can overlap in complex hierarchies—a “detective” can be considered a hyponym of “thriller.” Genre properties can moreover be extracted at various levels from texts, including style, themes, and settings, and successful authors often like to blend genres (e.g., a “historical thriller”) Genre classifications therefore rarely go uncontested and their application can be a highly subjective matter, where personal taste or the paradigm a scholar works in will play a significant role Because of the (inter)subjectivity that is involved in genre studies, quantitative approaches can offer a valuable second opinion on genetic classifications, like the one offered by Paul Fièvre Are there any lexical differences between the texts in this corpus that would seem to correlate, or perhaps contradict, the classification proposed? Can the textual properties in a bag-of-words model shed new light on the special status of the tragi-comédies? And so on Exploring the corpus After loading the plays into memory, we can transform the collection into a document-term matrix In the following code block, we first preprocess each • 93 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 94 — #17 94 • Chapter play using the preprocess_text() function defined earlier, which returns a list of lowercase word tokens for each play Subsequently, we construct the vocabulary with extract_vocabulary(), and prune all words that occur less than two times in the collection The final step, then, is to assemble the documentterm matrix by computing the token counts for all remaining words in the vocabulary for each document in the collection plays_tok = [preprocess_text(play, 'french') for play in plays] vocabulary = extract_vocabulary(plays_tok, min_count=2) document_term_matrix = np.array(corpus2dtm(plays_tok, vocabulary)) print(f"document-term matrix with " f"|D| = {document_term_matrix.shape[0]} documents and " f"|V| = {document_term_matrix.shape[1]} words.") document-term matrix with |D| = 498 documents and |V| = 48062 words We are now ready to start our analysis: we have an efficient bag-of-words representation of a corpus in the form of a NumPy matrix (a two-dimensional array) and list of labels that unambiguously encodes the genre for each document vector in that table Let us start by naively plotting the available documents, as if the frequency counts for two specific words in our bag-of-words model were simple twodimensional coordinates on a map In previous work by Schöch (2017), two words that had considerable discriminative power for these genres were “monsieur” (sir) and “sang” (blood), so we will use these as a starting point We can select the corresponding columns from our document-term matrix, by first retrieving their index in the vocabulary The index of the words in our vocabulary is aligned with the indices of the corresponding columns in the bag-of-words table: we will therefore always use the index of an item in the vocabulary to retrieve the correct frequency column from the bag-of-words model (The Pandas library, which is discussed at length in chapter 4, simplifies this process considerably when working with so-called DataFrame objects.) monsieur_idx = vocabulary.index('monsieur') sang_idx = vocabulary.index('sang') monsieur_counts = document_term_matrix[:, monsieur_idx] sang_counts = document_term_matrix[:, sang_idx] While NumPy is optimized for dealing with numeric data, lists of strings can also be casted into arrays This is exactly what we will to our list of genre labels, too, in order to ease the process of retrieving the locations of specific genre labels in the list later on: genres = np.array(genres) The column vectors, monsieur_counts and sang_counts, both have the same length and include the frequency counts for each of our two words in each document Using the labels in the corresponding list of genre tags, we can now plot each document as a point in the two-dimensional space defined by the “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 95 — #18 Exploring Texts Using the Vector Space Model Figure 3.4 Absolute frequency of “monsieur” and “sang” in individual plays two count vectors Pay close attention to the first two arguments passed to the scatter() function inside the for loop in which we iterate over the three genres: using the mechanism of “boolean indexing,” we select the frequency counts for the relevant documents and we plot those as a group in each iteration Figure 3.4 is generated using the following code block: fig, ax = plt.subplots() for genre in ('Comédie', 'Tragédie', 'Tragi-comédie'): ax.scatter( monsieur_counts[genres == genre], sang_counts[genres == genre], label=genre, alpha=0.7) ax.set(xlabel='monsieur', ylabel='sang') plt.legend() What does this initial “textual map” tell us? As we can glean from this plot, the usage of these two words appears to be remarkably distinctive Many tragédies seem to use the term “sang” profusely, whereas the term is almost absent from the comédies Conversely, the term “monsieur” is clearly favored by the authors of comédies, where it is perhaps predominantly used as a • 95 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 96 — #19 96 • Chapter vocative, because conversations are often said to be more typical of this particular subgenre (Schöch 2017) Interestingly, the tragi-comédies seem to hold the middle between the other two genres, as these seem to invite much less extreme frequencies for those terms Genre vectors Do we have any more objective methods to verify these impressions? A first option would be to take a more aggregate view and look at the average usage of these terms in the three genres In the code block below, we calculate the geometric mean or “centroids” for each genetic subcluster This is easy to achieve in NumPy, which has a dedicated function for this, numpy.mean(), that we can apply to our entire bag-of-words model at once Through setting the axis parameter to zero, we indicate that we are interested in the column-wise mean (as opposed to, e.g., the row-wise mean for which we could need to specify axis=1) (If this is all new to you, please study the materials in section 3.5.) Note how we again make use of the boolean indexing mechanism to retrieve only the vectors accociated with the specific genre in each line below: tr_means = document_term_matrix[genres == 'Tragédie'].mean(axis=0) co_means = document_term_matrix[genres == 'Comédie'].mean(axis=0) tc_means = document_term_matrix[genres == 'Tragi-comédie'].mean(axis=0) The resulting mean vectors will hold a one-dimensional list or vector for each term in our vocabulary: print(tr_means.shape) (48062,) We still can use the precomputed indices to retrieve the mean frequency of individual words from these summary vectors: print('Mean absolute frequency of "monsieur"') print(f' print(f' print(f' in comédies: {co_means[monsieur_idx]:.2f}') in tragédies: {tr_means[monsieur_idx]:.2f}') in tragi-comédies: {tc_means[monsieur_idx]:.2f}') Mean absolute frequency of "monsieur" in comédies: 45.46 in tragédies: 1.20 in tragi-comédies: 8.13 The mean frequencies for these words are again revealing telling differences across our three genres This also becomes evident by plotting the mean values in a scatter plot (see figure 3.5): fig, ax = plt.subplots() ax.scatter(co_means[monsieur_idx], co_means[sang_idx], label='Comédies') ax.scatter(tr_means[monsieur_idx], tr_means[sang_idx], label='Tragédie') ... tarfile.open( ''data/ theatre-classique.tar.gz'', ''r'') tf.extractall( ''data'' ) subgenres = (''Comédie'', ''Tragédie'', ''Tragi-comédie'') plays, titles, genres = [], [], [] for fn in os.scandir( ''data/ theatre-classique''):... Vector space models have proven to be invaluable for numerous computational approaches to textual data, such as text classification, information retrieval, and stylometry (cf chapter 8) In the remainder... which is curated by Paul Fièvre at http://www theatre-classique.fr/ A distinct feature of this data collection, apart from its scope and quality, is the fact that all texts are available in a

Định dạng
Số trang	7
Dung lượng	156,13 KB