Humanities Data Analysis “125 85018 Karsdrop Humanities ch01 3p” — 2020/8/19 — 11 01 — page 79 — #2 Exploring Texts Using the Vector Space Model • 79 Why would such a spatial perspective on (textual)[.]
“125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 79 — #2 Exploring Texts Using the Vector Space Model Why would such a spatial perspective on (textual) data in fact be desirable? To illustrate the potential of this approach, this chapter will work with a simple, but real-world example throughout, namely a corpus of French-language drama of the Classical Age and the Enlightenment Age (seventeenth to eighteenth century), that has been quantitatively explored by Schöch (2017) in a paper that will function as our baseline in this chapter We will focus on plays drawn from three well-studied subgenres: tragédie, comédie, and tragi-comédie Readers, even those not familiar at all with this specific literature, will have very clear expectations as to which texts will be represented in these genres, because of these generic genre labels The same readers, however, may lack grounds to explain this similarity formally and express their natural intuition in more concrete terms We will explore how the vector space model can help us understand the differences between these subgenres from a quantitative, geometric point of view What sort of lexical differences, if any, become apparent when we plot these different genres as spatial data? Does the subgenre of tragi-comédies display a lexical mix of both comédies and tragédies markers or is it relatively closer to one of these constituent genres? The first section of this chapter (section 3.2) elaborates on several low-level preprocessing steps that can be taken when preparing a real-life corpus for numerical processing It also discusses the ins-and-outs of converting a corpus into a bag-of-words representation, while critically reflecting on the numerous design choices that we have in implementing such a model in actual code In section 3.3, then, these new insights are combined, as well as those from chapter 2, in a focused case study about the dramatic texts introduced above To efficiently represent and work with texts represented as vectors, this chapter uses Python’s main numerical library NumPy (section 3.5) which is at the heart of many of Python’s data analysis libraries Walt, Colbert, and Varoquaux (2011) For readers not yet thoroughly familiar with this package, we offer an introductory overview at the end of this chapter, which can safely be skipped by more proficient coders Finally, this chapter introduces the notion of a “nearest neighbor” (section 3.3.2) and explains how this is useful for studying the collection of French drama texts 3.2 From Texts to Vectors As the name suggests, a bag-of-words representation models a text in terms of the individual words it contains, or, put differently, at the lexical level This representation discards any information available in the sequence in which words appear In the bag-of-words model, word sequences such as “the boy ate the fish,” “the fish ate the boy,” and “the ate fish boy the” are all equivalent While linguists agree that syntax is a vital part of human languages, representing texts as bags of words heedlessly ignores this information: it models texts by simply counting how often each unique word occurs in them, regardless of their order or position in the original document (Jurafsky and Martin, in press, chp 6; Sebastiani 2002) While disgarding this information seems, at first • 79 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 80 — #3 80 • Chapter Table 3.1 Example of a vector space representation with four documents (rows) and a vocabulary of four words (columns) For each document the table lists how often each vocabulary item occurs roi ange sang perdu d1 16 21 d2 2 18 19 d3 35 41 d4 39 55 glance, limiting in the extreme, representing texts in terms of their word frequencies has proven its usefulness over decades of use in information retrieval and quantitative text analysis When using the vector space model, a corpus—a collection of documents, each represented as a bag of words—is typically represented as a matrix, in which each row represents a document from the collection, each column represents a word from the collection’s vocabulary, and each cell represents the frequency with which a particular word occurs in a document In this tabular setting, each row is interpretable as a vector in a vector space A matrix arranged in this way is often called a document-term matrix—or termdocument matrix when rows are associated with words and documents are associated with columns An example of a document-term matrix is shown in table 3.1 In this table, each document di is represented as a vector, which, essentially, is a list of numbers—word frequencies in our present case A vector space is nothing more than a collection of numerical vectors, which may, for instance, be added together and multiplied by a number Documents represented in this manner may be compared in terms of their coordinates (or components) For example, by comparing the four documents on the basis of the second coordinate, we observe that the first two documents (d1 and d2 ) have similar counts, which might be an indication that these two documents are somehow more similar To obtain a more accurate and complete picture of document similarity, we would like to be able to compare documents more holistically, using all their components In our example, each document represents a point in a four-dimensional vector space We might hypothesize that similar documents use similar words, and hence reside close to each other in this space To illustrate this, we demonstrate how to visualize the documents in space using the first and third components The plot in figure 3.1 makes visually clear that documents d1 and d2 occupy neighboring positions in space, both far away from the other two documents As the number of dimensions increases (collections of real-world documents typically have vocabularies with tens of thousands of unique words), it ...“125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:01 — page 80 — #3 80 • Chapter Table 3.1 Example of a vector... frequencies has proven its usefulness over decades of use in information retrieval and quantitative text analysis When using the vector space model, a corpus—a collection of documents, each represented