Humanities Data Analysis “125 85018 Karsdrop Humanities ch01 3p” — 2020/8/19 — 11 01 — page 81 — #4 Exploring Texts Using the Vector Space Model • 81 Figure 3 1 Demonstration of four documents (repres[.]
“125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 81 — #4 Exploring Texts Using the Vector Space Model Figure 3.1 Demonstration of four documents (represented as vectors) residing in a two-dimensional space will become unfeasible to visually analyze the similarity between documents To quantify the distance (or similarity) between two documents in highdimensional space, we can employ distance functions or metrics, which express the distance between two vectors as a non-negative number The implementation and application of such distance metrics will be discussed in section 3.3.1 Having a basic theoretical understanding of the vector space model, we move on to the practical part of implementing a procedure to construct a documentterm matrix from plain text In essence, this involves three consecutive steps In the first step, we determine the vocabulary of the collection, optionally filtering the vocabulary using information about how often each unique word (type) occurs in the corpus The second step is to count how often each element of the vocabulary occurs in each individual document The third and final step takes the bags of words from the second step and builds a document-term matrix The right-most table in figure 3.2 represents the document-term matrix resulting from this procedure The next section will illustrate how this works in practice 3.2.1 Text preprocessing A common way to represent text documents is to use strings (associated with Python’s str type) Consider the following code block, which represents the ten mini-documents from the figure above as a list of strings • 81 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 82 — #5 82 • Chapter Figure 3.2 Extracting a document-term matrix from a collection of texts corpus = [ "D'où me vient ce désordre, Aufide, et que veut dire", "Madame, il était temps qu'il vous vỵnt du secours:", "Ah! Monsieur, c'est donc vous?", "Ami, j'ai beau rêver, toute ma rêverie", "Ne me parle plus tant de joie et d'hyménée;", "Il est vrai, Cléobule, et je veux l'avouer,", "Laisse-moi mon chagrin, tout injuste qu'il est;", "Ton frère, je l'avoue, a beaucoup de mérite;", "J'en demeure d'accord, chacun a sa méthode;", 'Pour prix de votre amour que vous peignez extrême,' ] In order to construct a bag-of-words representation of each “text” in this corpus, we must first process the strings into distinct words This process is called “tokenization” or “word segmentation.” A naive tokenizer might split documents along (contiguous) whitespace In Python, such a tokenizer can be implemented straightforwardly by using the string method split() As demonstrated in the following code block, this method employs a tokenization strategy “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 83 — #6 Exploring Texts Using the Vector Space Model in which tokens are separated by one or more instances of whitespace (e.g., spaces, tabs, newlines): document = corpus[2] print(document.split()) ['Ah!', 'Monsieur,', "c'est", 'donc', 'vous?'] The tokenization strategy used often has far-reaching consequences for the composition of the final document-term matrix If, for example, we decide not to lowercase the words, Il (‘he’) is considered to be different from il, whereas we would normally consider them to be instances of the same word type An equally important question is whether we should incorporate or ignore punctuation marks And what about contracted word forms? Should qu’il be restored to que and il? Such choices may appear simple, but they may have a strong influence on the final text representation, and, subsequently, on the analysis based on this representation Unfortunately, it is difficult to provide a recommendation here apart from advising that tokenization procedures be carefully documented To illustrate the complexity, consider the problem of modeling thematic differences between texts For this problem, certain linguistic markers such as punctuation might not be relevant However, the same linguistic markers might be of crucial importance to another problem In authorship attribution, for example, it has been demonstrated that punctuation is one of the strongest predictors of authorial identity (Grieve 2007) We already spoke about lowercasing texts, which is another common preprocessing step Here as well, we should be aware that it has certain consequences for the final text representation For instance, it complicates identifying proper nouns or the beginnings of sentences at a later stage in an analysis Sometimes reducing the information recorded in a text representation is motivated by necessity: researchers may only have a fixed budget of computational resources available to analyze a corpus The best recommendation here is to follow established strategies and exhaustively document the preprocessing steps taken Distributing the code used in preprocessing is an excellent idea Many applications employ off-the-shelf tokenizers to preprocess texts In the example below, we apply a tokenizer optimized for French as provided by the Natural Language ToolKit (NLTK) (Bird, Klein, and Loper 2009), and segment each document in corpus into a list of word tokens: import nltk import nltk.tokenize # download the most recent punkt package nltk.download('punkt', quiet=True) document = corpus[3] print(nltk.tokenize.word_tokenize(document, language='french')) ['Ami', ',', "j'ai", 'beau', 'rêver', ',', 'toute', 'ma', 'rêverie'] • 83 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 84 — #7 84 • Chapter It can be observed that this tokenizer correctly splits off sentence-final punctuation such as full stops, but retains contracted forms, such as j’ai Be aware that the clitic form j is not restored to je Such an example illustrates how tokenizers may come with a certain set of assumptions, which should be made explicit through, for instance, properly referring to the exact tokenizer applied in the analysis Given the current word segmentation, removing (repetitions of) isolated punctuation marks can be accomplished by filtering non-punctuation tokens To this end, we implement a simple utility function called is_punct(), which checks whether a given input string is either a single punctuation marker or a sequence thereof: import re PUNCT_RE = re.compile(r'[^\w\s]+$') def is_punct(string): """Check if STRING is a punctuation marker or a sequence of punctuation markers Arguments: string (str): a string to check for punctuation markers Returns: bool: True if string is a (sequence of) punctuation marker(s), False otherwise Examples: >>> is_punct("!") True >>> is_punct("Bonjour!") False >>> is_punct("¿Te gusta el verano?") False >>> is_punct(" ") True >>> is_punct("«» ") True """ return PUNCT_RE.match(string) is not None The function makes use of the regular expression [^\w\s]+$ For those with a rusty memory of regular expressions, allow us to briefly explain its components \w matches Unicode word characters (including digit characters), and \s matches Unicode whitespace characters By using the set notation [] and the negation sign ^, i.e., [^\w\s], the regular expression matches any character that “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 85 — #8 Exploring Texts Using the Vector Space Model is not matched by \w or \s, i.e., is not a word or whitespace character, and thus a punctuation character The + indicates that the expression should match one or more punctuation characters, and the $ matches the end of the string, which ensures that a string is only matched if it solely consists of punctuation characters Using the function is_punct(), filtering all non-punctuation tokens can be accomplished using a for loop or a list comprehension The following code block demonstrates the use of both looping mechanisms, which are essentially equivalent: tokens = nltk.tokenize.word_tokenize(corpus[2], language='french') # Loop with a standard for-loop tokenized = [] for token in tokens: if not is_punct(token): tokenized.append(token) print(tokenized) # Loop with a list comprehension tokenized = [token for token in tokens if not is_punct(token)] print(tokenized) ['Ah', 'Monsieur', "c'est", 'donc', 'vous'] ['Ah', 'Monsieur', "c'est", 'donc', 'vous'] After tokenizing and removing the punctuation, we are left with a sequence of alphanumeric strings (“words” or “word tokens”) Ideally, we would wrap these preprocessing steps in a single function, such as preprocess_text(), which returns a list of word tokens and removes all isolated punctuation markers Consider the following implementation: def preprocess_text(text, language, lowercase=True): """Preprocess a text Perform a text preprocessing procedure, which transforms a string object into a list of word tokens without punctuation markers Arguments: text (str): a string representing a text language (str): a string specifying the language of text lowercase (bool, optional): Set to True to lowercase all word tokens Defaults to True Returns: list: a list of word tokens extracted from text, excluding punctuation Examples: >>> preprocess_text("Ah! Monsieur, c'est donc vous?", 'french') ["ah", "monsieur", "c'est", "donc", "vous"] • 85 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 86 — #9 86 • Chapter """ if lowercase: text = text.lower() tokens = nltk.tokenize.word_tokenize(text, language=language) tokens = [token for token in tokens if not is_punct(token)] return tokens The lowercase parameter can be used to transform all word tokens into their lowercased form To test this new function, we apply it to some of the toy documents in corpus: for document in corpus[2:4]: print('Original:', document) print('Tokenized:', preprocess_text(document, 'french')) Original: Ah! Monsieur, c'est donc vous? Tokenized: ['ah', 'monsieur', "c'est", 'donc', 'vous'] Original: Ami, j'ai beau rêver, toute ma rêverie Tokenized: ['ami', "j'ai", 'beau', 'rêver', 'toute', 'ma', 'rêverie'] Having tackled the problem of preprocessing a corpus of document strings, we can move on to the remaining steps required to create a document-term matrix By default, the vocabulary of a corpus would comprise the complete set of words in all documents (i.e., all unique word types) However, nothing prevents us from establishing a vocabulary following a different strategy Here, too, a useful rule of thumb is that we should try to restrict the number of words in the vocabulary as much as possible to arrive at a compact model, while at the same time, not throwing out potentially useful information Therefore, it is common to apply a threshold or frequency cutoff, with which less informative lexical items can be ignored We could, for instance, decide to ignore words that only occur once throughout a corpus (so-called “hapax legomena,” or “hapaxes” for short) To establish such a vocabulary, one would typically scan the entire corpus, and count how often each unique word occurs Subsequently, we remove all words from the vocabulary that occur only once Given a sequence of items (e.g., a list or a tuple), counting items is straightforward in Python, especially when using the dedicated Counter object, which was discussed in chapter In the example below, we compute the frequency for all tokens in corpus: import collections vocabulary = collections.Counter() for document in corpus: vocabulary.update(preprocess_text(document, 'french')) Counter implements a number of methods specialized for convenient and rapid tallies For instance, the method Counter.most_common returns the n most frequent items: print(vocabulary.most_common(n=5)) [('et', 3), ('vous', 3), ('de', 3), ('me', 2), ('que', 2)] “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 87 — #10 Exploring Texts Using the Vector Space Model As can be observed, the most common words in the vocabulary are function words (or “stop words” as they are commonly called), such as me (personal pronoun), et (conjunction), and de (preposition) Words residing in lower ranks of the frequency list are typically content words that have a more specific meaning than function words This fundamental distinction between word types will re-appear at various places in the book (see, e.g., chapter 8) Using the Counter object constructed above, it is easy to compose a vocabulary which ignores these hapaxes: print('Original vocabulary size:', len(vocabulary)) pruned_vocabulary = { token for token, count in vocabulary.items() if count > } print(pruned_vocabulary) print('Pruned vocabulary size:', len(pruned_vocabulary)) Original vocabulary size: 66 {'il', 'est', 'a', 'je', 'me', "qu'il", 'que', 'vous', 'de', 'et'} Pruned vocabulary size: 10 To refresh your memory, a Python set is a data structure which is well-suited for representing a vocabulary A Python set, like its namesake in mathematics, is an unordered sequence of distinct elements Because a set only records distinct elements, we are guaranteed that all words appearing in it are unique Similarly, we could construct a vocabulary which excludes the n most frequent tokens: n = print('Original vocabulary size:', len(vocabulary)) pruned_vocabulary = {token for token, _ in vocabulary.most_common()[n:]} print('Pruned vocabulary size:', len(pruned_vocabulary)) Original vocabulary size: 66 Pruned vocabulary size: 61 Note how the size of the pruned vocabulary can indeed be aggressively reduced using such simple frequency thresholds Abstracting over these two concrete routines, we can now implement a function extract_vocabulary(), which extracts a vocabulary from a tokenized corpus given a minimum and a maximum frequency count: def extract_vocabulary(tokenized_corpus, min_count=1, max_count=float('inf')): """Extract a vocabulary from a tokenized corpus Arguments: tokenized_corpus (list): a tokenized corpus represented, list of lists of strings min_count (int, optional): the minimum occurrence count of a vocabulary item in the corpus • 87 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 88 — #11 88 • Chapter max_count (int, optional): the maximum occurrence count of a vocabulary item in the corpus Defaults to inf Returns: list: An alphabetically ordered list of unique words in the corpus, of which the frequencies adhere to the specified minimum and maximum count Examples: >>> corpus = [['the', 'man', 'love', 'man', 'the'], ['the', 'love', 'book', 'wise', 'drama'], ['a', 'story', 'book', 'drama']] >>> extract_vocabulary(corpus, min_count=2) ['book', 'drama', 'love', 'man', 'the'] """ vocabulary = collections.Counter() for document in tokenized_corpus: vocabulary.update(document) vocabulary = { word for word, count in vocabulary.items() if count >= min_count and count