Humanities Data Analysis “125 85018 Karsdrop Humanities ch01 3p” — 2020/8/19 — 11 05 — page 252 — #5 252 • Chapter 8 Her last secretary, the monk Guibert of Gembloux, is the secondary focus of this an[.]
“125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:05 — page 252 — #5 252 • Chapter Her last secretary, the monk Guibert of Gembloux, is the secondary focus of this analysis Two letters are extant which are commonly attributed to Hildegard herself, although philologists have noted that these are much closer to Guibert’s writings in style and tone These texts are available under the folder data/hildegard/texts, where their titles have been abbreviated: D_Mart.txt (Visio de Sancto Martino) and D_Missa.txt (Visio ad Guibertum missa) Note that the prefix D_ reflects their Dubious authorship Below, we will apply quantitative text analysis to find out whether we can add quantitative evidence to support the thesis that Guibert, perhaps even more than Hildegard herself, has had a hand in authoring these letters Because our focus only involves two authors, both known to have been at least somewhat involved in the production of these letters, it makes sense to approach this case from the attribution perspective 8.2.1 Burrows’s Delta Burrows’s Delta is a technique for authorship attribution, which was introduced by Burrows (2002) John F Burrows is commonly considered one of the founding fathers of present-day stylometry, not in the least because of his foundational monograph reporting a computational study of Jane Austen’s oeuvre (Burrows 1987; Craig 2004) The technique attempts to attribute an anonymous text to one from a series of candidate authors for which it has example texts available as training material Although Burrows did not originally frame Delta as such, we will discuss how it can be considered a naive machine learning method for text classification While the algorithm is simple, intuitive, and straightforward to implement, it has been shown to produce surprisingly strong results in many cases, especially when texts are relatively long (e.g., entire novels) Framed as a method for text classification, Burrows’s Delta consists of two consecutive steps First, during fitting (i.e., the training stage), the method takes a number of example texts from candidate authors Subsequently, during testing (i.e., the prediction stage), the method takes a series of new, unseen texts and attempts to attribute each of them to one of the candidate authors encountered during training Delta’s underlying assignment technique is simple and intuitive: it will attribute a new text to the author of its most similar training document, i.e., the training document which lies at the minimal stylistic distance from the test document In the machine learning literature, this is also known as a “nearest neighbor” classifier, since Delta will extrapolate the authorship of the nearest neighbor to the test document (cf section 3.3.2) Other terms frequently used for this kind of learning are instance-based learning and memory-based learning (Aha, Kibler, and Albert 1991; Daelemans and Van den Bosch 2005) In the current context, the word “training” is somewhat misleading, as training is often limited to storing the training examples in memory and calculating some simple statistics Most of the actual work is done during the prediction stage, when the stylistic distances between documents are computed In the prediction stage, the metric used to retrieve a document’s nearest neighbor among the training examples is of crucial importance Burrows’s Delta essentially is such a metric, which Burrows originally defined as “the mean of “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:05 — page 253 — #6 Stylometry and the Voice of Hildegard the absolute differences between the z-scores for a set of word-variables in a given text-group and the z-scores for the same set of word-variables in a target text” (Burrows 2002) Assume that we have a document d with n unique of length n The words, which, after vectorization, is represented as vector x i can then be defined as: z-score for each relative word frequency x z( xi ) = i − μi x σi (8.1) Here, μi and σi respectively stand for the sample mean and sample standard deviation of the word’s frequencies in the reference corpus Suppose that we have a second document available represented as a vector y, and that we would and y Following, like to calculate the Delta or stylistic difference between x Burrows’s definition, ( x, y) is then the mean of the absolute differences for the z-scores of the words in them: z(xi ) − z(yi ) n n (x, y) = (8.2) i=1 The vertical bars in the right-most part of the formula, indicate that we take the absolute value of the observed differences In 2008, Argamon found out that this formula can be rewritten in an interesting manner (Argamon 2008) He added the calculation of the z-scores in full again to the previous formula: (x, y) = n xi − μi yi − μi − σ n σi i (8.3) i=1 Argamon noted that since n (i.e., the size of the vocabulary used for the bagof-words model) is in fact the same for each document, in practice, we can leave n out, because it will not affect the ranking of the distances returned: n xi − μi yi − μi (x, y) = − σ σ i=1 i (8.4) i Burrows’s Delta applies a very common scaling operation to the document vectors Note that we can take the division by the standard deviation from the formula, and apply it beforehand to x1:n and y1:n , as a sort of scaling, during the fitting stage If σ1:n is a vector containing the standard deviation for each word’s frequency in the training set, and μ1:n a vector with the corresponding means for each word in the training data, we could this as follows (in vectorized notation): x1:n = x1:n − μ1:n y1:n − μ1:n , y1:n = σ1:n σ1:n (8.5) This is in fact a very common vector scaling strategy in machine learning and text classification If we perform this scaling operation beforehand, during • 253 ...“125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:05 — page 253 — #6 Stylometry and the Voice of Hildegard the... the training set, and μ1:n a vector with the corresponding means for each word in the training data, we could this as follows (in vectorized notation): x1:n = x1:n − μ1:n y1:n − μ1:n , y1:n =