Humanities Data Analysis

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	2
Dung lượng	109,54 KB

Nội dung

Humanities Data Analysis “125 85018 Karsdrop Humanities ch01 3p” — 2020/8/19 — 11 05 — page 266 — #19 266 • Chapter 8 Figure 8 4 A two dimensional visualization of the distances between 36 writing sam[.]

“125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:05 — page 266 — #19 266 • Chapter Figure 8.4 A two-dimensional visualization of the distances between 36 writing samples by three authors (Hildegard of Bingen (H), Guibert of Gembloux (G), and Bernard of Clairvaux (B)) in terms of two word variables 8.4 Principal Component Analysis In this final section, we will a cover a common technique in stylometry, called Principal Component Analysis (or PCA) PCA stems from multivariate statistics, and has been applied regularly to literary corpora in recent years (Binongo and Smith 1999; Hoover 2007) PCA is a useful technique for textual analysis because it enables intuitive visualizations of corpora The document-term matrix created earlier represents a 36 × 65 matrix: i.e., we have 36 documents (by authors) which are each characterized in terms of 65 word frequencies The columns in such a matrix are also called dimensions, because our texts can be considered points in a geometric space that has 65 axes We are used to plotting such points in a geometric space that only has a small number of axes (e.g., or 3), using the pairwise coordinates, reflecting their score in each dimension Let us plot, for instance, these texts with respect to two randomly selected dimensions, i.e., those representing super and propter (see figure 8.4): words = vectorizer.get_feature_names() authors = np.array(authors) x = v_documents[:, words.index('super')] y = v_documents[:, words.index('propter')] fig, ax = plt.subplots() “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:05 — page 267 — #20 Stylometry and the Voice of Hildegard Figure 8.5 A two-dimensional visualization of the distances between 36 texts by three authors (Hildegard of Bingen (H), Guibert of Gembloux (G), and Bernard of Clairvaux (B)) in terms of two word variables for author in set(authors): ax.scatter(x[authors == author], y[authors == author], label=author) ax.set(xlabel='super', ylabel='propter') plt.legend() To make our plot slightly more readable, we could plot the author’s name for each text, instead of multi-colored dots For this, we first need to plot an empty scatter plot, with invisible points using scatter() Next, we overlay these points in their vertical and horizontal center with a string label The resulting plot is shown in figure 8.5 fig, ax = plt.subplots() ax.scatter(x, y, facecolors='none') for p1, p2, author in zip(x, y, authors): ax.text(p1, p2, author[0], fontsize=12, ha='center', va='center') ax.set(xlabel='super', ylabel='propter') The sad reality remains that we are only inspecting of the 65 variables that we have for each data point Humans have great difficulties imagining, let alone plotting, data in more than dimensions simultaneously Many reallife datasets even have much more than 65 dimensions, so that the problem becomes even more acute in those cases PCA is one of the many techniques which exist to reduce the dimensionality of datasets The general idea behind • 267 ... variables that we have for each data point Humans have great difficulties imagining, let alone plotting, data in more than dimensions simultaneously Many reallife datasets even have much more than...“125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:05 — page 267 — #20 Stylometry and the Voice of Hildegard Figure... acute in those cases PCA is one of the many techniques which exist to reduce the dimensionality of datasets The general idea behind • 267

Ngày đăng: 20/11/2022, 11:28