Humanities Data Analysis “125 85018 Karsdrop Humanities ch01 3p” — 2020/8/19 — 11 05 — page 268 — #21 268 • Chapter 8 dimensionality reduction is that we seek a new representation of a dataset, which[.]
“125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:05 — page 268 — #21 268 • Chapter dimensionality reduction is that we seek a new representation of a dataset, which needs much less dimensions to characterize our data points, but which still offers a maximally faithful approximation of our data PCA is a prototypical approach to text modeling in stylometry, because we create a much smaller model which we know beforehand will only be an approximation of our original dataset This operation is often crucial for scholars who deal with large datasets, where the number of features is much larger that the number of data points In stylometry, it is very common to reduce the dimensionality of a dataset to as little as or dimensions 8.4.1 Applying PCA Before going into the intuition behind this technique, it makes sense to get a rough idea of the kind of output which a PCA can produce Nowadays, there are many Python libraries which allow you to quickly run a PCA The aforementioned scikit-learn library has a very efficient, uncluttered object for this Below, we instantiate such an object and use it to reduce the dimensionality of our original 36 × 65 matrix: import sklearn.decomposition pca = sklearn.decomposition.PCA(n_components=2) documents_proj = pca.fit_transform(v_documents) print(v_documents.shape) print(documents_proj.shape) (36, 65) (36, 2) Note that the shape information shows that the dimensionality of our new dataset is indeed restricted to n_components, the parameter which we set at when calling the PCA constructor Each of these newly created columns is called a “principal component” (PC), and can be expected to describe an important aspect about our data Apparently, the PCA has managed to provide a twodimensional “summary” of our dataset, which originally contained 65 columns Because our dataset is now low-dimensional, we can plot it using the same plotting techniques that were introduced above The code below produces the visualization in figure 8.6 c1, c2 = documents_proj[:, 0], documents_proj[:, 1] fig, ax = plt.subplots() ax.scatter(c1, c2, facecolors='none') for p1, p2, author in zip(c1, c2, authors): ax.text(p1, p2, author[0], fontsize=12, ha='center', va='center') ax.set(xlabel='PC1', ylabel='PC2') “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:05 — page 269 — #22 Stylometry and the Voice of Hildegard Figure 8.6 A scatter plot displaying 36 texts by three authors in the first two components of a PCA By convention, we plot the first component on the horizontal axis and the second component on the vertical axis The resulting plot displays much more authorial structure: the texts now form very neat per-author clusters Like hierarchical clustering, PCA too is an unsupervised method: we never included any provenance information about the texts in the analysis Therefore it is fascinating that the analysis still seems to be able to automatically distinguish between the writing styles of our three authors The first component is responsible for the horizontal distribution of our texts in the plot; interestingly, we see that this component primarily manages to separate Hildegard from her two male contemporaries The second component, on the other hand, is more useful to distinguish Bernard of Clairvaux from Guibert of Gembloux on the vertical axis This is also clear from the more simplistic visualization shown in figure 8.7, which was generated by the following lines of code: fig, ax = plt.subplots(figsize=(4, 8)) for idx in range(pca.components_.shape[0]): ax.axvline(idx, linewidth=2, color='lightgrey') for score, author in zip(documents_proj[:, idx], authors): ax.text( idx, score, author[0], fontsize=10, va='center', ha='center') ax.axhline(0, ls='dotted', c='black') • 269 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:05 — page 270 — #23 Figure 8.7 Displaying 36 writing samples by three authors in the first (PC1) and second (PC2) components of a PCA separately PC1 and PC2 realize different distinctions between the three authors ... method: we never included any provenance information about the texts in the analysis Therefore it is fascinating that the analysis still seems to be able to automatically distinguish between the...“125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:05 — page 269 — #22 Stylometry and the Voice of Hildegard Figure... fontsize=10, va=''center'', ha=''center'') ax.axhline(0, ls=''dotted'', c=''black'') • 269 “125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:05 — page 270 — #23 Figure 8.7 Displaying 36 writing samples