Humanities Data Analysis “125 85018 Karsdrop Humanities ch01 3p” — 2020/8/19 — 11 00 — page 65 — #34 Parsing and Manipulating Structured Data • 65 TransformTable" ["",""],"wgDefaultDateFormat" "dmy","[.]
“125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:00 — page 65 — #34 Parsing and Manipulating Structured Data TransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","Jan uary","February","March","April","May","June","July","August","September"," Unfortunately, we see that not only text, but also some JavaScript leaks through in the extracted text, which is not interesting for us here To explicitly remove such JavaScript or style-related code from our result too, we could first throw out the script (and also inline style) elements altogether, and extract the text again, followed by a few cosmetic operations to remove multiple linebreaks: import re for script in soup(['script', 'style']): script.extract() text = soup.get_text() text = re.sub('\s*\n+\s*', '\n', text) print(text[:300]) # remove multiple linebreaks: William Shakespeare - Wikipedia William Shakespeare From Wikipedia, the free encyclopedia Jump to navigation Jump to search This article is about the poet and playwright For other persons of the same name, see William Shakespeare (disambiguation) For other uses of "Shakespeare", see Shakespeare ( Following a similar strategy as before, we extract all hyperlinks from the retrieved webpage: links = soup.find_all('a') print(links[9].prettify()) Chandos portrait The extracted links contain both links to external pages, as well as links pointing to other sections on the same page (which lack an href attribute) Such links between webpages are crucial on the world wide web, which should be viewed as an intricate network of linked pages Networks offer a fascinating way to model information in an innovative fashion and lie at the heart of the next section of this chapter 2.8 Extracting Character Interaction Networks The previous sections in this chapter have consisted of a somewhat tedious listing of various common file formats that can be useful in the context of storing and exchanging data for quantitative analyses in the humanities Now it is time to move beyond the kind of simple tasks presented above and make clear how we can use such data formats in an actual application As announced in the introduction we will work below with the case study of a famous character network analysis of Hamlet • 65 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:00 — page 66 — #35 66 • Chapter The relationship between fictional characters in literary works can be conceptualized as social networks In recent years, the computational analysis of such fictional social networks has steadily gained popularity Network analysis can contribute to the study of literary fiction by formally mapping character relations in individual works More interestingly, however, is when network analysis is applied to larger collections of works, revealing the abstract and general patterns and structure of character networks Studying relations between speakers is of central concern in much research about dramatic works (see, e.g., Ubersfeld et al 1999) One example which is well-known in literary studies and which inspires this chapter is the analysis of Hamlet in Moretti (2011) In the field of computational linguistics, advances have been made in recent years, with research focusing on, for instance, social network analyses of nineteenth-century fiction (Elson, Dames, and McKeown 2010), Alice in Wonderland (Agarwal, Kotalwar, and Rambow 2013), Marvel graphic novels (Alberich, Miro-Julia, and Rossello 2002), or love relationships in French classical drama (Karsdorp, Kestemont, Schöch, and Van den Bosch 2015) Before describing in more detail what kind of networks we will create from Shakespeare’s plays, we will introduce the general concept of networks in a slightly more formal way.25 In network theory, networks consist of nodes (sometimes called vertices) and edges connecting pairs of nodes Consider the following sets of nodes (V) and edges (E): V = {1, 2, 3, 4, 5}, E = {1 ↔ 2, ↔ 4, ↔ 5, ↔ 4, ↔ 5} The notation ↔ means that node and are connected through an edge A network G, then, is defined as the combination of nodes V and edges E, i.e., G = (V, E) In Python, we can define these sets of vertices and edges as follows: V = {1, 2, 3, 4, 5} E = {(1, 2), (1, 4), (2, 5), (3, 4), (4, 5)} In this case, one would speak of an “undirected” network, because the edges lack directionality, and the nodes in such a pair reciprocally point to each other By contrast, a directed network consists of edges pointing in a single direction, as is often the case with links on webpages To construct an actual network from these sets, we will employ the thirdparty package NetworkX,26 which is an intuitive Python library for creating, manipulating, visualizing, and studying the structure of networks Consider the following: import networkx as nx G = nx.Graph() G.add_nodes_from(V) G.add_edges_from(E) 25 See Newman (2010) for an excellent and comprehensive introduction to network theory 26 https://networkx.github.io/ “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:00 — page 67 — #36 Parsing and Manipulating Structured Data Figure 2.2 Visualization of a toy network consisting of five nodes and five edges After construction, the network G can be visualized with Matplotlib using (see figure 2.2): networkx.draw_networkx() import matplotlib.pyplot as plt nx.draw_networkx(G, font_color="white") plt.axis('off') Having a rudimentary understanding of networks, let us now define social networks in the context of literary texts In the networks we will extract from Shakespeare’s plays, nodes are represented by speakers What determines a connection (i.e., an edge) between two speakers is less straightforward and strongly dependent on the sort of relationship one wishes to capture Here, we construct edges between two speakers if they are “in interaction with each other.” Two speakers A and B interact, we claim, if an utterance of A is preceded or followed by an utterance of B.27 Furthermore, in order to track the frequency of character interactions, each of the edges in our approach will hold a count representing the number of times 27 Our approach here diverges from Moretti’s (2011) own approach in which he manually extracted these interactions, whereas we follow a fully automated approach For Moretti, “two characters are linked if some words have passed between them: an interaction, is a speech act” (Moretti 2011) • 67 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:00 — page 68 — #37 68 • Chapter two speakers have interacted This number thus becomes a so-called attribute or property of the edge that has to be explicitly stored The final result can then be described as a network in which speakers are represented as nodes, and interactions between speakers are represented as weighted edges Having defined the type of social network we aim to construct, the real challenge we face is to extract such networks from Shakespeare’s plays in a data format that can be easily exchanged Fortunately, the Folger Digital Texts of Shakespeare provide annotations for speaker turns, which give a rich information source that can be useful in the construction of the character network Now that we are able to parse XML, we can extract speaker turns from the data files: the speaker turns and the entailing text uttered by a speaker are enclosed within sp tags The ID of its corresponding speaker is stored in the who attribute Consider the following fragment: ROSALIND What shall be our sport , then ? With this information about speaker turns, implementing a function to extract character interaction networks becomes trivial Consider the function character_network() below, which takes as argument a lxml.ElementTree object and returns a character network represented as a networkx.Graph object: NSMAP = {'tei': 'http://www.tei-c.org/ns/1.0'} def character_network(tree): """Construct a character interaction network Construct a character interaction network for Shakespeare texts in the Folger Digital Texts Character interaction networks are constructed on the basis of successive speaker turns in the texts, “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:00 — page 69 — #38 Parsing and Manipulating Structured Data and edges between speakers are created when their utterances follow one another Arguments: tree: An lxml.ElementTree instance representing one of the XML files in the Folger Shakespeare collection Returns: A character interaction network represented as a weighted, undirected NetworkX Graph """ G = nx.Graph() # extract a list of speaker turns for each scene in a play for scene in tree.iterfind('.//tei:div2[@type="scene"]', NSMAP): speakers = scene.findall('.//tei:sp', NSMAP) # iterate over the sequence of speaker turns for i in range(len(speakers) - 1): # and extract pairs of adjacent speakers try: speaker_i = speakers[i].attrib['who'].split( '_')[0].replace('#', '') speaker_j = speakers[i + 1].attrib['who'].split( '_')[0].replace('#', '') # if the interaction between two speakers has already # been attested, update their interaction count if G.has_edge(speaker_i, speaker_j): G[speaker_i][speaker_j]['weight'] += # else add an edge between speaker i and j to the graph else: G.add_edge(speaker_i, speaker_j, weight=1) except KeyError: continue return G Note that this code employs search expressions in the XPath syntax The expression we pass to tree.iterfind(), for instance, uses a so-called predicate ([@type="scene"]) to select all div2 elements that have a "type" attribute with a value of "scene." In the returned part of the XML tree, we then only select the speaker elements (sp) and parse their who attribute, to help us reconstruct, or at least approximate, the conversations which are going on in this part of the play Let’s test the function on one of Shakespeare’s plays, Hamlet: tree = lxml.etree.parse('data/folger/xml/Ham.xml') G = character_network(tree.getroot()) The extracted social network consists of 38 nodes (i.e., unique speakers) and 73 edges (i.e., unique speaker interactions): print(f"N nodes = {G.number_of_nodes()}, N edges = {G.number_of_edges()}") N nodes = 38, N edges = 73 • 69 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:00 — page 70 — #39 70 • Chapter An attractive feature of network analysis is to visualize the extracted network The visualization will be a graph in which speakers are represented by nodes and interactions between speakers by edges To make our network graph more insightful, we will have the size of the nodes reflect the count of the interactions We begin with extracting and computing the node sizes: import collections interactions = collections.Counter() for speaker_i, speaker_j, data in G.edges(data=True): interaction_count = data['weight'] interactions[speaker_i] += interaction_count interactions[speaker_j] += interaction_count nodesizes = [interactions[speaker] * for speaker in G] In the code block above, we again make use of a Counter, which, as explained before, is a dictionary in which the values represent the counts of the keys Next, we employ NetworkX’s plotting functionality to create the visualization shown in figure 2.3 of the character network: # Create an empty figure of size 15x15 fig = plt.figure(figsize=(15, 15)) # Compute the positions of the nodes using the spring layout algorithm pos = nx.spring_layout(G, k=0.5, iterations=200) # Then, add the edges to the visualization nx.draw_networkx_edges(G, pos, alpha=0.4) # Subsequently, add the weighted nodes to the visualization nx.draw_networkx_nodes(G, pos, node_size=nodesizes, alpha=0.4) # Finally, add the labels (i.e the speaker IDs) to the visualization nx.draw_networkx_labels(G, pos, fontsize=14) plt.axis('off') As becomes clear in the resulting plot, NetworkX is able to come up with an attractive visualization through the use of a so-called layout algorithm (here, we fairly randomly opt for the spring_layout, but there exist many alternative layout strategies) The resulting plot understandably assigns Hamlet a central position in the plot, because of his obvious centrality in the social story-world evoked in the play Less central characters are likewise pushed towards the boundaries of the graph If we want to de-emphasize the frequency of interaction and focus instead on the fact of interaction, we can remove the edge weights altogether from our links, because these did not play an explicit role in Moretti’s graph Below we make a copy (G0) of the original graph and set all of its weights to 1, before replotting the network: from copy import deepcopy G0 = deepcopy(G) for u, v, d in G0.edges(data=True): d['weight'] = “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:00 — page 71 — #40 Parsing and Manipulating Structured Data Figure 2.3 Visualization of the character interaction network in Hamlet nodesizes = [interactions[speaker] * for speaker in G0] fig = plt.figure(figsize=(15, 15)) pos = nx.spring_layout(G0, k=0.5, iterations=200) nx.draw_networkx_edges(G0, pos, alpha=0.4) nx.draw_networkx_nodes(G0, pos, node_size=nodesizes, alpha=0.4) nx.draw_networkx_labels(G0, pos, fontsize=14) plt.axis('off') Note how for instance the two gravediggers are pushed much more to the periphery in this unweighted perspective on the data, reflecting the fact that • 71 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:00 — page 72 — #41 72 • Chapter Figure 2.4 Visualization of the character interaction network in Hamlet without weights the actual length of the conversation involving the gravediggers is no longer being considered One experiment, suggested in Moretti (2011), is relevant here and involves the manipulation of the graph Moretti proposes the following challenging intervention: Take the protagonist again For literary critics, [the visualization of the character network] is important because it’s a very meaningful part of the text; there is always a lot to be said about it; we would never think of discussing Hamlet—without Hamlet But this is exactly what network theory tempts us to do: take the Hamletnetwork , and remove Hamlet, to see what happens (Moretti 2011) “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:00 — page 73 — #42 Parsing and Manipulating Structured Data Figure 2.5 Visualization of the character interaction network in Hamlet without the character Hamlet Removing Hamlet from the original text may be challenging, but removing him as a node from our network model is painless: G0.remove_node('Hamlet') We are now ready to plot the character network of Hamlet, without Hamlet (see figure 2.5): fig = plt.figure(figsize=(15, 15)) pos = nx.spring_layout(G0, k=0.5, iterations=200) nodesizes = [interactions[speaker] * for speaker in G0] nx.draw_networkx_edges(G0, pos, alpha=0.4) • 73 ... collections interactions = collections.Counter() for speaker_i, speaker_j, data in G.edges (data= True): interaction_count = data[ ''weight''] interactions[speaker_i] += interaction_count interactions[speaker_j]... deepcopy(G) for u, v, d in G0.edges (data= True): d[''weight''] = “125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:00 — page 71 — #40 Parsing and Manipulating Structured Data Figure 2.3 Visualization... successive speaker turns in the texts, “125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:00 — page 69 — #38 Parsing and Manipulating Structured Data and edges between speakers are created when their