Gisting Human Language Data 213

Part I. A Guided Tour of the Social Web Prelude

5. Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More

5.4. Entity-Centric Analysis: A Paradigm Shift 209

5.4.1. Gisting Human Language Data 213

It’s not much of a leap at this point to think that it would be another major step forward to take into account the verbs and compute triples of the form subject-verb-object so that you know which entities are interacting with which other entities, and the nature of those interactions. Such triples would lend themselves to visualizing object graphs of documents, which we could potentially skim much faster than we could read the docu‐

ments themselves. Better yet, imagine taking multiple object graphs derived from a set of documents and merging them to get the gist of the larger corpus. This exact technique is an area of active research and has tremendous applicability for virtually any situation suffering from the information-overload problem. But as will be illustrated, it’s an ex‐

cruciating problem for the general case and not for the faint of heart.

Assuming a part-of-speech tagger has identified the parts of speech from a sentence and emitted output such as [('Mr.', 'NNP'), ('Green', 'NNP'), ('killed', 'VBD'), ('Colonel', 'NNP'), ('Mustard', 'NNP'), ...], an index storing subject- predicate-object tuples of the form ('Mr. Green', 'killed', 'Colonel Mustard') 5.4. Entity-Centric Analysis: A Paradigm Shift | 213

would be easy to compute. However, the reality of the situation is that you’re unlikely to run across actual POS-tagged data with that level of simplicity—unless you’re plan‐

ning to mine children’s books (not actually a bad starting point for testing toy ideas).

For example, consider the tagging emitted from NLTK for the first sentence from the blog post printed earlier in this chapter as an arbitrary and realistic piece of data you might like to translate into an object graph:

This morning I had the chance to get a tour of The Henry Ford Museum in Dearborn, MI, along with Dale Dougherty, creator of Make: and Makerfaire, and Marc Greuther, the chief curator of the museum.

The simplest possible triple that you might expect to distill from that sentence is ('I', 'get', 'tour'), but even if you got that back, it wouldn’t convey that Dale Dougherty also got the tour, or that Marc Greuther was involved. The POS-tagged data should make it pretty clear that it’s not quite so straightforward to arrive at any of those inter‐

pretations, either, because the sentence has a very rich structure:

[(u'This', 'DT'), (u'morning', 'NN'), (u'I', 'PRP'), (u'had', 'VBD'), (u'the', 'DT'), (u'chance', 'NN'), (u'to', 'TO'), (u'get', 'VB'), (u'a', 'DT'), (u'tour', 'NN'), (u'of', 'IN'), (u'The', 'DT'),

(u'Henry', 'NNP'), (u'Ford', 'NNP'), (u'Museum', 'NNP'), (u'in', 'IN'), (u'Dearborn', 'NNP'), (u',', ','), (u'MI', 'NNP'), (u',', ','),

(u'along', 'IN'), (u'with', 'IN'), (u'Dale', 'NNP'), (u'Dougherty', 'NNP'), (u',', ','), (u'creator', 'NN'), (u'of', 'IN'), (u'Make', 'NNP'), (u':', ':'), (u'and', 'CC'), (u'Makerfaire', 'NNP'), (u',', ','), (u'and', 'CC'),

(u'Marc', 'NNP'), (u'Greuther', 'NNP'), (u',', ','), (u'the', 'DT'), (u'chief', 'NN'), (u'curator', 'NN'), (u'of', 'IN'), (u'the', 'DT'), (u'museum', 'NN'), (u'.', '.')]

It’s doubtful that a high-quality open source NLP toolkit would be capable of emitting meaningful triples in this case, given the complex nature of the predicate “had a chance to get a tour” and that the other actors involved in the tour are listed in a phrase appended to the end of the sentence.

If you’d like to pursue strategies for constructing these triples, you should be able to use reasonably accurate POS tagging information to take a good initial stab at it. Advanced tasks in manipulating human language data can be a lot of work, but the results are satisfying and have the potential to be quite disruptive (in a good way).

The good news is that you can actually do a lot of fun things by distilling just the entities from text and using them as the basis of analysis, as demonstrated earlier. You can easily produce triples from text on a per-sentence basis, where the “predicate” of each triple is a notion of a generic relationship signifying that the subject and object “interacted”

with each other. Example 5-9 is a refactoring of Example 5-8 that collects entities on a

214 | Chapter 5: Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More

per-sentence basis, which could be quite useful for computing the interactions between entities using a sentence as a context window.

Example 5-9. Discovering interactions between entities

import nltk import json

BLOG_DATA = "resources/ch05-webpages/feed.json"

def extract_interactions(txt):

sentences = nltk.tokenize.sent_tokenize(txt)

tokens = [nltk.tokenize.word_tokenize(s) for s in sentences]

pos_tagged_tokens = [nltk.pos_tag(t) for t in tokens]

entity_interactions = []

for sentence in pos_tagged_tokens:

all_entity_chunks = []

previous_pos = None current_entity_chunk = []

for (token, pos) in sentence:

if pos == previous_pos and pos.startswith('NN'):

current_entity_chunk.append(token) elif pos.startswith('NN'):

if current_entity_chunk != []:

all_entity_chunks.append((' '.join(current_entity_chunk), pos))

current_entity_chunk = [token]

previous_pos = pos

if len(all_entity_chunks) > 1:

entity_interactions.append(all_entity_chunks) else:

entity_interactions.append([])

assert len(entity_interactions) == len(sentences) return dict(entity_interactions=entity_interactions, sentences=sentences)

blog_data = json.loads(open(BLOG_DATA).read())

# Display selected interactions on a per-sentence basis for post in blog_data:

post.update(extract_interactions(post['content']))

5.4. Entity-Centric Analysis: A Paradigm Shift | 215

print post['title']

print '-' * len(post['title'])

for interactions in post['entity_interactions']:

print '; '.join([i[0] for i in interactions]) print

The following results from this listing highlight something important about the nature of unstructured data analysis: it’s messy!

The Louvre of the Industrial Age ---

morning; chance; tour; Henry Ford Museum; Dearborn; MI; Dale Dougherty; creator;

Make; Makerfaire; Marc Greuther; chief curator tweet; Louvre

"; Marc; artifact; museum; block; contains; Luther Burbank; shovel; Thomas Edison Luther Burbank; course; inventor; treasures; nectarine; Santa Rosa

Ford; farm boy; industrialist; Thomas Edison; friend museum; Ford; homage; transformation; world

machines; steam; engines; coal; generators; houses; lathes; precision; lathes;

makerbot; century; ribbon glass machine; incandescent; lightbulbs; world; combine;

harvesters; railroad; locomotives; cars; airplanes; gas; stations; McDonalds;

restaurant; epiphenomena

Marc; eye; transformation; machines; objects; things

advances; engineering; materials; workmanship; design; years

years; visit; Detroit; museum; visit; Paris; Louvre; Rome; Vatican Museum;

Florence; Uffizi Gallery; St. Petersburg; Hermitage; Berlin world; museums

Museum; Makerfaire Detroit reach; Detroit; weekend day; Makerfaire; day

A certain amount of noise in the results is almost inevitable, but realizing results that are highly intelligible and useful—even if they do contain a manageable amount of noise—is a worthy aim. The amount of effort required to achieve pristine results that are nearly noise-free can be immense. In fact, in most situations, this is downright impossible because of the inherent complexity involved in natural language and the limitations of most currently available toolkits, including NLTK. If you are able to make certain assumptions about the domain of the data or have expert knowledge of the nature 216 | Chapter 5: Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More

of the noise, you may be able to devise heuristics that are effective without risking an unacceptable amount of information loss—but it’s a fairly difficult proposition.

Still, the interactions do provide a certain amount of “gist” that’s valuable. For example, how closely would your interpretation of “morning; chance; tour; Henry Ford Museum;

Dearborn; MI; Dale Dougherty; creator; Make; Makerfaire; Marc Greuther; chief cu‐

rator” align with the meaning in the original sentence?

As was the case with our previous adventure in summarization, displaying markup that can be visually skimmed for inspection is also quite handy. A simple modification to Example 5-9 output, as shown in Example 5-10, is all that’s necessary to produce the results shown in Figure 5-6.

Example 5-10. Visualizing interactions between entities with HTML output

import os import json import nltk

from IPython.display import IFrame from IPython.core.display import display BLOG_DATA = "resources/ch05-webpages/feed.json"

HTML_TEMPLATE = """<html>

<head>

</head>

</html>"""

blog_data = json.loads(open(BLOG_DATA).read()) for post in blog_data:

post.update(extract_interactions(post['content']))

# Display output as markup with entities presented in bold text post['markup'] = []

for sentence_idx in range(len(post['sentences'])):

s = post['sentences'][sentence_idx]

for (term, _) in post['entity_interactions'][sentence_idx]:

s = s.replace(term, '<strong>%s</strong>' % (term, )) post['markup'] += [s]

filename = post['title'].replace("?", "") + '.entity_interactions.html' f = open(os.path.join('resources', 'ch05-webpages', filename), 'w')

5.4. Entity-Centric Analysis: A Paradigm Shift | 217

html = HTML_TEMPLATE % (post['title'] + ' Interactions', ' '.join(post['markup']),) f.write(html.encode('utf-8'))

f.close()

print "Data written to", f.name

# Display any of these files with an inline frame. This displays the # last file processed by using the last value of f.name...

print "Displaying %s:" % f.name

display(IFrame('files/%s' % f.name, '100%', '600px'))

Figure 5-6. Sample HTML output that displays entities identified in the text in bold so that it’s easy to visually skim the content for its key concepts

It could also be fruitful to perform additional analyses to identify the sets of interactions for a larger body of text and to find and visualize co-occurrences in the interactions.

The code involving force-directed graphs illustrated in Section 2.3.2.3 on page 83 would make a good starting template for visualization, but even without knowing the specific nature of the interaction, there’s still a lot of value in just knowing the subject and the object. If you’re feeling ambitious, you should attempt to complete the tuples with the missing verbs.

218 | Chapter 5: Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More

4. More precisely, F1 is said to be the harmonic mean of precision and recall, where the harmonic mean of any two numbers x and y is defined as:

H = 2 * x* y x+ y

You can read more about why it’s the “harmonic” mean by reviewing the definition of a harmonic number.

Why Is Twitter All the Rage? 6

Creating a Twitter API Connection 12