Part I. A Guided Tour of the Social Web Prelude
5. Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More
5.3. Discovering Semantics by Decoding Syntax 190
5.3.2. Sentence Detection in Human Language Data 196
Given that sentence detection is probably the first task you’ll want to ponder when building an NLP stack, it makes sense to start there. Even if you never complete the remaining tasks in the pipeline, it turns out that EOS detection alone yields some pow‐
erful possibilities, such as document summarization, which we’ll be considering as a follow-up exercise in the next section. But first, we’ll need to fetch some clean human language data. Let’s use the tried-and-true feedparser package, along with some util‐
ities introduced in the previous chapter that are based on nltk and BeautifulSoup, to clean up HTML formatting that may appear in the content to fetch some posts from the O’Reilly Radar blog. The listing in Example 5-4 fetches a few posts and saves them to a local file as JSON.
196 | Chapter 5: Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More
Example 5-4. Harvesting blog data by parsing feeds
import os import sys import json import feedparser
from BeautifulSoup import BeautifulStoneSoup from nltk import clean_html
FEED_URL = 'http://feeds.feedburner.com/oreilly/radar/atom' def cleanHtml(html):
return BeautifulStoneSoup(clean_html(html),
convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
fp = feedparser.parse(FEED_URL)
print "Fetched %s entries from '%s'" % (len(fp.entries[0].title), fp.feed.title) blog_posts = []
for e in fp.entries:
blog_posts.append({'title': e.title, 'content'
: cleanHtml(e.content[0].value), 'link': e.links[0].href}) out_file = os.path.join('resources', 'ch05-webpages', 'feed.json')
f = open(out_file, 'w')
f.write(json.dumps(blog_posts, indent=1)) f.close()
print 'Wrote output file to %s' % (f.name, )
Obtaining human language data from a reputable source affords us the luxury of assuming good English grammar; hopefully this also means that one of NLTK’s out-of- the-box sentence detectors will work reasonably well. There’s no better way to find out than hacking some code to see what happens, so go ahead and review the code listing in Example 5-5. It introduces the sent_tokenize and word_tokenize methods, which are aliases for NLTK’s currently recommended sentence detector and word tokenizer.
A brief discussion of the listing is provided afterward.
Example 5-5. Using NLTK’s NLP tools to process human language in blog data
import json import nltk
# Download nltk packages used in this example nltk.download('stopwords')
BLOG_DATA = "resources/ch05-webpages/feed.json"
blog_data = json.loads(open(BLOG_DATA).read())
# Customize your list of stopwords as needed. Here, we add common
5.3. Discovering Semantics by Decoding Syntax | 197
# punctuation and contraction artifacts.
stop_words = nltk.corpus.stopwords.words('english') + [ '.',
',', '--', '\'s', '?', ')', '(', ':', '\'', '\'re', '"', '-', '}', '{', u'—', ]
for post in blog_data:
sentences = nltk.tokenize.sent_tokenize(post['content']) words = [w.lower() for sentence in sentences for w in nltk.tokenize.word_tokenize(sentence)]
fdist = nltk.FreqDist(words) # Basic stats
num_words = sum([i[1] for i in fdist.items()]) num_unique_words = len(fdist.keys())
# Hapaxes are words that appear only once num_hapaxes = len(fdist.hapaxes())
top_10_words_sans_stop_words = [w for w in fdist.items() if w[0]
not in stop_words][:10]
print post['title']
print '\tNum Sentences:'.ljust(25), len(sentences) print '\tNum Words:'.ljust(25), num_words
print '\tNum Unique Words:'.ljust(25), num_unique_words print '\tNum Hapaxes:'.ljust(25), num_hapaxes
print '\tTop 10 Most Frequent Words (sans stop words):\n\t\t', \ '\n\t\t'.join(['%s (%s)'
% (w[0], w[1]) for w in top_10_words_sans_stop_words]) print
198 | Chapter 5: Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More
3. Treebank is a very specific term that refers to a corpus that’s been specially tagged with advanced linguistic information. In fact, the reason such a corpus is called a treebank is to emphasize that it’s a bank (think:
collection) of sentences that have been parsed into trees adhering to a particular grammar.
The first things you’re probably wondering about are the sent_tokenize and word_tokenize calls. NLTK provides several options for tokenization, but it provides
“recommendations” as to the best available via these aliases. At the time of this writing (you can double-check this with pydoc or a command like nltk.tokenize.sent_token ize? in IPython or IPython Notebook at any time), the sentence detector is the Punkt SentenceTokenizer and the word tokenizer is the TreebankWordTokenizer. Let’s take a brief look at each of these.
Internally, the PunktSentenceTokenizer relies heavily on being able to detect abbrevi‐
ations as part of collocation patterns, and it uses some regular expressions to try to intelligently parse sentences by taking into account common patterns of punctuation usage. A full explanation of the innards of the PunktSentenceTokenizer’s logic is out‐
side the scope of this book, but Tibor Kiss and Jan Strunk’s original paper “Unsupervised Multilingual Sentence Boundary Detection” discusses its approach in a highly readable way, and you should take some time to review it.
As we’ll see in a bit, it is possible to instantiate the PunktSentenceTokenizer with sample text that it trains on to try to improve its accuracy. The type of underlying algorithm that’s used is an unsupervised learning algorithm; it does not require you to explicitly mark up the sample training data in any way. Instead, the algorithm inspects certain features that appear in the text itself, such as the use of capitalization and the co- occurrences of tokens, to derive suitable parameters for breaking the text into sentences.
While NLTK’s WhitespaceTokenizer, which creates tokens by breaking a piece of text on whitespace, would have been the simplest word tokenizer to introduce, you’re already familiar with some of the shortcomings of blindly breaking on whitespace. Instead, NLTK currently recommends the TreebankWordTokenizer, a word tokenizer that op‐
erates on sentences and uses the same conventions as the Penn Treebank Project.3 The one thing that may catch you off guard is that the TreebankWordTokenizer’s tokeniza‐
tion does some less-than-obvious things, such as separately tagging components in contractions and nouns having possessive forms. For example, the parsing for the sen‐
tence “I’m hungry” would yield separate components for “I” and “’m,” maintaining a distinction between the subject and verb for the two words conjoined in the contraction
“I’m.” As you might imagine, finely grained access to this kind of grammatical infor‐
mation can be quite valuable when it’s time to do advanced analysis that scrutinizes relationships between subjects and verbs in sentences.
Given a sentence tokenizer and a word tokenizer, we can first parse the text into sen‐
tences and then parse each sentence into tokens. While this approach is fairly intuitive, its Achilles’ heel is that errors produced by the sentence detector propagate forward and
5.3. Discovering Semantics by Decoding Syntax | 199
can potentially bound the upper limit of the quality that the rest of the NLP stack can produce. For example, if the sentence tokenizer mistakenly breaks a sentence on the period after “Mr.” that appears in a section of text such as “Mr. Green killed Colonel Mustard in the study with the candlestick,” it may not be possible to extract the entity
“Mr. Green” from the text unless specialized repair logic is in place. Again, it all depends on the sophistication of the full NLP stack and how it accounts for error propagation.
The out-of-the-box PunktSentenceTokenizer is trained on the Penn Treebank corpus and performs quite well. The end goal of the parsing is to instantiate an nltk.FreqDist object (which is like a slightly more sophisticated collections.Counter), which expects a list of tokens. The remainder of the code in Example 5-5 is a straightforward usage of a few of the commonly used NLTK APIs.
If you have a lot of trouble with advanced word tokenizers such as NLTK’s TreebankWordTokenizer or PunktWordTokenizer, it’s fine to default to the WhitespaceTokenizer until you decide whether it’s worth the investment to use a more advanced tokenizer. In fact, us‐
ing a more straightforward tokenizer can often be advantageous. For example, using an advanced tokenizer on data that frequently inlines URLs might be a bad idea.
The aim of this section was to familiarize you with the first step involved in building an NLP pipeline. Along the way, we developed a few metrics that make a feeble attempt at characterizing some blog data. Our pipeline doesn’t involve part-of-speech tagging or chunking (yet), but it should give you a basic understanding of some concepts and get you thinking about some of the subtler issues involved. While it’s true that we could have simply split on whitespace, counted terms, tallied the results, and still gained a lot of information from the data, it won’t be long before you’ll be glad that you took these initial steps toward a deeper understanding of the data. To illustrate one possible ap‐
plication for what you’ve just learned, in the next section we’ll look at a simple document summarization algorithm that relies on little more than sentence segmentation and frequency analysis.