Introducing the Natural Language Toolkit 155- 123docz.net

Part I. A Guided Tour of the Social Web Prelude

4.4. Querying Human Language Data with TF-IDF 155

4.4.1. Introducing the Natural Language Toolkit 155

NLTK is written such that you can explore data easily and begin to form some impres‐

sions without a lot of upfront investment. Before skipping ahead, though, consider following along with the interpreter session in Example 4-8 to get a feel for some of the powerful functionality that NLTK provides right out of the box. Since you may not have done much work with NLTK before, don’t forget that you can use the built-in help function to get more information whenever you need it. For example, help(nltk) would provide documentation on the NLTK package in an interpreter session.

Not all of the functionality from NLTK is intended for incorporation into production software, since output is written to the console and not capturable into a data structure such as a list. In that regard, methods such as nltk.text.concordance are considered

“demo functionality.” Speaking of which, many of NLTK’s modules have a demo function that you can call to get some idea of how to use the functionality they provide, and the source code for these demos is a great starting point for learning how to use new APIs.

For example, you could run nltk.text.demo() in the interpreter to get some additional insight into the capabilities provided by the nltk.text module.

Example 4-8 demonstrates some good starting points for exploring the data with sample output included as part of an interactive interpreter session, and the same commands to explore the data are included in the IPython Notebook for this chapter. Please follow along with this example and examine the outputs of each step along the way. Are you able to follow along and understand the investigative flow of the interpreter session?

Take a look, and we’ll discuss some of the details after the example.

The next example includes stopwords, which—as noted earlier—are words that appear frequently in text but usually relay very little infor‐

mation (e.g., a, an, the, and other determinants).

4.4. Querying Human Language Data with TF-IDF | 155

Example 4-8. Exploring Google+ data with NLTK

# Explore some of NLTK's functionality by exploring the data.

# Here are some suggestions for an interactive interpreter session.

import nltk

# Download ancillary nltk packages if not already installed nltk.download('stopwords')

all_content = " ".join([ a['object']['content'] for a in activity_results ])

# Approximate bytes of text print len(all_content) tokens = all_content.split() text = nltk.Text(tokens)

# Examples of the appearance of the word "open"

text.concordance("open")

# Frequent collocations in the text (usually meaningful phrases) text.collocations()

# Frequency analysis for words of interest fdist = text.vocab()

fdist["open"]

fdist["source"]

fdist["web"]

fdist["2.0"]

# Number of words in the text len(tokens)

# Number of unique words in the text len(fdist.keys())

# Common words that aren't stopwords [w for w in fdist.keys()[:100] \

if w.lower() not in nltk.corpus.stopwords.words('english')]

# Long words that aren't URLs

[w for w in fdist.keys() if len(w) > 15 and not w.startswith("http")]

# Number of URLs

len([w for w in fdist.keys() if w.startswith("http")])

# Enumerate the frequency distribution for rank, word in enumerate(fdist):

print rank, word, fdist[word]

156 | Chapter 4: Mining Google+: Computing Document Similarity, Extracting Collocations, and More

4. The word the accounts for 7% of the tokens in the Brown Corpus and provides a reasonable starting point for a corpus if you don’t know anything else about it.

The examples throughout this chapter, including the prior example, use the split method to tokenize text. Tokenization isn’t quite as sim‐

ple as splitting on whitespace, however, and Chapter 5 introduces more sophisticated approaches for tokenization that work better for the general case.

The last command in the interpreter session lists the words from the frequency distri‐

bution, sorted by frequency. Not surprisingly, stopwords like the, to, and of are the most frequently occurring, but there’s a steep decline and the distribution has a very long tail.

We’re working with a small sample of text data, but this same property will hold true for any frequency analysis of natural language.

Zipf’s law, a well-known empirical law of natural language, asserts that a word’s fre‐

quency within a corpus is inversely proportional to its rank in the frequency table. What this means is that if the most frequently occurring term in a corpus accounts for N% of the total words, the second most frequently occurring term in the corpus should account for (N/2)% of the words, the third most frequent term for (N/3)% of the words, and so on. When graphed, such a distribution (even for a small sample of data) shows a curve that hugs each axis, as you can see in Figure 4-4.

Though perhaps not initially obvious, most of the area in such a distribution lies in its tail and for a corpus large enough to span a reasonable sample of a language, the tail is always quite long. If you were to plot this kind of distribution on a chart where each axis was scaled by a logarithm, the curve would approach a straight line for a represen‐

tative sample size.

Zipf’s law gives you insight into what a frequency distribution for words appearing in a corpus should look like, and it provides some rules of thumb that can be useful in estimating frequency. For example, if you know that there are a million (nonunique) words in a corpus, and you assume that the most frequently used word (usually the, in English) accounts for 7% of the words,4 you could derive the total number of logical calculations an algorithm performs if you were to consider a particular slice of the terms from the frequency distribution. Sometimes this kind of simple, back-of-the-napkin arithmetic is all that it takes to sanity-check assumptions about a long-running wall- clock time, or confirm whether certain computations on a large enough data set are even tractable.

4.4. Querying Human Language Data with TF-IDF | 157

Figure 4-4. The frequency distribution for terms appearing in a small sample of Google+ data “hugs” each axis closely; plotting it on a log-log scale would render it as something much closer to a straight line with a negative slope

Can you graph the same kind of curve shown in Figure 4-4 for the content from a few hundred Google+ activities using the techniques introduced in this chapter combined with IPython’s plotting function‐

ality, as introduced in Chapter 1?

Introducing the Natural Language Toolkit 155

Why Is Twitter All the Rage? 6

Creating a Twitter API Connection 12