Natural Language Processing Illustrated Step-by-Step 192

Một phần của tài liệu Mining the social web, 2nd edition (Trang 218 - 222)

Part I. A Guided Tour of the Social Web Prelude

5. Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More

5.3. Discovering Semantics by Decoding Syntax 190

5.3.1. Natural Language Processing Illustrated Step-by-Step 192

Let’s prepare to step through a series of examples that illustrate NLP with NLTK. The NLP pipeline we’ll examine involves the following steps:

1. EOS detection 2. Tokenization

3. Part-of-speech tagging 4. Chunking

5. Extraction

We’ll continue to use the following sample text from the previous chapter for the pur‐

poses of illustration: “Mr. Green killed Colonel Mustard in the study with the candle‐

stick. Mr. Green is not a very nice fellow.” Remember that even though you have already read the text and understand its underlying grammatical structure, it’s merely an opaque string value to a machine at this point. Let’s look at the steps we need to work through in more detail.

192 | Chapter 5: Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More

The following NLP pipeline is presented as though it is unfolding in a Python interpreter session for clarity and ease of illustration in the input and expected output of each step. However, each step of the pipeline is preloaded into this chapter’s IPython Notebook so that you can follow along per the norm with all other examples.

The five steps are:

EOS detection

This step breaks a text into a collection of meaningful sentences. Since sentences generally represent logical units of thought, they tend to have a predictable syntax that lends itself well to further analysis. Most NLP pipelines you’ll see begin with this step because tokenization (the next step) operates on individual sentences.

Breaking the text into paragraphs or sections might add value for certain types of analysis, but it is unlikely to aid in the overall task of EOS detection. In the inter‐

preter, you’d parse out a sentence with NLTK like so:

>>> import nltk

>>> txt = "Mr. Green killed Colonel Mustard in the study with the \ ... candlestick. Mr. Green is not a very nice fellow."

>>> txt = "Mr. Green killed Colonel Mustard in the study with the \ ... candlestick. Mr. Green is not a very nice fellow."

>>> sentences = nltk.tokenize.sent_tokenize(txt)

>>> sentences

['Mr. Green killed Colonel Mustard in the study with the candlestick.', 'Mr. Green is not a very nice fellow.']

We’ll talk a little bit more about what is happening under the hood with sent_token ize in the next section. For now, we’ll accept at face value that proper sentence detection has occurred for arbitrary text—a clear improvement over breaking on characters that are likely to be punctuation marks.

Tokenization

This step operates on individual sentences, splitting them into tokens. Following along in the example interpreter session, you’d do the following:

>>> tokens = [nltk.tokenize.word_tokenize(s) for s in sentences]

>>> tokens

[['Mr.', 'Green', 'killed', 'Colonel', 'Mustard', 'in', 'the', 'study', 'with', 'the', 'candlestick', '.'],

['Mr.', 'Green', 'is', 'not', 'a', 'very', 'nice', 'fellow', '.']]

Note that for this simple example, tokenization appeared to do the same thing as splitting on whitespace, with the exception that it tokenized out EOS markers (the periods) correctly. As we’ll see in a later section, though, it can do a bit more if we give it the opportunity, and we already know that distinguishing between whether a period is an EOS marker or part of an abbreviation isn’t always trivial. As an anecdotal note, some written languages, such as ones that use pictograms as 5.3. Discovering Semantics by Decoding Syntax | 193

opposed to letters, don’t necessarily even require whitespace to separate the tokens in sentences and require the reader (or machine) to distinguish the boundaries.

POS tagging

This step assigns part-of-speech (POS) information to each token. In the example interpreter session, you’d run the tokens through one more step to have them dec‐

orated with tags:

>>> pos_tagged_tokens = [nltk.pos_tag(t) for t in tokens]

>>> pos_tagged_tokens

[[('Mr.', 'NNP'), ('Green', 'NNP'), ('killed', 'VBD'), ('Colonel', 'NNP'), ('Mustard', 'NNP'), ('in', 'IN'), ('the', 'DT'), ('study', 'NN'), ('with', 'IN'), ('the', 'DT'), ('candlestick', 'NN'), ('.', '.')], [('Mr.', 'NNP'), ('Green', 'NNP'), ('is', 'VBZ'), ('not', 'RB'), ('a', 'DT'), ('very', 'RB'), ('nice', 'JJ'), ('fellow', 'JJ'), ('.', '.')]]

You may not intuitively understand all of these tags, but they do represent POS information. For example, 'NNP' indicates that the token is a noun that is part of a noun phrase, 'VBD' indicates a verb that’s in simple past tense, and 'JJ' indicates an adjective. The Penn Treebank Project provides a full summary of the POS tags that could be returned. With POS tagging completed, it should be getting pretty apparent just how powerful analysis can become. For example, by using the POS tags, we’ll be able to chunk together nouns as part of noun phrases and then try to reason about what types of entities they might be (e.g., people, places, or organi‐

zations). If you never thought that you’d need to apply those exercises from ele‐

mentary school regarding parts of speech, think again: it’s essential to a proper application of natural language processing.

Chunking

This step involves analyzing each tagged token within a sentence and assembling compound tokens that express logical concepts—quite a different approach than statistically analyzing collocations. It is possible to define a custom grammar through NLTK’s chunk.RegexpParser, but that’s beyond the scope of this chapter;

see Chapter 9 (“Building Feature Based Grammars”) of Natural Language Process‐

ing with Python (O’Reilly) for full details. Besides, NLTK exposes a function that combines chunking with named entity extraction, which is the next step.

194 | Chapter 5: Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More

Extraction

This step involves analyzing each chunk and further tagging the chunks as named entities, such as people, organizations, locations, etc. The continuing saga of NLP in the interpreter demonstrates:

>>> ne_chunks = nltk.batch_ne_chunk(pos_tagged_tokens)

>>> print ne_chunks

[Tree('S', [Tree('PERSON', [('Mr.', 'NNP')]),

Tree('PERSON', [('Green', 'NNP')]), ('killed', 'VBD'),

Tree('ORGANIZATION', [('Colonel', 'NNP'), ('Mustard', 'NNP')]), ('in', 'IN'), ('the', 'DT'), ('study', 'NN'), ('with', 'IN'), ('the', 'DT'), ('candlestick', 'NN'), ('.', '.')]),

Tree('S', [Tree('PERSON', [('Mr.', 'NNP')]), Tree('ORGANIZATION',

[('Green', 'NNP')]), ('is', 'VBZ'), ('not', 'RB'), ('a', 'DT'), ('very', 'RB'), ('nice', 'JJ'), ('fellow', 'JJ'), ('.', '.')])]

>>> print ne_chunks[0].pprint() # You can pretty-print each chunk in the tree

(S

(PERSON Mr./NNP) (PERSON Green/NNP) killed/VBD

(ORGANIZATION Colonel/NNP Mustard/NNP) in/IN

the/DT study/NN with/IN the/DT

candlestick/NN ./.)

Don’t get too wrapped up in trying to decipher exactly what the tree output means just yet. In short, it has chunked together some tokens and attempted to classify them as being certain types of entities. (You may be able to discern that it has iden‐

tified “Mr. Green” as a person, but unfortunately categorized “Colonel Mustard” as an organization.) Figure 5-3 illustrates output in IPython Notebook.

As worthwhile as it would be to continue exploring natural language with NLTK, that level of engagement isn’t really our purpose here. The background in this section is provided to motivate an appreciation for the difficulty of the task and to encourage you to review the NLTK book or one of the many other plentiful resources available online if you’d like to pursue the topic further.

5.3. Discovering Semantics by Decoding Syntax | 195

Figure 5-3. NLTK can interface with drawing toolkits so that you can inspect the chunked output in a more intuitive visual form than the raw text output you see in the interpreter

Given that it’s possible to customize certain aspects of NLTK, the remainder of this chapter assumes you’ll be using NLTK “as is” unless otherwise noted.

With that brief introduction to NLP concluded, let’s get to work mining some blog data.

Một phần của tài liệu Mining the social web, 2nd edition (Trang 218 - 222)

Tải bản đầy đủ (PDF)

(448 trang)