Quality of Analytics for Processing Human Language- 123docz.net

Part I. A Guided Tour of the Social Web Prelude

5. Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More

5.5. Quality of Analytics for Processing Human Language Data 219

When you’ve done even a modest amount of text mining, you’ll eventually want to start quantifying the quality of your analytics. How accurate is your end-of-sentence detec‐

tor? How accurate is your part-of-speech tagger? For example, if you began customizing the basic algorithm for extracting the entities from unstructured text, how would you know whether your algorithm was getting more or less performant with respect to the quality of the results? While you could manually inspect the results for a small corpus and tune the algorithm until you were satisfied with them, you’d still have a devil of a time determining whether your analytics would perform well on a much larger corpus or a different class of document altogether—hence, the need for a more automated process.

An obvious starting point is to randomly sample some documents and create a “golden set” of entities that you believe are absolutely crucial for a good algorithm to extract from them, and then use this list as the basis of evaluation. Depending on how rigorous you’d like to be, you might even be able to compute the sample error and use a statistical device called a confidence interval to predict the true error with a sufficient degree of confidence for your needs. However, what exactly is the calculation you should be com‐

puting based on the results of your extractor and golden set in order to compute accu‐

racy? A very common calculation for measuring accuracy is called the F1 score, which is defined in terms of two concepts called precision and recall4 as:

F = 2 * precision* recall precision+ recall where:

precision= TP TP+ FP

5.5. Quality of Analytics for Processing Human Language Data | 219

and:

recall= TP TP+ FN

In the current context, precision is a measure of exactness that reflects false positives, and recall is a measure of completeness that reflects true positives. The following list clarifies the meaning of these terms in relation to the current discussion in case they’re unfamiliar or confusing:

True positives (TP)

Terms that were correctly identified as entities False positives (FP)

Terms that were identified as entities but should not have been True negatives (TN)

Terms that were not identified as entities and should not have been False negatives (FN)

Terms that were not identified as entities but should have been

Given that precision is a measure of exactness that quantifies false positives, it is defined as TP / (TP + FP). Intuitively, if the number of false positives is zero, the exactness of the algorithm is perfect and the precision yields a value of 1.0. Conversely, if the number of false positives is high and begins to approach or surpass the value of true positives, precision is poor and the ratio approaches zero. As a measure of completeness, recall is defined as TP / (TP + FN) and yields a value of 1.0, indicating perfect recall, if the number of false negatives is zero. As the number of false negatives increases, recall approaches zero. By definition, F1 yields a value of 1.0 when precision and recall are both perfect, and approaches zero when both precision and recall are poor.

Of course, what you’ll find out in the wild is that it’s a trade-off as to whether you want to boost precision or recall, because it’s difficult to have both. If you think about it, this makes sense because of the trade-offs involved with false positives and false negatives (see Figure 5-7).

220 | Chapter 5: Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More

Figure 5-7. The intuition behind true positives, false positives, true negatives, and false negatives from the standpoint of predictive analytics

To put all of this into perspective, let’s consider the sentence “Mr. Green killed Colonel Mustard in the study with the candlestick” one last time and assume that an expert has determined that the key entities in the sentence are “Mr. Green,” “Colonel Mustard,”

“study,” and “candlestick.” Assuming your algorithm identified these four terms and only these four terms, you’d have four true positives, zero false positives, five true neg‐

atives (“killed,” “with,” “the,” “in,” “the”), and zero false negatives. That’s perfect precision and perfect recall, which yields an F1 score of 1.0. Substituting various values into the precision and recall formulas is straightforward and a worthwhile exercise if this is your first time encountering these terms.

What would the precision, recall, and F1 score have been if your al‐

gorithm had identified “Mr. Green,” “Colonel,” “Mustard,” and “can‐

dlestick”?

Many of the most compelling technology stacks used by commercial businesses in the NLP space use advanced statistical models to process natural language according to supervised learning algorithms. Given our discussion earlier in this chapter, you know that a supervised learning algorithm is essentially an approach in which you provide training samples that comprise inputs and expected outputs such that the model is able to predict the tuples with reasonable accuracy. The tricky part is ensuring that the trained model generalizes well to inputs that have not yet been encountered. If the model per‐

forms well for training data but poorly on unseen samples, it’s usually said to suffer from the problem of overfitting the training data. A common approach for measuring the 5.5. Quality of Analytics for Processing Human Language Data | 221

efficacy of a model is called cross-validation. With this approach, a portion of the training data (say, one-third) is reserved exclusively for the purpose of testing the model, and only the remainder is used for training the model.

Quality of Analytics for Processing Human Language Data 219

Why Is Twitter All the Rage? 6

Creating a Twitter API Connection 12