Text mining in practice with r

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	296
Dung lượng	6,86 MB

Nội dung

Table of Contents Cover Title Page Copyright Dedication Foreword Chapter 1: What is Text Mining? 1.1 What is it? 1.2 Why We Care About Text Mining 1.3 A Basic Workflow – How the Process Works 1.4 What Tools Do I Need to Get Started with This? 1.5 A Simple Example 1.6 A Real World Use Case 1.7 Summary Chapter 2: Basics of Text Mining 2.1 What is Text Mining in a Practical Sense? 2.2 Types of Text Mining: Bag of Words 2.3 The Text Mining Process in Context 2.4 String Manipulation: Number of Characters and Substitutions 2.5 Keyword Scanning 2.6 String Packages stringr and stringi 2.7 Preprocessing Steps for Bag of Words Text Mining 2.10 DeltaAssist Wrap Up 2.11 Summary Chapter 3: Common Text Mining Visualizations 3.1 A Tale of Two (or Three) Cultures 3.2 Simple Exploration: Term Frequency, Associations and Word Networks 3.3 Simple Word Clusters: Hierarchical Dendrograms 3.4 Word Clouds: Overused but Effective 3.5 Summary Chapter 4: Sentiment Scoring 4.1 What is Sentiment Analysis? 4.2 Sentiment Scoring: Parlor Trick or Insightful? 4.3 Polarity: Simple Sentiment Scoring 4.4 Emoticons – Dealing with These Perplexing Clues 4.5 R's Archived Sentiment Scoring Library 4.6 Sentiment the Tidytext Way 4.7 Airbnb.com Boston Wrap Up 4.8 Summary Chapter 5: Hidden Structures: Clustering, String Distance, Text Vectors and Topic Modeling 5.1 What is clustering? 5.2 Calculating and Exploring String Distance 5.3 LDA Topic Modeling Explained 5.4 Text to Vectors using text2vec 5.5 Summary Chapter 6: Document Classification: Finding Clickbait from Headlines 6.1 What is Document Classification? 6.2 Clickbait Case Study 6.3 Summary Chapter 7: Predictive Modeling: Using Text for Classifying and Predicting Outcomes 7.1 Classification vs Prediction 7.2 Case Study I: Will This Patient Come Back to the Hospital? 7.3 Case Study II: Predicting Box Office Success 7.4 Summary Chapter 8: The OpenNLP Project 8.1 What is the OpenNLP project? 8.2 R's OpenNLP Package 8.3 Named Entities in Hillary Clinton's Email 8.4 Analyzing the Named Entities 8.6 Summary Chapter 9: Text Sources 9.1 Sourcing Text 9.2 Web Sources 9.3 Getting Text from File Sources 9.4 Summary Index End User License Agreement List of Illustrations Chapter 1: What is Text Mining? Figure 1.1 Possible enterprise uses of text Figure 1.2 A gratuitous word cloud for Chapter Figure 1.3 Text mining is the transition from an unstructured state to a structured understandable state Chapter 2: Basics of Text Mining Figure 2.1 Recall, text mining is the process of taking unorganized sources of text, and applying standardized analytical steps, resulting in a concise insight or recommendation Essentially, it means going from an unorganized state to a summarized and structured state Figure 2.2 The sentence is parsed using simple part of speech tagging The collected contextual data has been captured as tags, resulting in more information than the bag of words methodology captured Figure 2.3 The section of the term document matrix from the code above Chapter 3: Common Text Mining Visualizations Figure 3.1 The bar plot of individual words has expected words like please, sorry and flight confirmation Figure 3.2 Showing that the most associated word from DeltaAssist's use of apologies is “delay” Figure 3.3 A simple word network, illustrating the node and edge attributes Figure 3.4 The matrix result from an R console of the matrix multiplication operator Figure 3.5 A larger matrix is returned with the answers to each of the multiplication inputs Figure 3.7 Calling the single qdap function yields a similar result, saving coding time Figure 3.8 The word association function's network Figure 3.9 The city rainfall data expressed as a dendrogram Figure 3.10 A reduced term DTM, expressed as a dendrogram for the @DeltaAssist corpus Figure 3.11 A modified dendrogram using a custom visualization The dendrogram confirms the agent behavior asking for customers to follow and dm (direct message) the team with confirmation numbers Figure 3.12 The circular dendrogram highlighting the agent behavioral insights Figure 3.13 A representation of the three word cloud functions from the wordcloud package Figure 3.14 A simple word cloud with 100 words and two colors based on Delta tweets Figure 3.15 The words in common between Amazon and Delta customer service tweets Figure 3.16 A comparison cloud showing the contrasting words between Delta and Amazon customer service tweets Figure 3.17 An example polarized tag plot showing words in common between corpora R will plot a larger version for easier viewing Chapter 4: Sentiment Scoring Figure 4.1 Plutchik's wheel of emotion with eight primary emotional states Figure 4.2 Top 50 unique terms from ~2.5million tweets follows Zipf's distribution Figure 4.3 Qdap's polarity function equals 0.68 on this single sentence Figure 4.4 The original word cloud functions applied to various corpora Figure 4.5 Polarity based subsections can be used to create different corpora for word clouds Figure 4.6 Histogram created by ggplot code – notice that the polarity distribution is not centered at zero Figure 4.7 The sentiment word cloud based on a scaled polarity score and a TFIDF weighted TDM Figure 4.8 The sentiment word cloud based on a polarity score without scaling and a TFIDF weighted TDM Figure 4.9 some common image based emoji used for smart phone messaging Figure 4.10 Smartphone sarcasm emote Figure 4.11 Twitch's kappa emoji used for sarcasm Figure 4.12 The 10k Boston Airbnb reviews skew highly to the emotion joy Figure 4.13 A sentiment-based word cloud based on the 10k Boston Airbnb reviews Apparently staying in Malden or Somerville leaves people in a state of disgust Figure 4.14 Bar plot of polarity as the Wizard of Oz story unfolds Figure 4.15 Smoothed polarity for the Wizard of Oz Chapter 5: Hidden Structures: Clustering, String Distance, Text Vectors and Topic Modeling Figure 5.1 The five example documents in two-dimensional space Figure 5.2 The added centroids and partitions grouping the documents Figure 5.3 Centroid movement to equalizing the “gravitational pull” from the documents, thereby minimizing distances Figure 5.4 The final k-means partition with the correct document clusters Figure 5.5 The k-means clustering with three partitions on work experiences Figure 5.6 The plotcluster visual is overwhelmed by the second cluster and shows that partitioning was not effective Figure 5.7 The k-means clustering silhouette plot dominated by a single cluster Figure 5.8 The comparison clouds based on prototype scores Figure 5.9 A comparison of the k-means and spherical k-means distance measures Figure 5.10 The cluster assignments for the 50 work experiences using spherical kmeans Figure 5.11 The spherical k-means cluster plot with some improved document separation Figure 5.12 The spherical k-means silhouette plot with three distinct clusters Figure 5.13 The spherical k-means comparison cloud improves on the original in the previous section Figure 5.14 K-mediod cluster silhouette showing a single cluster with 49 of the documents Figure 5.15 Mediod prototype work experiences 15 & 40 as a comparison cloud Figure 5.16 The six OSA operators needed for the distance measure between “raspberry” and “pear” Figure 5.17 The example fruit dendrogram Figure 5.18 “Highlighters” used to capture three topics in a passage Figure 5.19 The log likelihoods from the 25 sample iterations, showing it improving and then leveling off Figure 5.20 A screenshot portion for the resulting topic model visual Figure 5.21 Illustrating the articles' size, polarity and topic grouping Figure 5.22 The vector space of single word documents and a third document sharing both terms Chapter 6: Document Classification: Finding Clickbait from Headlines Figure 6.1 The training step where labeled data is fed into the algorithm Figure 6.2 The algorithm is applied to new data and the output is a predicted value or class Figure 6.3 Line 42 of the trace window where “Acronym” needs to be changed to “acronym.” Figure 6.4 Training data is split into nine training sections, with one holdout portion used to evaluate the classifier's performance The process is repeated while shuffling the holdout partitions The resulting performance measures from each of the models are averaged so the classifier is more reliable Figure 6.5 The interaction between lambda values and misclassification rates for the lasso regression model Figure 6.6 The ROC using the lasso regression predictions applied to the training headlines Figure 6.7 A comparison between training and test sets Figure 6.8 The kernel density plot for word coefficients Figure 6.9 Top and bottom terms impacting clickbait classifications Figure 6.10 The probability scatter plot for the lasso regression Chapter 7: Predictive Modeling: Using Text for Classifying and Predicting Outcomes Figure 7.1 The text-only GLMNet model accuracy results Figure 7.2 The numeric and dummy patient information has an improved AUC compared to the text only model Figure 7.3 The cross validated model AUC improves close to 0.8 with all inputs Figure 7.4 The additional lift provided using all available data instead of throwing out the text inputs Figure 7.5 Wikipedia's intuitive explanation for precision and recall Figure 7.6 Showing the minimum and “1 se” lambda values that minimize the model's mse Figure 7.7 The box plot comparing distributions between actual and predicted values Figure 7.8 The scatter plot between actual and predicted values shows a reasonable relationship Figure 7.9 The test set actual and predicted values show a wider dispersion than with the training set Chapter 8: The OpenNLP Project Figure 8.1 A sentence parsed syntactically with various annotations Figure 8.2 The most frequent organizations identified by the named entity model Figure 8.3 The base map with ggplot's basic color and axes Figure 8.4 The worldwide map with locations from 551 Hillary Clinton emails Figure 8.5 The Google map of email locations Figure 8.6 Black and white map of email locations Figure 8.7 A normal distribution alongside a box and whisker plot with three outlier values Figure 8.8 A box and whisker plot of Hillary Clinton emails containing “Russia,” “Senate” and “White House.” Figure 8.9 The Quantmod's simple line chart for Microsoft's stock price Figure 8.10 Entity polarity over time Chapter 9: Text Sources Figure 9.1 The methodology breakdown for obtaining text exemplified in this chapter Figure 9.2 An Amazon help forum thread mentioning Prime movies Figure 9.3 A portion of the Amazon forum page with SelectorGadget turned on and the thread text highlighted Figure 9.4 A portion of Amazon's general help forum Figure 9.5 The forum scraping workflow showing two steps requiring scraping information Figure 9.6 A visual representation of the amzn.forum list Figure 9.7 The row bound list resulting in a data table Figure 9.8 A typical Google News feed for Amazon Echo Figure 9.9 A portion of a newspaper image to be sent to the OCR service Figure 9.10 The final OCR text containing the image text List of Tables Chapter 1: What is Text Mining? Table 1.1 Example use cases and recommendations to use or not use text mining Chapter 2: Basics of Text Mining Table 2.1 An abbreviated document term matrix, showing simple word counts contained in the three-tweet corpus Table 2.2 The term document matrix contains the same information as the document term matrix but is the transposition The rows and columns have been switched Table 2.3 @DeltaAssist agent workload – The abbreviated Table demonstrates a simple text mining analysis that can help with competitive intelligence and benchmarking for customer service workloads Table 2.4 Common text-preprocessing functions from R's tm package with an example of the transformation's impact Table 2.5 In common English writing, these words appear frequently but offer little insight As a result, they are often removed to prepare a document for text mining Chapter 3: Common Text Mining Visualizations Table 3.1 A small term document matrix, called all to build an example word network Table 3.2 The adjacency matrix based on the small TDM in Table 3.1 Table 3.3 A small data set of annual city rainfall that will be used to create a dendrogram Table 3.4 The ten terms of the Amazon and Delta TDM Table 3.5 The tail of the common.words matrix Chapter 4: Sentiment Scoring Table 4.1 An example subjectivity lexicon from University of Pittsburgh's MQPA Subjectivity Lexicon Table 4.2 The top terms in a word frequency matrix show an expected distribution Table 4.3 The polarity function output Table 4.4 Polarity output with the custom and non-custom subjectivity lexicon Table 4.5 A portion of the TDM using frequency count Table 4.6 A portion of the Term Document Matrix using TFIDF Table 4.7 A small sample of the hundreds of punctuation and symbol based emoticons Table 4.8 Example native emoticons expressed as Unicode and byte strings in R Table 4.9 Common punctuation based emoticons Table 4.10 Pre-constructed punctuation-based emoticon dictionary from qdap Table 4.11 Common emoji with Unicode and R byte representations Table 4.12 The last six emotional words in the sentiment lexicon Table 4.13 The first six Airbnb reviews and associated sentiments data frame Table 4.14 Excerpt from the sentiments data frame Table 4.15 A portion of the tidy text data frame Table 4.16 The first ten Wizard of Oz “Joy” words Table 4.17 The oz.sentiment data frame with key value pairs spread across the polarity term counts Chapter 5: Hidden Structures: Clustering, String Distance, Text Vectors and Topic Modeling Table 5.1 A sample corpus of documents only containing two terms Table 5.2 Simple term frequency for an example corpus Table 5.3 Top five terms for each cluster can help provide useful insights to share Table 5.4 The five string distance measurement methods Table 5.5 The distance matrix from the three fruits Table 5.6 A portion of the Airbnb Reviews vocabulary Table 5.7 A portion of the example sentence target and context words with a window of Table 5.8 A portion of the input and output relationships used in skip gram modeling Table 5.9 The top vector terms from good.walks Table 5.10 The top ten terms demonstrating the cosine distance of dirty and other nouns Chapter 6: Document Classification: Finding Clickbait from Headlines Table 6.1 The confusion Table for the training set Table 6.2 The first six rows from headline.preds Table 6.3 The complete top.coef data, illustrating the negative words having a positive probability Chapter 7: Predictive Modeling: Using Text for Classifying and Predicting Outcomes Table 7.1 The AUC values for the three classification models Table 7.2 The best model's confusion matrix for the training patient data Table 7.3 The confusion matrix with example variables Table 7.4 The test set confusion matrix function has two inputs First, the file name to import The second, “skip=,” is an integer parameter similar to the readLines “n” parameter It defines the number of lines to skip when parsing the Word document The resulting object is a text vector for the document The Word document EOL markers define the length of the R character vector, so you can use paste to collapse the text read_docx library(qdap) one.email

Ngày đăng: 04/03/2019, 11:51