7.4 Machine Learning and Artificial Intelligence
7.4.3 Natural Language Artificial Intelligence
Let’s get back to our example of a client who wishes to augment machine learning forecasts by leveraging sources of unstructured data such as customer interest expressed in a series of emails, blogs and newspapers collected over the past 12 months.
A key point to understand about Natural Language Processing (NLP) is that these tools often don’t just learn by detecting signal in the data, but by associating patterns (e.g. subject-verb-object) and contents (e.g. words) found in new data with rules and meanings previously developed on prior data. Over the years literally, sets of linguistic rules and meanings known to relate to specific domains (e.g. popular English, medicine, politics) were consolidated from large collections of texts within each given domain and categorized into publicly available dictionaries called lexical corpora.
For example, the Corpus of Contemporary American English (COCA) contains more than 160,000 texts coming from various sources that range from movie tran- scripts to academic peer-reviewed journals, totaling 450 M words, pulled uniformly between 1990 and 2015 [208]. The corpus is divided into five sub-corpora tailored to different uses: spoken, fiction, popular, newspaper and academic articles. All words are annotated according to their syntactic function (part-of-speech e.g. noun, verb, adjective), stem/lemma (root word from which a given word derives, e.g.
‘good’, ‘better’, ‘best’ derive from ‘good’), phrase, synonym/homonym, and other types of customized indexing such as time periods and collocates (e.g. words often found together in sections of text).
Annotated corpora make the disambiguation of meanings in new texts tractable:
a word can have multiple meanings in different contexts, but when context is defined in advance, then the proper linguistic meaning can be more closely be inferred. For example the word “apple” can be disambiguated depending on whether it collocates more with “fruit” or “computer” and whether it is found in a gastronomy vs. com- puter related article.
Some NLP algorithms just parse text by removing stop words (e.g. white space) and standard suffixes/prefixes, but for more complex inferences (e.g. associating words with meaningful lemma, disambiguating synonyms and homonyms, etc), tai- lored annotated corpora are needed. The information found in most corpora relate to semantics and syntax and in particular sentence parsing (define valid grammatical constructs), tagging (define valid part-of-speech for each word) and lemmatization
7 Principles of Data Science: Advanced
(rules that can identify synonyms and relatively complex language morphologies).
All these together may aim at inferring name entity (is apple the fruit, computer or firm), what action is taken on these entities, from whom/what, with which intensity, intent, etc. Combined with sequence learning (e.g. recurrent neural networks intro- duced in Sect. 7.4.1), it enables to follow and emulate speech. And combined with unsupervised learning (e.g. SVD/PCA to combine words/lemma based on their cor- relation), it enables to derive high-level concepts (named latent semantic analysis [209]). All these are at the limit of the latest technology of course, but we are getting to a time when most mental constructs that make sense to a human can be encoded indeed, and thereby when artificially intelligent designs may emulate human behavior.
Let us take a look at a simple, concrete example and its algorithm in details. Let us assume we gathered a series of articles and blogs written by a set of existing and potential customers on a company, and want to develop a model that identifies the sentiment of the articles for that company. To make it simple let us consider only two outcomes, positive and negative sentiments, and derive a classifier. The same logic would apply to quantify sentiment numerically using regression, for example on a scale of 0 (negative) to 10 (positive).
1. Create an annotated corpus from the available articles by tagging each article as 1 (positive sentiment) or 0 (negative sentiment)
2. Ensure balanced classes (50/50% distribution of positive and negative articles) by sampling the over-represented class
3. For each of the n selected articles:
• Split article into a list of words
• Remove stop words (e.g. spaces, and, or, get, let, the, yet,…) and short words (e.g. any word with less than three characters)
• Replace each word by its base word (lemma, all lower case)
• Append article’s list of lemma to a master/nested list
• Append each individual lemma to an indexed list (e.g. dictionary in Python) of distinct lemma, i.e. append a lemma only when it has never been appended before
4. Create an n x (m + 1) matrix where the m + 1 columns correspond to the m distinct lemma + the sentiment tag. Starting from a null-vector in each of the n rows, represent the n articles by looping over each row and incrementing by 1 the column corresponding to a given lemma every time this lemma is observed in a given article (i.e. observed in each list of the nested list created above).
Each row now represents an article in form of a frequency vector
5. Normalize weights in each row to sum to 1 to ensure that each article impacts prediction through its word frequency, not its size
6. Add sentiment label (0 or 1) of each article in last column 7. Shuffle the rows randomly and hold out 30% for testing
8. Train and test a classifier (any of the ones described in this chapter, e.g. logistic regression) where the input features are all but the last column, and the response label is the last column
114
9. The classifier can now be used on any new article to indicate whether the arti- cle’s sentiment for the company is positive or negative, with level of confidence identified in step 8
10. If a logistic regression was used, the coefficients of the features (i.e. lemma) can be screened to identify words that have most positive and most negative sentiments in each article
Note in the simple NLP exercise above we annotated ourselves the training data by flagging which articles/blogs had positive vs. negative sentiment, yet we still relied on existing corpora in step 3 to remove stop words and, more importantly, identify lemma. Lemmatizers can be found easily (e.g. NLTK [210]) hence no proj- ect in NLP ever really starts from scratch, for the benefit of everyone working with artificial intelligence.
Naturally, we could have leveraged existing sentiment analysis APIs [211] to flag the training dataset in the first place, instead of doing it manually. But NLP is an emerging science, and existing annotated corpora are often not specific enough for most projects. NLP projects often require to manually develop and extend corpora by defining new or alternative rules specifically for the project. Often both the entities to analyze (e.g. customer name, article ID) and the measure on them (e.g.
sentiment for a given company) need be coded into linguistic concepts before analy- sis. Note Latent Semantic Analysis (i.e. SVD on sets of lemma [209]) can be used to code a complex concept as a weighted average of multiple lemma.
By transforming a corpus of text into linguistic concepts, NLP enables unstruc- tured data to be transformed into what is effectively structured data, i.e. a finite set of features that take on numeric value for each customer. As we saw, these features can then be processed by usual machine learning algorithms.
Similar to the conceptual transition we make from unsupervised to supervised learning when the goal moves from detecting patterns to predicting outcomes, with NLP we make another transition: one from a purely data-driven machine learning approach to an approach where both data and pre-defined knowledge drive the learn- ing process. This concept of reinforcement learning [202] is the state of the art, and recently met highly publicized success: when Google DeepMind beat the Go world champion in March 2016 [212], it was through a gigantic reinforcement learning process that took place prior to the tournament. Reinforcement learning consists in defining rules based on prior or new knowledge from the environment and adding rewards on partial outcomes, dynamically as the learning proceeds, when these par- tial outcomes can be identified as closer to the final goal (e.g. winning the game). A set of rules is referred to as a policy, a partial outcome as a state and the model as an agent for sequential decision making. The learning process consists in interacting with the environment and exploring as many states as possible to improve the agent’s policy. The policies are applied through state-specific terms in the overall loss func- tion, which can be propagated to the weights of the agent’s features by standard minimization (e.g. Gradient Descent). This is like adding religion to science, adding supervision on the go based on ad-hoc rules coming from the environment, observed state, actions and reward, to a more formal learning process. Google developed rein- forcing policies specifically for Go by having the computer play against itself over
7 Principles of Data Science: Advanced
multiple months before the tournament. When the computer faced the world cham- pion Lee Sedol on March 2016, it won four games to one. Five months earlier, it had beaten Fan Hui, the European champion, five games to zero [213]. A year later, it’s upgraded version was reported to beat its earlier version by 100 to nothing. Go allows for more potential moves to play than there are atoms in the universe, and is claimed to be the most complex game known to mankind. Now of course, playing games is not all what makes someone, or something, intelligent.