Models that combine the effects of many variables tend to be much more powerful than models that use only a single variable. In this section, you’ll learn how to build some of the most fundamental multiple-variable models: decision trees, nearest neigh- bor, and Naive Bayes.
6.3.1 Variable selection
A key part of building many variable models is selecting what variables6 to use and how the variables are to be transformed or treated. We’ve already discussed variable treatment in chapter 4, so we’ll only discuss variable selection here (we’re assuming you’ve discussed with your project sponsors what variables are available for or even legal to use in your model).
When variables are available has a huge impact on model utility. For instance, a vari- able that’s coincident with (available near or even after) the time that the outcome occurs may make a very accurate model with little utility (as it can’t be used for long- range prediction). The analyst has to watch out for variables that are functions of or
“contaminated by” the value to be predicted. Which variables will actually be available
5 For just a taste of the complexity this introduces, try to read Thomas Lumley’s “Standard nonstandard evalu- ation rules”: http://developer.r-project.org/nonstandard-eval.pdf.
6 We’ll call variables used to build the model variously variables, independent variables, input variables, causal variables, and so on to try and distinguish them from the item to be predicted (which we’ll call outcome or dependent).
126 CHAPTER 6 Memorization methods
in production is something you’ll want to discuss with your project sponsor. And some- times you may want to improve model utility (at a possible cost of accuracy) by remov- ing variables from the project design. An acceptable prediction one day before an event can be much more useful than a more accurate prediction one hour before the event.
Each variable we use represents a chance of explaining more of the outcome varia- tion (a chance of building a better model) but also represents a possible source of noise and overfitting. To control this effect, we often preselect which subset of variables we’ll use to fit. Variable selection can be an important defensive modeling step even for types of models that “don’t need it” (as seen with decision trees in section 6.3.2). Listing 6.11 shows a hand-rolled variable selection loop where each variable is scored according to an AIC (Akaike information criterion) -inspired score, in which a variable is scored with a bonus proportional to the scaled log likelihood of the training data minus a penalty proportional to the complexity of the variable (which in this case is 2^entropy). The score is a bit ad hoc, but tends to work well in selecting variables. Notice we’re using performance on the calibration set (not the training set) to pick variables. Note that we don’t use the test set for calibration; to do so lessens the reliability of the test set for model quality confirmation.
logLikelyhood <-
function(outCol,predCol) {
sum(ifelse(outCol==pos,log(predCol),log(1-predCol))) }
selVars <- c() minStep <- 5
baseRateCheck <- logLikelyhood(dCal[,outcome], sum(dCal[,outcome]==pos)/length(dCal[,outcome])) for(v in catVars) {
pi <- paste('pred',v,sep='')
liCheck <- 2*((logLikelyhood(dCal[,outcome],dCal[,pi]) - baseRateCheck))
if(liCheck>minStep) {
print(sprintf("%s, calibrationScore: %g", pi,liCheck))
selVars <- c(selVars,pi) }
}
for(v in numericVars) {
pi <- paste('pred',v,sep='')
liCheck <- 2*((logLikelyhood(dCal[,outcome],dCal[,pi]) - baseRateCheck) - 1)
if(liCheck>=minStep) {
print(sprintf("%s, calibrationScore: %g", pi,liCheck))
selVars <- c(selVars,pi) }
}
Listing 6.11 Basic variable selection
Define a convenience function to compute log likelihood.
Run through categorical variables and pick based on a deviance improvement (related to difference in log likelihoods; see chapter 3).
Run through categorical variables and pick based on a deviance improvement.
127 Building models using many variables
In our case, this picks 27 of the 212 possible variables. The categorical and numeric variables selected are shown in the following listing.
## [1] "predVar194, calibrationScore: 5.25759"
## [1] "predVar201, calibrationScore: 5.25521"
## [1] "predVar204, calibrationScore: 5.37414"
## [1] "predVar205, calibrationScore: 24.2323"
## [1] "predVar206, calibrationScore: 34.4434"
## [1] "predVar210, calibrationScore: 10.6681"
## [1] "predVar212, calibrationScore: 6.23409"
## [1] "predVar218, calibrationScore: 13.2455"
## [1] "predVar221, calibrationScore: 12.4098"
## [1] "predVar225, calibrationScore: 22.9074"
## [1] "predVar226, calibrationScore: 6.68931"
## [1] "predVar228, calibrationScore: 15.9644"
## [1] "predVar229, calibrationScore: 24.4946"
## [1] "predVar6, calibrationScore: 11.2431"
## [1] "predVar7, calibrationScore: 16.685"
## [1] "predVar13, calibrationScore: 8.06318"
## [1] "predVar28, calibrationScore: 9.38643"
## [1] "predVar65, calibrationScore: 7.96938"
## [1] "predVar72, calibrationScore: 10.5353"
## [1] "predVar73, calibrationScore: 46.2524"
## [1] "predVar74, calibrationScore: 17.6324"
## [1] "predVar81, calibrationScore: 6.8741"
## [1] "predVar113, calibrationScore: 21.136"
## [1] "predVar126, calibrationScore: 72.9556"
## [1] "predVar140, calibrationScore: 14.1816"
## [1] "predVar144, calibrationScore: 13.9858"
## [1] "predVar189, calibrationScore: 40.3059"
We’ll show in section 6.3.2 the performance of a multiple-variable model with and without using variable selection.
6.3.2 Using decision trees
Decision trees are a simple model type: they make a prediction that is piecewise con- stant. This is interesting because the null hypothesis that we’re trying to outperform is often a single constant for the whole dataset, so we can view a decision tree as a proce- dure to split the training data into pieces and use a simple memorized constant on each piece. Decision trees (especially a type called classification and regression trees, or CART) can be used to quickly predict either categorical or numeric outcomes. The best way to grasp the concept of decision trees is to think of them as machine-generated business rules.
FITTINGADECISIONTREEMODEL
Building a decision tree involves proposing many possible data cuts and then choosing best cuts based on simultaneous competing criteria of predictive power, cross-validation strength, and interaction with other chosen cuts. One of the advantages of using a
Listing 6.12 Selected categorical and numeric variables
128 CHAPTER 6 Memorization methods
canned package for decision tree work is not having to worry about tree construction details. Let’s start by building a decision tree model for churn. The simplest way to call rpart() is to just give it a list of variables and see what happens (rpart(), unlike many R modeling techniques, has built-in code for dealing with missing values).
> library('rpart')
> fV <- paste(outcome,'>0 ~ ',
paste(c(catVars,numericVars),collapse=' + '),sep='')
> tmodel <- rpart(fV,data=dTrain)
> print(calcAUC(predict(tmodel,newdata=dTrain),dTrain[,outcome])) [1] 0.9241265
> print(calcAUC(predict(tmodel,newdata=dTest),dTest[,outcome])) [1] 0.5266172
> print(calcAUC(predict(tmodel,newdata=dCal),dCal[,outcome])) [1] 0.5126917
What we get is pretty much a disaster. The model looks way too good to believe on the training data (which it has merely memorized, negating its usefulness) and not as good as our best single-variable models on withheld calibration and test data. A cou- ple of possible sources of the failure are that we have categorical variables with very many levels, and we have a lot more NAs/missing data than rpart()’s surrogate value strategy was designed for. What we can do to work around this is fit on our repro- cessed variables, which hide the categorical levels (replacing them with numeric pre- dictions), and remove NAs (treating them as just another level).
> tVars <- paste('pred',c(catVars,numericVars),sep='')
> fV2 <- paste(outcome,'>0 ~ ',paste(tVars,collapse=' + '),sep='')
> tmodel <- rpart(fV2,data=dTrain)
> print(calcAUC(predict(tmodel,newdata=dTrain),dTrain[,outcome])) [1] 0.928669
> print(calcAUC(predict(tmodel,newdata=dTest),dTest[,outcome])) [1] 0.5390648
> print(calcAUC(predict(tmodel,newdata=dCal),dCal[,outcome])) [1] 0.5384152
This result is about the same (also bad). So our next suspicion is that the overfitting is because our model is too complicated. To control rpart() model complexity, we need to monkey a bit with the controls. We pass in an extra argument, rpart.control (use help('rpart') for some details on this control), that changes the decision tree selection strategy.
> tmodel <- rpart(fV2,data=dTrain,
control=rpart.control(cp=0.001,minsplit=1000, minbucket=1000,maxdepth=5)
)
Listing 6.13 Building a bad decision tree
Listing 6.14 Building another bad decision tree
Listing 6.15 Building yet another bad decision tree
129 Building models using many variables
> print(calcAUC(predict(tmodel,newdata=dTrain),dTrain[,outcome])) [1] 0.9421195
> print(calcAUC(predict(tmodel,newdata=dTest),dTest[,outcome])) [1] 0.5794633
> print(calcAUC(predict(tmodel,newdata=dCal),dCal[,outcome])) [1] 0.547967
This is a very small improvement. We can waste a lot of time trying variations of the rpart() controls. The best guess is that this dataset is unsuitable for decision trees and a method that deals better with overfitting issues is needed—such as random for- ests, which we’ll demonstrate in chapter 9. The best result we could get for this dataset using decision trees was from using our selected variables (instead of all transformed variables).
f <- paste(outcome,'>0 ~ ',paste(selVars,collapse=' + '),sep='')
> tmodel <- rpart(f,data=dTrain,
control=rpart.control(cp=0.001,minsplit=1000, minbucket=1000,maxdepth=5)
)
> print(calcAUC(predict(tmodel,newdata=dTrain),dTrain[,outcome])) [1] 0.6906852
> print(calcAUC(predict(tmodel,newdata=dTest),dTest[,outcome])) [1] 0.6843595
> print(calcAUC(predict(tmodel,newdata=dCal),dCal[,outcome])) [1] 0.6669301
These AUCs aren’t great (they’re not near 1.0 or even particularly near the winning team’s 0.76), but they are significantly better than any of the AUCs we saw from single- variable models when checked on non-training data. So we’ve finally built a legitimate multiple-variable model.
To tune rpart we suggest, in addition to trying variable selection (which is an odd thing to combine with decision tree methods), following the rpart documentation in trying different settings of the method argument. But we quickly get better results with KNN and logistic regression, so it doesn’t make sense to spend too long trying to tune decision trees for this particular dataset.
HOWDECISIONTREEMODELSWORK
At this point, we can look at the model and use it to explain how decision tree models work.
> print(tmodel) n= 40518
node), split, n, deviance, yval
* denotes terminal node 1) root 40518 2769.3550 0.07379436
Listing 6.16 Building a better decision tree
Listing 6.17 Printing the decision tree
130 CHAPTER 6 Memorization methods
4) predVar126< 0.04391312 8804 189.7251 0.02203544 * 5) predVar126>=0.04391312 9384 530.1023 0.06010230
10) predVar189< 0.08449448 8317 410.4571 0.05206204 * 11) predVar189>=0.08449448 1067 114.9166 0.12277410 * 3) predVar126>=0.07366888 22330 2008.9000 0.09995522
6) predVar212< 0.07944508 8386 484.2499 0.06153112 12) predVar73< 0.06813291 4084 167.5012 0.04285015 * 13) predVar73>=0.06813291 4302 313.9705 0.07926546 * 7) predVar212>=0.07944508 13944 1504.8230 0.12306370
14) predVar218< 0.07134103 6728 580.7390 0.09542212 28) predVar126< 0.1015407 3901 271.8426 0.07536529 * 29) predVar126>=0.1015407 2827 305.1617 0.12309870
58) predVar73< 0.07804522 1452 110.0826 0.08264463 * 59) predVar73>=0.07804522 1375 190.1935 0.16581820 * 15) predVar218>=0.07134103 7216 914.1502 0.14883590
30) predVar74< 0.0797246 2579 239.3579 0.10352850 * 31) predVar74>=0.0797246 4637 666.5538 0.17403490
62) predVar189< 0.06775545 1031 102.9486 0.11251210 * 63) predVar189>=0.06775545 3606 558.5871 0.19162510 *
Each row in listing 6.17 that starts with #) is called a node of the decision tree. This decision tree has 15 nodes. Node 1 is always called the root. Each node other than the root node has a parent, and the parent of node k is node floor(k/2). The indenta- tion also indicates how deep in the tree a node is. Each node other than the root is named by what condition must be true to move from the parent to the node. You move from node 1 to node 2 if predVar126 < -0.002810871 (and otherwise you move to node 3, which has the complementary condition). So to score a row of data, we nav- igate from the root of the decision tree by the node conditions until we reach a node with no children, which is called a leaf node. Leaf nodes are marked with stars. The remaining three numbers reported for each node are the number of training items that navigated to the node, the deviance of the set of training items that navigated to the node (a measure of how much uncertainty remains at a given decision tree node), and the fraction of items that were in the positive class at the node (which is the pre- diction for leaf nodes).
We can get a graphical representation of much of this with the commands in the next listing that produce figure 6.2.
par(cex=0.7) plot(tmodel) text(tmodel)
6.3.3 Using nearest neighbor methods
A k-nearest neighbor (KNN) method scores an example by finding the k training exam- ples nearest to the example and then taking the average of their outcomes as the score. The notion of nearness is basic Euclidean distance, so it can be useful to select nonduplicate variables, rescale variables, and orthogonalize variables.
Listing 6.18 Plotting the decision tree
131 Building models using many variables
One problem with KNN is the nature of its concept space. For example, if we were to run a 3-nearest neighbor analysis on our data, we have to understand that with three neighbors from the training data, we’ll always see either zero, one, two, or three exam- ples of churn. So the estimated probability of churn is always going to be one of 0%, 33%, 66%, or 100%. This is not going to work on an event as rare as churn, which has a rate of around 7% in our training data. For events with unbalanced outcomes (that is, probabilities not near 50%), we suggest using a large k so KNN can express a useful range of probabilities. For a good k, we suggest trying something such that you have a good chance of seeing 10 positive examples in each neighborhood (allowing your model to express rates smaller than your baseline rate to some precision). In our case, that’s a k around 10/0.07 = 142. You’ll want to try a range of k, and we demonstrate a KNN run with k=200 in the following listing.
> library('class')
> nK <- 200
> knnTrain <- dTrain[,selVars]
> knnCl <- dTrain[,outcome]==pos Listing 6.19 Running k-nearest neighbors
predVar126< 0.07367
0.02204
0.05206 0.1228
0.04285 0.07927
0.07537
0.08264 0.1658 0.1035
0.1125 0.1916 predVar126< 0.1015 predVar74< 0.07972
predVar189< 0.06776 predVar73< 0.07805
predVar73< 0.06813 predVar218< 0.07134 predVar189< 0.08449
predVar126< 0.04391 predVar212< 0.07945
Figure 6.2 Graphical representation of a decision tree
Build a data frame with only the variables we wish to use for classification.
Build a vector with the
132 CHAPTER 6 Memorization methods
> knnPred <- function(df) {
knnDecision <- knn(knnTrain,df,knnCl,k=nK,prob=T) ifelse(knnDecision==TRUE,
attributes(knnDecision)$prob, 1-(attributes(knnDecision)$prob)) }
> print(calcAUC(knnPred(dTrain[,selVars]),dTrain[,outcome])) [1] 0.7443927
> print(calcAUC(knnPred(dCal[,selVars]),dCal[,outcome])) [1] 0.7119394
> print(calcAUC(knnPred(dTest[,selVars]),dTest[,outcome])) [1] 0.718256
This is our best result yet. What we’re looking for are the two distributions to be uni- modal7 and, if not separated, at least not completely on top of each other. Notice how, under these criteria, the double density performance plot in figure 6.3 is much better looking than figure 6.1.
7 Distributions that are multimodal are often evidence that there are significant effects we haven’t yet explained. Distributions that are unimodal or even look normal are consistent with the unexplained effects being simple noise.
Bind the knn() training function with our data in a new function.
Convert knn’s unfortunate convention of calculating probability as
“proportion of the votes for the winning class”
into the more useful “calculated probability of being a positive example.”
0 3 6 9
0.0 0.1 0.2 0.3 0.4
kpred
density
as.factor(churn) -1
1
Figure 6.3 Performance of 200-nearest neighbors on calibration data
133 Building models using many variables
The code to produce figure 6.3 is shown in the next listing.
dCal$kpred <- knnPred(dCal[,selVars]) ggplot(data=dCal) +
geom_density(aes(x=kpred,
color=as.factor(churn),linetype=as.factor(churn)))
This finally gives us a result good enough to bother plotting the ROC curve for. The code in the next listing produces figure 6.4.
plotROC <- function(predcol,outcol) {
perf <- performance(prediction(predcol,outcol==pos),'tpr','fpr') pf <- data.frame(
FalsePositiveRate=perf@x.values[[1]], TruePositiveRate=perf@y.values[[1]]) ggplot() +
geom_line(data=pf,aes(x=FalsePositiveRate,y=TruePositiveRate)) + geom_line(aes(x=c(0,1),y=c(0,1)))
}
print(plotROC(knnPred(dTest[,selVars]),dTest[,outcome])) Listing 6.20 Platting 200-nearest neighbor performance
Listing 6.21 Plotting the receiver operating characteristic curve
0.00 0.25 0.50 0.75 1.00
0.00 0.25 0.50 0.75 1.00
FalsePositiveRate
TruePositiveRate
134 CHAPTER 6 Memorization methods
The ROC curve shows every possible classifier you can get by using different scoring thresholds on the same model. For example, you can achieve a high recall (high true positive rate, or TPR) at the expense of a high false positive rate (FPR) by selecting a threshold that moves you to the top right of the graph. Conversely, you can achieve high precision (high positive confirmation rate) at the expense of recall by selecting a threshold that moves you to the bottom left of the graph. Notice that score thresholds aren’t plotted, just the resulting FPRs and TPRs.
KNN is expensive both in time and space. Sometimes we can get similar results with more efficient methods such as logistic regression (which we’ll explain in detail in chapter 7). To demonstrate that a fast method can be competitive with KNN, we’ll show the performance of logistic regression in the next listing.
> gmodel <- glm(as.formula(f),data=dTrain,family=binomial(link='logit'))
> print(calcAUC(predict(gmodel,newdata=dTrain),dTrain[,outcome])) [1] 0.7309537
> print(calcAUC(predict(gmodel,newdata=dTest),dTest[,outcome])) [1] 0.7234645
> print(calcAUC(predict(gmodel,newdata=dCal),dCal[,outcome])) [1] 0.7170824
6.3.4 Using Naive Bayes
Naive Bayes is an interesting method that memorizes how each training variable is related to outcome, and then makes predictions by multiplying together the effects of each variable. To demonstrate this, let’s use a scenario in which we’re trying to predict whether somebody is employed based on their level of education, their geographic region, and other variables. Naive Bayes begins by reversing that logic and asking this question: Given that you are employed, what is the probability that you have a high school education? From that data, we can then make our prediction regarding employment.
Let’s call a specific variable (x_1) taking on a specific value (X_1) a piece of evi- dence: ev_1. For example, suppose we define our evidence (ev_1) as the predicate education=="High School", which is true when the variable x_1 (education) takes on the value X_1 ("High School"). Let’s call the outcome y (taking on values T or True if the person is employed and F otherwise). Then the fraction of all positive examples where ev_1 is true is an approximation to the conditional probability of ev_1, given y==T. This is usually written as P(ev1|y==T). But what we want to estimate is the conditional probability of a subject being employed, given that they have a high school education:
P(y==T|ev1). How do we get from P(ev1|y==T) (the quantities we know from our training data) to an estimate of P(y==T|ev1 ... evN) (what we want to predict)?
Listing 6.22 Plotting the performance of a logistic regression model
135 Building models using many variables
Bayes’ law tells us we can expand P(y==T|ev1) and P(y==F|ev1) like this:
The left-hand side is what you want; the right-hand side is all quantities that can be estimated from the statistics of the training data. For a single feature ev1, this buys us little as we could derive P(y==T|ev1) as easily from our training data as from P(ev1|y==T). For multiple features (ev1 ... evN) this sort of expansion is useful. The Naive Bayes assumption lets us assume that all the evidence is conditionally indepen- dent of each other for a given outcome:
This gives us the following:
The numerator terms of the right sides of the final expressions can be calculated effi- ciently from the training data, while the left sides can’t. We don’t have a direct scheme for estimating the denominators in the Naive Bayes expression (these are called the joint probability of the evidence). However, we can still estimate P(y==T|evidence) and P(y==F|evidence), as we know by the law of total probability that we should have P(y==T|evidence) + P(y==F|evidence) = 1. So it’s enough to pick a denominator such that our estimates add up to 1.
For numerical reasons, it’s better to convert the products into sums, by taking the log of both sides. Since the denominator term is the same in both expressions, we can ignore it; we only want to determine which of the following expressions is greater:
It’s also a good idea to add a smoothing term so that you’re never taking the log of zero.
P(y==T ) × P(ev1| y==T) P(y==T | ev1) =
P(ev1)
P(y==F) × (P(ev1| y==F) P(y==F | ev1) =
P(ev1)
P(ev1&. . . evN| y==T) ≈ P(ev1| y==T) × P(ev2| y==T) × . . . P(evN| y==T) P(ev1&. . . evN| y==F) ≈ P(ev1| y==F) × P(ev2| y==F) × . . . P(evN| y==F)
P(y==T) × (P(ev1| y==T) × . . . P(evN| y==T)) P(y==T | ev1&. . . evN) ≈
P(ev1&. . . evN)
P(y==F) × (P(ev1| y==F) × . . . P(evN| y==F)) P(y==F | ev1&. . . evN) ≈
P(ev1&. . . evN)
score(T| ev1&. . . evN) = log(P(y==T)) +log(P(ev1| y==T)) + . . . log(P(evN| y==T)) score(F| ev1&. . . evN) = log(P(y==F)) + log(P(ev1| y==F)) + . . . log(P(evN| y==F))