Using bagging and random forests

In section 6.3.2, we looked at using decision trees for classification and regression. As we mentioned there, decision trees are an attractive method for a number of reasons:

 They take any type of data, numerical or categorical, without any distributional assumptions and without preprocessing.

 Most implementations (in particular, R’s) handle missing data; the method is also robust to redundant and nonlinear data.

 The algorithm is easy to use, and the output (the tree) is relatively easy to understand.

 Once the model is fit, scoring is fast.

213 Using bagging and random forests to reduce training variance

On the other hand, decision trees do have some drawbacks:

 They have a tendency to overfit, especially without pruning.

 They have high training variance: samples drawn from the same population can produce trees with different structures and different prediction accuracy.

 Prediction accuracy can be low, compared to other methods.1

For these reasons a technique called bagging is often used to improve decision tree models, and a more specialized approach called random forests directly combines decision trees with bagging. We’ll work examples of both techniques.

9.1.1 Using bagging to improve prediction

One way to mitigate the shortcomings of decision tree models is by bootstrap aggrega- tion, or bagging. In bagging, you draw bootstrap samples (random samples with replacement) from your data. From each sample, you build a decision tree model.

The final model is the average of all the individual decision trees.2 To make this con- crete, suppose that x is an input datum, y_i(x) is the output of the ith tree, c(y_1(x), y_2(x),...y_n(x)) is the vector of individual outputs, and y is the output of the final model:

 For regression, or for estimating class probabilities, y(x) is the average of the scores returned by the individual trees: y(x) = mean(c(y_1(x), ... y_n(x))).

 For classification, the final model assigns the class that got the most votes from the individual trees.

Bagging decision trees stabilizes the final model by lowering the variance; this improves the accuracy. A bagged ensemble of trees is also less likely to overfit the data.

1 See Lim, Loh, and Shih, “A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty- three Old and New Classification Algorithms,” Machine Learning, 2000. 40, 203–229; online at http://mng.bz/

rwKM.

2 Bagging and random forests (which we’ll describe in the next section) are two variations of a general technique called ensemble learning. An ensemble model is composed of the combination of several smaller simple models (often small decision trees). Giovanni Seni and John Elder’s Ensemble Methods in Data Mining (Morgan

Bagging classifiers

The proofs that bagging reduces variance are only valid for regression and for estimating class probabilities, not for classifiers (a model that only returns class mem- bership, not class probabilities). Bagging a bad classifier can make it worse. So you definitely want to work over estimated class probabilities, if they’re at all available.

But it can be shown that for CART trees (which is the decision tree implementation in R) under mild assumptions, bagging tends to increase classifier accuracy. See Clif- ton D. Sutton, “Classification and Regression Trees, Bagging, and Boosting,” Hand- book of Statistics, Vol. 24 (Elsevier, 2005) for more details.

214 CHAPTER 9 Exploring advanced methods

The Spambase dataset (also used in chapter 5) provides a good example of the bagging technique. The dataset consists of about 4,600 documents and 57 features that describe the frequency of certain key words and characters. First we’ll train a decision tree to estimate the probability that a given document is spam, and then we’ll evaluate the tree’s deviance (which you’ll recall from discussions in chapters 5 and 7 is similar to variance) and its prediction accuracy.

First, let’s load the data. As we did in section 5.2, let’s download a copy of spamD .tsv (https://github.com/WinVector/zmPDSwR/raw/master/Spambase/spamD.tsv).

Then we’ll write a few convenience functions and train a decision tree, as in the following listing.

spamD <- read.table('spamD.tsv',header=T,sep='\t') spamTrain <- subset(spamD,spamD$rgroup>=10) spamTest <- subset(spamD,spamD$rgroup<10)

spamVars <- setdiff(colnames(spamD),list('rgroup','spam')) spamFormula <- as.formula(paste('spam=="spam"',

paste(spamVars,collapse=' + '),sep=' ~ '))

loglikelihood <- function(y, py) { pysmooth <- ifelse(py==0, 1e-12,

ifelse(py==1, 1-1e-12, py))

sum(y * log(pysmooth) + (1-y)*log(1 - pysmooth)) }

accuracyMeasures <- function(pred, truth, name="model") {

dev.norm <- -2*loglikelihood(as.numeric(truth), pred)/length(pred) ctable <- table(truth=truth,

pred=(pred>0.5))

accuracy <- sum(diag(ctable))/sum(ctable) precision <- ctable[2,2]/sum(ctable[,2]) recall <- ctable[2,2]/sum(ctable[2,]) f1 <- precision*recall

data.frame(model=name, accuracy=accuracy, f1=f1, dev.norm) }

library(rpart)

treemodel <- rpart(spamFormula, spamTrain)

accuracyMeasures(predict(treemodel, newdata=spamTrain),

Listing 9.1 Preparing Spambase data and evaluating the performance of decision trees Load the data and split into training (90% of data) and test (10% of data) sets.

Use all the features and do binary classification, where TRUE corresponds to spam documents.

A function to calculate log likelihood (for calculating deviance).

A function to calculate and return various measures on the model: normalized deviance, prediction accuracy, and f1, which is the product of precision and recall.

Normalize the deviance by the number of data points so that we can compare the deviance across training and test sets.

Convert the class probability estimator into a classifier by labeling documents that score greater than 0.5 as spam.

Load the rpart library and fit a decision tree model.

Evaluate the

215 Using bagging and random forests to reduce training variance

spamTrain$spam=="spam", name="tree, training")

accuracyMeasures(predict(treemodel, newdata=spamTest), spamTest$spam=="spam",

name="tree, test")

The output of the last two calls to accuracyMeasures() produces the following output. As expected, the accuracy and F1 scores both degrade on the test set, and the deviance increases (we want the deviance to be small):

model accuracy f1 dev.norm tree, training 0.9104514 0.7809002 0.5618654 model accuracy f1 dev.norm tree, test 0.8799127 0.7091151 0.6702857

Now let’s try bagging the decision trees.

ntrain <- dim(spamTrain)[1]

n <- ntrain ntree <- 100

samples <- sapply(1:ntree,

FUN = function(iter)

{sample(1:ntrain, size=n, replace=T)})

treelist <-lapply(1:ntree,

FUN=function(iter) {samp <- samples[,iter];

rpart(spamFormula, spamTrain[samp,])})

predict.bag <- function(treelist, newdata) { preds <- sapply(1:length(treelist),

FUN=function(iter) {

predict(treelist[[iter]], newdata=newdata)}) predsums <- rowSums(preds)

predsums/length(treelist) }

accuracyMeasures(predict.bag(treelist, newdata=spamTrain), spamTrain$spam=="spam",

name="bagging, training") Listing 9.2 Bagging decision trees

Use bootstrap samples the same size as the training set, with 100 trees.

Build the bootstrap samples by sampling the row indices of spamTrain with replacement. Each column of the matrix samples represents the row indices into spamTrain that comprise the bootstrap sample.

Train the individual decision trees and return them in a list.

Note: this step can take a few minutes.

predict.bag assumes the underlying classifier returns decision probabilities, not decisions.

Evaluate the bagged decision trees against the training and test sets.

216 CHAPTER 9 Exploring advanced methods

spamTest$spam=="spam", name="bagging, test")

This results in the following:

model accuracy f1 dev.norm bagging, training 0.9220372 0.8072953 0.4702707 model accuracy f1 dev.norm bagging, test 0.9061135 0.7646497 0.528229

As you see, bagging improves accuracy and F1, and reduces deviance over both the training and test sets when compared to the single decision tree (we’ll see a direct comparison of the scores a little later on). The improvement is more dramatic on the test set: the bagged model has less generalization error3 than the single decision tree.

We can further improve model performance by going from bagging to random forests.

9.1.2 Using random forests to further improve prediction

In bagging, the trees are built using randomized datasets, but each tree is built by con- sidering the exact same set of features. This means that all the individual trees are likely to use very similar sets of features (perhaps in a different order or with different split values). Hence, the individual trees will tend to be overly correlated with each other. If there are regions in feature space where one tree tends to make mistakes, then all the trees are likely to make mistakes there, too, diminishing our opportunity for correction. The random forest approach tries to de-correlate the trees by random- izing the set of variables that each tree is allowed to use. For each individual tree in the ensemble, the random forest method does the following:

1 Draws a bootstrapped sample from the training data

2 For each sample, grows a decision tree, and at each node of the tree

1 Randomly draws a subset of mtry variables from the p total features that are available

2 Picks the best variable and the best split from that set of mtry variables

3 Continues until the tree is fully grown

The final ensemble of trees is then bagged to make the random forest predictions.

This is quite involved, but fortunately all done by a single-line random forest call.

By default, the randomForest() function in R draws mtry=p/3 variables at each node for regression trees and m = sqrt(p) variables for classification trees. In theory, random forests aren’t terribly sensitive to the value of mtry. Smaller values will grow the trees faster; but if you have a very large number of variables to choose from, of which only a small fraction are actually useful, then using a larger mtry is better, since

3 Generalization error is the difference in accuracy of the model on data it’s never seen before, as compared to its error on the training set.

217 Using bagging and random forests to reduce training variance

with a larger mtry you’re more likely to draw some useful variables at every step of the tree-growing procedure.

Continuing from the data in section 9.1, let’s build a spam model using random forests.

library(randomForest)

set.seed(5123512)

fmodel <- randomForest(x=spamTrain[,spamVars],

y=spamTrain$spam,

ntree=100,

nodesize=7,

importance=T)

accuracyMeasures(predict(fmodel,

newdata=spamTrain[,spamVars],type='prob')[,'spam'], spamTrain$spam=="spam",name="random forest, train")

## model accuracy f1 dev.norm

## 1 random forest, train 0.9884142 0.9706611 0.1428786 accuracyMeasures(predict(fmodel,

newdata=spamTest[,spamVars],type='prob')[,'spam'], spamTest$spam=="spam",name="random forest, test")

## model accuracy f1 dev.norm

## 1 random forest, test 0.9541485 0.8845029 0.3972416

Let’s summarize the results for all three of the models we’ve looked at:

# Performance on the training set

model accuracy f1 dev.norm Tree 0.9104514 0.7809002 0.5618654 Bagging 0.9220372 0.8072953 0.4702707 Random Forest 0.9884142 0.9706611 0.1428786

# Performance on the test set

model accuracy f1 dev.norm Tree 0.8799127 0.7091151 0.6702857 Bagging 0.9061135 0.7646497 0.5282290 Random Forest 0.9541485 0.8845029 0.3972416

# Performance change between training and test:

# The decrease in accuracy and f1 in the test set

# from training, and the increase in dev.norm Listing 9.3 Using random forests

Load the random- Forest package.

Set the pseudo-random seed to a known value to try and make the random forest run repeatable.

Call the randomForest() function to build the model with

explanatory variables as x and the category to be predicted as y.

Use 100 trees to be compatible with our bagging example. The default is 500 trees.

Specify that each node of a tree must have a minimum of 7 elements, to be compatible with the default minimum node size that rpart() uses on this training set.

Tell the algorithm to save information to be used for calculating variable importance (we’ll see this later).

Report the model quality.

218 CHAPTER 9 Exploring advanced methods

# in the test set from training.

# (So in every case, smaller is better)

model accuracy f1 dev.norm Tree 0.03053870 0.07178505 -0.10842030 Bagging 0.01592363 0.04264557 -0.05795832 Random Forest 0.03426572 0.08615813 -0.254363

The random forest model performed dramatically better than the other two models in both training and test. But the random forest’s generalization error was compara- ble to that of a single decision tree (and almost twice that of the bagged model).4

EXAMININGVARIABLEIMPORTANCE

A useful feature of the randomForest() function is its variable importance calcula- tion. Since the algorithm uses a large number of bootstrap samples, each data point x has a corresponding set of out-of-bag samples: those samples that don’t contain the point x. The out-of-bag samples can be used is a way similar to N-fold cross validation, to estimate the accuracy of each tree in the ensemble.

To estimate the “importance” of a variable v, the variable’s values are randomly permuted in the out-of-bag samples, and the corresponding decrease in each tree’s accuracy is estimated. If the average decrease over all the trees is large, then the variable is considered important—its value makes a big difference in predicting the outcome. If the average decrease is small, then the variable doesn’t make much difference to the outcome. The algorithm also measures the decrease in node purity that occurs from splitting on a permuted variable (how this variable affects the quality of the tree).

We can calculate the variable importance by setting importance=T in the random- Forest() call, and then calling the functions importance() and varImpPlot().

4 When a machine learning algorithm shows an implausibly good fit (like 0.99+ accuracy), it can be a symptom that you don’t have enough training data to falsify bad modeling alternatives. Limiting the complexity of the model can cut down on generalization error and overfitting and can be worthwhile, even if it decreases training performance.

Random forests can overfit!

It’s lore among random forest proponents that “random forests don’t overfit.” In fact, they can. Hastie et al. back up this observation in their chapter on random forests in The Elements of Statistical Learning, Second Edition (Springer, 2009). Look for unrea- sonably good fits on the training data as evidence of useless overfit and memoriza- tion. Also, it’s important to evaluate your model’s performance on a holdout set.

You can also mitigate the overfitting problem by limiting how deep the trees can be grown (using the maxnodes parameter to randomForest()). When you do this, you’re deliberately degrading model performance on training data so that you can more use- fully distinguish between models and falsify bad training decisions.

219 Using bagging and random forests to reduce training variance

> varImp <- importance(fmodel)

> varImp[1:10, ]

non-spam spam MeanDecreaseAccuracy word.freq.make 2.096811 3.7304353 4.334207 word.freq.address 3.603167 3.9967031 4.977452 word.freq.all 2.799456 4.9527834 4.924958

word.freq.3d 3.000273 0.4125932 2.917972

word.freq.our 9.037946 7.9421391 10.731509 word.freq.over 5.879377 4.2402613 5.751371 word.freq.remove 16.637390 13.9331691 17.753122 word.freq.internet 7.301055 4.4458342 7.947515 word.freq.order 3.937897 4.3587883 4.866540 word.freq.mail 5.022432 3.4701224 6.103929 varImpPlot(fmodel, type=1)

The result of the varImpPlot() call is shown in figure 9.1.

Knowing which variables are most important (or at least, which variables contrib- ute the most to the structure of the underlying decision trees) can help you with variable reduction. This is useful not only for building smaller, faster trees, but for choosing variables to be used by another modeling algorithm, if that’s desired. We can

Listing 9.4 randomForest variable importance()

Call importance() on the spam model.

The importance() function returns a matrix of importance measures (larger values = more important).

Plot the variable importance as measured by accuracy change.

word.freq.font word.freq.mail word.freq.85 word.freq.credit word.freq.receive word.freq.650 word.freq.email word.freq.business word.freq.internet word.freq.meeting word.freq.will char.freq.lparen word.freq.money word.freq.000 word.freq.hpl word.freq.our word.freq.you word.freq.1999 word.freq.re capital.run.length.total word.freq.george capital.run.length.longest word.freq.your word.freq.edu word.freq.free word.freq.remove word.freq.hp char.freq.dollar

capital.run.length.average char.freq.bang

10 15 20

fmodel

Figure 9.1 Plot of the most important variables in the spam model, as

220 CHAPTER 9 Exploring advanced methods

reduce the number of variables in this spam example from 57 to 25 without affecting the quality of the final model.

selVars <- names(sort(varImp[,1], decreasing=T))[1:25]

fsel <- randomForest(x=spamTrain[,selVars],y=spamTrain$spam, ntree=100,

nodesize=7, importance=T) accuracyMeasures(predict(fsel,

newdata=spamTrain[,selVars],type='prob')[,'spam'], spamTrain$spam=="spam",name="RF small, train")

## model accuracy f1 dev.norm

## 1 RF small, train 0.9876901 0.9688546 0.1506817 accuracyMeasures(predict(fsel,

newdata=spamTest[,selVars],type='prob')[,'spam'], spamTest$spam=="spam",name="RF small, test")

## model accuracy f1 dev.norm

## 1 RF small, test 0.9497817 0.8738142 0.400825

The smaller model performs just as well as the random forest model built using all 57 variables.

9.1.3 Bagging and random forest takeaways

Here’s what you should remember about bagging and random forests:

 Bagging stabilizes decision trees and improves accuracy by reducing variance.

 Bagging reduces generalization error.

 Random forests further improve decision tree performance by de-correlating the individual trees in the bagging ensemble.

 Random forests’ variable importance measures can help you determine which variables are contributing the most strongly to your model.

 Because the trees in a random forest ensemble are unpruned and potentially quite deep, there’s still a danger of overfitting. Be sure to evaluate the model on holdout data to get a better estimate of model performance.

Bagging and random forests are after-the-fact improvements we can try in order to improve model outputs. In our next section, we’ll work with generalized additive models, which work to improve how model inputs are used.

Listing 9.5 Fitting with fewer variables

Sort the variables by their importance, as measured by accuracy change.

Build a random forest model using only the 25 most important variables.

221 Using generalized additive models (GAMs) to learn non-monotone relationships

The roles in a data science project

Stages of a data science project