Sampling for modeling and validation

Sampling is the process of selecting a subset of a population to represent the whole, during analysis and modeling. In the current era of big datasets, some people argue that computational power and modern algorithms let us analyze the entire large dataset without the need to sample.

We can certainly analyze larger datasets than we could before, but sampling is a necessary task for other reasons. When you’re in the middle of developing or refining a modeling procedure, it’s easier to test and debug the code on small subsamples before training the model on the entire dataset. Visualization can be easier with a sub- sample of the data; ggplot runs faster on smaller datasets, and too much data will often obscure the patterns in a graph, as we mentioned in chapter 3. And often it’s not feasible to use your entire customer base to train a model.

It’s important that the dataset that you do use is an accurate representation of your population as a whole. For example, your customers might come from all over the United States. When you collect your custdata dataset, it might be tempting to use all the customers from one state, say Connecticut, to train the model. But if you plan to use the model to make predictions about customers all over the country, it’s a good idea to pick customers randomly from all the states, because what predicts health insurance coverage for Texas customers might be different from what predicts health insurance coverage in Connecticut. This might not always be possible (perhaps only your Connecticut and Massachusetts branches currently collect the customer health insurance information), but the shortcomings of using a nonrepresentative dataset should be kept in mind.

The other reason to sample your data is to create test and training splits.

4.2.1 Test and training splits

When you’re building a model to make predictions, like our model to predict the probability of health insurance coverage, you need data to build the model. You also need data to test whether the model makes correct predictions on new data. The first set is called the training set, and the second set is called the test (or hold-out) set.

The training set is the data that you feed to the model-building algorithm—regression, decision trees, and so on—so that the algorithm can set the correct parameters to best predict the outcome variable. The test set is the data that you feed into the resulting model, to verify that the model’s predictions are accurate. We’ll go into

2 There are methods other than capping to deal with signed logarithms, such as the arcsinh function (see http://mng.bz/ZWQa), but they also distort data near zero and make almost any data appear to be bimodal.

77 Sampling for modeling and validation

detail about the kinds of modeling issues that you can detect by using hold-out data in chapter 5. For now, we’ll just get our data ready for doing hold-out experiments at a later stage.

Many writers recommend train/calibration/test splits, which is also good advice.

Our philosophy is this: split the data into train/test early, don’t look at test until final evaluation, and if you need calibration data, resplit it from your training subset.

4.2.2 Creating a sample group column

A convenient way to manage random sampling is to add a sample group column to the data frame. The sample group column contains a number generated uniformly from zero to one, using the runif function. You can draw a random sample of arbi- trary size from the data frame by using the appropriate threshold on the sample group column.

For example, once you’ve labeled all the rows of your data frame with your sample group column (let’s call it gp), then the set of all rows such that gp < 0.4 will be about four-tenths, or 40%, of the data. The set of all rows where gp is between 0.55 and 0.70 is about 15% of the data (0.7-0.55=0.15). So you can repeatably generate a random sample of the data of any size by using gp.

> custdata$gp <- runif(dim(custdata)[1])

> testSet <- subset(custdata, custdata$gp <= 0.1)

> trainingSet <- subset(custdata, custdata$gp > 0.1)

> dim(testSet)[1]

[1] 93

> dim(trainingSet)[1]

[1] 907

R also has a function called sample that draws a random sample (a uniform random sample, by default) from a data frame. Why not just use sample to draw training and test sets? You could, but using a sample group column guarantees that you’ll draw the same sample group every time. This reproducible sampling is convenient when you’re debugging code. In many cases, code will crash because of a corner case that you for- got to guard against. This corner case might show up in your random sample. If you’re using a different random input sample every time you run the code, you won’t know if you will tickle the bug again. This makes it hard to track down and fix errors.

You also want repeatable input samples for what software engineers call regression testing (not to be confused with statistical regression). In other words, when you make changes to a model or to your data treatment, you want to make sure you don’t break what was already working. If model version 1 was giving “the right answer” for a cer- tain input set, you want to make sure that model version 2 does so also.

Listing 4.9 Splitting into test and training using a random group mark

dim(custdata) returns the number of rows and columns of the data frame as a vector, so dim(custdata)[1]

returns the number of rows.

Here we generate a training using the remaining data.

Here we generate a test set of about 10% of the data (93 customers—a little over 9%, actually) and train on the remaining 90%.

78 CHAPTER 4 Managing data

REPRODUCIBLE SAMPLING IS NOT JUST A TRICK FOR R If your data is in a database or other external store, and you only want to pull a subset of the data into R for analysis, you can draw a reproducible random sample by generat- ing a sample group column in an appropriate table in the database, using the SQL command RAND .

4.2.3 Record grouping

One caveat is that the preceding trick works if every object of interest (every customer, in this case) corresponds to a unique row. But what if you’re interested less in which customers don’t have health insurance, and more about which house- holds have uninsured members? If you’re modeling a question at the household level rather than the customer level, then every member of a household should be in the same group (test or training). In other words, the random sampling also has to be at the household level.

Suppose your customers are marked both by a household ID and a customer ID (so the unique ID for a customer is the combination (household_id, cust_id). This is shown in figure 4.6. We want to split the households into a training set and a test set.

The next listing shows one way to generate an appropriate sample group column.

hh <- unique(hhdata$household_id)

households <- data.frame(household_id = hh, gp = runif(length(hh)))

hhdata <- merge(hhdata, households, by="household_id")

The resulting sample group column is shown in figure 4.7. Now we can generate the test and training sets as before. This time, however, the threshold 0.1 doesn’t represent 10% of the data rows, but 10% of the households (which may be more or less than 10% of the data, depending on the sizes of the households).

4.2.4 Data provenance

You’ll also want to add a column (or columns) to record data provenance: when your dataset was collected, perhaps what version of your data cleaning procedure was used

Listing 4.10 Ensuring test/train split doesn’t split inside a household

Get all unique household IDs from your data frame.

Create a temporary data frame of household IDs and a uniformly random number from 0 to 1.

Merge new random sample group column back into original data frame.

Household 1 Household 2

Household 3

Household 4 Household 5

Figure 4.6 Example of dataset with customers and households

79 Summary

on the data before modeling, and so on. This is akin to version control for data. It’s handy information to have, to make sure that you’re comparing apples to apples when you’re in the process of improving your model, or comparing different models or different versions of a model.

The roles in a data science project

Stages of a data science project