Examples of the statistical view of data

Compared to statistics, machine learning and data science have an optimistic view of working with data. In data science, you quickly pounce on noncausal relations in the hope that they’ll hold up and help with future prediction. Much of statistics is about how data can lie to you and how such relations can mislead you. We only have space for a couple of examples, so we’ll concentrate on two of the most common issues: sampling bias and missing variable bias.

B.3.1 Sampling bias

Sampling bias is any process that systematically alters the distribution of observed data.5 The data scientist must be aware of the possibility of sampling bias and be prepared to detect it and fix it. The most effective fix is to fix your data collection methodology.

For our sampling bias example, we’ll continue with the income example we started in section B.2.4. Suppose through some happenstance we were studying only a high- earning subset of our original population (perhaps we polled them at some exclusive event). The following listing shows how, when we restrict to a high-earning set, it appears that earned income and capital gains are strongly anticorrelated. We get a correlation of -0.86 (so think of the anticorrelation as explaining about (-0.86)^2 = 0.74 = 74% of the variance; see http://mng.bz/ndYf) and a p-value very near 0 (so it’s unlikely the unknown true correlation of more data produced in this manner is in fact 0). The following listing demonstrates the calculation.

Listing B.22 Calculating the (non)significance of the observed correlation

5 We would have liked to use the common term “censored” for this issue, but in statistics the phrase censored observations is reserved for variables that have only been recorded up to a limit or bound. So it would be poten- tially confusing to try to use the term to describe missing observations.

361

APPENDIX B Important statistical concepts

veryHighIncome <- subset(d, EarnedIncome+CapitalGains>=500000) print(with(veryHighIncome,cor.test(EarnedIncome,CapitalGains,

method='spearman')))

# Spearman's rank correlation rho

#data: EarnedIncome and CapitalGains

#S = 1046, p-value < 2.2e-16

#alternative hypothesis: true rho is not equal to 0

#sample estimates:

# rho

#-0.8678571

Some plots help show what’s going on. Figure B.9 shows the original dataset with the best linear relation line run through. Note that the line is nearly flat (indicating change in x doesn’t predict change in y).

Listing B.23 Misleading significance result from biased observations

0 250000 500000 750000 1000000 1250000

EarnedIncome

CapitalGains

362 APPENDIX B Important statistical concepts

Figure B.10 shows the best trend line run through the high income dataset. It also shows how cutting out the points below the line x+y=500000 leaves a smattering of rare high-value events arranged in a direction that crudely approximates the slope of our cut line (-0.8678571 being a crude approximation for -1). It’s also interesting to note that the bits we suppressed aren’t correlated among themselves, so the effect wasn’t a matter of suppressing a correlated group out of an uncorrelated cloud to get a negative correlation.

The code to produce figures B.9 and B.10 and calculate the correlation between suppressed points is shown in the following listing.

library(ggplot2)

ggplot(data=d,aes(x=EarnedIncome,y=CapitalGains)) + geom_point() + geom_smooth(method='lm') +

Listing B.24 Plotting biased view of income and capital gains

0 250000 500000 750000 1000000 1250000

EarnedIncome

CapitalGains

Figure B.10 Biased earned income versus capital gains

363

APPENDIX B Important statistical concepts

coord_cartesian(xlim=c(0,max(d)),ylim=c(0,max(d)))

ggplot(data=veryHighIncome,aes(x=EarnedIncome,y=CapitalGains)) + geom_point() + geom_smooth(method='lm') +

geom_point(data=subset(d,EarnedIncome+CapitalGains<500000), aes(x=EarnedIncome,y=CapitalGains),

shape=4,alpha=0.5,color='red') +

geom_segment(x=0,xend=500000,y=500000,yend=0, linetype=2,alpha=0.5,color='red') +

coord_cartesian(xlim=c(0,max(d)),ylim=c(0,max(d))) print(with(subset(d,EarnedIncome+CapitalGains<500000),

cor.test(EarnedIncome,CapitalGains,method='spearman')))

# Spearman's rank correlation rho

#data: EarnedIncome and CapitalGains

#S = 107664, p-value = 0.6357

#alternative hypothesis: true rho is not equal to 0

#sample estimates:

# rho

#-0.05202267

B.3.2 Omitted variable bias

Many data science clients expect data science to be a quick process, where every con- venient variable is thrown in at once and a best possible result is quickly obtained. Stat- isticians are rightfully wary of such an approach due to various negative effects such as omitted variable bias, collinear variables, confounding variables, and nuisance variables. In this section, we’ll discuss one of the more general issues: omitted variable bias.

WHATISOMITTEDVARIABLEBIAS?

In its simplest form, omitted variable bias occurs when a variable that isn’t included in the model is both correlated with what we’re trying to predict and correlated with a variable that’s included in our model. When this effect is strong, it causes problems, as the model-fitting procedure attempts to use the variables in the model to both directly predict the desired outcome and to stand in for the effects of the missing variable.

This can introduce biases, create models that don’t quite make sense, and result in poor generalization performance.

The effect of omitted variable bias is easiest to see in a regression example, but it can affect any type of model.

ANEXAMPLEOFOMITTEDVARIABLEBIAS

We’ve prepared a synthetic dataset called synth.RData (download from https://

github.com/WinVector/zmPDSwR/tree/master/bioavailability) that has an omitted variable problem typical for a data science project. To start, please download synth.RData and load it into R, as the next listing shows.

Plot all of the income data with linear trend line (and uncertainty band).

Plot the very high income data and linear trend line (also include cut-off and portrayal of suppressed data).

Compute correlation of suppressed data.

364 APPENDIX B Important statistical concepts

> load('synth.RData')

> print(summary(s))

week Caco2A2BPapp FractionHumanAbsorption Min. : 1.00 Min. :6.994e-08 Min. :0.09347

1st Qu.: 25.75 1st Qu.:7.312e-07 1st Qu.:0.50343 Median : 50.50 Median :1.378e-05 Median :0.86937 Mean : 50.50 Mean :2.006e-05 Mean :0.71492 3rd Qu.: 75.25 3rd Qu.:4.238e-05 3rd Qu.:0.93908 Max. :100.00 Max. :6.062e-05 Max. :0.99170

> head(s)

week Caco2A2BPapp FractionHumanAbsorption

1 1 6.061924e-05 0.11568186

2 2 6.061924e-05 0.11732401

3 3 6.061924e-05 0.09347046

4 4 6.061924e-05 0.12893540

5 5 5.461941e-05 0.19021858

6 6 5.370623e-05 0.14892154

> View(s)

This loads synthetic data that’s supposed to represent a simplified view of the kind of data that might be collected over the history of a pharmaceutical ADME6 or bioavailability project. RStudio’s View() spreadsheet is shown in figure B.11.

The columns of this dataset are described in table B.2.

Listing B.25 Summarizing our synthetic biological data

6 ADME stands for absorption, distribution, metabolism, excretion; it helps determine which molecules make Table B.2 Bioavailability columns

Column Description

week In this project, we suppose that a research group submits a new drug candidate molecule for assay each week. To keep things simple, we use the week number (in terms of weeks since the start of the project) as the identifier for the molecule and the data row. This is an optimization project, which means each pro- posed molecule is made using lessons learned from all of the previous molecules. This is typical of many projects, but it means the data rows aren’t mutually exchangeable (an important assumption that we often use to justify statistical and machine learning techniques).

Display a date in spreadsheet-like window. View is one of the commands that has a much better implementation in RStudio than in basic R.

Figure B.11 View of rows from the bioavailability dataset

365

APPENDIX B Important statistical concepts

We’ve constructed this synthetic data to represent a project that’s trying to optimize human absorption by working through small variations of a candidate drug molecule.

At the start of the project, they have a molecule that’s highly optimized for the stand- in criteria Caco2 (which does correlate with human absorption), and through the history of the project, actual human absorption is greatly increased by altering factors that we’re not tracking in this simplistic model. During drug optimization, it’s common to have formerly dominant stand-in criteria revert to ostensibly less desirable val- ues as other inputs start to dominate the outcome. So for our example project, the human absorption rate is rising (as the scientists successfully optimize for it) and the Caco2 rate is falling (as it started high and we’re no longer optimizing for it, even though it is a useful feature).

One of the advantages of using synthetic data for these problem examples is that we can design the data to have a given structure, and then we know the model is cor- rect if it picks this up and incorrect if it misses it. In particular, this dataset was designed such that Caco2 is always a positive contribution to fraction of absorption throughout the entire dataset. This data was generated using a random non- increasing sequence of plausible Caco2 measurements and then generating fictional absorption numbers, as shown next (the data frame d is the published graph we base our synthetic example on). We produce our synthetic data that’s known to improve over time in the next listing.

set.seed(2535251)

s <- data.frame(week=1:100)

s$Caco2A2BPapp <- sort(sample(d$Caco2A2BPapp,100,replace=T), decreasing=T)

sigmoid <- function(x) {1/(1+exp(-x))}

s$FractionHumanAbsorption <-

Caco2A2BPapp This is the first assay run (and the “cheap” one). The Caco2 test measures how fast the candidate molecule passes through a membrane of cells derived from a specific large intestine carcinoma (cancers are often used for tests, as noncan- cerous human cells usually can’t be cultured indefinitely). The Caco2 test is a stand-in or analogy test. The test is thought to simulate one layer of the small intestine that it’s morphologically similar to (though it lacks a number of forms and mechanisms found in the actual small intestine). Think of Caco2 as a cheap test to evaluate a factor that correlates with bioavailability (the actual goal of the project).

FractionHuman- Absorption

This is the second assay run and is what fraction of the drug candidate is absorbed by human test subjects. Obviously, these tests would be expensive to run and subject to a lot of safety protocols. For this example, optimizing absorption is the actual end goal of the project.

Listing B.26 Building data that improves over time Table B.2 Bioavailability columns (continued)

Column Description

Build synthetic

examples. Add in Caco2 to absorption relation learned from

original dataset. Note the relation is positive:

366 APPENDIX B Important statistical concepts

s$week/10 - mean(s$week/10) + rnorm(100)/3

)

write.table(s,'synth.csv',sep=',', quote=F,row.names=F)

The design of this data is this: Caco2 always has a positive effect (identical to the source data we started with), but this gets hidden by the week factor (and Caco2 is negatively correlated with week, because week is increasing and Caco2 is sorted in decreasing order). Time is not a variable we at first wish to model (it isn’t something we usefully control), but analyses that omit time suffer from omitted variable bias. For the complete details, consult our GitHub example documentation (https://

github.com/WinVector/zmPDSwR/tree/master/bioavailability).

A SPOILEDANALYSIS

In some situations, the true relationship between Caco2 and FractionHuman- Absorption is hidden because the variable week is positively correlated with Fraction- HumanAbsorption (as the absorption is being improved over time) and negatively correlated with Caco2 (as Caco2 is falling over time). week is a stand-in variable for all the other molecular factors driving human absorption that we’re not recording or modeling. Listing B.27 shows what happens when we try to model the relation between Caco2 and FractionHumanAbsorption without using the week variable or any other factors.

print(summary(glm(data=s,

FractionHumanAbsorption~log(Caco2A2BPapp), family=binomial(link='logit'))))

## Warning: non-integer #successes in a binomial glm!

## Call:

## glm(formula = FractionHumanAbsorption ~ log(Caco2A2BPapp),

## family = binomial(link = "logit"),

## data = s)

## Deviance Residuals:

## Min 1Q Median 3Q Max

## -0.609 -0.246 -0.118 0.202 0.557

## Coefficients:

## Estimate Std. Error z value Pr(>|z|)

## (Intercept) -10.003 2.752 -3.64 0.00028 ***

## log(Caco2A2BPapp) -0.969 0.257 -3.77 0.00016 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## (Dispersion parameter for binomial family taken to be 1)

## Null deviance: 43.7821 on 99 degrees of freedom

## Residual deviance: 9.4621 on 98 degrees of freedom Listing B.27 A bad model (due to omitted variable bias)

Add in a mean-0 term that depends on time to simulate the effects of improvements as the project moves forward.

Add in a mean-0 noise term.

367

APPENDIX B Important statistical concepts

## AIC: 64.7

## Number of Fisher Scoring iterations: 6

For details on how to read the glm() summary, please see section 7.2. Note that the sign of the Caco2 coefficient is negative, not what’s plausible or what we expected going in. This is because the Caco2 coefficient isn’t just recording the relation of Caco2 to FractionHumanAbsorption, but also having to record any relations that come through omitted correlated variables.

WORKINGAROUNDOMITTEDVARIABLEBIAS

There are a number of ways to deal with omitted variable bias, the best ways being better experimental design and more variables. Other methods include use of fixed- effects models and hierarchical models. We’ll demonstrate one of the simplest methods: adding in possibly important omitted variables. In the following listing, we redo the analysis with week included.

print(summary(glm(data=s,

FractionHumanAbsorption~week+log(Caco2A2BPapp), family=binomial(link='logit'))))

## Warning: non-integer #successes in a binomial glm!

## Call:

## glm(formula = FractionHumanAbsorption ~ week + log(Caco2A2BPapp),

## family = binomial(link = "logit"), data = s)

## Deviance Residuals:

## Min 1Q Median 3Q Max

## -0.3474 -0.0568 -0.0010 0.0709 0.3038

## Coefficients:

## Estimate Std. Error z value Pr(>|z|)

## (Intercept) 3.1413 4.6837 0.67 0.5024

## week 0.1033 0.0386 2.68 0.0074 **

## log(Caco2A2BPapp) 0.5689 0.5419 1.05 0.2938

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## (Dispersion parameter for binomial family taken to be 1)

## Null deviance: 43.7821 on 99 degrees of freedom

## Residual deviance: 1.2595 on 97 degrees of freedom

## AIC: 47.82

## Number of Fisher Scoring iterations: 6

We recovered decent estimates of both the Caco2 and week coefficients, but we didn’t achieve statistical significance on the effect of Caco2. Note that fixing omitted variable bias requires (even in our synthetic example) some domain knowledge to propose

Listing B.28 A better model

368 APPENDIX B Important statistical concepts

important omitted variables and the ability to measure the additional variables (and try to remove their impact through the use of an offset; see help('offset')).

At this point, you should have a more detailed intentional view of variables. There are, at the least, variables you can control (explanatory variables), important variables you can’t control (nuisance variables), and important variables you don’t know (omitted variables). Your knowledge of all of these variable types should affect your experimental design and your analysis.

appendix C More tools and ideas worth exploring

In data science, you’re betting on the data and the process, not betting on any one magic technique. We advise designing your projects to be the pursuit of quantifi- able goals that have already been linked to important business needs. To concretely demonstrate this work style, we emphasize building predictive models using methods that are easily accessible from R. This is a good place to start, but shouldn’t be the end.

There’s always more to do in a data science project. At the least, you can

 Recruit new partners

 Research more profitable business goals

 Design new experiments

 Specify new variables

 Collect more data

 Explore new visualizations

 Design new presentations

 Test old assumptions

 Implement new methods

 Try new tools

The point being this: there’s always more to try. Minimize confusion by keeping a running journal of your actual goals and of things you haven’t yet had time to try.

And don’t let tools and techniques distract you away from your goals and data.

Always work with “your hands in the data.” That being said, we close with some useful topics for further research (please see the bibliography for publication details).

Examples of the statistical view of data

The roles in a data science project

Stages of a data science project