Basic Probabilistic Tools and Concepts

Statistical inference [166] and probability theory [167] are the realm of forecasts and predictions. When a regression line is constructed and used to make predictions on the dependent variable (Eq. 6.7, or in the general case Eq. 6.9), the purpose shifts from simple description to probabilistic inference. A key underlying assumption in probability theory is that the dataset studied can be seen as a sample drawn from a larger dataset (in time and/or space), and can thus provide information on the larger dataset (in time and/or space).

The concept of p-value — or how to test hypotheses

One of the most common tools in data analysis is statistical hypothesis testing [166], which is an ubiquitous approach to deciding whether a given outcome is significant, and if yes with what level of confidence. The two associated concepts are p-value and confidence interval.

To first approximation, a statistical test starts with a hypothesis (e.g. Juliet loves Romeo), defines a relevant alternative hypothesis called the null hypothesis (e.g.

Juliet loves Paris …bad news for Romeo), and adopts a conservative approach to decide upon which is most likely to be true. It starts by assuming that the null hypothesis is true and that, under this assumption, a probability distribution exists for all variables considered (time spent between Romeo and Juliet, time spent between Paris and Juliet, etc), and this probability is not null. For example, if one variable is the time spent between Romeo and Juliet, it might be reasonable to assume that this quantity follows a normal distribution with mean of 2 h per week and standard deviation of 30 min, even after Paris proposed to Juliet. After all Romeo and Juliet met at the ball of the Capulet, they know each other, there is no reason to assume that they would never meet again. Then, it recons how likely it would be for a set of observations to occur if the null hypothesis was true. In our example, if after 3 weeks Romeo and Juliet spent 3 h together per week, how likely is it that Juliet loves Paris?

For a normal distribution, we know that about 68% of the sample lie within 1 standard deviation from the mean, about 95% lie within 2 standard deviations from the mean, and about 99% lie within 3 standard deviations from the mean. Thus in our example, the 95% confidence interval under the assumption that Juliet does love Paris (the null hypothesis) is between 1 h and 3 h. The probability to observe 3 h per week for three consecutive weeks in a row is 0.05 × 0.05 × 0.05 = 1.25 × 10−4. Thus there is < 0.1% chance to wrongly reject the null hypothesis. From there, the test moves forward to conclude that Juliet could love Romeo because the null hypothesis has been rejected at the 0.001 level.

Reasonable threshold levels [168] for accepting/rejecting hypotheses are 0.1 (10%), 0.05 (5%), and 0.01 (1%). Statistical softwares compute p-values using a less comprehensive approach from the above, but more efficient when dealing with large datasets. Of course the theoretical concept is identical, so if you grasped it then whatever approach is used by your computer will not matter so much. Softwares rely on tabulated ratios that have been developed to fit different sampling

distributions. For example, if only one variable is considered and a normal distribution with known standard deviation is given (as in the example above), a z-test is used, which relates the expected theoretical deviations (standard error8) to the observed deviations, rather than computing a probability for every observation as done above. If the standard deviation is not known, a t-test is adequate. Finally, if dealing with multivariate probability distributions containing both numerical and categorical (non quantitative, non ordered) variables, the generalized χ-squared test is the right choice. In data analytics packages, χ-square is thus often set as the default algorithm to compute p-values.

A critical point to remember about p-value is that it does not prove a hypothesis [169]: it indicates if an alternative hypothesis (called the null hypothesis, H0) is more likely or not given the observed data and assumption made on probability distributions. That H is more likely than H0 does not prove that H is true. More gen- erally, a p-value is only as good as the hypothesis tested [168, 169]. Erroneous conclusions may be reached even though the p-values are excellent because of ill- posed hypotheses, inadequate statistics (i.e. assumed distribution functions), or sample bias.

Another critical point to remember about p-value is its dependence on sampling size [166]. In the example above, the p-value was conclusive because someone observed Romeo for 3 weeks. But on any single week, the p-value associated with the null hypothesis was 0.05, which would not be enough to reject the null hypothesis. A larger sample size always provides a lower p-value!

Statistical hypothesis testing, i.e. inference, should not be mistaken for the related concepts of decision tree (see Table 7.1) and game theory. The latters are also used to make decisions between events, but represent less granular methods as they themselves rely on hypothesis testing to assess the significance of their results. In fact for every type of predictive modeling (not only decision tree and game theory), p-values and confidence intervals are automatically generated by statistical softwares. For details and illustration, consult the application example of Sect. 6.3.

On Confidence Intervals — or How to Look Credible

Confidence intervals [91] are obtained by taking the mean plus and minus (right/left bound) some multiple of the standard deviation. For example in a normally distributed sample 95% of points lie within 1.96 standard deviations from the mean, which defines an interval with 95% confidence, as done in Eq. 7.17.

They provide a different type of information than p-value. Suppose that you have developed a great model to predict if an employee is a high-risk taker or is in contrast conservative in his decision makings (= response label of the model). Your model contains a dozen features, each with its own assigned weight, all of which have been selected with p-value <0.01 during the model design phase. Excellent.

But your client informs you that it does not want to keep track of a dozen features on its employees, it just want about 2–3 features to focus on when meeting

8 As mentioned in Sect. 6.1, the standard error is the standard deviation of the means of different sub-samples drawn from the original sample or population

6 Principles of Data Science: Primer

prospective employees and quickly evaluate (say without a computer) their risk assertiveness. The confidence interval can help with this feature selection, because it provides information on the range of magnitude of the weight assigned to each feature. Indeed, if a confidence interval nears or literally includes the value 0, then the excellent p-value essentially says that even though the feature is a predictor of the response variable, this feature is insignificant compared to some of the other features. The further away a weight is from 0, the more useful its associated feature is for predicting the response label.

To summarize, the p-value does not provide any information whatsoever on how much each feature contributes, it just confirmed the hypothesis that these features have a positive impact on the desired prediction, or at least no detectable negative impact. In contrast— or rather to complement, confidence intervals enable the assessment of the magnitude with which each feature contributes relative to one another.

Central Limit Theorem — or Alice in Statistics-land

For a primer in data science, it is worth mentioning a widely applied theorem that is known as the foundation of applied statistics. The Central Limit Theorem [170]

states that for almost every variable and every type of distribution, the distribution of the mean of these distributions will always be normally distributed for large sample size. This theorem is by far the most popular theorem in statistics (e.g. theory behind χ-squared tests and election polls) because many problems intractable for lack of knowledge on the underlying distribution of variables can be partially solved by asking alternative questions on the means of these variables [91]. Indeed the theorem says that the probability distribution for the mean of any variable is always perfectly known, it is a normal bell curve. Or almost always…

Bayesian Inference

Another key concept in statistics is the one of conditional probability and Bayesian modeling [167]. The intuitive notion of probability of an event is a simple marginal probability function [155]. But as for the concepts of marginal vs. conditional correlations described in Sect. 6.1, the behavior of a variable x might be influenced by its association with another variable y, and because y has its own probability function it does not have a static but a probabilistic influence on x. The actual probability of x knowing y is called the partial probability of x given y:

p x y p y x p x

| p y|

( )= ( ) ( )

( ) (6.10)

where the difference between p(x) and p(x|y) is null only when the two variables are completely independent (i.e. their correlation ρ(x,y) is null).

Without going into further details, it is useful to know that most predictive modeling softwares allow the user to introduce dependencies and prior knowledge (prior probability of event x) when developing a model. In doing so, you will be able to compute probabilities (posterior probability of event x) taking into account the

effect of what you already know (for example if you know that a customer buys the Wall Street Journal every day, the probability that this customer buys The Economist too is not the world average, it is close to 1) and mutual dependencies (correlations) between variables.

Basic Probabilistic Tools and Concepts

Future of Big Data in Management Consulting

The Big Picture: Theoretical Models