Model Design and Validation

7.4 Machine Learning and Artificial Intelligence

7.4.2 Model Design and Validation

Once an algorithm has been chosen and its parameters optimized, the next step in building up a predictive model is to address the complexity tradeoff introduced in Sect. 6.1 between under-fitting and over-fitting. The best regime between signal and noise is searched by cross-validation, also introduced in Sect. 6.1: a training set is defined to develop the model and a testing set is defined to assess its performance.

Three options are available [165]:

1. Hold-out: A part of the dataset (typically between 60% and 80%) is randomly chosen to represent the training set and the remaining subset is used for testing 2. K-folds: The dataset is divided into k subsets. k-1 of them are used for training

and the remaining one for testing. The process is repeated k times, so that each fold gets to be the testing fold. The final performance is the average over the k folds

3. Leave-1-out: Ultimately k-folding may be reduced to leave-one-out by taking k to be the number of data points. This takes full advantage of all information available in the entire dataset but may be computationally too expensive

Model performance in machine learning corresponds to how well the hypothesis h in Eq. 7.7 may predict the response variable(s) for a given set of features. This is called the error measure. For classification models, this measure is the rate of suc- cess and failure (e.g. confusion matrix, ROC curve [206]). For regression models, this measure is the loss function introduced in Sect. 6.1 between predicted and observed responses, e.g. the Euclidean distance (Eq. 6.6). To change the performance of a model, three options are available [165]:

Option 1: Add or remove some features by variance threshold of recursive fea- ture selection

Option 2: Change the hypothesis function by introducing regularization, non- linear terms, or cross-terms between features

Option 3: Transform some features e.g. by PCA or clustering

These options are discussed below, except for the addition of non-linear terms because this option requires deep human expertise and is not recommended given there exist algorithms that can handle non-linear functions automatically (e.g. deep learning, SVM). Deep learning is recommended for non-linear modeling.

Feature Selection

Predictive models provide an understanding of which variables influence the response variable(s) by measuring the strength of the relationship between features and response(s). With this knowledge, it becomes possible to add/remove features one at a time and see whether predictions performed by the model get more accurate and/or more efficient. Adding features one at a time is called forward wrapping,

110

removing features one at a time is called backward wrapping, and both are called ablative analysis [165]. For example, stepwise linear regression is used to evaluate the impact of adding a feature (or removing a feature in backward wrapping mode) based on the p-value threshold 0.05 for a χ-squared test of the following hypothesis:

Does it affect the value of the error measure?, where H1 = yes and H0 = no. All these tests are done automatically at every step of the stepwise regression algorithm. The algorithm may also add/remove cross-terms in the exact same way. Ultimately, stepwise regression indicates which subsets of features contained redundant information and which features experience partial correlations. It selects features appropriately …and automatically!

Wrappers are perfect in theory, but in practice they are challenged by Butterfly effects when searching for the optimal weights of the features. That is, it is impos- sible to exhaustively assess all combinations of features. When the heuristic “starts somewhere” it impacts subsequent decisions made during the stepwise search, and certain features that might be selected in one search might be rejected in another where the algorithm starts somewhere else, and vice versa.

For very large datasets thus, a second class of feature selection algorithm may be used, referred to as filtering. Filters are less accurate than wrappers but more computationally effective and thus might lead to a better result when working with large datasets that prevent wrappers from evaluating all possible combinations. Filters are based on computing the matrix of correlations (Eq. 6.2) or associations (Eq. 6.4 or Eq. 6.5) between features, which is indeed faster than a wrapping step where the entire model (Eq. 7.7) is used to make an actual prediction and evaluate the change in the error measure. A larger number of combinations can thus be tested. The main drawback with filters is that the presence of partial correlations may mislead results.

Thus a direct wrapping is preferable to filtering [165].

As recommended in Sect. 6.3, a smart tactic may be to use a filter at the onset of the project to detect and eliminate variables that are exceedingly redundant (too high ρ) or noisy (too low ρ), and then move on a more rigorous wrapper. Note another straightforward tactic here: when working with a regression model, the strength of the relationship between features relative to one another can be directly assessed by comparing the magnitude of their respective weights. This offers a solu- tion for the consultant to expedite the feature selection process.

Finally, feature transformation and regularization are two other options that may be leveraged to improve model performance. Feature transformation builds upon the singular value decomposition (e.g. PCA) and harmonic analysis (e.g. FFT) frameworks described in Sect. 7.1. Their goal is to project the space of features into a new space where variables may be ordered by decreasing level of importance (please go back to Sect. 7.1 for details), and from there a set of variables with high influence on the model’s predictions may be selected.

Regularization consists in restraining the magnitude of the model parameters (e.g. forcing weights to not exceed a threshold, forcing some features to drop out, etc) by introducing additional terms in the loss function used when training the model, or in forcing prior knowledge on the probability distribution of some features by introducing Bayes rules in the model.

7 Principles of Data Science: Advanced

The big picture: agile and emergent design

The sections above, including the ones on signal processing and computer simula- tions, described a number of options for developing and refining a predictive model in the context of machine learning. If the data scientist, or consultant of the twenty-first century, was to wait for a model to be theoretically optimally designed before applying it, he could spend his entire lifetime working on this achievement!

Some academics do. But this is not just an anecdote, as anyone may well spend several weeks reading through an analytics software package documentation before even starting to test his or her model. So here is something to remember: unexpected surprises may always happen, for any one and any model, when that model is finally used on real world applications.

For this reason, data scientists recommend an alternative approach to extensive model design: emergent design [207]. Emergent design does include data prepara- tion phases such as exploration, cleaning and filtering, but quite precociously switches to building a working model and applying it to real world data. It cares less about understanding factors that might play a role during model design and more about the insights gathered from assessing real-world performance and pitfalls.

Real- world feedbacks bring a unique value to orient efforts toward, for example, choosing the algorithm at the first place. Try one that looks reasonable, and see what the outputs look like — not to make predictions, but to make decisions about refining and improving performance (Fig. 7.3).

Fig. 7.3 Workflow of agile, emergent model design when developing supervised machine learning models

112

In other words, emergent design recommends to apply a viable model as soon as possible rather than to spend time defining the range of theoretically possible options. Build a model quickly, apply it to learn from real-world data, get back to model design, re-apply to real-world data, learn again, re-design and so forth. This process should generate feedbacks quickly with as little risks and costs as possible for the client, and in turn enable the consultant to come up with a satisfactory model in the shortest amount of time. The 80/20 rule always prevails.

Future of Big Data in Management Consulting

The Big Picture: Theoretical Models