Case 2: Data Science Project on Customer Churn- 123docz.net

This second example presents a data science project that was also carried out within a 2-week time frame, for a consulting engagement at one of the top tier management consulting firms. It applies machine learning to identify customers who will churn, and aims at extracting both quantitative and qualitative recommendations from the data for the client to make proper strategic and tactical decisions to reduce churn in the future. This example was chosen because it starts delving into many of the typical subtleties of data science, such as lack of clear marginal correlation for any of the features chosen, highly imbalanced dataset (90% of customers in the dataset do not churn), probabilistic prediction, adjustment of prediction to minimize false positives at the expense of false negatives, etc.

The Challenge

In this project, the client is a company providing gas and electricity who has recently seen an increase in customer defection, a.k.a. churn, to competitors. The dataset contains hundreds of customers with different attributes measured over the past couple of months, some of whom have churned, some have not. The client also provided a list of specific customers for which we are to predict whether each is forecasted to churn or not, and with which probability.

The Questions

Can we raise a predictive model of customer churn?

What are the most explicative variables for churn?

What are potential strategic or tactical levers to decrease churn?

The Executive Summary

Using a 4-step protocol: 1- Exploration, 2- Model design, 3- Performance analysis and 4- Sensitivity analysis/interpretations, we designed a model that enables our client to identify 30% of customers who will churn while limiting fall out (false

Fig. 7.7 Comparison of the confusion matrix for different machine learning classification algo- rithms using 20 features (left) and 3 features (right)

7 Principles of Data Science: Advanced

positive) to 10%. This study supports short term tactics based on discount and lon- ger term contracts, and a long term strategy based on building synergy between services and sales channels.

Exploration of the Dataset

The exploration of the dataset included feature engineering (deriving ‘dynamic’ attributes such as weekly/monthly rates of different metrics), scatter plots, covariance matri- ces, marginal correlation and Hamming/Jaccard distances with are loss functions designed specifically for binary outcomes (see Table 7.5). Some key issues to be solved were the presence of many empty entries, outliers, collinear and low-variance features. The empty entries were replaced by the median for each feature (except for features with more than 40% missing in which case the entire feature was deleted). The customers with outlier values beyond six standard deviations from the mean were also deleted.

Some features, such as prices and some forecasted metrics, were collinear with ρ >

0.95, see Fig. 7.8. Only one of each was kept for designing machine learning models.

Fig. 7.8 Cross-correlation between all 48 features in this project (left) and after filtering collinear features out (right)

Table 7.5 Top correlation and binary dissimilarity between top features and churn

Top pearson correlations with churn

Feature Original Filtered

Margins 0.06 0.1

Forecasted meter rent 0.03 0.04

Prices 0.03 0.04

Forecasted discount 0.01 0.01

Subscription to power 0.01 0.03

Forecasted consumption 0.01 0.01

Number of products −0.02 −0.02

Antiquity of customer −0.07 −0.07

Top binary dissimilarities with churn

Feature Hamming Jaccard

Sales channel 1 0.15 0.97

Sales channel 2 0.21 0.96

Sales channel 3 0.45 0.89

124

Model Design

A first feature selection was carried out before modeling using a variance filter (i.e.

removed features with no variance in more than 95% of customers) and later on during modeling by stepwise regression. Performance of learners was assessed by both 30% hold-out and ten-fold cross validation. A logistic regression, support vec- tor machine and a random forest algorithm were trained both separately and in ensemble as depicted in Fig. 7.9. The later is referred to as ‘soft-voting’ in the litera- ture because it takes the average probability of the different models to classify customer churn.

Clients with churn represented only 10% of the training data. Two approaches were tested to deal with class imbalance. Note for both, performance was evaluated on a test set where class imbalance was maintained to account for real world circumstances.

• Approach 1 (baseline): Training based on random sampling of the abundant class (i.e. clients who didn’t churn) to match the size of the rare class (i.e. client who did churn)

• Approach 2: Training based on an ensemble of nine models using nine random samples of the abundant class (without replacement) and a full sample of the rare class for each. A nine-fold was chosen because class imbalance was 1–9, it is designed to improve generalization, see Fig. 7.10.

Performance Analysis

The so-called receiver-operator characteristic (ROC [206]) curve is often used to complement contingency tables in machine learning because it provides an

Fig. 7.9 Ensemble model with soft-voting

probabilistic prediction

Fig. 7.10 Distribution of churn and no-churn customers and nine-fold ensemble model designed to use all data with balance samples and improve generalization

7 Principles of Data Science: Advanced

invariant measure of accuracy under changing the probability threshold for inferring positive and negative classes (in this project churn vs. no churn respectively). It consists in plotting all predictions classified positive accurately (a.k.a. true positives) vs. the ones misclassified as positives (a.k.a. false positive or fall out). The probabilities predicted by the different machine learning models are by default cali- brated so that p > 0.5 correspond to one class, and p < 0.5 corresponds to the other class. But of course this threshold can be fine tuned to minimize misclassified instances of one class at the expense of increasing misclassification in the other class. The area under the ROC curve is an elegant measure of overall performance that remains the same for any threshold in use. From Table 7.6, we observe that random forest has the best performance, and that the nine-fold ensemble does gen- eralize a little better, with a ROC AUC score of 0.68. This is our ‘best performer’

model. It enables to predict 69% of customers who will churn, although with significant fall out of 42% when using a probability threshold of 0.5, see Fig. 7.11. Looking at the ROC curve, we see that the same model can still predict 30% of customers, a significant proportion for our client, albeit this time with a highly minimized fall out of just 10%. Through grid search, I found this threshold to be p = 0.56. This is the model that was recommended to the stakeholders (Table 7.6).

Fig. 7.11 Performance of the overall best learner (nine-fold ensemble of random forest learners) and optimization of precision/fall out

126

Sensitivity Analysis and Interpretations

A recurrent feature selection carried out by stepwise logistic regression lead to the identification of 12 key predictors, see Fig. 7.12. Variables that tend to induce churn were high margins, forecasted consumptions and meter rent. Variables that tend to

Fig. 7.12 Explicative variables of churn identified by stepwise logistic regression (close-up) and ranked by their relative contribution to predict churn

Table 7.6 Performance of the learning classifiers, with random sampling of the rare class or nine- fold ensemble of learners, based on Accuracy, Brier and ROC measures

Performance measures

Algorithm Accuracy Brier ROC

Logit 56% 0.24 0.64

Stepwise 56% 0.24 0.64

SVM 57% 0.24 0.63

RF 65% 0.24 0.67

Ensemble 61% 0.23 0.66

9-logit ensemble 55% 0.26 0.64

9-SVM ensemble 61% 0.25 0.63

9-RF ensemble 70% 0.24 0.68

9-ensemble ensemble 61% 0.25 0.65

7 Principles of Data Science: Advanced

reduce churn were forecasted discount, number of active products, subscription to power, antiquity (i.e. loyalty) of the customer and also three of the sales channels.

Providing a discount to clients with high propensity to churn seems thus a good tactical measure, and other strategic levers could also be pulled: synergies, channels and long term contracts.

Finally, a sensitivity analysis was carried out by applying an incremental discount to clients identified by learners, and then re-running the models to evaluate how many clients were still predicted to churn with that discount level. If we consider the model with minimized fall out (because client had expressed interest in minimizing fall out), our analysis predict that a 20% discount will reduce churn significantly (i.e. by 25%) with minimal fall out (10%). Given the true positive rate with this threshold is about 30%, we can safely forecast that the discount approach will eliminate at least 8% of churn. See Fig. 7.13 for details. This is the first tactical step that was recommended to the stakeholders.

Fig. 7.13 Sensitivity analysis of discount strategy: proportion of customer churn predicted by logit model after adding a discount for clients predicted to churn

129

Principles of Strategy: Primer

A competitive strategy augments a corporate organization with inherent capabilities to sustain superior performance on a long-term basis [74]. Many strategy concepts exist and will be described in this chapter, with a special focus placed on practical matters such as key challenges and “programs” that can be used as roadmaps for implementation. Popular strategies include the five forces, the value chain, the prod- uct life cycle, disruptive innovation and blue ocean, to name a few. The reader is invited to consider these models as a simple aid to thinking about reality, since none of these theoretical concepts or authors thereof ever claim to describe the reality for any one particular circumstance. They claim to facilitate discussion and creativity over a wide range of concrete business issues. Armed with such tools, the consultant may examine whether his/her client does indeed enjoy a competitive advantage, and develop a winning strategy.

Case 2: Data Science Project on Customer Churn

Future of Big Data in Management Consulting

The Big Picture: Theoretical Models