32 Data Mining Model Comparison Paolo Giudici University of Pavia Summary. The aim of this contribution is to illustrate the role of statistical models and, more generally, of statistics, in choosing a Data Mining model. After a preliminary introduction on the distinction between Data Mining and statistics, we will focus on the issue of how to choose a Data Mining methodology. This well illustrates how statistical thinking can bring real added value to a Data Mining analysis, as otherwise it becomes rather difficult to make a reasoned choice. In the third part of the paper we will present, by means of a case study in credit risk management, how Data Mining and statistics can profitably interact. Key words: Model choice, statistical hypotheses testing, cross-validation, loss functions, credit risk management, logistic regression models. 32.1 Data Mining and Statistics Statistics has always been involved with creating methods to analyse data. The main differ- ence compared to the methods developed in Data Mining is that statistical methods are usually developed in relation to the data being analyzed but also according to a conceptual reference paradigm. Although this has made the various statistical methods available coherent and rig- orous at the same time, it has also limited their ability to adapt quickly to the methodological requests put forward by the developments in the field of information technology. There are at least four aspects that distinguish the statistical analysis of data from Data Mining. First, while statistical analysis traditionally concerns itself with analyzing primary data that has been collected to check specific research hypotheses, Data Mining can also concern itself with secondary data collected for other reasons. This is the norm, for example, when an- alyzing company data that comes from a data warehouse. Furthermore, while in the statistical field the data can be of an experimental nature (the data could be the result of an experiment which randomly allocates all the statistical units to different kinds of treatment) in Data Min- ing the data is typically of an observational nature. Second, Data Mining is concerned with analyzing great masses of data. This implies new considerations for statistical analysis. For example, for many applications it is impossible to analyst or even access the whole database for reasons of computer efficiency. Therefore O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09823-4_32, © Springer Science+Business Media, LLC 2010 642 Paolo Giudici it becomes necessary to have a sample of the data from the database being examined. This sampling must be carried out bearing in mind the Data Mining aims and, therefore, it cannot be analyzed with the traditional statistical sampling theory tools. Third, many databases do not lead to the classic forms of statistical data organization. This is true, for example, of data that comes from the Internet. This creates the need for appropriate analytical methods to be developed, which are not available in the statistics field. One last but very important difference that we have already mentioned is that Data Mining results must be of some consequence. This means that constant attention must be given to business results achieved with the data analysis models. 32.2 Data Mining Model Comparison Several classes of computational and statistical methods for data mining are available. Once a class of models has been established the problem is to choose the ”best” model from it. In this chapter, summarized from chapter 6 in (Giudici, 2003) we present a systematic comparison of them. Comparison criteria for Data Mining models can be classified schematically into: criteria based on statistical tests, based on scoring functions, Bayesian criteria, computational criteria, and business criteria. The first are based on the theory of statistical hypothesis testing and, therefore, there is a lot of detailed literature related to this topic. See for example a text about statistical inference, such as (Mood et al., 1991) or (Bickel and Doksum, 1977). A statistical model can be specified by a discrete probability function or by a probability density function, f(x) Such model is usually left unspecified, up to unknown quantities that have to be estimated on the basis of the data at hand. Typically, the observed sample it is not sufficient to reconstruct each detail of f (x), but can indeed be used to approximate f (x) with a certain accuracy. Often a density function is parametric so that it is defined by a vector of parameters Θ =( θ 1 , , θ I ), such that each value θ of Θ corresponds to a particular density function, p θ (x). In order to measure the accuracy of a parametric model, one can resort to the notion of distance between a model f , which underlies the data, and an approximating model g (see, for instance, (Zucchini, 2000)). Notable examples of distance functions are, for categorical variables: the entropic dis- tance, which describes the proportional reduction of the heterogeneity of the dependent vari- able; the chi-squared distance, based on the distance from the case of independence; the 0-1 distance, which leads to misclassification rates. The entropic distance of a distribution g from a target distribution f , is: E d = ∑ i f i log f i g i (32.1) The chi-squared distance of a distribution g from a target distribution f is instead: χ 2 d = ∑ i ( f i −g i ) 2 g i (32.2) The 0-1 distance between a vector of predicted values, X gr , and a vector of observed values, X fr , is: 0−1 d = n ∑ r=1 1 X fr −X gr (32.3) 32 643 where 1(w,z)=1ifw = z and 0 otherwise. For quantitative variables, the typical choice is the Euclidean distance, representing the distance between two vectors in the Cartesian plane. Another possible choice is the uniform distance, applied when nonparametric models are being used. The Euclidean distance between a distribution g and a target f is expressed by the equa- tion: 2 d X f ,X g = n ∑ r=1 X fr −X gr 2 (32.4) Given two distribution functions F and G with values in [0, 1] it is defined uniform dis- tance the quantity: sup 0≤t≤1 | F (t)−G(t) | (32.5) Any of the previous distances can be employed to define the notion of discrepancy of a statistical model. The discrepancy of a model, g, can be obtained as the discrepancy between the unknown probabilistic model, f , and the best (closest) parametric statistical model. Since f is unknown, closeness can be measured with respect to a sample estimate of the unknown density f . Assume that f represents the unknown density of the population, and let g= p θ be a family of density functions (indexed by a vector of I parameters, θ ) that approximates it. Using, to exemplify, the Euclidean distance, the discrepancy of a model g, with respect to a target model f is: Δ ( f , p ϑ )= n ∑ i=1 ( f (x i ) −p ϑ (x i )) 2 (32.6) A common choice of discrepancy function is the Kullback-Leibler divergence, that derives from the entropic distance, and can be applied to any type of observations. In such context, the best model can be interpreted as that with a minimal loss of information from the true unknown distribution. The Kullback-Leibler divergence of a parametric model p θ with respect to an unknown density f is defined by: Δ K−L ( f , p ϑ )= ∑ i f (x i )log f (x i ) p ˆ θ (x i ) (32.7) where the parametric density in the denominator has been evaluated in terms of the values of the parameters which minimizes the distance with respect to f . It can be shown that the statistical tests used for model comparison are generally based on estimators of the total Kullback-Leibler discrepancy. The most used of such estimators is the log-likelihood score. Statistical hypothesis testing is based on subsequent pairwise com- parisons between pairs of alternative models. The idea is to compare the log-likelihood score of two alternative models. The log-likelihood score is then defined by: −2 n ∑ i=1 log p ˆ θ (x i ) (32.8) Hypothesis testing theory allows to derive a threshold below which the difference between two models is not significant and, therefore, the simpler models can be chosen. To summarize, Data Mining Model Comparison 644 Paolo Giudici using statistical tests it is possible to make an accurate choice among the models, based on the observed data. The defect of this procedure is that it allows only a partial ordering of models, requiring a comparison between model pairs and, therefore, with a large number of alternatives it is necessary to make heuristic choices regarding the comparison strategy (such as choosing among forward, backward and stepwise criteria, whose results may diverge). Furthermore, a probabilistic model must be assumed to hold, and this may not always be a valid assumption. A less structured approach has been developed in the field of information theory, giving rise to criteria based on score functions. These criteria give each model a score, which puts them into some kind of complete order. We have seen how the Kullback-Leibler discrepancy can be used to derive statistical tests to compare models. In many cases, however, a formal test cannot be derived. For this reason, it is important to develop scoring functions, that attach a score to each model. The Kullback-Leibler discrepancy estimator is an example of such a scoring function that, for complex models, can be often be approximated asymptotically. A problem with the Kullback-Leibler score is that it depends on the complexity of a model as described, for instance, by the number of parameters. It is thus necessary to employ score functions that penalise model complexity. The most important of such functions is the AIC (Akaike Information Criterion, see (Akaike, 1974)). The AIC criterion is defined by the following equation: AIC = −2 log L( ˆ ϑ ;x 1 , ,x n )+2q (32.9) where the first term is minus twice the the logarithm of the likelihood function calculated in the maximum likelihood parameter estimate and q is the number of parameters of the model. From its definition notice that the AIC score essentially penalises the log-likelihood score with a term that increases linearly with model complexity. The AIC criterion is based on the implicit assumption that q remains constant when the size of the sample increases. However this assumption is not always valid and therefore the AIC criterion does not lead to a consis- tent estimate of the dimension of the unknown model. An alternative, and consistent, scoring function is the BIC criterion (Bayesian Information Criterion), also called SBC, formulated in (Schwarz, 1978). The BIC criterion is defined by the following expression: BIC = −2 log L ˆ ϑ ;x 1 , ,x n + qlog(n) (32.10) As can be seen from its definition the BIC differs from the AIC only in the second part which now also depends on the sample size n. Compared to the AIC, when n increases the BIC favours simpler models. As n gets large, the first term (linear in n) will dominate the second term (logarithmic in n). This corresponds to the fact that, for a large n, the variance term in the mean squared error expression tends to be negligible. We also point out that, despite the superficial similarity between the AIC and the BIC, the first is usually justified by resorting to classical asymptotic arguments, while the second by appealing to the Bayesian framework. To conclude, the scoring function criteria for selecting models are easy to calculate and lead to a total ordering of the models. From most statistical packages we can get the AIC and BIC scores for all the models considered. A further advantage of these criteria is that they can be used also to compare non-nested models and, more generally, models that do not belong to the same class (for instance a probabilistic neural network and a linear regression model). However, the limit of these criteria is the lack of a threshold, as well the difficult inter- pretability of their measurement scale. In other words, it is not easy to determine if the dif- ference between two models is significant or not, and how it compares to another difference. These criteria are indeed useful in a preliminary exploration phase. To examine this criteria and to compare it with the previous ones see, for instance, (Zucchini, 2000) or (Hand et al., 2001). A possible ”compromise” between the previous two criteria is the Bayesian criteria which could be developed in a rather coherent way (see e.g. (Bernardo and Smith, 1994)). It appears to combine the advantages of the two previous approaches: a coherent decision threshold and a complete ordering. One of the problems that may arise is connected to the absence of a general purpose software. For Data Mining works using Bayesian criteria the reader could see, for instance, (Giudici, 2003) and (Giudici and Castelo, 2001). The intensive wide spread use of computational methods has led to the development of computationally intensive model comparison criteria. These criteria are usually based on using dataset different than the one being analyzed (external validation) and are applicable to all the models considered, even when they belong to different classes (for example in the comparison between logistic regression, decision trees and neural networks, even when the latter two are non probabilistic). A possible problem with these criteria is that they take a long time to be designed and implemented, although general purpose softwares have made this task easier. The most common of such criterion is based on cross-validation. The idea of the cross- validation method is to divide the sample into two sub-samples, a ”training” sample, with n −m observations, and a ”validation” sample, with m observations. The first sample is used to fit a model and the second is used to estimate the expected discrepancy or to assess a distance. Using this criterion the choice between two or more models is made by evaluating an appropriate discrepancy function on the validation sample. Notice that the cross-validation idea can be applied to the calculation of any distance function. One problem regarding the cross-validation criterion is in deciding how to select m, that is, the number of the observations contained in the ”validation sample”. For example, if we select m = n/2 then only n/2 observations would be available to fit a model. We could reduce m but this would mean having few observations for the validation sampling group and therefore reducing the accuracy with which the choice between models is made. In practice proportions of 75% and 25% are usually used, respectively for the training and the validation samples. To summarize these criteria have the advantage of being generally applicable but have the disadvantage of taking a long time to be calculated and of being sensitive to the characteristics of the data being examined. A way to overcome this problem is to consider model combi- nation methods, such as bagging and boosting. For a thorough description of these recent methodologies, see (Hastie et al., 2001). One last group of criteria seem specifically tailored for the data mining field. These are criteria that compare the performance of the models in terms of their relative losses, connected to the errors of approximation made by fitting Data Mining models. Criteria based on loss functions have appeared recently, although related ideas are known since longtime in Bayesian decision theory (see for instance (Bernardo and Smith, 1994)) . They are of great interest and have great application potential although at present they are mainly concerned with solving problems regarding classification. For a more detailed examination of these criteria the reader can see for example (Hand , 1997,Hand et al., 2001) or the reference manuals on Data Mining software, such as that of SAS Enterprise Miner. The idea behind these methods is that it is important to focus the attention, in the choice among alternative models, to compare the utility of the results obtained from the models and not just to look exclusively at the statistical comparison between the models themselves. Since the main problem dealt with by data analysis is to reduce uncertainties on the risk factors or ”loss” factors, reference is often made to developing criteria that minimize the loss connected to the problem being examined. In other words, the best model is the one that leads to the least loss. 32 64 Data Mining Model Comparison 5 646 Paolo Giudici Most of the loss function based criteria apply to predictive classification problems, where the concept of a confusion matrix arises. The confusion matrix is used as an indication of the properties of a classification (discriminant) rule. It contains the number of elements that have been correctly or incorrectly classified for each class. On its main diagonal we can see the number of observations that have been correctly classified for each class while the off- diagonal elements indicate the number of observations that have been incorrectly classified. If it is (explicitly or implicitly) assumed that each incorrect classification has the same cost, the proportion of incorrect classifications over the total number of classifications is called rate of error, or misclassification error, and it is the quantity which must be minimized. Of course the assumption of equal costs can be replaced by weighting errors with their relative costs. The confusion matrix gives rise to a number of graphs that can be used to assess the rel- ative utility of a model, such as the Lift Chart, and the ROC Curve. For a detailed illustration of these graphs we refer to (Hand , 1997) or (Giudici, 2003). The lift chart puts the valida- tion set observations, in increasing or decreasing order, on the basis of their score, which is the probability of the response event (success), as estimated on the basis of the training set. Subsequently, it subdivides such scores in deciles. It then calculates and graphs the observed probability of success for each of the decile classes in the validation set. A model is valid if the observed success probabilities follow the same order (increasing or decreasing) as the estimated ones. Notice that, in order to be better interpreted, the lift chart of a model is usually compared with a baseline curve, for which the probability estimates are drawn in the absence of a model, that is, taking the mean of the observed success probabilities. The ROC (Receiver Operating Characteristic) curve is a graph that also measures predic- tive accuracy of a model. It is based on four conditional frequencies that can be derived from a model, and the choice of a cut-off points for its scores: • the observations predicted as events and effectively such (sensitivity) • the observations predicted as events and effectively non events • the observations predicted as non events and effectively events; • the observations predicted as non events and effectively such (specificity) The ROC curve is obtained representing, for any fixed cut-off value, a point in the Carte- sian plane having as x-value the false positive value (1-specificity) and as y-value the sensi- tivity value. Each point in the curve corresponds therefore to a particular cut-off. In terms of model comparison, the best curve is the one that is leftmost, the ideal one coinciding with the y-axis. To summarize, criteria based on loss functions have the advantage of being easy to interpret and, therefore, well suited for Data Mining applications but, on the other hand, they still need formal improvements and mathematical refinements. In the next section we give an example of how this can be done, and show that statistics and Data Mining applications can fruitfully interact. 32.3 Application to Credit Risk Management We now apply the previous considerations to a case-study that concerns credit risk manage- ment. The objective of the analysis is the evaluation of the credit reliability of small and medium enterprises (SMEs) that demand financing for their development. In order to assess credit reliability each applicant for credit is associated with a score, usually expressed in terms of probability of repayment (default probability). Data Mining methods are used to estimate such score and, on the basis of it, to classify applicants as being reliable (worth of credit) or not. Data Mining models for credit scoring are of the predictive (or supervised) kind: they use explanatory variables obtained from information available on the applicant in order to get an estimate of the probability of repayment (target or response variable). The methods most used in practical credit scoring applications are: linear and logistic regression models, neural networks and classification tress. Often, in banking practice, the resulting scores are called ”statistical” and supplemented with subjective, judgemental evaluations. In this section we consider the analysis of a database that includes 7134 SMEs belong- ing to the retail segment of an important Italian bank. The retail segment contains companies with total annual sales less than 2,5 million per year. On each of this companies the bank has calculated a score, in order to evaluate their financing (or refinancing) in the period from April 1 st , 1999 to April 30 th , 2000. After data cleaning, 13 variables are included in the anal- ysis database, of which one binary variable that expresses credit reliability (BAD =0 for the reliables, BAD=1 for the non reliables) can be considered as the response or target variable. The sample contains about 361 companies with BAD=1 (about 5%) and 6773 observed with BAD=0 (about 95%). The objective of the analysis is to build a statistical rule that explains the target variable as a function of the explanatory one. Once built on the observed data, such rule will be extrapolated to assess and predict future applicants for credit. Notice the unbal- ancedness of the distribution of the target response: this situation, typical in predictive Data Mining problems, poses serious challenges to the performance of a model. The remaining 12 available variables are retained to influence reliability, and can be con- sidered as explanatory predictors. Among them we have: the age of the company, its legal status, the number of employees, the total sales and variation of the sales in the last period, the region of residence, the specific business, the duration of the relationship of the managers of the company with the bank. Most of them can be considered as ”demographic” information on the company, stable in time but indeed not very powerful to build a statistical model. However, it must be said that, being the companies considered all SMEs, it is rather difficult to rely on other, such as balance sheet, information. A preliminary exploratory analysis can give indications on how to code the explanatory variables, in order to maximize their predictive power. In order to reach this objective we have employed statistical measures of association between pairs of variables, such as chi-squared based measures and statistical measures of dependence, such as Goodman and Kruskal’s (see (Giudici, 2003) for a systematic comparison of such measures). We remark that the use of such tools is very much beneficial for the analysis, and can considerably improve the final per- formance results. As a result of our analysis, all explanatory variables have been discretised, with a number of levels ranging from 2 to 26. In order to focus on the issue of model comparison we now concentrate on the comparison of three different logistic regression models on the data. This model is the most used in credit scoring applications; other models that are employed are classification trees, linear discrimi- nant analysis and neural networks. Here we prefer to compare models belonging to the same class, to better illustrate our issue; for a detailed comparison of credit scoring methods, on a different data set, see (Giudici, 2003). Our analysis have been conducted using SAS and SAS Enterprise Miner softwares, available at the bank subject of the analysis. We have chosen, in agreement with the bank’s experts, three logistic regression models: a saturated model, that contains all explanatory variables, with the levels obtained from the explanatory analysis; a statistically selected model, using pairwise statistical hypotheses test- ing; and a model that minimizes the loss function. In the following, the saturated model will be named ”RegA (model A)”; the chosen model, according to a statistical selection strategy ”RegB (model B)”, the model chosen minimizing the loss function ”RegC (model C)”. Sta- tistical model comparison has been carried out using a stepwise model selection approach, 32 64 Data Mining Model Comparison 7 648 Paolo Giudici with a reference value of 0,05 to compare p-values with. On the other hand, the loss function has been expressed by the bank’s experts, as a function of the classification errors. Table 32.1 below describes such a loss function. Table 32.1. The chosen loss function Predicted Actual BAD GOOD BAD 0 20 GOOD -1 0 The table contains the estimated losses (in scale free values) corresponding to the combi- nations of actual and predicted values of the target variable. The specified loss function means that it is retained that giving credit to a non reliable (bad) enterprise is 20 times more costly that not giving credit to a reliable (good) enterprise. In statistical terms, the type I error costs 20 times the type II error. As each of the four scenarios in Table 32.1 has an occurrence prob- ability, it is possible to calculate the expected loss of each considered statistical model. The best one will be that minimizing such expected loss. In the SAS Enterprise Miner tool the Assessment node provides a common framework to compare models, in terms of their predictions. This requires that data has been partitioned in two or more datasets, according to computational criteria of model comparison. The Assess- ment node produces a table view of the model results that lists relevant statistics and model adequacy and several different charts/reports depending on whether the target variable is con- tinuous or categorical and whether a profit/loss function has been specified. In the case under examination, the initial dataset (5351 observations) has been split in two, using a sampling mechanism stratified with respect to the target variable. The training dataset contains about 70% of the observations (about 3712) and the validation dataset the remaining 30% (about1639 observations). As the samples are stratified, in both the resulting datasets the percentages of ”bad” and ”good” enterprises remain the same as those in the combined dataset (5%eil95%). The first model comparison tool we consider is the lift chart. For a binary target, the lift (also called gains chart) is built as follows. The scored data set is sorted by the probabili- ties of the target event in descending order; observations are then grouped into deciles. For each decile, a lift chart can calculate either: the percentage of target responses (Bad repayers here) or the ratio between the percentage and the corresponding one for the baseline (random) model, called the lift. Lift charts show the percent of positive response or the lift value on the vertical axis. Table 54.1 show the calculations that give rise to the lift chart, for the credit scoring problem considered here. Figure 32.3 shows the corresponding curves. Table 32.2. Calculations for the lift chart Number of observa- tions in each group percentile % of captured re- sponses (BASELINE) % di of captured re- sponses % (REG A) % di of captured re- sponses % (REG B) % di of captured re- sponses % (REG C) 163.90 10 5.064 20.134 22.575 22.679 163.90 20 5.064 12.813 12.813 14.033 163.90 30 5.064 9.762 10.103 10.293 163.90 40 5.064 8.237 8.237 8.542 163.90 50 5.064 7.322 7.383 7.445 163.90 60 5.064 6.508 6.913 6.624 163.90 70 5.064 5.753 6.237 6.096 163.90 80 5.064 5.567 5.567 5.644 163.90 90 5.064 5.288 5.220 5.185 163.90 100 5.064 5.064 5.064 5.064 Fig. 32.1. Lift charts for the best model Comparing the results in Table 54.1 and Figure 39.1 it emerges that the performances of the three models being compared are rather similar; however the best model seem to be model C (the model that minimises the losses) as it is the model that, in the first deciles, is able to effectively capture more bad enterprises, a difficult task in the given problem. Recalling that the actual percentage of bad enterprises observed is equal to 5%, the previous graph can be normalized by dividing the percentage of bads in each decile by the overall 5% percentage. The result is the actual lift of a model, that is, the actual improvement with respect to the baseline situation of absence of a model (as if each company were estimated good/bad accord- ing to a purely random mechanism). In terms of model C, in the first decile (with about 164 enterprises) the lift is equal to 4,46 (i.e. 22, 7% 5,1%); this means that, using model C it is expected to obtain, in the first decile, a number of enterprises 4,5 times higher with respect to a random sample of the considered enterprises. The second Assessment tool we consider is the threshold chart. Threshold-based charts enable to display the agreement between the predicted and actual target values across a range of threshold levels. The threshold level is the cutoff that is used to classify an observation that is based on the event level posterior probabilities. The default threshold level is 0.50. For the credit scoring case the calculations leading to the threshold chart are in Table 32.3 and the corresponding figure in Figure 32.3 below. In order to interpret correctly the previous table and figure, let us consider some numerical examples. First we remark that the results refer to the validation dataset, with 1629 enterprises 32 64 Data Mining Model Comparison 9 . (REG C) 163.90 10 5.064 20 .134 22 .575 22 .679 163.90 20 5.064 12. 813 12. 813 14.033 163.90 30 5.064 9.7 62 10.103 10 .29 3 163.90 40 5.064 8 .23 7 8 .23 7 8.5 42 163.90 50 5.064 7. 322 7.383 7.445 163.90 60. the whole database for reasons of computer efficiency. Therefore O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09 823 -4_ 32, © Springer. 60 5.064 6.508 6.913 6. 624 163.90 70 5.064 5.753 6 .23 7 6.096 163.90 80 5.064 5. 567 5. 567 5.644 163.90 90 5.064 5 .28 8 5 .22 0 5.185 163.90 100 5.064 5.064 5.064 5.064 Fig. 32. 1. Lift charts for the