740 Pierre Geurts Obviously, if these (numerical) estimates have a small variance and bias the corre- sponding classifier is stable with respect to random variations of the learning set and close to the Bayes classifier. Thus, a complementary approach to study bias and vari- ance of a classification algorithm is to connect, in a quantitative way, the bias and variance terms of these estimates to the mean misclassification error of the resulting classification rule. Friedman (1997) has done this connection in the particular case of a two-class problem and assuming that the distribution of I c (S)(x) with respect to S is close to Gaussian. In this case, the mean misclassification error at some point x may be written (see (Friedman, 1997)): E S [Error(I(S)(x))] = σ C (x)+ Φ E S [ I c b (S)(x) ] −0.5 var S [ I c b (S)(x) ] .(2.P(y = c b | x) −1) where c b is the Bayes class at x and Φ (.) is the upper tail of the standard normal dis- tribution which is a positive and monotonically decreasing function of its argument and such that Φ (+∞)=0. The numerator in Φ is called the ”boundary bias” and the denominator is exactly the variance (37.4) of the regression model I c b (S)(x). There are two possible situations depending on the sign of the boundary bias: • When the average probability estimates of the Bayes prediction is greater than 0.5 (a majority of models are right), a decrease of the variance of these estimates will decrease the error. • On the other hand, when the average probability estimates is lower than 0.5 (a majority of models are wrong), a decrease of variance will yield an increase of the error. Hence, the conclusions are similar to what we found in our illustrative problem above: in classification, more variance is beneficial for biased points and detrimental for unbiased ones. Another important conclusion can be drawn from this decomposition: whatever the regression bias on the approximation of I c (S)(x), the classification error can be driven to its minimum value by reducing solely the variance, under the assumption that E S [I c b (S)(x)] remains greater than 0.5. This means that perfect classification rules can be induced from very bad (rather biased, but of small variance) probability estimators. In this sense, we can say that reducing variance is more important than reducing bias in classification. This decomposition is certainly complementary to the decompositions of the pre- vious section. One of its main advantages over these decompositions is that the behavior of bias and variance of probability estimates is predictable in the usual (squared loss) way, while some of the bias and variance terms introduced above are less interpretable. 37.3 Estimation of Bias and Variance Bias and variance are useful tools to understand the behavior of learning algorithms. So, it is very desirable to be able to compute them in practice for a given learning 37 Bias vs Variance 741 algorithm and problem. However, bias and variance definitions make intensive use of the knowledge of the distribution of learning samples and, usually, the only knowl- edge we have about this distribution is a data set of randomly drawn samples. So, in practice, true values of bias and variance terms have to be exchanged for some estimates. One way to obtain these estimates, assuming that we have enough data, is to split the available dataset into two disjoint parts, PS (P for “pool”) and TS (T for “test”) and use them in the following way: • PS is used to approximate the learning set generation mechanism. A good candi- date for this is to replace sampling from P D (x,y) by sampling with replacement from PS. This is called “bootstrap” sampling in the statistical literature (Efron and Tibshirani, 1993) and the idea behind it is to use the empirical distribution of the finite sample as an approximation of the true distribution. The bigger is PS with respect to the learning set size, the better will be the approximation. For example, denoting PS by (< x 1 ,y 1 >, ,< x Mp ,y Mp >), we can estimate the av- erage regression model by the following procedure: (i) draw T learning sets of size m (with m ≤ M p ) with replacement from PS,(S 1 , ,S T ), (ii) build a model from each S i using the inducer I, and (iii) compute: f avg (x) ∼ = 1 T T ∑ i=1 I(S i )(x). (37.8) • The set TS, which is independent of PS, is used as a test sample to estimate errors and bias and variance terms. For example, denoting by (< x 1 ,y 1 >, ,< x Mv ,y Mv >) this set, the mean error of a model is estimated by: Error( f )=E x,y [L( f (x), y)] ∼ = 1 M v M v ∑ i=1 L( f (x i ),y i ). However, some of the previously defined terms are difficult to estimate without any knowledge of the problem other than a dataset TS. Indeed, the estimation of the Bayes model f b (x) from data is nothing but the final goal of supervised learning. So, the bias and variance terms that make explicit use of these latter will be mostly impossible to estimate for real datasets (from which we do not have any knowledge of the underlying distribution). For example, the regression noise and bias terms depend on the Bayes model and thus are impossible to estimate separately only from data. A common solution to circumvent this problem is to assume that there is no noise, i.e. f b (x i )=y i ,i = 1, ,M v , and to estimate the bias term from TS by: E x bias 2 R (x) = E x ( f b (x) − f avg (x)) 2 ∼ = 1 M v M v ∑ i=1 (y i − f avg (x i )) 2 , (37.9) using in this latter expression the estimation of the average model given by (63.6). If it happens that actually there is noise, this expression is an estimation of the error of the average model and hence, this amounts at estimating the sum of the residual error and bias terms. The fact that we can not distinguish errors which are due to 742 Pierre Geurts noise or bias is not very dramatic, since usually we are mainly interested in studying relative variations of bias and variance more than their absolute values and the part of (63.7) which is due to residual error is constant. The regression variance may then be estimated by the following expression: E x [var R (x)] ∼ = 1 M v M v ∑ i=1 1 T T ∑ j=1 (I(S j )(x i ) − f avg (x i )) 2 , or equivalently by the difference between the mean error and the sum of noise and bias as estimated by (63.7). Of course, this estimation procedure also works for estimating the different bias and variance terms for the 0-1 loss function and the discussion about the estimation of the Bayes model still applies. The preceding procedure may yield very unstable estimators (suffering of a high variance) especially if the available data is not sufficiently large. Several techniques are possible to stabilize the estimations. For example, a simple method is to use several random divisions of the data set into PS and TS and to average the estimates found for each separation. More complex estimates may be constructed to further reduce the variance of the estimation (see for example (Tibshirani, 1996), (Wolpert, 1997) or (Webb, 2000)). 37.4 Experiments and Applications In this section, we carry out some experiments to illustrate the interest of a bias/variance analysis. These experiments are restricted to a regression problem but most of the discussion can be directly applied to classification problems as well. The illustrative problem is an artificial problem introduced in (Friedman, 1997), which has 10 input attributes all independently and uniformly distributed in [0,1]. The regression output variable is obtained by y = f b (x)+ ε where f b (x)= 10.sin( π x 1 x 2 )+20(x 3 −0.5) 2 + 10x 4 + 5x 5 depends only on the first five inputs and ε is a noise term distributed according to a Gaussian distribution of zero mean and unit variance. Bias and variance are estimated using the protocol of the previous sec- tion with PS and TS of respectively 8000 and 2000 cases. The learning set size m is 300 and T =50 models are constructed. 37.4.1 Bias/variance tradeoff As bias and variance are both positive and contribute directly to the error, they should both be minimized as much as possible. Unfortunately, there is a compromise, called the bias/variance tradeoff, between these two types of error. Indeed, usually, the more you fit your model to the data, the lower is the bias but at the same time the higher the variance since the dependence of the model to the learning sample increases. On the opposite, if you reduce the dependence of the model to the learning sample, usually, you will increase the bias. The goodness of the fit mainly depends on the model 37 Bias vs Variance 743 complexity (the size of the hypothesis space) but also on the amount of optimization carried out by the machine learning method. 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10 Number of hidden perceptrons Mean squared error Residual error Bias Variance 0 2 4 6 8 10 12 14 16 18 20 0 5 10 15 20 25 30 35 40 45 50 Number of test nodes Mean squared error Residual error Bias Variance Fig. 37.3. Evolution of bias and variance with respect to the number of perceptrons in the hidden layer of a neural network (top) and the number of tests in a regression tree (bottom) Figure 39.3 shows on our illustrative problem the evolution of bias and variance, left when we increase the number of hidden neurons in a neural network with one layer (Bishop, 1995) and right when we increase the number of test nodes in a regres- sion tree (Breiman et al., 1984). These curves clearly show the expected evolution of bias and variance. In both cases, the mean squared error goes through a minimum that corresponds to the best tradeoff between bias and variance. In the context of a particular machine learning, there are often many parameters that regulate a different bias/variance tradeoff. It is thus necessary to control these parameters in order to find the optimal tradeoff. There exist many techniques to do this in the context of a given learning algorithm. Examples of such techniques are early stopping or weight decay for neural networks (Bishop, 1995) and pruning in the context of decision/regression trees (Breiman et al., 1984). 37.4.2 Comparison of some learning algorithms The bias/variance tradeoff is different from one learning algorithm to another. Some algorithms intrinsically present high variance but low bias and other algorithms present high bias but low variance. For example, linear models and Naive Bayes method because of their strong hypothesis often suffer from a high bias. On the other hand, because of their small number of parameters, their variance is small and, on some problems, they may therefore be competitive with more complex algorithms, even if their underlying hypothesis is clearly violated. While the bias of the nearest neighbor method (1-NN), neural networks, and regression/decision trees are gener- ally smaller, the increase in flexibility of their model is paid by an increase of the variance. Table 54.1 provides a comparison of various algorithms on our illustrative prob- lem. The variance of the linear model is negligible and hence, its error is mainly due 744 Pierre Geurts Table 37.1. A comparison of bias/variance decompositions for several algorithms Method Mean squared error Noise Bias Variance Linear regression 7.0 1.0 5.8 0.2 k-NN (k=1) 15.4 1.0 4.0 10.4 k-NN (k=10) 8.5 1.0 6.2 1.3 MLP (10) 2.0 1.0 0.2 0.8 MLP (10-10) 4.6 1.0 0.4 3.2 Regression tree 10.2 1.0 2.5 6.7 Tree bagging 5.3 1.0 2.8 1.5 to bias. For the k nearest neighbors, the smaller value of k gives a very high variance and also a rather high bias. Increasing k to 10 allows reducing variance significantly, but at the price of bias. All in all, this method is less accurate than linear regression in spite of the fact that it may in principle handle the non-linearity. Multilayer per- ceptrons provide overall the best results on this problem, having both negligible bias and small variance. The two simulations correspond respectively to one hidden layer with 10 neurons (10) and two hidden layers of 10 neurons each (10-10). The variance of the more complex structure is significantly more important. Finally, (un-pruned) regression trees present a very high variance on this problem. So, although its bias is small, this method is less accurate than linear regression and the 10-NN method. 37.4.3 Ensemble methods: bagging Bias/variance analyses are especially useful to understand the behavior of ensemble methods. The idea of ensemble methods is to generate by some means a set of models using any learning algorithm and then to aggregate the predictions of these models to yield a more stable final prediction. By doing so, ensemble methods usually change the intrinsic bias/variance tradeoff of the learning algorithm they are applied to. For example, Breiman (1996b) has introduced bagging as a procedure to approximate the average model f avg or the majority vote classifier f ma j , which both have zero variance by definition. To this end, Bagging replaces sampling from the distribution P D (x,y) by bootstrap sampling from the learning sample. Usually, it reduces mainly the variance and only slightly increases the bias (Bauer and Kohavi, 1999). For ex- ample, on our illustrative problem, Bagging with 25 bootstrap samples reduces the variance of regression trees to 1.5 and slightly increases their bias to 2.8. All in all, it reduces the mean squared error from 10.2 to 5.3. Another well-known ensemble method is boosting which has been shown to reduce both bias and variance when applied to decision trees (Bauer and Kohavi, 1999). 37 Bias vs Variance 745 37.5 Discussion In automatic learning, bias and variance both contribute to prediction error, what- ever the learning problem, algorithm and sample size. The error decomposition al- lows us to better understand the way an automatic learning algorithm will respond to changing conditions. It allows us also to compare different methods in terms of their weaknesses. This understanding can then be exploited in order to select methods in practice, to study their performances in research, and to guide us in order to find appropriate ways to improve automatic learning methods. References Bauer, E., Kohavi, R. An Empirical Comparison of Voting Classification Algorithms: Bag- ging, Boosting, and Variants. Machine Learning 1999; 36:105-139. Bishop, C.M. Neural Networks for Pattern Recognition. Oxford: Oxford University Press, 1995. Breiman, L., Friedman, J.H., Olsen, R.A., Stone, C.J. Classification and Regression Trees. California: Wadsworth International, 1984. Breiman, L. Bias, Variance, and Arcing Classifiers. Technical Report 460, Statistics Depart- ment, University of California Berkeley, 1996. Breiman, L. Bagging Predictors. Machine Learning 1996; 24(2):123-140. Breiman, L. Randomizing Outputs to Increase Prediction Accuracy. Machine Learning 2000; 40(3):229-242. Dietterich, T.G. and Kong, E.B. Machine Learning Bias, Statistical Bias, and Statistical Vari- ance of Decision Tree Algorithms. Technical Report. Department of Computer Science, Oregon State University, 1995. Domingos, P. An unified Bias-Variance Decomposition for Zero-One and Squa-red Loss. Proceedings of the 17th International Conference on Machine Learning, Morgan Kauf- man, San Francisco, CA, 2000. Efron, B., Tibshirani, R.J. An Introduction to the Bootstrap. Chapman & Hall, 1993. Freund, Y., Schapire, R.E. A Decision-Theoretic Generalization of Online Learning and an Application to Boosting. Proceedings of the second European Conference on Computa- tional Learning Theory, 1995. Friedman, J.H. On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Data Mining and Knowledge Discovery 1997; 1:55-77. Geman, S., Bienenstock, E. and Doursat, R. Neural Networks and the Bias/Vari-ance Dilemna. Neural computation 1992; 4:1-58. Geurts, P. Contribution to Decision Tree Induction: Bias/Variance Tradeoff and Time Series Classification. Phd thesis. Department of Electrical Engineering and Computer Science, University of Li ` ege, 2002. Hansen, J.V. Combining Predictors: Meta Machine Learning Methods and Bias/Variance & Ambiguity Decompositions. PhD thesis. Department of Computer Science, University of Aarhus, 2000. Heskes, T. Bias/Variance Decompositions for Likelihood-Based Estimators. Neural Compu- tation 1998; 10(6):1425-1433. James, G.M. Variance and Bias for General Loss Functions. Machine Learning 2003; 51:115- 135. 746 Pierre Geurts Kohavi, R. and Wolpert, D. H. Bias Plus Variance Decomposition for Zero-One Loss Func- tions. Proceedings of the 13th International Conference on Machine Learning, Morgan Kaufman, 1996. Tibshirani, R. Bias, Variance and Prediction Error for Classification Rules. Technical Report, Department of Statistics, University of Toronto, 1996. Wolpert, D.H. On Bias plus Variance. Neural Computation 1997; 1211-1243. Webb, G. MultiBoosting: A Technique for Combining Boosting and Wagging. Machine Learning 2000; 40(2):159-196. 38 Mining with Rare Cases Gary M. Weiss Department of Computer and Information Science Fordham University 441 East Fordham Road Bronx, NY 10458 gweiss@cis.fordham.edu Summary. Rare cases are often the most interesting cases. For example, in medical diagnosis one is typically interested in identifying relatively rare diseases, such as cancer, rather than more frequently occurring ones, such as the common cold. In this chapter we discuss the role of rare cases in Data Mining. Specific problems associated with mining rare cases are discussed, followed by a description of methods for addressing these problems. Key words: Rare cases, small disjuncts, inductive bias, sampling 38.1 Introduction Rare cases are often of special interest. This is especially true in the context of Data Mining, where one often wants to uncover subtle patterns that may be hidden in massive amounts of data. Examples of mining rare cases include learning word pronunciations (Van den Bosch et al., 1997), detecting oil spills from satellite im- ages (Kubat et al., 1998), predicting telecommunication equipment failures (Weiss and Hirsh, 1998) and finding associations between infrequently purchased supermar- ket items (Liu et al., 1999). Rare cases warrant special attention because they pose significant problems for Data Mining algorithms. We begin by discussing what is meant by a rare case. Informally, a case corre- sponds to a region in the instance space that is meaningful with respect to the domain under study and a rare case is a case that covers a small region of the instance space and covers relatively few training examples. As a concrete example, with respect to the class bird, non-flying bird is a rare case since very few birds (e.g., ostriches) do not fly. Figure 38.1 shows rare cases and common cases for unlabeled data (Figure 38.1A) and for labeled data (Figure 38.1B). In each situation the regions associated with each case are outlined. Unfortunately, except for artificial domains, the borders for rare and common cases are not known and can only be approximated. O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09823-4_38, © Springer Science+Business Media, LLC 2010 748 Gary M. Weiss P1 + + + + + + + + + + + + P2 P3 + + + + + (A) (B) Fig. 38.1. Rare and common cases in unlabeled (A) and labeled (B) data One important Data Mining task associated with unsupervised learning is clus- tering, which involves the grouping of entities into categories. Based on the data in Figure 38.1A, a clustering algorithm might identify four clusters. In this situation we could say that the algorithm has identified one common case and three rare cases. The three rare cases will be more difficult to detect and generalize from because they contain fewer data points. A second important unsupervised learning task is associa- tion rule mining, which looks for associations between items (Agarwal et al., 1993). Groupings of items that co-occur frequently, such as milk and cookies, will be con- sidered common cases, while other associations may be extremely rare. For example, mop and broom will be a rare association (i.e., case) in the context of supermarket sales, not because the items are unlikely to be purchased together, but because neither item is frequently purchased in a supermarket (Liu et al., 1999). Figure 38.1B shows a classification problem with two classes: a positive class P and a negative class N. The positive class contains one common case, P1, and two rare cases, P2 and P3. For classification tasks the rare cases may manifest themselves as small disjuncts. Small disjuncts are those disjuncts in the learned classifier that cover few training examples (Holte et al., 1989). If a decision tree learner were to form a leaf node to cover case P2, the disjunct (i.e., leaf node) will be a small dis- junct because it covers only two training examples. Because rare cases are not easily identified, most research focuses on their learned counterparts–small disjuncts. Existing research indicates that rare cases and small disjuncts pose difficulties for Data Mining. Experiments using artificial domains show that rare cases have a much higher misclassification rate than common cases (Weiss, 1995, Japkowicz, 2001), a problem we refer to as the problem with rare cases. A large number of studies demonstrate a similar problem with small disjuncts. These studies show that small disjuncts consistently have a much higher error rate than large disjuncts (Ali and Pazzani, 1995, Weiss, 1995, Holte et al., 1989, Ting, 1994, Weiss and Hirsh, 2000). Most of these studies also show that small disjuncts collectively cover a substantial fraction of all examples and cannot simply be eliminated—doing so will substantially degrade the performance of a classifier. The most thorough empirical study of small 38 Mining with Rare Cases 749 disjuncts showed that, in the classifiers induced from thirty real-world data sets, most errors are contributed by the smaller disjuncts (Weiss and Hirsh, 2000). One important question to consider is whether the rarity of a case should be determined with respect to some absolute threshold number of training examples (“absolute rarity”) or with respect to the relative frequency of occurrence in the un- derlying distribution of data (“relative rarity”). If we use absolute rarity, then if a rare case covers only three examples from a training set, then it should be consid- ered rare. However, if additional training data are obtained so that the training set increases by factor of 100, so that this case now covers 300 examples, then absolute rarity says this case is no longer a rare case. However, if the case covers only 1% of the training data in both situations, then relative rarity would say it is rare in both situations. From a practical perspective we are concerned with both absolute and rel- ative rarity since, as we shall see, both forms of rarity pose problems for virtually all Data Mining systems. This chapter focuses on rare cases. In the remainder of this chapter we discuss problems associated with mining rare cases and techniques to address these prob- lems. Rare classes pose similar problems to those posed by rare cases and for this reason we comment on the connection between the two at the end of this chapter. 38.2 Why Rare Cases are Problematic Rare cases pose difficulties for Data Mining systems for a variety of reasons. The most obvious and fundamental problem is the associated lack of data—rare cases tend to cover only a few training examples (i.e., absolute rarity). This lack of data makes it difficult to detect rare cases and, even if the rare case is detected, makes gen- eralization difficult since it is hard to identify regularities from only a few data points. To see this, consider the classification task shown in Figure 38.2, which focuses on the rare case, P3, from Figure 38.1B. Figure 38.2A reproduces the region from Fig- ure 38.1B surrounding P3. Figure 38.2B shows what happens when the training data is augmented with only positive examples while Figure 38.2C shows the result of adding examples from the underlying distribution. P3 + P3 + + + + + + + + + + ++ ++ + P3 ++ + + + + + + + + + ++ + (A) (B) (C) + Fig. 38.2. The problem with absolute rarity . borders for rare and common cases are not known and can only be approximated. O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09 823 -4_38, ©. Predictors. Machine Learning 1996; 24 (2) : 123 -140. Breiman, L. Randomizing Outputs to Increase Prediction Accuracy. Machine Learning 20 00; 40(3) :22 9 -24 2. Dietterich, T.G. and Kong, E.B. Machine Learning. Curse-of-Dimensionality. Data Mining and Knowledge Discovery 1997; 1:55 -77. Geman, S., Bienenstock, E. and Doursat, R. Neural Networks and the Bias/Vari-ance Dilemna. Neural computation 19 92; 4:1-58. Geurts,