730 Ricardo Vilalta, Christophe Giraud-Carrier, and Pavel Brazdil Metal. A Meta-Learning Assistant for Providing User Support in Machine Learning and Data Mining, 1998. Michie, D., Spiegelhalter, D. J., Taylor, C.C. Machine Learning, Neural and Statistical Clas- sification. England: Ellis Horwood, 1994. Nakhaeizadeh, G., Schnabel, A. Development of Multi-criteria Metrics for Evaluation of Data-mining Algorithms. In Proceedings of the Third International Conference on Knowledge Discovery and Data-Mining, 1997. Paterson, I. New Models for Data Envelopment Analysis, Measuring Efficiency with the VRS Frontier. Economics Series No. 84, Institute for Advanced Studies, Vienna, 2000. Peng, Y., Flach, P., Brazdil, P., Soares, C. Decision Tree-Based Characterization for Meta- Learning. In: ECML/PKDD’02 Workshop on Integration and Collaboration Aspects of Data Mining, Decision Support and Meta-Learning, 111-122. University of Helsinki, 2002. Pfahringer, B., Bensusan, H., Giraud-Carrier, C. Meta-learning by Landmarking Various Learning Algorithms. In Proceedings of the Seventeenth International Conference on Machine Learning, 2000. Pratt, L., Thrun, S. Second Special Issue on Inductive Transfer. Machine Learning, 28, 1997. Pratt S., Jennings B. A Survey of Connectionist Network Reuse Through Transfer. In Learn- ing to Learn, Chapter 2, 19-43, Kluwer Academic Publishers, MA, 1998. Rokach, L., Averbuch, M., and Maimon, O., Information retrieval system for medical narra- tive reports. Lecture notes in artificial intelligence, 3055. pp. 217-228, Springer-Verlag (2004). Schmidhuber J. Discovering Solutions with Low Kolmogorov Complexity and High Gen- eralization Capability. Proceedings of the Twelve International Conference on Machine Learning, 488-49, Morgan Kaufman, 1995. Skalak, D. Prototype Selection for Composite Nearest Neighbor Classifiers. PhD thesis, Uni- versity of Massachusetts, Amherst, 1997. Soares, C., Brazdil, P. Zoomed Ranking: Selection of Classification Algorithms Based on Relevant Performance Information. In Proceedings of the Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases, 2000. Soares, C., Petrak, J., Brazdil, P. Sampling-Based Relative Landmarks: Systematically Test- Driving Algorithms Before Choosing. Proceedings of the 10th Portuguese Conference on Artificial Intelligence, Springer, 2001. Sohn, S.Y. Meta Analysis of Classification Algorithms for Pattern Recognition. IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 21(11): 1137-1144, 1999. Thrun, S. Lifelong Learning Algorithms. In Learning to Learn, Chapter 8, 181-209, MA: Kluwer Academic Publishers, 1998. Ting, K. M., Witten I. H. Stacked generalization: When does it work?. In Proceedings of the 15 th International Joint Conference on Artificial Intelligence, pp 866-873, Nagoya, Japan, Morgan Kaufmann, 1997. Todorovski, L., Dzeroski, S. Experiments in Meta-level Learning with ILP. In Proceedings of the Third European Conference on Principles and Practice of Knowledge Discovery in Databases, 1999. Todorovski, L., Dzeroski, S. Combining Multiple Models with Meta Decision Trees. In Pro- ceedings of the Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases, 2000. Todorovski, L., Dzeroski, S. Combining Classifiers with Meta Decision Trees. Machine Learning 50 (3), 223-250, 2003. 36 Meta-Learning 731 Utgoff P. Shift of Bias for Inductive Concept Learning. In Michalski, R.S. et al (Ed), Ma- chine Learning: An Artificial Intelligence Approach Vol. II, 107-148, Morgan Kaufman, California, 1986. Vilalta, R. Research Directions in Meta-Learning: Building Self-Adaptive Learners. Interna- tional Conference on Artificial Intelligence, Las Vegas, Nevada, 2001. Vilalta, R., Drissi, Y. A Perspective View and Survey of Meta-Learning. Journal of Artificial Intelligence Review, 18 (2): 77-95, 2002. Widmer, G. On-line Metalearning in Changing Contexts. MetaL(B) and MetaL(IB). In Pro- ceedings of the Third International Workshop on Multistrategy Learning (MSL-96), 1996A. Widmer, G. Recognition and Exploitation of Contextual Clues via Incremental Meta- Learning. In Proceedings of the Thirteenth International Conference on Machine Learn- ing (ICML-96), 1996B. Widmer, G. Tracking Context Changes through Meta-Learning. Machine Learning, 27(3): 259-286, 1997. Wolpert D. Stacked Generalization. Neural Networks, 5: 241-259, 1992. 37 Bias vs Variance Decomposition For Regression and Classification Pierre Geurts Department of Electrical Engineering and Computer Science, University of Li ` ege, Belgium. Postdoctoral Researcher, F.N.R.S., Belgium Summary. In this chapter, the important concepts of bias and variance are introduced. After an intuitive introduction to the bias/variance tradeoff, we discuss the bias/variance decom- positions of the mean square error (in the context of regression problems) and of the mean misclassification error (in the context of classification problems). Then, we carry out a small empirical study providing some insight about how the parameters of a learning algorithm in- fluence bias and variance. Key words: bias, variance, supervised learning, overfitting 37.1 Introduction The general problem of supervised learning is often formulated as an optimization problem. An error measure is defined that evaluates the quality of a model and the goal of learning is to find, in a family of models (the hypothesis space), a model that minimizes this error estimated on the learning sample (or dataset) S. So, at first sight, if no good enough model is found in this family, it should be sufficient to extend the family or to exchange it for a more powerful one in terms of model flexibility. However, we are often interested in a model that generalizes well on unseen data rather than on a model that perfectly predicts the output for the learning sample cases. And, unfortunately, in practice, good results on the learning set do not necessarily imply good generalization performance on unseen data, especially if the “size” of the hypothesis space is large in comparison to the sample size. Let us use a simple one-dimensional regression problem to explain intuitively why larger hypothesis spaces do not necessarily lead to better models. In this synthetic problem, learning outputs are generated according to y = f b (x)+ ε , where f b is represented by the dashed curves in Figure 39.1 and ε is distributed according to a Gaussian N(0, σ ) distribution. With squared error loss, we will see below that the best possible model for this problem is f b and its average squared error is σ 2 . Let us consider two extreme situations of a bad model structure choice. • A too simple model: using a linear model y = w.x + b and minimizing squared error on the learning set, we obtain the estimations given in the left part of Figure 39.1 for two O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09823-4_37, © Springer Science+Business Media, LLC 2010 734 Pierre Geurts Too simple model Too complex model S 1 S 1 2 2 x y S S y y xx x y Fig. 37.1. Left, a linear model fitted to two learning samples. Right, a neural network fitted to the same samples different learning set choices. These models are not very good, neither on their learning sets, nor in generalization. Whatever the learning set, there will always remain an error due to the fact that the model is too simple with respect to the complexity of f b . • A too complex model: by using a very complex model like a neural network with two hidden layers of ten neurons each, we get the functions at the right part of Figure 39.1 for the same learning sets. This time, models receive an almost perfect score on the learning set. However, their generalization errors are still not very good because of two phenomena. First, the learning algorithm is able to match perfectly the learning set and hence also the noise term. We say in this case that the learning algorithm “overfits” the data. Second, even if there is no noise, there will still remain some errors due to the high complexity of the model. Indeed, the learning algorithm has many different models at its disposal and if the learning set size is relatively small, several of them will realize a perfect match of the learning set. As at most one of them is a perfect image of the best model, any other choice by the learning algorithm will result in suboptimality. The main source of error is very different in both cases. In the first case, the error is essentially independent of the particular learning set and must be attributed to the lack of complexity of the model. This source of error is called bias. In the second case, on the other hand, the error may be attributed to the variability of the model from one learning set to another (which is due on one hand to overfitting and on the other hand to the sparse nature of the learning set with respect to the complexity of the model). This source of error is called variance. Note that in the first case there is also a dependence of the model on the learning set and thus some variability of the predictions. However the resulting variance is negligible with respect to bias. In general, bias and variance both depend on the complexity of the model but in opposite direction and thus there must exist an optimal tradeoff between these two sources of error. As a matter of fact, this optimal tradeoff depends also on the smoothness of the best model and on the sample size. An important consequence of this is that, because of variance, we should always take care of not increasing too much the complexity of the model structure with respect to the complexity of the problem and the size of the learning sample. In the next section, we give a formal additive decomposition of the mean (over all learn- ing set choices) squared error into two terms which represent the bias and the variance effect. 37 Bias vs Variance 735 Some propositions of similar decompositions in the context of 0-1 loss-functions are also dis- cussed. They show some fundamental differences between the two types of problems although bias and variance concepts are always useful. Section 3 discusses procedures to estimate bias and variance terms for practical problems. In Section 4, we give some experiments and appli- cations of bias/variance decompositions. 37.2 Bias/Variance Decompositions Let us introduce some notations. A learning sample S is a collection of m input/output pairs (< x 1 ,y 1 >, ,< x m ,y m >), each one randomly and independently drawn from a probability distribution P D (x,y). A learning algorithm I produces a model I(S) from S, i.e. a function of inputs x to the domain of y. The error of this model is computed as the expectation: Error(I(S)) = E x,y [L(y,I(S)(x))], where L is some loss function that measures the discrepancy between its two arguments. Since the learning sample S is randomly drawn from some distribution D, the model I(S) and its prediction I(S)(x) at x are also random. Hence, Error(I(S)) is again a random variable and we are interested in studying the expected value of this error (over the set of all learning sets of size m) E S [Error(I(S))]. This error can be decomposed into: E S [Error(I(S))] = E x [E S [E y|x [L(y,I(S)(x))]]] = E x [E S [Error(I(S)(x))]], where Error(I(S)(x)) denotes the local error at point x. Bias/variance decompositions usually try to decompose this error into three terms: the residual or minimal attainable error, the systematic error, and the effect of the variance. The exact decomposition depends on the loss function L. The next two subsections are devoted to the most common loss functions, i.e. the squared loss for regression problems and the 0-1 loss for classification problems. Notice however that these loss functions are not the only plausible loss functions and several authors have studied bias/variance decompositions for other loss functions (Wolpert, 1997, Hansen, 2000). Actually, several of the decompositions for 0-1 loss presented below are derived as special cases of more general bias/variance decompositions (Tibshirani, 1996,Wolpert, 1997,Heskes, 1998,Domingos, 1996,James, 2003). The interested reader may refer to these references for more details. 37.2.1 Bias/Variance Decomposition of the Squared Loss When the output y is numerical, the usual loss function is the squared loss L 2 (y 1 ,y 2 )=(y 1 − y 2 ) 2 . With this loss function, it is easy to show that the best possible model is f b (x)=E y|x [y], which takes the expectation of the target y at each point x. The best model according to a given loss function is often called the Bayes model in statistical pattern recognition. Introducing this model in the mean local error, we get with some elementary calculations: E S [Error(I(S)(x))] = E y|x [(y − f b (x)) 2 ]+E S [( f b (x) −I(S)(x)) 2 ]. (37.1) Symmetrically to the Bayes model, let us define the average model, f avg (x)=E S [I(S)(x)] which outputs the average prediction among all learning sets. Introducing this model in the second term of Equation (63.1), we obtain: 736 Pierre Geurts E S [( f b (x) −I(S)(x)) 2 ]=(f b (x) − f avg (x)) 2 + E S [(I(S)(x) − f avg (x)) 2 ]. In summary, we have the following well-known decomposition of the mean square error at a point x: E S [Error(I(S)(x))] = σ 2 R (x)+bias 2 R (x)+var R (x) by defining: σ 2 R (x)=E y | x [(y − f b (x)) 2 ], (37.2) bias 2 R (x)=(f b (x) − f avg (x)) 2 , (37.3) var 2 R (x)=E S [(I(S)(x) − f avg (x)) 2 ]. (37.4) This error decomposition is well known in estimation theory and has been introduced in the automatic learning community by (Geman et al., 1995). The residual squared error, σ 2 (x), is the error obtained by the best possible model. It provides a theoretical lower bound that is independent of the learning algorithm. Thus, the suboptimality of a particular learning algorithm is composed of two terms: the (squared) bias measures the discrepancy between the best and the average model. It measures how well is the estimate in average. The variance measures the variability of the predictions with respect to the learning set randomness. R bias 2 x ( ) R 2 σ x( ) R 2 σ x( ) R 2 xvar ( ) R 2 xvar ( ) R bias 2 x ( ) x( )f avg x( )f b x( )f avg x( )f b Err xx y xx y E rr Too complex model Too simple model Fig. 37.2. Top: the average models; bottom: residual error, bias, and variance To explain why these two terms are indeed the consequence of the two phenomena dis- cussed in the introduction of this chapter, let us come back to our simple regression problem. The average model is depicted in the top of Figure 39.2 for the two cases of bad model choice. Residual error, bias and variance for each position x are drawn in the bottom of the same figure. The residual error is entirely specified by the problem and loss criterion and hence in- dependent of the algorithm and learning set used. When the model is too simple, the average model is far from the Bayes model almost everywhere and thus the bias is large. On the other hand, the variance is small as the model does not match very strongly the learning set and thus the prediction at each point does not vary too much from one learning set to another. Bias is 37 Bias vs Variance 737 thus the dominant term of error. When the model is too complex, the distribution of predic- tions matches very strongly the distribution of outputs at each point. The average prediction is thus close to the Bayes model and the bias is small. However, because of the noise and the small learning set size, predictions are highly variable at each point. In this case, variance is the dominant term of error. 37.2.2 Bias/variance decompositions of the 0-1 loss The usual loss function for classification problems (i.e. a discrete target variable) is the 0-1 loss function, L c (y 1 ,y 2 )=1 if y 1 = y 2 , 0 otherwise, which yields the mean misclassification error at x: E S [Error(I(S)(x))] = E S [E y|x [L c (y,I(S)(x))]]] = P D,S (y = I(S)(x)|x). The Bayes model in this case is the model that outputs the most probable class at x, i.e. f b (x)=argmax c P D (y = c | x) . The corresponding residual error is: σ C (x)=1 −P D (y = f b (x) | x) . (37.5) By analogy with the decomposition of the square error, it is possible to define what we call “natural” bias and variance terms for the 0-1 loss function. First, by symmetry with the Bayes model and by analogy with the square loss decomposition, the equivalent in classification of the average model is the majority vote classifier defined by: f avg (x)=argmax c P S (I(S)(x)=c), which outputs at each point the class receiving the majority of votes among the distribution of classifiers induced from the distribution of learning sets. The square bias is the error of the average model with respect to the best possible model. This definition yields here: bias C (x)=L c ( f b (x), f ma j (x)). So, biased points are those for which the majority vote classifier disagrees with the Bayes classifier. On the other hand, variance can be naturally defined as: var C (x)=E S L c (I(S)(x), f ma j (x)) = P S (I(S)(x) = f ma j (x)), which is the average error of the models induced from random learning samples S with respect to the majority vote classifier. This definition is indeed a measure of the variability of the predictions at x: when var C (x)=0, every model outputs the same class whatever the learning set from which it is induced and var C (x) is maximal when the probability of the class given by the majority vote classifier is equal to 1/z (with z the number of classes), which corresponds to the most uncertain distribution of predictions. Unfortunately, these natural bias and variance terms do not sum up with the residual error to give the local misclassification error. In other words: E S [Error(I(S)(x))] = σ C (x)+bias C (x)+var C (x). Let us illustrate on a simple example how increased variance may decrease the average classi- fication error in some situations. Let us suppose that we have a 3 classes problem such that the 738 Pierre Geurts true class probability distribution is given by (P D (y = c 1 |x), P D (y = c 2 |x), P D (y = c 3 |x))=(0.7, 0.2, 0.1). The best possible prediction at x is thus the class c 1 and the corresponding minimal error is 0.3. Let us suppose that we have two learning algorithms I 1 and I 2 and that the distri- bution of predictions of the models built by these algorithms are given by: (P S (I 1 (S)(x)=c 1 ),P S (I 1 (S)(x)=c 2 ),P S (I 1 (S)(x)=c 3 )) = (.1,.8,.1) (P S (I 2 (S)(x)=c 1 ),P S (I 2 (S)(x)=c 2 ),P S (I 2 (S)(x)=c 3 )) = (.4,.5,.1) So, we observe that both algorithms produce models that most probably will decide class c 2 (respectively with probability 0.8 and 0.5). Thus, the two methods are biased (bias C (x)=1). On the other hand, the variances of the two methods are obtained in the following way: var 1 C (x)=1 −0.8 = 0.2 and var 2 C (x)=1 −0.5 = 0.5, and their mean misclassification errors are found to be E S [Error(I 1 (S)(x))] = 0.76 and E S [Error(I 2 (S)(x))] = 0.61. Thus between these two methods with identical bias, it is the one having the largest variance that has the smallest average error rate. It is easy to see that this happens here because of the existence of a bias. Indeed, with 0-1 loss, an algorithm that has small variance and high bias is an algorithm that systematically (i.e. whatever the learning sample) produces a wrong answer, whereas an algorithm that has a high bias but also a high variance is only wrong for a majority of learning samples, but not necessarily systematically. So, this latter may be better than the former. In other words, with 0-1 loss, much variance can be beneficial because it can lead the system closer to the Bayes classification. As a result of this counter-intuitive interaction between bias and variance terms with 0-1 loss, several authors have proposed their own decompositions. We briefly describe below the most representative of them. For a more detailed discussion of these decompositions, see for example (Geurts, 2002) or (James, 2003). In the fol- lowing sections, we present a very different approach to study bias and variance of 0-1 loss due to Friedman (1997), which relates the mean error to the squared bias and variance terms of the class probability estimates. Some decompositions Tibshirani (1996) defines the bias as the difference between the probability of the Bayes class and the probability of the majority vote class: bias T (x)=P D (y = f b (x) | x) −P D (y = f ma j (x) | x ). (37.6) Thus, the sum of this bias and the residual error is actually the misclassification error of the majority vote classifier: σ C (x)+bias T (x)=1 −P D (y = f ma j (x) | x )=Error( f ma j (x)). 37 Bias vs Variance 739 This is exactly the part of the error that would remain if we could completely can- cel the variability of the predictions. The variance is then defined as the difference between the mean misclassification error and the error of the majority vote classifier: var T (x)=E S [Error(I(S)(x))] −Error( f ma j (x)). (37.7) Tibshirani (1996) denotes this variance term the aggregation effect. Indeed, this is the variation of error that results from the aggregation of the predictions over all learning sets. Note that this variance term is not necessarily positive. From different consid- erations, James (2003) has proposed exactly the same decomposition. To distinguish (63.3) and (63.5) from the natural bias and variance terms, he calls them system- atic and variance effect respectively. Dietterich and Kong (1995) have proposed a decomposition that applies only to the noise-free case but that exactly reduces to Tibshirani’s decomposition in this latter case. Domingos (2000) agrees with the natural definition of bias, variance given in the introduction of this section and he combines them into a non-additive expression like: E S [Error(I(S)(x))] = b 1 (x). σ C (x)+bias C (x)+b 2 (x).var C (x), where b 1 and b 2 are two factors that are in fact functions of the true class distribution and of the distribution of predictions. Kohavi and Wolpert (1996) have proposed a very different decomposition which is closer in spirit to the decomposition of the squared loss. Their decomposition makes use of quadratic functions of the probabilities P S (I(S)(x) | x) and P(y | x). Heskes (1998) adopts the natural variance term var C and, ignoring the residual er- ror, defines bias as the difference between the mean misclassification error and his variance. As a consequence, it can happen that his bias is smaller than the residual error. Breiman (1996a, 2000) has successively proposed two decompositions. In the first one, bias and variance are defined globally instead of locally. Bias is the part of the error due to biased points (i.e. such that bias C (x)=1) and variance is defined as the part of the error due to unbiased points. This multitude of decompositions translates well the complexity of the interac- tion between bias and variance in classification. Each decomposition has its pros and cons. Notably, we may observe in some case counterintuitive behavior with respect to what would be observed with the classical decomposition of the squared error (e.g. a negative variance). This makes the choice, both in theoretical and empirical studies, of a particular decomposition difficult. Nevertheless, all decompositions have proven to be useful to analyze classification algorithms, each one at least in the context of its introduction. Bias and variance of class probability estimates Many classification algorithms work by first computing an estimate I c (S)(x) of the conditional probability of each class c at x and then deriving their classification model by: I(S)(x)=argmax c I c (S)(x). . left part of Figure 39.1 for two O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09 823 -4_37, © Springer Science+Business Media, LLC 20 10. on Principles and Practice of Knowledge Discovery in Databases, 20 00. Todorovski, L., Dzeroski, S. Combining Classifiers with Meta Decision Trees. Machine Learning 50 (3), 22 3 -25 0, 20 03. 36 Meta-Learning. Learning, 27 (3): 25 9 -28 6, 1997. Wolpert D. Stacked Generalization. Neural Networks, 5: 24 1 -25 9, 19 92. 37 Bias vs Variance Decomposition For Regression and Classification Pierre Geurts Department