Statistics, Data Mining, and Machine Learning in Astronomy 394 • Chapter 9 Classification The effect of updating w(xi ) is to increase the weight of the misclassified data After K iterations the final[.]
394 • Chapter Classification The effect of updating w(xi ) is to increase the weight of the misclassified data After K iterations the final classification is given by the weighted votes of each classifier given by eq 9.51 As the total error, e m , decreases, the weight of that iteration in the final classification increases A fundamental limitation of the boosted decision tree is the computation time for large data sets Unlike random forests, which can be trivially parallelized, boosted decision trees rely on a chain of classifiers which are each dependent on the last This may limit their usefulness on very large data sets Other methods for boosting have been developed such as gradient boosting; see [5] Gradient boosting involves approximating a steepest descent criterion after each simple evaluation, such that an additional weak classification can improve the classification score and may scale better to larger data sets Scikit-learn contains several flavors of boosted decision trees, which can be used for classification or regression For example, boosted classification tasks can be approached as follows: import numpy as np from sklearn ensemble import GradientBoostingClassifier X = np random random ( ( 0 , ) ) # pts in 0 dims y = ( X [ : , ] + X [ : , ] > ) astype ( int ) # simple division model = G r a d i e n t B o o s t i n g C l a s s i f i e r ( ) model fit (X , y ) y_pred = model predict ( X ) For more details see the Scikit-learn documentation, or the source code of figure 9.16 Figure 9.16 shows the results for a gradient-boosted decision tree for the SDSS photometric redshift data For the weak estimator, we use a decision tree with a maximum depth of The cross-validation results are shown as a function of boosting iteration By 500 steps, the cross-validation error is beginning to level out, but there are still no signs of overfitting The fact that the training error and cross-validation error remain very close indicates that a more complicated model (i.e., deeper trees or more boostings) would likely allow improved errors Even so, the rms error recovered with these suboptimal parameters is comparable to that of the random forest classifier 9.8 Evaluating Classifiers: ROC Curves Comparing the performance of classifiers is an important part of choosing the best classifier for a given task “Best” in this case can be highly subjective: for some 9.8 Evaluating Classifiers: ROC Curves • 395 0.4 N = 500 rms = 0.018 cross-validation training set 0.03 zfit rms error 0.3 0.02 0.2 0.1 Tree depth: 0.01 100 200 300 400 number of boosts 0.0 500 0.0 0.1 0.2 ztrue 0.3 0.4 Figure 9.16 Photometric redshift estimation using gradient-boosted decision trees, with 100 boosting steps As with random forests (figure 9.15), boosting allows for improved results over the single tree case (figure 9.14) Note, however, that the computational cost of boosted decision trees is such that it is computationally prohibitive to use very deep trees By stringing together a large number of very naive estimators, boosted trees improve on the underfitting of each individual estimator problems, one might wish for high completeness at the expense of contamination; at other times, one might wish to minimize contamination at the expense of completeness One way to visualize this is to plot receiver operating characteristic (ROC) curves (see §4.6.1) An ROC curve usually shows the true-positive rate as a function of the false-positive rate as the discriminant function is varied How the function is varied depends on the model: in the example of Gaussian naive Bayes, the curve is drawn by classifying data using relative probabilities between and A set of ROC curves for a selection of classifiers explored in this chapter is shown in the left panel of figure 9.17 The curves closest to the upper left of the plot are the best classifiers: for the RR Lyrae data set, the ROC curve indicates that GMM Bayes and K -nearest-neighbor classification outperform the rest For such an unbalanced data set, however, ROC curves can be misleading Because there are fewer than five sources for every 1000 background objects, a false-positive rate of even 0.05 means that false positives outnumber true positives ten to one! When sources are rare, it is often more informative to plot the efficiency (equal to one minus the contamination, eq 9.5) vs the completeness (eq 9.5) This can give a better idea of how well a classifier is recovering rare data from the background The right panel of figure 9.17 shows the completeness vs efficiency for the same set of classifiers A striking feature is that the simpler classifiers reach a maximum efficiency of about 0.25: this means that at their best, only 25% of objects identified as RR Lyrae are actual RR Lyrae By the completeness–efficiency measure, the GMM Bayes model outperforms all others, allowing for higher completeness at virtually any efficiency level We stress that this is not a general result, and that the best classifier for any task depends on the precise nature of the data As an example where the ROC curve is a more useful diagnostic, figure 9.18 shows ROC curves for the classification of stars and quasars from four-color photometry (see the description of the data set in §9.1) The stars and quasars in • Chapter Classification 1.0 1.0 0.9 0.8 0.6 GNB LDA QDA LR KNN DT GMMB 0.4 0.2 completeness true positive rate 0.8 0.7 0.6 0.5 0.4 0.3 0.0 0.000 0.008 0.016 0.024 0.032 0.040 false positive rate 0.2 0.0 0.2 0.4 0.6 efficiency 0.8 1.0 Figure 9.17 ROC curves (left panel) and completeness–efficiency curves (right panel) for the four-color RR Lyrae data using several of the classifiers explored in this chapter: Gaussian naive Bayes (GNB), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), logistic regression (LR), K -nearest-neighbor classification (KNN), decision tree classification (DT), and GMM Bayes classification (GMMB) See color plate 1.4 1.0 1.2 true positive rate 1.0 0.8 g−r 396 0.6 0.4 0.2 0.9 GNB LDA QDA LR KNN DT GMMB 0.8 0.7 0.0 −0.2 −0.5 0.0 0.5 1.0 1.5 u−g 2.0 2.5 3.0 0.6 0.00 0.03 0.06 0.09 0.12 false positive rate 0.15 Figure 9.18 The left panel shows data used in color-based photometric classification of stars and quasars Stars are indicated by gray points, while quasars are indicated by black points The right panel shows ROC curves for quasar identification based on u − g , g − r , r − i , and i − z colors Labels are the same as those in figure 9.17 See color plate this sample are selected with differing selection functions: for this reason, the data set does not reflect a realistic sample We use it for purposes of illustration only The stars outnumber the quasars by only a factor of 3, meaning that a false-positive rate of 0.3 corresponds to a contamination of ∼50% Here we see that the best-performing classifiers are the neighbors-based and tree-based classifiers, both of which approach 100% true positives with a very small number of false positives An interesting feature is that classifiers with linear discriminant functions (LDA and logistic regression) plateau at a true-positive rate of 0.9 These simple classifiers, while useful in some situations, not adequately explain these photometric data ... The left panel shows data used in color-based photometric classification of stars and quasars Stars are indicated by gray points, while quasars are indicated by black points The right panel shows... panel) and completeness–efficiency curves (right panel) for the four-color RR Lyrae data using several of the classifiers explored in this chapter: Gaussian naive Bayes (GNB), linear discriminant... discriminant function is varied How the function is varied depends on the model: in the example of Gaussian naive Bayes, the curve is drawn by classifying data using relative probabilities between and