www.it-ebooks.info www.it-ebooks.info ERROR ESTIMATION FOR PATTERN RECOGNITION www.it-ebooks.info IEEE Press 445 Hoes Lane Piscataway, NJ 08854 IEEE Press Editorial Board Tariq Samad, Editor in Chief George W Arnold Dmitry Goldgof Ekram Hossain Mary Lanzerotti Pui-In Mak Ray Perez Linda Shafer MengChu Zhou George Zobrist Kenneth Moore, Director of IEEE Book and Information Services (BIS) Technical Reviewer Frank Alexander, Los Alamos National Laboratory www.it-ebooks.info ERROR ESTIMATION FOR PATTERN RECOGNITION ULISSES M BRAGA-NETO EDWARD R DOUGHERTY www.it-ebooks.info Copyright © 2015 by The Institute of Electrical and Electronics Engineers, Inc Published by John Wiley & Sons, Inc., Hoboken, New Jersey All rights reserved Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose No warranty may be created or extended by sales representatives or written sales materials The advice and strategies contained herein may not be suitable for your situation You should consult with a professional where appropriate Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002 Wiley also publishes its books in a variety of electronic formats Some content that appears in print may not be available in electronic formats For more information about Wiley products, visit our web site at www.wiley.com Library of Congress Cataloging-in-Publication Data is available ISBN: 978-1-118-99973-8 Printed in the United States of America 10 www.it-ebooks.info To our parents (in memoriam) Jacinto and Consuˆelo and Russell and Ann And to our wives and children Fl´avia, Maria Clara, and Ulisses and Terry, Russell, John, and Sean www.it-ebooks.info www.it-ebooks.info CONTENTS PREFACE XIII ACKNOWLEDGMENTS XIX LIST OF SYMBOLS XXI CLASSIFICATION Classifiers / Population-Based Discriminants / Classification Rules / Sample-Based Discriminants / 13 1.4.1 Quadratic Discriminants / 14 1.4.2 Linear Discriminants / 15 1.4.3 Kernel Discriminants / 16 1.5 Histogram Rule / 16 1.6 Other Classification Rules / 20 1.6.1 k-Nearest-Neighbor Rules / 20 1.6.2 Support Vector Machines / 21 1.6.3 Neural Networks / 22 1.6.4 Classification Trees / 23 1.6.5 Rank-Based Rules / 24 1.7 Feature Selection / 25 Exercises / 28 1.1 1.2 1.3 1.4 vii www.it-ebooks.info viii CONTENTS ERROR ESTIMATION 35 Error Estimation Rules / 35 Performance Metrics / 38 2.2.1 Deviation Distribution / 39 2.2.2 Consistency / 41 2.2.3 Conditional Expectation / 41 2.2.4 Linear Regression / 42 2.2.5 Confidence Intervals / 42 2.3 Test-Set Error Estimation / 43 2.4 Resubstitution / 46 2.5 Cross-Validation / 48 2.6 Bootstrap / 55 2.7 Convex Error Estimation / 57 2.8 Smoothed Error Estimation / 61 2.9 Bolstered Error Estimation / 63 2.9.1 Gaussian-Bolstered Error Estimation / 67 2.9.2 Choosing the Amount of Bolstering / 68 2.9.3 Calibrating the Amount of Bolstering / 71 Exercises / 73 2.1 2.2 PERFORMANCE ANALYSIS 77 3.1 Empirical Deviation Distribution / 77 3.2 Regression / 79 3.3 Impact on Feature Selection / 82 3.4 Multiple-Data-Set Reporting Bias / 84 3.5 Multiple-Rule Bias / 86 3.6 Performance Reproducibility / 92 Exercises / 94 ERROR ESTIMATION FOR DISCRETE CLASSIFICATION 4.1 4.2 Error Estimators / 98 4.1.1 Resubstitution Error / 98 4.1.2 Leave-One-Out Error / 98 4.1.3 Cross-Validation Error / 99 4.1.4 Bootstrap Error / 99 Small-Sample Performance / 101 4.2.1 Bias / 101 4.2.2 Variance / 103 www.it-ebooks.info 97 298 BIBLIOGRAPHY Schiavo, R and Hand, D (2000) Ten more years of error rate research International Statistical Review, 68(3):295–310 Serdobolskii, V (2000) Multivariate Statistical Analysis: A High-Dimensional Approach Springer Shao, J (1993) Linear model selection by cross-validation Journal of the American Statistical Association, 88:486–494 Sima, C., Attoor, S., Braga-Neto, U., Lowey, J., Suh, E., and Dougherty, E (2005a) Impact of error estimation on feature-selection algorithms Pattern Recognition, 38(12): 2472–2482 Sima, C., Braga-Neto, U., and Dougherty, E (2005b) Bolstered error estimation provides superior feature-set ranking for small samples Bioinformatics, 21(7):1046–1054 Sima, C., Braga-Neto, U., and Dougherty, E (2011) High-dimensional bolstered error estimation Bioinformatics, 27(21):3056–3064 Sima, C and Dougherty, E (2006a) Optimal convex error estimators for classification Pattern Recognition, 39(6):1763–1780 Sima, C and Dougherty, E (2006b) What should be expected from feature selection in small-sample settings Bioinformatics, 22(19):2430–2436 Sima, C and Dougherty, E (2008) The peaking phenomenon in the presence of feature selection Pattern Recognition, 29:1667–1674 Sitgreaves, R (1951) On the distribution of two random matrices used in classification procedures Annals of Mathematical Statistics, 23:263–270 Sitgreaves, R (1961) Some results on the distribution of the W-classification In: Studies in Item Analysis and Prediction, Solomon, H (editor) Stanford University Press, pp 241–251 Smith, C (1947) Some examples of discrimination Annals of Eugenics, 18:272–282 Snapinn, S and Knoke, J (1985) An evaluation of smoothed classification error-rate estimators Technometrics, 27(2):199–206 Snapinn, S and Knoke, J (1989) Estimation of error rates in discriminant analysis with selection of variables Biometrics, 45:289–299 Stone, C (1977) Consistent nonparametric regression Annals of Statistics, 5:595–645 Stone, M (1974) Cross-validatory choice and assessment of statistical predictions Journal of the Royal Statistical Society Series B (Methodological), 36:111–147 Teichroew, D and Sitgreaves, R (1961) Computation of an empirical sampling distribution for the w-classification statistic In: Studies in Item Analysis and Prediction, Solomon, H (editor) Stanford University Press, pp 285–292 Toussaint, G (1974) Bibliography on estimation of misclassification IEEE Transactions on Information Theory, IT-20(4):472–479 Toussaint, G and Donaldson, R (1970) Algorithms for recognizing contour-traced handprinted characters IEEE Transactions on Computers, 19:541–546 Toussaint, G and Sharpe, P (1974) An efficient method for estimating the probability of misclassification applied to a problem in medical diagnosis IEEE Transactions on Information Theory, IT-20(4):472–479 Tutz, G (1985) Smoothed additive estimators for non-error rates in multiple discriminant analysis Pattern Recognition, 18(2):151–159 Vapnik, V (1998) Statistical Learning Theory John Wiley & Sons, New York www.it-ebooks.info BIBLIOGRAPHY 299 Verbeek, A (1985) A survey of algorithms for exact distributions of test statistics in rxc contingency tables with fixed margins Computational Statistics and Data Analysis, 3:159– 185 Vu, T (2011) The bootstrap in supervised learning and its applications in genomics/proteomics PhD thesis, Texas A&M University, College Station, TX Vu, T and Braga-Neto, U (2008) Preliminary study on bolstered error estimation in highdimensional spaces In: GENSIPS’2008 – IEEE International Workshop on Genomic Signal Processing and Statistics, Phoenix, AZ Vu, T., Sima, C., Braga-Neto, U., and Dougherty, E (2014) Unbiased bootstrap error estimation for linear discrimination analysis EURASIP Journal on Bioinformatics and Systems Biology, 2014:15 Wald, A (1944) On a statistical problem arising in the classification of an individual into one of two groups Annals of Mathematical Statistics, 15:145–162 Wald, P and Kronmal, R (1977) Discriminant functions when covariances are unequal and sample sizes are moderate Biometrics, 33:479–484 Webb, A (2002) Statistical Pattern Recognition, 2nd edition John Wiley & Sons, New York Wyman, F., Young, D., and Turner, D (1990) A comparison of asymptotic error rate expansions for the sample linear discriminant function Pattern Recognition, 23(7):775– 783 Xiao, Y., Hua, J., and Dougherty, E (2007) Quantification of the impact of feature selection on cross-validation error estimation precision EURASIP Journal on Bioinformatics and Systems Biology, 2007:16354 Xu, Q., Hua, J., Braga-Neto, U., Xiong, Z., Suh, E., and Dougherty, E (2006) Confidence intervals for the true classification error conditioned on the estimated error Technology in Cancer Research and Treatment, 5(6):579–590 Yousefi, M and Dougherty, E (2012) Performance reproducibility index for classification Bioinformatics, 28(21):2824–2833 Yousefi, M., Hua, J., and Dougherty, E (2011) Multiple-rule bias in the comparison of classification rules Bioinformatics, 27(12):1675–1683 Yousefi, M., Hua, J., Sima, C., and Dougherty, E (2010) Reporting bias when using real data sets to analyze classification performance Bioinformatics, 26(1):68–76 Zhou, X and Mao, K (2006) The ties problem resulting from counting-based error estimators and its impact on gene selection algorithms Bioinformatics, 22:2507–2515 Zollanvari, A (2010) Analytical study of performance of error estimators for linear discriminant analysis with applications in genomics PhD thesis, Texas A&M University, College Station, TX Zollanvari, A., Braga-Neto, U., and Dougherty, E (2009) On the sampling distribution of resubstitution and leave-one-out error estimators for linear classifiers Pattern Recognition, 42(11):2705–2723 Zollanvari, A., Braga-Neto, U., and Dougherty, E (2010) Joint sampling distribution between actual and estimated classification errors for linear discriminant analysis IEEE Transactions on Information Theory, 56(2):784–804 Zollanvari, A., Braga-Neto, U., and Dougherty, E (2011) Analytic study of performance of error estimators for linear discriminant analysis IEEE Transactions on Signal Processing, 59(9):1–18 www.it-ebooks.info 300 BIBLIOGRAPHY Zollanvari, A., Braga-Neto, U., and Dougherty, E (2012) Exact representation of the secondorder moments for resubstitution and leave-one-out error estimation for linear discriminant analysis in the univariate heteroskedastic gaussian model Pattern Recognition, 45(2):908– 917 Zollanvari, A and Dougherty, E (2014) Moments and root-mean-square error of the bayesian MMSE estimator of classification error in the gaussian model Pattern Recognition, 47(6):2178–2192 Zolman, J (1993) Biostatistics: Experimental Design and Statistical Inference Oxford University Press, New York www.it-ebooks.info AUTHOR INDEX Afra, S., 26 Afsari, B., 25 Agresti, A., 109 Ambroise, C., 50 Anderson, T., xvii, 3, 4, 15, 145, 146, 180–183, 188, 195, 200, 201 Arnold, S.F., 243 Bartlett, P., xv Bishop, C., 23 Bittner, M., 51, 52, 86 Boulesteix, A., 85 Bowker, xvii, 146, 182, 183 Braga-Neto, U.M., xv, 26, 38, 51, 54, 64–66, 69–71, 79, 97, 99, 101, 108–110, 118, 123 Bretz, F., 149, 168, 173 Butler, K., 199 Chandrasekaran, B., 11, 25 Chen, T., 97 Chernick, M., 56 Chung, K.L., 274, 275 Cochran, W., 97, 102, 111 Cover, T., xv, xvi, 20, 21, 28, 37, 49, 145, 280 Cram´er, H., xiv, 146 Dalton, L., xv, xix, 38, 221, 225–229, 233–235, 237, 238, 241–244, 247, 249–252, 254, 255, 257, 258 Deev, A., 199, 200 DeGroot, M., 239 Devroye, L., xiii, 2, 9, 13, 16, 20, 21, 25, 28, 51–54, 63, 97, 108, 111, 113, 235, 277, 280–283 Diaconis, P., 249 Donaldson, R., 37, 49 Dougherty, E.R., xv, 4, 5, 8, 13, 28, 35, 38, 51, 52, 57–61, 64–66, 69, 71, 77, 79, 86, 93, 94, 97, 99, 101, 108–110, 116–118, 147, 221, 225–229, 233–235, 237, 238, 241–244, 247, 249–252, 254, 255, 257, 258 Duda, R., 6, 15, 16, 115 Dudoit, S., 15, 70 Error Estimation for Pattern Recognition, First Edition Ulisses M Braga-Neto and Edward R Dougherty © 2015 The Institute of Electrical and Electronics Engineers, Inc Published 2015 by John Wiley & Sons, Inc 301 www.it-ebooks.info 302 AUTHOR INDEX Efron, B., 38, 55, 56, 69, 135 Esfahani, M., xix, 4, 5, 8, 13, 116–118, 257 Evans, M., 69 Farago, A., 22 Feller, W., 111 Feynman, R., xiv Fisher, R., 109, 146 Foley, D., 147, 199 Forster, J., 241 Freedman, D.A., 249 Fujikoshi, Y., 203 Lachenbruch, P., 37, 49, 147 Langford, J., xv Lugosi, G., 22, 38, 63 Gelman, A., 249 Geman, D., xix, 24 Genz, A., 149, 168, 173 Glick, N., 20, 38, 51, 62, 97, 111, 112 Gordon, L., 17 Hanczar, B., xv, xix, 40, 79, 108 Hand, D., xv Hart, P., 6, 15, 16, 20, 21, 115, 280 Harter, H., 146 Haubold, H., 243 Haussler, H., 199 Hills, M., 97, 102, 146–152 Hirji, K., 109 Hirst, D., 62 Hopkins, C., 97, 102, 111 Hua, J., xix, 25, 26, 28, 86, 194, 200 Hughes, G., 25 Imhof, J., 195, 196, 198 Jain, A., 25, 28, 58, 147, 182, 200 Jiang, X., 70, 71 John, S., xvii, 1, 15, 35, 77, 97, 115, 145–147, 149, 179–182, 184, 185, 188, 191, 193, 195, 196, 200–202, 209, 218, 220, 221, 259, 277, 285 Johnson, N., 195 Kaariainen, M., xv Kabe, D., 146 Kan, R., 11, 209 Kanal, L., 11 Kim, S., 64 Klotz, J., 109 Knight, J., 257 Knoke, J., 38, 62 Kohavi, R., 50 Kronmal, R., 16 Kudo, M., 28 Mao, K., xv Mathai, A.M., 243 McFarland, H., xvii, 147, 181, 193 McLachlan, G., xiii, xv, 47, 50, 145–147, 182, 199 Meshalkin, L., 199 Mickey, M., 37, 49 Molinaro, A., 79 Moore, E., 285 Moran, M., xvii, 15, 147, 184, 188, 190, 195 Muller, K.E., 244 Nijenhuis, A., 109, 124 O’Hagan, A., 241 Okamoto, M., 146, 199 Olshen, R., 17 Pawlak, M., 38, 63 Pikelis, V., 199, 200 Price, R., 195 Pudil, P., 28 Raiffa, H., 239 Raudys, S., 11, 58, 145–147, 199, 200, 204, 206, 207 Richards, D., xvii, 147, 193 Sayre, J., 146 Schiavo, R., xv Schlaifer, R., 239 Serdobolskii, V., 199 Shao, J., 99 Sharpe, P., 58 Sima, C., xv, xix, 27, 28, 38, 57–61, 71–73, 83, 84 Sitgreaves, R., 146, 147 Sklansky, J., 28 Smith, C., 36, 285 Snapinn, S., 38, 62 www.it-ebooks.info AUTHOR INDEX Stephens, M., 199 Stewart, P.W., 244 Stone, C., 21, 37, 49 Waller, W., 147, 182, 200 Webb, A., 22, 28 Wilf, H., 109, 124 Wyman, F., 145, 146, 199, 200 Teichroew, D., 146 Tibshirani, R., 55, 56 Toussaint, G., xv, 28, 37, 49, 51, 58, 147 Tutz, G., 62, 63 van Campenhout, J., 28 Vapnik, V., 10–12, 21, 22, 47, 277, 278, 280, 282, 283 Verbeek, A., 109 Vu, T., xix, 71, 147, 153, 191 Wagner, T., 54 Wald, A., 16, 146, 201 303 Xiao, Y., xv, 79 Xu, Q., xv, 42, 97, 110 Young, D., 145–147, 199, 204, 206 Yousefi, M., xix, 85, 86, 89, 90, 93, 94 Zhou, X., xv Zollanvari, A., xv, xix, 147, 149, 151, 156, 157, 160, 166, 174, 176, 188, 195, 196, 198, 200, 204–206, 208, 212, 215, 257 Zolman, J., 116 Zongker, D., 28 www.it-ebooks.info www.it-ebooks.info SUBJECT INDEX 632 bootstrap error estimator, 56–58, 60, 74, 77–79, 94, 95, 101, 106, 107, 125, 142, 154, 198 632+ bootstrap error estimator, 57, 60, 77–79, 83, 294 admissible procedure, Anderson’s LDA discriminant, xvii, 15, 182, 188, 195, 201 anti-symmetric discriminant, 120, 127, 129, 132, 181 apparent error, 10–12, 47, 49, 112, 296 Appell hypergeometric function, 242 average bias, 88, 90 average comparative bias, 89, 90 average RMS, 88, 90 average variance, 88, 90 balanced bootstrap, 56 balanced design, 127, 129, 132, 152, 154, 155, 165, 183, 188, 190, 194, 196, 197, 199, 218, 220 balanced sampling, see balanced design Bayes classifier, 2–4, 10, 17, 18, 22, 23, 28, 29, 32, 221 Bayes error, 2, 3, 5–8, 10, 17, 20, 25, 28–31, 33, 41, 51, 52, 60, 93, 98, 102, 108–110, 112, 117, 154, 155, 165, 166, 175, 196–198, 221, 232–234, 280 Bayes procedure, Bayesian error estimator, xv, xvii, 38, 223, 232, 235, 237, 238, 244, 246, 247, 249–252, 255, 256, 258 Bernoulli random variable, 31, 265 beta-binomial model, 225 bias, xv, xvi, 39–41, 44, 46–48, 50–52, 57–59, 62, 66, 68, 69, 73–75, 78, 79, 84–91, 99, 101–103, 106, 107, 109, 112–114, 116, 118, 119, 135, 146, 152, 174, 176, 199, 205, 228, 244, 245, 291, 292, 296, 299 binomial random variable, 8, 23, 44, 102, 199, 224, 265, 292 bin, 17, 97, 102, 111, 227–229, 232, 233, 237, 248 Error Estimation for Pattern Recognition, First Edition Ulisses M Braga-Neto and Edward R Dougherty © 2015 The Institute of Electrical and Electronics Engineers, Inc Published 2015 by John Wiley & Sons, Inc 305 www.it-ebooks.info 306 SUBJECT INDEX bin size, 108–110, 232, 237, 252, 257 bin counts, 17, 19, 98, 101, 103, 109, 112 biomedicine, 116 bivariate Gaussian distribution, 147, 149–151, 196, 197, 207, 271, 272 bolstered empirical distribution, 63 bolstered leave-one-out error estimator, 66, 69 bolstered resubstitution error estimator, 64–66, 68–71, 75, 77–79, 83, 84, 95 bolstering kernel, 63, 69 bootstrap 632 error estimator, 56–58, 60, 74, 77–79, 94, 95, 101, 106, 107, 125, 142, 154, 198 632+ error estimator, 57, 60, 77–79, 83, 294 balanced, 56 complete zero error estimator, 56, 100, 101, 124, 125 convex error estimator, 57, 134, 154, 155, 193, 196, 197 optmized error estimator, 60, 77, 79 zero error estimator, 56, 57, 61, 100, 101, 124, 125, 133–135, 152, 154, 190, 196 zero error estimation rule, 55, 99 bootstrap resampling, 56, 79 bootstrap error estimation rule, 55, 99, 299 bootstrap sample, 55, 56, 99, 100, 124, 134, 152, 154, 190–192 bootstrap sample means, 153, 191 Borel 𝜎-algebra, 248, 260 Borel measurable function, 222, 260, 263, 266 Borel-Cantelli Lemmas, 12, 41, 47, 261 branch-and-bound algorithm, 28 BSP, 280 calibrated bolstering, 71, 77 CART, 23, 24, 32, 53, 54, 71, 72, 279, 280 Cauchy distribution, 29 Cauchy-Scwartz Inequality, 267 cells, see bins censored sampling, 226 Central Limit Theorem, see CLT Chebyshev’s Inequality, 209, 210, 269 chi random variable, 69 chi-square random variable, 109, 183–185, 187, 189, 191, 193, 195, 201, 210 class-conditional density, 2, 5, 6, 28, 29, 77, 227, 258 classification rule, xiii–xvi, 8–13, 16, 17, 20, 21, 23–28, 30, 31, 35–39, 41, 43, 44, 46–56, 58, 60, 62–64, 67, 71, 74, 78–83, 86–89, 93–95, 102, 111, 115–119, 142, 143, 148, 150, 176, 179, 188, 200, 218, 220, 226, 232, 248, 252, 257, 277, 279, 280, 283, 295, 296, 299 classifier, xiii–xvi, 1–4, 6–13, 15, 17–19, 21–25, 27–30, 32, 35–38, 43, 45–48, 52–56, 60, 62–64, 66, 73, 84–87, 89, 90, 92, 93, 97–99, 102, 107, 111, 114–116, 119, 120, 122, 124, 134, 140, 221, 222, 225, 228, 229, 237, 240, 248, 251–253, 257, 258, 277–280, 282, 292–297, 299 CLT, 200, 201, 203, 276, 288, 289 comparative bias, 88–90 complete enumeration, 97, 108–110, 114, 140 complete k-fold cross-validation, 50, 123 complete zero bootstrap error estimator, 56, 100, 101, 124, 125 conditional bound, 42 conditional error, 29 conditional MSE convergence, 250 conditional variance formula, 37, 226, 269 confidence bound, 39, 43, 95, 110, 111 confidence interval, 43 confidence interval, 42, 43, 79–81, 109, 110, 174–176 consistency xvii, 9, 16, 17, 21, 22, 30, 39, 41, 47, 55, 73, 112, 150, 246–250, 251, 252, 270, 293, 294, 298 consistent classification rule, 9, 150 consistent error estimator, 41, 47 constrained classifier, 10, 37 continuous mapping theorem, 289 convergence of double sequences, 285–287 convex bootstrap error estimator, 57, 134, 154, 155, 193, 196, 197 convex error estimator, 57–61, 298 correlation coefficient, 40, 42, 82, 101, 105, 108, 146, 207, 269, 294 cost of constraint, 11, 27 www.it-ebooks.info SUBJECT INDEX Cover-Hart Theorem, 21 CoverCampenhout Theorem, 28 cross-validation, xv, xvi, 37, 38, 48–53, 55, 56, 58, 59, 71, 74, 75, 77–79, 83, 89, 93–95, 99, 101, 106, 107, 113, 115, 118, 119, 122–124, 130–133, 142, 221, 233, 246, 247, 255, 256, 292, 294, 296, 298, 299 curse of dimensionality, 25 decision boundary, 6, 24, 27, 62–64 decision region, 62, 64, 66, 279 deleted discriminant, 123, 215 deleted sample, 49, 50, 54, 66, 123 design error, 8–12, 20, 27, 277, 283 deviation distribution, xvi, 39, 41, 43, 77, 78, 174 deviation variance, xvi, 39–41, 78, 105, 108, 109, 118, 139, 174 diagonal LDA discriminant, 15 discrete histogram rule, see histogram rule discrete-data plug-in rule, 18 discriminant, xv–xvii, 3–7, 13–16, 30, 32, 36, 62, 63, 67, 78, 115, 116, 119, 120, 122, 123, 125, 127, 129, 132, 135, 136, 140, 141, 145–148, 152, 153, 170, 179–182, 184, 185, 188–191, 193–196, 198–202, 204, 205, 209, 215, 240, 244, 252, 291, 293–300 discriminant analysis, 6, 36, 78, 116, 120, 145, 146, 179, 244, 293, 295–299 double asymptotics, 285–289 double limit, 285 double sequence, 285–287, 289 empirical distribution, 46, 55, 63, 77, 99, 194 empirical risk minimization, see ERM principle entropy impurity, 23 Epanechnikov kernel, 16 ERM principle, 12, 277 error estimation rule, xiv, xvi, 27, 35, 36–38, 43, 46, 49, 55, 57, 70, 86, 87, 89, 99, 254 error estimator, xv–xvii, 20, 35, 36–42, 44–47, 50, 51, 55–59, 61, 63, 64, 66–68, 71, 73, 77–79, 82, 83, 86, 87, 89–93, 95, 98–103, 107–111, 307 113–116, 121, 123–125, 129, 131, 135, 140, 145, 148, 152, 157, 159, 160, 163, 165, 166, 169, 174–176, 179, 198, 221–223, 225–227, 232, 233, 235, 237, 238, 241, 244, 246, 247, 249–258, 292, 293, 298, 299 error rate, xii, xiv, xvi, xvii, 3, 4, 7, 10, 11, 18, 25, 26, 35, 49, 50, 56, 85, 92, 98–101, 112, 115, 120–137, 139–142, 145, 148, 149, 151–155, 157, 159, 161, 163, 165–167, 169, 171, 173–176, 179, 181, 183, 188, 190, 195, 196, 198–200, 205–207, 212, 215, 274, 280, 293, 295–299 exchangeability property, 127, 182, 183, 188, 190, 194 exhaustive feature selection, 27, 28, 83 expected design error, 8, 11, 12, 20, 27 exponential rates of convergence, 20, 112 external cross-validation, 50 family of Bayes procedures, feature, xiii–xvii, 1, 2, 8, 9, 13, 15–17, 20, 24–28, 31–33, 35–39, 41–47, 50, 55, 56, 58, 60, 65, 66, 68, 70–72, 77–79, 82–87, 89, 90, 93, 97, 108, 113, 115, 116, 118–120, 127, 129, 145, 200, 201, 221, 222, 226, 244, 246, 249, 252, 254, 255, 257, 277, 283, 293–299 feature selection rule, 26, 27, 38, 50 feature selection transformation, 25, 27 feature vector, 1, 8, 25, 32, 116 feature–label distribution, xiii, xiv, xvii, 1, 2, 8, 9, 13, 20, 28, 35, 36, 38, 39, 41–44, 46, 47, 55, 56, 58, 60, 79, 83, 84, 86, 87, 89, 90, 93, 108, 113, 115, 119, 120, 127, 129, 221, 222, 226, 249, 255, 257, 277, 283 filter feature selection rule, 27, 28 First Borel-Cantelli Lemma, see Borel-Cantelli Lemmas Gaussian kernel, 16, 65, 70, 71, 281 Gaussian-bolstered resubstitution error estimator, 67 general covariance model, 241, 244 general linear discriminant, 30, 32 general quadratic discriminant, 14 www.it-ebooks.info 308 SUBJECT INDEX generalization error, 10, 296 greedy feature selection, see nonexhaustive feature selection Gini impurity, 23 heteroskedastic model, 7, 118, 119, 145, 147–149, 151, 153, 156, 157, 159, 160, 163, 166, 168, 169, 173, 174, 176, 179–181, 193, 194, 300 higher-order moments, xvii, 136, 137, 139, 146, 154, 155, 157, 159, 161, 163, 165, 181, 198 histogram rule, 12, 16–19, 23, 98, 111, 112, 114, 235, 237, 281 Hoeffding’s Inequality, 74, 276, 282 Holder’s Inequality, 267, 275 holdout error estimator, see test-set error estimator homoskedastic model, 7, 117, 145, 147, 149, 151, 152, 154, 155, 165, 181, 182, 184, 188, 191, 194, 196, 197 ideal regression, 254 Imhof-Pearson approximation, 195, 196, 198, 219 impurity decrement, 24 impurity function, 23 increasing dimension limit, see thermodynamic limit instability of classification rule, 52–54 internal variance, 37, 44, 49, 50, 51, 56, 73 inverse-gamma distribution, 242, 244 Isserlis’ formula, 209 Jensen’s Inequality, 3, 91, 102 John’s LDA discriminant, xvii, 15, 181, 182, 184, 185, 188, 191, 193, 195, 196, 200–202, 209, 218, 220 k-fold cross-validation, 49, 50, 53, 58, 71, 74, 75, 77, 79, 86, 89, 93–95, 106, 107, 118, 119, 123, 132, 142 k-fold cross-validation error estimation rule, 49, 89 k-nearest-neighbor classification rule, see KNN classification rule kernel bandwidth, 16 kernel density, 69 KNN classification rule, 21, 30, 31, 52–56, 60, 61, 63, 71, 72, 74, 78, 83–85, 89, 94, 95, 142, 143, 280 label, xviii, xiv, xvii, 1, 2, 8, 9, 13, 16, 17, 19, 20, 21, 23, 28, 35, 36, 38, 39, 41–47, 51, 53, 55, 56, 58, 60, 63, 64, 68–70, 79, 83, 84, 86, 87, 89, 90, 93, 102, 108, 113, 115, 116, 118–120, 127, 129, 133, 181, 183, 188, 190, 192, 221, 222, 226, 249, 255, 257, 277, 283, 296 Lagrange multipliers, 60 Law of Large Numbers, 19, 44, 74, 111, 112, 116, 275 LDA, see linear discriminant analysis LDA-resubstitution PR rule, 37 leave-(1,1)-out error estimator, 133 leave-(k0 , k1 )-out error estimator, 123, 132, 133 leave-one-out, xvi, xvii, 49–55, 66, 69, 74, 77, 83, 89, 94, 95, 98, 101, 103, 105–111, 113–115, 122, 127, 130–133, 136, 139–142, 145, 152, 160, 163, 165, 166, 169, 174, 175, 200, 206, 215, 217, 232, 235, 299, 300 leave-one-out error estimation rule, 49 leave-one-out error estimator, xvii, 50, 51, 103, 108, 109, 113, 114, 131, 145, 152, 160, 163, 165, 169, 174, 235, 299 likelihood function, 224, 229 linear discriminant analysis (LDA) Anderson’s, xvii, 15, 182, 188, 195, 201 diagonal, 15 Johns’, xvii, 15, 181, 182, 184, 185, 188, 191, 193, 195, 196, 200–202, 209, 218, 220 population-based, sample-based, 15, 32, 147, 179 linearly separable data, 21 log-likelihood ratio discriminant, machine learning, xiii, 199, 291, 296 maximal-margin hyperplane, 21 Mahalanobis distance, 7, 30, 62, 154, 183, 188, 190, 192, 194, 196, 200, 201, 203, 205 www.it-ebooks.info SUBJECT INDEX Mahalanobis transformation, 270 majority voting, 17 margin (SVM), 21, 22, 32 Markov’s inequality, 40, 267, 269, 275 maximum a posteriori (MAP) classifier, maximum-likelihood estimator (MLE), 100 mean absolute deviation (MAD), mean absolute rank deviation, 83 mean estimated comparative advantage, 88 mean-square error (MSE) predictor, 2, 268, 269 sample-conditioned, 226, 227, 232, 233, 235–237, 244, 248, 250–254, 293 unconditional, 40, 41, 45, 58–60, 222, 226 minimax, 4–7, 13 misclassification impurity, 23 mixed moments, 136, 139, 140, 159, 163, 200, 207 mixture sampling, xvi, 31, 115–117, 120–122, 124, 126, 128, 129, 131–133, 139, 141, 142, 145, 148, 152, 154, 166, 169, 179, 180, 196, 198, 222 MMSE calibrated error estimate, 254 moving-window rule, 16 multinomial distribution, 103 multiple-data-set reporting bias, 84, 85 multiple-rule bias, 94 multivariate gaussian, 6, 67, 68, 118, 149, 181, 199, 200, 271, 273, 295 Naive Bayes principle, 15, 70 Naive-Bayes bolstering, 70, 71, 75, 295 “near-unbiasedness” theorem of cross-validation, 130–133, 142 nearest-mean classifier, see NMC discriminant nearest-neighbor (NN) classification rule, 20, 21, 30, 56, 63 NMC discriminant, 15, 89, 279 neural network, xvi, 12, 22, 23, 33, 48, 281, 291, 294, 297 NEXCOM routine, 109, 124 no-free-lunch theorem for feature selection, 28 nonexhaustive feature selection, 27, 28 non-informative prior, 221, 246 309 non-parametric maximum-likelihood estimator (NPMLE), 98 nonlinear SVM, 16, 281 nonrandomized error estimation rule, 36, 37, 49–51, 56 nonrandomized error estimator, 37, 73, 98, 123 OBC, 258 optimal Bayesian classifier, see OBC optimal convex estimator, 59 optimal MMSE calibration function, 254 optimistically biased, 40, 46, 50, 56–58, 66, 68, 69, 87, 97, 107 optimized boostrap error estimator, 60, 77, 79 overfitting, 9, 11, 24, 25, 46, 47, 52, 56, 57, 66, 102 Parzen-Window Classification rule, 16 pattern recognition model, 35 pattern recognition rule, 35 peaking phenomenon, 25, 26, 291, 298 perceptron, 12, 16, 67, 279 pessimistically biased, 40, 50, 56, 58, 73, 92, 107 plug-in discriminant, 13 plug-in rule, 18, 30 polynomial kernel, 281 pooled sample covariance matrix, 15, 180 population, xiii–xv, 1, 2–8, 13–16, 25, 30, 31, 51, 62, 69, 74, 115, 116, 120–130, 132, 134, 135, 140, 141, 145–148, 151, 152, 176, 179–181, 183, 185, 188, 190, 192, 194, 199, 200–202, 209, 218, 294 population-specific error rates, 3, 7, 120–123, 125–130, 134, 135, 145, 148, 180 posterior distribution, 223–230, 236, 238–240, 242, 248–251, 257, 258 posterior probability, 2, 3, 8, 63 posterior probability error estimator, 38, 63, 296 PR model, see pattern recognition model PR rule, see pattern recognition rule power-Law Distribution, see Zipf distribution predictor, 1, 2, 268, 273 www.it-ebooks.info 310 SUBJECT INDEX prior distribution, 221–225, 227–229, 232–239, 242, 244, 245–247, 249, 254–258, 294, 297 prior probability, 2, 7, 13, 115, 116, 118, 121–123, 125, 127–131, 133, 135, 137, 139, 140, 165, 183, 188, 190, 196, 203, 205, 227, 246, 249 quadratic discriminant analysis (QDA) population-based, sample-based, xvi, xvii, 14, 16, 31, 74, 79–83, 94, 95, 120, 127, 142, 143, 145, 179–181, 193, 194 quasi-binomial distribution, 199 randomized error estimation rule, 36, 37, 43, 49, 55, randomized error estimator, 37, 44, 73, 79, 106, 123 rank deviation, 83 rank-based rules, 24 Raudys-Kolmogorov asymptotic conditions, 200 Rayleigh distribution, 70 reducible bolstered resubstitution, 70 reducible error estimation rule, 38, 46, 50, 70 reducible error estimator, 45, 56, 66, regularized incomplete beta function, 242, 244 repeated k-fold cross-validation, 50, 77 reproducibility index, 93, 94, 299 reproducible with accuracy, 93 resampling, xv, xvi, 48, 55, 56, 79, 297 resubstitution, xvi, xvii, 10, 36, 37, 38, 46–48, 51, 52, 55–58, 61–71, 74, 75, 77–79, 83, 84, 95, 97, 98, 101–103, 105–115, 121, 122, 125, 127–129, 131, 136, 137, 139–141, 145, 147, 151, 152, 157, 159, 165, 166, 174–176, 190, 195, 199, 200, 205, 207, 212, 214, 215, 232, 277, 299, 300 resubstitution error estimation rule, 36, 37, 46 resubstitution error estimator, 37, 46, 64, 67, 71, 95, 98, 103, 129, 157, 159 retrospective studies, 116 root-mean-square error (RMS), 39 rule, xiii–xvi, 7–13, 16–21, 23–28, 30, 31, 35–39, 41, 43, 44, 46–60, 62–64, 66, 67, 70, 71, 74, 78–84, 86–91, 93–95, 97–99, 102, 110–112, 114–119, 142, 143, 148, 150, 176, 179, 183, 188, 194, 200, 218, 220, 221, 224, 226, 232, 235, 237, 248, 251, 252, 254, 257, 277, 279–281, 283, 293, 295, 296, 299 sample data, xiii, 1, 8–10, 18, 21, 24–26, 32, 36, 38, 42, 66, 68, 111, 116, 121, 122 sample-based discriminant, xv, 13, 15, 116, 119, 122, 123 sample-based LDA discriminant, 15, 147, 179 sample-based QDA discriminant, 14, 180 scissors plot, 11 second-order moments, xvi, 154, 157, 200, 207, 257 selection bias, 50, 291 semi-bolstered resubstitution, 66, 77, 78, 84 separate sampling, 31, 115–133, 135–137, 139, 140, 142, 145, 148, 152, 165, 179, 181, 182, 184, 188, 191, 294 separate-sampling cross-validation estimators, 123 sequential backward selection, 28 sequential floating forward selection (SFFS), 28 sequential forward selection (SFS), 28 shatter coefficients, 11, 30, 277–282 Slutsky’s theorem, 202, 288 smoothed error estimator, xvi, 61, 62 spherical bolstering kernel, 65, 68–71, 75, 77 SRM, see structural risk minimization statistical representation, xvii, 120, 146, 179, 181, 182, 193–195 stratified cross-validation, 49, 50 strong consistency, 9, 19, 41, 47, 112, 250–252 structural risk minimization (SRM), 47, 48 Student’s t random variable, 183 support vector machine (SVM), xvi, 16, 21, 26, 31, 32, 67, 74, 89, 95, 116, 142, 143, 279, 281 surrogate classifier, 52, 53 SVM, see support vector machine www.it-ebooks.info SUBJECT INDEX symmetric classification rule, 53 synthetic data, 31, 71, 89, 218 t-test statistic for feature selection, 28, 79, 86, 89, 92 tail probability, 39, 40, 41, 43, 55, 73, 74, 109 test data, 35, 43, 46 test-set (holdout) error estimator, xvi, 43–46, 73, 82, 83, 235, 237, 238, 282 thermodynamic limit, 199 tie-breaking rule, 53 top-scoring median classification rule (TSM), 25 top-scoring pairs classification rule (TSP), 24, 25 Toussaint’s counterexample, 28 training data, 43, 44, 46, 62, 69, 74, 92, 98, 102, 166 true comparative advantage, 88 true error, xvi, xvii, 10, 35–44, 51–53, 60, 62, 74, 79–81, 83–86, 90, 93, 95, 97, 98, 101, 102, 108–112, 114, 115, 118–123, 125–127, 131, 135, 136, 139, 140, 145–148, 153, 154, 159, 163, 166, 174, 175, 190, 195, 200, 205, 207, 222, 225, 230, 232–235, 237, 238, 240, 245, 246, 249–251, 254–257, 277 TSM, see top-scoring median classification rule TSP, see top-scoring pairs classification rule unbiased, xv, 40, 44, 50, 51, 58–60, 68, 75, 82, 90–92, 99, 112, 113, 130–135, 142, 143, 152, 154, 155, 176, 193, 196, 197, 222, 223, 244, 254, 270, 299 311 unbiased convex error estimation, 58 unbiased estimator, 135, 222 uniform kernel, 16, 54, 55 uniformly bounded double sequence, 289 uniformly bounded random sequence, 9, 274, 275 univariate heteroskedastic Gaussian LDA model, 159, 163, 166, 168, 169, 173, 176 univariate homoskedastic Gaussian LDA model, 149, 151, 165 universal consistency, 9, 16, 17, 21, 22, 41, 55 universal strong consistency, 9, 19, 47, 112, 294 Vapnik-Chervonenkis dimension, 11, 17, 25, 30, 47, 112, 200, 277–282 Vapnik-Chervonenkis theorem, 11, 47, 277, 282 VC confidence, 48 weak consistency, 246, 247 weak* consistency, 248–251 weak* topology, 248 whitening transformation, 270, 273 Wishart distribution, 182, 239, 241–244, 255 wrapper feature selection rule, 27, 28, 83 zero bootstrap error estimator, 56, 57, 61, 100, 101, 124, 125, 133–135, 152, 154, 190, 196 zero bootstrap error estimation rule, 55, 99 Zipf distribution, 106, 108–110, 232 www.it-ebooks.info WILEY END USER LICENSE AGREEMENT Go to www.wiley.com/go/eula to access Wiley’s ebook EULA www.it-ebooks.info [...]... book is therefore on trainingset error estimation, which is necessary for small-sample classifier design We cover in this chapter the following error estimation rules: resubstitution, cross-validation, bootstrap, optimal convex estimation, smoothed error estimation, and bolstered error estimation Chapter 3 Performance Analysis The main focus of the book is on performance characterization for error estimators,... Estimation This chapter covers the basics of error estimation: definitions, performance metrics for estimation rules, test-set error estimation, and training-set error estimation It also includes a discussion of pattern recognition models The test-set error estimator is straightforward and well-understood; there being efficient distribution-free bounds on performance However, it assumes large sample sizes,... Discriminant / 147 Expected Error Rates / 148 6.3.1 True Error / 148 6.3.2 Resubstitution Error / 151 6.3.3 Leave-One-Out Error / 152 6.3.4 Bootstrap Error / 152 www.it-ebooks.info 145 x CONTENTS Higher-Order Moments of Error Rates / 154 6.4.1 True Error / 154 6.4.2 Resubstitution Error / 157 6.4.3 Leave-One-Out Error / 160 6.4.4 Numerical Example / 165 6.5 Sampling Distributions of Error Rates / 166 6.5.1... Large-Sample Performance / 110 Exercises / 114 5 DISTRIBUTION THEORY 115 Mixture Sampling Versus Separate Sampling / 115 Sample-Based Discriminants Revisited / 119 True Error / 120 Error Estimators / 121 5.4.1 Resubstitution Error / 121 5.4.2 Leave-One-Out Error / 122 5.4.3 Cross-Validation Error / 122 5.4.4 Bootstrap Error / 124 5.5 Expected Error Rates / 125 5.5.1 True Error / 125 5.5.2 Resubstitution Error. .. Resubstitution Error / 128 5.5.3 Leave-One-Out Error / 130 5.5.4 Cross-Validation Error / 132 5.5.5 Bootstrap Error / 133 5.6 Higher-Order Moments of Error Rates / 136 5.6.1 True Error / 136 5.6.2 Resubstitution Error / 137 5.6.3 Leave-One-Out Error / 139 5.7 Sampling Distribution of Error Rates / 140 5.7.1 Resubstitution Error / 140 5.7.2 Leave-One-Out Error / 141 Exercises / 142 5.1 5.2 5.3 5.4 6... inadequate in most pattern recognition texts The fundamental entity in error analysis is the joint distribution between the true error and the error estimate Of special importance is the regression between them This chapter discusses the deviation distribution between the true and estimated error, with particular attention to the root-mean-square (RMS) error as the key metric of error estimation performance... John’s LDA discriminants population-specific bin probabilities and counts feature selection rule error estimation rule error estimator population-specific estimated error rates error estimator bias error estimator deviation variance error estimator root mean square error error estimator internal variance error estimator correlation coefficient classical resubstitution estimator resubstitution estimator... calibration of non-Bayesian error estimators relative to a prior distribution governing model uncertainty We close by noting that for researchers in applied mathematics, statistics, engineering, and computer science, error estimation for pattern recognition offers a wide open field with fundamental and practically important problems around every turn Despite this fact, error estimation has been a neglected... impact of error estimation on feature selection, bias that can arise when considering multiple data sets or multiple rules, and measurement of performance reproducibility Chapter 4 Error Estimation for Discrete Classification For discrete classification, the moments of resubstitution and the basic resampling error estimators, cross-validation and bootstrap, can be represented in finite-sample closed forms,... our joint work on error estimation for pattern recognition In this regard, we would like to acknowledge the students whose excellent work has contributed to the content: Chao Sima, Jianping Hua, Thang Vu, Lori Dalton, Mohammadmahdi Yousefi, Mohammad Esfahani, Amin Zollanvari, and Blaise Hanczar We also would like to thank Don Geman for many hours of interesting conversation about error estimation We appreciate ... Chapter Error Estimation This chapter covers the basics of error estimation: definitions, performance metrics for estimation rules, test-set error estimation, and training-set error estimation. .. optimal convex estimation, smoothed error estimation, and bolstered error estimation Chapter Performance Analysis The main focus of the book is on performance characterization for error estimators,... selection rule error estimation rule error estimator population-specific estimated error rates error estimator bias error estimator deviation variance error estimator root mean square error error estimator