Principles and theory for data mining and machine learning clarke, fokoué zhang 2009 07 30

Springer Series in Statistics Advisors: P Bickel, P Diggle, S Fienberg, U Gather, I Olkin, S Zeger For other titles published in this series go to, http://www.springer.com/series/692 Bertrand Clarke · Ernest Fokoué · Hao Helen Zhang Principles and Theory for Data Mining and Machine Learning 123 Bertrand Clarke University of Miami 120 NW 14th Street CRB 1055 (C-213) Miami, FL, 33136 bclarke2@med.miami.edu Ernest Fokoué Center for Quality and Applied Statistics Rochester Institute of Technology 98 Lomb Memorial Drive Rochester, NY 14623 ernest.fokoue@gmail.com Hao Helen Zhang Department of Statistics North Carolina State University Genetics P.O.Box 8203 Raleigh, NC 27695-8203 USA hzhang2@stat.ncsu.edu ISSN 0172-7397 ISBN 978-0-387-98134-5 e-ISBN 978-0-387-98135-2 DOI 10.1007/978-0-387-98135-2 Springer Dordrecht Heidelberg London New York Library of Congress Control Number: 2009930499 c Springer Science+Business Media, LLC 2009 All rights reserved This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) Preface The idea for this book came from the time the authors spent at the Statistics and Applied Mathematical Sciences Institute (SAMSI) in Research Triangle Park in North Carolina starting in fall 2003 The first author was there for a total of two years, the first year as a Duke/SAMSI Research Fellow The second author was there for a year as a Post-Doctoral Scholar The third author has the great fortune to be in RTP permanently SAMSI was – and remains – an incredibly rich intellectual environment with a general atmosphere of free-wheeling inquiry that cuts across established fields SAMSI encourages creativity: It is the kind of place where researchers can be found at work in the small hours of the morning – computing, interpreting computations, and developing methodology Visiting SAMSI is a unique and wonderful experience The people most responsible for making SAMSI the great success it is include Jim Berger, Alan Karr, and Steve Marron We would also like to express our gratitude to Dalene Stangl and all the others from Duke, UNC-Chapel Hill, and NC State, as well as to the visitors (short and long term) who were involved in the SAMSI programs It was a magical time we remember with ongoing appreciation While we were there, we participated most in two groups: Data Mining and Machine Learning, for which Clarke was the group leader, and a General Methods group run by David Banks We thank David for being a continual source of enthusiasm and inspiration The first chapter of this book is based on the outline of the first part of his short course on Data Mining and Machine Learning Moreover, David graciously contributed many of his figures to us Specifically, we gratefully acknowledge that Figs 1.1–6, Figs 2.1,3,4,5,7, Fig 4.2, Figs 8.3,6, and Figs 9.1,2 were either done by him or prepared under his guidance On the other side of the pond, the Newton Institute at Cambridge University provided invaluable support and stimulation to Clarke when he visited for three months in 2008 While there, he completed the final versions of Chapters and Like SAMSI, the Newton Institute was an amazing, wonderful, and intense experience This work was also partially supported by Clarke’s NSERC Operating Grant 2004–2008 In the USA, Zhang’s research has been supported over the years by two v vi Preface grants from the National Science Foundation Some of the research those grants supported is in Chapter 10 We hope that this book will be of value as a graduate text for a PhD-level course on data mining and machine learning (DMML) However, we have tried to make it comprehensive enough that it can be used as a reference or for independent reading Our paradigm reader is someone in statistics, computer science, or electrical or computer engineering who has taken advanced calculus and linear algebra, a strong undergraduate probability course, and basic undergraduate mathematical statistics Someone whose expertise in is one of the topics covered here will likely find that chapter routine, but hopefully find the other chapters are at a comfortable level The book roughly separates into three parts Part I consists of Chapters through 4: This is mostly a treatment of nonparametric regression, assuming a mastery of linear regression Part II consists of Chapters 5, 6, and 7: This is a mix of classification, recent nonparametric methods, and computational comparisons Part III consists of Chapters through 11 These focus on high dimensional problems, including clustering, dimension reduction, variable selection, and multiple comparisons We suggest that a selection of topics from the first two parts would be a good one semester course and a selection of topics from Part III would be a good follow-up course There are many topics left out: proper treatments of information theory, VC dimension, PAC learning, Oracle inequalities, hidden Markov models, graphical models, frames, and wavelets are the main absences We regret this, but no book can be everything The main perspective undergirding this work is that DMML is a fusion of large sectors of statistics, computer science, and electrical and computer engineering The DMML fusion rests on good prediction and a complete assessment of modeling uncertainty as its main organizing principles The assessment of modeling uncertainty ideally includes all of the contributing factors, including those commonly neglected, in order to be valid Given this, other aspects of inference – model identification, parameter estimation, hypothesis testing, and so forth – can largely be regarded as a consequence of good prediction We suggest that the development and analysis of good predictors is the paradigm problem for DMML Overall, for students and practitioners alike, DMML is an exciting context in which whole new worlds of reasoning can be productively explored and applied to important problems Bertrand Clarke University of Miami, Miami, FL Ernest Fokoué Kettering University, Flint, MI Hao Helen Zhang North Carolina State University, Raleigh, NC Contents Preface v Variability, Information, and Prediction 1.1 1.2 1.3 1.4 1.5 1.0.1 The Curse of Dimensionality 1.0.2 The Two Extremes Perspectives on the Curse 1.1.1 Sparsity 1.1.2 Exploding Numbers of Models 1.1.3 Multicollinearity and Concurvity 1.1.4 The Effect of Noise 10 Coping with the Curse 11 1.2.1 Selecting Design Points 11 1.2.2 Local Dimension 12 1.2.3 Parsimony 17 Two Techniques 18 1.3.1 The Bootstrap 18 1.3.2 Cross-Validation 27 Optimization and Search 32 1.4.1 Univariate Search 32 1.4.2 Multivariate Search 33 1.4.3 General Searches 34 1.4.4 Constraint Satisfaction and Combinatorial Search 35 Notes 38 1.5.1 Hammersley Points 38 vii viii Contents 1.6 Edgeworth Expansions for the Mean 39 1.5.3 Bootstrap Asymptotics for the Studentized Mean 41 Exercises 43 Local Smoothers 53 2.1 Early Smoothers 55 2.2 Transition to Classical Smoothers 59 2.3 2.2.1 Global Versus Local Approximations 60 2.2.2 LOESS 64 Kernel Smoothers 67 2.3.1 Statistical Function Approximation 68 2.3.2 The Concept of Kernel Methods and the Discrete Case 73 2.3.3 Kernels and Stochastic Designs: Density Estimation 78 2.3.4 Stochastic Designs: Asymptotics for Kernel Smoothers 81 2.3.5 Convergence Theorems and Rates for Kernel Smoothers 86 2.3.6 Kernel and Bandwidth Selection 90 2.3.7 Linear Smoothers 95 2.4 Nearest Neighbors 96 2.5 Applications of Kernel Regression 100 2.6 1.5.2 2.5.1 A Simulated Example 100 2.5.2 Ethanol Data 102 Exercises 107 Spline Smoothing 117 3.1 Interpolating Splines 117 3.2 Natural Cubic Splines 123 3.3 Smoothing Splines for Regression 126 3.4 3.3.1 Model Selection for Spline Smoothing 129 3.3.2 Spline Smoothing Meets Kernel Smoothing 130 Asymptotic Bias, Variance, and MISE for Spline Smoothers 131 3.4.1 3.5 Ethanol Data Example – Continued 133 Splines Redux: Hilbert Space Formulation 136 3.5.1 Reproducing Kernels 138 3.5.2 Constructing an RKHS 141 3.5.3 Direct Sum Construction for Splines 146 Contents 3.6 3.7 ix 3.5.4 Explicit Forms 149 3.5.5 Nonparametrics in Data Mining and Machine Learning 152 Simulated Comparisons 154 3.6.1 What Happens with Dependent Noise Models? 157 3.6.2 Higher Dimensions and the Curse of Dimensionality 159 Notes 163 3.7.1 3.8 Sobolev Spaces: Definition 163 Exercises 164 New Wave Nonparametrics 171 4.1 Additive Models 172 4.1.1 The Backfitting Algorithm 173 4.1.2 Concurvity and Inference 177 4.1.3 Nonparametric Optimality 180 4.2 Generalized Additive Models 181 4.3 Projection Pursuit Regression 184 4.4 Neural Networks 189 4.5 4.4.1 Backpropagation and Inference 192 4.4.2 Barron’s Result and the Curse 197 4.4.3 Approximation Properties 198 4.4.4 Barron’s Theorem: Formal Statement 200 Recursive Partitioning Regression 202 4.5.1 Growing Trees 204 4.5.2 Pruning and Selection 207 4.5.3 Regression 208 4.5.4 Bayesian Additive Regression Trees: BART 210 4.6 MARS 210 4.7 Sliced Inverse Regression 215 4.8 ACE and AVAS 218 4.9 Notes 220 4.9.1 Proof of Barron’s Theorem 220 4.10 Exercises 224 Supervised Learning: Partition Methods 231 5.1 Multiclass Learning 233 x Contents 5.2 5.3 5.4 Discriminant Analysis 235 5.2.1 Distance-Based Discriminant Analysis 236 5.2.2 Bayes Rules 241 5.2.3 Probability-Based Discriminant Analysis 245 Tree-Based Classifiers 249 5.3.1 Splitting Rules 249 5.3.2 Logic Trees 253 5.3.3 Random Forests 254 Support Vector Machines 262 5.4.1 Margins and Distances 262 5.4.2 Binary Classification and Risk 265 5.4.3 Prediction Bounds for Function Classes 268 5.4.4 Constructing SVM Classifiers 271 5.4.5 SVM Classification for Nonlinearly Separable Populations 279 5.4.6 SVMs in the General Nonlinear Case 282 5.4.7 Some Kernels Used in SVM Classification 288 5.4.8 Kernel Choice, SVMs and Model Selection 289 5.4.9 Support Vector Regression 290 5.4.10 Multiclass Support Vector Machines 293 5.5 Neural Networks 294 5.6 Notes 296 5.7 5.6.1 Hoeffding’s Inequality 296 5.6.2 VC Dimension 297 Exercises 300 Alternative Nonparametrics 307 6.1 6.2 Ensemble Methods 308 6.1.1 Bayes Model Averaging 310 6.1.2 Bagging 312 6.1.3 Stacking 316 6.1.4 Boosting 318 6.1.5 Other Averaging Methods 326 6.1.6 Oracle Inequalities 328 Bayes Nonparametrics 334 766 References Sun, J and C Loader (1994) Simultaneous confidence bands for linear regression and smothing Ann Statist 22(3), 1328–1345 Sutton, C D (2005) Classification and regression trees, bagging, and boosting In Handbook of Statistics, Volume 24, pp 303–329 Elsevier Taleb, A., , and C Jutten (1999) (1999) source separation in post-nonlnear mixtures IEEE Trans Signal Processing 47(10), 2807–2820 Taubin, G (1988) Nonplanar curve and surface estimation in 3-space Proc IEEE Conf Robotics and Automation, 644–645 Taubin, G., F Cukierman, S Sullivan, J Ponce, and D Kriegman (1994) Parametrized families of polynmials for bounded algebraic curve and surface fitting IEEE Trans Pattern Anal and Mach Intelligence 16(3), 287–303 Tenenbaum, J., de Silva V., and J Langford (2000) A global geometric framework for nonlinear dimension reduction Science 290, 2319–2323 Terrell, G (1990) Linear density estimates Amer Statist Assoc Proc Statist Comput., 297–302 Tibshirani, R (1988) Estimating transformations for regression via additivity and variance stabilization J Amer Statist Assoc 83, 394–405 Tibshirani, R J (1996) Regression shrinkage and selection via the lasso J Roy Statist Soc Ser B 58, 267–288 Tibshirani, R J and K Knight (1999) The covariance inflation criterion for model selection J Roy Statist Soc Ser B 61, 529–546 Tipping, M E (2001) Sparse bayesian learning and the relevance vector machine J Mach Learning Res 1, 211–244 Traub, J F and H Wozniakowski (1992) Perspectives on information based complexity Bull Amer Math Soc 26(1), 29–52 Tsai, C and J Chen (2007) Kernel estimation for adjusted p-values in multiple testing Comp Statist Data Anal 51(8), 3885–3897 Tufte, E (2001) The Visual Display of Quantitative Information Graphics Press Tukey, J R (1961) Curves as parameters and toch estimation Proc 4-th Berkeley Symposium, 681–694 Tukey, J R (1977) Exploratory Data Analysis Upper Saddle River: AddisonWesley Van de Geer, S (2007) Oracle inequalities and regularization In E Del Barrio, P Deheuvels, and S van de Geer (Eds.), Lectures on Empirical Processes: Theory and Statistical Applications, pp 191–249 European Mathematical Society van Erven, T., P Grunwald, and S de Rooij (2008) Catching up faster by switching sooner: a prequential solution to the aic-bic dilemma arXiv:0807.1005 Vanderbei, R and D Shanno (1999) An interior point algorithm for nonconvex nonlinear programming Comp Opt and Appl 13, 231–252 References 767 Vapnik, V N (1995) The Nature of Statistical Learning Theory Springer-Verlag, New York Vapnik, V N (1998) Statistical Learning Theory Wiley, New York Vapnik, V N and A Y Chervonenkis (1971) On the uniform convergence of relative frequencies of events to their probabilities Theory of Probab and It Applications 16(2), 264–280 Varshavsky, J (1995) On the Development of Intrinsic Bayes factors Ph D thesis, Department of Statsitics, Purdue University Verma, D and M Meila (2003) A comparison of spectral clustering algorithms Tech Report 03-05-01, Department of CSE, University of Washington, Seattle, WA, USA Voorhees, E and D Harman (Eds.) (2005) Experiment and Evaluation in Information Retrieval Boston, MA: The MIT Press Wahba, G (1985) A comparison of gcv and gml for choosing the smoothing parameter in the generalized spline smoothing problem Ann Statist 13, 1378–1402 Wahba, G (1990) Spline Models for Observational Data, Volume 59 Philadelphia: SIAM CBMS-NSF Regional Conference Series Wahba, G (1998) Support vector machines, reproducing kernel hilbert spaces, and the randomized gacv Tech Rep 984, Dept of Statistics, Univ Wisconsin, Madison, http://www.stat.wisc.edu/wahba Wahba, G (2005) Reproducing kernel hilbert spaces and why they are important Tech Rep xx, Dept Statistics, Univ Wisconsin, Madison, http://www.stat wisc.edu/wahba Wahba, G and S Wold (1975) A completely automatic french curve: Fitting spline functions by cross validation Comm Statist - Sim Comp 4, 1–17 Walker, A M (1969) On the asymptotic behavior of posterior distributions J Roy Statist Soc Ser B 31, 80–88 Walsh, B (2004, May) Multiple comparisons:bonferroni corrections and false discovery rates Lecture notes, University of Arizona, http://nitro.biosci arizona.edu/courses/EEB581-2006/handouts/Multiple.pdf Wang, H., G Li, and G Jiang (2007) Robust regression shrinkage and consistent variable selection via the lad-lasso J Bus Econ Statist 20, 347–355 Wang, L and X Shen (2007) On l1-norm multiclass support vector machines: methodology and theory J Amer Statist Assoc 102, 583–594 Wasserman, L (2004) All of Statistics: A Concise Course in Statistical Inference New york: Springer Wasserman, L A (2000) Asymptotic inference for mixture models by using datadependent priors J Roy Statist Soc Ser B 62(1), 159–180 Wegelin, J (2000) A survey of partial least squares methods, with emphasis on the two-block case Technical report, http://citeseer.ist.psu.edu/ 768 References wegelin00survey.html Weisberg, S (1985) Applied Linear Regression New York: Wiley Weisstein, E W (2009) Monte carlo integration Technical report, MathWorld–A Wolfram Web Resource., http://mathworld.wolfram com/Quasi-MonteCarloIntegration.html Welling, M (2005) Fisher-lda Technical report, http://www.ics.uci edu/welling/classnotes/papers_class/Fisher-LDA.pdf Westfall, P and S Young (1993) Resampling Based Methods for Multiple Testing: Examples and Methods for p-value Adjustment New York: Wiley Weston, J., S Mukherjee, O Chapelle, M Pontil, T Poggio, and V Vapnik (2000) Feature selection for SVMs Advances in Neural Information Processing Systems 13 Weston, J and C Watkins (1999) Support vector machines for multiclass pattern recognition In Proceedings of the Seventh European Symposium on Artificial Neural Networks, http://www.dice.ucl.ac.be/Proceedings/ esann/esannpdf/es1999-461.pdf White, H (1981) Consequences and detection of misspecified nonlinear regression models J Amer Statist Assoc 76, 419–433 White, H (1989) Some asymptotic results for learning in single hidden layer feedforward network models J Amer Statist Assoc 84, 1003–1013 Wilf, H (1989) Combinatorial Algorithms: An Update, Volume 55 of CBMS-NSF Regional Conference in Applied Mathematics philadelphia: SIAM Wolfe, J H (1963) Object Cluster Analysis of Social Areas Ph D thesis, UC Berkeley Wolfe, J H (1965) A computer program for the maximum likelihood analysis of types Technical Bulletin of the US Naval Personnel and Training Research Activity (SRM 65-15) Wolfe, J H (1967) Normix: Computational methods for estimating the parameters of multivariate normal mixtures of distributions Research Activity SRM 68-2, San Diego, CA, USA Wolfe, J H (1970) Pattern clustering by multivariate mixture analysis Multivariate Behavioral Research 5, 329–350 Wolpert, D (2001) The Supervised Learning No Free Lunch Theorems http: //ic.arc.nasa.gov/ic/projects/bayes-group/people/dhw/ Wolpert, D and W G Macready (1995) No free lunch theorems for search Working paper SFI-TR-95-02-010, Santa Fe Institute Wolpert, D and W G Macready (1996) Combining stacking with bagging to improve a learning algorithm Technical report, See: http://citeseer.ist psu.edu/wolpert96combining.html Wolpert, D and P Smyth (2004) Stacked density estimation Technical report References 769 Wolpert, D H (1992) On the connection between in-sample testing and generalization error Complex Systems 6, 47–94 Wong, D and M Murphy (2004) Estimating optimal transformations for multivariate regression using the ace algorithm J Data Sci 2, 329–346 Wong, H and B Clarke (2004) Improvement over bayes prediction in small samples in the presence of model uncertainty Can J Statist 32(3), 269–284 Wong, W H (1983) On the consistency of cross-validation in kernel nonparametric regression Ann Statist 11, 1136–1141 Wozniakowski, H (1991) Average case complexity of multivariate integration Bull AMS 24(1), 185–193 Wu, C F J (1983) On the convergence properties of the em algorithm Ann Statist 11(1), 95–103 Wu, Y and Y Liu (2009) Variable selection in quantile regression Statistica Sinica To Appear Wu, Z and R Leahy (1993) An optimal graph theoretic approach to data clustering: Theory and its application to image segmentation IEEE Pattern Anal and Mach Intel 15, 1101–1113 Wyner, A J (2003) On boosting and the exponential loss Tech report, See: http://citeseer.ist.psu.edu/576079.html Yang, Y (2001) Adaptive regression through mixing J Amer Statist Assoc 96, 574–588 Yang, Y (2005) Can the strengths of aic and bic be shared? a conflict between model identification and regression estimation Biometrika 92, 937–950 Yang, Y (2007) Consistency of cross validation for comparing regression procedures Ann Statist 35, 2450–2473 Yang, Y and A Barron (1999) Information-theoretic determination of minimax rates of convergence Ann Statist 27, 1564–1599 Yardimci, A and A Erar (2002) Bayesian variable selection in linear regression and a comparison Hacettepe J Math and Statist 31, 63–76 Ye, J (2007) Least squares discriminant analysis In Proceedings of the 24th international conference on Machine learning, Volume 227, pp 1087–1093 ACM International Conference Proceeding Series Ye, J., R Janardan, Q Li, and H Park (2004) Feature extraction via generalized uncorrelated linear discriminant analysis In Twenty-First International Conference on Machine Learning, pp 895902 ICML 2004 Yu, B and T Speed (1992) Data compression and histograms Probab theory Related Fields 92, 195–229 Yu, C W (2009) Median Methods in Statistical Analysis with Applications Ph D thesis, Department of Statsitics, University of British Columbia 770 References Yuan, M and Y Lin (2007) On the non-negative garrote estimator J Roy Statist Soc Ser B 69, 143–161 Zellner, A (1986) On assessing prior distributions and Bayes regression analysis with g-prior distributions In P K Goel and A Zellner (Eds.), Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti, pp 233–243 NorthHolland/Elsevier Zellner, A and A Siow (1980) Posterior odd ratios for selected regression hypotheses In J M Bernardo, M H DeGroot, D V Lindley, and A F M Smith (Eds.), Bayesian Statistics: Proceedings of the First International Meeting held in Valencia, pp 585–603 Valencia University Press Zhang, H., J Ahn, X Lin, and C Park (2006) Gene selection using support vector machines with nonconvex penalty Bioinformatics 22, 88–95 Zhang, H., Y Liu, Y Wu, and J Zhu (2008) Variable selection for multicategory SVM via supnorm regularization Elec J Statist 2, 149–167 Zhang, H and W Lu (2007) Adaptive-lasso for cox’s proportional hazards model Biometrika 94, 691–703 Zhang, H H., G Wahba, Y Lin, M Voelker, M Ferris, R Klein, and B Klein (2004) Nonparametric variable selection via basis pursuit for non-Gaussian data J Amer Statist Assoc 99, 659–672 Zhang, P (1992) On the distributional properties of model selection criteria rule J Amer Statist Assoc 87, 732–737 Zhang, P (1993) Model selection via multifold cross validation Ann Statist 21, 299–313 Zhang, T (2004) Statistical behavior and consistency of classification methods based on convex risk minimization Ann Statist 32, 56–85 Zhao, P and B Yu (2006) On model selection consistency of lasso J Mach Learning Res 7, 2541–2563 Zhao, P and B Yu (2007) Stagewise lasso J Mach Learning Res 8, 2701–2726 Zhao, Y and C G Atkeson (1991) Some approximation properties of projection pursuit learning networks NIPS, 936–943 Zhao, Y and C G Atkeson (1994) Projection pursuit learning: Approximation properties ftp://ftp.cis.ohio-state.edu/pub/neuroprose/yzhao.theory-pp.ps.Z Zhao, Y and G Karypis (2002) Evaluation of hierarchical clustering algorithms for document data sets Proceedings of the 11th ACM Conference on Information and Knowledge Management, 515–524 Zhao, Y and G Karypis (2005) Hierarchical clustering algorithms for document data sets Data Mining and Knowledge Discovery 10(2), 141–168 Zhou, S., X Shen, and D A Wolfe (1998) Local asymptotics for regression splines and confidence regions Ann Statist 26, 1760–1782 References 771 Zhou, S K., B Georgescu, Z X S., and D Comaniciu (2005) Image based regression using boosting method International Conference on Computer Vision 1, 541–548 Zhu, J and T Hastie (2001) Kernel logistic regression and the import vector machine Tech report, See http://citeseer.ist.psu.edu/479963.html Zhu, J., S Rosset, T Hastie, and R Tibshirani (2003) 1-norm support vector machines 17th Annual Conference on Neural Information Processing Systems 16 Zou, H (2006) The adaptive lasso and its oracle properties J Amer Statist Assoc 101, 1418–1429 Zou, H and T Hastie (2005) Regularization and variable selection via the elastic net J Roy Statist Soc Ser B 67, 301–320 Zou, H and R Li (2007) One-step sparse estimates in nonconcave penalized likelihood models Ann Statist 32, 1–28 Zou, H and H H Zhang (2008) On the adaptive elastic-net with a diverging number of parameters Ann Statist., to appear Index additive models, 172 hypothesis test for terms, 178 optimality, 181 additivity and variance stabilization, 218 Akaike information criterion, see information criteria alternating conditional expectations, 218 Australian crabs self-organizing maps, 557 backfitting, 184 backfitting algorithm, 173–177 Bayes, see Bayes variable selection, information criteria, see Bayes testing model average dilution, 311 cluster validation, 482 cross-validation, 598 extended formulation, 400–402 model average, 310–312 nonparametrics, 334 Dirichlet process, 334, 460 Polya tree priors, 336 Occam’s window, 312 Zellner g-prior, 311 Bayes clustering, 458 general case, 460 hierarchical, 458, 461 hypothesis testing, 461 Bayes rules classification, 241 high dimension, 248 normal case, 242 risk, 243 Idiot’s, 241 multiclass SVM, 294 relevance vector classification, 350 risk, 331 splines, 343 Bayes testing, 727 decision theoretic, 731 alternative losses, 734 linear loss, 736 proportion of false positives, 732 zero one loss, 733 hierarchical, 728 paradigm example, 728–729 pFDR, 719, 729 step down procedure, 730 Bayes variable selection and information criteria, 652 Bayes factors, 648 variants, 649 choice of prior, 635 on parameters, 638–643 on the model space, 636–638 dilution priors on the model space, 637 hierarchical formulation, 633 independence priors on the model space, 636 Markov chain Monte Carlo, 643 closed form posterior, 643 Gibbs sampler, 645 Metropolis-Hastings, 645 normal-normal on parameters, 641 point-normal on parameters, 639 prior as penalty, 650 scenarios, 632 spike and slab on parameters, 638 stochastic search, 646 Zellner’s g-prior on parameters, 640 Bayesian testing step down procedure parallel to BH, 731 big-O, 23 blind source separation model, see independent component analysis boosting and LASSO, 621 773 774 as logistic regression, 320 properties, 325 training error, 323 bootstrap, 18 asymptotics, 41–43 in parameter estimation, 21 with a pivot, 23 without a pivot, 25 branch and bound, 573 classification, see Bayes rules, see discriminant function, see logic trees, see neural networks, see random forests, see relevance vector classification, see support vector machines, see tree based, see discriminant function, see Bayes rules, see logistic regression classification neural networks, 294–295 clustering, 405 Bayes, see Bayes clustering dendrogram, 415 dissimilarity choices, 415 definition, 410 matrix, 418 monothetic vs polythetic, 419 selecting one, 421 EM algorithm, see EM algorithm graph-theoretic, see graph-theoretic clustering hierarchical agglomerative, 414 conditions, 427–428 convergence, 429 definition, 406 divisive, 422 partitional criteria, 431 definition, 406 objective functions, 431 procedures, 432 problems chaining, 416 clumping, 407 scaling up, 418 techniques centroid-based, 408–413 choice of linkage, 415 divisive K-means, 424 hierarchical, 413–426 implementations, 417, 423 K-means, 407, 409–411 K-medians, 412 K-medoids, 412 model based, 432–447 principal direction divisive partitioning, 425 Ward’s method, 413 Index validation, 480 association rules, 483 Bayesian, 482 choice of dissimilarity, 480 external, 481 internal, 482 relative, 482 silhouette index, 482 concurvity, 5, 10, 176–178, 188, 213 convex optimization dual problem, 276 Lagrangian, 275 slack variables, 280 convex optimzation primal form, 274 cross-validation, 27–29 choices for K, 593 consistency, 594 generalized, 591 in model selection, 587 K-fold cross-validation, 590 leave one out, 588 inconsistency as overfit, 595 leave-one-out equivalent to C p , 592 median error, 598 unifying theorem, 596 variations, 598 Curse, 3–4, 39 Barron’s Theorem, 197 statement, 200 descriptions, error growth, 12 experimental design, 11 feature vector dimension, 284 Friedman function comparisons, 394 instability of fit, kernel estimators, 159 kernel methods, 89 kernel smoothers convergence, 89 kernel smoothing, 152 linearity assumption, LOESS, 65 nearest neighbors, 100 neural networks, 192 parsimony principle, 17 projection pursuit regression, 188, 189 ranking methods in regression, 575 reproducing kernel Hilbert spaces, 152–154 scatterplot smooths, 55 sliced inverse regression, 215 smooth models, 172 splines, 117 superexponential growth, Index support vector machines, 233, 290 derivatives notation, 484 design, 11 A-optimality, 11 D-optimality, 11 G-optimality, 11 Hammersley points, 12 sequentially, 11 dilution, 637 dimension average local, 14 high, 248 local, 13, 15 locally low, 12 Vapnik-Chervonenkis, 269 dimension reduction, see feature selection, see variable selection variance bias tradeoff, 494 discriminant function, 232, 235, 239 Bayes, 239 distance based, 236 ratio of variances, 237 Fisher’s LDA, 239 decision boundary, 244 regression, 241 Mahalanobis, 239 quadratic, 243 early smoothers, 55 Edgeworth expansion, 23, 26 EM algorithm properties, 445 exponential family, 444 general derivation, 438 K components, 440 two normal components, 436 empirical distribution function convergence, 20 estimates from, 20 Glivenko-Cantelli Theorem, 19 large deviations, 19 empirical risk, 267, 332 ensemble methods bagging, 312 indicator functions, 315 stability vs bias, 313 Bayes model averaging, 310 boosting, 318 relation to SVMs, 321 classification, 326 definition, 308 functional aggregation, 326 stacking, 316 ε -insensitive loss, 290 775 example, see visualization Australian crab data self-organizing maps, 557 Australian crabs projections, 542 Boston housing data LARS, LASSO, forward stagewise, 619 Canadian expenditures Chernoff faces, 547 profiles and stars, 535 Ethanol data Nadaraya-Watson, 102 splines, 135 Fisher’s Iris Data, 366 centroid based clustering, 477 EM algorithm, 478 hierarchical agglomerative clustering, 475 hierarchical divisive clustering, 477 neural networks, 367 spectral clustering, 479 support vector machines, 368 tree models, 367 Friedman function generalized additive model, 392 six models compared, 390 high D cubes multidimensional scaling, 551 mtcars data heat map, 539 Ripley’s data, 369 centroid based clustering, 468 EM algorithm, 471 hierarchical agglomerative clustering, 465 hierarchical divisive clustering, 468 neural networks, 372 relevance vector machines, 375 spectral clustering, 472 support vector machines, 373 tree models, 370 simulated LOESS, 102 Nadaraya-Watson, 100 simulated, linear model AIC, BIC, GCV, 654–657 Bayes, 659–665 Enet, AEnet, LASSO, ALASSO, SCAD, 658–659 screening for large p, 665–667 sinc, 2D LOESS, NW, Splines, 160 sinc, dependent data LOESS, NW, Splines, 157 sinc, IID data Gaussian processes, 387 generalized additive model, 389 neural networks, 379 776 relevance vector machines, 385 support vector machines, 383 tree models, 378 sinc, IID data, LOESS, NW, Splines, 155 sunspot data time dependence, 544 two clusters in regression, 532 factor analysis, 502–508 choosing factors, 506 estimating factor scores, 507 indeterminates, 504 large sample inference for K, 506 ML factors, 504 model, 502 principal factors, 505 reduction to PCs, 503 false discovery proportion, see false discovery rate false discovery rate, 707 Benjamini-Hochberg procedure, 710 dependent data, 712 theorem, 711 Benjamini-Yekutieli procedure theorem, 712 false discovery proportion, 717 step down test, 718 false non-discovery rate, 714 asymptotic threshold, 714 asymptotics, 715 classification risk, 715 optimization, 716 Simes’ inequality, 713 variants, 709 false nondiscovery rate, see false discovery rate familywise error rate, 690 Bonferroni, 690 permutation test stepdown minP, maxT, 694 permutation tests, 692 maxT, 692 Sidak, 691 stepwise adjustments stepdown Bonferroni, Sidak, 693 stepdown minP, maxT, 694 Westfall and Young minP, maxT, 691 feature selection, see factor analysis, see independent components, see partial least squares, see principal components, see projection pursuit, see sufficient dimensions, see supervised dimension reduction, see variable selection, see visualization, 493 linear vs nonlinear, 494 nonlinear Index distance, 519 geometric, 518–522 independent components, 518 principal components, 517 principal curve, 520 Gaussian processes, 338 and splines, 340 generalized additive models, 182 backfitting, 183 generalized cross-validation, 30, 591 generalized linear model, 181 Gini index, 205, 252 graph-theoretic clustering, 447 cluster tree, 448 k-degree algorithm, 450 Kruskal’s algorithm, 449, 484 minimal spanning tree, 448 Prim’s algorithm, 449, 485 region, 451 spectral graph Laplacian, 452, 456 minimizing cuts, 453 mininizing cuts divisively, 455 other cut criteria, 455 properties, 456 Green’s Functions, 150 Hadamard, ill-posed, 138 Hammersley points, 38–39 hidden Markov models definition, 352 problems, 354 Hoeffding’s inequality, 296 Holder continuous, 75 independent component analysis computational approach FastICA, 515 definitions, 511–513 form of model, 511 properties, 513 independent components analysis, 516 information criteria Akaike, 580 corrected, 586 justification, 580 risk bound, 582 Akaike vs Bayes, 585 and Bayes variable selection, 652 basic inequality, 579 Bayes, 583 consistency, 584 definition, 578 deviance information, 586 Hannan-Quin, 586 Index Mallows’, 572, 578 risk inflation, 586 Karush-Kuhn-Tucker conditions, 274, 277 kernel, 284 choices, 74, 288 definition, 73 Mercer, 285 kernel trick, 284 leave-one-out property, 589 linearly separable, 234 Lipschitz continuous, 74 little-o, 23 local dimension, 12 logic trees, 253 logistic regression classification, 232, 246–247, 349 median cross-validation, 598 Mercer conditions, 285 model selection procedure, 31 multiclass classification, 234 reduction to binary, 234 multicollinearity, 5, multidimensional scaling, 547–553 implementation in SMACOF, 550 minimization problem, 548 representativity, 549 variations, 551 multiple comparisons, 679, see Bayes testing, see false discovery rate, see familywise error rate, see per comparison error rate, see positive false discovery rate ANOVA Bonferroni correction, 680 Scheffe’s method, 680 Tukey’s method, 680 criteria Bayes decision theory, 731 fully Bayes, 728 FWER, PCER/PFER, FDR, pFDR, 685 family error rate, 683 table for repeated testing, 684 terminology adjusted p-values, 689 stepwise vs single step, 688 types of control, 688 two normal means example, 681 multivariate adaptive regression splines, 210 fitting, 212 model, 211 properties, 213 Nadaraya-Watson, 78 as smoother, see smoothers, classical 777 variable bandwidth, 131 nearest neighbors, 96–100 neural networks architecture, 191 approximation, 199 backpropagation, 192 backpropogation, 196 bias-variance, 199 definition, 189 feedforward, 190 interpretation, 200 no free lunch, 365, 397 statement, 400 nonparametric optimality, 180 occupancy, 255 oracle inequality, 328 classification LeCue, 333 generic, 329 regression Yang, 331 oracle property, 600 orthogonal design matrix, 601 parsimony, 17 partial least squares properties, 526 simple case, 523 Wold’s NIPALS, 524 per comparison error rate, 695 adjusted p-values single step, 706 asymptotic level, 703 basic inequality, 696 common cutoffs single step, 700 adjusted p-values, 701 common quantiles adjusted p-values, 699 single step, 698 constructing the null, 704 generic strategy, 697 per family error rate, see per comparison error rate polynomial interpolation, 60–61 Lagrange, 62 positive false discovery rate, 719 estimation, 723 number of true nulls, 726 parameter selection, 725 q-value, 726 posterior interpretation, 720 q-value, 721 rejection region, 722 rejection region, 721 778 positive regression dependence on a subset, 711 prediction, 309, 647 bagging, 312 Bayes model averaging, 311 Bayes nonparametrics Dirichlet process prior, 335 Gaussian process priors, 339 Polya trees, 337 boosting, 318 stacking, 316 principal component analysis, 511 principal components, 16, 495 canonical correlation, 500 empirical PCs, 501 main theorem, 496 Lagrange multipliers, 497 quadratic maximization, 498 normal case, 499 properties, 498 techniques for selecting, 501 using correlation matrix, 500 projection pursuit, 508–511 choices for the index, 510 projection index, 509 projection pursuit regression, 184 non-uniqueness, 186 properties, 188 q-value, 721, see positive false discovery rate random forests, 254 asymptotics, 258 out of bag error, 255 random feature selection, 256 recursive partitioning, see tree models regluarization representer theorem, 153 regression, see additive models, see additivity and variance stabilization, see alternating conditional expectations, see generalized additive models, see multivariate adaptive regression splines, see neural networks, see projection pursuit, see relevance vector regression, see support vector regression, see tree based models systematic simulation study, 397 regularization cubic splines, 121 empirical risk, 122 in multiclass SVMs, 293 in reproducing kernel Hilbert spaces, 147 in smoothing splines, 122, 137 neural networks, 199, 379 relevance vector machines, 345 tree models, 207, 370 regularized risk, 121, 137, 286, 341 Index relevance vector, 345 relevance vector classification, 349 Laplace’s method, 350 relevance vector machines Bayesian derivation, 346–348 parameter estimation, 348 relevance vector regression, 345 relevance vectors definition, 347 interpretation, 348 reproducing kernel Hilbert space, 122 construction, 141–143 decomposition theorem, 150 definition, 140 direct sum construction, 146 example, 144 Gaussian process prior, 343 general case, 147 general example, 149 kernel function, 140 spline case, 143 risk, 265–270, 328 confidence interval, 269 hinge loss, 332 hypothesis testing, 715 zero-one loss, 266 search binary, 35 bracketing, 33 graph, 37 list, 35 Nelder-Mead, 33 Newton-Raphson, 32 simulated annealing, 34 tree, 35 univariate, 32 self-organizing maps, 553–560 contrast with MDS, 559 definition, 554 implementation, 556 interpretation, 554 procedure, 554 relation to clustering, PCs, NNs, 556 shattering, 269 shrinkage Adaptive LASSO, 610 properties, 611 Bridge asymptotic distribution, 609 consistency, 608 Bridge penalties, 607 choice of penalty, 601 definition, 599 elastic net, 616 GLMs, 623 Index LASSO, 604 and boosting, 621 grouped variables, 616 properties, 605–606 least angle regression, 617 non-negative garrotte, 603 limitations, 604 nonlinear models adaptive COSSO, 627 basis pursuit, 626 COSSO, 626 optimization problem, 601 oracle, 600 orthogonal design matrix, 601, 603 penalty as prior, 650 ridge regression, 602 SCAD difference of convex functions, 614 local linear approximation, 615 local quadratic approximation, 613 majorize-minimize procedure, 613 SCAD penalty, 611 properties, 612 SS-ANOVA framework, 624–625 support vector machines absolute error, binary case, 629 absolute error, multiclass, 630 adaptive supnorm, multiclass, 631 double sparsity, 628 SCAD, binary case, 629 supnorm, multiclass, 630 tree models, 623 singular value decomposition, 250, 602 sliced inverse regression, 215 and sufficient dimensions, 528 elliptical symmetry, 215, 339 properties, 217 smooth, 55 smoothers classical B-splines, 127 kernel bandwidth, 92–94 kernel selection, 90 LOESS, 64–67 Nadaraya-Watson, 78–81 nearest neighbors classification, 96 nearest neighbors regression, 99 NW AMISE, 85 NW asymptotic normality, 85 NW consistency, 81 Parzen-Rosenblatt, 78 Priestly-Chao, 75–77 rates for Kernel smoothers, 86 rates for kernel smoothers, 90 smoothing parameter selection, 129 smoothing spline parameter, 121 779 spline asymptotic normality, 133 spline bias, 131 spline MISE, 132 spline penalty, 121 spline variance, 131 spline vs kernel, 130 splines, 124, see spline, 126–131 early smoothers bin, 56 moving average, 57 running line, 57 bin, 118 running line, 118 linear, 95 Sobolev space, 123, 163 sparse, 5, data, 594 matrices, 175 posterior, 347 principal components, 499 relevance vector machine, 345 relevance vector machines, 344 RVM on Ripley’s data, 375 RVM vs SVM, 348 similarity matrix, 417 SS-ANOVA, 627 support vector machine, 277 SVM and Ripley’s data, 373 tree model sinc function, 391 sparsity, 6–8, 11 and basis pursuit, 626 and LASSO, 606 and oracle inequalities, 632 of data, 481, 541 of graphs, 490 of local linear approximation, 615 relevance vector machine, 346 RVM vs SVM, 375, 396 shrinkage methods, 599 support vector machines, 628 through thresholding, 636 spline, 117 as Bayes rule, 344 as smoother, see smoothers, classical B-spline basis, 128 band matrix, 124 Cox-de Boor recursion, 127 cubic, 118 definition, 117 first order, 118 Hilbert space formulation, 136 interpolating, 117–120 natural cubic, 120, 123 optimal, 125, 127 uniqueness, 125 780 thin plate, 152 zero-th order, 118 sufficient dimensions, 527 and sliced inverse regression, 528 estimating the central subspace, 528 quadratic discrepancy function, 529 testing for dimension, 530 superexponentially, 5, supervised dimension reduciton sufficient dimensions, 527 supervised dimension reduction partial least squares, 523 support vector machines, 262 distance, 264 general case, 282 linearization by kernels, 283 Mercer kernels, 285 linearly separable, 271–279 margin, 262 maximization, 271 margin expression, 273 multiclass, 293 not linearly separable, 279–282 dual problem, 281 primal form, 281 optimization, 274 dual form, 278 primal form, 276 regularized optimization, 286 separating hyperplane, 262, 270 support vector regression, 290–292 template Friedman PPR, 185 ACE, 219 average local dimension, 15 backfitting, 173 Bayes BH procedure, 731 Bonferroni, 690 boosted LASSO, 622 boosting, 318 Chen’s PPR, 187 constructing the null, 706 dense regions in a graph, 451 divisive K-means clustering, 424 EM algorithm, 437 estimating FDR and pFDR, 724 FastICA, 516 hierarchical agglomerative, 414 hierarchical divisive clustering, 422 least angle regression, 617 MARS, 212 maxT, 692 Metroplis-Hastings, 645 NIPALS deflating the matrix, 525 Index finding a factor, 525 partitional clustering, 430 PCER/PFER generic strategy, 697 principal curves, 521 principal direction divisive partitioning clustering, 425 projection pursuit, 509 self-organizing map, 554 shotgun stochastic search, 646 Sidak adjustment, 691 single step, common cutoffs, 700 single step, common quantiles, 698 SIR, 217 stepdown permutation minP, 694 theorem Aronszajn, 141 Barron, 201 Benjamini-Hochberg, 711, 712 Benjamini-Yekutieli, 712 Breiman, 258, 261 Buhlman-Yu, 315 calculating q-values, 726 Chen, 188 Cook-Ni, 530 Devroye et al., 100 Devroye-Wagner, 89 Duan and Li, 216 Dudoit et al., 703, 705 Dudoit, et al., 699 Eriksson and Koivunen, 513 Fan and Jiang, 179 Friedman et al., 320 Gasser-Muller, 76 Genovese and Wasserman, 714, 715 Green-Silverman, 124 Hoeffding, 297 Kagan et al., 513 Kleinberg, 428 Knight-Fu, 608, 609 LeCue, 333 Luxburg, 457 Mercer-Hilbert-Schmidt, 142 Muller, 597 Rahman and Rivals, 256 Representer, 151, 153 Riesz representation, 139 Romano and Shaikh, 719 Schapire et al., 323 semiparametric Representer, 154 Shi and Malik, 454 Silverman, 130 Storey, 720, 722 Vapnik, 271 Vapnik and Chervonenkis, 269, 324 White, 195, 196 Wu, 446 Index Yang, 331 Yang’s, 582 Zhou et al., 131 Zou, 611 Zou and Hastie, 617 tree based classifiers, 249 splitting rules, 249 Gini index, 252 impurity, 251 principal components, 251 twoing, 252 tree models Bayesian, 210 benefits, 203 pruning, 207 cost-complexity, 207 regression, 202 selecting splits, 204 Gini, 205 twoing, 205 twin data, 30 twoing, 205, 252 variable selection, 569, see Bayes variable selection, see cross-validation, see information criteria, see shrinkage, see variable ranking classification BW ratio, 575 781 SVMs, 628 in linear regression, 570 linear regression forward, backward, stepwise, 573 leaps and bounds, 573 subset selection, 572 ranking Dantzig selector, 576 sure independence screening, 576 VC dimension, 269, 297 indicator functions on the real line, 298 Levin-Denker example, 300 planes through the origin, 298 shattering, 269 visualization, see multidimensional scaling, see self-organizing maps, 532 Chernoff faces, 546 elementary techniques, 534 graphs and trees, 538 heat map, 539 profiles and stars, 535 projections, 541 time dependence, 543 GGobi, 534 Shepard plot, 549 using up data, 533 Wronskian, 149 ... Zeger For other titles published in this series go to, http://www.springer.com/series/692 Bertrand Clarke · Ernest Fokoué · Hao Helen Zhang Principles and Theory for Data Mining and Machine Learning. .. al., Principles and Theory for Data Mining and Machine Learning, Springer Series in Statistics, DOI 10.1 007/ 978-0-387-98135-2 1, c Springer Science+Business Media, LLC 2009 Variability, Information,... groups: Data Mining and Machine Learning, for which Clarke was the group leader, and a General Methods group run by David Banks We thank David for being a continual source of enthusiasm and inspiration

Định dạng
Số trang	793
Dung lượng	13,14 MB