Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 866 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
866
Dung lượng
13,77 MB
Nội dung
INTRODUCTION TO DATA MINING INTRODUCTION TO DATA MINING SECOND EDITION GLOBAL EDITION PANG-NING TAN Michigan State University MICHAEL STEINBACH University of Minnesota ANUJ KARPATNE University of Minnesota VIPIN KUMAR University of Minnesota 330 Hudson Street, NY NY 10013 Director, Portfolio Management: Engineering, Computer Science & Global Editions: Julian Partridge Specialist, Higher Ed Portfolio Management: Matt Goldstein Portfolio Management Assistant: Meghan Jacoby Acquisitions Editor, Global Edition: Sourabh Maheshwari Managing Content Producer: Scott Disanno Content Producer: Carole Snyder Senior Project Editor, Global Edition: K.K Neelakantan Web Developer: Steve Wright Manager, Media Production, Global Edition: Vikram Kumar Rights and Permissions Manager: Ben Ferrini Manufacturing Buyer, Higher Ed, Lake Side Communications Inc (LSC): Maura Zaldivar-Garcia Senior Manufacturing Controller, Global Edition: Caterina Pellegrino Inventory Manager: Ann Lam Product Marketing Manager: Yvonne Vannatta Field Marketing Manager: Demetrius Hall Marketing Assistant: Jon Bryant Cover Designer: Lumina Datamatics Full-Service Project Management: Ramya Radhakrishnan, Integra Software Services Pearson Education Limited KAO Two KAO Park Harlow CM17 9NA United Kingdom and Associated Companies throughout the world Visit us on the World Wide Web at: www.pearsonglobaleditions.com c Pearson Education Limited, 2019 The rights of Pang-Ning Tan, Michael Steinbach, Anuj Karpatne, and Vipin Kumar to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988 Authorized adaptation from the United States edition, entitled Introduction to Data Mining, 2nd Edition, ISBN 978-0-13-312890-1 by Pang-Ning Tan, Michael Steinbach, Anuj Karpatne, and Vipin Kumar, published by Pearson Education c 2019 All rights reserved No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without either the prior written permission of the publisher or a license permitting restricted copying in the United Kingdom issued by the Copyright Licensing Agency Ltd, Saffron House, 6–10 Kirby Street, London EC1N 8TS All trademarks used herein are the property of their respective owners The use of any trademark in this text does not vest in the author or publisher any trademark ownership rights in such trademarks, nor does the use of such trademarks imply any affiliation with or endorsement of this book by such owners For information regarding permissions, request forms, and the appropriate contacts within the Pearson Education Global Rights and Permissions department, please visit www.pearsoned.com/permissions This eBook is a standalone product and may or may not include all assets that were part of the print version It also does not provide access to other Pearson digital products like MyLab and Mastering The publisher reserves the right to remove any material in this eBook at any time British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN 10: 0-273-76922-7 ISBN 13: 978-0-273-76922-4 eBook ISBN 13: 978-0-273-77532-4 eBook formatted by Integra Software Services To our families Preface to the Second Edition Since the first edition, roughly 12 years ago, much has changed in the field of data analysis The volume and variety of data being collected continues to increase, as has the rate (velocity) at which it is being collected and used to make decisions Indeed, the term Big Data has been used to refer to the massive and diverse data sets now available In addition, the term data science has been coined to describe an emerging area that applies tools and techniques from various fields, such as data mining, machine learning, statistics, and many others, to extract actionable insights from data, often big data The growth in data has created numerous opportunities for all areas of data analysis The most dramatic developments have been in the area of predictive modeling, across a wide range of application domains For instance, recent advances in neural networks, known as deep learning, have shown impressive results in a number of challenging areas, such as image classification, speech recognition, as well as text categorization and understanding While not as dramatic, other areas, e.g., clustering, association analysis, and anomaly detection have also continued to advance This new edition is in response to those advances Overview As with the first edition, the second edition of the book provides a comprehensive introduction to data mining and is designed to be accessible and useful to students, instructors, researchers, and professionals Areas covered include data preprocessing, predictive modeling, association analysis, cluster analysis, anomaly detection, and avoiding false discoveries The goal is to present fundamental concepts and algorithms for each topic, thus providing the reader with the necessary background for the application of data mining to real problems As before, classification, association analysis and cluster analysis, are each covered in a pair of chapters The introductory chapter covers basic concepts, representative algorithms, and evaluation techniques, while the more following chapter discusses advanced concepts and algorithms As before, our objective is to provide the reader with a sound understanding of the foundations of data mining, while still covering many important advanced Preface to the Second Edition topics Because of this approach, the book is useful both as a learning tool and as a reference To help readers better understand the concepts that have been presented, we provide an extensive set of examples, figures, and exercises The solutions to the original exercises, which are already circulating on the web, will be made public The exercises are mostly unchanged from the last edition, with the exception of new exercises in the chapter on avoiding false discoveries New exercises for the other chapters and their solutions will be available to instructors via the web Bibliographic notes are included at the end of each chapter for readers who are interested in more advanced topics, historically important papers, and recent trends These have also been significantly updated The book also contains a comprehensive subject and author index What is New in the Second Edition? Some of the most significant improvements in the text have been in the two chapters on classification The introductory chapter uses the decision tree classifier for illustration, but the discussion on many topics—those that apply across all classification approaches— has been greatly expanded and clarified, including topics such as overfitting, underfitting, the impact of training size, model complexity, model selection, and common pitfalls in model evaluation Almost every section of the advanced classification chapter has been significantly updated The material on Bayesian networks, support vector machines, and artificial neural networks has been significantly expanded We have added a separate section on deep networks to address the current developments in this area The discussion of evaluation, which occurs in the section on imbalanced classes, has also been updated and improved The changes in association analysis are more localized We have completely reworked the section on the evaluation of association patterns (introductory chapter), as well as the sections on sequence and graph mining (advanced chapter) Changes to cluster analysis are also localized The introductory chapter added the K-means initialization technique and an updated the discussion of cluster evaluation The advanced clustering chapter adds a new section on spectral graph clustering Anomaly detection has been greatly revised and expanded Existing approaches—statistical, nearest neighbor/density-based, and clustering based—have been retained and updated, while new approaches have been added: reconstruction-based, one-class classification, and informationtheoretic The reconstruction-based approach is illustrated using autoencoder networks that are part of the deep learning paradigm The data chapter has Preface to the Second Edition been updated to include discussions of mutual information and kernel-based techniques The last chapter, which discusses how to avoid false discoveries and produce valid results, is completely new, and is novel among other contemporary textbooks on data mining It supplements the discussions in the other chapters with a discussion of the statistical concepts (statistical significance, p-values, false discovery rate, permutation testing, etc.) relevant to avoiding spurious results, and then illustrates these concepts in the context of data mining techniques This chapter addresses the increasing concern over the validity and reproducibility of results obtained from data analysis The addition of this last chapter is a recognition of the importance of this topic and an acknowledgment that a deeper understanding of this area is needed for those analyzing data The data exploration chapter has been deleted, as have the appendices, from the print edition of the book, but will remain available on the web A new appendix provides a brief discussion of scalability in the context of big data To the Instructor As a textbook, this book is suitable for a wide range of students at the advanced undergraduate or graduate level Since students come to this subject with diverse backgrounds that may not include extensive knowledge of statistics or databases, our book requires minimal prerequisites No database knowledge is needed, and we assume only a modest background in statistics or mathematics, although such a background will make for easier going in some sections As before, the book, and more specifically, the chapters covering major data mining topics, are designed to be as self-contained as possible Thus, the order in which topics can be covered is quite flexible The core material is covered in chapters (data), (classification), (association analysis), (clustering), and (anomaly detection) We recommend at least a cursory coverage of Chapter 10 (Avoiding False Discoveries) to instill in students some caution when interpreting the results of their data analysis Although the introductory data chapter (2) should be covered first, the basic classification (3), association analysis (4), and clustering chapters (5), can be covered in any order Because of the relationship of anomaly detection (9) to classification (3) and clustering (5), these chapters should precede Chapter Various topics can be selected from the advanced classification, association analysis, and clustering chapters (6, 7, and 8, respectively) to fit the schedule and interests of the instructor and students We also advise that the lectures be augmented by projects or practical exercises in data mining Although they Preface to the Second Edition are time consuming, such hands-on assignments greatly enhance the value of the course Support Materials Support materials available to all readers of this book are available on the book’s website • • • • PowerPoint lecture slides Suggestions for student projects Data mining resources, such as algorithms and data sets Online tutorials that give step-by-step examples for selected data mining techniques described in the book using actual data sets and data analysis software Additional support materials, including solutions to exercises, are available only to instructors adopting this textbook for classroom use Acknowledgments Many people contributed to the first and second editions of the book We begin by acknowledging our families to whom this book is dedicated Without their patience and support, this project would have been impossible We would like to thank the current and former students of our data mining groups at the University of Minnesota and Michigan State for their contributions Eui-Hong (Sam) Han and Mahesh Joshi helped with the initial data mining classes Some of the exercises and presentation slides that they created can be found in the book and its accompanying slides Students in our data mining groups who provided comments on drafts of the book or who contributed in other ways include Shyam Boriah, Haibin Cheng, Varun Chandola, Eric Eilertson, Levent Ertă oz, Jing Gao, Rohit Gupta, Sridhar Iyer, Jung-Eun Lee, Benjamin Mayer, Aysel Ozgur, Uygar Oztekin, Gaurav Pandey, Kashif Riaz, Jerry Scripps, Gyorgy Simon, Hui Xiong, Jieping Ye, and Pusheng Zhang We would also like to thank the students of our data mining classes at the University of Minnesota and Michigan State University who worked with early drafts of the book and provided invaluable feedback We specifically note the helpful suggestions of Bernardo Craemer, Arifin Ruslim, Jamshid Vayghan, and Yu Wei Joydeep Ghosh (University of Texas) and Sanjay Ranka (University of Florida) class tested early versions of the book We also received many useful suggestions directly from the following UT students: Pankaj Adhikari, Rajiv Bhatia, Frederic Bosche, Arindam Chakraborty, Meghana Deodhar, Chris Everson, David Gardner, Saad Godil, Todd Hay, Clint Jones, Ajay Joshi, Preface to the Second Edition Joonsoo Lee, Yue Luo, Anuj Nanavati, Tyler Olsen, Sunyoung Park, Aashish Phansalkar, Geoff Prewett, Michael Ryoo, Daryl Shannon, and Mei Yang Ronald Kostoff (ONR) read an early version of the clustering chapter and offered numerous suggestions George Karypis provided invaluable LATEX assistance in creating an author index Irene Moulitsas also provided assistance with LATEX and reviewed some of the appendices Musetta Steinbach was very helpful in finding errors in the figures We would like to acknowledge our colleagues at the University of Minnesota and Michigan State who have helped create a positive environment for data mining research They include Arindam Banerjee, Dan Boley, Joyce Chai, Anil Jain, Ravi Janardan, Rong Jin, George Karypis, Claudia Neuhauser, Haesun Park, William F Punch, Gyă orgy Simon, Shashi Shekhar, and Jaideep Srivastava The collaborators on our many data mining projects, who also have our gratitude, include Ramesh Agrawal, Maneesh Bhargava, Steve Cannon, Alok Choudhary, Imme Ebert-Uphoff, Auroop Ganguly, Piet C de Groen, Fran Hill, Yongdae Kim, Steve Klooster, Kerry Long, Nihar Mahapatra, Rama Nemani, Nikunj Oza, Chris Potter, Lisiane Pruinelli, Nagiza Samatova, Jonathan Shapiro, Kevin Silverstein, Brian Van Ness, Bonnie Westra, Nevin Young, and Zhi-Li Zhang The departments of Computer Science and Engineering at the University of Minnesota and Michigan State University provided computing resources and a supportive environment for this project ARDA, ARL, ARO, DOE, NASA, NOAA, and NSF provided research support for Pang-Ning Tan, Michael Steinbach, Anuj Karpatne, and Vipin Kumar In particular, Kamal Abdali, Mitra Basu, Dick Brackney, Jagdish Chandra, Joe Coughlan, Michael Coyle, Stephen Davis, Frederica Darema, Richard Hirsch, Chandrika Kamath, Tsengdar Lee, Raju Namburu, N Radhakrishnan, James Sidoran, Sylvia Spengler, Bhavani Thuraisingham, Walt Tiernin, Maria Zemankova, Aidong Zhang, and Xiaodong Zhang have been supportive of our research in data mining and high-performance computing It was a pleasure working with the helpful staff at Pearson Education In particular, we would like to thank Matt Goldstein, Kathy Smith, Carole Snyder, and Joyce Wells We would also like to thank George Nichols, who helped with the art work and Paul Anagnostopoulos, who provided LATEX support We are grateful to the following Pearson reviewers: Leman Akoglu (Carnegie Mellon University), Chien-Chung Chan (University of Akron), Zhengxin Chen (University of Nebraska at Omaha), Chris Clifton (Purdue University), Joydeep Ghosh (University of Texas, Austin), Nazli Goharian (Illinois Institute of Technology), J Michael Hardin (University of Alabama), Jingrui He (Arizona Subject Index 851 definition of, 307, 310 DENCLUE, see DENCLUE density-based clustering, 664–676 fuzzy clustering, see fuzzy clustering graph-based clustering, 676–701 sparsification, 677–678 grid-based clustering, see grid-based clustering hierarchical, see hierarchical clustering CURE, see CURE, see CURE minimum spanning tree, 678–679 Jarvis-Patrick, see Jarvis-Patrick K-means, see K-means mixture modes, see mixture models opossum, see opossum parameter selection, 349, 369, 639 prototype-based clustering, 641–664 seeshared nearest neighbor, densitybased clustering, 699 self-organizing maps, see self-organizing maps spectral clustering, 686 subspace clustering, see subspace clustering subspace clusters, 638 types of clusterings, 311–313 complete, 313 exclusive, 312 fuzzy, 312 hierarchical, 311 overlapping, 312 partial, 313 partitional, 311 types of clusters, 313–315 conceptual, 315 density-based, 314 graph-based, 314 prototype-based, 313 well-separated, 313 validation, see cluster validation cluster validation, 353–379 assessment of measures, 376–378 clustering tendency, 353, 370 cohesion, 356–361 cophenetic correlation, 368 for individual clusters, 363 for individual objects, 363 hierarchical, 367, 376 number of clusters, 369 relative measures, 356 separation, 356–360 silhouette coefficient, 363–364 supervised, 371–376 classification measures, 372–374 similarity measures, 374–376 supervised measures, 355 unsupervised, 356–371 unsupervised measures, 355 with proximity matrix, 364–367 codeword, 534 compaction factor, 256 concept hierarchy, 570 conditional independence, 431 confidence factor, 398 level, 901 measure, see measure confusion matrix, 138 constraint maxgap, 583 maxspan, 582 mingap, 583 timing, 581 window size, 584 contingency table, 258 correlation φ-coefficient, 262 coverage, 398 critical region, see hypothesis testing, critical region cross-validation, 185 CURE, 706–710 algorithm, 706, 708 cluster feature, 704 clustering feature tree, 704 use of partitioning, 709–710 use of sampling, 708–709 curse of dimensionality, 494 dag, see graph data attribute, see attribute attribute types, 637 cleaning, see data quality, data cleaning distribution, 638 852 Subject Index high-dimensional, 636 problems with similarity, 693 market basket, 213 mathematical properties, 637 noise, 636 object, see object outliers, 636 preprocessing, see preprocessing quality, see data quality scale, 637 set, see data set similarity, see similarity size, 636 sparse, 636 transformations, see transformations types, 637 types of, 43, 46–62 data quality, 43, 62–70 application issues, 69–70 data documentation, 70 relevance, 69 timliness, 69 data cleaning, 62 errors, 63–68 accuracy, 65 artifacts, 64 bias, 65 collection, 63 duplicate data, 68 inconsistent values, 67–68 measurment, 63 missing values, 66–67 noise, 63–64 outliers, 66 precision, 65 significant digits, 65 data set, 46 characteristics, 54–55 dimensionality, 54 resolution, 55 sparsity, 54 types of, 54–62 graph-based, 57–58 matrix, see matrix ordered, 58–61 record, 55–57 sequence, 60 sequential, 58 spatial, 61 temporal, 58 time series, 59 transaction, 56 DBSCAN, 347–351 algorithm, 349 comparison to K-means, 634–635 complexity, 349 definition of density, 347 parameter selection, 349 types of points, 348 border, 348 core, 348 noise, 348 decision boundary, 166 list, 400 stump, 505 tree, see classifier deduplication, 68 DENCLUE, 672–676 algorithm, 673 implementation issues, 674 kernel density estimation, 674 strengths and limitations, 674 dendrite, 451 density center based, 347 dimension, see attribute dimensionality curse, 77 dimensionality reduction, 76–78, 877–892 factor analysis, 884–886 FastMap, 889 ISOMAP, 889–891 issues, 891–892 Locally Linear Embedding, 886–888 multidimensional scaling, 888–889 PCA, 78 SVD, 78 discretization, 83–89, 423 association, see association binarization, 84–85 clustering, 564 equal frequency, 564 equal width, 564 of binary attributes, see discretization, binarization of categorical variables, 88–89 of continuous attributes, 85–88 Subject Index 853 supervised, 86–88 unsupervised, 85–86 dissimilarity, 96–98, 114–115 choosing, 118–120 definition of, 92 distance, see distance non-metric, 97 transformations, 92–95 distance, 96–97 city block, 96 Euclidean, 96, 866 Hamming, 534 L1 norm, 96 L2 norm, 96 L∞ , 96 Lmax , 96 Mahalanobis, 116 Manhattan, 96 metric, 97 positivity, 97 symmetry, 97 triangle inequality, 97 Minkowski, 96–97 supremum, 96 distribution binomial, 182 Gaussian, 182, 423 eager learner, see learner edge, 588 effect size, see hypothesis testing, effect size element, 574 EM algorithm, 651–657 ensemble method, see classifier entity, see object entropy, 87, 148 use in discretization, see discretization, supervised error error-correcting output coding, 533 generalization, 176 pessimistic, 178 error rate, 139 estimate error, 184 Euclidean distance, see distance, Euclidean evaluation association, 257 exhaustive, 400 factor analysis, see dimensionality reduction, factor analysis False Discovery Rate, 798 Benjamini-Hochberg FDR, 792 family-wise error rate, 788 FastMap, see dimensionality reduction, FastMap feature irrelevant, 164 feature creation, 81–83 feature extraction, 81–82 mapping data to a new space, 82–83 feature extraction, see feature creation, feature extraction feature selection, 78–81 architecture for, 79–80 feature weighting, 81 irrelevant features, 78 redundant features, 78 types of, 78–79 embedded, 78 filter, 79 wrapper, 79 field, see attribute Fourier transform, 82 FP-growth, 249 FP-tree, see tree frequent subgraph, 587 fuzzy clustering, 641–646 fuzzy c-means, 643–646 algorithm, 643 centroids, 644 example, 646 initialization, 644 SSE, 644 strenths and limitations, 646 weight update, 645 fuzzy sets, 642 fuzzy psuedo-partition, 643 gain ratio, 155 generalization, see rule gini index, 148 graph, 588 connected, 592 directed acyclic, 570 Laplacian, 687 undirected, 592 grid-based clustering, 664–668 algorithm, 665 854 Subject Index clusters, 666 density, 665 grid cells, 665 hierarchical clustering, 336–347 agglomerative algorithm, 337 centroid methods, 344 cluster proximity, 337 Lance-Williams formula, 344 complete link, 337, 340–341 complexity, 338 group average, 337, 341–342 inversions, 344 MAX, see complete link merging decisions, 346 MIN, see single link single link, 337, 340 Ward’s method, 343 high-dimensionality seedata,high-dimensional, 636 holdout, 185 hypothesis alternative, 567, 902 null, 567, 902 hypothesis testing, 781 critical region, 783 effect size, 786 power, 784 Type I error, 783 Type II error, 784 algorithm, 696 example, 697 strengths and limitations, 697 K-means, 316–335 algorithm, 317–318 bisecting, 329–330 centroids, 319, 321 choosing initial, 321–326 comparison to dBSCAN, 634–635 complexity, 326 derivation, 331–335 empty clusters, 326 incremental, 328 K-means++, 325–326 limitations, 330–331 objective functions, 319, 321 outliers, 327 reducing SEE, 327–328 kernel density estimation, 674 kernel function, 110–114 L1 norm, see distance, L1 norm L2 norm, see distance, L2 norm Lagrangian, 482 lazy learner, see learner learner eager, 410, 413 lazy, 410, 413 least squares, 875 leave-one-out, 187 lexicographic order, 227 independence linear algebra, 861–876 conditional, 420 matrix, see matrix information gain vector, see vector entropy-based, 151 linear regression, 875 FOIL’s, 403 linear systems of equations, 875 interest, see measure ISOMAP, see dimensionality reduction, ISOMAP linear transformation, see matrix, linear transformation isomorphism Locally Linear Embedding, see dimensiondefinition, 589 ality reduction, Locally Linear item, see attribute, 214 Embedding competing, 602 negative, 602 m-estimate, 426 itemset, 215 majority voting, see voting candidate, see candidate Manhattan distance, see distance, Manclosed, 242 hattan maximal, 240 margin soft, 486 Jarvis-Patrick, 696–698 Subject Index 855 market basket data, see data matrix, 57, 867–873 addition, 868–869 column vector, 868 confusion, see confusion matrix definition, 867–868 document-term, 57 eigenvalue, 873 eigenvalue decomposition, 873–874 eigenvector, 873 in data analysis, 875–876 inverse, 872–873 linear combination, 879 linear transformations, 871–873 column space, 872 left nullspace, 872 nullspace, 872 projection, 871 reflection, 871 rotation, 871 row space, 872 scaling, 871 multiplication, 869–871 positive semidefinite, 879 rank, 872 row vector, 868 scalar multiplication, 869 singular value, 874 singular value decomposition, 874 singular vector, 874 sparse, 57 maxgap, see constraint maximum likelihood estimation, 649–651 maxspan, see constraint MDL, 180 mean, 424 measure confidence, 216 consistency, 264 interest, 261 IS, 262 objective, 257 properties, 265 support, 216 symmetric, 270 measurement, 47–52 definition of, 47 scale, 47 permissible transformations, 50–51 types, 47–52 metric accuracy, 139 metrics classification, 139 min-Apriori, 569 mingap, see constraint minimum description length, see MDL missing values, see data quality, errors, missing values mixture models, 647–657 advantages and limitations, 657 definition of, 647–649 EM algorithm, 651–657 maximum likelihood estimation, 649– 651 model comparison, 193 descriptive, 136 generalization, 138 overfitting, 167 predictive, 136 selection, 176 model complexity Occam’s Razor AIC, 178 BIC, 178 monotonicity, 220 multiclass, 532 multidimensional scaling, see dimensionality reduction, multidimensional scaling multiple comparison, see False Discovery Rate multiple hypothesis testing, see False Discovery Rate family-wise error rate, see family-wise error rate mutual exclusive, 400 mutual information, 108–109 nearest neighbor classifier, see classifier network Bayesian, see classifier multilayer, see classifier neural, see classifier neuron, 451 node internal, 140 856 Subject Index leaf, 140 non-terminal, 140 root, 140 terminal, 140 noise, 413 null distribution, 778 null hypothesis, 777 object, 46 observation, see object Occam’s razor, 177 OLAP, 71 opposum, 679–680 algorithm, 680 strengths and weaknesses, 680 outliers, see data quality overfitting, see model, 169 p-value, 779 pattern cross-support, 276 hyperclique, 279 infrequent, 601 negative, 602 negatively correlated, 603, 604 sequential, see sequential subgraph, see subgraph PCA, 877–880 examples, 880 mathematics, 878–879 perceptron, see classifier permutation test, 801 Piatesky-Shapiro PS, 261 point, see object power, see hypothesis testing, power Precision-Recall Curve, 530 precondition, 397 preprocessing, 43, 70–91 aggregation, see aggregation binarization, see discretization, binarization dimensionality reduction, 76 discretization, see discretization feature creation, see feature creation feature selection, see feature selection sampling, see sampling transformations, see transformations proximity, 91–120 choosing, 118–120 cluster, 337 definition of, 91 dissimilarity, see dissimilarity distance, see distance for simple attributes, 94–95 issues, 116–118 attribute weights, 118 combining proximities, 117–118 correlation, 116 standardization, 116 similarity, see similarity transformations, 92–94 pruning post-pruning, 183 prepruning, 182 random forest seeclassifier, 512 randomization, 801 association patterns, 813–815 Receiver Operating Characteristic curve, see ROC record, see object reduced error pruning, 209, 548 regression logistic, 445 ROC, 525 Rote classifier, see classifier rule antecedent, 397 association, 216 candidate, see candidate consequent, 397 generalization, 566 generation, 218, 236, 407, 566 ordered, 400 ordering, 408 pruning, 404 quantitative, 562 discretization-based, 562 non-discretization, 568 statistics-based, 566 redundant, 566 specialization, 566 validation, 567 rule set, 397 Subject Index 857 sample, see object sampling, 72–76, 516 approaches, 73–74 progressive, 75–76 random, 73 stratified, 74 with replacement, 74 without replacement, 74 sample size, 74–75 scalability clustering algorithms, 701–710 BIRCH, 704–706 CURE, 706–710 general issues, 701–704 segmentation, 311 self-organizing maps, 657–664 algorithm, 658–661 applications, 663 strengths and limitations, 663 sensitivity, 521 sequence data sequence, 576 definition, 574 sequential pattern, 572 pattern discovery, 576 timing constraints, see constraint sequential covering, 401 shared nearest neighbor, 676 density, 698–699 density-based clustering, 699–701 algorithm, 700 example, 700 strengths and limitations, 701 principle, 677 similarity, 693–696 computation, 695 differences in density, 694 versus direct similarity, 696 significance level, 903 significance testing, 781 null distribution, see null distribution null hypothesis, see null hypothesis p-value, see p-value statistical significance, see statistical significance similarity, 44, 98–105 choosing, 118–120 correlation, 103–105 cosine, 101–102, 866 definition of, 92 differences, 105–108 extended Jaccard, 103 Jaccard, 100–101 kernel function, 110–114 mutual information, 108–109 shared nearest neighbor, see shared nearest neighbor, similarity simple matching coefficient, 100 Tanimoto, 103 transformations, 92–95 Simpson’s paradox, 272 soft splitting, 198 SOM, 638, see self-organizing maps specialization, see rule split information, 155 statistical significance, 780 statistics covarinace matrix, 878 subgraph core, 595 definition, 590 pattern, 587 support, see support subsequence, 575 contiguous, 583 subspace clustering, 668–672 CLIQUE, 670 algorithm, 671 monotonicity property, 671 strengths and limitations, 672 example, 668 subtree replacement, 183 support count, 215 counting, 229, 581, 585, 601 limitation, 258 measure, see measure pruning, 220 sequence, 576 subgraph, 591 support vector, 478 support vector machine, see classifier SVD, 882–884 example, 882–884 858 Subject Index mathematics, 882 SVM, see classifier nonlinear, 492 svm non-separable, 486 synapse, 451 taxonomy, see concept hierarchy transaction, 214 extended, 571 width, 235 transformations, 89–91 between similarity and dissimilarity, 92–94 normalization, 90–91 simple functions, 90 standardization, 90–91 tree conditional FP-tree, 254 decision, see classifier FP-tree, 250 hash, 231 oblique, 166 triangle inequality, 97 true positive, 521 Type I error, see hypothesis testing, Type I error Type II error, see hypothesis testing, Type II error underfitting, 169 universal approximator, 463 variable, see attribute variance, 424 vector, 861–867 addition, 861–862 column, see matrix, column vector definition, 861 dot product, 864–866 in data analysis, 866–867 linear independence, 865–866 mean, 867 mulitplication by a scalar, 862–863 norm, 864 orthogonal, 863–865 orthogonal projection, 865 row, see matrix, row vector space, 863–864 basis, 863 dimension, 863 independent components, 863 linear combination, 863 span, 863 vector quantization, 309 vertex, 588 voting distance-weighted, 412 majority, 412 wavelet transform, 83 web crawler, 158 window size, see constraint 859 Copyright Permissions Some figures and part of the text of Chapter originally appeared in the article “Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data,” Levent Ertă oz, Michael Steinbach, and Vipin Kumar, Proceedings of the Third SIAM International Conference on Data Mining, San Francisco, CA, May 1–3, 2003, SIAM c 2003, SIAM Some figures and part of the text of Chapter appeared in the article “Selecting the Right Objective Measure for Association Analysis,” Pang-Ning Tan, Vipin Kumar, and Jaideep Srivastava, Information Systems, 29(4), 293-313, 2004, Elsevier c 2004, Elsevier Some of the figures and text of Chapters appeared in the article “Discovery of Climate Indices Using Clustering,” Michael Steinbach, Pang-Ning Tan, Vipin Kumar, Steven Klooster, and Christopher Potter, KDD ’03: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 446–455, Washington, DC, August 2003, ACM c 2003, ACM, INC DOI = http://doi.acm.org/10.1145/956750.956801 Some of the figures (1-7,13) and text of Chapter originally appeared in the chapter “The Challenge of Clustering High-Dimensional Data,” Levent Ertoz, Michael Steinbach, and Vipin Kumar in New Directions in Statistical Physics, Econophysics, Bioinformatics, and Pattern Recognition, 273–312, Editor, Luc Wille, Springer, ISBN 3-540-43182-9 c 2004, Springer-Verlag Some of the figures and text of Chapter originally appeared in the article “Chameleon: Hierarchical Clustering Using Dynamic Modeling,” by George Karypis, Eui-Hong (Sam) Han, and Vipin Kumar, IEEE Computer, Volume 32(8), 68-75, August, 1999, IEEE c 1999, IEEE This page is intentionally left blank This page is intentionally left blank This page is intentionally left blank This page is intentionally left blank This page is intentionally left blank GLOBAL EDITION This is a special edition of an established title widely used by colleges and universities throughout the world Pearson published this exclusive edition for the benefit of students outside the United States and Canada If you purchased this book within the United States or Canada, you should be aware that it has been imported without the approval of the Publisher or Author Features • New – A chapter on avoiding false discoveries o Discusses statistical concepts relevant to avoiding spurious results, novel among contemporary textbooks on data mining o Addresses the increasing concern over the validity and reproducibility of results obtained from data analysis • New – A separate section on deep networks to address the current developments in this area • Classification, association analysis, and cluster analysis covered in a pair of chapters each – the first covering introductory concepts and the second covering more advanced concepts and algorithms • Coverage of classification significantly improved, including topics such as overfitting, underfitting, and model complexity and selection • Coverage of anomaly detection greatly revised and expanded – existing approaches updated and new ones like reconstruction-based detection added • Over 100 examples, 250 figures, and 150 exercises to help readers better understand concepts ... Is Data Mining? Data mining is the process of automatically discovering useful information in large data repositories Data mining techniques are deployed to scour large data sets in order to. .. research challenges in data mining Wu et al [59] discuss how developments in data mining research can be turned into practical tools Data mining standards are the subject of a paper by Grossman et... discusses how data mining algorithms can be scaled to large data sets The emergence of new data mining applications has produced new challenges that need to be addressed For instance, concerns