Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 607 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
607
Dung lượng
9,93 MB
Nội dung
DATA MINING AND ANALYSIS The fundamental algorithms in data mining and analysis form the basis for the emerging field of data science, which includes automated methods to analyze patterns and models for all kinds of data, with applications ranging from scientific discovery to business intelligence and analytics This textbook for senior undergraduate and graduate data mining courses provides a broad yet in-depth overview of data mining, integrating related concepts from machine learning and statistics The main parts of the book include exploratory data analysis, pattern mining, clustering, and classification The book lays the basic foundations of these tasks and also covers cutting-edge topics such as kernel methods, high-dimensional data analysis, and complex graphs and networks With its comprehensive coverage, algorithmic perspective, and wealth of examples, this book offers solid guidance in data mining for students, researchers, and practitioners alike Key Features: • Covers both core methods and cutting-edge research • Algorithmic approach with open-source implementations • Minimal prerequisites, as all key mathematical concepts are presented, as is the intuition behind the formulas • Short, self-contained chapters with class-tested examples and exercises that allow for flexibility in designing a course and for easy reference • Supplementary online resource containing lecture slides, videos, project ideas, and more Mohammed J Zaki is a Professor of Computer Science at Rensselaer Polytechnic Institute, Troy, New York Wagner Meira Jr is a Professor of Computer Science at Universidade Federal de Minas Gerais, Brazil DATA MINING AND ANALYSIS Fundamental Concepts and Algorithms MOHAMMED J ZAKI Rensselaer Polytechnic Institute, Troy, New York WAGNER MEIRA JR Universidade Federal de Minas Gerais, Brazil 32 Avenue of the Americas, New York, NY 10013-2473, USA Cambridge University Press is part of the University of Cambridge It furthers the University’s mission by disseminating knowledge in the pursuit of education, learning, and research at the highest international levels of excellence www.cambridge.org Information on this title: www.cambridge.org/9780521766333 c Mohammed J Zaki and Wagner Meira Jr 2014 This publication is in copyright Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press First published 2014 Printed in the United States of America A catalog record for this publication is available from the British Library Library of Congress Cataloging in Publication Data Zaki, Mohammed J., 1971– Data mining and analysis: fundamental concepts and algorithms / Mohammed J Zaki, Rensselaer Polytechnic Institute, Troy, New York, Wagner Meira Jr., Universidade Federal de Minas Gerais, Brazil pages cm Includes bibliographical references and index ISBN 978-0-521-76633-3 (hardback) Data mining I Meira, Wagner, 1967– II Title QA76.9.D343Z36 2014 006.3′ 12–dc23 2013037544 ISBN 978-0-521-76633-3 Hardback Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party Internet Web sites referred to in this publication and does not guarantee that any content on such Web sites is, or will remain, accurate or appropriate Contents page ix Preface Data Mining and Analysis 1.1 1.2 1.3 1.4 1.5 1.6 1.7 Data Matrix Attributes Data: Algebraic and Geometric View Data: Probabilistic View Data Mining Further Reading Exercises 1 14 25 30 30 PART ONE: DATA ANALYSIS FOUNDATIONS Numeric Attributes 33 2.1 2.2 2.3 2.4 2.5 2.6 2.7 33 42 48 52 54 60 60 Univariate Analysis Bivariate Analysis Multivariate Analysis Data Normalization Normal Distribution Further Reading Exercises Categorical Attributes 63 3.1 3.2 3.3 3.4 3.5 3.6 3.7 63 72 82 87 89 91 91 Univariate Analysis Bivariate Analysis Multivariate Analysis Distance and Angle Discretization Further Reading Exercises Graph Data 93 4.1 4.2 93 97 Graph Concepts Topological Attributes v vi Contents 4.3 4.4 4.5 4.6 Centrality Analysis Graph Models Further Reading Exercises 102 112 132 132 Kernel Methods 134 5.1 5.2 5.3 5.4 5.5 5.6 138 144 148 154 161 161 Kernel Matrix Vector Kernels Basic Kernel Operations in Feature Space Kernels for Complex Objects Further Reading Exercises High-dimensional Data 163 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 163 165 168 169 171 172 175 180 180 High-dimensional Objects High-dimensional Volumes Hypersphere Inscribed within Hypercube Volume of Thin Hypersphere Shell Diagonals in Hyperspace Density of the Multivariate Normal Appendix: Derivation of Hypersphere Volume Further Reading Exercises Dimensionality Reduction 183 7.1 7.2 7.3 7.4 7.5 7.6 183 187 202 208 213 214 Background Principal Component Analysis Kernel Principal Component Analysis Singular Value Decomposition Further Reading Exercises PART TWO: FREQUENT PATTERN MINING Itemset Mining 217 8.1 8.2 8.3 8.4 8.5 217 221 234 236 237 Frequent Itemsets and Association Rules Itemset Mining Algorithms Generating Association Rules Further Reading Exercises Summarizing Itemsets 242 9.1 9.2 9.3 9.4 9.5 9.6 242 245 248 250 256 256 Maximal and Closed Frequent Itemsets Mining Maximal Frequent Itemsets: GenMax Algorithm Mining Closed Frequent Itemsets: Charm Algorithm Nonderivable Itemsets Further Reading Exercises vii Contents 10 11 12 Sequence Mining 259 10.1 10.2 10.3 10.4 10.5 259 260 267 277 277 Frequent Sequences Mining Frequent Sequences Substring Mining via Suffix Trees Further Reading Exercises Graph Pattern Mining 280 11.1 11.2 11.3 11.4 11.5 280 284 288 296 297 Isomorphism and Support Candidate Generation The gSpan Algorithm Further Reading Exercises Pattern and Rule Assessment 301 12.1 12.2 12.3 12.4 301 316 328 328 Rule and Pattern Assessment Measures Significance Testing and Confidence Intervals Further Reading Exercises PART THREE: CLUSTERING 13 14 15 16 Representative-based Clustering 333 13.1 13.2 13.3 13.4 13.5 333 338 342 360 361 K-means Algorithm Kernel K-means Expectation-Maximization Clustering Further Reading Exercises Hierarchical Clustering 364 14.1 14.2 14.3 14.4 364 366 372 373 Preliminaries Agglomerative Hierarchical Clustering Further Reading Exercises and Projects Density-based Clustering 375 15.1 15.2 15.3 15.4 15.5 375 379 385 390 391 The DBSCAN Algorithm Kernel Density Estimation Density-based Clustering: DENCLUE Further Reading Exercises Spectral and Graph Clustering 394 16.1 16.2 16.3 16.4 16.5 394 401 416 422 423 Graphs and Matrices Clustering as Graph Cuts Markov Clustering Further Reading Exercises viii 17 Contents Clustering Validation 425 17.1 17.2 17.3 17.4 17.5 425 440 448 461 462 External Measures Internal Measures Relative Measures Further Reading Exercises PART FOUR: CLASSIFICATION 18 19 20 21 22 Index Probabilistic Classification 467 18.1 18.2 18.3 18.4 18.5 467 473 477 479 479 Bayes Classifier Naive Bayes Classifier K Nearest Neighbors Classifier Further Reading Exercises Decision Tree Classifier 481 19.1 19.2 19.3 19.4 483 485 496 496 Decision Trees Decision Tree Algorithm Further Reading Exercises Linear Discriminant Analysis 498 20.1 20.2 20.3 20.4 498 505 511 512 Optimal Linear Discriminant Kernel Discriminant Analysis Further Reading Exercises Support Vector Machines 514 21.1 21.2 21.3 21.4 21.5 21.6 21.7 514 520 524 530 534 545 546 Support Vectors and Margins SVM: Linear and Separable Case Soft Margin SVM: Linear and Nonseparable Case Kernel SVM: Nonlinear Case SVM Training Algorithms Further Reading Exercises Classification Assessment 548 22.1 22.2 22.3 22.4 22.5 548 562 572 581 582 Classification Performance Measures Classifier Evaluation Bias-Variance Decomposition Further Reading Exercises 585 Preface This book is an outgrowth of data mining courses at Rensselaer Polytechnic Institute (RPI) and Universidade Federal de Minas Gerais (UFMG); the RPI course has been offered every Fall since 1998, whereas the UFMG course has been offered since 2002 Although there are several good books on data mining and related topics, we felt that many of them are either too high-level or too advanced Our goal was to write an introductory text that focuses on the fundamental algorithms in data mining and analysis It lays the mathematical foundations for the core data mining methods, with key concepts explained when first encountered; the book also tries to build the intuition behind the formulas to aid understanding The main parts of the book include exploratory data analysis, frequent pattern mining, clustering, and classification The book lays the basic foundations of these tasks, and it also covers cutting-edge topics such as kernel methods, high-dimensional data analysis, and complex graphs and networks It integrates concepts from related disciplines such as machine learning and statistics and is also ideal for a course on data analysis Most of the prerequisite material is covered in the text, especially on linear algebra, and probability and statistics The book includes many examples to illustrate the main technical concepts It also has end-of-chapter exercises, which have been used in class All of the algorithms in the book have been implemented by the authors We suggest that readers use their favorite data analysis and mining software to work through our examples and to implement the algorithms we describe in text; we recommend the R software or the Python language with its NumPy package The datasets used and other supplementary material such as project ideas and slides are available online at the book’s companion site and its mirrors at RPI and UFMG: • http://dataminingbook.info • http://www.cs.rpi.edu/~ zaki/dataminingbook • http://www.dcc.ufmg.br/dataminingbook Having understood the basic principles and algorithms in data mining and data analysis, readers will be well equipped to develop their own methods or use more advanced techniques ix x Preface 14 15 13 17 16 20 21 22 19 18 11 10 12 Figure 0.1 Chapter dependencies Suggested Roadmaps The chapter dependency graph is shown in Figure 0.1 We suggest some typical roadmaps for courses and readings based on this book For an undergraduate-level course, we suggest the following chapters: 1–3, 8, 10, 12–15, 17–19, and 21–22 For an undergraduate course without exploratory data analysis, we recommend Chapters 1, 8–15, 17–19, and 21–22 For a graduate course, one possibility is to quickly go over the material in Part I or to assume it as background reading and to directly cover Chapters 9–22; the other parts of the book, namely frequent pattern mining (Part II), clustering (Part III), and classification (Part IV), can be covered in any order For a course on data analysis the chapters covered must include 1–7, 13–14, 15 (Section 2), and 20 Finally, for a course with an emphasis on graphs and kernels we suggest Chapters 4, 5, (Sections 1–3), 11–12, 13 (Sections 1–2), 16–17, and 20–22 Acknowledgments Initial drafts of this book have been used in several data mining courses We received many valuable comments and corrections from both the faculty and students Our thanks go to • • • • • • • • • Muhammad Abulaish, Jamia Millia Islamia, India Mohammad Al Hasan, Indiana University Purdue University at Indianapolis Marcio Luiz Bunte de Carvalho, Universidade Federal de Minas Gerais, Brazil Lo¨ıc Cerf, Universidade Federal de Minas Gerais, Brazil Ayhan Demiriz, Sakarya University, Turkey Murat Dundar, Indiana University Purdue University at Indianapolis Jun Luke Huan, University of Kansas Ruoming Jin, Kent State University Latifur Khan, University of Texas, Dallas 580 Classification Assessment In the case of binary classification, with classes {+1, −1}, the combined classifier MK can be expressed more simply as K MK (x) = sign αt Mt (x) t=1 Example 22.14 Figure 22.9a illustrates the boosting approach on the Iris principal components dataset, using linear SVMs as the base classifiers The regularization constant was set to C = The hyperplane learned in iteration t is denoted ht , thus, the classifier model is given as Mt (x) = sign(ht (x)) As such, no individual linear hyperplane can discriminate between the classes very well, as seen from their error rates on the training set: Mt ǫt αt h1 0.280 0.944 h2 0.305 0.826 h3 0.174 1.559 h4 0.282 0.935 However, when we combine the decisions from successive hyperplanes weighted by αt , we observe a marked drop in the error rate for the combined classifier MK (x) as K increases: M1 0.280 combined model training error rate M2 0.253 M3 0.073 M4 0.047 We can see, for example, that the combined classifier M3 , comprising h1 , h2 and h3 , has already captured the essential features of the nonlinear decision boundary between the two classes, yielding an error rate of 7.3% Further reduction in the training error is obtained by increasing the number of boosting steps To assess the performance of the combined classifier on independent testing data, we employ 5-fold cross-validation, and plot the average testing and training error rates as a function of K in Figure 22.9b We can see that as the number of base u2 h3 h2 h4 Testing Error Training Error 0.35 0.30 0.20 0.25 h1 −1 0.15 0.10 0.05 −2 u1 −4 −3 −2 −1 (a) K 0 50 100 150 200 (b) Figure 22.9 (a) Boosting SVMs with linear kernel (b) Average testing and training error: 5-fold cross-validation 22.4 Further Reading 581 classifiers K increases, both the training and testing error rates reduce However, while the training error essentially goes to 0, the testing error does not reduce beyond 0.02, which happens at K = 110 This example illustrates the effectiveness of boosting in reducing the bias Bagging as a Special Case of AdaBoost: Bagging can be considered as a special case of AdaBoost, where wt = n1 1, and αt = for all K iterations In this case, the weighted resampling defaults to regular resampling with replacement, and the predicted class for a test case also defaults to simple majority voting 22.4 FURTHER READING The application of ROC analysis to classifier performance was introduced in Provost and Fawcett (1997), with an excellent introduction to ROC analysis given in Fawcett (2006) For an in-depth description of the bootstrap, cross-validation, and other methods for assessing classification accuracy see Efron and Tibshirani (1993) For many datasets simple rules, like one-level decision trees, can yield good classification performance; see Holte (1993) for details For a recent review and comparison of classifiers over multiple datasets see Demˇsar (2006) A discussion of bias, variance, and zero–one loss for classification appears in Friedman (1997), with a unified decomposition of bias and variance for both squared and zero–one loss given in Domingos (2000) The concept of bagging was proposed in Breiman (1996), and that of adaptive boosting in Freund and Schapire (1997) Random forests is a tree-based ensemble approach that can be very effective; see Breiman (2001) for details For a comprehensive overview on the evaluation of classification algorithms see Japkowicz and Shah (2011) Breiman, L (1996) “Bagging predictors.” Machine Learning, 24 (2): 123–140 Breiman, L (2001) “Random forests.” Machine Learning, 45 (1): 5–32 Demˇsar, J (2006) “Statistical comparisons of classifiers over multiple data sets.” The Journal of Machine Learning Research, 7: 1–30 Domingos, P (2000) “A unified bias-variance decomposition for zero-one and squared loss.” In Proceedings of the National Conference on Artificial Intelligence, 564–569 Efron, B and Tibshirani, R (1993) An Introduction to the Bootstrap, vol 57 Boca Raton, FL: Chapman & Hall/CRC Fawcett, T (2006) “An introduction to ROC analysis.” Pattern Recognition Letters, 27 (8): 861–874 Freund, Y and Schapire, R E (1997) “A decision-theoretic generalization of on-line learning and an application to boosting.” Journal of Computer and System Sciences, 55 (1): 119–139 Friedman, J H (1997) “On bias, variance, 0/1-loss, and the curse-of-dimensionality.” Data Mining and Knowledge Discovery, (1): 55–77 582 Classification Assessment Holte, R C (1993) “Very simple classification rules perform well on most commonly used datasets.” Machine Learning, 11 (1): 63–90 Japkowicz, N and Shah, M (2011) Evaluating Learning Algorithms: A Classification Perspective New York: Cambridge University Press Provost, F and Fawcett, T (1997) “Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions.” In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA, 43–48 22.5 EXERCISES Q1 True or False: (a) A classification model must have 100% accuracy (overall) on the training dataset (b) A classification model must have 100% coverage (overall) on the training dataset Q2 Given the training database in Table 22.6a and the testing data in Table 22.6b, answer the following questions: (a) Build the complete decision tree using binary splits and Gini index as the evaluation measure (see Chapter 19) (b) Compute the accuracy of the classifier on the test data Also show the per class accuracy and coverage Table 22.6 Data for Q2 X Y Z Class 15 20 25 30 35 25 15 20 4 A B A A B A B B 1 2 X Y Z Class 10 20 30 40 15 A B A B B 2 (b) Testing (a) Training Q3 Show that for binary classification the majority voting for the combined classifier decision in boosting can be expressed as K MK (x) = sign αt Mt (x) t=1 Q4 Consider the 2-dimensional dataset shown in Figure 22.10, with the labeled points belonging to two classes: c1 (triangles) and c2 (circles) Assume that the six hyperplanes were learned from different bootstrap samples Find the error rate for each of the six hyperplanes on the entire dataset Then, compute the 95% confidence 583 22.5 Exercises h3 h2 h1 h4 h5 h6 1 Figure 22.10 For Q4 Table 22.7 Critical values for t-test dof tα/2 12.7065 4.3026 3.1824 2.7764 2.5706 2.4469 interval for the expected error rate, using the t-distribution critical values for different degrees of freedom (dof) given in Table 22.7 Q5 Consider the probabilities P (+1|xi ) for the positive class obtained for some classifier, and given the true class labels yi x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 yi +1 −1 +1 +1 −1 +1 −1 +1 −1 −1 P (+1|xi ) 0.53 0.86 0.25 0.95 0.87 0.86 0.76 0.94 0.44 0.86 Plot the ROC curve for this classifier Index accuracy, 549 Apriori algorithm, 223 association rule, 220, 301 antecedent, 301 assessment measures, 301 Bonferroni correction, 320 bootstrap sampling, 325 confidence, 220, 302 confidence interval, 325 consequent, 301 conviction, 306 Fisher exact test, 316 general, 315 improvement, 315 Jaccard coefficient, 305 leverage, 304 lift, 303 mining algorithm, 234 multiple hypothesis testing, 320 nonredundant, 315 odds ratio, 306 permutation test, 320 swap randomization, 321 productive, 315 randomization test, 320 redundant, 315 relative support, 220 significance, 320 specific, 315 support, 220, 302 relative, 302 swap randomization, 321 unproductive, 315 association rule mining, 234 attribute binary, categorical, nominal, ordinal, continuous, discrete, numeric, interval-scaled, ratio-scaled, bagging, 576 Bayes classifier, 467 categorical attributes, 471 numeric attributes, 468 Bayes theorem, 467, 492 Bernoulli distribution mean, 64 sample mean, 64 sample variance, 64 variance, 64 Bernoulli variable, 63 BetaCV measure, 441 bias-variance decomposition, 572 binary database, 218 vertical representation, 218 Binomial distribution, 65 bivariate analysis categorical, 72 numeric, 42 Bonferroni correction, 320 boosting, 577 AdaBoost, 577 combined classifier, 579 bootstrap sampling, 325, 563 C-index, 441 Calinski–Harabasz index, 450 categorical attributes angle, 87 cosine similarity, 88 covariance matrix, 68, 83 distance, 87 Euclidean distance, 87 Hamming distance, 88 585 586 categorical attributes (cont.) Jaccard coefficient, 88 mean, 67, 83 bivariate, 74 norm, 87 sample covariance matrix, 69 bivariate, 75 sample mean, 67 bivariate, 74 Cauchy–Schwartz inequality, central limit theorem, 565 centroid, 333 Charm algorithm, 248 properties, 248 χ distribution, 80 chi-squared statistic, 80 χ statistic, 80, 85 classification, 29 accuracy, 549, 550, 553 area under ROC curve, 557 assessment measures, 548 contingency table based, 550 AUC, 557 bagging, 576 Bayes classifier, 467 bias, 573 bias-variance decomposition, 572 binary classes, 553 boosting, 577 AdaBoost, 577 classifier evaluation, 562 confidence interval, 565 confusion matrix, 550 coverage, 551 cross-validation, 562 decision trees, 481 ensemble classifiers, 575 error rate, 549, 553 expected loss, 572 F-measure, 551 false negative, 553 false negative rate, 554 false positive, 553 false positive rate, 554 K nearest neighbors classifier, 477 KNN classifier, 477 loss function, 572 naive Bayes classifier, 473 overfitting, 574 paired t-test, 569 precision, 550, 554 recall, 551 sensitivity, 554 specificity, 554 true negative, 553 true negative rate, 554 true positive, 553 true positive rate, 554 unstable, 575 Index variance, 573 classifier evaluation, 562 bootstrap resampling, 563 confidence interval, 565 cross-validation, 562 paired t-test, 569 closed itemsets, 243 Charm algorithm, 248 equivalence class, 244 cluster stability, 454 clusterability, 457 clustering, 28 centroid, 333 curse of dimensionality, 388 DBSCAN, 375 border point, 375 core point, 375 density connected, 376 density-based cluster, 376 directly density reachable, 375 ǫ-neighborhood, 375 noise point, 375 DENCLUE density attractor, 385 dendrogram, 364 density-based DBSCAN, 375 DENCLUE, 385 EM, see expectation maximization EM algorithm, see expectation maximization algorithm evaluation, 425 expectation maximization, 342, 343 expectation step, 344, 348 initialization, 344, 348 maximization step, 345, 348 multivariate data, 346 univariate data, 344 expectation maximization algorithm, 349 external validation, 425 Gaussian mixture model, 342 graph cuts, 401 internal validation, 425 K-means, 334 specialization of EM, 353 kernel density estimation, 379 kernel K-means, 338 Markov chain, 416 Markov clustering, 416 Markov matrix, 416 relative validation, 425 spectral clustering computational complexity, 407 stability, 425 sum of squared errors, 333 tendency, 425 validation external, 425 587 Index internal, 425 relative, 425 clustering evaluation, 425 clustering stability, 425 clustering tendency, 425, 457 distance distribution, 459 Hopkins statistic, 459 spatial histogram, 457 clustering validation BetaCV measure, 441 C-index, 441 Calinski–Harabasz index, 450 clustering tendency, 457 conditional entropy, 430 contingency table, 426 correlation measures, 436 Davies–Bouldin index, 444 distance distribution, 459 Dunn index, 443 entropy-based measures, 430 external measures, 425 F-measure, 427 Fowlkes–Mallows measure, 435 gap statistic, 452 Hopkins statistic, 459 Hubert statistic, 437, 445 discretized, 438 internal measures, 440 Jaccard coefficient, 435 matching based measures, 426 maximum matching, 427 modularity, 443 mutual information, 431 normalized, 431 normalized cut, 442 pairwise measures, 433 purity, 426 Rand statistic, 435 relative measures, 448 silhouette coefficient, 444, 448 spatial histogram, 457 stability, 454 variation of information, 432 conditional entropy, 430 confidence interval, 325, 565 small sample, 567 unknown variance, 566 confusion matrix, 550 contingency table, 78 χ test, 85 clustering validation, 426 multiway, 84 correlation, 45 cosine similarity, covariance, 43 covariance matrix, 46, 49 bivariate, 74 determinant, 46 eigen-decomposition, 57 eigenvalues, 49 inner product, 50 outer product, 50 positive semidefinite, 49 trace, 46 cross-validation, 562 leave-one-out, 562 cumulative distribution binomial, 18 cumulative distribution function, 18 empirical CDF, 33 empirical inverse CDF, 34 inverse CDF, 34 joint CDF, 22, 23 quantile function, 34 curse of dimensionality clustering, 388 data dimensionality, extrinsic, 13 intrinsic, 13 data matrix, centering, 10 column space, 12 mean, rank, 13 row space, 12 symbolic, 63 total variance, data mining, 25 data normalization range normalization, 52 standard score normalization, 52 Davies–Bouldin index, 444 DBSCAN algorithm, 375 decision tree algorithm, 485 decision trees, 481 axis-parallel hyperplane, 483 categorical attributes, 485 data partition, 483 decision rules, 485 entropy, 486 Gini index, 487 information gain, 487 purity, 484 split point, 483 split point evaluation, 488 categorical attributes, 492 measures, 486 numeric attributes, 488 DENCLUE center-defined cluster, 386 density attractor, 385 density reachable, 387 density-based cluster, 387 DENCLUE algorithm, 385 dendrogram, 364 density attractor, 385 588 density estimation, 379 nearest neighbors based, 384 density-based cluster, 387 density-based clustering DBSCAN, 375 DENCLUE, 385 dimensionality reduction, 183 discrete random variable, 14 discretization, 89 equal-frequency intervals, 89 equal-width intervals, 89 dominant eigenvector, 105 power iteration method, 105 Dunn index, 443 Eclat algorithm, 225 computational complexity, 228 dEclat, 229 diffsets, 228 equivalence class, 226 empirical joint probability mass function, 457 ensemble classifiers, 575 bagging, 576 boosting, 577 entropy, 486 split, 487 EPMF, see empirical joint probability mass function error rate, 549 Euclidean distance, expectation maximization, 342, 343, 357 expectation step, 358 maximization step, 359 expected value, 34 exploratory data analysis, 26 F-measure, 427 false negative, 553 false positive, 553 Fisher exact test, 316, 318 Fowlkes–Mallows measure, 435 FPGrowth algorithm, 231 frequent itemset, 219 frequent itemsets mining, 221 frequent pattern mining, 27 gamma function, 80, 166 gap statistic, 452 Gauss error function, 55 Gaussian mixture model, 342 generalized itemset, 250 GenMax algorithm, 245 maximality checks, 245 Gini index, 487 graph, 280 adjacency matrix, 96 weighted, 96 authority score, 110 Index average degree, 98 average path length, 98 ´ Barabasi–Albert model, 124 clustering coefficient, 131 degree distribution, 125 diameter, 131 centrality authority score, 110 betwenness, 103 closeness, 103 degree, 102 eccentricity, 102 eigenvector centrality, 104 hub score, 110 pagerank, 108 prestige, 104 clustering coefficient, 100 clustering effect, 114 degree, 97 degree distribution, 94 degree sequence, 94 diameter, 98 eccentricity, 98 effective diameter, 99 efficiency, 101 ¨ ´ Erdos–R enyi model, 116 HITS, 110 hub score, 110 labeled, 280 PageRank, 108 preferential attachment, 124 radius, 98 random graphs, 116 scale-free property, 113 shortest path, 95 small-world property, 112 transitivity, 101 Watts–Strogatz model, 118 clustering coefficient, 119 degree distribution, 121 diameter, 119, 122 graph clustering average weight, 409 degree matrix, 395 graph cut, 402 k-way cut, 401 Laplacian matrix, 398 Markov chain, 416 Markov clustering, 416 MCL algorithm, 418 modularity, 411 normalized adjacency matrix, 395 normalized asymmetric Laplacian, 400 normalized cut, 404 normalized modularity, 415 normalized symmetric Laplacian, 399 objective functions, 403, 409 ratio cut, 403 weighted adjacency matrix, 394 589 Index graph cut, 402 graph isomorphism, 281 graph kernel, 156 exponential, 157 power kernel, 157 von Neumann, 158 graph mining canonical DFS code, 287 canonical graph, 286 canonical representative, 285 DFS code, 286 edge growth, 283 extended edge, 280 graph isomorphism, 281 gSpan algorithm, 288 rightmost path extension, 284 rightmost vertex, 285 search space, 283 subgraph isomorphism, 282 graph models, 112 ´ Barabasi–Albert model, 124 ¨ ´ Erdos–R enyi model, 116 Watts–Strogatz model, 118 graphs degree matrix, 395 Laplacian matrix, 398 normalized adjacency matrix, 395 normalized asymmetric Laplacian, 400 normalized symmetric Laplacian, 399 weighted adjacency matrix, 394 GSP algorithm, 261 gSpan algorithm, 288 candidate extension, 291 canonicality checking, 295 subgraph isomorphisms, 293 support computation, 291 hierarchical clustering, 364 agglomerative, 364 complete link, 367 dendrogram, 364, 365 distance measures, 367 divisive, 364 group average, 368 Lance–Williams formula, 370 mean distance, 368 minimum variance, 368 single link, 367 update distance matrix, 370 Ward’s method, 368 Hopkins statistic, 459 Hubert statistic, 437, 445 hyper-rectangle, 163 hyperball, 164 volume, 165 hypercube, 164 volume, 165 hyperspace, 163 density of multivariate normal, 172 diagonals, 171 angle, 171 hypersphere, 164 asymptotic volume, 167 closed, 164 inscribed within hypercube, 168 surface area, 167 volume of thin shell, 169 hypersphere volume, 175 Jacobian, 176–178 Jacobian matrix, 176–178 IID, see independent and identically distributed inclusion–exclusion principle, 251 independent and identically distributed, 24 information gain, 487 interquartile range, 38 itemset, 217 itemset mining, 217, 221 Apriori algorithm, 223 level-wise approach, 223 candidate generation, 221 Charm algorithm, 248 computational complexity, 222 Eclat algorithm, 225 tidset intersection, 225 FPGrowth algorithm, 231 frequent pattern tree, 231 frequent pattern tree, 231 GenMax algorithm, 245 level-wise approach, 223 negative border, 240 partition algorithm, 238 prefix search tree, 221, 223 support computation, 221 tidset intersection, 225 itemsets assessment measures, 309 closed, 313 maximal, 312 minimal generator, 313 minimum support threshold, 219 productive, 314 support, 309 relative, 309 closed, 243, 248 closure operator, 243 properties, 243 generalized, 250 maximal, 242, 245 minimal generators, 244 nonderivable, 250, 254 relative support, 219 rule-based assessment measures, 310 support, 219 Jaccard coefficient, 435 Jacobian matrix, 176–178 590 K nearest neighbors classifier, 477 K-means algorithm, 334 kernel method, 338 k-way cut, 401 kernel density estimation, 379 discrete kernel, 380, 382 Gaussian kernel, 380, 383 multivariate, 382 univariate, 379 kernel discriminant analysis, 505 kernel K-means, 338 kernel matrix, 135 centered, 151 normalized, 153 kernel methods data-specific kernel map, 142 diffusion kernel, 156 exponential, 157 power kernel, 157 von Neumann, 158 empirical kernel map, 140 Gaussian kernel, 147 graph kernel, 156 Hilbert space, 140 kernel matrix, 135 kernel operations centering, 151 distance, 149 mean, 149 norm, 148 normalization, 153 total variance, 150 kernel trick, 137 Mercer kernel map, 143 polynomial kernel homogeneous, 144 inhomogeneous, 144 positive semidefinite kernel, 138 pre-Hilbert space, 140 reproducing kernel Hilbert space, 140 reproducing kernel map, 139 reproducing property, 140 spectrum kernel, 155 string kernel, 155 vector kernel, 144 kernel PCA, see kernel principal component analysis kernel principal component analysis, 202 kernel trick, 338 KL divergence, see Kullback–Leibler divergence KNN classifier, 477 Kullback–Leibler divergence, 457 linear discriminant analysis, 498 between-class scatter matrix, 501 Fisher objective, 500 optimal linear discriminant, 501 within-class scatter matrix, 501 Index loss function, 572 squared loss, 572 zero-one loss, 572 Mahalanobis distance, 56 Markov chain, 416 Markov clustering, 416 maximal itemsets, 242 GenMax algorithm, 245 maximum likelihood estimation, 343, 353 covariance matrix, 355 mean, 354 mixture parameters, 356 maximum matching, 427 mean, 34 median, 35 minimal generator, 244 mode, 36 modularity, 412, 443 multinomial distribution, 71 covariance, 72 mean, 72 sample covariance, 72 sample mean, 72 multiple hypothesis testing, 320 multivariate analysis categorical, 82 numeric, 48 multivariate Bernoulli variable, 66, 82 covariance matrix, 68, 83 empirical PMF, 69 joint PMF, 73 mean, 67, 83 probability mass function, 66, 73 sample covariance matrix, 69 sample mean, 67 multivariate variable Bernoulli, 66 mutual information, 431 normalized, 431 naive Bayes classifier, 473 categorical attributes, 476 numeric attributes, 473 nearest neighbors density estimation, 384 nonderivable itemsets, 250, 254 inclusion–exclusion principle, 251 support bounds, 252 normal distribution Gauss error function, 55 normalized cut, 442 orthogonal complement, 186 orthogonal projection matrix, 186 error vector, 186 orthogonal subspaces, 186 591 Index pagerank, 108 paired t-test, 569 pattern assessment, 309 PCA, see principal component analysis permutation test, 320 swap randomization, 321 population, 24 power iteration method, 105 PrefixSpan algorithm, 265 principal component, 187 kernel PCA, 202 principal component analysis, 187 choosing the dimensionality, 197 connection with SVD, 211 mean squared error, 193, 197 minimum squared error, 189 total projected variance, 192, 196 probability density function, 16 joint PDF, 20, 23 probability distribution Bernoulli, 15, 63 binomial, 15 bivariate normal, 21 Gaussian, 17 multivariate normal, 56 normal, 17, 54 probability mass function, 15 empirical joint PMF, 43 empirical PMF, 34 joint PMF, 20, 23 purity, 426 quantile function, 34 quartile, 38 Rand statistic, 435 random graphs, 116 average degree, 116 clustering coefficient, 117 degree distribution, 116 diameter, 118 random sample, 24 multivariate, 24 statistic, 25 univariate, 24 random variable, 14 Bernoulli, 63 bivariate, 19 continuous, 14 correlation, 45 covariance, 43 covariance matrix, 46, 49 discrete, 14 empirical joint PMF, 43 expectation, 34 expected value, 34 generalized variance, 46, 49 independent and identically distributed, 24 interquartile range, 38 mean, 34 bivariate, 43 multivariate, 48 median, 35 mode, 36 moments about the mean, 39 multivariate, 23 standard deviation, 39 standardized covariance, 45 total variance, 43, 46, 49 value range, 38 variance, 38 vector, 23 receiver operating characteristic curve, 556 ROC curve, see receiver operating characteristic curve rule assessment, 301 sample covariance matrix bivariate, 75 sample mean, 25 sample space, 14 sample variance geometric interpretation, 40 sequence, 259 closed, 260 maximal, 260 sequence mining alphabet, 259 GSP algorithm, 261 prefix, 259 PrefixSpan algorithm, 265 relative support, 260 search space, 260 sequence, 259 SPADE algorithm, 263 subsequence, 259 consecutive, 259 substring, 259 substring mining, 267 suffix, 259 suffix tree, 267 support, 260 silhouette coefficient, 444, 448 singular value decomposition, 208 connection with PCA, 211 Frobenius norm, 210 left singular vector, 209 reduced SVD, 209 right singular vector, 209 singular value, 209 spectral decomposition, 210 Spade algorithm sequential joins, 263 spectral clustering average weight, 409 computational complexity, 407 degree matrix, 395 k-way cut, 401 592 spectral clustering (cont.) Laplacian matrix, 398 modularity, 411 normalized adjacency matrix, 395 normalized asymmetric Laplacian, 400 normalized cut, 404 normalized modularity, 415 normalized symmetric Laplacian, 399 objective functions, 403, 409 ratio cut, 403 weighted adjacency matrix, 394 spectral clustering algorithm, 406 standard deviation, 39 standard score, 39 statistic, 25 robustness, 35 sample correlation, 45 sample covariance, 44 sample covariance matrix, 46, 50 sample interquartile range, 38 sample mean, 25, 35 bivariate, 43 multivariate, 48 sample median, 36 sample mode, 36 sample range, 38 sample standard deviation, 39 sample total variance, 43 sample variance, 39 standard score, 39 trimmed mean, 35 unbiased estimator, 35 z-score, 39 statistical independence, 22 Stirling numbers second kind, 333 string, see sequence string kernel spectrum kernel, 155 subgraph, 281 connected, 281 support, 283 subgraph isomorphism, 282 substring mining, 267 suffix tree, 267 Ukkonen’s algorithm, 270 suffix tree, 267 Ukkonen’s algorithm, 270 support vector machines, 514 bias, 514 canonical hyperplane, 518 classifier, 522 directed distance, 515 dual algorithm, 535 dual objective, 521 hinge loss, 525, 532 hyperplane, 514 Karush–Kuhn–Tucker conditions, 521 kernel SVM, 530 Index linearly separable, 515 margin, 518 maximum margin hyperplane, 520 newton optimization algorithm, 539 nonseparable case, 524 nonlinear case, 530 primal algorithm, 539 primal kernel SVM algorithm, 541 primal objective, 520 quadratic loss, 529, 532 regularization constant, 525 separable case, 520 separating hyperplane, 515 slack variables, 525 soft margin, 525 stochastic gradient ascent algorithm, 535 support vectors, 518 training algorithms, 534 weight vector, 514 SVD, see singular value decomposition SVM, see support vector machines swap randomization, 321 tidset, 218 transaction identifiers, 218 tids, 218 total variance, 9, 43 transaction, 218 transaction database, 218 true negative, 553 true positive, 553 Ukkonen’s algorithm computational cost, 271 implicit extensions, 272 implicit suffixes, 271 skip/count trick, 272 space requirement, 270 suffix links, 273 time complexity, 276 univariate analysis categorical, 63 numeric, 33 variance, 38 variation of information, 432 vector dot product, Euclidean norm, length, linear combination, Lp -norm, normalization, orthogonal decomposition, 10 orthogonal projection, 11 orthogonality, perpendicular distance, 11 standard basis, unit vector, 593 Index vector kernel, 144 Gaussian, 147 polynomial, 144 vector random variable, 23 vector space basis, 13 column space, 12 dimension, 13 linear combination, 12 linear dependence, 13 linear independence, 13 orthogonal basis, 13 orthonormal basis, 13 row space, 12 span, 12 spanning set, 12 standard basis, 13 Watts–Strogatz model clustering coefficient, 122 z-score, 39 [...]... our editor at Cambridge University Press, for her guidance and patience in realizing this book Finally, on a more personal front, MJZ dedicates the book to his wife, Amina, for her love, patience and support over all these years, and to his children, Abrar and Afsah, and his parents WMJ gratefully dedicates the book to his wife Patricia; to his children, Gabriel and Marina; and to his parents, Wagner... probabilistic interpretation of data We then discuss the main data mining tasks, which span exploratory data analysis, frequent pattern mining, clustering, and classification, laying out the roadmap for the book 1.1 DATA MATRIX Data can often be represented or abstracted as an n × d data matrix, with n rows and d columns, where rows correspond to entities in the dataset, and columns represent attributes or