Computer Science and Data Analysis Series Clustering for Data Mining A Data Recovery Approach Boris Mirkin Boca Raton London New York Singapore © 2005 by Taylor & Francis Group, LLC C5343_Discl Page Thursday, March 24, 2005 8:38 AM Published in 2005 by Chapman & Hall/CRC Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2005 by Taylor & Francis Group, LLC Chapman & Hall/CRC is an imprint of Taylor & Francis Group No claim to original U.S Government works Printed in the United States of America on acid-free paper 10 International Standard Book Number-10: 1-58488-534-3 (Hardcover) International Standard Book Number-13: 978-1-58488-534-4 (Hardcover) Library of Congress Card Number 2005041421 This book contains information obtained from authentic and highly regarded sources Reprinted material is quoted with permission, and sources are indicated A wide variety of references are listed Reasonable efforts have been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the consequences of their use No part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc (CCC) 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400 CCC is a not-for-profit organization that provides licenses and registration for a variety of users For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe Library of Congress Cataloging-in-Publication Data Mirkin, B G (Boris Grigorévich) Clustering for data mining : a data recovery approach / Boris Mirkin p cm (Computer science and data analysis series ; 3) Includes bibliographical references and index ISBN 1-58488-534-3 Data mining Cluster analysis I Title II Series QA76.9.D343M57 2005 006.3'12 dc22 2005041421 Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com Taylor & Francis Group is the Academic Division of T&F Informa plc © 2005 by Taylor & Francis Group, LLC and the CRC Press Web site at http://www.crcpress.com Chapman & Hall/CRC Computer Science and Data Analysis Series The interface between the computer and statistical sciences is increasing, as each discipline seeks to harness the power and resources of the other This series aims to foster the integration between the computer sciences and statistical, numerical, and probabilistic methods by publishing a broad range of reference works, textbooks, and handbooks SERIES EDITORS John Lafferty, Carnegie Mellon University David Madigan, Rutgers University Fionn Murtagh, Royal Holloway, University of London Padhraic Smyth, University of California, Irvine Proposals for the series should be sent directly to one of the series editors above, or submitted to: Chapman & Hall/CRC 23-25 Blades Court London SW15 2NU UK Published Titles Bayesian Artificial Intelligence Kevin B Korb and Ann E Nicholson Pattern Recognition Algorithms for Data Mining Sankar K Pal and Pabitra Mitra Exploratory Data Analysis with MATLAB® Wendy L Martinez and Angel R Martinez Clustering for Data Mining: A Data Recovery Approach Boris Mirkin Correspondence Analysis and Data Coding with JAVA and R Fionn Murtagh R Graphics Paul Murrell © 2005 by Taylor & Francis Group, LLC Contents Preface List of Denotations Introduction: Historical Remarks What Is Clustering Base words 1.1 Exemplary problems 1.1.1 Structuring 1.1.2 Description 1.1.3 Association 1.1.4 Generalization 1.1.5 Visualization of data structure 1.2 Bird's-eye view 1.2.1 De nition: data and cluster structure 1.2.2 Criteria for revealing a cluster structure 1.2.3 Three types of cluster description 1.2.4 Stages of a clustering application 1.2.5 Clustering and other disciplines 1.2.6 Di erent perspectives of clustering What Is Data Base words 2.1 Feature characteristics 2.1.1 Feature scale types 2.1.2 Quantitative case 2.1.3 Categorical case 2.2 Bivariate analysis 2.2.1 Two quantitative variables 2.2.2 Nominal and quantitative variables © 2005 by Taylor & Francis Group, LLC © 2005 by Taylor & Francis Group, LLC 2.2.3 Two nominal variables cross-classi ed 2.2.4 Relation between correlation and contingency 2.2.5 Meaning of correlation 2.3 Feature space and data scatter 2.3.1 Data matrix 2.3.2 Feature space: distance and inner product 2.3.3 Data scatter 2.4 Pre-processing and standardizing mixed data 2.5 Other table data types 2.5.1 Dissimilarity and similarity data 2.5.2 Contingency and ow data K-Means Clustering Base words 3.1 Conventional K-Means 3.1.1 Straight K-Means 3.1.2 Square error criterion 3.1.3 Incremental versions of K-Means 3.2 Initialization of K-Means 3.2.1 Traditional approaches to initial setting 3.2.2 MaxMin for producing deviate centroids 3.2.3 Deviate centroids with Anomalous pattern 3.3 Intelligent K-Means 3.3.1 Iterated Anomalous pattern for iK-Means 3.3.2 Cross validation of iK-Means results 3.4 Interpretation aids 3.4.1 Conventional interpretation aids 3.4.2 Contribution and relative contribution tables 3.4.3 Cluster representatives 3.4.4 Measures of association from ScaD tables 3.5 Overall assessment Ward Hierarchical Clustering Base words 4.1 Agglomeration: Ward algorithm 4.2 Divisive clustering with Ward criterion 4.2.1 2-Means splitting 4.2.2 Splitting by separating 4.2.3 Interpretation aids for upper cluster hierarchies 4.3 Conceptual clustering 4.4 Extensions of Ward clustering 4.4.1 Agglomerative clustering with dissimilarity data 4.4.2 Hierarchical clustering for contingency and ow data © 2005 by Taylor & Francis Group, LLC © 2005 by Taylor & Francis Group, LLC 4.5 Overall assessment Data Recovery Models Base words 5.1 Statistics modeling as data recovery 5.1.1 Averaging 5.1.2 Linear regression 5.1.3 Principal component analysis 5.1.4 Correspondence factor analysis 5.2 Data recovery model for K-Means 5.2.1 Equation and data scatter decomposition 5.2.2 Contributions of clusters, features, and individual entities 5.2.3 Correlation ratio as contribution 5.2.4 Partition contingency coe cients 5.3 Data recovery models for Ward criterion 5.3.1 Data recovery models with cluster hierarchies 5.3.2 Covariances, variances and data scatter decomposed 5.3.3 Direct proof of the equivalence between 2-Means and Ward criteria 5.3.4 Gower's controversy 5.4 Extensions to other data types 5.4.1 Similarity and attraction measures compatible with K-Means and Ward criteria 5.4.2 Application to binary data 5.4.3 Agglomeration and aggregation of contingency data 5.4.4 Extension to multiple data 5.5 One-by-one clustering 5.5.1 PCA and data recovery clustering 5.5.2 Divisive Ward-like clustering 5.5.3 Iterated Anomalous pattern 5.5.4 Anomalous pattern versus Splitting 5.5.5 One-by-one clusters for similarity data 5.6 Overall assessment Di erent Clustering Approaches Base words 6.1 Extensions of K-Means clustering 6.1.1 Clustering criteria and implementation 6.1.2 Partitioning around medoids PAM 6.1.3 Fuzzy clustering 6.1.4 Regression-wise clustering 6.1.5 Mixture of distributions and EM algorithm 6.1.6 Kohonen self-organizing maps SOM © 2005 by Taylor & Francis Group, LLC © 2005 by Taylor & Francis Group, LLC 6.2 Graph-theoretic approaches 6.2.1 Single linkage, minimum spanning tree and connected components 6.2.2 Finding a core 6.3 Conceptual description of clusters 6.3.1 False positives and negatives 6.3.2 Conceptually describing a partition 6.3.3 Describing a cluster with production rules 6.3.4 Comprehensive conjunctive description of a cluster 6.4 Overall assessment General Issues Base words 7.1 Feature selection and extraction 7.1.1 A review 7.1.2 Comprehensive description as a feature selector 7.1.3 Comprehensive description as a feature extractor 7.2 Data pre-processing and standardization 7.2.1 Dis/similarity between entities 7.2.2 Pre-processing feature based data 7.2.3 Data standardization 7.3 Similarity on subsets and partitions 7.3.1 Dis/similarity between binary entities or subsets 7.3.2 Dis/similarity between partitions 7.4 Dealing with missing data 7.4.1 Imputation as part of pre-processing 7.4.2 Conditional mean 7.4.3 Maximum likelihood 7.4.4 Least-squares approximation 7.5 Validity and reliability 7.5.1 Index based validation 7.5.2 Resampling for validation and selection 7.5.3 Model selection with resampling 7.6 Overall assessment Conclusion: Data Recovery Approach in Clustering Bibliography © 2005 by Taylor & Francis Group, LLC © 2005 by Taylor & Francis Group, LLC Preface Clustering is a discipline devoted to nding and describing cohesive or homogeneous chunks in data, the clusters Some exemplary clustering problems are: - Finding common surf patterns in the set of web users - Automatically revealing meaningful parts in a digitalized image - Partition of a set of documents in groups by similarity of their contents - Visual display of the environmental similarity between regions on a country map - Monitoring socio-economic development of a system of settlements via a small number of representative settlements - Finding protein sequences in a database that are homologous to a query protein sequence - Finding anomalous patterns of gene expression data for diagnostic purposes - Producing a decision rule for separating potentially bad-debt credit applicants - Given a set of preferred vacation places, nding out what features of the places and vacationers attract each other - Classifying households according to their furniture purchasing patterns and nding groups' key characteristics to optimize furniture marketing and production Clustering is a key area in data mining and knowledge discovery, which are activities oriented towards nding non-trivial or hidden patterns in data collected in databases Earlier developments of clustering techniques have been associated, primarily, with three areas of research: factor analysis in psychology 55], numerical taxonomy in biology 122], and unsupervised learning in pattern recognition 21] Technically speaking, the idea behind clustering is rather simple: introduce a measure of similarity between entities under consideration and combine similar entities into the same clusters while keeping dissimilar entities in di erent clusters However, implementing this idea is less than straightforward First, too many similarity measures and clustering techniques have been © 2005 by Taylor & Francis Group, LLC © 2005 by Taylor & Francis Group, LLC invented with virtually no support to a non-specialist user in selecting among them The trouble with this is that di erent similarity measures and/or clustering techniques may, and frequently do, lead to di erent results Moreover, the same technique may also lead to di erent cluster solutions depending on the choice of parameters such as the initial setting or the number of clusters speci ed On the other hand, some common data types, such as questionnaires with both quantitative and categorical features, have been left virtually without any substantiated similarity measure Second, use and interpretation of cluster structures may become an issue, especially when available data features are not straightforwardly related to the phenomenon under consideration For instance, certain data on customers available at a bank, such as age and gender, typically are not very helpful in deciding whether to grant a customer a loan or not Specialists acknowledge peculiarities of the discipline of clustering They understand that the clusters to be found in data may very well depend not on only the data but also on the user's goals and degree of granulation They frequently consider clustering as art rather than science Indeed, clustering has been dominated by learning from examples rather than theory based instructions This is especially visible in texts written for inexperienced readers, such as 4], 28] and 115] The general opinion among specialists is that clustering is a tool to be applied at the very beginning of investigation into the nature of a phenomenon under consideration, to view the data structure and then decide upon applying better suited methodologies Another opinion of specialists is that methods for nding clusters as such should constitute the core of the discipline related questions of data pre-processing, such as feature quantization and standardization, de nition and computation of similarity, and post-processing, such as interpretation and association with other aspects of the phenomenon, should be left beyond the scope of the discipline because they are motivated by external considerations related to the substance of the phenomenon under investigation I share the former opinion and argue the latter because it is at odds with the former: in the very rst steps of knowledge discovery, substantive considerations are quite shaky, and it is unrealistic to expect that they alone could lead to properly solving the issues of pre- and post-processing Such a dissimilar opinion has led me to believe that the discovered clusters must be treated as an \ideal" representation of the data that could be used for recovering the original data back from the ideal format This is the idea of the data recovery approach: not only use data for nding clusters but also use clusters for recovering the data In a general situation, the data recovered from aggregate clusters cannot t the original data exactly, which can be used for evaluation of the quality of clusters: the better the t, the better the clusters This perspective would also lead to the addressing of issues in pre- and post- © 2005 by Taylor & Francis Group, LLC © 2005 by Taylor & Francis Group, LLC ... Pal and Pabitra Mitra Exploratory Data Analysis with MATLAB® Wendy L Martinez and Angel R Martinez Clustering for Data Mining: A Data Recovery Approach Boris Mirkin Correspondence Analysis and... the data that are explained by clusters can be separated from those that are not The data recovery approach is common in more traditional data mining and statistics areas such as regression, analysis... Science and Data Analysis Series Clustering for Data Mining A Data Recovery Approach Boris Mirkin Boca Raton London New York Singapore © 2005 by Taylor & Francis Group, LLC C5343_Discl Page Thursday,