Clustering methodology for symbolic data

Clustering Methodology for Symbolic Data Wiley Series in Computational Statistics Consulting Editors: Paolo Giudici University of Pavia, Italy Geof H Givens Colorado State University, USA Wiley Series in Computational Statistics is comprised of practical guides and cutting edge research books on new developments in computational statistics It features quality authors with a strong applications focus The texts in the series provide detailed coverage of statistical concepts, methods and case studies in areas at the interface of statistics, computing, and numerics With sound motivation and a wealth of practical examples, the books show in concrete terms how to select and to use appropriate ranges of statistical computing techniques in particular fields of study Readers are assumed to have a basic understanding of introductory terminology The series concentrates on applications of computational methods in statistics to fields of bioinformatics, genomics, epidemiology, business, engineering, finance and applied statistics Clustering Methodology for Symbolic Data Lynne Billard University of Georgia, USA Edwin Diday CEREMADE, Université Paris-Dauphine, Université PSL, Paris, France This edition first published 2020 © 2020 John Wiley & Sons Ltd All rights reserved No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions The right of Lynne Billard and Edwin Diday to be identified as the authors of this work has been asserted in accordance with law Registered Offices John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK Editorial Office 9600 Garsington Road, Oxford, OX4 2DQ, UK For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com Wiley also publishes its books in a variety of electronic formats and by print-on-demand Some content that appears in standard print versions of this book may not be available in other formats Limit of Liability/Disclaimer of Warranty While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make This work is sold with the understanding that the publisher is not engaged in rendering professional services The advice and strategies contained herein may not be suitable for your situation You should consult with a specialist where appropriate Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages Library of Congress Cataloging-in-Publication Data Names: Billard, L (Lynne), 1943- author | Diday, E., author Title: Clustering methodology for symbolic data / Lynne Billard (University of Georgia), Edwin Diday (CEREMADE, Université Paris-Dauphine, Université PSL, Paris, France) Description: Hoboken, NJ : Wiley, 2020 | Includes bibliographical references and index | Identifiers: LCCN 2019011642 (print) | LCCN 2019018340 (ebook) | ISBN 9781119010388 (Adobe PDF) | ISBN 9781119010395 (ePub) | ISBN 9780470713938 (hardcover) Subjects: LCSH: Cluster analysis | Multivariate analysis Classification: LCC QA278.55 (ebook) | LCC QA278.55 B55 2019 (print) | DDC 519.5/3–dc23 LC record available at https://lccn.loc.gov/2019011642 Cover Design: Wiley Cover Image: © Lynne Billard Background: © Iuliia_Syrotina_28/Getty Images Set in 10/12pt WarnockPro by SPi Global, Chennai, India 10 References Kaufman, L and Rousseeuw, P J (1986) Clustering large data sets (with Discussion) In: Pattern Recognition in Practice II (eds E S Gelsema and L N Kanal) North-Holland, Amsterdam, 425–437 Kaufman, L and Rousseeuw, P J (1987) Clustering by means of medoids In: Statistical Data Analysis Based on the 1 -Norm and Related Methods (ed Y Dodge) North-Holland, Berlin, 405–416 Kaufman, L and Rousseeuw, P L (1990) Finding Groups in Data: An Introduction to Cluster Analysis John Wiley, New York Kim, J (2009) Dissimilarity Measures for Histogram-valued Data and Divisive Clustering of Symbolic Objects Doctoral Dissertation, University of Georgia Kim, J and Billard, L (2011) A polythetic clustering process for symbolic observations and cluster validity indexes Computational Statistics and Data Analysis 55, 2250–2262 Kim, J and Billard, L (2012) Dissimilarity measures and divisive clustering for symbolic multimodal-valued data Computational Statistics and Data Analysis 56, 2795–2808 Kim, J and Billard, L (2013) Dissimilarity measures for histogram-valued observations Communications in Statistics: Theory and Methods 42, 283–303 Kim, J and Billard, L (2018) Double monothetic clustering for histogram-valued data Communications for Statistical Applications and Methods 25, 263–274 Kohonen, T (2001) Self-Organizing Maps Springer, Berlin, Heidelberg ˇ Korenjak-Cerne, S., Batagelj, V and Pave˘sić, B J (2011) Clustering large data sets described with discrete distributions and its application on TIMSS data set Statistical Analysis and Data Mining 4, 199–215 Ko˘smelj, K and Billard, L (2011) Clustering of population pyramids using Mallows’ L2 distance Metodolo˘ski Zvezki 8, 1–15 Ko˘smelj, K and Billard, L (2012) Mallows’L2 distance in some multivariate methods and its application to histogram-type data Metodolo˘ski Zvezki 9, 107–118 Kuo, R J., Ho, L M and Hu, C M (2002) Cluster analysis in industrial market segmentation through artificial neural network Computers and Industrial Engineering 42, 391–399 Lance, G N and Williams, W T (1967a) A general theory of classificatory sorting strategies II Clustering systems The Computer Journal 10, 271–277 Lance, G N and Williams, W T (1967b) A general theory of classificatory sorting strategies I Hierarchical systems The Computer Journal 9, 373–380 Lechevallier, Y., de Carvalho, F A T., Despeyroux, T and de Melo, F (2010) Clustering of multiple dissimilarity data tables for documents categorization In: Proceedings COMPSTAT (eds Y Lechevallier and G Saporta) 19, 1263–1270 Le-Rademacher, J and Billard, L (2011) Likelihood functions and some maximum likelihood estimators for symbolic data Journal of Statistical Planning and Inference 141, 1593–1602 325 326 References Leroy, B., Chouakria, A., Herlin, I and Diday, E (1996) Approche géométrique et classification pour la reconnaissance de visage Reconnaissance des Forms et Intelligence Artificelle, INRIA and IRISA and CNRS, France, 548–557 Levina, E and Bickel, P (2001) The earth mover’s distance is the Mallows’ distance Some insights from statistics Proceedings IEEE International Conference Computer Vision Publishers IEEE Computer Society, 251–256 Limam, M (2005) Méthodes de Description de Classes Combinant Classification et Discrimination en Analyse de Données Symboliques Thèse de Doctorat, Université de Paris, Dauphine Limam, M M., Diday, E and Winsberg, S (2004) Probabilist allocation of aggregated statistical units in classification trees for symbolic class description In: Classification, Clustering and Data Mining Applications (eds D Banks, L House, F R McMorris, P Arabie and W Gaul) Springer, Heidelberg, 371–379 Lisboa, P J G., Etchells, T A., Jarman, I H and Chambers, S J (2013) Finding reproducible cluster partitions for the k-means algorithm BMC Bioinformatics 14, 1–19 Liu, F (2016) Cluster Analysis for Symbolic Interval Data Using Linear Regression Method Doctoral Dissertation, University of Georgia MacNaughton-Smith, P., Williams, W T., Dale, M B and Mockett, L G (1964) Dissimilarity analysis: a new technique of hierarchical division Nature 202, 1034–1035 MacQueen, J (1967) Some methods for classification and analysis of multivariate observations In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability (eds L M LeCam and J Neyman) University of California Press, Berkeley, 1, 281–299 Mahalanobis, P C (1936) On the generalized distance in statistics Proceedings of the National Institute of Science India 2, 49–55 Mallows, C L (1972) A note on asymptotic joint normality Annals of Mathematical Statistics 43, 508–515 Malerba, D., Esposito, F., Gioviale, V and Tamma, V (2001) Comparing dissimilarity measures for symbolic data analysis In: Proceedings of Exchange of Technology and Know-how and New Techniques and Technologies for Statistics, Crete (eds P Nanopoulos and D Wilkinson) Publisher European Communities Rome, 473–481 McLachlan G and Basford, K E (1988) Mixture Models: Inference and Applications to Clustering Marcel Dekker, New York McLachlan G and Peel D (2000) Finite Mixture Models Wiley, New York McQuitty, L L (1967) Expansion of similarity analysis by reciprocal pairs for discrete and continuous data Education Psychology Measurement 27, 253–255 Milligan, G W and Cooper, M (1985) An examination of procedures for determining the number of clusters in a data set Psychometrika 50, 159–179 Mirkin, B (1996) Mathematical Classification and Clustering Kluwer, Dordrecht References Murtagh, F and Legendre, P (2014) Ward’s hierarchical clustering method: Clustering criterion and agglomerative algorithm Journal of Classification 31, 274–295 Nelsen, R B (2007) An Introduction to Copulas Springer, New York Noirhomme-Fraiture, M and Brito, M P (2011) Far beyond the classical data models: Symbolic data analysis Statistical Analysis and Data Mining 4, 157–170 Pak, K K (2005) Classifications Hiérarchique et Pyramidale Spatiale Thése de Doctorat, Université de Paris, Dauphine Parks, J M (1966) Cluster analysis applied to multivariate geologic problems Journal of Geology 74, 703–715 Pearson, K (1895) Note on regression and inheritance in the case of two parents Proceedings of the Royal Society of London 58, 240–242 Périnel, E (1996) Segmentation et Analyse des Données Symbolique: Applications des Données Probabilities Imprécises Thèse de Doctorat, Université de Paris, Dauphine Polaillon, G (1998) Organisation et Interprétation par les Treis de Galois de Données de Type Multivalué, Intervalle ou Histogramme Thése de Doctorat, Université de Paris, Dauphine Polaillon, G (2000) Pyramidal classification for interval data using Galois lattice reduction In: Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data (eds H -H Bock and E Diday) Springer, Berlin, 324–341 Polaillon, G and Diday, E (1996) Galois lattices construction and application in symbolic data analysis Report CEREMADE 9631, Université de Paris, Dauphine Punj, G and Stewart, D W (1983) Cluster analysis in marketing research: Review and suggestions for application Journal of Marketing Research 20, 134–148 Quantin, C., Billard, L., Touati, M., Andreu, N., Cottin, Y., Zeller, M., Afonso, F., Battaglia, G., Seck, D., LeTeuff, G and Diday, E (2011) Classification and regression trees on aggregate data modeling: An application in acute myocardial infarction Journal of Probability and Statistics, ID 523937, doi:10.1155/2011/523937 Rahal, M C (2010) Classification Pyramidale Spatiale: Nouveaux Algorithmes et Aide l’Interprétation Thése de Doctorat, Université de Paris, Dauphine Ralambondrainy, H (1995) A conceptual version of the K-means algorithm Pattern Recognition Letters 16, 1147–1157 Reynolds, A P., Richards, G., de la Iglesia, B and Rayward-Smith, V J (2006) Clustering rules: A comparison of partitioning and hierarchical clustering algorithms Journal of Mathematical Modelling and Algorithms 5, 475–504 Robinson, W S (1951) A method for chronologically ordering archaeological deposits American Antiquity 16, 293–301 327 328 References Rüschendorf, L (2001) Wasserstein metric In: Encyclopedia of Mathematics (ed M Hazewinkel) Springer, Dordrecht, 631 Schroeder, A (1976) Analyse d’un mélange de distributions de probabilité de même type Revue de Statistiques Appliquées 24, 39–62 Schwarz, G (1978) Estimating the dimension of a model Annals of Statistics 6, 461–464 Schweizer, B (1984) Distributions are the numbers of the future In: Proceedings The Mathematics of Fuzzy Systems Meeting (eds A di Nola and A Ventes) University of Naples, Naples Italy, 137–149 Scott, A J and Symons, M J (1971) Clustering methods based on likelihood ratio criteria Biometrics 27, 387–397 Seck, D A N (2012) Arbres de Décision Symboliques, Outils de Validation et d’Aide l’Interprétation Thèse de Doctorat, Université de Paris, Dauphine Seck, D., Billard, L., Diday, E and Afonso, F (2010) A decision tree for interval-valued data with modal dependent variable In: Proceedings COMPSTAT 2010 (eds Y Lechevallier and G Saporta) Springer, 19, 1621–1628 Silverman, B W (1986) Density Estimation for Statistics and Data Analysis Chapman and Hall, London Sklar, A (1959.) Fonction de répartition n dimensions et leurs marges Institute Statistics Université de Paris 8, 229–231 Sneath, P H and Sokal, R R (1973) Numerical Taxonomy Freeman, San Francisco Sokal, R R (1963) The principles and practice of numerical taxonomy Taxon 12, 190–199 Sokal, R R and Michener, C D (1958) A statistical method for evaluating systematic relationships University of Kansas Science Bulletin 38, 1409–1438 Sokal, R R and Sneath, P H (1963) Principles of Numerical Taxonomy Freeman, San Francisco Steinley, D (2006) K-means clustering: A half-century synthesis British Journal of Mathematical and Statistical Psychology 59, 1–34 Stéphan, V (1998) Construction d’ Objects Symboliques par Synthèse des Résultats de Requêtes SQL Thèse de Doctorat, Université de Paris, Dauphine Stéphan, V., Hébrail, G and Lechevallier, Y (2000) Generation of symbolic objects from relational databases In: Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data (eds H -H Bock and E Diday) Springer, Berlin, 78–105 Symons, M J (1981) Clustering criteria and multivariate normal mixtures Biometrics 37, 35–43 Tibshirani, R and Walther, G (2005) Cluster validation by prediction strength Journal of Computational and Graphical Statistics 14, 511–528 Vanessa, A and Vanessa, L (2004) La meilleure équipe de baseball Report CEREMADE, Université de Paris, Dauphine References Verde, R and Irpino, A (2007) Dynamic clustering of histogram data: Using the right metric In: Selected Contributions in Data Analysis and Classification (eds P Brito, P Bertrand, G Cucumel and F de Carvalho) Springer, Berlin, 123–134 Vrac, M (2002) Analyse et Modelisation de Données Probabilistes Par Décomposition de Mélange de Copules et Application une Base de Données Climatologiques Thèse de Doctorat, Université de Paris, Dauphine Vrac, M., Billard, L., Diday E and Chédin, A (2012) Copula analysis of mixture models Computational Statistics 27, 427–457 Ward, J H (1963) Hierarchical grouping to optimize an objective function Journal of the American Statistical Association 58, 236–244 Wei, G C G and Tanner, M A (1990) A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms Journal of the American Statistical Association 85, 699–704 Winsberg, S., Diday, E and Limam, M M (2006): A tree structured classifier for symbolic class description In: Proceedings COMPSTAT (eds A Rizzi and M Vichi) Springer, 927–936 Wishart, D (1969) An algorithm for hierarchical classification Biometrics 25, 165–170 Xu, W (2010) Symbolic Data Analysis: Interval-valued Data Regression Doctoral Dissertation, University of Georgia 329 331 Index a c Adaptive leader 156–159, 178–179 Agglomerative clustering (see Cluster(s)/clustering, agglomerative) Aggregation 1–5, 7–8, 13, 17–24, 27, 56, 65, 84, 86, 95–96, 131, 139, 144, 181, 197, 239, 246, 254, 282 Assertion 290, 302 Association measures (see Divisive clustering, association) Average weighted dissimilarity 237–238, 243, (see also Divisive clustering, polythetic) maximum average weighted dissimilarity 238, 243 (see also Divisive clustering, polythetic) Capacity 10, 16, 17, 38, 84 Cartesian product 1, 17 Categories 2, 8–11, 15–16, 18, 21–22, 24–26, 55–56, 58, 65, 79, 83–87, 91, 119, 153, 155–157, 200–203, 205–206, 214, 267, 290 (see also Symbolic data), types of, multi-valued Census 2, 18, 24, 56, 86–87, 154, 232, 246, 281 Centroid(s) 1, 121, 122, 152–153, 161–163, 167–172, 189, 225, 264–265, 275–276, 278 (see also Cluster(s)/clustering, seeds; Mean, symbolic) Centroid histogram 172 (see also Centroid(s)) Chebychev’s distance 52, 54, 71, 73, 75, 190 (see also Dissimilarity) Choquet capacities 39, 84 City block distance (see Dissimilarity, city block) Classes (see Categories; Cluster(s)/ clustering) Classification(s) (see Cluster(s)/ clustering, types of ) Cluster(s)/clustering (see also Divisive clustering, monothetic); Divisive clustering, polythetic) b Between-cluster variation 54, 127, 199–200, 250, 252 Between observation variance (see Variation, between observations) Bi-partition of cluster (see Partitions (partitioning)) Bi-plots 140, 142, 163, 164, 169, 312 Clustering Methodology for Symbolic Data, First Edition Lynne Billard and Edwin Diday © 2020 John Wiley & Sons Ltd Published 2020 by John Wiley & Sons Ltd 332 Index Cluster(s)/clustering (see also Divisive clustering, monothetic); Divisive clustering, polythetic) (contd.) agglomerative 5, 126–129, 261–316 (see also Pyramid clustering) adaptive 285, 288 algorithm 266 average-link 128, 139, 140, 263, 273, 280–282 group average 128, 263, 268–269 weighted average 128, 263 centroid-link 128, 265 complete-link 128–129, 139, 140, 142, 262, 265, 268–273, 285, 310 farthest neighbor (see Complete-link) flexible 128 for histogram data 277–288 for interval data 269–277 median-link 128 minimum variance (see Ward’s method) for mixed data 281–282 for modal multi-valued data 266–269 for multi-valued data (see Modal multi-valued data) nearest neighbor (see Single-link) pyramids (see Pyramid clustering) single-link 128, 129, 156, 262, 265, 267–270, 273, 280, 285 Ward’s method 4, 128, 129, 156, 263–266, 276–279, 282, 285 between-cluster sum of squares (see Between-cluster variation) between-cluster variation 54, 127, 199–200, 250, 252 (see also Variation, between-cluster) bottom-up (see Agglomerative clustering) classical 1, 4–5, 119–130, 131, 139, 141, 144–145, 151, 180, 257, 262, 264, 289, 309 construction algorithms 120, 151, 184, 204, 232, 238–239 divisive 5, 126 (see also Divisive clustering) dynamical (see Partitions (partitioning), dynamical partitioning) hierarchical (see Hierarchical clustering) initial clusters (see Seeds) monothetic 120, 125, 131–137, 144, 197, 200, 203–249, 254–256 non-hierarchical (see Partitions (partitioning)) numbers of 120, 122, 146, 163, 181, 190, 198, 239, 250, 253, 255, 257 (see also Davis and Bouldin index; Dunn index) overview 118–125 partitioning (see Partitions (partitioning)) polythetic 5, 120, 125, 137, 197, 200, 236–250, 255–257, 266 pyramid (see Pyramid clustering) representations 5, 122–124, 129, 149, 186–189 seeds 121, 150, 152, 155–156, 161–162, 164, 166–167, 190 ties 128 top-down (see Divisive clustering) total 159, 198, 200, 204, 206, 219–221, 225–226, 229–231, 250–251, 264, 276 types of (see also Cluster(s)/ clustering, agglomerative; Divisive clustering; Divisive clustering, polythetic; Pyramid clustering) Index hierarchy 4–5, 48, 119, 121, 125–130, 146, 186, 189, 197–259, 261–316 nested 125, 129 (see Divisive clustering) non-nested 120 (see Partitions (partitioning)) pyramid (see Pyramid clustering) main cluster 238, 243 partition (see Partitions (partitioning)) splinter cluster 238, 243–245, 257 within-cluster sum of squares (see Within-cluster variation) within-cluster variation 54, 127, 159, 198–200, 219, 229, 232, 237, 245, 250, 251, 253–254, 264 Complete-link 128–129, 139, 140, 142, 262, 265, 268–273, 285, 310 Complete object (see Object(s)) Content component 58, 69, 87–88 (see Gowda–Diday dissimilarity, content component) Copula 182, 184–186 Archimedean 184 Frank copula 185 function 184 Correlation coefficient, symbolic 30–31, 257 Covariance, symbolic 3, 24, 28–31, 39, 52, 72, 117 Credibility 16–17, 38, 94 Cumulative density function 115–116, 170, 246, 249, 279, 281 Cumulative density function dissimilarity (see Dissimilarity, cumulative density function) Cumulative distribution 117–118, 131, 170–172, 180 (see also Dissimilarity; symbolic data, types of, cumulative) Cut points (see Divisive clustering) d Davis and Bouldin index 146, 253–255, 257 de Carvalho distance 76–78, 112–114, 170 agreement-disagreement indices 76, 112 comparison functions 76–77, 112–113 Dendogram(s) (see Tree(s)) Density function (see Probability density function) Dependence hierarchical dependency (see Hierarchical clustering) logical rule (see Logical dependency) product-moment (see Covariance, symbolic) Dependency (see Dependence) Description(s) of objects (see Observations) of observations 8–17, 84, 237, 289–290 virtual 282–284 Descriptive statistics 24–38, 101–104 (see also Covariance, symbolic; Mean, symbolic) Difference 237–238 (see also Divisive clustering, polythetic) Dissimilarity 47–118 categories (see Multi-valued data) city block 51–52, 60–61, 66, 73–75, 106–107, 119, 207, 212, 251–252, 255 classical 47–54 333 334 Index Dissimilarity (contd.) cumulative density function 115–116, 170, 246, 254, 279, 281 cumulative distribution function (see Cumulative density function) cumulative histogram function (see Cumulative density function) earth movers (see Mallows’ distance) Hausdorff (see Hausdorff distance) histograms 83, 93–114, 117, 170–171, 225–236 intervals 62–78, 216–225 L2 distance 189 lists (see Multi-valued) Mahalanobis (see Mahalanobis distance) Mallows’ (see Mallows’ distance) matrix 48, 50, 54, 58, 60, 66, 68, 71–72, 75, 81, 86, 91–92, 108, 111, 170, 207–208, 212, 215, 219, 226, 229, 232, 243, 245–246, 250, 254–255, 267–268, 270–275, 278, 280, 309–310 Minkowski (see Minkowski distance) mixed 47, 72, 246 modal multi-valued 83–93, 205–214, 309–311 (see also multi-valued) multi-valued 55–62, 214–215 normalized cumulative density function 115, 116, 131, 138–139, 254, 279–281 Pearson’s correlation 79, 260 pyramidial 48 Robinson 48–50, 54, 129, 310 ultrametric 48–50, 128–130 Ward 264 (see also Cluster(s)/clustering, agglomerative, Ward’s method) Wasserstein (see Mallows’ distance) (see also de Carvalho distance; Euclidean distance; Extended Gowda–Diday dissimilarity; Extended Ichino–Yaguchi dissimilarity; Gowda–Diday dissimilarity; Hausdorff distance; Mallows’ distance) Distance measure (matrix) (see Dissimilarity) Distribution(s) of distributions 181 function 179 inverse distribution function 117 joint, distribution of distributions 182 mixture of 181–186 Divisive clustering 5, 125–126, 197–260 association measures 5, 200–203, 205, 207–209 between-cluster variation 199, 251–254 for histogram data 131, 225–248 bi-partition(s) 226–230, 235–236, 244–245 cut points 205, 226–230, 250 for interval data 216–225 bi-partition(s) 219–224 cut points 205, 219–224 for mixed data 246, 249–250 for modal multi-valued data 205–215 monothetic 203–236, 255 algorithm 204, 231 double algorithm 231–236 for multi-valued data 214–215 (see also Modal multi-valued data) bi-partition(s) 208–213 cut points 202, 205, 211–212, 215 partitioning criteria 197–200 polythetic 236–250, 255 Index algorithm 238–239 stopping rule 250–257 sub-partition(s) (see Bi-partition(s)) within-cluster variation 151, 152, 156, 159, 166, 198, 219–220, 232, 252 total 152, 199–200, 204, 206, 212, 219, 221, 225, 226, 229, 230–231, 232, 235, 251 Dunn index 146, 253–255, 257 Dynamical partitioning 4, 122–124, 149–150, 152–153, 167–169, 184 e Empirical covariance (see Covariance, symbolic) Empirical density function (see Density function) Empirical distributions (see Distribution(s)) Euclidean distance 51, 59, 61, 70, 74, 91–92, 107–108, 113, 142–143, 150, 157, 161–166, 265, 275–276, 278 matrix 75, 108, 275 normalized 52, 70, 74, 75, 77, 107–108, 110–111 weighted 51, 52, 73, 77, 108, 110–111 Extended Gowda–Diday dissimilarity (see Gowda–Diday dissimilarity) Extended Ichino–Yaguchi dissimilarity (see Ichino–Yaguchi dissimilarity) Frequency, observed (see Histograms, relative frequency) Fuzzy data 180 g Galois field (see Galois lattice) Galois lattice 39, 84, 312 Generality degree (index) (see Pyramid clustering) Gowda–Diday dissimilarity 51, 58–60, 68–72, 87–90, 104–108, 164–166, 250, 254, 255, 258, 281–282 content component 58, 69, 87–88 extended Gowda–Diday dissimilarity for histograms 104–108, 131, 136–137, 250, 254, 281 for modal multi-valued data, 87–90, 250 for modal-valued lists (see Modal multi-valued data) for intervals 68–72, 164–166, 269–271, 281 for lists (see Multi-valued data) for multi-valued data 58–60, 215, 249, 281 normalization 58, 59, 89, 107 position component 68–69, 105 relative content component (see Content component) relative location component (see Position component) relative measure component (see Position component) relative size component (see Span component) span component 58, 68, 87–88 f Frequency distribution, relative (see Histograms, relative frequency) Frequency histogram (see Histograms, relative frequency) h Hausdorff distance 63–68, 168–169, 189 216, 219, 221, 224–225, 258 (see also Divisive clustering, for interval data) 335 336 Index Hausdorff distance (contd.) Euclidean 63–64, 67–68 (see also Euclidean distance) normalized 64, 67–68, 225 span Euclidean 64, 67–68, 225 generalized Minkowski Hausdorff distance 64 matrix (see Dissimilarity, matrix) Hierarchical clustering agglomerative (see Cluster(s)/ clustering, agglomerative) divisive (see Divisive clustering) pyramidal (see Pyramid clustering) tree (see Tree(s)) Hierarchy (see Hierarchical clustering) Histograms 31 construction 31, 38 data (see Symbolic data, types of, histogram-valued) joint 30–37, 40, 182–183, 188 relative frequency 3, 16, 30, 38, 55, 83, 95–99, 115, 188–189, 200, 250, 267 Histogram-valued data (see Symbolic data, types of, histogram-valued) Huygens Theorem 54, 199 Hypercube(s) 1, 7, 122, 163, 282–283 Hyperrectangle(s) (see Hypercube(s)) i Ichino–Yaguchi dissimilarity (see Ichino–Yaguchi distance) Ichino–Yaguchi distance 60–62, 73–75, 258, 271, 273–276 for categories (see Multi-valued data) de Carvalho extensions (see de Carvalho distance) extended Ichino–Yaguchi distance 90–93, 108–111, 139, 140 for histogram data 108–111, 155–172, 178, 195–196, 225–229, 234, 239–242, 245–246, 254, 284–286 for modal multi-valued data 90–93, 155–156, 195–196, 207, 213, 252, 267, 271–272 for modal-valued lists (see Modal multi-valued data) normalized 51, 155–156, 229, 239, 257, 267, 284–288, 309 for interval data 73–75, 271–274, 276 for lists (see Multi-valued data) for multi-valued data 60–62 normalized 109, 110, 178, 195–196, 273 Indexed pyramid (see Pyramid clustering) Individual description (see Descriptions(s), of observations) Inertia 264 Internal variation (see Variation, internal) Intersection (operator) 55, 63, 69, 84–85, 87–88, 94, 98–105, 109, 112–113, 178, 228, 277, 284 Interval data (see Symbolic data, types of, interval-valued) Interval-valued (see Symbolic data, types of, interval-valued) j Jaccard dissimilarity 79 Join 55, 62, 73, 84 Joint distributions (see Histograms, joint) Joint histograms (see Histograms, joint) Joint probabilities 31 (see also Histograms, joint) k 4, 120, 122, 124, 142, 144–146, 149–153, 160, 163, 167, 170, 172, 180, 184, 190 adaptive 122, 152, 168, 285, 288 non-adaptive 122, 152, 285 k-means Index k-medoids 4, 120, 124, 149–150, 152–156, 164–167, 170, 178 l Linear order (see Pyramid clustering) Lists (see Modal multi-valued data; Symbolic data, types of, multi-valued) Logical dependency rule(s) 5, 282–285 Logical rule (see Logical dependency rule(s)) m Mahalanobis distance 51–52, 71–72, 165–166, 170–172 (see also Dissimilarity) Mallows’ distance 117–118, 164–166, 171–172, 190 (see also Dissimilarity) Manhattan distance (see City block distance) Maximum difference 237 (see also Divisive clustering, polythetic) Maximum likelihood estimators 39, 180, 184 Maximum likelihood methods 125, 146 Mean, symbolic of categories (see Multi-valued data) of histogram data 25, 235 intersection of two histograms 102, 178, 193–194 union of two histograms 102, 178, 193–194 of interval data 25 of lists (see Multi-valued data) of modal multi-valued data (see Multi-valued data) of multi-valued data 25 Median, symbolic 168 Medoids (see k-medoids) Meet 55, 62, 73, 84 Metric 48 (see also Dissimilarity) Minkowski distance 51–52, 69, 75 (see also Dissimilarity) generalized 64, 73 order one (see Dissimilarity, city block) order two (see Euclidean distance) weighted 51, 92 (see also Dissimilarity; Euclidean distance) Mixed data (see Symbolic data, types of, mixed) Modal data (see Symbolic data, types of, modal multi-valued) Modal multi-valued data 7, 10–11, 15–21, 47, 83–93, 98, 153–159, 177, 200–203, 205–214, 250–251, 266–269, 281, 309 Moment estimators 39 Monothethic algorithm 5, 120, 125, 131, 137, 144, 197, 200, 203–236, 251, 254, 294 Multi-valued data (see Symbolic data, types of, multi-valued) n Necessity 10, 16–17, 38, 94 Nodes 125, 127, 172, 186, 261 (see also Hierarchical clustering) o Object(s) 8, 47–56, 58–60, 62–65, 68–70, 73, 75–77, 79, 83–85, 87–93, 98, 100–105, 108–109, 112, 115, 117, 129, 197, 290, 292, 297–298, 300, 302 Observations 8–17, 84, 237, 289–290 Observed frequency (see Frequency, observed) Outliers(s) 24, 122, 128, 190 p Partitions (partitioning) 5, 119–125, 127, 129, 142–144, 146, 149–196 337 338 Index Partitions (partitioning) (contd.) adaptive 122, 152, 168 adaptive leader 156–159, 178–179 adaptive Ward (see Adaptive leader) of categories (see Multi-valued data) classical 120–125 classification criteria 120, 150–151 convergence 120–121, 123, 153 dynamical partitioning 4, 122–124, 149–150, 152–153, 167–169, 184 of histogram data 169–172 of interval data 159–169 iterated minimum partitioning (see Dynamical partitioning) of lists (see Multi-valued data) of mixed data 172, 179 mixture distribution method 179–186 of modal multi-valued data 153 monothetic 120 of multi-valued data 153–159 nearest centroid sorting (see Dynamical partitioning) polythetic 120 (see also Polythetic algorithm) reallocation 5, 120–123, 125, 151–153, 159, 161, 164, 169, 179, 183 representative function 122–123, 124 squared error partitioning (see Dynamical partitioning) within-cluster distance 151, 156, 159 Patterns 283 Polythetic algorithm 5, 120, 125, 131, 137, 197, 200, 236–250, 255–266 Position component 68–69, 105 Possibility 10, 16–17, 38, 94 Principal component 5, 147 Probability density function 153, 179–180, 183 Proximity measure (see Pyramid clustering) Pyramid clustering 4–5, 48–50, 129, 144, 261, 289–312 classical 129, 289 code 290–298, 301–305, 307 complete(ness) 289–293, 300–302 construction from dissimilarities 309–312 from generality degrees 297–309 generality degree 289–309, 315–316 linear order 129 spatial 129, 312 strongly indexed 312 weakly indexed 312 q Qualitative variable(s) (see Symbolic data, types of, multi-valued) Quantile 117, 142, 144, 171–172 r Random variables, types of (see Variables, random) Rectangles (see Hypercube(s)) Relative frequency (see Histograms, relative frequency) Representation (see Cluster(s)/clustering) Robinson matrix (see Dissimilarity, Robinson) Rules see Logical dependency rule(s) s Seeds 121, 150, 152, 155–156, 161–162, 164, 166–167, 190 Self organizing maps 124, 190 Similarity 47 (see also Dissimilarity) Single-link 128, 129, 156, 262, 265, 267–270, 273, 280, 285 Sklar’s theorem 182 SODAS2 software Span component 58, 68, 87–88 Index Spearman’s rho (see Covariance, symbolic) Standard deviation, symbolic 26, 28, 30–31, 103–105, 109, 112–113, 117, 141, 178, 228, 231–236, 280, 284 Stirling number 1, 121 Symbolic correlation coefficient (see Correlation coefficient, symbolic) Symbolic covariance (see Covariance, symbolic) Symbolic data, types of categorical (see Multi-valued) as classical 2–5, 8–9, 12, 22–24, 27, 39, 50, 52, 62, 71, 95–98, 117, 119, 127, 130–131, 144, 150, 153, 160, 186, 190, 198, 253, 262–264, 289, 309 cumulative distributions 15, 115–117, 170–172, 180 (see also Dissimilarity, cumulative density function; Distribution(s), function) distributions (see Distribution(s), function) histogram-valued 1–5, 7, 10, 13–18, 21–22, 24–25, 28, 30–31, 38, 47, 83, 93–116, 131, 140, 144–146, 149, 169–172, 176–179, 186–188, 190, 198, 203, 205, 225–236, 238–239, 246–250, 254, 261, 264–265, 277–282, 284–285 (see also Modal multi-valued data) intersection of two histograms 98–101 span 95, 109, 115 transformation(s) of histograms 94–98, 100–103, 106, 110, 112, 115–116, 225 union of two histograms 98–101 interval-valued 1–5, 7, 12–15, 18, 22–29, 31, 38–39, 47, 62–78, 94–95, 104–105, 108–109, 112, 122, 142, 144, 146, 149–151, 159–169, 172, 186–187, 189–190, 198, 200, 203, 205, 215–225, 250, 261, 264–265, 269–277, 281–290 list(s) (see Multi-valued) mixed 4–5, 14, 47, 114, 172–179, 189, 246, 261, 281–282 modal categorical (see Modal multi-valued) modal interval-valued 4, (see also Histogram-valued) modal list (see Modal multi-valued) modal multi-valued 7, 10–11, 15–21, 47, 83–93, 98, 153–159, 172, 200–203, 205–214, 250–251, 266–269, 281, 309 models 3, 15 multi-categorical (see Multi-valued) multi-valued 1–5, 7, 9–10, 15–18, 21–22, 24–26, 47, 55–63, 69, 79, 83, 149, 153–159, 172, 186–187, 198, 200–203, 214–215, 250–251, 261, 266–269, 281, 289–290 non-modal multi-values (see Multi-valued) probability density functions 15 Symbolic mean (see Mean, symbolic) Symbolic median (see Median, symbolic) Symbolic object(s) (see Object(s)) Symbolic standard deviation (see) Standard deviation, symbolic t Time series 15 Transformations (see Symbolic data, types of, histogram-valued) 339 340 Index Tree(s) 4–5, 125–131, 139, 142, 144, 146, 252, 254, 257–258, 261, 270–273, 275, 277–278, 280–281, 285, 288, 310–312 (see also Hierarchical clustering) height 5, 128, 130, 212, 252, 274, 278–279 Triangle inequality (see Triangle property) Triangle property 48, 197 Triangles 48, 282 Triangular distribution 39 u Ultrametric (see Dissimilarity, ultrametric) Union (operator) 55, 84–85, 87–88, 94, 98–104, 109, 112–113, 178, 187, 225–227, 230, 284 v Variables, random (see also Symbolic data, types of ) Variance 3, 24, 122, 231, 264 of classical 27 of histogram data 28 of intersection of two histograms 94, 101–104, 228 of union of two histograms 94, 101–104, 228 internal (see Variation, internal) of interval data 24, 26 of lists (see Multi-valued data) of modal interval-valued data (see Variance, of histogram data) of multi-valued data 26 Variation 5, 149, 206, 212, 264 between-class (see Between-cluster) between-cluster 54, 127, 159–160, 198–200 (see also Between-cluster variation; Cluster(s)/clustering, agglomerative; Divisive clustering, between-cluster variation) explained 212, 251–253 internal 3, 27–30, 39, 103, 120–121, 127, 131, 139–140, 144–145, 167, 204, 206 between observations 27–30 between sum of products (see Variation, between observations) between sum of squares (see Variation, between observations) total variation 130, 204, 219 (see also Cluster(s)/clustering, within-cluster variation) within-class (see Within-cluster) within-cluster 54, 198–200, 206, 219, 243 (see also Cluster(s)/clustering, within-cluster variation) within observations (see Variation, internal) Virtual data 283–284 w Ward’s method 4, 128, 129, 156, 263–266, 276–279, 282, 285 Weight(s) 3, 10, 15–17, 38, 51–52, 65, 67, 75, 128, 157–158, 171, 178, 180, 190, 196, 198, 201–203, 206, 219, 237–238, 243, 263, 265 Within-class variation (see Divisive clustering, within-cluster variation) Within-cluster variation (see Divisive clustering, within-cluster variation) ... methods in statistics to fields of bioinformatics, genomics, epidemiology, business, engineering, finance and applied statistics Clustering Methodology for Symbolic Data Lynne Billard University of... Cataloging-in-Publication Data Names: Billard, L (Lynne), 1943- author | Diday, E., author Title: Clustering methodology for symbolic data / Lynne Billard (University of Georgia), Edwin Diday (CEREMADE, Université... PO19 8SQ, UK Editorial Office 9600 Garsington Road, Oxford, OX4 2DQ, UK For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com

Định dạng
Số trang	345
Dung lượng	3,94 MB