Clustering methodology for symbolic data

345 22 0
Clustering methodology for symbolic data

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

Clustering Methodology for Symbolic Data Wiley Series in Computational Statistics Consulting Editors: Paolo Giudici University of Pavia, Italy Geof H Givens Colorado State University, USA Wiley Series in Computational Statistics is comprised of practical guides and cutting edge research books on new developments in computational statistics It features quality authors with a strong applications focus The texts in the series provide detailed coverage of statistical concepts, methods and case studies in areas at the interface of statistics, computing, and numerics With sound motivation and a wealth of practical examples, the books show in concrete terms how to select and to use appropriate ranges of statistical computing techniques in particular fields of study Readers are assumed to have a basic understanding of introductory terminology The series concentrates on applications of computational methods in statistics to fields of bioinformatics, genomics, epidemiology, business, engineering, finance and applied statistics Clustering Methodology for Symbolic Data Lynne Billard University of Georgia, USA Edwin Diday CEREMADE, Université Paris-Dauphine, Université PSL, Paris, France This edition first published 2020 © 2020 John Wiley & Sons Ltd All rights reserved No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions The right of Lynne Billard and Edwin Diday to be identified as the authors of this work has been asserted in accordance with law Registered Offices John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK Editorial Office 9600 Garsington Road, Oxford, OX4 2DQ, UK For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com Wiley also publishes its books in a variety of electronic formats and by print-on-demand Some content that appears in standard print versions of this book may not be available in other formats Limit of Liability/Disclaimer of Warranty While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make This work is sold with the understanding that the publisher is not engaged in rendering professional services The advice and strategies contained herein may not be suitable for your situation You should consult with a specialist where appropriate Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages Library of Congress Cataloging-in-Publication Data Names: Billard, L (Lynne), 1943- author | Diday, E., author Title: Clustering methodology for symbolic data / Lynne Billard (University of Georgia), Edwin Diday (CEREMADE, Université Paris-Dauphine, Université PSL, Paris, France) Description: Hoboken, NJ : Wiley, 2020 | Includes bibliographical references and index | Identifiers: LCCN 2019011642 (print) | LCCN 2019018340 (ebook) | ISBN 9781119010388 (Adobe PDF) | ISBN 9781119010395 (ePub) | ISBN 9780470713938 (hardcover) Subjects: LCSH: Cluster analysis | Multivariate analysis Classification: LCC QA278.55 (ebook) | LCC QA278.55 B55 2019 (print) | DDC 519.5/3–dc23 LC record available at https://lccn.loc.gov/2019011642 Cover Design: Wiley Cover Image: © Lynne Billard Background: © Iuliia_Syrotina_28/Getty Images Set in 10/12pt WarnockPro by SPi Global, Chennai, India 10 v Contents Introduction Symbolic Data: Basics 2.1 2.2 2.2.1 2.2.2 2.2.3 2.2.4 2.2.5 2.3 2.4 2.4.1 2.4.2 2.4.3 2.4.4 2.5 Individuals, Classes, Observations, and Descriptions Types of Symbolic Data Multi-valued or Lists of Categorical Data Modal Multi-valued Data 10 Interval Data 12 Histogram Data 13 Other Types of Symbolic Data 14 How Symbolic Data Arise? 17 Descriptive Statistics 24 Sample Means 25 Sample Variances 26 Sample Covariance and Correlation 28 Histograms 31 Other Issues 38 Exercises 39 Appendix 41 Dissimilarity, Similarity, and Distance Measures 47 3.1 3.2 3.2.1 3.2.2 3.2.3 3.2.4 3.3 3.3.1 3.3.2 Some General Basic Definitions 47 Distance Measures: List or Multi-valued Data 55 Join and Meet Operators for Multi-valued List Data 55 A Simple Multi-valued Distance 56 Gowda–Diday Dissimilarity 58 Ichino–Yaguchi Distance 60 Distance Measures: Interval Data 62 Join and Meet Operators for Interval Data 62 Hausdorff Distance 63 vi Contents 3.3.3 3.3.4 3.3.5 3.4 Gowda–Diday Dissimilarity 68 Ichino–Yaguchi Distance 73 de Carvalho Extensisons of Ichino–Yaguchi Distances 76 Other Measures 79 Exercises 79 Appendix 82 Dissimilarity, Similarity, and Distance Measures: Modal Data 83 4.1 4.1.1 Dissimilarity/Distance Measures: Modal Multi-valued List Data Union and Intersection Operators for Modal Multi-valued List Data 84 A Simple Modal Multi-valued List Distance 85 Extended Multi-valued List Gowda–Diday Dissimilarity 87 Extended Multi-valued List Ichino–Yaguchi Dissimilarity 90 Dissimilarity/Distance Measures: Histogram Data 93 Transformation of Histograms 94 Union and Intersection Operators for Histograms 98 Descriptive Statistics for Unions and Intersections 101 Extended Gowda–Diday Dissimilarity 104 Extended Ichino–Yaguchi Distance 108 Extended de Carvalho Distances 112 Cumulative Density Function Dissimilarities 115 Mallows’ Distance 117 Exercises 118 4.1.2 4.1.3 4.1.4 4.2 4.2.1 4.2.2 4.2.3 4.2.4 4.2.5 4.2.6 4.2.7 4.2.8 General Clustering Techniques 119 5.1 5.2 5.3 5.4 5.5 Brief Overview of Clustering 119 Partitioning 120 Hierarchies 125 Illustration 131 Other Issues 146 Partitioning Techniques 149 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 Basic Partitioning Concepts 150 Multi-valued List Observations 153 Interval-valued Data 159 Histogram Observations 169 Mixed-valued Observations 177 Mixture Distribution Methods 179 Cluster Representation 186 Other Issues 189 Exercises 191 Appendix 193 83 Contents Divisive Hierarchical Clustering 197 7.1 7.1.1 7.1.2 7.2 7.2.1 7.2.2 7.2.3 7.2.4 7.3 7.4 7.5 Some Basics 197 Partitioning Criteria 197 Association Measures 200 Monothetic Methods 203 Modal Multi-valued Observations 205 Non-modal Multi-valued Observations 214 Interval-valued Observations 216 Histogram-valued Observations 225 Polythethic Methods 236 Stopping Rule R 250 Other Issues 257 Exercises 258 Agglomerative Hierarchical Clustering 261 8.1 8.1.1 8.1.2 8.1.3 8.1.4 8.1.5 8.1.6 8.2 8.2.1 8.2.2 8.2.3 8.2.4 Agglomerative Hierarchical Clustering 261 Some Basic Definitions 261 Multi-valued List Observations 266 Interval-valued Observations 269 Histogram-valued Observations 278 Mixed-valued Observations 281 Interval Observations with Rules 282 Pyramidal Clustering 289 Generality Degree 289 Pyramid Construction Based on Generality Degree Pyramids from Dissimilarity Matrix 309 Other Issues 312 Exercises 313 Appendix 315 References 317 Index 331 297 vii 1 Introduction The theme of this volume centers on clustering methodologies for data which allow observations to be described by lists, intervals, histograms, and the like (referred to as “symbolic” data), instead of single point values (traditional “classical” data) Clustering techniques are frequent participants in exploratory data analyses when the goal is to elicit identifying classes in a data set Often these classes are in and of themselves the goal of an analysis, but they can also become the starting point(s) of subsequent analyses There are many texts available which focus on clustering for classically valued observations This volume aims to provide one such outlet for symbolic data With the capabilities of the modern computer, large and extremely large data sets are becoming more routine What is less routine is how to analyze these data Data sets are becoming so large that even with the increased computational power of today, direct analyses through the myriad of classical procedures developed over the past century alone are not possible; for example, from Stirling’s formula, the number of partitions of a data set of only 50 units is approximately 1.85 × 1047 As a consequence, subsets of aggregated data are determined for subsequent analyses Criteria for how and the directions taken in these aggregations would typically be driven by the underlying scientific questions pertaining to the nature and formation of the data sets at hand Examples abound Data streams may be aggregated into blocks of data or communications networks may have different patterns in phone usage across age groups and/or regions, studies of network traffic across different networks will inevitably involve symbolic data, satellite observations are aggregated into (smaller) sub-regional measurements, and so on The list is endless There are many different approaches and motivations behind the aggregations The aggregated observations are perforce lists, intervals, histograms, etc., and as such are examples of symbolic data Indeed, Schweizer (1984) anticipated this progress with his claim that “distributions are the numbers of the future” In its purest, simplest form, symbolic data can be defined as taking values as hypercubes or as Cartesian products of distributions in p-dimensional space Clustering Methodology for Symbolic Data, First Edition Lynne Billard and Edwin Diday © 2020 John Wiley & Sons Ltd Published 2020 by John Wiley & Sons Ltd Introduction ℝp , in contrast to classical observations whose values are points in ℝp Classical data are well known, being the currency of statistical analyses since the subject began Symbolic data and their analyses are, however, relatively new and owe their origin to the seminal work of Diday (1987) More specifically, observations may be multi-valued or lists (of categorical values) To illustrate, consider a text-mining document The original database may consist of thousands or millions of text files characterized by a number (e.g., 6000) of key words These words or sets of words can be aggregated into categories of words such as “themes” (e.g., telephone enquiries may be aggregated under categories of accounts, new accounts, discontinued service, broken lines, and so forth, with each of these consisting of its own sub-categories) Thus, a particular text message may contain the specific key words Y = {home-phone, monthly contract, …} from the list of possible key words  = {two -party line, billing, local service, international calls, connections, home, monthly contract, …} Or, the color of the bird species rainbow lorikeet is Y = {green, yellow, red, blue} with Y taking values from the list of colors  = {black, blue, brown, green, white, red, yellow, … , (possible colors), … } An aggregation of drivers by city census tract may produce a list of automobile ownership for one particular residential tract as Y = {Ford, Renaullt, Volvo, Jeep} from  = {… , (possible car models), … } As written, these are examples of non-modal observations If the end user also wants to know proportional car ownership, say, then aggregation of the census tract classical observations might produce the modal list-valued observation Y = {Holden, 2; Falcon, 25; Renault, 5; Volvo, 05} indicating 20% of the drivers own a Holden car, 50% own a Renault, and so forth Interval-valued observations, as the name suggests, are characterized as taking values across an interval Y = [a, b] from  ≡ ℝ There are endless examples Stock prices have daily low and high values; temperatures have daily (or monthly, or yearly, …) minimum and maximum values Observations within (or even between adjacent) pixels in a functional magnetic resonance imaging (fMRI) data set (from measurements of p different stimuli, say) are aggregated to produce a range of values across the separate pixels In their study of face recognition features, Leroy et al (1990) aggregated pixel values to obtain interval measurements At the current time, more methodology is available for interval-valued data sets than for other types of symbolic observations, so special attention will be paid to these data Another frequently occurring type of symbolic data is the histogram-valued observation These observations correspond to the traditional histogram that pertains when classical observations are summarized into a histogram format For example, consider the height (Y ) of high-school students Rather than retain the values for each individual, a histogram is calculated to make an analysis of height characteristics of school students across the 1000 schools in the state Thus, at a particular school, it may be that the heights, in inches, Introduction are Y = {[50, 60), 0.12; [60, 65), 0.33; [65, 72), 0.45; [72, 80], 0.1}, where the relative frequency of students being 60–65 inches tall is 0.33 or 33% More generally, rather than the sub-interval having a relative frequency, as in this example, other weights may pertain These lists, intervals, and histograms are just some of the many possible formats for symbolic data Chapter provides an introduction to symbolic data A key question relates to how these data arrive in practice Clearly, many symbolic data sets arise naturally, especially species data sets, such as the bird colors illustrated herein However, most symbolic data sets will emerge from the aggregation of the massively large data sets generated by the modern computer Accordingly, Chapter looks briefly at this generation process This chapter also considers the calculations of basic descriptive statistics, such as sample means, variances, covariances, and histograms, for symbolic data It is noted that classical observations are special cases However, it is also noted that symbolic data have internal variations, unlike classical data (for which this internal variation is zero) Bock and Diday (2000a), Billard and Diday (2003, 2006a), Diday and Noirhomme-Fraiture (2008), the reviews of Noirhomme-Fraiture and Brito (2011) and Diday (2016), and the non-technical introduction in Billard (2011) provide a wide coverage of symbolic data and some of the current methodologies As for classical statistics since the subject began, observations are realizations of some underlying random variable Symbolic observations are also realizations of those same (standard, so to speak) random variables, the difference being that realizations are symbolic-valued instead of numerical or categorical point-valued Thus, for example, the parameters of a distribution of the random variable, such as Y ∼ N(𝝁, 𝚺), are still points, e.g., 𝝁 = (0, … , 0) and 𝚺 = I This feature is especially evident when calculating descriptive statistics, e.g., the sample mean of interval observations (see section 2.4) That is, the output sample mean of intervals is a point, and is not an interval such as might be the case when interval arithmetic is employed Indeed, as for classical statistics, standard classical arithmetic is in force (i.e., we not use intervals or histograms or related arithmetics) In that same vein, aggregated observations are still distributed according to that underlying distribution (e.g., normally distributed); however, it is assumed that those normally distributed observations are uniformly spread across the interval, or sub-intervals for histogram valued data Indeed, this is akin to the “group” data histogram problems of elementary applied statistics courses While this uniform spread assumption exists in almost all symbolic data analytic procedures, relaxation to some other form of spread could be possible The starting premise of the clustering methodologies presupposes the data are already in a symbolic format, therefore the philosophical concepts involved behind the formation of symbolic data are by and large not included in this volume The reader should be aware, however, that there are many issues that References Kaufman, L and Rousseeuw, P J (1986) Clustering large data sets (with Discussion) In: Pattern Recognition in Practice II (eds E S Gelsema and L N Kanal) North-Holland, Amsterdam, 425–437 Kaufman, L and Rousseeuw, P J (1987) Clustering by means of medoids In: Statistical Data Analysis Based on the 1 -Norm and Related Methods (ed Y Dodge) North-Holland, Berlin, 405–416 Kaufman, L and Rousseeuw, P L (1990) Finding Groups in Data: An Introduction to Cluster Analysis John Wiley, New York Kim, J (2009) Dissimilarity Measures for Histogram-valued Data and Divisive Clustering of Symbolic Objects Doctoral Dissertation, University of Georgia Kim, J and Billard, L (2011) A polythetic clustering process for symbolic observations and cluster validity indexes Computational Statistics and Data Analysis 55, 2250–2262 Kim, J and Billard, L (2012) Dissimilarity measures and divisive clustering for symbolic multimodal-valued data Computational Statistics and Data Analysis 56, 2795–2808 Kim, J and Billard, L (2013) Dissimilarity measures for histogram-valued observations Communications in Statistics: Theory and Methods 42, 283–303 Kim, J and Billard, L (2018) Double monothetic clustering for histogram-valued data Communications for Statistical Applications and Methods 25, 263–274 Kohonen, T (2001) Self-Organizing Maps Springer, Berlin, Heidelberg ˇ Korenjak-Cerne, S., Batagelj, V and Pave˘si´c, B J (2011) Clustering large data sets described with discrete distributions and its application on TIMSS data set Statistical Analysis and Data Mining 4, 199–215 Ko˘smelj, K and Billard, L (2011) Clustering of population pyramids using Mallows’ L2 distance Metodolo˘ski Zvezki 8, 1–15 Ko˘smelj, K and Billard, L (2012) Mallows’L2 distance in some multivariate methods and its application to histogram-type data Metodolo˘ski Zvezki 9, 107–118 Kuo, R J., Ho, L M and Hu, C M (2002) Cluster analysis in industrial market segmentation through artificial neural network Computers and Industrial Engineering 42, 391–399 Lance, G N and Williams, W T (1967a) A general theory of classificatory sorting strategies II Clustering systems The Computer Journal 10, 271–277 Lance, G N and Williams, W T (1967b) A general theory of classificatory sorting strategies I Hierarchical systems The Computer Journal 9, 373–380 Lechevallier, Y., de Carvalho, F A T., Despeyroux, T and de Melo, F (2010) Clustering of multiple dissimilarity data tables for documents categorization In: Proceedings COMPSTAT (eds Y Lechevallier and G Saporta) 19, 1263–1270 Le-Rademacher, J and Billard, L (2011) Likelihood functions and some maximum likelihood estimators for symbolic data Journal of Statistical Planning and Inference 141, 1593–1602 325 326 References Leroy, B., Chouakria, A., Herlin, I and Diday, E (1996) Approche géométrique et classification pour la reconnaissance de visage Reconnaissance des Forms et Intelligence Artificelle, INRIA and IRISA and CNRS, France, 548–557 Levina, E and Bickel, P (2001) The earth mover’s distance is the Mallows’ distance Some insights from statistics Proceedings IEEE International Conference Computer Vision Publishers IEEE Computer Society, 251–256 Limam, M (2005) Méthodes de Description de Classes Combinant Classification et Discrimination en Analyse de Données Symboliques Thèse de Doctorat, Université de Paris, Dauphine Limam, M M., Diday, E and Winsberg, S (2004) Probabilist allocation of aggregated statistical units in classification trees for symbolic class description In: Classification, Clustering and Data Mining Applications (eds D Banks, L House, F R McMorris, P Arabie and W Gaul) Springer, Heidelberg, 371–379 Lisboa, P J G., Etchells, T A., Jarman, I H and Chambers, S J (2013) Finding reproducible cluster partitions for the k-means algorithm BMC Bioinformatics 14, 1–19 Liu, F (2016) Cluster Analysis for Symbolic Interval Data Using Linear Regression Method Doctoral Dissertation, University of Georgia MacNaughton-Smith, P., Williams, W T., Dale, M B and Mockett, L G (1964) Dissimilarity analysis: a new technique of hierarchical division Nature 202, 1034–1035 MacQueen, J (1967) Some methods for classification and analysis of multivariate observations In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability (eds L M LeCam and J Neyman) University of California Press, Berkeley, 1, 281–299 Mahalanobis, P C (1936) On the generalized distance in statistics Proceedings of the National Institute of Science India 2, 49–55 Mallows, C L (1972) A note on asymptotic joint normality Annals of Mathematical Statistics 43, 508–515 Malerba, D., Esposito, F., Gioviale, V and Tamma, V (2001) Comparing dissimilarity measures for symbolic data analysis In: Proceedings of Exchange of Technology and Know-how and New Techniques and Technologies for Statistics, Crete (eds P Nanopoulos and D Wilkinson) Publisher European Communities Rome, 473–481 McLachlan G and Basford, K E (1988) Mixture Models: Inference and Applications to Clustering Marcel Dekker, New York McLachlan G and Peel D (2000) Finite Mixture Models Wiley, New York McQuitty, L L (1967) Expansion of similarity analysis by reciprocal pairs for discrete and continuous data Education Psychology Measurement 27, 253–255 Milligan, G W and Cooper, M (1985) An examination of procedures for determining the number of clusters in a data set Psychometrika 50, 159–179 Mirkin, B (1996) Mathematical Classification and Clustering Kluwer, Dordrecht References Murtagh, F and Legendre, P (2014) Ward’s hierarchical clustering method: Clustering criterion and agglomerative algorithm Journal of Classification 31, 274–295 Nelsen, R B (2007) An Introduction to Copulas Springer, New York Noirhomme-Fraiture, M and Brito, M P (2011) Far beyond the classical data models: Symbolic data analysis Statistical Analysis and Data Mining 4, 157–170 Pak, K K (2005) Classifications Hiérarchique et Pyramidale Spatiale Thése de Doctorat, Université de Paris, Dauphine Parks, J M (1966) Cluster analysis applied to multivariate geologic problems Journal of Geology 74, 703–715 Pearson, K (1895) Note on regression and inheritance in the case of two parents Proceedings of the Royal Society of London 58, 240–242 Périnel, E (1996) Segmentation et Analyse des Données Symbolique: Applications des Données Probabilities Imprécises Thèse de Doctorat, Université de Paris, Dauphine Polaillon, G (1998) Organisation et Interprétation par les Treis de Galois de Données de Type Multivalué, Intervalle ou Histogramme Thése de Doctorat, Université de Paris, Dauphine Polaillon, G (2000) Pyramidal classification for interval data using Galois lattice reduction In: Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data (eds H -H Bock and E Diday) Springer, Berlin, 324–341 Polaillon, G and Diday, E (1996) Galois lattices construction and application in symbolic data analysis Report CEREMADE 9631, Université de Paris, Dauphine Punj, G and Stewart, D W (1983) Cluster analysis in marketing research: Review and suggestions for application Journal of Marketing Research 20, 134–148 Quantin, C., Billard, L., Touati, M., Andreu, N., Cottin, Y., Zeller, M., Afonso, F., Battaglia, G., Seck, D., LeTeuff, G and Diday, E (2011) Classification and regression trees on aggregate data modeling: An application in acute myocardial infarction Journal of Probability and Statistics, ID 523937, doi:10.1155/2011/523937 Rahal, M C (2010) Classification Pyramidale Spatiale: Nouveaux Algorithmes et Aide l’Interprétation Thése de Doctorat, Université de Paris, Dauphine Ralambondrainy, H (1995) A conceptual version of the K-means algorithm Pattern Recognition Letters 16, 1147–1157 Reynolds, A P., Richards, G., de la Iglesia, B and Rayward-Smith, V J (2006) Clustering rules: A comparison of partitioning and hierarchical clustering algorithms Journal of Mathematical Modelling and Algorithms 5, 475–504 Robinson, W S (1951) A method for chronologically ordering archaeological deposits American Antiquity 16, 293–301 327 328 References Rüschendorf, L (2001) Wasserstein metric In: Encyclopedia of Mathematics (ed M Hazewinkel) Springer, Dordrecht, 631 Schroeder, A (1976) Analyse d’un mélange de distributions de probabilité de même type Revue de Statistiques Appliquées 24, 39–62 Schwarz, G (1978) Estimating the dimension of a model Annals of Statistics 6, 461–464 Schweizer, B (1984) Distributions are the numbers of the future In: Proceedings The Mathematics of Fuzzy Systems Meeting (eds A di Nola and A Ventes) University of Naples, Naples Italy, 137–149 Scott, A J and Symons, M J (1971) Clustering methods based on likelihood ratio criteria Biometrics 27, 387–397 Seck, D A N (2012) Arbres de Décision Symboliques, Outils de Validation et d’Aide l’Interprétation Thèse de Doctorat, Université de Paris, Dauphine Seck, D., Billard, L., Diday, E and Afonso, F (2010) A decision tree for interval-valued data with modal dependent variable In: Proceedings COMPSTAT 2010 (eds Y Lechevallier and G Saporta) Springer, 19, 1621–1628 Silverman, B W (1986) Density Estimation for Statistics and Data Analysis Chapman and Hall, London Sklar, A (1959.) Fonction de répartition n dimensions et leurs marges Institute Statistics Université de Paris 8, 229–231 Sneath, P H and Sokal, R R (1973) Numerical Taxonomy Freeman, San Francisco Sokal, R R (1963) The principles and practice of numerical taxonomy Taxon 12, 190–199 Sokal, R R and Michener, C D (1958) A statistical method for evaluating systematic relationships University of Kansas Science Bulletin 38, 1409–1438 Sokal, R R and Sneath, P H (1963) Principles of Numerical Taxonomy Freeman, San Francisco Steinley, D (2006) K-means clustering: A half-century synthesis British Journal of Mathematical and Statistical Psychology 59, 1–34 Stéphan, V (1998) Construction d’ Objects Symboliques par Synthèse des Résultats de Requêtes SQL Thèse de Doctorat, Université de Paris, Dauphine Stéphan, V., Hébrail, G and Lechevallier, Y (2000) Generation of symbolic objects from relational databases In: Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data (eds H -H Bock and E Diday) Springer, Berlin, 78–105 Symons, M J (1981) Clustering criteria and multivariate normal mixtures Biometrics 37, 35–43 Tibshirani, R and Walther, G (2005) Cluster validation by prediction strength Journal of Computational and Graphical Statistics 14, 511–528 Vanessa, A and Vanessa, L (2004) La meilleure équipe de baseball Report CEREMADE, Université de Paris, Dauphine References Verde, R and Irpino, A (2007) Dynamic clustering of histogram data: Using the right metric In: Selected Contributions in Data Analysis and Classification (eds P Brito, P Bertrand, G Cucumel and F de Carvalho) Springer, Berlin, 123–134 Vrac, M (2002) Analyse et Modelisation de Données Probabilistes Par Décomposition de Mélange de Copules et Application une Base de Données Climatologiques Thèse de Doctorat, Université de Paris, Dauphine Vrac, M., Billard, L., Diday E and Chédin, A (2012) Copula analysis of mixture models Computational Statistics 27, 427–457 Ward, J H (1963) Hierarchical grouping to optimize an objective function Journal of the American Statistical Association 58, 236–244 Wei, G C G and Tanner, M A (1990) A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms Journal of the American Statistical Association 85, 699–704 Winsberg, S., Diday, E and Limam, M M (2006): A tree structured classifier for symbolic class description In: Proceedings COMPSTAT (eds A Rizzi and M Vichi) Springer, 927–936 Wishart, D (1969) An algorithm for hierarchical classification Biometrics 25, 165–170 Xu, W (2010) Symbolic Data Analysis: Interval-valued Data Regression Doctoral Dissertation, University of Georgia 329 331 Index a c Adaptive leader 156–159, 178–179 Agglomerative clustering (see Cluster(s)/clustering, agglomerative) Aggregation 1–5, 7–8, 13, 17–24, 27, 56, 65, 84, 86, 95–96, 131, 139, 144, 181, 197, 239, 246, 254, 282 Assertion 290, 302 Association measures (see Divisive clustering, association) Average weighted dissimilarity 237–238, 243, (see also Divisive clustering, polythetic) maximum average weighted dissimilarity 238, 243 (see also Divisive clustering, polythetic) Capacity 10, 16, 17, 38, 84 Cartesian product 1, 17 Categories 2, 8–11, 15–16, 18, 21–22, 24–26, 55–56, 58, 65, 79, 83–87, 91, 119, 153, 155–157, 200–203, 205–206, 214, 267, 290 (see also Symbolic data), types of, multi-valued Census 2, 18, 24, 56, 86–87, 154, 232, 246, 281 Centroid(s) 1, 121, 122, 152–153, 161–163, 167–172, 189, 225, 264–265, 275–276, 278 (see also Cluster(s)/clustering, seeds; Mean, symbolic) Centroid histogram 172 (see also Centroid(s)) Chebychev’s distance 52, 54, 71, 73, 75, 190 (see also Dissimilarity) Choquet capacities 39, 84 City block distance (see Dissimilarity, city block) Classes (see Categories; Cluster(s)/ clustering) Classification(s) (see Cluster(s)/ clustering, types of ) Cluster(s)/clustering (see also Divisive clustering, monothetic); Divisive clustering, polythetic) b Between-cluster variation 54, 127, 199–200, 250, 252 Between observation variance (see Variation, between observations) Bi-partition of cluster (see Partitions (partitioning)) Bi-plots 140, 142, 163, 164, 169, 312 Clustering Methodology for Symbolic Data, First Edition Lynne Billard and Edwin Diday © 2020 John Wiley & Sons Ltd Published 2020 by John Wiley & Sons Ltd 332 Index Cluster(s)/clustering (see also Divisive clustering, monothetic); Divisive clustering, polythetic) (contd.) agglomerative 5, 126–129, 261–316 (see also Pyramid clustering) adaptive 285, 288 algorithm 266 average-link 128, 139, 140, 263, 273, 280–282 group average 128, 263, 268–269 weighted average 128, 263 centroid-link 128, 265 complete-link 128–129, 139, 140, 142, 262, 265, 268–273, 285, 310 farthest neighbor (see Complete-link) flexible 128 for histogram data 277–288 for interval data 269–277 median-link 128 minimum variance (see Ward’s method) for mixed data 281–282 for modal multi-valued data 266–269 for multi-valued data (see Modal multi-valued data) nearest neighbor (see Single-link) pyramids (see Pyramid clustering) single-link 128, 129, 156, 262, 265, 267–270, 273, 280, 285 Ward’s method 4, 128, 129, 156, 263–266, 276–279, 282, 285 between-cluster sum of squares (see Between-cluster variation) between-cluster variation 54, 127, 199–200, 250, 252 (see also Variation, between-cluster) bottom-up (see Agglomerative clustering) classical 1, 4–5, 119–130, 131, 139, 141, 144–145, 151, 180, 257, 262, 264, 289, 309 construction algorithms 120, 151, 184, 204, 232, 238–239 divisive 5, 126 (see also Divisive clustering) dynamical (see Partitions (partitioning), dynamical partitioning) hierarchical (see Hierarchical clustering) initial clusters (see Seeds) monothetic 120, 125, 131–137, 144, 197, 200, 203–249, 254–256 non-hierarchical (see Partitions (partitioning)) numbers of 120, 122, 146, 163, 181, 190, 198, 239, 250, 253, 255, 257 (see also Davis and Bouldin index; Dunn index) overview 118–125 partitioning (see Partitions (partitioning)) polythetic 5, 120, 125, 137, 197, 200, 236–250, 255–257, 266 pyramid (see Pyramid clustering) representations 5, 122–124, 129, 149, 186–189 seeds 121, 150, 152, 155–156, 161–162, 164, 166–167, 190 ties 128 top-down (see Divisive clustering) total 159, 198, 200, 204, 206, 219–221, 225–226, 229–231, 250–251, 264, 276 types of (see also Cluster(s)/ clustering, agglomerative; Divisive clustering; Divisive clustering, polythetic; Pyramid clustering) Index hierarchy 4–5, 48, 119, 121, 125–130, 146, 186, 189, 197–259, 261–316 nested 125, 129 (see Divisive clustering) non-nested 120 (see Partitions (partitioning)) pyramid (see Pyramid clustering) main cluster 238, 243 partition (see Partitions (partitioning)) splinter cluster 238, 243–245, 257 within-cluster sum of squares (see Within-cluster variation) within-cluster variation 54, 127, 159, 198–200, 219, 229, 232, 237, 245, 250, 251, 253–254, 264 Complete-link 128–129, 139, 140, 142, 262, 265, 268–273, 285, 310 Complete object (see Object(s)) Content component 58, 69, 87–88 (see Gowda–Diday dissimilarity, content component) Copula 182, 184–186 Archimedean 184 Frank copula 185 function 184 Correlation coefficient, symbolic 30–31, 257 Covariance, symbolic 3, 24, 28–31, 39, 52, 72, 117 Credibility 16–17, 38, 94 Cumulative density function 115–116, 170, 246, 249, 279, 281 Cumulative density function dissimilarity (see Dissimilarity, cumulative density function) Cumulative distribution 117–118, 131, 170–172, 180 (see also Dissimilarity; symbolic data, types of, cumulative) Cut points (see Divisive clustering) d Davis and Bouldin index 146, 253–255, 257 de Carvalho distance 76–78, 112–114, 170 agreement-disagreement indices 76, 112 comparison functions 76–77, 112–113 Dendogram(s) (see Tree(s)) Density function (see Probability density function) Dependence hierarchical dependency (see Hierarchical clustering) logical rule (see Logical dependency) product-moment (see Covariance, symbolic) Dependency (see Dependence) Description(s) of objects (see Observations) of observations 8–17, 84, 237, 289–290 virtual 282–284 Descriptive statistics 24–38, 101–104 (see also Covariance, symbolic; Mean, symbolic) Difference 237–238 (see also Divisive clustering, polythetic) Dissimilarity 47–118 categories (see Multi-valued data) city block 51–52, 60–61, 66, 73–75, 106–107, 119, 207, 212, 251–252, 255 classical 47–54 333 334 Index Dissimilarity (contd.) cumulative density function 115–116, 170, 246, 254, 279, 281 cumulative distribution function (see Cumulative density function) cumulative histogram function (see Cumulative density function) earth movers (see Mallows’ distance) Hausdorff (see Hausdorff distance) histograms 83, 93–114, 117, 170–171, 225–236 intervals 62–78, 216–225 L2 distance 189 lists (see Multi-valued) Mahalanobis (see Mahalanobis distance) Mallows’ (see Mallows’ distance) matrix 48, 50, 54, 58, 60, 66, 68, 71–72, 75, 81, 86, 91–92, 108, 111, 170, 207–208, 212, 215, 219, 226, 229, 232, 243, 245–246, 250, 254–255, 267–268, 270–275, 278, 280, 309–310 Minkowski (see Minkowski distance) mixed 47, 72, 246 modal multi-valued 83–93, 205–214, 309–311 (see also multi-valued) multi-valued 55–62, 214–215 normalized cumulative density function 115, 116, 131, 138–139, 254, 279–281 Pearson’s correlation 79, 260 pyramidial 48 Robinson 48–50, 54, 129, 310 ultrametric 48–50, 128–130 Ward 264 (see also Cluster(s)/clustering, agglomerative, Ward’s method) Wasserstein (see Mallows’ distance) (see also de Carvalho distance; Euclidean distance; Extended Gowda–Diday dissimilarity; Extended Ichino–Yaguchi dissimilarity; Gowda–Diday dissimilarity; Hausdorff distance; Mallows’ distance) Distance measure (matrix) (see Dissimilarity) Distribution(s) of distributions 181 function 179 inverse distribution function 117 joint, distribution of distributions 182 mixture of 181–186 Divisive clustering 5, 125–126, 197–260 association measures 5, 200–203, 205, 207–209 between-cluster variation 199, 251–254 for histogram data 131, 225–248 bi-partition(s) 226–230, 235–236, 244–245 cut points 205, 226–230, 250 for interval data 216–225 bi-partition(s) 219–224 cut points 205, 219–224 for mixed data 246, 249–250 for modal multi-valued data 205–215 monothetic 203–236, 255 algorithm 204, 231 double algorithm 231–236 for multi-valued data 214–215 (see also Modal multi-valued data) bi-partition(s) 208–213 cut points 202, 205, 211–212, 215 partitioning criteria 197–200 polythetic 236–250, 255 Index algorithm 238–239 stopping rule 250–257 sub-partition(s) (see Bi-partition(s)) within-cluster variation 151, 152, 156, 159, 166, 198, 219–220, 232, 252 total 152, 199–200, 204, 206, 212, 219, 221, 225, 226, 229, 230–231, 232, 235, 251 Dunn index 146, 253–255, 257 Dynamical partitioning 4, 122–124, 149–150, 152–153, 167–169, 184 e Empirical covariance (see Covariance, symbolic) Empirical density function (see Density function) Empirical distributions (see Distribution(s)) Euclidean distance 51, 59, 61, 70, 74, 91–92, 107–108, 113, 142–143, 150, 157, 161–166, 265, 275–276, 278 matrix 75, 108, 275 normalized 52, 70, 74, 75, 77, 107–108, 110–111 weighted 51, 52, 73, 77, 108, 110–111 Extended Gowda–Diday dissimilarity (see Gowda–Diday dissimilarity) Extended Ichino–Yaguchi dissimilarity (see Ichino–Yaguchi dissimilarity) Frequency, observed (see Histograms, relative frequency) Fuzzy data 180 g Galois field (see Galois lattice) Galois lattice 39, 84, 312 Generality degree (index) (see Pyramid clustering) Gowda–Diday dissimilarity 51, 58–60, 68–72, 87–90, 104–108, 164–166, 250, 254, 255, 258, 281–282 content component 58, 69, 87–88 extended Gowda–Diday dissimilarity for histograms 104–108, 131, 136–137, 250, 254, 281 for modal multi-valued data, 87–90, 250 for modal-valued lists (see Modal multi-valued data) for intervals 68–72, 164–166, 269–271, 281 for lists (see Multi-valued data) for multi-valued data 58–60, 215, 249, 281 normalization 58, 59, 89, 107 position component 68–69, 105 relative content component (see Content component) relative location component (see Position component) relative measure component (see Position component) relative size component (see Span component) span component 58, 68, 87–88 f Frequency distribution, relative (see Histograms, relative frequency) Frequency histogram (see Histograms, relative frequency) h Hausdorff distance 63–68, 168–169, 189 216, 219, 221, 224–225, 258 (see also Divisive clustering, for interval data) 335 336 Index Hausdorff distance (contd.) Euclidean 63–64, 67–68 (see also Euclidean distance) normalized 64, 67–68, 225 span Euclidean 64, 67–68, 225 generalized Minkowski Hausdorff distance 64 matrix (see Dissimilarity, matrix) Hierarchical clustering agglomerative (see Cluster(s)/ clustering, agglomerative) divisive (see Divisive clustering) pyramidal (see Pyramid clustering) tree (see Tree(s)) Hierarchy (see Hierarchical clustering) Histograms 31 construction 31, 38 data (see Symbolic data, types of, histogram-valued) joint 30–37, 40, 182–183, 188 relative frequency 3, 16, 30, 38, 55, 83, 95–99, 115, 188–189, 200, 250, 267 Histogram-valued data (see Symbolic data, types of, histogram-valued) Huygens Theorem 54, 199 Hypercube(s) 1, 7, 122, 163, 282–283 Hyperrectangle(s) (see Hypercube(s)) i Ichino–Yaguchi dissimilarity (see Ichino–Yaguchi distance) Ichino–Yaguchi distance 60–62, 73–75, 258, 271, 273–276 for categories (see Multi-valued data) de Carvalho extensions (see de Carvalho distance) extended Ichino–Yaguchi distance 90–93, 108–111, 139, 140 for histogram data 108–111, 155–172, 178, 195–196, 225–229, 234, 239–242, 245–246, 254, 284–286 for modal multi-valued data 90–93, 155–156, 195–196, 207, 213, 252, 267, 271–272 for modal-valued lists (see Modal multi-valued data) normalized 51, 155–156, 229, 239, 257, 267, 284–288, 309 for interval data 73–75, 271–274, 276 for lists (see Multi-valued data) for multi-valued data 60–62 normalized 109, 110, 178, 195–196, 273 Indexed pyramid (see Pyramid clustering) Individual description (see Descriptions(s), of observations) Inertia 264 Internal variation (see Variation, internal) Intersection (operator) 55, 63, 69, 84–85, 87–88, 94, 98–105, 109, 112–113, 178, 228, 277, 284 Interval data (see Symbolic data, types of, interval-valued) Interval-valued (see Symbolic data, types of, interval-valued) j Jaccard dissimilarity 79 Join 55, 62, 73, 84 Joint distributions (see Histograms, joint) Joint histograms (see Histograms, joint) Joint probabilities 31 (see also Histograms, joint) k 4, 120, 122, 124, 142, 144–146, 149–153, 160, 163, 167, 170, 172, 180, 184, 190 adaptive 122, 152, 168, 285, 288 non-adaptive 122, 152, 285 k-means Index k-medoids 4, 120, 124, 149–150, 152–156, 164–167, 170, 178 l Linear order (see Pyramid clustering) Lists (see Modal multi-valued data; Symbolic data, types of, multi-valued) Logical dependency rule(s) 5, 282–285 Logical rule (see Logical dependency rule(s)) m Mahalanobis distance 51–52, 71–72, 165–166, 170–172 (see also Dissimilarity) Mallows’ distance 117–118, 164–166, 171–172, 190 (see also Dissimilarity) Manhattan distance (see City block distance) Maximum difference 237 (see also Divisive clustering, polythetic) Maximum likelihood estimators 39, 180, 184 Maximum likelihood methods 125, 146 Mean, symbolic of categories (see Multi-valued data) of histogram data 25, 235 intersection of two histograms 102, 178, 193–194 union of two histograms 102, 178, 193–194 of interval data 25 of lists (see Multi-valued data) of modal multi-valued data (see Multi-valued data) of multi-valued data 25 Median, symbolic 168 Medoids (see k-medoids) Meet 55, 62, 73, 84 Metric 48 (see also Dissimilarity) Minkowski distance 51–52, 69, 75 (see also Dissimilarity) generalized 64, 73 order one (see Dissimilarity, city block) order two (see Euclidean distance) weighted 51, 92 (see also Dissimilarity; Euclidean distance) Mixed data (see Symbolic data, types of, mixed) Modal data (see Symbolic data, types of, modal multi-valued) Modal multi-valued data 7, 10–11, 15–21, 47, 83–93, 98, 153–159, 177, 200–203, 205–214, 250–251, 266–269, 281, 309 Moment estimators 39 Monothethic algorithm 5, 120, 125, 131, 137, 144, 197, 200, 203–236, 251, 254, 294 Multi-valued data (see Symbolic data, types of, multi-valued) n Necessity 10, 16–17, 38, 94 Nodes 125, 127, 172, 186, 261 (see also Hierarchical clustering) o Object(s) 8, 47–56, 58–60, 62–65, 68–70, 73, 75–77, 79, 83–85, 87–93, 98, 100–105, 108–109, 112, 115, 117, 129, 197, 290, 292, 297–298, 300, 302 Observations 8–17, 84, 237, 289–290 Observed frequency (see Frequency, observed) Outliers(s) 24, 122, 128, 190 p Partitions (partitioning) 5, 119–125, 127, 129, 142–144, 146, 149–196 337 338 Index Partitions (partitioning) (contd.) adaptive 122, 152, 168 adaptive leader 156–159, 178–179 adaptive Ward (see Adaptive leader) of categories (see Multi-valued data) classical 120–125 classification criteria 120, 150–151 convergence 120–121, 123, 153 dynamical partitioning 4, 122–124, 149–150, 152–153, 167–169, 184 of histogram data 169–172 of interval data 159–169 iterated minimum partitioning (see Dynamical partitioning) of lists (see Multi-valued data) of mixed data 172, 179 mixture distribution method 179–186 of modal multi-valued data 153 monothetic 120 of multi-valued data 153–159 nearest centroid sorting (see Dynamical partitioning) polythetic 120 (see also Polythetic algorithm) reallocation 5, 120–123, 125, 151–153, 159, 161, 164, 169, 179, 183 representative function 122–123, 124 squared error partitioning (see Dynamical partitioning) within-cluster distance 151, 156, 159 Patterns 283 Polythetic algorithm 5, 120, 125, 131, 137, 197, 200, 236–250, 255–266 Position component 68–69, 105 Possibility 10, 16–17, 38, 94 Principal component 5, 147 Probability density function 153, 179–180, 183 Proximity measure (see Pyramid clustering) Pyramid clustering 4–5, 48–50, 129, 144, 261, 289–312 classical 129, 289 code 290–298, 301–305, 307 complete(ness) 289–293, 300–302 construction from dissimilarities 309–312 from generality degrees 297–309 generality degree 289–309, 315–316 linear order 129 spatial 129, 312 strongly indexed 312 weakly indexed 312 q Qualitative variable(s) (see Symbolic data, types of, multi-valued) Quantile 117, 142, 144, 171–172 r Random variables, types of (see Variables, random) Rectangles (see Hypercube(s)) Relative frequency (see Histograms, relative frequency) Representation (see Cluster(s)/clustering) Robinson matrix (see Dissimilarity, Robinson) Rules see Logical dependency rule(s) s Seeds 121, 150, 152, 155–156, 161–162, 164, 166–167, 190 Self organizing maps 124, 190 Similarity 47 (see also Dissimilarity) Single-link 128, 129, 156, 262, 265, 267–270, 273, 280, 285 Sklar’s theorem 182 SODAS2 software Span component 58, 68, 87–88 Index Spearman’s rho (see Covariance, symbolic) Standard deviation, symbolic 26, 28, 30–31, 103–105, 109, 112–113, 117, 141, 178, 228, 231–236, 280, 284 Stirling number 1, 121 Symbolic correlation coefficient (see Correlation coefficient, symbolic) Symbolic covariance (see Covariance, symbolic) Symbolic data, types of categorical (see Multi-valued) as classical 2–5, 8–9, 12, 22–24, 27, 39, 50, 52, 62, 71, 95–98, 117, 119, 127, 130–131, 144, 150, 153, 160, 186, 190, 198, 253, 262–264, 289, 309 cumulative distributions 15, 115–117, 170–172, 180 (see also Dissimilarity, cumulative density function; Distribution(s), function) distributions (see Distribution(s), function) histogram-valued 1–5, 7, 10, 13–18, 21–22, 24–25, 28, 30–31, 38, 47, 83, 93–116, 131, 140, 144–146, 149, 169–172, 176–179, 186–188, 190, 198, 203, 205, 225–236, 238–239, 246–250, 254, 261, 264–265, 277–282, 284–285 (see also Modal multi-valued data) intersection of two histograms 98–101 span 95, 109, 115 transformation(s) of histograms 94–98, 100–103, 106, 110, 112, 115–116, 225 union of two histograms 98–101 interval-valued 1–5, 7, 12–15, 18, 22–29, 31, 38–39, 47, 62–78, 94–95, 104–105, 108–109, 112, 122, 142, 144, 146, 149–151, 159–169, 172, 186–187, 189–190, 198, 200, 203, 205, 215–225, 250, 261, 264–265, 269–277, 281–290 list(s) (see Multi-valued) mixed 4–5, 14, 47, 114, 172–179, 189, 246, 261, 281–282 modal categorical (see Modal multi-valued) modal interval-valued 4, (see also Histogram-valued) modal list (see Modal multi-valued) modal multi-valued 7, 10–11, 15–21, 47, 83–93, 98, 153–159, 172, 200–203, 205–214, 250–251, 266–269, 281, 309 models 3, 15 multi-categorical (see Multi-valued) multi-valued 1–5, 7, 9–10, 15–18, 21–22, 24–26, 47, 55–63, 69, 79, 83, 149, 153–159, 172, 186–187, 198, 200–203, 214–215, 250–251, 261, 266–269, 281, 289–290 non-modal multi-values (see Multi-valued) probability density functions 15 Symbolic mean (see Mean, symbolic) Symbolic median (see Median, symbolic) Symbolic object(s) (see Object(s)) Symbolic standard deviation (see) Standard deviation, symbolic t Time series 15 Transformations (see Symbolic data, types of, histogram-valued) 339 340 Index Tree(s) 4–5, 125–131, 139, 142, 144, 146, 252, 254, 257–258, 261, 270–273, 275, 277–278, 280–281, 285, 288, 310–312 (see also Hierarchical clustering) height 5, 128, 130, 212, 252, 274, 278–279 Triangle inequality (see Triangle property) Triangle property 48, 197 Triangles 48, 282 Triangular distribution 39 u Ultrametric (see Dissimilarity, ultrametric) Union (operator) 55, 84–85, 87–88, 94, 98–104, 109, 112–113, 178, 187, 225–227, 230, 284 v Variables, random (see also Symbolic data, types of ) Variance 3, 24, 122, 231, 264 of classical 27 of histogram data 28 of intersection of two histograms 94, 101–104, 228 of union of two histograms 94, 101–104, 228 internal (see Variation, internal) of interval data 24, 26 of lists (see Multi-valued data) of modal interval-valued data (see Variance, of histogram data) of multi-valued data 26 Variation 5, 149, 206, 212, 264 between-class (see Between-cluster) between-cluster 54, 127, 159–160, 198–200 (see also Between-cluster variation; Cluster(s)/clustering, agglomerative; Divisive clustering, between-cluster variation) explained 212, 251–253 internal 3, 27–30, 39, 103, 120–121, 127, 131, 139–140, 144–145, 167, 204, 206 between observations 27–30 between sum of products (see Variation, between observations) between sum of squares (see Variation, between observations) total variation 130, 204, 219 (see also Cluster(s)/clustering, within-cluster variation) within-class (see Within-cluster) within-cluster 54, 198–200, 206, 219, 243 (see also Cluster(s)/clustering, within-cluster variation) within observations (see Variation, internal) Virtual data 283–284 w Ward’s method 4, 128, 129, 156, 263–266, 276–279, 282, 285 Weight(s) 3, 10, 15–17, 38, 51–52, 65, 67, 75, 128, 157–158, 171, 178, 180, 190, 196, 198, 201–203, 206, 219, 237–238, 243, 263, 265 Within-class variation (see Divisive clustering, within-cluster variation) Within-cluster variation (see Divisive clustering, within-cluster variation) ... perforce are described by symbolic data Species data are examples of naturally occurring symbolic data Data with minimum and maximum values, such as the temperature data of 23 24 Symbolic Data: ... dissimilarity/distance measures for non-modal symbolic data, i.e., for non-modal list multi-valued data and for interval-valued data Chapter considers such measures for modal observations, i.e., for modal list... necessary for clarification purposes, we will write “interval-valued data as “interval data for simplicity; likewise, for the other types of symbolic data It is important to remember that symbolic data,

Ngày đăng: 20/01/2020, 12:23

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan