DATA SCIENCE FOUNDATIONS Geometry and Topology of Complex Hierarchic Systems and Big Data Analytics Chapman & Hall/CRC Computer Science and Data Analysis Series The interface between the computer and statistical sciences is increasing, as each discipline seeks to harness the power and resources of the other This series aims to foster the integration between the computer sciences and statistical, numerical, and probabilistic methods by publishing a broad range of reference works, textbooks, and handbooks SERIES EDITORS David Blei, Princeton University David Madigan, Rutgers University Marina Meila, University of Washington Fionn Murtagh, Royal Holloway, University of London Proposals for the series should be sent directly to one of the series editors above, or submitted to: Chapman & Hall/CRC Taylor and Francis Group Park Square, Milton Park Abingdon, OX14 4RN, UK Published Titles Semisupervised Learning for Computational Linguistics Steven Abney Visualization and Verbalization of Data Jörg Blasius and Michael Greenacre Design and Modeling for Computer Experiments Kai-Tai Fang, Runze Li, and Agus Sudjianto Microarray Image Analysis: An Algorithmic Approach Karl Fraser, Zidong Wang, and Xiaohui Liu R Programming for Bioinformatics Robert Gentleman Exploratory Multivariate Analysis by Example Using R Franỗois Husson, Sébastien Lê, and Jérôme Pagès Bayesian Artificial Intelligence, Second Edition Kevin B Korb and Ann E Nicholson Published Titles cont Computational Statistics Handbook with MATLAB®, Third Edition Wendy L Martinez and Angel R Martinez Exploratory Data Analysis with MATLAB ®, Third Edition Wendy L Martinez, Angel R Martinez, and Jeffrey L Solka Statistics in MATLAB®: A Primer Wendy L Martinez and MoonJung Cho Clustering for Data Mining: A Data Recovery Approach, Second Edition Boris Mirkin Introduction to Machine Learning and Bioinformatics Sushmita Mitra, Sujay Datta, Theodore Perkins, and George Michailidis Introduction to Data Technologies Paul Murrell R Graphics Paul Murrell Correspondence Analysis and Data Coding with Java and R Fionn Murtagh Data Science Foundations: Geometry and Topology of Complex Hierarchic Systems and Big Data Analytics Fionn Murtagh Pattern Recognition Algorithms for Data Mining Sankar K Pal and Pabitra Mitra Statistical Computing with R Maria L Rizzo Statistical Learning and Data Science Mireille Gettler Summa, Léon Bottou, Bernard Goldfarb, Fionn Murtagh, Catherine Pardoux, and Myriam Touati Music Data Analysis: Foundations and Applications Claus Weihs, Dietmar Jannach, Igor Vatolkin, and Günter Rudolph Foundations of Statistical Algorithms: With References to R Packages Claus Weihs, Olaf Mersmann, and Uwe Ligges Chapman & Hall/CRC Computer Science and Data Analysis Series DATA SCIENCE FOUNDATIONS Geometry and Topology of Complex Hierarchic Systems and Big Data Analytics Fionn Murtagh CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2018 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S Government works Printed on acid-free paper Version Date: 20170823 International Standard Book Number-13: 978-1-4987-6393-6 (Hardback) This book contains information obtained from authentic and highly regarded sources Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint Except as permitted under U.S Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers For permission to photocopy or use material electronically from this work, please access www.copyright.com(http://www.copyright.com/) or contact the Copyright Clearance Center, Inc (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400 CCC is a not-for-profit organization that provides licenses and registration for a variety of users For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com Contents Preface I Narratives from Film and Literature, from Social Media and Contemporary Life The Correspondence Analysis Platform for Mapping Semantics 1.1 The Visualization and Verbalization of Data 1.2 Analysis of Narrative from Film and Drama 1.2.1 Introduction 1.2.2 The Changing Nature of Movie and Drama 1.2.3 Correspondence Analysis as a Semantic Analysis Platform 1.2.4 Casablanca Narrative: Illustrative Analysis 1.2.5 Modelling Semantics via the Geometry and Topology of Information 1.2.6 Casablanca Narrative: Illustrative Analysis Continued 1.2.7 Platform for Analysis of Semantics 1.2.8 Deeper Look at Semantics of Casablanca: Text Mining 1.2.9 Analysis of a Pivotal Scene 1.3 Application of Narrative Analysis to Science and Engineering Research 1.3.1 Assessing Coverage and Completeness 1.3.2 Change over Time 1.3.3 Conclusion on the Policy Case Studies 1.4 Human Resources Multivariate Performance Grading 1.5 Data Analytics as the Narrative of the Analysis Processing 1.6 Annex: The Correspondence Analysis and Hierarchical Clustering Platform 1.6.1 Analysis Chain 1.6.2 Correspondence Analysis: Mapping x2 Distances into Euclidean Distances 1.6.3 Input: Cloud of Points Endowed with the Chi-Squared Metric 1.6.4 Output: Cloud of Points Endowed with the Euclidean Metric in Factor Space 1.6.5 Supplementary Elements: Information Space Fusion 1.6.6 Hierarchical Clustering: Sequence-Constrained Analysis and Synthesis of Narrative: Semantics of Interactivity 2.1 Impact and Effect in Narrative: A Shock Occurrence in Social Media 2.1.1 Analysis 2.1.2 Two Critical Tweets in Terms of Their Words 2.1.3 Two Critical Tweets in Terms of Twitter Sub-narratives 2.2 Analysis and Synthesis, Episodization and Narrativization 2.3 Storytelling as Narrative Synthesis and Generation 2.4 2.5 2.6 2.7 Machine Learning and Data Mining in Film Script Analysis Style Analytics: Statistical Significance of Style Features Typicality and Atypicality for Narrative Summarization and Transcoding Integration and Assembling of Narrative II Foundations of Analytics through the Geometry and Topology of Complex Systems Symmetry in Data Mining and Analysis through Hierarchy 3.1 Analytics as the Discovery of Hierarchical Symmetries in Data 3.2 Introduction to Hierarchical Clustering, p-Adic and m-Adic Numbers 3.2.1 Structure in Observed or Measured Data 3.2.2 Brief Look Again at Hierarchical Clustering 3.2.3 Brief Introduction to p-Adic Numbers 3.2.4 Brief Discussion of p-Adic and m-Adic Numbers 3.3 Ultrametric Topology 3.3.1 Ultrametric Space for Representing Hierarchy 3.3.2 Geometrical Properties of Ultrametric Spaces 3.3.3 Ultrametric Matrices and Their Properties 3.3.4 Clustering through Matrix Row and Column Permutation 3.3.5 Other Data Symmetries 3.4 Generalized Ultrametric and Formal Concept Analysis 3.4.1 Link with Formal Concept Analysis 3.4.2 Applications of Generalized Ultrametrics 3.5 Hierarchy in a p-Adic Number System 3.5.1 p-Adic Encoding of a Dendrogram 3.5.2 p-Adic Distance on a Dendrogram 3.5.3 Scale-Related Symmetry 3.6 Tree Symmetries through the Wreath Product Group 3.6.1 Wreath Product Group for Hierarchical Clustering 3.6.2 Wreath Product Invariance 3.6.3 Wreath Product Invariance: Haar Wavelet Transform of Dendrogram 3.7 Tree and Data Stream Symmetries from Permutation Groups 3.7.1 Permutation Representation of a Data Stream 3.7.2 Permutation Representation of a Hierarchy 3.8 Remarkable Symmetries in Very High-Dimensional Spaces 3.9 Short Commentary on This Chapter Geometry and Topology of Data Analysis: in p-Adic Terms 4.1 Numbers and Their Representations 4.1.1 Series Representations of Numbers 4.1.2 Field 4.2 p-Adic Valuation, p-Adic Absolute Value, p-Adic Norm 4.3 p-Adic Numbers as Series Expansions 4.4 Canonical p-Adic Expansion; p-Adic Integer or Unit Ball 4.5 Non-Archimedean Norms as p-Adic Integer Norms in the Unit Ball 4.5.1 Archimedean and Non-Archimedean Absolute Value Properties 4.5.2 A Non-Archimedean Absolute Value, or Norm, is Less Than or Equal to One, and an Archimedean Absolute Value, or Norm, is Unbounded 4.6 Going Further: Negative p-Adic Numbers, and p-Adic Fractions 4.7 Number Systems in the Physical and Natural Sciences 4.8 p-Adic Numbers in Computational Biology and Computer Hardware 4.9 Measurement Requires a Norm, Implying Distance and Topology 4.10 Ultrametric Topology 4.11 Short Review of p-Adic Cosmology 4.12 Unbounded Increase in Mass or Other Measured Quantity 4.13 Scale-Free Partial Order or Hierarchical Systems 4.14 p-Adic Indexing of the Sphere 4.15 Diffusion and Other Dynamic Processes in Ultrametric Spaces III New Challenges and New Solutions for Information Search and Discovery Fast, Linear Time, m-Adic Hierarchical Clustering 5.1 Pervasive Ultrametricity: Computational Consequences 5.1.1 Ultrametrics in Data Analytics 5.1.2 Quantifying Ultrametricity 5.1.3 Pervasive Ultrametricity 5.1.4 Computational Implications 5.2 Applications in Search and Discovery using the Baire Metric 5.2.1 Baire Metric 5.2.2 Large Numbers of Observables 5.2.3 High-Dimensional Data 5.2.4 First Approach Based on Reduced Precision of Measurement 5.2.5 Random Projections in High-Dimensional Spaces, Followed by the Baire Distance 5.2.6 Summary Comments on Search and Discovery 5.3 m-Adic Hierarchy and Construction 5.4 The Baire Metric, the Baire Ultrametric 5.4.1 Metric and Ultrametric Spaces 5.4.2 Ultrametric Baire Space and Distance 5.5 Multidimensional Use of the Baire Metric through Random Projections 5.6 Hierarchical Tree Defined from m-Adic Encoding 5.7 Longest Common Prefix and Hashing 5.7.1 From Random Projection to Hashing 5.8 Enhancing Ultrametricity through Precision of Measurement 5.8.1 Quantifying Ultrametricity 5.8.2 Pervasiveness of Ultrametricity 5.9 Generalized Ultrametric and Formal Concept Analysis 5.9.1 Generalized Ultrametric 5.9.2 Formal Concept Analysis 5.10 Linear Time and Direct Reading Hierarchical Clustering 5.10.1 Linear Time, or O(N) Computational Complexity, Hierarchical Clustering 5.10.2 Grid-Based Clustering Algorithms 5.11 Summary: Many Viewpoints, Various Implementations Big Data Scaling through Metric Mapping 6.1 Mean Random Projection, Marginal Sum, Seriation 6.1.1 Mean of Random Projections as A Seriation 6.1.2 Normalization of the Random Projections 6.2 Ultrametric and Ordering of Rows, Columns 6.3 Power Iteration Clustering 6.4 Input Data for Eigenreduction 6.4.1 Implementation: Equivalence of Iterative Approximation and Batch Calculation 6.5 Inducing a Hierarchical Clustering from Seriation 6.6 Short Summary of All These Methodological Underpinnings 6.6.1 Trivial First Eigenvalue, Eigenvector in Correspondence Analysis 6.7 Very High-Dimensional Data Spaces: Data Piling 6.8 Recap on Correspondence Analysis for Following Applications 6.8.1 Clouds of Points, Masses and Inertia 6.8.2 Relative and Absolute Contributions 6.9 Evaluation 1: Uniformly Distributed Data Cloud Points 6.9.1 Computation Time Requirements 6.10 Evaluation 2: Time Series of Financial Futures 6.11 Evaluation 3: Chemistry Data, Power Law Distributed 6.11.1 Data and Determining Power Law Properties 6.11.2 Randomly Generating Power Law Distributed Data in Varying Embedding Dimensions 6.12 Application 1: Quantifying Effectiveness through Aggregate Outcome 6.12.1 Computational Requirements, from Original Space and Factor Space Identities 6.13 Application 2: Data Piling as Seriation of Dual Space 6.14 Brief Concluding Summary 6.15 Annex: R Software Used in Simulations and Evaluations 6.15.1 Evaluation 1: Dense, Uniformly Distributed Data 6.15.2 Evaluation 2: Financial Futures 6.15.3 Evaluation 3: Chemicals of Specified Marginal Distribution IV New Frontiers: New Vistas on Information, Cognition and the Human Mind On Ultrametric Algorithmic Information 7.1 Introduction to Information Measures