Exploratory Data Analysis with MATLAB ® Computer Science and Data Analysis Series © 2005 by CRC Press LLC Chapman & Hall/CRC Series in Computer Science and Data Analysis The interface between the computer and statistical sciences is increasing, as each discipline seeks to harness the power and resources of the other. This series aims to foster the integration between the computer sciences and statistical, numerical and probabilistic methods by publishing a broad range of reference works, textbooks and handbooks. SERIES EDITORS John Lafferty, Carnegie Mellon University David Madigan, Rutgers University Fionn Murtagh, Queen’s University Belfast Padhraic Smyth, University of California Irvine Proposals for the series should be sent directly to one of the series editors above, or submitted to: Chapman & Hall/CRC Press UK 23-25 Blades Court London SW15 2NU UK Published Titles Bayesian Artificial Intelligence Kevin B. Korb and Ann E. Nicholson Exploratory Data Analysis with MATLAB ® Wendy L. Martinez and Angel R. Martinez Forthcoming Titles Correspondence Analysis and Data Coding with JAVA and R Fionn Murtagh R Graphics Paul Murrell Nonlinear Dimensionality Reduction Vin de Silva and Carrie Grimes © 2005 by CRC Press LLC CHAPMAN & HALL/CRC A CRC Press Company Boca Raton London New York Washington, D.C. Wendy L. Martinez Angel R. Martinez Exploratory Data Analysis with MATLAB ® Computer Science and Data Analysis Series © 2005 by CRC Press LLC This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the consequences of their use. Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, microfilming, and recording, or by any information storage or retrieval system, without prior permission in writing from the publisher. The consent of CRC Press does not extend to copying for general distribution, for promotion, for creating new works, or for resale. Specific permission must be obtained in writing from CRC Press for such copying. Direct all inquiries to CRC Press, 2000 N.W. Corporate Blvd., Boca Raton, Florida 33431. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation, without intent to infringe. Visit the CRC Press Web site at www.crcpress.com © 2005 by Chapman & Hall/CRC Press No claim to original U.S. Government works International Standard Book Number 1-58488-366-9 Library of Congress Card Number 2004058245 Printed in the United States of America 1 2 3 4 5 6 7 8 9 0 Printed on acid-free paper Library of Congress Cataloging-in-Publication Data Martinez, Wendy L. Exploratory data analysis with MATLAB / Wendy L. Martinez, Angel R. Martinez. p. cm. Includes bibliographical references and index. ISBN 1-58488-366-9 (alk. paper) 1. Multivariate analysis. 2. MATLAB. 3. Mathematical statistics. I. Martinez, Angel R. II. Title. QA278.M3735 2004 519.5'35 dc22 2004058245 C3669 disclaimer.fm Page 1 Monday, October 18, 2004 12:24 PM © 2005 by CRC Press LLC This book is dedicated to our children: Angel and Ochida Deborah and Nataniel Jeff and Lynn and Lisa (Principessa) EDA.book Page i Monday, October 18, 2004 8:31 AM © 2005 by CRC Press LLC vii Table of Contents Table of Contents vii Preface xiii Part I Introduction to Exploratory Data Analysis Chapter 1 Introduction to Exploratory Data Analysis 1.1 What is Exploratory Data Analysis 3 1.2 Overview of the Text 6 1.3 A Few Words About Notation 8 1.4 Data Sets Used in the Book 9 1.4.1 Unstructured Text Documents 9 1.4.2 Gene Expression Data 12 1.4.3 Oronsay Data Set 18 1.4.4 Software Inspection 19 1.5 Transforming Data 20 1.5.1 Power Transformations 21 1.5.2 Standardization 22 1.5.3 Sphering the Data 24 1.6 Further Reading 25 Exercises 27 Part II EDA as Pattern Discovery Chapter 2 Dimensionality Reduction - Linear Methods 2.1 Introduction 31 2.2 Principal Component Analysis - PCA 33 2.2.1 PCA Using the Sample Covariance Matrix 34 2.2.2 PCA Using the Sample Correlation Matrix 37 2.2.3 How Many Dimensions Should We Keep? 38 2.3 Singular Value Decomposition - SVD 42 2.4 Factor Analysis 46 EDA.book Page vii Monday, October 18, 2004 8:31 AM © 2005 by CRC Press LLC viii Exploratory Data Analysis with MATLAB 2.5 Intrinsic Dimensionality 52 2.6 Summary and Further Reading 57 Exercises 57 Chapter 3 Dimensionality Reduction - Nonlinear Methods 3.1 Multidimensional Scaling - MDS 61 3.1.1 Metric MDS 63 3.1.2 Nonmetric MDS 72 3.2 Manifold Learning 81 3.2.1 Locally Linear Embedding 81 3.2.2 Isometric Feature Mapping - ISOMAP 83 3.2.3 Hessian Eigenmaps 85 3.3 Artificial Neural Network Approaches 90 3.3.1 Self-Organizing Maps - SOM 90 3.3.2 Generative Topographic Maps - GTM 94 3.4 Summary and Further Reading 98 Exercises 100 Chapter 4 Data Tours 4.1 Grand Tour 104 4.1.1 Torus Winding Method 105 4.1.2 Pseudo Grand Tour 107 4.2 Interpolation Tours 110 4.3 Projection Pursuit 112 4.4 Projection Pursuit Indexes 120 4.4.1 Posse Chi-Square Index 120 4.4.2 Moment Index 124 4.5 Summary and Further Reading 125 Exercises 126 Chapter 5 Finding Clusters 5.1 Introduction 127 5.2 Hierarchical Methods 129 5.3 Optimization Methods - k-Means 135 5.4 Evaluating the Clusters 139 5.4.1 Rand Index 141 5.4.2 Cophenetic Correlation 143 5.5.3 Upper Tail Rule 144 5.5.4 Silhouette Plot 147 5.5.5 Gap Statistic 149 5.5 Summary and Further Reading 155 EDA.book Page viii Monday, October 18, 2004 8:31 AM © 2005 by CRC Press LLC Table of Contents ix Exercises 158 Chapter 6 Model-Based Clustering 6.1 Overview of Model-Based Clustering 163 6.2 Finite Mixtures 166 6.2.1 Multivariate Finite Mixtures 167 6.2.2 Component Models - Constraining the Covariances 168 6.3 Expectation-Maximization Algorithm 176 6.4 Hierarchical Agglomerative Model-Based Clustering 181 6.5 Model-Based Clustering 182 6.6 Generating Random Variables from a Mixture Model 188 6.7 Summary and Further Reading 192 Exercises 193 Chapter 7 Smoothing Scatterplots 7.1 Introduction 197 7.2 Loess 198 7.3 Robust Loess 208 7.4 Residuals and Diagnostics 211 7.4.1 Residual Plots 212 7.4.2 Spread Smooth 216 7.4.3 Loess Envelopes - Upper and Lower Smooths 218 7.5 Bivariate Distribution Smooths 219 7.5.1 Pairs of Middle Smoothings 219 7.5.2 Polar Smoothing 222 7.6 Curve Fitting Toolbox 226 7.7 Summary and Further Reading 228 Exercises 229 Part III Graphical Methods for EDA Chapter 8 Visualizing Clusters 8.1 Dendrogram 233 8.2 Treemaps 235 8.3 Rectangle Plots 238 8.4 ReClus Plots 244 8.5 Data Image 249 8.6 Summary and Further Reading 255 Exercises 256 EDA.book Page ix Monday, October 18, 2004 8:31 AM © 2005 by CRC Press LLC x Exploratory Data Analysis with MATLAB Chapter 9 Distribution Shapes 9.1 Histograms 259 9.1.1 Univariate Histograms 259 9.1.2 Bivariate Histograms 266 9.2 Boxplots 268 9.2.1 The Basic Boxplot 269 9.2.2 Variations of the Basic Boxplot 274 9.3 Quantile Plots 279 9.3.1 Probability Plots 279 9.3.2 Quantile-quantile Plot 281 9.3.3 Quantile Plot 284 9.4 Bagplots 286 9.5 Summary and Further Reading 289 Exercises 289 Chapter 10 Multivariate Visualization 10.1 Glyph Plots 293 10.2 Scatterplots 294 10.2.1 2-D and 3-D Scatterplots 294 10.2.2 Scatterplot Matrices 298 10.2.3 Scatterplots with Hexagonal Binning 299 10.3 Dynamic Graphics 301 10.3.1 Identification of Data 301 10.3.2 Linking 305 10.3.3 Brushing 308 10.4 Coplots 309 10.5 Dot Charts 312 10.5.1 Basic Dot Chart 313 10.5.2 Multiway Dot Chart 314 10.6 Plotting Points as Curves 318 10.6.1 Parallel Coordinate Plots 318 10.6.2 Andrews’ Curves 321 10.6.3 More Plot Matrices 325 10.7 Data Tours Revisited 326 10.7.1 Grand Tour 326 10.7.2 Permutation Tour 328 10.8 Summary and Further Reading 332 Exercises 333 Appendix A Proximity Measures A.1 Definitions 337 A.1.1 Dissimilarities 338 EDA.book Page x Monday, October 18, 2004 8:31 AM © 2005 by CRC Press LLC Table of Contents xi A.1.2 Similarity Measures 340 A.1.3 Similarity Measures for Binary Data 340 A.1.4 Dissimilarities for Probability Density Functions 341 A.2 Transformations 342 A.3 Further Reading 343 Appendix B Software Resources for EDA B.1 MATLAB Programs 345 B.2 Other Programs for EDA 348 B.3 EDA Toolbox 350 Appendix C Description of Data Sets 351 Appendix D Introduction to MATLAB D.1 What Is MATLAB? 357 D.2 Getting Help in MATLAB 358 D.3 File and Workspace Management 358 D.4 Punctuation in MATLAB 360 D.5 Arithmetic Operators 361 D.6 Data Constructs in MATLAB 362 Basic Data Constructs 362 Building Arrays 363 Cell Arrays 363 Structures 364 D.7 Script Files and Functions 365 D.8 Control Flow 366 for Loop 366 while Loop 366 if-else Statements 367 switch Statement 367 D.9 Simple Plotting 367 D.10 Where to get MATLAB Information 370 Appendix E MATLAB Functions E.1 MATLAB 371 E.2 Statistics Toolbox - Versions 4 and 5 373 E.3 Exploratory Data Analysis Toolbox 374 EDA.book Page xi Monday, October 18, 2004 8:31 AM © 2005 by CRC Press LLC [...]... known as exploratory data analysis or EDA Thus, we see this book as a complement to the first one with similar goals: to make exploratory data analysis techniques available to a wide range of users Exploratory data analysis is an area of statistics and data analysis, where the idea is to first explore the data set, often using methods from descriptive statistics, scientific visualization, data tours,... counting detective work - or graphical detective work.” [Tukey, 1977, page 1] It is mostly a philosophy of data analysis where the researcher examines the data without any pre-conceived ideas in order to discover what the data can tell him about the phenomena being studied Tukey contrasts this with confirmatory data analysis (CDA), an area of data analysis that is mostly concerned with statistical hypothesis... the data looking for patterns and structure that leads to hypotheses and models Tukey’s book on EDA was written at a time when computers were not widely available and the data sets tended to be somewhat small, especially by today’s standards So, Tukey developed methods that could be accomplished using pencil and paper, such as the familiar box -and- whisker plots (also known as boxplots) and the stem -and- leaf... information MATLAB code in the form of an Exploratory Data Analysis Toolbox is provided with the text This includes the functions, GUIs, and data sets that are described in the book This is available for download at http://lib.stat.cmu.edu and http://www.infinityassociates.com Please review the readme file for installation instructions and information on any changes M-files that contain the MATLAB commands... is to provide some introductory and background information First, we cover the philosophy of exploratory data analysis and discuss how this fits in with other data analysis techniques and objectives This is followed by an overview of the text, which includes the software that will be used and the background necessary to understand the methods We then present several data sets that will be employed... 3 Apple Hill Drive Natick, MA, 0176 0-2 098 USA Tel: 50 8-6 4 7-7 000 Fax: 50 8-6 4 7-7 101 E-mail: info@mathworks.com Web: www.mathworks.com It is important for the reader to understand what versions of the software or what toolboxes are used with this text The book was written using MATLAB Versions 6.5 and 7 We made some use of the MATLAB Statistics Toolbox, Versions 4 and 5 We will refer to the Curve Fitting... lexicon and the pre-processing of the documents before proceeding with more information on the BPM and the data provided with this book All punctuation within a sentence, such as commas, semi-colons, colons, etc., were removed All end-of-sentence punctuation, other than a period, such as question marks and exclamation © 2005 by CRC Press LLC EDA.book Page 10 Wednesday, October 27, 2004 9:10 PM 10 Exploratory. .. the data available for download We include three gene expression data sets with this book, and we describe them below © 2005 by CRC Press LLC EDA.book Page 14 Wednesday, October 27, 2004 9:10 PM 14 Exploratory Data Analysis with MATLAB te S ataD ts aeY This data set was originally described in Cho, et al [1998], and it showed the gene expression levels of around 6000 genes over two cell cycles and. .. 2004 8:31 AM xiv Exploratory Data Analysis with MATLAB EDA methods Implementation of the methods is secondary, but where feasible, we show students and practitioners the implementation through algorithms, procedures, and MATLAB code Many of the methods are complicated, and the details of the MATLAB implementation are not important In these instances, we show how to use the functions and techniques The... variants of the grand tour and projection pursuit that try to look at the data set in many 2-D or 3-D views in the hope of discovering something interesting and informative Clustering or unsupervised learning is a standard tool in EDA and data mining These methods look for groups or clusters, and some of the issues that must be addressed involve determining the number of clusters and the validity or . Cataloging-in-Publication Data Martinez, Wendy L. Exploratory data analysis with MATLAB / Wendy L. Martinez, Angel R. Martinez. p. cm. Includes bibliographical references and index. ISBN 1-5 848 8-3 6 6-9 . Martinez Angel R. Martinez Exploratory Data Analysis with MATLAB ® Computer Science and Data Analysis Series © 2005 by CRC Press LLC This book contains information obtained from authentic and. Exploratory Data Analysis with MATLAB ® Computer Science and Data Analysis Series © 2005 by CRC Press LLC Chapman & Hall/CRC Series in Computer Science and Data Analysis The