Graphing Data with R AN INTRODUCTION John Jay Hilfiger Graphing Data with R It’s much easier to grasp complex data relationships with a graph than by scanning numbers in a spreadsheet This introductory guide shows you how to use the R language to create a variety of useful graphs for visualizing and analyzing complex data for science, business, media, and many other fields You’ll learn methods for highlighting important relationships and trends, reducing data to simpler forms, and emphasizing key numbers at a glance Anyone who wants to analyze data will find something useful here—even if you don’t have a background in mathematics, statistics, or computer programming If you want to examine data related to your work, this book is the ideal way to start ■■ Get started with R by learning basic commands ■■ Build single variable graphs, such as dot and pie charts, box plots, and histograms ■■ Explore the relationship between two quantitative variables with scatter plots, high-density plots, and other techniques ■■ Use scatterplot matrices, 3D plots, clustering, heat maps, and other graphs to visualize relationships among three or more variables Twitter: @oreillymedia facebook.com/oreilly DATA / DATA SCIENCE US $39.99 John Jay Hilfiger has an MS in biostatistics, as well as master’s and PhD degrees in music His unique career as data analyst, music professor, and college administrator has included analyzing data in subjects from music, medicine, agriculture, business, education, and more CAN $45.99 ISBN: 978-1-491-92261-3 Graphing Data with R John Jay Hilfiger Graphing Data with R by John Jay Hilfiger Copyright © 2016 John Jay Hilfiger All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://safaribooksonline.com) For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com Editors: Laurel Ruma and Shannon Cutt Production Editor: Shiny Kalapurakkel Copyeditor: Bob Russell, Octal Publishing, Inc Proofreader: Rachel Head November 2015: Indexer: Ellen Troutman Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest First Edition Revision History for the First Edition 2015-10-16: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491922613 for release details While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limi‐ tation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsi‐ bility to ensure that your use thereof complies with such licenses and/or rights 978-1-491-92261-3 [LSI] Table of Contents Preface vii Part I Getting Started with R R Basics Downloading the Software Try Some Simple Tasks User Interface Installing a Package: A GUI Interface Data Structures Sample Datasets The Working Directory Putting Data into R Sourcing a Script User-Written Functions A Taste of Things to Come 10 11 22 25 26 An Overview of R Graphics 31 Exporting a Graph Exploratory Graphs and Presentation Graphs Graphics Systems in R Part II 31 33 36 Single-Variable Graphs Strip Charts 45 A Simple Graph 45 iii Data Can Be Beautiful 52 Dot Charts 59 Basic Dot Chart 59 Box Plots 67 The Box Plot Nimrod Again Making the Data Beautiful 67 73 75 Stem-and-Leaf Plots 81 Basic Stem-and-Leaf Plot 81 Histograms 85 Simple Histograms Histograms with a Second Variable 85 89 Kernel Density Plots 95 Density Estimation The Cumulative Distribution Function 95 101 Bar Plots (Bar Charts) 105 Basic Bar Plot Spine Plot Bar Spacing and Orientation 105 109 111 10 Pie Charts 117 Ordinary Pie Chart Fan Plot 117 120 11 Rug Plots 123 The Rug Plot Part III 123 Two-Variable Graphs 12 Scatter Plots and Line Charts 129 Basic Scatter Plots Line Charts Templates Enhanced Scatter Plots iv | Table of Contents 129 135 143 145 13 High-Density Plots 151 Working with Large Datasets 151 14 The Bland-Altman Plot 161 Assessing Measurement Reliability 161 15 QQ Plots 171 Comparing Sets of Numbers Part IV 171 Multivariable Graphs 16 Scatter plot Matrices and Corrgrams 183 Scatter plot Matrix Corrgram Generalized Pairs Matrix with Mixed Quantitative and Categorical Variables 183 190 195 17 Three-Dimensional Plots 199 3D Scatter plots False Color Plots Bubble Plots 199 205 206 18 Coplots (Conditioning Plots) 213 The Coplot 213 19 Clustering: Dendrograms and Heat Maps 221 Clustering Heat Maps 221 227 20 Mosaic Plots 235 Graphing Categorical Data Part V 235 What Now? 21 Resources for Extending Your Knowledge of Things Graphical and R Fluency 249 R Graphics General Principles of Graphics Learning More About R 250 250 251 Table of Contents | v Statistics with R 251 A References 253 B R Colors 257 C The R Commander Graphical User Interface 259 D Packages Used/Referenced 265 E Importing Data from Outside of R 269 F Solutions to Chapter Exercises 275 G Troubleshooting: Why Doesn’t My Code Work? 287 H R Functions Introduced in This Book 297 Index 307 vi | Table of Contents Preface “A picture is worth a thousand words,” says the proverb Sometimes, a picture is worth a lot of numbers, too! Complex relationships are often more easily grasped by looking at a picture or a graph than they might be if one tried to absorb the nuances in a verbal descrip‐ tion or discern the relationships in columns of numbers This book is about using graphical methods to understand complex data by highlighting important relationships and trends, reducing the data to simpler forms, and making it possible to take in a lot of numbers at a glance Who Is This Book For? Just about anyone who needs to visualize and analyze data will find something useful here My primary aim, however, is to make graphi‐ cal data analysis accessible to a wide range of people—especially those who not have much (or any) previous experience with R but who need or want to create various types of graphs to help them understand data important to them This will likely include people working in business, media, graphic arts, social sciences, and health sciences who have real needs for data analysis but might not have backgrounds in advanced mathematics and computer program‐ ming Although this book is designed for self-study, it might also find a place as a supplemental text for courses in elementary and intermediate statistics or research methods The vehicle for this book is R, but this is not a comprehensive course on R Many computer classes and computer books attempt to show you every possible thing one can with a language or tool For many of us who have attempted to learn this way, it gets to be vii quite confusing and boring This book will focus on understanding the elements of graphics for data analysis and how to use R to pro‐ duce the kinds of graphs discussed here; it will show you how to use some of R’s built-in resources for finding help, and leave a lot of the other stuff for you to pursue elsewhere You should have access to a computer and feel comfortable using it for some task(s), such as sending email, browsing the Internet, or perhaps using applications such as word processor or spreadsheet Familiarity with basic statis‐ tics will be helpful for some of the topics covered here, but it is not necessary for most of them Why R? It is possible to make useful graphs of small datasets by hand It is much more efficient, however, to take advantage of computer tech‐ nology to produce accurate and appealing visual data analyses For large datasets, hand work is effectively impossible Computer soft‐ ware, conversely, makes producing complex graphs of even very large datasets practical This technology is now readily available through open source soft‐ ware to virtually anyone who has access to a computer “Open source” refers to programs for which the source code is made avail‐ able to all—to examine, to use, or to make one’s own modifications or additions Open source software products are offered as free downloads to anyone who wants them Perhaps you suspect that stuff given away for free cannot be of high quality Let me assure you that some of this free software conforms to the highest professional standards The particular software chosen for this book, R, is a programming language and collection of statistical, mathematical, and graphing programs used by literally millions of people around the world, including many leading professionals in science, business, and media You have likely seen graphics produced by R on websites, in major newspapers, and in other publications You will be able to produce this kind of professional data visualization, too, because R works on computers running Windows, Macintosh, or Linux oper‐ ating systems This covers just about all the desktop and laptop computers out there today! viii | Preface hclust() Perform a hierarchical cluster analysis lm() Compute a linear model (e.g., a regression) max() Compute the maximum value of a vector mean() Compute the mean of a vector median() Compute the median of a vector min() Compute the minimum value of a vector quantile() Find quantiles of a vector scale() Center and/or scale columns of a matrix sd() Compute the standard deviation of a vector summary() Compute several summary statistics of a vector table() Compute one-way or two-way frequencies var() Compute the variance of a vector User-Defined Functions and Scripts function() {} Create a user-defined function source() Execute a script R Functions Introduced in This Book | 305 Workspace and Directories ls() Determine what objects are in the current workspace getwd() Find the current working directory setwd() Change the working directory 306 | Appendix H: R Functions Introduced in This Book Index Symbols # (octothorpe), in comments, % (percent sign) %% remainder operator, %/% (divide and round down) operator, () (parentheses) grouping with, in functions, troubleshooting in R code, 289 * (multiplication) operator, ** (exponent) operator, + (addition) operator, , (comma) in R code, 292 - (subtraction) operator, / (division) operator, 3D (see three-dimensional plots)