www.it-ebooks.info www.it-ebooks.info Data Mashups in R www.it-ebooks.info www.it-ebooks.info Data Mashups in R Jeremy Leipzig and Xiao-Yi Li Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo www.it-ebooks.info Data Mashups in R by Jeremy Leipzig and Xiao-Yi Li Copyright © 2011 Jeremy Leipzig and Xiao-Yi Li All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://my.safaribooksonline.com) For more information, contact our corporate/institutional sales department: (800) 998-9938 or corporate@oreilly.com Editor: Mike Loukides Production Editor: Kristen Borg Proofreader: Kristen Borg Cover Designer: Karen Montgomery Interior Designer: David Futato Illustrator: Robert Romano Printing History: March 2011: First Edition Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc Data Mashups in R, the image of a black-billed Australian bustard, and related trade dress are trademarks of O’Reilly Media, Inc Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trademark claim, the designations have been printed in caps or initial caps While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein ISBN: 978-1-449-30353-2 [LSI] 1299253461 www.it-ebooks.info Table of Contents Introduction vii Mapping Foreclosures Messy Address Parsing Exploring “streets” Obtaining Latitude and Longitude Using Yahoo Shaking the XML Tree The Many Ways to Philly (Latitude) Using Data Structures Using Helper Methods Using Internal Class Methods Exceptional Circumstances The Unmappable Fake Street No Connection Taking Shape Finding a Usable Map PBSmapping Developing the Plot Preparing to Add Points to Our Map Exploring R Data Structures: geoTable Making Events of Our Foreclosures Turning Up the Heat Factors When You Need Them Filling with Color Gradients 7 8 9 10 10 11 12 14 15 15 16 17 Statistics of Foreclosure 19 Importing Census Data Descriptive Statistics Descriptive Plots Correlation Final Thoughts 19 22 23 25 26 v www.it-ebooks.info Appendix: Getting Started 27 vi | Table of Contents www.it-ebooks.info Introduction Programmers may spend a good part of their careers scripting code to conform to commercial statistics packages, visualization tools, and domain-specific third-party software The same tasks can force end users to spend countless hours in copy-paste purgatory, each minor change necessitating another grueling round of formatting tabs and screenshots Luckily, R scripting offers some reprieve Because this open source project garners the support of a large community of package developers, the R statistical programming environment provides an amazing level of extensibility Data from a multitude of sources can be imported into R and processed using R packages to aid statistical analysis and visualization R scripts can also be configured to produce high-quality reports in an automated fashion—saving time, energy, and frustration This book will demonstrate how real-world data is imported, managed, visualized, and analyzed within R Spatial mashups provide an excellent way to explore the capabilities of R—encompassing R packages, R syntax, and data structures Instead of canned sample data, we will be plotting and analyzing actual current home foreclosure auctions Through this exercise, we hope to provide an general idea of how the R environment works with R packages as well as its own capabilities in statistical analysis We will be accessing spatial data in several formats (HTML, XML, shapefiles, and text) both locally and over the web, to produce a map of home foreclosures and perform statistical analysis on these events vii www.it-ebooks.info www.it-ebooks.info After pasting the above geocodeAddresses function into your R console, enter in the following (make sure you still have a streets vector from the parsing chapter): > geoTable names(geoTable) [1] "address" "Y" > nrow(geoTable) [1] 1264 "X" "EID" The first row: > geoTable[1,] address Y X EID 6321 Farnsworth St 40.032400 -75.067243 X and Y from the first five rows: > geoTable[1:5,c("X","Y")] X Y -75.067243 40.032400 -75.159509 40.051511 -75.183899 39.937076 -75.188141 39.933655 -75.177794 39.966036 The cell in the 4th column, 4th row: > geoTable[4,4] [1] The second column, also known as “Y”: > geoTable[,2] #or# > geoTable$Y [1] 40.032400 40.051511 39.937076 39.933655 39.966036 39.948570 40.003219 40.011250 [9] 39.975206 39.999268 39.997490 39.993409 39.978768 39.991603 39.987332 39.992144 14 | Chapter 1: Mapping Foreclosures www.it-ebooks.info Making Events of Our Foreclosures Our geoTable is similar in structure to an EventData object but we need to use the as.EventData function to complete the conversion > addressEvents