www.it-ebooks.info www.it-ebooks.info R IN A NUTSHELL Second Edition Joseph Adler Beijing • Cambridge • Farnham • Kưln • Sebastopol • Tokyo www.it-ebooks.info R in a Nutshell, Second Edition by Joseph Adler Copyright © 2012 Joseph Adler All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://my.safaribooksonline.com) For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com Editors: Mike Loukides and Meghan Blanchette Production Editor: Holly Bauer Proofreader: Julie Van Keuren Indexer: Fred Brown Cover Designer: Karen Montgomery Interior Designer: David Futato Illustrators: Robert Romano and Rebecca Demarest September 2009: October 2012: First Edition Second Edition Revision History for the Second Edition: 2012-09-25 First release See http://oreilly.com/catalog/errata.csp?isbn=9781449312084 for release details Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc R in a Nutshell, the image of a harpy eagle, and related trade dress are trademarks of O’Reilly Media, Inc Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trademark claim, the designations have been printed in caps or initial caps While every precaution has been taken in the preparation of this book, the publisher and author assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein ISBN: 978-1-449-31208-4 [LSI] 1348585490 www.it-ebooks.info Table of Contents Preface xiii Part I R Basics Getting and Installing R R Versions Getting and Installing Interactive R Binaries Windows Mac OS X Linux and Unix Systems 3 5 The R User Interface The R Graphical User Interface Windows Mac OS X Linux and Unix The R Console Command-Line Editing Batch Mode Using R Inside Microsoft Excel RStudio Other Ways to Run R 8 11 13 13 14 15 17 A Short R Tutorial 19 Basic Operations in R Functions Variables 19 21 22 iii www.it-ebooks.info Introduction to Data Structures Objects and Classes Models and Formulas Charts and Graphics Getting Help 24 27 28 30 35 R Packages 37 An Overview of Packages Listing Packages in Local Libraries Loading Packages Loading Packages on Windows and Linux Loading Packages on Mac OS X Exploring Package Repositories Exploring R Package Repositories on the Web Finding and Installing Packages Inside R Installing Packages From Other Repositories Custom Packages Creating a Package Directory Building the Package 37 38 40 40 40 41 42 42 45 45 45 47 Part II The R Language An Overview of the R Language 51 Expressions Objects Symbols Functions Objects Are Copied in Assignment Statements Everything in R Is an Object Special Values NA Inf and -Inf NaN NULL Coercion The R Interpreter Seeing How R Works 51 52 52 52 54 55 55 55 56 56 56 56 57 59 R Syntax 63 Constants Numeric Vectors Character Vectors Symbols Operators Order of Operations 63 63 64 65 66 67 iv | Table of Contents www.it-ebooks.info Assignments Expressions Separating Expressions Parentheses Curly Braces Control Structures Conditional Statements Loops Accessing Data Structures Data Structure Operators Indexing by Integer Vector Indexing by Logical Vector Indexing by Name R Code Style Standards 69 69 69 70 70 71 71 72 75 75 76 78 79 80 R Objects 83 Primitive Object Types Vectors Lists Other Objects Matrices Arrays Factors Data Frames Formulas Time Series Shingles Dates and Times Connections Attributes Class 83 86 87 88 88 89 89 91 92 94 95 95 96 96 99 Symbols and Environments 101 Symbols Working with Environments The Global Environment Environments and Functions Working with the Call Stack Evaluating Functions in Different Environments Adding Objects to an Environment Exceptions Signaling Errors Catching Errors 101 102 103 104 104 105 107 108 108 109 Functions 111 The Function Keyword 111 Table of Contents | v www.it-ebooks.info Arguments Return Values Functions as Arguments Anonymous Functions Properties of Functions Argument Order and Named Arguments Side Effects Changes to Other Environments Input/Output Graphics 111 113 113 114 115 117 118 118 119 119 10 Object-Oriented Programming 121 Overview of Object-Oriented Programming in R Key Ideas Implementation Example Object-Oriented Programming in R: S4 Classes Defining Classes New Objects Accessing Slots Working with Objects Creating Coercion Methods Methods Managing Methods Basic Classes More Help Old-School OOP in R: S3 S3 Classes S3 Methods Using S3 Classes in S4 Classes Finding Hidden S3 Methods 122 122 123 129 129 130 130 131 131 132 133 134 135 135 135 136 137 137 Part III Working with Data 11 Saving, Loading, and Editing Data 141 Entering Data Within R Entering Data Using R Commands Using the Edit GUI Saving and Loading R Objects Saving Objects with save Importing Data from External Files Text Files Other Software Exporting Data Importing Data From Databases Export Then Import vi | Table of Contents www.it-ebooks.info 141 141 142 145 145 146 146 154 155 156 156 Database Connection Packages RODBC DBI TSDBI Getting Data from Hadoop 156 157 167 172 172 12 Preparing Data 173 Combining Data Sets Pasting Together Data Structures Merging Data by Common Fields Transformations Reassigning Variables The Transform Function Applying a Function to Each Element of an Object Binning Data Shingles Cut Combining Objects with a Grouping Variable Subsets Bracket Notation subset Function Random Sampling Summarizing Functions tapply, aggregate Aggregating Tables with rowsum Counting Values Reshaping Data Data Cleaning Finding and Removing Duplicates Sorting 173 174 177 179 179 179 180 185 185 186 187 187 188 188 189 190 190 193 194 196 205 205 206 Part IV Data Visualization 13 Graphics 213 An Overview of R Graphics Scatter Plots Plotting Time Series Bar Charts Pie Charts Plotting Categorical Data Three-Dimensional Data Plotting Distributions Box Plots Graphics Devices Customizing Charts 213 214 220 222 226 227 232 239 242 246 247 Table of Contents | vii www.it-ebooks.info Common Arguments to Chart Functions Graphical Parameters Basic Graphics Functions 247 247 257 14 Lattice Graphics 267 History An Overview of the Lattice Package How Lattice Works A Simple Example Using Lattice Functions Custom Panel Functions High-Level Lattice Plotting Functions Univariate Trellis Plots Bivariate Trellis Plots Trivariate Plots Other Plots Customizing Lattice Graphics Common Arguments to Lattice Functions trellis.skeleton Controlling How Axes Are Drawn Parameters plot.trellis strip.default simpleKey Low-Level Functions Low-Level Graphics Functions Panel Functions 267 268 268 268 270 272 272 273 297 305 310 312 312 313 314 315 319 320 321 322 322 323 15 ggplot2 325 A Short Introduction The Grammar of Graphics A More Complex Example: Medicare Data Quick Plot Creating Graphics with ggplot2 Learning More 325 328 333 342 343 347 Part V Statistics with R 16 Analyzing Data 351 Summary Statistics Correlation and Covariance Principal Components Analysis Factor Analysis Bootstrap Resampling viii | Table of Contents www.it-ebooks.info 351 354 357 360 361 logistic probability distribution function, 368 regression, 467–472 loglm function, 476 lookup performance, 509–515 environment objects in place of vectors, 515 R objects, 510–515 loops, 72–75 lqs, 415 M Mac OS X data editor, 142 finding and installing packages, 43 high-performance R binaries, 524 installing R, loading packages, 40 R GUI, SQLite ODBC example, 159 machine learning algorithms clustering, 490–494 algorithms, 491 distance measures, 490 market basket analysis, 485–490 machine learning algorithms for classification, 477–484 k nearest neighbors, 477 neural networks, 482 random forests, 483 SVMs, 483 tree models, 478–482 bagging, 480 boosting, 481 machine learning algorithms for regression, 437–465 generalized additive models, 462 MARS, 450–455 neural networks, 455–459 projection pursuit regression, 459– 461 regression tree models, 439–450 bagging, 446 boosting, 447 patient rule induction method, 446 random forests, 448 recursive partitioning trees, 439– 446 SVMs, 464 main, trellis.skeleton argument, 313 make.groups function, 187, 402 manual compilation, 518 Map/Reduce, Hadoop, 550 mappings, 330 mapply function, 183 mapreduce function, 561, 566 maptree library, 479 margins, 248 market basket analysis, 485–490 MARS, 450–455 MASS package, 621–630 about, 39 data sets, 624–630 functions, 621–624 matplot function, 218 matrices about, 88 reshaping, 197–202 scatter plot matrices, 304 transposing, 197 max function, 351 MDA (mixture discriminant analysis), 475 mean function, 351 means comparing, 372–376 comparing means across more than two groups (ANOVA), 378– 381 comparing two means, 385 Medicare data example, 333–342 medoids, 493 melt about, 202 using, 203 melt.data.frame function, 203 mem.limits function, 517 memory usage cleaning up, 516 measuring R performance, 505–507 preallocating, 516 memory.profile function, 506 merging data by common data fields, 177 message function, 109 metadata, 533 methods, 131–134 686 | Index www.it-ebooks.info about, 27 classes, 132 coercion, 131 managing, 133 old-school OOP (S3), 136, 137 package, 39, 630–637 mgcv package, 39, 637 MIAME (Minimum Information About a Microarray Experiment), 544 Microsoft Excel about, 14 charts, 30 text files, 149 Microsoft Windows data editor, 142 devices, 246 finding and installing packages, 42 high-performance R binaries, 521 installing R, loading packages, 40 R GUI, SQLite ODBC example, 161 Microsoft, ODBC drivers, 158 function, 351 Minimum Information About a Microarray Experiment (MIAME), 544 Minitab file format, 155 minlength, lattice axes argument, 315 mixture discriminant analysis (MDA), 475 models, 28–30 (see also classification models; linear classification models; linear models; non-linear models; regression model; regression tree models) log-linear models, 476 survival models, 428–433 time series models, 496–500 mosaicplot function, 229 mtext function, 264 multicollinearity, 417 multinom function, 470 multinomial distribution probability distribution function, 368 multiple inheritance, 123 MySQL ODBC drivers, 158 text files, 150 N NA value, 55 named arguments, 117 names attribute, 97 indexing by element name, 79 lists, 87 namespaces, trellis, 315 NaN value, 56 ncvTest function, 413 negative binomial probability distribution function, 368 neural networks, 455–459, 482 new.env function, 103 new.packages command, 44 next command, 72 NHTSA (National Highway Traffic Safety Administration), 280 nlme package, 39, 637 nls function, 427 nnet function, 457 package, 39, 637 non-parametric tests, 385–388 comparing more than two means, 387 two means, 385 variances, 387 difference in scale parameters, 388 tabular data, 396 nonlinear models, 420–428 glmnet package, 424 GLMs, 421–424 nonlinear least squares, 427 normal distribution, 363 normal distribution-based tests, 372–385 comparing means, 372–376 means across more than two groups (ANOVA), 378–381 paired data, 376 two populations, 377 correlation tests, 384 pairwise t-tests between multiple groups, 381 testing Index | 687 www.it-ebooks.info for normality, 382 if a data vector came from an arbitrary distribution, 382 if two data vectors came from the same distribution, 383 normal probability distribution function, 368 normality, testing for, 382 NULL object type, 85 value, 56 numeric vectors, 63 O object.size function, 506 objects, 83–100 (see also classes) about, 27, 52, 55 adding to environments, 107 AnnotatedDataFrame, 543 applying a function to each element, 180–184 arrays, 89 AssayData, 543 assignment statements, 54 attributes, 96–100 combining with grouping variable, 187 connections, 96 data frames, 91 dates and times, 95 environment objects in place of vectors, 515 factors, 89 formulas, 92 function, 102 geometric objects, 330 lists, 87 lookup performance, 510–515 matrices, 88 primitive object types, 83–86 saving and loading, 145 shingles, 95 time series, 94 vectors, 86 Octave file format, 155 ODBC (see RODBC) odbcClose function, 167 odbcCloseAll function, 167 odbcColumns function, 167 odbcConnect function, 162 odbcFetchResults function, 167 odbcGetErrMsg function, 167 odbcGetInfo function, 163 odbcPrimaryKeys function, 167 odbcQuery function, 167 odbcTables function, 167 old.packages command, 44 OOP (object-oriented programming), 121–137 about, 122 classes, 129–135 basic classes, 134 coercion, 131 defining, 129 methods, 131–134 objects, 130, 131 slots, 130 example, 123–128 old-school OOP (S3), 135–137 classes, 135, 137 methods, 136, 137 OpenLink Software, ODBC drivers, 159 operations, 19 (see also functions) operators, 66–69 assignments, 69 data structures, 75 examples of, 21 order of operations, 67 optimization, 503–524 high-performance R binaries, 520–524 Linux and Unix, 523 Mac OS X, 524 Revolution R, 520 Windows, 521 measuring R performance, 503–507 memory usage, 505–507 profiling, 504, 506 timing, 503 R byte code compiler, 518–520 inspecting byte code, 519 just-in-time compilation, 520 manual compilation, 518 R code, 507–517 688 | Index www.it-ebooks.info cleaning up memory, 516 databases to query large data sets, 516 functions for big data sets, 517 lookup performance, 509–515 preallocating memory, 516 vector operations, 507 Oracle, ODBC drivers, 158 order arguments, 117 function, 207 of operations, 67 ordinary least squares (OLS) regression, 412 OS X (see Mac OS X) outer, lattice function argument, 312 P pacf function, 495 package management systems, installing R, package.skeleton function, 45 packages, 37–47 about, 37 base package, 573–596 data sets, 596 functions, 573–596 Bioconductor, 537 boot package, 596–605 data sets, 598–605 functions, 596–598 class package, 605 cluster package, 606 data sets, 607 functions, 606 codetools package, 607 custom packages, 45–47 building, 47 creating, 45 DBI, 167–171 foreign package, 607 graphics package, 612–615 grDevices package, 608–612 data sets, 612 functions, 608–612 grid package, 615 importing databases, 156 KernSmooth package, 615 lattice package, 616–621 data sets, 621 functions, 616–621 listings of, 38 loading, 40 MASS package, 621–630 data sets, 624–630 functions, 621–624 methods package, 630–637 mgcv package, 637 nlme package, 637 nnet package, 637 repositories, 41–45 finding and installing packages, 42– 45 Web, 42 rpart package, 638 data sets, 639 functions, 638 spatial package, 639 splines package, 640 stats package, 641–658 data set, 658 functions, 641–658 stats4 package, 658 survival package, 658–662 data sets, 660 functions, 659 tcltk package, 662 tools package, 662 data sets, 664 functions, 662 utils package, 664–671 page, trellis.skeleton argument, 313 pairlists object type, 84 pairs function, 271 pairwise t-tests between multiple groups, 381 pam function, 493 panel functions about, 268 custom, 272 panel, lattice function argument, 312 par function, 248, 253, 315 par.main.text trellis parameter group, 318 par.settings, trellis.skeleton argument, 314 Index | 689 www.it-ebooks.info par.strip.text, trellis.skeleton argument, 313 par.sub.text trellis parameter group, 318 par.xlab.text trellis parameter group, 318 par.ylab.text trellis parameter group, 318 par.zlab.text trellis parameter group, 318 parallel computing, 549 package, 39 parameters difference in scale parameters, 388 graphical, 253–257 lattice graphics, 315–319 parent.env function, 103 parent.frame function, 105 parentheses (), expressions, 70 partial least square regression, 420 paste function, 174 patient rule induction method (PRIM), 446 pcr function, 420 Pearson correlation statistic, 354 performance (see optimization) Perl reprocessing data files, 151 translating files, 274 persp function, 234, 271 pie charts, 226 piecewise linear functions, 450 plot function, 214, 220, 409 plot.args, trellis.skeleton argument, 314 plot.density function, 271 plot.earth function, 454 plot.glmnet function, 427 plot.polygon trellis parameter group, 317 plot.symbol trellis parameter group, 317 plot.trellis, 319 plotcp function, 444 plotmo function, 454 plotting functions, 272–311 bivariate trellis plots, 297–305 box plots, 300 quantile-quantile plots, 305 scatter plot matrices, 304 scatter plots, 297 rfs function, 310 trivariate trellis plots, 305–310 cloud plots, 308 contour plots, 307 level plots, 305 wire-frame plots, 310 univariate trellis plots, 273–296 bar charts, 276–279 density plots, 285 dot plots, 280 histograms, 282 quantile-quantile plots, 288–296 strip plots, 286 plsr function, 420 plus sign (+) formulas, 93 incomplete line, 12 operator, 28 plyr library, 183 png function, 247 pnorm function, 363 points function, 258 in graphics, 252 Poisson probability distribution function, 368 polr function, 472 poly function, 404 polygon function, 261 polymorphism, 122 polynomial surfaces, fitting, 435 position functions, 347 positional adjustments, 331 POSIXct, 95 POSIXlt, 95 PostgreSQL, ODBC drivers, 158 pound sign (#), comments, 21 power tests, 397–400 ANOVA test design, 400 experimental design example, 397 proportion test design, 398 t-test design, 398 power.anova.test function, 400 power.prop.test function, 398 power.t.test function, 398 ppr function, 459 prcomp function, 357 preallocating memory, 516 predict function, 406, 498 predictive models, 28 prepanel, lattice function argument, 312 690 | Index www.it-ebooks.info PRIM (patient rule induction method), 446 primitive functions, 59 object types, 83–86 principal components analysis, 357–360 regression, 420 princomp function, 358 printcp function, 479 probability distributions, 363–369 common distribution-type arguments, 366 distribution function families, 366 normal distribution, 363 profiling measuring R performance, 504 memory usage, 506 projection pursuit regression, 459–461 promise object type, 85 prompt, 12 prop.test function, 388 properties functions as arguments, 115 text, 251 proportion tests about, 388 design, 398 Q QDA (quadratic discriminant analysis), 473 qnorm function, 364 qplot function, 342 qq function, 271 qqmath function, 271, 290 qqnorm function, 241, 271 qqplot function, 242, 271 quantile function, 352 quantile-quantile plots about, 241 bivariate, 305 univariate, 288–296 quasi function, 423 querying, DBI, 170 R R console about, 11 finding and installing packages, 43 R data editor versus spreadsheets, 144 R GUIs, 7–11 Linux and Unix, Mac OS X, Windows, R language, 51–61 built-in types, 134 coercion, 56 example, 59 expressions, 51 functions, 52 interpreter, 57 objects about, 52, 55, 83 assignment statements, 54 special values, 55 symbols, 52 R Productivity Environment, 11 R Studio, 11 R-Forge, 42 random forests classification, 483 function, 449 regression, 448 sampling, 189 range function, 352 rApache, 17 raw object type, 84 rbind function, 174 rbsurv function, 534 Rcmdr, 10 read.csv function, 148 read.delim function, 148 read.fwf function, 150 read.table function, 146 ReadAffy function, 526 readLines function, 152 reassigning variables, 179 recursive partitioning trees, 439–446 regions trellis parameter group, 317 regression model, 401–465 linear model example, 401–410 Index | 691 www.it-ebooks.info analyzing the fit, 407 fitting a model, 403 helper functions for specifying the model, 404 predicting values, 406 refining the model, 410 viewing, 404 lm function, 410–415 assumptions of least squares regression, 412 lm, lqs and rim, 415 resistant regression, 414 robust regression, 414 logistic regression, 467–472 machine learning algorithms for regression, 437–465 generalized additive models, 462 MARS, 450–455 neural networks, 455–459 projection pursuit regression, 459– 461 regression tree models, 439–450 SVMs, 464 nonlinear models, 420–428 glmnet package, 424 GLMs, 421–424 nonlinear least squares, 427 smoothing, 433–437 fitting polynomial surfaces, 435 kernel smoothing, 436 splines, 433 subset selection and shrinkage methods, 416–420 elasticnet, 419 lasso and least angle regression, 418 principal components regression and partial least square regression, 420 ridge regression, 417 stepwise variable selection, 416 survival models, 428–433 regression tree models, 439–450 bagging, 446 boosting, 447 patient rule induction method, 446 random forests, 448 recursive partitioning trees, 439–446 relation, lattice axes argument, 314 relationships, variables, 28 remove function, 102 remove.packages command, 44 removeGeneric function, 133 removeMethods function, 133 repositories, 41–45 finding and installing packages, 42–45 Web, 42 reshaping data, 196–205 residuals about, 410 function, 406 resistant regression, 414 retracemen function, 507 return values, functions, 113 Revolution R, 520 RExcel, 14 rfs function, 310 RGB (red/green/blue) components, 252 RHadoop, 554–568 example application, 559–566 installing locally, 555–559 rmr function, 566 ridge functions, 459 ridge regression, 417 rim, 415 Rkward, 10 rlm function, 414 rm function, 103, 517 rmr function, 566 rnorm function, 365 robust regression, 414 RODBC, 157–167 about, 156 installation, 157–161 library, 166 using, 162–167 rot, lattice axes argument, 315 rotation, in graphics, 252 row.names attribute, 97 rowsum function, 193 rpart function, 440 model, 480 package about, 39, 638 data sets, 639 692 | Index www.it-ebooks.info functions, 638 Rprof function, 504 Rprofmem function, 506 RScript command, 14 Rserve, 17 Rtools, 521 S S3 binary file format, 155 S4 object type, 84 sampling function, 438 random sampling, 189 San Francisco real estate prices data set, 294 sapply function, 113, 182 SAS Permanent Dataset file format, 155 SAS XPORT File file format, 155 save function, 107, 145 save.image function, 145 saving objects, 145 scale functions, 346 parameters, difference in, 388 scales about, 331 lattice function argument, 312 scan function, 153 scatter plots bivariate trellis plots, 297 lattice, 269 matrices, 304 using, 214–220 scripts, executing, 14 search function, 102 searchpaths function, 102 segments function, 261 segue package, 571 selectMethod function, 133 seq function, 87 servers Hadoop, 553 Rserve, 17 setAs function, 131 setClass function, 129 setClassUnion function, 130 setCompilerOptions function, 519 setGeneric function, 125, 132, 133 setIs function, 130 setMethod function, 133 setOldClass function, 137 setRepositories command, 44 setValidity function, 124, 130 shade.colors trellis parameter group, 317 Shapiro-Wilk test, 382 shingles, 95, 185 show.settings function, 316 showMethods function, 134 shrinkage methods (see subset selection and shrinkage methods) side effects, 118 sigmoid function, 457 signaling errors, 108 simpleKey function, 321 skip, trellis.skeleton argument, 313 slots, classes, 130 smooth.spline function, 433 smoothing, 433–437 fitting polynomial surfaces, 435 kernel smoothing, 436 splines, 433 smoothScatter function, 219, 355 sorting data, 206–208 source code, building R from, spatial package, 39, 639 Spearman correlation statistic, 354 special built-in types, 134 object type, 85 objects, 83 values, 55 splinefun function, 434 splines about, 433 function, 433 package, 39, 640 splom function, 271, 304 spreadsheets versus the R data editor, 144 SPSS file format, 155 sqlColumns function, 163 sqlFetch function, 164 sqlGetResults function, 166 SQLite ODBC on Mac OS X example, 159 ODBC on Windows example, 161 Index | 693 www.it-ebooks.info sqlPrimaryKeys function, 164 sqlQuery function, 166 sqlTables function, 163 sqrt function, 64 square brackets [], bracket notation for subsets, 188 standards, code style, 80 Stata file format, 155 statistical tests, 371–396 continuous data, 371–388 non-parametric tests, 385–388 normal distribution-based tests, 372–385 discrete data, 388–396 binomial tests, 389 non-parametric tabular data tests, 396 proportion tests, 388 tabular data tests, 390–395 statistical transformations, 331 stats package, 641–658 about, 39 data set, 658 functions, 641–658 stats4 package, 39, 658 stem function, 353 step function, 417 stepAIC function, 417 stepwise variable selection, 416 stop function, 108 str function, 353 strata function, 474 streaming, Hadoop, 568–570 strings, 21 strip, lattice function argument, 312 strip.background trellis parameter group, 317 strip.border trellis parameter group, 317 strip.default, 320 strip.left, trellis.skeleton argument, 313 strip.shingle trellis parameter group, 317 stripchart function, 271 stripplot function, 271, 286 studentized range distribution probability distribution function, 369 student’s t-distribution probability distribution function, 368 sub, trellis.skeleton argument, 313 subscripts, lattice function argument, 313 subset selection and shrinkage methods, 416–420 elasticnet, 419 lasso and least angle regression, 418 principal components regression and partial least square regression, 420 ridge regression, 417 stepwise variable selection, 416 subset, lattice function argument, 313 subsets, 187–190 bracket notation, 188 function, 353 random sampling, 189 subset function, 188 summary function, 29, 352, 405, 489 statistics, 351–353 summaryRprof function, 504 superpose.line trellis parameter group, 317 superpose.polygon trellis parameter group, 317 superpose.symbol trellis parameter group, 317 surrogate variables, 440 survexp function, 431 survfit function, 428 survival models, 428–433 package, 658–662 about, 40 data sets, 660 functions, 659 survreg function, 430 SVMs (support vector machines) machine learning algorithms for classification, 483 machine learning algorithms for regression, 464 symbols about, 52, 65, 101 object type, 85 syntactic sugar, 60 syntax, 63–81 code style standards, 80 constants, 63–66 694 | Index www.it-ebooks.info character vectors, 64 numeric vectors, 63 symbols, 65 control structures, 71–75 conditional statements, 71 loops, 72–75 data structures, 75–79 indexing by integer vector, 76 indexing by logical vector, 78 indexing by name, 79 operators, 75 expressions, 69 operators, 66–69 assignments, 69 order of operations, 67 sys.call function, 105 sys.calls function, 105 sys.frame function, 105 sys.frames function, 105 sys.function function, 105 sys.nframe function, 105 sys.on.exit function, 105 sys.parent function, 105 sys.parents function, 105 sys.status function, 105 Systat file format, 155 T t function, 197 t-tests function, 372 pairwise t-tests between multiple groups, 381 power tests design, 398 tables aggregating, 193 data tests, 390–396 function, 194 tabulate function, 194 tapply function, 190–192, 306 tck, lattice axes argument, 314 tcltk package, 40, 662 tests, 371–396 (see also power tests; statistical tests) binomial tests, 389 non-parametric tests, 385–388 comparing more than two means, 387 comparing two means, 385 comparing variances, 387 difference in scale parameters, 388 tabular data, 396 normal distribution-based tests, 372– 385 comparing means, 372–376 comparing means across more than two groups (ANOVA), 378–381 comparing paired data, 376 comparing two populations, 377 correlation tests, 384 pairwise t-tests between multiple groups, 381 testing for normality, 382 testing if a data vector came from an arbitrary distribution, 382 testing if two data vectors came from the same distribution, 383 power tests, 397–400 ANOVA test design, 400 experimental design example, 397 proportion test design, 398 t-test design, 398 proportion tests, 388 tabular data tests, 390–395 text function, 217, 259 importing, 146–154 delimited files, 146–150 fixed-width files, 150 other functions to parse data, 152 properties, 251 three-dimensional data, 232 tick.number, lattice axes argument, 314 tilde (~), formulas, 93 time series, 495–500 about, 94 autocorrelation functions, 495 models, 496–500 OOP, 122 plotting, 220 times and dates, 95 timing, measuring R performance, 503 title function, 263 Index | 695 www.it-ebooks.info tools package about, 40, 662 data sets, 664 functions, 662 tracemem function, 507 trans3d function, 265 transformations, 179–184 applying a function to each element of an object, 180–184 reassigning variables, 179 transform function, 179 transposing matrices and data frames, 197 tree models, 439 (see also regression tree models) classification, 478–482 bagging, 480 boosting, 481 regression, 439–450 bagging, 446 boosting, 447 patient rule induction method, 446 random forests, 448 recursive partitioning trees, 439– 446 Trellis graphics, 267 trellis.par.get function, 315, 316 trellis.par.set function, 315 trellis.skeleton, 313 trivariate trellis plots, 305–310 cloud plots, 308 contour plots, 307 level plots, 305 wire-frame plots, 310 try function, 109 tryCatch function, 110 ts function, 94 ts.plot function, 498 TSBDI, 172 tsp attribute, 97 type, defined, 83 typeface, 252 typeof function, 60, 99 U unary operators, 67 uniform distribution probability distribution function, 369 unique function, 206 univariate trellis plots, 273–296 bar charts, 276–279 density plots, 285 dot plots, 280 histograms, 282 quantile-quantile plots, 288–296 strip plots, 286 Unix high-performance R binaries, 523 installing R, R GUI, unstack function, 198 untracem function, 507 update.packages command, 44 UseMethod function, 136 user interfaces, 7–17 batch mode, 13 edit GUI, 142 Microsoft Excel, 14 other ways to run R, 17 R console, 11 R GUIs, 7–11 Linux and Unix, Mac OS X, Windows, RSTudio, 15 utils package, 40, 664–671 V validObject function, 124 values counting, 194–196 predicting with linear regression models, 406 return values, 113 special values, 55 var.test function, 377 variables about, 22 combining objects with grouping variable, 187 reassigning, 179 relationships, 28 stepwise variable selection, 416 variance-covariance matrix, 408 vcov function, 408 vectors 696 | Index www.it-ebooks.info about, 19, 83 applying functions to, 182 built-in types, 134 character vectors, 64 correlation, 384 elements, 22 environment objects in place of, 515 indexing by integer vector, 76 by logical vector, 78 indices, 22 length attribute, 87 logical vectors, 23 modifying, 55 numeric vectors, 63 objects, 86 operations for optimizing R code, 507 testing if a data vector came from an arbitrary distribution, 382 if two data vectors came from the same distribution, 383 vector operations versus conditional statements, 71 versions, vertical bar (|), formulas, 93 vignettes about, 36 Bioconductor, 546 visualization (see graphics; lattice graphics) W warning function, 109 weakref object type, 86 Web applications, 17 repositories, 42 Weibull probability distribution function, 369 while loops, 73 Wilcoxon rank sum probability distribution function, 369 signed rank probability distribution function, 369 test, 385 Windows data editor, 142 devices, 246 finding and installing packages, 42 high-performance R binaries, 521 installing R, loading packages, 40 R GUI, SQLite ODBC example, 161 wireframe function, 271, 272, 310 with function, 107 within function, 107 write.table function, 155 X X Windows data editor, 144 x, lattice function argument, 312 xlab, lattice function argument, 312 xlab.default, trellis.skeleton argument, 313 xlim, lattice function argument, 313 xplot function, 271 xscale.components, trellis.skeleton argument, 314 xtabs function, 196 xyplot function, 271, 297, 402 Y ylab, lattice function argument, 312 ylab.default, trellis.skeleton argument, 313 ylim, lattice function argument, 313 yscale.components, trellis.skeleton argument, 314 Index | 697 www.it-ebooks.info www.it-ebooks.info About the Author Joseph Adler has many years of experience in data mining and data analysis at companies including DoubleClick, American Express, and VeriSign He graduated from MIT with an Sc.B and M.Eng in Computer Science and Electrical Engineering He is the inventor of several patents for computer security and cryptography, and the author of Baseball Hacks Currently, he is a senior data scientist at LinkedIn Colophon The animal on the cover of R in a Nutshell is a harpy eagle (Harpia harpyja) Black feathers line the top half of the bird, while white feathers mostly make up the balance, although the underside of its wings may be striped black and white Unlike other species of birds, male and female harpy eagles appear virtually identical These eagles—the most powerful, carnivorous raptors in the Americas—typically inhabit tropical rain forests They prey upon animals that live in trees: sloths, monkeys, opossums, and even other birds, such as macaws The eagle is named after the harpies of ancient Greek mythology, female wind spirits who were said to be human from the chest to their ankles and eagle from the neck up Mythological harpies tormented people as they carried them to the underworld with their clawed feet; perhaps similarly, harpy eagles’ talons violently pierce and subdue their prey before the eagles carry them back to their nests Harpy eagles also inspire modern-day life: the eagle is the national bird of Panama and is pictured on the country’s coat of arms The bird also inspired the design of Fawkes the Phoenix in the Harry Potter film series The cover image is from Cassell’s Natural History The cover font is Adobe ITC Garamond The text font is Linotype Birka; the heading font is Adobe Myriad Condensed; and the code font is LucasFont’s TheSansMonoCondensed www.it-ebooks.info www.it-ebooks.info ... leaves off, describing the R language in detail • Part III, Working with Data, covers data processing in R: loading data into R, transforming data, and summarizing data • Part IV, Data Visualization,... of Packages Listing Packages in Local Libraries Loading Packages Loading Packages on Windows and Linux Loading Packages on Mac OS X Exploring Package Repositories Exploring R Package Repositories... Notation subset Function Random Sampling Summarizing Functions tapply, aggregate Aggregating Tables with rowsum Counting Values Reshaping Data Data Cleaning Finding and Removing Duplicates Sorting