Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 504 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
504
Dung lượng
2,96 MB
Nội dung
ModernAppliedStatisticswith S Fourth edition by W. N. Venables and B. D. Ripley Springer (mid 2002) Final 15 March 2002 Preface S is a language and environment for data analysis originally developed at Bell Laboratories (of AT&T and now Lucent Technologies). It became the statisti- cian’s calculator for the 1990s, allowing easy access to the computing power and graphical capabilities of modern workstations and personal computers. Various implementations have been available, currently S-PLUS, a commercial system from the Insightful Corporation 1 in Seattle, and R, 2 an Open Source system writ- ten by a team of volunteers. Both can be run on Windows and a range of UNIX / Linux operating systems: R also runs on Macintoshes. This is the fourth edition of a book which first appeared in 1994, and the S environment has grown rapidly since. This bookconcentrates on using the current systems to do statistics; there is a companion volume (Venables and Ripley, 2000) which discusses programming in the S language in much greater depth. Some of the more specialized functionality of the S environment is covered in on-line complements, additional sections and chapters which are available on the World Wide Web. The datasets and S functions that we use are supplied with most S environments and are also available on-line. This is not a text in statistical theory, but does covermodern statistical method- ology. Each chapter summarizes the methods discussed, in order to set out the notation and the precise method implemented in S. (It will help if the reader has a basic knowledge of the topic of the chapter, but several chapters have been suc- cessfully used for specialized courses in statistical methods.) Our aim is rather to show how we analyse datasets using S. In doing so we aim to show both how S can be used and how the availability of a powerful and graphical system has altered the way we approach data analysis and allows penetrating analyses to be performed routinely. Once calculation became easy, the statistician’s energies could be devoted to understanding his or her dataset. The core S language is not very large, but it is quite different from most other statistics systems. We describe the language in some detail in the first three chap- ters, but these are probably best skimmed at first reading. Once the philosophy of the language is grasped, its consistency and logical design will be appreciated. The chapters on applying S to statistical problems are largely self-contained, although Chapter 6 describes the language used for linear models that is used in several later chapters. We expect that most readers will want to pick and choose among the later chapters. This book is intended both for would-be users of S as an introductory guide 1 http://www.insightful.com. 2 http://www.r-project.org. v vi Preface and for class use. The level of course for which it is suitable differs from country to country, but would generally range from the upper years of an undergraduate course (especially the early chapters) to Masters’ level. (For example, almost all the material is covered in the M.Sc. in AppliedStatistics at Oxford.) On-line exercises (and selected answers) are provided, but these should not detract from the best exercise of all, using S to study datasets with which the reader is familiar. Our library provides many datasets, some of which are not used in the text but are there to provide source material for exercises. Nolan and Speed (2000) and Ramsey and Schafer (1997, 2002) are also good sources of exercise material. The authors may be contacted by electronic mail at MASS@stats.ox.ac.uk and would appreciate being informed of errors and improvements to the contents of this book. Errata and updates are available from our World Wide Web pages (see page 461 for sites). Acknowledgements: This book would not be possible without the S environment which has been prin- cipally developed by John Chambers, with substantial input from Doug Bates, Rick Becker, Bill Cleveland, Trevor Hastie, Daryl Pregibon and Allan Wilks. The code for survival analysis is the work of Terry Therneau. The S-PLUS and R im- plementations are the work of much larger teams acknowledged in their manuals. We are grateful to the many people who have read and commented on draft material and who have helped us test the software, as well as to those whose prob- lems have contributed to our understanding and indirectly to examples and exer- cises. We cannot name them all, but in particular we would like to thank Doug Bates, Adrian Bowman, Bill Dunlap, Kurt Hornik, Stephen Kaluzny, Jos´ePin- heiro, Brett Presnell, Ruth Ripley, Charles Roosen, David Smith, Patty Solomon and Terry Therneau. We thank Insightful Inc. for early access to versions of S-PLUS. Bill Venables Brian Ripley January 2002 Contents Preface v Typographical Conventions xi 1 Introduction 1 1.1 A Quick Overview of S 3 1.2 Using S 5 1.3 An Introductory Session . . . . . 6 1.4 WhatNext? 12 2 Data Manipulation 13 2.1 Objects 13 2.2 Connections 20 2.3 DataManipulation 27 2.4 TablesandCross-Classification 37 3The S Language 41 3.1 Language Layout . . 41 3.2 More on S Objects 44 3.3 ArithmeticalExpressions 47 3.4 CharacterVectorOperations 51 3.5 Formatting and Printing . . . . . . 54 3.6 Calling Conventions for Functions 55 3.7 ModelFormulae 56 3.8 ControlStructures 58 3.9 ArrayandMatrixOperations 60 3.10 Introduction to Classes and Methods . . . 66 4 Graphics 69 4.1 GraphicsDevices 71 4.2 Basic Plotting Functions . . . . . 72 vii viii Contents 4.3 EnhancingPlots 77 4.4 FineControlofGraphics 82 4.5 Trellis Graphics . . . 89 5 Univariate Statistics 107 5.1 Probability Distributions . . . . . 107 5.2 Generating Random Data . . . . . 110 5.3 DataSummaries 111 5.4 ClassicalUnivariateStatistics 115 5.5 RobustSummaries 119 5.6 DensityEstimation 126 5.7 Bootstrap and Permutation Methods . . . 133 6 Linear Statistical Models 139 6.1 AnAnalysisofCovarianceExample 139 6.2 ModelFormulaeandModelMatrices 144 6.3 Regression Diagnostics . . . . . . 151 6.4 SafePrediction 155 6.5 RobustandResistantRegression 156 6.6 BootstrappingLinearModels 163 6.7 FactorialDesignsandDesignedExperiments 165 6.8 An Unbalanced Four-Way Layout 169 6.9 PredictingComputerPerformance 177 6.10 Multiple Comparisons . . . . . . 178 7 Generalized Linear Models 183 7.1 Functions for Generalized Linear Modelling . . . 187 7.2 BinomialData 190 7.3 PoissonandMultinomialModels 199 7.4 ANegativeBinomialFamily 206 7.5 Over-DispersioninBinomialandPoissonGLMs 208 8 Non-Linear and Smooth Regression 211 8.1 An Introductory Example . . . . . 211 8.2 Fitting Non-Linear Regression Models . . 212 8.3 Non-Linear Fitted Model Objects and Method Functions 217 8.4 ConfidenceIntervalsforParameters 220 8.5 Profiles 226 Contents ix 8.6 ConstrainedNon-LinearRegression 227 8.7 One-Dimensional Curve-Fitting . 228 8.8 AdditiveModels 232 8.9 Projection-PursuitRegression 238 8.10NeuralNetworks 243 8.11Conclusions 249 9 Tree-Based Methods 251 9.1 Partitioning Methods 253 9.2 Implementation in rpart 258 9.3 Implementation in tree 266 10 Random and Mixed Effects 271 10.1LinearModels 272 10.2ClassicNestedDesigns 279 10.3Non-LinearMixedEffectsModels 286 10.4GeneralizedLinearMixedModels 292 10.5GEEModels 299 11 Exploratory Multivariate Analysis 301 11.1 Visualization Methods . . . . . . 302 11.2ClusterAnalysis 315 11.3FactorAnalysis 321 11.4DiscreteMultivariateAnalysis 325 12 Classification 331 12.1DiscriminantAnalysis 331 12.2ClassificationTheory 338 12.3Non-ParametricRules 341 12.4NeuralNetworks 342 12.5 Support Vector Machines . . . . . 344 12.6ForensicGlassExample 346 12.7CalibrationPlots 349 13 Survival Analysis 353 13.1EstimatorsofSurvivorCurves 355 13.2ParametricModels 359 13.3 Cox Proportional Hazards Model . 365 x Contents 13.4FurtherExamples 371 14 Time Series Analysis 387 14.1 Second-Order Summaries . . . . . 389 14.2ARIMAModels 397 14.3 Seasonality . 403 14.4 Nottingham Temperature Data . . 406 14.5RegressionwithAutocorrelatedErrors 411 14.6ModelsforFinancialSeries 414 15 Spatial Statistics 419 15.1SpatialInterpolationandSmoothing 419 15.2Kriging 425 15.3PointProcessAnalysis 430 16 Optimization 435 16.1UnivariateFunctions 435 16.2Special-PurposeOptimizationFunctions 436 16.3GeneralOptimization 436 Appendices A Implementation-Specific Details 447 A.1 Using S-PLUS under Unix / Linux 447 A.2 Using S-PLUS under Windows 450 A.3 Using R under Unix / Linux . . . 453 A.4 Using R under Windows 454 A.5 ForEmacsUsers 455 BThe S-PLUS GUI 457 C Datasets, Software and Libraries 461 C.1 OurSoftware 461 C.2 UsingLibraries 462 References 465 Index 481 Typographical Conventions Throughout this book S language constructs and commands to the operating sys- tem are set in a monospaced typewriter font like this. The character ~ may appear as ~ on your keyboard, screen or printer. We often use the prompts $ for the operatingsystem (it is the standard prompt for the UNIX Bourne shell) and > for S.However,wedonot use prompts for continuation lines, which are indicated by indentation. One reason for this is that the length of line available to use in a book column is less than that of a standard terminal window, so we have had to break lines that were not broken at the terminal. Paragraphs or comments that apply to only one S environment are signalled by a marginal mark: • This is specific to S-PLUS (version 6 or later). S+ • This is specific to S-PLUS under Windows. S+Win • This is specific to R. R Some of the S output has been edited. Where complete lines are omitted, these are usually indicated by in listings; however most blank lines have been silently removed. Much of the S output was generated with the options settings options(width = 65, digits = 5) in effect, whereas the defaults are around 80 and 7 . Not all functions consult these settings, so on occasion we have had to manually reduce the precision to more sensible values. xi Chapter 1 Introduction Statistics is fundamentally concerned with the understanding of structure in data. One of the effects of the information-technology era has been to make it much easier to collect extensive datasets with minimal human intervention. Fortunately, the same technological advances allow the users of statistics access to much more powerful ‘calculators’ to manipulate and display data. This book is about the modern developments in appliedstatistics that have been made possible by the widespread availability of workstations with high-resolution graphics and ample computational power. Workstations need software, and the S 1 system developed at Bell Laboratories (Lucent Technologies, formerly AT&T) provides a very flex- ible and powerful environment in which to implement new statistical ideas. Lu- cent’s current implementation of S is exclusively licensed to the Insightful Cor- poration 2 , which distributes an enhanced system called S-PLUS. An Open Source system called R 3 has emerged that provides an independent implementation of the S language. It is similar enough that almost all the exam- ples in this book can be run under R. An S environment is an integrated suite of software facilities for data analysis and graphical display. Among other things it offers • an extensive and coherent collection of tools for statistics and data analysis, • a language for expressing statistical models and tools for using linear and non-linear statistical models, • graphical facilities for data analysis and display either at a workstation or as hardcopy, • an effective object-oriented programming language that can easily be ex- tended by the user community. The term environment is intended to characterize it as a planned and coherent system built around a language and a collection of low-level facilities, rather than the ‘package’ model of an incremental accretion of very specific, high-level and 1 The name S arose long ago as a compromise name (Becker, 1994), in the spirit of the program- ming language C (also from Bell Laboratories). 2 http://www.insightful.com 3 http://www.r-project.org 1 [...]... vector-like class is much used in S Factors are sets of labelled observations with a pre-defined set of labels, not all of which need occur For example, > citizen citizen [1] uk us no au uk us us Although this is entered as a character vector, it is printed without quotes Internally the factor is stored as a set of codes, and an attribute giving... complex objects will have printed a short summary instead of full details This is achieved by an object-oriented programming mechanism; complex objects have classes assigned to them that determine how they are printed, summarized and plotted This process is taken further in S-PLUS in which all objects have classes S can be extended by writing new functions, which then can be used in the same way as built-in... menu plot(fitted(fm), resid(fm), xlab = "Fitted Values", ylab = "Residuals") A standard regression diagnostic plot to check for heteroscedasticity, that is, for unequal variances The data are generated from a heteroscedastic process, so can you see this from this plot? qqnorm(resid(fm)) qqline(resid(fm)) A normal scores plot to check for skewness, kurtosis and outliers (Note that the heteroscedasticity... Quotes within strings are not treated specially S+ 26 Data Manipulation In R character strings are quoted by default, this being suppressed by quote = FALSE , or selectively by giving a numeric vector for quote Embedded quotes are escaped, either as \" or doubled (Excel-style, set by qmethod = "double" ) (f) Precision The precision to which real (and complex) numbers are output is controlled by the... frames, where S-PLUS looks for objects required for calculations Compare the five experiments with simple boxplots The result is shown in Figure 1.5 fm . Modern Applied Statistics with S Fourth edition by W. N. Venables and B. D. Ripley Springer (mid 2002) Final 15 March. (For example, almost all the material is covered in the M.Sc. in Applied Statistics at Oxford.) On-line exercises (and selected answers) are provided, but these should not detract from the best. sites). Acknowledgements: This book would not be possible without the S environment which has been prin- cipally developed by John Chambers, with substantial input from Doug Bates, Rick Becker,