Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 44 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
44
Dung lượng
758,5 KB
Nội dung
Up and Running with R Ann Arbor Chapter, American Statistical Association (ASA) Instructor: Brady T West (bwest@umich.edu) September 28, 2010 Up and Running with R: A Workshop Presented by the Ann Arbor Chapter of ASA (B West) 1 Introduction: What is R? The software package known as R is an interactive computing language and environment for statistical analysis, computing, and graphics R is an open source software package: the source code behind the software is free for all to look at / modify / play around with, and R in fact grows by leaps and bounds as people from all fields develop new functions for use within R’s computing environment This is part of what makes R extremely useful! Several extremely complex statistical routines not available in other software packages have been programmed in R, and these routines are freely available for use by anyone R is completely interactive; users type commands and program functions as they go The software is extremely similar in many ways to the commercial software package S-Plus, and offers many of the same features R, however, can be downloaded for free, while S-Plus is a commercial package that costs money S-Plus may be slightly easier to use than R, but after this workshop, you should be familiar enough with R and how it functions to pretty much anything that you would like to without a hassle! The software provides users with a wide array of powerful and enlightening graphical techniques, and this is why many researchers love using R; the graphical capabilities are tremendous, and easy to implement Once you are able to grasp how to work with R’s graphical facilities, you will have a limitless supply of graphical tools at your fingertips that will enhance the appearance of your research presentations in many ways We strongly encourage you to visit the central web site behind the R project, which I will frequently refer to throughout this workshop: http://www.r-project.org/ Here you will find links for downloading R, downloading additional packages for R, and everything else that you would like to know about the software or the people behind it Up and Running with R: A Workshop Presented by the Ann Arbor Chapter of ASA (B West) 2 How to Obtain R The R Project Web Page At the R Project Web Page, you will find a variety of information about the R Project, which you can peruse at your leisure The most important link will appear at the left hand side of the screen, under the “Download” heading Click on the CRAN link (Comprehensive R Archive Network), and after you choose one of the U.S mirrors (http://cran.stat.ucla.edu/ is recommended), you will be taken to the page that you will use to download everything R-related Once you find the CRAN web page, take the following steps to obtain R: Click on the “R Binaries” link on the left-hand side of the page under the “Software” heading Click on the folder that best describes your operating system When using Windows, click on the “base” subdirectory This will allow you to download the base R package Click the “Download R 2.X.X for Windows” link R is updated quite frequently, and the version number is always changing (at the time of this writing, Version 2.11.1 is available) Save the exe file somewhere on your computer Double-click on the exe file once it has been downloaded A wizard will appear that will guide you through the setup of the R software on your machine Once you are finished, you should have an R icon on your desktop that gives you a shortcut to the R system Double-click on this icon, and you are ready to go! Adding Packages to R At step above, you also have the option of clicking on “contrib” subdirectory Doing this will allow you to download additional contributed packages in R So what exactly are “additional contributed packages”? As Up and Running with R: A Workshop Presented by the Ann Arbor Chapter of ASA (B West) mentioned in the introduction, R is an open source software package, meaning users of R are free to explore the code behind the software and write their own new code Several statisticians and researchers have written additional packages for R that perform complex analyses that are not very common, and in order to use these packages and the functions within them, you need to first download them The base R package comes with several additional packages, but odds are that you will discover an uncommon analysis technique in your research that requires you to install an additional package that is not available with the base package There are many additional contributed packages Don’t hesitate to explore the contributed packages to see if someone has developed a package that will allow you to implement a technique that you are interested in! To download contributed packages, follow steps and above, and then click on the “contrib” link Then, follow these steps: Select the version of R that you are using (the newest version for Windows at the time of this workshop is Version 2.11.1) Scroll through the list of contributed packages (in zip format), and click on the package that you would like to download You can find descriptions of all of these contributed packages and the techniques implemented within them by clicking on the “Packages” link under the “Software” heading on the CRAN web page This page will also have links to help manuals for the packages Save the zip file in a directory on your machine that you can remember When using R, select Install package(s) from local zip files… from the Packages menu Locate the zip file for the package that you downloaded onto your machine, click on Open, and R will install that package so that it is ready for use The package will now be ready to use when you start R! FAQ’s on the CRAN Web Page Under the “Documentation” heading on the left-hand side of the CRAN web page, click on the “FAQs” link This will allow you to see an FAQ page that will answer many of the most commonly asked questions about R You will Up and Running with R: A Workshop Presented by the Ann Arbor Chapter of ASA (B West) find that this section will provide answers to many of your questions, whether they are simple or difficult Searching on the CRAN Web Page Under the “CRAN” heading on the left-hand side of the CRAN web page, you can click on the “Search” link Although there is no formal search engine on the CRAN web page, this will take you to a set of links allowing you to search the R archives (manuals, mail, help files, etc.) for anything that you would like This is often useful if you are faced with a tough analysis question, and you want to see if another R user has addressed the question before Starting R / Loading Contributed Packages At this point (if you haven’t already), you should be able to start R! If you asked for a shortcut to R to be created on your desktop, simply double click on the R icon to start R This will open the RGui (Graphical User Interface) You should see a window inside the RGui containing the R Console This is where you will specify all of your commands and programs interactively, at the red command prompt For an example command, we’ll load a contributed package into R for use Let’s download the “quantreg” package from the CRAN mirror and save it to the desktop, and then install the package as described above After the package has been installed, simply type library(quantreg) at the command prompt: > library(quantreg) Press enter after you type this command to submit the command to R If you don’t see anything aside from another command prompt, the library was loaded successfully, and you can use all of the functions associated with it! If you see the error message Error in library(quantreg) : There is no package called 'quantreg' you did not extract the quantreg package correctly (see pages 4-5) A contributed package must be downloaded and extracted into the R library folder correctly in order for you to use it Up and Running with R: A Workshop Presented by the Ann Arbor Chapter of ASA (B West) This is how you load contributed packages into R for your personal use When you submit a command to R, you will either see nothing but another command prompt (good), a result (good), or an error message (bad) An even quicker way to install packages is to simply select “Install package(s)…” from the Packages menu You can pick a CRAN mirror, and then directly install a package and all of its related components This is probably the quickest way to install a package You would still need to load the package in order to use its functions At any point in a given R session, you can submit the command > installed.packages() to view packages that have been installed You are now ready to use R! Up and Running with R: A Workshop Presented by the Ann Arbor Chapter of ASA (B West) Help Tools In most well-written statistical software packages, help is never far away This holds true for R Although the help is somewhat technical in nature and requires a good understanding of the R language, it is very easy to access Once you’ve gained some experience in working with the R language, 90% of your help questions will be directed at how particular functions in R work, what arguments they take, etc For help on ANY function in R that is a part of a package that has already been loaded, simply type and submit > help(function.name) in the R console, where function.name is the name of the function that you would like to see a help window for Try typing help(lm)! If you have typed the correct name of a function which belongs to a package that has already been loaded into R, you will see a help window pop up that describes the function, its arguments, what the function returns, and also presents some examples of using the function Often times, there will be contact information for the person/people who wrote the function If you would like to see a list of all of the functions that come with the base R package, including brief descriptions of each, you can simply type > library(help = “base”) to generate the list Hint: Don’t forget, R is an open source language! If you want to see exactly how a given function has been written, simply type > fix(function.name) to see the code in a program editor You can copy it, update it, and whatever else you would like with it Just make sure not to save any changes to the code behind a function unless you know that they will work!!! Another easy way of obtaining help via the Internet is to type and submit Up and Running with R: A Workshop Presented by the Ann Arbor Chapter of ASA (B West) > help.start() Doing so will open up a web-based help system that is very easy to navigate A third and obvious way to obtain help is via the Help menu when you are working in the R Console Here you will find FAQ’s, help with navigating the console, and most importantly the official R manuals from the authors of R themselves Again, these are somewhat technical in nature, but very useful once you have been working with R a lot I would recommend the “Introduction to R” manual very highly Finally, don’t hesitate to contact the Center for Statistical Consultation and Research, or CSCAR (734.764.7828; cscar@umich.edu, online.stat@umich.edu) if you need further assistance with performing your analyses in R! Up and Running with R: A Workshop Presented by the Ann Arbor Chapter of ASA (B West) Importing / Exporting Data Sets The “bank2” Data Set All of the following examples of using R will be demonstrated using a data set that appears in a variety of formats in the archive at www.umich.edu/~bwest The “bank2” data set contains a variety of information on each of the 474 employees that work for a large bank The most important first step in using R for statistical analysis is of course to import a data set! Objects in R Before you can successfully import a data set, you need to know about objects in R The entire R computing environment is based on objects What exactly is an object? Objects take numerous forms: Numbers Vectors of numbers Matrices of numbers Results of analyses Data sets Many others! The operator nine nine [1] R, in this case, returns the value of the object (a number) Many objects (such as results of analyses) are much more complex, and there are ways to Up and Running with R: A Workshop Presented by the Ann Arbor Chapter of ASA (B West) look at specific aspects of objects Fields within objects (e.g variables within data set objects, or parts of result objects, such as the estimated regression coefficients that come from a regression analysis) can be accessed using the object$field command Suppose we run a regression analysis, and then want to investigate the resulting coefficients Submitting the command > fit fit Call: lm(formula = mo.fail ~ lc, data = data) Coefficients: (Intercept) 64.139 lc -9.195 This is not very exciting on the surface But there is more to this object than meets the eye For example, suppose we wanted to see the regression coefficients that are a part of this “results” object: > fit$coef (Intercept) 64.139079 lc -9.195036 Notice how the object$field command was submitted The “coef” result is a field located in the “fit” object If you want to see a list of all possible fields located within an object (for example, object.name), simply type > names(object.name) If we want to see a summary of the regression results, we can type > summary(fit) Call: lm(formula = mo.fail ~ lc, data = data) Up and Running with R: A Workshop Presented by the Ann Arbor Chapter of ASA (B West) 10 Multivariate Statistical Analyses / Modeling Before you dive into a more complex analysis of your data, make sure to formulate the problem that you are investigating into statistical terms! Understanding the primary objective of the research that you are engaged in is extremely important, and this understanding often leads to simpler analyses than you were initially considering Don’t try complex models or analyses when they aren’t necessary! A response/predictor mindset is usually helpful, as many statistical analyses address uncovering relationships in data Multiple Regression / Linear Models Multiple linear regression is a popular statistical technique that aims to assess the relationships of several predictor variables with a primary dependent or response variable Analysts aim to build linear regression models, where the parameters (or regression coefficients) associated with the predictors enter the model in a linear fashion to predict the response Using the lm() Function The general purpose linear modeling function lm() is quite handy for several standard multivariate statistical analyses We’ll use the bank2 data set for a simple illustration of how to fit a multiple regression model in R Suppose we desire to build a linear regression model assessing the relationship of education and minority status with current salary We use lm() to fit the model, and store the results in an object named fit1: > fit1 summary(fit1) Call: lm(formula = SALARY ~ MINORITY + EDUC, data = bank2) Residuals: Min 1Q Median -22284 -8182 -2215 3Q 5472 Max 78613 Coefficients: Estimate Std Error t value Pr(>|t|) (Intercept) -16539.2 2885.9 -5.731 1.78e-08 *** MINORITY -3757.6 1428.2 -2.631 0.00879 ** EDUC 3838.2 205.1 18.714 < 2e-16 *** Signif codes: `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' Residual standard error: 12750 on 471 degrees of freedom Multiple R-Squared: 0.4445, Adjusted R-squared: 0.4421 F-statistic: 188.4 on and 471 DF, p-value: < 2.2e-16 We can see that both of the predictors are strongly significant, and collectively explain about 45% of the total variation in SALARY The usual regression diagnostics should ALWAYS be checked after fitting a multiple regression model You can use the graphical techniques available in R for a visual investigation of the standard diagnostic plots A common diagnostic plot is the residual-fitted plot, where you can check for nonconstant variance in the residuals The fitted values on the SALARY variable and associated residuals from the model fit are stored as vectors in the fit1 object: > plot(fit1$fit,fit1$res,main="Residual-Fitted Plot",xlab="Fitted Values",ylab="Residuals") Up and Running with R: A Workshop Presented by the Ann Arbor Chapter of ASA (B West) 31 There is clearly a problem with non-constant variance in the residuals across the fitted values, and constant variance in the residuals is one of the assumptions underlying the validity of the results from fitting a linear regression model A transformation of the SALARY response would probably be recommended here, in an effort to stabilize the variance in the residuals An Analysis of Variance (ANOVA) table comparing two nested multiple regression models (essentially testing the null hypothesis that one or more of the coefficients for the predictors, or parameters in the model, are equal to zero) can be investigated using the anova() function on the results objects from the two fits For example, if we want to test the null hypothesis that the minority status predictor does not need to be in the model (or that the coefficient is equal to zero), we can submit the following commands: > fit1 fit2 anova(fit2,fit1) Up and Running with R: A Workshop Presented by the Ann Arbor Chapter of ASA (B West) 32 Analysis of Variance Table Model 1: SALARY ~ EDUC Model 2: SALARY ~ MINORITY + EDUC Res.Df RSS Df Sum of Sq F Pr(>F) 472 7.7738e+10 471 7.6612e+10 1.1260e+09 6.9226 0.00879 ** Signif codes: `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' We see from the strongly significant F-statistic comparing the two models that the null hypothesis of the coefficient for minority status being equal to zero is strongly rejected The anova() function can also be used on a single model fit object to produce a standard ANOVA table for the variables in the model: > anova(fit1) Analysis of Variance Table Response: SALARY Df Sum Sq Mean Sq F value Pr(>F) MINORITY 4.3373e+09 4.3373e+09 26.665 3.577e-07 *** EDUC 5.6967e+10 5.6967e+10 350.224 < 2.2e-16 *** Residuals 471 7.6612e+10 1.6266e+08 Signif codes: `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' Other Useful Modeling Functions Once you understand the basic ideas behind using a multivariate statistical analysis function in R and looking up help on the functions, you have unlimited access to all of the powerful statistical functions that R offers Nearly all of these functions work in a manner similar to the lm() function Provided below are tables of the more common functions (and corresponding packages) used for multivariate statistical analyses in R LINEAR MODELING Analysis WLS Regression Function Package lm( , weights = ) BASE Up and Running with R: A Workshop Presented by the Ann Arbor Chapter of ASA (B West) 33 Regression Splines Ridge Regression Stepwise Regression Robust Regression LTS Regression Critical Values from the Studentized Range Distribution splineDesign() lm.ridge() step(lm()) rlm() ltsreg() qtukey() splines MASS BASE MASS MASS BASE GENERALIZED LINEAR MODELING Analysis Logistic Regression Poisson Regression Poisson Rate Models Function Package glm(,fam=binomial) BASE glm(,fam=poisson) BASE same as above, BASE with offset() on the RHS of the model Log-linear Models same as Poisson, BASE with a count var on the LHS of the model Ordinal Regression cumulative() VGAM Multinomial Regression multinomial() VGAM Gamma Regression glm(,fam=Gamma) BASE Negative Binomial glm(,fam=negative.binomial(k)) Regression MASS NOTE: When comparing nested GLMs, make sure to use anova(,test=”Chi”) in order to construct an analysis of deviance table Use anova(,test=”F”) if there is a free dispersion parameter LINEAR MIXED MODELING **see help(lme) for specification of random effects Analysis Mixed Effects, ML Mixed Effects, REML Predicting Random Effects (BLUPs) Function Package lme(,method=”ML”) nlme / lme4 lme() nlme / lme4 random.effects(lme()) nlme Up and Running with R: A Workshop Presented by the Ann Arbor Chapter of ASA (B West) 34 Nested Effects lme(,random=~1|lev1/lev2/…/) nlme / lme4 Repeated Measures lme(,corr=) nlme / lme4 **consult with CSCAR NOTE: When comparing nested Mixed Models with different fixed effects using anova(), make sure to fit both models using method=”ML” NOTE: The new lmer() function in the lme4 package is considered to be an improved update of the original lme() function, and allows users to fit models with crossed random effects Consult with CSCAR for more details! MULTIVARIATE ANALYSIS / CLUSTERING Analysis Principal Components Analysis Linear Discriminant Analysis Classification Trees K-Nearest Neighbors Hierarchical Cluster Analysis K-Means Clustering MDS Factor Analysis MCA / HA Function Package princomp() stats lda() MASS rpart() knn() hclust() rpart class stats kmeans() cmdscale() factanal() mca() stats stats stats MASS NON-PARAMETRIC REGRESSION Analysis Kernel Estimation Spline Smoothing Lowess Smoothing Function Package ksmooth() smooth.spline() lowess() modreg stats BASE SURVIVAL ANALYSIS Up and Running with R: A Workshop Presented by the Ann Arbor Chapter of ASA (B West) 35 Analysis Create Survival Object Cox Proportional Hazards Modeling Function Package Surv(time,event) coxph() survival survival Plot Kaplan-Meier Survival Estimates Plot Predicted Survival Function(s) Based on coxph() plot(survfit(Surv())) survival plot(survfit(coxph())) survival The primary advantage of R when it comes to any kind of statistical analysis is the tremendous range of powerful functions at your disposal There are several special functions that are simply not offered in other statistical software packages! Click on the Packages link on the CRAN web page to get an idea of the functions that are available Please note that new functions and updated versions of existing functions are being added to the R software daily I recommend that you frequently visit a CRAN web site for updates Up and Running with R: A Workshop Presented by the Ann Arbor Chapter of ASA (B West) 36 Creating Functions / Programming Another very nice interactive feature of R is the ability to easily write your own programs and functions, and update existing functions R is not only a comprehensive statistical computing language, but also a programming language that can be used to write new statistical software Researchers write new software for R every day, and are always updating existing functions Writing functions in R takes a while to get used to, just like any other programming language New functions can be written in the R console, but it is much easier to submit the following command: > fix(new.function.name) where new.function.name represents a name that you choose to assign to your function Submitting this command will open a Notepad-based text editor from within R, and you won’t be able to anything else in R until you close this window After opening this window, you will see a skeleton for the basic form of a function in R: function () { } In this window, you will enter the names that will be assigned to objects that are arguments to the function in the parentheses, and build the function body in the brackets Try creating the following simple addition machine to get the basic idea of how a function in R can be programmed, saved, and run: > fix(add.machine) After submitting that command, enter the following text into the function editor, save the file, and then close the editor window: Up and Running with R: A Workshop Presented by the Ann Arbor Chapter of ASA (B West) 37 function (number1,number2) { result add.machine(3,5) [1] R returns the result of running the function! More commonly, the results of function are stored in new objects: > sum sum [1] To see other examples of how functions are written in R, you can type fix(function.name) for any of the functions that you are working with Programming in R is basically the same as programming in any other language; the core program is controlled by a series of if-then statements, loops, print statements, calls to other programs, and return statements The only differences are minor syntax conventions that just take a little while to get used to To begin with, let’s look at an example of how the if-then statements work in R Try submitting this command at the prompt to see how it works: > if (1 > 0) print("I Like Binary") Note how R prints “I Like Binary,” because the condition in the “if” statement is true The print() function is very useful for programming purposes, in that it prints simple strings in the R Console No “then” statements are needed after an “if” statement; you simply type what you would like R to if the condition is true In general, if you want to Up and Running with R: A Workshop Presented by the Ann Arbor Chapter of ASA (B West) 38 more than one thing if the “if” condition is true, you use this bracketed structure: if (logical condition) { this and this and this } The statements in the brackets usually refer to function calls and object assignments, and simply need to be on separate lines (no punctuation necessary!) To add else-if conditions, simply use this structure: if (logical condition) { this and this and this } else if (logical condition) { this and this and this } else { this and this and this } Similar to many other programming languages, && is the and operator for logical conditions, and || is the or operator for logical conditions Remember to make appropriate use of parentheses when working with logical and/or conditions! Now let’s take a look at how a “for” loop works in R Try submitting the following syntax at the command prompt: Up and Running with R: A Workshop Presented by the Ann Arbor Chapter of ASA (B West) 39 > for(i in 1:5) print(i) R prints 1, 2, 3, 4, and For loops work exactly like if conditions, and if you want R to more than one thing in a “for” loop, use brackets around the commands: for(i in a:b) { this and this and this } Keep in mind that the i argument in the for loop is a numerical object for each repetition of the for loop (it takes the value of for the first repetition, for the second, and so forth), and can be used as an index variable if you so desire We’ll see an example of this soon Now we’ll take a look at how “while” loops work in R Try creating a function called while.loop() by first using the fix() function: > fix(while.loop) Then, in the function editor window, type the following lines, and save the function before closing the editor window (note that there are no arguments): function () { x while.loop() [1] Counter is [1] Counter is [1] Counter is [1] Counter is [1] Counter is This function proves to be quite handy for more complicated programming NOTE: If you would like to run a simple loop at the command prompt and not create a new function in the function editor window, you can define the loop one line at a time in the format introduced above by pressing enter after each line Give it a try! The purpose of most R functions is to return an object, whether it be an object containing the results of a model fit, a single number, a vector of numbers, etc Therefore, several functions will end with a return(object.name) Up and Running with R: A Workshop Presented by the Ann Arbor Chapter of ASA (B West) 41 line, which tells R to either print the resulting object in the R console (if you not assign the result of the function to a new object in the console), or to store the resulting object in the object that you have defined This is the basic ammunition necessary for programming new functions in R! Let’s take a look at a more complicated example of programming a function in R for statistical analysis purposes Example: Bootstrapping R is often used to perform simulations Suppose we have a random sample of size 50 from a standard normal distribution, and we want to use the resampling technique known as bootstrapping to check whether or not the actual coverage of a 95% confidence interval is near 95% In other words, what percentage of the time does a 95% confidence interval (as it is commonly calculated) cover the true sample mean when using bootstrapping? We want to create a function that will perform bootstrapping to simulate repeated samples from this original sample, and see how many of the simulated confidence intervals contain the true sample mean In this example, I will use the R convention of commenting a program using # signs > fix(boot.cicheck) # call the function editor window function(n,rep) # n = sample size, rep = no of # bootstrap samples { samp