Other Matrix-Like Operations

Various matrix operations also apply to data frames. Most notably and use- fully, we can do ﬁltering to extract various subdata frames of interest.

5.2.1 Extracting Subdata Frames

As mentioned, a data frame can be viewed in row-and-column terms. In particular, we can extract subdata frames by rows or columns. Here’s an example:

> examsquiz[2:5,]

Exam.1 Exam.2 Quiz

2 3.3 2 3.7

3 4.0 4 4.0

4 2.3 0 3.3

5 2.3 1 3.3

> examsquiz[2:5,2]

[1] 2 4 0 1

> class(examsquiz[2:5,2]) [1] "numeric"

> examsquiz[2:5,2,drop=FALSE]

Exam.2

2 2

3 4

4 0

5 1

> class(examsquiz[2:5,2,drop=FALSE]) [1] "data.frame"

Note that in that second call, sinceexamsquiz[2:5,2]is a vector, R created a vector instead of another data frame. By specifyingdrop=FALSE, as described for the matrix case in Section 3.6, we can keep it as a (one- column) data frame.

We can also do ﬁltering. Here’s how to extract the subframe of all stu- dents whose ﬁrst exam score was at least 3.8:

> examsquiz[examsquiz$Exam.1 >= 3.8,]

Exam.1 Exam.2 Quiz

3 4 4.0 4.0

9 4 3.3 4.0

11 4 4.0 4.0

14 4 0.0 4.0

16 4 3.7 4.0

19 4 4.0 4.0

22 4 4.0 4.0

25 4 4.0 3.3

29 4 3.0 3.7

5.2.2 More on Treatment of NA Values

Suppose the second exam score for the ﬁrst student had been missing. Then we would have typed the following into that line when we were preparing the data ﬁle:

2.0 NA 4.0

In any subsequent statistical analyses, R would do its best to cope with the missing data. However, in some situations, we need to set the option na.rm=TRUE, explicitly telling R to ignore NA values. For instance, with the missing exam score, calculating the mean score on exam 2 by calling R’s mean()function would skip that ﬁrst student in ﬁnding the mean. Otherwise, R would just report NA for the mean.

Here’s a little example:

> x <- c(2,NA,4)

> mean(x) [1] NA

> mean(x,na.rm=TRUE) [1] 3

In Section 2.8.2, you were introduced to thesubset()function, which saves you the trouble of specifyingna.rm=TRUE. You can apply it in data frames for row selection. The column names are taken in the context of the given data frame. In our example, instead of typing this:

> examsquiz[examsquiz$Exam.1 >= 3.8,]

we could run this:

> subset(examsquiz,Exam.1 >= 3.8)

Note that we do not need to write this:

> subset(examsquiz,examsquiz$Exam.1 >= 3.8)

In some cases, we may wish to rid our data frame of any observation that has at least one NA value. A handy function for this purpose is complete.cases().

> d4

kids states

1 Jack CA

2 <NA> MA 3 Jillian MA 4 John <NA>

> complete.cases(d4) [1] TRUE FALSE TRUE FALSE

> d5 <- d4[complete.cases(d4),]

> d5

kids states

1 Jack CA

3 Jillian MA

Cases 2 and 4 were incomplete; hence theFALSEvalues in the output of complete.cases(d4). We then use that output to select the intact rows.

5.2.3 Using the rbind() and cbind() Functions and Alternatives

Therbind()andcbind()matrix functions introduced in Section 3.4 work with data frames, too, providing that you have compatible sizes, of course. For instance, you can usecbind()to add a new column that has the same length as the existing columns.

In usingrbind()to add a row, the added row is typically in the form of another data frame or list.

> d

kids ages 1 Jack 12 2 Jill 10

> rbind(d,list("Laura",19)) kids ages

1 Jack 12 2 Jill 10 3 Laura 19

You can also create new columns from old ones. For instance, we can add a variable that is the difference between exams 1 and 2:

> eq <- cbind(examsquiz,examsquiz$Exam.2-examsquiz$Exam.1)

> class(eq) [1] "data.frame"

> head(eq)

Exam.1 Exam.2 Quiz examsquiz$Exam.2 - examsquiz$Exam.1 1 2.0 3.3 4.0 1.3

2 3.3 2.0 3.7 -1.3

3 4.0 4.0 4.0 0.0 4 2.3 0.0 3.3 -2.3 5 2.3 1.0 3.3 -1.3 6 3.3 3.7 4.0 0.4

The new name is rather unwieldy: It’s long, and it has embedded blanks.

We could change it, using thenames()function, but it would be better to exploit the list basis of data frames and add a column (of the same length) to the data frame for this result:

> examsquiz$ExamDiff <- examsquiz$Exam.2 - examsquiz$Exam.1

> head(examsquiz)

Exam.1 Exam.2 Quiz ExamDiff

1 2.0 3.3 4.0 1.3

2 3.3 2.0 3.7 -1.3

3 4.0 4.0 4.0 0.0

4 2.3 0.0 3.3 -2.3

5 2.3 1.0 3.3 -1.3

6 3.3 3.7 4.0 0.4

What happened here? Since one can add a new component to an already existing list at any time, we did so: We added a componentExamDiffto the list/data frameexamsquiz.

We can even exploit recycling to add a column that is of a different length than those in the data frame:

> d

kids ages 1 Jack 12 2 Jill 10

> d$one <- 1

> d

kids ages one

1 Jack 12 1

2 Jill 10 1

5.2.4 Applying apply()

You can useapply()on data frames, if the columns are all of the same type.

For instance, we can ﬁnd the maximum grade for each student, as follows:

> apply(examsquiz,1,max)

[1] 4.0 3.7 4.0 3.3 3.3 4.0 3.7 3.3 4.0 4.0 4.0 3.3 4.0 4.0 3.7 4.0 3.3 3.7 4.0 [20] 3.7 4.0 4.0 3.3 3.3 4.0 4.0 3.3 3.3 4.0 3.7 3.3 3.3 3.7 2.7 3.3 4.0 3.7 3.7 [39] 3.7

5.2.5 Extended Example: A Salary Study

In a study of engineers and programmers, I considered the question, “How many of these workers are the best and the brightest—that is, people of extraordinary ability?” (Some of the details have been changed here.)

The government data I had available was limited. One (admittedly imperfect) way to determine whether a worker is of extraordinary ability is to look at the ratio of actual salary to the government prevailing wage for that job and location. If that ratio is substantially higher than 1.0, you can reasonably assume that this worker has a high level of talent.

I used R to prepare and analyze the data and will present excerpts of my preparation code here. First, I read in the data ﬁle:

all2006 <- read.csv("2006.csv",header=TRUE,as.is=TRUE)

The functionread.csv()is essentially identical toread.table()except that the input data is in the CSV format exported by spreadsheets, which is the way the data set was prepared by the US Department of Labor (DOL).

Theas.isargument is the negation ofstringsAsFactors, which you saw ear- lier in Section 5.1. So, settingas.istoTRUEhere is simply an alternate way to achievestringsAsFactors=FALSE.

At this point, I had a data frame,all2006, consisting of all the data for the year 2006. I then did some ﬁltering:

all2006 <- all2006[all2006$Wage_Per=="Year",] # exclude hourly-wagers all2006 <- all2006[all2006$Wage_Offered_From > 20000,] # exclude weird cases all2006 <- all2006[all2006$Prevailing_Wage_Amount > 200,] # exclude hrly prv wg

These operations are typical data cleaning. Most large data sets contain some outlandish values—some are obvious errors, others use different mea- surement systems, and so on. I needed to remedy this situation before doing any analysis.

I also needed to create a new column for the ratio between actual wage and prevailing wage:

all2006$rat <- all2006$Wage_Offered_From / all2006$Prevailing_Wage_Amount Since I knew I would be calculating the median in this new column for many subsets of the data, I deﬁned a function to do the work:

medrat <- function(dataframe) {

return(median(dataframe$rat,na.rm=TRUE)) }

Note the need to exclude NA values, which are common in government data sets.

I was particularly interested in three occupations and thus extracted subdata frames for them to make their analyses more convenient:

se2006 <- all2006[grep("Software Engineer",all2006),]

prg2006 <- all2006[grep("Programmer",all2006),]

ee2006 <- all2006[grep("Electronics Engineer",all2006),]

Here, I used R’sgrep()function to identify the rows containing the given job title. Details on this function are in Chapter 11.

Another aspect of interest was analysis by ﬁrm. I wrote this function to extract the subdata frame for a given ﬁrm:

makecorp <- function(corpname) {

t <- all2006[all2006$Employer_Name == corpname,]

return(t) }

I then created subdata frames for a number of ﬁrms (only some are shown here).

corplist <- c("MICROSOFT CORPORATION","ms","INTEL CORPORATION","intel","

SUN MICROSYSTEMS, INC.","sun","GOOGLE INC.","google") for (i in 1:(length(corplist)/2)) {

corp <- corplist[2*i-1]

newdtf <- paste(corplist[2*i],"2006",sep="") assign(newdtf,makecorp(corp),pos=.GlobalEnv) }

There’s quite a bit to discuss in the above code. First, note that I want the variables I’m creating to be at the top (that is, global) level, which is the usual place one does interactive analysis. Also, I’m creating my new variable names from character strings, such as “intel2006.” For these reasons, the assign()function is wonderful. It allows me to assign a variable by its name as a string and enables me to specify top level (as discussed in Section 7.8.2).

Thepaste()function allows me to concatenate strings, withsep=""speci- fying that I don’t want any characters between strings in my concatenation.

Preview of Some Important R Data Structures

Extended Example: Regression Analysis of Exam Grades