Derived variables and data manipulation

This section describes the creation of new variables as a function of existing variables in a dataset.

2.2. DERIVED VARIABLES AND DATA MANIPULATION 13

2.2.1 Add derived variable to a dataset

Example: 6.6 library(dplyr)

ds = mutate(ds, newvar=myfunction(oldvar1, oldvar2, ...)) or

ds$newvar = with(ds, myfunction(oldvar1, oldvar2, ...))

Note:The routines in thedplyrpackage have been highly optimized, and often run dramat- ically faster than other options. In these equivalent examples, the new variable is added to the original dataframe. While care should be taken whenever dataframes are overwritten, this may be less risky because the addition of the variables is not connected with other changes.

2.2.2 Rename variables in a dataset

library(dplyr)

ds = rename(ds, new1=old1, new2=old2) or

names(ds)[names(ds)=="old1"] = "new1"

names(ds)[names(ds)=="old2"] = "new2"

ds = within(ds, {new1 = old1; new2 = old2; rm(old1, old2)})

Note: The rename() function within the dplyr package provides a simple and efficient interface to rename variables in a dataframe. Alternatively, thenames()function provides a list of names associated with an object (see A.4.6). Theedit() function can be used to view names and edit values.

2.2.3 Create string variables from numeric variables

stringx = as.character(numericx) typeof(stringx)

typeof(numericx)

Note: The typeof() function can be used to verify the type of an object; possible values include logical, integer, double, complex, character, raw, list, NULL, closure (function),special, and builtin(see A.4.7).

2.2.4 Create categorical variables from continuous variables

Examples: 2.6.3 and 7.10.6 newcat1 = (x >= cutpoint1) + ... + (x >= cutpointn)

newcat = cut(x, breaks=c(minval, cutpoint1, ..., cutpointn), labels=c("Cut1", "Cut2", ..., "Cutn"), right=FALSE)

Note: In the first implementation, each expression within parentheses is a logical test re- turning 1 if the expression is true, 0 if not true, and NA ifxis missing. More information about missing value coding can be found in 11.4.4.1. Thecut() function provides a more general framework (see alsocut number()from theggplot2package).

2.2.5 Recode a categorical variable

A categorical variable may need to be recoded to have fewer levels (see also 6.1.3, changing reference category).

library(memisc) newcat1=cases(

"newval1"= oldcat==val1 | oldcat==val2,

"newval2"= oldcat==valn) or

tmpcat = oldcat

tmpcat[oldcat==val1] = newval1 tmpcat[oldcat==val2] = newval1 ...

tmpcat[oldcat==valn] = newvaln newcat = as.factor(tmpcat)

Note: The cases() function from the memisc package can be used to create the factor vector in one operation, by specifying the Boolean conditions. Alternatively, creating the variable can be undertaken in multiple steps. A copy of the old variable is first made, then multiple assignments are made for each of the new levels, for observations matching the condition inside the index (see A.4.2). In the final step, the categorical variable is coerced into a factor (class) variable.

2.2.6 Create a categorical variable using logic

Example: 2.6.3 Here we create a trichotomous variable newvar, which takes on a missing value if the continuous non-negative variable oldvar is less than 0, 0 if the continuous variable is 0, value 1 for subjects in group A with values greater than 0 but less than 50 and for subjects in group B with values greater than 0 but less than 60, or value 2 with values above those thresholds (more information about missing value coding can be found in 11.4.4.1).

library(memisc) tmpvar = cases(

"0" = oldvar==0,

"1" = (oldvar>0 & oldvar<50 & group=="A") | (oldvar>0 & oldvar<60 & group=="B"),

"2" = (oldvar>=50 & group=="A") | (oldvar>=60 & group=="B")) or

tmpvar = rep(NA, length(oldvar)) tmpvar[oldvar==0] = 0

tmpvar[oldvar>0 & oldvar<50 & group=="A"] = 1 tmpvar[oldvar>0 & oldvar<60 & group=="B"] = 1 tmpvar[oldvar>=50 & group=="A"] = 2

tmpvar[oldvar>=60 & group=="B"] = 2 newvar = as.factor(tmpvar)

Note: Creating the variable is undertaken in multiple steps in the second approach. A vector of the correct length is first created containing missing values. Values are updated if they match the conditions inside the vector index (see A.4.2). Care needs to be taken in the comparison ofoldvar==0if noninteger values are present (see 3.2.5).

2.2. DERIVED VARIABLES AND DATA MANIPULATION 15 Thecases() function from the memisc package provides a straightforward syntax for derivations of this sort. The %in% operator can also be used to test whether a string is included in a larger set of possible values (see 2.2.11 andhelp("%in%")).

2.2.7 Create numeric variables from string variables

numericx = as.numeric(stringx) typeof(stringx)

typeof(numericx) or

stringf = factor(stringx) numericx = as.numeric(stringf)

Note:The first set of code can be used when the string variable records numbers as character strings, and the code converts the storage type for these values. The second set of code can be used when the values in the string variable are arbitrary and may be awkward to enumerate for coding based on logical operations. Thetypeof() function can be used to verify the type of an object (see 2.2.3 and A.4.7).

2.2.8 Extract characters from string variables

get2through4 = substr(x, start=2, stop=4)

Note: The arguments to substr()specify the input vector, start character position, and end character position. Thestringrpackage provides additional support for operations on character strings.

2.2.9 Length of string variables

len = nchar(stringx)

Note: The nchar() function returns a vector of lengths of each of the elements of the string vector given as argument, as opposed to thelength()function (2.3.4) that returns the number of elements in a vector. Thestringrpackage provides additional support for operations on character strings.

2.2.10 Concatenate string variables

newcharvar = paste(x1, " VAR2 ", x2, sep="")

Note: The above R code creates a character variablenewcharvarcontaining the character vectorX1(which may be coerced from a numeric object) followed by the string" VAR2 "

then the character vectorX2. Thesep=""option leaves no additional separation character between these three strings.

2.2.11 Set operations

newengland = c("MA", "CT", "RI", "VT", "ME", "NH")

"NY" %in% newengland

"MA" %in% newengland

Note: The first statement would return FALSE, while the second one would returnTRUE.

The%in% operator also works with numeric vectors (seehelp("%in%")). Vector functions for set-like operations includeunion(),setdiff(),setequal(),intersect(),unique(), duplicated(), andmatch().

2.2.12 Find strings within string variables

Example: 7.10.9 matches = grep("pat", stringx)

positions = regexpr("pat", stringx)

> x = c("abc", "def", "abcdef", "defabc")

> grep("abc", x) [1] 1 3 4

> regexpr("abc", x) [1] 1 -1 1 4 attr(,"match.length") [1] 3 -1 3 3 attr(,"useBytes") [1] TRUE

> regexpr("abc", x) < 0 [1] FALSE TRUE FALSE FALSE

Note: The functiongrep() returns a list of elements in the vector given bystringxthat match the given pattern, while the regexpr()function returns a numeric list of starting points in each string in the list (with−1 if there was no match). Testing positions < 0 generates a vector of binary indicator of matches (TRUE=no match, FALSE=a match).

The regular expressions available within grep and other related routines are quite powerful. As an example, BooleanORexpressions can be specified using the|operator. A comprehensive description of these operators can be found usinghelp(regex). Additional support for operations on character vectors can be found in thestringrpackage.

2.2.13 Find approximate strings

agrep(pattern, string, max.distance=n)

Note: The support within the agrep() function is more rudimentary: it calculates the Levenshtein edit distance (total number of insertions, deletions, and substitutions required to transform one string into another) and it returns the indices of the elements of the second argument that are withinnedits ofpattern(see 2.2.12). By default, the threshold is 10%

of the pattern length.

> x = c("I ask a favour", "Pardon my error", "You are my favorite")

> agrep("favor", x, max.distance=1) [1] 1 3

2.2. DERIVED VARIABLES AND DATA MANIPULATION 17

2.2.14 Replace strings within string variables

Example: 12.2 newstring = gsub("oldpat", "newpat", oldstring)

x = "oldpat123"

substr(x, start=1, stop=6) = "newpat"

2.2.15 Split strings into multiple strings

strsplit(string, splitchar)

Note: The functionstrsplit()returns a list, each element of which is a vector containing the parts of the input, split at each occurrence of splitchar. If the input is a single character string, this is a list of one vector. Ifsplit is the null string, then the function returns a list of vectors of single characters.

> x = "this is a test"

> strsplit(x, " ") [[1]]

[1] "this" "is" "a" "test"

> strsplit(x,"") [[1]]

[1] "t" "h" "i" "s" " " "i" "s" " " "a" " " "t" "e" "s" "t"

2.2.16 Remove spaces around string variables

noleadortrail = sub(’ +$’, ’’, sub(’^ +’, ’’, stringx))

Note: The arguments tosub() consist of a regular expression, a substitution value, and a vector. In the first step, leading spaces are removed, then a separate call tosub()is used to remove trailing spaces (in both cases replacing the spaces with the null string). If instead of spaces all trailing whitespaces (e.g., tabs, space characters) should be removed, the regular expression’ +$’should be replaced by’[[:space:]]+$’.

2.2.17 Convert strings from upper to lower case

lowercasex = tolower(x) or

lowercasex = chartr("ABCDEFGHIJKLMNOPQRSTUVWXYZ",

"abcdefghijklmnopqrstuvwxzy", x)

Note: Thetoupper()function can be used to convert to upper case. Arbitrary translations from sets of characters can be made using the chartr()function. Theiconv() supports more complex encodings (e.g., from ASCII to other languages).

2.2.18 Create lagged variable

A lagged variable has the value of that variable in a previous row (typically the immediately previous one) within that dataset. The value of lag for the first observation will be missing (see 11.4.4.1).

lag1 = c(NA, x[1:(length(x)-1)])

Note:This expression creates a one-observation lag, with a missing value in the first position, and the first through second-to-last observation for the remaining entries (seelag()). Here we demonstrate how to write a function to create lags of more than one observation.

lagk = function(x, k) { len = length(x) if (!floor(k)==k) {

cat("k must be an integer") } else if (k<1 | k>(len-1)) {

cat("k must be between 1 and length(x)-1") } else {

return(c(rep(NA, k), x[1:(len-k)])) }

}

> lagk(1:10, 5)

[1] NA NA NA NA NA 1 2 3 4 5

2.2.19 Formatting values of variables

Example: 6.6.2 See also 2.1.4 (labelling variables).

Sometimes it is useful to display category names that are more descriptive than variable names. In general, we do not recommend using this feature (except potentially for graphical output), as it tends to complicate communication between data analysts and other readers of output. In this example, character labels are associated with a numeric variable (0=Control, 1=Low Dose, and 2=High Dose).

> x = c(0, 0, 1, 1, 2); x [1] 0 0 1 1 2

> x = factor(x, 0:2, labels=c("Control", "Low Dose", "High Dose")); x [1] Control Control Low Dose Low Dose High Dose

Levels: Control Low Dose High Dose

Note: Therownames()function can be used to associate a variable with an identifier (which is by default the observation number). As an example, this can be used to display the name of a region with the value taken by a particular variable measured in that region. The setNames() function can also be used to set the names on an object.

2.2.20 Perl interface

Perl is a high-level general-purpose programming language [154]. The RSPerl package provides a bidirectional interface between Perl and R.

2.2.21 Accessing databases using SQL

Example: 12.7 Structured Query Language (SQL) is a flexible language for accessing and modifying databases, data warehouses, and distributed systems. These interfaces are particularly useful when an-

Derived variables and data manipulation

Merging, combining, and subsetting datasets

Probability distributions and random number generation