Data management and related tasks

12.1.1 Finding two closest values in a vector

Suppose we need to find the closest pair of observations for some variable. This might arise if we were concerned that some data had been accidentally duplicated. In this case study, we return the IDs of the two closest observations and their distance from each other. We’ll first create some sample data and sort it, recognizing that the smallest difference must come between two observations that are adjacent after sorting.

We begin by generating data (3.1.6), along with some subject identifiers (2.3.4).

> options(digits=3)

> ds = data.frame(x=rnorm(8), id=1:8)

Then, we sort the data. The order()function (2.3.10) is used to keep track of the sorted random variables.

> options(digits=3)

> ds = ds[order(ds$x),]

> ds

x id 5 -0.893 5 4 -0.729 4 3 -0.609 3 2 -0.518 2 7 -0.436 7 1 0.180 1 6 1.222 6 8 1.387 8

We can use thediff()function to get the differences between observations. Thewhich.min() function extracts the index (location within the vector) of the smallest value. We apply

this function to thediffxvector to find the location and extract that location from theid vector.

> diffx = diff(ds$x)

> min(diffx) [1] 0.0822

> with(ds, id[which.min(diffx)]) # first val [1] 2

> with(ds, id[which.min(diffx) + 1]) # second val [1] 7

12.1.2 Tabulate binomial probabilities

Suppose we wanted to assess the probabilityP(X=x) for a binomial random variate with n= 10 and with p=.81, .84, . . . , .99. This could be helpful, for example, in various game settings.

We make a vector of the binomial probabilities, using the: operator (2.3.4) to generate a sequence of integers. After creating an empty matrix (3.3) to hold the table results, we loop (4.1.1) through the binomial probabilities, calling thedbinom()function (3.1.1) to find the probability that the random variable takes on that particular value. This calculation is nested within the round() function (3.2.4) to reduce the digits displayed. Finally, we include the vector of binomial probabilities with the results usingcbind().

> p = .78 + (3 * 1:7)/100

> allprobs = matrix(nrow=length(p), ncol=11)

> for (i in 1:length(p)) {

allprobs[i,] = round(dbinom(0:10, 10, p[i]),2) }

> table = cbind(p, allprobs)

> table p

[1,] 0.81 0 0 0 0 0 0.02 0.08 0.19 0.30 0.29 0.12 [2,] 0.84 0 0 0 0 0 0.01 0.05 0.15 0.29 0.33 0.17 [3,] 0.87 0 0 0 0 0 0.00 0.03 0.10 0.25 0.37 0.25 [4,] 0.90 0 0 0 0 0 0.00 0.01 0.06 0.19 0.39 0.35 [5,] 0.93 0 0 0 0 0 0.00 0.00 0.02 0.12 0.36 0.48 [6,] 0.96 0 0 0 0 0 0.00 0.00 0.01 0.05 0.28 0.66 [7,] 0.99 0 0 0 0 0 0.00 0.00 0.00 0.00 0.09 0.90

12.1.3 Calculate and plot a running average

The Law of Large Numbers concerns the convergence of the arithmetic average to the ex- pected value, as sample sizes increase. This is an important topic in mathematical statistics.

The convergence (or lack thereof, for certain distributions) can easily be visualized [68].

We define a function (4.2) to calculate the running average for a given vector, allowing for variates from many distributions to be generated.

12.1. DATA MANAGEMENT AND RELATED TASKS 189

> runave = function(n, gendist, ...) { x = gendist(n, ...)

avex = numeric(n) for (k in 1:n) {

avex[k] = mean(x[1:k]) }

return(data.frame(x, avex)) }

Therunave()function takes, at a minimum, two arguments: a sample sizenand function (4.2) denoted by gendist that is used to generate samples from a distribution (3.1). In addition, other options for the function can be specified, using the ... syntax (see 4.2).

This is used, for example, to specify the degrees of freedom for the samples generated for the tdistribution in the next code block. The loop in therunave()function could be eliminated through use of thecumsum()function applied to the vector given as an argument, and then divided by a vector of observation numbers.

Next, we generate the data, using our new macro and function. To make sure we have a nice example, we first set a fixed seed (3.1.3). Recall that because the expectation of a Cauchy random variable is undefined [137], the sample average does not converge to the center, while atdistribution with more than 1 degree of freedom does.

> vals = 1000

> set.seed(1984)

> cauchy = runave(vals, rcauchy)

> t4 = runave(vals, rt, 4)

Now we can plot the results. We begin with an empty plot with the correct axis limits, using the type="n"specification (8.3.1). We add the running average using the lines() function (9.1.1) and varying the line style (9.2.11) and thickness (9.2.12) with theltyand lwdspecifications, respectively. Finally, we specify a title (9.1.9) and a legend (9.1.15). The results are displayed in Figure 12.1.

12.1.4 Create a Fibonacci sequence

The Fibonacci numbers have many mathematical relationships and have been discovered repeatedly in nature. They are constructed as the sum of the previous two values, initialized with the values 1 and 1. It’s convenient to use aforloop, though other approaches (e.g., recursion) could be used.

> len = 10

> fibvals = numeric(len)

> fibvals[1] = 1

> fibvals[2] = 1

> for (i in 3:len) {

fibvals[i] = fibvals[i-1] + fibvals[i-2]

}

> fibvals

[1] 1 1 2 3 5 8 13 21 34 55

> plot(c(cauchy$avex, t4$avex), xlim=c(1, vals), type="n")

> lines(1:vals, cauchy$avex, lty=1, lwd=2)

> lines(1:vals, t4$avex, lty=2, lwd=2)

> abline(0, 0)

> legend(vals*.6, -1, legend=c("Cauchy", "t with 4 df"), lwd=2, lty=c(1, 2))

0 200 400 600 800 1000

−2−101

Index

c(cauchy$avex, t4$avex)

Cauchy t with 4 df

Figure 12.1: Running average for Cauchy and tdistributions

Derived variables and data manipulation

Merging, combining, and subsetting datasets