Extended Example: Achieving Better Speed in a Mont- 123docz.net

14.2 The Dreaded for Loop

14.2.2 Extended Example: Achieving Better Speed in a Monte Carlo

In some applications, simulation code can run for hours, days, or even months, so speedup methods are of high interest. Here, we’ll look at two simulation examples.

To begin, let’s consider the following code from Section 8.6:

sum <- 0 nreps <- 100000 for (i in 1:nreps) {

xy <- rnorm(2) # generate 2 N(0,1)s sum <- sum + max(xy)

}

print(sum/nreps)

Here’s a revision (hopefully faster):

nreps <- 100000

xymat <- matrix(rnorm(2*nreps),ncol=2)

maxs <- pmax(xymat[,1],xymat[,2]) print(mean(maxs))

In this code, we generate all the random variates at once, storing them in a matrixxymat, with one (X,Y) pair per row:

xymat <- matrix(rnorm(2*nreps),ncol=2)

Next, we ﬁnd all the max(X,Y) values, storing those values inmaxs, and then simply callmean().

It’s easier to program, and we believe it will be faster. Let’s check that.

I had the original code in the ﬁleMaxNorm.Rand the improved version in MaxNorm2.R.

> system.time(source("MaxNorm.R")) [1] 0.5667599

user system elapsed 1.700 0.004 1.722

> system.time(source("MaxNorm2.R")) [1] 0.5649281

user system elapsed 0.132 0.008 0.143

The speedup is dramatic, once again.

NOTE We achieved an increase in speed, at the expense of using more memory, by keeping our random numbers in an array instead of generating and discarding them one pair at a time. As mentioned earlier, the time/space trade-off is a common one in the computing world and in the R world in particular.

We attained an excellent speedup in this example, but it was mislead- ingly easy. Let’s look at a slightly more complicated example.

Our next example is a classic exercise from elementary probability courses. Urn 1 contains ten blue marbles and eight yellow ones. In urn 2, the mixture is six blue and six yellow. We draw a marble at random from urn 1, transfer it to urn 2, and then draw a marble at random from urn 2.

What is the probability that that second marble is blue? This is easy to ﬁnd analytically, but we’ll use simulation. Here is the straightforward way:

1 # perform nreps repetitions of the marble experiment, to estimate

2 # P(pick blue from Urn 2)

3 sim1 <- function(nreps) {

4 nb1 <- 10 # 10 blue marbles in Urn 1

5 n1 <- 18 # number of marbles in Urn 1 at 1st pick

6 n2 <- 13 # number of marbles in Urn 2 at 2nd pick

7 count <- 0 # number of repetitions in which get blue from Urn 2

8 for (i in 1:nreps) {

10 # pick from Urn 1 and put in Urn 2; is it blue?

11 if (runif(1) < nb1/n1) nb2 <- nb2 + 1

12 # pick from Urn 2; is it blue?

13 if (runif(1) < nb2/n2) count <- count + 1

14 }

15 return(count/nreps) # est. P(pick blue from Urn 2)

16 }

Here is how we can do it without loops, usingapply():

1 sim2 <- function(nreps) {

2 nb1 <- 10

3 nb2 <- 6

4 n1 <- 18

5 n2 <- 13

6 # pre-generate all our random numbers, one row per repetition

7 u <- matrix(c(runif(2*nreps)),nrow=nreps,ncol=2)

8 # define simfun for use in apply(); simulates one repetition

9 simfun <- function(rw) {

10 # rw ("row") is a pair of random numbers

11 # choose from Urn 1

12 if (rw[1] < nb1/n1) nb2 <- nb2 + 1

13 # choose from Urn 2, and return boolean on choosing blue

14 return (rw[2] < nb2/n2)

15 }

16 z <- apply(u,1,simfun)

17 # z is a vector of booleans but they can be treated as 1s, 0s

18 return(mean(z))

19 }

Here, we set up a matrixuwith two columns ofU(0,1)random variates.

The ﬁrst column is used for our simulation of drawing from urn 1, and the second for drawing from urn 2. This way, we generate all our random numbers at once, which might save a bit of time, but the main point is to set up for usingapply(). Toward that goal, our functionsimfun()works on one repetition of the experiment—that is, one row ofu. We set up the call toapply() to go through all of thenrepsrepetitions.

Note that since the functionsimfun()is declared withinsim2(), the locals ofsim2()—n1,n2,nb1, andnb2—are available as globals ofsimfun(). Also, since a Boolean vector will automatically be changed by R to 1s and 0s, we can ﬁnd the fraction ofTRUEvalues in the vector by simply callingmean().

Now, let’s compare performance.

> system.time(print(sim1(100000))) [1] 0.5086

user system elapsed 2.465 0.028 2.586

> system.time(print(sim2(10000)))

[1] 0.5031

user system elapsed 2.936 0.004 3.027

In spite of the many beneﬁts of functional programming, this approach usingapply()didn’t help. Instead, things got worse. Since this could be simply due to random sampling variation, I ran the code several times again, with similar results.

So, let’s look at vectorizing this simulation.

1 sim3 <- function(nreps) {

2 nb1 <- 10

3 nb2 <- 6

4 n1 <- 18

5 n2 <- 13

6 u <- matrix(c(runif(2*nreps)),nrow=nreps,ncol=2)

7 # set up the condition vector

8 cndtn <- u[,1] <= nb1/n1 & u[,2] <= (nb2+1)/n2 |

9 u[,1] > nb1/n1 & u[,2] <= nb2/n2

10 return(mean(cndtn))

11 }

The main work is done in this statement:

cndtn <- u[,1] <= nb1/n1 & u[,2] <= (nb2+1)/n2 | u[,1] > nb1/n1 & u[,2] <= nb2/n2

To get that, we reasoned out which conditions would lead to choosing a blue marble on the second pick, coded them, and then assigned them to cndtn.

Remember that<=and&are functions; in fact, they are vector functions, so they should be fast. Sure enough, this brings quite an improvement:

> system.time(print(sim3(10000))) [1] 0.4987

user system elapsed 0.060 0.016 0.076

In principle, the approach we took to speed up the code here could be applied to many other Monte Carlo simulations. However, it’s clear that the analog of the statement that computescndtnwould quickly become quite complex, even for seemingly simple applications.

Moreover, the approach would not work in “inﬁnite-stage” situations, meaning an unlimited number of time steps. Here, we are considering the marble example as being two-stage, with two columns to the matrixu.

Extended Example: Achieving Better Speed in a Monte Carlo

Preview of Some Important R Data Structures

Extended Example: Regression Analysis of Exam Grades