Extended Example: A Procedure for Polynomial Regression

Một phần của tài liệu No starch press the art of r programming (Trang 245 - 248)

As another example, consider a statistical regression setting with one pre- dictor variable. Since any statistical model is merely an approximation, in principle, you can get better and better models by fitting polynomials of higher and higher degrees. However, at some point, this becomes over- fitting, so that the prediction of new, future data actually deteriorates for degrees higher than some value.

The class"polyreg"aims to deal with this issue. It fits polynomials of var- ious degrees but assesses fits via cross-validation to reduce the risk of over- fitting. In this form of cross-validation, known as theleaving-one-out method, for each point we fit the regression to all the dataexceptthis observation, and then we predict that observation from the fit. An object of this class consists of outputs from the various regression models, plus the original data.

The following is the code for the"polyreg"class.

1 # "polyreg," S3 class for polynomial regression in one predictor variable

2

3 # polyfit(y,x,maxdeg) fits all polynomials up to degree maxdeg; y is

4 # vector for response variable, x for predictor; creates an object of

5 # class "polyreg"

6 polyfit <- function(y,x,maxdeg) {

7 # form powers of predictor variable, ith power in ith column

8 pwrs <- powers(x,maxdeg) # could use orthog polys for greater accuracy

9 lmout <- list() # start to build class

11 for (i in 1:maxdeg) {

12 lmo <- lm(y ~ pwrs[,1:i])

13 # extend the lm class here, with the cross-validated predictions

14 lmo$fitted.cvvalues <- lvoneout(y,pwrs[,1:i,drop=F])

15 lmout[[i]] <- lmo

16 }

17 lmout$x <- x

18 lmout$y <- y

19 return(lmout)

20 }

21

22 # print() for an object fits of class "polyreg": print

23 # cross-validated mean-squared prediction errors

24 print.polyreg <- function(fits) {

25 maxdeg <- length(fits) - 2

26 n <- length(fits$y)

27 tbl <- matrix(nrow=maxdeg,ncol=1)

28 colnames(tbl) <- "MSPE"

29 for (i in 1:maxdeg) {

30 fi <- fits[[i]]

31 errs <- fits$y - fi$fitted.cvvalues

32 spe <- crossprod(errs,errs) # sum of squared prediction errors

33 tbl[i,1] <- spe/n

34 }

35 cat("mean squared prediction errors, by degree\n")

36 print(tbl)

37 }

38

39 # forms matrix of powers of the vector x, through degree dg

40 powers <- function(x,dg) {

41 pw <- matrix(x,nrow=length(x))

42 prod <- x

43 for (i in 2:dg) {

44 prod <- prod * x

45 pw <- cbind(pw,prod)

46 }

47 return(pw)

48 }

49

50 # finds cross-validated predicted values; could be made much faster via

51 # matrix-update methods

52 lvoneout <- function(y,xmat) {

53 n <- length(y)

54 predy <- vector(length=n)

55 for (i in 1:n) {

56 # regress, leaving out ith observation

57 lmo <- lm(y[-i] ~ xmat[-i,]) betahat <- as.vector(lmo$coef)

59 # the 1 accommodates the constant term

60 predy[i] <- betahat %*% c(1,xmat[i,])

61 }

62 return(predy)

63 }

64

65 # polynomial function of x, coefficients cfs

66 poly <- function(x,cfs) {

67 val <- cfs[1]

68 prod <- 1

69 dg <- length(cfs) - 1

70 for (i in 1:dg) {

71 prod <- prod * x

72 val <- val + cfs[i+1] * prod

73 }

74 }

As you can see,"polyreg"consists ofpolyfit(), the constructor function, andprint.polyreg(), a print function tailored to this class. It also contains several utility functions to evaluate powers and polynomials and to perform cross-validation. (Note that in some cases here, efficiency has been sacrificed for clarity.)

As an example of using the class, we’ll generate some artificial data and create an object of class"polyreg"from it, printing out the results.

> n <- 60

> x <- (1:n)/n

> y <- vector(length=n)

> for (i in 1:n) y[i] <- sin((3*pi/2)*x[i]) + x[i]^2 + rnorm(1,mean=0,sd=0.5)

> dg <- 15

> (lmo <- polyfit(y,x,dg))

mean squared prediction errors, by degree MSPE

[1,] 0.4200127 [2,] 0.3212241 [3,] 0.2977433 [4,] 0.2998716 [5,] 0.3102032 [6,] 0.3247325 [7,] 0.3120066 [8,] 0.3246087 [9,] 0.3463628 [10,] 0.4502341 [11,] 0.6089814 [12,] 0.4499055

[13,] NA

[14,] NA

Note first that we used a common R trick in this command:

> (lmo <- polyfit(y,x,dg))

By surrounding the entire assignment statement in parentheses, we get the printout and formlmoat the same time, in case we need the latter for other things.

The functionpolyfit()fits polynomial models up through a specified degree, in this case 15, calculating the cross-validated mean squared pre- diction error for each model. The last few values in the output were NA, because roundoff error considerations led R to refuse to fit polynomials of degrees that high.

So, how is it all done? The main work is handled by the function polyfit(), which creates an object of class"polyreg". That object consists mainly of the objects returned by the R regression fitterlm()for each degree.

In forming those objects, note line 14:

lmo$fitted.cvvalues <- lvoneout(y,pwrs[,1:i,drop=F])

Here,lmois an object returned bylm(), but we are adding an extra com- ponent to it:fitted.cvvalues. Since we can add a new component to a list at any time, and since S3 classes are lists, this is possible.

We also have a method for the generic functionprint(),print.polyreg() in line 24. In Section 12.1.5, we will add a method for theplot()generic function,plot.polyreg().

In computing prediction errors, we used cross-validation, or the leaving- one-out method, in a form that predicts each observation from all the oth- ers. To implement this, we take advantage of R’s use of negative subscripts in line 57:

lmo <- lm(y[-i] ~ xmat[-i,])

So, we are fitting the model with theithobservation deleted from our data set.

NOTE As mentioned in the comment in the code, we could make a much faster implemen- tation by using a matrix-inverse update method, known as the Sherman-Morrison- Woodbury formula. For more information, see J. H. Venter and J. L. J. Snyman,

“A Note on the Generalised Cross-Validation Criterion in Linear Model Selection,”

Biometrika, Vol. 82, no. 1, pp. 215–219.

Một phần của tài liệu No starch press the art of r programming (Trang 245 - 248)

Tải bản đầy đủ (PDF)

(404 trang)