Often your available variables aren’t quite good enough to meet your modeling goals.
The most powerful way to get new variables is to get new, better measurements from the domain expert. Acquiring new measurements may not be practical, so you’d also use methods to create new variables from combinations of the measurements you already have at hand. We call these new variables synthetic to emphasize that they’re synthesized from combinations of existing variables and don’t represent actual new measurements. Kernel methods are one way to produce new variables from old and to increase the power of machine learning methods.9 With enough synthetic variables, data where points from different classes are mixed together can often be lifted to a space where the points from each class are grouped together, and separated from out- of-class points.
One misconception about kernel methods is that they’re automatic or self-adjusting.
They’re not; beyond a few “automatic bandwidth adjustments,” it’s up to the data sci- entist to specify a useful kernel instead of the kernel being automatically found from
9 The standard method to create synthetic variables is to add interaction terms. An interaction between variables occurs when a change in outcome due to two (or more) variables is more than the changes due to each vari- able alone. For example, too high a sodium intake will increase the risk of hypertension, but this increase is disproportionately higher for people with a genetic susceptibility to hypertension. The probability of becom- ing hypertensive is a function of the interaction of the two factors (diet and genetics). For details on using interaction terms in R, see help('formula'). In models such as lm(), you can introduce an interaction
234 CHAPTER 9 Exploring advanced methods
the data. But many of the standard kernels (inner-product, Gaussian, and cosine) are so useful that it’s often profitable to try a few kernels to see what improvements they offer.
THE WORD KERNEL IS USED IN MANY DIFFERENT SENSES The word kernel has many different incompatible definitions in mathematics and statistics. The machine learning sense of the word used here is taken from operator theory and the sense used in Mercer’s theorem. The kernels we want are two argu- ment functions that behave a lot like an inner product. The other common (incompatible) statistical use of kernel is in density estimation, where kernels are single argument functions that represent probability density functions or distributions.
In the next few sections, we’ll work through the definition of a kernel function. We’ll give a few examples of transformations that can be implemented by kernels and a few examples of transformations that can’t be implemented as kernels. We’ll then work through a few examples.
9.3.1 Understanding kernel functions
To understand kernel functions, we’ll work through the definition, why they’re useful, and some examples of important kernel functions.
FORMALDEFINITIONOFAKERNELFUNCTION
In our application, a kernel is a function with a very specific definition. Let u and v be any pair of variables. u and v are typically vectors of input or independent variables (possibly taken from two rows of a dataset). A function k(,) that maps pairs (u,v) to numbers is called a kernel function if and only if there is some function phi() mapping (u,v)s to a vector space such that k(u,v) = phi(u) %*% phi(v) for all u,v.10 We’ll informally call the expression k(u,v) = phi(u) %*% phi(v) the Mercer expansion of the kernel (in reference to Mercer’s theorem; see http://mng.bz/xFD2) and consider phi() the certificate that tells us k(,) is a good kernel. This is much easier to under- stand from a concrete example. In listing 9.16, we’ll develop an example function k(,) and the matching phi() that demonstrates that k(,) is in fact a kernel over two dimensional vectors.
> u <- c(1,2)
> v <- c(3,4)
> k <- function(u,v) { u[1]*v[1] + u[2]*v[2] +
u[1]*u[1]*v[1]*v[1] + u[2]*u[2]*v[2]*v[2] + u[1]*u[2]*v[1]*v[2]
}
10%*% is R’s notation for dot product or inner product; see help('%*%') for details. Note that phi() is allowed to map to very large (and even infinite) vector spaces.
Listing 9.16 An artificial kernel example
Define a function of two vector variables (both two dimensional) as the sum of various products of terms.
235 Using kernel methods to increase data separation
> phi <- function(x) { x <- as.numeric(x)
c(x,x*x,combn(x,2,FUN=prod)) }
> print(k(u,v)) [1] 108
> print(phi(u)) [1] 1 2 1 4 2
> print(phi(v)) [1] 3 4 9 16 12
> print(as.numeric(phi(u) %*% phi(v))) [1] 108
Figure 9.7 illustrates11 what we hope for from a good kernel: our data being pushed around so it’s easier to sort or classify. By using a kernel transformation, we move to a situation where the distinction we’re trying to learn is representable by a linear opera- tor in our transformed data.
11See Nello Cristianini and John Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Define a function of a single vector variable that returns a vector containing the original entries plus all products of entries.
Example evaluation of k(,).
Confirm phi() agrees with k(,). phi() is the certificate that shows k(,) is in fact a kernel.
o x
x
x
x x
x
x
o o
o
o o
o
o o
o
(x)
(x) (x) (x)
(x) (x)
(x)
(o)
(o) (o) (o)
(o) (o)
(o)
(o)
(o)
(o) Kernel
transform
Linearly Inseparable Data Linearly Separable Data
An “o” inside triangle of “x”s means there is no way to separate the x’s from the o’s using
a single straight line (or a linear separator).
Figure 9.7 Notional illustration of a kernel transform (based on Cristianini and Shawe-Taylor, 2000)
236 CHAPTER 9 Exploring advanced methods
Most kernel methods use the function k(,) directly and only use properties of k(,) guaranteed by the matching phi() to ensure method correctness. The k(,) function is usually quicker to compute than the notional function phi(). A simple example of this is what we’ll call the dot-product similarity of documents. The dot-product docu- ment similarity is defined as the dot product of two vectors where each vector is derived from a document by building a huge vector of indicators, one for each possi- ble feature. For instance, if the features you’re considering are word pairs, then for every pair of words in a given dictionary, the document gets a feature of 1 if the pair occurs as a consecutive utterance in the document and 0 if not. This method is the phi(), but in practice we never use the phi() procedure. Instead, for one document each consecutive pair of words is generated and a bit of score is added if this pair is both in the dictionary and found consecutively in the other document. For moderate- sized documents and large dictionaries, this direct k(,) implementation is vastly more efficient than the phi() implementation.
WHYAREKERNELFUNCTIONSUSEFUL?
Kernel functions are useful for a number of reasons:
Inseparable datasets (data where examples from multiple training classes appear to be intermixed when plotted) become separable (and hence we can build a good classifier) under common nonlinear transforms. This is known as Cover’s theorem. Nonlinear kernels are a good complement to many linear machine learning techniques.
Many phi()s can be directly implemented during data preparation. Never be too proud to try some interaction variables in a model.
Some very powerful and expensive phi()s that can’t be directly implemented during data preparation have very efficient matching kernel functions k(,) that can be used directly in select machine learning algorithms without needing access to the highly complex phi().
All symmetric positive semidefinite functions k(,) mapping pairs of variables to the reals can be represented as k(u,v)=phi(u)%*%phi(v) for some function phi(). This is a consequence of Mercer’s theorem. So by restricting to functions with a Mercer expansion, we’re not giving up much.
Our next goal is to demonstrate some useful kernels and some machine learning algo- rithms that use them efficiently to solve problems. The most famous kernelized machine learning algorithm is the support vector machine, which we’ll demonstrate in section 9.4. But first it helps to demonstrate some useful kernels.
SOMEIMPORTANTKERNELFUNCTIONS
Let’s look at some practical uses for some important kernels in table 9.1.
237 Using kernel methods to increase data separation
Table 9.1 Some important kernels and their uses
At this point, it’s important to mention that not everything is a kernel. For example, the common squared distance function (k=function(u,v){(u-v)%*%(u-v)}) isn’t a kernel. So kernels can express similarities, but can’t directly express distances.12
Only now that we’ve touched on why some common kernels are useful is it appro- priate to look at the formal mathematical definitions. Remember, we pick kernels for their utility, not because the mathematical form is exciting. Now let’s take a look at six important kernel definitions.
Mathematical definitions of common kernels
A definitional kernel is any kernel that is an explicit inner product of two applications of a vector function:
The dot product or identity kernel is just the inner product applied to actual vectors of data:
Kernel name Informal description and use
Definitional (or explicit) kernels Any method that explicitly adds additional variables (such as interac- tions) can be expressed as a kernel over the original data. These are kernels where you explicitly implement and use phi().
Linear transformation kernels Any positive semidefinite linear operation (like projecting to principal components) can also be expressed as a kernel.
Gaussian or radial kernel Many decreasing non-negative functions of distance can be expressed as kernels. This is also an example of a kernel where phi() maps into an infinite dimensional vector space (essentially the Taylor series of exp()) and therefore phi(u) doesn’t have an easy-to-implement representation (you must instead use k(,)).
Cosine similarity kernel Many similarity measures (measures that are large for identical items and small for dissimilar items) can be expressed as kernels.
Polynomial kernel Much is made of the fact that positive integer powers of kernels are also kernels. The derived kernel does have many more terms derived from powers and products of terms from the original kernel, but the modeling technique isn’t able to independently pick coeffi- cients for all of these terms simultaneously. Polynomial kernels do introduce some extra options, but they’re not magic.
12Some more examples of kernels (and how to build new kernels from old) can be found at k(u, v) = ɸ(u) . ɸ(v)
k(u, v) = u . v
238 CHAPTER 9 Exploring advanced methods
A linear transformation kernel is a matrix form like the following:
The Gaussian or radial kernel has the following form:
The cosine similarity kernel is a rescaled dot product kernel:
A polynomial kernel is a dot product with a transform (shift and power) applied as shown here:
9.3.2 Using an explicit kernel on a problem
Let’s demonstrate explicitly choosing a kernel function on a problem we’ve already worked with.
REVISITINGTHE PUMS LINEARREGRESSIONMODEL
To demonstrate using a kernel on an actual problem, we’ll reprepare the data used in section 7.1.3 to again build a model predicting the logarithm of income from a few other factors. We’ll resume this analysis by using load() to reload the data from a copy of the file https://github.com/WinVector/zmPDSwR/raw/master/PUMS/ psub.RData. Recall that the basic model (for purposes of demonstration) used only a few variables; we’ll redo producing a stepwise improved linear regression model for log(PINCP).
dtrain <- subset(psub,ORIGRANDGROUP >= 500) dtest <- subset(psub,ORIGRANDGROUP < 500) m1 <- step(
lm(log(PINCP,base=10) ~ AGEP + SEX + COW + SCHL, data=dtrain),
direction='both')
rmse <- function(y, f) { sqrt(mean( (y-f)^2 )) } print(rmse(log(dtest$PINCP,base=10),
Listing 9.17 Applying stepwise linear regression to PUMS data k(u, v) = uΤ LΤ Lv
k(u, v) = e –c||u–v||2
k u v( , ) u v⋅ u u v v⋅ ⋅ ⋅ ---
=
k(u, v) = (su . v + c) d
Split data into test and training.
Ask that the linear regression model we’re building be stepwise improved, which is a powerful automated procedure for removing variables that don’t seem to have significant impacts (can improve generalization performance).
Build the basic linear regression model.
Define the RMSE function.
239 Using kernel methods to increase data separation
predict(m1,newdata=dtest)))
# [1] 0.2752171
The quality of prediction was middling (the RMSE isn’t that small), but the model exposed some of the important relationships. In a real project, you’d do your utmost to find new explanatory variables. But you’d also be interested to see if any combina- tion of variables you were already using would help with prediction. We’ll work through finding some of these combinations using an explicit phi().
INTRODUCINGANEXPLICITTRANSFORM
Explicit kernel transforms are a formal way to unify ideas like reshaping variables and adding interaction terms.13
In listing 9.18, we’ll set up a phi() function and use it to build a new larger data frame with new modeling variables.
phi <- function(x) { x <- as.numeric(x)
c(x,x*x,combn(x,2,FUN=prod)) }
phiNames <- function(n) { c(n,paste(n,n,sep=':'),
combn(n,2,FUN=function(x) {paste(x,collapse=':')})) }
modelMatrix <- model.matrix(~ 0 + AGEP + SEX + COW + SCHL,psub) colnames(modelMatrix) <- gsub('[^a-zA-Z0-9]+','_',
colnames(modelMatrix))
pM <- t(apply(modelMatrix,1,phi)) vars <- phiNames(colnames(modelMatrix)) vars <- gsub('[^a-zA-Z0-9]+','_',vars) colnames(pM) <- vars
pM <- as.data.frame(pM)
Listing 9.18 Applying an example explicit kernel transform
Calculate the RMSE between the prediction and the actuals.
Define our primal kernel function:
map a vector to a copy of itself plus all square terms and cross- multiplied terms.
Define a function similar to our primal kernel, but working on variable names instead of values.
Convert data to a matrix where all categorical variables are encoded as multiple numeric indicators.
Remove problematic characters from matrix column names.
Apply the primal kernel function to every row of the matrix and transpose results so they’re written as rows (not as a list as returned by apply()).
Extend names from original matrix to names for compound variables in new matrix.
240 CHAPTER 9 Exploring advanced methods
pM$PINCP <- psub$PINCP
pM$ORIGRANDGROUP <- psub$ORIGRANDGROUP pMtrain <- subset(pM,ORIGRANDGROUP >= 500) pMtest <- subset(pM,ORIGRANDGROUP < 500)
The steps to use this new expanded data frame to build a model are shown in the fol- lowing listing.
formulaStr2 <- paste('log(PINCP,base=10)', paste(vars,collapse=' + '),
sep=' ~ ')
m2 <- lm(as.formula(formulaStr2),data=pMtrain) coef2 <- summary(m2)$coefficients
interestingVars <- setdiff(rownames(coef2)[coef2[,'Pr(>|t|)']<0.01], '(Intercept)')
interestingVars <- union(colnames(modelMatrix),interestingVars)
formulaStr3 <- paste('log(PINCP,base=10)',
paste(interestingVars,collapse=' + '), sep=' ~ ')
m3 <- step(lm(as.formula(formulaStr3),data=pMtrain),direction='both') print(rmse(log(pMtest$PINCP,base=10),predict(m3,newdata=pMtest)))
# [1] 0.2735955
We see RMSE is improved by a small amount on the test data. With such a small improvement, we have extra reason to confirm its statistical significance using a cross- validation procedure as demonstrated in section 6.2.3. Leaving these issues aside, let’s look at the summary of the new model to see what new variables the phi() procedure introduced. The next listing shows the structure of the new model.
> print(summary(m3)) Call:
lm(formula = log(PINCP, base = 10) ~ AGEP + SEXM + COWPrivate_not_for_profit_employee +
SCHLAssociate_s_degree + SCHLBachelor_s_degree + SCHLDoctorate_degree +
SCHLGED_or_alternative_credential + SCHLMaster_s_degree + Listing 9.19 Modeling using the explicit kernel transform
Listing 9.20 Inspecting the results of the explicit kernel model
Add in outcomes, test/train split columns, and prepare new data for modeling.
Select a set of interesting variables by building an initial model using all of the new variables and retaining an interesting subset. This is an ad hoc move to speed up the stepwise regression by trying to quickly dispose of many useless derived variables. By introducing many new variables, the primal kernel method also introduces many new degrees of freedom, which can invite overfitting.
Stepwise regress on subset of variables to get new model.
Calculate the RMSE between the prediction and the actuals.
241 Using kernel methods to increase data separation
SCHLProfessional_degree + SCHLRegular_high_school_diploma + SCHLsome_college_credit_no_degree + AGEP_AGEP, data = pMtrain) Residuals:
Min 1Q Median 3Q Max
-1.29264 -0.14925 0.01343 0.17021 0.61968 Coefficients:
Estimate Std. Error t value Pr(>|t|) (Intercept) 2.9400460 0.2219310 13.248 < 2e-16 ***
AGEP 0.0663537 0.0124905 5.312 1.54e-07 ***
SEXM 0.0934876 0.0224236 4.169 3.52e-05 ***
COWPrivate_not_for_profit_em -0.1187914 0.0379944 -3.127 0.00186 **
SCHLAssociate_s_degree 0.2317211 0.0509509 4.548 6.60e-06 ***
SCHLBachelor_s_degree 0.3844459 0.0417445 9.210 < 2e-16 ***
SCHLDoctorate_degree 0.3190572 0.1569356 2.033 0.04250 * SCHLGED_or_alternative_creden 0.1405157 0.0766743 1.833 0.06737 . SCHLMaster_s_degree 0.4553550 0.0485609 9.377 < 2e-16 ***
SCHLProfessional_degree 0.6525921 0.0845052 7.723 5.01e-14 ***
SCHLRegular_high_school_diplo 0.1016590 0.0415834 2.445 0.01479 * SCHLsome_college_credit_no_de 0.1655906 0.0416345 3.977 7.85e-05 ***
AGEP_AGEP -0.0007547 0.0001704 -4.428 1.14e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2649 on 582 degrees of freedom Multiple R-squared: 0.3541, Adjusted R-squared: 0.3408 F-statistic: 26.59 on 12 and 582 DF, p-value: < 2.2e-16
In this case, the only new variable is AGEP_AGEP. The model is using AGEP*AGEP to build a non-monotone relation between age and log income.14
The phi() method is automatic and can therefore be applied in many modeling situations. In our example, we can think of the crude function that multiplies all pairs of variables as our phi() or think of the implied function that took the original set of variables to the new set called interestingVars as the actual training data-dependent phi(). Explicit phi() kernel notation adds some capabilities, but algorithms that are designed to work directly with implicit kernel definitions in k(,) notation can be much more powerful. The most famous such method is the support vector machine, which we’ll use in the next section.
9.3.3 Kernel takeaways
Here’s what you should remember about kernel methods:
Kernels provide a systematic way of creating interactions and other synthetic variables that are combinations of individual variables.
The goal of kernelizing is to lift the data into a space where the data is separa- ble, or where linear methods can be used directly.
14Of course, this sort of relation could be handled quickly by introducing an AGEP*AGEP term directly in the model or by using a generalized additive model to discover the optimal (possibly nonlinear) shape of the rela-
242 CHAPTER 9 Exploring advanced methods
Now we’re ready to work with the most well-known use of kernel methods: support vector machines.