Least Squares Support Vector Machine

27.3 Variants of Support Vector Machines

27.3.3 Least Squares Support Vector Machine

The least squares support vector machine (Suykens and Vandewalle 1999) considers the equality constraints which make the formulation of the classification problem in the sense of least squares as follows:

min

.w;b;/2RnC1Cm

2kwk22CC Xm iD1

i2 (27.32)

s.t. i D1yi.w>xiCb/foriD1; 2; : : : ; m :

The same idea, called proximal support vector machine, is also proposed simultane- ously inFung and Mangasarian(2001), with adding the square of the bias termbin the objective function. With the least squares form, one can obtain the solution of the classification problem via solving a set of linear equations. Consider the Lagrangian function of (27.32):

L.w; b;I˛/D 1

2kwk22CC Xm iD1

i2 Xm iD1

˛iŒyi.w>xiCb/1Ci ; (27.33)

where˛i 2Rare Lagrange multipliers. Setting the gradient ofLto zeros gives the following Karush-Kuhn-Tucker optimality conditions:

wD Xm iD1

˛iyixi (27.34)

Xm iD1

˛iyi D0

˛i DC i; iD1; : : : ; m yi.w>xi Cb/1Ci D0 ; which are equivalent to the following linear equations:

2 66 4

I 0 0 OA>

0 0 0 y>

0 0 CI I A y IO 0

3 77 5

2 66 4 w b

˛ 3 77 5D

2 66 4 0 0 0 1 3 77

5; (27.35)

or, equivalently,

0 y>

yAOAO>C C1I b

D 0

; (27.36)

whereAO D Œx1y1Ix2y2I: : :Ixmym, y D Œy1Iy2I: : :Iym, and1 D Œ1I1I: : :I1. From (27.36), the nonlinear least squares SVM also can be extended via the inner product form. That is, the nonlinear least squares SVM solves the following linear

equations:

0 y>

y K.A;A/C C1I b

D 0

; (27.37)

where K.A;A/is the kernel matrix. Equations (27.36) or (27.37) gives an analytic solution to the classification problem via solving a system of linear equations. This brings a lower computational cost by comparing with solving a conventional SVM while obtaining a least squares SVM classifier.

27.3.4 1-norm Support Vector Machine

The1-norm support vector machine replaces the regularization termkwk22in (27.6) by a `1-norm of w. The `1-norm regularization term is also called the LASSO penalty (Tibshiran 1996). It tends to shrink the coefficients w’s towards zeros in particular for those coefficients corresponding to redundant noise features (Zhu et al.

2004). This nice feature will lead to a way of selecting the important attributes in our prediction model. The formulation of1-norm SVM is described as follows:

min

.w;b;/2RnC1Cm kwk1CC Xm iD1

i (27.38)

s.t. yi.w>xiCb/Ci 1 i 0; fori D1; 2; : : : ; m :

The objective function of (27.38) is a piecewise linear convex function. We can reformulate it as the following linear programming problem:

min

.w;s;b;/2RnCnC1Cm

Xn jD1

sj CC Xm iD1

i (27.39)

s.t. yi.w>xiCb/Ci 1

sj wj sj; forj D1; 2; : : : ; n ; i 0; foriD1; 2; : : : ; m ;

wheresj is the upper bound of the absolute value of wj. At the optimal solution of (27.39) the sum ofsj is equal tokwk1.

The 1-norm SVM can generate a very sparse solution w and lead to a parsimo- nious model. In a linear SVM classifier, solution sparsity means that the separating functionf .x/Dw>xCbdepends on very few input attributes. This characteristic can significantly suppress the number of the nonzero coefficients w’s, especially when there are many redundant noise features (Fung and Mangasarian 2004;Zhu et al. 2004). Therefore the 1-norm SVM can be a very promising tool for variable selection. In Sect.27.6, we will use it to choose the important financial indices for our bankruptcy prognosis model.

27.3.5 " -Support Vector Regression

In regression problems, the response y belongs to real numbers. We would like to find a linear or nonlinear regression function,f .x/, that tolerates a small error in fitting the given dataset. It can be achieved by utilizing the"-insensitive loss function that sets an"-insensitive “tube” around the data, within which errors are discarded.

We start with the linear case, that is the regression functionf .x/ defined as f .x/D w>xCb. The SVM minimization can be formulated as an unconstrained problem given by:

.w;b;/2RminnC1 1

2kwk22CC Xm iD1

jij"; (27.40)

wherejij" D maxf0;jw>xi Cbyij "g, represents the fitting errors and the positive control parameterC here weights the tradeoff between the fitting errors and the flatness of the linear regression functionf .x/. Similar to the idea in SVM, the regularization termkwk22in (27.40) is also applied for improving the generalization ability. To deal with the"-insensitive loss function in the objective function of the above minimization problem, conventionally, it is reformulated as a constrained minimization problem defined as follows:

min

.w;b;;/2RnC1C2m

2kwk22CC Xm iD1

.iCi/ (27.41)

s.t. w>xi Cbyi "Ci; w>xi bCyi "Ci; i; i0fori D1; 2; : : : ; m :

This formulation (27.41) is equivalent to the formulation (27.40) and its corresponding dual form is

˛;max˛2RO m

Xm iD1

.˛Oi ˛i/yi "

Xm iD1

.˛Oi C˛i/ (27.42)

Xm iD1

Xm jD1

.˛Oi ˛i/.˛Oj ˛j/hxi;xji;

s.t.

Xm iD1

.Oui ui/D0 ;

0˛i;˛Oi C ; fori D1 ; : : : ; m :

From (27.42), one also can apply the kernel trick on this dual form of"-SVR for the nonlinear extension. That is,hxi;xjiis directly replaced by a kernel function k.xi;xj/as follows:

˛;Omax˛2Rm

Xm iD1

.˛Oi ˛i/yi"

Xm iD1

.˛OiC˛i/ (27.43)

Xm iD1

Xm jD1

.˛Oi ˛i/.˛Oj ˛j/k.xi;xj/ ;

s.t.

Xm iD1

.˛Oi ˛i/D0 ;

0˛i;˛Oi C ; foriD1 ; : : : ; m : with the decision functionf .x/D Pm

iD1.˛Oi ˛i/k.xi;x/Cb.

Similar to the smooth approach in SSVM, the formulation (27.40) can be modified slightly as a smooth unconstrained minimization problem. Before we derive the smooth approximation function, we show some interesting observations:

jxj"D.x"/CC.x"/C (27.44) and

.x"/C.x"/C D0 for allx2Rand" > 0 : (27.45) Thus we have

jxj2"D.x"/2CC.x"/2C: (27.46) It is straightforward to replacejxj2"by a very accurate smooth approximation given by:

p2".x; ˇ/D.p.x"; ˇ//2C.p.x"; ˇ//2: (27.47) We use this approximationp"2-function with smoothing parameterˇ to obtain the smooth support vector regression ("-SSVR) (Lee et al. 2005):

min

.w;b/2RnC1

2.kwk22Cb2/C C 2

Xm iD1

p2".w>xiCbyi; ˇ/ ; (27.48)

wherep"2.w>xi Cbyi; ˇ/2 R. For the nonlinear case, this formulation can be extended to the nonlinear"-SSVR by using the kernel trick as follows:

min

.u;b/2RmC1

2.kuk22Cb2/C C 2

Xm iD1

p2". Xm jD1

ujK.xj;xi/Cbyi; ˇ/ ; (27.49)

wherek.xi;xj/is a kernel function. The nonlinear"-SSVR decision functionf .x/ can be expressed as follows:

f .x/D Xm iD1

uik.xj;x/Cb : (27.50) Note that the reduced kernel technique also can be applied to "-SSVR while encountering a large scale regression problem.

The Organization and Contents of This Handbook

The Computational Statistics Handbook Series