27.2 Support Vector Machine Formulations
27.2.1 The Formulation of Conventional Support Vector Machine
In this article, we mainly confine ourselves to binary classification problems, which focus on classifying data into two classes. Given a dataset consisting ofmpoints in then-dimensional real spaceRn, each with a class labely,C1or1, indicating one of two classes, AC, A Rn where the point belongs, we want to find the decision boundary between the two classes. For the multi-class case, many strategies have been proposed. They either decompose the problem into a series of binary classification or formulate it as a single optimization problem. We will discuss this issue in Sect.27.5. In notation, we use capital boldface letters to denote a matrix, lower case boldface letters to denote a column vector, and low case light face letters to denote scalars. The data points are denoted by anmn matrix A, where the ith row of the matrix corresponds to theith data point. We use a column vector xi to denote theith data point. All vectors indicate column vectors unless otherwise specified. The transpose of a matrix M is denoted by M>.
27.2.1.1 Primal Form of Conventional SVM
We start with a strictly linearly separable case, i.e. there exists a hyperplane which can separate the data ACand A. In this case we can separate the two classes by a pair of parallel bounding planes:
w>xCbD C1 ;
w>xCbD 1 ; (27.1) where w is the normal vector to these planes andbdetermines their location relative to the origin. The first plane of (27.1) bounds the class AC and the second plane
bounds the class A. That is,
w>xCb C1 ; 8x2AC;
w>xCb 1 ; 8x2A: (27.2)
According to the statistical learning theoryVapnik(2000), SVM achieves a better prediction ability via maximizing the margin between two bounding planes. Hence, the “hard margin” SVM searches for a separating hyperplane by maximizingkw2k
2. It can be done by means of minimizing12kwk22and the formulation leads to a quadratic program as follows:
min
.w;b/2RnC1
1
2kwk22 (27.3)
s.t. yi.w>xiCb/1 ; foriD1; 2; : : : ; m : The linear separating hyperplane is the plane
w>xCbD0 ; (27.4)
midway between the bounding planes (27.1), as shown in Fig.27.1a. For the linearly separable case, the feasible region of the above minimization problem (27.3) is nonempty and the objective function is a quadratic convex function; therefore, there exists an optimal solution, denoted by.w; b/. The data points on the bounding planes, w>xCbD ˙1, are called support vectors. It is not difficult to see that, if we remove any point that is not a support vector, the training result will remain the same. This is a nice property of SVM learning algorithms. For the purpose of data compression, once we have the training result, all we need to keep in our database are the support vectors.
If the classes are not linearly separable, in some cases, two planes may bound the two classes with a “soft margin”. That is, given a nonnegative slack vector variable WD.1; : : : ; m/, we would like to have:
x w+b=−1
x w+b= +1
A- A+
w
w22= Margin
(a) linearly separable
x w+b=−1
x w+b= +1
A- A+
w ξi
ξj
w22= Margin
(b) non-linearly separable Fig. 27.1 The illustration of linearly separable and non-linearly separable SVMs
w>xi CbCi C1 ; 8xi 2AC
w>xi Cb i 1 ; 8xi 2A: (27.5) The 1-norm of the slack vector variable,Pm
iD1i, is called the penalty term. In principle, we are going to determine a separating hyperplane that not only correctly classifies the training data, but also performs well on test data. We depict the geometric property in Fig.27.1b. With a soft margin, we can extend (27.3) and produce the conventional SVM (Vapnik 2000) as the following formulation:
min
.w;b;/2RnC1Cm
1
2kwk22CC Xm iD1
i (27.6)
s.t. yi.w>xiCb/Ci 1 ; i 0; fori D1; 2; : : : ; m ;
whereC > 0is a positive parameter that balances the weight of the penalty term Pm
iD1i and the margin maximization term 12kwk22. Alternatively, we can replace the penalty term by the 2-norm measure as follows:
min
.w;b;/2RnC1Cm
1
2kwk22CC Xm iD1
i2 (27.7)
s.t. yi.w>xiCb/Ci 1 ; fori D1; 2; : : : ; m :
The 1-norm penalty is considered less sensitive to outliers than the 2-norm penalty, therefore it receives more attention in real applications. However, mathematically the 1-norm is more difficult to manipulate such as when we need to compute the derivatives.
27.2.1.2 Dual Form of Conventional SVM
The conventional support vector machine formulation (27.6) is a standard convex quadratic program Bertsekas (1999), Mangasarian (1994), Nocedal and Wright (2006). The Wolfe dual problem of (27.6) is expressed as follows:
˛2Rmaxm Xm iD1
˛i 1 2
Xm iD1
Xm jD1
yiyj˛i˛jhxi;xji (27.8)
s.t.
Xm iD1
yi˛i D0 ;
0˛i C foriD1; 2; : : : ; m ;
wherehxi;xjiis the inner product of xi and xj. The primal variable w is given by:
wD X
˛i>0
yi˛ixi: (27.9)
Each dual variable ˛i corresponds to a training point xi. The normal vector w can be expressed in terms of a linear combination of training data points which have corresponding positive dual variables˛i (namely, the support vectors). By the Karush-Kuhn-Tucker complementarity conditionsBertsekas (1999),Mangasarian (1994):
0 ˛i ? yi.w>xiCb/Ci10
0C ˛i ? i 0 ; foriD1; 2; : : : ; m ; (27.10) we can determineb simply by taking any training point xi, such thati 2 I WD fkj0 < ˛k< Cgand obtain:
bDyiw>xi Dyi Xm jD1
.yj˛jhxj;xii/ : (27.11)
In the dual form, SVMs can be expressed by the form of inner product. It implies that we only need the information of the inner product of the data when expressing the formulation and decision function of SVM. This important characteristic carries SVMs to their nonlinear extension in a simple way.