Hindawi Publishing Corporation EURASIP Journal on Applied Signal Processing Volume 2006, Article ID 24185, Pages 1–13 DOI 10.1155/ASP/2006/24185 The Optimal Design of Weighted Order Statistics Filters by Using Support Vector Machines Chih-Chia Yao and Pao-Ta Yu Department of Computer Science and Information Engineering, College of Engineering, National Chung Cheng University, Chia-yi 62107, Taiwan Received 10 January 2005; Revised 13 September 2005; Accepted November 2005 Recommended for Publication by Moon Gi Kang Support vector machines (SVMs), a classification algorithm for the machine learning community, have been shown to provide higher performance than traditional learning machines In this paper, the technique of SVMs is introduced into the design of weighted order statistics (WOS) filters WOS filters are highly effective, in processing digital signals, because they have a simple window structure However, due to threshold decomposition and stacking property, the development of WOS filters cannot significantly improve both the design complexity and estimation error This paper proposes a new designing technique which can improve the learning speed and reduce the complexity of designing WOS filters This technique uses a dichotomous approach to reduce the Boolean functions from 255 levels to two levels, which are separated by an optimal hyperplane Furthermore, the optimal hyperplane is gotten by using the technique of SVMs Our proposed method approximates the optimal weighted order statistics filters more rapidly than the adaptive neural filters Copyright © 2006 Hindawi Publishing Corporation All rights reserved INTRODUCTION Support vector machines (SVMs), a classification algorithm for the machine learning community, have attracted much attention in recent years [1–5] In many applications, SVMs have been shown to provide higher performance than traditional learning machines [6–8] The principle of SVMs is based on approximating structural risk minimization It shows that the generalization error is bounded by the sum of the training set error and a term dependent on the Vapnik-Chervonenkis dimension of the learning machines [2] The idea of SVMs originates from finding an optimal separating hyperplane which separates the largest possible fraction of training set of each class of data while maximizing the distance from either class to the separating hyperplane According to Vapnik [9], this hyperplane minimizes the risk of misclassifying not only the examples in the training set, but also the unseen examples of the test set SVMs performance versus traditional learning machines suggested that a redesign approach could overcome significant problems under study [10–15] In this paper, a new dichotomous technique for designing WOS filter by SVMs is proposed WOS filters are a special subset of stack filters, and are used in a lot of applications including noise cancellation, image restoration, and texture analysis [16–21] Each stack filter based on a positive Boolean function can be characterized by two properties—threshold decomposition and stacking property [11, 22] The Boolean function on which each WOS filter is based is a threshold logic which needs an n-dimensional weight vector and a threshold value The representation of WOS filters based on threshold decomposition involves K − Boolean functions while input data are decomposed into K − levels Note that K is the number of gray levels of the input data This architecture has been realized in multilayer neural networks [20] However, based on stacking property, the boolean function can be reduced from K − levels to two levels without loss of accuracy Several research studies into WOS filters have also been proposed recently [23–27] Due to threshold decomposition and stacking property, these studies cannot significantly improve the design complexity and estimation error of WOS filters This task can be accomplished, however, when the concept of SVMs is involved to reduce the Boolean functions This paper compares our algorithm with adaptive neural filters, first proposed by Yin et al [20], approximating the solution of minimum estimation error Yin et al applied a backpropagation algorithm to develop adaptive neural filters EURASIP Journal on Applied Signal Processing with sigmoidal neuron functions as their nonlinear threshold functions [20] The learning process of adaptive neural filters has a long computational time since the learning structure is based on the architecture of threshold decomposition; that is, the learning data at each level of threshold decomposition must be manipulated One contribution of this paper is to design an efficient algorithm for approximating an optimal WOS filter In this algorithm, the total computational time is only 2T (time units), whereas the adaptive neural filter has a computational time of 255T (time units), given training data of 256 gray levels Our experimental results are superior to those obtained using adaptive neural filters We believe that the design methodology in our algorithm will reinvigorate research into stack filter, including morphological filters which have languished for a decade This paper is organized as follows In Section 2, the basic concepts of SVMs, WOS filters, and adaptive neural filters are reviewed In Section 3, the concept of dichotomous WOS filters is described In Section 4, a fast algorithm for generating an optimal WOS filter by SVMs is proposed Finally, some experimental results are presented in Section and our conclusions are offered in Section This section reviews three concepts: the basic concept of SVMs, the definition of WOS filters with reference to both the multivalued domain and binary domain approaches, and finally adaptive neural filters proposed by Yin et al [2, 20] 2.1 Linear support vector machines Consider the training samples {(xi , yi )}L=1 , where xi is the ini put pattern for the ith sample and yi is the corresponding desired response; xi ∈ Rm and yi ∈ {−1, 1} The objective is to define a separating hyperplane which divides the set of samples such that all the points with the same class are on the same sides of the hyperplane Let wo and bo denote the optimum values of the weight vector and bias, respectively The optimal separating hyperplane, representing a multidimensional linear decision surface in the input space, is given by T wo x + bo = (1) The set of vectors is said to be optimally separated by the hyperplane if it is separated without error and the margin of separation is maximal Then, the separating hyperplane wT x + b = must satisfy the following constraints: i = 1, 2, , L (2) Equation (2) can be redefined without losing accuracy, yi wT xi + b ≥ 1, yi wT xi + b ≥ − ξi , i = 1, 2, , L, ξi ≥ (4) (5) Two support hyperplanes wT xi +b = and wT xi +b = −1, which define the two borders of margin of separation, are specified on (4) According to (4), the optimal separating hyperplane is the maximal margin hyperplane with the geometric margin 2/ w Hence, the optimal separating hyperplane is the one that satisfies (4) and minimizes the cost function, L Φ(w) = wT w + C ξi i=1 (6) The parameter C controls the tradeoff between the complexity of the machine and the number of nonseparable points The parameter C is selected by the user A larger C assigns a higher penalty to errors Since the cost function is a convex function, a Lagrange function can be used to minimize the constrained optimization problem: L(w, b, α) BASIC CONCEPTS yi wT xi + b > 0, the constraint of (3) is modified to i = 1, 2, , L (3) When the nonseparable case is considered, a slack variable ξi is introduced to measure the deviation of a data point from an ideal value which would yield pattern separability Hence, L = w T w+C i=1 ξi − L i=1 αi yi wT xi +b − + ξi − L i=1 βi ξi , (7) where αi , βi , i = 1, 2, , L, are the Lagrange multipliers Once the solution αo = (αo , αo , , αo ) of (7) has been L found, the optimal weight vector is given by wo = L i=1 αo yi xi i (8) Classical Lagrangian duality enables the primal problem to be transformed to its dual problem The dual problem of (7) is reformulated as Q(α) = L i=1 αi − L i=1 L j =1 αi α j yi y j xiT x j , (9) with constraints L i=1 2.2 αi yi = 0, ≤ αi ≤ C, i = 1, 2, , L (10) Nonlinear support vector machines Input data can be mapped onto an alternative, higher-dimensional space, called feature space through a replacement to improve the representation, xi · x j −→ ϕ xi T ϕ xj (11) The functional form of the mapping ϕ(·) does not need to be known since it is implicitly defined by selected kernel function K(xi , x j ) = ϕ(xi )T ϕ(x j ), such as polynomials, splines, C.-C Yao and P.-T Yu radial basis function networks, or multilayer perceptrons A suitable choice of kernel can make the data separable in feature space despite being nonseparable in the original input space For example, the XOR problem is nonseparable by a hyperplane in input space, but it can be separated in the feature space defined by the polynomial kernel, p K x, xi = xT xi + (12) When xi is replaced by its mapping in the feature space ϕ(xi ), (9) becomes L L Q(α) = αi − i=1 i=1 L j =1 αi α j yi y j K xi , x j (13) 2.3 WOS filters In the multivalued domain {0, 1, , K − 1}, the output of a WOS filter can be easily obtained by a sorting operaˇ tion Let the K-valued input sequence or signal be X = ˇ (X1 , X2 , , XL ) and let the K-valued output sequence be Y = (Y1 , Y2 , , YL ), where Xi , Yi ∈ {0, 1, , K − 1}, i ∈ {1, 2, , L} Then, the output Yi = FW (Xi ) can be obtained according to the following equation, where Xi = (Xi−N , , Xi , , Xi+N ) and FW (·) denotes the filtering operation of the WOS filter associated with the corresponding vector W consisting of weights and threshold: Yi = FW (Xi ) = the tth largest value of the samples w1 times w2N+1 times w2 times Xi−N , , Xi−N , Xi−N+1 , , Xi−N+1 , , Xi+N , , Xi+N , (14) where W = [w1 , w2 , , w2N+1 ; t]T and T denotes transpose The terms w1 , w2 , , w2N+1 and t are all nonnegative integers Then, a necessary and sufficient condition for Xk , i − N ≤ k ≤ i + N, being the output of a WOS filter, is that Xi ∈ {0, 1, , K − 1} for all i, it can be decomposed into K − binary sequence {Xim }K −1 by thresholding This m=1 thresholding operation is called Tm and is defined as Xim ⎧ ⎨1 = Tm Xi = U Xi − m = if Xi ≥ m, otherwise, ⎩0 where U(·) is a unit step function; U(x) = if x ≥ and U(x) = if x < Note that Xi = K −1 m=1 Tm Xi = K −1 m=1 Xim i=1 wi ≥ t (15) The WOS filter is defined, using (15) In such a definition, the weights and threshold value need not be nonnegative integers They can be any nonnegative real numbers [15, 28] Using (15), the output f (x) of a WOS filter for a binary input vector x = {xi−N , xi−N+1 , , xi , , xi+N } is written as ⎧ ⎪ ⎪ ⎪ ⎨1 f (x) = ⎪ ⎪ ⎪ ⎩0 i+N if j =i−N w j x j ≥ t, (16) otherwise The function f (x) is a special case of Boolean functions, and is called the threshold function Since WOS filters have nonnegative weights and threshold, they are stack filters As a subclass of stack filters, WOS filters have representations in the threshold decomposition architecture Assuming (18) By using the threshold decomposition architecture, WOS filters can be implemented by threshold logic That is, the output of WOS filters is defined as Yi = K −1 m=1 U WT Xm , i i = 1, 2, , L, (19) m where X m = [Xim N , Xim N+1 , , Xim , , Xi+N , −1]T − − i 2.4 Adaptive neural filters ˇ ˇ Let X = (X1 , X2 , , XL ) and Z = (Z1 , Z2 , , ZL ) ∈ {0, 1, L be the input and the desired output of the adap , K − 1} tive neural filter, respectively If Xi and Zi are jointly stationary, then the MSE to be minimized is J W =E Zi − FW Xi ⎡ = E⎣ K −1 n=1 2 Tn Zi − σ W T Xin ⎤ ⎦ (20) Note that σ(x) = 1/(1 + e−x ) is the sigmoid function instead of the unit step function U(·) Analogous to the backpropagation algorithm, the optimal adaptive neural filter can be derived by applying the following update rule [20]: j k = j | (17) W ←− W + μΔW = W +2μ Zi − FW Xi K −1 n=1 sn − sn X n , i i i (21) where μ is a learning rate and sn = σ(W T X n ) ∈ [0, 1], that is, i i sn is the approximate output of FW (Xi ) at level n The learni ing process can be repeated from i = to L or with more iterations These filters use a sigmoid function as a neuron activation function, which can approximate linear functions and unit step functions Therefore, they can approximate both FIR filters and WOS filters However, the above algorithm takes much computational time to sum up the (K − 1) binary signals, and it is difficult to understand the correlated behaviors among signals This motivates the development of another approach which is presented in the next section to reduce the computational cost and clarify the correlated behaviors of signals with the viewpoint of support vector 4 EURASIP Journal on Applied Signal Processing 100 58 78 120 113 98 105 110 95 0 0 Summation U(W T X 255 ) i U(W T X 113 ) i T 99 U(W X i ) U(W T X 98 ) i U(W T X ) i U(W T X ) i 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 98 WOS filters W T = [1, 1, 2, 1, 2, 5, 3, 2, : 12] Threshold at 1, 2, , 98, 99, , 113, , 255 Figure 1: The filtering behavior of WOS filters when Xi = 113 output Yi equals m, the definition of the WOS filters can be rewritten as A NEW DICHOTOMOUS TECHNIQUE FOR DESIGNING WOS FILTERS This section proposes a new approach which adopts the concept of dichotomy and reduces Boolean functions with K − levels into Boolean functions with only two levels, thus saving considerable computational time Recall the definition of WOS filters from the previous section Let X n = [xi−N , xi−N+1 , , xi , , xi+N , −1]T ; xi = if i Xi ≥ n and xi = if Xi < n; and W T = [wi−N , wi−N+1 , , wi , , wi+N , t] Using (16), the output of a WOS filter for a binary input vector (xi−N , xi−N+1 , , xi , , xi+N ) is written as U W T Xn i = ⎧ ⎪ ⎪ ⎪1 ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪0 ⎪ ⎪ ⎩ i+N if wk xk ≥ t, k=i−N i+N if (22) wk xk < t k=i−N In the multivalued domain {0, 1, , K − 1}, the architecture of threshold decomposition has K − unit step functions Suppose the output value of Yi is m, and then Yi can be decomposed as (23) by threshold decomposition, m times Yi = m =⇒ decomposition of Yi = K −1−m times 1, , 1, 0, , (23) Besides, X i is also decomposed into K − binary vectors X , X , , X K −1 Then, those K − outputs of the unit i i i step function are listed as follows: U(W T X ), U(W T X ), , i i U(W T X K −1 ) According to the stacking property [22], i X ≥ X ≥ · · · ≥ X K −1 =⇒ U W T X ≥ U W T X i i i i i ≥ · · · ≥ U W T X K −1 i (24) It implies U(W T X ) = 1, U(W T X ) = 1, , U(W T X m ) = 1, i i i U(W T X m+1 ) = 0, , U(W T X K −1 ) = Then, two conclui i j sions are formulated: (a) for all j ≤ m, U(W T X i ) = and j (b) for all j ≥ m + 1, U(W T X i ) = Consequently, if the Yi = m = K −1 n=1 U WT Xn = i m n=1 U WT Xn i (25) Figure illustrates this concept It shows the filtering behavior of a window width × WOS filter, based on the architecture of threshold decomposition The data in the upper left are input signals and the data in the upper right are the output, after WOS filtering The 256-valued input signals are decomposed into a set of 255 binary signals After thresholding, each binary signal is independently processed according to (22) Finally, the outputs of the unit step function are summed In Figure 1, the threshold value t is 12; this means that the 12th largest value from the set {100, 58, 78, 78, 120, 113, 113, 98, 98, 98, 98, 98, 105, 105, 105, 110, 110, 95} is chosen The physical output of the WOS filter is then 98 Figure indicates that (i) for all n ≤ 98, where n is an integer, W T X n ≥ W T X 98 When W T X 98 = i i i must equal one; (ii) for all n ≥ 99, where n is an integer, W T X n ≤ W T X 99 When W T X 99 = i i i must equal zero X n ≥ X 98 and i i 1, then W T X n i X n ≤ X 99 and i i 0, then W T X n i In the supervised learning mode, if the desired output is m, then the goal in designing a WOS filter is to adjust the weight vector such that it satisfies U(W T X m+1 ) = and i U(W T X m ) = 1, implying that the input signal need not be i considered at levels other than X m+1 and X m This concept is i i referred to as dichotomy Accordingly, the binary input signals X k , k ∈ {1, 2, i , 255}, are classified into 1-vector and 0-vector signals The input signals X k are 1-vector if they satisfy U(W T X k ) = i i They are 0-vector if they satisfy U(W T X k ) = In vector i space, these two classes are separated by an optimal hyperplane, which is bounded by W T X m ≥ and W T X m+1 < 0, i i when the output value is m Hence, the vector X m is called the i 1-support vector and the vector X m+1 is called the 0-support i C.-C Yao and P.-T Yu vector, because X m and X m+1 are helpful in determining the i i optimal hyperplane 2y1i − wT x1i − t + ξ1i ≥ 1, SUPPORT VECTOR MACHINES FOR DICHOTOMOUS WOS FILTERS In the above section, the new approach of designing WOS filter has reduced the Boolean functions with K − levels into two levels In this section, the support vector machines are introduced on the design of dichotomous WOS filters The new technique is illustrated as follows If the input vector is X n = [xi−N , xi−N+1 , , xi , , i xi+N , −1]T , n = 0, 1, , 255, and the desired output is m, then an appropriate W T can be found, such that two constraints are satisfied: W T X m ≥ and W T X m+1 < For ini i creasing tolerance, W T X m ≥ and W T X m+1 < are redei i fined as follows: wk x1k − t ≥ 1, k=i−N i+N wk x2k − t ≤ −1, k=i−N x1k is the kth component of X m , i x2k is the kth component of X m+1 i (26) The corresponding outputs y1i , y2i of (26) are y1i = U(W T X m ) = and y2i = U(W T X m+1 ) = When y1i and i i y2i are considered, (27) is obtained as follows: i+N 2y1i − 2y2i − X m, i L Φ w, ξ1 , ξ2 = wT w + C ξ1i + ξ2i , i=1 L L L w, t, ξ1 , ξ2 = wT w + C ξ1i + ξ2i − αi i=1 i=1 2y1i − wT x1i − t + ξ1i − × βi 2y2i − wT x2i − t + ξ2i − − γT w − ηt − L μ1i ξ1i − i=1 wk x2k − t ≥ 1, L μ2i ξ2i , i=1 (36) k=i−N x2k is the kth component of X m+1 i Let x1i = [x1(i−N) , x1(i−N+1) , , x1i , , x1(i+N) ]T and x2i = [x2(i−N) , x2(i−N+1) , , x2i , , x2(i+N) ]T Then, (27) can be expressed in vector form as follows: 2y1i − wT x1i − t ≥ 1, 2y2i − wT x2i − t ≥ 1, (28) where wT = [wi−N , wi−N+1 , , wi , , wi+N ] Equation (28) is similar to the constraint which is used in SVMs Moreover, when misclassified data are considered, (28) is modified as follows: 2y1i − wT x1i − t + ξ1i ≥ 1, 2y2i − wT x2i − t + ξ2i ≥ 1, (35) where C is a user-specified positive parameter and x1i = mi +1 +1 [XimiN , XimiN+1 , , Ximi , , Xi+N ]T and x2i = [XimiN , XimiN+1 , − − − − mi +1 mi +1 T , Xi , , Xi+N ] Note that the inequality constraint “w ≥ 0” defines that all elements in binary vector are equal to or bigger than Since the cost function φ(w, ξ1 , ξ2 ) is a convex function of w and the constraints are linear in w, the above constrained optimization problem can be solved by using the method of Lagrange multipliers [29] The Lagrangian function is introduced to solve the above problem Let i=1 (27) (31) (32) (33) (34) and such that the weight vector w and the slack variables ξ1i , ξ2i can minimize the cost function: − wk x1k − t ≥ 1, x1k is the kth component of i+N 2y2i − w x2i − t + ξ2i ≥ 1, for i = 1, 2, , L, w ≥ 0, t ≥ 0, ξ1i , ξ2i ≥ 0, for i = 1, 2, , L, L k=i−N for i = 1, 2, , L, (30) T 4.1 Linear support vector machines for dichotomous WOS filters i+N Given the training samples {(Xi , mi )}L=1 , find an optimal i value of the weight vector w and threshold t such that they satisfy the constraints (29) ξ1i , ξ2i ≥ Now, we formulate the optimal design of WOS filters as the following constrained optimization problem where the auxiliary nonnegative variables αi , βi , γ, η, μ1i , and μ2i are called Lagrange multipliers, where γ ∈ R2N+1 The saddle point of the Lagrangian function L(w, t, ξ1 , ξ2 ) determines the solution to the constrained optimization problem Differentiating L(w, t, ξ1 , ξ2 ) with respect to w, t, ξ1i , ξ2i yields the following four equations: L ∂L w, t, ξ1i , ξ2i =w−γ− αi 2y1i − x1i ∂w i=1 L − i=1 βi 2y2i − x2i , L L ∂L w, t, ξ1i , ξ2i = αi 2y1i − + βi 2y2i − − η, ∂t i=1 i=1 ∂L w, t, ξ1i , ξ2i = C − αi − μ1i , ∂ξ1i ∂L w, t, ξ1i , ξ2i = C − βi − μ2i ∂ξ2i (37) EURASIP Journal on Applied Signal Processing The optimal value is obtained by setting the results of differentiating L(w, t, ξ1 , ξ2 ) with respect to w, t, ξ1i , ξ2i equal to zero Thus, w=γ+ L i=1 αi 2y1i − x1i + L 0= i=1 αi 2y1i − + L βi 2y2i − x2i , (38) 4.2 βi 2y2i − − η, (39) When the number of training samples is large enough, (32) can be replaced as wT x1i ≥ because (1) x1i is a binary vector and (2) all possible cases of x1i are included by training samples Then the problem is reformulated as follows Given the training samples {(Xi , mi )}L=1 , find an optimal i value of the weight vector w and threshold t such that they satisfy the constraints i=1 L i=1 C = αi + μ1i , C = βi + μ2i (40) (41) At each saddle point, for each Lagrange multiplier, the product of that multiplier with its corresponding constraint vanishes, as shown by T αi 2y1i − w x1i − t + ξ1i − = 0, βi 2y2i − wT x2i − t + ξ2i − 1] = 0, μ1i ξ1i = 0, μ2i ξ2i = 0, for i = 1, 2, , L, (42) for i = 1, 2, , L, for i = 1, 2, , L (44) (45) ξ1i = if αi < C, ξ2i = if βi < C Q(α, β) = i=1 L × 2y1i − 2y1 j − L − i=1 L −γ i=1 −γ T L j =1 L j =1 i=1 αi α j αi 2y1i − + and such that the weight vector w and the slack variables ξ1i , ξ2i can minimize the cost function: Φ w, ξ1 , ξ2 = wT w + C ξ1i + ξ2i i=1 Using the method of Lagrange multipliers and proceeding in a manner similar to that described in Section 4.1, the solution is gotten as follows: w= L i=1 γi x1i + L L i=1 i=1 αi 2y1i − x1i + αi 2y1i − + Q(α, β, γ) = L αi + βi − i=1 L i=1 i=1 βi 2y2i − x2i , βi 2y2i − − η, i=1 j =1 L i=1 L j =1 T × 2y1 j − x1i x1 j − αi β j L T x1i x2 j − i=1 L j =1 L − i=1 j =1 βi 2y2i − − η = 0, ≤ αi ≤ C for i = 1, 2, , L, ≤ βi ≤ C for i = 1, 2, , L, η ≥ 0, γ ≥ 0, L L Then the dual problem is generated by introducing (51), L L (50) (51) subject to the constraints i=1 (49) C = αi + μ1i , C = βi + μ2i (47) L w x1i ≥ 0, t ≥ 0, ξ1i , ξ2i ≥ 0, for i = 1, 2, , L, T x1i x1 j βi 2y2i − x2i − for i = 1, 2, , L, T i=1 T βi β j 2y2i − 2y2 j − x2i x2 j × 2y1i − 2y2 j − 2y2i − w x2i − t + ξ2i ≥ 1, 0= L for i = 1, 2, , L, T (46) αi 2y1i − x1i L 2y1i − wT x1i − t + ξ1i ≥ 1, L The corresponding dual problem is generated by introducing (38)–(41) into (36) Accordingly, the dual problem is formulated as follows Given the training samples {(Xi , mi )}L=1 , find the Lai grange multipliers {αi }L=1 that maximize the objective funci tion 1 αi + βi − γT γ − 2 i=1 Nonlinear support vector machines for dichotomous WOS filters for i = 1, 2, , L, (43) By combining (40), (41), (44), and (45), (46) is gotten: L where C is a user-specified positive parameter and x1i = mi +1 +1 [XimiN , XimiN+1 , , Ximi , , Xi+N ]T and x2i = [XimiN , XimiN+1 , − − − − mi +1 mi +1 T , Xi , , Xi+N ] L (48) L − i=1 j =1 αi α j 2y1i − L i=1 L j =1 T γi γ j x1i x2 j T βi β j 2y2i − 2y2 j − x2i x2 j T γi α j 2y1 j − x1i x1 j T γi β j 2y2 j − x1i x2 j − L i=1 L j =1 αi β j T × 2y1i − 2y2 j − x1i x2 j (52) C.-C Yao and P.-T Yu The input data are mapped into a high-dimensional feature space by some nonlinear mapping chosen a priori Let ϕ denote a set of nonlinear transformations from the input space Rm to a higher-dimensional feature space Then (47) becomes Q(α, β, γ) = L L i=1 αi + βi − i=1 L j =1 − i=1 L j =1 L − i=1 L j =1 L L − i=1 j =1 L L − i=1 j =1 L L − i=1 j =1 x2i ϕ x2 j γi α j 2y1 j − ϕT x1i ϕ x1 j γi β j 2y2 j − ϕT x1i ϕ x2 j αi β j 2y1i − 2y2 j − ϕT (x1i )ϕ x2 j (53) The inner product of the two vectors induced in the feature space can be replaced by the inner-product kernel denoted by K(x, xi ) and defined by K x, xi = ϕ(x) · ϕ xi (54) Once a kernel K(x, xi ) which satisfies Mercer’s condition has been selected, the nonlinear model is stated as follows Given the training samples {(Xi , mi )}L=1 , find the Lai grange multipliers {αi }L=1 that maximize the objective funci tion Q(α, β, γ) = L αi + βi − i=1 L i=1 L j =1 L i=1 L j =1 γi γ j K x1i , x2 j − L i=1 L L L i=1 j =1 j =1 γi α j 2y1 j − K x1i , x1 j − L L i=1 j =1 γi β j × 2y2 j − K x1i , x2 j L L − i=1 j =1 βi 2y2i − − η = 0, (56) EXPERIMENTAL RESULTS The “Lenna” and “Boat” images were used as training samples for a simulation Dichotomous WOS filters were compared with adaptive neural filters, rank-order filter, and L p norm WOS filter for the restoration of noisy images [20, 30, 31] In the simulation, the proposed dichotomous WOS filters were used to restore images corrupted by impulse noise The training results were used to filter the noisy images With image restoration, the object function was modified in order to get an optimal solution The learning steps are illustrated as follows Step In ith training step, choose the input signal Xi from a corrupted image and compare signal Di from an uncorrupted image, where Di ∈ {0, 1, , K − 1} The desired output Yi is selected from input signal vector Xi and Yi = {X j | |X j − D i | ≤ |X k − D i | , X j , X k ∈ X i } Step The training patterns x1i and x2i are gotten from input signal vector Xi by using desired output Yi Step Calculating the distances S pi and Sqi , where S pi and Sqi are the distances between X p , Yi and Xq , Yi Note that X p = {X j | Y j − X j ≤ Yi − Xk , X j , Xk ∈ Xi , and X j , Xk < Yi } and Xq = {X j | X j − Y j ≤ Xk − Yi , X j , Xk ∈ Xi , and X j , Xk > Yi } Step Applying the model of SVMs which is stated in Section to get optimal solution βi β j × 2y2i − 2y2 j − K x2i , x2 j − i=1 Step The object function is modified by replacing ξ1i and ξ2i with S pi ξ1i and Sqi ξ2i , where S pi and Sqi are taken as the weight of the error αi α j 2y1i − × 2y1 j − K x1i , x1 j − L ≤ αi ≤ C for i = 1, 2, , L, ≤ βi ≤ C for i = 1, 2, , L, ≤ γi for i = 1, 2, , L, βi β j 2y2i − × 2y2 j − ϕ i=1 αi 2y1i − + where C is a user-specified positive parameter and x1i = mi +1 +1 [XimiN , XimiN+1 , , Ximi , , Xi+N ]T and x2i = [XimiN , XimiN+1 , − − − − mi +1 mi +1 T , Xi , , Xi+N ] γi γ j ϕT x1i ϕ x2 j T L αi α j 2y1i − × 2y1 j − ϕT x1i ϕ x1 j L subject to the constraints αi β j 2y1i − 2y2 j − K x1i , x2 j (55) A large dataset is generated when training data are obtained from a 256 × 256 image Nonlinear SVMs create unwieldy storage problems There are various ways to overcome this including sequential minimal optimization (SMO), projected conjugate gradient chunking (PCGC), reduced support vector machines (RSVMs), and so forth [32– 34] In this paper, SMO was adopted because it has demonstrated outstanding performance Consider an example to illustrate how to generate the training data from the input signal Let the input signal inside EURASIP Journal on Applied Signal Processing (a) (b) (c) (d) Figure 2: (a) Original “Lenna” image; (b) “Lenna” image corrupted by 5% impulse noise; (c) “Lenna” image corrupted by 10% impulse noise; (d) “Lenna” image corrupted by 15% impulse noise the window of width be Xi = [240, 200, 90, 210, 180]T Suppose that the compared signal Di which is selected from uncorrupted image is 208 The desired output Yi is selected from input signal Xi According to the principle of WOS filters, the desired output is 210 Then, x1i = T210 (240), T210 (200), T210 (90), T210 (210), T210 (180) T = [1, 0, 0, 1, 0]T , x2i = T211 (240), T211 (200), T211 (90), T211 (210), T211 (180) = [1, 0, 0, 0, 0]T , T (57) and y1i = 1, y2i = The balance of training data is generated in the same way This section compares the dichotomous WOS filters with the adaptive neural filters in terms of three properties: time complexity, MSE, and convergence speed Figures and present the training pairs, and Figures and present the images restored by the dichotomous WOS filters Figures and show the images restored by the adaptive neural filters Using SVMs on the dichotomous WOS filters with 3×3 window width, the best near-optimal weight values for the test images, which are corrupted by 5% impulse noise, are listed as follows: ⎛ ⎞ 0.1968 0.2585 0.1646⎟ ⎜ ⎜ ⎟ “Lenna” =⇒ ⎜0.1436 0.5066 0.1322⎟ ⎝ ⎠ 0.2069 0.2586 0.1453 ⎛ ⎞ (58) 0.1611 0.2937 0.1344⎟ ⎜ ⎜ ⎟ “Boat” =⇒ ⎜0.0910 0.5280 0.2838⎟ ⎝ ⎠ 0.1988 0.1887 0.1255 Notably, the weight matrix was translated row-wise in the simulation, that is, w1 = w11 , w2 = w12 , w3 = w13 , w4 = w21 , w5 = w22 , w6 = w23 , w7 = w31 , w8 = w32 , w9 = w33 Three different kernel functions adopted in our experiments are polynomial function: (gamma ∗ u ∗ v+coef)degree , radial basis function: exp(−gamma ∗ u − v ), and sigmoid function: tanh(gamma ∗ u ∗ v + coef), respectively In our experiments, each element on training pattern is either or Suppose that three training patterns are xk1 = [0, 0, 0, 0, 0, 0, 0, 0, 0], xk2 = [0, 1, 0, 0, 0, 0, 0, 0, 0], and xk3 = [0, 0, 0, 1, 0, 0, 0, 0, 0] Obviously, the difference between xk1 , xk2 and xk1 , xk3 cannot be distinguished when polynomial function or sigmoid function is adopted as C.-C Yao and P.-T Yu (a) (b) (c) (d) Figure 3: (a) Original “Boat” image; (b) “Boat” image corrupted by 5% impulse noise; (c) “Boat” image corrupted by 10% impulse noise; (d) “Boat” image corrupted by 15% impulse noise (a) (b) (c) Figure 4: Using × dichotomous WOS filter to restore (a) 5% impulse noise image; (b) 10% impulse noise image; (c) 15% impulse noise image kernel function So in our experiments, only the radial basis function is considered Besides, after testing with different values of gamma, is adopted as the value of gamma in this experiment Better classified ability and filtering performance are provided when the value of gamma is bigger than 0.5 Time If the computational time was T (time units) on each level, then the dichotomous WOS filters took only 2T (time units) to filter 256 gray levels of data However, the adaptive neural filters took 255T (time units) 10 EURASIP Journal on Applied Signal Processing (a) (b) (c) Figure 5: Using × adaptive neural filter to restore (a) 5% impulse noise image; (b) 10% impulse noise image; (c) 15% impulse noise image (a) (b) (c) Figure 6: Using × dichotomous WOS filter to restore (a) 5% impulse noise image; (b) 10% impulse noise image; (c) 15% impulse noise image (a) (b) (c) Figure 7: Using × adaptive neural filter to restore (a) 5% impulse noise image; (b) 10% impulse noise image; (c) 15% impulse noise image C.-C Yao and P.-T Yu 11 Table 1: The comparisons of different filters’ performance on impulsive noise image Measured error WOS filter by SVMs 45 80.2 120.8 95 150.2 208.8 MSE errors “Lenna” 5% noise “Lenna” 10% noise “Lenna” 15% noise “Boat” 5% noise “Boat” 10% noise “Boat” 15% noise Adaptive neural filter 45 80 119 95.6 149 206 3500 L p norm WOS filter 50.8 82.8 125.6 105.1 160.5 218.4 that the dichotomous WOS filter converged steadily and more quickly than the adaptive neural filter In summary, the above comparisons revealed that dichotomous WOS filters outperformed adaptive neural filters, rank-order filters, and L p norm WOS filter 3000 2500 MSE Rank-order filter 67.1 90.7 139.9 155.7 192.8 256.9 2000 CONCLUSION 1500 1000 500 0 10 20 30 40 50 60 70 80 90 100 Training epochs Figure 8: Converging speed of dichotomy WOS filter and adaptive neural filter: “-” indicates adaptive neural filter: “x” indicates dichotomous WOS filter MSE Table lists the MSE values of the images restored with different filters In this experiment, the adaptive neural filters used 256 levels to filter the data In the simulation, ninefold cross-validation was performed on the dataset to evaluate how well the algorithm generalizes to future data [35] The ninefold cross-validation method extracts a certain proportion, typically 11%, of the training set as the tuning set, which is a surrogate of the testing set For each training, the proposed method was applied to the rest of the training data to obtain a filter and the tuning set correctness of this filter was computed Table indicates that the dichotomous WOS filters performed as well as the adaptive neural filters Both outperformed the rank-order filters and the L p norm WOS filter Figure compares convergence speeds In Figure 8, the vertical axis represents MSE, while the horizontal axis represents the number of training epochs Each unit of the horizontal axis represents 10 training epochs Figure reveals Support vector machines (SVMs), a classification algorithm for the machine learning community, have been shown to provide excellent performance on many applications In this paper, SVMs are introduced into the design of WOS filters in order to improve performance WOS filters are special subset of stack filters Each stack filter is based on a positive Boolean function and needs much computation time to achieve its Boolean computing This makes the stack filter uneasy to use on application Until now, the computation time has been only marginally improved by using the conventional design approach of stack filter or neural network Although the adaptive neural filter can effectively remove noise of various kinds, including Gaussian noise and impulsive noise, its learning process involves a great deal of computational time This work has proposed a new designing technique to approximate optimal WOS filters The proposed technique, based on threshold composition, uses a dichotomous approach to reduce the Boolean computing from 255 levels to two levels Then the technique of SVMs is used to get an optimal hyperplane to separate those two levels The advantage of SVMs is that the risk of misclassifying is minimized not only with the examples in the training set, but also with the unseen examples of the test set Our experimental results have showed that images were processed more efficiently than with an adaptive neural filter The proposed algorithm is designed to handle impulse noise and provided excellent performance on the images which contain impulse noise We have experimented with the images which contain Gaussian noise, but the experimental results are unsatisfied This reveals that a universal adaptive filter which can deal with any kind of noises simultaneously does not yet exists in the field of rank-ordered filters This experimental result is consistent with the conclusion proposed by [36] 12 ACKNOWLEDGMENT This work is supported by National Science Council of Taiwan under Grant NSC93-2213-E-194-020 REFERENCES [1] N Cristianini and J Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods, Cambridge University Press, Cambridge, UK, 2000 [2] V N Vapnik, The Nature of Statistical Learning Theory, Springer, New York, NY, USA, 1995 [3] Y.-J Lee and O L Mangasarian, “SSVM: a smooth support vector machine for classification,” Computational Optimization and Applications, vol 20, no 1, pp 5–22, 2001 [4] O L Mangasarian, “Generalized support vector machines,” in Advances in Large Margin Classiers, A J Smola, P Bartlett, B Schă lkopf, and C Schuurmans, Eds., pp 135–146, MIT Press, o Cambridge, Mass, USA, 2000 [5] O L Mangasarian and D R Musicant, “Successive overrelaxation for support vector machines,” IEEE Transactions on Neural Networks, vol 10, no 5, pp 1032–1037, 1999 [6] O Chapelle, P Haffner, and V N Vapnik, “Support vector machines for histogram-based image classification,” IEEE Transactions on Neural Networks, vol 10, no 5, pp 1055–1064, 1999 [7] G Guo, S Z Li, and K L Chan, “Support vector machines for face recognition,” Image and Vision Computing, vol 19, no 910, pp 631–638, 2001 [8] H Drucker, D Wu, and V N Vapnik, “Support vector machines for spam categorization,” IEEE Transactions on Neural Networks, vol 10, no 5, pp 1048–1054, 1999 [9] V N Vapnik, Statistical Learning Theory, John Wiley & Sons, New York, NY, USA, 1998 [10] R Yang, M Gabbouj, and P.-T Yu, “Parametric analysis of weighted order statistics filters,” IEEE Signal Processing Letters, vol 1, no 6, pp 95–98, 1994 [11] P.-T Yu, “Some representation properties of stack filters,” IEEE Transactions on Signal Processing, vol 40, no 9, pp 2261–2266, 1992 [12] P.-T Yu and R.-C Chen, “Fuzzy stack filters-their definitions, fundamental properties, and application in image processing,” IEEE Transactions on Image Processing, vol 5, no 6, pp 838– 854, 1996 [13] P.-T Yu and E J Coyle, “The classification and associative memory capability of stack filters,” IEEE Transactions on Signal Processing, vol 40, no 10, pp 2483–2497, 1992 [14] P.-T Yu and E J Coyle, “Convergence behavior and N-roots of stack filters,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol 38, no 9, pp 1529–1544, 1990 [15] P.-T Yu and W.-H Liao, “Weighted order statistics filters-their classification, some properties, and conversion algorithm,” IEEE Transactions on Signal Processing, vol 42, no 10, pp 2678–2691, 1994 [16] C Chakrabarti and L E Lucke, “VLSI architectures for weighted order statistic (WOS) filters,” Signal Processing, vol 80, no 8, pp 1419–1433, 2000 [17] S W Perry and L Guan, “Weight assignment for adaptive image restoration by neural networks,” IEEE Transactions on Neural Networks, vol 11, no 1, pp 156–170, 2000 [18] H.-S Wong and L Guan, “A neural learning approach for adaptive image restoration using a fuzzy model-based network architecture,” IEEE Transactions on Neural Networks, vol 12, no 3, pp 516–531, 2001 EURASIP Journal on Applied Signal Processing [19] L Yin, J Astola, and Y Neuvo, “Optimal weighted order statistic filters under the mean absolute error criterion,” in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’91), vol 4, pp 2529–2532, Toronto, Ontario, Canada, April 1991 [20] L Yin, J Astola, and Y Neuvo, “A new class of nonlinear filtersneural filters,” IEEE Transactions on Signal Processing, vol 41, no 3, pp 1201–1222, 1993 [21] L Yin, J Astola, and Y Neuvo, “Adaptive multistage weighted order statistic filters based on the backpropagation algorithm,” IEEE Transactions on Signal Processing, vol 42, no 2, pp 419– 422, 1994 [22] P D Wendt, E J Coyle, and N C Gallagher, “Stack filters,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol 34, no 4, pp 898–911, 1986 [23] M J Avedillo, J M Quintana, and E Rodriguez-Villegas, “Simple parallel weighted order statistic filter implementations,” in Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS ’02), vol 4, pp 607–610, May 2002 [24] A Gasteratos and I Andreadis, “A new algorithm for weighted order statistics operations,” IEEE Signal Processing Letters, vol 6, no 4, pp 84–86, 1999 [25] H Huttunen and P Koivisto, “Training based optimization of weighted order statistic filters under breakdown criteria,” in Proceedings of the International Conference on Image Processing (ICIP ’99), vol 4, pp 172–176, Kobe, Japan, October 1999 [26] P Koivisto and H Huttunen, “Design of weighted order statistic filters by training-based optimization,” in Proceedings of the 6th International Symposium on Signal Processing and Its Applications (ISSPA ’01), vol 1, pp 40–43, Kuala Lumpur, Malaysia, August 2001 [27] S Marshall, “New direct design method for weighted order statistic filters,” IEE Proceedings - Vision, Image, and Signal Processing, vol 151, no 1, pp 1–8, 2001 [28] O Yli-Harja, J Astola, and Y Neuvo, “Analysis of the properties of median and weighted median filters using threshold logic and stack filter representation,” IEEE Transactions on Signal Processing, vol 39, no 2, pp 395–410, 1991 [29] D P Bertsekas, Nonlinear Programming, Athena Scientific, Belmont, Mass, USA, 1999 [30] J Poikonen and A Paasio, “A ranked order filter implementation for parallel analog processing,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol 51, no 5, pp 974–987, 2004 [31] C E Savin, M O Ahmad, and M N S Swamy, “L p norm design of stack filters,” IEEE Transactions on Image Processing, vol 8, no 12, pp 1730–1743, 1999 [32] B Schlkopf and A J Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, Cambridge, Mass, USA, 2002 [33] P E Gill, W Murray, and M H Wright, Practical Optimization, Academic Press, London, UK, 1981 [34] Y.-J Lee and O L Mangasarian, “RSVM: Reduced Support Vector Machines,” in Proceedings of the 1st SIAM International Conference on Data Mining, Chicago, Ill, USA, April 2001 [35] M Stone, “Cross-validatory choice and assessment of statistical predictions,” Journal of the Royal Statistical Society, vol B36, pp 111–147, 1974 [36] L Yin, R Yang, M Gabbouj, and Y Neuvo, “Weighted median filters: a tutorial,” IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, vol 43, no 3, pp 157– 192, 1996 C.-C Yao and P.-T Yu Chih-Chia Yao received his B.S degree in computer science and information engineering from National Chiao Tung University, in 1992, and M.S degree in computer science and information engineering from National Cheng Kung University, Tainan, Taiwan, in 1994 He is a Lecturer in the Department of Information Management, Nankai College, Nantou, Taiwan He is currently a Ph.D candidate in the Department of Computer Science and Information Engineering, National Chung Cheng University His research interests include possibility reasoning, machine learning, data mining, and fuzzy inference system Pao-Ta Yu received the B.S degree in mathematics from National Taiwan Normal University in 1979, the M.S degree in computer science from National Taiwan University, Taipei, Taiwan, in 1985, and the Ph.D degree in electrical engineering from Purdue University, West Lafayette, Ind, USA, in 1989 Since 1990, he has been with the Department of Computer Science and Information Engineering at National Chung Cheng University, Chiayi, Taiwan, where he is currently a Professor His research interests include e-learning, neural networks and fuzzy systems, nonlinear filter design, intelligent networks, and XML technology 13 ... divides the set of samples such that all the points with the same class are on the same sides of the hyperplane Let wo and bo denote the optimum values of the weight vector and bias, respectively The. .. are separated by an optimal hyperplane, which is bounded by W T X m ≥ and W T X m+1 < 0, i i when the output value is m Hence, the vector X m is called the i 1 -support vector and the vector X m+1... classified into 1 -vector and 0 -vector signals The input signals X k are 1 -vector if they satisfy U(W T X k ) = i i They are 0 -vector if they satisfy U(W T X k ) = In vector i space, these two classes