Support vector machine tuyến tính với độ phức tạp tuyến tính

BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC SƯ PHẠM TP HỒ CHÍ MINH BÁO CÁO TỔNG KẾT ĐỀ TÀI KHOA HỌC VÀ CÔNG NGHỆ CẤP TRƯỜNG SUPPORT VECTOR MACHINE TUYẾN TÍNH VỚI ĐỘ PHỨC TẠP TUYẾN TÍNH MÃ SỐ: CS2014.19.39 Cơ quan chủ trì: Khoa Cơng nghệ Thơng tin Chủ nhiệm đề tài: TS Lê Minh Trung THÀNH PHỐ HỒ CHÍ MINH – 12/2015 BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC SƯ PHẠM TP HỒ CHÍ MINH BÁO CÁO TỔNG KẾT ĐỀ TÀI KHOA HỌC VÀ CƠNG NGHỆ CẤP TRƯỜNG SUPPORT VECTOR MACHINE TUYẾN TÍNH VỚI ĐỘ PHỨC TẠP TUYẾN TÍNH MÃ SỐ: CS2014.19.39 Xác nhận quan chủ trì (ký, họ tên) Chủ nhiệm đề tài (ký, họ tên) THÀNH PHỐ HỒ CHÍ MINH – 12/2015 Mẫu 1.10 CS THÔNG TIN KẾT QUẢ NGHIÊN CỨU ĐỀ TÀI KHOA HỌC VÀ CÔNG NGHỆ CẤP TRƯỜNG Tên đề tài: Support Vector Machine tuyến tính với độ phức tạp tuyến tính Mã số: CS2014.19.39 Chủ nhiệm đề tài: TS Lê Minh Trung Tel : 0937856369 E-mail: trunglm@hcmup.edu.vn Cơ quan chủ trì đề tài : Khoa Cơng nghệ Thông tin Trường Đại học Sư phạm Tp.HCM Cơ quan cá nhân phối hợp thực : ……………………………………………… …………………………………………………………………………………………… …………………………………………………………………………………………… …………………………………………………………………………………………… …………………………………………………………………………………………… Thời gian thực hiện: 11/2014 đến 12/2015 Mục tiêu: nghiên cứu khám phá Support Vector Machine tuyến tính có độ phức tạp tuyến tính ngữ cảnh học online tính tốn phân tán Nội dung chính: Phát triển số thuật tốn SVM tuyến tính ngữ cảnh online learning Phát triển thuật tốn SVM tuyến tính chạy hệ thống phân tán Spark Kết đạt (khoa học, ứng dụng, đào tạo, kinh tế-xã hội): 1) Trung Le, Dinh Phung, Khanh Nguyen, Svetha Venkatesh: Fast One-Class Support Vector Machine for Novelty Detection PAKDD 2015: 189-200 2) Khanh Nguyen, Trung Le, Vinh Lai, Duy Nguyen, Dat Tran, Wanli Ma: Least square Support Vector Machine for large-scale dataset IJCNN 2015: 1-8 3) Trung Le, Van Nguyen, Anh Nguyen, Khanh Nguyen: Adaptable Linear Support Vector Machine NICS 2015 Mẫu 1.11 CS SUMMARY Project Title: Linear Support Vector Machine with Linear Complexity Code number: CS2014.19.39 Coordinator: Dr Lê Minh Trung Implementing Institution: Faculty of Information Technology, HCMc University of Pedagogy Cooperating Institution(s)………………………………………………………………… …………………………………………………………………………………………… Duration: from 11/2014 to 12/2015 Objectives: investigating linear SVM with linear complexity in the context of online learning and distributed computing Main contents: Develop linear SVM in the context of online learning Develop linear SVM in the context of distributed computing Results obtained Trung Le, Dinh Phung, Khanh Nguyen, Svetha Venkatesh: Fast One-Class Support Vector Machine for Novelty Detection PAKDD 2015: 189-200 Khanh Nguyen, Trung Le, Vinh Lai, Duy Nguyen, Dat Tran, Wanli Ma: Least square Support Vector Machine for large-scale dataset IJCNN 2015: 1-8 Trung Le, Van Nguyen, Anh Nguyen, Khanh Nguyen: Adaptable Linear Support Vector Machine NICS 2015 Adaptable Linear Support Vector Machine Anonymous authors Abstract—Linear Support Vector Machine (LSVM) has recently become one of the most prominent learning methods for solving classification and regression problems because of its applications in text classification, word-sense disambiguation, and drug design However LSVM and its variations cannot adapt accordingly to a dynamic dataset nor learn in online mode In this paper, we introduce an Adaptable Linear Support Vector Machine (ALSVM) which linearly scales with the size of training set The most brilliant feature of ALSVM is that its decision boundary is adapted in a close form when adding or removing data Index Terms—Linear Support Vector Machine, adaptable method, linear method I I NTRODUCTION Support Vector Machine (SVM) [1] is a very well-known tool for classification and regression problems There have been plenty of variations of SVM proposed and applied to real-world problems, e.g., Least Square Support Vector Machine (LS-SVM) [11] and Proximal Support Vector Machine (PSVM) [4] SVM is associated with non-linear or linear kernel Although non-linear kernels often offer higher classification accuracies for real-world datasets, in many applications, e.g., text classification, word-sense disambiguation, and drug design, linear kernel is preferred because of its much shorter training time [7] Many methods have been proposed for Linear Support Vector Machine (LSVM) Active Support Vector Machine (ASVM) [8] and Lagrangian Support Vector Machine (LaSVM) [9] based on the formulation of Proximal Support Vector Machine and can linearly scale with the training set size N However, both ASVM and La-SVM require to compute inverse of a matrix of size (d + 1) × (d + 1), which costs over O d2.373 and scale roughly in the best case O N d2 [7] SVM-Perf [7], which is regarded as a variation of Structural SVM, uses cutting-plane technique to solve its optimization problem SVM-Perf was proven much faster than decomposition methods, e.g, SVM-Light [6] and LIBSVM [2] and scales with O (N s), where s is the sparsity of the training set In Pegasos [10], stochastic gradient descent method has been applied to solve the optimization problem of LSVM in primal Pegasos was proven to scale with the complexity O (N s); but it is really challenging to determine the appropriate values for the number of iterations T and the sub-sample size k Other methods concerned the unconstrained or constrained form of SVM and employed optimization techniques to find their solutions, e.g., [3, 5] Among these methods, LibLinear [5] emerges as the most popular tool, which has been widely used by the research community The computational complexity of LibLinear to gain −precision is known as O N dlog All the above-mentioned LSVM and its variations cannot adapt accordingly to a dynamic dataset in a close form In this paper, we depart from the formulation of Least Square Support Vector Machine (LS-SVM) [11] to propose Adaptable Linear Support Vector Machine (ALSVM) whose decision boundary can be adapted in a close form when data are added or removed accordingly Concretely, when one data sample is added to or removed from the current training set, the decision boundary is adapted with the computational complexity O d2 This feature enables ALSVM to run in online mode For the batch mode, by applying inductive strategy, besides ALSVM we also propose Fast Adaptable Linear Support Vector Machine (FALSVM) where data samples are subsequently added to the training set and the adaptations only occur for incorrect classifications In the batch mode, the computational complexity of ALSVM and FALSVM are O N d2 The experiments established in this paper show that in batch mode, the proposed methods are superior in computation time to other methods for dataset of thousands of dimensions II VARIATIONS OF S UPPORT V ECTOR M ACHINE A Least Square Support Vector Machine (LS-SVM) Least Square Support Vector Machine (LS-SVM) [11] is a variation of SVM LSVM uses square loss function in its formulation and replaces the inequality constrains by the equality ones: w,b,ξ w N ξi2 +C i=1 T s.t : ∀N i=1 : yi w φ (xi ) + b = − ξi where φ is a transformation from input space to feature space, ξ = [ξ1 , ξ2 , , ξN ] is vector of slack variables B Proximal Support Vector Machine (PSVM) Proximal Support Vector Machine (PSVM) [4] is another variation of SVM PSVM employs square loss function and keeps the samples as closest as possible to the margins: w,b,ξ ( w N + b2 ) + C ξi2 i=1 T s.t : ∀N i=1 : yi w φ (xi ) + b = − ξi (1) III L INEAR L EAST S QUARE S UPPORT V ECTOR M ACHINE (LLS-SVM) To find the formula for updating the normal vector, we derive as follows: A Formulation To eliminate the bias parameter b, we use the transformaT T tion: x ←− xT and w ←− wT b to the formulation of LS-SVM The formulation of linear LS-SVM is as follows: w,ξ s.t : w ∀N i=1 N i=1 T : yi w xi = − ξi (2) where xi ∈ Rd+1 is assumed to append a constant and w ∈ Rd+1 Actually, the optimization problem of LLS-SVM in Eq (2) is strictly equivalent to that of PSVM in linear case as shown in Eq (1) B Solution Substituting the constrainst to the objective function in Eq (2), we need to minimize the following optimization problem: L (w) = w N − yi wT xi +C i=1 Setting the derivative to 0, we gain: δL = → w + 2C δw N yi xi yi wT xi − = i=1 N yi xi N T → Gw = AA + νI w = yi xi = Ay i=1 N where G = i=1 xi xTi + νI ∈ R [(d + 1) × (d + 1)] , A = N T [xi ]i=1 ∈ R [(d + 1) × N ], y = [yi ]i=1 N and ν = 2C IV A DAPTABLE L INEAR S UPPORT V ECTOR M ACHINE A Adding New Data Assume that the new data samples {(xN +1 , yN +1 ) , , (xN +n , yN +n )} are being added to the current training set Let us denote N +n T A = [xi ]i=N +1 ∈ R [(d + 1) × n] and y = [yi ]i=N +1 N +n The current and new normal vectors w, w are solutions of the following systems of equations: Gw = AAT + νI w = Ay Let us denote P = G−1 and P = G −1 Using Kailath Variant formula, we can evaluate P as follows: = G−1 − G−1 A I + A T G−1 A = P − PA I + A TPA −1 −1 −1 A T G−1 A TP We now suppose that only one data sample x ∈ Rd+1 with label y is added to the current training set Eq (3) becomes: −1 T T P x (P x) P = P − P x + xT P x x P =P − 1+a (4) where a = xT P x Eq (4) is reasonable since P = G−1 is symmetric by the symmetry of G Moreover, we show how to efficiently evaluate a and P x Assume that the sparsity of the training set is s, the computational complexity of evaluation P x is O ((d + 1) s) Therefore, a is calculated with the cost of O (s) The cost to update P is O (d + 1) To update w, we first calculate P x as follows: a P x = Px − P xxT P x = P x − Px = Px 1+a 1+a 1+a (5) According to Eq (5), the cost to calculate P x is O(d + 1) Finally, w is updated with the cost of O (d + + s) as follows: (6) B Removing Current Data Assume that the current data samples {(xN −n+1 , yN −n+1 ) , , (xN , yN )} are being removed from N the current training set Let us denote A = [xi ]i=N −n+1 ∈ T R [(d + 1) × n] and y = [yi ]i=N −n+1 N The current and new normal vectors w, w are solutions of the following systems of equations: Gw = AAT + νI w = Ay G w = AAT − A A T + νI w = Ay − A y Let us denote P = G−1 and P = G −1 Using Kailath Variant formula, we can evaluate P as follows: P = G −1 = G − A A T = G−1 + G−1 A I − A T G−1 A = P + PA I − A TPA G w = AAT + A A T + νI w = Ay + A y P = G −1 = G + A A T A y − A A Tw Totally, the computational complexity of adding one data sample is O d2 i=1 i=1 →w −w =P w = w + P x y − wT x N xi xTi w − → w = −2C = G − A A T w = G w − A A Tw → w = w + P A y − A Tw ξi2 +C G w − A y = Ay = Gw −1 −1 −1 A T G−1 A TP To find the formula for updating the normal vector, we derive as follows: G w + A y = Ay = Gw (3) (7) = G + A A T w = G w + A A Tw →w −w =P −A y + A A T w → w = w + P A −y + A T w We now suppose that only one data sample x ∈ Rd+1 with label y is removed from the current training set Eq (7) becomes: P = P + P x − xT P x −1 xT P = P + T P x (P x) 1−a (8) where a = xT P x Eq (8) is reasonable since P = G−1 is symmetric by the symmetry of G Moreover, we show how to efficiently evaluate a and P x Assume that the sparsity of the training set is s, the computational complexity of evaluation P x is O ((d + 1) s) Therefore, a is calculated with the cost of O (s) The cost to update P is O (d + 1) To update w, we first calculate P x as follows: a 1 P xxT P x = P x + Px = Px 1−a 1−a 1−a (9) According to Eq (9), the cost to calculate P x is O(d + 1) Finally, w is updated with the cost of O (d + + s) as follows: P x = Px + w = w + P x −y + wT x Totally, the computational complexity of removing one data sample is O d2 Algorithm Algorithm for ALSVM −1 a = + ν −1 xT1 x1 P = ν −1 I − ν −2 ax1 xT1 w = y1 P x1 for k = to N Q = P xk a = xTk Q QQT P = P − 1+a R = 1+a Q w = w − R yk − wT xk endfor The algorithm for Adaptable Linear Support Vector Machine (ALSVM) is shown in Algorithm It is obvious that the computational complexity of ALSVM is O N d2 To propose another version of ALSVM, namely Fast ALSVM (FALSVM), which can efficiently reduce the training time of ALSVM, we first randomize the training set and when the data samples are added, we only update P and w for the data sample that causes the error The detail of the training process of FALSVM is given in Algorithm as follows: Algorithm Algorithm for FALSVM −1 C Inductive Algorithm for Training Data In this section, we present how to use the above adaptive process to learn the optimal hyperplane from the training set We start with the empty training set and the labeled N data samples {(xi , yi )}i=1 are subsequently added to the training set The hyperplane is gradually adapted along with the insertion of data At first, (x1 , y1 ) is added to the training set We have: T P1 = G−1 = νI + x1 x1 −1 = ν −1 I − ν −1 Ix1 + ν −1 xT1 x1 −1 a = + ν −1 xT1 x1 P = ν −1 I − ν −2 ax1 xT1 w = y1 P x1 for k = to N if yk wT xk < then Q = P xk a = xTk Q QQT P = P − 1+a R = 1+a Q w = w − R yk − w T xk endif endfor ν −1 IxT1 = ν −1 I − ν −2 x1 a1 xT1 = ν −1 I − ν −2 a1 x1 xT1 where a1 = + ν −1 xT1 x1 −1 V E XPERIMENTS w = y1 P x1 According to Eqs (3) and (6), when the labeled data sample (xk+1 , yk+1 ) is added, P and w are updated as follows: Pk+1 = Pk − T Pk xk+1 (Pk xk+1 ) + ak+1 where ak+1 = xTk+1 Pk xk+1 Pk+1 xk+1 = Pk xk+1 + ak+1 wk+1 = wk − Pk+1 xk+1 yk+1 − wkT xk+1 A Experimental Settings We conducted experiments on 11 datasets The details of the datasets are given in Table I We compared ALSVM and FALSVM with the two other well-known methods Pegasos and L1 LibLinear All experiments were run on a personal computer with Core I5 and GB of RAM All methods are implemented in C# The trade-off parameter C was searched in the grid 2−15 , 2−3 , , 23 , 25 To make the parameters consistent, in Pegasos, the parameter λ is computed from C as λ = N1C In Pegasos, the sub-sample size k is set to and the maximum number of iterations T is set to 10, 000 In LibLinear, the precision was set to its default value 0.01 The crossvalidation with folds was carried out Table I E XPERIMENTAL DATASETS Dataset a9a Cod-Rna Covertype Mushroom w8a SvmGuide3 German Poker Shuttle IJCNN SensIT Vehicle Size 32,561 271,617 581,012 8,124 49,749 1,243 1,000 548,831 34,145 49,990 57,650 Dimension 123 54 112 300 21 24 10 22 100 B How Fast are ALSVM and Fast ALSVM Compared to Existing Methods? The experimental results including the training times and accuracies are reported in Table II To highlight the conclusion, for each dataset, we boldfaced the method that yields the shortest training time and italicized the method that generates the highest accuracy As shown in Table II, regarding the accuracies, ALSVM and Fast ALSVM are comparable with others and always appeared in italic for all datasets With reference to the training time, FALSVM is always fastest over all datasets, ALSVM is faster than LibLinear and Pegasos over all datasets except for w8a It may be surprise how our proposed methods which scale exactly O N d2 are faster than both LibLinear O N dlog and Pegasos (O (N s)) This is partially explained as follows The computational complexities of ALSVM and FALSVM are exactly N d2 + ds + (d + s) and m d2 + ds + (d + s) , where m is the number of adaptations and may be very small as compared to N , i.e m ≈ N , while the computational complexities of Pegasos and LibLinear are exactly const × N s and const × N dlog It may happen that for the experimental datasets, the factors const in Pegasos and LibLinear cause the actually greater number of unit operations than our proposed methods Figure Training times of ALSVM and FALSVM on Covertype (left) and Cod-Rna (right) ALSVM and FALSVM scale really well with the number of training examples, but not really well with the dimension It is crucial to answer the question: “What is the threshold of dimension under which ALSVM or FALSVM is faster than others?” To this end, we performed the experiment on the datasets Real-Sim, whose dimension is 20, 958 and news20, whose dimension is 1, 355, 191 We made comparison ALSVM and FALSVM with LibLinear The dimension is varied with the step of 50 or 100 and the training times are recorded for consideration As shown in Figures and 3, FALSVM scales with the dimension better than ALSVM As also indicated from this experiment, FALSVM is faster than LibLinear if the dimension is less than around one thousand, otherwise it becomes gradually slower than LibLinear Figure Training times of ALSVM, FALSVM, and LibLinear on Real-Sim when dimension is varied C How Training Times of ALSVM and Fast ALSVM Scale with the Number of Training Examples and Dimension? To visually demonstrate how ALSVM and FALSVM scale with the number of training examples, we established the experiment over the datasets Covertype and Cod-Rna We measured the training times of ALSVM and FALSVM for 50%, 60%, 70%, 80%, 90%, and 100% of each dataset As shown in Figure 1, ALSVM and FALSVM possess asymptotically linear complexities and FALSVM is approximately tens of times faster than ALSVM Figure Training times of ALSVM, FALSVM, and LibLinear on news20 when dimension is varied VI C ONCLUSION In this paper, we have proposed Adaptable Linear Support Vector Machine (ALSVM) and Fast Adaptable Linear Support Vector Machine (FALSVM), which both scale linearly with the Table II T HE TRAINING TIMES ( IN SECOND ) AND THE ACCURACIES Dataset a9a Cod-Rna Covertype Mushroom w8a SvmGuide3 German Poker Shuttle IJCNN SensIT Vehicle ALSVM Time Acc 82 82% 93% 271 64% 14 92% 403 97% 0.27 82% 0.26 75% 35 91% 0.7 100% 10 93% 629 88% Fast ALSVM Time Acc 4.5 79% 0.7 72% 5.7 60% 0.1 94% 3.9 97% 0.04 82% 0.05 75% 3.2 91% 0.06 100% 0.21 93% 8.7 89% number of data examples of training set The most brilliant feature of ALSVM is that its decision boundary can be adapted in a close form when data are added to or removed from the training set Our experiments have indicated that ALSVM and FALSVM are efficient for the large-scale dataset of thousands of dimensions R EFERENCES [1] B E Boser, I M Guyon, and V Vapnik A training algorithm for optimal margin classifiers In Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pages 144–152 ACM Press, 1992 [2] C.-C Chang and C.-J Lin Libsvm: A library for support vector machines ACM Trans Intell Syst Technol., 2(3): 27:1–27:27, May 2011 ISSN 2157-6904 [3] K.-W Chang, C.-J Hsieh, and C.-J Lin Coordinate descent method for large-scale l2-loss linear support vector machines J Mach Learn Res., 9:1369–1398, June 2008 ISSN 1532-4435 [4] G Fung and O L Mangasarian Proximal support vector machine classifiers In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 77–86, 2001 [5] C.-J Hsieh, K.-W Chang, C.-J Lin, S S Keerthi, and S Sundararajan A dual coordinate descent method for large-scale linear svm In Proceedings of the 25th international conference on Machine learning, ICML ’08, pages 408–415 ACM, 2008 ISBN 978-1-60558205-4 [6] T Joachims Advances in kernel methods chapter Making Large-scale Support Vector Machine Learning Practical, pages 169–184 1999 ISBN 0-262-19416-3 [7] T Joachims Training linear svms in linear time In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’06, pages 217–226, 2006 ISBN 1-59593-339-5 [8] O L Mangasarian and David R Musicant Active support vector machine classification In NIPS, pages 577–583, 2000 [9] O L Mangasarian and David R Musicant Lagrangian support vector machines J Mach Learn Res., 1:161– 177, September 2001 ISSN 1532-4435 LibLinear Time Acc 193 82% 1,683 91% 1,592 62% 16 92% 316 97% 81% 0.54 74% 4,811 91% 52 100% 252 92% 4,695 89% Pegasos Time Acc 197 82% 8,332 89% 1,826 61% 92 91% 505 97% 81% 75% 12,490 91% 17 100% 213 93% 4,452 86% [10] S Shalev-Shwartz, Y Singer, and N Srebro Pegasos: Primal estimated sub-gradient solver for svm In Proceedings of the 24th international conference on Machine learning, ICML ’07, pages 807–814, 2007 ISBN 9781-59593-793-3 [11] J A K Suykens and J Vandewalle Least squares support vector machine classifiers Neural Process Lett., 9(3):293–300, June 1999 ISSN 1370-4621 Least Square Support Vector Machine for Large-scale Dataset Khanh Nguyen, Trung Le, Vinh Lai, Duy Nguyen, Dat Tran, Wanli Ma Abstract—Support Vector Machine (SVM) is a very wellknown tool for classification and regression problems Many applications require SVMs with non-linear kernels for accurate classification Training time complexity for SVMs with non-linear kernels is typically quadratic in the size of the training dataset In this paper, we depart from the very well-known variation of SVM, the so-called Least Square Support Vector Machine, and apply Steepest Sub-gradient Descent method to propose Steepest Subgradient Descent Least Square Support Vector Machine (SGDLSSVM) It is theoretically proven that the convergent rate of the proposed method to gain ε − precision solution is O log 1ε The experiments established on the large-scale datasets indicate that the proposed method offers the comparable classification accuracies while being faster than the baselines training set size under some specific conditions, e.g., the dataset is separated with margin m∗ in the feature space Actually, this condition is not rare to happen since data tend to be compacted and clustered in the feature space Furthermore, we also theoretically prove that the convergent rate to gain ε − precision solution is O log 1ε The experiments show that our proposed method is able to efficiently handle the largescale datasets and usually offers the comparable classification accuracy while being faster than the baselines Index Terms—Support Vector Machine, solver, kernel method, steepest gradient descent Solving a quadratic program is the main component of all SVMs There has been extensive research on how to efficiently solve the quadratic program in terms of both computational and memory complexities Interior-point methods, which have been proved effective on convex quadratic programs on other domains, have been applied to this quadratic program [6], [7], [9] However, the density, size, and ill-conditioning of the kernel matrix make them very expensive for the large-scale datasets Yet another conventional approach is decomposition method like SMO [15], SVM-Light [12], LIBSVM [3], and SVMTorch [4] By using the caching technique, the decomposition methods can handle the memory used in the training of the large-scale datasets very well Nonetheless, their super-linear scaling behavior with the dataset size [12], [15], [11] makes their use inefficient or even intractable on large-scale datasets An alternative to reducing the training time is to take advantage of the sparsity Some methods select a set of basic vectors a priori This includes sampling randomly from the training set in the Nystrom method [23], greedily minimizing reconstruction error [18], and variants of the Incomplete Cholesky factorization [7], [1] However, these selection methods are not part of the optimization process, which makes a goal-directed choice of basis vectors difficult [14] Another method is to yield a linear system instead of quadratic program problems [20] Hestenes-Stiefel conjugate gradient method was then applied to solve linear system Ax = B where A ∈ Rn×n and B ∈ Rn It converges in at most r + steps where r = rank (C) and A = I + C However, the rate of convergence depends on the condition number of matrix In the case using RBF kernel, it is influenced by the choice (C, γ) where C is trade-off parameter and γ is kernel width parameter [21] Cutting-Plane Subspace Pursuit (CPSP) [14] converts its quadratic optimization problem to an equivalent 1-slack one The cutting-plane technique is applied to subsequently add the constraints to a set of active constraints To utilize the sparsity, I I NTRODUCTION Support Vector Machine (SVM) [2], [5] is a very popular learning method for classification and regression problems There have been lots of variations of SVM proposed and applied to real-world problems, e.g., Least Square Support Vector Machine (LS-SVM) [19], Proximal Support Vector Machine (PSVM) [8], and Structural Support Vector Machine [22] Although SVMs are known to give excellent classification results, their application to problems with large-scale datasets is encumbered by the cumbersome training time requirements Recently, much research has been concentrated on design fast learning algorithms that can efficiently handle the large-scale datasets for linear SVMs [13], [10], [16] Even though linear SVMs are preferable for the sparse datasets, e.g., text classification, word-sense disambiguation, and drug design [13], many real-world applications require SVMs with non-linear kernels for accurate classification Training time complexity for SVMs with non-linear kernels is typically quadratic in the size of the training dataset [17] In this paper, we depart from the formulation of Least Square Support Vector Machine (LS-SVM) [19] and apply the steepest sub-gradient descent method to propose Steepest Sub-gradient Descent Least Square Support Vector Machine (SGD-LSSVM) Unlike the original steepest gradient descent method, in the proposed method, we use a sub-gradient rather than the entire gradient for updating at each iteration As a consequence, the computational complexity in each iteration is economic and around O(N (t + ln(t) + 1) + 4t + 2) ≈ O (N ), where t is usually a small number, e.g., t = 1, The computational complexity of the proposed method is at worst quadratic in general as others Nonetheless, it is theoretically proven that the proposed method linearly scales with the II R ELATED W ORK Table II T HE TRAINING TIMES ( IN SECOND ) AND THE ACCURACIES ( IN %) Dataset a9a Cod-Rna Rcv1 Mushroom SvmGuide1 Shuttle IJCNN1 Splice SGD-LSSVM Time Acc 88 83 169 91 239 97 100 0.4 97 25 100 169 99 0.13 89 SVM-Light Time Acc 2,089 85 1,949 93 1,001 97 18 100 97 50 100 1,949 99 0.62 87 SVM-Perf Time Acc 21,343 83 1,392 90 3,097 97 29 100 15 97 229 100 1,391 99 75 88 LIBSVM Time Acc 149 85 303 94 982 97 1.5 100 1.6 97 100 303 99 0.15 87 Nystrom Time Acc 162 85 809 89 66 95 14 100 97 18 100 809 98 85 ICF Time Acc 648 80 1,786 89 1,180 94 157 100 97 197 100 1,786 98 6.5 85 C How does the Parameter t Affect to the Training Time and Number of Iterations? To investigate the affection of parameter t to the training time, number of iterations, and accuracies, we established the experiment on the datasets Rcv1 and IJCNN1 The optimal values for C, γ were used in the experiment and the precision ε was set to 0.01 We varied the parameter t in the grid {1, 2, 4, 8, 16, 32, 64, 128, 256, 512} and recorded the training time, accuracy, and number of iterations in each case As observed from Figure 3, when t increases, the number of iterations decreases This fact can be explained from Corollaries and which say when t becomes larger, the objective function is reduced with a larger amount across the iterations Figure shows the training times when t is varied As shown from Figure 2, when t becomes larger, the training times tend to increase The reason is that when t is increased although the number of iterations is decreased, the computation cost at each iteration is trade-off increased Figure indicates the accuracies when t is varied in its domain As seen from Figure 4, the accuracies slightly vary and are stable along with variation of t This experiment also recommends that the optimal value for t is either or With these choices for t, the computation cost in each iteration is very cheap and it results in the reduced training time Figure The number of iterations when t is varied Figure The accuracies when t is varied D How does the Proposed Method scale with the Number of Training Examples? Figure The training times when t is varied To visually show how the training time of the proposed method scales with the number of training examples, we established the experiment on the datasets Rcv1 and IJCNN1 We selected 10%, 20%, , 100% of each dataset and measured the training time for each case The optimal parameters C, γ were used in this experiment, the parameter t was set to 1, and the precision ε was set to 0.01 As shown in Figure 5, the training time approximately scales with the linear complexity It may be partially explained from the fact that the optimal values |f (α∗ )| of the sub-datasets only a slight increase and thereby leading to a slight increase in the number of iterations 2(tR2 +υ )|f (α∗ )| since n ≤ ε2 1) λmax ≤ Q for all kinds of matrix norm m 2) If Q = [K (xi , xj ) + νδij ]i,j=1 max K (xi , xi ) 1/2 = i ≤ max φ (xi ) i and R then λmax ≤ mR2 + ν Proof: 1) Let us denote the eigenvector of λmax by x It follows that: λmax x = λmax x = Qx ≤ Q x 2) From previously, we have λmax ≤ Q = max i ≤ ν + +max i Qij = ν + max j φ (xi ) i K (xi , xj ) j φ (xj ) ≤ ν + mR2 j Figure The training times of the percentages on the datasets R EFERENCES VII C ONCLUSION We have departed from Least Square Support Vector Machine and propose Steepest Sub-gradient Least Square Support Vector Machine (SGD-LSSVM) SGD-LSSVM shows some good properties about convergent rate and computational complexity which are: 1) its convergent rate to gain ε − precision solution is O log 1ε ; and 2) its computational complexity is O N and O (N ) under some specific conditions that are not rare to happen This makes the proposed method efficient and tractable to handle the large-scale datasets A PPENDIX Lemma T gB QBB gB T = νgB gB + gyxT gyx Proof: We have T gB QBB gB = gi gj qij i∈B j∈B = gi gj (yi yj K(xi , xj ) + νδij ) i∈B j∈B = gi2 gi gj yi yj K(xi , xj ) + i∈B j∈B i∈B T = νgB gB + gyxT gyx Lemma gyxT gyx = i∈B gi yi Proof: We have gyxT gyx = gyxT gi yi φ(xi ) i∈B gi yi gyxT φ(xi ) = = i∈B gi yi i∈B Lemma Q is a m by m positive definite matrix and λmax = λmax (Q) is the maximal eigenvalue of Q, we have the following: [1] F R Bach and M I Jordan Predictive low-rank decomposition for kernel methods In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, pages 33–40, New York, NY, USA, 2005 ACM [2] B E Boser, I M Guyon, and V Vapnik A training algorithm for optimal margin classifiers In Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pages 144–152 ACM Press, 1992 [3] C.-C Chang and C.-J Lin Libsvm: A library for support vector machines ACM Trans Intell Syst Technol., 2(3):27:1–27:27, May 2011 [4] R Collobert and S Bengio Svmtorch: Support vector machines for large-scale regression problems J Mach Learn Res., 1:143–160, September 2001 [5] C Cortes and V Vapnik Support-vector networks In Machine Learning, pages 273–297, 1995 [6] M C Ferris and T S Munson Interior-point methods for massive support vector machines SIAM J on Optimization, 13(3):783–804, August 2002 [7] S Fine and K Scheinberg Efficient svm training using low-rank kernel representations J Mach Learn Res., 2:243–264, March 2002 [8] G Fung and O L Mangasarian Proximal support vector machine classifiers In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 77–86, 2001 [9] E M Gertz and S J Wright Object-oriented software for quadratic programming ACM Trans Math Softw., 29(1):58–81, March 2003 [10] C.-J Hsieh, K.-W Chang, C.-J Lin, S S Keerthi, and S Sundararajan A dual coordinate descent method for large-scale linear svm In Proceedings of the 25th international conference on Machine learning, ICML ’08, pages 408–415 ACM, 2008 [11] D Hush and C Scovel Polynomial-time decomposition algorithms for support vector machines Machine Learning, 51(1):51–71, 2003 [12] T Joachims Advances in kernel methods chapter Making Large-scale Support Vector Machine Learning Practical, pages 169–184 MIT Press, Cambridge, MA, USA, 1999 [13] T Joachims Training linear svms in linear time In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’06, pages 217–226, 2006 [14] T Joachims and C.-N Yu Sparse kernel svms via cutting-plane training Machine Learning, 76(2-3):179–193, 2009 European Conference on Machine Learning (ECML) Special Issue [15] J C Platt Advances in kernel methods chapter Fast Training of Support Vector Machines Using Sequential Minimal Optimization, pages 185– 208 1999 [16] S Shalev-Shwartz, Y Singer, and N Srebro Pegasos: Primal estimated sub-gradient solver for svm In Proceedings of the 24th international conference on Machine learning, ICML ’07, pages 807–814, 2007 [17] S Shalev-Shwartz and N Srebro In William W Cohen, Andrew McCallum, and Sam T Roweis, editors, ICML, volume 307, 2008 [18] A J Smola and B Schăokopf Sparse greedy matrix approximation for machine learning In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, pages 911–918, 2000 [19] J A K Suykens and J Vandewalle Least squares support vector machine classifiers Neural Process Lett., 9(3):293–300, June 1999 [20] JAK Suykens, Lukas Lukas, Paul Van Dooren, Bart De Moor, Joos Vandewalle, et al Least squares support vector machine classifiers: a large scale algorithm Citeseer [21] Johan AK Suykens, Jos De Brabanter, Lukas Lukas, and Joos Vandewalle Weighted least squares support vector machines: robustness and sparse approximation Neurocomputing, 48(1):85–105, 2002 [22] I Tsochantaridis Support Vector Machine Learning for Interdependent and Structured Output Spaces PhD thesis, Providence, RI, USA, 2005 AAI3174684 [23] C Williams and M Seeger Using the nystrom method to speed up kernel machines In Advances in Neural Information Processing Systems 13, pages 682–688 MIT Press, 2001 Fast One-Class Support Vector Machine for Novelty Detection Trung Le1 , Dinh Phung2 , Khanh Nguyen1 , Svetha Venkatesh2 Department of Information Technology, HCMc University of Pedagogy, Vietnam Center for Pattern Recognition and Data Analytics, Deakin University, Australia Abstract Novelty detection arises as an important learning task in several applications Kernel-based approach to novelty detection has been widely used due to its theoretical rigor and elegance of geometric interpretation However, computational complexity is a major obstacle in this approach In this paper, leveraging on the cutting-plane framework with the well-known One-Class Support Vector Machine, we present a new solution that can scale up seamlessly with data The first solution is exact and linear when viewed through the cutting-plane; the second employed a sampling strategy that remarkably has a constant computational complexity defined relatively to the probability of approximation accuracy Several datasets are benchmarked to demonstrate the credibility of our framework Keywords: One-class Support Vector Machine, Novelty Detection, Largescale dataset Introduction Real data rarely conform to regular patterns Finding subtle deviation from the norm, or novelty detection (ND), is an important research topics in many data analytics and machine learning tasks ranging from video security surveillance, network abnormality detection to detection of abnormal gene expression sequence Novelty detection refers to the task of of finding patterns in data that not conform to expected behaviors [4] These anomaly patterns are interesting because they reveal actionable information, the known unknowns and unknown unknowns Novelty detection methods have been used in many real-world applications, e.g., intrusion detection [8], fraud detection (credit card fraud detection, mobile phone fraud detection, and insurance claim fraud detection) [7] There are various existing approaches applied to novelty detection, including neural network based approach [2], Bayesian network based approach [1], rule based approach [6], kernel based approach [12,14] However, these approaches are either ad hoc (rule-based) or lacking a principled approach to scale up with data Moreover, none of these methods has been experimented with large datasets In this paper, we depart from a kernel-based approach, which has a established theoretical foundation, and propose a new novelty detection machine that scales up seamlessly with data The idea of using kernel-based methods for novelty detection is certainly not new At its crux, geometry of the normal data is learned from data to define the domain of novelty One-Class Support Vector Machine (OCSVM) [12] aims at constructing an optimal hyperplane that separates the origin and the data samples such that the margin, the distance from the origin to the hyperplane, is maximized OCSVM is used in case data of only one class is available and we wish to learn the description of this class In another approach, Support Vector Data Description (SVDD) [14], the novelty domain of the normal data is defined as an optimal hypersphere in the feature space, which becomes as a set of contours tightly covering the normal data when mapped back the input space There are also noticeable research efforts in scaling up kernel-based methods to linear complexity including Pegasos [13] (kernel case), SV M P erf [10] However, to our best of knowledge, none has attempted at novelty detection problem Starting with the formulation of cutting plane method [11], we propose in this paper a new approach for novelty detection Our approach is a combination of cutting plane method applied to the well-known One-Class Support Vector Machine [12] – we term our model Fast One-Class SVM (FOSVM) We propose two complementary solutions for training and detection The first is an exact solution that has linear complexity The number of iterations to reach a very close accuracy, termed as θ−precision (defined later in the paper) is O θ1 , making it attractive in dealing with large datasets The second solution employed a sampling strategy that remarkably has constant computational complexity defined relatively to probability of approximation accuracy In all cases, rigorous proofs are given to the convergence analysis as well as complexity analysis Related Work Most closely to our present work is the Core Vector Machine (CVM) [18] and its simplified version Ball Vector Machine (BVM) [17] CVM is based on the achievement in computational geometry [3] to reformulate a variation of L2-SVM as a problem of finding minimal enclosing ball (MEB) CVM shares with ours in using the principle of extremity in developing the algorithm Nonetheless, CVM does not solve directly its optimization problem and does not offer the insight view of cutting plane method as ours Cutting plane method [11] has been applied to building solvers for kernel methods in the work [9,10,15,16] In [10], the cutting plane method is employed to train SV M struct ; the N − slack optimization problem with N constrains is converted to an equivalent − slack optimization problem with 2N constrains (binary classification); the constrains are subsequently added to the optimization problem In [15,16], the bundle method is used for solving the regularized risk minimization problem Though those work have been proven that bundle method is very efficient for the regularized risk minimization problem and the number of iterations to gain −precision solution are O for non-smooth problem and O log for smooth problem, they are implausible to extend to a non-linear kernel case 3.1 Proposed Fast One-Class Support Vector Machine (FOSVM1) One-Class Support Vector Machine We depart from the One-Class Support Vector Machine proposed by [12] Given a training set with only positive samples D = {x1 , , xN }, One-Class Support Vector Machine (OCSVM) learns the optimal hyperplane that can separate the origin and all samples such that the margin, the distance from the origin to the hyperplane, is maximized The optimization problem of soft model of OCSVM is as follows: w,ρ w 2 −ρ+ vN N ξi (1) i=1 T s.t : ∀N i=1 : w Φ(xi ) ≥ ρ − ξi ∀N i=1 : ξi ≥ If the slacks variables are not used, i.e., all samples in the training set must be completely stayed on the positive side of the optimal hyperplane, the soft model becomes the hard model associated with the following optimization problem: w,ρ w 2 −ρ (2) T s.t : ∀N i=1 : w Φ(xi ) ≥ ρ 3.2 Cutting Plane Method for One-Class SVM To apply the cutting plane method for the optimization problem of OCSVM, we rewrite the OCSVM optimization problem as follows: w,ρ s.t : ∀N i=1 : Φ (xi ) −1 w T −ρ w ≥ (constrainst Ci ) ρ (3) The feasible set of the OCSVM optimization problem is composed by intersection of N half-hyperplanes corresponding to the constrains Ci (1 ≤ i ≤ N ) (see Figure 1) Inspired by the cutting plane method, the following algorithm is proposed: Algorithm Cutting method to find the θ-approximate solution for n = 2, 3, (wn , ρn ) = solve (n); n + = in+1 = argmin i>n Φ (xi ) −1 T wn ρn = argmin wnT Φ (xi ) − ρn ; i>n on+1 = wnT Φ (xn+1 ) − ρn ; if (on+1 ≥ −θρn ) return (wn , ρn ); endfor The procedure solve(n) stands for solving the optimization problem in Eq (3) where first n constrains Ci (1 ≤ i ≤ n) are activated By convenience, we assume that the chosen index in+1 = n + The satisfaction of condition on+1 ≥ −θρn where θ ∈ (0, 1) means that the current solution (wn , ρn ) is that of the relaxation of the optimization problem in Eq (3) while the constrains Ci (i > n) are replaced by its relaxation Ci (i > n) as follows: w,ρ s.t : ∀ni=1 : ∀N i=n+1 : Φ (xi ) −1 Φ (xi ) −1 T w T −ρ w ≥ (constrainst Ci ) ρ w ≥ −θρn (constrainst Ci ) ρ Otherwise, the hyperplane corresponding with the constrains Cn+1 separates the current solution sn = (wn , ρn ) and the feasible set of the full optimizaT Φ (xn+1 ) wn tion problem in Eq (1) since < −ρn θ < ; the constrains −1 ρn Cn+1 is added to the active constrains set (see Figure 1) Furthermore, the distance from the current solution as shown in Eq (4) to the chosen hyperplane is maximized so that the current feasible set rapidly approaches the destined one (see Figure 1) dn+1 = d (sn , Cn+1 ) = 3.3 ρn − wnT Φ (xn+1 ) wn (4) Approximate Hyperplane We present an interpretation of the proposed Algorithm under the framework of approximate hyperplane Let us start with clarifying the approximate hyperplane notion Let A ⊆ [N ] be a subset of the set including first N positive integer numbers and DA be a subset of the training set D including samples whose indices are in A Fig The constrains Cn+1 is chosen to add to the current active constrains set (red is active and green is inactive) T Φ (x) − ρA = Denote the optimal hyperplane induced by DA by (HA ) : wA Given a sample x, let us define the membership of x with regard to the positive side of the hyperplane (HA ) by : mA (x) = 1− d(Φ(x),HA ) d(0,HA ) =1+ T wA Φ(x)−ρA ρA = T wA Φ(x) ρA T Φ (x) − ρA < if wA otherwise Intuitively, the membership of x is exactly if Φ (x) lies on the positive side of HA ; otherwise this membership decreases when Φ (x) is moved further the hyperplane on the negative side We say that hyperplane HA is an θ − approximate hyperplane if it is the optimal hard-margin hyperplane induced by DA and the memberships with respect to HA of data samples in D \ DA are all greater than or equal − θ, i.e., ∀x ∈ D \ DA : mA (x) ≥ − θ The visualization of θ − approximate hyperplane is shown in Figure Using the θ − approximate hyperplane, Algorithm can now be rewritten as: Fig θ − approximate hyperplane (green is active, black is inactive), the red line stands for the optimal hard-margin hyperplane induced by DA Algorithm Algorithm to find θ − approximate hyperplane A = {1, 2} ; n = 2; (wn , ρn ) = OCSVM ({xi : i ∈ A}); n + = in+1 = argmin (wnT Φ (xi ) − ρn ); i∈Ac on+1 = wnT Φ (xn+1 ) − ρn ; if (on+1 ≥ −θρn ) return (wn , ρn ); else A = [n + 1]; while (true) Algorithm aims at finding a θ − approximate hyperplane in which θ − approximate notion can be interpreted as slack variables of soft model It is worthwhile to note that the stopping criterion in Algorithm means that ∀N i=n+1 : mA (xi ) ≥ − θ In the next section, we prove that Algorithm terminates after a finite number of iterations independent of the training size N To simplify the derivation, we assume that isometric kernel, e.g., Gaussian kernel, is used which means that Φ (x) = K(x, x)1/2 = 1, ∀x Furthermore, let us define w∗ = w∞ = lim wn n→∞ and d∗ = d∞ = lim dn where dn = d (0, Hn ) = wρnn is the margin of the n→∞ current active set D[n] and d∗ is the margin of the general dataset Theorem Algorithm terminates after at most n0 iterations where n0 depends only on θ and the margin of the general data set d∗ = d∞ Proof We sketch the proof as follows The main task is to prove that: ∀N j=1 1+n/2 : mn (xj ) ≥ − + − d∗ 1+n/2 d∗ = − g(n) It is easy to see that lim g(n) = Therefore, there exists n0 where n0 is a n→∞ constant which is independent with N such that mn (xj ) > − θ for all n ≥ n0 Theorem The number of iterations in Algorithm is O (1/θ) Proof Let us denote θ = − g(n) ≥ − θ ⇔ θd∗2 1−d∗2 We have: n + ≤ θ ⇐⇒ + ≥ + n/2 + n/2 √ √ 2θ + + 2θ It is obvious that: √ √ 2θ + + const ≤ ∼ O (1/θ ) = O (1/θ) 2θ 2θ Therefore, we gain the conclusion From Theorem 2, we also gain the number of iterations in Algorithm is O (1/θ) Theorem reveals that the complexity of Algorithm is O(N ) since the complexity of each iteration is O(N ) Sampling Fast One-Class Support Vector Machine (FOSVM2) In this section, we develop a second solution to train the proposed model using sampling strategy We now show how to use sampling technique to predict this furthest vector with the accuracy can be made as close as we like to the exact solution specified via a bounded probability similar to concept of p-value Let < < and m vectors are sampled from N vectors of the training set D We denote the set of -top furthest vectors in D by B We estimate the probability in order that at least one of m sampled vectors belonging to B We call the probability of this event as Pr( -top) First, we note from our definition that n = |B| = N Hence, the probability of interest becomes Pr( -top) = N −n N 1− / With some effort of manipulation, this reduces to: m m −1 Pr( -top) =1 − [(N − n)!(N − m)!] [N !(N − n − m)!] N −m =1 − i i=N −n−m+1 −1 N i i=N −n+1 Algorithm Algorithm to find θ − approximate hyperplane A = {1, 2} ; n = 2; (wn , ρn ) = OCSVM ({xi : i ∈ A}); B = Sampling(Ac , m); n + = in+1 = argmin wnT Φ (xi ) − ρn ; i∈B on+1 = wnT Φ xin+1 − ρn ; if (on+1 ≥ −θρn ) return (wn , ρn ); else A = [n + 1]; while (true) Taking the logarithm yields N −m ln [1 − Pr( -top)] = ln i=N −n−m+1 i i+m N −m ln − = i=N −n−m+1 m i+m Now applying the inequality ln(1 − x) ≤ −x, ∀x ∈ [0, 1), we have: N −m ln [1 − Pr( -top)] ≤ − i=N −n−m+1 m ≤− i+m N i=N −n+1 m mn ≤− i N −mn Therefore, Pr( -top) ≥ − e N = − e− m If we wish to have at least 95% probability in approximate accuracy (to the exact solution) with = 5%, we can calculate the number of sampling points m from this bound: Pr( -top) ≥ − e− m > 0.95 With = 0.05 we have m ≥ 20 ln 20 ≈ 59 Remarkably, this is a constant complexity w.r.t to N We note that the disappearance of N is due to the fact that the accuracy of our approximation to the exact solution is defined in terms of the probability What remarkable is that with a very tight bound required on a accuracy (e.g., within 95 %), the number of points required to be sampled is remarkably small (e.g., 59 points regardless of the data size N ) Hence the complexity of Algorithm is O(const) Experimental Results In the first set of experiments, we demonstrate that our proposed FOSVM is comparable with both OCSVM and SVDD in terms of the learning capacity but are much faster OCSVM and SVDD are implemented by using the state-of-theart LIBSVM solver [5] All comparing methods are coded in C# and run on a computer with GB of memory The datasets in the experiment were constructed by choosing one class as the normal class, appointing the remaining classes as the abnormal class, and randomly removing the samples from the abnormal class such that the ratio of data Datasets #Features #Pos #Neg Satimage 36 1,072 21 Usps 256 1,194 23 News20 1,355,191 3,986 79 a9a 123 7,841 150 Porker 10 10,599 211 Acoustic 407 18,261 366 Seismic 60 18,320 366 Real-sim 20,958 22,213 444 Shuttle 34,108 682 Connect-4 126 44,473 889 CodRna 90,539 1,810 IJCNN 22 91,701 1,834 Covertype 54 297,711 5,954 Rvc1 47,236 355,460 7,190 Table The details of the experimental datasets in the normal and abnormal classes are 50 : The details of the experimental datasets are given in Table We used three-fold cross validation For the training folds, we only used the positive samples to train the model and we used all samples of the testing folds for testing Gaussian kernel, which is given by K (x, x ) = e−γ x−x , was employed The width of kernel γ was searched in the grid 2−15 , 2−13 , , 25 The tradeoff parameter C was varied in the grid 2−15 , 2−13 , , 25 For FOSVM2, we set the parameter θ to 0.05 and m = 100 corresponding to the confidence of 99.33% for obtaining the top = 5% furthest samples To measure the accuracy, + − where acc+ , acc− are the we applied the measurement given by acc = acc +acc accuracies on the positive and negative classes, respectively This measurement is appropriate for ND because it encourages the high accuracies for both two classes Table shows the performance of our proposed models versus baselines on a diverse set of datasets As can be seen from Table 2, FOSVM1 and FOSVM2 are comparable with others in terms of the accuracies but much faster Particularly, FOSVM2 is superior in its time complexity as expected from our theoretical result Surprisingly, FOSVM2 also gains the highest accuracies on out of 13 datasets This may partially be explained from the fact that the furthest sample found by FOSVM1 may be noise or outlier Hence, FOSVM1 and others are more sensitive to noises and outliers than FOSVM2 In Table 2, we boldfaced the datasets at which the training times are sped up more than 10 times As seen from Table 2, for the large-scale datasets including CodRna, Covertype, and Rvc1 whose sizes amount greater than 90, 000, the training time speed-up ratios are all greater than 10 and are approximately equal to 40, 12, 28, respectively Especially, for the dataset Seismic, our proposed method FOSVM2 are 808 times faster than the baseline SVDD OCSVM FOSVM1 FOSVM2 Acc Time Acc Time Acc Time Acc Time Satimage 94% 63s 94% 45s 94% 35s 94% 20s Usps 96% 113s 96% 85s 91% 117s 91% 85s News20 60% 6,970s 60% 7,354s 59% 4,673s 61% 3,236s a9a 65% 821s 58% 548s 65% 822s 69% 122s Porker 50% 551s 50% 562s 50% 554s 53% 82s Acoustic 66% 6,211s 67% 6,138s 60% 1,066s 70% 59s Seismic 71% 54,177s 71% 42,244s 68% 924s 73% 67s Real-sim 70% 8,690s 69% 8,885s 61% 3,566s 64% 529s Shuttle 92% 723s 91% 726s 94% 167s 94% 43s Connect-4 53% 1,772s 53% 1,022s 55% 460s 55% 157s CodRna 62% 13,224s 62% 13,468s 60% 655s 64% 315s Covertype 55% 105,157s 54% 107,760s 52% 78,942s 54% 8,151s Rvc1 63% 204,331s 63% 204,298s 56% 9,236s 54% 7,439s Table The accuracies on the experimental datasets Datasets In the second experiment, we aim to investigate the behaviors of our proposed methods For each dataset, we randomly chose 10%, 20%, , 100% of data and evaluated both the training accuracies and times for each sub-dataset To ensure the stability of the proposed methods, we ran each case ten times To explicitly observe the complexities of FOSVM1 and FOSVM 2, we took average of the training times of ten times for each percentage and then plotted them As seen in Figure (left), the training times of FOSVM2 is increased at first and eventually does slightly change when the training size is sufficiently large As shown in Figure (right), the training time of FOSVM1 is approximately linear Fig The averages of the training times of FOSVM1 (right) and FOSVM2 (left) Conclusion In this paper, we integrate the formulation of cutting-plane method and the wellknown One-Class Support Vector Machine to propose Fast One-Class Support Vector Machine (FOSCM) for novelty detection We actually propose two complementary solutions FOSVM1 and FOSVM2 for training and detection The first is an exact solution thas has linear complexity The number of iterations to reach a very close accuracy, termed as θ-precision is O θ1 , making it attractive in dealing with large datasets The second solution employed a sampling strategy that remarkably has constant computational complexity defined relative to probability of approximation accuracy The experiment results indicate that our proposed methods are comparable with OCSVM and SVDD in terms of accuracy but much faster than them The speed-up in training time could reach 808 times for some dataset A Appendix Because of the limit space, in this appendix, we only sketch out the theoretical results used in the paper Lemma The following holds: ρn = wn , dn = wn > w1 ≥ w2 ≥ ≥ wn ≥ ≥ w∗ where w∗ = w∞ = lim wn n→∞ Lemma There exists xi (1 ≤ i ≤ n) such that Φ (xi ) − wn T and (Φ (xi ) − wn ) u ≤ for any vector u = = − wn Lemma The following holds: 2 wj − Φ (xi ) ≤ − wj (1 ≤ i ≤ j) 2 2 wn − wj ≥ wn − wj (j ≥ n + 1) 2 xn+1 − wn ≥ − w∗ Lemma Let us denote λn = 1− wn 1− w∗ 2 We have λn ≥ − 1+n/2 References A S Abbey, O Temitope, and S Lionel Active platform security through intrusion detection using naive bayesian network for anomaly detection In In: Proceedings of London communications symposium, 2002 M F Augusteijn and B A Folkert Neural network classification and novelty detection International Journal of Remote Sensing, 23:2891–2902, 2002 M Badoiu and K L Clarkson Optimal core-sets for balls In In Proc of DIMACS Workshop on Computational Geometry, 2002 V Chandola, A Banerjee, and V Kumar Anomaly detection: A survey ACM Comput Surv., 41(3):1–58, 2009 C-C Chang and C-J Lin Libsvm: A library for support vector machines ACM Trans Intell Syst Technol., 2(3):27:1–27:27, 2011 W Fan, M Miller, S J Stolfo, and W Lee Using artificial anomalies to detect unknown and known network intrusions In in Proceedings of the first IEEE International conference on Data Mining, pages 123–130 IEEE Computer Society, 2001 T Fawcett and F Provost Activity monitoring: Noticing interesting changes in behavior In In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 53–62, 1999 R Gwadera, M J Atallah, and W Szpankowski Reliable detection of episodes in event sequences Knowl Inf Syst., 7(4):415–437, 2005 T Joachims, T Finley, and C-N J Yu Cutting-plane training of structural svms Machine Learning, 77(1):27–59, 2009 10 T Joachims and C.-N Yu Sparse kernel svms via cutting-plane training Machine Learning, 76(2-3):179–193, 2009 European Conference on Machine Learning (ECML) Special Issue 11 J E Kelley The cutting plane method for solving convex programs Journal of the SIAM, 8:703–712, 1960 12 B Schölkopf, J C Platt, J C Shawe-Taylor, A J Smola, and R C Williamson Estimating the support of a high-dimensional distribution Neural Comput., 13(7):1443–1471, July 2001 13 S Shalev-Shwartz, Y Singer, N Srebro, and A Cotter Pegasos: Primal estimated sub-gradient solver for svm Mathematical programming, 127(1):3–30, 2011 14 D M J Tax and R P W Duin Support vector data description Journal of Machine Learning Research, 54(1):45–66, 2004 15 C H Teo, A Smola, SVN Vishwanathan, and Q V Le A scalable modular convex solver for regularized risk minimization In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 727–736 ACM, 2007 16 C H Teo, SVN Vishwanthan, A J Smola, and Q V Le Bundle methods for regularized risk minimization The Journal of Machine Learning Research, 11:311– 365, 2010 17 I W Tsang, A Kocsor, and J T Kwok Simpler core vector machines with enclosing balls In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, pages 911–918, 2007 18 I W Tsang, J T Kwok, P Cheung, and N Cristianini Core vector machines: Fast svm training on very large data sets Journal of Machine Learning Research, 6:363–392, 2005 ... tiêu: nghiên cứu khám phá Support Vector Machine tuyến tính có độ phức tạp tuyến tính ngữ cảnh học online tính tốn phân tán Nội dung chính: Phát triển số thuật tốn SVM tuyến tính ngữ cảnh online learning... for Linear Support Vector Machine (LSVM) Active Support Vector Machine (ASVM) [8] and Lagrangian Support Vector Machine (LaSVM) [9] based on the formulation of Proximal Support Vector Machine and... CHÍ MINH BÁO CÁO TỔNG KẾT ĐỀ TÀI KHOA HỌC VÀ CÔNG NGHỆ CẤP TRƯỜNG SUPPORT VECTOR MACHINE TUYẾN TÍNH VỚI ĐỘ PHỨC TẠP TUYẾN TÍNH MÃ SỐ: CS2014.19.39 Xác nhận quan chủ trì (ký, họ tên) Chủ nhiệm

Định dạng
Số trang	29
Dung lượng	1,64 MB