Improved kernel methods for classification

IMPROVED KERNEL METHODS FOR CLASSIFICATION DUAN KAIBO NATIONAL UNIVERSITY OF SINGAPORE 2003 IMPROVED KERNEL METHODS FOR CLASSIFICATION DUAN KAIBO (M. Eng, NUAA) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF MECHANICAL ENGINEERING NATIONAL UNIVERSITY OF SINGAPORE 2003 Acknowledgements I would like to express my deepest gratitude to my supervisors, Professor Aun Neow Poo and Professor S. Sathiya Keerthi, for their continuous guidance, help and encouragement. Professor Poo introduced me to this field and then introduced me to Professor Keerthi. Although he was usually very busy, he did manage to meet with his students from time to time and he was always available whenever his help was needed. Professor Keerthi guided me closely through every stage of the thesis work. He was always patient to explain hard things in an easy way and always ready for discussions. There have been enormous communication between us and feedback from him was always with enlightening comments, thoughtful suggestions and warm encouragement. “Think positively” is one of his words that I will always remember, although I might have sometimes over-practiced it. It was very fortunate that I had the opportunity to work on some problems together with my colleagues Shirish Shevade and Wei Chu. I also learned a lot from the collaboration work consisting of many discussions. I really appreciate the great time we had together. Dr Chih-Jen Lin kept a frequent interaction with us. His careful and critical reading of our publications and prompt feedback also greatly helped us in improving our work. We also got valuable comments from Dr Olivier Chapelle and Dr Bernhard Schölkopf on some of our work. I sincerely thank these great researchers for their communication with us. I also thank my old and new friends here in Singapore. Their friendship helped me out in many ways and made the life here warm and colorful. The technical support from the Control and Mechatronics Lab as well as the Research Scholarship from National University of Singapore are also greatly acknowledged here. I am grateful for the forever love and support from my parents. My brother Kaitao Duan and his family have always been taking care of our parents. I really appreciate it, especially when I was away from home. Besides, their support and pushing behind always gave me extra power to adventure ahead. Last but not the least I thank Xiuquan Zhang, my wife, for her unselfish support and caring accompanying. ii Table of Contents Acknowledgements ii Summary vii List of Tables viii List of Figures ix Nomenclature x Introduction 1.1 Classification Learning . . . . . . . . . . . . . . . . . . . . . . . 1.2 Statistical Learning Theory . . . . . . . . . . . . . . . . . . . . 1.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Kernel Technique . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Kernel Trick . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Mercer’s Kernels and Reproducing Kernel Hilbert Space 1.5 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . 1.5.1 Hard-Margin Formulation . . . . . . . . . . . . . . . . . 1.5.2 Soft-Margin Formulation . . . . . . . . . . . . . . . . . . 1.5.3 Optimization Techniques for Support Vector Machines . 1.6 Multi-Category Classification . . . . . . . . . . . . . . . . . . . 1.6.1 One-Versus-All Methods . . . . . . . . . . . . . . . . . . 1.6.2 One-Versus-One Methods . . . . . . . . . . . . . . . . . 1.6.3 Pairwise Probability Coupling Methods . . . . . . . . . 1.6.4 Error-Correcting Output Coding Methods . . . . . . . . 1.6.5 Single Multi-Category Classification Methods . . . . . . 1.7 Motivation and Outline of the Thesis . . . . . . . . . . . . . . . 1.7.1 Hyperparameter Tuning . . . . . . . . . . . . . . . . . . 1.7.2 Posteriori Probabilities for Binary Classification . . . . 1.7.3 Posteriori Probabilities for Multi-category Classification 1.7.4 Comparison of Multiclass Methods . . . . . . . . . . . . Hyperparameter Tuning 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 2.2 Performance Measures . . . . . . . . . . . . . . . . 2.2.1 K-fold Cross-Validation and Leave-One-Out 2.2.2 Xi-Alpha Bound . . . . . . . . . . . . . . . 2.2.3 Generalized Approximate Cross-Validation 2.2.4 Approximate Span Bound . . . . . . . . . . 2.2.5 VC Bound . . . . . . . . . . . . . . . . . . . 2.2.6 Radius-Margin Bound . . . . . . . . . . . . 2.3 Computational Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 9 11 12 13 14 14 15 16 17 18 18 19 20 21 . . . . . . . . . 22 22 23 23 24 24 25 26 27 28 iii 2.4 2.5 Analysis and Discussion . . . . . . . . . . . . . . . 2.4.1 K-fold Cross-Validation . . . . . . . . . . . 2.4.2 Xi-Alpha Bound . . . . . . . . . . . . . . . 2.4.3 Generalized Approximate Cross-Validation 2.4.4 Approximate Span Bound . . . . . . . . . . 2.4.5 VC Bound . . . . . . . . . . . . . . . . . . . 2.4.6 D2 w for L1 Soft-Margin Formulation . . 2.4.7 D2 w for L2 Soft-Margin Formulation . . Conclusions . . . . . . . . . . . . . . . . . . . . . . A Fast Dual Algorithm for Kernel Logistic 3.1 Introduction . . . . . . . . . . . . . . . . . . 3.2 Dual Formulation . . . . . . . . . . . . . . . 3.3 Optimality Conditions for Dual . . . . . . . 3.4 SMO Algorithm for KLR . . . . . . . . . . 3.5 Practical Aspects . . . . . . . . . . . . . . . 3.6 Numerical Experiments . . . . . . . . . . . 3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 31 31 32 32 33 33 34 35 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 42 44 46 48 50 53 55 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 57 59 61 61 63 65 65 66 68 70 A Decomposition Algorithm for Multiclass KLR 4.1 Multiclass KLR . . . . . . . . . . . . . . . . . . . . . 4.2 Dual Formulation . . . . . . . . . . . . . . . . . . . . 4.3 Problem Decomposition . . . . . . . . . . . . . . . . 4.3.1 Optimality Conditions . . . . . . . . . . . . . 4.3.2 A Basic Updating Step . . . . . . . . . . . . 4.3.3 Practical Aspects: Caching and Updating Hik 4.3.4 Solving the Whole Dual Problem . . . . . . . 4.3.5 Handling the Ill-Conditioned Situations . . . 4.4 Numerical Experiments . . . . . . . . . . . . . . . . 4.5 Discussions and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Soft-Max Combination of Binary Classifiers 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Soft-Max Combination of Binary Classifiers . . . . . . . . . 5.2.1 Soft-Max Combination of One-Versus-All Classifiers 5.2.2 Soft-Max Combination of One-Versus-One Classifiers 5.2.3 Relation to Previous Work . . . . . . . . . . . . . . 5.3 Practical Issues in the Soft-Max Function Design . . . . . . 5.3.1 Training Examples for the Soft-Max function Design 5.3.2 Regularization Parameter C . . . . . . . . . . . . . . 5.3.3 Simplified Soft-max Function Design . . . . . . . . . 5.4 Numerical Study . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Results and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 74 75 75 76 77 78 78 79 79 79 83 Comparison of Multiclass Kernel Methods 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Pairwise Coupling with Support Vector Machines . . . . . 6.2.1 Pairwise Probability Coupling . . . . . . . . . . . . 6.2.2 Posteriori Probability for Support Vector Machines 6.3 Numerical experiments . . . . . . . . . . . . . . . . . . . . 6.4 Results and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 86 87 87 88 89 90 Conclusion . . . . . . 95 iv Bibliography 98 Appendices 103 A Plots of Variation of Performance Measures wrt. Hyperparameters 104 B Pseudo Code of the Dual Algorithm for Kernel Logistic Regression 117 C Pseudo Code of the Decomposition Algorithm for Multiclass KLR 121 D A Second Formulation for Multiclass KLR D.1 Primal Formulation . . . . . . . . . . . . . . . . D.2 Dual Formulation . . . . . . . . . . . . . . . . . D.3 Problem Decomposition . . . . . . . . . . . . . D.4 Optimal Condition of the Subproblem . . . . . D.5 SMO Algorithm for the Sub Problem . . . . . . D.6 Practical Issues . . . . . . . . . . . . . . . . . . D.6.1 Caching and Updating of Hik . . . . . . D.6.2 Handling the Ill-Conditioned Situations D.7 Conclusions . . . . . . . . . . . . . . . . . . . . 127 127 128 130 131 132 134 134 135 136 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Summary Support vector machines (SVMs) and related kernel methods have become popular in the machine learning community for solving classification problems. Improving these kernel methods for classification with special interest in posteriori probability estimation and providing more clear guidelines for practical designers are the main focus of this thesis. Chapter gives a brief review of some background knowledge of classification learning, support vector machines and multi-category classification methods, and motivates the thesis. In Chapter we empirically study the usefulness of some easy-to-compute simple performance measures for SVM hyperparameter tuning. The results clearly point out that, 5-fold crossvalidation gives the best estimation of optimal hyperparameter values. Cross-validation can also be used in arbitrary learning methods other than SVMs. In Chapter we develop a new dual algorithm for kernel logistic regression (KLR) which also produces a natural posteriori probability estimation as part of its solution. This algorithm is similar in spirit to the popular Sequential Minimal Optimization (SMO) algorithm of SVMs. It is fast, robust and scales well to large problems. Then, in Chapter we generalize KLR to the multi-category case and develop a decomposition algorithm for it. Although the idea is very interesting, solving multi-category classification as a single optimization problem turns out to be slow. This agrees with the observations of other researchers made in the context of SVMs. Binary classification based multiclass methods are more suitable for practical use. In Chapter we develop a binary classification based multiclass method that combines binary classifiers through a systematically designed soft-max function. Posteriori probabilities are also obtained from the combination. The numerical study also shows that, the new method is competitive with other good schemes, in both, the classification performance as well as posteriori probability estimation. There exist a range of multiclass kernel methods. In chapter we conduct an empirical study comparing these methods and find that pairwise coupling with Platt’s posteriori probabilities for SVMs performs the best among the commonly used kernel classification methods included vi in the study, and thus it is recommended as the best multiclass kernel method. Thus, this thesis contributes, theoretically and practically, in improving the kernel methods for classification, especially in posteriori probability estimation for classification. In Chapter we conclude the thesis work and make recommendation for future research. vii List of Tables 2.1 2.2 2.3 2.4 2.5 General information about the datasets . . . . . . . . . . . . . . . . . . . . The value of Test Err at the minima of different criteria for fixed C values . The value of Test Err at the minima of different criteria for fixed σ values The value of Test Err at the minima of different criteria for fixed C values . The value of Test Err at the minima of different criteria for fixed σ values . . . . . . . . . . . . . . . 29 29 30 30 30 3.1 3.2 3.3 3.4 Properties of datasets . . . . . . . . . . . . . . . . . . . . Computational costs for SMO and BFGS algorithm . . . . NLL of the test set and test set error . . . . . . . . . . . . Generalization performance comparison of KLR and SVM . . . . . . . . . . . . 53 54 54 56 4.1 4.2 Basic information of the datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . Classification error rate of the methods, on datasets . . . . . . . . . . . . . . 68 69 5.1 5.2 5.3 5.4 5.5 5.6 5.7 Basic information about the datasets and training sizes . . . . . . . . . . . . . . Mean and standard deviation of test error rate of one-versus-all methods . . . . . Mean and standard deviation of test error rate of one-versus-one methods . . . . Mean and standard deviation of test NLL, of one-versus-all methods . . . . . . . Mean and standard deviation of test NLL, of one-versus-one methods . . . . . . P-values from t-test of (test set) error of PWC PSVM against the rest of methods P-values from t-test of (test set) error of PWC KLR against the rest of the methods 81 82 82 82 82 84 85 6.1 6.2 Basic information and training set sizes of the datasets . . . . . . . . . . . . . . Mean and standard deviation of test set error on datasets at different training set sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P-values from the pairwise t-test of the test set error, of PWC PSVM against the remaining methods, on datasets, at different training set sizes . . . . . . . . P-values from the pairwise t-test of the test set error, of PWC KLR against WTA SVM and MWV SVM, on datasets at different training set sizes . . . . P-values from the pairwise t-test of the test set error, of MWV SVM against WTA SVM, on datasets at different training set sizes . . . . . . . . . . . . . 89 6.3 6.4 6.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 92 92 94 viii List of Figures 1.1 An intuitive toy example of kernel mapping . . . . . . . . . . . . . . . . . . . . . 2.1 2.2 2.3 2.4 2.5 2.6 Variation of performance measures of L1 SVM, wrt. σ , on Image dataset . . . . Variation of performance measures of L1 SVM, wrt. C, on Image dataset . . . . Variation of performance measures of L2 SVM, wrt. σ , on Image dataset . . . . Variation of performance measures of L2 SVM, wrt. C, on Image dataset . . . . Performance of various measures for different training sizes . . . . . . . . . . . . Correlation of 5-fold cross-validation, Xi-Alpha bound and GACV with test error 37 38 39 39 40 41 3.1 Loss functions of KLR and SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.1 4.2 4.3 4.4 4.5 Class distribution of G5 dataset . . . . . . . . . . . . . . . . . . . . . . . . Winner-class posteriori probability contour plot of Bayes optimal classifier Winner-class posteriori probability contour plot of multiclass KLR . . . . Classification boundary of Bayes optimal classifier . . . . . . . . . . . . . Classification boundary of multiclass KLR . . . . . . . . . . . . . . . . . . 71 72 72 73 73 6.1 Boxplots of the four methods for the five datasets, at the three training set sizes A.1 Variation A.2 Variation A.3 Variation A.4 Variation A.5 Variation A.6 Variation A.7 Variation A.8 Variation A.9 Variation A.10 Variation A.11 Variation A.12 Variation A.13 Variation A.14 Variation A.15 Variation A.16 Variation of of of of of of of of of of of of of of of of performance performance performance performance performance performance performance performance performance performance performance performance performance performance performance performance measures measures measures measures measures measures measures measures measures measures measures measures measures measures measures measures of of of of of of of of of of of of of of of of L1 L1 L2 L2 L1 L1 L2 L2 L1 L1 L2 L2 L1 L1 L2 L2 SVM, SVM, SVM, SVM, SVM, SVM, SVM, SVM, SVM, SVM, SVM, SVM, SVM, SVM, SVM, SVM, wrt. wrt. wrt. wrt. wrt. wrt. wrt. wrt. wrt. wrt. wrt. wrt. wrt. wrt. wrt. wrt. . . . . . . . . . . σ , on Banana dataset . C, on Banana dataset . σ , on Banana dataset . C, on Banana dataset . σ , on Splice dataset . . C, on Splice dataset . . σ on Splice dataset . . C on Splice dataset . . . σ on Waveform dataset C on Waveform dataset σ on Waveform dataset C on Waveform dataset σ on Tree dataset . . . C on Tree dataset . . . . σ on Tree dataset . . . C on Tree dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 . . . . . . . . . . . . . . . . 105 106 107 107 108 109 110 110 111 112 113 113 114 115 116 116 ix Appendix C Pseudo Code of the Decomposition Algorithm for Multiclass KLR Some definitions: M: number of classes N: total number of training examples Nk: number of training examples from k-th class point[i]: i-th input example target[i]: class label of index i H[i][k]: cached value of Hkk alpha[i][k]: Lagrange multiplier αik I low[k]: iklow I up[k]: ikup B low[k]: bklow B up[k]: bkup //: words after “// ” are comments Method 1: decomp_Main( ) initialize alpha[i][k] to C/Nk if target[i] = k initialize alpha[i][k] to -C/(N-Nk) if target[i] != k initialize all H[i][k] initialize I_low, B_low, I_up and B_up do{ do{ numChange = numChange1 = for k = 1:M { i_low = I_low[k], b_low = B_low[k] i_up = I_up[k], b_up = B_up[k] if (b_low = |Hi - b_up|) { takestepFlag = takeStep(k, i, i_low) if (takestepFlag == 0) { takestepFlag = takeStep(k, i, i_up) } } else { takestepFlag = takeStep(k, i, i_up) if (takestepFlag == 0) { takestepFlag = takeStep(k, i, i_low) } } if (bound property of index i has changed) { numChange2 ++ } } } if (numChange != && numChange1 == && numChange2 == 0) { break } }while (numChange1 != || numChange2 != 0) end decomp_Main Method 2: decomp_Main( ) 122 initialize initialize initialize initialize alpha[i][k] to C/Nk if yi = k alpha[i][k] to -C/(N-Nk) if yi != k all H[i][k] I_low, B_low, I_up and B_up { numChange = 0; for k = 1:M { b_low = B_low[k] b_up = B_up[k] if (b_low |Hi - b_up|) { takestepFlag = takeStep(k, i, i_low) if (takestepFlag == 0) { takestepFlag = takeStep(k, i, i_up) } } else { takestepFlag = takeStep(k, i, i_up) if (takestepFlag == 0) { takestepFlag = takeStep(k, i, i_low) } } if (boundary characteristics of index i has changed) 123 { numChange ++ } } }while (numChange != || B_low[k] 0) { t_right = t_max t_left = 0, dPhi_left = dPhi, d2Phi_left = d2Phi } else { t = t_max takestepFlag = } } else { return } if (takestepFlag == 1) { //Choose a better start point if ( |dPhi_left| 0) { t_left = t dPhi_left = dPhi } else { 125 t_right = t dPhi_right = dPhi } t0 = t }while(|dPhi| > 0.1*tol) } if (t == 0) { return 0; } aj Hi Hj = = = = aio ajo Hio Hjo + + + t t Gio - log((C*dik-aio-t)/(C*diM+sum_ai+t)) + t(kii-kij) Gjo - log((c*djk-ajo+t)/(C*djM+sum_aj-t)) + t(kij-kjj) //Save ai, aj, Hi, Hj alpha[i][k] = alpha[j][k] = aj H[i][k] = Hi H[j][k] = Hj Update the boundary property of indices i and j for all p != i, j { Kip = kernel ( point[i], point[p] ) Kjp = kernel ( point[j], point[p] ) Hcache[p][k] += t * (Kip - Kjp) } Update i_low, i_up,b_low and b_up over indices of NG return end takeStep procedure 126 Appendix D A Second Formulation for Multiclass KLR In this appendix, we give another formulation for multiclass KLR and proposed a decomposition scheme to solve it. This formulation differs from the formulation in that, it avoids using an arbitrary “reference” class. This appendix is organized as follows. In Section D.1 we give the primal formulation. In Section D.2, we give a dual formulation and derive the Wolfe dual. In Section D.3, we decompose the Wolfe dual into small sub problems and In Section D.4 we derived the optimality conditions for the sub problems. In Section D.5 we give a SMO-like algorithm for solving the sub problems. Practical implementation issues are discussed in Section D.6. Finally, conclusion remarks are made in Section D.7. D.1 Primal Formulation As usual, we assume that the multiclass classification problem has M classes and training examples (x1 , y1 ), . . . , (x , y ), where class labels yi ∈ Y = {1, . . . , M }. M functions, fi (x), . . . , fM (x), are to be estimated, one function for each class. And then the following soft-max function is used to estimate the posterior probabilities: P (ωi |x) = efi (x) ef1 (x)+···+fM (x) , i = 1, . . . , M. (D.1) 127 The negative log likelihood of the training examples can thus be computed as NLL = − log P (ωyi |xi ) i=1 log ef1 (xi ) + · · · + efM (xi ) − fyi (xi ) , = (D.2) i=1 where fk (x) = wk · z − bk , k = 1, . . . , M (D.3) and z = Φ(x), k(xi , xj ) = Φ(xi ) · Φ(xj ). The fk ’s can be determined by minimizing a penalized negative log likelihood function, i.e. wk ,bk M wk log ef1 (xi ) + · · · + efM (xi ) − fyi (xi ) +C (D.4) i=1 k=1 where C is a positive regularization parameter. In the next section we will give a dual formulation of the above problem. D.2 Dual Formulation The KLR problem can be formulated in another form, which we call it a dual formulation M k=1 w,b,ξ wk +C subject to: ξik = wk · zi − bk , i=1 M (log(eξi + · · · + eξi ) − ξiyi ) i = 1, . . . , and k = 1, . . . , M (D.5) (D.6) The Lagrangian for this problem is L= M wk M log eξi + · · · + eξi +C − ξiyi i=1 k=1 M αik (ξik − (wk · zi − bk )) + (D.7) k=1 i=1 where αik ’s are Lagrange multipliers. 128 The optimization problem (D.5)–(D.6) is a convex one and the KKT conditions are αik · zi = ∇wk L = wk − (D.8) i=1 ∂L = ∂bk αik = (D.9) i=1 k ∂L =C ∂ξik k eξi eξi M − + · · · + eξi + αik = if k = yi (D.10) k ∂L =C ∂ξik k eξi eξi M + · · · + eξi + αik = if k = yi (D.11) From (D.10) and (D.11) we can get M M ∂L = αik = , k ∂ξ i k=1 k=1 ∀i (D.12) And from there we also get < αik < C if k = yi −C < αik < if k = yi (D.13) and M ξik − log eξi + · · · + eξi where we define gik = gik (αik ) =    log(1 − = gik αk i C ) if k k   log(− αi ) C (D.14) = yi (D.15) if k = yi We also can write (D.14) as M ξik = log(eξi + · · · + eξi ) + gik (D.16) Submit (D.8), (D.9), (D.12), (D.14) and (D.16) into (D.7) we get: =− =− =− =− M wk 2 M M wk giyi + −C i=1 k=1 αik + bk k=1 M i=1 i=1 k=1 giyi + −C k=1 i=1 i=1 k=1 giyi + −C M αik (log(eξi + · · · + eξi ) + gik ) M wk αik ξik k=1 i=1 M wk αik (ξik − (wk · zi − bk )) k=1 i=1 M M M M αik + log(eξi + · · · + eξi ) i=1 M k=1 gik k=1 i=1 M wk k=1 M M i=1 k=1 (log(eξi + · · · + eξi ) − ξiyi ) + +C giyi + −C i=1 αik gik k=1 i=1 129 Thus, the simplified Wolfe dual of (D.5)–(D.6) can be written as M k=1 wk +C yi i=1 gi M k=1 − i=1 αik gik (D.17) subject to i=1 αik = , M k=1 k = 1, . . . , M (D.18) i = 1, . . . , (D.19) αik = , Note that the Wolfe dual has two types of constraints. It is not easy to solve the problem with the two constraints. Constraints (D.18) are resulted from the bias term bk used in the expression of function fk . Dropping the term bk of function fk , i.e. fk (x) = wk · z, will get rid of constraints (D.18). The effect of elimination of bias term bk can be somewhat compensated ˜ i , xj ) = k(xi , xj ) + 1, where k(xi , xj ) is the normal by employing a modified kernel function k(x kernel function. If this modified kernel function is used, we can simply write fk (x) = wk · z and implicit bias term bk is incorporated by the modified kernel function.1 No matter whether simply drop the the bias term bk ’s or employ a modified kernel function to incorporate implicit bias terms, the corresponding Wolfe dual becomes D(α) = Subject to M k=1 M k=1 αik = , wk +C yi i=1 gi − M k=1 i=1 αik gik (D.20) i = 1, . . . , Note that, now fk are expressed as fk (x) = wk · z , k = 1, . . . , M (D.21) From now on, we will consider the simplified Wolfe dual (D.20). D.3 Problem Decomposition We can decompose the dual problem (D.20) by considering the one constraint each time, i.e., we select M dual variables associated with the one example to optimize while keep other dual variables fixed. Same strategy has been used for working set selection by Crammer and Singer (2000) in solving the optimization problem of a single multiclass SVM. Suppose we select the dual variables associated with example xp and define αp = (αp1 , . . . , αpM )T . Conceptually, it goes like that we add one dimension to the originally feature space while set the corresponding entry of all feature vectors to 1. It may be argued that, why not set this entry to other constant values. 130 Then the corresponding sub dual problem is M k=1 Dp (αp ) = l i=1 αpk αik kpi − M k=1 subject to M k=1 y M k=1 (αpk )2 kpp + Cgpp − αpk gpk αpk = (D.22) This sub dual problem is still a convex one and we will derive its optimal condition in the next section. D.4 Optimal Condition of the Subproblem In order to derive the optimal condition for the sub problem, let us write down the Lagrangian for the sub problem (D.22) M l αpk αik kpi Lp = k=1 i=1 − M M M (αpk )2 kpp + Cgpyp αpk gpk − (D.23) p=1 k=1 k=1 αpk + βp Now let us derive the first-order derivative of Lp , with respect to αpk . If k = yp , we have ∂ Lsub = ∂αpk αik kip + αpk kpp − αpk kpp + Cg yp p − gpk − αpk g k p + βp i=1 αik kip + Cg = yp p − gpk − αpk g k p + βp i=1 αik kip + C(− = i=1 1 ) − gpk − αpk (− ) + βp k C − αp C − αpk αik kip − gpk − + βp . = (D.24) i=1 If k = yp , we have ∂ Lp = ∂αpk αik kip + αpk kpp − αpk kpp − gpk − αpk g k p + βp i=1 αik kip − gpk − αpk g = k p + βp i=1 αik kip − gpk − αpk (− = i=1 ) + βp −αpk αik kip − gpk − + βp . = (D.25) i=1 131 Thus, over all, we can write the KKT condition as: ∂ Lp = ∂αpk = αik kip − gpk − + βp i=1 Fpk − gpk − πp = Hpk − πp ≡ , (D.26) where l Fpk = Hpk = αik kip , i=1 Fpk − gpk , πp = βp − 1. (D.27) (D.28) (D.29) Define kplow = mink Hpk , πplow = arg mink Hpk kpup = maxk Hpk , πpup = arg maxk Hpk (D.30) The optimality of the subproblem will hold at a given αp iff πplow = πpup (D.31) An class-index pair (k1 , k2 ) will define a violation of the optimal condition of the subproblem at αp if Hpk1 = Hpk2 (D.32) The optimality condition of the subproblem holds at a give αp iff there is no class-index pair that defines a violation. In numerical solution, it is not possible to achieve the optimality exactly and an approximate optimality condition is defined as πplow ≥ πpup − 2τ (D.33) where τ is a positive tolerance parameter. D.5 SMO Algorithm for the Sub Problem Suppose, for sub problem Dp , class-index pair (k1 , k2 ) defines a violation of the optimality condition. It is possible to achieve a decrease in Qsub by adjusting αpk1 and αpk2 only, with equality constraint M k=1 αpk = being maintained. 132 Let us make the following definitions α ˜ pk1 (t) = αpk1 + t (D.34) α ˜ pk2 (t) = αpk2 − t (D.35) α ˜ pk (t) = αpk ∀ k = k1 , k2 . (D.36) α ˜ ik (t) = αik ∀ i = p and ∀ k (D.37) Let φ(t) = Dp (˜ αp (t)) (D.38) and we have φ (t) = αpk1 αpk2 ∂φ d˜ ∂φ d˜ + ∂αpk1 dt ∂αpk2 dt = α ˜ ik1 Kpi + α ˜ pk1 Kpp − α ˜ pk1 Kpp + Cδk1 ,yp g yp αpyp ) p (˜ − gpk1 (˜ αpk1 ) − α ˜ k1 g k1 αpk1 ) p (˜ α ˜ ik2 Kpi + α ˜ pk2 Kpp − α ˜ pk2 Kpp + Cδk2 ,yp g yp αpyp ) p (˜ − gpk2 (˜ αpk2 ) − α ˜ pk2 g k2 αpk2 ) p (˜ i=1 − i=1 α ˜ ik1 Kpi − gpk1 (˜ αpk1 ) = α ˜ ik2 Kpi − gpk2 (˜ αpk2 ) − i=1 i=1 αpk2 ) αpk1 ) − Fpk2 (t) − gpk2 (˜ = Fpk1 (t) − gpk1 (˜ = Hpk1 (t) − Hpk2 (t) (D.39) φ (t) = Kpp − g k1 αpk1 ) p (˜ = 2Kpp − g where Kij = k(xi , xj ) and g (αpk ) = Since Kpp is always non-negative and g k p k1 αpk1 ) p (˜   −  − + Kpp − g +g k2 αpk2 ) p (˜ k2 αpk1 ) p (˜ C−αk p if k = yp −αk p if k = yp (D.40) (D.41) always negative, we have φ (t) > 0. In deriving (D.39), we have used the fact that Cδk1 ,yp g Cδk2 ,yp g y yp αpp ) p (˜ −α ˜ k2 g k2 αpk2 ) p (˜ (D.42) y yp αpp ) p (˜ −α ˜ k1 g k1 αpk1 ) p (˜ = −1 and = −1, which can easily be proved by using (D.41). This basic step to take one violating class-index pair (k1 , k2 ), to optimize the sub problem 133 Dp with variables αpk1 and αpk2 can be written as t∗ = arg φ(t) and (αp )new = α ˜ p (t∗ ) t (D.43) With simple formula to compute φ (t) and φ (t), we can solve the univariate problem (D.43) using Newton-Raphson iterations: tr+1 = tr − [φ (tr )]−1 φ (tr ) (D.44) starting from t0 = and until certain accuracy is reached, as we solve the univariate problem in Chapter and Chapter 4. As before, with the required accuracy tolerance τ in mind, we can terminate the iteration (D.44) when we find a tr satisfying a tighter accuracy criterion, say φ (tr ) < 0.1τ . In order to make |φ (t)| as large as possible, it is natural to choose k1 = kpup and k2 = kplow . D.6 D.6.1 Practical Issues Caching and Updating of Hik As function Hik plays an important role in the algorithm it is better to maintain a cache for Hik ’s. Especially for k = k1 , k2 , Hpk is needed at various tr in Newton-Raphson iterations involving violating class-index pair (k1 , k2 ). For k = k1 , k2 , we can use the following formula to update Hpk in the Newton-Raphson itermations αpk (tr )) αpk (tr+1 )) − g(˜ αp (tr+1 )) = Hpk (˜ αp (tr )) + α ˜ pk (tr+1 ) − α ˜ pk (tr ) Kpp − g(˜ Hpk (˜ (D.45) For k = k1 , k2 , Hpk are not affected by changes of α ˜ pk1 (t) and α ˜ pk2 (t) and no updating is needed to them. For i = p, ∀ k, after the p-th sub problem Dp is temporarily solved, Hik can be updated using the following formula (Hik )new = (Hik )old + (αpk )new − (αpk )old Kip (D.46) We use “temporarily” because, the optimality of D may not hold anymore when we turn to solve another p sub problem, say, Dp+1 . 134 where the subscript new and old are used to denote values of the corresponding variables after and before solving the p-th subproblem. D.6.2 Handling the Ill-Conditioned Situations Solving the univariate problem (D.43) may come across some ill-conditioned situation that require special handling. From (D.39) and (D.28) we have = φ (t∗ ) = Hpk1 (t∗ ) − Hpk2 (t∗ ) = Fpk1 (t∗ ) − Fpk2 (t∗ ) − gpk1 α ˜ pk1 (t∗ ) + gpk2 α ˜ pk2 (t∗ ) If the value of Fpk1 (t∗ ) − Fpk2 (t∗ ) is in the order of 105 , for Hpk1 (t∗ ) − Hpk2 (t∗ ) = to occur, the value of gpk1 α ˜ pk1 (t∗ ) and/or gpk2 α ˜ pk2 (t∗ ) must be in the order of 105 . Looking at (D.15), this is possible only if Cδyp ,k1 − α ˜ pk1 (t∗ ) and/or Cδyp ,k2 − α ˜ pk2 (t∗ ) are/is extremely small, i.e, in the order of 105 . In such a case, a precise determination of t∗ only push α ˜ pk1 (˜ αpk2 ) to accurate value close to its upper bound Cδyp ,k1 (Cδyp ,k2 ). Avoiding a precise determination of t∗ may affect the precise setting of gpk1 or gpk2 while has little effect on the value of Fpk1 and Fpk2 . Consequently, avoiding a precise determination of t∗ makes the value of Hpk1 and Hpk2 unreliable and such class-indices of example xp have to be treated specially when checking for optimality. Let us define Lki =   0 if yi = k   −C if yi = k    C if yi = k Uik =   if yi = k (D.47) From (D.13) we have Lki < αik < Uik ∀ i, k (D.48) We can handle the ill-conditioned situation by defining a special class-index group for each example index. Define Iik = (Lki , Uik ) and I˜ik = (Lki , Uik − µC) where µ is a small number, say, 103 ×machine precision. During the solution of (D.43) for the i-th sub problem, if we come across a situation at which, for an class-index, say, k, we have αik (t∗ ) ∈ Iik \ I˜ik , then we set t∗ to an approximate end value and terminate the solution of (D.43). In this case, Hik becomes unreliable and we put them into a special “near boundary group” NBGi . NBGi is a special class-index set for example index i, which is defined as NBGi = {k : αik ∈ Iik \I˜ik }. Correspondingly, for exampleindex i, we have another complement class-index set, “norm group”, NGi = {k : αik ∈ I˜ik }. Once an class-index get into NBGi , it will not be involved in the further optimization of the low i-th sub problem Di . So, for i-th sub problem Di , its kiup , bup and blow should be computed i i , ki 135 using class-indices from NGi . At the end of the solution of whole dual problem, check must be conducted on each example-index to make sure that, moving any of its NBG indices to NG group does not lead to improvement in objective function. Again, a two-loop approach is need to solve the whole dual problem (D.20). D.7 Conclusions The formulation given in this appendix avoid using a “reference” class as in the commonly used formulation of multiclass KLR. However, our primary experiments show that, solving the optimization problem from this formulation, using a decomposition algorithm, turns out to be very slow and as a result, we could not afford to conduct detailed numerical experiments to study the performance of this formulation. A more sophisticated working set selection strategy should be investigated. A parallel implementation of the decomposition algorithm may very likely improve the convergence speed. 136 [...]... some gaps in the existing kernel methods for classification, with special interest in classification methods of support vector machines and kernel logistic regression (KLR) We look at a variety of problems related to kernel methods, from binary classification to multi-category classification On the theoretical side, we develop new fast algorithms for existing methods and new methods for classification; on the... of Herbrich (2002) 1.4 Kernel Technique The term kernels here refers to positive-definite kernels, namely reproducing kernels, which are functions K : X × X → R and for all pattern sets {x1 , , xr } give rise to positive matrices (K)ij := k(xi , xj ) (Saitoh, 1998) In the support vector (SV) learning community, positive definite kernels are often referred to as Mercer kernels Kernels can be regarded... and Lin, 2002b) Binary classification based multiclass methods are more suitable for practical use 1.7 Motivation and Outline of the Thesis The broad aim of this thesis is to fill some gaps in the existing kernel methods for classification and provide some more clear guidelines for practical users We look at a variety of problems related to kernel methods and tackle 5 problems in this thesis 1.7.1 Hyperparameter... Comparison of Multiclass Methods There exist a range of multiclass kernel methods and practical designers may need some more clear guidelines to choose one particular method for use Thus, in Chapter 6 we conduct a specially designed numerical study to compare the commonly used multiclass kernel methods, besides the pairwise coupling methods with Platt’s posteriori probabilities for binary SVMs Based on... input space through a nonlinear mapping induced by a kernel function In this section we briefly review two basic formulations of support vector machines and the optimization techniques for them 1.5.1 Hard-Margin Formulation Support vector machine hard-margin formulation is for perfect classification without training error In feature space, the conditions for perfect classification are written as yi (w · zi... philosophy is referred to as a kernel trick ” in literature and has been followed in the so-called kernel methods: by formulating or reformulating linear, dot product based algorithms that are simple in feature space, one is able to generate powerful nonlinear algorithms, which use rich function classes in input space The kernel trick had been used in the literature for quite some time (Aizerman et... commonly used Mercer’s kernels: Linear Kernel Polynomial Kernel Gaussian (RBF) Kernel k(xi , xj ) = xi · xj (1.12) k(xi , xj ) = (xi · xj + 1)p k(xi , xj ) = exp − xi − xj 2σ 2 (1.13) 2 (1.14) For a given kernel, there are different ways of constructing the feature space These different feature spaces even differ in their dimensionality (Sch¨lkopf and Smola, 2002) Reproducing o Kernel Hilbert Space (RKHS)... examples than positive examples5 1.6.2 One-Versus-One Methods One-versus-one (1v1) methods are another possible way of combining binary classifiers for multicategory classification As the name indicates, one-versus-one methods construct a classifier for every possible pair of classes (Knerr et al., 1990; Friedman, 1996; Schmidt and Gish, 1996; Kreßl, 1999) For M classes, this results in M (M − 1)/2 binary... Recommendations are also made regarding how to design good codes for margin classifier, such as SVMs 1.6.5 Single Multi-Category Classification Methods The above reviewed multi-category classification methods can be applied to any binary classification methods, including support vector machines Nevertheless, these methods are all based on binary classification methods, which either combine with, couple from, or decode... popular for solving classification o problems The success of SVMs has given rise to more kernel- based learning algorithms, such as Kernel Fisher Discriminant (KFD) (Mika et al., 1999a, 2000) and Kernel Principal Component Analysis (KPCA) (Sch¨lkopf et al., 1998; Mika et al., 1999b; Sch¨lkopf et al., 1999b) Successful o o applications of kernel based algorithms have been reported in various fields, for instance . IMPROVED KERNEL METHODS FOR CLASSIFICATION DUAN KAIBO NATIONAL UNIVERSITY OF SINGAPORE 2003 IMPROVED KERNEL METHODS FOR CLASSIFICATION DUAN KAIBO (M. Eng, NUAA) A THESIS SUBMITTED FOR THE. machines (SVMs) and related kernel methods have become popular in the machine learning community for solving classification problems. Improving these kernel methods for classification with special. related to kernel methods, from binary classification to multi-category classification. On the theoretical side, we develop new fast algorithms for existing methods and new methods for classification;

Định dạng
Số trang	147
Dung lượng	1,31 MB