Experimental analysis of new algorithms for learning ternary classifiers

The 2015 IEEE RIVF International Conference on Computing & Communication Technologies Research, Innovation, and Vision for Future (RIVF) Experimental analysis of new Algorithms for Learning Ternary Classifiers Jean-Daniel Zucker Yann Chevaleyre Dao Van Sang IRD, UMI 209, UMMISCO IRD France Nord, F-93143, Bondy, France, INSERM, UMR-S 872, les Cordeliers, Nutriomique (Eq 7), Paris, F-75006 France; Sorbonne Universit´es, Univ Paris 06, UMI 209 UMMISCO, F-75005, Paris, France; Equipe MSI, IFI,Vietnam National University, 144 Xuan Thuy, Hanoi, Vietnam; jean-daniel.zucker@ird.fr LIPN, CNRS UMR 7030, Universit´e Paris Nord 93430 Villetaneuse, France chevaleyre@lipn.univ-paris13.fr IFI, Equipe MSI IRD, UMI 209 UMMISCO Vietnam National University Hanoi, Vietnam clairsang@gmail.com Abstract—Discrete linear classifier is a very sparse class of decision model that has proved useful to reduce overfitting in very high dimension learning problems However, learning discrete linear classifier is known as a difficult problem It requires finding a discrete linear model minimizing the classification error over a given sample A ternary classifier is a classifier defined by a pair (w, r) where w is a vector in {-1, 0, +1}n and r is a nonnegative real capturing the threshold or offset The goal of the learning algorithm is to find a vector of weights in {-1, 0, +1}n that minimizes the hinge loss of the linear model from the training data This problem is NP-hard and one approach consists in exactly solving the relaxed continuous problem and to heuristically derive discrete solutions A recent paper by the authors has introduced a randomized rounding algorithm [1] and we propose in this paper more sophisticated algorithms that improve the generalization error These algorithms are presented and their performances are experimentally analyzed Our results show that this kind of compact model can address the complex problem of learning predictors from bioinformatics data such as metagenomics ones where the size of samples is much smaller than the number of attributes The new algorithms presented improve the state of the art algorithm to learn ternary classifier The source of power of this improvement is done at the expense of time complexity Index Terms—Ternary Classifier, Randomized Rounding, Metagenomics data I I NTRODUCTION AND M OTIVATION Learning classifiers in very high dimension data has received more and more attention in the past ten years both theoretically [2], [3], [4], [5], [6], [7] and in practice through Data Mining applications [8], [9], [10], [11], [12], [13], [14], [15], [16], [17] In Biology in particular there is a recent paradigmatic shift towards predictive tools [18] In the meantime the Omics (Genomics, Transcriptomics, Protemics, metabolomics, etc.) data [19], [20], [21] have increased exponentially and in the future medicine will more and more rely on the use of such data to provide personalized medicine [20] When the number of dimensions p is greater than the number of examples N (p >> N ) the problem of overfitting 978-1-4799-8044-4/15/$31.00 c 2015 IEEE Fig A linear classifier vs a ternary classifier The latter correspond to the adding, subtracting or ignoring features to build a decision function becomes more and more acute In Omics data in particular the number of dimensions can reach a few millions This is the case for example in metagenomics of the gut microflora were a recently published catalog counts about ten millions entries [22], [23] In such setting, the ratio N/p is so small that both feature selection and sparse learning is required to diminish the risk of overfitting [24] The goal of this research is to find sparse models that support learning classifiers that scale in high-dimension data such as metagenomics Following a recent paper by the authors [1], we explore models called ternary classifier They are extremely sparse models simpler than linear combinaison of real weights Such model are represented as a weighted boolean combination (weights are in {-1, 0, +1}) instead of being in R) greater than a given threshold We explore algorithms that minimizes the hinge loss of such models induced from the training data and improve the generalization error of the first rounding algorithm proposed in [1] II T ERNARY CLASSIFIER Let us first introduce more formally the concept of ternary classifier Let us consider an example as a pair (x, y) where x 19 Algorithm Randomized Rounding INPUT: a real number α ∈ [−1, 1] Output: an integer ∈ {−1, 0, 1} if α >= then Draw randomly u ∈ {0, 1} such that P r (u = 1) = α return u else return −Round(−α) end if is an instance in Rn and y is a label in {-1, +1} A training set is a collection (xt , yt )t=1 of examples A ternary-weighted linear threshold concept c is a pair (w, r) where r is a real capturing the threshold or offset and textitw is a vector in {-1, 0, +1}n A loss function is a map Θ : {-1, 0, +1}n × R× Rn × {-1, +1} such that Θ(c; x, y) measures the discrepancy between the predicted value (c, x) and the true label y For a concept c and a training set S = (xt , yt )t=1 m , the cumulative loss of m c with respect to S is given by L(c; S) = t=1 Θ(c; xt , yt ) Based on this performance measure, the combinatorial optimization problem investigated in this study is described as follows Loss Minimization over {-1, , +1}n Given : 1) a target concept class C ⊆ {-1, 0, +1}n m 2) a training set S = {(xt , yt )}t=1 3) a loss function Θ 4) Find a concept c ∈ C that minimizes L(c, S) Recall that the zero-one loss is the loss function defined by Θ01 (c; x, y) = if sgn( c, x - r) = y and Θ01 (c; x, y) = otherwise Because this loss function is known to be hard to optimize, we shall concentrate on the hinge loss a well-known surrogate of the zero-one loss defined as follows: Θγ (c; x, y) = (1/γ)max[0, γ - y( w, x - r)] where γ ∈ R+ Finally, let us represent this optimization problem as a Linear Classification problem: Linear formulation: In addition to variables w1 , , wn , b, we need m+1 slack variables ξ1 , , ξm The linear formulation is closed to standard linear SVM formulation, except that wi ’s are bounded: Fig Randomized Rounding Method (RR algorithm) Algorithm Round-k-weights INPUT: a vector w ∈ [−1, 1]n , an integer k OPTIONAL INPUT: a permutation σ over {1 n} (by default, σ is the identity) Output: a vector ∈ {−1, 0, 1}n for j = k wσ(j) ← Round(wσ(j) ) end for return w Fig Randomized Rounding of k weights (Round − k − weights) This algorithm which randomly rounds k-weights is called ”Round-k-weights” (see Fig 3) III N EW ALGORITHMS TO LEARN TERNARY CLASSIFIERS In this section, we will present three rounding algorithms more elaborated than the initial one provided by [1] and called Round − k − weights: 1) RR-all-weights: this algorithm explores different randomized rounding of all weights and returns the best one (see Fig 4) 2) RRS-k-rand-weights: this algorithm selects k random weights and then after rounding call again the solver on the problem where the k weights randomized are forced to their integer value (see Fig 5) 3) RRS-k-best-weights: this algorithm selects k ”best” weights and then after rounding call again the solver on the problem where the k weights randomized are forced to their integer value (see Fig 6) ξi i∈[m] s.t ∀i wi ≤ wi ≥ yi (w.xi + b) ≥ − ξi ξi ≥ Randomized Rounding Method (RR method) A heuristic approach to solve the optimization problem described above consists in solving the relaxed problem where the coefficients are real (in [−1, 1]) and then to perform a rounding on randomized selected real coefficients Finally it possible to compute the error rate of the obtained linear model The basic idea to use the probabilistic method to convert an optimal solution of a relaxation of a problem into an approximately optimal solution to the original problem is standard in computer science and operations research [25] Let us first recall the simple algorithm called Randomized Round used to round a real number belonging to [−1, 1] (see Fig 2) By repeating the rounding algorithm (see Fig 2) a number of times corresponding to the the number of dimensions, a model with all its coefficients in the set {-1, 0, +1} is obtained A Rounding over all weights (RR-all-weights) As described above, we solve the problem in the case of real coefficients in the interval [−1, 1] We thus obtain a vector of weights w that are real Next, we run M times (this number M is a meta-parameter of all the algorithms presented, we used M = 100 in our experiments) the Round-k-weights algorithm with k=n (i.e for all weights) The output of the Randomized Rounding algorithm on this vector w, produces a discrete coefficient of vectors in the set {-1, 0, +1} Next, we compute the error rate of each model based on the respective values of the vector Finally, we select the vector of coefficients w (of M vectors) which gives the best error rate 20 1: 2: 3: 4: 5: 6: 7: 8: 9: INPUT: A real vector of coefficients w ∈ [-1, +1], a number of iterations M OUTPUT: An integer vector w ∈ {−1, 0, +1}n minimizing the hinge loss List ← {} for t ← M w ¯ ← Round-k-weights(w, k = n) add w ¯ to List end for Compute hinge loss of all vectors in List return best vector w Fig Rounding over all weights (RR-all-weights) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: INPUT: A real vector of coefficients w ∈ [−1, +1] , an integer k, a number of iterations M OUTPUT: An integer vector w ∈ {−1, 0, 1}n minimizing the hinge loss Call Solver to initialize the vector w S ← {1 n} for t ← 1, M while S = ∅ Let T be a set of k integers drawn from S without replacement S ←S−T for j ∈ T wj ← Round(wj ) end for Call Solver again recompute the real coefficients of the remaining attributes (the one that have not yet been forced) end while end for return w - the best among M vectors Fig Repeat Round and Solve over k random weights (RRS-k-rand-weights) B Combine Rounding and Solving over k random weights (RRS-k-rand-weights) In this algorithm, the linear model solver will be called many times instead of only once in the above algorithm Let us first consider the real coefficient vector w obtained from the linear model First of all, the real coefficients that happen to be either 0, -1 or will be stored in the object wInt and they will not be changed in each subsequent step of the algorithm This set of integer weights wInt may be empty depending on the problem The coefficients that are not yet integers will be stored in a set wNInt) which will be processed in the following way: let us consider first a permutation σ over the set wNInt, and a parameter k, we realize a loop with i running through the set {σ(j), j ∈ wNInt} In this loop, for each i taken from to wNInt the value of wNInt[σ(i)] of w will be rounded by Randomized rounding (RR) Then, the index of the element wNInt[σ(i)] and the value RR(wNInt[σ(i)]) will be added to the object wInt Then 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: INPUT: A real vector of coefficients w ∈ [-1, +1], , an integer k, a number of iterations M OUTPUT: An integer vector w ∈ [-1, +1]n minimizing the hinge loss Call Solver to obtain the vector initiation w Find in w all the coefficients not yet integer, and saved them in the set wNInt for t ← 1, M Take a random permutation σ on wNInt for i ← σ(1), σ( wNInt ) Find k ”best” attributes for rounding; Force rounding on these best k attributes ; Call Solver again to recalculate the real coefficients of remaining attributes (not yet forced); end for end for return w - the best vector between M Fig Repeat Round and Solve over k random weights (RRS-k-best-weights) depending on whether i is a multiple of k or not the solver will be called If the index i of the loop is a multiple of k the solver will be called to find real solutions to the initial linear model with added conditions that all coefficients: in index[wInt] be be assigned to their integer values[wInt] At the end of the loop, all the weights in w[] have been rounded by RR We can thus compute the value hinge loss This sequence is then repeated with M permutations of different σ(i) The output of the algorithm is a vector of coefficients that best minimizes the hinge loss (See Fig 5) C Repeat Round and Solve over Best k Weights (RRS-k-bestweights) In this third algorithm, instead of choosing sets of k random weights to be rounded before solving again the problem, we will search for k best coefficients at each step to apply rounding Here, the definition of best coefficient meanst that their minimum distance to -1, or is the smallest possible The intuition behind this algorithm is that the weights that are the closest to integer values ought to be chosen first as they have a priori (this is just a heuristic) greater chance to belong to the optimal solution IV E XPERIMENTS We tested the empirical performance of our new algorithms compared to RR-all-weights one by conducting experiments on several metagenomics databases used as benchmark All algorithms were written in R and the package used to solve linear problems is called lpSolve [26] The performance of lpSolve cannot compete with the CPLEX solver from IBM but we chose lpSolve because it is open-source and has a smooth integration with R All the algorithms in this paper have been integrated in a R-package called terDA This package will be made available to the community in the near future All tests were executed on a PC Intel Corei5 with GB of RAM The goal of these experiments is to evaluate whether the 21 different ideas to improve the RR method were improving the generalization error to learn ternary classifier and at which cost in terms of CPU A Metagenomics Datasets In this paper, we used public Metagenomics data from the European project METAHIT These are data are corresponding to next generation sequencing of the gut microbiota The microbiota is a community of microorganisms (viruses, bacteria, fungi, archea) living in a our gut Two problems are considered: the first one is to distinguish the status of patientsobese vs lean based on their microbiota and the second one is to identify which microbiota signatures supports predicting wether an individual will have a high vs low gene count In both case there are epidemiological interests of being able to devise microbiota signature to predict phenotypes of patients The data are described in table I In both cases we consider 172 patients and thousands of dimension corresponding to counts of particular metaspecies of the microbiota that have been identified The complete data include 3.3 millions counts per patient but we have not used such raw data Fig Empirical error rate vs number of Rounding Here k is set to for RRS-k-rand-weights and k = for RRS-k-Best-weights, number of rounding take values from to M =100 The dataset is DatasetMeta1 B Results We performed out tests following a 1-fold and 10-fold cross-validation procedure repeated ten times Here 1-fold correspond to the case where the data are used both to learn and test the model, the error measures is thus the empirical error Table II shows all the results in terms of empirical error obtained for the three algorithms In the case of learning ternary classifier it is interesting to look at the empirical error because the model is so sparse that it is likely not to have 0% empirical error The errors in generalization for all algorithms and the four datasets (10-fold cross-validation) are given in table III To quantify the speedup of the algorithms w.r.t to the RR method, we also explored the impact on the error rate on learning from DatasetMeta1 varying parameters: the number of roundings RR and the parameter k The results are shown on Figure and Figure respectively Figure shows the evolution in function of k of the generalization error rate estimated using a 10-fold crossvalidation V D ISCUSSION AND C ONCLUSIONS The results of Table II shows that RRS-k-rand-weights and RRS-k-best-weights decrease significantly the empirical error Nevertheless, it could mean that RR-all-weights does less overfitting and that it could explain this result Table III shows that on the fours datasets, RRS-k-rand-weights and RRS-k-best-weights improve also the generalization error The improvement is very significant (on dataset DatasetMeta1, DatasetMeta2 and DatasetMeta4) and less significant on DatasetMeta3 It is also clear that RRS-k-best-weights outperforms RRS-k-rand-weights when N/p is smaller than Fig Empirical Error rate vs parameter k Fig Generalization Error rate vs parameter k in 10-fold cross-validation which is the case for DatasetMeta3 and DatasetMeta4 The improvements are done at the expense of CPU time as the solver is called several times The time complexity is thus an order of magnitude superior that the RR-all-weights algorithm The figures showing the influence of the number of rounding indicates is not surprising It shows that the empirical error rate stabilizes when M the number of iteration increases.The parameter k has also an impact on the empirical rate As it grows RRS-k-rand-weights becomes worth whereas RRS-k- 22 TABLE I M ETAGENOMICS DATASETS USED RESPECTIVELY WITH N/p GREATER AND LOWER THAN T HE NUMBERS IN PARENTHESIS GIVE THE NUMBER OF EXAMPLES FOR EACH CLASS Datasets Examples(N) Features(p) # Examples N/p DatasetMeta1 HIGH (113)/LOW (41) 30 154 5.13 DatasetMeta2 OBE (99)/LEAN (55) 30 154 5.13 DatasetMeta3 HIGH (225)/LOW (67) 1104 292 0.26 DatasetMeta4 OBE (96 )/LEAN (196) 1104 292 0.26 TABLE II TABLE OF THE EMPIRICAL ERROR OF EACH ALGORITHM FOR EACH DATASET AND AVERAGED OVER TEN EXECUTIONS DatasetMeta1 DatasetMeta2 DatasetMeta3 DatasetMeta4 RR-all-weights RRS-k-rand-weights RRS-k-best-weights Error (%) Time Error (%) Time Error (%) Time 14.2857 11.04 12.9870 12.9870 31.12 90.60 18.8312 10.63 17.5325 16.8836 39.22 109.6 5.8219 7.7867 8.9041 500 3141 2734 18.1132 21.1312 33.9623 370 1797 1419 TABLE III TABLE OF THE ERROR IN GENERALIZATION (10- FOLD ) OF EACH ALGORITHM FOR EACH DATASET AND AVERAGED OVER TEN EXECUTIONS Datasets DatasetMeta1 DatasetMeta2 DatasetMeta3 DatasetMeta4 RR-all-weights 23.4600 ± 0.1101 28.5417 ± 0.0665 18.1149 ± 0.0732 41.8376 ± 0.0712 RRS-k-rand-weights RRS-k-best-weights 17.5000 ± 0.0969 17.8232 ± 0.0519 25.3750 ± 0.1196 27.0167 ± 0.0598 16.9500 ± 0.0652 16.1251 ± 0.0130 39.2040 ± 0.1038 38.1529 ± 0.0782 best-weights is less influenced Overall these experiments suggests that there is room for improving the original RR-all-weights algorithm that supports learning ternary classifiers This paper is experimental and its main result is first to propose two original algorithms to earn ternary classifiers and second to suggest that RRS-k-bestweights is the most promising improvement so far made to the original RR-all-weights algorithm as it gives better empirical and generalization errors, it is robust w.r.t to various k and M Future work include theoretical analysis of the algorithms and thorough analysis of RRS-k-best-weights on other benchmark data A Acknowledgments The authors would liker to express their thanks to Dr Edi Prifti (INRA, France) for his help in preparing the benchmark datasets We would like also to express our thanks to the anonymous reviewers The first author is partially supported by the METACARDIS project This project is funded by the European Unions Seventh Framework Programme for research, technological development and demonstration under grant agreement HEALT H − F − 2012 − 305312 R EFERENCES [1] Y Chevaleyre, F Koriche, and J.-D Zucker, “Rounding methods for discrete linear classification,” JMLR WCP, vol 28, no 1, pp 651–659, 2013 [2] J Fan and R Samworth, “Ultrahigh dimensional feature selection: beyond the linear model,” The Journal of Machine Learning , 2009 [3] T Hastie and R Tibshirani, The elements of statistical learning: data mining, inference, and prediction The Mathematical , 2009 [4] G James, D Witten, T Hastie, and R Tibshirani, An Introduction to Statistical Learning, ser with Applications in R Springer Science & Business, Jun 2013 [5] J Kogan, “Introduction to Clustering Large and High-Dimensional Data,” pp 1–222, Dec 2007 [6] A Kalousis, J Prados, and M Hilario, “Stability of feature selection algorithms: a study on high-dimensional spaces,” Knowledge and Information Systems, vol 12, no 1, pp 95–116, 2007 [7] L Parsons, E Haque, and H Liu, “Subspace clustering for high dimensional data: a review,” ACM SIGKDD Explorations Newsletter, vol 6, no 1, pp 90–105, 2004 [8] J Oh and J Gao, “A kernel-based approach for detecting outliers of high-dimensional biological data,” BMC bioinformatics, vol 10, no Suppl 4, p S7, 2009 [9] B Hanczar, J Hua, and E R Dougherty, “Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings,” EURASIP Journal on Bioinformatics and Systems Biology, vol 2007, pp 1–13, 2007 [10] S Rojas-Galeano, E Hsieh, D Agranoff, S Krishna, and D FernandezReyes, “Estimation of Relevant Variables on High-Dimensional Biological Patterns Using Iterated Weighted Kernel Functions,” PLoS ONE, vol 3, no 3, p e1806, Mar 2008 23 [11] J Quackenbush, “Extracting biology from high-dimensional biological data,” Journal of Experimental Biology, vol 210, no 9, pp 1507–1517, May 2007 [12] L Yu and H Liu, “Feature selection for high-dimensional data: A fast correlation-based filter solution,” MACHINE LEARNINGINTERNATIONAL WORKSHOP THEN CONFERENCE-, vol 20, no 2, p 856, 2003 [13] S Lee, B Schowe, and V Sivakumar, “Feature Selection for HighDimensional Data with RapidMiner,” 2012 [14] N Bouguila and D Ziou, “High-Dimensional Unsupervised Selection and Estimation of a Finite Generalized Dirichlet Mixture Model Based on Minimum Message Length,” IEEE transactions on pattern analysis and machine intelligence, vol 29, no 10, pp 1716–1731 [15] F Petitjean, G I Webb, and A E Nicholson, “Scaling Log-Linear Analysis to High-Dimensional Data,” in 2013 IEEE International Conference on Data Mining (ICDM) IEEE, pp 597–606 [16] A.-C Haury, P Gestraud, and J.-P Vert, “The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures,” PLoS ONE, vol 6, no 12, p e28210, 12 2011 [17] R M Simon, J Subramanian, M C Li, and S Menezes, “Using crossvalidation to evaluate predictive accuracy of survival risk classifiers based on high-dimensional data,” Briefings in bioinformatics, vol 12, no 3, pp 203–214, May 2011 [18] L Kelley and M Scott, “The evolution of biology A shift towards the engineering of prediction-generating tools and away from traditional research practice.” EMBO reports, vol 9, no 12, pp 1163–1167, 2008 [19] J Wooley, A Godzik, and I Friedberg, “A primer on metagenomics,” PLoS Computational Biology, vol 6, no 2, p e1000667, 2010 [20] H W Virgin and J A Todd, “Metagenomics and Personalized Medicine,” Cell, vol 147, no 1, pp 44–56, Sep 2011 [21] K E Nelson, Metagenomics of the Human Body Springer Science & Business Media, Nov 2010 [22] J Li, H Jia, X Cai, H Zhong, et al., P Bork, and J Wang, “An integrated catalog of reference genes in the human gut microbiome,” Nature biotechnology, vol 32, no 8, pp 834–41, 2014 [23] J Qin, R Li, J Raes, M Arumugam, K Burgdorf, C Manichanh, T Nielsen, N Pons, F Levenez, and T Yamada, “A human gut microbial gene catalogue established by metagenomic sequencing,” Nature, vol 464, no 7285, pp 59–65, 2010 [24] A L Tarca, V J Carey, X.-w Chen, R Romero, and S Dr˘aghici, “Machine learning and its applications to biology,” PLoS Computational Biology, vol 3, no 6, p e116, 2007 [25] P Raghavan and C D Thompson, “Randomized rounding: A technique for probably good algorithms and algorithmic proofs,” Combinatorica, vol 7, no 4, pp 365–374, 1987 [26] M Berkelaar, K Eikland, P Notebaert et al., “lpsolve: Open source (mixed-integer) linear programming system,” Eindhoven U of Technology, 2004 24 ... the empirical error Table II shows all the results in terms of empirical error obtained for the three algorithms In the case of learning ternary classifier it is interesting to look at the empirical... is room for improving the original RR-all-weights algorithm that supports learning ternary classifiers This paper is experimental and its main result is first to propose two original algorithms. .. tested the empirical performance of our new algorithms compared to RR-all-weights one by conducting experiments on several metagenomics databases used as benchmark All algorithms were written

Định dạng
Số trang	6
Dung lượng	174,46 KB