Available online at www.sciencedirect.com ScienceDirect Procedia - Social and Behavioral Sciences 147 (2014) 307 – 312 ICININFO Combining Probabilistic Classifiers for Text Classification Kostas Fragos, Petros Belsis, Christos Skourlas* Department of Informatics, TEI of Athens, Ag Spyridonos 12210 Athens GREECE Abstract Probabilistic classifiers are considered to be among the most popular classifiers for the machine learning community and are used in many applications Although popular probabilistic classifiers exhibit very good performance when used individually in a specific classification task, very little work has been done on assessing the performance of two or more classifiers used in combination in the same classification task In this work, we classify documents using two probabilistic approaches: The naive Bayes classifier and the Maximum Entropy classification model Then, we combine the results of the two classifiers to improve the classification performance, using two merging operators, Max and Harmonic Mean The proposed method was evaluated using the “ModApte” split of the Reuters-21578 dataset and the evaluation results show a measurable improvement in the final evaluation accuracy © Authors Published by Elsevier Ltd under the CC BY-NC-ND license © 2014 2014 The Elsevier Ltd This is an open access article (http://creativecommons.org/licenses/by-nc-nd/3.0/) Selection and peer-review under responsibility of the 3rd International Conference on Integrated Information Selection and peer-review under responsibility of the 3rd International Conference on Integrated Information Keywords: Type your keywords here, separated by semicolons ; Introduction Text classification could be seen as a task of applying a learning model to extract documents’ categories for a collection of documents Then, this model is applied to each new document and eventually the document is assigned to some (one or more) categories Text classification is important for many applications e.g spam filtering, e-mail routing, web directory maintenance and news filtering All these years, efficient training and application, performance tuning, and building of understandable classifiers are common topics for the text classification research Statistical classification and machine learning techniques have been applied to text * Corresponding author Tel.: +30-2105910974; fax: +30-2105910975 E-mail address: cskourlas@teiath.gr 1877-0428 © 2014 Elsevier Ltd This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/) Selection and peer-review under responsibility of the 3rd International Conference on Integrated Information doi:10.1016/j.sbspro.2014.07.098 308 Kostas Fragos et al / Procedia - Social and Behavioral Sciences 147 (2014) 307 – 312 categorization, including multivariate regression models, nearest neighbor classifiers, probabilistic Bayesian models, decision trees, neural networks (Dumais et al., 1998) The use of Support Vector Machines (SVMs) for text classification has been explored (Dumais et al., 1998), (Galathiya, 2012) Techniques for text classification can be classified in two main approaches: firstly, discriminative methods like Logistic Regression (LR), Support Vector Machines (SVMs) and secondly, probabilistic methods related to the aspect model (Hofmann, 1999), the maximum entropy model (Fragos et al), the latent dirichlet allocation (Blei et al, 2002), and the Bayesian classification (Hamad, 2007), (Grossman et al, 2005) Although popular classifiers exhibit very good performance when used individually in a specific classification task, very little work has been done on assessing the performance of two or more classifiers when used in combination in the same classification task In this work, we classify documents using two probabilistic approaches based on the naive Bayes classifier and the Maximum Entropy classification model, respectively To improve classification performance, we propose two merging operators, Max and Harmonic Mean, to combine the results of the two classifiers Two probabilistic approaches for documents classification 2.1 Naïve Bayes classifier A text classifier could be defined as a function that maps a document d of x1,x2,x3, xn words (features), d = ( x1,x2,x3, xn ) , to a confidence that the document d belongs to a text category If the features x1,…xn are conditionally independent, given the category variable c, then the Naïve Bayes classifier (Al-Aidaroos et al, 2010) is often used to estimate the probability of each category The Bayes theorem can be used to estimate the probabilities: Pr(c | d ) = Pr(d | c) P(c) P( d ) (1) Fragos et al (2005) used training data to estimate model parameters in order to find the best class (argmaxc Pr(c)Pr(d|c)) for the documents of the test set This technique was based on the technique proposed by McCallum and Nigam (1998) 2.2 Maximum Entropy Classification Entropy was used by Shannon in the communication theory The entropy H itself measures the average uncertainty of a single random variable X: H ( p) = H ( X ) = − ∑ p( x) log p( x) (2) x∈ X where, p(x) is the probability mass function of the random variable X In a different context, entropy has been used in natural language processing tasks, etc Della et al (Della P et al., 1997) shown that there is always a unique distribution with maximum entropy and that this distribution has an exponential form Fragos et al (2005) used the iterative scaling (IIS) algorithm, a hill-climbing algorithm for estimating the parameters of the maximum entropy model, specially adjusted for text classification In Section 3, we explain how the chi square goodness of fit statistical test can be used as an alternative relatedness measure for text classification purposes Section describes how two merging operators of the classification results can be used to improve classification performance In Kostas Fragos et al / Procedia - Social and Behavioral Sciences 147 (2014) 307 – 312 Section 5, we present data used in the experiments and discuss the evaluation results Finally, in section our conclusions are given followed by some directions for future work X Square Test for Feature Selection Chi-square test used in the past for feature selection in the text classification field Yang and Pedersen (Yang and Pedersen, 1997) compared five measurements in term selection, and found that the chi-square and information gain gave the best performance Fragos et al (Fragos, 2005) proposed a new method to apply Maximum Entropy modeling for text classification using weights for the selection of the features of the model and the evaluation of the importance of each feature in the classification task Instead of using Maximum Entropy modeling in the classical way, they used X square values to weight the features of the model and their importance Their method was evaluated on Reuters-21578 dataset for test classification tasks Example Having the distinct categories c1=’Acq’ and c2≠’Acq’ from the Reuters-21578 ‘ModApte’ split training dataset we want to decide if the word ‘usa’ is a good feature for the classification in the category ‘Acq’ All the stopwords are removed and after that we calculate the frequency of the word “usa” in the category c1=’Acq’ equal to 1,238 and in the other categories (c2≠’Acq’) equal to 4,464 In the class ‘Acq’ there are 125,907 terms (words) and in the other classes there are 664,241 Total is equal to 790,148 terms (words) The null hypothesis is that the word ‘usa’ and the class label ‘Acq’ occur independently We can compute the expected frequencies: w=’usa’ and c1=’Acq’: E11= (5,702x125,907)/790,148=908.59 w=’usa’ and c1≠’Acq’: E12= (5,702x664,241)/790,148=4,793.4 w≠’usa’ and c1=’Acq’: E21= (784,446x125,907)/790,148=124,998.4 w’≠usa’ and c1≠’Acq’: E22= (784,446x664,241)/790,148=659,447.6 Then we calculate the X2 value: X2 =(1,238-908.59)2/908.59 + (4,464-4793.4)2/4793.4 + ( 124,669-124,998.4)2/124,998.4 + (659,777-659,447.6)2/659,447.6 = 143.096 Looking up the X2 distribution, for significance level a equal to 0.05, and for one degree of freedom, if the calculated value is greater than the critical value we can reject the null hypothesis So, if the calculated X2 value is large then we have a strong evidence for the pair (‘usa’, ‘Acq’) and the word ‘usa’ is a good feature for the classification in the category ‘Acq’ Merging Operators for the Naïve Bayes and Maximum Entropy classifiers We use two operators to combine the results of the Naïve Bayes Classifier (NBC) and the Maximum Entropy Classifier (MEC) to compensate for errors in each classifier, and to improve the classification performance MaxC(d) = Max {NBC(d), MEC (d)} (3) HarmonicC (d) = 2.0 × NBC(d) ×MEC (d) / (NBC(d) + MEC (d)) (4) The equation shows that the MaxC(d) operator chooses a maximum value among the results of the Naïve Bayes (NBC (d)) and Maximum Entropy (MEC(d)) classifiers for an input document d In Equation 4, the HarmonicC (d) operator estimates the Harmonic Mean of the results of these two classifiers Jongwoo, Daniel, and George (Jongwoo et al, 2010) used these merging operators to classify sentences containing Databank Accession Numbers, a key piece of bibliographic information, from online biomedical articles 309 310 Kostas Fragos et al / Procedia - Social and Behavioral Sciences 147 (2014) 307 – 312 Evaluation The proposed classification technique was evaluated using the “ModApte” split of the Reuters-21578 dataset The corpus includes 9,603 training documents and 3,299 test documents Ten categories out of 135 potential categories were chosen (see table 1) If a document belongs to the specific category then it is located into the “Yes” group, otherwise it is in the “No” group The 10 categories with the number of documents for the training and test phase are shown in table Table 10 categories from the “ModApte” split of the Reuters-21578 dataset with the number of documents for the Training phase and the Test phase Category Training Set (YES) Training Set (NO) Test Set (YES) Test Set (NO) Acq Corn Crude Earn Grain Interest Money-fx Ship Trade Wheat 1615 175 383 2817 422 343 518 187 356 206 7988 9428 9220 6786 9181 9260 9085 9416 9247 9397 719 56 189 1087 149 131 179 89 117 71 2580 3243 3110 2212 3150 3168 3120 3210 3182 3228 In the training phase and in the test phase all the documents were parsed and a list of stopwords was used Eventually, a list of 32,412 discrete words-terms (out of a total of 790,148 words) was defined Then, the X square test was applied on the corpus and the 2,000 higher ranked words were selected for each category to be used in the maximum entropy model Table presents the 10 top ranked word terms calculated by the X square test for three categories Table 10 top ranked words calculated by the X square test for three categories of the ModApte Reuters-21578 training dataset Corn values july egypt agreed shipment belgium oilseeds finding february permitted Crude crude comment spoke stabilizing cancel shipowners foresee sites techniques stayed Earn earn usa convertible moody produce former borrowings caesars widespread honduras The features of the maximum entropy model were instantiated by using the 2000 higher ranked words (terms) for each category To evaluate the classification performance of the classifiers we used the following measures: micro-Recall (μRe), micro-Precision (μPr) and micro-averaged F1 measure (micro-F1) Let a denote the number of documents correctly classified in the class category by the system and let b denote the overall number of documents classified in the class and let d denote the overall number of the documents belong to the class We define μPr and μRe as Kostas Fragos et al / Procedia - Social and Behavioral Sciences 147 (2014) 307 – 312 μ Pr = ∑a ∑b c and c μ Re = 311 ∑a ∑d c c where the summing is over all the classes The micro-F1 measure is then computed as the harmonic mean of μPr and μRe micro − F1 = × μ Pr× μ Re/( μ Pr + μ Re) Table shows the micro averaged F1 performance Micro-averaged F1 measure performance for Naive Bayes and Maximum Entropy Classifiers and our Max and harmonic merging Operators Table Micro-averaged F1 measure performance for Naive Bayes and Maximum Entropy Classifiers and Max and harmonic Operators Algorithm Naive Bayes Maximum Entropy MaxC HarmonicC Performance 0.81 0.88 0.90 0.91 It appears that Maximum Entropy classifier performs better than Naïve Bayes exhibiting a Micro-averaged F1 measure performance of 0.88 Both MaxC(x) and HarmonicC(x) operators increase Micro-averaged F1 measure performance over those resulting from the Naïve Bayes and SVM classifiers Conclusion In this paper we describe a technique of using-combining two classifiers based on Naïve Bayes and Maximum Entropy, respectively, to classify documents of the “ModApte” split of the Reuters-21578 dataset We use a chisquare feature selection strategy to select the most representative words-features, as it was proposed by Fragos et al (Fragos, 2005) The Maximum Entropy model seems to have better performance than Naive Bayes classifier Two merging operators are used to combine results of the Naïve Bayes and SVM classifiers to improve performance, especially for the Recall rate The merging operators improve the performance, as seen in the results for Micro-averaged F1 measure (0.90, 0.91 for MaxC and HarmonicC operators respectively) As future work, we intend to find additional methods of collecting sets of words-features and different merging operators to further improve the performance Acknowledgements This research has been co-funded by the European Union (Social Fund) and Greek national resources under the framework of the “Archimedes III: Funding of Research Groups in TEI of Athens” project of the “Education & Lifelong Learning” Operational Programme References Al-Aidaroos, K.M., A.A Bakar, A.A., and Othman, Z., 2010 Naive Bayes variants in classifi-cation learning In Proceedings of the International Conference on Information Retrieval and Knowledge Management, March 17-18, 2010, Shah Alam, Selangor, pp: 276-281 Blei, D., Ng A., and Jordan, M 2002 Latent dirichlet allocation In Proceedings of NIPS 14 Della P., S., Della P., V and Lafferty J., 1997 Inducing features of random fields IEE trans-action on Pattern Analysis and Machine Intelligence, 19(4) 312 Kostas Fragos et al / Procedia - Social and Behavioral Sciences 147 (2014) 307 – 312 Dumais, T., S., Platt, J., Heckerman, D., and Sahami, M., 1998 Inductive learning algorithms and representations for text categorization In Proceedings of the Seventh International Confer-ence on Information and Knowledge Management, pages 148-155 ACM Press Fragos, K., Maistros, I., Skourlas, C., 2005 A X2-Weighted Maximum Entropy Model for Text Classification In Proceedings of 2nd International Conference On Natural Language Understanding and Cognitive Science, Miami, Florida: 22-23 Galathiya, A S., Ganatra, A., P., and CK Bhensdadia, K., C., 2012 An Improved decision tree induction algorithm with feature selection, cross validation, model complexity & reduced error pruning, IJSCIT march 2012 Grossman, D., and P Domingos, P., 2005 Learning Bayesian Network Classifiers by maxi-mizing conditional likelihood In Proceedings of the twenty-first international conference on Machine learning, 361–368 ACM Press Hamad, A., 2007 Weighted Naive Bayesian Classifier IEEE/ACS International Conference, on Computer Systems and Applications, AICCSA apos;07, Volume 1, Issue 1, Page(s):437 – 441 Hofmann, T., 1999 Probabilistic latent semantic analysis In Proceedings of UAI Jongwoo, K., Daniel X L., and George, R., T., 2010 Naïve Bayes and SVM Classifiers For Classifying Databank Accession Number Sentences National Library of medicine, from Online Biomedical Articles McCallum A and Nigam, K., 1998 A comparison of event models for naive Bayes text classi-fication In AAAI/ICML-98 Workshop on Learning for Text Categorization Reuters-21578 http://www.daviddlewis.com/resources/testcollections/reuters21578/ Yang, Y and Pedersen J., 1997 A comparative study on feature selection in text categorization In Proceedings of the Fourteenth International Conference on Machine Learning (ICML’97) pp 412-420 ... alternative relatedness measure for text classification purposes Section describes how two merging operators of the classification results can be used to improve classification performance In Kostas Fragos... chi-square and information gain gave the best performance Fragos et al (Fragos, 2005) proposed a new method to apply Maximum Entropy modeling for text classification using weights for the selection... F1 performance Micro-averaged F1 measure performance for Naive Bayes and Maximum Entropy Classifiers and our Max and harmonic merging Operators Table Micro-averaged F1 measure performance for Naive