An Improved Term Weighting Scheme for Text Categorization : M.A Thesis Information Technology : 60 48 01

An Improved Term Weighting Scheme for Text Categorization Pham Xuan Nguyen Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi Supervised by Dr Le Quang Hieu A thesis submitted in fulfillment of the requirements for the degree of Master of Science in Computer Science August 2014 ORIGINALITY STATEMENT ‘I hereby declare that this submission is my own work To the best of my knowledge, it contains no materials previously published by another person, or substantial proportions of material which have been accepted for the award of any other degrees or diplomas at University of Engineering and Technology (UET/Coltech) or any other educational institutions, except where due acknowledgement is made in the thesis Any contributions made to the researches by others are explicitly acknowledged in the thesis I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression are acknowledged.’ Hanoi, August 24th , 2014 Signed Supervisor Signed Judge Signed i ABSTRACT In text categorization, term weighting is the task to assign weights to terms during the document presentation phase Thus, it affects the classification performance In addition to resulting in a high performance of text categorization, an effective term weighting scheme should be easy to use Term weighting methods can be divided into two categories, namely, supervised and unsupervised [27] The traditional term weighting schemes such as binary, tf and tf.idf [38], belong to unsupervised term weighting methods Other schemes (for example, tf.χ2 [12]) that make use of the prior information about the membership of training documents, belong to the supervised term weighting methods The supervised term weighting method tf.rf [27] is one of the most effective schemes to date It showed better performance than many others [27] However, tf.rf is not the best in some cases Moreover, tf.rf requires many rf values for each term In this thesis, we present an improved term weighting scheme from tf.rf, called logtf.rfmax Our new scheme uses logtf = log2 (1.0 + tf ) instead of tf Furthermore, our scheme is simpler than tf.rf because it only uses the maximum value of rf for each term Our experimental results showed that our scheme is consistently better than tf.rf and others ii To my family ♥ iii ACKNOWLEDGEMENTS First, I would like to express my gratitude to my supervisor, Dr Le Quang Hieu He guided me throughout the years and gave me several useful advices about study method He was very patient with me His words influenced strongly on me I also would like to give my honest appreciation to my colleagues at Hoalu University and University of Engineering and Technology (UET/Coltech) for their great support Thank you all! iv Table of Contents Introduction 1.1 Motivation 1.2 Structure of this Thesis Overview of Text Categorization 2.1 Introduction 2.2 Text Representation 2.3 Text Categorization tasks 2.3.1 Single-label and Multi-label Text Categorization 2.3.2 Flat and Hierarchical Text Categorization 2.4 Applications of Text Categorization 2.4.1 Automatic Document Indexing for IR Systems 2.4.2 Documentation Organization 2.4.3 Word Sense Disambiguation 2.4.4 Text Filtering System 2.4.5 Hierarchical Categorization of Web Pages 2.5 Machine learning approaches to Text Categorization 2.5.1 k Nearest Neighbor 2.5.2 Decision Tree 2.5.3 Support Vector Machines 2.6 Performance Measures Term Weighting Schemes 3.1 Introduction 3.2 Previous Term Weighting Schemes 3.2.1 Unsupervised Term Weighting Schemes 3.2.2 Supervised Term Weighting Schemes v 1 4 7 10 10 10 11 11 12 12 13 14 15 18 18 19 19 21 TABLE OF CONTENTS 3.3 vi Our New Term Weighting Scheme 23 Experiments 4.1 Term Weighting Methods 4.2 Machine Learning Algorithm 4.3 Corpora 4.3.1 Reuters News Corpus 4.3.2 20 Newsgroups Corpus 4.4 Evaluation Measures 4.5 Results and Discussion 4.5.1 Results on the 20 Newsgroups corpus 4.5.2 Results on the Reuters News corpus 4.5.3 Discussion 4.5.4 Further Analysis Conclusion 26 26 27 28 28 29 29 30 30 31 33 34 37 List of Figures 2.1 2.2 2.3 2.4 2.5 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 An example of vector space model An example of transforming a multi-label problem into binary classification problems A hiararchy with two top-level categories Text Categorization using machine learning techniques An example of a decision tree [source [27]] Linear Support Vector Machine [source [14]] The micro − F1 measure of eight term weighting schemes on the 20 Newsgroups corpus with different numbers of features The macro − F1 measure of eight term weighting schemes on the 20 Newsgroups corpus with different numbers of features The micro − F1 measure of eight term weighting schemes on the Reuters News corpus with different numbers of features The macro − F1 measure of eight term weighting schemes on the Reuters News corpus with different numbers of features The f1 measure of four methods on each category of Reuters News corpus using SVM algorithm at the full vocabulary The f1 measure of four methods on each category of 20 Newsgroups corpus using SVM algorithm at the full vocabulary, category from to 10 The f1 measure of four methods on each category of 20 Newsgroups corpus using SVM algorithm at the full vocabulary, category from 11 to 20 27 vii 12 14 30 31 32 32 34 34 34 List of Tables 3.1 3.2 Traditional Term Weighting Schemes 19 Examples of two terms having different tf and log2 (1 + tf ) 24 4.1 Experimental Term Weighting Schemes 26 5.1 Examples of two term weights as using rf and rfmax viii 38 List of Abbreviations TC TWS IR ML F1 SVM Text Categorization Term Weighting Scheme Information Retrieval Machine Learning F-measure Support Vector Machine ix 4.3 Corpora 28 examples into two classes with the widest margin (see Figure 4.1) This is done based on solving the bellow optimization problem [8]: w,b,ξ w l +C ξi i=1 subject to yi (w · xi − b) + ξi ≥ 1, ξi ≥ 0, i = 1, , l where C is the trade-off between minimizing training error and maximizing margin, l is the number of training samples The results are the vector w and the scalar b, which determine the orientation the plane and its offset from the origin The classification function (or the learned model) is y∗ = sign(w · x∗ − b), where x∗ is a testing sample We applied the default C in this thesis The used linear SVM library is LIBLINEAR 1.93 [16] 4.3 Corpora We used Reuters News corpus and the 20 Newsgroups corpus - two common benchmark data sets We applied these data sets so as to our results can be compared with others, especially those reported in [27] 4.3.1 Reuters News Corpus The Reuters-21578 corpus contains the 10794 news stories, including 7775 documents in the training set and 3019 documents in the test set There are 115 categories that have at least training document We have conducted experiments on Reuters top ten (10 largest categories in this corpus) and each document may be categorized in more than one category In text preprocessing phase, 513 stop Reuters-21578 corpus can be downloaded from http://www.daviddlewis.com/resources/testcollections/reuters21578/ 4.4 Evaluation Measures 29 words, numbers, words containing single char and words occurring less than times in the training set were removed The resulting vocabulary has 9744 unique words (features) By using CHImax for feature selection, the top p ∈ {500, 2000, 4000, 6000, 8000, 10000, 12000} features are tried Besides, we also used all words in the vocabulary The categories in the Reuters News corpus have the skewed distribution In the training set, the most common category (earn) accounts for 29% of the total number of samples, but 98% of the other categories have less than 5% samples 4.3.2 20 Newsgroups Corpus The 20 Newsgroups (20NG) corpus is a collection of roughly 20000 newsgroup documents This divided into 20 newsgroups This corpus is balanced as each category has approximately 1000 samples We treat this data set as a a multilabeled data set Each newsgroup corresponds to a different topic After removing duplicates and headers, the remaining documents are sorted by date The training set contains 11314 documents (60%) and 7532 documents (40%) belong to the test set In text preprocessing phase, 513 stop words, words occurring less than times in the training set or words containing single char were removed There are 37172 unique words in vocabulary We used CHImax for feature selection, the top p ∈ {500, 2000, 4000, 6000, 8000, 10000, 12000, 14000, 16000} were selected The 20 categories in the 20 Newsgroups corpus have the rough uniform distribution This distribution is different from that in the Reuters News corpus 4.4 Evaluation Measures In this thesis, we use two averaging methods for F1 , namely, micro−F1 and macro− F1 micro − F1 is dependent on the large categories, while macro − F1 is influenced The 20 Newsgroups corpus can be downloaded from http://people.csail.mit.edu/jrennie/20Newsgroups/ 4.5 Results and Discussion 30 82 Micro F1(%) 80 78 binary tf 76 rf tf.rf 74 logtf.rf rf_max 72 tf.rf_max logtf.rf_max 70 68 500 2000 4000 6000 8000 10000 12000 14000 16000 Number of features Figure 4.2: The micro − F1 measure of eight term weighting schemes on the 20 Newsgroups corpus with different numbers of features by the small categories [39] By using these measures, our results are comparable with other results, including those in [27] 4.5 Results and Discussion In this section, we will describe previous results, our experimental results and discussion The results for 20NG were reported in [3] and [18] The best result for 20NG is 88.6 % micro-BEP, which was also obtained by [3] The best result for Reuters top ten are 92.3 % micro-BEP [18] The experimental results of these eight term weighting methods with respect to the micro − F1 and macro − F1 measure on the Reuters News corpus and the 20NG corpus are reported from Figure 4.2 to Figure 4.5 Each line in the figures shows the performance of a term weighting method at different of feature selection levels 4.5.1 Results on the 20 Newsgroups corpus Figure 4.2 shows the results in term of micro − F1 on the 20NG corpus Generally, the micro − F1 values of all methods increase when the number of selected features 4.5 Results and Discussion 31 81 79 77 binary Macro F1(%) tf rf 75 tf.rf logtf.rf rf_max 73 tf.rf_max 71 logtf.rf_max 69 67 500 2000 4000 6000 8000 10000 12000 14000 16000 Number of features Figure 4.3: The macro − F1 measure of eight term weighting schemes on the 20 Newsgroups corpus with different numbers of features increases logtf.rfmax and rfmax are consistently better than others at all feature selection levels Almost all term weighting methods achieve their peak at a feature size around 16000 and the best three micro − F1 values 81.27%, 81.23% and 80.27% are reached by using logtf.rfmax , rfmax and logtf.rf, respectively tf.rf and rf reach their peak of 79.46% and 79.94% Figure 4.3 depicts the results in term of macro − F1 on the 20NG corpus The trends of the lines are similar to those in Figure 4.2 logtf.rfmax and rfmax still are better than other schemes at all different numbers of selected features 4.5.2 Results on the Reuters News corpus Figure 4.4 shows the results with respect to micro − F1 on the Reuters News corpus From 6000 features onwards, the micro − F1 values generally increase logtf.rfmax and tf.rfmax are consistently better than others as the level of feature selection is bigger than 8000 Almost all term weighting methods achieve their peak at the full vocabulary The best three micro − F1 values 94.23%, 94.20% and 94.03% achieved by using tf.rfmax , logtf.rfmax and tf.rf the scheme rfmax and rf account for 93.50% 4.5 Results and Discussion 32 95 Micro F1(%) 94 93 binary tf 92 rf tf.rf logtf.rf rf_max tf.rf_max logtf.rf_max 91 90 89 88 500 2000 4000 6000 8000 10000 12000 ALL Number of features Figure 4.4: The micro − F1 measure of eight term weighting schemes on the Reuters News corpus with different numbers of features 90 89 Macro F1(%) 88 binary tf rf 87 tfrf logtfrf rf_max 86 85 tfrf_max logtfrf_max 84 83 82 500 2000 4000 6000 8000 10000 12000 ALL Number of features Figure 4.5: The macro−F1 measure of eight term weighting schemes on the Reuters News corpus with different numbers of features 4.5 Results and Discussion 33 and 93.10% at the full vocabulary Figure 4.5 depicts the results in term of macro − F1 on the Reuters News corpus The performances of eight schemes fluctuate as the number of selected features is smaller than 8000 From this point onwards, logtf.rfmax and logtf.rf are schemes that are consistently better than others Our experimental results confirm the classification results of tf.rf and rf (the peaks and trends) as reported in [27] Firstly, tf.rf shows consistently better than rf, tf and binary on the Reuters News corpus (Figure 4.4) Moreover, the performance of rf is better than tf.rf, tf and binary on the 20NG corpus 4.5.3 Discussion There are some our observations of schemes with our proposed improvements: • The schemes using the rfmax factor are better than those with the rf factor Specifically, tf.rfmax , logtf.rfmax and rfmax are better than tf.rf, logtf.rf and rf, respectively in all Figures • The schemes applying the logtf factor yield better performance than those using the tf factor on the 20NG corpus (see Figure and Figure 2) In the Reuters News corpus, the schemes using the logtf factor have a comparably good performance as those applying the tf factor (see Figure and Figure 4) • logtf.rfmax , a combination of two improvements, has a comparably good performance as tf.rfmax and rfmax , two best schemes on the Reuters News corpus and the 20NG corpus, respectively • logtf.rfmax shows significantly better than tf.rf on the 20NG corpus and consistently better than tf.rf on the Reuters News corpus as the level of feature selection exceeds 6000 In brief, logtf.rfmax steadily has higher performance than other schemes in our experiments 4.5 Results and Discussion 34 Category TWS 10 logtf.rfmax 98,72 97,05 83,06 95,27 89,95 78,33 82,59 88,73 88,89 91,89 binary 98,62 95,63 82,64 89,42 86,72 72,87 79,34 87,58 85,71 84,68 tf 98,63 96,24 83,80 90,71 90,31 82,55 80,33 83,33 82,72 83,02 tf.rf 98,45 96,57 84,62 94,24 90,86 79,66 82,59 84,67 88,10 88,89 Figure 4.6: The f1 measure of four methods on each category of Reuters News corpus using SVM algorithm at the full vocabulary TWS Category 10 logtf.rfmax 72,37 74,45 70,34 81,69 76,74 84,89 84,29 92,33 88,40 95,60 binary 60,50 64,63 62,48 73,45 70,45 78,24 74,08 87,65 78,07 87,42 tf 66,15 66,77 63,17 75,41 70,25 82,74 78,11 89,37 82,35 91,17 tf.rf 71,26 69,98 67,86 78,30 76,37 84,12 79,94 92,37 86,87 94,93 Figure 4.7: The f1 measure of four methods on each category of 20 Newsgroups corpus using SVM algorithm at the full vocabulary, category from to 10 4.5.4 Further Analysis To further investigate these methods, we explore their performance on each category We choose four representative methods, namely, binary, tf, tf.rf, logtf.rfmax with respect to the F1 measure The results are shown from Table 4.6 to Table 4.8 The maximum value in each Figure 4.8: The f1 measure of four methods on each category of 20 Newsgroups corpus using SVM algorithm at the full vocabulary, category from 11 to 20 4.5 Results and Discussion 35 comlum is shown in bold font We only analyze the performances of term weighting schemes at a certain feature set size where most of the methods achieve their best performance Even though not all the schemes achieve their best performance, it is still valuabe to compare their performance with respect to each other Reuters Corpus and Linear SVM Algorithm Table 4.6 depicts the F1 measure of four term weighting schemes on each of the 10 largest categories of Reuters News corpus using the SVM-based classifier at the full vocabulary All four schemes yield almost the same F1 on the two largest category (category 1, 37% and category 2, 21.22%) while for the rest categories there are significant differences among these methods For instance, the maximal difference of F1 between logtf rfmax and tf on category 10 is 8.87 logtf.rfmax has the best performance on of the 10 categories and thus it has contributed the best performance for the whole corpus This finding shows that logtf.rfmax is quite effective for the skewed category distribution in the Reuters News Corpus 20 Newsgroups Corpus and Linear SVM Algorithm Table 4.7 and Table 4.8 show the F1 measure of the four term weighting schemes on each category of the 20 Newsgroups data set using the SVM classifier at full vocabulary Unlike the results on Reuters Corpus, which has a skewed category distribution, there are significant differences among the four term weighting methods on each of the 20 categories of the 20 Newsgroups corpus However, logtf.rfmax has been shown to perform very well on each category Furthermore, both logtf.rfmax and tf.rf are consistently better than the other two 4.5 Results and Discussion methods 36

Định dạng
Số trang	52
Dung lượng	741,39 KB