Incorporating linguistically motivated knowledge sources into document classification

INCORPORATING LINGUISTICALLY MOTIVATED KNOWLEDGE SOURCES INTO DOCUMENT CLASSIFICATION GOH JIE MEIN NATIONAL UNIVERSITY OF SINGAPORE 2004 INCORPORATING LINGUISTICALLY MOTIVATED KNOWLEDGE SOURCES INTO DOCUMENT CLASSIFICATION GOH JIE MEIN BSc(Hons 1, NUS) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF INFORMATION SYSTEMS NATIONAL UNIVERSITY OF SINGAPORE 2004 ACKNOWLEDGEMENTS This thesis cannot be done without the constant guidance and assistance from many whom I must acknowledge at this point. Firstly, I am deeply grateful to my advisor, Associate Professor Danny Poo for his constant guidance, encouragement and understanding. He is instrumental to the development of this thesis and I sincerely thank him for providing valuable advice, direction and insights for my research. My deep appreciation also goes out to all my friends, peers and colleagues who have helped my in one way or another: I wish to thank Klarissa Chang, Koh Wei Chern, Cheo Puay Ling and Wong Foong Yin for their listening ears and uplifting remarks. Colleagues and friends such as Michelle Gwee, Wang Xinwei, Liu Xiao, Koh Chung Haur, Li Yan, Li Huixian, Santosa Paulus, Wan Wen, Bryan Low, Chua Teng Chwan, Tok Wee Hyong, Indriyati Atmosukarto, Colin Tan, Julian Lin, the pioneering batch of Schemers, for keeping life pleasurable in the office. I would also like to express my sincere thanks to A/P Chan Hock Chuan and A/P Stanislaw Jarzabek for evaluating my research. And to all professors, teaching staff, administrative staff, friends and students. Last but not least, I would like to thank my family especially my parents, sisters and Melvin Lye for their relentless moral support, motivation, advice, love and understanding. -i- TABLE OF CONTENTS TABLE OF CONTENTS Acknowledgement i Contents iv List of Tables v List of Figures vii 1. Introduction -------------------------------------------------------------------------- 1 1.1 Background & Motivation 1 1.2 Aims and Objectives 2 1.3 Thesis Plan 4 2. Literature Review--------------------------------------------------------------------- 6 2.1 Document Classification 8 2.2 Feature Selection Methods 9 2.3 Machine Learning Algorithms 9 2.3.1 Naïve Bayes 9 2.3.2 Support Vector Machines (SVM) 11 2.3.3 Alternating Decision Trees 13 2.3.4 C4.5 15 2.3.5 Ripper 16 2.3.6 Instance Based Learning – k-Nearest Neighbour 17 2.4. Natural Language Processing (NLP) in Document Classification 19 2.5. Conclusion 21 3. Linguistically Motivated Classification------------------------------------------ 22 3.1 Considerations 22 3.2 Linguistically Motivated Knowledge Sources 25 3.2.1 Phrase 26 3.2.2 Word Sense ( Part of Speech Tags) 27 -i- TABLE OF CONTENTS 3.2.3 Nouns 28 3.2.4 Verbs 28 3.2.5 Adjectives 29 3.2.6 Combination of Sources 29 3.3 Obtaining Linguistically Motivated Classifiers 4. Experiment----------------------------------------------------------------------------- 30 32 4.1 Evaluation Data Sets 32 4.1.1 Reuters-21578 32 4.1.2 WebKB 36 4.2 Dimensionality Reduction 37 4.2.1 Feature Selection 37 4.2.2 Document Frequency Thresholding 38 4.2.3 Stop Words Removal 40 4.3 Experiment Setup 40 4.3.1 Handling Multiple Categories Problems 40 4.3.2 Handling Multiple Categories Training Examples 41 4.4 Evaluation Measures 41 4.4.1 Loss Based Measures 41 4.4.2 Recall & Precision 43 4.4.3 Precision Recall Breakeven Point 43 4.4.4 Micro- and Macro- Averaging 44 4.5 Tools 5. Results & Evaluation ----------------------------------------------------------------- 45 48 5.1 Results 48 5.2 Contribution of Different Linguistically Motivated Knowledge Sources to 48 Classification of Reuters-21578 Corpus 5.2.1 Words 49 5.2.2 Phrase 51 - ii - TABLE OF CONTENTS 5.2.3 Word Sense (Part of Speech Tags) 51 5.2.4 Nouns 52 5.2.5 Verbs 53 5.2.6 Adjectives 53 5.2.7 Combination of Sources 54 5.2.8 Analysis of Reuters-21578Results 55 5.3 Contribution of Linguistically Motivated Knowledge Sources to 61 Classification accuracy of WebKB Corpus 5.3.1 Words 61 5.3.2 Phrase 62 5.3.3 Word Sense (Part of Speech Tags) 63 5.3.4 Nouns 64 5.3.5 Verbs 64 5.3.6 Adjectives 65 5.3.7 Nouns & Words 66 5.3.8 Phrase & Words 67 5.3.9 Adjectives & Words 67 5.3.10 Analysis of WebKb Results 68 5.4 Summary of Results 6. Conclusion ----------------------------------------------------------------------------- 71 72 6.1 Summary 72 6.2 Contributions 72 6.3 Limitations 75 6.3 Conclusion 75 - iii - SUMMARY This thesis describes an empirical study on the effects of linguistically motivated knowledge sources with different learning algorithms. Using up to nine different linguistically motivated knowledge sources and six different learning algorithms, we examined the performance of classification accuracy by using different linguistically motivated knowledge sources, with different learning algorithms on two benchmark data corpora: Reuters-21578 and WebKB. The use of linguistically motivated knowledge sources, nouns, outperformed the traditional bag-of-words classifiers for Reuters-21578. The best results for this corpus were obtained with the use of nouns with support vector machines. On the other hand, experiments with WebKB showed that classifiers built using the novel combinations of linguistically motivated knowledge sources were as competitive as those built using the conventional bag-of-words technique. - iv - LIST OF TABLES LIST OF TABLES Table 1 Summary of Related Studies 23 Table 2 Distribution of Categories in Reuters-21578 34 Table 3 Top 20 Categories in ModApte Split 35 Table 4 Breakdown of Documents in WebKB Categories 36 Table 5 Contingency Table 42 Table 6 Results using Words 49 Table 7 Results using Phrases 50 Table 8 Results using Tags 51 Table 9 Results using Nouns 52 Table 10 Results using Verbs 53 Table 11 Results using Adjectives 54 Table 12 Results using both Linguistically Knowledge Sources and Words 54 Table 13 Contribution of Knowledge Sources on Reuters-21578 (F1 measures) 55 Table 14 Contribution of Knowledge Sources on Reuters-21578 (Precision) 58 Table 15 Contribution of Knowledge Sources on Reuters-21578 (Recall) 59 Table 16 Results using Words 62 Table 17 Results using Phrases 63 Table 18 Results using Tags 63 Table 19 Results using Nouns 64 Table 20 Results using Verbs 65 Table 21 Results using Adjectives 66 -v- LIST OF TABLES Table 22 Results using Nouns & Words 66 Table 23 Results using Phrases & Words 67 Table 24 Results using Adjectives & Words 68 Table 25 Consolidated Results of WebKB 69 - vi - LIST OF FIGURES LIST OF FIGURES Figure 1 Document Classification Process 7 Figure 2 Linear SVM 12 Figure 3 An Example of an ADTree 15 Figure 4 k-NN Algorithm 18 Figure 5 Extracting Linguistically Motivated Knowledge Sources 31 Figure 6 An Example of a Reuters-21578 Document 35 Figure 7 Design of System 46 Figure 8 Comparison of Different Linguistically Motivated Knowledge 56 Sources on Reuters-21578 (Micro F1 values) Figure 9 Comparison of Different Linguistically Motivated Knowledge 57 Sources on Reuters-21578 (Macro F1 values) Figure 10 Comparison of Different Linguistically Motivated Knowledge 59 Sources on Reuters-21578 (Precision) Figure 11 Comparison of Different Linguistically Motivated Knowledge 60 Sources on Reuters-21578 (Recall) Figure 12 Comparison of Different Linguistically Motivated Knowledge 70 Sources on WebKB (Micro F1 values) Figure 13 Comparison of Different Linguistically Motivated Knowledge 70 Sources on WebKB (Macro F1 values) - vii - CHAPTER 1 INTRODUCTION CHAPTER 1 INTRODUCTION 1.1 Background & Motivation With the emerging importance of knowledge management, research in areas such as document classification, information retrieval and information extraction each plays a critical role in the success of knowledge management initiatives. Studies have shown that the perceived output quality is an essential factor for successful implementation and adoption of knowledge management technologies (Kankanhalli, et al., 2001). Large document archives such as electronic knowledge repositories offer a huge wealth of information whereby methods in the field of information retrieval and document classification are used to derive knowledge. Coupled with the accessibility to voluminous amount of information available in the World Wide Web, this information explosion has brought about other problems. Users are often overwhelmed by the deluge of information and suffer from a decreased ability to assimilate information. Research has suggested that users feel bored or frustrated when they receive too much information (Roussinov and Chen, 1999) which can lead to a state where the individual is no longer able to effectively process the amount of information he is exposed, giving rise to a lower decision quality in a given time set. This problem is exacerbated with the proliferation of information in electronic repositories in organizations and the World Wide Web (Farhoomand and Drury, 2002). -1- CHAPTER 1 INTRODUCTION Document classification has been applied to the categorization of search results and has been shown to alleviate the problem of information overload. The presentation of documents in categories works better than a list of results because it enables users to disambiguate the categories quickly and then focus on the relevant category (Dumais, Cutrell & Chen, 2001). This is also useful for distinguishing documents containing words with multiple meanings, or polysemy, a characteristic predominant in English words. Experiments on supervised document classification techniques have predominantly used the bag-of-words technique whereby words of the documents are used as features. Alternate formulations of a meaning can also be introduced through linguistic variation such as syntax which determines the association of words. Although there have been some studies that have employed alternate features such as linguistic sources, these studies have only employed a subset of linguistic sources and learning algorithms (Lewis, 1992; Arampatzis et al., 2000; Kongovi et al., 2002). Thus, this study extends previous studies relating to document classification and aims to find ways to improve the classification accuracy of documents. 1.2 Aims and Objectives Differences in previous empirical studies could be introduced by the differences in tagging tools used, different learning algorithms, parameters tuned for each of the learning algorithms, feature selection methods employed and dataset involved (Yang, -2- CHAPTER 1 INTRODUCTION 1999). Thus, it is difficult to offer a sound conclusion based on previous works. Since previous works documenting results based on linguistically motivated features with learning algorithms produced inconsistent and sometimes conflicting results, we propose to conduct a systematic study on multiple learning algorithms and linguistically motivated knowledge sources as features. Some of these features are novel combinations of linguistically motivated knowledge sources were not explored in previous studies. Using a systematic and controlled study, we can resolve some of these ambiguities and offer a sound conclusion. In our study, consistency in the dataset, learning algorithms, tagging tools and features selections were maintained in the study so that we can have a valid assessment of the effectiveness of linguistically motivated features. The aim of this thesis is also to provide a solid foundation for research on feature representations on text classification and study the factors affecting machine learning algorithms used in document classification. One of these factors we want to look into is the effect of feature engineering by utilizing linguistically motivated knowledge sources as features in our study. Thus, the objectives of this thesis are listed below: 1. To examine the approach of using linguistically motivated knowledge sources based on concepts derived from natural language processing as features with popular learning algorithms, and systematically vary both -3- CHAPTER 1 INTRODUCTION the learning algorithms and feature representations. We based our evaluation on Reuters-21578 corpus and WebKB, benchmark corpus that has been widely used in previous research. 2. To examine the feasibility of applying novel combinations of linguistically motivated knowledge sources and explore the performance of these combinations as features on the accuracy of document classification. 1.3 Thesis Plan This thesis is composed of the following chapters: Chapter 1 provides the background, motivation and objectives of this research. Chapter 2 provides a literature review of document classification research. Here we bring together literature from different fields, document classification, machine learning techniques and natural language processing and give a detailed coverage of the algorithms chosen for this study. In addition, this chapter also overview the rudimentary knowledge required in later chapters. Chapter 3 describes the types of linguistically motivated knowledge sources and the novel combinations used in our study. -4- CHAPTER 1 INTRODUCTION Chapter 4 provides a description of the experiment setup. It also gives a brief on the performance measures used to evaluate the classifiers and tools employed to conduct the study. Chapter 5 provides an analysis of the results and suggests the implications for practice. Chapter 6 concludes with the contributions, findings and the limitations of our study. Suggestions on future research that can be an extension to this work are also provided. -5- CHAPTER 2 LITERATURE REVIEW CHAPTER 2 LITERATURE REVIEW Document classification has traditionally been carried out using the bag-of-words paradigm. Research on natural language processing has shown fruitful results that can be applied to document classification by introducing an avenue for improving accuracy by a different set of features used. This chapter reviews related literature underpinning this research. Section 2.1 gives an overview of document classification and the focus of previous research. Section 2.2 overviews common features selection techniques used in previous studies. Section 2.3 introduces the machine learning algorithms that will be adopted in our study. Section 2.4 presents the concepts of natural language processing employed to derive linguistically motivated knowledge sources. 2.1 Document Classification Document classification, the focus of this work, refers to the task of assigning a document to categories based on its underlying content. Although this task has been carried out effectively using knowledge engineering approaches in the 80s, machine learning approaches have superceded the use of knowledge engineering approaches for document classification in recent years. While the knowledge engineering approaches have produced effective classification accuracy, the machine learning approach offered many advantages such as cost effectiveness, portability and competitive accuracy to that -6- CHAPTER 2 LITERATURE REVIEW of human experts while producing considerable efficiency (Sebastiani, 2002). Thus, machine learning supervised techniques are employed in this study. A supervised approach involves three main phases: feature extraction, training, testing. The entire process of document classification using machine learning methods is illustrated in Figure 1. Test Phase Test corpus Unlabeled Test Feature Vector Training Phase Training corpus Feature Selection Labeled Training Feature vectors Classifier Model Category Figure 1: Document classification Process There are two phases involved in learning based document classification: the training and testing phase. In the training phase, pre-labeled documents are collected. This set of prelabeled documents is called the training corpus, training data or training set which is used interchangeably throughout this thesis. Each document in the training corpus is transformed into a feature vector. These feature vectors are trained with a learning -7- CHAPTER 2 LITERATURE REVIEW classifier. Each classifier will then build a model based on the training set. The model built based on the learning algorithm is then used in the testing phase to test a set of unlabeled documents, called the test corpus, test data or test set, that are new to the learning classifier, to be labeled. The classification problem can be formally represented as follows: Classification fc(d) -> {true, false} where d∈D Given a set of training documents, D and a set of categories C, the classification problem is defined as the function, f, to map documents in the test set, T, into a boolean value where ‘true’ indicates that the document is categorized as C and ‘false’ indicates that the document is not categorized under C, based on the characteristics of the training documents, D where f (d)->c. 2.2 Feature Selection Methods With a large number of documents and features, the document classification process usually involves a feature selection step. This is done to reduce the feature dimension. Feature selection methods extract only the important features derived from the original set of features. Traditional features selection methods that have been commonly employed in previous studies include document frequency (Dumais et al., 1998; Yang and Pedersen, 1997), chi-square (Schutze et al., 1997; Yang and Pedersen, 1997), information gain (Lewis and Ringuette, 1994; Yang and Pedersen, 1997), mutual information (Dumais et al., 1998; Larkey and Croft, 1996) etc. -8- CHAPTER 2 LITERATURE REVIEW After the feature reduction step, many supervised techniques can then be employed in document classification. Subsequent section presents a review of the techniques that were employed in this study. 2.3 Machine Learning Algorithms This section reviews the state-of-the-art learning algorithms as text classifiers, to give a background on the different methods and an analysis of the advantages and disadvantages of each learning method that we have used in our empirical study. Past research in the field of automatic document classification has focused on improving the document classification techniques through various learning algorithms (Yang and Liu, 1999) such as support vector machines (Joachims, 1998) and various feature selection methods (Yang and Pedersen, 1997; Luigi et al., 2000). To make the study worthwhile, popular learning algorithms that have reported significant improvements on the classification accuracy of various learning algorithms in previous studies were used. These included a wide variety of supervised-learning algorithms, including naïve bayes, support vector machines, k-nearest-neighbours, C4.5, RIPPER, AdaBoost with decision stumps and alternating decision trees. 2.3.1 Naïve Bayes (NB) Bayes classification is a popular technique in recent years. The simplest Bayesian classifier is the widely used Naïve Bayes classifier which assumes that features are independent. Despite the inaccurate assumption of feature independence, Naïve Bayes is -9- CHAPTER 2 LITERATURE REVIEW surprisingly successful in practice and has proven effective in text classification, medical diagnosis, and computer performance management, among other applications. Naïve Bayes classifier uses a probabilistic model of text to estimate the probability that a document d is in class y, Pr(y|d). This model assumes conditional independence of features, i.e. words are assumed to occur independently of the other words in the document given its class. Despite this assumption, Naïve Bayes have performed well. Bayes rule says that to achieve the highest classification accuracy, d should be assigned to the class y ∈ {-1, +1} for which Pr(y|d) is the highest. hBAYES (d ) = arg max y∈{−1, +1} Pr( y | d ) (1) Pr(y|d) can be calculated by considering each document according to their length l. Pr( y | d ) = ∑l =1 Pr( y | d , l ) • Pr(l | d ) ∞ (2) Pr(l|d) equals one for the length l’ of document d and is zero otherwise. In other words, when we apply Bayes’ theorem to Pr(y|d,l) we can obtain the following equation: Pr( y | d ) = Pr(d | y, l ' ) • Pr( y | l ' ) (3) ∑ y '∈{−1,+1} Pr(d | y' , l ' ) • Pr( y'| l ' ) Pr(d|y,l’) is the probability of observing document d in class y given its length l’. Pr(y|l’) is the prior probability that a document of length l’ is in class y. In the following we will assume that the category of a document does not depend on its length so Pr(y|l’) = Pr(y). An estimate of Pr(y) is as follows: Pr' ( y ) = ∑ | y | | y '| y '∈ { − 1 , + 1 } = | y| |D | (4) - 10 - CHAPTER 2 LITERATURE REVIEW |y| denotes the number of training documents in class y∈{-1,+1} and |D| is the total number of documents. Despite the unrealistic independence assumption, the Naïve Bayes classifier is remarkably successful in practice. Researchers show that the Naïve Bayes classifier is competitive with other learning algorithms such as decision tree and neural network algorithms. Experimental results for Naïve Bayes classifiers can be found in several studies (Lewis, 1992; Lewis and Ringuette, 1994; Lang, 1995; Pazzani 1996; Joachims, 1998; McCallum & Nigam 1998; Sahami, 1998). These studies have shown that Bayesian classifiers can produce reasonable results with high precision and recall values. Hence, we have chosen this learning algorithm in learning to classify text documents. The second reason that this Bayesian method is important to our study of machine learning is that it provides a useful perspective for understanding many learning algorithms that do not explicitly manipulate probabilities. 2.3.2 Support Vector Machines (SVM) Support vector machines were developed by Vapnik et al. (1995) based on structural risk minimization principle from statistical learning theory. The idea of structural risk minimization is to find a hypothesis h from a space H that guarantees the lowest probability of error E(h) for a given training sample S consisting of n examples. Equation (5) gives the upper bound that connects the true error of a hypothesis h with the error E(h) of h on the training set and the complexity of h which reflects the well-known tradeoff between the complexity of the hypothesis space and the training error. - 11 - CHAPTER 2 LITERATURE REVIEW n d ln( ) − ln(η ) d E (h) ≤ Etrain(h) +O( ) n (5) A simple hypothesis space will most likely not contain good approximation functions and will lead to a high training and true error. On the other hand a large hypothesis space will lead to a small training error but the second term in the right-hand side of equation (5) will be large. This reflects the fact that for a hypothesis space with high VC-dimension the hypothesis with low training error may result in overfitting. Thus it is crucial to find the right hypothesis space. The simplest representation of a support vector machine, a linear SVM, is a hyperplane that separates a set of positive examples from a set of negative examples with maximum distance from the hyperplane to the nearest of the positive and negative examples. Figure 2 shows the graphical representation of a linear SVM. + + - - + + + + + - Maximum distance Figure 2. Linear SVM Joachims (1998) developed a model of learning text classifiers using support vector machine and linked the statistical properties of text with the generalization performance - 12 - CHAPTER 2 LITERATURE REVIEW of the learner. Unlike conventional generative models, SVM does not involve unreasonable parametric or independence assumptions. The discriminative model focuses on those properties of the text classification tasks that are sufficient for good generalization performance, avoiding much of the complexity of natural language. This makes SVM suitable for achieving good classification performance despite the high dimensional feature spaces in text classification. High redundancy, high discriminative power of term sets, and discriminative features in the high-frequency range are sufficient conditions for good generalization. SVM is therefore chosen as one of the learning algorithms in this study. We used Platt’s (1999) sequential minimal optimization algorithm to process the linear SVM more efficiently. This algorithm decomposes the large quadratic programming problem into smaller sub-problems. Document classification using support vector machine can be done either through a binary or multi-class classification but we have adopted the binary approach which will be mentioned in a later chapter. 2.3.3 Alternating Decision Tree (ADTree) Although a variety of decision tree learning methods have been developed with somewhat differing capabilities and requirements, we have chosen one of the recent method called the alternating decision trees (Freund & Mason, 1999). This is because this method has been often applied to classification problems and applied to problems such as learning to classify text or documents. - 13 - CHAPTER 2 LITERATURE REVIEW Alternating decision tree learning algorithm is a new combination of decision trees with boosting that generates classification rules that are small and often easy to interpret. A general alternating tree defines a classification rule by defining a set of paths in the alternating tree. As in standard decision trees, when a path reaches a decision node it continues with the child that corresponds to the outcome of the decision associated with that node. When a prediction node is reached, the path continues with all of the children of the node. Path splits into a set of paths, where each path corresponds to one of the children of the prediction node. The difference between ADTree and conventional decision trees is that the classification is based on traversing the path of the decision tree instead of the final leaf node of the tree. There are several key features of alternating decision trees. Firstly, compared to C5.0 with boosting, ADTree provides classifiers that are smaller and easier to interpret. In addition, ADTree give a measure of confidence, called the classification margin, that can be used to improve the accuracy of the cost of abstaining from predicting examples that are hard to classify instead of only a class. However, the disadvantage of ADTree is its susceptibility to overfitting in small data sets. - 14 - CHAPTER 2 LITERATURE REVIEW Figure 3. An Example of an ADTree (Freund & Mason, 1999) 2.3.4 C4.5 A decision tree text classifier is a tree in which internal nodes are labeled by terms, branches departing from them are labeled by tests on the weight that the term has in the test document, and leaves are labeled by categories. In this classification scheme, a text document d is categorized by recursively testing the weights that the terms labeling the internal nodes have in vector d, until a leaf node is reached. The label of this node is then assigned to d. Most of these classifiers use binary document representations represented as a binary tree. There are a number of decision trees and among the most popular is C4.5 (Cohen and Hirsh, 1998). Thus we have chosen this learning method. The most popular decision-tree algorithm that has shown good results on a variety of problems is the C4.5 algorithm (Quinlan, 1993). Previous works based on this technique are reported in Lewis and Ringuette, 1994, Moulinier et al., 1996, Apte and Damerau, 1994, Cohen, 1995, Cohen, 1996. The underlying approach to C4.5 is that it learns - 15 - CHAPTER 2 LITERATURE REVIEW decision trees by constructing them top-down, from the root of the tree. Each instance feature is then evaluated using a statistical test, like the information gain, to determine how well it alone classifies the training examples. Information gain is otherwise known as entropy in information theory. Entropy of a collection S is measured as follows: Entropy ( S ) ≡ − p + log 2 p + − p − log 2 p − (6) Where p+ is the proportion of positive instances in the collection S and p- is the proportion of negative instances. The best feature is selected and employed as a root node to the tree. For each possible value of this attribute, a descendant of the root node is created, and the training examples are sorted to the appropriate descendant node. C4.5 forms a greedy search for a suitable decision tree in which no backtracking is allowed. 2.3.5 Ripper This learning algorithm is a prepositional rule learner, RIPPER (Repeated Incremental Pruning to Produce Error Reduction), proposed by Cohen (1995). The algorithm has a few major phases that characterize it: grow, prune, optimization. RIPPER was developed based on repeated application of Furnkranz and Widmer’s (1994) IREP algorithm followed by two new global optimization procedures. Like other rule-based learners, RIPPER grows rules in a greedy fashion guided by some information theoretic procedures. - 16 - CHAPTER 2 LITERATURE REVIEW Firstly, the rules are grown from a greedy process which adds conditions to the rule until the rule is 100% accurate. The algorithm tries every possible value of each attribute and selects the conditions with the highest information gain. The rules are incrementally pruned and finally in the optimization stage, an initial rule set and pruned rule set are obtained. One variant is generated from an empty rule while the other is generated by greedily adding antecedents to the original rule. The smallest possible description length for each variant and the original rule is computed and the variant with the minimal description length is selected as the final representative of rules in the rule set. Rules that would increase the description length of the rule set if they were included were deleted. The resultant rules are then added to the rule set. RIPPER has already been applied to a number of standard problems in text classification with rather promising results (Cohen, 1995). Thus, it is chosen as one of the candidate learning algorithms in our empirical study. 2.3.6 Instance Based Learning – k-Nearest Neighbour The basic idea behind k-Nearest Neighbors (k-NN) classifier is the assumption that examples located close to each other according to a user-defined similarity metric are highly likely to belong to the same class. This algorithm is also derived from Bayes’ rule. This technique has shown good performance on text categorization in Yang & Liu (1997), Yang & Pederson (1999), Masand (1992).This algorithm assumes that all instances correspond to points in the n-dimensional space. The nearest neighbors of an instance are defined in terms of the standard Euclidean distance. - 17 - CHAPTER 2 LITERATURE REVIEW An arbitrary instance x is described as a feature vector (a1(x), a2(x), … an(x)) where ai(x) denotes the value of the ith attribute of the instance x. Then the distance between two instances xi and xj is defined to be d(xi,xj) where, d ( xi, xj ) = n ∑ (ar ( x ) − ar ( x )) r =1 i 2 j The target function can either be discrete or real-valued. In our study, we will assume that the target function is discrete valued thus we have a linear binary classifier for each category. The pseudocode is as follows: Train(training example x){ For each training example (x, f(x)) add example to list of training examples. } Classify(query instance xq){ Let xj… xk denote the k instances from training examples that are nearest to xq. k ^ Return f ( xq ) ←  arg max ∑ δ (v, f ( xi )) v∈V i =1 Where δ(a,b)=1 if a=b and where δ(a,b)=0 otherwise. } Figure 4: k-NN Algorithm (Mitchell, 1997) The key advantage of instance-based learning is that instead of estimating the target function once for the entire instance space, it can estimate it locally and differently for each new instance to be classified. This method is a conceptually straightforward approach to approximating real-valued or discrete-valued target functions. In general, one disadvantage of instance-based approaches is that the cost of classifying new instances - 18 - CHAPTER 2 LITERATURE REVIEW can be high due to the fact that nearly all computation takes place at classification time rather than when the training examples are first encountered. A second disadvantage is they consider all attributes of the instances when attempting to retrieve similar training examples from memory. If the target concept depends on only a few of the many available attributes, then the instances that are truly most “similar” may have a large distance apart. However, as previous attempts to classify text with this approach has shown to be effective (Yang, 1999), we have decided to include it inside our experiment. 2.4 Natural Language Processing (NLP) in Document Classification Most information retrieval (IR) systems are not linguistically motivated. Similarly, in document classification research, most experiments are not linguistically motivated (Cullingford, 1986). Closely related to the research on document classification is the research on natural language processing and cognitive science. Traditionally, document classification techniques are primarily directed at detecting performance accuracy and hold little regard for linguistic phenomena. Much of the current document classification systems are built upon techniques that represent text as a collection of terms such as words. This has been done successfully using quantitative methods based on word or character counts. However, it has been emphasized that vector space models cannot capture critical semantic aspects of document content. In this case, the representation is superficially related to content since language is more than simply a collection of words. - 19 - CHAPTER 2 LITERATURE REVIEW Thus, natural language processing is a key technology for building information retrieval systems of the future (Strzalkowski, 1999). In order to study the effects of linguistically motivated knowledge sources with document classification, it is imperative to learn about the grammar through natural language processing so as to apply concepts in cognitive science on document classification techniques. Natural language processing research attempts to enhance the ability of the computer to analyze, understand and generate languages that are used. This is performed by some type of computational or conceptual analysis to form meaningful structure or semantics from a document. The inherently ambiguous nature of natural language makes things even more difficult. A variety of research disciplines are involved in the successful development of NLP systems. The mapping of words into meaningful representations is driven by morphological, syntactic, semantic, and contextual cues available in words (Cullingford, 1986). With the advancement of NLP techniques, we hope to incorporate linguistic cues into document classification through NLP techniques. This can be done by utilizing NLP techniques to extract the different representation of the documents and then used in the classification process. As defined by Medin (2000), other concepts identified are verb, count nouns, mass nouns, isolated and interrelated concepts. We define such concepts as linguistically motivated knowledge sources. They can be used to derive more complex linguistically motivated features in the process of classification. It appears that the centrality of using linguistic knowledge sources as features in the process of classification can serve as an - 20 - CHAPTER 2 LITERATURE REVIEW important step for a good classification scheme. For example, besides individual words and the relationships between words within a sentence, a document and the context of what is already known of the world, helps to deliver the actual meaning of a text. Research has focused on using nouns in the process of categorization in modeling the process of categorization in the real world (Chen et al. 1992; Lewis, 1992; Arampatzis et al, 2000; Basili, 2001). However, the significant differences of these results have led us to examine these features with alternate representations. 2.5 Conclusion Bag-of-words paradigm has appeared to be the feature used dominantly in supervised classification studies. This could be due to the results of early attempts (Lewis, 1992) which showed negative results. With the advent of NLP techniques, there seems a propelling reason to examine the use of linguistically motivated knowledge sources. Although there has been separate attempts made to study the effects of linguistically motivated knowledge sources on supervised document classification techniques, it is difficult to generalize a conclusion based on these separate attempts because of the variations introduced across studies. In some cases, conflicting results were also reported. Thus, there seems a need for us to fill the gap with a systematic study that covers an extensive variety of linguistically motivated knowledge sources. - 21 - CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION Despite the existence of extensive research on document classification, the relationship between different linguistic knowledge sources and classification model has not been sufficiently or systematically explored. By bringing together the two streams of research of document classification and natural language processing, we hope to shed light on the effects of linguistically motivated knowledge sources with different learning algorithms. Section 3.1 discusses the shortcomings of previous research. Section 3.2 explores the linguistically motivated knowledge sources employed to resolve these issues. Finally section 3.3 presents the technique to derive the features. 3.1 Considerations Much research in the area of document classification has been focused mainly on developing techniques or on improving the accuracy of such techniques. While the underlying algorithm is an essential factor for classification accuracy, the way in which texts are represented is also an important factor that should be examined. However, attempts to produce text representations to improve effectiveness have shown inconsistent results. - 22 - CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION The classic work of Lewis (1992) has shown that there was low effectiveness of syntactic phrase indexing in terms of its characteristics as a text representation but recent works by Kongovi (2002), Basili (2001), has shown that there were improvements using the same representation. Table 1 shows the conclusions made by some related works. For example, noun phrase seems to behave differently with different learning algorithms. The results differ due to the inconsistencies introduced in these studies through various datasets, taggers, learning algorithms, parameters of the learning algorithms and feature selection methods used. Features Noun Phrase Algorithm/Method Statistical clustering algorithm RIPPER Corpus Reuters-22173 Work Lewis (1992) Reuters-21578 Scott & Matwin (1999) Noun Phrase Clustering Reuters-21578 Kongovi (2002) Noun Phrase SOM CANCERLIT Tolle & Chen (2000) Nouns Rocchio Reuters-21578 Proper Nouns Rocchio Reuters-21578 Tags Rocchio Reuters-21578 Basili, Moschitti & Pazienza (2001) Basili, Moschitti & Pazienza (2001) Basili, Moschitti & Pazienza (2001) Noun Phrase Worse Performance than Words Results Better Performance than Words Table 1: Summary of Previous Studies To address the issues as discussed in the previous section and limitations of previous work, this entails a systematic study on the effects of linguistically motivated knowledge sources with various machine learning approaches for automatic document classification is necessary. In contrast to previous work, this research conducts a comparative study and analysis of learning methods among which, are some of the most effective and popular techniques available, and report on the accuracies of linguistically motivated knowledge - 23 - CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION sources and novel combinations of them using a systematic methodology to resolve any of the issues that we have discussed in previous work. Additionally, we try to see if we can break away from the traditional bag-of-words paradigm. Bag-of-words basically refers to representing document using words, the smallest meaningful unit of a document with little ambiguity. Word-based representations have been the most common representation used in previous works related to document classification. They are the basis for most work in text classification. The obvious advantage of words is in its simplicity and straightforward processes to obtain the representation. However the problem of using bag-of-words is that usually the logical structure, layout and sequence of words are ignored. A basic observation about using the bag of words representations for classification is that a great deal of information from the original document associated with the logical structure and sequence is discarded. The major limitation is the implicit assumption that the order of words in a sentence is not relevant. In fact, paragraph, sentence and word orderings are disrupted, and syntactic structures are ignored. However, this assumption may always hold as words alone do not always represent true atomic units of meaning. For example, the word “learning algorithm” could be interpreted in another manner when broken up into two separate words, “learning” and “algorithm”. Thus, we utilize linguistically motivated knowledge sources as features, to see if we can resolve these limitations associated with the bag-of-words paradigm. Novel combinations of - 24 - CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION linguistically motivated knowledge sources are also proposed and presented in the next section. 3.1.1 Linguistically Motivated Knowledge Sources Machine learning methods require each example in the corpus to be described by a vector of fixed dimensionality. Each component of the vector represents the value of one feature of the example. As a linguistics knowledge source may provide the contextual cues about a document that are useful as a feature representation for distinguishing the category of the document, we are interested to study whether the choice of different feature representations using different linguistic knowledge sources as the input vectors to the learning algorithm have significant impact on document classification. We consider the following linguistics knowledge sources in our research: 1. Word, this will be used as the baseline to do a comparative analysis with other linguistically motivated knowledge sources; 2. Phrase; 3. Word sense or part of speech tagging; 4. Nouns; 5. Verbs; 6. Adjectives; 7. Combinations of sources with words. - 25 - CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION The description of the above features and an analysis of the advantages and disadvantages of feature representation are discussed. 3.1.2 Phrase Phrases have been found to be useful indexing units in previous research. Kongovi, Guzman & Dasigi’s (2002) has shown that phrases were salient features when used with category profiles. We consider one class of phrases i.e. the syntactic phrases. Syntactic phrases refer to any set of words that satisfy certain syntactic relations or constitute specified syntactic structures or certain syntactic relations. Phrase refers to the noun phrases that are identified by our parser. The data set is first parsed into the appropriate format before being extracted and segmented. Noun Phrases is defined as a sequence of words that terminates with a noun. More specifically, noun phrases is defined as NP = {A, N}*N , where NP stands for noun phrase, A for adjective and N for nouns. For example, the sentence, “The limping old man walks across the long bridge” the noun phrases identified are “limping old man” and “long bridge”. In our work here, we do not attempt to separate the noun phrases into its component noun phrases. - 26 - CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION The advantage of phrase does not ignore the assumption that the ordering of words in not relevant and the logical structure, layout and sequence of words are retained thus keeping some information from the original document. On the other hand, the major limitation is the greater degree of complexity when processing and extracting phrases as features. Although phrase-based representation has been used in information retrieval, conclusions derived from studies reporting the retrieval effectiveness of linguistic phrase-based representations on retrieval have been inconsistent. Linguistic phrase identification has been noted as improving retrieval effectiveness by Fagan (1987) but on the other hand, Mitra et al. (1997) reported little benefits in using phrase-based representations. Smeaton (1999) reported that the benefit of phrase-based representation varied with users. Lewis (1992) undertook a major study of the use of noun phrases for statistical classification and found that phrase representation did not produce any improvement on the Reuters22173 corpus. As we are using a different corpus in our work, we decided to continue with the use of phrase based representations in our experiment as it has not been studied before with some of the learning algorithms that we have chosen. 3.2.3 Word Sense (Part of Speech Tagging) Word sense refers to the incorporation of part of speech tags with the word so that the exact word sense within a document is identified. The part of speech of a word provides a - 27 - CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION syntactic source of the word, such as adjective, adverb, determiner, noun, verb, preposition, pronoun and conjunction. As this feature incorporates both the tag and the word, it will provide the word class or lexical tag for the classifier. The intuition for using word sense is to capture additional information that will help to distinguish homographs that can be differentiated based on the syntactical role of the word. Homographs refer to words with more than one meaning. For example, the word “patient” might have different meanings when utilized with different syntactical role such as noun or verb. When used as a noun, a patient refers to an individual who is not feeling well or sick but when used as an adjective, it could refer to the character of a person as being tolerant. 3.2.4 Nouns Gentner (1981) explored the differences between nouns and verbs and suggested that nouns differ from verbs in the relational density of their representations. The semantic components of noun meanings are more strongly interconnected than those of verbs and other parts of speech. Hence, the meanings of nouns seem less mutable than the meanings of verbs. Nouns have been used as a common candidate for distinguishing among different concepts. Nouns are often called “substantive words” in the field of Computational Linguistics and “content words” in Information Science. 3.2.5 Verbs - 28 - CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION Verbs are associated with motions involving relations between objects (Kersten 1998). From an information seeking perspective, verbs do not appear to contribute to the classification accuracies. In order to validate this hypothesis, verbs are included as one of the linguistically motivated knowledge sources that were examined in our study. 3.2.6 Adjectives Bruce and Wiebe’s (1999) work has established a positive correlation with the presence of adjectives with subjectivity. The presence of one or more adjectives is essential for predicting that a sentence is subjective. Subjectivity tagging refers to distinguishing sentences used to present opinions and evaluations from sentences used to objectively present factual information. There are numerous applications for which subjectivity tagging is relevant, including information retrieval and information extraction. This task is essential to forums and news reporting. For a complete study on the use of linguistically motivated knowledge sources, we have included adjectives as one source of linguistic knowledge in our experiment. 3.2.7 Combination of Sources Each linguistic knowledge source generates a feature vector from the context of the document. However we also examine the combination of two linguistic knowledge sources which is a novel technique. When sources are combined, the features generated - 29 - CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION from each knowledge source are concatenated with each source contributing to half of the total number of features and the dataset with all these features are generated. Here we combine the words and the linguistically motivated knowledge sources, nouns, noun phrase and adjectives as novel combinations to see if there are any improvements. Based on the fact that the original features are retained, while some syntactical structure is captured using this model, there appears to be an advantage in using a combination of techniques. 3.3 Obtaining Linguistically Motivated Classifiers Our technique has four steps (Figure 5). The input to the technique is a document, D. Below is an outline of the generic process proposed and employed to use the linguistically motivated knowledge sources as features: 1. Firstly, the document is broken up into sentences. 2. Morphological descriptions or tags are assigned to each term. This NLP component does linguistic processing on the contents and attaches a tag to every term. 3. Processed terms are parsed. 4. Linguistically motivated knowledge sources are then extracted based on the tagging requirements as discussed earlier. 5. Features are combined at the binder phase if combinations of features are required. - 30 - CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION As a final step, the set of linguistically motivated knowledge sources obtained is used as the input feature set for the training or testing phase of the documents. Document Sentence Boundary Detection Words Assignment of Morphological Descriptions Tags Parsing Extractor Phrase Noun Adjective Verb No Combine Tags? Feature Type e.g. Phrase Yes Binder Combined Feature e.g. Phrase and Words - 31 - CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION Figure 5: Extracting Linguistically Motivated Knowledge Sources - 32 - CHAPTER 4 EXPERIMENT CHAPTER 4 EXPERIMENT A controlled experimental study was conducted to validate the effectiveness of linguistically motivated knowledge sources. This chapter describes the experimental setup employed throughout the study. Section 4.1 describes the evaluation data sets used in the study. Section 4.2 presents the preprocessing methods required in the study. Section 4.3 provides more details on the handling of multiple categories. Section 4.4 presents the evaluation measures used. The last section, 4.5, describes the tools utilized in the study. 4.1 Evaluation Data Sets There are standard benchmark collections available for experimental purposes. We have tested numerous linguistically motivated knowledge sources and their combinations that were presented in Chapter 3 with two widely used data corpus: Reuters-21578 and WebKB. These data corpus vary in many characteristics. This section articulates the characteristics of all the data sets that were used in our experiment. 4.1.1 Reuters-21578 - 33 - CHAPTER 4 EXPERIMENT The dataset that we have chosen is the Reuters-21578 dataset. This is a widely used collection. It accounts for most of the experimental work in classification (Sebastiani, 2002). This dataset can be http://www.daviddlewis.com/resources/testcollections/reuters21578. obtained This from document corpus was originally collected from the Reuters newswire stories manually indexed by human indexers. The data was originally collected and labeled by Carnegie Group, Inc. and Reuters, Ltd. in the course of developing the CONSTRUE text categorization system. The Reuters corpus consists of 21,578 documents classified into 135 categories. Of these categories are 5 major categories and 118 subcategories. A total of more than 20,000 unique terms can be found in this corpus. The set of documents occur in SGML format and majority of them contain categories labels which has been obtained through a manual process. An example of a single document in the Reuters corpus is shown in Figure 6. In our empirical study, we chose the “Modified Apte Split” (ModApte) version of Reuters21578 because of the abundance of recent work using this configuration. The ModApte version place all documents written on or before 7 April 1987 into the training set and the rest of the documents into the test set. Of these documents, 9603 training examples are used as training documents and 3299 documents are used as testing examples, some of which have no class labels attached. Only 90 classes for which at least one training example and test example exists are included. - 34 - CHAPTER 4 EXPERIMENT An analysis of the corpus reveals that the category distribution in the Reuters dataset is highly skewed. Table 2 shows the overview of the category sets and the detailed breakdown of the categories. The 674 categories are broken down into 5 groups: 1. Exchanges: Stock and commodity exchanges 2. Organizations: Economically important regulatory, financial, and political organizations. 3. People: Important political and economic leaders. 4. Places: Countries of the world. 5. Topics: These are subject categories of economic interest. Categorization Subcategories EXCHANGES ORGS PEOPLE PLACES TOPICS 39 56 267 175 135 Subcategories with >1 occurrences 32 32 114 147 120 Subcategories with > 20 categories 7 9 15 60 57 Table 2: Distribution of Categories in Reuters-21578 The training set has some categories with high numbers of documents whereas some with extremely few documents. On average the documents belongs to 1.3 categories. We have chosen a random subset of documents from the top 20 subcategories in our evaluation as shown in Table 3. Category Train Test Acq Bop Corn Cpi Crude 1650 75 182 69 389 719 30 56 28 189 - 35 - CHAPTER 4 EXPERIMENT Dlr Earn Gnp Grain Interest livestock Money-fx Money-supply nat-gas Oilseed Ship soybean Sugar Trade Wheat 131 2877 101 433 347 75 538 140 75 124 197 78 126 369 212 44 1087 35 149 131 24 179 34 30 47 89 33 36 118 71 Table 3: Top 20 categories in ModApte Split 26-FEB-1987 15:01:01.79 cocoa el-salvadorusauruguay 5; 5; 5;C T 22; 22; 1;f0704 31;reute u f BC-BAHIA-COCOA-REVIEW 02-26 0105 2; BAHIA COCOA REVIEW SALVADOR, Feb 26 - Showers continued throughout the week in the Bahia cocoa zone, alleviating the drought since early January and improving prospects for the coming temporao, although normal humidity levels have not been restored, Comissaria Smith said in its weekly review. The dry period means the temporao will be late this year. Arrivals for the week ended February 22 were 155,221 bags of 60 kilos making a cumulative total for the season of 5.93 mln against 5.81 at the same stage last year. Again it seems that cocoa delivered earlier on consignment was included in … 3; Figure 6: An Example of a Reuters-21578 Document - 36 - CHAPTER 4 EXPERIMENT 4.1.2 WebKB The other corpus that we used is the World Wide Knowledge Base dataset that was collected by Craven et al. (1999). WebKB was collated by a crawler as part of an effort to build a data corpus that mirror the World Wide Web. The primary reason for choosing this corpus is that this corpus is of a different domain and has dissimilar characteristics from the Reuters corpus. The WebKB corpus is of a web-based domain. This corpus contains 8282 web pages from four academic domains contributed by different authors with very different styles of writing and presenting information on the web. Divided into two different polychotomies, topic and web domain, we have decided to use the first polychotomy in our experiments carried out by Nigam et al. (1998) and Bekkerman et al. (2003) in which four categories were used: course, faculty, project and student. This contributes to a total of 4199 documents. Our experiment also used these four categories and used the pages from Cornell as the test pages. A detailed breakdown of the data corpus is shown in Table 4. Category Course Faculty Project Student Number Proportion of +ve e.g.(% ) 930 22.1 1124 26.8 504 12.0 1641 39.1 Table 4: Breakdown of documents in WebKB Categories - 37 - CHAPTER 4 EXPERIMENT 4.2 Dimensionality Reduction In a large document collection, the dimensionality of the feature space is very high leading to more time and memory requirements in processing the feature vectors. This calls for a feature selection phase that will address this issue by increasing data sparsity. The first section gives an overview of feature selection methods used in document classification studies followed by a detailed explanation of the techniques that we have adopted for our study. 4.2.1 Feature Selection Feature selection is typically performed by assigning a weight to each term and keeping a certain number of terms with the highest scores, while discarding the rest of the terms. Experiments then evaluate the classification performance on the resulting feature vectors based on the features retained after the feature selection phase. Some learning algorithms cannot scale to high dimensionality. For instance, LLSF algorithm cannot scale to high dimensions. It has also been proven that some classifiers perform worse when using all features e.g. K-NN, Rocchio, C4.5 and some classifiers like C4.5 are too inefficient to use all features. An advantage of dimensionality reduction is that it tends to reduce overfitting which refers to the phenomenon by which a classifier is turned into the contingent - 38 - CHAPTER 4 EXPERIMENT characteristics of the training data rather than just the constitutive characteristics of the categories. Classifiers that overfit are extremely accurate on the training data such that the performance on the test data is worse. Experiments have shown that in order to avoid overfitting, some research suggested that 50-100 training examples per term may be needed in document classification tasks (Fuhr & Buckley, 1991). Thus, it implies that overfitting may be avoided by using a smaller amount of training examples. However, on the other hand, removing features risk removing potential useful information on the meaning of documents. Hence, the dimensionality reduction step must be done appropriately. Various dimensionality reduction methods have been proposed. Methods such as document frequency thresholding and empirical mutual information are used to select a set of features to represent the documents. Other methods such as odds-ratio, chi-square score, stopword removal and stemming are also used for feature selection. Document frequency thresholding and stop word removal is employed in this study. 4.2.1.2 Document Frequency Thresholding Of the many feature selection methods, document frequency is a simple and effective feature selection method popular in conventional document classification literature. Document frequency thresholding refers to the use of features that occur at least t times in the training documents. In other words, it gives the weight to the feature by assigning a value equal to that of its document frequency. Document frequency is the number of - 39 - CHAPTER 4 EXPERIMENT documents in which the feature occurs. This is one of the simplest but effective techniques (Yang & Pederson, 1999) for vocabulary reduction. It scales easily to very large corpus with a computational complexity approximately linear in the number of training documents. The document frequency for each feature in the training corpus is computed and removed from the feature space where the features or terms whose document frequency was less than some threshold, in the previous case. It is assumed that rare terms are not informative in terms of global performance. This will greatly reduce the dimensionality of the feature space. Although this is not always accurate, an improvement in classification accuracy is possible if the removal of rare terms is actually removing noise terms. As the emphasis of our study is on the effects of different learning algorithms with different linguistic knowledge sources, we will assume that the feature selection will not be a major factor in our study as we will be using the same feature selection technique, i.e. document frequency thresholding in all our experiments setup with different learning algorithms and different linguistic knowledge sources. Furthermore, Yang & Pederson’s (1999) work has consolidated the use of document frequency with Reuters corpus is not just an ad hoc approach to improving classification accuracy but a reliable metric for feature selection. They have suggested that document frequency should be a better alternative when the computation of information gain or χ2 test proves to be too costly. - 40 - CHAPTER 4 EXPERIMENT 4.2.2 Stop Words Removal Stop words refer to words that occur very frequently within one document such as “the”, which does not add any discriminative value to the classifier. Thus, stop words elimination is carried out to identify irrelevant features. When document frequency thresholding removes particularly infrequent words, stop word elimination removes mostly high frequency words from the attribute set. All words that occur in the list of stop words will not be considered as features. Feature selection based on rankings implies that the process is a greedy process that does not account for dependencies between words. A list of stop words is provided in Appendix A. We removed stop words for documents in Reuters-21578 but this was not carried out for documents in WebKB as the number of features derived from the web pages was much smaller than that of Reuters-21578. 4.3 Experiment Setup 4.3.1 Handling Multiple Categories Problems For all learning algorithms, we used the default settings established in WEKA to train the classifiers. We used binary classification for every learning algorithm. Each learning classifier is created for each category. If there are more than two categories for a - 41 - CHAPTER 4 EXPERIMENT document instance, an instance for each category will be treated as separate instances for each category. Thus there are twice as many classifiers as there are number of categories. 4.3.2 Handling Multiple Categories Training Examples As some documents in Reuters-21578 were labeled with more than 1 category, we created one training instance for each category that a training example in the corpus has. We created one positive training instance for each example labeled as belonging to that category. Thus the number of training instances generated for the normal variant of learning algorithms can be more than the number of training examples provided, but the number of training instances generated for the 1-per-category variant is the same as the number of training examples. 4.4 Evaluation Measures 4.4.1 Loss-based Measures A commonly used performance measure in the machine learning community is the error rate which is defined as the probability of the classification rule in predicting the wrong class. E(h) = Pr(h != Y|h) - 42 - CHAPTER 4 EXPERIMENT Estimators for these measures can be defined using a contingency table of prediction so on an independent test set. Each cell of the contingency table (see Table 5) represents one of the four possible outcomes of prediction h(x) for an example: f++, is the number of instances categorized under the category c, and the classifier predicts it correctly as under the category c; f+-, is the number of instances categorized not to belong to category c, but the classifier predicts it incorrectly as belonging to category c; f-+, is the number of instances categorized as belonging to category c, but the classifier predicts it incorrectly as not belonging to category c; f--, is the number of instances categorized not to belong to category c, and the classifier predicts it correctly as not belonging to category c; Predicted Category +1 Predicted Category -1 Actual Category y = +1 f++ f-+ Actual Category y=-1 f+f-- Table 5: Contingency Table Formally, the conventional estimators, for these measures using the contingency table, rooted in probabilistic theories, is E ( h) = f +− + f −+ f + + +f + − + f − + + f −− (7) Sometimes cost matrix may be introduced where the utility of predicting a positive example correctly is higher or lower than predicting a negative example and vice versa. - 43 - CHAPTER 4 EXPERIMENT 4.4.2 Recall and Precision To avoid the problem of using the accuracy as a measure in text classification experiments, we have considered two other performance measures, recall and precision. Recall is the probability that a document with label y = 1 is classified correctly. Re call (h) = Pr(h( x) = 1 | y = 1, h) Re call (h)estimate = f f ++ ++ +f −+ (8) (9) Precision is the probability that a document classified as h(x) = 1 is classified correctly. Pr ecision(h) = Pr( y = 1 | h( x) = 1, h) (10) Pr ecision(h)estimate = f f ++ ++ +f +− (11) 4.4.3 Precision and Recall Breakeven Point While precision and recall accurately describe the classification performance, it is difficult to compare different learning algorithms with two disparate scores. One popular evaluation method for balancing precision and recall is the F-measure value proposed by Rkjsbergen (1979). This measure combines precision and recall with a new parameter specifying the importance of recall relative to precision. This measure is defined by equation 12. - 44 - CHAPTER 4 EXPERIMENT (1 + β 2 ) Pr ecision(h) Re call (h) Fb(h) = β 2 Pr ecision(h) + Re call (h) (12) If =1, this means that precision and recall are equally important and have the same weight. The performance measure that we have employed is the F1 measure as given in equation 13. F 1(h) = 2 Pr ecision(h) Re call (h) Pr ecision(h) + Re call (h) (13) 4.4.4 Micro- and Macro- Averaging F-measure measures the effectiveness of a classifier on a single class. However as most text classification tasks contain many categories, each will have their own F-measure value. Two other measures, macro-average value and micro-average values were used in this study. The macro-average values take the averages of all categories’ contingency table as defined by: F 1Macro = 1 m ∑ F1 (hi ) m i =1 (14) - 45 - CHAPTER 4 EXPERIMENT On the other hand, the micro-average values makes use of all the contingency table values for each category to obtain a new averaged contingency table by component-wise addition to obtain averages for f++, f+-, f-+ and f-- . Finally, the micro F1 value is defined as: F 1Micro = 2 f +Average + 2 f +Average + Average + f +− + f −Average + (15) 4.5 Tools Finally, to facilitate the study of the effects of linguistically motivated knowledge sources with learning algorithms, a generic design of our system was conceptualized. The design of our system is shown in Figure 7. The system consists of four main modules: 1. The document management module; 2. The feature preprocessor module; 3. The learning classifier module; 4. The Graphical User Interface (GUI) module. Tools that were readily available were tested and integrated into the system where applicable such as the learning classifier module. Other modules had to be implemented. - 46 - CHAPTER 4 EXPERIMENT Preprocess document Extract Features Document Management Module Feature Engine Module Document database Learning Classifier Module Model Category User Interface Module New Document Figure 7: Design of system The details of each module follow: The document management module contains functions to parse and process the documents in the document repository. This module consists of a document handler and a filter sub-component. The document handler parses the document based on the existing format and the filter does additional pre-processing such as removing HTML tags for the WebKB documents. Having preprocessed the documents, the strings of text are passed to the feature engine module where the documents are converted to the linguistic units or tokens that serve as sources of linguistic knowledge to the learning classifier. The strings are tokenized and an appropriate feature selection is performed on them. In our empirical study, linguistically motivated tokens have been used. This entails the use of a Part of Speech - 47 - CHAPTER 4 EXPERIMENT Tagger that tags the required system. In our experiments, we have employed SNOW tagger (http://l2r.cs.uiuc.edu/~cogcomp/cc-software.html). The classifier module is a module containing several learning algorithms. In our system, WEKA 3-6 has been integrated with the system. Finally, the GUI module contains codes to render the appropriate user interface for the user to execute the classifier. Currently, the UI module consists of a raw text based interface. Based on the setup as described in this chapter, a series of experiments were conducted, using 6 learning algorithms – Naïve Bayes (NB), Support Vector Machines (SVM), k-NN Instance-Based Learner (k-NN), C4.5, RIPPER (JRip) and Alternating Decision Trees (ADTree), with different linguistically motivated sources, including novel combinations of these sources using the proposed approach that has been described in Chapter 3. Results from the experiment with the use of this system will be discussed in details in the next chapter. - 48 - CHAPTER 5 RESULTS & ANALYSIS CHAPTER 5 RESULTS & ANALYSIS In this chapter, results for our experiments are presented. The results of each type of linguistically motivated knowledge source are then discussed in detail and we proceed to give an overall assessment and analysis of the interaction effects of using different linguistically motivated feature representations with different learning algorithms. Section 5.1 presents an extensive set of experiment results based on the Reuters-21578 corpus. Comparisons of each of the linguistically motivated classifiers were then described in the Section 5.1. In Section 5.2, the experiment results for the WebKB corpus are then presented. 5.1 Results Experiments were conducted on Reuters-21578 and WebKB corpora. This section describes the series of experiments that have been carried out using different linguistically motivated knowledge sources and learning algorithms mentioned in Chapter 2 on these corpora. 5.2 Contribution of Different Linguistic Knowledge Sources to Classification of Reuters-21578 Corpus - 49 - CHAPTER 5 RESULTS & ANALYSIS 5.2.1 Words Table 6 shows the results of using word based representation. This set of results will be used as a baseline for comparison with other linguistic knowledge sources. Since the value of F1 occurs between 0 and 1, where the higher the value the better the classifier, it was observed that relatively good performances have been achieved with the use of words as representations for each of the learning algorithms tested. The results in Table 6 will be used as the baseline classifier for linguistically motivated classifiers. Algorithm Word Micro F1 Macro F1 NB 0.860 0.860 SVM 0.884 0.882 k-NN 0.868 0.867 C4.5 0.870 0.866 Ripper 0.864 0.860 ADTree 0.887 0.885 Table 6: Results using Words 5.2.2 Phrase Results of the use of noun phrase representation are being compared against word representation in Table 7. Surprisingly, the majority of the results show that the word based representations do better than using noun phrases. This result corresponds to the work of Lewis (1992) and Scott (1999). The results indicate that some of the informational content may have been lost. - 50 - CHAPTER 5 RESULTS & ANALYSIS From our observations, we see that some of the noun phrases were not identified correctly by SNOW which could have led to a decrease in the micro and macro values. However, it is worth noting that the use of support vector machines has the best performance with phrase based representations and showed minimal improvement to the macro value. The reason for the poor performance of the new linguistic classifiers built using phrases as the linguistically motivated knowledge source could be due to: i) Sparse distribution of noun phrases. ii) Synonymous phrases which dilutes the contribution of each features which essentially have the same meaning. For e.g., the occurrence of the noun phrase “Standard oil company” and “oil company” becomes two separate features. iii) Noise introduced via the separation of synonymous phrases into multiple phrases and the inaccurate identifications of noun phrases by the tagging tool. Algorithm Word Micro F1 NB SVM k-NN C4.5 Ripper ADTree Phrase Macro F1 Micro F1 Change (%) Macro F1 Change (%) 0.86 0.86 0.833 -3.140 0.83 -3.488 0.884 0.882 0.852 -3.620 0.884 0.227 0.868 0.867 0.835 -3.802 0.834 -3.806 0.87 0.866 0.842 -3.218 0.837 -3.349 0.864 0.86 0.828 -4.167 0.822 -4.419 0.887 0.885 0.848 -4.397 0.847 -4.294 Table 7: Results using Phrases - 51 - CHAPTER 5 RESULTS & ANALYSIS 5.2.3 Word Sense (Part of Speech Tags) There has been a drop in the micro and macro averages of the word sense representation which indicates that there is no clear advantage of using word sense as a representation for document classification. Although using word senses has been reported to produce better results in word sense disambiguation literature (Ng & Lee, 1996), it is not the case for document classification. Adding part of speech tags will not add much improvement because there are few terms that has very different meanings for different word senses. The only learning algorithm that worked well with word senses, or part of speech tag, was C4.5 trees. Our findings indicate that C4.5 actually does better for both micro and macro F1 with the linguistically motivated feature, word senses, than with words. Algorithm NB SVM k-NN C4.5 Ripper ADTree Word Tags Micro F1 Macro F1 Micro F1 Change (%) Macro F1 Change (%) 0.86 0.86 0.815 -5.233 0.812 -5.581 0.884 0.882 0.864 -2.262 0.863 -2.154 0.868 0.867 0.843 -2.880 0.842 -2.884 0.87 0.866 0.879 1.034 0.88 1.617 0.864 0.86 0.856 -0.926 0.854 -0.698 0.887 0.885 0.881 -0.676 0.881 -0.453 Table 8: Results using Word Senses (Tags) However, it is probable that word sense representation will be very effective especially in data corpora containing terms that depict multiple meanings with different word senses. For example, the word “sweet” has different word senses with different meanings. As a - 52 - CHAPTER 5 RESULTS & ANALYSIS noun, the word has an approximate meaning of “candy”. As an adjective, it may mean “lovable”. Since Reuters-21578 documents do not appear to exhibit this characteristic, using this linguistically motivated knowledge source does not improve the baseline classifier on most of the classifiers except for C4.5. 5.2.4 Nouns Results with the linguistic classifiers built using bag-of-nouns actually produced superior results with improvements ranging from 3.05- 4.76%. Except for Naïve Bayes, the results indicate that using nouns are able to capture most of the informational content of the documents. Thus, the results show that the use of noun representations shows signs of improved micro and macro values. The reason for the improvement in the results could be because the semantic components of noun meanings are more strongly interconnected than those of verbs and word senses. Nouns appeared to have captured features with salient concepts required for classification accuracy. Algorithm Word Micro F1 NB SVM k-NN C4.5 Ripper ADTree NN Macro F1 Micro F1 Change (%) Macro F1 Change (%) 0.86 0.86 0.861 0.116 0.857 -0.349 0.884 0.882 0.924 4.525 0.924 4.762 0.868 0.867 0.901 3.802 0.899 3.691 0.87 0.866 0.908 4.368 0.906 4.619 0.864 0.86 0.891 3.125 0.888 3.256 0.887 0.885 0.913 2.931 0.912 3.051 Table 9: Results using Nouns - 53 - CHAPTER 5 RESULTS & ANALYSIS 5.2.5 Verbs The results in table 10 show that there was a huge drop in the micro and macro F1 values. Since verbs are more mutable than nouns, it follows that the micro and macro values should fare worse than that of nouns as reflected in the results. Thus, verb is not a good choice for feature representation for classification tasks. This may also indicate that the removal of verbs from the documents may not affect the classification accuracy among the learning algorithms adopted in this study. Algorithm NB SVM k-NN C4.5 Ripper ADTree Word Verb Micro F1 Macro F1 Micro F1 Change (%) Macro F1 Change (%) 0.86 0.86 0.636 -26.047 0.625 -27.326 0.884 0.882 0.637 -27.941 0.625 -29.138 0.868 0.867 0.636 -26.728 0.628 -27.566 0.87 0.866 0.619 -28.851 0.601 -30.60 0.864 0.86 0.594 -31.25 0.566 -34.186 0.887 0.885 0.628 -29.199 0.611 -30.960 Table 10: Results using Verbs. 5.2.6 Adjectives From the results (see Table 11), there is a clear indication that there is no clear advantage in using adjectives as feature representations due to the drop in performance compared to using purely words. However, as compared to verbs, the performance is slightly better. - 54 - CHAPTER 5 RESULTS & ANALYSIS Algorithm NB SVM k-NN C4.5 Ripper ADTree Word Adjective Micro F1 Macro F1 Micro F1 Change (%) Macro F1 Change (%) 0.86 0.86 0.739 -14.070 0.73 -15.116 0.884 0.882 0.738 -16.516 0.732 -17.007 0.868 0.867 0.75 -13.594 0.747 -13.841 0.87 0.866 0.726 -16.552 0.717 -17.206 0.864 0.86 0.725 -16.088 0.713 -17.093 0.887 0.885 0.74 -16.573 0.733 -17.175 Table 11: Results using Adjectives. 5.2.7 Combination of Sources The results using a combination of words and noun phrase representations led to surprising results. This was due to the improvement shown across several learning algorithms over word representations. This shows that the use of noun phrase could have led to higher discriminating power with the use of bag of words representation. The improvements were consistent across the different learning algorithms. Algorithm NB SVM k-NN C4.5 Ripper ADTree Word Combine Micro F1 Macro F1 Micro F1 Change (%) Macro F1 Change (%) 0.86 0.86 0.864 0.465 0.861 0.116 0.884 0.882 0.892 0.905 0.892 1.134 0.868 0.867 0.88 1.382 0.879 1.384 0.87 0.866 0.878 0.920 0.877 1.270 0.864 0.86 0.864 0 0.862 0.233 0.887 0.885 0.892 0.564 0.891 0.678 Table 12: Results using both Linguistically Knowledge Source and Words. - 55 - CHAPTER 5 RESULTS & ANALYSIS 5.2.8 Analysis of Reuters-21578 Results An analysis of the results was carried out from two perspectives, one from the features and the other, the learning algorithm used. The former allows us to identify the features that will improve the classification accuracy of the learning algorithm. The latter allows us to select the appropriate type of learning algorithm that can be employed in other applications based on the characteristics of the data, for example, if the data corpus consists of many adjectives, we are able to tell which sort of learning algorithm is suitable for classifying the documents. Algorit Word Phrase Combine Tag NN Verb Adjective hm Micro Macro Micro Macr Micro Macr Micro Macr Micro Macr Micro Macr Micro Macr F1 F1 F1 o F1 F1 o F1 F1 o F1 F1 o F1 F1 o F1 F1 o F1 NB 0.860 0.860 0.833 0.830 0.864 0.861 0.815 0.812 0.861 0.857 0.636 0.625 0.739 0.730 SVM 0.884 0.882 0.852 0.884 0.892 0.892 0.864 0.863 0.924 0.924 0.637 0.625 0.738 0.732 k-NN 0.868 0.867 0.835 0.834 0.880 0.879 0.843 0.842 0.901 0.899 0.636 0.628 0.750 0.747 C4.5 0.870 0.866 0.842 0.837 0.878 0.877 0.879 0.880 0.908 0.906 0.619 0.601 0.726 0.717 Ripper 0.864 0.860 0.828 0.822 0.864 0.862 0.856 0.854 0.891 0.888 0.594 0.566 0.725 0.713 ADTree 0.887 0.885 0.848 0.847 0.892 0.891 0.881 0.881 0.913 0.912 0.628 0.611 0.740 0.733 Table 13: Contribution of Knowledge Sources on Reuters-21578 data set (MicroAveraged F1, Macro-averaged F1) Table 13, shows the micro-averaged and macro-averaged F1 measures for the different knowledge sources and learning algorithms for the Reuters-2178 data set. The seven columns in the table correspond to: (i) using only words; (ii) using only phrases; - 56 - CHAPTER 5 RESULTS & ANALYSIS (iii) using both words and phrases; (iv) using part of speech tag with words; (v) using only nouns; (vi) using only verbs and, (vii) using only adjectives. Each of these knowledge sources were used with document frequency as the feature selection technique. Comparison of Different Linguistically Motivated Knowledge Sources on Reuters-21578 (Micro F1 Values) 1 Micro F1 0.9 0.8 Words 0.7 Phrase 0.6 Combine 0.5 Tags 0.4 Nouns 0.3 Verbs 0.2 Adjectives 0.1 0 NB SVM k-NN C4.5 Ripper ADTree Learning Algorithms Figure 8: Comparison of Different Linguistically Motivated Knowledge Sources on Reuters-21578 (Micro F1 values) The best micro-averaged F1 value for Reuters-21578 is 92.4% (Figure 8), obtained by using the linguistically motivated knowledge source, nouns, as features and SVM, as the learning algorithm. The results indicate that SVM performs best with nouns, combined words and phrases, words, phrases, relatively worse on adjectives and worst on verbs. - 57 - CHAPTER 5 RESULTS & ANALYSIS The next best performing setup for micro-averaged F1 value gives values of 91.3% which is obtained by ADTree with nouns. Verb was consistently the worst performing linguistically motivated knowledge source when employed as the sole feature representation across all the learning algorithms. It is worthwhile to note that for each of these learning algorithms tested, the best performing linguistically motivated knowledge source was nouns, except for Naïve Bayes. Combining linguistically motivated knowledge source generated the best performer for Naïve Bayes. Comparison of Different Linguistically Motivated Knowledge Sources on Reuters-21578 (Macro F1 values) 1 Macro F1 0.9 0.8 Words 0.7 Phrase 0.6 Combine 0.5 Tags 0.4 Nouns 0.3 Verbs 0.2 Adjectives 0.1 0 NB SVM k-NN C4.5 Ripper ADTree Learning Algorithms Figure 9: Comparison of Different Linguistically Motivated Knowledge Sources on Reuters-21578 (Macro F1 values) The best macro-averaged F1 value turned out to be the same as the results for microaveraged F1 values. SVM, together with nouns as a linguistically motivated knowledge - 58 - CHAPTER 5 RESULTS & ANALYSIS source, outperformed all other classifiers. Similar to the micro-averaged F1 results, the order of the performance of the linguistically motivated knowledge sources with SVM is: nouns, combined words and nouns, words, phrases, tags, adjectives and verbs. Just as the previous set of results, the next best performing classifier was obtained using AdTree with nouns as the linguistically motivated knowledge source. The macro-averaged F1 value was 91.2%. Observation of the macro-averaged data also reveals the same trend in the order of performance of each of these linguistically motivated features. Algorithm Word Phrase Combine Tag NN Verb Adjective NB 0.898 0.872 0.884 0.845 0.891 0.687 0.750 SVM 0.902 0.871 0.886 0.867 0.930 0.677 0.734 k-NN 0.874 0.842 0.871 0.814 0.916 0.638 0.708 C4.5 0.898 0.873 0.861 0.889 0.920 0.678 0.724 Ripper 0.904 0.872 0.862 0.888 0.912 0.646 0.718 ADTree 0.902 0.871 0.880 0.884 0.918 0.670 0.730 Table 14: Contribution of knowledge sources on Reuters-21578 (Precision) Table 14 shows the tabulation of precision values for each of the learning algorithms using different feature representations as different knowledge sources. Figure 10 shows a graphical representation of the results. Again, the results based on precision values consolidates our previous observations that using nouns only gives the best results for SVM, k-NN, C4.5, Ripper and AdTree. However, unlike the previous case whereby Naïve Bayes works best with combined words and phrases as features. It is worth noting that the precision is most of the time higher with words than the combined - 59 - CHAPTER 5 RESULTS & ANALYSIS representations. The worst performing feature is obviously the verb as feature representation which is consistent with all the learning algorithms. Comparison of Different Linguistically Motivated Knowledge Sources on Reuters-21578 (Precision) Precision 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Word Phrase Combine Tag Nouns Verbs Adjectives NB SVM k-NN C4.5 Ripper ADTree Learning Algorithm Figure 10: Comparison of Different Linguistically Motivated Knowledge Sources on Reuters-21578 (Precision) Algorithm Word Phrase Combine Tag NN Verb Adjective NB 0.830 0.806 0.852 0.793 0.836 0.600 0.746 SVM 0.874 0.847 0.906 0.871 0.923 0.620 0.766 k-NN 0.872 0.841 0.895 0.883 0.890 0.657 0.805 C4.5 0.851 0.827 0.904 0.883 0.903 0.589 0.755 Ripper 0.834 0.807 0.877 0.836 0.879 0.616 0.782 ADTree 0.881 0.841 0.911 0.888 0.912 0.614 0.774 Table 15: Contribution of knowledge sources on Reuters-21578 (Recall) - 60 - CHAPTER 5 RESULTS & ANALYSIS From Table 15 we see that the combined feature representations actually outperform nouns only in three of the learning algorithms. Figure 11 shows the graphical representation of the results. From the results, it was observed that combined representation works best with Naïve Bayes, k-NN and C4.5. On the other hand, noun features outperforms other feature representations including combined representations with SVM, Ripper and ADTree. The worst performing algorithm as reflected in the micro and macro averaged values, is using verbs as the features. From our analysis of the preliminary results, we feel that there is potential to further combine words, phrases and nouns to produce even better results. These findings led us to further investigate the contribution of novel combinations of linguistically motivated features in another experiment that will be discussed in a later section. Recall Comparison of Different Linguistically Motivated Knowledge Sources on Reuters-21578 (Recall) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Word Phrase Combine Tags Nouns Verbs Adjectives NB SVM k-NN C4.5 Ripper ADTree Learning Algorithm Figure 11: Comparison of Different Linguistically Motivated Knowledge Sources on Reuters-21578 (Recall) - 61 - CHAPTER 5 RESULTS & ANALYSIS Based on the results, it appears that extracting nouns as features gives the best results for all the learning algorithms that were tested. Adjectives and verbs were not so effective for document classification with the learning algorithms. This could be because nouns extract the informative terms in this data set as compared to other context source. The removal of non-informative terms will increase the results due to the reduction of noise in the category prediction process. The findings also led to an indication for the use of novel combinations of linguistically motivated knowledge sources with bag-of-words which will be carried out using the WebKB corpus. 5.3 Contribution of Linguistic Knowledge Sources to Classification accuracy of WebKB Corpus Following the interesting findings from our previous experiment, we now attempt to reproduce the set of linguistically motivated features with a different data corpus. In addition, we also attempt to use novel combinations of linguistically motivated features with the bag-of-word representations to determine the effects of such novel combinations on document classification. A similar experiment was conducted on the WebKB collection. 5.3.1 Words Table 16 shows the results of using word based representation. This will also be used as the baseline classifier for the WebKB corpus. As seen in the Reuters-21578 corpus, good - 62 - CHAPTER 5 RESULTS & ANALYSIS performances were also achieved with the use of words as representations for each of the learning algorithms tested with WebKB. Algorithm Micro F1 Macro F1 Naïve Bayes 0.92 0.87 SVM 0.98 0.97 K-NN 0.73 0.65 AdaBoost 0.96 0.93 Adtree 1.00 0.99 Ripper 1.00 0.99 C4.5 0.77 0.99 Table 16: Results using Words 5.3.2 Phrase From Table 17, the results of using phrase based features to build linguistically motivated classifiers are shown. Similar to the Reuters corpus, the results showed no indication that phrases help to improve the F1 measure. In fact, the results indicate a substantial drop in both micro and macro values for each learning algorithm. The reason for the drop in performance of the phrase based linguistically motivated classifiers could be due to the sparse number of features especially noun phrases. The content of HTML pages of WebKB corpus has much less features compared to that of Reuters. Potentially, this also leads to a reduced effectiveness in the part of speech tagger that was introduced in the preprocessing phase. - 63 - CHAPTER 5 RESULTS & ANALYSIS Word Algorithm Naïve Bayes Phrase Micro F1 Macro F1 Micro F1 Change (%) Macro F1 Change (%) 0.92 0.87 0.73 -20.7 0.71 -18.4 SVM 0.98 0.97 0.72 -26.5 0.69 -28.9 K-NN 0.73 0.65 0.5 -31.5 0.4 -38.5 AdaBoost 0.96 0.93 0.92 -4.2 0.7 -24.7 Adtree 1 0.99 0.71 -29.0 0.67 -32.3 Ripper 1 0.99 0.8 -20.0 0.72 -27.3 0.771 0.994 0.75 -2.7 0.58 -41.6 C4.5 Table 17: Results using Phrases 5.3.3 Word Sense (Part of Speech Tags) Using tagged terms as a linguistically motivated knowledge source in WebKB seemed to show mixed results for different learning algorithms. As shown in Table 18, the results for micro F1 values for SVM and C4.5 shows that the linguistically motivated classifiers is as competitive as the bag-of-word classifier. However, such features showed a drop in performance for other learning algorithms. Word Algorithm Naïve Bayes TAG Micro F1 Macro F1 Micro F1 Change (%) Macro F1 Change (%) 0.92 0.87 0.75 -18.5 0.69 -20.7 SVM 0.98 0.97 0.98 0.0 0.72 -25.8 K-NN 0.73 0.65 0.57 -21.9 0.5 -23.1 AdaBoost 0.96 0.93 0.89 -7.3 0.68 -26.9 Adtree 1 0.99 0.56 -44.0 0.63 -36.4 Ripper 1 0.99 0.79 -21.0 0.72 -27.3 0.77 0.99 0.77 0.0 0.69 -30.3 C4.5 Table 18: Results using Word Senses (Tags) - 64 - CHAPTER 5 RESULTS & ANALYSIS 5.3.4 Nouns The results for noun were surprising. This was because the results showed a drop in performance for both micro and macro F1 values when noun based linguistically motivated knowledge sources actually gave the best performance in Reuters-21578. An analysis of the results and features shows that there were very few noun features in WebKB that could give discriminating power to the learning algorithms. This finding may also imply the possibility of the interaction effects of the characteristics of data with the learning algorithms (Yang, 2001). Word Algorithm Naïve Bayes Nouns Micro F1 Macro F1 Micro F1 Change (%) Macro F1 Change (%) 0.92 0.87 0.88 -4.3 0.78 -10.3 SVM 0.98 0.97 0.38 -61.2 0.76 -21.6 K-NN 0.73 0.65 0.4 -45.2 0.43 -33.8 AdaBoost 0.96 0.93 0.89 -7.3 0.77 -17.2 Adtree 1 0.99 0.76 -24.0 0.74 -25.3 Ripper 1 0.99 0.69 -31.0 0.66 -33.3 0.771 0.994 0.53 -31.3 0.54 -45.7 C4.5 Table 19: Results using Nouns 5.3.5 Verbs Just as we had expected, the verb features had a huge performance drop when only verbs were used. First, the micro and macro values, expected to fare worse than that of nouns - 65 - CHAPTER 5 RESULTS & ANALYSIS was reflected in the results, because of the sparse distribution of verbs in the web pages. Second, the verb features obtained from the feature engineering phase consisted of many words that do not contribute to the semantic content of the web pages. For e.g., words such as “do” and “go” do not confer distinguishing value to the linguistically motivated classifiers because they could potentially occur in any of the four categories tested. Word Algorithm Naïve Bayes Verb Micro F1 Macro F1 Micro F1 Change (%) Macro F1 Change (%) 0.92 0.87 0.48 -47.8 0.45 -48.3 SVM 0.98 0.97 0.49 -50.0 0.44 -54.6 K-NN 0.73 0.65 0.41 -43.8 0.32 -50.8 AdaBoost 0.96 0.93 0.56 -41.7 0.49 -47.3 1 0.99 0.43 -57.0 0.32 -67.7 1 0.99 0.41 -59.0 0.3 -69.7 0.771 0.994 0.51 -33.9 0.35 -64.8 Adtree Ripper C4.5 Table 20: Results using Verbs 5.3.6 Adjectives Results for adjectives were also given in Table 21. The use of bag of adjectives also caused a huge drop in performances. Unlike the results obtained in Reuters, as compared to results obtained using verbs, the adjectives based linguistically motivated knowledge sources fared worse than its peer. A close examination of the resulting set of knowledge sources obtained reveals the distribution of adjectives in the WebKB corpus is sparser compared to that of verbs. Thus, this suggests that the distribution was too sparse to confer any distinguishing value to the classifiers. - 66 - CHAPTER 5 RESULTS & ANALYSIS Word Algorithm Naïve Bayes Micro F1 Macro F1 Adjectives Micro Change F1 (%) Macro F1 Change (%) 0.92 0.87 0.63 -31.5 0.48 -44.8 SVM 0.98 0.97 0.76 -22.4 0.35 -63.9 K-NN 0.73 0.65 0.38 -47.9 0.35 -46.2 AdaBoost 0.96 0.93 0.61 -36.5 0.37 -60.2 Adtree 1 0.99 0.29 -71.0 0.33 -66.7 Ripper 1 0.99 0.26 -74.0 0.28 -71.7 0.771 0.994 0.36 -53.3 0.48 -51.7 C4.5 Table 21: Results using Adjectives 5.3.7 Nouns & Words Next we tried to combine both nouns and words and compare the results to that of using bag of words. The resulting linguistically motivated classifiers were as competitive as the baseline classifier and even performed better for some of the learning algorithms. Competitive classification accuracies were found with AdTree, Ripper, and Adaboost. Improvements were found with Naïve Bayes, SVM and C4.5. Word Algorithm Naïve Bayes Micro F1 Macro F1 Nouns + Words Micro Change F1 (%) Macro F1 Change (%) 0.92 0.87 0.93 1.1 0.9 3.4 SVM 0.98 0.97 0.99 1.0 0.98 1.0 K-NN 0.73 0.65 0.68 -6.8 0.59 -9.2 AdaBoost 0.96 0.93 0.95 -1.0 0.83 -10.8 1 0.99 1 0.0 0.99 0.0 Adtree Ripper C4.5 1 0.99 1 0.0 0.99 0.0 0.771 0.994 0.99 28.4 0.99 -0.4 Table 22: Results using Nouns & Words - 67 - CHAPTER 5 RESULTS & ANALYSIS 5.3.8 Phrase & Words Linguistically motivated classifiers based on phrases and words were also tested. The results are shown in Table 23. For micro values, improvements were shown in SVM and C4.5. Naïve Bayes, AdaBoost, Adtree and Ripper with linguistically motivated knowledge sources also showed competitive micro results. However, a slight decrease in accuracy was observed with k-NN. Word Algorithm Naïve Bayes Micro F1 Macro F1 Phrases + Words Micro Change F1 (%) Macro F1 Change (%) 0.92 0.87 0.92 0.0 0.87 0.0 SVM 0.98 0.97 0.99 1.0 0.92 -5.2 K-NN 0.73 0.65 0.71 -2.7 0.51 -21.5 AdaBoost 0.96 0.93 0.96 0.0 0.82 -11.8 1 0.99 1 0.0 0.99 0.0 1 0.99 0.99 -1.0 0.99 0.0 0.771 0.994 0.98 27.1 0.84 -15.5 Adtree Ripper C4.5 Table 23: Results using Phrases & Words 5.3.9 Adjectives & Words Lastly, the use of a combination of adjectives and bag of words were experimented and the results are shown in Table 24. Micro F1 values were encouraging because - 68 - CHAPTER 5 RESULTS & ANALYSIS improvements were shown for SVM, k-NN and C4.5. Competitive micro and macro results were also shown for Naïve Bayes, Adtree and Adaboost. Word Algorithm Naïve Bayes Micro F1 Macro F1 Adjectives + Words Micro Change F1 (%) 0.0 0.88 Change (%) 0.92 0.87 SVM 0.98 0.97 0.99 1.0 0.96 -1.0 K-NN 0.73 0.65 0.81 11.0 0.52 -20.0 AdaBoost 0.96 0.93 0.9 -6.2 0.81 -12.9 Adtree 1 0.99 1 0.0 0.99 0.0 Ripper 1 0.99 1 0.0 0.99 0.0 0.771 0.994 0.99 28.4 0.98 -1.4 C4.5 0.92 Macro F1 1.1 Table 24: Results using Adjectives & Words 5.3.10 Analysis of WebKB Results From the results, it appears that bag of words does well with most of the learning algorithms. However, it is noteworthy to highlight that the alternative combinations of words and adjectives, nouns and phrases were as competitive as, and perform better than the bag of words with some learning algorithms. However unlike the results obtained for Reuters-21578, the results obtained for WebKB were less conclusive because of the differing trends among the results. Best performing classifiers differs across different learning algorithms but were mainly dominated by either the bag of words or the combined linguistically motivated knowledge sources with words. In the case for SVM, the combination of nouns and words performed better than the use of words alone. Similar to the previous set of results, verbs and adjectives did not seem to add much discriminative value to the classification. - 69 - CHAPTER 5 RESULTS & ANALYSIS Observations based on the micro-averaged F1 values (Figure 12) shows that the best performing classifier was found with AdTree and Ripper. AdTree works well with words, combination of nouns and words, combination of phrase and words and combination of adjectives and words. Ripper also works well with words, combination of nouns and words and combination of adjectives and words. This suggests that combinations of linguistically motivated classifiers can be as competitive as using words alone using a web-based corpus. The best performing classifiers for the macro-averaged F1 values were AdTree, Ripper and C4.5 (see Figure 13). Combinations of linguistically motivated knowledge sources with words showed high macro-averaged scores with these learning algorithms. Algorithm Word Micro F1 Phrase Macro F1 Tag Micro F1 Macro F1 NN Micro F1 Macro F1 Verb Micro F1 Macro F1 Micro F1 Macro F1 Naïve Bayes 0.92 0.87 0.73 0.71 0.75 0.69 0.88 0.78 0.48 0.45 SVM 0.98 0.97 0.72 0.69 0.98 0.72 0.38 0.76 0.49 0.44 K-NN 0.73 0.65 0.5 0.4 0.57 0.5 0.4 0.43 0.41 0.32 AdaBoost 0.96 0.93 0.92 0.7 0.89 0.68 0.89 0.77 0.56 0.49 Adtree 1 0.99 0.71 0.67 0.56 0.63 0.76 0.74 0.43 0.32 Ripper 1 0.99 0.8 0.72 0.79 0.72 0.69 0.66 0.41 0.3 C4.5 0.77 0.99 0.75 0.58 0.77 0.69 0.53 0.54 0.51 0.35 Algorithm Adjectives Micro F1 Naïve Bayes 0.63 Macro F1 0.48 Nouns & Words Phrase & Words Micro F1 Micro F1 Macro F1 0.93 0.9 Macro F1 0.92 0.87 Adjectives & Words Micro Macro F1 F1 0.92 0.88 SVM 0.76 0.35 0.99 0.98 0.99 0.92 0.99 0.96 K-NN 0.38 0.35 0.68 0.59 0.71 0.51 0.81 0.52 AdaBoost 0.61 0.37 0.95 0.83 0.96 0.82 0.9 0.81 Adtree 0.29 0.33 1 0.99 1 0.99 1 0.99 Ripper 0.26 0.28 1 0.99 0.99 0.99 1 0.99 C4.5 0.36 0.48 0.99 0.99 0.98 0.84 0.99 0.98 Table 25: Consolidated Results of WebKB - 70 - CHAPTER 5 RESULTS & ANALYSIS Comparison of Different Linguistically Motivated Knowledge Sources on WebKB (Micro F1 Value) 1.2 Word 1 Phrase Tag M icro F1 0.8 Nouns Verbs 0.6 Adjectives 0.4 Nouns & Words Phrase & Words 0.2 Adjectives & Words 0 Naïve Bayes SVM K-NN AdaBoost Adtree Ripper C4.5 Learning Algorithms Figure 12: Comparison of Different Linguistically Motivated Knowledge Sources on WebKB (Micro F1 values) Comparison of Different Linguistically Motivated Knowledge Sources on WebKB (Macro F1 Value) 1.2 Word 1 Phrase Tag M acro F1 0.8 Nouns Verbs 0.6 Adjectives 0.4 Nouns & Words Phrase & Words 0.2 Adjectives & Words 0 Naïve Bayes SVM K-NN AdaBoost Adtree Ripper C4.5 Learning Algorithms Figure 13: Comparison of Different Linguistically Motivated Knowledge Sources on WebKB (Macro F1 values) - 71 - CHAPTER 5 RESULTS & ANALYSIS 5.4 Summary of Results The conclusions derived from the first part of our study suggested that document classification techniques appear to work well with noun based linguistic source for Reuters-21578. On the other hand, the later part of our study on WebKB, indicated that linguistically motivated classification using combined representations were as competitive as the baseline classifier using words. Previous research has highlighted the interaction effects of data corpus with the feature selection methods used (Yang, 1999). It appears that our results appear to suggest that the use of linguistically motivated knowledge sources depends on the characteristics of the data corpus used. Reuters-21578 is obtained from a financial domain and comprises of financial news that contains a substantial number of terms. On the other hand, WebKB was from a web based domain, comprising mainly of HTML pages. This suggests that supervised document classification using nouns may be suitable for data corpus with a large number of features. On the other hand, documents which have fewer features such as HTML documents, could work well with a combination of words and linguistically motivated knowledge sources. Thus, based on our findings, we see a potential in employing the use of linguistically motivated knowledge sources especially nouns and combinations of these sources with words to improve classification accuracy. - 72 - CHAPTER 6 CONCLUSION CHAPTER 6 CONCLUSION In this final chapter, we conclude with the contribution of this project and some final thoughts on this project. 6.1 Summary This research builds on previous empirical studies on document classification. The use of several linguistically motivated knowledge sources, including the use of novel combinations of these sources as features, has been explored in this study. This study extends previous research on the use of natural language processing in document classification and covers several learning algorithms with different linguistically motivated knowledge sources not explored in previous studies. Linguistically motivated knowledge sources such as nouns have been found to have significant interaction effects on the Reuters-21578 corpus. On the other hand, the use of novel linguistically motivated knowledge sources using combinations of part of speech with bag of words have also shown improvements in the performance measures on the WebKB data corpus. This suggests that the use of linguistically motivated classifiers can help to improve the performance of classification on these two corpora. - 73 - CHAPTER 6 CONCLUSION 6.2 Contributions There are several aspects that this thesis contributes. Firstly, a thorough literature review has been conducted on several streams of research, ranging from text classification, human cognition, knowledge management and personalization. An integrative technique that draws ideas from these fields has been proposed. Previous works focused mainly on the improvement of techniques using the bag-ofwords technique but very few works examined the effects of the use of linguistically motivated knowledge sources as feature representations and document classification efficiency across several learning algorithms in a single experiment. In our experiment, we have experimented with several combinations of linguistically motivated knowledge sources with various learning algorithms not covered in previous studies. This research adds a relatively new dimension to the current works of document classification by shedding light on the use of natural language processing techniques to employ linguistically motivated knowledge sources as features for document classification. The potential of using such features for various popular learning algorithms to improve document classification was also investigated. Even though these research studies provide answers and interpretations of their results, several questions were left open, such as does the different feature representations with different learning algorithms play a role in the results of the document classification which we attempt to answer in our research. In addition, we have also proposed novel combinations of - 74 - CHAPTER 6 CONCLUSION linguistically motivated knowledge sources with words whose performance appeared to be competitive. With the increasing pressure from the exponential growth of information available, research on document classification cannot be emphasized greater. The results of our experiments can contribute to the studies which employ learning algorithms to categorize information so as to facilitate people in finding and locate information. This will help to resolve the information overload problem. This thesis has examined and evaluated a series of classifiers, each of which is incorporated with different linguistically motivated knowledge sources and representations. We find that certain linguistic knowledge sources such as nouns and noun phrase combined with words can offer substantial improvements over classifiers used with the bag-of-words technique. In contrast with earlier works, this thesis presents a comparative empirical evaluation of both learning algorithms with different linguistic knowledge sources. We evaluated 6 learning algorithms (Naïve Bayes, Support Vector Machine, k-Nearest Neighbours, decision trees, RIPPER, ADTree), and 6 linguistic knowledge sources (word, phrase, combined word and phrase, word sense, nouns, verbs and adjectives) and combinations of some of these sources with words. We systematically evaluated the effectiveness of each learning algorithm using these knowledge sources on the Reuters-21578 and WebKb data corpora. - 75 - CHAPTER 6 CONCLUSION It was found that learning algorithms can improve classification accuracies with an appropriate use of knowledge sources. We have shown that the effectiveness of slightly different linguistic knowledge sources can vary substantially with different classification algorithm. From our experiments, we found that using support vector machine with nouns as knowledge sources had the best accuracies for Reuters 21578. On the other hand, the use of combinations of linguistically motivated knowledge sources with bag of words for web based classification using learning algorithms such as AdTree, Ripper and C4.5 works just as well or even better than bag of words. 6.3 Limitations This study employed several conventional yet popular learning algorithms. However, with ongoing research in machine learning techniques, the use of six learning algorithms in our study is a subset of the many learning algorithms available. Thus, it seems inadequate to stop the empirical study at this point with the current number of learning algorithms. 6.4 Conclusion Document classification systems are an integral to the success of knowledge management systems. There is increasing interest in applying supervised document classification techniques in knowledge management systems. The findings of this study help to enhance the understanding of the use of incorporating linguistically motivated knowledge - 76 - CHAPTER 6 CONCLUSION sources into document classification. It is hoped that this research will prove to be valuable as an extension to similar studies. - 77 - REFERENCES REFERENCES Arampatzis, A., van der Weide, Th.P., Koster, C.H.A., van Bommel, P. (2000). An Evaluation of Linguistically-motivated Indexing Schemes. Proceedings of BCS-IRSG 2000 Colloquium on IR Research, 5th-7th April 2000, Sidney Sussex College, Cambridge, England. Arampatzis, A., van der Weide, Th.P., Koster, C.H.A., van Bommel, P. (2000). Linguistically-motivated Information Retrieval, Encyclopedia of Library and Information Science, Volume 69, pages 201-222, Marcel Dekker, Inc., New York, Basel, 2000. Basili, R, Moschitti, A. and Pazienza, M.T. (2001). NLP-driven IR: Evaluating Performances over a Text Classification task. In Proceedings of the 10th Intenational Joint Conference of Artificial Intelligence (IJCAI-2001), August 4th, Seattle, Washington, USA. Basili, R, Moschitti, A. and Pazienza, M.T. (2002). Empirical investigation of fast text classification over linguistic features. In Proceedings of the 15th European Conference on Artificial Intelligence (ECAI-2002), July 21-26, 2002, Lyon, France. Basu, Chumki, Hirsh, Haym, Cohen, William W. (1998). Recommendation as Classification: Using Social and Content-Based Information in Recommendation. In Proceedings of the Fifteenth National Conference on Artificial Intelligence and Tenth Innovative Applications of Artificial Intelligence Conference, AAAI 98, IAAI 98, July 26-30, 1998, Madison, Wisconsin, USA, pp. 714-720. Bekkerman R., El-Yaniv, R., Tishby, N. and Winter, Y. (2003). Distributional Word Clusters vs Words for Text Categorization. Journal of Machine Learning Research, 3, pp. 1183-1208. Bruce, Rebecca F. & Wiebe, Janyce M. (1999). Recognizing subjectivity: a case study in manual tagging. Natural Language Engineering, 5 (2). Cardie C., Ng V., Pierce D., and Buckly C. (2000). Examining the Role of Statistical and Linguistic Knowledge Sources in a General-Knowledge Question-Answering System. Proceedings of the Sixth Applied Natural Language Processing Conference (ANLP2000), 180-187, ACL/Morgan Kaufmann. Chen H. (1992). Knowledge-based document retrieval: framework and design. Journal of Information Science, 18(1992) 293-314. Elsevier Science Publishers, B.V. Cohen, William W. (1996). Learning Trees and Rules with Set-Valued Features. In Proceedings of the Thirteenth National Conference on Artificial Intelligence and Eighth Innovative Applications of Artificial Intelligence Conference, AAAI 96, IAAI 96, August 4-8, 1996, Portland, Oregon - Volume 1. AAAI Press / The MIT Press, 709-716. - 78 - REFERENCES Cohen, William W. (1995). Fast Effective Rule Induction. In: Machine Learning, In Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, California, USA, July 9-12, 1995. Armand Prieditis, Stuart J. Russell (Eds.). Morgan Kaufmann, 115-123. Cohen, William W. and Hirsh, H. (1998). Joins That Generalize: Text Classification Using WHIRL. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD’98). Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., and Slattery, S. (1999). Learning to Construct Knowledge Bases from the World Wide Web, Artificial Intelligence, Elsevier. Cullingford, Richard E. (1986). Natural Language Processing: A KnowledgeEngineering Approach. Rowman & Littlefield. New Jersey. Dumais, S.T. (1994). Latent Semantic Indexing (LSI) and TREC-2. In: D. Harman (Ed.), The Second Text Retrieval Conference (TREC2), National Institute of Standards and Technology Special Publication, pp. 105-116. Dumais, S. T., Platt, J., Heckerman D., and Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of ACM-CIKM98, Nov. 1998, pp. 148-155. Dumais, S., Cutrell, E. and Chen, H. (2001). Optimizing Search by Showing Results in Context. Proceedings of the SIG-CHI on Human Factors in Computing, March 31- April 4, 2001, Seattle, WA, USA, ACM, 2001. Dumais, Susan T. and Chen, H. (2000). Hierarchical Classification of Web Content. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, July 24-28, 2000, Athens, Greece, pp. 256-263. Fagan Joel L. (1987). Experiments in Automatic Phrase Indexing for Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods. PhD thesis, Cornell University, 1987. Farhoomand, A. F. and Drury, D. H. (2002). Managerial Information Overload. Communications of the ACM, 45, 127-131. Freund Yoav, Schapire Robert E. (1996). Experiments with a New Boosting Algorithm. In Proceedings of the Thirteenth International Conference (ICML '96), July 3-6, Bari, Italy, pp.148-156. Freund Yoav, Mason Llew. (1999). The Alternating Decision Tree Learning Algorithm. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML 1999), June 27 - 30, Bled, Slovenia, pp. 124-133. - 79 - REFERENCES Fuhr, N., and Buckley, C. (1991). A Probabilistic Learning Approach for Document Indexing, ACM Transactions on Information Systems, 9(3), pp. 223-248. Furnkranz, J., Mitchell, T., Riloff, E. (1998). A Case Study in Using Linguistic Phrases for Text Categorization on the WWW, In Proceedings of the 1st AAAI Workshop on Learning for Text Categorization, pp. 5-12, Madison, US. Furnkranz, J., and Widmer, G. (1994). Incremental Reduced Error Pruning. In Proceedings of the 11th International Conference on Machine Learning (ML-94), pp. 7077, New Brunswick, JN, Morgan Kaufmann. Gentner, Dedre (1981). Some interesting differences between nouns and verbs. Cognition and Brain Theory 4. 161-178. Joachims, T. (1997). A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML' 97). Joachims, T. (1998). Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Proceedings of Machine Learning: ECML-98, 10th European Conference on Machine Learning, April 21-23, Chemnitz, Germany, Springer, pp. 137142. Joachims, T. (2002). Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers, Boston Hardbound. Kankanhalli A, Tan B C Y and Wei K K. (2001). Seeking Knowledge in Electronic Knowledge Repositories: An Exploratory Study, Proceedings of the Twenty-Second International Conference on Information Systems, New York, Association for Computing Machinery, pp. 123-133. Kongovi Madhusudhan, Guzman Juan Carlow, and Dasigi Venu. (2002). Text Categorization: An Experiment Using Phrases. In Advances in Information Retrieval, Proceedings of the 24th BCS-IRSG European Colloquium on IR Research Glasgow, UK, March 25-27, 2002. ECIR, 2002, Springer Verlag Berlin Heidelberg, Lecture Notes in Computer Science 2291, Springer, pp. 213-228. Moulinier, I., Raskinis, G., and Ganascia, J.G. (1996). Text Categorization: A Symbolic Approach. In Proceedings of the 5th Annual Symposium on Document Analysis and Information Retrieval (SDAIR'96), pp. 87-99. Lang K. (1995). NewsWeeder: Learning to filter netnews, Machine Learning, Proceedings of the Twelfth International Conference on Machine Learning, July 9-12, Tahoe City, California, USA, Morgan Kaufmann, pp.331-339. - 80 - REFERENCES Larkey, L.S. and Croft, W.B. (1996). Combining Classifiers in Text Categorization. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'98), pp. 179-187. Lewis, D. (1992). Representation and Learning in Information Retrieval. PhD Thesis , University of Massachusetts, Amherst. Lewis, D. D. and Ringuette, M. (1994). A Comparison of Two Learning Algorithms for Text Categorization. Third Annual Symposuim on Document Analysis and Information Retrieval. pp 81-93. Luigi, G., Fabrizio, S. and Maria, S. (2000). Feature Selection and Negative Evidence in Automated Text Categorization. Proceedings of the ACM KDD-00 Workshop on Text Mining, Boston, US. Masand Brij M., Linoff Gordon, Waltz David L. (1992). Classifying News Stories using Memory Based Reasoning. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, Denmark, June 21-24, 1992. ACM, pp. 59-65. McCallum, A., and Nigam, K. (1998). A Comparison of Event Models for Naïve Bayes Text Classification. AAAI-98, Workshop on Learning for Text Categorization, Technical Report, WS-98-05, AAAI Press. Mitchell, T.M. (1997). Machine Learning. McGraw Hill, New York, NY. Medin, D. L., Lynch, E. B. and Solomon, K. O. (2000). Are There Kinds of Concepts? Annual Review Psychology, 51, 121-147. Mitra M., Buckley C., Singhal A., and Cardie C. (1997). An analysis of statistical and syntactic phrases. Proceedings of RIAO '97, Montreal, Canada, June 25-27, 1997. Mladenic, D. (1998). Feature Subset Selection in Text Learning. In Proceedings of the 10th European Conference on Machine Learning (ECML'98), pp. 95-100. Ng, H. T. and Lee, H. B. (1996). Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach. Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, June 24-27, 1996, University of California, Santa Cruz, California, USA, Proceedings. Morgan Kaufmann Publishers / ACL 1996, pp. 40-47. Nigam, K, McCallum, A., Thrun, S. and Mitchell, T. (1998). Learning to Classify Text from Labeled and Unlabeled Documents. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), pp. 792-799. - 81 - REFERENCES Pedro, Domingos, Michael, J. Pazzani. (1996). Simple Bayesian Classifiers Do Not Assume Independence. In Proceedings of the Thirteenth National Conference on Artificial Intelligence and Eighth Innovative Applications of Artificial Intelligence Conference, AAAI 96, IAAI 96, August 4-8, 1996, Portland, Oregon - Volume 2. AAAI Press / The MIT Press, 1996. Platt, J. (1999). Using Sparseness and Analytic QP to Speed Training of Support Vector Machines, Advances in Neural Information Processing Systems, 11, M. S. Kearns, S. A. Solla, D. A. Cohn, eds., MIT Press. Platt, J. (1998). Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines, Microsoft Research Technical Report MSR-TR-98-14. Quinlan J. Ross. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann. Reuters-21578 data set [online] http://www.research.att.com/~lewis/reuters21578.html Richardson R. and Smeaton A.F. (1995). Using WordNet in a knowledge-based approach to infornation retrieval. [online] Rogati , M. and Yang, Y. (2002). High-performing feature selection for text classification, In Proceedings of the 10th Conference for Information and Knowledge Management (CIKM-2002). Roussinov, D.G., Chen, H.C. (1999). Document clustering for electronic meetings: an experimental comparison of two techniques, Decision Support Systems, 27 (1-2): 67-79. Sahami, M. (1998). Using Machine Learning to Improve Information Access. PhD Thesis, Stanford University, Computer Science Department. STAN-CS-TR-98-1615. Schapire, R.E., Singer, Y. and Singhal, A. (1998). Boosting and Rocchio Applied to Text Filtering. In Proceedings of the 21st Annual International Conference on Research and Development in Information Retrieval (SIGIR'98), pp. 215-223. Schutze, H. and Silverstein H. (1997). Projections for efficient document clustering. In Proceedings of the 20th Annual International Conference on Research and Development in Information Retrieval (SIGIR'97), pp. 74–81, Philadelphia, PA, July 1997. Scott S. and Matwin S. (1998). Text Classification Using WordNet Hypernyms. In Proceedings of Coling-ACL'98 Workshop on the Usage of WordNet in Natural Language Processing Systems, Montreal, Canada. Scott S. and Matwin S. (1999). Feature Engineering for Text Classification. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML'99), Bled, Slovenia, June 27-30, Morgan Kaufmann. - 82 - REFERENCES Sebastiani Fabrizio, (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, Vol 34, No. 1, March 2002, pp 1-47. Siegel E.V. and McKeown K.R. (2001). Learning Methods to Combine Linguistic Indicators: Improving Aspectual Classification and Revealing Linguistic Insights. Association for Computational Linguistics. SNOW URL:http://l2r.cs.uiuc.edu/~cogcomp/cc-software.html Smeaton, A. (1999). Using NLP or NLP resources for Information Retrieval Tasks. In Natural Language Information Retrieval, Kluwer, Boston, MA. Strzalkowski, T., Carballo, J.P., Karlgren, J., Hulth, A., Tapanainen, P., Lahtinen, T. (1999). Natural Language Information Retrieval: TREC-8 Report. In Proceedings of the Eighth Text REtreival Conference (TREC-8), Gaithersburg, Maryland, November 17-19, NIST. Tolle K.M. and Chen H. (2000). Comparing Noun Phrasing Techniques for Use with Medical Digital Library Tools. Journal of the American Society for Information Science, 51(4):352-370, 2000. Vapnik, V. (1995). The Nature of Statistical Learning Theory, Springer-Verlag. WEKA URL: http://www.cs.waikato.ac.nz/ml/weka/ Yang, Y. and Pedersen, J. O. (1997). A Comparative Study on Feature Selection in Text Categorization, In Proceedings of the Fourteenth International Conference on Machine Learning (ICML 1997), Nashville, Tennessee, USA, July 8-12, 1997. Morgan Kaufmann 1997, pp. 412-420. Yang, Y. and Liu, X. (1999). A Re-examination of Text Categorization Methods. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99), August 15-19, 1999, Berkeley, USA. ACM, pp. 42-49. Yang, Y. (2001). A Study on Thresholding Strategies for Text Categorization. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR,01), pp. 137-145. Zelikovitz S., and Hirsh, H. (2000). Improving Short Text Classification Using Unlabeled Background Knowledge to Assess Document Similarity. In Proceedings of the 17th International Conference on Machine Learning, pp. 1183-1190. Zelikovitz S., and Hirsh, H. (2001). Using LSI for Text Classification in the Presence of Background Text. In Proceedings of the 10th Conference for Information and Knowledge Management, 2001. - 83 - [...]... a systematic study that covers an extensive variety of linguistically motivated knowledge sources - 21 - CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION Despite the existence of extensive research on document classification, the relationship between different linguistic knowledge sources and classification model has not been sufficiently or systematically... broken up into two separate words, “learning” and “algorithm” Thus, we utilize linguistically motivated knowledge sources as features, to see if we can resolve these limitations associated with the bag-of-words paradigm Novel combinations of - 24 - CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION linguistically motivated knowledge sources are also proposed and presented in the next section 3.1.1 Linguistically. .. Processing (NLP) in Document Classification Most information retrieval (IR) systems are not linguistically motivated Similarly, in document classification research, most experiments are not linguistically motivated (Cullingford, 1986) Closely related to the research on document classification is the research on natural language processing and cognitive science Traditionally, document classification techniques... supervised classification studies This could be due to the results of early attempts (Lewis, 1992) which showed negative results With the advent of NLP techniques, there seems a propelling reason to examine the use of linguistically motivated knowledge sources Although there has been separate attempts made to study the effects of linguistically motivated knowledge sources on supervised document classification. .. employed to derive linguistically motivated knowledge sources 2.1 Document Classification Document classification, the focus of this work, refers to the task of assigning a document to categories based on its underlying content Although this task has been carried out effectively using knowledge engineering approaches in the 80s, machine learning approaches have superceded the use of knowledge engineering... Since previous works documenting results based on linguistically motivated features with learning algorithms produced inconsistent and sometimes conflicting results, we propose to conduct a systematic study on multiple learning algorithms and linguistically motivated knowledge sources as features Some of these features are novel combinations of linguistically motivated knowledge sources were not explored... explored By bringing together the two streams of research of document classification and natural language processing, we hope to shed light on the effects of linguistically motivated knowledge sources with different learning algorithms Section 3.1 discusses the shortcomings of previous research Section 3.2 explores the linguistically motivated knowledge sources employed to resolve these issues Finally section... effectiveness of linguistically motivated features The aim of this thesis is also to provide a solid foundation for research on feature representations on text classification and study the factors affecting machine learning algorithms used in document classification One of these factors we want to look into is the effect of feature engineering by utilizing linguistically motivated knowledge sources as features... feature representations using different linguistic knowledge sources as the input vectors to the learning algorithm have significant impact on document classification We consider the following linguistics knowledge sources in our research: 1 Word, this will be used as the baseline to do a comparative analysis with other linguistically motivated knowledge sources; 2 Phrase; 3 Word sense or part of speech... linguistic cues into document classification through NLP techniques This can be done by utilizing NLP techniques to extract the different representation of the documents and then used in the classification process As defined by Medin (2000), other concepts identified are verb, count nouns, mass nouns, isolated and interrelated concepts We define such concepts as linguistically motivated knowledge sources They ... CHAPTER LINGUISTICALLY MOTIVATED CLASSIFICATION linguistically motivated knowledge sources are also proposed and presented in the next section 3.1.1 Linguistically Motivated Knowledge Sources. .. covers an extensive variety of linguistically motivated knowledge sources - 21 - CHAPTER LINGUISTICALLY MOTIVATED CLASSIFICATION CHAPTER LINGUISTICALLY MOTIVATED CLASSIFICATION Despite the existence... use of linguistically motivated knowledge sources Although there has been separate attempts made to study the effects of linguistically motivated knowledge sources on supervised document classification

Định dạng
Số trang	93
Dung lượng	1,47 MB