Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 93 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
93
Dung lượng
1,47 MB
Nội dung
INCORPORATING LINGUISTICALLY MOTIVATED
KNOWLEDGE SOURCES INTO DOCUMENT
CLASSIFICATION
GOH JIE MEIN
NATIONAL UNIVERSITY OF SINGAPORE
2004
INCORPORATING LINGUISTICALLY MOTIVATED
KNOWLEDGE SOURCES INTO DOCUMENT
CLASSIFICATION
GOH JIE MEIN
BSc(Hons 1, NUS)
A THESIS SUBMITTED
FOR THE DEGREE OF MASTER OF SCIENCE
DEPARTMENT OF INFORMATION SYSTEMS
NATIONAL UNIVERSITY OF SINGAPORE
2004
ACKNOWLEDGEMENTS
This thesis cannot be done without the constant guidance and assistance from many
whom I must acknowledge at this point.
Firstly, I am deeply grateful to my advisor, Associate Professor Danny Poo for his
constant guidance, encouragement and understanding. He is instrumental to the
development of this thesis and I sincerely thank him for providing valuable advice,
direction and insights for my research.
My deep appreciation also goes out to all my friends, peers and colleagues who have
helped my in one way or another:
I wish to thank Klarissa Chang, Koh Wei Chern, Cheo Puay Ling and Wong Foong Yin
for their listening ears and uplifting remarks.
Colleagues and friends such as Michelle Gwee, Wang Xinwei, Liu Xiao, Koh Chung
Haur, Li Yan, Li Huixian, Santosa Paulus, Wan Wen, Bryan Low, Chua Teng Chwan,
Tok Wee Hyong, Indriyati Atmosukarto, Colin Tan, Julian Lin, the pioneering batch of
Schemers, for keeping life pleasurable in the office.
I would also like to express my sincere thanks to A/P Chan Hock Chuan and A/P
Stanislaw Jarzabek for evaluating my research.
And to all professors, teaching staff, administrative staff, friends and students.
Last but not least, I would like to thank my family especially my parents, sisters and
Melvin Lye for their relentless moral support, motivation, advice, love and
understanding.
-i-
TABLE OF CONTENTS
TABLE OF CONTENTS
Acknowledgement
i
Contents
iv
List of Tables
v
List of Figures
vii
1. Introduction --------------------------------------------------------------------------
1
1.1 Background & Motivation
1
1.2 Aims and Objectives
2
1.3 Thesis Plan
4
2. Literature Review---------------------------------------------------------------------
6
2.1 Document Classification
8
2.2 Feature Selection Methods
9
2.3 Machine Learning Algorithms
9
2.3.1 Naïve Bayes
9
2.3.2 Support Vector Machines (SVM)
11
2.3.3 Alternating Decision Trees
13
2.3.4 C4.5
15
2.3.5 Ripper
16
2.3.6 Instance Based Learning – k-Nearest Neighbour
17
2.4. Natural Language Processing (NLP) in Document Classification
19
2.5. Conclusion
21
3. Linguistically Motivated Classification------------------------------------------
22
3.1 Considerations
22
3.2 Linguistically Motivated Knowledge Sources
25
3.2.1 Phrase
26
3.2.2 Word Sense ( Part of Speech Tags)
27
-i-
TABLE OF CONTENTS
3.2.3 Nouns
28
3.2.4 Verbs
28
3.2.5 Adjectives
29
3.2.6 Combination of Sources
29
3.3 Obtaining Linguistically Motivated Classifiers
4. Experiment-----------------------------------------------------------------------------
30
32
4.1 Evaluation Data Sets
32
4.1.1 Reuters-21578
32
4.1.2 WebKB
36
4.2 Dimensionality Reduction
37
4.2.1 Feature Selection
37
4.2.2 Document Frequency Thresholding
38
4.2.3 Stop Words Removal
40
4.3 Experiment Setup
40
4.3.1 Handling Multiple Categories Problems
40
4.3.2 Handling Multiple Categories Training Examples
41
4.4 Evaluation Measures
41
4.4.1 Loss Based Measures
41
4.4.2 Recall & Precision
43
4.4.3 Precision Recall Breakeven Point
43
4.4.4 Micro- and Macro- Averaging
44
4.5 Tools
5. Results & Evaluation -----------------------------------------------------------------
45
48
5.1 Results
48
5.2 Contribution of Different Linguistically Motivated Knowledge Sources to
48
Classification of Reuters-21578 Corpus
5.2.1 Words
49
5.2.2 Phrase
51
- ii -
TABLE OF CONTENTS
5.2.3 Word Sense (Part of Speech Tags)
51
5.2.4 Nouns
52
5.2.5 Verbs
53
5.2.6 Adjectives
53
5.2.7 Combination of Sources
54
5.2.8 Analysis of Reuters-21578Results
55
5.3 Contribution of Linguistically Motivated Knowledge Sources
to
61
Classification accuracy of WebKB Corpus
5.3.1 Words
61
5.3.2 Phrase
62
5.3.3 Word Sense (Part of Speech Tags)
63
5.3.4 Nouns
64
5.3.5 Verbs
64
5.3.6 Adjectives
65
5.3.7 Nouns & Words
66
5.3.8 Phrase & Words
67
5.3.9 Adjectives & Words
67
5.3.10 Analysis of WebKb Results
68
5.4 Summary of Results
6. Conclusion -----------------------------------------------------------------------------
71
72
6.1 Summary
72
6.2 Contributions
72
6.3 Limitations
75
6.3 Conclusion
75
- iii -
SUMMARY
This thesis describes an empirical study on the effects of linguistically motivated
knowledge sources with different learning algorithms. Using up to nine different
linguistically motivated knowledge sources and six different learning algorithms, we
examined the performance of classification accuracy by using different linguistically
motivated knowledge sources, with different learning algorithms on two benchmark data
corpora: Reuters-21578 and WebKB. The use of linguistically motivated knowledge
sources, nouns, outperformed the traditional bag-of-words classifiers for Reuters-21578.
The best results for this corpus were obtained with the use of nouns with support vector
machines. On the other hand, experiments with WebKB showed that classifiers built
using the novel combinations of linguistically motivated knowledge sources were as
competitive as those built using the conventional bag-of-words technique.
- iv -
LIST OF TABLES
LIST OF TABLES
Table 1
Summary of Related Studies
23
Table 2
Distribution of Categories in Reuters-21578
34
Table 3
Top 20 Categories in ModApte Split
35
Table 4
Breakdown of Documents in WebKB Categories
36
Table 5
Contingency Table
42
Table 6
Results using Words
49
Table 7
Results using Phrases
50
Table 8
Results using Tags
51
Table 9
Results using Nouns
52
Table 10
Results using Verbs
53
Table 11
Results using Adjectives
54
Table 12
Results using both Linguistically Knowledge Sources and Words
54
Table 13
Contribution of Knowledge Sources on Reuters-21578 (F1 measures)
55
Table 14
Contribution of Knowledge Sources on Reuters-21578 (Precision)
58
Table 15
Contribution of Knowledge Sources on Reuters-21578 (Recall)
59
Table 16
Results using Words
62
Table 17
Results using Phrases
63
Table 18
Results using Tags
63
Table 19
Results using Nouns
64
Table 20
Results using Verbs
65
Table 21
Results using Adjectives
66
-v-
LIST OF TABLES
Table 22
Results using Nouns & Words
66
Table 23
Results using Phrases & Words
67
Table 24
Results using Adjectives & Words
68
Table 25
Consolidated Results of WebKB
69
- vi -
LIST OF FIGURES
LIST OF FIGURES
Figure 1
Document Classification Process
7
Figure 2
Linear SVM
12
Figure 3
An Example of an ADTree
15
Figure 4
k-NN Algorithm
18
Figure 5
Extracting Linguistically Motivated Knowledge Sources
31
Figure 6
An Example of a Reuters-21578 Document
35
Figure 7
Design of System
46
Figure 8
Comparison of Different Linguistically Motivated Knowledge
56
Sources on Reuters-21578 (Micro F1 values)
Figure 9
Comparison of Different Linguistically Motivated Knowledge
57
Sources on Reuters-21578 (Macro F1 values)
Figure 10
Comparison of Different Linguistically Motivated Knowledge
59
Sources on Reuters-21578 (Precision)
Figure 11
Comparison of Different Linguistically Motivated Knowledge
60
Sources on Reuters-21578 (Recall)
Figure 12
Comparison of Different Linguistically Motivated Knowledge
70
Sources on WebKB (Micro F1 values)
Figure 13
Comparison of Different Linguistically Motivated Knowledge
70
Sources on WebKB (Macro F1 values)
- vii -
CHAPTER 1 INTRODUCTION
CHAPTER 1
INTRODUCTION
1.1 Background & Motivation
With the emerging importance of knowledge management, research in areas such as
document classification, information retrieval and information extraction each plays a
critical role in the success of knowledge management initiatives. Studies have shown that
the perceived output quality is an essential factor for successful implementation and
adoption of knowledge management technologies (Kankanhalli, et al., 2001). Large
document archives such as electronic knowledge repositories offer a huge wealth of
information whereby methods in the field of information retrieval and document
classification are used to derive knowledge.
Coupled with the accessibility to voluminous amount of information available in the
World Wide Web, this information explosion has brought about other problems. Users
are often overwhelmed by the deluge of information and suffer from a decreased ability
to assimilate information. Research has suggested that users feel bored or frustrated when
they receive too much information (Roussinov and Chen, 1999) which can lead to a state
where the individual is no longer able to effectively process the amount of information he
is exposed, giving rise to a lower decision quality in a given time set. This problem is
exacerbated with the proliferation of information in electronic repositories in
organizations and the World Wide Web (Farhoomand and Drury, 2002).
-1-
CHAPTER 1 INTRODUCTION
Document classification has been applied to the categorization of search results and has
been shown to alleviate the problem of information overload. The presentation of
documents in categories works better than a list of results because it enables users to
disambiguate the categories quickly and then focus on the relevant category (Dumais,
Cutrell & Chen, 2001). This is also useful for distinguishing documents containing words
with multiple meanings, or polysemy, a characteristic predominant in English words.
Experiments on supervised document classification techniques have predominantly used
the bag-of-words technique whereby words of the documents are used as features.
Alternate formulations of a meaning can also be introduced through linguistic variation
such as syntax which determines the association of words. Although there have been
some studies that have employed alternate features such as linguistic sources, these
studies have only employed a subset of linguistic sources and learning algorithms (Lewis,
1992; Arampatzis et al., 2000; Kongovi et al., 2002). Thus, this study extends previous
studies relating to document classification and aims to find ways to improve the
classification accuracy of documents.
1.2 Aims and Objectives
Differences in previous empirical studies could be introduced by the differences in
tagging tools used, different learning algorithms, parameters tuned for each of the
learning algorithms, feature selection methods employed and dataset involved (Yang,
-2-
CHAPTER 1 INTRODUCTION
1999). Thus, it is difficult to offer a sound conclusion based on previous works. Since
previous works documenting results based on linguistically motivated features with
learning algorithms produced inconsistent and sometimes conflicting results, we propose
to conduct a systematic study on multiple learning algorithms and linguistically
motivated knowledge sources as features. Some of these features are novel combinations
of linguistically motivated knowledge sources were not explored in previous studies.
Using a systematic and controlled study, we can resolve some of these ambiguities and
offer a sound conclusion. In our study, consistency in the dataset, learning algorithms,
tagging tools and features selections were maintained in the study so that we can have a
valid assessment of the effectiveness of linguistically motivated features.
The aim of this thesis is also to provide a solid foundation for research on feature
representations on text classification and study the factors affecting machine learning
algorithms used in document classification. One of these factors we want to look into is
the effect of feature engineering by utilizing linguistically motivated knowledge sources
as features in our study.
Thus, the objectives of this thesis are listed below:
1.
To examine the approach of using linguistically motivated knowledge
sources based on concepts derived from natural language processing as
features with popular learning algorithms, and systematically vary both
-3-
CHAPTER 1 INTRODUCTION
the learning algorithms and feature representations. We based our
evaluation on Reuters-21578 corpus and WebKB, benchmark corpus
that has been widely used in previous research.
2.
To examine the feasibility of applying novel combinations of
linguistically
motivated
knowledge
sources
and
explore
the
performance of these combinations as features on the accuracy of
document classification.
1.3 Thesis Plan
This thesis is composed of the following chapters:
Chapter 1 provides the background, motivation and objectives of this research.
Chapter 2 provides a literature review of document classification research. Here we
bring together literature from different fields, document classification, machine learning
techniques and natural language processing and give a detailed coverage of the
algorithms chosen for this study. In addition, this chapter also overview the rudimentary
knowledge required in later chapters.
Chapter 3 describes the types of linguistically motivated knowledge sources and the
novel combinations used in our study.
-4-
CHAPTER 1 INTRODUCTION
Chapter 4 provides a description of the experiment setup. It also gives a brief on the
performance measures used to evaluate the classifiers and tools employed to conduct the
study.
Chapter 5 provides an analysis of the results and suggests the implications for practice.
Chapter 6 concludes with the contributions, findings and the limitations of our study.
Suggestions on future research that can be an extension to this work are also provided.
-5-
CHAPTER 2 LITERATURE REVIEW
CHAPTER 2
LITERATURE REVIEW
Document classification has traditionally been carried out using the bag-of-words
paradigm. Research on natural language processing has shown fruitful results that can be
applied to document classification by introducing an avenue for improving accuracy by a
different set of features used. This chapter reviews related literature underpinning this
research. Section 2.1 gives an overview of document classification and the focus of
previous research. Section 2.2 overviews common features selection techniques used in
previous studies. Section 2.3 introduces the machine learning algorithms that will be
adopted in our study. Section 2.4 presents the concepts of natural language processing
employed to derive linguistically motivated knowledge sources.
2.1
Document Classification
Document classification, the focus of this work, refers to the task of assigning a
document to categories based on its underlying content. Although this task has been
carried out effectively using knowledge engineering approaches in the 80s, machine
learning approaches have superceded the use of knowledge engineering approaches for
document classification in recent years. While the knowledge engineering approaches
have produced effective classification accuracy, the machine learning approach offered
many advantages such as cost effectiveness, portability and competitive accuracy to that
-6-
CHAPTER 2 LITERATURE REVIEW
of human experts while producing considerable efficiency (Sebastiani, 2002). Thus,
machine learning supervised techniques are employed in this study.
A supervised approach involves three main phases: feature extraction, training, testing.
The entire process of document classification using machine learning methods is
illustrated in Figure 1.
Test Phase
Test corpus
Unlabeled Test Feature
Vector
Training
Phase
Training corpus
Feature
Selection
Labeled Training Feature
vectors
Classifier Model
Category
Figure 1: Document classification Process
There are two phases involved in learning based document classification: the training and
testing phase. In the training phase, pre-labeled documents are collected. This set of prelabeled documents is called the training corpus, training data or training set which is used
interchangeably throughout this thesis. Each document in the training corpus is
transformed into a feature vector. These feature vectors are trained with a learning
-7-
CHAPTER 2 LITERATURE REVIEW
classifier. Each classifier will then build a model based on the training set. The model
built based on the learning algorithm is then used in the testing phase to test a set of
unlabeled documents, called the test corpus, test data or test set, that are new to the
learning classifier, to be labeled.
The classification problem can be formally represented as follows:
Classification fc(d) -> {true, false} where d∈D
Given a set of training documents, D and a set of categories C, the classification problem
is defined as the function, f, to map documents in the test set, T, into a boolean value
where ‘true’ indicates that the document is categorized as C and ‘false’ indicates that the
document is not categorized under C, based on the characteristics of the training
documents, D where f (d)->c.
2.2
Feature Selection Methods
With a large number of documents and features, the document classification process
usually involves a feature selection step. This is done to reduce the feature dimension.
Feature selection methods extract only the important features derived from the original
set of features. Traditional features selection methods that have been commonly
employed in previous studies include document frequency (Dumais et al., 1998; Yang
and Pedersen, 1997), chi-square (Schutze et al., 1997; Yang and Pedersen, 1997),
information gain (Lewis and Ringuette, 1994; Yang and Pedersen, 1997), mutual
information (Dumais et al., 1998; Larkey and Croft, 1996) etc.
-8-
CHAPTER 2 LITERATURE REVIEW
After the feature reduction step, many supervised techniques can then be employed in
document classification. Subsequent section presents a review of the techniques that were
employed in this study.
2.3
Machine Learning Algorithms
This section reviews the state-of-the-art learning algorithms as text classifiers, to give a
background on the different methods and an analysis of the advantages and disadvantages
of each learning method that we have used in our empirical study. Past research in the
field of automatic document classification has focused on improving the document
classification techniques through various learning algorithms (Yang and Liu, 1999) such
as support vector machines (Joachims, 1998) and various feature selection methods
(Yang and Pedersen, 1997; Luigi et al., 2000). To make the study worthwhile, popular
learning algorithms that have reported significant improvements on the classification
accuracy of various learning algorithms in previous studies were used. These included a
wide variety of supervised-learning algorithms, including naïve bayes, support vector
machines, k-nearest-neighbours, C4.5, RIPPER, AdaBoost with decision stumps and
alternating decision trees.
2.3.1 Naïve Bayes (NB)
Bayes classification is a popular technique in recent years. The simplest Bayesian
classifier is the widely used Naïve Bayes classifier which assumes that features are
independent. Despite the inaccurate assumption of feature independence, Naïve Bayes is
-9-
CHAPTER 2 LITERATURE REVIEW
surprisingly successful in practice and has proven effective in text classification, medical
diagnosis, and computer performance management, among other applications.
Naïve Bayes classifier uses a probabilistic model of text to estimate the probability that a
document d is in class y, Pr(y|d). This model assumes conditional independence of
features, i.e. words are assumed to occur independently of the other words in the
document given its class. Despite this assumption, Naïve Bayes have performed well.
Bayes rule says that to achieve the highest classification accuracy, d should be assigned
to the class y ∈ {-1, +1} for which Pr(y|d) is the highest.
hBAYES (d ) = arg max y∈{−1, +1} Pr( y | d )
(1)
Pr(y|d) can be calculated by considering each document according to their length l.
Pr( y | d ) = ∑l =1 Pr( y | d , l ) • Pr(l | d )
∞
(2)
Pr(l|d) equals one for the length l’ of document d and is zero otherwise. In other words,
when we apply Bayes’ theorem to Pr(y|d,l) we can obtain the following equation:
Pr( y | d ) =
Pr(d | y, l ' ) • Pr( y | l ' )
(3)
∑ y '∈{−1,+1} Pr(d | y' , l ' ) • Pr( y'| l ' )
Pr(d|y,l’) is the probability of observing document d in class y given its length l’. Pr(y|l’)
is the prior probability that a document of length l’ is in class y. In the following we will
assume that the category of a document does not depend on its length so Pr(y|l’) = Pr(y).
An estimate of Pr(y) is as follows:
Pr' ( y ) =
∑
| y |
| y '|
y '∈ { − 1 , + 1 }
=
| y|
|D |
(4)
- 10 -
CHAPTER 2 LITERATURE REVIEW
|y| denotes the number of training documents in class y∈{-1,+1} and |D| is the total
number of documents.
Despite the unrealistic independence assumption, the Naïve Bayes classifier is
remarkably successful in practice. Researchers show that the Naïve Bayes classifier is
competitive with other learning algorithms such as decision tree and neural network
algorithms. Experimental results for Naïve Bayes classifiers can be found in several
studies (Lewis, 1992; Lewis and Ringuette, 1994; Lang, 1995; Pazzani 1996; Joachims,
1998; McCallum & Nigam 1998; Sahami, 1998). These studies have shown that Bayesian
classifiers can produce reasonable results with high precision and recall values. Hence,
we have chosen this learning algorithm in learning to classify text documents. The second
reason that this Bayesian method is important to our study of machine learning is that it
provides a useful perspective for understanding many learning algorithms that do not
explicitly manipulate probabilities.
2.3.2 Support Vector Machines (SVM)
Support vector machines were developed by Vapnik et al. (1995) based on structural risk
minimization principle from statistical learning theory. The idea of structural risk
minimization is to find a hypothesis h from a space H that guarantees the lowest
probability of error E(h) for a given training sample S consisting of n examples. Equation
(5) gives the upper bound that connects the true error of a hypothesis h with the error
E(h) of h on the training set and the complexity of h which reflects the well-known tradeoff between the complexity of the hypothesis space and the training error.
- 11 -
CHAPTER 2 LITERATURE REVIEW
n
d ln( ) − ln(η )
d
E (h) ≤ Etrain(h) +O(
)
n
(5)
A simple hypothesis space will most likely not contain good approximation functions and
will lead to a high training and true error. On the other hand a large hypothesis space will
lead to a small training error but the second term in the right-hand side of equation (5)
will be large. This reflects the fact that for a hypothesis space with high VC-dimension
the hypothesis with low training error may result in overfitting. Thus it is crucial to find
the right hypothesis space.
The simplest representation of a support vector machine, a linear SVM, is a hyperplane
that separates a set of positive examples from a set of negative examples with maximum
distance from the hyperplane to the nearest of the positive and negative examples. Figure
2 shows the graphical representation of a linear SVM.
+
+
-
-
+
+
+
+
+
-
Maximum
distance
Figure 2. Linear SVM
Joachims (1998) developed a model of learning text classifiers using support vector
machine and linked the statistical properties of text with the generalization performance
- 12 -
CHAPTER 2 LITERATURE REVIEW
of the learner. Unlike conventional generative models, SVM does not involve
unreasonable parametric or independence assumptions. The discriminative model focuses
on those properties of the text classification tasks that are sufficient for good
generalization performance, avoiding much of the complexity of natural language. This
makes SVM suitable for achieving good classification performance despite the high
dimensional feature spaces in text classification. High redundancy, high discriminative
power of term sets, and discriminative features in the high-frequency range are sufficient
conditions for good generalization. SVM is therefore chosen as one of the learning
algorithms in this study.
We used Platt’s (1999) sequential minimal optimization algorithm to process the linear
SVM more efficiently. This algorithm decomposes the large quadratic programming
problem into smaller sub-problems. Document classification using support vector
machine can be done either through a binary or multi-class classification but we have
adopted the binary approach which will be mentioned in a later chapter.
2.3.3 Alternating Decision Tree (ADTree)
Although a variety of decision tree learning methods have been developed with
somewhat differing capabilities and requirements, we have chosen one of the recent
method called the alternating decision trees (Freund & Mason, 1999). This is because this
method has been often applied to classification problems and applied to problems such as
learning to classify text or documents.
- 13 -
CHAPTER 2 LITERATURE REVIEW
Alternating decision tree learning algorithm is a new combination of decision trees with
boosting that generates classification rules that are small and often easy to interpret. A
general alternating tree defines a classification rule by defining a set of paths in the
alternating tree. As in standard decision trees, when a path reaches a decision node it
continues with the child that corresponds to the outcome of the decision associated with
that node. When a prediction node is reached, the path continues with all of the children
of the node. Path splits into a set of paths, where each path corresponds to one of the
children of the prediction node.
The difference between ADTree and conventional decision trees is that the classification
is based on traversing the path of the decision tree instead of the final leaf node of the
tree.
There are several key features of alternating decision trees. Firstly, compared to C5.0
with boosting, ADTree provides classifiers that are smaller and easier to interpret. In
addition, ADTree give a measure of confidence, called the classification margin, that can
be used to improve the accuracy of the cost of abstaining from predicting examples that
are hard to classify instead of only a class. However, the disadvantage of ADTree is its
susceptibility to overfitting in small data sets.
- 14 -
CHAPTER 2 LITERATURE REVIEW
Figure 3. An Example of an ADTree (Freund & Mason, 1999)
2.3.4 C4.5
A decision tree text classifier is a tree in which internal nodes are labeled by terms,
branches departing from them are labeled by tests on the weight that the term has in the
test document, and leaves are labeled by categories. In this classification scheme, a text
document d is categorized by recursively testing the weights that the terms labeling the
internal nodes have in vector d, until a leaf node is reached. The label of this node is then
assigned to d. Most of these classifiers use binary document representations represented
as a binary tree. There are a number of decision trees and among the most popular is C4.5
(Cohen and Hirsh, 1998). Thus we have chosen this learning method.
The most popular decision-tree algorithm that has shown good results on a variety of
problems is the C4.5 algorithm (Quinlan, 1993). Previous works based on this technique
are reported in Lewis and Ringuette, 1994, Moulinier et al., 1996, Apte and Damerau,
1994, Cohen, 1995, Cohen, 1996. The underlying approach to C4.5 is that it learns
- 15 -
CHAPTER 2 LITERATURE REVIEW
decision trees by constructing them top-down, from the root of the tree. Each instance
feature is then evaluated using a statistical test, like the information gain, to determine
how well it alone classifies the training examples. Information gain is otherwise known
as entropy in information theory. Entropy of a collection S is measured as follows:
Entropy ( S ) ≡ − p + log 2 p + − p − log 2 p − (6)
Where p+ is the proportion of positive instances in the collection S and p- is the
proportion of negative instances. The best feature is selected and employed as a root
node to the tree. For each possible value of this attribute, a descendant of the root node is
created, and the training examples are sorted to the appropriate descendant node. C4.5
forms a greedy search for a suitable decision tree in which no backtracking is allowed.
2.3.5 Ripper
This learning algorithm is a prepositional rule learner, RIPPER (Repeated Incremental
Pruning to Produce Error Reduction), proposed by Cohen (1995). The algorithm has a
few major phases that characterize it: grow, prune, optimization. RIPPER was developed
based on repeated application of Furnkranz and Widmer’s (1994) IREP algorithm
followed by two new global optimization procedures. Like other rule-based learners,
RIPPER grows rules in a greedy fashion guided by some information theoretic
procedures.
- 16 -
CHAPTER 2 LITERATURE REVIEW
Firstly, the rules are grown from a greedy process which adds conditions to the rule until
the rule is 100% accurate. The algorithm tries every possible value of each attribute and
selects the conditions with the highest information gain. The rules are incrementally
pruned and finally in the optimization stage, an initial rule set and pruned rule set are
obtained. One variant is generated from an empty rule while the other is generated by
greedily adding antecedents to the original rule. The smallest possible description length
for each variant and the original rule is computed and the variant with the minimal
description length is selected as the final representative of rules in the rule set. Rules that
would increase the description length of the rule set if they were included were deleted.
The resultant rules are then added to the rule set.
RIPPER has already been applied to a number of standard problems in text classification
with rather promising results (Cohen, 1995). Thus, it is chosen as one of the candidate
learning algorithms in our empirical study.
2.3.6 Instance Based Learning – k-Nearest Neighbour
The
basic
idea
behind
k-Nearest
Neighbors
(k-NN)
classifier
is
the assumption that examples located close to each other according to a user-defined
similarity metric are highly likely to belong to the same class. This algorithm is also
derived from Bayes’ rule. This technique has shown good performance on text
categorization in Yang & Liu (1997), Yang & Pederson (1999), Masand (1992).This
algorithm assumes that all instances correspond to points in the n-dimensional space. The
nearest neighbors of an instance are defined in terms of the standard Euclidean distance.
- 17 -
CHAPTER 2 LITERATURE REVIEW
An arbitrary instance x is described as a feature vector (a1(x), a2(x), … an(x)) where ai(x)
denotes the value of the ith attribute of the instance x. Then the distance between two
instances xi and xj is defined to be d(xi,xj) where,
d ( xi, xj ) =
n
∑ (ar ( x ) − ar ( x ))
r =1
i
2
j
The target function can either be discrete or real-valued. In our study, we will assume that
the target function is discrete valued thus we have a linear binary classifier for each
category. The pseudocode is as follows:
Train(training example x){
For each training example (x, f(x)) add example to list of training examples.
}
Classify(query instance xq){
Let xj… xk denote the k instances from training examples that are nearest to xq.
k
^
Return
f ( xq ) ←
arg max ∑ δ (v, f ( xi ))
v∈V
i =1
Where δ(a,b)=1 if a=b and where δ(a,b)=0 otherwise.
}
Figure 4: k-NN Algorithm (Mitchell, 1997)
The key advantage of instance-based learning is that instead of estimating the target
function once for the entire instance space, it can estimate it locally and differently for
each new instance to be classified. This method is a conceptually straightforward
approach to approximating real-valued or discrete-valued target functions. In general, one
disadvantage of instance-based approaches is that the cost of classifying new instances
- 18 -
CHAPTER 2 LITERATURE REVIEW
can be high due to the fact that nearly all computation takes place at classification time
rather than when the training examples are first encountered. A second disadvantage is
they consider all attributes of the instances when attempting to retrieve similar training
examples from memory. If the target concept depends on only a few of the many
available attributes, then the instances that are truly most “similar” may have a large
distance apart. However, as previous attempts to classify text with this approach has
shown to be effective (Yang, 1999), we have decided to include it inside our experiment.
2.4
Natural Language Processing (NLP) in Document
Classification
Most information retrieval (IR) systems are not linguistically motivated. Similarly, in
document classification research, most experiments are not linguistically motivated
(Cullingford, 1986). Closely related to the research on document classification is the
research on natural language processing and cognitive science. Traditionally, document
classification techniques are primarily directed at detecting performance accuracy and
hold little regard for linguistic phenomena. Much of the current document classification
systems are built upon techniques that represent text as a collection of terms such as
words. This has been done successfully using quantitative methods based on word or
character counts. However, it has been emphasized that vector space models cannot
capture critical semantic aspects of document content. In this case, the representation is
superficially related to content since language is more than simply a collection of words.
- 19 -
CHAPTER 2 LITERATURE REVIEW
Thus, natural language processing is a key technology for building information retrieval
systems of the future (Strzalkowski, 1999).
In order to study the effects of linguistically motivated knowledge sources with document
classification, it is imperative to learn about the grammar through natural language
processing so as to apply concepts in cognitive science on document classification
techniques. Natural language processing research attempts to enhance the ability of the
computer to analyze, understand and generate languages that are used. This is performed
by some type of computational or conceptual analysis to form meaningful structure or
semantics from a document. The inherently ambiguous nature of natural language makes
things even more difficult. A variety of research disciplines are involved in the successful
development of NLP systems. The mapping of words into meaningful representations is
driven by morphological, syntactic, semantic, and contextual cues available in words
(Cullingford, 1986). With the advancement of NLP techniques, we hope to incorporate
linguistic cues into document classification through NLP techniques. This can be done by
utilizing NLP techniques to extract the different representation of the documents and then
used in the classification process.
As defined by Medin (2000), other concepts identified are verb, count nouns, mass
nouns, isolated and interrelated concepts. We define such concepts as linguistically
motivated knowledge sources. They can be used to derive more complex linguistically
motivated features in the process of classification. It appears that the centrality of using
linguistic knowledge sources as features in the process of classification can serve as an
- 20 -
CHAPTER 2 LITERATURE REVIEW
important step for a good classification scheme. For example, besides individual words
and the relationships between words within a sentence, a document and the context of
what is already known of the world, helps to deliver the actual meaning of a text.
Research has focused on using nouns in the process of categorization in modeling the
process of categorization in the real world (Chen et al. 1992; Lewis, 1992; Arampatzis et
al, 2000; Basili, 2001). However, the significant differences of these results have led us
to examine these features with alternate representations.
2.5
Conclusion
Bag-of-words paradigm has appeared to be the feature used dominantly in supervised
classification studies. This could be due to the results of early attempts (Lewis, 1992)
which showed negative results. With the advent of NLP techniques, there seems a
propelling reason to examine the use of linguistically motivated knowledge sources.
Although there has been separate attempts made to study the effects of linguistically
motivated knowledge sources on supervised document classification techniques, it is
difficult to generalize a conclusion based on these separate attempts because of the
variations introduced across studies. In some cases, conflicting results were also reported.
Thus, there seems a need for us to fill the gap with a systematic study that covers an
extensive
variety
of
linguistically
motivated
knowledge
sources.
- 21 -
CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION
CHAPTER 3
LINGUISTICALLY MOTIVATED
CLASSIFICATION
Despite the existence of extensive research on document classification, the relationship
between different linguistic knowledge sources and classification model has not been
sufficiently or systematically explored. By bringing together the two streams of research
of document classification and natural language processing, we hope to shed light on the
effects of linguistically motivated knowledge sources with different learning algorithms.
Section 3.1 discusses the shortcomings of previous research. Section 3.2 explores the
linguistically motivated knowledge sources employed to resolve these issues. Finally
section 3.3 presents the technique to derive the features.
3.1
Considerations
Much research in the area of document classification has been focused mainly on
developing techniques or on improving the accuracy of such techniques. While the
underlying algorithm is an essential factor for classification accuracy, the way in which
texts are represented is also an important factor that should be examined. However,
attempts to produce text representations to improve effectiveness have shown
inconsistent results.
- 22 -
CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION
The classic work of Lewis (1992) has shown that there was low effectiveness of syntactic
phrase indexing in terms of its characteristics as a text representation but recent works by
Kongovi (2002), Basili (2001), has shown that there were improvements using the same
representation. Table 1 shows the conclusions made by some related works. For example,
noun phrase seems to behave differently with different learning algorithms. The results
differ due to the inconsistencies introduced in these studies through various datasets,
taggers, learning algorithms, parameters of the learning algorithms and feature selection
methods used.
Features
Noun Phrase
Algorithm/Method
Statistical
clustering algorithm
RIPPER
Corpus
Reuters-22173
Work
Lewis (1992)
Reuters-21578
Scott & Matwin
(1999)
Noun Phrase
Clustering
Reuters-21578
Kongovi (2002)
Noun Phrase
SOM
CANCERLIT
Tolle & Chen (2000)
Nouns
Rocchio
Reuters-21578
Proper Nouns
Rocchio
Reuters-21578
Tags
Rocchio
Reuters-21578
Basili, Moschitti &
Pazienza (2001)
Basili, Moschitti &
Pazienza (2001)
Basili, Moschitti &
Pazienza (2001)
Noun Phrase
Worse Performance than Words
Results
Better Performance than Words
Table 1: Summary of Previous Studies
To address the issues as discussed in the previous section and limitations of previous
work, this entails a systematic study on the effects of linguistically motivated knowledge
sources with various machine learning approaches for automatic document classification
is necessary. In contrast to previous work, this research conducts a comparative study and
analysis of learning methods among which, are some of the most effective and popular
techniques available, and report on the accuracies of linguistically motivated knowledge
- 23 -
CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION
sources and novel combinations of them using a systematic methodology to resolve any
of the issues that we have discussed in previous work.
Additionally, we try to see if we can break away from the traditional bag-of-words
paradigm. Bag-of-words basically refers to representing document using words, the
smallest meaningful unit of a document with little ambiguity. Word-based representations
have been the most common representation used in previous works related to document
classification. They are the basis for most work in text classification. The obvious
advantage of words is in its simplicity and straightforward processes to obtain the
representation. However the problem of using bag-of-words is that usually the logical
structure, layout and sequence of words are ignored.
A basic observation about using the bag of words representations for classification is that
a great deal of information from the original document associated with the logical
structure and sequence is discarded. The major limitation is the implicit assumption that
the order of words in a sentence is not relevant. In fact, paragraph, sentence and word
orderings are disrupted, and syntactic structures are ignored. However, this assumption
may always hold as words alone do not always represent true atomic units of meaning.
For example, the word “learning algorithm” could be interpreted in another manner when
broken up into two separate words, “learning” and “algorithm”.
Thus, we utilize
linguistically motivated knowledge sources as features, to see if we can resolve these
limitations associated with the bag-of-words paradigm. Novel combinations of
- 24 -
CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION
linguistically motivated knowledge sources are also proposed and presented in the next
section.
3.1.1 Linguistically Motivated Knowledge Sources
Machine learning methods require each example in the corpus to be described by a vector
of fixed dimensionality. Each component of the vector represents the value of one feature
of the example. As a linguistics knowledge source may provide the contextual cues about
a document that are useful as a feature representation for distinguishing the category of
the document, we are interested to study whether the choice of different feature
representations using different linguistic knowledge sources as the input vectors to the
learning algorithm have significant impact on document classification. We consider the
following linguistics knowledge sources in our research:
1. Word, this will be used as the baseline to do a comparative analysis with other
linguistically motivated knowledge sources;
2. Phrase;
3. Word sense or part of speech tagging;
4. Nouns;
5. Verbs;
6. Adjectives;
7. Combinations of sources with words.
- 25 -
CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION
The description of the above features and an analysis of the advantages and
disadvantages of feature representation are discussed.
3.1.2 Phrase
Phrases have been found to be useful indexing units in previous research. Kongovi,
Guzman & Dasigi’s (2002) has shown that phrases were salient features when used with
category profiles. We consider one class of phrases i.e. the syntactic phrases. Syntactic
phrases refer to any set of words that satisfy certain syntactic relations or constitute
specified syntactic structures or certain syntactic relations.
Phrase refers to the noun phrases that are identified by our parser. The data set is first
parsed into the appropriate format before being extracted and segmented. Noun Phrases is
defined as a sequence of words that terminates with a noun. More specifically, noun
phrases is defined as
NP = {A, N}*N , where NP stands for noun phrase, A for adjective and N for
nouns.
For example, the sentence, “The limping old man walks across the long bridge” the noun
phrases identified are “limping old man” and “long bridge”. In our work here, we do not
attempt to separate the noun phrases into its component noun phrases.
- 26 -
CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION
The advantage of phrase does not ignore the assumption that the ordering of words in not
relevant and the logical structure, layout and sequence of words are retained thus keeping
some information from the original document. On the other hand, the major limitation is
the greater degree of complexity when processing and extracting phrases as features.
Although phrase-based representation has been used in information retrieval, conclusions
derived from studies reporting the retrieval effectiveness of linguistic phrase-based
representations on retrieval have been inconsistent. Linguistic phrase identification has
been noted as improving retrieval effectiveness by Fagan (1987) but on the other hand,
Mitra et al. (1997) reported little benefits in using phrase-based representations. Smeaton
(1999) reported that the benefit of phrase-based representation varied with users. Lewis
(1992) undertook a major study of the use of noun phrases for statistical classification
and found that phrase representation did not produce any improvement on the Reuters22173 corpus.
As we are using a different corpus in our work, we decided to continue with the use of
phrase based representations in our experiment as it has not been studied before with
some of the learning algorithms that we have chosen.
3.2.3 Word Sense (Part of Speech Tagging)
Word sense refers to the incorporation of part of speech tags with the word so that the
exact word sense within a document is identified. The part of speech of a word provides a
- 27 -
CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION
syntactic source of the word, such as adjective, adverb, determiner, noun, verb,
preposition, pronoun and conjunction. As this feature incorporates both the tag and the
word, it will provide the word class or lexical tag for the classifier.
The intuition for using word sense is to capture additional information that will help to
distinguish homographs that can be differentiated based on the syntactical role of the
word. Homographs refer to words with more than one meaning. For example, the word
“patient” might have different meanings when utilized with different syntactical role such
as noun or verb. When used as a noun, a patient refers to an individual who is not feeling
well or sick but when used as an adjective, it could refer to the character of a person as
being tolerant.
3.2.4 Nouns
Gentner (1981) explored the differences between nouns and verbs and suggested that
nouns differ from verbs in the relational density of their representations. The semantic
components of noun meanings are more strongly interconnected than those of verbs and
other parts of speech. Hence, the meanings of nouns seem less mutable than the meanings
of verbs. Nouns have been used as a common candidate for distinguishing among
different concepts. Nouns are often called “substantive words” in the field of
Computational Linguistics and “content words” in Information Science.
3.2.5 Verbs
- 28 -
CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION
Verbs are associated with motions involving relations between objects (Kersten 1998).
From an information seeking perspective, verbs do not appear to contribute to the
classification accuracies. In order to validate this hypothesis, verbs are included as one of
the linguistically motivated knowledge sources that were examined in our study.
3.2.6 Adjectives
Bruce and Wiebe’s (1999) work has established a positive correlation with the presence
of adjectives with subjectivity. The presence of one or more adjectives is essential for
predicting that a sentence is subjective. Subjectivity tagging refers to distinguishing
sentences used to present opinions and evaluations from sentences used to objectively
present factual information. There are numerous applications for which subjectivity
tagging is relevant, including information retrieval and information extraction. This task
is essential to forums and news reporting. For a complete study on the use of
linguistically motivated knowledge sources, we have included adjectives as one source of
linguistic knowledge in our experiment.
3.2.7 Combination of Sources
Each linguistic knowledge source generates a feature vector from the context of the
document. However we also examine the combination of two linguistic knowledge
sources which is a novel technique. When sources are combined, the features generated
- 29 -
CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION
from each knowledge source are concatenated with each source contributing to half of the
total number of features and the dataset with all these features are generated. Here we
combine the words and the linguistically motivated knowledge sources, nouns, noun
phrase and adjectives as novel combinations to see if there are any improvements. Based
on the fact that the original features are retained, while some syntactical structure is
captured using this model, there appears to be an advantage in using a combination of
techniques.
3.3
Obtaining Linguistically Motivated Classifiers
Our technique has four steps (Figure 5). The input to the technique is a document, D.
Below is an outline of the generic process proposed and employed to use the
linguistically motivated knowledge sources as features:
1. Firstly, the document is broken up into sentences.
2. Morphological descriptions or tags are assigned to each term. This NLP
component does linguistic processing on the contents and attaches a tag to every
term.
3. Processed terms are parsed.
4. Linguistically motivated knowledge sources are then extracted based on the
tagging requirements as discussed earlier.
5. Features are combined at the binder phase if combinations of features are
required.
- 30 -
CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION
As a final step, the set of linguistically motivated knowledge sources obtained is used as
the input feature set for the training or testing phase of the documents.
Document
Sentence
Boundary
Detection
Words
Assignment of
Morphological
Descriptions
Tags
Parsing
Extractor
Phrase
Noun
Adjective
Verb
No
Combine
Tags?
Feature Type e.g.
Phrase
Yes
Binder
Combined Feature e.g. Phrase and Words
- 31 -
CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION
Figure 5: Extracting Linguistically Motivated Knowledge Sources
- 32 -
CHAPTER 4 EXPERIMENT
CHAPTER 4
EXPERIMENT
A controlled experimental study was conducted to validate the effectiveness of
linguistically motivated knowledge sources. This chapter describes the experimental
setup employed throughout the study. Section 4.1 describes the evaluation data sets used
in the study. Section 4.2 presents the preprocessing methods required in the study.
Section 4.3 provides more details on the handling of multiple categories. Section 4.4
presents the evaluation measures used. The last section, 4.5, describes the tools utilized in
the study.
4.1 Evaluation Data Sets
There are standard benchmark collections available for experimental purposes. We have
tested numerous linguistically motivated knowledge sources and their combinations that
were presented in Chapter 3 with two widely used data corpus: Reuters-21578 and
WebKB. These data corpus vary in many characteristics. This section articulates the
characteristics of all the data sets that were used in our experiment.
4.1.1 Reuters-21578
- 33 -
CHAPTER 4 EXPERIMENT
The dataset that we have chosen is the Reuters-21578 dataset. This is a widely used
collection. It accounts for most of the experimental work in classification (Sebastiani,
2002).
This
dataset
can
be
http://www.daviddlewis.com/resources/testcollections/reuters21578.
obtained
This
from
document
corpus was originally collected from the Reuters newswire stories manually indexed by
human indexers. The data was originally collected and labeled by Carnegie Group, Inc.
and Reuters, Ltd. in the course of developing the CONSTRUE text categorization
system.
The Reuters corpus consists of 21,578 documents classified into 135 categories. Of these
categories are 5 major categories and 118 subcategories. A total of more than 20,000
unique terms can be found in this corpus. The set of documents occur in SGML format
and majority of them contain categories labels which has been obtained through a manual
process. An example of a single document in the Reuters corpus is shown in Figure 6. In
our empirical study, we chose the “Modified Apte Split” (ModApte) version of Reuters21578 because of the abundance of recent work using this configuration. The ModApte
version place all documents written on or before 7 April 1987 into the training set and the
rest of the documents into the test set. Of these documents, 9603 training examples are
used as training documents and 3299 documents are used as testing examples, some of
which have no class labels attached. Only 90 classes for which at least one training
example and test example exists are included.
- 34 -
CHAPTER 4 EXPERIMENT
An analysis of the corpus reveals that the category distribution in the Reuters dataset is
highly skewed. Table 2 shows the overview of the category sets and the detailed
breakdown of the categories. The 674 categories are broken down into 5 groups:
1. Exchanges: Stock and commodity exchanges
2. Organizations: Economically important regulatory, financial, and political
organizations.
3. People: Important political and economic leaders.
4. Places: Countries of the world.
5. Topics: These are subject categories of economic interest.
Categorization
Subcategories
EXCHANGES
ORGS
PEOPLE
PLACES
TOPICS
39
56
267
175
135
Subcategories with >1
occurrences
32
32
114
147
120
Subcategories with > 20
categories
7
9
15
60
57
Table 2: Distribution of Categories in Reuters-21578
The training set has some categories with high numbers of documents whereas some with
extremely few documents. On average the documents belongs to 1.3 categories. We have
chosen a random subset of documents from the top 20 subcategories in our evaluation as
shown in Table 3.
Category
Train
Test
Acq
Bop
Corn
Cpi
Crude
1650
75
182
69
389
719
30
56
28
189
- 35 -
CHAPTER 4 EXPERIMENT
Dlr
Earn
Gnp
Grain
Interest
livestock
Money-fx
Money-supply
nat-gas
Oilseed
Ship
soybean
Sugar
Trade
Wheat
131
2877
101
433
347
75
538
140
75
124
197
78
126
369
212
44
1087
35
149
131
24
179
34
30
47
89
33
36
118
71
Table 3: Top 20 categories in ModApte Split
26-FEB-1987 15:01:01.79
cocoa
el-salvadorusauruguay
5; 5; 5;C T
22; 22; 1;f0704 31;reute
u f BC-BAHIA-COCOA-REVIEW
02-26 0105
2;
BAHIA COCOA REVIEW
SALVADOR, Feb 26 - Showers continued
throughout the week in
the Bahia cocoa zone, alleviating the drought since early
January and improving prospects for the coming temporao,
although normal humidity levels have not been restored,
Comissaria Smith said in its weekly review.
The dry period means the temporao will be late this year.
Arrivals for the week ended February 22 were 155,221 bags
of 60 kilos making a cumulative total for the season of 5.93
mln against 5.81 at the same stage last year. Again it seems
that cocoa delivered earlier on consignment was included in …
3;
Figure 6: An Example of a Reuters-21578 Document
- 36 -
CHAPTER 4 EXPERIMENT
4.1.2 WebKB
The other corpus that we used is the World Wide Knowledge Base dataset that was
collected by Craven et al. (1999). WebKB was collated by a crawler as part of an effort to
build a data corpus that mirror the World Wide Web.
The primary reason for choosing this corpus is that this corpus is of a different domain
and has dissimilar characteristics from the Reuters corpus. The WebKB corpus is of a
web-based domain. This corpus contains 8282 web pages from four academic domains
contributed by different authors with very different styles of writing and presenting
information on the web.
Divided into two different polychotomies, topic and web domain, we have decided to use
the first polychotomy in our experiments carried out by Nigam et al. (1998) and
Bekkerman et al. (2003) in which four categories were used: course, faculty, project and
student. This contributes to a total of 4199 documents. Our experiment also used these
four categories and used the pages from Cornell as the test pages. A detailed breakdown
of the data corpus is shown in Table 4.
Category
Course
Faculty
Project
Student
Number Proportion of
+ve e.g.(% )
930
22.1
1124
26.8
504
12.0
1641
39.1
Table 4: Breakdown of documents in WebKB Categories
- 37 -
CHAPTER 4 EXPERIMENT
4.2 Dimensionality Reduction
In a large document collection, the dimensionality of the feature space is very high
leading to more time and memory requirements in processing the feature vectors. This
calls for a feature selection phase that will address this issue by increasing data sparsity.
The first section gives an overview of feature selection methods used in document
classification studies followed by a detailed explanation of the techniques that we have
adopted for our study.
4.2.1 Feature Selection
Feature selection is typically performed by assigning a weight to each term and keeping a
certain number of terms with the highest scores, while discarding the rest of the terms.
Experiments then evaluate the classification performance on the resulting feature vectors
based on the features retained after the feature selection phase. Some learning algorithms
cannot scale to high dimensionality. For instance, LLSF algorithm cannot scale to high
dimensions. It has also been proven that some classifiers perform worse when using all
features e.g. K-NN, Rocchio, C4.5 and some classifiers like C4.5 are too inefficient to
use all features.
An advantage of dimensionality reduction is that it tends to reduce overfitting which
refers to the phenomenon by which a classifier is turned into the contingent
- 38 -
CHAPTER 4 EXPERIMENT
characteristics of the training data rather than just the constitutive characteristics of the
categories. Classifiers that overfit are extremely accurate on the training data such that
the performance on the test data is worse. Experiments have shown that in order to avoid
overfitting, some research suggested that 50-100 training examples per term may be
needed in document classification tasks (Fuhr & Buckley, 1991). Thus, it implies that
overfitting may be avoided by using a smaller amount of training examples. However, on
the other hand, removing features risk removing potential useful information on the
meaning of documents. Hence, the dimensionality reduction step must be done
appropriately.
Various dimensionality reduction methods have been proposed.
Methods such as
document frequency thresholding and empirical mutual information are used to select a
set of features to represent the documents. Other methods such as odds-ratio, chi-square
score, stopword removal and stemming are also used for feature selection. Document
frequency thresholding and stop word removal is employed in this study.
4.2.1.2 Document Frequency Thresholding
Of the many feature selection methods, document frequency is a simple and effective
feature selection method popular in conventional document classification literature.
Document frequency thresholding refers to the use of features that occur at least t times
in the training documents. In other words, it gives the weight to the feature by assigning a
value equal to that of its document frequency. Document frequency is the number of
- 39 -
CHAPTER 4 EXPERIMENT
documents in which the feature occurs.
This is one of the simplest but effective
techniques (Yang & Pederson, 1999) for vocabulary reduction. It scales easily to very
large corpus with a computational complexity approximately linear in the number of
training documents.
The document frequency for each feature in the training corpus is computed and removed
from the feature space where the features or terms whose document frequency was less
than some threshold, in the previous case. It is assumed that rare terms are not
informative in terms of global performance. This will greatly reduce the dimensionality
of the feature space. Although this is not always accurate, an improvement in
classification accuracy is possible if the removal of rare terms is actually removing noise
terms.
As the emphasis of our study is on the effects of different learning algorithms with
different linguistic knowledge sources, we will assume that the feature selection will not
be a major factor in our study as we will be using the same feature selection technique,
i.e. document frequency thresholding in all our experiments setup with different learning
algorithms and different linguistic knowledge sources. Furthermore, Yang & Pederson’s
(1999) work has consolidated the use of document frequency with Reuters corpus is not
just an ad hoc approach to improving classification accuracy but a reliable metric for
feature selection. They have suggested that document frequency should be a better
alternative when the computation of information gain or χ2 test proves to be too costly.
- 40 -
CHAPTER 4 EXPERIMENT
4.2.2 Stop Words Removal
Stop words refer to words that occur very frequently within one document such as “the”,
which does not add any discriminative value to the classifier. Thus, stop words
elimination is carried out to identify irrelevant features. When document frequency
thresholding removes particularly infrequent words, stop word elimination removes
mostly high frequency words from the attribute set. All words that occur in the list of stop
words will not be considered as features.
Feature selection based on rankings implies that the process is a greedy process that does
not account for dependencies between words. A list of stop words is provided in
Appendix A. We removed stop words for documents in Reuters-21578 but this was not
carried out for documents in WebKB as the number of features derived from the web
pages was much smaller than that of Reuters-21578.
4.3 Experiment Setup
4.3.1 Handling Multiple Categories Problems
For all learning algorithms, we used the default settings established in WEKA to train the
classifiers. We used binary classification for every learning algorithm. Each learning
classifier is created for each category. If there are more than two categories for a
- 41 -
CHAPTER 4 EXPERIMENT
document instance, an instance for each category will be treated as separate instances for
each category. Thus there are twice as many classifiers as there are number of categories.
4.3.2 Handling Multiple Categories Training Examples
As some documents in Reuters-21578 were labeled with more than 1 category, we
created one training instance for each category that a training example in the corpus has.
We created one positive training instance for each example labeled as belonging to that
category. Thus the number of training instances generated for the normal variant of
learning algorithms can be more than the number of training examples provided, but the
number of training instances generated for the 1-per-category variant is the same as the
number of training examples.
4.4 Evaluation Measures
4.4.1 Loss-based Measures
A commonly used performance measure in the machine learning community is the error
rate which is defined as the probability of the classification rule in predicting the wrong
class.
E(h) = Pr(h != Y|h)
- 42 -
CHAPTER 4 EXPERIMENT
Estimators for these measures can be defined using a contingency table of prediction so
on an independent test set. Each cell of the contingency table (see Table 5) represents one
of the four possible outcomes of prediction h(x) for an example:
f++, is the number of instances categorized under the category c, and the classifier predicts
it correctly as under the category c;
f+-, is the number of instances categorized not to belong to category c, but the classifier
predicts it incorrectly as belonging to category c;
f-+, is the number of instances categorized as belonging to category c, but the classifier
predicts it incorrectly as not belonging to category c;
f--, is the number of instances categorized not to belong to category c, and the classifier
predicts it correctly as not belonging to category c;
Predicted Category +1
Predicted Category -1
Actual Category y = +1
f++
f-+
Actual Category y=-1
f+f--
Table 5: Contingency Table
Formally, the conventional estimators, for these measures using the contingency table,
rooted in probabilistic theories, is
E ( h) =
f +− + f −+
f + + +f + − + f − + + f
−−
(7)
Sometimes cost matrix may be introduced where the utility of predicting a positive
example correctly is higher or lower than predicting a negative example and vice versa.
- 43 -
CHAPTER 4 EXPERIMENT
4.4.2 Recall and Precision
To avoid the problem of using the accuracy as a measure in text classification
experiments, we have considered two other performance measures, recall and precision.
Recall is the probability that a document with label y = 1 is classified correctly.
Re call (h) = Pr(h( x) = 1 | y = 1, h)
Re call (h)estimate =
f
f
++
++
+f
−+
(8)
(9)
Precision is the probability that a document classified as h(x) = 1 is classified correctly.
Pr ecision(h) = Pr( y = 1 | h( x) = 1, h) (10)
Pr ecision(h)estimate =
f
f
++
++
+f
+−
(11)
4.4.3 Precision and Recall Breakeven Point
While precision and recall accurately describe the classification performance, it is
difficult to compare different learning algorithms with two disparate scores. One popular
evaluation method for balancing precision and recall is the F-measure value proposed by
Rkjsbergen (1979). This measure combines precision and recall with a new parameter
specifying the importance of recall relative to precision. This measure is defined by
equation 12.
- 44 -
CHAPTER 4 EXPERIMENT
(1 + β 2 ) Pr ecision(h) Re call (h)
Fb(h) =
β 2 Pr ecision(h) + Re call (h)
(12)
If =1, this means that precision and recall are equally important and have the same
weight. The performance measure that we have employed is the F1 measure as given in
equation 13.
F 1(h) =
2 Pr ecision(h) Re call (h)
Pr ecision(h) + Re call (h)
(13)
4.4.4 Micro- and Macro- Averaging
F-measure measures the effectiveness of a classifier on a single class. However as most
text classification tasks contain many categories, each will have their own F-measure
value. Two other measures, macro-average value and micro-average values were used in
this study.
The macro-average values take the averages of all categories’ contingency table as
defined by:
F 1Macro =
1 m
∑ F1 (hi )
m i =1
(14)
- 45 -
CHAPTER 4 EXPERIMENT
On the other hand, the micro-average values makes use of all the contingency table
values for each category to obtain a new averaged contingency table by component-wise
addition to obtain averages for f++, f+-, f-+ and f-- . Finally, the micro F1 value is defined as:
F 1Micro =
2 f +Average
+
2 f +Average
+
Average
+ f +−
+ f −Average
+
(15)
4.5 Tools
Finally, to facilitate the study of the effects of linguistically motivated knowledge sources
with learning algorithms, a generic design of our system was conceptualized. The design
of our system is shown in Figure 7.
The system consists of four main modules:
1. The document management module;
2. The feature preprocessor module;
3. The learning classifier module;
4. The Graphical User Interface (GUI) module.
Tools that were readily available were tested and integrated into the system where
applicable such as the learning classifier module. Other modules had to be implemented.
- 46 -
CHAPTER 4 EXPERIMENT
Preprocess
document
Extract
Features
Document
Management
Module
Feature Engine
Module
Document
database
Learning
Classifier
Module
Model
Category
User Interface
Module
New
Document
Figure 7: Design of system
The details of each module follow:
The document management module contains functions to parse and process the
documents in the document repository. This module consists of a document handler and a
filter sub-component. The document handler parses the document based on the existing
format and the filter does additional pre-processing such as removing HTML tags for the
WebKB documents.
Having preprocessed the documents, the strings of text are passed to the feature engine
module where the documents are converted to the linguistic units or tokens that serve as
sources of linguistic knowledge to the learning classifier. The strings are tokenized and
an appropriate feature selection is performed on them. In our empirical study,
linguistically motivated tokens have been used. This entails the use of a Part of Speech
- 47 -
CHAPTER 4 EXPERIMENT
Tagger that tags the required system. In our experiments, we have employed SNOW
tagger (http://l2r.cs.uiuc.edu/~cogcomp/cc-software.html).
The classifier module is a module containing several learning algorithms. In our system,
WEKA 3-6 has been integrated with the system.
Finally, the GUI module contains codes to render the appropriate user interface for the
user to execute the classifier. Currently, the UI module consists of a raw text based
interface.
Based on the setup as described in this chapter, a series of experiments were conducted,
using 6 learning algorithms – Naïve Bayes (NB), Support Vector Machines (SVM), k-NN
Instance-Based Learner (k-NN), C4.5, RIPPER (JRip) and Alternating Decision Trees
(ADTree), with different linguistically motivated sources, including novel combinations
of these sources using the proposed approach that has been described in Chapter 3.
Results from the experiment with the use of this system will be discussed in details in the
next chapter.
- 48 -
CHAPTER 5 RESULTS & ANALYSIS
CHAPTER 5
RESULTS & ANALYSIS
In this chapter, results for our experiments are presented. The results of each type of
linguistically motivated knowledge source are then discussed in detail and we proceed to
give an overall assessment and analysis of the interaction effects of using different
linguistically motivated feature representations with different learning algorithms.
Section 5.1 presents an extensive set of experiment results based on the Reuters-21578
corpus. Comparisons of each of the linguistically motivated classifiers were then
described in the Section 5.1. In Section 5.2, the experiment results for the WebKB corpus
are then presented.
5.1
Results
Experiments were conducted on Reuters-21578 and WebKB corpora. This section
describes the series of experiments that have been carried out using different
linguistically motivated knowledge sources and learning algorithms mentioned in
Chapter 2 on these corpora.
5.2
Contribution of Different Linguistic Knowledge Sources to
Classification of Reuters-21578 Corpus
- 49 -
CHAPTER 5 RESULTS & ANALYSIS
5.2.1 Words
Table 6 shows the results of using word based representation. This set of results will be
used as a baseline for comparison with other linguistic knowledge sources. Since the
value of F1 occurs between 0 and 1, where the higher the value the better the classifier, it
was observed that relatively good performances have been achieved with the use of
words as representations for each of the learning algorithms tested. The results in Table 6
will be used as the baseline classifier for linguistically motivated classifiers.
Algorithm
Word
Micro F1
Macro F1
NB
0.860
0.860
SVM
0.884
0.882
k-NN
0.868
0.867
C4.5
0.870
0.866
Ripper
0.864
0.860
ADTree
0.887
0.885
Table 6: Results using Words
5.2.2 Phrase
Results of the use of noun phrase representation are being compared against word
representation in Table 7. Surprisingly, the majority of the results show that the word
based representations do better than using noun phrases. This result corresponds to the
work of Lewis (1992) and Scott (1999). The results indicate that some of the
informational content may have been lost.
- 50 -
CHAPTER 5 RESULTS & ANALYSIS
From our observations, we see that some of the noun phrases were not identified
correctly by SNOW which could have led to a decrease in the micro and macro values.
However, it is worth noting that the use of support vector machines has the best
performance with phrase based representations and showed minimal improvement to the
macro value.
The reason for the poor performance of the new linguistic classifiers built using phrases
as the linguistically motivated knowledge source could be due to:
i)
Sparse distribution of noun phrases.
ii)
Synonymous phrases which dilutes the contribution of each features which
essentially have the same meaning. For e.g., the occurrence of the noun phrase
“Standard oil company” and “oil company” becomes two separate features.
iii)
Noise introduced via the separation of synonymous phrases into multiple
phrases and the inaccurate identifications of noun phrases by the tagging tool.
Algorithm
Word
Micro F1
NB
SVM
k-NN
C4.5
Ripper
ADTree
Phrase
Macro F1
Micro F1
Change (%)
Macro F1
Change (%)
0.86
0.86
0.833
-3.140
0.83
-3.488
0.884
0.882
0.852
-3.620
0.884
0.227
0.868
0.867
0.835
-3.802
0.834
-3.806
0.87
0.866
0.842
-3.218
0.837
-3.349
0.864
0.86
0.828
-4.167
0.822
-4.419
0.887
0.885
0.848
-4.397
0.847
-4.294
Table 7: Results using Phrases
- 51 -
CHAPTER 5 RESULTS & ANALYSIS
5.2.3 Word Sense (Part of Speech Tags)
There has been a drop in the micro and macro averages of the word sense representation
which indicates that there is no clear advantage of using word sense as a representation
for document classification. Although using word senses has been reported to produce
better results in word sense disambiguation literature (Ng & Lee, 1996), it is not the case
for document classification. Adding part of speech tags will not add much improvement
because there are few terms that has very different meanings for different word senses.
The only learning algorithm that worked well with word senses, or part of speech tag,
was C4.5 trees. Our findings indicate that C4.5 actually does better for both micro and
macro F1 with the linguistically motivated feature, word senses, than with words.
Algorithm
NB
SVM
k-NN
C4.5
Ripper
ADTree
Word
Tags
Micro F1
Macro
F1
Micro F1
Change (%)
Macro
F1
Change (%)
0.86
0.86
0.815
-5.233
0.812
-5.581
0.884
0.882
0.864
-2.262
0.863
-2.154
0.868
0.867
0.843
-2.880
0.842
-2.884
0.87
0.866
0.879
1.034
0.88
1.617
0.864
0.86
0.856
-0.926
0.854
-0.698
0.887
0.885
0.881
-0.676
0.881
-0.453
Table 8: Results using Word Senses (Tags)
However, it is probable that word sense representation will be very effective especially in
data corpora containing terms that depict multiple meanings with different word senses.
For example, the word “sweet” has different word senses with different meanings. As a
- 52 -
CHAPTER 5 RESULTS & ANALYSIS
noun, the word has an approximate meaning of “candy”. As an adjective, it may mean
“lovable”. Since Reuters-21578 documents do not appear to exhibit this characteristic,
using this linguistically motivated knowledge source does not improve the baseline
classifier on most of the classifiers except for C4.5.
5.2.4 Nouns
Results with the linguistic classifiers built using bag-of-nouns actually produced superior
results with improvements ranging from 3.05- 4.76%. Except for Naïve Bayes, the results
indicate that using nouns are able to capture most of the informational content of the
documents. Thus, the results show that the use of noun representations shows signs of
improved micro and macro values. The reason for the improvement in the results could
be because the semantic components of noun meanings are more strongly interconnected
than those of verbs and word senses. Nouns appeared to have captured features with
salient concepts required for classification accuracy.
Algorithm
Word
Micro F1
NB
SVM
k-NN
C4.5
Ripper
ADTree
NN
Macro
F1
Micro F1
Change (%)
Macro F1
Change (%)
0.86
0.86
0.861
0.116
0.857
-0.349
0.884
0.882
0.924
4.525
0.924
4.762
0.868
0.867
0.901
3.802
0.899
3.691
0.87
0.866
0.908
4.368
0.906
4.619
0.864
0.86
0.891
3.125
0.888
3.256
0.887
0.885
0.913
2.931
0.912
3.051
Table 9: Results using Nouns
- 53 -
CHAPTER 5 RESULTS & ANALYSIS
5.2.5 Verbs
The results in table 10 show that there was a huge drop in the micro and macro F1 values.
Since verbs are more mutable than nouns, it follows that the micro and macro values
should fare worse than that of nouns as reflected in the results. Thus, verb is not a good
choice for feature representation for classification tasks. This may also indicate that the
removal of verbs from the documents may not affect the classification accuracy among
the learning algorithms adopted in this study.
Algorithm
NB
SVM
k-NN
C4.5
Ripper
ADTree
Word
Verb
Micro F1
Macro F1
Micro F1
Change (%)
Macro F1
Change (%)
0.86
0.86
0.636
-26.047
0.625
-27.326
0.884
0.882
0.637
-27.941
0.625
-29.138
0.868
0.867
0.636
-26.728
0.628
-27.566
0.87
0.866
0.619
-28.851
0.601
-30.60
0.864
0.86
0.594
-31.25
0.566
-34.186
0.887
0.885
0.628
-29.199
0.611
-30.960
Table 10: Results using Verbs.
5.2.6 Adjectives
From the results (see Table 11), there is a clear indication that there is no clear advantage
in using adjectives as feature representations due to the drop in performance compared to
using purely words. However, as compared to verbs, the performance is slightly better.
- 54 -
CHAPTER 5 RESULTS & ANALYSIS
Algorithm
NB
SVM
k-NN
C4.5
Ripper
ADTree
Word
Adjective
Micro F1
Macro F1
Micro F1
Change (%)
Macro F1
Change (%)
0.86
0.86
0.739
-14.070
0.73
-15.116
0.884
0.882
0.738
-16.516
0.732
-17.007
0.868
0.867
0.75
-13.594
0.747
-13.841
0.87
0.866
0.726
-16.552
0.717
-17.206
0.864
0.86
0.725
-16.088
0.713
-17.093
0.887
0.885
0.74
-16.573
0.733
-17.175
Table 11: Results using Adjectives.
5.2.7 Combination of Sources
The results using a combination of words and noun phrase representations led to
surprising results. This was due to the improvement shown across several learning
algorithms over word representations.
This shows that the use of noun phrase could have led to higher discriminating power
with the use of bag of words representation. The improvements were consistent across the
different learning algorithms.
Algorithm
NB
SVM
k-NN
C4.5
Ripper
ADTree
Word
Combine
Micro F1
Macro
F1
Micro F1
Change (%)
Macro
F1
Change (%)
0.86
0.86
0.864
0.465
0.861
0.116
0.884
0.882
0.892
0.905
0.892
1.134
0.868
0.867
0.88
1.382
0.879
1.384
0.87
0.866
0.878
0.920
0.877
1.270
0.864
0.86
0.864
0
0.862
0.233
0.887
0.885
0.892
0.564
0.891
0.678
Table 12: Results using both Linguistically Knowledge Source and Words.
- 55 -
CHAPTER 5 RESULTS & ANALYSIS
5.2.8 Analysis of Reuters-21578 Results
An analysis of the results was carried out from two perspectives, one from the features
and the other, the learning algorithm used. The former allows us to identify the features
that will improve the classification accuracy of the learning algorithm. The latter allows
us to select the appropriate type of learning algorithm that can be employed in other
applications based on the characteristics of the data, for example, if the data corpus
consists of many adjectives, we are able to tell which sort of learning algorithm is
suitable for classifying the documents.
Algorit
Word
Phrase
Combine
Tag
NN
Verb
Adjective
hm
Micro
Macro
Micro
Macr
Micro
Macr
Micro
Macr
Micro
Macr
Micro
Macr
Micro
Macr
F1
F1
F1
o F1
F1
o F1
F1
o F1
F1
o F1
F1
o F1
F1
o F1
NB
0.860
0.860
0.833
0.830
0.864
0.861
0.815
0.812
0.861
0.857
0.636
0.625
0.739
0.730
SVM
0.884
0.882
0.852
0.884
0.892
0.892
0.864
0.863
0.924
0.924
0.637
0.625
0.738
0.732
k-NN
0.868
0.867
0.835
0.834
0.880
0.879
0.843
0.842
0.901
0.899
0.636
0.628
0.750
0.747
C4.5
0.870
0.866
0.842
0.837
0.878
0.877
0.879
0.880
0.908
0.906
0.619
0.601
0.726
0.717
Ripper
0.864
0.860
0.828
0.822
0.864
0.862
0.856
0.854
0.891
0.888
0.594
0.566
0.725
0.713
ADTree
0.887
0.885
0.848
0.847
0.892
0.891
0.881
0.881
0.913
0.912
0.628
0.611
0.740
0.733
Table 13: Contribution of Knowledge Sources on Reuters-21578 data set (MicroAveraged F1, Macro-averaged F1)
Table 13, shows the micro-averaged and macro-averaged F1 measures for the different
knowledge sources and learning algorithms for the Reuters-2178 data set. The seven
columns in the table correspond to:
(i)
using only words;
(ii)
using only phrases;
- 56 -
CHAPTER 5 RESULTS & ANALYSIS
(iii)
using both words and phrases;
(iv)
using part of speech tag with words;
(v)
using only nouns;
(vi)
using only verbs and,
(vii)
using only adjectives.
Each of these knowledge sources were used with document frequency as the feature
selection technique.
Comparison of Different Linguistically Motivated Knowledge Sources
on Reuters-21578 (Micro F1 Values)
1
Micro F1
0.9
0.8
Words
0.7
Phrase
0.6
Combine
0.5
Tags
0.4
Nouns
0.3
Verbs
0.2
Adjectives
0.1
0
NB
SVM
k-NN
C4.5
Ripper
ADTree
Learning Algorithms
Figure 8: Comparison of Different Linguistically Motivated Knowledge Sources on
Reuters-21578 (Micro F1 values)
The best micro-averaged F1 value for Reuters-21578 is 92.4% (Figure 8), obtained by
using the linguistically motivated knowledge source, nouns, as features and SVM, as the
learning algorithm. The results indicate that SVM performs best with nouns, combined
words and phrases, words, phrases, relatively worse on adjectives and worst on verbs.
- 57 -
CHAPTER 5 RESULTS & ANALYSIS
The next best performing setup for micro-averaged F1 value gives values of 91.3% which
is obtained by ADTree with nouns.
Verb was consistently the worst performing
linguistically motivated knowledge source when employed as the sole feature
representation across all the learning algorithms. It is worthwhile to note that for each of
these learning algorithms tested, the best performing linguistically motivated knowledge
source was nouns, except for Naïve Bayes. Combining linguistically motivated
knowledge source generated the best performer for Naïve Bayes.
Comparison of Different Linguistically Motivated Knowledge Sources
on Reuters-21578 (Macro F1 values)
1
Macro F1
0.9
0.8
Words
0.7
Phrase
0.6
Combine
0.5
Tags
0.4
Nouns
0.3
Verbs
0.2
Adjectives
0.1
0
NB
SVM
k-NN
C4.5
Ripper
ADTree
Learning Algorithms
Figure 9: Comparison of Different Linguistically Motivated Knowledge Sources on
Reuters-21578 (Macro F1 values)
The best macro-averaged F1 value turned out to be the same as the results for microaveraged F1 values. SVM, together with nouns as a linguistically motivated knowledge
- 58 -
CHAPTER 5 RESULTS & ANALYSIS
source, outperformed all other classifiers. Similar to the micro-averaged F1 results, the
order of the performance of the linguistically motivated knowledge sources with SVM is:
nouns, combined words and nouns, words, phrases, tags, adjectives and verbs. Just as the
previous set of results, the next best performing classifier was obtained using AdTree
with nouns as the linguistically motivated knowledge source. The macro-averaged F1
value was 91.2%. Observation of the macro-averaged data also reveals the same trend in
the order of performance of each of these linguistically motivated features.
Algorithm
Word
Phrase
Combine
Tag
NN
Verb
Adjective
NB
0.898
0.872
0.884
0.845
0.891
0.687
0.750
SVM
0.902
0.871
0.886
0.867
0.930
0.677
0.734
k-NN
0.874
0.842
0.871
0.814
0.916
0.638
0.708
C4.5
0.898
0.873
0.861
0.889
0.920
0.678
0.724
Ripper
0.904
0.872
0.862
0.888
0.912
0.646
0.718
ADTree
0.902
0.871
0.880
0.884
0.918
0.670
0.730
Table 14: Contribution of knowledge sources on Reuters-21578 (Precision)
Table 14 shows the tabulation of precision values for each of the learning algorithms
using different feature representations as different knowledge sources. Figure 10 shows a
graphical representation of the results. Again, the results based on precision values
consolidates our previous observations that using nouns only gives the best results for
SVM, k-NN, C4.5, Ripper and AdTree. However, unlike the previous case whereby
Naïve Bayes works best with combined words and phrases as features. It is worth noting
that the precision is most of the time higher with words than the combined
- 59 -
CHAPTER 5 RESULTS & ANALYSIS
representations. The worst performing feature is obviously the verb as feature
representation which is consistent with all the learning algorithms.
Comparison of Different Linguistically Motivated Knowledge
Sources on Reuters-21578 (Precision)
Precision
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Word
Phrase
Combine
Tag
Nouns
Verbs
Adjectives
NB
SVM
k-NN
C4.5
Ripper
ADTree
Learning Algorithm
Figure 10: Comparison of Different Linguistically Motivated Knowledge Sources on
Reuters-21578 (Precision)
Algorithm
Word
Phrase
Combine
Tag
NN
Verb
Adjective
NB
0.830
0.806
0.852
0.793
0.836
0.600
0.746
SVM
0.874
0.847
0.906
0.871
0.923
0.620
0.766
k-NN
0.872
0.841
0.895
0.883
0.890
0.657
0.805
C4.5
0.851
0.827
0.904
0.883
0.903
0.589
0.755
Ripper
0.834
0.807
0.877
0.836
0.879
0.616
0.782
ADTree
0.881
0.841
0.911
0.888
0.912
0.614
0.774
Table 15: Contribution of knowledge sources on Reuters-21578 (Recall)
- 60 -
CHAPTER 5 RESULTS & ANALYSIS
From Table 15 we see that the combined feature representations actually outperform
nouns only in three of the learning algorithms. Figure 11 shows the graphical
representation of the results. From the results, it was observed that combined
representation works best with Naïve Bayes, k-NN and C4.5. On the other hand, noun
features outperforms other feature representations including combined representations
with SVM, Ripper and ADTree. The worst performing algorithm as reflected in the micro
and macro averaged values, is using verbs as the features. From our analysis of the
preliminary results, we feel that there is potential to further combine words, phrases and
nouns to produce even better results. These findings led us to further investigate the
contribution of novel combinations of linguistically motivated features in another
experiment that will be discussed in a later section.
Recall
Comparison of Different Linguistically Motivated
Knowledge Sources on Reuters-21578 (Recall)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Word
Phrase
Combine
Tags
Nouns
Verbs
Adjectives
NB
SVM
k-NN
C4.5
Ripper
ADTree
Learning Algorithm
Figure 11: Comparison of Different Linguistically Motivated Knowledge Sources on
Reuters-21578 (Recall)
- 61 -
CHAPTER 5 RESULTS & ANALYSIS
Based on the results, it appears that extracting nouns as features gives the best results for
all the learning algorithms that were tested. Adjectives and verbs were not so effective for
document classification with the learning algorithms. This could be because nouns extract
the informative terms in this data set as compared to other context source. The removal of
non-informative terms will increase the results due to the reduction of noise in the
category prediction process. The findings also led to an indication for the use of novel
combinations of linguistically motivated knowledge sources with bag-of-words which
will be carried out using the WebKB corpus.
5.3 Contribution of Linguistic Knowledge Sources to
Classification accuracy of WebKB Corpus
Following the interesting findings from our previous experiment, we now attempt to
reproduce the set of linguistically motivated features with a different data corpus. In
addition, we also attempt to use novel combinations of linguistically motivated features
with the bag-of-word representations to determine the effects of such novel combinations
on document classification. A similar experiment was conducted on the WebKB
collection.
5.3.1 Words
Table 16 shows the results of using word based representation. This will also be used as
the baseline classifier for the WebKB corpus. As seen in the Reuters-21578 corpus, good
- 62 -
CHAPTER 5 RESULTS & ANALYSIS
performances were also achieved with the use of words as representations for each of the
learning algorithms tested with WebKB.
Algorithm
Micro F1
Macro F1
Naïve Bayes
0.92
0.87
SVM
0.98
0.97
K-NN
0.73
0.65
AdaBoost
0.96
0.93
Adtree
1.00
0.99
Ripper
1.00
0.99
C4.5
0.77
0.99
Table 16: Results using Words
5.3.2 Phrase
From Table 17, the results of using phrase based features to build linguistically motivated
classifiers are shown. Similar to the Reuters corpus, the results showed no indication that
phrases help to improve the F1 measure. In fact, the results indicate a substantial drop in
both micro and macro values for each learning algorithm.
The reason for the drop in performance of the phrase based linguistically motivated
classifiers could be due to the sparse number of features especially noun phrases. The
content of HTML pages of WebKB corpus has much less features compared to that of
Reuters. Potentially, this also leads to a reduced effectiveness in the part of speech tagger
that was introduced in the preprocessing phase.
- 63 -
CHAPTER 5 RESULTS & ANALYSIS
Word
Algorithm
Naïve
Bayes
Phrase
Micro F1
Macro F1
Micro F1
Change
(%)
Macro F1
Change
(%)
0.92
0.87
0.73
-20.7
0.71
-18.4
SVM
0.98
0.97
0.72
-26.5
0.69
-28.9
K-NN
0.73
0.65
0.5
-31.5
0.4
-38.5
AdaBoost
0.96
0.93
0.92
-4.2
0.7
-24.7
Adtree
1
0.99
0.71
-29.0
0.67
-32.3
Ripper
1
0.99
0.8
-20.0
0.72
-27.3
0.771
0.994
0.75
-2.7
0.58
-41.6
C4.5
Table 17: Results using Phrases
5.3.3 Word Sense (Part of Speech Tags)
Using tagged terms as a linguistically motivated knowledge source in WebKB seemed to
show mixed results for different learning algorithms. As shown in Table 18, the results
for micro F1 values for SVM and C4.5 shows that the linguistically motivated classifiers
is as competitive as the bag-of-word classifier. However, such features showed a drop in
performance for other learning algorithms.
Word
Algorithm
Naïve
Bayes
TAG
Micro F1
Macro F1
Micro F1
Change
(%)
Macro F1
Change
(%)
0.92
0.87
0.75
-18.5
0.69
-20.7
SVM
0.98
0.97
0.98
0.0
0.72
-25.8
K-NN
0.73
0.65
0.57
-21.9
0.5
-23.1
AdaBoost
0.96
0.93
0.89
-7.3
0.68
-26.9
Adtree
1
0.99
0.56
-44.0
0.63
-36.4
Ripper
1
0.99
0.79
-21.0
0.72
-27.3
0.77
0.99
0.77
0.0
0.69
-30.3
C4.5
Table 18: Results using Word Senses (Tags)
- 64 -
CHAPTER 5 RESULTS & ANALYSIS
5.3.4 Nouns
The results for noun were surprising. This was because the results showed a drop in
performance for both micro and macro F1 values when noun based linguistically
motivated knowledge sources actually gave the best performance in Reuters-21578.
An analysis of the results and features shows that there were very few noun features in
WebKB that could give discriminating power to the learning algorithms. This finding
may also imply the possibility of the interaction effects of the characteristics of data with
the learning algorithms (Yang, 2001).
Word
Algorithm
Naïve
Bayes
Nouns
Micro F1
Macro F1
Micro F1
Change
(%)
Macro F1
Change
(%)
0.92
0.87
0.88
-4.3
0.78
-10.3
SVM
0.98
0.97
0.38
-61.2
0.76
-21.6
K-NN
0.73
0.65
0.4
-45.2
0.43
-33.8
AdaBoost
0.96
0.93
0.89
-7.3
0.77
-17.2
Adtree
1
0.99
0.76
-24.0
0.74
-25.3
Ripper
1
0.99
0.69
-31.0
0.66
-33.3
0.771
0.994
0.53
-31.3
0.54
-45.7
C4.5
Table 19: Results using Nouns
5.3.5 Verbs
Just as we had expected, the verb features had a huge performance drop when only verbs
were used. First, the micro and macro values, expected to fare worse than that of nouns
- 65 -
CHAPTER 5 RESULTS & ANALYSIS
was reflected in the results, because of the sparse distribution of verbs in the web pages.
Second, the verb features obtained from the feature engineering phase consisted of many
words that do not contribute to the semantic content of the web pages. For e.g., words
such as “do” and “go” do not confer distinguishing value to the linguistically motivated
classifiers because they could potentially occur in any of the four categories tested.
Word
Algorithm
Naïve
Bayes
Verb
Micro F1
Macro F1
Micro F1
Change
(%)
Macro F1
Change
(%)
0.92
0.87
0.48
-47.8
0.45
-48.3
SVM
0.98
0.97
0.49
-50.0
0.44
-54.6
K-NN
0.73
0.65
0.41
-43.8
0.32
-50.8
AdaBoost
0.96
0.93
0.56
-41.7
0.49
-47.3
1
0.99
0.43
-57.0
0.32
-67.7
1
0.99
0.41
-59.0
0.3
-69.7
0.771
0.994
0.51
-33.9
0.35
-64.8
Adtree
Ripper
C4.5
Table 20: Results using Verbs
5.3.6 Adjectives
Results for adjectives were also given in Table 21. The use of bag of adjectives also
caused a huge drop in performances. Unlike the results obtained in Reuters, as compared
to results obtained using verbs, the adjectives based linguistically motivated knowledge
sources fared worse than its peer. A close examination of the resulting set of knowledge
sources obtained reveals the distribution of adjectives in the WebKB corpus is sparser
compared to that of verbs. Thus, this suggests that the distribution was too sparse to
confer any distinguishing value to the classifiers.
- 66 -
CHAPTER 5 RESULTS & ANALYSIS
Word
Algorithm
Naïve
Bayes
Micro F1
Macro F1
Adjectives
Micro
Change
F1
(%)
Macro F1
Change
(%)
0.92
0.87
0.63
-31.5
0.48
-44.8
SVM
0.98
0.97
0.76
-22.4
0.35
-63.9
K-NN
0.73
0.65
0.38
-47.9
0.35
-46.2
AdaBoost
0.96
0.93
0.61
-36.5
0.37
-60.2
Adtree
1
0.99
0.29
-71.0
0.33
-66.7
Ripper
1
0.99
0.26
-74.0
0.28
-71.7
0.771
0.994
0.36
-53.3
0.48
-51.7
C4.5
Table 21: Results using Adjectives
5.3.7 Nouns & Words
Next we tried to combine both nouns and words and compare the results to that of using
bag of words. The resulting linguistically motivated classifiers were as competitive as the
baseline classifier and even performed better for some of the learning algorithms.
Competitive classification accuracies were found with AdTree, Ripper, and Adaboost.
Improvements were found with Naïve Bayes, SVM and C4.5.
Word
Algorithm
Naïve
Bayes
Micro F1
Macro F1
Nouns + Words
Micro
Change
F1
(%)
Macro F1
Change
(%)
0.92
0.87
0.93
1.1
0.9
3.4
SVM
0.98
0.97
0.99
1.0
0.98
1.0
K-NN
0.73
0.65
0.68
-6.8
0.59
-9.2
AdaBoost
0.96
0.93
0.95
-1.0
0.83
-10.8
1
0.99
1
0.0
0.99
0.0
Adtree
Ripper
C4.5
1
0.99
1
0.0
0.99
0.0
0.771
0.994
0.99
28.4
0.99
-0.4
Table 22: Results using Nouns & Words
- 67 -
CHAPTER 5 RESULTS & ANALYSIS
5.3.8 Phrase & Words
Linguistically motivated classifiers based on phrases and words were also tested. The
results are shown in Table 23. For micro values, improvements were shown in SVM and
C4.5. Naïve Bayes, AdaBoost, Adtree and Ripper with linguistically motivated
knowledge sources also showed competitive micro results. However, a slight decrease in
accuracy was observed with k-NN.
Word
Algorithm
Naïve
Bayes
Micro F1
Macro F1
Phrases + Words
Micro
Change
F1
(%)
Macro F1
Change
(%)
0.92
0.87
0.92
0.0
0.87
0.0
SVM
0.98
0.97
0.99
1.0
0.92
-5.2
K-NN
0.73
0.65
0.71
-2.7
0.51
-21.5
AdaBoost
0.96
0.93
0.96
0.0
0.82
-11.8
1
0.99
1
0.0
0.99
0.0
1
0.99
0.99
-1.0
0.99
0.0
0.771
0.994
0.98
27.1
0.84
-15.5
Adtree
Ripper
C4.5
Table 23: Results using Phrases & Words
5.3.9 Adjectives & Words
Lastly, the use of a combination of adjectives and bag of words were experimented and
the results are shown in Table 24. Micro F1 values were encouraging because
- 68 -
CHAPTER 5 RESULTS & ANALYSIS
improvements were shown for SVM, k-NN and C4.5. Competitive micro and macro
results were also shown for Naïve Bayes, Adtree and Adaboost.
Word
Algorithm
Naïve
Bayes
Micro F1
Macro F1
Adjectives + Words
Micro
Change
F1
(%)
0.0
0.88
Change
(%)
0.92
0.87
SVM
0.98
0.97
0.99
1.0
0.96
-1.0
K-NN
0.73
0.65
0.81
11.0
0.52
-20.0
AdaBoost
0.96
0.93
0.9
-6.2
0.81
-12.9
Adtree
1
0.99
1
0.0
0.99
0.0
Ripper
1
0.99
1
0.0
0.99
0.0
0.771
0.994
0.99
28.4
0.98
-1.4
C4.5
0.92
Macro F1
1.1
Table 24: Results using Adjectives & Words
5.3.10 Analysis of WebKB Results
From the results, it appears that bag of words does well with most of the learning
algorithms. However, it is noteworthy to highlight that the alternative combinations of
words and adjectives, nouns and phrases were as competitive as, and perform better than
the bag of words with some learning algorithms. However unlike the results obtained for
Reuters-21578, the results obtained for WebKB were less conclusive because of the
differing trends among the results. Best performing classifiers differs across different
learning algorithms but were mainly dominated by either the bag of words or the
combined linguistically motivated knowledge sources with words. In the case for SVM,
the combination of nouns and words performed better than the use of words alone.
Similar to the previous set of results, verbs and adjectives did not seem to add much
discriminative value to the classification.
- 69 -
CHAPTER 5 RESULTS & ANALYSIS
Observations based on the micro-averaged F1 values (Figure 12) shows that the best
performing classifier was found with AdTree and Ripper. AdTree works well with words,
combination of nouns and words, combination of phrase and words and combination of
adjectives and words. Ripper also works well with words, combination of nouns and
words and combination of adjectives and words. This suggests that combinations of
linguistically motivated classifiers can be as competitive as using words alone using a
web-based corpus. The best performing classifiers for the macro-averaged F1 values were
AdTree, Ripper and C4.5 (see Figure 13). Combinations of linguistically motivated
knowledge sources with words showed high macro-averaged scores with these learning
algorithms.
Algorithm
Word
Micro
F1
Phrase
Macro
F1
Tag
Micro
F1
Macro
F1
NN
Micro
F1
Macro
F1
Verb
Micro
F1
Macro
F1
Micro F1
Macro
F1
Naïve
Bayes
0.92
0.87
0.73
0.71
0.75
0.69
0.88
0.78
0.48
0.45
SVM
0.98
0.97
0.72
0.69
0.98
0.72
0.38
0.76
0.49
0.44
K-NN
0.73
0.65
0.5
0.4
0.57
0.5
0.4
0.43
0.41
0.32
AdaBoost
0.96
0.93
0.92
0.7
0.89
0.68
0.89
0.77
0.56
0.49
Adtree
1
0.99
0.71
0.67
0.56
0.63
0.76
0.74
0.43
0.32
Ripper
1
0.99
0.8
0.72
0.79
0.72
0.69
0.66
0.41
0.3
C4.5
0.77
0.99
0.75
0.58
0.77
0.69
0.53
0.54
0.51
0.35
Algorithm
Adjectives
Micro
F1
Naïve
Bayes
0.63
Macro
F1
0.48
Nouns & Words
Phrase & Words
Micro
F1
Micro
F1
Macro
F1
0.93
0.9
Macro
F1
0.92
0.87
Adjectives &
Words
Micro
Macro
F1
F1
0.92
0.88
SVM
0.76
0.35
0.99
0.98
0.99
0.92
0.99
0.96
K-NN
0.38
0.35
0.68
0.59
0.71
0.51
0.81
0.52
AdaBoost
0.61
0.37
0.95
0.83
0.96
0.82
0.9
0.81
Adtree
0.29
0.33
1
0.99
1
0.99
1
0.99
Ripper
0.26
0.28
1
0.99
0.99
0.99
1
0.99
C4.5
0.36
0.48
0.99
0.99
0.98
0.84
0.99
0.98
Table 25: Consolidated Results of WebKB
- 70 -
CHAPTER 5 RESULTS & ANALYSIS
Comparison of Different Linguistically Motivated Knowledge Sources on WebKB (Micro
F1 Value)
1.2
Word
1
Phrase
Tag
M icro F1
0.8
Nouns
Verbs
0.6
Adjectives
0.4
Nouns & Words
Phrase & Words
0.2
Adjectives & Words
0
Naïve
Bayes
SVM
K-NN
AdaBoost
Adtree
Ripper
C4.5
Learning Algorithms
Figure 12: Comparison of Different Linguistically Motivated Knowledge Sources on
WebKB (Micro F1 values)
Comparison of Different Linguistically Motivated Knowledge Sources on WebKB (Macro
F1 Value)
1.2
Word
1
Phrase
Tag
M acro F1
0.8
Nouns
Verbs
0.6
Adjectives
0.4
Nouns & Words
Phrase & Words
0.2
Adjectives & Words
0
Naïve
Bayes
SVM
K-NN
AdaBoost
Adtree
Ripper
C4.5
Learning Algorithms
Figure 13: Comparison of Different Linguistically Motivated Knowledge Sources on
WebKB (Macro F1 values)
- 71 -
CHAPTER 5 RESULTS & ANALYSIS
5.4 Summary of Results
The conclusions derived from the first part of our study suggested that document
classification techniques appear to work well with noun based linguistic source for
Reuters-21578. On the other hand, the later part of our study on WebKB, indicated that
linguistically motivated classification using combined representations were as
competitive as the baseline classifier using words.
Previous research has highlighted the interaction effects of data corpus with the feature
selection methods used (Yang, 1999). It appears that our results appear to suggest that the
use of linguistically motivated knowledge sources depends on the characteristics of the
data corpus used. Reuters-21578 is obtained from a financial domain and comprises of
financial news that contains a substantial number of terms. On the other hand, WebKB
was from a web based domain, comprising mainly of HTML pages. This suggests that
supervised document classification using nouns may be suitable for data corpus with a
large number of features. On the other hand, documents which have fewer features such
as HTML documents, could work well with a combination of words and linguistically
motivated knowledge sources.
Thus, based on our findings, we see a potential in employing the use of linguistically
motivated knowledge sources especially nouns and combinations of these sources with
words to improve classification accuracy.
- 72 -
CHAPTER 6 CONCLUSION
CHAPTER 6
CONCLUSION
In this final chapter, we conclude with the contribution of this project and some final
thoughts on this project.
6.1
Summary
This research builds on previous empirical studies on document classification. The use of
several linguistically motivated knowledge sources, including the use of novel
combinations of these sources as features, has been explored in this study. This study
extends previous research on the use of natural language processing in document
classification and covers several learning algorithms with different linguistically
motivated knowledge sources not explored in previous studies.
Linguistically motivated knowledge sources such as nouns have been found to have
significant interaction effects on the Reuters-21578 corpus. On the other hand, the use of
novel linguistically motivated knowledge sources using combinations of part of speech
with bag of words have also shown improvements in the performance measures on the
WebKB data corpus. This suggests that the use of linguistically motivated classifiers can
help to improve the performance of classification on these two corpora.
- 73 -
CHAPTER 6 CONCLUSION
6.2
Contributions
There are several aspects that this thesis contributes. Firstly, a thorough literature review
has been conducted on several streams of research, ranging from text classification,
human cognition, knowledge management and personalization. An integrative technique
that draws ideas from these fields has been proposed.
Previous works focused mainly on the improvement of techniques using the bag-ofwords technique but very few works examined the effects of the use of linguistically
motivated knowledge sources as feature representations and document classification
efficiency across several learning algorithms in a single experiment. In our experiment,
we have experimented with several combinations of linguistically motivated knowledge
sources with various learning algorithms not covered in previous studies.
This research adds a relatively new dimension to the current works of document
classification by shedding light on the use of natural language processing techniques to
employ linguistically motivated knowledge sources as features for document
classification. The potential of using such features for various popular learning
algorithms to improve document classification was also investigated. Even though these
research studies provide answers and interpretations of their results, several questions
were left open, such as does the different feature representations with different learning
algorithms play a role in the results of the document classification which we attempt to
answer in our research. In addition, we have also proposed novel combinations of
- 74 -
CHAPTER 6 CONCLUSION
linguistically motivated knowledge sources with words whose performance appeared to
be competitive.
With the increasing pressure from the exponential growth of information available,
research on document classification cannot be emphasized greater. The results of our
experiments can contribute to the studies which employ learning algorithms to categorize
information so as to facilitate people in finding and locate information. This will help to
resolve the information overload problem.
This thesis has examined and evaluated a series of classifiers, each of which is
incorporated
with
different
linguistically
motivated
knowledge
sources
and
representations. We find that certain linguistic knowledge sources such as nouns and
noun phrase combined with words can offer substantial improvements over classifiers
used with the bag-of-words technique.
In contrast with earlier works, this thesis presents a comparative empirical evaluation of
both learning algorithms with different linguistic knowledge sources. We evaluated 6
learning algorithms (Naïve Bayes, Support Vector Machine, k-Nearest Neighbours,
decision trees, RIPPER, ADTree), and 6 linguistic knowledge sources (word, phrase,
combined word and phrase, word sense, nouns, verbs and adjectives) and combinations
of some of these sources with words. We systematically evaluated the effectiveness of
each learning algorithm using these knowledge sources on the Reuters-21578 and
WebKb data corpora.
- 75 -
CHAPTER 6 CONCLUSION
It was found that learning algorithms can improve classification accuracies with an
appropriate use of knowledge sources. We have shown that the effectiveness of slightly
different linguistic knowledge sources can vary substantially with different classification
algorithm. From our experiments, we found that using support vector machine with nouns
as knowledge sources had the best accuracies for Reuters 21578. On the other hand, the
use of combinations of linguistically motivated knowledge sources with bag of words for
web based classification using learning algorithms such as AdTree, Ripper and C4.5
works just as well or even better than bag of words.
6.3
Limitations
This study employed several conventional yet popular learning algorithms. However,
with ongoing research in machine learning techniques, the use of six learning algorithms
in our study is a subset of the many learning algorithms available. Thus, it seems
inadequate to stop the empirical study at this point with the current number of learning
algorithms.
6.4
Conclusion
Document classification systems are an integral to the success of knowledge management
systems. There is increasing interest in applying supervised document classification
techniques in knowledge management systems. The findings of this study help to
enhance the understanding of the use of incorporating linguistically motivated knowledge
- 76 -
CHAPTER 6 CONCLUSION
sources into document classification. It is hoped that this research will prove to be
valuable as an extension to similar studies.
- 77 -
REFERENCES
REFERENCES
Arampatzis, A., van der Weide, Th.P., Koster, C.H.A., van Bommel, P. (2000). An
Evaluation of Linguistically-motivated Indexing Schemes. Proceedings of BCS-IRSG
2000 Colloquium on IR Research, 5th-7th April 2000, Sidney Sussex College,
Cambridge, England.
Arampatzis, A., van der Weide, Th.P., Koster, C.H.A., van Bommel, P. (2000).
Linguistically-motivated Information Retrieval, Encyclopedia of Library and Information
Science, Volume 69, pages 201-222, Marcel Dekker, Inc., New York, Basel, 2000.
Basili, R, Moschitti, A. and Pazienza, M.T. (2001). NLP-driven IR: Evaluating
Performances over a Text Classification task. In Proceedings of the 10th Intenational
Joint Conference of Artificial Intelligence (IJCAI-2001), August 4th, Seattle, Washington,
USA.
Basili, R, Moschitti, A. and Pazienza, M.T. (2002). Empirical investigation of fast text
classification over linguistic features. In Proceedings of the 15th European Conference on
Artificial Intelligence (ECAI-2002), July 21-26, 2002, Lyon, France.
Basu, Chumki, Hirsh, Haym, Cohen, William W. (1998). Recommendation as
Classification: Using Social and Content-Based Information in Recommendation. In
Proceedings of the Fifteenth National Conference on Artificial Intelligence and Tenth
Innovative Applications of Artificial Intelligence Conference, AAAI 98, IAAI 98, July
26-30, 1998, Madison, Wisconsin, USA, pp. 714-720.
Bekkerman R., El-Yaniv, R., Tishby, N. and Winter, Y. (2003). Distributional Word
Clusters vs Words for Text Categorization. Journal of Machine Learning Research, 3, pp.
1183-1208.
Bruce, Rebecca F. & Wiebe, Janyce M. (1999). Recognizing subjectivity: a case study in
manual tagging. Natural Language Engineering, 5 (2).
Cardie C., Ng V., Pierce D., and Buckly C. (2000). Examining the Role of Statistical and
Linguistic Knowledge Sources in a General-Knowledge Question-Answering System.
Proceedings of the Sixth Applied Natural Language Processing Conference (ANLP2000), 180-187, ACL/Morgan Kaufmann.
Chen H. (1992). Knowledge-based document retrieval: framework and design. Journal of
Information Science, 18(1992) 293-314. Elsevier Science Publishers, B.V.
Cohen, William W. (1996). Learning Trees and Rules with Set-Valued Features. In
Proceedings of the Thirteenth National Conference on Artificial Intelligence and Eighth
Innovative Applications of Artificial Intelligence Conference, AAAI 96, IAAI 96, August
4-8, 1996, Portland, Oregon - Volume 1. AAAI Press / The MIT Press, 709-716.
- 78 -
REFERENCES
Cohen, William W. (1995). Fast Effective Rule Induction. In: Machine Learning, In
Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City,
California, USA, July 9-12, 1995. Armand Prieditis, Stuart J. Russell (Eds.). Morgan
Kaufmann, 115-123.
Cohen, William W. and Hirsh, H. (1998). Joins That Generalize: Text Classification
Using WHIRL. In Proceedings of the 4th International Conference on Knowledge
Discovery and Data Mining (KDD’98).
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., and
Slattery, S. (1999). Learning to Construct Knowledge Bases from the World Wide Web,
Artificial Intelligence, Elsevier.
Cullingford, Richard E. (1986). Natural Language Processing: A KnowledgeEngineering Approach. Rowman & Littlefield. New Jersey.
Dumais, S.T. (1994). Latent Semantic Indexing (LSI) and TREC-2. In: D. Harman (Ed.),
The Second Text Retrieval Conference (TREC2), National Institute of Standards and
Technology Special Publication, pp. 105-116.
Dumais, S. T., Platt, J., Heckerman D., and Sahami, M. (1998). Inductive learning
algorithms and representations for text categorization. In Proceedings of ACM-CIKM98,
Nov. 1998, pp. 148-155.
Dumais, S., Cutrell, E. and Chen, H. (2001). Optimizing Search by Showing Results in
Context. Proceedings of the SIG-CHI on Human Factors in Computing, March 31- April
4, 2001, Seattle, WA, USA, ACM, 2001.
Dumais, Susan T. and Chen, H. (2000). Hierarchical Classification of Web Content. In
Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, July 24-28, 2000, Athens, Greece, pp. 256-263.
Fagan Joel L. (1987). Experiments in Automatic Phrase Indexing for Document
Retrieval: A Comparison of Syntactic and Non-Syntactic Methods. PhD thesis, Cornell
University, 1987.
Farhoomand, A. F. and Drury, D. H. (2002). Managerial Information Overload.
Communications of the ACM, 45, 127-131.
Freund Yoav, Schapire Robert E. (1996). Experiments with a New Boosting Algorithm.
In Proceedings of the Thirteenth International Conference (ICML '96), July 3-6, Bari,
Italy, pp.148-156.
Freund Yoav, Mason Llew. (1999). The Alternating Decision Tree Learning Algorithm.
In Proceedings of the Sixteenth International Conference on Machine Learning (ICML
1999), June 27 - 30, Bled, Slovenia, pp. 124-133.
- 79 -
REFERENCES
Fuhr, N., and Buckley, C. (1991). A Probabilistic Learning Approach for Document
Indexing, ACM Transactions on Information Systems, 9(3), pp. 223-248.
Furnkranz, J., Mitchell, T., Riloff, E. (1998). A Case Study in Using Linguistic Phrases
for Text Categorization on the WWW, In Proceedings of the 1st AAAI Workshop on
Learning for Text Categorization, pp. 5-12, Madison, US.
Furnkranz, J., and Widmer, G. (1994). Incremental Reduced Error Pruning. In
Proceedings of the 11th International Conference on Machine Learning (ML-94), pp. 7077, New Brunswick, JN, Morgan Kaufmann.
Gentner, Dedre (1981). Some interesting differences between nouns and verbs. Cognition
and Brain Theory 4. 161-178.
Joachims, T. (1997). A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for
Text Categorization. In Proceedings of the 14th International Conference on Machine
Learning (ICML' 97).
Joachims, T. (1998). Text Categorization with Support Vector Machines: Learning with
Many Relevant Features. Proceedings of Machine Learning: ECML-98, 10th European
Conference on Machine Learning, April 21-23, Chemnitz, Germany, Springer, pp. 137142.
Joachims, T. (2002). Learning to Classify Text Using Support Vector Machines. Kluwer
Academic Publishers, Boston Hardbound.
Kankanhalli A, Tan B C Y and Wei K K. (2001). Seeking Knowledge in Electronic
Knowledge Repositories: An Exploratory Study, Proceedings of the Twenty-Second
International Conference on Information Systems, New York, Association for Computing
Machinery, pp. 123-133.
Kongovi Madhusudhan, Guzman Juan Carlow, and Dasigi Venu. (2002). Text
Categorization: An Experiment Using Phrases. In Advances in Information Retrieval,
Proceedings of the 24th BCS-IRSG European Colloquium on IR Research Glasgow, UK,
March 25-27, 2002. ECIR, 2002, Springer Verlag Berlin Heidelberg, Lecture Notes in
Computer Science 2291, Springer, pp. 213-228.
Moulinier, I., Raskinis, G., and Ganascia, J.G. (1996). Text Categorization: A Symbolic
Approach. In Proceedings of the 5th Annual Symposium on Document Analysis and
Information Retrieval (SDAIR'96), pp. 87-99.
Lang K. (1995). NewsWeeder: Learning to filter netnews, Machine Learning,
Proceedings of the Twelfth International Conference on Machine Learning, July 9-12,
Tahoe City, California, USA, Morgan Kaufmann, pp.331-339.
- 80 -
REFERENCES
Larkey, L.S. and Croft, W.B. (1996). Combining Classifiers in Text Categorization. In
Proceedings of the 19th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR'98), pp. 179-187.
Lewis, D. (1992). Representation and Learning in Information Retrieval. PhD Thesis ,
University of Massachusetts, Amherst.
Lewis, D. D. and Ringuette, M. (1994). A Comparison of Two Learning Algorithms for
Text Categorization. Third Annual Symposuim on Document Analysis and Information
Retrieval. pp 81-93.
Luigi, G., Fabrizio, S. and Maria, S. (2000). Feature Selection and Negative Evidence in
Automated Text Categorization. Proceedings of the ACM KDD-00 Workshop on Text
Mining, Boston, US.
Masand Brij M., Linoff Gordon, Waltz David L. (1992). Classifying News Stories using
Memory Based Reasoning. In Proceedings of the 15th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval, Copenhagen,
Denmark, June 21-24, 1992. ACM, pp. 59-65.
McCallum, A., and Nigam, K. (1998). A Comparison of Event Models for Naïve Bayes
Text Classification. AAAI-98, Workshop on Learning for Text Categorization, Technical
Report, WS-98-05, AAAI Press.
Mitchell, T.M. (1997). Machine Learning. McGraw Hill, New York, NY.
Medin, D. L., Lynch, E. B. and Solomon, K. O. (2000). Are There Kinds of Concepts?
Annual Review Psychology, 51, 121-147.
Mitra M., Buckley C., Singhal A., and Cardie C. (1997). An analysis of statistical and
syntactic phrases. Proceedings of RIAO '97, Montreal, Canada, June 25-27, 1997.
Mladenic, D. (1998). Feature Subset Selection in Text Learning. In Proceedings of the
10th European Conference on Machine Learning (ECML'98), pp. 95-100.
Ng, H. T. and Lee, H. B. (1996). Integrating Multiple Knowledge Sources to
Disambiguate Word Sense: An Exemplar-Based Approach. Proceedings of the 34th
Annual Meeting of the Association for Computational Linguistics, June 24-27, 1996,
University of California, Santa Cruz, California, USA, Proceedings. Morgan Kaufmann
Publishers / ACL 1996, pp. 40-47.
Nigam, K, McCallum, A., Thrun, S. and Mitchell, T. (1998). Learning to Classify Text
from Labeled and Unlabeled Documents. In Proceedings of the Fifteenth National
Conference on Artificial Intelligence (AAAI-98), pp. 792-799.
- 81 -
REFERENCES
Pedro, Domingos, Michael, J. Pazzani. (1996). Simple Bayesian Classifiers Do Not
Assume Independence. In Proceedings of the Thirteenth National Conference on
Artificial Intelligence and Eighth Innovative Applications of Artificial Intelligence
Conference, AAAI 96, IAAI 96, August 4-8, 1996, Portland, Oregon - Volume 2. AAAI
Press / The MIT Press, 1996.
Platt, J. (1999). Using Sparseness and Analytic QP to Speed Training of Support Vector
Machines, Advances in Neural Information Processing Systems, 11, M. S. Kearns, S. A.
Solla, D. A. Cohn, eds., MIT Press.
Platt, J. (1998). Sequential Minimal Optimization: A Fast Algorithm for Training Support
Vector Machines, Microsoft Research Technical Report MSR-TR-98-14.
Quinlan J. Ross. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann.
Reuters-21578 data set [online] http://www.research.att.com/~lewis/reuters21578.html
Richardson R. and Smeaton A.F. (1995). Using WordNet in a knowledge-based approach
to infornation retrieval. [online]
Rogati , M. and Yang, Y. (2002). High-performing feature selection for text
classification, In Proceedings of the 10th Conference for Information and Knowledge
Management (CIKM-2002).
Roussinov, D.G., Chen, H.C. (1999). Document clustering for electronic meetings: an
experimental comparison of two techniques, Decision Support Systems, 27 (1-2): 67-79.
Sahami, M. (1998). Using Machine Learning to Improve Information Access. PhD
Thesis, Stanford University, Computer Science Department. STAN-CS-TR-98-1615.
Schapire, R.E., Singer, Y. and Singhal, A. (1998). Boosting and Rocchio Applied to Text
Filtering. In Proceedings of the 21st Annual International Conference on Research and
Development in Information Retrieval (SIGIR'98), pp. 215-223.
Schutze, H. and Silverstein H. (1997). Projections for efficient document clustering. In
Proceedings of the 20th Annual International Conference on Research and Development
in Information Retrieval (SIGIR'97), pp. 74–81, Philadelphia, PA, July 1997.
Scott S. and Matwin S. (1998). Text Classification Using WordNet Hypernyms. In
Proceedings of Coling-ACL'98 Workshop on the Usage of WordNet in Natural Language
Processing Systems, Montreal, Canada.
Scott S. and Matwin S. (1999). Feature Engineering for Text Classification. In
Proceedings of the Sixteenth International Conference on Machine Learning (ICML'99),
Bled, Slovenia, June 27-30, Morgan Kaufmann.
- 82 -
REFERENCES
Sebastiani Fabrizio, (2002). Machine Learning in Automated Text Categorization. ACM
Computing Surveys, Vol 34, No. 1, March 2002, pp 1-47.
Siegel E.V. and McKeown K.R. (2001). Learning Methods to Combine Linguistic
Indicators: Improving Aspectual Classification and Revealing Linguistic Insights.
Association for Computational Linguistics.
SNOW URL:http://l2r.cs.uiuc.edu/~cogcomp/cc-software.html
Smeaton, A. (1999). Using NLP or NLP resources for Information Retrieval Tasks. In
Natural Language Information Retrieval, Kluwer, Boston, MA.
Strzalkowski, T., Carballo, J.P., Karlgren, J., Hulth, A., Tapanainen, P., Lahtinen, T.
(1999). Natural Language Information Retrieval: TREC-8 Report. In Proceedings of the
Eighth Text REtreival Conference (TREC-8), Gaithersburg, Maryland, November 17-19,
NIST.
Tolle K.M. and Chen H. (2000). Comparing Noun Phrasing Techniques for Use with
Medical Digital Library Tools. Journal of the American Society for Information Science,
51(4):352-370, 2000.
Vapnik, V. (1995). The Nature of Statistical Learning Theory, Springer-Verlag.
WEKA URL: http://www.cs.waikato.ac.nz/ml/weka/
Yang, Y. and Pedersen, J. O. (1997). A Comparative Study on Feature Selection in Text
Categorization, In Proceedings of the Fourteenth International Conference on Machine
Learning (ICML 1997), Nashville, Tennessee, USA, July 8-12, 1997. Morgan Kaufmann
1997, pp. 412-420.
Yang, Y. and Liu, X. (1999). A Re-examination of Text Categorization Methods. In
Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR'99), August 15-19, 1999, Berkeley, USA.
ACM, pp. 42-49.
Yang, Y. (2001). A Study on Thresholding Strategies for Text Categorization. In
Proceedings of the 24th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR,01), pp. 137-145.
Zelikovitz S., and Hirsh, H. (2000). Improving Short Text Classification Using Unlabeled
Background Knowledge to Assess Document Similarity. In Proceedings of the 17th
International Conference on Machine Learning, pp. 1183-1190.
Zelikovitz S., and Hirsh, H. (2001). Using LSI for Text Classification in the Presence of
Background Text. In Proceedings of the 10th Conference for Information and Knowledge
Management, 2001.
- 83 -
[...]... a systematic study that covers an extensive variety of linguistically motivated knowledge sources - 21 - CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION Despite the existence of extensive research on document classification, the relationship between different linguistic knowledge sources and classification model has not been sufficiently or systematically... broken up into two separate words, “learning” and “algorithm” Thus, we utilize linguistically motivated knowledge sources as features, to see if we can resolve these limitations associated with the bag-of-words paradigm Novel combinations of - 24 - CHAPTER 3 LINGUISTICALLY MOTIVATED CLASSIFICATION linguistically motivated knowledge sources are also proposed and presented in the next section 3.1.1 Linguistically. .. Processing (NLP) in Document Classification Most information retrieval (IR) systems are not linguistically motivated Similarly, in document classification research, most experiments are not linguistically motivated (Cullingford, 1986) Closely related to the research on document classification is the research on natural language processing and cognitive science Traditionally, document classification techniques... supervised classification studies This could be due to the results of early attempts (Lewis, 1992) which showed negative results With the advent of NLP techniques, there seems a propelling reason to examine the use of linguistically motivated knowledge sources Although there has been separate attempts made to study the effects of linguistically motivated knowledge sources on supervised document classification. .. employed to derive linguistically motivated knowledge sources 2.1 Document Classification Document classification, the focus of this work, refers to the task of assigning a document to categories based on its underlying content Although this task has been carried out effectively using knowledge engineering approaches in the 80s, machine learning approaches have superceded the use of knowledge engineering... Since previous works documenting results based on linguistically motivated features with learning algorithms produced inconsistent and sometimes conflicting results, we propose to conduct a systematic study on multiple learning algorithms and linguistically motivated knowledge sources as features Some of these features are novel combinations of linguistically motivated knowledge sources were not explored... explored By bringing together the two streams of research of document classification and natural language processing, we hope to shed light on the effects of linguistically motivated knowledge sources with different learning algorithms Section 3.1 discusses the shortcomings of previous research Section 3.2 explores the linguistically motivated knowledge sources employed to resolve these issues Finally section... effectiveness of linguistically motivated features The aim of this thesis is also to provide a solid foundation for research on feature representations on text classification and study the factors affecting machine learning algorithms used in document classification One of these factors we want to look into is the effect of feature engineering by utilizing linguistically motivated knowledge sources as features... feature representations using different linguistic knowledge sources as the input vectors to the learning algorithm have significant impact on document classification We consider the following linguistics knowledge sources in our research: 1 Word, this will be used as the baseline to do a comparative analysis with other linguistically motivated knowledge sources; 2 Phrase; 3 Word sense or part of speech... linguistic cues into document classification through NLP techniques This can be done by utilizing NLP techniques to extract the different representation of the documents and then used in the classification process As defined by Medin (2000), other concepts identified are verb, count nouns, mass nouns, isolated and interrelated concepts We define such concepts as linguistically motivated knowledge sources They ... CHAPTER LINGUISTICALLY MOTIVATED CLASSIFICATION linguistically motivated knowledge sources are also proposed and presented in the next section 3.1.1 Linguistically Motivated Knowledge Sources. .. covers an extensive variety of linguistically motivated knowledge sources - 21 - CHAPTER LINGUISTICALLY MOTIVATED CLASSIFICATION CHAPTER LINGUISTICALLY MOTIVATED CLASSIFICATION Despite the existence... use of linguistically motivated knowledge sources Although there has been separate attempts made to study the effects of linguistically motivated knowledge sources on supervised document classification