Discovering Corpus-SpecificWord Senses
Beate Dorow
Institut fiir Maschinelle Sprachverarbeitung
Universita Stuttgart, Germany
beate.dorow@ims.uni-stuttgart.de
Dominic Widdows
Center for the Study of Language and Information
Stanford University, California
dwiddows@csli.stanford.edu
Abstract
This paper presents an unsupervised al-
gorithm which automatically discovers
word senses from text. The algorithm
is based on a graph model representing
words and relationships between them.
Sense clusters are iteratively computed
by clustering the local graph of similar
words around an ambiguous word. Dis-
crimination against previously extracted
sense clusters enables us to discover
new senses. We use the same data for
both recognising and resolving ambigu-
ity.
1 Introduction
This paper describes an algorithm which automa-
tically discovers word senses from free text and
maps them to the appropriate entries of existing
dictionaries or taxonomies.
Automatic word sense discovery has applica-
tions of many kinds. It can greatly facilitate a lexi-
cographer's work and can be used to automatically
construct corpus-based taxonomies or to tune ex-
isting ones. The same corpus evidence which sup-
ports a clustering of an ambiguous word into dis-
tinct senses can be used to decide which sense is
referred to in a given context (Schiitze, 1998).
This paper is organised as follows. In section
2, we present the graph model from which we dis-
cover word senses. Section 3 describes the way we
divide graphs surrounding ambiguous words into
different areas corresponding to different senses,
using Markov clustering (van Dongen, 2000). The
quality of the Markov clustering depends strongly
on several parameters such as a granularity factor
and the size of the local graph. In section 4, we
outline a word sense discovery algorithm which
bypasses the problem of parameter tuning. We
conducted a pilot experiment to examine the per-
formance of our algorithm on a set of words with
varying degree of ambiguity. Section 5 describes
the experiment and presents a sample of the re-
sults. Finally, section 6 sketches applications of
the algorithm and discusses future work.
2 Building a Graph of Similar Words
The model from which we discover distinct word
senses is built automatically from the British Na-
tional corpus, which is tagged for parts of speech.
Based on the intuition that nouns which co-occur
in a list are often semantically related, we extract
contexts of the form
Noun, Noun, and/or Noun,
e.g. "genomic DNA from
rat, mouse
and
dog".
Following the method in (Widdows and Dorow,
2002), we build a graph in which each node repre-
sents a noun and two nodes have an edge between
them if they co-occur in lists more than a given
number of times
1
.
Following Lin's work (1998), we are cur-
rently investigating a graph with verb-object,
verb-subject and modifier-noun-collocations from
which it is possible to infer more about the senses
of systematically polysemous words. The word
sense clustering algorithm as outlined below can
be applied to any kind of similarity measure based
on any set of features.
1
Si mple cutoff functions proved unsatisfactory because of
the bias they give to more frequent words. Instead we link
each word to its top
n
neighbors where n can be determined
by the user (cf. section 4).
79
41=0
4
41=P
.4161.
sz
-
44,
CD
miltrA,
litrepate
inovi
o
.„ h,)
Cik
4111)
11‘
41
4Wit
ler,1110.1/.
1
7,
cgtoserek■Ilt
Figure 1: Local graph of the word
mouse
Figure 2: Local graph of the word
wing
3 Markov Clustering
Ambiguous words link otherwise unrelated areas
of meaning E.g.
rat
and
printer
are very differ-
ent in meaning, but they are both closely related
to different meanings of
mouse.
However, if we
remove the
mouse-node
from its local graph il-
lustrated in figure 1, the graph decomposes into
two parts, one representing the electronic device
meaning of
mouse and the other one representing
its animal sense. There are, of course, many more
types of polysemy (cf. e.g. (Kilgarriff, 1992)). As
can be seen in figure 2,
wing
"part of a bird" is
closely related to
tail, as is
wing
"part of a plane".
Therefore, even after removal of the wing-node,
the two areas of meaning are still linked via
tail.
The same happens with
wing
"part of a building"
and
wing
"political group" which are linked via
policy.
However, whereas there are many edges
within
an area of meaning, there is only a small
number of (weak) links
between
different areas of
meaning. To detect the different areas of mean-
ing in our local graphs, we use a cluster algorithm
for graphs (Markov clustering, MCL) developed
by van Dongen (2000). The idea underlying the
MCL-algorithm is that random walks within the
graph will tend to stay in the same cluster rather
than jump between clusters.
The following notation and description of the
MCL algorithm borrows heavily from van Dongen
(2000). Let
G„,
denote the local graph around the
ambiguous word w. The adjacency matrix
MG„
of a graph
G„,
is defined by setting
(111G„)
pq
equal
to the weight of the edge between nodes v and v
q
.
Normalizing the columns of A/G„ results in the
Markov Matrix
Ta
w
whose entries
(Th
i
,)
pq
can be
interpreted as transition probability from v
q
to v
v
.
It can easily be shown that the k-th power of
TG„
lists the probabilities
(TL
)
pq of a path of length
k
starting at node
v
q
and ending at node V.
The MCL-algorithm simulates flow in
G
w
by
iteratively recomputing the set of transition prob-
abilities via two steps, expansion and inflation.
The expansion step corresponds with taking the
k-th power of TG„
as outlined above and allows
nodes to see new neighbours. The inflation step
takes each matrix entry to the r-th power and
then rescales each column so that the entries sum
to 1.Vi a
inflation,
popular neighbours are further
supported at the expense of less popular ones.
Flow within dense regions in the graph is con-
centrated by both expansion and inflation. Even-
tually, flow between dense regions will disappear,
the matrix of transition probabilities
TG„
will con-
verge and the limiting matrix can be interpreted as
a clustering of the graph.
4 Word Sense Clustering Algorithm
The output of the MCL-algorithm strongly de-
pends on the inflation and expansion parameters
r
and
k
as well as the size of the local graph which
serves as input to MCL.
An appropriate choice of the inflation param-
80
eter r can depend on the ambiguous word w to
be clustered. In case of homonymy, a small infla-
tion parameter r would be appropriate. However,
there are ambiguous words with more closely re-
lated senses which are metaphorical or metonymic
variations of one another. In that case, the different
regions of meaning are more strongly interlinked
and a small power coefficient r would lump differ-
ent meanings together.
Usually, one sense of an ambiguous word w is
much more frequent than its other senses present
in the corpus. If the local graph handed over to the
MCL process is small, we might miss some of w's
meanings in the corpus. On the other hand, if the
local graph is too big, we will get a lot of noise.
Below, we outline an algorithm which circum-
vents the problem of choosing the right parame-
ters. In contrast to pure Markov clustering, we
don't try to find a complete clustering of G into
senses at once. Instead, in each step of the iter-
ative process, we try to find the most disctinctive
cluster
c
of
G
w
(i.e. the most distinctive mean-
ing of w) only. We then recompute the local graph
G
w
by discriminating against c's features. This is
achieved, in a manner similar to Pantel and Lin's
(2002) sense clustering approach, by removing c's
features from the set of features used for finding
similar words. The process is stopped if the simi-
larity between w and its best neighbour under the
reduced set of features is below a fixed threshold.
Let
F
be the set of w's features, and let
L
be the
output of the algorithm, i.e. a list of sense clus-
ters initially empty. The algorithm consists of the
following steps:
1.
Compute a small local graph
G
w
around w
using the set of features
F.
If the similarity
between w and its closest neighbour is below
a fixed threshold go to 6.
2.
Recursively remove all nodes of degree one.
Then remove the node corresponding with w
from
G.
3.
Apply MCL to
G
w
with a fairly big inflation
parameter r which is fixed.
4.
Take the "best" cluster (the one that is most
strongly connected to w in
G
w
before re-
moval of w), add it to the final list of clusters
L
and remove/devalue its features from
F.
5.
Go back to 1 with the reduced/devalued set of
features
F.
6.
Go through the final list of clusters
L
and as-
sign a name to each cluster using a broad-
coverage taxonomy (see below). Merge se-
mantically close clusters using a taxonomy-
based semantic distance measure (Budanit-
sky and Hirst, 2001) and assign a class-label
to the newly formed cluster.
7. Output the list of class-labels which best rep-
resent the different senses of w in the corpus.
The local graph in step 1 consists of w, the ni
neighbours of w and the
n9
neighbours of the
neighbours of w. Since in each iteration we only
attempt to find the "best" cluster, it suffices to
build a relatively small graph in 1. Step 2 removes
noisy strings of nodes pointing away from
G.
The removal of w from
G
w
might already sepa-
rate the different areas of meaning, but will at least
significantly loosen the ties between them.
In our simple model based on noun co-occur-
rences in lists, step 5 corresponds to rebuilding the
graph under the restriction that the nodes in the
new graph not co-occur (or at least not very often)
with any of the cluster members already extracted.
The class-labelling (step 6) is accomplished us-
ing the taxonomic structure of WordNet, using a
robust algorithm developed specially for this pur-
pose. The hypemym which subsumes as many
cluster members as possible and does so as closely
as possible in the taxonomic tree is chosen as
class-label. The family of such algorithms is de-
scribed in (Widdows, 2003).
5 Experimental Results
In this section, we describe an initial evaluation
experiment and present the results. We will soon
carry out and report on a more thorough analysis
of our algorithm.
We used the simple graph model based on co-
occurrences of nouns in lists (cf. section 2) for our
experiment. We gathered a list of nouns with vary-
ing degree of ambiguity, from homonymy (e.g.
arms)
to systematic polysemy (e.g. cherry).
Our
algorithm was applied to each word in the list
(with parameters Iii = 20, n2 = 10, r = 2.0,
k =
2.0) in order to extract the top two sense clusters
81
only. We then determined the WordNet synsets
which most adequately characterized the sense
clusters. An extract of the results is listed in ta-
ble 1.
Word
Sense clusters
Class-label
arms
knees trousers feet biceps hips elbows backs wings
breasts shoulders thighs bones buttocks ankles legs
inches wrists shoes necks
body part
horses muskets charges weapons methods firearms
knives explosives bombs bases mines projectiles drugs
missiles uniforms
weapon
jersey
israel colomho guernsey luxeinhourg denmark maim
greece belgium swede, turkey gibraltar portugal ire-
land mauritius britain cyprus netherlands norway aus-
tralia italy japan canada kingdom spain austria zealand
england france germany switzerland finland poland
a merica usa iceland holland scotland uk
European
country
crucifix bow apron sweater tie anorak hose bracelet
helmet waistcoat jacket pullover equipment cap collar
suit fleece tunic shirt scarf belt
garment
head
voice torso back chest face abdomen side belly groin
spine breast bill rump midhair hat collar waist tail
stomach skin throat neck speculum
body part
ceo treasurer justice chancellor principal founder pres-
ident commander deputy administrator constable li-
brarian secretary governor captain premier executive
chief curator assistant committee patron ruler
person
oil
heat coal power water gas food wood fuel steam tax
heating kerosene fire petroleum dust sand light steel
telephone timber supply drainage diesel electricity
acid air insurance petrol
object
tempera gouache watercolour poster pastel collage
acrylic
paint
lemon
bread cheese [flint butter jam cream pudding yogurt
sprinkling honey jelly toast ham chocolate pie syrup
milk meat beef cake yoghurt grain
foodstuff
hazel elder holly family virgin hawthorn
shrub
cherry
cedar larch mahogany water sycamore lime teak ash
hornbeam oak walnut hazel pine beech alder thorn
poplar birch chestnut blackthorn spruce holly yew lau-
rel maple elm fir hawthorn willow
wood
bacon cream honey pie grape blackcurrant cake ha-
mama
foodstuff
Table 1: Output of word sense clustering.
6 Applications and future research
The benefits of automatic, data-driven word sense
discovery for natural language processing and lex-
icography would be very great. Here we only men-
tion a few direct results of our work.
Our algorithm does not only recognise ambigu-
ity, but can also be used to resolve it, because the
features shared by the members of each sense clus-
ter provide strong indication of which reading of
an ambiguous word is appropriate given a certain
context. This gives rise to an automatic, unsuper-
vised word sense disambiguation algorithm which
is trained on the data to be disambiguated.
The ability to map senses into a taxonomy using
the class-labelling algorithm can be used to ensure
that the sense-distinctions discovered correspond
to recognised differences in meaning. This ap-
proach to disambiguation combines the benefits of
both Yarowsky's (1995) and Schtitze's (1998) ap-
proaches. Preliminary observations show that the
different neighbours in Table 1 can be used to in-
dicate with great accuracy which of the senses is
being used.
Off-the-shelf lexical resources are rarely ade-
quate for NLP tasks without being adapted. They
often contain many rare senses, but not the same
ones that are relevant for specific domains or cor-
pora. The problem can be addressed by using
word sense clustering to attune an existing re-
source to accurately describe the meanings used
in a particular corpus.
We prepare an evaluation of our algorithm as
applied to the collocation relationships (cf. section
2), and we plan to evaluate the uses of our clus-
tering algorithm for unsupervised disambiguation
more thoroughly.
References
A. Budanitsky. G. Hirst. 2001. Semantic distance
in WordNet: An experimental, application-oriented
evaluation of five measures. In
Workshop on Word-
Net and Other Lexical Resources, Second meeting of
the North American Chapter of the Association for
Computational Linguistics,
Pittsburgh, June.
S. van Dongen. 2000. A cluster algorithm for graphs.
Technical Report INS-ROOl 0, National Research In-
stitute for Mathematics and Computer Science, Am-
sterdam, The Netherlands, May.
A. Kilgarriff. 1992.
Polysemy.
Ph.D. Thesis, Univer-
sity of Sussex, December.
D. Lin. 1998. Automatic retrieval and clustering
of similar words. In
COLING-ACL98,
Montreal,
Canada,
August.
P. Pantel. D. Lin. 2002. Discovering word senses from
text. In
ACM SIGKDD Conference on Knowledge
Discovery and Data Mining,
Edmonton, Canada,
May.
H. Schfitze. 1998. Automatic word sense discri-
mination.
Journal of Computational Linguistics,
24(1):97—1 23.
D. Widdows. B. Dorow. 2002. A graph model for un-
supervised lexical acquisition. In
COLING,
Taiwan,
August.
D. Widdows. 2003. Unsupervised methods for devel-
oping taxonomies using syntactic and statistical in-
formation. In
HLT-NAACL (to appear),
Edmonton,
Canada.
D. Yarowsky. 1995. Unsupervised word sense disam-
biguation rivaling supervised methods. In
33rd An-
nual Meeting of the Association for Computational
Linguistics,
pages 189-196, Cambridge, MA.
82
. h,)
Cik
4111)
11‘
41
4Wit
ler,1110.1/.
1
7,
cgtoserek■Ilt
Figure 1: Local graph of the word
mouse
Figure 2: Local graph of the word
wing
3 Markov Clustering
Ambiguous words link otherwise unrelated areas
of. al-
gorithm which automatically discovers
word senses from text. The algorithm
is based on a graph model representing
words and relationships between them.
Sense