Memory-Based Learning:UsingSimilarityfor Smoothing
Jakub Zavrel and Walter
Daelemans
Computational Linguistics
Tilburg University
PO Box 90153
5000 LE Tilburg
The Netherlands
{zavrel, walt er}@kub, nl
Abstract
This paper analyses the relation between
the use of similarity in Memory-Based
Learning and the notion of backed-off
smoothing in statistical language model-
ing. We show that the two approaches are
closely related, and we argue that feature
weighting methods in the Memory-Based
paradigm can offer the advantage of au-
tomatically specifying a suitable domain-
specific hierarchy between most specific
and most general conditioning information
without the need for a large number of pa-
rameters. We report two applications of
this approach: PP-attachment and POS-
tagging. Our method achieves state-of-the-
art performance in both domains, and al-
lows the easy integration of diverse infor-
mation sources, such as rich lexical repre-
sentations.
1 Introduction
Statistical approaches to disambiguation offer the
advantage of making the most likely decision on the
basis of available evidence. For this purpose a large
number of probabilities has to be estimated from a
training corpus. However, many possible condition-
ing events are not present in the training data, yield-
ing zero Maximum Likelihood (ML) estimates. This
motivates the need for
smoothing
methods, which re-
estimate the probabilities of low-count events from
more reliable estimates.
Inductive generalization from observed to new
data lies at the heart of machine-learning approaches
to disambiguation. In Memory-Based Learning 1
(MBL) induction is based on the use of similarity
(Stanfill & Waltz, 1986; Aha
et al.,
1991; Cardie,
1994; Daelemans, 1995). In this paper we describe
how the use of similarity between patterns embod-
ies a solution to the sparse data problem, how it
1The Approach is also referred to as Case-based,
Instance-based or Exemplar-based.
relates to backed-off smoothing methods and what
advantages it offers when combining diverse and rich
information sources.
We illustrate the analysis by applying MBL to
two tasks where combination of information sources
promises to bring improved performance: PP-
attachment disambiguation and Part of Speech tag-
ging.
2 Memory-Based
Language
Processing
The basic idea in Memory-Based language process-
ing is that processing and learning are fundamen-
tally interwoven. Each language experience leaves a
memory trace which can be used to guide later pro-
cessing. When a new instance of a task is processed,
a set of relevant instances is selected from memory,
and the output is produced by analogy to that set.
The techniques that are used are variants and
extensions of the classic k-nearest neighbor (k-
NN) classifier algorithm. The instances of a task
are stored in a table as patterns of feature-value
pairs, together with the associated "correct" out-
put. When a new pattern is processed, the k nearest
neighbors of the pattern are retrieved from memory
using some similarity metric. The output is then de-
termined by extrapolation from the k nearest neigh-
bors, i.e. the output is chosen that has the highest
relative frequency among the nearest neighbors.
Note that no abstractions, such as grammatical
rules, stochastic automata, or decision trees are ex-
tracted from the examples. Rule-like behavior re-
sults from the linguistic regularities that are present
in the patterns of usage in memory in combination
with the use of an appropriate similarity metric.
It is our experience that even limited forms of ab-
straction can harm performance on linguistic tasks,
which often contain many subregularities and excep-
tions (Daelemans, 1996).
2.1 Similarity metrics
The most basic metric for patterns with symbolic
features is the Overlap metric given in equations 1
436
and 2; where A(X, Y) is the distance between pat-
terns X and Y, represented by n features, wi is a
weight for feature i, and 5 is the distance per fea-
ture. The k-NN algorithm with this metric, and
equal weighting for all features is called IB1 (Aha
et al., 1991). Usually k is set to 1.
where:
A(X, Y) = ~ wi 6(xi, yi) (1)
i=l
tf(xi,yi) = 0 if xi = yi, else 1 (2)
This metric simply counts the number of
(mis)matching feature values in both patterns. If
we do not have information about the importance
of features, this is a reasonable choice. But if we
do have some information about feature relevance
one possibility would be to add linguistic bias to
weight or select different features (Cardie, 1996). An
alternative more empiricist approach, is to look
at the behavior of features in the set of examples
used for training. We can compute statistics about
the relevance of features by looking at which fea-
tures are good predictors of the class labels. Infor-
mation Theory gives us a useful tool for measuring
feature relevance in this way (Quinlan, 1986; Quin-
lan, 1993).
Information Gain (IG) weighting looks at each
feature in isolation, and measures how much infor-
mation it contributes to our knowledge of the cor-
rect class label. The Information Gain of feature f
is measured by computing the difference in uncer-
tainty (i.e. entropy) between the situations with-
out and with knowledge of the value of that feature
(Equation 3).
w] = H(C) - ~-]~ev, P(v) x H(Clv )
si(f) (3)
si(f) = - Z P(v)log 2 P(v) (4)
vEVs
Where C is the set of class labels, V f is
the set of values for feature f, and H(C) =
- ~cec P(c) log 2 P(e) is the entropy of the class la-
bels. The probabilities are estimated from relative
frequencies in the training set. The normalizing fac-
tor si(f) (split info) is included to avoid a bias in
favor of features with more values. It represents the
amount of information needed to represent all val-
ues of the feature (Equation 4). The resulting IG
values can then be used as weights in equation 1.
The k-NN algorithm with this metric is called ml-
IG (Daelemans & Van den Bosch, 1992).
The possibility of automatically determining the
relevance of features implies that many different and
possibly irrelevant features can be added to the fea-
ture set. This is a very convenient methodology if
theory does not constrain the choice enough before-
hand, or if we wish to measure the importance of
various information sources experimentally.
Finally, it should be mentioned that MB-
classifiers, despite their description as table-lookup
algorithms here, can be implemented to work
fast, using e.g. tree-based indexing into the case-
base (Daelemans et al., 1997).
3 Smoothing of Estimates
The commonly used method for probabilistic clas-
sification (the Bayesian classifier) chooses a class
for a pattern X by picking the class that has the
maximum conditional probability P(classlX ). This
probability is estimated from the data set by looking
at the relative joint frequency of occurrence of the
classes and pattern X. If pattern X is described by
a number of feature-values Xl, , xn, we can write
the conditional probability as P(classlxl, , xn). If
a particular pattern x~, , x" is not literally present
among the examples, all classes have zero ML prob-
ability estimates. Smoothing methods are needed to
avoid zeroes on events that could occur in the test
material.
There are two main approaches to smoothing:
count re-estimation smoothing such as the Add-One
or Good-Turing methods (Church & Gale, 1991),
and Back-off type methods (Bahl et al., 1983; Katz,
1987; Chen & Goodman, 1996; Samuelsson, 1996).
We will focus here on a comparison with Back-off
type methods, because an experimental comparison
in Chen & Goodman (1996) shows the superiority
of Back-off based methods over count re-estimation
smoothing methods. With the Back-off method the
probabilities of complex conditioning events are ap-
proximated by (a linear interpolation of) the proba-
bilities of more general events:
/5(ctasslX) = ~x/3(clas~lX) + ~x,/3(d~sslX')
+ + ),x,~(dasslX n)
(5)
Where/5 stands for the smoothed estimate,/3 for
the relative frequency estimate, A are interpolation
weights, ~-']i~0"kx' = 1, and X -< X i for all i,
where -< is a (partial) ordering from most specific
to most general feature-sets 2 (e.g the probabilities
of trigrams (X) can be approximated by bigrams
(X') and unigrams (X")). The weights of the lin-
ear interpolation are estimated by maximizing the
probability of held-out data (deleted interpolation)
with the forward-backward algorithm. An alterna-
tive method to determine the interpolation weights
without iterative training on held-out data is given
in Samuelsson (1996).
2X -< X' can be read as X is more specific than X'.
437
We can assume for simplicity's sake that the Ax,
do not depend on the value of
X i,
but only on i. In
this case, if F is the number of features, there are
2 F - 1 more general terms, and we need to estimate
A~'s for all of these. In most applications the inter-
polation method is used for tasks with clear order-
ings of feature-sets (e.g. n-gram language modeling)
so that many of the 2 F 1 terms can be omitted
beforehand. More recently, the integration of infor-
mation sources, and the modeling of more complex
language processing tasks in the statistical frame-
work has increased the interest in smoothing meth-
ods (Collins ~z Brooks, 1995; Ratnaparkhi, 1996;
Magerman, 1994; Ng & Lee, 1996; Collins, 1996).
For such applications with a diverse set of features
it is not necessarily the case that terms can be ex-
cluded beforehand.
If we let the Axe depend on the value of X ~, the
number of parameters explodes even faster. A prac-
tical solution for this is to make a smaller number
of
buckets
for the X i, e.g. by clustering (see e.g.
Magerman (1994)):
Note that linear interpolation (equation 5) actu-
ally performs two functions. In the first place, if the
most specific terms have non-zero frequency, it still
interpolates them with the more general terms. Be-
cause the more general terms should never overrule
the more specific ones, the Ax e for the more general
terms should be quite small. Therefore the inter-
polation effect is usually small or negligible. The
second function is the pure back-off function: if the
more specific terms have zero frequency, the proba-
bilities of the more general terms are used instead.
Only if terms are of a similar specificity, the A's truly
serve to weight relevance of the interpolation terms.
If we isolate the pure back-off function of the in-
terpolation equation we get an algorithm similar to
the one used in Collins & Brooks (1995). It is given
in a schematic form in Table 1. Each step consists
of a back-off to a lower level of specificity. There
are as many steps as features, and there are a total
of 2 F terms, divided over all the steps. Because all
features are considered of equal importance, we call
this the
Naive Back-off
algorithm.
Usually, not all features x are equally important,
so that not all back-off terms are equally relevant
for the re-estimation. Hence, the problem of fitting
the Axe parameters is replaced by a term selection
task. To optimize the term selection, an evaluation
of the up to 2 F terms on held-out data is still neces-
sary. In summary, the Back-off method does not pro-
vide a principled and practical domain-independent
method to adapt to the structure of a particular do-
main by determining a suitable ordering -< between
events. In the next section, we will argue that a for-
mal operationalization of similarity between events,
as provided by MBL, can be used for this purpose.
In MBL the similarity metric and feature weighting
scheme automatically determine the implicit back-
If
f(xl, ,
xn) > 0:
#(clzl ,xn) = f(c,~l
~.)
' f(~l
~.)
Else if
f(xl, ,Xn-1, *) "4- A- f(*,x2, ,Xn)
> O:
~(clzl, ,zn) = f(c,~l ~,-1,,)+ +f(c,*,~2 ~)
f(zl, ,z 1,*)+ +/(*,z2
z.)
Else if :
~(clzl, , z,~) =
Else if
f(xl,
*, , *) + + f(*, , *, x,~) > O:
~(clzl, ,x~) =
f(c'~l'* )++f(c'*' )
f(zl,*, ,*)+ +/(*, ,*,z~)
Table 1: The Naive Back-off smoothing algorithm.
f(X)
stands for the frequency of pattern X in the
training set. An asterix (*) stands for a wildcard in
a pattern. The terms at a higher level in the back-off
sequence are more specific (-<) than the lower levels.
off ordering using a domain independent heuristic,
with only a few parameters, in which there is no
need for held-out data.
4 A Comparison
If we classify pattern X by looking at its nearest
neighbors, we are in fact estimating the probabil-
ity
P(classlX),
by looking at the relative frequency
of the class in the set defined by
simk(X),
where
slink(X)
is a function from X to the set of most sim-
ilar patterns present in the training data 3. Although
the name "k-nearest neighbor" might mislead us by
suggesting that classification is based on exactly k
training patterns, the
sima(X)
fimction given by the
Overlap metric groups varying numbers of patterns
into
buckets
of equal similarity. A bucket is defined
by a particular number of mismatches with respect
to pattern X. Each bucket can further be decom-
posed into a number of
schemata
characterized by
the position of a wildcard (i.e. a mismatch). Thus
simk(X)
specifies a ~ ordering in a Collins 8z Brooks
style back-off sequence, where each bucket is a step
in the sequence, and each schema is a term in the
estimation formula at that step. In fact, the un-
weighted overlap metric specifies exactly the same
ordering as the Naive Back-off algorithm (table 1).
In Figure 1 this is shown for a four-featured pat-
tern. The most specific schema is the schema with
zero mismatches, which corresponds to the retrieval
of an identical pattern from memory, the most gen-
eral schema (not shown in the Figure) has a mis-
match on every feature, which corresponds to the
3Note that MBL is not limited to choosing the best
class. It can also return the conditional distribution of
all the classes.
438
Overlap
exact match
IIIII
Overlap IG
IIitl
1 mismatch 2 mismatches 3 mismatches
X
X
X
X
xx XXX
XX × XX
x ~ X ×× X
×xx
× X
x X
IX] I I I 1><]><3 I I ~
I 1><3 I I IX] I I><:]
I I I IX] I IXI IX]
Figure 1: An analysis of nearest neighbor sets into buckets (from left to right) and schemata (stacked). IG
weights reorder the schemata. The grey schemata are not used if the third feature has a very high weight
(see section 5.1).
entire memory being best neighbor.
If Information Gain weights are used in combina-
tion with the Overlap metric, individual schemata
instead of buckets become the steps of the back-off
sequence 4. The -~ ordering becomes slightly more
complicated now, as it depends on the number of
wildcards
and
on the magnitude of the weights at-
tached to those wildcards. Let S be the most specific
(zero mismatches) schema. We can then define the
ordering between schemata in the following equa-
tion, where A(X,Y) is the distance as defined in
equation 1.
s' -< s" ~ ~,(s, s') < a(s, s") (6)
Note that this approach represents a type of im-
plicit parallelism. The importance of the 2~back-off
terms is specified using only F parameters the IG
weights-, where F is the number of features. This
advantage is not restricted to the use of IG weights;
many other weighting schemes exist in the machine
learning literature (see Wettschereck
et aL
(1997)
for an overview).
Using the IG weights causes the algorithm to rely
on the most specific schema only. Although in most
applications this leads to a higher accuracy, because
it rejects schemata which do not match the most
important features, sometimes this constraint needs
4Unless two schemata are exactly tied in their IG
values.
to be weakened. This is desirable when: (i) there
are a number of schemata which are almost equally
relevant, (ii) the top ranked schema selects too few
cases to make a reliable estimate, or (iii) the chance
that the few items instantiating the schema are mis-
labeled in the training material is high. In such
cases we wish to include some of the lower-ranked
schemata. For case (i) this can be done by discretiz-
ing the IG weights into bins, so that minor differ-
ences will lose their significance, in effect merging
some schemata back into buckets. For (ii) and (iii),
and for continuous metrics (Stanfill & Waltz, 1986;
Cost & Salzberg, 1993) which extrapolate from ex-
actly k neighbors 5, it might be necessary to choose a
k parameter larger than 1. This introduces one addi-
tional parameter, which has to be tuned on held-out
data. We can then use the distance between a pat-
tern and a schema to weight its vote in the nearest
neighbor extrapolation. This results in a back-off
sequence in which the terms at each step in the se-
quence are weighted with respect to each other, but
without the introduction of any additional weight-
ing parameters. A weighted voting function that
was found to work well is due to Dudani (1976): the
nearest neighbor schema receives a weight of 1.0, the
furthest schema a weight of 0.0, and the other neigh-
bors are scaled linearly to the line between these two
points.
5Note that the schema analysis does not apply to
these metrics.
439
Method
IB1 (=Naive Back-off)
IBI-IG
LexSpace IG
Back-off model (Collins & Brooks)
C4.5 (Ratnaparkhi et al.)
Max Entropy (Ratnaparkhi et al.)
Brill's rules (Collins 8z Brooks)
% Accuracy
83.7 %
84.1%
84.4 %
84.1%
79.9 %
81.6 %
81.9 %
Table 2: Accuracy on the PP-attachment test set.
5 Applications
5.1 PP-attachment
In this section we describe experiments with MBL
on a data-set of Prepositional Phrase (PP) attach-
ment disambiguation cases. The problem in this
data-set is to disambiguate whether a PP attaches
to the verb (as in
I ate pizza with a fork)
or to the
noun (as in
I ate pizza with cheese).
This is a dif-
ficult and important problem, because the semantic
knowledge needed to solve the problem is very diffi-
cult to model, and the ambiguity can lead to a very
large number of interpretations for sentences.
We used a data-set extracted from the Penn
Treebank WSJ corpus by Ratnaparkhi
et al.
(1994).
It consists of sentences containing the possibly
ambiguous sequence
verb noun-phrase PP.
Cases
were constructed from these sentences by record-
ing the features: verb, head noun of the first noun
phrase, preposition, and head noun of the noun
phrase contained in the PP. The cases were la-
beled with the attachment decision as made by
the parse annotator of the corpus. So, for the
two example sentences given above we would get
the feature vectors
ate,pizza,with,fork,V,
and
ate,pizza, with, cheese, N. The data-set contains
20801 training cases and 3097 separate test cases,
and was also used in Collins & Brooks (1995).
The IG weights for the four features (V,N,P,N)
were respectively 0.03, 0.03, 0.10, 0.03. This identi-
fies the preposition as the most important feature:
its weight is higher than the sum of the other three
weights. The composition of the back-off sequence
following from this can be seen in the lower part
of Figure 1. The grey-colored schemata were effec-
tively left out, because they include a mismatch on
the preposition.
Table 2 shows a comparison of accuracy on the
test-set of 3097 cases. We can see that Isl, which
implicitly uses the same specificity ordering as the
Naive Back-off algorithm already performs quite well
in relation to other methods used in the literature.
Collins & Brooks' (1995) Back-off model is more so-
phisticated than the naive model, because they per-
formed a number of validation experiments on held-
out data to determine which terms to include and,
more importantly, which to exclude from the back-
off sequence. They excluded all terms which did
not match in the preposition! Not surprisingly, the
84.1% accuracy they achieve is matched by the per-
formance of IBI-IG. The two methods exactly mimic
each others behavior, in spite of their huge differ-
ence in design. It should however be noted that the
computation of IG-weights is many orders of mag-
nitude faster than the laborious evaluation of terms
on held-out data.
We also experimented with rich lexical represen-
tations obtained in an unsupervised way from word
co-occurrences in raw WSJ text (Zavrel & Veenstra,
1995; Schiitze, 1994). We call these representations
Lexical Space vectors. Each word has a numeric 25
dimensional vector representation. Using these vec-
tors, in combination with the IG weights mentioned
above and a cosine metric, we got even slightly bet-
ter results. Because the cosine metric fails to group
the patterns into discrete schemata, it is necessary
to use a larger number of neighbors (k = 50). The
result in Table 2 is obtained using Dudani's weighted
voting method.
Note that to devise a back-off scheme on the basis
of these high-dimensional representations (each pat-
tern has 4 x 25 features) one would need to consider
up to 2 l°° smoothing terms. The MBL framework
is a convenient way to further experiment with even
more complex conditioning events, e.g. with seman-
tic labels added as features.
5.2 POS-tagging
Another NLP problem where combination of differ-
ent sources of statistical information is an impor-
tant issue, is POS-tagging, especially for the guess-
ing of the POS-tag of words not present in the lex-
icon. Relevant information for guessing the tag of
an unknown word includes contextual information
(the words and tags in the context of the word), and
word form information (prefixes and suffixes, first
and last letters of the word as an approximation of
affix information, presence or absence of capitaliza-
tion, numbers, special characters etc.). There is a
large number of potentially informative features that
could play a role in correctly predicting the tag of
an unknown word (Ratnaparkhi, 1996; Weischedel
et al.,
1993; Daelemans
et al.,
1996). A priori, it
is not clear what the relative importance is of these
features.
We compared Naive Back-off estimation and MBL
with two sets of features:
• PDASS:
the first letter of the unknown word (p),
the tag of the word to the left of the unknown
word (d), a tag representing the set of possible
lexical categories of the word to the right of the
unknown word (a), and the two last letters (s).
The first letter provides information about cap-
italisation and the prefix, the two last letters
440
about suffixes.
• PDDDAAASSS: more left and right context fea-
tures, and more suffix information.
The data set consisted of 100,000 feature value
patterns taken from the Wall Street Journal corpus.
Only open-class words were used during construc-
tion of the training set. For both IBI-IG and Naive
Back-off, a 10-fold cross-validation experiment was
run using both PDASS and PDDDAAASSS patterns.
The results are in Table 3. The IG values for the
features are given in Figure 2.
The results show that for Naive Back-off (and ml)
the addition of more, possibly irrelevant, features
quickly becomes detrimental (decrease from 88.5 to
85.9), even if these added features do make a gener-
alisation performance increase possible (witness the
increase with IBI-IG from 88.3 to 89.8). Notice that
we did not actually compute the 21° terms of Naive
Back-off in the PDDDAAASSS condition, as IB1 is
guaranteed to provide statistically the same results.
Contrary to Naive Back-off and IB1, memory-based
learning with feature weighting (ml-IG) manages
to integrate diverse information sources by differ-
entially assigning relevance to the different features.
Since noisy features will receive low IG weights, this
also implies that it is much more noise-tolerant.
0.30 -
0.25 -
0.20 -
F
o~
"3
~_ 0.15-
0.10 ,
0.05 -
0.0-
p d d d a a a s s s
feature
Figure 2: IG values for features used in predicting
the tag of unknown words.
IB1, Naive Back-off IBI-IG
PDASS 88.5 (0.4) 88.3 (0.4)
PDDDAAASSS 85.9 (0.4) 89.8 (0.4)
Table 3: Comparison of generalization accuracy of
Back-off and Memory-Based Learning on prediction
of category of unknown words. All differences are
statistically significant (two-tailed paired t-test, p <
0.05). Standard deviations on the 10 experiments
are between brackets.
6 Conclusion
We have analysed the relationship between Back-
off smoothing and Memory-Based Learning and es-
tablished a close correspondence between these two
frameworks which were hitherto mostly seen as un-
related. An exception is the use of similarityfor al-
leviating the sparse data problem in language mod-
eling (Essen & Steinbiss, 1992; Brown
et al.,
1992;
Dagan et
al.,
1994). However, these works differ in
their focus from our analysis in that the emphasis
is put on similarity between
values
of a feature (e.g.
words), instead of similarity between patterns that
are a (possibly complex) combination of many fea-
tures.
The comparison of MBL and Back-off shows that
the two approaches perform smoothing in a very sim-
ilar way, i.e. by using estimates from more general
patterns if specific patterns are absent in the train-
ing data. The analysis shows that MBL and Back-off
use exactly the same type of data and counts, and
this implies that MBL can safely be incorporated
into a system that is explicitly probabilistic. Since
the underlying k-NN classifier is a method that does
not necessitate any of the common independence or
distribution assumptions, this promises to be a fruit-
ful approach.
A serious advantage of the described approach,
is that in MBL the back-off sequence is specified
by the used similarity metric, without manual in-
tervention or the estimation of smoothing parame-
ters on held-out data, and requires only one param-
eter for each feature instead of an exponential num-
ber of parameters. With a feature-weighting met-
ric such as Information Gain, MBL is particularly
at an advantage for NLP tasks where conditioning
events are complex, where they consist of the fusion
of different information sources, or when the data is
noisy. This was illustrated by the experiments on
PP-attachment and POS-tagging data-sets.
441
Acknowledgements
This research was done in the context of the "Induc-
tion of Linguistic Knowledge" research programme,
partially supported by the Foundation for Lan-
guage Speech and Logic (TSL), which is funded by
the Netherlands Organization for Scientific Research
(NWO). We would like to thank Peter Berck and
Anders Green for their help with software for the
experiments.
References
D. Aha, D. Kibler, and M. Albert. 1991. Instance-
based Learning Algorithms.
Machine Learning,
Vol. 6, pp. 37-66.
L.R. Bahl, F. Jelinek and R.L. Mercer. 1983.
A Maximum Likelihood Approach to Continu-
ous Speech Recognition.
IEEE Transactions on
Pattern Analysis and Machine Intelligence,
Vol.
PAMI-5 (2), pp. 179-190.
Peter F. Brown, Vincent J. Della Pietra, Peter
V. deSouza, Jennifer C. Lai, and Robert L. Mer-
cer. 1992. Class-based N-gram Models of Natural
Language.
Computational Linguistics,
Vol. 18(4),
pp. 467-479.
Claire Cardie. 1994.
Domain Specific Knowl-
edge Acquisition for Conceptual Sentence Anal-
ysis,
PhD Thesis, University of Massachusets,
Amherst, MA.
Claire Cardie. 1996. Automatic Feature Set Selec-
tion for Case-Based Learning of Linguistic Knowl-
edge. In
Proc. of the Conference on Empirical
Methods in Natural Language Processing,
May 17-
18, 1996, University of Pennsylvania.
Stanley F.Chen and Joshua Goodman. 1996. An
Empirical Study of Smoothing Techniques for
Language Modelling. In
Proc. of the 34th Annual
Meeting of the ACL,
June 1996, Santa Cruz, CA,
ACL.
Kenneth W. Church and William A. Gale. 1991.
A comparison of the enhanced Good-Turing and
deleted estimation methods for estimating proba-
bilities of English bigrams.
Computer Speech and
Language,
Vol 19(5), pp. 19-54.
M. Collins. 1996. A New Statistical Parser Based on
Bigram Lexical Dependencies. In
Proc. of the 34th
Annual Meeting of the ACL,
June 1996, Santa
Cruz, CA, ACL.
M. Collins and J. Brooks. 1995. Prepositional
Phrase Attachment through a Backed-Off Model.
In
Proceedings of the Third Workshop on Very
Large Corpora,
Cambridge, MA.
S. Cost and S. Salzberg. 1993. A weighted near-
est neighbour algorithm for learning with symbolic
features.
Machine Learning,
Vol. 10, pp. 57-78.
Walter Daelemans and Antal van den Bosch.
1992. Generalisation Performance of Backprop-
agation Learning on a Syllabification Task. In
M. F. J. Drossaers & A. Nijholt (eds.), TWLT3:
Connectionism and Natural Language Processing.
Enschede: Twente University. pp. 27-37.
Walter Daelemans. 1995. Memory-based lexical
acquisition and processing. In P. Steffens (ed.),
Machine Translation and the Lexicon.
Springer
Lecture Notes in Artificial Intelligence, no. 898.
Berlin: Springer Verlag. pp. 85-98
Walter Da~lemans. 1996. Abstraction Considered
Harmful: Lazy Learning of Language Process-
ing. In J. van den Herik and T. Weijters (eds.),
Benelearn-96. Proceedings of the 6th Belgian-
Dutch Conference on Machine Learning.
MA-
TRIKS: Maastricht, The Netherlands, pp. 3-12.
Walter Daelemans, Jakub Zavrel, Peter Berck, and
Steven Gillis. 1996. MBT: A Memory-Based Part
of Speech Tagger Generator. In E. Ejerhed and
I. Dagan (eds.)
Proc. of the Fourth Workshop on
Very Large Corpora,
Copenhagen: ACL SIGDAT,
pp. 14-27.
Walter Daelemans, Antal van den Bosch, and Ton
Weijters. 1997. IGTree: Using Trees for Com-
pression and Classification in Lazy Learning Al-
gorithms. In D. Aha (ed.)
Artificial Intelligence
Review,
special issue on Lazy Learning, Vol. 11(1-
5).
Ido Dagan, Fernando Pereira, and Lillian Lee. 1994.
Similarity-Based Estimation of Word Cooccur-
fence Probabilities. In
Proc. of the 32nd Annual
Meeting of the ACL,
June 1994, Las Cruces, New
Mexico, ACL.
S.A. Dudani. 1981 The Distance-Weighted k-
Nearest Neighbor Rule
IEEE Transactions on
Systems, Man, and Cybernetics,
Vol. SMC-6, pp.
325-327.
Ute Essen, and Volker Steinbiss. 1992. Coocurrence
Smoothing for Stochastic Language Modeling. In
Proc. ofICASSP,
Vol. 1, pp. 161-164, IEEE.
Slava M. Katz. 1987. Estimation of Probabilities
from Sparse Data for the Language Model Com-
ponent of a Speech Recognizer
IEEE Transac-
tions on Acoustics, Speech and Signal Processing,
Vol. ASSP-35, pp. 400-401, March 1987.
David M. Magerman. 1994.
Natural Language Pars-
ing as Statistical Pattern Recognition,
PhD The-
sis, Department of Computer Science, Stanford
University.
Hwee Tou Ng and Hian Beng Lee. 1996. Integrat-
ing Multiple Knowledge Sources to Disambiguate
Word Sense: An Exemplar-Based Approach. In
Proc. of the 3~th Annual Meeting of the ACL,
June 1996, Santa Cruz, CA, ACL.
442
J R. Quinlan. 1986. Induction of Decision Trees.
Machine Learning,
Vol. 1, pp. 81-206.
J R. Quinlan. 1993.
¢4.5: Programs for Machine
Learning.
San Mateo, CA: Morgan Kaufmann.
Adwait Ratnaparkhi. 1996. A Maximum Entropy
Part-Of-Speech Tagger. In
Proc. of the Confer-
ence on Empirical Methods in Natural Language
Processing,
May 17-18, 1996, University of Penn-
sylvania.
A. Ratnaparkhi, J. Reynar and S. Roukos. 1994. A
maximum entropy model for Prepositional Phrase
Attachment. In
ARPA ~,Vorkshop on Human Lan-
guage Technology,
Plainsboro, NJ.
Christer Samuelsson. 1996. Handling Sparse Data
by Successive Abstraction In
Proc. of the Interna-
tional Confercncc on Computational Linguistics
(COLING'96), August 1996, Copenhagen, Den-
mark.
Hinrich Sch/itze. 1994. Distributional Part-of-
Speech Tagging. In
Proc. of the 7th Conference of
the European Chapter of the Association for Com-
putational Linguistics
(EACL'95), Dublin, Ire-
land.
C. Stanfill and D. Waltz. 1986. Toward memory-
based reasoning.
Communications
of
the
ACM,
Vol. 29, pp. 1213-1228.
Ralph Weischedel, Marie Meteer, Richard Schwartz,
Lance Ramshaw, and Jeff Palmucci. 1993. Cop-
ing with Ambiguity and Unknown Words through
Probabilistic Models.
Computational Linguistics,
Vol. 19(2). pp. 359-382.
D. Wettschereck, D. W. Aha, and T. Mohri.
1997. A Review and Comparative Evaluation of
Feature-Weighting Methods for Lazy Learning Al-
gorithms. In D. Aha (ed.)
Artificial Intelligence
Review,
special issue on Lazy Learning, Vol. 11(1-
5).
Jakub Zavrel and Jorn B. Veenstra. 1995. The Lan-
guage Environment and Syntactic Word-Class Ac-
quisition. In C.Koster and F.Wijnen (eds.)
Proc.
of the Groningen Assembly on Language Acquisi-
tion (GALA95).
Center for Language and Cogni-
tion, Groningen, pp. 365-374.
443
. Memory-Based Learning: Using Similarity for Smoothing Jakub Zavrel and Walter Daelemans Computational Linguistics Tilburg. next section, we will argue that a for- mal operationalization of similarity between events, as provided by MBL, can be used for this purpose. In MBL the similarity metric and feature weighting. of statistical information is an impor- tant issue, is POS-tagging, especially for the guess- ing of the POS-tag of words not present in the lex- icon. Relevant information for guessing the