ALIGNING SENTENCESINPARALLEL CORPORA
Peter F. Brown, Jennifer C. Lai, a, nd Robert L. Mercer
IBM Thomas J. Watson Research Center
P.O. Box 704
Yorktown Heights, NY 10598
ABSTRACT
In this paper we describe a statistical tech-
nique for aligning sentences with their translations
in two parallel corpora. In addition to certain
anchor points that are available in our da.ta, the
only information about the sentences that we use
for calculating alignments is the number of tokens
that they contain. Because we make no use of the
lexical details of the sentence, the alignment com-
putation is fast and therefore practical for appli-
cation to very large collections of text. We have
used this technique to align several million sen-
tences in the English-French Hans~trd corpora and
have achieved an accuracy in excess of 99% in a
random selected set of 1000 sentence pairs that we
checked by hand. We show that even without the
benefit of anchor points the correlation between
the lengths of aligned sentences is strong enough
that we should expect to achieve an accuracy of
between 96% and 97%. Thus, the technique may
be applicable to a wider variety of texts than we
have yet tried.
INTRODUCTION
Recent work by Brown
et al.,
[Brown
et
al.,
1988, Brown
et al.,
1990] has quickened
anew the long dormant idea of using statistical
techniques to carry out machine translation
from one natural language to another. The
lynchpin of their approach is a. large collection
of pairs of sentences that. are mutual transla-
tions. Beyond providing grist to the sta.tisti-
cal mill, such pairs of sentences are valuable
to researchers in bilingual lexicography [I(la
va.ns and Tzoukerma.nn, 1990, Warwick and
Russell, 1990] and may be usefifl in other ap-
proaches to machine translation [Sadler, 1989].
In this paper, we consider the problem of
extra.cting from pa.raJlel French and F, nglish
corpora pairs sentences that are translations
of one another. The task is not trivial because
at times a single sentence in one language is
translated as two or more sentencesin the
other language. At other times a sentence,
or even a whole passage, may be missing from
one or the other of the corpora.
If a person is given two parallel texts and
asked to match up the sentencesin them, it is
na.tural for him to look at the words in the sen-
tences. Elaborating this intuitively appealing
insight, researchers at Xerox and at ISSCO
[Kay, 1991, Catizone
et al.,
1989] have devel-
oped alignment Mgodthms that pair sentences
according to the words that they contain. Any
such algorithm is necessarily slow and, despite
the potential for highly accurate alignment,
may be unsuitable for very large collections
of text. Our algorithm makes no use of the
lexical details of the corpora, but deals only
with the number of words in each sentence.
Although we have used it only to align paral-
lel French and English corpora from the pro-
ceedings of the Canadian Parliament, we ex-
pect that our technique wouhl work on other
French and English corpora and even on other
pairs of languages. The work of Gale and
Church , [Gale and Church, 1991], who use
a very similar method but measure sentence
lengths in characters rather than in words,
supports this promise of wider applica.bility.
TIIE HANSARD CORPORA
Brown el
al.,
[Brown
et al.,
1990] describe
the process by which the proceedings of the
Ca.nadian Parliament are recorded. In Canada,
these proceedings are re[erred to as tta.nsards.
169
Our Hansard corpora consist of the llansards
from 1973 through 1986. There are two files
for each session of parliament: one English
and one French. After converting the obscure
text markup language of the raw data. to TEX ,
we combined all of the English files into a sin-
gle, large English corpus and all of the French
files into a single, large French corpus. We
then segmented the text of each corpus into
tokens and combined the tokens into groups
that we call sentences. Generally, these con-
form to the grade-school notion of a sentence:
they begin with a capital letter, contain a.
verb, and end with some type of sentence-final
punctuation. Occasionally, they fall short of
this ideal and so each corpus contains a num-
ber of sentence fragments and other groupings
of words that we nonetheless refer to as sen-
tences. With this broad interpretation, the
English corpus contains 85,016,286 tokens in
3,510,744 sentences, and the French corpus
contains 97,857,452 tokens in 3,690,425 sen-
tences. The average English sentence has 24.2
tokens, while the average French sentence is
about 9.5% longer with 26.5 tokens.
The left-hand side of Figure 1 shows the
raw data for a portion of the English corpus,
and the right-hand side shows the same por-
tion after we converted it to TEX and divided
it up into sentences. The sentence numbers do
not advance regularly because we have edited
the sample in order to display a variety of phe-
nolnena.
In addition to a verbatim record of the
proceedings and its translation, the ttansards
include session numbers, names of speakers,
time stamps, question numbers, and indica-
tions of the original language in which each
speech was delivered. We retain this auxiliary
information in the form of comments sprin-
kled throughout the text. Each comment has
the form
\SCM{} \ECM{}
as shown
on the right-hand side of Figure 1. ]n ad-
dition to these comments, which encode in-
formation explicitly present in the data, we
inserted
Paragraph
comments as suggested by
the space command of which we see aa exam-
ple in the eighth line on the left-hand side of
Figure
1.
We mark the beginning of a parliamentary
session with a
Document
comment as shown
in Sentence 1 on the right-hand side of Fig-
ure 1. Usually, when a member addresses the
parliament, his name is recorded and we en-
code it in an
Author
comment. We see an ex-
ample of this in Sentence 4. If the president
speaks, he is referred to in the English cor-
pus as
Mr. Speaker
and in the French corpus
as M. le Prdsideut.
If several members speak
at once, a shockingly regular occurrence, they
are referred to as
Some Hon. Members
in the
English and as
Des Voix
in the French. Times
are recorded either ~ exact times on a. 24-hour
basis as in $entencc 8], or as inexact times of
which there are two forms:
Time = Later,
and
Time = Recess.
These are rendered in
French as
Time = Plus Tard
and
Time = Re-
cess.
Other types of comments that appear
are shown in Table 1.
ALIGNING ANCHOR POINTS
After examining the Hansard corpora, we
realized that the comments laced throughout
would serve as uscflll anchor points in any
alignment process. We divide the comments
into major and minor anchors as follows. The
comments
Author = Mr. Speaker, Author =
ill. le Pr(sident, Author = Some Hon. Mem-
bers,
and
Author = Des Voix
are called minor
anchors. All other comments are called major
anchors with the exception of the
Paragraph
comment which is not treated as an anchor at
all. The minor anchors are much more com-
mon than any particular major anchor, mak-
ing an alignment based on them less robust
against deletions than one based on the ma-
jor anchors. Therefore, we have carried out
the alignment of anchor points in two passes,
first aligning the major anchors and then the
minor anchors.
Usually, the major anchors appear in both
languages. Sometimes, however, through inat-
tentlon on the part of the translator or other
misa.dvel~ture, the tla.me of a speaker may be
garbled or a comment may be omitted. In the
first alignment pass, we assign to alignments
170
/*START_COMMENT* Beginning file
=
048
101 H002-108 script A *END_COMMENT*/
.TB 029 060 090 099
.PL 060
.LL 120
.NF
The House met at 2 p.m.
.SP
*boMr. Donald MacInnis (Cape Breton
-East Richmond):*ro Mr. Speaker,
I rise on a question of privilege af-
fecting the rights and prerogatives
of parliamentary committees and one
which reflects on the word of two
ministers.
.SP
*boMr. Speaker: *roThe hon. member's
motion is proposed to the
House under the terms of Standing
Order 43. Is there unanimous consent?
.SP
*boSome hon. Members: *roAgreed.
s*itText*ro)
Question No. 17 *boMr. Mazankowski:
*to
I. For the period April I, 1973 to
January 31, 1974, what amount of
money was expended on the operation
and maintenance
of
the Prime
Minister's residence at Harrington
Lake, Quebec?
.SP
(1415)
s*itLater:*ro)
.SP
*boMr. Cossitt:*ro Mr. Speaker, I rise
on a point of order to ask for
clarification by the parliamentary
secretary.
1.
\SCM{} Document =
048 101 H002-108
script A \ECM{)
2. The House met at 2 p.m.
3. \SCM{} Paragraph \ECM{}
4. \SCM{} Author = Mr. Donald MacInnis
(Cape Breton-East Richmond) \ECM{}
5. Mr. Speaker, I rise on a question of
privilege affecting the rights and
prerogatives of parliamentary
committees and one which reflects on
the word of two ministers.
21. \SCM{} Paragraph \ECM{}
22. \SCM{} Author = Mr. Speaker \ECM{}
23.
The
hon. member's motion is proposed
to the House under the terms of
Standing Order 43.
44. Is there unanimous consent?
45. \SCM{} Paragraph \ECM{)
46. \SCM{-} Author = Some hon. Members
\ECM{}
47. Agreed.
61. \SCM{} Source = Text \ECM{}
62. \SCM{} Question = 17 \ECM{}
63. \SCM{} Author
= Mr.
Mazankowski
\ECMO
64.
I.
65. For the period April I, 1973 to
January 31, 1974, .hat amount of
money was expended on the operation
and maintenance of the Prime
Minister's residence at Harrington
Lake, Quebec?
66. \SCM{}
Paragraph
\ECN{}
81.
\SCM{) Time = (1415) \ECM{}
82. \SCM{) Time = Later \ECM{)
83. \SCM{} Paragraph \ECM{}
84. \SCM{} Author
=
Mr. Cossitt \ECM{}
85. Mr. Speaker, I rise on a point of
order to ask for clarification by
the parliamentary secretary.
Figure 1: A sample of text before and after cleanup
171
a cost that favors exact matches and penalizes
omissions or garbled matches. Thus, for ex-
ample, we assign a cost of 0 to the pair
Time
= Later
and
Time = Plus Tard,
but a cost
of 10 to the pair
Time = Later
and
Author
= Mr. Bateman.
We set the cost of a dele-
tion at 5. For two names, we set the cost by
counting the number of insertions, deletions,
and substitutions necessary to transform one
name, letter by letter, into the other. This
value is then reduced to the range 0 to 10.
Given the costs described above, it is a
standard problem in dynamic programming
to find that alignment of the major anchors
in the two corpora with the least total cost
[Bellman, 1957]. In theory, the time and space
required to find this alignment grow as the
product of the lengths of the two sequences
to be aligned. In practice, however, by using
thresholds and the partial traceback technique
described by Brown, Spohrer, Hochschild, and
Baker , [Brown
et al.,
1982], the time required
can be made linear in the length of the se-
quences, and the space can be made constant.
Even so, the computational demand is severe
since, in places, the two corpora are out of
alignment by as many as 90,000 sentences ow-
ing to mislabelled or missing files.
This first pass renders the data as a se-
quence of sections between aligned major an-
chors. In the second pass, we accept or reject
each section in turn according to the popula-
tion of minor anchors that it contains. Specifi-
cally, we accept a section provided that, within
the section, both corpora contain the same
number of minor anchors in the same order.
Otherwise, we reject the section. Altogether,
we reject about 10% of the data in each cor-
pus. The minor anchors serve to divide the
remaining sections into subsections thai. range
in size from one sentence to several thousand
sentences and average about ten sentences.
ALIGNING SENTENCES AND
PARAGRAPH BOUNDARIES
We turn now to the question of aligning
the individual sentencesin a subsection be-
tween minor anchors. Since the number of
English
Source = English
Source = Translation
Source = Text
Source = List Item
Source = Question
Source = Answer
Fren(;h
Source = Traduction
Source = Francais
Source = Texte
Source = List Item
Source = Question
Source = Reponse
Table 1: Examples of comments
sentences in the French corpus differs from the
number in the English corpus, it is clear that
they cannot be in one-to-one correspondence
throughout. Visual inspection of the two cor-
pora quickly reveals that although roughly 90%
of the English sentences correspond to single
French sentences, there are many instances
where a single sentence in one corpus is rep-
resented by two consecutive sentencesin the
other. Rarer, but still present, are examples
of sentences that appear in one corpus but
leave no trace in the other. If one is moder-
ately well acquainted with both English and
French, it is a simple matter to decide how the
sentences should be aligned. Unfortunately,
the sizes of our corpora make it impractical
for us to obtain a complete set of alignments
by hand. Rather, we must necessarily employ
some automatic scheme.
It is not surprising and further inspection
verifies that tile number of tokens insentences
that are translations of one another are corre-
lated. We looked, therefore, at the possibility
of obtaining alignments solely on the basis of
sentence lengths in tokens. From this point of
view, each corl)us is simply a sequence of sen-
tence lengths punctuated by occasional para-
graph markers. Figure 2 shows the initial por-
tion of such a pair of corpora. We have circled
groups of sentence lengths to show the cor-
rect alignment. We call each of the groupings
a bead.
In this example, we have an el-bead
followed by an eft-bead followed by an e-bead
followed by a ¶~¶l-bead. An alignment, then,
is simply a sequence of beads that accounts
for the observed sequences of sentence lengths
and paragraph markers. We imagine the sen-
tences in a subsection to have been generated
by a pa.ir of random processes, the first pro-
172
Figure 2: Division of aligned corpora into beads
Bead
e
/
,f
ee/
eft
¶!
¶o¶t
Text
one English sentence
one French sentence
one English and one French sentence
two English and one French sentence
one English and two French sentences
one English paragraph
one French paragraph
one English and one French paragraph
Table 2: Alignment Beads
ducing a sequence of beads and the second
choosing the lengths of the sentencesin each
bead.
Figure 3 shows the two-state Markov model
that we use for generating beads. -We assume
that a single sentence in one language lines up
with zero, one, or two sentencesin the other
and that paragraph markers may be deleted.
Thus, we allow any of the eight beads shown in
Table 2. We also assume that Pr (e) = Pr (f),
Pr (eft)= er (ee/), and Pr (¶¢) = Pr(¶t).
BEAD
s-L-° P- ;!:::O
Figure 3: Finite state model for generating beads
Given a bead, we determine the lengths of
the sentences it contains as follows. We a.s-
sume the probability of an English sentence
of length g~ given an e-bead to be the same
as the probability of an English sentence of
length ee in the text as a whole. We denote
this probability by Pr(ee). Similarly, we as-
sume the probability of a French sentence of
length g! given an f-bead to be Pr (gY)" For an
el-bead, we assume that the English sentence
has length e, with probability Pr (~e) and that
log of the ratio of length of the French sen-
tence to the length of the English sentence is
uormMly distributed with mean /t and vari-
ance a 2. Thus, if r = log(gt/ge), we assume
that
er(ts[e, ) = c exp[-(r- (1)
with 0¢ chosen so that the sum of
Pr(tllt, )
over positive values of gI is equal to unity. For
an eel-bead, we assume that each of the En-
glish sentences is drawn independently from
the distribution
Pr(t.)
and that the log of
the ratio of the length of the French sentence
to the sum of the lengths of the English sen-
tences is normally distributed with the same
mean and variance as for an el-bead. Finally,
for an eft-bead, we assume that the length of
the English sentence is drawn from the distri-
bution Pr (g,) and that the log of the ratio of
the sum of the lengths of the French sentences
to the length of the English sentence is nor-
mally distributed asbefore. Then, given the
sum of the lengths of the French sentences,
we assume that tile probability of a particular
pair of lengths,/~11 and ~12, is proportional to
Vr (ef,) Pr (~S~) .
Together, these two random processes form
a hidden Markov model [Baum, 1972] for the
generation of aligned pairs of corpora We de-
termined the distributions, Pr (g,) and Pr (aS),
front the relative frequencies of various sen-
tence lengths in our data. Figure 4 shows for
each language a. histogram of these for sen-
tences with fewer than 81 tokens. Except for
lengths 2 and 4, which include a large num-
ber of formulaic sentencesin both the French
and the English, the distributions are very
smooth.
For short sentences, the relative frequency
is a reliable estimate of the corresponding prob-
ability since for both French and English we
have more than 100 sentences of each length
less tha.n 8]. We estimated the probabilities
173
I 80
mentenee length
1 80
.entenea length
Figure 4: Histograms of French (top) and English (bottom) sentence lengths
174
of greater lengths by fitting the observed fre-
quencies of longer sentences to the tail of a
Poisson distribution.
We determined M1 of the other parameters
by applying the EM algorithm to a large sam-
pie of text [Baum, 1972, Dempster
et al.,
1977].
The resulting values are shown in Table 3.
From these parameters, we can see that 91%
of the English sentences and 98% of the En-
glish paragraph markers line up one-to-one
with their French counterparts. A random
variable z, the log of which is normMly dis-
tributed with mean # and variance o ~, has
mean value exp(/t + a2/2). We can also see,
therefore, that the total length of the French
text in an el-,
eel-,
or
eft-bead
should be about
9.8% greater on average than the total length
of the corresponding English text. Since most
sentences belong to el-beads, this is close to
the value of 9.5% given in Section 2 for the
amount by which the length of the average
French sentences exceeds that of the average
English sentence.
We can compute from the parameters in
Table 3 that the entropy of the bead produc-
tion process is 1.26 bits per sentence. Us-
ing the parameters # and (r 2, we can combine
the observed distribution of English sentence
lengths shown in Figure 4 with the conditional
distribution of French sentence lengths given
English sentence lengths in Equation (1) to
obtain the joint distribution of French and
English sentences lengths in
el-, eel-,
and
eft-
beads. From this joint distribution, we can
compute that the mutual information between
French and English sentence lengths in these
beads is 1.85 bits per sentence. We see there-
fore that, even in the absence of the anchor
points produced by the first two pa.sses, the
correla.tion in sentence lengths is strong enough
to allow alignment with an error rate that
is asymptotically less than 100%. lh;arten-
ing though such a result may be to the theo-
retician, this is a sufficiently coarse bound on
the error rate to warrant further study. Ac-
cordingly, we wrote a program to Simulate the
alignment process that we had in mind. Using
Pr(e¢), Pr((¢), and the parameters from Ta-
Parameter Estimate
er (e), Pr(/) .007
Pr (e/)
.690
Pr
(eel),
Pr
(eft)
.020
Pr (¶~), Pr (¶f) .005
It. .072
tr 2 .043
Table 3: P~rameter estimates
ble 3, we generated an artificial pair of aligned
corpora. We then determined the most prob-
able alignment for the data. We :recorded
the fraction of el-beads in the most probable
alignment that did not correspond to el-beads
in the true Mignment as the error rate for the
process. We repeated this process many thou-
sands of times and found that we could ex-
pect an error rate of about 0.9% given the
frequency of anchor points from the first two
pa,sses.
By varying the parameters of the hidden
Markov model, we explored the effect of an-
chor points and paragraph ma.rkers on the ac-
curacy of alignment. We found that with para-
graph markers but no ~tnchor points, we could
expect an error rate of 2.0%, with anchor points
but no l)~tra.graph markers, we could expect an
error rate of 2.3%, and with neither anchor
points nor pa.ragraph markers, we could ex-
pect an error rate of 3.2%. Thus, while anchor
points and paragraph markers are important,
alignment is still feasible without them. This
is promising since it suggests that one may
be able to apply the same technique to data
where frequent anchor points are not avail-
able.
RESULTS
We aplflied the alignment algorithm of Sec-
t.ions 3 and 4 to the Ca.na.dian Hansa.rd data
described in Section 2. The job ran for l0
clays on au IBM Model 3090 mainframe un-
der an operating system that permitted ac-
cess to 16 mega.bytes of virtual memory. The
most probable alignment contained 2,869,041
el-beads. Some of our colleagues helped us
175
And love and kisses to you, too.
mugwumps who sit on the fence with
their mugs on one side and their
wumps on the other side and do not
know which side to come down on.
At first reading, she may have.
Pareillelnent.
en voulant m&lager la ch~vre et le choux
ils n'arrivent 1)as k prendre patti.
Elle semble en effet avoir un grief tout a
fait valable, du moins au premier
abord.
Table 4: Unusual but correct alignments
examine a random sample of 1000 of these
beads, and we found 6 in which sentences were
not translations of one another. This is con-
sistent with the expected error rate ol 0.9%
mentioned above. In some cases, the algo-
rithm correctly aligns sentences with very dif-
ferent lengths. Table 4 shows some interesting
examples of this.
REFERENCES
[Baum, 1972] Baum, L. (1972). An inequality
and associated maximization technique in
statistical estimation of probabilistic func-
tions of a Markov process. Inequalities, 3:1-
8.
[Bellman, 1957] Bellman, R. (1957). Dy-
namic Programming. Princeton University
Press, Princeton N.J.
[Brown et al., 1982] Brown, P., Spohrer, J.,
Hochschild, P., and Baker, J. (1982). Par-
tial traceback and dynamic programming.
In Proceedings of the IEEE International
Conference on Acoustics, Speech and Signal
Processing, pages 1629-1632, Paris, France.
[Brown et ai., 1990] Brown, P. F., Cocke, J.,
DellaPietra, S. A., DellaPietra, V. J., Je-
linek, F., Lafferty, J. D., Mercer, R. L.,
and Roossin, P. S. (1990). A statisticM ap-
proach to machine translation. Computa-
tional Linguistics, 16(2):79-85.
[Brown et al., 1988] Brown, P. F., Cocke, J.,
DellaPietra, S. A., DellaPietra., V. J., .le-
linek, F., Mercer, R. L., and Roossin, P. S.
(1988). A statistical approach to language
translation. In Proceedings of the I2th In-
ternational Conference on Computational
Linguisticsl Budapest, Hungary.
[Catizone et al., 1989] Catizone, R., Russell,
G., and Warwick, S. (1989). Deriving trans-
lation data [rom bilingual texts. In Proceed-
ings of the First International Acquisition
Workshop, Detroit, Michigan.
[Dempster et al., ]977] Dempster, A., Laird,
N., and Rubin, D. (1977). Maximum likeli-
hood from incomplete data via the EM al-
gorithm. Journal of the Royal Statistical
Society, 39(B):1-38.
[Gale and Church, 1991] Gale, W. A. and
Church, K. W. (1991). A program for align-
ing sentencesin bilingual corpora. In Pro-
ceedings of the 2gth Annual Meeting of the
A ssociation for Computational Linguistics,
Berkeley, California.
[Kay, ]991] Kay, M. (1991). Text-translation
alignment. In ACII/ALLC '91: "Mak-
in.q Connections" Conference Handbook,
Tempe, Arizona.
[Klavans and Tzoukermann, 1990]
Kiavans, .l. and Tzoukermann, E. (1990).
The bicord system. ]n COLING-90, pages
174-179, Ilelsinki, Finland.
[Sadler, 19~9] Sadler, V. (1989). The Bilin-
gual Knowledge Bank- A New Conceptual
Basis for MT. BSO/Research, Utrecht.
[Warwick and Russell, 1990] Wa.rwick, S. and
Russell, G. (1990). Bilingual concordancing
and bilingnM lexicography. In EURALEX
4th International Congress, M~ilaga, Spain.
176
. program for align-
ing sentences in bilingual corpora. In Pro-
ceedings of the 2gth Annual Meeting of the
A ssociation for Computational Linguistics,
Berkeley,. average about ten sentences.
ALIGNING SENTENCES AND
PARAGRAPH BOUNDARIES
We turn now to the question of aligning
the individual sentences in a subsection