DSpace at VNU: Improving word alignment for statistical machine translation based on constraints

2012 International Conference on Asian Language Processing Improving Word Alignment for Statistical Machine Translation based on Constraints Le Quang Hung Le Anh Cuong Faculty of Information Technology Quynhon University, Vietnam Email: hungqnu@gmail.com University of Engineering and Technology Vietnam National University, Hanoi Email: cuongla@vnu.edu.vn Abstract— Word alignment is an important and fundamental task for building a statistical machine translation (SMT) system However, obtaining word-level alignments in parallel corpora with high accuracy is still a challenge In this paper, we propose a new method, which is based on constraint approach, to improve the quality of word alignment Our experiments show that using constraints for the parameter estimation of the IBM models reduces the alignment error rate down to 7.26% and increases the BLEU score to 5%, in the case of translation from English to Vietnamese estimation The knowledge sources such as cognate relations, bilingual dictionary, numeric pattern matching generate anchor constraints These anchor constraints were then used to help decide which word pairings were permissible during parameter estimation We differ from these two in that we not use a bilingual dictionary to generate anchor constraints Instead, we use lexical pairs and cognates [7] that are extracted from the training data as anchor points Thus, our method has advantages that no need for extra resources when compared to approaches in [5], [6] In this paper, we first propose: (1) a new constraint type relies on distance between position of source word and position of target word in a parallel sentence pair; (2) a novel method to generate anchor constraints Then, we incorporate the constraints into the parameter estimation of the IBM models to improve the quality of word alignments For the remaining of the paper, the section II will present IBM model-1 and the EM algorithm Section III describes our method to improve word alignment models Experimental results are shown in section IV Finally, conclusion is derived in section V I INTRODUCTION Word alignment is a core component of every SMT system In fact, the initial quality of statistical word alignment dominates the quality of SMT [1] Most current SMT systems [2], [3] use statistical models for word alignment like GIZA++ that implements the IBM models [4] However, the quality of alignment is typically quite low for the language pairs which are much different in syntactic structures such as EnglishVietnamese, English-Chinese Therefore, it is necessary to incorporate auxiliary information to alleviate this problem In our opinion, there are two factors that could reduce the alignment error rate: (1) adding more training data (parallel corpora); (2) developing more efficient methods for exploiting existing training data In this work, we take the second We address the problem of efficiently exploiting existing parallel corpora by directly adding constraints to the Expectation Maximization (EM) parameter estimation procedure in the IBM models From our surveys, there aren’t factors to prevent undesirable alignments in the original IBM-models, and thus each word in source sentence aligns to all words in target sentence To guide the model to correct alignments, we employ constraints to limit a range which a word aligns with other words (in a parallel sentence pair) There are some previous works have incorporated auxiliary information into the process of estimating IBM models’ parameters, such as [5], [6] Och and Ney [5] used a bilingual dictionary as an additional knowledge source for extending the training corpus They assign the dictionary entries, which really co-occur in the training corpus, a high weight and assign the remaining entries a very low weight On the other hand, Talbot [6] proposed a method that uses external knowledge sources to constrain the procedure directly by restricting the set of alignments explored during parameter 978-0-7695-4886-9/12 $26.00 © 2012 IEEE DOI 10.1109/IALP.2012.45 II IBM MODEL -1 AND EM ALGORITHM Suppose that we are working with the two language English and Vietnamese Given an English sentence e consisting of I words e1 , , eI and a Vietnamese sentence f consisting of J words f1 , , fJ , we define the alignment a between e and f as a subset of the Cartesian product of the word positions: a ⊆ {(j, i) : j = 1, , J; i = 1, , I} (1) In statistical alignment models, a hidden alignment a connects words in the target language sentence to words in the source language sentence The set of alignments a is defined as the set of all possible connections between each word position j in the target language sentence to exactly one word position i in the source language sentence The translation probability P r(f|e) can be calculated as the sum of P r(f, a|e) over all possible alignments, where P r(f, a|e) is the joint probability of the target language sentence f and an alignment a given the source language sentence e P r(f|e) = P r(f, a|e) a 113 (2) The joint probability P r(f, a|e) only depends on the parameter t(fj |eaj ) which is the probability that the target word fj is aligned to the source word at position aj is a translation of the word eaj J P r(f, a|e) = (I + 1) J t(fj |eaj ) (3) j=1 The EM algorithm [8] is used to iteratively estimate alignment model probabilities according to the likelihood of the model on a parallel corpus This algorithm consists of two steps: 1) Expectation-step (E-step): Apply model to the data, alignment probabilities are computed from the model parameters 2) Maximization-step (M-step): Estimate model from data, parameter values are re-estimated based on the alignment probabilities and the corpus The IBM model-1 [4] was originally developed to provide reasonable initial parameter estimates for more complex wordalignment models [9] The next section will present how to use the constraints for the parameter estimation of this model Fig An example of anchor constraints (black), setting translation probabilities to zero for all other word pairs (dark grey) (e.g., abbreviations, numbers, punctuation, ) Note that in our method, these cognates were extracted directly from corpus during training progress 2) Lexical pairs: From our observation, the most frequent source words are likely to be translated into words, which are also frequent on the target side In order to extract the anchor points this kind, we combine translation probability (t(fj |ei )) and frequency (count(fj , ei )) of word pairs in training data Therefore, we can select word pairs with high-accurate to generate anchor points We define a lexical list L as a set of entries: III THE PROPOSED APPROACH In this work, we add constraints directly to the EM and modify the standard parameter estimation procedure for the IBM model-1 Note that the constrained alignment parameter is t(fj |ei )1 We use two types of constraint: anchor and distance The constraints are formulated as boolean functions These are then used in the standard forward-backward recursions to directly restrict the posterior distribution inferred in the E-step Both types of constraint are shown as below L = {(fj , ei )|t(fj |ei ) > α, count(fj , ei ) > β} Here, ei is a source language word, fj is a target language word, and α, β are a predefined thresholds (in the experiment, we used thresholds α = 0.5, and β = 10) Now, we formulate constraint based on anchor points by a boolean function anchor_constraint(fj , ei ), as follows: A Anchor constraint Anchor constraints are exclusive constraints that force a confident alignment between two words The alignment between words in an anchor point was forced by setting translation probabilities to zero at that position for all other words in the E-step [6] Give a sentence pair (f, e), if word pair (fj , ei ) is an anchor point then we will assign translation probabilities t(fj |ek ) = 0, ∀k = i, and t(fl |ei ) = 0, ∀l = j Figure shows an example of anchor constraints As can be seen in the figure, word pair (tôi, me) is an anchor point; thus, translation probabilities between and other words such as (tôi, a), (tôi, car), (tôi, passed), are set to zero In this work, we used the following knowledge sources to generate anchor constraints: 1) Cognate: According to Kondrak [10], the term cognates denotes words in different languages that are similar in their orthographic or phonetic form and are possible translations of each other The cognates are particularly useful when machinereadable bilingual dictionaries are not available We differ from Kondrak’s method [10]; he used three word similarity measures: Simard’s condition, Dice’s coefficient, and LCSR to extract the cognates Here, we select the words that are not translated and they co-occur in an aligned sentence pair In (4) anchor_constraint(fj , ei ) = true f alse if (fj = ei ) ∨ (fj , ei ) ∈ L otherwise (5) B Distance constraint In our surveys, we see that words in source language sentence have usually relationship about distance with words in target language sentence2 Under this point of view, we proposed a new constraint type that relies on distance between the positions of source word and target word in a parallel sentence pair We formulate this by using the boolean function distance_constraint(i, j), as follows: true if abs(i − j) ≤ δ f alse otherwise (6) Here, abs(i−j) is the distance from a source position i to target position j, and δ is a predefined threshold (in experiments, we set δ = 2) It means that given a sentence pair (f, e), each distance_constraint(i, j) = the next sections, we use t(fj |ei ) instead of t(fj |eaj ) This 114 is will be confirmed in the section experiment carried out using the English-Vietnamese data sets which was credited by [11] We design four training data sets which consists of 60k, 70k, 80k, and 90k sentence pairs We performed main experiments which fell into the following categories: 1) Verifying that the use of constraints (as proposed) has an impact on the quality of alignments 2) Evaluating whether improved parameter estimates of alignment quality lead to improved translation quality To choose the thresholds α, β, δ, and parameter λ, we have trained 60k sentence pairs, and have selected α = 0.5, β = 10, δ = 2, and λ = 0.99 that achieve high performance Our baseline is a phrase based SMT system that uses the Moses toolkit [2] for translation model training and decoding, GIZA++ [5] for word alignment Fig An example of distance constraint with threshold δ = 2, each target position j (black) only align with source positions in range [j − δ, j + δ] (dark grey) A Word alignment experiments We used the alignment error rate (AER) metric as defined by Och and Ney [5] to measure the quality of alignments, as follows |A ∩ P | precision = (9) |A| target position j is only aligned with source positions in range [j − δ, j + δ] As can be seen in the figure 2, the target word at position (including word: vợ) only aligns with the source words at positions in range [3, 7] (including words: a, wife, and, kid, to) It is worth to emphasize that differ from the anchor constraints which set translation probabilities to zero for all unconstrained words We estimate translation probability t(fj |ei ) with a mixture of constrained and unconstrained for each word pair (fj , ei ) To amplify the contribution of constraints this kind; we weight the statistics collected in the EM algorithm We use a single parameter λ that assigns a high weight if a word pair (fj , ei ) constrained and a very low weight if otherwise That means, the translation probability t(fj |ei ) is multiplied by λ when constrained and by (1 − λ) otherwise (in the experiment, λ is set to 0.99) Here, the translation probability t(fj |ei ) used as collect counts in the EM algorithm Similar to Brown et al [4], we call the expected number of times that e connects to f in the translation (f|e) the count of f given e for (f|e) and denote it by c(f |e; f,e) Denote E1 , and E2 are set of English words, which distance constraint are satisfied, unsatisfied respectively Now we have to collect counts c(f |e; f,e) from a sentence pair (f,e), as follows3 λt(f |e) ek ∈E1 t(f |ek ) c(f |e; f,e) = ( J + I (1 − λ)t(f |e) ) δ(f, fj ) δ(e, ei ) el ∈E2 t(f |el ) j=1 i=0 recall = AER = − (f,e) f c(f |e; f,e) (f,e) (7) IV EXPERIMENTAL (8) c(f |e; f,e) I i=0 δ(e, ei ) is count of e in e, and (11) B Machine translation experiments SETUP In order to test that our improved parameter estimates lead to better translation quality, we used a phrase-based decoder [2] to translate a set of English sentences into Vietnamese The phrase-based decoder extracts phrases from the word alignments produced by experiments in the previous section In this section, we present results of experiments on a parallel corpus of English-Vietnamese The experiments were In equation (7), count of f in f |A ∩ S| + |A ∩ P | |A| + |S| (10) where, S denotes the annotated set of sure alignments, P denotes the annotated set of possible alignments, and A denotes the set of alignments produced by the model under test [9] In all the experiments below, we perform the same training scheme with the actual number of training order is: iterations of Model 1, iterations of Model 2, and iterations of Model We used 150 sentence pairs as a held out hand-aligned set to measure the word alignment quality Table I gives quality of alignments for IBM models when training with GIZA++ on four training data sets We obtained better results when incorporating the constraints into the parameter estimation of the IBM models Table II shows the results for the different corpus sizes The bestperforming model in the GIZA++ was trained on 90k sentence pairs, which had an alignment error rate of 24.48% In our modified IBM models the best-performing model trained on 90k sentence pairs with an alignment error rate of 22.53% We have shown a relative reduction of AER of about 7.26% on all training data set In the baseline, increased size of training data enables reduces alignment error rate but only very minimal improvements, average improvement 0.91%/10k sentence pairs (see in the figure 3) After collecting these counts over a corpus, we estimate the translation probability t(f |e; f,e) by equation (8) t(f |e; f,e) = |A ∩ S| |S| J j=1 δ(f, fj ) is 115 TABLE I QUALITY OF ALIGNMENTS FOR IBM MODELS (BASELINE) Size of training data 60k 70k 80k 90k Precision 67.79 68.22 68.66 68.63 Recall 83.49 83.68 83.49 83.93 AER 25.16 24.83 24.64 24.48 TABLE II QUALITY OF ALIGNMENTS FOR IBM MODELS TRAINED WITH CONSTRAINTS Size of training data 60k 70k 80k 90k Precision 71.82 71.82 72.21 72.57 Recall 81.44 82.87 83.18 83.06 Fig AER 23.66 23.04 22.68 22.53 Comparison word alignment quality of baseline with our method TABLE III COMPARISON SMT QUALITY OF BASELINE WITH OUR METHOD Size of training data 60k 70k 80k 90k (word alignment experiments) We trained a language model using the 90k Vietnamese sentences from the training set For the evaluation of translation quality; we used the BLEU metric [12] A test set including 5k sentence pairs is used to evaluate SMT quality Table III shows that our method leads to a better translation quality than the baseline [2] We achieve a higher BLEU score on all training data set The average improvement is 1.04 BLEU points absolute (5.0% relative) when compared to the baseline Baseline 18.46 20.19 20.96 21.90 Our method 19.13 20.95 22.20 23.40 Δ(%) +3.63 +3.76 +5.92 +6.85 [2] P Koehn, F J Och, and D Marcu, “Statistical phrase-based translation,” in Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, ser NAACL ’03 Stroudsburg, PA, USA: Association for Computational Linguistics, 2003, pp 48–54 [3] F J Och and H Ney, “The alignment template approach to statistical machine translation,” Comput Linguist., vol 30, no [4] P F Brown, V J D Pietra, S A D Pietra, and R L Mercer, “The mathematics of statistical machine translation: parameter estimation,” Comput Linguist., vol 19, no 2, pp 263–311, Jun 1993 [5] F J Och, H Ney, F Josef, and O H Ney, “A systematic comparison of various statistical alignment models,” Computational Linguistics, vol 29, 2003 [6] D Talbot, “Constrained em for parallel text alignment,” Nat Lang Eng., vol 11, no 3, pp 263–277, Sep 2005 [7] M Simard, G F Foster, and P Isabelle, “Using cognates to align sentences in bilingual corpora,” in Proceedings of the 1993 conference of the Centre for Advanced Studies on Collaborative research: distributed computing - Volume 2, ser CASCON ’93 IBM Press, 1993, pp 1071– 1082 [8] A P Dempster, N M Laird, and D B Rubin, “Maximum likelihood from incomplete data via the em algorithm,” JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, vol 39, no 1, pp 1–38, 1977 [9] R C Moore, “Improving ibm word-alignment model 1,” in Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, ser ACL ’04 Stroudsburg, PA, USA: Association for Computational Linguistics, 2004 [10] G Kondrak, D Marcu, and K Knight, “Cognates can improve statistical translation models,” in Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003–short papers - Volume 2, ser NAACL-Short ’03 Stroudsburg, PA, USA: Association for Computational Linguistics, 2003, pp 46–48 [11] C Hoang, A.-C Le, P.-T Nguyen, and T.-B Ho, “Exploiting nonparallel corpora for statistical machine translation.” in RIVF IEEE, 2012, pp 1–6 [12] K Papineni, S Roukos, T Ward, and W.-J Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ser ACL ’02, 2002, pp 311–318 V CONCLUSION In this paper, we have proposed a novel method to improve word alignments by incorporating constraints into the parameter estimation of the IBM models These constraints are used to prevent undesirable alignments that cannot be shown in standard IBM models Experimental results show that our proposed method significantly improves the alignment accuracy and increases translation qualities, for the case of translating from English to Vietnamese When we improve IBM model-1, the initializing transferring results to higher IBM models is better Therefore, it improves the quality in the overall We believe that our method can be applied for other pairs of languages because that the constraints were used in the proposed method is independent of language In the future, we will extend our work that uses the advanced constraints to improve the quality of alignments ACKNOWLEDGEMENT This work is supported by the project ”Studying Methods for Analyzing and Summarizing Opinions from Internet and Building an Application” which is funded by Vietnam National University of Hanoi REFERENCES [1] J.-H Lee, S.-W Lee, G Hong, Y.-S Hwang, S.-B Kim, and H.-C Rim, “A post-processing approach to statistical word alignment reflecting alignment tendency between part-of-speeches,” in Coling 2010: Posters Beijing, China: Coling 2010 Organizing Committee, August 2010, pp 623–629 116 ... method for automatic evaluation of machine translation, ” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ser ACL ’02, 2002, pp 311–318 V CONCLUSION In this... we formulate constraint based on anchor points by a boolean function anchor_constraint(fj , ei ), as follows: A Anchor constraint Anchor constraints are exclusive constraints that force a confident... Meeting on Association for Computational Linguistics, ser ACL ’04 Stroudsburg, PA, USA: Association for Computational Linguistics, 2004 [10] G Kondrak, D Marcu, and K Knight, “Cognates can improve statistical

Định dạng
Số trang	4
Dung lượng	359,32 KB