2012 International Conference on Asian Language Processing Improving the Quality of Word Alignment By Integrating Pearson’s Chi-square Test Information Cuong Hoang1 , Cuong Anh Le1 , Son Bao Pham1,2 University of Engineering and Technology Vietnam National University, Hanoi Information Technology Institute Vietnam National University, Hanoi {cuongh.mi10, cuongla, sonpb}@vnu.edu.vn Abstract—Previous researches mainly focus on the approaches which are essentially inspirited from the log-linear model background in machine learning or other adaptations However, not a lot of studies deeply focus on improving wordalignment models to enhance the quality of phrase translation table This research will follow on that approach The experiments show that this scheme could also improve the quality of the word-alignment component better Hence, the improvement impacts the quality of translation system in overall around 1% for the BLEU score metric focuses on the aspect that the lexical translation modelling improving grants us to “boost” the “hidden” merit of IBM higher alignment models and therefore improves the quality of statistical phrase-based translation better1 We found that this is a quite important aspect However, there are just only some works deeply concentrate on that point [8] Keywords-Machine Translation, Pearson’s Chi-Square Test, Word-Alignment Model, Log-Linear Model A IBM Model Model is a probabilistic generative model within a framework that assumes a source sentence f1J of length J translates as a target sentence eI1 of length I It is defined as a particularly simple instance of this framework, by assuming all possible lengths for f1J (less than some arbitrary upper bound) have a uniform probability [9] yields the following summarization equation: II IBM M ODELS AND T HE E FFECTS OF THEM TO HIGHER ALIGNMENT MODELS I I NTRODUCTION Modern Statistical Machine Translation (SMT) systems are usually built from log-linear models [1][2] In addition, the best performing systems are based in some way on phrases (or the groups of words) [2][3] The basic idea of phrase-based translation is to learn to break given source sentence into phrases, then translate each phrase and finally compose target sentence from these phrase translations The step of phrase learning in a statistical phrase-based translation system usually relies on the alignments between words For finding the best alignments between phrases, first we generate word alignments Phrase alignments are then heuristically “extracted” from them [3] In fact, previous experiments point out that automatic word-based alignment process is a vital component of an SMT system [3] There are a lot of studies, which are inspirited from the log-linear model background in machine learning or other adaptations[4][5][6], to focus on improving the quality of the translation system However, not many works scrutinize to enhance the quality of word-alignment component to improve the accuracy of phrase translation table and hence to enhance the system in overall [7] Follow the second scheme, in this paper, we focus on improving the accurateness of word alignment modelling Indeed, with the accuracy enhancing from results yielded by improving the quality of the word translation modelling, we found that the quality of phrase-based SMT system could obtain a quite good improvement To besides, this research 978-0-7695-4886-9/12 $26.00 © 2012 IEEE DOI 10.1109/IALP.2012.44 J P r(f |e) = (I + 1) J I t(fj |ei ) (1) j=1 i=0 The parameters of Model for a given pair of languages are normally estimated using EM [10] We call the expected number of times that word ei connects to fj in the pair of translation (f1J |eI1 ) the count of fj given ei for (fj |ei ) and denote it by c(fj |ei ; f1J , eI1 ) Follow some mathematical inductions by [9], c(fj |ei ; f1J , eI1 ) could be calculated as follow equation: c(fj |ei ; f1J , eI1 ) t(fj |ei ) = I i=0 I · t(fj |ei ) J · σ(f1J , fj ) j=1 σ(eI1 , ej ) (2) i=0 In addition, we set λe as normalization factor and then find repeatedly the translating probability between a word fj in f1J given a word ei in eI1 as: J I t(fj |ei ) = λ−1 e c(fj |ei ; f1 , e1 ) (3) For convenience, we use the “lexical models” term to imply IBM Model 1-2 and “higher models” term to refer to IBM Model 3-5 121 as it could be The translation probability t(fk |ei ) is small for two reasons First, it is not sure that the two words are co-occurred many times Second, it comes from the impact of “noisy” probability t(fj |ei ) This is quite important In addition, [8] points out that t(fk |ei ) will be usually smaller than t(fj |ei ) when fj is just less occurrence than fk In order to abate the “wrong” translation probabilities t(fj |ei ), following to [8], the purely statistical method based on the co-occurrence is quite hard to achieve that goal Previous works mainly focus on integrating syntactic knowledge to improve the quality of alignment [11] In this work, we will propose a new approach, that is we combine the traditional IBM models with the another statistical method - the Person’s Chi-square test information B The Effect of IBM Model to Higher Models Simply, for Model 2, we make the same assumptions as in j−1 , J, e) depends Model except we assume P r(aj |aj−1 , f1 on j, aj , and J, as well as on I The equation gives the Model estimate for the probability of a target sentence, given a source sentence: J I t(fj |ei )a(i|j, J, I) P r(f |e) = (4) j=1 i=0 IBM Model 3-5, yield more accurate results than Model 12 mainly based on fertility-based scheme However, consider the original and general equation for IBM Model 3-5, which is proposed by [9], as described the “joint likelihood” for a tableau, τ , and a permutation, π, is: I P r(τ, π|e) = I P r(φi |φi−1 , e)P r(φ0 |φ1 , e) i=1 I φi i=0 k=1 I φi i=1 k=1 φ0 k=1 B Adding Mutual Information In fact, the essence of Pearson’s Chi-square test is to compare the observed frequencies in a table with the frequencies expected for independence If the difference between observed and expected frequencies is large, then we can reject the null hypothesis of independence In the simplest case, the X test is applied to 2-by-2 tables The X statistic sums the differences between observed and expected values in all squares of the table, scaled by the magnitude of the expected values, as follows: P r(τik |τik−1 , τ0i−1 , φI0 , e) P r(πik |πik−1 , π1i−1 , τ0I , φI0 , e) P r(π0k |π0k−1 , π1I , τ0I , φI0 , e) (5) X2 = i,j We could see that the alignment position information and the fertility information are the great ideas However, the problem is quite intuitive here: for Model and 2, we restrict ourselves to the type of alignments which are the “lexical” connection - each “cept” connection is either a single source word or it could be an NULL word In contrast, as in the equation (6) and (7), we use these probabilities as initial result for parameterizing φ, calculating j−1 , J, e) and other translation probabilities P r(aj |aj−1 , f1 This fact reduces deeply the value of these higher models because the lexical translation probabilities are quite errorprone (Oij − Eij ) Eij (6) where ranges over rows of the table, sometimes it is called “contingency tables”, ranges over columns, Oij is the observed value for cell (i, j) and Eij is the expected value [12] realized that it seems to be a particularly good choice for using the “independence” information Actually they used a measure of judge which they call φ2 , which is a X -like statistic The value of φ2 is bounded between and For more detail on the tutorial to calculate φ2 , please refer to [12] Actually, the performance for identifying word correspondences by using φ2 method is not good as using IBM Model together with EM training scheme [13] However, we believe this information is quite valuable We adjust the equation for each IBM model to gain better accurate in lexical translation results In more detail, for convenient, let ϕij is denoted the probability φ2 (ei , fj ) The probability c(f |e; f1J , aJ1 , eI1 ) could be calculated as follow equation: III I MPROVING IBM M ODELS A The Problem Basically, in order to estimate the word translation t(ei |fj ), we consider the translation probabilities of all of the possible equivalence words ek (k = i) in eI1 of fj In more detail, our focus is on the case in which two words fj and ei co-occurrence many times The lexical translation probability t(fj |ei ) could be derived a high value However, the two words fj and ei are actually not existed any “meaning relationship” in linguistics These are just appearing many times in more by “chance” This case lets us an important following result That is, the “corrected” translation words of ei , for an example, fk is never gain an expected translation probability c(f |e; f, e) t(fj |ei ) · (λ + (1 − λ) · ϕij ) = J I i=0 t(fj |ei ) · (λ + (1 − λ) · ϕij ) I σ(f, fj ) j=1 σ(e, ej ) (7) i=0 The parameter λ defines the weight of the Pearson’s Chisquare test information Good values for this parameter are around 0.3 122 IV E XPERIMENT 1) Improving Word Alignment Quality: From the Table and 2, we could see that adding the Pearson’s Chisquare improves our systems better in quality It boosts the performance when we use IBM Model as the wordalignment component around 0.51% for the pair E-V and 0.44% for the pair E-F Also, we have the similar improving results for other IBM models There are two exciting things we found here First, we obtain a better improving performance for the fertility-based alignment models (0.84% for the first pair and 0.61% for the second pair) Second, we obtain a better improving performance for the pair E-V than the second pair This is logical with the previous work from [15], in which they point out that modelling the word-alignment for the pair which is quite different in grammar such as the pair E-V is quite difficult than the other cases It means for that pair, the t translation parameter is quite less accurate 2) The total performance: This experimental evaluation focuses on the another aspect which we concentrate on That is, for the training scheme - for example: 15 23 33 for the pair E-V or 15 25 33 for the pair E-F, we scrutinize that if we simultaneously improve not only Model 1-2 but also Model 3, is the obtaining improvement better than the case that we only try to simply improve Model The experimental result is quite impressive (1.24% vs 0.84% and 1.14% vs 0.61%) This steady confirms our analysis in the section The improving of lexical translation modelling grants us to “boost” all of “hidden” power of IBM higher models and therefore improving the quality of statistical phrase-based translation deeply A The Preparation This experiment is deployed on two pairs of languages: English-Vietnamese (E-V) and English-French (E-F) to have an accurate and reliable result The 60.000 E-V training data was credited by [14] Similarly, the 60.000 E-F training corpus was the Hansards corpus [13] In this work, we directly test our improving method to the phrase-based translation system in overall We learn phrase alignments from a corpus that has been word-aligned by the alignment toolkit Our phrase learning component used the best “Viterbi” sequences which are trained from each IBM model We use LGIZA2 as a lightweight statistical machine translation toolkit that is used to train IBM Models 1-3 More information on LGIZA could be referred to [15] We deploy the training scheme on the training data is 15 23 33 for the pair E-V This scheme is suggested for that pair by [15] for the best performance We also deploy the training scheme 15 25 33 , which was suggested by [13], for the pair E-F In addition, we use MOSES [16] as the phrase-based SMT framework We measure performance using BLEU metric [17], which estimates the accuracy of translation output with respect to a reference translation B The “boosting” performance We concentrate our evaluations on two aspects The first one is the “boosting” performance on each word alignment model The last one is the “boosting” performance of using higher alignment model based on the improvement of itself plus the improvement from the lower alignment models Table describes the “boosting” results of our proposed methods for the E-V translation system Similarly, Table describes the improvements for the pair E-F V C ONCLUSION Phrase-based models represent the current state-of-the-art in statistical machine translation Together, the step of learning phrase in statistical phrase-based translation is absolutely important This work focuses on an approach in which we integrate the Pearson’s Chi-square test information to IBM Model for obtaining a better performance We directly test our improving to the overall system to have a better accuracy To besides, we also point out the fact that the improving of lexical translation modelling grants us to “boost” all of “hidden” power of IBM higher models and therefore improving the quality of statistical phrase-based translation deeply In summary, we believe attacking lexical translation is a good way for improving statistical phrase-based translation in overall quality Hence, the quality of statistical phrasebased systems with the improving in phrase learning together with integrating linguistics information could tend closely to state-of-the-art IBM Model Baseline BLEU Delta Model 19.00 19.51 0.51 Model 19.58 20.05 0.47 Model 18.88 19.72 0.84 Model (+Improved Model 1) 19.58 20.31 0.73 Model (+Improved Model 1-2) 18.88 20.12 1.24 Table I T HE “ BOOSTING ” PERFORMANCE FOR THE E NGLISH -V IETNAMESE TRANSLATION SYSTEM IBM Model Baseline BLEU Delta Model 25.75 26.19 0.44 Model 26.51 26.79 0.28 Model 26.30 26.91 0.61 Model (+Improved Model 1) 26.51 27.05 0.54 Model (+Improved Model 1-2) 26.30 27.44 1.14 Table II T HE “ BOOSTING ” PERFORMANCE FOR THE E NGLISH -F RENCH TRANSLATION SYSTEM LGIZA VI ACKNOWLEDGEMENT This work is supported by the project ”Studying Methods for Analyzing and Summarizing Opinions from Internet and Building an Application” which is funded by Vietnam is available on: http://code.google.com/p/lgiza/ 123 National University of Hanoi It is also supported by the project KC.01.TN04/11-15 [11] H Wu, H Wang, and Z yi Liu, “Alignment model adaptation for domain-specific word alignment,” in ACL, 2005 R EFERENCES [12] W A Gale and K W Church, “Identifying word correspondence in parallel texts,” in Proceedings of the workshop on Speech and Natural Language, ser HLT ’91 Stroudsburg, PA, USA: Association for Computational Linguistics, 1991, pp 152–157 [Online] Available: http://dx.doi.org/10.3115/112405.112428 [1] F J Och and H Ney, “Discriminative training and maximum entropy models for statistical machine translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ser ACL ’02 Stroudsburg, PA, USA: Association for Computational Linguistics, 2002, pp 295–302 [Online] Available: http://dx.doi.org/10.3115/1073083.1073133 [13] F J Och and H Ney, “A systematic comparison of various statistical alignment models,” Comput Linguist., vol 29, pp 19–51, March 2003 [Online] Available: http://dx.doi.org/10.1162/089120103321337421 [2] ——, “The alignment template approach to statistical machine translation,” Comput Linguist., vol 30, pp 19–51, June 2004 [Online] Available: http://dx.doi.org/10.1162/089120103321337421 [14] C Hoang, A Le, P Nguyen, and T Ho, “Exploiting nonparallel corpora for statistical machine translation,” in Proceedings of The 9th IEEE-RIVF International Conference on Computing and Communication Technologies IEEE Computer Society, 2012, pp 97 – 102 [3] P Koehn, F J Och, and D Marcu, “Statistical phrase-based translation,” in Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, ser NAACL ’03 Stroudsburg, PA, USA: Association for Computational Linguistics, 2003, pp 48–54 [15] C Hoang, A Le, and B Pham, “A systematic comparison of various statistical alignment models for statistical englishvietnamese phrase-based translation (to appear),” in Proceedings of The 4th International Conference on Knowledge and Systems Engineering IEEE Computer Society, 2012 [4] D Chiang, “A hierarchical phrase-based model for statistical machine translation,” in Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ser ACL ’05 Stroudsburg, PA, USA: Association for Computational Linguistics, 2005, pp 263–270 [Online] Available: http://dx.doi.org/10.3115/1219840.1219873 [16] P Koehn, H Hoang, A Birch, C Callison-Burch, M Federico, N Bertoldi, B Cowan, W Shen, C Moran, R Zens, C Dyer, O Bojar, A Constantin, and E Herbst, “Moses: open source toolkit for statistical machine translation,” in Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ser ACL ’07 Stroudsburg, PA, USA: Association for Computational Linguistics, 2007, pp 177–180 [5] G Sanchis-Trilles and F Casacuberta, “Log-linear weight optimisation via bayesian adaptation in statistical machine translation,” in Proceedings of the 23rd International Conference on Computational Linguistics: Posters, ser COLING ’10 Stroudsburg, PA, USA: Association for Computational Linguistics, 2010, pp 1077–1085 [Online] Available: http://dl.acm.org/citation.cfm?id=1944566.1944690 [17] K Papineni, S Roukos, T Ward, and W.-J Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ser ACL ’02 Stroudsburg, PA, USA: Association for Computational Linguistics, 2002, pp 311–318 [6] J B Mari`oo, R E Banchs, J M Crego, A de Gispert, P Lambert, J A R Fonollosa, and M R Costa-juss`a, “N-gram-based machine translation,” Comput Linguist., vol 32, no 4, pp 527–549, Dec 2006 [Online] Available: http://dx.doi.org/10.1162/coli.2006.32.4.527 [7] D Vilar, M Popovi´c, and H Ney, “AER: Do we need to ”improve” our alignments?” in International Workshop on Spoken Language Translation, Kyoto, Japan, Nov 2006, pp 205–212 [8] C Hoang, A Le, and B Pham, “Refining lexical translation training scheme for improving the quality of statistical phrasebased translation (to appear),” in Proceedings of The 3th International Symposium on Information and Communication Technology ACM digital library, 2012 [9] P F Brown, V J D Pietra, S A D Pietra, and R L Mercer, “The mathematics of statistical machine translation: parameter estimation,” Comput Linguist., vol 19, pp 263–311, June 1993 [Online] Available: http://dl.acm.org/citation.cfm?id=972470.972474 [10] A P Dempster, N M Laird, and D B Rubin, “Maximum likelihood from incomplete data via the em algorithm,” JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, vol 39, no 1, pp 1–38, 1977 124 ... phrase-based translation in overall quality Hence, the quality of statistical phrasebased systems with the improving in phrase learning together with integrating linguistics information could tend... of independence In the simplest case, the X test is applied to 2 -by- 2 tables The X statistic sums the differences between observed and expected values in all squares of the table, scaled by the. .. position information and the fertility information are the great ideas However, the problem is quite intuitive here: for Model and 2, we restrict ourselves to the type of alignments which are the