A study of english vietnamese statistical machine translation = nghiên cứu về dịch máy thống kê anh việt

A Study of English-Vietnamese Statistical Machine Translation Hoang Cuong Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi Supervised by Prof Pham Bao Son A thesis submitted in fulfillment of the requirements for the degree of Master of Computer Science December, 2012 ORIGINALITY STATEMENT ‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at Vietnam National University, Hanoi or any other educational institution, except where due acknowledgement is made in the thesis Any contribution made to the research by others, with whom I have worked at Institute for INFOCOMM Research, Singapore (I2R), Vietnam Institute for Advanced Study in Mathematics, Hanoi (VIASM) or elsewhere, is explicitly acknowledged in the thesis I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.’ Signed i APPROVAL I, the supervisor, hereby approve that the Thesis in its current form is ready as the final version at the University of Engineering and Technology, Vietnam National University, Hanoi Prof Pham Bao Son ii iii x ABSTRACT Previous works from Vietnamese statistical machine translation (SMT) community research just focus on some top “researches” of the field Some are based on the ideas which are really simple We lack a fundamental work on the core of SMT system to make a significantly solid work on the statistical English-Vietnamese translation We also lack some large bilingual corpora with high quality This work will overcome that problem We present a fundamental and primitive study of English-Vietnamese statistical machine translation We make a serious research to the core of any SMT system such as exploiting bilingual corpora, improving word alignment or phrase translation modeling quality We also focus on developing a better evaluation metric for tuning SMT system We especial try our best to make a fundamental and solid work on building or improving the performance of the English-Vietnamese SMT system in overall Though we focus on the English-Vietnamese pair In every aspect, we also deploy and compare our research to the pair English-French to have a deeper view We hope our work research will be a solid work for other studies on deploying and improving the SMT for English-Vietnamese machine translation systems Publications: • Cuong Hoang, Cuong-Anh, Le, Thai-Phuong, Nguyen, Bao-Tu, Ho Exploiting Non-Parallel Corpora for Statistical Machine Translation In Proceedings of the international conference on Information and Communication Technologies (RIVF 2012)1 • Cuong Hoang, Cuong-Anh, Le, Son-Bao, Pham A Systematic Comparison Between Various Statistical Alignment Models for Statistical English-Vietnamese Phrase-Based Translation In Proceedings of the 4th international conference on Knowledge and Systems Engineering (KSE 2012) Best Student Paper Award iv v • Cuong Hoang, Cuong-Anh, Le, Son-Bao, Pham Refining Lexical Translation Training Scheme for Improving The Quality of Statistical Phrase-Based Translation In Proceedings of the 3th international symposium on Information and Communication Technology (SoICT 2012) • Cuong Hoang, Cuong-Anh, Le, Son-Bao, Pham Improving the Quality of Word Alignment By Integrating Pearson’s Chi-square Test Information In Proceedings of the international conference on Asian Language Processing (IALP 2012) ACKNOWLEDGEMENTS Life is so valuable when we found something which is merit to chase First, I would like to express my deep gratitude to my supervisor - Prof Pham Bao Son - who has been also my iconic researcher in Vietnam since I was as a freshman I also want to thank Prof Le Anh Cuong for his so-much-careful supervision for me, though he not register me as his student to the school For both of them, I own their patient guidance and support through-out the years I would like to give my honest appreciation to my other unofficial supervisors - Prof Ho Tu Bao (JAIST, Japan), Prof Zhang Min (I2R, Singapore), Prof Nguyen Xuan Long (Michigan, USA) - for their great support They are, in diversified perspectives, have been helping my passion in Computer Science increases intensively I sincerely acknowledge Vietnam National University, Hanoi I want to thank some of my best teachers - Dr Nguyen Van Vinh, Dr Nguyen Phuong Thai or Prof Nguyen Le Minh (JAIST, Japan) who make many useful discussion Especially, I want to give my honest appreciation to the Statistical Language Processing Laboratory I work at Infocomm for Institute Research, Singapore (I2R) for the infrastructure and other uncountable support I would like to thank my Chinese friends here - Yun Huang, Prof Yue Zhang, Jun SUN and Yanxia Qin I wish I could work with them as much longer as possible I also want to make a appreciation to some of my best friends - Dinh Xuan Nhat, Nguyen Dao Thai - for their helps in my work Finally, this thesis would not have been possible without the support and love of my Family - Dad, Mum, my Sister - Lan Ni and her small family Without their support in variety perspectives, I sure that I cannot finish my Master Degree in this way! And To love, My Mimosa ♥ !!! vi Table of Contents Introduction 1.1 Statistical Machine Translation - An Overview 1.2 Literature Survey on English-Vietnamese Machine Translation 1.3 Our Work 1.4 Thesis Contents 1.4.1 Exploiting non-parallel corpora for statistical machine translation 1.4.2 Systematic comparison between various statistical alignment models for statistical English-Vietnamese phrase-based translation 1.4.3 Improving word alignment models 1.4.4 Improving phrase translation modeling 1.4.5 Developing an evaluation metric for SMT vii 2 9 10 10 11 11 12 List of Figures 1.1 1.2 1.3 1.4 The architecture of the translation approach based on source-channel models An example of word alignments between the pair of English-French An example of phrase alignments between the pair of English-German The architecture of the translation approach based on log-linear models viii 4 12 Chapter Introduction directly integrates word-to-word translation parameter into phrase translation weight estimation This method reduces deeply the effect of noise reduction phenomenon We evaluate our approach on the WMT10 French-to-English task, and show significant improvements on parallel data sets of different scales By gaining these advantages, we also show the much better improvement for the upgraded systems trained on the tasks when improved word alignment quality 1.4.5 Developing an evaluation metric for SMT Tuning the parameters from a log-linear model is an important step to find out the best fitting weights for model Modern tuning techniques directly use automatic evaluation metrics as the training criteria for optimizing the system This aspect is very important since we have been forced to find good evaluation metrics Many machine translation evaluation metrics have been proposed after the seminal BLEU metric They have been found to outperform BLEU, demonstrated by the better correlations with human judgments We hope that to train machine translation systems using these new metrics can lead directly to advances in automatic machine translation However, to our knowledge though, there has been no unambiguous report that we can improve a state-of-the-art machine translation system over its BLEU-tuned baseline In this work, we will present a novel automatic evaluation metric, entitled SEMI It bases on the phrasal overlapping measurement scheme and especially favours the grading scheme with longer n-gram matchings We evaluate our metric on the the WMT10 French-to-English task We will show that the evaluation metric is the first one significantly led directly to advances in automatic machine translation on parallel data sets of different scales Bibliography AbduI-Rauf, S., & Schwenk, H (2009) On the use of comparable corpora to improve smt performance Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (pp 16–23) Stroudsburg, PA, USA: Association for Computational Linguistics Abdul-Rauf, S., & Schwenk, H (2009) Exploiting comparable corpora with ter and terp Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora (pp 46–54) Stroudsburg, PA, USA: Association for Computational Linguistics Abdul Rauf, S., & Schwenk, H (2011) Parallel sentence generation from comparable corpora for improved smt Machine Translation, 25, 341–375 Achananuparp, P., Hu, X., & Shen, X (2008) The evaluation of sentence similarity measures Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery (pp 305–316) Berlin, Heidelberg: Springer-Verlag Adafre, S F., & de Rijke, M (2006) Finding Similar Sentences across Multiple Languages in Wikipedia Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, 62–69 Banerjee, S., & Lavie, A (2005) Meteor: An automatic metric for mt evaluation with improved correlation with human judgments (pp 65–72 ) Banerjee, S., & Pedersen, T (2003) Extended gloss overlaps as a measure of semantic relatedness Proceedings of the 18th international joint conference on Artificial intelligence (pp 805–810) San Francisco, CA, USA: Morgan Kaufmann Publishers Inc Bertoldi, N., Haddow, B., & Fouet, J.-B (2009) Improved Minimum Error Rate Training in Moses The Prague Bulletin of Mathematical Linguistics, 91, 7–16 13 14 Bibliography Birch, A., & Osborne, M (2011) Reordering metrics for mt Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume (pp 1027–1035) Stroudsburg, PA, USA: Association for Computational Linguistics Brants, T., Popat, A C., Xu, P., Och, F J., Dean, J., & Inc, G (2007) Large language models in machine translation In EMNLP (pp 858–867) Brown, P F., Cocke, J., Pietra, S A D., Pietra, V J D., Jelinek, F., Lafferty, J D., Mercer, R L., & Roossin, P S (1990) A statistical approach to machine translation Comput Linguist., 16, 79–85 Brown, P F., deSouza, P V., Mercer, R L., Pietra, V J D., & Lai, J C (1992) Classbased n-gram models of natural language Computational Linguistics, 18, 467–479 Brown, P F., Lai, J C., & Mercer, R L (1991) Aligning sentences in parallel corpora Proceedings of the 29th annual meeting on Association for Computational Linguistics (pp 169–176) Stroudsburg, PA, USA: Association for Computational Linguistics Brown, P F., Pietra, V J D., Pietra, S A D., & Mercer, R L (1993) The mathematics of statistical machine translation: parameter estimation Comput Linguist., 19, 263–311 Burch, C C., Osborne, M., & Koehn, P Re-evaluating the Role of BLEU in Machine Translation Research Callison-Burch, C., Koehn, P., Monz, C., Peterson, K., Przybocki, M., & Zaidan, O F (2010) Findings of the 2010 joint workshop on statistical machine translation and metrics for machine translation Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR (pp 17–53) Stroudsburg, PA, USA: Association for Computational Linguistics Callison-Burch, C., Osborne, M., & Koehn, P (2006) Re-evaluating the role of bleu in machine translation research In EACL (pp 249–256) Callison-Burch, C., Talbot, D., & Osborne, M (2004) Statistical machine translation with word- and sentence-aligned parallel corpora Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics Stroudsburg, PA, USA: Association for Computational Linguistics Bibliography 15 Cer, D., Manning, C D., & Jurafsky, D (2010) The best lexical metric for phrase-based statistical mt system optimization Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp 555–563) Stroudsburg, PA, USA: Association for Computational Linguistics Chan, Y S., & Ng, H T (2009) Maxsim: performance and effects of translation fluency Machine Translation, 23, 157–168 Chang, P.-C., Tseng, H., Jurafsky, D., & Manning, C D (2009) Discriminative reordering with chinese grammatical relations features Proceedings of the Third Workshop on Syntax and Structure in Statistical Translation (pp 51–59) Stroudsburg, PA, USA: Association for Computational Linguistics Charniak, E., Knight, K., & Yamada, K (2003) Syntax-based language models for statistical machine translation MT Summit IX Intl Assoc for Machine Translation Chen, S F (1993) Aligning sentences in bilingual corpora using lexical information Proceedings of the 31st annual meeting on Association for Computational Linguistics (pp 9–16) Stroudsburg, PA, USA: Association for Computational Linguistics Chen, S F., & Goodman, J (1996) An empirical study of smoothing techniques for language modeling Proceedings of the 34th annual meeting on Association for Computational Linguistics (pp 310–318) Stroudsburg, PA, USA: Association for Computational Linguistics Chiang, D (2005) A hierarchical phrase-based model for statistical machine translation Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp 263–270) Stroudsburg, PA, USA: Association for Computational Linguistics Chiang, D (2007) Hierarchical phrase-based translation Comput Linguist., 33, 201– 228 Chiang, D., DeNeefe, S., Chan, Y S., & Ng, H T (2008a) Decomposability of translation metrics for improved evaluation and efficient algorithms Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp 610–619) Stroudsburg, PA, USA: Association for Computational Linguistics 16 Bibliography Chiang, D., DeNeefe, S., & Pust, M (2011) Two easy improvements to lexical weighting Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume (pp 455–460) Stroudsburg, PA, USA: Association for Computational Linguistics Chiang, D., Knight, K., & Wang, W (2009) 11,001 new features for statistical machine translation Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp 218–226) Stroudsburg, PA, USA: Association for Computational Linguistics Chiang, D., Marton, Y., & Resnik, P (2008b) Online large-margin training of syntactic and structural translation features Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp 224–233) Stroudsburg, PA, USA: Association for Computational Linguistics Collins, M (2003) Head-driven statistical models for natural language parsing Comput Linguist., 29, 589–637 Collins, M., Koehn, P., & Kuˇcerová, I (2005) Clause restructuring for statistical machine translation Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp 531–540) Stroudsburg, PA, USA: Association for Computational Linguistics Cowan, B., Kuˇcerová, I., & Collins, M (2006) A discriminative model for tree-to-tree translation Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (pp 232–241) Stroudsburg, PA, USA: Association for Computational Linguistics Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., & Singer, Y (2006) Online passive-aggressive algorithms J Mach Learn Res., 7, 551–585 Crammer, K., & Singer, Y (2003) Ultraconservative online algorithms for multiclass problems J Mach Learn Res., 3, 951–991 Dahlmeier, D., Liu, C., & Ng, H T (2011) Tesla at wmt 2011: translation evaluation and tunable metric Proceedings of the Sixth Workshop on Statistical Machine Translation (pp 78–84) Stroudsburg, PA, USA: Association for Computational Linguistics Bibliography 17 Dang, V B., & Ho, B.-Q (2007) Automatic construction of english-vietnamese parallel corpus through web mining RIVF (pp 261–266) IEEE Darroch, J N., & Ratcliff, D (1972) Generalized Iterative Scaling for Log-Linear Models The Annals of Mathematical Statistics, 43, 1470–1480 Dempster, A P., Laird, N M., & Rubin, D B (1977) Maximum likelihood from incomplete data via the em algorithm JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, 39, 1–38 Denkowski, M., & Lavie, A (2011) Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems Proceedings of the Sixth Workshop on Statistical Machine Translation (pp 85–91) Stroudsburg, PA, USA: Association for Computational Linguistics Dinh, D., Luu, N., Ngan, T., Quang, D X., & Nam, V C (2003) A hybrid approach to word order transfer in the english-to-vietnamese machine translation Do, T N., Besacier, L., & Castelli, E (2010a) A fully unsupervised approach for mining parallel data from comparable corpora European COnference on Machine Translation (EAMT) 2010 Saint-Raphael (France) Do, T N., Besacier, L., & Castelli, E (2010b) Unsupervised smt for a low-resourced language pair 2d Workshop on Spoken Languages Technologies for Under-Resourced Languages (SLTU 2010) Penang (Malaysia) Doddington, G (2002) Automatic evaluation of machine translation quality using ngram co-occurrence statistics Proceedings of the second international conference on Human Language Technology Research (pp 138–145) San Francisco, CA, USA: Morgan Kaufmann Publishers Inc Duan, N., Li, M., Zhou, M., & Cui, L (2011) Improving phrase extraction via mbr phrase scoring and pruning In Proc of the 13th Machine Translation Summit (pp 189–198) Elming, J (2008) Syntactic reordering integrated with phrase-based smt Proceedings of the 22nd International Conference on Computational Linguistics - Volume (pp 209–216) Stroudsburg, PA, USA: Association for Computational Linguistics 18 Bibliography Erdmann, M., Nakayama, K., Hara, T., & Nishio, S (2009) Improving the extraction of bilingual terminology from wikipedia ACM Trans Multimedia Comput Commun Appl., 5, 31:1–31:17 Fishel, M (2010) Simpler is better: Re-evaluation of default word alignment models in statistical mt Fraser, A., & Marcu, D (2007) Measuring word alignment quality for statistical machine translation Comput Linguist., 33, 293–303 Fung, P., & Cheung, P (2004) Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and EM Proceedings of EMNLP Gale, W A., & Church, K W (1991) Identifying word correspondence in parallel texts Proceedings of the workshop on Speech and Natural Language (pp 152–157) Stroudsburg, PA, USA: Association for Computational Linguistics Gale, W A., & Church, K W (1993) A program for aligning sentences in bilingual corpora Comput Linguist., 19, 75–102 Galley, M., Hopkins, M., Knight, K., & Marcu, D (2004) What’s in a translation rule? Proceedings of the Human Language Technology and North American Association for Computational Linguistics Conference (HLT/NAACL-04) Boston, USA Galley, M., & Manning, C D (2008) A simple and effective hierarchical phrase reordering model Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp 848–856) Stroudsburg, PA, USA: Association for Computational Linguistics Gao, Y., Koehn, P., & Birch, A (2011) Soft dependency constraints for reordering in hierarchical phrase-based translation Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp 857–868) Stroudsburg, PA, USA: Association for Computational Linguistics Hasan, S., & Ney, H (2009) Comparison of extended lexicon models in search and rescoring for smt Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers (pp 17–20) Stroudsburg, PA, USA: Association for Computational Linguistics Bibliography 19 He, Z., Meng, Y., & Yu, H (2010) Maximum entropy based phrase reordering for hierarchical phrase-based translation Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (pp 555–563) Stroudsburg, PA, USA: Association for Computational Linguistics Ho, T., Pham, N., Ha, T., & Nguyen, P (2008) Issues and first phase development of the english-vietnamese translation system evsmt1.0 59–66 Hoang, C., Le, A., Nguyen, P., & Ho, T (2012a) Exploiting non-parallel corpora for statistical machine translation Proceedings of The 9th IEEE-RIVF International Conference on Computing and Communication Technologies (pp 97 – 102) IEEE Computer Society Hoang, C., Le, A., & Pham, B (2012b) Improving the quality of word alignment by integrating pearson’s chi-square test information (to appear) Proceedings of The IALP 2012 International Conference on Asian Language Processing IEEE Computer Society Hoang, C., Le, A., & Pham, B (2012c) Refining lexical translation training scheme for improving the quality of statistical phrase-based translation (to appear) Proceedings of The 3th International Symposium on Information and Communication Technology ACM digital library Hoang, C., Le, A., & Pham, B (2012d) A systematic comparison of various statistical alignment models for statistical english-vietnamese phrase-based translation (to appear) Proceedings of The 4th International Conference on Knowledge and Systems Engineering IEEE Computer Society Hoang, H (2007) Factored translation models In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL (pp 868–876) Hoang, V., Ngo, M., & Dinh, D (2008) A dependency-based word reordering approach for statistical machine translation RIVF (pp 120–127) ˇ Huang, Z., Cmejrek, M., & Zhou, B (2010) Soft syntactic constraints for hierarchical phrase-based translation using latent syntactic distributions Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (pp 138– 147) Stroudsburg, PA, USA: Association for Computational Linguistics 20 Bibliography Hung, B T., Le, M N., & Shimazu, A (2012) Sentence splitting for vietnamese-english machine translation KSE (pp 156–160) Hung, L Q., & Cuong, L A (2010) Extracting parallel texts from the web Proceedings of the 2010 Second International Conference on Knowledge and Systems Engineering (pp 147–151) Washington, DC, USA: IEEE Computer Society Hung, L Q., & Cuong, L A (2012) Improving word alignment for statistical machine translation based on constraints (to appear) Proceedings of The IALP 2012 International Conference on Asian Language Processing IEEE Computer Society Johnson, J H., & Martin, J (2007) Improving translation quality by discarding most of the phrasetable In Proceedings of EMNLP-CoNLL’07 (pp 967–975) Junczys-Dowmunt, M (2012) Phrasal rank-encoding: Exploiting phrase redundancy and translational relations for phrase table compression Prague Bull Math Linguistics, 98, 63–74 Ker, S J., & Chang, J S (1997) A class-based approach to word alignment Comput Linguist., 23, 313–343 Knight, K (1999) A Statistical MT Tutorial Workbook Koehn, P (2004a) Pharaoh: A beam search decoder for phrase-based statistical machine translation models AMTA (pp 115–124) Koehn, P (2004b) Statistical significance tests for machine translation evaluation EMNLP (pp 388–395) Koehn, P (2005) Europarl: A Parallel Corpus for Statistical Machine Translation Conference Proceedings: the tenth Machine Translation Summit (pp 79–86) Phuket, Thailand: AAMT Koehn, P (2010) Statistical machine translation Cambridge University Press Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., & Herbst, E (2007) Moses: open source toolkit for statistical machine translation Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions (pp 177–180) Stroudsburg, PA, USA: Association for Computational Linguistics Bibliography 21 Koehn, P., Och, F J., & Marcu, D (2003) Statistical phrase-based translation Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume (pp 48–54) Stroudsburg, PA, USA: Association for Computational Linguistics Kumar, S., & Byrne, W (2005) Local phrase reordering models for statistical machine translation Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing (pp 161–168) Stroudsburg, PA, USA: Association for Computational Linguistics Kumar, S., Byrne, W., & Processing, S (2005) Local phrase reordering models for statistical machine translation In Proceedings of HLT-EMNLP (pp 161–168) Landauer, T K., L, T K., Laham, D., Rehder, B., & Schreiner, M E (1997) How well can passage meaning be derived without using word order? a comparison of latent semantic analysis and humans Lau, R., Rosenfeld, R., & Roukos, S (1993) Adaptive language modeling using the maximum entropy principle Proceedings of the workshop on Human Language Technology (pp 108–113) Stroudsburg, PA, USA: Association for Computational Linguistics Ling, W., Graça, J a., Trancoso, I., & Black, A (2012) Entropy-based pruning for phrase-based machine translation Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp 962–971) Stroudsburg, PA, USA: Association for Computational Linguistics Liu, C., Dahlmeier, D., & Ng, H T (2010) Tesla: translation evaluation of sentences with linear-programming-based analysis Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR (pp 354–359) Stroudsburg, PA, USA: Association for Computational Linguistics Liu, C., Dahlmeier, D., & Ng, H T (2011) Better evaluation metrics lead to better machine translation Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp 375–384) Stroudsburg, PA, USA: Association for Computational Linguistics Lopez, A (2006) Word-based alignment, phrase-based translation: What’s the link In Proc of AMTA (pp 90–99) 22 Bibliography Lopez, A (2008) Statistical machine translation ACM Comput Surv., 40 Manning, C D., & Schăutze, H (1999) Foundations of statistical natural language processing Cambridge, MA, USA: MIT Press Marcu, D., & Wong, W (2002) A phrase-based, joint probability model for statistical machine translation Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10 (pp 133–139) Stroudsburg, PA, USA: Association for Computational Linguistics Mariòo, J B., Banchs, R E., Crego, J M., de Gispert, A., Lambert, P., Fonollosa, J A R., & Costa-jussà, M R (2006) N-gram-based machine translation Comput Linguist., 32, 527–549 Metzler, D., Bernstein, Y., Croft, W B., Moffat, A., & Zobel, J (2005) Similarity measures for tracking information flow Proceedings of the 14th ACM international conference on Information and knowledge management (pp 517–524) New York, NY, USA: ACM Metzler, D., Dumais, S., & Meek, C (2007) Similarity measures for short segments of text Proceedings of the 29th European conference on IR research (pp 16–27) Berlin, Heidelberg: Springer-Verlag Moore, R C (2005) Improving ibm word-alignment model Munteanu, D S., & Marcu, D (2005) Improving machine translation performance by exploiting non-parallel corpora Comput Linguist., 31, 477–504 Nguyen, G T., & Dinh, D (2012) Improving english-vietnamese word alignment using translation model RIVF (pp 1–4) Nguyen, Q., Nguyen, A., & Dinh, D (2012) An approach to word sense disambiguation in english-vietnamese-english statistical machine translation RIVF (pp 1–4) Nguyen, T P., & Shimazu, A (2006a) Improving phrase-based statistical machine translation with morphosyntactic transformation Machine Translation, 20, 147–166 Nguyen, T P., & Shimazu, A (2006b) A syntactic transformation model for statistical machine translation Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges ahead (pp 63–74) Berlin, Heidelberg: Springer-Verlag Bibliography 23 Nguyen, T P., Shimazu, A., Ho, T.-B., Le Nguyen, M., & Van Nguyen, V (2008a) A tree-to-string phrase-based model for statistical machine translation Proceedings of the Twelfth Conference on Computational Natural Language Learning (pp 143–150) Stroudsburg, PA, USA: Association for Computational Linguistics Nguyen, V., Phuong Nguyen, T., Shimazu, A., & Nguyen, M (2008b) A reordering model for phrase-based machine translation Proceedings of the 6th international conference on Advances in Natural Language Processing (pp 476–487) Berlin, Heidelberg: Springer-Verlag Och, F J (2003) Minimum error rate training in statistical machine translation Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume (pp 160–167) Stroudsburg, PA, USA: Association for Computational Linguistics Och, F J., & Ney, H (2000) Improved statistical alignment models Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (pp 440–447) Stroudsburg, PA, USA: Association for Computational Linguistics Och, F J., & Ney, H (2002) Discriminative training and maximum entropy models for statistical machine translation Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (pp 295–302) Stroudsburg, PA, USA: Association for Computational Linguistics Och, F J., & Ney, H (2003) A systematic comparison of various statistical alignment models Comput Linguist., 29, 19–51 Och, F J., & Ney, H (2004) The alignment template approach to statistical machine translation Comput Linguist., 30, 19–51 Och, F J., Tillmann, C., Ney, H., & Informatik, L F (1999) Improved alignment models for statistical machine translation University of Maryland, College Park, MD (pp 20– 28) Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J (2002) Bleu: a method for automatic evaluation of machine translation Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (pp 311–318) Stroudsburg, PA, USA: Association for Computational Linguistics 24 Bibliography Ponzetto, S P., & Strube, M (2007) Knowledge derived from wikipedia for computing semantic relatedness J Artif Int Res., 30, 181–212 Press, W H., Teukolsky, S A., Vetterling, W T., & Flannery, B P (1992) Numerical recipes in c (2nd ed.): the art of scientific computing New York, NY, USA: Cambridge University Press Simard, M., Foster, G F., & Isabelle, P (1993) Using cognates to align sentences in bilingual corpora Proceedings of the 1993 conference of the Centre for Advanced Studies on Collaborative research: distributed computing - Volume (pp 1071–1082) Toronto, Ontario, Canada: IBM Press Smith, J R., Quirk, C., & Toutanova, K (2010) Extracting parallel sentences from comparable corpora using document level alignment Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp 403–411) Stroudsburg, PA, USA: Association for Computational Linguistics Snover, M., Dorr, B., Schwartz, R., Micciulla, L., & Makhoul, J (2006) A study of translation edit rate with targeted human annotation In Proceedings of Association for Machine Translation in the Americas (pp 223–231) Steedman, M (2000) The syntactic process Cambridge, MA, USA: MIT Press Thi, H.-N N., & Dinh, D (2008) A syntactic-based word re-ordering for englishvietnamese statistical machine translation system PRICAI (pp 809–818) Tyers, F., & Pienaar, J (2008) Extracting bilingual word pairs from wikipedia in Proceedings of the SALTMIL Workshop at Language Resources and Evaluation Conference, LREC08 Utiyama, M., & Isahara, H (2003) Reliable measures for aligning japanese-english news articles and sentences Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume (pp 72–79) Stroudsburg, PA, USA: Association for Computational Linguistics Van Nguyen, V., Shimazu, A., Le Nguyen, M., & Nguyen, T P (2009) Improving A Lexicalized Hierarchical Reordering Model Using Maximum Entropy Proceedings of MT Summit Bibliography 25 Vilar, D., Popović, M., & Ney, H (2006) AER: Do we need to ”improve” our alignments? International Workshop on Spoken Language Translation (pp 205–212) Kyoto, Japan Vogel, S., Ney, H., & Tillmann, C (1996) Hmm-based word alignment in statistical translation Proceedings of the 16th conference on Computational linguistics - Volume (pp 836–841) Stroudsburg, PA, USA: Association for Computational Linguistics Wahlster, W (Ed.) (2000) Verbmobil: Foundations of Speech-to-Speech Translation Springer Wang, C (2007) Chinese syntactic reordering for statistical machine translation In Proceedings of EMNLP (pp 737–745) Xia, F., & McCord, M (2004) Improving a statistical mt system with automatically learned rewrite patterns Proceedings of the 20th international conference on Computational Linguistics Stroudsburg, PA, USA: Association for Computational Linguistics Yamada, K (2003) A syntax-based statistical translation model Doctoral dissertation, Los Angeles, CA, USA AAI3103982 Yang, M., & Zheng, J (2009) Toward smaller, faster, and better hierarchical phrase-based smt Proceedings of the ACL-IJCNLP 2009 Conference Short Papers (pp 237–240) Stroudsburg, PA, USA: Association for Computational Linguistics Zaidan, O F (2009) Z-MERT: A fully configurable open source tool for minimum error rate training of machine translation systems The Prague Bulletin of Mathematical Linguistics, 91, 79–88 Zens, R., Och, F J., & Ney, H (2002) Phrase-based statistical machine translation Proceedings of the 25th Annual German Conference on AI: Advances in Artificial Intelligence (pp 18–32) London, UK: Springer-Verlag Zens, R., Stanton, D., & Xu, P (2012) A systematic comparison of phrase table pruning techniques Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp 972–983) Stroudsburg, PA, USA: Association for Computational Linguistics Zhao, B., & Vogel, S (2002) Adaptive parallel sentences mining from web bilingual news collection Proceedings of the 2002 IEEE International Conference on Data Mining (pp 745–) Washington, DC, USA: IEEE Computer Society 26 Bibliography Zipf, G (1935) The psycho-biology of language Boston, MA: Houghton Mifflin ... remained a key application in the field of natural language processing (NLP) Statistical machine translation (SMT) is a machine translation approach in which we treat the translation complication... of statistical models 1.1 Statistical Machine Translation - An Overview SMT treats translation as a machine learning problem This means that we apply a learning algorithm to a large body of previously... Exploiting non-parallel corpora for statistical machine translation 1.4.2 Systematic comparison between various statistical alignment models for statistical English- Vietnamese phrase-based translation

Định dạng
Số trang	38
Dung lượng	362,64 KB