Cải tiến chất lượng hệ dịch máy thống kê bằng cách sử dụng kho ngữ liệu đơn ngữ trong ngôn ngữ nguồn

VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY VU HUY HIEN BOOTSTRAPPING SMT USING UNANNOTATED CORPORA OF THE SOURCE LANGUAGE MASTER THESIS OF INFORMATION TECHNOLOGY Hanoi - 2014 VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY VU HUY HIEN BOOTSTRAPPING SMT USING UNANNOTATED CORPORA OF THE SOURCE LANGUAGE Major: Computer science Code: 60 48 01 MASTER THESIS OF INFORMATION TECHNOLOGY SUPERVISOR: PhD Nguyen Phuong Thai Hanoi - 2014 ORIGINALITY STATEMENT ‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at University of Engineering and Technology (UET/Coltech) or any other educational institution, except where due acknowledgement is made in the thesis Any contribution made to the research by others, with whom I have worked at UET/Coltech or elsewhere, is explicitly acknowledged in the thesis I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.’ Hanoi, December 6th , 2014 Signed i ABSTRACT Nowadays, statistical machine translation is derived diverse interest of researchers thanks to its advantages However, approaches based on statistic constantly confront deficiencies of parallel and specific domain corpora Generating these corpora requires intensive human effort and availability of experts Unfortunately, only a few popular languages in the world are derived continuous financial support and interest of researchers for development of machine translation systems For most remaining languages, there is very small interest of funding available Therefore it becomes an immense obstacle to apply approaches based on statistic for such languages The purpose of this thesis is to propose a method for utilizing unannotated corpora to address this impediment Publications: Hien Vu Huy, Phuong-Thai Nguyen, Tung-Lam Nguyen and M.L Nguyen Bootstrapping Phrasebased Statistical Machine Translation via WSD Integration In Proceedings of the Sixth International Joint Conference on Natural Language Processing (IJCNLP 2013), pp 1042-1046 ii ACKNOWLEDGEMENTS First and foremost, I would like to express my deepest gratitude to my supervisor, Dr Nguyen Phuong Thai, for his patient guidance and continuous supports throughout the years He always appears when I need help, and responds to queries so helpfully and promptly I would like to give my honest appreciation to my best friends in my home town for whatsoever they did for me I sincerely acknowledge the Vietnam National University, Hanoi and especially, QG.12.49 project for supporting finance to my master study Finally, this thesis would not have been possible without the support and love of my parents Thank you! iii To my family ♥ iv Table of Contents Introduction Literature review 2.1 Machine Translation 2.1.1 The history 2.1.2 Approaches 2.1.3 Evaluation 2.1.4 Moses - an Open Statistical Machine Translation System 2.2 Word Sense Disambiguation 2.2.1 Introduction 2.2.2 WSD tasks Utilizing WSD for SMT 3.1 Utilizing WSD 3.1.1 WSD task 3.1.2 WSD Training Data Generation 3.1.3 WSD Features 3.1.4 Integration 3.2 Using Unlabelled Data 3.2.1 Basic Algorithm 3.2.2 A new Algorithm with Sense Distribution Control 3.2.3 Using Clustering Context Information Evaluation 4.1 Corpora and Tools 4.1.1 Corpora 4.1.1.1 Bilingual Corpus 4.1.1.2 Monolingual Corpus v 4 9 10 10 11 17 17 17 17 20 20 22 22 22 24 26 26 26 26 26 TABLE OF CONTENTS 4.2 4.1.2 Tools Results 4.2.1 Extend Labelled Data 4.2.2 WSD clustering task 4.2.3 The impact of context on WSD 4.2.4 Impact of WSD system on SMT Conclusion vi translation system 27 28 28 30 31 32 35 List of Figures 1.1 Integrating WSD into phrase-based SMT system 2.1 Integrating WSD into phrase-based SMT system 3.1 Sense distribution of interest 24 vii Chapter Evaluation In this chapter, we describe our experiments 4.1 Corpora and Tools 4.1.1 Corpora 4.1.1.1 Bilingual Corpus The bilingual corpus in our experiments is English-Vietnamese bilingual corpus from several different fields which includes approximately 135,000 sentence pairs It is divided into three parts: training, development and test in Table 4.1 This is a proportion we performed experiments in (Huy et al., 2013);however, we increase the number of sentences in development and test sets in the experiment as Table 4.2 to ensure that the system is tuned by sufficient number of sentences We used the development set in the evaluation of MERT of SMT system in all experiments In addition to the test set extracted from the bilingual corpus, we used an additional corpus consisting of ambiguous words that are labelled by evaluators to test the external domain The rate of Out-of-Vocabulary in test sets is roughly 2% 4.1.1.2 Monolingual Corpus We use British National Corpus to exploit unannotated information for extending labelled data and clustering text BNC is a 100 million word collection of written and spoken samples of English language from a wide range of sources The corpus covers British English of the late 20th century from a wide variety of genres with the 26 4.1 Corpora and Tools 27 Table 4.1: Statistics for training, testing and developing corpora Number of sen- Average length Number of tences of sentences words Training corpus English 131,118 15.9 2,096,073 Vietnamese 131,118 17.0 2,236,847 Developing corpus English 218 15.4 3,367 Vietnamese 218 16.5 3,609 Testing corpus English 2,000 17.8 35,797 Vietnamese 2,000 19.4 38,814 External-domain testing corpus English 123 18.7 2,308 Table 4.2: Statistics for training, testing and developing corpora for using clustering context information Number of sen- Average length Number of tences of sentences words Training corpus English 116,094 17.9 2,023,018 Vietnamese 116,094 17.0 2047563 Developing corpus English 6,829 20.0 139,965 Vietnamese 6,829 21.0 143,796 Testing corpus English 6,829 20.0 138,697 Vietnamese 6,829 20.0 141,814 intention that it be a representative sample of spoken and written British English of that time To exploit information from BNC, we only text data and remove all tags and other information attached in BNC 4.1.2 Tools We used a word-segmentation program in (P et al., 2003), Moses in (Koehn et al., 2007), GIZA++ in (Och and Ney, 2003), SRILM in (Stolcke, 2002), a rule-based morphological analyser in (Pham et al., 2003) and Natural Language Toolkit in (Bird et al., 2009) for segmenting Vietnamese sentences, learning phrase transla- 4.2 Results 28 tions, creating word alignment, learning language models, analysing morphology and exploiting BNC respectively For a clustering tool, we use an implementation of Percy Liang in his thesis 4.2 4.2.1 Results Extend Labelled Data WSD Training Statistics Table 4.3: Statistics for samples and features before extending and after extending Min Max Average Number of samples before extending 14 29,512 153 Number of samples after extending 18 30,471 851 Number of features before extending 14 258,559 2637 Number of features after extending 110 265,179 11,164 Average number of sense of ambiguous words 211 3.94 Percentage of utilizing BNC 1.5% 94.5% 11.94% According to statistics from Table 4.3, we can see that the number of samples and features increased over fivefold; however, these increases are not balanced After extending, the number of samples in minimum and maximum case increased by and 959 to 18 and 30,471 respectively whereas the average number increased by 698 to 851 These remaining numbers in Table 4.3 also pointed out imbalance in the percentage of utilizing BNC and the number of features in minimum, maximum and average cases Table 4.4 pointed out that imbalance also appear in extending senses of one word After extending, the quantity of sense "sở thích" remains unchanged while the quantity of other sense increases rapidly or slightly such as "lợi ích" or "lãi" The imbalance in Table 4.3 and Table 4.4 can be explained by a quality of our training data Our training data is not a big corpus and is not able to cover all senses of one words or cover all words in corpus, due to which, only high frequency words and senses of one word are extended https://github.com/percyliang/brown-cluster 4.2 Results 29 Table 4.4: Expansion result with the word interest Labeled Data Labeled and Extended Data Sense Quantity Rate(%) Quantity Rate(%) tiền lãi (earnings) 164 24.44% 196 7.86% quan tâm (regard) 108 16.10% 538 21.57% mối quan tâm (con- 34 5.07% 36 1.44% centration) sở thích (hobby) 0.89% 0.24% lợi nhuận (profit) 30 4.47% 81 3.25% quyền lợi (right) 44 6.56% 219 8.78% lãi suất (surplus) 21 3.13% 104 4.17% lợi ích (benefit) 129 19.23% 643 25.78% hứng thú (pleasant) 12 1.79% 59 2.37% quan tâm (at- 26 3.87% 129 5.17% tention) lãi (gain) 97 14.44% 483 19.37% Total 671 2494 Accuracy 35% 51% Kullback Leibler distance: 0.17682 Translation Results Clearly, ambiguous words in examples in Table 4.5 and 4.6 were translated precisely in the target language when utilizing WSD and BNC In the first example in the Table 4.5, the word "hard " in "hard water " is translated to "cứng" (a type of water) which is more accurate than "chăm " (a personality) and "hard " (a feature of things) Results in the remaining example are similar to that of the first example, the word "maturity" is translated to "sự trưởng_thành" which is more correct than "hạn" (a deadline) and "đáo hạn" (the time when a bank pays money to investors) As indicated from the Table 4.7, that SMT system utilizes WSD integration with expanded information of BNC corpus leads to the high translation quality In results in Table 4.7, variabilities are explicit with growths by 1.04 and 1.54 in BLUE score in comparison with non-extended WSD integrated SMT system and baseline SMT system 4.2 Results 30 Table 4.5: Example translation of the test for hard Input hard water is water that has high mineral content ( in contrast with soft water ) SMT chăm_chỉ nước nước cao nội_dung khoáng_sản trái với nước mềm SMT + WSD khó nước nước có hàm_lượng khống_sản cao mềm ngược_lại với nước SMT + WSD + nước cứng nước cao hàm_lượng khoáng_sản trái với BNC mềm nước REF nước cứng nước có hàm_lượng khống_sản cao ( trái với nước mềm ) In this example: hard is translated to cứng , chăm_chỉ or khó ; water is translated to nước; is is translated to ; high is translated to cao; content is translated to nội_dung or hàm_lượng; mineral is translated to khoáng_sản; in contrast is translated to trái or ngược_lại; soft is translated to mềm; with is translated to với 4.2.2 WSD clustering task Impact of clustering feature We set the number of cluster to 1000 as the default number of Percy Liang in his thesis In our experiment, we divide labelled data extracted from bilingual corpus into two part with portion of 90% for training set and 10% for test set Table 4.8 shows the accuracy of WSD for four words when the clustering feature is used Clearly, the WSD system using the clustering feature reaches higher accuracy than the original WSD system The reason is that clustering feature support WSD system capture more context information to disambiguate senses of words Translation Results In the Table 4.9, the word late in a phrase late at night was translated to khuya (a literary style of muộn of a phrase late at night), which is more precise than muộn (an expression of something occurring after the proper time) As indicated from the Table 4.10, that SMT system utilizes WSD integration with the clustering feature leads to the high translation quality In results in Table 4.10, variabilities are explicit with growths by 0.3 and 0.8 in BLUE score in comparison with non-extended WSD integrated SMT system and baseline SMT system 4.2 Results 31 Table 4.6: Example translation of the test for maturity Input sexual maturity , the stage when an organism can reproduce , though it is distinct from adulthood SMT tình_dục hạn, sân_khấu tổ_chức có_thể lặp lại , mặc_dù người_ta khác với người_lớn SMT + WSD sinh_hoạt tình_dục đáo hạn, tổ_chức có_thể lặp lại , mặc_dù khác với người_lớn SMT + WSD + trưởng_thành tình_dục , sân_khấu tổ_chức BNC có_thể lặp lại , mặc_dù điều khác với trưởng_thành REF trưởng_thành giới_tính, giai đoạn sinh vật sinh sản, cho dù giai đoạn khác biệt với tuổi trưởng thành In this example, sexual is translated to tình_dục, giới_tính or sinh_hoạt tình_dục; maturity is translated to hạn, đáo_hạn or trưởng_thành; the and an are both translated to một; stage is translated to sân_khấu, tổ_chức or giai_đoạn; when is translated to khi; can is translated to có_thể ; reproduce is translated to lặp_lại or sinh_sản, though is translated to mặc_dù or cho_dù ; it is translated to and điều_đó ; is is translated to ; distinct from is translated to khác với or khác biệt; adulthood is translated to người_lớn, trưởng_thành or tuổi trưởng_thành Table 4.7: BLEU scores of phrase-based SMT systems with WSD and BNC-extended WSD Without WSD WSD integration WSD integration with BNC corpus BLEU 34.93 35.43 36.47 4.2.3 The impact of context on WSD In many cases, the evaluation results of WSD are incorrect, resulting in the effect on the translation outcome of SMT Below are two main reasons for this phenomenon: First, after the BNC expansion, the context could not embrace all possible cases The number of contexts in BNC is limited; therefore, in some cases, sentences still contain ambiguous words, which are not included in the training set, leading to the incorrect results Second, the system translates sentence by sentence, so the scope of context of ambiguous words is limited in only one sentence On the other hand, other features such as bag-of word are not limited by context information Therefore, in some situations, context information of surrounding sentences should be taken into consideration to lead to the decision on the division into labelled groups for 4.2 Results 32 Table 4.8: Accuracy of WSD system with the clustering feature and without the clustering feature for words hard, good, maturity and grow An accuracy of WSD An accuracy of WSD with without the clustering feature the clustering feature hard 37.5% 41.5% good 41% 45% maturity 65% 65% grow 56% 58% ambiguous words Consider this example: Yesterday, in the meeting of shareholders, the chairman asked me about the interest I couldn’t tell the truth because the value of my shares is decreasing After that, they complained of the share-price and required an explanation from the chairman Input sentence: Yesterday, in the meeting of shareholders, the chairman asked me about the interest In the sample of the above-mentioned independent sentence, the word “interest” could be understood according to various meanings such as “quan tâm” (a regard), “lợi ích” (a benefit) or “lãi suất” (earnings) and so on as shown in Table 4.4 In such cases, WSD system choose the “lợi ích” meaning with the highest probability Alternatively, if we use context information of the surrounding sentences, the WSD system will bring the ”lãi suất” meaning out WSD system allows us to take advantage of characteristics of the broader scope of context information while inputs of the SMT system are confined to separated sentences This fact impacts the quality of WSD system 4.2.4 Impact of WSD system on SMT translation system In the integration of WSD system into SMT system, WSD system occupies a certain weight, thus, the system only affects partly on the quality of the translation In some cases, WSD system releases more accuracy results than SMT system plus WSD The major reason is that the translation result of SMT system depends primarily on language model, translation model and so on Consider this example: Input sentence: since viet nam is small, it can more easily find market niches growing faster than overall exports The word “since” is translated into “kể từ khi” by SMT system Meanwhile, for 4.2 Results 33 Table 4.9: Example translation of the test for late Input imagine that you are at a party , it ’s quite late at night , you are tired and you have to go to work the next day SMT tưởng_tượng bạn bữa tiệc , muộn đêm bạn mệt_mỏi bạn đi_làm ngày hôm_sau SMT + WSD tưởng_tượng bạn bữa tiệc , muộn vào đêm bạn mệt_mỏi bạn có đi_làm ngày hơm_sau SMT + WSD + the tưởng_tượng bạn bữa tiệc , clustering feature khuya , bạn mệt_mỏi bạn có đi_làm ngày hôm_sau REF tưởng_tượng bạn dự buổi tiệc , khuya , bạn mệt_mỏi hôm_sau bạn phải đi_làm In this example: late is translated to muộn, or khuya; imagine is translated to tưởng_tượng; you is translated to bạn; at is translated to ; quite is translated to rất; tired is translated to mệt_mỏi; party is translated to bữa tiệc or buổi tiệc;go to work is translated to làm; have to is translated to phải; the next day is translated to hôm_sau; and is translated to ;night is translated to đêm Table 4.10: BLEU scores of phrase-based SMT systems with WSD and WSD with the clustering feature Without WSD WSD integration WSD integration with the clustering feature BLEU 34.69 35.19 35.49 this sentence, WSD system lays out the result of probability distribution in Table 4.11 and an exact sense of the word "since" is "bởi " Accordingly, although WSD model brings out the result, which is more appropriate than that of SMT translation system, due to the effect of other models in SMT system, the ultimate result turns out as above 4.2 Results Table 4.11: Sense distribution of since Sense Probability distribution từ (from) 4.64315e − 07 (because) 0.0124489 kể từ (previously to) 2.65233e − 16 kể từ (ago) 9.09226e − 16 (due to) 0.986962 từ (henceforth) 0.000588169 (cause) 5.67261e − 08 (when) 8.76621e − 21 34 Chapter Conclusion In this thesis, we demonstrated a significant effect of WSD bootstrapped on SMT system and showed an impact of the clustering feature for WSD The analyses and results on experiments also point out that the approach of enhancing quality of WSD model contributes to the improvement of translation quality According to the assessment based on the source of bilingual data and the open source MOSES SMT system, the translation quality has improved about one BLEU point The impact of sparse data on the training set in WSD model contributes positively to the increase of BLEU point The BNC corpus is abundant and diverse; therefore, we could take advantage of this source to expand the WSD source of training data to deal with problems related to sparse data in the training set The expansion of the source of training data whereby not only increases the degree of accuracy of WSD system but also improve the quality of translation In the future, we would like to continue to experiment with the expansion of the training set on other sources of information such as the Internet, WordNet and so forth with an aim of enhancing the quality of translation by machine 35 Bibliography Vamshi Ambati, Stephan Vogel, and Jaime Carbonell Multi-strategy approaches to active learning for statistical machine translation Proceedings of the 13th Machine Translation Summit, 2011 Steven Bird, Ewan Klein, and Edward Loper Natural Language Processing with Python O’Reilly, 2009 ISBN 978-0-596-51649-9 Avrim Blum and Tom Mitchell Combining labeled and unlabeled data with co-training In Proceedings of the eleventh annual conference on Computational learning theory, pages 92–100 ACM, 1998 Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik A training algorithm for optimal margin classifiers In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92, pages 144–152, New York, NY, USA, 1992 ACM ISBN 0-89791-497-X doi: 10.1145/130385.130401 URL http://doi.acm.org/10.1145/130385.130401 Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai Class-based n-gram models of natural language Computational linguistics, 18(4):467–479, 1992 R Bruce and J Wiebe Word-sense disambiguation using decomposable models In Proceedings of 32nd Annual Meeting of the Association for Computational Linguistics, pages 139–145, Las Cruces, NM, 1994 Marine Carpuat and Dekai Wu Improving statistical machine translation using word sense disambiguation In EMNLP-CoNLL, pages 61–72, 2007 Yee Seng Chan, Hwee Tou Ng, and David Chiang Word sense disambiguation improves statistical machine translation In ACL, 2007 Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale, and Mark Johnson Bllip 1987-89 wsj corpus release Linguistic Data Consortium, Philadelphia, 2000 T Chklovski and R Mihalcea Building a sense tagged corpus with open mind word expert In Proceedings of ACL 2002 Workshop on WSD: Recent Successes and Feature Directions, Philadelphia, PA, 2003 36 Bibliography 37 Thomas M Cover and Joy A Thomas Elements of Information Theory (2 ed.) Wiley, 2006 ISBN 978-0-471-24195-9 Marcello Federico, Nicola Bertoldi, and Mauro Cettolo Irstlm: an open source toolkit for handling large scale language models In Interspeech, pages 1618–1621, 2008 W.A Gale, K Church, and D Yarowsky A method for disambiguating word senses in a corpus In Comput Human.26, pages 415–439, 1992 Ismael Garc´ıa-Varea, Franz Josef Och, Hermann Ney, and Francisco Casacuberta Refined lexikon models for statistical machine translation using a maximum entropy approach In ACL, pages 204–211, 2001 David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda English gigaword Linguistic Data Consortium, Philadelphia, 2003 Zellig Harris Distributional structure: In: Katz, J J (ed) The Philosophy of Linguistics New York: Oxford University Press, 1985 ISBN 978-0-596-51649-9 Kenneth Heafield KenLM: faster and smaller language model queries In Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland, United Kingdom, July 2011 URL http://kheafield.com/professional/avenue/ kenlm.pdf E.H Hovy Toward finely differentiated evaluation metrics for machine translation In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 2002 Hutchins, W John, , and Harold L Somers An Introduction to Machine Translation London: Academic Press, 1992 ISBN 0-12-362830-X Hien Vu Huy, Phuong-Thai Nguyen, Tung-Lam Nguyen, and M.L Nguyen Bootstrapping phrasebased statistical machine translation via wsd integration In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pages 1042–1046, Nagoya, Japan, 2013 N Ide and K Suderman Integrating linguistic resources: The american national corpus model In Proceedings of the 5th Language Resources and Evaluation Conference (LREC, Genoa, Italy), Genoa, Italy, 2006 H Jeremy Clear 1993 The digital word chapter The British national corpus, pages 163–187 Thorsten Joachims Transductive learning via spectral graph partitioning In ICML, volume 3, pages 290–297, 2003 Philipp Koehn, Franz Josef Och, and Daniel Marcu Statistical phrase-based translation In HLT-NAACL, 2003 Bibliography 38 Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst Moses: Open source toolkit for statistical machine translation In ACL, 2007 H Kucera and W.N Francis Computational analysis of present-day american english 1967 Cuong Anh Le and Akira Shimazu High wsd accuracy using naive bayesian classifier with rich features In Language, Information and Computation : Proceedings of the 18th Pacific Asia Conference, 8-10 December, 2004, Waseda University, Tokyo, Japan, pages 105–114, 2004 C Leacock, G Towell, and E Voorhees Corpus-based statistical sense resolution In Proceedings of the ARPA Workshop on Human Language Technology, pages 260–265, Princeton, NJ, 1993 John C Mallery Thinking about foreign policy: Finding an appropriate role for artificial intelligence computers 1988 R Mihalcea and E Faruque Sense learner: Minimally supervised word sense disambiguation for all words in open text In Proceedings of the 3rd International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pages 155–158, Barcelona, Spain, 2004 Rada Mihalcea Co-training and self-training for word sense disambiguation In Proceedings of the Conference on Natural Language Learning, pages 33–40, 2004 G.A Miller, C Leacock, R Tengi, and R.T Bunker A semantic concordance In Proceedings of the ARPA Workshop on Human Language Technology, pages 303–308, 1993 Makoto Nagao A framework of a mechanical translation between japanese and english by analogy principle In Proc Of the International NATO Symposium on Artificial and Human Intelligence, pages 173–180, New York, NY, USA, 1984 Elsevier North-Holland, Inc ISBN 0-444-86545-4 R Navigli and P Velardi Structural semantic interconnections: A knowledge-based approach to word sense disambiguation pages 1075–1088, 2005 Roberto Navigli Word sense disambiguation: A survey ACM Comput Surv., 41(2):10:1–10:69, February 2009 ISSN 0360-0300 doi: 10.1145/1459352.1459355 URL http://doi.acm.org/ 10.1145/1459352.1459355 H.T Ng and H.B Lee Integrating multiple knowledge source to disambiguate word senses: An examplar-based approach In Proceedings of 34nd Annual Meeting of the Association for Computational Linguistics, pages 40–47, Santa Cruz, CA, 1996 T.H Ng Getting serious about word sense disambiguation In Proceedings of the ACL SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What and How?, pages 1–7, Washington D.C USA, 1997 Bibliography 39 Zheng-Yu Niu, Dong-Hong Ji, and Chew Lim Tan Word sense disambiguation using label propagation based semi-supervised learning In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 395–402 Association for Computational Linguistics, 2005 Franz Josef Och Minimum error rate training in statistical machine translation In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1, ACL ’03, pages 160–167, Stroudsburg, PA, USA, 2003 Association for Computational Linguistics doi: 10.3115/1075096.1075117 URL http://dx.doi.org/10.3115/1075096.1075117 Franz Josef Och and Hermann Ney A systematic comparison of various statistical alignment models Computational linguistics, 29(1):19–51, 2003 Nguyen T P., Nguyen V V., and Le A C Vietnamese word segmentation using hidden markov model In Proceedings of International Workshop for Computer, Information, and Communication Technologies in Korea and Vietnam, 2003 Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu Bleu: A method for automatic evaluation of machine translation In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 311–318, Stroudsburg, PA, USA, 2002 Association for Computational Linguistics doi: 10.3115/1073083.1073135 URL http://dx.doi.org/10 3115/1073083.1073135 N H Pham, Nguyen L M., Le A C., Nguyen P T., and Nguyen V.V Lvt: An english-vietnamese machine translation system In Proceedings of FAIR, 2003 Thanh Phong Pham, Hwee Tou Ng, and Wee Sun Lee Word sense disambiguation with semisupervised learning In Proceedings, The Twentieth National Conference on Artificial Intelligence and the Seventeenth Innovative Applications of Artificial Intelligence Conference, July 9-13, 2005, Pittsburgh, Pennsylvania, USA, pages 1093–1098, 2005 E Pianta, L Bentivogli, and C Girardi Multiwordnet: Developing an aligned multilingual database In Proceedings of the 1st International Conference on Global WordNet, pages 21– 25, Mysore, India, 2002 John R Pierce and John B Carroll Language and machines — computers in translation and linguistics ALPAC report, National Academy of Sciences, National Research Council Washington, DC, 1966 Andreas Stolcke Srilm - an extensible language modeling toolkit In Proc Intl Conf Spoken Language Processing, Denver, Colorado, pages 901–904, 2002 J Vernonis Hyperlex: Lexical cartography for information retrieval pages 223–252, 2004 Stephan Vogel, Hermann Ney, and Christoph Tillmann Hmm-based word alignment in statistical translation In Proceedings of the 16th conference on Computational linguistics-Volume 2, pages 836–841 Association for Computational Linguistics, 1996 Bibliography 40 W Weaver Translation (1949) Machine Translation of Languages, MIT Press, Cambridge, MA, 1955 Mei Yang and Katrin Kirchhoff Contextual modeling for meeting translation using unsupervised word sense disambiguation In COLING, pages 1227–1235, 2010 Xiaojin Zhu and Zoubin Ghahramani Learning from labeled and unlabeled data with label propagation Technical report, Technical Report CMU-CALD-02-107, Carnegie Mellon University, 2002 Copyright c 2014 by Vu Huy Hien Printed and bound by Vu Huy Hien ... nước nước cao nội_dung kho? ?ng_sản trái với nước mềm SMT + WSD khó nước nước có hàm _lượng khống_sản cao mềm ngược_lại với nước SMT + WSD + nước cứng nước cao hàm _lượng kho? ?ng_sản trái với BNC... khu_vực ngân_hàng việt_nam bao_gồm ngân_hàng quốc_doanh lớn , chiếm tới 82% tổng tài_sản tài_chính hệ_ thống ngân_hàng , 51 ngân_hàng cổ_phần nhỏ thuộc sở_hữu chung sở_hữu tư_nhân Word alignment information:... translated to ; high is translated to cao; content is translated to nội_dung or hàm _lượng; mineral is translated to kho? ?ng_sản; in contrast is translated to trái or ngược_lại; soft is translated to

Định dạng
Số trang	52
Dung lượng	873,51 KB