Automatically creating multilingual lexical resources

Automatically creating multilingual lexical resources by Khang Nhut Lam MSc., Ewha Womans University, Seoul, Korea, 2009 A dissertation submitted to the Graduate Faculty of the University of Colorado at Colorado Springs in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Computer Science 2015 ii c Copyright By Khang Nhut Lam 2015 All Rights Reserved iii This thesis for Ph.D of Computer Science degree by Khang Nhut Lam has been approved for the Department of Computer Science by Dr Jugal Kalita, Chair Dr Edward Chow Dr Rory Lewis Dr Martha Palmer Dr Jia Rao Date iv Khang Nhut Lam, Ph.D., Computer Science Title: Automatically creating multilingual lexical resources Supervisor: Dr Jugal Kalita Bilingual dictionaries and WordNets are important resources for natural language processing tasks such as information retrieval and machine translation However, lexical resources are usually available only for resource-rich languages, e.g., English, Spanish and French Resource-poor languages, e.g., Cherokee, Dimasa and Karbi have very few resources with limited numbers of entries Current approaches for creating new lexical resources work with languages that have good quality resources already available in sufficient quantities This thesis proposes novel approaches to generate bilingual dictionaries, translate phrases and construct WordNets for several natural languages, including some languages in the UNESCO Endangered Languages List (viz., Cherokee, Cheyenne, Dimasa and Karbi), by bootstrapping from just a few existing resources and publicly available resources in resource-rich languages such as the Princeton WordNet, Japanese WordNet and the Microsoft Translator This thesis not only constructs new lexical resources but also supports communities using languages with limited resources v Dedication I would like to express deep love to my parents Without your love and your support, I could not make this dissertation this far Thank you for everything you have done for me Even though, I have been alone half a world away from you, I have never felt lonely because you are always with me I would like to thank the Le family, and my best friends, Vicky Collier and Janet Gardner, who are always on my side, take care me as their daughter, and have given me a real family during the time I have been in the United States vi Acknowledgments I would like to take this opportunity to express my warm thanks to my advisor, Dr Jugal Kalita, who has supported and guided me with patience and encouragement, and has provided me with a professional evironment studying and doing research since my first day in the PhD Program at UCCS I also owe my gratitude to my dissertation committee members: Dr Edward Chow, Dr Jia Rao, Dr Martha Palmer and Dr Rory Lewis, for their enthusiasm, insightful comments, constructive suggestions and critical evaluations of my research A special thanks is due to Feras Al Tarouti, my lab mate and my co-author, for his simulating, contributions, discussions, help in programming and evaluating results, and excellent company during stressful days when we worked together to meet crucial paper deadlines Many thanks to all of my lab mates for their helps, questions, suggestions and all the fun we have had in our lab Many thanks to Dubari Borah, Francisco Torres Reyes, Conner Clark, Tri Doan, Morningkeey Phangcho, Dharamsing Teron, Navanath Saharia, Arnab Phonglosa, Faris Kateb, Abhijit Bendale, Lalit Prithviraj Jain and Svati Dhamija for helping me evaluate lexical resources I also thank all my friends in the Xobdo, Microsoft and PanLex projects who provided me with dictionaries and translations This research was supported by Vietnam International Education Development- Ministry of Education and Training of Vietnam (VIED) I gratefully acknowledge VIED financial support I also thank the Graduate School at UCCS for fellowships and the Computer Science department at UCCS for my grader and teaching jobs vii TABLE OF CONTENTS Introduction 1.1 Overview 1.2 Types of lexical resources 1.3 Research focus and contribution 1.4 Intellectual and scientific merit 1.5 Broader impact 1.6 Organization of the dissertation 10 Related work 11 2.1 Introduction 11 2.2 Structure of lexical resources 11 2.2.1 Structure of a bilingual dictionary 11 2.2.2 Structure of the Princeton WordNet 12 2.3 Language codes 15 2.4 Creating new bilingual dictionaries 16 2.4.1 Generating bilingual dictionaries using one intermediate language 17 2.4.2 Generating bilingual dictionaries using many intermediate languages 21 2.4.3 Extracting bilingual dictionaries from corpora 24 2.4.4 Generating dictionaries from multiple linguistic resources 29 2.5 Generating translations for phrases 33 2.6 Constructing WordNets 38 2.6.1 Constructing WordNets using the merge approach 38 2.6.2 Constructing WordNets using the expand approach 40 viii 2.7 45 Input resources and evaluation methods 47 3.1 Introduction 47 3.2 Input bilingual dictionaries 47 3.3 Input WordNets 48 3.4 Evaluation method 49 3.5 Chapter summary 50 Creating reverse bilingual dictionaries 52 4.1 Introduction 52 4.2 Related work 53 4.3 Proposed approaches 54 4.3.1 Direct reversal (DR) 54 4.3.2 Direct reversal with distance (DRwD) 56 4.3.3 Direct reversal with similarity (DRwS) 58 4.3.4 Direct reversal with similarity and distance (DRwSD) 60 Experimental results 62 4.4.1 Preprocessing entries in the existing dictionaries 63 4.4.2 Results 65 4.5 Future work 69 4.6 Chapter summary 72 4.4 Chapter summary Creating new bilingual dictionaries 74 5.1 Introduction 74 5.2 Related work 75 ix 5.3 Proposed approaches 76 5.3.1 Direct translation approach (DT) 76 5.3.2 Using publicly available WordNets as intermediate resources (IW) 77 Experimental results 81 5.4.1 Results and human evaluation 82 5.4.2 Comparing with existing approaches 88 5.4.3 Comparing with Google Translator 89 5.5 Future work 90 5.6 Chapter summary 91 5.4 Creating WordNets 93 6.1 Introduction 93 6.2 Related work 94 6.3 Proposed approaches 95 6.3.1 Generating synset candidates 95 6.3.1.1 The direct translation (DT) approach 96 6.3.1.2 Approach using intermediate WordNets (IW) 96 6.3.1.3 Approach using intermediate WordNets and a dictionary (IWND) 99 6.3.2 Ranking method 100 6.3.3 Selecting candidates based on ranks 101 6.4 Experiments 104 6.5 Future work 106 6.6 Chapter summary 107 x Generating translations for phrases using a bilingual dictionary and n-gram data 109 7.1 Introduction 109 7.2 Vietnamese morphology 110 7.3 Related work 111 7.4 Proposed approach 112 7.4.1 Segmenting Vietnamese words 112 7.4.2 Filtering segmentations 113 7.4.3 Generating ad hoc translations 114 7.4.4 Selecting the best ad hoc translation 114 7.4.5 Finding and ranking translation candidates 116 7.5 Experiments 117 7.6 Future work 119 7.7 Conclusion 120 Conclusions 122 References 124 Appendix A: Reverse dictionaries generated 134 Appendix B: New bilingual dictionaries created 136 Appendix C: New WordNets constructed 138 126 [25] Rudi L Cilibrasi and Paul M.B Vitanyi The Google similarity distance IEEE Transactions on Knowledge and Data Engineering, 19(3):370–383, 2007 [26] Marin Dantchev WordNet 2.1 overview ECS 595/SI 661 & 761/LING 541 Natural Language Processing Fall [27] Arthur P Dempster, Nan M Laird, and Donald B Rubin Maximum likelihood from incomplete data via the EM algorithm Journal of the royal statistical society Series B (methodological), pages 1–38, 1977 [28] Ho Ngoc Duc and Nguyen Thi Thao Towrads building a WordNet for Vietnamese In Proceedings of the 1st International Workshop for Computer, Information and Communication Technologies, Hanoi, Vietnam, 2003 [29] Christiane Fellbaum WordNet Blackwell Publishing Ltd, United Kingdom, 1999 [30] Bryan A Garner and Henry Campbell Black Black’s law dictionary St Paul, MN: Thomson/West, Saint Paul, US, 2004 [31] Chooi-Ling Goh, Masayuki Asahara, and Yuji Matsumoto Building a JapaneseChinese dictionary using Kanji/Hanzi conversion In Proceedings of the 2nd International Joint Conference on Natural Language Processing (IJCNLP), pages 670–681, Jeju Island, Korea, October, 2005 [32] Tim Gollins and Mark Sanderson Improving cross language information retrieval with triangulated translation In Proceeding of the 24th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 90–95, New York, USA, September, 2001 [33] Gunawan and Andy Saputra Building synsets for Indonesian WordNet with monolingual lexical resources In Proceedings of the International Conference on Asian Language Processing (IALP), pages 297–300, Harbin, China, December, 2010 [34] Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein Learning bilingual lexicons from monolingual corpora In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), volume 2008, pages 771–779, Ohio, USA, June, 2008 [35] Le Manh Hai, Asanee Kawtrakul, and Yuen Poovorawan Phrasal transfer model for Vietnamese-English machine translation In Proceedings of the conference on Natural Language Processing Pacific Rim Symposium (NLPRS) [36] Patrick Hanks, Flavia Hodges, and David L Gold A dictionary of surnames, volume 92 Oxford University Press,Oxford, United Kingdom, 1988 [37] William L Hays and Robert L Winkler Statistics: Probability, Inference and Decision Decision Holt, Rinehart and Winston Inc., New York, USA, 1971 [38] Enik¨ o Héja Dictionary building based on parallel corpora and word alignment In Proceedings of the XIV Euralex International Congress, Leeuwarden, pages 6–10, Leeuwarden/Ljouwert, Netherlands, July, 2010 127 [39] Graeme Hirst and David St-Onge Lexical chains as representations of context for the detection and correction of malapropisms WordNet: An electronic lexical database, 305(1998):305–332, 1998 [40] Iftikar Hussain, Navanth Saharia, and Utpal Sharma Development of Assamese WordNet Machine Intelligence: Recent Advances, Narosa Publishing House, Editors B Nath, U Sharma and DK Bhattacharyya, ISBN-978-81-8487-140-1, 2011 [41] William Peter Hyde A new Vietnamese–English dictionary Dunwoody Press, Hyattsville, Maryland, USA, 2008 [42] Satoru Ikehara, Masahiro Miyazaki, Akio Yokoo, Satoshi Shirai, Hiromi Nakaiwa, Kentaro Ogura, Yoshifumi Ooyama, and Yoshihiko Hayashi Nihongo Goi Taikei - A Japanese lexicon Iwanami Shoten 5, 1997 [43] Hitoshi Isahara, Francis Bond, Kiyotaka Uchimoto, Masao Utiyama, and Kyoko Kanzaki Development of the Japanese WordNet In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC), pages 2420–2423, Marrakech, Morocco, May, 2008 [44] Ray Jackendoff The architecture of the linguistic-spatial interface Language and space, pages 1–30, 1996 [45] Hiroyuki Kaji, Shin’ichi Tamamura, and Dashtseren Erdenebat Automatic construction of a Japanese-Chinese dictionary via English In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC), volume 2008, pages 699–706, Marrakech, Morocco, May, 2008 [46] Hiroyuki Kaji and Mariko Watanabe Automatic construction of Japanese WordNet In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC), Genoa, Italy, May, 2006 [47] Adam Kilgarriff Thesauruses for natural language processing In Proceedings of IEEE International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE), pages 5–13, Beijing, China, October, 2003 [48] Adam Kilgarriff and David Tugwell WASP-Bench: an MT lexicographers’ workstation supporting state-of-the-art lexical disambiguation In Proceedings of the 8th Machine Translation Summit, pages 187–190, Santiago de Compostela, Spain, September, 2001 [49] Kevin Knight and Vasileios Hatzivassiloglou Two-level, many-paths generation In Proceedings of the 33rd annual meeting on Association for Computational Linguistic, pages 252–260, Massachusetts, USA, June, 1995 [50] Kevin Knight and Steve K Luk Building a large-scale knowledge base for machine translation In Proceedings of the 12th National Conference on Artificial Intelligence, pages 773–778, Seattle, Washington, August, 1994 [51] Gary G Koch Intraclass correlation coefficient Encyclopedia of statistical sciences, 1982 128 [52] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst Moses: Open source toolkit for statistical machine translation [53] Philipp Koehn and Kevin Knight Learning a translation lexicon from monolingual corpora In Proceedings of the Workshop on Unsupervised Lexical Acquisition, volume 9, pages 9–16, Philadelphia, USA, July, 2002 Association for Computational Linguistics (ACL) [54] Philipp Koehn and Kevin Knight Feature-rich statistical translation of noun phrases In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics (ACL), pages 311–318, Sapporo, Japan, 2003 [55] Grzegorz Kondrak and Bonnie Dorr Identification of confusable drug names: A new approach and evaluation methodology In Proceedings of the 20th International Conference on Computational Linguistics (COLING), volume 26, pages 952–958, Switzerland, 2000 [56] Hideki Kozima and Teiji Furugori Similarity between words computed by spreading activation on an English dictionary In Proceedings of the 6th conference on European chapter of the Association for Computational Linguistics, pages 232–239, 1993 [57] Khang Nhut Lam and Jugal Kalita Creating reverse bilingual dictionaries In Proceedings of the International Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL-HLT), pages 524–528, Atlanta, USA, June, 2013 [58] Sidney Landau Dictionaries: the art craft of lexicography Cambridge University Press, Cambridge, United Kingdom, 2001 [59] Tuong Le, Trong Hai Duong, Bay Vo, and Sanggil Kang Consensus for collaborative ontology-based Vietnamese WordNet building Intelligent Information and Database Systems, pages 499–508, 2013 [60] Dhanon Leenoi, Thepchai Supnithi, and Wirote Aroonmanakun Buidling Thai WordNet with a bi-directional translation method In Proceedings of the International Conference on Asian Language Processing (IALP), pages 48–52, Singapore, December, 2009 [61] Dhanon Leenoi, Thepchai Supnithi, and Wirote Arronmanakun Building a gold standard for Thai WordNet In Proceeding of The International Conference on Asian Language Processing 2008 (IALP), pages 78–82, Chiang Mai, Thailand, November, 2008 [62] Omer Levy and Yoav Goldberg Linguistic regularities in sparse and explicit word representations In Proceedings of the 18th conference on Computational Natural Language Learning, pages 171–180, Baltimore, Maryland, USA, June, 2014 [63] Paul M Lewis, Gary F Simons, and Charles D Fennig (eds.) Ethnologue: Languages of the world, 7th edition Dallas, Texas: SIL International 129 [64] William D Lewis Measuring conceptual distance using WordNet: the design of a metric for measuring semantic similarity In The University of Arizona working papers in linguistics, 2002 [65] Hang Li, Yunbo Cao, and Cong Li Using bilingual web data to mine and rank translations IEEE Intelligent Systems [66] Krister Linden and Lauri Carlson FinnWordNet-WordNet p˚ a finska via o¨versättning LexicoNordica, 17:119–140, 2010 [67] Nikola Ljubeˇsić and Darja Fiˇser Bootstrapping bilingual lexicons from comparable corpora for closely related languages In Proceedings of the 14th International Conference on Text, Speech and Dialogue (TSD), pages 91–98, Plzeˇ n, Czech Republic, September, 2011 [68] José B Marino, Rafael E Banchs, Josep M Crego, Adrià de Gispert, Patrik Lambert, José A.R Fonollosa, and Marta R Costa-Jussà N-gram-based machine translation Computational Linguistics [69] Mausam, Stephen Soderland, Oren Etzioni, Daniel S Weld, Kobi Reiter, Michael Skinner, Marcus Sammer, and Jeff Bilmes Panlingual lexical translation via probabilistic inference Artificial Intelligence, 174:619–637, 2010 [70] Dan Melamed Empirical methods for MT lexicon constructions Machine Translation and the Information Soup, Springer-Verlag, 1998 [71] I Dan Melamed Bitext maps and alignment via pattern recognition Computational Linguistics, 25(10):107–130, 1999 [72] I Dan Melamed Models of translational equivalence among words Computational Linguistics, 26(2):221–249, 2000 [73] Merriam-Webster Merriam-Webster’s dictionary of synonyms Springfield, Mass.: Merriam-Webster, 1984 Springfield, US: [74] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in Neural Information Processing Systems, pages 3111–3119, 2013 [75] G.A Miller WordNet: a lexical database for English Communications of the ACM, 38(11):39–41, 1995 [76] M Rube Molina Cognate Linguistics (Cognates Book 1), Kindle Ed Cognates.org, 2011 [77] Mortaza Montazery and Heshaam Faili Automatic Persian WordNet construction In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pages 846–850, Beijing, China, August, 2010 [78] Jorge Morato, Miguel Angel Marzal, Juan Lloréns, and José Moreiro WordNet applications In Proceedings of the 2nd Global WordNet Conference, pages 270–278, Brno, Czech Republic, January, 2004 130 [79] Jane Morris and Graeme Hirst Lexical cohesion computed by thesaural relations as an indicator of the structure of text Computational Linguistics, 17(1):21–48, 1991 [80] Preslav Nakov and Hwee Tou Ng Improved statistical machine translation for resource-poor languages using related resource-rich languages In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), volume 3, pages 1358–1367, Singapore, August, 2009 Association for Computational Linguistics (ACL) [81] Luka Nerima and Eric Wehrli Generating bilingual dictionaries by transitivity In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC), pages 2584–2587, Marrakech, Morocco, May, 2008 [82] Binh N Ngo The Vietnamese language learning framework Journal of Southeast Asian Language Teaching, 10:1–24, 2001 [83] Q H Ngo, W Winiwarter, and Bartholomaus Wloka EVBCorpus-a multi-layer English-Vietnamese bilingual corpus for studying tasks in comparative linguistics In Proceedings in the 6th International Joint Conference on Natural Language Processing (IJCNLP), pages 1–92, Nagoya, Japan, 2013 [84] Sandro Nielsen A functional approach to user guides Dictionaries: Journal of the Dictionary Society of North America 27, 2006(1):1–20, 2006 [85] Rolf Noyer Vietnamese’morphology’and the definition of word In University of Pennsylvania Working Papers in Linguistics, Ed Alexis Dimitriadis, Hikyoung Lee, Christine Moisset, and Alexander Williams, pages 65–89, Philadelphia: U Penn Department of Linguistics, 1998 [86] Franz Josef Och and Hermann Ney Giza++: Training of statistical translation models 2000 [87] Kumiko Ohmori and Masanobu Higashida Extracting bilingual collocations from non-aligned parallel corpora In Proceedings of the 8th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI99), pages 88–97, 1999 [88] Antoni Oliver and Salvador Climent Parallel corpora for wordnet construction: machine translation vs automatic sense tagging Proceedings of the 13th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing), pages 110–121, 2012 [89] Noam Ordan and Shuly Wintner Hebrew WordNet: a test case of aligning lexical databases across languages International Journal of Translation, 19(1):39–58, 2007 [90] Pablo G Otero and Jose R.P Campos Automatic generation of bilingual dictionaries using intermediate languages and comparable corpora In Proceedings of the 11th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing), pages 473–483, Ia¸si, Romania, March, 2010 [91] Kyonghee Paik, Francis Bond, and Shirai Satoshi Using multiple pivots to align Korean and Japanese lexical resources In Proceedings of the 6th Natural Language 131 Processing Pacific Rim Symposium (NLPRS), pages 63–70, Tokyo, Japan, November, 2001 [92] Kyonghee Paik, Satoshi Shirai, and Hiromi Nakaiwa Automatic construction of a transfer dictionary considering directionality In Proceedings of the Workshop on Multilingual Linguistic Resources, pages 31–38, Geneva, Switzerland, August, 2004 Association for Computational Linguistics (ACL) [93] Pavel Pecina A machine learning approach to multiword expression extraction In Proceedings of the Workshop Towards a Shared Task for Multiword Expressions, pages 54–61, Marrakech, Morocc, 2008 Conference on Language Resources and Evaluation (LREC) [94] Martin F Porter An algorithm for suffix stripping Program: Electronic library and information systems, 3(40):211–218, 2006 [95] Kergrit Robkop, Sareewan Thoongsup, Thatsanee Charoenporn, Virach Sornlertlamvanich, and Hitoshi Isahara WNMS: connecting the distributed WordNet in the case of Asian Wordnet In Proceedings of the 5th International Conference of the Global WordNet Association (GWC), Mumbai, India, 2010 [96] Horacio Rodriguez, Salvador Climent, Piek Vossen, Laura Bloksma, Wim Peters, Antonietta Alonge, Francesca Bertagna, and Adriana Roventini The top-down strategy for building EuroWordNet: Vocabulary coverage, base concepts and top ontology Computers and the Humanities, 32(2-3), 1998 [97] Peter Mark Roget Roget’s Thesaurus of English Words and Phrases TY Crowell Company, USA, 1911 [98] Peter Mark Roget Roget’s International Thesaurus 3/E** Oxford and IBH Publishing, 2008 [99] Sheldon M Ross Introductory statistics Academic Press, 2010 [100] Ivan A Sag, Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger Multiword expressions: A pain in the neck for NLPs In the Computational Linguistics and Intelligent Text Processing, pages 1–15, 2002 [101] Benoit Sagot and Darja Fiser Building a free French WordNet from multilingual resources In In Proceedings of Ontolex, Marrakech, Morocco, 2008 [102] Antonio Sanfilippo and Ralf Steinberger Automatic selection and ranking of translation candidates In Proceedings of the 7th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI), volume 97, pages 200–207, Santa Fe, USA, 1997 [103] Patanakul Sathapornrungkij and Charnyote Pluempitiwiriyawej Construction of Thai Wordnet lexical database from machine readable dictionaries In Proceedings of 10th Machine Translation Summit, pages 78–82, Phuket, Thailand, 2005 [104] Martin Saveski and Igor Trajkovski Automatic construction of wordnets by using machine translation and language modeling In Proceedings of the 13th International MultiConference Information Society, volume C, Ljubljana, Slovenia, 2010 132 [105] Thomas Schmidt and Kai Wörner Multilingual corpora and multilingual corpus analysis John Benjamins Publishing, Amsterdam, Netherlands, 2012 [106] Li Shao and Hwee Tou Ng Mining new word translations from comparable corpora In Proceedings of the 20th International Conference on Computational Linguistics, pages 618–624, Geneva, Switzerland, August, 2004 [107] Ryan Shaw, Anindya Datta, Debra VanderMeer, and Kaushik Dutta Building a scalable database-driven reverse dictionary IEEE Transactions on Knowledge and Data Engineering, 25(3):528–540, 2013 [108] Satoshi Shirai and Kazuhide Yamamoto Linking English words in two bilingual dictionaries to generate another language pair dictionary In Proceedings of the 19th International Conference on Computer Processing of Oriental Languages (ICCPOL), pages 174–179, Seoul, Korea, May, 2001 [109] Satoshi Shirai, Kazuhide Yamamoto, and Kyonghee Paik Overlapping constraints of two step selection to generate a transfer dictionary In Proceedings of the International Conference on Stochastic Programming(ICSP) Berlin, Germany, August, 2001 [110] Jonas Sj¨ obergh Creating a free digital japanese-swedish lexicon In Proceedings of the Conference Pacific Association for Computational Linguistics (PACLING), pages 296–300, Tokyo, Japan, August, 2005 [111] Dagobert Soergel Indexing languages and thesauri: construction and maintenance Melville Pub Co., New York, USA, 1974 [112] Sofia Stamou, Kemal Oflazer, Karel Pala, Dimitris Christoudoulakis, Dan Cristea, Dan Tufis, Svetla Koeva, George Totkov, Dominique Dutoit, and Maria Grigoriadou Balkanet: A multilingual semantic network for the Balkan languages In Proceedings of the International Wordnet Conference, pages 21–25, Mysore, India, 2002 [113] Thomas Lathrop Stedman Stedman’s Medical Dictionary, Volume 1, ed 28 Lippincott Williams & Wilkins, Philadelphia, USA, 2006 [114] Kumiko Tanaka and Hideya Iwasaki Extraction of lexical translations from nonaligned corpora In Proceedings of the 16th Conference on ComputationalLlinguistics, volume 2, pages 580–585, Netherlands, 1996 [115] Kumiko Tanaka and Kyoji Umemura Construction of a bilingual dictionary intermediated by a third language In Proceedings of the 15th Conference on Computational Linguistics (COLING), volume 1, pages 297–303, Kyoto, Japan, August, 1994 [116] Takaaki Tanaka Measuring the similarity between compound nouns in different languages using non-parallel corpora In Proceedings of the 19th international conference on Computational linguistics (COLING), volume 1, pages 1–7, Taipei, Taiwan, August, 2002 [117] Takaaki Tanaka and Timothy Baldwin Translation selection for japanese-english noun-noun compounds In Proceedings of Machine Translation Summit IX, pages 378–385, Marrakech, Morocc, 2003 133 [118] Laurence C Thompson The problem of the word in Vietnamese Word journal of the International Linguistic Association, 19(1):39–52, 2003 [119] Istvan Varga and Shoichi Yokoyama Japanese-Hungarian dictionary generation using ontology resources In Proceedings of the Machine Translation Summit XI, pages 483– 490, Copenhagen, Denmark, September, 2007 [120] Istvan Varga and Shoichi Yokoyama Bilingual dictionary generation for low-resourced language pairs In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), volume 2, pages 862–870, Singapore, August, 2009 [121] Sornlertlamvanich Virach, Thatsanee Charoenporn, Chumpol Mokarat, Hitoshi Isahara, Hammam Riza, and Purev Jaimai Synset assignment for bi-lingual dictionary with limited resource In Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP), 2008 [122] Piek Vossen A multilingual database with lexical semantic networks Kluwer Academic Publishers, Dordrecht, Netherlands, 1998 [123] Piek Vossen Building Wordnets http://www.globalwordnet.org/gwa/BuildingWordnets.ppt, 2005 [124] Zhibiao Wu and Martha Palmer Verbs semantics and lexical selection In Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics (ACL), pages 133–138, New Mexico, USA, June, 1994 [125] Federico Zanettin Bilingual comparable corpora and the training of translators Meta 43, (4):616–630, 1998 [126] Yujie Zhang, Qing Ma, and Hitoshi Isahara Automatic acquisition of a JapaneseChinese bilingual lexicon using english as an intermediary In Proceedings of the International Conference on Natural Language Processing and Knowledge Engineering (NLPKE), pages 471–476, 2003 [127] Mˇ adˇ alina Zurini Word sense disambiguation using aggregated similarity based on WordNet graph representation Informatica Economica 134 Appendix A: Reverse dictionaries generated Table 1: Sample entries in the English-Assamese reverse dictionary Table 2: Sample entries in the English-Vietnamese reverse dictionary 135 Table 3: Sample entries in the English-Dimasa reverse dictionary Table 4: Sample entries in the English-Karbi reverse dictionary 136 Appendix B: New bilingual dictionaries created Table 5: Sample entries in the Assamese-Vietnamese and Assamese-Arabic dictionaries Table 6: Sample entries in the Assamese-German and Assamese-Spanish dictionaries 137 Table 7: Sample entries in the Arabic-German and Arabic-Spanish dictionaries Table 8: Sample entries in the Vietnamse-German and Vietnamse-Spanish dictionaries 138 Appendix C: New WordNets constructed Table 9: Sample entries in the Assamese WordNet synsets 139 Table 10: Sample entries in the Arabic WordNet synsets 140 Table 11: Sample entries in theVietnamese WordNet synsets [...]... lexical resources for creating a new one may cause ambiguity in the lexical resource created The approaches we propose will have the potential not only to create new lexical resources using just a few existing lexical resources which can reduce cost and time consumed, but also can improve the quality of lexical resources we create Briefly, to be able to automatically create many lexical resources for languages,... entries WordNets are among the most heavily used lexical resources We develop algorithms and models to automatically build WordNets for languages using available resources, but also by bootstrapping with resources we create ourselves If we can create a number of WordNets of acceptable quality, we believe it will contribute significantly to the repository of resources for languages that lack them A problem... translations for phrases that occur within these resources or even outside We believe using approaches that are not language-specific to create computational lexical resources, some of which may be adapted to produce printed resources as well, may work in concert with other similar efforts to invigorate speakers, learners and users of these languages 1.2 Types of lexical resources According to Landau [58], a... two languages, such as the English-Vietnamese Bilingual Corpus (EVbcorpus) [83], while a multilingual corpus consists of three or more languages such as the International Cambridge Language Survey14 1.3 Research focus and contribution The dissertation concentrates on automatically constructing multilingual lexical resources, especially bilingual dictionaries and WordNets, for several natural languages... complex for automatically generating bilingual dictionaries and WordNets We will also compare our proposed methods against existing methods to find positive and negative points of difference, and the reasons for the drawbacks In addition, most existing research works with languages that have some available lexical resources, each of which is expensive to construct Using many intermediate lexical resources. .. of the Arab League First, we focus on creating reverse bilingual dictionaries Published methods for automatically creating new dictionaries from existing dictionaries use intermediate dictionaries Unfortunately, we are lucky to find a single bilingual dictionary online or in software form for many resource-poor languages So, our first effort, to increase lexical resources for a language under consideration,... processes that do not require many resources to begin with, presenting challenging problems for the computational linguist Our research will make substantial progress on these problems by bootstrapping and leveraging WordNets and dictionaries for resource-rich languages 1.5 Broader impact The goal of this dissertation is to study the feasibility of creating multilingual lexical resources for languages by bootstrapping... Vietnamese to English Future work is discussed at the end of each chapter Chapter 8 concludes the thesis Acknowledgment A synopsis of this dissertation is presented in the paper Automatically creating multilingual lexical resources in the Proceedings of the Doctoral Consortium at the 28th Conference on Artificial Intelligence (AAAI), pages 3077-3078, Quebec, Canada, July 2014 11 CHAPTER 2 RELATED... discuss related work to build relevant lexical resources The remainder of this chapter is organized as follows In Section 2.2, we describe the structure of lexical resources Section 2.3 gives the ISO 693-3 codes of languages mentioned in the this dissertation Specific approaches to generate dictionaries, translations for phrases and WordNets from different linguistic resources are presented in Section... A dictionary entry, called LexicalEntry, is a 2-tuple Here LexicalUnit is a word or a phrase being defined, also called definiendum more formally, based on Aristotle’s analysis [58] Usually, a LexicalUnit is lemmatized (i.e., reduced to a representative or citation form such as infinitives for verbs), but not always A list of entries sorted by the LexicalUnit is called a lexicon

Định dạng
Số trang	156
Dung lượng	1,44 MB