portable language technology a resource-light approach to morpho-syntactic tagging

PORTABLE LANGUAGE TECHNOLOGY: A RESOURCE-LIGHT APPROACH TO MORPHO-SYNTACTIC TAGGING DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Anna Feldman ***** The Ohio State University 2006 Dissertation Committee: Approved by Professor Christopher H. Brew, Advisor Professor Brian D. Joseph Professor W. Detmar Meurers Advisor, Graduate Program in Linguistics UMI Number: 3226393 3226393 2006 UMI Microform Copyright All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, MI 48106-1346 by ProQuest Information and Learning Company. Copyright by Anna Feldman 2006 ABSTRACT Morpho-syntactic tagging is the process of assigning part of speech (POS), case, number, gender, and other morphological information to each word in a corpus. Morpho-syntactic tagging is an important step in natural language processing. Corpora that have been morphologically tagged are very useful both for linguistic research, e.g. finding instances or frequencies of particular constructions in large corpora, and for further computational processing, such as syntactic parsing, speech recognition, stemming, and word-sense disam- biguation, among others. Despite the importance of morphological tagging, there are many languages that lack annotated resources. This is almost inevitable because these resources are costly to create. But, as described in this thesis, it is possible to avoid this expense. This thesis describes a method for transferring annotation from a morphologically annotated corpus of a source language to a corpus of a related target language. Unlike unsupervised approaches that do not require annotated data at all and, as a consequence, lack precision, the approach proposed in this dissertation relies on linguistic knowledge, but avoids large-scale grammar engineering. The approach needs neither a parallel corpus nor a bilingual lexicon, and requires much less linguistic labor than the standard technology. This dissertation describes experiments with Russian, Czech, Polish, Spanish, Por- tuguese, and Catalan. However, the general method proposed can be applied to any fusional language. ii To Batsheva Barenfeld, Mira Barenfeld, and Ilia Feldman who made me who I am, and Gera and Naomi who like me this way. iii ACKNOWLEDGMENTS Even though my name is the only author on this work, many have contributed to its development and completion — those who provided insights, comments, and suggestions, and those who provided friendship, love, and support. First I want to thank Chris Brew for (surprisingly easily) agreeing to take me as his advisee and for being a terrific advisor. Always with bright insights (mostly in the form of interrogation), always knowledgeable, always generous with his time, always with anecdotal stories and jokes, always with good advice — Chris has become an object of appreciation. It was his seminar on Corpora and Multilingual Verb Classification where I realized that Czech is rather useful for processing Russian verbs. Special thanks goes to my friend and colleague, Jirka Hana, who contributed an incredible amount of work and ideas to this project. He developed a resource-light portable morphological analyzer which became the basis for the cross-language system described in this thesis. This work started as a joint project and many ideas developed in this thesis were inspired by discussions with Jirka. I also want to thank Detmar Meurers, another member of my dissertation committee. He was the first to introduce me to the field of Computational Linguistics and got me excited about it. Detmar gave me a lot of good advice and support over the years. He man- aged to keep me always in mind, pointing to the relevant literature and tools, and making me believe that I actually can write a dissertation! Throughout the years, I took several iv seminars with Detmar, and that’s where I acquired most of the skills for working on my thesis. I also thank Brian Joseph for always extremely insightful comments and timely feedback. What can I say? Brian knows . I am so lucky he agreed to be my committee member. I also want to thank people who helped me with corpora used in the experiments: Sandra Maria Aluísio, Gemma Boleda, Toni Badia, Lukasz Debowski, Maria das Graças Volpe Nunes, Jan Hajic, Ricardo Hasegawa, Vicente López, Lluís Padró, Carlos Rodríguez Penagos, Adam Przepiórkowski, and Martí Quixal. Special thanks go to Stacey Bailey for being such a nice office mate, for the hot chocolate with marshmallows, and for being ready to proofread the entire draft of this dissertation. I would like to thank my parents for giving me the freedom of choice and always trusting and supporting me. Linguistics is definitely not a profession that runs in our family. Then, there is a long list of people who deserve a word of thanks because of one or more of the following things: their teaching, their willingness to discuss whatever linguistic or non-linguistic topic, their collegiality, and their friendship. These are (in the alphabetical order) Luiz Amaral (a friend and an expert in Romance languages), Mary Beckman, Ilana Bromberg, Donna Byron, Angelo Costanzo (for the Catalan-Spanish false cognates), Peter Culicover, Mike Daniels (another person who knows practically every- thing and is always ready to help), Eric Fosler-Lussier, Kordula De Kuthy, Markus Dick- inson, Edit Doron, David Dowty, Stefan Dyła, Yakov Feldman (for making my life full of art), Zhenya Gabrilovich (There are computer scientists who can actually understand lin- guists. Well, at least to some extent. ), Anna Ghazaryan (a friend and a mathematician!), Jonathan Ginzburg, Jan Hajic, Hanka Hanova (swimming, hiking, cooking, reminding me that there is life beyond Oxley), Jirka Hana (a friend and colleague, whose contribution v rates a second mention), Jim Harmon, Erhard Hinrichs (for always good advice, encour- agement, and the subtaggers idea), Beth Hume, Martin Jansche, Dimitra Kolliakou, Greg Kondrak, Soyoung Kang, Chandana and Rupan Kundu (I wouldn’t have finished this thesis without you, guys!), Bob Levine (for making me believe I can do it and for making me want to know physics), Xiaofei Lu (my former office mate, my current lab mate, full of good ideas and jokes), Vahagn Manukian (for friendship, math, and grill), Arantxa Martin- Lozano (my dear friend), Dennis Mehay, Vanessa Metcalf (for the devoted friendship and mental support), Marcela Michalkova, Martin Michalek, Rick Nouwen (Utrecht, Utrecht), Carl Pollard (for making me love syntax, logic, and math), Craige Roberts, Anton Rytting (for being such a friendly office mate and for being always ready to discuss Arabic vowels, Lettish dialects, and entropy), Andrea Sims (an expert in Slavic languages), Shari Speer, Soundar Srinivasan, Maya Schwekher, Nathan Vaillette, Shravan Vasishth (strict, but fair), Pauline Welby (who always has some interesting story to tell), Don Winford, Mike White, Yael Ziv, and many, many other people. Thank you all! Last but not the least, I thank Gera, who asked me not to include his name here. So, read the dedication. vi VITA 1997 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.A. English Language and Literature, B.A. East-Asian Studies, Hebrew University of Jerusalem, Israel 1997–1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Research Assistant, Hebrew University of Jerusalem 1999 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.A. English Linguistics, Hebrew University of Jerusalem 1999–2000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Research Assistant, The Ohio State University 2001 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marie-Curie Fellow, Utrecht Institute of Linguistics, The Netherlands. Marie- Curie Fellow 2000–2005 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Teaching Assistant, The Ohio State University 2005 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.A. Linguistics, The Ohio State University 2005–present . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Language Consultant, Zi Corporation, Canada 2005–2006 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Presidential Fellow, The Ohio State University vii PUBLICATIONS 1. Anna Feldman, Jirka Hana, and Chris Brew (2006). A Cross-language Approach to Rapid Creation of New Morpho-syntactically Annotated Resources. In Proceed- ings of the Fifth International Conference on Language Resources and Evaluation (LREC), Genoa, Italy. 2. Jirka Hana, Anna Feldman, Luiz Amaral, and Chris Brew (2006). Tagging Por- tuguese with a Spanish Tagger Using Cognates. In Proceedings of the Workshop on Cross-language Knowledge Induction hosted in conjunction with the 11th Con- ference of the European Chapter of the Association for Computational Linguistics (EACL), Trento, Italy, pp. 33–40. 3. Anna Feldman, Jirka Hana, and Chris Brew (2006). Experiments in Morphological Annotation Transfer. In Proceedings of Computational Linguistics and Intelligent Text Processing (CICLing), A. Gelbukh (editor), Lecture Notes in Computer Science, Mexico City, Mexico, pp. 41–50. Springer-Verlag. 4. Anna Feldman, Jirka Hana, and Chris Brew (2005). Buy One, Get One Free or What to Do When Your Linguistic Resources are Limited. In Proceedings of the Third International Seminar on Computer Treatment of Slavic and East-European Languages (Slovko), Bratislava, Slovakia. 5. Jirka Hana, Anna Feldman, and Chris Brew (2004). A Resource-light Approach to Russian Morphology: Tagging Russian Using Czech Resources. In Proceedings of Empirical Methods in Natural Language Processing (EMNLP), Barcelona, Spain, pp. 222–229. 6. Jirka Hana and Anna Feldman (2004). Portable Language Technology: Russian via Czech. In Proceedings of the First Midwest Computational Linguistics Colloquium, Bloomington, Indiana. 7. Stefan Dyła and Anna Feldman (2003). On Comitative Constructions in Polish and Russian. In Proceedings of the Fifth European Conference on Formal Description of Slavic Languages, Leipzig, Germany. 8. Anna Feldman (2003). On S-Coordination and Plural Pronoun Constructions. In Balkan and Slavic Linguistics, vol.2, ed. Daniel E. Collins and Andrea D. Sims, The Ohio State University, Columbus, Ohio, USA, pp. 49–75. 9. Anna Feldman (2002). Kim and Sandy, Kim with Sandy, Just Me or Both of Us? In Proceedings of European Summer School of Logic, Language, and Information (ESSLLI), Trento, Italy, pp. 41–52. 10. Anna Feldman (2002). On NP-coordination. The UiL OTS 2002 Yearbook, Utrecht, The Netherlands, pp. 39–66. 11. Anna Feldman (2001). Comitative and Plural Pronoun Constructions. In Proceed- ings of the 17th Annual Meeting of the Israel Association of Theoretical Linguistics (IATL), Jerusalem, Israel. viii [...]... Slavic family, and three (Catalan, Portuguese, and Spanish) belong to the Romance group of languages Since the goal of the task is to project morphosyntactic information from a source language to a target language, the discussion concentrates mainly on characterizing the morpho-syntactic properties of these languages 2.1 Slavic languages Slavic (Slavonic) languages are a group of Indo-European languages... ways to adapt a tagger which was trained on another language with similar linguistic properties has potential to become the standard way of tagging languages for which large, labeled corpora are not available 1.4 Related work The idea of “information transfer” is not new, especially in areas such as the study of Second Language Acquisition (SLA) As the name suggests, SLA research focuses on how humans... resources Standard tagging techniques are accurate, but they rely heavily on high-quality annotated training data The training data also has to be statistically representative of the data on which the system will be tested In order to adapt a tagger to new kinds of data, it has to 2 be trained on new data that is similar in style and genre However, the creation of such data is time-consuming and labor-intensive... Polish Slovak Sorbian South – Western ∗ Slovenian ∗ Serbian ∗ Bosnian ∗ Croatian – Eastern ∗ Bulgarian ∗ Macedonian East – Belarusian – Russian – Ukrainian Figure 2.1: Slavic languages related dialects, Kashubian, Polabian, Obodrits) Russian, Ukrainian and Belarusian belong to the East Slavic branch Below a description of three Slavic languages — Czech, Polish, and Russian — is provided These are the... dissertation is on the portability of technology to new languages and on rapid language technology development This dissertation addresses the development of taggers for languages with extremely scarce resources With respect to tagging, languages with “scarce resources” are those that lack a large annotated corpus in electronic form or/and large lexicons This dissertation takes a novel approach to rapid,... mental organization of such knowledge What is of interest here is how the idea of L1 and L2 morpho-syntactic and lexical interaction can carry over to a machine-learning setting In particular, the goal is 6 to use annotated data in one language to aid the automatic learning of morpho-syntactic information in another language 1.5 Dissertation structure The structure of the dissertation is as follows Part... of information from one language to another This dissertation describes a knowledge- and resource-light system for automatic morphological analysis and tagging of inflected languages with scarce resources The method avoids the use of labor-intensive resources; instead, it relies on the following: 1 an annotated corpus of a source language 2 an unannotated corpus of a related target language 3 a description... described and compared, and the final sections of the chapter are devoted to the question of the appropriateness of these methods for inflected languages in general, and for Romance and Slavic languages in particular Chapter 5 summarizes previous resource-light approaches to Natural Language Processing (NLP) tasks This chapter takes a closer look at two bootstrapping solutions, both because they are fairly... is a richly inflected language like other Slavic languages Czech nouns and adjectives distinguish gender, number, and case, and in some cases, animacy There are seven cases: nominative, accusative, genitive, dative, instrumental, locative, and vocative About half the singular noun paradigms have a distinctive vocative form shared by no other case; no adjectival, pronominal, numeral or plural noun paradigms... morphological annotation transfer The experiments are divided into three types: those that deal with approximating the word order of the target language using the source language; those that deal with approximating the lexical information of the target language from the lexicon of the source language; and those that deal with data sparsity problems The first two experiments are reported for all language pairs . These are (in the alphabetical order) Luiz Amaral (a friend and an expert in Romance languages), Mary Beckman, Ilana Bromberg, Donna Byron, Angelo Costanzo (for the Catalan-Spanish false cognates),. Gemma Boleda, Toni Badia, Lukasz Debowski, Maria das Graças Volpe Nunes, Jan Hajic, Ricardo Hasegawa, Vicente López, Lluís Padró, Carlos Rodríguez Penagos, Adam Przepiórkowski, and Martí Quixal. Special. Presidential Fellow, The Ohio State University vii PUBLICATIONS 1. Anna Feldman, Jirka Hana, and Chris Brew (2006). A Cross -language Approach to Rapid Creation of New Morpho-syntactically Annotated

Định dạng
Số trang	299
Dung lượng	1,27 MB