Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 50 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
50
Dung lượng
11,63 MB
Nội dung
Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Vietnamese Academy of Science and Technology Japan Advanced Institute of Science and Technology (Keynote talk at international conference IEEE RIVF 2009) IEEE RIVF’09, 16 July 2009 nstitute of Information Technology ietnamese Academy of Science & Technology Japan Advanced Institute of Science and Technology IEEE RIVF’09, 16 July 2009 Outline Problems and progress in natural language processing Issues and challenges in Vietnamese language processing Our VLSP project (Vietnamese Language and Speech Processing) IEEE RIVF’09, 16 July 2009 Natural language processing? Psychological view: Understand human language processing Alan Turing: Propose to consider the question: “Can machine think?” Engineering view: Build systems to process language IEEE RIVF’09, 16 July 2009 More languages than you might have thought 6912 distinct languages (230 spoken in Europe, 2197 in Asia) We meet here today to talk about processing of Vietnamese language and speech Aujourd'hui nous nous réunissons ici pour discuter le traitement de langue et de parole vietnamienne Cегодня мы встрачаемся здесь, чтобы говорить о обработке вьетнамского языкa и речи 今今今今今今今今今今今今今今今今今今今今今今今今今今今今今今今 今今 今今今 今今今 今今今 今今今今今 今今今今今 今今今 今今今今今今今 أننا نجتمع هنا اليوم لنتحدث عن اللغة الفيتنامية و لغة الخطاب Hôm gặp để nói xử lý ngơn ngữ tiếng nói tiếng Việt IEEE RIVF’09, 16 July 2009 54 ethnic groups in Vietnam Language groups Mon-Khmer Tay-Thai TibetoBurman MalayoPolysian Kadai Mong-Dao Han IEEE RIVF’09, 16 July 2009 English websites and Vietnamese? IEEE RIVF’09, 16 July 2009 Translation and machine translation Translate the following sentence into English “Ông già nhanh quá”? Many possible translations [Ông già] [đi] [nhanh quá] The old man walks too fast My father walks too fast [Ông già] [đi] [nhanh quá] The old man died too fast My father died too fast [Ông] [già đi] [nhanh quá] You get old too fast Grandfather gets old too fast Ambiguity of language IEEE RIVF’09, 16 July 2009 Two approaches to machine translation Linguistic rule-based machine translation words are translated by using linguistic rules about the two languages, the correspondence transfer between them (morphology, syntax, etc) Statistical machine translation generate translations using statistical learning methods based on bilingual text corpora (statistically similar) Requires large and Requires understanding qualified bilingual text natural language corpora DOMINATING! IEEE RIVF’09, 16 July 2009 From text to the meaning Natural Language Processing (NLP) text Lexical / Morphological Analysis Tagging Shallow parsing The woman will give Mary a book POS tagging Chunking Syntactic Analysis Grammatical Relation Finding The/Det woman/NN will/MD give/VB Mary/NNP a/Det book/NN chunking Named Entity Recognition Word Sense Disambiguation Semantic Analysis Reference Resolution [The/Det woman/NN]NP [will/MD give/VB]VP [Mary/NNP]NP [a/Det book/NN]NP relation finding subject [The woman] [will give] [Mary] [a book] Discourse Analysis meaning i-object object IEEE RIVF’09, 16 July 2009 Google: English-Vietnamese translation 26.9.08 (translate.google.com, 35 languages) IEEE RIVF’09, 16 July 2009 Machine translation issues and challenges SMT major difficulties: word choice, word order, tense and aspect, pronoun, idioms Target: Improve phrase-based SMT in two aspects of word order and word choice Combination of tree-to-string SMT and phrasebased SMT (N.P Thai et al., Machine translation, Vol 20 No (2006), IJCPOL, Vol 20, No (2007) Focus on translating long and complex sentences by introducing CRF-based clause splitting and chunking parsing (N.V Vinh, IJCPOL, 2009) IEEE RIVF’09, 16 July 2009 IREST: Support for exploiting the Internet (Information Retrieval, Extraction, Summarization, Translation) Translate the List of Websites in English Danh sách Websites tiếng Việt Different types of query entries Translate the selected Website into Vietnamese Check each Website Search on Internet for Webpages having information related to the query Extract news related to the query Text related to the query Translate the gist into Vietnamese list of retrieved Webpages into Vietnamese Selected Website in English Trang Web dịch qua tiếng Việt Extract informati on related to the query Summarize the text Summarize d text in English Summarize the text for its gist Tin tóm tắt dịch sang tiếng Việt IEEE RIVF’09, 16 July 2009 Vietnamese named-entities on the web Ho Chi Minh University of Technology IEEE RIVF’09, 16 July 2009 Sentence reduction by SVM Input Corpus Long sentence Parsing all sentences parsing Tree set Generating training data Set of contexts and actions SVM learning Rules Large parsed tree list, CSTACK, RSTACK Actions: SHIFT, REDUCE, DROP, ASSIGN TYPE, RESTORE Transforming tree is a sequence of actions {a, b, c, d, e} transforming Small parsed tree generating A rule shows relation of a Short sentence context and an action {b, e, a} (Minh et al., COLING 2004, J CPOL’05, ACM Trans ALP’05, IEICE’06) IEEE RIVF’09, 16 July 2009 Emerging trend detection (Le Minh Hoang, KSS journal, 2006) ETD: Detecting topics that are growing in interest and utility overtime from a corpus Topic verification How to define interest and utility functions and evaluate their increase overtime? M = (D, E, T, TR, TI, TV, f, g) Topic representation Which features are necessary to characterize topics (interest and utility overtime)? Topic identification How to extract these features from the corpus for each topic? IEEE RIVF’09, 16 July 2009 ETD: Topic representation ETD: Detecting topics that are growing in interest Defineand utility overtime from a corpus types of citation Topic representation Which features are necessary to characterize topics (interest and utility overtime)? neural network IEEE RIVF’09, 16 July 2009 ETD: Topic identification ETD: Detecting topics that are growing in interest and utility overtime from a corpus Build models corresponding to types of citation Using HMM, MEMM, an CRF to extract features PO i = P O|i �P O| j j P O | i = max P s, o i s H O i = -�PO i log PO i i Topic identification How to extract these features from the corpus for each topic? IEEE RIVF’09, 16 July 2009 ETD: Topic verification ETD: Detecting topics that are growing in interest and utility overtime from a corpus Topic verification How to define interest and utility functions and evaluate their increase overtime? Growth(ti , j) growth of time-series {tik (j)}k along the time axis 1 Interest f (ti ) Growth ( t , j ) , Utility g(t ) Growth(ti , j ) i i j{1,3,5, 6} j{2, 4,5,6} � f f ' k k the speed of growing at x = k � x �2 f f '' k k the acceleration of growing at x = k � x Speed > Acceleration > IEEE RIVF’09, 16 July 2009 ETD: Evaluation ETD: Detectin g topics that are growing in interest and utility overtim e from a corpus IEEE RIVF’09, 16 July 2009 Conclusion Complete the first phase in VLSP infrastructure Advanced technologies and experience from processing of other languages, especially statistical learning from large corpora Work in collaboration and sharing Look for investment from the government and industry for the next phase, and for collaboration IEEE RIVF’09, 16 July 2009 Acknowledgements The national project KC01.01.05/06-10 Projects members: Luong Chi Mai, Ngo Cao Son, Ho Bao Quoc, Dinh Dien, Cao Hoang Tru, Nguyen Thi Minh Huyen, Vu Luong, Le Thanh Huong, Nguyen Phuong Thai, Nguyen Le Minh, Le Minh Hoang, Phan Xuan Hieu, Pham Ngoc Khanh, Ha Thanh Le, Nguyen Phuong Thao, Nguyen Viet Cuong, VLSP forum, among others VLSP meeting, 21-25 Nov 2005, JAIST IEEE RIVF’09, 16 July 2009 Korean and Arbic 今今 今今今 今今今 今今今 今今今今今 今今今今今 今今今 今今今今今今今 Ơnưr wulinân iơkiê mơiơshơ Vietnamơwa balântsơriê têhaiô ưi rônhagếtsưnnita أننا نجتمع هنا اليوم لنتحدث عن اللغة الفيتنامية و لغة الخطاب enna ngtma hena alyom lenthds an alloga alvitnamya wa logh alkhytab Inana nagtama huna alyom linatahades an: alloga alvitnemâyơ ôe loga alkhytab Kyou, wareware wa kokoni atsumari, Betonamu-go to speech shori ni tsuite Giron shimasu IEEE RIVF’09, 16 July 2009 Search for parallel document Observation: Many Vietnamese news are translated from English source in other web sites from the Internet Methodology Make use of search engine to find English candidates Queries are created from Posted News date source’s URL Translational independent data: text data unchanged during translation process E.g term, NE, number IEEE RIVF’09, 16 July 2009 Framework • Queries are generated and executed from high ranks to low ranks • Filtering •Length-based •TID-based IEEE RIVF’09, 16 July 2009