1. Trang chủ
  2. » Ngoại Ngữ

Vietnamese-English Cross Language Search Information Retrieval (CLIR) - Discovering Noun Phrases for Translation

23 137 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Cấu trúc

  • Slide 1

  • Outline

  • Motivations – Unknown Translations

  • Examples

  • Slide 5

  • Searching the web for translation?

  • Slide 7

  • Slide 8

  • Our Approach

  • Crosslingual Query to Obtain Mixed Languages WebPages

  • How to Find This VEX ?

  • Original Source Query

  • Crosslingual Query

  • Our Approach: Noun Phrase Translation Extraction

  • Yahoo Search API - XML Data Returning

  • Proper name recognition & Transliteration

  • Preprocessing (Query: Thuật toán genetic)

  • Frequency-Distance Model

  • Contextual Ordering Model & Result Ranking

  • Sample Program Output # 1 (dân ca -> folk or traditional music)

  • Sample Program Output # 2 (Quang Dũng -> Quang Dung)

  • Sample of Translation Results

  • Conclusion and Next Steps

Nội dung

Vietnamese-English Cross Language Search Information Retrieval (CLIR) Discovering Noun Phrases for Translation CSC 177 Presentation Nguyen Doan H, Ph.D Outline • • • • • Motivations Crosslingual Query Noun phrase translation extraction Experiments and results Conclusion and next steps Motivations – Unknown Translations • Words that outside scope of bilingual dictionary • • • • Compound nouns • • • • Meaning might not be inferable from individual components Might required expert knowledge for translation Might have multiple correct translations Applicability • • • • Brand names, Place names, Personal names Titles (music, book, video) Terminologies (Science, Computer, Medical, Space, Farming etc) Cross-language Information Retrieval (CLIR) Machine Translation (MT) Machine-Readable Dictionary (MRD) Most of the words are Out-Of-Vocabulary (OOV) Examples Example 1: Computer Terminology (phần mềm -> software) Examples Example 2: Personal Name (ca sĩ Quang Dũng -> Singer Quang Dung) Searching the web for translation? • Parallel Data on the Web: Vietnamese to English Translation Searching the web for translation? • Comparable corpus on the web: Searching the web for translation? • Mixed language web pages: English Translation Our Approach • Extensions to CMU’s Ying Zhang 2005 paper (Credit) • Addressing issues focusing to Vietnamese-English OOV translations • Proper name translation is using pattern recognition technique and not by phonetic similarity and string alignment • Detection of borrowed English words • Improving translation suggestions by utilizing contextual information Crosslingual Query to Obtain Mixed Languages WebPages • Extend the source query, VS , with extended words/phrases VEX: (tend to frequently co-occur) – VS : phần mềm → ? – VSVEX : phần mềm miễn phí • Translate the extended words/phrases, VEX, , to English, EEX: – VEX : miễn phí → EEX : free • Submit both source query and translated words/phrases to a search engine – VS EEX : phần mềm free How to Find This VEX ? • Find co-occurred terms in web log • Use co-occurred terms in search query (in CLIR) • Search Google, with VS, and select Vietnamese words, VEX, with high frequency Overture Search Log Original Source Query Crosslingual Query Our Approach: Noun Phrase Translation Extraction • Proper noun recognition & Transliteration • Preprocessing • Frequency-Distance Model • Contextual Ordering Model & Result Ranking Yahoo Search API - XML Data Returning Snippet Proper name recognition & Transliteration • Extract and concatenate Title, Summary, and URL • Recognize that proper name text pattern is likely to appear in capital with the first letter • Compute the likelihood of a query text is a proper name Occurences of First_Letter_In_Cap(Vs )in Snippet Text P (Vs) = All occurences of Vs in Snippet Text • Once recognized, map Vietnamese vowels to English vowels: – i.e → a, → a … , ũ → u… • Suggest a translation candidate VN: Quang Dũng → Eng: Quang Dung • Compute and assign a weight to a translation candidate Preprocessing (Query: Thuật toán genetic) – Extracting and concatenation of Title, Summary, and URL Thuật toán-Cấu trúc liệu (Reserve Polish Notation – RPN), thuật toán "kinh điển" lĩnh vực trình biên dịch THUẬT GIẢI DI TRUYỀN – GENETIC ALGORITHM Kỳ ity.vnuit.edu.vn/thuattoan/index.htm – Mark query, normalize text, remove noise text ~123456789 cấu trúc liệu reserve polish notation – rpn ~123456789 kinh điển lĩnh vực trình biên dịch thuẬt giẢi di truyỀn – ~987654321 algorithm kỳ ity vnuit edu thuattoan index htm – Mark recognized Vietnamese text with VNW tag ~123456789 VNW VNW VNW VNW reserve polish notation VNW rpn ~123456789 VNW VNW VNW VNW VNW VNW VNW VNW di VNW VNW ~987654321 algorithm VNW ity vnuit edu thuattoan index htm – Group continuous English words and build word list ['~123456789', 'VNW', 'VNW', 'VNW', 'VNW', '', '', 'reserve_polish_notation', 'VNW', 'rpn', '~123456789', 'VNW', 'VNW', 'trong', 'VNW', 'VNW', 'VNW', 'VNW', 'VNW', 'VNW', 'di', 'VNW', 'VNW', '~987654321', 'algorithm', 'VNW', 'ity', 'vnuit', 'edu', 'vn', 'thuattoan', 'index', 'htm'] Frequency-Distance Model • Frequency-Distance model: – Frequency of co-occurrence – Distance of either VS or EEX within a snippet text – For all doc returned summaries 1 w(e) = ∑ (∑ +∑ ) si v S i d (V S i , e) E EX i d ( E EX i , e) • Example: Thuật toán genetic Contextual Ordering Model & Result Ranking • Estimate Closeness Probability ADJ (V e) s P (Vs e) = ∑ e c (e) + ∑Vs c (Vs ) ADJ (eE ) EX P (eE EX ) = ∑ e c(e) + ∑ E c( E EX ) EX • Overall Score for each candidate RankScore (e) = w(e) ∗ P (Vs e) ∗ P(eE EX ) • Sort score and present top suggestions Sample Program Output # (dân ca -> folk or traditional music) Sample Program Output # (Quang Dũng -> Quang Dung) Sample of Translation Results Category Vietnamese Phrase/Word Vietnamese-English Web-mining Translation Vdict (Machine Translation) Vietdict (Online Dictionary) Organization Name WTO gì? What is world trade organization ? What is WTO? No definition found Science & Tech thuật toán di truyền Genetic algorithms Heredity algorism No definition found Location Name Thừa Thiên Huế Thua Thien Hue Partial Excess Hue No definition found Person Name ca sĩ Quang Dũng Singer Quang Dung Optical singer Dũng N/A Medical Term viêm màng não Meningitis brain infection meningitis No definition found Geographical name Đại dương Bắc Băng Dương Arctic ocean Đạtôi glacial ocean Boreal Yang No definition found Education học vị Tiến sỹ Phd degree Advance academical degree sỹ No definition found Music dân ca Folk music folk-song folk-song Music nhạc hip hop Hip hop music or Rap music music hu-blông hông No definition found Space phi hành gia Sally Ride Former astronaut Sally Ride air-man Phá vây cưỡi No definition found Plant kiểng vườn Nhật Bonsai Japanese garden Japanese garden plant kiểng No definition found Farming nghề cá thủy sản Aquaculture fisheries seafood fisheries No definition found Laws cư trú thường trực permanent resident populate permanent No definition found Astrological Thuật chiêm tinh phong thủy feng shui astrology Geomancy astrology Geomancy Conclusion and Next Steps • Contributions – Recognize and translate important phrases – Translate: persons, locations, concepts – Low cost for implementation with reasonable performance • Future work – Experiment with a larger set of test data – Integration with Vietnamese-English CLIR work – Automate the generation of extended words/phrase to derived English extended word – Experiment on “Refine Result” concept for search engine ... Machine Translation (MT) Machine-Readable Dictionary (MRD) Most of the words are Out-Of-Vocabulary (OOV) Examples Example 1: Computer Terminology (phần mềm -> software) Examples Example 2: Personal... Extensions to CMU’s Ying Zhang 2005 paper (Credit) • Addressing issues focusing to Vietnamese-English OOV translations • Proper name translation is using pattern recognition technique and not by phonetic

Ngày đăng: 27/08/2017, 00:19

TỪ KHÓA LIÊN QUAN

w