ParallelTextsExtractionfromtheWeb Lê Quang Hùng Trường Đại học Công nghệ Luận văn Thạc sĩ ngành: Công nghệ thông tin Người hướng dẫn: TS Lê Anh Cường Năm bảo vệ: 2010 Keywords: Mạng máy tính; Website; Văn song ngữ; Sao chép văn Content ORIGINALITY STATEMENT i Abstract ii Acknowledgements iii List of Figures vi List of Tables vii Introduction 1.1 Parallel corpus and its role 1.2 Current studies on automatically extracting parallel corpus 1.3 Objectives of the thesis 1.4 Contributions 1.5 Thesis’ structure 1 5 Related works 2.1 The general framework 2.2 Structure-based methods 2.3 Content-based methods 12 2.4 Hybrid methods 14 2.5 Summary 15 The proposed approach 16 3.1 The proposed model 16 3.1.1 Host crawling 17 3.1.2 Content-based filtering module 18 3.1.2.1 The method based on cognation 20 3.1.2.2 The method based on identifying translation segments 23 3.1.3 Structure analysis module 28 3.1.4 Classification modeling 30 3.2 Summary 31 Experiment 4.1 Evaluation measures 4.2 Experimental setup 4.3 Experimental results 4.4 Discussion 32 32 33 36 40 Conclusion and Future Works 41 5.1 Conclusion 41 5.2 Future works 42 A LIBSVM tool 43 B 44 Relevant publications Bibliography 45 Bibliography [1] P Resnik and Philip Parallel strands: A preliminary investigation into mining theweb for bilingual text In Proceedings of the Third Conference of the Association for Machine Translation in the Americas (AMTA) Langhorne, PA, pages 28-31, 1998 [2] J Chen and Nie J.Y Automatic construction of parallel english-chinese corpus for crosslanguage information retrieval In Proceedings ANLP, Seattle, pages 21-28, 2000 [3] Xiaoyi Ma and Liberman Mark Bits: A method for bilingual text search over theweb Machine Translation Summit VII, 1999 [4] J Chen, R Chau, and C.-H Yeh Discovering parallel text fromthe world wide web In Proceedings Australasian Workshop on Data Mining and Web Intelligence (DMWI), pages 157-161, 2004 [5] Dan Tufis Cross-lingual knowledge induction fromparallel corpora Southern Journal of Linguistics, USA, pages 214-223, 2007 [6] E N.Westerhout A corpus of dutch aphasic speech: Sketching the design and performing a pilot study 2005 [7] A Frankenberg-Garcia and D Santos Introducing compara: the portuguese- english parallel corpus Corpora in translator education, pages 71-87, 2003 [8] Philipp Koehn Europarl: A parallel corpus for statistical machine translation In MT Summit, 2005 [9] P Brown, J Cocke, S Della Pietra, V Della Pietra, F Jelinek, R Mercer, and P Roosin A statistical approach to machine translation Computational Linguistics, pages 79-85, 1990 [10] Melamed and I Dan Word-to-word models of translation equivalence IRCS technical report, University of Pennsylvania, 1998 [11] M Davis and T Dunning A trec evaluation of query translation methods for multi-lingual text retrieval Fourth Text Retrieval Conference (TREC- 4), NIST, 1995 [12] Martin Volk, Spela Vintar, and Paul Buitelaar Ontologies in cross-language information retrieval In Proceedings of WOW2003, pages 43-50, 2003 [13] D W Oard Cross-language text retrieval research in the usa Third DELOS Workshop, European Research Consortium for Informatics and Mathematics , 1997 [14] Akira Kumano and Hideki Hirakawa Building an mt dictionary fromparalleltexts based on linguisitic and statistical information In Proceedings 15th COLING, pages 76-81, 1994 [15] C McEwan, I Ounis, and I Ruthven Advances in information retrieval Springer, pages 365-368, 2002 [16] Melamed and I Dan Automatic discovery of non-compositional compounds in parallel data In Proceedings of the Second Conference on Empirical Meth ods in Natural Language Processing Association for Computational Linguis tics, Somerset, New Jersey, pages 97-108, 1997 [17] Peter F Brown, Stephen A Della Pietra, Vincent J Della Pietra, and Robert L Mercer Word-sense disambiguation using statistical methods In Proceedings of 29th Annual Meeting of the ACL, Berkeley, pages 264-270, 1991 [18] P.Resnik and Philip Mining theweb for bilingual text In Proceedings of the 37th Annual Meeting of the ACL, College Park, MD, pages 527-534, 1999 [19] Christopher C Yang and Kar Wing Li Building parallel corpora by automatic title alignment 5th International Conference on Asian Digital Li braries, ICADL 2002, pages 328-339, 2002 [20] P Resnik and N A Smith Theweb as a parallel corpus Computational Linguistics, pages 349-380, 2003 [21] Ying Zhang, Ke Wu, Jianfeng Gao, and P Vines Automatic acquisition of chinese-english parallel corpus fromtheweb In Proceedings of ECIR-06, 2006 [22] Van B Dang and Ho Bao-Quoc Automatic construction of english- vietnamese parallel corpus through web mining In Proceedings of 5th IEEE International Conference on Computer Science - Research, Innovation and Vision of the Future (RIVF), Hanoi, Vietnam, 2007 [23] Jrg Tiedemann, Lars Nygaard, and Tekstlaboratoriet Hf The opus corpus parallel and free In Proceedings of the 4th International Conference on Lan guage Resources and Evaluation, pages 1183-1186, 2004 [24] Jan Pomikalek Building parallel corpora fromtheweb 2007 [25] Dragos Munteanu and Daniel Marcu Extracting parallel sub-sentential fragments from nonparallel corpora ACL, pages 81-88, 2006 [26] Bing Zhao and Tephan Vogel Adaptive parallel sentences mining fromweb bilingual news collection In Proceedings of the IEEE Workshop on Data Mining, 2002 [27] Pascale Fung and Percy Cheung Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus In Proceedings of Coling, pages 1051-1057, 2004 [28] Jesus Tomas, Enrique Sanchez-Villamil, Jaime Lloret, and Francisco Casacu- berta Webmining: An unsupervised parallel corpora web retrieval In The Corpus Linguistics Conference, 2005 [29] B Barla Cambazoglu, Evren Karaca, Tayfun Kucukyilmaz, Ata Turk, and Cevdet Aykanat Architecture of a grid-enabled web search engine Information Processing and Management, pages 609-623, 2007 [30] Michel Simard, George F Foster, and Pierre Isabelle Using cognates to align sentences in bilingual corpora In Proceedings of the Forth International Conference on Theoretical and Methodological Issues in Machine Translation, Montreal, Canada, 1992 [31] Dragos Munteanu and Daniel Marcu Improving machine translation performance by exploiting comparable corpora Computational Linguistics, pages 477-504, 2005 [32] Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu Bleu: A method for automatic evaluation of machine translation ACL, Philadelphia, pages 311-318, 2002 [33] G Salton Automatic text processing: the transformation, analysis, and retrieval of information by computer Addison-Wesley Publishing Company, 1989 ... Consortium for Informatics and Mathematics , 1997 [14] Akira Kumano and Hideki Hirakawa Building an mt dictionary from parallel texts based on linguisitic and statistical information In Proceedings 15th... Smith The web as a parallel corpus Computational Linguistics, pages 349-380, 2003 [21] Ying Zhang, Ke Wu, Jianfeng Gao, and P Vines Automatic acquisition of chinese-english parallel corpus from the. .. Evaluation, pages 1183-1186, 2004 [24] Jan Pomikalek Building parallel corpora from the web 2007 [25] Dragos Munteanu and Daniel Marcu Extracting parallel sub-sentential fragments from nonparallel