Output file

Parallel Texts Extraction from the Web by Le Quang Hung Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi Supervised by Dr Le Anh Cuong A thesis submitted in fulfillment of the requirements for the degree of Master of Information Technology December, 2010 TIEU LUAN MOI download : skknchat@gmail.com Contents ORIGINALITY STATEMENT i Abstract ii Acknowledgements iii List of Figures vi List of Tables vii Introduction 1.1 Parallel corpus and its role 1.2 Current studies on automatically extracting parallel 1.3 Objectives of the thesis 1.4 Contributions 1.5 Thesis’ structure Related works 2.1 The general framework 2.2 Structure-based methods 2.3 Content-based methods 2.4 Hybrid methods 2.5 Summary The proposed approach 3.1 The proposed model 3.1.1 Host crawling 3.1.2 Content-based filtering module 3.1.2.1 The method based on cognation 3.1.2.2 The method based on identifying ments 3.1.3 Structure analysis module corpus translation seg 1 5 7 12 14 15 16 16 17 18 20 23 28 iv TIEU LUAN MOI download : skknchat@gmail.com Contents 3.2 v 3.1.4 Classification modeling 30 Summary 31 Experiment 4.1 Evaluation measures 4.2 Experimental setup 4.3 Experimental results 4.4 Discussion 32 32 33 36 40 Conclusion and Future Works 41 5.1 Conclusion 41 5.2 Future works 42 A LIBSVM tool 43 B Relevant publications 44 Bibliography 45 TIEU LUAN MOI download : skknchat@gmail.com List of Figures 1.1 An example of English-Vietnamese parallel texts 2.1 2.2 2.3 2.4 2.5 2.6 2.7 General architecture in building parallel corpus The STRAND architecture [1] An example of aligning two documents The workflow of the PTMiner system [2] The algorithm of translation pairs finder [3] Architecture of the PTI system [4] An example of the two links in the text 10 11 13 13 15 3.1 3.2 3.3 3.4 3.5 3.6 17 18 19 20 22 3.13 Architecture of the Parallel Text Mining system Architecture of a standard Web crawler An example of a candidate pair Description of the process content-based filtering module An example of two corresponding texts of English and Vietnamese The algorithm measures similarity of cognates between a texts pair (Etext, V text) Relationships between bilingual web pages The paragraphs can be denoted from HTML pages based on the tag < p > Identifying translation paragraphs A sample code written in Java to perform translation from English into Vietnamese via Google AJAX API Web documents and the source HTML code for two parallel translated texts An example of the publication date feature is extracted from a HTML page Classification model 30 31 4.1 4.2 4.3 4.4 Figure for precision and recall measures The format of training and testing data Performance of identifying translation segments method Comparison of the methods 32 34 38 39 3.7 3.8 3.9 3.10 3.11 3.12 22 24 25 27 27 29 vi TIEU LUAN MOI download : skknchat@gmail.com List of Tables 1.1 Europarl parallel corpus: 10 aligned language pairs all of which include English 3.1 Symbols and descriptions 30 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 URLs from three sites: BBC, VOA News and VietnamPlus No pages downloaded and No candidate pairs Structure-based method Content-based method Method based on cognation Combining structural features and cognate information Identifying translation at document level Identifying translation at paragraph level Identifying translation at sentence level Overall results of each method (P-Precision, R-Recall, F-FS core) 33 34 36 36 37 37 37 38 38 39 vii TIEU LUAN MOI download : skknchat@gmail.com Chapter Introduction In this chapter, we first introduce about parallel corpus and its role in NLP applications Current studies, objectives of the thesis and contributions are then presented Finally, the thesis’ structure is shortly described 1.1 Parallel corpus and its role Parallel text Different definitions of the term “parallel text” (also known as bitext) can be found in the literature As common understanding, a parallel text is a text in one language together with its translation in another language Dan Tufis [5] gives a definition: “parallel text is an association between two texts in different languages that represent translations of each other” Figure 1.1 shows an example of English-Vietnamese parallel texts Parallel corpus A parallel corpus is a collection of parallel texts According to [6], the simplest case is where two languages only are involved, one of the corpora is an exact translation of the other (e.g., COMPARA corpus [7]) However, some parallel corpora exist in several languages For instance, Europarl parallel corpus [8] which includes versions in 11 European languages as report in Table 1.1 In addition, the direction of the translation need not be constant, so that some texts in a parallel TIEU LUAN MOI download : skknchat@gmail.com Chapter Introduction Figure 1.1: An example of English-Vietnamese parallel texts corpus may have been translated from language L1 to language L2 and others the other way around The direction of the translation may not even be known The parallel corpora exist in several formats They can be raw parallel texts or they can be aligned texts The texts can be aligned in paragraph level, sentence level or even in phrase level and word level The alignment of the texts is useful for different NLP tasks Statistical machine translation [9, 10] uses parallel sentences as the input for the alignment module which produces word translation probabilities Cross language information retrieval [11–13] uses parallel texts for determining corresponding information in both questioning and answering Extracting semantically equivalent components of the parallel texts as words, phrases, sentences are useful for bilingual dictionary construction [14, 15] The parallel texts are also used for acquisition of lexical translation [16] or word sense disambiguation [17] For most of the mentioned tasks, the parallel corpora are currently playing a crucial role in NLP applications TIEU LUAN MOI download : skknchat@gmail.com Chapter Introduction Table 1.1: Europarl parallel corpus: 10 aligned language pairs all of which include English Parallel Corpus (L1 -L2 ) Danish-English German-English Greek-English Spanish-English Finnish-English French-English Italian-English Dutch-English Portuguese-English Swedish-English 1.2 Sentences L1 Words 1,684,664 43,692,760 1,581,107 41,587,670 960,356 1,689,850 48,860,242 1,646,143 32,355,142 1,723,705 51,708,806 1,635,140 46,380,851 1,715,710 47,477,378 1,681,991 47,621,552 1,570,411 38,537,243 English Words 46,282,519 43,848,958 27,468,389 46,843,295 45,136,552 47,915,991 47,236,441 47,166,762 47,000,805 42,810,628 Current studies on automatically extracting parallel corpus Nowadays, along with the development the Internet, the Web is really a huge database containing multi-language documents thus it is useful for bilingual texts processing For that reason, many studies [1–4, 18–22] are paying their attention in mining parallel corpora from the Web Basically, we can classify these studies into three groups: content-based (CB) [3, 4, 22], structure-based (SB) [1, 2, 18], and hybrid (combination of the both methods) [19–21] The CB approach uses the textual content of the parallel document pairs being evaluated This approach usually uses lexicon translations getting from a bilingual dictionary to measure the similarity of content of the two texts When the bilingual dictionary is available, documents are translated word by word to the target language The translated documents then are used to find the best matching parallel documents by applying similarity scores functions such as cosine, Jaccard, Dice, etc However, using bilingual dictionary may face difficulty because a word usually has many its translations Meanwhile, the SB approach relies on analysis HTML structure of pages This approach uses the hypothesis that parallel web pages are presented in similar structures The similarity of the web pages are estimated based on the structural HTML of them Note that this approach does not require linguistical knowledge TIEU LUAN MOI download : skknchat@gmail.com Chapter Introduction In addition, this approach is very effective in filtering a big number of unmatched documents, as it is quite fast but accuracy Nevertheless, it has drawbacks that requires the presentation of two sites with similar content must be presented in the same From our observation, many sites use the same template to design the Web, the structure of pages is similar but the content of them is different For that reason, HTML structure-based approach is not applicable in some cases 1.3 Objectives of the thesis As we have introduced, the parallel corpus is the valuable resource for different NLP tasks Unfortunately, the available parallel corpora are not only in relatively small size, but also unbalanced even in the major languages [3] Some resources are available, such as for English-French, the data are usually restricted to government documents (e.g., the Hansard corpus) or newswire texts The others are limited availability due to licensing restrictions as [23] According to [24], there are now some reliable parallel corpora: Hansard Corpus1 , JRC-Acquis Parallel Corpus2 , Europarl3 , and COMPARA4 However, these resources only exist for some language pairs In Vietnam, the NLP is in early stage The lack of parallel corpora is more severe The lack of such kind of resource has been an obstacle in the development of the data-driven NLP technologies There are a few studies of mining parallel corpora from the Web, one of them is presented in [22] (for English-Vietnamese language pair) On the other hand, the current studies [1–4, 18–21] while extremely useful, they have a few drawbacks as mentioned in Section 1.2 So, obtaining a parallel corpus with high quality is still a challenge That is why it still remains a big motivation for many studies on this work The objective of this research is extracting parallel texts from bilingual web sites of the English and Vietnamese language pair We first propose two new methods of designing content-based features: (1) based on cognation, (2) based on identifying translation segments Then, we combine content-based features with structural features under a framework of machine learning http://www.isi.edu/natural-language/download/hansard/ http://langtech.jrc.it/JRC-Acquis.html http://www.statmt.org/europarl/ http://www.linguateca.pt/COMPARA/ TIEU LUAN MOI download : skknchat@gmail.com Chapter Introduction 1.4 Contributions In our work, we aim to automatically extracting English-Vietnamese parallel texts As encouraging by [20] we formulate this problem as classification problem to utilize as much as possible the knowledge from structural information and the similarity of content The most important contribution of our work is that we proposed two new methods of designing content-based features and combined with structural-based features to extract parallel texts from bilingual web sites • The first method based on cognation It is worth to emphasize that different from previous studies [2, 20], we use cognate information replace of word by word translation From our observation, when translating a text from one language to another, some special parts will be kept or changed in a little These parts are usually abbreviation, proper noun, and number We also use other content-based features such as the length of tokens, the length of paragraphs, which also not require any linguistically analysis It is worth to note that by this approach we not need any dictionary thus we think it can be apply for other language pairs • The second method based on identifying translation segments use to match translation paragraphs That will help us to extract proper translation units in bilingual web pages Previous studies usually use lexicon translations getting from a bilingual dictionary to measure the similarity of content of the two texts, such as in [4, 20] This approach may face difficulty because a word usually has many its translations Differently, we use the Google translator because by using it we can utilize the advantages of a statistical machine translation It helps to disambiguating lexical ambiguity, translating phrases, and reordering 1.5 Thesis’ structure Given below is a brief outline of the topics discussed in next sections of this thesis: Chapter - Related works The studies that have close relations with our work are introduced in this chapter TIEU LUAN MOI download : skknchat@gmail.com Chapter Experiment 34 Table 4.2: No pages downloaded and No candidate pairs Web site No pages downloaded No candidate pairs BBC 37,665 721 VOA News 14,105 129 VietnamPlus 12,553 320 As the result we have excluded over 90% of the pairs which are not considered as candidates Consequently, we receive a number of 1,170 pairs which are considered as candidates for determining whether each pair of them is parallel or not Next, all data obtained from the content filter module go into the structure module to extract the designed features We then labeled or for each pair of the candidates A pair will be labeled by if it is parallel, in contrast it will be labeled by There are 433 pairs labeled and 737 pairs labeled from these 1,170 pairs of candidates After that, we construct this data with format: < label >< index1 >:< value1 >< index2 >:< value2 > (illustrated in Figure 4.2) which is suitable for using the LIBSVM tool1 Figure 4.2: The format of training and testing data http://www.csie.ntu.edu.tw/∼cjlin/libsvm/ TIEU LUAN MOI download : skknchat@gmail.com Chapter Experiment 35 We conduct 5-folds cross-validation experiment, each fold had 234 test items and 936 training items For investigating the effectiveness of different kinds of features, we here design three feature sets (F1 , F2 , and F3 ) as follows: Structural features: F1 = {f1 , f2 , f3 , f4 , f5 }, Content-based features: F2 = {f6 , f7 , f8 } (based on cognate information), and F3 = {f9 , f10 } (based on identifying translation paragraphs) It is worth to note that for comparing our approach and previous approaches using content-based features we also conduct an experiment like in [3] This study measure the similarity of content based on aligning word translation of the two texts Here, we use a bilingual English-Vietnamese dictionary to compute a content-based similarity score For each pair of two texts (or web pages), the similarity score is defined as follows sim(A,B) = Number of translation token pairs Number of tokens in text A (4.4) TIEU LUAN MOI download : skknchat@gmail.com Chapter Experiment 4.3 36 Experimental results We conducted four experiments corresponding to four data sets To evaluate the effectiveness of our proposed methods, we compare performance of them with previous two approaches: structure-based [1] (the first experiment) and contentbased [3] (the second experiment) - Data set We only use structural features (F1 feature set) as in the original STRAND system [1] Experimental result shows in Table 4.3 Table 4.3: Structure-based method Fold Fold Fold Fold Fold Average Precision 0.409 0.518 0.397 0.451 0.444 0.444 Recall 0.620 0.614 0.614 0.763 0.654 0.653 F-Score 0.493 0.562 0.482 0.567 0.529 0.529 - Data set The content-based method uses the bilingual dictionary to match pairs of word by word as in the BITS system [3] Experimental result shows in Table 4.4 Table 4.4: Content-based method Fold Fold Fold Fold Fold Average Precision 0.688 0.647 0.643 0.601 0.682 0.652 Recall 0.484 0.478 0.548 0.569 0.528 0.521 F-Score 0.568 0.550 0.591 0.584 0.595 0.578 TIEU LUAN MOI download : skknchat@gmail.com Chapter Experiment 37 - Data set Firstly, we only use the cognate information (F2 feature set), experimental result shows in Table 4.5 Then, we combine structural features and cognate information (F1 ∪ F2 feature set), experimental result shows in Table 4.6 Table 4.5: Method based on cognation Fold Fold Fold Fold Fold Average Precision 0.831 0.823 0.810 0.878 0.931 0.855 Recall 0.907 0.864 0.836 0.765 0.803 0.835 F-Score 0.867 0.843 0.823 0.818 0.862 0.843 Table 4.6: Combining structural features and cognate information Fold Fold Fold Fold Fold Average Precision 0.873 0.862 0.869 0.904 0.904 0.882 Recall 0.817 0.842 0.879 0.817 0.733 0.817 F-Score 0.844 0.852 0.874 0.858 0.810 0.848 - Data set We combine structural features and identifying translation segments Experimental result shows in Tables 4.7-4.9 Table 4.7: Identifying translation at document level Fold Fold Fold Fold Fold Average Precision 0.618 0.666 0.580 0.690 0.654 0.642 Recall 0.671 0.634 0.718 0.545 0.534 0.620 F-Score 0.643 0.650 0.642 0.609 0.588 0.626 TIEU LUAN MOI download : skknchat@gmail.com Chapter Experiment 38 Table 4.8: Identifying translation at paragraph level Fold Fold Fold Fold Fold Average Precision 0.920 0.911 0.897 0.904 0.866 0.900 Recall 0.816 0.753 0.777 0.802 0.775 0.785 F-Score 0.865 0.824 0.833 0.850 0.818 0.838 Table 4.9: Identifying translation at sentence level Fold Fold Fold Fold Fold Average Precision 0.896 0.920 0.877 0.909 0.889 0.898 Recall 0.828 0.758 0.775 0.888 0.795 0.809 F-Score 0.861 0.831 0.823 0.898 0.839 0.850 Figure 4.3: Performance of identifying translation segments method The experiment is conducted on different levels of translation segments, including document, paragraph, sentence, and word According to results shown in the above tables we can see that: identifying translation segments at paragraph level (90% of precision) is more effective document level (64.2% of precision) and word TIEU LUAN MOI download : skknchat@gmail.com Chapter Experiment 39 level2 (65.2% of precision) The results also show that the difference between paragraph level and sentence level are not large (90%, 89.9% of precision respectively) Table 4.10: Overall results of each method (P-Precision, R-Recall, F-FS core) Method M1-Structure-based (SB) M2-Using bilingual dictionary M3-Using cognate information M4-Combining SB and cognate information M5-Combining SB and identifying translation paragraphs P 0.444 0.652 0.855 0.882 0.900 R 0.653 0.521 0.835 0.817 0.785 F 0.529 0.578 0.843 0.848 0.838 Figure 4.4: Comparison of the methods Table 4.10 shows the overall results of each approach in our experiments It is worth to note that in such this task (extracting parallel corpus) the precision is the most important criterion for evaluating the effectiveness of the system We can see that: the precision of content-based method3 (85.5%) is much higher than structure-based method (44.4%) And our approach of extracting content-based features4 is also much better the approach in [3] which obtain only 65.2% of precision The combination of both feature kinds5 gives the best result, with precision Experimental result at word level shows in Table 4.4 Using cognate information Based on cognation, and based on identifying translation segments Combining content-based features (as we proposed) with structural features TIEU LUAN MOI download : skknchat@gmail.com Chapter Experiment 40 rate of 88.2% (based on cognation) and 90% (based on identifying translation paragraphs) 4.4 Discussion According to the experiments conducted, the proposed model to be quite successful in extracting parallel texts from the Web These results have shown that the content-based features as we proposed is so effective Note that it also suggests that if we are not sure about the structural corresponding between the two web pages, we can use only content-based features TIEU LUAN MOI download : skknchat@gmail.com Chapter Conclusion and Future Works 5.1 Conclusion In this work, we focus on extracting parallel texts from bilingual web sites The most important contribution of our work is that we proposed two new methods of designing content-based features We also combine content-based features (as we proposed) with structural features under a framework of machine learning This work will help to build the parallel corpus, which is a very crucial resource for many NLP researches According to the experiments conducted, the proposed model to be quite successful in extracting parallel texts from the Web The performance of our method is better than using bilingual dictionary [3] or using structural information [1] Our approach has some advantages: • We have utilized both structural features and content-based features to improve performance of the system • Using the Google translator, we can utilize the advantages of a statistical machine translation It helps to disambiguating lexical ambiguity, translating phrases, and reordering • Our approach (using cognate information) does not require any linguistically analysis It is worth to note that by this approach we not need any dictionary 41 TIEU LUAN MOI download : skknchat@gmail.com Chapter Conclusion and Future Works 42 • Our approach can be applied for other pairs of languages because that the features used in the proposed model is independent of language Although we have shown promising results with our approach Our current system has not been robust enough to work on every web site Because the HTML structure may completely different on different web sites 5.2 Future works In the future we will extend our work on extracting smaller parallel components such as paragraphs, sentences or phrases This work will also be interesting in the case the quality of translation between bilingual web pages is not good We also plan to use this system to collect big corpora of English and Vietnamese language pair Although further work remains to be done, we can conclude that it is possible to automatically construct a English-Vietnamese parallel corpus from the Web TIEU LUAN MOI download : skknchat@gmail.com Appendix A LIBSVM tool LIBSVM is an integrated software for support vector classification, regression and distribution estimation It supports multi-class classification LIBSVM includes main features: • Different SVM formulations • Efficient multi-class classification • Cross validation for model selection • Probability estimates • Weighted SVM for unbalanced data • Both C++ and Java sources • GUI demonstrating SVM classification and regression • Python, R (also Splus), MATLAB, Perl, Ruby, Weka, Common LISP, CLISP, Haskell and LabVIEW interfaces C# NET code is available It’s also included in some data mining environments: RapidMiner and PCP • Automatic model selection which can generate contour of cross valiation accuracy 43 TIEU LUAN MOI download : skknchat@gmail.com Appendix B Relevant publications • Le Quang Hung and Le Anh Cuong Extracting Parallel Texts from the Web In Proceedings of the 2nd International Conference on Knowledge and Systems Engineering (KSE), pp 147-151, Hanoi, October 2010 44 TIEU LUAN MOI download : skknchat@gmail.com Bibliography [1] P Resnik and Philip Parallel strands: A preliminary investigation into mining the web for bilingual text In Proceedings of the Third Conference of the Association for Machine Translation in the Americas (AMTA) Langhorne, PA, pages 28–31, 1998 [2] J Chen and Nie J.Y Automatic construction of parallel english-chinese corpus for cross-language information retrieval In Proceedings ANLP, Seattle, pages 21–28, 2000 [3] Xiaoyi Ma and Liberman Mark Bits: A method for bilingual text search over the web Machine Translation Summit VII, 1999 [4] J Chen, R Chau, and C.-H Yeh Discovering parallel text from the world wide web In Proceedings Australasian Workshop on Data Mining and Web Intelligence (DMWI), pages 157–161, 2004 [5] Dan Tufis Cross-lingual knowledge induction from parallel corpora Southern Journal of Linguistics, USA, pages 214–223, 2007 [6] E N.Westerhout A corpus of dutch aphasic speech: Sketching the design and performing a pilot study 2005 [7] A Frankenberg-Garcia and D Santos Introducing compara: the portugueseenglish parallel corpus Corpora in translator education, pages 71–87, 2003 [8] Philipp Koehn Europarl: A parallel corpus for statistical machine translation In MT Summit, 2005 [9] P Brown, J Cocke, S Della Pietra, V Della Pietra, F Jelinek, R Mercer, and P Roosin A statistical approach to machine translation Computational Linguistics, pages 79–85, 1990 45 TIEU LUAN MOI download : skknchat@gmail.com Bibliography 46 [10] Melamed and I Dan Word-to-word models of translation equivalence IRCS technical report, University of Pennsylvania, 1998 [11] M Davis and T Dunning A trec evaluation of query translation methods for multi-lingual text retrieval Fourth Text Retrieval Conference (TREC- 4), NIST, 1995 [12] Martin Volk, Spela Vintar, and Paul Buitelaar Ontologies in cross-language information retrieval In Proceedings of WOW2003, pages 43–50, 2003 [13] D W Oard Cross-language text retrieval research in the usa Third DELOS Workshop, European Research Consortium for Informatics and Mathematics, 1997 [14] Akira Kumano and Hideki Hirakawa Building an mt dictionary from parallel texts based on linguisitic and statistical information In Proceedings 15th COLING, pages 76–81, 1994 [15] C McEwan, I Ounis, and I Ruthven Advances in information retrieval Springer, pages 365–368, 2002 [16] Melamed and I Dan Automatic discovery of non-compositional compounds in parallel data In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing Association for Computational Linguistics, Somerset, New Jersey, pages 97–108, 1997 [17] Peter F Brown, Stephen A Della Pietra, Vincent J Della Pietra, and Robert L Mercer Word-sense disambiguation using statistical methods In Proceedings of 29th Annual Meeting of the ACL, Berkeley, pages 264–270, 1991 [18] P.Resnik and Philip Mining the web for bilingual text In Proceedings of the 37th Annual Meeting of the ACL, College Park, MD, pages 527–534, 1999 [19] Christopher C Yang and Kar Wing Li Building parallel corpora by automatic title alignment 5th International Conference on Asian Digital Libraries, ICADL 2002, pages 328–339, 2002 [20] P Resnik and N A Smith The web as a parallel corpus Computational Linguistics, pages 349–380, 2003 TIEU LUAN MOI download : skknchat@gmail.com Bibliography 47 [21] Ying Zhang, Ke Wu, Jianfeng Gao, and P Vines Automatic acquisition of chinese-english parallel corpus from the web In Proceedings of ECIR-06, 2006 [22] Van B Dang and Ho Bao-Quoc Automatic construction of english- vietnamese parallel corpus through web mining In Proceedings of 5th IEEE International Conference on Computer Science - Research, Innovation and Vision of the Future (RIVF), Hanoi, Vietnam, 2007 [23] Jrg Tiedemann, Lars Nygaard, and Tekstlaboratoriet Hf The opus corpus parallel and free In Proceedings of the 4th International Conference on Language Resources and Evaluation, pages 1183–1186, 2004 [24] Jan Pomikalek Building parallel corpora from the web 2007 [25] Dragos Munteanu and Daniel Marcu Extracting parallel sub-sentential fragments from non-parallel corpora ACL, pages 81–88, 2006 [26] Bing Zhao and Tephan Vogel Adaptive parallel sentences mining from web bilingual news collection In Proceedings of the IEEE Workshop on Data Mining, 2002 [27] Pascale Fung and Percy Cheung Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus In Proceedings of Coling, pages 1051–1057, 2004 [28] Jesus Tomas, Enrique Sanchez-Villamil, Jaime Lloret, and Francisco Casacuberta Webmining: An unsupervised parallel corpora web retrieval In The Corpus Linguistics Conference, 2005 [29] B Barla Cambazoglu, Evren Karaca, Tayfun Kucukyilmaz, Ata Turk, and Cevdet Aykanat Architecture of a grid-enabled web search engine Information Processing and Management, pages 609–623, 2007 [30] Michel Simard, George F Foster, and Pierre Isabelle Using cognates to align sentences in bilingual corpora In Proceedings of the Forth International Conference on Theoretical and Methodological Issues in Machine Translation, Montreal, Canada, 1992 [31] Dragos Munteanu and Daniel Marcu Improving machine translation performance by exploiting comparable corpora Computational Linguistics, pages 477–504, 2005 TIEU LUAN MOI download : skknchat@gmail.com Bibliography 48 [32] Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu Bleu: A method for automatic evaluation of machine translation ACL, Philadelphia, pages 311–318, 2002 [33] G Salton Automatic text processing: the transformation, analysis, and retrieval of information by computer Addison-Wesley Publishing Company, 1989 TIEU LUAN MOI download : skknchat@gmail.com ... English-Vietnamese parallel texts corpus may have been translated from language L1 to language L2 and others the other way around The direction of the translation may not even be known The parallel corpora... formats They can be raw parallel texts or they can be aligned texts The texts can be aligned in paragraph level, sentence level or even in phrase level and word level The alignment of the texts. .. features used to extract parallel texts (the detail of this task is presented in the next sections) The other one based on the monolingual corpora [25] As seen from the diagram, starting with two large

Tiêu đề	Parallel Texts Extraction from the Web
Tác giả	Le Quang Hung
Người hướng dẫn	Dr. Le Anh Cuong
Trường học	University of Engineering and Technology Vietnam National University
Chuyên ngành	Information Technology
Thể loại	thesis
Năm xuất bản	2010
Thành phố	Hanoi

Định dạng
Số trang	53
Dung lượng	1,46 MB