2010 Second International Conference on Knowledge and Systems Engineering Extracting Parallel Texts from the Web Le Quang Hung Le Anh Cuong Faculty of Information Technology Quynhon University, Vietnam Email: hungqnu@gmail.com University of Engineering and Technology Vietnam National University, Hanoi Email: cuongla@vnu.edu.vn Abstract— Parallel corpus is the valuable resource for some important applications of natural language processing such as statistical machine translation, dictionary construction, crosslanguage information retrieval The Web is a huge resource of knowledge, which partly contains bilingual information in various kinds of web pages It currently attracts many studies on building parallel corpora based on the Internet resource However, obtaining a parallel corpus with high accuracy is still a challenge This paper focuses on extracting parallel texts from bilingual web-sites of the English and Vietnamese language pair We first propose a new way of designing content-based features, and then combining them with structural features under a framework of machine learning In the experiment we obtain 88.2% of precision for the extracted parallel texts I INTRODUCTION Parallel corpus has been used in many research areas of natural language processing For example, parallel texts are used for connection between vocabularies in cross language information retrieval [5], [6], [9] Moreover, extracting semantically equivalent components of the parallel texts as words, phrases, sentences are useful for bilingual dictionary construction [1] and statistical machine translation [2], [7] However, the available parallel corpora are not only in relatively small size, but also unbalanced [15] even in the major languages Along with the development the Internet, World Wide Web is really a huge database containing multi-language documents thus it is useful for bilingual texts processing Up to now, some systems have been built for mining parallel corpus These studies can be divided into three main kinds including content-based (CB), structure-based (SB) and combination of the both methods For CB approach, [3], [13], [14] uses a bilingual dictionary to match pairs of word-word in two languages Meanwhile, the SB approach [11], [12] relies on analysis HTML structure of page Other studies such as [4], [10] have combined the two methods to improve performance of their systems Parallel web pages in a site in general speaking have comparable structures and contents Therefore, a big number of these studies focuses on finding characteristics of HTML structures such as URL links, filename, HTML tags PTMiner [3] works on extracting bilingual English-Chinese documents This system uses a search engines to locate for host containing the parallel web pages In order to generate candidate pairs, PTMiner uses a URL-matching process (e.g Chinese translation of a URL as "http://www.foo.ca/english-index.html" might be "http://www.foo.ca/chinese-index.html") and other features 978-0-7695-4213-3/10 $25.00 © 2010 IEEE DOI 10.1109/KSE.2010.14 such as size, date, ect Note that this criterion does not appear in most of the bilingual English-Vietnamese web sites STRAND [10] has a similar approach to PTMiner except that it handles the case where URL-matching requires multiple substitutions This system also proposes a new method that combines content-based and structure matching by using a cross-language similarity score as an additional parameter of the structure-based method In our knowledge, there is rarely studies on this field related to Vietnamese [14] built an English-Vietnamese parallel corpus based on content-based matching Firstly, candidate web page pairs are found by using the features of sentence length and date Then, they measure similarity of content using a bilingual English-Vietnamese dictionary and making decision that whether two papers are parallel based on some thresholds of this measure Note that this system only searches for parallel pages that are good translations of each other and they are required being written in the same style Moreover using wordword translation will cause much ambiguity Therefore this approach is difficult to extend when the data increases as well as when applying for bilingual web sites with various styles In this paper, we aim to automatically extracting EnglishVietnamese parallel texts from bilingual web-sites of news As encouraging by [10] we formulate this problem as classification problem to utilize as much as possible the knowledge from structural information and the similarity of content It is worth to emphasize that different from previous studies [3], [10] we use cognate information replace of wordword translation From our observation, when translating a text from one language to the another, some special parts will be kept or changed in a little These parts are usually abbreviation, proper noun, and number In addition, we also use other content-based features such as the length of tokens, the length of paragraphs, which also not require any linguistical analysis It is worth to note that by this approach we not need any dictionary thus we think it can be apply for other language pairs Our experiment is conducted on the web sites containing English-Vietnamese documents, including BBC (http://www.bbc.co.uk), VietnamPlus (http://www.vietnamplus.vn), and VOA (http://www.voanews.com) The rest of this paper is organized as follows Section II shows our proposed model, including the general architecture of the model, how structural features and content-based features are designed and computed Section III shows the 147 A Host crawling Bilingual English-Vietnamese web pages are collected by crawling the Web using a Web spider as in [4] A Web spider is a software tool that traverses a site to gather web pages by following the hyperlinks appeared in the web pages To describe this process, our system uses the Teleport-Pro to retrieve web pages from remote web-sites Teleport-Pro is a tool designed to download the documents on the Web via HTTP and FTP protocols and store the extracted data in disk [15] Note that we select the URLs on the specified hosts from the three news sites: BBC, VietnamPlus and VOA News For example, the URL on the BBC site for English is: "http://news.bbc.co.uk/english/" and "http://www.bbc.co.uk/vietnamese/" for Vietnamese We then use Teleport-Pro to download the HTML pages for obtaining the candidate web pages B Content-based Filtering Module experiment in which we will implement different feature sets Finally, conclusion is derived in section IV II THE PROPOSED MODEL In this paper, we follow the approach which combines content-based features and structure-based features of the HTML pages to extract parallel texts from the Web by using machine learning [10] The machine learning algorithm used here is Support Vector Machine (SVM) Figure illustrates the general architecture of our proposed model As shown in the model it includes the following tasks: • • • • • Firstly, we use a crawler on the specified domains to extract bilingual English-Vietnamese pages which are called raw data Secondly, from the raw data, we will create candidates of parallel web pages by using some threshold of extracted features (content-based features and the feature about date) Thirdly, we manually label these candidates and then we have a training data It means that we will obtain some pairs of parallel web pages which are assigned with label 1, and some other pairs of parallel web pages which are assigned with labeled (the detail of this task is presented in the experiment section) Fourthly, we will extract structural features and contentbased features so that each web page pair can be represented as a vector of these features This representation is required to fit a classification model Finally, we use a SVM tool to train a classification system on this training data It means that if we have a pair of English-Vietnamese web pages for test, then the obtained classification will decide whether it is parallel or not As common understanding, using content-based features we want to determine whether two pages are mutual translation However, as [15] pointed out, not all translators create translated pages that look like the original page Moreover, SB matching is applicable only in corpora that include markup, and there are certainly multilingual collections on the Web and elsewhere that contain parallel text without structural tags [10] Many studies have used this approach to build a parallel corpus from the Web such as [4], [14] They use a bilingual dictionary to measure the similarity of the contents of two texts However, this method can cause much ambiguity because a word has many its translations For English-Vietnamese, one word in English can correspond to multiple words in Vietnamese In this paper, we propose a different approach which provides a cheap and reasonably resource This proposal is based on an observation that a document usually contain some cognates and if two documents are mutual translations then the cognates are usually kept the same in both of them1 Note that [8] also use cognates but for sentence alignment From our observation, we divide a token which is considered as a cognate into the three kinds as follows: 1) The abbreviations (e.g "EU", "WTO") 2) The proper nouns in English (e.g "Vietnam", "Paris") 3) The numbers (e.g "10", "2010") Now, we can design a feature for measuring content similarity based on cognates This feature is computed by the rate between the number of corresponding cognates between the two texts and the number of tokens in one text (e.g for English text) Given a pair of texts (A, B) where A stands for English and B stands for Vietnamese Then, we respectively obtained the token set of cognates T1 and T2 from A and B For a robust matching between cognates we make some modifications of the original token: • A number which is written as sequence of letters in the English alphabet is converted into a real number Accord1 Cognates in linguistics are words that have a common etymological origin (http://en.wikipedia.org/wiki/Cognate) 148 ing to our observations, units of the numbers in English are often retained when translated into Vietnamese So, we not consider in case the units are different (e.g inch vs cm, pound vs kg, USD vs VND, etc) • We use a list which contains the corresponding names between English and Vietnamese They include names of countries, continents, date, ect However, the names of countries in English can be translated into Vietnamese in different ways Therefore, we only consider these names in English, which Vietnamese names corresponding have been published on Wikipedia site2 For example, "Mexico" in English vs "Mêhicô" or "Mễ Tây Cơ" in Vietnamese are the same The following is an example of two corresponding texts of English and Vietnamese: • Vietnam and Italy through three cooperation programmes beginning in 1998 have so far signed more than 60 projects on joint scientific research Of the figure, 40 projects have been carried out and brought good results • Từ 1998, đến nay, Việt Nam Italy kí kết 60 dự án hợp tác nghiên cứu chung, có khoảng 40 dự án triển khai thực đạt kết tích cực Scanning these texts, we obtained T1={"Vietnam", "Italy", "1998", "60", "40"} and T2 ={"1998", "Việt Nam", "Italy", "60", "40"} We measure the similarity of cognates between A and B by using the algorithm presented in figure If simcognates (A, B) greater than a threshold then the pair (A, B) is a candidate The simcognates (A, B) is calculated as in formula (1) count simcognates (A, B) = Number of tokens in T1 (1) In addition to cognates, we observe that text length and number of paragraphs also provide evidences for measuring content similarity between two texts Parallel texts usually have similar text lengths and numbers of paragraphs Therefore, given a pair of texts we design three features as follows: • The first feature estimates the cognate-based similarity It is computed by the formula (1) • The second feature estimates the similarity of text lengths A method to filter out the wrong pairs is to compare the lengths of the two texts by characters We will set a http://vi.wikipedia.org/wiki/Danh_sách_quốc_gia • reasonable threshold of the rate between the two texts so that it will keep potential candidates The third feature estimates the rate of paragraphs of the two texts In our opinion, two parallel texts often have similar numbers of paragraphs in the texts, so a feature for representing this criterion is necessary C Structure Analysis Module Beside finding candidate pairs based on the content of the texts, the similarity of structure of the HTML pages also provide useful information for determining whether a pair of web pages is mutual translation or not This method uses the hypothesis that parallel web pages are presented in similar structures Note that this approach does not require linguistical knowledge For presenting structural features we follow the approach presented in [10] The structural analysis module is implements in the two steps: At the first step, both documents in the candidate pair are analyzed through a markup analyzer that acts as a transducer, producing a linear sequence containing three kinds of token [10]: [START: elementl abel], [END:elementl abel], [Chunk:length] At the second step, we will align the linearized sequences using a dynamic programming algorithm After alignment, we compute four scalar values that characterize the quality of the alignment [10]: • dp The difference percentage, indicating non-shared material • n The number of aligned non-markup text chunks of unequal length • r The correlation of lengths of the aligned non-markup chunks • p The significance level of the correlation r In addition, we observe that a page on a bilingual news site, which is the translation of the original page will be created in the short period of time after the original was published Therefore, using this feature we can eliminate many pairs which are not parallel For example, in the bilingual news sites such as BBC, VOA, the Vietnamese pages are published on the same day or one day later than the corresponding English pages [14] To extract this information, we conducted analysis of the HTML tags and then group this feature (publication date) into the structural feature set Note that this information is extracted from the different HTML formats (e.g META tag in the BBC site:, SPAN tag in the VietnamPlus site:10/04/2009, ect) D Classification modeling As a result from content-based and structure analysis of each pair of web pages we obtain the features which are divided into the two categories: content features and structural features Content features include simcognates (A, B), text length and number of paragraphs Structure features include dp, n, r, p and publication date It is now easy to formulate the task as classification problem Each candidate pair of web pages is 149 represented by a vector of these features We will label them by or if each pair is parallel or not respectively By this way we will obtain the training data In our system, we use a support vector machine algorithm to train a classification system For a new pair of web pages, we first extract features to have its representation as a vector This vector goes through the classification system and get the result as or It means that we will have the answer about whether this pair is parallel or not III EXPERIMENT A Data preparation We have explored several news sites of bilingual EnglishVietnamese on the World Wide Web There are a few sites with high translation quality In this system, we experiment with 94,323 pages are downloaded from the three web sites: 37,665 pages from BBC3 ; 12,553 pages from VietnamPlus4 and 44,105 pages from VOA News5 Firstly, we perform a host crawling on the specified domains And then, the HTML pages are downloaded All data collected is analyzed by the CB modules to filter candidate pairs We have used some thresholds as follows simcognates (A, B) > 0.5, publication date≤1 As the result we have excluded over 90% of the pairs which are not considered as candidates Consequently we receive a number of 1,170 pairs which are considered as candidates for determining whether each pair of them is parallel or not Next, all data obtained from the content filter module go into the structure module to extract the designed features We then labeled or for each pair of the candidates A pair will be labeled by if it is parallel, in contrast it will be labeled by There are 433 pairs labeled and 737 pairs labeled from these 1,170 pairs of candidates After that, we construct this data with format: : : which is suitable for using the LIBSVM tool6 (4) F-Score = 2*Precision*Recall (Precision+Recall) It is worth to note that for comparing our approach and previous approaches in using content-based features we also conduct an experiment like in [15] This study measure the similarity of content based on aligning word translation of the two texts Here, we use a bilingual English-Vietnamese dictionary to compute a content-based similarity score For each pair of two texts (or web pages), the similarity score is defined as follows Number of translation token pairs (5) sim(A, B) = Number of tokens in text A With this experiment we obtained the result as shown in Table I TABLE I EVALUATING CONTENT-BASED MATCHING (USING THE BILINGUAL DICTIONARY TO MATCH PAIRS OF WORD -WORD ) Fold Fold Fold Fold Fold Average http://en.vietnamplus.vn, http://www.vietnamplus.vn http://www.voanews.com http://www.csie.ntu.edu.tw/ cjlin/libsvm/ F-Score 0.568 0.550 0.591 0.584 0.595 0.578 TABLE II EVALUATING CONTENT-BASED MATCHING (USING EXTRACTED FEATURES FROM CONTENT- BASED F ILTERING M ODULE ) Fold Fold Fold Fold Fold Average We conduct a 5-folds cross-validation experiment, each fold had 234 test items and 936 training items For investigating the effectiveness of different kinds of features, we here design three feature sets including: the feature set containing only content-based (CB) features; the feature set containing only structure-based (SB) features; and the feature set which include these both kinds of features (i.e CB and SB features) We also use the three well-known measures for evaluation, as follows http://news.bbc.co.uk Recall 0.484 0.478 0.548 0.569 0.528 0.521 Table II shows our experimental result by using the contentbased features, Table III shows our result by using structurebased features, and Table IV contains the result obtained by combining these both kinds of features B Experimental results No of pairs labeled are true Precision = Total no of pairs labeled in data output (2) No of pairs labeled are true Recall = Total no of pairs labeled in data test (3) Precision 0.688 0.647 0.643 0.601 0.682 0.652 Precision 0.831 0.823 0.810 0.878 0.931 0.855 Recall 0.907 0.864 0.836 0.765 0.803 0.835 F-Score 0.867 0.843 0.823 0.818 0.862 0.843 TABLE III EVALUATING STRUCTURE-BASED MATCHING Fold Fold Fold Fold Fold Average Precision 0.409 0.518 0.397 0.451 0.444 0.444 Recall 0.620 0.614 0.614 0.763 0.654 0.653 F-Score 0.493 0.562 0.482 0.567 0.529 0.529 It is worth to note that in such this task (extracting parallel corpus) the precision is the most important criterion for evaluating the effectiveness of the system According to results shown in the above tables we can see that: the precision of using CB feature set (85.5%) is much higher than using 150 TABLE IV COMBINING STRUCTURAL AND CONTENT-BASED MATCHING Fold Fold Fold Fold Fold Average Precision 0.873 0.862 0.869 0.904 0.904 0.882 Recall 0.817 0.842 0.879 0.817 0.733 0.817 F-Score 0.844 0.852 0.874 0.858 0.810 0.848 SB feature set (44.4%) And our approach of extracting CB features is also much better the approach in [15] which obtain only 65.2% of precision The combination of both feature kinds gives the best result, with the precision 88.2% These results have shown that the CB features as we proposed is so effective Note that it also suggests that if we are not sure about the structural corresponding between the two web pages, we can use only content-based features [8] Michel Simard, George F Foster, Pierre Isabelle "Using Cognates to Align Sentences in Bilingual Corpora" [9] Oard, D W (1997) "Cross-language text retrieval research in the USA Third DELOS Workshop" European Research Consortium for Informatics and Mathematics [10] P Resnik and N A Smith 2003 The Web as a Parallel Corpus Computational Linguistics, 2003, 29(3):349-380 [11] Resnik, Philip 1998 Parallel strands: A preliminary investigation into mining the Web for bilingual text In Proceedings of the Third Conference of the Association for Machine Translation in the Americas (AMTA-98) Langhorne, PA, October 28-31 [12] Resnik, Philip 1999 Mining the Web for bilingual text In Proceedings of the 37th Annual Meeting of the ACL, pages 527-534, College Park, MD, June [13] Takehito Utsuro, Hiroshi Ikeda Masaya Yamane, Yuji Matsumoto, and Makoto Nagao 1994 Bilingual text matching using bilingual dictionary and statistics In Proc 15th COLING, pages 1076-1082 [14] Van B Dang, Ho Bao-Quoc 2007 Automatic Construction of EnglishVietnamese Parallel Corpus through Web Mining Proceedings of 5th IEEE International Conference on Computer Science - Research, Innovation and Vision of the Future (RIVF’2007), Hanoi, Vietnam [15] Xiaoyi Ma, Liberman Mark 1999 BITS: A method for bilingual text search over the Web Machine Translation Summit VII, September, 1999 IV CONCLUSION This paper presents our work on extracting a parallel corpus from the Web for the language pair of English and Vietnamese We have proposed a new approach for measuring the similarity of content of the two pages which does not require a deep linguistical analysis We have utilized both structural features and content-based features under a framework of machine learning The obtained results have shown that content-based features as proposed is the major information for determining a pair of web pages is parallel or not In addition, our approach can be applied for other pairs of languages because that the features used in the proposed model is independent of language In the future we will extend our work on extracting smaller parallel components such as paragraphs, sentences or phrases This work will also be interesting in the case the quality of translation between bilingual web pages is not good Acknowledgment This work is supported by NAFOSTED (Vietnam’s National Foundation for Science and Technology Development) REFERENCES [1] Akira Kumano and Hideki Hirakawa 1994 Building an MT dictionary from parallel texts based on linguisitic and statistical information In Proc 15th COLING, pages 76-81 [2] Brown, P., Cocke, J., Della Pietra, S., Della Pietra, V., Jelinek, F., Mercer, R., Roosin, P (1990) "A statistical approach to machine translation" Computational Linguistics, 16(2), 79-85 [3] Chen J., Nie J.Y 2000 Automatic construction of parallel EnglishChinese corpus for cross-language information retrieval," Proc ANLP, pp 21-28, Seattle [4] Chen, J., Chau, R and Yeh, C.-H (2004) Discovering Parallel Text from The World Wide Web In Proc Australasian Workshop on Data Mining and Web Intelligence (DMWI2004) [5] Davis, M., Dunning, T (1995) "A TREC evaluation of query translation methods for multi-lingual text retrieval" Fourth Text Retrieval Conference (TREC- 4) NIST [6] Martin Volk, Spela Vintar, Paul Buitelaar "Ontologies in Cross-Language Information Retrieval," Wissensmanagement 2003: 43-50 [7] Melamed, I D (1998) "Word-to-word models of translation equivalence" IRCS technical report 98-08, University of Pennsylvania 151 ... threshold of the rate between the two texts so that it will keep potential candidates The third feature estimates the rate of paragraphs of the two texts In our opinion, two parallel texts often... paragraphs Therefore, given a pair of texts we design three features as follows: • The first feature estimates the cognate-based similarity It is computed by the formula (1) • The second feature estimates... features (content-based features and the feature about date) Thirdly, we manually label these candidates and then we have a training data It means that we will obtain some pairs of parallel web