Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 63 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
63
Dung lượng
356,71 KB
Nội dung
VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY HAI-LONG TRIEU BILINGUAL SENTENCE ALIGNMENT BASED ON SENTENCE LENGTH AND WORD TRANSLATION MASTER THESIS OF INFORMATION TECHNOLOGY Hanoi - 2014 VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY HAI-LONG TRIEU BILINGUAL SENTENCE ALIGNMENT BASED ON SENTENCE LENGTH AND WORD TRANSLATION Major: Computer science Code: 60 48 01 MASTER THESIS OF INFORMATION TECHNOLOGY SUPERVISOR: PhD Phuong-Thai Nguyen Hanoi - 2014 ORIGINALITY STATEMENT „I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at University of Engineering and Technology (UET) or any other educational institution, except where due acknowledgement is made in the thesis I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project‟s design and conception or in style, presentation and linguistic expression is acknowledged.‟ Signed Acknowledgements I would like to thank my advisor, PhD Phuong-Thai Nguyen, not only for his supervision but also for his enthusiastic encouragement, right suggestion and knowledge which I have been giving during studying in Master‟s course I would also like to show my deep gratitude M.A Phuong-Thao Thi Nguyen from Institute of Information Technology - Vietnam Academy of Science and Technology - who provided valuable data in my evaluating process I would like to thank PhD Van-Vinh Nguyen for examining and giving some advices to my work, M.A Kim-Anh Nguyen, M.A Truong Van Nguyen for their help along with comments on my work, especially M.A Kim-Anh Nguyen for supporting and checking some issues in my research In addition, I would like to express my thanks to lectures, professors in Faculty of Information Technology, University of Engineering and Technology (UET), Vietnam University, Hanoi who teach me and helping me whole time I study in UET Finally, I would like to thank my family and friends for their support, share, and confidence throughout my study Abstract Sentence alignment plays an important role in machine translation It is an essential task in processing parallel corpora which are ample and substantial resources for natural language processing In order to apply these abundant materials into useful applications, parallel corpora first have to be aligned at the sentence level This process maps sentences in texts of source language to their corresponding units in texts of target language Parallel corpora aligned at sentence level become a useful resource for a number of applications in natural language processing including Statistical Machine Translation, word disambiguation, cross language information retrieval This task also helps to extract structural information and derive statistical parameters from bilingual corpora There have been a number of algorithms proposed with different approaches for sentence alignment However, they may be classified into some major categories First of all, there are methods based on the similarity of sentence lengths which can be measured by words or characters of sentences These methods are simple but effective to apply for language pairs that have a high similarity in sentence lengths The second set of methods is based on word correspondences or lexicon These methods take into account the lexical information about texts, which is based on matching content in texts or uses cognates An external dictionary may be used in these methods, so these methods are more accurate but slower than the first ones There are also methods based on the hybrids of these first two approaches that combine their advantages, so they obtain quite high quality of alignments In this thesis, I summarize general issues related to sentence alignment, and I evaluate approaches proposed for this task and focus on the hybrid method, especially the proposal of Moore (2002), an effective method with high performance in term of precision From analyzing the limits of this method, I propose an algorithm using a new feature, bilingual word clustering, to improve the quality of Moore‟s method The baseline method (Moore, 2002) will be introduced based on analyzing of the framework, and I describe advantages as well as weaknesses of this approach In addition to this, I describe the basis knowledge, algorithm of bilingual word clustering, and the new feature used in sentence alignment Finally, experiments performed in this research are illustrated as well as evaluations to prove benefits of the proposed method Keywords: sentence alignment, parallel corpora, natural language processing, word clustering Table of Contents ORIGINALITY STATEMENT Acknowledgements Abstract Table of Contents List of Figures List of Tables 10 CHAPTER ONE Introduction 11 1.1 Background 11 1.2 Parallel Corpora 12 1.2.1 Definitions 12 1.2.2 Applications 12 1.2.3 Aligned Parallel Corpora 12 1.3 Sentence Alignment 12 1.3.1 Definition 12 1.3.2 Types of Alignments 12 1.3.3 Applications 15 1.3.4 Challenges 15 1.3.5 Algorithms 16 1.4 Thesis Contents 16 1.4.1 Objectives of the Thesis 16 1.4.2 Contributions 17 1.4.3 Outline 17 1.5 Summary 18 CHAPTER TWO Related Works 19 2.1 Overview 19 2.2 Overview of Approaches 19 2.2.1 Classification 19 2.2.2 Length-based Methods 19 2.2.3 Word Correspondences Methods 21 2.2.4 Hybrid Methods 21 2.3 Some Important Problems 22 2.3.1 Noise of Texts 22 2.3.2 Linguistic Distances 22 2.3.3 Searching 23 2.3.4 Resources 23 2.4 Length-based Proposals 23 2.4.1 Brown et al., 1991 23 2.4.2 Vanilla: Gale and Church, 1993 24 2.4.3 Wu, 1994 27 2.5 Word-based Proposals 27 2.5.1 Kay and Roscheisen, 1993 27 2.5.2 Chen, 1993 27 2.5.3 Melamed, 1996 28 2.5.4 Champollion: Ma, 2006 29 2.6 Hybrid Proposals 30 2.6.1 Microsoft’s Bilingual Sentence Aligner: Moore, 2002 30 2.6.2 Hunalign: Varga et al., 2005 31 2.6.3 Deng et al., 2007 32 2.6.4 Gargantua: Braune and Fraser, 2010 33 2.6.5 Fast-Champollion: Li et al., 2010 34 2.7 Other Proposals 35 2.7.1 Bleu-align: Sennrich and Volk, 2010 35 2.7.2 MSVM and HMM: Fattah, 2012 36 2.8 Summary 37 CHAPTER THREE Our Approach 39 3.1 Overview 39 3.2 Moore‟s Approach 39 3.2.1 Description 39 3.2.2 The Algorithm 40 3.3 Evaluation of Moore‟s Approach 42 3.4 Our Approach 42 3.4.1 Framework 42 3.4.2 Word Clustering 43 3.4.3 Proposed Algorithm 45 3.4.4 An Example 49 3.5 Summary 50 CHAPTER FOUR Experiments 51 4.1 Overview 51 4.2 Data 51 4.2.1 Bilingual Corpora 51 4.2.2 Word Clustering Data 53 4.3 Metrics 54 4.4 Discussion of Results 54 4.5 Summary 57 CHAPTER FIVE Conclusion and Future Work 58 5.1 Overview 58 5.2 Summary 58 5.3 Contributions 58 5.4 Future Work 59 5.4.1 Better Word Translation Models 59 5.4.2 Word-Phrase 59 Bibliography 60 List of Figures Figure 1.1 A sequence of beads (Brown et al., 1991) 13 Figure 2.1 Paragraph length (Gale and Church, 1993) 25 Figure 2.2 Equation in dynamic programming (Gale and Church, 1993) 26 Figure 2.3 A bitext space in Melamed‟s method (Melamed, 1996) 29 Figure 2.4 The method of Varga et al., 2005 31 Figure 2.5 The method of Braune and Fraser, 2010 33 Figure 2.6 Sentence Alignment Approaches Review 38 Figure 3.1 Framework of sentence alignment in our algorithm 43 Figure 3.2 An example of Brown's cluster algorithm 44 Figure 3.3 English word clustering data 44 Figure 3.4 Vietnamese word clustering data 44 Figure 3.5 Bilingual dictionary 46 Figure 3.6 Looking up the probability of a word pair 47 Figure 3.7 Looking up in a word cluster 48 Figure 3.8 Handling in the case: one word is contained in dictionary 48 Figure 4.1 Comparison in Precision 55 Figure 4.2 Comparison in Recall 56 Figure 4.3 Comparison in F-measure 57 List of Tables Table 1.1 Frequency of alignments (Gale and Church, 1993) 14 Table 1.2 Frequency of beads (Ma, 2006) 14 Table 1.3 Frequency of beads (Moore, 2002) 14 Table 1.4 An entry in a probabilistic dictionary (Gale and Church, 1993) 15 Table 2.1 Alignment pairs (Sennrich and Volk, 2010) 36 Table 4.1 Training data-1 51 Table 4.2 Topics in Training data-1 52 Table 4.3 Training data-2 52 Table 4.4 Topics in Training data-2 52 Table 4.5 Input data for training clusters 53 Table 4.6 Topics for Vietnamese input data to train clusters 53 Table 4.7 Word clustering data sets 54 10 1 1 … … … DICTIONARY Figure 3.6 Looking up the probability of a word pair The probability of this word pair ( ) is easily specified by looking up it in the dictionary Case ( 1, , 1) is not contained in the dictionary, either or is contained in Suppose that is contained in the dictionary 2 1 … … … DICTIONARY Looking for words in the cluster that contains 47 : , 2 … … DICTIONARY Figure 3.7 Looking up in a word cluster Searching ( 1, ), ( 1, ) in the dictionary 2 … 1 2 … DICTIONARY Figure 3.8 Handling in the case: one word is contained in dictionary The probability of this word pair will be calculated via above references: ( 1,, 1)= ( 1, 48 2) In term of computational cost, because there are references to clusters in our algorithm, the speed of our method is about ten times slower than that of Moore‟s method 3.4.4 An Example Consider an English-Vietnamese sentence pair: damodaran ' s solution is gelatin hydrolysate , a protein known to act as a natural antifreeze giải_pháp damodaran chất thủy_phân gelatin , loại protein có chức_năng chất chống đơng tự_nhiên Several word pairs in the Dictionary created by training IBM Model can be listed as follows: damodaran 's solution is a as natural However, there are not all word pairs in the Dictionary such as the word pair ( act, chức_năng) Thus, first of all, the algorithm finds the cluster of each word in this word pair The clusters which contain words “act” and “chức_năng” are as follows: 0110001111 0110001111 0110001111 0110001111 0110001111 11111110 11111110 11111110 11111110 act society show departments helps chức_năng hành_vi phạt hoạt_động In these clusters, the bit strings “ 0110001111” and “11111110” indicate the names of the clusters The algorithm then looks up word pairs in the Dictionary and achieves following results: 49 departments act act act The next step of the algorithm is to calculate the average value of these probabilities, and the probability of word pair (act, chức_năng) would be: Pr(act, chức_năng) = avg( 9.146747911957206E-4, 0.4258088124678187, 7.407735728457372E-4, 0.009579429801707052 ) = 0.1092609226583920 This word pair with the probability just calculated, can be used as a new item in the Dictionary 3.5 Summary This chapter describes the algorithm I propose to improve Moore‟s method The algorithm has been described using pseudo-code and explained with example Descriptions about Moore‟s method also completely introduced in this chapter The next chapter will present experimental results relevant to our approach 50 CHAPTER FOUR Experiments 4.1 Overview This chapter presents the experimental results relevant to our algorithm (EV-Aligner) Testing has been carried out to validate and evaluate EV-Aligner EV-Aligner has been tested on two different corpora, and it has been used to train and test EV-Aligner in all our experiments We also use two data sets in English and Vietnamese to train word clustering using the algorithm of Brown et al, 1992 We have compared our method with the baseline approach (Moore, 2002) The experiments also indicate that EV-Aligner performs better than another, and that it significantly outperforms the efficiency in terms of recall and overall performance Section 4.2 introduces bilingual corpora used to train and test EV-Aligner as well as word clustering data sets that we use in experiments The metrics used for evaluating the output are discussed in Section 4.3 Section 4.4 and Section 4.5 discuss the performance results Section 4.6 concludes this chapter 4.2 Data 4.2.1 Bilingual Corpora We perform experiments on 66 pairs of bilingual files English-Vietnamese extracted from websites of World Bank, Science, WHO, and Vietnamtourism, which consist of 1800 English sentences with 39,526 words (6,309 different words) and 1,828 Vietnamese sentences with 40,491 words (5,721 different words) We align this corpus at the sentence level by hand and gain 846 sentences pairs These data are described in Table 4.1 and Table 4.2 Table 4.1 51 Training data-1 Table 4.2 Topics in Training data-1 # Moreover, to achieve a better result in experiments, we use more 100,000 EnglishVietnamese sentence pairs with 1743040 English words (36149 different words) and 1681915 Vietnamese words (25523 different words), which are available at website They are presented in Table 4.3 Table 4.3 Training data-2 This data set consists of 80,000 sentence pairs in Economics-Social topics and 20,000 sentence pairs in information technology topic, which are shown in Table 4.4 Table 4.4 Topics in Training data-2 In order to ensure that the aligned results are more accurate, we identify the discrimination between lowercase and upper case This is reasonable since whether a http://vlsp.vietlp.org:8080/demo/?page=resources 52 word is lower case or upper case, it basically is similar in the meaning to the other Thus, we carry out to convert all words in these corpora into their lowercase form In addition to this, in Vietnamese, there are many compound words, which the accuracy of word translation is able to increase if the compound words are recognized rather than keeping all of them by single words All words in the Vietnamese data set, therefore, are tokenized into compound words The tool to perform this task is also available at website 4.2.2 Word Clustering Data Related to applying word clustering feature in our approach, we use two word clustering data sets of English and Vietnamese in experiments We create those data sets by using Brown's word clustering algorithm conducting on two input data sets which are described in Table 4.5 The input English data set is extracted from a part of British National Corpus with 1,044,285 sentences (approximately 22 million words) Table 4.5 Input data for training clusters Data English Vietnamese The input Vietnamese data set, meanwhile, is the Viettreebank data set consisting of 700,000 sentences with somewhere in the vicinity of 15 million words including PoliticalSocial topics from 70,000 sentences of Vietnamese treebank and the rest from topics of websites laodong, tuoitre, and PC world With 700 clusters in each data set, word items of these data sets cover approximately more than 81 percent those of these input corpora These are described in Table 4.6 and Table 4.7 Table 4.6 Topics for Vietnamese input data to train clusters Source/Top Political-So (Viettreeba (laodong, t http://vlsp.vietlp.org:8080/demo/?page=resources 53 Table 4.7 4.3 Word clustering data sets Metrics We use the following metrics for evaluation: Precision, Recall and F-measure to evaluate sentence aligners Precision is defined as the fraction of retrieved documents that are in fact relevant Recall is defined as the fraction of relevant documents that are retrieved by the algorithm The F-measure characterizes the combined performance of Recall and Precision [10] = = − ∗ =2∗ + Where: : number of sentence pairs created by the aligner match those aligned by hand : number of sentence pairs created by the aligner : number of sentence pairs aligned by hand 4.4 Discussion of Results We conduct experiments and compare our approach implemented on Java with the baseline algorithm: M-Align (Bilingual Sentence Aligner, Moore 2002) We evaluate approaches based on a range of thresholds from 0.5 to 0.99 in the initial alignment The threshold we use in the final alignment is 0.9 to ensure the high reliability 54 Figure 4.1 Comparison in Precision (EV-Aligner: our approach; M-Align: Bilingual Sentence Aligner, Moore 2002) Figure 4.1 illustrates precision of the two approaches with thresholds of the lengthbased phase spread from 0.5 to 0.99 M-Align gets the precision higher than the result of our approach approximately 9% In the threshold 0.5 of the length-based phase, the precision of these approaches are 60.99% and 69.30% Meanwhile, these results are 61.13% and 70.61% when the threshold is set in the highest rate as 0.99 In general, the precision gradually increases correspond with the raise of threshold in the initial alignment These approaches get the highest precision, 62.55% of our approach and 72.46% of M-Align, when the threshold is 0.9 The recall rate of these approaches is described in Figure 4.2 Our approach gets the recall significantly higher than that of M-Align, even up to more than 30% In the threshold of 0.5, the recall is 75.77% of EV-Aligner and 51.77% of M-Align while that is 74.35% (EV-Aligner) and 43.74% (M-Align) in the threshold of 0.99 In our approach, the recall fluctuates insignificantly with the range about 73.64% to 75.77% because of the contribution of using word clustering in processing the lack of lexical information Meanwhile, when the threshold is increased, the recall of M-Align drops rather considerably 55 Figure 4.2 Comparison in Recall (EV-Aligner: our approach; M-Align: Bilingual Sentence Aligner, Moore 2002) We perform the experiment by decreasing the threshold of the length-based from 0.99 to 0.5 of the initial alignment to evaluate the impact of a dictionary to the quality of alignment It is an indisputable fact that when using a lower threshold, the number of word items in the dictionary will increase that lead to a growth of recall rate M-Align usually gets a high precision rate; however, the weakness of this method is the quite low recall ratio, particularly when facing a sparseness of data This kind of data results in a low accuracy of the dictionary, which is the key factor of a poor recall rate in the approach of Moore 2002 because of using only word translation model - IBM Model Our approach, meanwhile, deals with this issue flexibly If the quality of the dictionary is good enough, a reference to IBM Model also gains a rather accurate output Moreover, using word clustering data sets which assist to give more translation word pairs by mapping them through their clusters resolved sparse data problem rather thoroughly The identification of words not found in the dictionary into a common word as in (Moore 2002) results in quite a low accuracy in lexical phase that many sentence pairs, therefore, are not found by the aligner Instead of that, using word clustering feature assists to improve the quality of lexical phase, and thus the performance increases significantly 56 Figure 4.3 Comparison in F-measure (EV-Aligner: our approach; M-Align: Bilingual Sentence Aligner, Moore 2002) Because our approach significantly improves the recall rate compared with M-Align while the precision of EV-Aligner is inconsiderably lower than that of M-Align, our approach obtains the F-measure relatively higher than M-Align described in Figure 4.3 In the threshold of 0.5, F-measure of our approach is 67.58% which is 8.31% higher than that of M-Align (59.27%) Meanwhile, in the threshold of 0.99, the increase of F-measure attains the highest rate (13.08%) when F-measure is 67.09% and 54.01% of EV-Aligner and M-Align respectively 4.5 Summary This chapter presents our experimental results We have discussed the corpus used to train and test EV-Aligner, and the criteria used for evaluating the output The evaluations of experiments as well as comparisons with the baseline method also are described in this chapter The next chapter will discuss the conclusions and future work 57 CHAPTER FIVE Conclusion and Future Work 5.1 Overview This chapter summarizes this thesis and the contributions of my research work, and suggests two possible ways to improve the performance of EV-Aligner Section 5.2 summarizes this thesis Section 5.3 discusses the contributions of my work Section 5.4 discusses two areas which warrant further investigation 5.2 Summary This thesis proposes a new algorithm called EV-Aligner which is based on the framework of Moore‟s method I evaluate advantages as well as weaknesses of method introduced by Moore, and propose improvement using a new feature, word clustering The experimental results suggest that these differences make EV-Aligner a better algorithm than the baseline method of Moore, 2002 [15] For this thesis, I conducted research on several algorithms for sentence alignment Among these algorithms, I focus on the method by Moore, 2002 The quality of the dictionary significantly impacts the performance of sentence aligners which is based on lexical information When aligning corpus with sparse data, the dictionary, which is created by training sentence pairs extracted from length-based phase, would lack a great number of word pairs This leads to a low quality of the dictionary which declines the performance of the aligners I have dealt with this issue by using a new feature which is word clustering in our algorithm The lack of many necessary items in the dictionary is effectively handled by referring to clusters in word clustering data sets sensibly I associate this feature with the sentence alignment framework proposed by Moore in our method The experiments indicated a better performance of our approach compare to other hybrid approaches Word clustering is a useful application and could be utilized in sentence alignment that is able to improve the performance of the aligner 5.3 Contributions The contributions of our work are as follows: 58 Proposed a new algorithm called EV-Aligner, which is based on Moore‟s method and uses a new feature, word clustering Evaluated a number of sentence alignment methods to have a general view of research in this field Found in our experiments that better results can be achieved by using word clustering in case of sparse data Found in our experiments that EV-Aligner performs better than other sentence alignment tools and that it significantly outperforms the baseline one (M-Align) Showed in our experiments that the word clustering are useful for sentence alignment and can therefore serve as a useful feature 5.4 Future Work This section discusses some ways to improve the quality of the output: 5.4.1 Better Word Translation Models I intend to improve alignment quality by using better word translation model to make bilingual word dictionary such as IBM Model 5.4.2 Word-Phrase It is very useful when sentence aligners effectively carry out and gain a high performance In near future, we try not only to improve the quality of sentence alignment by using more new features such as assessing the correlation of sentence pairs based on word phrases 59 Bibliography Anil Kumar Singh and Samar Husain 2005 Comparison, selection and use of sentence alignment algorithms for new language pairs In Proceedings of ACL 2005 Workshop on Parallel Text, Ann Arbor, Michigan Braune, Fabienne and Fraser, Alexander (2010), Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora In Proceedings of the 23rd International Conference on Computational Linguistics: Posters 81-89 Brown, Peter F and Lai, Jennifer C and Mercer, Robert L (1991), Aligning sentences in parallel corpora In Proceedings of the 29th annual meeting on Association for Computational Linguistics 169-176 Berkeley, California Brown, Peter F and Desouza, Peter V and Mercer, Robert L and Pietra, Vincent J Della and Lai, Jenifer C (1992), Class-based n-gram models of natural language Computational linguistics vol 18, 4, 467-479 Chen, Stanley F.(1993), Aligning sentences in bilingual corpora using lexical information In Proceedings of the 31st annual meeting on Association for Computational Linguistics 9-16 Deng, Y., Kumar, S., & Byrne, W (2007) Segmentation and alignment of parallel text for statistical machine translation Natural Language Engineering,13(3), 235-260 Fattah, Mohamed Abdel (2012), The Use of MSVM and HMM for Sentence Alignment JIPS vol 8, 2, 301-314 Gale, William A and Church, Kenneth W (1993), A program for aligning sentences in bilingual corpora Computational linguistics vol 19, 1, 75-102 Güngör, A M., & Taşcı, Ş (2006) TURKISH-ENGLISH SENTENCE ALIGNMENT 10 Huang, Yuanpeng J., Robert Powers, and Gaetano T Montelione "Protein NMR recall, precision, and F-measure scores (RPF scores): structure quality assessment measures based on information retrieval statistics." Journal of the American Chemical Society 127.6 (2005): 1665-1674 11 Kay, Martin, and Martin Röscheisen (1993), Text-translation alignment computational Linguistics vol 19, 1, 121-142 60 12 Li, P., Sun, M., & Xue, P (2010, August) Fast-Champollion: a fast and robust sentence alignment algorithm In Proceedings of the 23rd International Conference on Computational Linguistics: Posters (pp 710-718) Association for Computational Linguistics 13 Ma, Xiaoyi (2006), Champollion: a robust parallel text sentence aligner In LREC 2006: Fifth International Conference on Language Resources and Evaluation 489492 14 Melamed, I D (1996), A geometric approach to mapping bitext correspondence In Proceedings of the Conference on Empirical Methods in Natural Language Processing 1-12 15 Moore, Robert C (2002), Fast and Accurate Sentence Alignment of Bilingual Corpora In Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users 135-144 16 Sadaf Abdul-Rauf, Mark Fishel, Patrik Lambert, Sandra Noubours, Rico Sennrich (2012), Extrinsic Evaluation of Sentence Alignment Systems In Proc of the LREC Workshop on Creating Cross-language Resources for Disconnected Languages and Styles (CREDISLAS) Istanbul, Turkey, pages 6-10 17 Sennrich, R., and Volk, M (2010), MT-based sentence alignment for OCRgenerated parallel texts In The Ninth Conference of the Association for Machine Translation in the Americas (AMTA 2010), Denver, Colorado 18 Trieu, H L., Nguyen, P T., & Nguyen, K A (2014) Improving Moore’s Sentence Alignment Method Using Bilingual Word Clustering In Knowledge and Systems Engineering Springer International Publishing 149-160 19 Varga, Dániel and Németh, László and Halácsy, Péter and Kornai, András and Trón, Viktor and Nagy, Viktor (2005), Parallel corpora for medium density languages In Proceedings of the RANLP 2005 590-596 20 Wu, Dekai (1994), Aligning a parallel English-Chinese corpus statistically with lexical criteria In Proceedings of the 32nd annual meeting on Association for Computational Linguistics 80-87 61 ...VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY HAI-LONG TRIEU BILINGUAL SENTENCE ALIGNMENT BASED ON SENTENCE LENGTH AND WORD TRANSLATION Major: Computer science... with length -based methods Because they use the lexical information from source and translation lexicons rather than only sentence length to determine the translation relationship between sentences... language information retrieval, word disambiguation, sense disambiguation, bilingual lexicography, automatic translation verification, automatic acquisition of knowledge about translation, and cross-language