Automatic evaluation of machine translation, paraphrase generation, and summarization a linear programming based analysis

AUTOMATIC EVALUATION OF MACHINE TRANSLATION, PARAPHRASE GENERATION, AND SUMMARIZATION: A LINEAR-PROGRAMMING-BASED ANALYSIS LIU CHANG Bachelor of Computing (Honours), NUS A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY (SCHOOL OF COMPUTING) DEPARTMENT OF COMPUTER SCIENCE NATIONAL UNIVERSITY OF SINGAPORE 2013 DECLARATION I hereby declare that this thesis is my original work and it has been written by me in its entirety I have duly acknowledged all the sources of information which have been used in this thesis This thesis has also not been submitted for any degree in any university previously Liu Chang April 2014 ACKNOWLEDGEMENTS This thesis would not have been possible without the generous support of the kind people around me, to whom I will be ever so grateful Above all, I would like to thank my wife Xiaoqing for her love, patience and sacrifices, and my parents for their support and encouragement I promise to be a much more engaing husband, son, and father from now on I would like to thank my supervisor, Professor Ng Hwee Tou for his continous guidance His high standards for research and writing shaped this thesis more than anyone else My sincere thanks also goes to my friends and colleagues from the Computational Linguistics Lab, with whom I co-authored many papers: Daniel Dahlmeier, Lin Ziheng, Preslav Nakov, and Lu Wei I hope our paths will cross again in the future i Contents Summary v List of Tables vii List of Figures ix Introduction Literature Review 2.1 Machine Translation Evaluation 2.1.1 BLEU 2.1.2 TER 2.1.3 METEOR 2.1.4 MaxSim 2.1.5 RTE 2.1.6 Discussion 2.2 Machine Translation Tuning 2.3 Paraphrase Evaluation 2.4 Summarization Evaluation 2.4.1 ROUGE 2.4.2 Basic Elements Machine Translation Evaluation 3.1 TESLA-M 3.1.1 Similarity Functions 3.1.2 Matching Bags of N-grams 3.1.3 Scoring 3.1.4 Reduction 3.2 TESLA-B 3.2.1 Phrase Level Semantic Representation 3.2.2 Segmenting a Sentence into Phrases 3.2.3 Bags of Pivot Language N-grams at Sentence Level 3.2.4 Scoring ii 5 10 11 12 13 14 14 15 16 16 17 18 20 21 21 22 23 23 25 3.3 3.4 3.5 3.6 TESLA-F Experiments 3.4.1 Pre-processing 3.4.2 WMT 2009 Into-English Task 3.4.3 WMT 2009 Out-of-English Task 3.4.4 WMT 2010 Official Scores 3.4.5 WMT 2011 Official Scores Analysis 3.5.1 Effect of function word discounting 3.5.2 Effect of various other features Summary 26 27 28 29 30 32 34 38 38 40 41 Machine Translation Evaluation for Languages with Ambiguous Word Boundaries 44 4.1 Introduction 44 4.2 Motivation 46 4.3 The Algorithm 47 4.3.1 Basic Matching 47 4.3.2 Phrase Matching 48 4.3.3 Covered Matching 52 4.3.4 The Objective Function 55 4.4 Experiments 56 4.4.1 IWSLT 2008 English-Chinese Challenge Task 56 4.4.2 NIST 2008 English-Chinese Machine Translation Task 58 4.4.3 Baseline Metrics 59 4.4.4 TESLA-CELAB Correlations 61 4.4.5 Sample Sentences 62 4.5 Discussion 64 4.5.1 Other Languages with Ambiguous Word Boundaries 64 4.5.2 Fractional Similarity Measures 65 4.5.3 Fractional Weights for N-grams 65 4.6 Summary 66 Machine Translation Tuning 5.1 Introduction 5.2 Machine Translation Tuning Algorithms 5.3 Experimental Setup 5.4 Automatic and Manual Evaluations 5.5 Discussion 5.6 Summary iii 67 67 68 69 70 75 78 Paraphrase Evaluation 6.1 Introduction 6.2 Task Definition 6.3 Paraphrase Evaluation Metric 6.4 Human Evaluation 6.4.1 Evaluation Setup 6.4.2 Inter-judge Correlation 6.4.3 Adequacy, Fluency, and Dissimilarity 6.5 TESLA-PEM vs Human Evaluation 6.5.1 Experimental Setup 6.5.2 Results 6.6 Discussion 6.7 Summary 80 80 82 83 84 85 86 87 89 89 90 93 94 95 95 96 97 99 Conclusion 8.1 Contributions 8.2 Software 8.3 Future Work 100 100 101 101 Summarization Evaluation 7.1 Task Description 7.2 Adapting TESLA-M for Summarization Evaluation 7.3 Experiments 7.4 Summary Bibliography 103 A A Proof that TESLA with Unit Weight N-grams Reduces to Weighted Bipartite Matching 111 iv Summary Automatic evaluations form an important part of Natural Language Processing (NLP) research Designing automatic evaluation metrics is not only an interesting research problem in itself, but the evaluation metrics also help guide and evaluate algorithms in the underlying NLP task More interestingly, one approach of tackling an NLP task is to maximize the automatic evaluation score of the NLP task, further strengthening the link between the evaluation metric and the solver for the underlying NLP problem Despite their success, the mathematical foundations of most current metrics are capable of modeling only simple features of n-gram matching, such as exact matches – possibly after pre-processing – and single word synonyms We choose instead to base our proposal on the very versatile linear programming formulation, which allows fractional n-gram weights and fractional similarity measures and is efficiently solvable We show that this flexibility allows us to model additional linguistic phenomena and to exploit additional linguistic resources In this thesis, we introduce TESLA, a family of linear programming-based metrics for various automatic evaluation tasks TESLA builds on the basic ngram matching method of the dominant machine translation evaluation metric BLEU, with several features that target the semantics of natural languages In particular, we use synonym dictionaries to model word level semantics and bitext phrase tables to model phrase level semantics We also differentiate function words from content words by giving them different weights Variants of TESLA are devised for many different evaluation tasks: TESLAM, TESLA-B, and TESLA-F for the machine translation evaluation of European languages, TESLA-CELAB for the machine translation evaluation of languages v with ambiguous word boundaries such as Chinese, TESLA-PEM for paraphrase evaluation, and TESLA-S for summarization evaluation Experiments show that they are very competitive on the standard test sets in their respective tasks, as measured by correlations with human judgments vi List of Tables 3.1 3.2 3.3 Into-English task on WMT 2009 data Out-of-English task system-level correlation on WMT 2009 data Out-of-English task sentence-level consistency on WMT 2009 data 3.4 Into-English task on WMT 2010 data All scores other than TESLA-B are official 3.5 Out-of-English task system-level correlation on WMT 2010 data All scores other than TESLA-B are official 3.6 Out-of-English task sentence-level correlation on WMT 2010 data All scores other than TESLA-B are official 3.7 Into-English task on WMT 2011 data 3.8 Out-of-English task system-level correlation on WMT 2011 data 3.9 Out-of-English task sentence-level correlation on WMT 2011 data 3.10 Effect of function word discounting for TESLA-M on WMT 2009 into-English task 3.11 Contributions of various features in the WMT 2009 into-English task 3.12 Contributions of various features in the WMT 2009 out-of-English task 4.1 4.2 4.3 4.4 5.1 Inter-judge Kappa values for the NIST 2008 English-Chinese MT task Correlations with human judgment on the IWSLT 2008 EnglishChinese Challenge Task * denotes better than the BLEU baseline at 5% significance level ** denotes better than the BLEU baseline at 1% significance level Correlations with human judgment on the NIST 2008 EnglishChinese MT Task ** denotes better than the BLEU baseline at 1% significance level Sample sentences from the IWSLT 2008 test set 29 31 31 35 35 36 36 37 37 39 42 42 59 59 60 63 Z-MERT training times in hours:minutes and the number of iterations 70 vii 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 Automatic evaluation scores for the French-English task Automatic evaluation scores for the Spanish-English task Automatic evaluation scores for the German-English task Inter-annotator agreement Percentage of times each system produces the best translation Pairwise system comparison for the French-English task All pairwise differences are significant at 1% level, except those struck out Pairwise system comparison for the Spanish-English task All pairwise differences are significant at 1% level, except those struck out Pairwise system comparison for the German-English task All pairwise differences are significant at 1% level, except those struck out 72 72 72 72 73 74 74 74 6.1 6.2 6.3 Inter-judge correlation for overall paraphrase score 87 Correlation of paraphrase criteria with overall score 88 Correlation of TESLA-PEM with human judgment (overall score) 92 7.1 Content correlation with human judgment on summarizer level Top three scores among AESOP metrics are bolded A TESLAS score is bolded when it outperforms all others 98 viii Summarize the topics with the list of machine summarizers Evaluate the list of summaries from Step with the two evaluation metrics under comparison Determine which metric gives a higher correlation score Repeat Step – for 1,000 times As we have 44 topics in TAC 2011 summarization track, n = 44 The percentage of times metric a gives higher correlation than metric b is said to be the significance level at which a outperforms b The findings between TESLA-S and ROUGE-2/ROUGE-SU4 are: • Initial task: TESLA-S is better than ROUGE-2 at 99% significance level as measured by Pearson’s r • Update task: TESLA-S is better than ROUGE-SU4 at 95% significance level as measured by Pearson’s r • All other differences are statistically insignificant, including all correlations on Spearman’s ρ and Kendall’s τ The last point can be explained by the fact that Spearman’s ρ and Kendall’s τ are sensitive to only the system rankings, whereas Pearson’s r is sensitive to the magnitude of the differences as well, hence Pearson’s r is in general a more sensitive measure 7.4 Summary We proposed TESLA-S by adapting TESLA-M for machine translation evaluation to measure summary content coverage Experimental results on AESOP 2011 showed that TESLA-S is very competitive on both the initial and update tasks 99 Chapter Conclusion 8.1 Contributions In this thesis, we presented a versatile linear programming-based framework for a variety of automatic evaluation tasks in natural language processing, focusing on the semantic aspect of evaluation Based on this framework, we made a variety of enhancements to the standard n-gram matching procedure in machine translation evaluation, specifically: • support for fractional n-gram similarity measures and the discounting of function words (TESLA-M); • the use of parallel texts as a source of phrase synonyms (TESLA-B and TESLA-F); and • proper handling of multi-character synonyms in machine translation evaluation for Chinese (TESLA-CELAB) We showed for the first time that practical new generation machine translation evaluation metrics (TESLA-M and TESLA-F) can significantly improve the quality of automatic machine translation compared to BLEU, as measured 100 by human judgment We hope this will motivate the use of these new generation metrics in the tuning and evaluation of future statistical MT systems We also codified the paraphrase evaluation task, proposed its first automatic evaluation metric (TESLA-PEM), and derived a summarization evaluation metric (TESLA-S) which showed good performance in a shared task Both metrics are based on the same linear programming-based framework proposed for machine translation evaluation 8.2 Software All software produced as part of this thesis is available for download from http://www.comp.nus.edu.sg/~nlp/software.html, including: • TESLA-M, identical implementations in Python and in Java • TESLA-B implemented in Python • TESLA-F implemented in Python • Joshua tuning with TESLA-M/TESLA-F • TESLA-CELAB implemented in Python • TESLA-PEM implemented in Python • TESLA-S implemented in Python 8.3 Future Work The thesis leaves open some worthy questions for future work • Compared to TESLA-M, TESLA-F often achieves much better systemlevel correlation for the into-English task However, its performance in 101 the out-of-English task is not very robust, likely due to poorer linguistic resources such as the language model Therefore, we have to recommend TESLA-F for the into-English tasks and TESLA-M for the out-of-English tasks Future work can attempt to redesign TESLA-F so that a single metric can be recommended for all machine translation evaluation tasks • Semantic Textual Similarity (STS) is a pilot shared task in SemEval 20121 and a shared task in SemEval 20132 , where participants submit systems that examine the degree of semantic equivalence between two sentences The TESLA family of metrics can be adapted for this task • As discussed in Section 4.5.1, it is interesting to apply TESLA-CELAB to languages such as Japanese, Thai, and German This will shed light on the extent to which the problem of ambiguous word boundaries in NLP is specific to Chinese http://www.cs.york.ac.uk/semeval-2012/task6 http://ixa2.si.ehu.es/sts/ 102 Bibliography J Atserias, B Casas, E Comelles, M Gonzalez, L Padro, and M Padro 2006 Freeling 1.3: Syntactic and semantic services in an open-source NLP library In Proceedings of LREC Yigal Attali and Jill Burstein 2006 Automated essay scoring with e-rater v.2.0 Journal of Technology, Learning, and Assessment, 4(3):159–174 Timothy Baldwin 2001 Low-cost, high-performance translation retrieval: Dumber is better In Proceedings of ACL Timothy Baldwin 2009 The hare and the tortoise: Speed and accuracy in translation retrieval Machine Translation, 23(4):195–240 Satanjeev Banerjee and Alon Lavie 2005 METEOR: An automatic metric for MT evaluation with improved correlation with human judgments In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization Colin Bannard and Chris Callison-Burch 2005 Paraphrasing with bilingual parallel corpora In Proceedings of ACL Roy Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor 2006 The second PASCAL recognising textual entailment challenge In Proceedings of TAC Workshop Regina Barzilay and Lillian Lee 2003 Learning to paraphrase: An unsupervised approach using multiple-sequence alignment In Proceedings of HLT-NAACL Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini 2009 The fifth PASCAL recognizing textual entailment challenge In Proceedings of TAC Workshop John Blatz, Erin Fitzgerald, George Foster, Simona Gandrabur, Cyril Goutte, Alex Kulesza, Alberto Sanchis, and Nicola Ueffing 2003 Confidence 103 estimation for machine translation Technical report, CLSP Workshop Johns Hopkins University Chris Callison-Burch, Philipp Koehn, and Miles Osborne 2006 Improved statistical machine translation using paraphrases In Proceedings of HLTNAACL Chris Callison-Burch, Trevor Cohn, and Mirella Lapata 2008 ParaMetric: An automatic evaluation metric for paraphrasing In Proceedings of COLING Chris Callison-Burch, Philipp Koehn, Christof Monz, and Josh Schroeder 2009 Findings of the 2009 workshop on statistical machine translation In Proceedings of WMT Chris Callison-Burch, Philipp Koehn, Christof Monz, Kay Peterson, Mark Przybocki, and Omar F Zaidan 2010 Findings of the 2010 joint workshop on statistical machine translation and metrics for machine translation In Proceedings of WMT-MetricsMATR Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar Zaidan 2011 Findings of the 2011 workshop on statistical machine translation In Proceedings of WMT Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia 2012 Findings of the 2012 workshop on statistical machine translation In Proceedings of WMT Daniel Cer, Christopher D Manning, and Daniel Jurafsky 2010 The best lexical metric for phrase-based statistical MT system optimization In Proceedings of HLT-NAACL Yee Seng Chan and Hwee Tou Ng 2008 MaxSim: A maximum similarity metric for machine translation evaluation In Proceedings of ACL Yee Seng Chan and Hwee Tou Ng 2009 MaxSim: Performance and effects of translation fluency Machine Translation, 23(2):157–168, September Pi-Chuan Chang, Michel Galley, and Christopher D Manning 2008 Optimizing Chinese word segmentation for machine translation performance In Proceedings of WMT Boxing Chen and Roland Kuhn 2011 Amber: A modified BLEU, enhanced ranking metric In Proceedings of WMT David Chiang 2005 A hierarchical phrase-based model for statistical machine translation In Proceedings of ACL 104 David Chiang 2007 Hierarchical phrase-based translation Computational Linguistics, 33(2):201–228 Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein, 2001 Introduction to Algorithms MIT Press, Cambridge, MA Ido Dagan, Oren Glickman, and Bernardo Magnini 2006 The PASCAL recognising textual entailment challenge Machine Learning Challenges: Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment, pages 177–190 Daniel Dahlmeier, Chang Liu, and Hwee Tou Ng 2011 TESLA at WMT 2011: Translation evaluation and tunable metric In Proceedings of WMT Dipanjan Das and Noah A Smith 2009 Paraphrase identification as probabilistic quasi-synchronous recognition In Proceedings of ACL-IJCNLP Michael Denkowski and Alon Lavie 2010 METEOR-NEXT and the METEOR paraphrase tables: Improved evaluation support for five target languages In Proceedings of WMT-MetricsMATR Michael Denkowski and Alon Lavie 2011 Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems In Proceedings of WMT George Doddington 2002 Automatic evaluation of machine translation quality using n-gram co-occurrence statistics In Proceedings of the ARPA Workshop on Human Language Technology Pablo Ariel Duboue and Jennifer Chu-Carroll 2006 Answering the question you wish they had asked: The impact of paraphrasing for question answering In Proceedings of HLT-NAACL Christiane Fellbaum, editor 1998 WordNet: An electronic lexical database MIT Press, Cambridge, MA Jianfeng Gao, Mu Li, Andi Wu, and Chang-Ning Huang 2005 Chinese word segmentation and named entity recognition: A pragmatic approach Computational Linguistics, 31(4):531–574 Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan 2007 The third PASCAL recognizing textual entailment challenge In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing Danilo Giampiccolo, Hoa Trang Dang, Bernardo Magnini, Ido Dagan, Elena Cabrio, and Bill Dolan 2008 The fourth PASCAL recognizing textual entailment challenge In Proceedings of TAC Workshop 105 Jesus Gimenez and Lluis Marquez 2008 A smorgasbord of features for automatic MT evaluation In Proceedings of WMT Aria Haghighi, John Blitzer, John DeNero, and Dan Klein 2009 Better word alignments with supervised ITG models In Proceedings of ACL Eva Hasler, Barry Haddow, and Philipp Koehn 2011 Margin infused relaxed algorithm for Moses The Prague Bulletin of Mathematical Linguistics, 96(1):69–78 Michael Heilman and Noah A Smith 2010 Tree edit models for recognizing textual entailments, paraphrases, and answers to questions In Proceedings of NAACL Mark Hopkins and Jonathan May 2011 Tuning as ranking In Proceedings of EMNLP Eduard Hovy, Chin-Yew Lin, Liang Zhou, and Junichi Fukumoto 2006 Automated summarization evaluation with basic elements In Proceedings of LREC Chu-Ren Huang, Keh-Jiann Chen, and Li-Li Chang 1996 Segmentation standard for Chinese natural language processing In Proceedings of the 16th Conference on Computational Linguistics John W Hutchins 2007 Machine translation: A concise history Computer Aided Translation: Theory and Practice Guangjin Jin and Xiao Chen 2008 The fourth international Chinese language processing bakeoff: Chinese word segmentation, named entity recognition and Chinese POS tagging In Proceedings of SIGHAN Thorsten Joachims 1999 Making large-scale SVM learning practical In Advances in Kernel Methods - Support Vector Learning MIT Press Thorsten Joachims 2006 Training linear SVMs in linear time In Proceedings of KDD David Kauchak and Regina Barzilay 2006 Paraphrasing for automatic evaluation In Proceedings of HLT-NAACL Philipp Koehn and Kevin Knight 2003 Empirical methods for compound splitting In Proceedings of EACL Philipp Koehn, Franz Josef Och, and Daniel Marcu 2003 Statistical phrasebased translation In Proceedings of HLT-NAACL 106 Philipp Koehn 2004 Statistical significance tests for machine translation evaluation In Proceedings of EMNLP Philipp Koehn 2005 Europarl: A parallel corpus for statistical machine translation In MT Summit, volume Harold W Kuhn 1955 The Hungarian method for the assignment problem Naval Research Logistics Quarterly, 2(1):83–97 Zhifei Li, Chris Callison-Burch, Chris Dyer, Juri Ganitkevitch, Sanjeev Khudanpur, Lane Schwartz, Wren N.G Thornton, Jonathan Weese, and Omar F Zaidan 2009 Joshua: An open source toolkit for parsing-based machine translation In Proceedings of WMT Maoxi Li, Chengqing Zong, and Hwee Tou Ng 2011 Automatic evaluation of Chinese translation output: Word-level or character-level? In Proceedings of ACL Percy Liang, Ben Taskar, and Dan Klein 2006 Alignment by agreement In Proceedings of HLT-NAACL Ziheng Lin, Chang Liu, Hwee Tou Ng, and Min-Yen Kan 2012 Combining coherence models and machine translation evaluation metrics for summarization evaluation In Proceedings of ACL Chin-Yew Lin 2004 ROUGE: A package for automatic evaluation of summaries In Proceedings of the ACL Workshop on Text Summarization Branches Out Chang Liu and Hwee Tou Ng 2012 Character-level machine translation evaluation for languages with ambiguous word boundaries In Proceedings of ACL Chang Liu, Daniel Dahlmeier, and Hwee Tou Ng 2010a PEM: A paraphrase evaluation metric exploiting parallel texts In Proceedings of EMNLP Chang Liu, Daniel Dahlmeier, and Hwee Tou Ng 2010b TESLA: Translation evaluation of sentences with linear-programming-based analysis In Proceedings of WMT-MetricsMATR Chang Liu, Daniel Dahlmeier, and Hwee Tou Ng 2011 Better evaluation metrics lead to better machine translation In Proceedings of EMNLP Jin Kiat Low, Hwee Tou Ng, and Wenyuan Guo 2005 A maximum entropy approach to Chinese word segmentation In Proceedings of SIGHAN 107 Bill MacCartney, Trond Grenager, Marie-Catherine de Marneffe, Daniel Cer, and Christopher D Manning 2006 Learning to recognize features of valid textual entailments In Proceedings of HLT-NAACL Nitin Madnani, Necip Fazil Ayan, Philip Resnik, and Bonnie J Dorr 2007 Using paraphrases for parameter tuning in statistical machine translation In Proceedings of WMT Nitin Madnani, Philip Resnik, Bonnie J Dorr, and Richard Schwartz 2008 Are multiple reference translations necessary? Investigating the value of paraphrased reference translations in parameter optimization In Proceedings of AMTA James Munkres 1957 Algorithms for the assignment and transportation problems Journal of the Society for Industrial & Applied Mathematics, 5(1):32–38 Franz Josef Och and Hermann Ney 2003 A systematic comparison of various statistical alignment models Computational Linguistics, 29(1) Franz Josef Och 2003 Minimum error rate training in statistical machine translation In Proceedings of ACL Karolina Owczarzak, Declan Groves, Josef van Genabith, and Andy Way 2006 Contextual bitext-derived paraphrases in automatic MT evaluation In Proceedings of WMT Sebastian Pado, Daniel Cer, Michel Galley, Dan Jurafsky, and Christopher D Manning 2009a Measuring machine translation quality as semantic equivalence: A metric based on entailment features Machine Translation, 23(2):181–193 Sebastian Pado, Michel Galley, Dan Jurafsky, and Christopher D Manning 2009b Robust machine translation evaluation with entailment features In Proceedings of ACL-IJCNLP Bo Pang, Kevin Knight, and Daniel Marcu 2003 Syntax-based alignment of multiple translations: Extracting paraphrases and generating new sentences In Proceedings of HLT-NAACL Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu 2002 BLEU: A method for automatic evaluation of machine translation In Proceedings of ACL Kristen Parton, Joel Tetreault, Nitin Madnani, and Martin Chodorow 2011 E-rating machine translation In Proceedings of WMT 108 Michael Paul 2008 Overview of the IWSLT 2008 Evaluation Campaign In Proceedings of IWSLT Martin F Porter 1980 An algorithm for suffix stripping Program, 40(3) Long Qiu, Min-Yen Kan, and Tat-Seng Chua 2006 Paraphrase recognition via dissimilarity significance classification In Proceedings of EMNLP Chris Quirk, Chris Brockett, and William Dolan 2004 Monolingual machine translation for paraphrase generation In Proceedings of EMNLP Dana Shapira and James A Storer 2002 Edit distance with move operations In Combinatorial Pattern Matching, pages 85–98 Springer Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul 2006 A study of translation edit rate with targeted human annotation In Proceedings of AMTA Matthew Snover, Nitin Madnani, Bonnie J Dorr, and Richard Schwartz 2009 Fluency, adequacy, or HTER? Exploring different human judgments with a tunable MT metric In Proceedings of WMT Andreas Stolcke 2002 SRILM - an extensible language modeling toolkit In Proceedings of the International Conference on Spoken Language Processing Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer 2003 Feature-rich part-of-speech tagging with a cyclic dependency network In Proceedings of NAACL-HLT Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky, and Christopher Manning 2005 A conditional random field word segmenter for SIGHAN bakeoff 2005 In Proceedings of SIGHAN Robert A Wagner and Michael J Fischer 1974 The string-to-string correction problem Journal of the ACM, 21(1):168–173 Stephen Wan, Mark Dras, Robert Dale, and Cecile Paris 2006 Using dependency-based features to take the ‘para-farce’ out of paraphrase In Proceedings of ALTW Jia Xu, Richard Zens, and Hermann Ney 2004 Do we need Chinese word segmentation for statistical machine translation? In Proceedings of SIGHAN Naiwen Xue, Fei Xia, Fu-Dong Chiou, and Martha Palmer 2005 The Penn Chinese TreeBank: Phrase structure annotation of a large corpus Natural Language Engineering, 11(2):207–238 109 Omar Zaidan 2009 Z-MERT: A fully configurable open source tool for minimum error rate training of machine translation systems The Prague Bulletin of Mathematical Linguistics, 91:79–88 Hongmei Zhao and Qun Liu 2010 The CIPS-SIGHAN CLP 2010 Chinese word segmentation bakeoff In Proceedings of SIGHAN Shiqi Zhao, Cheng Niu, Ming Zhou, Ting Liu, and Sheng Li 2008 Combining multiple resources to improve SMT-based paraphrasing model In Proceedings of ACL-HLT Shiqi Zhao, Xiang Lan, Ting Liu, and Sheng Li 2009 Application-driven statistical paraphrase generation In Proceedings of ACL-IJCNLP Liang Zhou, Chin-Yew Lin, and Eduard Hovy 2006a Re-evaluating machine translation results with paraphrase support In Proceedings of EMNLP Liang Zhou, Chin-Yew Lin, Dragos Stefan Munteanu, and Eduard Hovy 2006b ParaEval: Using paraphrases to evaluate summaries automatically In Proceedings of HLT-NAACL 110 Appendix A A Proof that TESLA with Unit Weight N-grams Reduces to Weighted Bipartite Matching At the heart of the TESLA machine translation evaluation metric is a linear programming problem We call this problem LP and define it as follows: to find the weights w(xi , yj ) that maximize Slp = s(xi , yj )w(xi , yj ) (A.1) i,j subject to w(xi , yj ) ≥ ∀i, j (A.2) w(xi , yj ) ≤ xW ∀i i (A.3) W w(xi , yj ) ≤ yj ∀j (A.4) j i where W • xW and yj are the weights of n-grams xi and yj i 111 • s(xi , yj ) is the similarity measure between n-grams xi and yj • w(xi , yj ) is the weight assigned to the link between n-grams xi and yj W We now show that when all xW s and yj s are 1, LP reduces to a weighted i bipartite problem BP Mathematically, the bipartite problem BP is defined as follows: to find the match M between the sets {xi } and {yj } that maximizes Sbp = s(xi , yj ) (A.5) (xi ,yj )∈M Each xi and yj can appear in M at most once Lemma A.1 max(Slp ) ≥ max(Sbp ) Proof We observe that every match M can be described by an equivalent set of w(xi , yj ) such that Slp = Sbp : w(xi , yj ) =   1 if (xi , yj ) ∈ M  0 otherwise Conditions A.3 and A.4 are satisfied because each xi and yj can appear in M at most once This implies that every solution of BP is also a solution of LP max(Slp ) ≥ max(Sbp ) follows as a direct result Lemma A.2 max(Slp ) ≤ max(Sbp ) Proof The constraints of LP (A.2 – A.4) can be seen as those of a maximum flow problem, therefore the constraint matrix of LP is totally unimodular As W the bounds (0, xW , yj ) are all integers, it follows that LP has an all integer i optimal solution Given the constraints, an all integer solution implies that every w(xi , yj ) is either and 1, and at most one w(xi , yj ) can be for each xi and yi 112 Such a solution can be mapped to an equivalent solution of BP such that Slp = Sbp Specifically, (xi , yj ) ∈ M if and only if w(xi , yj ) = (A.6) Hence there exists an optimal solution of LP which is also a solution of BP max(Slp ) ≤ max(Sbp ) follows as a direct result We conclude from Lemmas A.1 and A.2 that Slp = Sbp , and that the problem LP is equivalent to the problem BP, i.e that TESLA with unit weight n-grams reduces to a weighted bipartite matching problem 113 ... weights Variants of TESLA are devised for many different evaluation tasks: TESLAM, TESLA-B, and TESLA-F for the machine translation evaluation of European languages, TESLA-CELAB for the machine translation... chapter reviews the current state of the art in the various natural language automatic evaluation tasks, including machine translation evaluation and its applications in tuning machine translation... systems, paraphrase evaluation, and summarization evaluation They can all be viewed as variations of the same underlying task, that of measuring the semantic similarity between segments of text Among

Định dạng
Số trang	125
Dung lượng	646,02 KB