acl acl2011 acl2011-264 acl2011-264-reference knowledge-graph by maker-knowledge-mining

264 acl-2011-Reordering Metrics for MT


Source: pdf

Author: Alexandra Birch ; Miles Osborne

Abstract: One of the major challenges facing statistical machine translation is how to model differences in word order between languages. Although a great deal of research has focussed on this problem, progress is hampered by the lack of reliable metrics. Most current metrics are based on matching lexical items in the translation and the reference, and their ability to measure the quality of word order has not been demonstrated. This paper presents a novel metric, the LRscore, which explicitly measures the quality of word order by using permutation distance metrics. We show that the metric is more consistent with human judgements than other metrics, including the BLEU score. We also show that the LRscore can successfully be used as the objective function when training translation model parameters. Training with the LRscore leads to output which is preferred by humans. Moreover, the translations incur no penalty in terms of BLEU scores.


reference text

Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for MT evaluation with improved correlation with human judgments. In Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization. Nicola Bertoldi, Barry Haddow, and Jean-Baptiste Fouet. 2009. Improved Minimum Error Rate Training in Moses. The Prague Bulletin of Mathematical Linguistics, 91:7–16. Alexandra Birch, Phil Blunsom, and Miles Osborne. 2010. Metrics for MT Evaluation: Evaluating Reordering. Machine Translation, 24(1): 15–26. Ondrej Bojar and Zdenek Zabokrtsky. 2009. CzEng0.9: Large Parallel Treebank with Rich Annotation. Prague Bulletin of Mathematical Linguistics, 92:63– 84. Chris Callison-Burch, Philipp Koehn, ChristofMonz, and Josh Schroeder. 2009. Findings of the 2009 Workshop on Statistical Machine Translation. In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 1–28, Athens, Greece, March. Association for Computational Linguistics. Chris Callison-Burch. 2009. Fast, cheap, and creative: evaluating translation quality using Amazon’s Mechanical Turk. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 286–295, Singapore, August. Association for Computational Linguistics. Daniel Cer, Christopher D. Manning, and Daniel Jurafsky. 2010. The best lexical metric for phrase-based statistical MT system optimization. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 555–563, Los Angeles, California, June. Richard Hamming. 1950. Error detecting and error correcting codes. Bell System Technical Journal, 26(2): 147–160. Maurice Kendall. 1938. A new measure of rank correlation. Biometrika, 30:81–89. A. Kittur, E. H. Chi, and B. Suh. 2008. Crowdsourcing user studies with Mechanical Turk. In Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems, pages 453–456. ACM. Philipp Koehn, Franz Och, and Daniel Marcu. 2003. Statistical Phrase-Based translation. In Proceedings of the Human Language Technology and North American Association for Computational Linguistics Conference, pages 127–133, Edmonton, Canada. Association for Computational Linguistics. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of MTSummit. 1035 Alon Lavie and Abhaya Agarwal. 2008. Meteor, m-BLEU and m-TER: Evaluation metrics for highcorrelation with human rankings of machine translation output. In Proceedings of the Workshop on Statistical Machine Translation at the Meeting of the Association for Computational Linguistics (ACL-2008), pages 115–1 18. Percy Liang, Ben Taskar, and Dan Klein. 2006. Alignment by agreement. In Proceedings of the Human Language Technology Conference ofthe NAACL, Main Conference, pages 104–1 11, New York City, USA, June. Association for Computational Linguistics. Chin-Yew Lin and Franz Och. 2004. ORANGE: a method for evaluating automatic evaluation metrics for machine translation. In Proceedings of the Conference on Computational Linguistics, pages 501–507. Franz J. Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the Association for Computational Linguistics, pages 160– 167, Sapporo, Japan. Sebastian Pad o´, Daniel Cer, Michel Galley, Dan Jurafsky, and Christopher D. Manning. 2009. Measuring machine translation quality as semantic equivalence: A metric based on entailment features. Machine Translation, pages 181–193. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the Association for Computational Linguistics, pages 3 11– 3 18, Philadelphia, USA. Matthew Snover, Bonnie Dorr, R. Schwartz, L. Micciulla, and J. Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of Association for Machine Translation in the Americas, pages 223–23 1. Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y. Ng. 2008. Cheap and fast—but is it good?: Evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 254–263. Association for Computational Lin- guistics. Andreas Stolcke. 2002. SRILM - an extensible language modeling toolkit. In Proceedings of Spoken Language Processing, pages 901–904. Billy Wong and Chunyu Kit. 2009. ATEC: automatic evaluation of machine translation via word choice and word order. Machine Translation, 23(2-3): 141–155.