acl acl2012 acl2012-158 acl2012-158-reference knowledge-graph by maker-knowledge-mining

158 acl-2012-PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning

Source: pdf

Author: Boxing Chen ; Roland Kuhn ; Samuel Larkin

Abstract: Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU. In principle, tuning on these metrics should yield better systems than tuning on BLEU. However, due to issues such as speed, requirements for linguistic resources, and optimization difficulty, they have not been widely adopted for tuning. This paper presents PORT , a new MT evaluation metric which combines precision, recall and an ordering metric and which is primarily designed for tuning MT systems. PORT does not require external resources and is quick to compute. It has a better correlation with human judgment than BLEU. We compare PORT-tuned MT systems to BLEU-tuned baselines in five experimental conditions involving four language pairs. PORT tuning achieves 1 consistently better performance than BLEU tuning, according to four automated metrics (including BLEU) and to human evaluation: in comparisons of outputs from 300 source sentences, human judges preferred the PORT-tuned output 45.3% of the time (vs. 32.7% BLEU tuning preferences and 22.0% ties). 1

reference text

S. Banerjee and A. Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of ACL Workshop on Intrinsic & Extrinsic Evaluation Measures for Machine Translation and/or Summarization. A. Birch and M. Osborne. 2011. Reordering Metrics for MT. In Proceedings of ACL. C. Callison-Burch, C. Fordyce, P. Koehn, C. Monz and J. Schroeder. 2008. Further Meta-Evaluation of Machine Translation. In Proceedings of WMT. C. Callison-Burch, M. Osborne, and P. Koehn. 2006. Re-evaluating the role of BLEU in machine translation research. In Proceedings of EACL. C. Callison-Burch, P. Koehn, C. Monz, K. Peterson, M. Przybocki and O. Zaidan. 2010. Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation. In Proceedings of WMT. C. Callison-Burch, P. Koehn, C. Monz and O. Zaidan. 2011. Findings of the 2011 Workshop on Statistical Machine Translation. In Proceedings of WMT. D. Cer, D. Jurafsky and C. Manning. 2010. The Best Lexical Metric for Phrase-Based Statistical MT System Optimization. In Proceedings of NAACL. Y. S. Chan and H. T. Ng. 2008. MAXSIM: A maximum similarity metric for machine translation evaluation. In Proceedings of ACL. B. Chen and R. Kuhn. 2011. AMBER: A Modified BLEU, Enhanced Ranking Metric. In: Proceedings of WMT. Edinburgh, UK. July. D. Chiang, S. DeNeefe, Y. S. Chan, and H. T. Ng. 2008. Decomposability of translation metrics for improved evaluation and efficient algorithms. In Proceedings of EMNLP, pages 610–619. M. Denkowski and A. Lavie. 2010. Meteor-next and the meteor paraphrase tables: Improved evaluation support for five target languages. In Proceedings of the Joint Fifth Workshop on SMT and MetricsMATR, pages 314–3 17. G. Doddington. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of HLT. J. L. Fleiss. 1971. Measuring nominal scale agreement among many raters. In Psychological Bulletin, Vol. 76, No. 5 pp. 378–382. 938 Y. He, J. Du, A. Way and J. van Genabith. 2010. The DCU dependency-based metric in WMTMetricsMATR 2010. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 324–328. H. Isozaki, T. Hirao, K. Duh, K. Sudoh, H. Tsukada. 2010. Automatic Evaluation of Translation Quality for Distant Language Pairs. In Proceedings of EMNLP. M. Kendall. 1938. A New Measure of Rank Correlation. In Biometrika, 30 (1–2), pp. 81–89. P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin and E. Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. In Proceedings of ACL, pp. 177-180, Prague, Czech Republic. A. Lavie and M. J. Denkowski. 2009. The METEOR metric for automatic evaluation of machine translation. Machine Translation, 23. C. Liu, D. Dahlmeier, and H. T. Ng. 2010. TESLA: Translation evaluation of sentences with linearprogramming-based analysis. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 329–334. C. Liu, D. Dahlmeier, and H. T. Ng. 2011. Better evaluation metrics lead to better machine translation. In Proceedings of EMNLP. C. Lo and D. Wu. 2011. MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles. In Proceedings of ACL. F. J. Och. 2003. Minimum error rate training statistical machine translation. In Proceedings ACL-2003. Sapporo, Japan. in of F. J. Och and H. Ney. 2003. A Systematic Comparison of Various Statistical Alignment Models. In Computational Linguistics, 29, pp. 19–5 1. S. Pado, M. Galley, D. Jurafsky, and C.D. Manning. 2009. Robust machine translation evaluation with entailment features. In Proceedings of ACL-IJCNLP. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of ACL. K. Parton, J. Tetreault, N. Madnani and M. Chodorow. 2011. E-rating Machine Translation. In Proceedings of WMT. M. Snover, B. Dorr, R. Schwartz, L. Micciulla, and J. Makhoul. 2006. A Study of Translation Edit Rate with Targeted Human Annotation. In Proceedings of Association for Machine Translation in the Americas. M. Snover, N. Madnani, B. Dorr, and R. Schwartz. 2009. Fluency, Adequacy, or HTER? Exploring Different Human Judgments with a Tunable MT Metric. In Proceedings of the Fourth Workshop on Statistical Machine Translation, Athens, Greece. C. Spearman. 1904. The proof and measurement of association between two things. In American Journal of Psychology, 15, pp. 72–101. S. Vogel, H. Ney, and C. Tillmann. 1996. HMM based word alignment in statistical translation. In Proceedings of COLING. 939