acl acl2013 acl2013-264 acl2013-264-reference knowledge-graph by maker-knowledge-mining

264 acl-2013-Online Relative Margin Maximization for Statistical Machine Translation

Source: pdf

Author: Vladimir Eidelman ; Yuval Marton ; Philip Resnik

Abstract: Recent advances in large-margin learning have shown that better generalization can be achieved by incorporating higher order information into the optimization, such as the spread of the data. However, these solutions are impractical in complex structured prediction problems such as statistical machine translation. We present an online gradient-based algorithm for relative margin maximization, which bounds the spread ofthe projected data while maximizing the margin. We evaluate our optimizer on Chinese-English and ArabicEnglish translation tasks, each with small and large feature sets, and show that our learner is able to achieve significant im- provements of 1.2-2 BLEU and 1.7-4.3 TER on average over state-of-the-art optimizers with the large feature set.

reference text

Abishek Arun and Philipp Koehn. 2007. Online learning methods for discriminative training of phrase based statistical machine translation. In MT Summit XI. Peter L. Bartlett and Shahar Mendelson. 2003. Rademacher and gaussian complexities: risk bounds and structural results. J. Mach. Learn. Res., 3:463– 482, March. Phil Blunsom, Trevor Cohn, and Miles Osborne. 2008. A discriminative latent variable model for statistical machine translation. In Proceedings of ACL-08: HLT, Columbus, Ohio, June. 7We and other researchers often use combined SMT quality metric. 21(TER−BLEU) as a 1124 Nicol `o Cesa-Bianchi, Alex Conconi, and Claudio Gentile. 2005. A second-order perceptron algorithm. SIAM J. Comput., 34(3):640–668, March. Stanley F. Chen and Joshua Goodman. 1996. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pages 3 10–3 18. Colin Cherry and George Foster. 2012. Batch tuning strategies for statistical machine translation. In Proceedings of NAACL. David Chiang, Yuval Marton, and Philip Resnik. 2008. Online large-margin training of syntactic and structural translation features. In Proceedings ofthe Conference on Empirical Methods in Natural Language Processing (EMNLP), Waikiki, Honolulu, Hawaii. David Chiang, Kevin Knight, and Wei Wang. 2009. 11,001 new features for statistical machine translation. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL ’09, pages 218–226. David Chiang. 2012. Hope and fear for discriminative training of statistical translation models. J. Machine Learning Research. Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. 2006. Online passive-aggressive algorithms. J. Mach. Learn. Res., 7:551–585. Koby Crammer, Alex Kulesza, and Mark Dredze. 2009a. Adaptive regularization of weight vectors. In Advances in Neural Information Processing Sys- tems 22, pages 414–422. Koby Crammer, Mehryar Mohri, and Fernando Pereira. 2009b. Gaussian margin machines. Journal of Machine Learning Research - Proceedings Track, 5: 105–1 12. Koby Crammer, Mark Dredze, and Fernando Pereira. 2012. Confidence-weighted linear classification for text categorization. J. Mach. Learn. Res., 98888: 1891–1926, June. Mark Dredze and Koby Crammer. 2008. Confidenceweighted linear classification. In In ICML 08: Proceedings of the 25th international conference on Machine learning, pages 264–271. ACM. Chris Dyer, Adam Lopez, Juri Ganitkevitch, Jonathan Weese, Ferhan Ture, Phil Blunsom, Hendra Setiawan, Vladimir Eidelman, and Philip Resnik. 2010. cdec: A decoder, alignment, and learning framework for finite-state and context-free translation models. In Proceedings of ACL System Demonstrations. Vladimir Eidelman. 2012. Optimization strategies for online large-margin learning in machine translation. In Proceedings of the Seventh Workshop on Statistical Machine Translation. George Foster and Roland Kuhn. 2009. Stabilizing minimum error rate training. In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 242–249, Athens, Greece, March. Association for Computational Linguistics. Kevin Gimpel and Noah A. Smith. 2012. Structured ramp loss minimization for machine translation. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics. Mark Hopkins and Jonathan May. 2011. Tuning as ranking. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1352–1362, Edinburgh, Scotland, UK., July. Association for Computational Linguistics. Thorsten Joachims. 1998. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In Claire N ´edellec and C ´eline Rouveirol, editors, European Conference on Machine Learning, pages 137–142, Berlin. Springer. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, Stroudsburg, PA, USA. Shankar Kumar, Wolfgang Macherey, Chris Dyer, and Franz Och. 2009. Efficient minimum error rate training and minimum bayes-risk decoding for translation hypergraphs and lattices. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 163–171. Young-Suk Lee, Kishore Papineni, Salim Roukos, Ossama Emam, and Hany Hassan. 2003. Language model based Arabic word segmentation. In Proceedings of the 41st Annual Meeting on Associationfor Computational Linguistics - Volume 1, pages 399–406. Percy Liang, Alexandre Bouchard-C oˆt´ e, Dan Klein, and Ben Taskar. 2006a. An end-to-end discriminative approach to machine translation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, ACL-44, pages 761–768. Percy Liang, Alexandre Bouchard-C oˆt´ e, Dan Klein, and Ben Taskar. 2006b. An end-to-end discriminative approach to machine translation. In Proceedings of the 2006 International Conference on Computational Linguistics (COLING) - the Association for Computational Linguistics (ACL). David Mcallester and Joseph Keshet. 2011. Generalization bounds and consistency for latent structural probit and ramp loss. In J. Shawe-Taylor, 1125 R.S. Zemel, P. Bartlett, F.C.N. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2205–2212. Ryan McDonald, Keith Hall, and Gideon Mann. 2010. Distributed training strategies for the structured perceptron. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 456–464, Los Angeles, California. Franz Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. In Computational Linguistics, volume 29(21), pages 19–51. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 160–167. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pages 311–3 18. Pannagadatta Shivaswamy and Tony Jebara. 2009a. Structured prediction with relative margin. In In International Conference on Machine Learning and Applications. Pannagadatta K Shivaswamy and Tony Jebara. 2009b. Relative margin machines. In In Advances in Neural Information Processing Systems 21. MIT Press. Patrick Simianer, Stefan Riezler, and Chris Dyer. 2012. Joint feature selection in distributed stochastic learning for large-scale discriminative training in smt. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jeju Island, Korea, July. David A. Smith and Jason Eisner. 2006. Minimum risk annealing for training log-linear models. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, Sydney, Australia, July. Association for Computational Linguistics. Ben Taskar, Simon Lacoste-Julien, and Michael I. Jordan. 2006. Structured prediction, dual extragradi- ent and bregman projections. J. Mach. Learn. Res., 7:1627–1653, December. Christoph Tillmann and Tong Zhang. 2006. A discriminative global training algorithm for statistical MT. In Proceedings ofthe 2006International Conference on Computational Linguistics (COLING) - the Association for Computational Linguistics (ACL). Huihsin Tseng, Pi-Chuan Chang, Galen Andrew, Daniel Jurafsky, and Christopher Manning. 2005. A conditional random field word segmenter. In Fourth SIGHAN Workshop on Chinese Language Processing. Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. 2004. Support vector machine learning for interdependent and structured output spaces. In Proceedings of the twenty-first international conference on Machine learning, ICML ’04. Vladimir N. Vapnik. 1995. The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA. Taro Watanabe, Jun Suzuki, Hajime Tsukada, and Hideki Isozaki. 2007. Online large-margin training for statistical machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLPCoNLL), Prague, Czech Republic, June. Association for Computational Linguistics. 1126