acl acl2013 acl2013-156 acl2013-156-reference knowledge-graph by maker-knowledge-mining

156 acl-2013-Fast and Adaptive Online Training of Feature-Rich Translation Models

Source: pdf

Author: Spence Green ; Sida Wang ; Daniel Cer ; Christopher D. Manning

Abstract: We present a fast and scalable online method for tuning statistical machine translation models with large feature sets. The standard tuning algorithm—MERT—only scales to tens of features. Recent discriminative algorithms that accommodate sparse features have produced smaller than expected translation quality gains in large systems. Our method, which is based on stochastic gradient descent with an adaptive learning rate, scales to millions of features and tuning sets with tens of thousands of sentences, while still converging after only a few epochs. Large-scale experiments on Arabic-English and Chinese-English show that our method produces significant translation quality gains by exploiting sparse features. Equally important is our analysis, which suggests techniques for mitigating overfitting and domain mismatch, and applies to other recent discriminative methods for machine translation. 1

reference text

A. Arun and P. Koehn. 2007. Online learning methods for discriminative training of phrase based statistical machine translation. In MT Summit XI. L. Bottou and O. Bousquet. 2011. The tradeoffs of large scale learning. In Optimization for Machine Learning, pages 35 1–368. MIT Press. D. Cer, D. Jurafsky, and C. D. Manning. 2008. Regularization and search for minimum error rate training. In WMT. D. Cer, M. Galley, D. Jurafsky, and C. D. Manning. 2010. Phrasal: A statistical machine translation toolkit for exploring new model features. In HLTNAACL, Demonstration Session. P-C. Chang, M. Galley, and C. D. Manning. 2008. Optimizing Chinese word segmentation for machine translation performance. In WMT. C. Cherry and G. Foster. 2012. Batch tuning strategies for statistical machine translation. In HLT-NAACL. D. Chiang, Y. Marton, and P. Resnik. 2008. Online large-margin training of syntactic and structural translation features. In EMNLP. D. Chiang, K. Knight, and W. Wang. 2009. 11,001 new features for statistical machine translation. In HLT-NAACL. D. Chiang. 2012. Hope and fear for discriminative training of statistical translation models. JMLR, 13: 1159–1 187. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. 2006. Online passive-aggressive algorithms. JMLR, 7:551–585. K. Crammer, A. Kulesza, and M. Dredze. 2009. Adaptive regularization of weight vectors. In NIPS. J. Duchi and Y. Singer. 2009. Efficient online and batch learning using forward backward splitting. JMLR, 10:2899–2934. J. Duchi, E. Hazan, and Y. Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:2121–2159. M. Galley and C. D. Manning. 2008. A simple and effective hierarchical phrase reordering model. In EMNLP. K. Gimpel and N. A. Smith. 2012. Structured ramp loss minimization for machine translation. In HLTNAACL. K. Gimpel, D. Das, and N. A. Smith. 2010. Distributed asynchronous online learning for natural language processing. In CoNLL. K. Gimpel. 2012. Discriminative Feature-Rich Modeling for Syntax-Based Machine Translation. Ph.D. thesis, Language Technologies Institute, Carnegie Mellon University. S. Green and J. DeNero. 2012. A class-based agreement model for generating accurately inflected translations. In ACL. B. Haddow and P. Koehn. 2012. Analysing the effect of out-of-domain data on SMT systems. In WMT. B. Haddow, A. Arun, and P. Koehn. 2011. SampleRank training for phrase-based machine translation. In WMT. E. Hasler, P. Bell, A. Ghoshal, B. Haddow, P. Koehn, F. McInnes, et al. 2012a. The UEDIN systems for the IWSLT 2012 evaluation. In IWSLT. E. Hasler, B. Haddow, and P. Koehn. 2012b. Sparse lexicalised features and topic adaptation for SMT. In IWSLT. X. He and L. Deng. 2012. Maximum expected BLEU training of phrase and lexicon translation models. In ACL. R. Herbrich, T. Graepel, and K. Obermayer. 1999. Support vector learning for ordinal regression. In ICANN. M. Hopkins and J. May. 2011. Tuning as ranking. In EMNLP. A. Ittycheriah and S. Roukos. 2007. Direct translation model 2. In HLT-NAACL. D. Klein and C. D. Manning. calized parsing. In ACL. 2003. Accurate unlexi- P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, et al. 2007. Moses: Open source toolkit for statistical machine translation. In ACL, Demonstration Session. J. Langford, A. J. Smola, and M. Zinkevich. Slow learners are fast. In NIPS. 2009. P. Liang and D. Klein. 2009. Online EM for unsupervised models. In HLT-NAACL. P. Liang, A. Bouchard-Côté, D. Klein, and B. Taskar. 2006a. An end-to-end discriminative approach to machine translation. In ACL. P. Liang, B. Taskar, and D. Klein. 2006b. Alignment by agreement. In NAACL. C.-Y. Lin and F. J. Och. 2004. ORANGE: a method for evaluating automatic evaluation metrics for machine translation. In COLING. M. Maamouri, A. Bies, and S. Kulick. 2008. Enhancing the Arabic Treebank: A collaborative effort toward new annotation guidelines. In LREC. 320 M. Marcus, M. A. Marcinkiewicz, and B. Santorini. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19:313–330. R. McDonald, K. Hall, and G. Mann. 2010. Distributed training strategies for the structured perceptron. In NAACL-HLT. A. Y. Ng. 2004. Feature selection, L1 vs. L2 regularization, and rotational invariance. In ICML. F. J. Och and H. Ney. 2002. Discriminative training and maximum entropy models for statistical machine translation. In ACL. F. J. Och and H. Ney. 2004. The alignment template approach to statistical machine translation. Computational Linguistics, 30(4):417–449. F. J. Och. 2003. Minimum error rate training for statistical machine translation. In ACL. K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL. S. Riezler and J. T. Maxwell. 2005. On some pitfalls in automatic evaluation and significance testing in MT. In ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (MTSE). P. Simianer, S. Riezler, and C. Dyer. 2012. Joint feature selection in distributed stochastic learning for largescale discriminative training in SMT. In ACL. A Stolcke. 2002. SRILM—an modeling toolkit. In ICSLP. extensible language C. Tillmann and T. Zhang. 2006. A discriminative global training algorithm for statistical MT. In ACLCOLING. T. Watanabe, J. Suzuki, H. Tsukada, and H. Isozaki. 2007. Online large-margin training for statistical machine translation. In EMNLP-CoNLL. T. Watanabe. 2012. Optimized online rank learning for machine translation. In HLT-NAACL. Association for Computational Linguistics. B. Xiang and A. Ittycheriah. 2011. Discriminative feature-tied mixture modeling for statistical machine translation. In ACL-HLT. N. Xue, F. Xia, F. Chiou, and M. Palmer. 2005. The Penn Chinese Treebank: Phrase structure annotation of a large corpus. Natural Language Engineering, 11(2):207–238. 321