emnlp emnlp2010 emnlp2010-35 emnlp2010-35-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sankaranarayanan Ananthakrishnan ; Rohit Prasad ; David Stallard ; Prem Natarajan
Abstract: Production of parallel training corpora for the development of statistical machine translation (SMT) systems for resource-poor languages usually requires extensive manual effort. Active sample selection aims to reduce the labor, time, and expense incurred in producing such resources, attaining a given performance benchmark with the smallest possible training corpus by choosing informative, nonredundant source sentences from an available candidate pool for manual translation. We present a novel, discriminative sample selection strategy that preferentially selects batches of candidate sentences with constructs that lead to erroneous translations on a held-out development set. The proposed strategy supports a built-in diversity mechanism that reduces redundancy in the selected batches. Simulation experiments on English-to-Pashto and Spanish-to-English translation tasks demon- strate the superiority of the proposed approach to a number of competing techniques, such as random selection, dissimilarity-based selection, as well as a recently proposed semisupervised active learning strategy.
Nir Ailon and Mehryar Mohri. 2008. An efficient reduction of ranking to classification. In COLT ’08: Proceedings of the 21st Annual Conference on Learning Theory, pages 87–98. Sankaranarayanan Ananthakrishnan, Rohit Prasad, David Stallard, and Prem Natarajan. 2010. A semisupervised batch-mode active learning strategy for improved statistical machine translation. In CoNLL ’10: Proceedings of the 14th International Conference on Computational Natural Language Learning, pages 126–134, July. David A. Cohn, Zoubin Ghahramani, and Michael I. Jordan. 1996. Active learning with statistical models. Journal of Artificial Intelligence Research, 4(1): 129– 145. Matthias Eck, Stephan Vogel, and Alex Waibel. 2005. Low cost portability for statistical machine translation based in N-gram frequency and TF-IDF. In Proceedings of IWSLT, Pittsburgh, PA, October. Gholamreza Haffari, Maxim Roy, and Anoop Sarkar. 2009. Active learning for statistical phrase-based machine translation. In NAACL ’09: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 415–423, Morristown, NJ, USA. Association for Computational Linguistics. Rebecca Hwa. 2004. Sample selection for statistical parsing. Computational Linguistics, 30:253–276. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In NAACL ’03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computa- tional Linguistics on Human Language Technology, pages 48–54, Morristown, NJ, USA. Association for Computational Linguistics. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT Summit X: Proceedings of the 10th Machine Translation Summit, pages 79–86. D. C. Liu and J. Nocedal. 1989. On the limited memory bfgs method for large scale optimization. Math. Program., 45(3):503–528. Qinggang Meng and Mark Lee. 2008. Error-driven active learning in growing radial basis function networks for early robot learning. Neurocomputing, 71(79): 1449–1461. Dan Shen, Jie Zhang, Jian Su, Guodong Zhou, and ChewLim Tan. 2004. Multi-criteria-based active learning for named entity recognition. In ACL ’04: Proceedings of the 42nd Annual Meeting on Association for 635 Computational Linguistics, pages 589–596, Morristown, NJ, USA. Association for Computational Linguistics. Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings AMTA, pages 223–23 1, August. Min Tang, Xiaoqiang Luo, and Salim Roukos. 2002. Active learning for statistical natural language parsing. In ACL ’02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 120–127, Morristown, NJ, USA. Association for Computational Linguistics.