emnlp emnlp2010 emnlp2010-5 emnlp2010-5-reference knowledge-graph by maker-knowledge-mining

5 emnlp-2010-A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

Source: pdf

Author: Minh-Thang Luong ; Preslav Nakov ; Min-Yen Kan

Abstract: We propose a language-independent approach for improving statistical machine translation for morphologically rich languages using a hybrid morpheme-word representation where the basic unit of translation is the morpheme, but word boundaries are respected at all stages of the translation process. Our model extends the classic phrase-based model by means of (1) word boundary-aware morpheme-level phrase extraction, (2) minimum error-rate training for a morpheme-level translation model using word-level BLEU, and (3) joint scoring with morpheme- and word-level language models. Further improvements are achieved by combining our model with the classic one. The evaluation on English to Finnish using Europarl (714K sentence pairs; 15.5M English words) shows statistically significant improvements over the classic model based on BLEU and human judgments.

reference text

Eleftherios Avramidis and Philipp Koehn. 2008. Enriching morphologically poor languages for statistical machine translation. In ACL-HLT. Ibrahim Badr, Rabih Zbib, and James Glass. 2008. Segmentation for English-to-Arabic statistical machine translation. In ACL-HLT. Tim Buckwalter. 2004. Buckwalter Arabic Morphological Analyzer Version 2.0. Linguistic Data Consortium, Philadelphia”. Chris Callison-Burch, Philipp Koehn, Christof Monz, and Josh Schroeder. 2009. Findings of the 2009 Workshop on Statistical Machine Translation. In EACL. Boxing Chen, Min Zhang, Haizhou Li, and Aiti Aw. 2009a. A comparative study of hypothesis alignment and its improvement for machine translation system combination. In ACL-IJCNLP. Yu Chen, Michael Jellinghaus, Andreas Eisele, Yi Zhang, Sabine Hunsicker, Silke Theison, Christian Federmann, and Hans Uszkoreit. 2009b. Combining multiengine translations with Moses. In EACL. Michael Collins, Philipp Koehn, and Ivona Kuˇ cerov a´. 2005. Clause restructuring for statistical machine translation. In ACL. Mathias Creutz and Krista Lagus. 2007. Unsupervised models for morpheme segmentation and morphology learning. ACM Trans. Speech Lang. Process., 4(1):3. Thi Ngoc Diep Do, Viet Bac Le, Brigitte Bigi, Laurent Besacier, and Eric Castelli. 2009. Mining a comparable text corpus for a Vietnamese-French statistical machine translation system. In EACL. Sharon Goldwater and David McClosky. 2005. Improving statistical MT through morphological analysis. In HLT. Philipp Koehn and Barry Haddow. 2009. Edinburgh’s submission to all tracks of the WMT2009 shared task with reordering and speed improvements to Moses. In EACL. Philipp Koehn and Hieu Hoang. 2007. Factored translation models. In EMNLP-CoNLL. Philipp Koehn and Christof Monz. 2005. Shared task: Statistical machine translation between European languages. In WPT. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In NAACL. Philipp Koehn, Hieu Hoang, Alexandra Birch Mayne, Christopher Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In ACL, Demonstration Session. 157 Philipp Koehn. 2003. Noun phrase translation. Ph.D. thesis, University of Southern California, Los Angeles, CA, USA. Young-Suk Lee. 2004. Morphological analysis for statistical machine translation. In HLT-NAACL. Preslav Nakov and Hwee Tou Ng. 2009. Improved statis- tical machine translation for resource-poor languages using related resource-rich languages. In EMNLP. Jan Niehues, Teresa Herrmann, Muntsin Kolss, and Alex Waibel. 2009. The Universit a¨t Karlsruhe translation system for the EACL-WMT 2009. In EACL. Attila Nov a´k. 2009. MorphoLogic’s submission for the WMT 2009 shared task. In EACL. Franz Josef Och and Hermann Ney. 2004. The alignment template approach to statistical machine translation. Computational Linguistics, 30(4):417–449. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In ACL. Kemal Oflazer and Ilknur El-Kahlout. 2007. Exploring different representational units in English-to-Turkish statistical machine translation. In StatMT. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2001. Bleu: a method for automatic evaluation of machine translation. In ACL. Fatiha Sadat and Nizar Habash. 2006. Combination of Arabic preprocessing schemes for statistical machine translation. In ACL. Helmut Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In International Conference on New Methods in Language Processing. Kristina Toutanova, Hisami Suzuki, and Achim Ruopp. 2008. Applying morphology generation models to machine translation. In ACL-HLT. Sami Virpioja, Jaakko J. Vyrynen, Mathias Creutz, and Markus Sadeniemi. 2007. Morphology-aware statistical machine translation based on morphs induced in an unsupervised manner. In Machine Translation Summit XI. Hua Wu and Haifeng Wang. 2007. Pivot language approach for phrase-based statistical machine translation. Machine Translation, 21(3): 165–181 . Mei Yang and Katrin Kirchhoff. 2006. Phrase-based backoff models for machine translation of highly inflected languages. In EACL.