acl acl2012 acl2012-140 acl2012-140-reference knowledge-graph by maker-knowledge-mining

140 acl-2012-Machine Translation without Words through Substring Alignment

Source: pdf

Author: Graham Neubig ; Taro Watanabe ; Shinsuke Mori ; Tatsuya Kawahara

Abstract: In this paper, we demonstrate that accurate machine translation is possible without the concept of “words,” treating MT as a problem of transformation between character strings. We achieve this result by applying phrasal inversion transduction grammar alignment techniques to character strings to train a character-based translation model, and using this in the phrase-based MT framework. We also propose a look-ahead parsing algorithm and substring-informed prior probabilities to achieve more effective and efficient alignment. In an evaluation, we demonstrate that character-based translation can achieve results that compare to word-based systems while effectively translating unknown and uncommon words over several language pairs.

reference text

Mohamed I. Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. 2004. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2(1). Yaser Al-Onaizan and Kevin Knight. 2002. Translating named entities using monolingual and bilingual resources. In Proc. ACL. Ming-Hong Bai, Keh-Jiann Chen, and Jason S. Chang. 2008. Improving word alignment by adjusting Chinese word segmentation. In Proc. IJCNLP. Phil Blunsom and Trevor Cohn. 2010. Inducing synchronous grammars with slice sampling. In Proc. HLT-NAACL, pages 238–241. Phil Blunsom, Trevor Cohn, Chris Dyer, and Miles Osborne. 2009. A Gibbs sampler for phrasal synchronous grammar induction. In Proc. ACL. Ond ˘rej Bojar. 2007. English-to-Czech factored machine translation. In Proc. WMT. Peter F. Brown, Vincent J.Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19. Ralf D. Brown. 2002. Corpus-driven splitting of compound words. In Proc. TMI. Pi-Chuan Chang, Michel Galley, and Christopher D. Manning. 2008. Optimizing Chinese word segmentation for machine translation performance. In Proc. WMT. Tagyoung Chung and Daniel Gildea. 2009. Unsupervised tokenization for machine translation. In Proc. EMNLP. Simon Corston-Oliver and Michael Gamon. 2004. Normalizing German and English inflectional morphology to improve statistical word alignment. Machine Translation: From Real Users to Research. Fabien Cromieres. 2006. Sub-sentential alignment using substring co-occurrence counts. In Proc. COLING/ACL 2006 Student Research Workshop. John DeNero, Alex Bouchard-C oˆt´ e, and Dan Klein. 2008. Sampling alignment structure under a Bayesian translation model. In Proc. EMNLP. Michael Denkowski and Alon Lavie. 2011. Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems. In Proc. WMT. Andrew Finch and Eiichiro Sumita. 2007. Phrase-based machine transliteration. In Proc. TCAST. Sharon Goldwater and David McClosky. 2005. Improving statistical MT through morphological analysis. In Proc. EMNLP. Aria Haghighi, John Blitzer, John DeNero, and Dan Klein. 2009. Better word alignments with supervised ITG models. In Proc. ACL. 173 Mark Johnson, Thomas Griffiths, and Sharon Goldwater. 2007. Bayesian inference for PCFGs via Markov chain Monte Carlo. In Proc. NAACL. Dan Klein and Christopher D. Manning. 2003. A* parsing: fast exact Viterbi parse selection. In Proc. HLT. Kevin Knight and Jonathan Graehl. 1998. Machine transliteration. Computational Linguistics, 24(4). Phillip Koehn, Franz JosefOch, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proc. HLT, pages 48–54. Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proc. EMNLP. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT Summit. Grzegorz Kondrak, Daniel Marcu, and Kevin Knight. 2003. Cognates can improve statistical translation models. In Proc. HLT. Young-Suk Lee. 2004. Morphological analysis for statistical machine translation. In Proc. HLT. Klaus Macherey, Andrew Dai, David Talbot, Ashok Popat, and Franz Och. 2011. Language-independent compound splitting with morphological operations. In Proc. ACL. Daniel Marcu and William Wong. 2002. A phrase-based, joint probability model for statistical machine translation. In Proc. EMNLP. Cos ¸kun Mermer and Ahmet Afs ¸ın Akın. 2010. Unsupervised search for the optimal segmentation for statistical machine translation. In Proc. ACL Student Research Workshop. Jason Naradowsky and Kristina Toutanova. 2011. Unsupervised bilingual morpheme segmentation and alignment with context-rich hidden semi-Markov models. In Proc. ACL. Graham Neubig, Taro Watanabe, Eiichiro Sumita, Shinsuke Mori, and Tatsuya Kawahara. 2011. An unsupervised model for joint phrase alignment and extraction. In Proc. ACL, pages 632–641, Portland, USA, June. Graham Neubig. 2011. The Kyoto free translation task. http://www.phontron.com/kftt. ThuyLinh Nguyen, Stephan Vogel, and Noah A. Smith. 2010. Nonparametric word segmentation for machine translation. In Proc. COLING. Sonja Nießen and Hermann Ney. 2000. Improving SMT quality with morpho-syntactic analysis. In Proc. COLING. Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1): 19–51 . Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proc. COLING. Markus Saers, Joakim Nivre, and Dekai Wu. 2009. Learning stochastic bracketing inversion transduction grammars with a cubic time biparsing algorithm. In Proc. IWPT, pages 29–32. Benjamin Snyder and Regina Barzilay. 2008. Unsupervised multilingual learning for morphological segmentation. Proc. ACL. Virach Sornlertlamvanich, Chumpol Mokarat, and Hitoshi Isahara. 2008. Thai-lao machine translation based on phoneme transfer. In Proc. 14th Annual Meeting of the Association for Natural Language Processing. Michael Subotin. 2011. An exponential translation model for target language morphology. In Proc. ACL. David Talbot and Miles Osborne. 2006. Modelling lexical redundancy for machine translation. In Proc. ACL. J o¨rg Tiedemann. 2009. Character-based PSMT for closely related languages. In Proc. 13th Annual Conference of the European Association for Machine Translation. David Vilar, Jan-T. Peter, and Hermann Ney. 2007. Can we translate letters. In Proc. WMT. Stephan Vogel, Hermann Ney, and Christoph Tillmann. 1996. HMM-based word alignment in statistical translation. In Proc. COLING. Dekai Wu. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23(3). Hao Zhang and Daniel Gildea. 2005. Stochastic lexicalized inversion transduction grammar for alignment. In Proc. ACL. Hao Zhang, Chris Quirk, Robert C. Moore, and Daniel Gildea. 2008a. Bayesian learning of non-compositional phrases with synchronous parsing. Proc. ACL. Ruiqiang Zhang, Keiji Yasuda, and Eiichiro Sumita. 2008b. Improved statistical machine translation by multiple Chinese word segmentation. In Proc. WMT. 174