acl acl2010 acl2010-201 acl2010-201-reference knowledge-graph by maker-knowledge-mining

201 acl-2010-Pseudo-Word for Phrase-Based Machine Translation

Source: pdf

Author: Xiangyu Duan ; Min Zhang ; Haizhou Li

Abstract: The pipeline of most Phrase-Based Statistical Machine Translation (PB-SMT) systems starts from automatically word aligned parallel corpus. But word appears to be too fine-grained in some cases such as non-compositional phrasal equivalences, where no clear word alignments exist. Using words as inputs to PBSMT pipeline has inborn deficiency. This paper proposes pseudo-word as a new start point for PB-SMT pipeline. Pseudo-word is a kind of basic multi-word expression that characterizes minimal sequence of consecutive words in sense of translation. By casting pseudo-word searching problem into a parsing framework, we search for pseudo-words in a monolingual way and a bilingual synchronous way. Experiments show that pseudo-word significantly outperforms word for PB-SMT model in both travel translation domain and news translation domain. 1

reference text

S. Banerjee, and A. Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (ACL’05). 65–72. P. Blunsom, T. Cohn, C. Dyer, M. Osborne. 2009. A Gibbs Sampler for Phrasal Synchronous Grammar Induction. In Proceedings of ACLIJCNLP, Singapore. P. Blunsom, T. Cohn, M. Osborne. 2008. Bayesian synchronous grammar induction. In Proceedings of NIPS 21, Vancouver, Canada. P. Brown, S. Della Pietra, V. Della Pietra, and R. Mercer. 1993. The mathematics of machine translation: Parameter estimation. Computational Linguistics, 19:263–3 12. P.-C. Chang, M. Galley, and C. D. Manning. 2008. Optimizing Chinese word segmentation for machine translation performance. In Proceedings of the 3rd Workshop on Statistical Machine Translation (SMT’08). 224–232. Chen, Stanley F. and Joshua Goodman. 1998. An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Harvard University Center for Research in Computing Technology. C. Cherry, D. Lin. 2007. Inversion transduction grammar for joint phrasal translation modeling. In Proc. of the HLTNAACL Workshop on Syntax and Structure in Statistical Translation (SSST 2007), Rochester, USA. D. Chiang. 2007. Hierarchical phrase-based translation.Computational Linguistics, 33(2):201 228. Y. Deng and W. Byrne. 2005. HMM word and phrase alignment for statistical machine translation. In Proc. of HLT-EMNLP, pages 169–176. G. Doddington. 2002. Automatic evaluation of machine translation quality using n-gram cooccurrence statistics. In Proceedings of the 2nd International Conference on Human Language Technology (HLT’02). 138–145. Kneser, Reinhard and Hermann Ney. 1995. Improved backing-off for M-gram language modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 181–184, Detroit, MI. P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan,W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, E. Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proc. of the 155 45th Annual Meeting of the ACL (ACL-2007), Prague. P. Koehn, F. J. Och, D. Marcu. 2003. Statistical phrasebased translation. In Proc. of the 3rd International conference on Human Language Technology Research and 4th Annual Meeting of the NAACL (HLT-NAACL 2003), 81–88, Edmonton, Canada. P. Koehn. 2004. Statistical Significance Tests for Machine Translation Evaluation. In Proceedings of EMNLP. P. Lambert and R. Banchs. 2005. Data Inferred Multi-word Expressions for Statistical Machine Translation. In Proceedings of MT Summit X. Y. Ma, N. Stroppa, and A. Way. 2007. Bootstrapping word alignment via word packing. In Pro- ceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL’07). 304–3 11. Y. Ma, and A. Way. 2009. Bilingually Motivated Word Segmentation for Statistical Machine Translation. In ACM Transactions on Asian Language Information Processing, 8(2). D. Marcu,W.Wong. 2002. A phrase-based, joint probability model for statistical machine translation. In Proc. of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP-2002), 133–139, Philadelphia. Association for Computational Linguistics. F. J. Och. 2003. Minimum error rate training in statistical machine translation. In Proc. of ACL, pages 160–167. F. J. Och and H. Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1): 19–5 1. Xuan-Hieu Phan, Le-Minh Nguyen, and Cam-Tu Nguyen. 2005. FlexCRFs: Flexible Conditional Random Field Toolkit, http://flexcrfs.sourceforge. net K. Papineni, S. Roukos, T. Ward, W. Zhu. 2001 . Bleu: a method for automatic evaluation of machine translation, 2001. M. Paul, 2008. Overview of the IWSLT 2008 evaluation campaign. In Proc. of Internationa Workshop on Spoken Language Translation, 20-21 October 2008. A. Stolcke. (2002). SRILM - an extensible language modeling toolkit. In Proceedings of ICSLP, Denver, Colorado. D. Wu. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23(3):377– 403. J. Xu, Zens., and H. Ney. 2004. Do we need Chinese word segmentation for statistical machine translation? In Proceedings of the ACL Workshop on Chinese Language Processing SIGHAN’04). 122–128. J. Xu, J. Gao, K. Toutanova, and H. Ney. 2008. Bayesian semi-supervised chinese word segmentation for statistical machine translation. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING’08). 1017–1024. H. Zhang, C. Quirk, R. C. Moore, D. Gildea. 2008. Bayesian learning of non-compositional phrases with synchronous parsing. In Proc. of the 46th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (ACL-08:HLT), 97–105, Columbus, Ohio. R. Zhang, K. Yasuda, and E. Sumita. 2008. Improved statistical machine translation by multiple Chinese word segmentation. In Proceedings of the 3rd Workshop on Statistical Machine Translation (SMT’08). 216–223. 156