acl acl2010 acl2010-135 acl2010-135-reference knowledge-graph by maker-knowledge-mining

135 acl-2010-Hindi-to-Urdu Machine Translation through Transliteration

Source: pdf

Author: Nadir Durrani ; Hassan Sajjad ; Alexander Fraser ; Helmut Schmid

Abstract: We present a novel approach to integrate transliteration into Hindi-to-Urdu statistical machine translation. We propose two probabilistic models, based on conditional and joint probability formulations, that are novel solutions to the problem. Our models consider both transliteration and translation when translating a particular Hindi word given the context whereas in previous work transliteration is only used for translating OOV (out-of-vocabulary) words. We use transliteration as a tool for disambiguation of Hindi homonyms which can be both translated or transliterated or transliterated differently based on different contexts. We obtain final BLEU scores of 19.35 (conditional prob- ability model) and 19.00 (joint probability model) as compared to 14.30 for a baseline phrase-based system and 16.25 for a system which transliterates OOV words in the baseline system. This indicates that transliteration is useful for more than only translating OOV words for language pairs like Hindi-Urdu.

reference text

Nasreen AbdulJaleel and Leah S. Larkey. 2003. Statistical transliteration for English-Arabic cross language information retrieval. In CIKM 03: Proceedings of the twelfth international conference on Information and knowledge management, pages 139– 146. Yaser Al-Onaizan and Kevin Knight. 2002. Translating named entities using monolingual and bilingual resources. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 400–408. Asif Ekbal, Sudip Kumar Naskar, and Sivaji Bandyopadhyay. 2006. A modified joint source-channel model for transliteration. In Proceedings of the COLING/ACL poster sessions, pages 191–198, Sydney, Australia. Association for Computational Linguistics. Swati Gupta. 2004. Aligning Hindi and Urdu bilingual corpora for robust projection. Masters project dissertation, Department of Computer Science, University of Sheffield. Ulf Hermjakob, Kevin Knight, and Hal Daum e´ III. 2008. Name translation in statistical machine translation - learning when to transliterate. In Proceedings of ACL-08: HLT, pages 389–397, Columbus, Ohio. Association for Computational Linguistics. Bushra Jawaid and Tafseer Ahmed. 2009. Hindi to Urdu conversion: beyond simple transliteration. In Conference on Language and Technology 2009, Lahore, Pakistan. Mehdi M. Kashani, Eric Joanis, Roland Kuhn, George Foster, and Fred Popowich. 2007. Integration of an Arabic transliteration module into a statistical machine translation system. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 17–24, Prague, Czech Republic. Association for Computational Linguistics. Kevin Knight and Jonathan Graehl. 1998. Machine transliteration. Computational Linguistics, 24(4):599–612. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Asso- ciation for Computational Linguistics, Demonstration Program, Prague, Czech Republic. Philipp Koehn. 2004a. Pharaoh: A beam search decoder for phrase-based statistical machine translation models. In AMTA, pages 115–124. Philipp Koehn. 2004b. Statistical significance tests for machine translation evaluation. In Dekang Lin and Dekai Wu, editors, Proceedings of EMNLP 2004, pages 388–395, Barcelona, Spain, July. Association for Computational Linguistics. Haizhou Li, Zhang Min, and Su Jian. 2004. A joint source-channel model for machine transliteration. In ACL ’04: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, pages 159–166, Barcelona, Spain. Association for Computational Linguistics. M G Abbas Malik, Christian Boitet, and Pushpak Bhattacharyya. 2008. Hindi Urdu machine transliteration using finite-state transducers. In Proceedings of the 22nd International Conference on Computational Linguistics, Manchester, UK. Robert C. Moore. 2002. Fast and accurate sentence alignment of bilingual corpora. In Conference of the Association for Machine Translation in the Americas (AMTA). Franz J. Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1): 19–5 1. Kishore A. Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2001. BLEU: a method for automatic evaluation of machine translation. Technical Report RC22176 (W0109-022), IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, NY. Ari Pirkola, Jarmo Toivonen, Heikki Keskustalo, Kari Visala, and Kalervo J ¨arvelin. 2003. Fuzzy translation of cross-lingual spelling variants. In SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 345–352, New York, NY, USA. ACM. R. Mahesh K. Sinha. 2009. Developing English-Urdu machine translation via Hindi. In Third Workshop on Computational Approaches to Arabic Scriptbased Languages (CAASL3), MT Summit XII, Ottawa, Canada. Andreas Stolcke. 2002. SRILM - an extensible language modeling toolkit. In Intl. Conf. Spoken Language Processing, Denver, Colorado. Paola Virga and Sanjeev Khudanpur. 2003. Transliteration of proper names in cross-lingual information retrieval. In Proceedings of the ACL 2003 workshop on Multilingual and mixed-language named entity recognition, pages 57–64, Morristown, NJ, USA. Association for Computational Linguistics. Bing Zhao, Nguyen Bach, Ian Lane, and Stephan Vogel. 2007. A log-linear block transliteration model based on bi-stream HMMs. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 364–371, Rochester, New York. Association for Computational Linguistics. 474