acl acl2013 acl2013-240 acl2013-240-reference knowledge-graph by maker-knowledge-mining

240 acl-2013-Microblogs as Parallel Corpora


Source: pdf

Author: Wang Ling ; Guang Xiang ; Chris Dyer ; Alan Black ; Isabel Trancoso

Abstract: In the ever-expanding sea of microblog data, there is a surprising amount of naturally occurring parallel text: some users create post multilingual messages targeting international audiences while others “retweet” translations. We present an efficient method for detecting these messages and extracting parallel segments from them. We have been able to extract over 1M Chinese-English parallel segments from Sina Weibo (the Chinese counterpart of Twitter) using only their public APIs. As a supplement to existing parallel training data, our automatically extracted parallel data yields substantial translation quality improvements in translating microblog text and modest improvements in translating edited news commentary. The resources in described in this paper are available at http://www.cs.cmu.edu/∼lingwang/utopia.


reference text

[Axelrod et al.2005] Amittai Axelrod, Ra Birch Mayne, Chris Callison-burch, Miles Osborne, and David Talbot. 2005. Edinburgh system description for the 2005 iwslt speech translation evaluation. In Proceedings of the International Workshop on Spoken Language Translation (IWSLT. [Blei et al.2003] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, March. [Braune and Fraser2010] Fabienne Braune and Alexander Fraser. 2010. Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, COLING ’ 10, pages 81–89, Stroudsburg, PA, USA. Association for Computational Linguistics. [Brown et al. 1993] Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. , 19:263–3 11, June. [Fukushima et al.2006] Ken’ichi Fukushima, Kenjiro Taura, and Takashi Chikayama. 2006. A fast and accurate method for detecting English-Japanese parallel texts. In Proceedings of the Workshop on Multilingual Language Resources and Interoperability, pages 60–67, Sydney, Australia, July. Association for Computational Linguistics. [Gimpel et al.201 1] Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. 2011. Partof-speech tagging for twitter: annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2, HLT ’ 11, pages 42–47, Stroudsburg, PA, USA. Association for Computational Linguistics. [Jelh et al.2012] Laura Jelh, Felix Hiebel, and Stefan Riezler. 2012. Twitter translation using translationbased cross-lingual retrieval. In Proceedings of the Seventh Workshop on Statistical Machine Translation, pages 410–421, Montr ´eal, Canada, June. Asso- ciation for Computational Linguistics. [Koehn et al.2003] Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 48–54, Morristown, NJ, USA. Association for Computational Linguistics. [Koehn2005] Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Proceedings of the tenth Machine Translation Summit, pages 79–86, Phuket, Thailand. AAMT, AAMT. [Li and Liu2008] Bo Li and Juan Liu. 2008. Mining Chinese-English parallel corpora from the web. In Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP). [Lin et al.2008] Dekang Lin, Shaojun Zhao, Benjamin Van Durme, and Marius Pas ¸ca. 2008. Mining parenthetical translations from the web by word alignment. In Proceedings of ACL-08: HLT, pages 994– 1002, Columbus, Ohio, June. Association for Computational Linguistics. [Och2003] Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1, ACL ’03, pages 160–167, Stroudsburg, PA, USA. Association for Computational Linguistics. [Papineni et al.2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 311–318, Stroudsburg, PA, USA. Association for Computational Linguistics. [Post et al.2012] Matt Post, Chris Callison-Burch, and Miles Osborne. 2012. Constructing parallel corpora for six indian languages via crowdsourcing. In Proceedings of the Seventh Workshop on Statistical Machine Translation, pages 401–409, Montr ´eal, Canada, June. Association for Computational Linguistics. [Resnik and Smith2003] Philip Resnik and Noah A. Smith. 2003. The web as a parallel corpus. Computational Linguistics, 29:349–380. [Smith et al.2010] Jason R. Smith, Chris Quirk, and Kristina Toutanova. 2010. Extracting parallel sentences from comparable corpora using document level alignment. In Proceedings of the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. [Ture and Lin2012] Ferhan Ture and Jimmy Lin. 2012. Why not grab a free lunch? mining large corpora for parallel sentences to improve translation modeling. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 626–630, Montr ´eal, Canada, June. Association for Computational Linguistics. [Uszkoreit et al.2010] Jakob Uszkoreit, Jay Ponte, Ashok C. Popat, and Moshe Dubiner. 2010. Large scale parallel document mining for machine translation. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 1101– 1109. [Vogel et al.1996] Stephan Vogel, Hermann Ney, and Christoph Tillmann. 1996. Hmm-based word alignment in statistical translation. In Proceedings of the 16th conference on Computational linguistics - Volume 2, COLING ’96, pages 836–841, Stroudsburg, PA, USA. Association for Computational Linguistics. [Xu et al.2001] Jinxi Xu, Ralph Weischedel, and Chanh Nguyen. 2001. Evaluating a probabilistic model pages 185 for cross-lingual information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’01, pages 105–1 10, New York, NY, USA. ACM. [Xu et al.2005] Jia Xu, Richard Zens, and Hermann Ney. 2005. Sentence segmentation using ibm word alignment model 1. In Proceedings of EAMT 2005 (10th Annual Conference of the European Association for Machine Translation, pages 280–287. [Zbib et al.2012] Rabih Zbib, Erika Malchiodi, Jacob Devlin, David Stallard, Spyros Matsoukas, Richard Schwarz, John Makhoul, Omar F. Zaidan, and Chris Callison-Burch. 2012. Machine translation of Arabic dialects. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 186