acl acl2013 acl2013-120 acl2013-120-reference knowledge-graph by maker-knowledge-mining

120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl


Source: pdf

Author: Jason R. Smith ; Herve Saint-Amand ; Magdalena Plamada ; Philipp Koehn ; Chris Callison-Burch ; Adam Lopez

Abstract: Parallel text is the fuel that drives modern machine translation systems. The Web is a comprehensive source of preexisting parallel text, but crawling the entire web is impossible for all but the largest companies. We bring web-scale parallel text to the masses by mining the Common Crawl, a public Web crawl hosted on Amazon’s Elastic Cloud. Starting from nothing more than a set of common two-letter language codes, our open-source extension of the STRAND algorithm mined 32 terabytes of the crawl in just under a day, at a cost of about $500. Our large-scale experiment uncovers large amounts of parallel text in dozens of language pairs across a variety of domains and genres, some previously unavailable in curated datasets. Even with minimal cleaning and filtering, the resulting data boosts translation performance across the board for five different language pairs in the news domain, and on open domain test sets we see improvements of up to 5 BLEU. We make our code and data available for other researchers seeking to mine this rich new data resource.1


reference text

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, March. Chris Callison-Burch, Philipp Koehn, Christof Monz, Kay Peterson, Mark Przybocki, and Omar F. Zaidan. 2010. Findings of the 2010 joint workshop on statistical machine translation and metrics for machine translation. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, WMT ’ 10, pages 17–53. Association for Computational Linguistics. Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar F. Zaidan. 2011. Findings of the 2011 workshop on statistical machine translation. In Proceedings of the Sixth Workshop on Statistical Machine Translation, WMT ’ 11, pages 22–64. Association for Computational Linguistics. Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 2012. Findings of the 2012 workshop on statistical machine translation. In Proceedings of the Seventh Workshop on Statistical Machine Translation, pages 10–5 1, Montr ´eal, Canada, June. Association for Computational Linguistics. Jiang Chen and Jian-Yun Nie. 2000. Parallel web text mining for cross-language ir. In IN IN PROC. OF RIAO, pages 62–77. J. Dean and S. Ghemawat. 2004. Mapreduce: simplified data processing on large clusters. In Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation-Volume 6, pages 10–10. USENIX Association. William A. Gale and Kenneth W. Church. 1993. A program for aligning sentences in bilingual corpora. Comput. Linguist., 19:75–102, March. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ond ˇrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL ’07, pages 177–180. Association for Computational Linguistics. P. Koehn. 2005. Europarl: A parallel corpus for statis- tical machine translation. In MT summit, volume 5. Edward Loper and Steven Bird. 2002. Nltk: the natural language toolkit. In Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics - Volume 1, ETMTNLP ’02, pages 63–70. Association for Computational Linguistics. Adam Lopez, Matt Post, and Chris Callison-Burch. 2013. Parallel speech, transcription, and translation: The Fisher and Callhome Spanish-English speech translation corpora. Technical Report 11, Johns Hopkins University Human Language Technology Center of Excellence. Marco Lui and Timothy Baldwin. 2012. langid.py: an off-the-shelf language identification tool. In Proceedings of the ACL 2012 System Demonstrations, ACL ’ 12, pages 25–30. Association for Computational Linguistics. Andrew Kachites McCallum. 2002. let: A machine learning for language http://mallet.cs.umass.edu. Maltoolkit. 1382 Dragos Stefan Munteanu and Daniel Marcu. 2005. Improving Machine Translation Performance by Exploiting Non-Parallel Corpora. Comput. Linguist., 31:477–504, December. Jian-Yun Nie, Michel Simard, Pierre Isabelle, and Omar F. Zaidan and Chris Callison-Burch. Crowdsourcing translation: Professional from non-professionals. In Proc. of ACL. 201 1. quality Richard Durand. 1999. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the web. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’99, pages 74–81, New York, NY, USA. ACM. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In acl, pages 160– 167, Sapporo, Japan. P. Resnik and N. A Smith. 2003. The web as a parallel corpus. Computational Linguistics, 29(3):349–380. Philip Resnik. 1999. Mining the web for bilingual text. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, ACL ’99, pages 527–534. Association for Computational Linguistics. Jason R. Smith, Chris Quirk, and Kristina Toutanova. 2010. Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment. In NAACL 2010. J o¨rg Tiedemann. 2009. News from OPUS - A collection of multilingual parallel corpora with tools and interfaces. In N. Nicolov, K. Bontcheva, G. Angelova, and R. Mitkov, editors, Recent Advances in Natural Language Processing, volume V, pages 237–248. John Benjamins, Amsterdam/Philadelphia, Borovets, Bulgaria. Ferhan Ture and Jimmy Lin. 2012. Why not grab a free lunch? mining large corpora for parallel sentences to improve translation modeling. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 626–630, Montr ´eal, Canada, June. Association for Computational Linguistics. Jakob Uszkoreit, Jay M. Ponte, Ashok C. Popat, and Moshe Dubiner. 2010. Large scale parallel document mining for machine translation. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’ 10, pages 1101– 1109. Association for Computational Linguistics. Ashish Venugopal, Jakob Uszkoreit, David Talbot, Franz J. Och, and Juri Ganitkevitch. 2011. Watermarking the outputs of structured prediction with an application in statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’ 11, pages 1363–1372. Association for Computational Linguistics. 1383