acl acl2011 acl2011-135 acl2011-135-reference knowledge-graph by maker-knowledge-mining

135 acl-2011-Faster and Smaller N-Gram Language Models


Source: pdf

Author: Adam Pauls ; Dan Klein

Abstract: N-gram language models are a major resource bottleneck in machine translation. In this paper, we present several language model implementations that are both highly compact and fast to query. Our fastest implementation is as fast as the widely used SRILM while requiring only 25% of the storage. Our most compact representation can store all 4 billion n-grams and associated counts for the Google n-gram corpus in 23 bits per n-gram, the most compact lossless representation to date, and even more compact than recent lossy compression techniques. We also discuss techniques for improving query speed during decoding, including a simple but novel language model caching technique that improves the query speed of our language models (and SRILM) by up to 300%.


reference text

Paolo Boldi and Sebastiano Vigna. 2005. Codes for the world wide web. Internet Mathematics, 2. Thorsten Brants and Alex Franz. 2006. Google web1t 5-gram corpus, version 1. In Linguistic Data Consortium, Philadelphia, Catalog Number LDC2006T13. Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. 2007. Large language models in machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Bernard Chazelle, Joe Kilian, Ronitt Rubinfeld, and Ayellet Tal. 2004. The Bloomier filter: an efficient data structure for static support lookup tables. In Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms. David Chiang. 2005. A hierarchical phrase-based model for statistical machine translation. In The Annual Conference of the Association for Computational Linguistics. Kenneth Church, Ted Hart, and Jianfeng Gao. 2007. Compressing trigram language models with golomb coding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Marcello Federico and Mauro Cettolo. 2007. Efficient handling of n-gram language models for statistical machine translation. In Proceedings of the Second Workshop on Statistical Machine Translation. Edward Fredkin. 1960. Trie memory. Communications of the ACM, 3:490–499, September. Ulrich Germann, Eric Joanis, and Samuel Larkin. 2009. Tightly packed tries: how to fit large models into memory, and make them load fast, too. In Proceedings of the Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing. S. W. Golomb. 1966. Run-length encodings. IEEE Transactions on Information Theory, 12. David Guthrie and Mark Hepple. 2010. Storing the web in memory: space efficient language models with constant time retrieval. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Boulos Harb, Ciprian Chelba, Jeffrey Dean, and Sanjay Ghemawat. 2009. Back-off language model compression. In Proceedings of Interspeech. Bo-June Hsu and James Glass. 2008. Iterative language model estimation: Efficient data structure and algorithms. In Proceedings of Interspeech. Abby Levenberg and Miles Osborne. 2009. Streambased randomised language models for smt. In Pro- ceedings of the Conference on Empirical Methods in Natural Language Processing. 267 Zhifei Li and Sanjeev Khudanpur. 2008. A scalable decoder for parsing-based machine translation with equivalent language model state maintenance. In Proceedings of the Second Workshop on Syntax and Structure in Statistical Translation. Zhifei Li, Chris Callison-Burch, Chris Dyer, Juri Ganitkevitch, Sanjeev Khudanpur, Lane Schwartz, Wren N. G. Thornton, Jonathan Weese, and Omar F. Zaidan. 2009. Joshua: an open source toolkit for parsingbased machine translation. In Proceedings of the Fourth Workshop on Statistical Machine Translation. Franz Josef Och and Hermann Ney. 2004. The alignment template approach to statistical machine translation. Computationl Linguistics, 30:417–449, December. Andreas Stolcke. 2002. SRILM: An extensible language modeling toolkit. In Proceedings of Interspeech. E. W. D. Whittaker and B. Raj. 2001. Quantizationbased language model compression. In Proceedings of Eurospeech.