nips nips2008 nips2008-4 nips2008-4-reference knowledge-graph by maker-knowledge-mining

4 nips-2008-A Scalable Hierarchical Distributed Language Model

Source: pdf

Author: Andriy Mnih, Geoffrey E. Hinton

Abstract: Neural probabilistic language models (NPLMs) have been shown to be competitive with and occasionally superior to the widely-used n-gram language models. The main drawback of NPLMs is their extremely long training and testing times. Morin and Bengio have proposed a hierarchical language model built around a binary tree of words, which was two orders of magnitude faster than the nonhierarchical model it was based on. However, it performed considerably worse than its non-hierarchical counterpart in spite of using a word tree created using expert knowledge. We introduce a fast hierarchical language model along with a simple feature-based algorithm for automatic construction of word trees from the data. We then show that the resulting models can outperform non-hierarchical neural models as well as the best n-gram models. 1

reference text

[1] Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137–1155, 2003.

[2] Yoshua Bengio and Jean-S´ bastien Sen´ cal. Quick training of probabilistic neural nets by e e importance sampling. In AISTATS’03, 2003.

[3] P.F. Brown, R.L. Mercer, V.J. Della Pietra, and J.C. Lai. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467–479, 1992.

[4] Stanley F. Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. In Proceedings of the Thirty-Fourth Annual Meeting of the Association for Computational Linguistics, pages 310–318, San Francisco, 1996.

[5] Ahmad Emami, Peng Xu, and Frederick Jelinek. Using a connectionist model in a syntactical based language model. In Proceedings of ICASSP, volume 1, pages 372–375, 2003.

[6] C. Fellbaum et al. WordNet: an electronic lexical database. Cambridge, Mass: MIT Press, 1998.

[7] J. Goodman. A bit of progress in language modeling. Technical report, Microsoft Research, 2000.

[8] John G. McMahon and Francis J. Smith. Improving statistical language model performance with automatically generated word hierarchies. Computational Linguistics, 22(2):217–247, 1996.

[9] A. Mnih and G. Hinton. Three new graphical models for statistical language modelling. Proceedings of the 24th international conference on Machine learning, pages 641–648, 2007.

[10] Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model. In Robert G. Cowell and Zoubin Ghahramani, editors, AISTATS’05, pages 246–252, 2005.

[11] F. Pereira, N. Tishby, and L. Lee. Distributional clustering of English words. Proceedings of the 31st conference on Association for Computational Linguistics, pages 183–190, 1993.

[12] Holger Schwenk and Jean-Luc Gauvain. Connectionist language modeling for large vocabulary continuous speech recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, pages 765–768, 2002. 8