nips nips2000 nips2000-6 nips2000-6-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yoshua Bengio, Réjean Ducharme, Pascal Vincent
Abstract: A goal of statistical language modeling is to learn the joint probability function of sequences of words. This is intrinsically difficult because of the curse of dimensionality: we propose to fight it with its own weapons. In the proposed approach one learns simultaneously (1) a distributed representation for each word (i.e. a similarity between words) along with (2) the probability function for word sequences, expressed with these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar to words forming an already seen sentence. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach very significantly improves on a state-of-the-art trigram model.
[1] D. Baker and A. McCallum. Distributional clustering of words for text classification. In SIGlR '98,1998.
[2] S. Bengio and Y. Bengio. Taking on the curse of dimensionality in joint distributions using neural networks. IEEE Transactions on Neural Networks, special issue on Data Mining and Knowledge Discovery, 11(3):550-557, 2000.
[3] Yoshua Bengio and Samy Bengio. Modeling high-dimensional discrete data with multi-layer neural networks. In S. A. Solla, T. K. Leen, and K-R. Mller, editors, Advances in Neural Information Processing Systems 12, pages 400--406. MIT Press, 2000.
[4] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R.Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391-407, 1990.
[5] G.E . Hinton. Learning distributed representations of concepts. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pages 1-12, Amherst 1986, 1986. Lawrence Erlbaum, Hillsdale.
[6] F. Jelinek and R. L. Mercer. Interpolated estimation of Markov source parameters from sparse data. In E. S. Gelsema and L. N. Kanal, editors, Pattern Recognition in Practice. North-Holland, Amsterdam, 1980.
[7] Slava M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP35(3):400-401 , March 1987.
[8] R. Miikkulainen and M.G. Dyer. Natural language processing with modular neural networks and distributed lexicon. Cognitive Science, 15:343- 399, 1991.
[9] A. Paccanaro and G.E. Hinton. Extracting distributed representations of concepts and relations from positive and negative propositions. In Proceedings of the International Joint Conference on Neural Network, lJCNN'2000, Como, Italy, 2000. IEEE, New York.
[10] F. Pereira, N. Tishby, and L. Lee. Distributional clustering of english words. In 30th Annual Me eting of the Association for Computational Linguistics, pages 183- 190, Columbus, Ohio, 1993.
[11] Jiirgen Schmidhuber. Sequential neural text compression. IEEE Transactions on Neural Networks, 7(1):142- 146, 1996.
[12] H. Schutze. Word space. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing Systems 5, pages pp. 895- 902, San Mateo CA, 1993. Morgan Kaufmann.