nips nips2004 nips2004-78 nips2004-78-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: John Blitzer, Fernando Pereira, Kilian Q. Weinberger, Lawrence K. Saul
Abstract: Statistical language models estimate the probability of a word occurring in a given context. The most common language models rely on a discrete enumeration of predictive contexts (e.g., n-grams) and consequently fail to capture and exploit statistical regularities across these contexts. In this paper, we show how to learn hierarchical, distributed representations of word contexts that maximize the predictive value of a statistical language model. The representations are initialized by unsupervised algorithms for linear and nonlinear dimensionality reduction [14], then fed as input into a hierarchical mixture of experts, where each expert is a multinomial distribution over predicted words [12]. While the distributed representations in our model are inspired by the neural probabilistic language model of Bengio et al. [2, 3], our particular architecture enables us to work with significantly larger vocabularies and training corpora. For example, on a large-scale bigram modeling task involving a sixty thousand word vocabulary and a training corpus of three million sentences, we demonstrate consistent improvement over class-based bigram models [10, 13]. We also discuss extensions of our approach to longer multiword contexts. 1
[1] A. Y. Alfakih, A. Khandani, and H. Wolkowicz. Solving Euclidean distance matrix completion problems via semidefinite programming. Computational Optimization Applications, 12(13):13–30, 1999.
[2] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137–1155, 2003.
[3] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, volume 13, Cambridge, MA, 2001. MIT Press.
[4] D. B. Borchers. CSDP, a C library for semidefinite programming. Optimization Methods and Software, 11(1):613–623, 1999.
[5] P. Brown, S. D. Pietra, V. D. Pietra, and R. Mercer. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2):263–311, 1991.
[6] P. F. Brown, V. J. D. Pietra, P. V. deSouza, J. C. Lai, and R. L. Mercer. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467–479, 1992.
[7] S. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting of the ACL, pages 310–318, 1996.
[8] M. Collins. Three generative, lexicalised models for statistical parsing. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, 1997.
[9] J. Ham, D. D. Lee, S. Mika, and B. Sch¨ lkopf. A kernel view of the dimensionality reduction of o manifolds. In Proceedings of the Twenty First International Conference on Machine Learning (ICML-04), Banff, Canada, 2004.
[10] T. Hofmann and J. Puzicha. Statistical models for co-occurrence and histogram data. In Proceedings of the International Conference Pattern Recognition, pages 192–194, 1998.
[11] F. Jelinek. Statistical Methods for Speech Recognition. MIT Press, 1997.
[12] M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6:181–214, 1994.
[13] L. K. Saul and F. C. N. Pereira. Aggregate and mixed-order Markov models for statistical language processing. In C. Cardie and R. Weischedel, editors, Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (EMNLP-97), pages 81–89, New Providence, RI, 1997.
[14] K. Q. Weinberger, F. Sha, and L. K. Saul. Learning a kernel matrix for nonlinear dimensionality reduction. In Proceedings of the Twenty First International Confernence on Machine Learning (ICML-04), Banff, Canada, 2004.
[15] C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2):179–214, 2004.