nips nips2013 nips2013-172 nips2013-172-reference knowledge-graph by maker-knowledge-mining

172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation

Source: pdf

Author: Andriy Mnih, Koray Kavukcuoglu

Abstract: Continuous-valued word embeddings learned by neural language models have recently been shown to capture semantic and syntactic information about words very well, setting performance records on several word similarity tasks. The best results are obtained by learning high-dimensional embeddings from very large quantities of data, which makes scalability of the training method a critical factor. We propose a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation. Our approach is simpler, faster, and produces better results than the current state-of-theart method. We achieve results comparable to the best ones reported, which were obtained on a cluster, using four times less data and more than an order of magnitude less computing time. We also investigate several model types and ﬁnd that the embeddings learned by the simpler models perform at least as well as those learned by the more complex ones. 1

reference text

[1] Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137–1155, 2003.

[2] Yoshua Bengio and Jean-S´ bastien Sen´ cal. Quick training of probabilistic neural nets by importance e e sampling. In AISTATS’03, 2003. 8

[3] Yoshua Bengio and Jean-S´ bastien Sen´ cal. Adaptive importance sampling to accelerate training of a e e neural probabilistic language model. IEEE Transactions on Neural Networks, 19(4):713–722, 2008.

[4] R. Collobert and J. Weston. A uniﬁed architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, 2008.

[5] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2010.

[6] M.U. Gutmann and A. Hyv¨ rinen. Noise-contrastive estimation of unnormalized statistical models, with a applications to natural image statistics. Journal of Machine Learning Research, 13:307–361, 2012.

[7] Zellig S Harris. Distributional structure. Word, 1954.

[8] Eric H Huang, Richard Socher, Christopher D Manning, and Andrew Y Ng. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 873–882, 2012. ˇ

[9] T. Mikolov, M. Karaﬁ´ t, L. Burget, J. Cernock` , and S. Khudanpur. Recurrent neural network based a y language model. In Eleventh Annual Conference of the International Speech Communication Association, 2010.

[10] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efﬁcient estimation of word representations in vector space. International Conference on Learning Representations 2013, 2013.

[11] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. Proceedings of NAACL-HLT, 2013.

[12] A. Mnih and G. Hinton. Three new graphical models for statistical language modelling. Proceedings of the 24th International Conference on Machine Learning, pages 641–648, 2007.

[13] Andriy Mnih and Geoffrey Hinton. A scalable hierarchical distributed language model. In Advances in Neural Information Processing Systems, volume 21, 2009.

[14] Andriy Mnih and Yee Whye Teh. A fast and simple algorithm for training neural probabilistic language models. In Proceedings of the 29th International Conference on Machine Learning, pages 1751–1758, 2012.

[15] Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model. In AISTATS’05, pages 246–252, 2005.

[16] Magnus Sahlgren. The Word-Space Model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. PhD thesis, Stockholm, 2006.

[17] R. Socher, C.C. Lin, A.Y. Ng, and C.D. Manning. Parsing natural scenes and natural language with recursive neural networks. In International Conference on Machine Learning (ICML), 2011.

[18] J. Turian, L. Ratinov, and Y. Bengio. Word representations: A simple and general method for semisupervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 384–394, 2010.

[19] G. Zweig and C.J.C. Burges. The Microsoft Research Sentence Completion Challenge. Technical Report MSR-TR-2011-129, Microsoft Research, 2011.

[20] Geoffrey Zweig and Chris J.C. Burges. A challenge set for advancing language modeling. In Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT, pages 29–36, 2012. 9