nips nips2013 nips2013-81 nips2013-81-reference knowledge-graph by maker-knowledge-mining

81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model

Source: pdf

Author: Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc'Aurelio Ranzato, Tomas Mikolov

Abstract: Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. This limitation is in part due to the increasing difﬁculty of acquiring sufﬁcient training data in the form of labeled images as the number of object categories grows. One remedy is to leverage data from other sources – such as text data – both to train visual models and to constrain their predictions. In this paper we present a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. We demonstrate that this model matches state-of-the-art performance on the 1000-class ImageNet object recognition challenge while making more semantically reasonable errors, and also show that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Semantic knowledge improves such zero-shot predictions achieving hit rates of up to 18% across thousands of novel labels never seen by the visual model. 1

reference text

[1] S. Bengio, J. Weston, and D. Grangier. Label embedding trees for large multi-class tasks. In Advances in Neural Information Processing Systems, NIPS, 2010.

[2] Y. Bengio, R. Ducharme, and P. Vincent. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137–1155, 2003.

[3] A. Coates and A. Ng. The importance of encoding versus training with sparse coding and vector quantization. In International Conference on Machine Learning (ICML), 2011.

[4] Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, MarcAurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, NIPS, 2012. 8

[5] Thomas Dean, Mark Ruzon, Mark Segal, Jonathon Shlens, Sudheendra Vijayanarasimhan, and Jay Yagnik. Fast, accurate detection of 100,000 object classes on a single machine. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.

[6] Jia Deng, Alex Berg, Sanjeev Satheesh, Hao Su, Aditya Khosla, and Fei-Fei Li. Imagenet large scale visual recognition challenge 2012.

[7] Jia Deng, Wei Dong, Richard Socher, Li jia Li, Kai Li, and Li Fei-fei. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.

[8] Thomas Deselaers and Vittorio Ferrari. Visual and semantic similarity in imagenet. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011.

[9] J. C. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.

[10] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.

[11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems, NIPS, 2012.

[12] Thomas Mensink, Jakob Verbeek, Florent Perronnin, and Gabriela Csurka. Metric learning for large scale image classiﬁcation: Generalizing to new classes at near-zero cost. In European Conference on Computer Vision (ECCV), 2012.

[13] Tomas Mikolov, Kai Chen, Greg S. Corrado, and Jeffrey Dean. Efﬁcient estimation of word representations in vector space. In International Conference on Learning Representations (ICLR), Scottsdale, Arizona, USA, 2013.

[14] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, NIPS, 2013.

[15] Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Jonathon Shlens, Andrea Frome, Greg S. Corrado, and Jeffrey Dean. Zero-shot learning by convex combination of semantic embeddings. arXiv (to be submitted), 2013.

[16] Mark Palatucci, Dean Pomerleau, Geoffrey E. Hinton, and Tom M. Mitchell. Zero-shot learning with semantic output codes. In Advances in Neural Information Processing Systems, NIPS, 2009.

[17] Marcus Rohrbach, Michael Stark, and Bernt Schiele. Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011.

[18] R. Socher, M. Ganjoo, H. Sridhar, O. Bastani, C. D. Manning, and A. Y. Ng. Zero-shot learning through cross-modal transfer. In International Conference on Learning Representations (ICLR), Scottsdale, Arizona, USA, 2013.

[19] L.J.P. van der Maaten and G.E. Hinton. Visualizing high-dimensional data using t-sne. Journal of Machine Learning Research, 9:2579–2605, 2008.

[20] Jason Weston, Samy Bengio, and Nicolas Usunier. Large scale image annotation: learning to rank with joint word-image embeddings. Machine Learning, 81(1):21–35, 2010. 9