nips nips2012 nips2012-92 nips2012-92-reference knowledge-graph by maker-knowledge-mining

92 nips-2012-Deep Representations and Codes for Image Auto-Annotation

Source: pdf

Author: Ryan Kiros, Csaba Szepesvári

Abstract: The task of image auto-annotation, namely assigning a set of relevant tags to an image, is challenging due to the size and variability of tag vocabularies. Consequently, most existing algorithms focus on tag assignment and ﬁx an often large number of hand-crafted features to describe image characteristics. In this paper we introduce a hierarchical model for learning representations of standard sized color images from the pixel level, removing the need for engineered feature representations and subsequent feature selection for annotation. We benchmark our model on the STL-10 recognition dataset, achieving state-of-the-art performance. When our features are combined with TagProp (Guillaumin et al.), we compete with or outperform existing annotation approaches that use over a dozen distinct handcrafted image descriptors. Furthermore, using 256-bit codes and Hamming distance for training TagProp, we exchange only a small reduction in performance for efﬁcient storage and fast comparisons. Self-taught learning is used in all of our experiments and deeper architectures always outperform shallow ones. 1

reference text

[1] T Huang. Linear spatial pyramid matching using sparse coding for image classiﬁcation. CVPR, pages 1794–1801, 2009.

[2] K. Yu F. Lv T. Huang J. Wang, J. Yang and Y. Gong. Locality-constrained linear coding for image classiﬁcation. In CVPR, pages 3360–3367, 2010.

[3] R. Ranganath H Lee, R. Grosse and A.Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. ICML, pages 1–8, 2009.

[4] K. Yu, Y. Lin, and J. Lafferty. Learning image representations from the pixel level via hierarchical sparse coding. In CVPR, pages 1713–1720, 2011.

[5] L. Bo, X. Ren, and D. Fox. Hierarchical Matching Pursuit for Image Classiﬁcation: Architecture and Fast Algorithms. In NIPS, 2011.

[6] Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y Ng. Self-taught learning, pages 759–766. ICML, 2007.

[7] A. Makadia, V. Pavlovic, and S. Kumar. A new baseline for image annotation. In ECCV, volume 8, pages 316–329, 2008.

[8] H. Nakayama. Linear Distance Metric Learning for Large-scale Generic Image Recognition. PhD thesis, The University of Tokyo.

[9] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In ICCV, pages 309–316, 2009.

[10] D. Tsai, Y. Jing, Y. Liu, H.A. Rowley, S. Ioffe, and J.M. Rehg. Large-scale image annotation using visual synset. In ICCV, pages 611–618, 2011.

[11] J. Weston, S. Bengio, and N. Usunier. Large scale image annotation: learning to rank with joint wordimage embeddings. Machine Learning, 81(1):21–35, 2010.

[12] S. Zhang, J. Huang, Y. Huang, Y. Yu, H. Li, and D.N. Metaxas. Automatic image annotation using group sparsity. In CVPR, pages 3312–3319, 2010.

[13] A. Coates and A.Y. Ng. The importance of encoding versus training with sparse coding and vector quantization. In ICML, 2011.

[14] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for recognition. In CVPR, pages 1–8, 2008.

[15] R. Rubinstein, M. Zibulevsky, and M. Elad. Efﬁcient implementation of the k-SVD algorithm using batch orthogonal matching pursuit. Technical Report, 2008.

[16] M. Ranzato K. Jarrett, K. Kavukcuoglu and Y. LeCun. What is the best multi-stage architecture for object recognition? ICCV, 6:2146–2153, 2009.

[17] G. Hinton and R. Salakhutdinov. Discovering binary codes for documents by learning deep generative models. Topics in Cognitive Science, 3(1):74–91, 2011.

[18] Z. Chen S. Bhaskar J. Ngiam, P. W. Koh and A.Y. Ng. Sparse ﬁltering. NIPS, 2011.

[19] A. Coates and A.Y. Ng. Selecting receptive ﬁelds in deep networks. NIPS, 2011.

[20] W. Zou, A. Ng, and Kai. Yu. Unsupervised learning of visual invariance with temporal coherence. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.

[21] L. Bo, X. Ren, and D. Fox. Unsupervised Feature Learning for RGB-D Based Object Recognition. In ISER, June 2012.

[22] Y. Jia, C. Huang, and T. Darrell. Beyond spatial pyramids: Receptive ﬁeld learning for pooled image features. In CVPR, 2012.

[23] M.L. Zhang and Z.H. Zhou. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition, 40(7):2038–2048, 2007.

[24] Y. Hu Z. Wang and L.T. Chia. Multi-label learning by image-to-class distance for scene classiﬁcation and image annotation. In CIVR, pages 105–112, 2010.

[25] M.L. Zhang and Z.H. Zhou. Multi-label learning by instance differentiation. In AAAI, number 1, pages 669–674, 2007.

[26] SL Feng, R. Manmatha, and V. Lavrenko. Multiple Bernoulli relevance models for image and video annotation. In CVPR, pages 1002–1009, 2004.

[27] A. Krizhevsky and G.E. Hinton. ESANN, 2011. Using very deep autoencoders for content-based image retrieval. 9