nips nips2010 nips2010-143 nips2010-143-reference knowledge-graph by maker-knowledge-mining

143 nips-2010-Learning Convolutional Feature Hierarchies for Visual Recognition

Source: pdf

Author: Koray Kavukcuoglu, Pierre Sermanet, Y-lan Boureau, Karol Gregor, Michael Mathieu, Yann L. Cun

Abstract: We propose an unsupervised method for learning multi-stage hierarchies of sparse convolutional features. While sparse coding has become an increasingly popular method for learning visual features, it is most often trained at the patch level. Applying the resulting ﬁlters convolutionally results in highly redundant codes because overlapping patches are encoded in isolation. By training convolutionally over large image windows, our method reduces the redudancy between feature vectors at neighboring locations and improves the efﬁciency of the overall representation. In addition to a linear decoder that reconstructs the image from sparse features, our method trains an efﬁcient feed-forward encoder that predicts quasisparse features from the input. While patch-based training rarely produces anything but oriented edge detectors, we show that convolutional training produces highly diverse ﬁlters, including center-surround ﬁlters, corner detectors, cross detectors, and oriented grating detectors. We show that using these ﬁlters in multistage convolutional network architecture improves performance on a number of visual recognition and detection tasks. 1

reference text

[1] LeCun, Y, Bottou, L, Bengio, Y, and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.

[2] Serre, T, Wolf, L, and Poggio, T. Object recognition with features inspired by visual cortex. In CVPR’05 - Volume 2, pages 994–1000, Washington, DC, USA, 2005. IEEE Computer Society.

[3] Lee, H, Grosse, R, Ranganath, R, and Ng, A. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML’09, pages 609–616. ACM, 2009.

[4] Ranzato, M, Poultney, C, Chopra, S, and LeCun, Y. Efﬁcient learning of sparse representations with an energy-based model. In NIPS’07. MIT Press, 2007.

[5] Kavukcuoglu, K, Ranzato, M, Fergus, R, and LeCun, Y. Learning invariant features through topographic ﬁlter maps. In CVPR’09. IEEE, 2009.

[6] Zeiler, M, Krishnan, D, Taylor, G, and Fergus, R. Deconvolutional Networks. In CVPR’10. IEEE, 2010.

[7] Aharon, M, Elad, M, and Bruckstein, A. M. K-SVD and its non-negative variant for dictionary design. In Papadakis, M, Laine, A. F, and Unser, M. A, editors, Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, volume 5914, pages 327–339, August 2005.

[8] Mairal, J, Bach, F, Ponce, J, and Sapiro, G. Online dictionary learning for sparse coding. In ICML’09, pages 689–696. ACM, 2009.

[9] Li, Y and Osher, S. Coordinate Descent Optimization for l1 Minimization with Application to Compressed Sensing; a Greedy Algorithm. CAM Report, pages 09–17.

[10] Olshausen, B. A and Field, D. J. Sparse coding with an overcomplete basis set: a strategy employed by v1? Vision Research, 37(23):3311–3325, 1997.

[11] Beck, A and Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Img. Sci., 2(1):183–202, 2009.

[12] Mallat, S and Zhang, Z. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41(12):3397:3415, 1993.

[13] Martin, D, Fowlkes, C, Tal, D, and Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV’01, volume 2, pages 416–423, July 2001.

[14] Jarrett, K, Kavukcuoglu, K, Ranzato, M, and LeCun, Y. What is the best multi-stage architecture for object recognition? In ICCV’09. IEEE, 2009.

[15] Gregor, K and LeCun, Y. Learning fast approximations of sparse coding. In Proc. International Conference on Machine learning (ICML’10), 2010.

[16] LeCun, Y, Bottou, L, Orr, G, and Muller, K. Efﬁcient backprop. In Orr, G and K., M, editors, Neural Networks: Tricks of the trade. Springer, 1998.

[17] Schwartz, O and Simoncelli, E. P. Natural signal statistics and sensory gain control. Nature Neuroscience, 4(8):819–825, August 2001.

[18] Lyu, S and Simoncelli, E. P. Nonlinear image representation using divisive normalization. In CVPR’08. IEEE Computer Society, Jun 23-28 2008.

[19] Fei-Fei, L, Fergus, R, and Perona, P. Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In Workshop on Generative-Model Based Vision, 2004.

[20] Pinto, N, Cox, D. D, and DiCarlo, J. J. Why is real-world visual object recognition hard? PLoS Comput Biol, 4(1):e27, 01 2008.

[21] Lazebnik, S, Schmid, C, and Ponce, J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. CVPR’06, 2:2169–2178, 2006.

[22] Boureau, Y, Bach, F, LeCun, Y, and Ponce, J. Learning mid-level features for recognition. In CVPR’10. IEEE, 2010.

[23] Dalal, N and Triggs, B. Histograms of oriented gradients for human detection. In Schmid, C, Soatto, S, and Tomasi, C, editors, CVPR’05, volume 2, pages 886–893, June 2005.

[24] Walk, S, Majer, N, Schindler, K, and Schiele, B. New features and insights for pedestrian detection. In CVPR 2010, San Francisco, California.

[25] Doll´ r, P, Wojek, C, Schiele, B, and Perona, P. Pedestrian detection: A benchmark. In CVPR’09. IEEE, a June 2009.

[26] Doll´ r, P, Tu, Z, Perona, P, and Belongie, S. Integral channel features. In BMVC 2009, London, England. a

[27] Doll´ r, P, Belongie, S, and Perona, P. The fastest pedestrian detector in the west. In BMVC 2010, a Aberystwyth, UK.

[28] Felzenszwalb, P, Girshick, R, McAllester, D, and Ramanan, D. Object detection with discriminatively trained part based models. In PAMI 2010. 9