nips nips2013 nips2013-166 nips2013-166-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Qianli Liao, Joel Z. Leibo, Tomaso Poggio
Abstract: One approach to computer object recognition and modeling the brain’s ventral stream involves unsupervised learning of representations that are invariant to common transformations. However, applications of these ideas have usually been limited to 2D affine transformations, e.g., translation and scaling, since they are easiest to solve via convolution. In accord with a recent theory of transformationinvariance [1], we propose a model that, while capturing other common convolutional networks as special cases, can also be used with arbitrary identitypreserving transformations. The model’s wiring can be learned from videos of transforming objects—or any other grouping of images into sets by their depicted object. Through a series of successively more complex empirical tests, we study the invariance/discriminability properties of this model with respect to different transformations. First, we empirically confirm theoretical predictions (from [1]) for the case of 2D affine transformations. Next, we apply the model to non-affine transformations; as expected, it performs well on face verification tasks requiring invariance to the relatively smooth transformations of 3D rotation-in-depth and changes in illumination direction. Surprisingly, it can also tolerate clutter “transformations” which map an image of a face on one background to an image of the same face on a different background. Motivated by these empirical findings, we tested the same model on face verification benchmark tasks from the computer vision literature: Labeled Faces in the Wild, PubFig [2, 3, 4] and a new dataset we gathered—achieving strong performance in these highly unconstrained cases as well. 1
[1] T. Poggio, J. Mutch, F. Anselmi, J. Z. Leibo, L. Rosasco, and A. Tacchetti, “The computational magic of the ventral stream: sketch of a theory (and why some deep architectures work),” MIT-CSAIL-TR-2012035, 2012.
[2] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, “Labeled faces in the wild: A database for studying face recognition in unconstrained environments,” in Workshop on faces in real-life images: Detection, alignment and recognition (ECCV), (Marseille, Fr), 2008.
[3] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar, “Attribute and Simile Classifiers for Face Verification,” in IEEE International Conference on Computer Vision (ICCV), (Kyoto, JP), pp. 365–372, Oct. 2009.
[4] N. Pinto, Z. Stone, T. Zickler, and D. D. Cox, “Scaling-up Biologically-Inspired Computer Vision: A Case-Study on Facebook,” in IEEE Computer Vision and Pattern Recognition, Workshop on Biologically Consistent Vision, 2011.
[5] S. Dali, “The persistence of memory (1931).” Museum of Modern Art, New York, NY.
[6] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classification with deep convolutional neural networks,” in Advances in neural information processing systems, vol. 25, (Lake Tahoe, CA), 2012. 9 Note: Our method of testing does not strictly conform to the protocol recommended by the creators of LFW [2]: we re-aligned (worse) the faces. We also use the identities of the individuals during training. 10 The original PubFig dataset was only provided as a list of URLs from which the images could be downloaded. Now only half the images remain available. On the original dataset, the strongest performance reported is 78.7% [3]. The authors of that study also made their features available, so we estimated the performance of their features on the available subset of images (using SVM). We found that an SVM classifier, using their features, and our cross-validation splits gets 78.4% correct—3.3% lower than our best model. 8
[7] O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4277–4280, 2012.
[8] C. F. Cadieu, H. Hong, D. Yamins, N. Pinto, N. J. Majaj, and J. J. DiCarlo, “The neural representation benchmark and its evaluation on brain and machine,” arXiv preprint arXiv:1301.3530, 2013.
[9] P. F¨ ldi´ k, “Learning invariance from transformation sequences,” Neural Computation, vol. 3, no. 2, o a pp. 194–200, 1991.
[10] L. Wiskott and T. Sejnowski, “Slow feature analysis: Unsupervised learning of invariances,” Neural computation, vol. 14, no. 4, pp. 715–770, 2002.
[11] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the best multi-stage architecture for object recognition?,” IEEE International Conference on Computer Vision, pp. 2146–2153, 2009.
[12] J. Z. Leibo, J. Mutch, L. Rosasco, S. Ullman, and T. Poggio, “Learning Generic Invariances in Object Recognition: Translation and Scale,” MIT-CSAIL-TR-2010-061, CBCL-294, 2010.
[13] A. Saxe, P. W. Koh, Z. Chen, M. Bhand, B. Suresh, and A. Y. Ng, “On random weights and unsupervised feature learning,” Proceedings of the International Conference on Machine Learning (ICML), 2011.
[14] M. Riesenhuber and T. Poggio, “Hierarchical models of object recognition in cortex,” Nature Neuroscience, vol. 2, pp. 1019–1025, Nov. 1999.
[15] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio, “Robust Object Recognition with CortexLike Mechanisms,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 3, pp. 411–426, 2007.
[16] K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” Biological Cybernetics, vol. 36, pp. 193–202, Apr. 1980.
[17] Y. LeCun, F. J. Huang, and L. Bottou, “Learning methods for generic object recognition with invariance to pose and lighting,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 90–97, IEEE, 2004.
[18] E. Bart and S. Ullman, “Class-based feature matching across unrestricted transformations,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 30, no. 9, pp. 1618–1631, 2008.
[19] N. Pinto, Y. Barhomi, D. Cox, and J. J. DiCarlo, “Comparing state-of-the-art visual features on invariant object recognition tasks,” in Applications of Computer Vision (WACV), 2011 IEEE Workshop on, 2011.
[20] T. Vetter, A. Hurlbert, and T. Poggio, “View-based models of 3D object recognition: invariance to imaging transformations,” Cerebral Cortex, vol. 5, no. 3, p. 261, 1995.
[21] J. Z. Leibo, J. Mutch, and T. Poggio, “Why The Brain Separates Face Recognition From Object Recognition,” in Advances in Neural Information Processing Systems (NIPS), (Granada, Spain), 2011.
[22] H. Kim, J. Wohlwend, J. Z. Leibo, and T. Poggio, “Body-form and body-pose recognition with a hierarchical model of the ventral stream,” MIT-CSAIL-TR-2013-013, CBCL-312, 2013.
[23] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, no. 886-893, 2005.
[24] E. Oja, “Simplified neuron model as a principal component analyzer,” Journal of mathematical biology, vol. 15, no. 3, pp. 267–273, 1982.
[25] A. Afraz, M. V. Pashkam, and P. Cavanagh, “Spatial heterogeneity in the perception of face and form attributes,” Current Biology, vol. 20, no. 23, pp. 2112–2116, 2010.
[26] J. Z. Leibo, Q. Liao, and T. Poggio, “Subtasks of Unconstrained Face Recognition,” in International Joint Conference on Computer Vision, Imaging and Computer Graphics, VISIGRAPP, (Lisbon), 2014.
[27] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 24, no. 7, pp. 971–987, 2002.
[28] X. Tan and B. Triggs, “Enhanced local texture feature sets for face recognition under difficult lighting conditions,” in Analysis and Modeling of Faces and Gestures, pp. 168–182, Springer, 2007.
[29] V. Ojansivu and J. Heikkil¨ , “Blur insensitive texture classification using local phase quantization,” in a Image and Signal Processing, pp. 236–243, Springer, 2008.
[30] X. Zhu and D. Ramanan, “Face detection, pose estimation, and landmark localization in the wild,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
[31] S. u. Hussain, T. Napoleon, and F. Jurie, “Face recognition using local quantized patterns,” in Proc. British Machine Vision Conference (BMCV), vol. 1, (Guildford, UK), pp. 52–61, 2012.
[32] M. Kouh and T. Poggio, “A canonical neural circuit for cortical nonlinear operations,” Neural computation, vol. 20, no. 6, pp. 1427–1451, 2008.
[33] D. Hubel and T. Wiesel, “Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex,” The Journal of Physiology, vol. 160, no. 1, p. 106, 1962. 9