nips nips2010 nips2010-209 nips2010-209-reference knowledge-graph by maker-knowledge-mining

209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression

Source: pdf

Author: Rob Fergus, George Williams, Ian Spiro, Christoph Bregler, Graham W. Taylor

Abstract: This paper tackles the complex problem of visually matching people in similar pose but with different clothes, background, and other appearance changes. We achieve this with a novel method for learning a nonlinear embedding based on several extensions to the Neighborhood Component Analysis (NCA) framework. Our method is convolutional, enabling it to scale to realistically-sized images. By cheaply labeling the head and hands in large video databases through Amazon Mechanical Turk (a crowd-sourcing service), we can use the task of localizing the head and hands as a proxy for determining body pose. We apply our method to challenging real-world data and show that it can generalize beyond hand localization to infer a more general notion of body pose. We evaluate our method quantitatively against other embedding methods. We also demonstrate that realworld performance can be improved through the use of synthetic data. 1

reference text

[1] A. Agarwal, B. Triggs, I. Rhone-Alpes, and F. Montbonnot. Recovering 3D human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(1):44–58, 2006.

[2] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In FOCS, pages 459–468, 2006.

[3] M. Andriluka, S. Roth, and B. Schiele. Pictorial structures revisited: People detection and articulated pose estimation. In CVPR, 2009.

[4] V. Athitsos, J. Alon, S. Sclaroff, and G. Kollios. Boostmap: A method for efﬁcient approximate similarity rankings. CVPR, 2004.

[5] S. Becker and G. Hinton. Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355(6356):161–163, 1992.

[6] L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3d human pose annotations. In ICCV, sep 2009.

[7] J. Bouvrie. Notes on convolutional neural networks. Unpublished, 2006.

[8] P. Buehler, A. Zisserman, and M. Everingham. Learning sign language by watching TV (using weakly aligned subtitles). CVPR, 2009.

[9] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of ﬂow and appearance. ECCV, 2006.

[10] A. Farhadi, D. Forsyth, and R. White. Transfer Learning in Sign language. In CVPR, 2007.

[11] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part model. In CVPR, 2008.

[12] V. Ferrari, M. Marin-Jimenez, and A. Zisserman. Pose search: Retrieving people using their pose. In CVPR, 2009.

[13] A. Frome, G. Cheung, A. Abdulkader, M. Zennaro, B. Wu, A. Bissacco, H. Adam, H. Neven, and L. Vincent. Large-scale Privacy Protection in Google Street View. In ICCV, 2009.

[14] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In NIPS, 2004.

[15] K. Grauman, G. Shakhnarovich, and T. Darrell. Inferring 3d structure with a statistical image-based shape model. In ICCV, pages 641–648, 2003.

[16] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, pages 1735–1742, 2006.

[17] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504 – 507, 2006.

[18] K. Jarrett, K. Kavukcuoglu, M-A Ranzato, and Y. LeCun. What is the best multi-stage architecture for object recognition? In ICCV, 2009.

[19] K. Kavukcuoglu, M-A Ranzato, and Y. LeCun. Fast inference in sparse coding algorithms with applications to object recognition. Technical report, NYU, 2008. CBLL-TR-2008-12-01.

[20] P. Keller, S. Mannor, and D. Precup. Automatic basis function construction for approximate dynamic programming and reinforcement learning. In ICML, pages 449–456, 2006.

[21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86(11):2278– 2324, 1998.

[22] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML, pages 609–616, 2009.

[23] R. Memisevic and G. Hinton. Unsupervised learning of image transformations. In CVPR, 2007.

[24] H. Mobahi, R. Collobert, and J. Weston. Deep learning from temporal coherence in video. In ICML, pages 737–744, 2009.

[25] G. Mori and J. Malik. Estimating human body conﬁgurations using shape context matching. ECCV, 2002.

[26] M. Nechyba, L. Brandy, and H. Schneiderman. Pittpatt face detection and tracking for the CLEAR 2007 evaluation. Multimodal Technologies for Perception of Humans, 2008.

[27] M. Norouzi, M. Ranjbar, and G. Mori. Stacks of convolutional restricted boltzmann machines for shift-invariant feature learning. In CVPR, 2009.

[28] S.J. Nowlan and J.C. Platt. A convolutional neural network hand tracker. In NIPS, 1995.

[29] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3):145–175, 2001.

[30] N. Pinto, D. Cox, and J. DiCarlo. Why is real-world visual object recognition hard? PLoS Comput Biol, 4(1), 2008.

[31] N. Pinto, D. Doukhan, J. DiCarlo, and David D. Cox. A high-throughput screening approach to discovering good forms of biologically inspired visual representation. PLoS Comput Biol, 5(11), 11 2009.

[32] R. Poppe. Vision-based human motion analysis: An overview. Computer Vision and Image Understanding, 108(1-2):4–18, 2007.

[33] D. Ramanan, D. Forsyth, and A. Zisserman. Strike a pose: Tracking people by ﬁnding stylized poses. In CVPR, 2005.

[34] R. Salakhutdinov and G. Hinton. Learning a nonlinear embedding by preserving class neighbourhood structure. In AISTATS, volume 11, 2007.

[35] B. Sapp, C. Jordan, and B.Taskar. Adaptive pose priors for pictorial structures. In CVPR, 2010.

[36] G. Shakhnarovich, P. Viola, and T. Darrell. Fast pose estimation with parameter-sensitive hashing. In ICCV, pages 750–759, 2003.

[37] L. Sigal, A. Balan, and Black. M. J. HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. IJCV, 87(1/2):4–27, 2010.

[38] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for recognition. In CVPR, 2008.

[39] C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland. Pﬁnder: Real-time tracking of the human body. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):780–785, 1997. 9