cvpr cvpr2013 cvpr2013-277 cvpr2013-277-reference knowledge-graph by maker-knowledge-mining

277 cvpr-2013-MODEC: Multimodal Decomposable Models for Human Pose Estimation

Source: pdf

Author: Ben Sapp, Ben Taskar

Abstract: We propose a multimodal, decomposable model for articulated human pose estimation in monocular images. A typical approach to this problem is to use a linear structured model, which struggles to capture the wide range of appearance present in realistic, unconstrained images. In this paper, we instead propose a model of human pose that explicitly captures a variety of pose modes. Unlike other multimodal models, our approach includes both global and local pose cues and uses a convex objective and joint training for mode selection and pose estimation. We also employ a cascaded mode selection step which controls the trade-off between speed and accuracy, yielding a 5x speedup in inference and learning. Our model outperforms state-of-theart approaches across the accuracy-speed trade-off curve for several pose datasets. This includes our newly-collected dataset of people in movies, FLIC, which contains an order of magnitude more labeled data for training and testing than existing datasets. The new dataset and code are avail- able online. 1

reference text

[1] M. Andriluka, S. Roth, and B. Schiele. Pictorial structures revisited: People detection and articulated pose estimation. In Proc. CVPR, 2009.

[2] L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3d human pose annotations. In Proc. ICCV, 2009.

[3] K. Duan, D. Batra, and D. Crandall. A multi-layer composite model for human pose estimation. In Proc. BMVC, 2012.

[4] M. Eichner and V. Ferrari. Better appearance models for pictorial structures. In Proc. BMVC, 2009.

[5] M. Eichner, M. Marin-Jimenez, A. Zisserman, and V. Ferrari. Articulated human pose estimation and search in (almost) unconstrained still images. Technical report, ETH Zurich, D-ITET, BIWI, 2010.

[6] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2009 (VOC2009) Results, 2009.

[7] R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. A library for large linear classification. JMLR, 2008. 333666778088 their mode is overlaid the left and right side of each image. The mode chosen by MODEC is highlighted in green. The best parse y? the image for the right (blue) and left (green) sides. In the last row we show common failures: firing on foreground clutter, scores z?, zr on on background clutter, and wrong scale estimation.

[8] Felzenszwalb, Girshick, and McAllester. Discriminatively trained deformable part models, release 4, 2011.

[9] P. Felzenszwalb and D. Huttenlocher. Pictorial structures for object recognition. IJCV, 2005.

[10] A. Frome, Y. Singer, F. Sha, and J. Malik. Learning globallyconsistent local distance functions for shape-based image retrieval and classification. In Proc. ICCV, 2007.

[11] S. Johnson and M. Everingham. Learning effective human pose estimation from inaccurate annotation. In Proc. CVPR, 2011.

[12] L. Ladicky and P. H. Torr. Locally linear support vector machines. In ICML, 2011.

[13] T. Malisiewicz, A. Gupta, and A. Efros. Ensemble of exemplar-svms for object detection and beyond. In Proc. ICCV, 2011.

[14] D. Ramanan and C. Sminchisescu. Training deformable models for localization. In Proc. CVPR, 2006.

[15] B. Sapp, A. Toshev, and B. Taskar. Cascaded models for articulated pose estimation. In Proc. ECCV, 2010.

[16] B. Sapp, D. Weiss, and B. Taskar. Parsing human motion with stretchable models. In Proc. CVPR, 2011.

[17] M. Sun and S. Savarese. Articulated part-based model for joint object detection and pose estimation. In Proc. ICCV, 2011.

[18] Y. Tian, C. Zitnick, and S. Narasimhan. Exploring the spatial hierarchy of mixture models for human pose estimation. In Proc. ECCV, 2012.

[19] D. Tran and D. Forsyth. Improved Human Parsing with a Full Relational Model. In Proc. ECCV, 2010.

[20] Y. Wang and G. Mori. Multiple tree models for occlusion and spatial constraints in human pose estimation. In Proc. ECCV, 2008.

[21] Y. Wang, D. Tran, and Z. Liao. Learning hierarchical poselets for human parsing. In Proc. CVPR, 2011.

[22] D. Weiss, B. Sapp, and B. Taskar. Structured prediction cascades (under review). In JMLR, 2012.

[23] D. Weiss and B. Taskar. Structured prediction cascades. In Proc. AISTATS, 2010.

[24] Y. Yang and D. Ramanan. Articulated pose estimation using flexible mixtures of parts. In Proc. CVPR, 2011.

[25] X. Zhu and D. Ramanan. Face detection, pose estimation and landmark localization in the wild. In Proc. CVPR, 2012. 333666778199