cvpr cvpr2013 cvpr2013-225 cvpr2013-225-reference knowledge-graph by maker-knowledge-mining

225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation

Source: pdf

Author: Brandon Rothrock, Seyoung Park, Song-Chun Zhu

Abstract: In this paper we present a compositional and-or graph grammar model for human pose estimation. Our model has three distinguishing features: (i) large appearance differences between people are handled compositionally by allowingparts or collections ofparts to be substituted with alternative variants, (ii) each variant is a sub-model that can define its own articulated geometry and context-sensitive compatibility with neighboring part variants, and (iii) background region segmentation is incorporated into the part appearance models to better estimate the contrast of a part region from its surroundings, and improve resilience to background clutter. The resulting integrated framework is trained discriminatively in a max-margin framework using an efficient and exact inference algorithm. We present experimental evaluation of our model on two popular datasets, and show performance improvements over the state-of-art on both benchmarks.

reference text

[1] M. Andriluka, S. Roth, and B. Schiele. Pictorial structures revisited: People detection and articulated pose estimation. In CVPR, pages 1014–1021, 2009.

[2] L. D. Bourdev and J. Malik. Poselets: Body part detectors trained using 3d human pose annotations. In ICCV, pages

[3]

[4]

[5]

[6]

[7]

[8] 1365–1372, 2009. H. Chen, Z. Xu, Z. Liu, and S. C. Zhu. Composite templates for cloth modeling and sketching. In CVPR (1), pages 943– 950, 2006. N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, pages I: 886–893, 2005. C. Desai and D. Ramanan. Detecting actions, poses, and objects with relational phraselets. In ECCV (4), pages 158– 172, 2012. M. Eichner and V. Ferrari. Better appearance models for pictorial structures. In BMVC, 2009. P. F. Felzenszwalb and D. P. Huttenlocher. Distance transforms of sampled functions. In Cornell University, 2004. P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for object recognition. International Journal of Computer Vision, 61(1):55–79, Jan. 2005. 333222112088 PARSE Leeds PARSE failuresLeeds failures Figure5.Exampler sults:Boundi gboxesaredrawnforeachofthe10partsconsider dforevalu tion.Partslocalizedcorectlyare shown in red, and incorrectly in blue. The top two rows are examples of challenging poses with perfect localization scores from the PARSE and Leeds datasets respectively. The bottom row illustrates some of the failure modes on both datasets. Common failures are due to double counting, occlusion, and background confounders, which is compounded by extreme perspective, foreshortening, and crumpled poses.

[9] V. Ferrari, M. M. Jimenez, and A. Zisserman. Progressive search space reduction for human pose estimation. In CVPR, pages 1–8, 2008.

[10] K. S. Fu. A step towards unification of syntactic and statistical pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell., 8(3):398–404, 1986.

[11] R. B. Girshick, P. F. Felzenszwalb, and D. McAllester. Object detection with grammar models. In NIPS, 2011.

[12] S. Johnson and M. Everingham. Combining discriminative appearance and segmentation cues for articulated human pose estimation. In ICCV Workshop on Machine Learning for Vision-based Motion Analysis, 2009.

[13] S. Johnson and M. Everingham. Clustered pose and nonlinear appearance models for human pose estimation. In BMVC, pages 1–1 1, 2010.

[14] S. Johnson and M. Everingham. Learning effective human pose estimation from inaccurate annotation. In CVPR, pages 1465–1472, 2011.

[15] D. Ramanan. Learning to parse images of articulated bodies. In NIPS, pages 1129–1 136, 2006.

[16] B. Rothrock and S.-C. Zhu. Human parsing using stochastic and-or grammars and rich appearances. In ICCV Workshops, pages 640–647, 2011.

[17] B. Sapp, C. Jordan, and B. Taskar. Adaptive pose priors for pictorial structures. In CVPR, pages 422–429, 2010.

[18] P. Srinivasan and J. B. Shi. Bottom-up recognition and parsing of the human body. In CVPR, pages 1–8, 2007.

[19] M. Sun and S. Savarese. Articulated part-based model for joint object detection and pose estimation. In ICCV, pages

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27] 333222112199 723–730, 2011. B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In NIPS, 2003. Y. Tian, C. L. Zitnick, and S. G. Narasimhan. Exploring the spatial hierarchy of mixture models for human pose estimation. In ECCV (5), pages 256–269, 2012. I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interdependent and structured output spaces. In ICML, 2004. Z. Tu and S. C. Zhu. Image segmentation by data-driven markov chain monte carlo. IEEE Trans. Pattern Anal. Mach. Intell., 24(5):657–673, 2002. Y. Wang, D. Tran, and Z. Liao. Learning hierarchical poselets for human parsing. In CVPR, pages 1705–1712, 2011. Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixtures-of-parts. In CVPR, pages 1385–1392, 2011. L. Zhu, Y. Chen, Y. Lu, C. Lin, and A. L. Yuille. Max margin and/or graph learning for parsing the human body. In CVPR, 2008. S. C. Zhu and D. Mumford. A stochastic grammar of images. Foundations and Trends in Computer Graphics and Vision, 2(4):259–362, 2006.