iccv iccv2013 iccv2013-311 iccv2013-311-reference knowledge-graph by maker-knowledge-mining

311 iccv-2013-Pedestrian Parsing via Deep Decompositional Network

Source: pdf

Author: Ping Luo, Xiaogang Wang, Xiaoou Tang

Abstract: We propose a new Deep Decompositional Network (DDN) for parsing pedestrian images into semantic regions, such as hair, head, body, arms, and legs, where the pedestrians can be heavily occluded. Unlike existing methods based on template matching or Bayesian inference, our approach directly maps low-level visual features to the label maps of body parts with DDN, which is able to accurately estimate complex pose variations with good robustness to occlusions and background clutters. DDN jointly estimates occluded regions and segments body parts by stacking three types of hidden layers: occlusion estimation layers, completion layers, and decomposition layers. The occlusion estimation layers estimate a binary mask, indicating which part of a pedestrian is invisible. The completion layers synthesize low-level features of the invisible part from the original features and the occlusion mask. The decomposition layers directly transform the synthesized visual features to label maps. We devise a new strategy to pre-train these hidden layers, and then fine-tune the entire network using the stochastic gradient descent. Experimental results show that our approach achieves better segmentation accuracy than the state-of-the-art methods on pedestrian images with or without occlusions. Another important contribution of this paper is that it provides a large scale benchmark human parsing dataset1 that includes 3, 673 annotated samples collected from 171 surveillance videos. It is 20 times larger than existing public datasets.

reference text

[1] Y. Bo and C. C. Fowlkes. Shape-based pedestrian parsing. CVPR, 2011.

[2] L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3d human pose annotations. ICCV, 2009.

[3] N. Dalal and B. Triggs. Histograms of oriented gradients for

[4]

[5]

[6]

[7] human detection. CVPR, 2005. M. Enzweiler, A. Eigenstetter, B. Schiele, and D. M. Gavrila. Multi-cue pedestrian classification with partial occlusion handling. CVPR, 2010. S. Eslami, N. Heess, and J. Winn. The shape boltzmann machine: a strong model of object shape. CVPR, 2012. S. Eslami and C. Williams. A generative model for partsbased object segmentation. NIPS, 2012. T. Gao, B. Packer, and D. Koller. A segmentation-aware object detection model with occlusion handling. CVPR, 2011. 2654 the images, predicted label maps and ground truthes are shown respectively.

[8] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 2006.

[9] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arxiv.org,

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22] 1207.0580, 2012. V. Jain and H. S. Seung. Natural image denoising with convolutional networks. NIPS, 2008. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 1998. P. Luo, X. Wang, and X. Tang. Hierachical face parsing via deep learning. CVPR, 2012. P. Luo, X. Wang, and X. Tang. A deep sum-product architecture for robust facial attributes analysis. ICCV, 2013. V. Mnih and G. E. Hinton. Learning to label aerial images from noisy data. ICML, 2012. V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. ICML, 2010. J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. ICML, 2011. W. Ouyang and X. Wang. A discriminative deep model for pedestrian detection with occlusion handling. CVPR, 2012. W. Ouyang and X. Wang. Modeling mutual visibility relationship with a deep model in pedestrian detection. CVPR, 2013. N. Qian. On the momentum term in gradient descent learning algorithms. Neural Networks, 1999. I. Rauschert and R. T. Collins. A generative model for simultaneous estimation of human body shape and pixellevel segmentation. ECCV, 2012. R. Salakhutdinov and G. Hinton. Deep boltzmann machines. AISTATS, 2009. L. Sigal and M. J. Black. Humaneva: Synchronized video and motion capture dataset for evaluation of articulated human motion. Technical Report CS-06-08, Brown University,

[23]

[24]

[25]

[26]

[27]

[28]

[29] 2006. K. Skretting and K. Engan. Recursive least squares dictionary learning algorithm. IEEE Trans. Signal Process, 2010. Y. Tang, R. Salakhutdinov, and G. Hinton. Robust boltzmann machines for recognition and denoising. CVPR, 2012. M. Turk and A. Pentland. Eigenfaces for recognition. J. Cognitive Neuroscience, 1991 . P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. JMLR, 2010. L. Wang, J. Shi, G. Song, and I. fan Shen. Object detection combining recognition and segmentation. ACCV, 2007. X. Wang, T. X. Han, and S. Yan. An hog-lbp human detector with partial occlusion handling. ICCV, 2009. Z. Zhu, P. Luo, X. Wang, and X. Tang. Deep learning identity preserving face space. ICCV, 2013. 2655