iccv iccv2013 iccv2013-322 iccv2013-322-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Karteek Alahari, Guillaume Seguin, Josef Sivic, Ivan Laptev
Abstract: We seek to obtain a pixel-wise segmentation and pose estimation of multiple people in a stereoscopic video. This involves challenges such as dealing with unconstrained stereoscopic video, non-stationary cameras, and complex indoor and outdoor dynamic scenes. The contributions of our work are two-fold: First, we develop a segmentation model incorporating person detection, pose estimation, as well as colour, motion, and disparity cues. Our new model explicitly represents depth ordering and occlusion. Second, we introduce a stereoscopic dataset with frames extracted from feature-length movies “StreetDance 3D ” and “Pina ”. The dataset contains 2727 realistic stereo pairs and includes annotation of human poses, person bounding boxes, and pixel-wise segmentations for hundreds of people. The dataset is composed of indoor and outdoor scenes depicting multiple people with frequent occlusions. We demonstrate results on our new challenging dataset, as well as on the H2view dataset from (Sheasby et al. ACCV 2012).
[1]
[2]
[3]
[4] http://pascallin.ecs.soton.ac.uk/challenges/voc/voc201 1, 2011. http://vision.middlebury.edu/stereo/, 2013. http://www.di.ens.fr/willow/research/stereoseg, 2013. P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. PAMI, 2011. 2118 (a) Original image(b) Segmentation result Figure 5. Qualitative results on images from the movie “StreetDance ”. Each row shows the original image and the corresponding segmentation. Rows 1and 2 demonstrate successful handling of occlusion between several people. The method can also handle non-trivial poses, as shown by Rows 3 and 4. The segmentation results are generally accurate, although some inaccuracies still remain on difficult examples. For instance, in Row 1, the segmentation is leaking into background for persons 3 and 5, due to the weak disparity cue for these people far away from the camera. The numbers denote the front (low values) to back (high values) ordering of people. (Best viewed in colour.)
[5] A. Ayvaci, M. Raptis, and S. Soatto. Sparse occlusion detection with optical flow. IJCV, 97(3), 2012.
[6] E. Boros and P. L. Hammer. Pseudo-boolean optimization. Discrete Applied Mathematics, 2002.
[7] Y. Boykov and M.-P. Jolly. Interactive graph cuts for optimal boundary and region segmentation of objects in N-D images. In ICCV, 2001.
[8] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. PAMI, 23(1 1): 1222–1239, 2001.
[9] I. Budvytis, V. Badrinarayanan, and R. Cipolla. Semi-supervised video segmentation. In CVPR, 2011.
[10] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.
[11] C. Desai and D. Ramanan. Detecting actions, poses, and objects with relational phraselets. In ECCV, 2012.
[12] M. Eichner, M. Marin-Jimenez, A. Zisserman, and V. Ferrari. 2d articulated human pose estimation and retrieval in (almost) unconstrained still images. IJCV, 2012.
[13] M. Everingham, J. Sivic, and A. Zisserman. “Hello! My name is... Buffy” automatic naming of characters in TV video. In BMVC, 2006.
[14] P. F. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. PAMI, 32(9), 2010.
[15] K. Fragkiadaki, H. Hu, and J. Shi. Pose from flow and flow from pose. In CVPR, 2013.
[16] D. Goldman, C. Gonterman, B. Curless, D. Salesin, and S. Seitz. Video annotation, navigation, and composition. In UIST, 2008.
[17] V. Gulshan, V. Lempitsky, and A. Zisserman. Humanising grabcut: Learning to segment humans using the kinect. In IEEE Workshop on Consumer Depth Cameras for Computer Vision, ICCV, 2011.
[18] S. Johnson and M. Everingham. Learning effective human pose estimation from inaccurate annotation. In CVPR, 2011.
[19] C. Keller, M. Enzweiler, M. Rohrbach, D. Llorca, C. Schnorr, and D. Gavrila. The benefits of dense stereo for pedestrian detection. IEEE Trans. Intelligent Transportation Systems, 2011.
[20] P. Kohli, J. Rihan, M. Bray, and P. Torr. Simultaneous segmentation and pose estimation of humans using dynamic graph cuts. IJCV, 2008.
[21] V. Kolmogorov, A. Criminisi, A. Blake, G. Cross, and C. Rother. Bi-layer segmentation of binocular stereo video. In CVPR, 2005.
[22] S. Koppal, C. Zitnick, M. Cohen, S. Kang, B. Ressler, and A. Colburn. A viewer-centric editor for 3d movies. Computer Graphics and –
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37] Applications, 2011. M. P. Kumar, P. H. S. Torr, and A. Zisserman. Learning layered motion segmentations of video. In ICCV, 2005. L. Ladicky, P. H. S. Torr, and A. Zisserman. Human pose estimation using a joint pixel-wise and part-wise formulation. In CVPR, 2013. C. Liu. Beyond Pixels: Exploring New Representations and Applications for Motion Analysis. PhD thesis, MIT, 2009. J. C. Niebles, B. Han, and L. Fei-Fei. Efficient extraction of human motion volumes by tracking. In CVPR, 2010. X. Ren, L. Bo, and D. Fox. Rgb-(d) scene labeling: Features and algorithms. In CVPR, 2012. B. Sapp, D. Weiss, and B. Taskar. Parsing human motion with stretchable models. In CVPR, 2011. K. Schindler, A. Ess, B. Leibe, and L. Van Gool. Automatic detection and tracking of pedestrians from a moving stereo rig. ISPRS Journal of Photogrammetry and Remote Sensing, 65(6):523–537, 2010. G. Sheasby, J. Valentin, N. Crook, and P. H. S. Torr. A robust stereo prior for human segmentation. In ACCV, 2012. J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from single depth images. In CVPR, 2011. L. Spinello and K. O. Arras. People detection in rgb-d data. In IROS, 2011. D. Sun, E. Sudderth, and M. Black. Layered image motion with explicit occlusions, temporal consistency, and depth ordering. In NIPS, 2010. P. H. S. Torr, R. Szeliski, and P. Anandan. An integrated bayesian approach to layer extraction from image sequences. PAMI, 2001. S. Walk, K. Schindler, and B. Schiele. Disparity statistics for pedestrian detection: Combining appearance, motion and stereo. In ECCV, 2010. H. Wang and D. Koller. Multi-level inference by relaxed dual decomposition for human pose segmentation. In CVPR, 2011. J. Y. A. Wang and E. H. Adelson. Representing moving images with layers. IEEE Trans. Image Processing, 3(5):625–638, 1994.
[38] A. Wedel, C. Rabe, T. Vaudrey, T. Brox, U. Franke, and D. Cremers. Efficient dense scene flow from sparse or dense stereo data. In ECCV, 2008.
[39] J. Xiao and M. Shah. Motion layer extraction in the presence of occlusion using graph cuts. PAMI, 2005.
[40] Y. Yang, S. Hallman, D. Ramanan, and C. Fowlkes. Layered object models for image segmentation. PAMI, 2011.
[41] Y. Yang and D. Ramanan. Articulated pose estimation using flexible mixtures of parts. In CVPR, 2011.
[42] B. Yao and L. Fei-Fei. Modeling mutual context of object and human pose in human-object interaction activities. In CVPR, 2010. 2119