iccv iccv2013 iccv2013-2 iccv2013-2-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Byung-Soo Kim, Pushmeet Kohli, Silvio Savarese
Abstract: Scene understanding is an important yet very challenging problem in computer vision. In the past few years, researchers have taken advantage of the recent diffusion of depth-RGB (RGB-D) cameras to help simplify the problem of inferring scene semantics. However, while the added 3D geometry is certainly useful to segment out objects with different depth values, it also adds complications in that the 3D geometry is often incorrect because of noisy depth measurements and the actual 3D extent of the objects is usually unknown because of occlusions. In this paper we propose a new method that allows us to jointly refine the 3D reconstruction of the scene (raw depth values) while accurately segmenting out the objects or scene elements from the 3D reconstruction. This is achieved by introducing a new model which we called Voxel-CRF. The Voxel-CRF model is based on the idea of constructing a conditional random field over a 3D volume of interest which captures the semantic and 3D geometric relationships among different elements (voxels) of the scene. Such model allows to jointly estimate (1) a dense voxel-based 3D reconstruction and (2) the semantic labels associated with each voxel even in presence of par- tial occlusions using an approximate yet efficient inference strategy. We evaluated our method on the challenging NYU Depth dataset (Version 1and 2). Experimental results show that our method achieves competitive accuracy in inferring scene semantics and visually appealing results in improving the quality of the 3D reconstruction. We also demonstrate an interesting application of object removal and scene completion from RGB-D images.
[1] P. Kohli, L. Ladicky, and P. H. S. Torr, “Robust higher order potentials for enforcing label consistency,” in CVPR, 2008. 1, 4, 5
[2] L. Ladicky, C. Russell, P. Kohli, and P. Torr, “Associative hierarchical crfs for object class image segmentation,” in ICCV, 2009. 1, 5, 6, 7
[3] J. Shotton, J. Winn, C. Rother, and A. Criminisi, “Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation,” in ECCV, 2006. 1
[4] B. Kim, M. Sun, P. Kohli, and S. Savarese, “Relating things and
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20] stuff by high-order potential modeling,” in ECCV Workshop (HiPot), 2012. 1, 4, 5 A. Gupta, A. Efros, and M. Hebert, “Blocks world revisited: Image understanding using qualitative geometry and mechanics,” ECCV, 2010. 1, 2 D. Hoiem, A. A. Efros, and M. Hebert, “Geometric context from a single image,” in ICCV, 2005. 1 V. Hedau, D. Hoiem, and D. Forsyth, “Recovering free space of indoor scenes from a single image,” in CVPR, 2012. 1, 2, 6 W. Choi, Y.-W. Chao, C. Pantofaru, and S. Savarese, “Understanding indoor scenes using 3d geometric phrases,” in CVPR. 1 S. Y. Bao and S. Savarese, “Semantic structure from motion,” in CVPR, 2011. 1 H. S. Koppula, A. Anand, T. Joachims, and A. Saxena, “Semantic labeling of 3d point clouds for indoor scenes,” in NIPS, 2011. 1, 2, 3 D. Munoz, J. A. Bagnell, and M. Hebert, “Co-inference machines for multi-modal scene analysis,” in ECCV, 2012. 1, 2, 3 N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” ECCV, 2012. 1, 2, 3, 5, 7 Microsoft Kinect, http://www.xbox.com/en-US/kinect. 1 Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy minimization via graph cuts,” PAMI, 2001 . 2, 5 Z. Jia, A. Gallagher, A. Saxena, and T. Chen, “3d-based reasoning with blocks, support, and stability,” in CVPR, 2013. 2 S. Satkin, J. Lin, and M. Hebert, “Data-driven scene understanding from 3d models,” 2012. 2, 6 A. Gupta, S. Satkin, A. A. Efros, and M. Hebert, “From 3d scene geometry to human workspace,” in CVPR, 2011. 2 S. Gupta, P. Arbelaez, and J. Malik, “Perceptual organization and recognition of indoor scenes from rgb-d images,” in CVPR, 2013. 2 L. Ladick` y, P. Sturgess, C. Russell, S. Sengupta, Y. Bastanlar, W. Clocksin, and P. Torr, “Joint optimization for object class segmentation and dense stereo reconstruction,” IJCV, 2012. 2 M. Bleyer, C. Rother, P. Kohli, D. Scharstein, and S. Sinha, “Object stereo: joint stereo matching and object segmentation,” in CVPR, 2011. 2
[21] R. Guo and D. Hoiem, “Beyond the line of sight: labeling the underlying surfaces,” in ECCV, 2012. 2
[22] J. Gonfaus, X. Boix, J. Van De Weijer, A. Bagdanov, J. Serrat, and J. Gonzalez, “Harmony potentials forjoint classification and segmentation,” in CVPR, 2010. 4
[23] D. Borrmann, J. Elseberg, K. Lingemann, and A. N ¨uchter, “The 3d hough transform for plane detection in point clouds: A review and a new accumulator design,” 3D Research, 2011. 4, 5
[24] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” PAMI, 2010. 4, 5
[25] L. Ladick` y, P. Sturgess, K. Alahari, C. Russell, and P. Torr, “What, where and how many? combining object detectors and crfs,” ECCV, 2010. 4
[26] T. Joachims, T. Finley, and C. Yu, “Cutting-plane training of structural svms,” Machine Learning, 2009. 5
[27] X. Ren, L. Bo, and D. Fox, “Rgb-(d) scene labeling: Features and algorithms,” in CVPR, 2012. 5, 6, 7
[28] N. Silberman and R. Fergus, “Indoor scene segmentation using a structured light sensor,” in ICCV Workshop (3DRR), 2011. 5, 6, 7
[29] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester, “Discriminatively trained deformable part models, release 4.” http://people.cs.uchicago.edu/ pff/latent-release4/. 5
[30] Project page, http://cvgl.stanford.edu/projects/vcrf/. 7, 8
[31] D. Van Krevelen and R. Poelman, “A survey of augmented reality technologies, applications and limitations,” International Journal of Virtual Reality, 2010. 7 1432