nips nips2011 nips2011-138 nips2011-138-reference knowledge-graph by maker-knowledge-mining

138 nips-2011-Joint 3D Estimation of Objects and Scene Layout


Source: pdf

Author: Andreas Geiger, Christian Wojek, Raquel Urtasun

Abstract: We propose a novel generative model that is able to reason jointly about the 3D scene layout as well as the 3D location and orientation of objects in the scene. In particular, we infer the scene topology, geometry as well as traffic activities from a short video sequence acquired with a single camera mounted on a moving car. Our generative model takes advantage of dynamic information in the form of vehicle tracklets as well as static information coming from semantic labels and geometry (i.e., vanishing points). Experiments show that our approach outperforms a discriminative baseline based on multiple kernel learning (MKL) which has access to the same image information. Furthermore, as we reason about objects in 3D, we are able to significantly increase the performance of state-of-the-art object detectors in their ability to estimate object orientation. 1


reference text

[1] S. Bao, M. Sun, and S. Savarese. Toward coherent object detection and scene layout understanding. In CVPR, 2010.

[2] O. Barinova, V. Lempitsky, E. Tretyak, and P. Kohli. Geometric image parsing in man-made environments. In ECCV, 2010.

[3] W. Choi and S. Savarese. Multiple target tracking in world coordinate with single, minimally calibrated camera. In ECCV, 2010.

[4] A. Ess, B. Leibe, K. Schindler, and L. Van Gool. Robust multi-person tracking from a mobile platform. PAMI, 31:1831–1846, 2009.

[5] P. Felzenszwalb, R.Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. PAMI, 32:1627–1645, 2010.

[6] D. Gavrila and S. Munder. Multi-cue pedestrian detection and tracking from a moving vehicle. IJCV, 73:41–59, 2007.

[7] A. Geiger, M. Lauer, and R. Urtasun. A generative model for 3d urban scene understanding from movable platforms. In Computer Vision and Pattern Recognition, 2011.

[8] A. Geiger, M. Roser, and R. Urtasun. Efficient large-scale stereo matching. In Asian Conference on Computer Vision, 2010.

[9] W. Gilks and S. Richardson, editors. Markov Chain Monte Carlo in Practice. Chapman & Hall, 1995.

[10] S. Gould, T. Gao, and D. Koller. Region-based segmentation and object detection. In NIPS, 2009.

[11] A. Gupta, A. Efros, and M. Hebert. Blocks world revisited: Image understanding using qualitative geometry and mechanics. In ECCV, 2010.

[12] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge, 2004.

[13] V. Hedau, D. Hoiem, and D.A. Forsyth. Recovering the spatial layout of cluttered rooms. In ICCV, 2009.

[14] D. Hoiem, A. Efros, and M. Hebert. Recovering surface layout from an image. IJCV, 75:151–172, 2007.

[15] D. Hoiem, A. Efros, and M. Hebert. Putting objects in perspective. IJCV, 80:3–15, 2008.

[16] C. Huang, B. Wu, and R. Nevatia. Robust object tracking by hierarchical association of detection responses. In ECCV, 2008.

[17] A. Kapoor, K. Grauman, R. Urtasun, and T. Darrell. Gaussian processes for object categorization. IJCV, 88:169–188, 2010.

[18] R. Kaucic, A. Perera, G. Brooksby, J. Kaufhold, and A. Hoogs. A unified framework for tracking through occlusions and across sensor gaps. In CVPR, 2005.

[19] J. Kosecka and W. Zhang. Video compass. In ECCV, 2002.

[20] D. Kuettel, M. Breitenstein, L. Gool, and V. Ferrari. What’s going on?: Discovering spatio-temporal dependencies in dynamic scenes. In CVPR, 2010.

[21] S. Kumar and M. Hebert. Man-made structure detection in natural images using a causal multiscale random field. In CVPR, 2003.

[22] D. Lee, A. Gupta, M. Hebert, and T. Kanade. Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. In NIPS, 2010.

[23] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV, 42:145–175, 2001.

[24] A. Saxena, S. H. Chung, and A. Y. Ng. 3-D depth reconstruction from a single still image. IJCV, 76:53– 69, 2008.

[25] G. Schindler and F. Dellaert. Atlanta world: An expectation maximization framework for simultaneous low-level edge grouping and camera calibration in complex man-made environments. In CVPR, 2004.

[26] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. IJCV, 81:2–23, 2009.

[27] H. Wang, S. Gould, and D. Koller. Discriminative learning with latent variables for cluttered indoor scene understanding. In ECCV, 2010.

[28] X. Wang, X. Ma, and W. Grimson. Unsupervised activity perception in crowded and complicated scenes using hierarchical bayesian models. PAMI, 2009.

[29] C. Wojek, S. Roth, K. Schindler, and B. Schiele. Monocular 3D Scene Modeling and Inference: Understanding Multi-Object Traffic Scenes. In ECCV, 2010.

[30] C. Wojek and B. Schiele. A dynamic CRF model for joint labeling of object and scene classes. In ECCV, 2008. 9