nips nips2011 nips2011-127 nips2011-127-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yibiao Zhao, Song-chun Zhu
Abstract: This paper proposes a parsing algorithm for scene understanding which includes four aspects: computing 3D scene layout, detecting 3D objects (e.g. furniture), detecting 2D faces (windows, doors etc.), and segmenting background. In contrast to previous scene labeling work that applied discriminative classifiers to pixels (or super-pixels), we use a generative Stochastic Scene Grammar (SSG). This grammar represents the compositional structures of visual entities from scene categories, 3D foreground/background, 2D faces, to 1D lines. The grammar includes three types of production rules and two types of contextual relations. Production rules: (i) AND rules represent the decomposition of an entity into sub-parts; (ii) OR rules represent the switching among sub-types of an entity; (iii) SET rules represent an ensemble of visual entities. Contextual relations: (i) Cooperative “+” relations represent positive links between binding entities, such as hinged faces of a object or aligned boxes; (ii) Competitive “-” relations represents negative links between competing entities, such as mutually exclusive boxes. We design an efficient MCMC inference algorithm, namely Hierarchical cluster sampling, to search in the large solution space of scene configurations. The algorithm has two stages: (i) Clustering: It forms all possible higher-level structures (clusters) from lower-level entities by production rules and contextual relations. (ii) Sampling: It jumps between alternative structures (clusters) in each layer of the hierarchy to find the most probable configuration (represented by a parse tree). In our experiment, we demonstrate the superiority of our algorithm over existing methods on public dataset. In addition, our approach achieves richer structures in the parse tree. 1
[1] Hoiem, D., Efors, A., & Hebert, M. (2007) Recovering Surface Layout from an Image IJCV 75(1).
[2] Hedau, V., Hoiem, D., & Forsyth, D. (2009) Recovering the spatial layout of cluttered rooms. In ICCV.
[3] Wang, H., Gould, S. & Koller, D. (2010) Discriminative Learning with Latent Variables for Cluttered Indoor Scene Understanding. ECCV.
[4] Lee, D., Gupta, A. Hebert, M., & Kanade, T. (2010) Estimating Spatial Layout of Rooms using Volumetric Reasoning about Objects and Surfaces Advances in Neural Information Processing Systems 7, pp. 609-616. Cambridge, MA: MIT Press.
[5] Shotton, J., & Winn, J. (2007) TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context. IJCV
[6] Tu, Z., & Bai, X. (2009) Auto-context and Its Application to High-level Vision Tasks and 3D Brain Image Segmentation PAMI
[7] Lafferty, J. D., McCallum, A., & Pereira, F. C. N. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. In ICML (pp. 282-289).
[8] Saxena, A., Sun, M. & Ng, A. (2008) Make3d: Learning 3D scene structure from a single image. PAMI.
[9] Gupta, A., Efros,A., & Hebert, M. (2010) Blocks World Revisited: Image Understanding using Qualitative Geometry and Mechanics. ECCV.
[10] Tsochantaridis, T. Joachims, T. Hofmann & Y. Altun (2005) Large Margin Methods for Structured and Interdependent Output Variables, JMLR, Vol. 6, pages 1453-1484.
[11] Manning, C., & Schuetze, H. (1999) Foundations of statistical natural language processing. Cambridge: MIT Press.
[12] Chen, H., Xu, Z., Liu, Z., & Zhu, S. C. (2006) Composite templates for cloth modeling and sketching. In CVPR (1) pp. 943-950.
[13] Jin, Y., & Geman, S. (2006) Context and hierarchy in a probabilistic image model. In CVPR (2) pp. 2145-2152.
[14] Zhu, L., & Yuille, A. L. (2005) A hierarchical compositional system for rapid object detection. Advances in Neural Information Processing Systems 7, pp. 609-616. Cambridge, MA: MIT Press.
[15] Fidler, S., & Leonardis, A. (2007) Towards Scalable Representations of Object Categories: Learning a Hierarchy of Parts. In CVPR.
[16] Zhu, S. C., & Mumford, D. (2006) A stochastic grammar of images. Foundations and Trends in Computer Graphics and Vision, 2(4), 259-362.
[17] Johnson, M., Griffiths, T. L, & Goldwater, S. (2007) Adaptor Grammars: A Framework for Specifying Compositional Nonparametric Bayesian Models. In G. Tesauro, D. S. Touretzky and T.K. Leen (eds.), Advances in Neural Information Processing Systems 7, pp. 609-616. Cambridge, MA: MIT Press.
[18] Han, F., & Zhu, S. C. (2009) Bottom-Up/Top-Down Image Parsing with Attribute Grammar PAMI
[19] Porway, J., & Zhu, S. C. (2010) Hierarchical and Contextual Model for Aerial Image Understanding. Int’l Journal of Computer Vision, vol.88, no.2, pp 254-283.
[20] Porway, J., & Zhu, S. C. (2011) C4 : Computing Multiple Solutions in Graphical Models by Cluster Sampling. PAMI, vol.33, no.9, 1713-1727.
[21] Lee, D., Hebert, M., & Kanade, T. (2009) Geometric Reasoning for Single Image Structure Recovery In CVPR.
[22] Hedau, V., Hoiem, D., & Forsyth, D. (2010). Thinking Inside the Box: Using Appearance Models and Context Based on Room Geometry. In ECCV.
[23] Felzenszwalb, P.F. (2010) Cascade Object Detection with Deformable Part Models. In CVPR.
[24] Pero, L. D., Guan, J., Brau, E. Schlecht, J. & Barnard, K. (2011) Sampling Bedrooms. In CVPR.
[25] Zhu, S. C., Wu, Y., & Mumford, D. (1997) Minimax Entropy Principle and Its Application to Texture Modeling. Neural Computation 9(8): 1627-1660.
[26] Yu, L. F., Yeung, S. K., Tang, C. K., Terzopoulos, D., Chan, T. F. & Osher, S. (2011) Make it home: automatic optimization of furniture arrangement. ACM Transactions on Graphics 30(4): pp.86 9