cvpr cvpr2013 cvpr2013-446 knowledge-graph by maker-knowledge-mining

446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases


Source: pdf

Author: Wongun Choi, Yu-Wei Chao, Caroline Pantofaru, Silvio Savarese

Abstract: Visual scene understanding is a difficult problem interleaving object detection, geometric reasoning and scene classification. We present a hierarchical scene model for learning and reasoning about complex indoor scenes which is computationally tractable, can be learned from a reasonable amount of training data, and avoids oversimplification. At the core of this approach is the 3D Geometric Phrase Model which captures the semantic and geometric relationships between objects whichfrequently co-occur in the same 3D spatial configuration. Experiments show that this model effectively explains scene semantics, geometry and object groupings from a single image, while also improving individual object detections.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract Visual scene understanding is a difficult problem interleaving object detection, geometric reasoning and scene classification. [sent-3, score-0.591]

2 We present a hierarchical scene model for learning and reasoning about complex indoor scenes which is computationally tractable, can be learned from a reasonable amount of training data, and avoids oversimplification. [sent-4, score-0.377]

3 At the core of this approach is the 3D Geometric Phrase Model which captures the semantic and geometric relationships between objects whichfrequently co-occur in the same 3D spatial configuration. [sent-5, score-0.379]

4 Experiments show that this model effectively explains scene semantics, geometry and object groupings from a single image, while also improving individual object detections. [sent-6, score-0.371]

5 A scene classifier will tell you, with some uncertainty, that this is a dining room [21, 23, 15, 7]. [sent-11, score-0.553]

6 A layout estimator [12, 16, 27, 2] will tell you, with different uncertainty, how to fit a box to the room. [sent-12, score-0.54]

7 This is because the scene is cluttered with objects which tend to occlude each other: the dining table occludes the chairs, the chairs occlude the dining table; all of these occlude the room layout components (i. [sent-15, score-1.43]

8 It is clear that truly understanding a scene involves integrating information at multiple levels as well as studying the interactions between scene elements. [sent-18, score-0.463]

9 A scene-object interaction describes the way a scene type (e. [sent-19, score-0.329]

10 An object-layout interaction describes the way the layout (e. [sent-22, score-0.52]

11 Our unified model combines object detection, layout estimation and scene classification. [sent-28, score-0.853]

12 A single input image (a) is described by a scene model (b), with the scene type and layout at the root, and objects as leaves. [sent-29, score-1.08]

13 The middle nodes are latent 3D Geometric Phrases, such as (c), describing the 3D relationships among objects (d). [sent-30, score-0.346]

14 Scene understanding means finding the correct parse graph, producing a final labeling (e) of the objects in 3D (bounding cubes), the object groups (dashed white lines), the room layout, and the scene type. [sent-31, score-0.869]

15 As part of a larger system, understanding a scene semantically and functionally will allow us to make predictions about the presence and locations of unseen objects within the space. [sent-36, score-0.338]

16 We propose a method that can automatically learn the interactions among scene elements and apply them to the holistic understanding of indoor scenes. [sent-37, score-0.46]

17 The model fuses together object detection, layout estimation and scene classification to obtain a unified estimate of the scene composition. [sent-39, score-1.043]

18 The problem is formulated as image parsing in which a parse graph must be constructed for an image as in Fig. [sent-40, score-0.391]

19 At the root of the parse graph is the scene type and layout while the leaves are the individual detections of objects. [sent-43, score-1.187]

20 A 3DGP encodes geometric and semantic relationships 333333 between groups of objects which frequently co-occur in spatially consistent configurations. [sent-47, score-0.372]

21 Grouping objects together provides contextual support to boost weak object detections, such as the chair that is occluded by the dining table. [sent-49, score-0.493]

22 To explain a new image, a parse graph must estimate the scene semantics, layout, objects and 3DGPs, making the space of possible graphs quite large and of variable dimension. [sent-56, score-0.645]

23 Experiments show our hierarchical scene model constructed upon 3DGPs improves object detection, layout estimation and semantic classification accuracy in challenging scenarios which include occlusions, clutter and intra-class variation. [sent-63, score-0.876]

24 In contrast to these approaches, our 3DGPs are capable of encoding both 3D geometric and contextual interactions among objects and can be automatically learned from training data. [sent-77, score-0.341]

25 [2, 1] utilized geometric relationship to help object detection and scene structure estimation. [sent-84, score-0.422]

26 Several methods attempted to specifically solve indoor layout estimation [12, 13, 27, 30, 22, 26, 25]. [sent-85, score-0.663]

27 proposed a formulation using a cubic room representation [12] and showed that layout estimation can improve object detection [13]. [sent-87, score-0.803]

28 This initial attempt demonstrated promising results, however experiments were limited to a single object type (bed) and a single room type (bedroom). [sent-88, score-0.407]

29 Other methods [16, 30] proposed to improve layout estimation by analyzing the consistency between layout and the geometric properties of objects without accounting for the specific categorical nature of such objects. [sent-89, score-1.177]

30 [9] incorporated human pose estimation into indoor scene layout understanding. [sent-91, score-0.853]

31 However, [9] does not capture relationships between objects or between an object and the scene type. [sent-92, score-0.442]

32 [19] proposed an approach called object bank to model the correlation between objects and scene by encoding object detection responses as features in a SPM and predicting the scene type. [sent-95, score-0.774]

33 They did not, however, explicitly reason about the relationship between the scene and its constituent objects, nor the geometric correlation among objects. [sent-96, score-0.399]

34 [21] used a latent DPM model to capture the spatial configuration of objects in a scene type. [sent-98, score-0.47]

35 Finally, the latent DPM model assumes that the number of objects per scene is fixed, whereas our scene model allows an arbitrary number of 3DGPs per scene. [sent-101, score-0.603]

36 Scene Model using 3D Geometric Phrases The high-level goal of our system is to take a single image ofan indoor scene and classify its scene semantics (such as room type), spatial layout, constituent objects and object relationships in a unified manner. [sent-103, score-1.137]

37 The root node S describes the scene type s1 s3 (bedroom or livingroom) and layout hypothesis l3 , l5 (red lines), while other white and skyblue round nodes represent objects and 3DGPs, respectively. [sent-107, score-1.188]

38 , o10) are detection hypotheses obtained by object detectors such as [8] (black boxes). [sent-111, score-0.353]

39 1), which attempts to identify the parse graph that best fits the image observations. [sent-116, score-0.353]

40 At the core of this formulation is our novel 3D Geometric Phrase (3DGP), which is the key ingredient in parse graph construction (Sec. [sent-117, score-0.353]

41 The 3DGP model facilitates the transfer of contextual information from a strong object hypothesis to a weaker one when the configuration of the two objects agrees with a learned geometric phrase (Fig. [sent-120, score-0.599]

42 The model parameter θ includes the observation weights α, γ, the semantic and geometric context model weights η, ν, the pair-wise interaction model μ, and the pa- β, rameters λ associated with the 3DGP (see eq. [sent-133, score-0.35]

43 We define a parse graph G = {S, V} as a collection of nodWese describing geometric aGnd = s e{mS,aVnt}ic a properties oiofn nth oef scene. [sent-135, score-0.445]

44 S = (C, H) is the root node containing the scene semantic class variable C and layout of the room H, and V= {V1, . [sent-136, score-0.991]

45 identify nth aen parse graph nGd = sc {eSn,e eV m}o tdhealt Mbes,t foitusr t gheo image. [sent-157, score-0.353]

46 iAd graph hise s pelaersctee gdr by hi )G choosing a scene type among tghee. [sent-158, score-0.448]

47 hypotheses Os, ii) choosing the scene layout from the layout hypotheses Ol, iii) selecting positive detections (shown as o1, o3, and o10 in Fig. [sent-159, score-1.604]

48 2) among the detection hypotheses Oo, and iv) selecting compatible 3DGPs (Sec. [sent-160, score-0.349]

49 Let VT be the set of nodes associated with a set of detection hypotheses (objects) and VI be the set of nodes corresponding to 3DGP hypotheses, with V = VT ∪ VI. [sent-165, score-0.442]

50 Then, the energy of parse graph G given an image I is∪: EΠ,θ(G,I) =scα? [sent-166, score-0.396]

51 Observation Features: The observation features φ and corresponding model parameters α, γ capture the compatibility of a scene type, layout and object hypothesis with the image, respectively. [sent-223, score-0.961]

52 For instance, one can use the spatial pyramid matching (SPM) classifier [15] to estimate the scene type, the indoor layout estimator [12] for determining layout and Deformable Part Model (DPM) [8] for detecting objects. [sent-224, score-1.36]

53 In practice, rather than learning the parameters for the feature vectors of the observation model, we use the confidence values given by SPM [15] for scene classification, from [12] for layout estimation, and from the DPM [8] for object detection. [sent-225, score-0.806]

54 3, a scene layout hypothesis li is expressed using a 3D box representation and an object detection hypothesis pi is expressed using a 3D cuboid representation. [sent-231, score-1.123]

55 The compatibility between an object and the scene layout (ν? [sent-232, score-0.797]

56 For each wall, we measure the object-wall penetration by identifying which (if any) of the object cuboid bottom corners intersects with the wall and computing the (discretized) distance to the wall surface. [sent-234, score-0.328]

57 Given a 3DGP node V , the spatial deformation (dxi, dzi) of a constituent object is a function of the difference between the object instance location oi and the learned expected location ci with respect to the centroid of the 3DGP (the mean location of all constituent objects mV). [sent-251, score-0.735]

58 ) Each object class is associated to a different prototypical bounding cuboid which we call the cuboid model (which was acquired from the commercial website www. [sent-280, score-0.427]

59 Similarly to [12, 16, 27], we represent the indoor space by the 3D layout of 5 orthogonal faces (floor, ceiling, left, center, and right wall), as in Fig. [sent-284, score-0.624]

60 For each set of layout faces, we obtain the corresponding 3D layout by back-projecting the intersecting corners of walls. [sent-287, score-0.944]

61 An object’s cuboid can be estimated from a single image given a set of known object cuboid models and an object detector that estimates the 2D bounding box and pose (Sec. [sent-288, score-0.465]

62 In order to obtain robust 3D localization of each object and disambiguate the size of the room space given a layout hypothesis, we estimate the camera height (ground plane location) by assuming all objects are lying on a common ground plane. [sent-292, score-0.799]

63 Inference In our formulation, performing inference is equivalent to finding the best parse graph specifying the scene type C, layout estimation H, positive object hypotheses V ∈ VT laanydo 3uDt GestPi hypotheses pVos i∈t vVeI . [sent-296, score-1.68]

64 Inference is performed for each scene type separately, so scene type is considered given in the remainder of this section. [sent-299, score-0.562]

65 The procedure starts by assigning one node Vt to each detection hypothesis ot, creating a set of candidate terminal nodes (leaves) VT = {V1T, . [sent-302, score-0.359]

66 To illustrate, suppose we have a parse graph G that contains the constituent objects of Vi but not Vi itself. [sent-313, score-0.537]

67 To find candidates VkI for πk, we search over all possible configurations of selecting one terminal node among the sofa hypotheses VsTofa and one among the table hypotheses VtTable. [sent-320, score-0.737]

68 In practice, this bottom-up search can be performed very efficiently (less than a minute per image) since there are typically few detection hypotheses per object type. [sent-322, score-0.353]

69 Top-down: Given all possible sets of nodes Vcand, the optimal parse graph G is found via Reversible Jump Markov Chain Monte Carlo (RJ-MCMC) sampling (Fig. [sent-323, score-0.434]

70 To efficiently explore the space of parse graphs, we propose 4 reversible jump moves, layout selection, add, delete and switch. [sent-325, score-0.865]

71 Starting from an initial parse graph G0, the RJ-MCMC sampling draws a new parse graph by sampling a random jump move, and the new sample is either accepted or rejected following Metropolis-Hasting rule. [sent-326, score-0.774]

72 The initial parse graph is obtained by 1) selecting the layout with highest observation likelihood [12] and 2) greedily adding ? [sent-328, score-0.863]

73 Top-down: the Markov chain is defined by 3 RJ-MCMC moves on the parse graph Gk . [sent-376, score-0.393]

74 The RJ-MCMC jump moves used with a parse graph at inference step k are defined as follows. [sent-382, score-0.497]

75 Layout selection: This move generates a new parse graph Gk+1 by changing the layout hypothesis. [sent-383, score-0.873]

76 This formulation accommodates the uncertainty in associating an image to a parse graph G similarly to [8, 28]; i. [sent-405, score-0.353]

77 given a label y, the root node and terminal nodes of G can be uniquely identified, but the 333777 3DGP nodes in the middle are hidden. [sent-407, score-0.355]

78 In latent completion, the most compatible parse graph G is found for an image with ground truth labels y by finding compatible 3DGP nodes VI. [sent-419, score-0.553]

79 We use the layout estimation loss proposed by [12] to model the layout estimation loss δl (H, Hi). [sent-431, score-1.057]

80 Although there exist datasets for layout estimation evaluation [12], object detection [6] and scene classification [23] in isolation, there is no dataset on which we can evaluate all the three problems simultaneously. [sent-445, score-0.841]

81 The indoor-scene-object dataset includes three scene types: living room, bedroom, and dining room, with ∼300 images per room type. [sent-446, score-0.518]

82 The score for each scene type is the observation feature for scene type in our model (φ(C, Os)). [sent-455, score-0.635]

83 Indoor layout estimation: The indoor layout estimator as trained in [12] is used to generate layout hypotheses with confidence scores for Ol and the associated feature φ(H, Ol). [sent-458, score-1.847]

84 Pixel accuracy is defined as the percentage of pixels on layout faces with correct labels. [sent-463, score-0.472]

85 To further analyze the layout estimation, we also evaluated per-face estimation accuracy. [sent-464, score-0.511]

86 The detection bounding boxes and associated confidence scores from the baseline detectors are used to generate a discrete set of detection hypotheses Oo for our model. [sent-514, score-0.523]

87 EEΠΠ( GGˆˆ,+oIi), −I) E −Π E(GˆΠ\(oGˆi, I ) , o oi ∈ ∈/GˆGˆ where Gˆ is the solution of our inference, Gˆ\oi (7) is the graph Gˆ+oi without oi, and is the graph augmented with oi. [sent-518, score-0.338]

88 To better understand the effect of the 3DGP, we employ two different strategies for building the augmented parse graph . [sent-520, score-0.353]

89 The second scheme M2 attempts to also add a parent 3DGP into Gˆ+oi if 1) the other constituent objects in the 3DGP (other than oi) already exist in and 2) the score is higher than the first scheme (adding oi as an individual object). [sent-522, score-0.36]

90 The first scheme ignores possible 3DGPs when evaluating object hypotheses that are not included in Gˆ due to low detection score, whereas the second scheme also incorporates 3DGP contexts while measuring the confidence of those object hypotheses. [sent-523, score-0.459]

91 To evaluate the contribution of the 3DGP to the scene model, we compared three versions algorithms: 1) the baseline methods, 2) our model without 3DGPs (including geometric and semantic context features), and 3) Gˆ\oi Gˆ+oi Gˆ+oi Gˆ HWeM/O3deDat3GuhD[oP1Gd2P]8 ix21. [sent-525, score-0.438]

92 This shows that the 3DGP can transfer contextual information from strong object detection hypotheses to weaker detection hypotheses. [sent-548, score-0.495]

93 The scene model (with or without 3DGPs) significantly improves scene classification accuracy over the baseline − (+7. [sent-549, score-0.469]

94 2%) by encoding the semantic relationship between scene type and objects (Table. [sent-550, score-0.45]

95 Finally, we demonstrate that our model provides more accurate layout estimation (Table. [sent-554, score-0.546]

96 We argue that the floor is the most important layout component since its extent directly provides information about the free space in the scene; the intersection lines between floor and walls uniquely specify the 3D extent of the free space. [sent-561, score-0.683]

97 Conclusion In this paper, we proposed a novel unified framework that can reason about the semantic class of an indoor scene, its spatial layout, and the identity and layout of objects within the space. [sent-579, score-0.878]

98 As a result of our unified framework, we showed that our model is capable of improving the accuracy of each scene understanding component and provides a cohesive interpretation of an indoor image. [sent-581, score-0.467]

99 Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. [sent-696, score-0.615]

100 Discriminative learning with latent variables for cluttered indoor scene understanding. [sent-767, score-0.393]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('layout', 0.472), ('parse', 0.272), ('hypotheses', 0.213), ('scene', 0.19), ('oi', 0.176), ('dining', 0.176), ('gk', 0.165), ('indoor', 0.152), ('room', 0.152), ('vt', 0.144), ('cuboid', 0.139), ('objects', 0.102), ('vi', 0.1), ('geometric', 0.092), ('phrases', 0.092), ('type', 0.091), ('hypothesis', 0.091), ('ol', 0.085), ('bed', 0.083), ('constituent', 0.082), ('graph', 0.081), ('nodes', 0.081), ('phrase', 0.08), ('sofa', 0.078), ('relationships', 0.077), ('vcand', 0.077), ('dpm', 0.076), ('contextual', 0.075), ('node', 0.073), ('object', 0.073), ('spm', 0.071), ('jump', 0.068), ('bedroom', 0.068), ('chair', 0.067), ('detection', 0.067), ('oo', 0.067), ('semantic', 0.067), ('compatibility', 0.062), ('floor', 0.06), ('wall', 0.058), ('os', 0.057), ('walls', 0.055), ('baseline', 0.054), ('biases', 0.053), ('reversible', 0.053), ('hi', 0.051), ('bkp', 0.051), ('skyblue', 0.051), ('configuration', 0.051), ('latent', 0.051), ('move', 0.048), ('boxes', 0.048), ('interaction', 0.048), ('hedau', 0.047), ('terminal', 0.047), ('understanding', 0.046), ('objec', 0.046), ('vti', 0.046), ('opg', 0.046), ('penetrate', 0.046), ('sdpm', 0.046), ('detections', 0.044), ('bank', 0.044), ('unified', 0.044), ('configurations', 0.043), ('energy', 0.043), ('chairs', 0.042), ('encode', 0.042), ('bounding', 0.041), ('spatial', 0.041), ('bao', 0.041), ('occlude', 0.04), ('hyper', 0.04), ('dik', 0.04), ('desai', 0.04), ('moves', 0.04), ('estimation', 0.039), ('silvio', 0.038), ('fouhey', 0.038), ('parsing', 0.038), ('observation', 0.038), ('root', 0.037), ('interactions', 0.037), ('simplex', 0.036), ('inference', 0.036), ('uniquely', 0.036), ('model', 0.035), ('recall', 0.035), ('tell', 0.035), ('among', 0.035), ('compatible', 0.034), ('gupta', 0.034), ('groups', 0.034), ('semantics', 0.034), ('pandey', 0.034), ('tables', 0.033), ('confidence', 0.033), ('estimator', 0.033), ('completion', 0.033), ('centroid', 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000023 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases

Author: Wongun Choi, Yu-Wei Chao, Caroline Pantofaru, Silvio Savarese

Abstract: Visual scene understanding is a difficult problem interleaving object detection, geometric reasoning and scene classification. We present a hierarchical scene model for learning and reasoning about complex indoor scenes which is computationally tractable, can be learned from a reasonable amount of training data, and avoids oversimplification. At the core of this approach is the 3D Geometric Phrase Model which captures the semantic and geometric relationships between objects whichfrequently co-occur in the same 3D spatial configuration. Experiments show that this model effectively explains scene semantics, geometry and object groupings from a single image, while also improving individual object detections.

2 0.28115904 445 cvpr-2013-Understanding Bayesian Rooms Using Composite 3D Object Models

Author: Luca Del_Pero, Joshua Bowdish, Bonnie Kermgard, Emily Hartley, Kobus Barnard

Abstract: We develop a comprehensive Bayesian generative model for understanding indoor scenes. While it is common in this domain to approximate objects with 3D bounding boxes, we propose using strong representations with finer granularity. For example, we model a chair as a set of four legs, a seat and a backrest. We find that modeling detailed geometry improves recognition and reconstruction, and enables more refined use of appearance for scene understanding. We demonstrate this with a new likelihood function that re- wards 3D object hypotheses whose 2D projection is more uniform in color distribution. Such a measure would be confused by background pixels if we used a bounding box to represent a concave object like a chair. Complex objects are modeled using a set or re-usable 3D parts, and we show that this representation captures much of the variation among object instances with relatively few parameters. We also designed specific data-driven inference mechanismsfor eachpart that are shared by all objects containing that part, which helps make inference transparent to the modeler. Further, we show how to exploit contextual relationships to detect more objects, by, for example, proposing chairs around and underneath tables. We present results showing the benefits of each of these innovations. The performance of our approach often exceeds that of state-of-the-art methods on the two tasks of room layout estimation and object recognition, as evaluated on two bench mark data sets used in this domain. work. 1) Detailed geometric models, such as tables with legs and top (bottom left), provide better reconstructions than plain boxes (top right), when supported by image features such as geometric context [5] (top middle), or an approach to using color introduced here. 2) Non convex models allow for complex configurations, such as a chair under a table (bottom middle). 3) 3D contextual relationships, such as chairs being around a table, allow identifying objects supported by little image evidence, like the chair behind the table (bottom right). Best viewed in color.

3 0.2317937 381 cvpr-2013-Scene Parsing by Integrating Function, Geometry and Appearance Models

Author: Yibiao Zhao, Song-Chun Zhu

Abstract: Indoor functional objects exhibit large view and appearance variations, thus are difficult to be recognized by the traditional appearance-based classification paradigm. In this paper, we present an algorithm to parse indoor images based on two observations: i) The functionality is the most essentialproperty to define an indoor object, e.g. “a chair to sit on ”; ii) The geometry (3D shape) ofan object is designed to serve its function. We formulate the nature of the object function into a stochastic grammar model. This model characterizes a joint distribution over the function-geometryappearance (FGA) hierarchy. The hierarchical structure includes a scene category, , functional groups, , functional objects, functional parts and 3D geometric shapes. We use a simulated annealing MCMC algorithm to find the maximum a posteriori (MAP) solution, i.e. a parse tree. We design four data-driven steps to accelerate the search in the FGA space: i) group the line segments into 3D primitive shapes, ii) assign functional labels to these 3D primitive shapes, iii) fill in missing objects/parts according to the functional labels, and iv) synthesize 2D segmentation maps and verify the current parse tree by the Metropolis-Hastings acceptance probability. The experimental results on several challenging indoor datasets demonstrate theproposed approach not only significantly widens the scope ofindoor sceneparsing algorithm from the segmentation and the 3D recovery to the functional object recognition, but also yields improved overall performance.

4 0.17738155 278 cvpr-2013-Manhattan Junction Catalogue for Spatial Reasoning of Indoor Scenes

Author: Srikumar Ramalingam, Jaishanker K. Pillai, Arpit Jain, Yuichi Taguchi

Abstract: Junctions are strong cues for understanding the geometry of a scene. In this paper, we consider the problem of detecting junctions and using them for recovering the spatial layout of an indoor scene. Junction detection has always been challenging due to missing and spurious lines. We work in a constrained Manhattan world setting where the junctions are formed by only line segments along the three principal orthogonal directions. Junctions can be classified into several categories based on the number and orientations of the incident line segments. We provide a simple and efficient voting scheme to detect and classify these junctions in real images. Indoor scenes are typically modeled as cuboids and we formulate the problem of the cuboid layout estimation as an inference problem in a conditional random field. Our formulation allows the incorporation of junction features and the training is done using structured prediction techniques. We outperform other single view geometry estimation methods on standard datasets.

5 0.1638166 30 cvpr-2013-Accurate Localization of 3D Objects from RGB-D Data Using Segmentation Hypotheses

Author: Byung-soo Kim, Shili Xu, Silvio Savarese

Abstract: In this paper we focus on the problem of detecting objects in 3D from RGB-D images. We propose a novel framework that explores the compatibility between segmentation hypotheses of the object in the image and the corresponding 3D map. Our framework allows to discover the optimal location of the object using a generalization of the structural latent SVM formulation in 3D as well as the definition of a new loss function defined over the 3D space in training. We evaluate our method using two existing RGB-D datasets. Extensive quantitative and qualitative experimental results show that our proposed approach outperforms state-of-theart as methods well as a number of baseline approaches for both 3D and 2D object recognition tasks.

6 0.15099713 329 cvpr-2013-Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images

7 0.15055676 461 cvpr-2013-Weakly Supervised Learning for Attribute Localization in Outdoor Scenes

8 0.14256312 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection

9 0.14200412 284 cvpr-2013-Mesh Based Semantic Modelling for Indoor and Outdoor Scenes

10 0.14049532 247 cvpr-2013-Learning Class-to-Image Distance with Object Matchings

11 0.13690457 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation

12 0.13611192 136 cvpr-2013-Discriminatively Trained And-Or Tree Models for Object Detection

13 0.13400568 16 cvpr-2013-A Linear Approach to Matching Cuboids in RGBD Images

14 0.13239501 207 cvpr-2013-Human Pose Estimation Using a Joint Pixel-wise and Part-wise Formulation

15 0.1303045 61 cvpr-2013-Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics

16 0.12403704 248 cvpr-2013-Learning Collections of Part Models for Object Recognition

17 0.12228693 67 cvpr-2013-Blocks That Shout: Distinctive Parts for Scene Classification

18 0.11868453 168 cvpr-2013-Fast Object Detection with Entropy-Driven Evaluation

19 0.11775494 43 cvpr-2013-Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs

20 0.11665951 154 cvpr-2013-Explicit Occlusion Modeling for 3D Object Class Representations


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.268), (1, -0.023), (2, 0.044), (3, -0.084), (4, 0.112), (5, 0.017), (6, 0.061), (7, 0.169), (8, -0.039), (9, -0.009), (10, -0.025), (11, -0.009), (12, 0.002), (13, -0.001), (14, -0.035), (15, -0.076), (16, 0.066), (17, 0.176), (18, -0.07), (19, -0.014), (20, 0.015), (21, 0.053), (22, 0.142), (23, 0.04), (24, 0.09), (25, -0.044), (26, 0.108), (27, -0.079), (28, -0.108), (29, 0.053), (30, -0.177), (31, 0.073), (32, 0.047), (33, 0.012), (34, -0.128), (35, -0.041), (36, -0.017), (37, 0.101), (38, 0.045), (39, -0.077), (40, -0.03), (41, 0.022), (42, 0.044), (43, -0.047), (44, 0.065), (45, 0.049), (46, 0.064), (47, 0.01), (48, 0.014), (49, -0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94218606 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases

Author: Wongun Choi, Yu-Wei Chao, Caroline Pantofaru, Silvio Savarese

Abstract: Visual scene understanding is a difficult problem interleaving object detection, geometric reasoning and scene classification. We present a hierarchical scene model for learning and reasoning about complex indoor scenes which is computationally tractable, can be learned from a reasonable amount of training data, and avoids oversimplification. At the core of this approach is the 3D Geometric Phrase Model which captures the semantic and geometric relationships between objects whichfrequently co-occur in the same 3D spatial configuration. Experiments show that this model effectively explains scene semantics, geometry and object groupings from a single image, while also improving individual object detections.

2 0.91711724 381 cvpr-2013-Scene Parsing by Integrating Function, Geometry and Appearance Models

Author: Yibiao Zhao, Song-Chun Zhu

Abstract: Indoor functional objects exhibit large view and appearance variations, thus are difficult to be recognized by the traditional appearance-based classification paradigm. In this paper, we present an algorithm to parse indoor images based on two observations: i) The functionality is the most essentialproperty to define an indoor object, e.g. “a chair to sit on ”; ii) The geometry (3D shape) ofan object is designed to serve its function. We formulate the nature of the object function into a stochastic grammar model. This model characterizes a joint distribution over the function-geometryappearance (FGA) hierarchy. The hierarchical structure includes a scene category, , functional groups, , functional objects, functional parts and 3D geometric shapes. We use a simulated annealing MCMC algorithm to find the maximum a posteriori (MAP) solution, i.e. a parse tree. We design four data-driven steps to accelerate the search in the FGA space: i) group the line segments into 3D primitive shapes, ii) assign functional labels to these 3D primitive shapes, iii) fill in missing objects/parts according to the functional labels, and iv) synthesize 2D segmentation maps and verify the current parse tree by the Metropolis-Hastings acceptance probability. The experimental results on several challenging indoor datasets demonstrate theproposed approach not only significantly widens the scope ofindoor sceneparsing algorithm from the segmentation and the 3D recovery to the functional object recognition, but also yields improved overall performance.

3 0.84984785 445 cvpr-2013-Understanding Bayesian Rooms Using Composite 3D Object Models

Author: Luca Del_Pero, Joshua Bowdish, Bonnie Kermgard, Emily Hartley, Kobus Barnard

Abstract: We develop a comprehensive Bayesian generative model for understanding indoor scenes. While it is common in this domain to approximate objects with 3D bounding boxes, we propose using strong representations with finer granularity. For example, we model a chair as a set of four legs, a seat and a backrest. We find that modeling detailed geometry improves recognition and reconstruction, and enables more refined use of appearance for scene understanding. We demonstrate this with a new likelihood function that re- wards 3D object hypotheses whose 2D projection is more uniform in color distribution. Such a measure would be confused by background pixels if we used a bounding box to represent a concave object like a chair. Complex objects are modeled using a set or re-usable 3D parts, and we show that this representation captures much of the variation among object instances with relatively few parameters. We also designed specific data-driven inference mechanismsfor eachpart that are shared by all objects containing that part, which helps make inference transparent to the modeler. Further, we show how to exploit contextual relationships to detect more objects, by, for example, proposing chairs around and underneath tables. We present results showing the benefits of each of these innovations. The performance of our approach often exceeds that of state-of-the-art methods on the two tasks of room layout estimation and object recognition, as evaluated on two bench mark data sets used in this domain. work. 1) Detailed geometric models, such as tables with legs and top (bottom left), provide better reconstructions than plain boxes (top right), when supported by image features such as geometric context [5] (top middle), or an approach to using color introduced here. 2) Non convex models allow for complex configurations, such as a chair under a table (bottom middle). 3) 3D contextual relationships, such as chairs being around a table, allow identifying objects supported by little image evidence, like the chair behind the table (bottom right). Best viewed in color.

4 0.75139689 61 cvpr-2013-Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics

Author: Bo Zheng, Yibiao Zhao, Joey C. Yu, Katsushi Ikeuchi, Song-Chun Zhu

Abstract: In this paper, we present an approach for scene understanding by reasoning physical stability of objects from point cloud. We utilize a simple observation that, by human design, objects in static scenes should be stable with respect to gravity. This assumption is applicable to all scene categories and poses useful constraints for the plausible interpretations (parses) in scene understanding. Our method consists of two major steps: 1) geometric reasoning: recovering solid 3D volumetric primitives from defective point cloud; and 2) physical reasoning: grouping the unstable primitives to physically stable objects by optimizing the stability and the scene prior. We propose to use a novel disconnectivity graph (DG) to represent the energy landscape and use a Swendsen-Wang Cut (MCMC) method for optimization. In experiments, we demonstrate that the algorithm achieves substantially better performance for i) object segmentation, ii) 3D volumetric recovery of the scene, and iii) better parsing result for scene understanding in comparison to state-of-the-art methods in both public dataset and our own new dataset.

5 0.73452497 197 cvpr-2013-Hallucinated Humans as the Hidden Context for Labeling 3D Scenes

Author: Yun Jiang, Hema Koppula, Ashutosh Saxena

Abstract: For scene understanding, one popular approach has been to model the object-object relationships. In this paper, we hypothesize that such relationships are only an artifact of certain hidden factors, such as humans. For example, the objects, monitor and keyboard, are strongly spatially correlated only because a human types on the keyboard while watching the monitor. Our goal is to learn this hidden human context (i.e., the human-object relationships), and also use it as a cue for labeling the scenes. We present Infinite Factored Topic Model (IFTM), where we consider a scene as being generated from two types of topics: human configurations and human-object relationships. This enables our algorithm to hallucinate the possible configurations of the humans in the scene parsimoniously. Given only a dataset of scenes containing objects but not humans, we show that our algorithm can recover the human object relationships. We then test our algorithm on the task ofattribute and object labeling in 3D scenes and show consistent improvements over the state-of-the-art.

6 0.68875682 278 cvpr-2013-Manhattan Junction Catalogue for Spatial Reasoning of Indoor Scenes

7 0.66474843 247 cvpr-2013-Learning Class-to-Image Distance with Object Matchings

8 0.6600306 16 cvpr-2013-A Linear Approach to Matching Cuboids in RGBD Images

9 0.65847439 1 cvpr-2013-3D-Based Reasoning with Blocks, Support, and Stability

10 0.62987465 136 cvpr-2013-Discriminatively Trained And-Or Tree Models for Object Detection

11 0.62430751 30 cvpr-2013-Accurate Localization of 3D Objects from RGB-D Data Using Segmentation Hypotheses

12 0.60458696 221 cvpr-2013-Incorporating Structural Alternatives and Sharing into Hierarchy for Multiclass Object Recognition and Detection

13 0.59462661 25 cvpr-2013-A Sentence Is Worth a Thousand Pixels

14 0.57390064 284 cvpr-2013-Mesh Based Semantic Modelling for Indoor and Outdoor Scenes

15 0.5712781 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation

16 0.56816983 440 cvpr-2013-Tracking People and Their Objects

17 0.56530201 256 cvpr-2013-Learning Structured Hough Voting for Joint Object Detection and Occlusion Reasoning

18 0.5562768 417 cvpr-2013-Subcategory-Aware Object Classification

19 0.55494225 416 cvpr-2013-Studying Relationships between Human Gaze, Description, and Computer Vision

20 0.55047321 80 cvpr-2013-Category Modeling from Just a Single Labeling: Use Depth Information to Guide the Learning of 2D Models


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.123), (16, 0.029), (26, 0.062), (28, 0.013), (33, 0.265), (39, 0.049), (55, 0.017), (56, 0.095), (67, 0.096), (69, 0.082), (80, 0.016), (87, 0.091)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93629587 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases

Author: Wongun Choi, Yu-Wei Chao, Caroline Pantofaru, Silvio Savarese

Abstract: Visual scene understanding is a difficult problem interleaving object detection, geometric reasoning and scene classification. We present a hierarchical scene model for learning and reasoning about complex indoor scenes which is computationally tractable, can be learned from a reasonable amount of training data, and avoids oversimplification. At the core of this approach is the 3D Geometric Phrase Model which captures the semantic and geometric relationships between objects whichfrequently co-occur in the same 3D spatial configuration. Experiments show that this model effectively explains scene semantics, geometry and object groupings from a single image, while also improving individual object detections.

2 0.93577439 381 cvpr-2013-Scene Parsing by Integrating Function, Geometry and Appearance Models

Author: Yibiao Zhao, Song-Chun Zhu

Abstract: Indoor functional objects exhibit large view and appearance variations, thus are difficult to be recognized by the traditional appearance-based classification paradigm. In this paper, we present an algorithm to parse indoor images based on two observations: i) The functionality is the most essentialproperty to define an indoor object, e.g. “a chair to sit on ”; ii) The geometry (3D shape) ofan object is designed to serve its function. We formulate the nature of the object function into a stochastic grammar model. This model characterizes a joint distribution over the function-geometryappearance (FGA) hierarchy. The hierarchical structure includes a scene category, , functional groups, , functional objects, functional parts and 3D geometric shapes. We use a simulated annealing MCMC algorithm to find the maximum a posteriori (MAP) solution, i.e. a parse tree. We design four data-driven steps to accelerate the search in the FGA space: i) group the line segments into 3D primitive shapes, ii) assign functional labels to these 3D primitive shapes, iii) fill in missing objects/parts according to the functional labels, and iv) synthesize 2D segmentation maps and verify the current parse tree by the Metropolis-Hastings acceptance probability. The experimental results on several challenging indoor datasets demonstrate theproposed approach not only significantly widens the scope ofindoor sceneparsing algorithm from the segmentation and the 3D recovery to the functional object recognition, but also yields improved overall performance.

3 0.93352377 248 cvpr-2013-Learning Collections of Part Models for Object Recognition

Author: Ian Endres, Kevin J. Shih, Johnston Jiaa, Derek Hoiem

Abstract: We propose a method to learn a diverse collection of discriminative parts from object bounding box annotations. Part detectors can be trained and applied individually, which simplifies learning and extension to new features or categories. We apply the parts to object category detection, pooling part detections within bottom-up proposed regions and using a boosted classifier with proposed sigmoid weak learners for scoring. On PASCAL VOC 2010, we evaluate the part detectors ’ ability to discriminate and localize annotated keypoints. Our detection system is competitive with the best-existing systems, outperforming other HOG-based detectors on the more deformable categories.

4 0.93221104 30 cvpr-2013-Accurate Localization of 3D Objects from RGB-D Data Using Segmentation Hypotheses

Author: Byung-soo Kim, Shili Xu, Silvio Savarese

Abstract: In this paper we focus on the problem of detecting objects in 3D from RGB-D images. We propose a novel framework that explores the compatibility between segmentation hypotheses of the object in the image and the corresponding 3D map. Our framework allows to discover the optimal location of the object using a generalization of the structural latent SVM formulation in 3D as well as the definition of a new loss function defined over the 3D space in training. We evaluate our method using two existing RGB-D datasets. Extensive quantitative and qualitative experimental results show that our proposed approach outperforms state-of-theart as methods well as a number of baseline approaches for both 3D and 2D object recognition tasks.

5 0.92616552 68 cvpr-2013-Blur Processing Using Double Discrete Wavelet Transform

Author: Yi Zhang, Keigo Hirakawa

Abstract: We propose a notion of double discrete wavelet transform (DDWT) that is designed to sparsify the blurred image and the blur kernel simultaneously. DDWT greatly enhances our ability to analyze, detect, and process blur kernels and blurry images—the proposed framework handles both global and spatially varying blur kernels seamlessly, and unifies the treatment of blur caused by object motion, optical defocus, and camera shake. To illustrate the potential of DDWT in computer vision and image processing, we develop example applications in blur kernel estimation, deblurring, and near-blur-invariant image feature extraction.

6 0.92612994 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation

7 0.92548925 240 cvpr-2013-Keypoints from Symmetries by Wave Propagation

8 0.92542553 136 cvpr-2013-Discriminatively Trained And-Or Tree Models for Object Detection

9 0.9246729 365 cvpr-2013-Robust Real-Time Tracking of Multiple Objects by Volumetric Mass Densities

10 0.92320412 61 cvpr-2013-Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics

11 0.92182738 445 cvpr-2013-Understanding Bayesian Rooms Using Composite 3D Object Models

12 0.92170066 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval

13 0.92031848 414 cvpr-2013-Structure Preserving Object Tracking

14 0.91888976 325 cvpr-2013-Part Discovery from Partial Correspondence

15 0.91770989 292 cvpr-2013-Multi-agent Event Detection: Localization and Role Assignment

16 0.9167363 4 cvpr-2013-3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image

17 0.91662925 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection

18 0.91660893 372 cvpr-2013-SLAM++: Simultaneous Localisation and Mapping at the Level of Objects

19 0.91620278 254 cvpr-2013-Learning SURF Cascade for Fast and Accurate Object Detection

20 0.91609204 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection