cvpr cvpr2013 cvpr2013-445 knowledge-graph by maker-knowledge-mining

445 cvpr-2013-Understanding Bayesian Rooms Using Composite 3D Object Models


Source: pdf

Author: Luca Del_Pero, Joshua Bowdish, Bonnie Kermgard, Emily Hartley, Kobus Barnard

Abstract: We develop a comprehensive Bayesian generative model for understanding indoor scenes. While it is common in this domain to approximate objects with 3D bounding boxes, we propose using strong representations with finer granularity. For example, we model a chair as a set of four legs, a seat and a backrest. We find that modeling detailed geometry improves recognition and reconstruction, and enables more refined use of appearance for scene understanding. We demonstrate this with a new likelihood function that re- wards 3D object hypotheses whose 2D projection is more uniform in color distribution. Such a measure would be confused by background pixels if we used a bounding box to represent a concave object like a chair. Complex objects are modeled using a set or re-usable 3D parts, and we show that this representation captures much of the variation among object instances with relatively few parameters. We also designed specific data-driven inference mechanismsfor eachpart that are shared by all objects containing that part, which helps make inference transparent to the modeler. Further, we show how to exploit contextual relationships to detect more objects, by, for example, proposing chairs around and underneath tables. We present results showing the benefits of each of these innovations. The performance of our approach often exceeds that of state-of-the-art methods on the two tasks of room layout estimation and object recognition, as evaluated on two bench mark data sets used in this domain. work. 1) Detailed geometric models, such as tables with legs and top (bottom left), provide better reconstructions than plain boxes (top right), when supported by image features such as geometric context [5] (top middle), or an approach to using color introduced here. 2) Non convex models allow for complex configurations, such as a chair under a table (bottom middle). 3) 3D contextual relationships, such as chairs being around a table, allow identifying objects supported by little image evidence, like the chair behind the table (bottom right). Best viewed in color.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Theus ofmrespcif and etaildgeomtricmodels as proposed in this paper enables better understanding of scenes, illustrated here by localizing chairs tucked under the table in 3D. [sent-6, score-0.432]

2 For example, we model a chair as a set of four legs, a seat and a backrest. [sent-9, score-0.248]

3 Such a measure would be confused by background pixels if we used a bounding box to represent a concave object like a chair. [sent-12, score-0.368]

4 We also designed specific data-driven inference mechanismsfor eachpart that are shared by all objects containing that part, which helps make inference transparent to the modeler. [sent-14, score-0.281]

5 Further, we show how to exploit contextual relationships to detect more objects, by, for example, proposing chairs around and underneath tables. [sent-15, score-0.442]

6 The performance of our approach often exceeds that of state-of-the-art methods on the two tasks of room layout estimation and object recognition, as evaluated on two bench mark data sets used in this domain. [sent-17, score-0.43]

7 1) Detailed geometric models, such as tables with legs and top (bottom left), provide better reconstructions than plain boxes (top right), when supported by image features such as geometric context [5] (top middle), or an approach to using color introduced here. [sent-19, score-0.725]

8 3) 3D contextual relationships, such as chairs being around a table, allow identifying objects supported by little image evidence, like the chair behind the table (bottom right). [sent-21, score-0.682]

9 [6] exploited the inferred 3D geometry to insert realistic computer graphics objects into indoor images. [sent-29, score-0.264]

10 A single box is used to approximate the walls, floor, and ceiling enclosing the scene (room box), and also to represent objects inside it, such as beds and tables. [sent-31, score-0.384]

11 First, bounding boxes of concave objects projected into images tend to include much background, which is confusing evidence for inference. [sent-35, score-0.315]

12 For example, the middle-top image shows the output of an appearance-based 2D classifier (geometric context [5]), where pixels with higher probability of being part of an object instead of the wall or the floor are colored gray. [sent-36, score-0.305]

13 Fitting a single 3D block to this feature map will be hampered by the confusing evidence, whereas a more articulated table model with legs and top explains the classification results for pixels between the legs ofthe table. [sent-37, score-0.785]

14 Second, even if a single-bounding-box representation succeeded in discovering the presence of an object in the image, the parameters of a single fitted block have only modest power to distinguish objects. [sent-38, score-0.24]

15 We previously showed that it is possible to classify furniture objects based only on 3D bounding box dimensions [10], but with much confusion when objects are similar in size. [sent-39, score-0.665]

16 Third, plain blocks cannot capture complex spatial configurations, and would not allow sliding chairs under the table (Figure 2, bottom row). [sent-41, score-0.577]

17 These observations lead us to propose a principled framework for modeling indoor scenes with representations for articulated objects, such as the table and the chairs in Figure 2. [sent-44, score-0.482]

18 As in our previous work [10], we set out to simultaneously infer the 3D room box, the objects in it, their identity, and the camera parameters, all from a single image. [sent-45, score-0.461]

19 Importantly, inference strategies designed for a specific part are naturally shared by all the objects containing that part, and the modeler can create models using the available parts without having to worry about the inference. [sent-51, score-0.444]

20 A third contribution is showing how to exploit contextual relationships between objects to help inference if there is significant occlusion or weak image evidence. [sent-52, score-0.227]

21 For example, we show how to improve recognition of chairs by looking around tables. [sent-53, score-0.341]

22 There is often little image evidence supporting the identification of a chair, perhaps just a leg or the top of a backrest (Figure 2, bottom right), but this can be addressed using top-down information, by looking for chairs in places that are likely based on the current model hypothesis. [sent-54, score-0.609]

23 Second, we advocate a stronger 3D representation, where geometric variations within an object category are modeled in 3D, for example using priors on 3D size, instead of learning orientations and distances among the parts of an object in 2D [16]. [sent-65, score-0.278]

24 , on) include the room box and objects in it, where the number of objects n is not known a priori. [sent-75, score-0.641]

25 We model the room as a right-angled parallelepiped [7, 3, 10, 11], defined in terms of its 3D center, width, height and length r = (xr, yr , zr , wr , hr , lr, γr) where γr , (2) is the amount of rotation around the room y axis (yaw) [11]. [sent-76, score-0.768]

26 A key contribution of this work is representing object models by assemblages of re-usable geometric primitives (parts), as opposed to simple bounding boxes. [sent-81, score-0.253]

27 (4) The last six parameters are the 3D center and size of a bounding box containing the entire object model. [sent-89, score-0.381]

28 We define the size and position ofobject parts relative to the object bounding box, as one does not have access to absolute sizes when reconstructing from single images. [sent-90, score-0.32]

29 We also assume that objects are aligned with the room walls [7, 11]. [sent-91, score-0.462]

30 We constrain the modeler to stack the parts vertically, although extensions are possible. [sent-93, score-0.278]

31 Variable hri denotes the height of the ith part expressed as a ratio of the total object height, with ? [sent-105, score-0.302]

32 Each part pi comprises a set of internal parameters pθi, which are defined relatively to the bounding box occupied by that part. [sent-113, score-0.446]

33 The height of a chair’s seat is an example of an internal part parameter, and Figure 3 (bottom right) shows changing it while keeping the part bounding box fixed. [sent-114, score-0.647]

34 The object size and position in the room are specified by its bounding box, while part heights (hr1, . [sent-116, score-0.644]

35 the parts (Figure 3, top right), changes in the part heights four legs and an L-shaped component. [sent-123, score-0.524]

36 Top right: Changes in the object bounding box propagate to each part. [sent-124, score-0.328]

37 Bottom left: Parts are stacked vertically, with their height defined as a ratio of the total object height (two different ratios shown). [sent-126, score-0.325]

38 Bottom right: Changing the internal parameters of a part, here the L shaped one, while keeping the part bounding box fixed. [sent-127, score-0.446]

39 propagate to the parameters for the affected parts (Figure 3, bottom left), and changing the internal parameters results in changes local to the specific part (Figure 3, bottom right). [sent-128, score-0.461]

40 We emphasize that precise geometry enables configurations that bounding boxes would not, for example sliding a chair under a table (Figure 2). [sent-130, score-0.479]

41 For efficiency, during inference we first check if the objects’ bounding boxes collide, and only if that is the case do we check collisions using the geometry of the individual parts. [sent-131, score-0.356]

42 Prior distributions Priors on the room box, objects, and camera parameters help constrain the search over parameter space, and also allow for recognition based on size and position [10]. [sent-134, score-0.505]

43 (7) 111555555 An object prior is defined over the size and position of its bounding box. [sent-144, score-0.227]

44 We consider the ratios between • height and largest dimension or1 = h/max(w, l) • width and length or2 = max(w, l)/min(w, l) • room height and object height or3 = hr/h In our previous work [10], we showed how these quantities help distinguish between object classes. [sent-145, score-0.889]

45 j=1 We set the parameters of object priors from text data available from online furniture and appliance catalogs [10]. [sent-156, score-0.326]

46 Building object models As part of this work, we implemented a modeling framework that allows any object assembled from the palette of geometric parts. [sent-159, score-0.236]

47 To create a new object, the modeler specifies how these parts are arranged in the vertical stack, and provides parameters for the prior distributions as described above. [sent-160, score-0.31]

48 Rather then include the internal parameters of an object as part of its prior, which leads to additional model selection problems, we simply set them as part of the model. [sent-170, score-0.283]

49 Designing object parts We designed object parts to be modular so that they can be re-used in the modeling phase. [sent-174, score-0.3]

50 The inference strategies defined for a part are shared by all objects using that part, and are transparent to the modeler. [sent-176, score-0.236]

51 , the inference for the four legs of a table is the same module as for the four legs of a chair. [sent-179, score-0.797]

52 Bottom: the free parameters of the L-shaped component and of the set of legs are shown by the double arrows. [sent-186, score-0.379]

53 The L component is parametrized in terms of the height of the horizontal block, and the width of the vertical block relative to the part bounding box. [sent-190, score-0.583]

54 Since we as- sume that all objects are aligned with the room walls, only four configurations are possible (Figure 4, top left). [sent-192, score-0.493]

55 Within this context, we provide three kinds of L-shaped parts (Figure 4, top): L1, where the vertical block can be along any side, L2, where the vertical block is restricted to a long side of the horizontal block, and L3, where it is on a short side. [sent-193, score-0.451]

56 The set of cylindrical legs is parametrized in terms of the leg radius and the offset between the leg position and the corner, both of which are shared among all legs. [sent-194, score-0.478]

57 Finally, the simple block part does not require any parameters, as we assume that the block is as big as the part bounding box, which is encoded at the next level up. [sent-195, score-0.487]

58 Having parts that capture some of the complexities of the objects, while inheriting their bounding box, simplifies the work of the modeler, and proves effective for inference (see Sec. [sent-197, score-0.314]

59 In this work, we modeled 6 different furniture types: simple beds (a single block), beds with headrests (an L3 component), couches (an L2 component), tables (a stack of four legs and a single block for the top), chairs (a stack of four legs and an L1), and cabinets (a single block). [sent-199, score-2.065]

60 Lastly, we use thin blocks attached to a room wall to model frame categories (Figure 4, middle right), which include doors, picture frames and windows [10]. [sent-200, score-0.542]

61 [2], we also consider geometric context labels, which estimate the geometric class of each pixel, choosing between: object, floor, ceiling, left, middle and right wall. [sent-212, score-0.226]

62 Since the available classifier was trained against data where only furniture was labeled as objects, and not frames, we consider frames as part of the wall they are attached to, and not as objects when we evaluate on geometric context. [sent-221, score-0.531]

63 In this scope, detailed geometry and 3D reasoning play an important role, as shown in Figure 2, where structures with legs provide a much better grouping than a plain block. [sent-225, score-0.479]

64 We consider two pixels in the same group if they are both part of the projection of the same object, or of the same room surface, but we consider walls as a single group, as they tend to be of the same color. [sent-235, score-0.464]

65 In this work, we experiment with color and use dij = χ(CHi , CHj), where χ(CHi, CHj) is the chi-square distance between the color histograms computed at pixels i and j over a window of size n = 15. [sent-243, score-0.226]

66 Here we used a coarse grid search over β and with a step of 2, using the room box layout error (defined in Section §4) as an objective function. [sent-252, score-0.519]

67 Inference We use MCMC sampling to search the parameter space, defined by camera and room box parameters, the unknown number of objects, their type, and the parameters of each object and its parts. [sent-256, score-0.628]

68 As in our previous work [10, 11], we combine two sampling methods—reversible jump Metropolis-Hastings for discrete parameters (how many objects, what type they are), and stochastic dynamics [9, 10] for continuous parameters (camera, room box, and object parameters). [sent-257, score-0.576]

69 We initialize the parameters of the room box by generating candidates from orthogonal corners detected on the image plane [10]. [sent-261, score-0.556]

70 We sample over the continuous parameters of each candidate and use the one with the best posterior to initialize the room box parameters. [sent-262, score-0.516]

71 Then, we randomly alternate the following moves: sample over room box and camera parameters • jump move: add/remove an object, change the category o mf an object • pick a random object and sample over the parametpeicrsk o af iratsn bounding bto axn, or over ( ohvre1r , eh pran)ra maned(pθ1 , . [sent-263, score-0.906]

72 Proposing a table from two pegs (middle) requires estimating the width/length and the height of the table, proposing it from three only leaves the height as a free parameter. [sent-282, score-0.47]

73 To increase the acceptance ratio of jump moves, we propose objects from image corners in a data-driven fashion [11], and briefly sample its continuous parameters before evaluating MetropolisHastings (delayed acceptance [11]). [sent-285, score-0.398]

74 Efficient inference of complex structure, such as chairs with legs, seat and backrest, is more exacting than that of simple blocks. [sent-287, score-0.506]

75 We designed specific inference moves for the different parts, which are re-used by all objects containing that part. [sent-288, score-0.235]

76 While all objects share the data-driven proposal mechanism from corners, we use specific inference for the L-component and the set of four legs. [sent-289, score-0.228]

77 For the former, we have to keep in mind that two to four configurations are possible (Figure 4, top row), and we try them all whenever a jump move involves an object containing an L component. [sent-290, score-0.24]

78 We thus detect peg structures, which are likely candidates for being legs (Figure 6, left), as suggested by Hedau et al. [sent-292, score-0.334]

79 Note that this is different from Hedau’s work [4] where objects are modeled with bounding boxes, and pegs are part of the likelihood, as a way to explain the missing edges between the legs of a table. [sent-297, score-0.657]

80 Tables and chairs are an example, since they often occlude each other, like the chair behind the table in Figure 2. [sent-301, score-0.516]

81 Here, we bias the sampler to propose for chairs around detected tables, as shown in Figure 7. [sent-303, score-0.341]

82 Given a table hypothesis (shown in blue), we look for chairs in the red areas in the Figure, whose size and position is defined relatively to the table, by making sure that the backrest of 111555888 hypothesis (seen from above in blue, top left), we propose chairs around it. [sent-304, score-0.87]

83 This allowed us to find the chairs drawn in yellow in Figure 7, that were missed by inference without context cues. [sent-318, score-0.486]

84 We first evaluate the quality of the room box estimation [2, 7, 10, 11], by comparing the projection of the estimated room against the ground truth, where each pixel was labeled according to the surface of the room box it belongs to. [sent-322, score-1.243]

85 We are trying to identify eight object classes that belong to two very distinct categories: frames (doors, windows and picture frames), and furniture (beds, cabinets, chairs, couches and tables). [sent-327, score-0.448]

86 We first compare with our previous results on object recognition [10], where the same furniture and frames categories are used, except for chairs. [sent-350, score-0.355]

87 For proper comparison, we do not include chairs when computing precision and recall, and evaluate with chairs separately. [sent-351, score-0.715]

88 Also, we consider beds with headrest and beds without headrest as both part of the category “bed”. [sent-352, score-0.453]

89 In general, there is a trend showing a better precision for furniture with respect to frames. [sent-355, score-0.249]

90 We explain this difference by considering that frames are supported by edges and color only, whereas furniture is detected using a more robust set of features including geometric context and orientation maps. [sent-356, score-0.438]

91 Detailed geometry also allows us to improve on subcategory classification for furniture, as precise topology is a strong hint for distinguishing among categories such as couches and tables. [sent-357, score-0.313]

92 Our color model improves precision and recall for both furniture and frames, as it helps segment objects from the background and from each other. [sent-358, score-0.395]

93 For furniture, color also improves subcategory recognition indirectly by improving object geometry fitting. [sent-359, score-0.26]

94 However, recall suffers, as chairs are relatively small and often heavily occluded. [sent-362, score-0.341]

95 ca/research/data/CVPR 13 room 111555999 the benefits of using all measures, and is heavy occlusion and information. [sent-364, score-0.358]

96 In the context for proposing, which improves a promising step towards dealing with scarce image evidence using top-down case of the Hedau dataset, context al- lowed us to identify seven more chairs at the cost of one false positive. [sent-365, score-0.478]

97 Qualitative results on using context to find chairs are shown in Figure 7, while full scene reconstructions are shown in Figure 1, 2 (bottom right), and 8, which also includes some typical failures. [sent-366, score-0.39]

98 Consider for example the table model, where we do not impose tables to be any particular height, but the relative amount for the legs part versus the top part is kept within a small learned range (roughly 92% for the legs). [sent-370, score-0.513]

99 Hence proposing a complex alternative to a bounding box is like an independent local part of the inference, unless it changes what is occluded with what, like switching a block into a table so that chairs can be tucked underneath. [sent-380, score-0.943]

100 Thinking inside the box: Using appearance models and context based on room geometry. [sent-403, score-0.366]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('chairs', 0.341), ('room', 0.317), ('legs', 0.288), ('hedau', 0.221), ('furniture', 0.216), ('beds', 0.149), ('box', 0.146), ('chair', 0.136), ('height', 0.134), ('block', 0.13), ('couches', 0.13), ('bounding', 0.125), ('modeler', 0.115), ('backrest', 0.104), ('pegs', 0.104), ('indoor', 0.102), ('jump', 0.096), ('inference', 0.096), ('parts', 0.093), ('objects', 0.089), ('ucb', 0.085), ('tables', 0.084), ('gck', 0.078), ('grouping', 0.075), ('subcategory', 0.073), ('geometry', 0.073), ('dij', 0.072), ('internal', 0.071), ('geometric', 0.071), ('bottom', 0.07), ('stack', 0.07), ('hrn', 0.069), ('seat', 0.069), ('schlecht', 0.069), ('couch', 0.069), ('cabinets', 0.064), ('boxes', 0.062), ('pk', 0.061), ('acceptance', 0.06), ('hri', 0.06), ('kermgard', 0.06), ('proposing', 0.059), ('wall', 0.059), ('sti', 0.057), ('pero', 0.057), ('object', 0.057), ('color', 0.057), ('layout', 0.056), ('walls', 0.056), ('width', 0.056), ('leg', 0.055), ('camera', 0.055), ('composite', 0.053), ('parameters', 0.053), ('chj', 0.052), ('headrest', 0.052), ('tucked', 0.052), ('part', 0.051), ('moves', 0.05), ('blocks', 0.049), ('floor', 0.049), ('vertical', 0.049), ('heights', 0.049), ('context', 0.049), ('likelihood', 0.047), ('rooms', 0.047), ('satkin', 0.047), ('kobus', 0.046), ('peg', 0.046), ('frames', 0.045), ('position', 0.045), ('vertically', 0.044), ('configurations', 0.044), ('four', 0.043), ('plain', 0.043), ('karsch', 0.043), ('contextual', 0.042), ('bed', 0.042), ('benefits', 0.041), ('ig', 0.041), ('pixels', 0.04), ('bowdish', 0.04), ('corners', 0.04), ('predicted', 0.04), ('table', 0.039), ('evidence', 0.039), ('component', 0.038), ('categories', 0.037), ('hoiem', 0.036), ('oi', 0.036), ('middle', 0.035), ('barnard', 0.035), ('cylindrical', 0.035), ('edge', 0.035), ('od', 0.035), ('allow', 0.035), ('chi', 0.034), ('doors', 0.034), ('yaw', 0.034), ('precision', 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 445 cvpr-2013-Understanding Bayesian Rooms Using Composite 3D Object Models

Author: Luca Del_Pero, Joshua Bowdish, Bonnie Kermgard, Emily Hartley, Kobus Barnard

Abstract: We develop a comprehensive Bayesian generative model for understanding indoor scenes. While it is common in this domain to approximate objects with 3D bounding boxes, we propose using strong representations with finer granularity. For example, we model a chair as a set of four legs, a seat and a backrest. We find that modeling detailed geometry improves recognition and reconstruction, and enables more refined use of appearance for scene understanding. We demonstrate this with a new likelihood function that re- wards 3D object hypotheses whose 2D projection is more uniform in color distribution. Such a measure would be confused by background pixels if we used a bounding box to represent a concave object like a chair. Complex objects are modeled using a set or re-usable 3D parts, and we show that this representation captures much of the variation among object instances with relatively few parameters. We also designed specific data-driven inference mechanismsfor eachpart that are shared by all objects containing that part, which helps make inference transparent to the modeler. Further, we show how to exploit contextual relationships to detect more objects, by, for example, proposing chairs around and underneath tables. We present results showing the benefits of each of these innovations. The performance of our approach often exceeds that of state-of-the-art methods on the two tasks of room layout estimation and object recognition, as evaluated on two bench mark data sets used in this domain. work. 1) Detailed geometric models, such as tables with legs and top (bottom left), provide better reconstructions than plain boxes (top right), when supported by image features such as geometric context [5] (top middle), or an approach to using color introduced here. 2) Non convex models allow for complex configurations, such as a chair under a table (bottom middle). 3) 3D contextual relationships, such as chairs being around a table, allow identifying objects supported by little image evidence, like the chair behind the table (bottom right). Best viewed in color.

2 0.28115904 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases

Author: Wongun Choi, Yu-Wei Chao, Caroline Pantofaru, Silvio Savarese

Abstract: Visual scene understanding is a difficult problem interleaving object detection, geometric reasoning and scene classification. We present a hierarchical scene model for learning and reasoning about complex indoor scenes which is computationally tractable, can be learned from a reasonable amount of training data, and avoids oversimplification. At the core of this approach is the 3D Geometric Phrase Model which captures the semantic and geometric relationships between objects whichfrequently co-occur in the same 3D spatial configuration. Experiments show that this model effectively explains scene semantics, geometry and object groupings from a single image, while also improving individual object detections.

3 0.21396735 381 cvpr-2013-Scene Parsing by Integrating Function, Geometry and Appearance Models

Author: Yibiao Zhao, Song-Chun Zhu

Abstract: Indoor functional objects exhibit large view and appearance variations, thus are difficult to be recognized by the traditional appearance-based classification paradigm. In this paper, we present an algorithm to parse indoor images based on two observations: i) The functionality is the most essentialproperty to define an indoor object, e.g. “a chair to sit on ”; ii) The geometry (3D shape) ofan object is designed to serve its function. We formulate the nature of the object function into a stochastic grammar model. This model characterizes a joint distribution over the function-geometryappearance (FGA) hierarchy. The hierarchical structure includes a scene category, , functional groups, , functional objects, functional parts and 3D geometric shapes. We use a simulated annealing MCMC algorithm to find the maximum a posteriori (MAP) solution, i.e. a parse tree. We design four data-driven steps to accelerate the search in the FGA space: i) group the line segments into 3D primitive shapes, ii) assign functional labels to these 3D primitive shapes, iii) fill in missing objects/parts according to the functional labels, and iv) synthesize 2D segmentation maps and verify the current parse tree by the Metropolis-Hastings acceptance probability. The experimental results on several challenging indoor datasets demonstrate theproposed approach not only significantly widens the scope ofindoor sceneparsing algorithm from the segmentation and the 3D recovery to the functional object recognition, but also yields improved overall performance.

4 0.17641062 1 cvpr-2013-3D-Based Reasoning with Blocks, Support, and Stability

Author: Zhaoyin Jia, Andrew Gallagher, Ashutosh Saxena, Tsuhan Chen

Abstract: 3D volumetric reasoning is important for truly understanding a scene. Humans are able to both segment each object in an image, and perceive a rich 3D interpretation of the scene, e.g., the space an object occupies, which objects support other objects, and which objects would, if moved, cause other objects to fall. We propose a new approach for parsing RGB-D images using 3D block units for volumetric reasoning. The algorithm fits image segments with 3D blocks, and iteratively evaluates the scene based on block interaction properties. We produce a 3D representation of the scene based on jointly optimizing over segmentations, block fitting, supporting relations, and object stability. Our algorithm incorporates the intuition that a good 3D representation of the scene is the one that fits the data well, and is a stable, self-supporting (i.e., one that does not topple) arrangement of objects. We experiment on several datasets including controlled and real indoor scenarios. Results show that our stability-reasoning framework improves RGB-D segmentation and scene volumetric representation.

5 0.15986128 329 cvpr-2013-Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images

Author: Saurabh Gupta, Pablo Arbeláez, Jitendra Malik

Abstract: We address the problems of contour detection, bottomup grouping and semantic segmentation using RGB-D data. We focus on the challenging setting of cluttered indoor scenes, and evaluate our approach on the recently introduced NYU-Depth V2 (NYUD2) dataset [27]. We propose algorithms for object boundary detection and hierarchical segmentation that generalize the gPb − ucm approach of [se2]g mbeyn mtaatkioinng t effective use oef t dheep gthP information. Wroea schho owf that our system can label each contour with its type (depth, normal or albedo). We also propose a generic method for long-range amodal completion of surfaces and show its effectiveness in grouping. We then turn to the problem of semantic segmentation and propose a simple approach that classifies superpixels into the 40 dominant object categories in NYUD2. We use both generic and class-specific features to encode the appearance and geometry of objects. We also show how our approach can be used for scene classification, and how this contextual information in turn improves object recognition. In all of these tasks, we report significant improvements over the state-of-the-art.

6 0.14663823 364 cvpr-2013-Robust Object Co-detection

7 0.14190623 67 cvpr-2013-Blocks That Shout: Distinctive Parts for Scene Classification

8 0.1392667 248 cvpr-2013-Learning Collections of Part Models for Object Recognition

9 0.13058567 278 cvpr-2013-Manhattan Junction Catalogue for Spatial Reasoning of Indoor Scenes

10 0.13030928 417 cvpr-2013-Subcategory-Aware Object Classification

11 0.12709795 154 cvpr-2013-Explicit Occlusion Modeling for 3D Object Class Representations

12 0.12373218 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection

13 0.10925777 127 cvpr-2013-Discovering the Structure of a Planar Mirror System from Multiple Observations of a Single Point

14 0.10883401 61 cvpr-2013-Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics

15 0.10577883 325 cvpr-2013-Part Discovery from Partial Correspondence

16 0.10574833 284 cvpr-2013-Mesh Based Semantic Modelling for Indoor and Outdoor Scenes

17 0.1049605 30 cvpr-2013-Accurate Localization of 3D Objects from RGB-D Data Using Segmentation Hypotheses

18 0.10421123 197 cvpr-2013-Hallucinated Humans as the Hidden Context for Labeling 3D Scenes

19 0.097695909 60 cvpr-2013-Beyond Physical Connections: Tree Models in Human Pose Estimation

20 0.097485386 86 cvpr-2013-Composite Statistical Inference for Semantic Segmentation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.232), (1, 0.021), (2, 0.041), (3, -0.052), (4, 0.089), (5, -0.004), (6, 0.071), (7, 0.132), (8, -0.002), (9, 0.009), (10, -0.048), (11, -0.031), (12, 0.015), (13, -0.028), (14, -0.027), (15, -0.043), (16, 0.1), (17, 0.13), (18, -0.095), (19, 0.032), (20, 0.013), (21, 0.05), (22, 0.173), (23, -0.018), (24, 0.095), (25, -0.034), (26, 0.072), (27, -0.069), (28, -0.081), (29, -0.014), (30, -0.096), (31, 0.077), (32, 0.016), (33, 0.037), (34, -0.075), (35, -0.047), (36, 0.021), (37, 0.063), (38, 0.014), (39, -0.079), (40, -0.005), (41, 0.01), (42, 0.067), (43, 0.054), (44, 0.037), (45, 0.073), (46, 0.077), (47, 0.008), (48, -0.062), (49, -0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93844324 445 cvpr-2013-Understanding Bayesian Rooms Using Composite 3D Object Models

Author: Luca Del_Pero, Joshua Bowdish, Bonnie Kermgard, Emily Hartley, Kobus Barnard

Abstract: We develop a comprehensive Bayesian generative model for understanding indoor scenes. While it is common in this domain to approximate objects with 3D bounding boxes, we propose using strong representations with finer granularity. For example, we model a chair as a set of four legs, a seat and a backrest. We find that modeling detailed geometry improves recognition and reconstruction, and enables more refined use of appearance for scene understanding. We demonstrate this with a new likelihood function that re- wards 3D object hypotheses whose 2D projection is more uniform in color distribution. Such a measure would be confused by background pixels if we used a bounding box to represent a concave object like a chair. Complex objects are modeled using a set or re-usable 3D parts, and we show that this representation captures much of the variation among object instances with relatively few parameters. We also designed specific data-driven inference mechanismsfor eachpart that are shared by all objects containing that part, which helps make inference transparent to the modeler. Further, we show how to exploit contextual relationships to detect more objects, by, for example, proposing chairs around and underneath tables. We present results showing the benefits of each of these innovations. The performance of our approach often exceeds that of state-of-the-art methods on the two tasks of room layout estimation and object recognition, as evaluated on two bench mark data sets used in this domain. work. 1) Detailed geometric models, such as tables with legs and top (bottom left), provide better reconstructions than plain boxes (top right), when supported by image features such as geometric context [5] (top middle), or an approach to using color introduced here. 2) Non convex models allow for complex configurations, such as a chair under a table (bottom middle). 3) 3D contextual relationships, such as chairs being around a table, allow identifying objects supported by little image evidence, like the chair behind the table (bottom right). Best viewed in color.

2 0.87360692 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases

Author: Wongun Choi, Yu-Wei Chao, Caroline Pantofaru, Silvio Savarese

Abstract: Visual scene understanding is a difficult problem interleaving object detection, geometric reasoning and scene classification. We present a hierarchical scene model for learning and reasoning about complex indoor scenes which is computationally tractable, can be learned from a reasonable amount of training data, and avoids oversimplification. At the core of this approach is the 3D Geometric Phrase Model which captures the semantic and geometric relationships between objects whichfrequently co-occur in the same 3D spatial configuration. Experiments show that this model effectively explains scene semantics, geometry and object groupings from a single image, while also improving individual object detections.

3 0.85873514 381 cvpr-2013-Scene Parsing by Integrating Function, Geometry and Appearance Models

Author: Yibiao Zhao, Song-Chun Zhu

Abstract: Indoor functional objects exhibit large view and appearance variations, thus are difficult to be recognized by the traditional appearance-based classification paradigm. In this paper, we present an algorithm to parse indoor images based on two observations: i) The functionality is the most essentialproperty to define an indoor object, e.g. “a chair to sit on ”; ii) The geometry (3D shape) ofan object is designed to serve its function. We formulate the nature of the object function into a stochastic grammar model. This model characterizes a joint distribution over the function-geometryappearance (FGA) hierarchy. The hierarchical structure includes a scene category, , functional groups, , functional objects, functional parts and 3D geometric shapes. We use a simulated annealing MCMC algorithm to find the maximum a posteriori (MAP) solution, i.e. a parse tree. We design four data-driven steps to accelerate the search in the FGA space: i) group the line segments into 3D primitive shapes, ii) assign functional labels to these 3D primitive shapes, iii) fill in missing objects/parts according to the functional labels, and iv) synthesize 2D segmentation maps and verify the current parse tree by the Metropolis-Hastings acceptance probability. The experimental results on several challenging indoor datasets demonstrate theproposed approach not only significantly widens the scope ofindoor sceneparsing algorithm from the segmentation and the 3D recovery to the functional object recognition, but also yields improved overall performance.

4 0.82287878 1 cvpr-2013-3D-Based Reasoning with Blocks, Support, and Stability

Author: Zhaoyin Jia, Andrew Gallagher, Ashutosh Saxena, Tsuhan Chen

Abstract: 3D volumetric reasoning is important for truly understanding a scene. Humans are able to both segment each object in an image, and perceive a rich 3D interpretation of the scene, e.g., the space an object occupies, which objects support other objects, and which objects would, if moved, cause other objects to fall. We propose a new approach for parsing RGB-D images using 3D block units for volumetric reasoning. The algorithm fits image segments with 3D blocks, and iteratively evaluates the scene based on block interaction properties. We produce a 3D representation of the scene based on jointly optimizing over segmentations, block fitting, supporting relations, and object stability. Our algorithm incorporates the intuition that a good 3D representation of the scene is the one that fits the data well, and is a stable, self-supporting (i.e., one that does not topple) arrangement of objects. We experiment on several datasets including controlled and real indoor scenarios. Results show that our stability-reasoning framework improves RGB-D segmentation and scene volumetric representation.

5 0.76372117 61 cvpr-2013-Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics

Author: Bo Zheng, Yibiao Zhao, Joey C. Yu, Katsushi Ikeuchi, Song-Chun Zhu

Abstract: In this paper, we present an approach for scene understanding by reasoning physical stability of objects from point cloud. We utilize a simple observation that, by human design, objects in static scenes should be stable with respect to gravity. This assumption is applicable to all scene categories and poses useful constraints for the plausible interpretations (parses) in scene understanding. Our method consists of two major steps: 1) geometric reasoning: recovering solid 3D volumetric primitives from defective point cloud; and 2) physical reasoning: grouping the unstable primitives to physically stable objects by optimizing the stability and the scene prior. We propose to use a novel disconnectivity graph (DG) to represent the energy landscape and use a Swendsen-Wang Cut (MCMC) method for optimization. In experiments, we demonstrate that the algorithm achieves substantially better performance for i) object segmentation, ii) 3D volumetric recovery of the scene, and iii) better parsing result for scene understanding in comparison to state-of-the-art methods in both public dataset and our own new dataset.

6 0.72529697 197 cvpr-2013-Hallucinated Humans as the Hidden Context for Labeling 3D Scenes

7 0.69034451 16 cvpr-2013-A Linear Approach to Matching Cuboids in RGBD Images

8 0.68709731 278 cvpr-2013-Manhattan Junction Catalogue for Spatial Reasoning of Indoor Scenes

9 0.68008709 30 cvpr-2013-Accurate Localization of 3D Objects from RGB-D Data Using Segmentation Hypotheses

10 0.66748452 247 cvpr-2013-Learning Class-to-Image Distance with Object Matchings

11 0.62859303 416 cvpr-2013-Studying Relationships between Human Gaze, Description, and Computer Vision

12 0.60460049 221 cvpr-2013-Incorporating Structural Alternatives and Sharing into Hierarchy for Multiclass Object Recognition and Detection

13 0.60439032 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection

14 0.60393828 364 cvpr-2013-Robust Object Co-detection

15 0.58207381 256 cvpr-2013-Learning Structured Hough Voting for Joint Object Detection and Occlusion Reasoning

16 0.58056575 417 cvpr-2013-Subcategory-Aware Object Classification

17 0.57889318 440 cvpr-2013-Tracking People and Their Objects

18 0.57328123 136 cvpr-2013-Discriminatively Trained And-Or Tree Models for Object Detection

19 0.57014787 248 cvpr-2013-Learning Collections of Part Models for Object Recognition

20 0.56999481 67 cvpr-2013-Blocks That Shout: Distinctive Parts for Scene Classification


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.119), (16, 0.022), (26, 0.067), (28, 0.011), (33, 0.223), (39, 0.028), (55, 0.225), (67, 0.052), (69, 0.099), (87, 0.078)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.84361136 159 cvpr-2013-Expressive Visual Text-to-Speech Using Active Appearance Models

Author: Robert Anderson, Björn Stenger, Vincent Wan, Roberto Cipolla

Abstract: This paper presents a complete system for expressive visual text-to-speech (VTTS), which is capable of producing expressive output, in the form of a ‘talking head’, given an input text and a set of continuous expression weights. The face is modeled using an active appearance model (AAM), and several extensions are proposed which make it more applicable to the task of VTTS. The model allows for normalization with respect to both pose and blink state which significantly reduces artifacts in the resulting synthesized sequences. We demonstrate quantitative improvements in terms of reconstruction error over a million frames, as well as in large-scale user studies, comparing the output of different systems.

same-paper 2 0.83641958 445 cvpr-2013-Understanding Bayesian Rooms Using Composite 3D Object Models

Author: Luca Del_Pero, Joshua Bowdish, Bonnie Kermgard, Emily Hartley, Kobus Barnard

Abstract: We develop a comprehensive Bayesian generative model for understanding indoor scenes. While it is common in this domain to approximate objects with 3D bounding boxes, we propose using strong representations with finer granularity. For example, we model a chair as a set of four legs, a seat and a backrest. We find that modeling detailed geometry improves recognition and reconstruction, and enables more refined use of appearance for scene understanding. We demonstrate this with a new likelihood function that re- wards 3D object hypotheses whose 2D projection is more uniform in color distribution. Such a measure would be confused by background pixels if we used a bounding box to represent a concave object like a chair. Complex objects are modeled using a set or re-usable 3D parts, and we show that this representation captures much of the variation among object instances with relatively few parameters. We also designed specific data-driven inference mechanismsfor eachpart that are shared by all objects containing that part, which helps make inference transparent to the modeler. Further, we show how to exploit contextual relationships to detect more objects, by, for example, proposing chairs around and underneath tables. We present results showing the benefits of each of these innovations. The performance of our approach often exceeds that of state-of-the-art methods on the two tasks of room layout estimation and object recognition, as evaluated on two bench mark data sets used in this domain. work. 1) Detailed geometric models, such as tables with legs and top (bottom left), provide better reconstructions than plain boxes (top right), when supported by image features such as geometric context [5] (top middle), or an approach to using color introduced here. 2) Non convex models allow for complex configurations, such as a chair under a table (bottom middle). 3) 3D contextual relationships, such as chairs being around a table, allow identifying objects supported by little image evidence, like the chair behind the table (bottom right). Best viewed in color.

3 0.83434618 26 cvpr-2013-A Statistical Model for Recreational Trails in Aerial Images

Author: Andrew Predoehl, Scott Morris, Kobus Barnard

Abstract: unkown-abstract

4 0.78651494 420 cvpr-2013-Supervised Descent Method and Its Applications to Face Alignment

Author: Xuehan Xiong, Fernando De_la_Torre

Abstract: Many computer vision problems (e.g., camera calibration, image alignment, structure from motion) are solved through a nonlinear optimization method. It is generally accepted that 2nd order descent methods are the most robust, fast and reliable approaches for nonlinear optimization ofa general smoothfunction. However, in the context of computer vision, 2nd order descent methods have two main drawbacks: (1) The function might not be analytically differentiable and numerical approximations are impractical. (2) The Hessian might be large and not positive definite. To address these issues, thispaperproposes a Supervised Descent Method (SDM) for minimizing a Non-linear Least Squares (NLS) function. During training, the SDM learns a sequence of descent directions that minimizes the mean of NLS functions sampled at different points. In testing, SDM minimizes the NLS objective using the learned descent directions without computing the Jacobian nor the Hessian. We illustrate the benefits of our approach in synthetic and real examples, and show how SDM achieves state-ofthe-art performance in the problem of facial feature detec- tion. The code is available at www. .human sen sin g. . cs . cmu . edu/in t ra fa ce.

5 0.77502245 381 cvpr-2013-Scene Parsing by Integrating Function, Geometry and Appearance Models

Author: Yibiao Zhao, Song-Chun Zhu

Abstract: Indoor functional objects exhibit large view and appearance variations, thus are difficult to be recognized by the traditional appearance-based classification paradigm. In this paper, we present an algorithm to parse indoor images based on two observations: i) The functionality is the most essentialproperty to define an indoor object, e.g. “a chair to sit on ”; ii) The geometry (3D shape) ofan object is designed to serve its function. We formulate the nature of the object function into a stochastic grammar model. This model characterizes a joint distribution over the function-geometryappearance (FGA) hierarchy. The hierarchical structure includes a scene category, , functional groups, , functional objects, functional parts and 3D geometric shapes. We use a simulated annealing MCMC algorithm to find the maximum a posteriori (MAP) solution, i.e. a parse tree. We design four data-driven steps to accelerate the search in the FGA space: i) group the line segments into 3D primitive shapes, ii) assign functional labels to these 3D primitive shapes, iii) fill in missing objects/parts according to the functional labels, and iv) synthesize 2D segmentation maps and verify the current parse tree by the Metropolis-Hastings acceptance probability. The experimental results on several challenging indoor datasets demonstrate theproposed approach not only significantly widens the scope ofindoor sceneparsing algorithm from the segmentation and the 3D recovery to the functional object recognition, but also yields improved overall performance.

6 0.76757389 61 cvpr-2013-Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics

7 0.76486015 248 cvpr-2013-Learning Collections of Part Models for Object Recognition

8 0.76448429 172 cvpr-2013-Finding Group Interactions in Social Clutter

9 0.76230806 231 cvpr-2013-Joint Detection, Tracking and Mapping by Semantic Bundle Adjustment

10 0.76104861 292 cvpr-2013-Multi-agent Event Detection: Localization and Role Assignment

11 0.76048523 86 cvpr-2013-Composite Statistical Inference for Semantic Segmentation

12 0.7588284 365 cvpr-2013-Robust Real-Time Tracking of Multiple Objects by Volumetric Mass Densities

13 0.7573598 331 cvpr-2013-Physically Plausible 3D Scene Tracking: The Single Actor Hypothesis

14 0.75710559 30 cvpr-2013-Accurate Localization of 3D Objects from RGB-D Data Using Segmentation Hypotheses

15 0.75689662 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases

16 0.75625736 311 cvpr-2013-Occlusion Patterns for Object Class Detection

17 0.7560941 19 cvpr-2013-A Minimum Error Vanishing Point Detection Approach for Uncalibrated Monocular Images of Man-Made Environments

18 0.75489312 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection

19 0.75389123 372 cvpr-2013-SLAM++: Simultaneous Localisation and Mapping at the Level of Objects

20 0.75354791 1 cvpr-2013-3D-Based Reasoning with Blocks, Support, and Stability