iccv iccv2013 iccv2013-410 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ruiqi Guo, Derek Hoiem
Abstract: In this paper, we present an approach to predict the extent and height of supporting surfaces such as tables, chairs, and cabinet tops from a single RGBD image. We define support surfaces to be horizontal, planar surfaces that can physically support objects and humans. Given a RGBD image, our goal is to localize the height and full extent of such surfaces in 3D space. To achieve this, we created a labeling tool and annotated 1449 images with rich, complete 3D scene models in NYU dataset. We extract ground truth from the annotated dataset and developed a pipeline for predicting floor space, walls, the height and full extent of support surfaces. Finally we match the predicted extent with annotated scenes in training scenes and transfer the the support surface configuration from training scenes. We evaluate the proposed approach in our dataset and demonstrate its effectiveness in understanding scenes in 3D space.
Reference: text
sentIndex sentText sentNum sentScore
1 Support surfaces prediction for indoor scene understanding Anonymous ICCV submission Paper ID 1506 Abstract In this paper, we present an approach to predict the extent and height of supporting surfaces such as tables, chairs, and cabinet tops from a single RGBD image. [sent-1, score-1.657]
2 We define support surfaces to be horizontal, planar surfaces that can physically support objects and humans. [sent-2, score-1.528]
3 Given a RGBD image, our goal is to localize the height and full extent of such surfaces in 3D space. [sent-3, score-0.787]
4 We extract ground truth from the annotated dataset and developed a pipeline for predicting floor space, walls, the height and full extent of support surfaces. [sent-5, score-1.058]
5 Finally we match the predicted extent with annotated scenes in training scenes and transfer the the support surface configuration from training scenes. [sent-6, score-0.9]
6 Introduction Knowledge of support surfaces is crucial to understand or interact with a scene. [sent-9, score-0.73]
7 Our goal is to infer the heights and extents of support surfaces in the scene from a single RGBD image. [sent-11, score-1.02]
8 The main challenge is that support surfaces have complex multi-layer structures and that much of the scene is hidden from view (Fig. [sent-12, score-0.905]
9 Often, surfaces such as tables, are littered with objects which limits the effectiveness of simple plane fitting strategies. [sent-14, score-0.586]
10 Some support surfaces are not visible at all because they are above eye level or obstructed by other objects. [sent-15, score-0.81]
11 Our approach is to label visible portions of the scene, project these labels into an overhead view, and refine estimates and infer occluded portions based on scene priors and context. [sent-16, score-0.795]
12 We undertook an extensive effort to create full 3D models that correspond to scenes in the NYU (v2) dataset [14], which we expect will be of interest to many extent of all support surfaces, including occluded portions, from one RGBD image. [sent-19, score-0.683]
13 As shown on the left, these surfaces are often covered with objects and are nearly invisible when they occur near eye-level. [sent-20, score-0.48]
14 Our annotations complement the existing detailed 2D object and support relation annotations and can be used for experiments on inferring free space, support surfaces, and 3D object layout. [sent-23, score-0.856]
15 When initially viewing a scene, we do not know how many surfaces there are or at which heights. [sent-26, score-0.431]
16 Our solution is to infer a set of overhead support maps that indicate the extent of supports at various heights on the floor plan. [sent-27, score-1.3]
17 We need to specify the heights and extents of support surfaces, which is difficult due to clutter and occlusion. [sent-28, score-0.499]
18 Our approach is analogous to that of Guo and Hoiem [3] who first label an RGB image according to the visible surfaces at each pixel and then infer occluded background labels. [sent-29, score-0.607]
19 [14] and then project these pixels into an overhead view using the depth signal. [sent-31, score-0.556]
20 Before projection, the scene is rotated so that walls and floor are axis-aligned, using the code from [14]. [sent-32, score-0.462]
21 We then predict which heights are likely to contain a support surface based on the normals of visible surfaces and 2144 straight lines in the image. [sent-33, score-1.219]
22 One height could contain multiple surfaces such as table and counter tops or the seats of several chairs. [sent-34, score-0.687]
23 For each height, we then predict the extent of support on the floor map using a variety of 2D and 3D features. [sent-35, score-0.846]
24 These estimates are refined using an autocontext [16] approach and shape priors incorporated by matching whole surfaces from the training set, similarly to [3]. [sent-36, score-0.635]
25 Our experiments investigate the accuracy of our estimates of support height and extent, the effects of occlusion due to foreground objects or surfaces above eye-level, and the effectiveness of our contextual inference and shape matching. [sent-37, score-1.042]
26 However, the depth signal does not trivialize the problem, since many support surfaces are obscured by clutter or completely hidden. [sent-44, score-0.797]
27 We differ in that we predict the full 3D extent of support surfaces, and we extend their dataset with full 3D annotations of objects using a tool that enables users to markup scenes based on both image data and point clouds from the depth sensor. [sent-50, score-0.926]
28 We adopt their basic pipeline of labeling visible portions of the scene and inferring occluded portions based on autocontext and shape priors. [sent-52, score-0.623]
29 infer support in outdoor scenes from RGB images to aid in geometric labeling [4], and Silberman et al. [sent-56, score-0.494]
30 Our work differs in its aim to infer full 3D extent of support surfaces. [sent-60, score-0.559]
31 Our work on estimating underlying support surfaces would facilitate object placement and allow more complex interactions. [sent-63, score-0.73]
32 Taylor and Cowley [15] estimate the layout of walls from an RGBD image based on plane fitting and inference in an overhead view but do not address the problem of support surfaces. [sent-64, score-1.138]
33 We differ in that we are provided with only one viewpoint, and our goal is to recover extent of underlying support surfaces rather than a full model of visible objects and structures. [sent-68, score-1.039]
34 they are modeled as line segments in the overhead view and polygons in the horizontal view. [sent-82, score-0.582]
35 Similarly, ceiling and floors are line segments in the horizontal view and polygons in the overhead view. [sent-83, score-0.677]
36 Fro src example, chairs and sofas cannot be modeled using cubic blocks, and their support surfaces are often not the top part. [sent-86, score-0.779]
37 The support surface of each object is labeled by hand on the 3D model. [sent-89, score-0.457]
38 Preprocessing The RGBD scenes are first preprocessed to facilitate annotation as well as support surface inference. [sent-97, score-0.589]
39 Then the surface normals are computed at each pixel, by fitting local planar surfaces in its neighborhood. [sent-100, score-0.662]
40 These local planar surfaces take into account color information as well, in a procedure similar to bilateral filtering, which improves robustness of plane fits compared to using only depth information. [sent-101, score-0.601]
41 Annotation procedure If the floor is visible, our system estimates the floor height from the depth information and object labels (from [ 14]). [sent-108, score-0.795]
42 If necessary, the annotator can correct or specify the floor height by clicking on a scene point and indicating its height above the floor. [sent-109, score-0.84]
43 The annotator then alternates between labeling in an overhead view (from above the scene looking down at the floor) and horizontal view (from the camera looking at the most frontal plane) to model the scene: 1. [sent-110, score-0.849]
44 In the overhead view, the annotator is shown high- lighted 3D points and an estimated bounding box that correspond to the object. [sent-113, score-0.463]
45 The horizontal view is then shown, and the annotator specifies the vertical height of the object by drawing a line segment at the object’s height. [sent-116, score-0.539]
46 4) to predict the vertical height and horizontal extent of support surfaces in a scene. [sent-129, score-1.313]
47 Scene parsing in an overhead view is very different than in the image plane. [sent-130, score-0.483]
48 On the other hand, since the room directions have been rectified, objects tend to have rectangular shapes in an overhead view. [sent-132, score-0.46]
49 We design features that apply to the overhead view and use spatial context to improve parsing results. [sent-133, score-0.51]
50 We then project the labeled pixels into the overhead view using the associated depth signal. [sent-138, score-0.556]
51 All voxels having smaller depth values are observed free space, and the voxels larger depth values than the scene are not direct observed. [sent-144, score-0.488]
52 Observed 3D points with upward normals are indicative of support surfaces at that height. [sent-151, score-0.835]
53 Edgemap is the detected straight line segments projected onto the overhead view. [sent-155, score-0.412]
54 Voxel occupancy is the amount of observed free space voxels near the near the surface height. [sent-158, score-0.455]
55 If a voxel is observed to be free space, it cannot be a part of a support surface. [sent-159, score-0.494]
56 Volumetric difference is the difference of number of free space voxels above the height at this location subtracted by the number of the number of free space voxels below it. [sent-161, score-0.553]
57 Support surfaces often have more air above them than below. [sent-162, score-0.403]
58 Location prior is the spatial prior of where support surfaces are in the training scenes, normalized by the scale of the scene. [sent-164, score-0.821]
59 Viewpoint prior is the spatial prior of support surfaces in training scenes with respect to the viewer. [sent-166, score-0.909]
60 Support height prior is the spatial distribution of the vertical height of support planes. [sent-168, score-0.837]
61 Predicting support height We first predict the support heights. [sent-171, score-0.915]
62 Our intuition is that at the height of a support plane, we are likely to see: (1) observed surfaces with upward normals; (2) a difference in the voxel occupancy above and below the plane; and (3) observed 3D points near the height. [sent-172, score-1.183]
63 These features are projected into an overhead view and used to estimate the locations of walls (or floor free space). [sent-185, score-0.949]
64 The horizontal extent of a supporting surface is then estimated for each support plane based on the features at each point and surrounding predictions. [sent-187, score-0.852]
65 In the rightmost image, green pixels are estimated walls, blue pixels are floor, and red pixels are support surfaces, with lighter colors corresponding to greater height. [sent-189, score-0.426]
66 Volumetric difference, occupancy and location/view prior are directly computed on the overhead grid. [sent-193, score-0.444]
67 Auto context inference is applied to each predict support height. [sent-194, score-0.429]
68 are aggregated fied as “floor” support extent, we divide the overhead view For each cell of the floor plane, all features over a 3D vertical scene column and classior “wall” using a linear SVM. [sent-200, score-1.217]
69 Likewise, for 2148 each support plane, we classify the grid cells in the overhead view in that height into being part of support surface or not. [sent-201, score-1.471]
70 We use the follow set of features: (1) observed point pointing up; (2) volumetric difference at each grid cell (3 levels); (3) projected surface map near the predicted support height; (4) view dependent and independent spatial priors in the overhead view; and (5) floor space prediction. [sent-202, score-1.415]
71 Integrating spatial context The support surfaces have spatial contexts. [sent-205, score-0.784]
72 We first compute feature at each overhead cell on the overhead view as in the previous subsection and apply autocontext. [sent-211, score-0.846]
73 Shape prior transfer using template matching Although spatial contexts helps to fill in the occluded part in the overhead view, it does not fully exploit the shape prior such as objects often being rectangular. [sent-216, score-0.812]
74 We further harvest shape prior by matching the support extent prob- ability map with the support surface layout template from training scenes. [sent-217, score-1.238]
75 We assign a positive score to locations that are support surfaces on both the prediction and the template; we assign a negative score to penalize where the template overlap with the free space in the probability map. [sent-219, score-0.975]
76 For scenes with complicated support surface layout, it is often the case that there is not a single perfect match, but there exist multiple good partial matches. [sent-225, score-0.545]
77 Experiments To verify the effectiveness of our propose method, we evaluate our support surface prediction by comparing to the ground truth support surfaces extracted from our annotated dataset. [sent-232, score-1.333]
78 Ground truth support surfaces are defined to be the top part of an object from “supporter” category unless they are within the 0. [sent-239, score-0.76]
79 For prediction, we project the depth points to the overhead view and quantize the projected area into a grid with a spacing of 0. [sent-241, score-0.659]
80 The ground truth floor space is the union of the annotated area union observed area, subtracted by the area that has been occluded by walls from the viewpoint. [sent-243, score-0.68]
81 The ground truth of support plane is just the top surface of objects, or, in case of the SketchUp model objects, the annotated functional surface (e. [sent-245, score-0.772]
82 To make evaluation less sensitive to noise in localization, we make the area around boundary of support surface within a thickness of 0. [sent-248, score-0.489]
83 For training support height classifier, we scan the vertical extent of the scene with spacing of 0. [sent-254, score-0.886]
84 To do template matching, we first aggregate the support surface configurations from training scenes, and obtain a total of 1372 templates. [sent-271, score-0.559]
85 For the extent probability map we predict at each support height, we retrieve the top 10 templates with the highest matching scores. [sent-272, score-0.62]
86 6, we display the visualization of support surface prediction in the overhead view. [sent-278, score-0.877]
87 Although the scenes are cluttered and challenging, we are able to predict most of the support surface extents even if they are severely occluded (kitchen counter in first image on the top left) or is partly out of view (the chair in first top right image). [sent-279, score-0.96]
88 Furthermore, the transferred support planes can give us a rough estimation of individual support objects. [sent-280, score-0.705]
89 However, there are cases where the support surfaces are hard to find because they are not obvious or not directly observed. [sent-281, score-0.73]
90 Quantitative results We evaluate accuracy of support extent prediction with precision-recall curves. [sent-286, score-0.58]
91 The ground truth consists of a set of known support pixels at some floor position and vertical height and a set of “don’t care” pixels at the edges of annotated surfaces or out of the field of view. [sent-287, score-1.398]
92 15m height of ground truth support pixel is labeled positive (or “don’t care” if that is the ground truth label). [sent-291, score-0.583]
93 [14] code for plane segmentation and selecting approximately horizontal surfaces as support surfaces. [sent-298, score-0.909]
94 7(a), we see that our method outperforms the baseline by 12% precision at the same recall level part Figure 7: Precision-Recall curve of support surface extent prediction. [sent-301, score-0.645]
95 7(b), we compare performance for occluded support surfaces to unoccluded (visible) ones. [sent-306, score-0.81]
96 Conclusions We propose 3D annotations for the NYU v2 dataset and an algorithm to find support planes and determine their extent in an overhead view. [sent-310, score-0.983]
97 Our quantitative and qualitative results show that our prediction is accurate for nearby and visible support surfaces, but surfaces that are distant or near or above eye-level still present a major challenge. [sent-311, score-0.941]
98 Methods to improve detection of support planes above eye-level (which are not directly visible) and to recognize objects and use categorical information are two fruitful directions for future work. [sent-313, score-0.419]
99 Green and blue and red areas are estimated walls, floor and support surfaces respectively. [sent-449, score-0.996]
100 The brighter colors of support surfaces indicate higher vertical heights relative to the floor. [sent-450, score-0.909]
wordName wordTfidf (topN-words)
[('surfaces', 0.403), ('overhead', 0.355), ('support', 0.327), ('floor', 0.266), ('height', 0.196), ('extent', 0.188), ('autocontext', 0.154), ('rgbd', 0.143), ('surface', 0.13), ('walls', 0.122), ('heights', 0.12), ('silberman', 0.115), ('annotator', 0.108), ('plane', 0.104), ('template', 0.102), ('view', 0.101), ('sketchup', 0.099), ('nyu', 0.096), ('scenes', 0.088), ('voxels', 0.084), ('portions', 0.081), ('visible', 0.08), ('occluded', 0.08), ('wall', 0.078), ('free', 0.078), ('indoor', 0.077), ('hoiem', 0.075), ('horizontal', 0.075), ('cabinet', 0.074), ('scene', 0.074), ('depth', 0.067), ('prediction', 0.065), ('predict', 0.065), ('normals', 0.064), ('room', 0.064), ('annotations', 0.062), ('guo', 0.059), ('care', 0.059), ('vertical', 0.059), ('occupancy', 0.057), ('ceiling', 0.057), ('voxel', 0.055), ('tops', 0.055), ('tool', 0.054), ('layout', 0.054), ('chair', 0.054), ('furniture', 0.052), ('extents', 0.052), ('planes', 0.051), ('annotated', 0.051), ('polygons', 0.051), ('extruded', 0.049), ('openings', 0.049), ('rematch', 0.049), ('seat', 0.049), ('chairs', 0.049), ('annotation', 0.044), ('hedau', 0.044), ('satkin', 0.044), ('blocked', 0.044), ('infer', 0.044), ('volumetric', 0.042), ('spacing', 0.042), ('objects', 0.041), ('upward', 0.041), ('matching', 0.04), ('floors', 0.038), ('shape', 0.038), ('fitting', 0.038), ('contexts', 0.037), ('inference', 0.037), ('desk', 0.036), ('near', 0.036), ('cell', 0.035), ('grid', 0.035), ('labeling', 0.035), ('users', 0.034), ('observed', 0.034), ('pixels', 0.033), ('polygon', 0.033), ('subtracted', 0.033), ('counter', 0.033), ('area', 0.032), ('prior', 0.032), ('qualitative', 0.03), ('tables', 0.03), ('straight', 0.03), ('relations', 0.03), ('partly', 0.03), ('truth', 0.03), ('understanding', 0.029), ('supporting', 0.028), ('preprocessing', 0.028), ('transfer', 0.028), ('viewing', 0.028), ('spatial', 0.027), ('projected', 0.027), ('planar', 0.027), ('scope', 0.027), ('parsing', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 410 iccv-2013-Support Surface Prediction in Indoor Scenes
Author: Ruiqi Guo, Derek Hoiem
Abstract: In this paper, we present an approach to predict the extent and height of supporting surfaces such as tables, chairs, and cabinet tops from a single RGBD image. We define support surfaces to be horizontal, planar surfaces that can physically support objects and humans. Given a RGBD image, our goal is to localize the height and full extent of such surfaces in 3D space. To achieve this, we created a labeling tool and annotated 1449 images with rich, complete 3D scene models in NYU dataset. We extract ground truth from the annotated dataset and developed a pipeline for predicting floor space, walls, the height and full extent of support surfaces. Finally we match the predicted extent with annotated scenes in training scenes and transfer the the support surface configuration from training scenes. We evaluate the proposed approach in our dataset and demonstrate its effectiveness in understanding scenes in 3D space.
2 0.21822372 2 iccv-2013-3D Scene Understanding by Voxel-CRF
Author: Byung-Soo Kim, Pushmeet Kohli, Silvio Savarese
Abstract: Scene understanding is an important yet very challenging problem in computer vision. In the past few years, researchers have taken advantage of the recent diffusion of depth-RGB (RGB-D) cameras to help simplify the problem of inferring scene semantics. However, while the added 3D geometry is certainly useful to segment out objects with different depth values, it also adds complications in that the 3D geometry is often incorrect because of noisy depth measurements and the actual 3D extent of the objects is usually unknown because of occlusions. In this paper we propose a new method that allows us to jointly refine the 3D reconstruction of the scene (raw depth values) while accurately segmenting out the objects or scene elements from the 3D reconstruction. This is achieved by introducing a new model which we called Voxel-CRF. The Voxel-CRF model is based on the idea of constructing a conditional random field over a 3D volume of interest which captures the semantic and 3D geometric relationships among different elements (voxels) of the scene. Such model allows to jointly estimate (1) a dense voxel-based 3D reconstruction and (2) the semantic labels associated with each voxel even in presence of par- tial occlusions using an approximate yet efficient inference strategy. We evaluated our method on the challenging NYU Depth dataset (Version 1and 2). Experimental results show that our method achieves competitive accuracy in inferring scene semantics and visually appealing results in improving the quality of the 3D reconstruction. We also demonstrate an interesting application of object removal and scene completion from RGB-D images.
3 0.19612083 1 iccv-2013-3DNN: Viewpoint Invariant 3D Geometry Matching for Scene Understanding
Author: Scott Satkin, Martial Hebert
Abstract: We present a new algorithm 3DNN (3D NearestNeighbor), which is capable of matching an image with 3D data, independently of the viewpoint from which the image was captured. By leveraging rich annotations associated with each image, our algorithm can automatically produce precise and detailed 3D models of a scene from a single image. Moreover, we can transfer information across images to accurately label and segment objects in a scene. The true benefit of 3DNN compared to a traditional 2D nearest-neighbor approach is that by generalizing across viewpoints, we free ourselves from the need to have training examples captured from all possible viewpoints. Thus, we are able to achieve comparable results using orders of magnitude less data, and recognize objects from never-beforeseen viewpoints. In this work, we describe the 3DNN algorithm and rigorously evaluate its performance for the tasks of geometry estimation and object detection/segmentation. By decoupling the viewpoint and the geometry of an image, we develop a scene matching approach which is truly 100% viewpoint invariant, yielding state-of-the-art performance on challenging data.
4 0.18154447 144 iccv-2013-Estimating the 3D Layout of Indoor Scenes and Its Clutter from Depth Sensors
Author: Jian Zhang, Chen Kan, Alexander G. Schwing, Raquel Urtasun
Abstract: In this paper we propose an approach to jointly estimate the layout ofrooms as well as the clutterpresent in the scene using RGB-D data. Towards this goal, we propose an effective model that is able to exploit both depth and appearance features, which are complementary. Furthermore, our approach is efficient as we exploit the inherent decomposition of additive potentials. We demonstrate the effectiveness of our approach on the challenging NYU v2 dataset and show that employing depth reduces the layout error by 6% and the clutter estimation by 13%.
5 0.16120425 201 iccv-2013-Holistic Scene Understanding for 3D Object Detection with RGBD Cameras
Author: Dahua Lin, Sanja Fidler, Raquel Urtasun
Abstract: In this paper, we tackle the problem of indoor scene understanding using RGBD data. Towards this goal, we propose a holistic approach that exploits 2D segmentation, 3D geometry, as well as contextual relations between scenes and objects. Specifically, we extend the CPMC [3] framework to 3D in order to generate candidate cuboids, and develop a conditional random field to integrate information from different sources to classify the cuboids. With this formulation, scene classification and 3D object recognition are coupled and can be jointly solved through probabilistic inference. We test the effectiveness of our approach on the challenging NYU v2 dataset. The experimental results demonstrate that through effective evidence integration and holistic reasoning, our approach achieves substantial improvement over the state-of-the-art.
6 0.16039595 79 iccv-2013-Coherent Object Detection with 3D Geometric Context from a Single Image
7 0.15763541 319 iccv-2013-Point-Based 3D Reconstruction of Thin Objects
8 0.15398115 367 iccv-2013-SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels
9 0.1454123 343 iccv-2013-Real-World Normal Map Capture for Nearly Flat Reflective Surfaces
10 0.14300214 64 iccv-2013-Box in the Box: Joint 3D Layout and Object Reasoning from Single Images
11 0.13673584 132 iccv-2013-Efficient 3D Scene Labeling Using Fields of Trees
12 0.13521416 424 iccv-2013-Tracking Revisited Using RGBD Camera: Unified Benchmark and Baselines
13 0.1244153 102 iccv-2013-Data-Driven 3D Primitives for Single Image Understanding
14 0.11616784 281 iccv-2013-Multi-view Normal Field Integration for 3D Reconstruction of Mirroring Objects
15 0.1094335 9 iccv-2013-A Flexible Scene Representation for 3D Reconstruction Using an RGB-D Camera
16 0.10652144 228 iccv-2013-Large-Scale Multi-resolution Surface Reconstruction from RGB-D Sequences
17 0.10286738 56 iccv-2013-Automatic Registration of RGB-D Scans via Salient Directions
18 0.10056254 111 iccv-2013-Detecting Dynamic Objects with Multi-view Background Subtraction
19 0.099238053 366 iccv-2013-STAR3D: Simultaneous Tracking and Reconstruction of 3D Objects Using RGB-D Data
20 0.098656312 128 iccv-2013-Dynamic Probabilistic Volumetric Models
topicId topicWeight
[(0, 0.211), (1, -0.148), (2, -0.019), (3, 0.001), (4, 0.088), (5, -0.029), (6, -0.045), (7, -0.118), (8, -0.075), (9, -0.074), (10, 0.039), (11, 0.067), (12, -0.081), (13, 0.07), (14, 0.033), (15, -0.131), (16, -0.035), (17, 0.05), (18, -0.035), (19, -0.125), (20, -0.157), (21, 0.028), (22, 0.139), (23, -0.026), (24, 0.03), (25, -0.058), (26, -0.015), (27, 0.076), (28, -0.013), (29, 0.012), (30, -0.035), (31, -0.014), (32, 0.03), (33, -0.04), (34, -0.029), (35, 0.041), (36, -0.044), (37, -0.013), (38, 0.061), (39, -0.023), (40, -0.047), (41, -0.012), (42, -0.004), (43, 0.004), (44, 0.085), (45, 0.005), (46, -0.067), (47, -0.006), (48, -0.01), (49, 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.97139847 410 iccv-2013-Support Surface Prediction in Indoor Scenes
Author: Ruiqi Guo, Derek Hoiem
Abstract: In this paper, we present an approach to predict the extent and height of supporting surfaces such as tables, chairs, and cabinet tops from a single RGBD image. We define support surfaces to be horizontal, planar surfaces that can physically support objects and humans. Given a RGBD image, our goal is to localize the height and full extent of such surfaces in 3D space. To achieve this, we created a labeling tool and annotated 1449 images with rich, complete 3D scene models in NYU dataset. We extract ground truth from the annotated dataset and developed a pipeline for predicting floor space, walls, the height and full extent of support surfaces. Finally we match the predicted extent with annotated scenes in training scenes and transfer the the support surface configuration from training scenes. We evaluate the proposed approach in our dataset and demonstrate its effectiveness in understanding scenes in 3D space.
2 0.84452492 2 iccv-2013-3D Scene Understanding by Voxel-CRF
Author: Byung-Soo Kim, Pushmeet Kohli, Silvio Savarese
Abstract: Scene understanding is an important yet very challenging problem in computer vision. In the past few years, researchers have taken advantage of the recent diffusion of depth-RGB (RGB-D) cameras to help simplify the problem of inferring scene semantics. However, while the added 3D geometry is certainly useful to segment out objects with different depth values, it also adds complications in that the 3D geometry is often incorrect because of noisy depth measurements and the actual 3D extent of the objects is usually unknown because of occlusions. In this paper we propose a new method that allows us to jointly refine the 3D reconstruction of the scene (raw depth values) while accurately segmenting out the objects or scene elements from the 3D reconstruction. This is achieved by introducing a new model which we called Voxel-CRF. The Voxel-CRF model is based on the idea of constructing a conditional random field over a 3D volume of interest which captures the semantic and 3D geometric relationships among different elements (voxels) of the scene. Such model allows to jointly estimate (1) a dense voxel-based 3D reconstruction and (2) the semantic labels associated with each voxel even in presence of par- tial occlusions using an approximate yet efficient inference strategy. We evaluated our method on the challenging NYU Depth dataset (Version 1and 2). Experimental results show that our method achieves competitive accuracy in inferring scene semantics and visually appealing results in improving the quality of the 3D reconstruction. We also demonstrate an interesting application of object removal and scene completion from RGB-D images.
3 0.83932048 1 iccv-2013-3DNN: Viewpoint Invariant 3D Geometry Matching for Scene Understanding
Author: Scott Satkin, Martial Hebert
Abstract: We present a new algorithm 3DNN (3D NearestNeighbor), which is capable of matching an image with 3D data, independently of the viewpoint from which the image was captured. By leveraging rich annotations associated with each image, our algorithm can automatically produce precise and detailed 3D models of a scene from a single image. Moreover, we can transfer information across images to accurately label and segment objects in a scene. The true benefit of 3DNN compared to a traditional 2D nearest-neighbor approach is that by generalizing across viewpoints, we free ourselves from the need to have training examples captured from all possible viewpoints. Thus, we are able to achieve comparable results using orders of magnitude less data, and recognize objects from never-beforeseen viewpoints. In this work, we describe the 3DNN algorithm and rigorously evaluate its performance for the tasks of geometry estimation and object detection/segmentation. By decoupling the viewpoint and the geometry of an image, we develop a scene matching approach which is truly 100% viewpoint invariant, yielding state-of-the-art performance on challenging data.
4 0.77371699 102 iccv-2013-Data-Driven 3D Primitives for Single Image Understanding
Author: David F. Fouhey, Abhinav Gupta, Martial Hebert
Abstract: What primitives should we use to infer the rich 3D world behind an image? We argue that these primitives should be both visually discriminative and geometrically informative and we present a technique for discovering such primitives. We demonstrate the utility of our primitives by using them to infer 3D surface normals given a single image. Our technique substantially outperforms the state-of-the-art and shows improved cross-dataset performance.
5 0.76542085 79 iccv-2013-Coherent Object Detection with 3D Geometric Context from a Single Image
Author: Jiyan Pan, Takeo Kanade
Abstract: Objects in a real world image cannot have arbitrary appearance, sizes and locations due to geometric constraints in 3D space. Such a 3D geometric context plays an important role in resolving visual ambiguities and achieving coherent object detection. In this paper, we develop a RANSAC-CRF framework to detect objects that are geometrically coherent in the 3D world. Different from existing methods, we propose a novel generalized RANSAC algorithm to generate global 3D geometry hypothesesfrom local entities such that outlier suppression and noise reduction is achieved simultaneously. In addition, we evaluate those hypotheses using a CRF which considers both the compatibility of individual objects under global 3D geometric context and the compatibility between adjacent objects under local 3D geometric context. Experiment results show that our approach compares favorably with the state of the art.
6 0.73367518 201 iccv-2013-Holistic Scene Understanding for 3D Object Detection with RGBD Cameras
7 0.71143669 132 iccv-2013-Efficient 3D Scene Labeling Using Fields of Trees
8 0.7023046 9 iccv-2013-A Flexible Scene Representation for 3D Reconstruction Using an RGB-D Camera
9 0.69220173 64 iccv-2013-Box in the Box: Joint 3D Layout and Object Reasoning from Single Images
10 0.6822232 144 iccv-2013-Estimating the 3D Layout of Indoor Scenes and Its Clutter from Depth Sensors
11 0.65440923 319 iccv-2013-Point-Based 3D Reconstruction of Thin Objects
12 0.6326378 228 iccv-2013-Large-Scale Multi-resolution Surface Reconstruction from RGB-D Sequences
13 0.62817955 56 iccv-2013-Automatic Registration of RGB-D Scans via Salient Directions
14 0.62753373 281 iccv-2013-Multi-view Normal Field Integration for 3D Reconstruction of Mirroring Objects
15 0.61435372 375 iccv-2013-Scene Collaging: Analysis and Synthesis of Natural Images with Semantic Layers
16 0.5914157 367 iccv-2013-SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels
17 0.58974326 250 iccv-2013-Lifting 3D Manhattan Lines from a Single Image
18 0.57802832 387 iccv-2013-Shape Anchors for Data-Driven Multi-view Reconstruction
19 0.57316571 343 iccv-2013-Real-World Normal Map Capture for Nearly Flat Reflective Surfaces
20 0.57096225 139 iccv-2013-Elastic Fragments for Dense Scene Reconstruction
topicId topicWeight
[(2, 0.054), (7, 0.012), (12, 0.034), (26, 0.086), (31, 0.052), (42, 0.08), (49, 0.13), (55, 0.024), (64, 0.074), (73, 0.043), (89, 0.285), (98, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.93936777 410 iccv-2013-Support Surface Prediction in Indoor Scenes
Author: Ruiqi Guo, Derek Hoiem
Abstract: In this paper, we present an approach to predict the extent and height of supporting surfaces such as tables, chairs, and cabinet tops from a single RGBD image. We define support surfaces to be horizontal, planar surfaces that can physically support objects and humans. Given a RGBD image, our goal is to localize the height and full extent of such surfaces in 3D space. To achieve this, we created a labeling tool and annotated 1449 images with rich, complete 3D scene models in NYU dataset. We extract ground truth from the annotated dataset and developed a pipeline for predicting floor space, walls, the height and full extent of support surfaces. Finally we match the predicted extent with annotated scenes in training scenes and transfer the the support surface configuration from training scenes. We evaluate the proposed approach in our dataset and demonstrate its effectiveness in understanding scenes in 3D space.
2 0.93143773 130 iccv-2013-Dynamic Structured Model Selection
Author: David Weiss, Benjamin Sapp, Ben Taskar
Abstract: Ben Taskar University of Washington Seattle, WA t as kar @ c s . washingt on . edu In many cases, the predictive power of structured models for for complex vision tasks is limited by a trade-off between the expressiveness and the computational tractability of the model. However, choosing this trade-off statically a priori is suboptimal, as images and videos in different settings vary tremendously in complexity. On the other hand, choosing the trade-off dynamically requires knowledge about the accuracy of different structured models on any given example. In this work, we propose a novel two-tier architecture that provides dynamic speed/accuracy trade-offs through a simple type of introspection. Our approach, which we call dynamic structured model selection (DMS), leverages typically intractable features in structured learning problems in order to automatically determine ’ which of several models should be used at test-time in order to maximize accuracy under a fixed budgetary constraint. We demonstrate DMS on two sequential modeling vision tasks, and we establish a new state-of-the-art in human pose estimation in video with an implementation that is roughly 23 faster than the prevaino uims sptleanmdeanrtda implementation.
3 0.91427034 317 iccv-2013-Piecewise Rigid Scene Flow
Author: Christoph Vogel, Konrad Schindler, Stefan Roth
Abstract: Estimating dense 3D scene flow from stereo sequences remains a challenging task, despite much progress in both classical disparity and 2D optical flow estimation. To overcome the limitations of existing techniques, we introduce a novel model that represents the dynamic 3D scene by a collection of planar, rigidly moving, local segments. Scene flow estimation then amounts to jointly estimating the pixelto-segment assignment, and the 3D position, normal vector, and rigid motion parameters of a plane for each segment. The proposed energy combines an occlusion-sensitive data term with appropriate shape, motion, and segmentation regularizers. Optimization proceeds in two stages: Starting from an initial superpixelization, we estimate the shape and motion parameters of all segments by assigning a proposal from a set of moving planes. Then the pixel-to-segment assignment is updated, while holding the shape and motion parameters of the moving planes fixed. We demonstrate the benefits of our model on different real-world image sets, including the challenging KITTI benchmark. We achieve leading performance levels, exceeding competing 3D scene flow methods, and even yielding better 2D motion estimates than all tested dedicated optical flow techniques.
4 0.91239744 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition
Author: Mohamed R. Amer, Sinisa Todorovic, Alan Fern, Song-Chun Zhu
Abstract: This paper presents an efficient approach to video parsing. Our videos show a number of co-occurring individual and group activities. To address challenges of the domain, we use an expressive spatiotemporal AND-OR graph (ST-AOG) that jointly models activity parts, their spatiotemporal relations, and context, as well as enables multitarget tracking. The standard ST-AOG inference is prohibitively expensive in our setting, since it would require running a multitude of detectors, and tracking their detections in a long video footage. This problem is addressed by formulating a cost-sensitive inference of ST-AOG as Monte Carlo Tree Search (MCTS). For querying an activity in the video, MCTS optimally schedules a sequence of detectors and trackers to be run, and where they should be applied in the space-time volume. Evaluation on the benchmark datasets demonstrates that MCTS enables two-magnitude speed-ups without compromising accuracy relative to the standard cost-insensitive inference.
5 0.912027 433 iccv-2013-Understanding High-Level Semantics by Modeling Traffic Patterns
Author: Hongyi Zhang, Andreas Geiger, Raquel Urtasun
Abstract: In this paper, we are interested in understanding the semantics of outdoor scenes in the context of autonomous driving. Towards this goal, we propose a generative model of 3D urban scenes which is able to reason not only about the geometry and objects present in the scene, but also about the high-level semantics in the form of traffic patterns. We found that a small number of patterns is sufficient to model the vast majority of traffic scenes and show how these patterns can be learned. As evidenced by our experiments, this high-level reasoning significantly improves the overall scene estimation as well as the vehicle-to-lane association when compared to state-of-the-art approaches [10].
6 0.91171575 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
7 0.91160738 129 iccv-2013-Dynamic Scene Deblurring
8 0.91061032 144 iccv-2013-Estimating the 3D Layout of Indoor Scenes and Its Clutter from Depth Sensors
9 0.91025555 89 iccv-2013-Constructing Adaptive Complex Cells for Robust Visual Tracking
10 0.90993178 226 iccv-2013-Joint Subspace Stabilization for Stereoscopic Video
11 0.90882385 78 iccv-2013-Coherent Motion Segmentation in Moving Camera Videos Using Optical Flow Orientations
12 0.90801311 146 iccv-2013-Event Detection in Complex Scenes Using Interval Temporal Constraints
13 0.90726274 387 iccv-2013-Shape Anchors for Data-Driven Multi-view Reconstruction
14 0.90717578 265 iccv-2013-Mining Motion Atoms and Phrases for Complex Action Recognition
15 0.9071641 190 iccv-2013-Handling Occlusions with Franken-Classifiers
16 0.90683311 299 iccv-2013-Online Video SEEDS for Temporal Window Objectness
17 0.90638256 204 iccv-2013-Human Attribute Recognition by Rich Appearance Dictionary
18 0.90628582 143 iccv-2013-Estimating Human Pose with Flowing Puppets
19 0.90626812 196 iccv-2013-Hierarchical Data-Driven Descent for Efficient Optimal Deformation Estimation
20 0.90625119 107 iccv-2013-Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction