iccv iccv2013 iccv2013-2 knowledge-graph by maker-knowledge-mining

2 iccv-2013-3D Scene Understanding by Voxel-CRF


Source: pdf

Author: Byung-Soo Kim, Pushmeet Kohli, Silvio Savarese

Abstract: Scene understanding is an important yet very challenging problem in computer vision. In the past few years, researchers have taken advantage of the recent diffusion of depth-RGB (RGB-D) cameras to help simplify the problem of inferring scene semantics. However, while the added 3D geometry is certainly useful to segment out objects with different depth values, it also adds complications in that the 3D geometry is often incorrect because of noisy depth measurements and the actual 3D extent of the objects is usually unknown because of occlusions. In this paper we propose a new method that allows us to jointly refine the 3D reconstruction of the scene (raw depth values) while accurately segmenting out the objects or scene elements from the 3D reconstruction. This is achieved by introducing a new model which we called Voxel-CRF. The Voxel-CRF model is based on the idea of constructing a conditional random field over a 3D volume of interest which captures the semantic and 3D geometric relationships among different elements (voxels) of the scene. Such model allows to jointly estimate (1) a dense voxel-based 3D reconstruction and (2) the semantic labels associated with each voxel even in presence of par- tial occlusions using an approximate yet efficient inference strategy. We evaluated our method on the challenging NYU Depth dataset (Version 1and 2). Experimental results show that our method achieves competitive accuracy in inferring scene semantics and visually appealing results in improving the quality of the 3D reconstruction. We also demonstrate an interesting application of object removal and scene completion from RGB-D images.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper we propose a new method that allows us to jointly refine the 3D reconstruction of the scene (raw depth values) while accurately segmenting out the objects or scene elements from the 3D reconstruction. [sent-7, score-0.371]

2 The Voxel-CRF model is based on the idea of constructing a conditional random field over a 3D volume of interest which captures the semantic and 3D geometric relationships among different elements (voxels) of the scene. [sent-9, score-0.395]

3 Such model allows to jointly estimate (1) a dense voxel-based 3D reconstruction and (2) the semantic labels associated with each voxel even in presence of par- tial occlusions using an approximate yet efficient inference strategy. [sent-10, score-0.778]

4 Introduction Understanding the geometric and semantic structure of a scene (scene understanding) is a critical problem in various research fields including computer vision, robotics, and augmented reality. [sent-15, score-0.425]

5 Several methods have been proposed to solve the problem of scene understanding using a single RGB (2D) im- (V-CRF) model jointly estimates (1) a dense voxel-based 3D reconstruction of the scene and (2) the semantic labels associated with each voxel. [sent-22, score-0.582]

6 Instead of labeling local 2D image regions, these methods provide semantic description of 3D elements (point clouds) acquired by a RGB-D camera [13]. [sent-30, score-0.351]

7 In this work, we propose a method to jointly estimate the semantic and geometric structure of a scene given a single RGB-D image. [sent-34, score-0.424]

8 We jointly estimate both the semantic labeling 1425 (a)(top view) In(cobm)Cpamle trea Visble Figure 2: (a) Reconstructed point cloud taken from the corner of the room. [sent-36, score-0.385]

9 and 3D geometry of voxels of the scene given a noisy set of inputs. [sent-41, score-0.859]

10 This allows us to i) correct noisy geometric estimation in input data and ii) provide the interpretation of non-visible geometric elements (such as the wall occluded by the table in Fig. [sent-42, score-0.484]

11 Our method is based on a voxel conditional random field model which we have called VoxelCRF (V-CRF). [sent-44, score-0.403]

12 In our V-CRF model, each node represents a voxel in the space of interest. [sent-45, score-0.403]

13 A voxel may or may not include one or multiple points acquired by the RGB-D sensor. [sent-46, score-0.403]

14 The state of each voxel is summarized by two variables, occupancy and semantic label. [sent-47, score-0.686]

15 An auxiliary variable visibility is introduced to help relate voxels and 2D RGB or depth observation (Sec. [sent-48, score-0.847]

16 The configuration ofvariables in the V-CRF model needs to be consistent with certain important geometric and semantic rules that ensure stable and more accurate 3D reconstruction and classification of the elements in the scene. [sent-51, score-0.384]

17 Geometric and semantic relationships based on higher-level elements such as certain groups of voxels which belong to the same plane (or object) are encoded using interactions between groups of voxels. [sent-55, score-1.122]

18 These relationships are especially useful for consistent labeling of voxels in an occluded space (Sec. [sent-56, score-0.953]

19 Instead of assuming that the true 3D geometry is given, we jointly estimate the geometric and semantic structure of the scene by finding the best configuration of all occupancy and semantic label variables of all voxels in the space. [sent-61, score-1.449]

20 Our inference algorithm iterates between 1) deciding voxels to be associated with observations and 2) reasoning about the geometric and semantic description of voxels. [sent-62, score-1.084]

21 1) We propose a new voxel based model for 3D scene understanding with RGB-D data that jointly infers the geometric and semantic structure of the scene (Sec. [sent-65, score-0.997]

22 5) We demonstrate (through qualitative and quantitative results and comparisons on benchmarks) that V-CRF produces accurate geometric and semantic scene understanding results (Sec. [sent-72, score-0.475]

23 However, our model can produce more fine grained labelling of geometric and semantic structure which is important for cluttered scenes. [sent-82, score-0.371]

24 Our work is also closely related to [19, 20] in the use of a random field model for joint semantic labeling and geometric interpretation. [sent-88, score-0.422]

25 [19] encouraged consistent semantic and geometric labelling of pixels by penalizing sudden changes in depth or semantic labeling results. [sent-89, score-0.772]

26 The problem of labelling occluded regions is also discussed in [21], where relative relationships between objects and background are used to infer labels of the occluded region. [sent-93, score-0.345]

27 However, the lack of a voxel representation restricts [21] to reconstruction of the foreground and background layers. [sent-94, score-0.445]

28 1426 (a) Image Camera (b) Figure 3: Ambiguity of assigning image observations to the voxels in a view ray. [sent-96, score-0.712]

29 Five voxels with green outline are the ground truth voxels in a correct place. [sent-97, score-1.392]

30 (a) For the successful cases, the voxel can be reconstructed from a depth data. [sent-98, score-0.536]

31 (b) Unfortunately, due to noisy depth data, incorrect voxels are reconstructed in many cases. [sent-99, score-0.849]

32 We represent the semantic and geometric structure of the scene with a 3D lattice where each cell of the lattice is a voxel. [sent-102, score-0.463]

33 , vtoicxees,ls c on uthees same 3grDo plane, or voxels that are believed to belong to an object (through an object detection bounding box). [sent-105, score-0.72]

34 The state of each voxel is described with a structured label ? [sent-106, score-0.403]

35 , it indicates whether voxel iis empty (oi = 0) or occupied (oi = 1). [sent-110, score-0.58]

36 The second variable si indicates the index of semantic class the voxel belong to; i. [sent-111, score-0.671]

37 , si ∈ {1, · · · , |S| } if the voxel is occupied (oi = 1), or si = ø oi 1=, ·0·,· ·w ,h|Se|r}e if∈ |S| is the number of semantic classes (e. [sent-113, score-0.883]

38 The variable vi encodes the visibility of a voxel iwhere vi = 1 and vi = 0 indicate whether the voxel is visible or non-visible, respectively. [sent-128, score-1.276]

39 Due to the high amount of noise in the RGB-D sensor, it is difficult to unambiguously assign 2D observations (image appearance and texture) to voxels in 3D space (see Fig. [sent-130, score-0.685]

40 Provided that we know which single voxel is visible on the viewing-ray, we can assign the 2D image observation to the corresponding voxel. [sent-133, score-0.519]

41 I anr contrast, dV o-CnRlyF f moro vdiesli bisl more flexible by having oi and vi as random variables, and this enables richer scene interpretation by i) estimating occluded regions, e. [sent-136, score-0.432]

42 The energy function can be written as a sum of potential functions defined over individual, pairs, and group of voxels as: E(V, L, O) = ? [sent-143, score-0.735]

43 c (1) where i and j are indices of voxels and c is the index of higher-order cliques in a graph. [sent-153, score-0.706]

44 The first term models the observation cost for individual voxels, while the second and third terms model semantic and geometric consistency among pairs and groups of voxels, respectively. [sent-154, score-0.364]

45 We model the term for two different cases, when voxel iis occupied (oi = 1) and when it is empty (oi = 0). [sent-159, score-0.58]

46 |(4) When the voxel iis occupied (oi = 1), it is composed of three terms. [sent-164, score-0.512]

47 Larger the disparity between depth according t oN t(h0e, σdata map value drm (i), which is value associated with a ray r(i) for a voxel i, and the voxel i’s depth di, more likely it is to be labeled as an empty state. [sent-167, score-1.17]

48 The third term models the occupancy based on density of 3D points in a voxel i. [sent-168, score-0.489]

49 We measure the ratio |Pi |/ |Pimax |, which is the ratio between mthee ansuumrebe thre eo rfa dteiote |cPte|d/ points |in, w3hDi cchub i sca thle evo raltuimoe b aestwsoeceinated with a voxel iover the maximum number of 3D points in a voxel i, i. [sent-170, score-0.806]

50 , the number of rays penetrating through a voxel i. [sent-172, score-0.403]

51 If there is an object at voxel i, and the surface is perpendicular to the camera ray, the number of points is the largest. [sent-173, score-0.452]

52 In the case the voxel iis empty (oi = 0), the energy models the sensitivity of the sensor (first term) and the den1427 sity of point clouds (second term). [sent-177, score-0.605]

53 Relating Pairs of Voxels The pairwise energy terms penalize labellings of pairs of voxels that are geometrically or semantically inconsistent. [sent-182, score-0.814]

54 Two different types of neighborhoods are considered to define pairwise relationships between voxels: i) adjacent voxels in 3D lattice structure, and ii) adjacent voxels in its 2D projection. [sent-183, score-1.594]

55 , color) is represented by cij which is a discretized color difference between voxels iand j, similar to [22]. [sent-187, score-0.741]

56 in this case the cost is the function of the color of the visible voxel i. [sent-190, score-0.519]

57 The pairwise cost on the labelling of voxels also depends on their visibility and is defined as: φp(vi = 1, ? [sent-191, score-0.861]

58 The exact penalty for inconsistent assignments depends on the relative spatial location sij and colors cij of the voxel pairs. [sent-210, score-0.556]

59 On top of adjacent voxels in 3D, the adjacency between a pair of voxels in the projected 2D images is formulated as pairwise costs. [sent-217, score-1.455]

60 For example, occlusion boundaries are useful cues to distinguish voxels that belong to different objects; if two voxels are across a detected occlusion boundary (when projected in the view of the camera), they are likely to have different semantic labels. [sent-218, score-1.652]

61 On the other hand, if two voxels across the boundary are still close in 3D, they are likely to have a same semantic label. [sent-219, score-0.882]

62 The relationship of voxels are automatically indexed as follows. [sent-220, score-0.685]

63 From the training data, we collect the relative surface feature between voxels i and j1 and cluster them to represent different types of corners, depending on wpw wpw voxels{ a·}s( 1The surface feature for adjacent regions iand j is composed of surface norm, color, and height. [sent-223, score-0.956]

64 (b) A group of voxels associated with the detected planar surface (top) and a group of voxels associated with the convex hull (bottom). [sent-226, score-1.531]

65 The voxels in the convex hull not only enforce consistency for visible voxels, but also for occluded voxels. [sent-227, score-0.94]

66 (c) V-CRF result: our model not only allows the labeling of visible voxels for TV (top), but also the labeling of the occluded region corresponding to the ‘wall’ . [sent-228, score-1.133]

67 For visibility, we removed the voxels corresponding to the TV. [sent-229, score-0.685]

68 semantic and geometric that encode The poten- consistency among voxels in a clique c ∈ VC of voxels that can be quite far fvrooxmel esa icnh aot chleirq. [sent-237, score-1.675]

69 u Teh ce relationships for a group of voxels can be represented using the Robust Pott’s model [1]. [sent-238, score-0.738]

70 , surface detection, object detection, or room layout estimation; however, in this work, we consider two types of voxel groups VC, 1) 3D surwfacoreks ,t whaet are diedeterc ttwedo using a Hough voting V based method 2[3] and 2) categorical object detections [24] as follows2. [sent-241, score-0.49]

71 The first type is the group of voxels that belong to a 3D surface (wall, tables etc). [sent-243, score-0.769]

72 First, a surface is likely to belong to an object or a facet of the indoor room layout, and there is consistency among labels of voxels for a detected plane. [sent-245, score-0.899]

73 Object detection methods provide a cue to define groups of voxels (bounding box) that take the same label, as used for 2D scene understanding in [25, 4], where we grouped a set of visible voxels which fall inside 2Room layout estimation is not used due to heavy clutter in the evaluated dataset. [sent-250, score-1.694]

74 , proposed in [24], to find 2D object bounding boxes and then find the corresponding voxels in 3D to form a clique. [sent-255, score-0.685]

75 Relating Voxels in a Camera Ray V-CRF model enforces that there is only one visible voxel for each ray from a camera. [sent-258, score-0.568]

76 erwi∈icsevi= 1 (9) where cr is indices of voxels in a single ray. [sent-262, score-0.685]

77 Efficiency of the inference step is a key requirement for us as V-CRF is defined over a voxel space which can be much larger than the number of pixels in the image. [sent-277, score-0.452]

78 In the tth iteration, we estimate the value of the visibility variables Vt from Lt−1 by finding out the first occupied voxel in each ray from a camera. [sent-279, score-0.61]

79 For the appearance term P(si |O) for visible voxels in Sec. [sent-305, score-0.801]

80 We find groups of voxels composing 3D surfaces using off-the-shelf plane detector [23], which detects a number of planes from point clouds by hough voting in a parameterized space. [sent-309, score-0.854]

81 To build V-CRF model, the 3D space of interest is divided with voxels having size of (4cm)3 for testing. [sent-316, score-0.685]

82 For training, voxels are divided into (8cm)3 for efficiency. [sent-317, score-0.685]

83 Given (a) a RGB image and (b) a depth map, (c) reconstructed 3D geometry (top view) suffers from noise and may not produce realistic scene understanding results. [sent-403, score-0.36]

84 Even with the error due to reflection of the mirror on the third example, V-CRF is capable of reconstructing realistic scenes along with accurate semantic labeling results. [sent-410, score-0.34]

85 We draw a grid to visualize voxels from top view for the first example only. [sent-411, score-0.712]

86 We visualize joint geometric and semantic scene understanding results from its top view. [sent-415, score-0.475]

87 Clearly, as the number of iterations increases, both geometry estimation accuracy and semantic labeling accuracy are improved, as highlighted with blue circles and green circles, respectively. [sent-418, score-0.417]

88 Metric 1: Top-view The proposed framework solves semantic and geometric scene understanding jointly. [sent-422, score-0.475]

89 5 (d), where free space and object occupancy as well as semantic labeling can be evaluated. [sent-427, score-0.432]

90 Note that our model improves reconstruction errors in depth map as well as semantic understanding against a benchmark method, e. [sent-455, score-0.41]

91 Still, our methods achieves the highest accuracy for both geometry estimation and semantic labeling tasks. [sent-507, score-0.4]

92 3 show the performance of geometry estimation and semantic labeling from the top view, respectively. [sent-524, score-0.371]

93 This is not possible in most of conventional augmented reality methods [3 1] where one can put a new object in a scene 5This number is equivalent to 2D semantic labeling accuracy 76. [sent-538, score-0.49]

94 2D super-pixel-based evaluation cannot address the accuracy of 3D scene labeling and tends to penalize less for inaccurate labeling for distant 3D regions. [sent-540, score-0.343]

95 (d) Associated voxels for detected ‘television’ is removed. [sent-544, score-0.708]

96 Note that the region behind the TV is labeled as wall by modeling energy terms for pairwise voxels and planes. [sent-545, score-0.879]

97 (e) As an augmented reality application, TV is removed and voxels are colored with the same color as the adjacent voxels with label ‘wall’ . [sent-546, score-1.51]

98 but cannot remove the existing objects, since it requires a model to i) identify semantic and geometric properties of the objects, ii) estimate occluded region behind the object. [sent-550, score-0.437]

99 Note that the occluded region behind the TV is reconstructed using pairwise relationships among voxels as discussed in Sec. [sent-554, score-0.951]

100 Conclusion We have presented the V-CRF model for jointly solving the problem of semantic scene understanding and geometry estimation that incorporates 3D geometric and semantic relationships between scene elements in a coherent fashion. [sent-560, score-0.938]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('voxels', 0.685), ('voxel', 0.403), ('semantic', 0.197), ('oi', 0.128), ('labeling', 0.117), ('visible', 0.116), ('geometric', 0.108), ('occluded', 0.098), ('sij', 0.097), ('vi', 0.093), ('depth', 0.087), ('occupancy', 0.086), ('scene', 0.086), ('understanding', 0.084), ('occupied', 0.083), ('indoor', 0.077), ('visibility', 0.075), ('wall', 0.075), ('pw', 0.074), ('empty', 0.068), ('labelling', 0.066), ('nyu', 0.062), ('clouds', 0.058), ('geometry', 0.057), ('cij', 0.056), ('reality', 0.056), ('pwu', 0.056), ('relationships', 0.053), ('energy', 0.05), ('adjacent', 0.05), ('ray', 0.049), ('drm', 0.049), ('surface', 0.049), ('inference', 0.049), ('reconstructed', 0.046), ('hebert', 0.046), ('savarese', 0.045), ('kohli', 0.043), ('tv', 0.042), ('reconstruction', 0.042), ('hull', 0.041), ('relating', 0.04), ('plane', 0.039), ('rgb', 0.039), ('cloud', 0.038), ('groups', 0.038), ('unary', 0.037), ('impai', 0.037), ('vcrf', 0.037), ('wpw', 0.037), ('elements', 0.037), ('torr', 0.037), ('si', 0.036), ('vj', 0.036), ('potentials', 0.036), ('lattice', 0.036), ('uu', 0.036), ('belong', 0.035), ('pairwise', 0.035), ('hough', 0.034), ('augmented', 0.034), ('behind', 0.034), ('jointly', 0.033), ('free', 0.032), ('noisy', 0.031), ('ladick', 0.03), ('labels', 0.03), ('achieves', 0.029), ('vt', 0.029), ('sturgess', 0.029), ('hoiem', 0.028), ('geo', 0.027), ('interpretation', 0.027), ('view', 0.027), ('blocks', 0.026), ('scenes', 0.026), ('iis', 0.026), ('robot', 0.025), ('circles', 0.024), ('associated', 0.024), ('vc', 0.024), ('ln', 0.024), ('iterative', 0.023), ('saxena', 0.023), ('sofa', 0.023), ('penalize', 0.023), ('detected', 0.023), ('truth', 0.022), ('stereo', 0.022), ('russell', 0.022), ('satkin', 0.022), ('ladicky', 0.022), ('perceive', 0.022), ('highlighted', 0.022), ('silberman', 0.022), ('pairs', 0.021), ('efros', 0.021), ('cliques', 0.021), ('reasoning', 0.021), ('diversity', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 2 iccv-2013-3D Scene Understanding by Voxel-CRF

Author: Byung-Soo Kim, Pushmeet Kohli, Silvio Savarese

Abstract: Scene understanding is an important yet very challenging problem in computer vision. In the past few years, researchers have taken advantage of the recent diffusion of depth-RGB (RGB-D) cameras to help simplify the problem of inferring scene semantics. However, while the added 3D geometry is certainly useful to segment out objects with different depth values, it also adds complications in that the 3D geometry is often incorrect because of noisy depth measurements and the actual 3D extent of the objects is usually unknown because of occlusions. In this paper we propose a new method that allows us to jointly refine the 3D reconstruction of the scene (raw depth values) while accurately segmenting out the objects or scene elements from the 3D reconstruction. This is achieved by introducing a new model which we called Voxel-CRF. The Voxel-CRF model is based on the idea of constructing a conditional random field over a 3D volume of interest which captures the semantic and 3D geometric relationships among different elements (voxels) of the scene. Such model allows to jointly estimate (1) a dense voxel-based 3D reconstruction and (2) the semantic labels associated with each voxel even in presence of par- tial occlusions using an approximate yet efficient inference strategy. We evaluated our method on the challenging NYU Depth dataset (Version 1and 2). Experimental results show that our method achieves competitive accuracy in inferring scene semantics and visually appealing results in improving the quality of the 3D reconstruction. We also demonstrate an interesting application of object removal and scene completion from RGB-D images.

2 0.27364206 366 iccv-2013-STAR3D: Simultaneous Tracking and Reconstruction of 3D Objects Using RGB-D Data

Author: Carl Yuheng Ren, Victor Prisacariu, David Murray, Ian Reid

Abstract: We introduce a probabilistic framework for simultaneous tracking and reconstruction of 3D rigid objects using an RGB-D camera. The tracking problem is handled using a bag-of-pixels representation and a back-projection scheme. Surface and background appearance models are learned online, leading to robust tracking in the presence of heavy occlusion and outliers. In both our tracking and reconstruction modules, the 3D object is implicitly embedded using a 3D level-set function. The framework is initialized with a simple shape primitive model (e.g. a sphere or a cube), and the real 3D object shape is tracked and reconstructed online. Unlike existing depth-based 3D reconstruction works, which either rely on calibrated/fixed camera set up or use the observed world map to track the depth camera, our framework can simultaneously track and reconstruct small moving objects. We use both qualitative and quantitative results to demonstrate the superior performance of both tracking and reconstruction of our method.

3 0.21822372 410 iccv-2013-Support Surface Prediction in Indoor Scenes

Author: Ruiqi Guo, Derek Hoiem

Abstract: In this paper, we present an approach to predict the extent and height of supporting surfaces such as tables, chairs, and cabinet tops from a single RGBD image. We define support surfaces to be horizontal, planar surfaces that can physically support objects and humans. Given a RGBD image, our goal is to localize the height and full extent of such surfaces in 3D space. To achieve this, we created a labeling tool and annotated 1449 images with rich, complete 3D scene models in NYU dataset. We extract ground truth from the annotated dataset and developed a pipeline for predicting floor space, walls, the height and full extent of support surfaces. Finally we match the predicted extent with annotated scenes in training scenes and transfer the the support surface configuration from training scenes. We evaluate the proposed approach in our dataset and demonstrate its effectiveness in understanding scenes in 3D space.

4 0.1977405 228 iccv-2013-Large-Scale Multi-resolution Surface Reconstruction from RGB-D Sequences

Author: Frank Steinbrücker, Christian Kerl, Daniel Cremers

Abstract: We propose a method to generate highly detailed, textured 3D models of large environments from RGB-D sequences. Our system runs in real-time on a standard desktop PC with a state-of-the-art graphics card. To reduce the memory consumption, we fuse the acquired depth maps and colors in a multi-scale octree representation of a signed distance function. To estimate the camera poses, we construct a pose graph and use dense image alignment to determine the relative pose between pairs of frames. We add edges between nodes when we detect loop-closures and optimize the pose graph to correct for long-term drift. Our implementation is highly parallelized on graphics hardware to achieve real-time performance. More specifically, we can reconstruct, store, and continuously update a colored 3D model of an entire corridor of nine rooms at high levels of detail in real-time on a single GPU with 2.5GB.

5 0.17085178 447 iccv-2013-Volumetric Semantic Segmentation Using Pyramid Context Features

Author: Jonathan T. Barron, Mark D. Biggin, Pablo Arbeláez, David W. Knowles, Soile V.E. Keranen, Jitendra Malik

Abstract: We present an algorithm for the per-voxel semantic segmentation of a three-dimensional volume. At the core of our algorithm is a novel “pyramid context” feature, a descriptive representation designed such that exact per-voxel linear classification can be made extremely efficient. This feature not only allows for efficient semantic segmentation but enables other aspects of our algorithm, such as novel learned features and a stacked architecture that can reason about self-consistency. We demonstrate our technique on 3Dfluorescence microscopy data ofDrosophila embryosfor which we are able to produce extremely accurate semantic segmentations in a matter of minutes, and for which other algorithms fail due to the size and high-dimensionality of the data, or due to the difficulty of the task.

6 0.15106149 144 iccv-2013-Estimating the 3D Layout of Indoor Scenes and Its Clutter from Depth Sensors

7 0.12772198 201 iccv-2013-Holistic Scene Understanding for 3D Object Detection with RGBD Cameras

8 0.12423185 132 iccv-2013-Efficient 3D Scene Labeling Using Fields of Trees

9 0.11613007 319 iccv-2013-Point-Based 3D Reconstruction of Thin Objects

10 0.11601159 367 iccv-2013-SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels

11 0.10907232 1 iccv-2013-3DNN: Viewpoint Invariant 3D Geometry Matching for Scene Understanding

12 0.10449314 42 iccv-2013-Active MAP Inference in CRFs for Efficient Semantic Segmentation

13 0.098730408 128 iccv-2013-Dynamic Probabilistic Volumetric Models

14 0.097481668 79 iccv-2013-Coherent Object Detection with 3D Geometric Context from a Single Image

15 0.094700009 281 iccv-2013-Multi-view Normal Field Integration for 3D Reconstruction of Mirroring Objects

16 0.092302151 9 iccv-2013-A Flexible Scene Representation for 3D Reconstruction Using an RGB-D Camera

17 0.088266678 64 iccv-2013-Box in the Box: Joint 3D Layout and Object Reasoning from Single Images

18 0.086577982 424 iccv-2013-Tracking Revisited Using RGBD Camera: Unified Benchmark and Baselines

19 0.082306035 317 iccv-2013-Piecewise Rigid Scene Flow

20 0.080957018 111 iccv-2013-Detecting Dynamic Objects with Multi-view Background Subtraction


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.197), (1, -0.124), (2, -0.018), (3, 0.007), (4, 0.087), (5, -0.018), (6, -0.07), (7, -0.094), (8, -0.056), (9, -0.065), (10, 0.02), (11, 0.055), (12, -0.092), (13, 0.069), (14, -0.014), (15, -0.089), (16, -0.055), (17, -0.029), (18, -0.072), (19, -0.055), (20, -0.164), (21, 0.028), (22, 0.097), (23, -0.026), (24, 0.002), (25, -0.05), (26, 0.007), (27, 0.043), (28, 0.007), (29, 0.074), (30, 0.018), (31, -0.046), (32, 0.008), (33, -0.064), (34, -0.045), (35, 0.03), (36, -0.004), (37, 0.051), (38, 0.09), (39, 0.086), (40, -0.045), (41, -0.145), (42, -0.052), (43, 0.051), (44, 0.05), (45, 0.03), (46, -0.063), (47, 0.008), (48, -0.053), (49, -0.068)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94098359 2 iccv-2013-3D Scene Understanding by Voxel-CRF

Author: Byung-Soo Kim, Pushmeet Kohli, Silvio Savarese

Abstract: Scene understanding is an important yet very challenging problem in computer vision. In the past few years, researchers have taken advantage of the recent diffusion of depth-RGB (RGB-D) cameras to help simplify the problem of inferring scene semantics. However, while the added 3D geometry is certainly useful to segment out objects with different depth values, it also adds complications in that the 3D geometry is often incorrect because of noisy depth measurements and the actual 3D extent of the objects is usually unknown because of occlusions. In this paper we propose a new method that allows us to jointly refine the 3D reconstruction of the scene (raw depth values) while accurately segmenting out the objects or scene elements from the 3D reconstruction. This is achieved by introducing a new model which we called Voxel-CRF. The Voxel-CRF model is based on the idea of constructing a conditional random field over a 3D volume of interest which captures the semantic and 3D geometric relationships among different elements (voxels) of the scene. Such model allows to jointly estimate (1) a dense voxel-based 3D reconstruction and (2) the semantic labels associated with each voxel even in presence of par- tial occlusions using an approximate yet efficient inference strategy. We evaluated our method on the challenging NYU Depth dataset (Version 1and 2). Experimental results show that our method achieves competitive accuracy in inferring scene semantics and visually appealing results in improving the quality of the 3D reconstruction. We also demonstrate an interesting application of object removal and scene completion from RGB-D images.

2 0.80096006 410 iccv-2013-Support Surface Prediction in Indoor Scenes

Author: Ruiqi Guo, Derek Hoiem

Abstract: In this paper, we present an approach to predict the extent and height of supporting surfaces such as tables, chairs, and cabinet tops from a single RGBD image. We define support surfaces to be horizontal, planar surfaces that can physically support objects and humans. Given a RGBD image, our goal is to localize the height and full extent of such surfaces in 3D space. To achieve this, we created a labeling tool and annotated 1449 images with rich, complete 3D scene models in NYU dataset. We extract ground truth from the annotated dataset and developed a pipeline for predicting floor space, walls, the height and full extent of support surfaces. Finally we match the predicted extent with annotated scenes in training scenes and transfer the the support surface configuration from training scenes. We evaluate the proposed approach in our dataset and demonstrate its effectiveness in understanding scenes in 3D space.

3 0.72992378 228 iccv-2013-Large-Scale Multi-resolution Surface Reconstruction from RGB-D Sequences

Author: Frank Steinbrücker, Christian Kerl, Daniel Cremers

Abstract: We propose a method to generate highly detailed, textured 3D models of large environments from RGB-D sequences. Our system runs in real-time on a standard desktop PC with a state-of-the-art graphics card. To reduce the memory consumption, we fuse the acquired depth maps and colors in a multi-scale octree representation of a signed distance function. To estimate the camera poses, we construct a pose graph and use dense image alignment to determine the relative pose between pairs of frames. We add edges between nodes when we detect loop-closures and optimize the pose graph to correct for long-term drift. Our implementation is highly parallelized on graphics hardware to achieve real-time performance. More specifically, we can reconstruct, store, and continuously update a colored 3D model of an entire corridor of nine rooms at high levels of detail in real-time on a single GPU with 2.5GB.

4 0.71719593 9 iccv-2013-A Flexible Scene Representation for 3D Reconstruction Using an RGB-D Camera

Author: Diego Thomas, Akihiro Sugimoto

Abstract: Updating a global 3D model with live RGB-D measurements has proven to be successful for 3D reconstruction of indoor scenes. Recently, a Truncated Signed Distance Function (TSDF) volumetric model and a fusion algorithm have been introduced (KinectFusion), showing significant advantages such as computational speed and accuracy of the reconstructed scene. This algorithm, however, is expensive in memory when constructing and updating the global model. As a consequence, the method is not well scalable to large scenes. We propose a new flexible 3D scene representation using a set of planes that is cheap in memory use and, nevertheless, achieves accurate reconstruction of indoor scenes from RGB-D image sequences. Projecting the scene onto different planes reduces significantly the size of the scene representation and thus it allows us to generate a global textured 3D model with lower memory requirement while keeping accuracy and easiness to update with live RGB-D measurements. Experimental results demonstrate that our proposed flexible 3D scene representation achieves accurate reconstruction, while keeping the scalability for large indoor scenes.

5 0.66521257 132 iccv-2013-Efficient 3D Scene Labeling Using Fields of Trees

Author: Olaf Kähler, Ian Reid

Abstract: We address the problem of 3D scene labeling in a structured learning framework. Unlike previous work which uses structured Support VectorMachines, we employ the recently described Decision Tree Field and Regression Tree Field frameworks, which learn the unary and binary terms of a Conditional Random Field from training data. We show this has significant advantages in terms of inference speed, while maintaining similar accuracy. We also demonstrate empirically the importance for overall labeling accuracy of features that make use of prior knowledge about the coarse scene layout such as the location of the ground plane. We show how this coarse layout can be estimated by our framework automatically, and that this information can be used to bootstrap improved accuracy in the detailed labeling.

6 0.65226674 144 iccv-2013-Estimating the 3D Layout of Indoor Scenes and Its Clutter from Depth Sensors

7 0.65053993 201 iccv-2013-Holistic Scene Understanding for 3D Object Detection with RGBD Cameras

8 0.64420128 319 iccv-2013-Point-Based 3D Reconstruction of Thin Objects

9 0.62262464 366 iccv-2013-STAR3D: Simultaneous Tracking and Reconstruction of 3D Objects Using RGB-D Data

10 0.61497813 1 iccv-2013-3DNN: Viewpoint Invariant 3D Geometry Matching for Scene Understanding

11 0.60197526 367 iccv-2013-SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels

12 0.58847016 102 iccv-2013-Data-Driven 3D Primitives for Single Image Understanding

13 0.58436781 447 iccv-2013-Volumetric Semantic Segmentation Using Pyramid Context Features

14 0.56836706 215 iccv-2013-Incorporating Cloud Distribution in Sky Representation

15 0.56308943 139 iccv-2013-Elastic Fragments for Dense Scene Reconstruction

16 0.54349464 64 iccv-2013-Box in the Box: Joint 3D Layout and Object Reasoning from Single Images

17 0.54253715 375 iccv-2013-Scene Collaging: Analysis and Synthesis of Natural Images with Semantic Layers

18 0.529679 79 iccv-2013-Coherent Object Detection with 3D Geometric Context from a Single Image

19 0.50343496 128 iccv-2013-Dynamic Probabilistic Volumetric Models

20 0.50244343 42 iccv-2013-Active MAP Inference in CRFs for Efficient Semantic Segmentation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.058), (12, 0.01), (16, 0.013), (26, 0.067), (31, 0.052), (40, 0.017), (42, 0.089), (64, 0.038), (73, 0.015), (89, 0.528), (98, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99720734 216 iccv-2013-Inferring "Dark Matter" and "Dark Energy" from Videos

Author: Dan Xie, Sinisa Todorovic, Song-Chun Zhu

Abstract: This paper presents an approach to localizing functional objects in surveillance videos without domain knowledge about semantic object classes that may appear in the scene. Functional objects do not have discriminative appearance and shape, but they affect behavior of people in the scene. For example, they “attract” people to approach them for satisfying certain needs (e.g., vending machines could quench thirst), or “repel” people to avoid them (e.g., grass lawns). Therefore, functional objects can be viewed as “dark matter”, emanating “dark energy ” that affects people ’s trajectories in the video. To detect “dark matter” and infer their “dark energy ” field, we extend the Lagrangian mechanics. People are treated as particle-agents with latent intents to approach “dark matter” and thus satisfy their needs, where their motions are subject to a composite “dark energy ” field of all functional objects in the scene. We make the assumption that people take globally optimal paths toward the intended “dark matter” while avoiding latent obstacles. A Bayesian framework is used to probabilistically model: people ’s trajectories and intents, constraint map of the scene, and locations of functional objects. A data-driven Markov Chain Monte Carlo (MCMC) process is used for inference. Our evaluation on videos of public squares and courtyards demonstrates our effectiveness in localizing functional objects and predicting people ’s trajectories in unobserved parts of the video footage.

2 0.99575073 103 iccv-2013-Deblurring by Example Using Dense Correspondence

Author: Yoav Hacohen, Eli Shechtman, Dani Lischinski

Abstract: This paper presents a new method for deblurring photos using a sharp reference example that contains some shared content with the blurry photo. Most previous deblurring methods that exploit information from other photos require an accurately registered photo of the same static scene. In contrast, our method aims to exploit reference images where the shared content may have undergone substantial photometric and non-rigid geometric transformations, as these are the kind of reference images most likely to be found in personal photo albums. Our approach builds upon a recent method for examplebased deblurring using non-rigid dense correspondence (NRDC) [11] and extends it in two ways. First, we suggest exploiting information from the reference image not only for blur kernel estimation, but also as a powerful local prior for the non-blind deconvolution step. Second, we introduce a simple yet robust technique for spatially varying blur estimation, rather than assuming spatially uniform blur. Unlike the aboveprevious method, which hasproven successful only with simple deblurring scenarios, we demonstrate that our method succeeds on a variety of real-world examples. We provide quantitative and qualitative evaluation of our method and show that it outperforms the state-of-the-art.

3 0.99550492 56 iccv-2013-Automatic Registration of RGB-D Scans via Salient Directions

Author: Bernhard Zeisl, Kevin Köser, Marc Pollefeys

Abstract: We address the problem of wide-baseline registration of RGB-D data, such as photo-textured laser scans without any artificial targets or prediction on the relative motion. Our approach allows to fully automatically register scans taken in GPS-denied environments such as urban canyon, industrial facilities or even indoors. We build upon image features which are plenty, localized well and much more discriminative than geometry features; however, they suffer from viewpoint distortions and request for normalization. We utilize the principle of salient directions present in the geometry and propose to extract (several) directions from the distribution of surface normals or other cues such as observable symmetries. Compared to previous work we pose no requirements on the scanned scene (like containing large textured planes) and can handle arbitrary surface shapes. Rendering the whole scene from these repeatable directions using an orthographic camera generates textures which are identical up to 2D similarity transformations. This ambiguity is naturally handled by 2D features and allows to find stable correspondences among scans. For geometric pose estimation from tentative matches we propose a fast and robust 2 point sample consensus scheme integrating an early rejection phase. We evaluate our approach on different challenging real world scenes.

4 0.99512351 81 iccv-2013-Combining the Right Features for Complex Event Recognition

Author: Kevin Tang, Bangpeng Yao, Li Fei-Fei, Daphne Koller

Abstract: In this paper, we tackle the problem of combining features extracted from video for complex event recognition. Feature combination is an especially relevant task in video data, as there are many features we can extract, ranging from image features computed from individual frames to video features that take temporal information into account. To combine features effectively, we propose a method that is able to be selective of different subsets of features, as some features or feature combinations may be uninformative for certain classes. We introduce a hierarchical method for combining features based on the AND/OR graph structure, where nodes in the graph represent combinations of different sets of features. Our method automatically learns the structure of the AND/OR graph using score-based structure learning, and we introduce an inference procedure that is able to efficiently compute structure scores. We present promising results and analysis on the difficult and large-scale 2011 TRECVID Multimedia Event Detection dataset [17].

5 0.99417984 139 iccv-2013-Elastic Fragments for Dense Scene Reconstruction

Author: Qian-Yi Zhou, Stephen Miller, Vladlen Koltun

Abstract: We present an approach to reconstruction of detailed scene geometry from range video. Range data produced by commodity handheld cameras suffers from high-frequency errors and low-frequency distortion. Our approach deals with both sources of error by reconstructing locally smooth scene fragments and letting these fragments deform in order to align to each other. We develop a volumetric registration formulation that leverages the smoothness of the deformation to make optimization practical for large scenes. Experimental results demonstrate that our approach substantially increases the fidelity of complex scene geometry reconstructed with commodity handheld cameras.

same-paper 6 0.99396694 2 iccv-2013-3D Scene Understanding by Voxel-CRF

7 0.99361652 302 iccv-2013-Optimization Problems for Fast AAM Fitting in-the-Wild

8 0.99310803 39 iccv-2013-Action Recognition with Improved Trajectories

9 0.9915399 337 iccv-2013-Random Grids: Fast Approximate Nearest Neighbors and Range Searching for Image Search

10 0.98991448 390 iccv-2013-Shufflets: Shared Mid-level Parts for Fast Object Detection

11 0.98580462 319 iccv-2013-Point-Based 3D Reconstruction of Thin Objects

12 0.98038465 129 iccv-2013-Dynamic Scene Deblurring

13 0.97640347 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set

14 0.97626579 9 iccv-2013-A Flexible Scene Representation for 3D Reconstruction Using an RGB-D Camera

15 0.97608614 228 iccv-2013-Large-Scale Multi-resolution Surface Reconstruction from RGB-D Sequences

16 0.97575223 317 iccv-2013-Piecewise Rigid Scene Flow

17 0.97462368 343 iccv-2013-Real-World Normal Map Capture for Nearly Flat Reflective Surfaces

18 0.97397768 42 iccv-2013-Active MAP Inference in CRFs for Efficient Semantic Segmentation

19 0.97256315 256 iccv-2013-Locally Affine Sparse-to-Dense Matching for Motion and Occlusion Estimation

20 0.97049481 105 iccv-2013-DeepFlow: Large Displacement Optical Flow with Deep Matching