iccv iccv2013 iccv2013-132 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Olaf Kähler, Ian Reid
Abstract: We address the problem of 3D scene labeling in a structured learning framework. Unlike previous work which uses structured Support VectorMachines, we employ the recently described Decision Tree Field and Regression Tree Field frameworks, which learn the unary and binary terms of a Conditional Random Field from training data. We show this has significant advantages in terms of inference speed, while maintaining similar accuracy. We also demonstrate empirically the importance for overall labeling accuracy of features that make use of prior knowledge about the coarse scene layout such as the location of the ground plane. We show how this coarse layout can be estimated by our framework automatically, and that this information can be used to bootstrap improved accuracy in the detailed labeling.
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract We address the problem of 3D scene labeling in a structured learning framework. [sent-5, score-0.346]
2 Unlike previous work which uses structured Support VectorMachines, we employ the recently described Decision Tree Field and Regression Tree Field frameworks, which learn the unary and binary terms of a Conditional Random Field from training data. [sent-6, score-0.165]
3 We also demonstrate empirically the importance for overall labeling accuracy of features that make use of prior knowledge about the coarse scene layout such as the location of the ground plane. [sent-8, score-0.69]
4 We show how this coarse layout can be estimated by our framework automatically, and that this information can be used to bootstrap improved accuracy in the detailed labeling. [sent-9, score-0.251]
5 Introduction Interacting with the world requires both an understanding of the 3D geometry of a scene and of its semantic meaning. [sent-11, score-0.158]
6 Learning and inference models such as Support Vector Machines and Random Forests [3] have been used in combination with Conditional Random Fields [2, 11] to infer semantic labels in a range of problems. [sent-13, score-0.209]
7 For a successful integration of the two parts, the 3D representation has to be rich enough to provide an actual benefit for the task of scene labeling, and the scene labeling system has to be fast enough to allow interaction with the inferred labels. [sent-15, score-0.434]
8 First, we present a framework to employ Decision Tree Fields [11] and Regression Tree Fields [6] for the 3D scene labeling task, where the dependency structure is dynamically determined from the scene instead using grid strucIan Reid School of Computer Science University of Adelaide i . [sent-17, score-0.434]
9 Third, we compare the two classifiers in the context of scene labeling. [sent-22, score-0.148]
10 And finally we use an adaptation of the framework to estimate the coarse scene layout and ground plane, which are a prerequisite for high level features commonly used for 3D scene labeling. [sent-23, score-0.519]
11 Related Work With the introduction of the Kinect sensor a significant number of works has been published on semantic labeling of RGB images with additional depth information [15, 13]. [sent-26, score-0.268]
12 The downside of their SVM-based approach is the processing time, which is far from real-time for their full scale classification model and an accelerated, approximate solution only comes at the expense of reduced labeling performance. [sent-33, score-0.227]
13 We show that this enables us to discover consistent scene labels for a full 3D model at interactive rates while still maintaining the high precision and recall of the current state-of-the-art methods. [sent-35, score-0.305]
14 Like many scene labeling systems, the performance of 33005647 Figure1. [sent-36, score-0.314]
15 our framework crucially relies on knowledge of the coarse scene layout comprising of a ground plane and the walls, which allows us to exploit rich geometric constraints that place other objects within the context of the overall scene. [sent-39, score-0.611]
16 A sepa- rate stream of work has explicitly considered inferring the coarse scene layout, usually by restricting it to some simple form such as a cuboid [16] or an indoor Manhattan world [5]. [sent-42, score-0.253]
17 In contrast we show that the very same method we use for scene labeling can also be adapted to label the floor and walls in indoor environments, which then allows to bootstrap a fine grained labeling of the scene. [sent-43, score-0.861]
18 Starting from given depth images, we compute a dense, volumetric representation of the scene using our own implementation of the KinectFusion algorithm [10] that in addition takes the RGB information into account in aligning the RGB-D images. [sent-47, score-0.193]
19 We then compute an oversegmentation of this dense representation and instantiate a graph over the segments as explained in Section 2. [sent-48, score-0.328]
20 For each of the segments and for each neighborhood relation between the segments we then extract a feature vector which we will detail in Section 3. [sent-49, score-0.341]
21 Finally we compute labels for the segments using adaptations of Decision Tree Fields [11] and Regression Tree Fields [6]. [sent-50, score-0.218]
22 Inspired by the very successful SLIC superpixels [1] for 2D images, we develop a method for the oversegmentation of a 3D volume. [sent-57, score-0.184]
23 Once each point has a label, the centroids pCi , mean colors cCi and average normals nCi of each seed Ci are recomputed from the points x that have been assigned the label lx = Ci. [sent-65, score-0.168]
24 These two steps are iterated until convergence to a stable oversegmentation of the scene. [sent-66, score-0.184]
25 As this approach does not enforce connectivity of the segments a post-processing step is applied after the iterations finish [1], in which individual stray points or small connected parts of the scene are merged with their most closely matching neighboring segment. [sent-67, score-0.341]
26 Given the oversegmentation we define a graph G = (V,G GEiv) over tehe o segments. [sent-68, score-0.184]
27 Similarly, we extract a feature vector for each pair of segments that have an edge ei,j linking them, which will describe the contextual relation of the two segments. [sent-75, score-0.244]
28 If no prior knowledge on the coarse layout of the scene is given, distances and the overall scale can only be measured in terms of voxels and the arrangement of segments within the scene context is unknown. [sent-77, score-0.699]
29 A measure of the fl−atnλess, that to our knowledge has not been used before, is given by a histogram of angles between the average normal of a segment and the surface for splitting decision trees in the labeling phase. [sent-90, score-0.658]
30 This gives a rough estimate of the relevance of features, but it does not take the depth within the trees into account where the features are used. [sent-92, score-0.232]
31 Finally we compute the mean, median and maximum plane fit errors as in [15] and spin images [7], that have not been used in either of [2, 15]. [sent-94, score-0.172]
32 To describe the relation between two segments, a first obvious feature is the angle between the surface normals and the distance between the two centroids. [sent-95, score-0.157]
33 Given the centroids and surface normals we can also check whether the centroid of one segment is in front of the other segment and vice versa, which provides an indication of convexity as used in [2]. [sent-96, score-0.328]
34 We also compute the mean, median and maximum distances of the points of one segment from a plane fit to the other segment, which gives a measure of coplanarity of the two [15], and we check whether two segments are connected as neighbors in the oversegmentation. [sent-97, score-0.377]
35 4 that it can be estimated automatically from the data using the very same system that we propose for fine grained scene labeling. [sent-102, score-0.263]
36 the angle of the surface normal with the ground plane, the height above ground and the horizontal and vertical displacements between two segments, all of which have also been used in [2]. [sent-105, score-0.207]
37 We also compute a projection of each segment onto the ground plane and use the area of this footprint and the percentage of overlap of two footprints as additional features [15]. [sent-106, score-0.25]
38 The histogram of normal angles and spin images appear to be the most important descriptors of shape, and convexity and coplanarity are highly relevant to describe the relation of segments. [sent-109, score-0.184]
39 evaluation of the features incorporating knowledge of the scene layout follows in Section 5. [sent-112, score-0.299]
40 Semantic Labeling We model the relation between the collection of feature vectors x extracted for the nodes and edges and the label vector y for the scene as a Conditional Random Field. [sent-115, score-0.263]
41 In our current setup we only consider unary terms EN and binary terms EE, leading to the overall relation: P(y|x,w) =? [sent-116, score-0.176]
42 Given that we have multiple classes and the labels for individual nodes in the graph are related to each other, this is a classical problem for structured learning and inference techniques. [sent-123, score-0.289]
43 As illustrated in Figure 3, the input data x is passed down a set of trees and eventually selects a single leaf from each tree. [sent-126, score-0.161]
44 The leaves then determine the energies required to assign the individual labels yi to nodes vi or pairs of labels yi and yj to nodes vi and vj . [sent-127, score-0.668]
45 , K], where K is the number of distinct classes, and∈ th [1e, parameter vector w stores tables of energy values for each leaf. [sent-132, score-0.164]
46 The label yi selects a single entry from the table selected by the input data x, and this entry represents the energy for assigning the label. [sent-133, score-0.239]
47 As there are multiple trees in the forest and hence multiple leaves, the overall energies EE and EN are defined as the sums over the energies in the individual leaves: ENDTF(yi,x,w) = ? [sent-134, score-0.501]
48 (i,j,x) where the functions L(i, x) and L(i, j,x) return the set of leaves reached in the respective forests and wq,yi and wq,yi,yj are the individual energy values. [sent-138, score-0.234]
49 For the unary terms the parameter vector w stores a symmetric, positive definite matrix Θu,q ∈ SK and a vector θu,q ∈ RK for each leaf q, and the energy required to assign a la∈be Rl yi to node vi is determined by the quadratic energy function defined by Θu,q and θu,q. [sent-141, score-0.425]
50 For the binary terms, the quadratic energy functions stored in the leaves are 2K dimensional and defined by Θb,q ∈ S2K and θb,q ∈ R2K. [sent-142, score-0.206]
51 The overall energy terms for the ensemble of trees in the forest are again sums over the individual contributions, resulting in: ENRTF(yi,x,w) =q∈L? [sent-144, score-0.404]
52 In that sense the leaves in the unary regression for·,eqsts store K-dimensional Gaussian distributions over the label vectors yi, and the binary forests store 2K-dimensional distributions over the concatenations of yi and yj . [sent-151, score-0.446]
53 As in [11] we determine the tree structures in a first step and then optimize the parameters w in a separate, second step. [sent-156, score-0.15]
54 A two step approach is necessary as the parameters w are continuous whereas the tree structures form a large, combinatorial space, and a simultaneous optimization of both is intractable. [sent-157, score-0.15]
55 In the first stage the tree structures are determined using standard methods [3]. [sent-158, score-0.15]
56 We pick a random subset of the training data to train each tree in the forest, the binary decision rules at the internal nodes select a random element of 33006670 the feature vector and split it at a random value. [sent-159, score-0.316]
57 Decision rules are selected to maximize the information gain and tree splitting is stopped, once a certain depth is reached or the entropy of the remaining labels is below a threshold. [sent-160, score-0.26]
58 Given that all the individual contributions to the likelihood from Equation (1) are Gaussian, the overall likelihood is Gaussian as well and the negative log likelihood takes the form − lnP(y|x,w) =12yTΘ˜y −ϑ˜Ty + lnZ. [sent-196, score-0.209]
59 Both of these contain image sequences recorded with a Kinect and pose multi-class labeling problems. [sent-205, score-0.194]
60 We then evaluate the effects of varying the number of trees in Section 5. [sent-210, score-0.161]
61 2 showing that a forest with more trees increases the performance of DTFs and RTFs at the expense of higher computational effort. [sent-211, score-0.246]
62 3 we evaluate the performance gained by using global knowledge of a ground plane, and finally we investigate the performance of the presented methods at finding such a ground plane in Section 5. [sent-213, score-0.291]
63 Compared to State-of-the-Art In [2] a labeling method based on structured Support Vector Machines is presented. [sent-217, score-0.226]
64 Note that the micro-averaged precision and recall are identical if a label has to be assigned to each of the segments, but the fast and approximate inference method is allowed to reject segments hence leading to different values for micro-averaged precision and recall. [sent-220, score-0.483]
65 As a baseline we also include a Random Forest classifier (RF) on 33006681 show the ground truth and prediction results for fine grained scene labeling and the right two columns show the same for coarse scene layout estimation. [sent-222, score-0.856]
66 However, the inference algorithms for the DTF and RTF methods are orders of magnitudes faster and comparable to the approximate, fast method based on SVMs, which only achieves a very low recall score (38. [sent-233, score-0.154]
67 These approaches are therefore highly relevant for scene labeling, particularly if predicted labels are required at interactive rates. [sent-236, score-0.194]
68 Note that the CornellRGBD-Dataset only comes with annotated 3D point clouds and the ground truth labels for these point clouds were originally created by annotating the oversegmentations from [2]. [sent-238, score-0.193]
69 For our experiments we therefore try to find suitable ground truth labels by reprojecting the ground truth point cloud and our oversegmentation into the original camera images and we reject segments, where the label is not clear. [sent-239, score-0.492]
70 The results achieved with our proposed pipeline are shown in Table 3, and in this case the RTFs appear to perform better than DTFs both in labeling performance and inference time. [sent-240, score-0.291]
71 As mentioned, the ground truth for this dataset was obtained by annotating the specific oversegmentations from [2] and differences in the segment boundaries will invariably degrade the performance. [sent-242, score-0.193]
72 The oversegmentation of [2] also results in less complex CRFs with about 50-100 nodes per scene, whereas ours have about 1000-3000 segments of much smaller and much more regular size. [sent-243, score-0.368]
73 Number of Trees In Section 4 we presented formulations of DTFs and RTFs using multiple trees per term. [sent-247, score-0.197]
74 As expected the performance increases with the number of trees in both the DTF and RTF formulations and saturates at about 15 trees. [sent-257, score-0.197]
75 In this experiment we use the same number of trees for the unary and binary terms. [sent-259, score-0.294]
76 We have also investigated varying the numbers of trees independently and found that the number of binary trees impacts the results more significantly than the number of unary trees. [sent-260, score-0.455]
77 We attribute this to the greater diversity in the binary terms, where pairs of labels have to be predicted instead of a single label per segment. [sent-261, score-0.16]
78 In the remaining experiments, we therefore typically use 10 unary trees and 15 binary trees, which appears to saturate the performance for most of our tasks. [sent-262, score-0.294]
79 Knowledge of Scene Layout Prior context information such as knowledge of a ground plane and the absolute scale of a scene are important hints for the labeling task, and they are thus heavily used in the feature set we presented in Section 3. [sent-265, score-0.544]
80 To assess their importance we re-run our scene labeling methods without using these features and compare the impact. [sent-266, score-0.314]
81 The resulting precision and recall scores are shown in the rows entitled w/o ground plane in Tables 2 and 3. [sent-267, score-0.287]
82 The significant drop in precision and recall scores underlines the relevance of such global knowledge for scene labeling and we next aim to find this knowledge automatically. [sent-268, score-0.568]
83 Errors in the estimation of the ground plane normal. [sent-270, score-0.176]
84 Experimental evaluation of precision and recall for estimating the coarse scene layout using DTFs and RTFs. [sent-275, score-0.449]
85 Estimating Scene Layout For finding the scene layout we try to infer one of the labels {floor, wall, tableTop, clutter} for each of the segmlabeenltss {inf tohoer scene, aanbdle Tacohpie,vcelu tthtiesr using tahceh very same scene labeling approach as before. [sent-278, score-0.633]
86 We reduce the set of labels to the given four classes and re-run the training and inference steps. [sent-279, score-0.171]
87 Sample results of this labeling task are shown on the right hand side of Figure 4. [sent-280, score-0.194]
88 As a second criterion we com- pute robust plane fits to the segments labeled as floor by our system and in the ground-truth data and compute the angle between the two recovered normals. [sent-282, score-0.299]
89 In Table 4 we present the labeling precision and recall thus achieved with our system. [sent-283, score-0.305]
90 From both evaluations it appears that RTFs perform better in this task than DTFs and the proposed method estimates the ground plane to within 20◦ in 80% of the cases. [sent-285, score-0.176]
91 First, the coarse prediction is used to estimate the ground plane, and second, this ground plane is used in the computation of a fine grained scene labeling. [sent-287, score-0.593]
92 Conclusions We have introduced a structured learning approach to 3D scene labeling that takes advantage ofthe recently described Decision Tree Field [11] and Regression Tree Field [6] classifiers. [sent-294, score-0.346]
93 In our current implementation the oversegmentation typically takes 2-3s and the feature extraction 5-10s. [sent-297, score-0.184]
94 Our oversegmentation step from Section 2 computes a dense set of small segments. [sent-300, score-0.184]
95 While this leads to a fine grained segmentation of object boundaries and while the CRF formulation does an excellent job at grouping the small segments into semantically consistent units, their sheer number poses a high computational burden on the CRF. [sent-301, score-0.287]
96 Finally our current system relies on a Kinect sensor and the KinectFusion system, but volumetric 3D scene representations can recently also be acquired with standard RGB cameras and DTAM [9]. [sent-308, score-0.157]
97 Contextually guided semantic labeling and search for threedimensional point clouds. [sent-326, score-0.232]
98 Regression tree fields an efficient, non-parametric approach to image labeling problems. [sent-347, score-0.443]
99 A generative framework for fast urban labeling using spatial and temporal context. [sent-393, score-0.194]
100 Discriminative learning with latent variables for cluttered indoor scene understanding. [sent-417, score-0.16]
wordName wordTfidf (topN-words)
[('dtfs', 0.46), ('rtfs', 0.403), ('labeling', 0.194), ('oversegmentation', 0.184), ('rtf', 0.173), ('trees', 0.161), ('tree', 0.15), ('segments', 0.144), ('dtf', 0.142), ('layout', 0.125), ('scene', 0.12), ('kinectfusion', 0.118), ('plane', 0.115), ('fields', 0.099), ('grained', 0.098), ('inference', 0.097), ('unary', 0.097), ('coarse', 0.093), ('decision', 0.09), ('pci', 0.089), ('leaves', 0.081), ('energies', 0.079), ('segment', 0.074), ('labels', 0.074), ('cci', 0.071), ('regression', 0.07), ('tables', 0.07), ('yi', 0.066), ('temperature', 0.064), ('energy', 0.061), ('dtam', 0.061), ('ground', 0.061), ('oversegmentations', 0.058), ('spin', 0.057), ('recall', 0.057), ('surface', 0.055), ('precision', 0.054), ('knowledge', 0.054), ('relation', 0.053), ('forest', 0.052), ('emax', 0.051), ('vi', 0.051), ('label', 0.05), ('normals', 0.049), ('emin', 0.047), ('walls', 0.047), ('contextual', 0.047), ('forests', 0.046), ('individual', 0.046), ('fine', 0.045), ('ee', 0.045), ('coplanarity', 0.044), ('nci', 0.044), ('overall', 0.043), ('sums', 0.041), ('pages', 0.041), ('centroid', 0.041), ('likelihood', 0.04), ('floor', 0.04), ('arrive', 0.04), ('nodes', 0.04), ('indoor', 0.04), ('ismar', 0.039), ('semantic', 0.038), ('volumetric', 0.037), ('crf', 0.037), ('depth', 0.036), ('eigenvalues', 0.036), ('formulations', 0.036), ('binary', 0.036), ('cloud', 0.035), ('centroids', 0.035), ('relevance', 0.035), ('ci', 0.034), ('kinect', 0.034), ('seeds', 0.034), ('lx', 0.034), ('stores', 0.033), ('newcombe', 0.033), ('expense', 0.033), ('nowozin', 0.033), ('manhattan', 0.033), ('bootstrap', 0.033), ('slic', 0.033), ('structured', 0.032), ('wp', 0.032), ('wn', 0.031), ('closely', 0.031), ('wc', 0.031), ('entry', 0.031), ('slow', 0.03), ('conditioned', 0.03), ('normal', 0.03), ('en', 0.029), ('definite', 0.028), ('quadratic', 0.028), ('field', 0.028), ('classifiers', 0.028), ('dy', 0.028), ('reject', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.000001 132 iccv-2013-Efficient 3D Scene Labeling Using Fields of Trees
Author: Olaf Kähler, Ian Reid
Abstract: We address the problem of 3D scene labeling in a structured learning framework. Unlike previous work which uses structured Support VectorMachines, we employ the recently described Decision Tree Field and Regression Tree Field frameworks, which learn the unary and binary terms of a Conditional Random Field from training data. We show this has significant advantages in terms of inference speed, while maintaining similar accuracy. We also demonstrate empirically the importance for overall labeling accuracy of features that make use of prior knowledge about the coarse scene layout such as the location of the ground plane. We show how this coarse layout can be estimated by our framework automatically, and that this information can be used to bootstrap improved accuracy in the detailed labeling.
2 0.21946824 144 iccv-2013-Estimating the 3D Layout of Indoor Scenes and Its Clutter from Depth Sensors
Author: Jian Zhang, Chen Kan, Alexander G. Schwing, Raquel Urtasun
Abstract: In this paper we propose an approach to jointly estimate the layout ofrooms as well as the clutterpresent in the scene using RGB-D data. Towards this goal, we propose an effective model that is able to exploit both depth and appearance features, which are complementary. Furthermore, our approach is efficient as we exploit the inherent decomposition of additive potentials. We demonstrate the effectiveness of our approach on the challenging NYU v2 dataset and show that employing depth reduces the layout error by 6% and the clutter estimation by 13%.
3 0.16526553 448 iccv-2013-Weakly Supervised Learning of Image Partitioning Using Decision Trees with Structured Split Criteria
Author: Christoph Straehle, Ullrich Koethe, Fred A. Hamprecht
Abstract: We propose a scheme that allows to partition an image into a previously unknown number of segments, using only minimal supervision in terms of a few must-link and cannotlink annotations. We make no use of regional data terms, learning instead what constitutes a likely boundary between segments. Since boundaries are only implicitly specified through cannot-link constraints, this is a hard and nonconvex latent variable problem. We address this problem in a greedy fashion using a randomized decision tree on features associated with interpixel edges. We use a structured purity criterion during tree construction and also show how a backtracking strategy can be used to prevent the greedy search from ending up in poor local optima. The proposed strategy is compared with prior art on natural images.
4 0.16046248 404 iccv-2013-Structured Forests for Fast Edge Detection
Author: Piotr Dollár, C. Lawrence Zitnick
Abstract: Edge detection is a critical component of many vision systems, including object detectors and image segmentation algorithms. Patches of edges exhibit well-known forms of local structure, such as straight lines or T-junctions. In this paper we take advantage of the structure present in local image patches to learn both an accurate and computationally efficient edge detector. We formulate the problem of predicting local edge masks in a structured learning framework applied to random decision forests. Our novel approach to learning decision trees robustly maps the structured labels to a discrete space on which standard information gain measures may be evaluated. The result is an approach that obtains realtime performance that is orders of magnitude faster than many competing state-of-the-art approaches, while also achieving state-of-the-art edge detection results on the BSDS500 Segmentation dataset and NYU Depth dataset. Finally, we show the potential of our approach as a general purpose edge detector by showing our learned edge models generalize well across datasets.
5 0.14066175 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments
Author: Shugao Ma, Jianming Zhang, Nazli Ikizler-Cinbis, Stan Sclaroff
Abstract: We propose Hierarchical Space-Time Segments as a new representation for action recognition and localization. This representation has a two-level hierarchy. The first level comprises the root space-time segments that may contain a human body. The second level comprises multi-grained space-time segments that contain parts of the root. We present an unsupervised method to generate this representation from video, which extracts both static and non-static relevant space-time segments, and also preserves their hierarchical and temporal relationships. Using simple linear SVM on the resultant bag of hierarchical space-time segments representation, we attain better than, or comparable to, state-of-the-art action recognition performance on two challenging benchmark datasets and at the same time produce good action localization results.
6 0.13673584 410 iccv-2013-Support Surface Prediction in Indoor Scenes
7 0.13514645 64 iccv-2013-Box in the Box: Joint 3D Layout and Object Reasoning from Single Images
8 0.13255036 1 iccv-2013-3DNN: Viewpoint Invariant 3D Geometry Matching for Scene Understanding
9 0.13181169 201 iccv-2013-Holistic Scene Understanding for 3D Object Detection with RGBD Cameras
10 0.12423185 2 iccv-2013-3D Scene Understanding by Voxel-CRF
11 0.12411442 42 iccv-2013-Active MAP Inference in CRFs for Efficient Semantic Segmentation
12 0.12108966 442 iccv-2013-Video Segmentation by Tracking Many Figure-Ground Segments
13 0.11864164 72 iccv-2013-Characterizing Layouts of Outdoor Scenes Using Spatial Topic Processes
14 0.11582498 9 iccv-2013-A Flexible Scene Representation for 3D Reconstruction Using an RGB-D Camera
15 0.11576169 386 iccv-2013-Sequential Bayesian Model Update under Structured Scene Prior for Semantic Road Scenes Labeling
16 0.11217988 317 iccv-2013-Piecewise Rigid Scene Flow
17 0.11121269 366 iccv-2013-STAR3D: Simultaneous Tracking and Reconstruction of 3D Objects Using RGB-D Data
18 0.10573545 367 iccv-2013-SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels
19 0.10000877 128 iccv-2013-Dynamic Probabilistic Volumetric Models
20 0.098031603 47 iccv-2013-Alternating Regression Forests for Object Detection and Pose Estimation
topicId topicWeight
[(0, 0.247), (1, -0.07), (2, -0.001), (3, 0.008), (4, 0.108), (5, 0.024), (6, -0.08), (7, -0.034), (8, -0.028), (9, -0.125), (10, -0.015), (11, 0.074), (12, -0.046), (13, 0.059), (14, 0.041), (15, -0.012), (16, -0.086), (17, -0.088), (18, -0.077), (19, -0.04), (20, -0.121), (21, -0.043), (22, 0.048), (23, -0.001), (24, -0.012), (25, -0.111), (26, 0.029), (27, 0.055), (28, -0.015), (29, -0.025), (30, 0.004), (31, 0.057), (32, -0.072), (33, 0.036), (34, -0.074), (35, 0.072), (36, -0.079), (37, -0.061), (38, 0.052), (39, -0.025), (40, -0.052), (41, 0.027), (42, -0.028), (43, -0.013), (44, 0.028), (45, 0.057), (46, -0.113), (47, -0.015), (48, -0.035), (49, -0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.95321584 132 iccv-2013-Efficient 3D Scene Labeling Using Fields of Trees
Author: Olaf Kähler, Ian Reid
Abstract: We address the problem of 3D scene labeling in a structured learning framework. Unlike previous work which uses structured Support VectorMachines, we employ the recently described Decision Tree Field and Regression Tree Field frameworks, which learn the unary and binary terms of a Conditional Random Field from training data. We show this has significant advantages in terms of inference speed, while maintaining similar accuracy. We also demonstrate empirically the importance for overall labeling accuracy of features that make use of prior knowledge about the coarse scene layout such as the location of the ground plane. We show how this coarse layout can be estimated by our framework automatically, and that this information can be used to bootstrap improved accuracy in the detailed labeling.
2 0.73890394 2 iccv-2013-3D Scene Understanding by Voxel-CRF
Author: Byung-Soo Kim, Pushmeet Kohli, Silvio Savarese
Abstract: Scene understanding is an important yet very challenging problem in computer vision. In the past few years, researchers have taken advantage of the recent diffusion of depth-RGB (RGB-D) cameras to help simplify the problem of inferring scene semantics. However, while the added 3D geometry is certainly useful to segment out objects with different depth values, it also adds complications in that the 3D geometry is often incorrect because of noisy depth measurements and the actual 3D extent of the objects is usually unknown because of occlusions. In this paper we propose a new method that allows us to jointly refine the 3D reconstruction of the scene (raw depth values) while accurately segmenting out the objects or scene elements from the 3D reconstruction. This is achieved by introducing a new model which we called Voxel-CRF. The Voxel-CRF model is based on the idea of constructing a conditional random field over a 3D volume of interest which captures the semantic and 3D geometric relationships among different elements (voxels) of the scene. Such model allows to jointly estimate (1) a dense voxel-based 3D reconstruction and (2) the semantic labels associated with each voxel even in presence of par- tial occlusions using an approximate yet efficient inference strategy. We evaluated our method on the challenging NYU Depth dataset (Version 1and 2). Experimental results show that our method achieves competitive accuracy in inferring scene semantics and visually appealing results in improving the quality of the 3D reconstruction. We also demonstrate an interesting application of object removal and scene completion from RGB-D images.
3 0.73445708 144 iccv-2013-Estimating the 3D Layout of Indoor Scenes and Its Clutter from Depth Sensors
Author: Jian Zhang, Chen Kan, Alexander G. Schwing, Raquel Urtasun
Abstract: In this paper we propose an approach to jointly estimate the layout ofrooms as well as the clutterpresent in the scene using RGB-D data. Towards this goal, we propose an effective model that is able to exploit both depth and appearance features, which are complementary. Furthermore, our approach is efficient as we exploit the inherent decomposition of additive potentials. We demonstrate the effectiveness of our approach on the challenging NYU v2 dataset and show that employing depth reduces the layout error by 6% and the clutter estimation by 13%.
4 0.73120493 410 iccv-2013-Support Surface Prediction in Indoor Scenes
Author: Ruiqi Guo, Derek Hoiem
Abstract: In this paper, we present an approach to predict the extent and height of supporting surfaces such as tables, chairs, and cabinet tops from a single RGBD image. We define support surfaces to be horizontal, planar surfaces that can physically support objects and humans. Given a RGBD image, our goal is to localize the height and full extent of such surfaces in 3D space. To achieve this, we created a labeling tool and annotated 1449 images with rich, complete 3D scene models in NYU dataset. We extract ground truth from the annotated dataset and developed a pipeline for predicting floor space, walls, the height and full extent of support surfaces. Finally we match the predicted extent with annotated scenes in training scenes and transfer the the support surface configuration from training scenes. We evaluate the proposed approach in our dataset and demonstrate its effectiveness in understanding scenes in 3D space.
5 0.70503569 201 iccv-2013-Holistic Scene Understanding for 3D Object Detection with RGBD Cameras
Author: Dahua Lin, Sanja Fidler, Raquel Urtasun
Abstract: In this paper, we tackle the problem of indoor scene understanding using RGBD data. Towards this goal, we propose a holistic approach that exploits 2D segmentation, 3D geometry, as well as contextual relations between scenes and objects. Specifically, we extend the CPMC [3] framework to 3D in order to generate candidate cuboids, and develop a conditional random field to integrate information from different sources to classify the cuboids. With this formulation, scene classification and 3D object recognition are coupled and can be jointly solved through probabilistic inference. We test the effectiveness of our approach on the challenging NYU v2 dataset. The experimental results demonstrate that through effective evidence integration and holistic reasoning, our approach achieves substantial improvement over the state-of-the-art.
6 0.69434434 1 iccv-2013-3DNN: Viewpoint Invariant 3D Geometry Matching for Scene Understanding
7 0.68207836 375 iccv-2013-Scene Collaging: Analysis and Synthesis of Natural Images with Semantic Layers
8 0.67447847 64 iccv-2013-Box in the Box: Joint 3D Layout and Object Reasoning from Single Images
9 0.65805942 404 iccv-2013-Structured Forests for Fast Edge Detection
10 0.62888783 148 iccv-2013-Example-Based Facade Texture Synthesis
11 0.62464285 386 iccv-2013-Sequential Bayesian Model Update under Structured Scene Prior for Semantic Road Scenes Labeling
12 0.62402266 448 iccv-2013-Weakly Supervised Learning of Image Partitioning Using Decision Trees with Structured Split Criteria
13 0.62068242 102 iccv-2013-Data-Driven 3D Primitives for Single Image Understanding
14 0.60898036 412 iccv-2013-Synergistic Clustering of Image and Segment Descriptors for Unsupervised Scene Understanding
15 0.60765511 72 iccv-2013-Characterizing Layouts of Outdoor Scenes Using Spatial Topic Processes
16 0.59709716 42 iccv-2013-Active MAP Inference in CRFs for Efficient Semantic Segmentation
17 0.57483011 250 iccv-2013-Lifting 3D Manhattan Lines from a Single Image
18 0.57089961 352 iccv-2013-Revisiting Example Dependent Cost-Sensitive Learning with Decision Trees
19 0.56976402 79 iccv-2013-Coherent Object Detection with 3D Geometric Context from a Single Image
20 0.56807351 57 iccv-2013-BOLD Features to Detect Texture-less Objects
topicId topicWeight
[(2, 0.088), (7, 0.015), (12, 0.013), (26, 0.102), (31, 0.029), (40, 0.24), (42, 0.095), (48, 0.013), (64, 0.045), (73, 0.038), (89, 0.235)]
simIndex simValue paperId paperTitle
Author: Ying Fu, Antony Lam, Imari Sato, Takahiro Okabe, Yoichi Sato
Abstract: Hyperspectral imaging is beneficial to many applications but current methods do not consider fluorescent effects which are present in everyday items ranging from paper, to clothing, to even our food. Furthermore, everyday fluorescent items exhibit a mix of reflectance and fluorescence. So proper separation of these components is necessary for analyzing them. In this paper, we demonstrate efficient separation and recovery of reflective and fluorescent emission spectra through the use of high frequency illumination in the spectral domain. With the obtained fluorescent emission spectra from our high frequency illuminants, we then present to our knowledge, the first method for estimating the fluorescent absorption spectrum of a material given its emission spectrum. Conventional bispectral measurement of absorption and emission spectra needs to examine all combinations of incident and observed light wavelengths. In contrast, our method requires only two hyperspectral images. The effectiveness of our proposed methods are then evaluated through a combination of simulation and real experiments. We also demonstrate an application of our method to synthetic relighting of real scenes.
2 0.92801946 318 iccv-2013-PixelTrack: A Fast Adaptive Algorithm for Tracking Non-rigid Objects
Author: Stefan Duffner, Christophe Garcia
Abstract: In this paper, we present a novel algorithm for fast tracking of generic objects in videos. The algorithm uses two components: a detector that makes use of the generalised Hough transform with pixel-based descriptors, and a probabilistic segmentation method based on global models for foreground and background. These components are used for tracking in a combined way, and they adapt each other in a co-training manner. Through effective model adaptation and segmentation, the algorithm is able to track objects that undergo rigid and non-rigid deformations and considerable shape and appearance variations. The proposed tracking method has been thoroughly evaluated on challenging standard videos, and outperforms state-of-theart tracking methods designed for the same task. Finally, the proposed models allow for an extremely efficient implementation, and thus tracking is very fast.
3 0.92622316 421 iccv-2013-Total Variation Regularization for Functions with Values in a Manifold
Author: Jan Lellmann, Evgeny Strekalovskiy, Sabrina Koetter, Daniel Cremers
Abstract: While total variation is among the most popular regularizers for variational problems, its extension to functions with values in a manifold is an open problem. In this paper, we propose the first algorithm to solve such problems which applies to arbitrary Riemannian manifolds. The key idea is to reformulate the variational problem as a multilabel optimization problem with an infinite number of labels. This leads to a hard optimization problem which can be approximately solved using convex relaxation techniques. The framework can be easily adapted to different manifolds including spheres and three-dimensional rotations, and allows to obtain accurate solutions even with a relatively coarse discretization. With numerous examples we demonstrate that the proposed framework can be applied to variational models that incorporate chromaticity values, normal fields, or camera trajectories.
same-paper 4 0.85979933 132 iccv-2013-Efficient 3D Scene Labeling Using Fields of Trees
Author: Olaf Kähler, Ian Reid
Abstract: We address the problem of 3D scene labeling in a structured learning framework. Unlike previous work which uses structured Support VectorMachines, we employ the recently described Decision Tree Field and Regression Tree Field frameworks, which learn the unary and binary terms of a Conditional Random Field from training data. We show this has significant advantages in terms of inference speed, while maintaining similar accuracy. We also demonstrate empirically the importance for overall labeling accuracy of features that make use of prior knowledge about the coarse scene layout such as the location of the ground plane. We show how this coarse layout can be estimated by our framework automatically, and that this information can be used to bootstrap improved accuracy in the detailed labeling.
5 0.85023797 160 iccv-2013-Fast Object Segmentation in Unconstrained Video
Author: Anestis Papazoglou, Vittorio Ferrari
Abstract: We present a technique for separating foreground objects from the background in a video. Our method isfast, , fully automatic, and makes minimal assumptions about the video. This enables handling essentially unconstrained settings, including rapidly moving background, arbitrary object motion and appearance, and non-rigid deformations and articulations. In experiments on two datasets containing over 1400 video shots, our method outperforms a state-of-theart background subtraction technique [4] as well as methods based on clustering point tracks [6, 18, 19]. Moreover, it performs comparably to recent video object segmentation methods based on objectproposals [14, 16, 27], while being orders of magnitude faster.
6 0.83381468 445 iccv-2013-Visual Reranking through Weakly Supervised Multi-graph Learning
7 0.81993979 391 iccv-2013-Sieving Regression Forest Votes for Facial Feature Detection in the Wild
8 0.81563437 47 iccv-2013-Alternating Regression Forests for Object Detection and Pose Estimation
9 0.79813933 447 iccv-2013-Volumetric Semantic Segmentation Using Pyramid Context Features
10 0.79707885 404 iccv-2013-Structured Forests for Fast Edge Detection
11 0.79117727 217 iccv-2013-Initialization-Insensitive Visual Tracking through Voting with Salient Local Features
12 0.78808957 340 iccv-2013-Real-Time Articulated Hand Pose Estimation Using Semi-supervised Transductive Regression Forests
13 0.78623068 386 iccv-2013-Sequential Bayesian Model Update under Structured Scene Prior for Semantic Road Scenes Labeling
14 0.78195596 78 iccv-2013-Coherent Motion Segmentation in Moving Camera Videos Using Optical Flow Orientations
15 0.7815631 336 iccv-2013-Random Forests of Local Experts for Pedestrian Detection
16 0.78098559 172 iccv-2013-Flattening Supervoxel Hierarchies by the Uniform Entropy Slice
17 0.78005105 448 iccv-2013-Weakly Supervised Learning of Image Partitioning Using Decision Trees with Structured Split Criteria
18 0.77975452 414 iccv-2013-Temporally Consistent Superpixels
19 0.77929932 89 iccv-2013-Constructing Adaptive Complex Cells for Robust Visual Tracking
20 0.77862525 254 iccv-2013-Live Metric 3D Reconstruction on Mobile Phones