nips nips2011 nips2011-247 knowledge-graph by maker-knowledge-mining

247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes


Source: pdf

Author: Hema S. Koppula, Abhishek Anand, Thorsten Joachims, Ashutosh Saxena

Abstract: Inexpensive RGB-D cameras that give an RGB image together with depth data have become widely available. In this paper, we use this data to build 3D point clouds of full indoor scenes such as an office and address the task of semantic labeling of these 3D point clouds. We propose a graphical model that captures various features and contextual relations, including the local visual appearance and shape cues, object co-occurence relationships and geometric relationships. With a large number of object classes and relations, the model’s parsimony becomes important and we address that by using multiple types of edge potentials. The model admits efficient approximate inference, and we train it using a maximum-margin learning approach. In our experiments over a total of 52 3D scenes of homes and offices (composed from about 550 views, having 2495 segments labeled with 27 object classes), we get a performance of 84.06% in labeling 17 object classes for offices, and 73.38% in labeling 17 object classes for home scenes. Finally, we applied these algorithms successfully on a mobile robot for the task of finding objects in large cluttered rooms.1 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper, we use this data to build 3D point clouds of full indoor scenes such as an office and address the task of semantic labeling of these 3D point clouds. [sent-5, score-0.382]

2 We propose a graphical model that captures various features and contextual relations, including the local visual appearance and shape cues, object co-occurence relationships and geometric relationships. [sent-6, score-0.717]

3 With a large number of object classes and relations, the model’s parsimony becomes important and we address that by using multiple types of edge potentials. [sent-7, score-0.371]

4 In our experiments over a total of 52 3D scenes of homes and offices (composed from about 550 views, having 2495 segments labeled with 27 object classes), we get a performance of 84. [sent-9, score-0.515]

5 However, a lot of valuable information about the shape and geometric layout of objects is lost when a 2D image is formed from the corresponding 3D world. [sent-19, score-0.444]

6 We then (over-)segment the scene and predict semantic labels for each segment (see Fig. [sent-32, score-0.513]

7 1 Figure 1: Office scene (top) and Home (bottom) scene with the corresponding label coloring above the images. [sent-38, score-0.368]

8 In this paper, we propose and evaluate the first model and learning algorithm for scene understanding that exploits rich relational information derived from the full-scene 3D point cloud for object labeling. [sent-44, score-0.484]

9 Each 3D segment is associated with a node, and pairwise potentials model the relationships between segments (e. [sent-46, score-0.649]

10 Some features are better indicators of label similarity, while other features are better indicators of nonassociative relations such as geometric arrangement (e. [sent-52, score-0.411]

11 We consider labeling each segment (from a total of about 50 segments per scene) into 27 classes (17 for offices and 17 for homes, with 7 classes in common). [sent-61, score-0.754]

12 We also consider the problem of labeling 3D segments with multiple attributes meaningful to robotics context (such as small objects that can be manipulated, furniture, etc. [sent-65, score-0.61]

13 Previous works focus on several different aspects: designing good local features such as HOG (histogram-of-gradients) [5] and bag of words [4], and designing good global (context) features such as GIST features [33]. [sent-69, score-0.396]

14 However, a lot of valuable information about the shape and geometric layout of objects is lost when a 2D image is formed from the corresponding 3D world. [sent-75, score-0.444]

15 The recent availability of synchronized videos of both color and depth obtained from RGB-D (Kinect-style) depth cameras, shifted the focus to making use of both visual as well as shape features for object detection [9, 18, 19, 24, 26] and 3D segmentation (e. [sent-84, score-0.538]

16 However, these works do not make use of the contextual relationships between various objects which have been shown to be useful for tasks such as object detection and scene understanding in 2D images. [sent-88, score-0.567]

17 Our goal is to perform semantic labeling of indoor scenes by modeling and learning several contextual relationships. [sent-89, score-0.378]

18 There is also some recent work in labeling outdoor scenes obtained from LIDAR data into a few geometric classes (e. [sent-90, score-0.376]

19 [8, 30] capture context by designing node features and [36] do so by stacking layers of classifiers; however these methods do not model the correlation between the labels. [sent-94, score-0.334]

20 However, many relative features between objects are not associative in nature. [sent-97, score-0.419]

21 , a ground segment cannot be “on top of” another ground segment. [sent-100, score-0.376]

22 Furthermore, these methods only consider very few geometric classes (between three to five classes) in outdoor environments, whereas we consider a large number of object classes for labeling the indoor RGB-D data. [sent-103, score-0.515]

23 The most related work to ours is [35], where they label the planar patches in a point-cloud of an indoor scene with four geometric labels (walls, floors, ceilings, clutter). [sent-104, score-0.374]

24 In comparison, our basic representation is a 3D segment (as compared to planar patches) and we consider a much larger number of classes (beyond just the geometric classes). [sent-107, score-0.406]

25 The reasonable success of object detection in 2D images shows that visual appearance is a good indicator for labeling scenes. [sent-121, score-0.402]

26 1 Model Formulation We model the three-dimensional structure of a scene using a model isomorphic to a Markov Random Field with log-linear node and pairwise edge potentials. [sent-139, score-0.393]

27 , xN ) consisting of segments xi , we aim to predict a labeling y = (y1 , . [sent-143, score-0.358]

28 Each segment label yi is itself a vector of K binary class labels yi = (yi , . [sent-147, score-0.637]

29 , yi ), k k with each yi ∈ {0, 1} indicating whether a segment i is a member of class k. [sent-150, score-0.588]

30 Note that multiple yi can be 1 for each segment (e. [sent-151, score-0.434]

31 extent of bounding box Some features capture spatial location of an object in the scene N9. [sent-195, score-0.593]

32 Some features capture spatial location of an object in the scene its shape. [sent-215, score-0.593]

33 We connect two segments (nodes) i and j by an edge if there exists a point in segmenttwo segments (nodes) i and j by an edge if there exists a point in segment i and a point We connect i and a point in segment j which are less than context range distance apart. [sent-223, score-1.442]

34 This captures the closest distance in segment j which are between two segments (as compared to centroid distance between the segments)—we study the(as compared to centroid distance between the segments)—we study the shape. [sent-225, score-0.869]

35 Some features capture spatial location of an object in the scene location above ground, and its between two segments effect of context range more in Section 4. [sent-226, score-0.975]

36 The edge features φt (i, j) (Tableof context range more in Section 4. [sent-227, score-0.365]

37 (E1-E2) based on visual appearance and local We connect two segments (nodes) i and j by an edge if there exists a point in segment i and a point features (E3-E8) that capture the tendencies of two objects to occur in certain configurations. [sent-232, score-1.165]

38 However, in segment j which are less than context range distance apart. [sent-234, score-0.416]

39 This captures the closest distance Note influences the shape our features place a lot of emphasis on the vertical direction because gravitythat our features are insensitive to horizontal translation and rotation of twocamera. [sent-235, score-0.706]

40 However, between the segments (as compared to centroid distance between the segments)—we study the our features place a lot of emphasis on the vertical direction because gravity influences the shape and relative positions of objects to a large extent. [sent-236, score-0.963]

41 The edge features φt (i, j) (Table 1-right) consist of associative features (E1-E2) based on visual appearance and local shape, as well as non-associative features (E3-E8) that capture the tendencies of two objects to occur in certain configurations. [sent-238, score-1.064]

42 However, our features place a lot of emphasis on the vertical direction because gravity influences the shape and relative positions of objects to a large extent. [sent-240, score-0.583]

43 max ˆ y = argmax max y w · φ (i) + z w K y k i z ∀i, j, l, k : k n lk ij n lk zij ≤ l yi , lk t K lk zij ≤ k yj , l yi + k yj ≤ lk zij + 1, lk zij . [sent-267, score-3.421]

44 Relaxing the variables zzij ≤ yi , zij ≤ yj , yi + yj ≤ zij + 1, zij , yi ∈ {0, 1} l ij and yi to the interval [0, 1] leads to a linear program that can be shown to always have half-integral j l l k lk lk l and yi this relaxation [0, solutions (i. [sent-269, score-2.426]

45 Furthermore,to the interval can 1] leads to a linear program that can be shown to always products yi yj have been replaced by auxiliary variables zij . [sent-273, score-0.516]

46 , solvedfor labeling also be 10 sec as a quadratic pseudo-Boolean optimization problem using tograph-cut method 1] leads to a linear program that can be shown to always have half-integral l ˆ a typical scene in our experiments). [sent-280, score-0.347]

47 l k yi yj lk zijl k such multi-labelings in our attribute experiments where each segment can have multiple attributes, but not in segment labeling experiments where each segment can have only one label). [sent-288, score-1.527]

48 of magnitude faster than using a general purpose LP solver ˆ a typical scene in our labeled ˆ that any segment for which the value of y is integral in y (i. [sent-294, score-0.464]

49 Persistence says Since every segment in our experiments is in exactly one class, we also consider the linear relaxation value of y is integral in y ˆ (i. [sent-302, score-0.371]

50 5) is labeled that any segment for which the ￿ from above with the additional constraint ∀i : y ˆ y = argmax fw (x, = 1. [sent-305, score-0.364]

51 Persistence says l ˆ that any segment for which the value of yi is integral in ycut (i. [sent-310, score-0.521]

52 cut cut l i cut cut cut 5 K j=1 y j i l i cut LP LP Since every segment in our experiments is in exactly one class, we also consider the linear relaxation ￿K j from above with the additional constraint ∀i : j=1 yi = 1. [sent-314, score-0.789]

53 The discriminant function captures the dependencies between segment labels as defined by an undirected graph (V, E) of vertices V = {1, . [sent-322, score-0.383]

54 html lowing discriminant function based on individual segment features φn (i) and edge features φt (i, j) as further described below. [sent-331, score-0.73]

55 5 5 4 K fw (y, x) = i∈V k=1 k k yi wn · φn (i) + (i,j)∈E Tt ∈T (l,k)∈Tt l k lk yi yj wt · φt (i, j) (2) The node feature map φn (i) describes segment i through a vector of features, and there is one weight vector for each of the K classes. [sent-332, score-1.116]

56 The edge feature maps φt (i, j) describe the relationship between segments i and j. [sent-334, score-0.373]

57 Examples of edge features are the ones capturing similarity in visual appearance and geometric context. [sent-335, score-0.476]

58 If Tt contains an edge between classes lk l and k, then this feature map and a weight vector wt is used to model the dependencies between classes l and k. [sent-337, score-0.577]

59 We say that a type t of edge features is modeled by an associative edge potential if Tt = {(k, k)|∀k = 1. [sent-339, score-0.552]

60 In our experiments we distinguished between two types of edge feature maps—“object-associative” features φoa (i, j) used between classes that are parts of the same object (e. [sent-347, score-0.463]

61 , features that indicate whether two adjacent segments belong to the same class or object. [sent-352, score-0.373]

62 In this parsimonious model (referred to as svm mrf parsimon), we model object associative features using objectassociative edge potentials and non-associative features as non-associative edge potentials. [sent-354, score-1.107]

63 Due to this, the model we learn with both type of edge features will have much lesser number of parameters compared to a model learnt with all edge features as non-associative features. [sent-357, score-0.528]

64 ri is the ray vector to the centroid of segment i from the position camera in which it was captured. [sent-362, score-0.483]

65 ni is the unit normal of segment i ˆ which points towards the camera (ri . [sent-364, score-0.374]

66 Average of HOG features of the blocks in image spanned by the points of a segment Local Shape and Geometry N4. [sent-371, score-0.447]

67 2) Count 48 14 3 31 8 2 1 1 1 2 1 Edge features for (segment i, segment j). [sent-381, score-0.412]

68 Some features capture spatial location of an object in the scene (e. [sent-401, score-0.593]

69 We connect two segments (nodes) i and j by an edge if there exists a point in segment i and a point in segment j which are less than context range distance apart. [sent-404, score-1.069]

70 This captures the closest distance between two segments (as compared to centroid distance between the segments)—we study the effect of context range more in Section 4. [sent-405, score-0.551]

71 The edge features φt (i, j) (Table 1-right) consist of associative features (E1-E2) based on visual appearance and local shape, as well as non-associative features (E3-E8) that capture the tendencies of two objects to occur in certain configurations. [sent-406, score-1.064]

72 However, our features place a lot of emphasis on the vertical direction because gravity influences the shape and relative positions of objects to a large extent. [sent-408, score-0.583]

73 Persistence says l ˆ that any segment for which the value of yi is integral in ycut (i. [sent-424, score-0.521]

74 Since every segment in our experiments is in exactly one class, we also consider the linear relaxation K j from above with the additional constraint ∀i : j=1 yi = 1. [sent-428, score-0.525]

75 , during our robotic experiments), objects other than the 27 objects we modeled might be present (e. [sent-438, score-0.36]

76 (3) can be equivalently written as w Ψ(x, y) by appropriately stacking the k lk k lk lk wn and wt into w and the yi φn (k) and zij φt (l, k) into Ψ(x, y), where each zij is consistent with Eq. [sent-460, score-1.599]

77 5, 1}N ·K : y 1 T w n n i=1 ¯ ¯ [Ψ(xi , yi ) − Ψ(xi , yi )] ≥ ∆(yi , yi ) − ξ While the number of constraints in this quadratic program is exponential in n, N , and K, it can nevertheless be solved efficiently using the cutting-plane algorithm for training structural SVMs [15]. [sent-468, score-0.508]

78 1 Data We consider labeling object segments in full 3D scene (as compared to 2. [sent-482, score-0.683]

79 This gave us a total of 1108 labeled segments in the office scenes and 1387 segments in the home scenes. [sent-489, score-0.778]

80 Often one object may be divided into multiple segments because of over-segmentation. [sent-490, score-0.382]

81 We use both the macro and micro averaging to aggregate precision and recall over various classes. [sent-498, score-0.344]

82 Figure 1 shows the original point cloud, ground-truth and predicted labels for one office (top) and one home scene (bottom). [sent-502, score-0.396]

83 The table shows average micro precision/recall, and average macro precision and recall for home and office scenes. [sent-504, score-0.507]

84 Office Scenes Home Scenes micro macro micro macro features algorithm P/R Precision Recall P/R Precision Recall None max class 26. [sent-505, score-0.542]

85 To show the effect of the features independent of the effect of context, we only use the node potentials from our model, referred to as svm node only in Table 2. [sent-560, score-0.449]

86 The type of contextual relations we capture depend on the type of edge potentials we model. [sent-571, score-0.356]

87 To study this, we compared our method with models using only associative or only non-associative edge potentials referred to as svm mrf assoc and svm mrf nonassoc respectively. [sent-572, score-0.823]

88 We observed that modeling all edge features using associative potentials is poor compared to our full model. [sent-573, score-0.492]

89 In fact, using only associative potentials showed a drop in performance compared to svm node only model on the office dataset. [sent-574, score-0.396]

90 Our svm mrf nonassoc model does so, by modeling all edge features using nonassociative potentials, which can favor or disfavor labels of different classes for nearby segments. [sent-576, score-0.667]

91 It gives higher precision and recall compared to svm node only and svm mrf assoc. [sent-577, score-0.517]

92 However, not all the edge features are non-associative in nature, modeling them using only nonassociative potentials could be an overkill (each non-associative feature adds K 2 more parameters to be learnt). [sent-579, score-0.379]

93 Therefore using our svm mrf parsimon model to model these relations achieves higher performance in both datasets. [sent-580, score-0.333]

94 In order to study this, we compared our svm mrf parsimon with varying context range for determining the neighborhood (see Figure 3 for average micro precision vs range plot). [sent-586, score-0.656]

95 We attribute the improvement in macro precision and recall to the fact that larger scenes have more context, and models are more complete because of multiple views. [sent-603, score-0.349]

96 For example, on office data, the graphcut inference for our svm mrf parsimon gave a micro precision of 90. [sent-608, score-0.518]

97 Here, the micro precision and recall are not same as some of the segments might not get any label. [sent-611, score-0.508]

98 3 Robotic experiments The ability to label segments is very useful for robotics applications, for example, in detecting objects (so that a robot can find/retrieve an object on request) or for other robotic tasks. [sent-614, score-0.649]

99 Attribute Learning: In some robotic tasks, such as robotic grasping, it is not important to know the exact object category, but just knowing a few attributes of an object may be useful. [sent-616, score-0.535]

100 Note that each segment in the point cloud can have multiple attributes and therefore we can learn these attributes using our model which naturally allows multiple labels per segment. [sent-621, score-0.602]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lk', 0.329), ('segment', 0.28), ('segments', 0.241), ('zij', 0.229), ('scene', 0.184), ('home', 0.163), ('cloud', 0.159), ('associative', 0.156), ('yi', 0.154), ('object', 0.141), ('scenes', 0.133), ('edge', 0.132), ('features', 0.132), ('objects', 0.131), ('tt', 0.131), ('micro', 0.128), ('shape', 0.121), ('mrf', 0.119), ('labeling', 0.117), ('ce', 0.114), ('centroid', 0.104), ('keyboard', 0.101), ('mip', 0.101), ('robotic', 0.098), ('precision', 0.093), ('relaxation', 0.091), ('svm', 0.091), ('appearance', 0.088), ('yj', 0.087), ('parsimon', 0.087), ('ycut', 0.087), ('vertical', 0.08), ('node', 0.077), ('macro', 0.077), ('indoor', 0.073), ('potentials', 0.072), ('icra', 0.07), ('persistence', 0.07), ('geometric', 0.068), ('context', 0.064), ('capture', 0.061), ('clouds', 0.059), ('classes', 0.058), ('kinect', 0.058), ('niz', 0.058), ('ylp', 0.058), ('attributes', 0.057), ('relationships', 0.056), ('visual', 0.056), ('contextual', 0.055), ('saxena', 0.055), ('discriminant', 0.054), ('horizontal', 0.054), ('camera', 0.053), ('minu', 0.051), ('labels', 0.049), ('argmax', 0.049), ('rgb', 0.049), ('ground', 0.048), ('cos', 0.048), ('minutes', 0.047), ('program', 0.046), ('recall', 0.046), ('position', 0.046), ('lot', 0.045), ('wall', 0.044), ('depth', 0.044), ('mobile', 0.044), ('tendencies', 0.044), ('hsv', 0.044), ('layout', 0.044), ('cut', 0.044), ('views', 0.044), ('ciz', 0.043), ('njz', 0.043), ('nonassoc', 0.043), ('nonassociative', 0.043), ('rhi', 0.043), ('lp', 0.043), ('ni', 0.041), ('location', 0.04), ('parsimony', 0.04), ('heitz', 0.038), ('gravity', 0.038), ('robot', 0.038), ('range', 0.037), ('monitor', 0.037), ('chair', 0.037), ('cluttered', 0.037), ('relations', 0.036), ('oor', 0.036), ('insensitive', 0.036), ('emphasis', 0.036), ('image', 0.035), ('spatial', 0.035), ('distance', 0.035), ('closest', 0.035), ('angle', 0.035), ('normals', 0.035), ('fw', 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes

Author: Hema S. Koppula, Abhishek Anand, Thorsten Joachims, Ashutosh Saxena

Abstract: Inexpensive RGB-D cameras that give an RGB image together with depth data have become widely available. In this paper, we use this data to build 3D point clouds of full indoor scenes such as an office and address the task of semantic labeling of these 3D point clouds. We propose a graphical model that captures various features and contextual relations, including the local visual appearance and shape cues, object co-occurence relationships and geometric relationships. With a large number of object classes and relations, the model’s parsimony becomes important and we address that by using multiple types of edge potentials. The model admits efficient approximate inference, and we train it using a maximum-margin learning approach. In our experiments over a total of 52 3D scenes of homes and offices (composed from about 550 views, having 2495 segments labeled with 27 object classes), we get a performance of 84.06% in labeling 17 object classes for offices, and 73.38% in labeling 17 object classes for home scenes. Finally, we applied these algorithms successfully on a mobile robot for the task of finding objects in large cluttered rooms.1 1

2 0.27184787 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

Author: Congcong Li, Ashutosh Saxena, Tsuhan Chen

Abstract: For most scene understanding tasks (such as object detection or depth estimation), the classifiers need to consider contextual information in addition to the local features. We can capture such contextual information by taking as input the features/attributes from all the regions in the image. However, this contextual dependence also varies with the spatial location of the region of interest, and we therefore need a different set of parameters for each spatial location. This results in a very large number of parameters. In this work, we model the independence properties between the parameters for each location and for each task, by defining a Markov Random Field (MRF) over the parameters. In particular, two sets of parameters are encouraged to have similar values if they are spatially close or semantically close. Our method is, in principle, complementary to other ways of capturing context such as the ones that use a graphical model over the labels instead. In extensive evaluation over two different settings, of multi-class object detection and of multiple scene understanding tasks (scene categorization, depth estimation, geometric labeling), our method beats the state-of-the-art methods in all the four tasks. 1

3 0.24194753 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling

Author: Adrian Ion, Joao Carreira, Cristian Sminchisescu

Abstract: We present a joint image segmentation and labeling model (JSL) which, given a bag of figure-ground segment hypotheses extracted at multiple image locations and scales, constructs a joint probability distribution over both the compatible image interpretations (tilings or image segmentations) composed from those segments, and over their labeling into categories. The process of drawing samples from the joint distribution can be interpreted as first sampling tilings, modeled as maximal cliques, from a graph connecting spatially non-overlapping segments in the bag [1], followed by sampling labels for those segments, conditioned on the choice of a particular tiling. We learn the segmentation and labeling parameters jointly, based on Maximum Likelihood with a novel Incremental Saddle Point estimation procedure. The partition function over tilings and labelings is increasingly more accurately approximated by including incorrect configurations that a not-yet-competent model rates probable during learning. We show that the proposed methodology matches the current state of the art in the Stanford dataset [2], as well as in VOC2010, where 41.7% accuracy on the test set is achieved.

4 0.21333693 227 nips-2011-Pylon Model for Semantic Segmentation

Author: Victor Lempitsky, Andrea Vedaldi, Andrew Zisserman

Abstract: Graph cut optimization is one of the standard workhorses of image segmentation since for binary random field representations of the image, it gives globally optimal results and there are efficient polynomial time implementations. Often, the random field is applied over a flat partitioning of the image into non-intersecting elements, such as pixels or super-pixels. In the paper we show that if, instead of a flat partitioning, the image is represented by a hierarchical segmentation tree, then the resulting energy combining unary and boundary terms can still be optimized using graph cut (with all the corresponding benefits of global optimality and efficiency). As a result of such inference, the image gets partitioned into a set of segments that may come from different layers of the tree. We apply this formulation, which we call the pylon model, to the task of semantic segmentation where the goal is to separate an image into areas belonging to different semantic classes. The experiments highlight the advantage of inference on a segmentation tree (over a flat partitioning) and demonstrate that the optimization in the pylon model is able to flexibly choose the level of segmentation across the image. Overall, the proposed system has superior segmentation accuracy on several datasets (Graz-02, Stanford background) compared to previously suggested approaches. 1

5 0.18296461 138 nips-2011-Joint 3D Estimation of Objects and Scene Layout

Author: Andreas Geiger, Christian Wojek, Raquel Urtasun

Abstract: We propose a novel generative model that is able to reason jointly about the 3D scene layout as well as the 3D location and orientation of objects in the scene. In particular, we infer the scene topology, geometry as well as traffic activities from a short video sequence acquired with a single camera mounted on a moving car. Our generative model takes advantage of dynamic information in the form of vehicle tracklets as well as static information coming from semantic labels and geometry (i.e., vanishing points). Experiments show that our approach outperforms a discriminative baseline based on multiple kernel learning (MKL) which has access to the same image information. Furthermore, as we reason about objects in 3D, we are able to significantly increase the performance of state-of-the-art object detectors in their ability to estimate object orientation. 1

6 0.16305465 127 nips-2011-Image Parsing with Stochastic Scene Grammar

7 0.15356235 180 nips-2011-Multiple Instance Filtering

8 0.14694068 165 nips-2011-Matrix Completion for Multi-label Image Classification

9 0.1205872 154 nips-2011-Learning person-object interactions for action recognition in still images

10 0.11388683 271 nips-2011-Statistical Tests for Optimization Efficiency

11 0.11193874 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning

12 0.10442711 126 nips-2011-Im2Text: Describing Images Using 1 Million Captioned Photographs

13 0.098982818 35 nips-2011-An ideal observer model for identifying the reference frame of objects

14 0.097508296 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

15 0.097031206 119 nips-2011-Higher-Order Correlation Clustering for Image Segmentation

16 0.096331529 76 nips-2011-Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials

17 0.092966646 231 nips-2011-Randomized Algorithms for Comparison-based Search

18 0.091490023 290 nips-2011-Transfer Learning by Borrowing Examples for Multiclass Object Detection

19 0.090943597 304 nips-2011-Why The Brain Separates Face Recognition From Object Recognition

20 0.090754196 130 nips-2011-Inductive reasoning about chimeric creatures


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.234), (1, 0.157), (2, -0.183), (3, 0.234), (4, 0.127), (5, 0.059), (6, -0.076), (7, -0.027), (8, 0.058), (9, 0.118), (10, 0.096), (11, -0.013), (12, 0.0), (13, -0.047), (14, -0.112), (15, 0.006), (16, 0.07), (17, -0.062), (18, -0.034), (19, 0.02), (20, -0.105), (21, -0.03), (22, -0.036), (23, 0.034), (24, 0.006), (25, 0.024), (26, 0.011), (27, -0.08), (28, -0.02), (29, -0.001), (30, -0.02), (31, 0.033), (32, 0.024), (33, -0.024), (34, 0.129), (35, -0.015), (36, -0.053), (37, 0.083), (38, -0.055), (39, -0.025), (40, 0.005), (41, -0.011), (42, 0.007), (43, 0.029), (44, 0.035), (45, 0.1), (46, 0.041), (47, -0.042), (48, -0.011), (49, 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9547013 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes

Author: Hema S. Koppula, Abhishek Anand, Thorsten Joachims, Ashutosh Saxena

Abstract: Inexpensive RGB-D cameras that give an RGB image together with depth data have become widely available. In this paper, we use this data to build 3D point clouds of full indoor scenes such as an office and address the task of semantic labeling of these 3D point clouds. We propose a graphical model that captures various features and contextual relations, including the local visual appearance and shape cues, object co-occurence relationships and geometric relationships. With a large number of object classes and relations, the model’s parsimony becomes important and we address that by using multiple types of edge potentials. The model admits efficient approximate inference, and we train it using a maximum-margin learning approach. In our experiments over a total of 52 3D scenes of homes and offices (composed from about 550 views, having 2495 segments labeled with 27 object classes), we get a performance of 84.06% in labeling 17 object classes for offices, and 73.38% in labeling 17 object classes for home scenes. Finally, we applied these algorithms successfully on a mobile robot for the task of finding objects in large cluttered rooms.1 1

2 0.84645486 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling

Author: Adrian Ion, Joao Carreira, Cristian Sminchisescu

Abstract: We present a joint image segmentation and labeling model (JSL) which, given a bag of figure-ground segment hypotheses extracted at multiple image locations and scales, constructs a joint probability distribution over both the compatible image interpretations (tilings or image segmentations) composed from those segments, and over their labeling into categories. The process of drawing samples from the joint distribution can be interpreted as first sampling tilings, modeled as maximal cliques, from a graph connecting spatially non-overlapping segments in the bag [1], followed by sampling labels for those segments, conditioned on the choice of a particular tiling. We learn the segmentation and labeling parameters jointly, based on Maximum Likelihood with a novel Incremental Saddle Point estimation procedure. The partition function over tilings and labelings is increasingly more accurately approximated by including incorrect configurations that a not-yet-competent model rates probable during learning. We show that the proposed methodology matches the current state of the art in the Stanford dataset [2], as well as in VOC2010, where 41.7% accuracy on the test set is achieved.

3 0.78462058 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

Author: Congcong Li, Ashutosh Saxena, Tsuhan Chen

Abstract: For most scene understanding tasks (such as object detection or depth estimation), the classifiers need to consider contextual information in addition to the local features. We can capture such contextual information by taking as input the features/attributes from all the regions in the image. However, this contextual dependence also varies with the spatial location of the region of interest, and we therefore need a different set of parameters for each spatial location. This results in a very large number of parameters. In this work, we model the independence properties between the parameters for each location and for each task, by defining a Markov Random Field (MRF) over the parameters. In particular, two sets of parameters are encouraged to have similar values if they are spatially close or semantically close. Our method is, in principle, complementary to other ways of capturing context such as the ones that use a graphical model over the labels instead. In extensive evaluation over two different settings, of multi-class object detection and of multiple scene understanding tasks (scene categorization, depth estimation, geometric labeling), our method beats the state-of-the-art methods in all the four tasks. 1

4 0.77914906 138 nips-2011-Joint 3D Estimation of Objects and Scene Layout

Author: Andreas Geiger, Christian Wojek, Raquel Urtasun

Abstract: We propose a novel generative model that is able to reason jointly about the 3D scene layout as well as the 3D location and orientation of objects in the scene. In particular, we infer the scene topology, geometry as well as traffic activities from a short video sequence acquired with a single camera mounted on a moving car. Our generative model takes advantage of dynamic information in the form of vehicle tracklets as well as static information coming from semantic labels and geometry (i.e., vanishing points). Experiments show that our approach outperforms a discriminative baseline based on multiple kernel learning (MKL) which has access to the same image information. Furthermore, as we reason about objects in 3D, we are able to significantly increase the performance of state-of-the-art object detectors in their ability to estimate object orientation. 1

5 0.7405318 227 nips-2011-Pylon Model for Semantic Segmentation

Author: Victor Lempitsky, Andrea Vedaldi, Andrew Zisserman

Abstract: Graph cut optimization is one of the standard workhorses of image segmentation since for binary random field representations of the image, it gives globally optimal results and there are efficient polynomial time implementations. Often, the random field is applied over a flat partitioning of the image into non-intersecting elements, such as pixels or super-pixels. In the paper we show that if, instead of a flat partitioning, the image is represented by a hierarchical segmentation tree, then the resulting energy combining unary and boundary terms can still be optimized using graph cut (with all the corresponding benefits of global optimality and efficiency). As a result of such inference, the image gets partitioned into a set of segments that may come from different layers of the tree. We apply this formulation, which we call the pylon model, to the task of semantic segmentation where the goal is to separate an image into areas belonging to different semantic classes. The experiments highlight the advantage of inference on a segmentation tree (over a flat partitioning) and demonstrate that the optimization in the pylon model is able to flexibly choose the level of segmentation across the image. Overall, the proposed system has superior segmentation accuracy on several datasets (Graz-02, Stanford background) compared to previously suggested approaches. 1

6 0.73358166 127 nips-2011-Image Parsing with Stochastic Scene Grammar

7 0.67148364 193 nips-2011-Object Detection with Grammar Models

8 0.66123194 154 nips-2011-Learning person-object interactions for action recognition in still images

9 0.65511036 293 nips-2011-Understanding the Intrinsic Memorability of Images

10 0.64747488 76 nips-2011-Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials

11 0.60598046 126 nips-2011-Im2Text: Describing Images Using 1 Million Captioned Photographs

12 0.59363043 290 nips-2011-Transfer Learning by Borrowing Examples for Multiclass Object Detection

13 0.5701257 119 nips-2011-Higher-Order Correlation Clustering for Image Segmentation

14 0.55642509 235 nips-2011-Recovering Intrinsic Images with a Global Sparsity Prior on Reflectance

15 0.5552668 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning

16 0.55385864 216 nips-2011-Portmanteau Vocabularies for Multi-Cue Image Representation

17 0.54039884 266 nips-2011-Spatial distance dependent Chinese restaurant processes for image segmentation

18 0.53174615 304 nips-2011-Why The Brain Separates Face Recognition From Object Recognition

19 0.53052193 141 nips-2011-Large-Scale Category Structure Aware Image Categorization

20 0.52313936 35 nips-2011-An ideal observer model for identifying the reference frame of objects


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.03), (4, 0.065), (20, 0.073), (26, 0.02), (31, 0.064), (33, 0.061), (43, 0.055), (45, 0.092), (57, 0.044), (60, 0.306), (65, 0.012), (74, 0.041), (83, 0.014), (84, 0.011), (99, 0.044)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.861552 138 nips-2011-Joint 3D Estimation of Objects and Scene Layout

Author: Andreas Geiger, Christian Wojek, Raquel Urtasun

Abstract: We propose a novel generative model that is able to reason jointly about the 3D scene layout as well as the 3D location and orientation of objects in the scene. In particular, we infer the scene topology, geometry as well as traffic activities from a short video sequence acquired with a single camera mounted on a moving car. Our generative model takes advantage of dynamic information in the form of vehicle tracklets as well as static information coming from semantic labels and geometry (i.e., vanishing points). Experiments show that our approach outperforms a discriminative baseline based on multiple kernel learning (MKL) which has access to the same image information. Furthermore, as we reason about objects in 3D, we are able to significantly increase the performance of state-of-the-art object detectors in their ability to estimate object orientation. 1

same-paper 2 0.77247977 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes

Author: Hema S. Koppula, Abhishek Anand, Thorsten Joachims, Ashutosh Saxena

Abstract: Inexpensive RGB-D cameras that give an RGB image together with depth data have become widely available. In this paper, we use this data to build 3D point clouds of full indoor scenes such as an office and address the task of semantic labeling of these 3D point clouds. We propose a graphical model that captures various features and contextual relations, including the local visual appearance and shape cues, object co-occurence relationships and geometric relationships. With a large number of object classes and relations, the model’s parsimony becomes important and we address that by using multiple types of edge potentials. The model admits efficient approximate inference, and we train it using a maximum-margin learning approach. In our experiments over a total of 52 3D scenes of homes and offices (composed from about 550 views, having 2495 segments labeled with 27 object classes), we get a performance of 84.06% in labeling 17 object classes for offices, and 73.38% in labeling 17 object classes for home scenes. Finally, we applied these algorithms successfully on a mobile robot for the task of finding objects in large cluttered rooms.1 1

3 0.55758607 127 nips-2011-Image Parsing with Stochastic Scene Grammar

Author: Yibiao Zhao, Song-chun Zhu

Abstract: This paper proposes a parsing algorithm for scene understanding which includes four aspects: computing 3D scene layout, detecting 3D objects (e.g. furniture), detecting 2D faces (windows, doors etc.), and segmenting background. In contrast to previous scene labeling work that applied discriminative classifiers to pixels (or super-pixels), we use a generative Stochastic Scene Grammar (SSG). This grammar represents the compositional structures of visual entities from scene categories, 3D foreground/background, 2D faces, to 1D lines. The grammar includes three types of production rules and two types of contextual relations. Production rules: (i) AND rules represent the decomposition of an entity into sub-parts; (ii) OR rules represent the switching among sub-types of an entity; (iii) SET rules represent an ensemble of visual entities. Contextual relations: (i) Cooperative “+” relations represent positive links between binding entities, such as hinged faces of a object or aligned boxes; (ii) Competitive “-” relations represents negative links between competing entities, such as mutually exclusive boxes. We design an efficient MCMC inference algorithm, namely Hierarchical cluster sampling, to search in the large solution space of scene configurations. The algorithm has two stages: (i) Clustering: It forms all possible higher-level structures (clusters) from lower-level entities by production rules and contextual relations. (ii) Sampling: It jumps between alternative structures (clusters) in each layer of the hierarchy to find the most probable configuration (represented by a parse tree). In our experiment, we demonstrate the superiority of our algorithm over existing methods on public dataset. In addition, our approach achieves richer structures in the parse tree. 1

4 0.55166924 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling

Author: Adrian Ion, Joao Carreira, Cristian Sminchisescu

Abstract: We present a joint image segmentation and labeling model (JSL) which, given a bag of figure-ground segment hypotheses extracted at multiple image locations and scales, constructs a joint probability distribution over both the compatible image interpretations (tilings or image segmentations) composed from those segments, and over their labeling into categories. The process of drawing samples from the joint distribution can be interpreted as first sampling tilings, modeled as maximal cliques, from a graph connecting spatially non-overlapping segments in the bag [1], followed by sampling labels for those segments, conditioned on the choice of a particular tiling. We learn the segmentation and labeling parameters jointly, based on Maximum Likelihood with a novel Incremental Saddle Point estimation procedure. The partition function over tilings and labelings is increasingly more accurately approximated by including incorrect configurations that a not-yet-competent model rates probable during learning. We show that the proposed methodology matches the current state of the art in the Stanford dataset [2], as well as in VOC2010, where 41.7% accuracy on the test set is achieved.

5 0.55033976 227 nips-2011-Pylon Model for Semantic Segmentation

Author: Victor Lempitsky, Andrea Vedaldi, Andrew Zisserman

Abstract: Graph cut optimization is one of the standard workhorses of image segmentation since for binary random field representations of the image, it gives globally optimal results and there are efficient polynomial time implementations. Often, the random field is applied over a flat partitioning of the image into non-intersecting elements, such as pixels or super-pixels. In the paper we show that if, instead of a flat partitioning, the image is represented by a hierarchical segmentation tree, then the resulting energy combining unary and boundary terms can still be optimized using graph cut (with all the corresponding benefits of global optimality and efficiency). As a result of such inference, the image gets partitioned into a set of segments that may come from different layers of the tree. We apply this formulation, which we call the pylon model, to the task of semantic segmentation where the goal is to separate an image into areas belonging to different semantic classes. The experiments highlight the advantage of inference on a segmentation tree (over a flat partitioning) and demonstrate that the optimization in the pylon model is able to flexibly choose the level of segmentation across the image. Overall, the proposed system has superior segmentation accuracy on several datasets (Graz-02, Stanford background) compared to previously suggested approaches. 1

6 0.52854204 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

7 0.49691439 66 nips-2011-Crowdclustering

8 0.49664938 154 nips-2011-Learning person-object interactions for action recognition in still images

9 0.48871273 266 nips-2011-Spatial distance dependent Chinese restaurant processes for image segmentation

10 0.48649123 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss

11 0.48337764 168 nips-2011-Maximum Margin Multi-Instance Learning

12 0.48189965 303 nips-2011-Video Annotation and Tracking with Active Learning

13 0.47989285 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning

14 0.47870916 180 nips-2011-Multiple Instance Filtering

15 0.47372267 263 nips-2011-Sparse Manifold Clustering and Embedding

16 0.47360939 231 nips-2011-Randomized Algorithms for Comparison-based Search

17 0.47202352 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data

18 0.46953934 156 nips-2011-Learning to Learn with Compound HD Models

19 0.46898645 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound

20 0.46805546 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features