nips nips2011 nips2011-138 knowledge-graph by maker-knowledge-mining

138 nips-2011-Joint 3D Estimation of Objects and Scene Layout


Source: pdf

Author: Andreas Geiger, Christian Wojek, Raquel Urtasun

Abstract: We propose a novel generative model that is able to reason jointly about the 3D scene layout as well as the 3D location and orientation of objects in the scene. In particular, we infer the scene topology, geometry as well as traffic activities from a short video sequence acquired with a single camera mounted on a moving car. Our generative model takes advantage of dynamic information in the form of vehicle tracklets as well as static information coming from semantic labels and geometry (i.e., vanishing points). Experiments show that our approach outperforms a discriminative baseline based on multiple kernel learning (MKL) which has access to the same image information. Furthermore, as we reason about objects in 3D, we are able to significantly increase the performance of state-of-the-art object detectors in their ability to estimate object orientation. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We propose a novel generative model that is able to reason jointly about the 3D scene layout as well as the 3D location and orientation of objects in the scene. [sent-5, score-0.922]

2 In particular, we infer the scene topology, geometry as well as traffic activities from a short video sequence acquired with a single camera mounted on a moving car. [sent-6, score-0.794]

3 Our generative model takes advantage of dynamic information in the form of vehicle tracklets as well as static information coming from semantic labels and geometry (i. [sent-7, score-0.747]

4 Furthermore, as we reason about objects in 3D, we are able to significantly increase the performance of state-of-the-art object detectors in their ability to estimate object orientation. [sent-11, score-0.35]

5 1 Introduction Visual 3D scene understanding is an important component in applications such as autonomous driving and robot navigation. [sent-12, score-0.321]

6 , semantic labels [10, 26], object detection [5] or rough 3D [15, 24]. [sent-15, score-0.3]

7 A notable exception are approaches that try to infer the scene layout of indoor scenes in the form of 3D bounding boxes [13, 22]. [sent-16, score-0.676]

8 , walls (and often objects) are aligned with the three dominant vanishing points. [sent-21, score-0.258]

9 In contrast, outdoor scenarios often show more clutter, vanishing points are not necessarily orthogonal [25, 2], and objects often do not agree with the dominant vanishing points. [sent-22, score-0.561]

10 Prior work on 3D urban scene analysis is mostly limited to simple ground plane estimation [4, 29] or models for which the objects and the scene are inferred separately [6, 7]. [sent-23, score-0.831]

11 In contrast, in this paper we propose a novel generative model that is able to reason jointly about the 3D scene layout as well as the 3D location and orientation of objects in the scene. [sent-24, score-0.922]

12 In particular, given a video sequence of short duration acquired with a single camera mounted on a moving car, we estimate the scene topology and geometry, as well as the traffic activities and 3D objects present in the scene (see Fig. [sent-25, score-1.103]

13 Towards this goal we propose a novel image likelihood which takes advantage of dynamic information in the form of vehicle tracklets as well as static information coming from semantic labels and geometry (i. [sent-27, score-0.767]

14 Furthermore, we propose a novel learning-based approach to detecting vanishing points and experimentally show improved performance in the presence of clutter when compared to existing approaches [19]. [sent-31, score-0.298]

15 We focus our evaluation mainly on estimating the layout of intersections, as this is the most challenging inference task in urban scenes. [sent-32, score-0.375]

16 We evaluate our method on a wide range of metrics including the accuracy of estimating the topology and geometry of the scene, as well as detecting 1 Vehicle Tracklets Vanishing Points ⇒ Scene Labels Figure 1: Monocular 3D Urban Scene Understanding. [sent-36, score-0.241]

17 (Right) Estimated layout: Detections belonging to a tracklet are depicted with the same color, traffic activities are depicted with red lines. [sent-38, score-0.293]

18 Furthermore, we show that we are able to significantly increase the performance of state-of-the-art object detectors [5] in terms of estimating object orientation. [sent-42, score-0.307]

19 2 Related Work While outdoor scenarios remain fairly unexplored, estimating the 3D layout of indoor scenes has experienced increased popularity in the past few years [13, 27, 22]. [sent-43, score-0.354]

20 , edges on the image can be associated with parallel lines defined in terms of the three dominant vanishing points which are orthonormal. [sent-46, score-0.314]

21 [2] proposed to jointly perform line detection as well as vanishing point, azimut and zenith estimation. [sent-51, score-0.284]

22 However, their approach does not tackle the problem of 3D scene understanding and 3D object detection. [sent-52, score-0.459]

23 , object A supports object B) have also been introduced [11]. [sent-57, score-0.276]

24 Prior work on 3D traffic scene analysis is mostly limited to simple ground plane estimation [4], or models for which the objects and scene are inferred separately [6]. [sent-59, score-0.706]

25 In contrast, our model offers a much richer scene description and reasons jointly about 3D objects and the scene layout. [sent-60, score-0.684]

26 The most successful approaches use tracklets to prune spurious detections by linking consistent evidence in successive frames [18, 16]. [sent-62, score-0.479]

27 However, these models are either designed for static camera setups in surveillance applications [16] or do not provide a rich scene description [18]. [sent-63, score-0.431]

28 Notable exceptions are [3, 29] which jointly infer the camera pose and the location of objects. [sent-64, score-0.348]

29 However, the employed scene models are rather simplistic containing only a single flat ground plane. [sent-65, score-0.351]

30 [7], where a generative model is proposed in order to estimate the scene topology, geometry as well as traffic activities at intersections. [sent-67, score-0.485]

31 Towards this goal we develop a richer image likelihood model that takes advantage of vehicle tracklets, vanishing points as well as segmentations of the scene into semantic labels. [sent-71, score-0.749]

32 [7] estimate only the scene layout, while we reason jointly about the layout as well as the 3D location and orientation of objects in the scene (i. [sent-73, score-1.168]

33 Finally, non-parametric models have been proposed to perform traffic scene analysis from a stationary camera with a view similar to bird’s eye perspective [20, 28]. [sent-78, score-0.607]

34 In our work we aim to infer similar activities but use video sequences from a camera mounted on a moving car with a substantially lower viewpoint. [sent-79, score-0.462]

35 3 3D Urban Scene Understanding We tackle the problem of estimating the 3D layout of urban scenes (i. [sent-82, score-0.385]

36 In this paper 2D refers to observations in the image plane while 3D refers to the bird’s eye perspective (in our scenario the height above ground is non-informative). [sent-85, score-0.344]

37 We assume that the road surface is flat, and model the bird’s eye perspective as the y = 0 plane of the standard camera coordinate system. [sent-86, score-0.51]

38 The intrinsic parameters of the camera are obtained using camera calibration and the extrinsics using a standard Structure-from-Motion (SfM) pipeline [12]. [sent-88, score-0.348]

39 We take advantage of dynamic and static information in the form of 3D vehicle tracklets, semantic labels (i. [sent-89, score-0.268]

40 In order to compute 3D tracklets, we first detect vehicles in each frame independently using a semi-supervised version of the partbased detector of [5] in order to obtain orientation estimates. [sent-92, score-0.383]

41 2D tracklets are then estimated using ’tracking-by-detection’: First adjacent frames are linked and then short tracklets are associated to create longer ones via the hungarian method. [sent-93, score-0.764]

42 Finally, 3D vehicle tracklets are obtained by projecting the 2D tracklets into bird’s eye perspective, employing error-propagation to obtain covariance estimates. [sent-94, score-1.006]

43 1 where detections belonging to the same tracklet are grouped by color. [sent-96, score-0.284]

44 We model lanes with splines (see red lines for active lanes in Fig. [sent-105, score-0.242]

45 fig:motivation), and place parking spots at equidistant places along the street boundaries (see Fig. [sent-106, score-0.258]

46 Our model then infers whether the cars participate in traffic or are parked in order to get more accurate layout estimations. [sent-108, score-0.302]

47 Latent variables are employed to associate each detected vehicle with positions in one of these lanes or parking spaces. [sent-109, score-0.531]

48 Each of these layouts is associated with a set of geometric random variables: The intersection center c, the street width w, the global scene rotation r and the angle of the crossing street α with respect to r (see Fig. [sent-114, score-0.664]

49 Joint Distribution: Our goal is to estimate the most likely configuration R = (θ, c, w, r, α) given the image evidence E = {T, V, S}, which comprises vehicle tracklets T = {t1 , . [sent-117, score-0.572]

50 , tN }, vanish3 (a) Graphical model (b) Road model Figure 3: Graphical model and road model with lanes represented as B-splines. [sent-119, score-0.275]

51 Prior: Let us first define a scene prior, which factorizes as p(R) = p(θ)p(c, w)p(r)p(α) (2) where c and w are modeled jointly to capture their correlation. [sent-126, score-0.328]

52 2 Image Likelihood This section details our image likelihood for tracklets, vanishing points and semantic labels. [sent-132, score-0.313]

53 Let us define a 3D tracklet as a set of object detections t = {d1 , . [sent-134, score-0.422]

54 Here, each object detection dm = (fm , bm , om ) contains the frame index fm ∈ N, the object bounding box bm ∈ R4 defined as 2D position and size, as well as a normalized orientation histogram om ∈ R8 with 8 bins. [sent-137, score-1.058]

55 We compute the bounding box bm and orientation om by supervised training of a part-based object detector [5], where each component contains examples from a single orientation. [sent-138, score-0.602]

56 , K(K − 1) + 2K} in total, where K(K − 1) is the number of lanes and 2K is the number of parking areas. [sent-146, score-0.311]

57 We use the latent variable l to index the lane or parking position associated with a tracklet. [sent-147, score-0.301]

58 The joint probability of a tracklet t and its lane index l is given by p(t, l|R, C) = p(t|l, R, C)p(l). [sent-148, score-0.318]

59 We assume a uniform prior over lanes and parking positions l ∼ U(1, K(K − 1) + 2K), and denote the posterior by pl when l corresponds to a lane, and pp when it is a parking position. [sent-149, score-0.699]

60 In order to evaluate the tracklet posterior for lanes pl (t|l, R, C), we need to associate all object detections t = {d1 , . [sent-150, score-0.72]

61 The posterior is modeled using a left-to-right Hidden Markov Model (HMM), defined as: M pl (t|l, R, C) = pl (s1 )pl (d1 |s1 , l, R, C) s1 ,. [sent-156, score-0.286]

62 ,sM pl (sm |sm−1 )pl (dm |sm , l, R, C) (3) m=2 We constrain all tracklets to move forward in 3D by defining the transition probability p(sm |sm−1 ) as uniform on sm ≥ sm−1 and 0 otherwise. [sent-158, score-0.612]

63 We assume that the emission likelihood pl (dm |sm , l, R, C) factorizes into the object location and its orientation. [sent-160, score-0.376]

64 We impose a multinomial distribution over the orientation pl (fm , om |sm , l, R, C), where each object orientation votes for its bin as well as neighboring bins, accounting for the uncertainty of the object detector. [sent-161, score-0.875]

65 We now describe how we transform the 2D tracklets into 3D tracklets {π 1 , Σ1 , . [sent-163, score-0.724]

66 , π M , ΣM }, which we use in pl (dm |sm , l, R, C): We project the image coordinates into bird’s eye perspective by backprojecting objects into 3D using several complementary cues. [sent-165, score-0.449]

67 Towards this goal we use the 2D bounding box foot-point in combination with the estimated road plane. [sent-166, score-0.251]

68 Assuming typical vehicle dimensions obtained from annotated ground truth, we also exploit the width and height of the bounding box. [sent-167, score-0.324]

69 Our parking posterior model is similar to the lane posterior described above, except that we do not allow parked vehicles to move; We assume them to have arbitrary orientations and place them at the sides of the road. [sent-170, score-0.52]

70 For inference, we subsample each tracklet trajectory equidistantly in intervals of 5 meters in order to reduce the number of detections within a tracklet and keep the total evaluation time of p(R, E|C) low. [sent-173, score-0.491]

71 Vanishing Points: We detect two types of dominant vanishing points (VP) in the last frame of each sequence: vf corresponding to the forward facing street and vc corresponding to the crossing street. [sent-174, score-0.632]

72 As a consequence, we represent vf ∈ R by its image u-coordinate and vc ∈ [− π , π ] 4 4 by the angle of the crossing road, back projected into the image. [sent-177, score-0.35]

73 Our feature set comprises geometric information in the form of position, length, orientation and number of lines with the same orientation as well as perpendicular orientation in a local window. [sent-200, score-0.618]

74 Finally, we add texton-like features using a Gabor filter bank, as well as 3 principal components of the scene GIST [23]. [sent-202, score-0.282]

75 We assume that vf and vc are independent given the road parameters. [sent-207, score-0.323]

76 Let µf = µf (R, C) be the image u-coordinate (in pixels) of the forward facing street’s VP and let µc = µc (R, C) be the orientation (in radians) of the crossing street in the image. [sent-208, score-0.421]

77 4 illustrates an example of the scene labeling returned by boosting (left) as well as the labeling generated from the reprojection of our model (right). [sent-220, score-0.318]

78 4 requires a function ϕ : f, b, C → π, Σ which takes a frame index f ∈ N, an object bounding box b ∈ R4 and the calibration parameters C as input and maps them to the object location π ∈ R2 and uncertainty Σ ∈ R2×2 in bird’s eye perspective. [sent-259, score-0.692]

79 As cues for this mapping we use the bounding box width and height, as well as the location of the bounding box foot-point. [sent-260, score-0.326]

80 The unknown parameters of the mapping are the uncertainty in bounding box location σu , σv , width σ∆u and height σ∆v as well as the real-world object dimensions ∆x , ∆y along with their uncertainties σ∆x , σ∆y . [sent-262, score-0.411]

81 To avoid trans-dimensional jumps, the road layout θ is estimated separately beforehand using MAP estimation θM AP provided by joint boosting [30]. [sent-269, score-0.373]

82 4 Experimental Evaluation In this section, we first show that learning which line features convey structural information improves dominant vanishing point detection. [sent-271, score-0.258]

83 Next, we compare our method to a multiple kernel learning (MKL) baseline in estimating scene topology, geometry and traffic activities on the dataset of [7], but only employing information from a single camera. [sent-272, score-0.522]

84 Finally, we show that our model can significantly improve object orientation estimates compared to state-of-the-art part based models [5]. [sent-273, score-0.344]

85 3D Urban Scene Inference: We evaluate our method’s ability to infer the scene layout by building a competitive baseline based on multi-kernel Gaussian process regression [17]. [sent-283, score-0.565]

86 We employ a total of 4 kernels built on GIST [23], tracklet histograms, VPs as well as scene labels. [sent-284, score-0.489]

87 Note that these are the same features employed by our model to estimate the scene topology, θM AP . [sent-285, score-0.314]

88 Following [7] we measure error in terms of the location of the intersection center in meters, the orientation of the intersection arms in degrees, the overlap of road area with ground truth as well as the percentage of correctly discovered intersection crossing activities. [sent-290, score-0.875]

89 The inferred intersection layout is shown in gray, ground truth labels are given in blue. [sent-295, score-0.351]

90 8 shows qualitative results, with detections belonging to the same tracklet depicted with the same color. [sent-306, score-0.284]

91 As cars are mostly aligned with the road surface, we only focus on the orientation angle in bird’s eye coordinates. [sent-312, score-0.565]

92 We correct for the ego motion and project the highest scoring orientation into bird’s eye perspective. [sent-314, score-0.334]

93 For our method, we infer the scene layout R using our approach and associate every tracklet to its lane by maximizing pl (l|t, R, C) over l using Viterbi decoding. [sent-315, score-1.018]

94 We then select the tangent angle at the associated spline’s footpoint s on the inferred lane l as our orientation estimate. [sent-316, score-0.351]

95 5 Conclusions We have proposed a generative model which is able to perform joint 3D inference over the scene layout as well as the location and orientation of objects. [sent-321, score-0.838]

96 Our approach is able to infer the scene topology and geometry, as well as traffic activities from a short video sequence acquired with a single camera mounted on a car driving around a mid-size city. [sent-322, score-0.805]

97 A generative model for 3d urban scene understanding from movable platforms. [sent-368, score-0.482]

98 Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. [sent-461, score-0.257]

99 Discriminative learning with latent variables for cluttered indoor scene understanding. [sent-492, score-0.337]

100 A dynamic CRF model for joint labeling of object and scene classes. [sent-510, score-0.42]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('tracklets', 0.362), ('scene', 0.282), ('tracklet', 0.207), ('orientation', 0.206), ('vanishing', 0.19), ('parking', 0.19), ('layout', 0.183), ('road', 0.154), ('vehicle', 0.154), ('traf', 0.149), ('camera', 0.149), ('pl', 0.143), ('object', 0.138), ('bird', 0.132), ('eye', 0.128), ('urban', 0.125), ('lanes', 0.121), ('lane', 0.111), ('vf', 0.111), ('vp', 0.11), ('sm', 0.107), ('monocular', 0.105), ('vps', 0.103), ('location', 0.095), ('topology', 0.092), ('vehicles', 0.091), ('crossing', 0.091), ('deg', 0.091), ('activities', 0.086), ('intersection', 0.084), ('geiger', 0.084), ('geometry', 0.081), ('stereo', 0.078), ('detections', 0.077), ('bm', 0.077), ('parked', 0.076), ('objects', 0.074), ('fm', 0.072), ('dm', 0.071), ('clutter', 0.071), ('street', 0.068), ('dominant', 0.068), ('semantic', 0.067), ('mounted', 0.061), ('infer', 0.058), ('vc', 0.058), ('image', 0.056), ('ijcv', 0.056), ('pp', 0.055), ('indoor', 0.055), ('mkl', 0.055), ('eccv', 0.052), ('bounding', 0.052), ('manhattan', 0.052), ('wojek', 0.052), ('orientations', 0.052), ('calibration', 0.05), ('detection', 0.048), ('perspective', 0.048), ('labels', 0.047), ('scenes', 0.046), ('frame', 0.046), ('video', 0.046), ('jointly', 0.046), ('schindler', 0.045), ('box', 0.045), ('height', 0.044), ('om', 0.044), ('ap', 0.043), ('cars', 0.043), ('baseline', 0.042), ('tracking', 0.041), ('detector', 0.04), ('frames', 0.04), ('arms', 0.04), ('understanding', 0.039), ('outdoor', 0.039), ('depth', 0.038), ('detecting', 0.037), ('width', 0.037), ('ground', 0.037), ('inference', 0.036), ('bins', 0.036), ('generative', 0.036), ('boosting', 0.036), ('barinova', 0.034), ('facade', 0.034), ('kosecka', 0.034), ('trackets', 0.034), ('angle', 0.034), ('associate', 0.034), ('efros', 0.033), ('employed', 0.032), ('estimating', 0.031), ('hoiem', 0.031), ('cc', 0.031), ('car', 0.031), ('plane', 0.031), ('moving', 0.031), ('buildings', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 138 nips-2011-Joint 3D Estimation of Objects and Scene Layout

Author: Andreas Geiger, Christian Wojek, Raquel Urtasun

Abstract: We propose a novel generative model that is able to reason jointly about the 3D scene layout as well as the 3D location and orientation of objects in the scene. In particular, we infer the scene topology, geometry as well as traffic activities from a short video sequence acquired with a single camera mounted on a moving car. Our generative model takes advantage of dynamic information in the form of vehicle tracklets as well as static information coming from semantic labels and geometry (i.e., vanishing points). Experiments show that our approach outperforms a discriminative baseline based on multiple kernel learning (MKL) which has access to the same image information. Furthermore, as we reason about objects in 3D, we are able to significantly increase the performance of state-of-the-art object detectors in their ability to estimate object orientation. 1

2 0.25498071 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

Author: Congcong Li, Ashutosh Saxena, Tsuhan Chen

Abstract: For most scene understanding tasks (such as object detection or depth estimation), the classifiers need to consider contextual information in addition to the local features. We can capture such contextual information by taking as input the features/attributes from all the regions in the image. However, this contextual dependence also varies with the spatial location of the region of interest, and we therefore need a different set of parameters for each spatial location. This results in a very large number of parameters. In this work, we model the independence properties between the parameters for each location and for each task, by defining a Markov Random Field (MRF) over the parameters. In particular, two sets of parameters are encouraged to have similar values if they are spatially close or semantically close. Our method is, in principle, complementary to other ways of capturing context such as the ones that use a graphical model over the labels instead. In extensive evaluation over two different settings, of multi-class object detection and of multiple scene understanding tasks (scene categorization, depth estimation, geometric labeling), our method beats the state-of-the-art methods in all the four tasks. 1

3 0.18296461 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes

Author: Hema S. Koppula, Abhishek Anand, Thorsten Joachims, Ashutosh Saxena

Abstract: Inexpensive RGB-D cameras that give an RGB image together with depth data have become widely available. In this paper, we use this data to build 3D point clouds of full indoor scenes such as an office and address the task of semantic labeling of these 3D point clouds. We propose a graphical model that captures various features and contextual relations, including the local visual appearance and shape cues, object co-occurence relationships and geometric relationships. With a large number of object classes and relations, the model’s parsimony becomes important and we address that by using multiple types of edge potentials. The model admits efficient approximate inference, and we train it using a maximum-margin learning approach. In our experiments over a total of 52 3D scenes of homes and offices (composed from about 550 views, having 2495 segments labeled with 27 object classes), we get a performance of 84.06% in labeling 17 object classes for offices, and 73.38% in labeling 17 object classes for home scenes. Finally, we applied these algorithms successfully on a mobile robot for the task of finding objects in large cluttered rooms.1 1

4 0.17434628 35 nips-2011-An ideal observer model for identifying the reference frame of objects

Author: Joseph L. Austerweil, Abram L. Friesen, Thomas L. Griffiths

Abstract: The object people perceive in an image can depend on its orientation relative to the scene it is in (its reference frame). For example, the images of the symbols × and + differ by a 45 degree rotation. Although real scenes have multiple images and reference frames, psychologists have focused on scenes with only one reference frame. We propose an ideal observer model based on nonparametric Bayesian statistics for inferring the number of reference frames in a scene and their parameters. When an ambiguous image could be assigned to two conflicting reference frames, the model predicts two factors should influence the reference frame inferred for the image: The image should be more likely to share the reference frame of the closer object (proximity) and it should be more likely to share the reference frame containing the most objects (alignment). We confirm people use both cues using a novel methodology that allows for easy testing of human reference frame inference. 1

5 0.15604213 127 nips-2011-Image Parsing with Stochastic Scene Grammar

Author: Yibiao Zhao, Song-chun Zhu

Abstract: This paper proposes a parsing algorithm for scene understanding which includes four aspects: computing 3D scene layout, detecting 3D objects (e.g. furniture), detecting 2D faces (windows, doors etc.), and segmenting background. In contrast to previous scene labeling work that applied discriminative classifiers to pixels (or super-pixels), we use a generative Stochastic Scene Grammar (SSG). This grammar represents the compositional structures of visual entities from scene categories, 3D foreground/background, 2D faces, to 1D lines. The grammar includes three types of production rules and two types of contextual relations. Production rules: (i) AND rules represent the decomposition of an entity into sub-parts; (ii) OR rules represent the switching among sub-types of an entity; (iii) SET rules represent an ensemble of visual entities. Contextual relations: (i) Cooperative “+” relations represent positive links between binding entities, such as hinged faces of a object or aligned boxes; (ii) Competitive “-” relations represents negative links between competing entities, such as mutually exclusive boxes. We design an efficient MCMC inference algorithm, namely Hierarchical cluster sampling, to search in the large solution space of scene configurations. The algorithm has two stages: (i) Clustering: It forms all possible higher-level structures (clusters) from lower-level entities by production rules and contextual relations. (ii) Sampling: It jumps between alternative structures (clusters) in each layer of the hierarchy to find the most probable configuration (represented by a parse tree). In our experiment, we demonstrate the superiority of our algorithm over existing methods on public dataset. In addition, our approach achieves richer structures in the parse tree. 1

6 0.12955272 154 nips-2011-Learning person-object interactions for action recognition in still images

7 0.11303417 180 nips-2011-Multiple Instance Filtering

8 0.1082734 126 nips-2011-Im2Text: Describing Images Using 1 Million Captioned Photographs

9 0.10268115 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning

10 0.098891519 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling

11 0.09493129 303 nips-2011-Video Annotation and Tracking with Active Learning

12 0.083812989 227 nips-2011-Pylon Model for Semantic Segmentation

13 0.082942165 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound

14 0.082236357 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning

15 0.079503424 193 nips-2011-Object Detection with Grammar Models

16 0.078788273 275 nips-2011-Structured Learning for Cell Tracking

17 0.074522711 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features

18 0.073564209 290 nips-2011-Transfer Learning by Borrowing Examples for Multiclass Object Detection

19 0.071828537 258 nips-2011-Sparse Bayesian Multi-Task Learning

20 0.071439996 165 nips-2011-Matrix Completion for Multi-label Image Classification


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.185), (1, 0.128), (2, -0.106), (3, 0.208), (4, 0.101), (5, 0.051), (6, 0.005), (7, -0.059), (8, 0.079), (9, 0.15), (10, 0.026), (11, -0.039), (12, -0.025), (13, 0.033), (14, 0.015), (15, -0.024), (16, 0.089), (17, 0.009), (18, -0.058), (19, 0.097), (20, -0.046), (21, -0.059), (22, -0.058), (23, 0.122), (24, 0.02), (25, -0.106), (26, -0.061), (27, 0.037), (28, 0.061), (29, 0.005), (30, -0.089), (31, 0.046), (32, -0.002), (33, 0.033), (34, 0.021), (35, 0.004), (36, -0.109), (37, 0.078), (38, -0.126), (39, 0.01), (40, -0.046), (41, 0.01), (42, -0.081), (43, 0.046), (44, 0.022), (45, 0.014), (46, 0.042), (47, 0.005), (48, 0.028), (49, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9604525 138 nips-2011-Joint 3D Estimation of Objects and Scene Layout

Author: Andreas Geiger, Christian Wojek, Raquel Urtasun

Abstract: We propose a novel generative model that is able to reason jointly about the 3D scene layout as well as the 3D location and orientation of objects in the scene. In particular, we infer the scene topology, geometry as well as traffic activities from a short video sequence acquired with a single camera mounted on a moving car. Our generative model takes advantage of dynamic information in the form of vehicle tracklets as well as static information coming from semantic labels and geometry (i.e., vanishing points). Experiments show that our approach outperforms a discriminative baseline based on multiple kernel learning (MKL) which has access to the same image information. Furthermore, as we reason about objects in 3D, we are able to significantly increase the performance of state-of-the-art object detectors in their ability to estimate object orientation. 1

2 0.75612104 35 nips-2011-An ideal observer model for identifying the reference frame of objects

Author: Joseph L. Austerweil, Abram L. Friesen, Thomas L. Griffiths

Abstract: The object people perceive in an image can depend on its orientation relative to the scene it is in (its reference frame). For example, the images of the symbols × and + differ by a 45 degree rotation. Although real scenes have multiple images and reference frames, psychologists have focused on scenes with only one reference frame. We propose an ideal observer model based on nonparametric Bayesian statistics for inferring the number of reference frames in a scene and their parameters. When an ambiguous image could be assigned to two conflicting reference frames, the model predicts two factors should influence the reference frame inferred for the image: The image should be more likely to share the reference frame of the closer object (proximity) and it should be more likely to share the reference frame containing the most objects (alignment). We confirm people use both cues using a novel methodology that allows for easy testing of human reference frame inference. 1

3 0.74086976 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

Author: Congcong Li, Ashutosh Saxena, Tsuhan Chen

Abstract: For most scene understanding tasks (such as object detection or depth estimation), the classifiers need to consider contextual information in addition to the local features. We can capture such contextual information by taking as input the features/attributes from all the regions in the image. However, this contextual dependence also varies with the spatial location of the region of interest, and we therefore need a different set of parameters for each spatial location. This results in a very large number of parameters. In this work, we model the independence properties between the parameters for each location and for each task, by defining a Markov Random Field (MRF) over the parameters. In particular, two sets of parameters are encouraged to have similar values if they are spatially close or semantically close. Our method is, in principle, complementary to other ways of capturing context such as the ones that use a graphical model over the labels instead. In extensive evaluation over two different settings, of multi-class object detection and of multiple scene understanding tasks (scene categorization, depth estimation, geometric labeling), our method beats the state-of-the-art methods in all the four tasks. 1

4 0.71513641 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes

Author: Hema S. Koppula, Abhishek Anand, Thorsten Joachims, Ashutosh Saxena

Abstract: Inexpensive RGB-D cameras that give an RGB image together with depth data have become widely available. In this paper, we use this data to build 3D point clouds of full indoor scenes such as an office and address the task of semantic labeling of these 3D point clouds. We propose a graphical model that captures various features and contextual relations, including the local visual appearance and shape cues, object co-occurence relationships and geometric relationships. With a large number of object classes and relations, the model’s parsimony becomes important and we address that by using multiple types of edge potentials. The model admits efficient approximate inference, and we train it using a maximum-margin learning approach. In our experiments over a total of 52 3D scenes of homes and offices (composed from about 550 views, having 2495 segments labeled with 27 object classes), we get a performance of 84.06% in labeling 17 object classes for offices, and 73.38% in labeling 17 object classes for home scenes. Finally, we applied these algorithms successfully on a mobile robot for the task of finding objects in large cluttered rooms.1 1

5 0.70447373 154 nips-2011-Learning person-object interactions for action recognition in still images

Author: Vincent Delaitre, Josef Sivic, Ivan Laptev

Abstract: We investigate a discriminatively trained model of person-object interactions for recognizing common human actions in still images. We build on the locally order-less spatial pyramid bag-of-features model, which was shown to perform extremely well on a range of object, scene and human action recognition tasks. We introduce three principal contributions. First, we replace the standard quantized local HOG/SIFT features with stronger discriminatively trained body part and object detectors. Second, we introduce new person-object interaction features based on spatial co-occurrences of individual body parts and objects. Third, we address the combinatorial problem of a large number of possible interaction pairs and propose a discriminative selection procedure using a linear support vector machine (SVM) with a sparsity inducing regularizer. Learning of action-specific body part and object interactions bypasses the difficult problem of estimating the complete human body pose configuration. Benefits of the proposed model are shown on human action recognition in consumer photographs, outperforming the strong bag-of-features baseline. 1

6 0.69146734 127 nips-2011-Image Parsing with Stochastic Scene Grammar

7 0.6697194 193 nips-2011-Object Detection with Grammar Models

8 0.62961614 304 nips-2011-Why The Brain Separates Face Recognition From Object Recognition

9 0.5984782 126 nips-2011-Im2Text: Describing Images Using 1 Million Captioned Photographs

10 0.57112002 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning

11 0.5584892 290 nips-2011-Transfer Learning by Borrowing Examples for Multiclass Object Detection

12 0.54844576 293 nips-2011-Understanding the Intrinsic Memorability of Images

13 0.50580299 216 nips-2011-Portmanteau Vocabularies for Multi-Cue Image Representation

14 0.49986687 180 nips-2011-Multiple Instance Filtering

15 0.49314409 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling

16 0.49191245 303 nips-2011-Video Annotation and Tracking with Active Learning

17 0.49174753 275 nips-2011-Structured Learning for Cell Tracking

18 0.46257231 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound

19 0.45954543 235 nips-2011-Recovering Intrinsic Images with a Global Sparsity Prior on Reflectance

20 0.40744418 85 nips-2011-Emergence of Multiplication in a Biophysical Model of a Wide-Field Visual Neuron for Computing Object Approaches: Dynamics, Peaks, & Fits


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.015), (4, 0.062), (20, 0.058), (26, 0.016), (31, 0.078), (33, 0.052), (43, 0.055), (45, 0.079), (57, 0.035), (60, 0.352), (65, 0.015), (74, 0.031), (83, 0.018), (84, 0.015), (99, 0.04)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82189238 138 nips-2011-Joint 3D Estimation of Objects and Scene Layout

Author: Andreas Geiger, Christian Wojek, Raquel Urtasun

Abstract: We propose a novel generative model that is able to reason jointly about the 3D scene layout as well as the 3D location and orientation of objects in the scene. In particular, we infer the scene topology, geometry as well as traffic activities from a short video sequence acquired with a single camera mounted on a moving car. Our generative model takes advantage of dynamic information in the form of vehicle tracklets as well as static information coming from semantic labels and geometry (i.e., vanishing points). Experiments show that our approach outperforms a discriminative baseline based on multiple kernel learning (MKL) which has access to the same image information. Furthermore, as we reason about objects in 3D, we are able to significantly increase the performance of state-of-the-art object detectors in their ability to estimate object orientation. 1

2 0.71214938 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes

Author: Hema S. Koppula, Abhishek Anand, Thorsten Joachims, Ashutosh Saxena

Abstract: Inexpensive RGB-D cameras that give an RGB image together with depth data have become widely available. In this paper, we use this data to build 3D point clouds of full indoor scenes such as an office and address the task of semantic labeling of these 3D point clouds. We propose a graphical model that captures various features and contextual relations, including the local visual appearance and shape cues, object co-occurence relationships and geometric relationships. With a large number of object classes and relations, the model’s parsimony becomes important and we address that by using multiple types of edge potentials. The model admits efficient approximate inference, and we train it using a maximum-margin learning approach. In our experiments over a total of 52 3D scenes of homes and offices (composed from about 550 views, having 2495 segments labeled with 27 object classes), we get a performance of 84.06% in labeling 17 object classes for offices, and 73.38% in labeling 17 object classes for home scenes. Finally, we applied these algorithms successfully on a mobile robot for the task of finding objects in large cluttered rooms.1 1

3 0.48612911 127 nips-2011-Image Parsing with Stochastic Scene Grammar

Author: Yibiao Zhao, Song-chun Zhu

Abstract: This paper proposes a parsing algorithm for scene understanding which includes four aspects: computing 3D scene layout, detecting 3D objects (e.g. furniture), detecting 2D faces (windows, doors etc.), and segmenting background. In contrast to previous scene labeling work that applied discriminative classifiers to pixels (or super-pixels), we use a generative Stochastic Scene Grammar (SSG). This grammar represents the compositional structures of visual entities from scene categories, 3D foreground/background, 2D faces, to 1D lines. The grammar includes three types of production rules and two types of contextual relations. Production rules: (i) AND rules represent the decomposition of an entity into sub-parts; (ii) OR rules represent the switching among sub-types of an entity; (iii) SET rules represent an ensemble of visual entities. Contextual relations: (i) Cooperative “+” relations represent positive links between binding entities, such as hinged faces of a object or aligned boxes; (ii) Competitive “-” relations represents negative links between competing entities, such as mutually exclusive boxes. We design an efficient MCMC inference algorithm, namely Hierarchical cluster sampling, to search in the large solution space of scene configurations. The algorithm has two stages: (i) Clustering: It forms all possible higher-level structures (clusters) from lower-level entities by production rules and contextual relations. (ii) Sampling: It jumps between alternative structures (clusters) in each layer of the hierarchy to find the most probable configuration (represented by a parse tree). In our experiment, we demonstrate the superiority of our algorithm over existing methods on public dataset. In addition, our approach achieves richer structures in the parse tree. 1

4 0.47807473 227 nips-2011-Pylon Model for Semantic Segmentation

Author: Victor Lempitsky, Andrea Vedaldi, Andrew Zisserman

Abstract: Graph cut optimization is one of the standard workhorses of image segmentation since for binary random field representations of the image, it gives globally optimal results and there are efficient polynomial time implementations. Often, the random field is applied over a flat partitioning of the image into non-intersecting elements, such as pixels or super-pixels. In the paper we show that if, instead of a flat partitioning, the image is represented by a hierarchical segmentation tree, then the resulting energy combining unary and boundary terms can still be optimized using graph cut (with all the corresponding benefits of global optimality and efficiency). As a result of such inference, the image gets partitioned into a set of segments that may come from different layers of the tree. We apply this formulation, which we call the pylon model, to the task of semantic segmentation where the goal is to separate an image into areas belonging to different semantic classes. The experiments highlight the advantage of inference on a segmentation tree (over a flat partitioning) and demonstrate that the optimization in the pylon model is able to flexibly choose the level of segmentation across the image. Overall, the proposed system has superior segmentation accuracy on several datasets (Graz-02, Stanford background) compared to previously suggested approaches. 1

5 0.47793901 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling

Author: Adrian Ion, Joao Carreira, Cristian Sminchisescu

Abstract: We present a joint image segmentation and labeling model (JSL) which, given a bag of figure-ground segment hypotheses extracted at multiple image locations and scales, constructs a joint probability distribution over both the compatible image interpretations (tilings or image segmentations) composed from those segments, and over their labeling into categories. The process of drawing samples from the joint distribution can be interpreted as first sampling tilings, modeled as maximal cliques, from a graph connecting spatially non-overlapping segments in the bag [1], followed by sampling labels for those segments, conditioned on the choice of a particular tiling. We learn the segmentation and labeling parameters jointly, based on Maximum Likelihood with a novel Incremental Saddle Point estimation procedure. The partition function over tilings and labelings is increasingly more accurately approximated by including incorrect configurations that a not-yet-competent model rates probable during learning. We show that the proposed methodology matches the current state of the art in the Stanford dataset [2], as well as in VOC2010, where 41.7% accuracy on the test set is achieved.

6 0.45211041 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

7 0.43490139 66 nips-2011-Crowdclustering

8 0.41848576 266 nips-2011-Spatial distance dependent Chinese restaurant processes for image segmentation

9 0.41649598 180 nips-2011-Multiple Instance Filtering

10 0.41494274 154 nips-2011-Learning person-object interactions for action recognition in still images

11 0.41035092 303 nips-2011-Video Annotation and Tracking with Active Learning

12 0.40833718 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss

13 0.40657625 168 nips-2011-Maximum Margin Multi-Instance Learning

14 0.40396416 55 nips-2011-Collective Graphical Models

15 0.40363777 156 nips-2011-Learning to Learn with Compound HD Models

16 0.40317607 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning

17 0.40176561 112 nips-2011-Heavy-tailed Distances for Gradient Based Image Descriptors

18 0.40176484 231 nips-2011-Randomized Algorithms for Comparison-based Search

19 0.40039298 35 nips-2011-An ideal observer model for identifying the reference frame of objects

20 0.39978233 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data