iccv iccv2013 iccv2013-433 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Hongyi Zhang, Andreas Geiger, Raquel Urtasun
Abstract: In this paper, we are interested in understanding the semantics of outdoor scenes in the context of autonomous driving. Towards this goal, we propose a generative model of 3D urban scenes which is able to reason not only about the geometry and objects present in the scene, but also about the high-level semantics in the form of traffic patterns. We found that a small number of patterns is sufficient to model the vast majority of traffic scenes and show how these patterns can be learned. As evidenced by our experiments, this high-level reasoning significantly improves the overall scene estimation as well as the vehicle-to-lane association when compared to state-of-the-art approaches [10].
Reference: text
sentIndex sentText sentNum sentScore
1 edu c Abstract In this paper, we are interested in understanding the semantics of outdoor scenes in the context of autonomous driving. [sent-6, score-0.225]
2 Towards this goal, we propose a generative model of 3D urban scenes which is able to reason not only about the geometry and objects present in the scene, but also about the high-level semantics in the form of traffic patterns. [sent-7, score-0.804]
3 We found that a small number of patterns is sufficient to model the vast majority of traffic scenes and show how these patterns can be learned. [sent-8, score-0.921]
4 In this paper, we are interested in understanding the semantics of outdoor scenes captured from a movable platform for the task of autonomous driving. [sent-16, score-0.293]
5 Unfortunately, such approaches are hard to transfer to autonomous systems due to the assumption of a static observer and the fact that the cies: In [10] high-order dependencies between objects are ignored, leading to physically implausible inference results with colliding vehicles (left). [sent-21, score-0.295]
6 We propose to explicitly account for traffic patterns (right, correct situation marked in red), thereby substantially improving scene layout and activity estimation results. [sent-22, score-0.832]
7 [9, 10], who infer the 3D geometry of intersections as well as the location and orientation of objects in 3D from short video sequences. [sent-25, score-0.26]
8 Unfortunately their approach does not capture high-order dependencies, and as a consequence, interactions between objects are not properly captured, leading to illegitimate traffic situations in the presence of ambiguous observations, as illustrated in Fig. [sent-26, score-0.544]
9 As humans, however, we can easily infer the correct situation as we are aware of the existence of traffic signals and passing rules at signalized intersections, which can be summarized by typ- ical traffic flow patterns, e. [sent-28, score-1.188]
10 In this paper, we take our inspiration from humans and propose a generative model of 3D urban scenes, which is able to reason not only about the geometry and objects present in the scene, but also about the high-level semantics in the form of traffic patterns. [sent-33, score-0.766]
11 As shown in our experiments, not only does this help in associating objects with the correct lanes but also improves the overall scene estimation. [sent-34, score-0.189]
12 We learn the traffic patterns from real scenarios and propose a novel object likelihood which, by integrating detection evidence over the full lane width, represents 3056 lane surfaces very accurately. [sent-35, score-1.594]
13 Our experiments reveal that a small number of traffic patterns is sufficient to cover the majority of traffic scenarios at signalized intersections. [sent-36, score-1.355]
14 Taken together, the proposed model significantly outperforms the state-of-the-art [10] in estimating the geometry and tracklet associations. [sent-37, score-0.287]
15 Furthermore, it provides high-level knowledge about the scene such as the current traffic light phase, without detecting and recognizing traffic lights explicitly, a task that is extremely difficult without the use of handannotated maps. [sent-38, score-1.167]
16 Related Work A wide variety of approaches have been proposed to re- cover the 3D layout of static indoor scenes from a single image [13, 28, 19, 26]. [sent-40, score-0.197]
17 Using long-term scene observations, [17] proposes a method for unsupervised activity recognition and abnor- mality detection that is able to recover spatio-temporal dependencies and traffic light states from a static camera that is mounted on top of the roof of a building. [sent-53, score-0.808]
18 This limits their applicability as the learned scene models can not be transferred to new scenes, which is required when performing scene understanding from movable platforms. [sent-55, score-0.217]
19 In contrast, here we are interested in inferring semantics at a higher level, such as multi-object traffic patterns at intersections, in order to improve the layout and object estimation process. [sent-57, score-0.852]
20 Prior work on 3D traffic scene analysis from movable platforms is mostly limited to ground plane estimation [2], classification [3, 8] or very simple planar scene models [29]. [sent-60, score-0.72]
21 [9, 10] tackle the problem of urban traffic scene understanding by considering static infrastructure (e. [sent-62, score-0.727]
22 However, their limiting assumption of independent tracklets given the road layout can lead to implausible inference results such as vehicles on collision course, as illustrated in Fig. [sent-67, score-0.583]
23 In this work, we aim at alleviating these problems by incorporating a latent variable model that captures learned traffic patterns whichjointly determine vehicle velocities and traffic light states. [sent-69, score-1.356]
24 Moreover, we show how a small set of plausible traffic patterns can be learned from annotated data. [sent-71, score-0.702]
25 Additionally, we propose a novel object location likelihood, that, by marginalizing over the lateral position on the lane, models lanes much more accurately than [10] and improves the estimation of parameters such as the street orientations. [sent-72, score-0.308]
26 Modeling Traffic Patterns In this section we present our generative model of scenes, which reasons jointly about high-level semantics in the form of traffic patterns as well as the 3D layout and objects present in the scene. [sent-74, score-0.894]
27 We restrict our domain to 3arm and 4-arm intersections, which are frequent intersections that exhibit interesting traffic patterns. [sent-75, score-0.638]
28 We first learn a subset of predominant patterns from training data. [sent-76, score-0.158]
29 At inference we recover these patterns from short video sequences and jointly associate vehicles to the corresponding lanes. [sent-77, score-0.291]
30 Following [10], we represent the geometry of the intersection in bird’s eye coordinates and denote c ∈ R2 the cseecnttieorn no ifn nth bei rind’tsers eyecetio cono, r tinhea reosta atniodn d doefn our own car with respect to the intersection, w, the street width and α the crossing angle. [sent-78, score-0.218]
31 We ing the our utilize semantic segmentation, 3D tracklets and vanishpoints as observations to estimate the traffic patterns, layout and the vehicle-to-lane associations. [sent-83, score-0.878]
32 w Githe o3m deettreyc,t Tiornasc aknledt M anAdP L easntiem Mateosd (a) road model parameters oelfs :th (ea )h ridodaden m voadreialb plaesr location of the vehicle along the normal of the lane spline at distance si , their. [sent-89, score-0.787]
33 Gprpreeona:c Blue: Lane spline and parking spots with associated T, (rub)e si’s (blue circles). [sent-92, score-0.21]
34 In (c), despite less probable, the blue and purple tracklet get assigned a higher likelihood than the red and green ones in the model of [10]. [sent-93, score-0.259]
35 Instead, our model correctly considers the red and the green tracklets more likely. [sent-94, score-0.227]
36 A Traffic-Aware Tracklet Model We aim at estimating the lane each vehicle is driving on, or in which road the vehicle is parked. [sent-104, score-0.733]
37 Let l be a latent variable indexing the lane or parking position associated with a tracklet. [sent-109, score-0.484]
38 Note that this is in contrast to [10] which assumes that the cars drive in the middle of the road with uniform prior over velocity and acceleration. [sent-116, score-0.164]
39 2(c), the blue and purple tracklets will have higher likelihood than the green and red tracklets in the model of [10]. [sent-119, score-0.481]
40 Thus, we include a latent variable h that models the lateral location of a vehicle and integrate it with uniform prior probability over the lane width (i. [sent-121, score-0.698]
41 In addition, [10] does not model dependencies in vehicle dynam- ics, whereas we encode the intuition that vehicles switch between “stop” and “go” states rather rarely. [sent-125, score-0.265]
42 This penalty helps to further reduce detection noise and to improve the tracklet estimation results. [sent-127, score-0.256]
43 In order to accurately estimate traffic patterns and lane associations, high-quality tracklet observations are crucial. [sent-128, score-1.371]
44 We compute tracklets using a two-stage approach: In the first stage we form short contiguous tracklets by associating detections using the hungarian method while predicting bounding boxes over time using a Kalman filter. [sent-129, score-0.531]
45 The second stage overcomes occlusions by joining tracklets which are up to 20 frames apart. [sent-131, score-0.227]
46 Including the appearance term into our association led to much more stable tracklet associations, especially in the case of heavy occlusions. [sent-133, score-0.297]
47 ,t oits d longitudinal position), and h ∈ R to denote its true location along the npoosrmitiaoln d)i,re acntdio hn o∈f tRhe t spline t(ei. [sent-139, score-0.204]
48 Together, they define a lane spline coordinate system such that every point on the ‘y=0’ plane in bird’s eye perspective can be represented. [sent-142, score-0.51]
49 To simplify notation, we will use gi to represent the pair gi = (si, bi). [sent-144, score-0.354]
50 We define a 3D tracklet as a set of object detections t = {di , . [sent-150, score-0.255]
51 In order to reason about traffic semantics, we introduce an additional latent variable a representing the possible traffic flow patterns. [sent-153, score-1.088]
52 4 illustrates the learned traffic patterns we use for 3-armed and 4-armed intersections. [sent-155, score-0.702]
53 We define the probability distribution over all tracklet observations as p(T|R) =Xp(a)YXp(ln)p(tn|ln,a,R) Xa Yn (2) Xln where a is the traffic pattern and ln is the lane index oftracklet n, denoted as tn. [sent-160, score-1.339]
54 We assume a uniform prior over traffic patterns a and lane assignments l. [sent-161, score-1.136]
55 In the following we will define the likelihood of a single tracklet p(t |l, a, R), dropping teh teh etra lcikkelleiht ionoddex o n af sorin clarity. [sent-162, score-0.29]
56 Our tracklet formulation combines a hidden Markov model (HMM) and a dynamical system with nonlinear constraints. [sent-163, score-0.269]
57 {Tdhe tracklet lik}el dihenoootde e p t(hte e|a s,e lt, oRf) d eist given as p(t|a, l,R) = p(d1|a, l,R) Yp(di|di−1, Yi a, l,R) (3) For the sake of clarity let us drop the dependencies on a, l,R in the following. [sent-169, score-0.28]
58 Learning Traffic Patterns: Number of explained tracklets by the learned patterns for different maximum number of total patterns. [sent-178, score-0.385]
59 Note that 4 patterns are sufficient to explain the majority of scenarios in our dataset. [sent-179, score-0.21]
60 The longitudinal transition probability is defined as p(gi|gi−1)=0p ( b i|b i− 1 ) π(·)i f b bi = = s gto op ∧ s i6= s i− 1 where π(si − si−1) represents a look-up table that depends on the differ−en sce si −si−1 in consecutive frames (i. [sent-182, score-0.201]
61 Fgo trh eth ree parking spots, s is assumed to be constant and h is truncated at the end of the parking area. [sent-188, score-0.215]
62 The generative process works as follows: First the road geometry and the traffic pattern are sampled. [sent-191, score-0.783]
63 Next, we sample the hidden states h and g conditioned on the geometry in order to generate the first frame of the tracklet. [sent-192, score-0.198]
64 This gives us the longitudinal position on the spline as well as the lateral distance to the spline. [sent-193, score-0.22]
65 ck Flienta by f,i wrset sampling hthee rheimddaeinni sntga toebs using the dynamics and then sampling the vehicle detections conditioned on all other variables. [sent-196, score-0.178]
66 Learning We restrict the set of possible traffic patterns to those that are collision-free. [sent-200, score-0.702]
67 Our goal is to recover a small subset of patterns which explains the annotated data well. [sent-203, score-0.158]
68 Each combination is scored according to the total number of tracklets explained by the best pattern in the current set. [sent-205, score-0.26]
69 A tracklet is explained by a pattern if its lane as- sociation and stop-or-go state agrees with the pattern. [sent-206, score-0.694]
70 As illustrated in Table 1, 4 patterns are sufficient to explain the majority of scenarios. [sent-208, score-0.181]
71 Tracklets on lanes of different states exhibit different “stop-go” transition statistics. [sent-215, score-0.275]
72 However, note that all lane states have low switching probabilities. [sent-216, score-0.482]
73 We learn the transition probability of the binary hidden states b on active/inactive lanes respectively using the tracklets and corresponding ground truth tracklet-to-lane associations. [sent-219, score-0.575]
74 Inference In this section we describe how to infer the road parameters, the traffic patterns, the lane associations and the hidden states in our model. [sent-223, score-1.313]
75 lG mivoevne t thyep etrsa fafriec pattern a ta rnadn rdooamd parameters R, the lane association of tracklet tn is given by tphaer ammaexteimrsu Rm, o thf p(ln |a, tn, R) ∝ p(tn |a, ln, R) . [sent-233, score-0.795]
76 Similarly, we infer the maximum-a-posteriori traffic pattern for a particular sequence by taking the product over all tracklets tn and marginalizing the lane associations ln YN p(a|T,R) ∝ Y Xp(tn|a,ln,R). [sent-235, score-1.513]
77 nY= Y1 (13) Xln Here, we assume a uniform prior on traffic patterns p(a). [sent-236, score-0.73]
78 Given the MAP estimate of the traffic pattern a and the lane association ln, the MAP assignment of hidden states {g1, . [sent-237, score-1.161]
79 For a fair comparison of tracklet-to-lane associations and road parameter inference, the method of [10] was run using our improved tracklets. [sent-248, score-0.207]
80 Our dataset consists of all video sequences of signalized intersections from the dataset of [9], summing up to 11 three-armed and 49 four-armed intersection sequences. [sent-251, score-0.206]
81 As our primary goal is to reason about the vehicle dynamics and the traffic patterns we do not infer the topology. [sent-254, score-0.87]
82 Road Parameter Estimation: Following the error metric employed in [9, 10], we first evaluate the performance of our model in inferring the location (center of intersections), the orientation of its arms as well as the overlap of the inferred road area with ground truth. [sent-263, score-0.203]
83 Note that in contrast to the activities proposed in [9, 10] this measure is much more restrictive as it not only considers which lanes are given the green light, but instead requires each tracklet to be associated to the correct lane. [sent-267, score-0.367]
84 This is a difficult task, especially given the fact that some tracklets are so short (in time or space) that they can only be disambiguated using high-level knowledge. [sent-268, score-0.254]
85 As shown in Table 3 by modeling traffic patterns and object location and dynamics more accurately, we achieve a significant reduction in terms of tracklet-to-lane association (T-L) error wrt. [sent-269, score-0.848]
86 Traffic Patterns: We labeled the traffic pattern for each of the sequences in our dataset, which is summarized in Table 4. [sent-271, score-0.577]
87 In our dataset, 4 videos are dominated by pattern transitions and another 9 videos contain unidentifiable patterns which do not correspond to any of the patterns in our model. [sent-272, score-0.377]
88 We evaluate the performance of our model on these videos with the exception of the traffic pattern inference task. [sent-273, score-0.654]
89 As shown in Table 3 (right column), our model can infer traffic patterns with high accuracy while only having access to short monocular video sequences. [sent-274, score-0.795]
90 f Estimating uthsien gM 1A50P0 v0e shaicmlelpolceast tioon asp given mthaet e ro pa(dE parameters only t athkees M abAoPu tv 1e second for all tracklets of a sequence. [sent-288, score-0.251]
91 Geometry Estimation and Tracklet-to-Lane Association Results: Results oftracklet-to-lane association (T-L) error, intersection location errors (bird’s eye view), street orientation errors and street area overlap (see [10] for a definition). [sent-305, score-0.219]
92 lanes with at least one visible tracklet are shown. [sent-306, score-0.367]
93 The pictogram in the lower-left corner (red) of each image shows the inferred traffic pattern, the symbol in the lower-right corner (green) the ground truth pattern. [sent-307, score-0.597]
94 The other cases are ambiguous ones as they contain transitions between two adjacent patterns or irregular driving behaviors such as U-turns (rightmost figure). [sent-314, score-0.234]
95 Traffic Patterns: Number of occurrences of each traffic pattern (see Fig. [sent-317, score-0.577]
96 Conclusions In this paper, we proposed a generative model of 3D urban scenes which is able to reason jointly about the geometry and objects present in the scene, as well as the high-level semantics in form of traffic patterns. [sent-320, score-0.804]
97 As shown by our experiments, this results in significant improvements in terms of performance over the state-of-the-art in all aspects of the scene estimation and allows to infer the current traffic light situation. [sent-321, score-0.666]
98 In the future, we plan to extend our approach to model transitions between traffic patterns. [sent-322, score-0.572]
99 A generative model for 3d urban scene understanding from movable platforms. [sent-406, score-0.256]
100 Combining appearance and structure from motion features for road scene understanding. [sent-532, score-0.163]
wordName wordTfidf (topN-words)
[('traffic', 0.544), ('lane', 0.406), ('tracklet', 0.232), ('tracklets', 0.227), ('hi', 0.187), ('gi', 0.177), ('patterns', 0.158), ('di', 0.142), ('te', 0.141), ('lanes', 0.135), ('road', 0.109), ('spline', 0.104), ('associations', 0.098), ('intersections', 0.094), ('vehicle', 0.085), ('parking', 0.078), ('layout', 0.076), ('states', 0.076), ('semantics', 0.074), ('movable', 0.068), ('association', 0.065), ('transition', 0.064), ('geiger', 0.064), ('rm', 0.06), ('longitudinal', 0.059), ('tn', 0.059), ('signalized', 0.057), ('lateral', 0.057), ('ln', 0.057), ('vehicles', 0.056), ('intersection', 0.055), ('geometry', 0.055), ('scene', 0.054), ('inferred', 0.053), ('urban', 0.051), ('inference', 0.05), ('dependencies', 0.048), ('driving', 0.048), ('marginalizing', 0.046), ('indoor', 0.046), ('autonomous', 0.046), ('width', 0.045), ('dhi', 0.044), ('infer', 0.043), ('generative', 0.042), ('si', 0.042), ('location', 0.041), ('understanding', 0.041), ('dynamics', 0.04), ('scenes', 0.038), ('hongyi', 0.038), ('december', 0.038), ('heading', 0.038), ('hidden', 0.037), ('static', 0.037), ('truncated', 0.036), ('probability', 0.036), ('implausible', 0.034), ('tinhea', 0.034), ('xln', 0.034), ('pattern', 0.033), ('etra', 0.031), ('collision', 0.031), ('observations', 0.031), ('conditioned', 0.03), ('hoiem', 0.029), ('street', 0.029), ('scenarios', 0.029), ('oi', 0.028), ('uniform', 0.028), ('spots', 0.028), ('transitions', 0.028), ('ess', 0.028), ('cars', 0.027), ('schindler', 0.027), ('short', 0.027), ('bounding', 0.027), ('likelihood', 0.027), ('exception', 0.027), ('outdoor', 0.026), ('xp', 0.025), ('light', 0.025), ('observer', 0.024), ('detection', 0.024), ('ro', 0.024), ('recovering', 0.024), ('bird', 0.024), ('rooms', 0.024), ('kuettel', 0.023), ('stop', 0.023), ('additionally', 0.023), ('state', 0.023), ('social', 0.023), ('majority', 0.023), ('detections', 0.023), ('monocular', 0.023), ('hedau', 0.023), ('hyperparameters', 0.023), ('ree', 0.023), ('manhattan', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 433 iccv-2013-Understanding High-Level Semantics by Modeling Traffic Patterns
Author: Hongyi Zhang, Andreas Geiger, Raquel Urtasun
Abstract: In this paper, we are interested in understanding the semantics of outdoor scenes in the context of autonomous driving. Towards this goal, we propose a generative model of 3D urban scenes which is able to reason not only about the geometry and objects present in the scene, but also about the high-level semantics in the form of traffic patterns. We found that a small number of patterns is sufficient to model the vast majority of traffic scenes and show how these patterns can be learned. As evidenced by our experiments, this high-level reasoning significantly improves the overall scene estimation as well as the vehicle-to-lane association when compared to state-of-the-art approaches [10].
2 0.24872877 393 iccv-2013-Simultaneous Clustering and Tracklet Linking for Multi-face Tracking in Videos
Author: Baoyuan Wu, Siwei Lyu, Bao-Gang Hu, Qiang Ji
Abstract: We describe a novel method that simultaneously clusters and associates short sequences of detected faces (termed as face tracklets) in videos. The rationale of our method is that face tracklet clustering and linking are related problems that can benefit from the solutions of each other. Our method is based on a hidden Markov random field model that represents the joint dependencies of cluster labels and tracklet linking associations . We provide an efficient algorithm based on constrained clustering and optimal matching for the simultaneous inference of cluster labels and tracklet associations. We demonstrate significant improvements on the state-of-the-art results in face tracking and clustering performances on several video datasets.
3 0.16278879 418 iccv-2013-The Way They Move: Tracking Multiple Targets with Similar Appearance
Author: Caglayan Dicle, Octavia I. Camps, Mario Sznaier
Abstract: We introduce a computationally efficient algorithm for multi-object tracking by detection that addresses four main challenges: appearance similarity among targets, missing data due to targets being out of the field of view or occluded behind other objects, crossing trajectories, and camera motion. The proposed method uses motion dynamics as a cue to distinguish targets with similar appearance, minimize target mis-identification and recover missing data. Computational efficiency is achieved by using a Generalized Linear Assignment (GLA) coupled with efficient procedures to recover missing data and estimate the complexity of the underlying dynamics. The proposed approach works with tracklets of arbitrary length and does not assume a dynamical model a priori, yet it captures the overall motion dynamics of the targets. Experiments using challenging videos show that this framework can handle complex target motions, non-stationary cameras and long occlusions, on scenarios where appearance cues are not available or poor.
4 0.13839039 187 iccv-2013-Group Norm for Learning Structured SVMs with Unstructured Latent Variables
Author: Daozheng Chen, Dhruv Batra, William T. Freeman
Abstract: Latent variables models have been applied to a number of computer vision problems. However, the complexity of the latent space is typically left as a free design choice. A larger latent space results in a more expressive model, but such models are prone to overfitting and are slower to perform inference with. The goal of this paper is to regularize the complexity of the latent space and learn which hidden states are really relevant for prediction. Specifically, we propose using group-sparsity-inducing regularizers such as ?1-?2 to estimate the parameters of Structured SVMs with unstructured latent variables. Our experiments on digit recognition and object detection show that our approach is indeed able to control the complexity of latent space without any significant loss in accuracy of the learnt model.
5 0.1310672 144 iccv-2013-Estimating the 3D Layout of Indoor Scenes and Its Clutter from Depth Sensors
Author: Jian Zhang, Chen Kan, Alexander G. Schwing, Raquel Urtasun
Abstract: In this paper we propose an approach to jointly estimate the layout ofrooms as well as the clutterpresent in the scene using RGB-D data. Towards this goal, we propose an effective model that is able to exploit both depth and appearance features, which are complementary. Furthermore, our approach is efficient as we exploit the inherent decomposition of additive potentials. We demonstrate the effectiveness of our approach on the challenging NYU v2 dataset and show that employing depth reduces the layout error by 6% and the clutter estimation by 13%.
6 0.11552922 286 iccv-2013-NYC3DCars: A Dataset of 3D Vehicles in Geographic Context
7 0.11412591 64 iccv-2013-Box in the Box: Joint 3D Layout and Object Reasoning from Single Images
8 0.11183956 58 iccv-2013-Bayesian 3D Tracking from Monocular Video
9 0.098446809 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
10 0.09693002 174 iccv-2013-Forward Motion Deblurring
11 0.096438058 1 iccv-2013-3DNN: Viewpoint Invariant 3D Geometry Matching for Scene Understanding
12 0.092268422 230 iccv-2013-Latent Data Association: Bayesian Model Selection for Multi-target Tracking
13 0.088199429 72 iccv-2013-Characterizing Layouts of Outdoor Scenes Using Spatial Topic Processes
14 0.084036969 201 iccv-2013-Holistic Scene Understanding for 3D Object Detection with RGBD Cameras
15 0.083274834 443 iccv-2013-Video Synopsis by Heterogeneous Multi-source Correlation
16 0.079475127 242 iccv-2013-Learning People Detectors for Tracking in Crowded Scenes
17 0.076260351 250 iccv-2013-Lifting 3D Manhattan Lines from a Single Image
18 0.074592747 46 iccv-2013-Allocentric Pose Estimation
19 0.072035477 386 iccv-2013-Sequential Bayesian Model Update under Structured Scene Prior for Semantic Road Scenes Labeling
20 0.072004445 324 iccv-2013-Potts Model, Parametric Maxflow and K-Submodular Functions
topicId topicWeight
[(0, 0.169), (1, -0.028), (2, 0.014), (3, 0.036), (4, 0.054), (5, -0.002), (6, -0.023), (7, 0.019), (8, -0.021), (9, -0.037), (10, -0.005), (11, -0.066), (12, -0.032), (13, 0.031), (14, -0.029), (15, -0.016), (16, -0.054), (17, 0.051), (18, -0.023), (19, -0.04), (20, -0.166), (21, -0.076), (22, 0.103), (23, -0.088), (24, 0.061), (25, -0.057), (26, 0.08), (27, -0.11), (28, 0.005), (29, 0.012), (30, 0.009), (31, -0.001), (32, 0.043), (33, 0.064), (34, 0.067), (35, 0.043), (36, 0.023), (37, 0.013), (38, -0.018), (39, -0.029), (40, 0.057), (41, 0.014), (42, -0.033), (43, -0.024), (44, -0.182), (45, -0.001), (46, 0.093), (47, 0.011), (48, -0.073), (49, 0.044)]
simIndex simValue paperId paperTitle
same-paper 1 0.92333919 433 iccv-2013-Understanding High-Level Semantics by Modeling Traffic Patterns
Author: Hongyi Zhang, Andreas Geiger, Raquel Urtasun
Abstract: In this paper, we are interested in understanding the semantics of outdoor scenes in the context of autonomous driving. Towards this goal, we propose a generative model of 3D urban scenes which is able to reason not only about the geometry and objects present in the scene, but also about the high-level semantics in the form of traffic patterns. We found that a small number of patterns is sufficient to model the vast majority of traffic scenes and show how these patterns can be learned. As evidenced by our experiments, this high-level reasoning significantly improves the overall scene estimation as well as the vehicle-to-lane association when compared to state-of-the-art approaches [10].
2 0.6891076 393 iccv-2013-Simultaneous Clustering and Tracklet Linking for Multi-face Tracking in Videos
Author: Baoyuan Wu, Siwei Lyu, Bao-Gang Hu, Qiang Ji
Abstract: We describe a novel method that simultaneously clusters and associates short sequences of detected faces (termed as face tracklets) in videos. The rationale of our method is that face tracklet clustering and linking are related problems that can benefit from the solutions of each other. Our method is based on a hidden Markov random field model that represents the joint dependencies of cluster labels and tracklet linking associations . We provide an efficient algorithm based on constrained clustering and optimal matching for the simultaneous inference of cluster labels and tracklet associations. We demonstrate significant improvements on the state-of-the-art results in face tracking and clustering performances on several video datasets.
3 0.61616355 58 iccv-2013-Bayesian 3D Tracking from Monocular Video
Author: Ernesto Brau, Jinyan Guan, Kyle Simek, Luca Del Pero, Colin Reimer Dawson, Kobus Barnard
Abstract: Jinyan Guan† j guan1 @ emai l ari z ona . edu . Kyle Simek† ks imek@ emai l ari z ona . edu . Colin Reimer Dawson‡ cdaws on@ emai l ari z ona . edu . ‡School of Information University of Arizona Kobus Barnard‡ kobus @ s i sta . ari z ona . edu ∗School of Informatics University of Edinburgh for tracking an unknown and changing number of people in a scene using video taken from a single, fixed viewpoint. We develop a Bayesian modeling approach for tracking people in 3D from monocular video with unknown cameras. Modeling in 3D provides natural explanations for occlusions and smoothness discontinuities that result from projection, and allows priors on velocity and smoothness to be grounded in physical quantities: meters and seconds vs. pixels and frames. We pose the problem in the context of data association, in which observations are assigned to tracks. A correct application of Bayesian inference to multitarget tracking must address the fact that the model’s dimension changes as tracks are added or removed, and thus, posterior densities of different hypotheses are not comparable. We address this by marginalizing out the trajectory parameters so the resulting posterior over data associations has constant dimension. This is made tractable by using (a) Gaussian process priors for smooth trajectories and (b) approximately Gaussian likelihood functions. Our approach provides a principled method for incorporating multiple sources of evidence; we present results using both optical flow and object detector outputs. Results are comparable to recent work on 3D tracking and, unlike others, our method requires no pre-calibrated cameras.
4 0.60943884 418 iccv-2013-The Way They Move: Tracking Multiple Targets with Similar Appearance
Author: Caglayan Dicle, Octavia I. Camps, Mario Sznaier
Abstract: We introduce a computationally efficient algorithm for multi-object tracking by detection that addresses four main challenges: appearance similarity among targets, missing data due to targets being out of the field of view or occluded behind other objects, crossing trajectories, and camera motion. The proposed method uses motion dynamics as a cue to distinguish targets with similar appearance, minimize target mis-identification and recover missing data. Computational efficiency is achieved by using a Generalized Linear Assignment (GLA) coupled with efficient procedures to recover missing data and estimate the complexity of the underlying dynamics. The proposed approach works with tracklets of arbitrary length and does not assume a dynamical model a priori, yet it captures the overall motion dynamics of the targets. Experiments using challenging videos show that this framework can handle complex target motions, non-stationary cameras and long occlusions, on scenarios where appearance cues are not available or poor.
5 0.57570142 230 iccv-2013-Latent Data Association: Bayesian Model Selection for Multi-target Tracking
Author: Aleksandr V. Segal, Ian Reid
Abstract: We propose a novel parametrization of the data association problem for multi-target tracking. In our formulation, the number of targets is implicitly inferred together with the data association, effectively solving data association and model selection as a single inference problem. The novel formulation allows us to interpret data association and tracking as a single Switching Linear Dynamical System (SLDS). We compute an approximate posterior solution to this problem using a dynamic programming/message passing technique. This inference-based approach allows us to incorporate richer probabilistic models into the tracking system. In particular, we incorporate inference over inliers/outliers and track termination times into the system. We evaluate our approach on publicly available datasets and demonstrate results competitive with, and in some cases exceeding the state of the art.
6 0.56439507 64 iccv-2013-Box in the Box: Joint 3D Layout and Object Reasoning from Single Images
7 0.55072117 87 iccv-2013-Conservation Tracking
8 0.52790928 72 iccv-2013-Characterizing Layouts of Outdoor Scenes Using Spatial Topic Processes
9 0.52651078 200 iccv-2013-Higher Order Matching for Consistent Multiple Target Tracking
10 0.52477247 216 iccv-2013-Inferring "Dark Matter" and "Dark Energy" from Videos
11 0.50052428 201 iccv-2013-Holistic Scene Understanding for 3D Object Detection with RGBD Cameras
12 0.49519393 144 iccv-2013-Estimating the 3D Layout of Indoor Scenes and Its Clutter from Depth Sensors
13 0.48919448 331 iccv-2013-Pyramid Coding for Functional Scene Element Recognition in Video Scenes
14 0.47427055 289 iccv-2013-Network Principles for SfM: Disambiguating Repeated Structures with Local Context
15 0.46685874 167 iccv-2013-Finding Causal Interactions in Video Sequences
16 0.4516384 1 iccv-2013-3DNN: Viewpoint Invariant 3D Geometry Matching for Scene Understanding
17 0.44364139 350 iccv-2013-Relative Attributes for Large-Scale Abandoned Object Detection
18 0.44100046 79 iccv-2013-Coherent Object Detection with 3D Geometric Context from a Single Image
19 0.44064674 246 iccv-2013-Learning the Visual Interpretation of Sentences
20 0.43926316 187 iccv-2013-Group Norm for Learning Structured SVMs with Unstructured Latent Variables
topicId topicWeight
[(2, 0.04), (7, 0.035), (12, 0.015), (26, 0.063), (31, 0.063), (34, 0.029), (40, 0.014), (42, 0.088), (48, 0.012), (64, 0.067), (73, 0.067), (78, 0.011), (88, 0.04), (89, 0.182), (94, 0.149), (95, 0.022), (98, 0.014)]
simIndex simValue paperId paperTitle
1 0.87290221 222 iccv-2013-Joint Learning of Discriminative Prototypes and Large Margin Nearest Neighbor Classifiers
Author: Martin Köstinger, Paul Wohlhart, Peter M. Roth, Horst Bischof
Abstract: In this paper, we raise important issues concerning the evaluation complexity of existing Mahalanobis metric learning methods. The complexity scales linearly with the size of the dataset. This is especially cumbersome on large scale or for real-time applications with limited time budget. To alleviate this problem we propose to represent the dataset by a fixed number of discriminative prototypes. In particular, we introduce a new method that jointly chooses the positioning of prototypes and also optimizes the Mahalanobis distance metric with respect to these. We show that choosing the positioning of the prototypes and learning the metric in parallel leads to a drastically reduced evaluation effort while maintaining the discriminative essence of the original dataset. Moreover, for most problems our method performing k-nearest prototype (k-NP) classification on the condensed dataset leads to even better generalization compared to k-NN classification using all data. Results on a variety of challenging benchmarks demonstrate the power of our method. These include standard machine learning datasets as well as the challenging Public Fig- ures Face Database. On the competitive machine learning benchmarks we are comparable to the state-of-the-art while being more efficient. On the face benchmark we clearly outperform the state-of-the-art in Mahalanobis metric learning with drastically reduced evaluation effort.
same-paper 2 0.87050253 433 iccv-2013-Understanding High-Level Semantics by Modeling Traffic Patterns
Author: Hongyi Zhang, Andreas Geiger, Raquel Urtasun
Abstract: In this paper, we are interested in understanding the semantics of outdoor scenes in the context of autonomous driving. Towards this goal, we propose a generative model of 3D urban scenes which is able to reason not only about the geometry and objects present in the scene, but also about the high-level semantics in the form of traffic patterns. We found that a small number of patterns is sufficient to model the vast majority of traffic scenes and show how these patterns can be learned. As evidenced by our experiments, this high-level reasoning significantly improves the overall scene estimation as well as the vehicle-to-lane association when compared to state-of-the-art approaches [10].
3 0.83610249 189 iccv-2013-HOGgles: Visualizing Object Detection Features
Author: Carl Vondrick, Aditya Khosla, Tomasz Malisiewicz, Antonio Torralba
Abstract: We introduce algorithms to visualize feature spaces used by object detectors. The tools in this paper allow a human to put on ‘HOG goggles ’ and perceive the visual world as a HOG based object detector sees it. We found that these visualizations allow us to analyze object detection systems in new ways and gain new insight into the detector’s failures. For example, when we visualize the features for high scoring false alarms, we discovered that, although they are clearly wrong in image space, they do look deceptively similar to true positives in feature space. This result suggests that many of these false alarms are caused by our choice of feature space, and indicates that creating a better learning algorithm or building bigger datasets is unlikely to correct these errors. By visualizing feature spaces, we can gain a more intuitive understanding of our detection systems.
4 0.83454508 152 iccv-2013-Extrinsic Camera Calibration without a Direct View Using Spherical Mirror
Author: Amit Agrawal
Abstract: We consider the problem of estimating the extrinsic parameters (pose) of a camera with respect to a reference 3D object without a direct view. Since the camera does not view the object directly, previous approaches have utilized reflections in a planar mirror to solve this problem. However, a planar mirror based approach requires a minimum of three reflections and has degenerate configurations where estimation fails. In this paper, we show that the pose can be obtained using a single reflection in a spherical mirror of known radius. This makes our approach simpler and easier in practice. In addition, unlike planar mirrors, the spherical mirror based approach does not have any degenerate configurations, leading to a robust algorithm. While a planar mirror reflection results in a virtual perspective camera, a spherical mirror reflection results in a non-perspective axial camera. The axial nature of rays allows us to compute the axis (direction of sphere center) and few pose parameters in a linear fashion. We then derive an analytical solution to obtain the distance to the sphere cen- ter and remaining pose parameters and show that it corresponds to solving a 16th degree equation. We present comparisons with a recent method that use planar mirrors and show that our approach recovers more accurate pose in the presence of noise. Extensive simulations and results on real data validate our algorithm.
5 0.8314774 285 iccv-2013-NEIL: Extracting Visual Knowledge from Web Data
Author: Xinlei Chen, Abhinav Shrivastava, Abhinav Gupta
Abstract: We propose NEIL (NeverEnding Image Learner), a computer program that runs 24 hours per day and 7 days per week to automatically extract visual knowledge from Internet data. NEIL uses a semi-supervised learning algorithm that jointly discovers common sense relationships (e.g., “Corolla is a kind of/looks similar to Car”, “Wheel is a part of Car”) and labels instances of the given visual categories. It is an attempt to develop the world’s largest visual structured knowledge base with minimum human labeling effort. As of 10th October 2013, NEIL has been continuously running for 2.5 months on 200 core cluster (more than 350K CPU hours) and has an ontology of 1152 object categories, 1034 scene categories and 87 attributes. During this period, NEIL has discovered more than 1700 relationships and has labeled more than 400K visual instances. 1. Motivation Recent successes in computer vision can be primarily attributed to the ever increasing size of visual knowledge in terms of labeled instances of scenes, objects, actions, attributes, and the contextual relationships between them. But as we move forward, a key question arises: how will we gather this structured visual knowledge on a vast scale? Recent efforts such as ImageNet [8] and Visipedia [30] have tried to harness human intelligence for this task. However, we believe that these approaches lack both the richness and the scalability required for gathering massive amounts of visual knowledge. For example, at the time of submission, only 7% of the data in ImageNet had bounding boxes and the relationships were still extracted via Wordnet. In this paper, we consider an alternative approach of automatically extracting visual knowledge from Internet scale data. The feasibility of extracting knowledge automatically from images and videos will itself depend on the state-ofthe-art in computer vision. While we have witnessed significant progress on the task of detection and recognition, we still have a long way to go for automatically extracting the semantic content of a given image. So, is it really possible to use existing approaches for gathering visual knowledge directly from web data? 1.1. NEIL – Never Ending Image Learner We propose NEIL, a computer program that runs 24 hours per day, 7 days per week, forever to: (a) semantically understand images on the web; (b) use this semantic understanding to augment its knowledge base with new labeled instances and common sense relationships; (c) use this dataset and these relationships to build better classifiers and detectors which in turn help improve semantic understanding. NEIL is a constrained semi-supervised learning (SSL) system that exploits the big scale of visual data to automatically extract common sense relationships and then uses these relationships to label visual instances of existing categories. It is an attempt to develop the world’s largest visual structured knowledge base with minimum human effort one that reflects the factual content of the images on the Internet, and that would be useful to many computer vision and AI efforts. Specifically, NEIL can use web data to extract: (a) Labeled examples of object categories with bounding boxes; (b) Labeled examples of scenes; (c) Labeled examples of attributes; (d) Visual subclasses for object categories; and (e) Common sense relationships about scenes, objects and attributes like “Corolla is a kind of/looks similar to Car”, “Wheel is a part ofCar”, etc. (See Figure 1). We believe our approach is possible for three key reasons: (a) Macro-vision vs. Micro-vision: We use the term “micro-vision” to refer to the traditional paradigm where the input is an image and the output is some information extracted from that image. In contrast, we define “macrovision” as a paradigm where the input is a large collection of images and the desired output is extracting significant or interesting patterns in visual data (e.g., car is detected frequently in raceways). These patterns help us to extract common sense relationships. Note, the key difference is that macro-vision does not require us to understand every image in the corpora and extract all possible patterns. Instead, it relies on understanding a few images and statistically combine evidence from these to build our visual knowledge. – (b) Structure of the Visual World: Our approach exploits the structure of the visual world and builds constraints for detection and classification. These global constraints are represented in terms of common sense relationships be1409 orCllaoraC Hloc e yrs(a) Objects (w/Bounding Boxes and Vislue hWal Subcategories) aongkPi lrt(b) ScenewyacaResaephs nuoRd(c) At d worCreibutes Visual Instances Labeled by NEIL (O-O) Wheel is a part of Car. (S-O) Car is found in Raceway. (O-O) Corolla is a kind of/looks similar to Car. (S-O) Pyramid is found in Egypt. (O-A) Wheel is/has Round shape. (S-A) Alley is/has Narrow. (S-A) Bamboo forest is/has Vertical lines. (O-A) Sunflower is/has Yellow. Relationships Extracted by NEIL Figure 1. NEIL is a computer program that runs 24 hours a day and 7 days a week to gather visual knowledge from the Internet. Specifically, it simultaneously labels the data and extracts common sense relationships between categories. tween categories. Most prior work uses manually defined relationships or learns relationships in a supervised setting. Our key insight is that at a large scale one can simultane- ously label the visual instances and extract common sense relationships in ajoint semi-supervised learning framework. (c) Semantically driven knowledge acquisition: We use a semantic representation for visual knowledge; that is, we group visual data based on semantic categories and develop relationships between semantic categories. This allows us to leverage text-based indexing tools such as Google Image Search to initialize our visual knowledge base learning. Contributions: Our main contributions are: (a) We propose a never ending learning algorithm for gathering visual knowledge from the Internet via macro-vision. NEIL has been continuously running for 2.5 months on a 200 core cluster; (b) We are automatically building a large visual structured knowledge base which not only consists of labeled instances of scenes, objects, and attributes but also the relationships between them. While NEIL’s core SSL algorithm works with a fixed vocabulary, we also use noun phrases from NELL’s ontology [5] to grow our vocabulary. Currently, our growing knowledge base has an ontology of 1152 object categories, 1034 scene categories, and 87 attributes. NEIL has discovered more than 1700 relationships and labeled more than 400K visual instances of these categories. (c) We demonstrate how joint discovery of relationships and labeling of instances at a gigantic scale can provide constraints for improving semi-supervised learning. 2. Related Work Recent work has only focused on extracting knowledge in the form of large datasets for recognition and classification [8, 23, 30]. One of the most commonly used approaches to build datasets is using manual annotations by motivated teams of people [30] or the power of crowds [8, 40]. To minimize human effort, recent works have also focused on active learning [37, 39] which selects label requests that are most informative. However, both of these directions have a major limitation: annotations are expensive, prone to errors, biased and do not scale. An alternative approach is to use visual recognition for extracting these datasets automatically from the Internet [23, 34, 36]. A common way of automatically creating a dataset is to use image search results and rerank them via visual classifiers [14] or some form of joint-clustering in text and visual space [2, 34]. Another approach is to use a semi-supervised framework [42]. Here, a small amount of labeled data is used in conjunction with a large amount of unlabeled data to learn reliable and robust visual models. These seed images can be manually labeled [36] or the top retrievals of a text-based search [23]. The biggest problem with most of these automatic approaches is that the small number of labeled examples or image search results do not provide enough constraints for learning robust visual classifiers. Hence, these approaches suffer from semantic drift [6]. One way to avoid semantic drift is to exploit additional constraints based on the structure of our visual data. Researchers have exploited a variety of constraints such as those based on visual similarity [11, 15], seman- tic similarity [17] or multiple feature spaces [3]. However, most of these constraints are weak in nature: for example, visual similarity only models the constraint that visuallysimilar images should receive the same labels. On the other hand, our visual world is highly structured: object cate1410 gories share parts and attributes, objects and scenes have strong contextual relationships, etc. Therefore, we need a way to capture the rich structure of our visual world and exploit this structure during semi-supervised learning. In recent years, there have been huge advances in modeling the rich structure of our visual world via contextual relationships. Some of these relationships include: SceneObject [38], Object-Object [3 1], Object-Attribute [12, 22, 28], Scene-Attribute [29]. All these relationships can provide a rich set of constraints which can help us improve SSL [4]. For example, scene-attribute relationships such as amphitheaters are circular can help improve semisupervised learning of scene classifiers [36] and Wordnet hierarchical relationships can help in propagating segmentations [21]. But the big question is: how do we obtain these relationships? One way to obtain such relationships is via text analysis [5, 18]. However, as [40] points out that the visual knowledge we need to obtain is so obvious that no one would take the time to write it down and put it on web. In this work, we argue that, at a large-scale, one can jointly discover relationships and constrain the SSL prob- lem for extracting visual knowledge and learning visual classifiers and detectors. Motivated by a never ending learning algorithm for text [5], we propose a never ending visual learning algorithm that cycles between extracting global relationships, labeling data and learning classifiers/detectors for building visual knowledge from the Internet. Our work is also related to attribute discovery [33, 35] since these approaches jointly discover the attributes and relationships between objects and attributes simultaneously. However, in our case, we only focus on semantic attributes and therefore our goal is to discover semantic relationships and semantically label visual instances. 3. Technical Approach Our goal is to extract visual knowledge from the pool of visual data on the web. We define visual knowledge as any information that can be useful for improving vision tasks such as image understanding and object/scene recognition. One form of visual knowledge would be labeled examples of different categories or labeled segments/boundaries. Labeled examples helps us learn classifiers or detectors and improve image understanding. Another example of visual knowledge would be relationships. For example, spatial contextual relationships can be used to improve object recognition. In this paper, we represent visual knowledge in terms of labeled examples of semantic categories and the relationships between those categories. Our knowledge base consists of labeled examples of: (1) Objects (e.g., Car, Corolla); (2) Scenes (e.g., Alley, Church); (3) Attributes (e.g., Blue, Modern). Note that for objects we learn detectors and for scenes we build classifiers; however for the rest of the paper we will use the term detector and classifier interchangeably. Our knowledge base also contains relationships of four types: (1) Object-Object (e.g., Wheel is a part of Car);(2) Object-Attribute (e.g., Sheep is/has White); (3) Scene-Object (e.g., Car is found in Raceway); (4) SceneAttribute (e.g., Alley is/has Narrow). The outline of our approach is shown in Figure 2. We use Google Image Search to download thousands of images for each object, scene and attribute category. Our method then uses an iterative approach to clean the labels and train detectors/classifiers in a semi-supervised manner. For a given concept (e.g., car), we first discover the latent visual subcategories and bounding boxes for these sub-categories using an exemplar-based clustering approach (Section 3. 1). We then train multiple detectors for a concept (one for each sub-category) using the clustering and localization results. These detectors and classifiers are then used for detections on millions of images to learn relationships based on cooccurrence statistics (Section 3.2). Here, we exploit the fact the we are interested in macro-vision and therefore build co-occurrence statistics using only confident detections/classifications. Once we have relationships, we use them in conjunction with our classifiers and detectors to label the large set of noisy images (Section 3.3). The most confidently labeled images are added to the pool of labeled data and used to retrain the models, and the process repeats itself. At every iteration, we learn better classifiers and detectors, which in turn help us learn more relationships and further constrain the semi-supervised learning problem. We now describe each step in detail below. 3.1. Seeding Classifiers via Google Image Search The first step in our semi-supervised algorithm is to build classifiers for visual categories. One way to build initial classifiers is via a few manually labeled seed images. Here, we take an alternative approach and use text-based image retrieval systems to provide seed images for training initial detectors. For scene and attribute classifiers we directly use these retrieved images as positive data. However, such an approach fails for training object and attribute detectors because of four reasons (Figure 3(a)) (1) Outliers: Due to the imperfectness of text-based image retrieval, the downloaded images usually have irrelevant images/outliers; (2) Polysemy: In many cases, semantic categories might be overloaded and a single semantic category might have multiple senses (e.g., apple can mean both the company and the fruit); (3) Visual Diversity: Retrieved images might have high intra-class variation due to different viewpoint, illumination etc.; (4) Localization: In many cases the retrieved image might be a scene without a bounding-box and hence one needs to localize the concept before training a detector. Most of the current approaches handle these problems via clustering. Clustering helps in handling visual diversity [9] and discovering multiple senses of retrieval (polysemy) [25]. It can also help us to reject outliers based on – distances from cluster centers. One simple way to cluster 141 1 would be to use K-means on the set of all possible bounding boxes and then use the representative clusters as visual sub-categories. However, clustering using K-means has two issues: (1) High Dimensionality: We use the Color HOG (CHOG) [20] representation and standard distance metrics do not work well in such high-dimensions [10]; (2) Scalability: Most clustering approaches tend to partition the complete feature space. In our case, since we do not have bounding boxes provided, every image creates millions of data points and the majority of the datapoints are outliers. Recent work has suggested that K-means is not scalable and has bad performance in this scenario since it assigns membership to every data point [10]. Instead, we propose to use a two-step approach for clustering. In the first step, we mine the set of downloaded im- × ages from Google Image Search to create candidate object windows. Specifically, every image is used to train a detector using recently proposed exemplar-LDA [19]. These detectors are then used for dense detections on the same set of downloaded images. We select the top K windows which have high scores from multiple detectors. Note that this step helps us prune out outliers as the candidate windows are selected via representativeness (how many detectors fire on them). For example, in Figure 3, none of the tricycle detectors fire on the outliers such as circular dots and people eating, and hence these images are rejected at this candidate widow step. Once we have candidate windows, we cluster them in the next step. However, instead of using the high-dimensional CHOG representation for clustering, we use the detection signature of each window (represented as a vector of seed detector ELDA scores on the window) to create a K K affinity matrix. The (i, j) entry in the affinity amteat arix K i s× thKe da fofti product orixf t.h Tish vee (cit,ojr) )fo enr twryin indo thwes ai fainndj. Intuitively, this step connects candidate windows if the same set of detectors fire on both windows. Once we have the affinity matrix, we cluster the candidate windows using the standard affinity propagation algorithm [16]. Affinity propagation also allows us to extract a representative window (prototype) for each cluster which acts as an iconic image for the object [32] (Figure 3). After clustering, we train a detector for each cluster/sub-category using three-quarters of the images in the cluster. The remaining quarter is used as a validation set for calibration. 3.2. Extracting Relationships Once we have initialized object detectors, attribute detectors, attribute classifiers and scene classifiers, we can use them to extract relationships automatically from the data. The key idea is that we do not need to understand each and every image downloaded from the Internet but instead understand the statistical pattern of detections and classifications at a large scale. These patterns can be used to select the top-N relationships at every iteration. Specifically, we extract four different kinds of relationships: Object-Object Relationships: The first kind of relationship we extract are object-object relationships which include: (1) Partonomy relationships such as “Eye is a part of Baby”; (2) Taxonomy relationships such as “BMW 320 is a kind of Car”; and (3) Similarity relationships such as 1412 (a) Google Image Search for “tricycle” (b) Sub-category Discovery Figure 3. An example of how clustering handles polysemy, intraclass variation and outlier removal (a). The bottom row shows our discovered clusters. “Swan looks similar to Goose”. To extract these relationships, we first build a co-detection matrix O0 whose elements represent the probability of simultaneous detection of object categories i and j. Intuitively, the co-detection matrix has high values when object detector idetects objects inside the bounding box of object j with high detection scores. To account for detectors that fire everywhere and images which have lots of detections, we normalize the matrix O0. The normalized co-detection matrix can be written 1 1 as: N1− 2 O0N2− 2 , where N1 and N2 are out-degree and indegree matrix and (i, j) element of O0 represents the average score of top-detections of detector ion images of object category j. Once we have selected a relationship between pair of categories, we learn its characteristics in terms of mean and variance of relative locations, relative aspect ra- tio, relative scores and relative size of the detections. For example, the nose-face relationship is characterized by low relative window size (nose is less than 20% of face area) and the relative location that nose occurs in center of the face. This is used to define a compatibility function ψi,j (·) which evaluates if the detections from category iand j are compatible or not. We also classify the relationships into the two semantic categories (part-of, taxonomy/similar) using relative features to have a human-communicable view of visual knowledge base. Object-Attribute Relationships: The second type of relationship we extract is object-attribute relationships such as “Pizza has Round Shape”, ”Sunflower is Yellow” etc. To extract these relationships we use the same methodology where the attributes are detected in the labeled examples of object categories. These detections and their scores are then used to build a normalized co-detection matrix which is used to find the top object-attribute relationships. Scene-Object Relationships: The third type of relationship extracted by our algorithm includes scene-object relationships such as “Bus is found in Bus depot” and “Monitor is found in Control room”. For extracting scene-object relationships, we use the object detectors on randomly sampled images of different scene classes. The detections are then used to create the normalized co-presence matrix (similar to object-object relationships) where the (i, j) element represents the likelihood of detection of instance of object category iand the scene category class j. Scene-Attribute Relationships: The fourth and final type of relationship extracted by our algorithm includes sceneattribute relationships such as “Ocean is Blue”, “Alleys are Narrow”, etc. Here, we follow a simple methodology for extracting scene-attribute relationships where we compute co-classification matrix such that the element (i, j) of the matrix represents average classification scores of attribute ion images of scene j. The top entries in this coclassification matrix are used to extract scene-attribute relationships. 3.3. Retraining via Labeling New Instances Once we have the initial set of classifiers/detectors and the set of relationships, we can use them to find new instances of different objects and scene categories. These new instances are then added to the set of labeled data and we retrain new classifiers/detectors using the updated set of labeled data. These new classifiers are then used to extract more relationships which in turn are used to label more data and so on. One way to find new instances is directly using the detector itself. For instance, using the car detector to find more cars. However, this approach leads to semantic drift. To avoid semantic drift, we use the rich set of relationships we extracted in the previous section and ensure that the new labeled instances of car satisfy the extracted relationships (e.g., has wheels, found in raceways etc.) Mathematically, let RO, RA and RS represent the set of object-object, object-attribute aanndd scene-object relationships at iteration t. If φi (·) represents the potential from object detector i, ωk (·) represents sthenet scene potential, raonmd ψi,j (·) represent the compatibility sfu thnect siocnen nbeet pwoeteennt tiwalo, aonbdject categories i,j,ethceonm we can ifityndfu uthncet new inetswtaenecnetsw woof oobb-ject category iusing the contextual scoring function given below: φi(x) + ? φj(xl)ψi,j(x,xl) + ? i,j∈R?O RA ? ωk(x) i,k?∈RS where x is the wi?ndow being evaluated and xl is the topdetected window of related object/attribute category. The above equation has three terms: the first term is appearance term for the object category itself and is measured by the 1413 Nilgai Yamaha Violin Bass F-18 Figure 4. Qualitative Examples of Bounding Box Labeling Done by NEIL score of the SVM detector on the window x. The second term measures the compatibility between object category i and the object/attribute category j if the relationship (i, j) is part of the catalogue. For example, if “Wheel is a part of Car” exists in the catalogue then this term will be the product of the score of wheel detector and the compatibility function between the wheel window (xl) and the car window (x). The final term measures the scene-object compatibility. Therefore, if the knowledge base contains the re- lationship “Car is found in Raceway”, this term boosts the “Car” detection scores in the “Raceway” scenes. At each iteration, we also add new instances of different scene categories. We find new instances of scene category k using the contextual scoring function given below: ωk(x) + ? ωm(x) + ? φi(xl) m,k?∈RA? i,k?∈RS where RA? represents the catalogue of scene-attribute relationships. The above equation has three terms: the first term is the appearance term for the scene category itself and is estimated using the scene classifier. The second term is the appearance term for the attribute category and is estimated using the attribute classifier. This term ensures that if a scene-attribute relationship exists then the attribute classifier score should be high. The third and the final term is the appearance term of an object category and is estimated using the corresponding object detector. This term ensures that if a scene-object relationship exists then the object detector should detect objects in the scene. Implementation Details: To train scene & attribute classifiers, we first extract a 3912 dimensional feature vector from each image. The feature vector includes 5 12D GIST [27] features, concatenated with bag ofwords representations for SIFT [24], HOG [7], Lab color space, and Texton [26]. The dictionary sizes are 1000, 1000, 400, 1000, respectively. Features of randomly sampled windows from other categories are used as negative examples for SVM training and hard mining. For the object and attribute section, we use CHOG [20] features with a bin size of 8. We train the detectors using latent SVM model (without parts) [13]. 4. Experimental Results We demonstrate the quality of visual knowledge by qualitative results, verification via human subjects and quantitative results on tasks such as object detection and scene recognition. 4.1. NEIL Statistics While NEIL’s core algorithm uses a fixed vocabulary, we use noun phrases from NELL [5] to grow NEIL’s vocabulary. As of 10th October 2013, NEIL has an ontology of 1152 object categories, 1034 scene categories and 87 attributes. It has downloaded more than 2 million images for extracting the current structured visual knowledge. For bootstrapping our system, we use a few seed images from ImageNet [8], SUN [4 1] or the top-images from Google Image Search. For the purposes of extensive experimental evaluation in this paper, we ran NEIL on steroids (200 cores as opposed to 30 cores used generally) for the last 2.5 months. NEIL has completed 16 iterations and it has labeled more than 400K visual instances (including 300,000 objects with their bounding boxes). It has also extracted 1703 common sense relationships. Readers can browse the current visual knowledge base and download the detectors from: www.neil-kb.com 4.2. Qualitative Results We first show some qualitative results in terms of ex- tracted visual knowledge by NEIL. Figure 4 shows the extracted visual sub-categories along with a few labeled instances belonging to each sub-category. It can be seen from the figure that NEIL effectively handles the intra-class variation and polysemy via the clustering process. The purity and diversity of the clusters for different concepts indicate that contextual relationships help make our system robust to semantic drift and ensure diversity. Figure 5 shows some of the qualitative examples of scene-object and object-object relationships extracted by NEIL. It is effective in using a few confident detections to extract interesting relationships. Figure 6 shows some of the interesting scene-attribute and object-attribute relationships extracted by NEIL. 1414 Helicopter is found in Airfield Leaning tower is found in Pisa Van is a kind of/looks similar to Ambulance Airplane nose is a part of Airbus 330 Zebra is found in Savanna Ferris wheel is found in Amusement park Opera house is found in Sydney Eye is a part of Baby Duck is a kind of/looks similar to Goose Monitor is a kind of/looks similar to Desktop computer Figure 5. Qualitative Examples of Scene-Object (rows Bus is found in Bus depot outdoor Sparrow is a kind of/looks similar to bird 1-2) and Object-Object (rows Throne is found in Throne room Camry is found in Pub outdoor Gypsy moth is a kind of/looks similar to Butterfly Basketball net is a part of Backboard 3-4) Relationships Extracted by NEIL 4.3. Evaluating Quality via Human Subjects Next, we want to evaluate the quality of extracted visual knowledge by NEIL. It should be noted that an extensive and comprehensive evaluation for the whole NEIL system is an extremely difficult task. It is impractical to evaluate each and every labeled instance and each and every rela- tionship for correctness. Therefore, we randomly sample the 500 visual instances and 500 relationships, and verify them using human experts. At the end of iteration 6, 79% of the relationships extracted by NEIL are correct, and 98% of the visual data labeled by NEIL has been labeled correctly. We also evaluate the per iteration correctness of relationships: At iteration 1, more than 96% relationships are correct and by iteration 3, the system stabilizes and 80% of extracted relationships are correct. While currently the system does not demonstrate any major semantic drift, we do plan to continue evaluation and extensive analysis of knowledge base as NEIL grows older. We also evaluate the quality of bounding-boxes generated by NEIL. For this we sample 100 images randomly and label the ground-truth bounding boxes. On the standard intersection-over-union metric, NEIL generates bounding boxes with 0.78 overlap on average with ground-truth. To give context to the difficulty of the task, the standard Objectness algorithm [1] produces bounding boxes with 0.59 overlap on average. 4.4. Using Knowledge for Vision Tasks Finally, we want to demonstrate the usefulness of the visual knowledge learned by NEIL on standard vision tasks such as object detection and scene classification. Here, we will also compare several aspects of our approach: (a) We first compare the quality of our automatically labeled dataset. As baselines, we train classifiers/detectors directly on the seed images downloaded from Google Image Search. (b) We compare NEIL against a standard bootstrapping ap- proach which does not extract/use relationships. (c) Finally, we will demonstrate the usefulness of relationships by detecting and classifying new test data with and without the learned relationships. Scene Classification: First we evaluate our visual knowledge for the task of scene classification. We build a dataset of 600 images (12 scene categories) using Flickr images. We compare the performance ofour scene classifiers against the scene classifiers trained from top 15 images of Google Image Search (our seed classifier). We also compare the performance with standard bootstrapping approach without using any relationship extraction. Table 1shows the results. We use mean average precision (mAP) as the evaluation metric. As the results show, automatic relationship extraction helps us to constrain the learning problem, and so the learned classifiers give much better performance. Finally, if we also use the contextual information from NEIL relationships we get a significant boost in performance. Table 1. mAP performance for scene classification on 12 categories. mAP Seed Classifier (15 Google Images) Bootstrapping (without relationships) NEIL Scene Classifiers NEIL (Classifiers + Relationships) 0.52 0.54 0.57 0.62 Object Detection: We also evaluate our extracted visual knowledge for the task of object detection. We build a dataset of 1000 images (15 object categories) using Flickr data for testing. We compare the performance against object detectors trained directly using (top-50 and top-450) images from Google Image Search. We also compare the perfor- mance of detectors trained after aspect-ratio, HOG clustering and our proposed clustering procedure. Table 2 shows the detection results. Using 450 images from Google image search decreases the performance due to noisy retrievals. While other clustering methods help, the gain by our clustering procedure is much larger. Finally, detectors trained using NEIL work better than standard bootstrapping. 1415 MMoonniittoorr iiss f foouunndd iinn CCoonnttrrooll rroooomm? rroooomm? MMoonniittoorr iiss ffoouunndd iinn CCoonnttrrooll Washing machine is found in Utility room? Siberian tiger is found in Zoo Baseball is found in Butters box Bullet train is found in Train station platform? Cougar looks similar to Cat Urn looks similar to Goblet Samsung galaxy is a kind of Cellphone Computer room is /has Modern Hallway is /has Narrow? Building facade is /has Check texture Trading floor is /has Crowded Umbrella looks similar to Ferris wheel Bonfire is found in Volcano Figure 6. Examples of extracted common sense relationships. Table 2. mAP performance for object detection on 15 categories. mAP Latent SVM (50 Google Images) Latent SVM (450 Google Images) 0.34 0.28 Latent SVM (450, Aspect Ratio Clustering) Latent SVM (450, HOG-based Clustering) Seed Detector (NEIL Clustering) Bootstrapping (without relationships) NEIL Detector NEIL Detector + Relationships 0.30 0.33 0.44 0.45 0.49 0.51 Acknowledgements: This research was supported by ONR MURI N000141010934 and a gift from Google. The authors would like to thank Tom Mitchell and David Fouhey for insightful discussions. We would also like to thank our computing clusters warp and workhorse for doing all the hard work! References [1] B. Alexe, T. Deselares, and V. Ferrari. What is an object? In TPAMI, 2010. 7 [2] T. Berg and D. Forsyth. Animals on the web. In CVPR, 2006. 2 [3] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT, 1998. 2 [4] A. Carlson, J. Betteridge, E. R. H. Jr., and T. M. Mitchell. Coupling semi-supervised learning of categories and relations. In NAACL HLT Workskop on SSL for NLP, 2009. 3 [5] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr., and T. M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, 2010. 2, 3, 6 [6] J. R. Curran, T. Murphy, and B. Scholz. Minimising semantic drift with mutual exclusion bootstrapping. In PacificAssociationfor Computational Linguistics, 2007. 2 [7] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005. 6 [8] J. Deng, W. Dong, R. Socher, J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009. 1, 2, 6 [9] S. Divvala, A. Efros, and M. Hebert. How important are ‘deformable parts’ in the deformable parts model? In ECCV, Parts and Attributes [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] Workshop, 2012. 3 C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. Efros. What makes Paris look like Paris? SIGGRAPH, 2012. 4 S. Ebert, D. Larlus, and B. Schiele. Extracting structures in image collections for object recognition. In ECCV, 2010. 2 A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, 2009. 3 P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. TPAMI, 2010. 6 R. Fergus, P. Perona, and A. Zisserman. A visual category filter for Google images. In ECCV, 2004. 2 R. Fergus, Y. Weiss, and A. Torralba. Semi-supervised learning in gigantic image collections. In NIPS. 2009. 2 B. Frey and D. Dueck. Clustering by passing messages between data points. Science, 2007. 4 M. Guillaumin, J. Verbeek, and C. Schmid. Multimodal semisupervised learning for image classification. In CVPR, 2010. 2 A. Gupta and L. S. Davis. Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers. In ECCV, 2008. 3 B. Hariharan, J. Malik, and D. Ramanan. Discriminative decorrelation for clustering and classification. In ECCV. 2012. 4 S. Khan, F. Anwer, R. Muhammad, J. van de Weijer, A. Joost, M. Vanrell, and A. Lopez. Color attributes for object detection. In CVPR, 2012. 4, 6 D. Kuettel, M. Guillaumin, and V. Ferrari. Segmentation propagation in ImageNet. In ECCV, 2012. 3 C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In CVPR, 2009. 3 L.-J. Li, G. Wang, and L. Fei-Fei. OPTIMOL: Automatic object picture collection via incremental model learning. In CVPR, 2007. 2 D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004. 6 [25] A. Lucchi and J. Weston. Joint image and word sense discrimination for image retrieval. In ECCV, 2012. 3 [26] D. Martin, C. Fowlkes, and J. Malik. Learning to detect natural image boundaries using local brightness, color, and texture cues. PAMI, 2004. 6 [27] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV, 2001. 6 [28] D. Parikh and K. Grauman. Relative attributes. In ICCV, 2011. 3 [29] G. Patterson and J. Hays. SUN attribute database: Discovering, annotating, and recognizing scene attributes. In CVPR, 2012. 3 [30] P. Perona. Visions of a Visipedia. Proceedings of IEEE, 2010. 1, 2 [3 1] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In ICCV, 2007. 3 [32] R. Raguram and S. Lazebnik. Computing iconic summaries of general visual concepts. In Workshop on Internet Vision, 2008. 4 [33] M. Rastegari, A. Farhadi, and D. Forsyth. Attribute discovery via predictable discriminative binary codes. In ECCV, 2012. 3 [34] F. Schroff, A. Criminisi, and A. Zisserman. Harvesting image databases from the web. In ICCV, 2007. 2 [35] V. Sharmanska, N. Quadrianto, and C. H. Lampert. Augmented attribute representations. In ECCV, 2012. 3 [36] A. Shrivastava, S. Singh, and A. Gupta. Constrained semi-supervised learning using attributes and comparative attributes. In ECCV, 2012. 2, 3 [37] B. Siddiquie and A. Gupta. Beyond active noun tagging: Modeling contextual interactions for multi-class active learning. In CVPR, 2010. 2 [38] E. Sudderth, A. Torralba, W. T. Freeman, and A. Wilsky. Learning hierarchical models of scenes, objects, and parts. In ICCV, 2005. 3 [39] S. Vijayanarasimhan and K. Grauman. Large-scale live active learning: Training object detectors with crawled data and crowds. In CVPR, 2011. 2 [40] L. von Ahn and L. Dabbish. Labeling images with a computer game. In SIGCHI, 2004. 2, 3 [41] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. SUN database: Large scale scene recognition from abbey to zoo. In CVPR, 2010. 6 [42] X. Zhu. Semi-supervised learning literature survey. Technical report, CS, UW-Madison, 2005. 2 1416
6 0.82144308 230 iccv-2013-Latent Data Association: Bayesian Model Selection for Multi-target Tracking
7 0.81557715 376 iccv-2013-Scene Text Localization and Recognition with Oriented Stroke Detection
8 0.81346881 393 iccv-2013-Simultaneous Clustering and Tracklet Linking for Multi-face Tracking in Videos
9 0.81146443 420 iccv-2013-Topology-Constrained Layered Tracking with Latent Flow
10 0.81132078 60 iccv-2013-Bayesian Robust Matrix Factorization for Image and Video Processing
11 0.80729473 196 iccv-2013-Hierarchical Data-Driven Descent for Efficient Optimal Deformation Estimation
12 0.80524266 349 iccv-2013-Regionlets for Generic Object Detection
13 0.80499035 65 iccv-2013-Breaking the Chain: Liberation from the Temporal Markov Assumption for Tracking Human Poses
14 0.80486882 146 iccv-2013-Event Detection in Complex Scenes Using Interval Temporal Constraints
15 0.80461806 300 iccv-2013-Optical Flow via Locally Adaptive Fusion of Complementary Data Costs
16 0.80459082 338 iccv-2013-Randomized Ensemble Tracking
17 0.80457169 425 iccv-2013-Tracking via Robust Multi-task Multi-view Joint Sparse Representation
18 0.80398327 182 iccv-2013-GOSUS: Grassmannian Online Subspace Updates with Structured-Sparsity
19 0.80365372 89 iccv-2013-Constructing Adaptive Complex Cells for Robust Visual Tracking
20 0.80309415 359 iccv-2013-Robust Object Tracking with Online Multi-lifespan Dictionary Learning