iccv iccv2013 iccv2013-189 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Carl Vondrick, Aditya Khosla, Tomasz Malisiewicz, Antonio Torralba
Abstract: We introduce algorithms to visualize feature spaces used by object detectors. The tools in this paper allow a human to put on ‘HOG goggles ’ and perceive the visual world as a HOG based object detector sees it. We found that these visualizations allow us to analyze object detection systems in new ways and gain new insight into the detector’s failures. For example, when we visualize the features for high scoring false alarms, we discovered that, although they are clearly wrong in image space, they do look deceptively similar to true positives in feature space. This result suggests that many of these false alarms are caused by our choice of feature space, and indicates that creating a better learning algorithm or building bigger datasets is unlikely to correct these errors. By visualizing feature spaces, we can gain a more intuitive understanding of our detection systems.
Reference: text
sentIndex sentText sentNum sentScore
1 edu , , , Abstract We introduce algorithms to visualize feature spaces used by object detectors. [sent-3, score-0.211]
2 The tools in this paper allow a human to put on ‘HOG goggles ’ and perceive the visual world as a HOG based object detector sees it. [sent-4, score-0.241]
3 We found that these visualizations allow us to analyze object detection systems in new ways and gain new insight into the detector’s failures. [sent-5, score-0.546]
4 For example, when we visualize the features for high scoring false alarms, we discovered that, although they are clearly wrong in image space, they do look deceptively similar to true positives in feature space. [sent-6, score-0.35]
5 This result suggests that many of these false alarms are caused by our choice of feature space, and indicates that creating a better learning algorithm or building bigger datasets is unlikely to correct these errors. [sent-7, score-0.221]
6 By visualizing feature spaces, we can gain a more intuitive understanding of our detection systems. [sent-8, score-0.281]
7 Introduction Figure 1 shows a high scoring detection from an object detector with HOG features and a linear SVM classifier trained on PASCAL. [sent-10, score-0.225]
8 Unfortunately, computer vision researchers are often unable to explain the failures of object detection systems. [sent-12, score-0.211]
9 1 We present algorithms to visualize the feature spaces of object detectors. [sent-16, score-0.211]
10 Since features are too high dimensional for humans to directly inspect, our visualization algorithms work by inverting features back to natural images. [sent-17, score-0.534]
11 We found that these inversions provide an intuitive and accurate visualization of the feature spaces used by object detectors. [sent-18, score-0.487]
12 Figure 2: We show the crop for the false car detection from Figure 1. [sent-22, score-0.217]
13 On the right, we show our visualization of the HOG features for the same patch. [sent-23, score-0.239]
14 Our visualization reveals that this false alarm actually looks like a car in HOG space. [sent-24, score-0.43]
15 Figure 2 shows the output from our visualization on the features for the false car detection. [sent-25, score-0.395]
16 This visualization reveals that, while there are clearly no cars in the original image, there is a car hiding in the HOG descriptor. [sent-26, score-0.335]
17 HOG features see a slightly different visual world than what we see, and by visualizing this space, we can gain a more intuitive understanding of our object detectors. [sent-27, score-0.339]
18 Although every visualization looks like a true positive, all of these detections are actually false alarms. [sent-31, score-0.41]
19 Consequently, even with a better learning algorithm or more data, these false alarms will likely persist. [sent-32, score-0.251]
20 The principle contribution of this paper is the presentation of algorithms for visualizing features used in object detection. [sent-34, score-0.271]
21 To this end, we present four algorithms to invert Figure 3: We visualize some high scoring detections from the deformable parts model [8] for person, chair, and car. [sent-35, score-0.472]
22 Our visualizations are perceptually intuitive for humans to understand. [sent-39, score-0.606]
23 We evaluate our inversions with both automatic benchmarks and a large human study, and we found our visualizations are perceptually more accurate at representing the content of a HOG feature than existing methods; see Figure 4 for a comparison between our visualization and HOG glyphs. [sent-42, score-0.822]
24 We then use our visualizations to inspect the behaviors of object detection systems and analyze their features. [sent-43, score-0.602]
25 Since we hope our visualizations will be useful to other researchers, our final contribution is a public feature visualization toolbox. [sent-44, score-0.613]
26 Related Work Our visualization algorithms extend an actively growing body of work in feature inversion. [sent-46, score-0.251]
27 2 While [22, 4, 17] do a good job at reconstructing images from SIFT, LBP, and gist features, our visualization algorithms have several advantages. [sent-54, score-0.346]
28 Firstly, while existing methods are tailored for specific features, our visualization algorithms we propose are feature independent. [sent-55, score-0.251]
29 Since we cast feature inversion as a machine learning problem, our algorithms can be used to visualize any feature. [sent-56, score-0.389]
30 Secondly, our algorithms are fast: our best algorithm can invert features in under a second on a desktop computer, enabling interactive visualization. [sent-58, score-0.305]
31 Finally, to our knowledge, this paper is the first to invert HOG. [sent-59, score-0.231]
32 Our visualizations enable analysis that complement a recent line of papers that provide tools to diagnose object recognition systems, which we briefly review here. [sent-60, score-0.48]
33 By putting on ‘HOG glasses’ and visualizing the world according to the features, we are able to gain a better understanding of the failures and behaviors of our object detection systems. [sent-72, score-0.366]
34 Feature Visualization Algorithms We pose the feature visualization problem as one of feature inversion, i. [sent-74, score-0.208]
35 φH(·e)nc ise, we asneyek-t an image x tohna,t, n owh aenna we compute Figure 5: We found that averaging the images of top detections from an exemplar LDA detector provide one method to invert HOG features. [sent-79, score-0.439]
36 Since, to our knowledge, no algorithms to invert HOG have yet been developed, we first describe three simple baselines for HOG inversion. [sent-87, score-0.274]
37 Baseline A: Exemplar LDA (ELDA) Consider the top detections for the exemplar object detector [9, 15] for a few images shown in Figure 5. [sent-91, score-0.257]
38 Although all top detections are false positives, notice that each detection captures some statistics about the query. [sent-92, score-0.275]
39 We use this simple observation to produce our first inversion baseline. [sent-94, score-0.255]
40 Baseline B: Ridge Regression We present a fast, parametric inversion baseline based off ridge regression. [sent-108, score-0.355]
41 Since we wish to invert any HOG point, we assume that P(X, Y ) is stationary [9], allowing us to efficiently learn the covariance across massive datasets. [sent-123, score-0.303]
42 We invert an arbitrary dimensional HOG point by marginalizing out unused dimensions. [sent-124, score-0.231]
43 Algorithm D: Paired Dictionary Learning In this section, we present our main inversion algorithm. [sent-143, score-0.255]
44 Figure 6: Inverting HOG using paired dictionary learning. [sent-145, score-0.338]
45 6 exist [14, 11], we can invert features in under two seconds on a 4 core CPU. [sent-155, score-0.262]
46 To do this, we solve a paired dictionary learning problem, inspired by recent super resolution sparse coding work [23, 21]: ? [sent-158, score-0.338]
47 Evaluation of Visualizations We evaluate our inversion algorithms using both qualitative and quantitative measures. [sent-171, score-0.298]
48 We use PASCAL VOC 2011 [6] as our dataset and we invert patches corresponding to objects. [sent-172, score-0.258]
49 PairDict (seconds)Greedy (days)Original Figure 10: Although our algorithms are good at inverting HOG, they are not perfect, and struggle to reconstruct high frequency detail. [sent-179, score-0.224]
50 ) Figure 11: We recursively compute HOG and invert it with a paired dictionary. [sent-184, score-0.428]
51 While there is some information loss, our visualizations still do a good job at accurately representing HOG features. [sent-185, score-0.461]
52 Paired dictionary learning tends to produce the best visualization for HOG descriptors. [sent-189, score-0.349]
53 By learning a dictionary over the visual world and the correlation between HOG and natural images, paired dictionary learning recovered high frequencies without introducing significant noise. [sent-190, score-0.545]
54 We discovered that the paired dictionary is able to recover color from HOG descriptors. [sent-191, score-0.397]
55 Figure 9 shows the re- sult of training a paired dictionary to estimate RGB images instead of grayscale images. [sent-192, score-0.373]
56 While the paired dictionary assigns arbitrary colors to man-made objects and in-door scenes, it frequently colors natural objects correctly, such as grass or the sky, likely because those categories are strongly correlated to HOG descriptors. [sent-193, score-0.395]
57 We focus on grayscale visualizations in this paper because we found those to be more intuitive for humans to understand. [sent-194, score-0.598]
58 While our visualizations do a good job at representing HOG features, they have some limitations. [sent-195, score-0.461]
59 Figure 10 compares our best visualization (paired dictionary) against a greedy algorithm that draws triangles of random rotation, 5 Figure 12: Our inversion algorithms are sensitive to the HOG template size. [sent-196, score-0.506]
60 If we allow the greedy algorithm to execute for an extremely long time (a few days), the visualization better shows higher frequency detail. [sent-199, score-0.236]
61 This reveals that there exists a visualization better than paired dictionary learning, although it may not be tractable. [sent-200, score-0.584]
62 Despite these limitations, our visualizations are, as we will now show, still perceptually intuitive for humans to understand. [sent-204, score-0.606]
63 Firstly, we use an automatic inversion metric that measures how well our inversions reconstruct original images. [sent-206, score-0.458]
64 Secondly, we conducted a large visualization challenge with human subjects on Amazon Mechanical Turk (MTurk), which is designed to determine how well people can infer high level semantics from our visualizations. [sent-207, score-0.309]
65 Inversion Benchmark We consider the inversion performance of our algorithm: given a HOG feature y, how well does our inverse φ−1 (y) reconstruct the original pixels x for each algorithm? [sent-210, score-0.376]
66 Since HOG is invariant up to a constant shift and scale, we score each inversion against the original image with normalized cross correlation. [sent-211, score-0.283]
67 Visualization Benchmark While the inversion benchmark evaluates how well the inversions reconstruct the original image, it does not capture the high level content of the inverse: is the inverse of a sheep still a sheep? [sent-216, score-0.534]
68 We then showed participants an inversion from one of our algorithms and asked users to classify it into one of the 20 categories. [sent-219, score-0.412]
69 Paired dictionary learning provides the best visualizations for humans. [sent-230, score-0.546]
70 Expert refers to MIT PhD students in computer vision performing the same visualization challenge with HOG glyphs. [sent-231, score-0.242]
71 We also compared our algorithms against the standard black-and-white HOG glyph popularized by [3]. [sent-234, score-0.26]
72 Our results in Table 2 show that paired dictionary learning and direct optimization provide the best visualization of HOG descriptors for humans. [sent-235, score-0.546]
73 Interestingly, the glyph does the best job at visualizing bicycles, likely due to their unique circular gradients. [sent-238, score-0.403]
74 Our results overall suggest that visualizing HOG with the glyph is misleading, and richer visualizations from our paired dictionary are useful for interpreting HOG vectors. [sent-239, score-1.06]
75 Human accuracy on inversions and state-ofthe-art object detection AP scores from [7] are correlated 6 Figure 13: HOG inversion reveals the world that object detectors see. [sent-241, score-0.68]
76 If we compute HOG on this image and invert it, the previously dark scene behind the man emerges. [sent-243, score-0.289]
77 We also asked computer vision PhD students at MIT to classify HOG glyphs in order to compare MTurk workers with experts in HOG. [sent-247, score-0.25]
78 HOG experts performed slightly better than non-experts on the glyph challenge, but experts on glyphs did not beat non-experts on other visualizations. [sent-249, score-0.321]
79 This result suggests that our algorithms produce more intuitive visualizations even for object detection researchers. [sent-250, score-0.63]
80 Understanding Object Detectors We have so far presented four algorithms to visualize object detection features. [sent-252, score-0.244]
81 We evaluated the visualizations with a large human study, and we found that paired dictionary learning provides the most intuitive visualization of HOG features. [sent-253, score-1.059]
82 In this section, we will use this visualization to inspect the behavior of object detection systems. [sent-254, score-0.374]
83 HOG Goggles Our visualizations reveal that the world that features see is slightly different from the world that the human eye perceives. [sent-257, score-0.55]
84 In order to understand how this clutter affects object detection, we visualized the features of some of the top false alarms from the Felzenszwalb et al. [sent-260, score-0.301]
85 Figure 3 shows our visualizations of the features of the top false alarms. [sent-262, score-0.531]
86 Notice how the false alarms look very similar to true positives. [sent-263, score-0.221]
87 While there are many different types of detector errors, this result suggests that these particular failures are due to limitations of HOG, and consequently, even if we develop better learning algorithms or use larger datasets, these will false alarms will likely persist. [sent-264, score-0.419]
88 Notice how when we view these detections in image space, all of the false alarms are difficult to explain. [sent-266, score-0.3]
89 By visualizing the detections in feature space, we discovered that the learning algorithm made reasonable failures since the features are deceptively similar to true positives. [sent-268, score-0.398]
90 We built an online interface for humans to look at HOG visualizations of window patches at the same resolution as DPM. [sent-274, score-0.518]
91 We instructed workers to either classify a HOG visualization as a positive example or a negative example for a category. [sent-275, score-0.345]
92 In most cases, human subjects classifying HOG visualizations were able to rank sliding windows with either the same accuracy or better than DPM. [sent-280, score-0.506]
93 Model Visualization We found our algorithms are also useful for visualizing the learned models of an object detector. [sent-322, score-0.24]
94 These visualizations provide hints on which gradients the learning found discriminative. [sent-324, score-0.405]
95 Conclusion We believe visualizations can be a powerful tool for understanding object detection systems and advancing research in computer vision. [sent-328, score-0.515]
96 To this end, this paper presented and evaluated four algorithms to visualize object detection features. [sent-329, score-0.244]
97 Since object detection researchers analyze HOG glyphs everyday and nearly every recent object detection Figure15:Wevisualizeafewdeformablepartsmodelstrainedwith[8]. [sent-330, score-0.355]
98 Our visualizations tend to reveal more detail than the glyph. [sent-335, score-0.405]
99 Figure 16: We show the original RGB patches that correspond to the visualizations from Figure 3. [sent-336, score-0.46]
100 paper includes HOG visualizations, we hope more intuitive visualizations will prove useful for the community. [sent-339, score-0.477]
wordName wordTfidf (topN-words)
[('hog', 0.468), ('visualizations', 0.405), ('inversion', 0.255), ('invert', 0.231), ('visualization', 0.208), ('paired', 0.197), ('glyph', 0.169), ('visualizing', 0.148), ('dictionary', 0.141), ('inversions', 0.13), ('alarms', 0.126), ('inverting', 0.108), ('ridge', 0.1), ('false', 0.095), ('visualize', 0.091), ('humans', 0.086), ('detections', 0.079), ('lda', 0.078), ('exemplar', 0.073), ('glyphs', 0.072), ('intuitive', 0.072), ('failures', 0.069), ('rgb', 0.065), ('dpm', 0.062), ('detection', 0.061), ('car', 0.061), ('rd', 0.059), ('detectors', 0.059), ('workers', 0.059), ('detector', 0.056), ('job', 0.056), ('inspect', 0.056), ('pascal', 0.052), ('object', 0.049), ('inverse', 0.048), ('alahi', 0.048), ('angelo', 0.048), ('arxg', 0.048), ('inverts', 0.048), ('pairdict', 0.048), ('popularized', 0.048), ('tatu', 0.048), ('mturk', 0.048), ('chairs', 0.048), ('reconstruct', 0.045), ('classify', 0.045), ('algorithms', 0.043), ('wish', 0.043), ('perceptually', 0.043), ('weinzaepfel', 0.043), ('voc', 0.042), ('users', 0.042), ('chair', 0.042), ('experts', 0.04), ('deceptively', 0.04), ('notice', 0.04), ('world', 0.039), ('reconstructing', 0.039), ('reveals', 0.038), ('elda', 0.037), ('parikh', 0.037), ('human', 0.036), ('goggles', 0.035), ('grayscale', 0.035), ('students', 0.034), ('positives', 0.034), ('sliding', 0.033), ('instructed', 0.033), ('people', 0.033), ('regression', 0.032), ('dictionaries', 0.032), ('blurred', 0.032), ('researchers', 0.032), ('subjects', 0.032), ('features', 0.031), ('analyze', 0.031), ('supplemental', 0.031), ('discovered', 0.031), ('eigenvectors', 0.031), ('likely', 0.03), ('man', 0.03), ('study', 0.03), ('vondrick', 0.029), ('divvala', 0.029), ('massive', 0.029), ('standing', 0.029), ('original', 0.028), ('spaces', 0.028), ('scoring', 0.028), ('dark', 0.028), ('recover', 0.028), ('minute', 0.028), ('sheep', 0.028), ('frequency', 0.028), ('looks', 0.028), ('natural', 0.027), ('rg', 0.027), ('participants', 0.027), ('patches', 0.027), ('tools', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 189 iccv-2013-HOGgles: Visualizing Object Detection Features
Author: Carl Vondrick, Aditya Khosla, Tomasz Malisiewicz, Antonio Torralba
Abstract: We introduce algorithms to visualize feature spaces used by object detectors. The tools in this paper allow a human to put on ‘HOG goggles ’ and perceive the visual world as a HOG based object detector sees it. We found that these visualizations allow us to analyze object detection systems in new ways and gain new insight into the detector’s failures. For example, when we visualize the features for high scoring false alarms, we discovered that, although they are clearly wrong in image space, they do look deceptively similar to true positives in feature space. This result suggests that many of these false alarms are caused by our choice of feature space, and indicates that creating a better learning algorithm or building bigger datasets is unlikely to correct these errors. By visualizing feature spaces, we can gain a more intuitive understanding of our detection systems.
2 0.12688686 161 iccv-2013-Fast Sparsity-Based Orthogonal Dictionary Learning for Image Restoration
Author: Chenglong Bao, Jian-Feng Cai, Hui Ji
Abstract: In recent years, how to learn a dictionary from input images for sparse modelling has been one very active topic in image processing and recognition. Most existing dictionary learning methods consider an over-complete dictionary, e.g. the K-SVD method. Often they require solving some minimization problem that is very challenging in terms of computational feasibility and efficiency. However, if the correlations among dictionary atoms are not well constrained, the redundancy of the dictionary does not necessarily improve the performance of sparse coding. This paper proposed a fast orthogonal dictionary learning method for sparse image representation. With comparable performance on several image restoration tasks, the proposed method is much more computationally efficient than the over-complete dictionary based learning methods.
3 0.12531377 384 iccv-2013-Semi-supervised Robust Dictionary Learning via Efficient l-Norms Minimization
Author: Hua Wang, Feiping Nie, Weidong Cai, Heng Huang
Abstract: Representing the raw input of a data set by a set of relevant codes is crucial to many computer vision applications. Due to the intrinsic sparse property of real-world data, dictionary learning, in which the linear decomposition of a data point uses a set of learned dictionary bases, i.e., codes, has demonstrated state-of-the-art performance. However, traditional dictionary learning methods suffer from three weaknesses: sensitivity to noisy and outlier samples, difficulty to determine the optimal dictionary size, and incapability to incorporate supervision information. In this paper, we address these weaknesses by learning a Semi-Supervised Robust Dictionary (SSR-D). Specifically, we use the ℓ2,0+ norm as the loss function to improve the robustness against outliers, and develop a new structured sparse regularization com, , tom. . cai@sydney . edu . au , heng@uta .edu make the learning tasks easier to deal with and reduce the computational cost. For example, in image tagging, instead of using the raw pixel-wise features, semi-local or patch- based features, such as SIFT and geometric blur, are usually more desirable to achieve better performance. In practice, finding a set of compact features bases, also referred to as dictionary, with enhanced representative and discriminative power, plays a significant role in building a successful computer vision system. In this paper, we explore this important problem by proposing a novel formulation and its solution for learning Semi-Supervised Robust Dictionary (SSRD), where we examine the challenges in dictionary learning, and seek opportunities to overcome them and improve the dictionary qualities. 1.1. Challenges in Dictionary Learning to incorporate the supervision information in dictionary learning, without incurring additional parameters. Moreover, the optimal dictionary size is automatically learned from the input data. Minimizing the derived objective function is challenging because it involves many non-smooth ℓ2,0+ -norm terms. We present an efficient algorithm to solve the problem with a rigorous proof of the convergence of the algorithm. Extensive experiments are presented to show the superior performance of the proposed method.
4 0.12247046 426 iccv-2013-Training Deformable Part Models with Decorrelated Features
Author: Ross Girshick, Jitendra Malik
Abstract: In this paper, we show how to train a deformable part model (DPM) fast—typically in less than 20 minutes, or four times faster than the current fastest method—while maintaining high average precision on the PASCAL VOC datasets. At the core of our approach is “latent LDA,” a novel generalization of linear discriminant analysis for learning latent variable models. Unlike latent SVM, latent LDA uses efficient closed-form updates and does not require an expensive search for hard negative examples. Our approach also acts as a springboard for a detailed experimental study of DPM training. We isolate and quantify the impact of key training factors for the first time (e.g., How important are discriminative SVM filters? How important is joint parameter estimation? How many negative images are needed for training?). Our findings yield useful insights for researchers working with Markov random fields and partbased models, and have practical implications for speeding up tasks such as model selection.
5 0.10977886 197 iccv-2013-Hierarchical Joint Max-Margin Learning of Mid and Top Level Representations for Visual Recognition
Author: Hans Lobel, René Vidal, Alvaro Soto
Abstract: Currently, Bag-of-Visual-Words (BoVW) and part-based methods are the most popular approaches for visual recognition. In both cases, a mid-level representation is built on top of low-level image descriptors and top-level classifiers use this mid-level representation to achieve visual recognition. While in current part-based approaches, mid- and top-level representations are usually jointly trained, this is not the usual case for BoVW schemes. A main reason for this is the complex data association problem related to the usual large dictionary size needed by BoVW approaches. As a further observation, typical solutions based on BoVW and part-based representations are usually limited to extensions of binary classification schemes, a strategy that ignores relevant correlations among classes. In this work we propose a novel hierarchical approach to visual recognition based on a BoVW scheme that jointly learns suitable midand top-level representations. Furthermore, using a maxmargin learning framework, the proposed approach directly handles the multiclass case at both levels of abstraction. We test our proposed method using several popular bench- mark datasets. As our main result, we demonstrate that, by coupling learning of mid- and top-level representations, the proposed approach fosters sharing of discriminative visual words among target classes, being able to achieve state-ofthe-art recognition performance using far less visual words than previous approaches.
6 0.1010516 390 iccv-2013-Shufflets: Shared Mid-level Parts for Fast Object Detection
7 0.097825631 111 iccv-2013-Detecting Dynamic Objects with Multi-view Background Subtraction
8 0.09391059 424 iccv-2013-Tracking Revisited Using RGBD Camera: Unified Benchmark and Baselines
9 0.091747776 354 iccv-2013-Robust Dictionary Learning by Error Source Decomposition
10 0.091370501 276 iccv-2013-Multi-attributed Dictionary Learning for Sparse Coding
11 0.089786984 377 iccv-2013-Segmentation Driven Object Detection with Fisher Vectors
12 0.089675315 308 iccv-2013-Parsing IKEA Objects: Fine Pose Estimation
13 0.089479536 188 iccv-2013-Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps
14 0.088106759 244 iccv-2013-Learning View-Invariant Sparse Representations for Cross-View Action Recognition
15 0.085284449 236 iccv-2013-Learning Discriminative Part Detectors for Image Classification and Cosegmentation
16 0.083959781 327 iccv-2013-Predicting an Object Location Using a Global Image Representation
17 0.082670018 359 iccv-2013-Robust Object Tracking with Online Multi-lifespan Dictionary Learning
18 0.080893487 277 iccv-2013-Multi-channel Correlation Filters
19 0.080768861 204 iccv-2013-Human Attribute Recognition by Rich Appearance Dictionary
20 0.080033988 242 iccv-2013-Learning People Detectors for Tracking in Crowded Scenes
topicId topicWeight
[(0, 0.182), (1, 0.062), (2, -0.027), (3, -0.044), (4, -0.024), (5, -0.089), (6, -0.072), (7, -0.016), (8, -0.096), (9, -0.049), (10, 0.078), (11, 0.017), (12, -0.009), (13, -0.05), (14, -0.053), (15, -0.005), (16, -0.001), (17, 0.071), (18, 0.079), (19, 0.041), (20, -0.065), (21, 0.008), (22, 0.01), (23, 0.048), (24, 0.01), (25, 0.034), (26, -0.046), (27, -0.056), (28, -0.001), (29, -0.075), (30, -0.026), (31, -0.017), (32, 0.006), (33, -0.02), (34, 0.037), (35, -0.024), (36, 0.051), (37, -0.017), (38, 0.0), (39, 0.056), (40, 0.008), (41, -0.022), (42, -0.042), (43, -0.042), (44, 0.005), (45, -0.037), (46, -0.004), (47, -0.03), (48, -0.046), (49, 0.042)]
simIndex simValue paperId paperTitle
same-paper 1 0.95792031 189 iccv-2013-HOGgles: Visualizing Object Detection Features
Author: Carl Vondrick, Aditya Khosla, Tomasz Malisiewicz, Antonio Torralba
Abstract: We introduce algorithms to visualize feature spaces used by object detectors. The tools in this paper allow a human to put on ‘HOG goggles ’ and perceive the visual world as a HOG based object detector sees it. We found that these visualizations allow us to analyze object detection systems in new ways and gain new insight into the detector’s failures. For example, when we visualize the features for high scoring false alarms, we discovered that, although they are clearly wrong in image space, they do look deceptively similar to true positives in feature space. This result suggests that many of these false alarms are caused by our choice of feature space, and indicates that creating a better learning algorithm or building bigger datasets is unlikely to correct these errors. By visualizing feature spaces, we can gain a more intuitive understanding of our detection systems.
2 0.71895468 390 iccv-2013-Shufflets: Shared Mid-level Parts for Fast Object Detection
Author: Iasonas Kokkinos
Abstract: We present a method to identify and exploit structures that are shared across different object categories, by using sparse coding to learn a shared basis for the ‘part’ and ‘root’ templates of Deformable Part Models (DPMs). Our first contribution consists in using Shift-Invariant Sparse Coding (SISC) to learn mid-level elements that can translate during coding. This results in systematically better approximations than those attained using standard sparse coding. To emphasize that the learned mid-level structures are shiftable we call them shufflets. Our second contribution consists in using the resulting score to construct probabilistic upper bounds to the exact template scores, instead of taking them ‘at face value ’ as is common in current works. We integrate shufflets in DualTree Branch-and-Bound and cascade-DPMs and demonstrate that we can achieve a substantial acceleration, with practically no loss in performance.
3 0.69211483 197 iccv-2013-Hierarchical Joint Max-Margin Learning of Mid and Top Level Representations for Visual Recognition
Author: Hans Lobel, René Vidal, Alvaro Soto
Abstract: Currently, Bag-of-Visual-Words (BoVW) and part-based methods are the most popular approaches for visual recognition. In both cases, a mid-level representation is built on top of low-level image descriptors and top-level classifiers use this mid-level representation to achieve visual recognition. While in current part-based approaches, mid- and top-level representations are usually jointly trained, this is not the usual case for BoVW schemes. A main reason for this is the complex data association problem related to the usual large dictionary size needed by BoVW approaches. As a further observation, typical solutions based on BoVW and part-based representations are usually limited to extensions of binary classification schemes, a strategy that ignores relevant correlations among classes. In this work we propose a novel hierarchical approach to visual recognition based on a BoVW scheme that jointly learns suitable midand top-level representations. Furthermore, using a maxmargin learning framework, the proposed approach directly handles the multiclass case at both levels of abstraction. We test our proposed method using several popular bench- mark datasets. As our main result, we demonstrate that, by coupling learning of mid- and top-level representations, the proposed approach fosters sharing of discriminative visual words among target classes, being able to achieve state-ofthe-art recognition performance using far less visual words than previous approaches.
4 0.68841499 406 iccv-2013-Style-Aware Mid-level Representation for Discovering Visual Connections in Space and Time
Author: Yong Jae Lee, Alexei A. Efros, Martial Hebert
Abstract: We present a weakly-supervised visual data mining approach that discovers connections between recurring midlevel visual elements in historic (temporal) and geographic (spatial) image collections, and attempts to capture the underlying visual style. In contrast to existing discovery methods that mine for patterns that remain visually consistent throughout the dataset, our goal is to discover visual elements whose appearance changes due to change in time or location; i.e., exhibit consistent stylistic variations across the label space (date or geo-location). To discover these elements, we first identify groups of patches that are stylesensitive. We then incrementally build correspondences to find the same element across the entire dataset. Finally, we train style-aware regressors that model each element’s range of stylistic differences. We apply our approach to date and geo-location prediction and show substantial improvement over several baselines that do not model visual style. We also demonstrate the method’s effectiveness on the related task of fine-grained classification.
5 0.67823786 349 iccv-2013-Regionlets for Generic Object Detection
Author: Xiaoyu Wang, Ming Yang, Shenghuo Zhu, Yuanqing Lin
Abstract: Generic object detection is confronted by dealing with different degrees of variations in distinct object classes with tractable computations, which demands for descriptive and flexible object representations that are also efficient to evaluate for many locations. In view of this, we propose to model an object class by a cascaded boosting classifier which integrates various types of features from competing local regions, named as regionlets. A regionlet is a base feature extraction region defined proportionally to a detection window at an arbitrary resolution (i.e. size and aspect ratio). These regionlets are organized in small groups with stable relative positions to delineate fine-grained spatial layouts inside objects. Their features are aggregated to a one-dimensional feature within one group so as to tolerate deformations. Then we evaluate the object bounding box proposal in selective search from segmentation cues, limiting the evaluation locations to thousands. Our approach significantly outperforms the state-of-the-art on popular multi-class detection benchmark datasets with a single method, without any contexts. It achieves the detec- tion mean average precision of 41. 7% on the PASCAL VOC 2007 dataset and 39. 7% on the VOC 2010 for 20 object categories. It achieves 14. 7% mean average precision on the ImageNet dataset for 200 object categories, outperforming the latest deformable part-based model (DPM) by 4. 7%.
6 0.67142361 426 iccv-2013-Training Deformable Part Models with Decorrelated Features
7 0.66199398 286 iccv-2013-NYC3DCars: A Dataset of 3D Vehicles in Geographic Context
8 0.66065657 111 iccv-2013-Detecting Dynamic Objects with Multi-view Background Subtraction
9 0.65992731 66 iccv-2013-Building Part-Based Object Detectors via 3D Geometry
10 0.64991504 179 iccv-2013-From Subcategories to Visual Composites: A Multi-level Framework for Object Detection
11 0.64798164 277 iccv-2013-Multi-channel Correlation Filters
12 0.6424225 327 iccv-2013-Predicting an Object Location Using a Global Image Representation
13 0.62411714 377 iccv-2013-Segmentation Driven Object Detection with Fisher Vectors
14 0.62345463 61 iccv-2013-Beyond Hard Negative Mining: Efficient Detector Learning via Block-Circulant Decomposition
15 0.61769831 109 iccv-2013-Detecting Avocados to Zucchinis: What Have We Done, and Where Are We Going?
16 0.59154189 269 iccv-2013-Modeling Occlusion by Discriminative AND-OR Structures
17 0.58131993 384 iccv-2013-Semi-supervised Robust Dictionary Learning via Efficient l-Norms Minimization
18 0.58024734 75 iccv-2013-CoDeL: A Human Co-detection and Labeling Framework
19 0.57986212 107 iccv-2013-Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction
20 0.5797959 285 iccv-2013-NEIL: Extracting Visual Knowledge from Web Data
topicId topicWeight
[(2, 0.088), (4, 0.014), (7, 0.023), (12, 0.034), (13, 0.014), (26, 0.071), (31, 0.054), (42, 0.123), (48, 0.014), (64, 0.031), (73, 0.039), (78, 0.027), (89, 0.2), (94, 0.139), (98, 0.021)]
simIndex simValue paperId paperTitle
1 0.93788135 222 iccv-2013-Joint Learning of Discriminative Prototypes and Large Margin Nearest Neighbor Classifiers
Author: Martin Köstinger, Paul Wohlhart, Peter M. Roth, Horst Bischof
Abstract: In this paper, we raise important issues concerning the evaluation complexity of existing Mahalanobis metric learning methods. The complexity scales linearly with the size of the dataset. This is especially cumbersome on large scale or for real-time applications with limited time budget. To alleviate this problem we propose to represent the dataset by a fixed number of discriminative prototypes. In particular, we introduce a new method that jointly chooses the positioning of prototypes and also optimizes the Mahalanobis distance metric with respect to these. We show that choosing the positioning of the prototypes and learning the metric in parallel leads to a drastically reduced evaluation effort while maintaining the discriminative essence of the original dataset. Moreover, for most problems our method performing k-nearest prototype (k-NP) classification on the condensed dataset leads to even better generalization compared to k-NN classification using all data. Results on a variety of challenging benchmarks demonstrate the power of our method. These include standard machine learning datasets as well as the challenging Public Fig- ures Face Database. On the competitive machine learning benchmarks we are comparable to the state-of-the-art while being more efficient. On the face benchmark we clearly outperform the state-of-the-art in Mahalanobis metric learning with drastically reduced evaluation effort.
2 0.90980428 285 iccv-2013-NEIL: Extracting Visual Knowledge from Web Data
Author: Xinlei Chen, Abhinav Shrivastava, Abhinav Gupta
Abstract: We propose NEIL (NeverEnding Image Learner), a computer program that runs 24 hours per day and 7 days per week to automatically extract visual knowledge from Internet data. NEIL uses a semi-supervised learning algorithm that jointly discovers common sense relationships (e.g., “Corolla is a kind of/looks similar to Car”, “Wheel is a part of Car”) and labels instances of the given visual categories. It is an attempt to develop the world’s largest visual structured knowledge base with minimum human labeling effort. As of 10th October 2013, NEIL has been continuously running for 2.5 months on 200 core cluster (more than 350K CPU hours) and has an ontology of 1152 object categories, 1034 scene categories and 87 attributes. During this period, NEIL has discovered more than 1700 relationships and has labeled more than 400K visual instances. 1. Motivation Recent successes in computer vision can be primarily attributed to the ever increasing size of visual knowledge in terms of labeled instances of scenes, objects, actions, attributes, and the contextual relationships between them. But as we move forward, a key question arises: how will we gather this structured visual knowledge on a vast scale? Recent efforts such as ImageNet [8] and Visipedia [30] have tried to harness human intelligence for this task. However, we believe that these approaches lack both the richness and the scalability required for gathering massive amounts of visual knowledge. For example, at the time of submission, only 7% of the data in ImageNet had bounding boxes and the relationships were still extracted via Wordnet. In this paper, we consider an alternative approach of automatically extracting visual knowledge from Internet scale data. The feasibility of extracting knowledge automatically from images and videos will itself depend on the state-ofthe-art in computer vision. While we have witnessed significant progress on the task of detection and recognition, we still have a long way to go for automatically extracting the semantic content of a given image. So, is it really possible to use existing approaches for gathering visual knowledge directly from web data? 1.1. NEIL – Never Ending Image Learner We propose NEIL, a computer program that runs 24 hours per day, 7 days per week, forever to: (a) semantically understand images on the web; (b) use this semantic understanding to augment its knowledge base with new labeled instances and common sense relationships; (c) use this dataset and these relationships to build better classifiers and detectors which in turn help improve semantic understanding. NEIL is a constrained semi-supervised learning (SSL) system that exploits the big scale of visual data to automatically extract common sense relationships and then uses these relationships to label visual instances of existing categories. It is an attempt to develop the world’s largest visual structured knowledge base with minimum human effort one that reflects the factual content of the images on the Internet, and that would be useful to many computer vision and AI efforts. Specifically, NEIL can use web data to extract: (a) Labeled examples of object categories with bounding boxes; (b) Labeled examples of scenes; (c) Labeled examples of attributes; (d) Visual subclasses for object categories; and (e) Common sense relationships about scenes, objects and attributes like “Corolla is a kind of/looks similar to Car”, “Wheel is a part ofCar”, etc. (See Figure 1). We believe our approach is possible for three key reasons: (a) Macro-vision vs. Micro-vision: We use the term “micro-vision” to refer to the traditional paradigm where the input is an image and the output is some information extracted from that image. In contrast, we define “macrovision” as a paradigm where the input is a large collection of images and the desired output is extracting significant or interesting patterns in visual data (e.g., car is detected frequently in raceways). These patterns help us to extract common sense relationships. Note, the key difference is that macro-vision does not require us to understand every image in the corpora and extract all possible patterns. Instead, it relies on understanding a few images and statistically combine evidence from these to build our visual knowledge. – (b) Structure of the Visual World: Our approach exploits the structure of the visual world and builds constraints for detection and classification. These global constraints are represented in terms of common sense relationships be1409 orCllaoraC Hloc e yrs(a) Objects (w/Bounding Boxes and Vislue hWal Subcategories) aongkPi lrt(b) ScenewyacaResaephs nuoRd(c) At d worCreibutes Visual Instances Labeled by NEIL (O-O) Wheel is a part of Car. (S-O) Car is found in Raceway. (O-O) Corolla is a kind of/looks similar to Car. (S-O) Pyramid is found in Egypt. (O-A) Wheel is/has Round shape. (S-A) Alley is/has Narrow. (S-A) Bamboo forest is/has Vertical lines. (O-A) Sunflower is/has Yellow. Relationships Extracted by NEIL Figure 1. NEIL is a computer program that runs 24 hours a day and 7 days a week to gather visual knowledge from the Internet. Specifically, it simultaneously labels the data and extracts common sense relationships between categories. tween categories. Most prior work uses manually defined relationships or learns relationships in a supervised setting. Our key insight is that at a large scale one can simultane- ously label the visual instances and extract common sense relationships in ajoint semi-supervised learning framework. (c) Semantically driven knowledge acquisition: We use a semantic representation for visual knowledge; that is, we group visual data based on semantic categories and develop relationships between semantic categories. This allows us to leverage text-based indexing tools such as Google Image Search to initialize our visual knowledge base learning. Contributions: Our main contributions are: (a) We propose a never ending learning algorithm for gathering visual knowledge from the Internet via macro-vision. NEIL has been continuously running for 2.5 months on a 200 core cluster; (b) We are automatically building a large visual structured knowledge base which not only consists of labeled instances of scenes, objects, and attributes but also the relationships between them. While NEIL’s core SSL algorithm works with a fixed vocabulary, we also use noun phrases from NELL’s ontology [5] to grow our vocabulary. Currently, our growing knowledge base has an ontology of 1152 object categories, 1034 scene categories, and 87 attributes. NEIL has discovered more than 1700 relationships and labeled more than 400K visual instances of these categories. (c) We demonstrate how joint discovery of relationships and labeling of instances at a gigantic scale can provide constraints for improving semi-supervised learning. 2. Related Work Recent work has only focused on extracting knowledge in the form of large datasets for recognition and classification [8, 23, 30]. One of the most commonly used approaches to build datasets is using manual annotations by motivated teams of people [30] or the power of crowds [8, 40]. To minimize human effort, recent works have also focused on active learning [37, 39] which selects label requests that are most informative. However, both of these directions have a major limitation: annotations are expensive, prone to errors, biased and do not scale. An alternative approach is to use visual recognition for extracting these datasets automatically from the Internet [23, 34, 36]. A common way of automatically creating a dataset is to use image search results and rerank them via visual classifiers [14] or some form of joint-clustering in text and visual space [2, 34]. Another approach is to use a semi-supervised framework [42]. Here, a small amount of labeled data is used in conjunction with a large amount of unlabeled data to learn reliable and robust visual models. These seed images can be manually labeled [36] or the top retrievals of a text-based search [23]. The biggest problem with most of these automatic approaches is that the small number of labeled examples or image search results do not provide enough constraints for learning robust visual classifiers. Hence, these approaches suffer from semantic drift [6]. One way to avoid semantic drift is to exploit additional constraints based on the structure of our visual data. Researchers have exploited a variety of constraints such as those based on visual similarity [11, 15], seman- tic similarity [17] or multiple feature spaces [3]. However, most of these constraints are weak in nature: for example, visual similarity only models the constraint that visuallysimilar images should receive the same labels. On the other hand, our visual world is highly structured: object cate1410 gories share parts and attributes, objects and scenes have strong contextual relationships, etc. Therefore, we need a way to capture the rich structure of our visual world and exploit this structure during semi-supervised learning. In recent years, there have been huge advances in modeling the rich structure of our visual world via contextual relationships. Some of these relationships include: SceneObject [38], Object-Object [3 1], Object-Attribute [12, 22, 28], Scene-Attribute [29]. All these relationships can provide a rich set of constraints which can help us improve SSL [4]. For example, scene-attribute relationships such as amphitheaters are circular can help improve semisupervised learning of scene classifiers [36] and Wordnet hierarchical relationships can help in propagating segmentations [21]. But the big question is: how do we obtain these relationships? One way to obtain such relationships is via text analysis [5, 18]. However, as [40] points out that the visual knowledge we need to obtain is so obvious that no one would take the time to write it down and put it on web. In this work, we argue that, at a large-scale, one can jointly discover relationships and constrain the SSL prob- lem for extracting visual knowledge and learning visual classifiers and detectors. Motivated by a never ending learning algorithm for text [5], we propose a never ending visual learning algorithm that cycles between extracting global relationships, labeling data and learning classifiers/detectors for building visual knowledge from the Internet. Our work is also related to attribute discovery [33, 35] since these approaches jointly discover the attributes and relationships between objects and attributes simultaneously. However, in our case, we only focus on semantic attributes and therefore our goal is to discover semantic relationships and semantically label visual instances. 3. Technical Approach Our goal is to extract visual knowledge from the pool of visual data on the web. We define visual knowledge as any information that can be useful for improving vision tasks such as image understanding and object/scene recognition. One form of visual knowledge would be labeled examples of different categories or labeled segments/boundaries. Labeled examples helps us learn classifiers or detectors and improve image understanding. Another example of visual knowledge would be relationships. For example, spatial contextual relationships can be used to improve object recognition. In this paper, we represent visual knowledge in terms of labeled examples of semantic categories and the relationships between those categories. Our knowledge base consists of labeled examples of: (1) Objects (e.g., Car, Corolla); (2) Scenes (e.g., Alley, Church); (3) Attributes (e.g., Blue, Modern). Note that for objects we learn detectors and for scenes we build classifiers; however for the rest of the paper we will use the term detector and classifier interchangeably. Our knowledge base also contains relationships of four types: (1) Object-Object (e.g., Wheel is a part of Car);(2) Object-Attribute (e.g., Sheep is/has White); (3) Scene-Object (e.g., Car is found in Raceway); (4) SceneAttribute (e.g., Alley is/has Narrow). The outline of our approach is shown in Figure 2. We use Google Image Search to download thousands of images for each object, scene and attribute category. Our method then uses an iterative approach to clean the labels and train detectors/classifiers in a semi-supervised manner. For a given concept (e.g., car), we first discover the latent visual subcategories and bounding boxes for these sub-categories using an exemplar-based clustering approach (Section 3. 1). We then train multiple detectors for a concept (one for each sub-category) using the clustering and localization results. These detectors and classifiers are then used for detections on millions of images to learn relationships based on cooccurrence statistics (Section 3.2). Here, we exploit the fact the we are interested in macro-vision and therefore build co-occurrence statistics using only confident detections/classifications. Once we have relationships, we use them in conjunction with our classifiers and detectors to label the large set of noisy images (Section 3.3). The most confidently labeled images are added to the pool of labeled data and used to retrain the models, and the process repeats itself. At every iteration, we learn better classifiers and detectors, which in turn help us learn more relationships and further constrain the semi-supervised learning problem. We now describe each step in detail below. 3.1. Seeding Classifiers via Google Image Search The first step in our semi-supervised algorithm is to build classifiers for visual categories. One way to build initial classifiers is via a few manually labeled seed images. Here, we take an alternative approach and use text-based image retrieval systems to provide seed images for training initial detectors. For scene and attribute classifiers we directly use these retrieved images as positive data. However, such an approach fails for training object and attribute detectors because of four reasons (Figure 3(a)) (1) Outliers: Due to the imperfectness of text-based image retrieval, the downloaded images usually have irrelevant images/outliers; (2) Polysemy: In many cases, semantic categories might be overloaded and a single semantic category might have multiple senses (e.g., apple can mean both the company and the fruit); (3) Visual Diversity: Retrieved images might have high intra-class variation due to different viewpoint, illumination etc.; (4) Localization: In many cases the retrieved image might be a scene without a bounding-box and hence one needs to localize the concept before training a detector. Most of the current approaches handle these problems via clustering. Clustering helps in handling visual diversity [9] and discovering multiple senses of retrieval (polysemy) [25]. It can also help us to reject outliers based on – distances from cluster centers. One simple way to cluster 141 1 would be to use K-means on the set of all possible bounding boxes and then use the representative clusters as visual sub-categories. However, clustering using K-means has two issues: (1) High Dimensionality: We use the Color HOG (CHOG) [20] representation and standard distance metrics do not work well in such high-dimensions [10]; (2) Scalability: Most clustering approaches tend to partition the complete feature space. In our case, since we do not have bounding boxes provided, every image creates millions of data points and the majority of the datapoints are outliers. Recent work has suggested that K-means is not scalable and has bad performance in this scenario since it assigns membership to every data point [10]. Instead, we propose to use a two-step approach for clustering. In the first step, we mine the set of downloaded im- × ages from Google Image Search to create candidate object windows. Specifically, every image is used to train a detector using recently proposed exemplar-LDA [19]. These detectors are then used for dense detections on the same set of downloaded images. We select the top K windows which have high scores from multiple detectors. Note that this step helps us prune out outliers as the candidate windows are selected via representativeness (how many detectors fire on them). For example, in Figure 3, none of the tricycle detectors fire on the outliers such as circular dots and people eating, and hence these images are rejected at this candidate widow step. Once we have candidate windows, we cluster them in the next step. However, instead of using the high-dimensional CHOG representation for clustering, we use the detection signature of each window (represented as a vector of seed detector ELDA scores on the window) to create a K K affinity matrix. The (i, j) entry in the affinity amteat arix K i s× thKe da fofti product orixf t.h Tish vee (cit,ojr) )fo enr twryin indo thwes ai fainndj. Intuitively, this step connects candidate windows if the same set of detectors fire on both windows. Once we have the affinity matrix, we cluster the candidate windows using the standard affinity propagation algorithm [16]. Affinity propagation also allows us to extract a representative window (prototype) for each cluster which acts as an iconic image for the object [32] (Figure 3). After clustering, we train a detector for each cluster/sub-category using three-quarters of the images in the cluster. The remaining quarter is used as a validation set for calibration. 3.2. Extracting Relationships Once we have initialized object detectors, attribute detectors, attribute classifiers and scene classifiers, we can use them to extract relationships automatically from the data. The key idea is that we do not need to understand each and every image downloaded from the Internet but instead understand the statistical pattern of detections and classifications at a large scale. These patterns can be used to select the top-N relationships at every iteration. Specifically, we extract four different kinds of relationships: Object-Object Relationships: The first kind of relationship we extract are object-object relationships which include: (1) Partonomy relationships such as “Eye is a part of Baby”; (2) Taxonomy relationships such as “BMW 320 is a kind of Car”; and (3) Similarity relationships such as 1412 (a) Google Image Search for “tricycle” (b) Sub-category Discovery Figure 3. An example of how clustering handles polysemy, intraclass variation and outlier removal (a). The bottom row shows our discovered clusters. “Swan looks similar to Goose”. To extract these relationships, we first build a co-detection matrix O0 whose elements represent the probability of simultaneous detection of object categories i and j. Intuitively, the co-detection matrix has high values when object detector idetects objects inside the bounding box of object j with high detection scores. To account for detectors that fire everywhere and images which have lots of detections, we normalize the matrix O0. The normalized co-detection matrix can be written 1 1 as: N1− 2 O0N2− 2 , where N1 and N2 are out-degree and indegree matrix and (i, j) element of O0 represents the average score of top-detections of detector ion images of object category j. Once we have selected a relationship between pair of categories, we learn its characteristics in terms of mean and variance of relative locations, relative aspect ra- tio, relative scores and relative size of the detections. For example, the nose-face relationship is characterized by low relative window size (nose is less than 20% of face area) and the relative location that nose occurs in center of the face. This is used to define a compatibility function ψi,j (·) which evaluates if the detections from category iand j are compatible or not. We also classify the relationships into the two semantic categories (part-of, taxonomy/similar) using relative features to have a human-communicable view of visual knowledge base. Object-Attribute Relationships: The second type of relationship we extract is object-attribute relationships such as “Pizza has Round Shape”, ”Sunflower is Yellow” etc. To extract these relationships we use the same methodology where the attributes are detected in the labeled examples of object categories. These detections and their scores are then used to build a normalized co-detection matrix which is used to find the top object-attribute relationships. Scene-Object Relationships: The third type of relationship extracted by our algorithm includes scene-object relationships such as “Bus is found in Bus depot” and “Monitor is found in Control room”. For extracting scene-object relationships, we use the object detectors on randomly sampled images of different scene classes. The detections are then used to create the normalized co-presence matrix (similar to object-object relationships) where the (i, j) element represents the likelihood of detection of instance of object category iand the scene category class j. Scene-Attribute Relationships: The fourth and final type of relationship extracted by our algorithm includes sceneattribute relationships such as “Ocean is Blue”, “Alleys are Narrow”, etc. Here, we follow a simple methodology for extracting scene-attribute relationships where we compute co-classification matrix such that the element (i, j) of the matrix represents average classification scores of attribute ion images of scene j. The top entries in this coclassification matrix are used to extract scene-attribute relationships. 3.3. Retraining via Labeling New Instances Once we have the initial set of classifiers/detectors and the set of relationships, we can use them to find new instances of different objects and scene categories. These new instances are then added to the set of labeled data and we retrain new classifiers/detectors using the updated set of labeled data. These new classifiers are then used to extract more relationships which in turn are used to label more data and so on. One way to find new instances is directly using the detector itself. For instance, using the car detector to find more cars. However, this approach leads to semantic drift. To avoid semantic drift, we use the rich set of relationships we extracted in the previous section and ensure that the new labeled instances of car satisfy the extracted relationships (e.g., has wheels, found in raceways etc.) Mathematically, let RO, RA and RS represent the set of object-object, object-attribute aanndd scene-object relationships at iteration t. If φi (·) represents the potential from object detector i, ωk (·) represents sthenet scene potential, raonmd ψi,j (·) represent the compatibility sfu thnect siocnen nbeet pwoeteennt tiwalo, aonbdject categories i,j,ethceonm we can ifityndfu uthncet new inetswtaenecnetsw woof oobb-ject category iusing the contextual scoring function given below: φi(x) + ? φj(xl)ψi,j(x,xl) + ? i,j∈R?O RA ? ωk(x) i,k?∈RS where x is the wi?ndow being evaluated and xl is the topdetected window of related object/attribute category. The above equation has three terms: the first term is appearance term for the object category itself and is measured by the 1413 Nilgai Yamaha Violin Bass F-18 Figure 4. Qualitative Examples of Bounding Box Labeling Done by NEIL score of the SVM detector on the window x. The second term measures the compatibility between object category i and the object/attribute category j if the relationship (i, j) is part of the catalogue. For example, if “Wheel is a part of Car” exists in the catalogue then this term will be the product of the score of wheel detector and the compatibility function between the wheel window (xl) and the car window (x). The final term measures the scene-object compatibility. Therefore, if the knowledge base contains the re- lationship “Car is found in Raceway”, this term boosts the “Car” detection scores in the “Raceway” scenes. At each iteration, we also add new instances of different scene categories. We find new instances of scene category k using the contextual scoring function given below: ωk(x) + ? ωm(x) + ? φi(xl) m,k?∈RA? i,k?∈RS where RA? represents the catalogue of scene-attribute relationships. The above equation has three terms: the first term is the appearance term for the scene category itself and is estimated using the scene classifier. The second term is the appearance term for the attribute category and is estimated using the attribute classifier. This term ensures that if a scene-attribute relationship exists then the attribute classifier score should be high. The third and the final term is the appearance term of an object category and is estimated using the corresponding object detector. This term ensures that if a scene-object relationship exists then the object detector should detect objects in the scene. Implementation Details: To train scene & attribute classifiers, we first extract a 3912 dimensional feature vector from each image. The feature vector includes 5 12D GIST [27] features, concatenated with bag ofwords representations for SIFT [24], HOG [7], Lab color space, and Texton [26]. The dictionary sizes are 1000, 1000, 400, 1000, respectively. Features of randomly sampled windows from other categories are used as negative examples for SVM training and hard mining. For the object and attribute section, we use CHOG [20] features with a bin size of 8. We train the detectors using latent SVM model (without parts) [13]. 4. Experimental Results We demonstrate the quality of visual knowledge by qualitative results, verification via human subjects and quantitative results on tasks such as object detection and scene recognition. 4.1. NEIL Statistics While NEIL’s core algorithm uses a fixed vocabulary, we use noun phrases from NELL [5] to grow NEIL’s vocabulary. As of 10th October 2013, NEIL has an ontology of 1152 object categories, 1034 scene categories and 87 attributes. It has downloaded more than 2 million images for extracting the current structured visual knowledge. For bootstrapping our system, we use a few seed images from ImageNet [8], SUN [4 1] or the top-images from Google Image Search. For the purposes of extensive experimental evaluation in this paper, we ran NEIL on steroids (200 cores as opposed to 30 cores used generally) for the last 2.5 months. NEIL has completed 16 iterations and it has labeled more than 400K visual instances (including 300,000 objects with their bounding boxes). It has also extracted 1703 common sense relationships. Readers can browse the current visual knowledge base and download the detectors from: www.neil-kb.com 4.2. Qualitative Results We first show some qualitative results in terms of ex- tracted visual knowledge by NEIL. Figure 4 shows the extracted visual sub-categories along with a few labeled instances belonging to each sub-category. It can be seen from the figure that NEIL effectively handles the intra-class variation and polysemy via the clustering process. The purity and diversity of the clusters for different concepts indicate that contextual relationships help make our system robust to semantic drift and ensure diversity. Figure 5 shows some of the qualitative examples of scene-object and object-object relationships extracted by NEIL. It is effective in using a few confident detections to extract interesting relationships. Figure 6 shows some of the interesting scene-attribute and object-attribute relationships extracted by NEIL. 1414 Helicopter is found in Airfield Leaning tower is found in Pisa Van is a kind of/looks similar to Ambulance Airplane nose is a part of Airbus 330 Zebra is found in Savanna Ferris wheel is found in Amusement park Opera house is found in Sydney Eye is a part of Baby Duck is a kind of/looks similar to Goose Monitor is a kind of/looks similar to Desktop computer Figure 5. Qualitative Examples of Scene-Object (rows Bus is found in Bus depot outdoor Sparrow is a kind of/looks similar to bird 1-2) and Object-Object (rows Throne is found in Throne room Camry is found in Pub outdoor Gypsy moth is a kind of/looks similar to Butterfly Basketball net is a part of Backboard 3-4) Relationships Extracted by NEIL 4.3. Evaluating Quality via Human Subjects Next, we want to evaluate the quality of extracted visual knowledge by NEIL. It should be noted that an extensive and comprehensive evaluation for the whole NEIL system is an extremely difficult task. It is impractical to evaluate each and every labeled instance and each and every rela- tionship for correctness. Therefore, we randomly sample the 500 visual instances and 500 relationships, and verify them using human experts. At the end of iteration 6, 79% of the relationships extracted by NEIL are correct, and 98% of the visual data labeled by NEIL has been labeled correctly. We also evaluate the per iteration correctness of relationships: At iteration 1, more than 96% relationships are correct and by iteration 3, the system stabilizes and 80% of extracted relationships are correct. While currently the system does not demonstrate any major semantic drift, we do plan to continue evaluation and extensive analysis of knowledge base as NEIL grows older. We also evaluate the quality of bounding-boxes generated by NEIL. For this we sample 100 images randomly and label the ground-truth bounding boxes. On the standard intersection-over-union metric, NEIL generates bounding boxes with 0.78 overlap on average with ground-truth. To give context to the difficulty of the task, the standard Objectness algorithm [1] produces bounding boxes with 0.59 overlap on average. 4.4. Using Knowledge for Vision Tasks Finally, we want to demonstrate the usefulness of the visual knowledge learned by NEIL on standard vision tasks such as object detection and scene classification. Here, we will also compare several aspects of our approach: (a) We first compare the quality of our automatically labeled dataset. As baselines, we train classifiers/detectors directly on the seed images downloaded from Google Image Search. (b) We compare NEIL against a standard bootstrapping ap- proach which does not extract/use relationships. (c) Finally, we will demonstrate the usefulness of relationships by detecting and classifying new test data with and without the learned relationships. Scene Classification: First we evaluate our visual knowledge for the task of scene classification. We build a dataset of 600 images (12 scene categories) using Flickr images. We compare the performance ofour scene classifiers against the scene classifiers trained from top 15 images of Google Image Search (our seed classifier). We also compare the performance with standard bootstrapping approach without using any relationship extraction. Table 1shows the results. We use mean average precision (mAP) as the evaluation metric. As the results show, automatic relationship extraction helps us to constrain the learning problem, and so the learned classifiers give much better performance. Finally, if we also use the contextual information from NEIL relationships we get a significant boost in performance. Table 1. mAP performance for scene classification on 12 categories. mAP Seed Classifier (15 Google Images) Bootstrapping (without relationships) NEIL Scene Classifiers NEIL (Classifiers + Relationships) 0.52 0.54 0.57 0.62 Object Detection: We also evaluate our extracted visual knowledge for the task of object detection. We build a dataset of 1000 images (15 object categories) using Flickr data for testing. We compare the performance against object detectors trained directly using (top-50 and top-450) images from Google Image Search. We also compare the perfor- mance of detectors trained after aspect-ratio, HOG clustering and our proposed clustering procedure. Table 2 shows the detection results. Using 450 images from Google image search decreases the performance due to noisy retrievals. While other clustering methods help, the gain by our clustering procedure is much larger. Finally, detectors trained using NEIL work better than standard bootstrapping. 1415 MMoonniittoorr iiss f foouunndd iinn CCoonnttrrooll rroooomm? rroooomm? MMoonniittoorr iiss ffoouunndd iinn CCoonnttrrooll Washing machine is found in Utility room? Siberian tiger is found in Zoo Baseball is found in Butters box Bullet train is found in Train station platform? Cougar looks similar to Cat Urn looks similar to Goblet Samsung galaxy is a kind of Cellphone Computer room is /has Modern Hallway is /has Narrow? Building facade is /has Check texture Trading floor is /has Crowded Umbrella looks similar to Ferris wheel Bonfire is found in Volcano Figure 6. Examples of extracted common sense relationships. Table 2. mAP performance for object detection on 15 categories. mAP Latent SVM (50 Google Images) Latent SVM (450 Google Images) 0.34 0.28 Latent SVM (450, Aspect Ratio Clustering) Latent SVM (450, HOG-based Clustering) Seed Detector (NEIL Clustering) Bootstrapping (without relationships) NEIL Detector NEIL Detector + Relationships 0.30 0.33 0.44 0.45 0.49 0.51 Acknowledgements: This research was supported by ONR MURI N000141010934 and a gift from Google. The authors would like to thank Tom Mitchell and David Fouhey for insightful discussions. We would also like to thank our computing clusters warp and workhorse for doing all the hard work! References [1] B. Alexe, T. Deselares, and V. Ferrari. What is an object? In TPAMI, 2010. 7 [2] T. Berg and D. Forsyth. Animals on the web. In CVPR, 2006. 2 [3] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT, 1998. 2 [4] A. Carlson, J. Betteridge, E. R. H. Jr., and T. M. Mitchell. Coupling semi-supervised learning of categories and relations. In NAACL HLT Workskop on SSL for NLP, 2009. 3 [5] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr., and T. M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, 2010. 2, 3, 6 [6] J. R. Curran, T. Murphy, and B. Scholz. Minimising semantic drift with mutual exclusion bootstrapping. In PacificAssociationfor Computational Linguistics, 2007. 2 [7] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005. 6 [8] J. Deng, W. Dong, R. Socher, J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009. 1, 2, 6 [9] S. Divvala, A. Efros, and M. Hebert. How important are ‘deformable parts’ in the deformable parts model? In ECCV, Parts and Attributes [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] Workshop, 2012. 3 C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. Efros. What makes Paris look like Paris? SIGGRAPH, 2012. 4 S. Ebert, D. Larlus, and B. Schiele. Extracting structures in image collections for object recognition. In ECCV, 2010. 2 A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, 2009. 3 P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. TPAMI, 2010. 6 R. Fergus, P. Perona, and A. Zisserman. A visual category filter for Google images. In ECCV, 2004. 2 R. Fergus, Y. Weiss, and A. Torralba. Semi-supervised learning in gigantic image collections. In NIPS. 2009. 2 B. Frey and D. Dueck. Clustering by passing messages between data points. Science, 2007. 4 M. Guillaumin, J. Verbeek, and C. Schmid. Multimodal semisupervised learning for image classification. In CVPR, 2010. 2 A. Gupta and L. S. Davis. Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers. In ECCV, 2008. 3 B. Hariharan, J. Malik, and D. Ramanan. Discriminative decorrelation for clustering and classification. In ECCV. 2012. 4 S. Khan, F. Anwer, R. Muhammad, J. van de Weijer, A. Joost, M. Vanrell, and A. Lopez. Color attributes for object detection. In CVPR, 2012. 4, 6 D. Kuettel, M. Guillaumin, and V. Ferrari. Segmentation propagation in ImageNet. In ECCV, 2012. 3 C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In CVPR, 2009. 3 L.-J. Li, G. Wang, and L. Fei-Fei. OPTIMOL: Automatic object picture collection via incremental model learning. In CVPR, 2007. 2 D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004. 6 [25] A. Lucchi and J. Weston. Joint image and word sense discrimination for image retrieval. In ECCV, 2012. 3 [26] D. Martin, C. Fowlkes, and J. Malik. Learning to detect natural image boundaries using local brightness, color, and texture cues. PAMI, 2004. 6 [27] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV, 2001. 6 [28] D. Parikh and K. Grauman. Relative attributes. In ICCV, 2011. 3 [29] G. Patterson and J. Hays. SUN attribute database: Discovering, annotating, and recognizing scene attributes. In CVPR, 2012. 3 [30] P. Perona. Visions of a Visipedia. Proceedings of IEEE, 2010. 1, 2 [3 1] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In ICCV, 2007. 3 [32] R. Raguram and S. Lazebnik. Computing iconic summaries of general visual concepts. In Workshop on Internet Vision, 2008. 4 [33] M. Rastegari, A. Farhadi, and D. Forsyth. Attribute discovery via predictable discriminative binary codes. In ECCV, 2012. 3 [34] F. Schroff, A. Criminisi, and A. Zisserman. Harvesting image databases from the web. In ICCV, 2007. 2 [35] V. Sharmanska, N. Quadrianto, and C. H. Lampert. Augmented attribute representations. In ECCV, 2012. 3 [36] A. Shrivastava, S. Singh, and A. Gupta. Constrained semi-supervised learning using attributes and comparative attributes. In ECCV, 2012. 2, 3 [37] B. Siddiquie and A. Gupta. Beyond active noun tagging: Modeling contextual interactions for multi-class active learning. In CVPR, 2010. 2 [38] E. Sudderth, A. Torralba, W. T. Freeman, and A. Wilsky. Learning hierarchical models of scenes, objects, and parts. In ICCV, 2005. 3 [39] S. Vijayanarasimhan and K. Grauman. Large-scale live active learning: Training object detectors with crawled data and crowds. In CVPR, 2011. 2 [40] L. von Ahn and L. Dabbish. Labeling images with a computer game. In SIGCHI, 2004. 2, 3 [41] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. SUN database: Large scale scene recognition from abbey to zoo. In CVPR, 2010. 6 [42] X. Zhu. Semi-supervised learning literature survey. Technical report, CS, UW-Madison, 2005. 2 1416
same-paper 3 0.90615082 189 iccv-2013-HOGgles: Visualizing Object Detection Features
Author: Carl Vondrick, Aditya Khosla, Tomasz Malisiewicz, Antonio Torralba
Abstract: We introduce algorithms to visualize feature spaces used by object detectors. The tools in this paper allow a human to put on ‘HOG goggles ’ and perceive the visual world as a HOG based object detector sees it. We found that these visualizations allow us to analyze object detection systems in new ways and gain new insight into the detector’s failures. For example, when we visualize the features for high scoring false alarms, we discovered that, although they are clearly wrong in image space, they do look deceptively similar to true positives in feature space. This result suggests that many of these false alarms are caused by our choice of feature space, and indicates that creating a better learning algorithm or building bigger datasets is unlikely to correct these errors. By visualizing feature spaces, we can gain a more intuitive understanding of our detection systems.
4 0.89586335 152 iccv-2013-Extrinsic Camera Calibration without a Direct View Using Spherical Mirror
Author: Amit Agrawal
Abstract: We consider the problem of estimating the extrinsic parameters (pose) of a camera with respect to a reference 3D object without a direct view. Since the camera does not view the object directly, previous approaches have utilized reflections in a planar mirror to solve this problem. However, a planar mirror based approach requires a minimum of three reflections and has degenerate configurations where estimation fails. In this paper, we show that the pose can be obtained using a single reflection in a spherical mirror of known radius. This makes our approach simpler and easier in practice. In addition, unlike planar mirrors, the spherical mirror based approach does not have any degenerate configurations, leading to a robust algorithm. While a planar mirror reflection results in a virtual perspective camera, a spherical mirror reflection results in a non-perspective axial camera. The axial nature of rays allows us to compute the axis (direction of sphere center) and few pose parameters in a linear fashion. We then derive an analytical solution to obtain the distance to the sphere cen- ter and remaining pose parameters and show that it corresponds to solving a 16th degree equation. We present comparisons with a recent method that use planar mirrors and show that our approach recovers more accurate pose in the presence of noise. Extensive simulations and results on real data validate our algorithm.
5 0.88049763 433 iccv-2013-Understanding High-Level Semantics by Modeling Traffic Patterns
Author: Hongyi Zhang, Andreas Geiger, Raquel Urtasun
Abstract: In this paper, we are interested in understanding the semantics of outdoor scenes in the context of autonomous driving. Towards this goal, we propose a generative model of 3D urban scenes which is able to reason not only about the geometry and objects present in the scene, but also about the high-level semantics in the form of traffic patterns. We found that a small number of patterns is sufficient to model the vast majority of traffic scenes and show how these patterns can be learned. As evidenced by our experiments, this high-level reasoning significantly improves the overall scene estimation as well as the vehicle-to-lane association when compared to state-of-the-art approaches [10].
6 0.87295508 349 iccv-2013-Regionlets for Generic Object Detection
7 0.87254226 327 iccv-2013-Predicting an Object Location Using a Global Image Representation
8 0.87253332 126 iccv-2013-Dynamic Label Propagation for Semi-supervised Multi-class Multi-label Classification
9 0.87183362 384 iccv-2013-Semi-supervised Robust Dictionary Learning via Efficient l-Norms Minimization
10 0.87058663 328 iccv-2013-Probabilistic Elastic Part Model for Unsupervised Face Detector Adaptation
11 0.87008202 426 iccv-2013-Training Deformable Part Models with Decorrelated Features
12 0.86894029 137 iccv-2013-Efficient Salient Region Detection with Soft Image Abstraction
14 0.86811626 150 iccv-2013-Exemplar Cut
15 0.86756074 188 iccv-2013-Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps
16 0.86745834 194 iccv-2013-Heterogeneous Image Features Integration via Multi-modal Semi-supervised Learning Model
17 0.8674044 61 iccv-2013-Beyond Hard Negative Mining: Efficient Detector Learning via Block-Circulant Decomposition
18 0.86722952 197 iccv-2013-Hierarchical Joint Max-Margin Learning of Mid and Top Level Representations for Visual Recognition
19 0.86716568 233 iccv-2013-Latent Task Adaptation with Large-Scale Hierarchies
20 0.86711854 75 iccv-2013-CoDeL: A Human Co-detection and Labeling Framework