iccv iccv2013 iccv2013-327 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jose A. Rodriguez Serrano, Diane Larlus
Abstract: We tackle the detection of prominent objects in images as a retrieval task: given a global image descriptor, we find the most similar images in an annotated dataset, and transfer the object bounding boxes. We refer to this approach as data driven detection (DDD), that is an alternative to sliding windows. Previous works have used similar notions but with task-independent similarities and representations, i.e. they were not tailored to the end-goal of localization. This article proposes two contributions: (i) a metric learning algorithm and (ii) a representation of images as object probability maps, that are both optimized for detection. We show experimentally that these two contributions are crucial to DDD, do not require costly additional operations, and in some cases yield comparable or better results than state-of-the-art detectors despite conceptual simplicity and increased speed. As an application of prominent object detection, we improve fine-grained categorization by precropping images with the proposed approach.
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract We tackle the detection of prominent objects in images as a retrieval task: given a global image descriptor, we find the most similar images in an annotated dataset, and transfer the object bounding boxes. [sent-6, score-0.677]
2 We refer to this approach as data driven detection (DDD), that is an alternative to sliding windows. [sent-7, score-0.29]
3 This article proposes two contributions: (i) a metric learning algorithm and (ii) a representation of images as object probability maps, that are both optimized for detection. [sent-11, score-0.354]
4 As an application of prominent object detection, we improve fine-grained categorization by precropping images with the proposed approach. [sent-13, score-0.3]
5 Introduction This paper deals with the problem of prominent object detection, where the goal is to predict the region containing the relevant subject (or object of interest) in an image, as opposed to other regions containing background or nonrelevant objects. [sent-15, score-0.3]
6 The localization of the prominent object is a key cue that can be used to improve this difficult Figure1. [sent-26, score-0.27]
7 A query image is compared to an annotated set, and the nearest neighbors transfer their annotations. [sent-28, score-0.292]
8 For this retrieval process, we use supervision to learn an image representation and a metric geared toward detection. [sent-29, score-0.254]
9 As these examples show, the definition of prominent or relevant object is application-dependent, and we assume this is defined by a training set with annotated bounding boxes. [sent-31, score-0.354]
10 Research in detection has converged to combining a template descriptor (e. [sent-33, score-0.168]
11 HOG [6] or its extension to deformable parts [10]) with sliding windows. [sent-35, score-0.197]
12 Although methods for accelerating sliding window search have been proposed, a large number of window descriptors have to be extracted and classified independently. [sent-37, score-0.375]
13 In this paper, we aim at an efficient solution, and propose to extract a single global feature for the input image and to estimate the prominent object location directly from this feature vector, avoiding sliding windows. [sent-38, score-0.48]
14 This suggests a simple retrieval-based method for prominent object localization: given an input image, find the nearest images from a database (using the global descriptor), and transfer the bounding box of the most similar image. [sent-40, score-0.49]
15 First, detection is performed at the ease and efficiency of a retrieval operation. [sent-42, score-0.211]
16 Second, it allows handling any object shape and does not rely on the rectangular and rigid object assumption of the sliding-window approaches. [sent-43, score-0.154]
17 Finally, as detection is obtained using a global image descriptor, it intrinsically leverages context for detection. [sent-44, score-0.163]
18 We apply metric learning to enforce image pairs with high overlap between detection rectangles to be more similar than images with no (or small) overlap. [sent-52, score-0.336]
19 We propose to use a compact image representation that is constructed from the probability ofeach image patch to belong to the object. [sent-54, score-0.242]
20 This requires some knowledge about the object to locate, but allows representing the image as a probability map of the object location. [sent-55, score-0.235]
21 Since there exists a strong correlation between the true detection and such a probability map, these features are well-suited to data-driven detection. [sent-56, score-0.246]
22 Both contributions use supervision to connect the features and similarity to the detection task, by converting an initial assumption (similar images tend to have similar layouts) into an actual optimization step (we learn what makes two layouts similar). [sent-57, score-0.252]
23 Finally, we show in fine-grained categorization experiments that we improve classification accuracy by concentrating on the region predicted by a DDD. [sent-64, score-0.151]
24 De-facto standard detection methods [6, 10] combine a template representation and a sliding window approach. [sent-76, score-0.419]
25 Additional abundant work on detection has been published but the most successful methods cast detection as a classification problem: a large number of classification operations are performed (each possible sub-window is classified as containing the object or not). [sent-80, score-0.444]
26 In contrast, in this paper we would like to take a data-driven approach, and cast detection as a retrieval problem. [sent-81, score-0.211]
27 It differs from DDD in spirit (it is not application dependent) and in practice (it is a sliding window approach). [sent-85, score-0.244]
28 The concept of transferring bounding boxes (or pixel-level masks) from the nearest neighbors of an image has been successfully exploited in previous work. [sent-87, score-0.28]
29 Two main strategies exist: transfer at subwindow level and transfer at full-image level. [sent-88, score-0.16]
30 In the first strategy, approaches still perform sliding window search and each sub-window is used as a query for which nearest-neighbor similarity is computed [3 1, 22]. [sent-89, score-0.35]
31 A similar case is the figure-ground segmentation of [15] where sliding window search is replaced by an objectness detector [2]. [sent-90, score-0.274]
32 Although these works clearly have a data-driven component that is key to their success (the scoring of subwindows), they still compare all sub-windows of an image against a database, and pay a complexity price that it at least as big as the one paid by sliding window approaches. [sent-91, score-0.36]
33 All these works boil-down to the same principle: find the nearest neighbors of the input image; and feed the annotations of those to a more complex method. [sent-93, score-0.177]
34 Similarly, [27] uses the neighbors to induce priors over object classes and bounding box locations in a graphical model. [sent-95, score-0.252]
35 We observe that for the second strategy, transferring the labels of the neighbors is crucial to guide the algorithm, but as the retrieval step is based on task-independent features and similarities, this is not sufficient. [sent-96, score-0.22]
36 In [15] a global neighbor transfer baseline produces poor results in a multi-object binary segmentation task. [sent-98, score-0.165]
37 Data-driven detection baseline Our goal is to infer the bounding box of the prominent object from the global feature vector of the image, using a training set with annotated bounding boxes. [sent-101, score-0.675]
38 Task-aware metric Our first contribution is a similarity learning algorithm that optimizes a detection criterion. [sent-127, score-0.281]
39 We are not aware of previous works applying metric learning on object location labels. [sent-130, score-0.206]
40 For object localization, a common similarity is the overlap score, defined as the union-to-intersection area ratio (e. [sent-136, score-0.163]
41 More precisely, we consider a similarity function which augments the dot product as: kW(q, x) = qTWx. [sent-145, score-0.143]
42 Task-aware representation: patch-level object classifiers are used to represent query images by probability maps. [sent-165, score-0.228]
43 Annotations of training images with similar probability maps are transferred to solve detection. [sent-166, score-0.215]
44 Task-aware representation The second contribution is an improved representation built using supervised information of the detection task. [sent-195, score-0.219]
45 More precisely, we propose to build a “probability map” indicating probabilities that a certain pixel contains the prominent object. [sent-196, score-0.18]
46 This is based on a patch-level classifier that has been pre-trained to distinguish between patches from prominent objects and from the rest of the image. [sent-197, score-0.219]
47 We highlight that probability maps constitute responses of local patch classifiers, which are noisy and smooth, and that the response map of explicit object detectors (e. [sent-198, score-0.358]
48 However, object banks suffer from the same limitations as sliding-window based approaches, while the probability map is fast to compute (just one extra dot product on top of the patch encoding). [sent-201, score-0.392]
49 Probability maps have been used as an input for object segmentation, to classify super-pixels [5], as the unary potential of a random field [17], or with auto-context algorithms [32]. [sent-202, score-0.195]
50 In our case, we use probability maps directly as an image representation within data-driven detection. [sent-203, score-0.259]
51 Patches are extracted densely and at multiple scales within images of the training set, and are associated to a binary label depending on their degree of overlap with the annotated object region. [sent-205, score-0.149]
52 Patch-level descriptors and their labels are used to train a linear SVM classifier which assigns each patch to the “object” or “background” class. [sent-208, score-0.168]
53 These scores are transformed into probabilities at the pixel level [5], yielding probability maps (see Fig. [sent-214, score-0.215]
54 To make them comparable, the probability maps are resized to a small fixed-size image (50x50 pixels), ? [sent-217, score-0.256]
55 First, by nature, this representation captures information about the end task, so images having similar representations are more likely to have similar detection annotations. [sent-221, score-0.225]
56 This means that, despite the extra cost at training time to learn an object classifier, and the small constant cost at test time to compute the map, the smaller dimension of this representation makes retrieval and consequently detection much faster. [sent-223, score-0.315]
57 Finally, one can combine several lowlevel cues (for instance SIFT and color) without increasing the final dimension of the representation by averaging maps computed using different channels [5]. [sent-226, score-0.192]
58 Since probability maps are a strong cue for object location, one may wonder why not using directly a sliding win- × dow over the probability map. [sent-227, score-0.549]
59 Intuitively, patch level classifiers are far from perfect, but we expect classifier errors to be consistent between similar training and query images. [sent-229, score-0.175]
60 Therefore, even for inaccurate probability maps, the closest maps in the database can help transferring the object location reliably. [sent-230, score-0.377]
61 In all experiments, the number of neighbors L are determined from the validation set. [sent-245, score-0.143]
62 For our task-aware features, the probability maps are built from FVs computed at patch level as in [5]. [sent-249, score-0.298]
63 This essentially follows the same process as explained above but computing one FV per patch instead of aggregating the patch contributions. [sent-250, score-0.166]
64 The probability maps using FVs computed from lowlevel SIFT descriptors are denoted PM-S IFT. [sent-255, score-0.309]
65 We also consider color statistics [25] as low-level descriptors (PM-COL) and the average of both maps (PM-S IFT+COL). [sent-256, score-0.146]
66 While the database an- notations are at the level of body joint locations (knees, elbows, neck, etc), it has been designed so that there is one prominent person (i. [sent-262, score-0.18]
67 We would like to evaluate the prominent subject detection task on these challenging images. [sent-265, score-0.311]
68 To this end, we transformed the annotation, and obtained ground-truth rectangles by taking the bounding rectangle of the body joint annotations. [sent-266, score-0.206]
69 The quality of the detection is evaluated using the overlap score of Eq. [sent-268, score-0.181]
70 SIFT), the task-aware representation PM-S IFT still compares favorably to the baseline (+5. [sent-274, score-0.169]
71 We also combine probability maps to a sliding window (SW) process. [sent-284, score-0.459]
72 The rectangle with the best density score (density inside the rectangle minus density outside) is kept. [sent-285, score-0.146]
73 Comparison of DDD for FV (and different spatial pooling G=GWxGH) and PM (for different resized map sizes) with and without metric learning. [sent-288, score-0.168]
74 achieves competitive results (confirming that probability maps are powerful representations), but is outperformed by our best DDD strategies. [sent-289, score-0.215]
75 The previous results use SIFT as lowlevel descriptor in the probability maps to be comparable to the DPM and DDD baselines that are based on gradients and do not use color. [sent-302, score-0.334]
76 When combining low-level SIFT and color descriptors in the probability maps (PM-S IFT+COL) and using metric learning, we obtain the best results with 44. [sent-304, score-0.358]
77 ImageNet dogs is a dataset of dog images used for fine-grained classification purposes [1]. [sent-310, score-0.316]
78 Here, the dog is the prominent object, and we measure the dog detection task, and the effect of our task-aware method on the final classification accuracy. [sent-311, score-0.596]
79 The dataset is composed of 120 different breeds of dogs, making the detection challenging (see Fig. [sent-312, score-0.175]
80 As the test annotations are not available, we have split the validation set into 1,000 images that we use as an actual validation set to find the best parameters, and the remaining 5,000 images are used as our TOaDbstklhe-2rw. [sent-315, score-0.169]
81 We used a generic dog classifier in the task-aware representation, trained using all 120 types of annotations. [sent-318, score-0.151]
82 The sliding window (PM-S IFT+SW) is on par with the FV-S IFT. [sent-339, score-0.285]
83 We found it tends to fail for dogs in “non-canonical poses”. [sent-343, score-0.143]
84 One could also ask whether detection is trivial just because dogs seem centered. [sent-346, score-0.274]
85 We assume that the object location is available at train time (as for DDD), and we train classifiers over the 120 breeds of dogs using the cropped images. [sent-357, score-0.296]
86 At test time, we use the bounding boxes predicted by our system to crop images, and classify the cropped region. [sent-358, score-0.172]
87 Classification results are reported in Table 3, and compared to (i) the classification of full images, and (ii) cropping using the ground-truth detections (to measure how far we are from the classification figure of a perfect detection). [sent-363, score-0.213]
88 We show detection results for generic bird detectors (the bird being the prominent object of each image). [sent-370, score-0.515]
89 The overlap threshold to compute precision is set 70% for the same reason as in the dog set. [sent-372, score-0.162]
90 For the same reason as in the dogs dataset, we concentrate on the task-aware representations. [sent-376, score-0.143]
91 In this set, probability maps based on color (PM-COL) yield better results than those based on SIFT (PM-S IFT), probably due to the colorful nature of birds. [sent-377, score-0.215]
92 The sliding window (PM-SIFT+SW) yields the same behavior as in the dogs set. [sent-387, score-0.387]
93 We also conduct a fine-grained classification experiment on this set with the same protocol as before, and measure the impact of detection on the final classification accuracy. [sent-389, score-0.253]
94 As expected, detection has an impact on fine-grained classification (increase of > 13%). [sent-393, score-0.192]
95 But if we inspect the detection precision at 20% overlap (which corresponds to estimating the object location roughly), the DPM is at 95. [sent-398, score-0.29]
96 Note the difference in classification accuracy between perfect detection and DPM detection is 4. [sent-416, score-0.362]
97 Since the proposed method still reduces to finding nearest neighbors at test time using a single feature vector per image, and the two contributions significantly reduce the dimension of the representation, we avoid sliding window search, and the retrieval process is fast (about 200ms per image). [sent-420, score-0.444]
98 DDD compares favorably to a stateof-the art sliding window approach in presence of nonrigid objects. [sent-421, score-0.316]
99 It compares less favorably for rigid objects (as birds) but appears to be good enough as pre-cropping method for fine-grained classification results. [sent-422, score-0.167]
100 Mobile product image search by automatic query object extraction. [sent-627, score-0.147]
wordName wordTfidf (topN-words)
[('ddd', 0.584), ('ift', 0.243), ('prominent', 0.18), ('sliding', 0.159), ('dpm', 0.147), ('dogs', 0.143), ('ijk', 0.143), ('detection', 0.131), ('kw', 0.122), ('probability', 0.115), ('dog', 0.112), ('maps', 0.1), ('metric', 0.097), ('fv', 0.094), ('neighbors', 0.087), ('window', 0.085), ('patch', 0.083), ('birds', 0.083), ('transfer', 0.08), ('retrieval', 0.08), ('col', 0.078), ('lsp', 0.076), ('bounding', 0.075), ('rectangle', 0.073), ('bird', 0.072), ('ri', 0.069), ('classification', 0.061), ('triplets', 0.06), ('object', 0.06), ('categorization', 0.06), ('ku', 0.059), ('rectangles', 0.058), ('annotations', 0.057), ('lijk', 0.057), ('taskindependent', 0.057), ('validation', 0.056), ('dot', 0.056), ('fisher', 0.055), ('baseline', 0.053), ('query', 0.053), ('transferring', 0.053), ('similarity', 0.053), ('similarities', 0.052), ('cropping', 0.052), ('sift', 0.051), ('overlap', 0.05), ('dim', 0.05), ('representations', 0.05), ('location', 0.049), ('imagenet', 0.048), ('lowlevel', 0.048), ('unreasonable', 0.047), ('descriptors', 0.046), ('representation', 0.044), ('banks', 0.044), ('fvs', 0.044), ('breeds', 0.044), ('leeds', 0.044), ('pascal', 0.044), ('sw', 0.044), ('resized', 0.041), ('big', 0.041), ('par', 0.041), ('pay', 0.04), ('favorably', 0.04), ('classifier', 0.039), ('annotated', 0.039), ('rd', 0.039), ('perfect', 0.039), ('costly', 0.039), ('pm', 0.039), ('deformable', 0.038), ('article', 0.038), ('descriptor', 0.037), ('dimensionality', 0.036), ('classify', 0.035), ('layouts', 0.035), ('paid', 0.035), ('russell', 0.034), ('confirming', 0.034), ('product', 0.034), ('baselines', 0.034), ('rigid', 0.034), ('supervision', 0.033), ('species', 0.033), ('nearest', 0.033), ('boxes', 0.032), ('compares', 0.032), ('triplet', 0.032), ('ij', 0.032), ('global', 0.032), ('anchez', 0.031), ('loss', 0.031), ('predicted', 0.03), ('objectness', 0.03), ('pooling', 0.03), ('box', 0.03), ('strategy', 0.03), ('localization', 0.03), ('acts', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 327 iccv-2013-Predicting an Object Location Using a Global Image Representation
Author: Jose A. Rodriguez Serrano, Diane Larlus
Abstract: We tackle the detection of prominent objects in images as a retrieval task: given a global image descriptor, we find the most similar images in an annotated dataset, and transfer the object bounding boxes. We refer to this approach as data driven detection (DDD), that is an alternative to sliding windows. Previous works have used similar notions but with task-independent similarities and representations, i.e. they were not tailored to the end-goal of localization. This article proposes two contributions: (i) a metric learning algorithm and (ii) a representation of images as object probability maps, that are both optimized for detection. We show experimentally that these two contributions are crucial to DDD, do not require costly additional operations, and in some cases yield comparable or better results than state-of-the-art detectors despite conceptual simplicity and increased speed. As an application of prominent object detection, we improve fine-grained categorization by precropping images with the proposed approach.
2 0.20294988 107 iccv-2013-Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction
Author: Ning Zhang, Ryan Farrell, Forrest Iandola, Trevor Darrell
Abstract: Recognizing objects in fine-grained domains can be extremely challenging due to the subtle differences between subcategories. Discriminative markings are often highly localized, leading traditional object recognition approaches to struggle with the large pose variation often present in these domains. Pose-normalization seeks to align training exemplars, either piecewise by part or globally for the whole object, effectively factoring out differences in pose and in viewing angle. Prior approaches relied on computationally-expensive filter ensembles for part localization and required extensive supervision. This paper proposes two pose-normalized descriptors based on computationally-efficient deformable part models. The first leverages the semantics inherent in strongly-supervised DPM parts. The second exploits weak semantic annotations to learn cross-component correspondences, computing pose-normalized descriptors from the latent parts of a weakly-supervised DPM. These representations enable pooling across pose and viewpoint, in turn facilitating tasks such as fine-grained recognition and attribute prediction. Experiments conducted on the Caltech-UCSD Birds 200 dataset and Berkeley Human Attribute dataset demonstrate significant improvements over state-of-art algorithms.
3 0.17202625 377 iccv-2013-Segmentation Driven Object Detection with Fisher Vectors
Author: Ramazan Gokberk Cinbis, Jakob Verbeek, Cordelia Schmid
Abstract: We present an object detection system based on the Fisher vector (FV) image representation computed over SIFT and color descriptors. For computational and storage efficiency, we use a recent segmentation-based method to generate class-independent object detection hypotheses, in combination with data compression techniques. Our main contribution is a method to produce tentative object segmentation masks to suppress background clutter in the features. Re-weighting the local image features based on these masks is shown to improve object detection significantly. We also exploit contextual features in the form of a full-image FV descriptor, and an inter-category rescoring mechanism. Our experiments on the PASCAL VOC 2007 and 2010 datasets show that our detector improves over the current state-of-the-art detection results.
4 0.16450244 411 iccv-2013-Symbiotic Segmentation and Part Localization for Fine-Grained Categorization
Author: Yuning Chai, Victor Lempitsky, Andrew Zisserman
Abstract: We propose a new method for the task of fine-grained visual categorization. The method builds a model of the baselevel category that can be fitted to images, producing highquality foreground segmentation and mid-level part localizations. The model can be learnt from the typical datasets available for fine-grained categorization, where the only annotation provided is a loose bounding box around the instance (e.g. bird) in each image. Both segmentation and part localizations are then used to encode the image content into a highly-discriminative visual signature. The model is symbiotic in that part discovery/localization is helped by segmentation and, conversely, the segmentation is helped by the detection (e.g. part layout). Our model builds on top of the part-based object category detector of Felzenszwalb et al., and also on the powerful GrabCut segmentation algorithm of Rother et al., and adds a simple spatial saliency coupling between them. In our evaluation, the model improves the categorization accuracy over the state-of-the-art. It also improves over what can be achieved with an analogous system that runs segmentation and part-localization independently.
5 0.14259429 169 iccv-2013-Fine-Grained Categorization by Alignments
Author: E. Gavves, B. Fernando, C.G.M. Snoek, A.W.M. Smeulders, T. Tuytelaars
Abstract: The aim of this paper is fine-grained categorization without human interaction. Different from prior work, which relies on detectors for specific object parts, we propose to localize distinctive details by roughly aligning the objects using just the overall shape, since implicit to fine-grained categorization is the existence of a super-class shape shared among all classes. The alignments are then used to transfer part annotations from training images to test images (supervised alignment), or to blindly yet consistently segment the object in a number of regions (unsupervised alignment). We furthermore argue that in the distinction of finegrained sub-categories, classification-oriented encodings like Fisher vectors are better suited for describing localized information than popular matching oriented features like HOG. We evaluate the method on the CU-2011 Birds and Stanford Dogs fine-grained datasets, outperforming the state-of-the-art.
6 0.14155956 379 iccv-2013-Semantic Segmentation without Annotating Segments
7 0.11996345 233 iccv-2013-Latent Task Adaptation with Large-Scale Hierarchies
8 0.11907513 111 iccv-2013-Detecting Dynamic Objects with Multi-view Background Subtraction
9 0.10395043 62 iccv-2013-Bird Part Localization Using Exemplar-Based Models with Enforced Pose and Subcategory Consistency
10 0.10239962 104 iccv-2013-Decomposing Bag of Words Histograms
11 0.10063899 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set
12 0.098852523 337 iccv-2013-Random Grids: Fast Approximate Nearest Neighbors and Range Searching for Image Search
13 0.097511321 426 iccv-2013-Training Deformable Part Models with Decorrelated Features
14 0.097419739 198 iccv-2013-Hierarchical Part Matching for Fine-Grained Visual Categorization
15 0.094666429 349 iccv-2013-Regionlets for Generic Object Detection
16 0.093568988 431 iccv-2013-Unbiased Metric Learning: On the Utilization of Multiple Datasets and Web Images for Softening Bias
17 0.091964513 269 iccv-2013-Modeling Occlusion by Discriminative AND-OR Structures
18 0.091595814 378 iccv-2013-Semantic-Aware Co-indexing for Image Retrieval
19 0.090159379 403 iccv-2013-Strong Appearance and Expressive Spatial Models for Human Pose Estimation
20 0.088011935 121 iccv-2013-Discriminatively Trained Templates for 3D Object Detection: A Real Time Scalable Approach
topicId topicWeight
[(0, 0.256), (1, 0.053), (2, 0.017), (3, -0.098), (4, 0.094), (5, 0.03), (6, -0.041), (7, 0.02), (8, -0.087), (9, -0.067), (10, 0.104), (11, 0.025), (12, -0.025), (13, -0.091), (14, -0.015), (15, -0.058), (16, 0.049), (17, 0.017), (18, 0.1), (19, -0.012), (20, 0.011), (21, 0.024), (22, -0.016), (23, 0.04), (24, -0.002), (25, 0.13), (26, 0.005), (27, -0.047), (28, 0.035), (29, 0.049), (30, 0.026), (31, -0.099), (32, -0.033), (33, -0.015), (34, -0.034), (35, -0.041), (36, 0.083), (37, -0.038), (38, -0.021), (39, 0.012), (40, 0.005), (41, 0.03), (42, -0.005), (43, 0.018), (44, 0.039), (45, -0.029), (46, 0.055), (47, -0.012), (48, 0.085), (49, -0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.94322622 327 iccv-2013-Predicting an Object Location Using a Global Image Representation
Author: Jose A. Rodriguez Serrano, Diane Larlus
Abstract: We tackle the detection of prominent objects in images as a retrieval task: given a global image descriptor, we find the most similar images in an annotated dataset, and transfer the object bounding boxes. We refer to this approach as data driven detection (DDD), that is an alternative to sliding windows. Previous works have used similar notions but with task-independent similarities and representations, i.e. they were not tailored to the end-goal of localization. This article proposes two contributions: (i) a metric learning algorithm and (ii) a representation of images as object probability maps, that are both optimized for detection. We show experimentally that these two contributions are crucial to DDD, do not require costly additional operations, and in some cases yield comparable or better results than state-of-the-art detectors despite conceptual simplicity and increased speed. As an application of prominent object detection, we improve fine-grained categorization by precropping images with the proposed approach.
2 0.84080023 109 iccv-2013-Detecting Avocados to Zucchinis: What Have We Done, and Where Are We Going?
Author: Olga Russakovsky, Jia Deng, Zhiheng Huang, Alexander C. Berg, Li Fei-Fei
Abstract: The growth of detection datasets and the multiple directions of object detection research provide both an unprecedented need and a great opportunity for a thorough evaluation of the current state of the field of categorical object detection. In this paper we strive to answer two key questions. First, where are we currently as a field: what have we done right, what still needs to be improved? Second, where should we be going in designing the next generation of object detectors? Inspired by the recent work of Hoiem et al. [10] on the standard PASCAL VOC detection dataset, we perform a large-scale study on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) data. First, we quantitatively demonstrate that this dataset provides many of the same detection challenges as the PASCAL VOC. Due to its scale of 1000 object categories, ILSVRC also provides an excellent testbed for understanding the performance of detectors as a function of several key properties of the object classes. We conduct a series of analyses looking at how different detection methods perform on a number of imagelevel and object-class-levelproperties such as texture, color, deformation, and clutter. We learn important lessons of the current object detection methods and propose a number of insights for designing the next generation object detectors.
3 0.83278948 169 iccv-2013-Fine-Grained Categorization by Alignments
Author: E. Gavves, B. Fernando, C.G.M. Snoek, A.W.M. Smeulders, T. Tuytelaars
Abstract: The aim of this paper is fine-grained categorization without human interaction. Different from prior work, which relies on detectors for specific object parts, we propose to localize distinctive details by roughly aligning the objects using just the overall shape, since implicit to fine-grained categorization is the existence of a super-class shape shared among all classes. The alignments are then used to transfer part annotations from training images to test images (supervised alignment), or to blindly yet consistently segment the object in a number of regions (unsupervised alignment). We furthermore argue that in the distinction of finegrained sub-categories, classification-oriented encodings like Fisher vectors are better suited for describing localized information than popular matching oriented features like HOG. We evaluate the method on the CU-2011 Birds and Stanford Dogs fine-grained datasets, outperforming the state-of-the-art.
4 0.81008464 349 iccv-2013-Regionlets for Generic Object Detection
Author: Xiaoyu Wang, Ming Yang, Shenghuo Zhu, Yuanqing Lin
Abstract: Generic object detection is confronted by dealing with different degrees of variations in distinct object classes with tractable computations, which demands for descriptive and flexible object representations that are also efficient to evaluate for many locations. In view of this, we propose to model an object class by a cascaded boosting classifier which integrates various types of features from competing local regions, named as regionlets. A regionlet is a base feature extraction region defined proportionally to a detection window at an arbitrary resolution (i.e. size and aspect ratio). These regionlets are organized in small groups with stable relative positions to delineate fine-grained spatial layouts inside objects. Their features are aggregated to a one-dimensional feature within one group so as to tolerate deformations. Then we evaluate the object bounding box proposal in selective search from segmentation cues, limiting the evaluation locations to thousands. Our approach significantly outperforms the state-of-the-art on popular multi-class detection benchmark datasets with a single method, without any contexts. It achieves the detec- tion mean average precision of 41. 7% on the PASCAL VOC 2007 dataset and 39. 7% on the VOC 2010 for 20 object categories. It achieves 14. 7% mean average precision on the ImageNet dataset for 200 object categories, outperforming the latest deformable part-based model (DPM) by 4. 7%.
5 0.79994845 377 iccv-2013-Segmentation Driven Object Detection with Fisher Vectors
Author: Ramazan Gokberk Cinbis, Jakob Verbeek, Cordelia Schmid
Abstract: We present an object detection system based on the Fisher vector (FV) image representation computed over SIFT and color descriptors. For computational and storage efficiency, we use a recent segmentation-based method to generate class-independent object detection hypotheses, in combination with data compression techniques. Our main contribution is a method to produce tentative object segmentation masks to suppress background clutter in the features. Re-weighting the local image features based on these masks is shown to improve object detection significantly. We also exploit contextual features in the form of a full-image FV descriptor, and an inter-category rescoring mechanism. Our experiments on the PASCAL VOC 2007 and 2010 datasets show that our detector improves over the current state-of-the-art detection results.
6 0.79535228 198 iccv-2013-Hierarchical Part Matching for Fine-Grained Visual Categorization
7 0.77548337 104 iccv-2013-Decomposing Bag of Words Histograms
8 0.77262586 107 iccv-2013-Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction
9 0.76818222 77 iccv-2013-Codemaps - Segment, Classify and Search Objects Locally
10 0.75435317 411 iccv-2013-Symbiotic Segmentation and Part Localization for Fine-Grained Categorization
11 0.74925584 202 iccv-2013-How Do You Tell a Blackbird from a Crow?
12 0.73490512 179 iccv-2013-From Subcategories to Visual Composites: A Multi-level Framework for Object Detection
13 0.71953148 406 iccv-2013-Style-Aware Mid-level Representation for Discovering Visual Connections in Space and Time
14 0.70946783 193 iccv-2013-Heterogeneous Auto-similarities of Characteristics (HASC): Exploiting Relational Information for Classification
15 0.69682306 189 iccv-2013-HOGgles: Visualizing Object Detection Features
16 0.68446594 390 iccv-2013-Shufflets: Shared Mid-level Parts for Fast Object Detection
18 0.65785068 285 iccv-2013-NEIL: Extracting Visual Knowledge from Web Data
19 0.65367556 236 iccv-2013-Learning Discriminative Part Detectors for Image Classification and Cosegmentation
20 0.64775884 426 iccv-2013-Training Deformable Part Models with Decorrelated Features
topicId topicWeight
[(0, 0.134), (2, 0.089), (7, 0.025), (12, 0.016), (13, 0.025), (26, 0.097), (31, 0.075), (34, 0.028), (35, 0.02), (42, 0.121), (48, 0.011), (64, 0.047), (73, 0.035), (89, 0.176), (98, 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.88772726 327 iccv-2013-Predicting an Object Location Using a Global Image Representation
Author: Jose A. Rodriguez Serrano, Diane Larlus
Abstract: We tackle the detection of prominent objects in images as a retrieval task: given a global image descriptor, we find the most similar images in an annotated dataset, and transfer the object bounding boxes. We refer to this approach as data driven detection (DDD), that is an alternative to sliding windows. Previous works have used similar notions but with task-independent similarities and representations, i.e. they were not tailored to the end-goal of localization. This article proposes two contributions: (i) a metric learning algorithm and (ii) a representation of images as object probability maps, that are both optimized for detection. We show experimentally that these two contributions are crucial to DDD, do not require costly additional operations, and in some cases yield comparable or better results than state-of-the-art detectors despite conceptual simplicity and increased speed. As an application of prominent object detection, we improve fine-grained categorization by precropping images with the proposed approach.
2 0.88140422 21 iccv-2013-A Method of Perceptual-Based Shape Decomposition
Author: Chang Ma, Zhongqian Dong, Tingting Jiang, Yizhou Wang, Wen Gao
Abstract: In thispaper, wepropose a novelperception-based shape decomposition method which aims to decompose a shape into semantically meaningful parts. In addition to three popular perception rules (the Minima rule, the Short-cut rule and the Convexity rule) in shape decomposition, we propose a new rule named part-similarity rule to encourage consistent partition of similar parts. The problem is formulated as a quadratically constrained quadratic program (QCQP) problem and is solved by a trust-region method. Experiment results on MPEG-7 dataset show that we can get a more consistent shape decomposition with human perception compared with other state-of-the-art methods both qualitatively and quantitatively. Finally, we show the advantage of semantic parts over non-meaningful parts in object detection on the ETHZ dataset.
3 0.87423521 180 iccv-2013-From Where and How to What We See
Author: S. Karthikeyan, Vignesh Jagadeesh, Renuka Shenoy, Miguel Ecksteinz, B.S. Manjunath
Abstract: Eye movement studies have confirmed that overt attention is highly biased towards faces and text regions in images. In this paper we explore a novel problem of predicting face and text regions in images using eye tracking data from multiple subjects. The problem is challenging as we aim to predict the semantics (face/text/background) only from eye tracking data without utilizing any image information. The proposed algorithm spatially clusters eye tracking data obtained in an image into different coherent groups and subsequently models the likelihood of the clusters containing faces and text using afully connectedMarkov Random Field (MRF). Given the eye tracking datafrom a test image, itpredicts potential face/head (humans, dogs and cats) and text locations reliably. Furthermore, the approach can be used to select regions of interest for further analysis by object detectors for faces and text. The hybrid eye position/object detector approach achieves better detection performance and reduced computation time compared to using only the object detection algorithm. We also present a new eye tracking dataset on 300 images selected from ICDAR, Street-view, Flickr and Oxford-IIIT Pet Dataset from 15 subjects.
4 0.86830914 137 iccv-2013-Efficient Salient Region Detection with Soft Image Abstraction
Author: Ming-Ming Cheng, Jonathan Warrell, Wen-Yan Lin, Shuai Zheng, Vibhav Vineet, Nigel Crook
Abstract: Detecting visually salient regions in images is one of the fundamental problems in computer vision. We propose a novel method to decompose an image into large scale perceptually homogeneous elements for efficient salient region detection, using a soft image abstraction representation. By considering both appearance similarity and spatial distribution of image pixels, the proposed representation abstracts out unnecessary image details, allowing the assignment of comparable saliency values across similar regions, and producing perceptually accurate salient region detection. We evaluate our salient region detection approach on the largest publicly available dataset with pixel accurate annotations. The experimental results show that the proposed method outperforms 18 alternate methods, reducing the mean absolute error by 25.2% compared to the previous best result, while being computationally more efficient.
5 0.86353636 80 iccv-2013-Collaborative Active Learning of a Kernel Machine Ensemble for Recognition
Author: Gang Hua, Chengjiang Long, Ming Yang, Yan Gao
Abstract: Active learning is an effective way of engaging users to interactively train models for visual recognition. The vast majority of previous works, if not all of them, focused on active learning with a single human oracle. The problem of active learning with multiple oracles in a collaborative setting has not been well explored. Moreover, most of the previous works assume that the labels provided by the human oracles are noise free, which may often be violated in reality. We present a collaborative computational model for active learning with multiple human oracles. It leads to not only an ensemble kernel machine that is robust to label noises, but also a principled label quality measure to online detect irresponsible labelers. Instead of running independent active learning processes for each individual human oracle, our model captures the inherent correlations among the labelers through shared data among them. Our simulation experiments and experiments with real crowd-sourced noisy labels demonstrated the efficacy of our model.
6 0.86273974 349 iccv-2013-Regionlets for Generic Object Detection
7 0.86192662 427 iccv-2013-Transfer Feature Learning with Joint Distribution Adaptation
8 0.86189044 328 iccv-2013-Probabilistic Elastic Part Model for Unsupervised Face Detector Adaptation
9 0.86153197 376 iccv-2013-Scene Text Localization and Recognition with Oriented Stroke Detection
10 0.86033577 126 iccv-2013-Dynamic Label Propagation for Semi-supervised Multi-class Multi-label Classification
11 0.8600105 384 iccv-2013-Semi-supervised Robust Dictionary Learning via Efficient l-Norms Minimization
12 0.85894465 197 iccv-2013-Hierarchical Joint Max-Margin Learning of Mid and Top Level Representations for Visual Recognition
13 0.85879493 54 iccv-2013-Attribute Pivots for Guiding Relevance Feedback in Image Search
14 0.85878289 156 iccv-2013-Fast Direct Super-Resolution by Simple Functions
15 0.85804081 150 iccv-2013-Exemplar Cut
16 0.85775805 326 iccv-2013-Predicting Sufficient Annotation Strength for Interactive Foreground Segmentation
17 0.85773444 73 iccv-2013-Class-Specific Simplex-Latent Dirichlet Allocation for Image Classification
18 0.85726786 95 iccv-2013-Cosegmentation and Cosketch by Unsupervised Learning
19 0.85659909 315 iccv-2013-PhotoOCR: Reading Text in Uncontrolled Conditions
20 0.85536492 245 iccv-2013-Learning a Dictionary of Shape Epitomes with Applications to Image Labeling