iccv iccv2013 iccv2013-377 knowledge-graph by maker-knowledge-mining

377 iccv-2013-Segmentation Driven Object Detection with Fisher Vectors


Source: pdf

Author: Ramazan Gokberk Cinbis, Jakob Verbeek, Cordelia Schmid

Abstract: We present an object detection system based on the Fisher vector (FV) image representation computed over SIFT and color descriptors. For computational and storage efficiency, we use a recent segmentation-based method to generate class-independent object detection hypotheses, in combination with data compression techniques. Our main contribution is a method to produce tentative object segmentation masks to suppress background clutter in the features. Re-weighting the local image features based on these masks is shown to improve object detection significantly. We also exploit contextual features in the form of a full-image FV descriptor, and an inter-category rescoring mechanism. Our experiments on the PASCAL VOC 2007 and 2010 datasets show that our detector improves over the current state-of-the-art detection results.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 For computational and storage efficiency, we use a recent segmentation-based method to generate class-independent object detection hypotheses, in combination with data compression techniques. [sent-4, score-0.317]

2 Our main contribution is a method to produce tentative object segmentation masks to suppress background clutter in the features. [sent-5, score-0.622]

3 Re-weighting the local image features based on these masks is shown to improve object detection significantly. [sent-6, score-0.417]

4 We also exploit contextual features in the form of a full-image FV descriptor, and an inter-category rescoring mechanism. [sent-7, score-0.385]

5 Our experiments on the PASCAL VOC 2007 and 2010 datasets show that our detector improves over the current state-of-the-art detection results. [sent-8, score-0.306]

6 [10, 14], is based on the sliding window approach, where detection windows of various scales and aspect ratios are evaluated at many positions across the image. [sent-14, score-0.653]

7 To alleviate this problem, the seminal approach of Viola and Jones [37] implements a cascade, which iteratively reduces the number of windows to be examined. [sent-16, score-0.285]

8 In a similar spirit, two or three-stage approaches have been explored [20, 35], where windows are discarded at each stage, while progressively using richer features. [sent-17, score-0.285]

9 A recent alternative is to prune the set of candidate windows without using class specific information, by relying on low-level contours and image segmentation, see e. [sent-19, score-0.42]

10 [6] recently also explored Fisher vectors (FV) for detection, and proposed an efficient detection mechanism based on integral images to find the best scoring window per image. [sent-26, score-0.436]

11 Our second contribution is that we show that the image segmentation that drives the object hypotheses generation, can also be used to improve the appearance features computed over the windows. [sent-30, score-0.316]

12 To this end, we compute a mask for each candidate window which counts for each pixel how many superpixels that cover that pixel are fully contained in the window, and weight the contribution of local descriptors in the Fisher vector representation accordingly. [sent-31, score-0.9]

13 This local feature weighting process is class-independent, completely unsupervised, and suppresses background clutter on superpixels that traverse the window boundary. [sent-32, score-0.58]

14 Related work in the literature has used segmentation for object detection in different ways. [sent-33, score-0.346]

15 [9, 27, 28, 38], extracts explicit segmentation for each object detection hypothesis as a post-processing step. [sent-36, score-0.346]

16 Moreover, if the supervision is limited to bounding box annotations, it is difficult to learn accurate object segmentation models. [sent-38, score-0.32]

17 [15] improve object detection using 2968 the output from the semantic segmentation of [3]. [sent-42, score-0.39]

18 The semantic segmentation is used to extract additional features encoding spatial relationships between the associated segments and object detection windows. [sent-43, score-0.489]

19 This approach, however, requires groundtruth segmentations to train the semantic segmentation model. [sent-44, score-0.3]

20 Our work is different in the sense that we incorporate segmentation into the feature extraction step for object detection, and remain in the training-from-bounding-boxes paradigm. [sent-45, score-0.245]

21 Even if the segmentation step fails in accurately delineating the object, our detector still benefits from the approximate segmentation since still part of the background clutter can be suppressed. [sent-46, score-0.685]

22 [30] sample 1,000 windows per image using the objectness measure of [1], and weight local features proportional to the number of windows that overlap them when computing a Fisher vector representation. [sent-52, score-0.683]

23 With a gain of around 2 mAP points, our approximate segmentation masks significantly contribute to the success of our method. [sent-55, score-0.498]

24 Segmentation driven object detection In this section we describe how we generate our approximate segmentation masks, the feature extraction and compression processes, and the detector training procedure. [sent-59, score-0.764]

25 Segmentation mask generation Hierarchical segmentation was proposed in [34] to generate class-independent candidate detection windows. [sent-62, score-0.601]

26 In this manner, a rich set of segments of varying sizes and shapes is obtained, and the bounding boxes of the segments are used as candidate detection windows. [sent-65, score-0.398]

27 When producing around 1,500 object windows per image, more than 95% of the ground truth object windows are matched in the sense that they have an intersection/union measure of over 50%, as measured on the VOC’07 dataset. [sent-66, score-0.741]

28 In this manner more computationally expensive classifiers and features can be used since far less windows need to be evaluated than in a sliding window approach. [sent-67, score-0.552]

29 Examples of candidate windows together with their generating segments can be found in Figure 1. [sent-68, score-0.523]

30 In general, however, the segments used to generate these candidate windows do not provide good object segmentations. [sent-69, score-0.549]

31 To obtain masks that are more suitable to improve object localization, we exploit the idea that background clutter is likely to be represented by superpixels that traverse the window boundary. [sent-70, score-0.858]

32 Therefore, we produce a binary mask based on each of the eight segmentations by retaining the superpixels that lie completely inside the window, and suppressing the other ones. [sent-71, score-0.479]

33 We average the eight binary masks to produce the weighted mask, which we use to weight the contribution of local features in the window descriptor. [sent-72, score-0.675]

34 It is important to consider the segmentation masks produced for incorrect candidate windows too, since these represent the vast majority of the candidate windows. [sent-75, score-1.023]

35 5 objects per image, while we use on the order of 1,000 to 2,000 candidate windows per image. [sent-77, score-0.49]

36 The first incorrect candidate window in Figure 1 shows a case where a partially visible horse is largely suppressed, since the superpixels on the ob- × ject straddle outside the window. [sent-78, score-0.582]

37 As a result this window gets a lower score than the correct one containing the entire horse. [sent-79, score-0.309]

38 The second incorrect window shows a case where the car features are retained, and background is suppressed. [sent-80, score-0.343]

39 Since the window does not accurately cover the object, this might be detrimental to the detector performance. [sent-81, score-0.472]

40 It is, therefore, important to also take into account the features of the entire window as shown experimentally in Section 3. [sent-82, score-0.267]

41 Feature extraction To represent the candidate object windows we use two local features: SIFT and the local color descriptor of [8]. [sent-85, score-0.591]

42 The eight images on the right show the binary masks of superpixels lying fully inside the window, for each of the eight segmentations. [sent-91, score-0.549]

43 To represent a candidate window we sum up these normalized gradients, and weight the contribution of local descriptors by the averaged segmentation masks when we use them. [sent-101, score-1.0]

44 To obtain the final window descriptor we concatenate the FVs obtained over the color and SIFT features. [sent-114, score-0.37]

45 Feature compression During training we apply our detectors several times to the training images to retrieve hard negative examples. [sent-120, score-0.379]

46 Reextracting descriptors at each hard negative mining iteration would be very costly. [sent-121, score-0.26]

47 For example, the PASCAL VOC 2007 dataset contains about 5,000 training images and we have between 1,000 and 2,000 candidate windows per image, thus we have to assess in the order of 5 to 10 million candi- × date windows in each iteration. [sent-122, score-0.812]

48 On the other hand, storing all window descriptors in memory is also problematic. [sent-123, score-0.41]

49 In our experiments we use K = 64 Gaussians, which leads to K(2D + 1) = 8, 256 dimensional FVs, which for 5 million candidate windows represents about 160 GB when using 4byte floating point encoding. [sent-124, score-0.515]

50 g I fno practice we use M the = in d8,e xan odf H/B = 8 dimensional subvectors, which leads to a compression factor of 32 as compared to a 4-byte floating point encoding of the original vector. [sent-131, score-0.281]

51 To reduce the memory requirements even further, we use Blosc compression [2] on the PQ codes per image. [sent-132, score-0.273]

52 1 Note that PQ compression was used for object detection before in [36], but for a HOG feature based sys- tem which is far less demanding in terms of storage. [sent-134, score-0.317]

53 One can choose not to compress the data during test time, and apply the detector in an online manner computing the features for one image at the time. [sent-136, score-0.245]

54 Using PQ codes, all window descriptors for the whole dataset take 580 GB of disk space. [sent-138, score-0.362]

55 In order to apply a detector (for hard negative mining or evaluation), we only need to decompress Blosc-compressed data on-the-fly back to PQ codes. [sent-140, score-0.436]

56 The PQ codes can be used directly to score windows efficiently using lookup tables [36]. [sent-141, score-0.369]

57 As positive training examples, we use the windows given by the ground-truth annotation. [sent-147, score-0.318]

58 We initialize the set of negative training examples by randomly sampling candidate boxes around ground-truth windows, and retaining those windows that have an overlap between 20% and 30% with a positive example in terms of intersection over union. [sent-148, score-0.555]

59 After the initial training stage, we add hard negative examples by applying the detector on the training set. [sent-149, score-0.378]

60 To avoid redundancy in negative samples, we do not allow two negative windows to have more than 60% overlap. [sent-151, score-0.415]

61 Using our development dataset, described in the next section, we observed that the detector performance significantly increases after the first hard negative mining itera1We use the public code from http : / /bl o s c . [sent-152, score-0.448]

62 Parameter evaluation on the development set We evaluated different versions of our detector on the development set, the results of which can be found in Table 1. [sent-170, score-0.395]

63 In our first three experiments we consider different detectors that only rely on the candidate windows, and do not make use of segmentation masks. [sent-171, score-0.37]

64 2 normalized) Fisher vector (FV) over the SIFT descriptors in each window , which leads to an mAP of 25. [sent-173, score-0.362]

65 In order to evaluate the importance of descriptor normalization, we removed the power normalization and test three versions: (i) no normalization (i. [sent-181, score-0.257]

66 First, we use for each window the generating segment used to produce it, i. [sent-196, score-0.309]

67 In this case we suppress all descriptors within the bounding box that do not lie inside the segment, except for the groundtruth object windows during training, for which there are no generating segments. [sent-199, score-0.609]

68 Although this result is 10 mAP points below that using the window itself, it is still surprisingly good considering that the generating segments often poorly capture the object shape. [sent-202, score-0.438]

69 Second, we repeat this experiment when using the weighted masks (see Figure 1third column), which improves mAP by about five points to 40. [sent-203, score-0.248]

70 Third, we also use the weighted masks on the ground-truth object windows during training. [sent-205, score-0.601]

71 This might be due to the fact that useful contextual background descriptors tend to be suppressed. [sent-209, score-0.333]

72 Our last experiment in this set considers combining the mask and window descriptors, so as to benefit from both local context, and crisper object-centered features. [sent-210, score-0.455]

73 First, we consider adding color to both the window-only detector and the window+mask detector. [sent-215, score-0.265]

74 The window-only SIFT+color detector performs very similar to the window+mask SIFT-only detector at 46. [sent-216, score-0.41]

75 When adding color to the window+mask detector performance rises to 48. [sent-218, score-0.265]

76 7%, clearly showing the complementarity of the mask and color features. [sent-219, score-0.248]

77 Finally, we add a contextual feature by means of a FV computed over the full image, which further increases the mAP score to 49. [sent-220, score-0.281]

78 Fi- nally, we implement the contextual rescoring mechanism proposed in [14], which further increases the score from 38. [sent-242, score-0.427]

79 To gain insight in the effect of the masked features, we present top detections in example images with our best detector (SIFT+color, window+full+context) with and without masks in Figure 2. [sent-245, score-0.639]

80 Images in the top row illustrate cases where the detector benefits from the masked features. [sent-246, score-0.311]

81 Our approximate object segmentation suppresses background clutter, which is particularly important when the object does not fill the bounding box (columns 1–4). [sent-247, score-0.491]

82 The bus example shows a case where a too small detection is suppressed since superpixels extend over the full bus, using the mask leads to the full bus being detected. [sent-248, score-0.705]

83 The bottom row shows examples where the use of masks degrades the top detection, typically the detection window is too large, since included background features are suppressed by the masks. [sent-249, score-0.704]

84 We divide them in two groups depending on whether they exploit inter-class contextual features, which we refer shortly as contextual detectors, or score windows independently. [sent-251, score-0.737]

85 We can observe that our detector without contextual rescoring obtains 38. [sent-252, score-0.637]

86 Since we use the candidate window method of Van de Sande et al. [sent-258, score-0.434]

87 Our detector without contextual 2972 Table 2: Performance on VOC’07 with different descriptors (S: SIFT, C: color), regions (W: window, M: mask, F: full image). [sent-261, score-0.539]

88 2P Table 3: Comparison of our detector with and without context with the state-of-the-art object detectors on VOC 2007. [sent-283, score-0.364]

89 Although the method itself does not directly use inter-class information, they utilize the detections given by the contextual rescoring approach of [16]. [sent-295, score-0.424]

90 5% mAP, and our contextual detector performs significantly better at 40. [sent-297, score-0.41]

91 Our noncontextual detector outperforms all other non-contextual detectors in terms of mAP, and performs comparable to other contextual object detectors. [sent-305, score-0.602]

92 When contextual rescoring is used, detection performance of our method increases to 38. [sent-306, score-0.486]

93 [34], which are based on the same candidate windows, our noncontextual detection results are better than theirs on 12 of the 20 categories, as well as on average: 35. [sent-311, score-0.302]

94 Conclusions We presented an object detection approach that exploits the powerful high dimensional Fisher vector representation. [sent-322, score-0.246]

95 We have shown that the same superpixels that drive the selective search can 2973 Table 4: Comparison of our detector with and without context with the state-of-the-art object detectors on VOC 2010. [sent-324, score-0.547]

96 ∗∗ Utilizes groundtruth segmentation annotations and extra training images. [sent-325, score-0.254]

97 be used to obtain approximate object segmentation masks, which allow us to compute object-centric features that are complementary to full-window features. [sent-326, score-0.277]

98 Our detector also exploits contextual features in the form of a full-image FV descriptor, and an inter-category rescoring mechanism. [sent-327, score-0.619]

99 With a gain of around 2 mAP points, our approximate segmentation masks significantly contribute to the success of our method. [sent-329, score-0.498]

100 In future work we want to explore the effectiveness of our approximate object detection masks for tasks such as semantic segmentation, by using them as a strongly semantic and spatially detailed prior. [sent-330, score-0.537]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('windows', 0.285), ('window', 0.267), ('masks', 0.248), ('contextual', 0.205), ('detector', 0.205), ('voc', 0.196), ('fisher', 0.19), ('mask', 0.188), ('rescoring', 0.18), ('segmentation', 0.177), ('fv', 0.169), ('compression', 0.148), ('superpixels', 0.137), ('candidate', 0.135), ('pq', 0.132), ('blosc', 0.132), ('pascal', 0.122), ('masked', 0.106), ('detection', 0.101), ('descriptors', 0.095), ('map', 0.087), ('eight', 0.082), ('song', 0.082), ('bus', 0.078), ('development', 0.078), ('normalization', 0.077), ('fvs', 0.076), ('anchez', 0.072), ('kd', 0.071), ('object', 0.068), ('decompress', 0.066), ('noncontextual', 0.066), ('negative', 0.065), ('sift', 0.065), ('clutter', 0.061), ('segments', 0.061), ('power', 0.06), ('color', 0.06), ('normalizations', 0.058), ('mining', 0.058), ('detectors', 0.058), ('sande', 0.056), ('suppressed', 0.055), ('khan', 0.054), ('subvectors', 0.051), ('verbeek', 0.05), ('spm', 0.05), ('perronnin', 0.049), ('cinbis', 0.049), ('memory', 0.048), ('dimensional', 0.048), ('vedaldi', 0.047), ('obtains', 0.047), ('floating', 0.047), ('selective', 0.046), ('nlpr', 0.045), ('groundtruth', 0.044), ('semantic', 0.044), ('traverse', 0.044), ('weight', 0.043), ('descriptor', 0.043), ('incorrect', 0.043), ('generating', 0.042), ('aez', 0.042), ('hard', 0.042), ('score', 0.042), ('codes', 0.042), ('van', 0.042), ('gain', 0.041), ('arbel', 0.041), ('weijer', 0.04), ('compress', 0.04), ('fidler', 0.04), ('bounding', 0.04), ('detections', 0.039), ('assess', 0.039), ('suppresses', 0.038), ('encoding', 0.038), ('retaining', 0.037), ('hypotheses', 0.036), ('per', 0.035), ('segmentations', 0.035), ('contribution', 0.035), ('objectness', 0.035), ('box', 0.035), ('versions', 0.034), ('full', 0.034), ('liblinear', 0.034), ('background', 0.033), ('context', 0.033), ('xd', 0.033), ('quantization', 0.033), ('vectors', 0.033), ('training', 0.033), ('approximate', 0.032), ('de', 0.032), ('pk', 0.031), ('compressed', 0.03), ('branch', 0.03), ('exploits', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 377 iccv-2013-Segmentation Driven Object Detection with Fisher Vectors

Author: Ramazan Gokberk Cinbis, Jakob Verbeek, Cordelia Schmid

Abstract: We present an object detection system based on the Fisher vector (FV) image representation computed over SIFT and color descriptors. For computational and storage efficiency, we use a recent segmentation-based method to generate class-independent object detection hypotheses, in combination with data compression techniques. Our main contribution is a method to produce tentative object segmentation masks to suppress background clutter in the features. Re-weighting the local image features based on these masks is shown to improve object detection significantly. We also exploit contextual features in the form of a full-image FV descriptor, and an inter-category rescoring mechanism. Our experiments on the PASCAL VOC 2007 and 2010 datasets show that our detector improves over the current state-of-the-art detection results.

2 0.24972679 379 iccv-2013-Semantic Segmentation without Annotating Segments

Author: Wei Xia, Csaba Domokos, Jian Dong, Loong-Fah Cheong, Shuicheng Yan

Abstract: Numerous existing object segmentation frameworks commonly utilize the object bounding box as a prior. In this paper, we address semantic segmentation assuming that object bounding boxes are provided by object detectors, but no training data with annotated segments are available. Based on a set of segment hypotheses, we introduce a simple voting scheme to estimate shape guidance for each bounding box. The derived shape guidance is used in the subsequent graph-cut-based figure-ground segmentation. The final segmentation result is obtained by merging the segmentation results in the bounding boxes. We conduct an extensive analysis of the effect of object bounding box accuracy. Comprehensive experiments on both the challenging PASCAL VOC object segmentation dataset and GrabCut50 image segmentation dataset show that the proposed approach achieves competitive results compared to previous detection or bounding box prior based methods, as well as other state-of-the-art semantic segmentation methods.

3 0.21398598 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set

Author: Dan Oneata, Jakob Verbeek, Cordelia Schmid

Abstract: Action recognition in uncontrolled video is an important and challenging computer vision problem. Recent progress in this area is due to new local features and models that capture spatio-temporal structure between local features, or human-object interactions. Instead of working towards more complex models, we focus on the low-level features and their encoding. We evaluate the use of Fisher vectors as an alternative to bag-of-word histograms to aggregate a small set of state-of-the-art low-level descriptors, in combination with linear classifiers. We present a large and varied set of evaluations, considering (i) classification of short actions in five datasets, (ii) localization of such actions in feature-length movies, and (iii) large-scale recognition of complex events. We find that for basic action recognition and localization MBH features alone are enough for stateof-the-art performance. For complex events we find that SIFT and MFCC features provide complementary cues. On all three problems we obtain state-of-the-art results, while using fewer features and less complex models.

4 0.20184365 111 iccv-2013-Detecting Dynamic Objects with Multi-view Background Subtraction

Author: Raúl Díaz, Sam Hallman, Charless C. Fowlkes

Abstract: The confluence of robust algorithms for structure from motion along with high-coverage mapping and imaging of the world around us suggests that it will soon be feasible to accurately estimate camera pose for a large class photographs taken in outdoor, urban environments. In this paper, we investigate how such information can be used to improve the detection of dynamic objects such as pedestrians and cars. First, we show that when rough camera location is known, we can utilize detectors that have been trained with a scene-specific background model in order to improve detection accuracy. Second, when precise camera pose is available, dense matching to a database of existing images using multi-view stereo provides a way to eliminate static backgrounds such as building facades, akin to background-subtraction often used in video analysis. We evaluate these ideas using a dataset of tourist photos with estimated camera pose. For template-based pedestrian detection, we achieve a 50 percent boost in average precision over baseline.

5 0.17202625 327 iccv-2013-Predicting an Object Location Using a Global Image Representation

Author: Jose A. Rodriguez Serrano, Diane Larlus

Abstract: We tackle the detection of prominent objects in images as a retrieval task: given a global image descriptor, we find the most similar images in an annotated dataset, and transfer the object bounding boxes. We refer to this approach as data driven detection (DDD), that is an alternative to sliding windows. Previous works have used similar notions but with task-independent similarities and representations, i.e. they were not tailored to the end-goal of localization. This article proposes two contributions: (i) a metric learning algorithm and (ii) a representation of images as object probability maps, that are both optimized for detection. We show experimentally that these two contributions are crucial to DDD, do not require costly additional operations, and in some cases yield comparable or better results than state-of-the-art detectors despite conceptual simplicity and increased speed. As an application of prominent object detection, we improve fine-grained categorization by precropping images with the proposed approach.

6 0.16814503 169 iccv-2013-Fine-Grained Categorization by Alignments

7 0.1664997 104 iccv-2013-Decomposing Bag of Words Histograms

8 0.14956945 204 iccv-2013-Human Attribute Recognition by Rich Appearance Dictionary

9 0.14936909 299 iccv-2013-Online Video SEEDS for Temporal Window Objectness

10 0.1373105 414 iccv-2013-Temporally Consistent Superpixels

11 0.13704097 242 iccv-2013-Learning People Detectors for Tracking in Crowded Scenes

12 0.13258211 201 iccv-2013-Holistic Scene Understanding for 3D Object Detection with RGBD Cameras

13 0.13238853 411 iccv-2013-Symbiotic Segmentation and Part Localization for Fine-Grained Categorization

14 0.12206274 322 iccv-2013-Pose Estimation and Segmentation of People in 3D Movies

15 0.1216533 179 iccv-2013-From Subcategories to Visual Composites: A Multi-level Framework for Object Detection

16 0.12158781 77 iccv-2013-Codemaps - Segment, Classify and Search Objects Locally

17 0.11947414 349 iccv-2013-Regionlets for Generic Object Detection

18 0.11926663 279 iccv-2013-Multi-stage Contextual Deep Learning for Pedestrian Detection

19 0.11696497 282 iccv-2013-Multi-view Object Segmentation in Space and Time

20 0.11641735 39 iccv-2013-Action Recognition with Improved Trajectories


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.27), (1, 0.038), (2, 0.069), (3, -0.031), (4, 0.152), (5, 0.021), (6, -0.091), (7, 0.071), (8, -0.087), (9, -0.134), (10, 0.127), (11, 0.1), (12, 0.022), (13, -0.087), (14, -0.088), (15, -0.083), (16, 0.027), (17, 0.055), (18, 0.033), (19, -0.031), (20, -0.005), (21, 0.005), (22, -0.061), (23, 0.057), (24, -0.074), (25, 0.106), (26, -0.115), (27, -0.039), (28, -0.048), (29, 0.008), (30, -0.022), (31, 0.0), (32, -0.061), (33, 0.027), (34, -0.011), (35, -0.022), (36, 0.12), (37, -0.08), (38, -0.006), (39, 0.12), (40, 0.003), (41, 0.06), (42, -0.035), (43, 0.074), (44, 0.031), (45, -0.015), (46, 0.042), (47, -0.11), (48, 0.033), (49, -0.083)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97109818 377 iccv-2013-Segmentation Driven Object Detection with Fisher Vectors

Author: Ramazan Gokberk Cinbis, Jakob Verbeek, Cordelia Schmid

Abstract: We present an object detection system based on the Fisher vector (FV) image representation computed over SIFT and color descriptors. For computational and storage efficiency, we use a recent segmentation-based method to generate class-independent object detection hypotheses, in combination with data compression techniques. Our main contribution is a method to produce tentative object segmentation masks to suppress background clutter in the features. Re-weighting the local image features based on these masks is shown to improve object detection significantly. We also exploit contextual features in the form of a full-image FV descriptor, and an inter-category rescoring mechanism. Our experiments on the PASCAL VOC 2007 and 2010 datasets show that our detector improves over the current state-of-the-art detection results.

2 0.80429286 349 iccv-2013-Regionlets for Generic Object Detection

Author: Xiaoyu Wang, Ming Yang, Shenghuo Zhu, Yuanqing Lin

Abstract: Generic object detection is confronted by dealing with different degrees of variations in distinct object classes with tractable computations, which demands for descriptive and flexible object representations that are also efficient to evaluate for many locations. In view of this, we propose to model an object class by a cascaded boosting classifier which integrates various types of features from competing local regions, named as regionlets. A regionlet is a base feature extraction region defined proportionally to a detection window at an arbitrary resolution (i.e. size and aspect ratio). These regionlets are organized in small groups with stable relative positions to delineate fine-grained spatial layouts inside objects. Their features are aggregated to a one-dimensional feature within one group so as to tolerate deformations. Then we evaluate the object bounding box proposal in selective search from segmentation cues, limiting the evaluation locations to thousands. Our approach significantly outperforms the state-of-the-art on popular multi-class detection benchmark datasets with a single method, without any contexts. It achieves the detec- tion mean average precision of 41. 7% on the PASCAL VOC 2007 dataset and 39. 7% on the VOC 2010 for 20 object categories. It achieves 14. 7% mean average precision on the ImageNet dataset for 200 object categories, outperforming the latest deformable part-based model (DPM) by 4. 7%.

3 0.77718157 104 iccv-2013-Decomposing Bag of Words Histograms

Author: Ankit Gandhi, Karteek Alahari, C.V. Jawahar

Abstract: We aim to decompose a global histogram representation of an image into histograms of its associated objects and regions. This task is formulated as an optimization problem, given a set of linear classifiers, which can effectively discriminate the object categories present in the image. Our decomposition bypasses harder problems associated with accurately localizing and segmenting objects. We evaluate our method on a wide variety of composite histograms, and also compare it with MRF-based solutions. In addition to merely measuring the accuracy of decomposition, we also show the utility of the estimated object and background histograms for the task of image classification on the PASCAL VOC 2007 dataset.

4 0.77585131 327 iccv-2013-Predicting an Object Location Using a Global Image Representation

Author: Jose A. Rodriguez Serrano, Diane Larlus

Abstract: We tackle the detection of prominent objects in images as a retrieval task: given a global image descriptor, we find the most similar images in an annotated dataset, and transfer the object bounding boxes. We refer to this approach as data driven detection (DDD), that is an alternative to sliding windows. Previous works have used similar notions but with task-independent similarities and representations, i.e. they were not tailored to the end-goal of localization. This article proposes two contributions: (i) a metric learning algorithm and (ii) a representation of images as object probability maps, that are both optimized for detection. We show experimentally that these two contributions are crucial to DDD, do not require costly additional operations, and in some cases yield comparable or better results than state-of-the-art detectors despite conceptual simplicity and increased speed. As an application of prominent object detection, we improve fine-grained categorization by precropping images with the proposed approach.

5 0.76130992 379 iccv-2013-Semantic Segmentation without Annotating Segments

Author: Wei Xia, Csaba Domokos, Jian Dong, Loong-Fah Cheong, Shuicheng Yan

Abstract: Numerous existing object segmentation frameworks commonly utilize the object bounding box as a prior. In this paper, we address semantic segmentation assuming that object bounding boxes are provided by object detectors, but no training data with annotated segments are available. Based on a set of segment hypotheses, we introduce a simple voting scheme to estimate shape guidance for each bounding box. The derived shape guidance is used in the subsequent graph-cut-based figure-ground segmentation. The final segmentation result is obtained by merging the segmentation results in the bounding boxes. We conduct an extensive analysis of the effect of object bounding box accuracy. Comprehensive experiments on both the challenging PASCAL VOC object segmentation dataset and GrabCut50 image segmentation dataset show that the proposed approach achieves competitive results compared to previous detection or bounding box prior based methods, as well as other state-of-the-art semantic segmentation methods.

6 0.72910649 109 iccv-2013-Detecting Avocados to Zucchinis: What Have We Done, and Where Are We Going?

7 0.72366887 77 iccv-2013-Codemaps - Segment, Classify and Search Objects Locally

8 0.69987607 169 iccv-2013-Fine-Grained Categorization by Alignments

9 0.69733769 411 iccv-2013-Symbiotic Segmentation and Part Localization for Fine-Grained Categorization

10 0.67360085 111 iccv-2013-Detecting Dynamic Objects with Multi-view Background Subtraction

11 0.66241062 189 iccv-2013-HOGgles: Visualizing Object Detection Features

12 0.65174019 198 iccv-2013-Hierarchical Part Matching for Fine-Grained Visual Categorization

13 0.61531103 416 iccv-2013-The Interestingness of Images

14 0.60329795 388 iccv-2013-Shape Index Descriptors Applied to Texture-Based Galaxy Analysis

15 0.59577703 326 iccv-2013-Predicting Sufficient Annotation Strength for Interactive Foreground Segmentation

16 0.59307045 447 iccv-2013-Volumetric Semantic Segmentation Using Pyramid Context Features

17 0.5916577 107 iccv-2013-Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction

18 0.58823562 201 iccv-2013-Holistic Scene Understanding for 3D Object Detection with RGBD Cameras

19 0.58785278 186 iccv-2013-GrabCut in One Cut

20 0.58596444 193 iccv-2013-Heterogeneous Auto-similarities of Characteristics (HASC): Exploiting Relational Information for Classification


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.098), (4, 0.014), (7, 0.017), (13, 0.235), (26, 0.114), (31, 0.041), (35, 0.017), (42, 0.115), (64, 0.066), (73, 0.013), (78, 0.011), (89, 0.185)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92468965 429 iccv-2013-Tree Shape Priors with Connectivity Constraints Using Convex Relaxation on General Graphs

Author: Jan Stühmer, Peter Schröder, Daniel Cremers

Abstract: We propose a novel method to include a connectivity prior into image segmentation that is based on a binary labeling of a directed graph, in this case a geodesic shortest path tree. Specifically we make two contributions: First, we construct a geodesic shortest path tree with a distance measure that is related to the image data and the bending energy of each path in the tree. Second, we include a connectivity prior in our segmentation model, that allows to segment not only a single elongated structure, but instead a whole connected branching tree. Because both our segmentation model and the connectivity constraint are convex, a global optimal solution can be found. To this end, we generalize a recent primal-dual algorithm for continuous convex optimization to an arbitrary graph structure. To validate our method we present results on data from medical imaging in angiography and retinal blood vessel segmentation.

2 0.87082982 333 iccv-2013-Quantize and Conquer: A Dimensionality-Recursive Solution to Clustering, Vector Quantization, and Image Retrieval

Author: Yannis Avrithis

Abstract: Inspired by the close relation between nearest neighbor search and clustering in high-dimensional spaces as well as the success of one helping to solve the other, we introduce a new paradigm where both problems are solved simultaneously. Our solution is recursive, not in the size of input data but in the number of dimensions. One result is a clustering algorithm that is tuned to small codebooks but does not need all data in memory at the same time and is practically constant in the data size. As a by-product, a tree structure performs either exact or approximate quantization on trained centroids, the latter being not very precise but extremely fast. A lesser contribution is a new indexing scheme for image retrieval that exploits multiple small codebooks to provide an arbitrarily fine partition of the descriptor space. Large scale experiments on public datasets exhibit state of the art performance and remarkable generalization.

3 0.85704684 145 iccv-2013-Estimating the Material Properties of Fabric from Video

Author: Katherine L. Bouman, Bei Xiao, Peter Battaglia, William T. Freeman

Abstract: Passively estimating the intrinsic material properties of deformable objects moving in a natural environment is essential for scene understanding. We present a framework to automatically analyze videos of fabrics moving under various unknown wind forces, and recover two key material properties of the fabric: stiffness and area weight. We extend features previously developed to compactly represent static image textures to describe video textures, such as fabric motion. A discriminatively trained regression model is then used to predict the physical properties of fabric from these features. The success of our model is demonstrated on a new, publicly available database offabric videos with corresponding measured ground truth material properties. We show that our predictions are well correlated with ground truth measurements of stiffness and density for the fabrics. Our contributions include: (a) a database that can be used for training and testing algorithms for passively predicting fabric properties from video, (b) an algorithm for predicting the material properties of fabric from a video, and (c) a perceptual study of humans’ ability to estimate the material properties of fabric from videos and images.

same-paper 4 0.84053314 377 iccv-2013-Segmentation Driven Object Detection with Fisher Vectors

Author: Ramazan Gokberk Cinbis, Jakob Verbeek, Cordelia Schmid

Abstract: We present an object detection system based on the Fisher vector (FV) image representation computed over SIFT and color descriptors. For computational and storage efficiency, we use a recent segmentation-based method to generate class-independent object detection hypotheses, in combination with data compression techniques. Our main contribution is a method to produce tentative object segmentation masks to suppress background clutter in the features. Re-weighting the local image features based on these masks is shown to improve object detection significantly. We also exploit contextual features in the form of a full-image FV descriptor, and an inter-category rescoring mechanism. Our experiments on the PASCAL VOC 2007 and 2010 datasets show that our detector improves over the current state-of-the-art detection results.

5 0.83157569 273 iccv-2013-Monocular Image 3D Human Pose Estimation under Self-Occlusion

Author: Ibrahim Radwan, Abhinav Dhall, Roland Goecke

Abstract: In this paper, an automatic approach for 3D pose reconstruction from a single image is proposed. The presence of human body articulation, hallucinated parts and cluttered background leads to ambiguity during the pose inference, which makes the problem non-trivial. Researchers have explored various methods based on motion and shading in order to reduce the ambiguity and reconstruct the 3D pose. The key idea of our algorithm is to impose both kinematic and orientation constraints. The former is imposed by projecting a 3D model onto the input image and pruning the parts, which are incompatible with the anthropomorphism. The latter is applied by creating synthetic views via regressing the input view to multiple oriented views. After applying the constraints, the 3D model is projected onto the initial and synthetic views, which further reduces the ambiguity. Finally, we borrow the direction of the unambiguous parts from the synthetic views to the initial one, which results in the 3D pose. Quantitative experiments are performed on the HumanEva-I dataset and qualitatively on unconstrained images from the Image Parse dataset. The results show the robustness of the proposed approach to accurately reconstruct the 3D pose form a single image.

6 0.76543748 414 iccv-2013-Temporally Consistent Superpixels

7 0.76091909 389 iccv-2013-Shortest Paths with Curvature and Torsion

8 0.75951099 349 iccv-2013-Regionlets for Generic Object Detection

9 0.75350291 29 iccv-2013-A Scalable Unsupervised Feature Merging Approach to Efficient Dimensionality Reduction of High-Dimensional Visual Data

10 0.7500422 285 iccv-2013-NEIL: Extracting Visual Knowledge from Web Data

11 0.7498554 287 iccv-2013-Neighbor-to-Neighbor Search for Fast Coding of Feature Vectors

12 0.74981749 327 iccv-2013-Predicting an Object Location Using a Global Image Representation

13 0.74893343 109 iccv-2013-Detecting Avocados to Zucchinis: What Have We Done, and Where Are We Going?

14 0.74788481 204 iccv-2013-Human Attribute Recognition by Rich Appearance Dictionary

15 0.74784243 95 iccv-2013-Cosegmentation and Cosketch by Unsupervised Learning

16 0.74690986 326 iccv-2013-Predicting Sufficient Annotation Strength for Interactive Foreground Segmentation

17 0.74340242 180 iccv-2013-From Where and How to What We See

18 0.74336237 150 iccv-2013-Exemplar Cut

19 0.74327469 77 iccv-2013-Codemaps - Segment, Classify and Search Objects Locally

20 0.74314779 197 iccv-2013-Hierarchical Joint Max-Margin Learning of Mid and Top Level Representations for Visual Recognition