cvpr cvpr2013 cvpr2013-217 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Guang Shu, Afshin Dehghan, Mubarak Shah
Abstract: We propose an approach to improve the detection performance of a generic detector when it is applied to a particular video. The performance of offline-trained objects detectors are usually degraded in unconstrained video environments due to variant illuminations, backgrounds and camera viewpoints. Moreover, most object detectors are trained using Haar-like features or gradient features but ignore video specificfeatures like consistent colorpatterns. In our approach, we apply a Superpixel-based Bag-of-Words (BoW) model to iteratively refine the output of a generic detector. Compared to other related work, our method builds a video-specific detector using superpixels, hence it can handle the problem of appearance variation. Most importantly, using Conditional Random Field (CRF) along with our super pixel-based BoW model, we develop and algorithm to segment the object from the background . Therefore our method generates an output of the exact object regions instead of the bounding boxes generated by most detectors. In general, our method takes detection bounding boxes of a generic detector as input and generates the detection output with higher average precision and precise object regions. The experiments on four recent datasets demonstrate the effectiveness of our approach and significantly improves the state-of-art detector by 5-16% in average precision.
Reference: text
sentIndex sentText sentNum sentScore
1 The performance of offline-trained objects detectors are usually degraded in unconstrained video environments due to variant illuminations, backgrounds and camera viewpoints. [sent-2, score-0.426]
2 Moreover, most object detectors are trained using Haar-like features or gradient features but ignore video specificfeatures like consistent colorpatterns. [sent-3, score-0.254]
3 In our approach, we apply a Superpixel-based Bag-of-Words (BoW) model to iteratively refine the output of a generic detector. [sent-4, score-0.214]
4 Compared to other related work, our method builds a video-specific detector using superpixels, hence it can handle the problem of appearance variation. [sent-5, score-0.36]
5 Therefore our method generates an output of the exact object regions instead of the bounding boxes generated by most detectors. [sent-7, score-0.332]
6 In general, our method takes detection bounding boxes of a generic detector as input and generates the detection output with higher average precision and precise object regions. [sent-8, score-0.957]
7 The experiments on four recent datasets demonstrate the effectiveness of our approach and significantly improves the state-of-art detector by 5-16% in average precision. [sent-9, score-0.301]
8 Introduction With the prevalence of video recording devices nowadays, the demand of automatically detecting objects in videos has significantly increased. [sent-11, score-0.175]
9 And it is very expensive to manually label examples and re-train the detector for each new video. [sent-13, score-0.373]
10 Second, most detectors are designed for a shah} @ ee c s . [sent-14, score-0.118]
11 In the upper right image, the red bounding boxes show the detection results of the DPM and the yellow bounding boxes show the miss detections. [sent-19, score-0.523]
12 The lower right image shows the results after refinement using our approach, in that all the objects are correctly detected and the object regions are extracted. [sent-20, score-0.136]
13 generic object class using Histograms of Gradient (HOG) [5] or Haar-like features [13] . [sent-21, score-0.165]
14 When applied on a particular video, they are not able to fully leverage the information presented in different frames of the video such as the consistent color pattern of objects and background. [sent-22, score-0.135]
15 A common technique of these approaches is to apply a coarse detector to the video and get initial detections, which are then added into the training set to improve the coarse detector. [sent-25, score-0.515]
16 While these approaches have proven effective, they can only adapt their appearance models based on the coarse detections and so are not truly adaptive. [sent-26, score-0.217]
17 On the other hand, detection-by-tracking approaches[18, 15, 3, 9] use trackers to improve the detection results for a particular video. [sent-27, score-0.143]
18 However, they may introduce more noise to the detection results if the tracker is not reliable for the video. [sent-28, score-0.175]
19 To address the above mentioned problems, we propose the use of a online-learned appearance model to iteratively refine a generic detector. [sent-29, score-0.228]
20 These instances have variant poses but consistent color features. [sent-33, score-0.145]
21 Based on this assumption, we transfer the knowledge from the generic detector to a more video-specific detector by using the superpixel-based appearance model. [sent-34, score-0.738]
22 First we apply the original detector with a low detection threshold on every frame of a video and obtain a substantial amount of detection examples. [sent-36, score-0.587]
23 Those examples are initially labeled as positive or hard by their confidences. [sent-37, score-0.312]
24 Second, we extract superpixel features from all examples and make a Bag-of-Word representation for each example. [sent-39, score-0.317]
25 In the last step, we train a SVM model with positive and negative examples and label the hard examples iteratively. [sent-40, score-0.433]
26 Each time a small number of hard examples are conservatively added into the training set until the iterations converge. [sent-41, score-0.207]
27 Superpixels have been successfully applied in image segmentation [1], object localization [8] and tracking [16]. [sent-42, score-0.15]
28 On the other hand, the superpixels have great flexibility which avoids the mis-alignment ofthe HOG and Haar-like features on variant poses of objects. [sent-44, score-0.544]
29 Using the mentioned advantages of superpixels along with the proposed algorithm, we also extract the regions of objects, as shown in Figure 1. [sent-45, score-0.488]
30 A confidence map which shows the likelihood of each pixel belonging to the target is made using a background generic model. [sent-46, score-0.281]
31 Different from any background subtraction method, our method requires no background modeling, hence it is not sensitive to camera motion and will still work with a moving camera. [sent-48, score-0.279]
32 In general, our algorithm can extract the object regions without prior knowledge of the object’s shape, and the output could serve a more precise initialization for other applications such as tracking and recognition. [sent-49, score-0.29]
33 In this paper we take pedestrian detection as an example to illustrate our approach. [sent-50, score-0.152]
34 Related Work A substantial amount of work has been reported for building online learning approaches for object detection. [sent-57, score-0.147]
35 However, both approaches require a number of manually labeled examples as the initial training examples. [sent-61, score-0.213]
36 [11] presented a framework that can automatically label data and learns the classifier for detecting moving objects from video. [sent-62, score-0.176]
37 However, in [11] and [4] the initial coarse detectors are based on background subtraction, hence they don’t fit into the scenarios with complex background or moving camera. [sent-65, score-0.418]
38 al [17] only learns objects having the similar appearance to the initial examples. [sent-69, score-0.194]
39 These approaches are likely to miss some hard examples with large appearance variations. [sent-70, score-0.313]
40 On the other hand, the detection-by-tracking approaches improve the detections by using trackers [18, 15, 3, 9]. [sent-71, score-0.147]
41 Initial Detection We employ the deformable part-based model detector (DPM) [7] as the initial detector in our approach since it has shown excellent performance in static images. [sent-82, score-0.617]
42 The detector is given a lower detection threshold td so we can obtain almost all true detections and a large amount of false alarms. [sent-83, score-0.549]
43 According to the detector’s confidence scores, we initially split all detections into two groups: the ones with confidence scores above a threshold are labeled as the positive examples; the rest are labeled as hard examples. [sent-84, score-0.605]
44 In addition, a large number of negative examples are randomly collected in a way that they do not overlap with any positive or hard examples. [sent-85, score-0.316]
45 Superpixels and Appearance Model Most object detectors use HOG or Haar-like features which can represent a generic object class well. [sent-89, score-0.327]
46 As shown in Figure 3, an individual can have variant poses that renders a mis-alignment for HOG, Haar-like features or any other pixel-level features. [sent-91, score-0.145]
47 To handle this problem, we need to transfer the knowledge from a generic detector to a videospecific detector. [sent-92, score-0.464]
48 Therefore we build a statistics appearance model with the superpixels as units. [sent-93, score-0.46]
49 We segment each detection output into Nsp superpixels by using the SLIC Superpixels segmentation algorithm in [1]. [sent-96, score-0.582]
50 We choose an appropriate number N so that one superpixel is roughly uniform in color and naturally preserves the boundaries of objects. [sent-97, score-0.16]
51 In order to encode both color and spatial information into superpixels, we describe each superpixel Sp(i) by a 5-dimensional feature vector f = (L, a, b, x, y), in which (L, a, b) is the average CIELAB colorspace value of all pixels and (x, y) is the average location of all pixels. [sent-98, score-0.2]
52 An M-word vocabulary is assembled by clustering all the superpixels using the K-means algorithm. [sent-99, score-0.399]
53 Then the superpixels are aggregated into an M-bin L2-normalized histogram for each example and later each example is represented in a BoW fashion. [sent-100, score-0.399]
54 After each iteration we obtain SVM scores for the hard examples, and we split the hard examples into three groups again. [sent-106, score-0.338]
55 We move the examples with high scores into positive set and the examples with low scores into negative set, then re-train the SVM model. [sent-107, score-0.425]
56 After all the hard examples are labeled, we can project the positive examples back into the image sequence and generate the detection output. [sent-111, score-0.478]
57 Region Extraction The superpixel-based appearance model enables us not only to improve the detector, but also to precisely extract the regions of objects. [sent-114, score-0.15]
58 Since the superpixels can naturally preserve the boundary of objects, we develop an algorithm that takes the detection bounding box as input and calculate a confidence map indicating how likely each superpixel belongs to the target. [sent-115, score-0.906]
59 First, we cluster all superpixels of the negative samples into Mn clusters by CIELAB color features. [sent-116, score-0.539]
60 Then we calculate the similarities between all superpixels from positive examples and all the clusters. [sent-118, score-0.628]
61 prior(j)) , (1) in which Sp(i) is the i-th superpixel from positive examples and clst(j) is the j-th cluster center. [sent-121, score-0.389]
62 prior(j) is the prior probability that j-th cluster belongs to the background; this is defined by the number of superpixels in the j-th cluster. [sent-123, score-0.451]
63 After obtaining the 333777222311 similarity matrix W(i, j), we can calculate the confidence of a superpixel belonging to the target by the equation Q(i) = 1 − majxW(i,j). [sent-125, score-0.325]
64 (2) Therefore we can obtain a confidence map Q for each positive example, as shown in the second row of Figure 4. [sent-126, score-0.225]
65 In order to extract a precise region in the confidence map, a conditional random field (CRF) model [2] is utilized to learn the conditional distribution over the class labeling. [sent-127, score-0.239]
66 ∈ Edge Φ(ci,cj|si,sj), (3) where Ψ is the unary potentials defined by the probability provided by the confidence map Q: Ψ(ci |si) = − log(Pr(ci |si)) , (4) and Φ is the pairwise edge potentials defined by Φ(ci,cj|si,sj) =? [sent-137, score-0.227]
67 A isfte thre eth Le2 C-nRoFrm segmentation we wncilel boebtwtaiene a binary map on which the target and background is distinctly separated. [sent-148, score-0.132]
68 Note that in some positive examples, there are usually some superpixels which belong to other near targets labeled as target. [sent-149, score-0.504]
69 The first row shows some pedestrian examples, the second row shows the corresponding confidence maps and the last row shows the corresponding CRF segmentations. [sent-161, score-0.327]
70 We analyze the detector performance by computing Precision-Recall curves for all four datasets, as shown in Figure 6. [sent-165, score-0.301]
71 We set a detection threshold td = −2 to achieve a high recall. [sent-167, score-0.138]
72 For the superpixel segmentation, we set the number of superpixels for each examples toNsp = 100. [sent-168, score-0.676]
73 In the region extraction we set the number of negative clusters to be Mn = 200. [sent-170, score-0.174]
74 We also calculate the average precision for quantitative comparison which is used in [6]. [sent-172, score-0.115]
75 While the initial detector takes around 15 second for each frame, our additional steps takes only 3 seconds on average for each frame with a 3GHz CPU. [sent-176, score-0.307]
76 Figure 5 and 7 show the detection results; figure 8(a) shows the region extraction results. [sent-178, score-0.18]
77 The green bounding boxes are the output by DPM detector; the red bounding boxes are the output by our approach. [sent-182, score-0.478]
78 It is clear that our approach has fewer false positives as well as false negatives. [sent-183, score-0.114]
79 PNNL Parking lot Dataset (PL): This dataset consists of two video sequences collected in a parking lot using a static camera. [sent-187, score-0.723]
80 Parking lot 1is a moderately crowded scene including groups of pedestrians walking in queues with parallel motion and similar appearance. [sent-188, score-0.289]
81 Parking lot 2 is a more challenging sequence due to the large amounts of pose variations and occlusions, hence the results on this dataset are lower than other datasets. [sent-189, score-0.18]
82 However, our approach still performs significantly better than the DPM detector and HOG feature. [sent-190, score-0.256]
83 Table 1 and 2 show that we outperform the DPM detector both in precision and average precision by significant margin. [sent-194, score-0.382]
84 The original detector has already achieved satisfying results but our approach performs even better. [sent-197, score-0.299]
85 Second row shows the results of our method using HOG as descriptors and the third row shows the proposed method using bog of words of superpixels. [sent-200, score-0.104]
86 Conclusion We proposed an effective method to improve generic detectors and extract object regions using a superpixels-based Bag-of-Words model. [sent-228, score-0.372]
87 Our method captures rich information about individuals by superpixels; hence it is highly discriminative and robust against appearance changes. [sent-229, score-0.104]
88 We employ a part-based human detector to obtain initial labels and gradually refine the detections in a iterative way. [sent-230, score-0.503]
89 We also present a region extraction algorithm that extracts the regions of objects. [sent-231, score-0.135]
90 We demonstrated by experiments that our method effectively improves the performance of object detectors in four recent datasets. [sent-232, score-0.207]
91 Segmentation of objects in a detection window by nonparametric inhomogeneous crfs. [sent-252, score-0.137]
92 Unsupervised and simultaneous training of multiple object detectors from unlabeled surveillance video. [sent-264, score-0.21]
93 We compared our method against the original detector once by choosing HOG and once using bag-of-word of superpixels as the feature. [sent-273, score-0.655]
94 The green bounding boxes are the output by DPM detector; the red bounding boxes are the output by our approach. [sent-277, score-0.478]
95 It is clear that our approach has fewer false positives as well as false negatives. [sent-278, score-0.114]
96 The rst row shows the original detection window; the second row shows our segmentation results using CRF. [sent-283, score-0.24]
97 An unsupervised, online learning framework for moving object detection. [sent-323, score-0.15]
98 Online detection and classification of moving objects using progresively improving detectors. [sent-329, score-0.191]
99 Rapid object detection using a boosted cascade of simple features. [sent-334, score-0.138]
100 Detection by detections: Non-parametric detector adaptation for a video. [sent-364, score-0.256]
wordName wordTfidf (topN-words)
[('superpixels', 0.399), ('parking', 0.303), ('detector', 0.256), ('dpm', 0.206), ('sp', 0.173), ('superpixel', 0.16), ('clst', 0.147), ('lot', 0.137), ('pnnl', 0.13), ('generic', 0.121), ('detectors', 0.118), ('examples', 0.117), ('confidence', 0.113), ('crf', 0.109), ('dehghan', 0.108), ('hog', 0.104), ('boxes', 0.104), ('detections', 0.098), ('skateborading', 0.098), ('town', 0.097), ('detection', 0.094), ('video', 0.092), ('hard', 0.09), ('bounding', 0.088), ('variant', 0.088), ('celik', 0.087), ('tracker', 0.081), ('cielab', 0.08), ('tracking', 0.064), ('bow', 0.063), ('precision', 0.063), ('shu', 0.063), ('appearance', 0.061), ('pedestrians', 0.061), ('positive', 0.06), ('coarse', 0.058), ('pedestrian', 0.058), ('illuminations', 0.057), ('potentials', 0.057), ('false', 0.057), ('poses', 0.057), ('slic', 0.056), ('shah', 0.056), ('static', 0.054), ('moving', 0.054), ('army', 0.053), ('row', 0.052), ('gradually', 0.052), ('cluster', 0.052), ('online', 0.052), ('calculate', 0.052), ('initial', 0.051), ('substantial', 0.051), ('si', 0.05), ('regions', 0.049), ('negative', 0.049), ('trackers', 0.049), ('crowded', 0.048), ('surveillance', 0.048), ('output', 0.047), ('background', 0.047), ('pr', 0.046), ('precise', 0.046), ('refine', 0.046), ('extraction', 0.046), ('four', 0.045), ('labeled', 0.045), ('miss', 0.045), ('camera', 0.044), ('subtraction', 0.044), ('object', 0.044), ('td', 0.044), ('levin', 0.044), ('transfer', 0.044), ('queues', 0.043), ('nsp', 0.043), ('ceonn', 0.043), ('gshu', 0.043), ('hanjalic', 0.043), ('oofu', 0.043), ('videospecific', 0.043), ('wncilel', 0.043), ('hence', 0.043), ('mn', 0.043), ('satisfying', 0.043), ('objects', 0.043), ('segmentation', 0.042), ('svm', 0.041), ('backgrounds', 0.041), ('scores', 0.041), ('extract', 0.04), ('guang', 0.04), ('erent', 0.04), ('benfold', 0.04), ('colorspace', 0.04), ('jh', 0.04), ('detecting', 0.04), ('region', 0.04), ('learns', 0.039), ('clusters', 0.039)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 217 cvpr-2013-Improving an Object Detector and Extracting Regions Using Superpixels
Author: Guang Shu, Afshin Dehghan, Mubarak Shah
Abstract: We propose an approach to improve the detection performance of a generic detector when it is applied to a particular video. The performance of offline-trained objects detectors are usually degraded in unconstrained video environments due to variant illuminations, backgrounds and camera viewpoints. Moreover, most object detectors are trained using Haar-like features or gradient features but ignore video specificfeatures like consistent colorpatterns. In our approach, we apply a Superpixel-based Bag-of-Words (BoW) model to iteratively refine the output of a generic detector. Compared to other related work, our method builds a video-specific detector using superpixels, hence it can handle the problem of appearance variation. Most importantly, using Conditional Random Field (CRF) along with our super pixel-based BoW model, we develop and algorithm to segment the object from the background . Therefore our method generates an output of the exact object regions instead of the bounding boxes generated by most detectors. In general, our method takes detection bounding boxes of a generic detector as input and generates the detection output with higher average precision and precise object regions. The experiments on four recent datasets demonstrate the effectiveness of our approach and significantly improves the state-of-art detector by 5-16% in average precision.
2 0.25745201 388 cvpr-2013-Semi-supervised Learning of Feature Hierarchies for Object Detection in a Video
Author: Yang Yang, Guang Shu, Mubarak Shah
Abstract: We propose a novel approach to boost the performance of generic object detectors on videos by learning videospecific features using a deep neural network. The insight behind our proposed approach is that an object appearing in different frames of a video clip should share similar features, which can be learned to build better detectors. Unlike many supervised detector adaptation or detection-bytracking methods, our method does not require any extra annotations or utilize temporal correspondence. We start with the high-confidence detections from a generic detector, then iteratively learn new video-specific features and refine the detection scores. In order to learn discriminative and compact features, we propose a new feature learning method using a deep neural network based on auto encoders. It differs from the existing unsupervised feature learning methods in two ways: first it optimizes both discriminative and generative properties of the features simultaneously, which gives our features better discriminative ability; second, our learned features are more compact, while the unsupervised feature learning methods usually learn a redundant set of over-complete features. Extensive experimental results on person and horse detection show that significant performance improvement can be achieved with our proposed method.
3 0.24179107 398 cvpr-2013-Single-Pedestrian Detection Aided by Multi-pedestrian Detection
Author: Wanli Ouyang, Xiaogang Wang
Abstract: In this paper, we address the challenging problem of detecting pedestrians who appear in groups and have interaction. A new approach is proposed for single-pedestrian detection aided by multi-pedestrian detection. A mixture model of multi-pedestrian detectors is designed to capture the unique visual cues which are formed by nearby multiple pedestrians but cannot be captured by single-pedestrian detectors. A probabilistic framework is proposed to model the relationship between the configurations estimated by single- and multi-pedestrian detectors, and to refine the single-pedestrian detection result with multi-pedestrian detection. It can integrate with any single-pedestrian detector without significantly increasing the computation load. 15 state-of-the-art single-pedestrian detection approaches are investigated on three widely used public datasets: Caltech, TUD-Brussels andETH. Experimental results show that our framework significantly improves all these approaches. The average improvement is 9% on the Caltech-Test dataset, 11% on the TUD-Brussels dataset and 17% on the ETH dataset in terms of average miss rate. The lowest average miss rate is reduced from 48% to 43% on the Caltech-Test dataset, from 55% to 50% on the TUD-Brussels dataset and from 51% to 41% on the ETH dataset.
4 0.21699645 242 cvpr-2013-Label Propagation from ImageNet to 3D Point Clouds
Author: Yan Wang, Rongrong Ji, Shih-Fu Chang
Abstract: Recent years have witnessed a growing interest in understanding the semantics of point clouds in a wide variety of applications. However, point cloud labeling remains an open problem, due to the difficulty in acquiring sufficient 3D point labels towards training effective classifiers. In this paper, we overcome this challenge by utilizing the existing massive 2D semantic labeled datasets from decadelong community efforts, such as ImageNet and LabelMe, and a novel “cross-domain ” label propagation approach. Our proposed method consists of two major novel components, Exemplar SVM based label propagation, which effectively addresses the cross-domain issue, and a graphical model based contextual refinement incorporating 3D constraints. Most importantly, the entire process does not require any training data from the target scenes, also with good scalability towards large scale applications. We evaluate our approach on the well-known Cornell Point Cloud Dataset, achieving much greater efficiency and comparable accuracy even without any 3D training data. Our approach shows further major gains in accuracy when the training data from the target scenes is used, outperforming state-ofthe-art approaches with far better efficiency.
5 0.21682647 29 cvpr-2013-A Video Representation Using Temporal Superpixels
Author: Jason Chang, Donglai Wei, John W. Fisher_III
Abstract: We develop a generative probabilistic model for temporally consistent superpixels in video sequences. In contrast to supervoxel methods, object parts in different frames are tracked by the same temporal superpixel. We explicitly model flow between frames with a bilateral Gaussian process and use this information to propagate superpixels in an online fashion. We consider four novel metrics to quantify performance of a temporal superpixel representation and demonstrate superior performance when compared to supervoxel methods.
6 0.2137626 370 cvpr-2013-SCALPEL: Segmentation Cascades with Localized Priors and Efficient Learning
7 0.21029156 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
8 0.20999555 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection
9 0.18956487 363 cvpr-2013-Robust Multi-resolution Pedestrian Detection in Traffic Scenes
10 0.18713024 142 cvpr-2013-Efficient Detector Adaptation for Object Detection in a Video
11 0.18322419 357 cvpr-2013-Revisiting Depth Layers from Occlusions
12 0.18270859 458 cvpr-2013-Voxel Cloud Connectivity Segmentation - Supervoxels for Point Clouds
13 0.18021309 309 cvpr-2013-Nonparametric Scene Parsing with Adaptive Feature Relevance and Semantic Context
14 0.17668577 460 cvpr-2013-Weakly-Supervised Dual Clustering for Image Semantic Segmentation
15 0.16455036 364 cvpr-2013-Robust Object Co-detection
16 0.1632573 414 cvpr-2013-Structure Preserving Object Tracking
17 0.16278298 386 cvpr-2013-Self-Paced Learning for Long-Term Tracking
18 0.16138151 329 cvpr-2013-Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images
19 0.16041383 318 cvpr-2013-Optimized Pedestrian Detection for Multiple and Occluded People
20 0.15792283 387 cvpr-2013-Semi-supervised Domain Adaptation with Instance Constraints
topicId topicWeight
[(0, 0.324), (1, -0.071), (2, 0.065), (3, -0.118), (4, 0.138), (5, 0.027), (6, 0.196), (7, 0.083), (8, -0.047), (9, 0.08), (10, 0.029), (11, -0.222), (12, 0.127), (13, -0.035), (14, -0.015), (15, -0.026), (16, 0.022), (17, -0.1), (18, -0.148), (19, 0.12), (20, 0.047), (21, -0.063), (22, -0.092), (23, 0.059), (24, -0.11), (25, -0.011), (26, -0.123), (27, -0.01), (28, 0.02), (29, 0.04), (30, 0.072), (31, -0.004), (32, 0.022), (33, -0.078), (34, 0.017), (35, -0.002), (36, -0.046), (37, -0.03), (38, 0.052), (39, -0.033), (40, 0.077), (41, -0.053), (42, -0.023), (43, -0.043), (44, -0.054), (45, -0.031), (46, -0.017), (47, -0.025), (48, 0.025), (49, -0.088)]
simIndex simValue paperId paperTitle
same-paper 1 0.93956137 217 cvpr-2013-Improving an Object Detector and Extracting Regions Using Superpixels
Author: Guang Shu, Afshin Dehghan, Mubarak Shah
Abstract: We propose an approach to improve the detection performance of a generic detector when it is applied to a particular video. The performance of offline-trained objects detectors are usually degraded in unconstrained video environments due to variant illuminations, backgrounds and camera viewpoints. Moreover, most object detectors are trained using Haar-like features or gradient features but ignore video specificfeatures like consistent colorpatterns. In our approach, we apply a Superpixel-based Bag-of-Words (BoW) model to iteratively refine the output of a generic detector. Compared to other related work, our method builds a video-specific detector using superpixels, hence it can handle the problem of appearance variation. Most importantly, using Conditional Random Field (CRF) along with our super pixel-based BoW model, we develop and algorithm to segment the object from the background . Therefore our method generates an output of the exact object regions instead of the bounding boxes generated by most detectors. In general, our method takes detection bounding boxes of a generic detector as input and generates the detection output with higher average precision and precise object regions. The experiments on four recent datasets demonstrate the effectiveness of our approach and significantly improves the state-of-art detector by 5-16% in average precision.
2 0.7481603 29 cvpr-2013-A Video Representation Using Temporal Superpixels
Author: Jason Chang, Donglai Wei, John W. Fisher_III
Abstract: We develop a generative probabilistic model for temporally consistent superpixels in video sequences. In contrast to supervoxel methods, object parts in different frames are tracked by the same temporal superpixel. We explicitly model flow between frames with a bilateral Gaussian process and use this information to propagate superpixels in an online fashion. We consider four novel metrics to quantify performance of a temporal superpixel representation and demonstrate superior performance when compared to supervoxel methods.
3 0.72217482 370 cvpr-2013-SCALPEL: Segmentation Cascades with Localized Priors and Efficient Learning
Author: David Weiss, Ben Taskar
Abstract: We propose SCALPEL, a flexible method for object segmentation that integrates rich region-merging cues with mid- and high-level information about object layout, class, and scale into the segmentation process. Unlike competing approaches, SCALPEL uses a cascade of bottom-up segmentation models that is capable of learning to ignore boundaries early on, yet use them as a stopping criterion once the object has been mostly segmented. Furthermore, we show how such cascades can be learned efficiently. When paired with a novel method that generates better localized shapepriors than our competitors, our method leads to a concise, accurate set of segmentation proposals; these proposals are more accurate on the PASCAL VOC2010 dataset than state-of-the-art methods that use re-ranking to filter much larger bags of proposals. The code for our algorithm is available online.
4 0.70719558 242 cvpr-2013-Label Propagation from ImageNet to 3D Point Clouds
Author: Yan Wang, Rongrong Ji, Shih-Fu Chang
Abstract: Recent years have witnessed a growing interest in understanding the semantics of point clouds in a wide variety of applications. However, point cloud labeling remains an open problem, due to the difficulty in acquiring sufficient 3D point labels towards training effective classifiers. In this paper, we overcome this challenge by utilizing the existing massive 2D semantic labeled datasets from decadelong community efforts, such as ImageNet and LabelMe, and a novel “cross-domain ” label propagation approach. Our proposed method consists of two major novel components, Exemplar SVM based label propagation, which effectively addresses the cross-domain issue, and a graphical model based contextual refinement incorporating 3D constraints. Most importantly, the entire process does not require any training data from the target scenes, also with good scalability towards large scale applications. We evaluate our approach on the well-known Cornell Point Cloud Dataset, achieving much greater efficiency and comparable accuracy even without any 3D training data. Our approach shows further major gains in accuracy when the training data from the target scenes is used, outperforming state-ofthe-art approaches with far better efficiency.
5 0.68331802 339 cvpr-2013-Probabilistic Graphlet Cut: Exploiting Spatial Structure Cue for Weakly Supervised Image Segmentation
Author: Luming Zhang, Mingli Song, Zicheng Liu, Xiao Liu, Jiajun Bu, Chun Chen
Abstract: Weakly supervised image segmentation is a challenging problem in computer vision field. In this paper, we present a new weakly supervised image segmentation algorithm by learning the distribution of spatially structured superpixel sets from image-level labels. Specifically, we first extract graphlets from each image where a graphlet is a smallsized graph consisting of superpixels as its nodes and it encapsulates the spatial structure of those superpixels. Then, a manifold embedding algorithm is proposed to transform graphlets of different sizes into equal-length feature vectors. Thereafter, we use GMM to learn the distribution of the post-embedding graphlets. Finally, we propose a novel image segmentation algorithm, called graphlet cut, that leverages the learned graphlet distribution in measuring the homogeneity of a set of spatially structured superpixels. Experimental results show that the proposed approach outperforms state-of-the-art weakly supervised image segmentation methods, and its performance is comparable to those of the fully supervised segmentation models.
6 0.67906713 86 cvpr-2013-Composite Statistical Inference for Semantic Segmentation
7 0.67677891 122 cvpr-2013-Detection Evolution with Multi-order Contextual Co-occurrence
8 0.67575908 458 cvpr-2013-Voxel Cloud Connectivity Segmentation - Supervoxels for Point Clouds
9 0.67518425 212 cvpr-2013-Image Segmentation by Cascaded Region Agglomeration
10 0.66228294 460 cvpr-2013-Weakly-Supervised Dual Clustering for Image Semantic Segmentation
11 0.66112268 26 cvpr-2013-A Statistical Model for Recreational Trails in Aerial Images
12 0.64919388 398 cvpr-2013-Single-Pedestrian Detection Aided by Multi-pedestrian Detection
13 0.6448555 280 cvpr-2013-Maximum Cohesive Grid of Superpixels for Fast Object Localization
14 0.63013095 366 cvpr-2013-Robust Region Grouping via Internal Patch Statistics
15 0.62435412 142 cvpr-2013-Efficient Detector Adaptation for Object Detection in a Video
16 0.6235258 383 cvpr-2013-Seeking the Strongest Rigid Detector
17 0.61454207 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection
18 0.61320692 363 cvpr-2013-Robust Multi-resolution Pedestrian Detection in Traffic Scenes
19 0.60806966 388 cvpr-2013-Semi-supervised Learning of Feature Hierarchies for Object Detection in a Video
20 0.59341049 144 cvpr-2013-Efficient Maximum Appearance Search for Large-Scale Object Detection
topicId topicWeight
[(10, 0.153), (26, 0.044), (33, 0.32), (37, 0.124), (67, 0.157), (69, 0.043), (87, 0.079)]
simIndex simValue paperId paperTitle
1 0.94579327 339 cvpr-2013-Probabilistic Graphlet Cut: Exploiting Spatial Structure Cue for Weakly Supervised Image Segmentation
Author: Luming Zhang, Mingli Song, Zicheng Liu, Xiao Liu, Jiajun Bu, Chun Chen
Abstract: Weakly supervised image segmentation is a challenging problem in computer vision field. In this paper, we present a new weakly supervised image segmentation algorithm by learning the distribution of spatially structured superpixel sets from image-level labels. Specifically, we first extract graphlets from each image where a graphlet is a smallsized graph consisting of superpixels as its nodes and it encapsulates the spatial structure of those superpixels. Then, a manifold embedding algorithm is proposed to transform graphlets of different sizes into equal-length feature vectors. Thereafter, we use GMM to learn the distribution of the post-embedding graphlets. Finally, we propose a novel image segmentation algorithm, called graphlet cut, that leverages the learned graphlet distribution in measuring the homogeneity of a set of spatially structured superpixels. Experimental results show that the proposed approach outperforms state-of-the-art weakly supervised image segmentation methods, and its performance is comparable to those of the fully supervised segmentation models.
2 0.94533873 2 cvpr-2013-3D Pictorial Structures for Multiple View Articulated Pose Estimation
Author: Magnus Burenius, Josephine Sullivan, Stefan Carlsson
Abstract: We consider the problem of automatically estimating the 3D pose of humans from images, taken from multiple calibrated views. We show that it is possible and tractable to extend the pictorial structures framework, popular for 2D pose estimation, to 3D. We discuss how to use this framework to impose view, skeleton, joint angle and intersection constraints in 3D. The 3D pictorial structures are evaluated on multiple view data from a professional football game. The evaluation is focused on computational tractability, but we also demonstrate how a simple 2D part detector can be plugged into the framework.
3 0.94194025 334 cvpr-2013-Pose from Flow and Flow from Pose
Author: Katerina Fragkiadaki, Han Hu, Jianbo Shi
Abstract: Human pose detectors, although successful in localising faces and torsos of people, often fail with lower arms. Motion estimation is often inaccurate under fast movements of body parts. We build a segmentation-detection algorithm that mediates the information between body parts recognition, and multi-frame motion grouping to improve both pose detection and tracking. Motion of body parts, though not accurate, is often sufficient to segment them from their backgrounds. Such segmentations are crucialfor extracting hard to detect body parts out of their interior body clutter. By matching these segments to exemplars we obtain pose labeled body segments. The pose labeled segments and corresponding articulated joints are used to improve the motion flow fields by proposing kinematically constrained affine displacements on body parts. The pose-based articulated motion model is shown to handle large limb rotations and displacements. Our algorithm can detect people under rare poses, frequently missed by pose detectors, showing the benefits of jointly reasoning about pose, segmentation and motion in videos.
4 0.94048506 264 cvpr-2013-Learning to Detect Partially Overlapping Instances
Author: Carlos Arteta, Victor Lempitsky, J. Alison Noble, Andrew Zisserman
Abstract: The objective of this work is to detect all instances of a class (such as cells or people) in an image. The instances may be partially overlapping and clustered, and hence quite challenging for traditional detectors, which aim at localizing individual instances. Our approach is to propose a set of candidate regions, and then select regions based on optimizing a global classification score, subject to the constraint that the selected regions are non-overlapping. Our novel contribution is to extend standard object detection by introducing separate classes for tuples of objects into the detection process. For example, our detector can pick a region containing two or three object instances, while assigning such region an appropriate label. We show that this formulation can be learned within the structured output SVM framework, and that the inference in such model can be accomplished using dynamic programming on a tree structured region graph. Furthermore, the learning only requires weak annotations – a dot on each instance. The improvement resulting from the addition of the capability to detect tuples of objects is demonstrated on quite disparate data sets: fluorescence microscopy images and UCSD pedestrians.
5 0.93898803 122 cvpr-2013-Detection Evolution with Multi-order Contextual Co-occurrence
Author: Guang Chen, Yuanyuan Ding, Jing Xiao, Tony X. Han
Abstract: Context has been playing an increasingly important role to improve the object detection performance. In this paper we propose an effective representation, Multi-Order Contextual co-Occurrence (MOCO), to implicitly model the high level context using solely detection responses from a baseline object detector. The so-called (1st-order) context feature is computed as a set of randomized binary comparisons on the response map of the baseline object detector. The statistics of the 1st-order binary context features are further calculated to construct a high order co-occurrence descriptor. Combining the MOCO feature with the original image feature, we can evolve the baseline object detector to a stronger context aware detector. With the updated detector, we can continue the evolution till the contextual improvements saturate. Using the successful deformable-partmodel detector [13] as the baseline detector, we test the proposed MOCO evolution framework on the PASCAL VOC 2007 dataset [8] and Caltech pedestrian dataset [7]: The proposed MOCO detector outperforms all known state-ofthe-art approaches, contextually boosting deformable part models (ver.5) [13] by 3.3% in mean average precision on the PASCAL 2007 dataset. For the Caltech pedestrian dataset, our method further reduces the log-average miss rate from 48% to 46% and the miss rate at 1 FPPI from 25% to 23%, compared with the best prior art [6].
6 0.93846542 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
7 0.93834692 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval
8 0.93808198 254 cvpr-2013-Learning SURF Cascade for Fast and Accurate Object Detection
9 0.93729478 345 cvpr-2013-Real-Time Model-Based Rigid Object Pose Estimation and Tracking Combining Dense and Sparse Visual Cues
same-paper 10 0.9369182 217 cvpr-2013-Improving an Object Detector and Extracting Regions Using Superpixels
11 0.93480653 60 cvpr-2013-Beyond Physical Connections: Tree Models in Human Pose Estimation
12 0.93419635 160 cvpr-2013-Face Recognition in Movie Trailers via Mean Sequence Sparse Representation-Based Classification
13 0.93337005 414 cvpr-2013-Structure Preserving Object Tracking
14 0.93317902 45 cvpr-2013-Articulated Pose Estimation Using Discriminative Armlet Classifiers
15 0.9308942 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
16 0.93054044 288 cvpr-2013-Modeling Mutual Visibility Relationship in Pedestrian Detection
17 0.93022686 375 cvpr-2013-Saliency Detection via Graph-Based Manifold Ranking
18 0.92988116 314 cvpr-2013-Online Object Tracking: A Benchmark
19 0.92957109 275 cvpr-2013-Lp-Norm IDF for Large Scale Image Search
20 0.92855382 325 cvpr-2013-Part Discovery from Partial Correspondence