Author: Guang Shu, Afshin Dehghan, Mubarak Shah
Abstract: We propose an approach to improve the detection performance of a generic detector when it is applied to a particular video. The performance of offline-trained objects detectors are usually degraded in unconstrained video environments due to variant illuminations, backgrounds and camera viewpoints. Moreover, most object detectors are trained using Haar-like features or gradient features but ignore video specificfeatures like consistent colorpatterns. In our approach, we apply a Superpixel-based Bag-of-Words (BoW) model to iteratively refine the output of a generic detector. Compared to other related work, our method builds a video-specific detector using superpixels, hence it can handle the problem of appearance variation. Most importantly, using Conditional Random Field (CRF) along with our super pixel-based BoW model, we develop and algorithm to segment the object from the background . Therefore our method generates an output of the exact object regions instead of the bounding boxes generated by most detectors. In general, our method takes detection bounding boxes of a generic detector as input and generates the detection output with higher average precision and precise object regions. The experiments on four recent datasets demonstrate the effectiveness of our approach and significantly improves the state-of-art detector by 5-16% in average precision.
1 The performance of offline-trained objects detectors are usually degraded in unconstrained video environments due to variant illuminations, backgrounds and camera viewpoints. [sent-2, score-0.426]
2 Moreover, most object detectors are trained using Haar-like features or gradient features but ignore video specificfeatures like consistent colorpatterns. [sent-3, score-0.254]
3 In our approach, we apply a Superpixel-based Bag-of-Words (BoW) model to iteratively refine the output of a generic detector. [sent-4, score-0.214]
4 Compared to other related work, our method builds a video-specific detector using superpixels, hence it can handle the problem of appearance variation. [sent-5, score-0.36]
5 Therefore our method generates an output of the exact object regions instead of the bounding boxes generated by most detectors. [sent-7, score-0.332]
6 In general, our method takes detection bounding boxes of a generic detector as input and generates the detection output with higher average precision and precise object regions. [sent-8, score-0.957]
7 The experiments on four recent datasets demonstrate the effectiveness of our approach and significantly improves the state-of-art detector by 5-16% in average precision. [sent-9, score-0.301]
8 Introduction With the prevalence of video recording devices nowadays, the demand of automatically detecting objects in videos has significantly increased. [sent-11, score-0.175]
9 And it is very expensive to manually label examples and re-train the detector for each new video. [sent-13, score-0.373]
10 Second, most detectors are designed for a shah} @ ee c s . [sent-14, score-0.118]
11 In the upper right image, the red bounding boxes show the detection results of the DPM and the yellow bounding boxes show the miss detections. [sent-19, score-0.523]
12 The lower right image shows the results after refinement using our approach, in that all the objects are correctly detected and the object regions are extracted. [sent-20, score-0.136]
13 generic object class using Histograms of Gradient (HOG) [5] or Haar-like features [13] . [sent-21, score-0.165]
14 When applied on a particular video, they are not able to fully leverage the information presented in different frames of the video such as the consistent color pattern of objects and background. [sent-22, score-0.135]
15 A common technique of these approaches is to apply a coarse detector to the video and get initial detections, which are then added into the training set to improve the coarse detector. [sent-25, score-0.515]
16 While these approaches have proven effective, they can only adapt their appearance models based on the coarse detections and so are not truly adaptive. [sent-26, score-0.217]
17 On the other hand, detection-by-tracking approaches[18, 15, 3, 9] use trackers to improve the detection results for a particular video. [sent-27, score-0.143]
18 However, they may introduce more noise to the detection results if the tracker is not reliable for the video. [sent-28, score-0.175]
19 To address the above mentioned problems, we propose the use of a online-learned appearance model to iteratively refine a generic detector. [sent-29, score-0.228]
20 These instances have variant poses but consistent color features. [sent-33, score-0.145]
21 Based on this assumption, we transfer the knowledge from the generic detector to a more video-specific detector by using the superpixel-based appearance model. [sent-34, score-0.738]
22 First we apply the original detector with a low detection threshold on every frame of a video and obtain a substantial amount of detection examples. [sent-36, score-0.587]
23 Those examples are initially labeled as positive or hard by their confidences. [sent-37, score-0.312]
24 Second, we extract superpixel features from all examples and make a Bag-of-Word representation for each example. [sent-39, score-0.317]
25 In the last step, we train a SVM model with positive and negative examples and label the hard examples iteratively. [sent-40, score-0.433]
26 Each time a small number of hard examples are conservatively added into the training set until the iterations converge. [sent-41, score-0.207]
27 Superpixels have been successfully applied in image segmentation [1], object localization [8] and tracking [16]. [sent-42, score-0.15]
28 On the other hand, the superpixels have great flexibility which avoids the mis-alignment ofthe HOG and Haar-like features on variant poses of objects. [sent-44, score-0.544]
29 Using the mentioned advantages of superpixels along with the proposed algorithm, we also extract the regions of objects, as shown in Figure 1. [sent-45, score-0.488]
30 A confidence map which shows the likelihood of each pixel belonging to the target is made using a background generic model. [sent-46, score-0.281]
31 Different from any background subtraction method, our method requires no background modeling, hence it is not sensitive to camera motion and will still work with a moving camera. [sent-48, score-0.279]
32 In general, our algorithm can extract the object regions without prior knowledge of the object’s shape, and the output could serve a more precise initialization for other applications such as tracking and recognition. [sent-49, score-0.29]
33 In this paper we take pedestrian detection as an example to illustrate our approach. [sent-50, score-0.152]
34 Related Work A substantial amount of work has been reported for building online learning approaches for object detection. [sent-57, score-0.147]
35 However, both approaches require a number of manually labeled examples as the initial training examples. [sent-61, score-0.213]
36 [11] presented a framework that can automatically label data and learns the classifier for detecting moving objects from video. [sent-62, score-0.176]
37 However, in [11] and [4] the initial coarse detectors are based on background subtraction, hence they don’t fit into the scenarios with complex background or moving camera. [sent-65, score-0.418]
38 al [17] only learns objects having the similar appearance to the initial examples. [sent-69, score-0.194]
39 These approaches are likely to miss some hard examples with large appearance variations. [sent-70, score-0.313]
40 On the other hand, the detection-by-tracking approaches improve the detections by using trackers [18, 15, 3, 9]. [sent-71, score-0.147]
41 Initial Detection We employ the deformable part-based model detector (DPM) [7] as the initial detector in our approach since it has shown excellent performance in static images. [sent-82, score-0.617]
42 The detector is given a lower detection threshold td so we can obtain almost all true detections and a large amount of false alarms. [sent-83, score-0.549]
43 According to the detector’s confidence scores, we initially split all detections into two groups: the ones with confidence scores above a threshold are labeled as the positive examples; the rest are labeled as hard examples. [sent-84, score-0.605]
44 In addition, a large number of negative examples are randomly collected in a way that they do not overlap with any positive or hard examples. [sent-85, score-0.316]
45 Superpixels and Appearance Model Most object detectors use HOG or Haar-like features which can represent a generic object class well. [sent-89, score-0.327]
46 As shown in Figure 3, an individual can have variant poses that renders a mis-alignment for HOG, Haar-like features or any other pixel-level features. [sent-91, score-0.145]
47 To handle this problem, we need to transfer the knowledge from a generic detector to a videospecific detector. [sent-92, score-0.464]
48 Therefore we build a statistics appearance model with the superpixels as units. [sent-93, score-0.46]
49 We segment each detection output into Nsp superpixels by using the SLIC Superpixels segmentation algorithm in [1]. [sent-96, score-0.582]
50 We choose an appropriate number N so that one superpixel is roughly uniform in color and naturally preserves the boundaries of objects. [sent-97, score-0.16]
51 In order to encode both color and spatial information into superpixels, we describe each superpixel Sp(i) by a 5-dimensional feature vector f = (L, a, b, x, y), in which (L, a, b) is the average CIELAB colorspace value of all pixels and (x, y) is the average location of all pixels. [sent-98, score-0.2]
52 An M-word vocabulary is assembled by clustering all the superpixels using the K-means algorithm. [sent-99, score-0.399]
53 Then the superpixels are aggregated into an M-bin L2-normalized histogram for each example and later each example is represented in a BoW fashion. [sent-100, score-0.399]
54 After each iteration we obtain SVM scores for the hard examples, and we split the hard examples into three groups again. [sent-106, score-0.338]
55 We move the examples with high scores into positive set and the examples with low scores into negative set, then re-train the SVM model. [sent-107, score-0.425]
56 After all the hard examples are labeled, we can project the positive examples back into the image sequence and generate the detection output. [sent-111, score-0.478]
57 Region Extraction The superpixel-based appearance model enables us not only to improve the detector, but also to precisely extract the regions of objects. [sent-114, score-0.15]
58 Since the superpixels can naturally preserve the boundary of objects, we develop an algorithm that takes the detection bounding box as input and calculate a confidence map indicating how likely each superpixel belongs to the target. [sent-115, score-0.906]
59 First, we cluster all superpixels of the negative samples into Mn clusters by CIELAB color features. [sent-116, score-0.539]
60 Then we calculate the similarities between all superpixels from positive examples and all the clusters. [sent-118, score-0.628]
61 prior(j)) , (1) in which Sp(i) is the i-th superpixel from positive examples and clst(j) is the j-th cluster center. [sent-121, score-0.389]
62 prior(j) is the prior probability that j-th cluster belongs to the background; this is defined by the number of superpixels in the j-th cluster. [sent-123, score-0.451]
63 After obtaining the 333777222311 similarity matrix W(i, j), we can calculate the confidence of a superpixel belonging to the target by the equation Q(i) = 1 − majxW(i,j). [sent-125, score-0.325]
64 (2) Therefore we can obtain a confidence map Q for each positive example, as shown in the second row of Figure 4. [sent-126, score-0.225]
65 In order to extract a precise region in the confidence map, a conditional random field (CRF) model [2] is utilized to learn the conditional distribution over the class labeling. [sent-127, score-0.239]
66 ∈ Edge Φ(ci,cj|si,sj), (3) where Ψ is the unary potentials defined by the probability provided by the confidence map Q: Ψ(ci |si) = − log(Pr(ci |si)) , (4) and Φ is the pairwise edge potentials defined by Φ(ci,cj|si,sj) =? [sent-137, score-0.227]
67 A isfte thre eth Le2 C-nRoFrm segmentation we wncilel boebtwtaiene a binary map on which the target and background is distinctly separated. [sent-148, score-0.132]
68 Note that in some positive examples, there are usually some superpixels which belong to other near targets labeled as target. [sent-149, score-0.504]
69 The first row shows some pedestrian examples, the second row shows the corresponding confidence maps and the last row shows the corresponding CRF segmentations. [sent-161, score-0.327]
70 We analyze the detector performance by computing Precision-Recall curves for all four datasets, as shown in Figure 6. [sent-165, score-0.301]
71 We set a detection threshold td = −2 to achieve a high recall. [sent-167, score-0.138]
72 For the superpixel segmentation, we set the number of superpixels for each examples toNsp = 100. [sent-168, score-0.676]
73 In the region extraction we set the number of negative clusters to be Mn = 200. [sent-170, score-0.174]
74 We also calculate the average precision for quantitative comparison which is used in [6]. [sent-172, score-0.115]
75 While the initial detector takes around 15 second for each frame, our additional steps takes only 3 seconds on average for each frame with a 3GHz CPU. [sent-176, score-0.307]
76 Figure 5 and 7 show the detection results; figure 8(a) shows the region extraction results. [sent-178, score-0.18]
77 The green bounding boxes are the output by DPM detector; the red bounding boxes are the output by our approach. [sent-182, score-0.478]
78 It is clear that our approach has fewer false positives as well as false negatives. [sent-183, score-0.114]
79 PNNL Parking lot Dataset (PL): This dataset consists of two video sequences collected in a parking lot using a static camera. [sent-187, score-0.723]
80 Parking lot 1is a moderately crowded scene including groups of pedestrians walking in queues with parallel motion and similar appearance. [sent-188, score-0.289]
81 Parking lot 2 is a more challenging sequence due to the large amounts of pose variations and occlusions, hence the results on this dataset are lower than other datasets. [sent-189, score-0.18]
82 However, our approach still performs significantly better than the DPM detector and HOG feature. [sent-190, score-0.256]
83 Table 1 and 2 show that we outperform the DPM detector both in precision and average precision by significant margin. [sent-194, score-0.382]
84 The original detector has already achieved satisfying results but our approach performs even better. [sent-197, score-0.299]
85 Second row shows the results of our method using HOG as descriptors and the third row shows the proposed method using bog of words of superpixels. [sent-200, score-0.104]
86 Conclusion We proposed an effective method to improve generic detectors and extract object regions using a superpixels-based Bag-of-Words model. [sent-228, score-0.372]
87 Our method captures rich information about individuals by superpixels; hence it is highly discriminative and robust against appearance changes. [sent-229, score-0.104]
88 We employ a part-based human detector to obtain initial labels and gradually refine the detections in a iterative way. [sent-230, score-0.503]
89 We also present a region extraction algorithm that extracts the regions of objects. [sent-231, score-0.135]
90 We demonstrated by experiments that our method effectively improves the performance of object detectors in four recent datasets. [sent-232, score-0.207]
91 Segmentation of objects in a detection window by nonparametric inhomogeneous crfs. [sent-252, score-0.137]
92 Unsupervised and simultaneous training of multiple object detectors from unlabeled surveillance video. [sent-264, score-0.21]
93 We compared our method against the original detector once by choosing HOG and once using bag-of-word of superpixels as the feature. [sent-273, score-0.655]
94 The green bounding boxes are the output by DPM detector; the red bounding boxes are the output by our approach. [sent-277, score-0.478]
95 It is clear that our approach has fewer false positives as well as false negatives. [sent-278, score-0.114]
96 The rst row shows the original detection window; the second row shows our segmentation results using CRF. [sent-283, score-0.24]
97 An unsupervised, online learning framework for moving object detection. [sent-323, score-0.15]
98 Online detection and classification of moving objects using progresively improving detectors. [sent-329, score-0.191]
99 Rapid object detection using a boosted cascade of simple features. [sent-334, score-0.138]
100 Detection by detections: Non-parametric detector adaptation for a video. [sent-364, score-0.256]
