iccv iccv2013 iccv2013-242 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Siyu Tang, Mykhaylo Andriluka, Anton Milan, Konrad Schindler, Stefan Roth, Bernt Schiele
Abstract: People tracking in crowded real-world scenes is challenging due to frequent and long-term occlusions. Recent tracking methods obtain the image evidence from object (people) detectors, but typically use off-the-shelf detectors and treat them as black box components. In this paper we argue that for best performance one should explicitly train people detectors on failure cases of the overall tracker instead. To that end, we first propose a novel joint people detector that combines a state-of-the-art single person detector with a detector for pairs of people, which explicitly exploits common patterns of person-person occlusions across multiple viewpoints that are a frequent failure case for tracking in crowded scenes. To explicitly address remaining failure modes of the tracker we explore two methods. First, we analyze typical failures of trackers and train a detector explicitly on these cases. And second, we train the detector with the people tracker in the loop, focusing on the most common tracker failures. We show that our joint multi-person detector significantly improves both de- tection accuracy as well as tracker performance, improving the state-of-the-art on standard benchmarks.
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper we argue that for best performance one should explicitly train people detectors on failure cases of the overall tracker instead. [sent-3, score-0.54]
2 First, we analyze typical failures of trackers and train a detector explicitly on these cases. [sent-6, score-0.524]
3 And second, we train the detector with the people tracker in the loop, focusing on the most common tracker failures. [sent-7, score-0.876]
4 We show that our joint multi-person detector significantly improves both de- tection accuracy as well as tracker performance, improving the state-of-the-art on standard benchmarks. [sent-8, score-0.689]
5 Introduction People detection is a key building block of most stateof-the-art people tracking methods [3, 22, 23]. [sent-10, score-0.542]
6 Although the performance of people detectors has improved tremendously in recent years, detecting partially occluded people remains a weakness of current approaches [8]. [sent-11, score-0.622]
7 This is also a key limiting factor when tracking people in crowded environments, such as typical street scenes, where many people remain occluded for long periods of time, or may not even become fully visible for the entire duration of the sequence. [sent-12, score-1.145]
8 The starting point of this paper is the observation that people detectors used for tracking are typically trained independently from the tracker, and are thus not specifically tai- Figure1. [sent-13, score-0.521]
9 In contrast, the present work aims to train people detectors explicitly to address failure modes of tracking in order to improve overall tracking performance. [sent-18, score-0.913]
10 However, this is not straightforward, since many tracking failures are related to frequent and long-term occlusions a typical failure case also for people detectors. [sent-19, score-0.71]
11 We address this problem in two steps: First, we target the limitations of people detection in crowded street scenes with many occlusions. [sent-20, score-0.663]
12 We build on these ideas, focusing on person-person occlusions, which are the dominant occlusion type in crowded street scenes. [sent-24, score-0.771]
13 Finally, some of the person-person occlusion cases are already handled well by existing tracking approaches (e. [sent-29, score-0.631]
14 We argue that the decision about incorporating certain types of occlusion patterns into the detector should be done in a tracking-aware fashion, either by manually observing typical tracking failures or by directly integrating the tracker into the detector training. [sent-32, score-1.786]
15 First, we manually define relevant occlusion patterns using a discretization of the mutual arrangement of people. [sent-34, score-0.671]
16 In addition to that, we train the detector with the tracker in the loop, by automatically identifying occlusion patterns based on regularities in the failure modes of the tracker. [sent-35, score-1.222]
17 We demonstrate that this tighter integration of tracker and detector improves tracking results on three challenging benchmark sequences. [sent-36, score-0.753]
18 Many recent methods for multi-person tracking [1, 2, 3, 22] follow the tracking-by-detection paradigm and use the output of people detectors as initial state space for tracking. [sent-38, score-0.521]
19 Although these methods are often robust to false positive detections and are able to fill in some missing detections due to short term occlusions, they typically require successful detection before and after the occlusion events, thus limiting their applicability in crowded scenes. [sent-39, score-0.776]
20 Recently, [17] proposed a people detector for crowded street environments that exploits characteristic appearance patterns from person-person occlusions. [sent-43, score-1.15]
21 We generalize this approach in several ways: First, we reformulate the approach as a structured prediction problem, which allows us to explicitly penalize activations of single-person detector components on examples with two people and vice versa. [sent-47, score-0.707]
22 Moreover, we generalize the joint detection approach of [17] to cope with a variety of viewpoints, not just side views, which is important when using the detector for tracking in more general scenes. [sent-49, score-0.854]
23 To address this we propose an approach tailored to the requirements of people tracking, and in particular propose to train a people detector based on feedback from the tracker. [sent-51, score-0.824]
24 Unlike previous work, we here not only consider detection and tracking jointly, but also explicitly adapt the detector to typical tracking failures. [sent-57, score-0.986]
25 We now use the DPM model to build a joint people detector, which overcomes the limitations imposed by frequent occlusions in real-world street scenes. [sent-92, score-0.628]
26 To address this we explicitly integrate multi-view person/person occlusion patterns into a joint DPM detector. [sent-94, score-0.815]
27 (3) We model our joint detector as a mixture of components that capture appearance patterns of either a single person, or a person/person occlusion pair. [sent-100, score-1.185]
28 We then introduce an explicit variable modeling the detection type, with the goal of enabling the joint detector to distinguish between a single person and a highly occluded person pair. [sent-101, score-1.08]
29 Incorporating the detection type into the structural loss then allows us to force the joint detector to learn the fundamental appearance difference between a single person and a person/person pair. [sent-102, score-0.899]
30 Before going into detail on learning occlusion patterns in Sec. [sent-103, score-0.596]
31 The advantage of the proposed structured learning of a joint people detector is that it learns that a detection with larger overlap with the ground truth bounding box has higher score than a detection with lower overlap. [sent-130, score-1.049]
32 Hence, the single person component should also have a lower score than the double person component on double person examples (see Fig. [sent-131, score-0.697]
33 One limitation of the loss ΔVOC for joint person detection is that it does not encourage the model enough to distinguish between a single person and a highly occluded double person pair. [sent-134, score-1.019]
34 In order to teach the model to distinguish a single person and a highly occluded person pair, we extend the structured output label with a detection type variable ydt ∈ {1, 2}, which denotes single person or double person d∈etec {t1io,2n}. [sent-137, score-1.169]
35 r joint detector with the joint detector proposed in [17], we explicitly train a side-view joint person detector using the same synthetic training images1 and initialize the single and double person detector components in the same way. [sent-144, score-2.612]
36 5), the joint detector further improves precision and achieves similar recall as [17]. [sent-149, score-0.629]
37 Note that the employed tracking approach does not include any explicit occlusion handling. [sent-186, score-0.603]
38 Table 1 shows tracking results on the TUD-Crossing sequence [1], using various detector variants as described above. [sent-189, score-0.668]
39 As expected, tracking based on the output of the joint detector shows improved performance compared to the single-person DPM detector. [sent-190, score-0.77]
40 Note that the side-view joint detector of Tang et al. [sent-191, score-0.53]
41 Learning People Detectors for Tracking So far we have shown that the proposed structured learning approach for training joint people detectors shows significant improvements for detection of occluded people in side-view street scenes. [sent-198, score-1.057]
42 This suggests the potential of leveraging characteristic appearance patterns of person/person pairs also for detecting occluded people in more general settings. [sent-199, score-0.599]
43 However, the generalization of this idea to crowded scenes with people walking in arbitrary directions is rather challenging due to the vast amount of possible person- Synthetically generated training images for different occlusion patterns and walking directions (right). [sent-200, score-1.365]
44 The number of putative occlusion patterns is exponential in the number of factors. [sent-203, score-0.596]
45 For example, short term occlusions resulting from people crossing each other’s way are frequent, but can be often easily resolved by modern tracking algorithms. [sent-205, score-0.548]
46 Therefore, finding occlusion patterns that are relevant in practice in order to reduce the modeling space is essential for applying joint person detectors for tracking in general crowded scenes. [sent-206, score-1.519]
47 We now propose two methods for discovering occlusion patterns for people walking in arbitrary directions by (a) manually designing regular occlusion combinations that appear frequently due to long-term occlusions and are, therefore, most relevant for tracking (Sec. [sent-207, score-1.728]
48 1); and (b) automatically learning a joint detector that exploits the tracking performance on occluded people and is explicitly optimized for the tracking task (Sec. [sent-209, score-1.367]
49 Designing occlusion patterns For many state-of-the-art trackers, the most important cases for improving tracking performance in crowded scenes correspond to long-term partial occlusions. [sent-214, score-1.104]
50 We begin by quantizing the space of possible occlusion patterns as shown in Fig. [sent-216, score-0.624]
51 We restrict ourselves to cases in which people walk in the same direction, as they cause longterm occlusions and moreover appear to have sufficient regularity in appearance, which is essential for detection performance in crowded scenes. [sent-228, score-0.619]
52 The occlusion patterns that we consider in the rest of this analysis correspond to a combination of the four walking directions of the subjects and one of the three remaining sectors (“A”, “B” or “C”). [sent-229, score-0.775]
53 Our joint detector uses a mixture of components that capture appearance patterns of either a single person or of a person/person occlusion pair. [sent-231, score-1.37]
54 In case of double person components, we generate two bounding boxes of people instead of one for each of the components’ detections. [sent-232, score-0.542]
55 However, in the multi-view setting, the same degree of occlusion can result in very different occlusion patterns. [sent-237, score-0.726]
56 Here, we instead initialize the components from the quantized occlusion patterns from above (Fig. [sent-238, score-0.63]
57 We collect 2400 images of people walking in 8 different walking directions to construct a synthetic training image pool. [sent-245, score-0.526]
58 In a similar fashion, we are able to generate training examples for different occlusion patterns and walk- ing directions by overlaying people on top of each other in a novel image. [sent-248, score-0.905]
59 Mining occlusion patterns from tracking As we will see in Sec. [sent-259, score-0.836]
60 5 in detail, carefully analyzing and designing occlusion patterns by hand already allows to train a joint detector that generalizes to more realistic and challenging crowded street scenes. [sent-260, score-1.538]
61 Nonetheless, the question remains which manually designed occlusion patterns are most relevant for successful tracking. [sent-261, score-0.71]
62 Furthermore, it is still unclear whether it is reasonable to harvest difficult cases from tracking failures and explicitly guide the joint detector to concentrate on those. [sent-262, score-0.888]
63 In the following, we describe a method to learn a joint detector specifically for tracking. [sent-263, score-0.53]
64 We employ tracking performance evaluation, oc- clusion pattern mining, synthetic image generation, and detector training jointly to optimize the detector for tracking multiple targets. [sent-264, score-1.363]
65 We use the same synthetic training images to train a single-person baseline detector, as we used for training the single-component of our joint detector with manually designed occlusion patterns (see Sec. [sent-269, score-1.441]
66 Output: A joint detector that is tailored to detect occlusion patterns that are most relevant for multi-target tracking. [sent-276, score-1.158]
67 Occlusion pattern mining (step 5): The majority of missed targets are occlusion related. [sent-279, score-0.77]
68 L2 mining sequence and mined occlusion patterns: (a) No person nearby; (b) interfered by one person; (c) interfered by more persons; (d) mined occlusion pattern 1st iteration; (e) mined occlusion pattern 2nd iteration. [sent-284, score-1.969]
69 Here, we concentrate on mining occlusion patterns for pairs of persons and consider the multiple people situation as a special case of a person pair, augmented by distractions from surroundings. [sent-288, score-1.266]
70 Note that our algorithm can be easily generalized to multiple people occlusion patterns given sufficient amount of mining sequences that contain certain distributions of multi-people occlusion patterns. [sent-289, score-1.419]
71 From the missed targets (step 4), we determine the problematic occlusion patterns and cluster them in terms of the relative position of the occluder/occludee pair. [sent-290, score-0.836]
72 5(d) and 5(e) show the dominant occlusion pattern of the first and second mining iteration. [sent-293, score-0.649]
73 Note that we only mine occlusion patterns and no additional image information (see next step). [sent-294, score-0.639]
74 Synthetic training example generation (step 6): We generate synthetic training images for the mined occlusion pattern using the same synthetic image pool as in Sec. [sent-295, score-0.718]
75 Training image generation, in principle, thus enables us to model arbitrary occlusion patterns in each iteration. [sent-300, score-0.596]
76 We generate 200 images for every new occlusion pattern, which amounts to the same number of training images as we used in the context of manually designed occlusion patterns. [sent-301, score-0.859]
77 Joint detector training with mined occlusion patterns (step 7): The single-person component of the joint detector is initialized with the same training images as the baseline detector. [sent-303, score-1.704]
78 Experiments We evaluate the performance of the proposed joint person detector with learned occlusion patterns and its application to tracking on three publicly available and particularly challenging sequences: PETS S2. [sent-309, score-1.551]
79 Note that our mining algorithm only extracts occlusion patterns and no additional image information. [sent-318, score-0.799]
80 Next, we evaluate the performance of our joint detector with manually designed occlusion patterns (see Fig. [sent-406, score-1.208]
81 The joint detector (blue) shows its advantage by outperforming the single-person detector on all sequences. [sent-408, score-0.884]
82 L2 test sequence, the joint detector outperforms the baseline detector by a large margin from 0. [sent-412, score-0.916]
83 These detection results suggest that the joint detection is much more powerful than the single detector; the designed occlusion patterns correspond to compact appearance and can be detected well. [sent-414, score-1.004]
84 Using the joint detector (Joint-Design) yields a remarkable performance boost on the S2. [sent-416, score-0.53]
85 L2 and the ParkingLot sequences, the joint detector also outperforms the single-person detector with a significantly higher recall achieved by detecting more occluded targets. [sent-425, score-1.047]
86 By carefully analyzing and designing the occlusion patterns, we obtain very competitive results on publicly available sequences, both in terms of detection and tracking, which shows the advantage of the proposed joint detector for tracking people in crowded scenes. [sent-426, score-1.716]
87 We report the joint detector performance for one and two mining iterations. [sent-429, score-0.733]
88 L2 (frames 1–218) as mining sequence, extracting occlusion patterns, but no further image information. [sent-431, score-0.566]
89 L2 test sequence (frames 219–436), which is more similar to the mining sequence than the other two sequences, our joint detector (black, Joint-Learn 1st, 56,5% MOTA) is nearly on par with the hand-designed patterns af- ter the first iteration, as shown in Fig. [sent-433, score-1.114]
90 This is because the most dominant occlusion pattern is captured and learned by the joint detector already. [sent-435, score-0.976]
91 L2 test sequence, but the precision slightly decreases because the dominant occlusion pattern of the second iteration only contains about 48 missed targets, compared to 5861 ground truth annotations, thus limiting potential performance improvement and introducing potential false positives. [sent-437, score-0.653]
92 L2, the learned joint detector (black) is already slightly better than the JointDesign detector after the first iteration, as shown in Fig. [sent-447, score-0.912]
93 11005555 we mine the occlusion patterns from the tracker improves the accuracy (MOTA) with each iteration (from 21. [sent-454, score-0.845]
94 Note that, similar to the findings above, the tracking performance reaches competitive levels after only one iteration, when compared to manually designed occlusion patterns. [sent-458, score-0.709]
95 6(c), the joint detector from the first iteration outperforms all other detectors, and reaches similar performance for tracking (Tab. [sent-464, score-0.817]
96 L2 and ParkingLot sequences suggest that our detector learning algorithm is not limited to particular occlusion patterns or crowd densities. [sent-476, score-0.989]
97 To that end, we plan to build a large dataset of crowded street scenes to mine a more diverse set of occlusion patterns. [sent-479, score-0.767]
98 Another promising future extension would be to learn a joint upper-body detector on extremely dense scenes, yielding specialized upper-body occlusion patterns. [sent-480, score-0.893]
99 Conclusion We presented a novel joint person detector specifically designed to address common failure cases during tracking in crowded street scenes due to long-term inter-object occlusions. [sent-482, score-1.402]
100 First, we showed that the most common occlusion patterns can be designed manually, and second, we proposed to learn reoccurring constellations with the tracker in the loop. [sent-483, score-0.813]
wordName wordTfidf (topN-words)
[('occlusion', 0.363), ('detector', 0.354), ('tracking', 0.24), ('patterns', 0.233), ('crowded', 0.227), ('people', 0.218), ('mota', 0.211), ('mining', 0.203), ('person', 0.185), ('joint', 0.176), ('parkinglot', 0.164), ('pets', 0.153), ('tracker', 0.135), ('dpm', 0.123), ('motp', 0.118), ('prcsn', 0.103), ('rcll', 0.103), ('missed', 0.097), ('occluded', 0.096), ('street', 0.093), ('occlusions', 0.09), ('mined', 0.09), ('detection', 0.084), ('ydt', 0.082), ('occluder', 0.079), ('walking', 0.076), ('sequence', 0.074), ('targets', 0.074), ('double', 0.071), ('synthetic', 0.065), ('detectors', 0.063), ('mot', 0.063), ('sectors', 0.063), ('doubleperson', 0.062), ('yib', 0.062), ('structured', 0.058), ('frequent', 0.051), ('training', 0.051), ('dominant', 0.05), ('voc', 0.05), ('iteration', 0.047), ('failure', 0.047), ('bounding', 0.045), ('constellations', 0.043), ('manually', 0.043), ('explicitly', 0.043), ('mine', 0.043), ('scenes', 0.041), ('doublepersonoutscores', 0.041), ('interfered', 0.041), ('recall', 0.04), ('directions', 0.04), ('failures', 0.039), ('sequences', 0.039), ('designed', 0.039), ('position', 0.038), ('type', 0.038), ('detections', 0.037), ('loss', 0.037), ('concentrate', 0.036), ('front', 0.036), ('precision', 0.035), ('synthetically', 0.035), ('half', 0.034), ('components', 0.034), ('train', 0.034), ('pattern', 0.033), ('relevant', 0.032), ('yl', 0.032), ('andriyenko', 0.032), ('mpg', 0.032), ('baseline', 0.032), ('relative', 0.031), ('box', 0.03), ('designing', 0.03), ('trackers', 0.029), ('occluders', 0.029), ('constellation', 0.029), ('dehghan', 0.029), ('partbased', 0.029), ('already', 0.028), ('limiting', 0.028), ('quantizing', 0.028), ('regularities', 0.028), ('compositions', 0.028), ('persons', 0.028), ('leibe', 0.028), ('modes', 0.028), ('detecting', 0.027), ('trajectories', 0.026), ('orientation', 0.026), ('employ', 0.026), ('appearance', 0.025), ('typical', 0.025), ('improves', 0.024), ('competitive', 0.024), ('pedestrian', 0.023), ('cyan', 0.023), ('viewpoints', 0.023), ('boxes', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 242 iccv-2013-Learning People Detectors for Tracking in Crowded Scenes
Author: Siyu Tang, Mykhaylo Andriluka, Anton Milan, Konrad Schindler, Stefan Roth, Bernt Schiele
Abstract: People tracking in crowded real-world scenes is challenging due to frequent and long-term occlusions. Recent tracking methods obtain the image evidence from object (people) detectors, but typically use off-the-shelf detectors and treat them as black box components. In this paper we argue that for best performance one should explicitly train people detectors on failure cases of the overall tracker instead. To that end, we first propose a novel joint people detector that combines a state-of-the-art single person detector with a detector for pairs of people, which explicitly exploits common patterns of person-person occlusions across multiple viewpoints that are a frequent failure case for tracking in crowded scenes. To explicitly address remaining failure modes of the tracker we explore two methods. First, we analyze typical failures of trackers and train a detector explicitly on these cases. And second, we train the detector with the people tracker in the loop, focusing on the most common tracker failures. We show that our joint multi-person detector significantly improves both de- tection accuracy as well as tracker performance, improving the state-of-the-art on standard benchmarks.
2 0.25681496 424 iccv-2013-Tracking Revisited Using RGBD Camera: Unified Benchmark and Baselines
Author: Shuran Song, Jianxiong Xiao
Abstract: Despite significant progress, tracking is still considered to be a very challenging task. Recently, the increasing popularity of depth sensors has made it possible to obtain reliable depth easily. This may be a game changer for tracking, since depth can be used to prevent model drift and handle occlusion. We also observe that current tracking algorithms are mostly evaluated on a very small number of videos collectedandannotated by different groups. The lack of a reasonable size and consistently constructed benchmark has prevented a persuasive comparison among different algorithms. In this paper, we construct a unified benchmark dataset of 100 RGBD videos with high diversity, propose different kinds of RGBD tracking algorithms using 2D or 3D model, and present a quantitative comparison of various algorithms with RGB or RGBD input. We aim to lay the foundation for further research in both RGB and RGBD tracking, and our benchmark is available at http://tracking.cs.princeton.edu.
3 0.24608856 269 iccv-2013-Modeling Occlusion by Discriminative AND-OR Structures
Author: Bo Li, Wenze Hu, Tianfu Wu, Song-Chun Zhu
Abstract: Occlusion presents a challenge for detecting objects in real world applications. To address this issue, this paper models object occlusion with an AND-OR structure which (i) represents occlusion at semantic part level, and (ii) captures the regularities of different occlusion configurations (i.e., the different combinations of object part visibilities). This paper focuses on car detection on street. Since annotating part occlusion on real images is time-consuming and error-prone, we propose to learn the the AND-OR structure automatically using synthetic images of CAD models placed at different relative positions. The model parameters are learned from real images under the latent structural SVM (LSSVM) framework. In inference, an efficient dynamic programming (DP) algorithm is utilized. In experiments, we test our method on both car detection and car view estimation. Experimental results show that (i) Our CAD simulation strategy is capable of generating occlusion patterns for real scenarios, (ii) The proposed AND-OR structure model is effective for modeling occlusions, which outperforms the deformable part-based model (DPM) [6, 10] in car detec- , tion on both our self-collected streetparking dataset and the Pascal VOC 2007 car dataset [4], (iii) The learned model is on-par with the state-of-the-art methods on car view estimation tested on two public datasets.
4 0.2413549 190 iccv-2013-Handling Occlusions with Franken-Classifiers
Author: Markus Mathias, Rodrigo Benenson, Radu Timofte, Luc Van_Gool
Abstract: Detecting partially occluded pedestrians is challenging. A common practice to maximize detection quality is to train a set of occlusion-specific classifiers, each for a certain amount and type of occlusion. Since training classifiers is expensive, only a handful are typically trained. We show that by using many occlusion-specific classifiers, we outperform previous approaches on three pedestrian datasets; INRIA, ETH, and Caltech USA. We present a new approach to train such classifiers. By reusing computations among different training stages, 16 occlusion-specific classifiers can be trained at only one tenth the cost of one full training. We show that also test time cost grows sub-linearly.
5 0.20039405 58 iccv-2013-Bayesian 3D Tracking from Monocular Video
Author: Ernesto Brau, Jinyan Guan, Kyle Simek, Luca Del Pero, Colin Reimer Dawson, Kobus Barnard
Abstract: Jinyan Guan† j guan1 @ emai l ari z ona . edu . Kyle Simek† ks imek@ emai l ari z ona . edu . Colin Reimer Dawson‡ cdaws on@ emai l ari z ona . edu . ‡School of Information University of Arizona Kobus Barnard‡ kobus @ s i sta . ari z ona . edu ∗School of Informatics University of Edinburgh for tracking an unknown and changing number of people in a scene using video taken from a single, fixed viewpoint. We develop a Bayesian modeling approach for tracking people in 3D from monocular video with unknown cameras. Modeling in 3D provides natural explanations for occlusions and smoothness discontinuities that result from projection, and allows priors on velocity and smoothness to be grounded in physical quantities: meters and seconds vs. pixels and frames. We pose the problem in the context of data association, in which observations are assigned to tracks. A correct application of Bayesian inference to multitarget tracking must address the fact that the model’s dimension changes as tracks are added or removed, and thus, posterior densities of different hypotheses are not comparable. We address this by marginalizing out the trajectory parameters so the resulting posterior over data associations has constant dimension. This is made tractable by using (a) Gaussian process priors for smooth trajectories and (b) approximately Gaussian likelihood functions. Our approach provides a principled method for incorporating multiple sources of evidence; we present results using both optical flow and object detector outputs. Results are comparable to recent work on 3D tracking and, unlike others, our method requires no pre-calibrated cameras.
6 0.1783713 111 iccv-2013-Detecting Dynamic Objects with Multi-view Background Subtraction
7 0.17230441 230 iccv-2013-Latent Data Association: Bayesian Model Selection for Multi-target Tracking
8 0.16233917 318 iccv-2013-PixelTrack: A Fast Adaptive Algorithm for Tracking Non-rigid Objects
9 0.15825692 256 iccv-2013-Locally Affine Sparse-to-Dense Matching for Motion and Occlusion Estimation
10 0.15443258 322 iccv-2013-Pose Estimation and Segmentation of People in 3D Movies
11 0.15430556 120 iccv-2013-Discriminative Label Propagation for Multi-object Tracking with Sporadic Appearance Features
12 0.14906579 425 iccv-2013-Tracking via Robust Multi-task Multi-view Joint Sparse Representation
13 0.14895324 298 iccv-2013-Online Robust Non-negative Dictionary Learning for Visual Tracking
14 0.13704097 377 iccv-2013-Segmentation Driven Object Detection with Fisher Vectors
15 0.13484973 168 iccv-2013-Finding the Best from the Second Bests - Inhibiting Subjective Bias in Evaluation of Visual Tracking Algorithms
16 0.13407467 338 iccv-2013-Randomized Ensemble Tracking
17 0.13033962 403 iccv-2013-Strong Appearance and Expressive Spatial Models for Human Pose Estimation
18 0.13026218 359 iccv-2013-Robust Object Tracking with Online Multi-lifespan Dictionary Learning
19 0.12027729 366 iccv-2013-STAR3D: Simultaneous Tracking and Reconstruction of 3D Objects Using RGB-D Data
20 0.11340801 204 iccv-2013-Human Attribute Recognition by Rich Appearance Dictionary
topicId topicWeight
[(0, 0.257), (1, -0.029), (2, 0.02), (3, 0.032), (4, 0.132), (5, -0.136), (6, -0.096), (7, 0.166), (8, -0.12), (9, 0.146), (10, 0.006), (11, -0.117), (12, 0.04), (13, -0.031), (14, 0.018), (15, -0.051), (16, 0.054), (17, 0.114), (18, 0.084), (19, 0.062), (20, -0.159), (21, 0.045), (22, -0.04), (23, -0.074), (24, 0.06), (25, 0.03), (26, -0.064), (27, -0.112), (28, -0.013), (29, -0.12), (30, -0.058), (31, -0.006), (32, -0.015), (33, 0.006), (34, 0.105), (35, -0.014), (36, 0.112), (37, 0.056), (38, -0.097), (39, 0.011), (40, 0.081), (41, -0.038), (42, -0.062), (43, 0.001), (44, -0.003), (45, -0.008), (46, 0.02), (47, -0.008), (48, -0.015), (49, -0.125)]
simIndex simValue paperId paperTitle
same-paper 1 0.98584074 242 iccv-2013-Learning People Detectors for Tracking in Crowded Scenes
Author: Siyu Tang, Mykhaylo Andriluka, Anton Milan, Konrad Schindler, Stefan Roth, Bernt Schiele
Abstract: People tracking in crowded real-world scenes is challenging due to frequent and long-term occlusions. Recent tracking methods obtain the image evidence from object (people) detectors, but typically use off-the-shelf detectors and treat them as black box components. In this paper we argue that for best performance one should explicitly train people detectors on failure cases of the overall tracker instead. To that end, we first propose a novel joint people detector that combines a state-of-the-art single person detector with a detector for pairs of people, which explicitly exploits common patterns of person-person occlusions across multiple viewpoints that are a frequent failure case for tracking in crowded scenes. To explicitly address remaining failure modes of the tracker we explore two methods. First, we analyze typical failures of trackers and train a detector explicitly on these cases. And second, we train the detector with the people tracker in the loop, focusing on the most common tracker failures. We show that our joint multi-person detector significantly improves both de- tection accuracy as well as tracker performance, improving the state-of-the-art on standard benchmarks.
2 0.73141748 269 iccv-2013-Modeling Occlusion by Discriminative AND-OR Structures
Author: Bo Li, Wenze Hu, Tianfu Wu, Song-Chun Zhu
Abstract: Occlusion presents a challenge for detecting objects in real world applications. To address this issue, this paper models object occlusion with an AND-OR structure which (i) represents occlusion at semantic part level, and (ii) captures the regularities of different occlusion configurations (i.e., the different combinations of object part visibilities). This paper focuses on car detection on street. Since annotating part occlusion on real images is time-consuming and error-prone, we propose to learn the the AND-OR structure automatically using synthetic images of CAD models placed at different relative positions. The model parameters are learned from real images under the latent structural SVM (LSSVM) framework. In inference, an efficient dynamic programming (DP) algorithm is utilized. In experiments, we test our method on both car detection and car view estimation. Experimental results show that (i) Our CAD simulation strategy is capable of generating occlusion patterns for real scenarios, (ii) The proposed AND-OR structure model is effective for modeling occlusions, which outperforms the deformable part-based model (DPM) [6, 10] in car detec- , tion on both our self-collected streetparking dataset and the Pascal VOC 2007 car dataset [4], (iii) The learned model is on-par with the state-of-the-art methods on car view estimation tested on two public datasets.
3 0.70467079 338 iccv-2013-Randomized Ensemble Tracking
Author: Qinxun Bai, Zheng Wu, Stan Sclaroff, Margrit Betke, Camille Monnier
Abstract: We propose a randomized ensemble algorithm to model the time-varying appearance of an object for visual tracking. In contrast with previous online methods for updating classifier ensembles in tracking-by-detection, the weight vector that combines weak classifiers is treated as a random variable and the posterior distribution for the weight vector is estimated in a Bayesian manner. In essence, the weight vector is treated as a distribution that reflects the confidence among the weak classifiers used to construct and adapt the classifier ensemble. The resulting formulation models the time-varying discriminative ability among weak classifiers so that the ensembled strong classifier can adapt to the varying appearance, backgrounds, and occlusions. The formulation is tested in a tracking-by-detection implementation. Experiments on 28 challenging benchmark videos demonstrate that the proposed method can achieve results comparable to and often better than those of stateof-the-art approaches.
4 0.69265944 58 iccv-2013-Bayesian 3D Tracking from Monocular Video
Author: Ernesto Brau, Jinyan Guan, Kyle Simek, Luca Del Pero, Colin Reimer Dawson, Kobus Barnard
Abstract: Jinyan Guan† j guan1 @ emai l ari z ona . edu . Kyle Simek† ks imek@ emai l ari z ona . edu . Colin Reimer Dawson‡ cdaws on@ emai l ari z ona . edu . ‡School of Information University of Arizona Kobus Barnard‡ kobus @ s i sta . ari z ona . edu ∗School of Informatics University of Edinburgh for tracking an unknown and changing number of people in a scene using video taken from a single, fixed viewpoint. We develop a Bayesian modeling approach for tracking people in 3D from monocular video with unknown cameras. Modeling in 3D provides natural explanations for occlusions and smoothness discontinuities that result from projection, and allows priors on velocity and smoothness to be grounded in physical quantities: meters and seconds vs. pixels and frames. We pose the problem in the context of data association, in which observations are assigned to tracks. A correct application of Bayesian inference to multitarget tracking must address the fact that the model’s dimension changes as tracks are added or removed, and thus, posterior densities of different hypotheses are not comparable. We address this by marginalizing out the trajectory parameters so the resulting posterior over data associations has constant dimension. This is made tractable by using (a) Gaussian process priors for smooth trajectories and (b) approximately Gaussian likelihood functions. Our approach provides a principled method for incorporating multiple sources of evidence; we present results using both optical flow and object detector outputs. Results are comparable to recent work on 3D tracking and, unlike others, our method requires no pre-calibrated cameras.
5 0.68194509 190 iccv-2013-Handling Occlusions with Franken-Classifiers
Author: Markus Mathias, Rodrigo Benenson, Radu Timofte, Luc Van_Gool
Abstract: Detecting partially occluded pedestrians is challenging. A common practice to maximize detection quality is to train a set of occlusion-specific classifiers, each for a certain amount and type of occlusion. Since training classifiers is expensive, only a handful are typically trained. We show that by using many occlusion-specific classifiers, we outperform previous approaches on three pedestrian datasets; INRIA, ETH, and Caltech USA. We present a new approach to train such classifiers. By reusing computations among different training stages, 16 occlusion-specific classifiers can be trained at only one tenth the cost of one full training. We show that also test time cost grows sub-linearly.
6 0.67156112 424 iccv-2013-Tracking Revisited Using RGBD Camera: Unified Benchmark and Baselines
7 0.66386884 87 iccv-2013-Conservation Tracking
8 0.65642208 230 iccv-2013-Latent Data Association: Bayesian Model Selection for Multi-target Tracking
9 0.59496045 350 iccv-2013-Relative Attributes for Large-Scale Abandoned Object Detection
10 0.59368962 168 iccv-2013-Finding the Best from the Second Bests - Inhibiting Subjective Bias in Evaluation of Visual Tracking Algorithms
11 0.58275294 349 iccv-2013-Regionlets for Generic Object Detection
12 0.58115679 111 iccv-2013-Detecting Dynamic Objects with Multi-view Background Subtraction
13 0.58046174 318 iccv-2013-PixelTrack: A Fast Adaptive Algorithm for Tracking Non-rigid Objects
14 0.56976378 286 iccv-2013-NYC3DCars: A Dataset of 3D Vehicles in Geographic Context
15 0.56563836 189 iccv-2013-HOGgles: Visualizing Object Detection Features
16 0.56184781 75 iccv-2013-CoDeL: A Human Co-detection and Labeling Framework
17 0.55276901 136 iccv-2013-Efficient Pedestrian Detection by Directly Optimizing the Partial Area under the ROC Curve
18 0.55141282 406 iccv-2013-Style-Aware Mid-level Representation for Discovering Visual Connections in Space and Time
19 0.5379678 425 iccv-2013-Tracking via Robust Multi-task Multi-view Joint Sparse Representation
20 0.53471667 303 iccv-2013-Orderless Tracking through Model-Averaged Posterior Estimation
topicId topicWeight
[(2, 0.055), (7, 0.011), (26, 0.051), (31, 0.051), (34, 0.019), (35, 0.011), (42, 0.078), (64, 0.42), (73, 0.034), (89, 0.159)]
simIndex simValue paperId paperTitle
1 0.95128953 99 iccv-2013-Cross-View Action Recognition over Heterogeneous Feature Spaces
Author: Xinxiao Wu, Han Wang, Cuiwei Liu, Yunde Jia
Abstract: In cross-view action recognition, “what you saw” in one view is different from “what you recognize ” in another view. The data distribution even the feature space can change from one view to another due to the appearance and motion of actions drastically vary across different views. In this paper, we address the problem of transferring action models learned in one view (source view) to another different view (target view), where action instances from these two views are represented by heterogeneous features. A novel learning method, called Heterogeneous Transfer Discriminantanalysis of Canonical Correlations (HTDCC), is proposed to learn a discriminative common feature space for linking source and target views to transfer knowledge between them. Two projection matrices that respectively map data from source and target views into the common space are optimized via simultaneously minimizing the canonical correlations of inter-class samples and maximizing the intraclass canonical correlations. Our model is neither restricted to corresponding action instances in the two views nor restricted to the same type of feature, and can handle only a few or even no labeled samples available in the target view. To reduce the data distribution mismatch between the source and target views in the commonfeature space, a nonparametric criterion is included in the objective function. We additionally propose a joint weight learning method to fuse multiple source-view action classifiers for recognition in the target view. Different combination weights are assigned to different source views, with each weight presenting how contributive the corresponding source view is to the target view. The proposed method is evaluated on the IXMAS multi-view dataset and achieves promising results.
2 0.93826443 298 iccv-2013-Online Robust Non-negative Dictionary Learning for Visual Tracking
Author: Naiyan Wang, Jingdong Wang, Dit-Yan Yeung
Abstract: This paper studies the visual tracking problem in video sequences and presents a novel robust sparse tracker under the particle filter framework. In particular, we propose an online robust non-negative dictionary learning algorithm for updating the object templates so that each learned template can capture a distinctive aspect of the tracked object. Another appealing property of this approach is that it can automatically detect and reject the occlusion and cluttered background in a principled way. In addition, we propose a new particle representation formulation using the Huber loss function. The advantage is that it can yield robust estimation without using trivial templates adopted by previous sparse trackers, leading to faster computation. We also reveal the equivalence between this new formulation and the previous one which uses trivial templates. The proposed tracker is empirically compared with state-of-the-art trackers on some challenging video sequences. Both quantitative and qualitative comparisons show that our proposed tracker is superior and more stable.
3 0.92911947 88 iccv-2013-Constant Time Weighted Median Filtering for Stereo Matching and Beyond
Author: Ziyang Ma, Kaiming He, Yichen Wei, Jian Sun, Enhua Wu
Abstract: Despite the continuous advances in local stereo matching for years, most efforts are on developing robust cost computation and aggregation methods. Little attention has been seriously paid to the disparity refinement. In this work, we study weighted median filtering for disparity refinement. We discover that with this refinement, even the simple box filter aggregation achieves comparable accuracy with various sophisticated aggregation methods (with the same refinement). This is due to the nice weighted median filtering properties of removing outlier error while respecting edges/structures. This reveals that the previously overlooked refinement can be at least as crucial as aggregation. We also develop the first constant time algorithmfor the previously time-consuming weighted median filter. This makes the simple combination “box aggregation + weighted median ” an attractive solution in practice for both speed and accuracy. As a byproduct, the fast weighted median filtering unleashes its potential in other applications that were hampered by high complexities. We show its superiority in various applications such as depth upsampling, clip-art JPEG artifact removal, and image stylization.
same-paper 4 0.91502959 242 iccv-2013-Learning People Detectors for Tracking in Crowded Scenes
Author: Siyu Tang, Mykhaylo Andriluka, Anton Milan, Konrad Schindler, Stefan Roth, Bernt Schiele
Abstract: People tracking in crowded real-world scenes is challenging due to frequent and long-term occlusions. Recent tracking methods obtain the image evidence from object (people) detectors, but typically use off-the-shelf detectors and treat them as black box components. In this paper we argue that for best performance one should explicitly train people detectors on failure cases of the overall tracker instead. To that end, we first propose a novel joint people detector that combines a state-of-the-art single person detector with a detector for pairs of people, which explicitly exploits common patterns of person-person occlusions across multiple viewpoints that are a frequent failure case for tracking in crowded scenes. To explicitly address remaining failure modes of the tracker we explore two methods. First, we analyze typical failures of trackers and train a detector explicitly on these cases. And second, we train the detector with the people tracker in the loop, focusing on the most common tracker failures. We show that our joint multi-person detector significantly improves both de- tection accuracy as well as tracker performance, improving the state-of-the-art on standard benchmarks.
5 0.88730901 166 iccv-2013-Finding Actors and Actions in Movies
Author: P. Bojanowski, F. Bach, I. Laptev, J. Ponce, C. Schmid, J. Sivic
Abstract: We address the problem of learning a joint model of actors and actions in movies using weak supervision provided by scripts. Specifically, we extract actor/action pairs from the script and use them as constraints in a discriminative clustering framework. The corresponding optimization problem is formulated as a quadratic program under linear constraints. People in video are represented by automatically extracted and tracked faces together with corresponding motion features. First, we apply the proposed framework to the task of learning names of characters in the movie and demonstrate significant improvements over previous methods used for this task. Second, we explore the joint actor/action constraint and show its advantage for weakly supervised action learning. We validate our method in the challenging setting of localizing and recognizing characters and their actions in feature length movies Casablanca and American Beauty.
6 0.87771142 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments
7 0.85833073 441 iccv-2013-Video Motion for Every Visible Point
8 0.84024829 215 iccv-2013-Incorporating Cloud Distribution in Sky Representation
9 0.80254155 380 iccv-2013-Semantic Transform: Weakly Supervised Semantic Inference for Relating Visual Attributes
10 0.78151816 303 iccv-2013-Orderless Tracking through Model-Averaged Posterior Estimation
11 0.75576305 442 iccv-2013-Video Segmentation by Tracking Many Figure-Ground Segments
12 0.73975074 86 iccv-2013-Concurrent Action Detection with Structural Prediction
13 0.73262703 424 iccv-2013-Tracking Revisited Using RGBD Camera: Unified Benchmark and Baselines
14 0.73193413 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition
15 0.70723474 359 iccv-2013-Robust Object Tracking with Online Multi-lifespan Dictionary Learning
16 0.70472693 425 iccv-2013-Tracking via Robust Multi-task Multi-view Joint Sparse Representation
17 0.68660688 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
18 0.66291505 230 iccv-2013-Latent Data Association: Bayesian Model Selection for Multi-target Tracking
19 0.65871102 338 iccv-2013-Randomized Ensemble Tracking
20 0.65401852 22 iccv-2013-A New Adaptive Segmental Matching Measure for Human Activity Recognition