cvpr cvpr2013 cvpr2013-187 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: S. Hussain Raza, Matthias Grundmann, Irfan Essa
Abstract: We present a novel algorithm for estimating the broad 3D geometric structure of outdoor video scenes. Leveraging spatio-temporal video segmentation, we decompose a dynamic scene captured by a video into geometric classes, based on predictions made by region-classifiers that are trained on appearance and motion features. By examining the homogeneity of the prediction, we combine predictions across multiple segmentation hierarchy levels alleviating the need to determine the granularity a priori. We built a novel, extensive dataset on geometric context of video to evaluate our method, consisting of over 100 groundtruth annotated outdoor videos with over 20,000 frames. To further scale beyond this dataset, we propose a semisupervised learning framework to expand the pool of labeled data with high confidence predictions obtained from unlabeled data. Our system produces an accurate prediction of geometric context of video achieving 96% accuracy across main geometric classes.
Reference: text
sentIndex sentText sentNum sentScore
1 edu/cpl/projects/videogeometriccontext Abstract We present a novel algorithm for estimating the broad 3D geometric structure of outdoor video scenes. [sent-5, score-0.41]
2 Leveraging spatio-temporal video segmentation, we decompose a dynamic scene captured by a video into geometric classes, based on predictions made by region-classifiers that are trained on appearance and motion features. [sent-6, score-0.889]
3 By examining the homogeneity of the prediction, we combine predictions across multiple segmentation hierarchy levels alleviating the need to determine the granularity a priori. [sent-7, score-1.095]
4 We built a novel, extensive dataset on geometric context of video to evaluate our method, consisting of over 100 groundtruth annotated outdoor videos with over 20,000 frames. [sent-8, score-0.832]
5 To further scale beyond this dataset, we propose a semisupervised learning framework to expand the pool of labeled data with high confidence predictions obtained from unlabeled data. [sent-9, score-0.363]
6 Our system produces an accurate prediction of geometric context of video achieving 96% accuracy across main geometric classes. [sent-10, score-0.965]
7 [12] showed that such geometric context can be used to obtain a probabilistic representation of the scene layout based on geometric classes, which in turn can be used to improve object detection. [sent-16, score-0.592]
8 In this paper, we propose a novel method to provide a high level description of a video scene by assigning geometric classes to spatio-temporal regions as shown in Figure 1. [sent-21, score-0.643]
9 We achieve high accuracy leveraging motion and appearance features while achieving temporal consistency by relying on spatio-temporal regions across various granularities. [sent-23, score-0.759]
10 Building upon a hierarchical video-segmentation to achieve temporal consistency, we compute a wide variety of appearance, location, and motion features which are used to train classifiers to predict geometric context in video. [sent-25, score-0.884]
11 A significant challenge for developing scene understanding system for videos is a need for an annotated video dataset available for training and evaluation. [sent-26, score-0.629]
12 To this end, we have collected and annotated a video dataset with pixel level ground truth labels for over 20,000 frames across 100 videos covering a wide variety of scene examples. [sent-27, score-1.097]
13 The primary contributions of this paper are: • • • • • A scene description for video via geometric classes (96% accuracy across tmioanin f geometric acla gsesoems)e. [sent-28, score-0.898]
14 Exploiting motion and temporal causality/redundancy present ning gvi mdeoot by using m teomtipoonr faelat cuareussa alnitdy aggregating predictions across spatio-temporal regions. [sent-29, score-0.505]
15 A semi-supervised bootstrap learning framework for expanding stuhep pool eodf blaoboetlsetdra dpa ltae awrnitihn highly ecwonofrikde fnotr predictions obtained on unlabeled data. [sent-30, score-0.358]
16 a A thorough evaluation of our system by examining importAan tcheo roofu gfehat euvraelsu,a tbioennef oift ooufr temporal redundancy a imndp ionr-dependence of segmentation granularity. [sent-32, score-0.455]
17 Related Work Image based scene understanding methods[13, 9] can be directly applied to individual video frames to generate a description of the scene. [sent-34, score-0.414]
18 Further, lacking temporal consistency, they can result in temporally inconsistent labels across frames, which can impact performance, as scene labels suddenly change between frames. [sent-36, score-0.564]
19 In addition, frame-based methods do not exploit temporal redundancy to process videos efficiently as processing each segment in video independently results in a longer processing time. [sent-37, score-0.817]
20 Another approach to achieve temporal consistency across frames is to use optical flow between consecutive frames to estimate the neighborhood of each pixel and then combine past predictions to make a final prediction [14]. [sent-46, score-0.785]
21 This requires labeling every pixel in every frame in the video independently, which doesn’t leverage the causality in video. [sent-47, score-0.396]
22 Our video scene understanding approach takes advantage of spatio-temporal information by employing hierarchical video segmentation[10], which segments a video into spatio-temporal regions. [sent-48, score-0.9]
23 Further, we leverage causality in videos to efficiently label videos, achieving favorable complexity which is linear in the number of unique spatiotemporal segments in videos. [sent-49, score-0.635]
24 In contrast, our approach performs geo- metric labeling by leveraging multiple hierarchy level while probabilistic aggregating labels over a temporal window. [sent-53, score-0.864]
25 A significant hurdle in video scene understanding is the availability of a ground truth annotated dataset for training. [sent-54, score-0.582]
26 While several datasets exist for predicting geometric context in the image domain [13, 9], datasets for videos [2, 6, 19] are currently limited in their scope. [sent-55, score-0.468]
27 Our approach differs, in that it is taking advantage of spatio-temporal context, extends feature set being more suitable for video, leverages temporal redundancy while achieving temporal consistency and broadens the pool of available data by semi-supervised learning. [sent-59, score-0.686]
28 Dataset and Geometric Classes Existing Datasets: In our supervised learning setting, we require an annotated dataset supplying ground truth labels for training and evaluation. [sent-61, score-0.376]
29 While several datasets for geometric scene understanding exists on still images [13, 9], our video-based scene analysis method demands an annotated video dataset. [sent-62, score-0.717]
30 However, existing datasets for video scene understanding only provide limited ground truth data. [sent-63, score-0.419]
31 To overcome this limitation, we provide a novel, pixel-level annotated dataset for geometric scene analysis of video, consisting of over 20,000 frames across 100 videos. [sent-67, score-0.649]
32 A video dataset for geometric scene understanding: Our dataset consists of 160 outdoor videos, with annotations available for a subset of 100 videos. [sent-68, score-0.56]
33 us Wede f pora training and cross-validation (13,000 frames), 40 videos for independent testing via external-validation (7,000 frames), and 60 videos are kept unlabeled (14,000 frames) and are later used for semi-supervised learning (Section 5. [sent-72, score-0.4]
34 Videos in the cross and external-validation sets are completely annotated with ground truth labels (every frame and pixel). [sent-74, score-0.428]
35 While many different partitions can be imagined, we follow [13, 11] and partition the video content into three main geometric classes: “Sky”, “support”, and “vertical”. [sent-76, score-0.436]
36 The porous vertical sub-class includes non-solid, static objects such as trees and foliage. [sent-79, score-0.491]
37 Hierarchical Segmentation segmented into a hierarchy of spatio-temporal regions using [10]. [sent-86, score-0.441]
38 Then, features are extracted for each segment to train a main and sub-classifier to predict geometric context in videos. [sent-87, score-0.599]
39 Geometric Context From Videos Our algorithm for determining geometric context from video consists of 3 main steps (Figure 2). [sent-89, score-0.532]
40 First, we apply hierarchical video segmentation, obtaining spatio-temporal regions at different hierarchy levels. [sent-90, score-0.665]
41 We rely on video segmentation to achieve (a) temporal coherence without having to explicitly enforce it in our framework and (b) by labeling regions as opposed to individual pixels we greatly reduce computational complexity. [sent-91, score-0.598]
42 Third, we train a classifier to discriminate segments into sky, ground, and vertical classes. [sent-93, score-0.474]
43 Their spatio-temporal hierarchical video segmentation builds upon the graph-based image segmentation of Felzenszwalb et al. [sent-103, score-0.416]
44 From left to right: Hierarchy levels in increasing order; region area increases as segments from lower hierarchy levels are grouped together. [sent-106, score-0.788]
45 ates an over-segmented video volume, which is further seg- mented into a hierarchy of super-regions of varying granularity. [sent-107, score-0.559]
46 Successive application of this algorithm yields a segmentation hierarchy of the video as shown in Figure 3 for one of our sample videos. [sent-110, score-0.686]
47 To address this problem, we introduce a new label “mix” to label a super-voxel, which is a mixture of two or more classes or if its identity is changing over time across geometric classes. [sent-123, score-0.545]
48 MixSpatio-temporal SegMmaiennt Sky Ground Vertical Solid Porous Sub-vertical Object Figure 4: Annotation hierarchy of spatio-temporal segments: Segments are either labeled as either as a mixture of classes (mix) or assigned a main geometric class label. [sent-125, score-0.882]
49 The vertical geometric class is further discriminated into solid, porous, and object. [sent-126, score-0.438]
50 To obtain a ground truth labeling for every level of the segmentation hierarchy, we leverage the ground truth labels of the over-segmented super-voxels, by pooling their 333000888311 segmented super-voxels are annotated manually. [sent-127, score-0.69]
51 Supervoxel labels are then combined to generate ground truth for each level of segmentation hierarchy (see Section 4. [sent-128, score-0.75]
52 7 C15%l %ase Table 2: Percentage of segments annotated for each geometric class (∼ 2. [sent-133, score-0.617]
53 We manually annotated over 20,000 frames at the over-segmentation level and then combined their labels via the above approach across the hierarchy to generate labels at higher levels. [sent-137, score-0.961]
54 Table 2 gives an overview of the percentage of segments annotated for each geometric class. [sent-138, score-0.541]
55 Features We estimate the class-dependent probability of each geometric label for a segment in a frame using a wide variety of features. [sent-141, score-0.595]
56 Specifically, we compute appearance (color, texture, location, perspective) and motion features across each segment in a frame. [sent-143, score-0.396]
57 To capture the motion and changes in velocity and acceleration of objects across time, we compute flow histograms and mean flow for each frame Ij w. [sent-149, score-0.464]
58 Table 3 lists all of our motion based features used for estimating geometric context of video. [sent-164, score-0.44]
59 Multiple Segmentations As the appropriate granularity of the segmentation is not known a priori, we make use of multiple segmentations across several hierarchy levels, utilizing the increased spatial support of the segments at higher levels to compute features. [sent-171, score-1.047]
60 We generate multiple segmentations of the scene at various granularity levels ranging from 10% to 50% of the hierarchy height using [10] in increments of 10% (5 hierarchy levels in total). [sent-177, score-1.223]
61 Classification We evaluate our method using boosted decision trees based on a logistic regression version of Adaboost [3] that outputs the class probability for each segment in a frame and perform 5-fold cross validation. [sent-180, score-0.443]
62 We train two multi-class classifiers to predict the geometric labels, first to discriminate 333000888422 Figure 6: Input Label Sky Ground Vertical Solid Por us Object Input video image, predicted labels and confidence for each geometric class. [sent-181, score-0.928]
63 Notice, that trees are correctly assigned high probability for porous class, walls for solid class, and humans and cars have high confidence for the object class. [sent-182, score-0.558]
64 In addition to the two multi-class classifiers, we independently train a homogeneity classifier that estimates the probability of the segment being a single label segment or part of the class “mix”. [sent-184, score-0.733]
65 As the segments vary across time, we opt to extract features for each frame for the same segment to provide discriminating information over time (e. [sent-190, score-0.536]
66 We extract features from 5 segmentation hierarchical levels ranging from 10% to 50% of the hierarchy height. [sent-194, score-0.674]
67 We train the homogeneity classifier by providing examples of a single label and “mix” label segments as positive and negative instances. [sent-196, score-0.629]
68 we compute the sub-vertical labels for all the segments in a frame but only apply it to segments labeled as vertical by main classifier. [sent-201, score-0.866]
69 When using multiple segmentations across different hierarchies, a super-pixel is part of different segments at each level of segmentation hierarchy. [sent-202, score-0.513]
70 To determine the label yi of super-pixel i, class-posteriors from all segments in the hierarchy sj, containing the super-pixel are combined using a weighted average based on their homogeneity likelihoods P(sj |xj ) [13, 11], where xj is the corresponding feature vector. [sent-203, score-0.859]
71 j where, k denotes the possible geometric labels and ns are number of hierarchical segmentations. [sent-206, score-0.362]
72 This technique yields a final classification of super-pixels at the oversegmentation level by combining the individual predictions across hierarchy levels. [sent-207, score-0.71]
73 These weighted posterior probabilities of super-pixels, for main and sub-vertical class, are then averaged across frames in a temporal window to give final predictions for each super-voxel. [sent-208, score-0.61]
74 In our experiments, leveraging multiple hierarchy levels and temporal redundancy, we achieve an overall classification accuracy of 96. [sent-213, score-0.863]
75 It is insightful to quantify to which extent temporal redundancy improves classification accuracy. [sent-229, score-0.406]
76 Figure 8: (a) Temporal consistency (b) Classification results for various hierarchy levels. [sent-241, score-0.439]
77 The temporal window starts at the very first frame a segment appears in the video. [sent-243, score-0.447]
78 (b) Classification accuracy estimated over 5-fold cross validation: (left) Single segmentation hierarchy level, (right) Multiple segmentation hierarchy levels. [sent-244, score-1.032]
79 Specifically, we compute the classposteriors of a segment independently for each frame, obtaining the final probability by taking the average of the perframe probabilities across the temporal window. [sent-247, score-0.485]
80 Using temporal window for labeling improves classification accuracy from 92. [sent-250, score-0.42]
81 Figure 8b demonstrates the variation in classification accuracy when using a single versus multiple segmentation hierarchy levels. [sent-254, score-0.582]
82 When using a single segmentation, the classification accuracy decreases with increasing hierarchy level from 0. [sent-255, score-0.536]
83 This decrease in accuracy is due to segments of different classes being increasingly mixed at higher hierarchy level as regions tend to get under segmented. [sent-259, score-0.842]
84 Using multiple segmentations by combining dif- ferent segmentation hierarchy levels provides a much more consistent accuracy, in particular it mitigates the problem of determining the correct granularity for a segment. [sent-260, score-0.754]
85 In our experiments, combining predictions for geometric context at hierarchy levels 0. [sent-261, score-0.905]
86 For vertical sub-classes accuracy is lower, due to the vertical class containing huge intra-class variations and regions tend to be more affected by segmentation errors than the other classes. [sent-266, score-0.586]
87 It can be seen, that the use of motion and appearance features yields the best accuracy, where motion features are primarily beneficial across the sub-vertical classifier (accuracy improves by 333000888644 5% compared to appearance features alone). [sent-272, score-0.609]
88 Table 5 also shows the benefit of temporal redundancy by using spatiotemporal regions. [sent-273, score-0.374]
89 Temporal redundancy is significant to our results, as shown by the reduced accuracy when limiting features to only the very first frame of each segment (last 2 rows). [sent-287, score-0.425]
90 Input Ground Truth Labels Figure 10: Misclassification examples: Scattered clouds are labeled as vertical class, a mix region of object / solid is labeled as car (top). [sent-289, score-0.519]
91 Then, these classifiers are used to predict geometric context on unlabeled data (2). [sent-309, score-0.457]
92 In addition, we make use of multiple segmentations at different hierarchy levels, by including all high-confidence segments from the hierarchy that have high homogeneity (probability of being a single class ≥ 80%). [sent-311, score-1.32]
93 To avoid adding low quality segments to the labeled set, we perform introspection every 5th iteration, discarding added segments whose confidence (maximum class posterior) dropped below 80%. [sent-314, score-0.602]
94 Our initial classifier is trained on a dataset of 63 videos (all videos in cross-valiation set, ∼ 200, 000 segments). [sent-316, score-0.4]
95 At each iteratiino nc,r we vaadldia t5io,0n00 se high 2c0o0n,fi0d0e0nc see segments otf e eaacchh i geometric class from unlabeled dataset, extending the training data. [sent-317, score-0.564]
96 In particular, we evaluate our bootstrap approach on a separate video dataset of 40 videos (7,000 frames). [sent-320, score-0.478]
97 We thoroughly evaluate the contribution of motion features and demonstrated the benefit of utilizing temporal redundancy across frames. [sent-325, score-0.551]
98 To measure accuracy of our approach, we collected a comprehensive dataset of annotated video which we plan to make available to the research community. [sent-326, score-0.421]
99 One reason for its lower accuracy is, that objects tend to be under-segmented even at the superpixel level, merging with porous or solid classes at higher hierarchy levels. [sent-330, score-0.931]
100 Finally, we plan on leveraging geometric context to improve object detection and activity recognition in video. [sent-332, score-0.414]
wordName wordTfidf (topN-words)
[('hierarchy', 0.391), ('porous', 0.253), ('geometric', 0.209), ('segments', 0.205), ('temporal', 0.204), ('homogeneity', 0.183), ('video', 0.168), ('videos', 0.163), ('vertical', 0.153), ('mix', 0.139), ('annotated', 0.127), ('redundancy', 0.124), ('segment', 0.12), ('predictions', 0.113), ('hoeim', 0.113), ('bootstrap', 0.111), ('frames', 0.111), ('solid', 0.109), ('motion', 0.1), ('labels', 0.097), ('granularity', 0.097), ('levels', 0.096), ('segmentation', 0.096), ('context', 0.096), ('across', 0.088), ('classes', 0.088), ('frame', 0.088), ('sky', 0.085), ('label', 0.08), ('scene', 0.078), ('flow', 0.078), ('differentials', 0.078), ('ij', 0.078), ('leveraging', 0.077), ('class', 0.076), ('segmentations', 0.074), ('unlabeled', 0.074), ('ground', 0.067), ('subvertical', 0.063), ('pool', 0.06), ('main', 0.059), ('buildings', 0.059), ('labeled', 0.059), ('accuracy', 0.058), ('grundmann', 0.058), ('understanding', 0.057), ('confidence', 0.057), ('hierarchical', 0.056), ('trees', 0.053), ('appearance', 0.053), ('adaboost', 0.052), ('causality', 0.052), ('trains', 0.051), ('cars', 0.051), ('sfm', 0.05), ('regions', 0.05), ('level', 0.05), ('truth', 0.049), ('matthias', 0.049), ('consistency', 0.048), ('supervoxel', 0.047), ('spatiotemporal', 0.046), ('achieving', 0.046), ('fauqueur', 0.045), ('labeling', 0.045), ('hoiem', 0.044), ('boats', 0.043), ('train', 0.043), ('efros', 0.043), ('leverage', 0.043), ('watch', 0.042), ('georgia', 0.042), ('tighe', 0.042), ('classifiers', 0.041), ('improves', 0.041), ('brostow', 0.04), ('classifier', 0.038), ('independently', 0.038), ('predict', 0.037), ('logistic', 0.037), ('classification', 0.037), ('dataset', 0.036), ('sj', 0.036), ('features', 0.035), ('probability', 0.035), ('opposed', 0.035), ('window', 0.035), ('discriminate', 0.035), ('boosted', 0.034), ('outdoor', 0.033), ('torralba', 0.033), ('plan', 0.032), ('predicted', 0.032), ('wide', 0.032), ('objects', 0.032), ('prediction', 0.032), ('yields', 0.031), ('examining', 0.031), ('successive', 0.031), ('variety', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000013 187 cvpr-2013-Geometric Context from Videos
Author: S. Hussain Raza, Matthias Grundmann, Irfan Essa
Abstract: We present a novel algorithm for estimating the broad 3D geometric structure of outdoor video scenes. Leveraging spatio-temporal video segmentation, we decompose a dynamic scene captured by a video into geometric classes, based on predictions made by region-classifiers that are trained on appearance and motion features. By examining the homogeneity of the prediction, we combine predictions across multiple segmentation hierarchy levels alleviating the need to determine the granularity a priori. We built a novel, extensive dataset on geometric context of video to evaluate our method, consisting of over 100 groundtruth annotated outdoor videos with over 20,000 frames. To further scale beyond this dataset, we propose a semisupervised learning framework to expand the pool of labeled data with high confidence predictions obtained from unlabeled data. Our system produces an accurate prediction of geometric context of video achieving 96% accuracy across main geometric classes.
2 0.19186053 294 cvpr-2013-Multi-class Video Co-segmentation with a Generative Multi-video Model
Author: Wei-Chen Chiu, Mario Fritz
Abstract: Video data provides a rich source of information that is available to us today in large quantities e.g. from online resources. Tasks like segmentation benefit greatly from the analysis of spatio-temporal motion patterns in videos and recent advances in video segmentation has shown great progress in exploiting these addition cues. However, observing a single video is often not enough to predict meaningful segmentations and inference across videos becomes necessary in order to predict segmentations that are consistent with objects classes. Therefore the task of video cosegmentation is being proposed, that aims at inferring segmentation from multiple videos. But current approaches are limited to only considering binary foreground/background -inf .mpg . de segmentation and multiple videos of the same object. This is a clear mismatch to the challenges that we are facing with videos from online resources or consumer videos. We propose to study multi-class video co-segmentation where the number of object classes is unknown as well as the number of instances in each frame and video. We achieve this by formulating a non-parametric bayesian model across videos sequences that is based on a new videos segmentation prior as well as a global appearance model that links segments of the same class. We present the first multi-class video co-segmentation evaluation. We show that our method is applicable to real video data from online resources and outperforms state-of-the-art video segmentation and image co-segmentation baselines.
3 0.18651286 158 cvpr-2013-Exploring Weak Stabilization for Motion Feature Extraction
Author: Dennis Park, C. Lawrence Zitnick, Deva Ramanan, Piotr Dollár
Abstract: We describe novel but simple motion features for the problem of detecting objects in video sequences. Previous approaches either compute optical flow or temporal differences on video frame pairs with various assumptions about stabilization. We describe a combined approach that uses coarse-scale flow and fine-scale temporal difference features. Our approach performs weak motion stabilization by factoring out camera motion and coarse object motion while preserving nonrigid motions that serve as useful cues for recognition. We show results for pedestrian detection and human pose estimation in video sequences, achieving state-of-the-art results in both. In particular, given a fixed detection rate our method achieves a five-fold reduction in false positives over prior art on the Caltech Pedestrian benchmark. Finally, we perform extensive diagnostic experiments to reveal what aspects of our system are crucial for good performance. Proper stabilization, long time-scale features, and proper normalization are all critical.
4 0.17147966 43 cvpr-2013-Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs
Author: Roozbeh Mottaghi, Sanja Fidler, Jian Yao, Raquel Urtasun, Devi Parikh
Abstract: Recent trends in semantic image segmentation have pushed for holistic scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning. In this work, we are interested in understanding the roles of these different tasks in aiding semantic segmentation. Towards this goal, we “plug-in ” human subjects for each of the various components in a state-of-the-art conditional random field model (CRF) on the MSRC dataset. Comparisons among various hybrid human-machine CRFs give us indications of how much “head room ” there is to improve segmentation by focusing research efforts on each of the tasks. One of the interesting findings from our slew of studies was that human classification of isolated super-pixels, while being worse than current machine classifiers, provides a significant boost in performance when plugged into the CRF! Fascinated by this finding, we conducted in depth analysis of the human generated potentials. This inspired a new machine potential which significantly improves state-of-the-art performance on the MRSC dataset.
5 0.16765039 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
Author: Raghuraman Gopalan
Abstract: While the notion of joint sparsity in understanding common and innovative components of a multi-receiver signal ensemble has been well studied, we investigate the utility of such joint sparse models in representing information contained in a single video signal. By decomposing the content of a video sequence into that observed by multiple spatially and/or temporally distributed receivers, we first recover a collection of common and innovative components pertaining to individual videos. We then present modeling strategies based on subspace-driven manifold metrics to characterize patterns among these components, across other videos in the system, to perform subsequent video analysis. We demonstrate the efficacy of our approach for activity classification and clustering by reporting competitive results on standard datasets such as, HMDB, UCF-50, Olympic Sports and KTH.
6 0.15888199 329 cvpr-2013-Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images
7 0.15706013 133 cvpr-2013-Discriminative Segment Annotation in Weakly Labeled Video
8 0.14654453 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos
9 0.14458032 243 cvpr-2013-Large-Scale Video Summarization Using Web-Image Priors
10 0.13825998 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
11 0.13423404 203 cvpr-2013-Hierarchical Video Representation with Trajectory Binary Partition Tree
12 0.13340352 230 cvpr-2013-Joint 3D Scene Reconstruction and Class Segmentation
13 0.13122302 455 cvpr-2013-Video Object Segmentation through Spatially Accurate and Temporally Dense Extraction of Primary Object Regions
14 0.13002791 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?
15 0.12828252 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
16 0.12666449 151 cvpr-2013-Event Retrieval in Large Video Collections with Circulant Temporal Encoding
17 0.12355558 173 cvpr-2013-Finding Things: Image Parsing with Regions and Per-Exemplar Detectors
18 0.12346673 86 cvpr-2013-Composite Statistical Inference for Semantic Segmentation
19 0.12240051 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection
20 0.12098113 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video
topicId topicWeight
[(0, 0.3), (1, -0.031), (2, 0.021), (3, -0.121), (4, -0.027), (5, 0.011), (6, 0.001), (7, 0.041), (8, -0.094), (9, 0.094), (10, 0.19), (11, -0.063), (12, 0.114), (13, -0.001), (14, 0.031), (15, 0.022), (16, 0.074), (17, -0.013), (18, -0.121), (19, -0.061), (20, -0.095), (21, 0.006), (22, 0.033), (23, -0.094), (24, -0.048), (25, 0.004), (26, 0.007), (27, 0.017), (28, 0.027), (29, 0.022), (30, -0.097), (31, -0.017), (32, -0.097), (33, 0.079), (34, 0.055), (35, 0.035), (36, -0.025), (37, -0.08), (38, -0.01), (39, -0.012), (40, -0.052), (41, 0.04), (42, -0.031), (43, 0.011), (44, -0.072), (45, 0.078), (46, -0.065), (47, 0.055), (48, 0.013), (49, -0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.98302877 187 cvpr-2013-Geometric Context from Videos
Author: S. Hussain Raza, Matthias Grundmann, Irfan Essa
Abstract: We present a novel algorithm for estimating the broad 3D geometric structure of outdoor video scenes. Leveraging spatio-temporal video segmentation, we decompose a dynamic scene captured by a video into geometric classes, based on predictions made by region-classifiers that are trained on appearance and motion features. By examining the homogeneity of the prediction, we combine predictions across multiple segmentation hierarchy levels alleviating the need to determine the granularity a priori. We built a novel, extensive dataset on geometric context of video to evaluate our method, consisting of over 100 groundtruth annotated outdoor videos with over 20,000 frames. To further scale beyond this dataset, we propose a semisupervised learning framework to expand the pool of labeled data with high confidence predictions obtained from unlabeled data. Our system produces an accurate prediction of geometric context of video achieving 96% accuracy across main geometric classes.
2 0.89981288 294 cvpr-2013-Multi-class Video Co-segmentation with a Generative Multi-video Model
Author: Wei-Chen Chiu, Mario Fritz
Abstract: Video data provides a rich source of information that is available to us today in large quantities e.g. from online resources. Tasks like segmentation benefit greatly from the analysis of spatio-temporal motion patterns in videos and recent advances in video segmentation has shown great progress in exploiting these addition cues. However, observing a single video is often not enough to predict meaningful segmentations and inference across videos becomes necessary in order to predict segmentations that are consistent with objects classes. Therefore the task of video cosegmentation is being proposed, that aims at inferring segmentation from multiple videos. But current approaches are limited to only considering binary foreground/background -inf .mpg . de segmentation and multiple videos of the same object. This is a clear mismatch to the challenges that we are facing with videos from online resources or consumer videos. We propose to study multi-class video co-segmentation where the number of object classes is unknown as well as the number of instances in each frame and video. We achieve this by formulating a non-parametric bayesian model across videos sequences that is based on a new videos segmentation prior as well as a global appearance model that links segments of the same class. We present the first multi-class video co-segmentation evaluation. We show that our method is applicable to real video data from online resources and outperforms state-of-the-art video segmentation and image co-segmentation baselines.
3 0.85127062 133 cvpr-2013-Discriminative Segment Annotation in Weakly Labeled Video
Author: Kevin Tang, Rahul Sukthankar, Jay Yagnik, Li Fei-Fei
Abstract: The ubiquitous availability of Internet video offers the vision community the exciting opportunity to directly learn localized visual concepts from real-world imagery. Unfortunately, most such attempts are doomed because traditional approaches are ill-suited, both in terms of their computational characteristics and their inability to robustly contend with the label noise that plagues uncurated Internet content. We present CRANE, a weakly supervised algorithm that is specifically designed to learn under such conditions. First, we exploit the asymmetric availability of real-world training data, where small numbers of positive videos tagged with the concept are supplemented with large quantities of unreliable negative data. Second, we ensure that CRANE is robust to label noise, both in terms of tagged videos that fail to contain the concept as well as occasional negative videos that do. Finally, CRANE is highly parallelizable, making it practical to deploy at large scale without sacrificing the quality of the learned solution. Although CRANE is general, this paper focuses on segment annotation, where we show state-of-the-art pixel-level segmentation results on two datasets, one of which includes a training set of spatiotemporal segments from more than 20,000 videos.
4 0.81884676 413 cvpr-2013-Story-Driven Summarization for Egocentric Video
Author: Zheng Lu, Kristen Grauman
Abstract: We present a video summarization approach that discovers the story of an egocentric video. Given a long input video, our method selects a short chain of video subshots depicting the essential events. Inspired by work in text analysis that links news articles over time, we define a randomwalk based metric of influence between subshots that reflects how visual objects contribute to the progression of events. Using this influence metric, we define an objective for the optimal k-subshot summary. Whereas traditional methods optimize a summary ’s diversity or representativeness, ours explicitly accounts for how one sub-event “leads to ” another—which, critically, captures event connectivity beyond simple object co-occurrence. As a result, our summaries provide a better sense of story. We apply our approach to over 12 hours of daily activity video taken from 23 unique camera wearers, and systematically evaluate its quality compared to multiple baselines with 34 human subjects.
Author: Dong Zhang, Omar Javed, Mubarak Shah
Abstract: In this paper, we propose a novel approach to extract primary object segments in videos in the ‘object proposal’ domain. The extracted primary object regions are then used to build object models for optimized video segmentation. The proposed approach has several contributions: First, a novel layered Directed Acyclic Graph (DAG) based framework is presented for detection and segmentation of the primary object in video. We exploit the fact that, in general, objects are spatially cohesive and characterized by locally smooth motion trajectories, to extract the primary object from the set of all available proposals based on motion, appearance and predicted-shape similarity across frames. Second, the DAG is initialized with an enhanced object proposal set where motion based proposal predictions (from adjacent frames) are used to expand the set of object proposals for a particular frame. Last, the paper presents a motion scoring function for selection of object proposals that emphasizes high optical flow gradients at proposal boundaries to discriminate between moving objects and the background. The proposed approach is evaluated using several challenging benchmark videos and it outperforms both unsupervised and supervised state-of-the-art methods. 1. Introduction & Related Work In this paper, our goal is to detect the primary object in videos and to delineate it from the background in allframes. Video object segmentation is a well-researched problem in the computer vision community and is a prerequisite for a variety of high-level vision applications, including content based video retrieval, video summarization, activity understanding and targeted content replacement. Both fully automatic methods and methods requiring manual initialization have been proposed for video object segmentation. In the latter class of approaches, [2, 15, 23] need annotations of object segments in key frames for initialization. Frame #38 #39 #61 #62 V ideo Frames Key-?fram e Object Regions [13] ? PrimaryObjectRegionsExtractedbyProposedMethod Figure 1. Primary object region selection in the object proposal domain. The first row shows frames from a video. The second row shows key object proposals (in red boundaries) extracted by [13]. “?” indicates that no proposal related to the primary object was found by the method. The third row shows primary object proposals selected by the proposed method. Note that the proposed method was able to find primary object proposals in all frames. The results in row 2 and 3 are prior to per-pixel segmentation. In this paper we demonstrate that temporally dense extraction of primary object proposals results in significant improvement in object segmentation performance. Please see Table 1for quantitative results and comparisons to state of the art.[Please Print in Color] Optimization techniques employing motion and appearance constraints are then used to propagate the segments to all frames. Other methods ([16, 20]) only require accurate object region annotation for the first frame, then employ region tracking to segment the rest of frames into object and background regions. Note that, the aforementioned semi-automatic techniques generally give good segmenta666222668 Figure 2. Object proposals from a video frame employing the method in [7]. The left side image is one of the video frames. Note that the monkey is the object of interest in the frame. Images on the right show some of the top ranked object proposals from the frame. Most of the proposals do not correspond to an actual object. The goal of the proposed work is to generate an enhanced set of object proposals and extract the segments related to the primary object from the video. tion results. However, most computer vision applications involve processing of large amounts of video data, which makes manual initialization cost prohibitive. Consequently, a large number of automatic methods have also been proposed for video object segmentation. A subset of these methods employs motion grouping ([19, 18, 4]) for object segmentation. Other methods ([10, 3, 21]) use appearance cues to segment each frame first and then use both appearance and motion constraints for a bottom-up final segmentation. Methods like [9, 3, 11, 22] present efficient optimization frameworks for spatiotemporal grouping of pixels for video segmentation. However, all of these automatic methods do not have an explicit model of how an object looks or moves, and therefore, the segments usually don’t correspond to a particular object but only to image regions that exhibit coherent appearance or motion. Recently, several methods ([7, 5, 1]) were proposed that provided an explicit notion of how a generic object looks like. Specifically, the method [7] could extract object-like regions or ‘object proposals’ from images. This work was built upon by Lee et al. [13] and Ma and Latecki [14] to employ object proposals for object video segmentation. Lee et al. [13] proposed to detect the primary object by collecting a pool of object proposals from the video, and then applying spectral graph clustering to obtain multiple binary inlier/outlier partitions. Each inlier cluster corresponds to a particular object’s regions. Both motion and appearance based cues are used to measure the ‘objectness’ of a proposal in the cluster. The cluster with the largest average ‘objectness’ is likely to contain the primary object in video. One shortcoming of this approach is that the clustering process ignores the order of the proposals in the video, and there- fore, cannot model the evolution of object’s shape and location with time. The work by Ma and Latecki [14] attempts Input Videos Figure 3. The Video Object Segmentation Framework to mitigate this issue by utilizing relationships between object proposals in adjacent frames. The object region selection problem is modeled as a constrained Maximum Weight Cliques problem in order to find the true object region from all the video frames simultaneously. However, this problem is NP-hard ([14]) and an approximate optimization technique is used to obtain the solution. The object proposal based segmentation approaches [13, 14] have two additional limitations compared to the proposed method. First, in both approaches, object proposal generation for a particular frame doesn’t directly depend on object proposals generated for adjacent frames. Second, both approaches do not actually predict the shape of the object in adjacent frames when computing region similarity, which degrades segmentation performance for fast moving objects. In this paper, we present an approach that though inspired from aforementioned approaches, attempts to remove their shortcomings. Note that, in general, an object’s shape and appearance varies slowly from frame to frame. Therefore, the intuition is that the object proposal sequence in a video with high ‘objectness’, and high similarity across frames is likely to be the primary object. To this end, we use optical flow to track the evolution of object shape, and compute the difference between predicted and actual shape (along with appearance) to measure similarity of object proposals across frames. The ‘objectness’ is measured using appearance and a motion based criterion that emphasizes high optical flow gradients at the boundaries between objects proposals and the background. Moreover, the primary object proposal selection problem is formulated as the longest path problem for Directed Acyclic Graph (DAG), for which (unlike [14]) an optimal solution exists in linear time. Note that, if the temporal order of object proposals locations (across frames) is not used ([13], then it can result in no proposals being associated with the prima666222779 ry object for many frames (please see Figure 1). The proposed method not only uses object proposals from a particular frame (please see Figure 2), but also expands the proposal set using predictions from proposals of neighboring frame. The combination of proposal expansion, and the predicted shape based similarity criteria results in temporally dense and spatially accurate primary object proposal extraction. We have evaluated the proposed approach using several challenging benchmark videos and it outperforms both unsupervised and supervised state-of-the-art methods In Section 2, the proposed layered DAG based object selection approach is introduced and discussed in detail; In Section 3, both qualitative and quantitative experiments results for two publicly available datasets and some other challenging videos are shown; The paper is concluded in Section 4. 2. Layered DAG based Video Object Segmentation 2.1. The Framework The proposed framework consists of 3 stages (as shown in Figure 3): 1. Generation of object proposals per-frame and then expansion of the proposal set for each frame based on object proposals in adjacent frames. 2. Generation of a layered DAG from all the object proposals in the video. The longest path in the graph fulfills the goal of maximizing ob- jectness and similarity scores, and represents the most likely set of proposals denoting the primary object in the video. 3. The primary object proposals are used to build object and background models using Gaussian mixtures, and a graph-cuts based optimization method is used to obtain refined per-pixel segmentation. Since the proposed approach is centered around layered DAG framework for selection of primary object regions, we will start with its description. 2.2. Layered DAG Structure We want to extract object proposals with high objectness likelihood, high appearance similarity and smoothly varying shape from the set of all proposals obtained from the video. Also since we want to extract the primary object only, we want to extract at most a single proposal per frame. Keeping these objectives in mind, the layered DAG is formed as follows. Each object proposal is represented by two nodes: a ‘beginning node’ and an ‘ending node’ and there are two types of edges: unary edges and binary edges. The unary edges have weights which measure the objectness of a proposal. The details of the function for unary weight assignments (measuring objectness) are given in section 2.2. 1. All the beginning nodes in the same frame form a layer, so as the ending nodes. A directed unary edge is built from beginning node to ending node. Thus, each video frame is represented by two layers in the graph. DiFrame i-1 Frame i Frame i+1 s… … La2i-ye3rL2ai-y2erL2ayi-1erLa2yierL2ai+y1erL2ai+y2er… … t Figure 4. Layered Directed Acyclic Graph (DAG) Structure. Node “s” and “t” are source and sink nodes respectively, which have zero weights for edges with other nodes in the graph. The yellow nodes and the green nodes are “beginning nodes” and “ending nodes” respectively and they are paired such that each yellow-green pair represents an object proposal. All the beginning nodes in the same frame are arranged in a layer and the same as the ending nodes. The green edges are the unary edges and red edges are the binary edges. rected binary edges are built from any ending node to all the beginning nodes in latter layers. The binary edges have weights which measure the appearance and shape similarity between the corresponding object proposals across frames. The binary weight assignment functions are introduced in Section 2.2.2. Figure 4 is an illustration of the graph structure. It shows frame i− 1, iand i 1 of the graph, with corresponding layers oif − −2i 1 1−,3 i, a2nid d− i2, + +2 i1 1− o1f, h2ie, 2gira+p 1h ,a wndi t2hi +co2rr. eNspooten tdhinagt, only 3s object proposals are s1h, o2wi,n 2 ifo+r 1e aacnhd layer f.or N simplic- + ity, however, there are usually hundreds of object proposals for each frame and the number of object proposals for different frames are not necessary the same. The yellow nodes are “beginning nodes”, the green nodes are “ending nodes”, the green edges are unary edges with weights indicating objectness and the red edges are binary edges with weights indicating appearance and shape similarity (note that the graph only shows some of the binary edges for simplicity). There is also a virtual source node s and a sink node t with 0 weighted edges (black edges) to the graph. Note that, it is not necessary to build binary edges from an ending node to all the beginning nodes in latter layers. In practice, only building binary edges to the next three subsequent frames is enough for most of the videos. 2.2.1 Unary Edges Unary edges measure the objectness of the proposals. Both appearance and motion are important to infer the objectness, so the scoring function for object proposals is defined as Sunary (r) = A(r) + M(r), in which r is any object proposal, A(r) is the appearance score and M(r) is the motion score. We define M(r) as the average Frobenius norm of optical flow gradient around the boundary of object pro666232880 Figure 5. Optical Flow Gradient Magnitude Motion Scoring. In row 1, column 1 shows the original video frame, column 2 is one of the object proposals and column 3 shows dilated boundary of the object proposal. In row 2, column 1 shows the forward optical flow of the frame, column 2 shows the optical flow gradient magnitude map and column 3 shows the optical flow gradient magnitude response for the specific object proposal around the boundary. [Please Print in Color] posal r. The Frobenius norm of optical flow gradients is defined as: ??UX??F=?????uvxx uvy ?????F=?ux2+ u2y+ vx2+ vy2, in ?whic?h U =? (1) (u, v) is th??e forward optical flow of the frame, ux , vx and uy, vy are optical flow gradients in x and y directions respectively. The intuition behind this motion scoring function is that, the motions of foreground object and background are usually distinct, so boundary of moving objects usually implies discontinuity in motion. Therefore, ideally, the gradient of optical flow should have high magnitude around foreground object boundary (this phenomenon could be easily observed from Figure 5). In equation 1, we use the Frobenius norm to measure the optical flow gradient magnitude, the higher the value, the more likely the region is from a moving object. In practice, usually the maximum of optical flow gradient magnitude does not coincide exactly with the moving object boundary due to underlying approximation of optical flow calculation. Therefore, we dilate the object proposal boundary and get the average optical flow gradient magnitude as the motion score. Figure 5 is an illustration of this process. The appearance scoring function A(r) is measured by the objectness ([7]). 2.2.2 Binary Edges Binary edges measure the similarity between object proposals across frames. For measuring the similarity of regions, color, location, size and shape are the properties to be considered. We define the similarity between regions as the weight of binary edges as follows: Sbinary(rm, rn) = λ · Scolor(rm, rn) · Soverlap(rm, rn), (2) in which rm and rn are regions from frame m and n, λ is a constant value for adjusting the ratio between unary and binary edges, Soverlap is the overlap similarity between regions and Scolor is the color histogram similarity: Scolor(rm, rn) = hist(rm) · hist(rn)T, (3) in which hist(r) is the normalized color histogram for a region r. Soverlap(rm,rn) =||rrmm∩∪ wwaarrppmmnn((rrnn))||, (4) in which warpmn (rn) is the warped region from rn by optical flow to frame m. It is clear that Scolor encodes the color similarity between regions and Soverlap encodes the size and location similarity between regions. If two regions are close, and the sizes and shapes are similar, the value would be higher, and vice versa. Note that, unlike prior approaches [13, 14], we use optical flow to predict the region (i.e. encoding location and shape), and therefore we are better able to compute similarity for fast moving objects. 2.2.3 Dynamic Programming Solution Until now, we have built the layered DAG and the objective is clear: to find the highest weighted path in the DAG. Assume the graph contains 2F + 2 layers (F is the frame number), the source node is in layer 0 and the sink node is in layer 2F + 2. Let Nij denotes the jth node in ith layer and E(Nij , Nkl) denotes the edge from Nij to Nkl. Layer i has Mi nodes. Let P = (p1, p2 , ..., pm+1) = (N01, Nj1j2, ..., Njm−1jm, N(2n+2)1) be a path from source to sink node. Therefore, ?m Pmax= arg mPax?i=1E(pi,pi+1). (5) Pmax forms a Longest (simple) Path Problem for DAG. Let OPT(i, j) be the maximum path value for Nij from source node. The maximum path value satisfies the following recurrence for i≥ 1and j ≥ 1: OPT(i,j) = k=0...i−m1a,lx=1...Mk[OPT(k,l) + E(Nkl,Nij)]. (6) This problem could be solved by dynamic programming in linear time [12]. The computational complexity for the algorithm is O(n + m), in which n is the number of nodes 666322 919 and m is the number of edges. The most important parameter for the layered DAG is the ratio λ between unary edges and binary edges. However, in practice, the results are not sensitive to it, and in the experiments λ is simply set to be 1. 2.3. Per-pixel Video Object Segmentation Once the primary object proposals are obtained in a video, the results are further refined by a graph-based method to get per-pixel segmentation results. We define a spatiotemporal graph by connecting frames temporally with optical flow displacement. Each of the nodes in the graph is a pixel in a frame, and edges are set to be the 8-neighbors within one frame and the forward-backward 18 neighbors in adjacent frames. We define the energy function for labeling f = [f1, f2, ..., fn] of n pixels with prior knowledge of h: E(f,h) = ?Dhi(fi) + λ ?i∈S ? Vi,j(fi,fj), (7) (i,?j)∈N where S = {pi, ..., pn} is the set of n pixels in the video, N cwohnesriest Ss o =f neighboring pixels, ta ondf i,j ixnedlesx in nt thhee pixels. pi could be set to 0 or 1which represents background or foreground respectively. The unary term Dih defines the cost of labeling pixel iwith label fi which we get from the Gaussian Mixture Models (GMM) for both color and location. Dih(fi) = −log(αUic(fi, h) + (1 − α)Uil(fi, h)), (8) where Uic(.) is the color-induced cost and Uil (.) is the location cost. For the binary term Vi,j (fi, fj), we follow the definitions in [17]: Vi,j(fi, fj) = [fi = fj]exp−β(Ci−Cj)2, (9) where [.] denotes the indicator function taking values 0 and 1, (Ci − Cj)2 is the Euclidean distance betwe?en two adjacent nodes in RGB space, and β = (2? (Ci − Cj)2)−1|(i,j)∈N ?We use −th Ce graph-cuts based minimization method in [8] to o?btain the optimal solution for equation 7, and thus get the final segmentation results. Next, we describe the method for object proposal generation that is used to initialize the video object segmentation process. 2.4. Object Proposal Generation & Expansion In order to achieve our goal of identifying image regions belonging to the primary object in the video, it is preferable (though not necessary) to have an object proposal corresponding to the actual object for each frame in which object is present. Using only appearance or optical flow based Figure 6. Object Proposal Expansion. For each optical flow warped object proposal in frame i− 1, we look for object proposals din o fbjreamcte p ir owpohsicahl ihnav fer high overlap erat liooosk kw fiotrh tohbej warped one. If some object proposals all have high overlap ratios with the warped one, they are merged into a new large object proposal. This process will produce the right object proposal if it is not discovered by [7] from frame i, but frame i− 1. cues to generate object proposals is usually not enough for this purpose. This phenomenon could be observed in the example shown in Figure 6. For frame iin this figure, hundreds of object proposals were generated using method in [7], however, no proposal is consistent with the true object, and the object is fragmented between different proposals. We assume that an object’s shape and location changes smoothly across frames and propose to enhance the set of object proposals for a frame by using the proposals generated for its adjacent frames. The object proposal expansion method works by the guidance of optical flow (see Figure 6). For the forward version of object proposal expansion, each object proposal rk in frame i− 1 is warped by the forward optical flow toi −fra1mine fir,a tmheen i a −ch 1ec isk wisa rmpaedde bify any proposal in frame i has a large overlap ratio with the rij 666333002 warped object proposal, i.e., o =|warpi−1,|ir(jir|ik−1) ∩ rij|. (10) The contiguous overlapped areas, for regions in i+1 with o greater than 0.5, are merged into a single region, and are used as additional proposals. Note that, the old original proposals are also kept, so this is an ‘expansion’ of the proposal set, and not a replacement. In practice, this process is carried out both forward and backward in time. Since it is an iterative process, even if suitable object proposals are missing in consecutive frames, they could potentially be produced by this expansion process. Figure 6 shows an example image sequence where the expansion process resulted in generation of a suitable proposal. 3. Experiments The proposed method was evaluated using two wellknown segmentation datasets: SegTrack dataset [20] and GaTech video segmentation dataset [9]. Quantitative comparisons are shown for SegTrack dataset since ground-truth is available for this dataset. Qualitative results are shown for GaTech video segmentation dataset. We also evaluated the proposed approach on additional challenging videos, for which we will share the ground-truth to aid future evaluations. 3.1. SegTrack Dataset We first evaluate our method on Segtrack dataset [20]. There are 6 videos in this dataset, and also a pixel-level segmentation ground-truth for each video is available. We follow the setup in the literature ([13, 14]), and use 5 (birdfall, cheetah, girl, monkeydog and parachute) of the videos for evaluation (since the ground-truth for the other one (penguin) is not useable). We use an optical flow magnitude based model selection method to infer the camera motion: for static cameras, a background subtraction cue is also used for moving object extraction; for all the results shown in this section, the static camera model was only selected (automatically) for the “birdfall” video. We compare our method with 4 state-of-the-art methods [14], [13], [20] and [6] shown in Table 1. Note that our method is a unsupervised method, and it outperforms all the other unsupervised methods except for the parachute video where it is a close second. Note that [20] and [6] are supervised methods which need an initial annotation for the first frame. The results in Table 1are the average per-frame pixel error rate compared to the ground-truth. The definition is [20]: error = XORF(f,GT), (11) where f is the segmentation labeling results of the method, GT is the ground-truth labeling of the video, and F is the (a) Birdfall (b) Cheetah (c) Girl (d) Monkeydog (e) Parachute Figure 7. SegTrack dataset results. The regions within the red boundaries are the segmented primary objects. [Please Print in Color] VideoOurs[14][13][20][6] birdfall155189288252454 cheetah 633 806 905 1142 1217 girl 1488 1698 1785 1304 1755 monkeydog 365 472 521 563 683 parachute 220 221 201 235 502 Avg. 452 542 592 594 791 supervised? N N N Y Y Table 1. Quantitative results and comparison with the state of the art on SegTrack dataset number of frames in the video. Figure 7 shows qualitative results for the videos of SegTrack dataset. Figure 8 is an example that shows the effectiveness of the proposed layered DAG approach for temporally dense extraction of primary object regions. The figure shows consecutive frames (frame 38 to frame 43) from “monkeydog” video. The top 2 rows show the results of key-frame objec- t extraction method [13], and the bottom 2 rows show our object region selection results. As one can see, [13] detects the primary object proposal in only one of the frames, however, by using the proposed approach, we can extract the 666333113 #41 ?#42 ?#43 ?(a) Key-frame Obje?ct Re gion Sel cti?on #41 #42 #43 Frame #38 ?#39 ?#40 Frame #38 #39 #40 (b) Layered DAG Object Region Sel ction Figure 8. Comparison of object region selection methods. The regions within the red boundaries are the selected object regions. “?” means there is no object region selected by the method. Numbers above are the frame indices.[Please Print in Color] primary object region from all the frames. This is the main reason that the segmentation results of the proposed method are better than prior methods. 3.2. GaTech Segmentation Dataset We also evaluated the proposed method on GaTech video segmentation dataset. We show qualitative comparison of results between the proposed approach and the original bottom-up method for the dataset in Figure 9. As one can observe, our results could segment the true foreground object from the background. The method [9] doesn’t use an object model which induces over-segmentation (although the results are very good for the general segmentation problem). 3.3. Persons and Cars Segmentation Dataset We have built a new dataset for video object segmentation. The dataset is challenging: persons are in a variety of poses; cars have different speeds, and when they are slow, it is very hard to do motion segmentation. We generate ground truth for those videos. Figure 10 shows some sample results from this dataset, and Table 2 shows the quantitative (a) waterski (b) yunakim Figure 9. Object Segmentation Results on GaTech Video Segmentation Dataset. Row 1: orignial frame, Row 2: Segmentation results by the bottom-up segmentation method [9]. Row 3: Video object segmentation by the proposed method. The regions within the red or green boundaries are the segmented primary objects. [Please Print in Color] VideoAverage per-frame pixel error Surfing1209 Jumping Skiing Sliding Big car Small car 835 817 2228 1129 272 Table 2. Quantitative Results on Persons and Cars dataset results for this dataset (the average per-frame pixel error is defined as the same as SegTrack dataset [20]). Please go to http://crcv.ucf.edu for more details. 4. Conclusions We have proposed a novel and efficient layered DAG based approach to segment the primary object in videos. This approach also uses innovative mechanisms to compute the ‘objectness’ of a region and to compute similarity between object proposals across frames. The proposed approach outperforms the state of the art on the well-known SegTrack dataset. We also demonstrate good segmentation performance on additional challenging data sets. 666333224 (a) Surfing (b) Jumping (c) Skiing (d) Sliding (e) Big car (f) Small car Figure 10. Sample Results on Persons and Cars Dataset. Please go to http://crcv.ucf.edu for more details. Acknowledgment This work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract numbers D11PC20066. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S.Government. References [1] B. Alexe, T. Deselaers, and V. Ferrari. What is an object? In CVPR, pages 73–80, 2010. [2] X. Bai, J. Wang, D. Simons, and G. Sapiro. Video snapcut: robust video object cutout using localized classifiers. ACM Transactions on Graphics, 28(3):70, 2009. [3] W. Brendel and S. Todorovic. Video object segmentation by tracking regions. In ICCV, pages 833–840, 2009. [4] T. Brox and J. Malik. Object segmentation by long term analysis of point trajectories. In ECCV, pages 282–295, 2010. [5] J. Carreira and C. Sminchisescu. Constrained parametric min-cuts for automatic object segmentation. In CVPR, pages 3241–3248, 2010. [6] P. Chockalingam, N. Pradeep, and S. Birchfield. Adaptive fragments-based tracking ofnon-rigid objects using level sets. In ICCV, pages 1530–1537, 2009. [7] I. Endres and D. Hoiem. Category independent object proposals. In ECCV, pages 575–588, 2010. [8] B. Fulkerson, A. Vedaldi, and S. Soatto. Class segmentation and object localization with superpixel neighborhoods. In ICCV, pages 670–677, 2009. [9] M. Grundmann, V. Kwatra, M. Han, and I. Essa. Efficient hierarchical graph-based video segmentation. In CVPR, pages 2141–2148, 2010. [10] Y. Huang, Q. Liu, and D. Metaxas. Video object segmentation by hypergraph cut. In CVPR, pages 1738–1745, 2009. [11] J.Wang, B. Thiesson, Y. Xu, and M. Cohen. Image and video segmentation by anisotropic kernel mean shift. In ECCV, 2004. [12] J. Kleinberg and E. Tardos. Algorithm design. Pearson Education and Addison Wesley, 2006. [13] Y. Lee, J. Kim, and K. Grauman. Key-segments for video object segmentation. In ICCV, pages 1995–2002, 2011. [14] T. Ma and L. Latecki. Maximum weight cliques with mutex constraints for video object segmentation. In CVPR, pages 670–677, 2012. [15] B. Price, B. Morse, and S. Cohen. Livecut: Learningbased interactive video segmentation by evaluation of multiple propagated cues. In ICCV, pages 779–786, 2009. [16] X. Ren and J. Malik. Tracking as repeated figure/ground segmentation. In CVPR, pages 1–8, 2007. [17] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In ACM Transactions on Graphics, volume 23, pages 309–3 14, 2004. [18] Y. Sheikh, O. Javed, and T. Kanade. Background subtraction for freely moving cameras. In ICCV, pages 1219–1225, 2009. [19] J. Shi and J. Malik. Motion segmentation and tracking using normalized cuts. In ICCV, pages 1154–1 160, 1998. [20] D. Tsai, M. Flagg, and J. Rehg. Motion coherent tracking with multi-label mrf optimization. In BMVC, page 1, 2010. [21] A. Vazquez-Reina, S. Avidan, H. Pfister, and E. Miller. Multiple hypothesis video segmentation from superpixel flows. In ECCV, pages 268–281, 2010. [22] C. Xu, C. Xiong, and J. J. Corso. Streaming hierarchical video segmentation. In ECCV, pages 626–639. 2012. [23] J. Yuen, B. Russell, C. Liu, and A. Torralba. Labelme video: Building a video database with human annotations. In ICCV, pages 1451–1458, 2009. 666333335
6 0.76997936 243 cvpr-2013-Large-Scale Video Summarization Using Web-Image Priors
7 0.74466455 313 cvpr-2013-Online Dominant and Anomalous Behavior Detection in Videos
8 0.73986715 333 cvpr-2013-Plane-Based Content Preserving Warps for Video Stabilization
9 0.69509387 137 cvpr-2013-Dynamic Scene Classification: Learning Motion Descriptors with Slow Features Analysis
10 0.68452191 203 cvpr-2013-Hierarchical Video Representation with Trajectory Binary Partition Tree
11 0.66137415 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
13 0.65596545 434 cvpr-2013-Topical Video Object Discovery from Key Frames by Modeling Word Co-occurrence Prior
14 0.64873594 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos
15 0.63405848 353 cvpr-2013-Relative Hidden Markov Models for Evaluating Motion Skill
16 0.62813467 37 cvpr-2013-Adherent Raindrop Detection and Removal in Video
17 0.62739682 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?
18 0.62592518 118 cvpr-2013-Detecting Pulse from Head Motions in Video
19 0.61508137 29 cvpr-2013-A Video Representation Using Temporal Superpixels
20 0.61033553 43 cvpr-2013-Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs
topicId topicWeight
[(10, 0.172), (16, 0.012), (24, 0.014), (26, 0.062), (33, 0.32), (67, 0.066), (69, 0.056), (87, 0.108), (98, 0.118)]
simIndex simValue paperId paperTitle
1 0.96700925 372 cvpr-2013-SLAM++: Simultaneous Localisation and Mapping at the Level of Objects
Author: Renato F. Salas-Moreno, Richard A. Newcombe, Hauke Strasdat, Paul H.J. Kelly, Andrew J. Davison
Abstract: We present the major advantages of a new ‘object oriented’ 3D SLAM paradigm, which takes full advantage in the loop of prior knowledge that many scenes consist of repeated, domain-specific objects and structures. As a hand-held depth camera browses a cluttered scene, realtime 3D object recognition and tracking provides 6DoF camera-object constraints which feed into an explicit graph of objects, continually refined by efficient pose-graph optimisation. This offers the descriptive and predictive power of SLAM systems which perform dense surface reconstruction, but with a huge representation compression. The object graph enables predictions for accurate ICP-based camera to model tracking at each live frame, and efficient active search for new objects in currently undescribed image regions. We demonstrate real-time incremental SLAM in large, cluttered environments, including loop closure, relocalisation and the detection of moved objects, and of course the generation of an object level scene description with the potential to enable interaction.
2 0.9505583 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
Author: Ian Endres, Kevin J. Shih, Johnston Jiaa, Derek Hoiem
Abstract: We propose a method to learn a diverse collection of discriminative parts from object bounding box annotations. Part detectors can be trained and applied individually, which simplifies learning and extension to new features or categories. We apply the parts to object category detection, pooling part detections within bottom-up proposed regions and using a boosted classifier with proposed sigmoid weak learners for scoring. On PASCAL VOC 2010, we evaluate the part detectors ’ ability to discriminate and localize annotated keypoints. Our detection system is competitive with the best-existing systems, outperforming other HOG-based detectors on the more deformable categories.
3 0.94479108 414 cvpr-2013-Structure Preserving Object Tracking
Author: Lu Zhang, Laurens van_der_Maaten
Abstract: Model-free trackers can track arbitrary objects based on a single (bounding-box) annotation of the object. Whilst the performance of model-free trackers has recently improved significantly, simultaneously tracking multiple objects with similar appearance remains very hard. In this paper, we propose a new multi-object model-free tracker (based on tracking-by-detection) that resolves this problem by incorporating spatial constraints between the objects. The spatial constraints are learned along with the object detectors using an online structured SVM algorithm. The experimental evaluation ofour structure-preserving object tracker (SPOT) reveals significant performance improvements in multi-object tracking. We also show that SPOT can improve the performance of single-object trackers by simultaneously tracking different parts of the object.
4 0.94472677 365 cvpr-2013-Robust Real-Time Tracking of Multiple Objects by Volumetric Mass Densities
Author: Horst Possegger, Sabine Sternig, Thomas Mauthner, Peter M. Roth, Horst Bischof
Abstract: Combining foreground images from multiple views by projecting them onto a common ground-plane has been recently applied within many multi-object tracking approaches. These planar projections introduce severe artifacts and constrain most approaches to objects moving on a common 2D ground-plane. To overcome these limitations, we introduce the concept of an occupancy volume exploiting the full geometry and the objects ’ center of mass and develop an efficient algorithm for 3D object tracking. Individual objects are tracked using the local mass density scores within a particle filter based approach, constrained by a Voronoi partitioning between nearby trackers. Our method benefits from the geometric knowledge given by the occupancy volume to robustly extract features and train classifiers on-demand, when volumetric information becomes unreliable. We evaluate our approach on several challenging real-world scenarios including the public APIDIS dataset. Experimental evaluations demonstrate significant improvements compared to state-of-theart methods, while achieving real-time performance. – –
5 0.94304597 285 cvpr-2013-Minimum Uncertainty Gap for Robust Visual Tracking
Author: Junseok Kwon, Kyoung Mu Lee
Abstract: We propose a novel tracking algorithm that robustly tracks the target by finding the state which minimizes uncertainty of the likelihood at current state. The uncertainty of the likelihood is estimated by obtaining the gap between the lower and upper bounds of the likelihood. By minimizing the gap between the two bounds, our method finds the confident and reliable state of the target. In the paper, the state that gives the Minimum Uncertainty Gap (MUG) between likelihood bounds is shown to be more reliable than the state which gives the maximum likelihood only, especially when there are severe illumination changes, occlusions, and pose variations. A rigorous derivation of the lower and upper bounds of the likelihood for the visual tracking problem is provided to address this issue. Additionally, an efficient inference algorithm using Interacting Markov Chain Monte Carlo is presented to find the best state that maximizes the average of the lower and upper bounds of the likelihood and minimizes the gap between two bounds simultaneously. Experimental results demonstrate that our method successfully tracks the target in realistic videos and outperforms conventional tracking methods.
6 0.94214612 22 cvpr-2013-A Non-parametric Framework for Document Bleed-through Removal
7 0.94195288 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
same-paper 8 0.94117302 187 cvpr-2013-Geometric Context from Videos
9 0.94046116 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path
10 0.93996108 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation
11 0.93989348 325 cvpr-2013-Part Discovery from Partial Correspondence
12 0.93914336 143 cvpr-2013-Efficient Large-Scale Structured Learning
13 0.9389168 242 cvpr-2013-Label Propagation from ImageNet to 3D Point Clouds
14 0.93827307 14 cvpr-2013-A Joint Model for 2D and 3D Pose Estimation from a Single Image
15 0.93815368 227 cvpr-2013-Intrinsic Scene Properties from a Single RGB-D Image
16 0.93787998 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases
17 0.93741226 400 cvpr-2013-Single Image Calibration of Multi-axial Imaging Systems
18 0.93699753 121 cvpr-2013-Detection- and Trajectory-Level Exclusion in Multiple Object Tracking
19 0.93669677 19 cvpr-2013-A Minimum Error Vanishing Point Detection Approach for Uncalibrated Monocular Images of Man-Made Environments
20 0.93662363 314 cvpr-2013-Online Object Tracking: A Benchmark