cvpr cvpr2013 cvpr2013-294 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Wei-Chen Chiu, Mario Fritz
Abstract: Video data provides a rich source of information that is available to us today in large quantities e.g. from online resources. Tasks like segmentation benefit greatly from the analysis of spatio-temporal motion patterns in videos and recent advances in video segmentation has shown great progress in exploiting these addition cues. However, observing a single video is often not enough to predict meaningful segmentations and inference across videos becomes necessary in order to predict segmentations that are consistent with objects classes. Therefore the task of video cosegmentation is being proposed, that aims at inferring segmentation from multiple videos. But current approaches are limited to only considering binary foreground/background -inf .mpg . de segmentation and multiple videos of the same object. This is a clear mismatch to the challenges that we are facing with videos from online resources or consumer videos. We propose to study multi-class video co-segmentation where the number of object classes is unknown as well as the number of instances in each frame and video. We achieve this by formulating a non-parametric bayesian model across videos sequences that is based on a new videos segmentation prior as well as a global appearance model that links segments of the same class. We present the first multi-class video co-segmentation evaluation. We show that our method is applicable to real video data from online resources and outperforms state-of-the-art video segmentation and image co-segmentation baselines.
Reference: text
sentIndex sentText sentNum sentScore
1 Tasks like segmentation benefit greatly from the analysis of spatio-temporal motion patterns in videos and recent advances in video segmentation has shown great progress in exploiting these addition cues. [sent-4, score-0.85]
2 However, observing a single video is often not enough to predict meaningful segmentations and inference across videos becomes necessary in order to predict segmentations that are consistent with objects classes. [sent-5, score-0.645]
3 Therefore the task of video cosegmentation is being proposed, that aims at inferring segmentation from multiple videos. [sent-6, score-0.546]
4 This is a clear mismatch to the challenges that we are facing with videos from online resources or consumer videos. [sent-10, score-0.35]
5 We propose to study multi-class video co-segmentation where the number of object classes is unknown as well as the number of instances in each frame and video. [sent-11, score-0.467]
6 We achieve this by formulating a non-parametric bayesian model across videos sequences that is based on a new videos segmentation prior as well as a global appearance model that links segments of the same class. [sent-12, score-0.971]
7 We show that our method is applicable to real video data from online resources and outperforms state-of-the-art video segmentation and image co-segmentation baselines. [sent-14, score-0.829]
8 Our proposed multi-class video co-segmentation model addresses segmentation of multiple object classes across multiple videos. [sent-21, score-0.612]
9 The segments are linked within and across videos via the global object classes. [sent-22, score-0.515]
10 As a single video might only expose a partial view, accidental similarities in appearance and motion patterns might lead to an ambiguous or even misleading analysis. [sent-25, score-0.463]
11 In addition, performing video segmentation independently on each video of a video collection does not reveal any ob- ject class structure between the segments that would lead to a much richer representation. [sent-26, score-1.24]
12 Second, a richer problem set should be investigated where the approach is enabled to reason across multiple video sequences in order to collect additional evidence that is able to link segments across videos. [sent-29, score-0.645]
13 background segmentation is assumed wherefore no association between object classes is required across 333222111 videos. [sent-33, score-0.356]
14 Furthermore, one presented evaluation [5] remains qualitative and the other one uses synthetically generated sequences [16] that paste a foreground video into different backgrounds. [sent-35, score-0.372]
15 There is still a big disconnect between the idea of video co-segmentation to the challenges presented in video data from the web or personal video collections. [sent-36, score-0.962]
16 We propose an approach that considers real video data, where neither the global number of appearance classes nor the number of instances in each images is known. [sent-38, score-0.534]
17 Our method is based on the first application of distant-dependent Chinese Restaurant Processes for video data in order to formulate a video segmentation prior. [sent-39, score-0.741]
18 Related Work The idea of spatio–temporal analysis and segmentation of video data [6, 23, 22] has seen several refinements over the last years. [sent-42, score-0.475]
19 Although their methods provide plausible solutions on video segmentation tasks, they lack a global appearance model that would relate segments across videos for the video co-segmentation task. [sent-46, score-1.239]
20 Non of these models have presented a video segmentation prior or described a generative model for appearance classes across multiple videos. [sent-54, score-0.76]
21 We present a model that employs ddCRP in order to formulate video segmentation prior as well as learning appearance models together with the segmentation across multiple videos. [sent-56, score-0.759]
22 Video Co-Segmentation Recently, two initial attempts [16, 5] have been made to approach video co-segmentation with a binary foreground/background segmentation task. [sent-57, score-0.439]
23 We define and address a multi-class video cosegmentation task. [sent-60, score-0.409]
24 Generative Multi-Video Model The goal of this paper is to perform segmentation across multiple videos where the segments should correspond to the objects and segments of the same object class are linked together within and across videos. [sent-63, score-0.935]
25 As motivated above, video segmentation on each video independently can lead to ambiguities that only can be resolved by reasoning across sequences. [sent-64, score-0.807]
26 In order to deal with this problem we approach video cosegmentation by a generative model where videos are linked by a global appearance model. [sent-65, score-0.834]
27 In particular, we define a video segmentation prior that proposes contiguous segments of coherent motion by a distance dependent Chinese Restaurant Process (ddCRP) as well as an infinite mixture model for the global appearance classes based on a Chinese Restaurant Process (CRP) [15]. [sent-67, score-1.003]
28 After describing our video representation, we give an overview of Chinese Restaurant Processes (CRP) and extension to distant dependent Chinese Restaurant Processes (ddCRP) [2]. [sent-68, score-0.342]
29 The ddCRP will then be used to define a video segmentation prior. [sent-69, score-0.439]
30 In oder to define a generative model across video we add another layer on top that links the videos with a shared appearance model. [sent-70, score-0.798]
31 Video Representation Given a set of videos V, we start by a superpixel segmenGtaivtieonn a afo sre eta ocfh vfirdameose w Vi,th wine th stea sequence apnerdp represent the video as a collection of superpixels. [sent-73, score-0.575]
32 For every video v ∈ V, we denote its total number of superpixels by Nv, avnd ∈ d Ves,cr wibee d eeancoht superpixel iu by eitsr appearance efelsat buyre N xi, spatio-temporal location si and motion vector mi. [sent-74, score-0.662]
33 A sequence of customers come enter the restaurant and sit at randomly chosen tables. [sent-80, score-0.572]
34 The i-th customer sits down at a table with a probability that is proportional to how many customers are already sitting at that table or opens up a new table with a probability proportional to a hyperparameter. [sent-81, score-0.656]
35 Since the table assignment of each customer just depends on the number of people sitting at each table and is independent of the other ones, the ordering of customers does not affect the distribution over partitions and therefore exchangeability holds. [sent-84, score-0.653]
36 The main difference between the CRP and ddCRP is that rather than directly linking customers to tables with table assignments, in ddCRP customers sit down with other customers according to the dependencies between them, which leads to customer assignments. [sent-87, score-1.286]
37 Groups of customers sit together at a table only implicitly if they can be connected by traversing the customer assignments. [sent-88, score-0.58]
38 Therefore the i-th customer sits with customer j with a probability inversely proportional to the distance dij between them or sits alone with a probability proportional to the hyperparameter α: p(ci= j|D,f,α) ∝? [sent-89, score-0.754]
39 == i (1) where ci is the customer assignment for customer iand f(d) is the decay function and D denotes the set of all distances between customers. [sent-91, score-0.688]
40 It describes how distances between customers aff(f∞ect) t h=e probability boefs linking sthtaemnc together. [sent-93, score-0.341]
41 ddCRP Video Segmentation Prior We use the ddCRP in order to define a video segmentation prior. [sent-96, score-0.439]
42 For the motion distance Dm between superpixels, we simply use the euclidean distances between mean motion vectors of superpixels for the motion similarities. [sent-103, score-0.368]
43 For fs, we use the window decay f(d) = [d < A] which determines the probabilities to link only with customers that are at most distance A away. [sent-104, score-0.468]
44 For fm, we use the exponential decay f(d) = which decays the probability of linking to customers exponentially with the distance to the current one, where B is the parameter of decay width. [sent-105, score-0.635]
45 With the decay functions fs and fm for both spatio-temporal and motion domains, we have defined a distribution over customer (superpixel) assignments which encourages to cluster nearby superpixels with similar motions thus to have contiguous segments in spatio-temporal and motion domains. [sent-106, score-1.044]
46 In Figure 2 we show samples from this ddCRP video segmentation prior for different hyperparameters. [sent-107, score-0.479]
47 The prior proposes segments having contiguous superpixels with similar motion. [sent-108, score-0.344]
48 Generative Multi-Video Model In this section we formulate a probabilistic, generative model that links the videos by a global appearance model that is also non-parametric. [sent-111, score-0.437]
49 We consider the following hierarchical generative procedure of multiple video sequences: Videos consist of multiple global object classes with different appearances, and for every video there are arbitrary number of instances which are located at different locations and possibly move over time. [sent-112, score-0.876]
50 Rest Rows: Samples from ddCRP video cosegmentation prior under different settings between concentration hyperparameter α and width parameter B for exponential decay function of motion fm. [sent-116, score-0.787]
51 each table (object instance) of each restaurant one dish (object class) is ordered from the menu by the first customer (superpixel) who sits there, and it is shared among all customers (superpixels) who sit at that table (object instance). [sent-117, score-1.018]
52 So the analogy is the following: restaurants correspond to videos, dishes correspond to object classes, tables correspond to instances, and customers correspond to superpixels. [sent-119, score-0.599]
53 For each superpixel iv in video v, draw assignment civ ∼ ddCRP(D, f,α) to object instance 2. [sent-121, score-0.597]
54 For each object instance tv in video v, draw assignment ktv ∼ CRP(γ) to object class 3. [sent-122, score-0.626]
55 For each superpixel iv in video v, draw observed feature xiv ∼ P(· |φziv ), where ziv = ktiv the class assignment ∼for P i(v·. [sent-124, score-0.59]
56 For each global object class k discovered across video sequences, the parameter φk for its appearance model is sampled from G0. [sent-128, score-0.586]
57 Therefore given the observed appearance feature xi for superpixel i, the likelihood of observed appearance feature for global object class k can be denoted as p(xi |φk) = ηk (xi) . [sent-130, score-0.378]
58 The posterior for customer assignments is: , = ? [sent-133, score-0.353]
59 p(x1:Nv|z(c1:Nv),γ) (3) Here we use ddCRP p(x1:Nv |z(c1:Nv ) as prior for all the possible customer configurations csuch that its combinatorial property makes the posterior to be intractable wherefore we use sampling techniques. [sent-143, score-0.377]
60 l=1 (5) Resampling the global class (dish) assignment k follows typical Gibbs sampling method for Chinese Restaurant Process but consider all the features xV and assignments kV 333222444 in the video set V. [sent-152, score-0.513]
61 mγηlkl−V(txvtηv)lk−Vtv(xtv) i f l i s n ueswed (6) Here k−Vtv denotes the class assignments for all the tables in the vide−ot set V excluding table tv, xV is the appearance feattuhere vs dofeoall s superpixels wngit thaibnl Ve t. [sent-155, score-0.5]
62 Given the class assignment mlkV−tv counts the number of tables linked to global class l whose appearance model is ηlk−Vtv. [sent-156, score-0.394]
63 The hyperparameter on multinomial distribution for appearance information is assigned symmetric Dirichlet prior Ha = Dir(2e + 2) which encourage to have bigger segments for global classes. [sent-164, score-0.389]
64 The concentration parameter α = 1e − 100 for the pro- posed vnicdeenot segmentation prior =and 1 eth −e w10i0dth fo parameter B = 1e − 1 for motion decay function fm was determined by inspecting samples fr doemca yth feu prior no fbtained from equation 2. [sent-165, score-0.564]
65 Experimental Results In this section, we evaluate our generative video cosegmentation approach and compare it to baselines from image co-segmentation and video segmentation. [sent-169, score-0.836]
66 Then we present the results of our method and compare them to image co-segmentation and video segmentation baselines. [sent-171, score-0.439]
67 Dataset We present a new Multi-Object Video Co-Segmentation (MOViCS) challenge, that is based on real videos and exposes several challenges encountered in online or consumer videos. [sent-174, score-0.386]
68 Accordingly, their task is defined as binary foreground/background segmentation that does not address segmentation of multiple classes and how the segments are linked across videos by the classes. [sent-178, score-0.777]
69 In contrast to this early video co-segmentation approaches, we do not phrase the task as binary foreground/background segmentation problem but rather as a multi-class labeling problem. [sent-179, score-0.439]
70 This change in task is crucial in order to make progress towards more unconstraint video settings as we encounter them on online resources and consumer media collections. [sent-180, score-0.431]
71 Therefore, we propose a new video co-segmentation task of real videos with multiple objects in the scene. [sent-181, score-0.517]
72 We propose the first benchmark for this task based on real video sequences download from youtube. [sent-183, score-0.345]
73 The dataset has 4 different video sets including 11 videos with 514 frames in total, and we equidistantly sample 5 frames from each video that we provide ground truth for. [sent-184, score-0.842]
74 Note that for each video set there are different numbers of common object classes appearing in each video sequence, and all the objects belonging to the same object class will be noted by the same label. [sent-185, score-0.844]
75 Unlike the popular image co-segmenation dataset iCoseg [1] which has similar lighting, image conditions and background or video segmentation dataset moseg [3] with significant motion patterns, our dataset exposes many of the difficulties encountered when processing less constraint sources. [sent-186, score-0.61]
76 In Figure 3 we show examples of video frames for the four video sets together with the provided groundtruth annotations. [sent-187, score-0.668]
77 Different color blocks stand for different video sets and the images within the same block come from the same video sequences. [sent-212, score-0.604]
78 We define our co-segmentation task as finding for each object class a set of segments that coincide with the object instances in the video frames. [sent-214, score-0.643]
79 Therefore our evaluation assigns the object class to the best matching set of segments predicted by an algorithm: Scorej = maix M(Si , Gj ) (8) Please note that this measure is not prone to oversegmentation, as only a single label is assigned per object class for the whole set of videos. [sent-217, score-0.348]
80 Comparison to video segmentation A comparison to video segmentation methods is not straight forward. [sent-221, score-0.878]
81 As each video is processed independently, there is no linking of segments across the videos. [sent-222, score-0.556]
82 We therefore give the advantage to the video segmentation method that our evaluation links the segments across videos by the groundtruth. [sent-223, score-0.89]
83 Results We evaluate our approach on the new MOViCS dataset and compare it to two state-of-the-art baselines from video segmentation and image co-segmentation. [sent-226, score-0.49]
84 Our video segmentation baseline [ 14] is denoted by (VS) and the image co-segmentation baseline [10] is denoted by (ICS) whereas we use (VCS) for our video co-segmentation method. [sent-227, score-0.741]
85 Here the evaluation is performed per set of video sequences since the algorithm not only have to correctly segment the object instances but also link them to a consistent object class. [sent-236, score-0.525]
86 As described above we give an advan- tage to the VS method by linking the segments across video via the groundtruth. [sent-240, score-0.556]
87 Comparison of co-segmentation accuracies between the our method (VCS), image co-segmentation (ICS) and video segmentation (VS) on the proposed MOViCS dataset. [sent-245, score-0.439]
88 Discussion The video segmentation baseline strongly de- pends on motion information in order to produce a good segmentation. [sent-247, score-0.523]
89 This issues are particular pronounced in the first video set where the chicken moves together with the turtle and the motion map is noisy due to fast motion in the second video. [sent-249, score-0.516]
90 For example in the second and third video sets in Figure 7, there are a varying number of objects moving in and out. [sent-253, score-0.352]
91 Another interesting aspect of our model is how segmentation is supported by jointly considering all the videos of a set and learning a global object class model. [sent-257, score-0.468]
92 We give an example in Figure 6 where the first row is the images from a single tiger video, the second row is the results by applying our proposed method only on this single sequence, and the last row is our VCS result while taking all videos in tiger set into account. [sent-260, score-0.528]
93 Example of improved results by segmenting across videos with a global object class model. [sent-266, score-0.397]
94 Please note that this relaxed measure doesn’t penalize for non-existing links between the videos as well as over segmentation in the spatial domain. [sent-273, score-0.39]
95 The improvements under this measure are particular prominent on the video sets where appearance is hard to match across sequences. [sent-279, score-0.445]
96 Examples of results from the proposed method (VCS) and baselines (ICS, VS) for all four video sets in MOViCS dataset. [sent-286, score-0.353]
97 Our method incorporates a probabilistic video segmentation prior that proposes spatially contiguous segments of similar motion. [sent-289, score-0.696]
98 The proposed Multi-Object Video CoSegmentation (MOViCS) dataset is based on real videos and exposes challenges encountered in consumer or online video collections. [sent-291, score-0.688]
99 Our method outperforms state-of-the-art image cosegmentation and video segmentation baselines on this new task. [sent-292, score-0.597]
100 Spatial distance dependent chinese restaurant processes for image segmentation. [sent-343, score-0.412]
wordName wordTfidf (topN-words)
[('ddcrp', 0.39), ('video', 0.302), ('customers', 0.285), ('customer', 0.245), ('restaurant', 0.237), ('nv', 0.21), ('videos', 0.19), ('movics', 0.184), ('tiger', 0.169), ('decay', 0.147), ('segmentation', 0.137), ('segments', 0.132), ('crp', 0.132), ('vs', 0.128), ('vcs', 0.122), ('ics', 0.118), ('superpixels', 0.116), ('cosegmentation', 0.107), ('dirichlet', 0.106), ('chinese', 0.096), ('motion', 0.084), ('superpixel', 0.083), ('appearance', 0.077), ('civ', 0.075), ('generative', 0.074), ('restaurants', 0.069), ('sits', 0.068), ('across', 0.066), ('class', 0.065), ('classes', 0.064), ('links', 0.063), ('concentration', 0.063), ('assignments', 0.062), ('dish', 0.061), ('instances', 0.058), ('vtv', 0.057), ('linking', 0.056), ('contiguous', 0.056), ('exposes', 0.053), ('fm', 0.053), ('tables', 0.052), ('resources', 0.051), ('linked', 0.051), ('assignment', 0.051), ('baselines', 0.051), ('sit', 0.05), ('chicken', 0.046), ('exchangeability', 0.046), ('franchise', 0.046), ('ktv', 0.046), ('menu', 0.046), ('wherefore', 0.046), ('xtv', 0.046), ('ziv', 0.046), ('posterior', 0.046), ('sudderth', 0.044), ('hyperparameter', 0.044), ('draw', 0.043), ('sequences', 0.043), ('object', 0.043), ('consumer', 0.041), ('prior', 0.04), ('dependent', 0.04), ('groundtruth', 0.04), ('fs', 0.039), ('processes', 0.039), ('infinite', 0.038), ('dishes', 0.038), ('multinomial', 0.037), ('online', 0.037), ('link', 0.036), ('refinements', 0.036), ('icoseg', 0.036), ('gibbs', 0.034), ('encountered', 0.034), ('global', 0.033), ('tv', 0.033), ('today', 0.033), ('trajectories', 0.033), ('iby', 0.033), ('challenges', 0.031), ('segmentations', 0.031), ('temporal', 0.03), ('proportional', 0.029), ('xz', 0.029), ('xv', 0.029), ('probabilistic', 0.029), ('correspond', 0.028), ('dependencies', 0.028), ('doesn', 0.027), ('ha', 0.027), ('synthetically', 0.027), ('distribution', 0.026), ('shared', 0.026), ('dij', 0.026), ('codeword', 0.026), ('moving', 0.025), ('big', 0.025), ('objects', 0.025), ('frames', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 294 cvpr-2013-Multi-class Video Co-segmentation with a Generative Multi-video Model
Author: Wei-Chen Chiu, Mario Fritz
Abstract: Video data provides a rich source of information that is available to us today in large quantities e.g. from online resources. Tasks like segmentation benefit greatly from the analysis of spatio-temporal motion patterns in videos and recent advances in video segmentation has shown great progress in exploiting these addition cues. However, observing a single video is often not enough to predict meaningful segmentations and inference across videos becomes necessary in order to predict segmentations that are consistent with objects classes. Therefore the task of video cosegmentation is being proposed, that aims at inferring segmentation from multiple videos. But current approaches are limited to only considering binary foreground/background -inf .mpg . de segmentation and multiple videos of the same object. This is a clear mismatch to the challenges that we are facing with videos from online resources or consumer videos. We propose to study multi-class video co-segmentation where the number of object classes is unknown as well as the number of instances in each frame and video. We achieve this by formulating a non-parametric bayesian model across videos sequences that is based on a new videos segmentation prior as well as a global appearance model that links segments of the same class. We present the first multi-class video co-segmentation evaluation. We show that our method is applicable to real video data from online resources and outperforms state-of-the-art video segmentation and image co-segmentation baselines.
2 0.19186053 187 cvpr-2013-Geometric Context from Videos
Author: S. Hussain Raza, Matthias Grundmann, Irfan Essa
Abstract: We present a novel algorithm for estimating the broad 3D geometric structure of outdoor video scenes. Leveraging spatio-temporal video segmentation, we decompose a dynamic scene captured by a video into geometric classes, based on predictions made by region-classifiers that are trained on appearance and motion features. By examining the homogeneity of the prediction, we combine predictions across multiple segmentation hierarchy levels alleviating the need to determine the granularity a priori. We built a novel, extensive dataset on geometric context of video to evaluate our method, consisting of over 100 groundtruth annotated outdoor videos with over 20,000 frames. To further scale beyond this dataset, we propose a semisupervised learning framework to expand the pool of labeled data with high confidence predictions obtained from unlabeled data. Our system produces an accurate prediction of geometric context of video achieving 96% accuracy across main geometric classes.
3 0.14791016 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
Author: Raghuraman Gopalan
Abstract: While the notion of joint sparsity in understanding common and innovative components of a multi-receiver signal ensemble has been well studied, we investigate the utility of such joint sparse models in representing information contained in a single video signal. By decomposing the content of a video sequence into that observed by multiple spatially and/or temporally distributed receivers, we first recover a collection of common and innovative components pertaining to individual videos. We then present modeling strategies based on subspace-driven manifold metrics to characterize patterns among these components, across other videos in the system, to perform subsequent video analysis. We demonstrate the efficacy of our approach for activity classification and clustering by reporting competitive results on standard datasets such as, HMDB, UCF-50, Olympic Sports and KTH.
4 0.14121738 243 cvpr-2013-Large-Scale Video Summarization Using Web-Image Priors
Author: Aditya Khosla, Raffay Hamid, Chih-Jen Lin, Neel Sundaresan
Abstract: Given the enormous growth in user-generated videos, it is becoming increasingly important to be able to navigate them efficiently. As these videos are generally of poor quality, summarization methods designed for well-produced videos do not generalize to them. To address this challenge, we propose to use web-images as a prior to facilitate summarization of user-generated videos. Our main intuition is that people tend to take pictures of objects to capture them in a maximally informative way. Such images could therefore be used as prior information to summarize videos containing a similar set of objects. In this work, we apply our novel insight to develop a summarization algorithm that uses the web-image based prior information in an unsupervised manner. Moreover, to automatically evaluate summarization algorithms on a large scale, we propose a framework that relies on multiple summaries obtained through crowdsourcing. We demonstrate the effectiveness of our evaluation framework by comparing its performance to that ofmultiple human evaluators. Finally, wepresent resultsfor our framework tested on hundreds of user-generated videos.
5 0.13534458 217 cvpr-2013-Improving an Object Detector and Extracting Regions Using Superpixels
Author: Guang Shu, Afshin Dehghan, Mubarak Shah
Abstract: We propose an approach to improve the detection performance of a generic detector when it is applied to a particular video. The performance of offline-trained objects detectors are usually degraded in unconstrained video environments due to variant illuminations, backgrounds and camera viewpoints. Moreover, most object detectors are trained using Haar-like features or gradient features but ignore video specificfeatures like consistent colorpatterns. In our approach, we apply a Superpixel-based Bag-of-Words (BoW) model to iteratively refine the output of a generic detector. Compared to other related work, our method builds a video-specific detector using superpixels, hence it can handle the problem of appearance variation. Most importantly, using Conditional Random Field (CRF) along with our super pixel-based BoW model, we develop and algorithm to segment the object from the background . Therefore our method generates an output of the exact object regions instead of the bounding boxes generated by most detectors. In general, our method takes detection bounding boxes of a generic detector as input and generates the detection output with higher average precision and precise object regions. The experiments on four recent datasets demonstrate the effectiveness of our approach and significantly improves the state-of-art detector by 5-16% in average precision.
6 0.12432957 29 cvpr-2013-A Video Representation Using Temporal Superpixels
7 0.1144186 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?
8 0.11273348 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos
9 0.11045601 151 cvpr-2013-Event Retrieval in Large Video Collections with Circulant Temporal Encoding
10 0.11015326 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
11 0.10959516 455 cvpr-2013-Video Object Segmentation through Spatially Accurate and Temporally Dense Extraction of Primary Object Regions
12 0.10672724 434 cvpr-2013-Topical Video Object Discovery from Key Frames by Modeling Word Co-occurrence Prior
13 0.10632019 133 cvpr-2013-Discriminative Segment Annotation in Weakly Labeled Video
14 0.10355962 370 cvpr-2013-SCALPEL: Segmentation Cascades with Localized Priors and Efficient Learning
15 0.10255169 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
16 0.10127426 203 cvpr-2013-Hierarchical Video Representation with Trajectory Binary Partition Tree
17 0.097287111 460 cvpr-2013-Weakly-Supervised Dual Clustering for Image Semantic Segmentation
18 0.095737241 148 cvpr-2013-Ensemble Video Object Cut in Highly Dynamic Scenes
19 0.094526954 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
20 0.093729027 313 cvpr-2013-Online Dominant and Anomalous Behavior Detection in Videos
topicId topicWeight
[(0, 0.204), (1, -0.03), (2, 0.02), (3, -0.086), (4, -0.038), (5, 0.016), (6, 0.015), (7, -0.004), (8, -0.096), (9, 0.077), (10, 0.17), (11, -0.09), (12, 0.06), (13, 0.021), (14, 0.013), (15, 0.016), (16, 0.089), (17, -0.041), (18, -0.116), (19, -0.043), (20, -0.034), (21, 0.014), (22, -0.006), (23, -0.124), (24, -0.04), (25, -0.026), (26, -0.057), (27, -0.004), (28, 0.036), (29, 0.02), (30, 0.044), (31, -0.018), (32, -0.025), (33, -0.008), (34, 0.065), (35, 0.055), (36, -0.058), (37, -0.035), (38, 0.023), (39, -0.05), (40, 0.032), (41, 0.007), (42, -0.051), (43, -0.027), (44, -0.131), (45, 0.084), (46, -0.108), (47, 0.087), (48, 0.001), (49, -0.037)]
simIndex simValue paperId paperTitle
same-paper 1 0.96113378 294 cvpr-2013-Multi-class Video Co-segmentation with a Generative Multi-video Model
Author: Wei-Chen Chiu, Mario Fritz
Abstract: Video data provides a rich source of information that is available to us today in large quantities e.g. from online resources. Tasks like segmentation benefit greatly from the analysis of spatio-temporal motion patterns in videos and recent advances in video segmentation has shown great progress in exploiting these addition cues. However, observing a single video is often not enough to predict meaningful segmentations and inference across videos becomes necessary in order to predict segmentations that are consistent with objects classes. Therefore the task of video cosegmentation is being proposed, that aims at inferring segmentation from multiple videos. But current approaches are limited to only considering binary foreground/background -inf .mpg . de segmentation and multiple videos of the same object. This is a clear mismatch to the challenges that we are facing with videos from online resources or consumer videos. We propose to study multi-class video co-segmentation where the number of object classes is unknown as well as the number of instances in each frame and video. We achieve this by formulating a non-parametric bayesian model across videos sequences that is based on a new videos segmentation prior as well as a global appearance model that links segments of the same class. We present the first multi-class video co-segmentation evaluation. We show that our method is applicable to real video data from online resources and outperforms state-of-the-art video segmentation and image co-segmentation baselines.
2 0.81916213 243 cvpr-2013-Large-Scale Video Summarization Using Web-Image Priors
Author: Aditya Khosla, Raffay Hamid, Chih-Jen Lin, Neel Sundaresan
Abstract: Given the enormous growth in user-generated videos, it is becoming increasingly important to be able to navigate them efficiently. As these videos are generally of poor quality, summarization methods designed for well-produced videos do not generalize to them. To address this challenge, we propose to use web-images as a prior to facilitate summarization of user-generated videos. Our main intuition is that people tend to take pictures of objects to capture them in a maximally informative way. Such images could therefore be used as prior information to summarize videos containing a similar set of objects. In this work, we apply our novel insight to develop a summarization algorithm that uses the web-image based prior information in an unsupervised manner. Moreover, to automatically evaluate summarization algorithms on a large scale, we propose a framework that relies on multiple summaries obtained through crowdsourcing. We demonstrate the effectiveness of our evaluation framework by comparing its performance to that ofmultiple human evaluators. Finally, wepresent resultsfor our framework tested on hundreds of user-generated videos.
3 0.81028605 187 cvpr-2013-Geometric Context from Videos
Author: S. Hussain Raza, Matthias Grundmann, Irfan Essa
Abstract: We present a novel algorithm for estimating the broad 3D geometric structure of outdoor video scenes. Leveraging spatio-temporal video segmentation, we decompose a dynamic scene captured by a video into geometric classes, based on predictions made by region-classifiers that are trained on appearance and motion features. By examining the homogeneity of the prediction, we combine predictions across multiple segmentation hierarchy levels alleviating the need to determine the granularity a priori. We built a novel, extensive dataset on geometric context of video to evaluate our method, consisting of over 100 groundtruth annotated outdoor videos with over 20,000 frames. To further scale beyond this dataset, we propose a semisupervised learning framework to expand the pool of labeled data with high confidence predictions obtained from unlabeled data. Our system produces an accurate prediction of geometric context of video achieving 96% accuracy across main geometric classes.
4 0.78535014 413 cvpr-2013-Story-Driven Summarization for Egocentric Video
Author: Zheng Lu, Kristen Grauman
Abstract: We present a video summarization approach that discovers the story of an egocentric video. Given a long input video, our method selects a short chain of video subshots depicting the essential events. Inspired by work in text analysis that links news articles over time, we define a randomwalk based metric of influence between subshots that reflects how visual objects contribute to the progression of events. Using this influence metric, we define an objective for the optimal k-subshot summary. Whereas traditional methods optimize a summary ’s diversity or representativeness, ours explicitly accounts for how one sub-event “leads to ” another—which, critically, captures event connectivity beyond simple object co-occurrence. As a result, our summaries provide a better sense of story. We apply our approach to over 12 hours of daily activity video taken from 23 unique camera wearers, and systematically evaluate its quality compared to multiple baselines with 34 human subjects.
5 0.77923793 133 cvpr-2013-Discriminative Segment Annotation in Weakly Labeled Video
Author: Kevin Tang, Rahul Sukthankar, Jay Yagnik, Li Fei-Fei
Abstract: The ubiquitous availability of Internet video offers the vision community the exciting opportunity to directly learn localized visual concepts from real-world imagery. Unfortunately, most such attempts are doomed because traditional approaches are ill-suited, both in terms of their computational characteristics and their inability to robustly contend with the label noise that plagues uncurated Internet content. We present CRANE, a weakly supervised algorithm that is specifically designed to learn under such conditions. First, we exploit the asymmetric availability of real-world training data, where small numbers of positive videos tagged with the concept are supplemented with large quantities of unreliable negative data. Second, we ensure that CRANE is robust to label noise, both in terms of tagged videos that fail to contain the concept as well as occasional negative videos that do. Finally, CRANE is highly parallelizable, making it practical to deploy at large scale without sacrificing the quality of the learned solution. Although CRANE is general, this paper focuses on segment annotation, where we show state-of-the-art pixel-level segmentation results on two datasets, one of which includes a training set of spatiotemporal segments from more than 20,000 videos.
6 0.72438681 29 cvpr-2013-A Video Representation Using Temporal Superpixels
8 0.71363002 434 cvpr-2013-Topical Video Object Discovery from Key Frames by Modeling Word Co-occurrence Prior
9 0.6968267 313 cvpr-2013-Online Dominant and Anomalous Behavior Detection in Videos
11 0.6304093 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
12 0.62807119 333 cvpr-2013-Plane-Based Content Preserving Warps for Video Stabilization
13 0.61655962 151 cvpr-2013-Event Retrieval in Large Video Collections with Circulant Temporal Encoding
14 0.59803289 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos
15 0.5973798 203 cvpr-2013-Hierarchical Video Representation with Trajectory Binary Partition Tree
16 0.59276938 339 cvpr-2013-Probabilistic Graphlet Cut: Exploiting Spatial Structure Cue for Weakly Supervised Image Segmentation
17 0.56871212 217 cvpr-2013-Improving an Object Detector and Extracting Regions Using Superpixels
18 0.55665177 148 cvpr-2013-Ensemble Video Object Cut in Highly Dynamic Scenes
19 0.55297655 86 cvpr-2013-Composite Statistical Inference for Semantic Segmentation
20 0.55277205 212 cvpr-2013-Image Segmentation by Cascaded Region Agglomeration
topicId topicWeight
[(10, 0.126), (16, 0.016), (26, 0.05), (28, 0.011), (33, 0.289), (67, 0.033), (69, 0.043), (75, 0.242), (76, 0.014), (87, 0.061)]
simIndex simValue paperId paperTitle
same-paper 1 0.85696262 294 cvpr-2013-Multi-class Video Co-segmentation with a Generative Multi-video Model
Author: Wei-Chen Chiu, Mario Fritz
Abstract: Video data provides a rich source of information that is available to us today in large quantities e.g. from online resources. Tasks like segmentation benefit greatly from the analysis of spatio-temporal motion patterns in videos and recent advances in video segmentation has shown great progress in exploiting these addition cues. However, observing a single video is often not enough to predict meaningful segmentations and inference across videos becomes necessary in order to predict segmentations that are consistent with objects classes. Therefore the task of video cosegmentation is being proposed, that aims at inferring segmentation from multiple videos. But current approaches are limited to only considering binary foreground/background -inf .mpg . de segmentation and multiple videos of the same object. This is a clear mismatch to the challenges that we are facing with videos from online resources or consumer videos. We propose to study multi-class video co-segmentation where the number of object classes is unknown as well as the number of instances in each frame and video. We achieve this by formulating a non-parametric bayesian model across videos sequences that is based on a new videos segmentation prior as well as a global appearance model that links segments of the same class. We present the first multi-class video co-segmentation evaluation. We show that our method is applicable to real video data from online resources and outperforms state-of-the-art video segmentation and image co-segmentation baselines.
2 0.85077816 285 cvpr-2013-Minimum Uncertainty Gap for Robust Visual Tracking
Author: Junseok Kwon, Kyoung Mu Lee
Abstract: We propose a novel tracking algorithm that robustly tracks the target by finding the state which minimizes uncertainty of the likelihood at current state. The uncertainty of the likelihood is estimated by obtaining the gap between the lower and upper bounds of the likelihood. By minimizing the gap between the two bounds, our method finds the confident and reliable state of the target. In the paper, the state that gives the Minimum Uncertainty Gap (MUG) between likelihood bounds is shown to be more reliable than the state which gives the maximum likelihood only, especially when there are severe illumination changes, occlusions, and pose variations. A rigorous derivation of the lower and upper bounds of the likelihood for the visual tracking problem is provided to address this issue. Additionally, an efficient inference algorithm using Interacting Markov Chain Monte Carlo is presented to find the best state that maximizes the average of the lower and upper bounds of the likelihood and minimizes the gap between two bounds simultaneously. Experimental results demonstrate that our method successfully tracks the target in realistic videos and outperforms conventional tracking methods.
3 0.79992098 121 cvpr-2013-Detection- and Trajectory-Level Exclusion in Multiple Object Tracking
Author: Anton Milan, Konrad Schindler, Stefan Roth
Abstract: When tracking multiple targets in crowded scenarios, modeling mutual exclusion between distinct targets becomes important at two levels: (1) in data association, each target observation should support at most one trajectory and each trajectory should be assigned at most one observation per frame; (2) in trajectory estimation, two trajectories should remain spatially separated at all times to avoid collisions. Yet, existing trackers often sidestep these important constraints. We address this using a mixed discrete-continuous conditional randomfield (CRF) that explicitly models both types of constraints: Exclusion between conflicting observations with supermodular pairwise terms, and exclusion between trajectories by generalizing global label costs to suppress the co-occurrence of incompatible labels (trajectories). We develop an expansion move-based MAP estimation scheme that handles both non-submodular constraints and pairwise global label costs. Furthermore, we perform a statistical analysis of ground-truth trajectories to derive appropriate CRF potentials for modeling data fidelity, target dynamics, and inter-target occlusion.
4 0.79983467 426 cvpr-2013-Tensor-Based Human Body Modeling
Author: Yinpeng Chen, Zicheng Liu, Zhengyou Zhang
Abstract: In this paper, we present a novel approach to model 3D human body with variations on both human shape and pose, by exploring a tensor decomposition technique. 3D human body modeling is important for 3D reconstruction and animation of realistic human body, which can be widely used in Tele-presence and video game applications. It is challenging due to a wide range of shape variations over different people and poses. The existing SCAPE model [4] is popular in computer vision for modeling 3D human body. However, it considers shape and pose deformations separately, which is not accurate since pose deformation is persondependent. Our tensor-based model addresses this issue by jointly modeling shape and pose deformations. Experimental results demonstrate that our tensor-based model outperforms the SCAPE model quite significantly. We also apply our model to capture human body using Microsoft Kinect sensors with excellent results.
5 0.79948831 360 cvpr-2013-Robust Estimation of Nonrigid Transformation for Point Set Registration
Author: Jiayi Ma, Ji Zhao, Jinwen Tian, Zhuowen Tu, Alan L. Yuille
Abstract: We present a new point matching algorithm for robust nonrigid registration. The method iteratively recovers the point correspondence and estimates the transformation between two point sets. In the first step of the iteration, feature descriptors such as shape context are used to establish rough correspondence. In the second step, we estimate the transformation using a robust estimator called L2E. This is the main novelty of our approach and it enables us to deal with the noise and outliers which arise in the correspondence step. The transformation is specified in a functional space, more specifically a reproducing kernel Hilbert space. We apply our method to nonrigid sparse image feature correspondence on 2D images and 3D surfaces. Our results quantitatively show that our approach outperforms state-ofthe-art methods, particularly when there are a large number of outliers. Moreover, our method of robustly estimating transformations from correspondences is general and has many other applications.
6 0.79908222 267 cvpr-2013-Least Soft-Threshold Squares Tracking
7 0.79881829 170 cvpr-2013-Fast Rigid Motion Segmentation via Incrementally-Complex Local Models
8 0.79872692 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases
9 0.79866976 317 cvpr-2013-Optimal Geometric Fitting under the Truncated L2-Norm
10 0.79826754 143 cvpr-2013-Efficient Large-Scale Structured Learning
11 0.79824734 227 cvpr-2013-Intrinsic Scene Properties from a Single RGB-D Image
12 0.7981354 425 cvpr-2013-Tensor-Based High-Order Semantic Relation Transfer for Semantic Scene Segmentation
13 0.79808897 245 cvpr-2013-Layer Depth Denoising and Completion for Structured-Light RGB-D Cameras
14 0.79800344 156 cvpr-2013-Exploring Compositional High Order Pattern Potentials for Structured Output Learning
15 0.79792297 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection
16 0.79773784 242 cvpr-2013-Label Propagation from ImageNet to 3D Point Clouds
17 0.7974444 111 cvpr-2013-Dense Reconstruction Using 3D Object Shape Priors
18 0.79732436 96 cvpr-2013-Correlation Filters for Object Alignment
19 0.79730749 256 cvpr-2013-Learning Structured Hough Voting for Joint Object Detection and Occlusion Reasoning
20 0.79730409 424 cvpr-2013-Templateless Quasi-rigid Shape Modeling with Implicit Loop-Closure