cvpr cvpr2013 cvpr2013-133 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kevin Tang, Rahul Sukthankar, Jay Yagnik, Li Fei-Fei
Abstract: The ubiquitous availability of Internet video offers the vision community the exciting opportunity to directly learn localized visual concepts from real-world imagery. Unfortunately, most such attempts are doomed because traditional approaches are ill-suited, both in terms of their computational characteristics and their inability to robustly contend with the label noise that plagues uncurated Internet content. We present CRANE, a weakly supervised algorithm that is specifically designed to learn under such conditions. First, we exploit the asymmetric availability of real-world training data, where small numbers of positive videos tagged with the concept are supplemented with large quantities of unreliable negative data. Second, we ensure that CRANE is robust to label noise, both in terms of tagged videos that fail to contain the concept as well as occasional negative videos that do. Finally, CRANE is highly parallelizable, making it practical to deploy at large scale without sacrificing the quality of the learned solution. Although CRANE is general, this paper focuses on segment annotation, where we show state-of-the-art pixel-level segmentation results on two datasets, one of which includes a training set of spatiotemporal segments from more than 20,000 videos.
Reference: text
sentIndex sentText sentNum sentScore
1 We present CRANE, a weakly supervised algorithm that is specifically designed to learn under such conditions. [sent-12, score-0.357]
2 First, we exploit the asymmetric availability of real-world training data, where small numbers of positive videos tagged with the concept are supplemented with large quantities of unreliable negative data. [sent-13, score-0.61]
3 Second, we ensure that CRANE is robust to label noise, both in terms of tagged videos that fail to contain the concept as well as occasional negative videos that do. [sent-14, score-0.689]
4 Although CRANE is general, this paper focuses on segment annotation, where we show state-of-the-art pixel-level segmentation results on two datasets, one of which includes a training set of spatiotemporal segments from more than 20,000 videos. [sent-16, score-0.589]
5 Introduction The ease of authoring and uploading video to the Internet creates a vast resource for computer vision research, particularly because Internet videos are frequently associated with semantic tags that identify visual concepts appearing in the video. [sent-18, score-0.307]
6 However, since tags are not spatially or temporally localized within the video, such videos cannot be directly exploited for training traditional supervised recognition systems. [sent-19, score-0.287]
7 In this paper, we examine the problem of generating pixel-level concept annotations for weakly labeled video. [sent-21, score-0.492]
8 Our method identifies segments that correspond to the label to generate a semantic segmentation [bottom]. [sent-27, score-0.319]
9 Given a video weakly tagged with a concept, such as “dog”, we process it using a standard unsupervised spatiotemporal segmentation method that aims to preserve object boundaries [3, 10, 15]. [sent-30, score-0.624]
10 From the video-level tag, we know that some of the segments correspond to the “dog” concept while most probably do not. [sent-31, score-0.415]
11 Our goal is to classify each segment within the video either as coming from the concept “dog”, which we denote as concept segments, or not, which we denote as background segments. [sent-32, score-0.572]
12 Given the varied nature of Internet videos, we cannot rely on assumptions about the relative frequencies or spatiotemporal distributions of segments from the two classes, neither within a frame nor across the video; nor can we assume that each video contains a single instance of the concept. [sent-33, score-0.468]
13 The first scenario, which we term transductive segment annotation (TSA), is studied in [23]. [sent-39, score-0.446]
14 This scenario is closely related to automatically annotating a weakly labeled dataset. [sent-40, score-0.383]
15 Here, the test videos that we seek to annotate are compared against a large amount of negative segments (from videos not tagged with the concept) to enable a direct discriminative separation of the test video segments into two classes. [sent-41, score-1.16]
16 The second scenario, which we term inductive segment annotation (ISA), is studied in [11]. [sent-42, score-0.456]
17 In this setting, a segment classifier is trained using a large quantity of weakly labeled segments from both positively- and negatively-tagged videos. [sent-43, score-0.777]
18 We observe that the TSA and ISA settings parallel the distinction between transductive and inductive learning, since the test instances are available during training in the former but not in the latter. [sent-45, score-0.384]
19 We present a unified interpretation under which a broad class of weakly supervised learning algorithms can be analyzed. [sent-49, score-0.398]
20 We introduce spatiotemporal segment-level annotations for a subset of the YouTube-Objects dataset [20], and present a detailed analysis of our method compared to other methods on this dataset for the transductive segment annotation scenario. [sent-54, score-0.582]
21 1 We also compare CRANE directly against [11] on the inductive segment annotation scenario and demonstrate state-of-the-art results. [sent-56, score-0.491]
22 Related Work Several methods have recently been proposed for high-quality, unsupervised spatiotemporal segmentation of videos [3, 10, 15, 30, 3 1]. [sent-58, score-0.267]
23 Several recent works have leveraged spatiotemporal segments for a variety of tasks in video understanding, including event detection [12], human motion volume generation [17], human 1Annotations and additional details are available at the project website: ht tp s : / / s it e s . [sent-60, score-0.468]
24 In the former (TSA), the proposed algorithm (CRANE) is evaluated on weakly labeled training data; in the latter (ISA), we train a classifier and evaluate on a disjoint test set. [sent-65, score-0.404]
25 TSA and ISA have parallels to transductive and inductive learning, respectively. [sent-66, score-0.29]
26 Drawing inspiration from these, we also employ such segments as a core representation in our work. [sent-68, score-0.284]
27 [11], where object segmentations are generated on weakly labeled video data. [sent-72, score-0.394]
28 , linear classifiers and multiple-instance learning), we propose a new way of thinking about this weakly supervised problem that leads to significantly superior results. [sent-75, score-0.374]
29 Discriminative segment annotation from weakly labeled data shares similarities with Multiple Instance Learning (MIL), on which there has been considerable research (e. [sent-76, score-0.613]
30 In MIL, we are given labeled bags of instances, where a positive bag contains at least one positive instance, and a negative bag contains no positive instances. [sent-79, score-0.317]
31 Spatiotemporal segments computed on “horse” and “dog” video sequences using [10]. [sent-82, score-0.325]
32 tain no concept segments as well as rare cases where some concept segments appear in negative videos. [sent-84, score-0.953]
33 There is increasing interest in exploring the idea of learning visual concepts from a combination of weakly supervised images and weakly supervised video [1, 6, 14, 19, 21, 26]. [sent-85, score-0.844]
34 Most applicable to our problem is recent work that achieves state-of-the-art results on bounding box annotation in weakly labeled 2D images [23]. [sent-86, score-0.434]
35 Weakly Supervised Segment Annotation As discussed earlier, we start with spatiotemporal segments for each video, such as those shown in Fig. [sent-90, score-0.366]
36 Each segment is a spatiotemporal (3D) volume that we represent as a point in a high-dimensional feature space using a set of standard features computed over the segment. [sent-92, score-0.282]
37 segment i s, with the label being positive i}f tihse t segment was eegxmtraecntte id, wfroitmh a vei ldaeboe lw bitehconcept c as a weak label, and negative otherwise. [sent-107, score-0.563]
38 We denote the set P to be the set of all instances with a positive lnaobteel, t haned s similarly e N th teo sbeet t ohfe aslelt nofs atalln negative ia ns ptoasnicteivse. [sent-108, score-0.231]
39 lSaibnecle, our negative Ndata to was weakly l aalble nleegda wtivieth i concepts other than c, we can assume that the segments labeled as negative are (with rare exceptions) correctly labeled. [sent-109, score-0.867]
40 Our task then is to determine which of the positive segments P are concept segments, anned w whhicichh o are background segments. [sent-110, score-0.493]
41 We present a generalized interpretation of transductive segment annotation, which leads to a family of methods that includes several common methods and previous works [23]. [sent-111, score-0.343]
42 Visualization of pairwise distance matrix between segments for weakly supervised annotation. [sent-113, score-0.62]
43 ments si from both the positive and negative videos, for a particular concept c. [sent-115, score-0.376]
44 Across the rows and columns, we order the segments from P first, followed by those from N. [sent-116, score-0.263]
45 dWerith tihne Pse,g we nfutsrt fhreorm mo Prde fri tshte, concept segments Pfroc m⊂ NP. [sent-117, score-0.415]
46 fWirsitt,h ifno lPlo,w wede by thheer background segments mPebn t=s PP \⊂ ⊂P Pc. [sent-118, score-0.29]
47 Tsh Pe bl=ock Ps \A ,P B and C correspond to intra-class distances among segments from Pc, Pb, and N, respectively. [sent-121, score-0.263]
48 We can now analyze a variety opf o weakly supervised approaches in this framework. [sent-126, score-0.357]
49 Co-segmentation [27] exploits the observation that concept segments across videos are similar, but that background segments are diverse. [sent-129, score-0.823]
50 TNhe a hope nis othpeatr atthee concept segments efoftrm 2× a 2d soumb-inant cluster/clique in feature space. [sent-131, score-0.449]
51 This principled approach t od weakly supervised learning exploits the insight that the (unknown) distribution of background segments Pb must be similar to the (known) distribution of negatPive segments N, since the latter consists almost entirely toivf background segments. [sent-133, score-0.954]
52 In our interpretation, this corresponds to building a generative omuord inetl according t,o t htihse cinofrorersmpaotniodns tino block C of the distance matrix, and scoring segments according to: SKDE(si) = −PN(si) = −|N1|z? [sent-140, score-0.304]
53 Standard fully supervised methods, such as Support Vector Machines (SVM), learn a discriminative classifier to separate positive from negative data, given instance-level labels. [sent-145, score-0.315]
54 Such methods can be shoehorned into the weakly supervised setting of segment annotation by propagating videolevel labels to segments. [sent-146, score-0.663]
55 e U background segments =fro Pm positively tagged videos, Pb (which are typically the majority), as ellaybe tal ngoedise v. [sent-150, score-0.437]
56 P Inf our experiments, moreeth foocdus stehsat o ntac skeple- weakly Plabferloedm segment arn enxopteartiiomne fnrtsom, m a more principled perspective significantly outperform these techniques. [sent-153, score-0.445]
57 ’s negative mining method [23], which we denote as MIN, can be interpreted as a discriminative method that operates on block D of the matrix to identify Pc. [sent-156, score-0.222]
58 (2) Following this perspective on how various weakly supervised approaches for segment annotations relate through the distance matrix, we detail our proposed algorithm, CRANE. [sent-161, score-0.569]
59 Proposed Method: CRANE Like MIN, our method, CRANE, operates on block D of the matrix, corresponding to the distances between weakly tagged positive and negative segments. [sent-163, score-0.628]
60 Unlike MIN, CRANE iterates through the segments in N, and each sMucINh negative Ein istetarantecse penalizes nearby segments i, na nPd. [sent-164, score-0.649]
61 Tacheh isnutcuhiti noeng aitsi tvhea itn concept segments eina rPby are thmoesnet sth inat P are fhaer - ? [sent-165, score-0.432]
62 Positive instances are less likely to be concept segments if they are near many negatives. [sent-184, score-0.472]
63 Back· ground segments in positive videos tend to fall near one or more segments from negative videos (in feature space). [sent-199, score-0.936]
64 Consequently, such segments are ranked lower than other positives. [sent-202, score-0.281]
65 Since concept segments are rarely the closest to negative instances, they are typically ranked higher. [sent-203, score-0.556]
66 Here, the unknown segment, si, is very close to a negative instance that may have come from an incorrectly tagged video. [sent-207, score-0.31]
67 Before detailing the specifics of how we apply CRANE to transductive and inductive segment annotation tasks, we discuss some properties of the algorithm that make it particularly suitable to practical implementations. [sent-210, score-0.596]
68 [23]’s observation regarding the abundance of negative data, our proposed approach enforces independence among negative instances (i. [sent-215, score-0.303]
69 Application to transductive segment annotation Applying CRANE to transductive segment annotation is straightforward. [sent-221, score-0.892]
70 We generate weakly labeled positive and negative instances for each concept. [sent-222, score-0.538]
71 Then we use CRANE to rank all of the segments in the positive set according to this score. [sent-223, score-0.339]
72 Application to inductive segment annotation In the inductive segment annotation task, for each concept, we are given a large number of weakly tagged positive and negative videos, from which we learn a set of segment-level classifiers that can be applied to arbitrary weakly tagged test videos. [sent-228, score-1.929]
73 Inductive segment annotation can be decomposed into a two-stage problem. [sent-229, score-0.306]
74 In the second stage, the most confident predictions for concept segments (from the first stage) are treated as segment-level labels. [sent-231, score-0.415]
75 Using these and our large set of negative instances, we train a standard fully supervised classifier. [sent-232, score-0.214]
76 Experiments To evaluate the different methods, we score each segment in our test videos, rank segments in decreasing order of score and compute precision/recall curves. [sent-236, score-0.467]
77 Transductive segment annotation (TSA) To evaluate transductive segment annotation, we use the YouTube-Objects (YTO) dataset [20], which consists of videos collected for 10 of the classes from the PASCAL Visual Objects Challenge [8]. [sent-240, score-0.743]
78 hope is to identify methods that can “clean” weakly supervised video to generate suitable data for training supervised classifiers for image challenges such as PASCAL VOC. [sent-246, score-0.563]
79 For negative data, we sample 5000 segments from videos tagged with other classes; our experiments show that additional negative data increases computation time but does not significantly affect results for any of the methods on this dataset. [sent-249, score-0.774]
80 7), we see that in many videos, the cat and background segments are very similar in appearance. [sent-265, score-0.315]
81 Direct comparison of several approaches for transductive segment annotation on the YouTube-Objects dataset [20]. [sent-269, score-0.446]
82 Visualizations of instances for the “cat” class where MIL is better able to distinguish between the similar looking concept and background segments (see text for details). [sent-271, score-0.499]
83 the minimum distance from a positive instance to a negative instance, it is more susceptible to label noise. [sent-272, score-0.245]
84 The transductive segment annotation scenario is useful for directly comparing various weakly supervised learning methods in a classifier-independent manner. [sent-273, score-0.855]
85 However, TSA is of limited practical use as it requires that each segment from every input video be compared against the negative data. [sent-274, score-0.364]
86 Inductive segment annotation (ISA) For the task of inductive segment annotation, where we learn a segment-level classifier from weakly labeled video, we use the dataset introduced by [11], as this dataset contains a large number of weakly labeled videos and deals exactly with this task. [sent-278, score-1.395]
87 Additional videos from several other tags are used to increase the set of negative background videos. [sent-280, score-0.309]
88 These videos are used for training, and a separate, disjoint set oftest videos from these 8 concept classes is used for evaluation. [sent-281, score-0.42]
89 Foolorr b hoitshCRANE and MIN, we retain the top 20% of the ranked segments from P as positive training data for the secsoengdm stage segment acsla pssoisfiietri. [sent-286, score-0.548]
90 As expected, if we retain too few segments, we do not span the intra-class variability of the target concept; conversely, retaining too many concepts risks including background segments and consequently corrupting the learned classifier. [sent-299, score-0.359]
91 Direct comparison of several methods for inductive segment annotation using the object segmentation dataset [11]. [sent-303, score-0.481]
92 Average precision as we vary CRANE’s fraction of retained segments [top] and number of training segments [bottom]. [sent-305, score-0.577]
93 Observations on successes: we segment multiple non-centered objects (topleft), which is difficult for GrabCut-based methods [22]; we highlight the horse but not the visually salient ball, improv- ing over [11]; we find the speedboat but not the moving water. [sent-311, score-0.211]
94 Conclusion We introduce CRANE, a surprisingly simple yet effective algorithm for annotating spatiotemporal segments from video-level labels. [sent-314, score-0.407]
95 We also present a generalized interpretation based on the distance matrix that serves as a taxonomy for weakly supervised methods and provides a deeper understanding of this problem. [sent-315, score-0.381]
96 We describe two related scenarios of the segment annotation problem (TSA and ISA) and present comprehensive experiments on published datasets. [sent-316, score-0.306]
97 In particular, CRANE is only one of a family of methods that exploit distances between weakly labeled instances for discriminative ranking and classification. [sent-319, score-0.404]
98 In each pair, the left image shows the original spatiotemporal segments and the right shows the output. [sent-385, score-0.366]
99 Modeling temporal structure of decomposable motion segments for activity classification. [sent-448, score-0.263]
100 In defence of negative mining for annotating weakly labelled data. [sent-493, score-0.466]
wordName wordTfidf (topN-words)
[('crane', 0.642), ('weakly', 0.266), ('tsa', 0.264), ('segments', 0.263), ('segment', 0.179), ('isa', 0.161), ('concept', 0.152), ('inductive', 0.15), ('tagged', 0.147), ('transductive', 0.14), ('annotation', 0.127), ('mil', 0.124), ('negative', 0.123), ('videos', 0.118), ('spatiotemporal', 0.103), ('supervised', 0.091), ('fcut', 0.075), ('dog', 0.074), ('internet', 0.067), ('video', 0.062), ('instances', 0.057), ('pb', 0.052), ('concepts', 0.051), ('positive', 0.051), ('si', 0.05), ('min', 0.05), ('siva', 0.046), ('dist', 0.043), ('block', 0.041), ('labeled', 0.041), ('tags', 0.041), ('annotating', 0.041), ('instance', 0.04), ('annot', 0.038), ('egment', 0.038), ('hartmann', 0.038), ('madani', 0.038), ('successes', 0.036), ('kwatra', 0.036), ('mining', 0.036), ('scenario', 0.035), ('pc', 0.035), ('shots', 0.035), ('grundmann', 0.035), ('annotations', 0.033), ('retained', 0.032), ('disjoint', 0.032), ('horse', 0.032), ('label', 0.031), ('pn', 0.03), ('contend', 0.029), ('rahul', 0.029), ('classifier', 0.028), ('shot', 0.028), ('background', 0.027), ('parallelizable', 0.027), ('segmentations', 0.025), ('vijayanarasimhan', 0.025), ('hoffman', 0.025), ('rank', 0.025), ('cat', 0.025), ('direct', 0.025), ('segmentation', 0.025), ('interpretation', 0.024), ('ford', 0.023), ('refers', 0.022), ('noise', 0.022), ('discriminative', 0.022), ('niebles', 0.021), ('employ', 0.021), ('histograms', 0.021), ('leveraged', 0.021), ('partitioning', 0.021), ('han', 0.021), ('unsupervised', 0.021), ('tsai', 0.02), ('tang', 0.02), ('leistner', 0.02), ('training', 0.019), ('boosting', 0.019), ('event', 0.019), ('annotate', 0.019), ('knn', 0.018), ('ranked', 0.018), ('ranking', 0.018), ('tnhe', 0.018), ('creates', 0.018), ('former', 0.018), ('localized', 0.018), ('density', 0.018), ('retain', 0.018), ('classifiers', 0.017), ('learning', 0.017), ('hope', 0.017), ('aosr', 0.017), ('coof', 0.017), ('googl', 0.017), ('uploading', 0.017), ('inat', 0.017), ('atthee', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999917 133 cvpr-2013-Discriminative Segment Annotation in Weakly Labeled Video
Author: Kevin Tang, Rahul Sukthankar, Jay Yagnik, Li Fei-Fei
Abstract: The ubiquitous availability of Internet video offers the vision community the exciting opportunity to directly learn localized visual concepts from real-world imagery. Unfortunately, most such attempts are doomed because traditional approaches are ill-suited, both in terms of their computational characteristics and their inability to robustly contend with the label noise that plagues uncurated Internet content. We present CRANE, a weakly supervised algorithm that is specifically designed to learn under such conditions. First, we exploit the asymmetric availability of real-world training data, where small numbers of positive videos tagged with the concept are supplemented with large quantities of unreliable negative data. Second, we ensure that CRANE is robust to label noise, both in terms of tagged videos that fail to contain the concept as well as occasional negative videos that do. Finally, CRANE is highly parallelizable, making it practical to deploy at large scale without sacrificing the quality of the learned solution. Although CRANE is general, this paper focuses on segment annotation, where we show state-of-the-art pixel-level segmentation results on two datasets, one of which includes a training set of spatiotemporal segments from more than 20,000 videos.
2 0.15706013 187 cvpr-2013-Geometric Context from Videos
Author: S. Hussain Raza, Matthias Grundmann, Irfan Essa
Abstract: We present a novel algorithm for estimating the broad 3D geometric structure of outdoor video scenes. Leveraging spatio-temporal video segmentation, we decompose a dynamic scene captured by a video into geometric classes, based on predictions made by region-classifiers that are trained on appearance and motion features. By examining the homogeneity of the prediction, we combine predictions across multiple segmentation hierarchy levels alleviating the need to determine the granularity a priori. We built a novel, extensive dataset on geometric context of video to evaluate our method, consisting of over 100 groundtruth annotated outdoor videos with over 20,000 frames. To further scale beyond this dataset, we propose a semisupervised learning framework to expand the pool of labeled data with high confidence predictions obtained from unlabeled data. Our system produces an accurate prediction of geometric context of video achieving 96% accuracy across main geometric classes.
3 0.13537143 273 cvpr-2013-Looking Beyond the Image: Unsupervised Learning for Object Saliency and Detection
Author: Parthipan Siva, Chris Russell, Tao Xiang, Lourdes Agapito
Abstract: We propose a principled probabilistic formulation of object saliency as a sampling problem. This novel formulation allows us to learn, from a large corpus of unlabelled images, which patches of an image are of the greatest interest and most likely to correspond to an object. We then sample the object saliency map to propose object locations. We show that using only a single object location proposal per image, we are able to correctly select an object in over 42% of the images in the PASCAL VOC 2007 dataset, substantially outperforming existing approaches. Furthermore, we show that our object proposal can be used as a simple unsupervised approach to the weakly supervised annotation problem. Our simple unsupervised approach to annotating objects of interest in images achieves a higher annotation accuracy than most weakly supervised approaches.
4 0.1208224 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection
Author: Sanja Fidler, Roozbeh Mottaghi, Alan Yuille, Raquel Urtasun
Abstract: In this paper we are interested in how semantic segmentation can help object detection. Towards this goal, we propose a novel deformable part-based model which exploits region-based segmentation algorithms that compute candidate object regions by bottom-up clustering followed by ranking of those regions. Our approach allows every detection hypothesis to select a segment (including void), and scores each box in the image using both the traditional HOG filters as well as a set of novel segmentation features. Thus our model “blends ” between the detector and segmentation models. Since our features can be computed very efficiently given the segments, we maintain the same complexity as the original DPM [14]. We demonstrate the effectiveness of our approach in PASCAL VOC 2010, and show that when employing only a root filter our approach outperforms Dalal & Triggs detector [12] on all classes, achieving 13% higher average AP. When employing the parts, we outperform the original DPM [14] in 19 out of 20 classes, achieving an improvement of 8% AP. Furthermore, we outperform the previous state-of-the-art on VOC’10 test by 4%.
5 0.11759041 43 cvpr-2013-Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs
Author: Roozbeh Mottaghi, Sanja Fidler, Jian Yao, Raquel Urtasun, Devi Parikh
Abstract: Recent trends in semantic image segmentation have pushed for holistic scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning. In this work, we are interested in understanding the roles of these different tasks in aiding semantic segmentation. Towards this goal, we “plug-in ” human subjects for each of the various components in a state-of-the-art conditional random field model (CRF) on the MSRC dataset. Comparisons among various hybrid human-machine CRFs give us indications of how much “head room ” there is to improve segmentation by focusing research efforts on each of the tasks. One of the interesting findings from our slew of studies was that human classification of isolated super-pixels, while being worse than current machine classifiers, provides a significant boost in performance when plugged into the CRF! Fascinated by this finding, we conducted in depth analysis of the human generated potentials. This inspired a new machine potential which significantly improves state-of-the-art performance on the MRSC dataset.
6 0.1147171 200 cvpr-2013-Harvesting Mid-level Visual Concepts from Large-Scale Internet Images
7 0.10632019 294 cvpr-2013-Multi-class Video Co-segmentation with a Generative Multi-video Model
8 0.10033444 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos
9 0.096566349 386 cvpr-2013-Self-Paced Learning for Long-Term Tracking
10 0.089654386 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
11 0.089543179 86 cvpr-2013-Composite Statistical Inference for Semantic Segmentation
12 0.089196123 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
13 0.086449817 19 cvpr-2013-A Minimum Error Vanishing Point Detection Approach for Uncalibrated Monocular Images of Man-Made Environments
14 0.080947787 80 cvpr-2013-Category Modeling from Just a Single Labeling: Use Depth Information to Guide the Learning of 2D Models
15 0.080716297 67 cvpr-2013-Blocks That Shout: Distinctive Parts for Scene Classification
16 0.079593934 450 cvpr-2013-Unsupervised Joint Object Discovery and Segmentation in Internet Images
17 0.079556175 28 cvpr-2013-A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching
18 0.074623466 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?
19 0.073591352 193 cvpr-2013-Graph Transduction Learning with Connectivity Constraints with Application to Multiple Foreground Cosegmentation
20 0.073509596 243 cvpr-2013-Large-Scale Video Summarization Using Web-Image Priors
topicId topicWeight
[(0, 0.16), (1, -0.067), (2, 0.023), (3, -0.044), (4, 0.003), (5, 0.023), (6, -0.01), (7, 0.019), (8, -0.054), (9, 0.036), (10, 0.063), (11, -0.075), (12, 0.033), (13, -0.009), (14, -0.039), (15, -0.026), (16, 0.066), (17, 0.003), (18, -0.061), (19, -0.057), (20, -0.048), (21, 0.04), (22, 0.062), (23, -0.075), (24, 0.037), (25, -0.016), (26, 0.001), (27, 0.042), (28, 0.077), (29, -0.029), (30, -0.033), (31, -0.006), (32, -0.048), (33, 0.069), (34, 0.052), (35, 0.077), (36, -0.018), (37, -0.066), (38, -0.002), (39, -0.015), (40, 0.024), (41, -0.011), (42, 0.011), (43, 0.066), (44, -0.117), (45, 0.043), (46, -0.052), (47, 0.012), (48, -0.01), (49, -0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.95805603 133 cvpr-2013-Discriminative Segment Annotation in Weakly Labeled Video
Author: Kevin Tang, Rahul Sukthankar, Jay Yagnik, Li Fei-Fei
Abstract: The ubiquitous availability of Internet video offers the vision community the exciting opportunity to directly learn localized visual concepts from real-world imagery. Unfortunately, most such attempts are doomed because traditional approaches are ill-suited, both in terms of their computational characteristics and their inability to robustly contend with the label noise that plagues uncurated Internet content. We present CRANE, a weakly supervised algorithm that is specifically designed to learn under such conditions. First, we exploit the asymmetric availability of real-world training data, where small numbers of positive videos tagged with the concept are supplemented with large quantities of unreliable negative data. Second, we ensure that CRANE is robust to label noise, both in terms of tagged videos that fail to contain the concept as well as occasional negative videos that do. Finally, CRANE is highly parallelizable, making it practical to deploy at large scale without sacrificing the quality of the learned solution. Although CRANE is general, this paper focuses on segment annotation, where we show state-of-the-art pixel-level segmentation results on two datasets, one of which includes a training set of spatiotemporal segments from more than 20,000 videos.
2 0.78394526 294 cvpr-2013-Multi-class Video Co-segmentation with a Generative Multi-video Model
Author: Wei-Chen Chiu, Mario Fritz
Abstract: Video data provides a rich source of information that is available to us today in large quantities e.g. from online resources. Tasks like segmentation benefit greatly from the analysis of spatio-temporal motion patterns in videos and recent advances in video segmentation has shown great progress in exploiting these addition cues. However, observing a single video is often not enough to predict meaningful segmentations and inference across videos becomes necessary in order to predict segmentations that are consistent with objects classes. Therefore the task of video cosegmentation is being proposed, that aims at inferring segmentation from multiple videos. But current approaches are limited to only considering binary foreground/background -inf .mpg . de segmentation and multiple videos of the same object. This is a clear mismatch to the challenges that we are facing with videos from online resources or consumer videos. We propose to study multi-class video co-segmentation where the number of object classes is unknown as well as the number of instances in each frame and video. We achieve this by formulating a non-parametric bayesian model across videos sequences that is based on a new videos segmentation prior as well as a global appearance model that links segments of the same class. We present the first multi-class video co-segmentation evaluation. We show that our method is applicable to real video data from online resources and outperforms state-of-the-art video segmentation and image co-segmentation baselines.
3 0.76281458 187 cvpr-2013-Geometric Context from Videos
Author: S. Hussain Raza, Matthias Grundmann, Irfan Essa
Abstract: We present a novel algorithm for estimating the broad 3D geometric structure of outdoor video scenes. Leveraging spatio-temporal video segmentation, we decompose a dynamic scene captured by a video into geometric classes, based on predictions made by region-classifiers that are trained on appearance and motion features. By examining the homogeneity of the prediction, we combine predictions across multiple segmentation hierarchy levels alleviating the need to determine the granularity a priori. We built a novel, extensive dataset on geometric context of video to evaluate our method, consisting of over 100 groundtruth annotated outdoor videos with over 20,000 frames. To further scale beyond this dataset, we propose a semisupervised learning framework to expand the pool of labeled data with high confidence predictions obtained from unlabeled data. Our system produces an accurate prediction of geometric context of video achieving 96% accuracy across main geometric classes.
4 0.73427927 243 cvpr-2013-Large-Scale Video Summarization Using Web-Image Priors
Author: Aditya Khosla, Raffay Hamid, Chih-Jen Lin, Neel Sundaresan
Abstract: Given the enormous growth in user-generated videos, it is becoming increasingly important to be able to navigate them efficiently. As these videos are generally of poor quality, summarization methods designed for well-produced videos do not generalize to them. To address this challenge, we propose to use web-images as a prior to facilitate summarization of user-generated videos. Our main intuition is that people tend to take pictures of objects to capture them in a maximally informative way. Such images could therefore be used as prior information to summarize videos containing a similar set of objects. In this work, we apply our novel insight to develop a summarization algorithm that uses the web-image based prior information in an unsupervised manner. Moreover, to automatically evaluate summarization algorithms on a large scale, we propose a framework that relies on multiple summaries obtained through crowdsourcing. We demonstrate the effectiveness of our evaluation framework by comparing its performance to that ofmultiple human evaluators. Finally, wepresent resultsfor our framework tested on hundreds of user-generated videos.
5 0.73225719 434 cvpr-2013-Topical Video Object Discovery from Key Frames by Modeling Word Co-occurrence Prior
Author: Gangqiang Zhao, Junsong Yuan, Gang Hua
Abstract: A topical video object refers to an object that is frequently highlighted in a video. It could be, e.g., the product logo and the leading actor/actress in a TV commercial. We propose a topic model that incorporates a word co-occurrence prior for efficient discovery of topical video objects from a set of key frames. Previous work using topic models, such as Latent Dirichelet Allocation (LDA), for video object discovery often takes a bag-of-visual-words representation, which ignored important co-occurrence information among the local features. We show that such data driven co-occurrence information from bottom-up can conveniently be incorporated in LDA with a Gaussian Markov prior, which combines top down probabilistic topic modeling with bottom up priors in a unified model. Our experiments on challenging videos demonstrate that the proposed approach can discover different types of topical objects despite variations in scale, view-point, color and lighting changes, or even partial occlusions. The efficacy of the co-occurrence prior is clearly demonstrated when comparing with topic models without such priors.
6 0.7315346 413 cvpr-2013-Story-Driven Summarization for Egocentric Video
9 0.59337324 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos
10 0.57542181 86 cvpr-2013-Composite Statistical Inference for Semantic Segmentation
11 0.56913233 313 cvpr-2013-Online Dominant and Anomalous Behavior Detection in Videos
12 0.56758142 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
13 0.56671357 193 cvpr-2013-Graph Transduction Learning with Connectivity Constraints with Application to Multiple Foreground Cosegmentation
14 0.55802208 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection
15 0.55312085 353 cvpr-2013-Relative Hidden Markov Models for Evaluating Motion Skill
16 0.55123037 25 cvpr-2013-A Sentence Is Worth a Thousand Pixels
17 0.53514636 145 cvpr-2013-Efficient Object Detection and Segmentation for Fine-Grained Recognition
18 0.53372908 43 cvpr-2013-Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs
19 0.52748489 148 cvpr-2013-Ensemble Video Object Cut in Highly Dynamic Scenes
20 0.52253515 370 cvpr-2013-SCALPEL: Segmentation Cascades with Localized Priors and Efficient Learning
topicId topicWeight
[(0, 0.187), (10, 0.166), (16, 0.017), (26, 0.049), (28, 0.011), (33, 0.259), (67, 0.058), (69, 0.064), (77, 0.01), (80, 0.019), (87, 0.061)]
simIndex simValue paperId paperTitle
same-paper 1 0.87669873 133 cvpr-2013-Discriminative Segment Annotation in Weakly Labeled Video
Author: Kevin Tang, Rahul Sukthankar, Jay Yagnik, Li Fei-Fei
Abstract: The ubiquitous availability of Internet video offers the vision community the exciting opportunity to directly learn localized visual concepts from real-world imagery. Unfortunately, most such attempts are doomed because traditional approaches are ill-suited, both in terms of their computational characteristics and their inability to robustly contend with the label noise that plagues uncurated Internet content. We present CRANE, a weakly supervised algorithm that is specifically designed to learn under such conditions. First, we exploit the asymmetric availability of real-world training data, where small numbers of positive videos tagged with the concept are supplemented with large quantities of unreliable negative data. Second, we ensure that CRANE is robust to label noise, both in terms of tagged videos that fail to contain the concept as well as occasional negative videos that do. Finally, CRANE is highly parallelizable, making it practical to deploy at large scale without sacrificing the quality of the learned solution. Although CRANE is general, this paper focuses on segment annotation, where we show state-of-the-art pixel-level segmentation results on two datasets, one of which includes a training set of spatiotemporal segments from more than 20,000 videos.
Author: Tsz-Ho Yu, Tae-Kyun Kim, Roberto Cipolla
Abstract: This work addresses the challenging problem of unconstrained 3D human pose estimation (HPE)from a novelperspective. Existing approaches struggle to operate in realistic applications, mainly due to their scene-dependent priors, such as background segmentation and multi-camera network, which restrict their use in unconstrained environments. We therfore present a framework which applies action detection and 2D pose estimation techniques to infer 3D poses in an unconstrained video. Action detection offers spatiotemporal priors to 3D human pose estimation by both recognising and localising actions in space-time. Instead of holistic features, e.g. silhouettes, we leverage the flexibility of deformable part model to detect 2D body parts as a feature to estimate 3D poses. A new unconstrained pose dataset has been collected to justify the feasibility of our method, which demonstrated promising results, significantly outperforming the relevant state-of-the-arts.
3 0.85698575 196 cvpr-2013-HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences
Author: Omar Oreifej, Zicheng Liu
Abstract: We present a new descriptor for activity recognition from videos acquired by a depth sensor. Previous descriptors mostly compute shape and motion features independently; thus, they often fail to capture the complex joint shapemotion cues at pixel-level. In contrast, we describe the depth sequence using a histogram capturing the distribution of the surface normal orientation in the 4D space of time, depth, and spatial coordinates. To build the histogram, we create 4D projectors, which quantize the 4D space and represent the possible directions for the 4D normal. We initialize the projectors using the vertices of a regular polychoron. Consequently, we refine the projectors using a discriminative density measure, such that additional projectors are induced in the directions where the 4D normals are more dense and discriminative. Through extensive experiments, we demonstrate that our descriptor better captures the joint shape-motion cues in the depth sequence, and thus outperforms the state-of-the-art on all relevant benchmarks.
4 0.85654575 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
Author: Ian Endres, Kevin J. Shih, Johnston Jiaa, Derek Hoiem
Abstract: We propose a method to learn a diverse collection of discriminative parts from object bounding box annotations. Part detectors can be trained and applied individually, which simplifies learning and extension to new features or categories. We apply the parts to object category detection, pooling part detections within bottom-up proposed regions and using a boosted classifier with proposed sigmoid weak learners for scoring. On PASCAL VOC 2010, we evaluate the part detectors ’ ability to discriminate and localize annotated keypoints. Our detection system is competitive with the best-existing systems, outperforming other HOG-based detectors on the more deformable categories.
5 0.85590678 414 cvpr-2013-Structure Preserving Object Tracking
Author: Lu Zhang, Laurens van_der_Maaten
Abstract: Model-free trackers can track arbitrary objects based on a single (bounding-box) annotation of the object. Whilst the performance of model-free trackers has recently improved significantly, simultaneously tracking multiple objects with similar appearance remains very hard. In this paper, we propose a new multi-object model-free tracker (based on tracking-by-detection) that resolves this problem by incorporating spatial constraints between the objects. The spatial constraints are learned along with the object detectors using an online structured SVM algorithm. The experimental evaluation ofour structure-preserving object tracker (SPOT) reveals significant performance improvements in multi-object tracking. We also show that SPOT can improve the performance of single-object trackers by simultaneously tracking different parts of the object.
6 0.85339326 285 cvpr-2013-Minimum Uncertainty Gap for Robust Visual Tracking
7 0.85089785 314 cvpr-2013-Online Object Tracking: A Benchmark
8 0.8497237 325 cvpr-2013-Part Discovery from Partial Correspondence
9 0.8496725 324 cvpr-2013-Part-Based Visual Tracking with Online Latent Structural Learning
10 0.84862298 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
11 0.8479647 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation
12 0.84653294 458 cvpr-2013-Voxel Cloud Connectivity Segmentation - Supervoxels for Point Clouds
13 0.846416 131 cvpr-2013-Discriminative Non-blind Deblurring
14 0.84594268 445 cvpr-2013-Understanding Bayesian Rooms Using Composite 3D Object Models
15 0.84585601 360 cvpr-2013-Robust Estimation of Nonrigid Transformation for Point Set Registration
16 0.84536266 267 cvpr-2013-Least Soft-Threshold Squares Tracking
17 0.84482777 462 cvpr-2013-Weakly Supervised Learning of Mid-Level Features with Beta-Bernoulli Process Restricted Boltzmann Machines
18 0.84429896 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection
19 0.84332263 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases
20 0.84315908 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection