cvpr cvpr2013 cvpr2013-243 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Aditya Khosla, Raffay Hamid, Chih-Jen Lin, Neel Sundaresan
Abstract: Given the enormous growth in user-generated videos, it is becoming increasingly important to be able to navigate them efficiently. As these videos are generally of poor quality, summarization methods designed for well-produced videos do not generalize to them. To address this challenge, we propose to use web-images as a prior to facilitate summarization of user-generated videos. Our main intuition is that people tend to take pictures of objects to capture them in a maximally informative way. Such images could therefore be used as prior information to summarize videos containing a similar set of objects. In this work, we apply our novel insight to develop a summarization algorithm that uses the web-image based prior information in an unsupervised manner. Moreover, to automatically evaluate summarization algorithms on a large scale, we propose a framework that relies on multiple summaries obtained through crowdsourcing. We demonstrate the effectiveness of our evaluation framework by comparing its performance to that ofmultiple human evaluators. Finally, wepresent resultsfor our framework tested on hundreds of user-generated videos.
Reference: text
sentIndex sentText sentNum sentScore
1 As these videos are generally of poor quality, summarization methods designed for well-produced videos do not generalize to them. [sent-8, score-1.042]
2 To address this challenge, we propose to use web-images as a prior to facilitate summarization of user-generated videos. [sent-9, score-0.571]
3 Our main intuition is that people tend to take pictures of objects to capture them in a maximally informative way. [sent-10, score-0.302]
4 Such images could therefore be used as prior information to summarize videos containing a similar set of objects. [sent-11, score-0.344]
5 In this work, we apply our novel insight to develop a summarization algorithm that uses the web-image based prior information in an unsupervised manner. [sent-12, score-0.546]
6 Moreover, to automatically evaluate summarization algorithms on a large scale, we propose a framework that relies on multiple summaries obtained through crowdsourcing. [sent-13, score-0.965]
7 These videos are extremely diverse in their content, and can vary in length from a few minutes to a few hours. [sent-18, score-0.265]
8 It is therefore becoming increasingly important to automatically extract a brief yet informative summary of these videos in order to enable a more efficient and engaging viewing experience. [sent-19, score-0.444]
9 In this work, we focus on the problem of automatic summarization and evaluation of user-generated videos. [sent-20, score-0.654]
10 Summarizing user-generated videos is different from well-produced videos in two important ways. [sent-21, score-0.53]
11 Unlike the individual images, that are taken from different canonical viewpoints to capture the car in a maximally informative way, very few of the sampled video-frames are as informative. [sent-28, score-0.406]
12 Secondly, most of the user-generated videos contain only a small fraction of frames where some interesting event is happening. [sent-29, score-0.455]
13 The main contribution of this work is the idea of using web-images as a prior to facilitate the process of creating summaries of user-generated videos. [sent-31, score-0.512]
14 Our intuition is that people tend to take pictures of objects and events from a few canonical viewpoints in order to capture them in a maximally informative way. [sent-32, score-0.57]
15 On the other hand, as shown in Figure 1, user-generated videos taken by hand-held cameras often contain many uninformative frames captured while transitioning between the various canonical viewpoints. [sent-33, score-0.603]
16 We therefore hypothesize that images of objects and events present on the web contain information that could be used as a prior for building semantically meaningful 222666999866 Assigned Viewpoint Labels each discovered subclass corresponds to a “canonical viewpoint”. [sent-34, score-0.287]
17 To improve our subclass models, we also use the unlabeled video data by first assigning each video frame to a subclass, and then repeating the optimization procedure from Section 2. [sent-37, score-0.55]
18 Finally to generate the output summary with k representative frames, we select the k frames from the test video, each of which is closest to the centroid the top k ranked subclasses. [sent-41, score-0.389]
19 summaries of user-generated videos in a scene-independent manner. [sent-42, score-0.718]
20 In this work, we apply our novel intuition to propose a summarization algorithm that incorporates this image-based prior to automatically select the maximally informative frames from a video. [sent-43, score-0.959]
21 An important related question we explore in this work is the evaluation ofvideo summarization algorithms in a largescale setting. [sent-44, score-0.568]
22 A majority of the previous work on video summarization uses expert opinion to evaluate their results. [sent-45, score-0.824]
23 Moreover, videos considered by these methods are generally well-produced with their content following a strict cinematographic structure. [sent-47, score-0.299]
24 Since user-generated videos are produced at scale, and do not follow strict structure, using expert opinion for their evaluation is infeasible. [sent-49, score-0.45]
25 To this end, we propose to rely on crowd-sourcing to obtain multiple candidate summaries ofuser-generated videos, to get a general sense of what their summaries should look like. [sent-50, score-0.906]
26 We cast the question of matching the output of a summarization algorithm and the crowd-sourced summarization as a graph-theoretic problem. [sent-51, score-1.024]
27 The main contributions of our work are: • A novel intuition to incorporate information from web images as a prior t too automatically fseorlemcta maximally ienbformative frames from user-generated videos without using any human annotated summaries for training. [sent-53, score-1.126]
28 • A crowd-sourcing based automatic evaluation framewAo crkro wtod e-vsoaluuractien tghe b arseesdult asu otofm multiple lvuiadteioo summarization algorithms. [sent-54, score-0.654]
29 • An analysis of our summarization mechanism tested over a large s oetf o ofu user-generated nvid meoesc. [sent-55, score-0.512]
30 We then explain how we automatically evaluate different summarization results using a crowd-sourcing platform in Section 3, and present a comparative analysis of our experiments and results in Section 4. [sent-57, score-0.512]
31 Using these discovered viewpoints, we want to learn a discriminative model to identify similar frames in a video capturing a different instance of the object class. [sent-63, score-0.484]
32 Since many user-generated videos are captured from hand-held mobile devices, they contain a lot of variation in the viewpoints from which they capture an object. [sent-67, score-0.416]
33 Therefore, frames from these videos can be used as difficult-to-classify negative examples to further improve models of the canonical viewpoints. [sent-69, score-0.566]
34 To this end, we use the pre-trained viewpoint classifiers learned only from webimages to initialize a second round of training, where the labels for both web-images and video frames are consid222666999977 Algorithm 1 Video summarization algorithm (explained in Section 2). [sent-70, score-0.969]
35 wi(j) Input: Unlabeled images xDand videos xV, number of subclasses K. [sent-72, score-0.399]
36 Identifying Canonical Viewpoints In order to discover the canonical viewpoints, we want to identify visually similar images in a corpus of web images. [sent-88, score-0.272]
37 Furthermore, we want to reliably identify these viewpoints in a previously unseen set of video frames. [sent-89, score-0.389]
38 Using Unlabeled Videos for Training In this section, we assume that we have a classifier for one canonical viewpoint and we want to identify additional examples from the videos from the same viewpoint. [sent-113, score-0.504]
39 We break the videos into frames and treat all the frames as independent examples. [sent-114, score-0.645]
40 We then assign each frame from all videos to a subclass using the equation ˆyVi = argmaxy(wy · xiV) where the weights learned from the images are used ·f oxr the video frame subclass assignment. [sent-137, score-0.816]
41 1 with both video frames and images for a few iterations. [sent-139, score-0.373]
42 Given a test video, we assign its frames to the different subclasses using their learned classifiers, and compute the average decision score of the positive examples from each subclass. [sent-140, score-0.349]
43 To generate the output summary with k representative frames, we select the k frames from the test video, each of which is closest to the centroid of any one of the top k ranked subclasses. [sent-145, score-0.389]
44 Overall, the process consists of obtaining multiple summaries of a single video via AMT, and later comparing those summaries against the ones obtained by applying different algorithms. [sent-149, score-1.089]
45 Obtaining Annotation using Mechanical Turk Summarizing a video is a subjective task, and summaries produced by different people are often different, even when done by experts. [sent-153, score-0.67]
46 Thus, it is beneficial to obtain multiple summaries of a single video as ground truth to evaluate the performance of different algorithms. [sent-154, score-0.636]
47 A turker must select at least 3 and at most 25 frames that he believes adequately summarize the content of the frames shown. [sent-159, score-0.537]
48 20 per summary per video, and we obtain a total of 10 summaries per video, for a total of 155 videos. [sent-164, score-0.552]
49 Evaluation using Average Precision Since the number of frames to use for a summary is application dependent, we propose to evaluate a variable number of frames from a ranked list (similar to [14]). [sent-167, score-0.518]
50 Thus, we can iteratively evaluate the precision and recall of using 1frame ({S1}), 2 frames ({S1, S2}) and so on to plot a precision e r (e{cSall curve. [sent-180, score-0.286]
51 Thus, to compute precision, we want to find how well all the retrieved frames match with the reference frames, while to compute recall we want to find how many, and how accurately, are the reference frames returned in the retrieval result. [sent-186, score-0.626]
52 W ∈e R Rwant to find a matching between the two sets of frames such that there is one reference frame corresponding to each retrieved frame. [sent-200, score-0.329]
53 Dataset For this work, we focus on the “Cars and Trucks” class of objects, since it is one of the most popular categories where users upload both images and videos to ecommerce websites. [sent-215, score-0.341]
54 In order to collect our image corpus, we crawled several popular ecommerce websites and downloaded about 300, 000 images of cars and trucks that users had uploaded to their listings. [sent-216, score-0.266]
55 To collect video data, we searched for all the user listings with a “youtube. [sent-217, score-0.28]
56 For each of these 180 listings, we downloaded their corresponding videos from youtube. [sent-219, score-0.302]
57 We ensured that images from listings containing test videos were not included in our training data. [sent-222, score-0.336]
58 The images and video frames were resized to have a maximum dimension of 500 pixels (preserving aspect ratio). [sent-229, score-0.373]
59 Random sampling is the simplest baseline where we randomly select n frames from the video. [sent-237, score-0.278]
60 In uniform sampling we split the video into n + 1equal segments where the last frame from the first n segments is selected. [sent-238, score-0.304]
61 2 and our method, we also compute the average precision (AP) when reference summaries from the AMT workers are used for evaluation. [sent-245, score-0.615]
62 Figure 5 shows the summaries produced by different methods for three example videos to give the reader a visual sense of the summaries produced by different algorithms. [sent-251, score-1.171]
63 Note that our algorithm produces summaries most similar to the ones generated by human annotators. [sent-252, score-0.481]
64 Human Evaluation With 15 human judges, we performed human evaluation of the retrieved summaries from different algorithms to verify the results obtained from the automatic evaluation. [sent-257, score-0.69]
65 Each expert was shown a set of 25 randomly sampled videos 222777000200 A MP eti/hAodPAMTU04n5. [sent-258, score-0.359]
66 e,uniform,k-means,oursandAMT),showingdifer nt umberof rames for 3 different videos to give a sense of the visual quality of summaries obtained using the different summarization algorithms. [sent-278, score-1.23]
67 They watched the video at 3x speed and were then shown 4 sets of summaries constructed using different methods: uniform sampling, kmeans clustering, our proposed algorithm (Section 2), and a reference summary from AMT. [sent-281, score-0.802]
68 70 between the scores assigned to each video by human evaluators and our automatic method. [sent-287, score-0.297]
69 Finally, the high performance of AMT summaries in both human and automatic evaluation illustrates that our method to obtain summaries using crowdsourcing is effective, allowing us to evaluate video summarization results in a large-scale setting, while keeping costs low. [sent-289, score-1.806]
70 51 05 10 150 Sorted video index Figure 6: Improvement of our algorithm over baseline (k-means) for individual videos sorted by the amount of improvement. [sent-292, score-0.475]
71 Our main intuition is that people tend to take pictures of objects from select viewpoints in order to capture them in a maximally informative way. [sent-297, score-0.456]
72 We therefore hypothesized that images of objects could be used to create summaries of user-generated videos 222777000311 AMvegt. [sent-298, score-0.718]
73 58T Table 2: Human Evaluation– 15 human judges evaluated the summaries from different algorithms on a scale of 1to 10 to verify the results of our automatic evaluation scheme. [sent-303, score-0.701]
74 As shown in Table 1, the results of our experiments confirm our hypothesis, where the average precision we obtain while using web-image prior to summarize videos is significantly better (54. [sent-308, score-0.392]
75 We also posit that since user-generated videos have a lot of variation in the viewpoints from which they capture an object, frames from these videos could be used in addition to the image based prior information to further improve the summarization performance. [sent-311, score-1.417]
76 This hypothesis is confirmed by the results in Table 1, where using video frames and images together performs better (59. [sent-312, score-0.373]
77 This is because combining images and video frames results in viewpoint clusters that are largely coherent (see Figure 7 for example clusters). [sent-315, score-0.461]
78 This indicates that using image based priors for summarizing user-generated videos captures what humans consider good summaries. [sent-318, score-0.343]
79 Furthermore, based on the feedback from the judges we learned that users generally position their cameras at the start and end of recording the videos such that the first and last frames of the videos are usually more informative than a randomly selected frame. [sent-319, score-0.907]
80 Adding this information in our summarization algorithm is likely to improve our overall performance. [sent-320, score-0.512]
81 generated cooking videos from Youtube and downloaded close to 10, 000 images from Flickr for similar queries, all of which related to the activity of making a salad. [sent-325, score-0.302]
82 For these reasons, we found the returned summaries of our algorithm and uniform sampling to be largely similar. [sent-330, score-0.543]
83 Furthermore, there are challenges of domain adaptation [33] when training on images in one setting, and testing the learned models on videos in a different setting. [sent-331, score-0.291]
84 Related Work Video summarization has been looked at from multiple perspectives [28]. [sent-334, score-0.512]
85 While the representation used for the summary might be key-frames [34] [12], image montages [3], or short glimpses [26] [25], the goal of video summarization is nevertheless to produce a compact visual summary that encapsulates the most informative parts of a video. [sent-335, score-0.973]
86 Most of the previous summarization techniques are designed for well-produced videos, and rely on low-level appearance and motion cues [22] [15]. [sent-336, score-0.512]
87 Our current work is another step in this general direction of content-aware summarization, where unlike previous approaches, we use web-images as a prior to facilitate summarization of user-generated videos. [sent-339, score-0.571]
88 222777000422 The lack of an agreed upon notion of the “optimal” summary of a video can make summary evaluation a key challenge for video summarization. [sent-340, score-0.646]
89 Similar challenges exist in other domains, such as machine translation [24] and text summarization [19], where previous methods have tried to combine several human-generated candidate summaries to infer a final answer which in expectation is better than any of the individual candidate results. [sent-341, score-0.991]
90 Following this approach, there has been previous work in the field of video summarization that also attempts to aggregate multiple summaries of a video to infer a final answer [18] [29] [7]. [sent-342, score-1.331]
91 More recently, there has been an interest in the problem of evaluating video summarization results at a large scale [2] [23]. [sent-344, score-0.695]
92 However, these approaches use multiple expert summaries which is an expensive and timeconsuming exercise. [sent-345, score-0.547]
93 In this work however, we show how to use a crowd-sourcing model to get multiple summarization labels specifically for user-generated videos. [sent-347, score-0.551]
94 We demonstrated that web images could be used as a prior to summarize videos that capture objects similar to those present in the image corpus. [sent-350, score-0.416]
95 We also focused on the related problem of large-scale automatic evaluation of summarization algorithms. [sent-351, score-0.654]
96 We proposed an evaluation framework that uses multiple summaries obtained by crowd-sourcing, and compared the performance of our framework to that of multiple expert users. [sent-352, score-0.603]
97 Our main intuition regarding people taking pictures of objects to capture them in an informative way is applicable to videos of events and activities as well. [sent-353, score-0.537]
98 Online, simultaneous shot boundary detection and key frame extraction for sports videos using rank tracing. [sent-358, score-0.358]
99 Vert: automatic evaluation of video sum- [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] maries. [sent-462, score-0.325]
100 Event driven web video summarization by tag localization and key-shot identification. [sent-554, score-0.739]
wordName wordTfidf (topN-words)
[('summarization', 0.512), ('summaries', 0.453), ('videos', 0.265), ('frames', 0.19), ('video', 0.183), ('amt', 0.141), ('subclasses', 0.134), ('viewpoints', 0.123), ('subclass', 0.122), ('canonical', 0.111), ('trucks', 0.105), ('summary', 0.099), ('expert', 0.094), ('siftflow', 0.091), ('automatic', 0.086), ('xiv', 0.084), ('informative', 0.08), ('summarizing', 0.078), ('judges', 0.078), ('workers', 0.076), ('listings', 0.071), ('wy', 0.07), ('bipartite', 0.067), ('maximally', 0.064), ('frame', 0.062), ('argmaxy', 0.058), ('evaluation', 0.056), ('instructions', 0.055), ('iew', 0.055), ('ldo', 0.05), ('want', 0.05), ('pictures', 0.048), ('precision', 0.048), ('intuition', 0.048), ('cars', 0.048), ('ecommerce', 0.047), ('examplesoflearnedcano', 0.047), ('partite', 0.047), ('rushes', 0.047), ('turker', 0.047), ('xid', 0.047), ('summarize', 0.045), ('viewpoint', 0.045), ('web', 0.044), ('clusters', 0.043), ('ebay', 0.042), ('yvi', 0.042), ('ap', 0.042), ('labels', 0.039), ('retrieved', 0.039), ('tomccap', 0.039), ('ranked', 0.039), ('reference', 0.038), ('downloaded', 0.037), ('taiwan', 0.037), ('worker', 0.037), ('uninformative', 0.037), ('annotators', 0.035), ('crowdsourcing', 0.035), ('food', 0.035), ('opinion', 0.035), ('corpus', 0.034), ('prior', 0.034), ('events', 0.034), ('content', 0.034), ('people', 0.034), ('fie', 0.033), ('identify', 0.033), ('annotation', 0.032), ('rank', 0.031), ('select', 0.031), ('returned', 0.031), ('sampling', 0.03), ('mechanical', 0.03), ('centroid', 0.03), ('trecvid', 0.03), ('xv', 0.03), ('pritch', 0.03), ('uniform', 0.029), ('optima', 0.029), ('users', 0.029), ('human', 0.028), ('capture', 0.028), ('descending', 0.028), ('acl', 0.028), ('discovered', 0.028), ('baseline', 0.027), ('keyframes', 0.026), ('slack', 0.026), ('challenges', 0.026), ('clustering', 0.026), ('scoring', 0.026), ('user', 0.026), ('notion', 0.026), ('growth', 0.025), ('spectral', 0.025), ('cluster', 0.025), ('semantically', 0.025), ('facilitate', 0.025), ('decision', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 243 cvpr-2013-Large-Scale Video Summarization Using Web-Image Priors
Author: Aditya Khosla, Raffay Hamid, Chih-Jen Lin, Neel Sundaresan
Abstract: Given the enormous growth in user-generated videos, it is becoming increasingly important to be able to navigate them efficiently. As these videos are generally of poor quality, summarization methods designed for well-produced videos do not generalize to them. To address this challenge, we propose to use web-images as a prior to facilitate summarization of user-generated videos. Our main intuition is that people tend to take pictures of objects to capture them in a maximally informative way. Such images could therefore be used as prior information to summarize videos containing a similar set of objects. In this work, we apply our novel insight to develop a summarization algorithm that uses the web-image based prior information in an unsupervised manner. Moreover, to automatically evaluate summarization algorithms on a large scale, we propose a framework that relies on multiple summaries obtained through crowdsourcing. We demonstrate the effectiveness of our evaluation framework by comparing its performance to that ofmultiple human evaluators. Finally, wepresent resultsfor our framework tested on hundreds of user-generated videos.
2 0.21673495 32 cvpr-2013-Action Recognition by Hierarchical Sequence Summarization
Author: Yale Song, Louis-Philippe Morency, Randall Davis
Abstract: Recent progress has shown that learning from hierarchical feature representations leads to improvements in various computer vision tasks. Motivated by the observation that human activity data contains information at various temporal resolutions, we present a hierarchical sequence summarization approach for action recognition that learns multiple layers of discriminative feature representations at different temporal granularities. We build up a hierarchy dynamically and recursively by alternating sequence learning and sequence summarization. For sequence learning we use CRFs with latent variables to learn hidden spatiotemporal dynamics; for sequence summarization we group observations that have similar semantic meaning in the latent space. For each layer we learn an abstract feature representation through non-linear gate functions. This procedure is repeated to obtain a hierarchical sequence summary representation. We develop an efficient learning method to train our model and show that its complexity grows sublinearly with the size of the hierarchy. Experimental results show the effectiveness of our approach, achieving the best published results on the ArmGesture and Canal9 datasets.
3 0.21607889 413 cvpr-2013-Story-Driven Summarization for Egocentric Video
Author: Zheng Lu, Kristen Grauman
Abstract: We present a video summarization approach that discovers the story of an egocentric video. Given a long input video, our method selects a short chain of video subshots depicting the essential events. Inspired by work in text analysis that links news articles over time, we define a randomwalk based metric of influence between subshots that reflects how visual objects contribute to the progression of events. Using this influence metric, we define an objective for the optimal k-subshot summary. Whereas traditional methods optimize a summary ’s diversity or representativeness, ours explicitly accounts for how one sub-event “leads to ” another—which, critically, captures event connectivity beyond simple object co-occurrence. As a result, our summaries provide a better sense of story. We apply our approach to over 12 hours of daily activity video taken from 23 unique camera wearers, and systematically evaluate its quality compared to multiple baselines with 34 human subjects.
4 0.15901574 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
Author: Raghuraman Gopalan
Abstract: While the notion of joint sparsity in understanding common and innovative components of a multi-receiver signal ensemble has been well studied, we investigate the utility of such joint sparse models in representing information contained in a single video signal. By decomposing the content of a video sequence into that observed by multiple spatially and/or temporally distributed receivers, we first recover a collection of common and innovative components pertaining to individual videos. We then present modeling strategies based on subspace-driven manifold metrics to characterize patterns among these components, across other videos in the system, to perform subsequent video analysis. We demonstrate the efficacy of our approach for activity classification and clustering by reporting competitive results on standard datasets such as, HMDB, UCF-50, Olympic Sports and KTH.
5 0.14684731 151 cvpr-2013-Event Retrieval in Large Video Collections with Circulant Temporal Encoding
Author: Jérôme Revaud, Matthijs Douze, Cordelia Schmid, Hervé Jégou
Abstract: This paper presents an approach for large-scale event retrieval. Given a video clip of a specific event, e.g., the wedding of Prince William and Kate Middleton, the goal is to retrieve other videos representing the same event from a dataset of over 100k videos. Our approach encodes the frame descriptors of a video to jointly represent their appearance and temporal order. It exploits the properties of circulant matrices to compare the videos in the frequency domain. This offers a significant gain in complexity and accurately localizes the matching parts of videos. Furthermore, we extend product quantization to complex vectors in order to compress our descriptors, and to compare them in the compressed domain. Our method outperforms the state of the art both in search quality and query time on two large-scale video benchmarks for copy detection, TRECVID and CCWEB. Finally, we introduce a challenging dataset for event retrieval, EVVE, and report the performance on this dataset.
6 0.14458032 187 cvpr-2013-Geometric Context from Videos
7 0.14121738 294 cvpr-2013-Multi-class Video Co-segmentation with a Generative Multi-video Model
8 0.13547297 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
9 0.13234012 386 cvpr-2013-Self-Paced Learning for Long-Term Tracking
10 0.11758592 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?
11 0.11061694 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
12 0.10906269 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos
14 0.10667242 85 cvpr-2013-Complex Event Detection via Multi-source Video Attributes
15 0.10512511 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
16 0.098887488 150 cvpr-2013-Event Recognition in Videos by Learning from Heterogeneous Web Sources
17 0.094873384 455 cvpr-2013-Video Object Segmentation through Spatially Accurate and Temporally Dense Extraction of Primary Object Regions
18 0.088482149 434 cvpr-2013-Topical Video Object Discovery from Key Frames by Modeling Word Co-occurrence Prior
19 0.087786563 158 cvpr-2013-Exploring Weak Stabilization for Motion Feature Extraction
20 0.082647413 388 cvpr-2013-Semi-supervised Learning of Feature Hierarchies for Object Detection in a Video
topicId topicWeight
[(0, 0.195), (1, -0.064), (2, -0.009), (3, -0.085), (4, -0.066), (5, 0.005), (6, -0.053), (7, -0.051), (8, -0.03), (9, 0.044), (10, 0.062), (11, -0.08), (12, 0.062), (13, -0.026), (14, -0.021), (15, -0.049), (16, 0.056), (17, 0.005), (18, -0.031), (19, -0.143), (20, -0.062), (21, -0.013), (22, 0.025), (23, -0.123), (24, -0.053), (25, -0.041), (26, 0.019), (27, -0.001), (28, 0.029), (29, 0.007), (30, 0.032), (31, 0.004), (32, -0.017), (33, 0.017), (34, 0.044), (35, 0.18), (36, -0.039), (37, 0.018), (38, -0.011), (39, -0.059), (40, 0.013), (41, 0.026), (42, -0.082), (43, -0.084), (44, -0.107), (45, 0.131), (46, -0.128), (47, 0.056), (48, -0.021), (49, 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.96166533 243 cvpr-2013-Large-Scale Video Summarization Using Web-Image Priors
Author: Aditya Khosla, Raffay Hamid, Chih-Jen Lin, Neel Sundaresan
Abstract: Given the enormous growth in user-generated videos, it is becoming increasingly important to be able to navigate them efficiently. As these videos are generally of poor quality, summarization methods designed for well-produced videos do not generalize to them. To address this challenge, we propose to use web-images as a prior to facilitate summarization of user-generated videos. Our main intuition is that people tend to take pictures of objects to capture them in a maximally informative way. Such images could therefore be used as prior information to summarize videos containing a similar set of objects. In this work, we apply our novel insight to develop a summarization algorithm that uses the web-image based prior information in an unsupervised manner. Moreover, to automatically evaluate summarization algorithms on a large scale, we propose a framework that relies on multiple summaries obtained through crowdsourcing. We demonstrate the effectiveness of our evaluation framework by comparing its performance to that ofmultiple human evaluators. Finally, wepresent resultsfor our framework tested on hundreds of user-generated videos.
2 0.86808115 413 cvpr-2013-Story-Driven Summarization for Egocentric Video
Author: Zheng Lu, Kristen Grauman
Abstract: We present a video summarization approach that discovers the story of an egocentric video. Given a long input video, our method selects a short chain of video subshots depicting the essential events. Inspired by work in text analysis that links news articles over time, we define a randomwalk based metric of influence between subshots that reflects how visual objects contribute to the progression of events. Using this influence metric, we define an objective for the optimal k-subshot summary. Whereas traditional methods optimize a summary ’s diversity or representativeness, ours explicitly accounts for how one sub-event “leads to ” another—which, critically, captures event connectivity beyond simple object co-occurrence. As a result, our summaries provide a better sense of story. We apply our approach to over 12 hours of daily activity video taken from 23 unique camera wearers, and systematically evaluate its quality compared to multiple baselines with 34 human subjects.
3 0.80339038 294 cvpr-2013-Multi-class Video Co-segmentation with a Generative Multi-video Model
Author: Wei-Chen Chiu, Mario Fritz
Abstract: Video data provides a rich source of information that is available to us today in large quantities e.g. from online resources. Tasks like segmentation benefit greatly from the analysis of spatio-temporal motion patterns in videos and recent advances in video segmentation has shown great progress in exploiting these addition cues. However, observing a single video is often not enough to predict meaningful segmentations and inference across videos becomes necessary in order to predict segmentations that are consistent with objects classes. Therefore the task of video cosegmentation is being proposed, that aims at inferring segmentation from multiple videos. But current approaches are limited to only considering binary foreground/background -inf .mpg . de segmentation and multiple videos of the same object. This is a clear mismatch to the challenges that we are facing with videos from online resources or consumer videos. We propose to study multi-class video co-segmentation where the number of object classes is unknown as well as the number of instances in each frame and video. We achieve this by formulating a non-parametric bayesian model across videos sequences that is based on a new videos segmentation prior as well as a global appearance model that links segments of the same class. We present the first multi-class video co-segmentation evaluation. We show that our method is applicable to real video data from online resources and outperforms state-of-the-art video segmentation and image co-segmentation baselines.
Author: Pradipto Das, Chenliang Xu, Richard F. Doell, Jason J. Corso
Abstract: The problem of describing images through natural language has gained importance in the computer vision community. Solutions to image description have either focused on a top-down approach of generating language through combinations of object detections and language models or bottom-up propagation of keyword tags from training images to test images through probabilistic or nearest neighbor techniques. In contrast, describing videos with natural language is a less studied problem. In this paper, we combine ideas from the bottom-up and top-down approaches to image description and propose a method for video description that captures the most relevant contents of a video in a natural language description. We propose a hybrid system consisting of a low level multimodal latent topic model for initial keyword annotation, a middle level of concept detectors and a high level module to produce final lingual descriptions. We compare the results of our system to human descriptions in both short and long forms on two datasets, and demonstrate that final system output has greater agreement with the human descriptions than any single level.
5 0.78348917 434 cvpr-2013-Topical Video Object Discovery from Key Frames by Modeling Word Co-occurrence Prior
Author: Gangqiang Zhao, Junsong Yuan, Gang Hua
Abstract: A topical video object refers to an object that is frequently highlighted in a video. It could be, e.g., the product logo and the leading actor/actress in a TV commercial. We propose a topic model that incorporates a word co-occurrence prior for efficient discovery of topical video objects from a set of key frames. Previous work using topic models, such as Latent Dirichelet Allocation (LDA), for video object discovery often takes a bag-of-visual-words representation, which ignored important co-occurrence information among the local features. We show that such data driven co-occurrence information from bottom-up can conveniently be incorporated in LDA with a Gaussian Markov prior, which combines top down probabilistic topic modeling with bottom up priors in a unified model. Our experiments on challenging videos demonstrate that the proposed approach can discover different types of topical objects despite variations in scale, view-point, color and lighting changes, or even partial occlusions. The efficacy of the co-occurrence prior is clearly demonstrated when comparing with topic models without such priors.
6 0.76128578 151 cvpr-2013-Event Retrieval in Large Video Collections with Circulant Temporal Encoding
7 0.71655989 313 cvpr-2013-Online Dominant and Anomalous Behavior Detection in Videos
8 0.70572221 133 cvpr-2013-Discriminative Segment Annotation in Weakly Labeled Video
9 0.67555523 187 cvpr-2013-Geometric Context from Videos
10 0.66697085 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
11 0.58381659 455 cvpr-2013-Video Object Segmentation through Spatially Accurate and Temporally Dense Extraction of Primary Object Regions
12 0.58130783 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos
13 0.58029187 333 cvpr-2013-Plane-Based Content Preserving Warps for Video Stabilization
14 0.55583775 430 cvpr-2013-The SVM-Minus Similarity Score for Video Face Recognition
15 0.5534566 85 cvpr-2013-Complex Event Detection via Multi-source Video Attributes
16 0.53937495 32 cvpr-2013-Action Recognition by Hierarchical Sequence Summarization
17 0.53762013 37 cvpr-2013-Adherent Raindrop Detection and Removal in Video
18 0.52428937 49 cvpr-2013-Augmenting Bag-of-Words: Data-Driven Discovery of Temporal and Structural Information for Activity Recognition
19 0.5053246 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
20 0.4994089 353 cvpr-2013-Relative Hidden Markov Models for Evaluating Motion Skill
topicId topicWeight
[(10, 0.132), (16, 0.02), (26, 0.056), (28, 0.02), (33, 0.255), (50, 0.247), (67, 0.068), (69, 0.027), (77, 0.018), (87, 0.058)]
simIndex simValue paperId paperTitle
1 0.83436537 8 cvpr-2013-A Fast Approximate AIB Algorithm for Distributional Word Clustering
Author: Lei Wang, Jianjia Zhang, Luping Zhou, Wanqing Li
Abstract: Distributional word clustering merges the words having similar probability distributions to attain reliable parameter estimation, compact classification models and even better classification performance. Agglomerative Information Bottleneck (AIB) is one of the typical word clustering algorithms and has been applied to both traditional text classification and recent image recognition. Although enjoying theoretical elegance, AIB has one main issue on its computational efficiency, especially when clustering a large number of words. Different from existing solutions to this issue, we analyze the characteristics of its objective function the loss of mutual information, and show that by merely using the ratio of word-class joint probabilities of each word, good candidate word pairs for merging can be easily identified. Based on this finding, we propose a fast approximate AIB algorithm and show that it can significantly improve the computational efficiency of AIB while well maintaining or even slightly increasing its classification performance. Experimental study on both text and image classification benchmark data sets shows that our algorithm can achieve more than 100 times speedup on large real data sets over the state-of-the-art method.
same-paper 2 0.82354569 243 cvpr-2013-Large-Scale Video Summarization Using Web-Image Priors
Author: Aditya Khosla, Raffay Hamid, Chih-Jen Lin, Neel Sundaresan
Abstract: Given the enormous growth in user-generated videos, it is becoming increasingly important to be able to navigate them efficiently. As these videos are generally of poor quality, summarization methods designed for well-produced videos do not generalize to them. To address this challenge, we propose to use web-images as a prior to facilitate summarization of user-generated videos. Our main intuition is that people tend to take pictures of objects to capture them in a maximally informative way. Such images could therefore be used as prior information to summarize videos containing a similar set of objects. In this work, we apply our novel insight to develop a summarization algorithm that uses the web-image based prior information in an unsupervised manner. Moreover, to automatically evaluate summarization algorithms on a large scale, we propose a framework that relies on multiple summaries obtained through crowdsourcing. We demonstrate the effectiveness of our evaluation framework by comparing its performance to that ofmultiple human evaluators. Finally, wepresent resultsfor our framework tested on hundreds of user-generated videos.
3 0.82278764 417 cvpr-2013-Subcategory-Aware Object Classification
Author: Jian Dong, Wei Xia, Qiang Chen, Jianshi Feng, Zhongyang Huang, Shuicheng Yan
Abstract: In this paper, we introduce a subcategory-aware object classification framework to boost category level object classification performance. Motivated by the observation of considerable intra-class diversities and inter-class ambiguities in many current object classification datasets, we explicitly split data into subcategories by ambiguity guided subcategory mining. We then train an individual model for each subcategory rather than attempt to represent an object category with a monolithic model. More specifically, we build the instance affinity graph by combining both intraclass similarity and inter-class ambiguity. Visual subcategories, which correspond to the dense subgraphs, are detected by the graph shift algorithm and seamlessly integrated into the state-of-the-art detection assisted classification framework. Finally the responses from subcategory models are aggregated by subcategory-aware kernel regression. The extensive experiments over the PASCAL VOC 2007 and PASCAL VOC 2010 databases show the state-ofthe-art performance from our framework.
4 0.81992847 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition
Author: Michalis Raptis, Leonid Sigal
Abstract: In this paper, we develop a new model for recognizing human actions. An action is modeled as a very sparse sequence of temporally local discriminative keyframes collections of partial key-poses of the actor(s), depicting key states in the action sequence. We cast the learning of keyframes in a max-margin discriminative framework, where we treat keyframes as latent variables. This allows us to (jointly) learn a set of most discriminative keyframes while also learning the local temporal context between them. Keyframes are encoded using a spatially-localizable poselet-like representation with HoG and BoW components learned from weak annotations; we rely on structured SVM formulation to align our components and minefor hard negatives to boost localization performance. This results in a model that supports spatio-temporal localization and is insensitive to dropped frames or partial observations. We show classification performance that is competitive with the state of the art on the benchmark UT-Interaction dataset and illustrate that our model outperforms prior methods in an on-line streaming setting.
5 0.81113499 321 cvpr-2013-PDM-ENLOR: Learning Ensemble of Local PDM-Based Regressions
Author: Yen H. Le, Uday Kurkure, Ioannis A. Kakadiaris
Abstract: Statistical shape models, such as Active Shape Models (ASMs), sufferfrom their inability to represent a large range of variations of a complex shape and to account for the large errors in detection of model points. We propose a novel method (dubbed PDM-ENLOR) that overcomes these limitations by locating each shape model point individually using an ensemble of local regression models and appearance cues from selected model points. Our method first detects a set of reference points which were selected based on their saliency during training. For each model point, an ensemble of regressors is built. From the locations of the detected reference points, each regressor infers a candidate location for that model point using local geometric constraints, encoded by a point distribution model (PDM). The final location of that point is determined as a weighted linear combination, whose coefficients are learnt from the training data, of candidates proposed from its ensemble ’s component regressors. We use different subsets of reference points as explanatory variables for the component regressors to provide varying degrees of locality for the models in each ensemble. This helps our ensemble model to capture a larger range of shape variations as compared to a single PDM. We demonstrate the advantages of our method on the challenging problem of segmenting gene expression images of mouse brain.
6 0.77859938 413 cvpr-2013-Story-Driven Summarization for Egocentric Video
7 0.77343518 414 cvpr-2013-Structure Preserving Object Tracking
8 0.77266794 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
9 0.77205992 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection
10 0.77095711 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation
11 0.77070296 325 cvpr-2013-Part Discovery from Partial Correspondence
12 0.77017522 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
13 0.76975596 285 cvpr-2013-Minimum Uncertainty Gap for Robust Visual Tracking
14 0.76906645 314 cvpr-2013-Online Object Tracking: A Benchmark
15 0.76690578 14 cvpr-2013-A Joint Model for 2D and 3D Pose Estimation from a Single Image
16 0.76608586 360 cvpr-2013-Robust Estimation of Nonrigid Transformation for Point Set Registration
17 0.76571381 311 cvpr-2013-Occlusion Patterns for Object Class Detection
18 0.76525271 277 cvpr-2013-MODEC: Multimodal Decomposable Models for Human Pose Estimation
19 0.76516593 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases
20 0.76493299 206 cvpr-2013-Human Pose Estimation Using Body Parts Dependent Joint Regressors