cvpr cvpr2013 cvpr2013-413 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Zheng Lu, Kristen Grauman
Abstract: We present a video summarization approach that discovers the story of an egocentric video. Given a long input video, our method selects a short chain of video subshots depicting the essential events. Inspired by work in text analysis that links news articles over time, we define a randomwalk based metric of influence between subshots that reflects how visual objects contribute to the progression of events. Using this influence metric, we define an objective for the optimal k-subshot summary. Whereas traditional methods optimize a summary ’s diversity or representativeness, ours explicitly accounts for how one sub-event “leads to ” another—which, critically, captures event connectivity beyond simple object co-occurrence. As a result, our summaries provide a better sense of story. We apply our approach to over 12 hours of daily activity video taken from 23 unique camera wearers, and systematically evaluate its quality compared to multiple baselines with 34 human subjects.
Reference: text
sentIndex sentText sentNum sentScore
1 edu , Abstract We present a video summarization approach that discovers the story of an egocentric video. [sent-3, score-0.792]
2 Given a long input video, our method selects a short chain of video subshots depicting the essential events. [sent-4, score-0.891]
3 Inspired by work in text analysis that links news articles over time, we define a randomwalk based metric of influence between subshots that reflects how visual objects contribute to the progression of events. [sent-5, score-0.963]
4 Whereas traditional methods optimize a summary ’s diversity or representativeness, ours explicitly accounts for how one sub-event “leads to ” another—which, critically, captures event connectivity beyond simple object co-occurrence. [sent-7, score-0.214]
5 We apply our approach to over 12 hours of daily activity video taken from 23 unique camera wearers, and systematically evaluate its quality compared to multiple baselines with 34 human subjects. [sent-9, score-0.269]
6 Much of the data consists of long-running, unedited content—for example, surveillance feeds, home videos, or video dumps from a camera worn by a human or robot. [sent-12, score-0.213]
7 The resulting summaries can be used to enhance video browsing, or to aid activity recognition algorithms. [sent-16, score-0.255]
8 Summarization methods compress the video by selecting a series of keyframes [26, 27, 10, 16] or subshots [19, 13, 18, 6] that best represent the original input. [sent-17, score-0.756]
9 OurStmorey-tdhroivdeng videroatseuma stroy -drivensum aAcrvtoyisfvaiufntairlounembnpjtaeci atslenr unedited egocentric video. [sent-21, score-0.345]
10 A good story is defined as a coherent chain of video subshots in which each subshot influences the next through some (active) subset of influential visual objects. [sent-22, score-1.694]
11 However, we contend that defining video summarization as a sampling problem is much too limiting. [sent-26, score-0.269]
12 While a problem for any video source, this limitation is especially pronounced for egocentric video summarization. [sent-29, score-0.516]
13 Egocentric video captured with a wearable camera is long and unstructured, and its continuous nature yields no evident shot boundaries; yet, the raw data inherently should tell a story—that of the camera wearer’s day. [sent-30, score-0.21]
14 Specifically, we define a good story as a coherent chain of video subshots1 in which each subshot influences the next through some subset of key visual objects. [sent-34, score-1.027]
15 For exam- ple, in the “story” of visiting the bookstore, a book plays an important role in linking the actions of browsing the shelves 1Throughout, we use subshot and keyframe interchangeably; the posed method can produce summaries based on either unit. [sent-36, score-0.647]
16 The metric builds a bipartite graph between subshots and objects, and then scores the impact each object has on the stationary probability for a random walk starting from one of the subshots and ending in another. [sent-40, score-1.243]
17 First, we segment the input video into subshots using a novel static-transit grouping procedure well-suited for unstructured egocentric video. [sent-42, score-1.005]
18 Next, for each subshot, we estimate its individual importance as well as its influence on every other subshot in the original sequence, given their ob- jects/words. [sent-45, score-0.628]
19 Finally, we optimize an energy function that scores a candidate chain of k selected subshots according to how well it preserves both influence over time and individually important events. [sent-46, score-0.901]
20 Contributions Our main contribution is the idea of storydriven video summarization; to our knowledge, ours is the first summarization work to explicitly model the influence between sub-events. [sent-48, score-0.392]
21 Related Work We review prior work in video summarization, egocen- tric video analysis, and influence discovery in text mining. [sent-52, score-0.366]
22 Video summarization Keyframe-based methods select a sequence of keyframes to form a summary, and typically use low-level features like optical flow [26] or image differences [27]. [sent-53, score-0.206]
23 In contrast, video skimming techniques first segment the input into subshots using shot boundary detection. [sent-55, score-0.737]
24 Features used for subshot selection include motion-based attention [19], motion activity [18], or spatiotemporal features [13]. [sent-57, score-0.485]
25 User interaction can help guide subshot selection; for example, the user could point out a few interesting subshots [6], or provide keyframes for locations in a map-based storyboard [21]. [sent-58, score-1.179]
26 In contrast, our approach models the influence between subshots, which we show is vital to capture the story in the original video. [sent-61, score-0.35]
27 Egocentric video analysis Due to the small form factor oftoday’s egocentric cameras, as well as expanding application areas, vision researchers are actively exploring egocentric video analysis. [sent-64, score-0.812]
28 Influence in news articles Both our influence metric as well as the search strategy we use to find good chains are directly inspired by recent work in text mining [24]. [sent-68, score-0.348]
29 Given a start and end news article, that system extracts a coherent chain of articles connecting them. [sent-69, score-0.353]
30 Adapting their model of influence to video requires defining analogies for documents and words. [sent-71, score-0.259]
31 For the former, we develop a novel subshot segmentation method for egocentric data; for the latter, we explore both category-specific and categoryindependent models of visual objects. [sent-72, score-0.778]
32 Finally, we find that compared to news articles, egocentric video contains substantial redundancy, and subshot quality varies greatly. [sent-73, score-0.932]
33 222777111533 Thus, whereas the model in [24] scores only the influence of selected documents, we also model chain quality in terms of predicted importance and scene diversity. [sent-74, score-0.347]
34 Approach Our approach takes a long video as input and returns a short video summary as output. [sent-76, score-0.319]
35 Consider the subshots as nodes in a 1D chain, and let S = {sk1 , . [sent-86, score-0.599]
36 Egocentric Subshot Representation Subshot extraction is especially challenging for egocentric video. [sent-103, score-0.296]
37 , detecting an abrupt change to the color histogram), egocentric videos are continuous. [sent-106, score-0.324]
38 Thus, we introduce a novel subshot segmentation approach tailored to egocentric data. [sent-108, score-0.756]
39 Thus, the number of subshots n will vary per video; in our data (described below) a typical subshot lasts 15 seconds and a typical 4-hour video has n = 960 total subshots. [sent-127, score-1.169]
40 We represent a subshot si in terms of the visual objects that appear within it. [sent-129, score-0.531]
41 For example, for egocen- tric video capturing daily living activities in the living room and kitchen [20], the object bank could naturally consist of household objects like fridge, mug, couch, etc. [sent-132, score-0.336]
42 Top row: object nodes, bottom row: subshot nodes. [sent-154, score-0.46]
43 ) Story progress between subshots The first term S(S) captures the element of story, and is mheos fti cstru tceiraml t oS (tShe) novelty of our approach. [sent-157, score-0.643]
44 We say a selected chain S tells a good story if it consists of a coherent chain of visual objects, where each strongly influences the next in sequence. [sent-158, score-0.636]
45 The influence criterion means that for any pair of subshots selected in sequence, the objects in the first one “lead to” those in the second. [sent-159, score-0.79]
46 tIns tohlefollowing, our definitions for influence and coherency are directly adapted from [24], where we draw an analogy between the news articles and words in that work, and the sub- shots and visual objects in our work. [sent-161, score-0.361]
47 Suppose we were considering influence alone for the story term S(S). [sent-162, score-0.35]
48 n,K−1oXi∈OINFLUENCE(sj,sj+1|oi), (3) that is, the chain whose weakest link is as strong as possible. [sent-167, score-0.223]
49 To compute the influence between two subshots requires more than simply counting their shared objects, as discussed above. [sent-168, score-0.722]
50 We construct a bipartite directed graph G = (Vs ∪ Vo, E) connecting subshots and objects. [sent-170, score-0.626]
51 The vertices V∪s Vand Vo correspond to the subshots and objects, respectively. [sent-171, score-0.599]
52 For every object o that appears in subshot s, we add both the edges (o, s) and (s, o) to E. [sent-172, score-0.46]
53 The edges have weights based on the association between the subshot and object; we define the weight to be the frequency with which the object occurs in that subshot, scaled by its predicted egocentric importance, using [14]. [sent-173, score-0.756]
54 Intuitively, two subshots are highly connected if a random walk on the graph starting at the first subshot vertex frequently reaches the second one. [sent-176, score-1.079]
55 In the story of making cereal, our influence measure can capture grabbing a dish leading to fetching the milk (left). [sent-180, score-0.463]
56 INFLUENCE(si,sj|o) = Yi(sj) (5) Intuitively, the score is high ifY Yobject o Yis key to the influence of subshot si on sj—that is, if its removal would × cause sj to no longer be reachable from si. [sent-193, score-0.646]
57 As desired, this metric of influence captures relationships between subshots even when they do not share objects. [sent-194, score-0.743]
58 To account for coherency as well as influence, we also enforce preferences that only a small number of objects be “active” for any given subshot transition, and that their activation patterns be smooth in the summary. [sent-201, score-0.569]
59 hgtohts N, ai,j nde renfloetectsi nitgs value for object iand subshot j, and ? [sent-206, score-0.46]
60 Figure 4 shows an example of the activation pattern over a chain of subshots for our method and a baseline that uniformly samples frames throughout the original video. [sent-215, score-0.809]
61 Our result shows how the story progresses though the objects (i. [sent-216, score-0.276]
62 Importance of individual subshots The second term of our objective (Eqn. [sent-220, score-0.599]
63 It exploits cues specific to egocentric data, such as nearness of the region to the camera wearer’s hands, its size and location, and its frequency of appearance in a short time window. [sent-224, score-0.321]
64 We define XK I(S) = XIMPORTANCE(sj), (7) Xj=1 where the importance of a subshot sj is the average of importance scores for all its regions. [sent-225, score-0.592]
65 Note our influence computation also uses importance to weight edges in G above; however, the normalization step discards the overall importance of the subshot that we capture here. [sent-226, score-0.673]
66 Note this value is high when the scenes in sequential subshots are dissimilar. [sent-233, score-0.599]
67 The basic idea is to use a priority queue to hold intermediate chains, and exploit the fact that computing the story term S for a single-link chain fisa very etf cfiocmiepnut. [sent-241, score-0.486]
68 Each chain in the priority queue is either associated with its Q(S) score or an approximate score that is computed very efficiently. [sent-244, score-0.301]
69 S (ASt )e saccohr ieste fraortio thne, tchuer top cchhaaiinn in the priority queue is scored by Q(S) and reinserted if the chain is currently associated with its approximate score; otherwise the chain is expanded to longer chains by adding the subsequent subshots. [sent-246, score-0.506]
70 Then each newly created chain is inserted in the priority queue with its approximate score. [sent-247, score-0.259]
71 Selecting a Chain of Chains in Long Videos For long egocentric video inputs, it is often ill-posed to measure influence across the boundaries of major distinct events (such as entirely different physical locations). [sent-253, score-0.596]
72 To compute a boundary score for each subshot, we sum the affinity between that 222777111 686 subshot and all others within a small temporal window, and normalize that value by the affinity of all pairs in which one subshot is either before or after the current window. [sent-258, score-0.979]
73 The final video summary is constructed by selecting one chain from the candidates per event, and concatenating the selected chains together. [sent-263, score-0.453]
74 We simply select the chain with the highest importance among those for which the minimum diversity term is higher than a threshold τ. [sent-264, score-0.274]
75 For ADL we use keyframes rather than subshots due to their shorter duration. [sent-282, score-0.665]
76 Baselines We compare to three baselines: (1) Uniform sampling: We select K subshots uniformly spaced throughout the video. [sent-315, score-0.599]
77 (2) Shortest-path: We construct a graph where all pairs of subshots have an edge connecting them, and the edge is weighted by their bag-of-objects distance. [sent-317, score-0.626]
78 We then select the K subshots that form the shortest path connecting the first and last subshot. [sent-318, score-0.626]
79 (3) Object-driven: We apply the state-of-the-art egocentric summarization method [14] using the authors’ code. [sent-320, score-0.455]
80 We first show the users a sped-up version of the entire original video, and × ask them to write down the main story events. [sent-325, score-0.255]
81 This supports our main claim, that our approach can better capture stories in egocentric videos. [sent-351, score-0.296]
82 In such cases, our model of coherent influence finds subshots that give the sense of one event leading to the next. [sent-356, score-0.797]
83 In contrast, the state-of-the-art approach [14] tends to include subshots with important objects, but with a less obvious thread connecting them. [sent-357, score-0.626]
84 When a video focuses on the same scene for a long time, our method summarizes a short essential part, thanks to our importance and scene diversity terms. [sent-358, score-0.246]
85 On the other hand, our method does not have much advantage when the story is uneventful, or when there are multiple interwoven threads (e. [sent-363, score-0.227]
86 In such cases, our method tends to select a chain of subshots that are influential to each other, but miss other important parts of the story. [sent-366, score-0.827]
87 Example summaries Figures 6 and 7 show all methods’ summaries for example UTE and ADL inputs. [sent-368, score-0.24]
88 Discovering influential objects Finally, we demonstrate how our influence estimates can be used to discover the ob- Figure 5. [sent-371, score-0.24]
89 For a given video, we sort the objects oi ∈ O by their total influence scores across all its subshot tran∈sit Oion bys (Eqn. [sent-375, score-0.658]
90 To obtain ground truth, we had 3 workers on MTurk identify which of the N = 42 objects they found central to the story per video, and took the majority vote. [sent-378, score-0.276]
91 This application of our work may be useful for video retrieval or video saliency detection applications. [sent-381, score-0.22]
92 Towards this goal, we have developed a novel subshot segmentation method for egocentric data, and a selection objective that captures the influence between subshots as well as shot importance and diversity. [sent-384, score-1.572]
93 We are interested in our method’s use for egocentric data, since there is great need in that domain to cope with long unedited video—and it will only increase as more people and robots wear a camera as one of their mobile computing devices. [sent-387, score-0.392]
94 Still, in the future we’d like to explore visual influence in the context of other video domains. [sent-388, score-0.255]
95 We also plan to extend our subshot descriptions to reflect motion patterns or detected actions, moving beyond the object-centric view. [sent-389, score-0.46]
96 Our method clearly captures the progress of the story: serving ice cream leads to weighing the ice cream, which leads to watching TV in the ice cream shop, then driving home. [sent-394, score-0.29]
97 Even when there are no obvious visual links for the story, our method captures visually distinct scenes (see last few subshots in top row). [sent-395, score-0.674]
98 The shortest-path approach makes abrupt hops across the storyline in order to preserve subshots that smoothly transition (see redundancy in its last 5 subshots). [sent-396, score-0.619]
99 Discovering important people and objects for egocentric video summarization. [sent-506, score-0.455]
100 Figure-ground segmentation improves handled object recognition in egocentric video. [sent-560, score-0.296]
wordName wordTfidf (topN-words)
[('subshots', 0.599), ('subshot', 0.46), ('egocentric', 0.296), ('story', 0.227), ('chain', 0.16), ('summarization', 0.159), ('influence', 0.123), ('summaries', 0.12), ('video', 0.11), ('adl', 0.095), ('chains', 0.087), ('summary', 0.077), ('articles', 0.072), ('ute', 0.072), ('diversity', 0.069), ('influential', 0.068), ('news', 0.066), ('fridge', 0.062), ('wearer', 0.056), ('queue', 0.055), ('cream', 0.049), ('unedited', 0.049), ('objects', 0.049), ('event', 0.047), ('keyframes', 0.047), ('user', 0.045), ('importance', 0.045), ('events', 0.045), ('priority', 0.044), ('daily', 0.044), ('sj', 0.042), ('cereal', 0.042), ('grabbing', 0.042), ('wearers', 0.042), ('weakest', 0.041), ('food', 0.041), ('keyframe', 0.04), ('hours', 0.039), ('dish', 0.037), ('ice', 0.037), ('watching', 0.037), ('milk', 0.034), ('inclusion', 0.032), ('mug', 0.032), ('links', 0.032), ('activation', 0.031), ('tv', 0.03), ('living', 0.03), ('redundant', 0.029), ('home', 0.029), ('cooking', 0.029), ('coherency', 0.029), ('coherent', 0.028), ('users', 0.028), ('videos', 0.028), ('moetuhro', 0.028), ('storyboard', 0.028), ('wkshp', 0.028), ('shot', 0.028), ('fathi', 0.027), ('browsing', 0.027), ('connecting', 0.027), ('blur', 0.027), ('baselines', 0.026), ('documents', 0.026), ('activities', 0.026), ('oi', 0.026), ('activity', 0.025), ('camera', 0.025), ('quantize', 0.025), ('file', 0.025), ('eating', 0.025), ('microwave', 0.025), ('stationary', 0.025), ('bank', 0.024), ('progress', 0.023), ('dishes', 0.023), ('ought', 0.023), ('tric', 0.023), ('long', 0.022), ('link', 0.022), ('visual', 0.022), ('representativeness', 0.022), ('tea', 0.022), ('subjects', 0.021), ('score', 0.021), ('captures', 0.021), ('redundancy', 0.02), ('influences', 0.02), ('minutes', 0.02), ('notion', 0.02), ('else', 0.02), ('interchangeably', 0.02), ('civr', 0.02), ('walk', 0.02), ('frames', 0.019), ('selected', 0.019), ('affinity', 0.019), ('vo', 0.019), ('shorter', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 413 cvpr-2013-Story-Driven Summarization for Egocentric Video
Author: Zheng Lu, Kristen Grauman
Abstract: We present a video summarization approach that discovers the story of an egocentric video. Given a long input video, our method selects a short chain of video subshots depicting the essential events. Inspired by work in text analysis that links news articles over time, we define a randomwalk based metric of influence between subshots that reflects how visual objects contribute to the progression of events. Using this influence metric, we define an objective for the optimal k-subshot summary. Whereas traditional methods optimize a summary ’s diversity or representativeness, ours explicitly accounts for how one sub-event “leads to ” another—which, critically, captures event connectivity beyond simple object co-occurrence. As a result, our summaries provide a better sense of story. We apply our approach to over 12 hours of daily activity video taken from 23 unique camera wearers, and systematically evaluate its quality compared to multiple baselines with 34 human subjects.
2 0.21607889 243 cvpr-2013-Large-Scale Video Summarization Using Web-Image Priors
Author: Aditya Khosla, Raffay Hamid, Chih-Jen Lin, Neel Sundaresan
Abstract: Given the enormous growth in user-generated videos, it is becoming increasingly important to be able to navigate them efficiently. As these videos are generally of poor quality, summarization methods designed for well-produced videos do not generalize to them. To address this challenge, we propose to use web-images as a prior to facilitate summarization of user-generated videos. Our main intuition is that people tend to take pictures of objects to capture them in a maximally informative way. Such images could therefore be used as prior information to summarize videos containing a similar set of objects. In this work, we apply our novel insight to develop a summarization algorithm that uses the web-image based prior information in an unsupervised manner. Moreover, to automatically evaluate summarization algorithms on a large scale, we propose a framework that relies on multiple summaries obtained through crowdsourcing. We demonstrate the effectiveness of our evaluation framework by comparing its performance to that ofmultiple human evaluators. Finally, wepresent resultsfor our framework tested on hundreds of user-generated videos.
3 0.10216377 332 cvpr-2013-Pixel-Level Hand Detection in Ego-centric Videos
Author: Cheng Li, Kris M. Kitani
Abstract: We address the task of pixel-level hand detection in the context of ego-centric cameras. Extracting hand regions in ego-centric videos is a critical step for understanding handobject manipulation and analyzing hand-eye coordination. However, in contrast to traditional applications of hand detection, such as gesture interfaces or sign-language recognition, ego-centric videos present new challenges such as rapid changes in illuminations, significant camera motion and complex hand-object manipulations. To quantify the challenges and performance in this new domain, we present a fully labeled indoor/outdoor ego-centric hand detection benchmark dataset containing over 200 million labeled pixels, which contains hand images taken under various illumination conditions. Using both our dataset and a publicly available ego-centric indoors dataset, we give extensive analysis of detection performance using a wide range of local appearance features. Our analysis highlights the effectiveness of sparse features and the importance of modeling global illumination. We propose a modeling strategy based on our findings and show that our model outperforms several baseline approaches.
4 0.087260425 32 cvpr-2013-Action Recognition by Hierarchical Sequence Summarization
Author: Yale Song, Louis-Philippe Morency, Randall Davis
Abstract: Recent progress has shown that learning from hierarchical feature representations leads to improvements in various computer vision tasks. Motivated by the observation that human activity data contains information at various temporal resolutions, we present a hierarchical sequence summarization approach for action recognition that learns multiple layers of discriminative feature representations at different temporal granularities. We build up a hierarchy dynamically and recursively by alternating sequence learning and sequence summarization. For sequence learning we use CRFs with latent variables to learn hidden spatiotemporal dynamics; for sequence summarization we group observations that have similar semantic meaning in the latent space. For each layer we learn an abstract feature representation through non-linear gate functions. This procedure is repeated to obtain a hierarchical sequence summary representation. We develop an efficient learning method to train our model and show that its complexity grows sublinearly with the size of the hierarchy. Experimental results show the effectiveness of our approach, achieving the best published results on the ArmGesture and Canal9 datasets.
5 0.076033965 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition
Author: Michalis Raptis, Leonid Sigal
Abstract: In this paper, we develop a new model for recognizing human actions. An action is modeled as a very sparse sequence of temporally local discriminative keyframes collections of partial key-poses of the actor(s), depicting key states in the action sequence. We cast the learning of keyframes in a max-margin discriminative framework, where we treat keyframes as latent variables. This allows us to (jointly) learn a set of most discriminative keyframes while also learning the local temporal context between them. Keyframes are encoded using a spatially-localizable poselet-like representation with HoG and BoW components learned from weak annotations; we rely on structured SVM formulation to align our components and minefor hard negatives to boost localization performance. This results in a model that supports spatio-temporal localization and is insensitive to dropped frames or partial observations. We show classification performance that is competitive with the state of the art on the benchmark UT-Interaction dataset and illustrate that our model outperforms prior methods in an on-line streaming setting.
6 0.074603714 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
7 0.072811291 287 cvpr-2013-Modeling Actions through State Changes
8 0.072160549 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?
9 0.072092302 294 cvpr-2013-Multi-class Video Co-segmentation with a Generative Multi-video Model
10 0.0690988 151 cvpr-2013-Event Retrieval in Large Video Collections with Circulant Temporal Encoding
11 0.063031226 187 cvpr-2013-Geometric Context from Videos
12 0.062490266 85 cvpr-2013-Complex Event Detection via Multi-source Video Attributes
13 0.061019123 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
14 0.059246764 28 cvpr-2013-A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching
15 0.058174167 49 cvpr-2013-Augmenting Bag-of-Words: Data-Driven Discovery of Temporal and Structural Information for Activity Recognition
16 0.055572681 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos
17 0.053046249 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
18 0.052505907 158 cvpr-2013-Exploring Weak Stabilization for Motion Feature Extraction
19 0.05091631 455 cvpr-2013-Video Object Segmentation through Spatially Accurate and Temporally Dense Extraction of Primary Object Regions
20 0.050375547 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video
topicId topicWeight
[(0, 0.132), (1, -0.018), (2, 0.012), (3, -0.05), (4, -0.052), (5, 0.021), (6, -0.022), (7, -0.017), (8, -0.026), (9, 0.038), (10, 0.051), (11, -0.039), (12, 0.03), (13, -0.008), (14, -0.016), (15, -0.009), (16, 0.044), (17, 0.035), (18, -0.018), (19, -0.081), (20, -0.042), (21, 0.019), (22, 0.001), (23, -0.067), (24, -0.026), (25, -0.027), (26, 0.039), (27, 0.015), (28, 0.004), (29, 0.034), (30, 0.01), (31, -0.015), (32, -0.034), (33, -0.002), (34, 0.021), (35, 0.108), (36, -0.013), (37, 0.022), (38, -0.011), (39, -0.022), (40, 0.013), (41, 0.021), (42, -0.057), (43, -0.015), (44, -0.045), (45, 0.059), (46, -0.074), (47, 0.002), (48, -0.024), (49, -0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.93123525 413 cvpr-2013-Story-Driven Summarization for Egocentric Video
Author: Zheng Lu, Kristen Grauman
Abstract: We present a video summarization approach that discovers the story of an egocentric video. Given a long input video, our method selects a short chain of video subshots depicting the essential events. Inspired by work in text analysis that links news articles over time, we define a randomwalk based metric of influence between subshots that reflects how visual objects contribute to the progression of events. Using this influence metric, we define an objective for the optimal k-subshot summary. Whereas traditional methods optimize a summary ’s diversity or representativeness, ours explicitly accounts for how one sub-event “leads to ” another—which, critically, captures event connectivity beyond simple object co-occurrence. As a result, our summaries provide a better sense of story. We apply our approach to over 12 hours of daily activity video taken from 23 unique camera wearers, and systematically evaluate its quality compared to multiple baselines with 34 human subjects.
2 0.91379768 243 cvpr-2013-Large-Scale Video Summarization Using Web-Image Priors
Author: Aditya Khosla, Raffay Hamid, Chih-Jen Lin, Neel Sundaresan
Abstract: Given the enormous growth in user-generated videos, it is becoming increasingly important to be able to navigate them efficiently. As these videos are generally of poor quality, summarization methods designed for well-produced videos do not generalize to them. To address this challenge, we propose to use web-images as a prior to facilitate summarization of user-generated videos. Our main intuition is that people tend to take pictures of objects to capture them in a maximally informative way. Such images could therefore be used as prior information to summarize videos containing a similar set of objects. In this work, we apply our novel insight to develop a summarization algorithm that uses the web-image based prior information in an unsupervised manner. Moreover, to automatically evaluate summarization algorithms on a large scale, we propose a framework that relies on multiple summaries obtained through crowdsourcing. We demonstrate the effectiveness of our evaluation framework by comparing its performance to that ofmultiple human evaluators. Finally, wepresent resultsfor our framework tested on hundreds of user-generated videos.
Author: Pradipto Das, Chenliang Xu, Richard F. Doell, Jason J. Corso
Abstract: The problem of describing images through natural language has gained importance in the computer vision community. Solutions to image description have either focused on a top-down approach of generating language through combinations of object detections and language models or bottom-up propagation of keyword tags from training images to test images through probabilistic or nearest neighbor techniques. In contrast, describing videos with natural language is a less studied problem. In this paper, we combine ideas from the bottom-up and top-down approaches to image description and propose a method for video description that captures the most relevant contents of a video in a natural language description. We propose a hybrid system consisting of a low level multimodal latent topic model for initial keyword annotation, a middle level of concept detectors and a high level module to produce final lingual descriptions. We compare the results of our system to human descriptions in both short and long forms on two datasets, and demonstrate that final system output has greater agreement with the human descriptions than any single level.
4 0.79262936 294 cvpr-2013-Multi-class Video Co-segmentation with a Generative Multi-video Model
Author: Wei-Chen Chiu, Mario Fritz
Abstract: Video data provides a rich source of information that is available to us today in large quantities e.g. from online resources. Tasks like segmentation benefit greatly from the analysis of spatio-temporal motion patterns in videos and recent advances in video segmentation has shown great progress in exploiting these addition cues. However, observing a single video is often not enough to predict meaningful segmentations and inference across videos becomes necessary in order to predict segmentations that are consistent with objects classes. Therefore the task of video cosegmentation is being proposed, that aims at inferring segmentation from multiple videos. But current approaches are limited to only considering binary foreground/background -inf .mpg . de segmentation and multiple videos of the same object. This is a clear mismatch to the challenges that we are facing with videos from online resources or consumer videos. We propose to study multi-class video co-segmentation where the number of object classes is unknown as well as the number of instances in each frame and video. We achieve this by formulating a non-parametric bayesian model across videos sequences that is based on a new videos segmentation prior as well as a global appearance model that links segments of the same class. We present the first multi-class video co-segmentation evaluation. We show that our method is applicable to real video data from online resources and outperforms state-of-the-art video segmentation and image co-segmentation baselines.
5 0.78911388 434 cvpr-2013-Topical Video Object Discovery from Key Frames by Modeling Word Co-occurrence Prior
Author: Gangqiang Zhao, Junsong Yuan, Gang Hua
Abstract: A topical video object refers to an object that is frequently highlighted in a video. It could be, e.g., the product logo and the leading actor/actress in a TV commercial. We propose a topic model that incorporates a word co-occurrence prior for efficient discovery of topical video objects from a set of key frames. Previous work using topic models, such as Latent Dirichelet Allocation (LDA), for video object discovery often takes a bag-of-visual-words representation, which ignored important co-occurrence information among the local features. We show that such data driven co-occurrence information from bottom-up can conveniently be incorporated in LDA with a Gaussian Markov prior, which combines top down probabilistic topic modeling with bottom up priors in a unified model. Our experiments on challenging videos demonstrate that the proposed approach can discover different types of topical objects despite variations in scale, view-point, color and lighting changes, or even partial occlusions. The efficacy of the co-occurrence prior is clearly demonstrated when comparing with topic models without such priors.
6 0.74994373 313 cvpr-2013-Online Dominant and Anomalous Behavior Detection in Videos
7 0.7305131 151 cvpr-2013-Event Retrieval in Large Video Collections with Circulant Temporal Encoding
8 0.71409148 133 cvpr-2013-Discriminative Segment Annotation in Weakly Labeled Video
9 0.71275514 187 cvpr-2013-Geometric Context from Videos
10 0.68674242 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
11 0.65248406 37 cvpr-2013-Adherent Raindrop Detection and Removal in Video
12 0.64347959 333 cvpr-2013-Plane-Based Content Preserving Warps for Video Stabilization
13 0.62730187 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos
14 0.62209284 455 cvpr-2013-Video Object Segmentation through Spatially Accurate and Temporally Dense Extraction of Primary Object Regions
15 0.58969194 49 cvpr-2013-Augmenting Bag-of-Words: Data-Driven Discovery of Temporal and Structural Information for Activity Recognition
16 0.54858583 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?
17 0.54553986 85 cvpr-2013-Complex Event Detection via Multi-source Video Attributes
18 0.54393035 148 cvpr-2013-Ensemble Video Object Cut in Highly Dynamic Scenes
19 0.53234196 353 cvpr-2013-Relative Hidden Markov Models for Evaluating Motion Skill
20 0.53119218 32 cvpr-2013-Action Recognition by Hierarchical Sequence Summarization
topicId topicWeight
[(10, 0.109), (16, 0.025), (26, 0.048), (28, 0.016), (33, 0.198), (50, 0.04), (52, 0.231), (67, 0.059), (69, 0.055), (76, 0.016), (77, 0.018), (87, 0.077)]
simIndex simValue paperId paperTitle
same-paper 1 0.79259819 413 cvpr-2013-Story-Driven Summarization for Egocentric Video
Author: Zheng Lu, Kristen Grauman
Abstract: We present a video summarization approach that discovers the story of an egocentric video. Given a long input video, our method selects a short chain of video subshots depicting the essential events. Inspired by work in text analysis that links news articles over time, we define a randomwalk based metric of influence between subshots that reflects how visual objects contribute to the progression of events. Using this influence metric, we define an objective for the optimal k-subshot summary. Whereas traditional methods optimize a summary ’s diversity or representativeness, ours explicitly accounts for how one sub-event “leads to ” another—which, critically, captures event connectivity beyond simple object co-occurrence. As a result, our summaries provide a better sense of story. We apply our approach to over 12 hours of daily activity video taken from 23 unique camera wearers, and systematically evaluate its quality compared to multiple baselines with 34 human subjects.
2 0.78762913 102 cvpr-2013-Decoding, Calibration and Rectification for Lenselet-Based Plenoptic Cameras
Author: Donald G. Dansereau, Oscar Pizarro, Stefan B. Williams
Abstract: Plenoptic cameras are gaining attention for their unique light gathering and post-capture processing capabilities. We describe a decoding, calibration and rectification procedurefor lenselet-basedplenoptic cameras appropriatefor a range of computer vision applications. We derive a novel physically based 4D intrinsic matrix relating each recorded pixel to its corresponding ray in 3D space. We further propose a radial distortion model and a practical objective function based on ray reprojection. Our 15-parameter camera model is of much lower dimensionality than camera array models, and more closely represents the physics of lenselet-based cameras. Results include calibration of a commercially available camera using three calibration grid sizes over five datasets. Typical RMS ray reprojection errors are 0.0628, 0.105 and 0.363 mm for 3.61, 7.22 and 35.1 mm calibration grids, respectively. Rectification examples include calibration targets and real-world imagery.
Author: Xiaolong Wang, Liang Lin, Lichao Huang, Shuicheng Yan
Abstract: This paper proposes a reconfigurable model to recognize and detect multiclass (or multiview) objects with large variation in appearance. Compared with well acknowledged hierarchical models, we study two advanced capabilities in hierarchy for object modeling: (i) “switch” variables(i.e. or-nodes) for specifying alternative compositions, and (ii) making local classifiers (i.e. leaf-nodes) shared among different classes. These capabilities enable us to account well for structural variabilities while preserving the model compact. Our model, in the form of an And-Or Graph, comprises four layers: a batch of leaf-nodes with collaborative edges in bottom for localizing object parts; the or-nodes over bottom to activate their children leaf-nodes; the andnodes to classify objects as a whole; one root-node on the top for switching multiclass classification, which is also an or-node. For model training, we present an EM-type algorithm, namely dynamical structural optimization (DSO), to iteratively determine the structural configuration, (e.g., leaf-node generation associated with their parent or-nodes and shared across other classes), along with optimizing multi-layer parameters. The proposed method is valid on challenging databases, e.g., PASCAL VOC2007and UIUCPeople, and it achieves state-of-the-arts performance.
4 0.75158882 63 cvpr-2013-Binary Code Ranking with Weighted Hamming Distance
Author: Lei Zhang, Yongdong Zhang, Jinhu Tang, Ke Lu, Qi Tian
Abstract: Binary hashing has been widely used for efficient similarity search due to its query and storage efficiency. In most existing binary hashing methods, the high-dimensional data are embedded into Hamming space and the distance or similarity of two points are approximated by the Hamming distance between their binary codes. The Hamming distance calculation is efficient, however, in practice, there are often lots of results sharing the same Hamming distance to a query, which makes this distance measure ambiguous and poses a critical issue for similarity search where ranking is important. In this paper, we propose a weighted Hamming distance ranking algorithm (WhRank) to rank the binary codes of hashing methods. By assigning different bit-level weights to different hash bits, the returned binary codes are ranked at a finer-grained binary code level. We give an algorithm to learn the data-adaptive and query-sensitive weight for each hash bit. Evaluations on two large-scale image data sets demonstrate the efficacy of our weighted Hamming distance for binary code ranking.
5 0.73338807 237 cvpr-2013-Kernel Learning for Extrinsic Classification of Manifold Features
Author: Raviteja Vemulapalli, Jaishanker K. Pillai, Rama Chellappa
Abstract: In computer vision applications, features often lie on Riemannian manifolds with known geometry. Popular learning algorithms such as discriminant analysis, partial least squares, support vector machines, etc., are not directly applicable to such features due to the non-Euclidean nature of the underlying spaces. Hence, classification is often performed in an extrinsic manner by mapping the manifolds to Euclidean spaces using kernels. However, for kernel based approaches, poor choice of kernel often results in reduced performance. In this paper, we address the issue of kernelselection for the classification of features that lie on Riemannian manifolds using the kernel learning approach. We propose two criteria for jointly learning the kernel and the classifier using a single optimization problem. Specifically, for the SVM classifier, we formulate the problem of learning a good kernel-classifier combination as a convex optimization problem and solve it efficiently following the multiple kernel learning approach. Experimental results on image set-based classification and activity recognition clearly demonstrate the superiority of the proposed approach over existing methods for classification of manifold features.
6 0.7239787 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
7 0.71889937 365 cvpr-2013-Robust Real-Time Tracking of Multiple Objects by Volumetric Mass Densities
8 0.71518147 414 cvpr-2013-Structure Preserving Object Tracking
9 0.71363324 61 cvpr-2013-Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics
10 0.71345335 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation
11 0.71343732 331 cvpr-2013-Physically Plausible 3D Scene Tracking: The Single Actor Hypothesis
12 0.71314383 19 cvpr-2013-A Minimum Error Vanishing Point Detection Approach for Uncalibrated Monocular Images of Man-Made Environments
13 0.71218818 325 cvpr-2013-Part Discovery from Partial Correspondence
14 0.71166307 285 cvpr-2013-Minimum Uncertainty Gap for Robust Visual Tracking
15 0.71157116 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases
16 0.71145117 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
17 0.71104294 400 cvpr-2013-Single Image Calibration of Multi-axial Imaging Systems
18 0.71086073 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path
19 0.71079302 445 cvpr-2013-Understanding Bayesian Rooms Using Composite 3D Object Models
20 0.71057385 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection