cvpr cvpr2013 cvpr2013-151 knowledge-graph by maker-knowledge-mining

151 cvpr-2013-Event Retrieval in Large Video Collections with Circulant Temporal Encoding

Source: pdf

Author: Jérôme Revaud, Matthijs Douze, Cordelia Schmid, Hervé Jégou

Abstract: This paper presents an approach for large-scale event retrieval. Given a video clip of a specific event, e.g., the wedding of Prince William and Kate Middleton, the goal is to retrieve other videos representing the same event from a dataset of over 100k videos. Our approach encodes the frame descriptors of a video to jointly represent their appearance and temporal order. It exploits the properties of circulant matrices to compare the videos in the frequency domain. This offers a significant gain in complexity and accurately localizes the matching parts of videos. Furthermore, we extend product quantization to complex vectors in order to compress our descriptors, and to compare them in the compressed domain. Our method outperforms the state of the art both in search quality and query time on two large-scale video benchmarks for copy detection, TRECVID and CCWEB. Finally, we introduce a challenging dataset for event retrieval, EVVE, and report the performance on this dataset.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Event retrieval in large video collections with circulant temporal encoding Je´ r oˆme Revaud Matthijs DouzIeNRIACordelia Schmid Herv e´ J e´gou Abstract This paper presents an approach for large-scale event retrieval. [sent-1, score-0.895]

2 , the wedding of Prince William and Kate Middleton, the goal is to retrieve other videos representing the same event from a dataset of over 100k videos. [sent-4, score-0.576]

3 Our approach encodes the frame descriptors of a video to jointly represent their appearance and temporal order. [sent-5, score-0.484]

4 It exploits the properties of circulant matrices to compare the videos in the frequency domain. [sent-6, score-0.483]

5 Furthermore, we extend product quantization to complex vectors in order to compress our descriptors, and to compare them in the compressed domain. [sent-8, score-0.278]

6 Our method outperforms the state of the art both in search quality and query time on two large-scale video benchmarks for copy detection, TRECVID and CCWEB. [sent-9, score-0.524]

7 Finally, we introduce a challenging dataset for event retrieval, EVVE, and report the performance on this dataset. [sent-10, score-0.29]

8 Introduction This paper introduces an approach for specific event retrieval. [sent-12, score-0.317]

9 Examples of events are news items such as the wedding of prince William and Kate, or re-occurring events such as the eruption of a geyser. [sent-13, score-0.494]

10 Searching for specific events is related to video copy detection [13] and event category recognition [16], but there are substantial differences with both. [sent-17, score-0.822]

11 The goal of video copy detection is to find deformed videos, e. [sent-18, score-0.377]

12 Detecting event categories requires a classification approach that captures the large intra-class variability. [sent-21, score-0.29]

13 The method introduced in this paper is tailored to specific event retrieval, as it is flexible enough to handle significant viewpoint change while still producing a precise alignment in time. [sent-22, score-0.335]

14 Our first contribution is to encode the frame descriptors of a video into a temporal representation and to exploit the properties of circulant matrices to compare videos in the frequency domain. [sent-23, score-0.967]

15 The second contribution is a dataset for specific event retrieval in large user-generated video content. [sent-24, score-0.534]

16 This dataset, named EVVE, has been collected from Youtube and comprises a set of manually annotated videos of 13 events, as well as 100,000 distractor videos. [sent-25, score-0.304]

17 Many techniques for video retrieval represent a video as a set of descriptors extracted from frames or keyframes [4, 11, 20]. [sent-26, score-0.501]

18 Searching in a collection is performed by comparing the query descriptors with those of the dataset. [sent-27, score-0.265]

19 , partial alignment [22] or classic voting techniques, such as temporal Hough transform [4], which was popular in the TRECVID video copy detection task [19]. [sent-30, score-0.603]

20 Such approaches are costly, since all frame descriptors of the query must be compared to those of the database before performing the temporal verification. [sent-31, score-0.574]

21 Frame descriptors are jointly encoded in the frequency domain, where convolutions cast into efficient element-wise multiplications. [sent-35, score-0.216]

22 Computing a matching score between videos only requires component-wise operations and a single one-dimensional inverse Fourier transform, avoiding the reconstruction of the descriptor in the temporal domain. [sent-38, score-0.549]

23 Recently, transforming a multi-dimensional signal to the Fourier domain to speed up detection was shown useful [5], but to our knowledge, it is new to analyze the temporal aspect of global image descriptors in this way. [sent-42, score-0.334]

24 222444555977 The tradeoff between search quality, speed and memory usage is optimized with the product quantization technique [9], which is extended to complex vectors in order to compare our descriptors in the compressed Fourier domain. [sent-43, score-0.444]

25 Section 3 describes frame descriptors, Section 4 describes our temporal circulant encoding technique and Section 5 presents our indexing strategy. [sent-46, score-0.516]

26 The experiments in Section 6 demonstrate the excellent results of our approach for event retrieval on the EVVE dataset. [sent-47, score-0.394]

27 Our approach also significantly outperforms state-of-the-art systems for efficient video copy detection on the TRECVID and CCWEB benchmarks. [sent-48, score-0.377]

28 EVVE: an event retrieval dataset This section introduces the EVVE (EVent VidEo) dataset which is dedicated to the retrieval of particular events. [sent-50, score-0.525]

29 This differs from recognizing event categories such as “birthday party” or “grooming an animal”, as in the TRECVID Multimedia event detection task [16]. [sent-51, score-0.611]

30 Several of them are localized precisely in time and space as professional reporters and spectators have captured the same event simultaneously. [sent-53, score-0.325]

31 An example is the event “Concert of Madonna in Rome 2012”. [sent-54, score-0.29]

32 In this case, the videos overlap visually and can be aligned. [sent-55, score-0.232]

33 EVVE also includes events for which relevant videos might not correspond to the same instance in place or time. [sent-56, score-0.387]

34 For instance, the event ”The major autumn flood in Thailand in 2011” is covered by videos of the flood in different places, and “Austerity riots in Barcelona” includes shots of riots at different places and moments. [sent-57, score-0.809]

35 Each event was annotated by one annotator, who first produced a precise definition of the event. [sent-60, score-0.29]

36 For example, the event “The wedding of Prince William and Kate Middleton” is defined as: TtsI(emihnxgatelwrshitnmuofahgKtpeicrnhgoeu. [sent-61, score-0.344]

37 In addition to the videos collected for the specific events, we have also retrieved a set of 100,000 “dis- tractor” videos by querying Youtube with unrelated terms. [sent-69, score-0.495]

38 These videos have all been collected before September 2008, which ensures that the distractor set does not contain any of the relevant events of EVVE, since all events are temporally localized after September 2008 (except the Figure 1. [sent-70, score-0.644]

39 The distractor videos representing a similar but distinct event, such as videos of other bomb attacks for Event #9, are counted as negatives. [sent-77, score-0.536]

40 Evaluation is performed in a standard retrieval scenario, where we submit one video query at a time and the algorithm returns a list of videos ranked by similarity scores. [sent-79, score-0.624]

41 Frame description We represent a video by a sequence of high-dimensional frame descriptors, as described in this section. [sent-85, score-0.226]

42 All videos are mapped to a common format, by sampling them at a fixed rate of 15 fps and resizing them to a maximum of 120k pixels, while keeping the aspect ratio. [sent-87, score-0.292]

43 Local SIFT descriptors [14] are extracted for each frame on a dense grid [15], every 4 pixels and for 5 scale levels. [sent-89, score-0.203]

44 The SIFT descriptors of a frame are encoded using MultiVLAD [8], a variant of the Fisher vector [17]. [sent-93, score-0.231]

45 Circulant temporal aggregation The method introduced in this section aims at comparing two sequences of frame descriptors q = [q1, . [sent-99, score-0.408]

46 This is the case for Fisher and our Multi-VLAD descriptors (Section 3), but not for other type of descriptors to be compared with complex kernels. [sent-118, score-0.273]

47 In practice, this assumption is not well satisfied, because the videos are very self-similar in time, so the similarity proposed in Eqn. [sent-120, score-0.232]

48 The encoding technique for sequences of vector descriptors presented in this section, is referred to as Circulant Temporal Encoding (CTE). [sent-123, score-0.258]

49 Unfortunately, averaging does not always suffice, as many videos contain only one shot composed of a single frame: the components associated with high frequencies are almost 0 for all dimensions. [sent-215, score-0.317]

50 This leads to a regularized score between two video sequences q and b: sλ(q,b) =d1F−1? [sent-231, score-0.204]

51 ] between two videos sequences q and b for all possible temporal shifts. [sent-249, score-0.437]

52 In some applications such as video alignment (see Section 6), we also need the boundaries of the matching segments. [sent-251, score-0.215]

53 For this purpose, the database descriptors are reconstructed in the temporal domain from F−1 (b? [sent-252, score-0.385]

54 Yet, on large datasets this does not impact the overall efficiency, since it is only applied to a short-list of videos with the highest scores. [sent-259, score-0.232]

55 Frequency-domain representation A database video b of length n is represented in the Fourier domain by a complex matrix B = 222444666200 [B? [sent-267, score-0.306]

56 Therefore, expanded versions of , database descriptors can be generated on the fly and at no cost. [sent-294, score-0.243]

57 This asymmetric processing of the videos was chosen for efficiency reasons. [sent-295, score-0.232]

58 Unfortunately, this introduces an uncertainty on the alignment of the query and database videos: δ∗ can be determined modulo n only. [sent-296, score-0.302]

59 10, we propose two extensions of the product quantization technique [9], which is a compression technique that enables efficient compressed-domain comparison and search. [sent-300, score-0.245]

60 The comparison between a query descriptor x and the database vectors is performed in two stages. [sent-319, score-0.333]

61 We learn the k-means centroids for complex vectors by considering a d-dimensional complex vector to be a 2ddimensional real vector, and this for all the frequency vectors that we keep: Cd ≡ R2d and fj ≡ yj . [sent-327, score-0.3]

62 At query time, the table T stores complex values. [sent-328, score-0.219]

63 Summary of search procedure and complexity Each database video is processed offline as follows: 1. [sent-343, score-0.222]

64 The video is pre-processed and each frame is described as a d-dimensional Multi-VLAD descriptor. [sent-344, score-0.226]

65 These vectors are separately encoded with a complex product quantizer, producing a compressed representation of p n? [sent-353, score-0.214]

66 At query time, the submitted video is described in the ×× same manner. [sent-355, score-0.288]

67 The complexity at query time depends on the number N of database videos, the dimensionality d of the frame descriptor and the video length, that we assume for readability to be constant (n frames): 1. [sent-356, score-0.52]

68 O(d n log n) – The query frame descriptors are mapped to the frequency domain by d FFTs. [sent-357, score-0.5]

69 ) – This vector is mapped to the temporal domain using a single inverse FFT. [sent-376, score-0.266]

70 Experiments In this section we evaluate our approach, both for video copy detection and event retrieval. [sent-385, score-0.667]

71 To compare the contributions of the frame descriptors and of the temporal matching, we introduce an additional descriptor obtained by averaging the frame descriptors (see section 3) over the entire video. [sent-386, score-0.651]

72 Video copy detection This task is evaluated on two public benchmarks, the CCWEB dataset [21] and the TRECVID 2008 content based copy detection dataset (CCD) [19], see Table 1. [sent-390, score-0.474]

73 The transformed versions in the database correspond to user re-posts on video sharing sites. [sent-392, score-0.266]

74 We present results on the camcording subtask, which is most relevant to our context of event retrieval in the presence of significant viewpoint changes. [sent-396, score-0.394]

75 The spatial and temporal compression is parametrized by the dimensionality d after PCA, the number p of PQ sub-quantizers and the frame description rate β, which defines the ratio between the number of frequency vectors and the number of video frames. [sent-399, score-0.53]

76 For nearduplicate retrieval as well as for event retrieval, Figure 2 shows that intermediate values of λ yield the best performance. [sent-417, score-0.394]

77 In contrast, we observe that small values of λ produce the best NDCR performance for the TRECVID copy detection task. [sent-418, score-0.237]

78 1 for the near-duplicate and event retrieval tasks, and λ=0. [sent-421, score-0.394]

79 On CCWEB, both the temporal and non-temporal versions of our method outperform the state of the art for comparable memory footprints. [sent-425, score-0.23]

80 Impact of the parameter λ on the performance for the large-scale version ofthe dataset are not strictly comparable with those of the original paper [20] because the distractor videos are different (they do not provide theirs). [sent-436, score-0.304]

81 Despite this advantage, MMV performs poorly (NDCR close to 1), due to the small overlap between queries and database videos (typically 1%), which dilutes the matching segment in the video descriptor. [sent-441, score-0.484]

82 Remark: The performance of CTE mainly depends on the length of the subsequence shared by the query and retrieved videos: Pairs with subsequences shorter than 5 s are correctly found with 62% accuracy, subsequences between 5s and 10s with 80% accuracy and longer subsequences with 93% accuracy. [sent-442, score-0.347]

83 , CCWEB with 100k distractors, the bottleneck remains the descriptor computation, which is performed faster than real-time on one processor core (1-2 minute per query on TRECVID and CCWEB). [sent-446, score-0.24]

84 On EVVE+100k, this generates a database size of 943 MB and an average query time of 11s. [sent-453, score-0.23]

85 The detailed results are presented per event in Table 3 for both the temporal and nontemporal versions of our algorithm. [sent-454, score-0.475]

86 Interestingly, MMV performs similarly to CTE on average, at a much lower memory and computational cost, which means that some events are better captured by using a global descriptor of visual appearance. [sent-455, score-0.264]

87 For instance, videos from the Shakira concert always feature the crowd in the foreground and the nuEmvenbterMMVCETVEVEMMV+CTEMEMVVVE+1C0T0E,000M diMstrVa+ctCorTsE same concert scene behind, so averaging the frame descrip- tors provides a robust visual summary of the event. [sent-456, score-0.482]

88 This is done by adding the normalized scores obtained from MMV and CTE for each database video and for each query. [sent-459, score-0.222]

89 Note that CTE also outputs the matching video parts, which is important for the video alignment described in the next section. [sent-464, score-0.355]

90 Automatic video alignment For some events from EVVE, many people have filmed the same scene, e. [sent-467, score-0.34]

91 We use the CTE method to automatically align the videos on a common timeline. [sent-470, score-0.232]

92 We match all possible videos pairs (including all query and database videos), which results in a time shift δ∗ for all pairs (see Section 4. [sent-471, score-0.504]

93 Aligning the videos consists in estimating the starting time of each video on the common timeline, so that the time shifts are satisfied. [sent-473, score-0.372]

94 During this process, groups of independent videos emerge, where each group corresponds to a distinct scene. [sent-477, score-0.232]

95 We use this to display different viewpoints of an event on a shared timeline, as depicted in Figure 3. [sent-478, score-0.29]

96 This video representation provides an efficient search scheme that avoids the exhaustive comparison of frames, which is commonly performed when estimating the temporal Hough transform. [sent-483, score-0.281]

97 Extensive experiments on two video copy detection benchmarks show that our approach improves over the state of the art with respect to accuracy, search time and mem- ory usage. [sent-484, score-0.407]

98 Moving towards the more challenging task of event retrieval, our approach efficiently retrieves instances of events in a large collection of videos, as shown for the EVVE event retrieval dataset introduced in this paper. [sent-485, score-0.839]

99 Compact video description for copy detection with precise temporal alignment. [sent-516, score-0.518]

100 Tiny Videos: A large data set for nonparametric video retrieval and frame classification. [sent-561, score-0.33]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('evve', 0.379), ('event', 0.29), ('mmv', 0.247), ('videos', 0.232), ('cte', 0.207), ('copy', 0.206), ('trecvid', 0.19), ('circulant', 0.18), ('ccweb', 0.177), ('fourier', 0.16), ('events', 0.155), ('ndcr', 0.152), ('qi', 0.149), ('query', 0.148), ('temporal', 0.141), ('video', 0.14), ('descriptors', 0.117), ('retrieval', 0.104), ('frame', 0.086), ('database', 0.082), ('eruption', 0.076), ('distractor', 0.072), ('frequency', 0.071), ('flood', 0.067), ('pca', 0.065), ('descriptor', 0.064), ('sequences', 0.064), ('wi', 0.063), ('concert', 0.062), ('kate', 0.062), ('quantization', 0.059), ('product', 0.059), ('william', 0.059), ('subsequences', 0.056), ('prince', 0.054), ('wedding', 0.054), ('compression', 0.053), ('autumn', 0.051), ('geyser', 0.051), ('iceland', 0.051), ('middleton', 0.051), ('nmax', 0.051), ('riots', 0.051), ('strokkur', 0.051), ('thailand', 0.051), ('compressed', 0.049), ('fft', 0.049), ('hough', 0.048), ('inverse', 0.047), ('domain', 0.045), ('alignment', 0.045), ('bolme', 0.045), ('timeline', 0.045), ('frequencies', 0.045), ('memory', 0.045), ('versions', 0.044), ('shift', 0.042), ('centroids', 0.042), ('impacting', 0.042), ('padded', 0.042), ('regularization', 0.041), ('douze', 0.041), ('averaging', 0.04), ('transform', 0.04), ('encoding', 0.04), ('complex', 0.039), ('vectors', 0.039), ('smeaton', 0.037), ('archives', 0.037), ('technique', 0.037), ('seam', 0.036), ('gou', 0.036), ('byproduct', 0.036), ('distractors', 0.036), ('civr', 0.036), ('operations', 0.035), ('professional', 0.035), ('ccd', 0.033), ('compress', 0.033), ('mapped', 0.033), ('september', 0.033), ('indexing', 0.032), ('qt', 0.032), ('stores', 0.032), ('multimedia', 0.031), ('retrieved', 0.031), ('yj', 0.031), ('detection', 0.031), ('fisher', 0.031), ('benchmarks', 0.03), ('temporally', 0.03), ('matching', 0.03), ('storing', 0.029), ('filters', 0.028), ('processor', 0.028), ('encoded', 0.028), ('egou', 0.028), ('introduces', 0.027), ('pq', 0.027), ('fps', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 151 cvpr-2013-Event Retrieval in Large Video Collections with Circulant Temporal Encoding

Author: Jérôme Revaud, Matthijs Douze, Cordelia Schmid, Hervé Jégou

2 0.28291011 85 cvpr-2013-Complex Event Detection via Multi-source Video Attributes

Author: Zhigang Ma, Yi Yang, Zhongwen Xu, Shuicheng Yan, Nicu Sebe, Alexander G. Hauptmann

Abstract: Complex events essentially include human, scenes, objects and actions that can be summarized by visual attributes, so leveraging relevant attributes properly could be helpful for event detection. Many works have exploited attributes at image level for various applications. However, attributes at image level are possibly insufficient for complex event detection in videos due to their limited capability in characterizing the dynamic properties of video data. Hence, we propose to leverage attributes at video level (named as video attributes in this work), i.e., the semantic labels of external videos are used as attributes. Compared to complex event videos, these external videos contain simple contents such as objects, scenes and actions which are the basic elements of complex events. Specifically, building upon a correlation vector which correlates the attributes and the complex event, we incorporate video attributes latently as extra informative cues into the event detector learnt from complex event videos. Extensive experiments on a real-world large-scale dataset validate the efficacy of the proposed approach.

3 0.19714576 292 cvpr-2013-Multi-agent Event Detection: Localization and Role Assignment

Author: Suha Kwak, Bohyung Han, Joon Hee Han

Abstract: We present a joint estimation technique of event localization and role assignment when the target video event is described by a scenario. Specifically, to detect multi-agent events from video, our algorithm identifies agents involved in an event and assigns roles to the participating agents. Instead of iterating through all possible agent-role combinations, we formulate the joint optimization problem as two efficient subproblems—quadratic programming for role assignment followed by linear programming for event localization. Additionally, we reduce the computational complexity significantly by applying role-specific event detectors to each agent independently. We test the performance of our algorithm in natural videos, which contain multiple target events and nonparticipating agents.

4 0.17262657 49 cvpr-2013-Augmenting Bag-of-Words: Data-Driven Discovery of Temporal and Structural Information for Activity Recognition

Author: Vinay Bettadapura, Grant Schindler, Thomas Ploetz, Irfan Essa

Abstract: We present data-driven techniques to augment Bag of Words (BoW) models, which allow for more robust modeling and recognition of complex long-term activities, especially when the structure and topology of the activities are not known a priori. Our approach specifically addresses the limitations of standard BoW approaches, which fail to represent the underlying temporal and causal information that is inherent in activity streams. In addition, we also propose the use ofrandomly sampled regular expressions to discover and encode patterns in activities. We demonstrate the effectiveness of our approach in experimental evaluations where we successfully recognize activities and detect anomalies in four complex datasets.

5 0.15232545 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities

Author: Raghuraman Gopalan

Abstract: While the notion of joint sparsity in understanding common and innovative components of a multi-receiver signal ensemble has been well studied, we investigate the utility of such joint sparse models in representing information contained in a single video signal. By decomposing the content of a video sequence into that observed by multiple spatially and/or temporally distributed receivers, we first recover a collection of common and innovative components pertaining to individual videos. We then present modeling strategies based on subspace-driven manifold metrics to characterize patterns among these components, across other videos in the system, to perform subsequent video analysis. We demonstrate the efficacy of our approach for activity classification and clustering by reporting competitive results on standard datasets such as, HMDB, UCF-50, Olympic Sports and KTH.

6 0.14943351 343 cvpr-2013-Query Adaptive Similarity for Large Scale Object Retrieval

7 0.14684731 243 cvpr-2013-Large-Scale Video Summarization Using Web-Image Priors

8 0.1373513 150 cvpr-2013-Event Recognition in Videos by Learning from Heterogeneous Web Sources

9 0.12907794 402 cvpr-2013-Social Role Discovery in Human Events

10 0.12666449 187 cvpr-2013-Geometric Context from Videos

11 0.122967 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches

12 0.12032226 260 cvpr-2013-Learning and Calibrating Per-Location Classifiers for Visual Place Recognition

13 0.12024903 189 cvpr-2013-Graph-Based Discriminative Learning for Location Recognition

14 0.11524409 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition

15 0.11506071 456 cvpr-2013-Visual Place Recognition with Repetitive Structures

16 0.11045601 294 cvpr-2013-Multi-class Video Co-segmentation with a Generative Multi-video Model

17 0.10964483 172 cvpr-2013-Finding Group Interactions in Social Clutter

18 0.1091188 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos

19 0.10825852 28 cvpr-2013-A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching

20 0.10793575 246 cvpr-2013-Learning Binary Codes for High-Dimensional Data Using Bilinear Projections

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.211), (1, -0.07), (2, -0.038), (3, -0.073), (4, -0.062), (5, 0.016), (6, -0.13), (7, -0.134), (8, -0.096), (9, 0.027), (10, 0.022), (11, -0.06), (12, 0.124), (13, 0.016), (14, 0.029), (15, -0.08), (16, 0.088), (17, 0.041), (18, 0.039), (19, -0.22), (20, -0.065), (21, -0.026), (22, -0.012), (23, -0.07), (24, -0.085), (25, 0.006), (26, 0.007), (27, -0.091), (28, -0.004), (29, -0.006), (30, 0.178), (31, -0.061), (32, -0.016), (33, -0.097), (34, -0.105), (35, 0.067), (36, 0.068), (37, 0.039), (38, 0.05), (39, 0.066), (40, 0.107), (41, 0.015), (42, -0.056), (43, -0.075), (44, -0.005), (45, 0.039), (46, -0.136), (47, 0.085), (48, -0.004), (49, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96042323 151 cvpr-2013-Event Retrieval in Large Video Collections with Circulant Temporal Encoding

Author: Jérôme Revaud, Matthijs Douze, Cordelia Schmid, Hervé Jégou

2 0.75610858 292 cvpr-2013-Multi-agent Event Detection: Localization and Role Assignment

Author: Suha Kwak, Bohyung Han, Joon Hee Han

3 0.6810165 49 cvpr-2013-Augmenting Bag-of-Words: Data-Driven Discovery of Temporal and Structural Information for Activity Recognition

Author: Vinay Bettadapura, Grant Schindler, Thomas Ploetz, Irfan Essa

4 0.66803795 243 cvpr-2013-Large-Scale Video Summarization Using Web-Image Priors

Author: Aditya Khosla, Raffay Hamid, Chih-Jen Lin, Neel Sundaresan

Abstract: Given the enormous growth in user-generated videos, it is becoming increasingly important to be able to navigate them efficiently. As these videos are generally of poor quality, summarization methods designed for well-produced videos do not generalize to them. To address this challenge, we propose to use web-images as a prior to facilitate summarization of user-generated videos. Our main intuition is that people tend to take pictures of objects to capture them in a maximally informative way. Such images could therefore be used as prior information to summarize videos containing a similar set of objects. In this work, we apply our novel insight to develop a summarization algorithm that uses the web-image based prior information in an unsupervised manner. Moreover, to automatically evaluate summarization algorithms on a large scale, we propose a framework that relies on multiple summaries obtained through crowdsourcing. We demonstrate the effectiveness of our evaluation framework by comparing its performance to that ofmultiple human evaluators. Finally, wepresent resultsfor our framework tested on hundreds of user-generated videos.

5 0.64947277 402 cvpr-2013-Social Role Discovery in Human Events

Author: Vignesh Ramanathan, Bangpeng Yao, Li Fei-Fei

Abstract: We deal with the problem of recognizing social roles played by people in an event. Social roles are governed by human interactions, and form a fundamental component of human event description. We focus on a weakly supervised setting, where we are provided different videos belonging to an event class, without training role labels. Since social roles are described by the interaction between people in an event, we propose a Conditional Random Field to model the inter-role interactions, along with person specific social descriptors. We develop tractable variational inference to simultaneously infer model weights, as well as role assignment to all people in the videos. We also present a novel YouTube social roles dataset with ground truth role annotations, and introduce annotations on a subset of videos from the TRECVID-MED11 [1] event kits for evaluation purposes. The performance of the model is compared against different baseline methods on these datasets.

6 0.63572371 85 cvpr-2013-Complex Event Detection via Multi-source Video Attributes

7 0.60840851 413 cvpr-2013-Story-Driven Summarization for Egocentric Video

8 0.60755479 313 cvpr-2013-Online Dominant and Anomalous Behavior Detection in Videos

9 0.57460648 172 cvpr-2013-Finding Group Interactions in Social Clutter

10 0.54335886 28 cvpr-2013-A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching

11 0.53078371 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities

12 0.50833458 294 cvpr-2013-Multi-class Video Co-segmentation with a Generative Multi-video Model

13 0.48312733 38 cvpr-2013-All About VLAD

14 0.46187234 456 cvpr-2013-Visual Place Recognition with Repetitive Structures

15 0.46149907 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos

16 0.4571121 246 cvpr-2013-Learning Binary Codes for High-Dimensional Data Using Bilinear Projections

17 0.44066828 79 cvpr-2013-Cartesian K-Means

18 0.44033241 149 cvpr-2013-Evaluation of Color STIPs for Human Action Recognition

19 0.43893105 434 cvpr-2013-Topical Video Object Discovery from Key Frames by Modeling Word Co-occurrence Prior

20 0.4376919 7 cvpr-2013-A Divide-and-Conquer Method for Scalable Low-Rank Latent Matrix Pursuit

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.103), (16, 0.034), (26, 0.041), (33, 0.291), (67, 0.068), (69, 0.033), (77, 0.295), (80, 0.01), (87, 0.051)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.8955074 18 cvpr-2013-A Max-Margin Riffled Independence Model for Image Tag Ranking

Author: Tian Lan, Greg Mori

Abstract: We propose Max-Margin Riffled Independence Model (MMRIM), a new method for image tag ranking modeling the structured preferences among tags. The goal is to predict a ranked tag list for a given image, where tags are ordered by their importance or relevance to the image content. Our model integrates the max-margin formalism with riffled independence factorizations proposed in [10], which naturally allows for structured learning and efficient ranking. Experimental results on the SUN Attribute and LabelMe datasets demonstrate the superior performance of the proposed model compared with baseline tag ranking methods. We also apply the predicted rank list of tags to several higher-level computer vision applications in image understanding and retrieval, and demonstrate that MMRIM significantly improves the accuracy of these applications.

2 0.88764507 358 cvpr-2013-Robust Canonical Time Warping for the Alignment of Grossly Corrupted Sequences

Author: Yannis Panagakis, Mihalis A. Nicolaou, Stefanos Zafeiriou, Maja Pantic

Abstract: Temporal alignment of human behaviour from visual data is a very challenging problem due to a numerous reasons, including possible large temporal scale differences, inter/intra subject variability and, more importantly, due to the presence of gross errors and outliers. Gross errors are often in abundance due to incorrect localization and tracking, presence of partial occlusion etc. Furthermore, such errors rarely follow a Gaussian distribution, which is the de-facto assumption in machine learning methods. In this paper, building on recent advances on rank minimization and compressive sensing, a novel, robust to gross errors temporal alignment method is proposed. While previous approaches combine the dynamic time warping (DTW) with low-dimensional projections that maximally correlate two sequences, we aim to learn two underlyingprojection matrices (one for each sequence), which not only maximally correlate the sequences but, at the same time, efficiently remove the possible corruptions in any datum in the sequences. The projections are obtained by minimizing the weighted sum of nuclear and ?1 norms, by solving a sequence of convex optimization problems, while the temporal alignment is found by applying the DTW in an alternating fashion. The superiority of the proposed method against the state-of-the-art time alignment methods, namely the canonical time warping and the generalized time warping, is indicated by the experimental results on both synthetic and real datasets.

3 0.88027561 402 cvpr-2013-Social Role Discovery in Human Events

Author: Vignesh Ramanathan, Bangpeng Yao, Li Fei-Fei

same-paper 4 0.84219724 151 cvpr-2013-Event Retrieval in Large Video Collections with Circulant Temporal Encoding

Author: Jérôme Revaud, Matthijs Douze, Cordelia Schmid, Hervé Jégou

5 0.84066784 146 cvpr-2013-Enriching Texture Analysis with Semantic Data

Author: Tim Matthews, Mark S. Nixon, Mahesan Niranjan

Abstract: We argue for the importance of explicit semantic modelling in human-centred texture analysis tasks such as retrieval, annotation, synthesis, and zero-shot learning. To this end, low-level attributes are selected and used to define a semantic space for texture. 319 texture classes varying in illumination and rotation are positioned within this semantic space using a pairwise relative comparison procedure. Low-level visual features used by existing texture descriptors are then assessed in terms of their correspondence to the semantic space. Textures with strong presence ofattributes connoting randomness and complexity are shown to be poorly modelled by existing descriptors. In a retrieval experiment semantic descriptors are shown to outperform visual descriptors. Semantic modelling of texture is thus shown to provide considerable value in both feature selection and in analysis tasks.

6 0.80161256 364 cvpr-2013-Robust Object Co-detection

7 0.78653991 292 cvpr-2013-Multi-agent Event Detection: Localization and Role Assignment

8 0.77031571 85 cvpr-2013-Complex Event Detection via Multi-source Video Attributes

9 0.75765222 422 cvpr-2013-Tag Taxonomy Aware Dictionary Learning for Region Tagging

10 0.75169897 213 cvpr-2013-Image Tag Completion via Image-Specific and Tag-Specific Linear Sparse Reconstructions

11 0.74925196 412 cvpr-2013-Stochastic Deconvolution

12 0.74603367 377 cvpr-2013-Sample-Specific Late Fusion for Visual Category Recognition

13 0.74499094 28 cvpr-2013-A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching

14 0.74488783 150 cvpr-2013-Event Recognition in Videos by Learning from Heterogeneous Web Sources

15 0.74462944 356 cvpr-2013-Representing and Discovering Adversarial Team Behaviors Using Player Roles

16 0.74279034 432 cvpr-2013-Three-Dimensional Bilateral Symmetry Plane Estimation in the Phase Domain

17 0.74273098 164 cvpr-2013-Fast Convolutional Sparse Coding

18 0.74175727 7 cvpr-2013-A Divide-and-Conquer Method for Scalable Low-Rank Latent Matrix Pursuit

19 0.74088651 348 cvpr-2013-Recognizing Activities via Bag of Words for Attribute Dynamics

20 0.73681134 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection