cvpr cvpr2013 cvpr2013-233 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Raghuraman Gopalan
Abstract: While the notion of joint sparsity in understanding common and innovative components of a multi-receiver signal ensemble has been well studied, we investigate the utility of such joint sparse models in representing information contained in a single video signal. By decomposing the content of a video sequence into that observed by multiple spatially and/or temporally distributed receivers, we first recover a collection of common and innovative components pertaining to individual videos. We then present modeling strategies based on subspace-driven manifold metrics to characterize patterns among these components, across other videos in the system, to perform subsequent video analysis. We demonstrate the efficacy of our approach for activity classification and clustering by reporting competitive results on standard datasets such as, HMDB, UCF-50, Olympic Sports and KTH.
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract While the notion of joint sparsity in understanding common and innovative components of a multi-receiver signal ensemble has been well studied, we investigate the utility of such joint sparse models in representing information contained in a single video signal. [sent-4, score-1.205]
2 By decomposing the content of a video sequence into that observed by multiple spatially and/or temporally distributed receivers, we first recover a collection of common and innovative components pertaining to individual videos. [sent-5, score-0.713]
3 We then present modeling strategies based on subspace-driven manifold metrics to characterize patterns among these components, across other videos in the system, to perform subsequent video analysis. [sent-6, score-0.632]
4 We demonstrate the efficacy of our approach for activity classification and clustering by reporting competitive results on standard datasets such as, HMDB, UCF-50, Olympic Sports and KTH. [sent-7, score-0.5]
5 Introduction Understanding information contained in videos is a wellstudied problem that has found utility in applications such as activity recognition, object tracking, search and retrieval, summarization among others [39]. [sent-9, score-0.734]
6 As with other visual recognition tasks, there has been emphasis on both designing features to represent action patterns and learning strategies to derive pertinent information from features to perform robust activity classification. [sent-11, score-0.638]
7 We pursue such an intermediate representation of a video based on the principles of joint sparsity, and perform activity classification and clustering using modeling techniques on Grassmann manifolds. [sent-15, score-1.104]
8 The goal of this work is to study the utility of such joint sparse models in the context of a single video, by expressing the content of a video sequence into that of an ensemble observed by multiple receivers that are spatially and/or temporally distributed in that video. [sent-19, score-0.803]
9 Starting with an ini222777333866 tial representation of such a signal ensemble, in the form of spatio-temporal bag-of-features [22], we first recover the intermediate joint sparse representation of the video in terms of common and innovative atoms pertaining to the ensemble. [sent-20, score-1.061]
10 Before getting into the details of our approach, we first overview recent efforts on unconstrained activity analysis. [sent-22, score-0.442]
11 Related Work Unconstrained activity analysis under multiple sources of variations such as camera motion, inter-object interaction, the associated background clutter and changes in scene appearance due to illumination, viewpoint etc. [sent-26, score-0.372]
12 is receiving recent attention in part due to the proliferation of video content in consumer, broadcast and surveillance domains. [sent-27, score-0.253]
13 [25] addressed a more challenging set of videos collected from YouTube by obtaining visual vocabularies from pruned static and motion features. [sent-31, score-0.264]
14 Learning a discriminative hierarchy of space-time features was pursued by [20], while [42] investigated dense trajectories corresponding to local features and [33] clustered such trajectories into action classes using graphical models. [sent-32, score-0.287]
15 The focus of this work is to analyze activities by obtaining a mid-level, intermediate video representation based on joint sparsity principles [2], which aims at understanding information that a signal ensemble shares and varies upon. [sent-37, score-0.858]
16 However there hasn’t been much work on understanding the utility of joint sparsity for video analysis, which is one of the main motivations behind this paper. [sent-39, score-0.577]
17 Problem Description Let V = {Vi}iN=1 denote the set of N videos in the system belonging to m activity classes. [sent-43, score-0.587]
18 1 porally distributed segments and obtain its joint sparse representation J(Vi) consisting of a set of common and innovative components, Ci and Ii respectively. [sent-45, score-0.57]
19 With the collection of such intermediate representations J = {J(Vi)}iN=1 corresponding to all videos in the system, we perform subsequent video analysis by learning f(C¯, I¯) where C¯ = {Ci}iN=1, I¯ = {Ii}iN=1 and f is modeled using strategies based on Grassmann manifolds. [sent-46, score-0.596]
20 We address both classification and clustering scenarios, by adapting f to account for activity labels li ∈ {1, 2, . [sent-47, score-0.5]
21 , m} accompanying Vi or otherwise, and perform experiments on several standard unconstrained activity datasets. [sent-50, score-0.442]
22 Joint Sparse Representation of a Video Given a video Vi we first extract twenty four segments = Vis ∪ Vit ∪ Vist that are grouped into spatial Vis, temporal Vit and spatio-temporal Vist ensembles. [sent-54, score-0.31]
23 222777333977 ing 4,4, and 16 video segments J(Vi) Vij respectively) from a video Vi and utilizing joint sparse models to obtain an intermediate representation consisting of common Ci and innovative using techniques Ii atoms. [sent-57, score-0.999]
24 Step-2: Modeling these joint sparse atoms across all N videos in the system, on the Grassmann manifold, to perform subsequent video analysis such as activity classification and clustering. [sent-58, score-1.23]
25 segments obtained by dividing the video into four equal intervals along the temporal dimension. [sent-60, score-0.31]
26 We then consider each spatial segment pertaining to each temporal interval to make up 16 spatio-temporal segments that represent Vist. [sent-61, score-0.228]
27 We now recover common and innovative components pertaining to each of these three ensembles using the principles of joint sparsity [2], to obtain the intermediate representation J(Vi) of a video Vi. [sent-64, score-1.012]
28 Three joint sparse models (JSM) have been investigated in [2] in the context of multireceiver communication settings by imposing assumptions on the sparsity of common and innovative components of the signal ensemble. [sent-65, score-0.754]
29 We first present a qualitative analogy of these models in terms of information contained in a video, to facilitate intuitions on the applicability of JSM to video analysis. [sent-66, score-0.248]
30 JSM-1 represents the case where both the common and innovative components are sparse. [sent-67, score-0.252]
31 While there could be sparse changes in the global background due to variations in the daylight intensity, the innovations will also be sparse since number of people at work on a weekend is relatively less than that during weekdays. [sent-69, score-0.331]
32 JSM-2 corresponds to cases where the common component is ideally zero whereas the innovations are sparse with similar supports. [sent-70, score-0.293]
33 JSM-3 deals with the case where the common component is not sparse while the innovations are sparse. [sent-73, score-0.293]
34 This could correspond to a video covering the daily routine of a mailman. [sent-74, score-0.242]
35 While the places he travels to deliver mails will have large variations (nonsparse common component), the action he performs at those places will be mostly similar with few innovations (interacting with customer, delivering mail in the mailbox or dropping off near the front door etc). [sent-75, score-0.378]
36 In pursuit of such joint sparse information contained in spatial and/or temporal ensembles of a video, we begin with the d-dimensional features extracted from segments Vij and use the recovery algorithms2 presented in [2] to obtain the intermediate representation J(Vi). [sent-76, score-0.528]
37 We used all three JSM’s since we are dealing with unconstrained videos and any of these models could be representative of the activities occurring in different video segments. [sent-77, score-0.583]
38 Hence for JSM-1 and JSM-3, we obtain 1 common and four innovations each for ensembles Vis and Vit, and 1 common and 16 innovations for the ensemble Vist, whereas for JSM-2 we obtain the same number of innovations as mentioned above but without any common component. [sent-78, score-0.736]
39 The collection of 6 common components Ci and 72 innovative components Ii represent the 2The algorithmic details from [2] material. [sent-80, score-0.3]
40 are provided in the supplementary 222777334088 joint sparse representation J(Vi) ∈ Rd×78 of the video Vi. [sent-81, score-0.448]
41 In the following we refer to Ci and Ii as joint sparse atoms of a video Vi, and let C¯ and I¯ refer to collection of such atoms obtained from all videos in the system. [sent-82, score-0.877]
42 Modeling Jointly Sparse Atoms We now perform activity analysis by modeling information contained in C¯ and I¯. [sent-85, score-0.458]
43 Towards this end we consider the subspace Si spanned by the columns of (orthonormalized) matrix J(Vi), which includes the set of all linear combinations of the joint sparse atoms from video Vi. [sent-87, score-0.594]
44 The problem of performing activity analysis then translates to that of ‘comparing’ subspaces S = {Si}iN=1 corresponding to all videos in the system, for which we pursue a geometrically meaningful Grassmann manifold interpretation3 of the space spanned by S. [sent-88, score-0.804]
45 There have been several works addressing geometric properties [10] and related statistical techniques [7] on this manifold, and we now utilize some of these tools f for analyzing the point cloud S to facilitate both activity classification and clustering. [sent-93, score-0.472]
46 1 Activity Classification We first consider the case where each video Vi is accompanied by an activity label li ∈ {1, 2. [sent-96, score-0.572]
47 Let denote the test video whose activity label l˜ is to be inferred. [sent-100, score-0.572]
48 We pursue two statistical techniques for this supervised scenario V˜ namely, intrinsic (f1 ∈ f) and extrinsic (f2 ∈ f). [sent-101, score-0.245]
49 While the extrinsic methods embed nonlinear manifolds in higher dimensional Euclidean spaces and perform computations in those larger spaces, the intrinsic methods are completely restricted to the manifolds themselves and do not rely on any Euclidean embedding. [sent-102, score-0.296]
50 We first pursue the intrinsic method of [40] that learns parametric class conditional densities pertaining to the labeled point cloud S. [sent-103, score-0.3]
51 The class conditional Ci for the ith activity class is then completely represented by the tuple Ci = {Mi , μi, Σi}. [sent-109, score-0.372]
52 The same process is repeated for points in S corresponding to all m activity labels. [sent-110, score-0.372]
53 Next we pursue an extrinsic method proposed by [14] that performs kernel discriminant analysis on the labeled point cloud S using a projection kernel kP, which is a positive definite kernel well-defined for points on Gn,d. [sent-112, score-0.232]
54 We then use kP to create kernel matrices from training and test data, and perform test video activity classification f2 in the standard discriminant analysis framework. [sent-116, score-0.629]
55 2 Activity Clustering We then consider cases where the videos V do not have activity labels associated to them. [sent-120, score-0.587]
56 In our experiments, we hide the activity labels of the videos V in the dataset and obtain their grouping f3. [sent-141, score-0.587]
57 We then evaluate the clustering accuracy with the widely used method of [45], which labels each of the resulting clusters with the majority activity class according to the original ground truth labels li, and finally measures the number of misclassifications in all clusters. [sent-142, score-0.443]
58 Experiments We now evaluate our approach on four standard activity analysis datasets. [sent-144, score-0.372]
59 We first experiment with the UCF50 dataset [1] that consists of real-world videos taken from YouTube. [sent-145, score-0.215]
60 We then consider the Human Motion DataBase (HMDB) [21] which is argued to be more challenging that UCF-50 since it contains videos from multiple sources such as YouTube, motion pictures etc. [sent-146, score-0.305]
61 We then focus on the Olympic Sports dataset [28] that contains several sports clips which makes it interesting to analyze the performance of the method within a specific theme of activities. [sent-147, score-0.252]
62 Figure 3 provides a sample of activity classes from these datasets. [sent-149, score-0.411]
63 Feature Extraction Our basic feature representation of the video segments ,∀j = 1 to 24, ∀i = 1 to N is a d−dimensional histogram pertaining to spatio-temporal bag-of-features obtained using the method of [22], which has shown good empirical performance on many datasets. [sent-153, score-0.428]
64 We then obtain the intermediate joint sparse representation J(Vi) ∈ R4000× 78 for all videos in the system with which the subsequent modeling (f1, f2 , f3) is done on G78,4000. [sent-156, score-0.629]
65 UCF-50 Dataset A recent real-world dataset is the UCF-50 [1] that has 50 activity classes with atleast 100 videos per class. [sent-159, score-0.757]
66 These videos taken from YouTube has a range of activities from general sports to daily-life exercises, and there are 6618 videos in total. [sent-160, score-0.68]
67 Each activity class is divided into 25 homogenous groups with atleast 4 videos per activity in each of these groups. [sent-161, score-1.09]
68 The videos in the same group may share some common features, such as the same person, similar background or similar viewpoint. [sent-162, score-0.264]
69 We evaluate our intrinsic f1 and extrinsic f2 modeling strategies using the Leave-one-Group-out (LoGo) cross-validation scheme suggested in the dataset website and report the average classification accuracy across all activity classes in Table 1. [sent-163, score-0.721]
70 For the clustering experiment, we report clustering results in two settings: (i) Case-A where only the test videos used in the classification experiment are clustered, and (ii) Case-B where both training and test videos in the classification experiment are clustered. [sent-165, score-0.686]
71 We ran the clustering algorithm f3 ten times, and the average clustering accuracy (for LoGo setting, but without the activity labels) is as follows, Case-A: 53. [sent-166, score-0.514]
72 HMDB Another recent dataset is the HumanMotion DataBase (HMDB) [2 1] that has 51 distinct activity classes with atleast 101 videos per class. [sent-176, score-0.757]
73 The total 6766 videos were extracted from a wide range of sources, including YouTube and motion pictures, and has 10 overlapping activity classes with the UCF-50 dataset. [sent-177, score-0.675]
74 Each video was validated by atleast two human observers to ensure consistency. [sent-178, score-0.331]
75 The evaluation protocol for this dataset consists of three distinct training and testing splits, each containing 70 training and 30 testing videos per activity class, and the splits are selected in such way to display a representative mix of video quality and camera motion attributes. [sent-179, score-0.836]
76 The dataset contains both original videos and their stabilized version, and we report classification results on both of these sets in Table 2. [sent-180, score-0.345]
77 Sample frames corresponding to different activity classes in datasets - UCF-50 (first row), HMDB (second), KTH (third), and Olympic Sports (last). [sent-187, score-0.411]
78 These datasets have a good mix of realistic activities from YouTube, motion pictures, sports programs etc. [sent-188, score-0.299]
79 There are 16 sports events: high-jump, long-jump, triplejump, pole-vault, basketball lay-up, bowling, tennis-serve, platform, discus, hammer, javelin, shot-put, springboard, snatch, clean-jerk and vault, represented by a total of 783 video sequences. [sent-192, score-0.352]
80 The mean average precision over all activity classes is reported in Table 3. [sent-194, score-0.411]
81 Average classification accuracy over all action classes is given in Table 4 and the clustering results for this dataset are Case-A: 71. [sent-209, score-0.339]
82 Discussion We now empirically analyze the utility of modeling using joint sparsity principles. [sent-219, score-0.372]
83 Top three videos classified into each of these classes are displayed here, with a representative frame corresponding those videos, where the first row pertains to results using joint sparse modeling and the second row to that of PCA modeling. [sent-235, score-0.5]
84 ing to a video Vi, instead of extracting jointly sparse atoms J(Vi), we perform principal component analysis (PCA) to obtain an orthonormal matrix using which we perform subsequent computations for activity classification and clustering (Section 2. [sent-237, score-1.051]
85 We see that the joint sparse modeling yields better results, and some illustrations on activity (mis-)classification are shown in Figure 4. [sent-247, score-0.618]
86 Given the amount of variation in activity patterns across the datasets, these results demonstrate the generalizability of our joint sparsity based manifold modeling. [sent-250, score-0.741]
87 For a 100 frame test video it takes about 10 seconds to obtain its intermediate joint sparse representation Then to infer its activity label it takes around 8 seconds using the intrinsic method f1 and 3 seconds using the extrinsic algorithm f2. [sent-254, score-1.059]
88 Conclusion Through this work we studied the utility of joint sparsity models for representing common and diverse content within a video and a subsequent manifold interpretation for performing video analysis under both supervised and unsupervised settings. [sent-261, score-0.962]
89 We demonstrated the generalizability of the approach across several video activity datasets that portrayed varying degree of event complexity, without resorting to feature tuning, and achieved competitive results on many counts. [sent-262, score-0.73]
90 It is an interesting future work to explore integrating feature selection mechanisms with our model and study their utility for more general video understanding problems. [sent-263, score-0.38]
91 Visual event recognition in videos by learning from web data. [sent-321, score-0.355]
92 In Proceedings of the 25th international conference on Machine learning, pages 376–383. [sent-355, score-0.309]
93 Motion interchange patterns for action recognition in unconstrained videos. [sent-389, score-0.321]
94 HMDB: a large video database for human motion recognition. [sent-403, score-0.249]
95 Local expert forest of score fusion for video event classification. [sent-441, score-0.339]
96 Modeling temporal structure of decomposable motion segments for activity classification. [sent-455, score-0.531]
97 Action bank: A high-level representation of activity in video. [sent-511, score-0.412]
98 Statistical computations on grassmann and stiefel manifolds for image and video-based recognition. [sent-557, score-0.239]
99 Efficient orthogonal matching pursuit using sparse random projections for scene and video classification. [sent-563, score-0.287]
100 Dense trajectories and motion boundary descriptors for action recognition. [sent-579, score-0.259]
wordName wordTfidf (topN-words)
[('activity', 0.372), ('videos', 0.215), ('video', 0.2), ('action', 0.172), ('conference', 0.17), ('innovations', 0.157), ('hmdb', 0.157), ('innovative', 0.155), ('grassmann', 0.154), ('sports', 0.152), ('vi', 0.145), ('olympic', 0.131), ('atleast', 0.131), ('atoms', 0.127), ('joint', 0.121), ('pertaining', 0.118), ('sparsity', 0.114), ('ieee', 0.113), ('jsm', 0.106), ('extrinsic', 0.106), ('utility', 0.099), ('event', 0.099), ('activities', 0.098), ('vij', 0.097), ('pages', 0.092), ('sparse', 0.087), ('youtube', 0.086), ('pursue', 0.083), ('multireceiver', 0.08), ('receivers', 0.08), ('vist', 0.08), ('vit', 0.079), ('intermediate', 0.077), ('actions', 0.076), ('manifold', 0.075), ('ensemble', 0.073), ('stabilized', 0.073), ('clustering', 0.071), ('vis', 0.07), ('segments', 0.07), ('unconstrained', 0.07), ('pattern', 0.065), ('theme', 0.062), ('outdoors', 0.062), ('logo', 0.062), ('spanned', 0.059), ('generalizability', 0.059), ('classification', 0.057), ('intrinsic', 0.056), ('ci', 0.055), ('sensing', 0.055), ('content', 0.053), ('onlookers', 0.053), ('orthonormalized', 0.053), ('communication', 0.053), ('players', 0.053), ('strategies', 0.053), ('subsequent', 0.051), ('orthonormal', 0.05), ('manifolds', 0.049), ('motion', 0.049), ('kth', 0.049), ('common', 0.049), ('contained', 0.048), ('components', 0.048), ('distributed', 0.048), ('athletes', 0.047), ('hyperspectral', 0.047), ('signal', 0.047), ('international', 0.047), ('xu', 0.046), ('principles', 0.045), ('ensembles', 0.045), ('kp', 0.045), ('ii', 0.044), ('compressive', 0.044), ('understanding', 0.043), ('cloud', 0.043), ('temporally', 0.042), ('daily', 0.042), ('firstperson', 0.041), ('pictures', 0.041), ('recognition', 0.041), ('duan', 0.04), ('representation', 0.04), ('temporal', 0.04), ('pooling', 0.04), ('expert', 0.04), ('football', 0.039), ('karcher', 0.039), ('crowd', 0.039), ('classes', 0.039), ('trajectories', 0.038), ('modeling', 0.038), ('interesting', 0.038), ('grassmannian', 0.038), ('interchange', 0.038), ('turaga', 0.036), ('jhuang', 0.036), ('computations', 0.036)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
Author: Raghuraman Gopalan
Abstract: While the notion of joint sparsity in understanding common and innovative components of a multi-receiver signal ensemble has been well studied, we investigate the utility of such joint sparse models in representing information contained in a single video signal. By decomposing the content of a video sequence into that observed by multiple spatially and/or temporally distributed receivers, we first recover a collection of common and innovative components pertaining to individual videos. We then present modeling strategies based on subspace-driven manifold metrics to characterize patterns among these components, across other videos in the system, to perform subsequent video analysis. We demonstrate the efficacy of our approach for activity classification and clustering by reporting competitive results on standard datasets such as, HMDB, UCF-50, Olympic Sports and KTH.
2 0.35425338 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?
Author: Michael S. Ryoo, Larry Matthies
Abstract: This paper discusses the problem of recognizing interaction-level human activities from a first-person viewpoint. The goal is to enable an observer (e.g., a robot or a wearable camera) to understand ‘what activity others are performing to it’ from continuous video inputs. These include friendly interactions such as ‘a person hugging the observer’ as well as hostile interactions like ‘punching the observer’ or ‘throwing objects to the observer’, whose videos involve a large amount of camera ego-motion caused by physical interactions. The paper investigates multichannel kernels to integrate global and local motion information, and presents a new activity learning/recognition methodology that explicitly considers temporal structures displayed in first-person activity videos. In our experiments, we not only show classification results with segmented videos, but also confirm that our new approach is able to detect activities from continuous videos reliably.
3 0.31923893 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos
Author: Yu Cao, Daniel Barrett, Andrei Barbu, Siddharth Narayanaswamy, Haonan Yu, Aaron Michaux, Yuewei Lin, Sven Dickinson, Jeffrey Mark Siskind, Song Wang
Abstract: Recognizing human activities in partially observed videos is a challengingproblem and has many practical applications. When the unobserved subsequence is at the end of the video, the problem is reduced to activity prediction from unfinished activity streaming, which has been studied by many researchers. However, in the general case, an unobserved subsequence may occur at any time by yielding a temporal gap in the video. In this paper, we propose a new method that can recognize human activities from partially observed videos in the general case. Specifically, we formulate the problem into a probabilistic framework: 1) dividing each activity into multiple ordered temporal segments, 2) using spatiotemporal features of the training video samples in each segment as bases and applying sparse coding (SC) to derive the activity likelihood of the test video sample at each segment, and 3) finally combining the likelihood at each segment to achieve a global posterior for the activities. We further extend the proposed method to include more bases that correspond to a mixture of segments with different temporal lengths (MSSC), which can better rep- resent the activities with large intra-class variations. We evaluate the proposed methods (SC and MSSC) on various real videos. We also evaluate the proposed methods on two special cases: 1) activity prediction where the unobserved subsequence is at the end of the video, and 2) human activity recognition on fully observed videos. Experimental results show that the proposed methods outperform existing state-of-the-art comparison methods.
4 0.3142944 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video
Author: Yingying Zhu, Nandita M. Nayak, Amit K. Roy-Chowdhury
Abstract: In thispaper, rather than modeling activities in videos individually, we propose a hierarchical framework that jointly models and recognizes related activities using motion and various context features. This is motivated from the observations that the activities related in space and time rarely occur independently and can serve as the context for each other. Given a video, action segments are automatically detected using motion segmentation based on a nonlinear dynamical model. We aim to merge these segments into activities of interest and generate optimum labels for the activities. Towards this goal, we utilize a structural model in a max-margin framework that jointly models the underlying activities which are related in space and time. The model explicitly learns the duration, motion and context patterns for each activity class, as well as the spatio-temporal relationships for groups of them. The learned model is then used to optimally label the activities in the testing videos using a greedy search method. We show promising results on the VIRAT Ground Dataset demonstrating the benefit of joint modeling and recognizing activities in a wide-area scene.
5 0.24297768 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
Author: Arpit Jain, Abhinav Gupta, Mikel Rodriguez, Larry S. Davis
Abstract: How should a video be represented? We propose a new representation for videos based on mid-level discriminative spatio-temporal patches. These spatio-temporal patches might correspond to a primitive human action, a semantic object, or perhaps a random but informative spatiotemporal patch in the video. What defines these spatiotemporal patches is their discriminative and representative properties. We automatically mine these patches from hundreds of training videos and experimentally demonstrate that these patches establish correspondence across videos and align the videos for label transfer techniques. Furthermore, these patches can be used as a discriminative vocabulary for action classification where they demonstrate stateof-the-art performance on UCF50 and Olympics datasets.
6 0.23255464 287 cvpr-2013-Modeling Actions through State Changes
7 0.23074025 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
8 0.22617628 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
9 0.22052228 237 cvpr-2013-Kernel Learning for Extrinsic Classification of Manifold Features
10 0.19824548 40 cvpr-2013-An Approach to Pose-Based Action Recognition
11 0.19491798 49 cvpr-2013-Augmenting Bag-of-Words: Data-Driven Discovery of Temporal and Structural Information for Activity Recognition
12 0.19069777 302 cvpr-2013-Multi-task Sparse Learning with Beta Process Prior for Action Recognition
13 0.18011977 348 cvpr-2013-Recognizing Activities via Bag of Words for Attribute Dynamics
14 0.16765039 187 cvpr-2013-Geometric Context from Videos
15 0.16716838 215 cvpr-2013-Improved Image Set Classification via Joint Sparse Approximated Nearest Subspaces
16 0.166879 313 cvpr-2013-Online Dominant and Anomalous Behavior Detection in Videos
17 0.16662373 444 cvpr-2013-Unconstrained Monocular 3D Human Pose Estimation by Action Detection and Cross-Modality Regression Forest
18 0.16086416 85 cvpr-2013-Complex Event Detection via Multi-source Video Attributes
19 0.15901574 243 cvpr-2013-Large-Scale Video Summarization Using Web-Image Priors
20 0.15876712 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
topicId topicWeight
[(0, 0.304), (1, -0.125), (2, -0.089), (3, -0.162), (4, -0.304), (5, -0.011), (6, -0.161), (7, -0.09), (8, -0.098), (9, 0.009), (10, 0.143), (11, -0.133), (12, -0.009), (13, -0.078), (14, -0.072), (15, 0.05), (16, -0.02), (17, 0.088), (18, -0.095), (19, -0.121), (20, -0.057), (21, 0.102), (22, 0.048), (23, -0.073), (24, -0.06), (25, -0.026), (26, -0.013), (27, -0.013), (28, 0.023), (29, 0.086), (30, 0.028), (31, 0.026), (32, 0.016), (33, -0.047), (34, 0.035), (35, 0.027), (36, -0.02), (37, -0.074), (38, 0.015), (39, 0.075), (40, -0.044), (41, 0.026), (42, 0.044), (43, -0.059), (44, -0.052), (45, -0.003), (46, 0.063), (47, 0.027), (48, -0.021), (49, -0.044)]
simIndex simValue paperId paperTitle
same-paper 1 0.96278274 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
Author: Raghuraman Gopalan
Abstract: While the notion of joint sparsity in understanding common and innovative components of a multi-receiver signal ensemble has been well studied, we investigate the utility of such joint sparse models in representing information contained in a single video signal. By decomposing the content of a video sequence into that observed by multiple spatially and/or temporally distributed receivers, we first recover a collection of common and innovative components pertaining to individual videos. We then present modeling strategies based on subspace-driven manifold metrics to characterize patterns among these components, across other videos in the system, to perform subsequent video analysis. We demonstrate the efficacy of our approach for activity classification and clustering by reporting competitive results on standard datasets such as, HMDB, UCF-50, Olympic Sports and KTH.
2 0.90591019 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos
Author: Yu Cao, Daniel Barrett, Andrei Barbu, Siddharth Narayanaswamy, Haonan Yu, Aaron Michaux, Yuewei Lin, Sven Dickinson, Jeffrey Mark Siskind, Song Wang
Abstract: Recognizing human activities in partially observed videos is a challengingproblem and has many practical applications. When the unobserved subsequence is at the end of the video, the problem is reduced to activity prediction from unfinished activity streaming, which has been studied by many researchers. However, in the general case, an unobserved subsequence may occur at any time by yielding a temporal gap in the video. In this paper, we propose a new method that can recognize human activities from partially observed videos in the general case. Specifically, we formulate the problem into a probabilistic framework: 1) dividing each activity into multiple ordered temporal segments, 2) using spatiotemporal features of the training video samples in each segment as bases and applying sparse coding (SC) to derive the activity likelihood of the test video sample at each segment, and 3) finally combining the likelihood at each segment to achieve a global posterior for the activities. We further extend the proposed method to include more bases that correspond to a mixture of segments with different temporal lengths (MSSC), which can better rep- resent the activities with large intra-class variations. We evaluate the proposed methods (SC and MSSC) on various real videos. We also evaluate the proposed methods on two special cases: 1) activity prediction where the unobserved subsequence is at the end of the video, and 2) human activity recognition on fully observed videos. Experimental results show that the proposed methods outperform existing state-of-the-art comparison methods.
3 0.8512733 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?
Author: Michael S. Ryoo, Larry Matthies
Abstract: This paper discusses the problem of recognizing interaction-level human activities from a first-person viewpoint. The goal is to enable an observer (e.g., a robot or a wearable camera) to understand ‘what activity others are performing to it’ from continuous video inputs. These include friendly interactions such as ‘a person hugging the observer’ as well as hostile interactions like ‘punching the observer’ or ‘throwing objects to the observer’, whose videos involve a large amount of camera ego-motion caused by physical interactions. The paper investigates multichannel kernels to integrate global and local motion information, and presents a new activity learning/recognition methodology that explicitly considers temporal structures displayed in first-person activity videos. In our experiments, we not only show classification results with segmented videos, but also confirm that our new approach is able to detect activities from continuous videos reliably.
4 0.8446238 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video
Author: Yingying Zhu, Nandita M. Nayak, Amit K. Roy-Chowdhury
Abstract: In thispaper, rather than modeling activities in videos individually, we propose a hierarchical framework that jointly models and recognizes related activities using motion and various context features. This is motivated from the observations that the activities related in space and time rarely occur independently and can serve as the context for each other. Given a video, action segments are automatically detected using motion segmentation based on a nonlinear dynamical model. We aim to merge these segments into activities of interest and generate optimum labels for the activities. Towards this goal, we utilize a structural model in a max-margin framework that jointly models the underlying activities which are related in space and time. The model explicitly learns the duration, motion and context patterns for each activity class, as well as the spatio-temporal relationships for groups of them. The learned model is then used to optimally label the activities in the testing videos using a greedy search method. We show promising results on the VIRAT Ground Dataset demonstrating the benefit of joint modeling and recognizing activities in a wide-area scene.
Author: Vinay Bettadapura, Grant Schindler, Thomas Ploetz, Irfan Essa
Abstract: We present data-driven techniques to augment Bag of Words (BoW) models, which allow for more robust modeling and recognition of complex long-term activities, especially when the structure and topology of the activities are not known a priori. Our approach specifically addresses the limitations of standard BoW approaches, which fail to represent the underlying temporal and causal information that is inherent in activity streams. In addition, we also propose the use ofrandomly sampled regular expressions to discover and encode patterns in activities. We demonstrate the effectiveness of our approach in experimental evaluations where we successfully recognize activities and detect anomalies in four complex datasets.
6 0.71207982 348 cvpr-2013-Recognizing Activities via Bag of Words for Attribute Dynamics
7 0.70684105 313 cvpr-2013-Online Dominant and Anomalous Behavior Detection in Videos
8 0.65900511 291 cvpr-2013-Motionlets: Mid-level 3D Parts for Human Motion Recognition
9 0.65552419 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition
10 0.63863736 243 cvpr-2013-Large-Scale Video Summarization Using Web-Image Priors
11 0.6311602 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
12 0.62495923 413 cvpr-2013-Story-Driven Summarization for Egocentric Video
13 0.61402494 3 cvpr-2013-3D R Transform on Spatio-temporal Interest Points for Action Recognition
14 0.61097479 287 cvpr-2013-Modeling Actions through State Changes
15 0.61086357 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
16 0.60899568 294 cvpr-2013-Multi-class Video Co-segmentation with a Generative Multi-video Model
17 0.59855217 151 cvpr-2013-Event Retrieval in Large Video Collections with Circulant Temporal Encoding
18 0.58752441 103 cvpr-2013-Decoding Children's Social Behavior
19 0.58081377 32 cvpr-2013-Action Recognition by Hierarchical Sequence Summarization
20 0.57795644 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
topicId topicWeight
[(10, 0.132), (16, 0.021), (19, 0.015), (26, 0.048), (28, 0.015), (33, 0.297), (38, 0.175), (67, 0.085), (69, 0.069), (77, 0.01), (87, 0.049)]
simIndex simValue paperId paperTitle
1 0.94754153 350 cvpr-2013-Reconstructing Loopy Curvilinear Structures Using Integer Programming
Author: Engin Türetken, Fethallah Benmansour, Bjoern Andres, Hanspeter Pfister, Pascal Fua
Abstract: We propose a novel approach to automated delineation of linear structures that form complex and potentially loopy networks. This is in contrast to earlier approaches that usually assume a tree topology for the networks. At the heart of our method is an Integer Programming formulation that allows us to find the global optimum of an objective function designed to allow cycles but penalize spurious junctions and early terminations. We demonstrate that it outperforms state-of-the-art techniques on a wide range of datasets.
2 0.91888958 263 cvpr-2013-Learning the Change for Automatic Image Cropping
Author: Jianzhou Yan, Stephen Lin, Sing Bing Kang, Xiaoou Tang
Abstract: Image cropping is a common operation used to improve the visual quality of photographs. In this paper, we present an automatic cropping technique that accounts for the two primary considerations of people when they crop: removal of distracting content, and enhancement of overall composition. Our approach utilizes a large training set consisting of photos before and after cropping by expert photographers to learn how to evaluate these two factors in a crop. In contrast to the many methods that exist for general assessment of image quality, ours specifically examines differences between the original and cropped photo in solving for the crop parameters. To this end, several novel image features are proposed to model the changes in image content and composition when a crop is applied. Our experiments demonstrate improvements of our method over recent cropping algorithms on a broad range of images.
Author: Alessandro Perina, Nebojsa Jojic
Abstract: Recently, the Counting Grid (CG) model [5] was developed to represent each input image as a point in a large grid of feature counts. This latent point is a corner of a window of grid points which are all uniformly combined to match the (normalized) feature counts in the image. Being a bag of word model with spatial layout in the latent space, the CG model has superior handling of field of view changes in comparison to other bag of word models, but with the price of being essentially a mixture, mapping each scene to a single window in the grid. In this paper we introduce a family of componential models, dubbed the Componential Counting Grid, whose members represent each input image by multiple latent locations, rather than just one. In this way, we make a substantially more flexible admixture model which captures layers or parts of images and maps them to separate windows in a Counting Grid. We tested the models on scene and place classification where their com- ponential nature helped to extract objects, to capture parallax effects, thus better fitting the data and outperforming Counting Grids and Latent Dirichlet Allocation, especially on sequences taken with wearable cameras.
same-paper 4 0.89631635 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
Author: Raghuraman Gopalan
Abstract: While the notion of joint sparsity in understanding common and innovative components of a multi-receiver signal ensemble has been well studied, we investigate the utility of such joint sparse models in representing information contained in a single video signal. By decomposing the content of a video sequence into that observed by multiple spatially and/or temporally distributed receivers, we first recover a collection of common and innovative components pertaining to individual videos. We then present modeling strategies based on subspace-driven manifold metrics to characterize patterns among these components, across other videos in the system, to perform subsequent video analysis. We demonstrate the efficacy of our approach for activity classification and clustering by reporting competitive results on standard datasets such as, HMDB, UCF-50, Olympic Sports and KTH.
5 0.89475566 158 cvpr-2013-Exploring Weak Stabilization for Motion Feature Extraction
Author: Dennis Park, C. Lawrence Zitnick, Deva Ramanan, Piotr Dollár
Abstract: We describe novel but simple motion features for the problem of detecting objects in video sequences. Previous approaches either compute optical flow or temporal differences on video frame pairs with various assumptions about stabilization. We describe a combined approach that uses coarse-scale flow and fine-scale temporal difference features. Our approach performs weak motion stabilization by factoring out camera motion and coarse object motion while preserving nonrigid motions that serve as useful cues for recognition. We show results for pedestrian detection and human pose estimation in video sequences, achieving state-of-the-art results in both. In particular, given a fixed detection rate our method achieves a five-fold reduction in false positives over prior art on the Caltech Pedestrian benchmark. Finally, we perform extensive diagnostic experiments to reveal what aspects of our system are crucial for good performance. Proper stabilization, long time-scale features, and proper normalization are all critical.
6 0.89429408 171 cvpr-2013-Fast Trust Region for Segmentation
7 0.8861236 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
8 0.88409311 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases
9 0.88401383 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection
10 0.88370186 325 cvpr-2013-Part Discovery from Partial Correspondence
11 0.88356459 414 cvpr-2013-Structure Preserving Object Tracking
12 0.88330424 221 cvpr-2013-Incorporating Structural Alternatives and Sharing into Hierarchy for Multiclass Object Recognition and Detection
13 0.88270098 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection
14 0.88074696 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval
15 0.88045937 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
16 0.88045007 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation
17 0.87950486 256 cvpr-2013-Learning Structured Hough Voting for Joint Object Detection and Occlusion Reasoning
18 0.87929791 60 cvpr-2013-Beyond Physical Connections: Tree Models in Human Pose Estimation
19 0.87928897 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video
20 0.87875456 285 cvpr-2013-Minimum Uncertainty Gap for Robust Visual Tracking