cvpr cvpr2013 cvpr2013-94 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yingying Zhu, Nandita M. Nayak, Amit K. Roy-Chowdhury
Abstract: In thispaper, rather than modeling activities in videos individually, we propose a hierarchical framework that jointly models and recognizes related activities using motion and various context features. This is motivated from the observations that the activities related in space and time rarely occur independently and can serve as the context for each other. Given a video, action segments are automatically detected using motion segmentation based on a nonlinear dynamical model. We aim to merge these segments into activities of interest and generate optimum labels for the activities. Towards this goal, we utilize a structural model in a max-margin framework that jointly models the underlying activities which are related in space and time. The model explicitly learns the duration, motion and context patterns for each activity class, as well as the spatio-temporal relationships for groups of them. The learned model is then used to optimally label the activities in the testing videos using a greedy search method. We show promising results on the VIRAT Ground Dataset demonstrating the benefit of joint modeling and recognizing activities in a wide-area scene.
Reference: text
sentIndex sentText sentNum sentScore
1 nayak Abstract In thispaper, rather than modeling activities in videos individually, we propose a hierarchical framework that jointly models and recognizes related activities using motion and various context features. [sent-5, score-1.263]
2 This is motivated from the observations that the activities related in space and time rarely occur independently and can serve as the context for each other. [sent-6, score-0.606]
3 Given a video, action segments are automatically detected using motion segmentation based on a nonlinear dynamical model. [sent-7, score-0.458]
4 We aim to merge these segments into activities of interest and generate optimum labels for the activities. [sent-8, score-0.642]
5 Towards this goal, we utilize a structural model in a max-margin framework that jointly models the underlying activities which are related in space and time. [sent-9, score-0.52]
6 The model explicitly learns the duration, motion and context patterns for each activity class, as well as the spatio-temporal relationships for groups of them. [sent-10, score-0.848]
7 The learned model is then used to optimally label the activities in the testing videos using a greedy search method. [sent-11, score-0.532]
8 We show promising results on the VIRAT Ground Dataset demonstrating the benefit of joint modeling and recognizing activities in a wide-area scene. [sent-12, score-0.46]
9 The spatial layout of activities and their sequential patterns provide useful cues for their understanding. [sent-17, score-0.438]
10 Consider the activities that happen in the same spatio-temporal region in Fig. [sent-18, score-0.5]
11 edu Figure 1: An example that demonstrates the importance of context in activity recognition. [sent-29, score-0.626]
12 (bounded by red circle) is doing, and the relative position of the person of interest and the car says that activities (a) and (c) are very different from activity (b). [sent-31, score-1.057]
13 However, it is hard to tell what the person is doing in (a) and (c) - getting out of the vehicle or getting into the vehicle. [sent-32, score-0.503]
14 If we knew that these activities occurred around the same vehicle along time, it would be immediately clear that in (a) the person is getting out of the vehicle and in (c) the person is getting into the vehicle. [sent-33, score-1.23]
15 This example shows the importance of spatial and temporal relationships for activity recognition. [sent-34, score-0.665]
16 Many existing works on activity recognition assume that, the temporal locations of the activities are known [1, 19]. [sent-35, score-1.015]
17 We focus on the problem of detecting activities of interest in continuous videos without prior information about the locations of activities. [sent-36, score-0.527]
18 Finally, we learn a structural model that merges these segments into activities and generates the optimum activity labels for them. [sent-39, score-1.184]
19 The dashed lines indicate that the connections between activity labels and the observations of action segments are not fixed, i. [sent-46, score-0.776]
20 , the structure of connections is different for different activity sets. [sent-48, score-0.482]
21 To achieve this goal, we build upon existing well-known feature descriptors and spatio-temporal context representations that, when combined together, provide a powerful framework to model activities in continuous videos. [sent-51, score-0.668]
22 Action segments that are related to each other in space and time are grouped together into activity sets. [sent-52, score-0.583]
23 For each set, the underlying activities are jointly modeled and recognized by a structural model with the activity durations as the auxiliary variables. [sent-53, score-1.123]
24 For the testing, the action segments, which are considered as the basic elements of activities, are merged together and assigned activity labels by inference on the structural model. [sent-54, score-0.803]
25 (i) We combine low-level motion segmentation with high-level activity modeling under one framework. [sent-58, score-0.666]
26 (ii) We jointly model and recognize the activities in video using a structural model, which integrates activity durations, motion features and various context features within and between activities into a unified model. [sent-59, score-1.77]
27 (iii) We formulate the inference problem as a greedy strategy that iteratively searches for the optimum activity labels on the learned structural model. [sent-60, score-0.763]
28 Related Work Many existing works exploring context focus on interactions among features, objects and actions [26, 23, 12, 2, 1], environmental conditions such as spatial locations ofcertain activities in the scene[16], and temporal relationships of activities [24, 17]. [sent-63, score-1.251]
29 Space-time constraints across activities in a wide-area scene are rarely considered. [sent-64, score-0.462]
30 The work in [24] models a complex activity by a variable-duration hidden Markov model on equal-length temporal segments. [sent-65, score-0.577]
31 It decomposes a complex activity into sequential actions, which are the context of each other. [sent-66, score-0.626]
32 ANDOR graph [11, 21] is a powerful tool for activity representation. [sent-68, score-0.482]
33 However, the learning and inference processes of AND-OR graphs become more complex as the graph grows large and more and more activities are learned. [sent-69, score-0.484]
34 This method labels each image with an group activity label. [sent-71, score-0.51]
35 Also, these methods aim to recognize group activities and are not suitable in our scenario where activities cannot be considered as the parts of larger activities. [sent-73, score-0.876]
36 However, there was no activity segmentation or modeling of the activity duration; only the regions with activity were detected. [sent-75, score-1.496]
37 We propose an alternative method that explicitly models the durations, motion, intra-activity context and the spatio-temporal relationships between the activities and use them in the inference stage for recognition. [sent-76, score-0.716]
38 The novelty of this paper lies in developing a structural model for representing related activities in a video, and to demonstrate how to perform efficient inference on this model. [sent-80, score-0.566]
39 Each motion region is segmented into action segments using the motion segmentation based on the method in [5] with STIP histograms as the model observation. [sent-90, score-0.594]
40 Motion and Context Feature Descriptors Assume there are M+ 1classes of activities in the scene, including a background class with label 0 and M classes of interest with labels 1, . [sent-95, score-0.513]
41 An activity is a 3D region consisting of one or multiple consecutive action segments. [sent-100, score-0.679]
42 Intra-activity motion feature descriptor Features of an activity that encode the motion information extracted from low-level motion features such as STIP features are defined as intra-activity motion features. [sent-105, score-1.041]
43 , si,M of classifying the action segment ias activity classes 0, 1, . [sent-109, score-0.647]
44 We define a set G of attributes related to the scene and the involved objects in activities of interest. [sent-124, score-0.469]
45 Since we work on the VIRAT dataset with individual person activities and person-object interactions, we use the following (NG = 6) subsets of at- Figure 3: Example subsets of context attributes used for the development of intra-activity context features. [sent-126, score-0.899]
46 Inter-activity context feature descriptor Features that capture the relative spatial and temporal relationships of activities are defined as inter-activity context feature. [sent-139, score-0.932]
47 Temporal context is defined by the following temporal relationships: nth frame of ai is before aj, nth frame of ai is during aj, and nth frame of ai is after aj . [sent-146, score-1.333]
48 tcij (n) is the temporal relationship of ai and aj at the nth frame of ai as shown in Fig. [sent-147, score-0.852]
49 The normalized histogram tc = is the inter-activity temporal context featureP of activity ai with respect to activity aj . [sent-149, score-1.551]
50 The red circle indicates the motion region of ai at this frame while the purple rectangle indicates the activity region of aj . [sent-151, score-1.096]
51 Structural Activity Model For an activity set a with n action segmentsP, Pwe assign an auxiliary duration vector d = [d1, · · · , dm] (Pim=1 di = n) and a label vector y = [y1, · · · , y,m·]· . [sent-164, score-0.839]
52 , M} is the activity label of the ith activity yand di is∈ ∈its { Pactivity }du i-s ration, for i= 1, · · · , m. [sent-167, score-1.069]
53 Thus, for a = [a1, · · · , am], ai is trhatei ointh, activity 1in, ·t·he· ,smet. [sent-168, score-0.663]
54 A Tshsuusm, feo xi ∈ = R [aDx, ·an··d , gi ∈ RDg to be the motion feature and intra-activity context fea∈tu Rre of instance ai, and Dx and Dg to be the dimension of xi and gi respectively. [sent-170, score-0.419]
55 ωd,yi ∈ RDx, ωx,yi ∈ RDx and ωg,yi ∈ RDg are the weight vect∈ors R that capture t hRe valid duratio∈n, motion and intra-activity context patterns of activity class yi. [sent-171, score-0.76]
56 scij ∈ RDsc and tcij ∈ RDtc are the inter-activity context featu∈res R Rassociated with∈ ai and aj . [sent-172, score-0.712]
57 ωsc,yi,yj ∈ RDsc and ωtc,yi,yj ∈ RDtc are the weight vectors that capture the valid spatial a∈nd R temporal relationships of activity classes yi and yj . [sent-174, score-0.703]
58 In general, dimensions of the same kind of feature can be different for each activity class/class pairs. [sent-175, score-0.505]
59 Four potentials are developed to measure the compatibilities between the assigned variables (y, d) and the observed features of activity set a. [sent-176, score-0.505]
60 × Activity-duration potential measures the compatibility between the activity label yi and its duration di for activity ai. [sent-178, score-1.25]
61 1 Intra-activity motion potential measures the compatibility between the activity label of ai and the intra-activity motion feature xi developed from the associated action segments as Fx (yi, di) = diωxT,yixi. [sent-181, score-1.348]
62 (3) Intra-activity context potential measures the compatibility between the activity label of ai and its intra-activity context feature gi as Fg (yi, di) = diωgT,yigi. [sent-182, score-1.138]
63 (4) Inter-activity context potential measures the compatibility between the activity labels of ai and aj and their spatial and temporal relationships scij and tcij as Fsc,tc (yi, yj, di, dj) = didj(ωTsc,yi,yjscij + ωtTc,yi,yjtcij). [sent-183, score-1.485]
64 (5) Combined potential function F(a, y, d) is defined to measure the compatibility between (y, d) of the activity set a and its features: = XFd(yi,di) +XFx(yi,di) Xm iX= X1 F (a,y,d) Xm Xm iX= X1 Xm +XFg(yi,di) + X Fsc,tc(yi,yj,di,dj). [sent-184, score-0.562]
65 Thus, the potential function F(a, y, d) can be converted into a linear function with a single parameter ω, F(a, y, d) = ωTΓ(a, y, d), (8) where Γ(a, y, d), called the joint feature of activity set a, can be easily obtained from (6). [sent-191, score-0.536]
66 , (aP, yP, where ai is the activity set, yi is the label vector and is the auxiliary d1), vector. [sent-196, score-0.75]
67 dP), di The loss function for assigning ai with ( byi,bdi), ∆(ai,b yi,dbi), equals the number of action swegitmhe (nb yts tbhat associ,ab y te wbith incorrect activity labels (an acwtiiothn (s byegmbent is mis,lb yabebled if over half of the segment is mislabeled). [sent-197, score-0.936]
68 We set all weights related to background activities to be zeros. [sent-208, score-0.438]
69 We greedily instantiate di consecutive segments denoted as ai that, when labeled as a specific activity class, can increase the weighted value of the compatibility function, F, by the largest amount. [sent-215, score-0.893]
70 Algorithm 1Greedy Search Algorithm Input: Output: Testing instance with n action segments Interested activities A, label vector Y and the duration vector D 1. [sent-220, score-0.792]
71 Experiment To assess the effectiveness of our structural model in activity modeling and recognition, we perform experiments on the public VIRAT Ground Dataset [8]. [sent-226, score-0.586]
72 We compare our results with the popular activity recognition method, BOW+SVM [18], and recently developed methods - string of feature graphs (SFG) [10] and sum-product networks (SPN) [1]. [sent-228, score-0.558]
73 Dataset VIRAT Ground dataset is a state-of-the-art activity dataset with many challenging characteristics, such as wide variation in the activities and clutter in the scene. [sent-231, score-0.92]
74 The activities defined in Release 1 include 1 - person loading an object to a vehicle; 2 - person unloading an object from a vehicle; 3 person opening a vehicle trunk; 4 - person closing a vehicle trunk; 5 - person getting into a vehicle; 6 - person getting out of a vehicle. [sent-233, score-1.842]
75 Five more activities are defined in VIRAT Release 2 as: 7 - person gesturing; 8 - person carrying an object; 9 - person running; 10 - person entering a facility; 11- person exiting a facility. [sent-235, score-1.089]
76 Preprocessing and Feature Extraction Motion regions involving only vehicles moving are excluded from the experiments since we are only interested in person activities and person-vehicle interactions. [sent-239, score-0.615]
77 A distance threshold of 2 times the height of the person and a time threshold of 15 seconds are used to group action segments into activity sets. [sent-242, score-0.863]
78 2 to develop the feature descriptors for each activity set. [sent-244, score-0.505]
79 Motion Segmentation We develop an automatic motion segmentation algorithm by detecting boundaries where the statistics of motion features change dramatically, and obtain the action segments. [sent-249, score-0.461]
80 Defining the segmentation accuracy as twice the absolute sum of deviations of estimated activity boundaries from the real ones normalized by the total number of total frames, the segmentation accuracy on VIRAT dataset Release 1 is 85. [sent-260, score-0.538]
81 c hTahneg sem tahlele wr tihndeo dwis stainzcee T threshold dth is the more number of action segments a complex activity may have. [sent-265, score-0.748]
82 As an example of the importance of context features, the baseline classifier often confuses “open a vehicle trunk” and “close a vehicle trunk” with each other. [sent-270, score-0.563]
83 However, if the two activities happen closely in time in the same place, the first activity in time is probably “open a vehicle trunk”. [sent-271, score-1.124]
84 This kind of contextual information within and across activity classes are captured by our model and used to improve the recognition performance. [sent-272, score-0.482]
85 (a): Confusion matrix for the baseline classifier; (b): Confusion matrix for our approach using motion and intra-activity context features; (c): (b): Confusion matrix for our approach using motion and intraand inter- activity context features. [sent-274, score-1.073]
86 The results are expected since the intra-activity and inter-activity context give the model additional information about the activities beyond the motion information encoded in low-level features. [sent-279, score-0.716]
87 However, it does not consider the relationships between various activities and thus our method outperforms the SFGs. [sent-281, score-0.526]
88 8 shows examples that demonstrate the significance of context in activity recognition. [sent-283, score-0.626]
89 However, [1] works on video clips, each containing an activity of interest with additional 10 seconds occurring randomly before or after the target activity instance, while we work on continuous video. [sent-292, score-1.054]
90 222444999644 model with motion feature and intra-activity context feature; our method (2): the proposed structural model with motion feature, intra- activity and inter-activity context features. [sent-293, score-1.143]
91 10 compares the precision and recall for the eleven activities defined in VIRAT Release 2 for BOW+SVM method, the baseline classifier, and our method. [sent-300, score-0.5]
92 We see that by modeling the relationships between activities, those with strong context patterns, such as “person closing a vehicle trunk”(4) and “person running”(9), achieve larger performance gain compared to activities with weak context patterns such as “person gesturing”(7). [sent-301, score-1.034]
93 11 shows example results on activities in Release 2. [sent-303, score-0.438]
94 821460 12BOW3+SVM4Bas5elin (bN)D6M+SVM7)O8urMeth9od10 1 Figure 10: Precision (a) and recall (b) for the eleven activities defined in VIRAT Release 2. [sent-306, score-0.465]
95 Conclusion In this paper, we present a novel approach tojointly model a variable number of activities in continuous videos. [sent-308, score-0.478]
96 We have addressed the problem of automatic motion segmentation based on low-level motion features and the problem of high-level representations of activities in the scene. [sent-309, score-0.734]
97 Upon the detected activity elements, we can build a high-level model that integrates various features within and between activities. [sent-310, score-0.536]
98 Our experiments demonstrate that joint modeling of activities, encapsulating object interactions and spatial and temporal relationships between activity classes, can significantly improve the recognition accuracy. [sent-312, score-0.71]
99 A “string of feature graphs” model for recognition of complex activities in natural videos. [sent-381, score-0.461]
100 Modeling temporal structure of decomposable motion segments for activity classification. [sent-444, score-0.812]
wordName wordTfidf (topN-words)
[('activity', 0.482), ('activities', 0.438), ('virat', 0.32), ('release', 0.195), ('ai', 0.181), ('vehicle', 0.174), ('aj', 0.167), ('action', 0.165), ('context', 0.144), ('motion', 0.134), ('trunk', 0.133), ('scij', 0.12), ('person', 0.115), ('getting', 0.107), ('segments', 0.101), ('tcij', 0.1), ('temporal', 0.095), ('relationships', 0.088), ('nth', 0.082), ('structural', 0.082), ('ndm', 0.08), ('di', 0.08), ('duration', 0.063), ('durations', 0.063), ('nayak', 0.06), ('gi', 0.059), ('opening', 0.055), ('optimum', 0.053), ('compatibility', 0.049), ('stip', 0.048), ('inference', 0.046), ('rs', 0.046), ('frame', 0.046), ('facility', 0.044), ('bow', 0.043), ('greedy', 0.042), ('continuous', 0.04), ('gesturing', 0.04), ('rdg', 0.04), ('rdsc', 0.04), ('rdtc', 0.04), ('sfg', 0.04), ('spn', 0.04), ('unloading', 0.04), ('yi', 0.038), ('classifier', 0.036), ('rdx', 0.036), ('baseline', 0.035), ('recognized', 0.034), ('loading', 0.033), ('region', 0.032), ('xm', 0.032), ('potential', 0.031), ('bin', 0.031), ('moving', 0.031), ('attributes', 0.031), ('vehicles', 0.031), ('clips', 0.031), ('confusion', 0.031), ('happen', 0.03), ('searches', 0.03), ('string', 0.03), ('subsequences', 0.03), ('detected', 0.03), ('ix', 0.029), ('facilities', 0.028), ('video', 0.028), ('segmentation', 0.028), ('labels', 0.028), ('exiting', 0.027), ('maxi', 0.027), ('event', 0.027), ('videos', 0.027), ('development', 0.027), ('dmax', 0.027), ('carrying', 0.027), ('eleven', 0.027), ('doors', 0.027), ('ng', 0.026), ('persons', 0.026), ('label', 0.025), ('actions', 0.025), ('amit', 0.025), ('closing', 0.024), ('integrates', 0.024), ('auxiliary', 0.024), ('rarely', 0.024), ('zhu', 0.024), ('feature', 0.023), ('developed', 0.023), ('interactions', 0.023), ('ts', 0.023), ('dist', 0.023), ('upon', 0.023), ('modeling', 0.022), ('circle', 0.022), ('interest', 0.022), ('entering', 0.022), ('opt', 0.022), ('lan', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video
Author: Yingying Zhu, Nandita M. Nayak, Amit K. Roy-Chowdhury
Abstract: In thispaper, rather than modeling activities in videos individually, we propose a hierarchical framework that jointly models and recognizes related activities using motion and various context features. This is motivated from the observations that the activities related in space and time rarely occur independently and can serve as the context for each other. Given a video, action segments are automatically detected using motion segmentation based on a nonlinear dynamical model. We aim to merge these segments into activities of interest and generate optimum labels for the activities. Towards this goal, we utilize a structural model in a max-margin framework that jointly models the underlying activities which are related in space and time. The model explicitly learns the duration, motion and context patterns for each activity class, as well as the spatio-temporal relationships for groups of them. The learned model is then used to optimally label the activities in the testing videos using a greedy search method. We show promising results on the VIRAT Ground Dataset demonstrating the benefit of joint modeling and recognizing activities in a wide-area scene.
2 0.45586881 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?
Author: Michael S. Ryoo, Larry Matthies
Abstract: This paper discusses the problem of recognizing interaction-level human activities from a first-person viewpoint. The goal is to enable an observer (e.g., a robot or a wearable camera) to understand ‘what activity others are performing to it’ from continuous video inputs. These include friendly interactions such as ‘a person hugging the observer’ as well as hostile interactions like ‘punching the observer’ or ‘throwing objects to the observer’, whose videos involve a large amount of camera ego-motion caused by physical interactions. The paper investigates multichannel kernels to integrate global and local motion information, and presents a new activity learning/recognition methodology that explicitly considers temporal structures displayed in first-person activity videos. In our experiments, we not only show classification results with segmented videos, but also confirm that our new approach is able to detect activities from continuous videos reliably.
3 0.36424497 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos
Author: Yu Cao, Daniel Barrett, Andrei Barbu, Siddharth Narayanaswamy, Haonan Yu, Aaron Michaux, Yuewei Lin, Sven Dickinson, Jeffrey Mark Siskind, Song Wang
Abstract: Recognizing human activities in partially observed videos is a challengingproblem and has many practical applications. When the unobserved subsequence is at the end of the video, the problem is reduced to activity prediction from unfinished activity streaming, which has been studied by many researchers. However, in the general case, an unobserved subsequence may occur at any time by yielding a temporal gap in the video. In this paper, we propose a new method that can recognize human activities from partially observed videos in the general case. Specifically, we formulate the problem into a probabilistic framework: 1) dividing each activity into multiple ordered temporal segments, 2) using spatiotemporal features of the training video samples in each segment as bases and applying sparse coding (SC) to derive the activity likelihood of the test video sample at each segment, and 3) finally combining the likelihood at each segment to achieve a global posterior for the activities. We further extend the proposed method to include more bases that correspond to a mixture of segments with different temporal lengths (MSSC), which can better rep- resent the activities with large intra-class variations. We evaluate the proposed methods (SC and MSSC) on various real videos. We also evaluate the proposed methods on two special cases: 1) activity prediction where the unobserved subsequence is at the end of the video, and 2) human activity recognition on fully observed videos. Experimental results show that the proposed methods outperform existing state-of-the-art comparison methods.
4 0.3142944 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
Author: Raghuraman Gopalan
Abstract: While the notion of joint sparsity in understanding common and innovative components of a multi-receiver signal ensemble has been well studied, we investigate the utility of such joint sparse models in representing information contained in a single video signal. By decomposing the content of a video sequence into that observed by multiple spatially and/or temporally distributed receivers, we first recover a collection of common and innovative components pertaining to individual videos. We then present modeling strategies based on subspace-driven manifold metrics to characterize patterns among these components, across other videos in the system, to perform subsequent video analysis. We demonstrate the efficacy of our approach for activity classification and clustering by reporting competitive results on standard datasets such as, HMDB, UCF-50, Olympic Sports and KTH.
Author: Vinay Bettadapura, Grant Schindler, Thomas Ploetz, Irfan Essa
Abstract: We present data-driven techniques to augment Bag of Words (BoW) models, which allow for more robust modeling and recognition of complex long-term activities, especially when the structure and topology of the activities are not known a priori. Our approach specifically addresses the limitations of standard BoW approaches, which fail to represent the underlying temporal and causal information that is inherent in activity streams. In addition, we also propose the use ofrandomly sampled regular expressions to discover and encode patterns in activities. We demonstrate the effectiveness of our approach in experimental evaluations where we successfully recognize activities and detect anomalies in four complex datasets.
6 0.24696511 287 cvpr-2013-Modeling Actions through State Changes
7 0.19620007 62 cvpr-2013-Bilinear Programming for Human Activity Recognition with Unknown MRF Graphs
8 0.17490242 348 cvpr-2013-Recognizing Activities via Bag of Words for Attribute Dynamics
9 0.17153896 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
10 0.16347305 196 cvpr-2013-HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences
11 0.15738466 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
12 0.15233439 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition
13 0.13304092 407 cvpr-2013-Spatio-temporal Depth Cuboid Similarity Feature for Activity Recognition Using Depth Camera
14 0.13184278 40 cvpr-2013-An Approach to Pose-Based Action Recognition
16 0.12448677 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
17 0.12349003 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition
18 0.12193254 77 cvpr-2013-Capturing Complex Spatio-temporal Relations among Facial Muscles for Facial Expression Recognition
19 0.12126182 313 cvpr-2013-Online Dominant and Anomalous Behavior Detection in Videos
20 0.12098113 187 cvpr-2013-Geometric Context from Videos
topicId topicWeight
[(0, 0.222), (1, -0.09), (2, -0.019), (3, -0.21), (4, -0.251), (5, 0.014), (6, -0.117), (7, -0.011), (8, -0.1), (9, 0.048), (10, 0.15), (11, -0.099), (12, 0.036), (13, -0.038), (14, -0.01), (15, 0.075), (16, 0.039), (17, 0.177), (18, -0.037), (19, -0.121), (20, -0.091), (21, 0.098), (22, 0.04), (23, -0.012), (24, 0.009), (25, 0.0), (26, -0.083), (27, 0.053), (28, 0.083), (29, 0.189), (30, 0.022), (31, 0.016), (32, 0.025), (33, -0.085), (34, 0.029), (35, -0.069), (36, -0.013), (37, -0.105), (38, -0.049), (39, 0.04), (40, -0.094), (41, 0.067), (42, 0.213), (43, 0.005), (44, 0.035), (45, -0.076), (46, 0.148), (47, -0.04), (48, -0.022), (49, -0.064)]
simIndex simValue paperId paperTitle
same-paper 1 0.96490824 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video
Author: Yingying Zhu, Nandita M. Nayak, Amit K. Roy-Chowdhury
Abstract: In thispaper, rather than modeling activities in videos individually, we propose a hierarchical framework that jointly models and recognizes related activities using motion and various context features. This is motivated from the observations that the activities related in space and time rarely occur independently and can serve as the context for each other. Given a video, action segments are automatically detected using motion segmentation based on a nonlinear dynamical model. We aim to merge these segments into activities of interest and generate optimum labels for the activities. Towards this goal, we utilize a structural model in a max-margin framework that jointly models the underlying activities which are related in space and time. The model explicitly learns the duration, motion and context patterns for each activity class, as well as the spatio-temporal relationships for groups of them. The learned model is then used to optimally label the activities in the testing videos using a greedy search method. We show promising results on the VIRAT Ground Dataset demonstrating the benefit of joint modeling and recognizing activities in a wide-area scene.
2 0.88444895 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos
Author: Yu Cao, Daniel Barrett, Andrei Barbu, Siddharth Narayanaswamy, Haonan Yu, Aaron Michaux, Yuewei Lin, Sven Dickinson, Jeffrey Mark Siskind, Song Wang
Abstract: Recognizing human activities in partially observed videos is a challengingproblem and has many practical applications. When the unobserved subsequence is at the end of the video, the problem is reduced to activity prediction from unfinished activity streaming, which has been studied by many researchers. However, in the general case, an unobserved subsequence may occur at any time by yielding a temporal gap in the video. In this paper, we propose a new method that can recognize human activities from partially observed videos in the general case. Specifically, we formulate the problem into a probabilistic framework: 1) dividing each activity into multiple ordered temporal segments, 2) using spatiotemporal features of the training video samples in each segment as bases and applying sparse coding (SC) to derive the activity likelihood of the test video sample at each segment, and 3) finally combining the likelihood at each segment to achieve a global posterior for the activities. We further extend the proposed method to include more bases that correspond to a mixture of segments with different temporal lengths (MSSC), which can better rep- resent the activities with large intra-class variations. We evaluate the proposed methods (SC and MSSC) on various real videos. We also evaluate the proposed methods on two special cases: 1) activity prediction where the unobserved subsequence is at the end of the video, and 2) human activity recognition on fully observed videos. Experimental results show that the proposed methods outperform existing state-of-the-art comparison methods.
3 0.86355031 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?
Author: Michael S. Ryoo, Larry Matthies
Abstract: This paper discusses the problem of recognizing interaction-level human activities from a first-person viewpoint. The goal is to enable an observer (e.g., a robot or a wearable camera) to understand ‘what activity others are performing to it’ from continuous video inputs. These include friendly interactions such as ‘a person hugging the observer’ as well as hostile interactions like ‘punching the observer’ or ‘throwing objects to the observer’, whose videos involve a large amount of camera ego-motion caused by physical interactions. The paper investigates multichannel kernels to integrate global and local motion information, and presents a new activity learning/recognition methodology that explicitly considers temporal structures displayed in first-person activity videos. In our experiments, we not only show classification results with segmented videos, but also confirm that our new approach is able to detect activities from continuous videos reliably.
Author: Vinay Bettadapura, Grant Schindler, Thomas Ploetz, Irfan Essa
Abstract: We present data-driven techniques to augment Bag of Words (BoW) models, which allow for more robust modeling and recognition of complex long-term activities, especially when the structure and topology of the activities are not known a priori. Our approach specifically addresses the limitations of standard BoW approaches, which fail to represent the underlying temporal and causal information that is inherent in activity streams. In addition, we also propose the use ofrandomly sampled regular expressions to discover and encode patterns in activities. We demonstrate the effectiveness of our approach in experimental evaluations where we successfully recognize activities and detect anomalies in four complex datasets.
5 0.73463404 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
Author: Raghuraman Gopalan
Abstract: While the notion of joint sparsity in understanding common and innovative components of a multi-receiver signal ensemble has been well studied, we investigate the utility of such joint sparse models in representing information contained in a single video signal. By decomposing the content of a video sequence into that observed by multiple spatially and/or temporally distributed receivers, we first recover a collection of common and innovative components pertaining to individual videos. We then present modeling strategies based on subspace-driven manifold metrics to characterize patterns among these components, across other videos in the system, to perform subsequent video analysis. We demonstrate the efficacy of our approach for activity classification and clustering by reporting competitive results on standard datasets such as, HMDB, UCF-50, Olympic Sports and KTH.
6 0.68686146 62 cvpr-2013-Bilinear Programming for Human Activity Recognition with Unknown MRF Graphs
7 0.62747657 348 cvpr-2013-Recognizing Activities via Bag of Words for Attribute Dynamics
8 0.54672199 313 cvpr-2013-Online Dominant and Anomalous Behavior Detection in Videos
9 0.54490066 103 cvpr-2013-Decoding Children's Social Behavior
10 0.52165061 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition
11 0.51667559 287 cvpr-2013-Modeling Actions through State Changes
12 0.47266346 291 cvpr-2013-Motionlets: Mid-level 3D Parts for Human Motion Recognition
13 0.4417229 187 cvpr-2013-Geometric Context from Videos
14 0.4412362 196 cvpr-2013-HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences
15 0.43288931 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
17 0.40944821 137 cvpr-2013-Dynamic Scene Classification: Learning Motion Descriptors with Slow Features Analysis
18 0.40583345 407 cvpr-2013-Spatio-temporal Depth Cuboid Similarity Feature for Activity Recognition Using Depth Camera
19 0.40267202 294 cvpr-2013-Multi-class Video Co-segmentation with a Generative Multi-video Model
20 0.40167147 353 cvpr-2013-Relative Hidden Markov Models for Evaluating Motion Skill
topicId topicWeight
[(10, 0.095), (16, 0.02), (26, 0.043), (28, 0.015), (29, 0.012), (33, 0.253), (46, 0.154), (67, 0.138), (69, 0.082), (87, 0.074)]
simIndex simValue paperId paperTitle
1 0.90107894 213 cvpr-2013-Image Tag Completion via Image-Specific and Tag-Specific Linear Sparse Reconstructions
Author: Zijia Lin, Guiguang Ding, Mingqing Hu, Jianmin Wang, Xiaojun Ye
Abstract: Though widely utilized for facilitating image management, user-provided image tags are usually incomplete and insufficient to describe the whole semantic content of corresponding images, resulting in performance degradations in tag-dependent applications and thus necessitating effective tag completion methods. In this paper, we propose a novel scheme denoted as LSR for automatic image tag completion via image-specific and tag-specific Linear Sparse Reconstructions. Given an incomplete initial tagging matrix with each row representing an image and each column representing a tag, LSR optimally reconstructs each image (i.e. row) and each tag (i.e. column) with remaining ones under constraints of sparsity, considering imageimage similarity, image-tag association and tag-tag concurrence. Then both image-specific and tag-specific reconstruction values are normalized and merged for selecting missing related tags. Extensive experiments conducted on both benchmark dataset and web images well demonstrate the effectiveness of the proposed LSR.
same-paper 2 0.89495939 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video
Author: Yingying Zhu, Nandita M. Nayak, Amit K. Roy-Chowdhury
Abstract: In thispaper, rather than modeling activities in videos individually, we propose a hierarchical framework that jointly models and recognizes related activities using motion and various context features. This is motivated from the observations that the activities related in space and time rarely occur independently and can serve as the context for each other. Given a video, action segments are automatically detected using motion segmentation based on a nonlinear dynamical model. We aim to merge these segments into activities of interest and generate optimum labels for the activities. Towards this goal, we utilize a structural model in a max-margin framework that jointly models the underlying activities which are related in space and time. The model explicitly learns the duration, motion and context patterns for each activity class, as well as the spatio-temporal relationships for groups of them. The learned model is then used to optimally label the activities in the testing videos using a greedy search method. We show promising results on the VIRAT Ground Dataset demonstrating the benefit of joint modeling and recognizing activities in a wide-area scene.
3 0.87938517 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval
Author: Xiaohui Shen, Zhe Lin, Jonathan Brandt, Ying Wu
Abstract: Detecting faces in uncontrolled environments continues to be a challenge to traditional face detection methods[24] due to the large variation in facial appearances, as well as occlusion and clutter. In order to overcome these challenges, we present a novel and robust exemplarbased face detector that integrates image retrieval and discriminative learning. A large database of faces with bounding rectangles and facial landmark locations is collected, and simple discriminative classifiers are learned from each of them. A voting-based method is then proposed to let these classifiers cast votes on the test image through an efficient image retrieval technique. As a result, faces can be very efficiently detected by selecting the modes from the voting maps, without resorting to exhaustive sliding window-style scanning. Moreover, due to the exemplar-based framework, our approach can detect faces under challenging conditions without explicitly modeling their variations. Evaluation on two public benchmark datasets shows that our new face detection approach is accurate and efficient, and achieves the state-of-the-art performance. We further propose to use image retrieval for face validation (in order to remove false positives) and for face alignment/landmark localization. The same methodology can also be easily generalized to other facerelated tasks, such as attribute recognition, as well as general object detection.
4 0.87801105 2 cvpr-2013-3D Pictorial Structures for Multiple View Articulated Pose Estimation
Author: Magnus Burenius, Josephine Sullivan, Stefan Carlsson
Abstract: We consider the problem of automatically estimating the 3D pose of humans from images, taken from multiple calibrated views. We show that it is possible and tractable to extend the pictorial structures framework, popular for 2D pose estimation, to 3D. We discuss how to use this framework to impose view, skeleton, joint angle and intersection constraints in 3D. The 3D pictorial structures are evaluated on multiple view data from a professional football game. The evaluation is focused on computational tractability, but we also demonstrate how a simple 2D part detector can be plugged into the framework.
5 0.87636024 254 cvpr-2013-Learning SURF Cascade for Fast and Accurate Object Detection
Author: Jianguo Li, Yimin Zhang
Abstract: This paper presents a novel learning framework for training boosting cascade based object detector from large scale dataset. The framework is derived from the wellknown Viola-Jones (VJ) framework but distinguished by three key differences. First, the proposed framework adopts multi-dimensional SURF features instead of single dimensional Haar features to describe local patches. In this way, the number of used local patches can be reduced from hundreds of thousands to several hundreds. Second, it adopts logistic regression as weak classifier for each local patch instead of decision trees in the VJ framework. Third, we adopt AUC as a single criterion for the convergence test during cascade training rather than the two trade-off criteria (false-positive-rate and hit-rate) in the VJ framework. The benefit is that the false-positive-rate can be adaptive among different cascade stages, and thus yields much faster convergence speed of SURF cascade. Combining these points together, the proposed approach has three good properties. First, the boosting cascade can be trained very efficiently. Experiments show that the proposed approach can train object detectors from billions of negative samples within one hour even on personal computers. Second, the built detector is comparable to the stateof-the-art algorithm not only on the accuracy but also on the processing speed. Third, the built detector is small in model-size due to short cascade stages.
6 0.87523586 339 cvpr-2013-Probabilistic Graphlet Cut: Exploiting Spatial Structure Cue for Weakly Supervised Image Segmentation
7 0.87445581 45 cvpr-2013-Articulated Pose Estimation Using Discriminative Armlet Classifiers
8 0.87370288 288 cvpr-2013-Modeling Mutual Visibility Relationship in Pedestrian Detection
9 0.87367791 160 cvpr-2013-Face Recognition in Movie Trailers via Mean Sequence Sparse Representation-Based Classification
10 0.87277472 345 cvpr-2013-Real-Time Model-Based Rigid Object Pose Estimation and Tracking Combining Dense and Sparse Visual Cues
11 0.87145323 275 cvpr-2013-Lp-Norm IDF for Large Scale Image Search
12 0.86872655 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
13 0.86765361 122 cvpr-2013-Detection Evolution with Multi-order Contextual Co-occurrence
14 0.86734706 375 cvpr-2013-Saliency Detection via Graph-Based Manifold Ranking
15 0.86346912 322 cvpr-2013-PISA: Pixelwise Image Saliency by Aggregating Complementary Appearance Contrast Measures with Spatial Priors
16 0.86330462 416 cvpr-2013-Studying Relationships between Human Gaze, Description, and Computer Vision
17 0.86276501 246 cvpr-2013-Learning Binary Codes for High-Dimensional Data Using Bilinear Projections
18 0.86122984 398 cvpr-2013-Single-Pedestrian Detection Aided by Multi-pedestrian Detection
19 0.8605634 4 cvpr-2013-3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image
20 0.86046082 60 cvpr-2013-Beyond Physical Connections: Tree Models in Human Pose Estimation