cvpr cvpr2013 cvpr2013-347 knowledge-graph by maker-knowledge-mining

347 cvpr-2013-Recognize Human Activities from Partially Observed Videos


Source: pdf

Author: Yu Cao, Daniel Barrett, Andrei Barbu, Siddharth Narayanaswamy, Haonan Yu, Aaron Michaux, Yuewei Lin, Sven Dickinson, Jeffrey Mark Siskind, Song Wang

Abstract: Recognizing human activities in partially observed videos is a challengingproblem and has many practical applications. When the unobserved subsequence is at the end of the video, the problem is reduced to activity prediction from unfinished activity streaming, which has been studied by many researchers. However, in the general case, an unobserved subsequence may occur at any time by yielding a temporal gap in the video. In this paper, we propose a new method that can recognize human activities from partially observed videos in the general case. Specifically, we formulate the problem into a probabilistic framework: 1) dividing each activity into multiple ordered temporal segments, 2) using spatiotemporal features of the training video samples in each segment as bases and applying sparse coding (SC) to derive the activity likelihood of the test video sample at each segment, and 3) finally combining the likelihood at each segment to achieve a global posterior for the activities. We further extend the proposed method to include more bases that correspond to a mixture of segments with different temporal lengths (MSSC), which can better rep- resent the activities with large intra-class variations. We evaluate the proposed methods (SC and MSSC) on various real videos. We also evaluate the proposed methods on two special cases: 1) activity prediction where the unobserved subsequence is at the end of the video, and 2) human activity recognition on fully observed videos. Experimental results show that the proposed methods outperform existing state-of-the-art comparison methods.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 When the unobserved subsequence is at the end of the video, the problem is reduced to activity prediction from unfinished activity streaming, which has been studied by many researchers. [sent-2, score-1.338]

2 However, in the general case, an unobserved subsequence may occur at any time by yielding a temporal gap in the video. [sent-3, score-0.387]

3 In this paper, we propose a new method that can recognize human activities from partially observed videos in the general case. [sent-4, score-0.529]

4 We further extend the proposed method to include more bases that correspond to a mixture of segments with different temporal lengths (MSSC), which can better rep- resent the activities with large intra-class variations. [sent-6, score-0.501]

5 We also evaluate the proposed methods on two special cases: 1) activity prediction where the unobserved subsequence is at the end of the video, and 2) human activity recognition on fully observed videos. [sent-8, score-1.486]

6 Introduction Human activity recognition aims at building robust and efficient computer vision algorithms and systems which can automatically recognize specific human activities from a sequence of video frames. [sent-11, score-0.894]

7 Recently, research on activity recognition has been extended to more complex activity scenarios which involve multiple persons interacting with each other or objects [15, 20, 24]. [sent-18, score-0.984]

8 One widely used approach for human activity recognition is to train and classify the spatiotemporal features extracted from videos with different activities. [sent-19, score-0.794]

9 In these methods, a sequence of 2D video frames are treated as a 3D XYT video volume in which interest points are located by finding local maxima in the responses of the feature detector, followed by calculating vectorized feature descriptors at each interest point. [sent-21, score-0.354]

10 By using the bag-of-visual-words technique, spatiotemporal features within a video can be combined into a feature vector that describes the activity presented in the video. [sent-22, score-0.711]

11 (a) Human activity recognition from fully observed videos: What is this activity? [sent-23, score-0.603]

12 Full Observation (b) Human activity recognition from partially observed videos − prediction: What is this activity? [sent-24, score-0.814]

13 Missing observation Observation (c) Human activity recognition from partially observed videos − gapfilling: What is this activity? [sent-25, score-0.849]

14 An illustration of the human activity recognition from fully and partially observed videos. [sent-27, score-0.731]

15 Previous research on human activity recognition usually focused on recognizing activities after fully observing the entire video, as illustrated in Fig. [sent-28, score-0.758]

16 However in prac- 222666555866 tice, partially observed videos may occur when video signals drop off, cameras or objects of interest are occluded, or videos are composited from multiple sources. [sent-30, score-0.63]

17 The unobserved subsequence may occur any time with any duration, yielding a temporal gap as shown in Fig. [sent-31, score-0.387]

18 For example, one of the four major themes in the DARPA Mind’s Eye program1 is to handle such gapped videos for activity recognition. [sent-34, score-0.709]

19 When the unobserved subsequence is at the end of the video, the problem is reduced to activity prediction from unfinished activity streaming, as illustrated in Fig. [sent-35, score-1.32]

20 Another example of related work on activity prediction is the maxmargin early event detectors (MMED) [6], which try to detect the temporal location and duration of a certain activity from the video streaming. [sent-38, score-1.418]

21 studied a special activity prediction problem in [8], which tries to predict the walking path of a person in certain environments based on historical data. [sent-40, score-0.633]

22 However, activity recognition from general gapped videos, as shown in Fig. [sent-41, score-0.58]

23 Note that, in general, this is a different problem from activity prediction because the temporal gap may divide the video into two disjoint observed video subsequences and we need to combine them to achieve a reliable recognition. [sent-43, score-1.248]

24 In this paper, we propose a probabilistic formulation for human activity recognition from partially observed videos, where the posterior is maximized for the recognized activity class and the observed video frames. [sent-44, score-1.532]

25 In our formulation, the key component in defining the posterior is the likelihood that the observed video frames describe a certain class of activity. [sent-45, score-0.413]

26 In this paper, we take a set of training video samples (completely observed) of each activity class as the bases, and then use sparse coding (SC) to derive the likelihood that a certain type of activity is presented in a partially observed test video. [sent-46, score-1.596]

27 Furthermore, we divide each activity into multiple temporal segments, apply sparse coding to derive the activity likelihood at each segment, and finally combine the likelihoods at each segment to achieve a global posterior for the activity. [sent-47, score-1.42]

28 While video segments are constructed by uniformly dividing the video in SC, we also extend it to include more sparse coding bases constructed from a mixture of training video segments (MSSC) with different lengths and locations. [sent-48, score-1.117]

29 Using sparse coding with the constructed bases, the proposed methods can find closest video segments from different training videos when matching a new test video. [sent-49, score-0.691]

30 Thus, the proposed methods don’t require full temporal alignment between any pair of (training or test) videos, and they can handle the problems of 1) a limited number of training videos; 2) possible outliers in the training video data; and 1http : / /www . [sent-50, score-0.332]

31 In the experiments, we not only evaluate the performance on general gapped videos, but also on fully observed videos without a gap and videos with a gap at the end (activity prediction). [sent-55, score-0.679]

32 In Section 2, we present our probabilistic formulation of human activity recognition from partially observed videos. [sent-57, score-0.706]

33 Section 3 introduces the likelihood component using a sparse coding (SC) technique followed by extending the SC to include more bases constructed from a mixture of segments (MSSC) with different temporal lengths. [sent-58, score-0.557]

34 Therefore, we can divide the video O[1 : T] into a sequence of shorter vwiede coa segments feo vr spatiotemporal tfoe aatsu reeq ueexntrcaec otifon sh. [sent-68, score-0.355]

35 Frteorr simplicity, we uniformly divide the video O[1 : T] into M equal-length segments, yw dhiveride ee tahceh segment 1O :(t Ti−]1 in : t Mi], ewqiutha ti n=g tMhiT ,s corresponds etroe th eaec hi-th se stage to fO t(hte activity, with i = 1, 2, · · · , M. [sent-69, score-0.578]

36 The posterior probability that an activity Ap is presented in Ttheh ev pidoesote rOio[1r : oTba] can y beth datef ainne adct as P(Ap |O[1 : T]), winh tihceh can boe O O re[1wr :it Tten] as: ? [sent-72, score-0.532]

37 1 (1) In this formulation, P(Ap, (ti−1 : ti]) is the prior of stage i Ionf activity Ap atniodn ,P P(O(A[1 : T] |Ap, (ti−1 : ti]) is the observofat aiocnti vliiktyel Aihoaodnd dg Pive(On [a1ct :iv Ti]ty|A class Ap in the i-th stage. [sent-81, score-0.521]

38 Tvahtienon nth leik ienlidheoxo odf gthivee recognized activity is ? [sent-82, score-0.514]

39 Human Activity Recognition from a Partially Observed Video A partially observed video can be represented by O[1 : T1] A∪ [aTr2t : lTy] o , bwsehrevreed dfvr aimdeeos cOan(T b1e : eTpr2)e are missing, as illu]s ∪tra [tTed i:n T Fig. [sent-87, score-0.308]

40 By following the observation O[1:T 1] un−observed subsequence observation O[T2 :T] frame frame 1 segment 1 T1 frame T2 segment 2 frame T segment M Figure 2. [sent-91, score-0.581]

41 An illustration of a partially observed video (general case), where the unobserved subsequence is located in the middle of the video. [sent-92, score-0.523]

42 (1), the posterior probability that an activity Ap is presented in this partially observed video can be tdivefiityne Ad as: P(Ap|O[1 : T1] ∪ O[T2 : T]) ∝ ω1i|? [sent-94, score-0.84]

43 (4) The index of the recognized activity is therefore: p∗ = arg mpaxP(Ap|O[1 : T1] ∪ O[T2 : T]). [sent-103, score-0.514]

44 p Wrohbelnem mT1 i s= r Tdu, ctehed t poro ibtslem is degenerated to the classic human activity recognition from fully observed videos. [sent-105, score-0.656]

45 Likelihood calculation using sparse coding Without loss of generality, in this section, we only consider the calculation of the likelihood P(O[1 : cTo1]n |sAidpe, r(t ti−h1e : atlic])u,l stiionnce oPf(O th[eT2 : eTli]h h|Aoopd, d(t Pi−(1O : ti]) can A be calculated ]i)n, a nsicmeil Par( way. [sent-111, score-0.279]

46 :T The]| Abasic idea is to collect a set of training videos (completely observed) for activity class Ap and then define the likelihood P(O[1 : aTc1t]i |Ap, c(tlai−s1s : tia])n by comparing hOe[ 1li : Tih1o] wdi Pth( Othe[1 1i -: th segment of all t]h)e btyrai cnoinmgp vairidnegos O. [sent-112, score-0.87]

47 One intuitive way to define the likelihood P(O[1 : T1]O O|Anpe, i(tnit−u1it : eti] w) aisy yto to ofi drset cinoenst trheuct l a mean fdea Ptu(rOe vector, as used in [14], h¯i = N1 hin for the i-th stage over all N training videos in cl? [sent-116, score-0.302]

48 However, simply using the mean feature vector as the unique activity model may suffer from two practical limita- tions. [sent-127, score-0.498]

49 First, when the number of training videos is limited, the mean feature may not be a representative of the true ‘center’ of its activity class in feature space. [sent-128, score-0.706]

50 the activity label for a training video is actually incorrect or one training video shows a large difference from the other training videos in the same activity class, the mean feature vector may not well represent the considered activity. [sent-131, score-1.577]

51 (9) By using the sparse coding as described above, the proposed SC method can automatically select a proper subset of bases for approximating the test video segments. [sent-155, score-0.458]

52 In addition, for different video segments in the test video, the proposed SC method can identify segments from different training videos and use their linear combination for representing a test video segment. [sent-157, score-0.819]

53 Compared to the mean activity model, the proposed method can substantially decrease the approximation error and to some extent, alleviate the problem of limited number of training videos. [sent-158, score-0.522]

54 Note that, in the proposed method, different test videos and different segments from a test video will identify different bases and different coefficients for likelihood calculation. [sent-159, score-0.685]

55 In the experiments, we will show that the proposed SC method outperforms MMED [6], a structured SVM based method, on the activity prediction task. [sent-161, score-0.593]

56 Likelihood calculation using sparse coding on a mixture of segments It is well known that, in practice, humans perform activities with different paces and overhead time. [sent-164, score-0.471]

57 To handle such variations, we further extend SC to MSSC by including more bases that are constructed from a mixture of segments with different temporal lengths and temporal locations in the training videos. [sent-166, score-0.505]

58 More specifically, when calculating the likelihood of segment i of the test video, we not only take segment (ti−1 , ti] in the training video to construct a basis, but also take 8 segments in each training video to construct 8 more bases. [sent-167, score-0.816]

59 These 8 more segments are (ti−2 , ti−1] , (ti , ti+1] , (ti−2 , ti] , (ti−1 , ti+1] , (ti−2 , ti] , segment i−1 segment i segment i+1 test video shorter segments j−thv itrdeaoining longer segments Figure 3. [sent-170, score-0.777]

60 An illustration of a mixture of segments with different temporal lengths and shifted temporal locations. [sent-171, score-0.33]

61 (ti−1 , ti−1 + (ti − ti−1 )/2] , (ti−1 + (ti − ti−1 )/2, ti] , and (ti−1 + (ti − ti−1 )t/4, ti − (ti − ti−1 )/−4] t, which are all around segment (ti−1 , ti], −but ( wi−th tvaried segment lengths and small temporal location shifts. [sent-172, score-0.588]

62 We expect that these additional bases can better handle the intra-class activity variation by approximating the test video segments more accurately. [sent-173, score-0.883]

63 , each activity (and each video) is uniformly divided into 20 temporal segments. [sent-181, score-0.553]

64 We choose several state-of-the-art comparison methods, including Ryoo’s human activity prediction methods (both non-dynamic and dynamic versions) [14], early event detector MMED [6], C2 [7], and Action Bank [17]. [sent-185, score-0.727]

65 We also implement a baseline sparse coding (named after ‘baseline’) method which concatenates features from different segments of a training video into a single feature vector as one row of the basis matrix and then directly applies sparse coding for recognition. [sent-187, score-0.656]

66 More specifically, a (partially observed) test video is classified into an activity class that leads to the smallest reconstruction error. [sent-188, score-0.706]

67 This way, we can apply a simple majority voting to the activity classes of K nearest training videos to classify the test video. [sent-191, score-0.727]

68 Evaluation on the special case: prediction In the prediction task, we simulate incremental arrival of video frames (represented by observation ratio [0. [sent-194, score-0.513]

69 When recognizing a test video O, it is concatenated with other test videos firnogm a otethster v activity citla isss ceosn icnattoe a long vitihde oot,h werit the sOt v aitd teohes fernodm. [sent-205, score-0.936]

70 t hToe adapt MMED from early event detection to human activity prediction, we set its minimum searching length to be the temporal length of O, and the step length of the searching t eom mbpeo ridaeln ltenicgalth t oo ft Ohe, segment length ignth hth oef proposed SinCg method. [sent-207, score-0.857]

71 If the subsequence returned by MMED contains no less than 50% of the frames in the observed subsequence in O, it is counted as a correct prediction. [sent-208, score-0.39]

72 For each test video, the result is a single human activity class out of all possible activities classes and we use the average accuracy over all cross validation tests and all activity classes as a quantitative metric of the performance. [sent-215, score-1.245]

73 The baseline sparse coding method achieves good performance (close to SC and MSSC) on UT-interaction #1, #2 but not in DARPA Y1 because the activities in UT-interaction datasets show very small intra-class variations, while the proposed methods can better handle the large intra-class variations. [sent-218, score-0.341]

74 Evaluation on the general case: gapfilling In the gapfilling task, we compare the proposed SC and MSSC methods with adapted Ryoo’s methods (nondynamic and dynamic versions), KNN (non-dynamic and dynamic versions) and the baseline sparse coding method. [sent-222, score-0.845]

75 Specifically, we adapt Ryoo’s methods to perform activity prediction on these two subsequences, and the gapfilling posterior score is the summation of prediction posterior scores on each observed subsequence. [sent-223, score-1.198]

76 We first perform gapfilling evaluation on UT-interaction #1,#2 and DARPA Y1 datasets. [sent-225, score-0.306]

77 We intentionally replace a subsequence of frames from a test video by empty frames to create a partially observed video. [sent-226, score-0.577]

78 We finally evaluate the accuracy rate in term of each possible non-observation ratio βˆ by counting the percentage of the correctly recognized test videos with the non-observation ratio βˆ over all folds (10 folds for UT-interaction #1, #2; 20 folds for DARPA Y1) of cross validations. [sent-239, score-0.469]

79 We further evaluate the proposed methods and comparison methods on more complex datasets: The DARPA Mind’s Eye program provides a Year-2 evaluation corpus, which contains 4 sub-datasets of long test videos (each video has a length more than ten minutes) with multiple gaps. [sent-250, score-0.436]

80 This dataset is much more challenging than DARPA Y1 dataset due to: 1) the important action units are missing for the underlying activities in many cases; and 2) many activities are performed simultaneously by multiple actors. [sent-252, score-0.366]

81 These three training sets are ‘CD2b’ (22 activity classes, totally 3, 819 training videos), ‘C-D2c’ (16 activity classes, totally 2, 80 training videos) and ‘C-D2bc’ (23 activity classes, totally 4, 409 training videos). [sent-255, score-1.684]

82 In our experiment, for each gap in DARPA Y2-Gapfilling, we construct test video clips by including a certain number of observed frames before and/or after this gap. [sent-257, score-0.446]

83 For each gap, nine video clips are constructed with the gap appearing ‘at the beginning’, ‘in the center’ or ‘at the end’ of the video clip and counting for 20%, 40% or 60% of the clip length. [sent-258, score-0.589]

84 This way, we construct a total of 2, 403 video clips with a gap and evaluate the recognition results against the human annotated groundtruth (may give multiple activities labels for a test gapped video clip). [sent-259, score-0.815]

85 However, the general performance is low, which indicates that the gapfilling on practical scenarios is far from a solved problem. [sent-263, score-0.328]

86 Previous published recognition methods are mostly evaluated on short video clips where each of them contains a single activity, which cannot reflect real performance in the practical scenarios. [sent-267, score-0.291]

87 0), we further test these methods on DARPA Year-2 Recognition, a dataset provided by the DARPA Minds’ Eye program for large-scale activity recognition evaluation. [sent-347, score-0.572]

88 We break the long video into partially overlapping short clips using sliding windows for activity recognition. [sent-350, score-0.811]

89 For the proposed SC and MSSC methods, Ryoo’s methods (nondynamic and dynamic versions), and the baseline sparse coding method, we calculate the posteriors of each activity presented in each short clip. [sent-351, score-0.721]

90 We normalize the posterior scores that an activity is present in each short clip and label the video clip with the activities that have posterior scores larger than a pre-set threshold τ. [sent-352, score-1.02]

91 For C2 and Action Bank methods, we use their default parameters to detect activities in each constructed video clip. [sent-355, score-0.338]

92 We check the overlap between the sliding window with the recognized activities and the ground-truth labeling of the activities (starting and ending frames) using the intersection/union ratio. [sent-356, score-0.352]

93 Given the identical activity label, we threshold this overlap ratio to get precision/recall values and then combine them into a F1-measure, which is shown in Table 1. [sent-357, score-0.507]

94 However, the general performance is very low and this indicates that there is still a long way to go to achieve good activity recognition in practical scenarios. [sent-359, score-0.553]

95 Conclusion In this paper, we proposed novel methods for recognizing human activities from partially observed videos. [sent-361, score-0.37]

96 We formulated the problem as a posterior-maximization problem whose likelihood is calculated on each activity temporal stage using a sparse coding (SC) technique. [sent-362, score-0.798]

97 We further include more sparse coding bases for a mixture of variedlength and/or varied-location segments (MSSC) from the training videos. [sent-363, score-0.425]

98 We evaluated the proposed SC and MSSC methods on three tasks: activity prediction, gapfilling and full-video recognition. [sent-364, score-0.782]

99 Human activity prediction: Early recognition of ongoing activities from streaming videos. [sent-505, score-0.688]

100 Action bank: A high-level representation of activity in video. [sent-529, score-0.476]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('activity', 0.476), ('gapfilling', 0.306), ('mssc', 0.306), ('darpa', 0.291), ('ti', 0.281), ('ryoo', 0.167), ('video', 0.163), ('mmed', 0.162), ('videos', 0.161), ('ap', 0.149), ('sc', 0.147), ('activities', 0.147), ('subsequence', 0.146), ('prediction', 0.117), ('coding', 0.104), ('bases', 0.101), ('segments', 0.099), ('gap', 0.095), ('segment', 0.091), ('temporal', 0.077), ('partially', 0.075), ('likelihood', 0.073), ('spatiotemporal', 0.072), ('gapped', 0.072), ('observed', 0.07), ('unobserved', 0.069), ('subsequences', 0.066), ('posterior', 0.056), ('gapfilliing', 0.054), ('human', 0.053), ('action', 0.053), ('mind', 0.051), ('duration', 0.049), ('hio', 0.048), ('lengths', 0.048), ('clip', 0.047), ('eye', 0.046), ('folds', 0.046), ('clips', 0.046), ('training', 0.046), ('sparse', 0.046), ('test', 0.044), ('recognized', 0.038), ('versions', 0.037), ('nondynamic', 0.036), ('unfinished', 0.036), ('utinteraction', 0.036), ('observation', 0.035), ('streaming', 0.033), ('recognition', 0.032), ('ratio', 0.031), ('pages', 0.031), ('event', 0.03), ('early', 0.03), ('mixture', 0.029), ('calculation', 0.028), ('constructed', 0.028), ('frames', 0.028), ('short', 0.028), ('kitani', 0.027), ('knn', 0.026), ('cross', 0.026), ('organize', 0.025), ('validations', 0.025), ('length', 0.025), ('implement', 0.025), ('fully', 0.025), ('recognizing', 0.025), ('toronto', 0.025), ('totally', 0.024), ('frame', 0.023), ('baseline', 0.023), ('long', 0.023), ('posteriors', 0.023), ('hte', 0.023), ('bank', 0.023), ('actions', 0.023), ('class', 0.023), ('recognize', 0.023), ('intentionally', 0.023), ('stage', 0.022), ('practical', 0.022), ('special', 0.022), ('varies', 0.022), ('laptev', 0.022), ('description', 0.022), ('divide', 0.021), ('datasets', 0.021), ('overly', 0.021), ('icpr', 0.021), ('dynamic', 0.021), ('ending', 0.02), ('program', 0.02), ('gaps', 0.019), ('missing', 0.019), ('overhead', 0.018), ('degenerate', 0.018), ('studied', 0.018), ('adapted', 0.018), ('surveillance', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999851 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos

Author: Yu Cao, Daniel Barrett, Andrei Barbu, Siddharth Narayanaswamy, Haonan Yu, Aaron Michaux, Yuewei Lin, Sven Dickinson, Jeffrey Mark Siskind, Song Wang

Abstract: Recognizing human activities in partially observed videos is a challengingproblem and has many practical applications. When the unobserved subsequence is at the end of the video, the problem is reduced to activity prediction from unfinished activity streaming, which has been studied by many researchers. However, in the general case, an unobserved subsequence may occur at any time by yielding a temporal gap in the video. In this paper, we propose a new method that can recognize human activities from partially observed videos in the general case. Specifically, we formulate the problem into a probabilistic framework: 1) dividing each activity into multiple ordered temporal segments, 2) using spatiotemporal features of the training video samples in each segment as bases and applying sparse coding (SC) to derive the activity likelihood of the test video sample at each segment, and 3) finally combining the likelihood at each segment to achieve a global posterior for the activities. We further extend the proposed method to include more bases that correspond to a mixture of segments with different temporal lengths (MSSC), which can better rep- resent the activities with large intra-class variations. We evaluate the proposed methods (SC and MSSC) on various real videos. We also evaluate the proposed methods on two special cases: 1) activity prediction where the unobserved subsequence is at the end of the video, and 2) human activity recognition on fully observed videos. Experimental results show that the proposed methods outperform existing state-of-the-art comparison methods.

2 0.39014953 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?

Author: Michael S. Ryoo, Larry Matthies

Abstract: This paper discusses the problem of recognizing interaction-level human activities from a first-person viewpoint. The goal is to enable an observer (e.g., a robot or a wearable camera) to understand ‘what activity others are performing to it’ from continuous video inputs. These include friendly interactions such as ‘a person hugging the observer’ as well as hostile interactions like ‘punching the observer’ or ‘throwing objects to the observer’, whose videos involve a large amount of camera ego-motion caused by physical interactions. The paper investigates multichannel kernels to integrate global and local motion information, and presents a new activity learning/recognition methodology that explicitly considers temporal structures displayed in first-person activity videos. In our experiments, we not only show classification results with segmented videos, but also confirm that our new approach is able to detect activities from continuous videos reliably.

3 0.36424497 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video

Author: Yingying Zhu, Nandita M. Nayak, Amit K. Roy-Chowdhury

Abstract: In thispaper, rather than modeling activities in videos individually, we propose a hierarchical framework that jointly models and recognizes related activities using motion and various context features. This is motivated from the observations that the activities related in space and time rarely occur independently and can serve as the context for each other. Given a video, action segments are automatically detected using motion segmentation based on a nonlinear dynamical model. We aim to merge these segments into activities of interest and generate optimum labels for the activities. Towards this goal, we utilize a structural model in a max-margin framework that jointly models the underlying activities which are related in space and time. The model explicitly learns the duration, motion and context patterns for each activity class, as well as the spatio-temporal relationships for groups of them. The learned model is then used to optimally label the activities in the testing videos using a greedy search method. We show promising results on the VIRAT Ground Dataset demonstrating the benefit of joint modeling and recognizing activities in a wide-area scene.

4 0.31923893 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities

Author: Raghuraman Gopalan

Abstract: While the notion of joint sparsity in understanding common and innovative components of a multi-receiver signal ensemble has been well studied, we investigate the utility of such joint sparse models in representing information contained in a single video signal. By decomposing the content of a video sequence into that observed by multiple spatially and/or temporally distributed receivers, we first recover a collection of common and innovative components pertaining to individual videos. We then present modeling strategies based on subspace-driven manifold metrics to characterize patterns among these components, across other videos in the system, to perform subsequent video analysis. We demonstrate the efficacy of our approach for activity classification and clustering by reporting competitive results on standard datasets such as, HMDB, UCF-50, Olympic Sports and KTH.

5 0.19868016 49 cvpr-2013-Augmenting Bag-of-Words: Data-Driven Discovery of Temporal and Structural Information for Activity Recognition

Author: Vinay Bettadapura, Grant Schindler, Thomas Ploetz, Irfan Essa

Abstract: We present data-driven techniques to augment Bag of Words (BoW) models, which allow for more robust modeling and recognition of complex long-term activities, especially when the structure and topology of the activities are not known a priori. Our approach specifically addresses the limitations of standard BoW approaches, which fail to represent the underlying temporal and causal information that is inherent in activity streams. In addition, we also propose the use ofrandomly sampled regular expressions to discover and encode patterns in activities. We demonstrate the effectiveness of our approach in experimental evaluations where we successfully recognize activities and detect anomalies in four complex datasets.

6 0.16084597 348 cvpr-2013-Recognizing Activities via Bag of Words for Attribute Dynamics

7 0.16025692 287 cvpr-2013-Modeling Actions through State Changes

8 0.15314592 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots

9 0.14654453 187 cvpr-2013-Geometric Context from Videos

10 0.13942114 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches

11 0.12575132 196 cvpr-2013-HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences

12 0.12349703 62 cvpr-2013-Bilinear Programming for Human Activity Recognition with Unknown MRF Graphs

13 0.12111337 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition

14 0.11393902 313 cvpr-2013-Online Dominant and Anomalous Behavior Detection in Videos

15 0.11296982 178 cvpr-2013-From Local Similarity to Global Coding: An Application to Image Classification

16 0.11273348 294 cvpr-2013-Multi-class Video Co-segmentation with a Generative Multi-video Model

17 0.11135039 407 cvpr-2013-Spatio-temporal Depth Cuboid Similarity Feature for Activity Recognition Using Depth Camera

18 0.1091188 151 cvpr-2013-Event Retrieval in Large Video Collections with Circulant Temporal Encoding

19 0.10906269 243 cvpr-2013-Large-Scale Video Summarization Using Web-Image Priors

20 0.10227663 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.19), (1, -0.097), (2, -0.039), (3, -0.157), (4, -0.207), (5, -0.0), (6, -0.105), (7, -0.032), (8, -0.094), (9, 0.037), (10, 0.132), (11, -0.14), (12, 0.042), (13, -0.035), (14, -0.022), (15, 0.056), (16, 0.052), (17, 0.156), (18, -0.032), (19, -0.178), (20, -0.122), (21, 0.114), (22, 0.076), (23, -0.097), (24, -0.025), (25, 0.036), (26, -0.081), (27, 0.043), (28, 0.073), (29, 0.161), (30, 0.037), (31, 0.014), (32, -0.027), (33, -0.078), (34, 0.052), (35, 0.017), (36, -0.048), (37, -0.101), (38, -0.011), (39, 0.017), (40, -0.106), (41, 0.05), (42, 0.154), (43, -0.045), (44, -0.064), (45, -0.063), (46, 0.104), (47, -0.023), (48, -0.048), (49, -0.069)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9694916 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos

Author: Yu Cao, Daniel Barrett, Andrei Barbu, Siddharth Narayanaswamy, Haonan Yu, Aaron Michaux, Yuewei Lin, Sven Dickinson, Jeffrey Mark Siskind, Song Wang

Abstract: Recognizing human activities in partially observed videos is a challengingproblem and has many practical applications. When the unobserved subsequence is at the end of the video, the problem is reduced to activity prediction from unfinished activity streaming, which has been studied by many researchers. However, in the general case, an unobserved subsequence may occur at any time by yielding a temporal gap in the video. In this paper, we propose a new method that can recognize human activities from partially observed videos in the general case. Specifically, we formulate the problem into a probabilistic framework: 1) dividing each activity into multiple ordered temporal segments, 2) using spatiotemporal features of the training video samples in each segment as bases and applying sparse coding (SC) to derive the activity likelihood of the test video sample at each segment, and 3) finally combining the likelihood at each segment to achieve a global posterior for the activities. We further extend the proposed method to include more bases that correspond to a mixture of segments with different temporal lengths (MSSC), which can better rep- resent the activities with large intra-class variations. We evaluate the proposed methods (SC and MSSC) on various real videos. We also evaluate the proposed methods on two special cases: 1) activity prediction where the unobserved subsequence is at the end of the video, and 2) human activity recognition on fully observed videos. Experimental results show that the proposed methods outperform existing state-of-the-art comparison methods.

2 0.90294904 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video

Author: Yingying Zhu, Nandita M. Nayak, Amit K. Roy-Chowdhury

Abstract: In thispaper, rather than modeling activities in videos individually, we propose a hierarchical framework that jointly models and recognizes related activities using motion and various context features. This is motivated from the observations that the activities related in space and time rarely occur independently and can serve as the context for each other. Given a video, action segments are automatically detected using motion segmentation based on a nonlinear dynamical model. We aim to merge these segments into activities of interest and generate optimum labels for the activities. Towards this goal, we utilize a structural model in a max-margin framework that jointly models the underlying activities which are related in space and time. The model explicitly learns the duration, motion and context patterns for each activity class, as well as the spatio-temporal relationships for groups of them. The learned model is then used to optimally label the activities in the testing videos using a greedy search method. We show promising results on the VIRAT Ground Dataset demonstrating the benefit of joint modeling and recognizing activities in a wide-area scene.

3 0.87143332 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?

Author: Michael S. Ryoo, Larry Matthies

Abstract: This paper discusses the problem of recognizing interaction-level human activities from a first-person viewpoint. The goal is to enable an observer (e.g., a robot or a wearable camera) to understand ‘what activity others are performing to it’ from continuous video inputs. These include friendly interactions such as ‘a person hugging the observer’ as well as hostile interactions like ‘punching the observer’ or ‘throwing objects to the observer’, whose videos involve a large amount of camera ego-motion caused by physical interactions. The paper investigates multichannel kernels to integrate global and local motion information, and presents a new activity learning/recognition methodology that explicitly considers temporal structures displayed in first-person activity videos. In our experiments, we not only show classification results with segmented videos, but also confirm that our new approach is able to detect activities from continuous videos reliably.

4 0.84053123 49 cvpr-2013-Augmenting Bag-of-Words: Data-Driven Discovery of Temporal and Structural Information for Activity Recognition

Author: Vinay Bettadapura, Grant Schindler, Thomas Ploetz, Irfan Essa

Abstract: We present data-driven techniques to augment Bag of Words (BoW) models, which allow for more robust modeling and recognition of complex long-term activities, especially when the structure and topology of the activities are not known a priori. Our approach specifically addresses the limitations of standard BoW approaches, which fail to represent the underlying temporal and causal information that is inherent in activity streams. In addition, we also propose the use ofrandomly sampled regular expressions to discover and encode patterns in activities. We demonstrate the effectiveness of our approach in experimental evaluations where we successfully recognize activities and detect anomalies in four complex datasets.

5 0.80418062 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities

Author: Raghuraman Gopalan

Abstract: While the notion of joint sparsity in understanding common and innovative components of a multi-receiver signal ensemble has been well studied, we investigate the utility of such joint sparse models in representing information contained in a single video signal. By decomposing the content of a video sequence into that observed by multiple spatially and/or temporally distributed receivers, we first recover a collection of common and innovative components pertaining to individual videos. We then present modeling strategies based on subspace-driven manifold metrics to characterize patterns among these components, across other videos in the system, to perform subsequent video analysis. We demonstrate the efficacy of our approach for activity classification and clustering by reporting competitive results on standard datasets such as, HMDB, UCF-50, Olympic Sports and KTH.

6 0.62823302 62 cvpr-2013-Bilinear Programming for Human Activity Recognition with Unknown MRF Graphs

7 0.60737514 348 cvpr-2013-Recognizing Activities via Bag of Words for Attribute Dynamics

8 0.59778816 313 cvpr-2013-Online Dominant and Anomalous Behavior Detection in Videos

9 0.53319925 103 cvpr-2013-Decoding Children's Social Behavior

10 0.52171737 294 cvpr-2013-Multi-class Video Co-segmentation with a Generative Multi-video Model

11 0.51755792 133 cvpr-2013-Discriminative Segment Annotation in Weakly Labeled Video

12 0.51469082 413 cvpr-2013-Story-Driven Summarization for Egocentric Video

13 0.50867605 187 cvpr-2013-Geometric Context from Videos

14 0.50460017 243 cvpr-2013-Large-Scale Video Summarization Using Web-Image Priors

15 0.45687726 151 cvpr-2013-Event Retrieval in Large Video Collections with Circulant Temporal Encoding

16 0.44502997 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition

17 0.43561259 291 cvpr-2013-Motionlets: Mid-level 3D Parts for Human Motion Recognition

18 0.42946339 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots

19 0.4245947 32 cvpr-2013-Action Recognition by Hierarchical Sequence Summarization

20 0.42051038 353 cvpr-2013-Relative Hidden Markov Models for Evaluating Motion Skill


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.115), (13, 0.238), (16, 0.018), (26, 0.049), (28, 0.013), (33, 0.241), (39, 0.01), (67, 0.089), (69, 0.048), (87, 0.085)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.85465896 42 cvpr-2013-Analytic Bilinear Appearance Subspace Construction for Modeling Image Irradiance under Natural Illumination and Non-Lambertian Reflectance

Author: Shireen Y. Elhabian, Aly A. Farag

Abstract: Conventional subspace construction approaches suffer from the need of “large-enough ” image ensemble rendering numerical methods intractable. In this paper, we propose an analytic formulation for low-dimensional subspace construction in which shading cues lie while preserving the natural structure of an image sample. Using the frequencyspace representation of the image irradiance equation, the process of finding such subspace is cast as establishing a relation between its principal components and that of a deterministic set of basis functions, termed as irradiance harmonics. Representing images as matrices further lessen the number of parameters to be estimated to define a bilinear projection which maps the image sample to a lowerdimensional bilinear subspace. Results show significant impact on dimensionality reduction with minimal loss of information as well as robustness against noise.

2 0.81822324 292 cvpr-2013-Multi-agent Event Detection: Localization and Role Assignment

Author: Suha Kwak, Bohyung Han, Joon Hee Han

Abstract: We present a joint estimation technique of event localization and role assignment when the target video event is described by a scenario. Specifically, to detect multi-agent events from video, our algorithm identifies agents involved in an event and assigns roles to the participating agents. Instead of iterating through all possible agent-role combinations, we formulate the joint optimization problem as two efficient subproblems—quadratic programming for role assignment followed by linear programming for event localization. Additionally, we reduce the computational complexity significantly by applying role-specific event detectors to each agent independently. We test the performance of our algorithm in natural videos, which contain multiple target events and nonparticipating agents.

same-paper 3 0.81615305 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos

Author: Yu Cao, Daniel Barrett, Andrei Barbu, Siddharth Narayanaswamy, Haonan Yu, Aaron Michaux, Yuewei Lin, Sven Dickinson, Jeffrey Mark Siskind, Song Wang

Abstract: Recognizing human activities in partially observed videos is a challengingproblem and has many practical applications. When the unobserved subsequence is at the end of the video, the problem is reduced to activity prediction from unfinished activity streaming, which has been studied by many researchers. However, in the general case, an unobserved subsequence may occur at any time by yielding a temporal gap in the video. In this paper, we propose a new method that can recognize human activities from partially observed videos in the general case. Specifically, we formulate the problem into a probabilistic framework: 1) dividing each activity into multiple ordered temporal segments, 2) using spatiotemporal features of the training video samples in each segment as bases and applying sparse coding (SC) to derive the activity likelihood of the test video sample at each segment, and 3) finally combining the likelihood at each segment to achieve a global posterior for the activities. We further extend the proposed method to include more bases that correspond to a mixture of segments with different temporal lengths (MSSC), which can better rep- resent the activities with large intra-class variations. We evaluate the proposed methods (SC and MSSC) on various real videos. We also evaluate the proposed methods on two special cases: 1) activity prediction where the unobserved subsequence is at the end of the video, and 2) human activity recognition on fully observed videos. Experimental results show that the proposed methods outperform existing state-of-the-art comparison methods.

4 0.79545861 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?

Author: Michael S. Ryoo, Larry Matthies

Abstract: This paper discusses the problem of recognizing interaction-level human activities from a first-person viewpoint. The goal is to enable an observer (e.g., a robot or a wearable camera) to understand ‘what activity others are performing to it’ from continuous video inputs. These include friendly interactions such as ‘a person hugging the observer’ as well as hostile interactions like ‘punching the observer’ or ‘throwing objects to the observer’, whose videos involve a large amount of camera ego-motion caused by physical interactions. The paper investigates multichannel kernels to integrate global and local motion information, and presents a new activity learning/recognition methodology that explicitly considers temporal structures displayed in first-person activity videos. In our experiments, we not only show classification results with segmented videos, but also confirm that our new approach is able to detect activities from continuous videos reliably.

5 0.7771697 248 cvpr-2013-Learning Collections of Part Models for Object Recognition

Author: Ian Endres, Kevin J. Shih, Johnston Jiaa, Derek Hoiem

Abstract: We propose a method to learn a diverse collection of discriminative parts from object bounding box annotations. Part detectors can be trained and applied individually, which simplifies learning and extension to new features or categories. We apply the parts to object category detection, pooling part detections within bottom-up proposed regions and using a boosted classifier with proposed sigmoid weak learners for scoring. On PASCAL VOC 2010, we evaluate the part detectors ’ ability to discriminate and localize annotated keypoints. Our detection system is competitive with the best-existing systems, outperforming other HOG-based detectors on the more deformable categories.

6 0.76908523 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation

7 0.76870012 414 cvpr-2013-Structure Preserving Object Tracking

8 0.76853442 365 cvpr-2013-Robust Real-Time Tracking of Multiple Objects by Volumetric Mass Densities

9 0.76732916 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval

10 0.7669338 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection

11 0.76664793 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases

12 0.76630872 325 cvpr-2013-Part Discovery from Partial Correspondence

13 0.76574624 339 cvpr-2013-Probabilistic Graphlet Cut: Exploiting Spatial Structure Cue for Weakly Supervised Image Segmentation

14 0.76546746 30 cvpr-2013-Accurate Localization of 3D Objects from RGB-D Data Using Segmentation Hypotheses

15 0.7648477 277 cvpr-2013-MODEC: Multimodal Decomposable Models for Human Pose Estimation

16 0.76476562 254 cvpr-2013-Learning SURF Cascade for Fast and Accurate Object Detection

17 0.76469111 14 cvpr-2013-A Joint Model for 2D and 3D Pose Estimation from a Single Image

18 0.76397127 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection

19 0.76369405 242 cvpr-2013-Label Propagation from ImageNet to 3D Point Clouds

20 0.76317155 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path