cvpr cvpr2013 cvpr2013-175 knowledge-graph by maker-knowledge-mining

175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?


Source: pdf

Author: Michael S. Ryoo, Larry Matthies

Abstract: This paper discusses the problem of recognizing interaction-level human activities from a first-person viewpoint. The goal is to enable an observer (e.g., a robot or a wearable camera) to understand ‘what activity others are performing to it’ from continuous video inputs. These include friendly interactions such as ‘a person hugging the observer’ as well as hostile interactions like ‘punching the observer’ or ‘throwing objects to the observer’, whose videos involve a large amount of camera ego-motion caused by physical interactions. The paper investigates multichannel kernels to integrate global and local motion information, and presents a new activity learning/recognition methodology that explicitly considers temporal structures displayed in first-person activity videos. In our experiments, we not only show classification results with segmented videos, but also confirm that our new approach is able to detect activities from continuous videos reliably.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 gov Abstract This paper discusses the problem of recognizing interaction-level human activities from a first-person viewpoint. [sent-6, score-0.37]

2 , a robot or a wearable camera) to understand ‘what activity others are performing to it’ from continuous video inputs. [sent-9, score-0.857]

3 These include friendly interactions such as ‘a person hugging the observer’ as well as hostile interactions like ‘punching the observer’ or ‘throwing objects to the observer’, whose videos involve a large amount of camera ego-motion caused by physical interactions. [sent-10, score-0.829]

4 The paper investigates multichannel kernels to integrate global and local motion information, and presents a new activity learning/recognition methodology that explicitly considers temporal structures displayed in first-person activity videos. [sent-11, score-1.534]

5 In our experiments, we not only show classification results with segmented videos, but also confirm that our new approach is able to detect activities from continuous videos reliably. [sent-12, score-0.599]

6 Introduction In the past decade, there has been a large amount of progress in human activity recognition research. [sent-14, score-0.635]

7 Researchers not only focused on developing reliable video features robust to noise and illumination changes [14, 3, 7], but also proposed various types of hierarchical approaches to recognize high-level activities with multiple actors [12, 9, 17] and even group activities [13]. [sent-15, score-0.808]

8 However, most of these previous works focused on activity recognition from a 3rd-person perspective (i. [sent-17, score-0.566]

9 This 3rd-person activity recognition paradigm is insufficient when the observer itself is involved in interactions, such as ‘a person attacking the camera’ . [sent-23, score-0.93]

10 What we require is the ability to recognize physical and social human activities targeted to the observer (e. [sent-25, score-0.759]

11 , a wearable camera or a robot) from its viewpoint: first-person human activity recognition. [sent-27, score-0.692]

12 Even though there has been previous attempts to recognize activities from first-person videos [6, 4, 10], they focused on recognition of ego-actions of the person wearing the camera (e. [sent-29, score-0.661]

13 There also are works on recognition of gesture-level motion to the sensor [16] and analysis of face/eye directions [5], but recognition of highlevel activities involving physical interactions (e. [sent-32, score-0.659]

14 Recognition of ‘what others are doing to the observer’ from its own perspective is not only crucial for any surveillance or military systems to protect themselves from harmful activities by hostile humans, but also is very important for friendly human-robot interaction scenarios (e. [sent-35, score-0.471]

15 In this paper, we introduce our new dataset composed of first-person videos collected during humans’ interaction with the observer, and investigate features and approaches necessary for the system to understand activities from such videos. [sent-38, score-0.501]

16 We particularly focus on two aspects of first-person activity recognition, aiming to provide answers to the following two questions: (1) What features (and their combination) do we need to recognize interaction-level activities from first-person videos? [sent-39, score-0.899]

17 (2) How important is it to consider temporal structure of the activities in first-person recognition? [sent-40, score-0.402]

18 Next, we present a new kernel-based activity recognition approach that explicitly 222777223088 Figure 1. [sent-42, score-0.566]

19 Our approach learns sub-events composing an activity and how they are temporally organized, obtaining superior performance in first-person activity recognition. [sent-46, score-1.113]

20 Related works Computer vision researchers explored various human activity recognition approaches since early 1990’s [1]. [sent-49, score-0.635]

21 However, these previous human activity recognition works detected human behaviors from videos with third-person viewpoints (e. [sent-52, score-0.935]

22 Even though there are recent works on first-person action recognition from wearable cameras [6, 4, 10, 5], research on recognition of physical human interactions targeted to the camera and their influences on the camera movements has been very limited. [sent-55, score-0.495]

23 First-person video dataset We constructed a new first-person video dataset containing interactions between humans and the observer. [sent-58, score-0.423]

24 We attached a GoPro camera to the head of a humanoid model (Figure 1), and asked human participants to interact with the humanoid by performing activities. [sent-59, score-0.371]

25 The neutral interaction is the situation where two persons have a conversation about the observer while occasionally pointing it. [sent-69, score-0.419]

26 Videos were recorded continuously during human activities where each video sequence contains 0 to 3 activities. [sent-71, score-0.467]

27 Notice that the robot (and its camera) is not stationary and it displays a large amount of ego-motion in its videos particularly during the human ac- tivity. [sent-74, score-0.418]

28 In addition, in order to support the training of the robot, we also prepared the segmented version of the dataset: videos in each dataset are segmented so that each video segment contains one activity execution, providing us at least 7 video segments per set. [sent-83, score-1.032]

29 We emphasize that our first-person videos are different from public activity recognition datasets (e. [sent-84, score-0.763]

30 Features for first-person videos In this section, we discuss motion features for firstperson videos. [sent-96, score-0.468]

31 We construct and evaluate two categories of video features, global motion descriptors and local motion descriptors, and confirm that each of them contributes to the recognition of different activities from first-person videos. [sent-97, score-0.947]

32 In addition, we present kernel functions to combine global features and local features for the activity recognition. [sent-98, score-0.824]

33 Our kernels reliably integrates both global and local motion information, and we illustrate that these multi-channel kernels benefit first-person activity recognition. [sent-99, score-0.995]

34 Global motion descriptors For describing global motion in first-person videos, we take advantage of dense optical flows. [sent-111, score-0.499]

35 We designed our global motion descriptor as a histogram of extracted optical flows: We categorize observed optical flows into multiple types based on their locations and directions, and count the number of optical flows belonging to each category. [sent-114, score-0.663]

36 Local motion descriptors We use sparse 3-D XYT space-time features as our local motion descriptors. [sent-123, score-0.439]

37 The intention is to abstract local motion information inside each of the detected video patches, and use it as a descriptor. [sent-128, score-0.373]

38 These three descriptors (obtained during different types of ego-motion of the camera) are distinct, suggesting that our descriptors correctly captures observer ego-motion. [sent-131, score-0.496]

39 Visual words We take advantage of the concept of visual words, in order to represent motion information in videos more efficiently. [sent-136, score-0.383]

40 (1) Our clustering and histogram construction processes are applied for the global motion descriptors and local motion descriptors separately. [sent-150, score-0.594]

41 The feature histogram Hi for video vi directly serves as our feature vector representing the video: xi = [H1i ; Hi2], where Hi1 is the histogram of global descriptors and Hi2 is the histogram of local descriptors. [sent-152, score-0.49]

42 Multi-channel kernels We present multi-channel kernels that consider both global features and local features for computing video similarities. [sent-155, score-0.502]

43 In order to integrate both global and local motion cues for reliable recognition from first-person videos, we defined multi-channel kernels that lead to the computation of a non-linear decision boundary. [sent-162, score-0.398]

44 These multi-channel kernels robustly combines information from channels (global motion and local motion in our case). [sent-164, score-0.453]

45 ke e t ch w (a) Global descriptors ke e t ch ke e t ch w (b) Local descriptors w ke e t ch w (c) Histogram intersection (d) χ2 kernel Figure 5. [sent-222, score-0.577]

46 global motion: First, we evaluate the activity classification ability of our approach while forcing the sys- tem to only use one of the two motion features (global vs. [sent-227, score-0.789]

47 The objective is to identify which motion representation contributes to recognition of which activity, and confirm that using two types of motion features jointly (using our multi-channel kernel) will benefit the overall recognition. [sent-229, score-0.452]

48 We implemented two baseline activity classifiers: Both these baseline classifiers are support vector machine (SVM) classifiers, which use a standard Gaussian kernel relying on only one feature channel (either global or local) for the classification. [sent-230, score-0.848]

49 This confirms that utilizing both global and local motion benefits overall recognition of human activities from firstperson videos, and that our kernel functions are able to combine such information reliably. [sent-246, score-0.874]

50 Recognition with activity structure In the case of high-level activities, considering activities’ structures is crucial for their reliable recognition. [sent-248, score-0.648]

51 , sub-events) the activity should be divided into and how they must be organized temporally. [sent-251, score-0.522]

52 This is particularly important for interaction-level activities where causeand-effect relations are explicitly displayed, such as the observer ‘collapsing’ as a result of a person ‘hitting’ it in the punching interaction. [sent-252, score-0.885]

53 The system must learn the structure representation of each activity and take advantage of it for more reliable recognition. [sent-253, score-0.614]

54 In this section, we present a new recognition methodology that explicitly considers the activity structure, and investigate how important it is to learn/use structures for first-person activity videos. [sent-254, score-1.157]

55 We first describe our structure representation, and define a new kernel function computing video distances given a particular structure. [sent-255, score-0.376]

56 Next, we present an algorithm to search for the best activity structure given training videos. [sent-256, score-0.642]

57 Hierarchical structure match kernel We represent an activity as a continuous concatenation of its sub-events. [sent-260, score-0.885]

58 That is, we define the structure of an activity as a particular division that temporally splits an entire video containing the activity into multiple video segments. [sent-261, score-1.409]

59 Any activity structure constructed by applying a number of production rules starting from S[0, 1] (until they reach terminals) is considered as a valid structure (e. [sent-263, score-0.803]

60 a75m)plematchingbetw ntwohugin vdeos,xi and xj , using the kernel KS constructed from the hierarchical structure S = (((0, 0. [sent-285, score-0.397]

61 That is, if two videos contains an identical activity and if they are divided into video segments based on the correct activity structure, the similarity between each pair of video segments must be high. [sent-293, score-1.565]

62 Given a particular activity structure S, we define the kernel function kS (xi, xj) measuring the distance between two feature vectors xi and xj with the following two equations: = kS[t1 (xi, xj) + kS[t3,t2] (xi, xj), k(t1,t2)(xi,xj) =n? [sent-294, score-0.898]

63 Notice that this structure kernel is constructed for each channel c, resulting a multi-channel kernel integrating (i. [sent-296, score-0.441]

64 Our structure match kernel can be efficiently implemented with temporal integral histograms [11], which allows us to obtain a feature histogram of any particular time interval in O(w). [sent-305, score-0.428]

65 Structure learning In this subsection, we present our approach to learn the activity structure and its kernel that best matches training 222777333422 videos. [sent-311, score-0.803]

66 We first introduce kernel target alignment [15] that measures the angle between two Gram matrices, and present that it can be used to evaluate structure kernels for our activity recognition. [sent-312, score-0.979]

67 The idea is to represent the ‘optimal kernel function’ and candidate structure kernels in terms of Gram matrices and measure their similarities. [sent-313, score-0.372]

68 We take advantage of the kernel target alignment for evaluating candidate activity structures. [sent-341, score-0.768]

69 10 oytih=erw yjise, (9) where yi is the activity class label corresponding to the training sample xi. [sent-343, score-0.55]

70 The matrix L essentially indicates that the distance between any two training samples must be 0 if they have an identical activity class, and 1otherwise. [sent-344, score-0.55]

71 This provides the system an ability to score possible activity structure candidates so that it can search for the best structure S∗ . [sent-347, score-0.706]

72 With the above formulation, the structure learning is interpreted as the searching of S[0, 1] ∗, the best structure dividing the entire activity duration [0, 1] , among an exponential number of possible structures. [sent-359, score-0.706]

73 This structure can either be learned per activity, or the system may learn the common structure suitable for all activity classes. [sent-372, score-0.706]

74 One common structure that best distinguishes videos with different activities was obtained, and our kernel function cor- responding to the learned structure was constructed. [sent-381, score-0.817]

75 In addition, in order to illustrate the advantage of our structure learning and recognition for first-person videos, we tested two state-of-the-art activity recognition approaches: spatio-temporal pyramid matching [2], and dynamic bag-of-words (BoW) [11]. [sent-383, score-0.733]

76 Spatio-temporal pyramid match kernel is a spatio-temporal version of a spatial pyramid match kernel [8]. [sent-384, score-0.5]

77 However, in dynamic BoW, an activity model was learned only using videos belonging to that class without considering other activity videos, which results inferior performance. [sent-390, score-1.241]

78 This confirms that learning the optimal structure suitable for activity videos benefits their recognition particularly in the firstperson activity recognition setting. [sent-401, score-1.548]

79 Evaluation - detection In this subsection, we evaluate the activity detection ability of our approach using the first-person dataset. [sent-404, score-0.522]

80 Activity detection is the process of finding correct starting time and ending time of the activity from continuous videos. [sent-405, score-0.574]

81 , continuous observations from a camera), for each activity, the system must decide whether the activity is contained in the video and when it is occurring. [sent-408, score-0.697]

82 Multiple activity durations learned from positive training examples of each activity were considered, and we trained the classifier by sampling video segments (with the same length) from continuous training videos. [sent-411, score-1.314]

83 In addition to the recognition approach with our structure matching kernel, we implemented three baseline approaches for comparison: SVM classifiers only using local features, those only using global features, and the method with our multi-channel kernel discussed in Section 3. [sent-417, score-0.466]

84 Figure 7 shows average PR-curves combining results for all seven activity classes. [sent-432, score-0.551]

85 We are able to confirm that our method using structure match kernel performs superior to the conventional SVMs with the bag-of-words paradigm. [sent-433, score-0.395]

86 We also present PR curves for each activity category in Figure 9. [sent-440, score-0.522]

87 Our structure match obtained the highest mean APs in all activity categories, and particularly performed superior to baseline approaches for ‘punching’, ‘pointconverse’, and ‘petting’ . [sent-441, score-0.777]

88 The structure match kernel not only considers both global motion and local motion of firstperson videos (with a optimum weighting computed using kernel target alignment), but also reflects sequential structure of the activity, thereby correctly distinguishing interactions from false positives. [sent-442, score-1.418]

89 The result suggests that fusing global/local motion information and considering their temporal structure are particularly necessary for detecting high-level human interactions with complex motion. [sent-443, score-0.498]

90 Throwing (orange box) and punching (red box) are detected in the upper video, and pointing (yellow box), hand shaking (cyan box), and waving (magenta box) are detected in the lower video. [sent-446, score-0.427]

91 Average precision-recall (c) Pet curves (d) Wave for each activity performed better than the baselines using space-times category are (e) Point-Converse (f) Punch presented. [sent-519, score-0.522]

92 (g) Throw curve and red curve) Particularly, activity detection using our structure match kernel showed superior performance compared to all the others. [sent-521, score-0.875]

93 Conclusion In this paper, we introduced the problem of recognizing interaction-level activities from videos in first-person perspective. [sent-523, score-0.498]

94 Furthermore, we developed a new kernel-based activity learning/recognition methodology to consider the activities’ hierarchical structures, and verified that learning activity structures from training videos benefits recognition of human interactions targeted to the observer. [sent-525, score-1.603]

95 As a result, friendly human activities such as ‘shaking hands with the observer’ as well as hostile interactions like ‘throwing objects to the observer’ were correctly detected from continuous video streams. [sent-526, score-0.91]

96 One future work is to extend our approach for early recognition of humans’ intention based on activity detection results and other subtle information from human body movements. [sent-528, score-0.669]

97 Our paper presented the idea that first-person recognition of physical/social interactions becomes possible by analyzing video motion patterns observed during the activities. [sent-529, score-0.432]

98 Modeling temporal structure of decomposable motion segments for activity classification. [sent-590, score-0.84]

99 Detecting activities of daily living in first- [11] [12] [13] [14] [15] [16] [17] [18] [19] person camera views. [sent-595, score-0.384]

100 Human activity prediction: Early recognition of ongoing activities from streaming videos. [sent-600, score-0.841]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('activity', 0.522), ('observer', 0.311), ('activities', 0.275), ('punching', 0.209), ('videos', 0.197), ('kernel', 0.161), ('motion', 0.152), ('throwing', 0.133), ('ks', 0.127), ('humanoid', 0.123), ('video', 0.123), ('kernels', 0.119), ('robot', 0.115), ('interactions', 0.113), ('gram', 0.108), ('friendly', 0.103), ('subsection', 0.096), ('hostile', 0.093), ('structure', 0.092), ('firstperson', 0.09), ('flows', 0.086), ('shaking', 0.086), ('xj', 0.08), ('descriptors', 0.076), ('xyt', 0.076), ('axa', 0.07), ('hro', 0.07), ('hugging', 0.07), ('human', 0.069), ('optical', 0.066), ('classifiers', 0.06), ('match', 0.058), ('camera', 0.056), ('hi', 0.056), ('histogram', 0.055), ('ryoo', 0.054), ('global', 0.053), ('person', 0.053), ('alignment', 0.053), ('continuous', 0.052), ('av', 0.05), ('hands', 0.048), ('argtm', 0.046), ('conversation', 0.046), ('petting', 0.046), ('propulsion', 0.046), ('wearable', 0.045), ('recognition', 0.044), ('xi', 0.043), ('confirm', 0.042), ('superior', 0.042), ('production', 0.04), ('segments', 0.039), ('hj', 0.039), ('jet', 0.038), ('ke', 0.038), ('hierarchical', 0.037), ('particularly', 0.037), ('targeted', 0.037), ('humans', 0.037), ('pr', 0.036), ('recognize', 0.036), ('dch', 0.036), ('emulate', 0.036), ('hitting', 0.036), ('temporal', 0.035), ('considers', 0.035), ('ug', 0.034), ('hug', 0.034), ('intention', 0.034), ('detected', 0.034), ('pointing', 0.034), ('structures', 0.034), ('words', 0.034), ('types', 0.033), ('classification', 0.033), ('target', 0.032), ('pet', 0.032), ('multichannel', 0.032), ('physical', 0.031), ('pyramid', 0.031), ('local', 0.03), ('rules', 0.03), ('bow', 0.03), ('fathi', 0.03), ('hierarchically', 0.03), ('waving', 0.03), ('features', 0.029), ('seven', 0.029), ('training', 0.028), ('ch', 0.028), ('neutral', 0.028), ('constructed', 0.027), ('temporally', 0.027), ('interval', 0.027), ('divides', 0.027), ('confusion', 0.026), ('recognizing', 0.026), ('baseline', 0.026), ('nth', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?

Author: Michael S. Ryoo, Larry Matthies

Abstract: This paper discusses the problem of recognizing interaction-level human activities from a first-person viewpoint. The goal is to enable an observer (e.g., a robot or a wearable camera) to understand ‘what activity others are performing to it’ from continuous video inputs. These include friendly interactions such as ‘a person hugging the observer’ as well as hostile interactions like ‘punching the observer’ or ‘throwing objects to the observer’, whose videos involve a large amount of camera ego-motion caused by physical interactions. The paper investigates multichannel kernels to integrate global and local motion information, and presents a new activity learning/recognition methodology that explicitly considers temporal structures displayed in first-person activity videos. In our experiments, we not only show classification results with segmented videos, but also confirm that our new approach is able to detect activities from continuous videos reliably.

2 0.45586881 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video

Author: Yingying Zhu, Nandita M. Nayak, Amit K. Roy-Chowdhury

Abstract: In thispaper, rather than modeling activities in videos individually, we propose a hierarchical framework that jointly models and recognizes related activities using motion and various context features. This is motivated from the observations that the activities related in space and time rarely occur independently and can serve as the context for each other. Given a video, action segments are automatically detected using motion segmentation based on a nonlinear dynamical model. We aim to merge these segments into activities of interest and generate optimum labels for the activities. Towards this goal, we utilize a structural model in a max-margin framework that jointly models the underlying activities which are related in space and time. The model explicitly learns the duration, motion and context patterns for each activity class, as well as the spatio-temporal relationships for groups of them. The learned model is then used to optimally label the activities in the testing videos using a greedy search method. We show promising results on the VIRAT Ground Dataset demonstrating the benefit of joint modeling and recognizing activities in a wide-area scene.

3 0.39014953 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos

Author: Yu Cao, Daniel Barrett, Andrei Barbu, Siddharth Narayanaswamy, Haonan Yu, Aaron Michaux, Yuewei Lin, Sven Dickinson, Jeffrey Mark Siskind, Song Wang

Abstract: Recognizing human activities in partially observed videos is a challengingproblem and has many practical applications. When the unobserved subsequence is at the end of the video, the problem is reduced to activity prediction from unfinished activity streaming, which has been studied by many researchers. However, in the general case, an unobserved subsequence may occur at any time by yielding a temporal gap in the video. In this paper, we propose a new method that can recognize human activities from partially observed videos in the general case. Specifically, we formulate the problem into a probabilistic framework: 1) dividing each activity into multiple ordered temporal segments, 2) using spatiotemporal features of the training video samples in each segment as bases and applying sparse coding (SC) to derive the activity likelihood of the test video sample at each segment, and 3) finally combining the likelihood at each segment to achieve a global posterior for the activities. We further extend the proposed method to include more bases that correspond to a mixture of segments with different temporal lengths (MSSC), which can better rep- resent the activities with large intra-class variations. We evaluate the proposed methods (SC and MSSC) on various real videos. We also evaluate the proposed methods on two special cases: 1) activity prediction where the unobserved subsequence is at the end of the video, and 2) human activity recognition on fully observed videos. Experimental results show that the proposed methods outperform existing state-of-the-art comparison methods.

4 0.35425338 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities

Author: Raghuraman Gopalan

Abstract: While the notion of joint sparsity in understanding common and innovative components of a multi-receiver signal ensemble has been well studied, we investigate the utility of such joint sparse models in representing information contained in a single video signal. By decomposing the content of a video sequence into that observed by multiple spatially and/or temporally distributed receivers, we first recover a collection of common and innovative components pertaining to individual videos. We then present modeling strategies based on subspace-driven manifold metrics to characterize patterns among these components, across other videos in the system, to perform subsequent video analysis. We demonstrate the efficacy of our approach for activity classification and clustering by reporting competitive results on standard datasets such as, HMDB, UCF-50, Olympic Sports and KTH.

5 0.2343953 49 cvpr-2013-Augmenting Bag-of-Words: Data-Driven Discovery of Temporal and Structural Information for Activity Recognition

Author: Vinay Bettadapura, Grant Schindler, Thomas Ploetz, Irfan Essa

Abstract: We present data-driven techniques to augment Bag of Words (BoW) models, which allow for more robust modeling and recognition of complex long-term activities, especially when the structure and topology of the activities are not known a priori. Our approach specifically addresses the limitations of standard BoW approaches, which fail to represent the underlying temporal and causal information that is inherent in activity streams. In addition, we also propose the use ofrandomly sampled regular expressions to discover and encode patterns in activities. We demonstrate the effectiveness of our approach in experimental evaluations where we successfully recognize activities and detect anomalies in four complex datasets.

6 0.16967602 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots

7 0.16878641 348 cvpr-2013-Recognizing Activities via Bag of Words for Attribute Dynamics

8 0.16284248 196 cvpr-2013-HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences

9 0.15874344 287 cvpr-2013-Modeling Actions through State Changes

10 0.15646255 62 cvpr-2013-Bilinear Programming for Human Activity Recognition with Unknown MRF Graphs

11 0.15467992 332 cvpr-2013-Pixel-Level Hand Detection in Ego-centric Videos

12 0.14516138 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches

13 0.14370702 237 cvpr-2013-Kernel Learning for Extrinsic Classification of Manifold Features

14 0.14103009 313 cvpr-2013-Online Dominant and Anomalous Behavior Detection in Videos

15 0.13503623 407 cvpr-2013-Spatio-temporal Depth Cuboid Similarity Feature for Activity Recognition Using Depth Camera

16 0.13460454 172 cvpr-2013-Finding Group Interactions in Social Clutter

17 0.13002791 187 cvpr-2013-Geometric Context from Videos

18 0.12869218 158 cvpr-2013-Exploring Weak Stabilization for Motion Feature Extraction

19 0.12854658 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition

20 0.12484239 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.251), (1, -0.055), (2, -0.036), (3, -0.165), (4, -0.235), (5, 0.054), (6, -0.111), (7, -0.076), (8, -0.084), (9, 0.054), (10, 0.144), (11, -0.098), (12, 0.07), (13, -0.081), (14, 0.016), (15, 0.047), (16, 0.046), (17, 0.167), (18, -0.073), (19, -0.219), (20, -0.114), (21, 0.107), (22, 0.086), (23, -0.091), (24, -0.05), (25, 0.047), (26, -0.079), (27, 0.1), (28, 0.059), (29, 0.185), (30, 0.002), (31, 0.002), (32, 0.044), (33, -0.045), (34, 0.062), (35, -0.041), (36, 0.015), (37, -0.055), (38, -0.029), (39, 0.03), (40, -0.08), (41, 0.049), (42, 0.182), (43, -0.0), (44, 0.013), (45, -0.052), (46, 0.164), (47, -0.037), (48, -0.044), (49, -0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9576205 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?

Author: Michael S. Ryoo, Larry Matthies

Abstract: This paper discusses the problem of recognizing interaction-level human activities from a first-person viewpoint. The goal is to enable an observer (e.g., a robot or a wearable camera) to understand ‘what activity others are performing to it’ from continuous video inputs. These include friendly interactions such as ‘a person hugging the observer’ as well as hostile interactions like ‘punching the observer’ or ‘throwing objects to the observer’, whose videos involve a large amount of camera ego-motion caused by physical interactions. The paper investigates multichannel kernels to integrate global and local motion information, and presents a new activity learning/recognition methodology that explicitly considers temporal structures displayed in first-person activity videos. In our experiments, we not only show classification results with segmented videos, but also confirm that our new approach is able to detect activities from continuous videos reliably.

2 0.93388754 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video

Author: Yingying Zhu, Nandita M. Nayak, Amit K. Roy-Chowdhury

Abstract: In thispaper, rather than modeling activities in videos individually, we propose a hierarchical framework that jointly models and recognizes related activities using motion and various context features. This is motivated from the observations that the activities related in space and time rarely occur independently and can serve as the context for each other. Given a video, action segments are automatically detected using motion segmentation based on a nonlinear dynamical model. We aim to merge these segments into activities of interest and generate optimum labels for the activities. Towards this goal, we utilize a structural model in a max-margin framework that jointly models the underlying activities which are related in space and time. The model explicitly learns the duration, motion and context patterns for each activity class, as well as the spatio-temporal relationships for groups of them. The learned model is then used to optimally label the activities in the testing videos using a greedy search method. We show promising results on the VIRAT Ground Dataset demonstrating the benefit of joint modeling and recognizing activities in a wide-area scene.

3 0.91299874 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos

Author: Yu Cao, Daniel Barrett, Andrei Barbu, Siddharth Narayanaswamy, Haonan Yu, Aaron Michaux, Yuewei Lin, Sven Dickinson, Jeffrey Mark Siskind, Song Wang

Abstract: Recognizing human activities in partially observed videos is a challengingproblem and has many practical applications. When the unobserved subsequence is at the end of the video, the problem is reduced to activity prediction from unfinished activity streaming, which has been studied by many researchers. However, in the general case, an unobserved subsequence may occur at any time by yielding a temporal gap in the video. In this paper, we propose a new method that can recognize human activities from partially observed videos in the general case. Specifically, we formulate the problem into a probabilistic framework: 1) dividing each activity into multiple ordered temporal segments, 2) using spatiotemporal features of the training video samples in each segment as bases and applying sparse coding (SC) to derive the activity likelihood of the test video sample at each segment, and 3) finally combining the likelihood at each segment to achieve a global posterior for the activities. We further extend the proposed method to include more bases that correspond to a mixture of segments with different temporal lengths (MSSC), which can better rep- resent the activities with large intra-class variations. We evaluate the proposed methods (SC and MSSC) on various real videos. We also evaluate the proposed methods on two special cases: 1) activity prediction where the unobserved subsequence is at the end of the video, and 2) human activity recognition on fully observed videos. Experimental results show that the proposed methods outperform existing state-of-the-art comparison methods.

4 0.83478022 49 cvpr-2013-Augmenting Bag-of-Words: Data-Driven Discovery of Temporal and Structural Information for Activity Recognition

Author: Vinay Bettadapura, Grant Schindler, Thomas Ploetz, Irfan Essa

Abstract: We present data-driven techniques to augment Bag of Words (BoW) models, which allow for more robust modeling and recognition of complex long-term activities, especially when the structure and topology of the activities are not known a priori. Our approach specifically addresses the limitations of standard BoW approaches, which fail to represent the underlying temporal and causal information that is inherent in activity streams. In addition, we also propose the use ofrandomly sampled regular expressions to discover and encode patterns in activities. We demonstrate the effectiveness of our approach in experimental evaluations where we successfully recognize activities and detect anomalies in four complex datasets.

5 0.77926725 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities

Author: Raghuraman Gopalan

Abstract: While the notion of joint sparsity in understanding common and innovative components of a multi-receiver signal ensemble has been well studied, we investigate the utility of such joint sparse models in representing information contained in a single video signal. By decomposing the content of a video sequence into that observed by multiple spatially and/or temporally distributed receivers, we first recover a collection of common and innovative components pertaining to individual videos. We then present modeling strategies based on subspace-driven manifold metrics to characterize patterns among these components, across other videos in the system, to perform subsequent video analysis. We demonstrate the efficacy of our approach for activity classification and clustering by reporting competitive results on standard datasets such as, HMDB, UCF-50, Olympic Sports and KTH.

6 0.64386934 62 cvpr-2013-Bilinear Programming for Human Activity Recognition with Unknown MRF Graphs

7 0.62186915 313 cvpr-2013-Online Dominant and Anomalous Behavior Detection in Videos

8 0.60485536 103 cvpr-2013-Decoding Children's Social Behavior

9 0.60408503 348 cvpr-2013-Recognizing Activities via Bag of Words for Attribute Dynamics

10 0.51811731 137 cvpr-2013-Dynamic Scene Classification: Learning Motion Descriptors with Slow Features Analysis

11 0.51358014 187 cvpr-2013-Geometric Context from Videos

12 0.50436211 413 cvpr-2013-Story-Driven Summarization for Egocentric Video

13 0.47412875 294 cvpr-2013-Multi-class Video Co-segmentation with a Generative Multi-video Model

14 0.47004816 133 cvpr-2013-Discriminative Segment Annotation in Weakly Labeled Video

15 0.46696514 118 cvpr-2013-Detecting Pulse from Head Motions in Video

16 0.4611474 243 cvpr-2013-Large-Scale Video Summarization Using Web-Image Priors

17 0.4510771 151 cvpr-2013-Event Retrieval in Large Video Collections with Circulant Temporal Encoding

18 0.44848159 291 cvpr-2013-Motionlets: Mid-level 3D Parts for Human Motion Recognition

19 0.44292238 353 cvpr-2013-Relative Hidden Markov Models for Evaluating Motion Skill

20 0.4400937 172 cvpr-2013-Finding Group Interactions in Social Clutter


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.136), (13, 0.197), (16, 0.017), (26, 0.045), (33, 0.281), (67, 0.109), (69, 0.064), (76, 0.014), (87, 0.051)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90998 42 cvpr-2013-Analytic Bilinear Appearance Subspace Construction for Modeling Image Irradiance under Natural Illumination and Non-Lambertian Reflectance

Author: Shireen Y. Elhabian, Aly A. Farag

Abstract: Conventional subspace construction approaches suffer from the need of “large-enough ” image ensemble rendering numerical methods intractable. In this paper, we propose an analytic formulation for low-dimensional subspace construction in which shading cues lie while preserving the natural structure of an image sample. Using the frequencyspace representation of the image irradiance equation, the process of finding such subspace is cast as establishing a relation between its principal components and that of a deterministic set of basis functions, termed as irradiance harmonics. Representing images as matrices further lessen the number of parameters to be estimated to define a bilinear projection which maps the image sample to a lowerdimensional bilinear subspace. Results show significant impact on dimensionality reduction with minimal loss of information as well as robustness against noise.

2 0.88790363 292 cvpr-2013-Multi-agent Event Detection: Localization and Role Assignment

Author: Suha Kwak, Bohyung Han, Joon Hee Han

Abstract: We present a joint estimation technique of event localization and role assignment when the target video event is described by a scenario. Specifically, to detect multi-agent events from video, our algorithm identifies agents involved in an event and assigns roles to the participating agents. Instead of iterating through all possible agent-role combinations, we formulate the joint optimization problem as two efficient subproblems—quadratic programming for role assignment followed by linear programming for event localization. Additionally, we reduce the computational complexity significantly by applying role-specific event detectors to each agent independently. We test the performance of our algorithm in natural videos, which contain multiple target events and nonparticipating agents.

3 0.88412583 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos

Author: Yu Cao, Daniel Barrett, Andrei Barbu, Siddharth Narayanaswamy, Haonan Yu, Aaron Michaux, Yuewei Lin, Sven Dickinson, Jeffrey Mark Siskind, Song Wang

Abstract: Recognizing human activities in partially observed videos is a challengingproblem and has many practical applications. When the unobserved subsequence is at the end of the video, the problem is reduced to activity prediction from unfinished activity streaming, which has been studied by many researchers. However, in the general case, an unobserved subsequence may occur at any time by yielding a temporal gap in the video. In this paper, we propose a new method that can recognize human activities from partially observed videos in the general case. Specifically, we formulate the problem into a probabilistic framework: 1) dividing each activity into multiple ordered temporal segments, 2) using spatiotemporal features of the training video samples in each segment as bases and applying sparse coding (SC) to derive the activity likelihood of the test video sample at each segment, and 3) finally combining the likelihood at each segment to achieve a global posterior for the activities. We further extend the proposed method to include more bases that correspond to a mixture of segments with different temporal lengths (MSSC), which can better rep- resent the activities with large intra-class variations. We evaluate the proposed methods (SC and MSSC) on various real videos. We also evaluate the proposed methods on two special cases: 1) activity prediction where the unobserved subsequence is at the end of the video, and 2) human activity recognition on fully observed videos. Experimental results show that the proposed methods outperform existing state-of-the-art comparison methods.

same-paper 4 0.87573719 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?

Author: Michael S. Ryoo, Larry Matthies

Abstract: This paper discusses the problem of recognizing interaction-level human activities from a first-person viewpoint. The goal is to enable an observer (e.g., a robot or a wearable camera) to understand ‘what activity others are performing to it’ from continuous video inputs. These include friendly interactions such as ‘a person hugging the observer’ as well as hostile interactions like ‘punching the observer’ or ‘throwing objects to the observer’, whose videos involve a large amount of camera ego-motion caused by physical interactions. The paper investigates multichannel kernels to integrate global and local motion information, and presents a new activity learning/recognition methodology that explicitly considers temporal structures displayed in first-person activity videos. In our experiments, we not only show classification results with segmented videos, but also confirm that our new approach is able to detect activities from continuous videos reliably.

5 0.8580749 248 cvpr-2013-Learning Collections of Part Models for Object Recognition

Author: Ian Endres, Kevin J. Shih, Johnston Jiaa, Derek Hoiem

Abstract: We propose a method to learn a diverse collection of discriminative parts from object bounding box annotations. Part detectors can be trained and applied individually, which simplifies learning and extension to new features or categories. We apply the parts to object category detection, pooling part detections within bottom-up proposed regions and using a boosted classifier with proposed sigmoid weak learners for scoring. On PASCAL VOC 2010, we evaluate the part detectors ’ ability to discriminate and localize annotated keypoints. Our detection system is competitive with the best-existing systems, outperforming other HOG-based detectors on the more deformable categories.

6 0.85491961 414 cvpr-2013-Structure Preserving Object Tracking

7 0.85275954 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval

8 0.85274112 325 cvpr-2013-Part Discovery from Partial Correspondence

9 0.8525846 122 cvpr-2013-Detection Evolution with Multi-order Contextual Co-occurrence

10 0.8521533 288 cvpr-2013-Modeling Mutual Visibility Relationship in Pedestrian Detection

11 0.85172606 60 cvpr-2013-Beyond Physical Connections: Tree Models in Human Pose Estimation

12 0.85163212 339 cvpr-2013-Probabilistic Graphlet Cut: Exploiting Spatial Structure Cue for Weakly Supervised Image Segmentation

13 0.85062748 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection

14 0.84987998 314 cvpr-2013-Online Object Tracking: A Benchmark

15 0.84970081 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases

16 0.84878635 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation

17 0.84851807 254 cvpr-2013-Learning SURF Cascade for Fast and Accurate Object Detection

18 0.84849703 221 cvpr-2013-Incorporating Structural Alternatives and Sharing into Hierarchy for Multiclass Object Recognition and Detection

19 0.84840983 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection

20 0.84807312 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection