iccv iccv2013 iccv2013-86 knowledge-graph by maker-knowledge-mining

86 iccv-2013-Concurrent Action Detection with Structural Prediction


Source: pdf

Author: Ping Wei, Nanning Zheng, Yibiao Zhao, Song-Chun Zhu

Abstract: Action recognition has often been posed as a classification problem, which assumes that a video sequence only have one action class label and different actions are independent. However, a single human body can perform multiple concurrent actions at the same time, and different actions interact with each other. This paper proposes a concurrent action detection model where the action detection is formulated as a structural prediction problem. In this model, an interval in a video sequence can be described by multiple action labels. An detected action interval is determined both by the unary local detector and the relations with other actions. We use a wavelet feature to represent the action sequence, and design a composite temporal logic descriptor to describe the action relations. The model parameters are trained by structural SVM learning. Given a long video sequence, a sequential decision window search algorithm is designed to detect the actions. Experiments on our new collected concurrent action dataset demonstrate the strength of our method.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 x Abstract Action recognition has often been posed as a classification problem, which assumes that a video sequence only have one action class label and different actions are independent. [sent-6, score-0.983]

2 However, a single human body can perform multiple concurrent actions at the same time, and different actions interact with each other. [sent-7, score-0.877]

3 This paper proposes a concurrent action detection model where the action detection is formulated as a structural prediction problem. [sent-8, score-1.652]

4 In this model, an interval in a video sequence can be described by multiple action labels. [sent-9, score-0.913]

5 An detected action interval is determined both by the unary local detector and the relations with other actions. [sent-10, score-0.895]

6 We use a wavelet feature to represent the action sequence, and design a composite temporal logic descriptor to describe the action relations. [sent-11, score-1.584]

7 Experiments on our new collected concurrent action dataset demonstrate the strength of our method. [sent-14, score-0.903]

8 e, a classifier assigns one action label to a video sequence [18]. [sent-17, score-0.748]

9 In this case, the video sequence in the concurrent time interval can not be simply classified into one action class. [sent-22, score-1.242]

10 edu action turn on monitor occurs usually before the action type on keyboard. [sent-29, score-1.242]

11 We believe that such information of action relations should play important roles in the action recognition and localization. [sent-31, score-1.28]

12 We define the concurrent actions as the multiple actions simultaneously performed by one human body. [sent-32, score-0.796]

13 By concurrent action detection, we mean to recognize all the actions and localize their time intervals in the long video sequence, as is shown in Figure 1. [sent-34, score-1.294]

14 In this paper, we propose a novel concurrent action detection model (COA). [sent-35, score-0.945]

15 Our model formulates the detection of concurrent action as a structural prediction problem, similar to the multi-class object layout in still image [5]. [sent-36, score-1.069]

16 In this formulation, the detected action instances are determined by both the unary local detectors and the relations with other actions. [sent-37, score-0.73]

17 A multiple kernel learning method [2] is applied to mining the informative body parts for different action classes. [sent-38, score-0.814]

18 With the informative parts mining, the human body is softly divided into the weighted parts which perform the concurrent actions. [sent-39, score-0.61]

19 Given a video sequence, we propose an online sequential decision window search algorithm to detect the concurrent actions. [sent-41, score-0.623]

20 We collect a new concurrent action dataset for evaluation. [sent-42, score-0.903]

21 It includes 12 action classes, which are listed in Figure 1, and totally 61 long video sequences. [sent-44, score-0.62]

22 The small colorful blocks correspond to the action intervals in the time axis. [sent-54, score-0.759]

23 This method needs the video sequence to be presegmented, and predicts one action class for each segment. [sent-57, score-0.772]

24 The work [12] represented the concurrent decision with a semi-Markov model where plans were learned from concurrent actions. [sent-62, score-0.74]

25 (3) Temporal relations are used in some literatures to facilitate the action modeling [1, 9, 10, 13, 16, 19]. [sent-64, score-0.706]

26 Allen [1] introduced classical temporal logics to describe the relations between actions, which were further applied to representing the action structures and action detection in [10]. [sent-68, score-1.488]

27 Actually, the problem of action detection in motion data is more complex than the problem of object detection in still image because the motion data always has more data scales and more complex structures. [sent-76, score-0.658]

28 Concurrent Action Model Suppose there are M overlapping action intervals in a video sequence. [sent-79, score-0.783]

29 These intervals are obtained by sliding the local action detectors of all the 12 action classes along the time axis in the video sequence, similar to the object detection in image with sliding windows. [sent-80, score-1.457]

30 yi ∈ Y is the action class label ofthe interval di, where Y is the ∈set Y Yo ifs sa tllh aec atciotino nlacb lealsss. [sent-83, score-0.863]

31 ρyi (xi) is the local detection model of the action yi. [sent-103, score-0.616]

32 ρyi is related to the action class yi, which suggests that different actions correspond to different parts of the body. [sent-105, score-0.85]

33 ωyi,yj is the relation parameter, which encodes the location and semantic relations between action classes yi and yj . [sent-110, score-0.995]

34 The introduction of the neighborhood system N indicates that an action in a sequence is only o reoldat seyds t eom mth Ne a icntdioicnast wesh thicaht are accltoiosne t ion i at. [sent-113, score-0.728]

35 Wavelet Feature and Local Detection ρyi (xi) In our work, the input human action data is the sequence of 3D human poses which are estimated by the Kinect [14]. [sent-129, score-0.792]

36 A human action sequence forms K trajectories, as is shown in Figure 2. [sent-131, score-0.747]

37 We apply the symlet wavelet transform to the interpolated hk, and keep the first V wavelet coefficients as the action feature of the kth joint, denoted as Hk. [sent-148, score-0.88]

38 With the wavelet feature x, the local action detection model is ρyi (xi) = (fyi , 1), where fyi is an action detector: fyi = βyTix + byi (2) The wavelet transform has the attribute of timefrequency localization. [sent-153, score-1.64]

39 Furthermore, by keeping the first V wavelet coefficients, we can eliminate the noise in the original pose data, which makes the action description more robust. [sent-157, score-0.735]

40 Composite Temporal Logic Descriptor for rij rij represents the temporal location ofinterval dj relative to the interval di. [sent-160, score-0.633]

41 For example, the action press button and turn on monitor both occur before the action type on keyboard. [sent-163, score-1.255]

42 Itis decomposedinto three components rij = (rSij, riCj, riEj): 1) riSj, the location of dj relative to the start point of di; 2) riCj, the location of dj relative to the center point of di; 3) riEj, the location of dj relative to the end point of di. [sent-166, score-0.615]

43 The action bend down and pick up trash always start simultaneously. [sent-169, score-1.005]

44 So the action pick up trash is closely related to the start of bend down. [sent-170, score-1.005]

45 For example, the action throw trash always occurs after the action pick up trash ends. [sent-173, score-1.887]

46 So the action throw trash is closely related to the end of pick up trash. [sent-174, score-1.035]

47 Because it quantizes the duration of action interval, it also characterizes the action’s duration information. [sent-188, score-0.682]

48 Mining Informative Parts with MKL This subsection elaborates on how we learn the local action detector fyi = βyTix + byi by mining the informative body parts for different actions. [sent-192, score-0.932]

49 An action is usually related to some specific parts of human body. [sent-193, score-0.66]

50 For example, the action drink is mainly performed by the hand and arms. [sent-194, score-0.625]

51 We use a multiple kernel learning (MKL) [2] method to automatically mine the informative parts for each action class. [sent-197, score-0.688]

52 , αK) for each action class y, where K is the number of human body joints, and αk 0 corresponds to the kth joint. [sent-202, score-0.762]

53 Each wavelet action fea≥ture 0 x oisr decomposed ithnteo kKth h bl joocinkts. [sent-203, score-0.708]

54 Such decomposition makes it possible to differentiate the effects of different joints on the action y. [sent-213, score-0.656]

55 Each ωu Nu-dimension block of ωu corresponds to an action class. [sent-261, score-0.599]

56 The elements of ϕ(ρyi (xi) , yi) are all zeros except the block corresponding to the action class yi, where it is ρyi (xi). [sent-262, score-0.623]

57 Each Nbdimension block corresponds to a pair of action classes. [sent-264, score-0.599]

58 The elements of ψ(rij , yi, yj) are all zeros except the block corresponding to the action class pair (yi, yj ), where it is rij . [sent-265, score-0.753]

59 Inference Given a long temporal sequence X containing multiple actions, our goal is to localize all the action intervals and label them with the action classes. [sent-283, score-1.54]

60 The detection of multiple concurrent actions in temporal sequence is more complex than the object layout in the still image. [sent-286, score-0.873]

61 (2)) of all the 12 action classes on the temporal sequence in a sliding-window manner. [sent-303, score-0.857]

62 Such local detection process produces a large amount of action intervals, which are pruned by a non-maxima suppression step to generate M? [sent-305, score-0.616]

63 Suppose Ds ⊆ D {d, a|nid = XDs ,aMnd YDs are respectively the feature se⊆t a Dnd, t ahned corresponding action label set of the action intervals in Ds. [sent-311, score-1.282]

64 But it is reasonable in the human action sequence data because an action is usually only related to other actions which are close to it. [sent-321, score-1.532]

65 Dataset To evaluate our method, we collect a new concurrent action dataset with annotation. [sent-329, score-0.903]

66 Each sequence contains many actions which are concurrent in the time axis and interact with others. [sent-334, score-0.668]

67 The dataset includes 12 action classes: drink, make a call, turn on moni33 113403 tor, type on keyboard, fetch water, pour water, press button, pick up trash, throw trash, bend down, sit, and stand. [sent-335, score-0.991]

68 Our dataset is new in two aspects: i) each sequence contains multiple concurrent actions; ii) these actions semantically and temporally interacts with each other. [sent-336, score-0.696]

69 Thirdly, the instances of each action class have large variances. [sent-340, score-0.598]

70 For example, some instances of the action sit last for less than thirty frames, but some may last for more than one thousand frames. [sent-341, score-0.62]

71 Finally, some different actions are very similar, like drink and make a call, pick up trash and throw trash. [sent-342, score-0.723]

72 A detected action interval is taken as correct if the overlapping length of the detected interval and the ground truth interval is larger 60% than their union length or the detected interval is totally covered by the ground truth interval. [sent-346, score-1.263]

73 The second condition is special in action detection because part of an action is still described with the same action label by human. [sent-347, score-1.764]

74 This method uses the original aligned skeleton sequence as the action feature, and a SVM trained detector to detect the action with sliding windows. [sent-352, score-1.335]

75 This method is similar to the SVM-SKL except for that its action feature is our proposed wavelet feature. [sent-354, score-0.708]

76 Actionlet ensemble [18] is the state-of-art method in multiple action recognition with the 3D human pose data. [sent-356, score-0.646]

77 In most action classes, our method outperforms the other methods, which proves its effectiveness and advantage. [sent-375, score-0.603]

78 The temporal relation between them and other action classes can facilitate the detection. [sent-377, score-0.79]

79 For example, the action throw trash is usually inconspicuous and hard to be detected. [sent-378, score-0.974]

80 With the context of pick up trash which usually occurs closely before throw trash, the AP of throw trash is significantly boosted. [sent-379, score-0.886]

81 , the results of all the testing sequences and all the action classes are put together to compute the AP. [sent-384, score-0.599]

82 The SVM-WAV and the SVM-SKL are different in the action sequence feature. [sent-389, score-0.702]

83 8A87LE3 video sequence 1 video sequence 3 video sequence 2 video sequence 4 Figure 5. [sent-395, score-0.696]

84 To intuitionally shows the strength of our model, we visualize some action detection results in Figure 5. [sent-422, score-0.616]

85 Informative Body Parts The informative body parts are weighted human body parts for different action classes. [sent-428, score-0.936]

86 An action is usually related to some specific parts of human body. [sent-430, score-0.66]

87 Figure 6 shows that though the data of action instances is noisy and has large variance, our algorithm can mine the reasonable body parts for different action classes. [sent-433, score-1.27]

88 For example, the action throw trash usually occurs after the action pick up trash. [sent-441, score-1.634]

89 The action type on keyboard usually co-occurs with the action sit. [sent-443, score-1.275]

90 This is displayed by that the descriptor of action yj to yi and the descriptor of yi to yj are unsymmetrical, as the relations between turn on monitor and type on keyboard. [sent-447, score-1.151]

91 Tbhloeck th dreeescribes the relation of the column action relative to the row action. [sent-454, score-0.66]

92 Conclusion In this paper, we present a new problem of concurrent action detection and proposes a structural prediction formulation for this problem. [sent-457, score-1.036]

93 This formulation extends the action recognition from unary feature classification to multiple structural labeling. [sent-458, score-0.663]

94 We describe the phenomenon of the concurrent actions by introducing the informative body parts, which are mined for each action class by multiple kernel learning. [sent-459, score-1.292]

95 To accommodate the sequential nature and large duration of video sequence, we design a sequential decision window search algorithm, which can online detect actions in video sequence. [sent-460, score-0.677]

96 We design two descriptors for representing the local action feature and temporal relations between actions, respectively. [sent-461, score-0.836]

97 The experiment results on our new concurrent action dataset demonstrate the benefit of our model. [sent-462, score-0.903]

98 The future work will focus on the multiple action detection in real surveillance video of large scenes. [sent-463, score-0.662]

99 Human action detection using pnf propagation of temporal constraints. [sent-529, score-0.746]

100 Mining actionlet ensemble for action recognition with depth cameras. [sent-590, score-0.611]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('action', 0.574), ('concurrent', 0.329), ('trash', 0.253), ('actions', 0.211), ('interval', 0.165), ('throw', 0.147), ('coa', 0.14), ('intervals', 0.134), ('wavelet', 0.134), ('relations', 0.132), ('temporal', 0.13), ('sequence', 0.128), ('dj', 0.107), ('yi', 0.1), ('ds', 0.099), ('mcol', 0.091), ('keyboard', 0.09), ('bend', 0.086), ('rij', 0.086), ('decision', 0.082), ('joints', 0.082), ('body', 0.081), ('logic', 0.077), ('informative', 0.073), ('ricj', 0.072), ('riej', 0.072), ('sequential', 0.072), ('structural', 0.065), ('fyi', 0.064), ('dw', 0.063), ('zl', 0.062), ('relation', 0.061), ('pick', 0.061), ('fetch', 0.059), ('window', 0.058), ('water', 0.055), ('greedy', 0.055), ('byi', 0.054), ('pictrka', 0.054), ('risj', 0.054), ('duration', 0.054), ('du', 0.054), ('drink', 0.051), ('composite', 0.051), ('di', 0.051), ('video', 0.046), ('sit', 0.046), ('human', 0.045), ('mining', 0.045), ('descriptor', 0.044), ('yj', 0.044), ('encapsulates', 0.042), ('detection', 0.042), ('parts', 0.041), ('mip', 0.04), ('hk', 0.039), ('button', 0.038), ('kth', 0.038), ('durations', 0.037), ('actionlet', 0.037), ('type', 0.037), ('logics', 0.036), ('nere', 0.036), ('usph', 0.036), ('xds', 0.036), ('yds', 0.036), ('ytix', 0.036), ('stand', 0.036), ('search', 0.036), ('yn', 0.034), ('location', 0.034), ('layout', 0.033), ('monitor', 0.032), ('sliding', 0.031), ('start', 0.031), ('bins', 0.03), ('slides', 0.03), ('else', 0.03), ('proves', 0.029), ('overlapping', 0.029), ('kinect', 0.029), ('allen', 0.028), ('semantically', 0.028), ('skeleton', 0.028), ('pose', 0.027), ('pour', 0.027), ('event', 0.026), ('neighborhood', 0.026), ('prediction', 0.026), ('colorful', 0.026), ('hoai', 0.026), ('xn', 0.025), ('relative', 0.025), ('blocks', 0.025), ('encodes', 0.025), ('block', 0.025), ('occurs', 0.025), ('classes', 0.025), ('unary', 0.024), ('class', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 86 iccv-2013-Concurrent Action Detection with Structural Prediction

Author: Ping Wei, Nanning Zheng, Yibiao Zhao, Song-Chun Zhu

Abstract: Action recognition has often been posed as a classification problem, which assumes that a video sequence only have one action class label and different actions are independent. However, a single human body can perform multiple concurrent actions at the same time, and different actions interact with each other. This paper proposes a concurrent action detection model where the action detection is formulated as a structural prediction problem. In this model, an interval in a video sequence can be described by multiple action labels. An detected action interval is determined both by the unary local detector and the relations with other actions. We use a wavelet feature to represent the action sequence, and design a composite temporal logic descriptor to describe the action relations. The model parameters are trained by structural SVM learning. Given a long video sequence, a sequential decision window search algorithm is designed to detect the actions. Experiments on our new collected concurrent action dataset demonstrate the strength of our method.

2 0.44316918 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection

Author: Mihai Zanfir, Marius Leordeanu, Cristian Sminchisescu

Abstract: Human action recognition under low observational latency is receiving a growing interest in computer vision due to rapidly developing technologies in human-robot interaction, computer gaming and surveillance. In this paper we propose a fast, simple, yet powerful non-parametric Moving Pose (MP)frameworkfor low-latency human action and activity recognition. Central to our methodology is a moving pose descriptor that considers both pose information as well as differential quantities (speed and acceleration) of the human body joints within a short time window around the current frame. The proposed descriptor is used in conjunction with a modified kNN classifier that considers both the temporal location of a particular frame within the action sequence as well as the discrimination power of its moving pose descriptor compared to other frames in the training set. The resulting method is non-parametric and enables low-latency recognition, one-shot learning, and action detection in difficult unsegmented sequences. Moreover, the framework is real-time, scalable, and outperforms more sophisticated approaches on challenging benchmarks like MSR-Action3D or MSR-DailyActivities3D.

3 0.4031834 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition

Author: Jiang Wang, Ying Wu

Abstract: Temporal misalignment and duration variation in video actions largely influence the performance of action recognition, but it is very difficult to specify effective temporal alignment on action sequences. To address this challenge, this paper proposes a novel discriminative learning-based temporal alignment method, called maximum margin temporal warping (MMTW), to align two action sequences and measure their matching score. Based on the latent structure SVM formulation, the proposed MMTW method is able to learn a phantom action template to represent an action class for maximum discrimination against other classes. The recognition of this action class is based on the associated learned alignment of the input action. Extensive experiments on five benchmark datasets have demonstrated that this MMTW model is able to significantly promote the accuracy and robustness of action recognition under temporal misalignment and variations.

4 0.35247242 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos

Author: Sunil Bandla, Kristen Grauman

Abstract: Collecting and annotating videos of realistic human actions is tedious, yet critical for training action recognition systems. We propose a method to actively request the most useful video annotations among a large set of unlabeled videos. Predicting the utility of annotating unlabeled video is not trivial, since any given clip may contain multiple actions of interest, and it need not be trimmed to temporal regions of interest. To deal with this problem, we propose a detection-based active learner to train action category models. We develop a voting-based framework to localize likely intervals of interest in an unlabeled clip, and use them to estimate the total reduction in uncertainty that annotating that clip would yield. On three datasets, we show our approach can learn accurate action detectors more efficiently than alternative active learning strategies that fail to accommodate the “untrimmed” nature of real video data.

5 0.30928552 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions

Author: Vignesh Ramanathan, Percy Liang, Li Fei-Fei

Abstract: Human action and role recognition play an important part in complex event understanding. State-of-the-art methods learn action and role models from detailed spatio temporal annotations, which requires extensive human effort. In this work, we propose a method to learn such models based on natural language descriptions of the training videos, which are easier to collect and scale with the number of actions and roles. There are two challenges with using this form of weak supervision: First, these descriptions only provide a high-level summary and often do not directly mention the actions and roles occurring in a video. Second, natural language descriptions do not provide spatio temporal annotations of actions and roles. To tackle these challenges, we introduce a topic-based semantic relatedness (SR) measure between a video description and an action and role label, and incorporate it into a posterior regularization objective. Our event recognition system based on these action and role models matches the state-ofthe-art method on the TRECVID-MED11 event kit, despite weaker supervision.

6 0.28582713 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments

7 0.27887654 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction

8 0.27677932 244 iccv-2013-Learning View-Invariant Sparse Representations for Cross-View Action Recognition

9 0.26650691 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding

10 0.24847154 249 iccv-2013-Learning to Share Latent Tasks for Action Recognition

11 0.23664407 116 iccv-2013-Directed Acyclic Graph Kernels for Action Recognition

12 0.23465377 188 iccv-2013-Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps

13 0.22334218 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set

14 0.21476685 166 iccv-2013-Finding Actors and Actions in Movies

15 0.2090123 231 iccv-2013-Latent Multitask Learning for View-Invariant Action Recognition

16 0.20486647 265 iccv-2013-Mining Motion Atoms and Phrases for Complex Action Recognition

17 0.19323793 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition

18 0.18403485 268 iccv-2013-Modeling 4D Human-Object Interactions for Event and Object Recognition

19 0.17501229 396 iccv-2013-Space-Time Robust Representation for Action Recognition

20 0.16606116 146 iccv-2013-Event Detection in Complex Scenes Using Interval Temporal Constraints


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.24), (1, 0.305), (2, 0.15), (3, 0.389), (4, 0.003), (5, -0.04), (6, 0.155), (7, -0.156), (8, -0.047), (9, 0.036), (10, 0.021), (11, 0.068), (12, -0.066), (13, -0.068), (14, 0.173), (15, 0.011), (16, 0.019), (17, -0.061), (18, -0.052), (19, -0.035), (20, -0.009), (21, 0.04), (22, -0.049), (23, -0.154), (24, -0.044), (25, -0.032), (26, -0.026), (27, 0.007), (28, -0.017), (29, -0.003), (30, -0.022), (31, 0.03), (32, 0.029), (33, 0.058), (34, 0.076), (35, 0.015), (36, 0.006), (37, -0.007), (38, 0.012), (39, -0.021), (40, -0.028), (41, 0.004), (42, 0.0), (43, 0.005), (44, 0.01), (45, -0.042), (46, -0.006), (47, -0.006), (48, 0.011), (49, -0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98910815 86 iccv-2013-Concurrent Action Detection with Structural Prediction

Author: Ping Wei, Nanning Zheng, Yibiao Zhao, Song-Chun Zhu

Abstract: Action recognition has often been posed as a classification problem, which assumes that a video sequence only have one action class label and different actions are independent. However, a single human body can perform multiple concurrent actions at the same time, and different actions interact with each other. This paper proposes a concurrent action detection model where the action detection is formulated as a structural prediction problem. In this model, an interval in a video sequence can be described by multiple action labels. An detected action interval is determined both by the unary local detector and the relations with other actions. We use a wavelet feature to represent the action sequence, and design a composite temporal logic descriptor to describe the action relations. The model parameters are trained by structural SVM learning. Given a long video sequence, a sequential decision window search algorithm is designed to detect the actions. Experiments on our new collected concurrent action dataset demonstrate the strength of our method.

2 0.96631712 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition

Author: Jiang Wang, Ying Wu

Abstract: Temporal misalignment and duration variation in video actions largely influence the performance of action recognition, but it is very difficult to specify effective temporal alignment on action sequences. To address this challenge, this paper proposes a novel discriminative learning-based temporal alignment method, called maximum margin temporal warping (MMTW), to align two action sequences and measure their matching score. Based on the latent structure SVM formulation, the proposed MMTW method is able to learn a phantom action template to represent an action class for maximum discrimination against other classes. The recognition of this action class is based on the associated learned alignment of the input action. Extensive experiments on five benchmark datasets have demonstrated that this MMTW model is able to significantly promote the accuracy and robustness of action recognition under temporal misalignment and variations.

3 0.90711397 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection

Author: Mihai Zanfir, Marius Leordeanu, Cristian Sminchisescu

Abstract: Human action recognition under low observational latency is receiving a growing interest in computer vision due to rapidly developing technologies in human-robot interaction, computer gaming and surveillance. In this paper we propose a fast, simple, yet powerful non-parametric Moving Pose (MP)frameworkfor low-latency human action and activity recognition. Central to our methodology is a moving pose descriptor that considers both pose information as well as differential quantities (speed and acceleration) of the human body joints within a short time window around the current frame. The proposed descriptor is used in conjunction with a modified kNN classifier that considers both the temporal location of a particular frame within the action sequence as well as the discrimination power of its moving pose descriptor compared to other frames in the training set. The resulting method is non-parametric and enables low-latency recognition, one-shot learning, and action detection in difficult unsegmented sequences. Moreover, the framework is real-time, scalable, and outperforms more sophisticated approaches on challenging benchmarks like MSR-Action3D or MSR-DailyActivities3D.

4 0.88815308 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding

Author: Weiyu Zhang, Menglong Zhu, Konstantinos G. Derpanis

Abstract: This paper presents a novel approach for analyzing human actions in non-scripted, unconstrained video settings based on volumetric, x-y-t, patch classifiers, termed actemes. Unlike previous action-related work, the discovery of patch classifiers is posed as a strongly-supervised process. Specifically, keypoint labels (e.g., position) across spacetime are used in a data-driven training process to discover patches that are highly clustered in the spacetime keypoint configuration space. To support this process, a new human action dataset consisting of challenging consumer videos is introduced, where notably the action label, the 2D position of a set of keypoints and their visibilities are provided for each video frame. On a novel input video, each acteme is used in a sliding volume scheme to yield a set of sparse, non-overlapping detections. These detections provide the intermediate substrate for segmenting out the action. For action classification, the proposed representation shows significant improvement over state-of-the-art low-level features, while providing spatiotemporal localiza- tion as additional output. This output sheds further light into detailed action understanding.

5 0.86720765 231 iccv-2013-Latent Multitask Learning for View-Invariant Action Recognition

Author: Behrooz Mahasseni, Sinisa Todorovic

Abstract: This paper presents an approach to view-invariant action recognition, where human poses and motions exhibit large variations across different camera viewpoints. When each viewpoint of a given set of action classes is specified as a learning task then multitask learning appears suitable for achieving view invariance in recognition. We extend the standard multitask learning to allow identifying: (1) latent groupings of action views (i.e., tasks), and (2) discriminative action parts, along with joint learning of all tasks. This is because it seems reasonable to expect that certain distinct views are more correlated than some others, and thus identifying correlated views could improve recognition. Also, part-based modeling is expected to improve robustness against self-occlusion when actors are imaged from different views. Results on the benchmark datasets show that we outperform standard multitask learning by 21.9%, and the state-of-the-art alternatives by 4.5–6%.

6 0.82394707 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos

7 0.82384425 38 iccv-2013-Action Recognition with Actons

8 0.80757195 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions

9 0.72182679 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set

10 0.72005415 166 iccv-2013-Finding Actors and Actions in Movies

11 0.70545959 265 iccv-2013-Mining Motion Atoms and Phrases for Complex Action Recognition

12 0.67753816 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition

13 0.66407835 249 iccv-2013-Learning to Share Latent Tasks for Action Recognition

14 0.66266394 244 iccv-2013-Learning View-Invariant Sparse Representations for Cross-View Action Recognition

15 0.66239804 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments

16 0.64810491 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction

17 0.63469756 116 iccv-2013-Directed Acyclic Graph Kernels for Action Recognition

18 0.6263563 260 iccv-2013-Manipulation Pattern Discovery: A Nonparametric Bayesian Approach

19 0.58755881 69 iccv-2013-Capturing Global Semantic Relationships for Facial Action Unit Recognition

20 0.54484689 99 iccv-2013-Cross-View Action Recognition over Heterogeneous Feature Spaces


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.087), (7, 0.011), (12, 0.033), (13, 0.015), (26, 0.085), (31, 0.041), (35, 0.014), (40, 0.015), (42, 0.093), (44, 0.199), (64, 0.125), (73, 0.024), (78, 0.011), (89, 0.154)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.88278317 303 iccv-2013-Orderless Tracking through Model-Averaged Posterior Estimation

Author: Seunghoon Hong, Suha Kwak, Bohyung Han

Abstract: We propose a novel offline tracking algorithm based on model-averaged posterior estimation through patch matching across frames. Contrary to existing online and offline tracking methods, our algorithm is not based on temporallyordered estimates of target state but attempts to select easyto-track frames first out of the remaining ones without exploiting temporal coherency of target. The posterior of the selected frame is estimated by propagating densities from the already tracked frames in a recursive manner. The density propagation across frames is implemented by an efficient patch matching technique, which is useful for our algorithm since it does not require motion smoothness assumption. Also, we present a hierarchical approach, where a small set of key frames are tracked first and non-key frames are handled by local key frames. Our tracking algorithm is conceptually well-suited for the sequences with abrupt motion, shot changes, and occlusion. We compare our tracking algorithm with existing techniques in real videos with such challenges and illustrate its superior performance qualitatively and quantitatively.

same-paper 2 0.85245955 86 iccv-2013-Concurrent Action Detection with Structural Prediction

Author: Ping Wei, Nanning Zheng, Yibiao Zhao, Song-Chun Zhu

Abstract: Action recognition has often been posed as a classification problem, which assumes that a video sequence only have one action class label and different actions are independent. However, a single human body can perform multiple concurrent actions at the same time, and different actions interact with each other. This paper proposes a concurrent action detection model where the action detection is formulated as a structural prediction problem. In this model, an interval in a video sequence can be described by multiple action labels. An detected action interval is determined both by the unary local detector and the relations with other actions. We use a wavelet feature to represent the action sequence, and design a composite temporal logic descriptor to describe the action relations. The model parameters are trained by structural SVM learning. Given a long video sequence, a sequential decision window search algorithm is designed to detect the actions. Experiments on our new collected concurrent action dataset demonstrate the strength of our method.

3 0.82416242 225 iccv-2013-Joint Segmentation and Pose Tracking of Human in Natural Videos

Author: Taegyu Lim, Seunghoon Hong, Bohyung Han, Joon Hee Han

Abstract: We propose an on-line algorithm to extract a human by foreground/background segmentation and estimate pose of the human from the videos captured by moving cameras. We claim that a virtuous cycle can be created by appropriate interactions between the two modules to solve individual problems. This joint estimation problem is divided into two subproblems, , foreground/background segmentation and pose tracking, which alternate iteratively for optimization; segmentation step generates foreground mask for human pose tracking, and human pose tracking step provides foreground response map for segmentation. The final solution is obtained when the iterative procedure converges. We evaluate our algorithm quantitatively and qualitatively in real videos involving various challenges, and present its outstandingperformance compared to the state-of-the-art techniques for segmentation and pose estimation.

4 0.79799438 416 iccv-2013-The Interestingness of Images

Author: Michael Gygli, Helmut Grabner, Hayko Riemenschneider, Fabian Nater, Luc Van_Gool

Abstract: We investigate human interest in photos. Based on our own and others ’psychological experiments, we identify various cues for “interestingness ”, namely aesthetics, unusualness and general preferences. For the ranking of retrieved images, interestingness is more appropriate than cues proposed earlier. Interestingness is, for example, correlated with what people believe they will remember. This is opposed to actual memorability, which is uncorrelated to both of them. We introduce a set of features computationally capturing the three main aspects of visual interestingness that we propose and build an interestingness predictor from them. Its performance is shown on three datasets with varying context, reflecting diverse levels of prior knowledge of the viewers.

5 0.7690841 359 iccv-2013-Robust Object Tracking with Online Multi-lifespan Dictionary Learning

Author: Junliang Xing, Jin Gao, Bing Li, Weiming Hu, Shuicheng Yan

Abstract: Recently, sparse representation has been introduced for robust object tracking. By representing the object sparsely, i.e., using only a few templates via ?1-norm minimization, these so-called ?1-trackers exhibit promising tracking results. In this work, we address the object template building and updating problem in these ?1-tracking approaches, which has not been fully studied. We propose to perform template updating, in a new perspective, as an online incremental dictionary learning problem, which is efficiently solved through an online optimization procedure. To guarantee the robustness and adaptability of the tracking algorithm, we also propose to build a multi-lifespan dictionary model. By building target dictionaries of different lifespans, effective object observations can be obtained to deal with the well-known drifting problem in tracking and thus improve the tracking accuracy. We derive effective observa- tion models both generatively and discriminatively based on the online multi-lifespan dictionary learning model and deploy them to the Bayesian sequential estimation framework to perform tracking. The proposed approach has been extensively evaluated on ten challenging video sequences. Experimental results demonstrate the effectiveness of the online learned templates, as well as the state-of-the-art tracking performance of the proposed approach.

6 0.76535189 215 iccv-2013-Incorporating Cloud Distribution in Sky Representation

7 0.7582655 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection

8 0.75817662 380 iccv-2013-Semantic Transform: Weakly Supervised Semantic Inference for Relating Visual Attributes

9 0.7570647 442 iccv-2013-Video Segmentation by Tracking Many Figure-Ground Segments

10 0.75515985 166 iccv-2013-Finding Actors and Actions in Movies

11 0.75437033 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition

12 0.75047982 338 iccv-2013-Randomized Ensemble Tracking

13 0.75030279 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments

14 0.74529546 447 iccv-2013-Volumetric Semantic Segmentation Using Pyramid Context Features

15 0.74497938 425 iccv-2013-Tracking via Robust Multi-task Multi-view Joint Sparse Representation

16 0.74011731 242 iccv-2013-Learning People Detectors for Tracking in Crowded Scenes

17 0.73751807 22 iccv-2013-A New Adaptive Segmental Matching Measure for Human Activity Recognition

18 0.7372278 441 iccv-2013-Video Motion for Every Visible Point

19 0.73657286 88 iccv-2013-Constant Time Weighted Median Filtering for Stereo Matching and Beyond

20 0.73631728 340 iccv-2013-Real-Time Articulated Hand Pose Estimation Using Semi-supervised Transductive Regression Forests