iccv iccv2013 iccv2013-240 knowledge-graph by maker-knowledge-mining

240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition


Source: pdf

Author: Jiang Wang, Ying Wu

Abstract: Temporal misalignment and duration variation in video actions largely influence the performance of action recognition, but it is very difficult to specify effective temporal alignment on action sequences. To address this challenge, this paper proposes a novel discriminative learning-based temporal alignment method, called maximum margin temporal warping (MMTW), to align two action sequences and measure their matching score. Based on the latent structure SVM formulation, the proposed MMTW method is able to learn a phantom action template to represent an action class for maximum discrimination against other classes. The recognition of this action class is based on the associated learned alignment of the input action. Extensive experiments on five benchmark datasets have demonstrated that this MMTW model is able to significantly promote the accuracy and robustness of action recognition under temporal misalignment and variations.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Abstract Temporal misalignment and duration variation in video actions largely influence the performance of action recognition, but it is very difficult to specify effective temporal alignment on action sequences. [sent-4, score-1.52]

2 To address this challenge, this paper proposes a novel discriminative learning-based temporal alignment method, called maximum margin temporal warping (MMTW), to align two action sequences and measure their matching score. [sent-5, score-1.411]

3 Based on the latent structure SVM formulation, the proposed MMTW method is able to learn a phantom action template to represent an action class for maximum discrimination against other classes. [sent-6, score-1.656]

4 The recognition of this action class is based on the associated learned alignment of the input action. [sent-7, score-0.685]

5 Extensive experiments on five benchmark datasets have demonstrated that this MMTW model is able to significantly promote the accuracy and robustness of action recognition under temporal misalignment and variations. [sent-8, score-0.83]

6 Introduction A fundamental yet challenging problem in human action recognition is to deal with its temporal variations. [sent-10, score-0.776]

7 , the way of performing an action), the action can be performed at difference paces and thus spanning different time durations. [sent-13, score-0.532]

8 Moreover, in practice, action video data may not be accurately localized along the time axis, and the starting and ending of an action are not provided. [sent-14, score-1.083]

9 If used in training, such action videos can only be regarded as weakly labeled. [sent-15, score-0.55]

10 If used as inputs for recognition, they bring extra work of action localization, explicitly or implicitly. [sent-16, score-0.532]

11 Effective handling of such temporal variations is important to the performance of action recognition. [sent-17, score-0.723]

12 These methods attempt to model the generative process of actions so as to perform action inference and Positive Sample Negative Sample Figure 1. [sent-19, score-0.61]

13 The video action are temporally aligned to a phantom action template. [sent-20, score-1.485]

14 As they exploit the structural or compositional information in modeling, they may produce effective representations for action parsing and interpretation. [sent-23, score-0.595]

15 This facilitates discriminative models for action classification, whose training may be sim- pler than generative models. [sent-26, score-0.532]

16 Dynamic time warping (DTW) has been used to align videos for recognition [32], time series classification [13] and action retrieval [14]. [sent-27, score-0.828]

17 As a result, action alignment and classification are treated independently. [sent-30, score-0.669]

18 In this paper, we propose to learn action alignment so that we can unify action alignment and classification. [sent-31, score-1.294]

19 2688 Specifically, the proposed method, called maximum margin temporal warping (MMTW), learns temporal action alignment for max margin classification. [sent-32, score-1.431]

20 For each action class, an MMTW model is learned to achieve maximum margin separation from the rest action classes. [sent-33, score-1.161]

21 This learned MMTW model can be treated as a phantom action template for representing this action class. [sent-34, score-1.578]

22 First, the proposed maximum margin temporal warping (MMTW) is a novel approach to both action alignment and action recognition. [sent-41, score-1.698]

23 It learns to align action videos and to model actions. [sent-42, score-0.557]

24 Second, we find an innovative method to achieve computationally efficient action alignment and MMTW inference based on dynamic programming, which also enables effective learning. [sent-43, score-0.68]

25 Because the action models are discriminatively learned, and the temporal deformation is explicitly modeled, the proposed approach achieves excellent results on action recognition tasks, as demonstrated by our extensive experiments on these five benchmark datasets. [sent-46, score-1.352]

26 Representing the temporal structure is crucial for successful action recognition. [sent-49, score-0.723]

27 This representation is robust to temporal misalignment because it discards the phase information, but may fall short when the phase information may be important for action classification. [sent-54, score-0.795]

28 The temporal structure of an action can also be modeled based on hidden Markov models [10, 11]. [sent-55, score-0.723]

29 Other temporal structure models include temporal AND-OR graph [17], actom sequence model [5] and spatio-temporal graphs [2]. [sent-57, score-0.45]

30 The proposed method is a novel learning approach to learning the temporal structure for action alignment and classification. [sent-58, score-0.838]

31 Recently, structural max-margin learning has been applied to action detection [21, 23]. [sent-61, score-0.578]

32 [6] employs the structural output SVM to learn a well shaped predictive function for early action detection. [sent-63, score-0.595]

33 [13] models the temporal structure of the action with maximum margin temporal clustering. [sent-64, score-1.011]

34 Action Classification with Maximum Margin Temporal Warping In this section, we propose maximum margin temporal warping (MMTW) approach to integrate action temporal alignment and action classification. [sent-69, score-1.889]

35 The proposed MMTW approach is robust to the temporal deformation and misalignment in action recognition tasks, and has the discriminative power of the max margin methods. [sent-70, score-0.931]

36 A video action is represented as a sequence of framelevel features X = (x1, · · · , xL), where xi is the visual descriptor extracted a,t t·h·e , ix-th frame. [sent-71, score-0.598]

37 We denote an action dataset by {(X1, y1) , (X2 , y2) , · · · , (XN, yN)}, wanh earceti oXn id a∈ta sXe i bs a v(iXdeo action, and yi ·∈· ·Y , iXs its acti)o}n, category lab∈el Xs. [sent-75, score-0.582]

38 For each action class, we define a phantom action template T that consists of a sequence of atomic actions: T = {t1, t2, ··· , tLT}, (1) where LT is the length of the atomic action sequence. [sent-81, score-2.468]

39 In addition, the expected length of an atomic action tj is μj, but it can deform under warping. [sent-83, score-0.718]

40 The long sequence above is aligned to the short sequence below red element (i, j) in the alignment matrix means the i-th element in the long sequence is aligned to the j-th element in the short sequence. [sent-86, score-0.423]

41 In the binary classification setting, the phantom action template is associated with the positive class. [sent-88, score-1.068]

42 In the multi-class setting, each action class is associated with a phantom action template. [sent-89, score-1.477]

43 The phantom template and its deformation parameters are learned from training data (details will be discussed in Sec. [sent-90, score-0.558]

44 In order to deal with misalignment, we align an action X of length L to the phantom action template T with a warping function. [sent-92, score-1.893]

45 Notice that the length of Lthe× input action L and the length of the phantom template LT are not necessarily the same. [sent-95, score-1.164]

46 A warping path P is a contiguous set of matrix elements that defines a mapping between X and T. [sent-96, score-0.33]

47 the cost function of aligning the action X to the phantom action template T under a warping path P as: g(X,P) =L1? [sent-104, score-1.902]

48 =bj (3) where the {bj , bj + 1, · · · ej } elements in X are aligned wtoh tehree j-th {eblemen+t i n1 T··,· eand} eCle(Pme) itss t ihne Xcos atr oef a length deformation of the atomic actions under the the warping P. [sent-108, score-0.646]

49 We denote the number of the elements in X that are aligned to the j-th element in the phantom action template T by lj = ej bj + 1. [sent-109, score-1.24]

50 (4) where μj is the expected length of the j-th atomic action in the phantom action template T, and dj , aj model its length variation. [sent-113, score-1.924]

51 The predictive mapping function is evaluated by finding the optimal warping path P that maximizes the cost function Eq. [sent-115, score-0.321]

52 T Thehen the binary classification of the action can be simply based − on f(X). [sent-121, score-0.554]

53 First, it finds the optimal alignment of the input action to the phantom action template of a particular action class. [sent-123, score-2.225]

54 Second, since both the phantom templates and their deformation parameters are learnt from the training data, the proposed method is more discriminative and adaptive than the traditional dynamic temporal warping. [sent-125, score-0.679]

55 Inference: Action Alignment and Classification In order to predict the class label of an input action X, we perform the following steps. [sent-127, score-0.552]

56 Then, we determine the action class label of X by f(X), i. [sent-131, score-0.552]

57 We can then compare the matching scores of all the action categories for action recognition. [sent-139, score-1.064]

58 Learning via Latent Structural SVM Since the warping path P is not observable in the traning data, we formulate the learning problem as a latent structural SVM [33], with the warping path P as the latent variable. [sent-141, score-0.744]

59 elements in the feature sequence X aligned to the j-th element of the action template sequence T in P and P? [sent-147, score-0.836]

60 Ti ∈e Xra nisi nthge d sequence oXf featu)r,e·s·, a ,n(dX yi ∈ Y) }is, the action category hleab seelsq. [sent-156, score-0.596]

61 Δ(P, Pi) + g(Xi, P) − g(Xi, Pi) ≤ ξi, ∀P, ∀yi = −1 1− g(Xi, Pi) ≤ ξi, ∀yi = 1 ξi > 0, ∀i (9) where Pi is the warping path for the i-th training data, P can be any feasible warping path. [sent-162, score-0.535]

62 ) can also be decomposed according to each pj of the warping path P, the most violated constraints can be found with the dynamic programming algorithm in Sec. [sent-169, score-0.356]

63 In addition, since Pi are not observable in the training data, we iteratively solve the warping path Pi, the expected length μjn, and the parameters w in our optimization. [sent-171, score-0.383]

64 The optimal warping path Pi is solved via the dynamic programming algorithm in Sec. [sent-172, score-0.356]

65 The expected length μj for the j-th atomic action element in the phantom action template is computed as the average number of elements matched to it in the positive training data: μj=N1+i:? [sent-175, score-1.829]

66 In the beginning of the algorithm, we initialize Pi to be a uniform warping, which aligns the same number of elements to each atomic action in the phantom action template. [sent-177, score-1.648]

67 We learn a phantom action template and the associated score function f(Xi) for each class. [sent-180, score-1.046]

68 We require the length of phantom action template LT to be the same for all the classes. [sent-182, score-1.105]

69 Frame Feature Description We represent an action by a sequence of highdimensional frame descriptors, as described in this section. [sent-188, score-0.613]

70 We are interested in action recognition from both the depth sequences captured by Kinect cameras and conventional RGB videos. [sent-189, score-0.636]

71 For example, for the action “call cellphone”, some people tend to use their right hand, while some people use their left hand. [sent-216, score-0.532]

72 We cluster the training data of each action category via k-means clustering using the video-level descriptors (such as bag-of-words and Fourier temporal pyramid). [sent-218, score-0.723]

73 We learn a phantom action template for each cluster as a sub-category of a conceptual action class. [sent-219, score-1.578]

74 In the following experiments, unless specified, we use a mixture of two MMTW models for each action class, as described in Sec. [sent-232, score-0.532]

75 MSR Sports Action3D dataset MSR Sports Action3D dataset [11] is an action dataset of depth sequences captured by Kinect camera. [sent-237, score-0.711]

76 Every action was performed by ten subjects three times each. [sent-240, score-0.532]

77 We employ the relative joint positions as the frame descriptors for this dataset, and set the length of the phantom action template LT = 11 in this experiment. [sent-244, score-1.25]

78 79% accuracy of using the uniform warping (no action alignment), the proposed MMTW approach achieves much better accuracy because it discriminatively aligns the sequences. [sent-249, score-0.819]

79 We also evaluate the dynamic temporal warping (DTW) in our dataset using the Euclidean distance of the skeleton joint positions as the frame matching. [sent-250, score-0.728]

80 Finally, we study the robustness of the proposed method to temporal misalignment and phantom action template length, shown in Fig. [sent-253, score-1.309]

81 The robustness of the methods to temporal shifts and phantom template length. [sent-263, score-0.705]

82 the uniform warping approach because of the explicit action alignment in MMTW. [sent-264, score-0.898]

83 Moreover, the proposed method is insensitive to the length of the phantom action template. [sent-266, score-0.984]

84 tions, we also use the local HON4D features [16] extracted at each human skeleton joint, as well as the human skeleton joint positions per-frame features. [sent-283, score-0.411]

85 ×W6e, sanetd th diev length notof t ah e3 phantom aricdtio fonr template LT = 12 in this experiment. [sent-285, score-0.573]

86 3D ActionPair dataset 3D ActionPair dataset [16] is an action dataset captured by a Kinect camera. [sent-292, score-0.646]

87 Since the motion cue is usually similar for a pair of the actions, modeling the temporal structure is crucial for successful action recognition. [sent-294, score-0.723]

88 We employ the relative joint positions and the local HOV4D features [16] extracted at each human skeleton joint and the human skeleton joint positions per-frame features. [sent-297, score-0.556]

89 W, aen ds deti vthidee length oaf 3 3t h×e phantom dac ftoiorn H template aLtuTr s=. [sent-300, score-0.573]

90 It contains the sports actions from 16 sport classes: basketball layup, bowling, clean and jerk, discus throw , diving platform 10m, diving springboard 3m, hammer throw,high jump, javelin throw, long jump, pole vault, shot put, snatch, tennis serve, tripe jump, vault with 50 sequences per class. [sent-313, score-0.375]

91 The actions in this dataset usually exhibit the complex temporal structure and temporal misalignment. [sent-314, score-0.491]

92 The length of the phantom action template is set to be LT = 30. [sent-317, score-1.105]

93 In the proposed MMTW method, one atomic action can occur at any place of the input sequence, as long as the order of the atomic action is preserved, while [15] restricts the position of the atomic action. [sent-321, score-1.445]

94 The length of the phantom action template is set to be LT = 25 in this experiment. [sent-328, score-1.105]

95 Conclusion This paper proposes a novel unification of action alignment and classification, called maximum margin temporal warping (MMTW). [sent-336, score-1.166]

96 MMTW method integrates the advantages of the dynamic temporal warping and discriminative max-margin learning. [sent-337, score-0.455]

97 Due to the learned action alignment, it is robust to the temporal variations and misalignment, while at the same time maximizes the margin among dif- ferent action classes. [sent-338, score-1.329]

98 Learning a hierarchy of discriminative spacetime neighborhood features for human action recognition. [sent-383, score-0.567]

99 Learning hierarchical invariant spatiotemporal features for action recognition with independent subspace analysis. [sent-397, score-0.569]

100 Action MACH:a spatio-temporal Maximum Average Correlation Height filter for action recognition. [sent-452, score-0.532]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('action', 0.532), ('mmtw', 0.496), ('phantom', 0.393), ('warping', 0.231), ('temporal', 0.191), ('skeleton', 0.135), ('atomic', 0.127), ('template', 0.121), ('alignment', 0.115), ('actions', 0.078), ('sports', 0.075), ('margin', 0.074), ('path', 0.073), ('misalignment', 0.072), ('lt', 0.06), ('length', 0.059), ('msr', 0.058), ('aj', 0.056), ('dtw', 0.056), ('actionpair', 0.05), ('lj', 0.048), ('pi', 0.047), ('olympic', 0.047), ('structural', 0.046), ('dj', 0.045), ('sequence', 0.045), ('deformation', 0.044), ('activity', 0.044), ('rgb', 0.039), ('kinect', 0.039), ('element', 0.039), ('employ', 0.038), ('throw', 0.037), ('frame', 0.036), ('depth', 0.036), ('joint', 0.036), ('latent', 0.035), ('positions', 0.035), ('human', 0.035), ('bj', 0.034), ('hof', 0.034), ('fourier', 0.033), ('bji', 0.033), ('lltl', 0.033), ('lltlj', 0.033), ('xcos', 0.033), ('wave', 0.033), ('dynamic', 0.033), ('dataset', 0.031), ('tennis', 0.03), ('vault', 0.029), ('eji', 0.029), ('sequences', 0.029), ('aligned', 0.028), ('sport', 0.028), ('tlt', 0.027), ('elements', 0.026), ('pages', 0.026), ('archives', 0.026), ('cellphone', 0.026), ('align', 0.025), ('joints', 0.025), ('june', 0.024), ('swing', 0.024), ('cutting', 0.023), ('maximum', 0.023), ('actom', 0.023), ('kick', 0.023), ('hoai', 0.023), ('jump', 0.023), ('tracked', 0.023), ('pyramid', 0.023), ('diving', 0.023), ('hammer', 0.023), ('classification', 0.022), ('xi', 0.021), ('captured', 0.021), ('histograms', 0.021), ('svm', 0.021), ('sofa', 0.021), ('uniform', 0.02), ('aligning', 0.02), ('class', 0.02), ('observable', 0.02), ('ending', 0.019), ('yi', 0.019), ('spatiotemporal', 0.019), ('ej', 0.019), ('programming', 0.019), ('regarded', 0.018), ('recognition', 0.018), ('templates', 0.018), ('aligns', 0.018), ('discriminatively', 0.018), ('five', 0.017), ('hmm', 0.017), ('compositional', 0.017), ('predictive', 0.017), ('sit', 0.017), ('frames', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition

Author: Jiang Wang, Ying Wu

Abstract: Temporal misalignment and duration variation in video actions largely influence the performance of action recognition, but it is very difficult to specify effective temporal alignment on action sequences. To address this challenge, this paper proposes a novel discriminative learning-based temporal alignment method, called maximum margin temporal warping (MMTW), to align two action sequences and measure their matching score. Based on the latent structure SVM formulation, the proposed MMTW method is able to learn a phantom action template to represent an action class for maximum discrimination against other classes. The recognition of this action class is based on the associated learned alignment of the input action. Extensive experiments on five benchmark datasets have demonstrated that this MMTW model is able to significantly promote the accuracy and robustness of action recognition under temporal misalignment and variations.

2 0.4031834 86 iccv-2013-Concurrent Action Detection with Structural Prediction

Author: Ping Wei, Nanning Zheng, Yibiao Zhao, Song-Chun Zhu

Abstract: Action recognition has often been posed as a classification problem, which assumes that a video sequence only have one action class label and different actions are independent. However, a single human body can perform multiple concurrent actions at the same time, and different actions interact with each other. This paper proposes a concurrent action detection model where the action detection is formulated as a structural prediction problem. In this model, an interval in a video sequence can be described by multiple action labels. An detected action interval is determined both by the unary local detector and the relations with other actions. We use a wavelet feature to represent the action sequence, and design a composite temporal logic descriptor to describe the action relations. The model parameters are trained by structural SVM learning. Given a long video sequence, a sequential decision window search algorithm is designed to detect the actions. Experiments on our new collected concurrent action dataset demonstrate the strength of our method.

3 0.39433706 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection

Author: Mihai Zanfir, Marius Leordeanu, Cristian Sminchisescu

Abstract: Human action recognition under low observational latency is receiving a growing interest in computer vision due to rapidly developing technologies in human-robot interaction, computer gaming and surveillance. In this paper we propose a fast, simple, yet powerful non-parametric Moving Pose (MP)frameworkfor low-latency human action and activity recognition. Central to our methodology is a moving pose descriptor that considers both pose information as well as differential quantities (speed and acceleration) of the human body joints within a short time window around the current frame. The proposed descriptor is used in conjunction with a modified kNN classifier that considers both the temporal location of a particular frame within the action sequence as well as the discrimination power of its moving pose descriptor compared to other frames in the training set. The resulting method is non-parametric and enables low-latency recognition, one-shot learning, and action detection in difficult unsegmented sequences. Moreover, the framework is real-time, scalable, and outperforms more sophisticated approaches on challenging benchmarks like MSR-Action3D or MSR-DailyActivities3D.

4 0.27676231 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos

Author: Sunil Bandla, Kristen Grauman

Abstract: Collecting and annotating videos of realistic human actions is tedious, yet critical for training action recognition systems. We propose a method to actively request the most useful video annotations among a large set of unlabeled videos. Predicting the utility of annotating unlabeled video is not trivial, since any given clip may contain multiple actions of interest, and it need not be trimmed to temporal regions of interest. To deal with this problem, we propose a detection-based active learner to train action category models. We develop a voting-based framework to localize likely intervals of interest in an unlabeled clip, and use them to estimate the total reduction in uncertainty that annotating that clip would yield. On three datasets, we show our approach can learn accurate action detectors more efficiently than alternative active learning strategies that fail to accommodate the “untrimmed” nature of real video data.

5 0.27523696 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments

Author: Shugao Ma, Jianming Zhang, Nazli Ikizler-Cinbis, Stan Sclaroff

Abstract: We propose Hierarchical Space-Time Segments as a new representation for action recognition and localization. This representation has a two-level hierarchy. The first level comprises the root space-time segments that may contain a human body. The second level comprises multi-grained space-time segments that contain parts of the root. We present an unsupervised method to generate this representation from video, which extracts both static and non-static relevant space-time segments, and also preserves their hierarchical and temporal relationships. Using simple linear SVM on the resultant bag of hierarchical space-time segments representation, we attain better than, or comparable to, state-of-the-art action recognition performance on two challenging benchmark datasets and at the same time produce good action localization results.

6 0.2681621 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions

7 0.26338559 249 iccv-2013-Learning to Share Latent Tasks for Action Recognition

8 0.24257609 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction

9 0.24017175 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding

10 0.2351228 244 iccv-2013-Learning View-Invariant Sparse Representations for Cross-View Action Recognition

11 0.22582361 188 iccv-2013-Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps

12 0.20892553 265 iccv-2013-Mining Motion Atoms and Phrases for Complex Action Recognition

13 0.20783271 116 iccv-2013-Directed Acyclic Graph Kernels for Action Recognition

14 0.19076228 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set

15 0.18147564 39 iccv-2013-Action Recognition with Improved Trajectories

16 0.18118818 231 iccv-2013-Latent Multitask Learning for View-Invariant Action Recognition

17 0.16998014 127 iccv-2013-Dynamic Pooling for Complex Event Recognition

18 0.16009228 166 iccv-2013-Finding Actors and Actions in Movies

19 0.15379992 268 iccv-2013-Modeling 4D Human-Object Interactions for Event and Object Recognition

20 0.15346639 396 iccv-2013-Space-Time Robust Representation for Action Recognition


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.222), (1, 0.254), (2, 0.126), (3, 0.368), (4, -0.028), (5, -0.04), (6, 0.152), (7, -0.136), (8, -0.04), (9, 0.045), (10, 0.009), (11, 0.07), (12, -0.02), (13, -0.039), (14, 0.183), (15, -0.002), (16, 0.032), (17, -0.027), (18, -0.027), (19, -0.064), (20, 0.021), (21, 0.051), (22, -0.083), (23, -0.148), (24, -0.051), (25, -0.025), (26, -0.044), (27, 0.046), (28, -0.025), (29, 0.021), (30, 0.02), (31, -0.0), (32, 0.035), (33, 0.054), (34, 0.046), (35, 0.008), (36, -0.032), (37, -0.0), (38, 0.073), (39, -0.061), (40, -0.049), (41, 0.013), (42, 0.043), (43, 0.021), (44, 0.015), (45, -0.046), (46, 0.026), (47, 0.045), (48, -0.001), (49, -0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98499799 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition

Author: Jiang Wang, Ying Wu

Abstract: Temporal misalignment and duration variation in video actions largely influence the performance of action recognition, but it is very difficult to specify effective temporal alignment on action sequences. To address this challenge, this paper proposes a novel discriminative learning-based temporal alignment method, called maximum margin temporal warping (MMTW), to align two action sequences and measure their matching score. Based on the latent structure SVM formulation, the proposed MMTW method is able to learn a phantom action template to represent an action class for maximum discrimination against other classes. The recognition of this action class is based on the associated learned alignment of the input action. Extensive experiments on five benchmark datasets have demonstrated that this MMTW model is able to significantly promote the accuracy and robustness of action recognition under temporal misalignment and variations.

2 0.95330667 86 iccv-2013-Concurrent Action Detection with Structural Prediction

Author: Ping Wei, Nanning Zheng, Yibiao Zhao, Song-Chun Zhu

Abstract: Action recognition has often been posed as a classification problem, which assumes that a video sequence only have one action class label and different actions are independent. However, a single human body can perform multiple concurrent actions at the same time, and different actions interact with each other. This paper proposes a concurrent action detection model where the action detection is formulated as a structural prediction problem. In this model, an interval in a video sequence can be described by multiple action labels. An detected action interval is determined both by the unary local detector and the relations with other actions. We use a wavelet feature to represent the action sequence, and design a composite temporal logic descriptor to describe the action relations. The model parameters are trained by structural SVM learning. Given a long video sequence, a sequential decision window search algorithm is designed to detect the actions. Experiments on our new collected concurrent action dataset demonstrate the strength of our method.

3 0.87069476 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection

Author: Mihai Zanfir, Marius Leordeanu, Cristian Sminchisescu

Abstract: Human action recognition under low observational latency is receiving a growing interest in computer vision due to rapidly developing technologies in human-robot interaction, computer gaming and surveillance. In this paper we propose a fast, simple, yet powerful non-parametric Moving Pose (MP)frameworkfor low-latency human action and activity recognition. Central to our methodology is a moving pose descriptor that considers both pose information as well as differential quantities (speed and acceleration) of the human body joints within a short time window around the current frame. The proposed descriptor is used in conjunction with a modified kNN classifier that considers both the temporal location of a particular frame within the action sequence as well as the discrimination power of its moving pose descriptor compared to other frames in the training set. The resulting method is non-parametric and enables low-latency recognition, one-shot learning, and action detection in difficult unsegmented sequences. Moreover, the framework is real-time, scalable, and outperforms more sophisticated approaches on challenging benchmarks like MSR-Action3D or MSR-DailyActivities3D.

4 0.85788661 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding

Author: Weiyu Zhang, Menglong Zhu, Konstantinos G. Derpanis

Abstract: This paper presents a novel approach for analyzing human actions in non-scripted, unconstrained video settings based on volumetric, x-y-t, patch classifiers, termed actemes. Unlike previous action-related work, the discovery of patch classifiers is posed as a strongly-supervised process. Specifically, keypoint labels (e.g., position) across spacetime are used in a data-driven training process to discover patches that are highly clustered in the spacetime keypoint configuration space. To support this process, a new human action dataset consisting of challenging consumer videos is introduced, where notably the action label, the 2D position of a set of keypoints and their visibilities are provided for each video frame. On a novel input video, each acteme is used in a sliding volume scheme to yield a set of sparse, non-overlapping detections. These detections provide the intermediate substrate for segmenting out the action. For action classification, the proposed representation shows significant improvement over state-of-the-art low-level features, while providing spatiotemporal localiza- tion as additional output. This output sheds further light into detailed action understanding.

5 0.84385633 231 iccv-2013-Latent Multitask Learning for View-Invariant Action Recognition

Author: Behrooz Mahasseni, Sinisa Todorovic

Abstract: This paper presents an approach to view-invariant action recognition, where human poses and motions exhibit large variations across different camera viewpoints. When each viewpoint of a given set of action classes is specified as a learning task then multitask learning appears suitable for achieving view invariance in recognition. We extend the standard multitask learning to allow identifying: (1) latent groupings of action views (i.e., tasks), and (2) discriminative action parts, along with joint learning of all tasks. This is because it seems reasonable to expect that certain distinct views are more correlated than some others, and thus identifying correlated views could improve recognition. Also, part-based modeling is expected to improve robustness against self-occlusion when actors are imaged from different views. Results on the benchmark datasets show that we outperform standard multitask learning by 21.9%, and the state-of-the-art alternatives by 4.5–6%.

6 0.80947667 38 iccv-2013-Action Recognition with Actons

7 0.76532221 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos

8 0.75777507 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions

9 0.70217842 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set

10 0.70004153 265 iccv-2013-Mining Motion Atoms and Phrases for Complex Action Recognition

11 0.67884707 166 iccv-2013-Finding Actors and Actions in Movies

12 0.67731351 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments

13 0.65913296 249 iccv-2013-Learning to Share Latent Tasks for Action Recognition

14 0.65148306 244 iccv-2013-Learning View-Invariant Sparse Representations for Cross-View Action Recognition

15 0.64250845 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction

16 0.63581318 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition

17 0.6074754 116 iccv-2013-Directed Acyclic Graph Kernels for Action Recognition

18 0.58103424 260 iccv-2013-Manipulation Pattern Discovery: A Nonparametric Bayesian Approach

19 0.57489157 69 iccv-2013-Capturing Global Semantic Relationships for Facial Action Unit Recognition

20 0.54577494 99 iccv-2013-Cross-View Action Recognition over Heterogeneous Feature Spaces


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.056), (7, 0.01), (12, 0.039), (13, 0.012), (16, 0.012), (26, 0.052), (31, 0.033), (40, 0.011), (42, 0.09), (48, 0.014), (64, 0.141), (68, 0.138), (73, 0.064), (78, 0.03), (89, 0.176)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.8929745 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition

Author: Jiang Wang, Ying Wu

Abstract: Temporal misalignment and duration variation in video actions largely influence the performance of action recognition, but it is very difficult to specify effective temporal alignment on action sequences. To address this challenge, this paper proposes a novel discriminative learning-based temporal alignment method, called maximum margin temporal warping (MMTW), to align two action sequences and measure their matching score. Based on the latent structure SVM formulation, the proposed MMTW method is able to learn a phantom action template to represent an action class for maximum discrimination against other classes. The recognition of this action class is based on the associated learned alignment of the input action. Extensive experiments on five benchmark datasets have demonstrated that this MMTW model is able to significantly promote the accuracy and robustness of action recognition under temporal misalignment and variations.

2 0.88229156 203 iccv-2013-How Related Exemplars Help Complex Event Detection in Web Videos?

Author: Yi Yang, Zhigang Ma, Zhongwen Xu, Shuicheng Yan, Alexander G. Hauptmann

Abstract: Compared to visual concepts such as actions, scenes and objects, complex event is a higher level abstraction of longer video sequences. For example, a “marriage proposal” event is described by multiple objects (e.g., ring, faces), scenes (e.g., in a restaurant, outdoor) and actions (e.g., kneeling down). The positive exemplars which exactly convey the precise semantic of an event are hard to obtain. It would be beneficial to utilize the related exemplars for complex event detection. However, the semantic correlations between related exemplars and the target event vary substantially as relatedness assessment is subjective. Two related exemplars can be about completely different events, e.g., in the TRECVID MED dataset, both bicycle riding and equestrianism are labeled as related to “attempting a bike trick” event. To tackle the subjectiveness of human assessment, our algorithm automatically evaluates how positive the related exemplars are for the detection of an event and uses them on an exemplar-specific basis. Experiments demonstrate that our algorithm is able to utilize related exemplars adaptively, and the algorithm gains good perform- z. ance for complex event detection.

3 0.86020386 215 iccv-2013-Incorporating Cloud Distribution in Sky Representation

Author: Kuan-Chuan Peng, Tsuhan Chen

Abstract: Most sky models only describe the cloudiness ofthe overall sky by a single category or parameter such as sky index, which does not account for the distribution of the clouds across the sky. To capture variable cloudiness, we extend the concept of sky index to a random field indicating the level of cloudiness of each sky pixel in our proposed sky representation based on the Igawa sky model. We formulate the problem of solving the sky index of every sky pixel as a labeling problem, where an approximate solution can be efficiently found. Experimental results show that our proposed sky model has better expressiveness, stability with respect to variation in camera parameters, and geo-location estimation in outdoor images compared to the uniform sky index model. Potential applications of our proposed sky model include sky image rendering, where sky images can be generated with an arbitrary cloud distribution at any time and any location, previously impossible with traditional sky models.

4 0.85258579 380 iccv-2013-Semantic Transform: Weakly Supervised Semantic Inference for Relating Visual Attributes

Author: Sukrit Shankar, Joan Lasenby, Roberto Cipolla

Abstract: Relative (comparative) attributes are promising for thematic ranking of visual entities, which also aids in recognition tasks [19, 23]. However, attribute rank learning often requires a substantial amount of relational supervision, which is highly tedious, and apparently impracticalfor realworld applications. In this paper, we introduce the Semantic Transform, which under minimal supervision, adaptively finds a semantic feature space along with a class ordering that is related in the best possible way. Such a semantic space is found for every attribute category. To relate the classes under weak supervision, the class ordering needs to be refined according to a cost function in an iterative procedure. This problem is ideally NP-hard, and we thus propose a constrained search tree formulation for the same. Driven by the adaptive semantic feature space representation, our model achieves the best results to date for all of the tasks of relative, absolute and zero-shot classification on two popular datasets.

5 0.85150039 166 iccv-2013-Finding Actors and Actions in Movies

Author: P. Bojanowski, F. Bach, I. Laptev, J. Ponce, C. Schmid, J. Sivic

Abstract: We address the problem of learning a joint model of actors and actions in movies using weak supervision provided by scripts. Specifically, we extract actor/action pairs from the script and use them as constraints in a discriminative clustering framework. The corresponding optimization problem is formulated as a quadratic program under linear constraints. People in video are represented by automatically extracted and tracked faces together with corresponding motion features. First, we apply the proposed framework to the task of learning names of characters in the movie and demonstrate significant improvements over previous methods used for this task. Second, we explore the joint actor/action constraint and show its advantage for weakly supervised action learning. We validate our method in the challenging setting of localizing and recognizing characters and their actions in feature length movies Casablanca and American Beauty.

6 0.84736985 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments

7 0.84579343 441 iccv-2013-Video Motion for Every Visible Point

8 0.84487367 88 iccv-2013-Constant Time Weighted Median Filtering for Stereo Matching and Beyond

9 0.8446331 242 iccv-2013-Learning People Detectors for Tracking in Crowded Scenes

10 0.84240443 442 iccv-2013-Video Segmentation by Tracking Many Figure-Ground Segments

11 0.84224021 424 iccv-2013-Tracking Revisited Using RGBD Camera: Unified Benchmark and Baselines

12 0.8405652 303 iccv-2013-Orderless Tracking through Model-Averaged Posterior Estimation

13 0.83993554 86 iccv-2013-Concurrent Action Detection with Structural Prediction

14 0.83796251 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection

15 0.83607244 359 iccv-2013-Robust Object Tracking with Online Multi-lifespan Dictionary Learning

16 0.83514249 425 iccv-2013-Tracking via Robust Multi-task Multi-view Joint Sparse Representation

17 0.8348906 298 iccv-2013-Online Robust Non-negative Dictionary Learning for Visual Tracking

18 0.83246469 338 iccv-2013-Randomized Ensemble Tracking

19 0.82026541 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding

20 0.81783307 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos