iccv iccv2013 iccv2013-265 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Limin Wang, Yu Qiao, Xiaoou Tang
Abstract: This paper proposes motion atom and phrase as a midlevel temporal “part” for representing and classifying complex action. Motion atom is defined as an atomic part of action, and captures the motion information of action video in a short temporal scale. Motion phrase is a temporal composite of multiple motion atoms with an AND/OR structure, which further enhances the discriminative ability of motion atoms by incorporating temporal constraints in a longer scale. Specifically, given a set of weakly labeled action videos, we firstly design a discriminative clustering method to automatically discovera set ofrepresentative motion atoms. Then, based on these motion atoms, we mine effective motion phrases with high discriminative and representativepower. We introduce a bottom-upphrase construction algorithm and a greedy selection method for this mining task. We examine the classification performance of the motion atom and phrase based representation on two complex action datasets: Olympic Sports and UCF50. Experimental results show that our method achieves superior performance over recent published methods on both datasets.
Reference: text
sentIndex sentText sentNum sentScore
1 Abstract This paper proposes motion atom and phrase as a midlevel temporal “part” for representing and classifying complex action. [sent-12, score-1.36]
2 Motion atom is defined as an atomic part of action, and captures the motion information of action video in a short temporal scale. [sent-13, score-1.179]
3 Motion phrase is a temporal composite of multiple motion atoms with an AND/OR structure, which further enhances the discriminative ability of motion atoms by incorporating temporal constraints in a longer scale. [sent-14, score-2.265]
4 Specifically, given a set of weakly labeled action videos, we firstly design a discriminative clustering method to automatically discovera set ofrepresentative motion atoms. [sent-15, score-0.629]
5 Then, based on these motion atoms, we mine effective motion phrases with high discriminative and representativepower. [sent-16, score-1.151]
6 We examine the classification performance of the motion atom and phrase based representation on two complex action datasets: Olympic Sports and UCF50. [sent-18, score-1.449]
7 For each atom, it describes a short temporal scale motion information, which can be shared by different complex action classes. [sent-33, score-0.79]
8 For each complex action, there exist temporal structures of multiple atoms in a long temporal scale. [sent-34, score-0.718]
9 Firstly, due to background clutter, viewpoint changes, and motion speed variation, there exist always large intra-class appearance and motion variations within the same class of action. [sent-36, score-0.663]
10 Recently, researches show that the temporal structures of complex action yield effective cues for action classification [8, 15, 23, 26]. [sent-38, score-0.723]
11 From a short temporal scale, each atomic motion corresponds to a simple pattern and these atomic motions may be shared by different complex action classes. [sent-42, score-0.992]
12 These observations offer us insights to complex action recognition: • Unsupervised discovery of motion atoms. [sent-44, score-0.64]
13 We need to design an unsupervised method to discover a set of motion atoms automatically from video dataset. [sent-50, score-0.734]
14 The discriminative power of motion atom is limited by its temporal duration. [sent-54, score-0.853]
15 sequential composition of motion atoms), captures motion information in a longer scale and provides important cue to discriminate different action classes. [sent-57, score-0.866]
16 Based on the above insights, this paper proposes motion atom and phrase, a mid-level representation of action video, which jointly encodes the motion, appearance, and temporal structure of multiple atomic actions. [sent-60, score-1.142]
17 Firstly, we discover a set of motion atoms from training samples in an unsupervised manner. [sent-61, score-0.707]
18 Then, we construct motion phrase as a temporal composite of multiple atoms. [sent-67, score-1.042]
19 It not only captures short-scale motion information of each atom, but also models the temporal structure of multiple atoms in a longer temporal scale. [sent-68, score-0.993]
20 We propose a bottom-up mining algorithm and greedy selection method to obtain a set of motion phrases with high discriminative and representative power. [sent-70, score-0.93]
21 Finally, we represent each video by the activation vector of motion atoms and phrases by max pooling the response score of each atom and phrase. [sent-71, score-1.521]
22 motion atoms and phrase, to represent video of complex action, and our representation is flexible with the classifier used for recognition. [sent-87, score-0.774]
23 Besides, previous studies usually train a single model for each action class, but our method can discover a set of motion atoms and phrases. [sent-88, score-0.902]
24 Their cuboids are limited in temporal duration and not suitable for complex action recognition Our motion atom has the similar role as these motion attributes and parts in essence. [sent-99, score-1.451]
25 However, our motion atoms are obtained through an unsupervised manner from training data, and we model temporal structure of multiple motion atoms to enhance their descriptive power. [sent-100, score-1.517]
26 Unsupervised Discovery of Motion Atoms To construct effective representations for complex actions, we first discover a set ofmotion atoms that capture the motion patterns in a short temporal scale. [sent-106, score-1.002]
27 These atoms act 2681 corresponds to running and opening arms for complex action gym-vault; right: motion atom corresponds to rolling in circles for complex action hammer throw. [sent-107, score-1.618]
28 as basic units for constructing more discriminative motion phrase in a longer scale. [sent-108, score-0.968]
29 Given a set of training videos, our objective is to automatically discover a set of common motion patterns as motion atoms. [sent-109, score-0.682]
30 Our main goal is to determine a large set of simple motion patterns, which are shared by many complex actions and can be used as basic units to represent complex actions. [sent-116, score-0.641]
31 We need to make sure that the obtained atom set can cover different motion patterns occurring in various actions. [sent-117, score-0.69]
32 Given two segments Algorithm 1: Discovery of motion Algorithm 1: Discovery of motion atoms. [sent-127, score-0.678]
33 One atom usually corresponds to a simple motion pattern within a short temporal scale, and may occur in different classes of complex actions. [sent-167, score-0.884]
34 These facts limit the discriminative ability of motion atoms in classifying complex actions. [sent-168, score-0.816]
35 To circumvent this problem, we make use of these atoms as basic units to construct motion phrase with a longer scale. [sent-169, score-1.273]
36 For action classification task, motion phrases are expected to have the following properties: • • Descriptive property: Each phrase should be a temporDale composite oopfe highly rcehla ptehdra mseo sthioonu dato bmes a. [sent-170, score-1.552]
37 Meanwhile, to deal with motion speed variations, motion phrase needs to allow temporal displacement among its composite atoms. [sent-172, score-1.362]
38 Illustration for motion phrase: motion phrase is an AND/OR structure over a set of atom units, which are indicated by ellipsoids. [sent-174, score-1.454]
39 It is desirable that a motion phrase is highly related to a certain class of action. [sent-176, score-0.846]
40 Representative property: Due to large variations among complex apcrotipoenr vtyid:eos, each motion phrase can only cover part of the action videos. [sent-178, score-1.149]
41 Thus, we need to take account of the correlations between different phrases, and we wish to determine a set of motion phrases which convey enough motion patterns to handle the variations of complex actions. [sent-179, score-1.173]
42 Motion Phrase Definition: Based on the analysis above, we define motion phrase as an AND/OR structure on a set of motion atom units as shown in Figure 3. [sent-180, score-1.542]
43 • Each atom unit, denoted as Π = (A, t, σ), refer to a motion atom A detected in the neighborhood of temporal anchor point t. [sent-183, score-1.102]
44 Based on these atom units, we construct motion phrases by AND/OR structure. [sent-193, score-1.09]
45 We first apply OR operation over several atom units that have the same atom label and are located nearby (e. [sent-194, score-0.749]
46 Then, we conduct AND operation over the selected atom units and choose the smallest response as motion phrase response. [sent-200, score-1.279]
47 Thus the response value r of an motion phrase P with respect to a given video V : r(V,P) = O mRiin∈PΠjm∈aOxRiv(V,Πj), (4) 2683 where ORi denote the OR operations in motion phrase P. [sent-201, score-1.703]
48 The size of motion phrase is defined as the number of OR operations it includes (e. [sent-202, score-0.808]
49 In essence, motion phrase representation is the temporal composite of multiple atomic motion units. [sent-205, score-1.448]
50 The OR operation allows us to search for the best location for current motion atom, and makes it flexible to deal with the temporal displacement caused by motion speed variations. [sent-206, score-0.809]
51 Above all, motion phrase not only delivers motion information of each atom, but also encodes temporal structure among them. [sent-208, score-1.285]
52 Evaluation of Discriminative Ability: A motion phrase P is discriminative for c-th class of complex action if it is highly related with this class, but appears sparely in other action classes. [sent-210, score-1.423]
53 Due to the large variance among action videos, a single motion phrase could obtain strong value only on part of the videos of certain class. [sent-213, score-1.099]
54 Mining Motion Phrase: Given a training video set V = {Vi}iN=1 with class label Y = {yi}iN=1 and a set of Vmo =tion { Vato}ms Aw =h {laAsis} liMa=b1e,l our goal i}s to find a set of mmoottiioonn phrases AP = == { A{P}i}iK=1for complex action classes. [sent-215, score-0.851]
55 mGiovteinon nth peh rcalsaesss c, f=or {ePach} individual motion phrase, we want each motion phrase to have high discriminative and representative ability with current class c. [sent-216, score-1.283]
56 Thus, the set of motion phrases is able to cover the complexity of action videos. [sent-223, score-1.003]
57 The main challenge comes from the fact that the possible combination atom units that form a motion phrase is huge. [sent-224, score-1.218]
58 Assuming a video with k segments and the size of motion atoms is M, there are M k possible atom units. [sent-225, score-1.073]
59 If a phrase of size s has a high representative ability for action class c (Equation (6)), then any (s 1)-atom phrase by eliminating one motion tahteomn asnhyou (lsd −als 1o) -haatvoem a high representative ability as owtieolnl. [sent-230, score-1.753]
60 f Finally, we heleimmiinnainteg some motion phrase of low discriminative ability with a threshold − τ. [sent-233, score-0.889]
61 In each iteration, we determine a motion phrase with high individual representative power, that meanwhile increases the set representative power the most. [sent-238, score-0.96]
62 Data: motion phrases candidates P = {Pi}iL=1 , class: c, mnuomtiboner: p hKrac. [sent-243, score-0.754]
63 - Return motion phrases: P the mining process, for each motion phrase, we consider top 40 videos with highest response value (i. [sent-252, score-0.769]
64 Recognition with Motion Atoms and Phrases Motion atoms and phrases can be regarded as mid-level units for representing complex action. [sent-259, score-0.949]
65 Specifically, for each motion atom A, we define a special motion phrase, in which there is only one atom unit (A, 0, +∞). [sent-261, score-1.274]
66 We call this special motion phrase as 0-motion phrase. [sent-262, score-0.808]
67 n, W wei ctha a stheti sm sopteicoina phrase Pn p =hr a{sPei} asiK= 0-1 mwohtioosne spihzreass range nfr,o wmit h0 atsoe MtmAoXtion, we represent ePac}h video V by an activation vector f = [r1, · · · , rK], where ri is the response value of motion phrase ·P·i· ,writh respect to video V . [sent-264, score-1.464]
68 Experiments We evaluate the effectiveness of motion atom and phrase on two complex action datasets: Olympic Sports dataset [15] and UCF50 dataset [18]. [sent-268, score-1.435]
69 Left: performance of motion phrase for different sizes on the Olympic Sports dataset. [sent-270, score-0.808]
70 Right: performance trend of varying maximum size for motion phrase on the Olympic Sports dataset. [sent-271, score-0.808]
71 Size of Motion Phrases: We examine the performance of motion phrases with different sizes on the Olympic Sports dataset and the results are shown in Figure 4. [sent-274, score-0.754]
72 1-motion phrases are mined for high discriminative and representative power, and thus their performance is better than 0-motion phrases (motion atoms), whose discriminative power is relatively low. [sent-276, score-1.082]
73 Secondly, we notice that the mAPs of 2-motion phrases and 3-motion phrases are lower than the mAPs of 1-motion phrases and 0-motion phrases. [sent-277, score-1.347]
74 This may be due to the large variations of video data, and the number of mined 2-motion phrases and 3-motion phrases is much smaller than the other two. [sent-278, score-0.972]
75 Although motion phrases of large size are more discriminative than others, they only cover a small part of the video data. [sent-279, score-0.862]
76 Besides, the information conveyed by large motion phrases has been partly contained in the motion phrases of smaller size. [sent-281, score-1.525]
77 We combine the representation of motion phrases with different sizes and the performance is shown in the right of Figure 4. [sent-282, score-0.768]
78 We see that the performance increases apparently in using motion phrases of size from 0 to 2. [sent-283, score-0.754]
79 But there is only slight improvement when including motion phrases of size 3. [sent-284, score-0.754]
80 Therefore, in the remaining discussions, we fix the maximum size of motion phrases as 2. [sent-286, score-0.754]
81 Motion phrase can automatically locate temporal composites of multiple motion atoms (indicated by red boxes) in complex actions. [sent-304, score-1.396]
82 Combine all indicates the combination of low-level features with motion atoms and phrases, with which we obtain state- TabNlei2. [sent-309, score-0.64]
83 Note that for motion atoms and phrases, we only use linear SVM. [sent-318, score-0.64]
84 We can find that the motion atom based mid-level representations achieve better performance than low-level features on both datasets. [sent-320, score-0.644]
85 However, motion atoms can achieve good results just with linear SVM. [sent-322, score-0.64]
86 The combination of motion atoms and phrases can further improve the recognition results. [sent-323, score-1.089]
87 Finally, we combine motion atoms and phrases with lowlevel features, and obtain the state-of-the-art performances on both datasets. [sent-325, score-1.105]
88 Comparison with Other Methods: We compare motion atoms and phrases with other methods on both datasets, and the results are shown in Table 2 and Table 3. [sent-327, score-1.089]
89 Our mid-level representation aims to find multiple motion atoms and phrases, and each representation covers a subset of videos. [sent-330, score-0.668]
90 In [14], the authors use attribute representation, where the attributes are specified in advance, and we find motion atoms and phrases learned from training data are more flexible and effective. [sent-332, score-1.142]
91 From the results of Table 3, we see that our motion atoms and phrases outperform these low-level features on UCF50 dataset. [sent-334, score-1.089]
92 Unlike action bank, our motion atom and phrase correspond to middle-level “parts” of the action, similar to the mid-level motionlet [25]. [sent-337, score-1.387]
93 Compared with the latest paper [24], our motion atom and phrase use less descriptor and smaller codebook size. [sent-340, score-1.147]
94 This indicates that motion atom and phrase is effective for action classification, especially for complex action classes with longer temporal scale. [sent-343, score-1.862]
95 Visualization: We show some examples of motion atoms and phrases in Figure 2 and Figure 5 respectively. [sent-344, score-1.089]
96 Motion phrase consists of a sequence of motion atoms. [sent-347, score-0.808]
97 As shown in the examples of Figure 5, motion phrase can discover waiting and diving for diving-platform, running and layup for basketball-layup, running and jumping for triple jumping, and running and landing for vault. [sent-348, score-0.978]
98 Conclusion We propose motion atom and phrase for representing and recognizing complex actions. [sent-350, score-1.226]
99 Motion atom describes simple motion pattern in a short temporal scale, and motion phrase encodes temporal structure of multiple atoms in a longer scale. [sent-351, score-2.15]
100 From the experimental results, we see that motion atoms and phrases are effective representations and outperform several recently published low-level features and complex models. [sent-354, score-1.215]
wordName wordTfidf (topN-words)
[('phrase', 0.503), ('phrases', 0.449), ('atoms', 0.335), ('atom', 0.322), ('motion', 0.305), ('action', 0.228), ('temporal', 0.153), ('olympic', 0.121), ('sports', 0.104), ('atomic', 0.101), ('actions', 0.094), ('units', 0.088), ('complex', 0.077), ('segments', 0.068), ('composite', 0.067), ('mining', 0.066), ('representative', 0.051), ('videos', 0.049), ('rep', 0.047), ('vi', 0.046), ('discriminative', 0.044), ('qiao', 0.044), ('response', 0.044), ('video', 0.043), ('cooking', 0.04), ('class', 0.038), ('ability', 0.037), ('jumping', 0.036), ('discover', 0.034), ('descriptive', 0.032), ('cluster', 0.032), ('firstly', 0.031), ('discovery', 0.03), ('mine', 0.03), ('power', 0.029), ('hjm', 0.029), ('motionlet', 0.029), ('longer', 0.028), ('short', 0.027), ('running', 0.027), ('return', 0.027), ('resort', 0.026), ('motionlets', 0.025), ('svm', 0.025), ('duration', 0.024), ('pi', 0.024), ('activity', 0.024), ('composites', 0.023), ('activation', 0.023), ('patterns', 0.022), ('ik', 0.022), ('bovw', 0.021), ('interchange', 0.021), ('grouplet', 0.021), ('shenzhen', 0.021), ('attributes', 0.021), ('clustering', 0.021), ('bank', 0.021), ('cover', 0.021), ('meanwhile', 0.021), ('sure', 0.02), ('actom', 0.02), ('amer', 0.02), ('rohrbach', 0.02), ('unit', 0.02), ('rolling', 0.019), ('diving', 0.019), ('recognizing', 0.019), ('structure', 0.019), ('researches', 0.019), ('script', 0.019), ('facts', 0.018), ('latent', 0.018), ('gaidon', 0.018), ('effective', 0.018), ('representations', 0.017), ('codebook', 0.017), ('partly', 0.017), ('niebles', 0.017), ('unsupervised', 0.017), ('operation', 0.017), ('cuboids', 0.016), ('attribute', 0.016), ('training', 0.016), ('lowlevel', 0.016), ('ap', 0.016), ('mined', 0.016), ('preference', 0.015), ('decomposed', 0.015), ('libsvm', 0.015), ('marszalek', 0.015), ('variations', 0.015), ('displacement', 0.015), ('greedy', 0.015), ('representation', 0.014), ('peng', 0.014), ('construct', 0.014), ('deal', 0.014), ('published', 0.014), ('variance', 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999934 265 iccv-2013-Mining Motion Atoms and Phrases for Complex Action Recognition
Author: Limin Wang, Yu Qiao, Xiaoou Tang
Abstract: This paper proposes motion atom and phrase as a midlevel temporal “part” for representing and classifying complex action. Motion atom is defined as an atomic part of action, and captures the motion information of action video in a short temporal scale. Motion phrase is a temporal composite of multiple motion atoms with an AND/OR structure, which further enhances the discriminative ability of motion atoms by incorporating temporal constraints in a longer scale. Specifically, given a set of weakly labeled action videos, we firstly design a discriminative clustering method to automatically discovera set ofrepresentative motion atoms. Then, based on these motion atoms, we mine effective motion phrases with high discriminative and representativepower. We introduce a bottom-upphrase construction algorithm and a greedy selection method for this mining task. We examine the classification performance of the motion atom and phrase based representation on two complex action datasets: Olympic Sports and UCF50. Experimental results show that our method achieves superior performance over recent published methods on both datasets.
2 0.26213947 188 iccv-2013-Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps
Author: Jiajia Luo, Wei Wang, Hairong Qi
Abstract: Human action recognition based on the depth information provided by commodity depth sensors is an important yet challenging task. The noisy depth maps, different lengths of action sequences, and free styles in performing actions, may cause large intra-class variations. In this paper, a new framework based on sparse coding and temporal pyramid matching (TPM) is proposed for depthbased human action recognition. Especially, a discriminative class-specific dictionary learning algorithm isproposed for sparse coding. By adding the group sparsity and geometry constraints, features can be well reconstructed by the sub-dictionary belonging to the same class, and the geometry relationships among features are also kept in the calculated coefficients. The proposed approach is evaluated on two benchmark datasets captured by depth cameras. Experimental results show that the proposed algorithm repeatedly hqi } @ ut k . edu GB ImagesR epth ImagesD setkonlSy0 896.5170d4ept.3h021 .x02y 19.876504.dep3th02.1 x02. achieves superior performance to the state of the art algorithms. Moreover, the proposed dictionary learning method also outperforms classic dictionary learning approaches.
3 0.23985794 179 iccv-2013-From Subcategories to Visual Composites: A Multi-level Framework for Object Detection
Author: Tian Lan, Michalis Raptis, Leonid Sigal, Greg Mori
Abstract: The appearance of an object changes profoundly with pose, camera view and interactions of the object with other objects in the scene. This makes it challenging to learn detectors based on an object-level label (e.g., “car”). We postulate that having a richer set oflabelings (at different levels of granularity) for an object, including finer-grained subcategories, consistent in appearance and view, and higherorder composites – contextual groupings of objects consistent in their spatial layout and appearance, can significantly alleviate these problems. However, obtaining such a rich set of annotations, including annotation of an exponentially growing set of object groupings, is simply not feasible. We propose a weakly-supervised framework for object detection where we discover subcategories and the composites automatically with only traditional object-level category labels as input. To this end, we first propose an exemplar-SVM-based clustering approach, with latent SVM refinement, that discovers a variable length set of discriminative subcategories for each object class. We then develop a structured model for object detection that captures interactions among object subcategories and automatically discovers semantically meaningful and discriminatively relevant visual composites. We show that this model produces state-of-the-art performance on UIUC phrase object detection benchmark.
4 0.20892553 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition
Author: Jiang Wang, Ying Wu
Abstract: Temporal misalignment and duration variation in video actions largely influence the performance of action recognition, but it is very difficult to specify effective temporal alignment on action sequences. To address this challenge, this paper proposes a novel discriminative learning-based temporal alignment method, called maximum margin temporal warping (MMTW), to align two action sequences and measure their matching score. Based on the latent structure SVM formulation, the proposed MMTW method is able to learn a phantom action template to represent an action class for maximum discrimination against other classes. The recognition of this action class is based on the associated learned alignment of the input action. Extensive experiments on five benchmark datasets have demonstrated that this MMTW model is able to significantly promote the accuracy and robustness of action recognition under temporal misalignment and variations.
5 0.20486647 86 iccv-2013-Concurrent Action Detection with Structural Prediction
Author: Ping Wei, Nanning Zheng, Yibiao Zhao, Song-Chun Zhu
Abstract: Action recognition has often been posed as a classification problem, which assumes that a video sequence only have one action class label and different actions are independent. However, a single human body can perform multiple concurrent actions at the same time, and different actions interact with each other. This paper proposes a concurrent action detection model where the action detection is formulated as a structural prediction problem. In this model, an interval in a video sequence can be described by multiple action labels. An detected action interval is determined both by the unary local detector and the relations with other actions. We use a wavelet feature to represent the action sequence, and design a composite temporal logic descriptor to describe the action relations. The model parameters are trained by structural SVM learning. Given a long video sequence, a sequential decision window search algorithm is designed to detect the actions. Experiments on our new collected concurrent action dataset demonstrate the strength of our method.
6 0.20404172 294 iccv-2013-Offline Mobile Instance Retrieval with a Small Memory Footprint
7 0.19728681 198 iccv-2013-Hierarchical Part Matching for Fine-Grained Visual Categorization
8 0.19598435 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
9 0.19382015 39 iccv-2013-Action Recognition with Improved Trajectories
10 0.19082312 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments
11 0.17881331 249 iccv-2013-Learning to Share Latent Tasks for Action Recognition
12 0.15616766 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
13 0.15527229 244 iccv-2013-Learning View-Invariant Sparse Representations for Cross-View Action Recognition
14 0.1552114 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
15 0.15219402 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction
16 0.14799054 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos
17 0.1328941 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding
18 0.12906554 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set
19 0.12577882 260 iccv-2013-Manipulation Pattern Discovery: A Nonparametric Bayesian Approach
20 0.1183892 116 iccv-2013-Directed Acyclic Graph Kernels for Action Recognition
topicId topicWeight
[(0, 0.19), (1, 0.199), (2, 0.079), (3, 0.255), (4, -0.046), (5, 0.003), (6, 0.041), (7, -0.072), (8, 0.02), (9, 0.029), (10, 0.045), (11, 0.066), (12, 0.047), (13, -0.04), (14, 0.006), (15, 0.003), (16, 0.033), (17, -0.012), (18, 0.03), (19, -0.031), (20, -0.02), (21, 0.033), (22, 0.013), (23, -0.014), (24, -0.046), (25, -0.018), (26, 0.001), (27, 0.006), (28, 0.007), (29, 0.001), (30, 0.007), (31, -0.096), (32, -0.097), (33, 0.058), (34, -0.034), (35, 0.052), (36, -0.034), (37, -0.002), (38, -0.051), (39, 0.015), (40, -0.027), (41, 0.003), (42, -0.003), (43, -0.03), (44, -0.074), (45, 0.042), (46, -0.02), (47, 0.082), (48, 0.155), (49, 0.132)]
simIndex simValue paperId paperTitle
same-paper 1 0.94182444 265 iccv-2013-Mining Motion Atoms and Phrases for Complex Action Recognition
Author: Limin Wang, Yu Qiao, Xiaoou Tang
Abstract: This paper proposes motion atom and phrase as a midlevel temporal “part” for representing and classifying complex action. Motion atom is defined as an atomic part of action, and captures the motion information of action video in a short temporal scale. Motion phrase is a temporal composite of multiple motion atoms with an AND/OR structure, which further enhances the discriminative ability of motion atoms by incorporating temporal constraints in a longer scale. Specifically, given a set of weakly labeled action videos, we firstly design a discriminative clustering method to automatically discovera set ofrepresentative motion atoms. Then, based on these motion atoms, we mine effective motion phrases with high discriminative and representativepower. We introduce a bottom-upphrase construction algorithm and a greedy selection method for this mining task. We examine the classification performance of the motion atom and phrase based representation on two complex action datasets: Olympic Sports and UCF50. Experimental results show that our method achieves superior performance over recent published methods on both datasets.
2 0.72058964 260 iccv-2013-Manipulation Pattern Discovery: A Nonparametric Bayesian Approach
Author: Bingbing Ni, Pierre Moulin
Abstract: We aim to unsupervisedly discover human’s action (motion) patterns of manipulating various objects in scenarios such as assisted living. We are motivated by two key observations. First, large variation exists in motion patterns associated with various types of objects being manipulated, thus manually defining motion primitives is infeasible. Second, some motion patterns are shared among different objects being manipulated while others are object specific. We therefore propose a nonparametric Bayesian method that adopts a hierarchical Dirichlet process prior to learn representative manipulation (motion) patterns in an unsupervised manner. Taking easy-to-obtain object detection score maps and dense motion trajectories as inputs, the proposed probabilistic model can discover motion pattern groups associated with different types of objects being manipulated with a shared manipulation pattern dictionary. The size of the learned dictionary is automatically inferred. Com- prehensive experiments on two assisted living benchmarks and a cooking motion dataset demonstrate superiority of our learned manipulation pattern dictionary in representing manipulation actions for recognition.
3 0.6891219 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition
Author: Jiang Wang, Ying Wu
Abstract: Temporal misalignment and duration variation in video actions largely influence the performance of action recognition, but it is very difficult to specify effective temporal alignment on action sequences. To address this challenge, this paper proposes a novel discriminative learning-based temporal alignment method, called maximum margin temporal warping (MMTW), to align two action sequences and measure their matching score. Based on the latent structure SVM formulation, the proposed MMTW method is able to learn a phantom action template to represent an action class for maximum discrimination against other classes. The recognition of this action class is based on the associated learned alignment of the input action. Extensive experiments on five benchmark datasets have demonstrated that this MMTW model is able to significantly promote the accuracy and robustness of action recognition under temporal misalignment and variations.
4 0.68646622 86 iccv-2013-Concurrent Action Detection with Structural Prediction
Author: Ping Wei, Nanning Zheng, Yibiao Zhao, Song-Chun Zhu
Abstract: Action recognition has often been posed as a classification problem, which assumes that a video sequence only have one action class label and different actions are independent. However, a single human body can perform multiple concurrent actions at the same time, and different actions interact with each other. This paper proposes a concurrent action detection model where the action detection is formulated as a structural prediction problem. In this model, an interval in a video sequence can be described by multiple action labels. An detected action interval is determined both by the unary local detector and the relations with other actions. We use a wavelet feature to represent the action sequence, and design a composite temporal logic descriptor to describe the action relations. The model parameters are trained by structural SVM learning. Given a long video sequence, a sequential decision window search algorithm is designed to detect the actions. Experiments on our new collected concurrent action dataset demonstrate the strength of our method.
5 0.68451118 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding
Author: Weiyu Zhang, Menglong Zhu, Konstantinos G. Derpanis
Abstract: This paper presents a novel approach for analyzing human actions in non-scripted, unconstrained video settings based on volumetric, x-y-t, patch classifiers, termed actemes. Unlike previous action-related work, the discovery of patch classifiers is posed as a strongly-supervised process. Specifically, keypoint labels (e.g., position) across spacetime are used in a data-driven training process to discover patches that are highly clustered in the spacetime keypoint configuration space. To support this process, a new human action dataset consisting of challenging consumer videos is introduced, where notably the action label, the 2D position of a set of keypoints and their visibilities are provided for each video frame. On a novel input video, each acteme is used in a sliding volume scheme to yield a set of sparse, non-overlapping detections. These detections provide the intermediate substrate for segmenting out the action. For action classification, the proposed representation shows significant improvement over state-of-the-art low-level features, while providing spatiotemporal localiza- tion as additional output. This output sheds further light into detailed action understanding.
6 0.66484576 38 iccv-2013-Action Recognition with Actons
7 0.6568014 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments
8 0.65640241 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
9 0.64844763 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
10 0.63699597 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set
11 0.60634476 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction
12 0.59518272 231 iccv-2013-Latent Multitask Learning for View-Invariant Action Recognition
13 0.5795908 39 iccv-2013-Action Recognition with Improved Trajectories
14 0.57549417 249 iccv-2013-Learning to Share Latent Tasks for Action Recognition
15 0.55187201 188 iccv-2013-Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps
16 0.551561 244 iccv-2013-Learning View-Invariant Sparse Representations for Cross-View Action Recognition
17 0.54635108 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos
18 0.54373896 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
19 0.53575563 69 iccv-2013-Capturing Global Semantic Relationships for Facial Action Unit Recognition
20 0.52658045 145 iccv-2013-Estimating the Material Properties of Fabric from Video
topicId topicWeight
[(2, 0.107), (4, 0.012), (7, 0.019), (12, 0.018), (26, 0.144), (31, 0.029), (42, 0.065), (64, 0.111), (73, 0.024), (78, 0.041), (83, 0.088), (89, 0.198)]
simIndex simValue paperId paperTitle
1 0.91084719 414 iccv-2013-Temporally Consistent Superpixels
Author: Matthias Reso, Jörn Jachalsky, Bodo Rosenhahn, Jörn Ostermann
Abstract: Superpixel algorithms represent a very useful and increasingly popular preprocessing step for a wide range of computer vision applications, as they offer the potential to boost efficiency and effectiveness. In this regards, this paper presents a highly competitive approach for temporally consistent superpixelsfor video content. The approach is based on energy-minimizing clustering utilizing a novel hybrid clustering strategy for a multi-dimensional feature space working in a global color subspace and local spatial subspaces. Moreover, a new contour evolution based strategy is introduced to ensure spatial coherency of the generated superpixels. For a thorough evaluation the proposed approach is compared to state of the art supervoxel algorithms using established benchmarks and shows a superior performance.
same-paper 2 0.90374112 265 iccv-2013-Mining Motion Atoms and Phrases for Complex Action Recognition
Author: Limin Wang, Yu Qiao, Xiaoou Tang
Abstract: This paper proposes motion atom and phrase as a midlevel temporal “part” for representing and classifying complex action. Motion atom is defined as an atomic part of action, and captures the motion information of action video in a short temporal scale. Motion phrase is a temporal composite of multiple motion atoms with an AND/OR structure, which further enhances the discriminative ability of motion atoms by incorporating temporal constraints in a longer scale. Specifically, given a set of weakly labeled action videos, we firstly design a discriminative clustering method to automatically discovera set ofrepresentative motion atoms. Then, based on these motion atoms, we mine effective motion phrases with high discriminative and representativepower. We introduce a bottom-upphrase construction algorithm and a greedy selection method for this mining task. We examine the classification performance of the motion atom and phrase based representation on two complex action datasets: Olympic Sports and UCF50. Experimental results show that our method achieves superior performance over recent published methods on both datasets.
3 0.90311074 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding
Author: Weiyu Zhang, Menglong Zhu, Konstantinos G. Derpanis
Abstract: This paper presents a novel approach for analyzing human actions in non-scripted, unconstrained video settings based on volumetric, x-y-t, patch classifiers, termed actemes. Unlike previous action-related work, the discovery of patch classifiers is posed as a strongly-supervised process. Specifically, keypoint labels (e.g., position) across spacetime are used in a data-driven training process to discover patches that are highly clustered in the spacetime keypoint configuration space. To support this process, a new human action dataset consisting of challenging consumer videos is introduced, where notably the action label, the 2D position of a set of keypoints and their visibilities are provided for each video frame. On a novel input video, each acteme is used in a sliding volume scheme to yield a set of sparse, non-overlapping detections. These detections provide the intermediate substrate for segmenting out the action. For action classification, the proposed representation shows significant improvement over state-of-the-art low-level features, while providing spatiotemporal localiza- tion as additional output. This output sheds further light into detailed action understanding.
4 0.90004981 411 iccv-2013-Symbiotic Segmentation and Part Localization for Fine-Grained Categorization
Author: Yuning Chai, Victor Lempitsky, Andrew Zisserman
Abstract: We propose a new method for the task of fine-grained visual categorization. The method builds a model of the baselevel category that can be fitted to images, producing highquality foreground segmentation and mid-level part localizations. The model can be learnt from the typical datasets available for fine-grained categorization, where the only annotation provided is a loose bounding box around the instance (e.g. bird) in each image. Both segmentation and part localizations are then used to encode the image content into a highly-discriminative visual signature. The model is symbiotic in that part discovery/localization is helped by segmentation and, conversely, the segmentation is helped by the detection (e.g. part layout). Our model builds on top of the part-based object category detector of Felzenszwalb et al., and also on the powerful GrabCut segmentation algorithm of Rother et al., and adds a simple spatial saliency coupling between them. In our evaluation, the model improves the categorization accuracy over the state-of-the-art. It also improves over what can be achieved with an analogous system that runs segmentation and part-localization independently.
5 0.89832842 86 iccv-2013-Concurrent Action Detection with Structural Prediction
Author: Ping Wei, Nanning Zheng, Yibiao Zhao, Song-Chun Zhu
Abstract: Action recognition has often been posed as a classification problem, which assumes that a video sequence only have one action class label and different actions are independent. However, a single human body can perform multiple concurrent actions at the same time, and different actions interact with each other. This paper proposes a concurrent action detection model where the action detection is formulated as a structural prediction problem. In this model, an interval in a video sequence can be described by multiple action labels. An detected action interval is determined both by the unary local detector and the relations with other actions. We use a wavelet feature to represent the action sequence, and design a composite temporal logic descriptor to describe the action relations. The model parameters are trained by structural SVM learning. Given a long video sequence, a sequential decision window search algorithm is designed to detect the actions. Experiments on our new collected concurrent action dataset demonstrate the strength of our method.
6 0.89820957 442 iccv-2013-Video Segmentation by Tracking Many Figure-Ground Segments
7 0.89681089 95 iccv-2013-Cosegmentation and Cosketch by Unsupervised Learning
8 0.89527595 396 iccv-2013-Space-Time Robust Representation for Action Recognition
9 0.89462978 299 iccv-2013-Online Video SEEDS for Temporal Window Objectness
10 0.89419818 150 iccv-2013-Exemplar Cut
11 0.89351416 8 iccv-2013-A Deformable Mixture Parsing Model with Parselets
12 0.89199102 107 iccv-2013-Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction
13 0.88873816 102 iccv-2013-Data-Driven 3D Primitives for Single Image Understanding
14 0.88850015 22 iccv-2013-A New Adaptive Segmental Matching Measure for Human Activity Recognition
15 0.88794583 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
16 0.88757098 4 iccv-2013-ACTIVE: Activity Concept Transitions in Video Event Classification
17 0.88734186 155 iccv-2013-Facial Action Unit Event Detection by Cascade of Tasks
18 0.88712704 359 iccv-2013-Robust Object Tracking with Online Multi-lifespan Dictionary Learning
19 0.88676 326 iccv-2013-Predicting Sufficient Annotation Strength for Interactive Foreground Segmentation
20 0.88515794 160 iccv-2013-Fast Object Segmentation in Unconstrained Video