Author: LiMin Wang, Yu Qiao, Xiaoou Tang
Abstract: This paper proposes motionlet, a mid-level and spatiotemporal part, for human motion recognition. Motionlet can be seen as a tight cluster in motion and appearance space, corresponding to the moving process of different body parts. We postulate three key properties of motionlet for action recognition: high motion saliency, multiple scale representation, and representative-discriminative ability. Towards this goal, we develop a data-driven approach to learn motionlets from training videos. First, we extract 3D regions with high motion saliency. Then we cluster these regions and preserve the centers as candidate templates for motionlet. Finally, we examine the representative and discriminative power of the candidates, and introduce a greedy method to select effective candidates. With motionlets, we present a mid-level representation for video, called motionlet activation vector. We conduct experiments on three datasets, KTH, HMDB51, and UCF50. The results show that the proposed methods significantly outperform state-of-the-art methods.
1 hk Abstract This paper proposes motionlet, a mid-level and spatiotemporal part, for human motion recognition. [sent-12, score-0.222]
2 Motionlet can be seen as a tight cluster in motion and appearance space, corresponding to the moving process of different body parts. [sent-13, score-0.188]
3 We postulate three key properties of motionlet for action recognition: high motion saliency, multiple scale representation, and representative-discriminative ability. [sent-14, score-1.024]
4 Towards this goal, we develop a data-driven approach to learn motionlets from training videos. [sent-15, score-0.558]
5 First, we extract 3D regions with high motion saliency. [sent-16, score-0.15]
6 Finally, we examine the representative and discriminative power of the candidates, and introduce a greedy method to select effective candidates. [sent-18, score-0.149]
7 With motionlets, we present a mid-level representation for video, called motionlet activation vector. [sent-19, score-0.774]
8 Introduction Due to the popularization of surveillance cameras and personal video devices, video based human motion analysis and recognition have become a highly active area in computer vision [2]. [sent-23, score-0.286]
9 Human action recognition is difficult for many reasons, such as high-dimension of video data, intraclass variability caused by scale, viewpoint and illumination changes, low resolution and video quality. [sent-24, score-0.347]
10 However, they only capture low-level information, and may lack discriminative power for high-level motion recognition. [sent-38, score-0.144]
11 Among them, Action Bank [27] applies a large set of action detectors on input video, and use the responses of these detectors as a semantically rich representation (Figure 1). [sent-40, score-0.262]
12 In addition to appearance, motion is an important visual cues for action recognition. [sent-46, score-0.281]
13 Moreover, it is more ambiguous and difficult to define parts for human motion than for objects. [sent-47, score-0.14]
14 To achieve the above goals , we propose a learning based approach to extract motionlets from training videos. [sent-51, score-0.579]
15 Specifically, we first estimate motion saliency using spatiotemporal orientation energies [1], and extract 3D regions with high motion saliency. [sent-52, score-0.515]
16 Then we tightly cluster these 3D regions into candidate motionlets, and keep the medians for each cluster as the templates. [sent-53, score-0.154]
17 Finally, we examine the representative and discriminative power of these candidates, and introduce a greedy search algorithm to select effective candidates as motionlets. [sent-54, score-0.178]
18 We represent a video by motionlet activation vector, which measures the strength of each motionlet occurring in the video. [sent-55, score-1.54]
19 We conduct experiments on human motion recognition on three public datasets: KTH [28], UCF50 [26], and HMDB51 [19]. [sent-56, score-0.168]
20 Motionlet differs from Poselet in two ways: 1) Motionlet is a 3D part constructed from video and designed for human motion recognition; 2) we construct motionlet in an unsupervised way without using human annotations of pose. [sent-71, score-0.961]
21 Several recent action recognition methods also make use of the concept of “part”, either explicitly or implicitly. [sent-72, score-0.201]
22 firstly over-segment the whole video into tubes corresponding to action “part” and adopt spatiotemporal graphs to learn the relationship among the parts. [sent-76, score-0.356]
23 group the trajectories into clusters, each of which can be seen as an action part. [sent-78, score-0.207]
24 Different from these methods, our motionlets are motion templates and provide mid-level representation of video. [sent-81, score-0.729]
25 Moreover, motionlets do not rely on specific inference algorithms in recognition step, which makes it easy to be combined with other methods. [sent-82, score-0.572]
26 In this paper, we use spatiotemporal orientation energy (SOE) [1] as low level features. [sent-87, score-0.176]
27 SOEs have been used for action recognition in [8, 27]. [sent-88, score-0.201]
28 This approach not only fits motionlet representation of video very well, but also reduce time cost in template matching. [sent-92, score-0.861]
29 We use 3D steerable filter to estimate local spatiotemporal orientation energy to represent the strength of motion along 3D spatiotemporal directions. [sent-95, score-0.402]
30 We can estimate the spatiotemporal orientation energy at each pixel as follows, Eθˆ(x) = ? [sent-97, score-0.176]
31 o(nule,tv, we use nine spatiotemporal energies with different image velocities (u, v)? [sent-123, score-0.167]
32 can be seen as measures of motion saliency along eight different orientations (Figure 2). [sent-141, score-0.194]
33 We extract dense histogram of spatiotemporal orientation energy (HOE) and histogram of gradient (HOG) for video representation. [sent-144, score-0.331]
34 For motion information, we compute histogram of eight pure energies by Equation (4). [sent-166, score-0.206]
35 Being histogram features, dense HOE and HOG is more compact and efficient than the spatiotemporal orientation energy features used in [27, 8], where they compute a feature vector for each pixel. [sent-172, score-0.217]
36 Motionlet Construction This section describes how to construct motionlet for video representation. [sent-179, score-0.786]
37 As shown in Figure 4, the whole process consists of three steps, 1) extracting motion salient regions, 2) finding motionlet candidates, and 3) ranking motionlets. [sent-180, score-0.822]
38 Extraction of Motion Salient Regions In the first step, we extract 3D video regions with high motion saliency as seeds for constructing motionlets. [sent-183, score-0.316]
39 For each volume Ω, we use t vheo lsuummems oatfio sinz oef W spatiotemporal oorri eeanctahti voonl energies as a measure of motion saliency (See Left of Figure 4), s(Ω) =? [sent-185, score-0.334]
40 Finding Motionlet Candidates The 3D regions generated from motion saliency serve as the seeds for constructing motionlet. [sent-203, score-0.222]
41 Then, for each group, we cluster the 3D regions according to motion and appearance information. [sent-209, score-0.188]
42 The pipeline of motionlet construction: we first generate a large pool of 3D regions using motion saliency; then, we tightly cluster 3D regions into candidate motionlets; finally, we rank and select motionlets based on their representative and discriminative ability. [sent-220, score-1.61]
43 Some examples of representative-discriminative and non representative-discriminative motionlets for brush hair. [sent-222, score-0.558]
44 Due to the great variance of video data, the preference parameters of Affinity Propagation are set to be larger than the median to make sure 3D regions within the same cluster very similar. [sent-225, score-0.151]
45 The construction of motionlet is conducted for each action category separately. [sent-228, score-0.9]
46 For each action category, we generate about 3, 000 3D regions and cluster them into 500 templates. [sent-229, score-0.265]
47 Ranking Motionlet The motionlet templates constructed above mainly takes account of the low level features captured by HOE and HOG. [sent-232, score-0.755]
48 As a consequence, it is still uncertain whether these templates are representative and discriminative for highlevel action classification. [sent-233, score-0.325]
49 To be representative, a motionlet should occur frequently and distribute widely in different videos (See Figure 4). [sent-234, score-0.765]
50 To be discriminative, a motionlet should provide information to distinguish one action class from the others (See Figure 4). [sent-235, score-0.9]
51 Specifically, let denote motionlet activation value which is calculated as the max pooling result of matching motionlet Mj with video Vi, . [sent-243, score-1.542]
52 indicates the strength of motionlet Mj occurring in Vi. [sent-248, score-0.728]
53 ber of action classes, Nk is the 222666777866 sjk and sj are the number of videos in action class Ck, and means within class Ck and over all classes, sjk=N1kVi? [sent-254, score-0.45]
54 However, this method treats each motionlet independently, and ignore the correlation between motionlets. [sent-260, score-0.713]
55 We overcome this limitation by exploring the k nearest videos of each motionlet in training samples, as shown in Figure 4. [sent-262, score-0.765]
56 We call video Vi is ‘k nearest’ to motionlet Mj, if matching result belongs tko ntheaer eks largest vtaiolunele ot fM M{sjn} (n = 1, . [sent-263, score-0.786]
57 VOuirs goal eis k t on fainreds a seuigbshebto orfs mo fo Mtionlets satisfying two requirements, the sum of representative and discriminative power should be as large as possible; the coverage percentage of training samples should be as high as possible. [sent-269, score-0.138]
58 We design a greedy algorithm to select motionlets sequentially as shown in Algorithm 1. [sent-270, score-0.606]
59 Then, we search for the set of motionlets that cover these training samples. [sent-272, score-0.558]
60 Finally, we greedily select the motionlet that has highest representative and discriminative power in this set. [sent-273, score-0.834]
61 Video Representation using Motionlet With a set of motionlets M = {M1, M2, . [sent-276, score-0.558]
62 , somn] v, dwehoer Ve baycti ava mtiootnio snlje tis a cthtiev max pooling result for matching motionlet Mj with V (Equaptiooonl i(9ng)). [sent-285, score-0.73]
63 Experiment We evaluate the effectiveness of motionlet on three datasets, one small scale dataset KTH [28] and two large scale datasets UCF50 [26] and HMDB51 [19]. [sent-290, score-0.758]
64 KTH [28] consists of six human action classes and each action is performed several times by 25 subjects. [sent-291, score-0.406]
65 UCF50 [26] and HMDB51 [19] are two large datasets for human action recognition. [sent-296, score-0.236]
66 UCF50 has 50 action classes with total 6,618 videos, and each action class is divided into 25 groups with at least 100 videos for each class. [sent-297, score-0.426]
67 HMDB5 1 has 5 1 action classes with total 6,766 videos and each action class has at least 100 videos. [sent-298, score-0.426]
68 From the results, we can see that video parts belonging to the same motionlets exhibit similar motion and appearance features. [sent-305, score-0.755]
69 Motionlets can correspond to the motion of body part (such as upper body, leg) or visual phase (person-horse, gun-hand), and thus can yield important cues to recognize human motion category. [sent-306, score-0.258]
70 From these results, we see that the proposed motionlets achieve a comparable result on the simple dataset and high performance on the two large scale datasets. [sent-310, score-0.572]
71 These results yield 13 percents improvement over a baseline HOG/HOF (low-level representation), 7 percents 1They remove part of testing videos in the bank and they do not split each video into short clips according to [28], thus their testing settings is different from the other methods and ours. [sent-319, score-0.29]
72 Examples of motionlet from three datasets: KTH (left), UCF50 (middle) and HMDB5 1 (right). [sent-321, score-0.713]
73 We find each motionlet is a tight cluster both in motion and appearance space. [sent-322, score-0.88]
74 improvement over a recent method of action bank (highlevel representation), and 4 percents improvement over a recent feature of motion interchange pattern (low-level representation) [18]. [sent-327, score-0.41]
75 Our method outperforms HOG/HOF, Action Bank, and motion interchange pattern for both group wise cross validation (GV) and leave one group out cross validation schemes (LOGO). [sent-334, score-0.233]
76 For computational cost, we extract motion saliency for about 30s and 3000 motionlets match for about 40s for each video on average on HMDB5 1 and UCF50 on a PC with E5645 CPU(2. [sent-335, score-0.823]
77 From these comparisons, we can conclude that motionlet is effective in dealing with realistic videos. [sent-337, score-0.734]
78 Local features like HOG/HOF cannot describe the complex motion information in realistic videos, while high level templates like action bank fail to deal with the large deformation among video samples very well. [sent-339, score-0.477]
79 Due to the mid-level nature, motionlets yield a good tradeoff between low-level and high-level representation, and provide rich and robust information for classification. [sent-340, score-0.558]
80 We explore the influence of motionlet number and the effectiveness of motionlet selection algorithm using HMDB51 and UCF50 (GV). [sent-347, score-1.457]
81 d T foher UreCsFul5ts0 are sreh oawren t oitna Figure 7×, 5fr0o m= which we can see that the accuracy increases little when the number of motionlets is larger than 2,000. [sent-349, score-0.558]
82 These results indicates high redundancy within candidates, and thus it is necessary to conduct motionlet selection. [sent-350, score-0.741]
83 We also make comparison between our motionlet selection method and random selection (we randomly select motionlets and repeat the random experiments 50 times). [sent-351, score-1.323]
84 Besides, we can achieve a bit higher classification accuracy using selected motionlets than using all candidate motionlets. [sent-353, score-0.575]
85 All these results imply that our greedy algorithm is effective in motionlet selection. [sent-354, score-0.741]
86 We use motionlets to obtain a mid-level representation of video. [sent-356, score-0.593]
87 Results of varying motionlet size and compare ranking algorithm with random selection, Left: HMDB5 1 and Right: UCF50. [sent-360, score-0.728]
88 use action bank representation with 205 detectors1 . [sent-365, score-0.282]
89 The number of motionlets are set as 3,000 in this combination. [sent-366, score-0.558]
90 Conclusion In this paper, we propose a mid-level video representation for motion recognition using motionlet. [sent-374, score-0.216]
91 Motionlet are defined as a spatiotemporal part with coherent appearance and motion features. [sent-375, score-0.223]
92 We develop a data-driven approach to learn motionlets by considering three properties, high motion saliency, multiple scale representation, and representative-discriminative ability. [sent-376, score-0.666]
93 Compared with local features (such as STIP) and global template (such as action bank), motionlets are a mid-level parts and provide a good tradeoff between repeatability and discriminative ability. [sent-377, score-0.827]
94 We evaluate the performance of motionlet on three public datasets, KTH, HMDB51 and UCF50. [sent-378, score-0.713]
95 Shape-from-silhouette of articulated objects and its use for human body kinematics estimation and motion capture. [sent-432, score-0.147]
96 Efficient action spotting based on a spacetime oriented structure representation. [sent-442, score-0.209]
97 Motion interchange patterns for action recognition in unconstrained videos. [sent-514, score-0.234]
98 Hmdb: A large video database for human motion recognition. [sent-522, score-0.199]
99 Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. [sent-544, score-0.201]
100 A comparative study of encoding, pooling and normalization methods for action recognition. [sent-605, score-0.204]
