cvpr cvpr2013 cvpr2013-291 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: LiMin Wang, Yu Qiao, Xiaoou Tang
Abstract: This paper proposes motionlet, a mid-level and spatiotemporal part, for human motion recognition. Motionlet can be seen as a tight cluster in motion and appearance space, corresponding to the moving process of different body parts. We postulate three key properties of motionlet for action recognition: high motion saliency, multiple scale representation, and representative-discriminative ability. Towards this goal, we develop a data-driven approach to learn motionlets from training videos. First, we extract 3D regions with high motion saliency. Then we cluster these regions and preserve the centers as candidate templates for motionlet. Finally, we examine the representative and discriminative power of the candidates, and introduce a greedy method to select effective candidates. With motionlets, we present a mid-level representation for video, called motionlet activation vector. We conduct experiments on three datasets, KTH, HMDB51, and UCF50. The results show that the proposed methods significantly outperform state-of-the-art methods.
Reference: text
sentIndex sentText sentNum sentScore
1 hk Abstract This paper proposes motionlet, a mid-level and spatiotemporal part, for human motion recognition. [sent-12, score-0.222]
2 Motionlet can be seen as a tight cluster in motion and appearance space, corresponding to the moving process of different body parts. [sent-13, score-0.188]
3 We postulate three key properties of motionlet for action recognition: high motion saliency, multiple scale representation, and representative-discriminative ability. [sent-14, score-1.024]
4 Towards this goal, we develop a data-driven approach to learn motionlets from training videos. [sent-15, score-0.558]
5 First, we extract 3D regions with high motion saliency. [sent-16, score-0.15]
6 Finally, we examine the representative and discriminative power of the candidates, and introduce a greedy method to select effective candidates. [sent-18, score-0.149]
7 With motionlets, we present a mid-level representation for video, called motionlet activation vector. [sent-19, score-0.774]
8 Introduction Due to the popularization of surveillance cameras and personal video devices, video based human motion analysis and recognition have become a highly active area in computer vision [2]. [sent-23, score-0.286]
9 Human action recognition is difficult for many reasons, such as high-dimension of video data, intraclass variability caused by scale, viewpoint and illumination changes, low resolution and video quality. [sent-24, score-0.347]
10 However, they only capture low-level information, and may lack discriminative power for high-level motion recognition. [sent-38, score-0.144]
11 Among them, Action Bank [27] applies a large set of action detectors on input video, and use the responses of these detectors as a semantically rich representation (Figure 1). [sent-40, score-0.262]
12 In addition to appearance, motion is an important visual cues for action recognition. [sent-46, score-0.281]
13 Moreover, it is more ambiguous and difficult to define parts for human motion than for objects. [sent-47, score-0.14]
14 To achieve the above goals , we propose a learning based approach to extract motionlets from training videos. [sent-51, score-0.579]
15 Specifically, we first estimate motion saliency using spatiotemporal orientation energies [1], and extract 3D regions with high motion saliency. [sent-52, score-0.515]
16 Then we tightly cluster these 3D regions into candidate motionlets, and keep the medians for each cluster as the templates. [sent-53, score-0.154]
17 Finally, we examine the representative and discriminative power of these candidates, and introduce a greedy search algorithm to select effective candidates as motionlets. [sent-54, score-0.178]
18 We represent a video by motionlet activation vector, which measures the strength of each motionlet occurring in the video. [sent-55, score-1.54]
19 We conduct experiments on human motion recognition on three public datasets: KTH [28], UCF50 [26], and HMDB51 [19]. [sent-56, score-0.168]
20 Motionlet differs from Poselet in two ways: 1) Motionlet is a 3D part constructed from video and designed for human motion recognition; 2) we construct motionlet in an unsupervised way without using human annotations of pose. [sent-71, score-0.961]
21 Several recent action recognition methods also make use of the concept of “part”, either explicitly or implicitly. [sent-72, score-0.201]
22 firstly over-segment the whole video into tubes corresponding to action “part” and adopt spatiotemporal graphs to learn the relationship among the parts. [sent-76, score-0.356]
23 group the trajectories into clusters, each of which can be seen as an action part. [sent-78, score-0.207]
24 Different from these methods, our motionlets are motion templates and provide mid-level representation of video. [sent-81, score-0.729]
25 Moreover, motionlets do not rely on specific inference algorithms in recognition step, which makes it easy to be combined with other methods. [sent-82, score-0.572]
26 In this paper, we use spatiotemporal orientation energy (SOE) [1] as low level features. [sent-87, score-0.176]
27 SOEs have been used for action recognition in [8, 27]. [sent-88, score-0.201]
28 This approach not only fits motionlet representation of video very well, but also reduce time cost in template matching. [sent-92, score-0.861]
29 We use 3D steerable filter to estimate local spatiotemporal orientation energy to represent the strength of motion along 3D spatiotemporal directions. [sent-95, score-0.402]
30 We can estimate the spatiotemporal orientation energy at each pixel as follows, Eθˆ(x) = ? [sent-97, score-0.176]
31 o(nule,tv, we use nine spatiotemporal energies with different image velocities (u, v)? [sent-123, score-0.167]
32 can be seen as measures of motion saliency along eight different orientations (Figure 2). [sent-141, score-0.194]
33 We extract dense histogram of spatiotemporal orientation energy (HOE) and histogram of gradient (HOG) for video representation. [sent-144, score-0.331]
34 For motion information, we compute histogram of eight pure energies by Equation (4). [sent-166, score-0.206]
35 Being histogram features, dense HOE and HOG is more compact and efficient than the spatiotemporal orientation energy features used in [27, 8], where they compute a feature vector for each pixel. [sent-172, score-0.217]
36 Motionlet Construction This section describes how to construct motionlet for video representation. [sent-179, score-0.786]
37 As shown in Figure 4, the whole process consists of three steps, 1) extracting motion salient regions, 2) finding motionlet candidates, and 3) ranking motionlets. [sent-180, score-0.822]
38 Extraction of Motion Salient Regions In the first step, we extract 3D video regions with high motion saliency as seeds for constructing motionlets. [sent-183, score-0.316]
39 For each volume Ω, we use t vheo lsuummems oatfio sinz oef W spatiotemporal oorri eeanctahti voonl energies as a measure of motion saliency (See Left of Figure 4), s(Ω) =? [sent-185, score-0.334]
40 Finding Motionlet Candidates The 3D regions generated from motion saliency serve as the seeds for constructing motionlet. [sent-203, score-0.222]
41 Then, for each group, we cluster the 3D regions according to motion and appearance information. [sent-209, score-0.188]
42 The pipeline of motionlet construction: we first generate a large pool of 3D regions using motion saliency; then, we tightly cluster 3D regions into candidate motionlets; finally, we rank and select motionlets based on their representative and discriminative ability. [sent-220, score-1.61]
43 Some examples of representative-discriminative and non representative-discriminative motionlets for brush hair. [sent-222, score-0.558]
44 Due to the great variance of video data, the preference parameters of Affinity Propagation are set to be larger than the median to make sure 3D regions within the same cluster very similar. [sent-225, score-0.151]
45 The construction of motionlet is conducted for each action category separately. [sent-228, score-0.9]
46 For each action category, we generate about 3, 000 3D regions and cluster them into 500 templates. [sent-229, score-0.265]
47 Ranking Motionlet The motionlet templates constructed above mainly takes account of the low level features captured by HOE and HOG. [sent-232, score-0.755]
48 As a consequence, it is still uncertain whether these templates are representative and discriminative for highlevel action classification. [sent-233, score-0.325]
49 To be representative, a motionlet should occur frequently and distribute widely in different videos (See Figure 4). [sent-234, score-0.765]
50 To be discriminative, a motionlet should provide information to distinguish one action class from the others (See Figure 4). [sent-235, score-0.9]
51 Specifically, let denote motionlet activation value which is calculated as the max pooling result of matching motionlet Mj with video Vi, . [sent-243, score-1.542]
52 indicates the strength of motionlet Mj occurring in Vi. [sent-248, score-0.728]
53 ber of action classes, Nk is the 222666777866 sjk and sj are the number of videos in action class Ck, and means within class Ck and over all classes, sjk=N1kVi? [sent-254, score-0.45]
54 However, this method treats each motionlet independently, and ignore the correlation between motionlets. [sent-260, score-0.713]
55 We overcome this limitation by exploring the k nearest videos of each motionlet in training samples, as shown in Figure 4. [sent-262, score-0.765]
56 We call video Vi is ‘k nearest’ to motionlet Mj, if matching result belongs tko ntheaer eks largest vtaiolunele ot fM M{sjn} (n = 1, . [sent-263, score-0.786]
57 VOuirs goal eis k t on fainreds a seuigbshebto orfs mo fo Mtionlets satisfying two requirements, the sum of representative and discriminative power should be as large as possible; the coverage percentage of training samples should be as high as possible. [sent-269, score-0.138]
58 We design a greedy algorithm to select motionlets sequentially as shown in Algorithm 1. [sent-270, score-0.606]
59 Then, we search for the set of motionlets that cover these training samples. [sent-272, score-0.558]
60 Finally, we greedily select the motionlet that has highest representative and discriminative power in this set. [sent-273, score-0.834]
61 Video Representation using Motionlet With a set of motionlets M = {M1, M2, . [sent-276, score-0.558]
62 , somn] v, dwehoer Ve baycti ava mtiootnio snlje tis a cthtiev max pooling result for matching motionlet Mj with V (Equaptiooonl i(9ng)). [sent-285, score-0.73]
63 Experiment We evaluate the effectiveness of motionlet on three datasets, one small scale dataset KTH [28] and two large scale datasets UCF50 [26] and HMDB51 [19]. [sent-290, score-0.758]
64 KTH [28] consists of six human action classes and each action is performed several times by 25 subjects. [sent-291, score-0.406]
65 UCF50 [26] and HMDB51 [19] are two large datasets for human action recognition. [sent-296, score-0.236]
66 UCF50 has 50 action classes with total 6,618 videos, and each action class is divided into 25 groups with at least 100 videos for each class. [sent-297, score-0.426]
67 HMDB5 1 has 5 1 action classes with total 6,766 videos and each action class has at least 100 videos. [sent-298, score-0.426]
68 From the results, we can see that video parts belonging to the same motionlets exhibit similar motion and appearance features. [sent-305, score-0.755]
69 Motionlets can correspond to the motion of body part (such as upper body, leg) or visual phase (person-horse, gun-hand), and thus can yield important cues to recognize human motion category. [sent-306, score-0.258]
70 From these results, we see that the proposed motionlets achieve a comparable result on the simple dataset and high performance on the two large scale datasets. [sent-310, score-0.572]
71 These results yield 13 percents improvement over a baseline HOG/HOF (low-level representation), 7 percents 1They remove part of testing videos in the bank and they do not split each video into short clips according to [28], thus their testing settings is different from the other methods and ours. [sent-319, score-0.29]
72 Examples of motionlet from three datasets: KTH (left), UCF50 (middle) and HMDB5 1 (right). [sent-321, score-0.713]
73 We find each motionlet is a tight cluster both in motion and appearance space. [sent-322, score-0.88]
74 improvement over a recent method of action bank (highlevel representation), and 4 percents improvement over a recent feature of motion interchange pattern (low-level representation) [18]. [sent-327, score-0.41]
75 Our method outperforms HOG/HOF, Action Bank, and motion interchange pattern for both group wise cross validation (GV) and leave one group out cross validation schemes (LOGO). [sent-334, score-0.233]
76 For computational cost, we extract motion saliency for about 30s and 3000 motionlets match for about 40s for each video on average on HMDB5 1 and UCF50 on a PC with E5645 CPU(2. [sent-335, score-0.823]
77 From these comparisons, we can conclude that motionlet is effective in dealing with realistic videos. [sent-337, score-0.734]
78 Local features like HOG/HOF cannot describe the complex motion information in realistic videos, while high level templates like action bank fail to deal with the large deformation among video samples very well. [sent-339, score-0.477]
79 Due to the mid-level nature, motionlets yield a good tradeoff between low-level and high-level representation, and provide rich and robust information for classification. [sent-340, score-0.558]
80 We explore the influence of motionlet number and the effectiveness of motionlet selection algorithm using HMDB51 and UCF50 (GV). [sent-347, score-1.457]
81 d T foher UreCsFul5ts0 are sreh oawren t oitna Figure 7×, 5fr0o m= which we can see that the accuracy increases little when the number of motionlets is larger than 2,000. [sent-349, score-0.558]
82 These results indicates high redundancy within candidates, and thus it is necessary to conduct motionlet selection. [sent-350, score-0.741]
83 We also make comparison between our motionlet selection method and random selection (we randomly select motionlets and repeat the random experiments 50 times). [sent-351, score-1.323]
84 Besides, we can achieve a bit higher classification accuracy using selected motionlets than using all candidate motionlets. [sent-353, score-0.575]
85 All these results imply that our greedy algorithm is effective in motionlet selection. [sent-354, score-0.741]
86 We use motionlets to obtain a mid-level representation of video. [sent-356, score-0.593]
87 Results of varying motionlet size and compare ranking algorithm with random selection, Left: HMDB5 1 and Right: UCF50. [sent-360, score-0.728]
88 use action bank representation with 205 detectors1 . [sent-365, score-0.282]
89 The number of motionlets are set as 3,000 in this combination. [sent-366, score-0.558]
90 Conclusion In this paper, we propose a mid-level video representation for motion recognition using motionlet. [sent-374, score-0.216]
91 Motionlet are defined as a spatiotemporal part with coherent appearance and motion features. [sent-375, score-0.223]
92 We develop a data-driven approach to learn motionlets by considering three properties, high motion saliency, multiple scale representation, and representative-discriminative ability. [sent-376, score-0.666]
93 Compared with local features (such as STIP) and global template (such as action bank), motionlets are a mid-level parts and provide a good tradeoff between repeatability and discriminative ability. [sent-377, score-0.827]
94 We evaluate the performance of motionlet on three public datasets, KTH, HMDB51 and UCF50. [sent-378, score-0.713]
95 Shape-from-silhouette of articulated objects and its use for human body kinematics estimation and motion capture. [sent-432, score-0.147]
96 Efficient action spotting based on a spacetime oriented structure representation. [sent-442, score-0.209]
97 Motion interchange patterns for action recognition in unconstrained videos. [sent-514, score-0.234]
98 Hmdb: A large video database for human motion recognition. [sent-522, score-0.199]
99 Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. [sent-544, score-0.201]
100 A comparative study of encoding, pooling and normalization methods for action recognition. [sent-605, score-0.204]
wordName wordTfidf (topN-words)
[('motionlet', 0.713), ('motionlets', 0.558), ('action', 0.187), ('spatiotemporal', 0.096), ('motion', 0.094), ('saliency', 0.077), ('hoe', 0.076), ('video', 0.073), ('mj', 0.067), ('bank', 0.06), ('energies', 0.052), ('videos', 0.052), ('kth', 0.051), ('representative', 0.051), ('gv', 0.046), ('orientation', 0.046), ('soe', 0.044), ('cluster', 0.043), ('templates', 0.042), ('sij', 0.041), ('template', 0.04), ('stip', 0.037), ('bovw', 0.036), ('percents', 0.036), ('logo', 0.036), ('regions', 0.035), ('representation', 0.035), ('energy', 0.034), ('volumes', 0.034), ('interchange', 0.033), ('vi', 0.032), ('human', 0.032), ('candidates', 0.029), ('greedy', 0.028), ('conduct', 0.028), ('cuboids', 0.028), ('discriminative', 0.028), ('activation', 0.026), ('shenzhen', 0.024), ('mxax', 0.024), ('sjk', 0.024), ('ri', 0.024), ('eight', 0.023), ('poselet', 0.022), ('power', 0.022), ('rj', 0.022), ('coverage', 0.022), ('spacetime', 0.022), ('extract', 0.021), ('originates', 0.021), ('jhuang', 0.021), ('dense', 0.021), ('body', 0.021), ('steerable', 0.021), ('realistic', 0.021), ('detectors', 0.02), ('select', 0.02), ('raptis', 0.02), ('histogram', 0.02), ('group', 0.02), ('actions', 0.02), ('dei', 0.019), ('nine', 0.019), ('laptev', 0.019), ('aser', 0.018), ('cross', 0.018), ('tpami', 0.018), ('pooling', 0.017), ('highlevel', 0.017), ('niebles', 0.017), ('cto', 0.017), ('candidate', 0.017), ('messages', 0.017), ('part', 0.017), ('datasets', 0.017), ('pure', 0.017), ('um', 0.016), ('appearance', 0.016), ('affinity', 0.016), ('properties', 0.016), ('seeds', 0.016), ('clips', 0.016), ('dimension', 0.016), ('tightly', 0.016), ('selection', 0.016), ('mo', 0.015), ('libsvm', 0.015), ('influence', 0.015), ('strength', 0.015), ('indicated', 0.015), ('validation', 0.015), ('ranking', 0.015), ('hog', 0.015), ('volume', 0.015), ('felzenszwalb', 0.015), ('scale', 0.014), ('parts', 0.014), ('tight', 0.014), ('convolutional', 0.014), ('recognition', 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 291 cvpr-2013-Motionlets: Mid-level 3D Parts for Human Motion Recognition
Author: LiMin Wang, Yu Qiao, Xiaoou Tang
Abstract: This paper proposes motionlet, a mid-level and spatiotemporal part, for human motion recognition. Motionlet can be seen as a tight cluster in motion and appearance space, corresponding to the moving process of different body parts. We postulate three key properties of motionlet for action recognition: high motion saliency, multiple scale representation, and representative-discriminative ability. Towards this goal, we develop a data-driven approach to learn motionlets from training videos. First, we extract 3D regions with high motion saliency. Then we cluster these regions and preserve the centers as candidate templates for motionlet. Finally, we examine the representative and discriminative power of the candidates, and introduce a greedy method to select effective candidates. With motionlets, we present a mid-level representation for video, called motionlet activation vector. We conduct experiments on three datasets, KTH, HMDB51, and UCF50. The results show that the proposed methods significantly outperform state-of-the-art methods.
2 0.16039085 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
Author: Arpit Jain, Abhinav Gupta, Mikel Rodriguez, Larry S. Davis
Abstract: How should a video be represented? We propose a new representation for videos based on mid-level discriminative spatio-temporal patches. These spatio-temporal patches might correspond to a primitive human action, a semantic object, or perhaps a random but informative spatiotemporal patch in the video. What defines these spatiotemporal patches is their discriminative and representative properties. We automatically mine these patches from hundreds of training videos and experimentally demonstrate that these patches establish correspondence across videos and align the videos for label transfer techniques. Furthermore, these patches can be used as a discriminative vocabulary for action classification where they demonstrate stateof-the-art performance on UCF50 and Olympics datasets.
3 0.15162936 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
Author: Feng Shi, Emil Petriu, Robert Laganière
Abstract: Local spatio-temporal features and bag-of-features representations have become popular for action recognition. A recent trend is to use dense sampling for better performance. While many methods claimed to use dense feature sets, most of them are just denser than approaches based on sparse interest point detectors. In this paper, we explore sampling with high density on action recognition. We also investigate the impact of random sampling over dense grid for computational efficiency. We present a real-time action recognition system which integrates fast random sampling method with local spatio-temporal features extracted from a Local Part Model. A new method based on histogram intersection kernel is proposed to combine multiple channels of different descriptors. Our technique shows high accuracy on the simple KTH dataset, and achieves state-of-the-art on two very challenging real-world datasets, namely, 93% on KTH, 83.3% on UCF50 and 47.6% on HMDB51.
4 0.14184527 40 cvpr-2013-An Approach to Pose-Based Action Recognition
Author: Chunyu Wang, Yizhou Wang, Alan L. Yuille
Abstract: We address action recognition in videos by modeling the spatial-temporal structures of human poses. We start by improving a state of the art method for estimating human joint locations from videos. More precisely, we obtain the K-best estimations output by the existing method and incorporate additional segmentation cues and temporal constraints to select the “best” one. Then we group the estimated joints into five body parts (e.g. the left arm) and apply data mining techniques to obtain a representation for the spatial-temporal structures of human actions. This representation captures the spatial configurations ofbodyparts in one frame (by spatial-part-sets) as well as the body part movements(by temporal-part-sets) which are characteristic of human actions. It is interpretable, compact, and also robust to errors on joint estimations. Experimental results first show that our approach is able to localize body joints more accurately than existing methods. Next we show that it outperforms state of the art action recognizers on the UCF sport, the Keck Gesture and the MSR-Action3D datasets.
5 0.1417304 287 cvpr-2013-Modeling Actions through State Changes
Author: Alireza Fathi, James M. Rehg
Abstract: In this paper we present a model of action based on the change in the state of the environment. Many actions involve similar dynamics and hand-object relationships, but differ in their purpose and meaning. The key to differentiating these actions is the ability to identify how they change the state of objects and materials in the environment. We propose a weakly supervised method for learning the object and material states that are necessary for recognizing daily actions. Once these state detectors are learned, we can apply them to input videos and pool their outputs to detect actions. We further demonstrate that our method can be used to segment discrete actions from a continuous video of an activity. Our results outperform state-of-the-art action recognition and activity segmentation results.
6 0.14042042 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
8 0.13154624 205 cvpr-2013-Hollywood 3D: Recognizing Actions in 3D Natural Scenes
9 0.1174986 3 cvpr-2013-3D R Transform on Spatio-temporal Interest Points for Action Recognition
10 0.11437037 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
11 0.10643466 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition
12 0.10562109 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition
13 0.10533276 302 cvpr-2013-Multi-task Sparse Learning with Beta Process Prior for Action Recognition
14 0.10390939 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
15 0.097573623 123 cvpr-2013-Detection of Manipulation Action Consequences (MAC)
16 0.087470286 196 cvpr-2013-HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences
17 0.087315448 153 cvpr-2013-Expanded Parts Model for Human Attribute and Action Recognition in Still Images
18 0.087253049 149 cvpr-2013-Evaluation of Color STIPs for Human Action Recognition
19 0.084138758 258 cvpr-2013-Learning Video Saliency from Human Gaze Using Candidate Selection
20 0.082688943 202 cvpr-2013-Hierarchical Saliency Detection
topicId topicWeight
[(0, 0.15), (1, -0.078), (2, 0.062), (3, -0.079), (4, -0.202), (5, -0.009), (6, -0.034), (7, 0.006), (8, -0.049), (9, -0.039), (10, 0.004), (11, -0.001), (12, -0.019), (13, -0.013), (14, -0.0), (15, 0.008), (16, 0.015), (17, -0.018), (18, 0.028), (19, 0.071), (20, 0.023), (21, -0.017), (22, 0.019), (23, 0.037), (24, 0.05), (25, -0.018), (26, 0.001), (27, -0.001), (28, 0.02), (29, -0.01), (30, -0.043), (31, 0.042), (32, 0.019), (33, 0.02), (34, 0.018), (35, 0.038), (36, -0.032), (37, 0.017), (38, 0.021), (39, -0.018), (40, 0.019), (41, 0.017), (42, -0.026), (43, -0.06), (44, -0.003), (45, -0.021), (46, -0.011), (47, 0.068), (48, 0.002), (49, -0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.91019499 291 cvpr-2013-Motionlets: Mid-level 3D Parts for Human Motion Recognition
Author: LiMin Wang, Yu Qiao, Xiaoou Tang
Abstract: This paper proposes motionlet, a mid-level and spatiotemporal part, for human motion recognition. Motionlet can be seen as a tight cluster in motion and appearance space, corresponding to the moving process of different body parts. We postulate three key properties of motionlet for action recognition: high motion saliency, multiple scale representation, and representative-discriminative ability. Towards this goal, we develop a data-driven approach to learn motionlets from training videos. First, we extract 3D regions with high motion saliency. Then we cluster these regions and preserve the centers as candidate templates for motionlet. Finally, we examine the representative and discriminative power of the candidates, and introduce a greedy method to select effective candidates. With motionlets, we present a mid-level representation for video, called motionlet activation vector. We conduct experiments on three datasets, KTH, HMDB51, and UCF50. The results show that the proposed methods significantly outperform state-of-the-art methods.
2 0.82895362 287 cvpr-2013-Modeling Actions through State Changes
Author: Alireza Fathi, James M. Rehg
Abstract: In this paper we present a model of action based on the change in the state of the environment. Many actions involve similar dynamics and hand-object relationships, but differ in their purpose and meaning. The key to differentiating these actions is the ability to identify how they change the state of objects and materials in the environment. We propose a weakly supervised method for learning the object and material states that are necessary for recognizing daily actions. Once these state detectors are learned, we can apply them to input videos and pool their outputs to detect actions. We further demonstrate that our method can be used to segment discrete actions from a continuous video of an activity. Our results outperform state-of-the-art action recognition and activity segmentation results.
3 0.82088292 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
Author: Yicong Tian, Rahul Sukthankar, Mubarak Shah
Abstract: Deformable part models have achieved impressive performance for object detection, even on difficult image datasets. This paper explores the generalization of deformable part models from 2D images to 3D spatiotemporal volumes to better study their effectiveness for action detection in video. Actions are treated as spatiotemporal patterns and a deformable part model is generated for each action from a collection of examples. For each action model, the most discriminative 3D subvolumes are automatically selected as parts and the spatiotemporal relations between their locations are learned. By focusing on the most distinctive parts of each action, our models adapt to intra-class variation and show robustness to clutter. Extensive experiments on several video datasets demonstrate the strength of spatiotemporal DPMs for classifying and localizing actions.
4 0.80492944 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
Author: Feng Shi, Emil Petriu, Robert Laganière
Abstract: Local spatio-temporal features and bag-of-features representations have become popular for action recognition. A recent trend is to use dense sampling for better performance. While many methods claimed to use dense feature sets, most of them are just denser than approaches based on sparse interest point detectors. In this paper, we explore sampling with high density on action recognition. We also investigate the impact of random sampling over dense grid for computational efficiency. We present a real-time action recognition system which integrates fast random sampling method with local spatio-temporal features extracted from a Local Part Model. A new method based on histogram intersection kernel is proposed to combine multiple channels of different descriptors. Our technique shows high accuracy on the simple KTH dataset, and achieves state-of-the-art on two very challenging real-world datasets, namely, 93% on KTH, 83.3% on UCF50 and 47.6% on HMDB51.
5 0.78709954 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition
Author: Michalis Raptis, Leonid Sigal
Abstract: In this paper, we develop a new model for recognizing human actions. An action is modeled as a very sparse sequence of temporally local discriminative keyframes collections of partial key-poses of the actor(s), depicting key states in the action sequence. We cast the learning of keyframes in a max-margin discriminative framework, where we treat keyframes as latent variables. This allows us to (jointly) learn a set of most discriminative keyframes while also learning the local temporal context between them. Keyframes are encoded using a spatially-localizable poselet-like representation with HoG and BoW components learned from weak annotations; we rely on structured SVM formulation to align our components and minefor hard negatives to boost localization performance. This results in a model that supports spatio-temporal localization and is insensitive to dropped frames or partial observations. We show classification performance that is competitive with the state of the art on the benchmark UT-Interaction dataset and illustrate that our model outperforms prior methods in an on-line streaming setting.
6 0.76488835 3 cvpr-2013-3D R Transform on Spatio-temporal Interest Points for Action Recognition
7 0.72609991 149 cvpr-2013-Evaluation of Color STIPs for Human Action Recognition
8 0.72535861 123 cvpr-2013-Detection of Manipulation Action Consequences (MAC)
9 0.72534704 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
10 0.72213703 205 cvpr-2013-Hollywood 3D: Recognizing Actions in 3D Natural Scenes
11 0.67805463 302 cvpr-2013-Multi-task Sparse Learning with Beta Process Prior for Action Recognition
12 0.66024405 40 cvpr-2013-An Approach to Pose-Based Action Recognition
13 0.60637236 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path
14 0.58551848 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition
15 0.58535755 32 cvpr-2013-Action Recognition by Hierarchical Sequence Summarization
16 0.58232474 444 cvpr-2013-Unconstrained Monocular 3D Human Pose Estimation by Action Detection and Cross-Modality Regression Forest
17 0.58162189 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
18 0.56678104 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
19 0.501849 137 cvpr-2013-Dynamic Scene Classification: Learning Motion Descriptors with Slow Features Analysis
20 0.4991551 196 cvpr-2013-HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences
topicId topicWeight
[(10, 0.134), (16, 0.027), (26, 0.049), (28, 0.013), (33, 0.266), (38, 0.01), (48, 0.014), (61, 0.186), (67, 0.081), (69, 0.044), (72, 0.011), (87, 0.055)]
simIndex simValue paperId paperTitle
1 0.93485153 47 cvpr-2013-As-Projective-As-Possible Image Stitching with Moving DLT
Author: Julio Zaragoza, Tat-Jun Chin, Michael S. Brown, David Suter
Abstract: We investigate projective estimation under model inadequacies, i.e., when the underpinning assumptions oftheprojective model are not fully satisfied by the data. We focus on the task of image stitching which is customarily solved by estimating a projective warp — a model that is justified when the scene is planar or when the views differ purely by rotation. Such conditions are easily violated in practice, and this yields stitching results with ghosting artefacts that necessitate the usage of deghosting algorithms. To this end we propose as-projective-as-possible warps, i.e., warps that aim to be globally projective, yet allow local non-projective deviations to account for violations to the assumed imaging conditions. Based on a novel estimation technique called Moving Direct Linear Transformation (Moving DLT), our method seamlessly bridges image regions that are inconsistent with the projective model. The result is highly accurate image stitching, with significantly reduced ghosting effects, thus lowering the dependency on post hoc deghosting.
2 0.89682549 16 cvpr-2013-A Linear Approach to Matching Cuboids in RGBD Images
Author: Hao Jiang, Jianxiong Xiao
Abstract: We propose a novel linear method to match cuboids in indoor scenes using RGBD images from Kinect. Beyond depth maps, these cuboids reveal important structures of a scene. Instead of directly fitting cuboids to 3D data, we first construct cuboid candidates using superpixel pairs on a RGBD image, and then we optimize the configuration of the cuboids to satisfy the global structure constraints. The optimal configuration has low local matching costs, small object intersection and occlusion, and the cuboids tend to project to a large region in the image; the number of cuboids is optimized simultaneously. We formulate the multiple cuboid matching problem as a mixed integer linear program and solve the optimization efficiently with a branch and bound method. The optimization guarantees the global optimal solution. Our experiments on the Kinect RGBD images of a variety of indoor scenes show that our proposed method is efficient, accurate and robust against object appearance variations, occlusions and strong clutter.
3 0.88012356 72 cvpr-2013-Boundary Detection Benchmarking: Beyond F-Measures
Author: Xiaodi Hou, Alan Yuille, Christof Koch
Abstract: For an ill-posed problem like boundary detection, human labeled datasets play a critical role. Compared with the active research on finding a better boundary detector to refresh the performance record, there is surprisingly little discussion on the boundary detection benchmark itself. The goal of this paper is to identify the potential pitfalls of today’s most popular boundary benchmark, BSDS 300. In the paper, we first introduce a psychophysical experiment to show that many of the “weak” boundary labels are unreliable and may contaminate the benchmark. Then we analyze the computation of f-measure and point out that the current benchmarking protocol encourages an algorithm to bias towards those problematic “weak” boundary labels. With this evidence, we focus on a new problem of detecting strong boundaries as one alternative. Finally, we assess the performances of 9 major algorithms on different ways of utilizing the dataset, suggesting new directions for improvements.
same-paper 4 0.8730635 291 cvpr-2013-Motionlets: Mid-level 3D Parts for Human Motion Recognition
Author: LiMin Wang, Yu Qiao, Xiaoou Tang
Abstract: This paper proposes motionlet, a mid-level and spatiotemporal part, for human motion recognition. Motionlet can be seen as a tight cluster in motion and appearance space, corresponding to the moving process of different body parts. We postulate three key properties of motionlet for action recognition: high motion saliency, multiple scale representation, and representative-discriminative ability. Towards this goal, we develop a data-driven approach to learn motionlets from training videos. First, we extract 3D regions with high motion saliency. Then we cluster these regions and preserve the centers as candidate templates for motionlet. Finally, we examine the representative and discriminative power of the candidates, and introduce a greedy method to select effective candidates. With motionlets, we present a mid-level representation for video, called motionlet activation vector. We conduct experiments on three datasets, KTH, HMDB51, and UCF50. The results show that the proposed methods significantly outperform state-of-the-art methods.
5 0.86174148 100 cvpr-2013-Crossing the Line: Crowd Counting by Integer Programming with Local Features
Author: Zheng Ma, Antoni B. Chan
Abstract: We propose an integer programming method for estimating the instantaneous count of pedestrians crossing a line of interest in a video sequence. Through a line sampling process, the video is first converted into a temporal slice image. Next, the number of people is estimated in a set of overlapping sliding windows on the temporal slice image, using a regression function that maps from local features to a count. Given that count in a sliding window is the sum of the instantaneous counts in the corresponding time interval, an integer programming method is proposed to recover the number of pedestrians crossing the line of interest in each frame. Integrating over a specific time interval yields the cumulative count of pedestrian crossing the line. Compared with current methods for line counting, our proposed approach achieves state-of-the-art performance on several challenging crowd video datasets.
6 0.8597064 304 cvpr-2013-Multipath Sparse Coding Using Hierarchical Matching Pursuit
7 0.85759485 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
8 0.85740918 325 cvpr-2013-Part Discovery from Partial Correspondence
9 0.85671705 414 cvpr-2013-Structure Preserving Object Tracking
10 0.85506296 49 cvpr-2013-Augmenting Bag-of-Words: Data-Driven Discovery of Temporal and Structural Information for Activity Recognition
11 0.85451519 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection
12 0.85399276 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation
13 0.85349917 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
14 0.85180032 314 cvpr-2013-Online Object Tracking: A Benchmark
15 0.85146648 285 cvpr-2013-Minimum Uncertainty Gap for Robust Visual Tracking
16 0.85060942 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases
17 0.84960914 14 cvpr-2013-A Joint Model for 2D and 3D Pose Estimation from a Single Image
18 0.84943968 221 cvpr-2013-Incorporating Structural Alternatives and Sharing into Hierarchy for Multiclass Object Recognition and Detection
19 0.84860861 60 cvpr-2013-Beyond Physical Connections: Tree Models in Human Pose Estimation
20 0.84769094 122 cvpr-2013-Detection Evolution with Multi-order Contextual Co-occurrence