cvpr cvpr2013 cvpr2013-378 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Feng Shi, Emil Petriu, Robert Laganière
Abstract: Local spatio-temporal features and bag-of-features representations have become popular for action recognition. A recent trend is to use dense sampling for better performance. While many methods claimed to use dense feature sets, most of them are just denser than approaches based on sparse interest point detectors. In this paper, we explore sampling with high density on action recognition. We also investigate the impact of random sampling over dense grid for computational efficiency. We present a real-time action recognition system which integrates fast random sampling method with local spatio-temporal features extracted from a Local Part Model. A new method based on histogram intersection kernel is proposed to combine multiple channels of different descriptors. Our technique shows high accuracy on the simple KTH dataset, and achieves state-of-the-art on two very challenging real-world datasets, namely, 93% on KTH, 83.3% on UCF50 and 47.6% on HMDB51.
Reference: text
sentIndex sentText sentNum sentScore
1 ca Abstract Local spatio-temporal features and bag-of-features representations have become popular for action recognition. [sent-3, score-0.413]
2 A recent trend is to use dense sampling for better performance. [sent-4, score-0.651]
3 While many methods claimed to use dense feature sets, most of them are just denser than approaches based on sparse interest point detectors. [sent-5, score-0.377]
4 In this paper, we explore sampling with high density on action recognition. [sent-6, score-0.877]
5 We also investigate the impact of random sampling over dense grid for computational efficiency. [sent-7, score-0.738]
6 We present a real-time action recognition system which integrates fast random sampling method with local spatio-temporal features extracted from a Local Part Model. [sent-8, score-0.921]
7 Introduction Action recognition in videos has recently been a very active research area due to its wide applications such as intelligent video surveillance, video retrieval, human-computer interaction and smart home. [sent-14, score-0.403]
8 State-of-the-art approaches [23, 24, 30] have reported good results on human action datasets. [sent-15, score-0.418]
9 Among all the methods, local spatio-temporal features and bag-of-features(BoF) representations achieved remarkable performance for action recognition. [sent-16, score-0.451]
10 A recent trend is the use of dense sampled feature points [26, 32] and trajectories [30] for action recognition. [sent-24, score-0.824]
11 First, most existing action recognition methods use computationally expensive feature extraction, which is a very limiting factor considering the huge amount of data to be processed. [sent-26, score-0.465]
12 In contrast, dense sam- pling methods can provide a very large number of feature patches and thus can potentially produce excellent performance. [sent-28, score-0.452]
13 The best results are observed when the sampling step size decreases [30, 32]. [sent-29, score-0.412]
14 However, the increase in the number of processed points adds to the computation complexity even if simplifying techniques are used, such as integral video and approximative box-filters. [sent-30, score-0.442]
15 Most interest point detectors used for action classification are extended from 2D space domain. [sent-31, score-0.484]
16 To overcome these challenges, we proposed an efficient method for real-time action recognition. [sent-39, score-0.376]
17 Inspired by the success of random sampling approach in image classification [2 1], we use random sampling for action recognition. [sent-40, score-1.366]
18 We evaluated our random sampling strategy combined with Local Part Model on publicly available datasets and show real-time performance and state-of-art accuracy. [sent-43, score-0.511]
19 Related works A recent trend is to use dense sampling over sparse interest points for better performance. [sent-45, score-0.709]
20 Dense sampling has shown to produce good results for image classification [1, 15]. [sent-46, score-0.462]
21 demonstrate in [32] that dense sampling at regular space-time grids outperforms state-of-the-art interest point detectors. [sent-48, score-0.666]
22 Compared with interest point detectors, dense sampling captures most information by sampling every pixel in each spatial scale. [sent-50, score-1.078]
23 Uniform random sampling [2 1], on the other hand, can provide performances comparable to dense sampling. [sent-52, score-0.703]
24 A recent study [29] shows that action recognition performance can be maintained with as little as 30% of the densely de- tected features. [sent-53, score-0.418]
25 Given the effectiveness of the uniform sampling strategy, one can think of using biased random samplers in order to find more discriminant patches. [sent-55, score-0.506]
26 By using such saliency maps, they pruned 20-50% of the dense features and achieved better results. [sent-62, score-0.29]
27 In addition, because of computational constraints, these methods didn’t explore high sampling density schemes to improve their performance. [sent-64, score-0.501]
28 As for real-time action recognition algorithms, both Ke et al. [sent-65, score-0.376]
29 [33] use approximative boxfilter operations and integral video structure to speed-up the feature extraction. [sent-67, score-0.401]
30 Figure 1: Example of Local Part Model defined with root filter and overlapping grids of part filters. [sent-79, score-0.329]
31 Real-time action recognition approach Our method builds on our previous work based on Local Part Model [26]. [sent-82, score-0.376]
32 While the performance of dense sampling is improved as the sampling step size decreases [30], such approach becomes rapidly computationally intractable due to the very large number of patches produced. [sent-86, score-1.179]
33 To overcome such problem, our approach increases the sampling density by decreasing the sampling step size, and at the same time controls the number of sampled patches. [sent-87, score-1.038]
34 In addition, we experimentally found that, with proper sampling density, state-of-the-art performance can be achieved by randomly discarding up to 92% of densely sampled patches. [sent-88, score-0.627]
35 However, instead of adopting deformable “parts”, we use “parts” with fixed size and location on the purpose of maintaining both structure information and ordering of local events for action recognition. [sent-92, score-0.469]
36 As shown in Figure 1, the local part model includes both a coarse primitive level root feature covering event-content statistics and higher resolution overlapping part filters incorporating structure and temporal relations. [sent-93, score-0.621]
37 Under the local part model, a feature consists of a coarse global root filter and several fine overlapped part filters. [sent-98, score-0.433]
38 The root filter is extracted on the video at half the resolution. [sent-99, score-0.476]
39 For every coarse root filter, a group of fine part filters are acquired from the full resolution video at the locations where the root filter serves as a reference position. [sent-100, score-0.735]
40 For sampling, the dense sampling grid is determined by the root filter applied on half the spatial resolution of the processed video. [sent-101, score-1.142]
41 At this resolution, it can achieve very high sampling density with far less samples. [sent-102, score-0.501]
42 Integral video and descriptors Following the ideas of [7, 9, 33], we use integral video for fast 3D cuboid computation. [sent-106, score-0.57]
43 In our method, for each clip, we compute two integral videos, one for the root filter at half resolution, and another one for the part filters at full resolution. [sent-108, score-0.581]
44 The descriptor of a 3D patch can then be computed very efficiently through 8 additions multiplied by the total number of root and parts. [sent-109, score-0.346]
45 Apart from descriptor quantization, most cost associated with feature extraction is spent on accessing memory through the integral videos. [sent-110, score-0.3]
46 1 Dense Sampling Grid We perform uniform random sampling on a very dense sampling grid. [sent-121, score-1.114]
47 We follow the same multi-scale dense sampling grid as in [26] but with denser patches. [sent-122, score-0.756]
48 A 3D video patch centred at (x, y, t) is sampled with a patch size determined by the multi-scale factor (σ, τ). [sent-124, score-0.416]
49 With a total of 8 spatial scales and 2 temporal scales, we sampled the video 16 times. [sent-127, score-0.328]
50 A key factor governing sampling density is the overlapping rate of sampling patches. [sent-128, score-1.006]
51 We explore very high sampling density with 80% overlap for both spatial and temporal sampling. [sent-129, score-0.553]
52 The features produced with cuboid and dense sampling in [32] are sampled from videos with resolution of 360 x 288 pixels. [sent-131, score-1.012]
53 At same resolution, we generate 43 times more features than the dense sampling method in [32]. [sent-132, score-0.645]
54 m hbaevre sfh poowsnsi bilne [ s2a 1m m] pthleadt the performance is always improved as the number of randomly sampled patches is increased with as many as 10000 points per image. [sent-137, score-0.331]
55 Therefore, we have to use some strategies to reduce number of sampled points per frame and at the same time maintain an adequate sam- pling density. [sent-139, score-0.297]
56 One solution is to do sampling at lower spatial resolution. [sent-140, score-0.412]
57 As discussed above, the Local Part Model is well suitable for maintaining sampling density. [sent-141, score-0.412]
58 By using it, the dense sampling grid is determined by the root filter, which is applied at half the resolution of the processed video. [sent-142, score-1.07]
59 As stated above, we use half the video resolution for UCF50 and HMDB51 in our experiments. [sent-143, score-0.329]
60 Table 2 shows the average number of dense points (the third column) per video for different datasets. [sent-144, score-0.388]
61 For example, the average video size of HMDB5 1 is 182 x 120 pixels and 95 frames, and we randomly sample 10000 patches from the dense grid of 87,249 points. [sent-147, score-0.584]
62 The dense grid is decided by root filter, which is performed on half the video size (91 x 60 pixels and 95 frames). [sent-148, score-0.672]
63 408p3%6l0e%s Table 2: The sampling percentage of 10,000 random sam- ples vs. [sent-152, score-0.47]
64 total points (the third column) of dense sampling for different datasets. [sent-153, score-0.608]
65 Our sampling density is much higher than [30, 32]. [sent-161, score-0.501]
66 Experiments To demonstrate the performance of our sampling strategy, we evaluated our method on three public action benchmarks, the KTH [25], the UCF50 [22] and the HMDB51 [10] datasets. [sent-164, score-0.788]
67 We randomly sampled 3D patches from the dense grid, and used them to represent a video with a standard bag-of-features approach. [sent-165, score-0.637]
68 The sampled 3D patches are represented by descriptors, and the descriptors are matched to their nearest visual words with Euclidean distance. [sent-167, score-0.305]
69 It contains six action classes: walking, jogging, running, boxing, hand waving and hand clapping. [sent-186, score-0.376]
70 Each action is performed by 25 subjects in four different scenarios: outdoors, outdoors with scale variation, outdoors with different clothes and indoors. [sent-187, score-0.536]
71 The videos are grouped into 25 groups, where each group consists of a minimum of 4 action clips. [sent-192, score-0.477]
72 The HMDB51 dataset [10] is by far the largest human action dataset with 51 action categories, with at least 101 clips for each category. [sent-196, score-0.865]
73 The root filter of local part model is sampled from the dense sampling grid of the processed ×× ×× video at half the resolution. [sent-314, score-1.429]
74 For each root patch, we sampled 8 (2 by 2 by 2) overlapping part filters from the full resolution video. [sent-315, score-0.516]
75 Both root and part patches are represented with a descriptor. [sent-316, score-0.335]
76 The histograms of 1 root patch and 8 part patches are concatenated as one local ST feature. [sent-317, score-0.443]
77 With one HOG3D descriptor at dimension of 60, 96 or 144, our local part model feature (with 1 root filter and 8 part filters) has a dimension of 540, 864 or 1296, respectively. [sent-326, score-0.684]
78 For each video, we randomly sampled 4000, 6000, 8000 and 10000 3D patches as features, and performed classification using a standard Bag-of-Feature approach with 4000 and 6000 codewords, respectively. [sent-340, score-0.34]
79 To compensate the sampling randomness, all the tests were run 3 times. [sent-342, score-0.412]
80 The performance is almost always improved as the number of patches sampled from the video is increased. [sent-345, score-0.393]
81 This is consistent with the results of random sampling for image classification [2 1]. [sent-346, score-0.52]
82 On both KTH and UCF50 datasets, the best result is achieved using 6000 codewords and 10000 sampled patches 222555999977 Table 3: Average accuracy on all three datasets with 4000, 6000, 8000 and 10000 randomly sampled features per video. [sent-351, score-0.678]
83 Note: if video has more than 160 frames, more features are sampled at the same sampling rate as the first 160 frames. [sent-354, score-0.725]
84 The consistency of the results at such high rate of randomness can be explained by the very high sampling density used by our approach. [sent-364, score-0.542]
85 Nevertheless, we obtain 93%, which is better than the uniform dense sampling methods [13, 26, 32]. [sent-373, score-0.644]
86 Also, by randomly sampling cubic patches, we are able to use integral video to accelerate the processing. [sent-387, score-0.76]
87 One possible explanation for such good performances on real life videos resides in our random sampling conducted on an extremely dense sampling grid. [sent-391, score-1.263]
88 For 10000 patches per video on HMDB51, we have around 100 features per frame, which is similar as [3 1]. [sent-392, score-0.387]
89 However, our sampling density is much higher because the sampling is performed on one quarter size of that in [3 1]. [sent-393, score-0.913]
90 Compared with interest point detectors, we have more patches sampled from the test videos, and with uniform random sampling our method also includes correlated background information. [sent-394, score-0.895]
91 The speed for integral video varies little, and it only depends on the size of videos. [sent-427, score-0.36]
92 Because there is no feature detection for random sampling, the introduction of integral video has greatly improved the speed for random sampling and HOG3D. [sent-428, score-0.935]
93 Conclusions This paper has introduced a sampling strategy for efficient action recognition. [sent-438, score-0.788]
94 Motion interchange patterns for action recognition in unconstrained videos. [sent-495, score-0.376]
95 Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. [sent-542, score-0.413]
96 Human action recognition based on boosted feature selection and naive bayes nearest- [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] neighbor classification. [sent-553, score-0.423]
97 Dynamic eye movement datasets and learnt saliency models for visual action recognition. [sent-572, score-0.591]
98 Space-variant descriptor sampling for action recognition based on saliency and eye movements. [sent-638, score-0.976]
99 Dense trajectories and motion boundary descriptors for action recognition. [sent-654, score-0.476]
100 Real-time action recognition by spatiotemporal semantic and structural forests. [sent-693, score-0.376]
wordName wordTfidf (topN-words)
[('sampling', 0.412), ('action', 0.376), ('dense', 0.196), ('mbh', 0.171), ('root', 0.16), ('video', 0.151), ('integral', 0.149), ('codewords', 0.144), ('kth', 0.127), ('sampled', 0.125), ('patches', 0.117), ('pages', 0.108), ('mbhy', 0.104), ('videos', 0.101), ('actions', 0.099), ('half', 0.093), ('pling', 0.092), ('dimension', 0.092), ('density', 0.089), ('polyhedron', 0.085), ('resolution', 0.085), ('laptev', 0.084), ('outdoors', 0.08), ('denser', 0.076), ('filter', 0.072), ('grid', 0.072), ('clips', 0.071), ('willems', 0.071), ('patch', 0.07), ('klser', 0.069), ('mathe', 0.069), ('ottawa', 0.069), ('petriu', 0.069), ('descriptor', 0.067), ('intersection', 0.066), ('channels', 0.065), ('eye', 0.064), ('descriptors', 0.063), ('wc', 0.061), ('speed', 0.06), ('random', 0.058), ('interest', 0.058), ('part', 0.058), ('saliency', 0.057), ('yeffet', 0.057), ('ldt', 0.057), ('cuboid', 0.056), ('ordering', 0.055), ('background', 0.054), ('approximative', 0.054), ('governing', 0.054), ('flann', 0.054), ('movement', 0.053), ('int', 0.052), ('temporal', 0.052), ('processed', 0.052), ('conf', 0.051), ('classification', 0.05), ('additions', 0.049), ('filters', 0.049), ('randomly', 0.048), ('resides', 0.047), ('nowak', 0.047), ('feature', 0.047), ('realistic', 0.046), ('uncorrelated', 0.045), ('trend', 0.043), ('deviation', 0.042), ('frames', 0.042), ('listed', 0.042), ('splits', 0.042), ('human', 0.042), ('densely', 0.042), ('computationally', 0.042), ('randomness', 0.041), ('per', 0.041), ('datasets', 0.041), ('hof', 0.04), ('klaser', 0.04), ('multiscale', 0.04), ('strategies', 0.039), ('overlapping', 0.039), ('histogram', 0.039), ('local', 0.038), ('features', 0.037), ('spent', 0.037), ('performances', 0.037), ('trajectories', 0.037), ('computation', 0.036), ('cells', 0.036), ('uniform', 0.036), ('kernel', 0.036), ('includes', 0.035), ('efficiency', 0.035), ('svm', 0.035), ('maximal', 0.035), ('stages', 0.035), ('doll', 0.034), ('dimensions', 0.034), ('samples', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999905 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
Author: Feng Shi, Emil Petriu, Robert Laganière
Abstract: Local spatio-temporal features and bag-of-features representations have become popular for action recognition. A recent trend is to use dense sampling for better performance. While many methods claimed to use dense feature sets, most of them are just denser than approaches based on sparse interest point detectors. In this paper, we explore sampling with high density on action recognition. We also investigate the impact of random sampling over dense grid for computational efficiency. We present a real-time action recognition system which integrates fast random sampling method with local spatio-temporal features extracted from a Local Part Model. A new method based on histogram intersection kernel is proposed to combine multiple channels of different descriptors. Our technique shows high accuracy on the simple KTH dataset, and achieves state-of-the-art on two very challenging real-world datasets, namely, 93% on KTH, 83.3% on UCF50 and 47.6% on HMDB51.
2 0.35546607 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
Author: Arpit Jain, Abhinav Gupta, Mikel Rodriguez, Larry S. Davis
Abstract: How should a video be represented? We propose a new representation for videos based on mid-level discriminative spatio-temporal patches. These spatio-temporal patches might correspond to a primitive human action, a semantic object, or perhaps a random but informative spatiotemporal patch in the video. What defines these spatiotemporal patches is their discriminative and representative properties. We automatically mine these patches from hundreds of training videos and experimentally demonstrate that these patches establish correspondence across videos and align the videos for label transfer techniques. Furthermore, these patches can be used as a discriminative vocabulary for action classification where they demonstrate stateof-the-art performance on UCF50 and Olympics datasets.
3 0.31090084 205 cvpr-2013-Hollywood 3D: Recognizing Actions in 3D Natural Scenes
Author: Simon Hadfield, Richard Bowden
Abstract: Action recognition in unconstrained situations is a difficult task, suffering from massive intra-class variations. It is made even more challenging when complex 3D actions are projected down to the image plane, losing a great deal of information. The recent emergence of 3D data, both in broadcast content, and commercial depth sensors, provides the possibility to overcome this issue. This paper presents a new dataset, for benchmarking action recognition algorithms in natural environments, while making use of 3D information. The dataset contains around 650 video clips, across 14 classes. In addition, two state of the art action recognition algorithms are extended to make use ofthe 3D data, andfive new interestpoint detection strategies are alsoproposed, that extend to the 3D data. Our evaluation compares all 4 feature descriptors, using 7 different types of interest point, over a variety of threshold levels, for the Hollywood3D dataset. We make the dataset including stereo video, estimated depth maps and all code required to reproduce the benchmark results, available to the wider community.
4 0.29007301 287 cvpr-2013-Modeling Actions through State Changes
Author: Alireza Fathi, James M. Rehg
Abstract: In this paper we present a model of action based on the change in the state of the environment. Many actions involve similar dynamics and hand-object relationships, but differ in their purpose and meaning. The key to differentiating these actions is the ability to identify how they change the state of objects and materials in the environment. We propose a weakly supervised method for learning the object and material states that are necessary for recognizing daily actions. Once these state detectors are learned, we can apply them to input videos and pool their outputs to detect actions. We further demonstrate that our method can be used to segment discrete actions from a continuous video of an activity. Our results outperform state-of-the-art action recognition and activity segmentation results.
5 0.26525012 40 cvpr-2013-An Approach to Pose-Based Action Recognition
Author: Chunyu Wang, Yizhou Wang, Alan L. Yuille
Abstract: We address action recognition in videos by modeling the spatial-temporal structures of human poses. We start by improving a state of the art method for estimating human joint locations from videos. More precisely, we obtain the K-best estimations output by the existing method and incorporate additional segmentation cues and temporal constraints to select the “best” one. Then we group the estimated joints into five body parts (e.g. the left arm) and apply data mining techniques to obtain a representation for the spatial-temporal structures of human actions. This representation captures the spatial configurations ofbodyparts in one frame (by spatial-part-sets) as well as the body part movements(by temporal-part-sets) which are characteristic of human actions. It is interpretable, compact, and also robust to errors on joint estimations. Experimental results first show that our approach is able to localize body joints more accurately than existing methods. Next we show that it outperforms state of the art action recognizers on the UCF sport, the Keck Gesture and the MSR-Action3D datasets.
6 0.26495525 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
8 0.25787711 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition
9 0.241675 302 cvpr-2013-Multi-task Sparse Learning with Beta Process Prior for Action Recognition
10 0.23074025 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
11 0.21475494 123 cvpr-2013-Detection of Manipulation Action Consequences (MAC)
12 0.21129525 3 cvpr-2013-3D R Transform on Spatio-temporal Interest Points for Action Recognition
13 0.20299712 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
14 0.20164315 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition
15 0.18252593 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path
16 0.17815885 149 cvpr-2013-Evaluation of Color STIPs for Human Action Recognition
17 0.17537202 153 cvpr-2013-Expanded Parts Model for Human Attribute and Action Recognition in Still Images
18 0.16168894 407 cvpr-2013-Spatio-temporal Depth Cuboid Similarity Feature for Activity Recognition Using Depth Camera
19 0.15739624 196 cvpr-2013-HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences
20 0.15162936 291 cvpr-2013-Motionlets: Mid-level 3D Parts for Human Motion Recognition
topicId topicWeight
[(0, 0.32), (1, -0.136), (2, 0.036), (3, -0.165), (4, -0.344), (5, -0.019), (6, -0.087), (7, 0.006), (8, -0.112), (9, -0.093), (10, -0.027), (11, -0.074), (12, 0.046), (13, -0.008), (14, -0.012), (15, -0.025), (16, -0.007), (17, -0.06), (18, 0.147), (19, 0.137), (20, 0.054), (21, 0.008), (22, -0.0), (23, -0.017), (24, 0.029), (25, -0.042), (26, 0.003), (27, 0.04), (28, -0.035), (29, -0.052), (30, -0.066), (31, 0.035), (32, 0.003), (33, 0.006), (34, -0.049), (35, 0.027), (36, -0.024), (37, -0.014), (38, -0.031), (39, -0.001), (40, 0.05), (41, 0.016), (42, -0.04), (43, -0.049), (44, -0.025), (45, 0.012), (46, -0.026), (47, 0.022), (48, -0.003), (49, -0.007)]
simIndex simValue paperId paperTitle
same-paper 1 0.95461661 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
Author: Feng Shi, Emil Petriu, Robert Laganière
Abstract: Local spatio-temporal features and bag-of-features representations have become popular for action recognition. A recent trend is to use dense sampling for better performance. While many methods claimed to use dense feature sets, most of them are just denser than approaches based on sparse interest point detectors. In this paper, we explore sampling with high density on action recognition. We also investigate the impact of random sampling over dense grid for computational efficiency. We present a real-time action recognition system which integrates fast random sampling method with local spatio-temporal features extracted from a Local Part Model. A new method based on histogram intersection kernel is proposed to combine multiple channels of different descriptors. Our technique shows high accuracy on the simple KTH dataset, and achieves state-of-the-art on two very challenging real-world datasets, namely, 93% on KTH, 83.3% on UCF50 and 47.6% on HMDB51.
2 0.88920534 287 cvpr-2013-Modeling Actions through State Changes
Author: Alireza Fathi, James M. Rehg
Abstract: In this paper we present a model of action based on the change in the state of the environment. Many actions involve similar dynamics and hand-object relationships, but differ in their purpose and meaning. The key to differentiating these actions is the ability to identify how they change the state of objects and materials in the environment. We propose a weakly supervised method for learning the object and material states that are necessary for recognizing daily actions. Once these state detectors are learned, we can apply them to input videos and pool their outputs to detect actions. We further demonstrate that our method can be used to segment discrete actions from a continuous video of an activity. Our results outperform state-of-the-art action recognition and activity segmentation results.
3 0.88276315 291 cvpr-2013-Motionlets: Mid-level 3D Parts for Human Motion Recognition
Author: LiMin Wang, Yu Qiao, Xiaoou Tang
Abstract: This paper proposes motionlet, a mid-level and spatiotemporal part, for human motion recognition. Motionlet can be seen as a tight cluster in motion and appearance space, corresponding to the moving process of different body parts. We postulate three key properties of motionlet for action recognition: high motion saliency, multiple scale representation, and representative-discriminative ability. Towards this goal, we develop a data-driven approach to learn motionlets from training videos. First, we extract 3D regions with high motion saliency. Then we cluster these regions and preserve the centers as candidate templates for motionlet. Finally, we examine the representative and discriminative power of the candidates, and introduce a greedy method to select effective candidates. With motionlets, we present a mid-level representation for video, called motionlet activation vector. We conduct experiments on three datasets, KTH, HMDB51, and UCF50. The results show that the proposed methods significantly outperform state-of-the-art methods.
4 0.87623894 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
Author: Yicong Tian, Rahul Sukthankar, Mubarak Shah
Abstract: Deformable part models have achieved impressive performance for object detection, even on difficult image datasets. This paper explores the generalization of deformable part models from 2D images to 3D spatiotemporal volumes to better study their effectiveness for action detection in video. Actions are treated as spatiotemporal patterns and a deformable part model is generated for each action from a collection of examples. For each action model, the most discriminative 3D subvolumes are automatically selected as parts and the spatiotemporal relations between their locations are learned. By focusing on the most distinctive parts of each action, our models adapt to intra-class variation and show robustness to clutter. Extensive experiments on several video datasets demonstrate the strength of spatiotemporal DPMs for classifying and localizing actions.
5 0.85265219 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
Author: Arpit Jain, Abhinav Gupta, Mikel Rodriguez, Larry S. Davis
Abstract: How should a video be represented? We propose a new representation for videos based on mid-level discriminative spatio-temporal patches. These spatio-temporal patches might correspond to a primitive human action, a semantic object, or perhaps a random but informative spatiotemporal patch in the video. What defines these spatiotemporal patches is their discriminative and representative properties. We automatically mine these patches from hundreds of training videos and experimentally demonstrate that these patches establish correspondence across videos and align the videos for label transfer techniques. Furthermore, these patches can be used as a discriminative vocabulary for action classification where they demonstrate stateof-the-art performance on UCF50 and Olympics datasets.
6 0.83923632 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition
7 0.82807398 3 cvpr-2013-3D R Transform on Spatio-temporal Interest Points for Action Recognition
8 0.812415 149 cvpr-2013-Evaluation of Color STIPs for Human Action Recognition
9 0.79169363 123 cvpr-2013-Detection of Manipulation Action Consequences (MAC)
10 0.77480906 205 cvpr-2013-Hollywood 3D: Recognizing Actions in 3D Natural Scenes
11 0.74673921 302 cvpr-2013-Multi-task Sparse Learning with Beta Process Prior for Action Recognition
12 0.69650096 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path
13 0.67408282 40 cvpr-2013-An Approach to Pose-Based Action Recognition
14 0.63802218 32 cvpr-2013-Action Recognition by Hierarchical Sequence Summarization
15 0.63535202 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
16 0.62277114 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
18 0.58869648 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition
19 0.5828476 196 cvpr-2013-HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences
20 0.57015485 332 cvpr-2013-Pixel-Level Hand Detection in Ego-centric Videos
topicId topicWeight
[(10, 0.108), (16, 0.041), (26, 0.036), (28, 0.011), (33, 0.391), (48, 0.143), (67, 0.09), (69, 0.047), (87, 0.055)]
simIndex simValue paperId paperTitle
1 0.95791876 67 cvpr-2013-Blocks That Shout: Distinctive Parts for Scene Classification
Author: Mayank Juneja, Andrea Vedaldi, C.V. Jawahar, Andrew Zisserman
Abstract: The automatic discovery of distinctive parts for an object or scene class is challenging since it requires simultaneously to learn the part appearance and also to identify the part occurrences in images. In this paper, we propose a simple, efficient, and effective method to do so. We address this problem by learning parts incrementally, starting from a single part occurrence with an Exemplar SVM. In this manner, additional part instances are discovered and aligned reliably before being considered as training examples. We also propose entropy-rank curves as a means of evaluating the distinctiveness of parts shareable between categories and use them to select useful parts out of a set of candidates. We apply the new representation to the task of scene categorisation on the MIT Scene 67 benchmark. We show that our method can learn parts which are significantly more informative and for a fraction of the cost, compared to previouspart-learning methods such as Singh et al. [28]. We also show that a well constructed bag of words or Fisher vector model can substantially outperform the previous state-of- the-art classification performance on this data.
same-paper 2 0.95045859 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
Author: Feng Shi, Emil Petriu, Robert Laganière
Abstract: Local spatio-temporal features and bag-of-features representations have become popular for action recognition. A recent trend is to use dense sampling for better performance. While many methods claimed to use dense feature sets, most of them are just denser than approaches based on sparse interest point detectors. In this paper, we explore sampling with high density on action recognition. We also investigate the impact of random sampling over dense grid for computational efficiency. We present a real-time action recognition system which integrates fast random sampling method with local spatio-temporal features extracted from a Local Part Model. A new method based on histogram intersection kernel is proposed to combine multiple channels of different descriptors. Our technique shows high accuracy on the simple KTH dataset, and achieves state-of-the-art on two very challenging real-world datasets, namely, 93% on KTH, 83.3% on UCF50 and 47.6% on HMDB51.
3 0.94816279 387 cvpr-2013-Semi-supervised Domain Adaptation with Instance Constraints
Author: Jeff Donahue, Judy Hoffman, Erik Rodner, Kate Saenko, Trevor Darrell
Abstract: Most successful object classification and detection methods rely on classifiers trained on large labeled datasets. However, for domains where labels are limited, simply borrowing labeled data from existing datasets can hurt performance, a phenomenon known as “dataset bias.” We propose a general framework for adapting classifiers from “borrowed” data to the target domain using a combination of available labeled and unlabeled examples. Specifically, we show that imposing smoothness constraints on the classifier scores over the unlabeled data can lead to improved adaptation results. Such constraints are often available in the form of instance correspondences, e.g. when the same object or individual is observed simultaneously from multiple views, or tracked between video frames. In these cases, the object labels are unknown but can be constrained to be the same or similar. We propose techniques that build on existing domain adaptation methods by explicitly modeling these relationships, and demonstrate empirically that they improve recognition accuracy in two scenarios, multicategory image classification and object detection in video.
4 0.94435126 82 cvpr-2013-Class Generative Models Based on Feature Regression for Pose Estimation of Object Categories
Author: Michele Fenzi, Laura Leal-Taixé, Bodo Rosenhahn, Jörn Ostermann
Abstract: In this paper, we propose a method for learning a class representation that can return a continuous value for the pose of an unknown class instance using only 2D data and weak 3D labelling information. Our method is based on generative feature models, i.e., regression functions learnt from local descriptors of the same patch collected under different viewpoints. The individual generative models are then clustered in order to create class generative models which form the class representation. At run-time, the pose of the query image is estimated in a maximum a posteriori fashion by combining the regression functions belonging to the matching clusters. We evaluate our approach on the EPFL car dataset [17] and the Pointing’04 face dataset [8]. Experimental results show that our method outperforms by 10% the state-of-the-art in the first dataset and by 9% in the second.
5 0.94356495 202 cvpr-2013-Hierarchical Saliency Detection
Author: Qiong Yan, Li Xu, Jianping Shi, Jiaya Jia
Abstract: When dealing with objects with complex structures, saliency detection confronts a critical problem namely that detection accuracy could be adversely affected if salient foreground or background in an image contains small-scale high-contrast patterns. This issue is common in natural images and forms a fundamental challenge for prior methods. We tackle it from a scale point of view and propose a multi-layer approach to analyze saliency cues. The final saliency map is produced in a hierarchical model. Different from varying patch sizes or downsizing images, our scale-based region handling is by finding saliency values optimally in a tree model. Our approach improves saliency detection on many images that cannot be handled well traditionally. A new dataset is also constructed. –
6 0.94332439 204 cvpr-2013-Histograms of Sparse Codes for Object Detection
7 0.94278759 43 cvpr-2013-Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs
8 0.94260621 438 cvpr-2013-Towards Pose Robust Face Recognition
9 0.94234562 464 cvpr-2013-What Makes a Patch Distinct?
10 0.94220597 173 cvpr-2013-Finding Things: Image Parsing with Regions and Per-Exemplar Detectors
11 0.94218081 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video
12 0.94180858 92 cvpr-2013-Constrained Clustering and Its Application to Face Clustering in Videos
13 0.9417128 380 cvpr-2013-Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images
14 0.94168931 163 cvpr-2013-Fast, Accurate Detection of 100,000 Object Classes on a Single Machine
15 0.94164497 318 cvpr-2013-Optimized Pedestrian Detection for Multiple and Occluded People
16 0.94161326 36 cvpr-2013-Adding Unlabeled Samples to Categories by Learned Attributes
17 0.94153208 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
19 0.94135946 168 cvpr-2013-Fast Object Detection with Entropy-Driven Evaluation
20 0.94117302 207 cvpr-2013-Human Pose Estimation Using a Joint Pixel-wise and Part-wise Formulation