cvpr cvpr2013 cvpr2013-205 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Simon Hadfield, Richard Bowden
Abstract: Action recognition in unconstrained situations is a difficult task, suffering from massive intra-class variations. It is made even more challenging when complex 3D actions are projected down to the image plane, losing a great deal of information. The recent emergence of 3D data, both in broadcast content, and commercial depth sensors, provides the possibility to overcome this issue. This paper presents a new dataset, for benchmarking action recognition algorithms in natural environments, while making use of 3D information. The dataset contains around 650 video clips, across 14 classes. In addition, two state of the art action recognition algorithms are extended to make use ofthe 3D data, andfive new interestpoint detection strategies are alsoproposed, that extend to the 3D data. Our evaluation compares all 4 feature descriptors, using 7 different types of interest point, over a variety of threshold levels, for the Hollywood3D dataset. We make the dataset including stereo video, estimated depth maps and all code required to reproduce the benchmark results, available to the wider community.
Reference: text
sentIndex sentText sentNum sentScore
1 The recent emergence of 3D data, both in broadcast content, and commercial depth sensors, provides the possibility to overcome this issue. [sent-7, score-0.414]
2 This paper presents a new dataset, for benchmarking action recognition algorithms in natural environments, while making use of 3D information. [sent-8, score-0.39]
3 In addition, two state of the art action recognition algorithms are extended to make use ofthe 3D data, andfive new interestpoint detection strategies are alsoproposed, that extend to the 3D data. [sent-10, score-0.473]
4 We make the dataset including stereo video, estimated depth maps and all code required to reproduce the benchmark results, available to the wider community. [sent-12, score-0.324]
5 Figure 1: Example frames of various action sequences from the dataset, showing the left viewpoint and depth streams. [sent-21, score-0.667]
6 In this work, a new natural action dataset is introduced termed Hollywood3D (see figure 1), it builds on the spirit 333333999866 of the existing Hollywood datasets but includes 3D information. [sent-24, score-0.352]
7 Lighting variations are generally not expressed in depth data, and actor appearance differences are eliminated (although differences in body shape remain). [sent-26, score-0.324]
8 In this work, data is extracted from the latter commercially available sources, providing a number of advantages over self-captured data with a depth sensor[15, 5, 4]. [sent-30, score-0.361]
9 Additionally, active depth sensors are often unable to function in direct sunlight, severely limiting possible applications. [sent-34, score-0.322]
10 Finally, 3D information produced by active depth sensors tends to be much lower fidelity than that available commercially, and is limited in terms of operational range. [sent-35, score-0.322]
11 In addition to the release of a new dataset, which incorporates both original video and depth estimates, this paper provides baseline performance using both depth and appearance, and the software necessary to reproduce these results. [sent-36, score-0.634]
12 Previous work on action recognition, has focused on the use of feature points which can either be sampled densely or sparsely within the video. [sent-37, score-0.407]
13 In this work, the additional dimension z is employed, and we show how this depth information can be incorporated both at the descriptor level, and while detecting regions of interest, extending common Spatio-temporal Interest Point techniques. [sent-40, score-0.397]
14 The state of the art in natural 2D action recognition is first discussed in section 2, followed by section 3 covering the data extraction process, with details ofthe dataset. [sent-42, score-0.424]
15 Section 4 pro- vides a general overview of the action recognition methodology employed. [sent-43, score-0.39]
16 Section 5 details the depth-aware spatiotemporal interest point detection schemes, followed by extensions for two state of the art feature descriptors in section 7. [sent-44, score-0.433]
17 Results are provided with different combinations of interest point and recognition schemes in sections 8. [sent-45, score-0.382]
18 Finally section 9 draws conclusions about the benefits of depth data in natural action tasks, and the relative merits of the presented approaches. [sent-46, score-0.633]
19 Related Work The majority of existing approaches to action recognition focus on collections of local feature descriptors. [sent-48, score-0.422]
20 These interest points detect salient image locations, for example using separable linear filters [7] or spatio-temporal Harris corners [13]. [sent-53, score-0.523]
21 Descriptors are generated around these interest points in a number of ways, including SIFT and SURF approaches [26, 22, 12], pixel gradients [7], Jet descriptors [21] or detection distributions and strengths [19, 9]. [sent-54, score-0.465]
22 By performing a separate scene classification stage, combined with prior knowledge of probable action contexts (for example the “Get Out Car” action is unlikely to occur indoors) recognition rates can be improved. [sent-58, score-0.742]
23 [25] demonstrated that dense sampling of features provides combined action and context information, and generally outperforms sparse interest points. [sent-60, score-0.562]
24 An example of this is assigning each frame to a state in a Hidden Markov Model (HMM), then determining the most probable action for the observed sequence of states [3]. [sent-64, score-0.382]
25 However, the subset that is useful for generating an action recognition dataset is still limited. [sent-69, score-0.39]
26 Depth data extracted from these films is less rich, lacking depth variations within objects, resembling a collection of card board cut-outs, and is fundamentally artificial, created for effect only. [sent-71, score-0.438]
27 These technologies produce 3D consumer content from real stereo cameras which can be used to reconstruct accurate 3D depth maps. [sent-75, score-0.328]
28 It contains over 650 manually labeled video clips across 13 action classes, plus a further 78 clips representing the “NoAction”. [sent-77, score-0.549]
29 Most 3D films are too recent to have publicly available transcriptions, and subtitles alone rarely offer action cues, so automatic extraction techniques such as those employed by Marszalek et al. [sent-78, score-0.541]
30 In addition to the action sequences, a collection of sequences containing no actions was also automatically extracted as negative data, while ensuring no overlap with positive classes. [sent-81, score-0.597]
31 If the right appearance stream is removed from the dataset, it is possible to simulate the input data that would be provided by hybrid sensors like the Kinect, albeit at a higher spatial, and lower depth resolution. [sent-86, score-0.438]
32 Artifacts introduced by post processing are not considered, however it may be useful in future work to examine the behavior and consequences of such artifacts, with regards to action recognition. [sent-87, score-0.352]
33 This means each action is tested on actors and settings not seen in the training data, emphasizing generalization. [sent-95, score-0.398]
34 Firstly salient points are detected in using a range of detection schemes which incorporate the depth information, as discussed in section 5. [sent-102, score-0.519]
35 Interest Point Detection The additional information present in the depth data may be exploited during interest point extraction, in order to detect more salient features, and discount irrelevant detections. [sent-106, score-0.607]
36 4D Harris Corners The Harris Corner [11] is a frequently used interest point detector, which was extended into the spatio-temporal domain by Laptev et al. [sent-111, score-0.308]
37 However, the combination of appearance and depth streams constitutes 3. [sent-120, score-0.382]
38 Instead, the relationship between the spatio-temporal gradients of the depth stream and those of the appearance stream are exploited. [sent-124, score-0.532]
39 Equation 2 employs the chain rule, where Ix , Iy, It are intensity gradients along the spatial and temporal dimensions and Dx , Dy , Dt are the gradients of the depth stream. [sent-125, score-0.496]
40 The effect of this threshold (and the threshold of each interest point detector) on recognition performance, is examined in detail in section 8. [sent-129, score-0.377]
41 The detected interest points relate to areas with strong second order intensity derivatives, including both blobs and saddles. [sent-135, score-0.352]
42 As in the 4D Harris scheme, gradients along z are estimated using the relationships between the depth and intensity stream gradients. [sent-136, score-0.472]
43 The set of interest points F4D-He is calculated as the set of spatio-temporal locations, for which the determinant of μ is greater than the threshold λ4D-He as in equation 6. [sent-138, score-0.417]
44 5D In part, the Harris and Hessian interest point operators are motivated by the idea that object boundary points are highly salient, and that intensity gradients relate to boundaries. [sent-141, score-0.491]
45 However, depth data directly provides boundary information, rendering the estimation of the intensity gradient along z somewhat redundant. [sent-142, score-0.366]
46 5D” representation, using a pair of complimentary 3D spatio-temporal volumes, from the appearance and depth sequences. [sent-144, score-0.363]
47 Where and φ are equation 1 applied to the appearance and depth streams respectively, while υ and ω are the 3 by 3 Hessians. [sent-146, score-0.43]
48 The relative weighting of the appearance and depth information, is controlled by α. [sent-147, score-0.324]
49 This approach exploits complimentary information between the streams, to detect interest points where there are large intensity changes and/or large depth changes. [sent-148, score-0.641]
50 5D approach used for the Harris and Hessian detectors, leads to equation 10, where I D and are the appearance and depth streams respectively. [sent-158, score-0.43]
51 The descriptors can be based on various types of information, including appearance, motion and saliency, however depth information has rarely been utilized. [sent-164, score-0.433]
52 In the following sections we describe feature extraction approaches, based on the descriptors of two widely successful action recognition schemes, extended to make use of the additional information present in the Hollywood3D dataset. [sent-165, score-0.52]
53 Bag of Visual Words One of the most successful feature descriptors for action recognition is that of Laptev et al. [sent-168, score-0.471]
54 Descriptors are extracted only in salient regions (found through interest point detection) and are composed of a Histogram of Oriented Gradients (HOG) G, concatenated with a Histogram of Oriented Flow (HOF) F. [sent-170, score-0.326]
55 This provides a descriptor ρ of the visual appearance and local motion around the salient point at I(u, v, w). [sent-172, score-0.314]
56 Importantly, this descriptor is not dependent on the interest point detector, provided the HODG can be calculated from the depth stream D. [sent-180, score-0.76]
57 has also been shown to perform well in a large range of action recognition datasets, while making use of only the saliency information obtained during interest point detection. [sent-184, score-0.919]
58 An integral volume η is created, based on the interest point detection and their strengths. [sent-185, score-0.345]
59 The saliency content of a sub-cuboid, with origin at (u, v, w) is defined in equation 13 as c(u, v, w) for a sub-cuboid of dimensions ( uˆ, vˆ, wˆ ). [sent-186, score-0.365]
60 The descriptor δ of the saliency distribution at a position (u, v, w) can then be formed, by performing N comparisons of the content of two randomly offset spatio-temporal sub-cuboids, with origins at (u, v, w) + and (u, v, w) + β? [sent-188, score-0.464]
61 n) (14) By extracting δ at every location in the sequence, a histogram may be constructed, which encodes the occurrences of relative saliency distributions within the sequence, without requiring appearance data or motion estimation. [sent-202, score-0.415]
62 We propose extending the standard RMD described above, by storing the saliency measurements within a 4D integral hyper-volume, so as to encode the behavior of the interest point distribution across the 3D scene, rather than within the image plane. [sent-214, score-0.605]
63 The 4D integral volume can be populated by extracting the depth measurements at each detected interest point. [sent-215, score-0.636]
64 As with the original RMD, the descriptor can be applied in conjunction with any interest point detector. [sent-218, score-0.375]
65 As with the 4D Bag of Words approach, these features are not restricted to the extended interest point detectors described in section 5, and work equally well with standard spatio-temporal interest points, provided that a depth video is available during descriptor extraction. [sent-219, score-0.98]
66 The source code for the three novel interest point detection algorithms, and the two extended Action Recognition techniques is available3, to allow reproduction of these results. [sent-223, score-0.308]
67 Interest Point Analysis First we examine the benefits of including depth information during interest point detection. [sent-228, score-0.54]
68 1) is used for classification, in conjunction with the traditional spatio-temporal interest points (Separable Filters 3D − S and Harris Corners e3sDt −oin Htsa) ( are compared tros t 3hDe proposed depth aware nscehresm 3Des. [sent-232, score-0.677]
69 This is also reflected in the depth aware schemes, and is unsurprising, as separable filters were designed primarily for computational speed. [sent-236, score-0.561]
70 Hessian based interest points prove less informative than the extended Harris operators in both the 4D and 3. [sent-237, score-0.413]
71 Interestingly, certain actions consistently perform better, when described by depth aware interest points. [sent-242, score-0.794]
72 These are actions such as Kiss, Hug, Drive and Run where there is an informative foreground object, which depth aware interest points are better able to pick out. [sent-243, score-0.849]
73 In contrast, actions such as Swim, Dance and Shoot are often performed against a similar depth background, or within a group of people, and the inclusion of depth in the saliency measure is less valu3personal . [sent-244, score-1.036]
74 This suggests that a combination of standard spatiotemporal, and depth aware schemes, may prove valuable. [sent-250, score-0.452]
75 The complexity of the depth aware interest point detectors remains of the same order as their spatio-temporal counterparts (linear with respect to u, v and w). [sent-251, score-0.707]
76 Descriptor Analysis Next, the use of depth information at the feature level was explored, including it’s interaction with the depth aware saliency measures. [sent-256, score-0.963]
77 The previously noted relationship between saliency measures, appears to hold regardless of the feature descriptor used. [sent-259, score-0.386]
78 5D scheme prove to be the most effective way to incorporate depth information. [sent-263, score-0.321]
79 This is unsurprising as the RMD relies only on interest point detections, without the inclusion of any visual and motion information. [sent-267, score-0.375]
80 It may have been reasonable to guess, that including structural features would prove more valuable with a standard saliency measure, as the depth information had not previously been exploited. [sent-268, score-0.657]
81 In fact the opposite proves to be true, 4D features provide more modest gains for 3D-S and 3D-Ha (up to 20%) than they do when combined with extended saliency measures (up to 45%). [sent-269, score-0.319]
82 This demonstrates that depth aware saliency measures are capable of focusing computation, into regions where structural features are par- ticularly valuable. [sent-270, score-0.717]
83 The complexity of the RMD-4D is greater than the standard RMD (being linear in the range of depth values, as well as in u, v and w). [sent-271, score-0.314]
84 However the increased feature vector length does lead to and increased cost during 333444000311 Table 2: Average precision per class, on the 3D action dataset, for a range of interest point detectors, including simple spatio-temporal interest points, and depth aware schemes. [sent-274, score-1.335]
85 Classes are shown in bold, when depth aware interest points outperform both 3D schemes. [sent-276, score-0.677]
86 Table 3: Correct Classification rate and Average Precision for each combination of descriptor and saliency measure. [sent-277, score-0.386]
87 Interest Point Threshold Results Different interest point operators produce very different response strengths, meaning the optimal threshold for extracting salient points varies. [sent-282, score-0.508]
88 In general an arbitrary threshold is selected, indeed the experiments in the previous sections employed a saliency threshold based on those suggested in previous literature. [sent-283, score-0.35]
89 In figure 2 the relationship between the saliency threshold and the action recognition performance, is contrasted for 4D and 3. [sent-284, score-0.7]
90 Regardless of the saliency measure, the standard features descriptor and their depth aware extensions follow the same trend. [sent-286, score-0.827]
91 In contrast, bag of words approaches provide greater accuracy for lower saliency thresholds. [sent-290, score-0.475]
92 This makes sense, as a weak interest point relates to a single histogram entry under the bag of words scheme. [sent-292, score-0.466]
93 In contrast, poor interest points will affect the RMD descriptor of all surrounding locations. [sent-293, score-0.381]
94 Conclusions In this paper, we propose and make available a large corpus of 3D data to the community, for the comparison of action recognition techniques, in natural environments. [sent-297, score-0.39]
95 5D Har is (c) 4D Hessian (d) 4D Harris Figure 2: Average Precision on the Hollywood 3D action recognition dataset, for various saliency thresholds, with 3. [sent-301, score-0.66]
96 It has been shown that 3D information provides valuable cues to improve action recognition. [sent-303, score-0.383]
97 A variety of new interest point detection algorithm, incorporating depth data, have been shown to improve action recognition rates, doubling performance in some cases, even using standard features. [sent-304, score-0.93]
98 Human daily action analysis with multi-view and color-depth data. [sent-335, score-0.352]
99 Capturing the relative distribution of features for action recognition. [sent-425, score-0.352]
100 A 3-dimensional sift descriptor and its application to action recognition. [sent-445, score-0.468]
wordName wordTfidf (topN-words)
[('action', 0.352), ('rmd', 0.3), ('depth', 0.281), ('harris', 0.27), ('saliency', 0.27), ('interest', 0.21), ('actions', 0.172), ('hollywood', 0.169), ('films', 0.157), ('hessian', 0.14), ('aware', 0.131), ('bag', 0.117), ('descriptor', 0.116), ('laptev', 0.11), ('surrey', 0.106), ('separable', 0.095), ('schemes', 0.085), ('clips', 0.084), ('hodg', 0.082), ('oshin', 0.082), ('descriptors', 0.081), ('commercially', 0.08), ('stream', 0.073), ('salient', 0.067), ('gilbert', 0.063), ('gradients', 0.062), ('operators', 0.059), ('streams', 0.058), ('strengths', 0.057), ('willems', 0.056), ('broadcast', 0.056), ('intensity', 0.056), ('points', 0.055), ('words', 0.055), ('hadfie', 0.055), ('hol', 0.055), ('noaction', 0.055), ('filters', 0.054), ('det', 0.053), ('marszalek', 0.05), ('point', 0.049), ('extended', 0.049), ('equation', 0.048), ('klaser', 0.048), ('content', 0.047), ('actors', 0.046), ('integral', 0.045), ('recognise', 0.045), ('unsurprising', 0.045), ('hev', 0.045), ('offsets', 0.044), ('dollar', 0.044), ('appearance', 0.043), ('reproduce', 0.043), ('kiss', 0.042), ('hod', 0.042), ('corners', 0.042), ('volume', 0.041), ('sensors', 0.041), ('commercial', 0.041), ('threshold', 0.04), ('hug', 0.04), ('prove', 0.04), ('motion', 0.039), ('ensuring', 0.039), ('complimentary', 0.039), ('tuytelaars', 0.038), ('recognition', 0.038), ('worst', 0.036), ('emergence', 0.036), ('increased', 0.036), ('detectors', 0.036), ('temporal', 0.035), ('structural', 0.035), ('histogram', 0.035), ('art', 0.034), ('sequences', 0.034), ('greater', 0.033), ('rarely', 0.032), ('uk', 0.032), ('collections', 0.032), ('inclusion', 0.032), ('valuable', 0.031), ('detected', 0.031), ('drive', 0.031), ('storing', 0.031), ('calculated', 0.031), ('comparisons', 0.031), ('iz', 0.03), ('movies', 0.03), ('histograms', 0.03), ('codebook', 0.03), ('precision', 0.03), ('sequence', 0.03), ('spatiotemporal', 0.03), ('somewhat', 0.029), ('video', 0.029), ('extensions', 0.029), ('extracting', 0.028), ('tests', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 205 cvpr-2013-Hollywood 3D: Recognizing Actions in 3D Natural Scenes
Author: Simon Hadfield, Richard Bowden
Abstract: Action recognition in unconstrained situations is a difficult task, suffering from massive intra-class variations. It is made even more challenging when complex 3D actions are projected down to the image plane, losing a great deal of information. The recent emergence of 3D data, both in broadcast content, and commercial depth sensors, provides the possibility to overcome this issue. This paper presents a new dataset, for benchmarking action recognition algorithms in natural environments, while making use of 3D information. The dataset contains around 650 video clips, across 14 classes. In addition, two state of the art action recognition algorithms are extended to make use ofthe 3D data, andfive new interestpoint detection strategies are alsoproposed, that extend to the 3D data. Our evaluation compares all 4 feature descriptors, using 7 different types of interest point, over a variety of threshold levels, for the Hollywood3D dataset. We make the dataset including stereo video, estimated depth maps and all code required to reproduce the benchmark results, available to the wider community.
2 0.31090084 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
Author: Feng Shi, Emil Petriu, Robert Laganière
Abstract: Local spatio-temporal features and bag-of-features representations have become popular for action recognition. A recent trend is to use dense sampling for better performance. While many methods claimed to use dense feature sets, most of them are just denser than approaches based on sparse interest point detectors. In this paper, we explore sampling with high density on action recognition. We also investigate the impact of random sampling over dense grid for computational efficiency. We present a real-time action recognition system which integrates fast random sampling method with local spatio-temporal features extracted from a Local Part Model. A new method based on histogram intersection kernel is proposed to combine multiple channels of different descriptors. Our technique shows high accuracy on the simple KTH dataset, and achieves state-of-the-art on two very challenging real-world datasets, namely, 93% on KTH, 83.3% on UCF50 and 47.6% on HMDB51.
3 0.29157731 287 cvpr-2013-Modeling Actions through State Changes
Author: Alireza Fathi, James M. Rehg
Abstract: In this paper we present a model of action based on the change in the state of the environment. Many actions involve similar dynamics and hand-object relationships, but differ in their purpose and meaning. The key to differentiating these actions is the ability to identify how they change the state of objects and materials in the environment. We propose a weakly supervised method for learning the object and material states that are necessary for recognizing daily actions. Once these state detectors are learned, we can apply them to input videos and pool their outputs to detect actions. We further demonstrate that our method can be used to segment discrete actions from a continuous video of an activity. Our results outperform state-of-the-art action recognition and activity segmentation results.
Author: Tsz-Ho Yu, Tae-Kyun Kim, Roberto Cipolla
Abstract: This work addresses the challenging problem of unconstrained 3D human pose estimation (HPE)from a novelperspective. Existing approaches struggle to operate in realistic applications, mainly due to their scene-dependent priors, such as background segmentation and multi-camera network, which restrict their use in unconstrained environments. We therfore present a framework which applies action detection and 2D pose estimation techniques to infer 3D poses in an unconstrained video. Action detection offers spatiotemporal priors to 3D human pose estimation by both recognising and localising actions in space-time. Instead of holistic features, e.g. silhouettes, we leverage the flexibility of deformable part model to detect 2D body parts as a feature to estimate 3D poses. A new unconstrained pose dataset has been collected to justify the feasibility of our method, which demonstrated promising results, significantly outperforming the relevant state-of-the-arts.
5 0.24598558 196 cvpr-2013-HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences
Author: Omar Oreifej, Zicheng Liu
Abstract: We present a new descriptor for activity recognition from videos acquired by a depth sensor. Previous descriptors mostly compute shape and motion features independently; thus, they often fail to capture the complex joint shapemotion cues at pixel-level. In contrast, we describe the depth sequence using a histogram capturing the distribution of the surface normal orientation in the 4D space of time, depth, and spatial coordinates. To build the histogram, we create 4D projectors, which quantize the 4D space and represent the possible directions for the 4D normal. We initialize the projectors using the vertices of a regular polychoron. Consequently, we refine the projectors using a discriminative density measure, such that additional projectors are induced in the directions where the 4D normals are more dense and discriminative. Through extensive experiments, we demonstrate that our descriptor better captures the joint shape-motion cues in the depth sequence, and thus outperforms the state-of-the-art on all relevant benchmarks.
6 0.24431767 40 cvpr-2013-An Approach to Pose-Based Action Recognition
7 0.24234237 376 cvpr-2013-Salient Object Detection: A Discriminative Regional Feature Integration Approach
9 0.23912385 375 cvpr-2013-Saliency Detection via Graph-Based Manifold Ranking
10 0.23739956 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
11 0.23665833 202 cvpr-2013-Hierarchical Saliency Detection
12 0.23639564 245 cvpr-2013-Layer Depth Denoising and Completion for Structured-Light RGB-D Cameras
13 0.23448415 273 cvpr-2013-Looking Beyond the Image: Unsupervised Learning for Object Saliency and Detection
14 0.22769313 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
15 0.2222161 149 cvpr-2013-Evaluation of Color STIPs for Human Action Recognition
16 0.22068462 374 cvpr-2013-Saliency Aggregation: A Data-Driven Approach
17 0.21834908 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition
18 0.21742781 123 cvpr-2013-Detection of Manipulation Action Consequences (MAC)
19 0.20310767 407 cvpr-2013-Spatio-temporal Depth Cuboid Similarity Feature for Activity Recognition Using Depth Camera
20 0.19882995 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
topicId topicWeight
[(0, 0.3), (1, -0.063), (2, 0.237), (3, -0.024), (4, -0.399), (5, -0.061), (6, -0.118), (7, 0.068), (8, -0.035), (9, -0.076), (10, -0.067), (11, -0.062), (12, 0.002), (13, 0.118), (14, -0.011), (15, -0.039), (16, -0.132), (17, -0.015), (18, 0.051), (19, 0.109), (20, 0.053), (21, -0.078), (22, -0.014), (23, 0.094), (24, 0.058), (25, 0.048), (26, -0.018), (27, 0.042), (28, -0.037), (29, -0.047), (30, -0.005), (31, 0.067), (32, 0.044), (33, -0.007), (34, -0.025), (35, -0.018), (36, 0.005), (37, 0.036), (38, -0.02), (39, 0.043), (40, 0.007), (41, -0.016), (42, -0.069), (43, -0.005), (44, -0.068), (45, -0.072), (46, -0.033), (47, 0.026), (48, 0.032), (49, 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.95663047 205 cvpr-2013-Hollywood 3D: Recognizing Actions in 3D Natural Scenes
Author: Simon Hadfield, Richard Bowden
Abstract: Action recognition in unconstrained situations is a difficult task, suffering from massive intra-class variations. It is made even more challenging when complex 3D actions are projected down to the image plane, losing a great deal of information. The recent emergence of 3D data, both in broadcast content, and commercial depth sensors, provides the possibility to overcome this issue. This paper presents a new dataset, for benchmarking action recognition algorithms in natural environments, while making use of 3D information. The dataset contains around 650 video clips, across 14 classes. In addition, two state of the art action recognition algorithms are extended to make use ofthe 3D data, andfive new interestpoint detection strategies are alsoproposed, that extend to the 3D data. Our evaluation compares all 4 feature descriptors, using 7 different types of interest point, over a variety of threshold levels, for the Hollywood3D dataset. We make the dataset including stereo video, estimated depth maps and all code required to reproduce the benchmark results, available to the wider community.
2 0.72956926 407 cvpr-2013-Spatio-temporal Depth Cuboid Similarity Feature for Activity Recognition Using Depth Camera
Author: Lu Xia, J.K. Aggarwal
Abstract: Local spatio-temporal interest points (STIPs) and the resulting features from RGB videos have been proven successful at activity recognition that can handle cluttered backgrounds and partial occlusions. In this paper, we propose its counterpart in depth video and show its efficacy on activity recognition. We present a filtering method to extract STIPsfrom depth videos (calledDSTIP) that effectively suppress the noisy measurements. Further, we build a novel depth cuboid similarity feature (DCSF) to describe the local 3D depth cuboid around the DSTIPs with an adaptable supporting size. We test this feature on activity recognition application using the public MSRAction3D, MSRDailyActivity3D datasets and our own dataset. Experimental evaluation shows that the proposed approach outperforms stateof-the-art activity recognition algorithms on depth videos, and the framework is more widely applicable than existing approaches. We also give detailed comparisons with other features and analysis of choice of parameters as a guidance for applications.
3 0.72798187 196 cvpr-2013-HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences
Author: Omar Oreifej, Zicheng Liu
Abstract: We present a new descriptor for activity recognition from videos acquired by a depth sensor. Previous descriptors mostly compute shape and motion features independently; thus, they often fail to capture the complex joint shapemotion cues at pixel-level. In contrast, we describe the depth sequence using a histogram capturing the distribution of the surface normal orientation in the 4D space of time, depth, and spatial coordinates. To build the histogram, we create 4D projectors, which quantize the 4D space and represent the possible directions for the 4D normal. We initialize the projectors using the vertices of a regular polychoron. Consequently, we refine the projectors using a discriminative density measure, such that additional projectors are induced in the directions where the 4D normals are more dense and discriminative. Through extensive experiments, we demonstrate that our descriptor better captures the joint shape-motion cues in the depth sequence, and thus outperforms the state-of-the-art on all relevant benchmarks.
4 0.72228271 291 cvpr-2013-Motionlets: Mid-level 3D Parts for Human Motion Recognition
Author: LiMin Wang, Yu Qiao, Xiaoou Tang
Abstract: This paper proposes motionlet, a mid-level and spatiotemporal part, for human motion recognition. Motionlet can be seen as a tight cluster in motion and appearance space, corresponding to the moving process of different body parts. We postulate three key properties of motionlet for action recognition: high motion saliency, multiple scale representation, and representative-discriminative ability. Towards this goal, we develop a data-driven approach to learn motionlets from training videos. First, we extract 3D regions with high motion saliency. Then we cluster these regions and preserve the centers as candidate templates for motionlet. Finally, we examine the representative and discriminative power of the candidates, and introduce a greedy method to select effective candidates. With motionlets, we present a mid-level representation for video, called motionlet activation vector. We conduct experiments on three datasets, KTH, HMDB51, and UCF50. The results show that the proposed methods significantly outperform state-of-the-art methods.
5 0.71542472 287 cvpr-2013-Modeling Actions through State Changes
Author: Alireza Fathi, James M. Rehg
Abstract: In this paper we present a model of action based on the change in the state of the environment. Many actions involve similar dynamics and hand-object relationships, but differ in their purpose and meaning. The key to differentiating these actions is the ability to identify how they change the state of objects and materials in the environment. We propose a weakly supervised method for learning the object and material states that are necessary for recognizing daily actions. Once these state detectors are learned, we can apply them to input videos and pool their outputs to detect actions. We further demonstrate that our method can be used to segment discrete actions from a continuous video of an activity. Our results outperform state-of-the-art action recognition and activity segmentation results.
6 0.69731522 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
7 0.68959898 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
8 0.68625814 149 cvpr-2013-Evaluation of Color STIPs for Human Action Recognition
9 0.65661895 123 cvpr-2013-Detection of Manipulation Action Consequences (MAC)
10 0.65185416 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition
11 0.63777131 3 cvpr-2013-3D R Transform on Spatio-temporal Interest Points for Action Recognition
12 0.56795394 302 cvpr-2013-Multi-task Sparse Learning with Beta Process Prior for Action Recognition
13 0.55653334 40 cvpr-2013-An Approach to Pose-Based Action Recognition
14 0.55591464 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path
15 0.5541684 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
16 0.54631501 322 cvpr-2013-PISA: Pixelwise Image Saliency by Aggregating Complementary Appearance Contrast Measures with Spatial Priors
17 0.53874838 376 cvpr-2013-Salient Object Detection: A Discriminative Regional Feature Integration Approach
18 0.52436078 258 cvpr-2013-Learning Video Saliency from Human Gaze Using Candidate Selection
19 0.52336967 444 cvpr-2013-Unconstrained Monocular 3D Human Pose Estimation by Action Detection and Cross-Modality Regression Forest
20 0.52158731 202 cvpr-2013-Hierarchical Saliency Detection
topicId topicWeight
[(10, 0.141), (16, 0.022), (26, 0.052), (33, 0.299), (64, 0.158), (67, 0.095), (69, 0.049), (80, 0.011), (87, 0.075)]
simIndex simValue paperId paperTitle
1 0.9225108 382 cvpr-2013-Scene Text Recognition Using Part-Based Tree-Structured Character Detection
Author: Cunzhao Shi, Chunheng Wang, Baihua Xiao, Yang Zhang, Song Gao, Zhong Zhang
Abstract: Scene text recognition has inspired great interests from the computer vision community in recent years. In this paper, we propose a novel scene text recognition method using part-based tree-structured character detection. Different from conventional multi-scale sliding window character detection strategy, which does not make use of the character-specific structure information, we use part-based tree-structure to model each type of character so as to detect and recognize the characters at the same time. While for word recognition, we build a Conditional Random Field model on the potential character locations to incorporate the detection scores, spatial constraints and linguistic knowledge into one framework. The final word recognition result is obtained by minimizing the cost function defined on the random field. Experimental results on a range of challenging public datasets (ICDAR 2003, ICDAR 2011, SVT) demonstrate that the proposed method outperforms stateof-the-art methods significantly bothfor character detection and word recognition.
same-paper 2 0.92099643 205 cvpr-2013-Hollywood 3D: Recognizing Actions in 3D Natural Scenes
Author: Simon Hadfield, Richard Bowden
Abstract: Action recognition in unconstrained situations is a difficult task, suffering from massive intra-class variations. It is made even more challenging when complex 3D actions are projected down to the image plane, losing a great deal of information. The recent emergence of 3D data, both in broadcast content, and commercial depth sensors, provides the possibility to overcome this issue. This paper presents a new dataset, for benchmarking action recognition algorithms in natural environments, while making use of 3D information. The dataset contains around 650 video clips, across 14 classes. In addition, two state of the art action recognition algorithms are extended to make use ofthe 3D data, andfive new interestpoint detection strategies are alsoproposed, that extend to the 3D data. Our evaluation compares all 4 feature descriptors, using 7 different types of interest point, over a variety of threshold levels, for the Hollywood3D dataset. We make the dataset including stereo video, estimated depth maps and all code required to reproduce the benchmark results, available to the wider community.
3 0.91173297 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
Author: Ian Endres, Kevin J. Shih, Johnston Jiaa, Derek Hoiem
Abstract: We propose a method to learn a diverse collection of discriminative parts from object bounding box annotations. Part detectors can be trained and applied individually, which simplifies learning and extension to new features or categories. We apply the parts to object category detection, pooling part detections within bottom-up proposed regions and using a boosted classifier with proposed sigmoid weak learners for scoring. On PASCAL VOC 2010, we evaluate the part detectors ’ ability to discriminate and localize annotated keypoints. Our detection system is competitive with the best-existing systems, outperforming other HOG-based detectors on the more deformable categories.
4 0.90767419 414 cvpr-2013-Structure Preserving Object Tracking
Author: Lu Zhang, Laurens van_der_Maaten
Abstract: Model-free trackers can track arbitrary objects based on a single (bounding-box) annotation of the object. Whilst the performance of model-free trackers has recently improved significantly, simultaneously tracking multiple objects with similar appearance remains very hard. In this paper, we propose a new multi-object model-free tracker (based on tracking-by-detection) that resolves this problem by incorporating spatial constraints between the objects. The spatial constraints are learned along with the object detectors using an online structured SVM algorithm. The experimental evaluation ofour structure-preserving object tracker (SPOT) reveals significant performance improvements in multi-object tracking. We also show that SPOT can improve the performance of single-object trackers by simultaneously tracking different parts of the object.
5 0.90610051 262 cvpr-2013-Learning for Structured Prediction Using Approximate Subgradient Descent with Working Sets
Author: Aurélien Lucchi, Yunpeng Li, Pascal Fua
Abstract: We propose a working set based approximate subgradient descent algorithm to minimize the margin-sensitive hinge loss arising from the soft constraints in max-margin learning frameworks, such as the structured SVM. We focus on the setting of general graphical models, such as loopy MRFs and CRFs commonly used in image segmentation, where exact inference is intractable and the most violated constraints can only be approximated, voiding the optimality guarantees of the structured SVM’s cutting plane algorithm as well as reducing the robustness of existing subgradient based methods. We show that the proposed method obtains better approximate subgradients through the use of working sets, leading to improved convergence properties and increased reliability. Furthermore, our method allows new constraints to be randomly sampled instead of computed using the more expensive approximate inference techniques such as belief propagation and graph cuts, which can be used to reduce learning time at only a small cost of performance. We demonstrate the strength of our method empirically on the segmentation of a new publicly available electron microscopy dataset as well as the popular MSRC data set and show state-of-the-art results.
6 0.9060871 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation
7 0.90580392 325 cvpr-2013-Part Discovery from Partial Correspondence
8 0.90578836 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
9 0.90362042 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases
10 0.90298164 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection
11 0.90297413 14 cvpr-2013-A Joint Model for 2D and 3D Pose Estimation from a Single Image
12 0.90193695 60 cvpr-2013-Beyond Physical Connections: Tree Models in Human Pose Estimation
13 0.90176976 285 cvpr-2013-Minimum Uncertainty Gap for Robust Visual Tracking
14 0.90146178 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval
15 0.90140349 314 cvpr-2013-Online Object Tracking: A Benchmark
16 0.90117615 277 cvpr-2013-MODEC: Multimodal Decomposable Models for Human Pose Estimation
17 0.90112782 122 cvpr-2013-Detection Evolution with Multi-order Contextual Co-occurrence
18 0.90017587 206 cvpr-2013-Human Pose Estimation Using Body Parts Dependent Joint Regressors
19 0.90006661 339 cvpr-2013-Probabilistic Graphlet Cut: Exploiting Spatial Structure Cue for Weakly Supervised Image Segmentation