Author: Simon Hadfield, Richard Bowden
Abstract: Action recognition in unconstrained situations is a difficult task, suffering from massive intra-class variations. It is made even more challenging when complex 3D actions are projected down to the image plane, losing a great deal of information. The recent emergence of 3D data, both in broadcast content, and commercial depth sensors, provides the possibility to overcome this issue. This paper presents a new dataset, for benchmarking action recognition algorithms in natural environments, while making use of 3D information. The dataset contains around 650 video clips, across 14 classes. In addition, two state of the art action recognition algorithms are extended to make use ofthe 3D data, andfive new interestpoint detection strategies are alsoproposed, that extend to the 3D data. Our evaluation compares all 4 feature descriptors, using 7 different types of interest point, over a variety of threshold levels, for the Hollywood3D dataset. We make the dataset including stereo video, estimated depth maps and all code required to reproduce the benchmark results, available to the wider community.
1 The recent emergence of 3D data, both in broadcast content, and commercial depth sensors, provides the possibility to overcome this issue. [sent-7, score-0.414]
2 This paper presents a new dataset, for benchmarking action recognition algorithms in natural environments, while making use of 3D information. [sent-8, score-0.39]
3 In addition, two state of the art action recognition algorithms are extended to make use ofthe 3D data, andfive new interestpoint detection strategies are alsoproposed, that extend to the 3D data. [sent-10, score-0.473]
4 We make the dataset including stereo video, estimated depth maps and all code required to reproduce the benchmark results, available to the wider community. [sent-12, score-0.324]
5 Figure 1: Example frames of various action sequences from the dataset, showing the left viewpoint and depth streams. [sent-21, score-0.667]
6 In this work, a new natural action dataset is introduced termed Hollywood3D (see figure 1), it builds on the spirit 333333999866 of the existing Hollywood datasets but includes 3D information. [sent-24, score-0.352]
7 Lighting variations are generally not expressed in depth data, and actor appearance differences are eliminated (although differences in body shape remain). [sent-26, score-0.324]
8 In this work, data is extracted from the latter commercially available sources, providing a number of advantages over self-captured data with a depth sensor[15, 5, 4]. [sent-30, score-0.361]
9 Additionally, active depth sensors are often unable to function in direct sunlight, severely limiting possible applications. [sent-34, score-0.322]
10 Finally, 3D information produced by active depth sensors tends to be much lower fidelity than that available commercially, and is limited in terms of operational range. [sent-35, score-0.322]
11 In addition to the release of a new dataset, which incorporates both original video and depth estimates, this paper provides baseline performance using both depth and appearance, and the software necessary to reproduce these results. [sent-36, score-0.634]
12 Previous work on action recognition, has focused on the use of feature points which can either be sampled densely or sparsely within the video. [sent-37, score-0.407]
13 In this work, the additional dimension z is employed, and we show how this depth information can be incorporated both at the descriptor level, and while detecting regions of interest, extending common Spatio-temporal Interest Point techniques. [sent-40, score-0.397]
14 The state of the art in natural 2D action recognition is first discussed in section 2, followed by section 3 covering the data extraction process, with details ofthe dataset. [sent-42, score-0.424]
15 Section 4 pro- vides a general overview of the action recognition methodology employed. [sent-43, score-0.39]
16 Section 5 details the depth-aware spatiotemporal interest point detection schemes, followed by extensions for two state of the art feature descriptors in section 7. [sent-44, score-0.433]
17 Results are provided with different combinations of interest point and recognition schemes in sections 8. [sent-45, score-0.382]
18 Finally section 9 draws conclusions about the benefits of depth data in natural action tasks, and the relative merits of the presented approaches. [sent-46, score-0.633]
19 Related Work The majority of existing approaches to action recognition focus on collections of local feature descriptors. [sent-48, score-0.422]
20 These interest points detect salient image locations, for example using separable linear filters [7] or spatio-temporal Harris corners [13]. [sent-53, score-0.523]
21 Descriptors are generated around these interest points in a number of ways, including SIFT and SURF approaches [26, 22, 12], pixel gradients [7], Jet descriptors [21] or detection distributions and strengths [19, 9]. [sent-54, score-0.465]
22 By performing a separate scene classification stage, combined with prior knowledge of probable action contexts (for example the “Get Out Car” action is unlikely to occur indoors) recognition rates can be improved. [sent-58, score-0.742]
23 [25] demonstrated that dense sampling of features provides combined action and context information, and generally outperforms sparse interest points. [sent-60, score-0.562]
24 An example of this is assigning each frame to a state in a Hidden Markov Model (HMM), then determining the most probable action for the observed sequence of states [3]. [sent-64, score-0.382]
25 However, the subset that is useful for generating an action recognition dataset is still limited. [sent-69, score-0.39]
26 Depth data extracted from these films is less rich, lacking depth variations within objects, resembling a collection of card board cut-outs, and is fundamentally artificial, created for effect only. [sent-71, score-0.438]
27 These technologies produce 3D consumer content from real stereo cameras which can be used to reconstruct accurate 3D depth maps. [sent-75, score-0.328]
28 It contains over 650 manually labeled video clips across 13 action classes, plus a further 78 clips representing the “NoAction”. [sent-77, score-0.549]
29 Most 3D films are too recent to have publicly available transcriptions, and subtitles alone rarely offer action cues, so automatic extraction techniques such as those employed by Marszalek et al. [sent-78, score-0.541]
30 In addition to the action sequences, a collection of sequences containing no actions was also automatically extracted as negative data, while ensuring no overlap with positive classes. [sent-81, score-0.597]
31 If the right appearance stream is removed from the dataset, it is possible to simulate the input data that would be provided by hybrid sensors like the Kinect, albeit at a higher spatial, and lower depth resolution. [sent-86, score-0.438]
32 Artifacts introduced by post processing are not considered, however it may be useful in future work to examine the behavior and consequences of such artifacts, with regards to action recognition. [sent-87, score-0.352]
33 This means each action is tested on actors and settings not seen in the training data, emphasizing generalization. [sent-95, score-0.398]
34 Firstly salient points are detected in using a range of detection schemes which incorporate the depth information, as discussed in section 5. [sent-102, score-0.519]
35 Interest Point Detection The additional information present in the depth data may be exploited during interest point extraction, in order to detect more salient features, and discount irrelevant detections. [sent-106, score-0.607]
36 4D Harris Corners The Harris Corner [11] is a frequently used interest point detector, which was extended into the spatio-temporal domain by Laptev et al. [sent-111, score-0.308]
37 However, the combination of appearance and depth streams constitutes 3. [sent-120, score-0.382]
38 Instead, the relationship between the spatio-temporal gradients of the depth stream and those of the appearance stream are exploited. [sent-124, score-0.532]
39 Equation 2 employs the chain rule, where Ix , Iy, It are intensity gradients along the spatial and temporal dimensions and Dx , Dy , Dt are the gradients of the depth stream. [sent-125, score-0.496]
40 The effect of this threshold (and the threshold of each interest point detector) on recognition performance, is examined in detail in section 8. [sent-129, score-0.377]
41 The detected interest points relate to areas with strong second order intensity derivatives, including both blobs and saddles. [sent-135, score-0.352]
42 As in the 4D Harris scheme, gradients along z are estimated using the relationships between the depth and intensity stream gradients. [sent-136, score-0.472]
43 The set of interest points F4D-He is calculated as the set of spatio-temporal locations, for which the determinant of μ is greater than the threshold λ4D-He as in equation 6. [sent-138, score-0.417]
44 5D In part, the Harris and Hessian interest point operators are motivated by the idea that object boundary points are highly salient, and that intensity gradients relate to boundaries. [sent-141, score-0.491]
45 However, depth data directly provides boundary information, rendering the estimation of the intensity gradient along z somewhat redundant. [sent-142, score-0.366]
46 5D” representation, using a pair of complimentary 3D spatio-temporal volumes, from the appearance and depth sequences. [sent-144, score-0.363]
47 Where and φ are equation 1 applied to the appearance and depth streams respectively, while υ and ω are the 3 by 3 Hessians. [sent-146, score-0.43]
48 The relative weighting of the appearance and depth information, is controlled by α. [sent-147, score-0.324]
49 This approach exploits complimentary information between the streams, to detect interest points where there are large intensity changes and/or large depth changes. [sent-148, score-0.641]
50 5D approach used for the Harris and Hessian detectors, leads to equation 10, where I D and are the appearance and depth streams respectively. [sent-158, score-0.43]
51 The descriptors can be based on various types of information, including appearance, motion and saliency, however depth information has rarely been utilized. [sent-164, score-0.433]
52 In the following sections we describe feature extraction approaches, based on the descriptors of two widely successful action recognition schemes, extended to make use of the additional information present in the Hollywood3D dataset. [sent-165, score-0.52]
53 Bag of Visual Words One of the most successful feature descriptors for action recognition is that of Laptev et al. [sent-168, score-0.471]
54 Descriptors are extracted only in salient regions (found through interest point detection) and are composed of a Histogram of Oriented Gradients (HOG) G, concatenated with a Histogram of Oriented Flow (HOF) F. [sent-170, score-0.326]
55 This provides a descriptor ρ of the visual appearance and local motion around the salient point at I(u, v, w). [sent-172, score-0.314]
56 Importantly, this descriptor is not dependent on the interest point detector, provided the HODG can be calculated from the depth stream D. [sent-180, score-0.76]
57 has also been shown to perform well in a large range of action recognition datasets, while making use of only the saliency information obtained during interest point detection. [sent-184, score-0.919]
58 An integral volume η is created, based on the interest point detection and their strengths. [sent-185, score-0.345]
59 The saliency content of a sub-cuboid, with origin at (u, v, w) is defined in equation 13 as c(u, v, w) for a sub-cuboid of dimensions ( uˆ, vˆ, wˆ ). [sent-186, score-0.365]
60 The descriptor δ of the saliency distribution at a position (u, v, w) can then be formed, by performing N comparisons of the content of two randomly offset spatio-temporal sub-cuboids, with origins at (u, v, w) + and (u, v, w) + β? [sent-188, score-0.464]
61 n) (14) By extracting δ at every location in the sequence, a histogram may be constructed, which encodes the occurrences of relative saliency distributions within the sequence, without requiring appearance data or motion estimation. [sent-202, score-0.415]
62 We propose extending the standard RMD described above, by storing the saliency measurements within a 4D integral hyper-volume, so as to encode the behavior of the interest point distribution across the 3D scene, rather than within the image plane. [sent-214, score-0.605]
63 The 4D integral volume can be populated by extracting the depth measurements at each detected interest point. [sent-215, score-0.636]
64 As with the original RMD, the descriptor can be applied in conjunction with any interest point detector. [sent-218, score-0.375]
65 As with the 4D Bag of Words approach, these features are not restricted to the extended interest point detectors described in section 5, and work equally well with standard spatio-temporal interest points, provided that a depth video is available during descriptor extraction. [sent-219, score-0.98]
66 The source code for the three novel interest point detection algorithms, and the two extended Action Recognition techniques is available3, to allow reproduction of these results. [sent-223, score-0.308]
67 Interest Point Analysis First we examine the benefits of including depth information during interest point detection. [sent-228, score-0.54]
68 1) is used for classification, in conjunction with the traditional spatio-temporal interest points (Separable Filters 3D − S and Harris Corners e3sDt −oin Htsa) ( are compared tros t 3hDe proposed depth aware nscehresm 3Des. [sent-232, score-0.677]
69 This is also reflected in the depth aware schemes, and is unsurprising, as separable filters were designed primarily for computational speed. [sent-236, score-0.561]
70 Hessian based interest points prove less informative than the extended Harris operators in both the 4D and 3. [sent-237, score-0.413]
71 Interestingly, certain actions consistently perform better, when described by depth aware interest points. [sent-242, score-0.794]
72 These are actions such as Kiss, Hug, Drive and Run where there is an informative foreground object, which depth aware interest points are better able to pick out. [sent-243, score-0.849]
73 In contrast, actions such as Swim, Dance and Shoot are often performed against a similar depth background, or within a group of people, and the inclusion of depth in the saliency measure is less valu3personal . [sent-244, score-1.036]
74 This suggests that a combination of standard spatiotemporal, and depth aware schemes, may prove valuable. [sent-250, score-0.452]
75 The complexity of the depth aware interest point detectors remains of the same order as their spatio-temporal counterparts (linear with respect to u, v and w). [sent-251, score-0.707]
76 Descriptor Analysis Next, the use of depth information at the feature level was explored, including it’s interaction with the depth aware saliency measures. [sent-256, score-0.963]
77 The previously noted relationship between saliency measures, appears to hold regardless of the feature descriptor used. [sent-259, score-0.386]
78 5D scheme prove to be the most effective way to incorporate depth information. [sent-263, score-0.321]
79 This is unsurprising as the RMD relies only on interest point detections, without the inclusion of any visual and motion information. [sent-267, score-0.375]
80 It may have been reasonable to guess, that including structural features would prove more valuable with a standard saliency measure, as the depth information had not previously been exploited. [sent-268, score-0.657]
81 In fact the opposite proves to be true, 4D features provide more modest gains for 3D-S and 3D-Ha (up to 20%) than they do when combined with extended saliency measures (up to 45%). [sent-269, score-0.319]
82 This demonstrates that depth aware saliency measures are capable of focusing computation, into regions where structural features are par- ticularly valuable. [sent-270, score-0.717]
83 The complexity of the RMD-4D is greater than the standard RMD (being linear in the range of depth values, as well as in u, v and w). [sent-271, score-0.314]
84 However the increased feature vector length does lead to and increased cost during 333444000311 Table 2: Average precision per class, on the 3D action dataset, for a range of interest point detectors, including simple spatio-temporal interest points, and depth aware schemes. [sent-274, score-1.335]
85 Classes are shown in bold, when depth aware interest points outperform both 3D schemes. [sent-276, score-0.677]
86 Table 3: Correct Classification rate and Average Precision for each combination of descriptor and saliency measure. [sent-277, score-0.386]
87 Interest Point Threshold Results Different interest point operators produce very different response strengths, meaning the optimal threshold for extracting salient points varies. [sent-282, score-0.508]
88 In general an arbitrary threshold is selected, indeed the experiments in the previous sections employed a saliency threshold based on those suggested in previous literature. [sent-283, score-0.35]
89 In figure 2 the relationship between the saliency threshold and the action recognition performance, is contrasted for 4D and 3. [sent-284, score-0.7]
90 Regardless of the saliency measure, the standard features descriptor and their depth aware extensions follow the same trend. [sent-286, score-0.827]
91 In contrast, bag of words approaches provide greater accuracy for lower saliency thresholds. [sent-290, score-0.475]
92 This makes sense, as a weak interest point relates to a single histogram entry under the bag of words scheme. [sent-292, score-0.466]
93 In contrast, poor interest points will affect the RMD descriptor of all surrounding locations. [sent-293, score-0.381]
94 Conclusions In this paper, we propose and make available a large corpus of 3D data to the community, for the comparison of action recognition techniques, in natural environments. [sent-297, score-0.39]
95 5D Har is (c) 4D Hessian (d) 4D Harris Figure 2: Average Precision on the Hollywood 3D action recognition dataset, for various saliency thresholds, with 3. [sent-301, score-0.66]
96 It has been shown that 3D information provides valuable cues to improve action recognition. [sent-303, score-0.383]
97 A variety of new interest point detection algorithm, incorporating depth data, have been shown to improve action recognition rates, doubling performance in some cases, even using standard features. [sent-304, score-0.93]
98 Human daily action analysis with multi-view and color-depth data. [sent-335, score-0.352]
99 Capturing the relative distribution of features for action recognition. [sent-425, score-0.352]
100 A 3-dimensional sift descriptor and its application to action recognition. [sent-445, score-0.468]
