cvpr cvpr2013 cvpr2013-334 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Katerina Fragkiadaki, Han Hu, Jianbo Shi
Abstract: Human pose detectors, although successful in localising faces and torsos of people, often fail with lower arms. Motion estimation is often inaccurate under fast movements of body parts. We build a segmentation-detection algorithm that mediates the information between body parts recognition, and multi-frame motion grouping to improve both pose detection and tracking. Motion of body parts, though not accurate, is often sufficient to segment them from their backgrounds. Such segmentations are crucialfor extracting hard to detect body parts out of their interior body clutter. By matching these segments to exemplars we obtain pose labeled body segments. The pose labeled segments and corresponding articulated joints are used to improve the motion flow fields by proposing kinematically constrained affine displacements on body parts. The pose-based articulated motion model is shown to handle large limb rotations and displacements. Our algorithm can detect people under rare poses, frequently missed by pose detectors, showing the benefits of jointly reasoning about pose, segmentation and motion in videos.
Reference: text
sentIndex sentText sentNum sentScore
1 We build a segmentation-detection algorithm that mediates the information between body parts recognition, and multi-frame motion grouping to improve both pose detection and tracking. [sent-10, score-0.943]
2 Such segmentations are crucialfor extracting hard to detect body parts out of their interior body clutter. [sent-12, score-0.893]
3 By matching these segments to exemplars we obtain pose labeled body segments. [sent-13, score-0.726]
4 The pose labeled segments and corresponding articulated joints are used to improve the motion flow fields by proposing kinematically constrained affine displacements on body parts. [sent-14, score-1.77]
5 The pose-based articulated motion model is shown to handle large limb rotations and displacements. [sent-15, score-0.599]
6 Our algorithm can detect people under rare poses, frequently missed by pose detectors, showing the benefits of jointly reasoning about pose, segmentation and motion in videos. [sent-16, score-0.615]
7 Introduction We study human pose detection and dense body motion estimation. [sent-18, score-0.825]
8 With fast motion and extreme pose variation, both pose and motion estimation algorithms often fail. [sent-19, score-0.934]
9 Our insight is that estimated body part motion, though not accurate, is often sufficient to segment body parts from their backgrounds. [sent-21, score-0.829]
10 By matching body part segments to shape exemplars, one can improve pose estimation under large body deformations. [sent-22, score-1.031]
11 Such reliable detections help segmentation of body pose by adjusting motion affini- pose detector of [24] that combines Pb, optical flow edges, skin color and image gradient features in a structural model of human body parts in space and time. [sent-25, score-2.104]
12 Right: Results of our method that mediates salient motion segmentations with body part detections for detecting human body limbs under large motion. [sent-26, score-1.225]
13 Our method exploits “lucky” segmentations of moving body parts to 1) index into pose space, 2) infer articulated kinematic chains in the image, 3) improve body part motion estimates using kinematic constraints. [sent-29, score-2.003]
14 The proposed framework targets rare, widely deformed poses, often missed by pose detectors, and optical flow of human body parts, often inaccurate due to clutter and large motion. [sent-30, score-1.29]
15 However, body parts at the end of the articulation chains, i. [sent-33, score-0.552]
16 222000555977 We estimate pose inversely to current detectors: our method aligns image segmentations to pose exemplars rather than learnt templates to image gradients, bypassing the need for enormous training sets. [sent-37, score-0.691]
17 Our algorithm segments moving body parts by leveraging motion grouping cues with figure-ground segregation of reliably detected body parts, e. [sent-40, score-1.132]
18 Confident body part detections [2] induce figure-ground repulsions between regions residing in their interior and exterior, and clean up region motion affinities in places where motion is not informative. [sent-43, score-1.259]
19 Extracted motion segments with hypothesized body joint locations (at their corners and endpoints) are matched against body pose exemplars close in body joint configuration. [sent-44, score-1.645]
20 Resulting pose labeled segments extract occluding body part boundaries (also interior to the body), not only the human silhouette outline, in contrast to background subtraction works [15]. [sent-45, score-0.732]
21 Pose segmentation hypotheses induce kinematic constraints during motion estimation of body parts. [sent-46, score-0.778]
22 We compute coarse piece-wise affine, kinematically constrained part motion models, incorporating reliable pixel correspondences from optical flow, whenever they are available. [sent-47, score-0.556]
23 Our hybrid flow model benefits from fine-grain optical flow tracking for elbows and slowly moving limbs of the articulation chain, while computes coarser motion estimates for fast moving ones. [sent-48, score-1.517]
24 The resulting “articulated” flow can accurately follow large rotations or mixed displacements and rotations of body parts, which are hard to track in the standard optical flow framework. [sent-49, score-1.398]
25 It propagates the pose segmentations in time, from frames of large motion to frames with no salient motion. [sent-50, score-0.625]
26 Our algorithm can detect people under rare poses, frequently missed by state-of-the-art pose detectors, by proposing a versatile representation for the human body that effectively adapts to the segmentability or detectability of different body parts and motion patterns. [sent-53, score-1.466]
27 Related work We distinguish two main categories of work combining pose and motion estimation in existing literature: (i) Pose estimation methods that exploit optical flow information; and (ii) part motion estimation methods that exploit pose information. [sent-55, score-1.596]
28 The first class of methods comprises methods that use optical flow as a cue either for body part detection or for pose propagation from frame-to-frame [12, 24]. [sent-56, score-1.161]
29 [7] propose a pose tracking system that interleaves between contour-driven pose estimation and optical flow pose propagation from frame to frame. [sent-58, score-1.395]
30 The second class of methods comprises approaches that exploit kinematic constraints of the body for part motion estimation. [sent-60, score-0.739]
31 Bregler and Malik [3] represent 3D motion of ellipsoidal body parts using a kinematic chain of twists. [sent-61, score-0.835]
32 [17] model the human body as a collection of planar patches undergoing affine motion, and soft constraints penalize the distance between the articulation points predicted by adjacent affine models. [sent-63, score-0.917]
33 [8] constrain the body joint displacements to be the same under the affine models of the adjacent parts, resulting in a simple linear constrained least squares optimization for kinematically constrained part tracking. [sent-65, score-0.741]
34 In the “strike a pose” work of [21], stylized (canonical) human body poses are detected reliably, and are used to learn instance specific part appearance models for better pose detection in other frames. [sent-67, score-0.755]
35 In this work, we follow a “strike a segment” approach by segmenting widely deforming body poses and propagating inferred body pose in time using articulated optical flow. [sent-68, score-1.381]
36 Highly deformable poses appear infrequently in the datasets, which reflects their low frequency in people’s body pose repertoire. [sent-75, score-0.628]
37 There is an asymmetry of motion segmentability among the parts of the human body due to its articulated nature. [sent-78, score-0.936]
38 Parts towards the ends ofthe articulated chains often deform much faster than the main torso (root of the body articulation tree). [sent-79, score-0.803]
39 Lack of motion may cause ambiguities in motion segmentation of root body parts. [sent-80, score-0.796]
40 We exploit detectability and segmentability across different body poses and parts in a graph theoretic framework which combines motion-driven grouping cues of articulated parts and detection-driven grouping cues of torso like parts. [sent-82, score-0.99]
41 We segment arm articulated chains by constrained normalized cuts in the steered region graph. [sent-84, score-0.561]
42 Resulting segmentations with hypothesizing body joints at their corners and endpoints infer body pose by matching against pose exemplars. [sent-85, score-1.296]
43 While detectors would need 222000556088 Part detections change motion affinities Region clusters index into pose Body pose induces kinematic Pose propagates with articulated flow Figure 2. [sent-86, score-1.634]
44 Region motion affinities in A change according part detections that induce repulsions R between regions assigned to their foreground and background. [sent-88, score-0.616]
45 to confident body Region clusters index into pose exemplars according to hypothesized joint locations at their endpoints. [sent-89, score-0.723]
46 Coarse motion proposals compute an articulated optical flow field that can deal with large part rotations. [sent-91, score-1.003]
47 many training examples to learn to extract a deformed pose from background clutter [16], our pose segmentations are already freed from their backgrounds. [sent-92, score-0.63]
48 Multi-frame segmentation methods exploit optical flow trajectory correspondences that integrate motion estimates across multiple frames and can segment parts reliably even in frames with no motion. [sent-97, score-1.115]
49 As such, we will integrate per frame optical flow estimates on region spatial support to segment frames with large motion as measured from a box around a shoulder activation. [sent-99, score-1.096]
50 We describe the motion of an image region in two ways: i) with the set of point trajectories, if any, overlapping with the region mask, ii) with an affine model fitted to the optical flow displacements of the region pixels. [sent-100, score-1.247]
51 Affine motion fitting allows motion representation in places of ambiguous optical flow anchoring and sparse trajectory coverage. [sent-101, score-1.01]
52 }B deetnwoeteen t ehaec she tp oaifr poofi tnrta tjreacjtoecriteostra, trb we compute motion affinities AT (tra, trb) encoding their long range motion similarity [6]. [sent-106, score-0.635]
53 Steering Cut We combine motion-driven affinities and detectiondriven repulsions in one region affinity graph by canceling motion affinities between repulsive regions. [sent-130, score-0.842]
54 Using Jk1, Jk2 and detected shoulder locations, we select pose exemplars close in body joint configuration as measured by the partial Procrustes distance between the corresponding sets of body joints (we do not consider scaling). [sent-163, score-1.161]
55 Standard optical flow cannot follow fast motion of the lower arm in most cases. [sent-169, score-0.926]
56 However, when descriptors capture the hand but miss the arm, hand and arm appear disconnected in the motion flow space (2nd row). [sent-171, score-0.667]
57 Knowing the rough body articulation points allows to restrict our motion model to be a kinematic chain along the body parts. [sent-172, score-1.219]
58 The resulting articulated motion flow is more accurate. [sent-173, score-0.753]
59 Confidently matched pose segments recover body parts that would have been missed by the pose detectors due to overwhelming surrounding clutter or misalignment of pose. [sent-177, score-1.022]
60 Each kinematic chain is comprised of upper and lower arms du, dl connected at the elbow body joint Ju,l, as shown in Figure 2. [sent-179, score-0.874]
61 From Pose to Flow We use the estimated body pose to help motion estimation of lower limbs. [sent-181, score-0.872]
62 Human body limbs are hard to track accurately with general motion estimation techniques, such as optical flow methods, due to large rotations, deformations, and ambiguity of correspondence along their medial axis (aperture problems). [sent-182, score-1.216]
63 Articulation points correspond to rotation axes and impose kinematic constraints on the body parts they are connected to. [sent-185, score-0.573]
64 Articulated Flow We use our pose labelled segmentations to infer dense displacement fields for body parts, which we call articulated flow fields. [sent-189, score-1.309]
65 Given an arm articulated chain (left or right), let Mu, Ml denote the masks of the corresponding upper and lower arms du, dl, linked at the elbow location Ju,l . [sent-190, score-0.68]
66 Let wuD, wlD denote affine motion fields of parts √du and dl i. [sent-192, score-0.569]
67 xT)h)e| cost function for our articulated optical flow reads: w,mwuDin,w. [sent-199, score-0.753]
68 Bottom Row: Pose propagation with affine motion fitting to the optical flow estimates of [5]. [sent-205, score-0.989]
69 Limb motion is often too erratic to track with standard optical flow schemes, which drift to surroundings under wide deformations. [sent-207, score-0.754]
70 The third term penalizes deviations of the displacement field w from the affine fields wuD, wlD, weighted by the pixelwise confidence of the affine displacements φu (x) , φl (x). [sent-211, score-0.623]
71 The constraint requires the affine displacements predicted for the articulated joint by the two affine fields to be equal. [sent-213, score-0.75]
72 3 by computing coarse affine models for upper and lower arms and then injecting their affine displacements as soft constraints in an optical flow computation for the kinematic chain. [sent-215, score-1.452]
73 For computing the two kinematically constrained affine fields we use “hybrid” tracking: for upper arms or the background, standard optical flow displacements are often reliable, since their motion is not erratic. [sent-216, score-1.32]
74 We use such flow displacements to propagate foreground and background of the arm kinematic chain from the previous frame, and compute an affine motion field for the upper arm wuD. [sent-217, score-1.331]
75 Such propagation constrains i) the possible displacement hypotheses of the articulation point Ju,l, and ii) the possible affine deformations of the lower limb dl . [sent-218, score-0.664]
76 The affine displacements wuD, wlD receive higher weights at coarse pyramid levels and are down-weighted at finer pyramid levels as more and more image evidence is taken into account, to better adapt to the fine-grain details of part motion, that may deviate from an affine model. [sent-225, score-0.58]
77 In descriptor augmented optical flow of [26] the motion estimate of the arm “breaks” in cases of missing reliable descriptor match to capture its deformation. [sent-228, score-0.876]
78 Standard coarse-to-fine flow misses the fast moving hand whose motion is larger that the its spatial extent. [sent-229, score-0.583]
79 We propagate our body segmentations in time using articulated optical flow trajectories, as shown in Figure 4. [sent-230, score-1.192]
80 We compare with affine fitting to standard flow estimates in Figure 4. [sent-232, score-0.571]
81 Ambiguities of limb motion estimation due to self occlusions, nondiscriminative appearance and wide deformations cause flow estimates to drift, in absence of pose informed kinematic constraints. [sent-233, score-1.199]
82 We selected 15 video sequences with widely deformed body pose in at least one frame. [sent-236, score-0.624]
83 For each exemplar we automatically extract a set of boundary contours lying inside the groundtruth body part bounding boxes of width one fifth of the shoulder distance. [sent-242, score-0.606]
84 well as our pose detection step only, without improving the motion estimation, but rather propagating the pose in time by fitting affine motion models to standard optical flow [5]. [sent-247, score-1.645]
85 It combines multiple cues such as Probability of Boundary, optical flow edges and skin color for computing unary and pairwise part potentials. [sent-252, score-0.586]
86 It extends the state-of-the-art static pose detector of [28] for human pose estimation in videos by keeping N best pose samples per frame and inferring the most coherent pose sequence across frames using dynamic programming. [sent-256, score-1.11]
87 The two methods though have a large performance gap when tracking lower arms, whose wide frame-to-frame deformations cause standard optical flow to drift. [sent-261, score-0.705]
88 This demonstrates the importance of improving the motion estimation via articulation constraints for tracking the pose in time. [sent-262, score-0.675]
89 The baseline system of [24] uses optical flow edge as a cue for part detection. [sent-263, score-0.586]
90 for wrist detection to two factors: 1) it learns a single weight combination for optical flow edges, Pb and image gradi222000666533 ents for each part or pair of parts, which may create contradictions in case of absence of motion. [sent-270, score-0.691]
91 2) Optical flow edges may not align well with the body part boundaries due to optical flow “bleeding” effect. [sent-271, score-1.258]
92 We recover from mis-alignments of optical flow with part boundaries by computing a flow based region segmentation, rather than using optical flow as a raw feature into part detection. [sent-273, score-1.572]
93 In contrast to standard pose detectors, and also our baseline systems, our method does not require all body parts to be present in each frame. [sent-275, score-0.659]
94 The lack of specified wrist and elbow detectors makes our wrist and elbow localization occasionally poor (see last column of Figure 5) while lying inside the body part. [sent-276, score-0.646]
95 Conclusion We proposed an approach that detects human body poses by steering cut on motion grouping affinities of lower limbs and figure-ground repulsions from shoulder detections. [sent-278, score-1.347]
96 Arm articulated chains resulting from matching such segmentations to exemplars, are used to provide feedback to dense body motion estimation about articulation points and region stiffness. [sent-281, score-1.162]
97 Resulting flow fields can deal with large per frame deformations of body parts and propagate the detected pose in time, during its deforming posture. [sent-282, score-1.123]
98 Our flow to pose to flow process is able to infer poses under wide deformations that would have been both too hard to detect and too hard to track otherwise. [sent-283, score-1.017]
99 High accuracy optical flow estimation based on a theory for warping. [sent-309, score-0.583]
100 High accuracy optical flow serves 3-D pose tracking: exploiting contour and flow based constraints. [sent-328, score-1.12]
wordName wordTfidf (topN-words)
[('body', 0.336), ('flow', 0.336), ('pose', 0.239), ('optical', 0.209), ('motion', 0.209), ('articulated', 0.208), ('affine', 0.204), ('affinities', 0.179), ('repulsions', 0.154), ('wud', 0.154), ('kinematic', 0.153), ('limb', 0.14), ('shoulder', 0.14), ('articulation', 0.132), ('arms', 0.13), ('arm', 0.122), ('xqb', 0.115), ('exemplars', 0.11), ('segmentations', 0.103), ('displacements', 0.097), ('asteer', 0.096), ('wld', 0.096), ('xqf', 0.096), ('limbs', 0.088), ('parts', 0.084), ('elbow', 0.082), ('chains', 0.072), ('region', 0.064), ('kinematically', 0.063), ('steered', 0.063), ('brox', 0.062), ('steering', 0.06), ('fragkiadaki', 0.058), ('segmentability', 0.058), ('tracking', 0.057), ('friends', 0.057), ('repulsive', 0.057), ('torso', 0.055), ('wrist', 0.054), ('dq', 0.054), ('poses', 0.053), ('chain', 0.053), ('deformations', 0.053), ('contradictions', 0.051), ('transient', 0.051), ('reliably', 0.051), ('displacement', 0.05), ('lower', 0.05), ('deformed', 0.049), ('contours', 0.048), ('ldof', 0.047), ('trajectory', 0.047), ('stylized', 0.045), ('missed', 0.045), ('endpoints', 0.043), ('elbows', 0.043), ('strike', 0.043), ('rotations', 0.042), ('segmentation', 0.042), ('segments', 0.041), ('exemplar', 0.041), ('rare', 0.041), ('part', 0.041), ('human', 0.041), ('people', 0.039), ('ri', 0.039), ('datta', 0.038), ('detectability', 0.038), ('lucky', 0.038), ('mediates', 0.038), ('mediating', 0.038), ('saliently', 0.038), ('seas', 0.038), ('trb', 0.038), ('zme', 0.038), ('moving', 0.038), ('estimation', 0.038), ('detectors', 0.038), ('hypothesized', 0.038), ('frame', 0.038), ('fields', 0.037), ('grouping', 0.037), ('frames', 0.037), ('rj', 0.037), ('dl', 0.035), ('pb', 0.035), ('shoulders', 0.035), ('inaccurate', 0.035), ('upper', 0.035), ('coarse', 0.034), ('wed', 0.034), ('upenn', 0.034), ('fablet', 0.034), ('nr', 0.034), ('interior', 0.034), ('detections', 0.033), ('segment', 0.032), ('recomputing', 0.032), ('pixelwise', 0.031), ('estimates', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000013 334 cvpr-2013-Pose from Flow and Flow from Pose
Author: Katerina Fragkiadaki, Han Hu, Jianbo Shi
Abstract: Human pose detectors, although successful in localising faces and torsos of people, often fail with lower arms. Motion estimation is often inaccurate under fast movements of body parts. We build a segmentation-detection algorithm that mediates the information between body parts recognition, and multi-frame motion grouping to improve both pose detection and tracking. Motion of body parts, though not accurate, is often sufficient to segment them from their backgrounds. Such segmentations are crucialfor extracting hard to detect body parts out of their interior body clutter. By matching these segments to exemplars we obtain pose labeled body segments. The pose labeled segments and corresponding articulated joints are used to improve the motion flow fields by proposing kinematically constrained affine displacements on body parts. The pose-based articulated motion model is shown to handle large limb rotations and displacements. Our algorithm can detect people under rare poses, frequently missed by pose detectors, showing the benefits of jointly reasoning about pose, segmentation and motion in videos.
2 0.31724826 244 cvpr-2013-Large Displacement Optical Flow from Nearest Neighbor Fields
Author: Zhuoyuan Chen, Hailin Jin, Zhe Lin, Scott Cohen, Ying Wu
Abstract: We present an optical flow algorithm for large displacement motions. Most existing optical flow methods use the standard coarse-to-fine framework to deal with large displacement motions which has intrinsic limitations. Instead, we formulate the motion estimation problem as a motion segmentation problem. We use approximate nearest neighbor fields to compute an initial motion field and use a robust algorithm to compute a set of similarity transformations as the motion candidates for segmentation. To account for deviations from similarity transformations, we add local deformations in the segmentation process. We also observe that small objects can be better recovered using translations as the motion candidates. We fuse the motion results obtained under similarity transformations and under translations together before a final refinement. Experimental validation shows that our method can successfully handle large displacement motions. Although we particularly focus on large displacement motions in this work, we make no sac- rifice in terms of overall performance. In particular, our method ranks at the top of the Middlebury benchmark.
3 0.25738162 158 cvpr-2013-Exploring Weak Stabilization for Motion Feature Extraction
Author: Dennis Park, C. Lawrence Zitnick, Deva Ramanan, Piotr Dollár
Abstract: We describe novel but simple motion features for the problem of detecting objects in video sequences. Previous approaches either compute optical flow or temporal differences on video frame pairs with various assumptions about stabilization. We describe a combined approach that uses coarse-scale flow and fine-scale temporal difference features. Our approach performs weak motion stabilization by factoring out camera motion and coarse object motion while preserving nonrigid motions that serve as useful cues for recognition. We show results for pedestrian detection and human pose estimation in video sequences, achieving state-of-the-art results in both. In particular, given a fixed detection rate our method achieves a five-fold reduction in false positives over prior art on the Caltech Pedestrian benchmark. Finally, we perform extensive diagnostic experiments to reveal what aspects of our system are crucial for good performance. Proper stabilization, long time-scale features, and proper normalization are all critical.
4 0.25123078 206 cvpr-2013-Human Pose Estimation Using Body Parts Dependent Joint Regressors
Author: Matthias Dantone, Juergen Gall, Christian Leistner, Luc Van_Gool
Abstract: In this work, we address the problem of estimating 2d human pose from still images. Recent methods that rely on discriminatively trained deformable parts organized in a tree model have shown to be very successful in solving this task. Within such a pictorial structure framework, we address the problem of obtaining good part templates by proposing novel, non-linear joint regressors. In particular, we employ two-layered random forests as joint regressors. The first layer acts as a discriminative, independent body part classifier. The second layer takes the estimated class distributions of the first one into account and is thereby able to predict joint locations by modeling the interdependence and co-occurrence of the parts. This results in a pose estimation framework that takes dependencies between body parts already for joint localization into account and is thus able to circumvent typical ambiguities of tree structures, such as for legs and arms. In the experiments, we demonstrate that our body parts dependent joint regressors achieve a higher joint localization accuracy than tree-based state-of-the-art methods.
5 0.25033411 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition
Author: Mihir Jain, Hervé Jégou, Patrick Bouthemy
Abstract: Several recent works on action recognition have attested the importance of explicitly integrating motion characteristics in the video description. This paper establishes that adequately decomposing visual motion into dominant and residual motions, both in the extraction of the space-time trajectories and for the computation of descriptors, significantly improves action recognition algorithms. Then, we design a new motion descriptor, the DCS descriptor, based on differential motion scalar quantities, divergence, curl and shear features. It captures additional information on the local motion patterns enhancing results. Finally, applying the recent VLAD coding technique proposed in image retrieval provides a substantial improvement for action recognition. Our three contributions are complementary and lead to outperform all reported results by a significant margin on three challenging datasets, namely Hollywood 2, HMDB51 and Olympic Sports. 1. Introduction and related work Human actions often convey the essential meaningful content in videos. Yet, recognizing human actions in un- constrained videos is a challenging problem in Computer Vision which receives a sustained attention due to the potential applications. In particular, there is a large interest in designing video-surveillance systems, providing some automatic annotation of video archives as well as improving human-computer interaction. The solutions proposed to address this problem inherit, to a large extent, from the techniques first designed for the goal of image search and classification. The successful local features developed to describe image patches [15, 23] have been translated in the 2D+t domain as spatio-temporal local descriptors [13, 30] and now include motion clues [29]. These descriptors are often extracted from spatial-temporal interest points [12, 3 1]. More recent techniques assume some underlying temporal motion model involving trajectories [2, 6, 7, 17, 18, 25, 29, 32]. Most of these approaches produce large set of local descriptors which are in turn aggregated to produce a single vector representing the video, in order to enable the use of powerful discriminative classifiers such as support vector machines (SVMs). This is usually done with the bag- Figure 1. Optical flow field vectors (green vectors with red end points) before and after dominant motion compensation. Most of the flow vectors due to camera motion are suppressed after compensation. One of the contributions of this paper is to show that compensating for the dominant motion is beneficial for most of the existing descriptors used for action recognition. of-words technique [24], which quantizes the local features using a k-means codebook. Thanks to the successful combination of this encoding technique with the aforementioned local descriptors, the state of the art in action recognition is able to go beyond the toy problems ofclassifying simple human actions in controlled environment and considers the detection of actions in real movies or video clips [11, 16]. Despite these progresses, the existing descriptors suffer from an uncompleted handling of motion in the video sequence. Motion is arguably the most reliable source of information for action recognition, as often related to the actions of interest. However, it inevitably involves the background or camera motion when dealing with uncontrolled and re- alistic situations. Although some attempts have been made to compensate camera motion in several ways [10, 21, 26, 29, 32], how to separate action motion from that caused by the camera, and how to reflect it in the video description remains an open issue. The motion compensation mechanism employed in [10] is tailor-made to the Motion Interchange Pattern encoding technique. The Motion Boundary Histogram (MBH) [29] is a recent appealing approach to 222555555533 suppress the constant motion by considering the flow gradient. It is robust to some extent to the presence of camera motion, yet it does not explicitly handle the camera motion. Another approach [26] uses a sophisticated and robust (RANSAC) estimation of camera motion. It first segments the color image into regions corresponding to planar parts in the scene and estimates the (three) dominant homographies to update the motion associated with local features. A rather different view is adopted in [32] where the motion decomposition is performed at the trajectory level. All these works support the potential of motion compensation. As the first contribution of this paper, we address the problem in a way that departs from these works by considering the compensation of the dominant motion in both the tracking stages and encoding stages involved in the computation of action recognition descriptors. We rely on the pioneering works on motion compensation such as the technique proposed in [20], that considers 2D polynomial affine motion models for estimating the dominant image motion. We consider this particular model for its robustness and its low computational cost. It was already used in [21] to separate the dominant motion (assumed to be due to the camera motion) and the residual motion (corresponding to the independent scene motions) for dynamic event recognition in videos. However, the statistical modeling of both motion components was global (over the entire image) and only the normal flow was computed for the latter. Figure 1 shows the vectors of optical flow before and after applying the proposed motion compensation. Our method successfully suppresses most of the background motion and reinforces the focus towards the action of interest. We exploit this compensated motion both for descriptor computation and for extracting trajectories. However, we also show that the camera motion should not be thrown as it contains complementary information that is worth using to recognize certain action categories. Then, we introduce the Divergence-Curl-Shear (DCS) descriptor, which encodes scalar first-order motion features, namely the motion divergence, curl and shear. It captures physical properties of the flow pattern that are not involved in the best existing descriptors for action recognition, except in the work of [1] which exploits divergence and vorticity among a set of eleven kinematic features computed from the optical flow. Our DCS descriptor provides a good performance recognition performance on its own. Most importantly, it conveys some information which is not captured by existing descriptors and further improves the recognition performance when combined with the other descriptors. As a last contribution, we bring an encoding technique known as VLAD (vector oflocal aggregated descriptors) [8] to the field of action recognition. This technique is shown to be better than the bag-of-words representation for combining all the local video descriptors we have considered. The organization of the paper is as follows. Section 2 introduces the motion properties that we will consider through this paper. Section 3 presents the datasets and classification scheme used in our different evaluations. Section 4 details how we revisit several popular descriptors of the literature by the means of dominant motion compensation. Our DCS descriptor based on kinematic properties is introduced in Section 5 and improved by the VLAD encoding technique, which is introduced and bench-marked in Section 6 for several video descriptors. Section 7 provides a comparison showing the large improvement achieved over the state of the art. Finally, Section 8 concludes the paper. 2. Motion Separation and Kinematic Features In this section, we describe the motion clues we incorporate in our action recognition framework. We separate the dominant motion and the residual motion. In most cases, this will account to distinguishing the impact of camera movement and independent actions. Note that we do not aim at recovering the 3D camera motion: The 2D parametric motion model describes the global (or dominant) motion between successive frames. We first explain how we estimate the dominant motion and employ it to separate the dominant flow from the optical flow. Then, we will introduce kinematic features, namely divergence, curl and shear for a more comprehensive description of the visual motion. 2.1. Affine motion for compensating camera motion Among polynomial motion models, we consider the 2D affine motion model. Simplest motion models such as the 4parameter model formed by the combination of 2D translation, 2D rotation and scaling, or more complex ones such as the 8-parameter quadratic model (equivalent to a homography), could be selected as well. The affine model is a good trade-off between accuracy and efficiency which is of primary importance when processing a huge video database. It does have limitations since strictly speaking it implies a single plane assumption for the static background. However, this is not that penalizing (especially for outdoor scenes) if differences in depth remain moderated with respect to the distance to the camera. The affine flow vector at point p = (x, y) and at time t, is defined as waff(pt) =?cc12((t ) ?+?aa31((t ) aa42((t ) ? ?xytt?. (1) = + + = + uaff(pt) c1(t) a1(t)xt a2(t)yt and vaff(pt) c2(t) a3 (t)xt + a4(t)yt are horizontal and vertical components of waff(pt) respectively. Let us denote the optical flow vector at point p at time t as w(pt) = (u(pt) , v(pt)). We introduce the flow vector ω(pt) obtained by removing the affine flow vector from the optical flow vector ω(pt) = w(pt) − waff(pt) . (2) 222555555644 The dominant motion (estimated as waff(pt)) is usually due to the camera motion. In this case, Equation 2 amounts to canceling (or compensating) the camera motion. Note that this is not always true. For example in case of close-up on a moving actor, the dominant motion will be the affine estimation of the apparent actor motion. The interpretation of the motion compensation output will not be that straightforward in this case, however the resulting ω-field will still exhibit different patterns for the foreground action part and the background part. In the remainder, we will refer to the “compensated” flow as ω-flow. Figure 1 displays the computed optical flow and the ωflow. We compute the affine flow with the publicly available Motion2D software1 [20] which implements a realtime robust multiresolution incremental estimation framework. The affine motion model has correctly accounted for the motion induced by the camera movement which corresponds to the dominant motion in the image pair. Indeed, we observe that the compensated flow vectors in the background are close to null and the compensated flow in the foreground, i.e., corresponding to the actors, is conversely inflated. The experiments presented along this paper will show that effective separation of dominant motion from the residual motions is beneficial for action recognition. As explained in Section 4, we will compute local motion descriptors, such as HOF, on both the optical flow and the compensated flow (ω-flow), which allows us to explicitly and directly characterize the scene motion. 2.2. Local kinematic features By kinematic features, we mean local first-order differential scalar quantities computed on the flow field. We consider the divergence, the curl (or vorticity) and the hyperbolic terms. They inform on the physical pattern of the flow so that they convey useful information on actions in videos. They can be computed from the first-order derivatives of the flow at every point p at every frame t as ⎨⎪ ⎪ ⎪ ⎧hcdyuipvr1l2(p t) = −∂ u ∂ (yxp(xtp) +−∂ v ∂ v(px ypxt ) The diverg⎪⎩ence is related to axial motion, expansion scaling effects, the curl to rotation in the image plane. hyperbolic terms express the shear of the visual flow responding to more complex configuration. We take account the shear quantity only: shear(pt) = ?hyp12(pt) + hyp22(pt). (3) and The corinto (4) 1http://www.irisa.fr/vista/Motion2D/ In Section 5, we propose the DCS descriptor that is based on the kinematic features (divergence, curl and shear) of the visual motion discussed in this subsection. It is computed on either the optical or the compensated flow, ω-flow. 3. Datasets and evaluation This section first introduces the datasets used for the evaluation. Then, we briefly present the bag-of-feature model and the classification scheme used to encode the descriptors which will be introduced in Section 4. Hollywood2. The Hollywood2 dataset [16] contains 1,707 video clips from 69 movies representing 12 action classes. It is divided into train set and test set of 823 and 884 samples respectively. Following the standard evaluation protocol of this benchmark, we use average precision (AP) for each class and the mean of APs (mAP) for evaluation. HMDB51. The HMDB51 dataset [11] is a large dataset containing 6,766 video clips extracted from various sources, ranging from movies to YouTube. It consists of 51 action classes, each having at least 101 samples. We follow the evaluation protocol of [11] and use three train/test splits, each with 70 training and 30 testing samples per class. The average classification accuracy is computed over all classes. Out of the two released sets, we use the original set as it is more challenging and used by most of the works reporting results in action recognition. Olympic Sports. The third dataset we use is Olympic Sports [19], which again is obtained from YouTube. This dataset contains 783 samples with 16 sports action classes. We use the provided2 train/test split, there are 17 to 56 training samples and 4 to 11test samples per class. Mean AP is used for the evaluation, which is the standard choice. Bag of features and classification setup. We first adopt the standard BOF [24] approach to encode all kinds of descriptors. It produces a vector that serves as the video representation. The codebook is constructed for each type of descriptor separately by the k-means algorithm. Following a common practice in the literature [27, 29, 30], the codebook size is set to k=4,000 elements. Note that Section 6 will consider encoding technique for descriptors. For the classification, we use a non-linear SVM with χ2kernel. When combining different descriptors, we simply add the kernel matrices, as done in [27]: K(xi,xj) = exp?−?cγ1cD(xic,xjc)?, 2http://vision.stanford.edu/Datasets/OlympicSports/ 222555555755 (5) where D(xic, xjc) is χ2 distance between video xic and xjc with respect to c-th channel, corresponding to c-th descriptor. The quantity γc is the mean value of χ2 distances between the training samples for the c-th channel. The multiclass classification problem that we consider is addressed by applying a one-against-rest approach. 4. Compensated descriptors This section describes how the compensation ofthe dominant motion is exploited to improve the quality of descriptors encoding the motion and the appearance around spatio-temporal positions, hence the term “compensated descriptors”. First, we briefly review the local descriptors [5, 13, 16, 29, 30] used here along with dense trajectories [29]. Second, we analyze the impact of motion flow compensation when used in two different stages of the descriptor computation, namely in the tracking and the description part. 4.1. Dense trajectories and local descriptors Employing dense trajectories to compute local descriptors is one of the state-of-the-art approaches for action recognition. It has been shown [29] that when local descriptors are computed over dense trajectories the performance improves considerably compared to when computed over spatio temporal features [30]. Dense Trajectories [29]: The trajectories are obtained by densely tracking sampled points using optical flow fields. First, feature points are sampled from a dense grid, with step size of 5 pixels and over 8 scales. Each feature point pt = (xt, yt) at frame t is then tracked to the next frame by median filtering in a dense optical flow field F = (ut, vt) as follows: pt+1 = (xt+1 , yt+1) = (xt, yt) + (M ∗ F) | (x ¯t,y ¯t) , (6) where M is the kernel of median filtering and ( x¯ t, y¯ t) is the rounded position of (xt, yt). The tracking is limited to L (=15) frames to avoid any drifting effect. Excessively short trajectories and trajectories exhibiting sudden large displacements are removed as they induce some artifacts. Trajectories must be understood here as tracks in the spacetime volume of the video. Local descriptors: The descriptors are computed within a space-time volume centered around each trajectory. Four types of descriptors are computed to encode the shape of the trajectory, local motion pattern and appearance, namely Trajectory [29], HOF (histograms of optical flow) [13], MBH [4] and HOG (histograms of oriented gradients) [3]. All these descriptors depend on the flow field used for the tracking and as input of the descriptor computation: 1. The Trajectory descriptor encodes the shape of the trajectory represented by the normalized relative coor- × dinates of the successive points forming the trajectory. It directly depends on the dense flow used for tracking points. 2. HOF is computed using the orientations and magnitudes of the flow field. 3. MBH is designed to capture the gradient of horizontal and vertical components of the flow. The motion boundaries encode the relative pixel motion and therefore suppress camera motion, but only to some extent. 4. HOG encodes the appearance by using the intensity gradient orientations and magnitudes. It is formally not a motion descriptor. Yet the position where the descriptor is computed depends on the trajectory shape. As in [29], volume around a feature point is divided into a 2 2 3 space-time grid. The orientations are quantized ian 2to × ×8 b2i ×ns 3fo srp HacOe-Gti amned g g9r ibdi.ns T fhoer o oHriOenFt (awtioitnhs one a qdudainttiiozneadl zero bin). The horizontal and vertical components of MBH are separately quantized into 8 bins each. 4.2. Impact of motion compensation The optical flow is simply referred to as flow in the following, while the compensated flow (see subsection 2. 1) is denoted by ω-flow. Both of them are considered in the tracking and descriptor computation stages. The trajectories obtained by tracking with the ω-flow are called ω-trajectories. Figure 2 comparatively illustrates the ωtrajectories and the trajectories obtained using the flow. The input video shows a man moving away from the car. In this video excerpt, the camera is following the man walking to the right, thus inducing a global motion to the left in the video. When using the flow, the computed trajectories reflect the combination of these two motion components (camera and scene motion) as depicted by Subfigure 2(b), which hampers the characterization of the current action. In contrast, the ω-trajectories plotted in Subfigure 2(c) are more active on the actor moving on the foreground, while those localized in the background are now parallel to the time axis enhancing static parts of the scene. The ω-trajectories are therefore more relevant for action recognition, since they are more regularly and more exclusively following the actor’s motion. Impact on Trajectory and HOG descriptors. Table 1reports the impact of ω-trajectories on Trajectory and HOG descriptors, which are both significantly improved by 3%4% of mAP on the two datasets. When improved by ωflow, these descriptors will be respectively referred to as ω-Trajdesc and ω-HOG in the rest of the paper. Although the better performance of ω-Trajdesc versus the original Trajectory descriptor was expected, the one 222555555866 2. Trajectories obtained from optical and compensated flows. The green tail is the trajectory the current frame. The trajectories are sub-sampled for the sake of clarity. The frames are extracted Figure over every 15 frames with red dot indicating 5 frames in this example. DescriptorHollywood2HMDB51 BaseTrliaωnje- Tc(rtoarejrdpyreos[c2d9u]ced)54 7 1. 7 4% %2382.–89% BaseliHnωOe- (GHreOp [2rG9od]uced)4 451 . 658%%%2296.– 13%% Table 1. ω-Trajdesc and ω-HOG: Impact of compensating flow on Trajectory descriptor and HOG descriptors. achieved by ω-HOG might be surprising. Our interpretation is that HOG captures more context with the modified trajectories. More precisely, the original HOG descriptor is computed from a 2D+t sub-volume aligned with the corresponding trajectory and hence represents the appearance along the trajectory shape. When using ω-flow, we do not align the video sequence. As a result, the ω-HOG descriptor is no more computed around the very same tracked physical point in the space-time volume but around points lying in a patch of the initial feature point, whose size depends on the affine flow magnitude. ω-HOG can be viewed as a “patchbased” computation capturing more information about the appearance of the background or of the moving foreground. As for ω-trajectories, they are closer to the real trajectories of the moving actors as they usually cancel the camera movement, and so, more easier to train and recognize. Impact on HOF. The ω-flow impacts computation used as an input to HOF computation itself. Therefore, HOF can both types of trajectories (ω-trajectories both the trajectory and the descriptor be computed along or those extracted MethodHollywood2HMDB51 Table(ω2rHf.-alocO IwomkF)inpHgacOtFobf[2u9ωhsb]i:f-nlo ωgwot-hwωHOflFown5H 02 34O. 58291F% %descripto3 r706s38.:–1076% m%APfor Hollywood2 and average accuracy for HMDB5 1. The ω-HOF is used in subsequent evaluations. from flow) and can encode both kinds of flows (ω-flow or flow). For the sake of completeness, we evaluate all the variants as well as the combination of both flows in the descriptor computation stage. The results are presented in Table 2 and demonstrate the significant improvement obtained by computing the HOF descriptor with the ω-flow instead of the optical flow. Note that the type of trajectories which is used, either “Tracking flow” or “Tracking ω-flow”, has a limited impact in this case. From now on, we only consider the “Tracking ω-flow” case where HOF is computed along ω-trajectories. Interestingly, combining the HOF computed from the flow and the ω-flow further improves the results. This suggests that the two flow fields are complementary and the affine flow that was subtracted from ω-flow brings in additional information. For the sake of brevity, the combination of the two kinds of HOF, i.e., computed from the flow and the ω-flow using ω-trajectories, is referred to as the ω-HOF 222555555977 MethodHollywood2HMDB51 Tab(lerT3a.cIkmM inpBgMacHgωtBf-loH w [u2)s9in]gω f-lo wo MBH5 d42 e.052s7c% riptos:m34A90P.–3769f% orHllywood2 and average accuracy for HMDB5 1. DTerHasMjcBrOeblitpHGeFor4.ySumTωraw- frcilykto hw ionfgtheduωpCs-fcaolrtωmeiwNp-df/tl+Aωoutrw-finlogwthdesωcr- isTpc-fHtrMloaiOjrBpdswtGeHFosrc descriptor in the rest of this paper. Compared to the HOF baseline, the ω-HOF descriptor achieves a gain of +3.1% of mAP on Hollywood 2 and of +7.8% on HMDB51. Impact on MBH. Since MBH is computed from gradient of flow and cancel the constant motion, there is practically no benefit in using the ω-flow to compute the MBH descriptors, as shown in Table 3. However, by tracking ω-flow, the performance improves by around 1.3% for HMDB5 1 dataset and drops by around 1.5% for Hollywood2. This relative performance depends on the encoding technique. We will come back on this descriptor when considering another encoding scheme for local descriptors in Section 6. 4.3. Summary of compensated descriptors Table 4 summarizes the refined versions of the descriptors obtained by exploiting the ω-flow, and both ω-flow and the optical flow in the case of HOF. The revisited descriptors considerably improve the results compared to the orig- inal ones, with the noticeable exception of ω-MBH which gives mixed performance with a bag-of-features encoding scheme. But we already mention as this point that this incongruous behavior of ω-MBH is stabilized with the VLAD encoding scheme considered in Section 6. Another advantage of tracking the compensated flow is that fewer trajectories are produced. For instance, the total number of trajectories decreases by about 9. 16% and 22.81% on the Hollywood2 and HMDB51 datasets, respectively. Note that exploiting both the flow and the ω-flow do not induce much computational overhead, as the latter is obtained from the flow and the affine flow which is computed in real-time and already used to get the ω-trajectories. The only additional computational cost that we introduce by using the descriptors summarized in Table 4 is the computation of a second HOF descriptor, but this stage is relatively efficient and not the bottleneck of the extraction procedure. 5. Divergence-Curl-Shear descriptor This section introduces a new descriptor encoding the kinematic properties of motion discussed in Section 2.2. It is denoted by DCS in the rest of this paper. Combining kinematic features. The spatial derivatives are computed for the horizontal and vertical components of the flow field, which are used in turn to compute the divergence, curl and shear scalar values, see Equation 3. We consider all possible pairs of kinematic features, namely (div, curl), (div, shear) and (curl, shear). At each × ×× pixel, we compute the orientation and magnitude of the 2-D vector corresponding to each of these pairs. The orientation is quantized into histograms and the magnitude is used for weighting, similar to SIFT. Our motivation for encoding pairs is that the joint distribution of kinematic features conveys more information than exploiting them independently. Implementation details. The descriptor computation and parameters are similar to HOG and other popular descriptors such as MBH, HOF. We obtain 8-bin histograms for each of the three feature pairs or components of DCS. The range of possible angles is 2π for the (div,curl) pair and π for the other pairs, because the shear is always positive. The DCS descriptor is computed for a space-time volume aligned with a trajectory, as done with the four descriptors mentioned in the previous section. In order to capture the spatio-temporal structure of kinematic features, the volume (32 32 pixels and L = 15 frames) is subdivided into a spatio-temporal grid nofd s Lize = nx 5× f ny m×e nt, sw situhb nx =de ny =to 2a and nt = 3. These parameters ×hnave× × bneen fixed for the sake of consistency with the other descriptors. For each pair of kinematic features, each cell in the grid is represented by a histogram. The resulting local descriptors have a dimensionality equal to 288 = nx ny nt 8 3. At the video level, these descriptors are nenc×od end i×nto 8 a single vector representation using either BOF or the VLAD encoding scheme introduced in the next section. 6. VLAD in actions VLAD [8] is a descriptor encoding technique that aggregates the descriptors based on a locality criterion in the feature space. To our knowledge, this technique has never been considered for action recognition. Below, we briefly introduce this approach and give the performance achieved for all the descriptors introduced along the previous sections. VLAD in brief. Similar to BOF, VLAD relies on a codebook C = {c1, c2 , ...ck} of k centroids learned by k-means. bTohoek representation is ob}t oaifn ked c by summing, efodr b yea kch-m mveiasunasl. word ci, the differences x − ci of the vectors x assigned to ci, thereby producing a sv exct −or c representation oflength d×k, 222555556088 DMeBscHriptorV5 LH.A1o%Dlywo5Bo4d.O2 %F4V3L.3HA%MD B35B91.O7%F Taωbl-eDHM5rOCBa.FSjGPdHe+rsωfco-mMHaBOnFeofV54L2936A.51D% with5431ω208-.5T96% rajde3s42c97158,.ω3% -HOG342,58019ω.6-% HOF descriptors and their combination. where d is the dimension ofthe local descriptors. We use the codebook size, k = 256. Despite this large dimensionality, VLAD is efficient because it is effectively compared with a linear kernel. VLAD is post-processed using a componentwise power normalization, which dramatically improves its performance [8]. While cross validating the parameter α involved in this power normalization, we consistently observe, for all the descriptors, a value between 0.15 and 0.3. Therefore, this parameter is set to α = 0.2 in all our experiments. For classification, we use a linear SVM and oneagainst-rest approach everywhere, unless stated otherwise. Impact on existing descriptors. We employ VLAD because it is less sensitive to quantization parameters and appears to provide better performance with descriptors having a large dimensionality. These properties are interesting in our case, because the quantization parameters involved in the DCS and MBH descriptors have been used unchanged in Section 4 for the sake of direct comparison. They might be suboptimal when using the ω-flow instead of the optical flow on which they have initially been optimized [29]. Results for MBH and ω-MBH in Table 5 supports this argument. When using VLAD instead of BOF, the scores are stable in both the cases and there is no mixed inference as that observed in Table 3. VLAD also has significant positive influence on accuracy of ω-DCS descriptor. We also observe that ω-DCS is complementary to ω-MBH and adds to the performance. Still DCS is probably not best utilized in the current setting of parameters. In case of ω-Trajdesc and ω-HOG, the scores are better with BOF on both the datasets. ω-HOF with VLAD improves on HMDB5 1, but remains equivalent for Hollywood2. Although BOF leads to better scores for the descriptors considered individually, their combination with VLAD outperforms the BOF. 7. Comparison with the state of the art This section reports our results with all descriptors combined and compares our method with the state of the art. TrajectorCy+omHbOiGna+tHioOnF+ MDBCHSHol5 l98y.w76%o%od2H4M489.D02%B%51 All ω-descriptors all five compensated descriptors using combined62.5%52.1% Table 6. Combination of VLAD representation. WU*JVliaOnughreM tHaeolth. [yo2w9d87o] 256 0985. 37% SKa*duOeJinhrau tnegdMteHatlMa.h [ol1Dd.0B[91]25 24 609.8172% Table 7. Comparison with the state of the art on Hollywood2 and HMDB5 1 datasets. *Vig et al. [28] gets 61.9% by using external eye movements data. *Jiang et al. [9] used one-vs-one multi class SVM while our and other methods use one-vs-rest SVMs. With one-against-one multi class SVM we obtain 45. 1% for HMDB51. Descriptor combination. Table 6 reports the results obtained when the descriptors are combined. Since we use VLAD, our baseline is updated that is combination of Trajectory, HOG, HOF and MBH with VLAD representation. When DCS is added to the baseline there is an improvement of 0.9% and 1.2%. With combination of all five compensated descriptors we obtain 62.5% and 52.1% on the two datasets. This is a large improvement even over the updated baseline, which shows that the proposed motion compensation and the way we exploit it are significantly important for action recognition. The comparison with the state of the art is shown in Table 7. Our method outperforms all the previously reported results in the literature. In particular, on the HMDB51 dataset, the improvement over the best reported results to date is more than 11% in average accuracy. Jiang el al. [9] used a one-against-one multi-class SVM, which might have resulted in inferior scores. With a similar multi-class SVM approach, our method obtains 45. 1%, which remains significantly better than their result. All others results were reported with one-against-rest approach. On Olympic Sports dataset we obtain mAP of 83.2% with ‘All ω-descriptors combined’ and the improvement is mostly because of VLAD and ω-flow. The best reported mAPs on this dataset are Liu et al. [14] (74.4%) and Jiang et al. [9] (80.6%), which we exceed convincingly. Gaidon et al. [6] reports the best average accuracy of 82.7%. 8. Conclusions This paper first demonstrates the interest of canceling the dominant motion (predominantly camera motion) to make the visual motion truly related to actions, for both the trajectory extraction and descriptor computation stages. It pro222555556199 duces significantly better versions (called compensated descriptors) of several state-of-the-art local descriptors for action recognition. The simplicity, efficiency and effectiveness of this motion compensation approach make it applicable to any action recognition framework based on motion descriptors and trajectories. The second contribution is the new DCS descriptor derived from the first-order scalar motion quantities specifying the local motion patterns. It captures additional information which is proved complementary to the other descriptors. Finally, we show that VLAD encoding technique instead of bag-of-words boosts several action descriptors, and overall exhibits a significantly better performance when combining different types of descriptors. Our contributions are all complementary and significantly outperform the state of the art when combined, as demon- strated by our extensive experiments on the Hollywood 2, HMDB51 and Olympic Sports datasets. Acknowledgments This work was supported by the Quaero project, funded by Oseo, French agency for innovation. We acknowledge Heng Wang’s help for reproducing some of their results. References [1] S. Ali and M. Shah. Human action recognition in videos using kinematic features and multiple instance learning. IEEE T-PAMI, 32(2):288–303, Feb. 2010. [2] T. Brox and J. Malik. Object segmentation by long term analysis of point trajectories. In ECCV, Sep. 2010. [3] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, Jun. 2005. [4] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow and appearance. In ECCV, May 2006. [5] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. In VS-PETS, Oct. 2005. [6] A. Gaidon, Z. Harchaoui, and C. Schmid. Recognizing activities with cluster-trees of tracklets. In BMVC, Sep. 2012. [7] A. Hervieu, P. Bouthemy, and J.-P. Le Cadre. A statistical video content recognition method using invariant features on object trajectories. IEEE T-CSVT, 18(1 1): 1533–1543, 2008. [8] H. J ´egou, F. Perronnin, M. Douze, J. S ´anchez, P. P ´erez, and C. Schmid. Aggregating local descriptors into compact [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] codes. IEEE T-PAMI, 34(9):1704–1716, 2012. Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, and C.-W. Ngo. Trajectory-based modeling of human actions with motion reference points. In ECCV, Oct. 2012. O. Kliper-Gross, Y. Gurovich, T. Hassner, and L. Wolf. Motion interchange patterns for action recognition in unconstrained videos. In ECCV, Oct. 2012. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recognition. In ICCV, Nov. 2011. I. Laptev and T. Lindeberg. Space-time interest points. In ICCV, Oct. 2003. I. Laptev, M. Marzalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, Jun. 2008. J. Liu, B. Kuipers, and S. Savarese. Recognizing human actions by attributes. In CVPR, Jun. 2011. D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–1 10, Nov. 2004. M. Marzalek, I. Laptev, and C. Schmid. Actions in context. In CVPR, Jun. 2009. P. Matikainen, M. Hebert, and R. Sukthankar. Trajectons: Action recognition through the motion analysis of tracked features. In Workshop on Video-Oriented Object and Event Classification, ICCV, Sep. 2009. R. Messing, C. J. Pal, and H. A. Kautz. Activity recognition using the velocity histories of tracked keypoints. In ICCV, Sep. 2009. J. C. Niebles, C.-W. Chen, and F.-F. Li. Modeling temporal structure of decomposable motion segments for activity classification. In ECCV, Sep. 2010. [20] J.-M. Odobez and P. Bouthemy. Robust multiresolution estimation of parametric motion models. Jal of Vis. Comm. and Image Representation, 6(4):348–365, Dec. 1995. [21] G. Piriou, P. Bouthemy, and J.-F. Yao. Recognition of dynamic video contents with global probabilistic models of visual motion. IEEE T-IP, 15(1 1):3417–3430, 2006. [22] S. Sadanand and J. J. Corso. Action bank: A high-level representation of activity in video. In CVPR, Jun. 2012. [23] C. Schmid and R. Mohr. Local grayvalue invariants for image retrieval. IEEE T-PAMI, 19(5):530–534, May 1997. [24] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In ICCV, pages 1470– 1477, Oct. 2003. [25] J. Sun, X. Wu, S. Yan, L. F. Cheong, T.-S. Chua, and J. Li. Hierarchical spatio-temporal context modeling for action recognition. In CVPR, Jun. 2009. [26] H. Uemura, S. Ishikawa, and K. Mikolajczyk. Feature tracking and motion compensation for action recognition. In BMVC, Sep. 2008. [27] M. M. Ullah, S. N. Parizi, and I. Laptev. Improving bag-offeatures action recognition with non-local cues. In BMVC, Sep. 2010. [28] E. Vig, M. Dorr, and D. Cox. Saliency-based space-variant descriptor sampling for action recognition. In ECCV, Oct. 2012. [29] H. Wang, A. Kl¨ aser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In CVPR, Jun. 2011. [30] H. Wang, M. M. Ullah, A. Kl¨ aser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features for action recognition. In BMVC, Sep. 2009. [3 1] G. Willems, T. Tuytelaars, and L. J. V. Gool. An efficient dense and scale-invariant spatio-temporal interest point detector. In ECCV, Oct. 2008. [32] S. Wu, O. Oreifej, and M. Shah. Action recognition in videos acquired by a moving camera using motion decomposition of lagrangian particle trajectories. In ICCV, Nov. 2011. 222555666200
6 0.24890405 335 cvpr-2013-Poselet Conditioned Pictorial Structures
7 0.24870418 40 cvpr-2013-An Approach to Pose-Based Action Recognition
8 0.23434255 207 cvpr-2013-Human Pose Estimation Using a Joint Pixel-wise and Part-wise Formulation
9 0.23281769 345 cvpr-2013-Real-Time Model-Based Rigid Object Pose Estimation and Tracking Combining Dense and Sparse Visual Cues
10 0.22316332 14 cvpr-2013-A Joint Model for 2D and 3D Pose Estimation from a Single Image
11 0.20458274 10 cvpr-2013-A Fully-Connected Layered Model of Foreground and Background Flow
12 0.2030894 439 cvpr-2013-Tracking Human Pose by Tracking Symmetric Parts
13 0.20064031 362 cvpr-2013-Robust Monocular Epipolar Flow Estimation
14 0.19859903 60 cvpr-2013-Beyond Physical Connections: Tree Models in Human Pose Estimation
15 0.18544263 444 cvpr-2013-Unconstrained Monocular 3D Human Pose Estimation by Action Detection and Cross-Modality Regression Forest
16 0.17454366 203 cvpr-2013-Hierarchical Video Representation with Trajectory Binary Partition Tree
17 0.17133993 124 cvpr-2013-Determining Motion Directly from Normal Flows Upon the Use of a Spherical Eye Platform
18 0.16166747 300 cvpr-2013-Multi-target Tracking by Lagrangian Relaxation to Min-cost Network Flow
19 0.16026787 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation
20 0.1561358 46 cvpr-2013-Articulated and Restricted Motion Subspaces and Their Signatures
topicId topicWeight
[(0, 0.276), (1, 0.073), (2, 0.047), (3, -0.193), (4, -0.09), (5, -0.031), (6, 0.197), (7, -0.017), (8, 0.036), (9, -0.006), (10, 0.039), (11, 0.359), (12, -0.008), (13, -0.004), (14, 0.198), (15, 0.12), (16, 0.063), (17, -0.221), (18, -0.157), (19, -0.039), (20, -0.036), (21, -0.04), (22, 0.004), (23, -0.071), (24, -0.013), (25, 0.03), (26, 0.025), (27, 0.005), (28, 0.003), (29, -0.003), (30, 0.062), (31, -0.032), (32, 0.051), (33, 0.049), (34, 0.004), (35, 0.005), (36, 0.064), (37, 0.068), (38, -0.012), (39, -0.004), (40, -0.003), (41, -0.102), (42, 0.052), (43, 0.041), (44, 0.003), (45, -0.024), (46, 0.051), (47, 0.008), (48, 0.006), (49, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.98907256 334 cvpr-2013-Pose from Flow and Flow from Pose
Author: Katerina Fragkiadaki, Han Hu, Jianbo Shi
Abstract: Human pose detectors, although successful in localising faces and torsos of people, often fail with lower arms. Motion estimation is often inaccurate under fast movements of body parts. We build a segmentation-detection algorithm that mediates the information between body parts recognition, and multi-frame motion grouping to improve both pose detection and tracking. Motion of body parts, though not accurate, is often sufficient to segment them from their backgrounds. Such segmentations are crucialfor extracting hard to detect body parts out of their interior body clutter. By matching these segments to exemplars we obtain pose labeled body segments. The pose labeled segments and corresponding articulated joints are used to improve the motion flow fields by proposing kinematically constrained affine displacements on body parts. The pose-based articulated motion model is shown to handle large limb rotations and displacements. Our algorithm can detect people under rare poses, frequently missed by pose detectors, showing the benefits of jointly reasoning about pose, segmentation and motion in videos.
2 0.74011385 244 cvpr-2013-Large Displacement Optical Flow from Nearest Neighbor Fields
Author: Zhuoyuan Chen, Hailin Jin, Zhe Lin, Scott Cohen, Ying Wu
Abstract: We present an optical flow algorithm for large displacement motions. Most existing optical flow methods use the standard coarse-to-fine framework to deal with large displacement motions which has intrinsic limitations. Instead, we formulate the motion estimation problem as a motion segmentation problem. We use approximate nearest neighbor fields to compute an initial motion field and use a robust algorithm to compute a set of similarity transformations as the motion candidates for segmentation. To account for deviations from similarity transformations, we add local deformations in the segmentation process. We also observe that small objects can be better recovered using translations as the motion candidates. We fuse the motion results obtained under similarity transformations and under translations together before a final refinement. Experimental validation shows that our method can successfully handle large displacement motions. Although we particularly focus on large displacement motions in this work, we make no sac- rifice in terms of overall performance. In particular, our method ranks at the top of the Middlebury benchmark.
Author: Karl Pauwels, Leonardo Rubio, Javier Díaz, Eduardo Ros
Abstract: We propose a novel model-based method for estimating and tracking the six-degrees-of-freedom (6DOF) pose of rigid objects of arbitrary shapes in real-time. By combining dense motion and stereo cues with sparse keypoint correspondences, and by feeding back information from the model to the cue extraction level, the method is both highly accurate and robust to noise and occlusions. A tight integration of the graphical and computational capability of Graphics Processing Units (GPUs) results in pose updates at framerates exceeding 60 Hz. Since a benchmark dataset that enables the evaluation of stereo-vision-based pose estimators in complex scenarios is currently missing in the literature, we have introduced a novel synthetic benchmark dataset with varying objects, background motion, noise and occlusions. Using this dataset and a novel evaluation methodology, we show that the proposed method greatly outperforms state-of-the-art methods. Finally, we demonstrate excellent performance on challenging real-world sequences involving object manipulation.
4 0.69486326 10 cvpr-2013-A Fully-Connected Layered Model of Foreground and Background Flow
Author: Deqing Sun, Jonas Wulff, Erik B. Sudderth, Hanspeter Pfister, Michael J. Black
Abstract: Layered models allow scene segmentation and motion estimation to be formulated together and to inform one another. Traditional layered motion methods, however, employ fairly weak models of scene structure, relying on locally connected Ising/Potts models which have limited ability to capture long-range correlations in natural scenes. To address this, we formulate a fully-connected layered model that enables global reasoning about the complicated segmentations of real objects. Optimization with fully-connected graphical models is challenging, and our inference algorithm leverages recent work on efficient mean field updates for fully-connected conditional random fields. These methods can be implemented efficiently using high-dimensional Gaussian filtering. We combine these ideas with a layered flow model, and find that the long-range connections greatly improve segmentation into figure-ground layers when compared with locally connected MRF models. Experiments on several benchmark datasets show that the method can re- cover fine structures and large occlusion regions, with good flow accuracy and much lower computational cost than previous locally-connected layered models.
5 0.69386142 158 cvpr-2013-Exploring Weak Stabilization for Motion Feature Extraction
Author: Dennis Park, C. Lawrence Zitnick, Deva Ramanan, Piotr Dollár
Abstract: We describe novel but simple motion features for the problem of detecting objects in video sequences. Previous approaches either compute optical flow or temporal differences on video frame pairs with various assumptions about stabilization. We describe a combined approach that uses coarse-scale flow and fine-scale temporal difference features. Our approach performs weak motion stabilization by factoring out camera motion and coarse object motion while preserving nonrigid motions that serve as useful cues for recognition. We show results for pedestrian detection and human pose estimation in video sequences, achieving state-of-the-art results in both. In particular, given a fixed detection rate our method achieves a five-fold reduction in false positives over prior art on the Caltech Pedestrian benchmark. Finally, we perform extensive diagnostic experiments to reveal what aspects of our system are crucial for good performance. Proper stabilization, long time-scale features, and proper normalization are all critical.
6 0.68145454 439 cvpr-2013-Tracking Human Pose by Tracking Symmetric Parts
7 0.6687308 335 cvpr-2013-Poselet Conditioned Pictorial Structures
8 0.65064961 2 cvpr-2013-3D Pictorial Structures for Multiple View Articulated Pose Estimation
9 0.64475638 14 cvpr-2013-A Joint Model for 2D and 3D Pose Estimation from a Single Image
10 0.64395112 124 cvpr-2013-Determining Motion Directly from Normal Flows Upon the Use of a Spherical Eye Platform
11 0.63628858 362 cvpr-2013-Robust Monocular Epipolar Flow Estimation
12 0.62548512 45 cvpr-2013-Articulated Pose Estimation Using Discriminative Armlet Classifiers
13 0.6220538 207 cvpr-2013-Human Pose Estimation Using a Joint Pixel-wise and Part-wise Formulation
14 0.62126231 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition
15 0.61733156 206 cvpr-2013-Human Pose Estimation Using Body Parts Dependent Joint Regressors
16 0.60687441 40 cvpr-2013-An Approach to Pose-Based Action Recognition
17 0.60504603 88 cvpr-2013-Compressible Motion Fields
18 0.59502685 277 cvpr-2013-MODEC: Multimodal Decomposable Models for Human Pose Estimation
19 0.58397162 203 cvpr-2013-Hierarchical Video Representation with Trajectory Binary Partition Tree
20 0.58114964 46 cvpr-2013-Articulated and Restricted Motion Subspaces and Their Signatures
topicId topicWeight
[(10, 0.091), (26, 0.057), (28, 0.024), (33, 0.328), (37, 0.148), (65, 0.014), (67, 0.106), (69, 0.044), (76, 0.01), (80, 0.032), (87, 0.059)]
simIndex simValue paperId paperTitle
1 0.94572043 264 cvpr-2013-Learning to Detect Partially Overlapping Instances
Author: Carlos Arteta, Victor Lempitsky, J. Alison Noble, Andrew Zisserman
Abstract: The objective of this work is to detect all instances of a class (such as cells or people) in an image. The instances may be partially overlapping and clustered, and hence quite challenging for traditional detectors, which aim at localizing individual instances. Our approach is to propose a set of candidate regions, and then select regions based on optimizing a global classification score, subject to the constraint that the selected regions are non-overlapping. Our novel contribution is to extend standard object detection by introducing separate classes for tuples of objects into the detection process. For example, our detector can pick a region containing two or three object instances, while assigning such region an appropriate label. We show that this formulation can be learned within the structured output SVM framework, and that the inference in such model can be accomplished using dynamic programming on a tree structured region graph. Furthermore, the learning only requires weak annotations – a dot on each instance. The improvement resulting from the addition of the capability to detect tuples of objects is demonstrated on quite disparate data sets: fluorescence microscopy images and UCSD pedestrians.
same-paper 2 0.94230866 334 cvpr-2013-Pose from Flow and Flow from Pose
Author: Katerina Fragkiadaki, Han Hu, Jianbo Shi
Abstract: Human pose detectors, although successful in localising faces and torsos of people, often fail with lower arms. Motion estimation is often inaccurate under fast movements of body parts. We build a segmentation-detection algorithm that mediates the information between body parts recognition, and multi-frame motion grouping to improve both pose detection and tracking. Motion of body parts, though not accurate, is often sufficient to segment them from their backgrounds. Such segmentations are crucialfor extracting hard to detect body parts out of their interior body clutter. By matching these segments to exemplars we obtain pose labeled body segments. The pose labeled segments and corresponding articulated joints are used to improve the motion flow fields by proposing kinematically constrained affine displacements on body parts. The pose-based articulated motion model is shown to handle large limb rotations and displacements. Our algorithm can detect people under rare poses, frequently missed by pose detectors, showing the benefits of jointly reasoning about pose, segmentation and motion in videos.
3 0.92978549 217 cvpr-2013-Improving an Object Detector and Extracting Regions Using Superpixels
Author: Guang Shu, Afshin Dehghan, Mubarak Shah
Abstract: We propose an approach to improve the detection performance of a generic detector when it is applied to a particular video. The performance of offline-trained objects detectors are usually degraded in unconstrained video environments due to variant illuminations, backgrounds and camera viewpoints. Moreover, most object detectors are trained using Haar-like features or gradient features but ignore video specificfeatures like consistent colorpatterns. In our approach, we apply a Superpixel-based Bag-of-Words (BoW) model to iteratively refine the output of a generic detector. Compared to other related work, our method builds a video-specific detector using superpixels, hence it can handle the problem of appearance variation. Most importantly, using Conditional Random Field (CRF) along with our super pixel-based BoW model, we develop and algorithm to segment the object from the background . Therefore our method generates an output of the exact object regions instead of the bounding boxes generated by most detectors. In general, our method takes detection bounding boxes of a generic detector as input and generates the detection output with higher average precision and precise object regions. The experiments on four recent datasets demonstrate the effectiveness of our approach and significantly improves the state-of-art detector by 5-16% in average precision.
4 0.91961086 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval
Author: Xiaohui Shen, Zhe Lin, Jonathan Brandt, Ying Wu
Abstract: Detecting faces in uncontrolled environments continues to be a challenge to traditional face detection methods[24] due to the large variation in facial appearances, as well as occlusion and clutter. In order to overcome these challenges, we present a novel and robust exemplarbased face detector that integrates image retrieval and discriminative learning. A large database of faces with bounding rectangles and facial landmark locations is collected, and simple discriminative classifiers are learned from each of them. A voting-based method is then proposed to let these classifiers cast votes on the test image through an efficient image retrieval technique. As a result, faces can be very efficiently detected by selecting the modes from the voting maps, without resorting to exhaustive sliding window-style scanning. Moreover, due to the exemplar-based framework, our approach can detect faces under challenging conditions without explicitly modeling their variations. Evaluation on two public benchmark datasets shows that our new face detection approach is accurate and efficient, and achieves the state-of-the-art performance. We further propose to use image retrieval for face validation (in order to remove false positives) and for face alignment/landmark localization. The same methodology can also be easily generalized to other facerelated tasks, such as attribute recognition, as well as general object detection.
5 0.91823775 254 cvpr-2013-Learning SURF Cascade for Fast and Accurate Object Detection
Author: Jianguo Li, Yimin Zhang
Abstract: This paper presents a novel learning framework for training boosting cascade based object detector from large scale dataset. The framework is derived from the wellknown Viola-Jones (VJ) framework but distinguished by three key differences. First, the proposed framework adopts multi-dimensional SURF features instead of single dimensional Haar features to describe local patches. In this way, the number of used local patches can be reduced from hundreds of thousands to several hundreds. Second, it adopts logistic regression as weak classifier for each local patch instead of decision trees in the VJ framework. Third, we adopt AUC as a single criterion for the convergence test during cascade training rather than the two trade-off criteria (false-positive-rate and hit-rate) in the VJ framework. The benefit is that the false-positive-rate can be adaptive among different cascade stages, and thus yields much faster convergence speed of SURF cascade. Combining these points together, the proposed approach has three good properties. First, the boosting cascade can be trained very efficiently. Experiments show that the proposed approach can train object detectors from billions of negative samples within one hour even on personal computers. Second, the built detector is comparable to the stateof-the-art algorithm not only on the accuracy but also on the processing speed. Third, the built detector is small in model-size due to short cascade stages.
6 0.91821223 273 cvpr-2013-Looking Beyond the Image: Unsupervised Learning for Object Saliency and Detection
8 0.91584921 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video
9 0.91547507 204 cvpr-2013-Histograms of Sparse Codes for Object Detection
10 0.91509932 4 cvpr-2013-3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image
11 0.91494089 207 cvpr-2013-Human Pose Estimation Using a Joint Pixel-wise and Part-wise Formulation
12 0.91419804 318 cvpr-2013-Optimized Pedestrian Detection for Multiple and Occluded People
13 0.91379994 206 cvpr-2013-Human Pose Estimation Using Body Parts Dependent Joint Regressors
14 0.91341817 438 cvpr-2013-Towards Pose Robust Face Recognition
15 0.9130494 167 cvpr-2013-Fast Multiple-Part Based Object Detection Using KD-Ferns
16 0.91290909 122 cvpr-2013-Detection Evolution with Multi-order Contextual Co-occurrence
17 0.91279888 416 cvpr-2013-Studying Relationships between Human Gaze, Description, and Computer Vision
18 0.91273588 43 cvpr-2013-Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs
19 0.9125455 89 cvpr-2013-Computationally Efficient Regression on a Dependency Graph for Human Pose Estimation
20 0.91243523 168 cvpr-2013-Fast Object Detection with Entropy-Driven Evaluation