cvpr cvpr2013 cvpr2013-244 knowledge-graph by maker-knowledge-mining

244 cvpr-2013-Large Displacement Optical Flow from Nearest Neighbor Fields

Source: pdf

Author: Zhuoyuan Chen, Hailin Jin, Zhe Lin, Scott Cohen, Ying Wu

Abstract: We present an optical flow algorithm for large displacement motions. Most existing optical flow methods use the standard coarse-to-fine framework to deal with large displacement motions which has intrinsic limitations. Instead, we formulate the motion estimation problem as a motion segmentation problem. We use approximate nearest neighbor fields to compute an initial motion field and use a robust algorithm to compute a set of similarity transformations as the motion candidates for segmentation. To account for deviations from similarity transformations, we add local deformations in the segmentation process. We also observe that small objects can be better recovered using translations as the motion candidates. We fuse the motion results obtained under similarity transformations and under translations together before a final refinement. Experimental validation shows that our method can successfully handle large displacement motions. Although we particularly focus on large displacement motions in this work, we make no sac- rifice in terms of overall performance. In particular, our method ranks at the top of the Middlebury benchmark.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We present an optical flow algorithm for large displacement motions. [sent-3, score-0.737]

2 Most existing optical flow methods use the standard coarse-to-fine framework to deal with large displacement motions which has intrinsic limitations. [sent-4, score-0.984]

3 Instead, we formulate the motion estimation problem as a motion segmentation problem. [sent-5, score-1.006]

4 We use approximate nearest neighbor fields to compute an initial motion field and use a robust algorithm to compute a set of similarity transformations as the motion candidates for segmentation. [sent-6, score-1.71]

5 We also observe that small objects can be better recovered using translations as the motion candidates. [sent-8, score-0.531]

6 We fuse the motion results obtained under similarity transformations and under translations together before a final refinement. [sent-9, score-0.746]

7 It started with Horn and Schunck’s original optical flow work [15] in the early eighties. [sent-15, score-0.594]

8 However, a good solution still remains elusive in challenging situations such as occlusions, motion boundaries, texture-less regions, and/or large displacement motions. [sent-17, score-0.597]

9 This paper addresses particularly the issue of large displacement motions in optical flow. [sent-18, score-0.619]

10 Most existing optical flow formulations are based on linearizing the optical flow constraint which requires an initial motion field between the two images. [sent-19, score-1.755]

11 In the absence of any prior knowledge, they use zero as the initial motion field which is then refined by a gradient-based optimization technique. [sent-20, score-0.59]

12 Bottom left: color-coded motion computed from an approximate nearest neighbor field computed using [17]. [sent-24, score-0.998]

13 large displacement motions, most optical flow methods adopt a multi-scale coarse-to-fine framework which sub-samples the images when going from a fine scale to a coarse scale. [sent-31, score-0.737]

14 Sub-sampling reduces the size of the images and the motion within, but at the same time, the reduction in image size leads to a loss of motion details that any algorithm can recover. [sent-32, score-0.908]

15 However, it turns out that it is possible to obtain reliable motion information for a sparse set of distinct image locations using robust keypoint detection and matching such as [19]. [sent-36, score-0.563]

16 One can 222444444311 incorporate the sparse matches into a dense field through either motion segmentation [16], constraints [9] or fusion [26]. [sent-37, score-0.743]

17 In this paper, we propose to incorporate a different type of correspondence information between two images, namely nearest neighbor fields [5]. [sent-41, score-0.544]

18 A nearest neighbor field between two images is defined as, for each patch in one image, the most similar patch in the other image. [sent-42, score-0.563]

19 Computing exact nearest neighbor fields can be computationally expensive depending on the size of images but there exist efficient approximate algorithms such as [5, 13, 17, 21]. [sent-43, score-0.535]

20 The first key observation in this paper is that although not designed for the motion estimation problem, approximate nearest neighbor fields contain a sufficiently high number of patches with approximately correct motions (see Figure 1). [sent-45, score-1.32]

21 However, one cannot directly use nearest neighbor fields as the input for a nonlinear refinement because they often contain a huge amount of noise. [sent-46, score-0.544]

22 Based on this observation, we can view the problem as a motion segmentation problem. [sent-48, score-0.518]

23 In particular, we segment the images into a set of regions that have similar motion patterns using a multi-label graph-cut algorithm [8]. [sent-49, score-0.587]

24 We compute the motion patterns from a noisy nearest neighbor field using an algorithm that is robust to noise. [sent-50, score-1.126]

25 There are two issues in the motion segmentation formulation which are the type of motion patterns and the number of them. [sent-51, score-1.105]

26 These two issues are related in that more complex motion patterns can describe the image with fewer patterns. [sent-52, score-0.587]

27 In this paper, we choose to use similarity transformations as the motion pattern. [sent-54, score-0.608]

28 Our third key observation is that it is not necessary for the motion segmentation to obtain perfect results in terms of motion estimation, as long as the error is within the limit of a typical optical flow refinement. [sent-57, score-1.591]

29 Based on this observation, we propose to allow for small deformations on top of the similarity transformations in motion segmentation. [sent-58, score-0.688]

30 Finally, we observe that although motion segmentation with similarity transformations and local deformations is very effective in terms of capturing the overall motion between two images, it may sometimes miss objects of small scale. [sent-59, score-1.229]

31 Therefore, we perform a fusion between the motion segmentation result under translations and that under the similarity transformations before a final refinement. [sent-61, score-0.859]

32 Related Work There is a huge body of literature on optical flow following the original work of Horn and Schunck [15]. [sent-64, score-0.594]

33 We only discuss the papers that address the large displacement motion problem as that is the main focus of this work. [sent-66, score-0.597]

34 It has since been adopted by most optical flow algorithms to handle large displacement motions. [sent-68, score-0.76]

35 Brox and Malik [9] proposed to add robust keypoint detection and matching such as the SIFT features [19] into the classical optical flow framework which can handle arbitrarily large motion without any performance sacrifice. [sent-74, score-1.151]

36 Instead of adding keypoint matches as constraints, they expand the matches into candidate motion fields and fuse them with the standard optical flow result. [sent-77, score-1.243]

37 Our work is related to [16] where motion segmentation is also used to deal with large displacement motions. [sent-79, score-0.661]

38 The differences are that we use nearest neighbor fields as the input and we allow local deformations in motion segmentation, both of which are shown to improve the overall performance. [sent-80, score-1.03]

39 Motion segmentation is related to the so-called layer representation in optical flow [24, 25]. [sent-81, score-0.658]

40 The advan- tages having an explicit segmentation are that we can incorporate simple models such as translation or similarity transformations to describe the motion between two images and integrate the correspondence information within individual segments. [sent-82, score-0.779]

41 The differences are that [6] uses a variant of Belief Propagation to regularize the noisy nearest neighbor fields and we use motion segmentation. [sent-85, score-0.984]

42 Contributions The main contribution of this work is an optical flow that can handle large displacement motions. [sent-88, score-0.76]

43 In particular we improve upon existing methods in the following ways: • • We use approximate nearest neighbor field algorithms teo compute an iantietia nle daerensste n correspondence foireildth. [sent-89, score-0.592]

44 Itturns out the approximate nearest neighbor field contains a high percentage of approximately accurate motions which can be used by robust algorithms to recover the dominant motion patterns. [sent-90, score-1.458]

45 We formulate the motion estimation problem as a motWioen segmentation problem aimnda aiolnlow pr oloblceaml d aesfo arm moa-- • tions in the segmentation process. [sent-91, score-0.616]

46 Having local deformations significantly reduces the number of motion patterns needed to describe the motion and therefore improves the robustness of the algorithm. [sent-92, score-1.121]

47 gWee use a novel fusion algorithm to merge the motion result under translations with that under similarity transformations. [sent-94, score-0.685]

48 Admittedly, our method focuses on the large displacement motion issue in optical flow and does not explicitly address other outstanding issues, such as occlusions, motion boundaries, etc. [sent-95, score-1.645]

49 Nearest Neighbor Fields (NNF) Given a pair ofinput images, we first compute an approximate nearest neighbor field between them using Coheren- × cy Sensitive Hashing (CSH) [17]. [sent-101, score-0.544]

50 As noted in introduction, the nearest neighbor field is approximately consistent to the ground truth flow field in majority of pixel locations. [sent-102, score-1.032]

51 Our empirical study shows that this is a valid assumption for most cases in optical flow estimation. [sent-103, score-0.594]

52 Under this assumption, there are two advantages in leveraging the nearest neighbor field for optical flow problems. [sent-104, score-1.099]

53 Firstly, since nearest neighbor field algorithms are not restricted by the magnitude of motions, they can provide valuable information for handling large motions in optical flow estimation, which has been a main challenge for traditional optical flow algorithms. [sent-105, score-1.98]

54 Secondly, although the nearest neighbor fields are generally noisy, they retain motion details for small image structures, which would most likely be ignored by traditional optical flow algorithms. [sent-106, score-1.544]

55 Directly applying the nearest nearest neighbor field as an initialization to traditional optical flow algorithms cannot recover from large errors in the nearest neighbor field since these algorithms only refine flows locally, which makes noise handling crucial in our formulation. [sent-107, score-1.846]

56 Dominant Motion Patterns To suppress noise on the nearest neighbor field, here we propose a motion segmentation-based method by restricting the initial nearest neighbor field to a sparse set of dominant motion patterns. [sent-115, score-1.993]

57 To achieve this, we fist identify those patterns robustly based on simple geometric transformation models between the two images based on an iterative RANSAC algorithm and then use them to compute motion segmentation from the nearest neighbor field. [sent-116, score-1.101]

58 We can simply adopt histogram statistics as in [14] to extract K most frequent motion modes and use them as the dominant motion patterns. [sent-117, score-1.152]

59 However, when the scene contains more complex rigid motions such as rotation and scaling, the number of modes required for accurately representing the underlying motion field can be very large. [sent-119, score-0.87]

60 To address this problem, we can extract the dominant motion patterns under more sophisticated models such as similarity/affine transformations. [sent-120, score-0.775]

61 A comparison of dominant motion patterns extracted from sparse SIFT correspondences and a dense nearest neighbor field. [sent-129, score-1.247]

62 (a) The RubberWhale example [4], (b) The ground truth, (c)(d) Th similarity transformations inferred with dominant motion patterns extracted from sparse feature correspondences and the nearest neighbor field, respectively. [sent-130, score-1.374]

63 Here, we adopt a more robust approach by removing only those “inliers” with high confidence values (samples which are sufficiently close to the current motion pattern) during the iterative RANSAC-based motion estimation process. [sent-133, score-0.942]

64 Also, the large number of potential correspondences offered by the nearest neighbor field allows us to estimate the motion patterns robustly even for small objects or nontexture scenes. [sent-134, score-1.172]

65 As we can see from the figure, due to lack of textures on the cylinder, the SIFT-based estimation [16] cannot reconstruct the motion of the rotating cylinder accurately. [sent-138, score-0.615]

66 In contrast, our dense correspondence-based method closely reconstructs the ground truth with a similarity transformation-based motion pattern. [sent-139, score-0.574]

67 In our implementation, we use the dominant motion patterns extracted from both translation and similarity transformation. [sent-140, score-0.903]

68 The reason is that motion modes from offset histograms is complementary to motion patterns from similarity transformations, i. [sent-141, score-1.198]

69 translation models can more robustly identify motions on small independently moving objects and covers motions unexplainable with the set of estimated similarity transformations. [sent-143, score-0.677]

70 We also tried to complement our motion models with affine transformation, but we found that it is quite sensitive to errors in the original nearest neighbor field due to its increased DOF. [sent-144, score-0.986]

71 (a)(c) motion estimation without motion pattern perturbation and (b),(d) with motion pattern perturbation. [sent-148, score-1.558]

72 To deal with this problem, we allow a small perturbation around each motion pattern. [sent-152, score-0.56]

73 This perturbation step is essential in improving the quality of motion segmentation using dominant motion patterns. [sent-170, score-1.266]

74 An example is shown in Figure 3 to demonstrate the advantages of local perturbation on regularizing motion fields and obtaining more accurate motion segmentation. [sent-172, score-1.118]

75 The time complexity of the motion segmentation step (described in the following) is super-linear to the number of motion patterns, so we typically achieve at least 4X speed up by allowing perturbation (e. [sent-174, score-1.078]

76 Motion Segmentation with Dominant Motion Patterns Given the set of K candidate motion patterns and their perturbation models, we formulate the dense motion estimation procedure as a labeling problem: E(u) =? [sent-180, score-1.208]

77 First, we estimate the motion configuration that best explains the data by choosing from motion patterns obtained from translation and similarity transformation separately as: u∗ = argmuinEˆ(u) s. [sent-208, score-1.2]

78 , Ω(PJ ◦ x)) }, where the energy function Eˆ takes the same form as E in Eqn (3) but with a smoothness term on the motion pattern type rather than actual flow vectors. [sent-218, score-0.847]

79 This step is equivalent to solving for motion segmentation with the two motion models separately, which is reasonable given that motion patterns from these two models are estimated independently so they have overlaps. [sent-220, score-1.559]

80 The third column shows that our flow result captures the motion of the fast moving, motion-blurred, textureless balls in each of these three examples. [sent-225, score-0.977]

81 Continuous Flow Refinement For generating final optical flow with sub-pixel accuracy, we need final continuous refinement. [sent-228, score-0.689]

82 We achieve this by simply initializing the motion field with ˆu and estimate the sub-pixel motion field by a continuous optical flow framework [23]. [sent-229, score-1.773]

83 Pyramid-based methods that rely on a small motion assumption will also have problems because the balls will be heavily blurred at the pyramid level for which the small motion assumption holds. [sent-239, score-0.983]

84 In contrast, our algorithm achieves accurate motion estimation by fusion of different motion patterns. [sent-248, score-1.027]

85 Our flow results in Figure 6 (d),(i) show that we can compute these large motions oflarge objects quite effectively from the NNF and dominant translations (not shown). [sent-252, score-0.904]

86 As shown in Figure 6(e),(j), the state-of-the-art MDPOF method [26] fails to capture many of the motions of the fast moving, low texture umbrellas and has difficulties at the motion discontinuity on the left side of the tree. [sent-253, score-0.807]

87 As we can see, for sequences in the first column, we estimate motions with the translation modes to get u∗ and the similarity modes to get u∗∗ in the third and fourth column, and fuse them adaptively to obtain ˆu as in fifth column. [sent-265, score-0.604]

88 , the cloth in the sequence “Demetrodon”) the dominant offset model u∗ performs much better, and for rigid-body motion (e. [sent-269, score-0.674]

89 For example, the fused Rubber Whale flow in the fifth column uses the small hole flows in the letter D and the fence from the translation result and the rotating cylinder flow from the similarity result. [sent-273, score-1.097]

90 An example result of our method compared to the state-of-the-art optical flow estimation algorithms. [sent-279, score-0.628]

91 Note that, although noisy as a flow field, the NNF in (h) captures the dominant motions in this scene. [sent-285, score-0.834]

92 First column: image sequences; second column: ground truth; third column: flow similarity flow u∗∗ u∗ by translation in Eqn (4); fourth column: in Eqn (5); fifth column: fusion result ˆu by Eqn (6); last column: further refined optical flow initialized with uˆ. [sent-291, score-1.599]

93 Conclusion In this work, we introduce a novel PatchMatch flow method for motion estimation. [sent-296, score-0.819]

94 Based on this observation, we extract dominant offsets and rigid-body motion modes. [sent-298, score-0.669]

95 Although direct dense matching provides pixel-wise motion candidates, raw patch feature is not robust to large appearance variations such as changes of radiation condition, as well as scaling and rotation. [sent-315, score-0.568]

96 Reliable estimation of dense optical flow fields with large displacements. [sent-325, score-0.759]

97 Estimating optical flow in segmented images using variable-order parametric models with local deformations. [sent-368, score-0.594]

98 Large displacement optical flow: descriptor matching in variational motion estimation. [sent-379, score-0.851]

99 Investigations of multigrid algorithms for the estimation of optical flow fields in image sequences. [sent-397, score-0.732]

100 Joint motion estimation and segmentation of complex scenes with label costs and occlusion modeling. [sent-473, score-0.552]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('motion', 0.454), ('flow', 0.365), ('nnf', 0.254), ('motions', 0.247), ('optical', 0.229), ('neighbor', 0.217), ('dominant', 0.188), ('nearest', 0.175), ('middlebury', 0.152), ('displacement', 0.143), ('patterns', 0.133), ('patchmatch', 0.131), ('field', 0.113), ('eqn', 0.113), ('perturbation', 0.106), ('fields', 0.104), ('mdpof', 0.102), ('fusion', 0.085), ('transformations', 0.085), ('deformations', 0.08), ('cylinder', 0.08), ('translations', 0.077), ('balls', 0.075), ('similarity', 0.069), ('flowergarden', 0.068), ('rubberwhale', 0.068), ('segmentation', 0.064), ('ldof', 0.063), ('translation', 0.059), ('modes', 0.056), ('keypoint', 0.055), ('perturbed', 0.054), ('correspondences', 0.053), ('ui', 0.051), ('schefflera', 0.051), ('steinbruecker', 0.051), ('umbrellas', 0.051), ('refinement', 0.048), ('correspondence', 0.048), ('continuous', 0.045), ('schunck', 0.045), ('column', 0.043), ('adaptively', 0.042), ('horn', 0.042), ('umbrella', 0.042), ('ransac', 0.042), ('pock', 0.04), ('handling', 0.04), ('textureless', 0.04), ('inliers', 0.039), ('fifth', 0.039), ('repetitive', 0.039), ('approximate', 0.039), ('fuse', 0.036), ('served', 0.035), ('estimation', 0.034), ('noisy', 0.034), ('direct', 0.033), ('offset', 0.032), ('transformation', 0.031), ('eliminate', 0.031), ('discontinuity', 0.03), ('deviations', 0.03), ('benchmark', 0.029), ('reliable', 0.029), ('patch', 0.029), ('occlusions', 0.028), ('moving', 0.028), ('pattern', 0.028), ('offsets', 0.027), ('army', 0.027), ('quite', 0.027), ('flows', 0.027), ('robustly', 0.027), ('dense', 0.027), ('roth', 0.026), ('fused', 0.026), ('final', 0.025), ('fails', 0.025), ('approximately', 0.025), ('observation', 0.025), ('matching', 0.025), ('satisfactory', 0.025), ('row', 0.025), ('truth', 0.024), ('risk', 0.024), ('rotating', 0.024), ('keypoints', 0.024), ('ball', 0.024), ('brox', 0.023), ('miss', 0.023), ('refined', 0.023), ('reconstruct', 0.023), ('handle', 0.023), ('unger', 0.023), ('werlberger', 0.023), ('muin', 0.023), ('evanston', 0.023), ('sheridan', 0.023), ('yingwu', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999994 244 cvpr-2013-Large Displacement Optical Flow from Nearest Neighbor Fields

Author: Zhuoyuan Chen, Hailin Jin, Zhe Lin, Scott Cohen, Ying Wu

2 0.31724826 334 cvpr-2013-Pose from Flow and Flow from Pose

Author: Katerina Fragkiadaki, Han Hu, Jianbo Shi

Abstract: Human pose detectors, although successful in localising faces and torsos of people, often fail with lower arms. Motion estimation is often inaccurate under fast movements of body parts. We build a segmentation-detection algorithm that mediates the information between body parts recognition, and multi-frame motion grouping to improve both pose detection and tracking. Motion of body parts, though not accurate, is often sufficient to segment them from their backgrounds. Such segmentations are crucialfor extracting hard to detect body parts out of their interior body clutter. By matching these segments to exemplars we obtain pose labeled body segments. The pose labeled segments and corresponding articulated joints are used to improve the motion flow fields by proposing kinematically constrained affine displacements on body parts. The pose-based articulated motion model is shown to handle large limb rotations and displacements. Our algorithm can detect people under rare poses, frequently missed by pose detectors, showing the benefits of jointly reasoning about pose, segmentation and motion in videos.

3 0.29391602 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition

Author: Mihir Jain, Hervé Jégou, Patrick Bouthemy

Abstract: Several recent works on action recognition have attested the importance of explicitly integrating motion characteristics in the video description. This paper establishes that adequately decomposing visual motion into dominant and residual motions, both in the extraction of the space-time trajectories and for the computation of descriptors, significantly improves action recognition algorithms. Then, we design a new motion descriptor, the DCS descriptor, based on differential motion scalar quantities, divergence, curl and shear features. It captures additional information on the local motion patterns enhancing results. Finally, applying the recent VLAD coding technique proposed in image retrieval provides a substantial improvement for action recognition. Our three contributions are complementary and lead to outperform all reported results by a significant margin on three challenging datasets, namely Hollywood 2, HMDB51 and Olympic Sports. 1. Introduction and related work Human actions often convey the essential meaningful content in videos. Yet, recognizing human actions in un- constrained videos is a challenging problem in Computer Vision which receives a sustained attention due to the potential applications. In particular, there is a large interest in designing video-surveillance systems, providing some automatic annotation of video archives as well as improving human-computer interaction. The solutions proposed to address this problem inherit, to a large extent, from the techniques first designed for the goal of image search and classification. The successful local features developed to describe image patches [15, 23] have been translated in the 2D+t domain as spatio-temporal local descriptors [13, 30] and now include motion clues [29]. These descriptors are often extracted from spatial-temporal interest points [12, 3 1]. More recent techniques assume some underlying temporal motion model involving trajectories [2, 6, 7, 17, 18, 25, 29, 32]. Most of these approaches produce large set of local descriptors which are in turn aggregated to produce a single vector representing the video, in order to enable the use of powerful discriminative classifiers such as support vector machines (SVMs). This is usually done with the bag- Figure 1. Optical flow field vectors (green vectors with red end points) before and after dominant motion compensation. Most of the flow vectors due to camera motion are suppressed after compensation. One of the contributions of this paper is to show that compensating for the dominant motion is beneficial for most of the existing descriptors used for action recognition. of-words technique [24], which quantizes the local features using a k-means codebook. Thanks to the successful combination of this encoding technique with the aforementioned local descriptors, the state of the art in action recognition is able to go beyond the toy problems ofclassifying simple human actions in controlled environment and considers the detection of actions in real movies or video clips [11, 16]. Despite these progresses, the existing descriptors suffer from an uncompleted handling of motion in the video sequence. Motion is arguably the most reliable source of information for action recognition, as often related to the actions of interest. However, it inevitably involves the background or camera motion when dealing with uncontrolled and re- alistic situations. Although some attempts have been made to compensate camera motion in several ways [10, 21, 26, 29, 32], how to separate action motion from that caused by the camera, and how to reflect it in the video description remains an open issue. The motion compensation mechanism employed in [10] is tailor-made to the Motion Interchange Pattern encoding technique. The Motion Boundary Histogram (MBH) [29] is a recent appealing approach to 222555555533 suppress the constant motion by considering the flow gradient. It is robust to some extent to the presence of camera motion, yet it does not explicitly handle the camera motion. Another approach [26] uses a sophisticated and robust (RANSAC) estimation of camera motion. It first segments the color image into regions corresponding to planar parts in the scene and estimates the (three) dominant homographies to update the motion associated with local features. A rather different view is adopted in [32] where the motion decomposition is performed at the trajectory level. All these works support the potential of motion compensation. As the first contribution of this paper, we address the problem in a way that departs from these works by considering the compensation of the dominant motion in both the tracking stages and encoding stages involved in the computation of action recognition descriptors. We rely on the pioneering works on motion compensation such as the technique proposed in [20], that considers 2D polynomial affine motion models for estimating the dominant image motion. We consider this particular model for its robustness and its low computational cost. It was already used in [21] to separate the dominant motion (assumed to be due to the camera motion) and the residual motion (corresponding to the independent scene motions) for dynamic event recognition in videos. However, the statistical modeling of both motion components was global (over the entire image) and only the normal flow was computed for the latter. Figure 1 shows the vectors of optical flow before and after applying the proposed motion compensation. Our method successfully suppresses most of the background motion and reinforces the focus towards the action of interest. We exploit this compensated motion both for descriptor computation and for extracting trajectories. However, we also show that the camera motion should not be thrown as it contains complementary information that is worth using to recognize certain action categories. Then, we introduce the Divergence-Curl-Shear (DCS) descriptor, which encodes scalar first-order motion features, namely the motion divergence, curl and shear. It captures physical properties of the flow pattern that are not involved in the best existing descriptors for action recognition, except in the work of [1] which exploits divergence and vorticity among a set of eleven kinematic features computed from the optical flow. Our DCS descriptor provides a good performance recognition performance on its own. Most importantly, it conveys some information which is not captured by existing descriptors and further improves the recognition performance when combined with the other descriptors. As a last contribution, we bring an encoding technique known as VLAD (vector oflocal aggregated descriptors) [8] to the field of action recognition. This technique is shown to be better than the bag-of-words representation for combining all the local video descriptors we have considered. The organization of the paper is as follows. Section 2 introduces the motion properties that we will consider through this paper. Section 3 presents the datasets and classification scheme used in our different evaluations. Section 4 details how we revisit several popular descriptors of the literature by the means of dominant motion compensation. Our DCS descriptor based on kinematic properties is introduced in Section 5 and improved by the VLAD encoding technique, which is introduced and bench-marked in Section 6 for several video descriptors. Section 7 provides a comparison showing the large improvement achieved over the state of the art. Finally, Section 8 concludes the paper. 2. Motion Separation and Kinematic Features In this section, we describe the motion clues we incorporate in our action recognition framework. We separate the dominant motion and the residual motion. In most cases, this will account to distinguishing the impact of camera movement and independent actions. Note that we do not aim at recovering the 3D camera motion: The 2D parametric motion model describes the global (or dominant) motion between successive frames. We first explain how we estimate the dominant motion and employ it to separate the dominant flow from the optical flow. Then, we will introduce kinematic features, namely divergence, curl and shear for a more comprehensive description of the visual motion. 2.1. Affine motion for compensating camera motion Among polynomial motion models, we consider the 2D affine motion model. Simplest motion models such as the 4parameter model formed by the combination of 2D translation, 2D rotation and scaling, or more complex ones such as the 8-parameter quadratic model (equivalent to a homography), could be selected as well. The affine model is a good trade-off between accuracy and efficiency which is of primary importance when processing a huge video database. It does have limitations since strictly speaking it implies a single plane assumption for the static background. However, this is not that penalizing (especially for outdoor scenes) if differences in depth remain moderated with respect to the distance to the camera. The affine flow vector at point p = (x, y) and at time t, is defined as waff(pt) =?cc12((t ) ?+?aa31((t ) aa42((t ) ? ?xytt?. (1) = + + = + uaff(pt) c1(t) a1(t)xt a2(t)yt and vaff(pt) c2(t) a3 (t)xt + a4(t)yt are horizontal and vertical components of waff(pt) respectively. Let us denote the optical flow vector at point p at time t as w(pt) = (u(pt) , v(pt)). We introduce the flow vector ω(pt) obtained by removing the affine flow vector from the optical flow vector ω(pt) = w(pt) − waff(pt) . (2) 222555555644 The dominant motion (estimated as waff(pt)) is usually due to the camera motion. In this case, Equation 2 amounts to canceling (or compensating) the camera motion. Note that this is not always true. For example in case of close-up on a moving actor, the dominant motion will be the affine estimation of the apparent actor motion. The interpretation of the motion compensation output will not be that straightforward in this case, however the resulting ω-field will still exhibit different patterns for the foreground action part and the background part. In the remainder, we will refer to the “compensated” flow as ω-flow. Figure 1 displays the computed optical flow and the ωflow. We compute the affine flow with the publicly available Motion2D software1 [20] which implements a realtime robust multiresolution incremental estimation framework. The affine motion model has correctly accounted for the motion induced by the camera movement which corresponds to the dominant motion in the image pair. Indeed, we observe that the compensated flow vectors in the background are close to null and the compensated flow in the foreground, i.e., corresponding to the actors, is conversely inflated. The experiments presented along this paper will show that effective separation of dominant motion from the residual motions is beneficial for action recognition. As explained in Section 4, we will compute local motion descriptors, such as HOF, on both the optical flow and the compensated flow (ω-flow), which allows us to explicitly and directly characterize the scene motion. 2.2. Local kinematic features By kinematic features, we mean local first-order differential scalar quantities computed on the flow field. We consider the divergence, the curl (or vorticity) and the hyperbolic terms. They inform on the physical pattern of the flow so that they convey useful information on actions in videos. They can be computed from the first-order derivatives of the flow at every point p at every frame t as ⎨⎪ ⎪ ⎪ ⎧hcdyuipvr1l2(p t) = −∂ u ∂ (yxp(xtp) +−∂ v ∂ v(px ypxt ) The diverg⎪⎩ence is related to axial motion, expansion scaling effects, the curl to rotation in the image plane. hyperbolic terms express the shear of the visual flow responding to more complex configuration. We take account the shear quantity only: shear(pt) = ?hyp12(pt) + hyp22(pt). (3) and The corinto (4) 1http://www.irisa.fr/vista/Motion2D/ In Section 5, we propose the DCS descriptor that is based on the kinematic features (divergence, curl and shear) of the visual motion discussed in this subsection. It is computed on either the optical or the compensated flow, ω-flow. 3. Datasets and evaluation This section first introduces the datasets used for the evaluation. Then, we briefly present the bag-of-feature model and the classification scheme used to encode the descriptors which will be introduced in Section 4. Hollywood2. The Hollywood2 dataset [16] contains 1,707 video clips from 69 movies representing 12 action classes. It is divided into train set and test set of 823 and 884 samples respectively. Following the standard evaluation protocol of this benchmark, we use average precision (AP) for each class and the mean of APs (mAP) for evaluation. HMDB51. The HMDB51 dataset [11] is a large dataset containing 6,766 video clips extracted from various sources, ranging from movies to YouTube. It consists of 51 action classes, each having at least 101 samples. We follow the evaluation protocol of [11] and use three train/test splits, each with 70 training and 30 testing samples per class. The average classification accuracy is computed over all classes. Out of the two released sets, we use the original set as it is more challenging and used by most of the works reporting results in action recognition. Olympic Sports. The third dataset we use is Olympic Sports [19], which again is obtained from YouTube. This dataset contains 783 samples with 16 sports action classes. We use the provided2 train/test split, there are 17 to 56 training samples and 4 to 11test samples per class. Mean AP is used for the evaluation, which is the standard choice. Bag of features and classification setup. We first adopt the standard BOF [24] approach to encode all kinds of descriptors. It produces a vector that serves as the video representation. The codebook is constructed for each type of descriptor separately by the k-means algorithm. Following a common practice in the literature [27, 29, 30], the codebook size is set to k=4,000 elements. Note that Section 6 will consider encoding technique for descriptors. For the classification, we use a non-linear SVM with χ2kernel. When combining different descriptors, we simply add the kernel matrices, as done in [27]: K(xi,xj) = exp?−?cγ1cD(xic,xjc)?, 2http://vision.stanford.edu/Datasets/OlympicSports/ 222555555755 (5) where D(xic, xjc) is χ2 distance between video xic and xjc with respect to c-th channel, corresponding to c-th descriptor. The quantity γc is the mean value of χ2 distances between the training samples for the c-th channel. The multiclass classification problem that we consider is addressed by applying a one-against-rest approach. 4. Compensated descriptors This section describes how the compensation ofthe dominant motion is exploited to improve the quality of descriptors encoding the motion and the appearance around spatio-temporal positions, hence the term “compensated descriptors”. First, we briefly review the local descriptors [5, 13, 16, 29, 30] used here along with dense trajectories [29]. Second, we analyze the impact of motion flow compensation when used in two different stages of the descriptor computation, namely in the tracking and the description part. 4.1. Dense trajectories and local descriptors Employing dense trajectories to compute local descriptors is one of the state-of-the-art approaches for action recognition. It has been shown [29] that when local descriptors are computed over dense trajectories the performance improves considerably compared to when computed over spatio temporal features [30]. Dense Trajectories [29]: The trajectories are obtained by densely tracking sampled points using optical flow fields. First, feature points are sampled from a dense grid, with step size of 5 pixels and over 8 scales. Each feature point pt = (xt, yt) at frame t is then tracked to the next frame by median filtering in a dense optical flow field F = (ut, vt) as follows: pt+1 = (xt+1 , yt+1) = (xt, yt) + (M ∗ F) | (x ¯t,y ¯t) , (6) where M is the kernel of median filtering and ( x¯ t, y¯ t) is the rounded position of (xt, yt). The tracking is limited to L (=15) frames to avoid any drifting effect. Excessively short trajectories and trajectories exhibiting sudden large displacements are removed as they induce some artifacts. Trajectories must be understood here as tracks in the spacetime volume of the video. Local descriptors: The descriptors are computed within a space-time volume centered around each trajectory. Four types of descriptors are computed to encode the shape of the trajectory, local motion pattern and appearance, namely Trajectory [29], HOF (histograms of optical flow) [13], MBH [4] and HOG (histograms of oriented gradients) [3]. All these descriptors depend on the flow field used for the tracking and as input of the descriptor computation: 1. The Trajectory descriptor encodes the shape of the trajectory represented by the normalized relative coor- × dinates of the successive points forming the trajectory. It directly depends on the dense flow used for tracking points. 2. HOF is computed using the orientations and magnitudes of the flow field. 3. MBH is designed to capture the gradient of horizontal and vertical components of the flow. The motion boundaries encode the relative pixel motion and therefore suppress camera motion, but only to some extent. 4. HOG encodes the appearance by using the intensity gradient orientations and magnitudes. It is formally not a motion descriptor. Yet the position where the descriptor is computed depends on the trajectory shape. As in [29], volume around a feature point is divided into a 2 2 3 space-time grid. The orientations are quantized ian 2to × ×8 b2i ×ns 3fo srp HacOe-Gti amned g g9r ibdi.ns T fhoer o oHriOenFt (awtioitnhs one a qdudainttiiozneadl zero bin). The horizontal and vertical components of MBH are separately quantized into 8 bins each. 4.2. Impact of motion compensation The optical flow is simply referred to as flow in the following, while the compensated flow (see subsection 2. 1) is denoted by ω-flow. Both of them are considered in the tracking and descriptor computation stages. The trajectories obtained by tracking with the ω-flow are called ω-trajectories. Figure 2 comparatively illustrates the ωtrajectories and the trajectories obtained using the flow. The input video shows a man moving away from the car. In this video excerpt, the camera is following the man walking to the right, thus inducing a global motion to the left in the video. When using the flow, the computed trajectories reflect the combination of these two motion components (camera and scene motion) as depicted by Subfigure 2(b), which hampers the characterization of the current action. In contrast, the ω-trajectories plotted in Subfigure 2(c) are more active on the actor moving on the foreground, while those localized in the background are now parallel to the time axis enhancing static parts of the scene. The ω-trajectories are therefore more relevant for action recognition, since they are more regularly and more exclusively following the actor’s motion. Impact on Trajectory and HOG descriptors. Table 1reports the impact of ω-trajectories on Trajectory and HOG descriptors, which are both significantly improved by 3%4% of mAP on the two datasets. When improved by ωflow, these descriptors will be respectively referred to as ω-Trajdesc and ω-HOG in the rest of the paper. Although the better performance of ω-Trajdesc versus the original Trajectory descriptor was expected, the one 222555555866 2. Trajectories obtained from optical and compensated flows. The green tail is the trajectory the current frame. The trajectories are sub-sampled for the sake of clarity. The frames are extracted Figure over every 15 frames with red dot indicating 5 frames in this example. DescriptorHollywood2HMDB51 BaseTrliaωnje- Tc(rtoarejrdpyreos[c2d9u]ced)54 7 1. 7 4% %2382.–89% BaseliHnωOe- (GHreOp [2rG9od]uced)4 451 . 658%%%2296.– 13%% Table 1. ω-Trajdesc and ω-HOG: Impact of compensating flow on Trajectory descriptor and HOG descriptors. achieved by ω-HOG might be surprising. Our interpretation is that HOG captures more context with the modified trajectories. More precisely, the original HOG descriptor is computed from a 2D+t sub-volume aligned with the corresponding trajectory and hence represents the appearance along the trajectory shape. When using ω-flow, we do not align the video sequence. As a result, the ω-HOG descriptor is no more computed around the very same tracked physical point in the space-time volume but around points lying in a patch of the initial feature point, whose size depends on the affine flow magnitude. ω-HOG can be viewed as a “patchbased” computation capturing more information about the appearance of the background or of the moving foreground. As for ω-trajectories, they are closer to the real trajectories of the moving actors as they usually cancel the camera movement, and so, more easier to train and recognize. Impact on HOF. The ω-flow impacts computation used as an input to HOF computation itself. Therefore, HOF can both types of trajectories (ω-trajectories both the trajectory and the descriptor be computed along or those extracted MethodHollywood2HMDB51 Table(ω2rHf.-alocO IwomkF)inpHgacOtFobf[2u9ωhsb]i:f-nlo ωgwot-hwωHOflFown5H 02 34O. 58291F% %descripto3 r706s38.:–1076% m%APfor Hollywood2 and average accuracy for HMDB5 1. The ω-HOF is used in subsequent evaluations. from flow) and can encode both kinds of flows (ω-flow or flow). For the sake of completeness, we evaluate all the variants as well as the combination of both flows in the descriptor computation stage. The results are presented in Table 2 and demonstrate the significant improvement obtained by computing the HOF descriptor with the ω-flow instead of the optical flow. Note that the type of trajectories which is used, either “Tracking flow” or “Tracking ω-flow”, has a limited impact in this case. From now on, we only consider the “Tracking ω-flow” case where HOF is computed along ω-trajectories. Interestingly, combining the HOF computed from the flow and the ω-flow further improves the results. This suggests that the two flow fields are complementary and the affine flow that was subtracted from ω-flow brings in additional information. For the sake of brevity, the combination of the two kinds of HOF, i.e., computed from the flow and the ω-flow using ω-trajectories, is referred to as the ω-HOF 222555555977 MethodHollywood2HMDB51 Tab(lerT3a.cIkmM inpBgMacHgωtBf-loH w [u2)s9in]gω f-lo wo MBH5 d42 e.052s7c% riptos:m34A90P.–3769f% orHllywood2 and average accuracy for HMDB5 1. DTerHasMjcBrOeblitpHGeFor4.ySumTωraw- frcilykto hw ionfgtheduωpCs-fcaolrtωmeiwNp-df/tl+Aωoutrw-finlogwthdesωcr- isTpc-fHtrMloaiOjrBpdswtGeHFosrc descriptor in the rest of this paper. Compared to the HOF baseline, the ω-HOF descriptor achieves a gain of +3.1% of mAP on Hollywood 2 and of +7.8% on HMDB51. Impact on MBH. Since MBH is computed from gradient of flow and cancel the constant motion, there is practically no benefit in using the ω-flow to compute the MBH descriptors, as shown in Table 3. However, by tracking ω-flow, the performance improves by around 1.3% for HMDB5 1 dataset and drops by around 1.5% for Hollywood2. This relative performance depends on the encoding technique. We will come back on this descriptor when considering another encoding scheme for local descriptors in Section 6. 4.3. Summary of compensated descriptors Table 4 summarizes the refined versions of the descriptors obtained by exploiting the ω-flow, and both ω-flow and the optical flow in the case of HOF. The revisited descriptors considerably improve the results compared to the orig- inal ones, with the noticeable exception of ω-MBH which gives mixed performance with a bag-of-features encoding scheme. But we already mention as this point that this incongruous behavior of ω-MBH is stabilized with the VLAD encoding scheme considered in Section 6. Another advantage of tracking the compensated flow is that fewer trajectories are produced. For instance, the total number of trajectories decreases by about 9. 16% and 22.81% on the Hollywood2 and HMDB51 datasets, respectively. Note that exploiting both the flow and the ω-flow do not induce much computational overhead, as the latter is obtained from the flow and the affine flow which is computed in real-time and already used to get the ω-trajectories. The only additional computational cost that we introduce by using the descriptors summarized in Table 4 is the computation of a second HOF descriptor, but this stage is relatively efficient and not the bottleneck of the extraction procedure. 5. Divergence-Curl-Shear descriptor This section introduces a new descriptor encoding the kinematic properties of motion discussed in Section 2.2. It is denoted by DCS in the rest of this paper. Combining kinematic features. The spatial derivatives are computed for the horizontal and vertical components of the flow field, which are used in turn to compute the divergence, curl and shear scalar values, see Equation 3. We consider all possible pairs of kinematic features, namely (div, curl), (div, shear) and (curl, shear). At each × ×× pixel, we compute the orientation and magnitude of the 2-D vector corresponding to each of these pairs. The orientation is quantized into histograms and the magnitude is used for weighting, similar to SIFT. Our motivation for encoding pairs is that the joint distribution of kinematic features conveys more information than exploiting them independently. Implementation details. The descriptor computation and parameters are similar to HOG and other popular descriptors such as MBH, HOF. We obtain 8-bin histograms for each of the three feature pairs or components of DCS. The range of possible angles is 2π for the (div,curl) pair and π for the other pairs, because the shear is always positive. The DCS descriptor is computed for a space-time volume aligned with a trajectory, as done with the four descriptors mentioned in the previous section. In order to capture the spatio-temporal structure of kinematic features, the volume (32 32 pixels and L = 15 frames) is subdivided into a spatio-temporal grid nofd s Lize = nx 5× f ny m×e nt, sw situhb nx =de ny =to 2a and nt = 3. These parameters ×hnave× × bneen fixed for the sake of consistency with the other descriptors. For each pair of kinematic features, each cell in the grid is represented by a histogram. The resulting local descriptors have a dimensionality equal to 288 = nx ny nt 8 3. At the video level, these descriptors are nenc×od end i×nto 8 a single vector representation using either BOF or the VLAD encoding scheme introduced in the next section. 6. VLAD in actions VLAD [8] is a descriptor encoding technique that aggregates the descriptors based on a locality criterion in the feature space. To our knowledge, this technique has never been considered for action recognition. Below, we briefly introduce this approach and give the performance achieved for all the descriptors introduced along the previous sections. VLAD in brief. Similar to BOF, VLAD relies on a codebook C = {c1, c2 , ...ck} of k centroids learned by k-means. bTohoek representation is ob}t oaifn ked c by summing, efodr b yea kch-m mveiasunasl. word ci, the differences x − ci of the vectors x assigned to ci, thereby producing a sv exct −or c representation oflength d×k, 222555556088 DMeBscHriptorV5 LH.A1o%Dlywo5Bo4d.O2 %F4V3L.3HA%MD B35B91.O7%F Taωbl-eDHM5rOCBa.FSjGPdHe+rsωfco-mMHaBOnFeofV54L2936A.51D% with5431ω208-.5T96% rajde3s42c97158,.ω3% -HOG342,58019ω.6-% HOF descriptors and their combination. where d is the dimension ofthe local descriptors. We use the codebook size, k = 256. Despite this large dimensionality, VLAD is efficient because it is effectively compared with a linear kernel. VLAD is post-processed using a componentwise power normalization, which dramatically improves its performance [8]. While cross validating the parameter α involved in this power normalization, we consistently observe, for all the descriptors, a value between 0.15 and 0.3. Therefore, this parameter is set to α = 0.2 in all our experiments. For classification, we use a linear SVM and oneagainst-rest approach everywhere, unless stated otherwise. Impact on existing descriptors. We employ VLAD because it is less sensitive to quantization parameters and appears to provide better performance with descriptors having a large dimensionality. These properties are interesting in our case, because the quantization parameters involved in the DCS and MBH descriptors have been used unchanged in Section 4 for the sake of direct comparison. They might be suboptimal when using the ω-flow instead of the optical flow on which they have initially been optimized [29]. Results for MBH and ω-MBH in Table 5 supports this argument. When using VLAD instead of BOF, the scores are stable in both the cases and there is no mixed inference as that observed in Table 3. VLAD also has significant positive influence on accuracy of ω-DCS descriptor. We also observe that ω-DCS is complementary to ω-MBH and adds to the performance. Still DCS is probably not best utilized in the current setting of parameters. In case of ω-Trajdesc and ω-HOG, the scores are better with BOF on both the datasets. ω-HOF with VLAD improves on HMDB5 1, but remains equivalent for Hollywood2. Although BOF leads to better scores for the descriptors considered individually, their combination with VLAD outperforms the BOF. 7. Comparison with the state of the art This section reports our results with all descriptors combined and compares our method with the state of the art. TrajectorCy+omHbOiGna+tHioOnF+ MDBCHSHol5 l98y.w76%o%od2H4M489.D02%B%51 All ω-descriptors all five compensated descriptors using combined62.5%52.1% Table 6. Combination of VLAD representation. WU*JVliaOnughreM tHaeolth. [yo2w9d87o] 256 0985. 37% SKa*duOeJinhrau tnegdMteHatlMa.h [ol1Dd.0B[91]25 24 609.8172% Table 7. Comparison with the state of the art on Hollywood2 and HMDB5 1 datasets. *Vig et al. [28] gets 61.9% by using external eye movements data. *Jiang et al. [9] used one-vs-one multi class SVM while our and other methods use one-vs-rest SVMs. With one-against-one multi class SVM we obtain 45. 1% for HMDB51. Descriptor combination. Table 6 reports the results obtained when the descriptors are combined. Since we use VLAD, our baseline is updated that is combination of Trajectory, HOG, HOF and MBH with VLAD representation. When DCS is added to the baseline there is an improvement of 0.9% and 1.2%. With combination of all five compensated descriptors we obtain 62.5% and 52.1% on the two datasets. This is a large improvement even over the updated baseline, which shows that the proposed motion compensation and the way we exploit it are significantly important for action recognition. The comparison with the state of the art is shown in Table 7. Our method outperforms all the previously reported results in the literature. In particular, on the HMDB51 dataset, the improvement over the best reported results to date is more than 11% in average accuracy. Jiang el al. [9] used a one-against-one multi-class SVM, which might have resulted in inferior scores. With a similar multi-class SVM approach, our method obtains 45. 1%, which remains significantly better than their result. All others results were reported with one-against-rest approach. On Olympic Sports dataset we obtain mAP of 83.2% with ‘All ω-descriptors combined’ and the improvement is mostly because of VLAD and ω-flow. The best reported mAPs on this dataset are Liu et al. [14] (74.4%) and Jiang et al. [9] (80.6%), which we exceed convincingly. Gaidon et al. [6] reports the best average accuracy of 82.7%. 8. Conclusions This paper first demonstrates the interest of canceling the dominant motion (predominantly camera motion) to make the visual motion truly related to actions, for both the trajectory extraction and descriptor computation stages. It pro222555556199 duces significantly better versions (called compensated descriptors) of several state-of-the-art local descriptors for action recognition. The simplicity, efficiency and effectiveness of this motion compensation approach make it applicable to any action recognition framework based on motion descriptors and trajectories. The second contribution is the new DCS descriptor derived from the first-order scalar motion quantities specifying the local motion patterns. It captures additional information which is proved complementary to the other descriptors. Finally, we show that VLAD encoding technique instead of bag-of-words boosts several action descriptors, and overall exhibits a significantly better performance when combining different types of descriptors. Our contributions are all complementary and significantly outperform the state of the art when combined, as demon- strated by our extensive experiments on the Hollywood 2, HMDB51 and Olympic Sports datasets. Acknowledgments This work was supported by the Quaero project, funded by Oseo, French agency for innovation. We acknowledge Heng Wang’s help for reproducing some of their results. References [1] S. Ali and M. Shah. Human action recognition in videos using kinematic features and multiple instance learning. IEEE T-PAMI, 32(2):288–303, Feb. 2010. [2] T. Brox and J. Malik. Object segmentation by long term analysis of point trajectories. In ECCV, Sep. 2010. [3] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, Jun. 2005. [4] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow and appearance. In ECCV, May 2006. [5] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. In VS-PETS, Oct. 2005. [6] A. Gaidon, Z. Harchaoui, and C. Schmid. Recognizing activities with cluster-trees of tracklets. In BMVC, Sep. 2012. [7] A. Hervieu, P. Bouthemy, and J.-P. Le Cadre. A statistical video content recognition method using invariant features on object trajectories. IEEE T-CSVT, 18(1 1): 1533–1543, 2008. [8] H. J ´egou, F. Perronnin, M. Douze, J. S ´anchez, P. P ´erez, and C. Schmid. Aggregating local descriptors into compact [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] codes. IEEE T-PAMI, 34(9):1704–1716, 2012. Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, and C.-W. Ngo. Trajectory-based modeling of human actions with motion reference points. In ECCV, Oct. 2012. O. Kliper-Gross, Y. Gurovich, T. Hassner, and L. Wolf. Motion interchange patterns for action recognition in unconstrained videos. In ECCV, Oct. 2012. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recognition. In ICCV, Nov. 2011. I. Laptev and T. Lindeberg. Space-time interest points. In ICCV, Oct. 2003. I. Laptev, M. Marzalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, Jun. 2008. J. Liu, B. Kuipers, and S. Savarese. Recognizing human actions by attributes. In CVPR, Jun. 2011. D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–1 10, Nov. 2004. M. Marzalek, I. Laptev, and C. Schmid. Actions in context. In CVPR, Jun. 2009. P. Matikainen, M. Hebert, and R. Sukthankar. Trajectons: Action recognition through the motion analysis of tracked features. In Workshop on Video-Oriented Object and Event Classification, ICCV, Sep. 2009. R. Messing, C. J. Pal, and H. A. Kautz. Activity recognition using the velocity histories of tracked keypoints. In ICCV, Sep. 2009. J. C. Niebles, C.-W. Chen, and F.-F. Li. Modeling temporal structure of decomposable motion segments for activity classification. In ECCV, Sep. 2010. [20] J.-M. Odobez and P. Bouthemy. Robust multiresolution estimation of parametric motion models. Jal of Vis. Comm. and Image Representation, 6(4):348–365, Dec. 1995. [21] G. Piriou, P. Bouthemy, and J.-F. Yao. Recognition of dynamic video contents with global probabilistic models of visual motion. IEEE T-IP, 15(1 1):3417–3430, 2006. [22] S. Sadanand and J. J. Corso. Action bank: A high-level representation of activity in video. In CVPR, Jun. 2012. [23] C. Schmid and R. Mohr. Local grayvalue invariants for image retrieval. IEEE T-PAMI, 19(5):530–534, May 1997. [24] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In ICCV, pages 1470– 1477, Oct. 2003. [25] J. Sun, X. Wu, S. Yan, L. F. Cheong, T.-S. Chua, and J. Li. Hierarchical spatio-temporal context modeling for action recognition. In CVPR, Jun. 2009. [26] H. Uemura, S. Ishikawa, and K. Mikolajczyk. Feature tracking and motion compensation for action recognition. In BMVC, Sep. 2008. [27] M. M. Ullah, S. N. Parizi, and I. Laptev. Improving bag-offeatures action recognition with non-local cues. In BMVC, Sep. 2010. [28] E. Vig, M. Dorr, and D. Cox. Saliency-based space-variant descriptor sampling for action recognition. In ECCV, Oct. 2012. [29] H. Wang, A. Kl¨ aser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In CVPR, Jun. 2011. [30] H. Wang, M. M. Ullah, A. Kl¨ aser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features for action recognition. In BMVC, Sep. 2009. [3 1] G. Willems, T. Tuytelaars, and L. J. V. Gool. An efficient dense and scale-invariant spatio-temporal interest point detector. In ECCV, Oct. 2008. [32] S. Wu, O. Oreifej, and M. Shah. Action recognition in videos acquired by a moving camera using motion decomposition of lagrangian particle trajectories. In ICCV, Nov. 2011. 222555666200

4 0.26947737 10 cvpr-2013-A Fully-Connected Layered Model of Foreground and Background Flow

Author: Deqing Sun, Jonas Wulff, Erik B. Sudderth, Hanspeter Pfister, Michael J. Black

Abstract: Layered models allow scene segmentation and motion estimation to be formulated together and to inform one another. Traditional layered motion methods, however, employ fairly weak models of scene structure, relying on locally connected Ising/Potts models which have limited ability to capture long-range correlations in natural scenes. To address this, we formulate a fully-connected layered model that enables global reasoning about the complicated segmentations of real objects. Optimization with fully-connected graphical models is challenging, and our inference algorithm leverages recent work on efficient mean field updates for fully-connected conditional random fields. These methods can be implemented efficiently using high-dimensional Gaussian filtering. We combine these ideas with a layered flow model, and find that the long-range connections greatly improve segmentation into figure-ground layers when compared with locally connected MRF models. Experiments on several benchmark datasets show that the method can re- cover fine structures and large occlusion regions, with good flow accuracy and much lower computational cost than previous locally-connected layered models.

5 0.26782751 158 cvpr-2013-Exploring Weak Stabilization for Motion Feature Extraction

Author: Dennis Park, C. Lawrence Zitnick, Deva Ramanan, Piotr Dollár

Abstract: We describe novel but simple motion features for the problem of detecting objects in video sequences. Previous approaches either compute optical flow or temporal differences on video frame pairs with various assumptions about stabilization. We describe a combined approach that uses coarse-scale flow and fine-scale temporal difference features. Our approach performs weak motion stabilization by factoring out camera motion and coarse object motion while preserving nonrigid motions that serve as useful cues for recognition. We show results for pedestrian detection and human pose estimation in video sequences, achieving state-of-the-art results in both. In particular, given a fixed detection rate our method achieves a five-fold reduction in false positives over prior art on the Caltech Pedestrian benchmark. Finally, we perform extensive diagnostic experiments to reveal what aspects of our system are crucial for good performance. Proper stabilization, long time-scale features, and proper normalization are all critical.

6 0.26771897 124 cvpr-2013-Determining Motion Directly from Normal Flows Upon the Use of a Spherical Eye Platform

7 0.22602364 362 cvpr-2013-Robust Monocular Epipolar Flow Estimation

8 0.21398036 46 cvpr-2013-Articulated and Restricted Motion Subspaces and Their Signatures

9 0.20895801 88 cvpr-2013-Compressible Motion Fields

10 0.20515338 345 cvpr-2013-Real-Time Model-Based Rigid Object Pose Estimation and Tracking Combining Dense and Sparse Visual Cues

11 0.20259061 326 cvpr-2013-Patch Match Filter: Efficient Edge-Aware Filtering Meets Randomized Search for Fast Correspondence Field Estimation

12 0.18030907 316 cvpr-2013-Optical Flow Estimation Using Laplacian Mesh Energy

13 0.17895368 203 cvpr-2013-Hierarchical Video Representation with Trajectory Binary Partition Tree

14 0.16462903 108 cvpr-2013-Dense 3D Reconstruction from Severely Blurred Images Using a Single Moving Camera

15 0.15509516 300 cvpr-2013-Multi-target Tracking by Lagrangian Relaxation to Min-cost Network Flow

16 0.15486822 170 cvpr-2013-Fast Rigid Motion Segmentation via Incrementally-Complex Local Models

17 0.14779282 465 cvpr-2013-What Object Motion Reveals about Shape with Unknown BRDF and Lighting

18 0.14754051 107 cvpr-2013-Deformable Spatial Pyramid Matching for Fast Dense Correspondences

19 0.14621732 290 cvpr-2013-Motion Estimation for Self-Driving Cars with a Generalized Camera

20 0.13702123 455 cvpr-2013-Video Object Segmentation through Spatially Accurate and Temporally Dense Extraction of Primary Object Regions

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.236), (1, 0.163), (2, 0.023), (3, -0.058), (4, -0.105), (5, -0.006), (6, 0.075), (7, -0.153), (8, -0.095), (9, 0.107), (10, 0.151), (11, 0.232), (12, 0.168), (13, -0.037), (14, 0.271), (15, 0.081), (16, 0.0), (17, -0.18), (18, -0.091), (19, 0.001), (20, -0.029), (21, -0.146), (22, 0.078), (23, -0.049), (24, 0.054), (25, -0.013), (26, -0.033), (27, 0.043), (28, 0.014), (29, 0.036), (30, -0.088), (31, -0.05), (32, 0.122), (33, 0.068), (34, -0.023), (35, -0.015), (36, 0.061), (37, -0.005), (38, -0.058), (39, -0.032), (40, -0.054), (41, -0.076), (42, -0.032), (43, 0.076), (44, 0.056), (45, 0.042), (46, 0.073), (47, 0.005), (48, -0.012), (49, 0.058)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98958528 244 cvpr-2013-Large Displacement Optical Flow from Nearest Neighbor Fields

Author: Zhuoyuan Chen, Hailin Jin, Zhe Lin, Scott Cohen, Ying Wu

2 0.85845995 124 cvpr-2013-Determining Motion Directly from Normal Flows Upon the Use of a Spherical Eye Platform

Author: Tak-Wai Hui, Ronald Chung

Abstract: We address the problem of recovering camera motion from video data, which does not require the establishment of feature correspondences or computation of optical flows but from normal flows directly. We have designed an imaging system that has a wide field of view by fixating a number of cameras together to form an approximate spherical eye. With a substantially widened visual field, we discover that estimating the directions of translation and rotation components of the motion separately are possible and particularly efficient. In addition, the inherent ambiguities between translation and rotation also disappear. Magnitude of rotation is recovered subsequently. Experimental results on synthetic and real image data are provided. The results show that not only the accuracy of motion estimation is comparable to those of the state-of-the-art methods that require explicit feature correspondences or optical flows, but also a faster computation time.

3 0.83452159 10 cvpr-2013-A Fully-Connected Layered Model of Foreground and Background Flow

Author: Deqing Sun, Jonas Wulff, Erik B. Sudderth, Hanspeter Pfister, Michael J. Black

4 0.80328763 362 cvpr-2013-Robust Monocular Epipolar Flow Estimation

Author: Koichiro Yamaguchi, David McAllester, Raquel Urtasun

Abstract: We consider the problem of computing optical flow in monocular video taken from a moving vehicle. In this setting, the vast majority of image flow is due to the vehicle ’s ego-motion. We propose to take advantage of this fact and estimate flow along the epipolar lines of the egomotion. Towards this goal, we derive a slanted-plane MRF model which explicitly reasons about the ordering of planes and their physical validity at junctions. Furthermore, we present a bottom-up grouping algorithm which produces over-segmentations that respect flow boundaries. We demonstrate the effectiveness of our approach in the challenging KITTI flow benchmark [11] achieving half the error of the best competing general flow algorithm and one third of the error of the best epipolar flow algorithm.

5 0.79922199 88 cvpr-2013-Compressible Motion Fields

Author: Giuseppe Ottaviano, Pushmeet Kohli

Abstract: Traditional video compression methods obtain a compact representation for image frames by computing coarse motion fields defined on patches of pixels called blocks, in order to compensate for the motion in the scene across frames. This piecewise constant approximation makes the motion field efficiently encodable, but it introduces block artifacts in the warped image frame. In this paper, we address the problem of estimating dense motion fields that, while accurately predicting one frame from a given reference frame by warping it with the field, are also compressible. We introduce a representation for motion fields based on wavelet bases, and approximate the compressibility of their coefficients with a piecewise smooth surrogate function that yields an objective function similar to classical optical flow formulations. We then show how to quantize and encode such coefficients with adaptive precision. We demonstrate the effectiveness of our approach by com- paring its performance with a state-of-the-art wavelet video encoder. Experimental results on a number of standard flow and video datasets reveal that our method significantly outperforms both block-based and optical-flow-based motion compensation algorithms.

6 0.76007473 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition

7 0.7554003 158 cvpr-2013-Exploring Weak Stabilization for Motion Feature Extraction

8 0.72578865 334 cvpr-2013-Pose from Flow and Flow from Pose

9 0.68195385 203 cvpr-2013-Hierarchical Video Representation with Trajectory Binary Partition Tree

10 0.6676811 326 cvpr-2013-Patch Match Filter: Efficient Edge-Aware Filtering Meets Randomized Search for Fast Correspondence Field Estimation

11 0.6564123 46 cvpr-2013-Articulated and Restricted Motion Subspaces and Their Signatures

12 0.64264357 137 cvpr-2013-Dynamic Scene Classification: Learning Motion Descriptors with Slow Features Analysis

13 0.63371164 170 cvpr-2013-Fast Rigid Motion Segmentation via Incrementally-Complex Local Models

14 0.63013375 316 cvpr-2013-Optical Flow Estimation Using Laplacian Mesh Energy

15 0.62408721 345 cvpr-2013-Real-Time Model-Based Rigid Object Pose Estimation and Tracking Combining Dense and Sparse Visual Cues

16 0.58491695 290 cvpr-2013-Motion Estimation for Self-Driving Cars with a Generalized Camera

17 0.57540238 283 cvpr-2013-Megastereo: Constructing High-Resolution Stereo Panoramas

18 0.56863898 455 cvpr-2013-Video Object Segmentation through Spatially Accurate and Temporally Dense Extraction of Primary Object Regions

19 0.56708056 118 cvpr-2013-Detecting Pulse from Head Motions in Video

20 0.54218721 300 cvpr-2013-Multi-target Tracking by Lagrangian Relaxation to Min-cost Network Flow

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.149), (10, 0.118), (16, 0.074), (26, 0.03), (33, 0.354), (67, 0.057), (69, 0.054), (87, 0.063)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96634418 289 cvpr-2013-Monocular Template-Based 3D Reconstruction of Extensible Surfaces with Local Linear Elasticity

Author: Abed Malti, Richard Hartley, Adrien Bartoli, Jae-Hak Kim

Abstract: We propose a new approach for template-based extensible surface reconstruction from a single view. We extend the method of isometric surface reconstruction and more recent work on conformal surface reconstruction. Our approach relies on the minimization of a proposed stretching energy formalized with respect to the Poisson ratio parameter of the surface. We derive a patch-based formulation of this stretching energy by assuming local linear elasticity. This formulation unifies geometrical and mechanical constraints in a single energy term. We prevent local scale ambiguities by imposing a set of fixed boundary 3D points. We experimentally prove the sufficiency of this set of boundary points and demonstrate the effectiveness of our approach on different developable and non-developable surfaces with a wide range of extensibility.

2 0.96119535 9 cvpr-2013-A Fast Semidefinite Approach to Solving Binary Quadratic Problems

Author: Peng Wang, Chunhua Shen, Anton van_den_Hengel

Abstract: Many computer vision problems can be formulated as binary quadratic programs (BQPs). Two classic relaxation methods are widely used for solving BQPs, namely, spectral methods and semidefinite programming (SDP), each with their own advantages and disadvantages. Spectral relaxation is simple and easy to implement, but its bound is loose. Semidefinite relaxation has a tighter bound, but its computational complexity is high for large scale problems. We present a new SDP formulation for BQPs, with two desirable properties. First, it has a similar relaxation bound to conventional SDP formulations. Second, compared with conventional SDP methods, the new SDP formulation leads to a significantly more efficient and scalable dual optimization approach, which has the same degree of complexity as spectral methods. Extensive experiments on various applications including clustering, image segmentation, co-segmentation and registration demonstrate the usefulness of our SDP formulation for solving large-scale BQPs.

3 0.95095021 226 cvpr-2013-Intrinsic Characterization of Dynamic Surfaces

Author: Tony Tung, Takashi Matsuyama

Abstract: This paper presents a novel approach to characterize deformable surface using intrinsic property dynamics. 3D dynamic surfaces representing humans in motion can be obtained using multiple view stereo reconstruction methods or depth cameras. Nowadays these technologies have become capable to capture surface variations in real-time, and give details such as clothing wrinkles and deformations. Assuming repetitive patterns in the deformations, we propose to model complex surface variations using sets of linear dynamical systems (LDS) where observations across time are given by surface intrinsic properties such as local curvatures. We introduce an approach based on bags of dynamical systems, where each surface feature to be represented in the codebook is modeled by a set of LDS equipped with timing structure. Experiments are performed on datasets of real-world dynamical surfaces and show compelling results for description, classification and segmentation.

4 0.94029087 322 cvpr-2013-PISA: Pixelwise Image Saliency by Aggregating Complementary Appearance Contrast Measures with Spatial Priors

Author: Keyang Shi, Keze Wang, Jiangbo Lu, Liang Lin

Abstract: Driven by recent vision and graphics applications such as image segmentation and object recognition, assigning pixel-accurate saliency values to uniformly highlight foreground objects becomes increasingly critical. More often, such fine-grained saliency detection is also desired to have a fast runtime. Motivated by these, we propose a generic and fast computational framework called PISA Pixelwise Image Saliency Aggregating complementary saliency cues based on color and structure contrasts with spatial priors holistically. Overcoming the limitations of previous methods often using homogeneous superpixel-based and color contrast-only treatment, our PISA approach directly performs saliency modeling for each individual pixel and makes use of densely overlapping, feature-adaptive observations for saliency measure computation. We further impose a spatial prior term on each of the two contrast measures, which constrains pixels rendered salient to be compact and also centered in image domain. By fusing complementary contrast measures in such a pixelwise adaptive manner, the detection effectiveness is significantly boosted. Without requiring reliable region segmentation or post– relaxation, PISA exploits an efficient edge-aware image representation and filtering technique and produces spatially coherent yet detail-preserving saliency maps. Extensive experiments on three public datasets demonstrate PISA’s superior detection accuracy and competitive runtime speed over the state-of-the-arts approaches.

5 0.93947756 91 cvpr-2013-Consensus of k-NNs for Robust Neighborhood Selection on Graph-Based Manifolds

Author: Vittal Premachandran, Ramakrishna Kakarala

Abstract: Propagating similarity information along the data manifold requires careful selection of local neighborhood. Selecting a “good” neighborhood in an unsupervised setting, given an affinity graph, has been a difficult task. The most common way to select a local neighborhood has been to use the k-nearest neighborhood (k-NN) selection criterion. However, it has the tendency to include noisy edges. In this paper, we propose a way to select a robust neighborhood using the consensus of multiple rounds of k-NNs. We explain how using consensus information can give better control over neighborhood selection. We also explain in detail the problems with another recently proposed neighborhood selection criteria, i.e., Dominant Neighbors, and show that our method is immune to those problems. Finally, we show the results from experiments in which we compare our method to other neighborhood selection approaches. The results corroborate our claims that consensus ofk-NNs does indeed help in selecting more robust and stable localities.

same-paper 6 0.93812191 244 cvpr-2013-Large Displacement Optical Flow from Nearest Neighbor Fields

7 0.93754864 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation

8 0.92820758 326 cvpr-2013-Patch Match Filter: Efficient Edge-Aware Filtering Meets Randomized Search for Fast Correspondence Field Estimation

9 0.92613751 245 cvpr-2013-Layer Depth Denoising and Completion for Structured-Light RGB-D Cameras

10 0.92609489 384 cvpr-2013-Segment-Tree Based Cost Aggregation for Stereo Matching

11 0.92482877 352 cvpr-2013-Recovering Stereo Pairs from Anaglyphs

12 0.92451894 44 cvpr-2013-Area Preserving Brain Mapping

13 0.92407 380 cvpr-2013-Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images

14 0.92340004 403 cvpr-2013-Sparse Output Coding for Large-Scale Visual Recognition

15 0.92290723 330 cvpr-2013-Photometric Ambient Occlusion

16 0.92277402 271 cvpr-2013-Locally Aligned Feature Transforms across Views

17 0.92277169 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases

18 0.92212075 221 cvpr-2013-Incorporating Structural Alternatives and Sharing into Hierarchy for Multiclass Object Recognition and Detection

19 0.92201871 425 cvpr-2013-Tensor-Based High-Order Semantic Relation Transfer for Semantic Scene Segmentation

20 0.92178106 138 cvpr-2013-Efficient 2D-to-3D Correspondence Filtering for Scalable 3D Object Recognition