cvpr cvpr2013 cvpr2013-124 knowledge-graph by maker-knowledge-mining

124 cvpr-2013-Determining Motion Directly from Normal Flows Upon the Use of a Spherical Eye Platform


Source: pdf

Author: Tak-Wai Hui, Ronald Chung

Abstract: We address the problem of recovering camera motion from video data, which does not require the establishment of feature correspondences or computation of optical flows but from normal flows directly. We have designed an imaging system that has a wide field of view by fixating a number of cameras together to form an approximate spherical eye. With a substantially widened visual field, we discover that estimating the directions of translation and rotation components of the motion separately are possible and particularly efficient. In addition, the inherent ambiguities between translation and rotation also disappear. Magnitude of rotation is recovered subsequently. Experimental results on synthetic and real image data are provided. The results show that not only the accuracy of motion estimation is comparable to those of the state-of-the-art methods that require explicit feature correspondences or optical flows, but also a faster computation time.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 hk Abstract We address the problem of recovering camera motion from video data, which does not require the establishment of feature correspondences or computation of optical flows but from normal flows directly. [sent-5, score-1.666]

2 We have designed an imaging system that has a wide field of view by fixating a number of cameras together to form an approximate spherical eye. [sent-6, score-0.5]

3 With a substantially widened visual field, we discover that estimating the directions of translation and rotation components of the motion separately are possible and particularly efficient. [sent-7, score-0.605]

4 In addition, the inherent ambiguities between translation and rotation also disappear. [sent-8, score-0.33]

5 The results show that not only the accuracy of motion estimation is comparable to those of the state-of-the-art methods that require explicit feature correspondences or optical flows, but also a faster computation time. [sent-11, score-0.379]

6 The translation magnitude of the motion is generally not determinable and left as an overall arbitrary scale relative to object depth because of the well-known ambiguity between object size-depth and translation speed. [sent-14, score-0.726]

7 This paper presents a direct method to determine the five degrees of freedom (DoFs) the direction of translation and the full rotation of a camera moving in a static scene. [sent-15, score-0.649]

8 The correspondences might be in the form of optical flows (also known as full flows) using monocular camera [14], [26], [33], multiple cameras [17] and spherical camera – Ronald Chung Vocational Training Council of Hong Kong rchung@vt c . [sent-17, score-1.397]

9 hk [24], [25], or correspondences over distinct features using monocular camera [31], [23], [28], multiple cameras [32], [20], [19] and spherical camera [21]. [sent-19, score-0.805]

10 Optical flow induced by the spatial motion at any image position is only partially observable in general due to the familiar aperture problem. [sent-20, score-0.541]

11 The apparent flow, termed the normal flow, which is the component of the optical flow along or opposite to the direction of the local intensity gradient is fully observable. [sent-21, score-0.797]

12 The partial observability of the flow is what makes full flow computation and in turn motion determination a challenge. [sent-22, score-0.839]

13 A few methods have been proposed to determine camera motion from normal flows directly without ever requiring the full flows to be recovered explicitly. [sent-26, score-1.523]

14 Unlike optical flow, normal flow can be obtained directly from image data, without involving minimization of certain cost functional [15], [6], [36] which is often computationally demanding. [sent-29, score-0.63]

15 In addition, the invention of normal flow measurement camera gives additional support to use normal flow [29]. [sent-32, score-1.153]

16 The video perceived under say a pure translation ofthe camera in the x-direction would be similar to that under a pure left-hand rotation about the y-axis, where the x and y axes are the orthogonal coordinate axes of the image domain. [sent-34, score-0.726]

17 They can execute navigational tasks accurately relying on the visual clue from optical flows [35]. [sent-37, score-0.555]

18 In this paper, we present a direct method that uses an approximate spherical eye to recover motion parameters in a static scene. [sent-38, score-0.582]

19 The approximate spherical eye comprises a number of cameras that have optical centers placed near one another without necessary having overlapped FoV. [sent-39, score-0.659]

20 First, the translation’s direction and the rotation’s direction of the motion can be recovered separately in an efficient manner. [sent-41, score-0.361]

21 Second, tighter constraints on the motion solution (two pairs of special normal flows are enough to reduce the possible motion solution by 43 of the motion space) are available to improve the re- sult significantly and in turn the computation speed. [sent-43, score-1.265]

22 The main contribution is instead to provide a mechanism of how motion ambiguity arisen from the use of normal flows could be reduced by separating the translation and rotation motion components through the use of three particular subsets of normal flows. [sent-46, score-1.706]

23 Related Works Direct methods determine camera motion from normal flows without prior computation of full flows. [sent-50, score-1.071]

24 Horn and Weldon required the camera undergoing pure translation, pure rotation, or general motion with known rotation [16]. [sent-52, score-0.58]

25 The boundary of each pattern is generally difficult to extract due to the sparse normal flow field. [sent-57, score-0.495]

26 Yet only a limited number of nor- mal flows participated in recovering egomotion. [sent-60, score-0.45]

27 turned the recovery of camera motion into three sub-problems, each involving one rotational and two translational parameters (translation and rotation are mixed up) [30]. [sent-67, score-0.589]

28 This partial separation of motion components is only possible when normal flows are located at the three equators which are perpendicular to the three principal axes of the camera’s coordinate frame respectively. [sent-68, score-1.086]

29 also presented the use of global patterns [8] for the case of spherical eye [7]. [sent-70, score-0.332]

30 extended their previous work from planar eye [5] to spherical eye [2]. [sent-72, score-0.433]

31 provided an analysis about the conditions when apparent flows become ambiguous [4]. [sent-76, score-0.461]

32 Ferm u¨ller and Aloimonos characterized the structure of rigid motion fields [9], ambiguity in structure from motion using planar and spherical eyes [10], and also the observability of 3D motion under different fields of view [11]. [sent-77, score-0.973]

33 Our proposed method, being a direct one, provides an alternative approach to recover camera motion without the need of matching feature correspondences and recovery of full optical flows as the current state-of-the-art methods. [sent-78, score-1.096]

34 Our work is related to [24] that both of us determine the directions of translation and rotation from general motion using merely the direction component of flow vectors. [sent-79, score-0.898]

35 Unlike their work, we utilize normal flows which are directly observable instead of prior computation of full flows. [sent-80, score-0.766]

36 Our algorithm is developed for a multi-camera rig but not for a perfectly spherical eye. [sent-81, score-0.401]

37 Moreover, we do not demand each pair of flow vectors located at opposite image positions on the image sphere. [sent-82, score-0.366]

38 Unlike the works of [2], [17], [25], our strategy is to separate the directions of translation and rotation from general motion. [sent-85, score-0.367]

39 The optical flow x˙ at the image position x is given by: x x˙ = ? [sent-95, score-0.424]

40 If the image plane is warped to a spherical imaging sur- ×× face with focal length f (as shown in Figure 1b), image position becomes xs = fX/| |X| |. [sent-108, score-0.455]

41 The optical flow x˙ s which is tangential to the imaging spherical s ouprftiaccael aflto xws x˙i s given by: x˙s = (xs (xs t)) / ? [sent-109, score-0.687]

42 In general, optical flow at any image position is not directly observable from the image because ofthe well-known aperture problem. [sent-113, score-0.479]

43 Only the projected component ofthe flow to the spatial intensity gradient n (normalized to a unit vector) at the position, in the name of normal flow, is directly observable. [sent-114, score-0.556]

44 By using the Brightness Constancy Constraint Equation (BCCE) [15], we can relate optical flow x˙ and spatial intensity gradient ∇I in planar image as: ∇I · x˙ + It = 0. [sent-115, score-0.402]

45 The camera moves with a translation t and a rotation w. [sent-120, score-0.493]

46 A 3D scene point X projects onto the image plane at x and induces optical flow ˙x and normal flow x˙n there. [sent-121, score-0.915]

47 Similarly, optical flow x˙s and normal flow ˙x sn are induced at xs on the spherical surface. [sent-122, score-1.18]

48 One way of constructing a spherical eye is to stitch two omnidirectional cameras together in a back-to-back configuration. [sent-125, score-0.478]

49 In this work, we explore the use of multiple standard cameras to approximate the spherical eye. [sent-127, score-0.403]

50 We bundle the cameras together with their optical centers close to one another and with their visual fields distinct. [sent-128, score-0.327]

51 If the optical centers of the cameras can be made exactly concurrent, the multi-camera system mimics a spherical eye. [sent-129, score-0.627]

52 There is evidence indicating that even with such an imperfect imaging system, the gain (in having a wider FoV) generally outweighs the loss (from the non-concurrency of the optical centers), and substantial improvement in recovering motion is possible [19]. [sent-131, score-0.416]

53 With a number of cameras stitched together, we seek to recover the 5 DoFs of the rigid motion of the camera rig directly from the normal flows observed in the various cameras. [sent-132, score-1.324]

54 Below we first outline how normal flows in the multiple cameras are related to the camera rig motion when the baseline distances between the cameras are small compared with the overall scene depth from the camera rig. [sent-133, score-1.667]

55 Suppose the image point xi in the ith camera is projected by a 3D point Xi with depth Zi (with respect to the camera coordinate frame Ci), and at the image position xi the local camera motion (ti, wi) induces a full flow x˙ i. [sent-134, score-1.235]

56 By projecting the full flow x˙ i to the spatial gradient ni (a unit vector) at the image position xi, we have the normal flow magnitude: | x˙i· ni| =? [sent-135, score-0.91]

57 By a few algebraic manipulations over (6), (7), (8), and (9), normal flow x˙ ni in the ith camera can be related to the desired motion parameters t and w by: wi di=? [sent-149, score-0.995]

58 where 0)T (sinθi, −cosθi, , (11) awi = ati xi, (12) are terms related to the normal flow with orientation θi at the image position xi of the ith camera. [sent-153, score-0.746]

59 In particular, ati is orthogonal to (sin θi, −cos θi, 0)T which is the direction vector of the normal flow ( x˙ni, (in projective coordinates) rotated about the camera’s optical axis by 90◦. [sent-157, score-0.905]

60 This means that ati, the image position vector xi, and the normal flow ( x˙ni, 0)T (in projective coordinates) lie on the plane (Πi), and ati points in the direction governed by (11). [sent-158, score-0.828]

61 -hsec tawleod tinervmesrs Re sicaetni ean dde pRthia awnid indicate that every normal flow data point from each camera is transformed from its local camera system Ci to the global coordinate system C0 through the rotation matrix Ri. [sent-182, score-1.097]

62 One is to use a wide FoV imaging system to reduce the ambiguity in motion estimation. [sent-194, score-0.348]

63 In addition, we separate the translation and rotation components in the motion recovery process, and treat them one by one. [sent-195, score-0.571]

64 Classification of a Pair of Normal Flows Consider the spherical imaging surface approximated by the images planes of several standard cameras (with camera centers possibly mildly non-concurrent). [sent-199, score-0.698]

65 Here, we just use two cameras with centers C1 and C2 to illustrate the classification of normal flow pairs in Figure 2. [sent-200, score-0.687]

66 Suppose we have two observable normal flows x˙ n1 and x˙ n2 at the image positions x1 and x2 (3-vectors expressed in projective coordinates) with respect to their local camera coordinates respectively. [sent-201, score-1.048]

67 For the spherical approximation of multiple cameras, the two camera centers C1 and C2 coincide to the camera rig’s center C0. [sent-202, score-0.629]

68 All the measured entities are transformed from their local camera coordinates to the camera rig’s coordinate system (i. [sent-203, score-0.447]

69 and x˙ n1 · x˙ n2 > 0, then we classify the normal flow pair as an α-v·e x˙ ctor pair (Figure 2a). [sent-228, score-0.577]

70 T normal flow (in projective coordinates) lie on the plane Πi. [sent-238, score-0.641]

71 Π1, Π2 are orthogonal to P) when the normal flows are γ-pair. [sent-249, score-0.741]

72 2 of the approximate spherical eye as shown in Figure 2a. [sent-256, score-0.332]

73 The above AFS constraints, with the exception of the AFSβ (tˆ; x, θ, d) constraint, depend only on the direction of the normal flows. [sent-335, score-0.32]

74 Each pair of normal flows that belong to either the α-, β-, or γ-group can trim away the associated motion space (α- and β-vectors for ˆt-space, γ-vector for wˆ -space) up to half of the mot? [sent-337, score-0.948]

75 Apparent Flow Magnitude (AFM) Constraint The AFS constraints return two probability maps that indicate the the direction of translation and the direction of rotation separately. [sent-388, score-0.462]

76 The solution set can be further refined by going through a second stage which involves normal flow’s magnitude by the use of partial detranslation and complete derotation similar to that described in [8]. [sent-389, score-0.553]

77 The magnitude component of normal flows could be more erroneous than its direction component because the temporal resolution of a video is generally lower than its spatial resolution. [sent-391, score-0.871]

78 In complete derotation, we need to remove rotational component in normal flows by using {w} which is obtained fcroommp partial nd neotrramnasllat filoown. [sent-402, score-0.777]

79 The spherical imaging system consisted of 4 cameras which were positioned in a cross-shaped configuration as shown in Figure 4b. [sent-423, score-0.5]

80 Each camera in the rig was placed 2cm away from the global coordinate frame C0. [sent-426, score-0.379]

81 To simulate a sparse flow field, we randomly chose only 5% of the flow vectors. [sent-433, score-0.509]

82 To simulate flow extraction error, full flows were corrupted by Gaussian noise. [sent-434, score-0.725]

83 The camera rig underwent general motion that included both translation and rotation motions randomly generated over all possible directions. [sent-436, score-0.834]

84 The magnitudes of translation and rotation were fixed to 6. [sent-437, score-0.33]

85 We randomly picked up 2000 pairs of normal flows for each of the α- and β-groups, and another 4000 pairs of normal flows for the γ-group. [sent-441, score-1.348]

86 This means that the estimation of the directions of translation and rotation used the same × amount ofnormal flows. [sent-442, score-0.367]

87 The camera system was placed on a computer-controlled xy-table with a manually tunable rotation stage (Fig. [sent-468, score-0.354]

88 Same amount of normal flows was used as the simulation. [sent-483, score-0.674]

89 spher-5-pt RANSAC A 5-point algorithm utilizes feature correspondences in RANSAC to estimate motion from an approximate spherical camera [19]. [sent-487, score-0.696]

90 TV-L1-NL+LM A linear method utilizes optical flows to estimate camera motion [33]. [sent-489, score-0.947]

91 TV-L1-NL+LQP – A linear quasi-parallax method [17] uses optical flows [36] from pairs of anti-parallel visual rays. [sent-492, score-0.555]

92 AFD+AFM – A two-stage direct method utilizes normal flows to estimate motion of monocular camera [18]. [sent-494, score-1.148]

93 Conclusion We have proposed two constraints that allow normal flows to be used directly for motion recovery, which are readily usable with the availability of wide field-of-view imaging system. [sent-521, score-0.925]

94 The first constraint separates the directions of translation and rotation components from general motion. [sent-522, score-0.404]

95 A spherical eye from multiple cameras (makes better models of the world). [sent-538, score-0.478]

96 Determining spatial motion directly from normal flow: A comprehensive treatment. [sent-632, score-0.451]

97 Spherical approximation for multiple cameras in motion estimation:Its applicability and advantages. [sent-638, score-0.343]

98 Estimation of the epipole using optical flow at antipodal points. [sent-675, score-0.417]

99 Finding motion parameters from spherical motion fields (or the advantages of having eyes in the back of your head). [sent-704, score-0.651]

100 Robust egomotion estiamtion from the normal flow using search subspaces. [sent-723, score-0.54]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('flows', 0.42), ('afs', 0.315), ('spherical', 0.257), ('normal', 0.254), ('flow', 0.241), ('ferm', 0.198), ('motion', 0.197), ('translation', 0.182), ('camera', 0.163), ('ller', 0.148), ('rotation', 0.148), ('cameras', 0.146), ('rig', 0.144), ('optical', 0.135), ('afm', 0.126), ('derotation', 0.102), ('ransac', 0.083), ('brodsk', 0.081), ('detranslation', 0.081), ('riati', 0.081), ('magnitude', 0.077), ('eye', 0.075), ('ati', 0.073), ('projective', 0.069), ('orthogonal', 0.067), ('direction', 0.066), ('riawi', 0.061), ('fov', 0.06), ('ni', 0.055), ('observable', 0.055), ('imaging', 0.054), ('ambiguity', 0.054), ('awi', 0.054), ('direct', 0.053), ('xs', 0.052), ('positions', 0.051), ('wi', 0.05), ('aloimonos', 0.05), ('angular', 0.05), ('separation', 0.05), ('position', 0.048), ('correspondences', 0.047), ('centers', 0.046), ('egomotion', 0.045), ('observability', 0.045), ('horn', 0.045), ('plane', 0.044), ('recovery', 0.044), ('system', 0.043), ('coordinate', 0.042), ('apparent', 0.041), ('xi', 0.041), ('pair', 0.041), ('antipodal', 0.041), ('subtended', 0.041), ('widened', 0.041), ('determination', 0.039), ('instantaneous', 0.039), ('partial', 0.039), ('cos', 0.038), ('rotational', 0.037), ('compound', 0.037), ('constraint', 0.037), ('directions', 0.037), ('full', 0.037), ('pure', 0.036), ('coordinates', 0.036), ('trim', 0.036), ('hui', 0.036), ('nelson', 0.036), ('weldon', 0.036), ('dofs', 0.036), ('icosahedral', 0.036), ('ijcv', 0.036), ('inequality', 0.035), ('ith', 0.035), ('depth', 0.034), ('unit', 0.034), ('lie', 0.033), ('opposite', 0.033), ('recovered', 0.032), ('planes', 0.032), ('utilizes', 0.032), ('locus', 0.031), ('simulation', 0.031), ('eliminated', 0.03), ('frame', 0.03), ('recovering', 0.03), ('inequalities', 0.029), ('monocular', 0.029), ('median', 0.028), ('perpendicular', 0.028), ('sdet', 0.028), ('velocity', 0.028), ('component', 0.027), ('coplanar', 0.027), ('suppose', 0.027), ('simulate', 0.027), ('planar', 0.026), ('axes', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 124 cvpr-2013-Determining Motion Directly from Normal Flows Upon the Use of a Spherical Eye Platform

Author: Tak-Wai Hui, Ronald Chung

Abstract: We address the problem of recovering camera motion from video data, which does not require the establishment of feature correspondences or computation of optical flows but from normal flows directly. We have designed an imaging system that has a wide field of view by fixating a number of cameras together to form an approximate spherical eye. With a substantially widened visual field, we discover that estimating the directions of translation and rotation components of the motion separately are possible and particularly efficient. In addition, the inherent ambiguities between translation and rotation also disappear. Magnitude of rotation is recovered subsequently. Experimental results on synthetic and real image data are provided. The results show that not only the accuracy of motion estimation is comparable to those of the state-of-the-art methods that require explicit feature correspondences or optical flows, but also a faster computation time.

2 0.26771897 244 cvpr-2013-Large Displacement Optical Flow from Nearest Neighbor Fields

Author: Zhuoyuan Chen, Hailin Jin, Zhe Lin, Scott Cohen, Ying Wu

Abstract: We present an optical flow algorithm for large displacement motions. Most existing optical flow methods use the standard coarse-to-fine framework to deal with large displacement motions which has intrinsic limitations. Instead, we formulate the motion estimation problem as a motion segmentation problem. We use approximate nearest neighbor fields to compute an initial motion field and use a robust algorithm to compute a set of similarity transformations as the motion candidates for segmentation. To account for deviations from similarity transformations, we add local deformations in the segmentation process. We also observe that small objects can be better recovered using translations as the motion candidates. We fuse the motion results obtained under similarity transformations and under translations together before a final refinement. Experimental validation shows that our method can successfully handle large displacement motions. Although we particularly focus on large displacement motions in this work, we make no sac- rifice in terms of overall performance. In particular, our method ranks at the top of the Middlebury benchmark.

3 0.20843789 158 cvpr-2013-Exploring Weak Stabilization for Motion Feature Extraction

Author: Dennis Park, C. Lawrence Zitnick, Deva Ramanan, Piotr Dollár

Abstract: We describe novel but simple motion features for the problem of detecting objects in video sequences. Previous approaches either compute optical flow or temporal differences on video frame pairs with various assumptions about stabilization. We describe a combined approach that uses coarse-scale flow and fine-scale temporal difference features. Our approach performs weak motion stabilization by factoring out camera motion and coarse object motion while preserving nonrigid motions that serve as useful cues for recognition. We show results for pedestrian detection and human pose estimation in video sequences, achieving state-of-the-art results in both. In particular, given a fixed detection rate our method achieves a five-fold reduction in false positives over prior art on the Caltech Pedestrian benchmark. Finally, we perform extensive diagnostic experiments to reveal what aspects of our system are crucial for good performance. Proper stabilization, long time-scale features, and proper normalization are all critical.

4 0.17737761 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition

Author: Mihir Jain, Hervé Jégou, Patrick Bouthemy

Abstract: Several recent works on action recognition have attested the importance of explicitly integrating motion characteristics in the video description. This paper establishes that adequately decomposing visual motion into dominant and residual motions, both in the extraction of the space-time trajectories and for the computation of descriptors, significantly improves action recognition algorithms. Then, we design a new motion descriptor, the DCS descriptor, based on differential motion scalar quantities, divergence, curl and shear features. It captures additional information on the local motion patterns enhancing results. Finally, applying the recent VLAD coding technique proposed in image retrieval provides a substantial improvement for action recognition. Our three contributions are complementary and lead to outperform all reported results by a significant margin on three challenging datasets, namely Hollywood 2, HMDB51 and Olympic Sports. 1. Introduction and related work Human actions often convey the essential meaningful content in videos. Yet, recognizing human actions in un- constrained videos is a challenging problem in Computer Vision which receives a sustained attention due to the potential applications. In particular, there is a large interest in designing video-surveillance systems, providing some automatic annotation of video archives as well as improving human-computer interaction. The solutions proposed to address this problem inherit, to a large extent, from the techniques first designed for the goal of image search and classification. The successful local features developed to describe image patches [15, 23] have been translated in the 2D+t domain as spatio-temporal local descriptors [13, 30] and now include motion clues [29]. These descriptors are often extracted from spatial-temporal interest points [12, 3 1]. More recent techniques assume some underlying temporal motion model involving trajectories [2, 6, 7, 17, 18, 25, 29, 32]. Most of these approaches produce large set of local descriptors which are in turn aggregated to produce a single vector representing the video, in order to enable the use of powerful discriminative classifiers such as support vector machines (SVMs). This is usually done with the bag- Figure 1. Optical flow field vectors (green vectors with red end points) before and after dominant motion compensation. Most of the flow vectors due to camera motion are suppressed after compensation. One of the contributions of this paper is to show that compensating for the dominant motion is beneficial for most of the existing descriptors used for action recognition. of-words technique [24], which quantizes the local features using a k-means codebook. Thanks to the successful combination of this encoding technique with the aforementioned local descriptors, the state of the art in action recognition is able to go beyond the toy problems ofclassifying simple human actions in controlled environment and considers the detection of actions in real movies or video clips [11, 16]. Despite these progresses, the existing descriptors suffer from an uncompleted handling of motion in the video sequence. Motion is arguably the most reliable source of information for action recognition, as often related to the actions of interest. However, it inevitably involves the background or camera motion when dealing with uncontrolled and re- alistic situations. Although some attempts have been made to compensate camera motion in several ways [10, 21, 26, 29, 32], how to separate action motion from that caused by the camera, and how to reflect it in the video description remains an open issue. The motion compensation mechanism employed in [10] is tailor-made to the Motion Interchange Pattern encoding technique. The Motion Boundary Histogram (MBH) [29] is a recent appealing approach to 222555555533 suppress the constant motion by considering the flow gradient. It is robust to some extent to the presence of camera motion, yet it does not explicitly handle the camera motion. Another approach [26] uses a sophisticated and robust (RANSAC) estimation of camera motion. It first segments the color image into regions corresponding to planar parts in the scene and estimates the (three) dominant homographies to update the motion associated with local features. A rather different view is adopted in [32] where the motion decomposition is performed at the trajectory level. All these works support the potential of motion compensation. As the first contribution of this paper, we address the problem in a way that departs from these works by considering the compensation of the dominant motion in both the tracking stages and encoding stages involved in the computation of action recognition descriptors. We rely on the pioneering works on motion compensation such as the technique proposed in [20], that considers 2D polynomial affine motion models for estimating the dominant image motion. We consider this particular model for its robustness and its low computational cost. It was already used in [21] to separate the dominant motion (assumed to be due to the camera motion) and the residual motion (corresponding to the independent scene motions) for dynamic event recognition in videos. However, the statistical modeling of both motion components was global (over the entire image) and only the normal flow was computed for the latter. Figure 1 shows the vectors of optical flow before and after applying the proposed motion compensation. Our method successfully suppresses most of the background motion and reinforces the focus towards the action of interest. We exploit this compensated motion both for descriptor computation and for extracting trajectories. However, we also show that the camera motion should not be thrown as it contains complementary information that is worth using to recognize certain action categories. Then, we introduce the Divergence-Curl-Shear (DCS) descriptor, which encodes scalar first-order motion features, namely the motion divergence, curl and shear. It captures physical properties of the flow pattern that are not involved in the best existing descriptors for action recognition, except in the work of [1] which exploits divergence and vorticity among a set of eleven kinematic features computed from the optical flow. Our DCS descriptor provides a good performance recognition performance on its own. Most importantly, it conveys some information which is not captured by existing descriptors and further improves the recognition performance when combined with the other descriptors. As a last contribution, we bring an encoding technique known as VLAD (vector oflocal aggregated descriptors) [8] to the field of action recognition. This technique is shown to be better than the bag-of-words representation for combining all the local video descriptors we have considered. The organization of the paper is as follows. Section 2 introduces the motion properties that we will consider through this paper. Section 3 presents the datasets and classification scheme used in our different evaluations. Section 4 details how we revisit several popular descriptors of the literature by the means of dominant motion compensation. Our DCS descriptor based on kinematic properties is introduced in Section 5 and improved by the VLAD encoding technique, which is introduced and bench-marked in Section 6 for several video descriptors. Section 7 provides a comparison showing the large improvement achieved over the state of the art. Finally, Section 8 concludes the paper. 2. Motion Separation and Kinematic Features In this section, we describe the motion clues we incorporate in our action recognition framework. We separate the dominant motion and the residual motion. In most cases, this will account to distinguishing the impact of camera movement and independent actions. Note that we do not aim at recovering the 3D camera motion: The 2D parametric motion model describes the global (or dominant) motion between successive frames. We first explain how we estimate the dominant motion and employ it to separate the dominant flow from the optical flow. Then, we will introduce kinematic features, namely divergence, curl and shear for a more comprehensive description of the visual motion. 2.1. Affine motion for compensating camera motion Among polynomial motion models, we consider the 2D affine motion model. Simplest motion models such as the 4parameter model formed by the combination of 2D translation, 2D rotation and scaling, or more complex ones such as the 8-parameter quadratic model (equivalent to a homography), could be selected as well. The affine model is a good trade-off between accuracy and efficiency which is of primary importance when processing a huge video database. It does have limitations since strictly speaking it implies a single plane assumption for the static background. However, this is not that penalizing (especially for outdoor scenes) if differences in depth remain moderated with respect to the distance to the camera. The affine flow vector at point p = (x, y) and at time t, is defined as waff(pt) =?cc12((t ) ?+?aa31((t ) aa42((t ) ? ?xytt?. (1) = + + = + uaff(pt) c1(t) a1(t)xt a2(t)yt and vaff(pt) c2(t) a3 (t)xt + a4(t)yt are horizontal and vertical components of waff(pt) respectively. Let us denote the optical flow vector at point p at time t as w(pt) = (u(pt) , v(pt)). We introduce the flow vector ω(pt) obtained by removing the affine flow vector from the optical flow vector ω(pt) = w(pt) − waff(pt) . (2) 222555555644 The dominant motion (estimated as waff(pt)) is usually due to the camera motion. In this case, Equation 2 amounts to canceling (or compensating) the camera motion. Note that this is not always true. For example in case of close-up on a moving actor, the dominant motion will be the affine estimation of the apparent actor motion. The interpretation of the motion compensation output will not be that straightforward in this case, however the resulting ω-field will still exhibit different patterns for the foreground action part and the background part. In the remainder, we will refer to the “compensated” flow as ω-flow. Figure 1 displays the computed optical flow and the ωflow. We compute the affine flow with the publicly available Motion2D software1 [20] which implements a realtime robust multiresolution incremental estimation framework. The affine motion model has correctly accounted for the motion induced by the camera movement which corresponds to the dominant motion in the image pair. Indeed, we observe that the compensated flow vectors in the background are close to null and the compensated flow in the foreground, i.e., corresponding to the actors, is conversely inflated. The experiments presented along this paper will show that effective separation of dominant motion from the residual motions is beneficial for action recognition. As explained in Section 4, we will compute local motion descriptors, such as HOF, on both the optical flow and the compensated flow (ω-flow), which allows us to explicitly and directly characterize the scene motion. 2.2. Local kinematic features By kinematic features, we mean local first-order differential scalar quantities computed on the flow field. We consider the divergence, the curl (or vorticity) and the hyperbolic terms. They inform on the physical pattern of the flow so that they convey useful information on actions in videos. They can be computed from the first-order derivatives of the flow at every point p at every frame t as ⎨⎪ ⎪ ⎪ ⎧hcdyuipvr1l2(p t) = −∂ u ∂ (yxp(xtp) +−∂ v ∂ v(px ypxt ) The diverg⎪⎩ence is related to axial motion, expansion scaling effects, the curl to rotation in the image plane. hyperbolic terms express the shear of the visual flow responding to more complex configuration. We take account the shear quantity only: shear(pt) = ?hyp12(pt) + hyp22(pt). (3) and The corinto (4) 1http://www.irisa.fr/vista/Motion2D/ In Section 5, we propose the DCS descriptor that is based on the kinematic features (divergence, curl and shear) of the visual motion discussed in this subsection. It is computed on either the optical or the compensated flow, ω-flow. 3. Datasets and evaluation This section first introduces the datasets used for the evaluation. Then, we briefly present the bag-of-feature model and the classification scheme used to encode the descriptors which will be introduced in Section 4. Hollywood2. The Hollywood2 dataset [16] contains 1,707 video clips from 69 movies representing 12 action classes. It is divided into train set and test set of 823 and 884 samples respectively. Following the standard evaluation protocol of this benchmark, we use average precision (AP) for each class and the mean of APs (mAP) for evaluation. HMDB51. The HMDB51 dataset [11] is a large dataset containing 6,766 video clips extracted from various sources, ranging from movies to YouTube. It consists of 51 action classes, each having at least 101 samples. We follow the evaluation protocol of [11] and use three train/test splits, each with 70 training and 30 testing samples per class. The average classification accuracy is computed over all classes. Out of the two released sets, we use the original set as it is more challenging and used by most of the works reporting results in action recognition. Olympic Sports. The third dataset we use is Olympic Sports [19], which again is obtained from YouTube. This dataset contains 783 samples with 16 sports action classes. We use the provided2 train/test split, there are 17 to 56 training samples and 4 to 11test samples per class. Mean AP is used for the evaluation, which is the standard choice. Bag of features and classification setup. We first adopt the standard BOF [24] approach to encode all kinds of descriptors. It produces a vector that serves as the video representation. The codebook is constructed for each type of descriptor separately by the k-means algorithm. Following a common practice in the literature [27, 29, 30], the codebook size is set to k=4,000 elements. Note that Section 6 will consider encoding technique for descriptors. For the classification, we use a non-linear SVM with χ2kernel. When combining different descriptors, we simply add the kernel matrices, as done in [27]: K(xi,xj) = exp?−?cγ1cD(xic,xjc)?, 2http://vision.stanford.edu/Datasets/OlympicSports/ 222555555755 (5) where D(xic, xjc) is χ2 distance between video xic and xjc with respect to c-th channel, corresponding to c-th descriptor. The quantity γc is the mean value of χ2 distances between the training samples for the c-th channel. The multiclass classification problem that we consider is addressed by applying a one-against-rest approach. 4. Compensated descriptors This section describes how the compensation ofthe dominant motion is exploited to improve the quality of descriptors encoding the motion and the appearance around spatio-temporal positions, hence the term “compensated descriptors”. First, we briefly review the local descriptors [5, 13, 16, 29, 30] used here along with dense trajectories [29]. Second, we analyze the impact of motion flow compensation when used in two different stages of the descriptor computation, namely in the tracking and the description part. 4.1. Dense trajectories and local descriptors Employing dense trajectories to compute local descriptors is one of the state-of-the-art approaches for action recognition. It has been shown [29] that when local descriptors are computed over dense trajectories the performance improves considerably compared to when computed over spatio temporal features [30]. Dense Trajectories [29]: The trajectories are obtained by densely tracking sampled points using optical flow fields. First, feature points are sampled from a dense grid, with step size of 5 pixels and over 8 scales. Each feature point pt = (xt, yt) at frame t is then tracked to the next frame by median filtering in a dense optical flow field F = (ut, vt) as follows: pt+1 = (xt+1 , yt+1) = (xt, yt) + (M ∗ F) | (x ¯t,y ¯t) , (6) where M is the kernel of median filtering and ( x¯ t, y¯ t) is the rounded position of (xt, yt). The tracking is limited to L (=15) frames to avoid any drifting effect. Excessively short trajectories and trajectories exhibiting sudden large displacements are removed as they induce some artifacts. Trajectories must be understood here as tracks in the spacetime volume of the video. Local descriptors: The descriptors are computed within a space-time volume centered around each trajectory. Four types of descriptors are computed to encode the shape of the trajectory, local motion pattern and appearance, namely Trajectory [29], HOF (histograms of optical flow) [13], MBH [4] and HOG (histograms of oriented gradients) [3]. All these descriptors depend on the flow field used for the tracking and as input of the descriptor computation: 1. The Trajectory descriptor encodes the shape of the trajectory represented by the normalized relative coor- × dinates of the successive points forming the trajectory. It directly depends on the dense flow used for tracking points. 2. HOF is computed using the orientations and magnitudes of the flow field. 3. MBH is designed to capture the gradient of horizontal and vertical components of the flow. The motion boundaries encode the relative pixel motion and therefore suppress camera motion, but only to some extent. 4. HOG encodes the appearance by using the intensity gradient orientations and magnitudes. It is formally not a motion descriptor. Yet the position where the descriptor is computed depends on the trajectory shape. As in [29], volume around a feature point is divided into a 2 2 3 space-time grid. The orientations are quantized ian 2to × ×8 b2i ×ns 3fo srp HacOe-Gti amned g g9r ibdi.ns T fhoer o oHriOenFt (awtioitnhs one a qdudainttiiozneadl zero bin). The horizontal and vertical components of MBH are separately quantized into 8 bins each. 4.2. Impact of motion compensation The optical flow is simply referred to as flow in the following, while the compensated flow (see subsection 2. 1) is denoted by ω-flow. Both of them are considered in the tracking and descriptor computation stages. The trajectories obtained by tracking with the ω-flow are called ω-trajectories. Figure 2 comparatively illustrates the ωtrajectories and the trajectories obtained using the flow. The input video shows a man moving away from the car. In this video excerpt, the camera is following the man walking to the right, thus inducing a global motion to the left in the video. When using the flow, the computed trajectories reflect the combination of these two motion components (camera and scene motion) as depicted by Subfigure 2(b), which hampers the characterization of the current action. In contrast, the ω-trajectories plotted in Subfigure 2(c) are more active on the actor moving on the foreground, while those localized in the background are now parallel to the time axis enhancing static parts of the scene. The ω-trajectories are therefore more relevant for action recognition, since they are more regularly and more exclusively following the actor’s motion. Impact on Trajectory and HOG descriptors. Table 1reports the impact of ω-trajectories on Trajectory and HOG descriptors, which are both significantly improved by 3%4% of mAP on the two datasets. When improved by ωflow, these descriptors will be respectively referred to as ω-Trajdesc and ω-HOG in the rest of the paper. Although the better performance of ω-Trajdesc versus the original Trajectory descriptor was expected, the one 222555555866 2. Trajectories obtained from optical and compensated flows. The green tail is the trajectory the current frame. The trajectories are sub-sampled for the sake of clarity. The frames are extracted Figure over every 15 frames with red dot indicating 5 frames in this example. DescriptorHollywood2HMDB51 BaseTrliaωnje- Tc(rtoarejrdpyreos[c2d9u]ced)54 7 1. 7 4% %2382.–89% BaseliHnωOe- (GHreOp [2rG9od]uced)4 451 . 658%%%2296.– 13%% Table 1. ω-Trajdesc and ω-HOG: Impact of compensating flow on Trajectory descriptor and HOG descriptors. achieved by ω-HOG might be surprising. Our interpretation is that HOG captures more context with the modified trajectories. More precisely, the original HOG descriptor is computed from a 2D+t sub-volume aligned with the corresponding trajectory and hence represents the appearance along the trajectory shape. When using ω-flow, we do not align the video sequence. As a result, the ω-HOG descriptor is no more computed around the very same tracked physical point in the space-time volume but around points lying in a patch of the initial feature point, whose size depends on the affine flow magnitude. ω-HOG can be viewed as a “patchbased” computation capturing more information about the appearance of the background or of the moving foreground. As for ω-trajectories, they are closer to the real trajectories of the moving actors as they usually cancel the camera movement, and so, more easier to train and recognize. Impact on HOF. The ω-flow impacts computation used as an input to HOF computation itself. Therefore, HOF can both types of trajectories (ω-trajectories both the trajectory and the descriptor be computed along or those extracted MethodHollywood2HMDB51 Table(ω2rHf.-alocO IwomkF)inpHgacOtFobf[2u9ωhsb]i:f-nlo ωgwot-hwωHOflFown5H 02 34O. 58291F% %descripto3 r706s38.:–1076% m%APfor Hollywood2 and average accuracy for HMDB5 1. The ω-HOF is used in subsequent evaluations. from flow) and can encode both kinds of flows (ω-flow or flow). For the sake of completeness, we evaluate all the variants as well as the combination of both flows in the descriptor computation stage. The results are presented in Table 2 and demonstrate the significant improvement obtained by computing the HOF descriptor with the ω-flow instead of the optical flow. Note that the type of trajectories which is used, either “Tracking flow” or “Tracking ω-flow”, has a limited impact in this case. From now on, we only consider the “Tracking ω-flow” case where HOF is computed along ω-trajectories. Interestingly, combining the HOF computed from the flow and the ω-flow further improves the results. This suggests that the two flow fields are complementary and the affine flow that was subtracted from ω-flow brings in additional information. For the sake of brevity, the combination of the two kinds of HOF, i.e., computed from the flow and the ω-flow using ω-trajectories, is referred to as the ω-HOF 222555555977 MethodHollywood2HMDB51 Tab(lerT3a.cIkmM inpBgMacHgωtBf-loH w [u2)s9in]gω f-lo wo MBH5 d42 e.052s7c% riptos:m34A90P.–3769f% orHllywood2 and average accuracy for HMDB5 1. DTerHasMjcBrOeblitpHGeFor4.ySumTωraw- frcilykto hw ionfgtheduωpCs-fcaolrtωmeiwNp-df/tl+Aωoutrw-finlogwthdesωcr- isTpc-fHtrMloaiOjrBpdswtGeHFosrc descriptor in the rest of this paper. Compared to the HOF baseline, the ω-HOF descriptor achieves a gain of +3.1% of mAP on Hollywood 2 and of +7.8% on HMDB51. Impact on MBH. Since MBH is computed from gradient of flow and cancel the constant motion, there is practically no benefit in using the ω-flow to compute the MBH descriptors, as shown in Table 3. However, by tracking ω-flow, the performance improves by around 1.3% for HMDB5 1 dataset and drops by around 1.5% for Hollywood2. This relative performance depends on the encoding technique. We will come back on this descriptor when considering another encoding scheme for local descriptors in Section 6. 4.3. Summary of compensated descriptors Table 4 summarizes the refined versions of the descriptors obtained by exploiting the ω-flow, and both ω-flow and the optical flow in the case of HOF. The revisited descriptors considerably improve the results compared to the orig- inal ones, with the noticeable exception of ω-MBH which gives mixed performance with a bag-of-features encoding scheme. But we already mention as this point that this incongruous behavior of ω-MBH is stabilized with the VLAD encoding scheme considered in Section 6. Another advantage of tracking the compensated flow is that fewer trajectories are produced. For instance, the total number of trajectories decreases by about 9. 16% and 22.81% on the Hollywood2 and HMDB51 datasets, respectively. Note that exploiting both the flow and the ω-flow do not induce much computational overhead, as the latter is obtained from the flow and the affine flow which is computed in real-time and already used to get the ω-trajectories. The only additional computational cost that we introduce by using the descriptors summarized in Table 4 is the computation of a second HOF descriptor, but this stage is relatively efficient and not the bottleneck of the extraction procedure. 5. Divergence-Curl-Shear descriptor This section introduces a new descriptor encoding the kinematic properties of motion discussed in Section 2.2. It is denoted by DCS in the rest of this paper. Combining kinematic features. The spatial derivatives are computed for the horizontal and vertical components of the flow field, which are used in turn to compute the divergence, curl and shear scalar values, see Equation 3. We consider all possible pairs of kinematic features, namely (div, curl), (div, shear) and (curl, shear). At each × ×× pixel, we compute the orientation and magnitude of the 2-D vector corresponding to each of these pairs. The orientation is quantized into histograms and the magnitude is used for weighting, similar to SIFT. Our motivation for encoding pairs is that the joint distribution of kinematic features conveys more information than exploiting them independently. Implementation details. The descriptor computation and parameters are similar to HOG and other popular descriptors such as MBH, HOF. We obtain 8-bin histograms for each of the three feature pairs or components of DCS. The range of possible angles is 2π for the (div,curl) pair and π for the other pairs, because the shear is always positive. The DCS descriptor is computed for a space-time volume aligned with a trajectory, as done with the four descriptors mentioned in the previous section. In order to capture the spatio-temporal structure of kinematic features, the volume (32 32 pixels and L = 15 frames) is subdivided into a spatio-temporal grid nofd s Lize = nx 5× f ny m×e nt, sw situhb nx =de ny =to 2a and nt = 3. These parameters ×hnave× × bneen fixed for the sake of consistency with the other descriptors. For each pair of kinematic features, each cell in the grid is represented by a histogram. The resulting local descriptors have a dimensionality equal to 288 = nx ny nt 8 3. At the video level, these descriptors are nenc×od end i×nto 8 a single vector representation using either BOF or the VLAD encoding scheme introduced in the next section. 6. VLAD in actions VLAD [8] is a descriptor encoding technique that aggregates the descriptors based on a locality criterion in the feature space. To our knowledge, this technique has never been considered for action recognition. Below, we briefly introduce this approach and give the performance achieved for all the descriptors introduced along the previous sections. VLAD in brief. Similar to BOF, VLAD relies on a codebook C = {c1, c2 , ...ck} of k centroids learned by k-means. bTohoek representation is ob}t oaifn ked c by summing, efodr b yea kch-m mveiasunasl. word ci, the differences x − ci of the vectors x assigned to ci, thereby producing a sv exct −or c representation oflength d×k, 222555556088 DMeBscHriptorV5 LH.A1o%Dlywo5Bo4d.O2 %F4V3L.3HA%MD B35B91.O7%F Taωbl-eDHM5rOCBa.FSjGPdHe+rsωfco-mMHaBOnFeofV54L2936A.51D% with5431ω208-.5T96% rajde3s42c97158,.ω3% -HOG342,58019ω.6-% HOF descriptors and their combination. where d is the dimension ofthe local descriptors. We use the codebook size, k = 256. Despite this large dimensionality, VLAD is efficient because it is effectively compared with a linear kernel. VLAD is post-processed using a componentwise power normalization, which dramatically improves its performance [8]. While cross validating the parameter α involved in this power normalization, we consistently observe, for all the descriptors, a value between 0.15 and 0.3. Therefore, this parameter is set to α = 0.2 in all our experiments. For classification, we use a linear SVM and oneagainst-rest approach everywhere, unless stated otherwise. Impact on existing descriptors. We employ VLAD because it is less sensitive to quantization parameters and appears to provide better performance with descriptors having a large dimensionality. These properties are interesting in our case, because the quantization parameters involved in the DCS and MBH descriptors have been used unchanged in Section 4 for the sake of direct comparison. They might be suboptimal when using the ω-flow instead of the optical flow on which they have initially been optimized [29]. Results for MBH and ω-MBH in Table 5 supports this argument. When using VLAD instead of BOF, the scores are stable in both the cases and there is no mixed inference as that observed in Table 3. VLAD also has significant positive influence on accuracy of ω-DCS descriptor. We also observe that ω-DCS is complementary to ω-MBH and adds to the performance. Still DCS is probably not best utilized in the current setting of parameters. In case of ω-Trajdesc and ω-HOG, the scores are better with BOF on both the datasets. ω-HOF with VLAD improves on HMDB5 1, but remains equivalent for Hollywood2. Although BOF leads to better scores for the descriptors considered individually, their combination with VLAD outperforms the BOF. 7. Comparison with the state of the art This section reports our results with all descriptors combined and compares our method with the state of the art. TrajectorCy+omHbOiGna+tHioOnF+ MDBCHSHol5 l98y.w76%o%od2H4M489.D02%B%51 All ω-descriptors all five compensated descriptors using combined62.5%52.1% Table 6. Combination of VLAD representation. WU*JVliaOnughreM tHaeolth. [yo2w9d87o] 256 0985. 37% SKa*duOeJinhrau tnegdMteHatlMa.h [ol1Dd.0B[91]25 24 609.8172% Table 7. Comparison with the state of the art on Hollywood2 and HMDB5 1 datasets. *Vig et al. [28] gets 61.9% by using external eye movements data. *Jiang et al. [9] used one-vs-one multi class SVM while our and other methods use one-vs-rest SVMs. With one-against-one multi class SVM we obtain 45. 1% for HMDB51. Descriptor combination. Table 6 reports the results obtained when the descriptors are combined. Since we use VLAD, our baseline is updated that is combination of Trajectory, HOG, HOF and MBH with VLAD representation. When DCS is added to the baseline there is an improvement of 0.9% and 1.2%. With combination of all five compensated descriptors we obtain 62.5% and 52.1% on the two datasets. This is a large improvement even over the updated baseline, which shows that the proposed motion compensation and the way we exploit it are significantly important for action recognition. The comparison with the state of the art is shown in Table 7. Our method outperforms all the previously reported results in the literature. In particular, on the HMDB51 dataset, the improvement over the best reported results to date is more than 11% in average accuracy. Jiang el al. [9] used a one-against-one multi-class SVM, which might have resulted in inferior scores. With a similar multi-class SVM approach, our method obtains 45. 1%, which remains significantly better than their result. All others results were reported with one-against-rest approach. On Olympic Sports dataset we obtain mAP of 83.2% with ‘All ω-descriptors combined’ and the improvement is mostly because of VLAD and ω-flow. The best reported mAPs on this dataset are Liu et al. [14] (74.4%) and Jiang et al. [9] (80.6%), which we exceed convincingly. Gaidon et al. [6] reports the best average accuracy of 82.7%. 8. Conclusions This paper first demonstrates the interest of canceling the dominant motion (predominantly camera motion) to make the visual motion truly related to actions, for both the trajectory extraction and descriptor computation stages. It pro222555556199 duces significantly better versions (called compensated descriptors) of several state-of-the-art local descriptors for action recognition. The simplicity, efficiency and effectiveness of this motion compensation approach make it applicable to any action recognition framework based on motion descriptors and trajectories. The second contribution is the new DCS descriptor derived from the first-order scalar motion quantities specifying the local motion patterns. It captures additional information which is proved complementary to the other descriptors. Finally, we show that VLAD encoding technique instead of bag-of-words boosts several action descriptors, and overall exhibits a significantly better performance when combining different types of descriptors. Our contributions are all complementary and significantly outperform the state of the art when combined, as demon- strated by our extensive experiments on the Hollywood 2, HMDB51 and Olympic Sports datasets. Acknowledgments This work was supported by the Quaero project, funded by Oseo, French agency for innovation. We acknowledge Heng Wang’s help for reproducing some of their results. References [1] S. Ali and M. Shah. Human action recognition in videos using kinematic features and multiple instance learning. IEEE T-PAMI, 32(2):288–303, Feb. 2010. [2] T. Brox and J. Malik. Object segmentation by long term analysis of point trajectories. In ECCV, Sep. 2010. [3] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, Jun. 2005. [4] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow and appearance. In ECCV, May 2006. [5] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. In VS-PETS, Oct. 2005. [6] A. Gaidon, Z. Harchaoui, and C. Schmid. Recognizing activities with cluster-trees of tracklets. In BMVC, Sep. 2012. [7] A. Hervieu, P. Bouthemy, and J.-P. Le Cadre. A statistical video content recognition method using invariant features on object trajectories. IEEE T-CSVT, 18(1 1): 1533–1543, 2008. [8] H. J ´egou, F. Perronnin, M. Douze, J. S ´anchez, P. P ´erez, and C. Schmid. Aggregating local descriptors into compact [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] codes. IEEE T-PAMI, 34(9):1704–1716, 2012. Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, and C.-W. Ngo. Trajectory-based modeling of human actions with motion reference points. In ECCV, Oct. 2012. O. Kliper-Gross, Y. Gurovich, T. Hassner, and L. Wolf. Motion interchange patterns for action recognition in unconstrained videos. In ECCV, Oct. 2012. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recognition. In ICCV, Nov. 2011. I. Laptev and T. Lindeberg. Space-time interest points. In ICCV, Oct. 2003. I. Laptev, M. Marzalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, Jun. 2008. J. Liu, B. Kuipers, and S. Savarese. Recognizing human actions by attributes. In CVPR, Jun. 2011. D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–1 10, Nov. 2004. M. Marzalek, I. Laptev, and C. Schmid. Actions in context. In CVPR, Jun. 2009. P. Matikainen, M. Hebert, and R. Sukthankar. Trajectons: Action recognition through the motion analysis of tracked features. In Workshop on Video-Oriented Object and Event Classification, ICCV, Sep. 2009. R. Messing, C. J. Pal, and H. A. Kautz. Activity recognition using the velocity histories of tracked keypoints. In ICCV, Sep. 2009. J. C. Niebles, C.-W. Chen, and F.-F. Li. Modeling temporal structure of decomposable motion segments for activity classification. In ECCV, Sep. 2010. [20] J.-M. Odobez and P. Bouthemy. Robust multiresolution estimation of parametric motion models. Jal of Vis. Comm. and Image Representation, 6(4):348–365, Dec. 1995. [21] G. Piriou, P. Bouthemy, and J.-F. Yao. Recognition of dynamic video contents with global probabilistic models of visual motion. IEEE T-IP, 15(1 1):3417–3430, 2006. [22] S. Sadanand and J. J. Corso. Action bank: A high-level representation of activity in video. In CVPR, Jun. 2012. [23] C. Schmid and R. Mohr. Local grayvalue invariants for image retrieval. IEEE T-PAMI, 19(5):530–534, May 1997. [24] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In ICCV, pages 1470– 1477, Oct. 2003. [25] J. Sun, X. Wu, S. Yan, L. F. Cheong, T.-S. Chua, and J. Li. Hierarchical spatio-temporal context modeling for action recognition. In CVPR, Jun. 2009. [26] H. Uemura, S. Ishikawa, and K. Mikolajczyk. Feature tracking and motion compensation for action recognition. In BMVC, Sep. 2008. [27] M. M. Ullah, S. N. Parizi, and I. Laptev. Improving bag-offeatures action recognition with non-local cues. In BMVC, Sep. 2010. [28] E. Vig, M. Dorr, and D. Cox. Saliency-based space-variant descriptor sampling for action recognition. In ECCV, Oct. 2012. [29] H. Wang, A. Kl¨ aser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In CVPR, Jun. 2011. [30] H. Wang, M. M. Ullah, A. Kl¨ aser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features for action recognition. In BMVC, Sep. 2009. [3 1] G. Willems, T. Tuytelaars, and L. J. V. Gool. An efficient dense and scale-invariant spatio-temporal interest point detector. In ECCV, Oct. 2008. [32] S. Wu, O. Oreifej, and M. Shah. Action recognition in videos acquired by a moving camera using motion decomposition of lagrangian particle trajectories. In ICCV, Nov. 2011. 222555666200

5 0.17620505 362 cvpr-2013-Robust Monocular Epipolar Flow Estimation

Author: Koichiro Yamaguchi, David McAllester, Raquel Urtasun

Abstract: We consider the problem of computing optical flow in monocular video taken from a moving vehicle. In this setting, the vast majority of image flow is due to the vehicle ’s ego-motion. We propose to take advantage of this fact and estimate flow along the epipolar lines of the egomotion. Towards this goal, we derive a slanted-plane MRF model which explicitly reasons about the ordering of planes and their physical validity at junctions. Furthermore, we present a bottom-up grouping algorithm which produces over-segmentations that respect flow boundaries. We demonstrate the effectiveness of our approach in the challenging KITTI flow benchmark [11] achieving half the error of the best competing general flow algorithm and one third of the error of the best epipolar flow algorithm.

6 0.17133993 334 cvpr-2013-Pose from Flow and Flow from Pose

7 0.16733707 167 cvpr-2013-Fast Multiple-Part Based Object Detection Using KD-Ferns

8 0.1569497 432 cvpr-2013-Three-Dimensional Bilateral Symmetry Plane Estimation in the Phase Domain

9 0.15395896 400 cvpr-2013-Single Image Calibration of Multi-axial Imaging Systems

10 0.15180412 345 cvpr-2013-Real-Time Model-Based Rigid Object Pose Estimation and Tracking Combining Dense and Sparse Visual Cues

11 0.14933443 46 cvpr-2013-Articulated and Restricted Motion Subspaces and Their Signatures

12 0.14713818 290 cvpr-2013-Motion Estimation for Self-Driving Cars with a Generalized Camera

13 0.1351907 10 cvpr-2013-A Fully-Connected Layered Model of Foreground and Background Flow

14 0.12734522 423 cvpr-2013-Template-Based Isometric Deformable 3D Reconstruction with Sampling-Based Focal Length Self-Calibration

15 0.1255199 349 cvpr-2013-Reconstructing Gas Flows Using Light-Path Approximation

16 0.125018 108 cvpr-2013-Dense 3D Reconstruction from Severely Blurred Images Using a Single Moving Camera

17 0.12351609 170 cvpr-2013-Fast Rigid Motion Segmentation via Incrementally-Complex Local Models

18 0.11525858 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?

19 0.11157163 300 cvpr-2013-Multi-target Tracking by Lagrangian Relaxation to Min-cost Network Flow

20 0.10632373 209 cvpr-2013-Hypergraphs for Joint Multi-view Reconstruction and Multi-object Tracking


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.207), (1, 0.214), (2, 0.008), (3, -0.018), (4, -0.082), (5, -0.063), (6, -0.001), (7, -0.091), (8, -0.006), (9, 0.09), (10, 0.052), (11, 0.171), (12, 0.132), (13, -0.08), (14, 0.13), (15, 0.046), (16, 0.052), (17, -0.018), (18, -0.104), (19, 0.002), (20, -0.012), (21, -0.114), (22, -0.016), (23, -0.046), (24, 0.048), (25, 0.041), (26, -0.057), (27, 0.096), (28, -0.003), (29, 0.046), (30, -0.056), (31, 0.048), (32, 0.086), (33, 0.087), (34, -0.048), (35, 0.018), (36, 0.068), (37, 0.031), (38, -0.047), (39, -0.016), (40, -0.005), (41, -0.06), (42, -0.025), (43, 0.051), (44, 0.074), (45, 0.075), (46, 0.107), (47, -0.015), (48, -0.054), (49, -0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97860116 124 cvpr-2013-Determining Motion Directly from Normal Flows Upon the Use of a Spherical Eye Platform

Author: Tak-Wai Hui, Ronald Chung

Abstract: We address the problem of recovering camera motion from video data, which does not require the establishment of feature correspondences or computation of optical flows but from normal flows directly. We have designed an imaging system that has a wide field of view by fixating a number of cameras together to form an approximate spherical eye. With a substantially widened visual field, we discover that estimating the directions of translation and rotation components of the motion separately are possible and particularly efficient. In addition, the inherent ambiguities between translation and rotation also disappear. Magnitude of rotation is recovered subsequently. Experimental results on synthetic and real image data are provided. The results show that not only the accuracy of motion estimation is comparable to those of the state-of-the-art methods that require explicit feature correspondences or optical flows, but also a faster computation time.

2 0.83807212 244 cvpr-2013-Large Displacement Optical Flow from Nearest Neighbor Fields

Author: Zhuoyuan Chen, Hailin Jin, Zhe Lin, Scott Cohen, Ying Wu

Abstract: We present an optical flow algorithm for large displacement motions. Most existing optical flow methods use the standard coarse-to-fine framework to deal with large displacement motions which has intrinsic limitations. Instead, we formulate the motion estimation problem as a motion segmentation problem. We use approximate nearest neighbor fields to compute an initial motion field and use a robust algorithm to compute a set of similarity transformations as the motion candidates for segmentation. To account for deviations from similarity transformations, we add local deformations in the segmentation process. We also observe that small objects can be better recovered using translations as the motion candidates. We fuse the motion results obtained under similarity transformations and under translations together before a final refinement. Experimental validation shows that our method can successfully handle large displacement motions. Although we particularly focus on large displacement motions in this work, we make no sac- rifice in terms of overall performance. In particular, our method ranks at the top of the Middlebury benchmark.

3 0.72854817 290 cvpr-2013-Motion Estimation for Self-Driving Cars with a Generalized Camera

Author: Gim Hee Lee, Friedrich Faundorfer, Marc Pollefeys

Abstract: In this paper, we present a visual ego-motion estimation algorithm for a self-driving car equipped with a closeto-market multi-camera system. By modeling the multicamera system as a generalized camera and applying the non-holonomic motion constraint of a car, we show that this leads to a novel 2-point minimal solution for the generalized essential matrix where the full relative motion including metric scale can be obtained. We provide the analytical solutions for the general case with at least one inter-camera correspondence and a special case with only intra-camera correspondences. We show that up to a maximum of 6 solutions exist for both cases. We identify the existence of degeneracy when the car undergoes straight motion in the special case with only intra-camera correspondences where the scale becomes unobservable and provide a practical alternative solution. Our formulation can be efficiently implemented within RANSAC for robust estimation. We verify the validity of our assumptions on the motion model by comparing our results on a large real-world dataset collected by a car equipped with 4 cameras with minimal overlapping field-of-views against the GPS/INS ground truth.

4 0.71920818 158 cvpr-2013-Exploring Weak Stabilization for Motion Feature Extraction

Author: Dennis Park, C. Lawrence Zitnick, Deva Ramanan, Piotr Dollár

Abstract: We describe novel but simple motion features for the problem of detecting objects in video sequences. Previous approaches either compute optical flow or temporal differences on video frame pairs with various assumptions about stabilization. We describe a combined approach that uses coarse-scale flow and fine-scale temporal difference features. Our approach performs weak motion stabilization by factoring out camera motion and coarse object motion while preserving nonrigid motions that serve as useful cues for recognition. We show results for pedestrian detection and human pose estimation in video sequences, achieving state-of-the-art results in both. In particular, given a fixed detection rate our method achieves a five-fold reduction in false positives over prior art on the Caltech Pedestrian benchmark. Finally, we perform extensive diagnostic experiments to reveal what aspects of our system are crucial for good performance. Proper stabilization, long time-scale features, and proper normalization are all critical.

5 0.70813489 88 cvpr-2013-Compressible Motion Fields

Author: Giuseppe Ottaviano, Pushmeet Kohli

Abstract: Traditional video compression methods obtain a compact representation for image frames by computing coarse motion fields defined on patches of pixels called blocks, in order to compensate for the motion in the scene across frames. This piecewise constant approximation makes the motion field efficiently encodable, but it introduces block artifacts in the warped image frame. In this paper, we address the problem of estimating dense motion fields that, while accurately predicting one frame from a given reference frame by warping it with the field, are also compressible. We introduce a representation for motion fields based on wavelet bases, and approximate the compressibility of their coefficients with a piecewise smooth surrogate function that yields an objective function similar to classical optical flow formulations. We then show how to quantize and encode such coefficients with adaptive precision. We demonstrate the effectiveness of our approach by com- paring its performance with a state-of-the-art wavelet video encoder. Experimental results on a number of standard flow and video datasets reveal that our method significantly outperforms both block-based and optical-flow-based motion compensation algorithms.

6 0.69833541 283 cvpr-2013-Megastereo: Constructing High-Resolution Stereo Panoramas

7 0.69419158 362 cvpr-2013-Robust Monocular Epipolar Flow Estimation

8 0.69394505 46 cvpr-2013-Articulated and Restricted Motion Subspaces and Their Signatures

9 0.67441863 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition

10 0.66391391 10 cvpr-2013-A Fully-Connected Layered Model of Foreground and Background Flow

11 0.63582611 84 cvpr-2013-Cloud Motion as a Calibration Cue

12 0.6340937 368 cvpr-2013-Rolling Shutter Camera Calibration

13 0.62633371 334 cvpr-2013-Pose from Flow and Flow from Pose

14 0.59926826 345 cvpr-2013-Real-Time Model-Based Rigid Object Pose Estimation and Tracking Combining Dense and Sparse Visual Cues

15 0.59091252 170 cvpr-2013-Fast Rigid Motion Segmentation via Incrementally-Complex Local Models

16 0.57080483 137 cvpr-2013-Dynamic Scene Classification: Learning Motion Descriptors with Slow Features Analysis

17 0.5663026 400 cvpr-2013-Single Image Calibration of Multi-axial Imaging Systems

18 0.56537277 333 cvpr-2013-Plane-Based Content Preserving Warps for Video Stabilization

19 0.55433267 118 cvpr-2013-Detecting Pulse from Head Motions in Video

20 0.54969043 349 cvpr-2013-Reconstructing Gas Flows Using Light-Path Approximation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.101), (16, 0.031), (26, 0.07), (33, 0.267), (47, 0.012), (65, 0.013), (67, 0.066), (69, 0.037), (72, 0.198), (87, 0.112)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.89557081 62 cvpr-2013-Bilinear Programming for Human Activity Recognition with Unknown MRF Graphs

Author: Zhenhua Wang, Qinfeng Shi, Chunhua Shen, Anton van_den_Hengel

Abstract: Markov Random Fields (MRFs) have been successfully applied to human activity modelling, largely due to their ability to model complex dependencies and deal with local uncertainty. However, the underlying graph structure is often manually specified, or automatically constructed by heuristics. We show, instead, that learning an MRF graph and performing MAP inference can be achieved simultaneously by solving a bilinear program. Equipped with the bilinear program based MAP inference for an unknown graph, we show how to estimate parameters efficiently and effectively with a latent structural SVM. We apply our techniques to predict sport moves (such as serve, volley in tennis) and human activity in TV episodes (such as kiss, hug and Hi-Five). Experimental results show the proposed method outperforms the state-of-the-art.

same-paper 2 0.8849535 124 cvpr-2013-Determining Motion Directly from Normal Flows Upon the Use of a Spherical Eye Platform

Author: Tak-Wai Hui, Ronald Chung

Abstract: We address the problem of recovering camera motion from video data, which does not require the establishment of feature correspondences or computation of optical flows but from normal flows directly. We have designed an imaging system that has a wide field of view by fixating a number of cameras together to form an approximate spherical eye. With a substantially widened visual field, we discover that estimating the directions of translation and rotation components of the motion separately are possible and particularly efficient. In addition, the inherent ambiguities between translation and rotation also disappear. Magnitude of rotation is recovered subsequently. Experimental results on synthetic and real image data are provided. The results show that not only the accuracy of motion estimation is comparable to those of the state-of-the-art methods that require explicit feature correspondences or optical flows, but also a faster computation time.

3 0.87839067 149 cvpr-2013-Evaluation of Color STIPs for Human Action Recognition

Author: Ivo Everts, Jan C. van_Gemert, Theo Gevers

Abstract: This paper is concerned with recognizing realistic human actions in videos based on spatio-temporal interest points (STIPs). Existing STIP-based action recognition approaches operate on intensity representations of the image data. Because of this, these approaches are sensitive to disturbing photometric phenomena such as highlights and shadows. Moreover, valuable information is neglected by discarding chromaticity from the photometric representation. These issues are addressed by Color STIPs. Color STIPs are multi-channel reformulations of existing intensity-based STIP detectors and descriptors, for which we consider a number of chromatic representations derived from the opponent color space. This enhanced modeling of appearance improves the quality of subsequent STIP detection and description. Color STIPs are shown to substantially outperform their intensity-based counterparts on the challenging UCF sports, UCF11 and UCF50 action recognition benchmarks. Moreover, the results show that color STIPs are currently the single best low-level feature choice for STIP-based approaches to human action recognition.

4 0.87139356 352 cvpr-2013-Recovering Stereo Pairs from Anaglyphs

Author: Armand Joulin, Sing Bing Kang

Abstract: An anaglyph is a single image created by selecting complementary colors from a stereo color pair; the user can perceive depth by viewing it through color-filtered glasses. We propose a technique to reconstruct the original color stereo pair given such an anaglyph. We modified SIFT-Flow and use it to initially match the different color channels across the two views. Our technique then iteratively refines the matches, selects the good matches (which defines the “anchor” colors), and propagates the anchor colors. We use a diffusion-based technique for the color propagation, and added a step to suppress unwanted colors. Results on a variety of inputs demonstrate the robustness of our technique. We also extended our method to anaglyph videos by using optic flow between time frames.

5 0.86723477 325 cvpr-2013-Part Discovery from Partial Correspondence

Author: Subhransu Maji, Gregory Shakhnarovich

Abstract: We study the problem of part discovery when partial correspondence between instances of a category are available. For visual categories that exhibit high diversity in structure such as buildings, our approach can be used to discover parts that are hard to name, but can be easily expressed as a correspondence between pairs of images. Parts naturally emerge from point-wise landmark matches across many instances within a category. We propose a learning framework for automatic discovery of parts in such weakly supervised settings, and show the utility of the rich part library learned in this way for three tasks: object detection, category-specific saliency estimation, and fine-grained image parsing.

6 0.86649501 229 cvpr-2013-It's Not Polite to Point: Describing People with Uncertain Attributes

7 0.83932287 365 cvpr-2013-Robust Real-Time Tracking of Multiple Objects by Volumetric Mass Densities

8 0.83623749 71 cvpr-2013-Boundary Cues for 3D Object Shape Recovery

9 0.83318371 331 cvpr-2013-Physically Plausible 3D Scene Tracking: The Single Actor Hypothesis

10 0.8329559 242 cvpr-2013-Label Propagation from ImageNet to 3D Point Clouds

11 0.83204067 147 cvpr-2013-Ensemble Learning for Confidence Measures in Stereo Vision

12 0.83199972 248 cvpr-2013-Learning Collections of Part Models for Object Recognition

13 0.83113623 222 cvpr-2013-Incorporating User Interaction and Topological Constraints within Contour Completion via Discrete Calculus

14 0.83072948 72 cvpr-2013-Boundary Detection Benchmarking: Beyond F-Measures

15 0.83026659 15 cvpr-2013-A Lazy Man's Approach to Benchmarking: Semisupervised Classifier Evaluation and Recalibration

16 0.83007222 155 cvpr-2013-Exploiting the Power of Stereo Confidences

17 0.82964504 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path

18 0.82948899 19 cvpr-2013-A Minimum Error Vanishing Point Detection Approach for Uncalibrated Monocular Images of Man-Made Environments

19 0.82940012 290 cvpr-2013-Motion Estimation for Self-Driving Cars with a Generalized Camera

20 0.82864547 277 cvpr-2013-MODEC: Multimodal Decomposable Models for Human Pose Estimation