cvpr cvpr2013 cvpr2013-137 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Christian Thériault, Nicolas Thome, Matthieu Cord
Abstract: In this paper, we address the challenging problem of categorizing video sequences composed of dynamic natural scenes. Contrarily to previous methods that rely on handcrafted descriptors, we propose here to represent videos using unsupervised learning of motion features. Our method encompasses three main contributions: 1) Based on the Slow Feature Analysis principle, we introduce a learned local motion descriptor which represents the principal and more stable motion components of training videos. 2) We integrate our local motion feature into a global coding/pooling architecture in order to provide an effective signature for each video sequence. 3) We report state of the art classification performances on two challenging natural scenes data sets. In particular, an outstanding improvement of 11 % in classification score is reached on a data set introduced in 2012.
Reference: text
sentIndex sentText sentNum sentScore
1 Abstract In this paper, we address the challenging problem of categorizing video sequences composed of dynamic natural scenes. [sent-3, score-0.2]
2 Contrarily to previous methods that rely on handcrafted descriptors, we propose here to represent videos using unsupervised learning of motion features. [sent-4, score-0.352]
3 Our method encompasses three main contributions: 1) Based on the Slow Feature Analysis principle, we introduce a learned local motion descriptor which represents the principal and more stable motion components of training videos. [sent-5, score-0.503]
4 2) We integrate our local motion feature into a global coding/pooling architecture in order to provide an effective signature for each video sequence. [sent-6, score-0.335]
5 Designing efficient motion descriptors is a key ingredient of current video analysis systems. [sent-11, score-0.269]
6 In the most usual context, motion features arise from the relative motion between the different objects in the scene and the camera. [sent-12, score-0.463]
7 In this context, motion is often correlated with effects that may be considered as interferences or artifacts: shadows, lighting variations, specular effects, etc. [sent-16, score-0.21]
8 However, SF1 (the slowest feature learned with SFA) correctly untangles the classes. [sent-24, score-0.21]
9 Bottom: SF1 reveals stable motion components which correlate with semantic categories: upward/backward water motion (fountains/waterfalls), complex flame motion (Forest Fire). [sent-25, score-0.708]
10 In the neuroscience community, one challenge of internal representation design or learning is related to the class manifold untangling problem [10]: high level representations are expected to be well separated for different semantic categories. [sent-31, score-0.193]
11 org/challenges/LSVRC/2012/ 222666000311 In this paper, we introduce an unsupervised method to learn local motion features which self-adapt to the difficult context of dynamic scenes. [sent-34, score-0.383]
12 The curves compare the mean temporal signal, over each class, for V1 features3 and for learned motion features, inside the green windows shown at the bottom. [sent-38, score-0.416]
13 On the other hand, the slowest learned feature (SF1) correctly untangles the classes by generating outputs with stable responses inside categories and yet different responses be- tween categories. [sent-40, score-0.349]
14 Quite impressively, one single slow feature is able to untangle 7 video classes. [sent-41, score-0.417]
15 The bottom part of figure 1 illustrates that the slow features learned by SFA reveals sensible motion components correlated with the semantic classes: upward/backward water motion (fountains/waterfalls), complex flame motion (Forest Fire), etc. [sent-42, score-1.059]
16 Section 3 gives the details ofthe method, introducing our SFA-based learned local motion features and their embedding into a coding/pooling framework. [sent-45, score-0.291]
17 Section 4 reports classification scores on two challenging dynamic scenes data sets, pointing out the remarkable level of performance achieved by the described method using learned motion features. [sent-46, score-0.426]
18 Related work & Contributions In this section, we give more details on video classification approaches related to ours, and focus on two main aspects of the proposed systems: the chosen motion features, and their use for video categorization. [sent-49, score-0.355]
19 The literature on scene classification includes several handcrafted motion features responding to space-time variations. [sent-50, score-0.411]
20 These motions features are often optimally handcrafted for specific applications and are not learned from the statistics of training images. [sent-51, score-0.211]
21 This motion feature uses Histograms of Optic Flow (HOF) in a similar spirit to the static images features SIFT [22] or HOG [7]. [sent-55, score-0.272]
22 assumes constant illumination between subsequent frames, the performance of this type of motion features is subject to collapse under the context of natural video scenes. [sent-60, score-0.352]
23 Such stochastic models have been successfully applied in various contexts, from dynamic texture classification to motion segmentation [6] or tacking [5]. [sent-64, score-0.314]
24 Other motion features [12, 18] presented in the literature are based on biological inspirations. [sent-67, score-0.248]
25 These fea- tures can be related to neuro-physiological recordings from the V1-V2-V4 cortical areas which are known to process local spatio-temporal informations [25] and from the MT area which is believed to integrate global motion patterns [30]. [sent-68, score-0.213]
26 These biologically inspired motion features are still not truly learned from stimuli. [sent-69, score-0.328]
27 Although both works address the same classification problem as we do, our approach and method are different as we focus on unsupervised motion feature learning. [sent-73, score-0.312]
28 One unsupervised learning principle in neuroscience is to minimize temporal variations created by motion in order to learn stable representations of object undergoing motion [27, 14, 24]. [sent-74, score-0.75]
29 Recently, SFA has been investigated in [35] to represent local motion for human action recognition. [sent-80, score-0.243]
30 Interestingly, this work, closely related to ours, consolidates the relevance of using SFA to extract meaningful motion pattern for video classification. [sent-81, score-0.242]
31 One possibility is to use global motion descriptors [11, 32] which cover the entire spatial area of the scene to be classified. [sent-84, score-0.242]
32 Other models extend the 222666000422 are then mapped on a set of learned slow features through the SFA principle. [sent-86, score-0.405]
33 Blue: Temporal sequences of slow feature codes are used to train a dictionary of motion features. [sent-87, score-0.662]
34 Orange: Motion features from new videos are mapped on the dictionary before being pooled across time and space into a final vector signature. [sent-88, score-0.243]
35 BoW framework [3 1, 1] of static images to video classification [19, 20] where local motion features (HOF) are extracted at Space Time Interest Points (STIP) and coded by a mapping function on a learned dictionary of features. [sent-89, score-0.486]
36 The work in [35] uses the SFA principle to transforms videos × into histograms of slow feature temporal averages. [sent-92, score-0.587]
37 With this approach, the temporal dimension of the input signal is reduced to a scalar value before being accumulated into histograms with no further coding or pooling. [sent-93, score-0.296]
38 SFA ouputs a set of elementary motion pattern, in a similar manner as done in [35] for human action recognition. [sent-100, score-0.243]
39 Indeed, their data set is concerned with human motion recorded in stable and controlled environments (i. [sent-103, score-0.272]
40 Second, we show that the SFA principle gives good untangling of semantic class manifolds in the context of complex natural scene videos. [sent-106, score-0.291]
41 Importantly, to incorporate temporal information in our video representation, SFA codes are threaded along τ frames, so that local regions over time are represented with output sequences of size RM×τ. [sent-108, score-0.28]
42 Here, the difference with respect to [35] is significant since we maintain the full temporal dimension of the input signal which gives a richer temporal categorial information compared to averaging method. [sent-111, score-0.48]
43 To summarize, the paper presents the three following main contributions: • • We introduce a local motion descriptor adapted to complex dynamic scenes. [sent-112, score-0.255]
44 SFA generates a low dimensional and low variational subspace representing the embedded stable components of motions inside the video frames. [sent-114, score-0.194]
45 We propose a coding/pooling architecture in which temporal outputs sequences ogf SrcFhAit generate global video signatures. [sent-116, score-0.274]
46 By keeping temporal dimension into 222666000533 the output signal, categorial information is not diluted as it is when using a temporal average over the signal [35]. [sent-117, score-0.48]
47 Learning local motion features with SFA The SFA principle has been introduced as a mean to learn invariant representations from transformation sequences [15, 34]. [sent-123, score-0.333]
48 The invariance emerging from the SFA principle, which has been used for human action recognition [35], makes it an excellent choice to extract stable motion features for dynamic scene classification. [sent-124, score-0.481]
49 Specifically, given a D-dimensional temporal input signal v(t) = [v1(t)v2 (t) . [sent-130, score-0.233]
50 yM(t)]T where yj = Sj(v(t)) vary as slow as possible and still retains relevant information (see figure 3). [sent-138, score-0.336]
51 This is obtained by minimizing the average square of the signal temporal derivative mSjin< y˙ j2>t (1) under the constraints: 1. [sent-142, score-0.233]
52 ∀j < j0 : < yj , yj0 >t= 0 (decorrelation) where t is the temporal average of y. [sent-145, score-0.194]
53 With these constraints the SFA principle ensures that output signals vary as slowly as possible without being a simple constant signal carrying no information. [sent-146, score-0.198]
54 normalize the outputs to a common scale and prevent the trivial solution yj = cst which would be obtained with a temporal low pass filter (temporal smoothing). [sent-149, score-0.224]
55 Therefore, × the slow features Sj must be instantaneous and cannot be averaging the signals over time. [sent-150, score-0.489]
56 This ensures that the slow features carry time specific information and do not simply dilute the signals. [sent-151, score-0.389]
57 The solution to equation 1, with the slow features Sj (x) ranked from the slowest to the fastest, can be obtained by solving the following eigenvalue problem where the slower features are associated with the smaller eigenvalues λ1 ≤ λ2 ≤ . [sent-154, score-0.55]
58 oN voewc,t otros le va ∈rn Rslow features from these V1 features, we need to define the temporal covariance matrix of equation 2. [sent-167, score-0.22]
59 We compute all possible features vxny (t) and compute the temporal derivatives v˙ xny (t). [sent-171, score-0.323]
60 The temporal covariance matrix of equation 2 is then computed by < v˙ v˙T>t =p2N1TxXyp==11nXN=1tX=T1 v˙xny(t) v˙xny(t)T (3) The eigenvectors of this matrix associated with the M smallest eigenvalues define our slow features S(v) = [S1(v) , . [sent-172, score-0.522]
61 The slowest features generate the most stable non trivial output signals. [sent-175, score-0.246]
62 As previously shown in figure 1, these slow features already produce an impressive untangling of class manifolds and are thus excellent candidates to define stable and relevant motion features for classification. [sent-176, score-0.83]
63 fr/ cord/BioVision/ 5The V1 features are normalized to a unit sphere [34] 222666000644 slow features to encode local motions features which are then pooled into a final signature for each video. [sent-179, score-0.637]
64 Coding and Pooling Our motion features are defined by threading together short temporal sequences of SFA outputs to generate a new representation space. [sent-182, score-0.468]
65 Specifically, we define a motion feature m(t) at position (x, y) and across time t = [t. [sent-183, score-0.212]
66 t+τ] by × a short temporal SFA output sequence from V1 features using the matrix product m(t)=[zxy(t). [sent-185, score-0.22]
67 vxy(t+τ)] (4) If we use M slow features, then S ∈ RM×D and equationIf 4 w dee ufisnees M m soltoiown f efeatautureress, hmen(t) S ∈ ∈ RRM×τ. [sent-189, score-0.302]
68 As illusttiroatned 4 i dne figure m2,o tthioense f meatoutiroesn mfea(ttu)re ∈s m R(t) can be interpreted as spatio-temporal atoms describing the stable motion components inside a small space-time window of dimension k k τ. [sent-190, score-0.345]
69 mWpoe rcahlo dsice a simple unsupervised sampling procedure [28] in which we sample N motion features on training videos at random positions and times. [sent-195, score-0.335]
70 The generated temporal codes ci are computed by ,s. [sent-204, score-0.218]
71 30 pt) the recently spatial temporal filters (SOE) proposed in [9], and many other state-of-the art image and motion features used in the computer vision community7: HoF [23], GIST [26] and Chaotic invariants [29]. [sent-237, score-0.483]
72 SFA for learning motion descriptors and their embedding in a coding-pooling framework. [sent-243, score-0.215]
73 3% in Yupen and Maryland), the improvement of learning motion descriptors with × SFA is outstanding : ∼ 30% % in Yupen, ∼ 20% % in Maryland. [sent-246, score-0.249]
74 Ttahnisd clearly v 3a0li%date %s t ihne Yreupleevna,n ∼ce 2of0 learning motion descriptors which self-adapt to the statistics of training videos. [sent-247, score-0.215]
75 ma Ttihoins ialnluds tnroatt diluting mthpeo signal using a temporal average. [sent-255, score-0.233]
76 An increase of 8% in classification scores is reached when using dictionary elements with a temporal depth of τ = 16. [sent-260, score-0.366]
77 This suggests that more categorial temporal information is captured when using features which span multiple frames. [sent-261, score-0.282]
78 This is one major difference with the approach used in [35] land Data set in which the slow feature temporal dimension is reduced to a single scalar statistic (i. [sent-262, score-0.511]
79 As reported in figure 6, good classification scores are reached using a only a small set of slow features. [sent-266, score-0.426]
80 By keeping only the most stable slow features (i. [sent-267, score-0.446]
81 The scores on both data sets are stable under a wide range of dictionary sizes, highlighting the robustness of our motion feature representation. [sent-272, score-0.416]
82 Motion feature space The SFA algorithm is based on computation of temporal derivatives and therefore assumes a smooth (i. [sent-275, score-0.207]
83 The smooth spatial structure of learned slow features is illustrated by mapping our motion features into V1 space. [sent-278, score-0.676]
84 Figure 8 displays the V1 projection of the 10 slowest features learned from the Yupenn data set (top) and the Maryland 222666000866 scores data set (bottom). [sent-279, score-0.243]
85 Figure 9 illustrates the smooth temporal output signal from the first slow feature learned on the Yupenn data set in response to a wave pattern. [sent-280, score-0.693]
86 As shown, the output signal of the instantaneous V1 feature (no SFA) does not give smooth motion information compared to the slow feature signal (with SFA). [sent-281, score-0.8]
87 In addition, The SFA signal correlates with semantic motion pattern (the wave), whereas the raw V1 curve has a more random behavior. [sent-282, score-0.284]
88 As defined in section 3, our motion features are the result of M slow features varying over time. [sent-285, score-0.61]
89 While our full system, using M = 30 slow features, reaches a score of 86. [sent-286, score-0.324]
90 9 on the Yupenn data set, one single slow feature (the slowest) still reaches a score of 73. [sent-287, score-0.348]
91 This remarkable result is first introduced in figure 1 which illustrates the perfect separation achieved by a single slow feature on 7 classes of the Yupenn data set. [sent-289, score-0.356]
92 Figure 10 complements the results of figure 1 and illustrates the semantic untangling achieved by individual slow features on all 14 classes of the Yupenn data set. [sent-290, score-0.551]
93 As shown, one single slow feature cannot untangle all the classes but still achieved impressive separation using a single dimension. [sent-291, score-0.363]
94 several slow features) as our classification scores confirms it. [sent-295, score-0.399]
95 Conclusions and summary This paper presented motion features for video scenes classification learned in a unsupervised manner. [sent-299, score-0.476]
96 These motion features are the result of mapping temporal sequences of instantaneous image features into a low dimensional subspace where temporal variations are minimized. [sent-300, score-0.751]
97 This 222666000977 learned low dimensional representation provides stable descriptions of video scenes which can be used to obtain stateof-the art classification on two challenging dynamic scenes data sets. [sent-301, score-0.395]
98 One possibility unevaluated in this paper would be to learn stable features from spatio-temporal filters instead of from instantaneous spatial filters. [sent-302, score-0.26]
99 The outstanding classification results reported in this paper also suggest that temporal output signals provide more categorical information compared to instantaneous outputs. [sent-303, score-0.38]
100 As many classes studied in the paper consist of dynamical textures, one interesting direction for future work is to use the classification pipeline for motion segmentation and action recognition. [sent-304, score-0.302]
wordName wordTfidf (topN-words)
[('sfa', 0.718), ('slow', 0.302), ('yupenn', 0.228), ('motion', 0.188), ('temporal', 0.16), ('untangling', 0.136), ('thome', 0.104), ('slowest', 0.102), ('instantaneous', 0.093), ('maryland', 0.088), ('stable', 0.084), ('cord', 0.083), ('dictionary', 0.082), ('handcrafted', 0.077), ('signal', 0.073), ('signature', 0.069), ('dynamic', 0.067), ('categorial', 0.062), ('xny', 0.062), ('yupen', 0.062), ('features', 0.06), ('classification', 0.059), ('principle', 0.055), ('pooled', 0.055), ('action', 0.055), ('video', 0.054), ('fire', 0.053), ('hof', 0.048), ('pooling', 0.047), ('videos', 0.046), ('learned', 0.043), ('chaotic', 0.041), ('crivelli', 0.041), ('eriault', 0.041), ('nicolas', 0.041), ('striate', 0.041), ('untangles', 0.041), ('vxny', 0.041), ('vxy', 0.041), ('zxy', 0.041), ('unsupervised', 0.041), ('sj', 0.04), ('coding', 0.038), ('scores', 0.038), ('wave', 0.038), ('biologically', 0.037), ('performances', 0.037), ('untangle', 0.037), ('bouthemy', 0.037), ('flame', 0.037), ('slowly', 0.036), ('codes', 0.036), ('yj', 0.034), ('tino', 0.034), ('matthieu', 0.034), ('neuroscience', 0.034), ('outstanding', 0.034), ('signals', 0.034), ('tpi', 0.032), ('motions', 0.031), ('scenes', 0.031), ('illustrates', 0.03), ('sequences', 0.03), ('outputs', 0.03), ('optical', 0.029), ('pages', 0.029), ('stabilized', 0.028), ('lds', 0.028), ('flow', 0.028), ('cviu', 0.027), ('serre', 0.027), ('scene', 0.027), ('descriptors', 0.027), ('carry', 0.027), ('context', 0.027), ('reached', 0.027), ('slower', 0.026), ('art', 0.026), ('invariants', 0.026), ('categorizing', 0.026), ('dim', 0.026), ('dimension', 0.025), ('inside', 0.025), ('laptev', 0.025), ('stip', 0.025), ('tures', 0.025), ('receptive', 0.025), ('feature', 0.024), ('filters', 0.023), ('dne', 0.023), ('semantic', 0.023), ('natural', 0.023), ('smooth', 0.023), ('pyramid', 0.022), ('score', 0.022), ('frames', 0.022), ('ci', 0.022), ('lighting', 0.022), ('cells', 0.021), ('wolf', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 137 cvpr-2013-Dynamic Scene Classification: Learning Motion Descriptors with Slow Features Analysis
Author: Christian Thériault, Nicolas Thome, Matthieu Cord
Abstract: In this paper, we address the challenging problem of categorizing video sequences composed of dynamic natural scenes. Contrarily to previous methods that rely on handcrafted descriptors, we propose here to represent videos using unsupervised learning of motion features. Our method encompasses three main contributions: 1) Based on the Slow Feature Analysis principle, we introduce a learned local motion descriptor which represents the principal and more stable motion components of training videos. 2) We integrate our local motion feature into a global coding/pooling architecture in order to provide an effective signature for each video sequence. 3) We report state of the art classification performances on two challenging natural scenes data sets. In particular, an outstanding improvement of 11 % in classification score is reached on a data set introduced in 2012.
2 0.15345436 158 cvpr-2013-Exploring Weak Stabilization for Motion Feature Extraction
Author: Dennis Park, C. Lawrence Zitnick, Deva Ramanan, Piotr Dollár
Abstract: We describe novel but simple motion features for the problem of detecting objects in video sequences. Previous approaches either compute optical flow or temporal differences on video frame pairs with various assumptions about stabilization. We describe a combined approach that uses coarse-scale flow and fine-scale temporal difference features. Our approach performs weak motion stabilization by factoring out camera motion and coarse object motion while preserving nonrigid motions that serve as useful cues for recognition. We show results for pedestrian detection and human pose estimation in video sequences, achieving state-of-the-art results in both. In particular, given a fixed detection rate our method achieves a five-fold reduction in false positives over prior art on the Caltech Pedestrian benchmark. Finally, we perform extensive diagnostic experiments to reveal what aspects of our system are crucial for good performance. Proper stabilization, long time-scale features, and proper normalization are all critical.
3 0.12770881 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition
Author: Mihir Jain, Hervé Jégou, Patrick Bouthemy
Abstract: Several recent works on action recognition have attested the importance of explicitly integrating motion characteristics in the video description. This paper establishes that adequately decomposing visual motion into dominant and residual motions, both in the extraction of the space-time trajectories and for the computation of descriptors, significantly improves action recognition algorithms. Then, we design a new motion descriptor, the DCS descriptor, based on differential motion scalar quantities, divergence, curl and shear features. It captures additional information on the local motion patterns enhancing results. Finally, applying the recent VLAD coding technique proposed in image retrieval provides a substantial improvement for action recognition. Our three contributions are complementary and lead to outperform all reported results by a significant margin on three challenging datasets, namely Hollywood 2, HMDB51 and Olympic Sports. 1. Introduction and related work Human actions often convey the essential meaningful content in videos. Yet, recognizing human actions in un- constrained videos is a challenging problem in Computer Vision which receives a sustained attention due to the potential applications. In particular, there is a large interest in designing video-surveillance systems, providing some automatic annotation of video archives as well as improving human-computer interaction. The solutions proposed to address this problem inherit, to a large extent, from the techniques first designed for the goal of image search and classification. The successful local features developed to describe image patches [15, 23] have been translated in the 2D+t domain as spatio-temporal local descriptors [13, 30] and now include motion clues [29]. These descriptors are often extracted from spatial-temporal interest points [12, 3 1]. More recent techniques assume some underlying temporal motion model involving trajectories [2, 6, 7, 17, 18, 25, 29, 32]. Most of these approaches produce large set of local descriptors which are in turn aggregated to produce a single vector representing the video, in order to enable the use of powerful discriminative classifiers such as support vector machines (SVMs). This is usually done with the bag- Figure 1. Optical flow field vectors (green vectors with red end points) before and after dominant motion compensation. Most of the flow vectors due to camera motion are suppressed after compensation. One of the contributions of this paper is to show that compensating for the dominant motion is beneficial for most of the existing descriptors used for action recognition. of-words technique [24], which quantizes the local features using a k-means codebook. Thanks to the successful combination of this encoding technique with the aforementioned local descriptors, the state of the art in action recognition is able to go beyond the toy problems ofclassifying simple human actions in controlled environment and considers the detection of actions in real movies or video clips [11, 16]. Despite these progresses, the existing descriptors suffer from an uncompleted handling of motion in the video sequence. Motion is arguably the most reliable source of information for action recognition, as often related to the actions of interest. However, it inevitably involves the background or camera motion when dealing with uncontrolled and re- alistic situations. Although some attempts have been made to compensate camera motion in several ways [10, 21, 26, 29, 32], how to separate action motion from that caused by the camera, and how to reflect it in the video description remains an open issue. The motion compensation mechanism employed in [10] is tailor-made to the Motion Interchange Pattern encoding technique. The Motion Boundary Histogram (MBH) [29] is a recent appealing approach to 222555555533 suppress the constant motion by considering the flow gradient. It is robust to some extent to the presence of camera motion, yet it does not explicitly handle the camera motion. Another approach [26] uses a sophisticated and robust (RANSAC) estimation of camera motion. It first segments the color image into regions corresponding to planar parts in the scene and estimates the (three) dominant homographies to update the motion associated with local features. A rather different view is adopted in [32] where the motion decomposition is performed at the trajectory level. All these works support the potential of motion compensation. As the first contribution of this paper, we address the problem in a way that departs from these works by considering the compensation of the dominant motion in both the tracking stages and encoding stages involved in the computation of action recognition descriptors. We rely on the pioneering works on motion compensation such as the technique proposed in [20], that considers 2D polynomial affine motion models for estimating the dominant image motion. We consider this particular model for its robustness and its low computational cost. It was already used in [21] to separate the dominant motion (assumed to be due to the camera motion) and the residual motion (corresponding to the independent scene motions) for dynamic event recognition in videos. However, the statistical modeling of both motion components was global (over the entire image) and only the normal flow was computed for the latter. Figure 1 shows the vectors of optical flow before and after applying the proposed motion compensation. Our method successfully suppresses most of the background motion and reinforces the focus towards the action of interest. We exploit this compensated motion both for descriptor computation and for extracting trajectories. However, we also show that the camera motion should not be thrown as it contains complementary information that is worth using to recognize certain action categories. Then, we introduce the Divergence-Curl-Shear (DCS) descriptor, which encodes scalar first-order motion features, namely the motion divergence, curl and shear. It captures physical properties of the flow pattern that are not involved in the best existing descriptors for action recognition, except in the work of [1] which exploits divergence and vorticity among a set of eleven kinematic features computed from the optical flow. Our DCS descriptor provides a good performance recognition performance on its own. Most importantly, it conveys some information which is not captured by existing descriptors and further improves the recognition performance when combined with the other descriptors. As a last contribution, we bring an encoding technique known as VLAD (vector oflocal aggregated descriptors) [8] to the field of action recognition. This technique is shown to be better than the bag-of-words representation for combining all the local video descriptors we have considered. The organization of the paper is as follows. Section 2 introduces the motion properties that we will consider through this paper. Section 3 presents the datasets and classification scheme used in our different evaluations. Section 4 details how we revisit several popular descriptors of the literature by the means of dominant motion compensation. Our DCS descriptor based on kinematic properties is introduced in Section 5 and improved by the VLAD encoding technique, which is introduced and bench-marked in Section 6 for several video descriptors. Section 7 provides a comparison showing the large improvement achieved over the state of the art. Finally, Section 8 concludes the paper. 2. Motion Separation and Kinematic Features In this section, we describe the motion clues we incorporate in our action recognition framework. We separate the dominant motion and the residual motion. In most cases, this will account to distinguishing the impact of camera movement and independent actions. Note that we do not aim at recovering the 3D camera motion: The 2D parametric motion model describes the global (or dominant) motion between successive frames. We first explain how we estimate the dominant motion and employ it to separate the dominant flow from the optical flow. Then, we will introduce kinematic features, namely divergence, curl and shear for a more comprehensive description of the visual motion. 2.1. Affine motion for compensating camera motion Among polynomial motion models, we consider the 2D affine motion model. Simplest motion models such as the 4parameter model formed by the combination of 2D translation, 2D rotation and scaling, or more complex ones such as the 8-parameter quadratic model (equivalent to a homography), could be selected as well. The affine model is a good trade-off between accuracy and efficiency which is of primary importance when processing a huge video database. It does have limitations since strictly speaking it implies a single plane assumption for the static background. However, this is not that penalizing (especially for outdoor scenes) if differences in depth remain moderated with respect to the distance to the camera. The affine flow vector at point p = (x, y) and at time t, is defined as waff(pt) =?cc12((t ) ?+?aa31((t ) aa42((t ) ? ?xytt?. (1) = + + = + uaff(pt) c1(t) a1(t)xt a2(t)yt and vaff(pt) c2(t) a3 (t)xt + a4(t)yt are horizontal and vertical components of waff(pt) respectively. Let us denote the optical flow vector at point p at time t as w(pt) = (u(pt) , v(pt)). We introduce the flow vector ω(pt) obtained by removing the affine flow vector from the optical flow vector ω(pt) = w(pt) − waff(pt) . (2) 222555555644 The dominant motion (estimated as waff(pt)) is usually due to the camera motion. In this case, Equation 2 amounts to canceling (or compensating) the camera motion. Note that this is not always true. For example in case of close-up on a moving actor, the dominant motion will be the affine estimation of the apparent actor motion. The interpretation of the motion compensation output will not be that straightforward in this case, however the resulting ω-field will still exhibit different patterns for the foreground action part and the background part. In the remainder, we will refer to the “compensated” flow as ω-flow. Figure 1 displays the computed optical flow and the ωflow. We compute the affine flow with the publicly available Motion2D software1 [20] which implements a realtime robust multiresolution incremental estimation framework. The affine motion model has correctly accounted for the motion induced by the camera movement which corresponds to the dominant motion in the image pair. Indeed, we observe that the compensated flow vectors in the background are close to null and the compensated flow in the foreground, i.e., corresponding to the actors, is conversely inflated. The experiments presented along this paper will show that effective separation of dominant motion from the residual motions is beneficial for action recognition. As explained in Section 4, we will compute local motion descriptors, such as HOF, on both the optical flow and the compensated flow (ω-flow), which allows us to explicitly and directly characterize the scene motion. 2.2. Local kinematic features By kinematic features, we mean local first-order differential scalar quantities computed on the flow field. We consider the divergence, the curl (or vorticity) and the hyperbolic terms. They inform on the physical pattern of the flow so that they convey useful information on actions in videos. They can be computed from the first-order derivatives of the flow at every point p at every frame t as ⎨⎪ ⎪ ⎪ ⎧hcdyuipvr1l2(p t) = −∂ u ∂ (yxp(xtp) +−∂ v ∂ v(px ypxt ) The diverg⎪⎩ence is related to axial motion, expansion scaling effects, the curl to rotation in the image plane. hyperbolic terms express the shear of the visual flow responding to more complex configuration. We take account the shear quantity only: shear(pt) = ?hyp12(pt) + hyp22(pt). (3) and The corinto (4) 1http://www.irisa.fr/vista/Motion2D/ In Section 5, we propose the DCS descriptor that is based on the kinematic features (divergence, curl and shear) of the visual motion discussed in this subsection. It is computed on either the optical or the compensated flow, ω-flow. 3. Datasets and evaluation This section first introduces the datasets used for the evaluation. Then, we briefly present the bag-of-feature model and the classification scheme used to encode the descriptors which will be introduced in Section 4. Hollywood2. The Hollywood2 dataset [16] contains 1,707 video clips from 69 movies representing 12 action classes. It is divided into train set and test set of 823 and 884 samples respectively. Following the standard evaluation protocol of this benchmark, we use average precision (AP) for each class and the mean of APs (mAP) for evaluation. HMDB51. The HMDB51 dataset [11] is a large dataset containing 6,766 video clips extracted from various sources, ranging from movies to YouTube. It consists of 51 action classes, each having at least 101 samples. We follow the evaluation protocol of [11] and use three train/test splits, each with 70 training and 30 testing samples per class. The average classification accuracy is computed over all classes. Out of the two released sets, we use the original set as it is more challenging and used by most of the works reporting results in action recognition. Olympic Sports. The third dataset we use is Olympic Sports [19], which again is obtained from YouTube. This dataset contains 783 samples with 16 sports action classes. We use the provided2 train/test split, there are 17 to 56 training samples and 4 to 11test samples per class. Mean AP is used for the evaluation, which is the standard choice. Bag of features and classification setup. We first adopt the standard BOF [24] approach to encode all kinds of descriptors. It produces a vector that serves as the video representation. The codebook is constructed for each type of descriptor separately by the k-means algorithm. Following a common practice in the literature [27, 29, 30], the codebook size is set to k=4,000 elements. Note that Section 6 will consider encoding technique for descriptors. For the classification, we use a non-linear SVM with χ2kernel. When combining different descriptors, we simply add the kernel matrices, as done in [27]: K(xi,xj) = exp?−?cγ1cD(xic,xjc)?, 2http://vision.stanford.edu/Datasets/OlympicSports/ 222555555755 (5) where D(xic, xjc) is χ2 distance between video xic and xjc with respect to c-th channel, corresponding to c-th descriptor. The quantity γc is the mean value of χ2 distances between the training samples for the c-th channel. The multiclass classification problem that we consider is addressed by applying a one-against-rest approach. 4. Compensated descriptors This section describes how the compensation ofthe dominant motion is exploited to improve the quality of descriptors encoding the motion and the appearance around spatio-temporal positions, hence the term “compensated descriptors”. First, we briefly review the local descriptors [5, 13, 16, 29, 30] used here along with dense trajectories [29]. Second, we analyze the impact of motion flow compensation when used in two different stages of the descriptor computation, namely in the tracking and the description part. 4.1. Dense trajectories and local descriptors Employing dense trajectories to compute local descriptors is one of the state-of-the-art approaches for action recognition. It has been shown [29] that when local descriptors are computed over dense trajectories the performance improves considerably compared to when computed over spatio temporal features [30]. Dense Trajectories [29]: The trajectories are obtained by densely tracking sampled points using optical flow fields. First, feature points are sampled from a dense grid, with step size of 5 pixels and over 8 scales. Each feature point pt = (xt, yt) at frame t is then tracked to the next frame by median filtering in a dense optical flow field F = (ut, vt) as follows: pt+1 = (xt+1 , yt+1) = (xt, yt) + (M ∗ F) | (x ¯t,y ¯t) , (6) where M is the kernel of median filtering and ( x¯ t, y¯ t) is the rounded position of (xt, yt). The tracking is limited to L (=15) frames to avoid any drifting effect. Excessively short trajectories and trajectories exhibiting sudden large displacements are removed as they induce some artifacts. Trajectories must be understood here as tracks in the spacetime volume of the video. Local descriptors: The descriptors are computed within a space-time volume centered around each trajectory. Four types of descriptors are computed to encode the shape of the trajectory, local motion pattern and appearance, namely Trajectory [29], HOF (histograms of optical flow) [13], MBH [4] and HOG (histograms of oriented gradients) [3]. All these descriptors depend on the flow field used for the tracking and as input of the descriptor computation: 1. The Trajectory descriptor encodes the shape of the trajectory represented by the normalized relative coor- × dinates of the successive points forming the trajectory. It directly depends on the dense flow used for tracking points. 2. HOF is computed using the orientations and magnitudes of the flow field. 3. MBH is designed to capture the gradient of horizontal and vertical components of the flow. The motion boundaries encode the relative pixel motion and therefore suppress camera motion, but only to some extent. 4. HOG encodes the appearance by using the intensity gradient orientations and magnitudes. It is formally not a motion descriptor. Yet the position where the descriptor is computed depends on the trajectory shape. As in [29], volume around a feature point is divided into a 2 2 3 space-time grid. The orientations are quantized ian 2to × ×8 b2i ×ns 3fo srp HacOe-Gti amned g g9r ibdi.ns T fhoer o oHriOenFt (awtioitnhs one a qdudainttiiozneadl zero bin). The horizontal and vertical components of MBH are separately quantized into 8 bins each. 4.2. Impact of motion compensation The optical flow is simply referred to as flow in the following, while the compensated flow (see subsection 2. 1) is denoted by ω-flow. Both of them are considered in the tracking and descriptor computation stages. The trajectories obtained by tracking with the ω-flow are called ω-trajectories. Figure 2 comparatively illustrates the ωtrajectories and the trajectories obtained using the flow. The input video shows a man moving away from the car. In this video excerpt, the camera is following the man walking to the right, thus inducing a global motion to the left in the video. When using the flow, the computed trajectories reflect the combination of these two motion components (camera and scene motion) as depicted by Subfigure 2(b), which hampers the characterization of the current action. In contrast, the ω-trajectories plotted in Subfigure 2(c) are more active on the actor moving on the foreground, while those localized in the background are now parallel to the time axis enhancing static parts of the scene. The ω-trajectories are therefore more relevant for action recognition, since they are more regularly and more exclusively following the actor’s motion. Impact on Trajectory and HOG descriptors. Table 1reports the impact of ω-trajectories on Trajectory and HOG descriptors, which are both significantly improved by 3%4% of mAP on the two datasets. When improved by ωflow, these descriptors will be respectively referred to as ω-Trajdesc and ω-HOG in the rest of the paper. Although the better performance of ω-Trajdesc versus the original Trajectory descriptor was expected, the one 222555555866 2. Trajectories obtained from optical and compensated flows. The green tail is the trajectory the current frame. The trajectories are sub-sampled for the sake of clarity. The frames are extracted Figure over every 15 frames with red dot indicating 5 frames in this example. DescriptorHollywood2HMDB51 BaseTrliaωnje- Tc(rtoarejrdpyreos[c2d9u]ced)54 7 1. 7 4% %2382.–89% BaseliHnωOe- (GHreOp [2rG9od]uced)4 451 . 658%%%2296.– 13%% Table 1. ω-Trajdesc and ω-HOG: Impact of compensating flow on Trajectory descriptor and HOG descriptors. achieved by ω-HOG might be surprising. Our interpretation is that HOG captures more context with the modified trajectories. More precisely, the original HOG descriptor is computed from a 2D+t sub-volume aligned with the corresponding trajectory and hence represents the appearance along the trajectory shape. When using ω-flow, we do not align the video sequence. As a result, the ω-HOG descriptor is no more computed around the very same tracked physical point in the space-time volume but around points lying in a patch of the initial feature point, whose size depends on the affine flow magnitude. ω-HOG can be viewed as a “patchbased” computation capturing more information about the appearance of the background or of the moving foreground. As for ω-trajectories, they are closer to the real trajectories of the moving actors as they usually cancel the camera movement, and so, more easier to train and recognize. Impact on HOF. The ω-flow impacts computation used as an input to HOF computation itself. Therefore, HOF can both types of trajectories (ω-trajectories both the trajectory and the descriptor be computed along or those extracted MethodHollywood2HMDB51 Table(ω2rHf.-alocO IwomkF)inpHgacOtFobf[2u9ωhsb]i:f-nlo ωgwot-hwωHOflFown5H 02 34O. 58291F% %descripto3 r706s38.:–1076% m%APfor Hollywood2 and average accuracy for HMDB5 1. The ω-HOF is used in subsequent evaluations. from flow) and can encode both kinds of flows (ω-flow or flow). For the sake of completeness, we evaluate all the variants as well as the combination of both flows in the descriptor computation stage. The results are presented in Table 2 and demonstrate the significant improvement obtained by computing the HOF descriptor with the ω-flow instead of the optical flow. Note that the type of trajectories which is used, either “Tracking flow” or “Tracking ω-flow”, has a limited impact in this case. From now on, we only consider the “Tracking ω-flow” case where HOF is computed along ω-trajectories. Interestingly, combining the HOF computed from the flow and the ω-flow further improves the results. This suggests that the two flow fields are complementary and the affine flow that was subtracted from ω-flow brings in additional information. For the sake of brevity, the combination of the two kinds of HOF, i.e., computed from the flow and the ω-flow using ω-trajectories, is referred to as the ω-HOF 222555555977 MethodHollywood2HMDB51 Tab(lerT3a.cIkmM inpBgMacHgωtBf-loH w [u2)s9in]gω f-lo wo MBH5 d42 e.052s7c% riptos:m34A90P.–3769f% orHllywood2 and average accuracy for HMDB5 1. DTerHasMjcBrOeblitpHGeFor4.ySumTωraw- frcilykto hw ionfgtheduωpCs-fcaolrtωmeiwNp-df/tl+Aωoutrw-finlogwthdesωcr- isTpc-fHtrMloaiOjrBpdswtGeHFosrc descriptor in the rest of this paper. Compared to the HOF baseline, the ω-HOF descriptor achieves a gain of +3.1% of mAP on Hollywood 2 and of +7.8% on HMDB51. Impact on MBH. Since MBH is computed from gradient of flow and cancel the constant motion, there is practically no benefit in using the ω-flow to compute the MBH descriptors, as shown in Table 3. However, by tracking ω-flow, the performance improves by around 1.3% for HMDB5 1 dataset and drops by around 1.5% for Hollywood2. This relative performance depends on the encoding technique. We will come back on this descriptor when considering another encoding scheme for local descriptors in Section 6. 4.3. Summary of compensated descriptors Table 4 summarizes the refined versions of the descriptors obtained by exploiting the ω-flow, and both ω-flow and the optical flow in the case of HOF. The revisited descriptors considerably improve the results compared to the orig- inal ones, with the noticeable exception of ω-MBH which gives mixed performance with a bag-of-features encoding scheme. But we already mention as this point that this incongruous behavior of ω-MBH is stabilized with the VLAD encoding scheme considered in Section 6. Another advantage of tracking the compensated flow is that fewer trajectories are produced. For instance, the total number of trajectories decreases by about 9. 16% and 22.81% on the Hollywood2 and HMDB51 datasets, respectively. Note that exploiting both the flow and the ω-flow do not induce much computational overhead, as the latter is obtained from the flow and the affine flow which is computed in real-time and already used to get the ω-trajectories. The only additional computational cost that we introduce by using the descriptors summarized in Table 4 is the computation of a second HOF descriptor, but this stage is relatively efficient and not the bottleneck of the extraction procedure. 5. Divergence-Curl-Shear descriptor This section introduces a new descriptor encoding the kinematic properties of motion discussed in Section 2.2. It is denoted by DCS in the rest of this paper. Combining kinematic features. The spatial derivatives are computed for the horizontal and vertical components of the flow field, which are used in turn to compute the divergence, curl and shear scalar values, see Equation 3. We consider all possible pairs of kinematic features, namely (div, curl), (div, shear) and (curl, shear). At each × ×× pixel, we compute the orientation and magnitude of the 2-D vector corresponding to each of these pairs. The orientation is quantized into histograms and the magnitude is used for weighting, similar to SIFT. Our motivation for encoding pairs is that the joint distribution of kinematic features conveys more information than exploiting them independently. Implementation details. The descriptor computation and parameters are similar to HOG and other popular descriptors such as MBH, HOF. We obtain 8-bin histograms for each of the three feature pairs or components of DCS. The range of possible angles is 2π for the (div,curl) pair and π for the other pairs, because the shear is always positive. The DCS descriptor is computed for a space-time volume aligned with a trajectory, as done with the four descriptors mentioned in the previous section. In order to capture the spatio-temporal structure of kinematic features, the volume (32 32 pixels and L = 15 frames) is subdivided into a spatio-temporal grid nofd s Lize = nx 5× f ny m×e nt, sw situhb nx =de ny =to 2a and nt = 3. These parameters ×hnave× × bneen fixed for the sake of consistency with the other descriptors. For each pair of kinematic features, each cell in the grid is represented by a histogram. The resulting local descriptors have a dimensionality equal to 288 = nx ny nt 8 3. At the video level, these descriptors are nenc×od end i×nto 8 a single vector representation using either BOF or the VLAD encoding scheme introduced in the next section. 6. VLAD in actions VLAD [8] is a descriptor encoding technique that aggregates the descriptors based on a locality criterion in the feature space. To our knowledge, this technique has never been considered for action recognition. Below, we briefly introduce this approach and give the performance achieved for all the descriptors introduced along the previous sections. VLAD in brief. Similar to BOF, VLAD relies on a codebook C = {c1, c2 , ...ck} of k centroids learned by k-means. bTohoek representation is ob}t oaifn ked c by summing, efodr b yea kch-m mveiasunasl. word ci, the differences x − ci of the vectors x assigned to ci, thereby producing a sv exct −or c representation oflength d×k, 222555556088 DMeBscHriptorV5 LH.A1o%Dlywo5Bo4d.O2 %F4V3L.3HA%MD B35B91.O7%F Taωbl-eDHM5rOCBa.FSjGPdHe+rsωfco-mMHaBOnFeofV54L2936A.51D% with5431ω208-.5T96% rajde3s42c97158,.ω3% -HOG342,58019ω.6-% HOF descriptors and their combination. where d is the dimension ofthe local descriptors. We use the codebook size, k = 256. Despite this large dimensionality, VLAD is efficient because it is effectively compared with a linear kernel. VLAD is post-processed using a componentwise power normalization, which dramatically improves its performance [8]. While cross validating the parameter α involved in this power normalization, we consistently observe, for all the descriptors, a value between 0.15 and 0.3. Therefore, this parameter is set to α = 0.2 in all our experiments. For classification, we use a linear SVM and oneagainst-rest approach everywhere, unless stated otherwise. Impact on existing descriptors. We employ VLAD because it is less sensitive to quantization parameters and appears to provide better performance with descriptors having a large dimensionality. These properties are interesting in our case, because the quantization parameters involved in the DCS and MBH descriptors have been used unchanged in Section 4 for the sake of direct comparison. They might be suboptimal when using the ω-flow instead of the optical flow on which they have initially been optimized [29]. Results for MBH and ω-MBH in Table 5 supports this argument. When using VLAD instead of BOF, the scores are stable in both the cases and there is no mixed inference as that observed in Table 3. VLAD also has significant positive influence on accuracy of ω-DCS descriptor. We also observe that ω-DCS is complementary to ω-MBH and adds to the performance. Still DCS is probably not best utilized in the current setting of parameters. In case of ω-Trajdesc and ω-HOG, the scores are better with BOF on both the datasets. ω-HOF with VLAD improves on HMDB5 1, but remains equivalent for Hollywood2. Although BOF leads to better scores for the descriptors considered individually, their combination with VLAD outperforms the BOF. 7. Comparison with the state of the art This section reports our results with all descriptors combined and compares our method with the state of the art. TrajectorCy+omHbOiGna+tHioOnF+ MDBCHSHol5 l98y.w76%o%od2H4M489.D02%B%51 All ω-descriptors all five compensated descriptors using combined62.5%52.1% Table 6. Combination of VLAD representation. WU*JVliaOnughreM tHaeolth. [yo2w9d87o] 256 0985. 37% SKa*duOeJinhrau tnegdMteHatlMa.h [ol1Dd.0B[91]25 24 609.8172% Table 7. Comparison with the state of the art on Hollywood2 and HMDB5 1 datasets. *Vig et al. [28] gets 61.9% by using external eye movements data. *Jiang et al. [9] used one-vs-one multi class SVM while our and other methods use one-vs-rest SVMs. With one-against-one multi class SVM we obtain 45. 1% for HMDB51. Descriptor combination. Table 6 reports the results obtained when the descriptors are combined. Since we use VLAD, our baseline is updated that is combination of Trajectory, HOG, HOF and MBH with VLAD representation. When DCS is added to the baseline there is an improvement of 0.9% and 1.2%. With combination of all five compensated descriptors we obtain 62.5% and 52.1% on the two datasets. This is a large improvement even over the updated baseline, which shows that the proposed motion compensation and the way we exploit it are significantly important for action recognition. The comparison with the state of the art is shown in Table 7. Our method outperforms all the previously reported results in the literature. In particular, on the HMDB51 dataset, the improvement over the best reported results to date is more than 11% in average accuracy. Jiang el al. [9] used a one-against-one multi-class SVM, which might have resulted in inferior scores. With a similar multi-class SVM approach, our method obtains 45. 1%, which remains significantly better than their result. All others results were reported with one-against-rest approach. On Olympic Sports dataset we obtain mAP of 83.2% with ‘All ω-descriptors combined’ and the improvement is mostly because of VLAD and ω-flow. The best reported mAPs on this dataset are Liu et al. [14] (74.4%) and Jiang et al. [9] (80.6%), which we exceed convincingly. Gaidon et al. [6] reports the best average accuracy of 82.7%. 8. Conclusions This paper first demonstrates the interest of canceling the dominant motion (predominantly camera motion) to make the visual motion truly related to actions, for both the trajectory extraction and descriptor computation stages. It pro222555556199 duces significantly better versions (called compensated descriptors) of several state-of-the-art local descriptors for action recognition. The simplicity, efficiency and effectiveness of this motion compensation approach make it applicable to any action recognition framework based on motion descriptors and trajectories. The second contribution is the new DCS descriptor derived from the first-order scalar motion quantities specifying the local motion patterns. It captures additional information which is proved complementary to the other descriptors. Finally, we show that VLAD encoding technique instead of bag-of-words boosts several action descriptors, and overall exhibits a significantly better performance when combining different types of descriptors. Our contributions are all complementary and significantly outperform the state of the art when combined, as demon- strated by our extensive experiments on the Hollywood 2, HMDB51 and Olympic Sports datasets. Acknowledgments This work was supported by the Quaero project, funded by Oseo, French agency for innovation. We acknowledge Heng Wang’s help for reproducing some of their results. References [1] S. Ali and M. Shah. Human action recognition in videos using kinematic features and multiple instance learning. IEEE T-PAMI, 32(2):288–303, Feb. 2010. [2] T. Brox and J. Malik. Object segmentation by long term analysis of point trajectories. In ECCV, Sep. 2010. [3] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, Jun. 2005. [4] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow and appearance. In ECCV, May 2006. [5] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. In VS-PETS, Oct. 2005. [6] A. Gaidon, Z. Harchaoui, and C. Schmid. Recognizing activities with cluster-trees of tracklets. In BMVC, Sep. 2012. [7] A. Hervieu, P. Bouthemy, and J.-P. Le Cadre. A statistical video content recognition method using invariant features on object trajectories. IEEE T-CSVT, 18(1 1): 1533–1543, 2008. [8] H. J ´egou, F. Perronnin, M. Douze, J. S ´anchez, P. P ´erez, and C. Schmid. Aggregating local descriptors into compact [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] codes. IEEE T-PAMI, 34(9):1704–1716, 2012. Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, and C.-W. Ngo. Trajectory-based modeling of human actions with motion reference points. In ECCV, Oct. 2012. O. Kliper-Gross, Y. Gurovich, T. Hassner, and L. Wolf. Motion interchange patterns for action recognition in unconstrained videos. In ECCV, Oct. 2012. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recognition. In ICCV, Nov. 2011. I. Laptev and T. Lindeberg. Space-time interest points. In ICCV, Oct. 2003. I. Laptev, M. Marzalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, Jun. 2008. J. Liu, B. Kuipers, and S. Savarese. Recognizing human actions by attributes. In CVPR, Jun. 2011. D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–1 10, Nov. 2004. M. Marzalek, I. Laptev, and C. Schmid. Actions in context. In CVPR, Jun. 2009. P. Matikainen, M. Hebert, and R. Sukthankar. Trajectons: Action recognition through the motion analysis of tracked features. In Workshop on Video-Oriented Object and Event Classification, ICCV, Sep. 2009. R. Messing, C. J. Pal, and H. A. Kautz. Activity recognition using the velocity histories of tracked keypoints. In ICCV, Sep. 2009. J. C. Niebles, C.-W. Chen, and F.-F. Li. Modeling temporal structure of decomposable motion segments for activity classification. In ECCV, Sep. 2010. [20] J.-M. Odobez and P. Bouthemy. Robust multiresolution estimation of parametric motion models. Jal of Vis. Comm. and Image Representation, 6(4):348–365, Dec. 1995. [21] G. Piriou, P. Bouthemy, and J.-F. Yao. Recognition of dynamic video contents with global probabilistic models of visual motion. IEEE T-IP, 15(1 1):3417–3430, 2006. [22] S. Sadanand and J. J. Corso. Action bank: A high-level representation of activity in video. In CVPR, Jun. 2012. [23] C. Schmid and R. Mohr. Local grayvalue invariants for image retrieval. IEEE T-PAMI, 19(5):530–534, May 1997. [24] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In ICCV, pages 1470– 1477, Oct. 2003. [25] J. Sun, X. Wu, S. Yan, L. F. Cheong, T.-S. Chua, and J. Li. Hierarchical spatio-temporal context modeling for action recognition. In CVPR, Jun. 2009. [26] H. Uemura, S. Ishikawa, and K. Mikolajczyk. Feature tracking and motion compensation for action recognition. In BMVC, Sep. 2008. [27] M. M. Ullah, S. N. Parizi, and I. Laptev. Improving bag-offeatures action recognition with non-local cues. In BMVC, Sep. 2010. [28] E. Vig, M. Dorr, and D. Cox. Saliency-based space-variant descriptor sampling for action recognition. In ECCV, Oct. 2012. [29] H. Wang, A. Kl¨ aser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In CVPR, Jun. 2011. [30] H. Wang, M. M. Ullah, A. Kl¨ aser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features for action recognition. In BMVC, Sep. 2009. [3 1] G. Willems, T. Tuytelaars, and L. J. V. Gool. An efficient dense and scale-invariant spatio-temporal interest point detector. In ECCV, Oct. 2008. [32] S. Wu, O. Oreifej, and M. Shah. Action recognition in videos acquired by a moving camera using motion decomposition of lagrangian particle trajectories. In ICCV, Nov. 2011. 222555666200
4 0.1239531 244 cvpr-2013-Large Displacement Optical Flow from Nearest Neighbor Fields
Author: Zhuoyuan Chen, Hailin Jin, Zhe Lin, Scott Cohen, Ying Wu
Abstract: We present an optical flow algorithm for large displacement motions. Most existing optical flow methods use the standard coarse-to-fine framework to deal with large displacement motions which has intrinsic limitations. Instead, we formulate the motion estimation problem as a motion segmentation problem. We use approximate nearest neighbor fields to compute an initial motion field and use a robust algorithm to compute a set of similarity transformations as the motion candidates for segmentation. To account for deviations from similarity transformations, we add local deformations in the segmentation process. We also observe that small objects can be better recovered using translations as the motion candidates. We fuse the motion results obtained under similarity transformations and under translations together before a final refinement. Experimental validation shows that our method can successfully handle large displacement motions. Although we particularly focus on large displacement motions in this work, we make no sac- rifice in terms of overall performance. In particular, our method ranks at the top of the Middlebury benchmark.
5 0.10505091 187 cvpr-2013-Geometric Context from Videos
Author: S. Hussain Raza, Matthias Grundmann, Irfan Essa
Abstract: We present a novel algorithm for estimating the broad 3D geometric structure of outdoor video scenes. Leveraging spatio-temporal video segmentation, we decompose a dynamic scene captured by a video into geometric classes, based on predictions made by region-classifiers that are trained on appearance and motion features. By examining the homogeneity of the prediction, we combine predictions across multiple segmentation hierarchy levels alleviating the need to determine the granularity a priori. We built a novel, extensive dataset on geometric context of video to evaluate our method, consisting of over 100 groundtruth annotated outdoor videos with over 20,000 frames. To further scale beyond this dataset, we propose a semisupervised learning framework to expand the pool of labeled data with high confidence predictions obtained from unlabeled data. Our system produces an accurate prediction of geometric context of video achieving 96% accuracy across main geometric classes.
6 0.098018892 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
7 0.094390847 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
8 0.090328202 296 cvpr-2013-Multi-level Discriminative Dictionary Learning towards Hierarchical Visual Categorization
9 0.088917233 388 cvpr-2013-Semi-supervised Learning of Feature Hierarchies for Object Detection in a Video
10 0.08884988 40 cvpr-2013-An Approach to Pose-Based Action Recognition
11 0.087324128 46 cvpr-2013-Articulated and Restricted Motion Subspaces and Their Signatures
12 0.086713947 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition
13 0.085909232 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
14 0.084905006 100 cvpr-2013-Crossing the Line: Crowd Counting by Integer Programming with Local Features
15 0.081083857 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?
16 0.080022417 203 cvpr-2013-Hierarchical Video Representation with Trajectory Binary Partition Tree
17 0.079471417 392 cvpr-2013-Separable Dictionary Learning
18 0.07650426 257 cvpr-2013-Learning Structured Low-Rank Representations for Image Classification
19 0.076472819 302 cvpr-2013-Multi-task Sparse Learning with Beta Process Prior for Action Recognition
20 0.075992212 205 cvpr-2013-Hollywood 3D: Recognizing Actions in 3D Natural Scenes
topicId topicWeight
[(0, 0.179), (1, -0.033), (2, -0.061), (3, -0.013), (4, -0.134), (5, -0.018), (6, 0.0), (7, -0.033), (8, -0.059), (9, 0.038), (10, 0.054), (11, 0.017), (12, 0.06), (13, -0.04), (14, 0.108), (15, 0.052), (16, 0.013), (17, -0.003), (18, -0.035), (19, -0.007), (20, 0.002), (21, -0.042), (22, 0.011), (23, -0.022), (24, -0.064), (25, 0.061), (26, -0.01), (27, 0.05), (28, -0.036), (29, -0.009), (30, -0.051), (31, -0.009), (32, -0.019), (33, 0.021), (34, 0.001), (35, 0.019), (36, 0.029), (37, -0.008), (38, -0.024), (39, -0.026), (40, -0.024), (41, -0.003), (42, -0.08), (43, 0.047), (44, -0.016), (45, 0.0), (46, 0.009), (47, 0.06), (48, 0.001), (49, 0.056)]
simIndex simValue paperId paperTitle
same-paper 1 0.95545644 137 cvpr-2013-Dynamic Scene Classification: Learning Motion Descriptors with Slow Features Analysis
Author: Christian Thériault, Nicolas Thome, Matthieu Cord
Abstract: In this paper, we address the challenging problem of categorizing video sequences composed of dynamic natural scenes. Contrarily to previous methods that rely on handcrafted descriptors, we propose here to represent videos using unsupervised learning of motion features. Our method encompasses three main contributions: 1) Based on the Slow Feature Analysis principle, we introduce a learned local motion descriptor which represents the principal and more stable motion components of training videos. 2) We integrate our local motion feature into a global coding/pooling architecture in order to provide an effective signature for each video sequence. 3) We report state of the art classification performances on two challenging natural scenes data sets. In particular, an outstanding improvement of 11 % in classification score is reached on a data set introduced in 2012.
2 0.7697413 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition
Author: Mihir Jain, Hervé Jégou, Patrick Bouthemy
Abstract: Several recent works on action recognition have attested the importance of explicitly integrating motion characteristics in the video description. This paper establishes that adequately decomposing visual motion into dominant and residual motions, both in the extraction of the space-time trajectories and for the computation of descriptors, significantly improves action recognition algorithms. Then, we design a new motion descriptor, the DCS descriptor, based on differential motion scalar quantities, divergence, curl and shear features. It captures additional information on the local motion patterns enhancing results. Finally, applying the recent VLAD coding technique proposed in image retrieval provides a substantial improvement for action recognition. Our three contributions are complementary and lead to outperform all reported results by a significant margin on three challenging datasets, namely Hollywood 2, HMDB51 and Olympic Sports. 1. Introduction and related work Human actions often convey the essential meaningful content in videos. Yet, recognizing human actions in un- constrained videos is a challenging problem in Computer Vision which receives a sustained attention due to the potential applications. In particular, there is a large interest in designing video-surveillance systems, providing some automatic annotation of video archives as well as improving human-computer interaction. The solutions proposed to address this problem inherit, to a large extent, from the techniques first designed for the goal of image search and classification. The successful local features developed to describe image patches [15, 23] have been translated in the 2D+t domain as spatio-temporal local descriptors [13, 30] and now include motion clues [29]. These descriptors are often extracted from spatial-temporal interest points [12, 3 1]. More recent techniques assume some underlying temporal motion model involving trajectories [2, 6, 7, 17, 18, 25, 29, 32]. Most of these approaches produce large set of local descriptors which are in turn aggregated to produce a single vector representing the video, in order to enable the use of powerful discriminative classifiers such as support vector machines (SVMs). This is usually done with the bag- Figure 1. Optical flow field vectors (green vectors with red end points) before and after dominant motion compensation. Most of the flow vectors due to camera motion are suppressed after compensation. One of the contributions of this paper is to show that compensating for the dominant motion is beneficial for most of the existing descriptors used for action recognition. of-words technique [24], which quantizes the local features using a k-means codebook. Thanks to the successful combination of this encoding technique with the aforementioned local descriptors, the state of the art in action recognition is able to go beyond the toy problems ofclassifying simple human actions in controlled environment and considers the detection of actions in real movies or video clips [11, 16]. Despite these progresses, the existing descriptors suffer from an uncompleted handling of motion in the video sequence. Motion is arguably the most reliable source of information for action recognition, as often related to the actions of interest. However, it inevitably involves the background or camera motion when dealing with uncontrolled and re- alistic situations. Although some attempts have been made to compensate camera motion in several ways [10, 21, 26, 29, 32], how to separate action motion from that caused by the camera, and how to reflect it in the video description remains an open issue. The motion compensation mechanism employed in [10] is tailor-made to the Motion Interchange Pattern encoding technique. The Motion Boundary Histogram (MBH) [29] is a recent appealing approach to 222555555533 suppress the constant motion by considering the flow gradient. It is robust to some extent to the presence of camera motion, yet it does not explicitly handle the camera motion. Another approach [26] uses a sophisticated and robust (RANSAC) estimation of camera motion. It first segments the color image into regions corresponding to planar parts in the scene and estimates the (three) dominant homographies to update the motion associated with local features. A rather different view is adopted in [32] where the motion decomposition is performed at the trajectory level. All these works support the potential of motion compensation. As the first contribution of this paper, we address the problem in a way that departs from these works by considering the compensation of the dominant motion in both the tracking stages and encoding stages involved in the computation of action recognition descriptors. We rely on the pioneering works on motion compensation such as the technique proposed in [20], that considers 2D polynomial affine motion models for estimating the dominant image motion. We consider this particular model for its robustness and its low computational cost. It was already used in [21] to separate the dominant motion (assumed to be due to the camera motion) and the residual motion (corresponding to the independent scene motions) for dynamic event recognition in videos. However, the statistical modeling of both motion components was global (over the entire image) and only the normal flow was computed for the latter. Figure 1 shows the vectors of optical flow before and after applying the proposed motion compensation. Our method successfully suppresses most of the background motion and reinforces the focus towards the action of interest. We exploit this compensated motion both for descriptor computation and for extracting trajectories. However, we also show that the camera motion should not be thrown as it contains complementary information that is worth using to recognize certain action categories. Then, we introduce the Divergence-Curl-Shear (DCS) descriptor, which encodes scalar first-order motion features, namely the motion divergence, curl and shear. It captures physical properties of the flow pattern that are not involved in the best existing descriptors for action recognition, except in the work of [1] which exploits divergence and vorticity among a set of eleven kinematic features computed from the optical flow. Our DCS descriptor provides a good performance recognition performance on its own. Most importantly, it conveys some information which is not captured by existing descriptors and further improves the recognition performance when combined with the other descriptors. As a last contribution, we bring an encoding technique known as VLAD (vector oflocal aggregated descriptors) [8] to the field of action recognition. This technique is shown to be better than the bag-of-words representation for combining all the local video descriptors we have considered. The organization of the paper is as follows. Section 2 introduces the motion properties that we will consider through this paper. Section 3 presents the datasets and classification scheme used in our different evaluations. Section 4 details how we revisit several popular descriptors of the literature by the means of dominant motion compensation. Our DCS descriptor based on kinematic properties is introduced in Section 5 and improved by the VLAD encoding technique, which is introduced and bench-marked in Section 6 for several video descriptors. Section 7 provides a comparison showing the large improvement achieved over the state of the art. Finally, Section 8 concludes the paper. 2. Motion Separation and Kinematic Features In this section, we describe the motion clues we incorporate in our action recognition framework. We separate the dominant motion and the residual motion. In most cases, this will account to distinguishing the impact of camera movement and independent actions. Note that we do not aim at recovering the 3D camera motion: The 2D parametric motion model describes the global (or dominant) motion between successive frames. We first explain how we estimate the dominant motion and employ it to separate the dominant flow from the optical flow. Then, we will introduce kinematic features, namely divergence, curl and shear for a more comprehensive description of the visual motion. 2.1. Affine motion for compensating camera motion Among polynomial motion models, we consider the 2D affine motion model. Simplest motion models such as the 4parameter model formed by the combination of 2D translation, 2D rotation and scaling, or more complex ones such as the 8-parameter quadratic model (equivalent to a homography), could be selected as well. The affine model is a good trade-off between accuracy and efficiency which is of primary importance when processing a huge video database. It does have limitations since strictly speaking it implies a single plane assumption for the static background. However, this is not that penalizing (especially for outdoor scenes) if differences in depth remain moderated with respect to the distance to the camera. The affine flow vector at point p = (x, y) and at time t, is defined as waff(pt) =?cc12((t ) ?+?aa31((t ) aa42((t ) ? ?xytt?. (1) = + + = + uaff(pt) c1(t) a1(t)xt a2(t)yt and vaff(pt) c2(t) a3 (t)xt + a4(t)yt are horizontal and vertical components of waff(pt) respectively. Let us denote the optical flow vector at point p at time t as w(pt) = (u(pt) , v(pt)). We introduce the flow vector ω(pt) obtained by removing the affine flow vector from the optical flow vector ω(pt) = w(pt) − waff(pt) . (2) 222555555644 The dominant motion (estimated as waff(pt)) is usually due to the camera motion. In this case, Equation 2 amounts to canceling (or compensating) the camera motion. Note that this is not always true. For example in case of close-up on a moving actor, the dominant motion will be the affine estimation of the apparent actor motion. The interpretation of the motion compensation output will not be that straightforward in this case, however the resulting ω-field will still exhibit different patterns for the foreground action part and the background part. In the remainder, we will refer to the “compensated” flow as ω-flow. Figure 1 displays the computed optical flow and the ωflow. We compute the affine flow with the publicly available Motion2D software1 [20] which implements a realtime robust multiresolution incremental estimation framework. The affine motion model has correctly accounted for the motion induced by the camera movement which corresponds to the dominant motion in the image pair. Indeed, we observe that the compensated flow vectors in the background are close to null and the compensated flow in the foreground, i.e., corresponding to the actors, is conversely inflated. The experiments presented along this paper will show that effective separation of dominant motion from the residual motions is beneficial for action recognition. As explained in Section 4, we will compute local motion descriptors, such as HOF, on both the optical flow and the compensated flow (ω-flow), which allows us to explicitly and directly characterize the scene motion. 2.2. Local kinematic features By kinematic features, we mean local first-order differential scalar quantities computed on the flow field. We consider the divergence, the curl (or vorticity) and the hyperbolic terms. They inform on the physical pattern of the flow so that they convey useful information on actions in videos. They can be computed from the first-order derivatives of the flow at every point p at every frame t as ⎨⎪ ⎪ ⎪ ⎧hcdyuipvr1l2(p t) = −∂ u ∂ (yxp(xtp) +−∂ v ∂ v(px ypxt ) The diverg⎪⎩ence is related to axial motion, expansion scaling effects, the curl to rotation in the image plane. hyperbolic terms express the shear of the visual flow responding to more complex configuration. We take account the shear quantity only: shear(pt) = ?hyp12(pt) + hyp22(pt). (3) and The corinto (4) 1http://www.irisa.fr/vista/Motion2D/ In Section 5, we propose the DCS descriptor that is based on the kinematic features (divergence, curl and shear) of the visual motion discussed in this subsection. It is computed on either the optical or the compensated flow, ω-flow. 3. Datasets and evaluation This section first introduces the datasets used for the evaluation. Then, we briefly present the bag-of-feature model and the classification scheme used to encode the descriptors which will be introduced in Section 4. Hollywood2. The Hollywood2 dataset [16] contains 1,707 video clips from 69 movies representing 12 action classes. It is divided into train set and test set of 823 and 884 samples respectively. Following the standard evaluation protocol of this benchmark, we use average precision (AP) for each class and the mean of APs (mAP) for evaluation. HMDB51. The HMDB51 dataset [11] is a large dataset containing 6,766 video clips extracted from various sources, ranging from movies to YouTube. It consists of 51 action classes, each having at least 101 samples. We follow the evaluation protocol of [11] and use three train/test splits, each with 70 training and 30 testing samples per class. The average classification accuracy is computed over all classes. Out of the two released sets, we use the original set as it is more challenging and used by most of the works reporting results in action recognition. Olympic Sports. The third dataset we use is Olympic Sports [19], which again is obtained from YouTube. This dataset contains 783 samples with 16 sports action classes. We use the provided2 train/test split, there are 17 to 56 training samples and 4 to 11test samples per class. Mean AP is used for the evaluation, which is the standard choice. Bag of features and classification setup. We first adopt the standard BOF [24] approach to encode all kinds of descriptors. It produces a vector that serves as the video representation. The codebook is constructed for each type of descriptor separately by the k-means algorithm. Following a common practice in the literature [27, 29, 30], the codebook size is set to k=4,000 elements. Note that Section 6 will consider encoding technique for descriptors. For the classification, we use a non-linear SVM with χ2kernel. When combining different descriptors, we simply add the kernel matrices, as done in [27]: K(xi,xj) = exp?−?cγ1cD(xic,xjc)?, 2http://vision.stanford.edu/Datasets/OlympicSports/ 222555555755 (5) where D(xic, xjc) is χ2 distance between video xic and xjc with respect to c-th channel, corresponding to c-th descriptor. The quantity γc is the mean value of χ2 distances between the training samples for the c-th channel. The multiclass classification problem that we consider is addressed by applying a one-against-rest approach. 4. Compensated descriptors This section describes how the compensation ofthe dominant motion is exploited to improve the quality of descriptors encoding the motion and the appearance around spatio-temporal positions, hence the term “compensated descriptors”. First, we briefly review the local descriptors [5, 13, 16, 29, 30] used here along with dense trajectories [29]. Second, we analyze the impact of motion flow compensation when used in two different stages of the descriptor computation, namely in the tracking and the description part. 4.1. Dense trajectories and local descriptors Employing dense trajectories to compute local descriptors is one of the state-of-the-art approaches for action recognition. It has been shown [29] that when local descriptors are computed over dense trajectories the performance improves considerably compared to when computed over spatio temporal features [30]. Dense Trajectories [29]: The trajectories are obtained by densely tracking sampled points using optical flow fields. First, feature points are sampled from a dense grid, with step size of 5 pixels and over 8 scales. Each feature point pt = (xt, yt) at frame t is then tracked to the next frame by median filtering in a dense optical flow field F = (ut, vt) as follows: pt+1 = (xt+1 , yt+1) = (xt, yt) + (M ∗ F) | (x ¯t,y ¯t) , (6) where M is the kernel of median filtering and ( x¯ t, y¯ t) is the rounded position of (xt, yt). The tracking is limited to L (=15) frames to avoid any drifting effect. Excessively short trajectories and trajectories exhibiting sudden large displacements are removed as they induce some artifacts. Trajectories must be understood here as tracks in the spacetime volume of the video. Local descriptors: The descriptors are computed within a space-time volume centered around each trajectory. Four types of descriptors are computed to encode the shape of the trajectory, local motion pattern and appearance, namely Trajectory [29], HOF (histograms of optical flow) [13], MBH [4] and HOG (histograms of oriented gradients) [3]. All these descriptors depend on the flow field used for the tracking and as input of the descriptor computation: 1. The Trajectory descriptor encodes the shape of the trajectory represented by the normalized relative coor- × dinates of the successive points forming the trajectory. It directly depends on the dense flow used for tracking points. 2. HOF is computed using the orientations and magnitudes of the flow field. 3. MBH is designed to capture the gradient of horizontal and vertical components of the flow. The motion boundaries encode the relative pixel motion and therefore suppress camera motion, but only to some extent. 4. HOG encodes the appearance by using the intensity gradient orientations and magnitudes. It is formally not a motion descriptor. Yet the position where the descriptor is computed depends on the trajectory shape. As in [29], volume around a feature point is divided into a 2 2 3 space-time grid. The orientations are quantized ian 2to × ×8 b2i ×ns 3fo srp HacOe-Gti amned g g9r ibdi.ns T fhoer o oHriOenFt (awtioitnhs one a qdudainttiiozneadl zero bin). The horizontal and vertical components of MBH are separately quantized into 8 bins each. 4.2. Impact of motion compensation The optical flow is simply referred to as flow in the following, while the compensated flow (see subsection 2. 1) is denoted by ω-flow. Both of them are considered in the tracking and descriptor computation stages. The trajectories obtained by tracking with the ω-flow are called ω-trajectories. Figure 2 comparatively illustrates the ωtrajectories and the trajectories obtained using the flow. The input video shows a man moving away from the car. In this video excerpt, the camera is following the man walking to the right, thus inducing a global motion to the left in the video. When using the flow, the computed trajectories reflect the combination of these two motion components (camera and scene motion) as depicted by Subfigure 2(b), which hampers the characterization of the current action. In contrast, the ω-trajectories plotted in Subfigure 2(c) are more active on the actor moving on the foreground, while those localized in the background are now parallel to the time axis enhancing static parts of the scene. The ω-trajectories are therefore more relevant for action recognition, since they are more regularly and more exclusively following the actor’s motion. Impact on Trajectory and HOG descriptors. Table 1reports the impact of ω-trajectories on Trajectory and HOG descriptors, which are both significantly improved by 3%4% of mAP on the two datasets. When improved by ωflow, these descriptors will be respectively referred to as ω-Trajdesc and ω-HOG in the rest of the paper. Although the better performance of ω-Trajdesc versus the original Trajectory descriptor was expected, the one 222555555866 2. Trajectories obtained from optical and compensated flows. The green tail is the trajectory the current frame. The trajectories are sub-sampled for the sake of clarity. The frames are extracted Figure over every 15 frames with red dot indicating 5 frames in this example. DescriptorHollywood2HMDB51 BaseTrliaωnje- Tc(rtoarejrdpyreos[c2d9u]ced)54 7 1. 7 4% %2382.–89% BaseliHnωOe- (GHreOp [2rG9od]uced)4 451 . 658%%%2296.– 13%% Table 1. ω-Trajdesc and ω-HOG: Impact of compensating flow on Trajectory descriptor and HOG descriptors. achieved by ω-HOG might be surprising. Our interpretation is that HOG captures more context with the modified trajectories. More precisely, the original HOG descriptor is computed from a 2D+t sub-volume aligned with the corresponding trajectory and hence represents the appearance along the trajectory shape. When using ω-flow, we do not align the video sequence. As a result, the ω-HOG descriptor is no more computed around the very same tracked physical point in the space-time volume but around points lying in a patch of the initial feature point, whose size depends on the affine flow magnitude. ω-HOG can be viewed as a “patchbased” computation capturing more information about the appearance of the background or of the moving foreground. As for ω-trajectories, they are closer to the real trajectories of the moving actors as they usually cancel the camera movement, and so, more easier to train and recognize. Impact on HOF. The ω-flow impacts computation used as an input to HOF computation itself. Therefore, HOF can both types of trajectories (ω-trajectories both the trajectory and the descriptor be computed along or those extracted MethodHollywood2HMDB51 Table(ω2rHf.-alocO IwomkF)inpHgacOtFobf[2u9ωhsb]i:f-nlo ωgwot-hwωHOflFown5H 02 34O. 58291F% %descripto3 r706s38.:–1076% m%APfor Hollywood2 and average accuracy for HMDB5 1. The ω-HOF is used in subsequent evaluations. from flow) and can encode both kinds of flows (ω-flow or flow). For the sake of completeness, we evaluate all the variants as well as the combination of both flows in the descriptor computation stage. The results are presented in Table 2 and demonstrate the significant improvement obtained by computing the HOF descriptor with the ω-flow instead of the optical flow. Note that the type of trajectories which is used, either “Tracking flow” or “Tracking ω-flow”, has a limited impact in this case. From now on, we only consider the “Tracking ω-flow” case where HOF is computed along ω-trajectories. Interestingly, combining the HOF computed from the flow and the ω-flow further improves the results. This suggests that the two flow fields are complementary and the affine flow that was subtracted from ω-flow brings in additional information. For the sake of brevity, the combination of the two kinds of HOF, i.e., computed from the flow and the ω-flow using ω-trajectories, is referred to as the ω-HOF 222555555977 MethodHollywood2HMDB51 Tab(lerT3a.cIkmM inpBgMacHgωtBf-loH w [u2)s9in]gω f-lo wo MBH5 d42 e.052s7c% riptos:m34A90P.–3769f% orHllywood2 and average accuracy for HMDB5 1. DTerHasMjcBrOeblitpHGeFor4.ySumTωraw- frcilykto hw ionfgtheduωpCs-fcaolrtωmeiwNp-df/tl+Aωoutrw-finlogwthdesωcr- isTpc-fHtrMloaiOjrBpdswtGeHFosrc descriptor in the rest of this paper. Compared to the HOF baseline, the ω-HOF descriptor achieves a gain of +3.1% of mAP on Hollywood 2 and of +7.8% on HMDB51. Impact on MBH. Since MBH is computed from gradient of flow and cancel the constant motion, there is practically no benefit in using the ω-flow to compute the MBH descriptors, as shown in Table 3. However, by tracking ω-flow, the performance improves by around 1.3% for HMDB5 1 dataset and drops by around 1.5% for Hollywood2. This relative performance depends on the encoding technique. We will come back on this descriptor when considering another encoding scheme for local descriptors in Section 6. 4.3. Summary of compensated descriptors Table 4 summarizes the refined versions of the descriptors obtained by exploiting the ω-flow, and both ω-flow and the optical flow in the case of HOF. The revisited descriptors considerably improve the results compared to the orig- inal ones, with the noticeable exception of ω-MBH which gives mixed performance with a bag-of-features encoding scheme. But we already mention as this point that this incongruous behavior of ω-MBH is stabilized with the VLAD encoding scheme considered in Section 6. Another advantage of tracking the compensated flow is that fewer trajectories are produced. For instance, the total number of trajectories decreases by about 9. 16% and 22.81% on the Hollywood2 and HMDB51 datasets, respectively. Note that exploiting both the flow and the ω-flow do not induce much computational overhead, as the latter is obtained from the flow and the affine flow which is computed in real-time and already used to get the ω-trajectories. The only additional computational cost that we introduce by using the descriptors summarized in Table 4 is the computation of a second HOF descriptor, but this stage is relatively efficient and not the bottleneck of the extraction procedure. 5. Divergence-Curl-Shear descriptor This section introduces a new descriptor encoding the kinematic properties of motion discussed in Section 2.2. It is denoted by DCS in the rest of this paper. Combining kinematic features. The spatial derivatives are computed for the horizontal and vertical components of the flow field, which are used in turn to compute the divergence, curl and shear scalar values, see Equation 3. We consider all possible pairs of kinematic features, namely (div, curl), (div, shear) and (curl, shear). At each × ×× pixel, we compute the orientation and magnitude of the 2-D vector corresponding to each of these pairs. The orientation is quantized into histograms and the magnitude is used for weighting, similar to SIFT. Our motivation for encoding pairs is that the joint distribution of kinematic features conveys more information than exploiting them independently. Implementation details. The descriptor computation and parameters are similar to HOG and other popular descriptors such as MBH, HOF. We obtain 8-bin histograms for each of the three feature pairs or components of DCS. The range of possible angles is 2π for the (div,curl) pair and π for the other pairs, because the shear is always positive. The DCS descriptor is computed for a space-time volume aligned with a trajectory, as done with the four descriptors mentioned in the previous section. In order to capture the spatio-temporal structure of kinematic features, the volume (32 32 pixels and L = 15 frames) is subdivided into a spatio-temporal grid nofd s Lize = nx 5× f ny m×e nt, sw situhb nx =de ny =to 2a and nt = 3. These parameters ×hnave× × bneen fixed for the sake of consistency with the other descriptors. For each pair of kinematic features, each cell in the grid is represented by a histogram. The resulting local descriptors have a dimensionality equal to 288 = nx ny nt 8 3. At the video level, these descriptors are nenc×od end i×nto 8 a single vector representation using either BOF or the VLAD encoding scheme introduced in the next section. 6. VLAD in actions VLAD [8] is a descriptor encoding technique that aggregates the descriptors based on a locality criterion in the feature space. To our knowledge, this technique has never been considered for action recognition. Below, we briefly introduce this approach and give the performance achieved for all the descriptors introduced along the previous sections. VLAD in brief. Similar to BOF, VLAD relies on a codebook C = {c1, c2 , ...ck} of k centroids learned by k-means. bTohoek representation is ob}t oaifn ked c by summing, efodr b yea kch-m mveiasunasl. word ci, the differences x − ci of the vectors x assigned to ci, thereby producing a sv exct −or c representation oflength d×k, 222555556088 DMeBscHriptorV5 LH.A1o%Dlywo5Bo4d.O2 %F4V3L.3HA%MD B35B91.O7%F Taωbl-eDHM5rOCBa.FSjGPdHe+rsωfco-mMHaBOnFeofV54L2936A.51D% with5431ω208-.5T96% rajde3s42c97158,.ω3% -HOG342,58019ω.6-% HOF descriptors and their combination. where d is the dimension ofthe local descriptors. We use the codebook size, k = 256. Despite this large dimensionality, VLAD is efficient because it is effectively compared with a linear kernel. VLAD is post-processed using a componentwise power normalization, which dramatically improves its performance [8]. While cross validating the parameter α involved in this power normalization, we consistently observe, for all the descriptors, a value between 0.15 and 0.3. Therefore, this parameter is set to α = 0.2 in all our experiments. For classification, we use a linear SVM and oneagainst-rest approach everywhere, unless stated otherwise. Impact on existing descriptors. We employ VLAD because it is less sensitive to quantization parameters and appears to provide better performance with descriptors having a large dimensionality. These properties are interesting in our case, because the quantization parameters involved in the DCS and MBH descriptors have been used unchanged in Section 4 for the sake of direct comparison. They might be suboptimal when using the ω-flow instead of the optical flow on which they have initially been optimized [29]. Results for MBH and ω-MBH in Table 5 supports this argument. When using VLAD instead of BOF, the scores are stable in both the cases and there is no mixed inference as that observed in Table 3. VLAD also has significant positive influence on accuracy of ω-DCS descriptor. We also observe that ω-DCS is complementary to ω-MBH and adds to the performance. Still DCS is probably not best utilized in the current setting of parameters. In case of ω-Trajdesc and ω-HOG, the scores are better with BOF on both the datasets. ω-HOF with VLAD improves on HMDB5 1, but remains equivalent for Hollywood2. Although BOF leads to better scores for the descriptors considered individually, their combination with VLAD outperforms the BOF. 7. Comparison with the state of the art This section reports our results with all descriptors combined and compares our method with the state of the art. TrajectorCy+omHbOiGna+tHioOnF+ MDBCHSHol5 l98y.w76%o%od2H4M489.D02%B%51 All ω-descriptors all five compensated descriptors using combined62.5%52.1% Table 6. Combination of VLAD representation. WU*JVliaOnughreM tHaeolth. [yo2w9d87o] 256 0985. 37% SKa*duOeJinhrau tnegdMteHatlMa.h [ol1Dd.0B[91]25 24 609.8172% Table 7. Comparison with the state of the art on Hollywood2 and HMDB5 1 datasets. *Vig et al. [28] gets 61.9% by using external eye movements data. *Jiang et al. [9] used one-vs-one multi class SVM while our and other methods use one-vs-rest SVMs. With one-against-one multi class SVM we obtain 45. 1% for HMDB51. Descriptor combination. Table 6 reports the results obtained when the descriptors are combined. Since we use VLAD, our baseline is updated that is combination of Trajectory, HOG, HOF and MBH with VLAD representation. When DCS is added to the baseline there is an improvement of 0.9% and 1.2%. With combination of all five compensated descriptors we obtain 62.5% and 52.1% on the two datasets. This is a large improvement even over the updated baseline, which shows that the proposed motion compensation and the way we exploit it are significantly important for action recognition. The comparison with the state of the art is shown in Table 7. Our method outperforms all the previously reported results in the literature. In particular, on the HMDB51 dataset, the improvement over the best reported results to date is more than 11% in average accuracy. Jiang el al. [9] used a one-against-one multi-class SVM, which might have resulted in inferior scores. With a similar multi-class SVM approach, our method obtains 45. 1%, which remains significantly better than their result. All others results were reported with one-against-rest approach. On Olympic Sports dataset we obtain mAP of 83.2% with ‘All ω-descriptors combined’ and the improvement is mostly because of VLAD and ω-flow. The best reported mAPs on this dataset are Liu et al. [14] (74.4%) and Jiang et al. [9] (80.6%), which we exceed convincingly. Gaidon et al. [6] reports the best average accuracy of 82.7%. 8. Conclusions This paper first demonstrates the interest of canceling the dominant motion (predominantly camera motion) to make the visual motion truly related to actions, for both the trajectory extraction and descriptor computation stages. It pro222555556199 duces significantly better versions (called compensated descriptors) of several state-of-the-art local descriptors for action recognition. The simplicity, efficiency and effectiveness of this motion compensation approach make it applicable to any action recognition framework based on motion descriptors and trajectories. The second contribution is the new DCS descriptor derived from the first-order scalar motion quantities specifying the local motion patterns. It captures additional information which is proved complementary to the other descriptors. Finally, we show that VLAD encoding technique instead of bag-of-words boosts several action descriptors, and overall exhibits a significantly better performance when combining different types of descriptors. Our contributions are all complementary and significantly outperform the state of the art when combined, as demon- strated by our extensive experiments on the Hollywood 2, HMDB51 and Olympic Sports datasets. Acknowledgments This work was supported by the Quaero project, funded by Oseo, French agency for innovation. We acknowledge Heng Wang’s help for reproducing some of their results. References [1] S. Ali and M. Shah. Human action recognition in videos using kinematic features and multiple instance learning. IEEE T-PAMI, 32(2):288–303, Feb. 2010. [2] T. Brox and J. Malik. Object segmentation by long term analysis of point trajectories. In ECCV, Sep. 2010. [3] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, Jun. 2005. [4] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow and appearance. In ECCV, May 2006. [5] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. In VS-PETS, Oct. 2005. [6] A. Gaidon, Z. Harchaoui, and C. Schmid. Recognizing activities with cluster-trees of tracklets. In BMVC, Sep. 2012. [7] A. Hervieu, P. Bouthemy, and J.-P. Le Cadre. A statistical video content recognition method using invariant features on object trajectories. IEEE T-CSVT, 18(1 1): 1533–1543, 2008. [8] H. J ´egou, F. Perronnin, M. Douze, J. S ´anchez, P. P ´erez, and C. Schmid. Aggregating local descriptors into compact [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] codes. IEEE T-PAMI, 34(9):1704–1716, 2012. Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, and C.-W. Ngo. Trajectory-based modeling of human actions with motion reference points. In ECCV, Oct. 2012. O. Kliper-Gross, Y. Gurovich, T. Hassner, and L. Wolf. Motion interchange patterns for action recognition in unconstrained videos. In ECCV, Oct. 2012. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recognition. In ICCV, Nov. 2011. I. Laptev and T. Lindeberg. Space-time interest points. In ICCV, Oct. 2003. I. Laptev, M. Marzalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, Jun. 2008. J. Liu, B. Kuipers, and S. Savarese. Recognizing human actions by attributes. In CVPR, Jun. 2011. D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–1 10, Nov. 2004. M. Marzalek, I. Laptev, and C. Schmid. Actions in context. In CVPR, Jun. 2009. P. Matikainen, M. Hebert, and R. Sukthankar. Trajectons: Action recognition through the motion analysis of tracked features. In Workshop on Video-Oriented Object and Event Classification, ICCV, Sep. 2009. R. Messing, C. J. Pal, and H. A. Kautz. Activity recognition using the velocity histories of tracked keypoints. In ICCV, Sep. 2009. J. C. Niebles, C.-W. Chen, and F.-F. Li. Modeling temporal structure of decomposable motion segments for activity classification. In ECCV, Sep. 2010. [20] J.-M. Odobez and P. Bouthemy. Robust multiresolution estimation of parametric motion models. Jal of Vis. Comm. and Image Representation, 6(4):348–365, Dec. 1995. [21] G. Piriou, P. Bouthemy, and J.-F. Yao. Recognition of dynamic video contents with global probabilistic models of visual motion. IEEE T-IP, 15(1 1):3417–3430, 2006. [22] S. Sadanand and J. J. Corso. Action bank: A high-level representation of activity in video. In CVPR, Jun. 2012. [23] C. Schmid and R. Mohr. Local grayvalue invariants for image retrieval. IEEE T-PAMI, 19(5):530–534, May 1997. [24] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In ICCV, pages 1470– 1477, Oct. 2003. [25] J. Sun, X. Wu, S. Yan, L. F. Cheong, T.-S. Chua, and J. Li. Hierarchical spatio-temporal context modeling for action recognition. In CVPR, Jun. 2009. [26] H. Uemura, S. Ishikawa, and K. Mikolajczyk. Feature tracking and motion compensation for action recognition. In BMVC, Sep. 2008. [27] M. M. Ullah, S. N. Parizi, and I. Laptev. Improving bag-offeatures action recognition with non-local cues. In BMVC, Sep. 2010. [28] E. Vig, M. Dorr, and D. Cox. Saliency-based space-variant descriptor sampling for action recognition. In ECCV, Oct. 2012. [29] H. Wang, A. Kl¨ aser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In CVPR, Jun. 2011. [30] H. Wang, M. M. Ullah, A. Kl¨ aser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features for action recognition. In BMVC, Sep. 2009. [3 1] G. Willems, T. Tuytelaars, and L. J. V. Gool. An efficient dense and scale-invariant spatio-temporal interest point detector. In ECCV, Oct. 2008. [32] S. Wu, O. Oreifej, and M. Shah. Action recognition in videos acquired by a moving camera using motion decomposition of lagrangian particle trajectories. In ICCV, Nov. 2011. 222555666200
3 0.75862557 158 cvpr-2013-Exploring Weak Stabilization for Motion Feature Extraction
Author: Dennis Park, C. Lawrence Zitnick, Deva Ramanan, Piotr Dollár
Abstract: We describe novel but simple motion features for the problem of detecting objects in video sequences. Previous approaches either compute optical flow or temporal differences on video frame pairs with various assumptions about stabilization. We describe a combined approach that uses coarse-scale flow and fine-scale temporal difference features. Our approach performs weak motion stabilization by factoring out camera motion and coarse object motion while preserving nonrigid motions that serve as useful cues for recognition. We show results for pedestrian detection and human pose estimation in video sequences, achieving state-of-the-art results in both. In particular, given a fixed detection rate our method achieves a five-fold reduction in false positives over prior art on the Caltech Pedestrian benchmark. Finally, we perform extensive diagnostic experiments to reveal what aspects of our system are crucial for good performance. Proper stabilization, long time-scale features, and proper normalization are all critical.
4 0.73965025 118 cvpr-2013-Detecting Pulse from Head Motions in Video
Author: Guha Balakrishnan, Fredo Durand, John Guttag
Abstract: We extract heart rate and beat lengths from videos by measuring subtle head motion caused by the Newtonian reaction to the influx of blood at each beat. Our method tracks features on the head and performs principal component analysis (PCA) to decompose their trajectories into a set of component motions. It then chooses the component that best corresponds to heartbeats based on its temporal frequency spectrum. Finally, we analyze the motion projected to this component and identify peaks of the trajectories, which correspond to heartbeats. When evaluated on 18 subjects, our approach reported heart rates nearly identical to an electrocardiogram device. Additionally we were able to capture clinically relevant information about heart rate variability.
5 0.70295817 88 cvpr-2013-Compressible Motion Fields
Author: Giuseppe Ottaviano, Pushmeet Kohli
Abstract: Traditional video compression methods obtain a compact representation for image frames by computing coarse motion fields defined on patches of pixels called blocks, in order to compensate for the motion in the scene across frames. This piecewise constant approximation makes the motion field efficiently encodable, but it introduces block artifacts in the warped image frame. In this paper, we address the problem of estimating dense motion fields that, while accurately predicting one frame from a given reference frame by warping it with the field, are also compressible. We introduce a representation for motion fields based on wavelet bases, and approximate the compressibility of their coefficients with a piecewise smooth surrogate function that yields an objective function similar to classical optical flow formulations. We then show how to quantize and encode such coefficients with adaptive precision. We demonstrate the effectiveness of our approach by com- paring its performance with a state-of-the-art wavelet video encoder. Experimental results on a number of standard flow and video datasets reveal that our method significantly outperforms both block-based and optical-flow-based motion compensation algorithms.
6 0.67263287 244 cvpr-2013-Large Displacement Optical Flow from Nearest Neighbor Fields
7 0.663845 313 cvpr-2013-Online Dominant and Anomalous Behavior Detection in Videos
8 0.66212583 291 cvpr-2013-Motionlets: Mid-level 3D Parts for Human Motion Recognition
9 0.66035539 203 cvpr-2013-Hierarchical Video Representation with Trajectory Binary Partition Tree
10 0.65029764 187 cvpr-2013-Geometric Context from Videos
11 0.60707194 124 cvpr-2013-Determining Motion Directly from Normal Flows Upon the Use of a Spherical Eye Platform
12 0.60513216 333 cvpr-2013-Plane-Based Content Preserving Warps for Video Stabilization
13 0.6041804 10 cvpr-2013-A Fully-Connected Layered Model of Foreground and Background Flow
14 0.59479064 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?
15 0.59446043 302 cvpr-2013-Multi-task Sparse Learning with Beta Process Prior for Action Recognition
16 0.59202331 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
17 0.59060943 32 cvpr-2013-Action Recognition by Hierarchical Sequence Summarization
18 0.5857932 353 cvpr-2013-Relative Hidden Markov Models for Evaluating Motion Skill
19 0.58512473 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
20 0.56609106 83 cvpr-2013-Classification of Tumor Histology via Morphometric Context
topicId topicWeight
[(10, 0.066), (16, 0.017), (26, 0.021), (29, 0.012), (33, 0.682), (67, 0.042), (69, 0.042), (87, 0.036)]
simIndex simValue paperId paperTitle
same-paper 1 0.9990598 137 cvpr-2013-Dynamic Scene Classification: Learning Motion Descriptors with Slow Features Analysis
Author: Christian Thériault, Nicolas Thome, Matthieu Cord
Abstract: In this paper, we address the challenging problem of categorizing video sequences composed of dynamic natural scenes. Contrarily to previous methods that rely on handcrafted descriptors, we propose here to represent videos using unsupervised learning of motion features. Our method encompasses three main contributions: 1) Based on the Slow Feature Analysis principle, we introduce a learned local motion descriptor which represents the principal and more stable motion components of training videos. 2) We integrate our local motion feature into a global coding/pooling architecture in order to provide an effective signature for each video sequence. 3) We report state of the art classification performances on two challenging natural scenes data sets. In particular, an outstanding improvement of 11 % in classification score is reached on a data set introduced in 2012.
2 0.99868804 260 cvpr-2013-Learning and Calibrating Per-Location Classifiers for Visual Place Recognition
Author: Petr Gronát, Guillaume Obozinski, Josef Sivic, Tomáš Pajdla
Abstract: The aim of this work is to localize a query photograph by finding other images depicting the same place in a large geotagged image database. This is a challenging task due to changes in viewpoint, imaging conditions and the large size of the image database. The contribution of this work is two-fold. First, we cast the place recognition problem as a classification task and use the available geotags to train a classifier for each location in the database in a similar manner to per-exemplar SVMs in object recognition. Second, as onlyfewpositive training examples are availablefor each location, we propose a new approach to calibrate all the per-location SVM classifiers using only the negative examples. The calibration we propose relies on a significance measure essentially equivalent to the p-values classically used in statistical hypothesis testing. Experiments are performed on a database of 25,000 geotagged street view images of Pittsburgh and demonstrate improved place recognition accuracy of the proposed approach over the previous work. 2Center for Machine Perception, Faculty of Electrical Engineering 3WILLOW project, Laboratoire d’Informatique de l’E´cole Normale Sup e´rieure, ENS/INRIA/CNRS UMR 8548. 4Universit Paris-Est, LIGM (UMR CNRS 8049), Center for Visual Computing, Ecole des Ponts - ParisTech, 77455 Marne-la-Valle, France
3 0.99852788 357 cvpr-2013-Revisiting Depth Layers from Occlusions
Author: Adarsh Kowdle, Andrew Gallagher, Tsuhan Chen
Abstract: In this work, we consider images of a scene with a moving object captured by a static camera. As the object (human or otherwise) moves about the scene, it reveals pairwise depth-ordering or occlusion cues. The goal of this work is to use these sparse occlusion cues along with monocular depth occlusion cues to densely segment the scene into depth layers. We cast the problem of depth-layer segmentation as a discrete labeling problem on a spatiotemporal Markov Random Field (MRF) that uses the motion occlusion cues along with monocular cues and a smooth motion prior for the moving object. We quantitatively show that depth ordering produced by the proposed combination of the depth cues from object motion and monocular occlusion cues are superior to using either feature independently, and using a na¨ ıve combination of the features.
4 0.99832189 180 cvpr-2013-Fully-Connected CRFs with Non-Parametric Pairwise Potential
Author: Neill D.F. Campbell, Kartic Subr, Jan Kautz
Abstract: Conditional Random Fields (CRFs) are used for diverse tasks, ranging from image denoising to object recognition. For images, they are commonly defined as a graph with nodes corresponding to individual pixels and pairwise links that connect nodes to their immediate neighbors. Recent work has shown that fully-connected CRFs, where each node is connected to every other node, can be solved efficiently under the restriction that the pairwise term is a Gaussian kernel over a Euclidean feature space. In this paper, we generalize the pairwise terms to a non-linear dissimilarity measure that is not required to be a distance metric. To this end, we propose a density estimation technique to derive conditional pairwise potentials in a nonparametric manner. We then use an efficient embedding technique to estimate an approximate Euclidean feature space for these potentials, in which the pairwise term can still be expressed as a Gaussian kernel. We demonstrate that the use of non-parametric models for the pairwise interactions, conditioned on the input data, greatly increases expressive power whilst maintaining efficient inference.
5 0.99828357 346 cvpr-2013-Real-Time No-Reference Image Quality Assessment Based on Filter Learning
Author: Peng Ye, Jayant Kumar, Le Kang, David Doermann
Abstract: This paper addresses the problem of general-purpose No-Reference Image Quality Assessment (NR-IQA) with the goal ofdeveloping a real-time, cross-domain model that can predict the quality of distorted images without prior knowledge of non-distorted reference images and types of distortions present in these images. The contributions of our work are two-fold: first, the proposed method is highly efficient. NR-IQA measures are often used in real-time imaging or communication systems, therefore it is important to have a fast NR-IQA algorithm that can be used in these real-time applications. Second, the proposed method has the potential to be used in multiple image domains. Previous work on NR-IQA focus primarily on predicting quality of natural scene image with respect to human perception, yet, in other image domains, the final receiver of a digital image may not be a human. The proposed method consists of the following components: (1) a local feature extractor; (2) a global feature extractor and (3) a regression model. While previous approaches usually treat local feature extraction and regres- sion model training independently, we propose a supervised method based on back-projection, which links the two steps by learning a compact set of filters which can be applied to local image patches to obtain discriminative local features. Using a small set of filters, the proposed method is extremely fast. We have tested this method on various natural scene and document image datasets and obtained stateof-the-art results.
6 0.99823141 178 cvpr-2013-From Local Similarity to Global Coding: An Application to Image Classification
7 0.99804115 55 cvpr-2013-Background Modeling Based on Bidirectional Analysis
8 0.99791694 93 cvpr-2013-Constraints as Features
9 0.99788493 165 cvpr-2013-Fast Energy Minimization Using Learned State Filters
10 0.99785155 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition
11 0.99764317 113 cvpr-2013-Dense Variational Reconstruction of Non-rigid Surfaces from Monocular Video
12 0.99726516 48 cvpr-2013-Attribute-Based Detection of Unfamiliar Classes with Humans in the Loop
13 0.9970305 252 cvpr-2013-Learning Locally-Adaptive Decision Functions for Person Verification
14 0.99620759 301 cvpr-2013-Multi-target Tracking by Rank-1 Tensor Approximation
15 0.99277627 189 cvpr-2013-Graph-Based Discriminative Learning for Location Recognition
16 0.99266219 379 cvpr-2013-Scalable Sparse Subspace Clustering
17 0.9909631 266 cvpr-2013-Learning without Human Scores for Blind Image Quality Assessment
18 0.99061787 343 cvpr-2013-Query Adaptive Similarity for Large Scale Object Retrieval
19 0.98992306 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
20 0.98959512 148 cvpr-2013-Ensemble Video Object Cut in Highly Dynamic Scenes