iccv iccv2013 iccv2013-39 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Heng Wang, Cordelia Schmid
Abstract: Recently dense trajectories were shown to be an efficient video representation for action recognition and achieved state-of-the-art results on a variety of datasets. This paper improves their performance by taking into account camera motion to correct them. To estimate camera motion, we match feature points between frames using SURF descriptors and dense optical flow, which are shown to be complementary. These matches are, then, used to robustly estimate a homography with RANSAC. Human motion is in general different from camera motion and generates inconsistent matches. To improve the estimation, a human detector is employed to remove these matches. Given the estimated camera motion, we remove trajectories consistent with it. We also use this estimation to cancel out camera motion from the optical flow. This significantly improves motion-based descriptors, such as HOF and MBH. Experimental results onfour challenging action datasets (i.e., Hollywood2, HMDB51, Olympic Sports and UCF50) significantly outperform the current state of the art.
Reference: text
sentIndex sentText sentNum sentScore
1 fr a a Abstract Recently dense trajectories were shown to be an efficient video representation for action recognition and achieved state-of-the-art results on a variety of datasets. [sent-3, score-0.615]
2 This paper improves their performance by taking into account camera motion to correct them. [sent-4, score-0.43]
3 To estimate camera motion, we match feature points between frames using SURF descriptors and dense optical flow, which are shown to be complementary. [sent-5, score-0.576]
4 These matches are, then, used to robustly estimate a homography with RANSAC. [sent-6, score-0.425]
5 Human motion is in general different from camera motion and generates inconsistent matches. [sent-7, score-0.739]
6 Given the estimated camera motion, we remove trajectories consistent with it. [sent-9, score-0.514]
7 We also use this estimation to cancel out camera motion from the optical flow. [sent-10, score-0.606]
8 Recent research focuses on realistic datasets collected from movies [20, 22], web videos [21, 3 1], TV shows [28], etc. [sent-17, score-0.292]
9 Among the local space-time features, dense trajectories [40] have been shown to perform best on a variety of datasets. [sent-26, score-0.383]
10 First row: images of two consecutive frames overlaid; second row: optical flow [8] between the two frames; third row: optical flow after removing camera motion; last row: trajectories removed due to camera motion in white. [sent-28, score-1.713]
11 idea is to densely sample feature points in each frame, and track them in the video based on optical flow. [sent-29, score-0.295]
12 Multiple descriptors are computed along the trajectories of feature points to capture shape, appearance and motion information. [sent-30, score-0.664]
13 Interestingly, motion boundary histograms (MBH) [6] give the best results due to their robustness to camera motion. [sent-31, score-0.385]
14 MBH is based on derivatives of optical flow, which is a simple and efficient way to suppress camera motion. [sent-32, score-0.333]
15 However, we argue that we can still benefit from explicit camera motion estimation. [sent-33, score-0.385]
16 Green arrows correspond to SURF descriptor matches, and red ones to dense optical flow. [sent-35, score-0.377]
17 We can prune them and only keep trajectories from humans or objects of interest, if we know the camera motion (see Figure 1). [sent-37, score-0.886]
18 Furthermore, given the camera motion, we can correct the optical flow, so that the motion vectors of human ac- tors are independent of camera motion. [sent-38, score-0.81]
19 This improves the performance of motion descriptors based on optical flow, i. [sent-39, score-0.572]
20 We illustrate the difference between the original and corrected optical flow in the middle two rows of Figure 1. [sent-42, score-0.362]
21 Very few approaches consider camera motion when extracting feature trajectories for action recognition. [sent-43, score-0.918]
22 [42] apply a low-rank assumption to decompose feature trajectories into camera-induced and object-induced components. [sent-47, score-0.33]
23 [27] perform weak stabilization to remove both camera and object-centric motion using coarsescale optical flow for pedestrian detection and pose estimation in video. [sent-49, score-0.932]
24 [14] decompose visual motion into dominant and residual motions both for extracting trajectories and computing descriptors. [sent-51, score-0.547]
25 In section 2, we detail our approach for camera motion estimation and discuss how to remove inconsistent matches due to humans. [sent-58, score-0.731]
26 The code to compute improved trajectories and descriptors is available online. [sent-60, score-0.43]
27 The right one fits the homography to the moving humans as they dominate the frame. [sent-71, score-0.484]
28 Improving dense trajectories In this section, we first describe the major steps of our camera motion estimation method, and how to use it to improve dense trajectories. [sent-73, score-0.877]
29 We, then, discuss how to remove potentially inconsistent matches based on humans to obtain a robust homography estimation. [sent-74, score-0.729]
30 Camera motion estimation To estimate the global background motion, we assume that two consecutive frames are related by a homography [37]. [sent-77, score-0.665]
31 We also sample motion vectors from the optical flow, which provides us with dense matches between frames. [sent-84, score-0.656]
32 Here, we use an efficient optical flow algorithm based on polynomial expansion [8]. [sent-85, score-0.362]
33 The optical flow (second and fourth columns) is warped with the corresponding homography. [sent-92, score-0.464]
34 Figure 1 (two rows in the middle) demonstrates the difference of optical flow before and after rectification. [sent-102, score-0.362]
35 Compared to the original flow (the second row of Figure 1), the rectified version (the third row) suppresses the background camera motion and enhances the foreground moving objects. [sent-103, score-0.673]
36 For dense trajectories, there are two major advantages of canceling out camera motion from optical flow. [sent-104, score-0.659]
37 First, the motion descriptors can directly benefit from this. [sent-105, score-0.334]
38 Second, we can remove trajectories generated by camera motion. [sent-110, score-0.514]
39 This can be achieved by thresholding the displacement vectors of the trajectories in the warped flow field. [sent-111, score-0.6]
40 If the displacement is too small, the trajectory is considered to be too similar to camera motion, and thus removed. [sent-112, score-0.342]
41 , pan, tilt and zoom) and only trajectories related to human actions are kept (shown in green in Figure 3). [sent-116, score-0.452]
42 The left one is due to severe motion blur, which makes both SURF descriptor matching and optical flow estimation unreliable. [sent-119, score-0.738]
43 Improving motion estimation in the presence of motion blur is worth further attention, since blur often occurs in realistic datasets. [sent-120, score-0.628]
44 In the example shown on the right, humans dominate the frame, which causes homography estimation to fail. [sent-121, score-0.483]
45 Removing inconsistent matches due to humans In action datasets, videos often focus on the humans performing the action. [sent-125, score-0.842]
46 As a result, it is very common that humans dominate the frame, which can be a problem for camera motion estimation as human motion is in general not consistent with it. [sent-126, score-0.945]
47 We propose to use a human detector to remove matches from human regions. [sent-127, score-0.439]
48 In general, human detection in action datasets is rather difficult, as there are dramatic pose changes when the person is performing the action. [sent-128, score-0.367]
49 33554536 Here, we apply a state-of-the-art human detector [30], which adapts the general part-based human detector [9] to action datasets. [sent-130, score-0.479]
50 We use the human detector as a mask to remove feature matches inside the bounding boxes when estimating the homography. [sent-134, score-0.436]
51 Without human detection (the left two columns of Figure 4), many features from the moving humans become inlier matches and the homography is, thus, incorrect. [sent-135, score-0.7]
52 As a result, the corresponding optical flow is not correctly warped. [sent-136, score-0.362]
53 In contrast, camera motion is successfully compensated (the right two columns of Figure 4), when the human bounding boxes are used to remove matches not corresponding to camera motion. [sent-137, score-0.92]
54 The homography does not fit the background very well despite detecting the humans correctly, as the background is represented by two planes, one of which is very close to the camera. [sent-139, score-0.525]
55 3, we compare the performance of action recognition with or without human detection. [sent-141, score-0.295]
56 1 In the following, we always use the human detector to remove potentially inconsistent matches before computing the homography, unless stated otherwise. [sent-151, score-0.456]
57 Tracking points is achieved by median filtering in a dense optical flow field [8]. [sent-161, score-0.443]
58 We remove static feature trajectories as they do not contain motion information, and also prune trajectories with sudden large displacements. [sent-163, score-0.997]
59 Both HOF and MBH measure motion information, and are based on optical flow. [sent-170, score-0.438]
60 MBH splits the optical flow into horizontal and vertical components, and quantizes the derivatives of each component. [sent-172, score-0.434]
61 To compute the descriptors, we first estimate the homography with RANSAC using the feature matches extracted between two consecutive frames; matches on detected humans are removed. [sent-184, score-0.743]
62 The optical flow [8] is re-computed between the first and the warped second frame. [sent-186, score-0.464]
63 Motion descriptors (HOF and MBH) are computed on the warped optical flow. [sent-187, score-0.384]
64 We estimate the homography and warped optical flow for every two frames independently to avoid error propagation. [sent-189, score-0.769]
65 The Trajectory descriptor is also computed based on the motion vectors of the warped flow. [sent-191, score-0.45]
66 We further utilize these stabilized motion vectors to remove background trajectories. [sent-192, score-0.435]
67 , 1 pixel), the trajectory is considered to be consistent with camera motion, and thus removed. [sent-196, score-0.315]
68 In recent evaluations [5, 26], this shows an improved performance over bag offeatures for both image and action classification. [sent-207, score-0.334]
69 The UCF50 dataset [31] has 50 action categories, consisting of real-world videos taken from YouTube. [sent-236, score-0.294]
70 Experimental results We, first, evaluate the gain due to different motion stabilization steps in section 4. [sent-244, score-0.299]
71 3 evaluates the impact of removing inconsistent matches based on human detection. [sent-249, score-0.461]
72 Evaluation of improved dense trajectories We choose the dense trajectories [40] as our baseline and apply RootSIFT normalization as described in section 3. [sent-254, score-0.851]
73 , “WarpFlow” and “RmTrack”, which stand for warping optical flow with the homography corresponding to the camera motion and removing background trajectories consistent with the homography. [sent-258, score-1.453]
74 The performance of the Trajectory descriptor is significantly improved, when camera motion is compensated for. [sent-283, score-0.521]
75 “Combined” further improves over “WarpFlow” as background trajectories are removed. [sent-289, score-0.404]
76 Since HOG is designed to capture static appearance information, we do not expect that compensating camera motion significantly improves its performance. [sent-291, score-0.43]
77 MBH is known for its robustness to camera motion [40]. [sent-300, score-0.385]
78 HOF represents zero-order motion information, whereas MBH focuses on first-order derivatives. [sent-304, score-0.307]
79 Feature encoding with BOF and FV In this section, we evaluate the performance of our improved trajectories using different feature encoding methods. [sent-309, score-0.503]
80 We can observe a similar amount of improvement due to our motion stabilized descriptors when encoding them with bag of features (BOF) or Fisher vector (FV). [sent-311, score-0.581]
81 “DTF” stands for the original dense trajectory features [40] with RootSIFT normalization, whereas “ITF” are our improved trajectory features. [sent-323, score-0.499]
82 Removing inconsistent matches due to humans In this section, we investigate the impact of removing inconsistent matches due to humans when estimating the homography, see Figure 4 for an illustration. [sent-345, score-0.917]
83 , estimating the homography without human detection, with automatic human detection, and with manual labeling of humans. [sent-348, score-0.47]
84 This allows us to measure the impact of removing matches from human regions as well as to determine an upper bound in case of a perfect human detector. [sent-349, score-0.444]
85 To limit the labeling effort, we annotated humans in 20 training and 20 testing videos for each action class from Hollywood2. [sent-350, score-0.445]
86 As shown in Table 3, human detection helps to improve all motion related descriptors (Trajectory, HOF and MBH), since removing inconsistent matches on humans improves the homography estimation. [sent-351, score-1.246]
87 It is always better to use human detection for homography estimation on these action datasets. [sent-355, score-0.614]
88 On Hollywood2, all presented results [14, 15, 23, 39] improve dense trajectories in different ways. [sent-368, score-0.383]
89 Dense trajectories based approaches [14, 15] seem to be very successful on HMDB51 . [sent-385, score-0.302]
90 It contains significant camera motion, which results in a large number of trajectories in the background. [sent-390, score-0.442]
91 Conclusion This paper improves dense trajectories by explicitly estimating camera motion. [sent-410, score-0.568]
92 We show that the performance can be significantly improved by removing background trajecto33555570 ries and warping optical flow with a robustly estimated homography approximating the camera motion. [sent-411, score-0.973]
93 Using a stateof-the-art human detector, potentially inconsistent matches can be removed during camera motion estimation, which makes it more robust. [sent-412, score-0.723]
94 Trajectorybased modeling of human actions with motion reference points. [sent-510, score-0.395]
95 HMDB: A large video database for human motion recognition. [sent-531, score-0.366]
96 Exploring weak [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] stabilization for motion feature extraction. [sent-586, score-0.327]
97 A 3-dimensional SIFT descriptor and its application to action recognition. [sent-622, score-0.306]
98 Space-variant descriptor sampling for action recognition based on saliency and eye movements. [sent-656, score-0.345]
99 Dense trajectories and motion boundary descriptors for action recognition. [sent-664, score-0.839]
100 Action recognition in videos acquired by a moving camera using motion decomposition of La- grangian particle trajectories. [sent-676, score-0.505]
wordName wordTfidf (topN-words)
[('mbh', 0.312), ('trajectories', 0.302), ('hof', 0.295), ('homography', 0.26), ('motion', 0.245), ('action', 0.203), ('optical', 0.193), ('trajectory', 0.175), ('flow', 0.169), ('olympic', 0.167), ('sports', 0.155), ('humans', 0.151), ('fisher', 0.141), ('camera', 0.14), ('matches', 0.137), ('inconsistent', 0.109), ('descriptor', 0.103), ('warped', 0.102), ('surf', 0.102), ('human', 0.092), ('bag', 0.092), ('videos', 0.091), ('descriptors', 0.089), ('removing', 0.087), ('rootsift', 0.086), ('dense', 0.081), ('vig', 0.079), ('warpflow', 0.079), ('remove', 0.072), ('itf', 0.07), ('encoding', 0.067), ('dtf', 0.065), ('movies', 0.062), ('stabilized', 0.061), ('actions', 0.058), ('background', 0.057), ('stabilization', 0.054), ('rmtrack', 0.053), ('uemura', 0.053), ('marsza', 0.05), ('prune', 0.048), ('solmaz', 0.047), ('normalization', 0.046), ('detector', 0.046), ('hog', 0.046), ('track', 0.045), ('improves', 0.045), ('frames', 0.045), ('dominate', 0.044), ('trinary', 0.043), ('jain', 0.043), ('youtube', 0.042), ('datasets', 0.041), ('mathe', 0.041), ('quantizes', 0.041), ('saliency', 0.039), ('improved', 0.039), ('atl', 0.039), ('interchange', 0.039), ('realistic', 0.038), ('fv', 0.037), ('impact', 0.036), ('blur', 0.036), ('reddy', 0.035), ('sadanand', 0.035), ('ek', 0.034), ('bof', 0.034), ('cor', 0.034), ('compensation', 0.034), ('row', 0.033), ('focuses', 0.033), ('gaidon', 0.033), ('compensated', 0.033), ('bounding', 0.032), ('detection', 0.031), ('splits', 0.031), ('failure', 0.03), ('occupy', 0.03), ('consecutive', 0.03), ('whereas', 0.029), ('boxes', 0.029), ('recommended', 0.029), ('around', 0.029), ('moving', 0.029), ('schmid', 0.029), ('video', 0.029), ('feature', 0.028), ('aser', 0.028), ('zoom', 0.028), ('jiang', 0.028), ('robustly', 0.028), ('estimation', 0.028), ('collected', 0.027), ('displacement', 0.027), ('improvement', 0.027), ('report', 0.026), ('improvements', 0.026), ('manual', 0.026), ('activity', 0.026), ('shi', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999923 39 iccv-2013-Action Recognition with Improved Trajectories
Author: Heng Wang, Cordelia Schmid
Abstract: Recently dense trajectories were shown to be an efficient video representation for action recognition and achieved state-of-the-art results on a variety of datasets. This paper improves their performance by taking into account camera motion to correct them. To estimate camera motion, we match feature points between frames using SURF descriptors and dense optical flow, which are shown to be complementary. These matches are, then, used to robustly estimate a homography with RANSAC. Human motion is in general different from camera motion and generates inconsistent matches. To improve the estimation, a human detector is employed to remove these matches. Given the estimated camera motion, we remove trajectories consistent with it. We also use this estimation to cancel out camera motion from the optical flow. This significantly improves motion-based descriptors, such as HOF and MBH. Experimental results onfour challenging action datasets (i.e., Hollywood2, HMDB51, Olympic Sports and UCF50) significantly outperform the current state of the art.
2 0.39442655 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction
Author: Jiaming Guo, Zhuwen Li, Loong-Fah Cheong, Steven Zhiying Zhou
Abstract: Given a pair of videos having a common action, our goal is to simultaneously segment this pair of videos to extract this common action. As a preprocessing step, we first remove background trajectories by a motion-based figureground segmentation. To remove the remaining background and those extraneous actions, we propose the trajectory cosaliency measure, which captures the notion that trajectories recurring in all the videos should have their mutual saliency boosted. This requires a trajectory matching process which can compare trajectories with different lengths and not necessarily spatiotemporally aligned, and yet be discriminative enough despite significant intra-class variation in the common action. We further leverage the graph matching to enforce geometric coherence between regions so as to reduce feature ambiguity and matching errors. Finally, to classify the trajectories into common action and action outliers, we formulate the problem as a binary labeling of a Markov Random Field, in which the data term is measured by the trajectory co-saliency and the smooth- ness term is measured by the spatiotemporal consistency between trajectories. To evaluate the performance of our framework, we introduce a dataset containing clips that have animal actions as well as human actions. Experimental results show that the proposed method performs well in common action extraction.
3 0.35042214 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set
Author: Dan Oneata, Jakob Verbeek, Cordelia Schmid
Abstract: Action recognition in uncontrolled video is an important and challenging computer vision problem. Recent progress in this area is due to new local features and models that capture spatio-temporal structure between local features, or human-object interactions. Instead of working towards more complex models, we focus on the low-level features and their encoding. We evaluate the use of Fisher vectors as an alternative to bag-of-word histograms to aggregate a small set of state-of-the-art low-level descriptors, in combination with linear classifiers. We present a large and varied set of evaluations, considering (i) classification of short actions in five datasets, (ii) localization of such actions in feature-length movies, and (iii) large-scale recognition of complex events. We find that for basic action recognition and localization MBH features alone are enough for stateof-the-art performance. For complex events we find that SIFT and MFCC features provide complementary cues. On all three problems we obtain state-of-the-art results, while using fewer features and less complex models.
4 0.28335354 297 iccv-2013-Online Motion Segmentation Using Dynamic Label Propagation
Author: Ali Elqursh, Ahmed Elgammal
Abstract: The vast majority of work on motion segmentation adopts the affine camera model due to its simplicity. Under the affine model, the motion segmentation problem becomes that of subspace separation. Due to this assumption, such methods are mainly offline and exhibit poor performance when the assumption is not satisfied. This is made evident in state-of-the-art methods that relax this assumption by using piecewise affine spaces and spectral clustering techniques to achieve better results. In this paper, we formulate the problem of motion segmentation as that of manifold separation. We then show how label propagation can be used in an online framework to achieve manifold separation. The performance of our framework is evaluated on a benchmark dataset and achieves competitive performance while being online.
5 0.27121779 361 iccv-2013-Robust Trajectory Clustering for Motion Segmentation
Author: Feng Shi, Zhong Zhou, Jiangjian Xiao, Wei Wu
Abstract: Due to occlusions and objects ’ non-rigid deformation in the scene, the obtained motion trajectories from common trackers may contain a number of missing or mis-associated entries. To cluster such corrupted point based trajectories into multiple motions is still a hard problem. In this paper, we present an approach that exploits temporal and spatial characteristics from tracked points to facilitate segmentation of incomplete and corrupted trajectories, thereby obtain highly robust results against severe data missing and noises. Our method first uses the Discrete Cosine Transform (DCT) bases as a temporal smoothness constraint on trajectory projection to ensure the validity of resulting components to repair pathological trajectories. Then, based on an observation that the trajectories of foreground and background in a scene may have different spatial distributions, we propose a two-stage clustering strategy that first performs foreground-background separation then segments remaining foreground trajectories. We show that, with this new clustering strategy, sequences with complex motions can be accurately segmented by even using a simple trans- lational model. Finally, a series of experiments on Hopkins 155 dataset andBerkeley motion segmentation dataset show the advantage of our method over other state-of-the-art motion segmentation algorithms in terms of both effectiveness and robustness.
6 0.25286156 68 iccv-2013-Camera Alignment Using Trajectory Intersections in Unsynchronized Videos
7 0.23120105 78 iccv-2013-Coherent Motion Segmentation in Moving Camera Videos Using Optical Flow Orientations
8 0.22415094 263 iccv-2013-Measuring Flow Complexity in Videos
9 0.22145967 105 iccv-2013-DeepFlow: Large Displacement Optical Flow with Deep Matching
10 0.21760286 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments
11 0.2121131 116 iccv-2013-Directed Acyclic Graph Kernels for Action Recognition
12 0.20360968 12 iccv-2013-A General Dense Image Matching Framework Combining Direct and Feature-Based Costs
13 0.20247003 317 iccv-2013-Piecewise Rigid Scene Flow
14 0.19382015 265 iccv-2013-Mining Motion Atoms and Phrases for Complex Action Recognition
15 0.19057888 249 iccv-2013-Learning to Share Latent Tasks for Action Recognition
16 0.18580703 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
17 0.18147564 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition
18 0.16983575 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos
19 0.1675227 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding
20 0.16600728 86 iccv-2013-Concurrent Action Detection with Structural Prediction
topicId topicWeight
[(0, 0.295), (1, 0.036), (2, 0.157), (3, 0.36), (4, -0.017), (5, 0.102), (6, 0.086), (7, 0.006), (8, 0.154), (9, 0.141), (10, 0.092), (11, 0.084), (12, 0.199), (13, -0.121), (14, 0.045), (15, -0.003), (16, 0.017), (17, 0.093), (18, 0.146), (19, 0.03), (20, -0.082), (21, -0.027), (22, 0.102), (23, 0.147), (24, 0.054), (25, 0.062), (26, 0.045), (27, 0.042), (28, 0.034), (29, -0.006), (30, 0.001), (31, -0.011), (32, -0.043), (33, 0.005), (34, -0.042), (35, -0.094), (36, 0.008), (37, -0.054), (38, -0.013), (39, 0.041), (40, -0.052), (41, -0.007), (42, 0.021), (43, 0.082), (44, 0.03), (45, -0.024), (46, -0.021), (47, -0.031), (48, 0.003), (49, -0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.97489744 39 iccv-2013-Action Recognition with Improved Trajectories
Author: Heng Wang, Cordelia Schmid
Abstract: Recently dense trajectories were shown to be an efficient video representation for action recognition and achieved state-of-the-art results on a variety of datasets. This paper improves their performance by taking into account camera motion to correct them. To estimate camera motion, we match feature points between frames using SURF descriptors and dense optical flow, which are shown to be complementary. These matches are, then, used to robustly estimate a homography with RANSAC. Human motion is in general different from camera motion and generates inconsistent matches. To improve the estimation, a human detector is employed to remove these matches. Given the estimated camera motion, we remove trajectories consistent with it. We also use this estimation to cancel out camera motion from the optical flow. This significantly improves motion-based descriptors, such as HOF and MBH. Experimental results onfour challenging action datasets (i.e., Hollywood2, HMDB51, Olympic Sports and UCF50) significantly outperform the current state of the art.
2 0.85269612 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction
Author: Jiaming Guo, Zhuwen Li, Loong-Fah Cheong, Steven Zhiying Zhou
Abstract: Given a pair of videos having a common action, our goal is to simultaneously segment this pair of videos to extract this common action. As a preprocessing step, we first remove background trajectories by a motion-based figureground segmentation. To remove the remaining background and those extraneous actions, we propose the trajectory cosaliency measure, which captures the notion that trajectories recurring in all the videos should have their mutual saliency boosted. This requires a trajectory matching process which can compare trajectories with different lengths and not necessarily spatiotemporally aligned, and yet be discriminative enough despite significant intra-class variation in the common action. We further leverage the graph matching to enforce geometric coherence between regions so as to reduce feature ambiguity and matching errors. Finally, to classify the trajectories into common action and action outliers, we formulate the problem as a binary labeling of a Markov Random Field, in which the data term is measured by the trajectory co-saliency and the smooth- ness term is measured by the spatiotemporal consistency between trajectories. To evaluate the performance of our framework, we introduce a dataset containing clips that have animal actions as well as human actions. Experimental results show that the proposed method performs well in common action extraction.
3 0.83333808 263 iccv-2013-Measuring Flow Complexity in Videos
Author: Saad Ali
Abstract: In this paper a notion of flow complexity that measures the amount of interaction among objects is introduced and an approach to compute it directly from a video sequence is proposed. The approach employs particle trajectories as the input representation of motion and maps it into a ‘braid’ based representation. The mapping is based on the observation that 2D trajectories of particles take the form of a braid in space-time due to the intermingling among particles over time. As a result of this mapping, the problem of estimating the flow complexity from particle trajectories becomes the problem of estimating braid complexity, which in turn can be computed by measuring the topological entropy of a braid. For this purpose recently developed mathematical tools from braid theory are employed which allow rapid computation of topological entropy of braids. The approach is evaluated on a dataset consisting of open source videos depicting variations in terms of types of moving objects, scene layout, camera view angle, motion patterns, and object densities. The results show that the proposed approach is able to quantify the complexity of the flow, and at the same time provides useful insights about the sources of the complexity.
4 0.75088334 260 iccv-2013-Manipulation Pattern Discovery: A Nonparametric Bayesian Approach
Author: Bingbing Ni, Pierre Moulin
Abstract: We aim to unsupervisedly discover human’s action (motion) patterns of manipulating various objects in scenarios such as assisted living. We are motivated by two key observations. First, large variation exists in motion patterns associated with various types of objects being manipulated, thus manually defining motion primitives is infeasible. Second, some motion patterns are shared among different objects being manipulated while others are object specific. We therefore propose a nonparametric Bayesian method that adopts a hierarchical Dirichlet process prior to learn representative manipulation (motion) patterns in an unsupervised manner. Taking easy-to-obtain object detection score maps and dense motion trajectories as inputs, the proposed probabilistic model can discover motion pattern groups associated with different types of objects being manipulated with a shared manipulation pattern dictionary. The size of the learned dictionary is automatically inferred. Com- prehensive experiments on two assisted living benchmarks and a cooking motion dataset demonstrate superiority of our learned manipulation pattern dictionary in representing manipulation actions for recognition.
5 0.72809923 265 iccv-2013-Mining Motion Atoms and Phrases for Complex Action Recognition
Author: Limin Wang, Yu Qiao, Xiaoou Tang
Abstract: This paper proposes motion atom and phrase as a midlevel temporal “part” for representing and classifying complex action. Motion atom is defined as an atomic part of action, and captures the motion information of action video in a short temporal scale. Motion phrase is a temporal composite of multiple motion atoms with an AND/OR structure, which further enhances the discriminative ability of motion atoms by incorporating temporal constraints in a longer scale. Specifically, given a set of weakly labeled action videos, we firstly design a discriminative clustering method to automatically discovera set ofrepresentative motion atoms. Then, based on these motion atoms, we mine effective motion phrases with high discriminative and representativepower. We introduce a bottom-upphrase construction algorithm and a greedy selection method for this mining task. We examine the classification performance of the motion atom and phrase based representation on two complex action datasets: Olympic Sports and UCF50. Experimental results show that our method achieves superior performance over recent published methods on both datasets.
6 0.69821316 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set
7 0.69493765 68 iccv-2013-Camera Alignment Using Trajectory Intersections in Unsynchronized Videos
8 0.69457275 361 iccv-2013-Robust Trajectory Clustering for Motion Segmentation
9 0.66238308 297 iccv-2013-Online Motion Segmentation Using Dynamic Label Propagation
10 0.63367146 78 iccv-2013-Coherent Motion Segmentation in Moving Camera Videos Using Optical Flow Orientations
11 0.62144047 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding
12 0.6152932 226 iccv-2013-Joint Subspace Stabilization for Stereoscopic Video
13 0.58938426 145 iccv-2013-Estimating the Material Properties of Fabric from Video
14 0.58875853 216 iccv-2013-Inferring "Dark Matter" and "Dark Energy" from Videos
15 0.58216089 301 iccv-2013-Optimal Orthogonal Basis and Image Assimilation: Motion Modeling
16 0.57062614 105 iccv-2013-DeepFlow: Large Displacement Optical Flow with Deep Matching
17 0.55190939 116 iccv-2013-Directed Acyclic Graph Kernels for Action Recognition
18 0.52492583 317 iccv-2013-Piecewise Rigid Scene Flow
19 0.52278084 12 iccv-2013-A General Dense Image Matching Framework Combining Direct and Feature-Based Costs
20 0.52166992 256 iccv-2013-Locally Affine Sparse-to-Dense Matching for Motion and Occlusion Estimation
topicId topicWeight
[(2, 0.045), (7, 0.011), (13, 0.017), (26, 0.042), (31, 0.018), (42, 0.043), (64, 0.078), (73, 0.021), (89, 0.629)]
simIndex simValue paperId paperTitle
same-paper 1 0.99732405 39 iccv-2013-Action Recognition with Improved Trajectories
Author: Heng Wang, Cordelia Schmid
Abstract: Recently dense trajectories were shown to be an efficient video representation for action recognition and achieved state-of-the-art results on a variety of datasets. This paper improves their performance by taking into account camera motion to correct them. To estimate camera motion, we match feature points between frames using SURF descriptors and dense optical flow, which are shown to be complementary. These matches are, then, used to robustly estimate a homography with RANSAC. Human motion is in general different from camera motion and generates inconsistent matches. To improve the estimation, a human detector is employed to remove these matches. Given the estimated camera motion, we remove trajectories consistent with it. We also use this estimation to cancel out camera motion from the optical flow. This significantly improves motion-based descriptors, such as HOF and MBH. Experimental results onfour challenging action datasets (i.e., Hollywood2, HMDB51, Olympic Sports and UCF50) significantly outperform the current state of the art.
2 0.99419487 81 iccv-2013-Combining the Right Features for Complex Event Recognition
Author: Kevin Tang, Bangpeng Yao, Li Fei-Fei, Daphne Koller
Abstract: In this paper, we tackle the problem of combining features extracted from video for complex event recognition. Feature combination is an especially relevant task in video data, as there are many features we can extract, ranging from image features computed from individual frames to video features that take temporal information into account. To combine features effectively, we propose a method that is able to be selective of different subsets of features, as some features or feature combinations may be uninformative for certain classes. We introduce a hierarchical method for combining features based on the AND/OR graph structure, where nodes in the graph represent combinations of different sets of features. Our method automatically learns the structure of the AND/OR graph using score-based structure learning, and we introduce an inference procedure that is able to efficiently compute structure scores. We present promising results and analysis on the difficult and large-scale 2011 TRECVID Multimedia Event Detection dataset [17].
3 0.992827 139 iccv-2013-Elastic Fragments for Dense Scene Reconstruction
Author: Qian-Yi Zhou, Stephen Miller, Vladlen Koltun
Abstract: We present an approach to reconstruction of detailed scene geometry from range video. Range data produced by commodity handheld cameras suffers from high-frequency errors and low-frequency distortion. Our approach deals with both sources of error by reconstructing locally smooth scene fragments and letting these fragments deform in order to align to each other. We develop a volumetric registration formulation that leverages the smoothness of the deformation to make optimization practical for large scenes. Experimental results demonstrate that our approach substantially increases the fidelity of complex scene geometry reconstructed with commodity handheld cameras.
4 0.99014038 216 iccv-2013-Inferring "Dark Matter" and "Dark Energy" from Videos
Author: Dan Xie, Sinisa Todorovic, Song-Chun Zhu
Abstract: This paper presents an approach to localizing functional objects in surveillance videos without domain knowledge about semantic object classes that may appear in the scene. Functional objects do not have discriminative appearance and shape, but they affect behavior of people in the scene. For example, they “attract” people to approach them for satisfying certain needs (e.g., vending machines could quench thirst), or “repel” people to avoid them (e.g., grass lawns). Therefore, functional objects can be viewed as “dark matter”, emanating “dark energy ” that affects people ’s trajectories in the video. To detect “dark matter” and infer their “dark energy ” field, we extend the Lagrangian mechanics. People are treated as particle-agents with latent intents to approach “dark matter” and thus satisfy their needs, where their motions are subject to a composite “dark energy ” field of all functional objects in the scene. We make the assumption that people take globally optimal paths toward the intended “dark matter” while avoiding latent obstacles. A Bayesian framework is used to probabilistically model: people ’s trajectories and intents, constraint map of the scene, and locations of functional objects. A data-driven Markov Chain Monte Carlo (MCMC) process is used for inference. Our evaluation on videos of public squares and courtyards demonstrates our effectiveness in localizing functional objects and predicting people ’s trajectories in unobserved parts of the video footage.
5 0.98602808 103 iccv-2013-Deblurring by Example Using Dense Correspondence
Author: Yoav Hacohen, Eli Shechtman, Dani Lischinski
Abstract: This paper presents a new method for deblurring photos using a sharp reference example that contains some shared content with the blurry photo. Most previous deblurring methods that exploit information from other photos require an accurately registered photo of the same static scene. In contrast, our method aims to exploit reference images where the shared content may have undergone substantial photometric and non-rigid geometric transformations, as these are the kind of reference images most likely to be found in personal photo albums. Our approach builds upon a recent method for examplebased deblurring using non-rigid dense correspondence (NRDC) [11] and extends it in two ways. First, we suggest exploiting information from the reference image not only for blur kernel estimation, but also as a powerful local prior for the non-blind deconvolution step. Second, we introduce a simple yet robust technique for spatially varying blur estimation, rather than assuming spatially uniform blur. Unlike the aboveprevious method, which hasproven successful only with simple deblurring scenarios, we demonstrate that our method succeeds on a variety of real-world examples. We provide quantitative and qualitative evaluation of our method and show that it outperforms the state-of-the-art.
6 0.98340243 56 iccv-2013-Automatic Registration of RGB-D Scans via Salient Directions
7 0.98239964 302 iccv-2013-Optimization Problems for Fast AAM Fitting in-the-Wild
8 0.98014098 337 iccv-2013-Random Grids: Fast Approximate Nearest Neighbors and Range Searching for Image Search
9 0.97595012 2 iccv-2013-3D Scene Understanding by Voxel-CRF
10 0.97193635 390 iccv-2013-Shufflets: Shared Mid-level Parts for Fast Object Detection
11 0.97020501 319 iccv-2013-Point-Based 3D Reconstruction of Thin Objects
12 0.95931733 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set
13 0.95722842 129 iccv-2013-Dynamic Scene Deblurring
14 0.95227331 228 iccv-2013-Large-Scale Multi-resolution Surface Reconstruction from RGB-D Sequences
15 0.95222366 317 iccv-2013-Piecewise Rigid Scene Flow
16 0.95083696 343 iccv-2013-Real-World Normal Map Capture for Nearly Flat Reflective Surfaces
17 0.94529617 226 iccv-2013-Joint Subspace Stabilization for Stereoscopic Video
18 0.94444764 9 iccv-2013-A Flexible Scene Representation for 3D Reconstruction Using an RGB-D Camera
19 0.94337052 256 iccv-2013-Locally Affine Sparse-to-Dense Matching for Motion and Occlusion Estimation
20 0.94321531 42 iccv-2013-Active MAP Inference in CRFs for Efficient Semantic Segmentation