cvpr cvpr2013 cvpr2013-10 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Deqing Sun, Jonas Wulff, Erik B. Sudderth, Hanspeter Pfister, Michael J. Black
Abstract: Layered models allow scene segmentation and motion estimation to be formulated together and to inform one another. Traditional layered motion methods, however, employ fairly weak models of scene structure, relying on locally connected Ising/Potts models which have limited ability to capture long-range correlations in natural scenes. To address this, we formulate a fully-connected layered model that enables global reasoning about the complicated segmentations of real objects. Optimization with fully-connected graphical models is challenging, and our inference algorithm leverages recent work on efficient mean field updates for fully-connected conditional random fields. These methods can be implemented efficiently using high-dimensional Gaussian filtering. We combine these ideas with a layered flow model, and find that the long-range connections greatly improve segmentation into figure-ground layers when compared with locally connected MRF models. Experiments on several benchmark datasets show that the method can re- cover fine structures and large occlusion regions, with good flow accuracy and much lower computational cost than previous locally-connected layered models.
Reference: text
sentIndex sentText sentNum sentScore
1 Black2 1Harvard University 2MPI for Intelligent Systems 3Brown University (a) First frame of video(b) Flow field [27](c) Segmentation [27](d) Flow field, proposed (e) Segmentation, proposed Figure 1. [sent-3, score-0.148]
2 The proposed fully-connected layered model can recover fine structures better than a locally connected layered model [27]. [sent-4, score-1.083]
3 Abstract Layered models allow scene segmentation and motion estimation to be formulated together and to inform one another. [sent-5, score-0.257]
4 Traditional layered motion methods, however, employ fairly weak models of scene structure, relying on locally connected Ising/Potts models which have limited ability to capture long-range correlations in natural scenes. [sent-6, score-0.75]
5 To address this, we formulate a fully-connected layered model that enables global reasoning about the complicated segmentations of real objects. [sent-7, score-0.503]
6 Optimization with fully-connected graphical models is challenging, and our inference algorithm leverages recent work on efficient mean field updates for fully-connected conditional random fields. [sent-8, score-0.201]
7 We combine these ideas with a layered flow model, and find that the long-range connections greatly improve segmentation into figure-ground layers when compared with locally connected MRF models. [sent-10, score-1.104]
8 Experiments on several benchmark datasets show that the method can re- cover fine structures and large occlusion regions, with good flow accuracy and much lower computational cost than previous locally-connected layered models. [sent-11, score-0.855]
9 Introduction Layered models [8, 12, 30] are promising for motion analysis, particularly for handling occlusion and capturing temporally consistent scene structure [27]. [sent-13, score-0.189]
10 This allows the integration of imagebased information with flow information to arrive at a good segmentation of the scene, which is propagated over time using motion cues. [sent-15, score-0.534]
11 While popular in many vision tasks, including layered models [3 1], such local dependencies have limited modeling power. [sent-18, score-0.48]
12 For example, in the “Hand” sequence of Figure 1, ambiguous local motion and boundary cues cause locally connected layered models [27] to merge the background between the fingers into the foreground (Figure 1(b)). [sent-19, score-0.9]
13 By linking the narrow background regions between the fingers to other dis- tant background regions, it becomes far easier to correctly segment foreground objects. [sent-21, score-0.229]
14 Fortunately, Kr¨ahenbu¨hl and Koltun [15] recently showed that mean field approximation algorithms can efficiently optimize densely connected CRF models for static image segmentation. [sent-24, score-0.264]
15 Recent work applies their optimization scheme to optical flow [16] by directly modeling the flow field with a densely connected CRF. [sent-26, score-0.975]
16 For optical flow estimation, we argue that it is more powerful to utilize a fully-connected prior for layer segmentation. [sent-27, score-0.551]
17 To that end, we formulate a new model that combines recent work on layered flow estimation with algorithms for static image segmentation with fully-connected models. [sent-28, score-0.925]
18 Because flow fields are not directly observed, optimization for fully-connected layered models is more challenging than for static image segmentation models, and we proposal several innovations to improve speed and accuracy. [sent-32, score-0.97]
19 In spite of modeling additional dependencies, our resulting mean field method is more efficient than previous locally connected formulations [27]. [sent-33, score-0.267]
20 Related Work Several previous and current lines of research intersect our theme, including figure-ground segmentation, video segmentation, and layered optical flow estimation. [sent-36, score-0.882]
21 Here we focus on layered models, and specifically ones that combine motion estimation and segmentation. [sent-38, score-0.569]
22 While our 2-layer method is related to figure-ground segmentation, previous work in that area typically uses motion only as a cue to segmentation [7], treats it as an observation [2], and does not attempt to accurately estimate optical flow. [sent-39, score-0.289]
23 In contrast, our work attempts to simultaneously estimate accurate flow and solve for a figure-ground segmentation that gives good flow estimates. [sent-40, score-0.784]
24 Recent work by Ochs and Brox [22] addresses motion segmentation using point trajectories. [sent-43, score-0.195]
25 They show that higherorder tuples (3-affinities) and segmentation using spectral clustering can produce nice segmentations of scenes with several moving objects. [sent-44, score-0.16]
26 Unlike our work, they do not address the dense flow estimation problem, and consequently do not test on competitive flow benchmarks. [sent-45, score-0.709]
27 Our work is directly descended from layered models of optical flow [12, 30, 3 1]. [sent-49, score-0.913]
28 Several methods extract layered models of the scene as well as layer movement, using parametric models of the motion of each layer. [sent-50, score-0.718]
29 Accurate flow estimation and segmentation requires richer models of flow within layers that go beyond simple parametric transformations. [sent-54, score-0.906]
30 Previous methods [27, 3 1] allow the flow to vary smoothly or discontinuously within layers. [sent-55, score-0.339]
31 Recent work [27] shows that such models can achieve good flow and segmentation accuracy, albeit with high computational cost. [sent-56, score-0.476]
32 A limitation of these previous methods is that the spatial variation in the flow is modeled by a local (typically pairwise, Ising or Potts) MRF. [sent-59, score-0.339]
33 Kr¨ahenbu¨hl and Koltun [15] describe a mean-field approximate inference scheme for fully connected CRF models. [sent-66, score-0.229]
34 They show that the spatial message passing step can be efficiently approximated by high-dimensional filtering [1]. [sent-67, score-0.159]
35 Zhang and Chen [33] independently suggest a quadratic programming relaxation for fully connected CRFs, and use bilateral filtering to perform gradient descent. [sent-68, score-0.28]
36 Our main contribution is to extend these fully-connected inference methods to layered models for optical flow estimation and segmentation. [sent-69, score-0.997]
37 Fully Connected Modeling and Inference We first formulate our fully connected layered model, and then describe a variational expectation maximization (EM) inference algorithm. [sent-71, score-0.71]
38 We use the terms foreground and background loosely; the foreground layer is one that contains regions occluding the background. [sent-75, score-0.364]
39 Each layer k has its own flow field (utk , vtk). [sent-79, score-0.544]
40 We use a semi-parametric flow model [27] that biases the flow within each layer to be piecewise smooth, and roughly similar to a global affine motion. [sent-80, score-0.823]
41 For the horizontal flow field, utk, we define the spatial energy term, Emrf(utk , θtk) = ? [sent-81, score-0.366]
42 The energy function for vtk, the vertical flow field, is defined similarly. [sent-86, score-0.366]
43 We use a binary mask gt to model the foreground support at frame t. [sent-87, score-0.23]
44 As shown in Figure 2, we model the binary mask spatially as a fully connected CRF and define the spatial energy term Espace(gt)=? [sent-89, score-0.236]
45 =p (2) where a pixel is fully connected to all other pixels at the current frame, δ(x) is 1if x is true and 0 otherwise, and the weight wqp is defined as wqp = ηG1 (Ipt − Itq , p− q) + (1 η)G2 (p− q) = − (3) η exp? [sent-95, score-0.27]
46 For our layered flow model, they further lead to inaccurate flow estimates; removing these isolated regions significantly reduces outliers in our final results. [sent-105, score-1.127]
47 The binary layer support masks evolve over time accord- mask defining foreground layer support. [sent-106, score-0.396]
48 The center pixel (red) is spatially fully connected to all other pixels at the current frame. [sent-107, score-0.176]
49 The center pixel is also temporally connected to two temporal neighbors (green), as determined by the foreground and background flow vectors. [sent-108, score-0.762]
50 ing to the flow field of the foreground layer, Etime(gt,gt+1,ut1,vt1) = ? [sent-110, score-0.525]
51 q) ∈Et1 + where Et1 = {(p, q) : q = p (utp1 , vtp1)} contains all temporal neighbors ql)in k:ed q by th pe + foreground }flow field. [sent-113, score-0.222]
52 2, we handle subpixel motion by bilinear interpolation of the temporal neighbors. [sent-116, score-0.21]
53 The layer support mask provides a segmentation of the video sequence: a pixel p belongs to the foreground layer if gtp = 1, and to the background otherwise. [sent-117, score-0.895]
54 The foreground term is only “on” when a pixel and g¯ i t=s successor at the next frame are both visible, gtp = gtq+1 = 1. [sent-134, score-0.533]
55 The background term is active when gtp = gtq+1 = 0. [sent-135, score-0.421]
56 Note that occluded states are less likely than flow vectors whose robust matching costs are smaller than λD. [sent-137, score-0.339]
57 The energy function is proportional to the negative log probability of the joint distribution of the binary masks and flow fields P(g, u, v, θ | I). [sent-147, score-0.439]
58 Inference We use a variational EM algorithm [9], maximizing the posterior probability of the hidden flow fields while approximately marginalizing over possible layer support masks: mu,va,xθ logP(u,v,θ | I) = um,va,xθ log? [sent-150, score-0.534]
59 1 summarizes an inference algorithm based on a mean field message update schedule. [sent-165, score-0.229]
60 age passing from next frame Q ˜ tp ( 0l1) ← Q ˜ pt p( l10) + λ? [sent-201, score-0.186]
61 Each pixel p has two temporal neighbors q at the next frame, determined by the motion of the foreground and the background layers. [sent-211, score-0.359]
62 A pixel, p, may have several temporal neighbors, q, at + the previous frame, so that its update depends on marginals {Qqt−1 : p q (uqt−1,k vtq−1,k) k 1, 2}. [sent-215, score-0.154]
63 To implement spatial message passing via high-dimensional filtering, we must update the node marginals within a frame simultaneously and in parallel [15]. [sent-218, score-0.214]
64 While mean field methods are guaranteed to converge when marginals are updated sequentially [9], they may oscillate with parallel updates as demonstrated in Figure 5. [sent-219, score-0.213]
65 We suspect this is a greater problem for our flow model, where likelihoods are more ambiguous than for = + , , = 222444555422 Qtp(gtp) =Z1tpexp? [sent-220, score-0.339]
66 With strong temporal dependence, it is difficult to deviate from our (temporally consistent) initialization; a weak temporal term allows the algorithm to escape local optima via likelihood cues. [sent-237, score-0.237]
67 Third, we perform median filtering to the approximate distribution Q whenever there is a change in temporal weight λc. [sent-239, score-0.192]
68 This median filtering step helps reduce speckles caused by outliers in the data matching term, and results in better local optima as measured by the K-L divergence between the approximate and true distributions. [sent-240, score-0.296]
69 We interleave mean field updates to the layer support distributions with refinement of the foreground and background flow fields. [sent-242, score-0.721]
70 Gradient-based optimization is similar to single-layer affine-biased flow estimation, except that likelihoods are weighted by the inferred layer supports. [sent-243, score-0.457]
71 To avoid local optima, we initialize via a FlowFusion [20] step which combines the current flow estimate and the affine flow field of each layer. [sent-244, score-0.765]
72 Following [26], we first compute an initial flow estimate using Classic+NL [25], and cluster the flow vectors into 2 groups. [sent-250, score-0.678]
73 We use the synthetic examples from [27] to compare the effect of using graph cuts for local models and a mean field approximation for fully connected models. [sent-253, score-0.372]
74 The fully connected model recovers from the poor initialization. [sent-255, score-0.176]
75 We perform a FlowFusion step to obtain stronger local minima, with some increase in com- (a)Firstframeofvideo(b)InitalfowbyClas ic+NL (c)Localmodeloptimizedby(d)Fulycon ectedmodel graph cuts optimized by mean field Figure 3. [sent-257, score-0.165]
76 The fully connected model can more accurately infer the layer ownership by using global information. [sent-260, score-0.294]
77 As shown in Tables 1 and 2, FlowFusion improves the flow estimates on some sequences. [sent-265, score-0.339]
78 Figure 6 shows that adjusting the weight on the temporal term and applying median filtering respectively helps the mean field algorithm converge. [sent-272, score-0.354]
79 The parallel mean field algorithm (blue circle) fails to converge for the fully connected layered model, while damping (red dot) helps the algorithm to converge to a better local minimum. [sent-276, score-0.934]
80 (better viewed on a computer screen) ally, the segmentation has fewer speckles with the adjusting and filtering. [sent-280, score-0.168]
81 We perform a bootstrap statistical significance test of the flow estimation results on the Middlebury training and test set for the algorithm with and without the FlowFusion step. [sent-286, score-0.37]
82 expect future multi-layer formulations to further improve the performance of fully-connected models of layer support. [sent-292, score-0.149]
83 However, there are also failure cases that reveal the limitation of the fully-connected layered model. [sent-298, score-0.449]
84 The estimated flow fields are visually close to the ground truth. [sent-309, score-0.384]
85 The proposed FC-2Layers and FC-2Layers-FF methods improve over the single layered Classic+NL, while obtaining performance close to a multi-layered formulation. [sent-313, score-0.449]
86 Average end-point error (EPE) on the Middlebury optical flow benchmark test set. [sent-325, score-0.433]
87 The two-layer formulation of the fully-connected layered model achieves performance comparable to a multi-layer local model (Layers++). [sent-326, score-0.449]
88 Fast version uses a fast but less accurate version to compute the initial flow field, which results in slight loss in performance. [sent-327, score-0.339]
89 The core algorithm for mean field inference takes about 5 minutes. [sent-357, score-0.17]
90 The remaining time is largely spent on computing the initial flow fields with Classic+NL [25] in MATLAB. [sent-358, score-0.384]
91 The fully connected layered model can recover the fine structures in the scene, such as the background holes in “Car3” (right column). [sent-362, score-0.708]
92 Further speedup is achievable by using C++ and a GPU flow implementation. [sent-364, score-0.339]
93 Note that the speed is already much faster than previous locally layered models, which take 5 hours to process 2 frames [26] or more than 10 hours for 4 frames [27]. [sent-365, score-0.483]
94 Conclusion We have formulated a fully-connected layered model that captures long-range correlations in natural scenes for joint motion segmentation and estimation. [sent-367, score-0.644]
95 The proposed algorithm achieves competitive results on the Middlebury and MPI Sintel optical flow benchmark and produces reliable results on a variety of other sequences. [sent-372, score-0.433]
96 Our work extends previous work on fully-connected models for joint motion segmentation and estimation, and also suggests that layered models can be a rich and flexible representation for natural scenes. [sent-373, score-0.706]
97 A naturalistic open source movie for optical flow evaluation. [sent-417, score-0.433]
98 Efficient inference in fully connected crfs with gaussian edge potentials. [sent-468, score-0.26]
99 Layered image motion with explicit occlusions, temporal consistency, and depth ordering. [sent-541, score-0.179]
100 Joint motion estimation and segmentation of complex scenes with label costs and occlusion modeling. [sent-568, score-0.258]
wordName wordTfidf (topN-words)
[('layered', 0.449), ('gtp', 0.373), ('flow', 0.339), ('flowfusion', 0.256), ('gtq', 0.21), ('middlebury', 0.125), ('layer', 0.118), ('emrf', 0.116), ('qtp', 0.116), ('connected', 0.116), ('segmentation', 0.106), ('foreground', 0.099), ('tp', 0.095), ('optical', 0.094), ('ahenbu', 0.093), ('itp', 0.093), ('temporal', 0.09), ('motion', 0.089), ('field', 0.087), ('damping', 0.083), ('epe', 0.083), ('utk', 0.083), ('nl', 0.078), ('aff', 0.072), ('unmatched', 0.072), ('qprev', 0.07), ('utpk', 0.07), ('vtk', 0.07), ('filtering', 0.07), ('mpi', 0.066), ('marginals', 0.064), ('sintel', 0.062), ('speckles', 0.062), ('frame', 0.061), ('fully', 0.06), ('itq', 0.06), ('layers', 0.06), ('message', 0.059), ('optima', 0.057), ('classic', 0.055), ('segmentations', 0.054), ('inference', 0.053), ('tq', 0.052), ('pages', 0.05), ('cuts', 0.048), ('frey', 0.048), ('background', 0.048), ('bespace', 0.047), ('dkata', 0.047), ('etk', 0.047), ('etkeq', 0.047), ('qqt', 0.047), ('shaman', 0.047), ('wqp', 0.047), ('adjustment', 0.045), ('fields', 0.045), ('helps', 0.045), ('hl', 0.042), ('koltun', 0.041), ('unger', 0.041), ('mrf', 0.041), ('kr', 0.038), ('permutohedral', 0.038), ('tk', 0.038), ('gt', 0.037), ('temporally', 0.037), ('potentials', 0.036), ('gq', 0.036), ('fine', 0.035), ('locally', 0.034), ('edata', 0.034), ('fingers', 0.034), ('ipt', 0.034), ('qtq', 0.034), ('bilateral', 0.034), ('neighbors', 0.033), ('ochs', 0.033), ('mask', 0.033), ('occlusions', 0.032), ('variational', 0.032), ('median', 0.032), ('roth', 0.032), ('jojic', 0.032), ('org', 0.032), ('converge', 0.032), ('occlusion', 0.032), ('estimation', 0.031), ('crfs', 0.031), ('bp', 0.031), ('gc', 0.031), ('subpixel', 0.031), ('models', 0.031), ('passing', 0.03), ('mean', 0.03), ('nonlocal', 0.03), ('sudderth', 0.03), ('divergence', 0.03), ('masks', 0.028), ('energy', 0.027), ('piecewise', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999934 10 cvpr-2013-A Fully-Connected Layered Model of Foreground and Background Flow
Author: Deqing Sun, Jonas Wulff, Erik B. Sudderth, Hanspeter Pfister, Michael J. Black
Abstract: Layered models allow scene segmentation and motion estimation to be formulated together and to inform one another. Traditional layered motion methods, however, employ fairly weak models of scene structure, relying on locally connected Ising/Potts models which have limited ability to capture long-range correlations in natural scenes. To address this, we formulate a fully-connected layered model that enables global reasoning about the complicated segmentations of real objects. Optimization with fully-connected graphical models is challenging, and our inference algorithm leverages recent work on efficient mean field updates for fully-connected conditional random fields. These methods can be implemented efficiently using high-dimensional Gaussian filtering. We combine these ideas with a layered flow model, and find that the long-range connections greatly improve segmentation into figure-ground layers when compared with locally connected MRF models. Experiments on several benchmark datasets show that the method can re- cover fine structures and large occlusion regions, with good flow accuracy and much lower computational cost than previous locally-connected layered models.
2 0.26947737 244 cvpr-2013-Large Displacement Optical Flow from Nearest Neighbor Fields
Author: Zhuoyuan Chen, Hailin Jin, Zhe Lin, Scott Cohen, Ying Wu
Abstract: We present an optical flow algorithm for large displacement motions. Most existing optical flow methods use the standard coarse-to-fine framework to deal with large displacement motions which has intrinsic limitations. Instead, we formulate the motion estimation problem as a motion segmentation problem. We use approximate nearest neighbor fields to compute an initial motion field and use a robust algorithm to compute a set of similarity transformations as the motion candidates for segmentation. To account for deviations from similarity transformations, we add local deformations in the segmentation process. We also observe that small objects can be better recovered using translations as the motion candidates. We fuse the motion results obtained under similarity transformations and under translations together before a final refinement. Experimental validation shows that our method can successfully handle large displacement motions. Although we particularly focus on large displacement motions in this work, we make no sac- rifice in terms of overall performance. In particular, our method ranks at the top of the Middlebury benchmark.
3 0.20649171 362 cvpr-2013-Robust Monocular Epipolar Flow Estimation
Author: Koichiro Yamaguchi, David McAllester, Raquel Urtasun
Abstract: We consider the problem of computing optical flow in monocular video taken from a moving vehicle. In this setting, the vast majority of image flow is due to the vehicle ’s ego-motion. We propose to take advantage of this fact and estimate flow along the epipolar lines of the egomotion. Towards this goal, we derive a slanted-plane MRF model which explicitly reasons about the ordering of planes and their physical validity at junctions. Furthermore, we present a bottom-up grouping algorithm which produces over-segmentations that respect flow boundaries. We demonstrate the effectiveness of our approach in the challenging KITTI flow benchmark [11] achieving half the error of the best competing general flow algorithm and one third of the error of the best epipolar flow algorithm.
4 0.20458274 334 cvpr-2013-Pose from Flow and Flow from Pose
Author: Katerina Fragkiadaki, Han Hu, Jianbo Shi
Abstract: Human pose detectors, although successful in localising faces and torsos of people, often fail with lower arms. Motion estimation is often inaccurate under fast movements of body parts. We build a segmentation-detection algorithm that mediates the information between body parts recognition, and multi-frame motion grouping to improve both pose detection and tracking. Motion of body parts, though not accurate, is often sufficient to segment them from their backgrounds. Such segmentations are crucialfor extracting hard to detect body parts out of their interior body clutter. By matching these segments to exemplars we obtain pose labeled body segments. The pose labeled segments and corresponding articulated joints are used to improve the motion flow fields by proposing kinematically constrained affine displacements on body parts. The pose-based articulated motion model is shown to handle large limb rotations and displacements. Our algorithm can detect people under rare poses, frequently missed by pose detectors, showing the benefits of jointly reasoning about pose, segmentation and motion in videos.
Author: Dong Zhang, Omar Javed, Mubarak Shah
Abstract: In this paper, we propose a novel approach to extract primary object segments in videos in the ‘object proposal’ domain. The extracted primary object regions are then used to build object models for optimized video segmentation. The proposed approach has several contributions: First, a novel layered Directed Acyclic Graph (DAG) based framework is presented for detection and segmentation of the primary object in video. We exploit the fact that, in general, objects are spatially cohesive and characterized by locally smooth motion trajectories, to extract the primary object from the set of all available proposals based on motion, appearance and predicted-shape similarity across frames. Second, the DAG is initialized with an enhanced object proposal set where motion based proposal predictions (from adjacent frames) are used to expand the set of object proposals for a particular frame. Last, the paper presents a motion scoring function for selection of object proposals that emphasizes high optical flow gradients at proposal boundaries to discriminate between moving objects and the background. The proposed approach is evaluated using several challenging benchmark videos and it outperforms both unsupervised and supervised state-of-the-art methods. 1. Introduction & Related Work In this paper, our goal is to detect the primary object in videos and to delineate it from the background in allframes. Video object segmentation is a well-researched problem in the computer vision community and is a prerequisite for a variety of high-level vision applications, including content based video retrieval, video summarization, activity understanding and targeted content replacement. Both fully automatic methods and methods requiring manual initialization have been proposed for video object segmentation. In the latter class of approaches, [2, 15, 23] need annotations of object segments in key frames for initialization. Frame #38 #39 #61 #62 V ideo Frames Key-?fram e Object Regions [13] ? PrimaryObjectRegionsExtractedbyProposedMethod Figure 1. Primary object region selection in the object proposal domain. The first row shows frames from a video. The second row shows key object proposals (in red boundaries) extracted by [13]. “?” indicates that no proposal related to the primary object was found by the method. The third row shows primary object proposals selected by the proposed method. Note that the proposed method was able to find primary object proposals in all frames. The results in row 2 and 3 are prior to per-pixel segmentation. In this paper we demonstrate that temporally dense extraction of primary object proposals results in significant improvement in object segmentation performance. Please see Table 1for quantitative results and comparisons to state of the art.[Please Print in Color] Optimization techniques employing motion and appearance constraints are then used to propagate the segments to all frames. Other methods ([16, 20]) only require accurate object region annotation for the first frame, then employ region tracking to segment the rest of frames into object and background regions. Note that, the aforementioned semi-automatic techniques generally give good segmenta666222668 Figure 2. Object proposals from a video frame employing the method in [7]. The left side image is one of the video frames. Note that the monkey is the object of interest in the frame. Images on the right show some of the top ranked object proposals from the frame. Most of the proposals do not correspond to an actual object. The goal of the proposed work is to generate an enhanced set of object proposals and extract the segments related to the primary object from the video. tion results. However, most computer vision applications involve processing of large amounts of video data, which makes manual initialization cost prohibitive. Consequently, a large number of automatic methods have also been proposed for video object segmentation. A subset of these methods employs motion grouping ([19, 18, 4]) for object segmentation. Other methods ([10, 3, 21]) use appearance cues to segment each frame first and then use both appearance and motion constraints for a bottom-up final segmentation. Methods like [9, 3, 11, 22] present efficient optimization frameworks for spatiotemporal grouping of pixels for video segmentation. However, all of these automatic methods do not have an explicit model of how an object looks or moves, and therefore, the segments usually don’t correspond to a particular object but only to image regions that exhibit coherent appearance or motion. Recently, several methods ([7, 5, 1]) were proposed that provided an explicit notion of how a generic object looks like. Specifically, the method [7] could extract object-like regions or ‘object proposals’ from images. This work was built upon by Lee et al. [13] and Ma and Latecki [14] to employ object proposals for object video segmentation. Lee et al. [13] proposed to detect the primary object by collecting a pool of object proposals from the video, and then applying spectral graph clustering to obtain multiple binary inlier/outlier partitions. Each inlier cluster corresponds to a particular object’s regions. Both motion and appearance based cues are used to measure the ‘objectness’ of a proposal in the cluster. The cluster with the largest average ‘objectness’ is likely to contain the primary object in video. One shortcoming of this approach is that the clustering process ignores the order of the proposals in the video, and there- fore, cannot model the evolution of object’s shape and location with time. The work by Ma and Latecki [14] attempts Input Videos Figure 3. The Video Object Segmentation Framework to mitigate this issue by utilizing relationships between object proposals in adjacent frames. The object region selection problem is modeled as a constrained Maximum Weight Cliques problem in order to find the true object region from all the video frames simultaneously. However, this problem is NP-hard ([14]) and an approximate optimization technique is used to obtain the solution. The object proposal based segmentation approaches [13, 14] have two additional limitations compared to the proposed method. First, in both approaches, object proposal generation for a particular frame doesn’t directly depend on object proposals generated for adjacent frames. Second, both approaches do not actually predict the shape of the object in adjacent frames when computing region similarity, which degrades segmentation performance for fast moving objects. In this paper, we present an approach that though inspired from aforementioned approaches, attempts to remove their shortcomings. Note that, in general, an object’s shape and appearance varies slowly from frame to frame. Therefore, the intuition is that the object proposal sequence in a video with high ‘objectness’, and high similarity across frames is likely to be the primary object. To this end, we use optical flow to track the evolution of object shape, and compute the difference between predicted and actual shape (along with appearance) to measure similarity of object proposals across frames. The ‘objectness’ is measured using appearance and a motion based criterion that emphasizes high optical flow gradients at the boundaries between objects proposals and the background. Moreover, the primary object proposal selection problem is formulated as the longest path problem for Directed Acyclic Graph (DAG), for which (unlike [14]) an optimal solution exists in linear time. Note that, if the temporal order of object proposals locations (across frames) is not used ([13], then it can result in no proposals being associated with the prima666222779 ry object for many frames (please see Figure 1). The proposed method not only uses object proposals from a particular frame (please see Figure 2), but also expands the proposal set using predictions from proposals of neighboring frame. The combination of proposal expansion, and the predicted shape based similarity criteria results in temporally dense and spatially accurate primary object proposal extraction. We have evaluated the proposed approach using several challenging benchmark videos and it outperforms both unsupervised and supervised state-of-the-art methods In Section 2, the proposed layered DAG based object selection approach is introduced and discussed in detail; In Section 3, both qualitative and quantitative experiments results for two publicly available datasets and some other challenging videos are shown; The paper is concluded in Section 4. 2. Layered DAG based Video Object Segmentation 2.1. The Framework The proposed framework consists of 3 stages (as shown in Figure 3): 1. Generation of object proposals per-frame and then expansion of the proposal set for each frame based on object proposals in adjacent frames. 2. Generation of a layered DAG from all the object proposals in the video. The longest path in the graph fulfills the goal of maximizing ob- jectness and similarity scores, and represents the most likely set of proposals denoting the primary object in the video. 3. The primary object proposals are used to build object and background models using Gaussian mixtures, and a graph-cuts based optimization method is used to obtain refined per-pixel segmentation. Since the proposed approach is centered around layered DAG framework for selection of primary object regions, we will start with its description. 2.2. Layered DAG Structure We want to extract object proposals with high objectness likelihood, high appearance similarity and smoothly varying shape from the set of all proposals obtained from the video. Also since we want to extract the primary object only, we want to extract at most a single proposal per frame. Keeping these objectives in mind, the layered DAG is formed as follows. Each object proposal is represented by two nodes: a ‘beginning node’ and an ‘ending node’ and there are two types of edges: unary edges and binary edges. The unary edges have weights which measure the objectness of a proposal. The details of the function for unary weight assignments (measuring objectness) are given in section 2.2. 1. All the beginning nodes in the same frame form a layer, so as the ending nodes. A directed unary edge is built from beginning node to ending node. Thus, each video frame is represented by two layers in the graph. DiFrame i-1 Frame i Frame i+1 s… … La2i-ye3rL2ai-y2erL2ayi-1erLa2yierL2ai+y1erL2ai+y2er… … t Figure 4. Layered Directed Acyclic Graph (DAG) Structure. Node “s” and “t” are source and sink nodes respectively, which have zero weights for edges with other nodes in the graph. The yellow nodes and the green nodes are “beginning nodes” and “ending nodes” respectively and they are paired such that each yellow-green pair represents an object proposal. All the beginning nodes in the same frame are arranged in a layer and the same as the ending nodes. The green edges are the unary edges and red edges are the binary edges. rected binary edges are built from any ending node to all the beginning nodes in latter layers. The binary edges have weights which measure the appearance and shape similarity between the corresponding object proposals across frames. The binary weight assignment functions are introduced in Section 2.2.2. Figure 4 is an illustration of the graph structure. It shows frame i− 1, iand i 1 of the graph, with corresponding layers oif − −2i 1 1−,3 i, a2nid d− i2, + +2 i1 1− o1f, h2ie, 2gira+p 1h ,a wndi t2hi +co2rr. eNspooten tdhinagt, only 3s object proposals are s1h, o2wi,n 2 ifo+r 1e aacnhd layer f.or N simplic- + ity, however, there are usually hundreds of object proposals for each frame and the number of object proposals for different frames are not necessary the same. The yellow nodes are “beginning nodes”, the green nodes are “ending nodes”, the green edges are unary edges with weights indicating objectness and the red edges are binary edges with weights indicating appearance and shape similarity (note that the graph only shows some of the binary edges for simplicity). There is also a virtual source node s and a sink node t with 0 weighted edges (black edges) to the graph. Note that, it is not necessary to build binary edges from an ending node to all the beginning nodes in latter layers. In practice, only building binary edges to the next three subsequent frames is enough for most of the videos. 2.2.1 Unary Edges Unary edges measure the objectness of the proposals. Both appearance and motion are important to infer the objectness, so the scoring function for object proposals is defined as Sunary (r) = A(r) + M(r), in which r is any object proposal, A(r) is the appearance score and M(r) is the motion score. We define M(r) as the average Frobenius norm of optical flow gradient around the boundary of object pro666232880 Figure 5. Optical Flow Gradient Magnitude Motion Scoring. In row 1, column 1 shows the original video frame, column 2 is one of the object proposals and column 3 shows dilated boundary of the object proposal. In row 2, column 1 shows the forward optical flow of the frame, column 2 shows the optical flow gradient magnitude map and column 3 shows the optical flow gradient magnitude response for the specific object proposal around the boundary. [Please Print in Color] posal r. The Frobenius norm of optical flow gradients is defined as: ??UX??F=?????uvxx uvy ?????F=?ux2+ u2y+ vx2+ vy2, in ?whic?h U =? (1) (u, v) is th??e forward optical flow of the frame, ux , vx and uy, vy are optical flow gradients in x and y directions respectively. The intuition behind this motion scoring function is that, the motions of foreground object and background are usually distinct, so boundary of moving objects usually implies discontinuity in motion. Therefore, ideally, the gradient of optical flow should have high magnitude around foreground object boundary (this phenomenon could be easily observed from Figure 5). In equation 1, we use the Frobenius norm to measure the optical flow gradient magnitude, the higher the value, the more likely the region is from a moving object. In practice, usually the maximum of optical flow gradient magnitude does not coincide exactly with the moving object boundary due to underlying approximation of optical flow calculation. Therefore, we dilate the object proposal boundary and get the average optical flow gradient magnitude as the motion score. Figure 5 is an illustration of this process. The appearance scoring function A(r) is measured by the objectness ([7]). 2.2.2 Binary Edges Binary edges measure the similarity between object proposals across frames. For measuring the similarity of regions, color, location, size and shape are the properties to be considered. We define the similarity between regions as the weight of binary edges as follows: Sbinary(rm, rn) = λ · Scolor(rm, rn) · Soverlap(rm, rn), (2) in which rm and rn are regions from frame m and n, λ is a constant value for adjusting the ratio between unary and binary edges, Soverlap is the overlap similarity between regions and Scolor is the color histogram similarity: Scolor(rm, rn) = hist(rm) · hist(rn)T, (3) in which hist(r) is the normalized color histogram for a region r. Soverlap(rm,rn) =||rrmm∩∪ wwaarrppmmnn((rrnn))||, (4) in which warpmn (rn) is the warped region from rn by optical flow to frame m. It is clear that Scolor encodes the color similarity between regions and Soverlap encodes the size and location similarity between regions. If two regions are close, and the sizes and shapes are similar, the value would be higher, and vice versa. Note that, unlike prior approaches [13, 14], we use optical flow to predict the region (i.e. encoding location and shape), and therefore we are better able to compute similarity for fast moving objects. 2.2.3 Dynamic Programming Solution Until now, we have built the layered DAG and the objective is clear: to find the highest weighted path in the DAG. Assume the graph contains 2F + 2 layers (F is the frame number), the source node is in layer 0 and the sink node is in layer 2F + 2. Let Nij denotes the jth node in ith layer and E(Nij , Nkl) denotes the edge from Nij to Nkl. Layer i has Mi nodes. Let P = (p1, p2 , ..., pm+1) = (N01, Nj1j2, ..., Njm−1jm, N(2n+2)1) be a path from source to sink node. Therefore, ?m Pmax= arg mPax?i=1E(pi,pi+1). (5) Pmax forms a Longest (simple) Path Problem for DAG. Let OPT(i, j) be the maximum path value for Nij from source node. The maximum path value satisfies the following recurrence for i≥ 1and j ≥ 1: OPT(i,j) = k=0...i−m1a,lx=1...Mk[OPT(k,l) + E(Nkl,Nij)]. (6) This problem could be solved by dynamic programming in linear time [12]. The computational complexity for the algorithm is O(n + m), in which n is the number of nodes 666322 919 and m is the number of edges. The most important parameter for the layered DAG is the ratio λ between unary edges and binary edges. However, in practice, the results are not sensitive to it, and in the experiments λ is simply set to be 1. 2.3. Per-pixel Video Object Segmentation Once the primary object proposals are obtained in a video, the results are further refined by a graph-based method to get per-pixel segmentation results. We define a spatiotemporal graph by connecting frames temporally with optical flow displacement. Each of the nodes in the graph is a pixel in a frame, and edges are set to be the 8-neighbors within one frame and the forward-backward 18 neighbors in adjacent frames. We define the energy function for labeling f = [f1, f2, ..., fn] of n pixels with prior knowledge of h: E(f,h) = ?Dhi(fi) + λ ?i∈S ? Vi,j(fi,fj), (7) (i,?j)∈N where S = {pi, ..., pn} is the set of n pixels in the video, N cwohnesriest Ss o =f neighboring pixels, ta ondf i,j ixnedlesx in nt thhee pixels. pi could be set to 0 or 1which represents background or foreground respectively. The unary term Dih defines the cost of labeling pixel iwith label fi which we get from the Gaussian Mixture Models (GMM) for both color and location. Dih(fi) = −log(αUic(fi, h) + (1 − α)Uil(fi, h)), (8) where Uic(.) is the color-induced cost and Uil (.) is the location cost. For the binary term Vi,j (fi, fj), we follow the definitions in [17]: Vi,j(fi, fj) = [fi = fj]exp−β(Ci−Cj)2, (9) where [.] denotes the indicator function taking values 0 and 1, (Ci − Cj)2 is the Euclidean distance betwe?en two adjacent nodes in RGB space, and β = (2? (Ci − Cj)2)−1|(i,j)∈N ?We use −th Ce graph-cuts based minimization method in [8] to o?btain the optimal solution for equation 7, and thus get the final segmentation results. Next, we describe the method for object proposal generation that is used to initialize the video object segmentation process. 2.4. Object Proposal Generation & Expansion In order to achieve our goal of identifying image regions belonging to the primary object in the video, it is preferable (though not necessary) to have an object proposal corresponding to the actual object for each frame in which object is present. Using only appearance or optical flow based Figure 6. Object Proposal Expansion. For each optical flow warped object proposal in frame i− 1, we look for object proposals din o fbjreamcte p ir owpohsicahl ihnav fer high overlap erat liooosk kw fiotrh tohbej warped one. If some object proposals all have high overlap ratios with the warped one, they are merged into a new large object proposal. This process will produce the right object proposal if it is not discovered by [7] from frame i, but frame i− 1. cues to generate object proposals is usually not enough for this purpose. This phenomenon could be observed in the example shown in Figure 6. For frame iin this figure, hundreds of object proposals were generated using method in [7], however, no proposal is consistent with the true object, and the object is fragmented between different proposals. We assume that an object’s shape and location changes smoothly across frames and propose to enhance the set of object proposals for a frame by using the proposals generated for its adjacent frames. The object proposal expansion method works by the guidance of optical flow (see Figure 6). For the forward version of object proposal expansion, each object proposal rk in frame i− 1 is warped by the forward optical flow toi −fra1mine fir,a tmheen i a −ch 1ec isk wisa rmpaedde bify any proposal in frame i has a large overlap ratio with the rij 666333002 warped object proposal, i.e., o =|warpi−1,|ir(jir|ik−1) ∩ rij|. (10) The contiguous overlapped areas, for regions in i+1 with o greater than 0.5, are merged into a single region, and are used as additional proposals. Note that, the old original proposals are also kept, so this is an ‘expansion’ of the proposal set, and not a replacement. In practice, this process is carried out both forward and backward in time. Since it is an iterative process, even if suitable object proposals are missing in consecutive frames, they could potentially be produced by this expansion process. Figure 6 shows an example image sequence where the expansion process resulted in generation of a suitable proposal. 3. Experiments The proposed method was evaluated using two wellknown segmentation datasets: SegTrack dataset [20] and GaTech video segmentation dataset [9]. Quantitative comparisons are shown for SegTrack dataset since ground-truth is available for this dataset. Qualitative results are shown for GaTech video segmentation dataset. We also evaluated the proposed approach on additional challenging videos, for which we will share the ground-truth to aid future evaluations. 3.1. SegTrack Dataset We first evaluate our method on Segtrack dataset [20]. There are 6 videos in this dataset, and also a pixel-level segmentation ground-truth for each video is available. We follow the setup in the literature ([13, 14]), and use 5 (birdfall, cheetah, girl, monkeydog and parachute) of the videos for evaluation (since the ground-truth for the other one (penguin) is not useable). We use an optical flow magnitude based model selection method to infer the camera motion: for static cameras, a background subtraction cue is also used for moving object extraction; for all the results shown in this section, the static camera model was only selected (automatically) for the “birdfall” video. We compare our method with 4 state-of-the-art methods [14], [13], [20] and [6] shown in Table 1. Note that our method is a unsupervised method, and it outperforms all the other unsupervised methods except for the parachute video where it is a close second. Note that [20] and [6] are supervised methods which need an initial annotation for the first frame. The results in Table 1are the average per-frame pixel error rate compared to the ground-truth. The definition is [20]: error = XORF(f,GT), (11) where f is the segmentation labeling results of the method, GT is the ground-truth labeling of the video, and F is the (a) Birdfall (b) Cheetah (c) Girl (d) Monkeydog (e) Parachute Figure 7. SegTrack dataset results. The regions within the red boundaries are the segmented primary objects. [Please Print in Color] VideoOurs[14][13][20][6] birdfall155189288252454 cheetah 633 806 905 1142 1217 girl 1488 1698 1785 1304 1755 monkeydog 365 472 521 563 683 parachute 220 221 201 235 502 Avg. 452 542 592 594 791 supervised? N N N Y Y Table 1. Quantitative results and comparison with the state of the art on SegTrack dataset number of frames in the video. Figure 7 shows qualitative results for the videos of SegTrack dataset. Figure 8 is an example that shows the effectiveness of the proposed layered DAG approach for temporally dense extraction of primary object regions. The figure shows consecutive frames (frame 38 to frame 43) from “monkeydog” video. The top 2 rows show the results of key-frame objec- t extraction method [13], and the bottom 2 rows show our object region selection results. As one can see, [13] detects the primary object proposal in only one of the frames, however, by using the proposed approach, we can extract the 666333113 #41 ?#42 ?#43 ?(a) Key-frame Obje?ct Re gion Sel cti?on #41 #42 #43 Frame #38 ?#39 ?#40 Frame #38 #39 #40 (b) Layered DAG Object Region Sel ction Figure 8. Comparison of object region selection methods. The regions within the red boundaries are the selected object regions. “?” means there is no object region selected by the method. Numbers above are the frame indices.[Please Print in Color] primary object region from all the frames. This is the main reason that the segmentation results of the proposed method are better than prior methods. 3.2. GaTech Segmentation Dataset We also evaluated the proposed method on GaTech video segmentation dataset. We show qualitative comparison of results between the proposed approach and the original bottom-up method for the dataset in Figure 9. As one can observe, our results could segment the true foreground object from the background. The method [9] doesn’t use an object model which induces over-segmentation (although the results are very good for the general segmentation problem). 3.3. Persons and Cars Segmentation Dataset We have built a new dataset for video object segmentation. The dataset is challenging: persons are in a variety of poses; cars have different speeds, and when they are slow, it is very hard to do motion segmentation. We generate ground truth for those videos. Figure 10 shows some sample results from this dataset, and Table 2 shows the quantitative (a) waterski (b) yunakim Figure 9. Object Segmentation Results on GaTech Video Segmentation Dataset. Row 1: orignial frame, Row 2: Segmentation results by the bottom-up segmentation method [9]. Row 3: Video object segmentation by the proposed method. The regions within the red or green boundaries are the segmented primary objects. [Please Print in Color] VideoAverage per-frame pixel error Surfing1209 Jumping Skiing Sliding Big car Small car 835 817 2228 1129 272 Table 2. Quantitative Results on Persons and Cars dataset results for this dataset (the average per-frame pixel error is defined as the same as SegTrack dataset [20]). Please go to http://crcv.ucf.edu for more details. 4. Conclusions We have proposed a novel and efficient layered DAG based approach to segment the primary object in videos. This approach also uses innovative mechanisms to compute the ‘objectness’ of a region and to compute similarity between object proposals across frames. The proposed approach outperforms the state of the art on the well-known SegTrack dataset. We also demonstrate good segmentation performance on additional challenging data sets. 666333224 (a) Surfing (b) Jumping (c) Skiing (d) Sliding (e) Big car (f) Small car Figure 10. Sample Results on Persons and Cars Dataset. Please go to http://crcv.ucf.edu for more details. Acknowledgment This work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract numbers D11PC20066. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S.Government. References [1] B. Alexe, T. Deselaers, and V. Ferrari. What is an object? In CVPR, pages 73–80, 2010. [2] X. Bai, J. Wang, D. Simons, and G. Sapiro. Video snapcut: robust video object cutout using localized classifiers. ACM Transactions on Graphics, 28(3):70, 2009. [3] W. Brendel and S. Todorovic. Video object segmentation by tracking regions. In ICCV, pages 833–840, 2009. [4] T. Brox and J. Malik. Object segmentation by long term analysis of point trajectories. In ECCV, pages 282–295, 2010. [5] J. Carreira and C. Sminchisescu. Constrained parametric min-cuts for automatic object segmentation. In CVPR, pages 3241–3248, 2010. [6] P. Chockalingam, N. Pradeep, and S. Birchfield. Adaptive fragments-based tracking ofnon-rigid objects using level sets. In ICCV, pages 1530–1537, 2009. [7] I. Endres and D. Hoiem. Category independent object proposals. In ECCV, pages 575–588, 2010. [8] B. Fulkerson, A. Vedaldi, and S. Soatto. Class segmentation and object localization with superpixel neighborhoods. In ICCV, pages 670–677, 2009. [9] M. Grundmann, V. Kwatra, M. Han, and I. Essa. Efficient hierarchical graph-based video segmentation. In CVPR, pages 2141–2148, 2010. [10] Y. Huang, Q. Liu, and D. Metaxas. Video object segmentation by hypergraph cut. In CVPR, pages 1738–1745, 2009. [11] J.Wang, B. Thiesson, Y. Xu, and M. Cohen. Image and video segmentation by anisotropic kernel mean shift. In ECCV, 2004. [12] J. Kleinberg and E. Tardos. Algorithm design. Pearson Education and Addison Wesley, 2006. [13] Y. Lee, J. Kim, and K. Grauman. Key-segments for video object segmentation. In ICCV, pages 1995–2002, 2011. [14] T. Ma and L. Latecki. Maximum weight cliques with mutex constraints for video object segmentation. In CVPR, pages 670–677, 2012. [15] B. Price, B. Morse, and S. Cohen. Livecut: Learningbased interactive video segmentation by evaluation of multiple propagated cues. In ICCV, pages 779–786, 2009. [16] X. Ren and J. Malik. Tracking as repeated figure/ground segmentation. In CVPR, pages 1–8, 2007. [17] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In ACM Transactions on Graphics, volume 23, pages 309–3 14, 2004. [18] Y. Sheikh, O. Javed, and T. Kanade. Background subtraction for freely moving cameras. In ICCV, pages 1219–1225, 2009. [19] J. Shi and J. Malik. Motion segmentation and tracking using normalized cuts. In ICCV, pages 1154–1 160, 1998. [20] D. Tsai, M. Flagg, and J. Rehg. Motion coherent tracking with multi-label mrf optimization. In BMVC, page 1, 2010. [21] A. Vazquez-Reina, S. Avidan, H. Pfister, and E. Miller. Multiple hypothesis video segmentation from superpixel flows. In ECCV, pages 268–281, 2010. [22] C. Xu, C. Xiong, and J. J. Corso. Streaming hierarchical video segmentation. In ECCV, pages 626–639. 2012. [23] J. Yuen, B. Russell, C. Liu, and A. Torralba. Labelme video: Building a video database with human annotations. In ICCV, pages 1451–1458, 2009. 666333335
6 0.16812333 158 cvpr-2013-Exploring Weak Stabilization for Motion Feature Extraction
7 0.1523809 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition
8 0.14790224 300 cvpr-2013-Multi-target Tracking by Lagrangian Relaxation to Min-cost Network Flow
9 0.13813454 345 cvpr-2013-Real-Time Model-Based Rigid Object Pose Estimation and Tracking Combining Dense and Sparse Visual Cues
10 0.1351907 124 cvpr-2013-Determining Motion Directly from Normal Flows Upon the Use of a Spherical Eye Platform
11 0.12719098 316 cvpr-2013-Optical Flow Estimation Using Laplacian Mesh Energy
12 0.12353875 107 cvpr-2013-Deformable Spatial Pyramid Matching for Fast Dense Correspondences
13 0.11936578 180 cvpr-2013-Fully-Connected CRFs with Non-Parametric Pairwise Potential
14 0.1175016 187 cvpr-2013-Geometric Context from Videos
15 0.11399329 88 cvpr-2013-Compressible Motion Fields
16 0.11131267 357 cvpr-2013-Revisiting Depth Layers from Occlusions
18 0.11001354 203 cvpr-2013-Hierarchical Video Representation with Trajectory Binary Partition Tree
19 0.10725451 371 cvpr-2013-SCaLE: Supervised and Cascaded Laplacian Eigenmaps for Visual Object Recognition Based on Nearest Neighbors
20 0.10693184 256 cvpr-2013-Learning Structured Hough Voting for Joint Object Detection and Occlusion Reasoning
topicId topicWeight
[(0, 0.209), (1, 0.096), (2, 0.049), (3, -0.046), (4, -0.016), (5, -0.015), (6, 0.091), (7, -0.024), (8, -0.092), (9, 0.083), (10, 0.154), (11, 0.106), (12, 0.101), (13, 0.017), (14, 0.169), (15, 0.146), (16, -0.024), (17, -0.113), (18, 0.007), (19, -0.01), (20, -0.044), (21, -0.072), (22, 0.052), (23, -0.105), (24, 0.031), (25, -0.073), (26, 0.054), (27, 0.09), (28, -0.031), (29, 0.012), (30, 0.031), (31, -0.02), (32, 0.113), (33, 0.046), (34, -0.063), (35, -0.033), (36, 0.112), (37, -0.015), (38, 0.034), (39, 0.007), (40, -0.096), (41, -0.047), (42, -0.015), (43, 0.05), (44, -0.015), (45, 0.044), (46, -0.004), (47, -0.014), (48, 0.003), (49, 0.038)]
simIndex simValue paperId paperTitle
same-paper 1 0.96561134 10 cvpr-2013-A Fully-Connected Layered Model of Foreground and Background Flow
Author: Deqing Sun, Jonas Wulff, Erik B. Sudderth, Hanspeter Pfister, Michael J. Black
Abstract: Layered models allow scene segmentation and motion estimation to be formulated together and to inform one another. Traditional layered motion methods, however, employ fairly weak models of scene structure, relying on locally connected Ising/Potts models which have limited ability to capture long-range correlations in natural scenes. To address this, we formulate a fully-connected layered model that enables global reasoning about the complicated segmentations of real objects. Optimization with fully-connected graphical models is challenging, and our inference algorithm leverages recent work on efficient mean field updates for fully-connected conditional random fields. These methods can be implemented efficiently using high-dimensional Gaussian filtering. We combine these ideas with a layered flow model, and find that the long-range connections greatly improve segmentation into figure-ground layers when compared with locally connected MRF models. Experiments on several benchmark datasets show that the method can re- cover fine structures and large occlusion regions, with good flow accuracy and much lower computational cost than previous locally-connected layered models.
2 0.81762087 362 cvpr-2013-Robust Monocular Epipolar Flow Estimation
Author: Koichiro Yamaguchi, David McAllester, Raquel Urtasun
Abstract: We consider the problem of computing optical flow in monocular video taken from a moving vehicle. In this setting, the vast majority of image flow is due to the vehicle ’s ego-motion. We propose to take advantage of this fact and estimate flow along the epipolar lines of the egomotion. Towards this goal, we derive a slanted-plane MRF model which explicitly reasons about the ordering of planes and their physical validity at junctions. Furthermore, we present a bottom-up grouping algorithm which produces over-segmentations that respect flow boundaries. We demonstrate the effectiveness of our approach in the challenging KITTI flow benchmark [11] achieving half the error of the best competing general flow algorithm and one third of the error of the best epipolar flow algorithm.
3 0.81704205 244 cvpr-2013-Large Displacement Optical Flow from Nearest Neighbor Fields
Author: Zhuoyuan Chen, Hailin Jin, Zhe Lin, Scott Cohen, Ying Wu
Abstract: We present an optical flow algorithm for large displacement motions. Most existing optical flow methods use the standard coarse-to-fine framework to deal with large displacement motions which has intrinsic limitations. Instead, we formulate the motion estimation problem as a motion segmentation problem. We use approximate nearest neighbor fields to compute an initial motion field and use a robust algorithm to compute a set of similarity transformations as the motion candidates for segmentation. To account for deviations from similarity transformations, we add local deformations in the segmentation process. We also observe that small objects can be better recovered using translations as the motion candidates. We fuse the motion results obtained under similarity transformations and under translations together before a final refinement. Experimental validation shows that our method can successfully handle large displacement motions. Although we particularly focus on large displacement motions in this work, we make no sac- rifice in terms of overall performance. In particular, our method ranks at the top of the Middlebury benchmark.
Author: Dong Zhang, Omar Javed, Mubarak Shah
Abstract: In this paper, we propose a novel approach to extract primary object segments in videos in the ‘object proposal’ domain. The extracted primary object regions are then used to build object models for optimized video segmentation. The proposed approach has several contributions: First, a novel layered Directed Acyclic Graph (DAG) based framework is presented for detection and segmentation of the primary object in video. We exploit the fact that, in general, objects are spatially cohesive and characterized by locally smooth motion trajectories, to extract the primary object from the set of all available proposals based on motion, appearance and predicted-shape similarity across frames. Second, the DAG is initialized with an enhanced object proposal set where motion based proposal predictions (from adjacent frames) are used to expand the set of object proposals for a particular frame. Last, the paper presents a motion scoring function for selection of object proposals that emphasizes high optical flow gradients at proposal boundaries to discriminate between moving objects and the background. The proposed approach is evaluated using several challenging benchmark videos and it outperforms both unsupervised and supervised state-of-the-art methods. 1. Introduction & Related Work In this paper, our goal is to detect the primary object in videos and to delineate it from the background in allframes. Video object segmentation is a well-researched problem in the computer vision community and is a prerequisite for a variety of high-level vision applications, including content based video retrieval, video summarization, activity understanding and targeted content replacement. Both fully automatic methods and methods requiring manual initialization have been proposed for video object segmentation. In the latter class of approaches, [2, 15, 23] need annotations of object segments in key frames for initialization. Frame #38 #39 #61 #62 V ideo Frames Key-?fram e Object Regions [13] ? PrimaryObjectRegionsExtractedbyProposedMethod Figure 1. Primary object region selection in the object proposal domain. The first row shows frames from a video. The second row shows key object proposals (in red boundaries) extracted by [13]. “?” indicates that no proposal related to the primary object was found by the method. The third row shows primary object proposals selected by the proposed method. Note that the proposed method was able to find primary object proposals in all frames. The results in row 2 and 3 are prior to per-pixel segmentation. In this paper we demonstrate that temporally dense extraction of primary object proposals results in significant improvement in object segmentation performance. Please see Table 1for quantitative results and comparisons to state of the art.[Please Print in Color] Optimization techniques employing motion and appearance constraints are then used to propagate the segments to all frames. Other methods ([16, 20]) only require accurate object region annotation for the first frame, then employ region tracking to segment the rest of frames into object and background regions. Note that, the aforementioned semi-automatic techniques generally give good segmenta666222668 Figure 2. Object proposals from a video frame employing the method in [7]. The left side image is one of the video frames. Note that the monkey is the object of interest in the frame. Images on the right show some of the top ranked object proposals from the frame. Most of the proposals do not correspond to an actual object. The goal of the proposed work is to generate an enhanced set of object proposals and extract the segments related to the primary object from the video. tion results. However, most computer vision applications involve processing of large amounts of video data, which makes manual initialization cost prohibitive. Consequently, a large number of automatic methods have also been proposed for video object segmentation. A subset of these methods employs motion grouping ([19, 18, 4]) for object segmentation. Other methods ([10, 3, 21]) use appearance cues to segment each frame first and then use both appearance and motion constraints for a bottom-up final segmentation. Methods like [9, 3, 11, 22] present efficient optimization frameworks for spatiotemporal grouping of pixels for video segmentation. However, all of these automatic methods do not have an explicit model of how an object looks or moves, and therefore, the segments usually don’t correspond to a particular object but only to image regions that exhibit coherent appearance or motion. Recently, several methods ([7, 5, 1]) were proposed that provided an explicit notion of how a generic object looks like. Specifically, the method [7] could extract object-like regions or ‘object proposals’ from images. This work was built upon by Lee et al. [13] and Ma and Latecki [14] to employ object proposals for object video segmentation. Lee et al. [13] proposed to detect the primary object by collecting a pool of object proposals from the video, and then applying spectral graph clustering to obtain multiple binary inlier/outlier partitions. Each inlier cluster corresponds to a particular object’s regions. Both motion and appearance based cues are used to measure the ‘objectness’ of a proposal in the cluster. The cluster with the largest average ‘objectness’ is likely to contain the primary object in video. One shortcoming of this approach is that the clustering process ignores the order of the proposals in the video, and there- fore, cannot model the evolution of object’s shape and location with time. The work by Ma and Latecki [14] attempts Input Videos Figure 3. The Video Object Segmentation Framework to mitigate this issue by utilizing relationships between object proposals in adjacent frames. The object region selection problem is modeled as a constrained Maximum Weight Cliques problem in order to find the true object region from all the video frames simultaneously. However, this problem is NP-hard ([14]) and an approximate optimization technique is used to obtain the solution. The object proposal based segmentation approaches [13, 14] have two additional limitations compared to the proposed method. First, in both approaches, object proposal generation for a particular frame doesn’t directly depend on object proposals generated for adjacent frames. Second, both approaches do not actually predict the shape of the object in adjacent frames when computing region similarity, which degrades segmentation performance for fast moving objects. In this paper, we present an approach that though inspired from aforementioned approaches, attempts to remove their shortcomings. Note that, in general, an object’s shape and appearance varies slowly from frame to frame. Therefore, the intuition is that the object proposal sequence in a video with high ‘objectness’, and high similarity across frames is likely to be the primary object. To this end, we use optical flow to track the evolution of object shape, and compute the difference between predicted and actual shape (along with appearance) to measure similarity of object proposals across frames. The ‘objectness’ is measured using appearance and a motion based criterion that emphasizes high optical flow gradients at the boundaries between objects proposals and the background. Moreover, the primary object proposal selection problem is formulated as the longest path problem for Directed Acyclic Graph (DAG), for which (unlike [14]) an optimal solution exists in linear time. Note that, if the temporal order of object proposals locations (across frames) is not used ([13], then it can result in no proposals being associated with the prima666222779 ry object for many frames (please see Figure 1). The proposed method not only uses object proposals from a particular frame (please see Figure 2), but also expands the proposal set using predictions from proposals of neighboring frame. The combination of proposal expansion, and the predicted shape based similarity criteria results in temporally dense and spatially accurate primary object proposal extraction. We have evaluated the proposed approach using several challenging benchmark videos and it outperforms both unsupervised and supervised state-of-the-art methods In Section 2, the proposed layered DAG based object selection approach is introduced and discussed in detail; In Section 3, both qualitative and quantitative experiments results for two publicly available datasets and some other challenging videos are shown; The paper is concluded in Section 4. 2. Layered DAG based Video Object Segmentation 2.1. The Framework The proposed framework consists of 3 stages (as shown in Figure 3): 1. Generation of object proposals per-frame and then expansion of the proposal set for each frame based on object proposals in adjacent frames. 2. Generation of a layered DAG from all the object proposals in the video. The longest path in the graph fulfills the goal of maximizing ob- jectness and similarity scores, and represents the most likely set of proposals denoting the primary object in the video. 3. The primary object proposals are used to build object and background models using Gaussian mixtures, and a graph-cuts based optimization method is used to obtain refined per-pixel segmentation. Since the proposed approach is centered around layered DAG framework for selection of primary object regions, we will start with its description. 2.2. Layered DAG Structure We want to extract object proposals with high objectness likelihood, high appearance similarity and smoothly varying shape from the set of all proposals obtained from the video. Also since we want to extract the primary object only, we want to extract at most a single proposal per frame. Keeping these objectives in mind, the layered DAG is formed as follows. Each object proposal is represented by two nodes: a ‘beginning node’ and an ‘ending node’ and there are two types of edges: unary edges and binary edges. The unary edges have weights which measure the objectness of a proposal. The details of the function for unary weight assignments (measuring objectness) are given in section 2.2. 1. All the beginning nodes in the same frame form a layer, so as the ending nodes. A directed unary edge is built from beginning node to ending node. Thus, each video frame is represented by two layers in the graph. DiFrame i-1 Frame i Frame i+1 s… … La2i-ye3rL2ai-y2erL2ayi-1erLa2yierL2ai+y1erL2ai+y2er… … t Figure 4. Layered Directed Acyclic Graph (DAG) Structure. Node “s” and “t” are source and sink nodes respectively, which have zero weights for edges with other nodes in the graph. The yellow nodes and the green nodes are “beginning nodes” and “ending nodes” respectively and they are paired such that each yellow-green pair represents an object proposal. All the beginning nodes in the same frame are arranged in a layer and the same as the ending nodes. The green edges are the unary edges and red edges are the binary edges. rected binary edges are built from any ending node to all the beginning nodes in latter layers. The binary edges have weights which measure the appearance and shape similarity between the corresponding object proposals across frames. The binary weight assignment functions are introduced in Section 2.2.2. Figure 4 is an illustration of the graph structure. It shows frame i− 1, iand i 1 of the graph, with corresponding layers oif − −2i 1 1−,3 i, a2nid d− i2, + +2 i1 1− o1f, h2ie, 2gira+p 1h ,a wndi t2hi +co2rr. eNspooten tdhinagt, only 3s object proposals are s1h, o2wi,n 2 ifo+r 1e aacnhd layer f.or N simplic- + ity, however, there are usually hundreds of object proposals for each frame and the number of object proposals for different frames are not necessary the same. The yellow nodes are “beginning nodes”, the green nodes are “ending nodes”, the green edges are unary edges with weights indicating objectness and the red edges are binary edges with weights indicating appearance and shape similarity (note that the graph only shows some of the binary edges for simplicity). There is also a virtual source node s and a sink node t with 0 weighted edges (black edges) to the graph. Note that, it is not necessary to build binary edges from an ending node to all the beginning nodes in latter layers. In practice, only building binary edges to the next three subsequent frames is enough for most of the videos. 2.2.1 Unary Edges Unary edges measure the objectness of the proposals. Both appearance and motion are important to infer the objectness, so the scoring function for object proposals is defined as Sunary (r) = A(r) + M(r), in which r is any object proposal, A(r) is the appearance score and M(r) is the motion score. We define M(r) as the average Frobenius norm of optical flow gradient around the boundary of object pro666232880 Figure 5. Optical Flow Gradient Magnitude Motion Scoring. In row 1, column 1 shows the original video frame, column 2 is one of the object proposals and column 3 shows dilated boundary of the object proposal. In row 2, column 1 shows the forward optical flow of the frame, column 2 shows the optical flow gradient magnitude map and column 3 shows the optical flow gradient magnitude response for the specific object proposal around the boundary. [Please Print in Color] posal r. The Frobenius norm of optical flow gradients is defined as: ??UX??F=?????uvxx uvy ?????F=?ux2+ u2y+ vx2+ vy2, in ?whic?h U =? (1) (u, v) is th??e forward optical flow of the frame, ux , vx and uy, vy are optical flow gradients in x and y directions respectively. The intuition behind this motion scoring function is that, the motions of foreground object and background are usually distinct, so boundary of moving objects usually implies discontinuity in motion. Therefore, ideally, the gradient of optical flow should have high magnitude around foreground object boundary (this phenomenon could be easily observed from Figure 5). In equation 1, we use the Frobenius norm to measure the optical flow gradient magnitude, the higher the value, the more likely the region is from a moving object. In practice, usually the maximum of optical flow gradient magnitude does not coincide exactly with the moving object boundary due to underlying approximation of optical flow calculation. Therefore, we dilate the object proposal boundary and get the average optical flow gradient magnitude as the motion score. Figure 5 is an illustration of this process. The appearance scoring function A(r) is measured by the objectness ([7]). 2.2.2 Binary Edges Binary edges measure the similarity between object proposals across frames. For measuring the similarity of regions, color, location, size and shape are the properties to be considered. We define the similarity between regions as the weight of binary edges as follows: Sbinary(rm, rn) = λ · Scolor(rm, rn) · Soverlap(rm, rn), (2) in which rm and rn are regions from frame m and n, λ is a constant value for adjusting the ratio between unary and binary edges, Soverlap is the overlap similarity between regions and Scolor is the color histogram similarity: Scolor(rm, rn) = hist(rm) · hist(rn)T, (3) in which hist(r) is the normalized color histogram for a region r. Soverlap(rm,rn) =||rrmm∩∪ wwaarrppmmnn((rrnn))||, (4) in which warpmn (rn) is the warped region from rn by optical flow to frame m. It is clear that Scolor encodes the color similarity between regions and Soverlap encodes the size and location similarity between regions. If two regions are close, and the sizes and shapes are similar, the value would be higher, and vice versa. Note that, unlike prior approaches [13, 14], we use optical flow to predict the region (i.e. encoding location and shape), and therefore we are better able to compute similarity for fast moving objects. 2.2.3 Dynamic Programming Solution Until now, we have built the layered DAG and the objective is clear: to find the highest weighted path in the DAG. Assume the graph contains 2F + 2 layers (F is the frame number), the source node is in layer 0 and the sink node is in layer 2F + 2. Let Nij denotes the jth node in ith layer and E(Nij , Nkl) denotes the edge from Nij to Nkl. Layer i has Mi nodes. Let P = (p1, p2 , ..., pm+1) = (N01, Nj1j2, ..., Njm−1jm, N(2n+2)1) be a path from source to sink node. Therefore, ?m Pmax= arg mPax?i=1E(pi,pi+1). (5) Pmax forms a Longest (simple) Path Problem for DAG. Let OPT(i, j) be the maximum path value for Nij from source node. The maximum path value satisfies the following recurrence for i≥ 1and j ≥ 1: OPT(i,j) = k=0...i−m1a,lx=1...Mk[OPT(k,l) + E(Nkl,Nij)]. (6) This problem could be solved by dynamic programming in linear time [12]. The computational complexity for the algorithm is O(n + m), in which n is the number of nodes 666322 919 and m is the number of edges. The most important parameter for the layered DAG is the ratio λ between unary edges and binary edges. However, in practice, the results are not sensitive to it, and in the experiments λ is simply set to be 1. 2.3. Per-pixel Video Object Segmentation Once the primary object proposals are obtained in a video, the results are further refined by a graph-based method to get per-pixel segmentation results. We define a spatiotemporal graph by connecting frames temporally with optical flow displacement. Each of the nodes in the graph is a pixel in a frame, and edges are set to be the 8-neighbors within one frame and the forward-backward 18 neighbors in adjacent frames. We define the energy function for labeling f = [f1, f2, ..., fn] of n pixels with prior knowledge of h: E(f,h) = ?Dhi(fi) + λ ?i∈S ? Vi,j(fi,fj), (7) (i,?j)∈N where S = {pi, ..., pn} is the set of n pixels in the video, N cwohnesriest Ss o =f neighboring pixels, ta ondf i,j ixnedlesx in nt thhee pixels. pi could be set to 0 or 1which represents background or foreground respectively. The unary term Dih defines the cost of labeling pixel iwith label fi which we get from the Gaussian Mixture Models (GMM) for both color and location. Dih(fi) = −log(αUic(fi, h) + (1 − α)Uil(fi, h)), (8) where Uic(.) is the color-induced cost and Uil (.) is the location cost. For the binary term Vi,j (fi, fj), we follow the definitions in [17]: Vi,j(fi, fj) = [fi = fj]exp−β(Ci−Cj)2, (9) where [.] denotes the indicator function taking values 0 and 1, (Ci − Cj)2 is the Euclidean distance betwe?en two adjacent nodes in RGB space, and β = (2? (Ci − Cj)2)−1|(i,j)∈N ?We use −th Ce graph-cuts based minimization method in [8] to o?btain the optimal solution for equation 7, and thus get the final segmentation results. Next, we describe the method for object proposal generation that is used to initialize the video object segmentation process. 2.4. Object Proposal Generation & Expansion In order to achieve our goal of identifying image regions belonging to the primary object in the video, it is preferable (though not necessary) to have an object proposal corresponding to the actual object for each frame in which object is present. Using only appearance or optical flow based Figure 6. Object Proposal Expansion. For each optical flow warped object proposal in frame i− 1, we look for object proposals din o fbjreamcte p ir owpohsicahl ihnav fer high overlap erat liooosk kw fiotrh tohbej warped one. If some object proposals all have high overlap ratios with the warped one, they are merged into a new large object proposal. This process will produce the right object proposal if it is not discovered by [7] from frame i, but frame i− 1. cues to generate object proposals is usually not enough for this purpose. This phenomenon could be observed in the example shown in Figure 6. For frame iin this figure, hundreds of object proposals were generated using method in [7], however, no proposal is consistent with the true object, and the object is fragmented between different proposals. We assume that an object’s shape and location changes smoothly across frames and propose to enhance the set of object proposals for a frame by using the proposals generated for its adjacent frames. The object proposal expansion method works by the guidance of optical flow (see Figure 6). For the forward version of object proposal expansion, each object proposal rk in frame i− 1 is warped by the forward optical flow toi −fra1mine fir,a tmheen i a −ch 1ec isk wisa rmpaedde bify any proposal in frame i has a large overlap ratio with the rij 666333002 warped object proposal, i.e., o =|warpi−1,|ir(jir|ik−1) ∩ rij|. (10) The contiguous overlapped areas, for regions in i+1 with o greater than 0.5, are merged into a single region, and are used as additional proposals. Note that, the old original proposals are also kept, so this is an ‘expansion’ of the proposal set, and not a replacement. In practice, this process is carried out both forward and backward in time. Since it is an iterative process, even if suitable object proposals are missing in consecutive frames, they could potentially be produced by this expansion process. Figure 6 shows an example image sequence where the expansion process resulted in generation of a suitable proposal. 3. Experiments The proposed method was evaluated using two wellknown segmentation datasets: SegTrack dataset [20] and GaTech video segmentation dataset [9]. Quantitative comparisons are shown for SegTrack dataset since ground-truth is available for this dataset. Qualitative results are shown for GaTech video segmentation dataset. We also evaluated the proposed approach on additional challenging videos, for which we will share the ground-truth to aid future evaluations. 3.1. SegTrack Dataset We first evaluate our method on Segtrack dataset [20]. There are 6 videos in this dataset, and also a pixel-level segmentation ground-truth for each video is available. We follow the setup in the literature ([13, 14]), and use 5 (birdfall, cheetah, girl, monkeydog and parachute) of the videos for evaluation (since the ground-truth for the other one (penguin) is not useable). We use an optical flow magnitude based model selection method to infer the camera motion: for static cameras, a background subtraction cue is also used for moving object extraction; for all the results shown in this section, the static camera model was only selected (automatically) for the “birdfall” video. We compare our method with 4 state-of-the-art methods [14], [13], [20] and [6] shown in Table 1. Note that our method is a unsupervised method, and it outperforms all the other unsupervised methods except for the parachute video where it is a close second. Note that [20] and [6] are supervised methods which need an initial annotation for the first frame. The results in Table 1are the average per-frame pixel error rate compared to the ground-truth. The definition is [20]: error = XORF(f,GT), (11) where f is the segmentation labeling results of the method, GT is the ground-truth labeling of the video, and F is the (a) Birdfall (b) Cheetah (c) Girl (d) Monkeydog (e) Parachute Figure 7. SegTrack dataset results. The regions within the red boundaries are the segmented primary objects. [Please Print in Color] VideoOurs[14][13][20][6] birdfall155189288252454 cheetah 633 806 905 1142 1217 girl 1488 1698 1785 1304 1755 monkeydog 365 472 521 563 683 parachute 220 221 201 235 502 Avg. 452 542 592 594 791 supervised? N N N Y Y Table 1. Quantitative results and comparison with the state of the art on SegTrack dataset number of frames in the video. Figure 7 shows qualitative results for the videos of SegTrack dataset. Figure 8 is an example that shows the effectiveness of the proposed layered DAG approach for temporally dense extraction of primary object regions. The figure shows consecutive frames (frame 38 to frame 43) from “monkeydog” video. The top 2 rows show the results of key-frame objec- t extraction method [13], and the bottom 2 rows show our object region selection results. As one can see, [13] detects the primary object proposal in only one of the frames, however, by using the proposed approach, we can extract the 666333113 #41 ?#42 ?#43 ?(a) Key-frame Obje?ct Re gion Sel cti?on #41 #42 #43 Frame #38 ?#39 ?#40 Frame #38 #39 #40 (b) Layered DAG Object Region Sel ction Figure 8. Comparison of object region selection methods. The regions within the red boundaries are the selected object regions. “?” means there is no object region selected by the method. Numbers above are the frame indices.[Please Print in Color] primary object region from all the frames. This is the main reason that the segmentation results of the proposed method are better than prior methods. 3.2. GaTech Segmentation Dataset We also evaluated the proposed method on GaTech video segmentation dataset. We show qualitative comparison of results between the proposed approach and the original bottom-up method for the dataset in Figure 9. As one can observe, our results could segment the true foreground object from the background. The method [9] doesn’t use an object model which induces over-segmentation (although the results are very good for the general segmentation problem). 3.3. Persons and Cars Segmentation Dataset We have built a new dataset for video object segmentation. The dataset is challenging: persons are in a variety of poses; cars have different speeds, and when they are slow, it is very hard to do motion segmentation. We generate ground truth for those videos. Figure 10 shows some sample results from this dataset, and Table 2 shows the quantitative (a) waterski (b) yunakim Figure 9. Object Segmentation Results on GaTech Video Segmentation Dataset. Row 1: orignial frame, Row 2: Segmentation results by the bottom-up segmentation method [9]. Row 3: Video object segmentation by the proposed method. The regions within the red or green boundaries are the segmented primary objects. [Please Print in Color] VideoAverage per-frame pixel error Surfing1209 Jumping Skiing Sliding Big car Small car 835 817 2228 1129 272 Table 2. Quantitative Results on Persons and Cars dataset results for this dataset (the average per-frame pixel error is defined as the same as SegTrack dataset [20]). Please go to http://crcv.ucf.edu for more details. 4. Conclusions We have proposed a novel and efficient layered DAG based approach to segment the primary object in videos. This approach also uses innovative mechanisms to compute the ‘objectness’ of a region and to compute similarity between object proposals across frames. The proposed approach outperforms the state of the art on the well-known SegTrack dataset. We also demonstrate good segmentation performance on additional challenging data sets. 666333224 (a) Surfing (b) Jumping (c) Skiing (d) Sliding (e) Big car (f) Small car Figure 10. Sample Results on Persons and Cars Dataset. Please go to http://crcv.ucf.edu for more details. Acknowledgment This work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract numbers D11PC20066. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S.Government. References [1] B. Alexe, T. Deselaers, and V. Ferrari. What is an object? In CVPR, pages 73–80, 2010. [2] X. Bai, J. Wang, D. Simons, and G. Sapiro. Video snapcut: robust video object cutout using localized classifiers. ACM Transactions on Graphics, 28(3):70, 2009. [3] W. Brendel and S. Todorovic. Video object segmentation by tracking regions. In ICCV, pages 833–840, 2009. [4] T. Brox and J. Malik. Object segmentation by long term analysis of point trajectories. In ECCV, pages 282–295, 2010. [5] J. Carreira and C. Sminchisescu. Constrained parametric min-cuts for automatic object segmentation. In CVPR, pages 3241–3248, 2010. [6] P. Chockalingam, N. Pradeep, and S. Birchfield. Adaptive fragments-based tracking ofnon-rigid objects using level sets. In ICCV, pages 1530–1537, 2009. [7] I. Endres and D. Hoiem. Category independent object proposals. In ECCV, pages 575–588, 2010. [8] B. Fulkerson, A. Vedaldi, and S. Soatto. Class segmentation and object localization with superpixel neighborhoods. In ICCV, pages 670–677, 2009. [9] M. Grundmann, V. Kwatra, M. Han, and I. Essa. Efficient hierarchical graph-based video segmentation. In CVPR, pages 2141–2148, 2010. [10] Y. Huang, Q. Liu, and D. Metaxas. Video object segmentation by hypergraph cut. In CVPR, pages 1738–1745, 2009. [11] J.Wang, B. Thiesson, Y. Xu, and M. Cohen. Image and video segmentation by anisotropic kernel mean shift. In ECCV, 2004. [12] J. Kleinberg and E. Tardos. Algorithm design. Pearson Education and Addison Wesley, 2006. [13] Y. Lee, J. Kim, and K. Grauman. Key-segments for video object segmentation. In ICCV, pages 1995–2002, 2011. [14] T. Ma and L. Latecki. Maximum weight cliques with mutex constraints for video object segmentation. In CVPR, pages 670–677, 2012. [15] B. Price, B. Morse, and S. Cohen. Livecut: Learningbased interactive video segmentation by evaluation of multiple propagated cues. In ICCV, pages 779–786, 2009. [16] X. Ren and J. Malik. Tracking as repeated figure/ground segmentation. In CVPR, pages 1–8, 2007. [17] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In ACM Transactions on Graphics, volume 23, pages 309–3 14, 2004. [18] Y. Sheikh, O. Javed, and T. Kanade. Background subtraction for freely moving cameras. In ICCV, pages 1219–1225, 2009. [19] J. Shi and J. Malik. Motion segmentation and tracking using normalized cuts. In ICCV, pages 1154–1 160, 1998. [20] D. Tsai, M. Flagg, and J. Rehg. Motion coherent tracking with multi-label mrf optimization. In BMVC, page 1, 2010. [21] A. Vazquez-Reina, S. Avidan, H. Pfister, and E. Miller. Multiple hypothesis video segmentation from superpixel flows. In ECCV, pages 268–281, 2010. [22] C. Xu, C. Xiong, and J. J. Corso. Streaming hierarchical video segmentation. In ECCV, pages 626–639. 2012. [23] J. Yuen, B. Russell, C. Liu, and A. Torralba. Labelme video: Building a video database with human annotations. In ICCV, pages 1451–1458, 2009. 666333335
5 0.69260061 88 cvpr-2013-Compressible Motion Fields
Author: Giuseppe Ottaviano, Pushmeet Kohli
Abstract: Traditional video compression methods obtain a compact representation for image frames by computing coarse motion fields defined on patches of pixels called blocks, in order to compensate for the motion in the scene across frames. This piecewise constant approximation makes the motion field efficiently encodable, but it introduces block artifacts in the warped image frame. In this paper, we address the problem of estimating dense motion fields that, while accurately predicting one frame from a given reference frame by warping it with the field, are also compressible. We introduce a representation for motion fields based on wavelet bases, and approximate the compressibility of their coefficients with a piecewise smooth surrogate function that yields an objective function similar to classical optical flow formulations. We then show how to quantize and encode such coefficients with adaptive precision. We demonstrate the effectiveness of our approach by com- paring its performance with a state-of-the-art wavelet video encoder. Experimental results on a number of standard flow and video datasets reveal that our method significantly outperforms both block-based and optical-flow-based motion compensation algorithms.
7 0.66292655 124 cvpr-2013-Determining Motion Directly from Normal Flows Upon the Use of a Spherical Eye Platform
8 0.65991467 334 cvpr-2013-Pose from Flow and Flow from Pose
9 0.65709203 316 cvpr-2013-Optical Flow Estimation Using Laplacian Mesh Energy
10 0.65423667 203 cvpr-2013-Hierarchical Video Representation with Trajectory Binary Partition Tree
11 0.63523096 158 cvpr-2013-Exploring Weak Stabilization for Motion Feature Extraction
12 0.61278194 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition
13 0.61038595 300 cvpr-2013-Multi-target Tracking by Lagrangian Relaxation to Min-cost Network Flow
14 0.5811311 345 cvpr-2013-Real-Time Model-Based Rigid Object Pose Estimation and Tracking Combining Dense and Sparse Visual Cues
15 0.55066794 352 cvpr-2013-Recovering Stereo Pairs from Anaglyphs
16 0.53888786 112 cvpr-2013-Dense Segmentation-Aware Descriptors
17 0.53666264 137 cvpr-2013-Dynamic Scene Classification: Learning Motion Descriptors with Slow Features Analysis
18 0.50972402 357 cvpr-2013-Revisiting Depth Layers from Occlusions
19 0.49286792 107 cvpr-2013-Deformable Spatial Pyramid Matching for Fast Dense Correspondences
20 0.49159634 118 cvpr-2013-Detecting Pulse from Head Motions in Video
topicId topicWeight
[(10, 0.103), (11, 0.244), (16, 0.024), (26, 0.032), (33, 0.351), (67, 0.034), (69, 0.036), (87, 0.064)]
simIndex simValue paperId paperTitle
same-paper 1 0.88312286 10 cvpr-2013-A Fully-Connected Layered Model of Foreground and Background Flow
Author: Deqing Sun, Jonas Wulff, Erik B. Sudderth, Hanspeter Pfister, Michael J. Black
Abstract: Layered models allow scene segmentation and motion estimation to be formulated together and to inform one another. Traditional layered motion methods, however, employ fairly weak models of scene structure, relying on locally connected Ising/Potts models which have limited ability to capture long-range correlations in natural scenes. To address this, we formulate a fully-connected layered model that enables global reasoning about the complicated segmentations of real objects. Optimization with fully-connected graphical models is challenging, and our inference algorithm leverages recent work on efficient mean field updates for fully-connected conditional random fields. These methods can be implemented efficiently using high-dimensional Gaussian filtering. We combine these ideas with a layered flow model, and find that the long-range connections greatly improve segmentation into figure-ground layers when compared with locally connected MRF models. Experiments on several benchmark datasets show that the method can re- cover fine structures and large occlusion regions, with good flow accuracy and much lower computational cost than previous locally-connected layered models.
2 0.83799821 425 cvpr-2013-Tensor-Based High-Order Semantic Relation Transfer for Semantic Scene Segmentation
Author: Heesoo Myeong, Kyoung Mu Lee
Abstract: We propose a novel nonparametric approach for semantic segmentation using high-order semantic relations. Conventional context models mainly focus on learning pairwise relationships between objects. Pairwise relations, however, are not enough to represent high-level contextual knowledge within images. In this paper, we propose semantic relation transfer, a method to transfer high-order semantic relations of objects from annotated images to unlabeled images analogous to label transfer techniques where label information are transferred. Wefirst define semantic tensors representing high-order relations of objects. Semantic relation transfer problem is then formulated as semi-supervised learning using a quadratic objective function of the semantic tensors. By exploiting low-rank property of the semantic tensors and employing Kronecker sum similarity, an efficient approximation algorithm is developed. Based on the predicted high-order semantic relations, we reason semantic segmentation and evaluate the performance on several challenging datasets.
3 0.83758903 370 cvpr-2013-SCALPEL: Segmentation Cascades with Localized Priors and Efficient Learning
Author: David Weiss, Ben Taskar
Abstract: We propose SCALPEL, a flexible method for object segmentation that integrates rich region-merging cues with mid- and high-level information about object layout, class, and scale into the segmentation process. Unlike competing approaches, SCALPEL uses a cascade of bottom-up segmentation models that is capable of learning to ignore boundaries early on, yet use them as a stopping criterion once the object has been mostly segmented. Furthermore, we show how such cascades can be learned efficiently. When paired with a novel method that generates better localized shapepriors than our competitors, our method leads to a concise, accurate set of segmentation proposals; these proposals are more accurate on the PASCAL VOC2010 dataset than state-of-the-art methods that use re-ranking to filter much larger bags of proposals. The code for our algorithm is available online.
4 0.83724153 46 cvpr-2013-Articulated and Restricted Motion Subspaces and Their Signatures
Author: Bastien Jacquet, Roland Angst, Marc Pollefeys
Abstract: Articulated objects represent an important class ofobjects in our everyday environment. Automatic detection of the type of articulated or otherwise restricted motion and extraction of the corresponding motion parameters are therefore of high value, e.g. in order to augment an otherwise static 3D reconstruction with dynamic semantics, such as rotation axes and allowable translation directions for certain rigid parts or objects. Hence, in this paper, a novel theory to analyse relative transformations between two motion-restricted parts will be presented. The analysis is based on linear subspaces spanned by relative transformations. Moreover, a signature for relative transformations will be introduced which uniquely specifies the type of restricted motion encoded in these relative transformations. This theoretic framework enables the derivation of novel algebraic constraints, such as low-rank constraints for subsequent rotations around two fixed axes for example. Lastly, given the type of restricted motion as predicted by the signature, the paper shows how to extract all the motion parameters with matrix manipulations from linear algebra. Our theory is verified on several real data sets, such as a rotating blackboard or a wheel rolling on the floor amongst others.
5 0.83723921 111 cvpr-2013-Dense Reconstruction Using 3D Object Shape Priors
Author: Amaury Dame, Victor A. Prisacariu, Carl Y. Ren, Ian Reid
Abstract: We propose a formulation of monocular SLAM which combines live dense reconstruction with shape priors-based 3D tracking and reconstruction. Current live dense SLAM approaches are limited to the reconstruction of visible surfaces. Moreover, most of them are based on the minimisation of a photo-consistency error, which usually makes them sensitive to specularities. In the 3D pose recovery literature, problems caused by imperfect and ambiguous image information have been dealt with by using prior shape knowledge. At the same time, the success of depth sensors has shown that combining joint image and depth information drastically increases the robustness of the classical monocular 3D tracking and 3D reconstruction approaches. In this work we link dense SLAM to 3D object pose and shape recovery. More specifically, we automatically augment our SLAMsystem with object specific identity, together with 6D pose and additional shape degrees of freedom for the object(s) of known class in the scene, combining im- age data and depth information for the pose and shape recovery. This leads to a system that allows for full scaled 3D reconstruction with the known object(s) segmented from the scene. The segmentation enhances the clarity, accuracy and completeness of the maps built by the dense SLAM system, while the dense 3D data aids the segmentation process, yieldingfaster and more reliable convergence than when using 2D image data alone.
6 0.83697271 187 cvpr-2013-Geometric Context from Videos
7 0.83683938 333 cvpr-2013-Plane-Based Content Preserving Warps for Video Stabilization
8 0.83678955 450 cvpr-2013-Unsupervised Joint Object Discovery and Segmentation in Internet Images
9 0.8367734 300 cvpr-2013-Multi-target Tracking by Lagrangian Relaxation to Min-cost Network Flow
10 0.83655661 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
11 0.83635008 203 cvpr-2013-Hierarchical Video Representation with Trajectory Binary Partition Tree
12 0.83634335 299 cvpr-2013-Multi-source Multi-scale Counting in Extremely Dense Crowd Images
13 0.83620912 250 cvpr-2013-Learning Cross-Domain Information Transfer for Location Recognition and Clustering
14 0.83611292 390 cvpr-2013-Semi-supervised Node Splitting for Random Forest Construction
15 0.83603507 34 cvpr-2013-Adaptive Active Learning for Image Classification
16 0.83599043 296 cvpr-2013-Multi-level Discriminative Dictionary Learning towards Hierarchical Visual Categorization
17 0.83597517 380 cvpr-2013-Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images
18 0.83585387 284 cvpr-2013-Mesh Based Semantic Modelling for Indoor and Outdoor Scenes
19 0.83578598 329 cvpr-2013-Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images
20 0.83547211 20 cvpr-2013-A New Model and Simple Algorithms for Multi-label Mumford-Shah Problems