iccv iccv2013 iccv2013-252 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Zhan Yu, Xinqing Guo, Haibing Lin, Andrew Lumsdaine, Jingyi Yu
Abstract: Light fields are image-based representations that use densely sampled rays as a scene description. In this paper, we explore geometric structures of 3D lines in ray space for improving light field triangulation and stereo matching. The triangulation problem aims to fill in the ray space with continuous and non-overlapping simplices anchored at sampled points (rays). Such a triangulation provides a piecewise-linear interpolant useful for light field superresolution. We show that the light field space is largely bilinear due to 3D line segments in the scene, and direct triangulation of these bilinear subspaces leads to large errors. We instead present a simple but effective algorithm to first map bilinear subspaces to line constraints and then apply Constrained Delaunay Triangulation (CDT). Based on our analysis, we further develop a novel line-assisted graphcut (LAGC) algorithm that effectively encodes 3D line constraints into light field stereo matching. Experiments on synthetic and real data show that both our triangulation and LAGC algorithms outperform state-of-the-art solutions in accuracy and visual quality.
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper, we explore geometric structures of 3D lines in ray space for improving light field triangulation and stereo matching. [sent-6, score-0.772]
2 The triangulation problem aims to fill in the ray space with continuous and non-overlapping simplices anchored at sampled points (rays). [sent-7, score-0.586]
3 Such a triangulation provides a piecewise-linear interpolant useful for light field superresolution. [sent-8, score-0.465]
4 We show that the light field space is largely bilinear due to 3D line segments in the scene, and direct triangulation of these bilinear subspaces leads to large errors. [sent-9, score-0.931]
5 We instead present a simple but effective algorithm to first map bilinear subspaces to line constraints and then apply Constrained Delaunay Triangulation (CDT). [sent-10, score-0.289]
6 Based on our analysis, we further develop a novel line-assisted graphcut (LAGC) algorithm that effectively encodes 3D line constraints into light field stereo matching. [sent-11, score-0.339]
7 Experiments on synthetic and real data show that both our triangulation and LAGC algorithms outperform state-of-the-art solutions in accuracy and visual quality. [sent-12, score-0.316]
8 A light field (LF) [17, 9] captures a dense set of rays as scene descriptions in place of geometry. [sent-16, score-0.219]
9 Every ray is parameterized by its intersections with two parallel planes: [s, t] as the intersection with the first plane Πst and [u, v] as the second with Πuv. [sent-18, score-0.233]
10 This paper explores ray geometry of a common primitive, 3D line segments. [sent-20, score-0.341]
11 Previous studies show that the LF space is largely linear: a 3D scene point maps to a 2D ray hyperplane [39, 38]. [sent-21, score-0.261]
12 (a) A 3D line segment lmaps to a bilinear subspace in a LF; (b) lmaps to a curve on a diagonal cut; (c) Brute-force triangulation creates volume. [sent-35, score-0.712]
13 The triangulation provides a natural anisotropic reconstruction kernel: any point in the space can be approximated using a convex combination of the enclosing simplex’s vertices (samples). [sent-37, score-0.35]
14 The simplest triangulation method is to apply highdimensional Delauney Triangulation [8]. [sent-38, score-0.316]
15 A better approach is to align simplices with ray geometry of 3D scene. [sent-42, score-0.33]
16 For example, we can first estimate the disparity (depth) of the feature pixels (rays), map them to the hyperplane constraints, and apply Constrained Delaunay Triangulation (CDT) [27]. [sent-43, score-0.326]
17 We show this approach is still insufficient to produce high quality triangulations: the LF space contains a large amount of non-linear, or more precisely, bilinear substructures that correspond to 3D line segments. [sent-44, score-0.263]
18 Brute-force triangulation of these bilinear structures leads to large errors and visual artifacts. [sent-45, score-0.495]
19 We instead present a new solution that combines the bilinear and hyperplane constraints for CDT. [sent-46, score-0.239]
20 Our ray geometry analysis of 3D lines also leads to a new LF stereo algorithm. [sent-47, score-0.4]
21 We first introduce a new F3 energy tLeFrm s etor preserve disparity consistency along leiwne F segments. [sent-48, score-0.292]
22 We then modify the binocular stereo graph via the general purpose graph construction framework [15] and solve it using the extended Quadratic Pseudo-Boolean Optimization algorithm [25]. [sent-49, score-0.234]
23 Experiments show that both our LF triangulation and stereo matching algorithms outperform state-of-the-art solutions in accuracy and visual quality. [sent-51, score-0.429]
24 Ponce [22] applies projective geometry to analyze ray structures. [sent-57, score-0.257]
25 Yu and McMillan [39] use General Linear Cameras (GLCs) to analyze all 2D linear structures (hyperplanes) in 4D LF ray space. [sent-58, score-0.197]
26 Their studies have shown that the LF ray space is largely linear and hence suitable for triangulation: scene geometry such as 3D points or parallel directions maps to GLCs. [sent-59, score-0.323]
27 We show that special care needs to be taken to handle non-linear (bilinear) ray structures. [sent-60, score-0.197]
28 (a) A scanline from a stereo pair; (b) RG Delaunay triangulation (bottom) performs poorly on LF super-resolution (top); (c) Using disparity as additional edge constraints, Constrained Delaunay triangulation significantly improves LF super-resolution. [sent-84, score-1.062]
29 Ray Geometry of 3D Lines We first briefly reiterate the ray geometry [38, 23]. [sent-86, score-0.284]
30 If a 3D line l is parallel to Πuv and represent it with a point P˙ = [Px , Py, Pz] direction [γx , γy, 0] . [sent-87, score-0.12]
31 If lis not parallel to Πuv, it then can be directly parame- terized by a ray under 2PP as [u0, v0, s0, t0] . [sent-90, score-0.233]
32 All rays passing through lsatisfy the following bilinear constraint: s − s0 u − u0 = t − t0. [sent-91, score-0.252]
33 v − (2) v0 The bilinear ray geometry is particularly important since a real scene usually contains many linear structures unparallel to the image plane. [sent-92, score-0.466]
34 This reveals that the LF ray space contains a large amount of bilinear structures. [sent-93, score-0.376]
35 Light Field Triangulation The simplest LF triangulation is regular-grid (RG) triangulation. [sent-99, score-0.316]
36 If we use this triangulation to super-resolve the LF, i. [sent-104, score-0.316]
37 This is because RG triangulation is analogous to bilinear interpolation and does not consider scene geometry (e. [sent-107, score-0.585]
38 Using stereo matching, we can first estimate every pixel’s disparity and map it to a 2D hyperplane [38, 23] as a constraint. [sent-114, score-0.439]
39 In the 2D EPI case, each pixel maps to an edge where the slope of the edge corresponds to its disparity (depth). [sent-115, score-0.319]
40 2 (c) shows an E-CDT triangulation and its super-resolution result. [sent-119, score-0.316]
41 Our triangulation applies CDT to all pixels with these edge constraints. [sent-121, score-0.316]
42 The new view improves RG at non-occlusion regions but exhibits strong aliasing near linear occlusion boundaries. [sent-131, score-0.12]
43 To analyze the cause of aliasing, let us consider a 3D line segment lwhose image is (lx1, l1y) −(l2x , l2y) in LF view (s, t). [sent-132, score-0.119]
44 Assume the disparity of l1and l2) are d1 and d2 respectively. [sent-133, score-0.292]
45 In geometric modeling, it is well known that any direct triangulation of S from the four vertices of S will introduce large error: S is a surface that does not occupy any volume. [sent-137, score-0.35]
46 However, a triangulation of the four vertices will turn S into a tetrahedron which will occupy large volume when |d1 − d2 | is large, as sihchow wni iln o Fig. [sent-138, score-0.375]
47 Therefore it is important to add additional constraints onto the bilinear structure. [sent-143, score-0.234]
48 CDT with 3D Edge Constraints We present a simple but effective scheme that directly maps bilinear ray structures of 3D lines into the CDT framework. [sent-146, score-0.433]
49 Specifically, we apply a subdivision scheme [21] by discretizing the bilinear surface into slim bilinear patches and then triangulate each patch. [sent-147, score-0.451]
50 Finally, we use edges of bilinear patches and disparity hyperplanes as constraints for CDT. [sent-148, score-0.497]
51 3), we detect additional 303 line segments in the reference view, subdivide their corresponding bilinear surfaces, and add them as constraints for conducting B-CDT. [sent-153, score-0.375]
52 The new triangulation significantly improves the E-CDT result: it preserves most sharp edges and exhibits very little aliasing near occlusion boundaries. [sent-154, score-0.484]
53 ECDT preserves non-boundary contents but exhibits strong aliasing near the boundary pixels such as the tripod, the light edges, and the bust. [sent-161, score-0.249]
54 The Tsukuba scene has a relatively large disparity range and direct warping produces many holes. [sent-163, score-0.36]
55 Specifically, to synthesize a new view Vst in the 4D LF with four sample views indexed as V00, V01, V10, V11, we first detect 3D line segments and apply 3D B-CDT to synthesize two new views Vs0 and Vs1 from 3D LFs V00 V10 and V01 V11, respectively. [sent-175, score-0.141]
56 4 shows an skyscraper LF− wVith disparity range [0,300]. [sent-178, score-0.292]
57 Results using RG exhibit severe aliasing where directly warping produces holes and discontinuity. [sent-179, score-0.163]
58 Next, we select 90, 269 feature points (11% of total pixel) and 2092 line segments and apply the pseudo 4D CDT. [sent-180, score-0.141]
59 Light Field Stereo Matching Our LF triangulation also provides useful insights on incorporating 3D line constraints into multi-view stereo. [sent-183, score-0.426]
60 Disparity Interpolant We first prove the linearity of disparity along a line segment, i. [sent-186, score-0.376]
61 , given two endpoints l1 and l2 of a 3D line segment lwith disparity d1 and d2, the disparity dk ofany intermediate point lk = λkl1 + (1−λk)l2 is λkd1 + (1−λk)d2. [sent-188, score-0.893]
62 We present a different proof based on bilinear ray geometry of line l. [sent-192, score-0.52]
63 If lis not parallel to Πuv and Πst, l can be represented as a ray (s0 , t0, u0, v0). [sent-194, score-0.233]
64 Consider a specific pixel (s, t) in ××× camera (u, v) that observes a point P on line l and pixel s + Δs in a neighbor camera (u + Δu, v) that also observes P. [sent-195, score-0.138]
65 Both ray (s, t, u, v) and (s + Δs, t, u + Δu, v) satisfy the bilinear ray constraint (Eqn. [sent-196, score-0.573]
66 This reveals that disparity is a linear function in tv −alvong l, i. [sent-199, score-0.292]
67 Line-Assisted Graph Cut (LAGC) To incorporate the linear disparity constraint into multiview stereo matching, the most direct approach is to first detect line segments in the captured LF, then estimate their disparities, and use them as hard constraints in the graphcut algorithm. [sent-204, score-0.572]
68 3F4o]r tohre e endpoints oarfo euancdh 1 l1in0e0 segment l7, we iterate over all possible disparities and interpolate the disparity for all intermediate points. [sent-209, score-0.467]
69 Finally, we find the optimal disparity assignments to the endpoints that yield to highest consistency of all intermediate points. [sent-210, score-0.397]
70 However, if the disparity of the line segment is incorrectly assigned, it will lead to large errors, e. [sent-214, score-0.411]
71 Next, we study how to explicitly encode the disparity constraint of line segments into MVGC. [sent-218, score-0.433]
72 MVGC aims to find the optimal disparity label that minimizes the energy 2795 Reference MVGCHard Constraint Figure 5. [sent-219, score-0.292]
73 Encoding 3D line segments as hard constraints improves MVGC but misses important details, e. [sent-220, score-0.167]
74 d oTfh eP o,c Tclusion term Eocc measures if occlusion is correctly preserved when warping the disparity from I I? [sent-229, score-0.33]
75 Our key observation is that when assigning disparity labels to the two endpoints, every intermediate point along the line should check occlusion consistency. [sent-232, score-0.411]
76 Specifically, given the two endpoints (pixels) li and lj of line segment land an intermediate pixel lk = λkli + (1 λk)lj, we define − = Eline ? [sent-233, score-0.408]
77 [5] and Kolmogorov and Zabih[13, 14] show that one can minimize Econventional by consecutively solving the two-label problem: at each iteration, a new disparity label is added and the algorithm decides whether each pixel should keep the old disparity or switch to the new disparity. [sent-240, score-0.611]
78 This is because an unlabel pixel is caused by non-submodularity which, in our case, occurs mostly on the line segments since Eline is non-submodular. [sent-261, score-0.168]
79 For natural scenes, line segments are generally sparse and therefore only a small percentage of pixels are unlabeled in QPBO. [sent-262, score-0.141]
80 We then add the source s node for label 0, the sink node t for label 1, the tlinks from the graph nodes to s or t, and the n-links between the graph nodes using 4-connectivity. [sent-273, score-0.174]
81 Specifically, for each edge tuple (li , lj , lk) (i, j the endpoints and k the intermediate point), we add three auxiliary n-links (anlinks) li − lj, li − lk and lk − lj, as shown in Fig. [sent-278, score-0.421]
82 (b) For a line segment (pink), we add auxiliary n-links (green). [sent-301, score-0.193]
83 We first evaluate our algorithm on binocular stereo using the Tsukuba dataset. [sent-310, score-0.156]
84 T ohfe 1 scene h imasa a disparity range efrsoomlut i0o o16f 1p0ix2e4ls ×. [sent-323, score-0.322]
85 The scene hence is challenging for classical stereo matching. [sent-327, score-0.143]
86 9 compares the disparity maps computed by GCDL and LAGC. [sent-337, score-0.292]
87 Therefore, it requires ultra-densely sampled LFs with a small disparity range (usually between -2 to 2). [sent-339, score-0.292]
88 Tah rees disparity range i s× b 9et6w0e oefn a a− L3e gtoo 5 g pixels raannde we ddiesl-. [sent-347, score-0.292]
89 b eOtwn eceonnt −in3u otous 5 regions snudch w as dthiseground, LAGC produces much smoother disparity transitions whereas the result from GCDL contains large discontinuities. [sent-349, score-0.292]
90 The disparity range is small, from −3 to 3 pixels. [sent-353, score-0.292]
91 T Whee disparity range eis L uFltr tao sam 1a1ll × ×(b 1et1w LeFen a t− 810 t0o ×1 pixels). [sent-363, score-0.292]
92 We ed dspisacrriteytiz rea tghee disparity range using 0−. [sent-364, score-0.292]
93 Limitations and Future Work We have presented a LF triangulation and stereo matching framework by imposing ray geometry of 3D line segments as constraints. [sent-372, score-0.827]
94 Our current super-resolution scheme requires rasterizing ray simplices into voxels. [sent-377, score-0.346]
95 An alternative approach is to use a walkthrough algorithm that picks one face of the ray simplex at a time and does the orientation test for locating the simplex, a process can be accelerated using parallel processing on the graphics hardware. [sent-378, score-0.262]
96 An important following step is test our scheme on the Raytrix data which have a larger disparity range. [sent-382, score-0.319]
97 In addition, given the increasing interest in LF imaging and the availability of commercial LF cameras, we also plan to build a LF stereo benchmark of real scenes analogous to the Middlebury Stereo Portal[6], for evaluating LF stereo matching algorithms. [sent-383, score-0.226]
98 Complexity Discrete of the delaunay triangulation & Computational Geometry, 2003. [sent-402, score-0.456]
99 General-dimensional constrained delaunay and constrained regular triangulations i: Combinatorial properties. [sent-542, score-0.247]
100 Variational light field analysis for disparity estimation and super-resolution. [sent-593, score-0.408]
wordName wordTfidf (topN-words)
[('lf', 0.563), ('triangulation', 0.316), ('disparity', 0.292), ('lagc', 0.247), ('ray', 0.197), ('bilinear', 0.179), ('cdt', 0.165), ('gcdl', 0.165), ('delaunay', 0.14), ('mvgc', 0.115), ('stereo', 0.113), ('lytro', 0.102), ('qpbo', 0.088), ('aliasing', 0.087), ('lk', 0.085), ('line', 0.084), ('light', 0.081), ('epi', 0.076), ('rays', 0.073), ('simplices', 0.073), ('lj', 0.072), ('endpoints', 0.07), ('eline', 0.066), ('rg', 0.066), ('geometry', 0.06), ('segments', 0.057), ('kolmogorov', 0.055), ('wanner', 0.054), ('goldl', 0.054), ('econventional', 0.049), ('gantry', 0.049), ('gcp', 0.049), ('lfs', 0.049), ('lmaps', 0.049), ('rasterizing', 0.049), ('tetgen', 0.049), ('preserves', 0.048), ('es', 0.046), ('nk', 0.046), ('auxiliary', 0.045), ('triangulating', 0.044), ('microlens', 0.044), ('binocular', 0.043), ('uv', 0.043), ('triangulate', 0.041), ('esmooth', 0.041), ('triangulations', 0.041), ('array', 0.04), ('graph', 0.039), ('warping', 0.038), ('holes', 0.038), ('plenoptic', 0.038), ('venus', 0.036), ('parallel', 0.036), ('segment', 0.035), ('disparities', 0.035), ('levoy', 0.035), ('tsukuba', 0.035), ('field', 0.035), ('intermediate', 0.035), ('hyperplane', 0.034), ('vertices', 0.034), ('zabih', 0.034), ('edata', 0.034), ('sink', 0.034), ('exhibits', 0.033), ('constrained', 0.033), ('amethyst', 0.033), ('chimneys', 0.033), ('crane', 0.033), ('interpolant', 0.033), ('lego', 0.033), ('sosp', 0.033), ('subnodes', 0.033), ('tlinks', 0.033), ('xinqing', 0.033), ('xpz', 0.033), ('ypz', 0.033), ('stanford', 0.032), ('lines', 0.03), ('scene', 0.03), ('cut', 0.029), ('simplex', 0.029), ('eocc', 0.029), ('lumsdaine', 0.029), ('tripod', 0.029), ('add', 0.029), ('ni', 0.029), ('middlebury', 0.029), ('triangulated', 0.028), ('scheme', 0.027), ('reiterate', 0.027), ('pixel', 0.027), ('constraints', 0.026), ('scanline', 0.025), ('tetrahedron', 0.025), ('woodford', 0.025), ('subdivision', 0.025), ('el', 0.025), ('lsd', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 252 iccv-2013-Line Assisted Light Field Triangulation and Stereo Matching
Author: Zhan Yu, Xinqing Guo, Haibing Lin, Andrew Lumsdaine, Jingyi Yu
Abstract: Light fields are image-based representations that use densely sampled rays as a scene description. In this paper, we explore geometric structures of 3D lines in ray space for improving light field triangulation and stereo matching. The triangulation problem aims to fill in the ray space with continuous and non-overlapping simplices anchored at sampled points (rays). Such a triangulation provides a piecewise-linear interpolant useful for light field superresolution. We show that the light field space is largely bilinear due to 3D line segments in the scene, and direct triangulation of these bilinear subspaces leads to large errors. We instead present a simple but effective algorithm to first map bilinear subspaces to line constraints and then apply Constrained Delaunay Triangulation (CDT). Based on our analysis, we further develop a novel line-assisted graphcut (LAGC) algorithm that effectively encodes 3D line constraints into light field stereo matching. Experiments on synthetic and real data show that both our triangulation and LAGC algorithms outperform state-of-the-art solutions in accuracy and visual quality.
2 0.45869982 423 iccv-2013-Towards Motion Aware Light Field Video for Dynamic Scenes
Author: Salil Tambe, Ashok Veeraraghavan, Amit Agrawal
Abstract: Current Light Field (LF) cameras offer fixed resolution in space, time and angle which is decided a-priori and is independent of the scene. These cameras either trade-off spatial resolution to capture single-shot LF [20, 27, 12] or tradeoff temporal resolution by assuming a static scene to capture high spatial resolution LF [18, 3]. Thus, capturing high spatial resolution LF video for dynamic scenes remains an open and challenging problem. We present the concept, design and implementation of a LF video camera that allows capturing high resolution LF video. The spatial, angular and temporal resolution are not fixed a-priori and we exploit the scene-specific redundancy in space, time and angle. Our reconstruction is motion-aware and offers a continuum of resolution tradeoff with increasing motion in the scene. The key idea is (a) to design efficient multiplexing matrices that allow resolution tradeoffs, (b) use dictionary learning and sparse repre- sentations for robust reconstruction, and (c) perform local motion-aware adaptive reconstruction. We perform extensive analysis and characterize the performance of our motion-aware reconstruction algorithm. We show realistic simulations using a graphics simulator as well as real results using a LCoS based programmable camera. We demonstrate novel results such as high resolution digital refocusing for dynamic moving objects.
3 0.22048801 322 iccv-2013-Pose Estimation and Segmentation of People in 3D Movies
Author: Karteek Alahari, Guillaume Seguin, Josef Sivic, Ivan Laptev
Abstract: We seek to obtain a pixel-wise segmentation and pose estimation of multiple people in a stereoscopic video. This involves challenges such as dealing with unconstrained stereoscopic video, non-stationary cameras, and complex indoor and outdoor dynamic scenes. The contributions of our work are two-fold: First, we develop a segmentation model incorporating person detection, pose estimation, as well as colour, motion, and disparity cues. Our new model explicitly represents depth ordering and occlusion. Second, we introduce a stereoscopic dataset with frames extracted from feature-length movies “StreetDance 3D ” and “Pina ”. The dataset contains 2727 realistic stereo pairs and includes annotation of human poses, person bounding boxes, and pixel-wise segmentations for hundreds of people. The dataset is composed of indoor and outdoor scenes depicting multiple people with frequent occlusions. We demonstrate results on our new challenging dataset, as well as on the H2view dataset from (Sheasby et al. ACCV 2012).
4 0.21572839 304 iccv-2013-PM-Huber: PatchMatch with Huber Regularization for Stereo Matching
Author: Philipp Heise, Sebastian Klose, Brian Jensen, Alois Knoll
Abstract: Most stereo correspondence algorithms match support windows at integer-valued disparities and assume a constant disparity value within the support window. The recently proposed PatchMatch stereo algorithm [7] overcomes this limitation of previous algorithms by directly estimating planes. This work presents a method that integrates the PatchMatch stereo algorithm into a variational smoothing formulation using quadratic relaxation. The resulting algorithm allows the explicit regularization of the disparity and normal gradients using the estimated plane parameters. Evaluation of our method in the Middlebury benchmark shows that our method outperforms the traditional integer-valued disparity strategy as well as the original algorithm and its variants in sub-pixel accurate disparity estimation.
5 0.18276966 382 iccv-2013-Semi-dense Visual Odometry for a Monocular Camera
Author: Jakob Engel, Jürgen Sturm, Daniel Cremers
Abstract: We propose a fundamentally novel approach to real-time visual odometry for a monocular camera. It allows to benefit from the simplicity and accuracy of dense tracking which does not depend on visual features while running in real-time on a CPU. The key idea is to continuously estimate a semi-dense inverse depth map for the current frame, which in turn is used to track the motion of the camera using dense image alignment. More specifically, we estimate the depth of all pixels which have a non-negligible image gradient. Each estimate is represented as a Gaussian probability distribution over the inverse depth. We propagate this information over time, and update it with new measurements as new images arrive. In terms of tracking accuracy and computational speed, the proposed method compares favorably to both state-of-the-art dense and feature-based visual odometry and SLAM algorithms. As our method runs in real-time on a CPU, it is oflargepractical valuefor robotics and augmented reality applications. – – 1. Towards Dense Monocular Visual Odometry Tracking a hand-held camera and recovering the threedimensional structure of the environment in real-time is among the most prominent challenges in computer vision. In the last years, dense approaches to these challenges have become increasingly popular: Instead of operating solely on visual feature positions, they reconstruct and track on the whole image using a surface-based map and thereby are fundamentally different from feature-based approaches. Yet, these methods are to date either not real-time capable on standard CPUs [11, 15, 17] or require direct depth measurements from the sensor [7], making them unsuitable for many practical applications. In this paper, we propose a novel semi-dense visual odometry approach for a monocular camera, which combines the accuracy and robustness of dense approaches with the efficiency of feature-based methods. Further, it computes highly accurate semi-dense depth maps from the monocular images, providing rich information about the 3D ∗ This work was supported by the ERC Starting Grant ConvexVision and the DFG project Mapping on Demand Figure1.Semi-Dens MoncularVisualOdometry:Oucfrloas rpe- proach works on a semi-dense inverse depth map and combines the accuracy and robustness of dense visual SLAM methods with the efficiency of feature-based techniques. Left: video frame, Right: color-coded semi-dense depth map, which consists of depth estimates in all image regions with sufficient structure. structure of the environment. We use the term visual odometry as supposed to SLAM, as for simplicity we deliberately maintain only information about the currently visible scene, instead of building a global world-model. – – 1.1. Related Work Feature-based monocular SLAM. In all feature-based methods (such as [4, 8]), tracking and mapping consists of two separate steps: First, discrete feature observations (i.e., their locations in the image) are extracted and matched to each other. Second, the camera and the full feature poses are calculated from a set of such observations disregarding the images themselves. While this preliminary abstrac– tion step greatly reduces the complexity of the overall problem and allows it to be tackled in real time, it inherently comes with two significant drawbacks: First, only image information conforming to the respective feature type and parametrization typically image corners and blobs [6] or line segments [9] is utilized. Second, features have to be matched to each other, which often requires the costly computation of scale- and rotation-invariant descriptors and robust outlier estimation methods like RANSAC. – – Dense monocular SLAM. To overcome these limitations and to better exploit the available image information, dense monocular SLAM methods [11, 17] have recently been proposed. The fundamental difference to keypoint-based approaches is that these methods directly work on the images 11444499 instead of a set of extracted features, for both mapping and tracking: The world is modeled as dense surface while in turn new frames are tracked using whole-image alignment. This concept removes the need for discrete features, and allows to exploit all information present in the image, increasing tracking accuracy and robustness. To date however, doing this in real-time is only possible using modern, powerful GPU processors. Similar methods are broadly used in combination with RGB-D cameras [7], which directly measure the depth of each pixel, or stereo camera rigs [3] greatly reducing the – complexity of the problem. Dense multi-view stereo. Significant prior work exists on multi-view dense reconstruction, both in a real-time setting [13, 11, 15], as well as off-line [5, 14]. In particular for offline reconstruction, there is a long history of using different baselines to steer the stereo-inherent trade-off between accuracy and precision [12]. Most similar to our approach is the early work of Matthies et al., who proposed probabilistic depth map fusion and propagation for image sequences [10], however only for structure from motion, i.e., not coupled with subsequent dense tracking. 1.2. Contributions In this paper, we propose a novel semi-dense approach to monocular visual odometry, which does not require feature points. The key concepts are • a probabilistic depth map representation, • tracking based on whole-image alignment, • the reduction on image-regions which carry informattihoen (esdeumctii-odenn osen), i manadg • the full incorporation of stereo measurement uncertainty. To the best of our knowledge, this is the first featureless, real-time monocular visual odometry approach, which runs in real-time on a CPU. 1.3. Method Outline Our approach is partially motivated by the basic principle that for most real-time applications, video information is abundant and cheap to come by. Therefore, the computational budget should be spent such that the expected information gain is maximized. Instead of reducing the images to a sparse set of feature observations however, our method continuously estimates a semi-dense inverse depth map for the current frame, i.e., a dense depth map covering all image regions with non-negligible gradient (see Fig. 2). It is comprised of one inverse depth hypothesis per pixel modeled by a Gaussian probability distribution. This representation still allows to use whole-image alignment [7] to track new orignalimagesemi-densedepthmap(ours)clfoasre keypointdepthmap[8]densedepthmap[1 ]RGB-Dcamera[16] Figure 2. Semi-Dense Approach: Our approach reconstructs and tracks on a semi-dense inverse depth map, which is dense in all image regions carrying information (top-right). For comparison, the bottom row shows the respective result from a keypoint-based approach, a fully dense approach and the ground truth from an RGB-D camera. frames, while at the same time greatly reducing computational complexity compared to volumetric methods. The estimated depth map is propagated from frame to frame, and updated with variable-baseline stereo comparisons. We explicitly use prior knowledge about a pixel’s depth to select a suitable reference frame on a per-pixel basis, and to limit the disparity search range. The remainder of this paper is organized as follows: Section 2 describes the semi-dense mapping part of the proposed method, including the derivation of the observation accuracy as well as the probabilistic data fusion, propagation and regularization steps. Section 3 describes how new frames are tracked using whole-image alignment, and Sec. 4 summarizes the complete visual odometry method. A qualitative as well as a quantitative evaluation is presented in Sec. 5. We then give a brief conclusion in Sec. 6. 2. Semi-Dense Depth Map Estimation One of the key ideas proposed in this paper is to estimate a semi-dense inverse depth map for the current camera image, which in turn can be used for estimating the camera pose of the next frame. This depth map is continuously propagated from frame to frame, and refined with new stereo depth measurements, which are obtained by performing per-pixel, adaptive-baseline stereo comparisons. This allows us to accurately estimate the depth both of close-by and far-away image regions. In contrast to previous work that accumulates the photometric cost over a sequence of several frames [11, 15], we keep exactly one inverse depth hypothesis per pixel that we represent as Gaussian probability distribution. This section is comprised of three main parts: Sec11445500 reference small baseline medium baseline large baseline tcso0120 .050.10.150.20.2sl5m areagdleiulm0.3 inverse depth d Figure 3. Variable Baseline Stereo: Reference image (left), three stereo images at different baselines (right), and the respective matching cost functions. While a small baseline (black) gives a unique, but imprecise minimum, a large baseline (red) allows for a very precise estimate, but has many false minima. tion 2. 1 describes the stereo method used to extract new depth measurements from previous frames, and how they are incorporated into the prior depth map. In Sec. 2.2, we describe how the depth map is propagated from frame to frame. In Sec. 2.3, we detail how we partially regularize the obtained depth map in each iteration, and how outliers are handled. Throughout this section, d denotes the inverse depth of a pixel. 2.1. Stereo-Based Depth Map Update It is well known [12] that for stereo, there is a trade-off between precision and accuracy (see Fig. 3). While many multiple-baseline stereo approaches resolve this by accumulating the respective cost functions over many frames [5, 13], we propose a probabilistic approach which explicitly takes advantage of the fact that in a video, smallbaseline frames are available before large-baseline frames. The full depth map update (performed once for each new frame) consists of the following steps: First, a subset of pixels is selected for which the accuracy of a disparity search is sufficiently large. For this we use three intuitive and very efficiently computable criteria, which will be derived in Sec. 2. 1.3. For each selected pixel, we then individually select a suitable reference frame, and perform a onedimensional disparity search. Propagated prior knowledge is used to reduce the disparity search range when possible, decreasing computational cost and eliminating false minima. The obtained inverse depth estimate is then fused into the depth map. 2.1.1 Reference Frame Selection Ideally, the reference frame is chosen such that it maximizes the stereo accuracy, while keeping the disparity search range as well as the observation angle sufficiently cur ent framepixel’s “age” -4.8 s -3.9 s -3.1 s -2.2 s -1.2 s -0.8 s -0.5 s -0.4 s Figure 4. Adaptive Baseline Selection: For each pixel in the new frame (top left), a different stereo-reference frame is selected, based on how long the pixel was visible (top right: the more yellow, the older the pixel.). Some of the reference frames are displayed below, the red regions were used for stereo comparisons. small. As the stereo accuracy depends on many factors and because this selection is done for each pixel independently, we employ the following heuristic: We use the oldest frame the pixel was observed in, where the disparity search range and the observation angle do not exceed a certain threshold (see Fig. 4). If a disparity search is unsuccessful (i.e., no good match is found), the pixel’s “age” is increased, such that subsequent disparity searches use newer frames where the pixel is likely to be still visible. 2.1.2 Stereo Matching Method We perform an exhaustive search for the pixel’s intensity along the epipolar line in the selected reference frame, and then perform a sub-pixel accurate localization of the matching disparity. If a prior inverse depth hypothesis is available, the search interval is limited by d 2σd, where d and σd de,e nthoete s etharec mean avnadl ssta lnimdaiterdd d beyv dia ±tion 2σ σof the prior hypothesis. Otherwise, the full disparity range is searched. In our implementation, we use the SSD error over five equidistant points on the epipolar line: While this significantly increases robustness in high-frequent image regions, it does not change the purely one-dimensional nature of this search. Furthermore, it is computationally efficient, as 4 out ± of 5 interpolated image values can be re-used for each SSD evaluation. 2.1.3 Uncertainty Estimation In this section, we use uncertainty propagation to derive an expression for the error variance σd2 on the inverse depth d. 11445511 In general this can be done by expressing the optimal inverse depth d∗ as a function of the noisy inputs here we consider the images I0, I1 themselves, their relative orientation ξ and the camera calibration in terms of a projection function π1 – d∗ = d(I0, I1, ξ, π) . The error-variance of d∗ is then given by σd2 = JdΣJdT, (1) (2) where Jd is the Jacobian of d, and Σ the covariance of the input-error. For more details on covariance propagation, including the derivation of this formula, we refer to [2]. For simplicity, the following analysis is performed for patchfree stereo, i.e., we consider only a point-wise search for a single intensity value along the epipolar line. For this analysis, we split the computation into three steps: First, the epipolar line in the reference frame is computed. Second, the best matching position λ∗ ∈ R along it (i.e., the disparity) is determined. Third, the i∈nv eRrse al depth d∗ is computed from the disparity λ∗ . The first two steps involve two independent error sources: the geometric error, which originates from noise on ξ and π and affects the first step, and the photometric error, which originates from noise in the images I0, I1 and affects the second step. The third step scales these errors by a factor, which depends on the baseline. Geometric disparity error. The geometric error is the error ?λ on the disparity λ∗ caused by noise on ξ and π. While it would be possible to model, propagate, and estimate the complete covariance on ξ and π, we found that the gain in accuracy does not justify the increase in computational complexity. We therefore use an intuitive approximation: Let the considered epipolar line segment L ⊂ R2 be deLfineted th by L := ?l0 + λ?llyx? |λ ∈ S? , (3) where λ is the disparity with search interval S, (lx , ly)T the normalized epipolar line direction and l0 the point corresponding to infinite depth. We now assume that only the absolute position of this line segment, i.e., l0 is subject to isotropic Gaussian noise ?l . As in practice we keep the searched epipolar line segments short, the influence of rotational error is small, making this a good approximation. Intuitively, a positioning error ?l on the epipolar line causes a small disparity error ?λ if the epipolar line is parallel to the image gradient, and a large one otherwise (see Fig. 5). This can be mathematically derived as follows: The image constrains the optimal disparity λ∗ to lie on a certain isocurve, i.e. a curve of equal intensity. We approximate 1In the linear case, this is the camera matrix K – in practice however, nonlinear distortion and other (unmodeled) effects also play a role. FiguLre5.Geo?l mλetricDigs,palrityEroL?rl:Influe?nλceofgasmla posi- tioning error ?l of the epipolar line on the disparity error ?λ . The dashed line represents the isocurve on which the matching point has to lie. ?λ is small if the epipolar line is parallel to the image gradient (left), and a large otherwise (right). this isocurve to be locally linear, i.e. the gradient direction to be locally constant. This gives l0 + λ∗ ?llxy? =! + γ?−gxgy?, g0 γ ∈ R (4) where g := (gx , gy) ?is the image gradient and g0 a point on the isoline. The influence of noise on the image values will be derived in the next paragraph, hence at this point g and g0 are assumed noise-free. Solving for λ gives the optimal disparity λ∗ in terms of the noisy input l0: λ∗(l0) =?g,g?g0,−l? l0? (5) Analogously to (2), the variance of the geometric disparity error can then be expressed as σλ2(ξ,π)= Jλ∗(l0)?σ0l2 σ0l2?JλT∗(l0)=?gσ,l 2?2, (6) where g is the normalized image gradient, lthe normalized epipolar line direction and σl2 the variance of ?l. Note that this error term solely originates from noise on the relative camera orientation and the camera calibration π, i.e., it is independent of image intensity noise. ξ Photometric disparity error. Intuitively, this error encodes that small image intensity errors have a large effect on the estimated disparity if the image gradient is small, and a small effect otherwise (see Fig. 6). Mathematically, this relation can be derived as follows. We seek the disparity λ∗ that minimizes the difference in intensities, i.e., λ∗ = mλin (iref − Ip(λ))2, (7) where iref is the reference intensity, and Ip(λ) the image intensity on the epipolar line at disparity λ. We assume a good initialization λ0 to be available from the exhaustive search. Using a first-order Taylor approximation for Ip gives λ∗(I) = λ0 + (iref − Ip(λ0)) g−p1, (8) where gp is the gradient of Ip, that is image gradient along the epipolar line. For clarity we only consider noise on iref and Ip(λ0) ; equivalent results are obtained in the general case when taking into account noise on the image values involved in the computation of gp. The variance of the pho11445522 ?i Ip?λ ?iiIp?λλ Figure 6. Photometric Disparity Error: Noise ?i on the image intensity values causes a small disparity error ?λ if the image gradient along the epipolar line is large (left). If the gradient is small, the disparity error is magnified (right). tometric disparity error is given by σλ2(I) = Jλ∗(I)?σ0i2 σ0i2?Jλ∗(I) =2gσ2pi2, (9) where σi2 is the variance of the image intensity noise. The respective error originates solely from noisy image intensity values, and hence is independent of the geometric disparity error. Pixel to inverse depth conversion. Using that, for small camera rotation, the inverse depth d is approximately proportional to the disparity λ, the observation variance of the inverse depth σd2,obs can be calculated using σd2,obs = α2 ?σ2λ(ξ,π) + σλ2(I)? , (10) where the proportionality ?constant α in th?e general, nonrectified case – is different for each pixel, and can be calculated from – α :=δδdλ, (11) where δd is the length of the searched inverse depth interval, and δλ the length of the searched epipolar line segment. While α is inversely linear in the length of the camera translation, it also depends on the translation direction and the pixel’s location in the image. When using an SSD error over multiple points along the epipolar line – as our implementation does – a good upper bound for the matching uncertainty is then given by ?min{σ2λ(ξ,π)} + min{σλ2(I)}? σd2,obs-SSD ≤ α2 , (12) where the min goes over all points included in the? SSD error. 2.1.4 Depth Observation Fusion After a depth observation for a pixel in the current image has been obtained, we integrate it into the depth map as follows: If no prior hypothesis for a pixel exists, we initialize it directly with the observation. Otherwise, the new observation is incorporated into the prior, i.e., the two distribu- tions are multiplied (corresponding to the update step in a Knoailsmya onb fsieltrvera)t:io Gniv Nen(do a, pσrio2o)r, d thiest priobsutetiroionr N is( gdipv,eσnp2 b)y and a N?σ2pdσo2p++ σ σo2o2dp,σ2σpp2+σo2 σo2?. 2.1.5 (13) Summary of Uncertainty-Aware Stereo New stereo observations are obtained on a per-pixel basis, adaptively selecting for each pixel a suitable reference frame and performing a one-dimensional search along the epipolar line. We identified the three major factors which determine the accuracy of such a stereo observation, i.e., • the photometric disparity error σλ2(ξ,π), depending on tphheo magnitude sofp trhiet image gradient along the epipolar line, • the geometric disparity error σλ2(I) ,depending on the athnegl gee bometewtereinc dthisep image gradient and the epipolar line (independent of the gradient magnitude), and • the pixel to inverse depth ratio α, depending on the camera etlra tons ilantvioenrs, eth dee pfothcal r length ,a dndep tehned pixel’s position. These three simple-to-compute and purely local criteria are used to determine for which pixel a stereo update is worth the computational cost. Further, the computed observation variance is then used to integrate the new measurements into the existing depth map. 2.2. Depth Map Propagation We continuously propagate the estimated inverse depth map from frame to frame, once the camera position of the next frame has been estimated. Based on the inverse depth estimate d0 for a pixel, the corresponding 3D point is calculated and projected into the new frame, providing an inverse depth estimate d1 in the new frame. The hypothesis is then assigned to the closest integer pixel position to eliminate discretization errors, the sub-pixel accurate image location of the projected point is kept, and re-used for the next propagation step. For propagating the inverse depth variance, we assume the camera rotation to be small. The new inverse depth d1 can then be approximated by – d1(d0) = (d0−1 − tz)−1, (14) where tz is the camera translation along the optical axis. The variance of d1 is hence given by σd21= Jd1σd20JTd1+ σp2=?dd01?4σd20+ σp2, (15) where σp2 is the prediction uncertainty, which directly corresponds to the prediction step in an extended Kalman filter. It can also be interpreted as keeping the variance on 11445533 in the top right shows the new frame I2 (x) without depth information. Middle: Intermediate steps while minimizing E(ξ) on different pyramid levels. The top row shows the back-warped new frame I2 (w(x, d, ξ)), the bottom row shows the respective residual image I2 (w(x, di,ξ)) − I1 (x) . The bottom right image shows the final pixel-weights (black = small weight). Small weights mainly correspond to newly oc,cξl)ud)e −d or disoccluded pixel. tWhe z fo-cuonodrtd hina t uesi onfg a sm poailnlt v failxue ds, fo i.re. σ,p2 sedteticnrgea σsez2s0 d=rift σ,z2 a1s. it causes the estimated geometry to gradually ”lock” into place. Collision handling. At all times, we allow at most one inverse depth hypothesis per pixel: If two inverse depth hypothesis are propagated to the same pixel in the new frame, we distinguish between two cases: 1. if they are statistically similar, i.e., lie within 2σ bounds, they are treated as two independent observations of the pixel’s depth and fused according to (13). 2. otherwise, the point that is further away from the camera is assumed to be occluded, and is removed. 2.3. Depth Map Regularization For each frame – after all observations have been incorporated – we perform one regularization iteration by assign- ing each inverse depth value the average of the surrounding inverse depths, weighted by their respective inverse variance. To preserve sharp edges, if two adjacent inverse depth values are statistically different, i.e., are further away than 2σ, they do not contribute to one another. Note that the respective variances are not changed during regularization to account for the high correlation between neighboring hypotheses. Instead we use the minimal variance of all neighboring pixel when defining the stereo search range, and as a weighting factor for tracking (see Sec. 3). Outlier removal. To handle outliers, we continuously keep track of the validity of each inverse depth hypothesis in terms of the probability that it is an outlier, or has become invalid (e.g., due to occlusion or a moving object). For each successful stereo observation, this probability is decreased. It is increased for each failed stereo search, if the respective intensity changes significantly on propagation, or when the absolute image gradient falls below a given threshold. If, during regularization, the probability that all contributing neighbors are outliers i.e., the product of their individual outlier-probabilities rises above a given threshold, the hypothesis is removed. Equally, if for an “empty” pixel this product drops below a given threshold, a new hypothesis is created from the neighbors. This fills holes arising from the forward-warping nature of the propagation step, and dilates the semi-dense depth map to a small neighborhood around sharp image intensity edges, which signifi– – × cantly increases tracking and mapping robustness. 3. Dense Tracking Based on the inverse depth map of the previous frame, we estimate the camera pose of the current frame using dense image alignment. Such methods have previously been applied successfully (in real-time on a CPU) for tracking RGB-D cameras [7], which directly provide dense depth measurements along with the color image. It is based on the direct minimization of the photometric error ri (ξ) := (I2 (w(xi, di , ξ)) − I1 , (16) where the warp function w : Ω1 R R6 → Ω2 maps each point xi ∈ Ω1 in the reference× image RI1 →to Ωthe respective point w(x∈i, Ωdi, ξ) ∈ Ω2 in the new image I2. As input it requires the 3D,ξ pose Ωof the camera ξ ∈ R6 and uses the reestqiumiraetesd t hienv 3erDse p depth fd it ∈e cRa mfore rthae ξ pixel in I1. Note that no depth information with respect t toh Ie2 p i sx required. To increase robustness to self-occlusion and moving objects, we apply a weighting scheme as proposed in [7]. Further, we add the variance of the inverse depth σd2i as an additional weighting term, making the tracking resistant to recently initialized and still inaccurate depth estimates from 11445544 (xi))2 Figure 8. Examples: Top: Camera images overlaid with the respective stimated semi-dense inverse depth map. Bot om: 3D view of tracked scene. Note the versatility of our approach: It accurately reconstructs and tracks through (outside) scenes with a large depth- variance, including far-away objects like clouds , as well as (indoor) scenes with little structure and close to no image corners / keypoints. More examples are shown in the attached video. the mapping process. The final energy that is minimized is hence given by E(ξ) :=?iα(rσid2(iξ))ri(ξ), (17) where α : R → R defines the weight for a given residual. Minimizing t h→is error can b thee interpreted as computing uthale. maximum likelihood estimator for ξ, assuming independent noise on the image intensity values. The resulting weighted least-squares problem is solved efficiently using an iteratively reweighted Gauss-Newton algorithm coupled with a coarse-to-fine approach, using four pyramid levels. Figure 7 shows an example of the tracking process. For further details on the minimization we refer to [1]. 4. System Overview Tracking and depth estimation is split into two separate threads: One continuously propagates the inverse depth map to the most recent tracked frame, updates it with stereocomparisons and partially regularizes it. The other simultaneously tracks each incoming frame on the most recent available depth map. While tracking is performed in real- time at 30Hz, one complete mapping iteration takes longer and is hence done at roughly 15Hz if the map is heavily populated, we adaptively reduce the number of stereo comparisons to maintain a constant frame-rate. For stereo observations, a buffer of up to 100 past frames is kept, automatically removing those that are used least. We use a standard, keypoint-based method to obtain the relative camera pose between two initial frames, which are then used to initialize the inverse depth map needed for tracking successive frames. From this point onward, our method is entirely self-contained. In preliminary experiments, we found that in most cases our approach is even able to recover from random or extremely inaccurate initial depth maps, indicating that the keypoint-based initialization might become superfluous in the future. Table 1. Results on RGB-D Benchmark position drift (cm/s) rotation drift (deg/s) ours [7] [8] ours [7] [8] – fr2/xyz fr2/desk 0.6 2.1 0.6 2.0 8.2 - 0.33 0.65 0.34 0.70 3.27 - 5. Results We have tested our approach on both publicly available benchmark sequences, as well as live, using a hand-held camera. Some examples are shown in Fig. 8. Note that our method does not attempt to build a global map, i.e., once a point leaves the field of view of the camera or becomes occluded, the respective depth value is deleted. All experiments are performed on a standard consumer laptop with Intel i7 quad-core CPU. In a preprocessing step, we rectify all images such that a pinhole camera-model can be applied. 5.1. RGB-D Benchmark Sequences As basis for a quantitative evaluation and to facilitate reproducibility and easy comparison with other methods, we use the TUM RGB-D benchmark [16]. For tracking and mapping we only use the gray-scale images; for the very first frame however the provided depth image is used as initialization. Our method (like any monocular visual odometry method) fails in case of pure camera rotation, as the depth of new regions cannot be determined. The achieved tracking accuracy for two feasible sequences that is, sequences which do not contain strong camera rotation without simultaneous translation is given in Table 1. For comparison we also list the accuracy from (1) a state-of-the-art, dense RGB-D odometry [7], and (2) a state-of-the-art, keypointbased monocular SLAM system (PTAM, [8]). We initialize PTAM using the built-in stereo initializer, and perform a 7DoF (rigid body plus scale) alignment to the ground truth trajectory. Figure 9 shows the tracked camera trajectory for fr2/desk. We found that our method achieves similar accu– – 11445555 era the the the trajectory (black), the depth map of the first frame (blue), and estimated depth map (gray-scale) after a complete loop around table. Note how well certain details such as the keyboard and monitor align. racy as [7] which uses the same dense tracking algorithm but relies on the Kinect depth images. The keypoint-based approach [8] proves to be significantly less accurate and robust; it consistently failed after a few seconds for the second sequence. 5.2. Additional Test Sequences To analyze our approach in more detail, we recorded additional challenging sequences with the corresponding ground truth trajectory in a motion capture studio. Figure 10 shows an extract from the video, as well as the tracked and the ground-truth camera position over time. As can be seen from the figure, our approach is able to maintain a reasonably dense depth map at all times and the estimated camera trajectory matches closely the ground truth. 6. Conclusion In this paper we proposed a novel visual odometry method for a monocular camera, which does not require discrete features. In contrast to previous work on dense tracking and mapping, our approach is based on probabilistic depth map estimation and fusion over time. Depth measurements are obtained from patch-free stereo matching in different reference frames at a suitable baseline, which are selected on a per-pixel basis. To our knowledge, this is the first featureless monocular visual odometry method which runs in real-time on a CPU. In our experiments, we showed that the tracking performance of our approach is comparable to that of fully dense methods without requiring a depth sensor. References [1] S. Baker and I. Matthews. Lucas-Kanade 20 years on: A unifying framework. Technical report, Carnegie Mellon Univ., 2002. 7 [2] A. Clifford. Multivariate Error Analysis. John Wiley & Sons, 1973. 4 sionpito[m ]− 024 2 0 s1xzy0s20s30s40s50s60s Figure 10. Additional Sequence: Estimated camera trajectory and ground truth (dashed) for a long and challenging sequence. The complete sequence is shown in the attached video. [3] A. Comport, E. Malis, and P. Rives. Accurate quadri-focal tracking for robust 3d visual odometry. In ICRA, 2007. 2 [4] A. Davison, I. Reid, N. Molton, and O. Stasse. MonoSLAM: Real-time single camera SLAM. Trans. on Pattern Analysis and Machine Intelligence (TPAMI), 29, 2007. 1 [5] D. Gallup, J. Frahm, P. Mordohai, and M. Pollefeys. Variable baseline/resolution stereo. In CVPR, 2008. 2, 3 [6] C. Harris and M. Stephens. A combined corner and edge detector. In Alvey Vision Conference, 1988. 1 [7] C. Kerl, J. Sturm, and D. Cremers. Robust odometry estimation for RGB-D cameras. In ICRA, 2013. 1, 2, 6, 7, 8 [8] G. Klein and D. Murray. Parallel tracking and mapping for small AR workspaces. In Mixed and Augmented Reality (ISMAR), 2007. 1, 2, 7, 8 [9] G. Klein and D. Murray. Improving the agility of keyframebased SLAM. In ECCV, 2008. 1 [10] M. Pollefes et al. Detailed real-time urban 3d reconstruction from video. IJCV, 78(2-3): 143–167, 2008. 2, 3 [11] L. Matthies, R. Szeliski, and T. Kanade. Incremental estimation of dense depth maps from image image sequences. In CVPR, 1988. 2 [12] R. Newcombe, S. Lovegrove, and A. Davison. DTAM: Dense tracking and mapping in real-time. In ICCV, 2011. 1, 2 [13] M. Okutomi and T. Kanade. A multiple-baseline stereo. Trans. on Pattern Analysis and Machine Intelligence (TPAMI), 15(4):353–363, 1993. 2, 3 [14] T. Sato, M. Kanbara, N. Yokoya, and H. Takemura. Dense 3-d reconstruction of an outdoor scene by hundreds-baseline stereo using a hand-held camera. IJCV, 47: 1–3, 2002. 2 [15] J. Stuehmer, S. Gumhold, and D. Cremers. Real-time dense geometry from a handheld camera. In Pattern Recognition (DAGM), 2010. 1, 2 [16] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of RGB-D SLAM systems. In Intelligent Robot Systems (IROS), 2012. 2, 7 [17] A. Wendel, M. Maurer, G. Graber, T. Pock, and H. Bischof. Dense reconstruction on-the-fly. In ECCV, 2012. 1 11445566
6 0.1502254 28 iccv-2013-A Rotational Stereo Model Based on XSlit Imaging
7 0.13729203 271 iccv-2013-Modeling the Calibration Pipeline of the Lytro Camera for High Quality Light-Field Image Reconstruction
8 0.13009268 317 iccv-2013-Piecewise Rigid Scene Flow
9 0.098630264 108 iccv-2013-Depth from Combining Defocus and Correspondence Using Light-Field Cameras
10 0.082193285 250 iccv-2013-Lifting 3D Manhattan Lines from a Single Image
11 0.079710424 88 iccv-2013-Constant Time Weighted Median Filtering for Stereo Matching and Beyond
12 0.07626655 12 iccv-2013-A General Dense Image Matching Framework Combining Direct and Feature-Based Costs
13 0.075242177 343 iccv-2013-Real-World Normal Map Capture for Nearly Flat Reflective Surfaces
14 0.075066514 319 iccv-2013-Point-Based 3D Reconstruction of Thin Objects
15 0.074556142 348 iccv-2013-Refractive Structure-from-Motion on Underwater Images
16 0.07244046 281 iccv-2013-Multi-view Normal Field Integration for 3D Reconstruction of Mirroring Objects
17 0.060686596 226 iccv-2013-Joint Subspace Stabilization for Stereoscopic Video
18 0.059901662 160 iccv-2013-Fast Object Segmentation in Unconstrained Video
19 0.057780195 402 iccv-2013-Street View Motion-from-Structure-from-Motion
20 0.056774642 57 iccv-2013-BOLD Features to Detect Texture-less Objects
topicId topicWeight
[(0, 0.144), (1, -0.143), (2, -0.034), (3, 0.047), (4, -0.023), (5, 0.019), (6, -0.016), (7, -0.126), (8, 0.016), (9, -0.025), (10, -0.001), (11, 0.012), (12, 0.035), (13, 0.032), (14, -0.006), (15, -0.074), (16, -0.102), (17, 0.008), (18, -0.005), (19, 0.111), (20, 0.055), (21, 0.03), (22, -0.206), (23, 0.016), (24, 0.062), (25, 0.025), (26, -0.109), (27, -0.162), (28, -0.009), (29, 0.087), (30, -0.009), (31, 0.098), (32, -0.008), (33, -0.178), (34, 0.145), (35, 0.142), (36, -0.195), (37, -0.074), (38, 0.008), (39, 0.022), (40, 0.084), (41, 0.098), (42, -0.037), (43, -0.043), (44, 0.067), (45, -0.035), (46, 0.015), (47, 0.116), (48, 0.013), (49, -0.065)]
simIndex simValue paperId paperTitle
same-paper 1 0.9483285 252 iccv-2013-Line Assisted Light Field Triangulation and Stereo Matching
Author: Zhan Yu, Xinqing Guo, Haibing Lin, Andrew Lumsdaine, Jingyi Yu
Abstract: Light fields are image-based representations that use densely sampled rays as a scene description. In this paper, we explore geometric structures of 3D lines in ray space for improving light field triangulation and stereo matching. The triangulation problem aims to fill in the ray space with continuous and non-overlapping simplices anchored at sampled points (rays). Such a triangulation provides a piecewise-linear interpolant useful for light field superresolution. We show that the light field space is largely bilinear due to 3D line segments in the scene, and direct triangulation of these bilinear subspaces leads to large errors. We instead present a simple but effective algorithm to first map bilinear subspaces to line constraints and then apply Constrained Delaunay Triangulation (CDT). Based on our analysis, we further develop a novel line-assisted graphcut (LAGC) algorithm that effectively encodes 3D line constraints into light field stereo matching. Experiments on synthetic and real data show that both our triangulation and LAGC algorithms outperform state-of-the-art solutions in accuracy and visual quality.
2 0.79196721 28 iccv-2013-A Rotational Stereo Model Based on XSlit Imaging
Author: Jinwei Ye, Yu Ji, Jingyi Yu
Abstract: Traditional stereo matching assumes perspective viewing cameras under a translational motion: the second camera is translated away from the first one to create parallax. In this paper, we investigate a different, rotational stereo model on a special multi-perspective camera, the XSlit camera [9, 24]. We show that rotational XSlit (R-XSlit) stereo can be effectively created by fixing the sensor and slit locations but switching the two slits’ directions. We first derive the epipolar geometry of R-XSlit in the 4D light field ray space. Our derivation leads to a simple but effective scheme for locating corresponding epipolar “curves ”. To conduct stereo matching, we further derive a new disparity term in our model and develop a patch-based graph-cut solution. To validate our theory, we assemble an XSlit lens by using a pair of cylindrical lenses coupled with slit-shaped apertures. The XSlit lens can be mounted on commodity cameras where the slit directions are adjustable to form desirable R-XSlit pairs. We show through experiments that R-XSlitprovides apotentially advantageous imaging system for conducting fixed-location, dynamic baseline stereo.
3 0.7256 304 iccv-2013-PM-Huber: PatchMatch with Huber Regularization for Stereo Matching
Author: Philipp Heise, Sebastian Klose, Brian Jensen, Alois Knoll
Abstract: Most stereo correspondence algorithms match support windows at integer-valued disparities and assume a constant disparity value within the support window. The recently proposed PatchMatch stereo algorithm [7] overcomes this limitation of previous algorithms by directly estimating planes. This work presents a method that integrates the PatchMatch stereo algorithm into a variational smoothing formulation using quadratic relaxation. The resulting algorithm allows the explicit regularization of the disparity and normal gradients using the estimated plane parameters. Evaluation of our method in the Middlebury benchmark shows that our method outperforms the traditional integer-valued disparity strategy as well as the original algorithm and its variants in sub-pixel accurate disparity estimation.
4 0.70695758 423 iccv-2013-Towards Motion Aware Light Field Video for Dynamic Scenes
Author: Salil Tambe, Ashok Veeraraghavan, Amit Agrawal
Abstract: Current Light Field (LF) cameras offer fixed resolution in space, time and angle which is decided a-priori and is independent of the scene. These cameras either trade-off spatial resolution to capture single-shot LF [20, 27, 12] or tradeoff temporal resolution by assuming a static scene to capture high spatial resolution LF [18, 3]. Thus, capturing high spatial resolution LF video for dynamic scenes remains an open and challenging problem. We present the concept, design and implementation of a LF video camera that allows capturing high resolution LF video. The spatial, angular and temporal resolution are not fixed a-priori and we exploit the scene-specific redundancy in space, time and angle. Our reconstruction is motion-aware and offers a continuum of resolution tradeoff with increasing motion in the scene. The key idea is (a) to design efficient multiplexing matrices that allow resolution tradeoffs, (b) use dictionary learning and sparse repre- sentations for robust reconstruction, and (c) perform local motion-aware adaptive reconstruction. We perform extensive analysis and characterize the performance of our motion-aware reconstruction algorithm. We show realistic simulations using a graphics simulator as well as real results using a LCoS based programmable camera. We demonstrate novel results such as high resolution digital refocusing for dynamic moving objects.
5 0.59053892 88 iccv-2013-Constant Time Weighted Median Filtering for Stereo Matching and Beyond
Author: Ziyang Ma, Kaiming He, Yichen Wei, Jian Sun, Enhua Wu
Abstract: Despite the continuous advances in local stereo matching for years, most efforts are on developing robust cost computation and aggregation methods. Little attention has been seriously paid to the disparity refinement. In this work, we study weighted median filtering for disparity refinement. We discover that with this refinement, even the simple box filter aggregation achieves comparable accuracy with various sophisticated aggregation methods (with the same refinement). This is due to the nice weighted median filtering properties of removing outlier error while respecting edges/structures. This reveals that the previously overlooked refinement can be at least as crucial as aggregation. We also develop the first constant time algorithmfor the previously time-consuming weighted median filter. This makes the simple combination “box aggregation + weighted median ” an attractive solution in practice for both speed and accuracy. As a byproduct, the fast weighted median filtering unleashes its potential in other applications that were hampered by high complexities. We show its superiority in various applications such as depth upsampling, clip-art JPEG artifact removal, and image stylization.
6 0.58582711 271 iccv-2013-Modeling the Calibration Pipeline of the Lytro Camera for High Quality Light-Field Image Reconstruction
7 0.51760447 322 iccv-2013-Pose Estimation and Segmentation of People in 3D Movies
8 0.48850214 382 iccv-2013-Semi-dense Visual Odometry for a Monocular Camera
9 0.46989301 255 iccv-2013-Local Signal Equalization for Correspondence Matching
10 0.39654863 407 iccv-2013-Subpixel Scanning Invariant to Indirect Lighting Using Quadratic Code Length
11 0.3885203 108 iccv-2013-Depth from Combining Defocus and Correspondence Using Light-Field Cameras
12 0.38763386 405 iccv-2013-Structured Light in Sunlight
13 0.35575712 348 iccv-2013-Refractive Structure-from-Motion on Underwater Images
14 0.34827921 226 iccv-2013-Joint Subspace Stabilization for Stereoscopic Video
15 0.33820277 284 iccv-2013-Multiview Photometric Stereo Using Planar Mesh Parameterization
16 0.33305481 385 iccv-2013-Separating Reflective and Fluorescent Components Using High Frequency Illumination in the Spectral Domain
17 0.31123051 343 iccv-2013-Real-World Normal Map Capture for Nearly Flat Reflective Surfaces
18 0.30836776 397 iccv-2013-Space-Time Tradeoffs in Photo Sequencing
19 0.30008754 317 iccv-2013-Piecewise Rigid Scene Flow
20 0.29721633 324 iccv-2013-Potts Model, Parametric Maxflow and K-Submodular Functions
topicId topicWeight
[(2, 0.072), (7, 0.018), (12, 0.012), (16, 0.014), (26, 0.073), (31, 0.037), (34, 0.012), (35, 0.013), (39, 0.013), (42, 0.061), (64, 0.033), (73, 0.044), (78, 0.296), (89, 0.148), (98, 0.052)]
simIndex simValue paperId paperTitle
1 0.78736031 69 iccv-2013-Capturing Global Semantic Relationships for Facial Action Unit Recognition
Author: Ziheng Wang, Yongqiang Li, Shangfei Wang, Qiang Ji
Abstract: In this paper we tackle the problem of facial action unit (AU) recognition by exploiting the complex semantic relationships among AUs, which carry crucial top-down information yet have not been thoroughly exploited. Towards this goal, we build a hierarchical model that combines the bottom-level image features and the top-level AU relationships to jointly recognize AUs in a principled manner. The proposed model has two major advantages over existing methods. 1) Unlike methods that can only capture local pair-wise AU dependencies, our model is developed upon the restricted Boltzmann machine and therefore can exploit the global relationships among AUs. 2) Although AU relationships are influenced by many related factors such as facial expressions, these factors are generally ignored by the current methods. Our model, however, can successfully capture them to more accurately characterize the AU relationships. Efficient learning and inference algorithms of the proposed model are also developed. Experimental results on benchmark databases demonstrate the effectiveness of the proposed approach in modelling complex AU relationships as well as its superior AU recognition performance over existing approaches.
same-paper 2 0.77406186 252 iccv-2013-Line Assisted Light Field Triangulation and Stereo Matching
Author: Zhan Yu, Xinqing Guo, Haibing Lin, Andrew Lumsdaine, Jingyi Yu
Abstract: Light fields are image-based representations that use densely sampled rays as a scene description. In this paper, we explore geometric structures of 3D lines in ray space for improving light field triangulation and stereo matching. The triangulation problem aims to fill in the ray space with continuous and non-overlapping simplices anchored at sampled points (rays). Such a triangulation provides a piecewise-linear interpolant useful for light field superresolution. We show that the light field space is largely bilinear due to 3D line segments in the scene, and direct triangulation of these bilinear subspaces leads to large errors. We instead present a simple but effective algorithm to first map bilinear subspaces to line constraints and then apply Constrained Delaunay Triangulation (CDT). Based on our analysis, we further develop a novel line-assisted graphcut (LAGC) algorithm that effectively encodes 3D line constraints into light field stereo matching. Experiments on synthetic and real data show that both our triangulation and LAGC algorithms outperform state-of-the-art solutions in accuracy and visual quality.
3 0.76363301 344 iccv-2013-Recognising Human-Object Interaction via Exemplar Based Modelling
Author: Jian-Fang Hu, Wei-Shi Zheng, Jianhuang Lai, Shaogang Gong, Tao Xiang
Abstract: Human action can be recognised from a single still image by modelling Human-object interaction (HOI), which infers the mutual spatial structure information between human and object as well as their appearance. Existing approaches rely heavily on accurate detection of human and object, and estimation of human pose. They are thus sensitive to large variations of human poses, occlusion and unsatisfactory detection of small size objects. To overcome this limitation, a novel exemplar based approach is proposed in this work. Our approach learns a set of spatial pose-object interaction exemplars, which are density functions describing how a person is interacting with a manipulated object for different activities spatially in a probabilistic way. A representation based on our HOI exemplar thus has great potential for being robust to the errors in human/object detection and pose estimation. A new framework consists of a proposed exemplar based HOI descriptor and an activity specific matching model that learns the parameters is formulated for robust human activity recog- nition. Experiments on two benchmark activity datasets demonstrate that the proposed approach obtains state-ofthe-art performance.
4 0.74477232 290 iccv-2013-New Graph Structured Sparsity Model for Multi-label Image Annotations
Author: Xiao Cai, Feiping Nie, Weidong Cai, Heng Huang
Abstract: In multi-label image annotations, because each image is associated to multiple categories, the semantic terms (label classes) are not mutually exclusive. Previous research showed that such label correlations can largely boost the annotation accuracy. However, all existing methods only directly apply the label correlation matrix to enhance the label inference and assignment without further learning the structural information among classes. In this paper, we model the label correlations using the relational graph, and propose a novel graph structured sparse learning model to incorporate the topological constraints of relation graph in multi-label classifications. As a result, our new method will capture and utilize the hidden class structures in relational graph to improve the annotation results. In proposed objective, a large number of structured sparsity-inducing norms are utilized, thus the optimization becomes difficult. To solve this problem, we derive an efficient optimization algorithm with proved convergence. We perform extensive experiments on six multi-label image annotation benchmark data sets. In all empirical results, our new method shows better annotation results than the state-of-the-art approaches.
5 0.6966989 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding
Author: Weiyu Zhang, Menglong Zhu, Konstantinos G. Derpanis
Abstract: This paper presents a novel approach for analyzing human actions in non-scripted, unconstrained video settings based on volumetric, x-y-t, patch classifiers, termed actemes. Unlike previous action-related work, the discovery of patch classifiers is posed as a strongly-supervised process. Specifically, keypoint labels (e.g., position) across spacetime are used in a data-driven training process to discover patches that are highly clustered in the spacetime keypoint configuration space. To support this process, a new human action dataset consisting of challenging consumer videos is introduced, where notably the action label, the 2D position of a set of keypoints and their visibilities are provided for each video frame. On a novel input video, each acteme is used in a sliding volume scheme to yield a set of sparse, non-overlapping detections. These detections provide the intermediate substrate for segmenting out the action. For action classification, the proposed representation shows significant improvement over state-of-the-art low-level features, while providing spatiotemporal localiza- tion as additional output. This output sheds further light into detailed action understanding.
6 0.69049454 276 iccv-2013-Multi-attributed Dictionary Learning for Sparse Coding
7 0.65155506 155 iccv-2013-Facial Action Unit Event Detection by Cascade of Tasks
8 0.60605097 150 iccv-2013-Exemplar Cut
9 0.5922938 319 iccv-2013-Point-Based 3D Reconstruction of Thin Objects
10 0.58581114 268 iccv-2013-Modeling 4D Human-Object Interactions for Event and Object Recognition
11 0.58028871 194 iccv-2013-Heterogeneous Image Features Integration via Multi-modal Semi-supervised Learning Model
12 0.57200313 265 iccv-2013-Mining Motion Atoms and Phrases for Complex Action Recognition
13 0.56724823 126 iccv-2013-Dynamic Label Propagation for Semi-supervised Multi-class Multi-label Classification
14 0.56486845 107 iccv-2013-Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction
15 0.56251025 179 iccv-2013-From Subcategories to Visual Composites: A Multi-level Framework for Object Detection
16 0.56160235 108 iccv-2013-Depth from Combining Defocus and Correspondence Using Light-Field Cameras
17 0.5613097 384 iccv-2013-Semi-supervised Robust Dictionary Learning via Efficient l-Norms Minimization
18 0.56115979 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
19 0.56108904 43 iccv-2013-Active Visual Recognition with Expertise Estimation in Crowdsourcing
20 0.55875635 188 iccv-2013-Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps