iccv iccv2013 iccv2013-209 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: David Ferstl, Christian Reinbacher, Rene Ranftl, Matthias Ruether, Horst Bischof
Abstract: In this work we present a novel method for the challenging problem of depth image upsampling. Modern depth cameras such as Kinect or Time of Flight cameras deliver dense, high quality depth measurements but are limited in their lateral resolution. To overcome this limitation we formulate a convex optimization problem using higher order regularization for depth image upsampling. In this optimization an anisotropic diffusion tensor, calculated from a high resolution intensity image, is used to guide the upsampling. We derive a numerical algorithm based on a primaldual formulation that is efficiently parallelized and runs at multiple frames per second. We show that this novel upsampling clearly outperforms state of the art approaches in terms of speed and accuracy on the widely used Middlebury 2007 datasets. Furthermore, we introduce novel datasets with highly accurate groundtruth, which, for the first time, enable to benchmark depth upsampling methods using real sensor data.
Reference: text
sentIndex sentText sentNum sentScore
1 Modern depth cameras such as Kinect or Time of Flight cameras deliver dense, high quality depth measurements but are limited in their lateral resolution. [sent-4, score-1.169]
2 To overcome this limitation we formulate a convex optimization problem using higher order regularization for depth image upsampling. [sent-5, score-0.57]
3 In this optimization an anisotropic diffusion tensor, calculated from a high resolution intensity image, is used to guide the upsampling. [sent-6, score-0.663]
4 We show that this novel upsampling clearly outperforms state of the art approaches in terms of speed and accuracy on the widely used Middlebury 2007 datasets. [sent-8, score-0.57]
5 Furthermore, we introduce novel datasets with highly accurate groundtruth, which, for the first time, enable to benchmark depth upsampling methods using real sensor data. [sent-9, score-1.155]
6 Introduction Accurate, high resolution depth sensing is a fundamental challenge in computer vision. [sent-11, score-0.677]
7 Traditional computer vision approaches calculate the scene depth through computational exhaustive stereo calculations or expensive laser range measurements. [sent-13, score-0.509]
8 A per-pixel depth is measured actively through the runtime of light. [sent-15, score-0.515]
9 It delivers a dense depth map even at very close ranges [12, 21]. [sent-17, score-0.549]
10 No additional calculations are necessary, which results in depth measurements at high frame rates. [sent-18, score-0.588]
11 Recently, ToF sensors have become affordable in the mass market and a small (a)Lowresolutiondepth(b)Highresolutioni tensity (c) High resolution depth upsampling result Figure 1. [sent-19, score-1.207]
12 Upsampling of a low resolution depth image (a) using an additional high resolution intensity image (b) through image guided anisotropic Total Generalized Variation (c). [sent-20, score-1.239]
13 In this work, we propose a method to drastically increase the lateral measurement resolution by a novel depth map upsampling approach, as shown in Figure 1. [sent-24, score-1.28]
14 To increase both, quality and resolution, we add information from a high resolution intensity camera in a variational optimization framework. [sent-25, score-0.523]
15 We build on the observation that textural edges are more likely to appear at high depth discontinuities, whereas homogeneous textured regions correspond to homogeneous surface parts [23]. [sent-26, score-0.515]
16 Fusing both, low resolution but very robust depth and high resolution intensity in a spatial sense, results in a dense depth map with increased lateral resolution and visual quality. [sent-27, score-1.787]
17 999933 We formulate the upsampling as a convex optimization problem [2, 6]. [sent-28, score-0.59]
18 The main contributions of this work are two-fold: (1) We propose a novel method for fast depth image upsampling by combining a low resolution depth image with high resolution texture information in a variational energy optimization framework. [sent-33, score-1.966]
19 The employed higher order regularization is well suited to model the image acquisition process of modern depth cameras and leads to an improved quality of the upsampled depth maps, compared to state of the art methods. [sent-34, score-1.155]
20 (2) We propose benchmarking datasets that enable a quantitative comparison of depth image upsampling methods providing real ToF and intensity camera acquisitions together with a highly accurate groundtruth measurement. [sent-35, score-1.588]
21 In our experiments we demonstrate the upsampling quality by a numerical and visual comparison on synthetic and real benchmarking datasets. [sent-37, score-0.717]
22 Related Work There are many ways to increase the resolution and the accuracy of depth measurements. [sent-40, score-0.642]
23 In general, they can be separated in three main classes: (1) fusion of multiple depth sensors, (2) temporal and spatial fusion and (3) upsampling by combining depth and intensity sensors. [sent-41, score-1.801]
24 Multiple Depth Sensor Fusion Recent works addressed the fusion of different depth sensing techniques to increase resolution and quality. [sent-42, score-0.736]
25 [8] presented a method for stereo and Time of Flight (ToF) depth map fusion in a dynamic programming approach. [sent-44, score-0.637]
26 [26] using an accurate depth calibration and fusing the measurements in a Markov Random Field (MRF) framework. [sent-46, score-0.608]
27 at/tofmark Temporal and Spatial Upsampling A common way to improve the resolution and quality of depth information is to fuse multiple depth measurements into one depth map. [sent-51, score-1.688]
28 [14] proposed a method for simultaneous camera localization and depth fusion in real time. [sent-58, score-0.639]
29 Depth Upsampling through Intensity Information This class of approaches uses additional intensity information as depth cue for image upsampling. [sent-59, score-0.634]
30 [24] used bilateral filtering of a depth cost volume and a RGB image in an iterative refinement process. [sent-61, score-0.611]
31 [3] used a noise aware joint bilateral filter to increase the resolution and to reduce depth map errors at multiple frames per second. [sent-63, score-0.792]
32 Diebel and Thrun [5] performed an upsampling using a MRF formulation, where the smoothness term is weighted according to texture derivatives. [sent-64, score-0.555]
33 They used a combination of different weighting terms of a least squares optimization including segmentation, image gradients, edge saliency and non-local means for depth upsampling. [sent-67, score-0.507]
34 The combination of intensity and depth data in a Bayesian Framework was proposed by Li et al. [sent-68, score-0.634]
35 Discussion While the methods for multiple sensor fusion deliver accurate depth results, their quality relies on high calibration effort. [sent-70, score-0.778]
36 Further, most sensor fusion techniques have to calculate a depth map from passive stereo in a preprocessing step before the actual fusion is able to start. [sent-71, score-0.858]
37 Contrary, temporal and spatial fusion approaches rely on multiple acquisitions from a single depth sensor. [sent-72, score-0.647]
38 To overcome these limitations, we chose the combination of a low resolution depth and a high resolution intensity sensor to increase the natural depth sensor resolution. [sent-74, score-1.732]
39 The upsampling is calculated on a per image basis without the need for complex preprocessing. [sent-75, score-0.574]
40 Existing approaches, such as [3, 24], calculate this depth upsampling by a bilateral filtering. [sent-76, score-1.095]
41 In contrast, our method builds on the success of recently introduced upsampling methods using MRF and least squares optimization [5, 15]. [sent-78, score-0.555]
42 This tensor not only weights the depth gradient but also orients the gradient direction during the optimization process. [sent-81, score-0.654]
43 Method Our upsampling approach generates a high quality and high resolution depth map DH out of a high resolution intensity image IH and a low resolution and noisy depth map DL, where IH, DH : ΩH ⊆ R2 and DL : ΩL ⊆ R2. [sent-83, score-2.423]
44 The methodology of this approach can be divided⊆ in Rto three main areas: (1) Registering the low-resolution depth measurements and the high resolution intensity information in one common coordinate system (Section 3. [sent-84, score-0.958]
45 1), (2) formulating the depth upsampling problem into a convex energy functional (Section 3. [sent-85, score-1.047]
46 Depth Mapping Since the low resolution depth map DL and the high resolution intensity image IH stem from different cameras, a mapping can only be established when intrinsic and extrinsic parameters are known (see Section 4. [sent-90, score-1.152]
47 Each depth measurement di,j at pixel position xi,j = [i, j, 1]T is projected into the high resolution intensity image space ΩH. [sent-93, score-0.927]
48 x˜i,j = PHXi,j ∀i,j ∈ ΩL, (1) PL† where is the pseudoinverse of the depth camera projection matrix, CL the camera center and Xi,j the 3D point. [sent-96, score-0.551]
49 Hence, we get a projected depth image DS consisting of a sparse set of base depth points at position ˜ xi,j in the intensity image space ΩH where the depth value is given by the distance to the 3D point Xi,j (see Figure 2). [sent-98, score-1.591]
50 resolution sparse depth map DS in the intensity camera coordinate system. [sent-99, score-0.906]
51 Although, one low resolution sensor pixel DL i,j measures the average depth of multiple pixels in the high resolution space we only project it to one central pixel DS i,j at position ˜ xi,j . [sent-100, score-0.997]
52 Depth Image Upsampling Our upsampling method increases the resolution of measured depth data from a low resolution depth sensor by adding edge cues from a high resolution intensity image. [sent-106, score-2.383]
53 To be able to use both information, we map the depth measurements to the intensity camera coordinate system as described in Section 3. [sent-107, score-0.817]
54 With this mapping we get a depth map DS of a sparse set of base depth measurements from the low resolution depth sensor. [sent-109, score-1.738]
55 The high resolution depth map DH is given by DH = argmin {G(u, DS) + αF(u)} . [sent-110, score-0.711]
56 (2) u This formulation is composed of the data term G(u, DS) that measures the fidelity of the argument u to the input depth measurements DS and the regularization term F(u) that reflects prior knowledge of the smoothness of our solution. [sent-111, score-0.606]
57 The data term in our energy model is designed to ensure a data consistency to the base depth points DS from the depth camera. [sent-114, score-0.965]
58 Additionally, we allow to weight the depth measurements with a weighting operator w = [0, 1] ∈ RΩH , swuhriecmh ins zero aht unmapped image points a=nd [ 0b,e1t]w∈e en R zero and one on the base points according to some application specific confidence. [sent-115, score-0.604]
59 The regularization term has to meet the challenges of producing a high resolution depth map out of a sparse set of depth points. [sent-118, score-1.217]
60 This prevents the depth map to become a piecewise smooth surface, resulting in piecewise fronto parallel depth reconstructions. [sent-128, score-1.012]
61 Assuming that texture edges most likely correspond to depth discontinuities, we use the high resolution intensity data to produce a more accurate upsampling result. [sent-141, score-1.413]
62 The anisotropic diffusion tensor not only weights the first order depth gradient but also orients the gradient direction during the optimization process. [sent-144, score-0.835]
63 Including this term in our TGV model we can penalize high depth discontinuities at homogeneous regions and allow sharp depth edges at corresponding texture differences. [sent-145, score-0.997]
64 Evaluation × In this section, we show a quantitative and qualitative evaluation of our upsampling method. [sent-194, score-0.558]
65 Visual comparison of 8 upsampling on a snippet of the Middlebury Art dataset including fine image, (b) Vloiswu arles coolmuptioanri input image (enlarged using snneiaprpeestt neighbor upsampling). [sent-314, score-0.526]
66 (d) Adaptive bilateral upsampling proposed by Chan et al. [sent-316, score-0.642]
67 (e) Nonlocal means upsampling (f) Our upsampling method using image guided anisotropic suffers from edge bleeding especially at small structure structures. [sent-318, score-1.214]
68 rTahdeie ntetn osopre parameters uβl atned th γ as well as the TGV parameters α0 and α1 are manually set once for each upsampling factor and are constant in synthetic and the real world evaluations. [sent-328, score-0.595]
69 [15] provides low resolution input depth images with different downsampling factors ( 2, ×4, 8, i×m 1a6g)e. [sent-334, score-0.672]
70 Further quantitative comparisons to other depth upsampling methods on the Middlebury 2003 and 2007 datasets can be found in the supplemental material. [sent-357, score-1.043]
71 Discussion What can be clearly seen is that our method delivers an upsampling quality that is superior compared to state of the art methods at a lower computation time. [sent-358, score-0.672]
72 The higher order regularization better captures the surface of real world scenes, while the anisotropic diffusion tensor delivers a more defined guidance of the high resolution intensity data compared to a simple scalar weighting. [sent-360, score-0.871]
73 While the Middlebury datasets are popular to evaluate depth upsampling methods, they neglect some important properties of real acquisition setups. [sent-361, score-1.111]
74 Typically, depth and intensity data do not originate from the same sensor and are therefore not aligned. [sent-362, score-0.735]
75 Further, real low resolution depth sensors measure depth data with a more complex acquisition noise which can not be simulated by adding simple Gaussian noise. [sent-363, score-1.264]
76 For depth measurements we use a PMD Nano ToF camera delivering a 120 160 dense depth PaMndD DIR N amplitude image e[1li6v]e. [sent-368, score-1.173]
77 Camera Calibration Calibration of the intensity camera and the ToF camera is a crucial part in our upsampling system since the quality of the calibration directly affects the accuracy of the upsampling result. [sent-370, score-1.426]
78 In addition to the intrinsic and extrinsic camera parameters, both the ToF depth is calibrated and a depth confidence value is calculated through the 3D projections of the planar feature points. [sent-376, score-1.076]
79 Because ToF cameras measure depth through active illumination, the depth measurement certainty increases with the measured amplitude [7]. [sent-377, score-1.132]
80 Through a comparison of the very accurate 3D measurements of the calibration points and the measured ToF depth points a dependence between the acquired IR amplitude image and the measurement error can be established, as shown in Figure 4. [sent-378, score-0.846]
81 Correlation between measured IR amplitude and depth measurement error of TOF acquisitions. [sent-380, score-0.652]
82 Compensation for amplitude based depth error of TOF cameras. [sent-387, score-0.571]
83 We assume a linear correspondence between the IR amplitude and the depth error below a full confidence threshold Imax, as shown in Figure 4. [sent-390, score-0.603]
84 The input density value shows the percentage of sparse depth values which are projected into the high resolution image space. [sent-419, score-0.702]
85 This corresponds to an upsampling factor of approximately 6. [sent-420, score-0.526]
86 To get a dense depth map, multiple acquisitions with slightly displaced projection angles are fused together. [sent-425, score-0.584]
87 The acquired scenes are chosen to incorporate structures with high texture variations (see Books scene) as well as thin wiry elements (Shark and Devil scenes) to evaluate the upsampling accuracy. [sent-426, score-0.629]
88 × A quantitative accuracy evaluation of our upsampling for three real world datasets is shown in Table 2. [sent-431, score-0.659]
89 The upsampling error is calculated by the RMSE to the groundtruth depth map measured with the highly accurate structured light scanner. [sent-432, score-1.199]
90 We compared our method to two common interpolation techniques, joint bilateral upsampling [11] and guided image filtering [9]. [sent-433, score-0.772]
91 As depth input to all methods we used the offset corrected ToF depth input. [sent-434, score-0.906]
92 Thus, the projected depth measurements near large depth steps can differ from correct depth values. [sent-440, score-1.484]
93 Because the distance between the cameras is very small compared to the measured depth range, these wrong measurements have no large impact on the result and can be handled by the regularization term. [sent-441, score-0.67]
94 Despite that, in the visual and numerical results it can be seen that our method delivers high quality upsampling results at multiple frames per second for an approximate upsampling factor of 6. [sent-442, score-1.226]
95 Conclusion In this paper we propose a novel method for depth map upsampling using a low resolution, low cost 3D sensor and an additional high resolution 2D sensor. [sent-449, score-1.398]
96 The upsampling is formulated as a global energy optimization problem using Total Generalized Variation (TGV) regularization. [sent-450, score-0.588]
97 We further provide benchmarking datasets of real world scenes providing a highly accurate groundtruth that, for the first time, enable a real quality comparison of depth image upsampling methods. [sent-453, score-1.335]
98 In column (a) the low resolution ToF image and the high resolution intensity image are shown, whereas column (b) shows the high resolution groundtruth depth. [sent-500, score-0.949]
99 In column (c) the upsampling result of our method is shown whereas in column (d) the relative depth error to the known groundtruth is shown. [sent-502, score-1.08]
100 Reliability fusion of time-of-flight depth and stereo geometry for high quality depth maps. [sent-641, score-1.131]
wordName wordTfidf (topN-words)
[('upsampling', 0.526), ('depth', 0.453), ('tof', 0.283), ('tgv', 0.201), ('resolution', 0.189), ('intensity', 0.181), ('amplitude', 0.118), ('bilateral', 0.116), ('anisotropic', 0.102), ('rmse', 0.101), ('sensor', 0.101), ('groundtruth', 0.101), ('acquisitions', 0.1), ('measurements', 0.1), ('fusion', 0.094), ('middlebury', 0.088), ('diffusion', 0.079), ('thrun', 0.077), ('tensor', 0.074), ('books', 0.073), ('benchmarking', 0.071), ('chan', 0.07), ('ds', 0.069), ('shark', 0.067), ('diebel', 0.067), ('delivers', 0.062), ('flight', 0.062), ('park', 0.061), ('guided', 0.06), ('acquisition', 0.057), ('stereo', 0.056), ('calibration', 0.055), ('regularization', 0.053), ('schuon', 0.05), ('camera', 0.049), ('calculated', 0.048), ('ranftl', 0.046), ('imax', 0.046), ('measurement', 0.044), ('art', 0.044), ('ih', 0.044), ('real', 0.043), ('filtering', 0.042), ('extrinsic', 0.041), ('quality', 0.04), ('sensors', 0.039), ('acquired', 0.039), ('dh', 0.039), ('gudmundsson', 0.038), ('orients', 0.038), ('preconditioning', 0.038), ('saddlepoint', 0.038), ('therewith', 0.038), ('primal', 0.037), ('numerical', 0.037), ('measured', 0.037), ('piecewise', 0.036), ('mrf', 0.036), ('dl', 0.035), ('ir', 0.035), ('convex', 0.035), ('high', 0.035), ('devil', 0.034), ('map', 0.034), ('lateral', 0.034), ('pmd', 0.033), ('siegen', 0.033), ('energy', 0.033), ('mm', 0.032), ('quantitative', 0.032), ('confidence', 0.032), ('generalized', 0.032), ('datasets', 0.032), ('displaced', 0.031), ('davis', 0.03), ('gradient', 0.03), ('low', 0.03), ('bredies', 0.029), ('texture', 0.029), ('optimization', 0.029), ('graz', 0.028), ('theobalt', 0.028), ('upsampled', 0.028), ('polynomials', 0.028), ('imaging', 0.028), ('interpolation', 0.028), ('surface', 0.027), ('sharp', 0.027), ('cameras', 0.027), ('austria', 0.027), ('base', 0.026), ('world', 0.026), ('variation', 0.026), ('passive', 0.026), ('dual', 0.026), ('weighting', 0.025), ('projected', 0.025), ('il', 0.025), ('cui', 0.025), ('runtime', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 209 iccv-2013-Image Guided Depth Upsampling Using Anisotropic Total Generalized Variation
Author: David Ferstl, Christian Reinbacher, Rene Ranftl, Matthias Ruether, Horst Bischof
Abstract: In this work we present a novel method for the challenging problem of depth image upsampling. Modern depth cameras such as Kinect or Time of Flight cameras deliver dense, high quality depth measurements but are limited in their lateral resolution. To overcome this limitation we formulate a convex optimization problem using higher order regularization for depth image upsampling. In this optimization an anisotropic diffusion tensor, calculated from a high resolution intensity image, is used to guide the upsampling. We derive a numerical algorithm based on a primaldual formulation that is efficiently parallelized and runs at multiple frames per second. We show that this novel upsampling clearly outperforms state of the art approaches in terms of speed and accuracy on the widely used Middlebury 2007 datasets. Furthermore, we introduce novel datasets with highly accurate groundtruth, which, for the first time, enable to benchmark depth upsampling methods using real sensor data.
2 0.30969286 18 iccv-2013-A Joint Intensity and Depth Co-sparse Analysis Model for Depth Map Super-resolution
Author: Martin Kiechle, Simon Hawe, Martin Kleinsteuber
Abstract: High-resolution depth maps can be inferred from lowresolution depth measurements and an additional highresolution intensity image of the same scene. To that end, we introduce a bimodal co-sparse analysis model, which is able to capture the interdependency of registered intensity . go l e i um . de . .t ities together with the knowledge of the relative positions between all views. Despite very active research in this area and significant improvements over the past years, stereo methods still struggle with noise, texture-less regions, repetitive texture, and occluded areas. For an overview of stereo methods, the reader is referred to [25]. and depth information. This model is based on the assumption that the co-supports of corresponding bimodal image structures are aligned when computed by a suitable pair of analysis operators. No analytic form of such operators ex- ist and we propose a method for learning them from a set of registered training signals. This learning process is done offline and returns a bimodal analysis operator that is universally applicable to natural scenes. We use this to exploit the bimodal co-sparse analysis model as a prior for solving inverse problems, which leads to an efficient algorithm for depth map super-resolution.
3 0.2744042 382 iccv-2013-Semi-dense Visual Odometry for a Monocular Camera
Author: Jakob Engel, Jürgen Sturm, Daniel Cremers
Abstract: We propose a fundamentally novel approach to real-time visual odometry for a monocular camera. It allows to benefit from the simplicity and accuracy of dense tracking which does not depend on visual features while running in real-time on a CPU. The key idea is to continuously estimate a semi-dense inverse depth map for the current frame, which in turn is used to track the motion of the camera using dense image alignment. More specifically, we estimate the depth of all pixels which have a non-negligible image gradient. Each estimate is represented as a Gaussian probability distribution over the inverse depth. We propagate this information over time, and update it with new measurements as new images arrive. In terms of tracking accuracy and computational speed, the proposed method compares favorably to both state-of-the-art dense and feature-based visual odometry and SLAM algorithms. As our method runs in real-time on a CPU, it is oflargepractical valuefor robotics and augmented reality applications. – – 1. Towards Dense Monocular Visual Odometry Tracking a hand-held camera and recovering the threedimensional structure of the environment in real-time is among the most prominent challenges in computer vision. In the last years, dense approaches to these challenges have become increasingly popular: Instead of operating solely on visual feature positions, they reconstruct and track on the whole image using a surface-based map and thereby are fundamentally different from feature-based approaches. Yet, these methods are to date either not real-time capable on standard CPUs [11, 15, 17] or require direct depth measurements from the sensor [7], making them unsuitable for many practical applications. In this paper, we propose a novel semi-dense visual odometry approach for a monocular camera, which combines the accuracy and robustness of dense approaches with the efficiency of feature-based methods. Further, it computes highly accurate semi-dense depth maps from the monocular images, providing rich information about the 3D ∗ This work was supported by the ERC Starting Grant ConvexVision and the DFG project Mapping on Demand Figure1.Semi-Dens MoncularVisualOdometry:Oucfrloas rpe- proach works on a semi-dense inverse depth map and combines the accuracy and robustness of dense visual SLAM methods with the efficiency of feature-based techniques. Left: video frame, Right: color-coded semi-dense depth map, which consists of depth estimates in all image regions with sufficient structure. structure of the environment. We use the term visual odometry as supposed to SLAM, as for simplicity we deliberately maintain only information about the currently visible scene, instead of building a global world-model. – – 1.1. Related Work Feature-based monocular SLAM. In all feature-based methods (such as [4, 8]), tracking and mapping consists of two separate steps: First, discrete feature observations (i.e., their locations in the image) are extracted and matched to each other. Second, the camera and the full feature poses are calculated from a set of such observations disregarding the images themselves. While this preliminary abstrac– tion step greatly reduces the complexity of the overall problem and allows it to be tackled in real time, it inherently comes with two significant drawbacks: First, only image information conforming to the respective feature type and parametrization typically image corners and blobs [6] or line segments [9] is utilized. Second, features have to be matched to each other, which often requires the costly computation of scale- and rotation-invariant descriptors and robust outlier estimation methods like RANSAC. – – Dense monocular SLAM. To overcome these limitations and to better exploit the available image information, dense monocular SLAM methods [11, 17] have recently been proposed. The fundamental difference to keypoint-based approaches is that these methods directly work on the images 11444499 instead of a set of extracted features, for both mapping and tracking: The world is modeled as dense surface while in turn new frames are tracked using whole-image alignment. This concept removes the need for discrete features, and allows to exploit all information present in the image, increasing tracking accuracy and robustness. To date however, doing this in real-time is only possible using modern, powerful GPU processors. Similar methods are broadly used in combination with RGB-D cameras [7], which directly measure the depth of each pixel, or stereo camera rigs [3] greatly reducing the – complexity of the problem. Dense multi-view stereo. Significant prior work exists on multi-view dense reconstruction, both in a real-time setting [13, 11, 15], as well as off-line [5, 14]. In particular for offline reconstruction, there is a long history of using different baselines to steer the stereo-inherent trade-off between accuracy and precision [12]. Most similar to our approach is the early work of Matthies et al., who proposed probabilistic depth map fusion and propagation for image sequences [10], however only for structure from motion, i.e., not coupled with subsequent dense tracking. 1.2. Contributions In this paper, we propose a novel semi-dense approach to monocular visual odometry, which does not require feature points. The key concepts are • a probabilistic depth map representation, • tracking based on whole-image alignment, • the reduction on image-regions which carry informattihoen (esdeumctii-odenn osen), i manadg • the full incorporation of stereo measurement uncertainty. To the best of our knowledge, this is the first featureless, real-time monocular visual odometry approach, which runs in real-time on a CPU. 1.3. Method Outline Our approach is partially motivated by the basic principle that for most real-time applications, video information is abundant and cheap to come by. Therefore, the computational budget should be spent such that the expected information gain is maximized. Instead of reducing the images to a sparse set of feature observations however, our method continuously estimates a semi-dense inverse depth map for the current frame, i.e., a dense depth map covering all image regions with non-negligible gradient (see Fig. 2). It is comprised of one inverse depth hypothesis per pixel modeled by a Gaussian probability distribution. This representation still allows to use whole-image alignment [7] to track new orignalimagesemi-densedepthmap(ours)clfoasre keypointdepthmap[8]densedepthmap[1 ]RGB-Dcamera[16] Figure 2. Semi-Dense Approach: Our approach reconstructs and tracks on a semi-dense inverse depth map, which is dense in all image regions carrying information (top-right). For comparison, the bottom row shows the respective result from a keypoint-based approach, a fully dense approach and the ground truth from an RGB-D camera. frames, while at the same time greatly reducing computational complexity compared to volumetric methods. The estimated depth map is propagated from frame to frame, and updated with variable-baseline stereo comparisons. We explicitly use prior knowledge about a pixel’s depth to select a suitable reference frame on a per-pixel basis, and to limit the disparity search range. The remainder of this paper is organized as follows: Section 2 describes the semi-dense mapping part of the proposed method, including the derivation of the observation accuracy as well as the probabilistic data fusion, propagation and regularization steps. Section 3 describes how new frames are tracked using whole-image alignment, and Sec. 4 summarizes the complete visual odometry method. A qualitative as well as a quantitative evaluation is presented in Sec. 5. We then give a brief conclusion in Sec. 6. 2. Semi-Dense Depth Map Estimation One of the key ideas proposed in this paper is to estimate a semi-dense inverse depth map for the current camera image, which in turn can be used for estimating the camera pose of the next frame. This depth map is continuously propagated from frame to frame, and refined with new stereo depth measurements, which are obtained by performing per-pixel, adaptive-baseline stereo comparisons. This allows us to accurately estimate the depth both of close-by and far-away image regions. In contrast to previous work that accumulates the photometric cost over a sequence of several frames [11, 15], we keep exactly one inverse depth hypothesis per pixel that we represent as Gaussian probability distribution. This section is comprised of three main parts: Sec11445500 reference small baseline medium baseline large baseline tcso0120 .050.10.150.20.2sl5m areagdleiulm0.3 inverse depth d Figure 3. Variable Baseline Stereo: Reference image (left), three stereo images at different baselines (right), and the respective matching cost functions. While a small baseline (black) gives a unique, but imprecise minimum, a large baseline (red) allows for a very precise estimate, but has many false minima. tion 2. 1 describes the stereo method used to extract new depth measurements from previous frames, and how they are incorporated into the prior depth map. In Sec. 2.2, we describe how the depth map is propagated from frame to frame. In Sec. 2.3, we detail how we partially regularize the obtained depth map in each iteration, and how outliers are handled. Throughout this section, d denotes the inverse depth of a pixel. 2.1. Stereo-Based Depth Map Update It is well known [12] that for stereo, there is a trade-off between precision and accuracy (see Fig. 3). While many multiple-baseline stereo approaches resolve this by accumulating the respective cost functions over many frames [5, 13], we propose a probabilistic approach which explicitly takes advantage of the fact that in a video, smallbaseline frames are available before large-baseline frames. The full depth map update (performed once for each new frame) consists of the following steps: First, a subset of pixels is selected for which the accuracy of a disparity search is sufficiently large. For this we use three intuitive and very efficiently computable criteria, which will be derived in Sec. 2. 1.3. For each selected pixel, we then individually select a suitable reference frame, and perform a onedimensional disparity search. Propagated prior knowledge is used to reduce the disparity search range when possible, decreasing computational cost and eliminating false minima. The obtained inverse depth estimate is then fused into the depth map. 2.1.1 Reference Frame Selection Ideally, the reference frame is chosen such that it maximizes the stereo accuracy, while keeping the disparity search range as well as the observation angle sufficiently cur ent framepixel’s “age” -4.8 s -3.9 s -3.1 s -2.2 s -1.2 s -0.8 s -0.5 s -0.4 s Figure 4. Adaptive Baseline Selection: For each pixel in the new frame (top left), a different stereo-reference frame is selected, based on how long the pixel was visible (top right: the more yellow, the older the pixel.). Some of the reference frames are displayed below, the red regions were used for stereo comparisons. small. As the stereo accuracy depends on many factors and because this selection is done for each pixel independently, we employ the following heuristic: We use the oldest frame the pixel was observed in, where the disparity search range and the observation angle do not exceed a certain threshold (see Fig. 4). If a disparity search is unsuccessful (i.e., no good match is found), the pixel’s “age” is increased, such that subsequent disparity searches use newer frames where the pixel is likely to be still visible. 2.1.2 Stereo Matching Method We perform an exhaustive search for the pixel’s intensity along the epipolar line in the selected reference frame, and then perform a sub-pixel accurate localization of the matching disparity. If a prior inverse depth hypothesis is available, the search interval is limited by d 2σd, where d and σd de,e nthoete s etharec mean avnadl ssta lnimdaiterdd d beyv dia ±tion 2σ σof the prior hypothesis. Otherwise, the full disparity range is searched. In our implementation, we use the SSD error over five equidistant points on the epipolar line: While this significantly increases robustness in high-frequent image regions, it does not change the purely one-dimensional nature of this search. Furthermore, it is computationally efficient, as 4 out ± of 5 interpolated image values can be re-used for each SSD evaluation. 2.1.3 Uncertainty Estimation In this section, we use uncertainty propagation to derive an expression for the error variance σd2 on the inverse depth d. 11445511 In general this can be done by expressing the optimal inverse depth d∗ as a function of the noisy inputs here we consider the images I0, I1 themselves, their relative orientation ξ and the camera calibration in terms of a projection function π1 – d∗ = d(I0, I1, ξ, π) . The error-variance of d∗ is then given by σd2 = JdΣJdT, (1) (2) where Jd is the Jacobian of d, and Σ the covariance of the input-error. For more details on covariance propagation, including the derivation of this formula, we refer to [2]. For simplicity, the following analysis is performed for patchfree stereo, i.e., we consider only a point-wise search for a single intensity value along the epipolar line. For this analysis, we split the computation into three steps: First, the epipolar line in the reference frame is computed. Second, the best matching position λ∗ ∈ R along it (i.e., the disparity) is determined. Third, the i∈nv eRrse al depth d∗ is computed from the disparity λ∗ . The first two steps involve two independent error sources: the geometric error, which originates from noise on ξ and π and affects the first step, and the photometric error, which originates from noise in the images I0, I1 and affects the second step. The third step scales these errors by a factor, which depends on the baseline. Geometric disparity error. The geometric error is the error ?λ on the disparity λ∗ caused by noise on ξ and π. While it would be possible to model, propagate, and estimate the complete covariance on ξ and π, we found that the gain in accuracy does not justify the increase in computational complexity. We therefore use an intuitive approximation: Let the considered epipolar line segment L ⊂ R2 be deLfineted th by L := ?l0 + λ?llyx? |λ ∈ S? , (3) where λ is the disparity with search interval S, (lx , ly)T the normalized epipolar line direction and l0 the point corresponding to infinite depth. We now assume that only the absolute position of this line segment, i.e., l0 is subject to isotropic Gaussian noise ?l . As in practice we keep the searched epipolar line segments short, the influence of rotational error is small, making this a good approximation. Intuitively, a positioning error ?l on the epipolar line causes a small disparity error ?λ if the epipolar line is parallel to the image gradient, and a large one otherwise (see Fig. 5). This can be mathematically derived as follows: The image constrains the optimal disparity λ∗ to lie on a certain isocurve, i.e. a curve of equal intensity. We approximate 1In the linear case, this is the camera matrix K – in practice however, nonlinear distortion and other (unmodeled) effects also play a role. FiguLre5.Geo?l mλetricDigs,palrityEroL?rl:Influe?nλceofgasmla posi- tioning error ?l of the epipolar line on the disparity error ?λ . The dashed line represents the isocurve on which the matching point has to lie. ?λ is small if the epipolar line is parallel to the image gradient (left), and a large otherwise (right). this isocurve to be locally linear, i.e. the gradient direction to be locally constant. This gives l0 + λ∗ ?llxy? =! + γ?−gxgy?, g0 γ ∈ R (4) where g := (gx , gy) ?is the image gradient and g0 a point on the isoline. The influence of noise on the image values will be derived in the next paragraph, hence at this point g and g0 are assumed noise-free. Solving for λ gives the optimal disparity λ∗ in terms of the noisy input l0: λ∗(l0) =?g,g?g0,−l? l0? (5) Analogously to (2), the variance of the geometric disparity error can then be expressed as σλ2(ξ,π)= Jλ∗(l0)?σ0l2 σ0l2?JλT∗(l0)=?gσ,l 2?2, (6) where g is the normalized image gradient, lthe normalized epipolar line direction and σl2 the variance of ?l. Note that this error term solely originates from noise on the relative camera orientation and the camera calibration π, i.e., it is independent of image intensity noise. ξ Photometric disparity error. Intuitively, this error encodes that small image intensity errors have a large effect on the estimated disparity if the image gradient is small, and a small effect otherwise (see Fig. 6). Mathematically, this relation can be derived as follows. We seek the disparity λ∗ that minimizes the difference in intensities, i.e., λ∗ = mλin (iref − Ip(λ))2, (7) where iref is the reference intensity, and Ip(λ) the image intensity on the epipolar line at disparity λ. We assume a good initialization λ0 to be available from the exhaustive search. Using a first-order Taylor approximation for Ip gives λ∗(I) = λ0 + (iref − Ip(λ0)) g−p1, (8) where gp is the gradient of Ip, that is image gradient along the epipolar line. For clarity we only consider noise on iref and Ip(λ0) ; equivalent results are obtained in the general case when taking into account noise on the image values involved in the computation of gp. The variance of the pho11445522 ?i Ip?λ ?iiIp?λλ Figure 6. Photometric Disparity Error: Noise ?i on the image intensity values causes a small disparity error ?λ if the image gradient along the epipolar line is large (left). If the gradient is small, the disparity error is magnified (right). tometric disparity error is given by σλ2(I) = Jλ∗(I)?σ0i2 σ0i2?Jλ∗(I) =2gσ2pi2, (9) where σi2 is the variance of the image intensity noise. The respective error originates solely from noisy image intensity values, and hence is independent of the geometric disparity error. Pixel to inverse depth conversion. Using that, for small camera rotation, the inverse depth d is approximately proportional to the disparity λ, the observation variance of the inverse depth σd2,obs can be calculated using σd2,obs = α2 ?σ2λ(ξ,π) + σλ2(I)? , (10) where the proportionality ?constant α in th?e general, nonrectified case – is different for each pixel, and can be calculated from – α :=δδdλ, (11) where δd is the length of the searched inverse depth interval, and δλ the length of the searched epipolar line segment. While α is inversely linear in the length of the camera translation, it also depends on the translation direction and the pixel’s location in the image. When using an SSD error over multiple points along the epipolar line – as our implementation does – a good upper bound for the matching uncertainty is then given by ?min{σ2λ(ξ,π)} + min{σλ2(I)}? σd2,obs-SSD ≤ α2 , (12) where the min goes over all points included in the? SSD error. 2.1.4 Depth Observation Fusion After a depth observation for a pixel in the current image has been obtained, we integrate it into the depth map as follows: If no prior hypothesis for a pixel exists, we initialize it directly with the observation. Otherwise, the new observation is incorporated into the prior, i.e., the two distribu- tions are multiplied (corresponding to the update step in a Knoailsmya onb fsieltrvera)t:io Gniv Nen(do a, pσrio2o)r, d thiest priobsutetiroionr N is( gdipv,eσnp2 b)y and a N?σ2pdσo2p++ σ σo2o2dp,σ2σpp2+σo2 σo2?. 2.1.5 (13) Summary of Uncertainty-Aware Stereo New stereo observations are obtained on a per-pixel basis, adaptively selecting for each pixel a suitable reference frame and performing a one-dimensional search along the epipolar line. We identified the three major factors which determine the accuracy of such a stereo observation, i.e., • the photometric disparity error σλ2(ξ,π), depending on tphheo magnitude sofp trhiet image gradient along the epipolar line, • the geometric disparity error σλ2(I) ,depending on the athnegl gee bometewtereinc dthisep image gradient and the epipolar line (independent of the gradient magnitude), and • the pixel to inverse depth ratio α, depending on the camera etlra tons ilantvioenrs, eth dee pfothcal r length ,a dndep tehned pixel’s position. These three simple-to-compute and purely local criteria are used to determine for which pixel a stereo update is worth the computational cost. Further, the computed observation variance is then used to integrate the new measurements into the existing depth map. 2.2. Depth Map Propagation We continuously propagate the estimated inverse depth map from frame to frame, once the camera position of the next frame has been estimated. Based on the inverse depth estimate d0 for a pixel, the corresponding 3D point is calculated and projected into the new frame, providing an inverse depth estimate d1 in the new frame. The hypothesis is then assigned to the closest integer pixel position to eliminate discretization errors, the sub-pixel accurate image location of the projected point is kept, and re-used for the next propagation step. For propagating the inverse depth variance, we assume the camera rotation to be small. The new inverse depth d1 can then be approximated by – d1(d0) = (d0−1 − tz)−1, (14) where tz is the camera translation along the optical axis. The variance of d1 is hence given by σd21= Jd1σd20JTd1+ σp2=?dd01?4σd20+ σp2, (15) where σp2 is the prediction uncertainty, which directly corresponds to the prediction step in an extended Kalman filter. It can also be interpreted as keeping the variance on 11445533 in the top right shows the new frame I2 (x) without depth information. Middle: Intermediate steps while minimizing E(ξ) on different pyramid levels. The top row shows the back-warped new frame I2 (w(x, d, ξ)), the bottom row shows the respective residual image I2 (w(x, di,ξ)) − I1 (x) . The bottom right image shows the final pixel-weights (black = small weight). Small weights mainly correspond to newly oc,cξl)ud)e −d or disoccluded pixel. tWhe z fo-cuonodrtd hina t uesi onfg a sm poailnlt v failxue ds, fo i.re. σ,p2 sedteticnrgea σsez2s0 d=rift σ,z2 a1s. it causes the estimated geometry to gradually ”lock” into place. Collision handling. At all times, we allow at most one inverse depth hypothesis per pixel: If two inverse depth hypothesis are propagated to the same pixel in the new frame, we distinguish between two cases: 1. if they are statistically similar, i.e., lie within 2σ bounds, they are treated as two independent observations of the pixel’s depth and fused according to (13). 2. otherwise, the point that is further away from the camera is assumed to be occluded, and is removed. 2.3. Depth Map Regularization For each frame – after all observations have been incorporated – we perform one regularization iteration by assign- ing each inverse depth value the average of the surrounding inverse depths, weighted by their respective inverse variance. To preserve sharp edges, if two adjacent inverse depth values are statistically different, i.e., are further away than 2σ, they do not contribute to one another. Note that the respective variances are not changed during regularization to account for the high correlation between neighboring hypotheses. Instead we use the minimal variance of all neighboring pixel when defining the stereo search range, and as a weighting factor for tracking (see Sec. 3). Outlier removal. To handle outliers, we continuously keep track of the validity of each inverse depth hypothesis in terms of the probability that it is an outlier, or has become invalid (e.g., due to occlusion or a moving object). For each successful stereo observation, this probability is decreased. It is increased for each failed stereo search, if the respective intensity changes significantly on propagation, or when the absolute image gradient falls below a given threshold. If, during regularization, the probability that all contributing neighbors are outliers i.e., the product of their individual outlier-probabilities rises above a given threshold, the hypothesis is removed. Equally, if for an “empty” pixel this product drops below a given threshold, a new hypothesis is created from the neighbors. This fills holes arising from the forward-warping nature of the propagation step, and dilates the semi-dense depth map to a small neighborhood around sharp image intensity edges, which signifi– – × cantly increases tracking and mapping robustness. 3. Dense Tracking Based on the inverse depth map of the previous frame, we estimate the camera pose of the current frame using dense image alignment. Such methods have previously been applied successfully (in real-time on a CPU) for tracking RGB-D cameras [7], which directly provide dense depth measurements along with the color image. It is based on the direct minimization of the photometric error ri (ξ) := (I2 (w(xi, di , ξ)) − I1 , (16) where the warp function w : Ω1 R R6 → Ω2 maps each point xi ∈ Ω1 in the reference× image RI1 →to Ωthe respective point w(x∈i, Ωdi, ξ) ∈ Ω2 in the new image I2. As input it requires the 3D,ξ pose Ωof the camera ξ ∈ R6 and uses the reestqiumiraetesd t hienv 3erDse p depth fd it ∈e cRa mfore rthae ξ pixel in I1. Note that no depth information with respect t toh Ie2 p i sx required. To increase robustness to self-occlusion and moving objects, we apply a weighting scheme as proposed in [7]. Further, we add the variance of the inverse depth σd2i as an additional weighting term, making the tracking resistant to recently initialized and still inaccurate depth estimates from 11445544 (xi))2 Figure 8. Examples: Top: Camera images overlaid with the respective stimated semi-dense inverse depth map. Bot om: 3D view of tracked scene. Note the versatility of our approach: It accurately reconstructs and tracks through (outside) scenes with a large depth- variance, including far-away objects like clouds , as well as (indoor) scenes with little structure and close to no image corners / keypoints. More examples are shown in the attached video. the mapping process. The final energy that is minimized is hence given by E(ξ) :=?iα(rσid2(iξ))ri(ξ), (17) where α : R → R defines the weight for a given residual. Minimizing t h→is error can b thee interpreted as computing uthale. maximum likelihood estimator for ξ, assuming independent noise on the image intensity values. The resulting weighted least-squares problem is solved efficiently using an iteratively reweighted Gauss-Newton algorithm coupled with a coarse-to-fine approach, using four pyramid levels. Figure 7 shows an example of the tracking process. For further details on the minimization we refer to [1]. 4. System Overview Tracking and depth estimation is split into two separate threads: One continuously propagates the inverse depth map to the most recent tracked frame, updates it with stereocomparisons and partially regularizes it. The other simultaneously tracks each incoming frame on the most recent available depth map. While tracking is performed in real- time at 30Hz, one complete mapping iteration takes longer and is hence done at roughly 15Hz if the map is heavily populated, we adaptively reduce the number of stereo comparisons to maintain a constant frame-rate. For stereo observations, a buffer of up to 100 past frames is kept, automatically removing those that are used least. We use a standard, keypoint-based method to obtain the relative camera pose between two initial frames, which are then used to initialize the inverse depth map needed for tracking successive frames. From this point onward, our method is entirely self-contained. In preliminary experiments, we found that in most cases our approach is even able to recover from random or extremely inaccurate initial depth maps, indicating that the keypoint-based initialization might become superfluous in the future. Table 1. Results on RGB-D Benchmark position drift (cm/s) rotation drift (deg/s) ours [7] [8] ours [7] [8] – fr2/xyz fr2/desk 0.6 2.1 0.6 2.0 8.2 - 0.33 0.65 0.34 0.70 3.27 - 5. Results We have tested our approach on both publicly available benchmark sequences, as well as live, using a hand-held camera. Some examples are shown in Fig. 8. Note that our method does not attempt to build a global map, i.e., once a point leaves the field of view of the camera or becomes occluded, the respective depth value is deleted. All experiments are performed on a standard consumer laptop with Intel i7 quad-core CPU. In a preprocessing step, we rectify all images such that a pinhole camera-model can be applied. 5.1. RGB-D Benchmark Sequences As basis for a quantitative evaluation and to facilitate reproducibility and easy comparison with other methods, we use the TUM RGB-D benchmark [16]. For tracking and mapping we only use the gray-scale images; for the very first frame however the provided depth image is used as initialization. Our method (like any monocular visual odometry method) fails in case of pure camera rotation, as the depth of new regions cannot be determined. The achieved tracking accuracy for two feasible sequences that is, sequences which do not contain strong camera rotation without simultaneous translation is given in Table 1. For comparison we also list the accuracy from (1) a state-of-the-art, dense RGB-D odometry [7], and (2) a state-of-the-art, keypointbased monocular SLAM system (PTAM, [8]). We initialize PTAM using the built-in stereo initializer, and perform a 7DoF (rigid body plus scale) alignment to the ground truth trajectory. Figure 9 shows the tracked camera trajectory for fr2/desk. We found that our method achieves similar accu– – 11445555 era the the the trajectory (black), the depth map of the first frame (blue), and estimated depth map (gray-scale) after a complete loop around table. Note how well certain details such as the keyboard and monitor align. racy as [7] which uses the same dense tracking algorithm but relies on the Kinect depth images. The keypoint-based approach [8] proves to be significantly less accurate and robust; it consistently failed after a few seconds for the second sequence. 5.2. Additional Test Sequences To analyze our approach in more detail, we recorded additional challenging sequences with the corresponding ground truth trajectory in a motion capture studio. Figure 10 shows an extract from the video, as well as the tracked and the ground-truth camera position over time. As can be seen from the figure, our approach is able to maintain a reasonably dense depth map at all times and the estimated camera trajectory matches closely the ground truth. 6. Conclusion In this paper we proposed a novel visual odometry method for a monocular camera, which does not require discrete features. In contrast to previous work on dense tracking and mapping, our approach is based on probabilistic depth map estimation and fusion over time. Depth measurements are obtained from patch-free stereo matching in different reference frames at a suitable baseline, which are selected on a per-pixel basis. To our knowledge, this is the first featureless monocular visual odometry method which runs in real-time on a CPU. In our experiments, we showed that the tracking performance of our approach is comparable to that of fully dense methods without requiring a depth sensor. References [1] S. Baker and I. Matthews. Lucas-Kanade 20 years on: A unifying framework. Technical report, Carnegie Mellon Univ., 2002. 7 [2] A. Clifford. Multivariate Error Analysis. John Wiley & Sons, 1973. 4 sionpito[m ]− 024 2 0 s1xzy0s20s30s40s50s60s Figure 10. Additional Sequence: Estimated camera trajectory and ground truth (dashed) for a long and challenging sequence. The complete sequence is shown in the attached video. [3] A. Comport, E. Malis, and P. Rives. Accurate quadri-focal tracking for robust 3d visual odometry. In ICRA, 2007. 2 [4] A. Davison, I. Reid, N. Molton, and O. Stasse. MonoSLAM: Real-time single camera SLAM. Trans. on Pattern Analysis and Machine Intelligence (TPAMI), 29, 2007. 1 [5] D. Gallup, J. Frahm, P. Mordohai, and M. Pollefeys. Variable baseline/resolution stereo. In CVPR, 2008. 2, 3 [6] C. Harris and M. Stephens. A combined corner and edge detector. In Alvey Vision Conference, 1988. 1 [7] C. Kerl, J. Sturm, and D. Cremers. Robust odometry estimation for RGB-D cameras. In ICRA, 2013. 1, 2, 6, 7, 8 [8] G. Klein and D. Murray. Parallel tracking and mapping for small AR workspaces. In Mixed and Augmented Reality (ISMAR), 2007. 1, 2, 7, 8 [9] G. Klein and D. Murray. Improving the agility of keyframebased SLAM. In ECCV, 2008. 1 [10] M. Pollefes et al. Detailed real-time urban 3d reconstruction from video. IJCV, 78(2-3): 143–167, 2008. 2, 3 [11] L. Matthies, R. Szeliski, and T. Kanade. Incremental estimation of dense depth maps from image image sequences. In CVPR, 1988. 2 [12] R. Newcombe, S. Lovegrove, and A. Davison. DTAM: Dense tracking and mapping in real-time. In ICCV, 2011. 1, 2 [13] M. Okutomi and T. Kanade. A multiple-baseline stereo. Trans. on Pattern Analysis and Machine Intelligence (TPAMI), 15(4):353–363, 1993. 2, 3 [14] T. Sato, M. Kanbara, N. Yokoya, and H. Takemura. Dense 3-d reconstruction of an outdoor scene by hundreds-baseline stereo using a hand-held camera. IJCV, 47: 1–3, 2002. 2 [15] J. Stuehmer, S. Gumhold, and D. Cremers. Real-time dense geometry from a handheld camera. In Pattern Recognition (DAGM), 2010. 1, 2 [16] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of RGB-D SLAM systems. In Intelligent Robot Systems (IROS), 2012. 2, 7 [17] A. Wendel, M. Maurer, G. Graber, T. Pock, and H. Bischof. Dense reconstruction on-the-fly. In ECCV, 2012. 1 11445566
4 0.20754518 408 iccv-2013-Super-resolution via Transform-Invariant Group-Sparse Regularization
Author: Carlos Fernandez-Granda, Emmanuel J. Candès
Abstract: We present a framework to super-resolve planar regions found in urban scenes and other man-made environments by taking into account their 3D geometry. Such regions have highly structured straight edges, but this prior is challenging to exploit due to deformations induced by the projection onto the imaging plane. Our method factors out such deformations by using recently developed tools based on convex optimization to learn a transform that maps the image to a domain where its gradient has a simple group-sparse structure. This allows to obtain a novel convex regularizer that enforces global consistency constraints between the edges of the image. Computational experiments with real images show that this data-driven approach to the design of regularizers promoting transform-invariant group sparsity is very effective at high super-resolution factors. We view our approach as complementary to most recent superresolution methods, which tend to focus on hallucinating high-frequency textures.
5 0.18055557 133 iccv-2013-Efficient Hand Pose Estimation from a Single Depth Image
Author: Chi Xu, Li Cheng
Abstract: We tackle the practical problem of hand pose estimation from a single noisy depth image. A dedicated three-step pipeline is proposed: Initial estimation step provides an initial estimation of the hand in-plane orientation and 3D location; Candidate generation step produces a set of 3D pose candidate from the Hough voting space with the help of the rotational invariant depth features; Verification step delivers the final 3D hand pose as the solution to an optimization problem. We analyze the depth noises, and suggest tips to minimize their negative impacts on the overall performance. Our approach is able to work with Kinecttype noisy depth images, and reliably produces pose estimations of general motions efficiently (12 frames per second). Extensive experiments are conducted to qualitatively and quantitatively evaluate the performance with respect to the state-of-the-art methods that have access to additional RGB images. Our approach is shown to deliver on par or even better results.
6 0.17969632 108 iccv-2013-Depth from Combining Defocus and Correspondence Using Light-Field Cameras
7 0.17393111 444 iccv-2013-Viewing Real-World Faces in 3D
8 0.17217268 199 iccv-2013-High Quality Shape from a Single RGB-D Image under Uncalibrated Natural Illumination
9 0.14923966 319 iccv-2013-Point-Based 3D Reconstruction of Thin Objects
10 0.14386266 341 iccv-2013-Real-Time Body Tracking with One Depth Camera and Inertial Sensors
11 0.13675255 88 iccv-2013-Constant Time Weighted Median Filtering for Stereo Matching and Beyond
12 0.12933975 424 iccv-2013-Tracking Revisited Using RGBD Camera: Unified Benchmark and Baselines
13 0.12496547 366 iccv-2013-STAR3D: Simultaneous Tracking and Reconstruction of 3D Objects Using RGB-D Data
14 0.12219074 254 iccv-2013-Live Metric 3D Reconstruction on Mobile Phones
15 0.12126336 218 iccv-2013-Interactive Markerless Articulated Hand Motion Tracking Using RGB and Depth Data
16 0.11811993 9 iccv-2013-A Flexible Scene Representation for 3D Reconstruction Using an RGB-D Camera
17 0.1142465 188 iccv-2013-Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps
18 0.11072107 12 iccv-2013-A General Dense Image Matching Framework Combining Direct and Feature-Based Costs
19 0.105825 367 iccv-2013-SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels
20 0.1041986 144 iccv-2013-Estimating the 3D Layout of Indoor Scenes and Its Clutter from Depth Sensors
topicId topicWeight
[(0, 0.182), (1, -0.217), (2, -0.063), (3, 0.042), (4, -0.045), (5, -0.033), (6, 0.011), (7, -0.184), (8, -0.058), (9, 0.018), (10, -0.021), (11, -0.023), (12, -0.009), (13, 0.035), (14, 0.0), (15, -0.104), (16, -0.089), (17, -0.18), (18, -0.036), (19, 0.069), (20, -0.015), (21, 0.174), (22, -0.118), (23, 0.073), (24, 0.032), (25, 0.072), (26, 0.009), (27, 0.036), (28, -0.038), (29, 0.108), (30, 0.005), (31, -0.007), (32, -0.065), (33, 0.126), (34, -0.082), (35, -0.086), (36, 0.01), (37, 0.025), (38, 0.013), (39, 0.061), (40, 0.081), (41, -0.06), (42, 0.144), (43, 0.071), (44, -0.153), (45, 0.054), (46, 0.035), (47, 0.122), (48, -0.067), (49, -0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.98462319 209 iccv-2013-Image Guided Depth Upsampling Using Anisotropic Total Generalized Variation
Author: David Ferstl, Christian Reinbacher, Rene Ranftl, Matthias Ruether, Horst Bischof
Abstract: In this work we present a novel method for the challenging problem of depth image upsampling. Modern depth cameras such as Kinect or Time of Flight cameras deliver dense, high quality depth measurements but are limited in their lateral resolution. To overcome this limitation we formulate a convex optimization problem using higher order regularization for depth image upsampling. In this optimization an anisotropic diffusion tensor, calculated from a high resolution intensity image, is used to guide the upsampling. We derive a numerical algorithm based on a primaldual formulation that is efficiently parallelized and runs at multiple frames per second. We show that this novel upsampling clearly outperforms state of the art approaches in terms of speed and accuracy on the widely used Middlebury 2007 datasets. Furthermore, we introduce novel datasets with highly accurate groundtruth, which, for the first time, enable to benchmark depth upsampling methods using real sensor data.
2 0.94273251 18 iccv-2013-A Joint Intensity and Depth Co-sparse Analysis Model for Depth Map Super-resolution
Author: Martin Kiechle, Simon Hawe, Martin Kleinsteuber
Abstract: High-resolution depth maps can be inferred from lowresolution depth measurements and an additional highresolution intensity image of the same scene. To that end, we introduce a bimodal co-sparse analysis model, which is able to capture the interdependency of registered intensity . go l e i um . de . .t ities together with the knowledge of the relative positions between all views. Despite very active research in this area and significant improvements over the past years, stereo methods still struggle with noise, texture-less regions, repetitive texture, and occluded areas. For an overview of stereo methods, the reader is referred to [25]. and depth information. This model is based on the assumption that the co-supports of corresponding bimodal image structures are aligned when computed by a suitable pair of analysis operators. No analytic form of such operators ex- ist and we propose a method for learning them from a set of registered training signals. This learning process is done offline and returns a bimodal analysis operator that is universally applicable to natural scenes. We use this to exploit the bimodal co-sparse analysis model as a prior for solving inverse problems, which leads to an efficient algorithm for depth map super-resolution.
3 0.89179128 108 iccv-2013-Depth from Combining Defocus and Correspondence Using Light-Field Cameras
Author: Michael W. Tao, Sunil Hadap, Jitendra Malik, Ravi Ramamoorthi
Abstract: Light-field cameras have recently become available to the consumer market. An array of micro-lenses captures enough information that one can refocus images after acquisition, as well as shift one ’s viewpoint within the subapertures of the main lens, effectively obtaining multiple views. Thus, depth cues from both defocus and correspondence are available simultaneously in a single capture. Previously, defocus could be achieved only through multiple image exposures focused at different depths, while correspondence cues needed multiple exposures at different viewpoints or multiple cameras; moreover, both cues could not easily be obtained together. In this paper, we present a novel simple and principled algorithm that computes dense depth estimation by combining both defocus and correspondence depth cues. We analyze the x-u 2D epipolar image (EPI), where by convention we assume the spatial x coordinate is horizontal and the angular u coordinate is vertical (our final algorithm uses the full 4D EPI). We show that defocus depth cues are obtained by computing the horizontal (spatial) variance after vertical (angular) integration, and correspondence depth cues by computing the vertical (angular) variance. We then show how to combine the two cues into a high quality depth map, suitable for computer vision applications such as matting, full control of depth-of-field, and surface reconstruction.
4 0.79662603 382 iccv-2013-Semi-dense Visual Odometry for a Monocular Camera
Author: Jakob Engel, Jürgen Sturm, Daniel Cremers
Abstract: We propose a fundamentally novel approach to real-time visual odometry for a monocular camera. It allows to benefit from the simplicity and accuracy of dense tracking which does not depend on visual features while running in real-time on a CPU. The key idea is to continuously estimate a semi-dense inverse depth map for the current frame, which in turn is used to track the motion of the camera using dense image alignment. More specifically, we estimate the depth of all pixels which have a non-negligible image gradient. Each estimate is represented as a Gaussian probability distribution over the inverse depth. We propagate this information over time, and update it with new measurements as new images arrive. In terms of tracking accuracy and computational speed, the proposed method compares favorably to both state-of-the-art dense and feature-based visual odometry and SLAM algorithms. As our method runs in real-time on a CPU, it is oflargepractical valuefor robotics and augmented reality applications. – – 1. Towards Dense Monocular Visual Odometry Tracking a hand-held camera and recovering the threedimensional structure of the environment in real-time is among the most prominent challenges in computer vision. In the last years, dense approaches to these challenges have become increasingly popular: Instead of operating solely on visual feature positions, they reconstruct and track on the whole image using a surface-based map and thereby are fundamentally different from feature-based approaches. Yet, these methods are to date either not real-time capable on standard CPUs [11, 15, 17] or require direct depth measurements from the sensor [7], making them unsuitable for many practical applications. In this paper, we propose a novel semi-dense visual odometry approach for a monocular camera, which combines the accuracy and robustness of dense approaches with the efficiency of feature-based methods. Further, it computes highly accurate semi-dense depth maps from the monocular images, providing rich information about the 3D ∗ This work was supported by the ERC Starting Grant ConvexVision and the DFG project Mapping on Demand Figure1.Semi-Dens MoncularVisualOdometry:Oucfrloas rpe- proach works on a semi-dense inverse depth map and combines the accuracy and robustness of dense visual SLAM methods with the efficiency of feature-based techniques. Left: video frame, Right: color-coded semi-dense depth map, which consists of depth estimates in all image regions with sufficient structure. structure of the environment. We use the term visual odometry as supposed to SLAM, as for simplicity we deliberately maintain only information about the currently visible scene, instead of building a global world-model. – – 1.1. Related Work Feature-based monocular SLAM. In all feature-based methods (such as [4, 8]), tracking and mapping consists of two separate steps: First, discrete feature observations (i.e., their locations in the image) are extracted and matched to each other. Second, the camera and the full feature poses are calculated from a set of such observations disregarding the images themselves. While this preliminary abstrac– tion step greatly reduces the complexity of the overall problem and allows it to be tackled in real time, it inherently comes with two significant drawbacks: First, only image information conforming to the respective feature type and parametrization typically image corners and blobs [6] or line segments [9] is utilized. Second, features have to be matched to each other, which often requires the costly computation of scale- and rotation-invariant descriptors and robust outlier estimation methods like RANSAC. – – Dense monocular SLAM. To overcome these limitations and to better exploit the available image information, dense monocular SLAM methods [11, 17] have recently been proposed. The fundamental difference to keypoint-based approaches is that these methods directly work on the images 11444499 instead of a set of extracted features, for both mapping and tracking: The world is modeled as dense surface while in turn new frames are tracked using whole-image alignment. This concept removes the need for discrete features, and allows to exploit all information present in the image, increasing tracking accuracy and robustness. To date however, doing this in real-time is only possible using modern, powerful GPU processors. Similar methods are broadly used in combination with RGB-D cameras [7], which directly measure the depth of each pixel, or stereo camera rigs [3] greatly reducing the – complexity of the problem. Dense multi-view stereo. Significant prior work exists on multi-view dense reconstruction, both in a real-time setting [13, 11, 15], as well as off-line [5, 14]. In particular for offline reconstruction, there is a long history of using different baselines to steer the stereo-inherent trade-off between accuracy and precision [12]. Most similar to our approach is the early work of Matthies et al., who proposed probabilistic depth map fusion and propagation for image sequences [10], however only for structure from motion, i.e., not coupled with subsequent dense tracking. 1.2. Contributions In this paper, we propose a novel semi-dense approach to monocular visual odometry, which does not require feature points. The key concepts are • a probabilistic depth map representation, • tracking based on whole-image alignment, • the reduction on image-regions which carry informattihoen (esdeumctii-odenn osen), i manadg • the full incorporation of stereo measurement uncertainty. To the best of our knowledge, this is the first featureless, real-time monocular visual odometry approach, which runs in real-time on a CPU. 1.3. Method Outline Our approach is partially motivated by the basic principle that for most real-time applications, video information is abundant and cheap to come by. Therefore, the computational budget should be spent such that the expected information gain is maximized. Instead of reducing the images to a sparse set of feature observations however, our method continuously estimates a semi-dense inverse depth map for the current frame, i.e., a dense depth map covering all image regions with non-negligible gradient (see Fig. 2). It is comprised of one inverse depth hypothesis per pixel modeled by a Gaussian probability distribution. This representation still allows to use whole-image alignment [7] to track new orignalimagesemi-densedepthmap(ours)clfoasre keypointdepthmap[8]densedepthmap[1 ]RGB-Dcamera[16] Figure 2. Semi-Dense Approach: Our approach reconstructs and tracks on a semi-dense inverse depth map, which is dense in all image regions carrying information (top-right). For comparison, the bottom row shows the respective result from a keypoint-based approach, a fully dense approach and the ground truth from an RGB-D camera. frames, while at the same time greatly reducing computational complexity compared to volumetric methods. The estimated depth map is propagated from frame to frame, and updated with variable-baseline stereo comparisons. We explicitly use prior knowledge about a pixel’s depth to select a suitable reference frame on a per-pixel basis, and to limit the disparity search range. The remainder of this paper is organized as follows: Section 2 describes the semi-dense mapping part of the proposed method, including the derivation of the observation accuracy as well as the probabilistic data fusion, propagation and regularization steps. Section 3 describes how new frames are tracked using whole-image alignment, and Sec. 4 summarizes the complete visual odometry method. A qualitative as well as a quantitative evaluation is presented in Sec. 5. We then give a brief conclusion in Sec. 6. 2. Semi-Dense Depth Map Estimation One of the key ideas proposed in this paper is to estimate a semi-dense inverse depth map for the current camera image, which in turn can be used for estimating the camera pose of the next frame. This depth map is continuously propagated from frame to frame, and refined with new stereo depth measurements, which are obtained by performing per-pixel, adaptive-baseline stereo comparisons. This allows us to accurately estimate the depth both of close-by and far-away image regions. In contrast to previous work that accumulates the photometric cost over a sequence of several frames [11, 15], we keep exactly one inverse depth hypothesis per pixel that we represent as Gaussian probability distribution. This section is comprised of three main parts: Sec11445500 reference small baseline medium baseline large baseline tcso0120 .050.10.150.20.2sl5m areagdleiulm0.3 inverse depth d Figure 3. Variable Baseline Stereo: Reference image (left), three stereo images at different baselines (right), and the respective matching cost functions. While a small baseline (black) gives a unique, but imprecise minimum, a large baseline (red) allows for a very precise estimate, but has many false minima. tion 2. 1 describes the stereo method used to extract new depth measurements from previous frames, and how they are incorporated into the prior depth map. In Sec. 2.2, we describe how the depth map is propagated from frame to frame. In Sec. 2.3, we detail how we partially regularize the obtained depth map in each iteration, and how outliers are handled. Throughout this section, d denotes the inverse depth of a pixel. 2.1. Stereo-Based Depth Map Update It is well known [12] that for stereo, there is a trade-off between precision and accuracy (see Fig. 3). While many multiple-baseline stereo approaches resolve this by accumulating the respective cost functions over many frames [5, 13], we propose a probabilistic approach which explicitly takes advantage of the fact that in a video, smallbaseline frames are available before large-baseline frames. The full depth map update (performed once for each new frame) consists of the following steps: First, a subset of pixels is selected for which the accuracy of a disparity search is sufficiently large. For this we use three intuitive and very efficiently computable criteria, which will be derived in Sec. 2. 1.3. For each selected pixel, we then individually select a suitable reference frame, and perform a onedimensional disparity search. Propagated prior knowledge is used to reduce the disparity search range when possible, decreasing computational cost and eliminating false minima. The obtained inverse depth estimate is then fused into the depth map. 2.1.1 Reference Frame Selection Ideally, the reference frame is chosen such that it maximizes the stereo accuracy, while keeping the disparity search range as well as the observation angle sufficiently cur ent framepixel’s “age” -4.8 s -3.9 s -3.1 s -2.2 s -1.2 s -0.8 s -0.5 s -0.4 s Figure 4. Adaptive Baseline Selection: For each pixel in the new frame (top left), a different stereo-reference frame is selected, based on how long the pixel was visible (top right: the more yellow, the older the pixel.). Some of the reference frames are displayed below, the red regions were used for stereo comparisons. small. As the stereo accuracy depends on many factors and because this selection is done for each pixel independently, we employ the following heuristic: We use the oldest frame the pixel was observed in, where the disparity search range and the observation angle do not exceed a certain threshold (see Fig. 4). If a disparity search is unsuccessful (i.e., no good match is found), the pixel’s “age” is increased, such that subsequent disparity searches use newer frames where the pixel is likely to be still visible. 2.1.2 Stereo Matching Method We perform an exhaustive search for the pixel’s intensity along the epipolar line in the selected reference frame, and then perform a sub-pixel accurate localization of the matching disparity. If a prior inverse depth hypothesis is available, the search interval is limited by d 2σd, where d and σd de,e nthoete s etharec mean avnadl ssta lnimdaiterdd d beyv dia ±tion 2σ σof the prior hypothesis. Otherwise, the full disparity range is searched. In our implementation, we use the SSD error over five equidistant points on the epipolar line: While this significantly increases robustness in high-frequent image regions, it does not change the purely one-dimensional nature of this search. Furthermore, it is computationally efficient, as 4 out ± of 5 interpolated image values can be re-used for each SSD evaluation. 2.1.3 Uncertainty Estimation In this section, we use uncertainty propagation to derive an expression for the error variance σd2 on the inverse depth d. 11445511 In general this can be done by expressing the optimal inverse depth d∗ as a function of the noisy inputs here we consider the images I0, I1 themselves, their relative orientation ξ and the camera calibration in terms of a projection function π1 – d∗ = d(I0, I1, ξ, π) . The error-variance of d∗ is then given by σd2 = JdΣJdT, (1) (2) where Jd is the Jacobian of d, and Σ the covariance of the input-error. For more details on covariance propagation, including the derivation of this formula, we refer to [2]. For simplicity, the following analysis is performed for patchfree stereo, i.e., we consider only a point-wise search for a single intensity value along the epipolar line. For this analysis, we split the computation into three steps: First, the epipolar line in the reference frame is computed. Second, the best matching position λ∗ ∈ R along it (i.e., the disparity) is determined. Third, the i∈nv eRrse al depth d∗ is computed from the disparity λ∗ . The first two steps involve two independent error sources: the geometric error, which originates from noise on ξ and π and affects the first step, and the photometric error, which originates from noise in the images I0, I1 and affects the second step. The third step scales these errors by a factor, which depends on the baseline. Geometric disparity error. The geometric error is the error ?λ on the disparity λ∗ caused by noise on ξ and π. While it would be possible to model, propagate, and estimate the complete covariance on ξ and π, we found that the gain in accuracy does not justify the increase in computational complexity. We therefore use an intuitive approximation: Let the considered epipolar line segment L ⊂ R2 be deLfineted th by L := ?l0 + λ?llyx? |λ ∈ S? , (3) where λ is the disparity with search interval S, (lx , ly)T the normalized epipolar line direction and l0 the point corresponding to infinite depth. We now assume that only the absolute position of this line segment, i.e., l0 is subject to isotropic Gaussian noise ?l . As in practice we keep the searched epipolar line segments short, the influence of rotational error is small, making this a good approximation. Intuitively, a positioning error ?l on the epipolar line causes a small disparity error ?λ if the epipolar line is parallel to the image gradient, and a large one otherwise (see Fig. 5). This can be mathematically derived as follows: The image constrains the optimal disparity λ∗ to lie on a certain isocurve, i.e. a curve of equal intensity. We approximate 1In the linear case, this is the camera matrix K – in practice however, nonlinear distortion and other (unmodeled) effects also play a role. FiguLre5.Geo?l mλetricDigs,palrityEroL?rl:Influe?nλceofgasmla posi- tioning error ?l of the epipolar line on the disparity error ?λ . The dashed line represents the isocurve on which the matching point has to lie. ?λ is small if the epipolar line is parallel to the image gradient (left), and a large otherwise (right). this isocurve to be locally linear, i.e. the gradient direction to be locally constant. This gives l0 + λ∗ ?llxy? =! + γ?−gxgy?, g0 γ ∈ R (4) where g := (gx , gy) ?is the image gradient and g0 a point on the isoline. The influence of noise on the image values will be derived in the next paragraph, hence at this point g and g0 are assumed noise-free. Solving for λ gives the optimal disparity λ∗ in terms of the noisy input l0: λ∗(l0) =?g,g?g0,−l? l0? (5) Analogously to (2), the variance of the geometric disparity error can then be expressed as σλ2(ξ,π)= Jλ∗(l0)?σ0l2 σ0l2?JλT∗(l0)=?gσ,l 2?2, (6) where g is the normalized image gradient, lthe normalized epipolar line direction and σl2 the variance of ?l. Note that this error term solely originates from noise on the relative camera orientation and the camera calibration π, i.e., it is independent of image intensity noise. ξ Photometric disparity error. Intuitively, this error encodes that small image intensity errors have a large effect on the estimated disparity if the image gradient is small, and a small effect otherwise (see Fig. 6). Mathematically, this relation can be derived as follows. We seek the disparity λ∗ that minimizes the difference in intensities, i.e., λ∗ = mλin (iref − Ip(λ))2, (7) where iref is the reference intensity, and Ip(λ) the image intensity on the epipolar line at disparity λ. We assume a good initialization λ0 to be available from the exhaustive search. Using a first-order Taylor approximation for Ip gives λ∗(I) = λ0 + (iref − Ip(λ0)) g−p1, (8) where gp is the gradient of Ip, that is image gradient along the epipolar line. For clarity we only consider noise on iref and Ip(λ0) ; equivalent results are obtained in the general case when taking into account noise on the image values involved in the computation of gp. The variance of the pho11445522 ?i Ip?λ ?iiIp?λλ Figure 6. Photometric Disparity Error: Noise ?i on the image intensity values causes a small disparity error ?λ if the image gradient along the epipolar line is large (left). If the gradient is small, the disparity error is magnified (right). tometric disparity error is given by σλ2(I) = Jλ∗(I)?σ0i2 σ0i2?Jλ∗(I) =2gσ2pi2, (9) where σi2 is the variance of the image intensity noise. The respective error originates solely from noisy image intensity values, and hence is independent of the geometric disparity error. Pixel to inverse depth conversion. Using that, for small camera rotation, the inverse depth d is approximately proportional to the disparity λ, the observation variance of the inverse depth σd2,obs can be calculated using σd2,obs = α2 ?σ2λ(ξ,π) + σλ2(I)? , (10) where the proportionality ?constant α in th?e general, nonrectified case – is different for each pixel, and can be calculated from – α :=δδdλ, (11) where δd is the length of the searched inverse depth interval, and δλ the length of the searched epipolar line segment. While α is inversely linear in the length of the camera translation, it also depends on the translation direction and the pixel’s location in the image. When using an SSD error over multiple points along the epipolar line – as our implementation does – a good upper bound for the matching uncertainty is then given by ?min{σ2λ(ξ,π)} + min{σλ2(I)}? σd2,obs-SSD ≤ α2 , (12) where the min goes over all points included in the? SSD error. 2.1.4 Depth Observation Fusion After a depth observation for a pixel in the current image has been obtained, we integrate it into the depth map as follows: If no prior hypothesis for a pixel exists, we initialize it directly with the observation. Otherwise, the new observation is incorporated into the prior, i.e., the two distribu- tions are multiplied (corresponding to the update step in a Knoailsmya onb fsieltrvera)t:io Gniv Nen(do a, pσrio2o)r, d thiest priobsutetiroionr N is( gdipv,eσnp2 b)y and a N?σ2pdσo2p++ σ σo2o2dp,σ2σpp2+σo2 σo2?. 2.1.5 (13) Summary of Uncertainty-Aware Stereo New stereo observations are obtained on a per-pixel basis, adaptively selecting for each pixel a suitable reference frame and performing a one-dimensional search along the epipolar line. We identified the three major factors which determine the accuracy of such a stereo observation, i.e., • the photometric disparity error σλ2(ξ,π), depending on tphheo magnitude sofp trhiet image gradient along the epipolar line, • the geometric disparity error σλ2(I) ,depending on the athnegl gee bometewtereinc dthisep image gradient and the epipolar line (independent of the gradient magnitude), and • the pixel to inverse depth ratio α, depending on the camera etlra tons ilantvioenrs, eth dee pfothcal r length ,a dndep tehned pixel’s position. These three simple-to-compute and purely local criteria are used to determine for which pixel a stereo update is worth the computational cost. Further, the computed observation variance is then used to integrate the new measurements into the existing depth map. 2.2. Depth Map Propagation We continuously propagate the estimated inverse depth map from frame to frame, once the camera position of the next frame has been estimated. Based on the inverse depth estimate d0 for a pixel, the corresponding 3D point is calculated and projected into the new frame, providing an inverse depth estimate d1 in the new frame. The hypothesis is then assigned to the closest integer pixel position to eliminate discretization errors, the sub-pixel accurate image location of the projected point is kept, and re-used for the next propagation step. For propagating the inverse depth variance, we assume the camera rotation to be small. The new inverse depth d1 can then be approximated by – d1(d0) = (d0−1 − tz)−1, (14) where tz is the camera translation along the optical axis. The variance of d1 is hence given by σd21= Jd1σd20JTd1+ σp2=?dd01?4σd20+ σp2, (15) where σp2 is the prediction uncertainty, which directly corresponds to the prediction step in an extended Kalman filter. It can also be interpreted as keeping the variance on 11445533 in the top right shows the new frame I2 (x) without depth information. Middle: Intermediate steps while minimizing E(ξ) on different pyramid levels. The top row shows the back-warped new frame I2 (w(x, d, ξ)), the bottom row shows the respective residual image I2 (w(x, di,ξ)) − I1 (x) . The bottom right image shows the final pixel-weights (black = small weight). Small weights mainly correspond to newly oc,cξl)ud)e −d or disoccluded pixel. tWhe z fo-cuonodrtd hina t uesi onfg a sm poailnlt v failxue ds, fo i.re. σ,p2 sedteticnrgea σsez2s0 d=rift σ,z2 a1s. it causes the estimated geometry to gradually ”lock” into place. Collision handling. At all times, we allow at most one inverse depth hypothesis per pixel: If two inverse depth hypothesis are propagated to the same pixel in the new frame, we distinguish between two cases: 1. if they are statistically similar, i.e., lie within 2σ bounds, they are treated as two independent observations of the pixel’s depth and fused according to (13). 2. otherwise, the point that is further away from the camera is assumed to be occluded, and is removed. 2.3. Depth Map Regularization For each frame – after all observations have been incorporated – we perform one regularization iteration by assign- ing each inverse depth value the average of the surrounding inverse depths, weighted by their respective inverse variance. To preserve sharp edges, if two adjacent inverse depth values are statistically different, i.e., are further away than 2σ, they do not contribute to one another. Note that the respective variances are not changed during regularization to account for the high correlation between neighboring hypotheses. Instead we use the minimal variance of all neighboring pixel when defining the stereo search range, and as a weighting factor for tracking (see Sec. 3). Outlier removal. To handle outliers, we continuously keep track of the validity of each inverse depth hypothesis in terms of the probability that it is an outlier, or has become invalid (e.g., due to occlusion or a moving object). For each successful stereo observation, this probability is decreased. It is increased for each failed stereo search, if the respective intensity changes significantly on propagation, or when the absolute image gradient falls below a given threshold. If, during regularization, the probability that all contributing neighbors are outliers i.e., the product of their individual outlier-probabilities rises above a given threshold, the hypothesis is removed. Equally, if for an “empty” pixel this product drops below a given threshold, a new hypothesis is created from the neighbors. This fills holes arising from the forward-warping nature of the propagation step, and dilates the semi-dense depth map to a small neighborhood around sharp image intensity edges, which signifi– – × cantly increases tracking and mapping robustness. 3. Dense Tracking Based on the inverse depth map of the previous frame, we estimate the camera pose of the current frame using dense image alignment. Such methods have previously been applied successfully (in real-time on a CPU) for tracking RGB-D cameras [7], which directly provide dense depth measurements along with the color image. It is based on the direct minimization of the photometric error ri (ξ) := (I2 (w(xi, di , ξ)) − I1 , (16) where the warp function w : Ω1 R R6 → Ω2 maps each point xi ∈ Ω1 in the reference× image RI1 →to Ωthe respective point w(x∈i, Ωdi, ξ) ∈ Ω2 in the new image I2. As input it requires the 3D,ξ pose Ωof the camera ξ ∈ R6 and uses the reestqiumiraetesd t hienv 3erDse p depth fd it ∈e cRa mfore rthae ξ pixel in I1. Note that no depth information with respect t toh Ie2 p i sx required. To increase robustness to self-occlusion and moving objects, we apply a weighting scheme as proposed in [7]. Further, we add the variance of the inverse depth σd2i as an additional weighting term, making the tracking resistant to recently initialized and still inaccurate depth estimates from 11445544 (xi))2 Figure 8. Examples: Top: Camera images overlaid with the respective stimated semi-dense inverse depth map. Bot om: 3D view of tracked scene. Note the versatility of our approach: It accurately reconstructs and tracks through (outside) scenes with a large depth- variance, including far-away objects like clouds , as well as (indoor) scenes with little structure and close to no image corners / keypoints. More examples are shown in the attached video. the mapping process. The final energy that is minimized is hence given by E(ξ) :=?iα(rσid2(iξ))ri(ξ), (17) where α : R → R defines the weight for a given residual. Minimizing t h→is error can b thee interpreted as computing uthale. maximum likelihood estimator for ξ, assuming independent noise on the image intensity values. The resulting weighted least-squares problem is solved efficiently using an iteratively reweighted Gauss-Newton algorithm coupled with a coarse-to-fine approach, using four pyramid levels. Figure 7 shows an example of the tracking process. For further details on the minimization we refer to [1]. 4. System Overview Tracking and depth estimation is split into two separate threads: One continuously propagates the inverse depth map to the most recent tracked frame, updates it with stereocomparisons and partially regularizes it. The other simultaneously tracks each incoming frame on the most recent available depth map. While tracking is performed in real- time at 30Hz, one complete mapping iteration takes longer and is hence done at roughly 15Hz if the map is heavily populated, we adaptively reduce the number of stereo comparisons to maintain a constant frame-rate. For stereo observations, a buffer of up to 100 past frames is kept, automatically removing those that are used least. We use a standard, keypoint-based method to obtain the relative camera pose between two initial frames, which are then used to initialize the inverse depth map needed for tracking successive frames. From this point onward, our method is entirely self-contained. In preliminary experiments, we found that in most cases our approach is even able to recover from random or extremely inaccurate initial depth maps, indicating that the keypoint-based initialization might become superfluous in the future. Table 1. Results on RGB-D Benchmark position drift (cm/s) rotation drift (deg/s) ours [7] [8] ours [7] [8] – fr2/xyz fr2/desk 0.6 2.1 0.6 2.0 8.2 - 0.33 0.65 0.34 0.70 3.27 - 5. Results We have tested our approach on both publicly available benchmark sequences, as well as live, using a hand-held camera. Some examples are shown in Fig. 8. Note that our method does not attempt to build a global map, i.e., once a point leaves the field of view of the camera or becomes occluded, the respective depth value is deleted. All experiments are performed on a standard consumer laptop with Intel i7 quad-core CPU. In a preprocessing step, we rectify all images such that a pinhole camera-model can be applied. 5.1. RGB-D Benchmark Sequences As basis for a quantitative evaluation and to facilitate reproducibility and easy comparison with other methods, we use the TUM RGB-D benchmark [16]. For tracking and mapping we only use the gray-scale images; for the very first frame however the provided depth image is used as initialization. Our method (like any monocular visual odometry method) fails in case of pure camera rotation, as the depth of new regions cannot be determined. The achieved tracking accuracy for two feasible sequences that is, sequences which do not contain strong camera rotation without simultaneous translation is given in Table 1. For comparison we also list the accuracy from (1) a state-of-the-art, dense RGB-D odometry [7], and (2) a state-of-the-art, keypointbased monocular SLAM system (PTAM, [8]). We initialize PTAM using the built-in stereo initializer, and perform a 7DoF (rigid body plus scale) alignment to the ground truth trajectory. Figure 9 shows the tracked camera trajectory for fr2/desk. We found that our method achieves similar accu– – 11445555 era the the the trajectory (black), the depth map of the first frame (blue), and estimated depth map (gray-scale) after a complete loop around table. Note how well certain details such as the keyboard and monitor align. racy as [7] which uses the same dense tracking algorithm but relies on the Kinect depth images. The keypoint-based approach [8] proves to be significantly less accurate and robust; it consistently failed after a few seconds for the second sequence. 5.2. Additional Test Sequences To analyze our approach in more detail, we recorded additional challenging sequences with the corresponding ground truth trajectory in a motion capture studio. Figure 10 shows an extract from the video, as well as the tracked and the ground-truth camera position over time. As can be seen from the figure, our approach is able to maintain a reasonably dense depth map at all times and the estimated camera trajectory matches closely the ground truth. 6. Conclusion In this paper we proposed a novel visual odometry method for a monocular camera, which does not require discrete features. In contrast to previous work on dense tracking and mapping, our approach is based on probabilistic depth map estimation and fusion over time. Depth measurements are obtained from patch-free stereo matching in different reference frames at a suitable baseline, which are selected on a per-pixel basis. To our knowledge, this is the first featureless monocular visual odometry method which runs in real-time on a CPU. In our experiments, we showed that the tracking performance of our approach is comparable to that of fully dense methods without requiring a depth sensor. References [1] S. Baker and I. Matthews. Lucas-Kanade 20 years on: A unifying framework. Technical report, Carnegie Mellon Univ., 2002. 7 [2] A. Clifford. Multivariate Error Analysis. John Wiley & Sons, 1973. 4 sionpito[m ]− 024 2 0 s1xzy0s20s30s40s50s60s Figure 10. Additional Sequence: Estimated camera trajectory and ground truth (dashed) for a long and challenging sequence. The complete sequence is shown in the attached video. [3] A. Comport, E. Malis, and P. Rives. Accurate quadri-focal tracking for robust 3d visual odometry. In ICRA, 2007. 2 [4] A. Davison, I. Reid, N. Molton, and O. Stasse. MonoSLAM: Real-time single camera SLAM. Trans. on Pattern Analysis and Machine Intelligence (TPAMI), 29, 2007. 1 [5] D. Gallup, J. Frahm, P. Mordohai, and M. Pollefeys. Variable baseline/resolution stereo. In CVPR, 2008. 2, 3 [6] C. Harris and M. Stephens. A combined corner and edge detector. In Alvey Vision Conference, 1988. 1 [7] C. Kerl, J. Sturm, and D. Cremers. Robust odometry estimation for RGB-D cameras. In ICRA, 2013. 1, 2, 6, 7, 8 [8] G. Klein and D. Murray. Parallel tracking and mapping for small AR workspaces. In Mixed and Augmented Reality (ISMAR), 2007. 1, 2, 7, 8 [9] G. Klein and D. Murray. Improving the agility of keyframebased SLAM. In ECCV, 2008. 1 [10] M. Pollefes et al. Detailed real-time urban 3d reconstruction from video. IJCV, 78(2-3): 143–167, 2008. 2, 3 [11] L. Matthies, R. Szeliski, and T. Kanade. Incremental estimation of dense depth maps from image image sequences. In CVPR, 1988. 2 [12] R. Newcombe, S. Lovegrove, and A. Davison. DTAM: Dense tracking and mapping in real-time. In ICCV, 2011. 1, 2 [13] M. Okutomi and T. Kanade. A multiple-baseline stereo. Trans. on Pattern Analysis and Machine Intelligence (TPAMI), 15(4):353–363, 1993. 2, 3 [14] T. Sato, M. Kanbara, N. Yokoya, and H. Takemura. Dense 3-d reconstruction of an outdoor scene by hundreds-baseline stereo using a hand-held camera. IJCV, 47: 1–3, 2002. 2 [15] J. Stuehmer, S. Gumhold, and D. Cremers. Real-time dense geometry from a handheld camera. In Pattern Recognition (DAGM), 2010. 1, 2 [16] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of RGB-D SLAM systems. In Intelligent Robot Systems (IROS), 2012. 2, 7 [17] A. Wendel, M. Maurer, G. Graber, T. Pock, and H. Bischof. Dense reconstruction on-the-fly. In ECCV, 2012. 1 11445566
5 0.69829649 254 iccv-2013-Live Metric 3D Reconstruction on Mobile Phones
Author: Petri Tanskanen, Kalin Kolev, Lorenz Meier, Federico Camposeco, Olivier Saurer, Marc Pollefeys
Abstract: unkown-abstract
6 0.65028173 135 iccv-2013-Efficient Image Dehazing with Boundary Constraint and Contextual Regularization
7 0.63009745 133 iccv-2013-Efficient Hand Pose Estimation from a Single Depth Image
8 0.58920878 319 iccv-2013-Point-Based 3D Reconstruction of Thin Objects
9 0.56781411 199 iccv-2013-High Quality Shape from a Single RGB-D Image under Uncalibrated Natural Illumination
10 0.56076777 9 iccv-2013-A Flexible Scene Representation for 3D Reconstruction Using an RGB-D Camera
11 0.54132992 28 iccv-2013-A Rotational Stereo Model Based on XSlit Imaging
12 0.52123177 444 iccv-2013-Viewing Real-World Faces in 3D
13 0.50482368 271 iccv-2013-Modeling the Calibration Pipeline of the Lytro Camera for High Quality Light-Field Image Reconstruction
14 0.49618059 255 iccv-2013-Local Signal Equalization for Correspondence Matching
15 0.49600029 341 iccv-2013-Real-Time Body Tracking with One Depth Camera and Inertial Sensors
16 0.49350429 284 iccv-2013-Multiview Photometric Stereo Using Planar Mesh Parameterization
17 0.47633699 101 iccv-2013-DCSH - Matching Patches in RGBD Images
18 0.446381 408 iccv-2013-Super-resolution via Transform-Invariant Group-Sparse Regularization
19 0.44504288 366 iccv-2013-STAR3D: Simultaneous Tracking and Reconstruction of 3D Objects Using RGB-D Data
20 0.44207489 30 iccv-2013-A Simple Model for Intrinsic Image Decomposition with Depth Cues
topicId topicWeight
[(2, 0.046), (7, 0.015), (26, 0.126), (31, 0.05), (40, 0.012), (42, 0.065), (48, 0.013), (62, 0.011), (64, 0.047), (73, 0.065), (89, 0.222), (93, 0.2), (98, 0.01)]
simIndex simValue paperId paperTitle
same-paper 1 0.86450255 209 iccv-2013-Image Guided Depth Upsampling Using Anisotropic Total Generalized Variation
Author: David Ferstl, Christian Reinbacher, Rene Ranftl, Matthias Ruether, Horst Bischof
Abstract: In this work we present a novel method for the challenging problem of depth image upsampling. Modern depth cameras such as Kinect or Time of Flight cameras deliver dense, high quality depth measurements but are limited in their lateral resolution. To overcome this limitation we formulate a convex optimization problem using higher order regularization for depth image upsampling. In this optimization an anisotropic diffusion tensor, calculated from a high resolution intensity image, is used to guide the upsampling. We derive a numerical algorithm based on a primaldual formulation that is efficiently parallelized and runs at multiple frames per second. We show that this novel upsampling clearly outperforms state of the art approaches in terms of speed and accuracy on the widely used Middlebury 2007 datasets. Furthermore, we introduce novel datasets with highly accurate groundtruth, which, for the first time, enable to benchmark depth upsampling methods using real sensor data.
2 0.82969153 404 iccv-2013-Structured Forests for Fast Edge Detection
Author: Piotr Dollár, C. Lawrence Zitnick
Abstract: Edge detection is a critical component of many vision systems, including object detectors and image segmentation algorithms. Patches of edges exhibit well-known forms of local structure, such as straight lines or T-junctions. In this paper we take advantage of the structure present in local image patches to learn both an accurate and computationally efficient edge detector. We formulate the problem of predicting local edge masks in a structured learning framework applied to random decision forests. Our novel approach to learning decision trees robustly maps the structured labels to a discrete space on which standard information gain measures may be evaluated. The result is an approach that obtains realtime performance that is orders of magnitude faster than many competing state-of-the-art approaches, while also achieving state-of-the-art edge detection results on the BSDS500 Segmentation dataset and NYU Depth dataset. Finally, we show the potential of our approach as a general purpose edge detector by showing our learned edge models generalize well across datasets.
3 0.82963002 109 iccv-2013-Detecting Avocados to Zucchinis: What Have We Done, and Where Are We Going?
Author: Olga Russakovsky, Jia Deng, Zhiheng Huang, Alexander C. Berg, Li Fei-Fei
Abstract: The growth of detection datasets and the multiple directions of object detection research provide both an unprecedented need and a great opportunity for a thorough evaluation of the current state of the field of categorical object detection. In this paper we strive to answer two key questions. First, where are we currently as a field: what have we done right, what still needs to be improved? Second, where should we be going in designing the next generation of object detectors? Inspired by the recent work of Hoiem et al. [10] on the standard PASCAL VOC detection dataset, we perform a large-scale study on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) data. First, we quantitatively demonstrate that this dataset provides many of the same detection challenges as the PASCAL VOC. Due to its scale of 1000 object categories, ILSVRC also provides an excellent testbed for understanding the performance of detectors as a function of several key properties of the object classes. We conduct a series of analyses looking at how different detection methods perform on a number of imagelevel and object-class-levelproperties such as texture, color, deformation, and clutter. We learn important lessons of the current object detection methods and propose a number of insights for designing the next generation object detectors.
4 0.82853013 116 iccv-2013-Directed Acyclic Graph Kernels for Action Recognition
Author: Ling Wang, Hichem Sahbi
Abstract: One of the trends of action recognition consists in extracting and comparing mid-level features which encode visual and motion aspects of objects into scenes. However, when scenes contain high-level semantic actions with many interacting parts, these mid-level features are not sufficient to capture high level structures as well as high order causal relationships between moving objects resulting into a clear drop in performances. In this paper, we address this issue and we propose an alternative action recognition method based on a novel graph kernel. In the main contributions of this work, we first describe actions in videos using directed acyclic graphs (DAGs), that naturally encode pairwise interactions between moving object parts, and then we compare these DAGs by analyzing the spectrum of their sub-patterns that capture complex higher order interactions. This extraction and comparison process is computationally tractable, re- sulting from the acyclic property of DAGs, and it also defines a positive semi-definite kernel. When plugging the latter into support vector machines, we obtain an action recognition algorithm that overtakes related work, including graph-based methods, on a standard evaluation dataset.
5 0.79881614 196 iccv-2013-Hierarchical Data-Driven Descent for Efficient Optimal Deformation Estimation
Author: Yuandong Tian, Srinivasa G. Narasimhan
Abstract: Real-world surfaces such as clothing, water and human body deform in complex ways. The image distortions observed are high-dimensional and non-linear, making it hard to estimate these deformations accurately. The recent datadriven descent approach [17] applies Nearest Neighbor estimators iteratively on a particular distribution of training samples to obtain a globally optimal and dense deformation field between a template and a distorted image. In this work, we develop a hierarchical structure for the Nearest Neighbor estimators, each of which can have only a local image support. We demonstrate in both theory and practice that this algorithm has several advantages over the nonhierarchical version: it guarantees global optimality with significantly fewer training samples, is several orders faster, provides a metric to decide whether a given image is “hard” (or “easy ”) requiring more (or less) samples, and can handle more complex scenes that include both global motion and local deformation. The proposed algorithm successfully tracks a broad range of non-rigid scenes including water, clothing, and medical images, and compares favorably against several other deformation estimation and tracking approaches that do not provide optimality guarantees.
6 0.794635 78 iccv-2013-Coherent Motion Segmentation in Moving Camera Videos Using Optical Flow Orientations
7 0.79426461 390 iccv-2013-Shufflets: Shared Mid-level Parts for Fast Object Detection
8 0.79230487 102 iccv-2013-Data-Driven 3D Primitives for Single Image Understanding
10 0.79078925 423 iccv-2013-Towards Motion Aware Light Field Video for Dynamic Scenes
11 0.78963053 411 iccv-2013-Symbiotic Segmentation and Part Localization for Fine-Grained Categorization
12 0.78898603 249 iccv-2013-Learning to Share Latent Tasks for Action Recognition
13 0.78828251 60 iccv-2013-Bayesian Robust Matrix Factorization for Image and Video Processing
14 0.7867924 414 iccv-2013-Temporally Consistent Superpixels
15 0.78658116 309 iccv-2013-Partial Enumeration and Curvature Regularization
16 0.78510845 8 iccv-2013-A Deformable Mixture Parsing Model with Parselets
17 0.78440237 361 iccv-2013-Robust Trajectory Clustering for Motion Segmentation
18 0.78361046 420 iccv-2013-Topology-Constrained Layered Tracking with Latent Flow
19 0.78271699 156 iccv-2013-Fast Direct Super-Resolution by Simple Functions
20 0.78234804 18 iccv-2013-A Joint Intensity and Depth Co-sparse Analysis Model for Depth Map Super-resolution