Author: Amaury Dame, Victor A. Prisacariu, Carl Y. Ren, Ian Reid

Abstract: We propose a formulation of monocular SLAM which combines live dense reconstruction with shape priors-based 3D tracking and reconstruction. Current live dense SLAM approaches are limited to the reconstruction of visible surfaces. Moreover, most of them are based on the minimisation of a photo-consistency error, which usually makes them sensitive to specularities. In the 3D pose recovery literature, problems caused by imperfect and ambiguous image information have been dealt with by using prior shape knowledge. At the same time, the success of depth sensors has shown that combining joint image and depth information drastically increases the robustness of the classical monocular 3D tracking and 3D reconstruction approaches. In this work we link dense SLAM to 3D object pose and shape recovery. More specifically, we automatically augment our SLAMsystem with object specific identity, together with 6D pose and additional shape degrees of freedom for the object(s) of known class in the scene, combining im- age data and depth information for the pose and shape recovery. This leads to a system that allows for full scaled 3D reconstruction with the known object(s) segmented from the scene. The segmentation enhances the clarity, accuracy and completeness of the maps built by the dense SLAM system, while the dense 3D data aids the segmentation process, yieldingfaster and more reliable convergence than when using 2D image data alone.

1 uk Abstract We propose a formulation of monocular SLAM which combines live dense reconstruction with shape priors-based 3D tracking and reconstruction. [sent-6, score-0.443]

2 Current live dense SLAM approaches are limited to the reconstruction of visible surfaces. [sent-7, score-0.242]

3 Moreover, most of them are based on the minimisation of a photo-consistency error, which usually makes them sensitive to specularities. [sent-8, score-0.129]

4 In the 3D pose recovery literature, problems caused by imperfect and ambiguous image information have been dealt with by using prior shape knowledge. [sent-9, score-0.356]

5 At the same time, the success of depth sensors has shown that combining joint image and depth information drastically increases the robustness of the classical monocular 3D tracking and 3D reconstruction approaches. [sent-10, score-0.694]

6 In this work we link dense SLAM to 3D object pose and shape recovery. [sent-11, score-0.476]

7 More specifically, we automatically augment our SLAMsystem with object specific identity, together with 6D pose and additional shape degrees of freedom for the object(s) of known class in the scene, combining im- age data and depth information for the pose and shape recovery. [sent-12, score-0.92]

8 This leads to a system that allows for full scaled 3D reconstruction with the known object(s) segmented from the scene. [sent-13, score-0.153]

9 The segmentation enhances the clarity, accuracy and completeness of the maps built by the dense SLAM system, while the dense 3D data aids the segmentation process, yieldingfaster and more reliable convergence than when using 2D image data alone. [sent-14, score-0.282]

10 Introduction The reconstruction of scene geometry from a single monocular image sequence is a key problem in computer vision. [sent-16, score-0.129]

11 When the camera trajectory is unknown, the joint on-line estimation problem for scene structure and camera pose has become known as visual Simultaneous Localisation and Mapping. [sent-17, score-0.466]

12 Early methods for visual SLAM [7, 11] concentrated on accurate camera pose estimation using only sparse reconstructions. [sent-18, score-0.324]

13 au an computation devices, real-time, dense SLAM has become a technical possibility [10, 15]. [sent-23, score-0.141]

14 While this has the effect of increasing SLAM robustness and accuracy, both approaches are limited by their use of a sparse map and by the fact that they consider the objects to be of fixed and perfectly known shape. [sent-27, score-0.1]

15 A more generic semantic reconstruction is proposed in [8], where shape and layout priors of buildings are learned offline. [sent-28, score-0.224]

16 While shape priors have seen limited use in the SLAM literature, they have been extensively used in segmentation and tracking, as a solution to the problem of imperfect raw image information. [sent-31, score-0.202]

17 One of the most effective and popular approaches to represent shape knowledge is to use dimensionality reduction to capture the shape variance as low dimensional latent shape spaces. [sent-32, score-0.434]

18 Initial works, such as [24], focused on (implicitly or explicitly defined) 2D shapes, and used linear dimensionality reduction in the form of principal component analysis (PCA). [sent-33, score-0.118]

19 More recent works use nonlinear dimensionality reduction such as Kernel PCA in [6] and Gaussian Process Latent Variable Models (GP-LVM) in [17]. [sent-34, score-0.117]

20 This led to 3D shape priors being first introduced in [23]. [sent-35, score-0.162]

21 Most recently, [19] learn GP-LVM latent spaces of 3D shapes and use them in monocular simultaneous 2D segmentation, 3D reconstruction and 3D pose recovery. [sent-36, score-0.461]

22 Our objective in this paper is to address these limitations of existing systems by proposing an efficient dense SLAM ooooopppppyy y r r ri i ggggghhhhhtt t 1 1 12 2 28 8 868 6 approach that integrates a shape-prior-based estimator asand-when possible. [sent-37, score-0.141]

23 Here, in a manner similar to [19], we represent the shape-prior using GP-LVM and optimise an energy over the pose and a low-dimensional latent shape space. [sent-39, score-0.444]

24 An implicit volumetric representation of the dense reconstruction (similar to that used in [16]) allows for a very efficient fusion of the dense reconstruction with the reconstructed object shape. [sent-40, score-0.566]

25 Next, in Section 3 the semantic part of the system is described, including the recognition of an object together and the estimation of its refined pose and shape. [sent-44, score-0.327]

26 In Section 4 we present the way the information provided by the shape prior based estimator can be integrated into the dense SLAM. [sent-45, score-0.24]

27 Dense SLAM Our dense SLAM system is structured as follows: firstly, assuming known camera pose from the PTAM system [11], dense depth maps are built using a brightness constancy assumption. [sent-48, score-1.033]

28 Each depth map is subsequently fused into a global volumetric representation of the scene. [sent-49, score-0.509]

29 Local Depth Map Estimation Closely mirroring the approach of [15], we formulate the initial depth map estimation problem as one of finding the depth of each point that is seen in one reference image. [sent-52, score-0.717]

30 More formally, let u denote the coordinates of a pixel in the reference image Ir. [sent-56, score-0.219]

31 the pixel coordinates and depth that brings to its homogeneous coordinates in the camera frame (computed using the intrinsic parameters of the camera). [sent-66, score-0.682]

32 mMr is the SE(3) matrix that maps the coordinates of a point in the reference camera frame into its coordinates in the camera frame m, this matrix is available from PTAM providing the world to camera(s) transformation mMw using mMr = mMwrMw . [sent-67, score-0.659]

33 π is the function that projects the homogeneous coordinates of a point in the camera frame into its pixel coordinates in the image plane. [sent-68, score-0.422]

34 Searching for the actual surface along one ray is then −1 equivalent to searching for the depth d that leads to the minimum photo-consistency error. [sent-70, score-0.444]

35 The points along each ray are evenly sampled along the inverse depth so that the corresponding epipolar lines are evenly sampled. [sent-71, score-0.417]

36 The depth estimate resulting from this process is however noisy, since (i) for many pixels, the brightness constancy is not respected; (ii) the pixels themselves are noisy, (iii) the evaluation space is discretised and (iv) uniform regions lack colour information. [sent-72, score-0.606]

37 To improve the depth map the standard approach is to regularise with a weak prior that favours continuous depth in uniform regions. [sent-73, score-0.591]

38 This yields the following energy minimisation over the depth map d(u) : E(d(u)) =? [sent-74, score-0.551]

39 adl( tou) 1 i sw thheer depth map gra(dsieenet, [ γ i]s f a rsfc aulrathr weighting t ? [sent-81, score-0.326]

40 Robust Map Representation The process above can be repeated for several reference frames, and the resulting depth maps merged into a single global map. [sent-91, score-0.389]

41 To address this limitation we fuse the local depth maps into a dense volumetric parametrization of the 3D world 1 1 12 2 28 8 879 7 akin to that used in [26, 10, 16]. [sent-95, score-0.602]

42 The surface is recovered from this representation as the TSDF zero level set. [sent-97, score-0.099]

43 Each time a new depth map is generated, the values in the TSDF are updated to take the new information into account following a similar process to the one in [16]. [sent-98, score-0.363]

44 , the distance as estimated in the new depth map: Wk+1= Wk+ W? [sent-100, score-0.265]

45 ∈ [0, maxW] denotes the confidence in the new in- formation∈ ∈an [0d, ims a xfunction of the angle between the surface normal and the optic ray, with greater confidence associated with frontal surfaces (and near zero confidence for surfaces tangential to an optic ray). [sent-106, score-0.338]

46 Thus the new (approximate) distance to the surface is a weighted average of all previous measurements, helping to smooth out errors. [sent-107, score-0.099]

47 Incorporating object knowledge Our objective in this paper is to show how the ability to detect objects and incorporate them into a SLAM map is beneficial, as a step towards a more object-based, more semantically meaningful map. [sent-110, score-0.115]

48 Image-based object detection and localisation While the dense SLAM system continually acquires new × depth meshes at key-frames and fuses these into a global volumetric representation, in parallel we run a part-based object-class detector based on the effective procedure described in [9]. [sent-115, score-0.772]

49 To perform the optimization using shape priors, it is first necessary to have a coarse estimation of the pose of the detected object. [sent-123, score-0.391]

50 To estimate this pose, we use a combination of the data available from the detector and from the dense SLAM map as follows. [sent-124, score-0.25]

51 We thus require a detection in at least two key-frames before proceeding to estimate the pose [12]. [sent-126, score-0.182]

52 Next, we estimate the vertical axis of the object by find- ing its supporting plane using the dense SLAM map. [sent-128, score-0.252]

53 To do so, we make the assumptions that (i) there is indeed a supporting planar surface; and (ii) the supporting plane is unoccluded in the immediate area around the object. [sent-129, score-0.114]

54 In particular, we sample depth values from pixels located immediately below the object in the key-frames and apply RANSAC to the resulting point cloud (see Figure 1(b)). [sent-130, score-0.319]

55 To estimate the second (and hence third and final) principal direction we consider the projection of the part configuration from the Felzenszwalb detector [9]. [sent-131, score-0.17]

56 Finally, the size of the object is estimated using the size of the projection of the detected object in the first image and the depth of the object available from an initial triangulation. [sent-135, score-0.538]

57 This, together with the intrinsic camera calibration, is sufficient to yield an estimate of the size of the object. [sent-136, score-0.143]

58 We lever- × age this information to build foreground and background colour models for the detected object as a whole. [sent-140, score-0.342]

59 Colour histograms of both foreground P(y|Cf) and background P(y|Cb) are tohfe bno easily ecgreraotuendd (h Pe(rey y represents tkhger image pixel colour at location u). [sent-144, score-0.29]

60 foreground iaxnedl background posterior probabilities: Pf(u) = Pb(u) = P(Cηff|y)=ηfP(y|CPf()y +|C ηfb)P(y|Cb) P(Cηbb|y)=ηfP(y|CPf()y +|C ηbb)P(y|Cb) (4) with ηf and ηb being the number of pixels in the foreground and background regions respectively. [sent-147, score-0.162]

61 1(a) where each part of the detected object is represented with its relative segmenta- tion (black region represents the background, white is foreground and gray is unknown) and the foreground per-pixel posterior probability shown in Fig. [sent-149, score-0.253]

62 Prior based shape and pose estimation To segment the object in 3D (and subsequently fuse this information back into the volumetric model), we use a method similar to [19]. [sent-153, score-0.573]

63 3D shapes are represented volumetrically as Signed Distance Functions (SDFs), with the object surface implicitly defined by the zero-level set making this a natural candidate for use with the volumetric models produced using the methods in Section 2. [sent-154, score-0.294]

64 Within-class shape variation is represented via a low-dimensional embedding of the otherwise very high-dimensional 3D shapespace. [sent-155, score-0.099]

65 Unlike [19] however, in our current context, we have camera pose and depth information available from the SLAM system, which we aim to use to improve the object pose and shape recovery results. [sent-159, score-0.92]

66 one that also takes depth into consideration), and, on the other hand, the need to match the scale between the SLAM system and the learned object coordinate system. [sent-162, score-0.432]

67 Our aim therefore becomes, for objects detected within the scene, the simultaneous recovery of 3D shape (parametrised by the latent space), 6D pose and scale. [sent-163, score-0.468]

68 We do this by defining an image and depth based energy function, finding its derivatives w. [sent-164, score-0.458]

69 pose, scale and shape and using standard nonlinear minimisation techniques. [sent-167, score-0.275]

70 1 Energy Function Our dense SLAM system provides pose and depth information over multiple frames coming from a single monocular source. [sent-171, score-0.741]

71 We use the nv key-frames from this data stream as multiple views in our joint 3D shape / 3D pose optimisation. [sent-172, score-0.281]

72 image coordinates and depth) from a key frame v, and the corresponding points X0 in the object coordinate frame, we write our energy function as: E(Φ) =n1v? [sent-175, score-0.353]

73 This energy function combines an image based error Eiv (Φ) and a depth based one Edv (Φ), with α representing the balance between the two. [sent-184, score-0.361]

74 Note that there is a principled, probabilistic explanation behind this coupling, as each part of the energy function can be written as the log of a per pixel joint probability. [sent-185, score-0.135]

75 Furthermore, since the two parts of the energy function are sums of per-pixel 1 1 12 2 28 89 91 9 values, we can perform the multi view information fusion by simply averaging the per-view energy function values. [sent-186, score-0.242]

76 Eiv (Φ) measures the discrimination between statistically defined foreground and background regions, as a function of the projected 3D SDF Φ, using the functions Pf and Pb from eq (4) in each reference view v. [sent-187, score-0.218]

77 πv (Φ) projects Φ to a 2D occupancy map, with value 1 inside the projection outline and 0 outside. [sent-190, score-0.161]

78 It does this by evaluating, for each pixel in the imagedepth domain the probability of it being a projection of a voxel “inside” the 3D SDF Φ. [sent-191, score-0.167]

79 e the surface of the 3D object model), using the pose corresponding to view v and assuming pixelwise independence. [sent-201, score-0.385]

80 The probability that an image-depth point lies on the object surface is equal to the probability that the back projected 3D point lies on the zero-level of the SDF. [sent-202, score-0.153]

81 This approach was also used in [20], in which depth data coming from a Microsoft Kinect unit was used for simultaneous model based 3D tracking and calibration. [sent-204, score-0.382]

82 Unlike [20] however, here we (i) also make use of the RGB image data, (ii) adapt the shape of the object and (iii) use the dense SLAM system to provide depth data. [sent-205, score-0.611]

83 To minimise this energy, we compute its derivatives with respect to pose, shape and scale and use them in a Levenberg-Marquardt style nonlinear minimisation. [sent-206, score-0.243]

84 2 Pose/Scale Derivatives Each image-depth point x in a view v is the projection of a point X in the camera coordinate frame, which itself has a corresponding point X0 in the object coordinate frame. [sent-209, score-0.403]

85 The transformation from X to x is parametrised by the camera intrinsic parameters corresponding to view v. [sent-210, score-0.295]

86 The transformation from X0 to X is X = vMoX0, where vMo = vMwwMo with vMw being the SE(3) transformation from the world to the reference camera v coordinates defined in Section 2. [sent-211, score-0.411]

87 wMo is the transformation from object to world coordinates, i. [sent-212, score-0.136]

88 , 7 represent the unknown 6 DoF pose parameters p( ∈thre 1e, f. [sent-219, score-0.222]

89 zeΦe(ΦX(0X)ζ0)+ζ 1∂∂XΦnvv∂∂Xλpnv (12) where ∂∂λΦp= −∂∂XΦ0(vMo)−1vMw∂∂wλMpo(vMo)−1X (13) (14) As in [19], in order to make the computation of ∂∂πλpv easier, we use OpenGL-style normalised device coordinates for Φ and X. [sent-229, score-0.177]

90 In this coordinate system the 3D SDF Φ is transformed into Φv, using the pose, scale and intrinsics corresponding to view v. [sent-230, score-0.163]

91 Also, the 3D point that projects to x under the known camera calibration for view v is now denoted by Xnv. [sent-231, score-0.237]

92 Therefore, using the chain rule, we can write: ∂∂Xλpvn=∂∂XXnv∂∂λXp (15) ∂∂XXvn where are the derivatives of the standard normalised device coordinate conversion (i. [sent-232, score-0.242]

93 projection and normalisation of the Z coordinate) and ∂∂λXp follow in a straightforward manner as derivatives of X = vMwwMoX0, wrt. [sent-234, score-0.171]

94 Finally, the derivatives ∂∂wλMpo are computed × analogously and 3. [sent-236, score-0.14]

95 We do this, in a manner similar to [19], by using a dimensionality reduction technique called Gaussian Process Latent Variable Models, to learn nonlinear and probabilistic latent shape spaces. [sent-240, score-0.283]

96 pose and scale, by replacing ∂∂ΦXvvn∂∂Xλpnv with ∂∂Φlqv and ∂∂λΦp with ∂∂lΦq. [sent-251, score-0.182]

97 These final two derivatives are the ones of the standard GP-LVM generative process [13], on which the inverse DCT transform has been applied. [sent-252, score-0.134]

98 Map update Once the shape and pose estimation of the object has converged (as measured using the standard LevenbergMarquardt test), we fuse the shape SDF Φ with the global map. [sent-254, score-0.527]

99 is defined by the object SDF while the confidence W? [sent-257, score-0.103]

100 (or weight) in this distance is defined so that only the voxels close to the object surface are modified: W? [sent-258, score-0.191]

