Author: Renato F. Salas-Moreno, Richard A. Newcombe, Hauke Strasdat, Paul H.J. Kelly, Andrew J. Davison

Abstract: We present the major advantages of a new ‘object oriented’ 3D SLAM paradigm, which takes full advantage in the loop of prior knowledge that many scenes consist of repeated, domain-specific objects and structures. As a hand-held depth camera browses a cluttered scene, realtime 3D object recognition and tracking provides 6DoF camera-object constraints which feed into an explicit graph of objects, continually refined by efficient pose-graph optimisation. This offers the descriptive and predictive power of SLAM systems which perform dense surface reconstruction, but with a huge representation compression. The object graph enables predictions for accurate ICP-based camera to model tracking at each live frame, and efficient active search for new objects in currently undescribed image regions. We demonstrate real-time incremental SLAM in large, cluttered environments, including loop closure, relocalisation and the detection of moved objects, and of course the generation of an object level scene description with the potential to enable interaction.

1 1Imperial College London Abstract We present the major advantages of a new ‘object oriented’ 3D SLAM paradigm, which takes full advantage in the loop of prior knowledge that many scenes consist of repeated, domain-specific objects and structures. [sent-6, score-0.214]

2 As a hand-held depth camera browses a cluttered scene, realtime 3D object recognition and tracking provides 6DoF camera-object constraints which feed into an explicit graph of objects, continually refined by efficient pose-graph optimisation. [sent-7, score-0.482]

3 This offers the descriptive and predictive power of SLAM systems which perform dense surface reconstruction, but with a huge representation compression. [sent-8, score-0.141]

4 The object graph enables predictions for accurate ICP-based camera to model tracking at each live frame, and efficient active search for new objects in currently undescribed image regions. [sent-9, score-0.773]

5 We demonstrate real-time incremental SLAM in large, cluttered environments, including loop closure, relocalisation and the detection of moved objects, and of course the generation of an object level scene description with the potential to enable interaction. [sent-10, score-0.457]

6 Modern processing hardware permits ever-improving levels of detail and scale, and much interest is now turning to semantic labelling of this geometry in terms of the objects and regions that are known to exist in the scene. [sent-13, score-0.233]

7 However, some thought about this process reveals a huge amount of wasted computational effort; and the potential for a much better way of taking account of domain knowledge in the loop of SLAM operation itself. [sent-14, score-0.181]

8 We propose a paradigm for real-time localisation and mapping which harnesses 3D object recognition to jump over low level geometry processing and produce incremen- Davison1 2University of Washington Figure1. [sent-15, score-0.273]

9 (left) A live view at the current camera pose and the synthetic rendered objects. [sent-17, score-0.42]

10 (right) We contrast a raw depth camera normal map with the corresponding high quality prediction from our object graph, used both for camera tracking and for masking object search. [sent-18, score-0.653]

11 As a hand-held depth camera browses a cluttered scene, prior knowledge of the objects likely to be repetitively present enables real-time 3D recognition and the creation of a simple pose graph map of relative object locations. [sent-20, score-0.653]

12 This graph is continuously optimised as new measurements arrive, and enables always up-to-date, dense and precise prediction of the next camera measurement. [sent-21, score-0.381]

13 These predictions are used for robust camera tracking and the generation of active search regions for further object detection. [sent-22, score-0.257]

14 Our approach is enabled by efficient GPGPU parallel im1 1 13 3 35 5 502 0 plementation of recent advances in real-time 3D object detection and 6DoF (Degree of Freedom) ICP-based pose refinement. [sent-23, score-0.207]

15 Real-Time SLAM with Hand-Held Sensors In SLAM (Simultaneous Localisation and Mapping), building an internally consistent map in real-time from a moving sensor enables drift-free localisation during arbitrarily long periods of motion. [sent-26, score-0.199]

16 Sparse feature filtering methods like [5] were improved on by ‘keyframe SLAM’ systems like PTAM [8] which used bundle adjustment in parallel with tracking to enable high feature counts and more accurate tracking. [sent-29, score-0.149]

17 While this approach is possible with an RGB camera [12], commodity depth cameras have now come to the fore in high performance, robust indoor 3D mapping, in particular via the KinectFusion algorithm [11]. [sent-31, score-0.208]

18 New developments such as [18] have tackled scaling the method via a sliding volume, sub-blocking or octrees; but a a truly scalable, multi-resolution, loop closure capable dense nonparametric surface representation remains elusive, and will always be wasteful in environments with symmetry. [sent-32, score-0.391]

19 From sparse feature-based SLAM, where the world is modelled as an unconnected point cloud, to dense SLAM which assumes that scenes contain continuous surfaces, we have seen an increase in the prior scene knowledge brought to bear. [sent-33, score-0.152]

20 While we currently pre-define the ob- jects expected in a scene, we intend that the paradigm permits the objects in a scene to be identified and segmented automatically as salient, repeated elements. [sent-35, score-0.269]

21 Unlike dense nonparametric approaches, the relatively few discrete entities in the map makes it highly feasible to jointly optimise over all object positions to make globally consistent maps. [sent-37, score-0.192]

22 Further, and crucially, instant recognition of objects provides great efficiency and robustness benefits via the active approaches it permits to tracking and object detection, guided entirely by the dense predictions we can make of the positions of known objects. [sent-39, score-0.319]

23 SLAM++ relates strongly to the growing interest in semantically labelling scene reconstructions and maps, in both the computer vision and robotics communities, though we stress the big difference between post-hoc labelling of geometry and the closed loop, real-time algorithm we present. [sent-40, score-0.336]

24 A depth camera is first used to scan a scene, similar in scale and object content to the results we demonstrate later, and all data is fused into a single large point cloud. [sent-43, score-0.302]

25 Off-line, learned object models, with a degree of variation to cope with a range of real object types, are then matched into the joint scan, optimising both similarity and object configuration constraints. [sent-44, score-0.186]

26 The results are impressive, though the emphasis is on labelling rather than aiding mapping and we can see problems with missing data which cannot be fixed with non-interactive capture. [sent-45, score-0.184]

27 Other good work on labelling using RDB-D data was by Silberman [16] as well as Ren et al. [sent-46, score-0.108]

28 [14] who used kernel descriptors for appearance and shape to label single depth camera images with object and region identity. [sent-47, score-0.256]

29 These objects, once recognized via SIFT descriptors, improved the quality of SLAM due to their known size and shape, though the objects were simple highly textured posters and the scene scale small. [sent-51, score-0.125]

30 [13] demonstrated a 2D laser/camera system which used object recognition to generate discrete entities to map (tree trunks) rather than using Finally, the same idea that object recognition aids reconstruction has been used in off-line structure from motion. [sent-53, score-0.185]

31 [3] represented a scene as a set of points, objects and regions in two-view SfM, solving expensively and jointly in a graph optimisation for a labelling and reconstruction solution taking account of interactions between all scene entities. [sent-55, score-0.481]

32 Given a live depth map Dl, we first compute a surface measurement in the form of a vertex and Fniogrmureal map Ntliln providing input t+o pthipee sequentially computed camera tracking and object detection pipelines. [sent-59, score-0.806]

33 (1) We track the live camera pose aTlw ml awpit hN an iterative closest point approach using the dense multi-object scene prediction captured in the current SLAM graph G. [sent-60, score-0.66]

34 (3) We add successfully detected objects g into the SLAM graph in the form of a object-pose vertex connected to the live estimated camera-pose vertex via a measurement edge. [sent-64, score-0.563]

35 (4) Rendering objects from the SLAM graph produce a predicted depth Dr and normal map Nr into the vliveret eexsti vmiaa taed m efarasmuree,m enabling us 4to) actively sgea orbcjhe only othmo steh pixels Mno gt rdaepshcr pirboeddu by cau prrreendtic objects tinh Dthe graph. [sent-65, score-0.411]

36 Wmael run an individual ICP between each object and the live image resulting in the addition of a new camera-object constraint into the SLAM graph. [sent-66, score-0.249]

37 Creating an Object Database Before live operation in a certain scene, we rapidly make high quality 3D models of repeatedly occurring objects via interactive scanning using KinectFusion [11] in a controlled setting where the object can easily be circled without occlusion. [sent-71, score-0.354]

38 A mesh for the object is extracted from the truncated signed distance volume obtained from KinectFusion using marching cubes [10]. [sent-72, score-0.162]

39 A small amount of manual editing in a mesh tool is performed to separate the object from the ground plane, and mark it up with a coordinate frame such that domain-specific object constraints can be applied. [sent-73, score-0.173]

40 SLAM Map Representation Our representation of the world is a graph, where each node stores either the estimated SE(3) pose (rotation and translation relative to a fixed world frame) Twoj of object j, or Twi of the historical pose of the camera at timestep i (see Figure 5). [sent-77, score-0.45]

41 Each object node is annotated with a type from the object database. [sent-78, score-0.096]

42 Each SE(3) measurement of the pose of an object Zi,oj from the camera is stored in the graph as a factor (constraint) which links one camera pose and one object pose. [sent-79, score-0.665]

43 Additional factors can optionally be added to the graph; between camera poses to represent camera-camera motion estimates (e. [sent-80, score-0.152]

44 Details on graph optimisation are given in Section 3. [sent-85, score-0.192]

45 [6] for recognising the 6DoF pose of 3D objects, represented by meshes, in a depth image. [sent-90, score-0.191]

46 We give details of our parallel implementation, which achieves the real-time detection of multiple instances of multiple objects we need by fully exploiting the fine-grained parallelism of GPUs. [sent-91, score-0.127]

47 ’s method an object is detected and simultaneously localised via the accumulation of votes in a parameter space. [sent-93, score-0.143]

48 Points, with normal estimates, are randomly sampled on a bilateral-filtered image from the depth camera. [sent-95, score-0.136]

49 These samples are paired up in all possible combinations to generate PPFs which vote for 6DoF model configurations containing a similar PPF. [sent-96, score-0.098]

50 Similar structures are also built from each live frame. [sent-98, score-0.201]

51 Matching similar features of the scene against the model can be efficiently performed in parallel via a vectorised binary search, producing a vote for each match. [sent-100, score-0.212]

52 32 + i; codes ← Sort ( codes ) ; / / Decode PPF index and hash key key2ppfMap ← new array; hashKeys ← new array; fhoarseahcKhe eiy ← ←0 ntoe wN a - 1r ainy parallel do key2ppfMap[i] = ∼(1 ? [sent-108, score-0.204]

53 To overcome this, each vote is represented as a 64-bit integer code (Figure 4), which can then be efficiently sorted and reduced in parallel. [sent-111, score-0.098]

54 The first 6 bits encode the alignment angle, followed by 26 bits for the model point and 32 bits for the scene reference point. [sent-114, score-0.206]

55 This is followed by a parallel reduction with a sum operation to accumulate equal vote codes (Algorithm 2). [sent-116, score-0.286]

56 After peak finding, pose estimates for each scene reference point are clustered on the CPU according to translation and rotation thresholds as in [6]. [sent-117, score-0.196]

57 ’s recognition algorithm [6] in room scenes is highly successful when objects occupy most of the camera’s field of view, but poor for objects which are distant or partly occluded by other objects, due to poor sample point coverage. [sent-121, score-0.179]

58 The view prediction capabilities of the system mean that we can generate a mask in image space for depth pixels which are not already described by projected known objects. [sent-123, score-0.133]

59 The measured depth images from the camera are cropped using these masks and samples are only spread in the undescribed regions (see Figure 3). [sent-124, score-0.281]

60 The result of object detection is often multiple cluster peaks, and quantised localisation results. [sent-125, score-0.155]

61 We update the live camera to world transform Twl by estimating a sequence of m incremental updates {T˜rnl }nm=1 parametrised with a vector x ∈ R6 defining a twist in SE(3), with as the identity. [sent-132, score-0.447]

62 We iteratively minimise the whole depth image point-plane metric over all available valid pixels u ∈ Ω in the live depth map: Ec(x) = T˜rnl=0 ? [sent-133, score-0.381]

63 = π(K vˆl (u)), computed by projecting the vertex vl (u) at pixel u from the live depth map into the reference frame with camera intrinsic matrix K and standard pin-hole projection function π. [sent-142, score-0.598]

64 The current live vertex is transformed into the reference frame using the current incremental transform T˜rnl: vˆl(u) = T˜rnlvl(u) , vl(u) = K−1 u˙Dl(u) , vr(u? [sent-143, score-0.426]

65 Taking ×the 6 s lionluetairon sy vsteecmtor u x ntog an element in SE(3) via the exponential map, we compose the computed incremental transform at iteration n + 1onto the previous estimated transform : T˜rnl T˜rnl+1 ← exp(x)T˜rnl . [sent-157, score-0.131]

66 (7) The estimated live camera pose Twl therefore results by composing the final incremental transform onto the previous frame pose: Twl ← TwrT˜rml T˜rml . [sent-158, score-0.543]

67 (8) Tracking Convergence: We use a maximum of m = 10 iterations and check for poor convergence of the optimisation process using two simple criteria. [sent-159, score-0.14]

68 Second, after an optimisation iteration has completed we compute the ratio of pixels in the live image which have been correctly matched with the predicted model ascertained by discounting pixels which induce a point-plane error greater than a specified magnitude ? [sent-161, score-0.349]

69 Tracking for Model Initialisation: We utilise the dense ICP pose estimation and convergence check for two further key components in SLAM++. [sent-163, score-0.19]

70 Therefore, given a candidate object and detected pose, we run camera-model ICP estimation on the detected object pose, and check for convergence using the previously described criteria. [sent-165, score-0.202]

71 We find that for correctly detected objects, the pose estimates from the detector are erroneous within ±30◦ rotation, and ±fro5m0cm th etr daentselcattioorn a. [sent-166, score-0.137]

72 Camera-Object Pose Constraints: Given the active set of objects that have been detected in SLAM++, we further estimate relative camera-object pose parameters which are used to induce constraints in the scene pose graph. [sent-169, score-0.363]

73 To that end, we run the dense ICP estimate between the live frame and each model object currently visible in the frame. [sent-170, score-0.386]

74 The ability to compute an individual relative pose estimate introduces the possibility to prune poorly initialised or incorrectly tracked objects from the pose graph at a later date. [sent-171, score-0.426]

75 By analysing the statistics of the camera-object pose estimate’s convergence we can keep an inlier-outlier style count on the inserted objects, and cull poorly performing ones. [sent-172, score-0.185]

76 Examplegraphilustraingthepose fthemovingcam- era over four time steps Twi (red) as well as the poses of three static objects in the world Twoj (blue). [sent-176, score-0.144]

77 We formulate the problem of estimating the historical poses of the depth camera Twi at time iand the poses of the static objects Twoj as graph optimisation. [sent-179, score-0.431]

78 Zi,oj denotes the 6DoF measurement of object j in frame iand Σi−,o1j its inverse measurement covariance which can be estimated using the approximated Hessian Σi−,o1j = J? [sent-180, score-0.174]

79 Zi,i+1 is the relative ICP constraint between camera iand i+1, with Σi−,i1+1 the corresponding inverse covariance. [sent-182, score-0.118]

80 1 Including Structural Priors Additional information can be incorporated in the graph in order to improve the robustness and accuracy of the optimisation problem. [sent-198, score-0.192]

81 The world reference frame w is defined in such that the x and z-axes lie within the ground plane with the y-axis perpendicular into it. [sent-200, score-0.116]

82 The ground plane is implicitly detected from the initial observation Z1,o1 of the first object; its pose Twf remains fixed during optimisation. [sent-201, score-0.137]

83 Other Priors The ground plane constraint can have value beyond the pose graph. [sent-212, score-0.101]

84 While this is not yet implemented we at least cull hypotheses of object positions far from the ground plane. [sent-215, score-0.097]

85 Relocalisation When camera tracking is lost the system enters a relocalisation mode. [sent-218, score-0.339]

86 Here a new local graph is created and tracked from, and when it contains at least 3 objects it is matched against the previously tracked long-term graph (see Figure 6). [sent-219, score-0.351]

87 The matched vertex with highest vote in the long-term graph is used instead of the currently observed vertex in the local graph and camera tracking is resumed from it, discarding the local map. [sent-223, score-0.742]

88 Whentrackingslotalo- cal graph (blue) is created and matched against a long-term graph (red). [sent-226, score-0.214]

89 (top) Scene with objects and camera frustum when tracking is resumed a few frames after relocalisation. [sent-227, score-0.327]

90 Results × The in-the-loop operation of our system is more effectively demonstrated in our submitted video than on paper, where the advantages of our method over off-line scene labelling may not be immediately obvious. [sent-231, score-0.2]

91 Loop Closure Loop closure in SLAM occurs when a location is revisited after a period of neglect, and the arising drift corrected. [sent-235, score-0.108]

92 Larger loop closures (see Figure 7), where the drift is too much to enable matching via predictive ICP, are detected using a module on based matching fragments within the main long-term graph in the same manner as in relocalisation (Section 3. [sent-237, score-0.478]

93 The real-time process lasted around 10 minutes, including various loop closures, relocalisations due to lost tracking. [sent-242, score-0.145]

94 We apply this × to command virtual characters to navigate the scene and find places to sit as soon as the system is started (without the need to scan a whole room). [sent-253, score-0.102]

95 System statistics We present system settings for mapping the room in Figure 7 (10×6 3m) using a gaming laptop. [sent-258, score-0.117]

96 In this paper we have shown that using high performance 3D object recognition in the loop permits a new approach to real-time SLAM with large advantages in terms of efficient and semantic scene description. [sent-261, score-0.305]

97 In particular we demonstrate how the tight interaction of recognition, mapping and tracking elements is mutually beneficial to all. [sent-262, score-0.167]

98 (middle) Imposing the new correspondences and re-optimising the graph closes the loop and yields a more metric map. [sent-266, score-0.231]

99 (right) For visualisation purposes only (since raw scans are not normally saved in SLAM++), we show a coloured point cloud after loop closure. [sent-267, score-0.145]

100 Combining object recognition and SLAM for extended map representations. [sent-371, score-0.099]

