iccv iccv2013 iccv2013-152 knowledge-graph by maker-knowledge-mining

152 iccv-2013-Extrinsic Camera Calibration without a Direct View Using Spherical Mirror

Source: pdf

Author: Amit Agrawal

Abstract: We consider the problem of estimating the extrinsic parameters (pose) of a camera with respect to a reference 3D object without a direct view. Since the camera does not view the object directly, previous approaches have utilized reflections in a planar mirror to solve this problem. However, a planar mirror based approach requires a minimum of three reflections and has degenerate configurations where estimation fails. In this paper, we show that the pose can be obtained using a single reflection in a spherical mirror of known radius. This makes our approach simpler and easier in practice. In addition, unlike planar mirrors, the spherical mirror based approach does not have any degenerate configurations, leading to a robust algorithm. While a planar mirror reflection results in a virtual perspective camera, a spherical mirror reflection results in a non-perspective axial camera. The axial nature of rays allows us to compute the axis (direction of sphere center) and few pose parameters in a linear fashion. We then derive an analytical solution to obtain the distance to the sphere cen- ter and remaining pose parameters and show that it corresponds to solving a 16th degree equation. We present comparisons with a recent method that use planar mirrors and show that our approach recovers more accurate pose in the presence of noise. Extensive simulations and results on real data validate our algorithm.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Since the camera does not view the object directly, previous approaches have utilized reflections in a planar mirror to solve this problem. [sent-2, score-1.096]

2 However, a planar mirror based approach requires a minimum of three reflections and has degenerate configurations where estimation fails. [sent-3, score-1.112]

3 In this paper, we show that the pose can be obtained using a single reflection in a spherical mirror of known radius. [sent-4, score-1.254]

4 In addition, unlike planar mirrors, the spherical mirror based approach does not have any degenerate configurations, leading to a robust algorithm. [sent-6, score-1.347]

5 While a planar mirror reflection results in a virtual perspective camera, a spherical mirror reflection results in a non-perspective axial camera. [sent-7, score-2.264]

6 The axial nature of rays allows us to compute the axis (direction of sphere center) and few pose parameters in a linear fashion. [sent-8, score-0.463]

7 We then derive an analytical solution to obtain the distance to the sphere cen- ter and remaining pose parameters and show that it corresponds to solving a 16th degree equation. [sent-9, score-0.359]

8 We present comparisons with a recent method that use planar mirrors and show that our approach recovers more accurate pose in the presence of noise. [sent-10, score-0.524]

9 Introduction The external calibration of a camera with a reference 3D object is a fundamental pose estimation procedure in computer vision. [sent-13, score-0.37]

10 If 2D projections of three 3D points are visible, the extrinsic parameters (or pose), namely rotation and translation can be obtained from the three 2D-3D correspondences. [sent-14, score-0.426]

11 Previous approaches to solve this problem have used planar mirrors to observe the reflections of the reference ob- ject. [sent-18, score-0.583]

12 The extrinsic parameters along with the location and orientation of planar mirrors can be obtained using the reflections. [sent-19, score-0.658]

13 As shown in Sturm and Bonfort [28], at least three reflections of the reference object are required to uniquely determine the extrinsic parameters using a planar mirror. [sent-21, score-0.554]

14 The singular configuration occurs if all the mirror planes intersect in a single line. [sent-24, score-0.725]

15 This happens if the mirror is rotated around a fixed axis, or if all mirror planes are parallel (intersection line is at infinity) [27]. [sent-25, score-1.362]

16 If the intersection of any two mirrors lie on the plane defined by the reference 3D points, it results in a degenerate configuration. [sent-28, score-0.399]

17 Previous approaches either assume that the mirror configuration is non-singular or use heuristics to detect the presence of such configurations. [sent-30, score-0.724]

18 We show that the mirror-based extrinsic calibration can be done using a single reflection from a spherical mirror with known radius. [sent-34, score-1.491]

19 The location of the spherical mirror is not required to be known and is estimated simultaneously with the extrinsic 22336688 MethodMir or ShapeMir ors3D PointsprToojteaclt 2ioDnsDeCgoennfeirgautrea Mtioinr sor Table1. [sent-35, score-1.27]

20 Ulkeprvious approaches, our approach does not have any degenerate mirror configurations. [sent-39, score-0.768]

21 In addition, we show in Section 6 that it outperforms the planar mirror based method [3 1]. [sent-40, score-0.899]

22 Our approach does not require the boundary of the sphere to be visible in the image, which is commonly used to estimate the sphere center [16, 6, 7] and only uses the projection of reference 3D points. [sent-42, score-0.423]

23 Moreover, no degenerate mirror configurations exist when using a spherical mirror, since we require only one view. [sent-44, score-1.158]

24 In addition, we show that our approach outperforms the planar mirrors based approach [3 1] and recovers more accurate pose in the presence of noise. [sent-45, score-0.524]

25 Contributions • • We demonstrate that mirror based extrinsic calibration can be done from a single reflection using a non-planar spherical mirror. [sent-48, score-1.472]

26 We derive analytical solutions for estimating the spherical mirror location and pose of the reference object using a single reflection. [sent-49, score-1.326]

27 Related Work Mirror based extrinsic calibration: Previous approaches either attach markers to planar mirrors to estimate the mirror poses [15, 20] or estimate the mirror poses along with the extrinsic calibration [28, 27, 18, 14, 3 1]. [sent-51, score-2.358]

28 As discussed, a minimum of three reflections are required using planar mirror and degenerate configurations exists. [sent-52, score-1.112]

29 Catadioptric Cameras: Both planar and spherical mirrors have been used with perspective camera as a catadioptric imaging system. [sent-54, score-1.043]

30 This is referred to as planar catadioptric stereo [23, 10, 32, 9]. [sent-56, score-0.379]

31 [26] used three planar mirrors to obtain hundreds of views of an object for visual hull reconstruction. [sent-59, score-0.44]

32 Spherical Mirrors: Spherical mirror based catadioptric cameras have also been used for wide-angle 3D reconstruction [22, 5, 3, 16, 19], for obtaining wide-angle light fields [29] and navigation [17]. [sent-60, score-0.868]

33 These approaches typically attach multiple spherical mirrors on a planar surface along with markers [5, 19] and use the markers to estimate the initial location of spheres or utilize the sphere boundary (contour) in the image. [sent-61, score-1.1]

34 [6] require several images of a moving sphere along with the sphere boundary to be visible in all images for extrinsic calibration. [sent-63, score-0.514]

35 [1], where multiple spherical mirrors based catadioptric system is calibrated using a planar checkerboard. [sent-68, score-0.943]

36 Thus, calibration of such systems is similar to our problem of extrinsic mirror based calibration. [sent-70, score-1.002]

37 The approach in [1] utilizes the fact that by using rays from two or more spherical mirrors, the pose can be obtained using a linear algorithm. [sent-71, score-0.485]

38 Thus, if we use two spherical mirrors (in a single view) or take two views of a spherical mirror, we can apply the algorithm of [1] to perform extrinsic calibration. [sent-73, score-1.118]

39 However, in this paper our goal is to achieve it using a single view of a single spherical mirror. [sent-74, score-0.387]

40 For a spherical mirror, using two views results in a degenerate configuration when the sphere location is same in both views. [sent-75, score-0.682]

41 Thus, to completely avoid degenerate mirror configurations, we need a solution involving single view/mirror. [sent-77, score-0.792]

42 In addition, [1] clearly states that they were unable to find an analytical solution for calibration using a single spherical mirror, which is our key contribution. [sent-78, score-0.605]

43 Our goal is to estimate the unknown pose (rotation R and translation t) of these points in the camera coordinate system. [sent-84, score-0.35]

44 To perform the calibration, we place a spherical mirror of radius r at an unknown location C in-front of the camera and observe the reflection of the 3D points. [sent-85, score-1.361]

45 The goal is to then recover the desired pose [R, t] along with the unknown sphere location C using K 3D-2D correspondences. [sent-88, score-0.332]

46 Axial Geometry: It is well-known that a perspective camera looking into a spherical mirror corresponds to an axial imaging system [3]. [sent-89, score-1.241]

47 In addition, the coplanarity constraints can also recover the rotation R and translation orthogonal to the axis (two degrees of translation). [sent-96, score-0.361]

48 In the next section, we derive an analytical solution for remaining parameters, namely, distance d to the sphere center and translation along the axis, which is a key contribution of this paper. [sent-99, score-0.417]

49 Let π be the plane of reflection given by the axis and a camera ray v(i). [sent-100, score-0.372]

50 To summarize, for planar reference 3D points, we obtain a solution for axis and four solutions for rotation R and vector s. [sent-152, score-0.521]

51 Analytical solution for d and tA In this section, we describe an analytical solution to obtain the sphere distance d and translation along axis tA assuming known axis, rotation R and vector s. [sent-157, score-0.629]

52 We again emphasise that in [1], authors state that finding an analytical solution for a single mirror is extremely difficult and they ×× were unable to find such a solution. [sent-158, score-0.778]

53 In contrast, since we assume a single view (single spherical mirror), the linear pose estimation algorithm of [1] cannot be applied. [sent-160, score-0.471]

54 Since the translation along axis is unknown, the reflected camera ray should pass through u1 = [ux , uy + α] . [sent-169, score-0.461]

55 The spherical mirror is represented as a circle on π with center at C = [0, d] and radius r. [sent-172, score-1.132]

56 Let M = kw be the common point on the mirror and the camera ray for some scalar k. [sent-173, score-0.813]

57 Non-linear refinement Once the initial estimate of pose and sphere location is obtained, the parameters can be refined by minimizing the image re-projection error for all K points. [sent-195, score-0.374]

58 For a spherical mirror, the image projection of a 3D point can be obtained by solving a 4th degree equation [3]. [sent-196, score-0.411]

59 For comparison with planar mirror based approach, we also implemented the corresponding non-linear refinement procedure, which involves 6 + 3 3 = 15 parameters (6 for pose and 3 for each mirror position). [sent-209, score-1.701]

60 Simulations In this section, we present simulations on our approach along with comparisons with the planar mirror based approach [3 1]. [sent-213, score-0.966]

61 A planar checkerboard with 9 6 squares, each of size 30 mm is used as a reference object. [sent-215, score-0.489]

62 The ground truth rotation angles and translation of checkerboard in the camera coordinate system are θ = [2. [sent-216, score-0.446]

63 Translation error is computed as the norm of the translation error vector, normalized with the true translation magnitude. [sent-229, score-0.322]

64 Comparison with [31]: For comparison with [3 1], we observe the reflection of checkerboard corners using three planar mirrors. [sent-232, score-0.515]

65 For fair comparison, we made sure that the ground truth 2D projection of checkerboard corners using both spherical and planar mirrors occupy similar pixel area in the image. [sent-238, score-0.998]

66 Figure 2 shows the 2D projections using spherical mirror (black) and using planar mirrors (red, green and blue). [sent-239, score-1.492]

67 Notice that the spherical mirror projections are in-fact confined to a smaller pixel area. [sent-240, score-1.091]

68 Again for fair comparisons, we use the same 8 3D points for the planar mirror based approach in each trial. [sent-242, score-0.925]

69 In addition, we also use all 8 correspondences for computing the mirror orientations and checkerboard pose in the least square framework of [3 1]. [sent-246, score-0.943]

70 Thus, these simulations clearly demonstrate that our approach outperforms the planar mirror based approach. [sent-260, score-0.938]

71 Analysis Mirror Size: Intuitively, calibration using spherical mirror is possible using only a single view since a spherical mirror results in a non-central (non-perspective) camera. [sent-262, score-2.257]

72 Thus, it is expected that a larger mirror (with larger radius) will perform better than a smaller mirror. [sent-264, score-0.681]

73 In practice, the deviations of camera rays from a perspective (central) model 2Better than using 3 points since we assume no 22337722 (Middle and Right) Error in rotation and translation with noise for initial estimates. [sent-265, score-0.386]

74 We analyze the effect of mirror size on calibration accuracy using the same simulation scenario as in Section 6. [sent-271, score-0.828]

75 Figures 4 and 5 show the errors in translation and rotation for different mirror radii but for the same location of spherical mirror. [sent-272, score-1.295]

76 Notice that larger mirror size provides smaller error both for initial estimates and estimates after non-linear refinement. [sent-273, score-0.795]

77 Mirror radius error: Since our algorithm assumes that the mirror radius is known, we evaluate performance using an incorrect value of radius. [sent-274, score-0.846]

78 As the size of spherical mirror increases, pose estimation performance with noise improves. [sent-298, score-1.126]

79 Plots show errors for initial estimates using mirror of different radii at the same location. [sent-299, score-0.747]

80 Pose estimation errors with varying mirror size after nonlinear refinement of initial estimates. [sent-320, score-0.737]

81 Pose estimation errors with 5% error in true mirror radius (r = 25. [sent-323, score-0.796]

82 ball3 ing a spherical of diameter 3 inches and three photos using a planar mirror (Figure 7). [sent-327, score-1.299]

83 For spherical mirror, 48 corners were detected whereas for planar mirror, 30 corners were detected in each photo. [sent-331, score-0.643]

84 Note that the 2D projections are confined to a significantly smaller pixel area for the spherical mirror. [sent-333, score-0.41]

85 Figure 8 shows the detected checkerboard points (red) along with the re-projected checkerboard points (green) using estimated the calibration parameters for our approach, which overlap nicely. [sent-335, score-0.539]

86 Discussions and Conclusions To the best of our knowledge, we have presented the first approach for mirror-based extrinsic calibration using a single reflection from a non-planar spherical mirror. [sent-337, score-0.791]

87 The spherical mirror provides a win-win situation: (a) only a single view is required, (b) there exist no degenerate mirror configurations, and (c) it provides better performance than the planar mirror based approach in the presence of noise. [sent-339, score-2.754]

88 approach with varying mirror size and assuming error in known mirror radius. [sent-347, score-1.424]

89 Since spherical mirrors are easy to manufacture and are low cost, we believe that our algorithm will be widely adopted. [sent-348, score-0.564]

90 Our algorithm equivalently provides a method for calibrating spherical mirror based catadioptric systems using a single photo of a checkerboard. [sent-349, score-1.198]

91 Techniques for display-camera calibration [25] often utilize reflections from an eye by modeling them as spherical mirror reflection. [sent-350, score-1.286]

92 We believe that our approach will be useful in several vision applications such as specular surface reconstruction, robot navigation, catadioptric imaging, display-camera calibration and multi-camera calibration without overlapping field of view. [sent-351, score-0.453]

93 A moving planar [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] mirror based approach for cultural reconstruction: Research articles. [sent-465, score-0.899]

94 3D scene reconstruction from reflection images in a spherical mirror. [sent-474, score-0.47]

95 Calibration and performance evaluation of omnidirectional sensor with compound spherical mirrors. [sent-482, score-0.361]

96 Flexible extrinsic calibration of nonoverlapping cameras using a planar mirror: Application to vision-based robotics. [sent-507, score-0.562]

97 Camera pose estimation using images of planar mirror reflections. [sent-559, score-0.983]

98 Axial-cones: Modeling spherical catadioptric cameras for wide-angle light field rendering. [sent-574, score-0.521]

99 A new mirror-based extrinsic camera calibration using an orthogonality constraint. [sent-592, score-0.395]

100 Geometric properties of multiple reflections in catadioptric camera with two planar mirrors. [sent-600, score-0.526]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mirror', 0.681), ('spherical', 0.361), ('planar', 0.218), ('mirrors', 0.203), ('extrinsic', 0.174), ('checkerboard', 0.156), ('sphere', 0.156), ('calibration', 0.147), ('catadioptric', 0.137), ('translation', 0.118), ('reflection', 0.109), ('axis', 0.108), ('reflections', 0.097), ('degenerate', 0.087), ('pose', 0.084), ('rotation', 0.079), ('axial', 0.075), ('camera', 0.074), ('analytical', 0.073), ('radius', 0.072), ('reference', 0.065), ('agrawal', 0.06), ('ray', 0.058), ('mm', 0.05), ('june', 0.048), ('uy', 0.046), ('error', 0.043), ('francken', 0.043), ('ta', 0.042), ('ux', 0.04), ('rays', 0.04), ('fov', 0.04), ('simulations', 0.039), ('coplanarity', 0.038), ('refinement', 0.037), ('taguchi', 0.036), ('location', 0.035), ('pages', 0.034), ('corners', 0.032), ('nitschkea', 0.032), ('takahashi', 0.032), ('rp', 0.031), ('perspective', 0.03), ('unknown', 0.029), ('configurations', 0.029), ('reflected', 0.029), ('cop', 0.029), ('hermans', 0.029), ('nobuhara', 0.029), ('reshetouski', 0.029), ('snell', 0.029), ('projections', 0.029), ('along', 0.028), ('br', 0.028), ('markers', 0.028), ('projection', 0.028), ('solutions', 0.027), ('canadian', 0.027), ('notice', 0.027), ('navigation', 0.027), ('estimates', 0.026), ('view', 0.026), ('wx', 0.026), ('points', 0.026), ('ik', 0.025), ('solution', 0.024), ('stereo', 0.024), ('calibrated', 0.024), ('attach', 0.024), ('configuration', 0.024), ('cameras', 0.023), ('plane', 0.023), ('correspondences', 0.022), ('int', 0.022), ('ramalingam', 0.022), ('robot', 0.022), ('degree', 0.022), ('radii', 0.021), ('lie', 0.021), ('incorrect', 0.021), ('wy', 0.021), ('determinant', 0.021), ('singular', 0.02), ('photos', 0.02), ('imaging', 0.02), ('confined', 0.02), ('coordinate', 0.019), ('views', 0.019), ('initial', 0.019), ('diameter', 0.019), ('rms', 0.019), ('presence', 0.019), ('known', 0.019), ('calibrating', 0.019), ('orthogonal', 0.018), ('geiger', 0.018), ('sturm', 0.018), ('let', 0.018), ('reprojection', 0.018), ('center', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 152 iccv-2013-Extrinsic Camera Calibration without a Direct View Using Spherical Mirror

Author: Amit Agrawal

2 0.24613944 280 iccv-2013-Multi-view 3D Reconstruction from Uncalibrated Radially-Symmetric Cameras

Author: Jae-Hak Kim, Yuchao Dai, Hongdong Li, Xin Du, Jonghyuk Kim

Abstract: We present a new multi-view 3D Euclidean reconstruction method for arbitrary uncalibrated radially-symmetric cameras, which needs no calibration or any camera model parameters other than radial symmetry. It is built on the radial 1D camera model [25], a unified mathematical abstraction to different types of radially-symmetric cameras. We formulate the problem of multi-view reconstruction for radial 1D cameras as a matrix rank minimization problem. Efficient implementation based on alternating direction continuation is proposed to handle scalability issue for real-world applications. Our method applies to a wide range of omnidirectional cameras including both dioptric and catadioptric (central and non-central) cameras. Additionally, our method deals with complete and incomplete measurements under a unified framework elegantly. Experiments on both synthetic and real images from various types of cameras validate the superior performance of our new method, in terms of numerical accuracy and robustness.

3 0.16577877 343 iccv-2013-Real-World Normal Map Capture for Nearly Flat Reflective Surfaces

Author: Bastien Jacquet, Christian Häne, Kevin Köser, Marc Pollefeys

Abstract: Although specular objects have gained interest in recent years, virtually no approaches exist for markerless reconstruction of reflective scenes in the wild. In this work, we present a practical approach to capturing normal maps in real-world scenes using video only. We focus on nearly planar surfaces such as windows, facades from glass or metal, or frames, screens and other indoor objects and show how normal maps of these can be obtained without the use of an artificial calibration object. Rather, we track the reflections of real-world straight lines, while moving with a hand-held or vehicle-mounted camera in front of the object. In contrast to error-prone local edge tracking, we obtain the reflections by a robust, global segmentation technique of an ortho-rectified 3D video cube that also naturally allows efficient user interaction. Then, at each point of the reflective surface, the resulting 2D-curve to 3D-line correspondence provides a novel quadratic constraint on the local surface normal. This allows to globally solve for the shape by integrability and smoothness constraints and easily supports the usage of multiple lines. We demonstrate the technique on several objects and facades.

4 0.10594846 436 iccv-2013-Unsupervised Intrinsic Calibration from a Single Frame Using a "Plumb-Line" Approach

Author: R. Melo, M. Antunes, J.P. Barreto, G. Falcão, N. Gonçalves

Abstract: Estimating the amount and center ofdistortionfrom lines in the scene has been addressed in the literature by the socalled “plumb-line ” approach. In this paper we propose a new geometric method to estimate not only the distortion parameters but the entire camera calibration (up to an “angular” scale factor) using a minimum of 3 lines. We propose a new framework for the unsupervised simultaneous detection of natural image of lines and camera parameters estimation, enabling a robust calibration from a single image. Comparative experiments with existing automatic approaches for the distortion estimation and with ground truth data are presented.

5 0.10534363 115 iccv-2013-Direct Optimization of Frame-to-Frame Rotation

Author: Laurent Kneip, Simon Lynen

Abstract: This work makes use of a novel, recently proposed epipolar constraint for computing the relative pose between two calibrated images. By enforcing the coplanarity of epipolar plane normal vectors, it constrains the three degrees of freedom of the relative rotation between two camera views directly—independently of the translation. The present paper shows how the approach can be extended to n points, and translated into an efficient eigenvalue minimization over the three rotational degrees of freedom. Each iteration in the non-linear optimization has constant execution time, independently of the number of features. Two global optimization approaches are proposed. The first one consists of an efficient Levenberg-Marquardt scheme with randomized initial value, which already leads to stable and accurate results. The second scheme consists of a globally optimal branch-and-bound algorithm based on a bound on the eigenvalue variation derived from symmetric eigenvalue-perturbation theory. Analysis of the cost function reveals insights into the nature of a specific relative pose problem, and outlines the complexity under different conditions. The algorithm shows state-of-the-art performance w.r.t. essential-matrix based solutions, and a frameto-frame application to a video sequence immediately leads to an alternative, real-time visual odometry solution. Note: All algorithms in this paper are made available in the OpenGV library. Please visit http : / / l aurent kne ip .github . i / opengv o

6 0.096618034 151 iccv-2013-Exploiting Reflection Change for Automatic Reflection Removal

7 0.095601678 281 iccv-2013-Multi-view Normal Field Integration for 3D Reconstruction of Mirroring Objects

8 0.091017894 346 iccv-2013-Rectangling Stereographic Projection for Wide-Angle Image Visualization

9 0.087028965 342 iccv-2013-Real-Time Solution to the Absolute Pose Problem with Unknown Radial Distortion and Focal Length

10 0.082932025 17 iccv-2013-A Global Linear Method for Camera Pose Registration

11 0.080058955 323 iccv-2013-Pose Estimation with Unknown Focal Length Using Points, Directions and Lines

12 0.079353221 184 iccv-2013-Global Fusion of Relative Motions for Robust, Accurate and Scalable Structure from Motion

13 0.078621298 353 iccv-2013-Revisiting the PnP Problem: A Fast, General and Optimal Solution

14 0.076429389 348 iccv-2013-Refractive Structure-from-Motion on Underwater Images

15 0.067521445 444 iccv-2013-Viewing Real-World Faces in 3D

16 0.064067453 291 iccv-2013-No Matter Where You Are: Flexible Graph-Guided Multi-task Learning for Multi-view Head Pose Classification under Target Motion

17 0.062521361 56 iccv-2013-Automatic Registration of RGB-D Scans via Salient Directions

18 0.062187172 317 iccv-2013-Piecewise Rigid Scene Flow

19 0.061226577 27 iccv-2013-A Robust Analytical Solution to Isometric Shape-from-Template with Focal Length Calibration

20 0.05793887 100 iccv-2013-Curvature-Aware Regularization on Riemannian Submanifolds

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.114), (1, -0.134), (2, -0.046), (3, 0.018), (4, -0.023), (5, 0.016), (6, 0.036), (7, -0.107), (8, 0.026), (9, 0.028), (10, 0.037), (11, -0.024), (12, -0.118), (13, -0.015), (14, -0.011), (15, 0.053), (16, 0.048), (17, 0.095), (18, -0.025), (19, 0.027), (20, 0.035), (21, -0.077), (22, -0.005), (23, -0.031), (24, -0.067), (25, 0.066), (26, 0.01), (27, -0.055), (28, -0.055), (29, -0.029), (30, -0.025), (31, 0.079), (32, -0.02), (33, -0.089), (34, -0.022), (35, -0.014), (36, -0.044), (37, -0.038), (38, -0.07), (39, -0.022), (40, 0.053), (41, -0.125), (42, 0.014), (43, -0.009), (44, 0.026), (45, 0.019), (46, -0.012), (47, 0.022), (48, 0.097), (49, -0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93450546 152 iccv-2013-Extrinsic Camera Calibration without a Direct View Using Spherical Mirror

Author: Amit Agrawal

2 0.82761824 348 iccv-2013-Refractive Structure-from-Motion on Underwater Images

Author: Anne Jordt-Sedlazeck, Reinhard Koch

Abstract: In underwater environments, cameras need to be confined in an underwater housing, viewing the scene through a piece of glass. In case of flat port underwater housings, light rays entering the camera housing are refracted twice, due to different medium densities of water, glass, and air. This causes the usually linear rays of light to bend and the commonly used pinhole camera model to be invalid. When using the pinhole camera model without explicitly modeling refraction in Structure-from-Motion (SfM) methods, a systematic model error occurs. Therefore, in this paper, we propose a system for computing camera path and 3D points with explicit incorporation of refraction using new methods for pose estimation. Additionally, a new error function is introduced for non-linear optimization, especially bundle adjustment. The proposed method allows to increase reconstruction accuracy and is evaluated in a set of experiments, where the proposed method’s performance is compared to SfM with the perspective camera model.

3 0.81569928 436 iccv-2013-Unsupervised Intrinsic Calibration from a Single Frame Using a "Plumb-Line" Approach

Author: R. Melo, M. Antunes, J.P. Barreto, G. Falcão, N. Gonçalves

4 0.7974388 280 iccv-2013-Multi-view 3D Reconstruction from Uncalibrated Radially-Symmetric Cameras

Author: Jae-Hak Kim, Yuchao Dai, Hongdong Li, Xin Du, Jonghyuk Kim

5 0.74806499 49 iccv-2013-An Enhanced Structure-from-Motion Paradigm Based on the Absolute Dual Quadric and Images of Circular Points

Author: Lilian Calvet, Pierre Gurdjos

Abstract: This work aims at introducing a new unified Structurefrom-Motion (SfM) paradigm in which images of circular point-pairs can be combined with images of natural points. An imaged circular point-pair encodes the 2D Euclidean structure of a world plane and can easily be derived from the image of a planar shape, especially those including circles. A classical SfM method generally runs two steps: first a projective factorization of all matched image points (into projective cameras and points) and second a camera selfcalibration that updates the obtained world from projective to Euclidean. This work shows how to introduce images of circular points in these two SfM steps while its key contribution is to provide the theoretical foundations for combining “classical” linear self-calibration constraints with additional ones derived from such images. We show that the two proposed SfM steps clearly contribute to better results than the classical approach. We validate our contributions on synthetic and real images.

6 0.7372613 323 iccv-2013-Pose Estimation with Unknown Focal Length Using Points, Directions and Lines

7 0.73446918 346 iccv-2013-Rectangling Stereographic Projection for Wide-Angle Image Visualization

8 0.73411888 343 iccv-2013-Real-World Normal Map Capture for Nearly Flat Reflective Surfaces

9 0.71718025 342 iccv-2013-Real-Time Solution to the Absolute Pose Problem with Unknown Radial Distortion and Focal Length

10 0.68533123 250 iccv-2013-Lifting 3D Manhattan Lines from a Single Image

11 0.62339777 27 iccv-2013-A Robust Analytical Solution to Isometric Shape-from-Template with Focal Length Calibration

12 0.61851877 353 iccv-2013-Revisiting the PnP Problem: A Fast, General and Optimal Solution

13 0.6132555 17 iccv-2013-A Global Linear Method for Camera Pose Registration

14 0.55372852 84 iccv-2013-Complex 3D General Object Reconstruction from Line Drawings

15 0.54846603 115 iccv-2013-Direct Optimization of Frame-to-Frame Rotation

16 0.53338861 90 iccv-2013-Content-Aware Rotation

17 0.52211118 281 iccv-2013-Multi-view Normal Field Integration for 3D Reconstruction of Mirroring Objects

18 0.52184778 397 iccv-2013-Space-Time Tradeoffs in Photo Sequencing

19 0.49389973 271 iccv-2013-Modeling the Calibration Pipeline of the Lytro Camera for High Quality Light-Field Image Reconstruction

20 0.4875553 141 iccv-2013-Enhanced Continuous Tabu Search for Parameter Estimation in Multiview Geometry

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.038), (7, 0.025), (26, 0.073), (31, 0.056), (35, 0.013), (38, 0.011), (42, 0.175), (64, 0.02), (73, 0.021), (89, 0.207), (94, 0.204), (98, 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.89140248 222 iccv-2013-Joint Learning of Discriminative Prototypes and Large Margin Nearest Neighbor Classifiers

Author: Martin Köstinger, Paul Wohlhart, Peter M. Roth, Horst Bischof

Abstract: In this paper, we raise important issues concerning the evaluation complexity of existing Mahalanobis metric learning methods. The complexity scales linearly with the size of the dataset. This is especially cumbersome on large scale or for real-time applications with limited time budget. To alleviate this problem we propose to represent the dataset by a fixed number of discriminative prototypes. In particular, we introduce a new method that jointly chooses the positioning of prototypes and also optimizes the Mahalanobis distance metric with respect to these. We show that choosing the positioning of the prototypes and learning the metric in parallel leads to a drastically reduced evaluation effort while maintaining the discriminative essence of the original dataset. Moreover, for most problems our method performing k-nearest prototype (k-NP) classification on the condensed dataset leads to even better generalization compared to k-NN classification using all data. Results on a variety of challenging benchmarks demonstrate the power of our method. These include standard machine learning datasets as well as the challenging Public Fig- ures Face Database. On the competitive machine learning benchmarks we are comparable to the state-of-the-art while being more efficient. On the face benchmark we clearly outperform the state-of-the-art in Mahalanobis metric learning with drastically reduced evaluation effort.

same-paper 2 0.85599464 152 iccv-2013-Extrinsic Camera Calibration without a Direct View Using Spherical Mirror

Author: Amit Agrawal

3 0.84160519 189 iccv-2013-HOGgles: Visualizing Object Detection Features

Author: Carl Vondrick, Aditya Khosla, Tomasz Malisiewicz, Antonio Torralba

Abstract: We introduce algorithms to visualize feature spaces used by object detectors. The tools in this paper allow a human to put on ‘HOG goggles ’ and perceive the visual world as a HOG based object detector sees it. We found that these visualizations allow us to analyze object detection systems in new ways and gain new insight into the detector’s failures. For example, when we visualize the features for high scoring false alarms, we discovered that, although they are clearly wrong in image space, they do look deceptively similar to true positives in feature space. This result suggests that many of these false alarms are caused by our choice of feature space, and indicates that creating a better learning algorithm or building bigger datasets is unlikely to correct these errors. By visualizing feature spaces, we can gain a more intuitive understanding of our detection systems.

4 0.83576202 285 iccv-2013-NEIL: Extracting Visual Knowledge from Web Data

Author: Xinlei Chen, Abhinav Shrivastava, Abhinav Gupta

Abstract: We propose NEIL (NeverEnding Image Learner), a computer program that runs 24 hours per day and 7 days per week to automatically extract visual knowledge from Internet data. NEIL uses a semi-supervised learning algorithm that jointly discovers common sense relationships (e.g., “Corolla is a kind of/looks similar to Car”, “Wheel is a part of Car”) and labels instances of the given visual categories. It is an attempt to develop the world’s largest visual structured knowledge base with minimum human labeling effort. As of 10th October 2013, NEIL has been continuously running for 2.5 months on 200 core cluster (more than 350K CPU hours) and has an ontology of 1152 object categories, 1034 scene categories and 87 attributes. During this period, NEIL has discovered more than 1700 relationships and has labeled more than 400K visual instances. 1. Motivation Recent successes in computer vision can be primarily attributed to the ever increasing size of visual knowledge in terms of labeled instances of scenes, objects, actions, attributes, and the contextual relationships between them. But as we move forward, a key question arises: how will we gather this structured visual knowledge on a vast scale? Recent efforts such as ImageNet [8] and Visipedia [30] have tried to harness human intelligence for this task. However, we believe that these approaches lack both the richness and the scalability required for gathering massive amounts of visual knowledge. For example, at the time of submission, only 7% of the data in ImageNet had bounding boxes and the relationships were still extracted via Wordnet. In this paper, we consider an alternative approach of automatically extracting visual knowledge from Internet scale data. The feasibility of extracting knowledge automatically from images and videos will itself depend on the state-ofthe-art in computer vision. While we have witnessed significant progress on the task of detection and recognition, we still have a long way to go for automatically extracting the semantic content of a given image. So, is it really possible to use existing approaches for gathering visual knowledge directly from web data? 1.1. NEIL – Never Ending Image Learner We propose NEIL, a computer program that runs 24 hours per day, 7 days per week, forever to: (a) semantically understand images on the web; (b) use this semantic understanding to augment its knowledge base with new labeled instances and common sense relationships; (c) use this dataset and these relationships to build better classifiers and detectors which in turn help improve semantic understanding. NEIL is a constrained semi-supervised learning (SSL) system that exploits the big scale of visual data to automatically extract common sense relationships and then uses these relationships to label visual instances of existing categories. It is an attempt to develop the world’s largest visual structured knowledge base with minimum human effort one that reflects the factual content of the images on the Internet, and that would be useful to many computer vision and AI efforts. Specifically, NEIL can use web data to extract: (a) Labeled examples of object categories with bounding boxes; (b) Labeled examples of scenes; (c) Labeled examples of attributes; (d) Visual subclasses for object categories; and (e) Common sense relationships about scenes, objects and attributes like “Corolla is a kind of/looks similar to Car”, “Wheel is a part ofCar”, etc. (See Figure 1). We believe our approach is possible for three key reasons: (a) Macro-vision vs. Micro-vision: We use the term “micro-vision” to refer to the traditional paradigm where the input is an image and the output is some information extracted from that image. In contrast, we define “macrovision” as a paradigm where the input is a large collection of images and the desired output is extracting significant or interesting patterns in visual data (e.g., car is detected frequently in raceways). These patterns help us to extract common sense relationships. Note, the key difference is that macro-vision does not require us to understand every image in the corpora and extract all possible patterns. Instead, it relies on understanding a few images and statistically combine evidence from these to build our visual knowledge. – (b) Structure of the Visual World: Our approach exploits the structure of the visual world and builds constraints for detection and classification. These global constraints are represented in terms of common sense relationships be1409 orCllaoraC Hloc e yrs(a) Objects (w/Bounding Boxes and Vislue hWal Subcategories) aongkPi lrt(b) ScenewyacaResaephs nuoRd(c) At d worCreibutes Visual Instances Labeled by NEIL (O-O) Wheel is a part of Car. (S-O) Car is found in Raceway. (O-O) Corolla is a kind of/looks similar to Car. (S-O) Pyramid is found in Egypt. (O-A) Wheel is/has Round shape. (S-A) Alley is/has Narrow. (S-A) Bamboo forest is/has Vertical lines. (O-A) Sunflower is/has Yellow. Relationships Extracted by NEIL Figure 1. NEIL is a computer program that runs 24 hours a day and 7 days a week to gather visual knowledge from the Internet. Specifically, it simultaneously labels the data and extracts common sense relationships between categories. tween categories. Most prior work uses manually defined relationships or learns relationships in a supervised setting. Our key insight is that at a large scale one can simultane- ously label the visual instances and extract common sense relationships in ajoint semi-supervised learning framework. (c) Semantically driven knowledge acquisition: We use a semantic representation for visual knowledge; that is, we group visual data based on semantic categories and develop relationships between semantic categories. This allows us to leverage text-based indexing tools such as Google Image Search to initialize our visual knowledge base learning. Contributions: Our main contributions are: (a) We propose a never ending learning algorithm for gathering visual knowledge from the Internet via macro-vision. NEIL has been continuously running for 2.5 months on a 200 core cluster; (b) We are automatically building a large visual structured knowledge base which not only consists of labeled instances of scenes, objects, and attributes but also the relationships between them. While NEIL’s core SSL algorithm works with a fixed vocabulary, we also use noun phrases from NELL’s ontology [5] to grow our vocabulary. Currently, our growing knowledge base has an ontology of 1152 object categories, 1034 scene categories, and 87 attributes. NEIL has discovered more than 1700 relationships and labeled more than 400K visual instances of these categories. (c) We demonstrate how joint discovery of relationships and labeling of instances at a gigantic scale can provide constraints for improving semi-supervised learning. 2. Related Work Recent work has only focused on extracting knowledge in the form of large datasets for recognition and classification [8, 23, 30]. One of the most commonly used approaches to build datasets is using manual annotations by motivated teams of people [30] or the power of crowds [8, 40]. To minimize human effort, recent works have also focused on active learning [37, 39] which selects label requests that are most informative. However, both of these directions have a major limitation: annotations are expensive, prone to errors, biased and do not scale. An alternative approach is to use visual recognition for extracting these datasets automatically from the Internet [23, 34, 36]. A common way of automatically creating a dataset is to use image search results and rerank them via visual classifiers [14] or some form of joint-clustering in text and visual space [2, 34]. Another approach is to use a semi-supervised framework [42]. Here, a small amount of labeled data is used in conjunction with a large amount of unlabeled data to learn reliable and robust visual models. These seed images can be manually labeled [36] or the top retrievals of a text-based search [23]. The biggest problem with most of these automatic approaches is that the small number of labeled examples or image search results do not provide enough constraints for learning robust visual classifiers. Hence, these approaches suffer from semantic drift [6]. One way to avoid semantic drift is to exploit additional constraints based on the structure of our visual data. Researchers have exploited a variety of constraints such as those based on visual similarity [11, 15], seman- tic similarity [17] or multiple feature spaces [3]. However, most of these constraints are weak in nature: for example, visual similarity only models the constraint that visuallysimilar images should receive the same labels. On the other hand, our visual world is highly structured: object cate1410 gories share parts and attributes, objects and scenes have strong contextual relationships, etc. Therefore, we need a way to capture the rich structure of our visual world and exploit this structure during semi-supervised learning. In recent years, there have been huge advances in modeling the rich structure of our visual world via contextual relationships. Some of these relationships include: SceneObject [38], Object-Object [3 1], Object-Attribute [12, 22, 28], Scene-Attribute [29]. All these relationships can provide a rich set of constraints which can help us improve SSL [4]. For example, scene-attribute relationships such as amphitheaters are circular can help improve semisupervised learning of scene classifiers [36] and Wordnet hierarchical relationships can help in propagating segmentations [21]. But the big question is: how do we obtain these relationships? One way to obtain such relationships is via text analysis [5, 18]. However, as [40] points out that the visual knowledge we need to obtain is so obvious that no one would take the time to write it down and put it on web. In this work, we argue that, at a large-scale, one can jointly discover relationships and constrain the SSL prob- lem for extracting visual knowledge and learning visual classifiers and detectors. Motivated by a never ending learning algorithm for text [5], we propose a never ending visual learning algorithm that cycles between extracting global relationships, labeling data and learning classifiers/detectors for building visual knowledge from the Internet. Our work is also related to attribute discovery [33, 35] since these approaches jointly discover the attributes and relationships between objects and attributes simultaneously. However, in our case, we only focus on semantic attributes and therefore our goal is to discover semantic relationships and semantically label visual instances. 3. Technical Approach Our goal is to extract visual knowledge from the pool of visual data on the web. We define visual knowledge as any information that can be useful for improving vision tasks such as image understanding and object/scene recognition. One form of visual knowledge would be labeled examples of different categories or labeled segments/boundaries. Labeled examples helps us learn classifiers or detectors and improve image understanding. Another example of visual knowledge would be relationships. For example, spatial contextual relationships can be used to improve object recognition. In this paper, we represent visual knowledge in terms of labeled examples of semantic categories and the relationships between those categories. Our knowledge base consists of labeled examples of: (1) Objects (e.g., Car, Corolla); (2) Scenes (e.g., Alley, Church); (3) Attributes (e.g., Blue, Modern). Note that for objects we learn detectors and for scenes we build classifiers; however for the rest of the paper we will use the term detector and classifier interchangeably. Our knowledge base also contains relationships of four types: (1) Object-Object (e.g., Wheel is a part of Car);(2) Object-Attribute (e.g., Sheep is/has White); (3) Scene-Object (e.g., Car is found in Raceway); (4) SceneAttribute (e.g., Alley is/has Narrow). The outline of our approach is shown in Figure 2. We use Google Image Search to download thousands of images for each object, scene and attribute category. Our method then uses an iterative approach to clean the labels and train detectors/classifiers in a semi-supervised manner. For a given concept (e.g., car), we first discover the latent visual subcategories and bounding boxes for these sub-categories using an exemplar-based clustering approach (Section 3. 1). We then train multiple detectors for a concept (one for each sub-category) using the clustering and localization results. These detectors and classifiers are then used for detections on millions of images to learn relationships based on cooccurrence statistics (Section 3.2). Here, we exploit the fact the we are interested in macro-vision and therefore build co-occurrence statistics using only confident detections/classifications. Once we have relationships, we use them in conjunction with our classifiers and detectors to label the large set of noisy images (Section 3.3). The most confidently labeled images are added to the pool of labeled data and used to retrain the models, and the process repeats itself. At every iteration, we learn better classifiers and detectors, which in turn help us learn more relationships and further constrain the semi-supervised learning problem. We now describe each step in detail below. 3.1. Seeding Classifiers via Google Image Search The first step in our semi-supervised algorithm is to build classifiers for visual categories. One way to build initial classifiers is via a few manually labeled seed images. Here, we take an alternative approach and use text-based image retrieval systems to provide seed images for training initial detectors. For scene and attribute classifiers we directly use these retrieved images as positive data. However, such an approach fails for training object and attribute detectors because of four reasons (Figure 3(a)) (1) Outliers: Due to the imperfectness of text-based image retrieval, the downloaded images usually have irrelevant images/outliers; (2) Polysemy: In many cases, semantic categories might be overloaded and a single semantic category might have multiple senses (e.g., apple can mean both the company and the fruit); (3) Visual Diversity: Retrieved images might have high intra-class variation due to different viewpoint, illumination etc.; (4) Localization: In many cases the retrieved image might be a scene without a bounding-box and hence one needs to localize the concept before training a detector. Most of the current approaches handle these problems via clustering. Clustering helps in handling visual diversity [9] and discovering multiple senses of retrieval (polysemy) [25]. It can also help us to reject outliers based on – distances from cluster centers. One simple way to cluster 141 1 would be to use K-means on the set of all possible bounding boxes and then use the representative clusters as visual sub-categories. However, clustering using K-means has two issues: (1) High Dimensionality: We use the Color HOG (CHOG) [20] representation and standard distance metrics do not work well in such high-dimensions [10]; (2) Scalability: Most clustering approaches tend to partition the complete feature space. In our case, since we do not have bounding boxes provided, every image creates millions of data points and the majority of the datapoints are outliers. Recent work has suggested that K-means is not scalable and has bad performance in this scenario since it assigns membership to every data point [10]. Instead, we propose to use a two-step approach for clustering. In the first step, we mine the set of downloaded im- × ages from Google Image Search to create candidate object windows. Specifically, every image is used to train a detector using recently proposed exemplar-LDA [19]. These detectors are then used for dense detections on the same set of downloaded images. We select the top K windows which have high scores from multiple detectors. Note that this step helps us prune out outliers as the candidate windows are selected via representativeness (how many detectors fire on them). For example, in Figure 3, none of the tricycle detectors fire on the outliers such as circular dots and people eating, and hence these images are rejected at this candidate widow step. Once we have candidate windows, we cluster them in the next step. However, instead of using the high-dimensional CHOG representation for clustering, we use the detection signature of each window (represented as a vector of seed detector ELDA scores on the window) to create a K K affinity matrix. The (i, j) entry in the affinity amteat arix K i s× thKe da fofti product orixf t.h Tish vee (cit,ojr) )fo enr twryin indo thwes ai fainndj. Intuitively, this step connects candidate windows if the same set of detectors fire on both windows. Once we have the affinity matrix, we cluster the candidate windows using the standard affinity propagation algorithm [16]. Affinity propagation also allows us to extract a representative window (prototype) for each cluster which acts as an iconic image for the object [32] (Figure 3). After clustering, we train a detector for each cluster/sub-category using three-quarters of the images in the cluster. The remaining quarter is used as a validation set for calibration. 3.2. Extracting Relationships Once we have initialized object detectors, attribute detectors, attribute classifiers and scene classifiers, we can use them to extract relationships automatically from the data. The key idea is that we do not need to understand each and every image downloaded from the Internet but instead understand the statistical pattern of detections and classifications at a large scale. These patterns can be used to select the top-N relationships at every iteration. Specifically, we extract four different kinds of relationships: Object-Object Relationships: The first kind of relationship we extract are object-object relationships which include: (1) Partonomy relationships such as “Eye is a part of Baby”; (2) Taxonomy relationships such as “BMW 320 is a kind of Car”; and (3) Similarity relationships such as 1412 (a) Google Image Search for “tricycle” (b) Sub-category Discovery Figure 3. An example of how clustering handles polysemy, intraclass variation and outlier removal (a). The bottom row shows our discovered clusters. “Swan looks similar to Goose”. To extract these relationships, we first build a co-detection matrix O0 whose elements represent the probability of simultaneous detection of object categories i and j. Intuitively, the co-detection matrix has high values when object detector idetects objects inside the bounding box of object j with high detection scores. To account for detectors that fire everywhere and images which have lots of detections, we normalize the matrix O0. The normalized co-detection matrix can be written 1 1 as: N1− 2 O0N2− 2 , where N1 and N2 are out-degree and indegree matrix and (i, j) element of O0 represents the average score of top-detections of detector ion images of object category j. Once we have selected a relationship between pair of categories, we learn its characteristics in terms of mean and variance of relative locations, relative aspect ra- tio, relative scores and relative size of the detections. For example, the nose-face relationship is characterized by low relative window size (nose is less than 20% of face area) and the relative location that nose occurs in center of the face. This is used to define a compatibility function ψi,j (·) which evaluates if the detections from category iand j are compatible or not. We also classify the relationships into the two semantic categories (part-of, taxonomy/similar) using relative features to have a human-communicable view of visual knowledge base. Object-Attribute Relationships: The second type of relationship we extract is object-attribute relationships such as “Pizza has Round Shape”, ”Sunflower is Yellow” etc. To extract these relationships we use the same methodology where the attributes are detected in the labeled examples of object categories. These detections and their scores are then used to build a normalized co-detection matrix which is used to find the top object-attribute relationships. Scene-Object Relationships: The third type of relationship extracted by our algorithm includes scene-object relationships such as “Bus is found in Bus depot” and “Monitor is found in Control room”. For extracting scene-object relationships, we use the object detectors on randomly sampled images of different scene classes. The detections are then used to create the normalized co-presence matrix (similar to object-object relationships) where the (i, j) element represents the likelihood of detection of instance of object category iand the scene category class j. Scene-Attribute Relationships: The fourth and final type of relationship extracted by our algorithm includes sceneattribute relationships such as “Ocean is Blue”, “Alleys are Narrow”, etc. Here, we follow a simple methodology for extracting scene-attribute relationships where we compute co-classification matrix such that the element (i, j) of the matrix represents average classification scores of attribute ion images of scene j. The top entries in this coclassification matrix are used to extract scene-attribute relationships. 3.3. Retraining via Labeling New Instances Once we have the initial set of classifiers/detectors and the set of relationships, we can use them to find new instances of different objects and scene categories. These new instances are then added to the set of labeled data and we retrain new classifiers/detectors using the updated set of labeled data. These new classifiers are then used to extract more relationships which in turn are used to label more data and so on. One way to find new instances is directly using the detector itself. For instance, using the car detector to find more cars. However, this approach leads to semantic drift. To avoid semantic drift, we use the rich set of relationships we extracted in the previous section and ensure that the new labeled instances of car satisfy the extracted relationships (e.g., has wheels, found in raceways etc.) Mathematically, let RO, RA and RS represent the set of object-object, object-attribute aanndd scene-object relationships at iteration t. If φi (·) represents the potential from object detector i, ωk (·) represents sthenet scene potential, raonmd ψi,j (·) represent the compatibility sfu thnect siocnen nbeet pwoeteennt tiwalo, aonbdject categories i,j,ethceonm we can ifityndfu uthncet new inetswtaenecnetsw woof oobb-ject category iusing the contextual scoring function given below: φi(x) + ? φj(xl)ψi,j(x,xl) + ? i,j∈R?O RA ? ωk(x) i,k?∈RS where x is the wi?ndow being evaluated and xl is the topdetected window of related object/attribute category. The above equation has three terms: the first term is appearance term for the object category itself and is measured by the 1413 Nilgai Yamaha Violin Bass F-18 Figure 4. Qualitative Examples of Bounding Box Labeling Done by NEIL score of the SVM detector on the window x. The second term measures the compatibility between object category i and the object/attribute category j if the relationship (i, j) is part of the catalogue. For example, if “Wheel is a part of Car” exists in the catalogue then this term will be the product of the score of wheel detector and the compatibility function between the wheel window (xl) and the car window (x). The final term measures the scene-object compatibility. Therefore, if the knowledge base contains the re- lationship “Car is found in Raceway”, this term boosts the “Car” detection scores in the “Raceway” scenes. At each iteration, we also add new instances of different scene categories. We find new instances of scene category k using the contextual scoring function given below: ωk(x) + ? ωm(x) + ? φi(xl) m,k?∈RA? i,k?∈RS where RA? represents the catalogue of scene-attribute relationships. The above equation has three terms: the first term is the appearance term for the scene category itself and is estimated using the scene classifier. The second term is the appearance term for the attribute category and is estimated using the attribute classifier. This term ensures that if a scene-attribute relationship exists then the attribute classifier score should be high. The third and the final term is the appearance term of an object category and is estimated using the corresponding object detector. This term ensures that if a scene-object relationship exists then the object detector should detect objects in the scene. Implementation Details: To train scene & attribute classifiers, we first extract a 3912 dimensional feature vector from each image. The feature vector includes 5 12D GIST [27] features, concatenated with bag ofwords representations for SIFT [24], HOG [7], Lab color space, and Texton [26]. The dictionary sizes are 1000, 1000, 400, 1000, respectively. Features of randomly sampled windows from other categories are used as negative examples for SVM training and hard mining. For the object and attribute section, we use CHOG [20] features with a bin size of 8. We train the detectors using latent SVM model (without parts) [13]. 4. Experimental Results We demonstrate the quality of visual knowledge by qualitative results, verification via human subjects and quantitative results on tasks such as object detection and scene recognition. 4.1. NEIL Statistics While NEIL’s core algorithm uses a fixed vocabulary, we use noun phrases from NELL [5] to grow NEIL’s vocabulary. As of 10th October 2013, NEIL has an ontology of 1152 object categories, 1034 scene categories and 87 attributes. It has downloaded more than 2 million images for extracting the current structured visual knowledge. For bootstrapping our system, we use a few seed images from ImageNet [8], SUN [4 1] or the top-images from Google Image Search. For the purposes of extensive experimental evaluation in this paper, we ran NEIL on steroids (200 cores as opposed to 30 cores used generally) for the last 2.5 months. NEIL has completed 16 iterations and it has labeled more than 400K visual instances (including 300,000 objects with their bounding boxes). It has also extracted 1703 common sense relationships. Readers can browse the current visual knowledge base and download the detectors from: www.neil-kb.com 4.2. Qualitative Results We first show some qualitative results in terms of ex- tracted visual knowledge by NEIL. Figure 4 shows the extracted visual sub-categories along with a few labeled instances belonging to each sub-category. It can be seen from the figure that NEIL effectively handles the intra-class variation and polysemy via the clustering process. The purity and diversity of the clusters for different concepts indicate that contextual relationships help make our system robust to semantic drift and ensure diversity. Figure 5 shows some of the qualitative examples of scene-object and object-object relationships extracted by NEIL. It is effective in using a few confident detections to extract interesting relationships. Figure 6 shows some of the interesting scene-attribute and object-attribute relationships extracted by NEIL. 1414 Helicopter is found in Airfield Leaning tower is found in Pisa Van is a kind of/looks similar to Ambulance Airplane nose is a part of Airbus 330 Zebra is found in Savanna Ferris wheel is found in Amusement park Opera house is found in Sydney Eye is a part of Baby Duck is a kind of/looks similar to Goose Monitor is a kind of/looks similar to Desktop computer Figure 5. Qualitative Examples of Scene-Object (rows Bus is found in Bus depot outdoor Sparrow is a kind of/looks similar to bird 1-2) and Object-Object (rows Throne is found in Throne room Camry is found in Pub outdoor Gypsy moth is a kind of/looks similar to Butterfly Basketball net is a part of Backboard 3-4) Relationships Extracted by NEIL 4.3. Evaluating Quality via Human Subjects Next, we want to evaluate the quality of extracted visual knowledge by NEIL. It should be noted that an extensive and comprehensive evaluation for the whole NEIL system is an extremely difficult task. It is impractical to evaluate each and every labeled instance and each and every rela- tionship for correctness. Therefore, we randomly sample the 500 visual instances and 500 relationships, and verify them using human experts. At the end of iteration 6, 79% of the relationships extracted by NEIL are correct, and 98% of the visual data labeled by NEIL has been labeled correctly. We also evaluate the per iteration correctness of relationships: At iteration 1, more than 96% relationships are correct and by iteration 3, the system stabilizes and 80% of extracted relationships are correct. While currently the system does not demonstrate any major semantic drift, we do plan to continue evaluation and extensive analysis of knowledge base as NEIL grows older. We also evaluate the quality of bounding-boxes generated by NEIL. For this we sample 100 images randomly and label the ground-truth bounding boxes. On the standard intersection-over-union metric, NEIL generates bounding boxes with 0.78 overlap on average with ground-truth. To give context to the difficulty of the task, the standard Objectness algorithm [1] produces bounding boxes with 0.59 overlap on average. 4.4. Using Knowledge for Vision Tasks Finally, we want to demonstrate the usefulness of the visual knowledge learned by NEIL on standard vision tasks such as object detection and scene classification. Here, we will also compare several aspects of our approach: (a) We first compare the quality of our automatically labeled dataset. As baselines, we train classifiers/detectors directly on the seed images downloaded from Google Image Search. (b) We compare NEIL against a standard bootstrapping ap- proach which does not extract/use relationships. (c) Finally, we will demonstrate the usefulness of relationships by detecting and classifying new test data with and without the learned relationships. Scene Classification: First we evaluate our visual knowledge for the task of scene classification. We build a dataset of 600 images (12 scene categories) using Flickr images. We compare the performance ofour scene classifiers against the scene classifiers trained from top 15 images of Google Image Search (our seed classifier). We also compare the performance with standard bootstrapping approach without using any relationship extraction. Table 1shows the results. We use mean average precision (mAP) as the evaluation metric. As the results show, automatic relationship extraction helps us to constrain the learning problem, and so the learned classifiers give much better performance. Finally, if we also use the contextual information from NEIL relationships we get a significant boost in performance. Table 1. mAP performance for scene classification on 12 categories. mAP Seed Classifier (15 Google Images) Bootstrapping (without relationships) NEIL Scene Classifiers NEIL (Classifiers + Relationships) 0.52 0.54 0.57 0.62 Object Detection: We also evaluate our extracted visual knowledge for the task of object detection. We build a dataset of 1000 images (15 object categories) using Flickr data for testing. We compare the performance against object detectors trained directly using (top-50 and top-450) images from Google Image Search. We also compare the perfor- mance of detectors trained after aspect-ratio, HOG clustering and our proposed clustering procedure. Table 2 shows the detection results. Using 450 images from Google image search decreases the performance due to noisy retrievals. While other clustering methods help, the gain by our clustering procedure is much larger. Finally, detectors trained using NEIL work better than standard bootstrapping. 1415 MMoonniittoorr iiss f foouunndd iinn CCoonnttrrooll rroooomm? rroooomm? MMoonniittoorr iiss ffoouunndd iinn CCoonnttrrooll Washing machine is found in Utility room? Siberian tiger is found in Zoo Baseball is found in Butters box Bullet train is found in Train station platform? Cougar looks similar to Cat Urn looks similar to Goblet Samsung galaxy is a kind of Cellphone Computer room is /has Modern Hallway is /has Narrow? Building facade is /has Check texture Trading floor is /has Crowded Umbrella looks similar to Ferris wheel Bonfire is found in Volcano Figure 6. Examples of extracted common sense relationships. Table 2. mAP performance for object detection on 15 categories. mAP Latent SVM (50 Google Images) Latent SVM (450 Google Images) 0.34 0.28 Latent SVM (450, Aspect Ratio Clustering) Latent SVM (450, HOG-based Clustering) Seed Detector (NEIL Clustering) Bootstrapping (without relationships) NEIL Detector NEIL Detector + Relationships 0.30 0.33 0.44 0.45 0.49 0.51 Acknowledgements: This research was supported by ONR MURI N000141010934 and a gift from Google. The authors would like to thank Tom Mitchell and David Fouhey for insightful discussions. We would also like to thank our computing clusters warp and workhorse for doing all the hard work! References [1] B. Alexe, T. Deselares, and V. Ferrari. What is an object? In TPAMI, 2010. 7 [2] T. Berg and D. Forsyth. Animals on the web. In CVPR, 2006. 2 [3] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT, 1998. 2 [4] A. Carlson, J. Betteridge, E. R. H. Jr., and T. M. Mitchell. Coupling semi-supervised learning of categories and relations. In NAACL HLT Workskop on SSL for NLP, 2009. 3 [5] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr., and T. M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, 2010. 2, 3, 6 [6] J. R. Curran, T. Murphy, and B. Scholz. Minimising semantic drift with mutual exclusion bootstrapping. In PacificAssociationfor Computational Linguistics, 2007. 2 [7] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005. 6 [8] J. Deng, W. Dong, R. Socher, J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009. 1, 2, 6 [9] S. Divvala, A. Efros, and M. Hebert. How important are ‘deformable parts’ in the deformable parts model? In ECCV, Parts and Attributes [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] Workshop, 2012. 3 C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. Efros. What makes Paris look like Paris? SIGGRAPH, 2012. 4 S. Ebert, D. Larlus, and B. Schiele. Extracting structures in image collections for object recognition. In ECCV, 2010. 2 A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, 2009. 3 P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. TPAMI, 2010. 6 R. Fergus, P. Perona, and A. Zisserman. A visual category filter for Google images. In ECCV, 2004. 2 R. Fergus, Y. Weiss, and A. Torralba. Semi-supervised learning in gigantic image collections. In NIPS. 2009. 2 B. Frey and D. Dueck. Clustering by passing messages between data points. Science, 2007. 4 M. Guillaumin, J. Verbeek, and C. Schmid. Multimodal semisupervised learning for image classification. In CVPR, 2010. 2 A. Gupta and L. S. Davis. Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers. In ECCV, 2008. 3 B. Hariharan, J. Malik, and D. Ramanan. Discriminative decorrelation for clustering and classification. In ECCV. 2012. 4 S. Khan, F. Anwer, R. Muhammad, J. van de Weijer, A. Joost, M. Vanrell, and A. Lopez. Color attributes for object detection. In CVPR, 2012. 4, 6 D. Kuettel, M. Guillaumin, and V. Ferrari. Segmentation propagation in ImageNet. In ECCV, 2012. 3 C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In CVPR, 2009. 3 L.-J. Li, G. Wang, and L. Fei-Fei. OPTIMOL: Automatic object picture collection via incremental model learning. In CVPR, 2007. 2 D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004. 6 [25] A. Lucchi and J. Weston. Joint image and word sense discrimination for image retrieval. In ECCV, 2012. 3 [26] D. Martin, C. Fowlkes, and J. Malik. Learning to detect natural image boundaries using local brightness, color, and texture cues. PAMI, 2004. 6 [27] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV, 2001. 6 [28] D. Parikh and K. Grauman. Relative attributes. In ICCV, 2011. 3 [29] G. Patterson and J. Hays. SUN attribute database: Discovering, annotating, and recognizing scene attributes. In CVPR, 2012. 3 [30] P. Perona. Visions of a Visipedia. Proceedings of IEEE, 2010. 1, 2 [3 1] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In ICCV, 2007. 3 [32] R. Raguram and S. Lazebnik. Computing iconic summaries of general visual concepts. In Workshop on Internet Vision, 2008. 4 [33] M. Rastegari, A. Farhadi, and D. Forsyth. Attribute discovery via predictable discriminative binary codes. In ECCV, 2012. 3 [34] F. Schroff, A. Criminisi, and A. Zisserman. Harvesting image databases from the web. In ICCV, 2007. 2 [35] V. Sharmanska, N. Quadrianto, and C. H. Lampert. Augmented attribute representations. In ECCV, 2012. 3 [36] A. Shrivastava, S. Singh, and A. Gupta. Constrained semi-supervised learning using attributes and comparative attributes. In ECCV, 2012. 2, 3 [37] B. Siddiquie and A. Gupta. Beyond active noun tagging: Modeling contextual interactions for multi-class active learning. In CVPR, 2010. 2 [38] E. Sudderth, A. Torralba, W. T. Freeman, and A. Wilsky. Learning hierarchical models of scenes, objects, and parts. In ICCV, 2005. 3 [39] S. Vijayanarasimhan and K. Grauman. Large-scale live active learning: Training object detectors with crawled data and crowds. In CVPR, 2011. 2 [40] L. von Ahn and L. Dabbish. Labeling images with a computer game. In SIGCHI, 2004. 2, 3 [41] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. SUN database: Large scale scene recognition from abbey to zoo. In CVPR, 2010. 6 [42] X. Zhu. Semi-supervised learning literature survey. Technical report, CS, UW-Madison, 2005. 2 1416

5 0.81120515 259 iccv-2013-Manifold Based Face Synthesis from Sparse Samples

Author: Hongteng Xu, Hongyuan Zha

Abstract: Data sparsity has been a thorny issuefor manifold-based image synthesis, and in this paper we address this critical problem by leveraging ideas from transfer learning. Specifically, we propose methods based on generating auxiliary data in the form of synthetic samples using transformations of the original sparse samples. To incorporate the auxiliary data, we propose a weighted data synthesis method, which adaptively selects from the generated samples for inclusion during the manifold learning process via a weighted iterative algorithm. To demonstrate the feasibility of the proposed method, we apply it to the problem of face image synthesis from sparse samples. Compared with existing methods, the proposed method shows encouraging results with good performance improvements.

6 0.80798221 433 iccv-2013-Understanding High-Level Semantics by Modeling Traffic Patterns

7 0.80592865 79 iccv-2013-Coherent Object Detection with 3D Geometric Context from a Single Image

8 0.80592507 438 iccv-2013-Unsupervised Visual Domain Adaptation Using Subspace Alignment

9 0.80527186 26 iccv-2013-A Practical Transfer Learning Algorithm for Face Verification

10 0.80439687 45 iccv-2013-Affine-Constrained Group Sparse Coding and Its Application to Image-Based Classifications

11 0.80269921 122 iccv-2013-Distributed Low-Rank Subspace Segmentation

12 0.80250722 321 iccv-2013-Pose-Free Facial Landmark Fitting via Optimized Part Mixtures and Cascaded Deformable Shape Model

13 0.80198437 392 iccv-2013-Similarity Metric Learning for Face Recognition

14 0.80102801 97 iccv-2013-Coupling Alignments with Recognition for Still-to-Video Face Recognition

15 0.80100346 328 iccv-2013-Probabilistic Elastic Part Model for Unsupervised Face Detector Adaptation

16 0.800403 232 iccv-2013-Latent Space Sparse Subspace Clustering

17 0.79969233 362 iccv-2013-Robust Tucker Tensor Decomposition for Effective Image Representation

18 0.7991277 339 iccv-2013-Rank Minimization across Appearance and Shape for AAM Ensemble Fitting

19 0.79898179 187 iccv-2013-Group Norm for Learning Structured SVMs with Unstructured Latent Variables

20 0.79891849 106 iccv-2013-Deep Learning Identity-Preserving Face Space