iccv iccv2013 iccv2013-111 iccv2013-111-reference knowledge-graph by maker-knowledge-mining

111 iccv-2013-Detecting Dynamic Objects with Multi-view Background Subtraction

Source: pdf

Author: Raúl Díaz, Sam Hallman, Charless C. Fowlkes

Abstract: The confluence of robust algorithms for structure from motion along with high-coverage mapping and imaging of the world around us suggests that it will soon be feasible to accurately estimate camera pose for a large class photographs taken in outdoor, urban environments. In this paper, we investigate how such information can be used to improve the detection of dynamic objects such as pedestrians and cars. First, we show that when rough camera location is known, we can utilize detectors that have been trained with a scene-specific background model in order to improve detection accuracy. Second, when precise camera pose is available, dense matching to a database of existing images using multi-view stereo provides a way to eliminate static backgrounds such as building facades, akin to background-subtraction often used in video analysis. We evaluate these ideas using a dataset of tourist photos with estimated camera pose. For template-based pedestrian detection, we achieve a 50 percent boost in average precision over baseline.

reference text

[1] S. Agarwal, N. Snavely, I. Simon, S. Seitz, and R. Szeliski. Building Rome in a day. ICCV, pages 72–79, Sept. 2009. 3

[2] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, pages I: 886–893, 2005. 5

[3] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. Pascal VOC2007 results. http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop/index.html. 5

[4] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained partbased models. TPAMI, 32(9): 1627–45, Sept. 2010. 5

[5] J. Frahm, P. Fite-Georgel, and D. Gallup. Building Rome on a cloudless day. In ECCV, 2010. 3

[6] Y. Furukawa. CMVS. http://grail.cs.washington.edu/software/cmvs/. 3

[7] Y. Furukawa, B. Curless, S. Seitz, and R. Szeliski. Towards Internet-scale multi-view stereo. In CVPR, 2010. 1, 3

[8] Y. Furukawa and J. Ponce. Accurate, dense, and robust multiview stereopsis. TPAMI, 1(1): 1–14, 2010. 3

[9] R. B. Girshick, P. F. Felzenszwalb, and D. McAllester. Discriminatively trained deformable part models, release 5. http://people.cs.uchicago.edu/ rbg/latent-release5/. 5

[10] H. Grabner, J. Gall, and L. V. Gool. What makes a chair a chair? In CVPR, 2011. 1

[11] A. Gupta, S. Satkin, A. A. Efros, and M. Hebert. From 3d scene geometry to human workspace. In CVPR, 2011. 1

[12] J. Hays and A. A. Efros. im2gps: estimating geographic information from a single image. In CVPR, 2008. 3

[13] D. Hoiem, A. Efros, and M. Hebert. Putting Objects in Perspective. CVPR, 2:2137–2144, 2006. 1, 5, 6, 7

[14] D. Hoiem, A. A. Efros, and M. Hebert. Geometric context from a single image. ICCV, 1:654–661, 2005. 1, 7

[15] Y. Li, N. Snavely, D. Huttenlocher, and P. Fua. Worldwide Pose Estimation using 3D Point Clouds. In ECCV, 2012. 3

[16] D. G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. IJCV, 60(2):91–1 10, Nov. 2004. 3

[17] K. Murphy, A. Torralba, and W. Freeman. Using the forest to see the trees: a graphical model relating features, objects and scenes. NIPS, 2003. 1

[18] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics (TOG), 23(3):309–314, 2004. 4

[19] T. Sattler, B. Leibe, and L. Kobbelt. Fast image-based localization using direct 2D-to-3D matching. ICCV, 2011. 2, 3

[20] Y. Sheikh, O. Javed, and T. Kanade. Background Subtraction for Freely Moving Cameras. ICCV, pages 1219–1225, Sept. 2009. 4

[21] N. Snavely, S. Seitz, and R. Szeliski. Photo tourism: exploring photo collections in 3D. ACM Transactions on Graphics (TOG), 2006. 3, 4

[22] N. Snavely, I. Simon, M. Goesele, R. Szeliski, and S. M. Seitz. Scene Reconstruction and Visualization From Community Photo Collections. Proceedings of the IEEE, 98(8): 1370–1390, Aug. 2010. 1, 3

[23] C. Stauffer and W. Grimson. Adaptive background mixture models for real-time tracking. CVPR, 1999. 4

[24] A. Torralba and A. Efros. Unbiased look at dataset bias. CVPR, pages 1521–1528, 2011. 7

[25] A. Torralba and P. Sinha. Statistical context priming for object detection. ICCV, 1:763–770, 2001. 1 279 DT DT+SS−DT+SS−+MVBS DT+SS−+GC DT+SS−+MVBS+GC Figure 5: Example detector outputs at 50% recall. Unsupervised scene specific training makes the detector better able to reject common distractors (e.g. the statues in row 2). MVBS can prune additional false positives at test-time by performing stereo matching to a database of existing images. Note that MVBS is able to remove some false positives which are not caught by geometric consistency (GC) with the horizon line because the hypothesized detections overlap heavily with regions identified as background. 280