cvpr cvpr2013 cvpr2013-187 cvpr2013-187-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: S. Hussain Raza, Matthias Grundmann, Irfan Essa
Abstract: We present a novel algorithm for estimating the broad 3D geometric structure of outdoor video scenes. Leveraging spatio-temporal video segmentation, we decompose a dynamic scene captured by a video into geometric classes, based on predictions made by region-classifiers that are trained on appearance and motion features. By examining the homogeneity of the prediction, we combine predictions across multiple segmentation hierarchy levels alleviating the need to determine the granularity a priori. We built a novel, extensive dataset on geometric context of video to evaluate our method, consisting of over 100 groundtruth annotated outdoor videos with over 20,000 frames. To further scale beyond this dataset, we propose a semisupervised learning framework to expand the pool of labeled data with high confidence predictions obtained from unlabeled data. Our system produces an accurate prediction of geometric context of video achieving 96% accuracy across main geometric classes.
[1] G. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. Segmentation and recognition using structure from motion point clouds. In ECCV, 2008. 2
[2] Gabriel J. Brostow, Julien Fauqueur, and Roberto Cipolla. Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, 2008. 2
[3] M. Collins, R.E. Schapire, and Y. Singer. Logistic regression, adaboost and bregman distances. Machine Learning, 2002. 3, 4
[4] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow and appearance. ECCV, 2006. 4
[5] S.K. Divvala, D. Hoiem, J.H. Hays, A.A. Efros, and M. Hebert. An empirical study of context in object detection. In IEEE CVPR, 2009. 1
[6] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Scene parsing with multiscale feature learning, purity trees, and optimal covers. In ICML, 2012. 2
[7] G. Farneb¨ ack. Two-frame motion estimation based on polynomial expansion. Image Analysis, 2003. 4
[8] P.F. Felzenszwalb and D.P. Huttenlocher. Efficient graph-based image segmentation. IJCV, 2004. 3
[9] S. Gould, R. Fulton, and D. Koller. Decomposing a scene into geometric and semantically consistent regions. In IEEE CVPR, 2009. 2
[10] M. Grundmann, V. Kwatra, M. Han, and I. Essa. Efficient hierarchical graph-based video segmentation. In IEEE CVPR, 2010. 2, 3, 4
[11] D. Hoiem, A.A. Efros, and M. Hebert. Geometric context from a
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20] single image. In ICCV, 2005. 2, 5 D. Hoiem, A.A. Efros, and M. Hebert. Putting objects in perspective. In IEEE CVPR, 2006. 1 D. Hoiem, A.A. Efros, and M. Hebert. Recovering surface layout from an image. IJCV, 2007. 2, 4, 5 O. Miksik, D. Munoz, J. A. Bagnell, and M. Hebert. Efficient temporal consistency for streaming video scene analysis. Technical report, RI, CMU, Sep. 2012. 2 B.C. Russell, A. Torralba, K.P. Murphy, and W.T. Freeman. Labelme: a database and web-based tool for image annotation. IJCV, 2008. 7 P. Sturgess, K. Alahari, L. Ladicky, and P. Torr. Combining appearance and structure from motion features for road scene understanding. In BMVC, 2009. 2 J. Tighe and S. Lazebnik. Superparsing. IJCV, 2012. 2 A. Torralba, K.P. Murphy, and W.T. Freeman. Contextual models for object detection using boosted random fields. In NIPS, 2004. 1 C. Wojek, S. Roth, K. Schindler, and B. Schiele. Monocular 3d scene modeling and inference: Understanding multi-object traffic scenes. In ECCV. Springer, 2010. 2 C. Xu and J.J. Corso. Evaluation of super-voxel methods for early video processing. In IEEE CVPR, 2012. 3 333000888866