nips nips2011 nips2011-303 nips2011-303-reference knowledge-graph by maker-knowledge-mining

303 nips-2011-Video Annotation and Tracking with Active Learning

Source: pdf

Author: Carl Vondrick, Deva Ramanan

Abstract: We introduce a novel active learning framework for video annotation. By judiciously choosing which frames a user should annotate, we can obtain highly accurate tracks with minimal user effort. We cast this problem as one of active learning, and show that we can obtain excellent performance by querying frames that, if annotated, would produce a large expected change in the estimated object track. We implement a constrained tracker and compute the expected change for putative annotations with efﬁcient dynamic programming algorithms. We demonstrate our framework on four datasets, including two benchmark datasets constructed with key frame annotations obtained by Amazon Mechanical Turk. Our results indicate that we could obtain equivalent labels for a small fraction of the original cost. 1

reference text

[1] A. Agarwala, A. Hertzmann, D. Salesin, and S. Seitz. Keyframe-based tracking for rotoscoping and animation. In ACM Transactions on Graphics (TOG), volume 23, pages 584–591. ACM, 2004. 1, 2

[2] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In Proceedings of the 23rd international conference on Machine learning, ICML ’06, pages 65–72, New York, NY, USA, 2006. ACM. 2

[3] R. Bellman. Some problems in the theory of dynamic programming. Econometrica: Journal of the Econometric Society, pages 37–48, 1954. 3

[4] S. Branson, P. Perona, and S. Belongie. Strong supervision from weak annotation: Interactive training of deformable part models. ICCV. 2

[5] A. Buchanan and A. Fitzgibbon. Interactive feature tracking using kd trees and dynamic programming. In CVPR 06, volume 1, pages 626–633. Citeseer, 2006. 2

[6] F. Crow. Summed-area tables for texture mapping. ACM SIGGRAPH Computer Graphics, 18(3):207– 212, 1984. 3

[7] A. Culotta, T. Kristjansson, A. McCallum, and P. Viola. Corrective feedback and persistent learning for information extraction. Artiﬁcial Intelligence, 170(14-15):1101–1122, 2006. 1

[8] A. Culotta, A. McCallum, and M. U. A. D. O. C. SCIENCE. Reducing labeling effort for structured prediction tasks. In PROCEEDINGS OF THE NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, volume 20, page 746. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2005. 1

[9] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, pages I: 886– 893, 2005. 2

[10] R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. LIBLINEAR: A library for large linear classiﬁcation. The Journal of Machine Learning Research, 9:1871–1874, 2008. 2

[11] P. Felzenszwalb and D. Huttenlocher. Distance transforms of sampled functions. Cornell Computing and Information Science Technical Report TR2004-1963, 2004. 3

[12] C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. Freeman. Sift ﬂow: dense correspondence across different scenes. In Proceedings of the 10th European Conference on Computer Vision: Part III, pages 28–42. Springer-Verlag, 2008. 1

[13] S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C.-C. Chen, J. T. Lee, S. Mukherjee, J. K. Aggarwal, H. Lee, L. Davis, E. Swears, X. Wang, Q. Ji, K. Reddy, M. Shah, C. Vondrick, H. Pirsiavash, D. Ramanan, J. Yuen, A. Torralba, B. Song, A. Fong, A. Roy-Chowdhury, and M. Desai. A large-scale benchmark dataset for event recognition in surveillance video. In CVPR, 2011. 1, 7, 8

[14] B. Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison, 2009. 1, 2, 3

[15] C. Vondrick, D. Ramanan, and D. Patterson. Efﬁciently Scaling Up Video Annotation on Crowdsourced Marketplaces. ECCV, 2010. 1, 2, 8

[16] J. Yuen, B. Russell, C. Liu, and A. Torralba. LabelMe video: Building a Video Database with Human Annotations. 2009. 1, 2

[17] J. Yuen and A. Torralba. A data-driven approach for event prediction. Computer Vision–ECCV 2010, pages 707–720, 2010. 1