cvpr cvpr2013 cvpr2013-133 cvpr2013-133-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kevin Tang, Rahul Sukthankar, Jay Yagnik, Li Fei-Fei
Abstract: The ubiquitous availability of Internet video offers the vision community the exciting opportunity to directly learn localized visual concepts from real-world imagery. Unfortunately, most such attempts are doomed because traditional approaches are ill-suited, both in terms of their computational characteristics and their inability to robustly contend with the label noise that plagues uncurated Internet content. We present CRANE, a weakly supervised algorithm that is specifically designed to learn under such conditions. First, we exploit the asymmetric availability of real-world training data, where small numbers of positive videos tagged with the concept are supplemented with large quantities of unreliable negative data. Second, we ensure that CRANE is robust to label noise, both in terms of tagged videos that fail to contain the concept as well as occasional negative videos that do. Finally, CRANE is highly parallelizable, making it practical to deploy at large scale without sacrificing the quality of the learned solution. Although CRANE is general, this paper focuses on segment annotation, where we show state-of-the-art pixel-level segmentation results on two datasets, one of which includes a training set of spatiotemporal segments from more than 20,000 videos.
[1] K. Ali, D. Hasler, and F. Fleuret. FlowBoost—Appearance learning from sparsely annotated video. In CVPR, 2011. 3
[2] W. Brendel and S. Todorovic. Learning spatiotemporal graphs of human activities. In ICCV, 2011. 2
[3] T. Brox and J. Malik. Object segmentation by long term analysis of point trajectories. In ECCV, 2010. 1, 2
[4] R. Chaudhry et al. Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In CVPR, 2009. 5
[5] Y. Chen, J. Bi, and J. Wang. MILES: Multiple-instance learning via embedded instance selection. PAMI, 28(12), 2006. 2
[6] T. Deselaers, B. Alexe, and V. Ferrari. Localizing objects while learning their appearance. In ECCV, 2010. 3
[7] J. C. Duchi and Y. Singer. Boosting with structural sparsity. In ICML, 2009. 5 222444888977 Figure 10. Object segmentations obtained using CRANE. The top two rows are obtained for the ISA task on the dataset introduced by [11]. The bottom two rows are obtained for the TSA task on the YouTube-Objects dataset [20]. In each pair, the left image shows the original spatiotemporal segments and the right shows the output. (a) Successes; (b) Failures.
[8] M. Everingham et al. The Pascal visual object classes (VOC) challenge. IJCV, 2010. 5
[9] R.-E. Fan et al. LIBLINEAR: A library for large linear classification. JMLR, 9, 2008. 6
[10] M. Grundmann, V. Kwatra, M. Han, and I. A. Essa. Efficient hierarchical graph-based video segmentation. In CVPR, 2010. 1, 2, 3
[11] G. Hartmann, M. Grundmann, J. Hoffman, D. Tsai, V. Kwatra, O. Madani, S. Vijayanarasimhan, I. A. Essa, J. M. Rehg, and R. Sukthankar. Weakly supervised learning of object segmentations from web-scale video. In ECCV Workshop on Vision in Web-Scale Media, 2012. 1, 2, 4, 6, 7, 8
[12] Y. Ke, R. Sukthankar, and M. Hebert. Event detection in crowded videos. In ICCV, 2007. 2
[13] Y. Lee, J. Kim, and K. Grauman. Key-segments for video object segmentation. In ICCV, 2011. 2
[14] C. Leistner et al. Improving classifiers with unlabeled weakly-related videos. In CVPR, 2011. 3
[15] J. Lezama, K. Alahari, J. Sivic, and I. Laptev. Track to the future: Spatio-temporal video segmentation with long-range motion cues. In CVPR, 2011. 1, 2
[16] J. C. Niebles, C.-W. Chen, and L. Fei-Fei. Modeling temporal structure of decomposable motion segments for activity classification. In ECCV, 2010. 1
[17] J.-C. Niebles, B. Han, A. Ferencz, and L. Fei-Fei. Extracting moving people from internet videos. In ECCV, 2008. 2
[18] T. Ojala, M. Pietikainen, and D. Harwood. Performance evaluation of texture measures with classification based on Kullback discrimination of distributions. In ICPR, 1994. 5
[19] B. Ommer, T. Mader, and J. Buhmann. Seeing the objects be-
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30] [3 1]
[32] hind the dots: Recognition in videos from a moving camera. IJCV, 83(1), 2009. 3 A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari. Learning object class detectors from weakly annotated video. In CVPR, 2012. 1, 2, 5, 6, 8 D. Ramanan, D. Forsyth, and K. Barnard. Building models of animals from video. PAMI, 28(8), 2006. 3 C. Rother et al. GrabCut: interactive foreground extraction using iterated graph cuts. Trans. Graphics, 23(3), 2004. 7 P. Siva, C. Russell, and T. Xiang. In defence of negative mining for annotating weakly labelled data. In ECCV, 2012. 2, 3, 4, 5, 7 K. Tang et al. Shifting weights: Adapting object detectors from image to video. In NIPS, 2012. 5 K. Tang, L. Fei-Fei, and D. Koller. Learning latent temporal structure for complex event detection. In CVPR, 2012. 1 A. Vezhnevets, V. Ferrari, and J. M. Buhmann. Weakly supervised structured output learning for semantic segmentation. In CVPR, 2012. 3 S. Vicente, V. Kolmogorov, and C. Rother. Cosegmentation revisited: Models and optimization. In ECCV, 2010. 3 P. Viola, J. Platt, and C. Zhang. Multiple instance boosting for object detection. In NIPS, 2005. 2, 5 X. Wang, T. X. Han, and S. Yan. An HOG-LBP human detector with partial occlusion handling. In ICCV, 2009. 5 J. Xiao and M. Shah. Motion layer extraction in the presence of occlusion using graph cuts. PAMI, 27(10), 2005. 2 C. Xu, C. Xiong, and J. Corso. Streaming hierarchical video segmentation. In ECCV, 2012. 2 Z.-J. Zha et al. Joint multi-label multi-instance learning for image classification. In CVPR, 2008. 2
[33] Z.-H. Zhou and M.-L. Zhang. Multi-instance multi-label learning with application to scene classification. In NIPS, 2007. 2 222444889088