iccv iccv2013 iccv2013-81 iccv2013-81-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kevin Tang, Bangpeng Yao, Li Fei-Fei, Daphne Koller
Abstract: In this paper, we tackle the problem of combining features extracted from video for complex event recognition. Feature combination is an especially relevant task in video data, as there are many features we can extract, ranging from image features computed from individual frames to video features that take temporal information into account. To combine features effectively, we propose a method that is able to be selective of different subsets of features, as some features or feature combinations may be uninformative for certain classes. We introduce a hierarchical method for combining features based on the AND/OR graph structure, where nodes in the graph represent combinations of different sets of features. Our method automatically learns the structure of the AND/OR graph using score-based structure learning, and we introduce an inference procedure that is able to efficiently compute structure scores. We present promising results and analysis on the difficult and large-scale 2011 TRECVID Multimedia Event Detection dataset [17].
[1] F. Bach. Exploring large feature spaces with hierarchical multiple kernel learning. In NIPS, 2008. 2
[2] H. Chen, Z. Xu, Z. Liu, and S. C. Zhu. Composite templates for cloth modeling and sketching. In CVPR, 2006. 2, 4
[3] Y. Chen, L. Zhu, C. Lin, A. L. Yuille, and H. Zhang. Rapid inference on a novel and/or graph for object detection, segmentation and parsing. In NIPS, 2007. 2, 4
[4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005. 5
[5] P. V. Gehler and S. Nowozin. On feature combination for multiclass object classification. In ICCV, 2009. 2, 6, 7
[6] M. G o¨nen and E. Alpaydin. Multiple kernel learning algorithms. JMLR, 2011. 2, 7
[7] A. Gupta, P. Srinivasan, J. Shi, and L. S. Davis. Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In CVPR, 2009. 2
[8] S. J. Hwang, K. Grauman, and F. Sha. Semantic kernel
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19] forests from multiple taxonomies. In NIPS, 2012. 2 H. Izadinia and M. Shah. Recognizing complex events using large margin joint low-level event model. In ECCV, 2012. 2 A. Kl¨ aser, M. Marszałek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC, 2008. 6 M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. lp-norm multiple kernel learning. JMLR, 12:953–997, 2011. 6 Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR, 2011. 6 D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004. 5 P. Natarajan, S. Wu, S. N. P. Vitaladevuni, X. Zhuang, S. Tsakalidis, U. Park, R. Prasad, and P. Natarajan. Multimodal feature fusion for robust event detection in web videos. In CVPR, 2012. 1, 2 T. Ojala, M. Pietik¨ ainen, and T. Ma¨ enp a¨ a¨. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE TPAMI, 2002. 5 A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV, 2001 . 5 P. Over et al. Trecvid 2011 – an overview of the goals, tasks, data, evaluation mechanisms and metrics. In TRECVID 2011, 2011. 1, 2, 5, 8 A. Tamrakar, S. Ali, Q. Yu, J. Liu, O. Javed, A. Divakaran, H. Cheng, and H. S. Sawhney. Evaluation of low-level features and their combinations for complex event detection in open source videos. In CVPR, 2012. 2 K. Tang, L. Fei-Fei, and D. Koller. Learning latent temporal
[20]
[21]
[22]
[23]
[24]
[25] structure for complex event detection. In CVPR, 2012. 5, 7 M. Varma and D. Ray. Learning the discriminative powerinvariance trade-off. In ICCV, 2007. 2, 7 A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman. Multiple kernels for object detection. In ICCV, 2009. 2 J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010. 5 Y. Yang and M. Shah. Complex events detection using datadriven concepts. In ECCV, 2012. 2 B. Yao and L. Fei-Fei. Grouplet: A structured image representation for recognizing human and object interactions. In CVPR, 2010. 2 L. Zhu, Y. Chen, Y. Lu, C. Lin, and A. L. Yuille. Max margin and/or graph learning for parsing the human body. In CVPR, 2008. 2 2703