iccv iccv2013 iccv2013-163 iccv2013-163-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Zhongwen Xu, Yi Yang, Ivor Tsang, Nicu Sebe, Alexander G. Hauptmann
Abstract: Fusion of multiple features can boost the performance of large-scale visual classification and detection tasks like TRECVID Multimedia Event Detection (MED) competition [1]. In this paper, we propose a novel feature fusion approach, namely Feature Weighting via Optimal Thresholding (FWOT) to effectively fuse various features. FWOT learns the weights, thresholding and smoothing parameters in a joint framework to combine the decision values obtained from all the individual features and the early fusion. To the best of our knowledge, this is the first work to consider the weight and threshold factors of fusion problem simultaneously. Compared to state-of-the-art fusion algorithms, our approach achieves promising improvements on HMDB [8] action recognition dataset and CCV [5] video classification dataset. In addition, experiments on two TRECVID MED 2011 collections show that our approach outperforms the state-of-the-art fusion methods for complex event detection.
[1] http : / /www . ni st . gov/ it l iad/mig/med1 1. cfm. /
[2] M. Chen and A. Hauptmann. Mosift: Recognizing human actions in surveillance videos. In CMU-CS-09-161, 2009.
[3] R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. Liblinear: A library for large linear classification. JMLR, 9: 1871–1874, 2008.
[4] P. Gehler and S. Nowozin. On feature combination for multiclass object classification. In CVPR, 2009.
[5] Y.-G. Jiang, G. Ye, S.-F. Chang, D. Ellis, and A. C. Loui. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In ICMR, 2011.
[6] J. Kelley Jr. The cutting-plane method for solving convex programs. Journal of the Society for Industrial & Applied Mathematics, 8(4):703–712, 1960.
[7] S. Kim and S. Boyd. A minimax theorem with applications to machine learning, signal processing, and finance. SIAM Journal on Optimization, 19(3): 1344–1367, 2008.
[8] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In ICCV, 2011.
[9] Z. Lan, L. Bao, S.-I. Yu, W. Liu, and A. G. Hauptmann. Double fusion for multimedia event detection. In Advances in Multimedia Modeling. 2012.
[10] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008.
[11] Y. Li, I. Tsang, J. Kwok, and Z. Zhou. Tighter and convex maximum
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25] margin clustering. In AISTATS, 2009. D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–1 10, 2004. Z. Ma, Y. Yang, Z. Xu, S. Yan, N. Sebe, and A. G. Hauptmann. Complex event detection via multi-source video attributes. In CVPR 2013. P. Natarajan, S. Wu, S. Vitaladevuni, X. Zhuang, U. Park, R. Prasad, and N. P. Multi-channel shape-flow kernel descriptors for robust video event detection and retrieval. ECCV, 2012. P. Natarajan, S. Wu, S. Vitaladevuni, X. Zhuang, S. Tsakalidis, U. Park, and R. Prasad. Multimodal feature fusion for robust event detection in web videos. In CVPR, 2012. A. Rakotomamonjy, F. Bach, S. Canu, Y. Grandvalet, et al. SimpleMKL. JMLR, 9:2491–2521, 2008. K. Reddy and M. Shah. Recognizing 50 human action categories of web videos. MVAP, 2012. C. G. Snoek, M. Worring, and A. W. Smeulders. Early versus late fusion in semantic video analysis. In ACM Multimedia. ACM, 2005. A. Tamrakar, S. Ali, Q. Yu, J. Liu, O. Javed, A. Divakaran, H. Cheng, and H. Sawhney. Evaluation of low-level features and their combinations for complex event detection in open source videos. In CVPR, 2012. M. Tan, L. Wang, and I. W. Tsang. Learning sparse svm for feature selection on very high dimensional datasets. In ICML 2010. K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek. Evaluating color descriptors for object and scene recognition. TPAMI, 32(9):1582–1596, 2010. A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman. Multiple kernels for object detection. In CVPR, 2009. H. Wang, A. Klaser, C. Schmid, and C. Liu. Action recognition by dense trajectories. In CVPR, 2011. H. Wang, M. Ullah, A. Klaser, I. Laptev, C. Schmid, et al. Evaluation of local spatio-temporal features for action recognition. In BMVC, 2009. Y. Yang, J. Song, Z. Huang, Z. Ma, N. Sebe, and A. Hauptmann. Multi-feature fusion via hierarchical regression for multimedia anal- ysis. TMM, 15(3):572–581, 2013.
[26] S.-I. Yu, Z. Xu, D. Ding, W. Sze, F. Vicente, Z. Lan, Y. Cai, et al. Informedia e-lamp@ trecvid2012: Multimedia event detection and recounting med and mer. In NIST TRECVID Workshop, 2012. 33444470