iccv iccv2013 iccv2013-127 iccv2013-127-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Weixin Li, Qian Yu, Ajay Divakaran, Nuno Vasconcelos
Abstract: The problem of adaptively selecting pooling regions for the classification of complex video events is considered. Complex events are defined as events composed of several characteristic behaviors, whose temporal configuration can change from sequence to sequence. A dynamic pooling operator is defined so as to enable a unified solution to the problems of event specific video segmentation, temporal structure modeling, and event detection. Video is decomposed into segments, and the segments most informative for detecting a given event are identified, so as to dynamically determine the pooling operator most suited for each sequence. This dynamic pooling is implemented by treating the locations of characteristic segments as hidden information, which is inferred, on a sequence-by-sequence basis, via a large-margin classification rule with latent variables. Although the feasible set of segment selections is combinatorial, it is shown that a globally optimal solution to the inference problem can be obtained efficiently, through the solution of a series of linear programs. Besides the coarselevel location of segments, a finer model of video struc- ture is implemented by jointly pooling features of segmenttuples. Experimental evaluation demonstrates that the re- sulting event detector has state-of-the-art performance on challenging video datasets.
[1] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance learning. NIPS, 2002. 4
[2] S. Boyd and L. Vandenberghe. Convex optimization. 2004. 4
[3] W. Brendel, A. Fern, and S. Todorovic. Probabilistic event logic for interval-based event recognition. CVPR, 2011. 7
[4] W. Brendel and S. Todorovic. Learning spatiotemporal graphs of human activities. ICCV, 2011. 3, 7
[5] L. Cao, Y. Mu, N. Apostol, S.-F. Chang, G. Hua, and J. R. Smith.
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19] Scene aligned pooling for complex video recognition. In ECCV, 2012. 1, 2 R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear classification. JMLR, 9: 1871 1874, 2008. 5 P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE TPAMI, 32(9): 1627–1645, 2009. 4, 5 A. Gaidon, Z. Harchaoui, and C. Schmid. Actom sequence models for efficient action detection. CVPR, 2011. 3, 5 A. Gaidon, Z. Harchaoui, and C. Schmid. Recognizing activities with cluster-trees of tracklets. BMVC, 2012. 2, 7 Y. Jia, C. Huang, and T. Darrell. Beyond spatial pyramids: Receptive field learning for pooled image features. CVPR, 2012. 2 Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, and C.-W. Ngo. Trajectorybased modeling of human actions with motion reference points. ECCV, 2012. 7 K. Kiwiel. Proximity control in bundle methods for convex nondifferentiable minimization. Math. Program., 46: 105–122, 1990. 5 I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. CVPR, 2008. 2, 3, 4, 5, 6, 7, 8 W. Li and N. Vasconcelos. Recognizing activities by attribute dynamics. NIPS, 2012. 2, 3, 5, 6, 7 W. Li and N. Vasconcelos. Exact linear relaxation of integer linear fractional programming with non-negative denominators. SVCL Technical Report, 2013. 4 W. Li, Q. Yu, H. Sawhney, and N. Vasconcelos. Recognizing activities via bag of words for attribute dynamics. CVPR, 2013. 1, 3 J. Niebles, C. Chen, and L. Fei-Fei. Modeling temporal structure of decomposable motion segments for activity classification. ECCV, 2010. 1, 2, 3, 6, 7, 8 S. Nowozin, G. Bakir, and K. Tsuda. Discriminative subsequence mining for action classification. ICCV, 2007. 2, 5 P. Over, G. Awad, J. Fiscus, B. Antonishek, M. Michel, A. F. Smeaton, and W. Kraaij. Trecvid 2011 – an overview of the goals, tasks, data, evaluation mechanisms, and metrics. Proceedings of
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29] TRECVID 2011, 2011. 7 S. Satkin and M. Hebert. Modeling the temporal extent of actions. ECCV, 2010. 2, 3 K. Schindler and L. V. Gool. Action snippets: How many frames does human action recognition require? CVPR, 2008. 2 B. K. Sriperumbudur and G. R. G. Lanckriet. A proof of convergence of the concave-convex procedure using zangwill’s theory. Neural Computation, 24: 1391–1407, 2012. 5 K. Tang, L. Fei-Fei, and D. Koller. Learning latent temporal structure for complex event detection. CVPR, 2012. 1, 2, 3, 5, 6, 7, 8 S. Todorovic. Human activities as stochastic kronecker graphs. ECCV, 2012. 3, 7 I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Large margin methods for structured and interdependent output variables. JMLR, 6: 1453–1484, 2005. 5 A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit feature maps. IEEE TPAMI, 34(3):480–492, 2012. 3, 7 H. Wang, A. Kl¨ aser, C. Schmid, and L. Cheng-Lin. Action recognition by dense trajectories. CVPR, 2011. 2, 5, 7 H. Wang, M. Ullah, A. Kl¨ aser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features for action recognition. BMVC, 2009. 3, 6 A. L. Yuille and A. Rangarajan. The concave-convex procedure (cccp). NIPS, 2003. 4 2735