cvpr cvpr2013 cvpr2013-94 cvpr2013-94-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yingying Zhu, Nandita M. Nayak, Amit K. Roy-Chowdhury
Abstract: In thispaper, rather than modeling activities in videos individually, we propose a hierarchical framework that jointly models and recognizes related activities using motion and various context features. This is motivated from the observations that the activities related in space and time rarely occur independently and can serve as the context for each other. Given a video, action segments are automatically detected using motion segmentation based on a nonlinear dynamical model. We aim to merge these segments into activities of interest and generate optimum labels for the activities. Towards this goal, we utilize a structural model in a max-margin framework that jointly models the underlying activities which are related in space and time. The model explicitly learns the duration, motion and context patterns for each activity class, as well as the spatio-temporal relationships for groups of them. The learned model is then used to optimally label the activities in the testing videos using a greedy search method. We show promising results on the VIRAT Ground Dataset demonstrating the benefit of joint modeling and recognizing activities in a wide-area scene.
[F1]i Mu.r Re. 4A.m Eexr aanmd Sp. Teosdo srohvoicw. S tuhme-p erofdfuectc ntet owfor ckos nfotre moordreelincgt layct irveiticeos wgnithi zsteodcha bstyic tsthruectu brea. sIne CliVnPeR, (2N01D2.M c1,t i2o, 5n, 6 5, .75
[2] Y. B. and F. L. Modeling mutual context of object and human
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13] pose in human object interaction activities. In CVPR, 2010. 2 C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2nd edition, 2006. 5 C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 2011. 3 R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal. Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In CVPR, 2009. 1, 3, 5, 6 N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005. 3 C. Desai, D. Ramanan, and C. C. Fowlkes. Discriminative models for multi-class object layout. In International Journal of Computer Vision, 2011. 2, 5 S. O. et al. A large-scale benchmark dataset for event recognition in surveillance video. In CVPR, 2011. 5 P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Discriminatively trained deformable part models, Release 4. http://people.cs.uchicago.edu/ pff/latent-release4/. 2, 3 U. Guar, Y. Zhu, B. Song, and A. K. Roy-Chowdhury. A “string of feature graphs” model for recognition of complex activities in natural videos. In ICCV, 2011. 5, 6, 7 A. Gupta, P. Srinivasan, J. Shi, and L. S. Davis. Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In CVPR, 2009. 2 D. Han, L. Bo, and C. Sminchisescu. Selection and context for action recognition. In ICCV, 2009. 2 T. Lan, Y. Wang, S. N. Robinovitch, and G. Mori. Discriminative latent models for recognizing contextual group ac-
[14] [V15M]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25] tivities. In IEEE Trans. on Pattern Analysis and Machine Intelligence, 2012. 2 T. Lan, Y. Wang, W. Yang, and G. Mori. Beyond actionus:r eDissc irnim rineatcivoeg mnoizdeinls gfor a ccotnitevxittuiael sgr tohupa tact wivietieres. In NIPS, i2n010 re. 2c )I. cLlaapstesvi. eOnr s (rpaecela-ttimeed in etxereasmt ppolinets. r eIns u Inlttesrn faotiornal Jno VurnIaRlA AofT C Domatpautesre Vtis Rioen,l e20a0s5e. e2 2 M. Marszalek, I. Laptev, and C. Schmid. Actions in context. In CVPR, 2009. 2 V. I. Morariu and L. S. Davis. Multi-agent event recognition in structured scenarios. In CVPR, 2011. 2 Y.-G. J. G.-W. Ngo and J. Yang. Towards optimal bag of words for object categorization and semantic video retrieval. ACM-CIVR, 2007. 5, 7 J. C. Niebles, H. Wang, and L. Fei-Fei. Modeling temporal structure of decomposable motion segments for activity classification. In ECCV, 2010. 1 A. Oliva and A. Torralba. The role of context in object recognition. In Trends in Cognitive Science, 2007. 1 Z. Si, M. Pei, B. Yao, and S. Zhu. Unsupervised learning of event and-or grammar and semantics from video. In ICCV, 2011. 2 B. Song, T. Jeng, E. Staudt, and A. Roy-Chowdury. A stochastic graph evolution framework for robust multi-target tracking. In ECCV, 2010. 2 J. Sun, X. Wu, S. Yan, L.-F. Cheong, T.-S. Chua, and J. Li. Hierarchical spatio-temporal context modeling for action recognition. In CVPR, 2009. 2 K. Tang, L. Fei-Fei, and D. Koller. Learning latent temporal stucture for complex event detection. In CVPR, 2012. 2 C. H. Teo, Q. Le, A. Smola, and S. V. N. Vishwanathan. A scalable modular convex solver for regularized risk minimization. In SIGKDD, pages 727–736, 2007. 5
[26] J. Wang, Z. Chen, and Y. Wu. Action recognition with multiscale spatio-temporal contexts. In CVPR, 2011. 2
[27] Y. Zhu, N. M. Nayak, and A. K. Roy-Chowdhury. Contextaware activity recognition and anomaly detection in video. In IEEE Journal of Selected Topics in Signal Processing, pages 91–101, February 2013. 2
[28] Z.Zivkovic. Improved adaptive Gaussian mixture model for background subtraction. In ICPR, 2004. 2 222444999866