iccv iccv2013 iccv2013-4 iccv2013-4-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Chen Sun, Ram Nevatia
Abstract: The goal of high level event classification from videos is to assign a single, high level event label to each query video. Traditional approaches represent each video as a set of low level features and encode it into a fixed length feature vector (e.g. Bag-of-Words), which leave a big gap between low level visual features and high level events. Our paper tries to address this problem by exploiting activity concept transitions in video events (ACTIVE). A video is treated as a sequence of short clips, all of which are observations corresponding to latent activity concept variables in a Hidden Markov Model (HMM). We propose to apply Fisher Kernel techniques so that the concept transitions over time can be encoded into a compact and fixed length feature vector very efficiently. Our approach can utilize concept annotations from independent datasets, and works well even with a very small number of training samples. Experiments on the challenging NIST TRECVID Multimedia Event Detection (MED) dataset shows our approach performs favorably over the state-of-the-art.
[1] http : / /www-nlpi r .nist .gov/pro j ect s / tvpubs /tv .pubs . 11. org .html. 4 991199 attempting a board trick (jumping, jumping) (sliding, jumping) (flipping, jumping) feeding an animal (reeling, reeling) (hands visible, hands visible) (walking, animal eating) landing a fish (reeling, reeling) (casting, reeling) (hands visible, reeling) wedding ceremony (kissing, kissing) (pouring, kissing) (hugging, kissing) woodworking project (hands visible, hands visible) (carving, carving) (carving, hands visible) birthday party (singing, singing) (singing, jumping) (clapping, singing) changing a vehicle tire (turning wrench, turning wrench) (hands visible, turning wrench) (fitting bolts, turning wrench) flash mob gathering (marching, marching) (dancing, marching) (dancing, dancing) getting a vehicle unstuck (steering, steering) (vehicle moving, steering) (vehicle moving, vehicle moving) grooming an animal (washing, washing) (hands visible, washing) (hands visible, hands visible) making a sandwich (hands visible, hands visible) (spreading cream, spreading cream) (spreading cream, hands visible) parade (marching, marching) (dancing, marching) (walking, marching) parkour (jumping, jumping) (flipping, jumping) (dancing, jumping) repairing an appliance (hands visible, hands visible) (turning wrench, hands visible) (pointing, hands visible) sewing project (sewing, sewing) (hands visible, sewing) (eating, sewing) Figure 4. High level events and their top rated descriptions based on activity concept transitions. Concept transitions with top 3 highest responses are listed under each event.
[2] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 2008. 3
[3] A. Farhadi, I. Endres, D. Hoiem, and D. A. Forsyth. Describing objects by their attributes. In CVPR, 2009. 2
[4] A. Gaidon, Z. Harchaoui, and C. Schmid. Actom sequence models for efficient action detection. In CVPR, 2011. 2
[5] H. Izadinia and M. Shah. Recognizing complex events using large margin joint low-level event model. In ECCV, 2012. 2, 4, 5, 6, 7
[6] T. Jaakkola, M. Diekhans, and D. Haussler. A discriminative framework for detecting remote protein homologies. Journal of Computational Biology, 7(1-2), 2000. 2
[7] T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In NIPS, 1999. 2, 3
[8] A. Kl¨ aser, M. Marszalek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC, 2008. 1
[9] I. Laptev. On space-time interest points. IJCV, 64(2-3): 107– 123, 2005. 1, 2
[10] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008. 1
[11] L.-J. Li, H. Su, E. P. Xing, and F.-F. Li. Object bank: A highlevel image representation for scene classification & semantic feature sparsification. In NIPS, 2010. 2
[12] J. Liu, B. Kuipers, and S. Savarese. Recognizing human actions by attributes. In CVPR, 2011. 2
[13] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–1 10, 2004. 1, 2
[14] P. Natarajan, S. Vitaladevuni, U. Park, S. Wu, V. Manohar, X. Zhuang, S. Tsakalidis, R. Prasad, and P. Natarajan. Multimodel feature fusion for robust event detection in web videos. In CVPR, 2012. 5
[15] F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. In CVPR, 2007. 2, 5, 6
[16] L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. In Proceedings of the IEEE, 1989. 4
[17] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: A local svm approach. In ICPR, 2004. 1
[18] K. Soomro, A. R. Zamir, and M. Shah. Ucf101 : A dataset of 101 human actions classes from videos in the wild. CRCVTR-12-01. 5
[19] C. Sun and R. Nevatia. Large-scale web video event classification by use of fisher vectors. In WACV, 2013. 2, 5
[20] A. Tamrakar, S. Ali, Q. Yu, J. Liu, O. Javed, A. Divakaran, H. Cheng, and H. S. Sawhney. Evaluation of low-level features and their combinations for complex event detection in open source videos. In CVPR, 2012. 2
[21] K. Tang, F.-F. Li, and D. Koller. Learning latent temporal structure for complex event detection. In CVPR, 2012. 2
[22] L. Torresani, M. Szummer, and A. W. Fitzgibbon. Efficient object category recognition using classemes. In ECCV, 2010. 2
[23] H. Wang, A. Kl¨ aser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In CVPR, 2011. 1, 2, 5
[24] H. Wang, M. M. Ullah, A. Kl¨ aser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features for action recognition. In BMVC, 2009. 2 992200