iccv iccv2013 iccv2013-39 iccv2013-39-reference knowledge-graph by maker-knowledge-mining

39 iccv-2013-Action Recognition with Improved Trajectories

Source: pdf

Author: Heng Wang, Cordelia Schmid

Abstract: Recently dense trajectories were shown to be an efficient video representation for action recognition and achieved state-of-the-art results on a variety of datasets. This paper improves their performance by taking into account camera motion to correct them. To estimate camera motion, we match feature points between frames using SURF descriptors and dense optical flow, which are shown to be complementary. These matches are, then, used to robustly estimate a homography with RANSAC. Human motion is in general different from camera motion and generates inconsistent matches. To improve the estimation, a human detector is employed to remove these matches. Given the estimated camera motion, we remove trajectories consistent with it. We also use this estimation to cancel out camera motion from the optical flow. This significantly improves motion-based descriptors, such as HOF and MBH. Experimental results onfour challenging action datasets (i.e., Hollywood2, HMDB51, Olympic Sports and UCF50) significantly outperform the current state of the art.

reference text

[1] J. Aggarwal and M. Ryoo. Human activity analysis: A review. ACM Computing Surveys, 43(3): 16: 1–16:43, 2011.

[2] R. Arandjelovic and A. Zisserman. Three things everyone should know to improve object retrieval. In CVPR, pages 2911–2918, 2012.

[3] H. Bay, T. Tuytelaars, and L. V. Gool. SURF: Speeded up robust features. In ECCV, 2006.

[4] W. Brendel and S. Todorovic. Learning spatiotemporal graphs of human activities. In ICCV, 2011.

[5] K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. In BMVC, 2011.

[6] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow and appearance. In ECCV, 2006.

[7] P. Doll a´r, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. In IEEE Workshop Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005.

[8] G. Farneb¨ ack. Two-frame motion estimation based on polynomial expansion. In SCIA, 2003.

[9] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE PAMI, 32(9): 1627–1645, 2010.

[10] V. Ferrari, M. Marin-Jimenez, and A. Zisserman. Progressive search space reduction for human pose estimation. In CVPR, 2008.

[11] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381– 395, 1981.

[12] A. Gaidon, Z. Harchaoui, and C. Schmid. Recognizing activities with

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27] cluster-trees of tracklets. In BMVC, 2012. S. Gauglitz, T. H ¨ollerer, and M. Turk. Evaluation of interest point detectors and feature descriptors for visual tracking. IJCV, 94(3):335– 360, 2011. M. Jain, H. J ´egou, and P. Bouthemy. Better exploiting motion for better action recognition. In CVPR, 2013. Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, and C.-W. Ngo. Trajectorybased modeling of human actions with motion reference points. In ECCV, 2012. A. Kl¨ aser, M. Marszałek, and C. Schmid. A spatio-temporal descriptor based on 3D-gradients. In BMVC, 2008. O. Kliper-Gross, Y. Gurovich, T. Hassner, and L. Wolf. Motion interchange patterns for action recognition in unconstrained videos. In ECCV, 2012. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A large video database for human motion recognition. In ICCV, pages 2556–2563, 2011. I. Laptev. On space-time interest points. IJCV, 64(2-3): 107–123, 2005. I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008. J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos in the wild. In CVPR, 2009. M. Marszałek, I. Laptev, and C. Schmid. Actions in context. In CVPR, 2009. S. Mathe and C. Sminchisescu. Dynamic eye movement datasets and learnt saliency models for visual action recognition. In ECCV, 2012. J. C. Niebles, C.-W. Chen, and L. Fei-Fei. Modeling temporal structure of decomposable motion segments for activity classification. In ECCV, 2010. A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV, 42(3): 145–175, 2001. D. Oneata, J. Verbeek, and C. Schmid. Action and event recognition with Fisher vectors on a compact feature set. In ICCV, 2013. D. Park, C. L. Zitnick, D. Ramanan, and P. Doll a´r. Exploring weak

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42] stabilization for motion feature extraction. In CVPR, 2013. A. Patron-Perez, M. Marszalek, I. Reid, and A. Zisserman. Structured learning ofhuman interactions in TV shows. IEEEPAMI, 2012. F. Perronnin, J. S ´anchez, and T. Mensink. Improving the Fisher kernel for large-scale image classification. In ECCV, 2010. A. Prest, C. Schmid, and V. Ferrari. Weakly supervised learning of interactions between humans and objects. IEEE PAMI, 34(3):601 614, 2012. K. Reddy and M. Shah. Recognizing 50 human action categories of web videos. Machine Vision and Applications, pages 1–1 1, 2012. S. Sadanand and J. J. Corso. Action bank: A high-level representation of activity in video. In CVPR, 2012. P. Scovanner, S. Ali, and M. Shah. A 3-dimensional SIFT descriptor and its application to action recognition. In ACM Conference on Multimedia, 2007. F. Shi, E. Petriu, and R. Laganiere. Sampling strategies for real-time action recognition. In CVPR, 2013. J. Shi and C. Tomasi. Good features to track. In CVPR, 1994. B. Solmaz, S. M. Assari, and M. Shah. Classifying web videos using a global video descriptor. Machine Vision and Applications, pages 1–13, 2012. R. Szeliski. Image alignment and stitching: a tutorial. Foundations and Trends in Computer Graphics and Vision, 2(1): 1–104, 2006. H. Uemura, S. Ishikawa, and K. Mikolajczyk. Feature tracking and motion compensation for action recognition. In BMVC, 2008. E. Vig, M. Dorr, and D. Cox. Space-variant descriptor sampling for action recognition based on saliency and eye movements. In ECCV, 2012. H. Wang, A. Kl¨ aser, C. Schmid, and C.-L. Liu. Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103(1):60–79, 2013. G. Willems, T. Tuytelaars, and L. Gool. An efficient dense and scaleinvariant spatio-temporal interest point detector. In ECCV, 2008. S. Wu, O. Oreifej, and M. Shah. Action recognition in videos acquired by a moving camera using motion decomposition of La- grangian particle trajectories. In ICCV, 2011.

[43] L. Yeffet and L. Wolf. Local trinary patterns for human action recognition. In ICCV, 2009. 33555581