cvpr cvpr2013 cvpr2013-291 cvpr2013-291-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: LiMin Wang, Yu Qiao, Xiaoou Tang
Abstract: This paper proposes motionlet, a mid-level and spatiotemporal part, for human motion recognition. Motionlet can be seen as a tight cluster in motion and appearance space, corresponding to the moving process of different body parts. We postulate three key properties of motionlet for action recognition: high motion saliency, multiple scale representation, and representative-discriminative ability. Towards this goal, we develop a data-driven approach to learn motionlets from training videos. First, we extract 3D regions with high motion saliency. Then we cluster these regions and preserve the centers as candidate templates for motionlet. Finally, we examine the representative and discriminative power of the candidates, and introduce a greedy method to select effective candidates. With motionlets, we present a mid-level representation for video, called motionlet activation vector. We conduct experiments on three datasets, KTH, HMDB51, and UCF50. The results show that the proposed methods significantly outperform state-of-the-art methods.
[1] E. Adelson and J. Bergen. Spatiotemporal energy models for the perception of motion. J. Opt. Soc. Am. A, 2(2):284–299, 1985.
[2] J. K. Aggarwal and M. S. Ryoo. Human activity analysis: A review. ACM Comput. Surv., 43(3): 16, 2011.
[3] A. F. Bobick and J. W. Davis. The recognition of human movement using temporal templates. TPAMI, 23(3):257–267, 2001 . 1Available at http://www.cse.buffalo.edu/ jcorso/r/actionbank/
[4] L. D. Bourdev and J. Malik. Poselets: Body part detectors trained using 3d human pose annotations. In ICCV, 2009.
[5] W. Brendel and S. Todorovic. Learning spatiotemporal graphs of human activities. In ICCV, 2011.
[6] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM TIST, 2:27:1–27:27, 2011.
[7] G. K. M. Cheung, S. Baker, and T. Kanade. Shape-from-silhouette of articulated objects and its use for human body kinematics estimation and motion capture. In CVPR, 2003.
[8] K. G. Derpanis, M. Sizintsev, K. J. Cannons, and R. P. Wildes. Efficient action spotting based on a spacetime oriented structure representation. In CVPR, 2010.
[9] P. Doll a´r, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. In VS-PETS, 2005.
[10] P. F. Felzenszwalb, R. B. Girshick, D. A. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. TPAMI, 32(9): 1627–1645, 2010.
[11] P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27] object recognition. IJCV, 61(1):55–79, 2005. W. T. Freeman and E. H. Adelson. The design and use of steerable filters. TPAMI, 13(9):891–906, 1991. B. J. Frey and D. Dueck. Clustering by passing messages between data points. Science, 315:972–976, 2007. L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes. TPAMI, 29(12):2247–2253, 2007. H. Jhuang, T. Serre, L. Wolf, and T. Poggio. A biologically inspired system for action recognition. In ICCV, 2007. S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. In ICML, 2010. A. Kl¨ aser, M. Marszalek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC, 2008. O. Kliper-Gross, Y. Gurovich, T. Hassner, and L. Wolf. Motion interchange patterns for action recognition in unconstrained videos. In ECCV, 2012. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recognition. In ICCV, 2011. I. Laptev. On space-time interest points. IJCV, 64(2-3): 107–123, 2005. I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008. Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR, 2011. J. C. Niebles, C.-W. Chen, and F.-F. Li. Modeling temporal structure of decomposable motion segments for activity classification. In ECCV, 2010. A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV, 42(3): 145–175, 2001. M. Raptis, I. Kokkinos, and S. Soatto. Discovering discriminative action parts from mid-level video representations. In CVPR, 2012. K. K. Reddy and M. Shah. Recognizing 50 human action categories of web videos. MVAJ, 2012. S. Sadanand and J. J. Corso. Action bank: A high-level representation of activity in video. In CVPR, 2012.
[28] C. Sch u¨ldt, I. Laptev, and B. Caputo. Recognizing human actions: A local svm approach. In ICPR, 2004.
[29] G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolutional learning of spatio-temporal features. In ECCV, 2010.
[30] H. Wang, M. M. Ullah, A. Kl¨ aser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features for action recognition. In BMVC, 2009.
[31] X. Wang, L. Wang, and Y. Qiao. A comparative study of encoding, pooling and normalization methods for action recognition. In ACCV, 2012.
[32] G. Willems, T. Tuytelaars, and L. J. V. Gool. An efficient dense and scale-invariant spatio-temporal interest point detector. In ECCV, 2008. 222666778199