iccv iccv2013 iccv2013-37 iccv2013-37-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Shugao Ma, Jianming Zhang, Nazli Ikizler-Cinbis, Stan Sclaroff
Abstract: We propose Hierarchical Space-Time Segments as a new representation for action recognition and localization. This representation has a two-level hierarchy. The first level comprises the root space-time segments that may contain a human body. The second level comprises multi-grained space-time segments that contain parts of the root. We present an unsupervised method to generate this representation from video, which extracts both static and non-static relevant space-time segments, and also preserves their hierarchical and temporal relationships. Using simple linear SVM on the resultant bag of hierarchical space-time segments representation, we attain better than, or comparable to, state-of-the-art action recognition performance on two challenging benchmark datasets and at the same time produce good action localization results.
[1] P. Arbelaez, M. Maire, C. C. Fowlkes, and J. Malik. From contours to regions: An empirical evaluation. In CVPR, 2009.
[2] A. F. Bobick and J. W. Davis. The recognition of human movement using temporal templates. TPAMI, 23(3):257– 267, 2001.
[3] M. Bregonzio, S. Gong, and T. Xiang. Recognising action as clouds of space-time interest points. In CVPR, 2009.
[4] W. Brendel and S. Todorovic. Learning spatiotemporal graphs of human activities. In ICCV, 2011.
[5] C. X. Chenliang Xu and J. J. Corso. Streaming hierarchical video segmentation. In ECCV, 2012.
[6] A. Gaidon, Z. Harchaoui, and C. Schmid. Recognizing activities with cluster-trees of tracklets. In BMVC, 2012.
[7] A. Gilbert, J. Illingworth, and R. Bowden. Fast realistic multi-action recognition using mined dense spatio-temporal features. In ICCV, 2009.
[8] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes. TPAMI, 29(12):2247–2253, 2007.
[9] M. Grundmann, V. Kwatra, M. Han, and I. A. Essa. Effi-
[10]
[11]
[12]
[13] cient hierarchical graph-based video segmentation. In CVPR, 2010. N. Ikizler-Cinbis and S. Sclaroff. Object, scene and actions: Combining multiple features for human action recognition. In ECCV, 2010. A. Kovashka and K. Grauman. Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In CVPR, 2010. T. Lan, Y. Wang, and G. Mori. Discriminative figure-centric models for joint action localization and recognition. In ICCV, 2011. I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008. 2750 video. The yellow boxes outlines the extracted segments. The inclusion of one box in another indicates the parent-children relationships covered by red mask is our action localization output. First five rows are from UCF-Sports, and last two rows are from HighFive.
[14] Y. J. Lee, J. Kim, and K. Grauman. Key-segments for video object segmentation. In ICCV, 2011.
[15] M. Leordeanu, R. Sukthankar, and C. Sminchisescu. Efficient closed-form solution to generalized boundary detection. In ECCV, 2012.
[16] T. Ma and L. J. Latecki. Maximum weight cliques with mutex constraints for video object segmentation. In CVPR, 2012.
[17] A. Patron-Perez, M. Marszalek, A. Zisserman, and I. D. Reid. High five: Recognising human interactions in tv shows. In BMVC, 2010.
[18] M. Raptis, I. Kokkinos, and S. Soatto. Discovering discriminative action parts from mid-level video representations. In CVPR, 2012.
[19] M. D. Rodriguez, J. Ahmed, and M. Shah. Action mach a spatio-temporal maximum average correlation height filter for action recognition. In CVPR, 2008.
[20] D. Tran and J. Yuan. Optimal spatio-temporal path discovery for video event detection. In CVPR, 2011.
[21] D. Tran and J. Yuan. Max-margin structured output regression for spatio-temporal action localization. In NIPS, 2012.
[22] H. Wang, A. Kl¨ aser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In CVPR, 2011.
[23] Y. Wang, K. Huang, and T. Tan. Human activity recognition based on r transform. In CVPR, 2007.
[24] X. Wu, D. Xu, L. Duan, and J. Luo. Action recognition using context and appearance distribution features. In CVPR, 2011.
[25] Y. Xie, H. Chang, Z. Li, L. Liang, X. Chen, and D. Zhao. A unified framework for locating and recognizing human actions. In CVPR, 2011.
[26] Z. L. Yang Wang, Duan Tran and D. Forsyth. Discriminative hierarchical part-based models for human parsing and action recognition. JMLR, 13:30753102, 2012. 275 1