cvpr cvpr2013 cvpr2013-355 cvpr2013-355-reference knowledge-graph by maker-knowledge-mining

355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches

Source: pdf

Author: Arpit Jain, Abhinav Gupta, Mikel Rodriguez, Larry S. Davis

Abstract: How should a video be represented? We propose a new representation for videos based on mid-level discriminative spatio-temporal patches. These spatio-temporal patches might correspond to a primitive human action, a semantic object, or perhaps a random but informative spatiotemporal patch in the video. What defines these spatiotemporal patches is their discriminative and representative properties. We automatically mine these patches from hundreds of training videos and experimentally demonstrate that these patches establish correspondence across videos and align the videos for label transfer techniques. Furthermore, these patches can be used as a discriminative vocabulary for action classification where they demonstrate stateof-the-art performance on UCF50 and Olympics datasets.

reference text

[1] A. Bobick and J. Davis. The recognition ofhuman movement using temporal templates. PAMI, 2001. 2

[2] L. Bourdev and J. Malik. Poselets:body part detectors trained using 3d human pose annotations. In ICCV, 2009. 1, 2

[3] W. Brendel and S. Todorovic. Learning spatiotemporal graphs of human activities. In ICCV, 2011. 6, 7

[4] V. Delaitre, J. Sivic, and I. Laptev. Learning person-object interactions for action recognition in still images. In NIPS, 2011. 1

[5] C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. A. Efros. What makes paris look like paris? ACM Transactions on Graphics (SIGGRAPH), 2012. 3

[6] A. A. Efros, A. C. Berg, G. Mori, and J. Malik. Recognizing action at a distance. In ICCV, 2003. 1

[7] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19] based models. PAMI, 2010. 1, 2 A. Gaidon, Z. Harchaoui, and C. Schmid. Recognizing activities with cluster-trees of tracklets. In BMVC, 2012. 7 L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes. In ICCV, 2005. 2 A. Gupta, A. Kembhavi, and L. S. Davis. Observing humanobject interactions: Using spatial and functional compatibility for recognition. PAMI, 2009. 1, 2 A. Gupta, P. Srinivasan, J. Shi, and L. S. Davis. Understanding videos, constructing plots: Learning a visually grounded storyline model from annotated videos. In CVPR, 2009. 1 Y. Ke, R. Sukthankar, and M. Hebert. Event detection in crowded videos. In ICCV, 2007. 2 A. Kl¨ aser, M. Marszałek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC, 2008. 3 H. Koppula, R. Gupta, and A. Saxena. Learning human activities and object affordances from rgb-d videos. IJRR, 2013. 2 A. Kovashka and K. Grauman. Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In CVPR, 2010. 2 T. Lan, Y. Wang, and G. Mori. Discriminative figure-centric models for joint action localization and recognition. In ICCV, 2011. 2 I. Laptev. On space-time interest points. IJCV, 2005. 2 I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008. 6, 7 M. Leordeanu, M. Hebert, and R. Sukthankar. An integer projected fixed point method for graph matching and map inference. In NIPS, 2009. 5

[20] L.-J. Li, H. Su, E. P. Xing, and L. Fei-Fei. Object bank: A high-level image representation for scene classification and semantic feature sparsification. In NIPS, 2010. 4

[21] J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos in the wild. In CVPR, 2009. 2

[22] T. Malisiewicz, A. Gupta, and A. A. Efros. Ensemble of exemplar-svms for object detection and beyond. In ICCV, 2011. 1, 3

[23] J. Niebles, C. Chen, and L. Fei-Fei. Modeling temporal structure of decomposable motion segments for activity classification. In ECCV, 2010. 1, 2, 6, 7

[24] J. C. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learning of human action categories using spatial-temporal words. IJCV, 2008. 2, 5

[25] M. Raptis, I. Kokkinos, and S. Soatto. Discovering discriminative action parts from mid-level video representations. In CVPR, 2012. 2

[26] S. Sadanand and J. J. Corso. Action bank: A high-level representation of activity in video. In CVPR, 2012. 2, 6

[27] S. Satkin and M. Hebert. Modeling the temporal extent of actions. In ECCV, 2010. 1

[28] P. Scovanner, S. Ali, and M. Shah. A 3-dimensional sift descriptor and its application to action recognition. In ACM Multimedia, 2007. 2

[29] N. Shapovalova, A. Vahdat, K. Cannons, T. Lan, and G. Mori. Similarity constrained latent support vector machine: An application to weakly supervised action classification. In ECCV, 2012. 2

[30] Z. Si, M. Pei, and S. Zhu. Unsupervised learning of event and-or grammar and semantics from video. In ICCV, 2011. 1 222555777755 ????????????????????? ????????????????????? ????????????????????? are aligned, annotations such as object bounding boxes and human poses can be obtained by simple label transfer. Figure 7. Example alignment for the Olympics Dataset. [3 1] S. Singh, A. Gupta, and A. A. Efros. Unsupervised discovery of mid-level discriminative patches. In ECCV, 2012. 1, 2

[34] B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. J. Guibas, and L. Fei-Fei. Action recognition by learning bases of action attributes and parts.

[32] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features for action recognition. In BMVC, 2009. 2

[33] Y. Wang and G. Mori. Hidden part models for human action recognition:probabilistic versus max margin. PAMI, 201 1. 2 222555777866 In ICCV, 2011. 2