cvpr cvpr2013 cvpr2013-175 cvpr2013-175-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Michael S. Ryoo, Larry Matthies
Abstract: This paper discusses the problem of recognizing interaction-level human activities from a first-person viewpoint. The goal is to enable an observer (e.g., a robot or a wearable camera) to understand ‘what activity others are performing to it’ from continuous video inputs. These include friendly interactions such as ‘a person hugging the observer’ as well as hostile interactions like ‘punching the observer’ or ‘throwing objects to the observer’, whose videos involve a large amount of camera ego-motion caused by physical interactions. The paper investigates multichannel kernels to integrate global and local motion information, and presents a new activity learning/recognition methodology that explicitly considers temporal structures displayed in first-person activity videos. In our experiments, we not only show classification results with segmented videos, but also confirm that our new approach is able to detect activities from continuous videos reliably.
[1] J. K. Aggarwal and M. S. Ryoo. Human activity analysis: A review. ACM Computing Surveys, 43: 16: 1–16:43, April 201 1.
[2] J. Choi, W. Jeon, and S. Lee. Spatio-temporal pyramid matching for sports videos. In ACM MIR, 2008.
[3] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. In IEEE Workshop on VS-PETS, 2005.
[4] A. Fathi, A. Farhadi, and J. M. Rehg. Understanding egocentric activities. In ICCV, 201 1.
[5] A. Fathi, J. Hodgins, and J. Rehg. Social interactions: A first-person perspective. In CVPR, 2012.
[6] K. M. Kitani, T. Okabe, Y. Sato, and A. Sugimoto. Fast unsupervised ego-action learning for first-person sports videos. In CVPR, 2011.
[7] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008.
[8] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006.
[9] J. Niebles, C. Chen, and L. Fei-Fei. Modeling temporal structure of decomposable motion segments for activity classification. In ECCV, 2010.
[10] H. Pirsiavash and D. Ramanan. Detecting activities of daily living in first-
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19] person camera views. In CVPR, 2012. M. S. Ryoo. Human activity prediction: Early recognition of ongoing activities from streaming videos. In ICCV, 2011. M. S. Ryoo and J. Aggarwal. Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In ICCV, 2009. M. S. Ryoo and J. K. Aggarwal. Stochastic representation and recognition of high-level group activities. IJCV, 93(2): 183–200, 201 1. C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local SVM approach. In ICPR, 2004. N. Shawe-Taylor and A. Kandola. On kernel target alignment. In NIPS, 2002. J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from a single depth image. In CVPR, 201 1. Z. Si, M. Pei, B. Yao, and S. Zhu. Unsupervised learning of event and-or grammar and semantics from video. In ICCV, 2011. T. Wu, C. Lin, and R. Weng. Probability estimates for multi-class classification by pairwise coupling. JMLR, 5:975–1005, 2004. J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Local features and kernels for classification of texture and object categories: A comprehensive study. IJCV, 73:213–238, April 2007. 222777333755