jmlr jmlr2013 jmlr2013-56 jmlr2013-56-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sean Ryan Fanello, Ilaria Gori, Giorgio Metta, Francesca Odone
Abstract: Sparsity has been showed to be one of the most important properties for visual recognition purposes. In this paper we show that sparse representation plays a fundamental role in achieving one-shot learning and real-time recognition of actions. We start off from RGBD images, combine motion and appearance cues and extract state-of-the-art features in a computationally efficient way. The proposed method relies on descriptors based on 3D Histograms of Scene Flow (3DHOFs) and Global Histograms of Oriented Gradient (GHOGs); adaptive sparse coding is applied to capture high-level patterns from data. We then propose a simultaneous on-line video segmentation and recognition of actions using linear SVMs. The main contribution of the paper is an effective realtime system for one-shot action modeling and recognition; the paper highlights the effectiveness of sparse coding techniques to represent 3D actions. We obtain very good results on three different data sets: a benchmark data set for one-shot action learning (the ChaLearn Gesture Data Set), an in-house data set acquired by a Kinect sensor including complex actions and gestures differing by small details, and a data set created for human-robot interaction purposes. Finally we demonstrate that our system is effective also in a human-robot interaction setting and propose a memory game, “All Gestures You Can”, to be played against a humanoid robot. Keywords: real-time action recognition, sparse representation, one-shot action learning, human robot interaction
J.K. Aggarwal and M.S. Ryoo. Human activity analysis: A review. ACM Computing Surveys, 2011. A. Ali and J.K. Aggarwal. Segmentation and recognition of continuous human activity. IEEE Workshop on Detection and Recognition of Events in Video, 2001. J. Alon, V. Athitsos, Y. Quan, and S. Sclaroff. A unified framework for gesture recognition and spatiotemporal gesture segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009. A. Bisio, N. Stucchi, M. Jacono, L. Fadiga, and T. Pozzo. Automatic versus voluntary motor imitation: Effect of visual context and stimulus velocity. PLoS ONE, 2010. A. F. Bobick and J. W. Davis. The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001. M. Bregonzio, S. Gong, and T. Xiang. Recognising action as clouds of space-time interest points. IEEE Conference on Computer Vision and Pattern Recognition, 2009. M. J. Burden and D. B. Mitchell. Implicit memory development in school-aged children with attention deficit hyperactivity disorder (adhd): Conceptual priming deficit? In Developmental Neurophysiology, 2005. J. Cech, J. Sanchez-Riera, and R. Horaud. Scene flow estimation by growing correspondence seeds. In IEEE Conference on Computer Vision and Pattern Recognition, 2011. ChaLearn Gesture Dataset (CGD2011). http://gesture.chalearn.org/data, 2011. S.P. Chatzis, D. Kosmopoulos, and P. Doliotis. A conditional random field-based model for joint sequence segmentation and classification. In Pattern Recognition, 2013. C. Comoldi, A. Barbieri, C. Gaiani, and S. Zocchi. Strategic memory deficits in attention deficit disorder with hyperactivity participants: The role of executive processes. In Developmental Neurophysiology, 1999. 2636 K EEP I T S IMPLE A ND S PARSE : R EAL -T IME ACTION R ECOGNITION N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. IEEE Conference on Computer Vision and Pattern Recognition, 2005. A. Destrero, C. De Mol, F. Odone, and Verri A. A sparsity-enforcing method for learning face features. IEEE Transactions on Image Processing, 18:188–201, 2009. A. Efros, A. Berg, G. Mori, and J. Malik. Recognizing action at a distance. In International Conference on Computer Vision, 2003. M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 2006. R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9, 2008. S. R. Fanello, I. Gori, and F. Pirri. Arm-hand behaviours modelling: from attention to imitation. In International Symposium on Visual Computing, 2010. G. Farneb¨ ck. Two-frame motion estimation based on polynomial expansion. In Scandinavian a Conference on Image Analysis, 2003. J. Feng, B. Ni, Q. Tian, and S. Yan. Geometric lp-norm feature pooling for image classification. In IEEE Conference on Computer Vision and Pattern Recognition, 2011. M. A. Giese and T. Poggio. Neural mechanisms for the recognition of biological movements. Nature reviews. Neuroscience, 2003. L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29, 2007. I. Gori, S. R. Fanello, F. Odone, and G. Metta. All gestures you can: a memory game against a humanoid robot. IEEE-RAS International Conference on Humanoid Robots, 2012. R.D. Green and L. Guan. Continuous human activity recognition. Control, Automation, Robotics and Vision Conference, 2004. I. Guyon and A. Elisseeff. An introduction to variable and feature selection. International Journal of Machine Learning Research, 3:1157–1182, 2003. I. Guyon, V. Athitsos, P. Jangyodsuk, B. Hammer, and H. J. E. Balderas. Chalearn gesture challenge: Design and first results. In Computer Vision and Pattern Recognition Workshops, 2012. H. O. Hirschfeld. A connection between correlation and contingency. In Mathematical Proceedings of the Cambridge Philosophical Society, 1935. H. Hirschmuller. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008. B. K. P. Horn and B. G. Shunk. Determining optical flow. Journal of Artificial Intelligence, 1981. F. Huguet and F. Devernay. A variational method for scene flow estimation from stereo sequences. In International Conference on Computer Vision, 2007. 2637 FANELLO , G ORI , M ETTA AND O DONE I. Laptev and T. Lindeberg. Space-time interest points. In IEEE International Conference on Computer Vision, 2003. I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In IEEE Conference on Computer Vision and Pattern Recognition, 2008. H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparse coding algorithms. In Conference on Neural Information Processing Systems, 2007. V. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics - Doklady, 1966. W. Li, Z. Zhang, and Z. Liu. Action recognition based on a bag of 3d points. In Computer Vision and Pattern Recognition Workshops, 2010. H.-Y.M. Liao, D-Y. Chen, and S.-W Shih. Continuous human action segmentation and recognition using a spatio-temporal probabilistic framework. IEEE International Symposium on Multimedia, 2006. Y. M. Lui. A least squares regression framework on manifolds and its application to gesture recognition. In Computer Vision and Pattern Recognition Workshops, 2012. F. Lv and R. Nevatia. Single view human action recognition using key pose matching and viterbi path searching. In IEEE Conference on Computer Vision and Pattern Recognition, 2007. U. Mahbub, H. Imtiaz, T. Roy, S. Rahman, and A. R. Ahad. Action recognition from one example. Pattern Recognition Letters, 2011. J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Discriminative learned dictionaries for local image analysis. In IEEE Conference on Computer Vision and Pattern Recognition, 2008a. J. Mairal, M. Elad, and G. Sapiro. Sparse representation for color image restoration. IEEE Transactions on Image Processing, pages 53–69, 2008b. M. R. Malgireddy, I. Inwogu, and V. Govindaraju. A temporal Bayesian model for classifying, detecting and localizing activities in video sequences. Computer Vision and Pattern Recognition Workshops, 2012. G. Metta, P. Fitzpatrick, and L. Natale. YARP: Yet Another Robot Platform. International Journal of Advanced Robotic Systems, 2006. G. Metta, G. Sandini, D. Vernon, L. Natale, and F. Nori. The icub humanoid robot: an open platform for research in embodied cognition. In Workshop on Performance Metrics for Intelligent Systems, 2008. D. Minnen, T. Westeyn, and T. Starner. Performance metrics and evaluation issues for continuous activity recognition. In Performance Metrics for Intelligent Systems Workshop, 2006. D. L. Mumme. Early social cognition: understanding others in the first months of life. Journal of Infant and Child Development, 2001. 2638 K EEP I T S IMPLE A ND S PARSE : R EAL -T IME ACTION R ECOGNITION P. Natarajan and R. Nevatia. Coupled hidden semi markov models for activity recognition. In Workshop Motion and Video Computing, 2007. B. A. Olshausen and D. J. Fieldt. Sparse coding with an overcomplete basis set: a strategy employed by v1. Vision Research, 1997. N. Otsu. A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man and Cybernetics, 1979. N. Papenberg, A. Bruhn, T. Brox, S. Didas, and J. Weickert. Highly accurate optic flow computation with theoretically justified warping. International Journal of Computer Vision, 2006. R. Poppe. A survey on vision-based human action recognition. Image and Vision Computing, 2010. H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE International Conference on Acoustics, Speech and Signal Processing, 1978. H. J. Seo and P. Milanfar. A template matching approach of one-shot-learning gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012. J. W. Shneider and P. Borlund. Matrix comparison, part 1: Motivation and important issues for measuring the resemblance between proximity measures or ordination results. In Journal of the American Society for Information Science and Technology, 2007. J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from a single depth image. In IEEE Conference on Computer Vision and Pattern Recognition, 2011. C. Stauffer and W. E. L. Grimson. Adaptive background mixture models for real-time tracking. IEEE Conference on Computer Vision and Pattern Recognition, 1999. V. Vapnik. Statistical Learning Theory. John Wiley and Sons, Inc., 1998. M. Varma and D. Ray. Learning the discriminative power-invariance trade-off. In IEEE International Conference on Computer Vision, 2007. P. Viola and M.J. Jones. Robust real-time face detection. International Journal of Computer Vision, 57:137–154, 2004. J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image classification. In IEEE Conference on Computer Vision and Pattern Recognition, 2010. J. Wang, Z. Liu, J. Chorowski, Z. Chen, and Y. Wu. Robust 3d action recognition with random occupancy patterns. European Conference on Computer Vision, 2012. A. Wedel, T. Brox, T. Vaudrey, C. Rabe, U. Franke, and D. Cremers. Stereoscopic scene flow computation for 3d motion understanding. International Journal of Computer Vision, 2010. G. Willems, T. Tuytelaars, and L. Gool. An efficient dense and scale-invariant spatio-temporal interest point detector. European Conference on Computer Vision, 2008. 2639 FANELLO , G ORI , M ETTA AND O DONE D. Wu, F. Zhu, and L. Shao. One shot learning gesture recognition from rgbd images. In Computer Vision and Pattern Recognition Workshops, 2012. J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image classification. In IEEE Conference on Computer Vision and Pattern Recognition, 2009. 2640