cvpr cvpr2013 cvpr2013-32 cvpr2013-32-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yale Song, Louis-Philippe Morency, Randall Davis
Abstract: Recent progress has shown that learning from hierarchical feature representations leads to improvements in various computer vision tasks. Motivated by the observation that human activity data contains information at various temporal resolutions, we present a hierarchical sequence summarization approach for action recognition that learns multiple layers of discriminative feature representations at different temporal granularities. We build up a hierarchy dynamically and recursively by alternating sequence learning and sequence summarization. For sequence learning we use CRFs with latent variables to learn hidden spatiotemporal dynamics; for sequence summarization we group observations that have similar semantic meaning in the latent space. For each layer we learn an abstract feature representation through non-linear gate functions. This procedure is repeated to obtain a hierarchical sequence summary representation. We develop an efficient learning method to train our model and show that its complexity grows sublinearly with the size of the hierarchy. Experimental results show the effectiveness of our approach, achieving the best published results on the ArmGesture and Canal9 datasets.
[1] Y. Bengio. Learning deep architectures for AI. FTML, 2(1), 2009. 2, 3, 4
[2] K. Bousmalis, L.-P. Morency, and M. Pantic. Modeling hidden dynamics ofmultimodal cues for spontaneous agreement and disagreement recognition. In FG, 2011. 6
[3] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient graphbased image segmentation. IJCV, 59(2), 2004. 2, 4
[4] J. R. K. Hartline. Incremental Optimization. PhD thesis, Cornell University, 2008. 5
[5] G. E. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief nets. NECO, 18(7), 2006. 2
[6] G. B. Huang, H. Lee, and E. G. Learned-Miller. Learning hierarchical representations for face verification with convolutional deep belief networks. In CVPR, 2012. 2
[7] P. Kohli, L. Ladicky, and P. H. S. Torr. Robust higher order potentials for enforcing label consistency. IJCV, 82, 2009. 1
[8] A. Kovashka and K. Grauman. Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In CVPR, 2010. 2
[9] I. Laptev. On space-time interest points. IJCV, 64(2-3), 2005. 2
[10] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008. 2, 3
[11] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006. 1
[12] Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR,
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23] 2011. 1, 2, 4 J. C. Niebles and L. Fei-Fei. A hierarchical model of shape and appearance for human action classification. In CVPR, 2007. 2 J. Nocedal and S. J. Wright. Numerical Optimization. Springer-Verlag, 1999. 5 J. Peng, L. Bo, and J. Xu. Conditional neural fields. In NIPS, 2009. 2, 6 A. Quattoni, S. B. Wang, L.-P. Morency, M. Collins, and T. Darrell. Hidden conditional random fields. PAMI., 29(10), 2007. 2, 3, 5, 6 M. Ranzato, J. Susskind, V. Mnih, and G. E. Hinton. On deep generative models with applications to recognition. In CVPR, 2011. 1 R. Salakhutdinov and G. E. Hinton. An efficient learning procedure for deep boltzmann machines. NECO, 24(8), 2012. 2 A. Shyr, R. Urtasun, and M. I. Jordan. Sufficient dimension reduction for visual sequence classification. In CVPR, 2010. 6 Y. Song, D. Demirdjian, and R. Davis. Tracking body and hands for gesture recognition: Natops aircraft handling signals database. In FG, 2011. 6, 8 Y. Song, L.-P. Morency, and R. Davis. Multi-view latent variable discriminative models for action recognition. In CVPR, 2012. 6 Y. Song, L.-P. Morency, and R. Davis. Multimodal human behavior analysis: learning correlation and interaction across modalities. In ICMI, 2012. 6 J. Sun, X. Wu, S. Yan, L. F. Cheong, T.-S. Chua, and J. Li. Hierarchical spatio-temporal context modeling for ac-
[24]
[25]
[26]
[27]
[28] tion recognition. In CVPR, 2009. 2 K. Tang, F.-F. Li, and D. Koller. Learning latent temporal structure for complex event detection. In CVPR, 2012. 2 A. Vinciarelli, A. Dielmann, S. Favre, and H. Salamin. Canal9: A database of political debates for analysis of social interactions. In ACII, 2009. 6 J. Wang, Z. Chen, and Y. Wu. Action recognition with multiscale spatio-temporal contexts. In CVPR, 2011. 2 Y. Wang and G. Mori. Hidden part models for human action recognition: Probabilistic versus max margin. PAMI, 33(7), 2011. 2 D. Yu and L. Deng. Deep-structured hidden conditional random fields for phonetic recognition. In INTERSPEECH, 2010. 2 333555666977