nips nips2012 nips2012-209 nips2012-209-reference knowledge-graph by maker-knowledge-mining

209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization


Source: pdf

Author: Du Tran, Junsong Yuan

Abstract: Structured output learning has been successfully applied to object localization, where the mapping between an image and an object bounding box can be well captured. Its extension to action localization in videos, however, is much more challenging, because we need to predict the locations of the action patterns both spatially and temporally, i.e., identifying a sequence of bounding boxes that track the action in video. The problem becomes intractable due to the exponentially large size of the structured video space where actions could occur. We propose a novel structured learning approach for spatio-temporal action localization. The mapping between a video and a spatio-temporal action trajectory is learned. The intractable inference and learning problems are addressed by leveraging an efficient Max-Path search method, thus making it feasible to optimize the model over the whole structured space. Experiments on two challenging benchmark datasets show that our proposed method outperforms the state-of-the-art methods. 1


reference text

[1] L. Bertelli, T. Yu, D. Vu, and S. Gokturk. Kernelized structural SVM learning for supervised object segmentation. CVPR, 2011.

[2] M. B. Blaschko and C. H. Lampert. Learning to localize objects with structured output regression. ECCV, 2008.

[3] O. Boiman and M. Irani. Detecting irregularities in images and in video. IJCV, 2007.

[4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. CVPR, 2005.

[5] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow and appearance. ECCV, 2006.

[6] K. Derpanis, M. Sizintsev, K. Cannons, and P. Wildes. Efficient action spotting based on a spacetime oriented structure representation. CVPR, 2010.

[7] T. Finley and T. Joachims. Training structural SVMs when exact inference is intractable. ICML, 2008.

[8] Y. Hu, L. Cao, F. Lv, S. Yan, Y. Gong, and T. S. Huang. Action detection in complex scenes with spatial and temporal ambiguities. ICCV, 2009.

[9] Y. Ke, R. Sukthankar, and M. Hebert. Volumetric features for video event detection. IJCV, 2010.

[10] A. Klaser, M. Marszalek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. BMVC, 2008.

[11] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Efficient subwindow search: A branch and bound framework for object localization. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2009.

[12] T. Lan, Y. Wang, and G. Mori. Discriminative figure-centric models for joint action localization and recognition. ICCV, 2011.

[13] T. Lan, Y. Wang, W. Yang, and G. Mori. Beyond actions: Discriminative models for contextual group activities. NIPS, 2010.

[14] Q. Le, W. Zou, S. Yeung, and A. Ng. Learning hierarchical spatio-temporal features for action recognition with independent subspace analysis. CVPR, 2011.

[15] V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos. Anomaly detection in crowded scenes. CVPR, 2010.

[16] M. H. Nguyen, T. Simon, F. De la Torre, and J. Cohn. Action unit detection with segment-based SVMs. CVPR, 2010.

[17] J. C. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learning of human action categories using spatialtemporal words. International Journal of Computer Vision, 2008.

[18] A. Patron-Perez, M. Marszalek, A. Zisserman, and I. Reid. High five: Recognising human interactions in tv shows. BMVC, 2010.

[19] N. Ratliff, J. A. Bagnell, and M. Zinkevich. Subgradient methods for maximum margin structured learning. ICML 2006 Workshop on Learning in Structured Output Spaces, 2006.

[20] M. D. Rodriguez, J. Ahmed, and M. Shah. Action mach: A spatio-temporal maximum average correlation height filter for action recognition. CVPR, 2008.

[21] B. Taskar, S. Lacoste-Julien, and M. Jordan. Structured prediction via the extragradient method. NIPS, 2005.

[22] D. Tran and D. Forsyth. Configuration estimates improve pedestrian finding. NIPS, 2007.

[23] D. Tran and J. Yuan. Optimal spatio-temporal path discovery for video event detection. CVPR, pages 3321–3328, 2011.

[24] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. JMLR, 2005.

[25] A. Vedaldi and A. Zisserman. Structured output regression for detection with partial truncation. NIPS, 2009.

[26] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features for action recognition. BMVC, 2009.

[27] Y. Wang, D. Tran, and Z. Liao. Learning hierarchical poselets for human parsing. CVPR, 2011.

[28] A. Yao, J. Gall, L. V. Gool, and R. Urtasun. Learning probabilistic non-linear latent variable models for tracking complex activities. NIPS, 2011.

[29] J. Yuan, Z. Liu, and Y. Wu. Discriminative video pattern search for efficient action detection. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2011. 9