iccv iccv2013 iccv2013-440 iccv2013-440-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Vignesh Ramanathan, Percy Liang, Li Fei-Fei
Abstract: Human action and role recognition play an important part in complex event understanding. State-of-the-art methods learn action and role models from detailed spatio temporal annotations, which requires extensive human effort. In this work, we propose a method to learn such models based on natural language descriptions of the training videos, which are easier to collect and scale with the number of actions and roles. There are two challenges with using this form of weak supervision: First, these descriptions only provide a high-level summary and often do not directly mention the actions and roles occurring in a video. Second, natural language descriptions do not provide spatio temporal annotations of actions and roles. To tackle these challenges, we introduce a topic-based semantic relatedness (SR) measure between a video description and an action and role label, and incorporate it into a posterior regularization objective. Our event recognition system based on these action and role models matches the state-ofthe-art method on the TRECVID-MED11 event kit, despite weaker supervision.
[1] Trecvid multimedia event detection track, 2011. 1
[2] T. L. Berg, A. C. Berg, and J. Shih. Automatic attribute discovery and charateriztion from noisy web data. In ECCV, 2010. 2
[3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993– 1022, 2003. 1
[4] T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar. Movie/script: Alignment and parsing of video and text transcription. In ECCV, 2008. 2
[5] A. Fathi, J. K. Hoggins, and J. M. Rehg. Social interactions: A first person perspective. In CVPR, 2012. 2 9911 11 wrong videos which are added by the method. The video descriptions are shown below it. Note that they do not contain the action label.
[6] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. PAMI, 32(9): 1627–1645, 2010. 3
[7] K. Ganchev, J. Graca, J. Gillewater, and B. Taskar. Posterior regularization for structured latent variable models. JMLR, 11, 2010. 1, 3
[8] H. Izadinia and M. Shah. Recognizing complex events using large margin joint low-level event model. In ECCV, 2012. 1, 2, 4, 5, 7
[9] Y. Jia, M. Salzmann, and T. Darrell. Learning cross-modality similarity for multinomial data. In ICCV, 2011. 2
[10] A. Kl¨ aser, M. Marszałek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC, 2008. 3
[11] M. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable models. In NIPS, 2010. 4
[12] T. Lan, L. Sigal, and G. Mori. Social roles in hierarchical models for human activity recognition. In CVPR, 2012. 1, 2, 3
[13] I. Laptev, M. Marszaek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR,
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25] 2008. 2 J. Liu, B. Kuipers, and S. Savarese. Recognizing human actions by attributes. In CVPR, 2011. 2 D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60, 2004. 3 M. Marszaek, I. Laptev, and C. Schmid. Actions in context. In CVPR, 2009. 2 M. Merler, B. Huang, L. Xie, G. Hua, and A. Natsev. Semantic model vectors for complex video event recognition. IEEE Trans. Multimedia, 14:88–101, 2012. 1 T. S. Motwani and R. J. Mooney. Improving video activity recognition using object recognition and text mining. In ECAI, 2012. 2 D. Parikh and K. Grauman. Interactively building a discriminative vocabulary of nameable attributes. In CVPR, 2011. 2 L. Rabiner and B.-H. Juang. Fundamentals of speech recognition. Prentice-Hall, Inc., 1993. 3 M. Raptis, I. Kokkinos, and S. Soatto. Discovering discriminative action parts from mid-level vide representations. In ECCV, 2012. 2 M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele, and M. Pinkal. Grounding action descriptions in videos. TACL, 1:25–36, 2013. 1 M. Rohrbach, M. Regneri, M. Andriluka, S. Amin, M. Pinkal, and B. Schiele. Script data for attribute-based recognition of composite activities. In ECCV, 2012. 1, 2 M. Rohrbach, M. Stark, G. Szarvas, and B. Schiele. Combining language sources and robust semantic relatedness for attribute-based knowledge transfer. In ECCV-PnA, 2010. 2 M. Rohrbach, M. Stark, G. Szarvas, and B. Schiele. What
[26]
[27]
[28]
[29]
[30] [3 1]
[32] helps where – and why? semantic relatedness for knowledge transfer. In CVPR, 2010. 2, 4, 5, 6 M. Rohrbach, Q. Wei, I. Titov, S. Thater, M. Pinkal, and B. Schiele. Translating video content to natural language descriptions. In ICCV, 2013. 7 S. Satkin and M. Hebert. Modelling the temporal extent of action. In ECCV, 2010. 2 S. Singh, A. Gupta, and A. A. Efros. Unsupervised discovery of mid-level discriminative patches. In ECCV, 2012. 2 R. Socher and L. Fei-Fei. Connecting modalities: Semisupervised segmentation and annotation of images using unaligned text corpora. In CVPR, 2010. 2 G. Wang, A. Gallagher, J. Luo, and D. Forsyth. Seeing people in social context: recognizing people and social relationships. In ECCV, 2010. 2 W. Yang and G. Toderici. Discriminative tag learning on youtube videos with latent sub-tags. In CVPR, 2011. 2 Z.-H. Zhou and M.-L. Zhang. Multi-instance multi-label learning with application to scene classification. In NIPS, 2007. 3 991122