cvpr cvpr2013 cvpr2013-402 cvpr2013-402-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Vignesh Ramanathan, Bangpeng Yao, Li Fei-Fei
Abstract: We deal with the problem of recognizing social roles played by people in an event. Social roles are governed by human interactions, and form a fundamental component of human event description. We focus on a weakly supervised setting, where we are provided different videos belonging to an event class, without training role labels. Since social roles are described by the interaction between people in an event, we propose a Conditional Random Field to model the inter-role interactions, along with person specific social descriptors. We develop tractable variational inference to simultaneously infer model weights, as well as role assignment to all people in the videos. We also present a novel YouTube social roles dataset with ground truth role annotations, and introduce annotations on a subset of videos from the TRECVID-MED11 [1] event kits for evaluation purposes. The performance of the model is compared against different baseline methods on these datasets.
[1] Trecvid multimedia event detection track. 1, 2, 5
[2] B. J. Biddle. Recent development in role theory. Annual Review of Sociology, 12:67–92, 1986. 1
[3] W. Choi and S. Savarese. A unified framework for multitarget tracking and collective activity recognition. In ECCV, 2012. 2
[4] L. Ding and A. Yilmaz. Learning relations among movie characters: A social network perspective. In ECCV, 2010. 2
[5] L. Ding and A. Yilmaz. Inferring social relations from visual concepts. In ICCV, 2011. 2
[6] C. Direkolu and N. OConnor. Team activity recognition in sports. In ECCV. 2012. 2, 3
[7] A. Fathi, J. K. Hoggins, and J. M. Rehg. Social interactions: A first person perspective. In CVPR, 2012. 1, 2
[8] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE Trans. on PAMI, (9), 2010. 4
[9] Y. Fu, T. Hospedales, T. Xiang, and S. Gong. Attribute learning for understanding unstructured social activity. In ECCV, 2012. 2
[10] A. C. Gallagher and T. Chen. Understanding images of groups of people. In CVPR, 2009. 2
[11] A. Kl¨ aser, M. Marszałek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC, 2008. 3
[12] A. Klaser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In CVPR. 2011. 4
[13] T. Lan, L. Sigal, and G. Mori. Social roles in hierarchical models for human activity recognition. In CVPR, 2012. 1, 2, 3
[14] T. Lan, Y. Wang, W. Yang, S. Robinovitch, and G. Mori. Discriminative latent models for recognizing contextual group activities. IEEE Trans. on PAMI, 34(8): 1549–1562, 2012. 2
[15] L.-J. Li, H. Su, E. P. Xing, and L. Fei-Fei. Object bank: A high-level image representation for scene classification and semantic feature sparsification. In NIPS, 2010. 4
[16] M. Marin-Jimenez, A. Zisserman, and V. Ferrari. ”heres looking at you, kid”-detecting people looking at each other in videos. In BMVC, 2011. 2
[17] A. P. Perez, M. Marszalek, A. Zisserman, and I. Reid. High five: Recognising human interactions in tv shows. In BMVC, 2010. 2
[18] Z. Qin and C. R. Shelton. Improving multi-target tracking via social grouping. In CVPR, 2012. 2
[19] Z. Song, M. Wang, X. Hua, and S. Yan. Predicting occupation via human clothing and contexts. In ICCV, 2011. 2 222444778199 (a) Bride(b) Priest(c) Bridesmaids(d) Gro msmen Figure 6. Marginal of the position of a role relative to the reference (“groom”), estimated by our model is shown for YouTube wedding videos. The spatial positions in the image co-ordinates are normalized by width of the reference role bounding box, and binned. The groom’s position is marked by a cross-hair. The “bride” is mostly close to the “groom”. “groomsmen” and “bridesmaids” are distributed around the groom as expected. The uncertainty in recognizing the “priest” is reflected by a scattered distribution. Wedding Birthday Award Function Figure7.SPahmyspliecarelsTurltasinfrionmgtheYouT besocialrolesdat setis hown,wher eachrowcorespondstoanev nt.Boxeswithsolid nes indicate correct role assignments from our full model, while dashed lines represent faulty assignments. Different roles are indicated by the same color code as in Fig. 2. The ground truth role of a person is indicated by the color of the dot on the person. Last column shows typical failure cases for each event.
[20] Z. Stone, T. Zickler, and T. Darrell. Toward large-scale face recognition using social network context. In Proc. of the IEEE, 2010. 2
[21] C. Vondrick and D. Ramanan. Video annotation and tracking with active learning. In NIPS, 201 1. 5
[24] Y. Yang, S. Baker, A. Kannan, and D. Ramanan. Recognizing proxemics in personal photos. In CVPR, 2012. 4
[25] T. Yu, S.-N. Lim, K. Patwardhan, andN. Krahnstoever. Monitoring, recognizing and discovering social networks. In CVPR, 2009. 2
[22] G. Wang, A. Gallagher, J. Luo, and D. Forsyth. Seeing people in social context: recognizing people and social relation-
[26] J. Zhu and E. P. Xing. Conditional topic random fields. In
[23] ICML, 2010. 4, 5 sCh.-iYps.. W Ine EngC,C WV.,-T 2.0 C1h0u., 2 and J.-L. Wu. Rolenet: Movie anal-
[27] Xan.d Z lhanud amndark D. lo Rca lmiza ntaion . in Fa tche w deitlde.ct Iinon C,V pPoRse, 2 e0s1ti2m.a 4tion, ysis from the perspective of social networks. IEEE Trans. on Multimedia, (2):256–271, 2009. 2 222444888200