iccv iccv2013 iccv2013-166 iccv2013-166-reference knowledge-graph by maker-knowledge-mining

166 iccv-2013-Finding Actors and Actions in Movies

Source: pdf

Author: P. Bojanowski, F. Bach, I. Laptev, J. Ponce, C. Schmid, J. Sivic

Abstract: We address the problem of learning a joint model of actors and actions in movies using weak supervision provided by scripts. Specifically, we extract actor/action pairs from the script and use them as constraints in a discriminative clustering framework. The corresponding optimization problem is formulated as a quadratic program under linear constraints. People in video are represented by automatically extracted and tracked faces together with corresponding motion features. First, we apply the proposed framework to the task of learning names of characters in the movie and demonstrate significant improvements over previous methods used for this task. Second, we explore the joint actor/action constraint and show its advantage for weakly supervised action learning. We validate our method in the challenging setting of localizing and recognizing characters and their actions in feature length movies Casablanca and American Beauty.

reference text

[1] http://www.di.ens.fr/willow/research/actoraction, 2013. 5, 7, 8

[2] F. Bach and Z. Harchaoui. Diffrac: a discriminative and flexible framework for clustering. In NIPS, 2007. 2, 3, 4

[3] C. F. Baker, C. J. Fillmore, and J. B. Lowe. The berkeley framenet project. In COLING-ACL, 1998. 4

[4] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, and M. Jordan. Matching words and pictures. J. Machine Learning Research, 2003. 2

[5] T. L. Berg, A. C. Berg, J. Edwards, M. Maire, R. White, Y. W. Teh, E. G. Learned-Miller, and D. A. Forsyth. Names and faces in the news. In CVPR, 2004. 1, 2

[6] T. Cour, B. Sapp, C. Jordan, and B. Taskar. Learning from ambiguously labeled images. In CVPR, 2009. 1, 2, 5, 6 22228866 Figure 6: Examples of automatically assigned names and actions in the movie Casablanca. Top row: Correct name and action assignments for tracks that have an actor/action constraint in the script. Bottom row: Correct name and action assignments for tracks that do not have a corresponding constraint in the script, but are still correctly classified. Note that even very infrequent characters are correctly classified (Annina and Yvonne). See more examples on the project web-page [1].

[7] D. Das, A. F. T. Martins, and N. A. Smith. An exact dual decomposition algorithm for shallow semantic parsing with constraints. In SEM, 2012. 4

[8] O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce. Automatic annotation of human actions in video. In ICCV, 2009. 1, 2

[9] M. Everingham, J. Sivic, and A. Zisserman. “Hello! My name is... Buffy” – automatic naming of characters in TV

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20] video. In BMVC, 2006. 2, 3, 4, 5 A. Farhadi, M. Hejrati, A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: generating sentences for images. In ECCV, 2010. 2 M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In CVPR, 2009. 2 Y. Guo and D. Schuurmans. Convex relaxations of latent variable training. In NIPS, 2007. 4 A. Gupta and L. Davis. Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers. In ECCV, 2008. 1, 2 A. Gupta, P. Srinivasan, J. Shi, and L. Davis. Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In CVPR, 2009. 2 G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report, 2007. 5 A. Joulin, F. Bach, and J. Ponce. Multi-class cosegmentation. In CVPR, 2012. 2 A. Joulin, F. R. Bach, and J. Ponce. Discriminative clustering for image co-segmentation. In CVPR, 2010. 2, 4 I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008. 1, 2 J. Luo, B. Caputo, and V. Ferrari. Who’s doing what: Joint modeling of names and verbs for simultaneous face and pose annotation. In NIPS, 2009. 1, 2 M. Marszalek, I. Laptev, and C. Schmid. Actions in context. In CVPR, 2009. 1, 5

[21] V. Ordonez, G. Kulkarni, and T. Berg. Im2text: Describing images using 1million captioned photographs. In NIPS, 2011. 2

[22] J. Sivic, M. Everingham, and A. Zisserman. ”who are you?” - learning person specific classifiers from video. In CVPR, 2009. 1, 2, 4, 5, 6

[23] M. Tapaswi, M. Bauml, and R. Stiefelhagen. ”knock! knock! who is it?” probabilistic person identification in tv-series. In CVPR, 2012. 1

[24] S. Vijayanarasimhan and K. Grauman. Keywords to visual categories: Multiple-instance learning forweakly supervised object categorization. In CVPR, 2008. 3

[25] Y. Wang and G. Mori. A discriminative latent model of image region and object tag correspondence. In NIPS, 2010. 2, 4, 5

[26] X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark localization in the wild. In CVPR, 2012. 4 22228877