cvpr cvpr2013 cvpr2013-402 knowledge-graph by maker-knowledge-mining

402 cvpr-2013-Social Role Discovery in Human Events

Source: pdf

Author: Vignesh Ramanathan, Bangpeng Yao, Li Fei-Fei

Abstract: We deal with the problem of recognizing social roles played by people in an event. Social roles are governed by human interactions, and form a fundamental component of human event description. We focus on a weakly supervised setting, where we are provided different videos belonging to an event class, without training role labels. Since social roles are described by the interaction between people in an event, we propose a Conditional Random Field to model the inter-role interactions, along with person specific social descriptors. We develop tractable variational inference to simultaneously infer model weights, as well as role assignment to all people in the videos. We also present a novel YouTube social roles dataset with ground truth role annotations, and introduce annotations on a subset of videos from the TRECVID-MED11 [1] event kits for evaluation purposes. The performance of the model is compared against different baseline methods on these datasets.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 yao , Abstract We deal with the problem of recognizing social roles played by people in an event. [sent-2, score-1.122]

2 Social roles are governed by human interactions, and form a fundamental component of human event description. [sent-3, score-0.715]

3 We focus on a weakly supervised setting, where we are provided different videos belonging to an event class, without training role labels. [sent-4, score-0.737]

4 Since social roles are described by the interaction between people in an event, we propose a Conditional Random Field to model the inter-role interactions, along with person specific social descriptors. [sent-5, score-1.946]

5 We develop tractable variational inference to simultaneously infer model weights, as well as role assignment to all people in the videos. [sent-6, score-0.663]

6 We also present a novel YouTube social roles dataset with ground truth role annotations, and introduce annotations on a subset of videos from the TRECVID-MED11 [1] event kits for evaluation purposes. [sent-7, score-1.707]

7 Our ability to comprehend human relations stands fundamental to our survival, development and social life. [sent-11, score-0.607]

8 We understand such relationships in terms of social roles assumed by people, and tend to describe events using these roles. [sent-12, score-1.018]

9 Typically, social roles answer semantic queries like, “Who is doing what in an event? [sent-15, score-0.954]

10 While the tasks of identifying the action and detecting the person are widely studied in computer vision, the problem of role assignment is relatively new and equally interesting. [sent-17, score-0.556]

11 Social role discovery derives motivation from the field of “Role Theory” [2] in sociology, which observes that people behave in predictable ways based on their social roles. [sent-18, score-1.098]

12 This shows that knowing the role of a person can help determine his/her interactions with the environment and vice-versa. [sent-19, score-0.572]

13 Social roles act as identities for the individuals and can help us describe the event in terms of these roles. [sent-26, score-0.625]

14 Also, the knowledge of social roles can help determine the interesting segments of social event footages [7] and sports videos. [sent-29, score-1.726]

15 The definition of social roles is event specific, and can sometimes be abstract such as, people “helping”, “visiting” or “residing” in a nursing home [13], making role identification a difficult human task. [sent-30, score-1.732]

16 Recognizing these difficulties, we formulate the problem of social role discovery in a weakly supervised framework. [sent-33, score-1.033]

17 Given a set of videos belonging to a social event without training labels for the people in the videos, we group them into different social roles. [sent-34, score-1.536]

18 The event label acts as the weak annotation in our setting, restricting the discovered roles to be event specific. [sent-35, score-0.857]

19 The problem is amply challenging due to the wide variation in appearance, scale, location and scene context of a role across different videos as seen in Fig. [sent-36, score-0.506]

20 1, it is difficult to determine roles by observing people individually. [sent-39, score-0.56]

21 Rather, social role discovery is an attempt to identify people based on their interactions in an event. [sent-40, score-1.182]

22 In order to solve this problem of weakly supervised role assignment, we propose a Conditional Random Field (CRF) to capture inter-role interaction cues, and develop 222444777533 ? [sent-42, score-0.653]

23 The different roles in each event are marked by the colors noted in the last column. [sent-177, score-0.667]

24 Further, to evaluate the model performance, we introduce a novel YouTube social roles dataset in Sec. [sent-180, score-0.954]

25 1, accompanied by event specific ground truth role annotations for the people in the videos. [sent-182, score-0.779]

26 We also provide role annotations for a subset of videos from two events of the TRECVID MED-1 1 [1] event kits, and test our model performance on these videos. [sent-184, score-0.735]

27 Experiments on these datasets show that our method achieves encouraging performance in weakly supervised social role assignment. [sent-185, score-0.997]

28 Related Work Socially aware video and image analysis Recent works on social network construction and interaction understanding is relevant to our work on social role recognition. [sent-187, score-1.715]

29 [5] uses scene context and visual concept attributes to build social relation network. [sent-190, score-0.557]

30 [23] also builds a social role network based on their co-occurrence of movie characters in different scenes. [sent-191, score-1.004]

31 [20] studied the problem of face recognition in social context. [sent-194, score-0.536]

32 Social Interaction in Action Recognition Another related line of work has been the use of social interaction to aid group action recognition [14, 3, 6]. [sent-195, score-0.773]

33 [18] also uses social grouping to help multi target tracking. [sent-197, score-0.536]

34 [10] uses social context in group photos to make better prediction of human attributes and scene semantics. [sent-198, score-0.628]

35 Although the above works capture social interactions in some form, they do not explicitly identify the roles assumed by people during a social event. [sent-202, score-1.716]

36 Role Recognition Recently, [7, 13] used social roles to predict group activities. [sent-203, score-0.98]

37 They used training labels 222444777644 to learn role assignments based on spatio-temporal interaction between players. [sent-207, score-0.627]

38 However, in our work we are not provided role annotations, and we wish to discover interactionbased roles automatically by studying different instances of an event. [sent-208, score-0.826]

39 Our Approach We define social role discovery as a weakly supervised problem, where the training role labels for the people in the videos are not available. [sent-211, score-1.648]

40 We are only provided the event label for each video, and the number of roles to be discovered in an event. [sent-212, score-0.625]

41 Social roles are not only decided by person specific descriptors, but also by the interaction between people. [sent-214, score-0.732]

42 Hence, any model used to discover social roles should be capable of incorporating this information. [sent-215, score-0.978]

43 In our approach, every event has a reference role, and the interaction of any person with this reference role is most significant. [sent-217, score-1.109]

44 One instance of the reference role is assumed to be present in every video belonging to the event class. [sent-222, score-0.748]

45 Model Formulation We present a CRF model which accounts for the reference role interaction with other roles in a video. [sent-226, score-1.105]

46 As illustrated, to capture person specific social cues, we extract unary features (Ψu) from each human track, describing spatio-temporal activity, human appearance and human-object interaction. [sent-229, score-0.811]

47 Similarly, to represent interaction based social cues, pairwise features (Ψp) describing proxemic touch codes, and spatial proximity are extracted. [sent-230, score-0.879]

48 Our CRF model uses these features to perform weakly supervised social role recognition. [sent-231, score-0.997]

49 Let Pv be the set of people in a video v and siv be the social role assigned to a person piv ∈ Pv. [sent-232, score-1.368]

50 We want to assign social roles, and jointly learn model∈ weights by maximizing the log likelihood of the CRF shown in Eq. [sent-233, score-0.536]

51 m is the index of the reference role in the video v. [sent-246, score-0.541]

52 where mE denotes the reference role in the event E, and the person holding the reference role in v. [sent-250, score-1.301]

53 1, sE is the complete social role assignment to all people in the event, and Zv is the log-partition function for the video v. [sent-257, score-1.157]

54 Note that the model only considers interaction of different roles with the reference role, in accordance with our assumption, and every video is assumed to contain one person playing this reference role. [sent-259, score-0.982]

55 Unary Features The unary feature Ψu captures role specific social cues extracted from human tracks, and their interaction with the event environment. [sent-264, score-1.462]

56 Object Interaction Feature ΨuOI: The interaction of a person with the event environment plays a key role in determining his/ her role. [sent-271, score-0.887]

57 : These features capture two important social aspects of a person, representing gender and clothing. [sent-277, score-0.575]

58 Pairwise Interaction Features Human interaction forms an important basis for social role definitions. [sent-285, score-1.112]

59 : The proxemic interaction of two people provides interesting insights regarding the relation between roles in an event such as the touch-code between a “parent” and the “birthday child”. [sent-290, score-1.059]

60 1 arises due to the correlation between different social roles and the coupling introduced by Zv. [sent-304, score-0.954]

61 We also introduce a variational approximation to the social role probability distribution in a video, with similar dependencies as the original model. [sent-308, score-0.959]

62 3, where sv denotes the role assignment to all people in the video v. [sent-310, score-0.621]

63 is a factor giving the probability of a person being assigned the reference role in the video. [sent-318, score-0.617]

64 ψv is a set of |Pv | factors, where ψ(vi) is the secondary role probability fm |Patr|ix f fcotro orst,h werh people in the video, when piv is assigned the reference role. [sent-319, score-0.8]

65 This variational approximation ofthe social role probability, retains the dependencies in our original structure. [sent-322, score-0.959]

66 It represents one predominant reference role, with secondary role assignments dependent on this reference role. [sent-323, score-0.701]

67 In every video v, the person pvm with the highest value of φv is assigned the reference role, forming a reference role cluster. [sent-337, score-0.817]

68 The corresponding variational probability ψ(vm) is used to assign secondary roles to other people in the video. [sent-338, score-0.663]

69 We enforce a lower and upper bound on the number of people assigned to a secondary role cluster in the event. [sent-339, score-0.632]

70 This acts a lose prior on the number of people in each role cluster. [sent-341, score-0.551]

71 Datasets YouTube Social Roles Most publicly available video datasets are not suitable for evaluating the social role assignment task, since they do not cover a good range of peo- ple donning different roles in specific social events. [sent-347, score-1.987]

72 In an attempt to evaluate our method, we collected a set of YouTube videos under 4 social events. [sent-348, score-0.605]

73 To facilitate easy evaluation, we annotate every person in our dataset with the relevant social roles. [sent-351, score-0.64]

74 Some videos have stray individuals not annotated with any specific social role and are called as “others”. [sent-352, score-1.007]

75 Within each social event, there is wide variation in event settings as seen from the sample video frames in Fig. [sent-354, score-0.789]

76 This diversity in scenarios, with the same underlying interactions between different roles is an interesting characteristic of the dataset, and makes the task amply challenging. [sent-359, score-0.534]

77 TRECVID Social Roles Among publicly available datasets, the TRECVID-MED1 1event kits [1] have two social event classes birthday and wedding. [sent-360, score-0.968]

78 Some videos were cropped to include only the parts showing relevant social events. [sent-363, score-0.605]

79 Due to the weakly supervised nature of the problem, we do not have a direct mapping between role clusters and groundtruth role labels. [sent-379, score-0.862]

80 To facilitate easy comparison with different baselines, the role clusters obtained from a method are each mapped to one ofthe human defined roles, maximizing the total correct role assignments in an event. [sent-380, score-0.861]

81 e A r roalen,d aonmd tpheer stornue i prior ho fv secondary roles is used to assign roles to other people in the video. [sent-389, score-1.042]

82 This confirms our belief that, human interactions are informative for role recognition. [sent-415, score-0.513]

83 This demonstrates the value in explicitly modeling interaction between role pairs, instead of using interaction as a context feature. [sent-419, score-0.789]

84 Sample frames from videos are shown, where our full model identified the correct (a) “bride” (green box), “groom”(red box) roles in wedding and (b) “presenter” (green box), “recipient” (red box) roles in award function. [sent-425, score-1.155]

85 The column corresponding to the reference role cluster chosen by our algorithm is highlighted in each matrix. [sent-436, score-0.519]

86 The average purities of the reference role clusters are 0. [sent-437, score-0.512]

87 We observe that the model is able to cluster the roles better in the wedding event, as seen in Fig. [sent-441, score-0.602]

88 To study this interaction, we visualize the marginals of the spatial relationship of different roles with the reference role (“groom”) cluster in the YouTube wedding dataset, in Fig. [sent-444, score-1.119]

89 “friends” are difficult to distinguish from “guests” in the TRECVID birthday dataset, where we observed both roles to exhibit low interaction with the reference role. [sent-449, score-0.881]

90 The column corresponding to the reference role cluster chosen by our model is highlighted in each event. [sent-456, score-0.519]

91 In order to evaluate the latent reference role assignment in our model, we compare performances with a control setting which randomly chooses the reference role in each video. [sent-460, score-1.039]

92 82% for the YouTube social roles dataset with this choice ofreference role, justifying the need to model it as a latent variable. [sent-462, score-0.986]

93 80% for the wedding event, which has more role classes than the other events leading to increased randomness in the choice of reference role in each video. [sent-464, score-1.086]

94 Conclusion We proposed to recognize social roles from human event videos in a weakly supervised setting, and designed a CRF to model the inter-role interactions along with person specific unary features. [sent-466, score-1.621]

95 As a next step, our approach can be extended to perform simultaneous event classification along with role discovery. [sent-470, score-0.591]

96 It is also noted that our method is not robust to noisy and fragmented reference role tracking, due to the inherent assumption of one reference role per video. [sent-471, score-1.012]

97 Learning relations among movie characters: A social network perspective. [sent-492, score-0.621]

98 Marginal of the position of a role relative to the reference (“groom”), estimated by our model is shown for YouTube wedding videos. [sent-600, score-0.655]

99 Seeing people in social context: recognizing people and social relation- [26] J. [sent-645, score-1.382]

100 a 4tion, ysis from the perspective of social networks. [sent-665, score-0.536]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('social', 0.536), ('roles', 0.418), ('role', 0.384), ('event', 0.207), ('interaction', 0.192), ('birthday', 0.16), ('wedding', 0.16), ('people', 0.142), ('groom', 0.129), ('bride', 0.115), ('reference', 0.111), ('person', 0.104), ('proxemic', 0.1), ('interactions', 0.084), ('youtube', 0.082), ('trecvid', 0.081), ('piv', 0.081), ('presenter', 0.081), ('crf', 0.078), ('award', 0.07), ('videos', 0.069), ('kits', 0.065), ('secondary', 0.064), ('unary', 0.063), ('siv', 0.057), ('pst', 0.053), ('proxemics', 0.05), ('recipient', 0.05), ('assignment', 0.049), ('weakly', 0.049), ('cake', 0.048), ('sjv', 0.048), ('events', 0.047), ('video', 0.046), ('human', 0.045), ('pvm', 0.043), ('gender', 0.039), ('variational', 0.039), ('movie', 0.038), ('discovery', 0.036), ('amply', 0.032), ('bridesmaids', 0.032), ('distributor', 0.032), ('envamente', 0.032), ('groomsmen', 0.032), ('guests', 0.032), ('justifying', 0.032), ('methodbirthday', 0.032), ('occupation', 0.032), ('pceiaolp', 0.032), ('peesr', 0.032), ('priest', 0.032), ('rleol', 0.032), ('sociology', 0.032), ('assignments', 0.031), ('activities', 0.03), ('touch', 0.03), ('footages', 0.029), ('osfo', 0.029), ('bangpeng', 0.029), ('supervised', 0.028), ('helping', 0.028), ('annotations', 0.028), ('zv', 0.027), ('group', 0.026), ('relations', 0.026), ('tractable', 0.026), ('recognizing', 0.026), ('clothing', 0.025), ('characters', 0.025), ('acts', 0.025), ('activity', 0.025), ('discover', 0.024), ('cluster', 0.024), ('tracks', 0.023), ('inference', 0.023), ('box', 0.023), ('noted', 0.022), ('marginals', 0.022), ('network', 0.021), ('pairwise', 0.021), ('context', 0.021), ('labels', 0.02), ('full', 0.02), ('initialized', 0.02), ('gallagher', 0.02), ('marked', 0.02), ('failure', 0.019), ('track', 0.019), ('action', 0.019), ('scores', 0.019), ('confusion', 0.018), ('pv', 0.018), ('specific', 0.018), ('assigned', 0.018), ('understand', 0.017), ('lan', 0.017), ('clusters', 0.017), ('cues', 0.017), ('distinguished', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 402 cvpr-2013-Social Role Discovery in Human Events

Author: Vignesh Ramanathan, Bangpeng Yao, Li Fei-Fei

2 0.35223502 292 cvpr-2013-Multi-agent Event Detection: Localization and Role Assignment

Author: Suha Kwak, Bohyung Han, Joon Hee Han

Abstract: We present a joint estimation technique of event localization and role assignment when the target video event is described by a scenario. Specifically, to detect multi-agent events from video, our algorithm identifies agents involved in an event and assigns roles to the participating agents. Instead of iterating through all possible agent-role combinations, we formulate the joint optimization problem as two efficient subproblems—quadratic programming for role assignment followed by linear programming for event localization. Additionally, we reduce the computational complexity significantly by applying role-specific event detectors to each agent independently. We test the performance of our algorithm in natural videos, which contain multiple target events and nonparticipating agents.

3 0.21563119 4 cvpr-2013-3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image

Author: Ishani Chakraborty, Hui Cheng, Omar Javed

Abstract: We present a unified framework for detecting and classifying people interactions in unconstrained user generated images. 1 Unlike previous approaches that directly map people/face locations in 2D image space into features for classification, we first estimate camera viewpoint and people positions in 3D space and then extract spatial configuration features from explicit 3D people positions. This approach has several advantages. First, it can accurately estimate relative distances and orientations between people in 3D. Second, it encodes spatial arrangements of people into a richer set of shape descriptors than afforded in 2D. Our 3D shape descriptors are invariant to camera pose variations often seen in web images and videos. The proposed approach also estimates camera pose and uses it to capture the intent of the photo. To achieve accurate 3D people layout estimation, we develop an algorithm that robustly fuses semantic constraints about human interpositions into a linear camera model. This enables our model to handle large variations in people size, heights (e.g. age) and poses. An accurate 3D layout also allows us to construct features informed by Proxemics that improves our semantic classification. To characterize the human interaction space, we introduce visual proxemes; a set of prototypical patterns that represent commonly occurring social interactions in events. We train a discriminative classifier that classifies 3D arrangements of people into visual proxemes and quantitatively evaluate the performance on a large, challenging dataset.

4 0.19812562 172 cvpr-2013-Finding Group Interactions in Social Clutter

Author: Ruonan Li, Parker Porfilio, Todd Zickler

Abstract: We consider the problem of finding distinctive social interactions involving groups of agents embedded in larger social gatherings. Given a pre-defined gallery of short exemplar interaction videos, and a long input video of a large gathering (with approximately-tracked agents), we identify within the gathering small sub-groups of agents exhibiting social interactions that resemble those in the exemplars. The participants of each detected group interaction are localized in space; the extent of their interaction is localized in time; and when the gallery ofexemplars is annotated with group-interaction categories, each detected interaction is classified into one of the pre-defined categories. Our approach represents group behaviors by dichotomous collections of descriptors for (a) individual actions, and (b) pairwise interactions; and it includes efficient algorithms for optimally distinguishing participants from by-standers in every temporal unit and for temporally localizing the extent of the group interaction. Most importantly, the method is generic and can be applied whenever numerous interacting agents can be approximately tracked over time. We evaluate the approach using three different video collections, two that involve humans and one that involves mice.

5 0.15826614 85 cvpr-2013-Complex Event Detection via Multi-source Video Attributes

Author: Zhigang Ma, Yi Yang, Zhongwen Xu, Shuicheng Yan, Nicu Sebe, Alexander G. Hauptmann

Abstract: Complex events essentially include human, scenes, objects and actions that can be summarized by visual attributes, so leveraging relevant attributes properly could be helpful for event detection. Many works have exploited attributes at image level for various applications. However, attributes at image level are possibly insufficient for complex event detection in videos due to their limited capability in characterizing the dynamic properties of video data. Hence, we propose to leverage attributes at video level (named as video attributes in this work), i.e., the semantic labels of external videos are used as attributes. Compared to complex event videos, these external videos contain simple contents such as objects, scenes and actions which are the basic elements of complex events. Specifically, building upon a correlation vector which correlates the attributes and the complex event, we incorporate video attributes latently as extra informative cues into the event detector learnt from complex event videos. Extensive experiments on a real-world large-scale dataset validate the efficacy of the proposed approach.

6 0.15287368 356 cvpr-2013-Representing and Discovering Adversarial Team Behaviors Using Player Roles

7 0.15174016 103 cvpr-2013-Decoding Children's Social Behavior

8 0.12907794 151 cvpr-2013-Event Retrieval in Large Video Collections with Circulant Temporal Encoding

9 0.11486498 440 cvpr-2013-Tracking People and Their Objects

10 0.10339929 150 cvpr-2013-Event Recognition in Videos by Learning from Heterogeneous Web Sources

11 0.10300253 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities

12 0.09349966 49 cvpr-2013-Augmenting Bag-of-Words: Data-Driven Discovery of Temporal and Structural Information for Activity Recognition

13 0.091923036 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?

14 0.079794347 43 cvpr-2013-Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs

15 0.077557996 252 cvpr-2013-Learning Locally-Adaptive Decision Functions for Person Verification

16 0.074195184 463 cvpr-2013-What's in a Name? First Names as Facial Attributes

17 0.07380119 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches

18 0.072292574 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video

19 0.072100073 243 cvpr-2013-Large-Scale Video Summarization Using Web-Image Priors

20 0.072093658 160 cvpr-2013-Face Recognition in Movie Trailers via Mean Sequence Sparse Representation-Based Classification

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.146), (1, -0.064), (2, -0.01), (3, -0.1), (4, -0.034), (5, 0.014), (6, -0.064), (7, -0.019), (8, 0.036), (9, 0.057), (10, 0.072), (11, -0.059), (12, 0.043), (13, -0.0), (14, -0.02), (15, 0.037), (16, 0.056), (17, 0.128), (18, -0.003), (19, -0.218), (20, -0.122), (21, 0.065), (22, -0.004), (23, 0.005), (24, -0.009), (25, -0.026), (26, 0.024), (27, -0.109), (28, -0.006), (29, -0.122), (30, 0.062), (31, -0.025), (32, 0.012), (33, -0.103), (34, 0.024), (35, 0.038), (36, 0.05), (37, 0.117), (38, 0.112), (39, 0.132), (40, 0.138), (41, -0.065), (42, -0.108), (43, 0.0), (44, 0.248), (45, 0.006), (46, -0.123), (47, -0.04), (48, 0.018), (49, 0.087)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97262555 402 cvpr-2013-Social Role Discovery in Human Events

Author: Vignesh Ramanathan, Bangpeng Yao, Li Fei-Fei

2 0.87969333 292 cvpr-2013-Multi-agent Event Detection: Localization and Role Assignment

Author: Suha Kwak, Bohyung Han, Joon Hee Han

3 0.6817174 172 cvpr-2013-Finding Group Interactions in Social Clutter

Author: Ruonan Li, Parker Porfilio, Todd Zickler

4 0.5869 103 cvpr-2013-Decoding Children's Social Behavior

Author: James M. Rehg, Gregory D. Abowd, Agata Rozga, Mario Romero, Mark A. Clements, Stan Sclaroff, Irfan Essa, Opal Y. Ousley, Yin Li, Chanho Kim, Hrishikesh Rao, Jonathan C. Kim, Liliana Lo Presti, Jianming Zhang, Denis Lantsman, Jonathan Bidwell, Zhefan Ye

Abstract: We introduce a new problem domain for activity recognition: the analysis of children ’s social and communicative behaviors based on video and audio data. We specifically target interactions between children aged 1–2 years and an adult. Such interactions arise naturally in the diagnosis and treatment of developmental disorders such as autism. We introduce a new publicly-available dataset containing over 160 sessions of a 3–5 minute child-adult interaction. In each session, the adult examiner followed a semistructured play interaction protocol which was designed to elicit a broad range of social behaviors. We identify the key technical challenges in analyzing these behaviors, and describe methods for decoding the interactions. We present experimental results that demonstrate the potential of the dataset to drive interesting research questions, and show preliminary results for multi-modal activity recognition.

5 0.55061287 151 cvpr-2013-Event Retrieval in Large Video Collections with Circulant Temporal Encoding

Author: Jérôme Revaud, Matthijs Douze, Cordelia Schmid, Hervé Jégou

Abstract: This paper presents an approach for large-scale event retrieval. Given a video clip of a specific event, e.g., the wedding of Prince William and Kate Middleton, the goal is to retrieve other videos representing the same event from a dataset of over 100k videos. Our approach encodes the frame descriptors of a video to jointly represent their appearance and temporal order. It exploits the properties of circulant matrices to compare the videos in the frequency domain. This offers a significant gain in complexity and accurately localizes the matching parts of videos. Furthermore, we extend product quantization to complex vectors in order to compress our descriptors, and to compare them in the compressed domain. Our method outperforms the state of the art both in search quality and query time on two large-scale video benchmarks for copy detection, TRECVID and CCWEB. Finally, we introduce a challenging dataset for event retrieval, EVVE, and report the performance on this dataset.

6 0.54717082 85 cvpr-2013-Complex Event Detection via Multi-source Video Attributes

7 0.49402237 49 cvpr-2013-Augmenting Bag-of-Words: Data-Driven Discovery of Temporal and Structural Information for Activity Recognition

8 0.48305944 356 cvpr-2013-Representing and Discovering Adversarial Team Behaviors Using Player Roles

9 0.47653821 4 cvpr-2013-3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image

10 0.39281487 28 cvpr-2013-A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching

11 0.38253939 413 cvpr-2013-Story-Driven Summarization for Egocentric Video

12 0.37395087 313 cvpr-2013-Online Dominant and Anomalous Behavior Detection in Videos

13 0.34933192 243 cvpr-2013-Large-Scale Video Summarization Using Web-Image Priors

14 0.33627442 272 cvpr-2013-Long-Term Occupancy Analysis Using Graph-Based Optimisation in Thermal Imagery

15 0.32864064 7 cvpr-2013-A Divide-and-Conquer Method for Scalable Low-Rank Latent Matrix Pursuit

16 0.31949019 440 cvpr-2013-Tracking People and Their Objects

17 0.30728874 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities

18 0.30722526 463 cvpr-2013-What's in a Name? First Names as Facial Attributes

19 0.30526671 252 cvpr-2013-Learning Locally-Adaptive Decision Functions for Person Verification

20 0.28423381 416 cvpr-2013-Studying Relationships between Human Gaze, Description, and Computer Vision

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.081), (16, 0.012), (26, 0.051), (28, 0.017), (33, 0.217), (67, 0.075), (69, 0.071), (77, 0.332), (87, 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7901265 402 cvpr-2013-Social Role Discovery in Human Events

Author: Vignesh Ramanathan, Bangpeng Yao, Li Fei-Fei

2 0.78930581 18 cvpr-2013-A Max-Margin Riffled Independence Model for Image Tag Ranking

Author: Tian Lan, Greg Mori

Abstract: We propose Max-Margin Riffled Independence Model (MMRIM), a new method for image tag ranking modeling the structured preferences among tags. The goal is to predict a ranked tag list for a given image, where tags are ordered by their importance or relevance to the image content. Our model integrates the max-margin formalism with riffled independence factorizations proposed in [10], which naturally allows for structured learning and efficient ranking. Experimental results on the SUN Attribute and LabelMe datasets demonstrate the superior performance of the proposed model compared with baseline tag ranking methods. We also apply the predicted rank list of tags to several higher-level computer vision applications in image understanding and retrieval, and demonstrate that MMRIM significantly improves the accuracy of these applications.

3 0.78406799 358 cvpr-2013-Robust Canonical Time Warping for the Alignment of Grossly Corrupted Sequences

Author: Yannis Panagakis, Mihalis A. Nicolaou, Stefanos Zafeiriou, Maja Pantic

Abstract: Temporal alignment of human behaviour from visual data is a very challenging problem due to a numerous reasons, including possible large temporal scale differences, inter/intra subject variability and, more importantly, due to the presence of gross errors and outliers. Gross errors are often in abundance due to incorrect localization and tracking, presence of partial occlusion etc. Furthermore, such errors rarely follow a Gaussian distribution, which is the de-facto assumption in machine learning methods. In this paper, building on recent advances on rank minimization and compressive sensing, a novel, robust to gross errors temporal alignment method is proposed. While previous approaches combine the dynamic time warping (DTW) with low-dimensional projections that maximally correlate two sequences, we aim to learn two underlyingprojection matrices (one for each sequence), which not only maximally correlate the sequences but, at the same time, efficiently remove the possible corruptions in any datum in the sequences. The projections are obtained by minimizing the weighted sum of nuclear and ?1 norms, by solving a sequence of convex optimization problems, while the temporal alignment is found by applying the DTW in an alternating fashion. The superiority of the proposed method against the state-of-the-art time alignment methods, namely the canonical time warping and the generalized time warping, is indicated by the experimental results on both synthetic and real datasets.

4 0.71990788 146 cvpr-2013-Enriching Texture Analysis with Semantic Data

Author: Tim Matthews, Mark S. Nixon, Mahesan Niranjan

Abstract: We argue for the importance of explicit semantic modelling in human-centred texture analysis tasks such as retrieval, annotation, synthesis, and zero-shot learning. To this end, low-level attributes are selected and used to define a semantic space for texture. 319 texture classes varying in illumination and rotation are positioned within this semantic space using a pairwise relative comparison procedure. Low-level visual features used by existing texture descriptors are then assessed in terms of their correspondence to the semantic space. Textures with strong presence ofattributes connoting randomness and complexity are shown to be poorly modelled by existing descriptors. In a retrieval experiment semantic descriptors are shown to outperform visual descriptors. Semantic modelling of texture is thus shown to provide considerable value in both feature selection and in analysis tasks.

5 0.71975654 151 cvpr-2013-Event Retrieval in Large Video Collections with Circulant Temporal Encoding

Author: Jérôme Revaud, Matthijs Douze, Cordelia Schmid, Hervé Jégou

6 0.68228555 292 cvpr-2013-Multi-agent Event Detection: Localization and Role Assignment

7 0.67998117 364 cvpr-2013-Robust Object Co-detection

8 0.63622814 85 cvpr-2013-Complex Event Detection via Multi-source Video Attributes

9 0.62194157 422 cvpr-2013-Tag Taxonomy Aware Dictionary Learning for Region Tagging

10 0.61867863 213 cvpr-2013-Image Tag Completion via Image-Specific and Tag-Specific Linear Sparse Reconstructions

11 0.61304194 356 cvpr-2013-Representing and Discovering Adversarial Team Behaviors Using Player Roles

12 0.61230272 28 cvpr-2013-A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching

13 0.60963553 432 cvpr-2013-Three-Dimensional Bilateral Symmetry Plane Estimation in the Phase Domain

14 0.60590142 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval

15 0.60557956 377 cvpr-2013-Sample-Specific Late Fusion for Visual Category Recognition

16 0.60498899 4 cvpr-2013-3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image

17 0.60490084 164 cvpr-2013-Fast Convolutional Sparse Coding

18 0.60460263 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection

19 0.60441172 116 cvpr-2013-Designing Category-Level Attributes for Discriminative Visual Recognition

20 0.60411441 248 cvpr-2013-Learning Collections of Part Models for Object Recognition