iccv iccv2013 iccv2013-440 knowledge-graph by maker-knowledge-mining

440 iccv-2013-Video Event Understanding Using Natural Language Descriptions


Source: pdf

Author: Vignesh Ramanathan, Percy Liang, Li Fei-Fei

Abstract: Human action and role recognition play an important part in complex event understanding. State-of-the-art methods learn action and role models from detailed spatio temporal annotations, which requires extensive human effort. In this work, we propose a method to learn such models based on natural language descriptions of the training videos, which are easier to collect and scale with the number of actions and roles. There are two challenges with using this form of weak supervision: First, these descriptions only provide a high-level summary and often do not directly mention the actions and roles occurring in a video. Second, natural language descriptions do not provide spatio temporal annotations of actions and roles. To tackle these challenges, we introduce a topic-based semantic relatedness (SR) measure between a video description and an action and role label, and incorporate it into a posterior regularization objective. Our event recognition system based on these action and role models matches the state-ofthe-art method on the TRECVID-MED11 event kit, despite weaker supervision.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu ang fe l } Abstract Human action and role recognition play an important part in complex event understanding. [sent-3, score-0.974]

2 State-of-the-art methods learn action and role models from detailed spatio temporal annotations, which requires extensive human effort. [sent-4, score-0.956]

3 In this work, we propose a method to learn such models based on natural language descriptions of the training videos, which are easier to collect and scale with the number of actions and roles. [sent-5, score-0.875]

4 There are two challenges with using this form of weak supervision: First, these descriptions only provide a high-level summary and often do not directly mention the actions and roles occurring in a video. [sent-6, score-0.788]

5 Second, natural language descriptions do not provide spatio temporal annotations of actions and roles. [sent-7, score-1.121]

6 To tackle these challenges, we introduce a topic-based semantic relatedness (SR) measure between a video description and an action and role label, and incorporate it into a posterior regularization objective. [sent-8, score-1.075]

7 Our event recognition system based on these action and role models matches the state-ofthe-art method on the TRECVID-MED11 event kit, despite weaker supervision. [sent-9, score-1.181]

8 In this work, we provide a method to learn these action and role models based on easily obtainable natural language descriptions of event videos (see Fig. [sent-13, score-1.691]

9 We rely entirely on these descriptions and do not require separate ground truth annotations of roles and actions. [sent-15, score-0.632]

10 The use of action and/or role models trained with extensive spatio temporal annotations has shown to boost event recognition performance in videos [8, 12]. [sent-16, score-1.297]

11 The descriptions of videos containing the action “play instrument” are bounded in green, but we do not use action/role labels during training. [sent-41, score-0.848]

12 On the other hand, complex event datasets like TRECVID-MED1 1 [1] event kit and MPII Cooking [22] are accompanied by natural language descriptions, which are easy to obtain and incur only a one-time annotation cost during the collection of a dataset. [sent-43, score-0.952]

13 Unfortunately, natural language descriptions only provide a coarse high-level summary of the events occurring in videos. [sent-45, score-0.76]

14 Bridging the gap between highlevel natural language descriptions and low-level action/role labels is a challenging problem in natural language semantics. [sent-53, score-1.053]

15 To tackle this, we define a new semantic relatedness (SR) measure between an action/role label and a natural language description. [sent-54, score-0.534]

16 The second challenge is that natural language descriptions do not specify the spatiotemporal extents of actions and roles. [sent-56, score-0.843]

17 Specifically, we represent a video as a bag of spatiotemporally-localized human tracklets, and define an action and role assignment variable for each tracklet. [sent-58, score-0.865]

18 The natural language supervision then imposes a soft constraint that at least one of tracklets in the video is assigned to a semantically-related action/role label. [sent-59, score-0.65]

19 990055 We first evaluated our approach on action and role classification, showing that our SR measure improves accuracy over existing measures. [sent-60, score-0.709]

20 We incorporated our action/role models, which are trained only on natural language descriptions, into our event recognition model. [sent-62, score-0.592]

21 Related Work Natural language processing for vision Recent works attempting to leverage the vast amount of textual data available with Internet images have developed vision-specific semantic relatedness measures [25, 24] to identify the link between part-based object attributes and image classes. [sent-65, score-0.516]

22 Other attempts to use textual descriptions in conjunction with attribute recognition were presented in [2, 19]. [sent-67, score-0.423]

23 [23] transfers composite action videos to an attribute space enabling comparison with textual corpus. [sent-70, score-0.575]

24 Again, these methods rely on presence of the action label in the script or use a pretrained classifier [13] to identify the action-text in a script and require temporal annotations. [sent-72, score-0.557]

25 [18] processes descriptions of action segments to automatically discover a set of action classes. [sent-74, score-1.117]

26 In contrast to the above methods, we learn models based on natural language descriptions which may not contain the action and role label. [sent-75, score-1.355]

27 Action, role and event recognition [8, 12] showed significant improvement in event recognition by using atomic ac- tion and role detectors as a part of their event recognition model. [sent-77, score-1.413]

28 Both methods required spatio temporal annotation of action and roles in the training videos to learn the models. [sent-78, score-1.045]

29 Other works which have investigated the use of social roles in video understanding include [30, 5]. [sent-79, score-0.398]

30 Weakly supervised action models Discriminative spatio temporal regions in videos or images to localize the actions in [27, 21, 28]. [sent-81, score-0.89]

31 However, we develop a model with latent action and role assignments to different human tracklets in a video. [sent-83, score-1.015]

32 We first use natural language video descriptions to train action and role models. [sent-87, score-1.489]

33 In our setup, each training video is accompanied by a natural language description, which might or might not contain the action label present in the video. [sent-89, score-0.93]

34 No textual descriptions are present in the test data. [sent-105, score-0.423]

35 We assume a fixed set of actions A and roles R and defineW aedd aistsiuomnael av fairxiaedble ses ? [sent-106, score-0.434]

36 Human tracklet extraction Complex event videos are composed of many atomic actions and roles, confined to spatio temporal regions. [sent-116, score-0.936]

37 The action or role occurring bina a ovfide huo mwaonul tdra tchkelne correspond to one or more of these tracklets h ∈ Hi. [sent-118, score-0.92]

38 h2, ∈ we obtain tracklets by running a human detector [6] across different segments in a video and tracking the resulting bounding boxes within a temporal window of 100 frames. [sent-120, score-0.399]

39 Action and role model We define a conditional random field (CRF) to model the actions and roles of different tracklets in a video, similar in spirit to [12]. [sent-125, score-0.935]

40 However, we neither assume perfect human tracking nor complete person-wise action and role labels for training. [sent-126, score-0.783]

41 The features fahc ∈ Rdac, and frho ∈ Rdro are the action and role features ∈for R the human tra∈ck Rlet h respectively. [sent-131, score-0.798]

42 The global weight is denoted by wg ∈ R|A|×|R|×dg, where wg (a, r) ∈ Rdg gives the global weight for action a and role( r. [sent-132, score-0.588]

43 Similarly, win ∈ R|A|×|R| is the weight for joint action and role assignment to a track, with win (a, r) ∈ R corresponding to action a and role r. [sent-133, score-1.501]

44 The log-likelihood of making action and role assignments a = (a1, . [sent-144, score-0.734]

45 We wish to learn model weights while making latent action and role assignments to each tracklet in the video. [sent-167, score-0.896]

46 This enables us to optimize the likelihood subject to soft constraints on the predicted action and role distribution. [sent-170, score-0.683]

47 Formally, let Q(a, r) be a distribution of action and role assignments to the training videos. [sent-171, score-0.766]

48 We wish to ensure that, in a video tagged as positive for a specific action, the number of tracklets corresponding to the action is at least one. [sent-172, score-0.701]

49 Similarly, in negative videos, the number of tracklets corresponding to an action should be zero. [sent-173, score-0.606]

50 Using natural language video descriptions The natural language description of a video contains rich information about the event context and can help infer the presence of specific actions and roles in the video. [sent-192, score-1.936]

51 1 provides examples of descriptions which do not contain the action label “play instrument”. [sent-194, score-0.749]

52 We address this issue by building a task-specific language topic model and using it to define the SR measure. [sent-202, score-0.418]

53 Topic model based SR: A natural source for video descriptions is the vast collection of user-provided descriptions of YouTube videos. [sent-203, score-0.813]

54 Since the text corpus was obtained based on training descriptions, the generated topic clusters often capture frequent actions and roles in the data. [sent-207, score-0.684]

55 All video descriptions ti can now be represented by a 200 dimensional vector fdti specifying the distribution of the topics in the description. [sent-210, score-0.504]

56 An action a can be represented by fda giving the topic distribution over the action label (fdr is? [sent-211, score-0.982]

57 provides the proximity of a video xi to an action a? [sent-216, score-0.521]

58 Training the event model We use the action and role detection scores to perform video event classification. [sent-284, score-1.276]

59 The expected number of tracklets corresponding to different actions and roles are used as additional features along with the global video features to train a linear SVM. [sent-285, score-0.811]

60 2 and finally treat the event classification score from these classifiers as global video features. [sent-291, score-0.403]

61 Since only a small set of actions and roles are usually related to an event, we add an additional L1 regularization term for the action and role feature weights to encourage only the relevant action and/or role scores to be selected. [sent-292, score-1.855]

62 Experiments We test our event, action and role classification models on the TRECVID-MED1 1 event kit. [sent-294, score-0.96]

63 Each video is accompanied by a synopsis describing the events in the video, and only a few of them mention the atomic actions and objects present in the video. [sent-296, score-0.457]

64 Implementation details We define crude action labels y˜i and role labels z˜i for each video xi based on simple text processing. [sent-300, score-0.953]

65 We set y˜ia = 990088 1if ti contains the action label a, y˜ia = −1 if none of the natural language descriptions in the eve=nt c− la1ss i fo fn xi eco onftta hine the action label a; otherwise, we set y˜ia = 0. [sent-301, score-1.544]

66 We define ˜zri ∈ {−1, 0, 1} similarly for video xi and role r. [sent-302, score-0.443]

67 The value of τ is set to consider the top 300 (30) videos closest to the action (role) description as potential positives. [sent-306, score-0.529]

68 In our experiments, we train separate one-vs-all models for each action and role. [sent-307, score-0.433]

69 While training an action (role) model, we consider the relation of the action (role) to all the roles (actions) including a null role (action). [sent-308, score-1.372]

70 In practice, this makes the learning more tractable and also performs better than training a single model considering all actions and roles together. [sent-309, score-0.466]

71 We choose only the action classes which are directly mentioned at least once in the training data descriptions. [sent-315, score-0.426]

72 Each video in the test set is annotated with the actions and roles present in it for evaluation. [sent-318, score-0.529]

73 The action and role classification performance is evaluated by computing the average precision on the testing data as shown in Tab. [sent-319, score-0.711]

74 We defube the expected number of tracklets performing an action in a videos as the corresponding action score for the video. [sent-321, score-1.087]

75 Similarly, the expected number of tracklets holding a role in a video provides the role score. [sent-322, score-0.921]

76 “Full model” refers to the complete algorithm using video descriptions to train PR models in a self-paced setting. [sent-323, score-0.49]

77 1, 2, we observe that identifying human tracklets in the videos improves the overall action and role classification. [sent-337, score-1.045]

78 The effect is even more prominent for roles, since roles are governed by the humans holding the Actionglobal only simple PR ful PR wi[k2i5 S]R topic SR ful model Table 1. [sent-338, score-0.552]

79 As expected, we see strong correlation between certain action and role classes (highlighted by ovals). [sent-360, score-0.683]

80 5, where the highest scoring tracklet in a video for a certain action is shown along with the corresponding role assignment. [sent-365, score-0.881]

81 To analyze the utility of our topic model, we run experiments where natural language descriptions are assumed to be present both during training and testing. [sent-411, score-0.839]

82 We train two separate PR models which use topic model based textual features and Wikipedia SR based textual features respectively as additional global features both during training and testing. [sent-412, score-0.425]

83 For Wikipedia features, we concatenate the Wikipedia SR measure of a description with each of the action and role labels to form a feature. [sent-415, score-0.795]

84 Further, note that in our adaptation of the self-paced approach, while natural language descriptions are not available during testing, we use features extracted from natural language descriptions in the initial iterations oftraining, and finally anneal their weights to zero. [sent-419, score-1.369]

85 The green bars correspond to the setting where natural language descriptions are used at test time. [sent-449, score-0.701]

86 fective for action classes like hammering, writing, planing, drilling, where the number of training video descriptions directly containing the action label were below 10. [sent-452, score-1.27]

87 We notice the inclusion of videos whose descriptions do not contain the action directly. [sent-455, score-0.81]

88 Note that unlike our method, [8] used extensive spatio temporal annotation to learn completely supervised atomic action classification models. [sent-462, score-0.779]

89 We report two sets of results from [8], one using action classification scores in linear ensemble SVM and the other using them in a joint CRF model. [sent-464, score-0.422]

90 3, we observe that our methods using either the action or role features outperform an SVM trained only with global video features. [sent-466, score-0.809]

91 Our full model using both action and role scores achieves the maximum mean AP. [sent-467, score-0.683]

92 Thus, our action and role models trained only with natural language descriptions matches state-of-the-art methods from [8], which uses ground truth spatio temporal action annotations for training. [sent-468, score-2.027]

93 This supports the utility of the our action and role models learned with very weak supervision. [sent-469, score-0.683]

94 Conclusion We have presented a method to learn atomic action and role models based on easily available natural language video descriptions. [sent-471, score-1.209]

95 We proposed a language topic model Eventgloonlbayl[S8V]M *[8]C *R jFoint galoctbiaoln + gloroblaels +mfuoldlel Table 3. [sent-472, score-0.418]

96 * Unlike our method, [8] uses extensive ground truth spatio temporal annotations for training separate action classifiers to aid event classification. [sent-475, score-0.953]

97 These labels were used to train a CRF model with posterior regularization, which makes latent action and role assignments to human tracklets. [sent-477, score-0.934]

98 The action and role models were used to achieve state-of-the-art event classification performance on the TRECVID-MED1 1 event kits. [sent-479, score-1.209]

99 Further, such SR measures could also be used to tackle the problem of converting video content to natural language descriptions as proposed in [26]. [sent-481, score-0.767]

100 Combining language sources and robust semantic relatedness for attribute-based knowledge transfer. [sent-650, score-0.422]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('action', 0.394), ('descriptions', 0.329), ('role', 0.289), ('language', 0.283), ('roles', 0.263), ('event', 0.249), ('tracklets', 0.212), ('spatio', 0.181), ('actions', 0.171), ('sr', 0.169), ('topic', 0.135), ('wikipedia', 0.113), ('relatedness', 0.109), ('tracklet', 0.103), ('video', 0.095), ('textual', 0.094), ('pr', 0.09), ('atomic', 0.088), ('videos', 0.087), ('instrument', 0.082), ('fdti', 0.08), ('rohrbach', 0.071), ('wg', 0.066), ('youtube', 0.064), ('events', 0.063), ('natural', 0.06), ('wac', 0.06), ('zri', 0.06), ('ful', 0.059), ('temporal', 0.057), ('posterior', 0.054), ('zir', 0.053), ('win', 0.053), ('assignments', 0.051), ('description', 0.048), ('yia', 0.047), ('corpus', 0.043), ('play', 0.042), ('social', 0.04), ('annotations', 0.04), ('aih', 0.04), ('fahc', 0.04), ('fgx', 0.04), ('frho', 0.04), ('marszaek', 0.04), ('paced', 0.04), ('rdac', 0.04), ('rdro', 0.04), ('rih', 0.04), ('wro', 0.04), ('kit', 0.04), ('accompanied', 0.04), ('script', 0.04), ('text', 0.04), ('train', 0.039), ('labels', 0.038), ('ia', 0.038), ('holding', 0.036), ('rdg', 0.036), ('thater', 0.036), ('szarvas', 0.036), ('human', 0.035), ('latent', 0.034), ('kissing', 0.033), ('wrench', 0.033), ('fda', 0.033), ('repositories', 0.033), ('vid', 0.033), ('pinkal', 0.033), ('regneri', 0.033), ('training', 0.032), ('xi', 0.032), ('hq', 0.031), ('pertaining', 0.031), ('annotation', 0.031), ('global', 0.031), ('semantic', 0.03), ('regularization', 0.03), ('eq', 0.03), ('hi', 0.029), ('assignment', 0.029), ('bars', 0.029), ('cutting', 0.029), ('classification', 0.028), ('identifying', 0.028), ('complete', 0.027), ('crude', 0.027), ('similarly', 0.027), ('measure', 0.026), ('crf', 0.026), ('wordnet', 0.026), ('label', 0.026), ('weights', 0.025), ('occurring', 0.025), ('writing', 0.024), ('bold', 0.024), ('relations', 0.024), ('turning', 0.024), ('baselines', 0.024), ('bag', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999911 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions

Author: Vignesh Ramanathan, Percy Liang, Li Fei-Fei

Abstract: Human action and role recognition play an important part in complex event understanding. State-of-the-art methods learn action and role models from detailed spatio temporal annotations, which requires extensive human effort. In this work, we propose a method to learn such models based on natural language descriptions of the training videos, which are easier to collect and scale with the number of actions and roles. There are two challenges with using this form of weak supervision: First, these descriptions only provide a high-level summary and often do not directly mention the actions and roles occurring in a video. Second, natural language descriptions do not provide spatio temporal annotations of actions and roles. To tackle these challenges, we introduce a topic-based semantic relatedness (SR) measure between a video description and an action and role label, and incorporate it into a posterior regularization objective. Our event recognition system based on these action and role models matches the state-ofthe-art method on the TRECVID-MED11 event kit, despite weaker supervision.

2 0.32996103 428 iccv-2013-Translating Video Content to Natural Language Descriptions

Author: Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, Bernt Schiele

Abstract: Humans use rich natural language to describe and communicate visual perceptions. In order to provide natural language descriptions for visual content, this paper combines two important ingredients. First, we generate a rich semantic representation of the visual content including e.g. object and activity labels. To predict the semantic representation we learn a CRF to model the relationships between different components of the visual input. And second, we propose to formulate the generation of natural language as a machine translation problem using the semantic representation as source language and the generated sentences as target language. For this we exploit the power of a parallel corpus of videos and textual descriptions and adapt statistical machine translation to translate between our two languages. We evaluate our video descriptions on the TACoS dataset [23], which contains video snippets aligned with sentence descriptions. Using automatic evaluation and human judgments we show significant improvements over several baseline approaches, motivated by prior work. Our translation approach also shows improvements over related work on an image description task.

3 0.30928552 86 iccv-2013-Concurrent Action Detection with Structural Prediction

Author: Ping Wei, Nanning Zheng, Yibiao Zhao, Song-Chun Zhu

Abstract: Action recognition has often been posed as a classification problem, which assumes that a video sequence only have one action class label and different actions are independent. However, a single human body can perform multiple concurrent actions at the same time, and different actions interact with each other. This paper proposes a concurrent action detection model where the action detection is formulated as a structural prediction problem. In this model, an interval in a video sequence can be described by multiple action labels. An detected action interval is determined both by the unary local detector and the relations with other actions. We use a wavelet feature to represent the action sequence, and design a composite temporal logic descriptor to describe the action relations. The model parameters are trained by structural SVM learning. Given a long video sequence, a sequential decision window search algorithm is designed to detect the actions. Experiments on our new collected concurrent action dataset demonstrate the strength of our method.

4 0.28902408 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection

Author: Mihai Zanfir, Marius Leordeanu, Cristian Sminchisescu

Abstract: Human action recognition under low observational latency is receiving a growing interest in computer vision due to rapidly developing technologies in human-robot interaction, computer gaming and surveillance. In this paper we propose a fast, simple, yet powerful non-parametric Moving Pose (MP)frameworkfor low-latency human action and activity recognition. Central to our methodology is a moving pose descriptor that considers both pose information as well as differential quantities (speed and acceleration) of the human body joints within a short time window around the current frame. The proposed descriptor is used in conjunction with a modified kNN classifier that considers both the temporal location of a particular frame within the action sequence as well as the discrimination power of its moving pose descriptor compared to other frames in the training set. The resulting method is non-parametric and enables low-latency recognition, one-shot learning, and action detection in difficult unsegmented sequences. Moreover, the framework is real-time, scalable, and outperforms more sophisticated approaches on challenging benchmarks like MSR-Action3D or MSR-DailyActivities3D.

5 0.2681621 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition

Author: Jiang Wang, Ying Wu

Abstract: Temporal misalignment and duration variation in video actions largely influence the performance of action recognition, but it is very difficult to specify effective temporal alignment on action sequences. To address this challenge, this paper proposes a novel discriminative learning-based temporal alignment method, called maximum margin temporal warping (MMTW), to align two action sequences and measure their matching score. Based on the latent structure SVM formulation, the proposed MMTW method is able to learn a phantom action template to represent an action class for maximum discrimination against other classes. The recognition of this action class is based on the associated learned alignment of the input action. Extensive experiments on five benchmark datasets have demonstrated that this MMTW model is able to significantly promote the accuracy and robustness of action recognition under temporal misalignment and variations.

6 0.25409669 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos

7 0.24753837 268 iccv-2013-Modeling 4D Human-Object Interactions for Event and Object Recognition

8 0.22548944 166 iccv-2013-Finding Actors and Actions in Movies

9 0.21829408 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments

10 0.21251556 244 iccv-2013-Learning View-Invariant Sparse Representations for Cross-View Action Recognition

11 0.20864011 127 iccv-2013-Dynamic Pooling for Complex Event Recognition

12 0.20767604 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction

13 0.20585896 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set

14 0.20465243 249 iccv-2013-Learning to Share Latent Tasks for Action Recognition

15 0.2017801 146 iccv-2013-Event Detection in Complex Scenes Using Interval Temporal Constraints

16 0.19133215 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding

17 0.19060718 203 iccv-2013-How Related Exemplars Help Complex Event Detection in Web Videos?

18 0.17967249 147 iccv-2013-Event Recognition in Photo Collections with a Stopwatch HMM

19 0.17288451 116 iccv-2013-Directed Acyclic Graph Kernels for Action Recognition

20 0.17203102 452 iccv-2013-YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.229), (1, 0.332), (2, 0.127), (3, 0.296), (4, 0.053), (5, 0.024), (6, 0.148), (7, -0.16), (8, -0.014), (9, -0.043), (10, -0.006), (11, -0.1), (12, -0.025), (13, 0.059), (14, 0.015), (15, -0.046), (16, -0.022), (17, -0.003), (18, -0.092), (19, -0.003), (20, -0.004), (21, 0.035), (22, -0.022), (23, -0.071), (24, 0.021), (25, -0.003), (26, 0.07), (27, -0.068), (28, -0.136), (29, -0.009), (30, 0.003), (31, -0.165), (32, 0.088), (33, 0.001), (34, -0.027), (35, -0.004), (36, -0.059), (37, 0.055), (38, -0.137), (39, -0.063), (40, 0.015), (41, 0.009), (42, -0.086), (43, 0.006), (44, -0.084), (45, -0.007), (46, 0.09), (47, -0.103), (48, -0.018), (49, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96612251 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions

Author: Vignesh Ramanathan, Percy Liang, Li Fei-Fei

Abstract: Human action and role recognition play an important part in complex event understanding. State-of-the-art methods learn action and role models from detailed spatio temporal annotations, which requires extensive human effort. In this work, we propose a method to learn such models based on natural language descriptions of the training videos, which are easier to collect and scale with the number of actions and roles. There are two challenges with using this form of weak supervision: First, these descriptions only provide a high-level summary and often do not directly mention the actions and roles occurring in a video. Second, natural language descriptions do not provide spatio temporal annotations of actions and roles. To tackle these challenges, we introduce a topic-based semantic relatedness (SR) measure between a video description and an action and role label, and incorporate it into a posterior regularization objective. Our event recognition system based on these action and role models matches the state-ofthe-art method on the TRECVID-MED11 event kit, despite weaker supervision.

2 0.71643496 166 iccv-2013-Finding Actors and Actions in Movies

Author: P. Bojanowski, F. Bach, I. Laptev, J. Ponce, C. Schmid, J. Sivic

Abstract: We address the problem of learning a joint model of actors and actions in movies using weak supervision provided by scripts. Specifically, we extract actor/action pairs from the script and use them as constraints in a discriminative clustering framework. The corresponding optimization problem is formulated as a quadratic program under linear constraints. People in video are represented by automatically extracted and tracked faces together with corresponding motion features. First, we apply the proposed framework to the task of learning names of characters in the movie and demonstrate significant improvements over previous methods used for this task. Second, we explore the joint actor/action constraint and show its advantage for weakly supervised action learning. We validate our method in the challenging setting of localizing and recognizing characters and their actions in feature length movies Casablanca and American Beauty.

3 0.71352625 86 iccv-2013-Concurrent Action Detection with Structural Prediction

Author: Ping Wei, Nanning Zheng, Yibiao Zhao, Song-Chun Zhu

Abstract: Action recognition has often been posed as a classification problem, which assumes that a video sequence only have one action class label and different actions are independent. However, a single human body can perform multiple concurrent actions at the same time, and different actions interact with each other. This paper proposes a concurrent action detection model where the action detection is formulated as a structural prediction problem. In this model, an interval in a video sequence can be described by multiple action labels. An detected action interval is determined both by the unary local detector and the relations with other actions. We use a wavelet feature to represent the action sequence, and design a composite temporal logic descriptor to describe the action relations. The model parameters are trained by structural SVM learning. Given a long video sequence, a sequential decision window search algorithm is designed to detect the actions. Experiments on our new collected concurrent action dataset demonstrate the strength of our method.

4 0.70442861 428 iccv-2013-Translating Video Content to Natural Language Descriptions

Author: Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, Bernt Schiele

Abstract: Humans use rich natural language to describe and communicate visual perceptions. In order to provide natural language descriptions for visual content, this paper combines two important ingredients. First, we generate a rich semantic representation of the visual content including e.g. object and activity labels. To predict the semantic representation we learn a CRF to model the relationships between different components of the visual input. And second, we propose to formulate the generation of natural language as a machine translation problem using the semantic representation as source language and the generated sentences as target language. For this we exploit the power of a parallel corpus of videos and textual descriptions and adapt statistical machine translation to translate between our two languages. We evaluate our video descriptions on the TACoS dataset [23], which contains video snippets aligned with sentence descriptions. Using automatic evaluation and human judgments we show significant improvements over several baseline approaches, motivated by prior work. Our translation approach also shows improvements over related work on an image description task.

5 0.68235862 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition

Author: Jiang Wang, Ying Wu

Abstract: Temporal misalignment and duration variation in video actions largely influence the performance of action recognition, but it is very difficult to specify effective temporal alignment on action sequences. To address this challenge, this paper proposes a novel discriminative learning-based temporal alignment method, called maximum margin temporal warping (MMTW), to align two action sequences and measure their matching score. Based on the latent structure SVM formulation, the proposed MMTW method is able to learn a phantom action template to represent an action class for maximum discrimination against other classes. The recognition of this action class is based on the associated learned alignment of the input action. Extensive experiments on five benchmark datasets have demonstrated that this MMTW model is able to significantly promote the accuracy and robustness of action recognition under temporal misalignment and variations.

6 0.66878605 38 iccv-2013-Action Recognition with Actons

7 0.66395283 452 iccv-2013-YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition

8 0.65790647 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding

9 0.64746225 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set

10 0.64019603 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition

11 0.63669163 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection

12 0.62826908 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos

13 0.61310124 4 iccv-2013-ACTIVE: Activity Concept Transitions in Video Event Classification

14 0.6023491 231 iccv-2013-Latent Multitask Learning for View-Invariant Action Recognition

15 0.58995515 265 iccv-2013-Mining Motion Atoms and Phrases for Complex Action Recognition

16 0.58450925 127 iccv-2013-Dynamic Pooling for Complex Event Recognition

17 0.55771178 268 iccv-2013-Modeling 4D Human-Object Interactions for Event and Object Recognition

18 0.54392552 163 iccv-2013-Feature Weighting via Optimal Thresholding for Video Analysis

19 0.51762331 203 iccv-2013-How Related Exemplars Help Complex Event Detection in Web Videos?

20 0.49948981 260 iccv-2013-Manipulation Pattern Discovery: A Nonparametric Bayesian Approach


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.06), (7, 0.016), (12, 0.059), (26, 0.053), (29, 0.214), (31, 0.059), (42, 0.108), (64, 0.1), (73, 0.023), (76, 0.011), (78, 0.018), (89, 0.153), (98, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80160046 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions

Author: Vignesh Ramanathan, Percy Liang, Li Fei-Fei

Abstract: Human action and role recognition play an important part in complex event understanding. State-of-the-art methods learn action and role models from detailed spatio temporal annotations, which requires extensive human effort. In this work, we propose a method to learn such models based on natural language descriptions of the training videos, which are easier to collect and scale with the number of actions and roles. There are two challenges with using this form of weak supervision: First, these descriptions only provide a high-level summary and often do not directly mention the actions and roles occurring in a video. Second, natural language descriptions do not provide spatio temporal annotations of actions and roles. To tackle these challenges, we introduce a topic-based semantic relatedness (SR) measure between a video description and an action and role label, and incorporate it into a posterior regularization objective. Our event recognition system based on these action and role models matches the state-ofthe-art method on the TRECVID-MED11 event kit, despite weaker supervision.

2 0.72856569 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection

Author: Mihai Zanfir, Marius Leordeanu, Cristian Sminchisescu

Abstract: Human action recognition under low observational latency is receiving a growing interest in computer vision due to rapidly developing technologies in human-robot interaction, computer gaming and surveillance. In this paper we propose a fast, simple, yet powerful non-parametric Moving Pose (MP)frameworkfor low-latency human action and activity recognition. Central to our methodology is a moving pose descriptor that considers both pose information as well as differential quantities (speed and acceleration) of the human body joints within a short time window around the current frame. The proposed descriptor is used in conjunction with a modified kNN classifier that considers both the temporal location of a particular frame within the action sequence as well as the discrimination power of its moving pose descriptor compared to other frames in the training set. The resulting method is non-parametric and enables low-latency recognition, one-shot learning, and action detection in difficult unsegmented sequences. Moreover, the framework is real-time, scalable, and outperforms more sophisticated approaches on challenging benchmarks like MSR-Action3D or MSR-DailyActivities3D.

3 0.71842468 338 iccv-2013-Randomized Ensemble Tracking

Author: Qinxun Bai, Zheng Wu, Stan Sclaroff, Margrit Betke, Camille Monnier

Abstract: We propose a randomized ensemble algorithm to model the time-varying appearance of an object for visual tracking. In contrast with previous online methods for updating classifier ensembles in tracking-by-detection, the weight vector that combines weak classifiers is treated as a random variable and the posterior distribution for the weight vector is estimated in a Bayesian manner. In essence, the weight vector is treated as a distribution that reflects the confidence among the weak classifiers used to construct and adapt the classifier ensemble. The resulting formulation models the time-varying discriminative ability among weak classifiers so that the ensembled strong classifier can adapt to the varying appearance, backgrounds, and occlusions. The formulation is tested in a tracking-by-detection implementation. Experiments on 28 challenging benchmark videos demonstrate that the proposed method can achieve results comparable to and often better than those of stateof-the-art approaches.

4 0.7132805 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition

Author: Jiang Wang, Ying Wu

Abstract: Temporal misalignment and duration variation in video actions largely influence the performance of action recognition, but it is very difficult to specify effective temporal alignment on action sequences. To address this challenge, this paper proposes a novel discriminative learning-based temporal alignment method, called maximum margin temporal warping (MMTW), to align two action sequences and measure their matching score. Based on the latent structure SVM formulation, the proposed MMTW method is able to learn a phantom action template to represent an action class for maximum discrimination against other classes. The recognition of this action class is based on the associated learned alignment of the input action. Extensive experiments on five benchmark datasets have demonstrated that this MMTW model is able to significantly promote the accuracy and robustness of action recognition under temporal misalignment and variations.

5 0.71323872 380 iccv-2013-Semantic Transform: Weakly Supervised Semantic Inference for Relating Visual Attributes

Author: Sukrit Shankar, Joan Lasenby, Roberto Cipolla

Abstract: Relative (comparative) attributes are promising for thematic ranking of visual entities, which also aids in recognition tasks [19, 23]. However, attribute rank learning often requires a substantial amount of relational supervision, which is highly tedious, and apparently impracticalfor realworld applications. In this paper, we introduce the Semantic Transform, which under minimal supervision, adaptively finds a semantic feature space along with a class ordering that is related in the best possible way. Such a semantic space is found for every attribute category. To relate the classes under weak supervision, the class ordering needs to be refined according to a cost function in an iterative procedure. This problem is ideally NP-hard, and we thus propose a constrained search tree formulation for the same. Driven by the adaptive semantic feature space representation, our model achieves the best results to date for all of the tasks of relative, absolute and zero-shot classification on two popular datasets.

6 0.70821404 86 iccv-2013-Concurrent Action Detection with Structural Prediction

7 0.70655417 215 iccv-2013-Incorporating Cloud Distribution in Sky Representation

8 0.70552397 359 iccv-2013-Robust Object Tracking with Online Multi-lifespan Dictionary Learning

9 0.70410931 451 iccv-2013-Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions

10 0.70241177 442 iccv-2013-Video Segmentation by Tracking Many Figure-Ground Segments

11 0.70115024 124 iccv-2013-Domain Transfer Support Vector Ranking for Person Re-identification without Target Camera Label Information

12 0.70091259 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition

13 0.70080483 328 iccv-2013-Probabilistic Elastic Part Model for Unsupervised Face Detector Adaptation

14 0.70058358 59 iccv-2013-Bayesian Joint Topic Modelling for Weakly Supervised Object Localisation

15 0.70058137 367 iccv-2013-SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels

16 0.7002058 180 iccv-2013-From Where and How to What We See

17 0.70018488 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos

18 0.69935489 425 iccv-2013-Tracking via Robust Multi-task Multi-view Joint Sparse Representation

19 0.69873643 299 iccv-2013-Online Video SEEDS for Temporal Window Objectness

20 0.69856954 127 iccv-2013-Dynamic Pooling for Complex Event Recognition