iccv iccv2013 iccv2013-166 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: P. Bojanowski, F. Bach, I. Laptev, J. Ponce, C. Schmid, J. Sivic
Abstract: We address the problem of learning a joint model of actors and actions in movies using weak supervision provided by scripts. Specifically, we extract actor/action pairs from the script and use them as constraints in a discriminative clustering framework. The corresponding optimization problem is formulated as a quadratic program under linear constraints. People in video are represented by automatically extracted and tracked faces together with corresponding motion features. First, we apply the proposed framework to the task of learning names of characters in the movie and demonstrate significant improvements over previous methods used for this task. Second, we explore the joint actor/action constraint and show its advantage for weakly supervised action learning. We validate our method in the challenging setting of localizing and recognizing characters and their actions in feature length movies Casablanca and American Beauty.
Reference: text
sentIndex sentText sentNum sentScore
1 Sivic2,∗ 2INRIA Abstract We address the problem of learning a joint model of actors and actions in movies using weak supervision provided by scripts. [sent-7, score-0.746]
2 First, we apply the proposed framework to the task of learning names of characters in the movie and demonstrate significant improvements over previous methods used for this task. [sent-11, score-0.711]
3 Second, we explore the joint actor/action constraint and show its advantage for weakly supervised action learning. [sent-12, score-0.479]
4 We validate our method in the challenging setting of localizing and recognizing characters and their actions in feature length movies Casablanca and American Beauty. [sent-13, score-0.661]
5 Video scripts exist for thousands of movies and TVseries and contain rich descriptions in terms of people, their actions, interactions and emotions, object properties, scene layouts and more. [sent-18, score-0.51]
6 Previous work has explored video scripts to learn and automatically annotate characters in TV series [6, 22, 23]. [sent-19, score-0.415]
7 Automatic learning of human actions from scripts has also been attempted [8, 18, 20]. [sent-20, score-0.609]
8 †LEAR team, INRIA Grenoble Rhˆ one-Alpes, Paris, France Figure 1: Result of our automatic detection and annotation ofcharacters and their actions in the movie Casablanca. [sent-23, score-0.564]
9 Previous work on weakly supervised learning in images [5, 13, 19] and video [6, 8, 18, 20, 22, 23] has explored redundancy to resolve ambiguity of textual annotation. [sent-26, score-0.403]
10 We follow this intuition and address joint weakly supervised learning of actors and actions by exploiting their cooccurrence in movies. [sent-37, score-0.806]
11 We follow previous work [6, 8, 18, 20, 22, 23] and use movie scripts as a source of weak supervision. [sent-38, score-0.426]
12 Differently from this prior work, we use actor-action 22228800 co-occurrences derived from scripts to constrain the weakly supervised learning problem. [sent-39, score-0.553]
13 As one of our main contributions, we formulate weakly supervised joint learning of actors and actions as an optimization of a new discriminative cost function. [sent-40, score-0.845]
14 We first investigate weakly supervised learning of actors only and demonstrate the benefit of our learning method in comparison with other weakly supervised techniques designed for this task [6, 22]. [sent-41, score-0.728]
15 We then demonstrate the advantage of the joint constraints for action recognition. [sent-42, score-0.295]
16 We validate our method in the challenging setting of localizing and recog- nizing actors and their actions in movies Casablanca and American Beauty. [sent-43, score-0.674]
17 An example output of our algorithm for a short movie clip and the associated script section is illustrated in Figure 1. [sent-44, score-0.307]
18 Learning from images and text has been addressed in the context of automatic annotation of images with keywords [4, 11, 25] or labeling faces with names in news collections [5]. [sent-46, score-0.59]
19 [5] label detected faces in news photographs with names of people obtained from text captions. [sent-48, score-0.546]
20 A generative model of faces and poses (such as “Hit Backhand”) was learnt from names and verbs in manually provided captions for news photographs [19]. [sent-50, score-0.521]
21 To deal with the ambiguity of annotations, we develop a new discriminative weakly supervised clustering model of video and text. [sent-52, score-0.35]
22 In video, manually provided text descriptions have been used to learn a causal model of human actions in the constrained domain of sports events [14]. [sent-53, score-0.465]
23 Others have looked at learning from videos with readily-available text, but names [6, 9, 22] and actions [8, 18] have been so far considered separately. [sent-54, score-0.746]
24 First, we consider a richer use of textual information for video and learn from pairs of names and actions co-occurring in the text. [sent-59, score-0.801]
25 Second, we formulate the problem of finding characters and their actions as weakly supervised structured classification of pairs of action and name labels. [sent-60, score-1.053]
26 Third, we develop a new discriminative clustering model jointly learning both actions and names and incorporating text annotations as constraints. [sent-61, score-0.861]
27 Finally, we demonstrate the validity of the model on two feature-length movies and corresponding movie scripts, and demonstrate improvements over earlier weakly supervised methods. [sent-63, score-0.621]
28 Joint Model of Actors and Actions We formulate the problem of jointly detecting actors and actions as discriminative clustering [2, 17]: grouping samples into classes so that an appropriate loss is minimized. [sent-65, score-0.612]
29 Iyn b our application, Nsami isp tehse N group of person tracks appearing in a scene while ΛNi can be thought of as a set of sentences spec? [sent-75, score-0.439]
30 a Ne d×e iPn em Xatr toix bwei ath N person altaabels in rows zn and T, Zis a N a N N× ×A mPa mtriaxtr iwxi twhi athct pioenrs olanb elalsin rows tn. [sent-86, score-0.299]
31 Given weak supervision in the form of constraints on Z and T (more on these in the next section), we want to recover latent variables zn, tn for every sample xn and learn two multi-class classifiers f : Rd → RP and g : Rd → RA (for persons and actions respectively). [sent-88, score-0.536]
32 Problem Formulation × Our problem can be decomposed as a sum of two cost functions (for person names and actions) that are linked by joint constraints. [sent-93, score-0.555]
33 Using (3), we next define a joint optimization problem over action labels T and person labels Z as: mZ,iTn Tr(ZZTA(X,λ1)) + Tr(TTTB(X,λ2)). [sent-109, score-0.489]
34 We will use information mined from scripts to couple Z and T by joint constraints as described below. [sent-113, score-0.331]
35 Annotations as Constraints on Latent Variables We would like to constrain solutions of our problem by coupling person and action labels. [sent-116, score-0.468]
36 After aligning scripts with videos [9], we extract person-action pairs (p, a) and their approximate temporal locations. [sent-118, score-0.28]
37 Given a pair (p, a) found in the script, we assume a person p performs an action a at the corresponding temporal location in the video. [sent-119, score-0.385]
38 The (p, ∅) pairs come from another source of textual information: movie scripts contain both scene descriptions and dialogues with speaker identities specified. [sent-124, score-0.646]
39 For every person-action pair (p, a) we construct a bag icontaining samples Ni corresponding to person tracks in tih ceo temporal proximity of (p, a). [sent-126, score-0.614]
40 Once the bags are defined, we use annotations to constrain the latent variables of person tracks in the bag. [sent-128, score-0.673]
41 This can be translated in the form of constraints on sums of latent variables of tracks within a bag as: ∀i ∈ I , ∀(p,a) ∈ Λi,n? [sent-130, score-0.546]
42 Pairs (p, ∅) and (∅, a) define independent constraints on the person and action latent classes respectively. [sent-135, score-0.551]
43 In the binary case it consists in learning a binary classifier given bags containing samples of both classes and bags containing only negatives. [sent-138, score-0.307]
44 Slack Variables In practice, person-action pairs in scripts may not always have corresponding person tracks in the video. [sent-142, score-0.69]
45 This can happen due to failures of automatic person detection and tracking as well as due to possible mismatches between scripts and video tracks. [sent-143, score-0.499]
46 Kf and Ka are the two kernels that we use for faces and actions as described in more details in Section 3. [sent-186, score-0.404]
47 From each detected occurrence of the frame in the text we use the “agent” and the “target verb” as the name and action pair. [sent-196, score-0.35]
48 The aim here is to design a representation ofvideo that can be related to the name and action structures extracted from the text. [sent-198, score-0.289]
49 To extract person tracks, we run the multi-view face detector of [26] and associate detections across frames using point tracks in a similar manner to [9, 22]. [sent-202, score-0.495]
50 For both movies we extract person tracks and associated descriptors. [sent-220, score-0.596]
51 We discard person tracks with unreliable facial features based on the landmark localization score. [sent-221, score-0.452]
52 For Casablanca, we obtain 1,273 person tracks containing 124,423 face detections while for American Beauty we use 1,330 person tracks containing 13 1,741 face detections. [sent-222, score-0.99]
53 By processing corresponding movie scripts, we extract 17 names for the main characters in Casablanca and 11 names for the main characters in American Beauty. [sent-223, score-1.153]
54 For each movie we select two most frequent action classes, i. [sent-224, score-0.418]
55 For Casablanca we obtain 42 action/name pairs and 359 occurrences of names with no associated actions. [sent-227, score-0.382]
56 To explicitly model non-named characters in the movie (side characters and extras) as well as non-considered action classes we introduce an additional “background” class for both faces and actions. [sent-230, score-0.795]
57 For actions, we randomly sample 500 person tracks from the Hollywood2 dataset [20] using the corresponding movie scripts to discard actions considered in this work. [sent-233, score-1.168]
58 Second, we show that even for learning names alone (without actions) the proposed method outperforms other state-of-the-art weakly supervised learning techniques designed for the same task. [sent-239, score-0.665]
59 Finally, we demonstrate benefits of learning names and actions jointly compared to resolving both tasks independently. [sent-240, score-0.711]
60 pWere bwagil | use raenadl tdhaeta n m1,b2e7r3 offac aen ntroatcaktiso annsd p tehre bira descriptors f wroimll uthsee – movie Casablanca but group the tracks into bags in a controlled manner. [sent-243, score-0.536]
61 To create each bag, we first sample a track from a uniform distribution over characters and then complete the bag with up to |Ni | tracks by randomly sampling tracks according to the ttrou |eN di|stt rraibckutsio bny o raf ntdheo mchlayr saacmteprsli ning t threa cmksov aciec. [sent-245, score-0.806]
62 Classified face tracks are then sorted by their confidence values and the percentage of correctly classified tracks (i. [sent-251, score-0.568]
63 Here we compare our method with other weakly supervised face identification approaches. [sent-259, score-0.376]
64 (c) Keeping 7 classes, we increase the number of samples per bag showing that more samples per bag increase confusion resulting in a lower performance. [sent-279, score-0.436]
65 We run all methods on 1,273 face tracks from Casablanca and 1330 face tracks from American Beauty using noisy name annotations obtained from movie scripts. [sent-281, score-0.949]
66 First, the training data within a film is limited as it is not possible to harvest face tracks across multiple episodes as in TV series. [sent-284, score-0.346]
67 This assumption is often violated in movies as side characters and extras are often not mentioned in the script. [sent-294, score-0.374]
68 In contrast, our approach suffers less from this problem since (a) it can handle multiple annotations for bags of multiple tracks and (b) the noise in labels and person detections is explicitly modeled using slack variables. [sent-295, score-0.621]
69 We next evaluate benefits of learning names and actions jointly. [sent-297, score-0.711]
70 This is achieved by proportoin of labeleld rtacks proportoin of labeleld rtacks (a) Casablanca (b) American Beauty Figure 4: Results of automatic person naming in movies. [sent-298, score-0.534]
71 Our method is compared with weakly supervised face identification approaches of Cour et al. [sent-299, score-0.376]
72 The name assignments are then fixed and used as additional constraints when learning the likely action assignments T for each track. [sent-303, score-0.543]
73 While this procedure can be iterated to improve the assignment of actor names with the help of estimated action labels, we found that the optimization converges after the first iteration in our setup. [sent-304, score-0.54]
74 The distribution of action classes in our data is heavily unbalanced with the “background” class corresponding to more than 78% of person tracks. [sent-305, score-0.432]
75 We therefore evaluate the labeling of each target action in each movie using a standard one-vs-all action precision-recall measure. [sent-306, score-0.622]
76 Names+Actions corresponds to our proposed method of learning person names and actions jointly. [sent-308, score-0.895]
77 No Names uses constraints on actions only without considering joint constraints on actions and names. [sent-309, score-0.82]
78 True Names+Actions uses the ground truth person names as constraints on actions instead of the automatic name assignment. [sent-310, score-1.048]
79 This provides an upper bound on the action classification performance provided perfect assignment of per22228855 Walking, Casablanca Sit down, Casablanca Walking, American Beauty Open door, American Beauty Figure 5: Results of action labeling in movies Casablanca and American Beauty. [sent-311, score-0.619]
80 Finally, we evaluate two “dummy” baselines which blindly assign action labels based on person names and person-action pairs obtained from scripts. [sent-314, score-0.803]
81 Names+Text learns face assignments for each person track and assigns action labels using person-action pairs. [sent-316, score-0.627]
82 True Names+Text assigns action labels based on person-action pairs and ground truth person names. [sent-317, score-0.464]
83 Precision-recall plots for the target action classes in two movies are shown in Figure 5. [sent-320, score-0.434]
84 We first observe that our full method (blue curves) outperforms the weakly supervised learning of actions only (green curves) in most of the cases. [sent-321, score-0.618]
85 As expected, action classification can be further improved using ground truth for name assignments (red curves). [sent-323, score-0.365]
86 For the frequent action walking for which many personaction constraints are available in scripts, automatic person naming in our method provides a large benefit. [sent-324, score-0.601]
87 However, even with ground truth face assignments the action classification performance is not perfect (True Names+Actions). [sent-325, score-0.362]
88 First, the ambiguity in the weak supervision is not reduced to zero as a single character may do several different actions in a single clip (bag). [sent-327, score-0.396]
89 Recognizing less frequent actions sit down and open door appears to be more difficult. [sent-329, score-0.461]
90 Incorrect constraints often occur due to the failure of face detection as actors often turn away from the camera when sitting down and opening doors. [sent-332, score-0.303]
91 To explicitly quantify the loss due to failures of automatic person tracking, we have manually annotated person tracks in the movie Casablanca. [sent-333, score-0.826]
92 The performance of our full method is significantly improved when run on correct person tracks yielding AP=0. [sent-334, score-0.41]
93 Qualitative results for automatic labeling names of actors and actions using our method (Names+Actions) are illustrated Figure 6. [sent-338, score-0.901]
94 Conclusion We have developed a new discriminative weakly supervised model jointly representing actions and actors in video. [sent-341, score-0.773]
95 We have demonstrated the model can be learnt from a feature length movie together with its shooting script, and have shown a significant improvement over other state-of-the-art weakly supervised methods. [sent-342, score-0.435]
96 As actions are shared across movies, applying the model over multiple movies simultaneously opens-up the possibility of automatically learning discriminative classifiers for a large vocabulary of action classes. [sent-343, score-0.798]
97 1, 2, 5, 6 22228866 Figure 6: Examples of automatically assigned names and actions in the movie Casablanca. [sent-396, score-0.86]
98 Top row: Correct name and action assignments for tracks that have an actor/action constraint in the script. [sent-397, score-0.591]
99 Bottom row: Correct name and action assignments for tracks that do not have a corresponding constraint in the script, but are still correctly classified. [sent-398, score-0.591]
100 Who’s doing what: Joint modeling of names and verbs for simultaneous face and pose annotation. [sent-494, score-0.459]
wordName wordTfidf (topN-words)
[('names', 0.339), ('actions', 0.332), ('casablanca', 0.314), ('scripts', 0.237), ('tracks', 0.226), ('action', 0.201), ('movie', 0.189), ('movies', 0.186), ('person', 0.184), ('bag', 0.166), ('actors', 0.156), ('weakly', 0.15), ('characters', 0.143), ('american', 0.136), ('beauty', 0.129), ('script', 0.118), ('supervised', 0.096), ('bags', 0.091), ('name', 0.088), ('face', 0.085), ('speaker', 0.079), ('assignments', 0.076), ('faces', 0.072), ('diffrac', 0.067), ('zzta', 0.067), ('constraints', 0.062), ('text', 0.061), ('latent', 0.057), ('zn', 0.055), ('door', 0.055), ('coupling', 0.053), ('textual', 0.052), ('tv', 0.051), ('annotations', 0.05), ('xn', 0.05), ('agent', 0.048), ('classes', 0.047), ('walking', 0.046), ('ni', 0.046), ('descriptions', 0.046), ('sit', 0.046), ('identification', 0.045), ('extras', 0.045), ('framenet', 0.045), ('labeleld', 0.045), ('proportoin', 0.045), ('rtacks', 0.045), ('tttb', 0.045), ('track', 0.045), ('news', 0.044), ('pairs', 0.043), ('automatic', 0.043), ('facial', 0.042), ('layouts', 0.041), ('learning', 0.04), ('knock', 0.04), ('discriminative', 0.039), ('samples', 0.038), ('naming', 0.037), ('rick', 0.037), ('labels', 0.036), ('looked', 0.035), ('variables', 0.035), ('video', 0.035), ('film', 0.035), ('verb', 0.035), ('verbs', 0.035), ('bach', 0.035), ('slack', 0.034), ('character', 0.034), ('prepositions', 0.033), ('rp', 0.032), ('joint', 0.032), ('xxt', 0.032), ('normale', 0.032), ('rieure', 0.032), ('confidence', 0.031), ('labeling', 0.031), ('erc', 0.031), ('kf', 0.031), ('gram', 0.031), ('captions', 0.031), ('ambiguity', 0.03), ('people', 0.03), ('controlled', 0.03), ('constrain', 0.03), ('rows', 0.03), ('sivic', 0.03), ('relaxations', 0.029), ('sentences', 0.029), ('laptev', 0.028), ('confusion', 0.028), ('frequent', 0.028), ('arguments', 0.028), ('convex', 0.027), ('tr', 0.027), ('joulin', 0.027), ('matrices', 0.027), ('nn', 0.027), ('events', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 166 iccv-2013-Finding Actors and Actions in Movies
Author: P. Bojanowski, F. Bach, I. Laptev, J. Ponce, C. Schmid, J. Sivic
Abstract: We address the problem of learning a joint model of actors and actions in movies using weak supervision provided by scripts. Specifically, we extract actor/action pairs from the script and use them as constraints in a discriminative clustering framework. The corresponding optimization problem is formulated as a quadratic program under linear constraints. People in video are represented by automatically extracted and tracked faces together with corresponding motion features. First, we apply the proposed framework to the task of learning names of characters in the movie and demonstrate significant improvements over previous methods used for this task. Second, we explore the joint actor/action constraint and show its advantage for weakly supervised action learning. We validate our method in the challenging setting of localizing and recognizing characters and their actions in feature length movies Casablanca and American Beauty.
2 0.23760855 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
Author: Mihai Zanfir, Marius Leordeanu, Cristian Sminchisescu
Abstract: Human action recognition under low observational latency is receiving a growing interest in computer vision due to rapidly developing technologies in human-robot interaction, computer gaming and surveillance. In this paper we propose a fast, simple, yet powerful non-parametric Moving Pose (MP)frameworkfor low-latency human action and activity recognition. Central to our methodology is a moving pose descriptor that considers both pose information as well as differential quantities (speed and acceleration) of the human body joints within a short time window around the current frame. The proposed descriptor is used in conjunction with a modified kNN classifier that considers both the temporal location of a particular frame within the action sequence as well as the discrimination power of its moving pose descriptor compared to other frames in the training set. The resulting method is non-parametric and enables low-latency recognition, one-shot learning, and action detection in difficult unsegmented sequences. Moreover, the framework is real-time, scalable, and outperforms more sophisticated approaches on challenging benchmarks like MSR-Action3D or MSR-DailyActivities3D.
3 0.22548944 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
Author: Vignesh Ramanathan, Percy Liang, Li Fei-Fei
Abstract: Human action and role recognition play an important part in complex event understanding. State-of-the-art methods learn action and role models from detailed spatio temporal annotations, which requires extensive human effort. In this work, we propose a method to learn such models based on natural language descriptions of the training videos, which are easier to collect and scale with the number of actions and roles. There are two challenges with using this form of weak supervision: First, these descriptions only provide a high-level summary and often do not directly mention the actions and roles occurring in a video. Second, natural language descriptions do not provide spatio temporal annotations of actions and roles. To tackle these challenges, we introduce a topic-based semantic relatedness (SR) measure between a video description and an action and role label, and incorporate it into a posterior regularization objective. Our event recognition system based on these action and role models matches the state-ofthe-art method on the TRECVID-MED11 event kit, despite weaker supervision.
4 0.21476685 86 iccv-2013-Concurrent Action Detection with Structural Prediction
Author: Ping Wei, Nanning Zheng, Yibiao Zhao, Song-Chun Zhu
Abstract: Action recognition has often been posed as a classification problem, which assumes that a video sequence only have one action class label and different actions are independent. However, a single human body can perform multiple concurrent actions at the same time, and different actions interact with each other. This paper proposes a concurrent action detection model where the action detection is formulated as a structural prediction problem. In this model, an interval in a video sequence can be described by multiple action labels. An detected action interval is determined both by the unary local detector and the relations with other actions. We use a wavelet feature to represent the action sequence, and design a composite temporal logic descriptor to describe the action relations. The model parameters are trained by structural SVM learning. Given a long video sequence, a sequential decision window search algorithm is designed to detect the actions. Experiments on our new collected concurrent action dataset demonstrate the strength of our method.
5 0.19885683 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos
Author: Sunil Bandla, Kristen Grauman
Abstract: Collecting and annotating videos of realistic human actions is tedious, yet critical for training action recognition systems. We propose a method to actively request the most useful video annotations among a large set of unlabeled videos. Predicting the utility of annotating unlabeled video is not trivial, since any given clip may contain multiple actions of interest, and it need not be trimmed to temporal regions of interest. To deal with this problem, we propose a detection-based active learner to train action category models. We develop a voting-based framework to localize likely intervals of interest in an unlabeled clip, and use them to estimate the total reduction in uncertainty that annotating that clip would yield. On three datasets, we show our approach can learn accurate action detectors more efficiently than alternative active learning strategies that fail to accommodate the “untrimmed” nature of real video data.
6 0.17589836 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set
7 0.16687612 244 iccv-2013-Learning View-Invariant Sparse Representations for Cross-View Action Recognition
8 0.16009228 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition
9 0.15465561 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction
10 0.15387137 116 iccv-2013-Directed Acyclic Graph Kernels for Action Recognition
11 0.15380491 249 iccv-2013-Learning to Share Latent Tasks for Action Recognition
12 0.14863838 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments
13 0.14062119 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding
14 0.13805188 188 iccv-2013-Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps
15 0.13733463 322 iccv-2013-Pose Estimation and Segmentation of People in 3D Movies
16 0.13690205 39 iccv-2013-Action Recognition with Improved Trajectories
17 0.12814055 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition
18 0.12281433 289 iccv-2013-Network Principles for SfM: Disambiguating Repeated Structures with Local Context
20 0.11565512 210 iccv-2013-Image Retrieval Using Textual Cues
topicId topicWeight
[(0, 0.234), (1, 0.202), (2, 0.04), (3, 0.161), (4, 0.028), (5, -0.028), (6, 0.146), (7, -0.036), (8, -0.01), (9, 0.029), (10, 0.101), (11, -0.002), (12, 0.032), (13, -0.02), (14, 0.107), (15, -0.004), (16, -0.047), (17, 0.017), (18, -0.119), (19, 0.001), (20, -0.045), (21, 0.003), (22, -0.041), (23, -0.111), (24, 0.067), (25, 0.001), (26, 0.004), (27, -0.088), (28, -0.029), (29, 0.035), (30, 0.017), (31, -0.027), (32, 0.0), (33, -0.061), (34, -0.024), (35, 0.007), (36, 0.02), (37, 0.042), (38, -0.029), (39, -0.041), (40, -0.02), (41, -0.004), (42, -0.02), (43, -0.026), (44, -0.054), (45, -0.064), (46, 0.065), (47, -0.102), (48, -0.05), (49, -0.01)]
simIndex simValue paperId paperTitle
same-paper 1 0.9545545 166 iccv-2013-Finding Actors and Actions in Movies
Author: P. Bojanowski, F. Bach, I. Laptev, J. Ponce, C. Schmid, J. Sivic
Abstract: We address the problem of learning a joint model of actors and actions in movies using weak supervision provided by scripts. Specifically, we extract actor/action pairs from the script and use them as constraints in a discriminative clustering framework. The corresponding optimization problem is formulated as a quadratic program under linear constraints. People in video are represented by automatically extracted and tracked faces together with corresponding motion features. First, we apply the proposed framework to the task of learning names of characters in the movie and demonstrate significant improvements over previous methods used for this task. Second, we explore the joint actor/action constraint and show its advantage for weakly supervised action learning. We validate our method in the challenging setting of localizing and recognizing characters and their actions in feature length movies Casablanca and American Beauty.
2 0.82138002 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
Author: Vignesh Ramanathan, Percy Liang, Li Fei-Fei
Abstract: Human action and role recognition play an important part in complex event understanding. State-of-the-art methods learn action and role models from detailed spatio temporal annotations, which requires extensive human effort. In this work, we propose a method to learn such models based on natural language descriptions of the training videos, which are easier to collect and scale with the number of actions and roles. There are two challenges with using this form of weak supervision: First, these descriptions only provide a high-level summary and often do not directly mention the actions and roles occurring in a video. Second, natural language descriptions do not provide spatio temporal annotations of actions and roles. To tackle these challenges, we introduce a topic-based semantic relatedness (SR) measure between a video description and an action and role label, and incorporate it into a posterior regularization objective. Our event recognition system based on these action and role models matches the state-ofthe-art method on the TRECVID-MED11 event kit, despite weaker supervision.
Author: Weiyu Zhang, Menglong Zhu, Konstantinos G. Derpanis
Abstract: This paper presents a novel approach for analyzing human actions in non-scripted, unconstrained video settings based on volumetric, x-y-t, patch classifiers, termed actemes. Unlike previous action-related work, the discovery of patch classifiers is posed as a strongly-supervised process. Specifically, keypoint labels (e.g., position) across spacetime are used in a data-driven training process to discover patches that are highly clustered in the spacetime keypoint configuration space. To support this process, a new human action dataset consisting of challenging consumer videos is introduced, where notably the action label, the 2D position of a set of keypoints and their visibilities are provided for each video frame. On a novel input video, each acteme is used in a sliding volume scheme to yield a set of sparse, non-overlapping detections. These detections provide the intermediate substrate for segmenting out the action. For action classification, the proposed representation shows significant improvement over state-of-the-art low-level features, while providing spatiotemporal localiza- tion as additional output. This output sheds further light into detailed action understanding.
4 0.75387895 86 iccv-2013-Concurrent Action Detection with Structural Prediction
Author: Ping Wei, Nanning Zheng, Yibiao Zhao, Song-Chun Zhu
Abstract: Action recognition has often been posed as a classification problem, which assumes that a video sequence only have one action class label and different actions are independent. However, a single human body can perform multiple concurrent actions at the same time, and different actions interact with each other. This paper proposes a concurrent action detection model where the action detection is formulated as a structural prediction problem. In this model, an interval in a video sequence can be described by multiple action labels. An detected action interval is determined both by the unary local detector and the relations with other actions. We use a wavelet feature to represent the action sequence, and design a composite temporal logic descriptor to describe the action relations. The model parameters are trained by structural SVM learning. Given a long video sequence, a sequential decision window search algorithm is designed to detect the actions. Experiments on our new collected concurrent action dataset demonstrate the strength of our method.
5 0.74548328 231 iccv-2013-Latent Multitask Learning for View-Invariant Action Recognition
Author: Behrooz Mahasseni, Sinisa Todorovic
Abstract: This paper presents an approach to view-invariant action recognition, where human poses and motions exhibit large variations across different camera viewpoints. When each viewpoint of a given set of action classes is specified as a learning task then multitask learning appears suitable for achieving view invariance in recognition. We extend the standard multitask learning to allow identifying: (1) latent groupings of action views (i.e., tasks), and (2) discriminative action parts, along with joint learning of all tasks. This is because it seems reasonable to expect that certain distinct views are more correlated than some others, and thus identifying correlated views could improve recognition. Also, part-based modeling is expected to improve robustness against self-occlusion when actors are imaged from different views. Results on the benchmark datasets show that we outperform standard multitask learning by 21.9%, and the state-of-the-art alternatives by 4.5–6%.
6 0.73448002 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition
7 0.73253846 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos
8 0.73249143 38 iccv-2013-Action Recognition with Actons
9 0.72219616 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
10 0.6694898 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition
11 0.65104699 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set
12 0.59891647 249 iccv-2013-Learning to Share Latent Tasks for Action Recognition
14 0.58656007 260 iccv-2013-Manipulation Pattern Discovery: A Nonparametric Bayesian Approach
15 0.5863601 244 iccv-2013-Learning View-Invariant Sparse Representations for Cross-View Action Recognition
16 0.56888044 116 iccv-2013-Directed Acyclic Graph Kernels for Action Recognition
17 0.56186604 69 iccv-2013-Capturing Global Semantic Relationships for Facial Action Unit Recognition
18 0.55739671 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction
19 0.54796672 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments
20 0.54184335 265 iccv-2013-Mining Motion Atoms and Phrases for Complex Action Recognition
topicId topicWeight
[(2, 0.064), (4, 0.022), (8, 0.011), (12, 0.022), (26, 0.052), (31, 0.049), (40, 0.011), (42, 0.101), (64, 0.418), (89, 0.137)]
simIndex simValue paperId paperTitle
1 0.95400691 99 iccv-2013-Cross-View Action Recognition over Heterogeneous Feature Spaces
Author: Xinxiao Wu, Han Wang, Cuiwei Liu, Yunde Jia
Abstract: In cross-view action recognition, “what you saw” in one view is different from “what you recognize ” in another view. The data distribution even the feature space can change from one view to another due to the appearance and motion of actions drastically vary across different views. In this paper, we address the problem of transferring action models learned in one view (source view) to another different view (target view), where action instances from these two views are represented by heterogeneous features. A novel learning method, called Heterogeneous Transfer Discriminantanalysis of Canonical Correlations (HTDCC), is proposed to learn a discriminative common feature space for linking source and target views to transfer knowledge between them. Two projection matrices that respectively map data from source and target views into the common space are optimized via simultaneously minimizing the canonical correlations of inter-class samples and maximizing the intraclass canonical correlations. Our model is neither restricted to corresponding action instances in the two views nor restricted to the same type of feature, and can handle only a few or even no labeled samples available in the target view. To reduce the data distribution mismatch between the source and target views in the commonfeature space, a nonparametric criterion is included in the objective function. We additionally propose a joint weight learning method to fuse multiple source-view action classifiers for recognition in the target view. Different combination weights are assigned to different source views, with each weight presenting how contributive the corresponding source view is to the target view. The proposed method is evaluated on the IXMAS multi-view dataset and achieves promising results.
2 0.93268514 298 iccv-2013-Online Robust Non-negative Dictionary Learning for Visual Tracking
Author: Naiyan Wang, Jingdong Wang, Dit-Yan Yeung
Abstract: This paper studies the visual tracking problem in video sequences and presents a novel robust sparse tracker under the particle filter framework. In particular, we propose an online robust non-negative dictionary learning algorithm for updating the object templates so that each learned template can capture a distinctive aspect of the tracked object. Another appealing property of this approach is that it can automatically detect and reject the occlusion and cluttered background in a principled way. In addition, we propose a new particle representation formulation using the Huber loss function. The advantage is that it can yield robust estimation without using trivial templates adopted by previous sparse trackers, leading to faster computation. We also reveal the equivalence between this new formulation and the previous one which uses trivial templates. The proposed tracker is empirically compared with state-of-the-art trackers on some challenging video sequences. Both quantitative and qualitative comparisons show that our proposed tracker is superior and more stable.
3 0.91500986 88 iccv-2013-Constant Time Weighted Median Filtering for Stereo Matching and Beyond
Author: Ziyang Ma, Kaiming He, Yichen Wei, Jian Sun, Enhua Wu
Abstract: Despite the continuous advances in local stereo matching for years, most efforts are on developing robust cost computation and aggregation methods. Little attention has been seriously paid to the disparity refinement. In this work, we study weighted median filtering for disparity refinement. We discover that with this refinement, even the simple box filter aggregation achieves comparable accuracy with various sophisticated aggregation methods (with the same refinement). This is due to the nice weighted median filtering properties of removing outlier error while respecting edges/structures. This reveals that the previously overlooked refinement can be at least as crucial as aggregation. We also develop the first constant time algorithmfor the previously time-consuming weighted median filter. This makes the simple combination “box aggregation + weighted median ” an attractive solution in practice for both speed and accuracy. As a byproduct, the fast weighted median filtering unleashes its potential in other applications that were hampered by high complexities. We show its superiority in various applications such as depth upsampling, clip-art JPEG artifact removal, and image stylization.
4 0.90076381 242 iccv-2013-Learning People Detectors for Tracking in Crowded Scenes
Author: Siyu Tang, Mykhaylo Andriluka, Anton Milan, Konrad Schindler, Stefan Roth, Bernt Schiele
Abstract: People tracking in crowded real-world scenes is challenging due to frequent and long-term occlusions. Recent tracking methods obtain the image evidence from object (people) detectors, but typically use off-the-shelf detectors and treat them as black box components. In this paper we argue that for best performance one should explicitly train people detectors on failure cases of the overall tracker instead. To that end, we first propose a novel joint people detector that combines a state-of-the-art single person detector with a detector for pairs of people, which explicitly exploits common patterns of person-person occlusions across multiple viewpoints that are a frequent failure case for tracking in crowded scenes. To explicitly address remaining failure modes of the tracker we explore two methods. First, we analyze typical failures of trackers and train a detector explicitly on these cases. And second, we train the detector with the people tracker in the loop, focusing on the most common tracker failures. We show that our joint multi-person detector significantly improves both de- tection accuracy as well as tracker performance, improving the state-of-the-art on standard benchmarks.
same-paper 5 0.8816793 166 iccv-2013-Finding Actors and Actions in Movies
Author: P. Bojanowski, F. Bach, I. Laptev, J. Ponce, C. Schmid, J. Sivic
Abstract: We address the problem of learning a joint model of actors and actions in movies using weak supervision provided by scripts. Specifically, we extract actor/action pairs from the script and use them as constraints in a discriminative clustering framework. The corresponding optimization problem is formulated as a quadratic program under linear constraints. People in video are represented by automatically extracted and tracked faces together with corresponding motion features. First, we apply the proposed framework to the task of learning names of characters in the movie and demonstrate significant improvements over previous methods used for this task. Second, we explore the joint actor/action constraint and show its advantage for weakly supervised action learning. We validate our method in the challenging setting of localizing and recognizing characters and their actions in feature length movies Casablanca and American Beauty.
6 0.86241698 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments
7 0.83935416 441 iccv-2013-Video Motion for Every Visible Point
8 0.83033657 215 iccv-2013-Incorporating Cloud Distribution in Sky Representation
9 0.79375726 380 iccv-2013-Semantic Transform: Weakly Supervised Semantic Inference for Relating Visual Attributes
10 0.76810014 303 iccv-2013-Orderless Tracking through Model-Averaged Posterior Estimation
11 0.74197114 442 iccv-2013-Video Segmentation by Tracking Many Figure-Ground Segments
12 0.72971505 86 iccv-2013-Concurrent Action Detection with Structural Prediction
13 0.7183342 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition
14 0.70626265 424 iccv-2013-Tracking Revisited Using RGBD Camera: Unified Benchmark and Baselines
15 0.69329828 359 iccv-2013-Robust Object Tracking with Online Multi-lifespan Dictionary Learning
16 0.68818641 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
17 0.68688416 425 iccv-2013-Tracking via Robust Multi-task Multi-view Joint Sparse Representation
18 0.65324318 338 iccv-2013-Randomized Ensemble Tracking
19 0.64441848 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos
20 0.63793653 168 iccv-2013-Finding the Best from the Second Bests - Inhibiting Subjective Bias in Evaluation of Visual Tracking Algorithms