iccv iccv2013 iccv2013-274 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mohamed R. Amer, Sinisa Todorovic, Alan Fern, Song-Chun Zhu
Abstract: This paper presents an efficient approach to video parsing. Our videos show a number of co-occurring individual and group activities. To address challenges of the domain, we use an expressive spatiotemporal AND-OR graph (ST-AOG) that jointly models activity parts, their spatiotemporal relations, and context, as well as enables multitarget tracking. The standard ST-AOG inference is prohibitively expensive in our setting, since it would require running a multitude of detectors, and tracking their detections in a long video footage. This problem is addressed by formulating a cost-sensitive inference of ST-AOG as Monte Carlo Tree Search (MCTS). For querying an activity in the video, MCTS optimally schedules a sequence of detectors and trackers to be run, and where they should be applied in the space-time volume. Evaluation on the benchmark datasets demonstrates that MCTS enables two-magnitude speed-ups without compromising accuracy relative to the standard cost-insensitive inference.
Reference: text
sentIndex sentText sentNum sentScore
1 To address challenges of the domain, we use an expressive spatiotemporal AND-OR graph (ST-AOG) that jointly models activity parts, their spatiotemporal relations, and context, as well as enables multitarget tracking. [sent-6, score-0.431]
2 The standard ST-AOG inference is prohibitively expensive in our setting, since it would require running a multitude of detectors, and tracking their detections in a long video footage. [sent-7, score-0.493]
3 For querying an activity in the video, MCTS optimally schedules a sequence of detectors and trackers to be run, and where they should be applied in the space-time volume. [sent-9, score-0.5]
4 Recent approaches usually model activity parts, their spatiotemporal relations, and context (e. [sent-13, score-0.355]
5 For this they use highly expressive activity representations whose intractable inference and learning require approximate algorithms. [sent-18, score-0.55]
6 However, as the representations are getting increasingly expressive, even their approximate inference becomes prohibitively expensive. [sent-19, score-0.263]
7 Our videos show a number of individual and group activities co-occurring in a large scene, as illustrated in Fig. [sent-21, score-0.392]
8 ST-AOG is a stochastic grammar [18] that models both individual actions and group activities, captures relations of individual actions within a group activity, accounts for parts and contexts, and enables their tracking. [sent-27, score-0.846]
9 ST-AOG enables parsing of challenging videos by running a multitude of object/people/activity detectors, and tracking their detections. [sent-30, score-0.273]
10 To address this issue, we enforce that ST-AOG inference is cost sensitive, and formulate such an inference as a scheduling problem. [sent-32, score-0.562]
11 In particular, given a query about a particular activity class (e. [sent-33, score-0.385]
12 1, the scheduling of α, β, γ, and ω processes jointly defines: which activity detectors to run, and which level of activities to track, and where in the space-time video volume to apply the detectors and tracking. [sent-37, score-0.867]
13 Thus, given a query, the scheduling specifies a sequence of triplets {(process, detector, time interval)} to be run, cine o orfde trri ptloe efficiently answer tthore, query. [sent-38, score-0.256]
14 In this way, inference becomes efficient, optimizing the total number of detectors and trackers to be run, for a given time budget. [sent-40, score-0.273]
15 Since the best sequence of inference steps is learned for each query type, and ex1353 of typical activity representations, where modeling complexity increases top to bottom. [sent-42, score-0.717]
16 The top two rows show detections of “walking”, and tracking these detections for recognizing structured actions of each person, as in [9]; this approach may suffer from missed detections, identity switches, and false positives. [sent-43, score-0.331]
17 The bottom row shows our performance for ST-AOG that models both individual actions and group activities, relations of individual actions within every group activity, context, and enables tracking at all semantic levels. [sent-46, score-0.703]
18 Inspired by recent advances in Monte Carlo planning [5], we use Monte- Carlo tree search (MCTS) to learn the scheduling of STAOG inference. [sent-48, score-0.424]
19 In [4], Q-learning is used for scheduling the α, β, γ inference of a spatial AOG. [sent-71, score-0.349]
20 In that work, Q-Learning simplifies the inference of a stochastic grammar by: i) Summarizing all current parse-graph hypotheses into a single state; and ii) Conducting inference as first-order Markovian moves in a large state space, following a fixed policy. [sent-72, score-0.606]
21 Inference as Open-Loop Planning Given a query for an activity in the video, and ST-AOG, a brute force approach to inference would be to run all detectors associated with α, β, γ, ω, and then compute the posterior of the query. [sent-79, score-0.706]
22 T, ihnis o problem can nb etl yvi aewnded ac as a planning problem where each triplet is viewed as an inference step. [sent-82, score-0.352]
23 Our goal is to select an optimal sequence of inference steps, given a time budget, that maximizes a utility measure. [sent-83, score-0.42]
24 One approach to selecting inference steps would be to follow a closed-loop planning, where at each step we run a planning algorithm to select the next step, based on the information from previous steps. [sent-84, score-0.428]
25 RL uses a policy that maps any inference state to an action (e. [sent-89, score-0.455]
26 However, since the number of inference states is enormous, such approaches require making significant approximations, e. [sent-92, score-0.243]
27 We pre-compute an explicit sequence of inference steps for each type of query that will be executed at inference time. [sent-97, score-0.657]
28 The assumption underlying our approach is that for each type of query there do exist high-quality openloop sequences of inference steps. [sent-99, score-0.338]
29 The steps available to our inference are {(process, detector, time isnteteprsv aavl)a}i triplets. [sent-104, score-0.241]
30 For each type of query, our objective is to produce a high utility sequence of inference actions (a1, . [sent-107, score-0.583]
31 Note that the exact observation sequence resulting from the action sequence will vary across videos. [sent-111, score-0.322]
32 Thus, we take the utility of an action sequence to be the expected with respect to a distribution over videos of the log-likelihood of the parse graph, pg. [sent-112, score-0.463]
33 6, pg summarizes the current video parsing results given observations gathered from the applied action sequence. [sent-115, score-0.365]
34 In particular, we assume the availability of a set of training videos on which we can easily simulate the application of any action sequence and compute the required likelihoods. [sent-124, score-0.309]
35 Next, we describe how to search for a high utility action sequence using MCTS. [sent-125, score-0.383]
36 Monte-Carlo Tree Search The number of potential action sequences is exponential in the budget B, and hence we use an intelligent search over potential action sequences, which is able to uncover high quality sequences in a reasonable amount of time. [sent-127, score-0.47]
37 Our approach is based on the view that the set of all length B action sequences can be represented as a rooted tree, where edges correspond to actions, so that each path from the root to a leaf corresponds to a distinct length B action sequence. [sent-128, score-0.467]
38 It is initialized to a single root node, and each iteration adds a single new leaf node to the current tree, and updates certain statistics of nodes in the tree. [sent-137, score-0.288]
39 2, begins by using a tree policy to follow a path of actions from the root until reaching a leaf node v of the current tree. [sent-139, score-0.56]
40 A random action is selected at node v, and the resulting node v? [sent-140, score-0.348]
41 corresponds to an action sequence from the root to v? [sent-142, score-0.286]
42 This action sequence is appended to by selecting random actions until reaching a depth of B, resulting in a sequence of B actions. [sent-144, score-0.485]
43 The utility of the action sequence is then evaluated using the training videos, as described in Sec. [sent-145, score-0.376]
44 This evaluation is used to update the statistics of tree nodes along the 1355 path from the root to v? [sent-147, score-0.245]
45 Specifically, each node v in the tree maintains a count n(v) of how many times the node has been traversed during the search, and the average utility Q(v) of the length B actions sequences that have passed through the node so far during the search. [sent-149, score-0.752]
46 Intuitively, the statistics at each tree node indicates the overall quality of the action sequences which have that node as a prefix. [sent-150, score-0.509]
47 This is done by starting at the root and selecting the action that leads to the child node v with largest utility Q(v). [sent-152, score-0.415]
48 Then, from v the next action is the one that leads to the highest utility child of v. [sent-153, score-0.256]
49 It remains to specify the tree policy which is the key ingredient in an MCTS algorithm as it controls how the tree is expanded. [sent-155, score-0.299]
50 Intuitively, we would like the tree to be expanded toward more promising action sequences, which exploits information from previous information. [sent-156, score-0.253]
51 We use the UCT algorithm that selects action a at node v as argmaxa? [sent-159, score-0.244]
52 , (1) where T(v, a) denotes the tree node that is reached by selecting action a in node v. [sent-162, score-0.461]
53 In (1), the exploitation term, Q(T(v, a)), favors actions that have been observed to have high average utility from v in previous iterations. [sent-163, score-0.317]
54 6, the α, β, γ, ω processes that are scheduled by MCTS in inference for video parsing. [sent-171, score-0.31]
55 Group activities are defined as a spatial relationship of a set of individual actions. [sent-174, score-0.263]
56 Modeling efficiency is achieved by sharing children nodes among multiple parents, where AND nodes encode particular configurations of parts, and OR nodes account for alternative configurations. [sent-180, score-0.293]
57 ST-AOG establishes temporal (lateral) edges between stages of the activity to model their temporal variations. [sent-181, score-0.494]
58 G associates activity classes with ∧ nodes, which are hierarchically organized yin c laesvseelss wl =ith 1 ∧, . [sent-192, score-0.308]
59 ra Sricmhiiclaalr yst,r tuhcetu ireth o cfh Gild means tihsa dte activity csl ∧asses. [sent-200, score-0.308]
60 From (2), the query uniquely identifies the level l in ST-AOG, and its parent level l−wherein the corresponding pg = pgl is rooted. [sent-239, score-0.388]
61 A subgraph pglτ of pgl, associated with time interval τ, has a single switching node ∨lτ which selects ∧lτ representing tah sei query activity d neotedcete ∨d inw ihnictehrv saell τ tosf ∧ ∧the video. [sent-240, score-0.592]
62 The detected activity ∧lτ can be explained as a layout of Nlτ sduetbe-catcetidvi aticetsiv, {∧τil+ : i = 1, . [sent-241, score-0.308]
63 Also, the detected activity ∧lτ can b∧e predicted, given a preceding? [sent-245, score-0.308]
64 nal processes involved in inference of pglτ namely, αlτ, βlτ, γlτ, ωlτ, illustrated in Fig. [sent-384, score-0.25]
65 From (3), for each type of query, our inference first identifies the root node of pg. [sent-386, score-0.408]
66 Then, it executes the maximum expected-utility inference sequence (a1, . [sent-387, score-0.366]
67 Every inference action ab, represents a triplet {(process, detector, time interval)}, w rehperrees tnhtes process {is( one sos,f {ατl , βlτ , γlτ , ωlτ : el =al }1,, ,2 ,w 3h, τ e= t 1e, . [sent-394, score-0.353]
68 Tyh oef prior orefn cthee onfum paibresr o∧f ,c∨hildren nodes p(Nl) is the exponential distribution, learned on the numbers of corresponding children nodes of ∧l in training parse graphs. [sent-402, score-0.312]
69 Lreesapronnidning α: iPlodsrietniv neo examples Tα+l are labeled bounding boxes around group activities (l = 1), or individual actions (l = 2), or objects (l = 3). [sent-403, score-0.564]
70 Implementation Details Grid of Blocks: Each video is partitioned into a grid of 2D+t blocks, allowing inference action sequences to select optimal blocks for video parsing. [sent-423, score-0.557]
71 Detectors: For each level lof ST-AOG, we define a set of αl activity detectors. [sent-426, score-0.308]
72 This makes detecting group activities robust to perspective and viewpoint changes. [sent-446, score-0.286]
73 The tracks of STVs are then classified by a multiclass SVM to detect the group activities of interest. [sent-447, score-0.316]
74 Results Datasets: For evaluation, we use datasets with multiple co-occurring individual actions and group activities, such as the UCLA Courtyard Dataset [4], Collective Activity Dataset [7], and New Collective Activity Dataset [6]. [sent-449, score-0.3]
75 For each group activity or individual action, the dataset contains 20 instances, and for each object the dataset contains 50 instances. [sent-457, score-0.445]
76 The dataset provides labels of every 10th frame, in terms of bounding boxes around people performing the activity, their pose, and activity class. [sent-461, score-0.366]
77 Recently [6] released a new collective activity dataset which has interactions. [sent-462, score-0.508]
78 New Collective Activity Dataset [6] is composed of 32 video clips with 6 collective activities, with 9 interactions, and 3 individual actions. [sent-463, score-0.317]
79 For each type of query, our inference identifies the root node of pg, then executes the associated inference sequence (a1, . [sent-467, score-0.774]
80 , aB), where every inference action ab, represents a (process, detector, time interval) triplet. [sent-473, score-0.353]
81 V2(B) is a variant of our ST-AOG, whose inference accounts for ω process only at the query level. [sent-482, score-0.335]
82 We compare our activity recognition with that ofthe state of the art [4, 6, 9], and our tracklet association accuracy with that of [6]. [sent-490, score-0.377]
83 For the UCLA Courtyard dataset, performance is evaluated in terms of precision and false positive rate of per-frame activity recognition. [sent-491, score-0.337]
84 The comparison of V3(B) and S-AOG of [4] in tables 1–2 demonstrates that the use of MCTS significantly improves per-frame activity recognition, and reduces the over all computational time, since we operate per blocks of video rather than the entire video. [sent-501, score-0.471]
85 When time budget B = ∞, our approach achieves the besWt rhesenult tsim ine Tbaudblgeest 1B–2 =, s i∞nc,e o iut ri sa papbrloea ctoh run as many inference steps as needed. [sent-502, score-0.347]
86 f V2(B) and V3(B), and the comparison of V2(B) with recent work of [4, 9, 6] in Tables 1–2 and Tables 3–4 demonstrate that accounting for temporal relations between activities across the video improves performance. [sent-506, score-0.425]
87 , using dynamic programming) may be prohibitively expensive, since video parsing requires running a multitude of object and activity detectors in the long video footage. [sent-521, score-0.709]
88 To address this issue, we have formulated inference of ST-AOG as open-loop planning, which optimally schedules inference steps to be run until the allowed time budget. [sent-522, score-0.543]
89 For every query type, our inference executes a maximum utility sequence of inference processes. [sent-523, score-0.772]
90 These optimal inference sequences are learned using Monte Carlo Tree Search (MCTS). [sent-524, score-0.261]
91 MCTS efficiently estimates the expected utility of inference steps by using an empirical average over a set of training data. [sent-525, score-0.386]
92 MCTS accounts for higherorder dependences of inference steps, and thus alleviates drawbacks of Q-Learning and Markov Decision Process used in related work for inference. [sent-526, score-0.258]
93 Our results demonstrate that the MCTS-based scheduling of video parsing gives similar accuracy levels under two-magnitude speedups relative to the standard cost-insensitive inference with unlimited time budgets. [sent-527, score-0.535]
94 Also, the extended expressiveness of ST-AOG relative to existing activity representations leads to our superior performance on the benchmark datasets, including the UCLA Courtyard, Collective Activities, and New Collective Activities datasets. [sent-528, score-0.337]
95 A Chains model for localizing group activities in videos. [sent-540, score-0.286]
96 A unified framework for multi-target tracking and collective activity recognition. [sent-578, score-0.561]
97 : Collective activity classification using spatiotemporal relationship among people. [sent-585, score-0.355]
98 Representing pairwise spatial and temporal relations for action recognition. [sent-720, score-0.267]
99 Parsing video events with goal inference and intent prediction. [sent-727, score-0.273]
100 A numerical study of the bottom-up and top-down inference processes in andor graphs. [sent-747, score-0.25]
wordName wordTfidf (topN-words)
[('mcts', 0.477), ('activity', 0.308), ('inference', 0.213), ('activities', 0.206), ('collective', 0.2), ('aog', 0.166), ('pgl', 0.166), ('actions', 0.163), ('action', 0.14), ('planning', 0.139), ('scheduling', 0.136), ('utility', 0.116), ('tree', 0.113), ('courtyard', 0.11), ('grammar', 0.106), ('node', 0.104), ('parsing', 0.097), ('sequence', 0.091), ('ucla', 0.09), ('stvs', 0.083), ('group', 0.08), ('nodes', 0.077), ('temporal', 0.077), ('query', 0.077), ('uct', 0.074), ('policy', 0.073), ('carlo', 0.073), ('pg', 0.068), ('tables', 0.067), ('parse', 0.067), ('monte', 0.064), ('nl', 0.063), ('children', 0.062), ('edgeslogp', 0.062), ('executes', 0.062), ('ogp', 0.062), ('stv', 0.062), ('video', 0.06), ('detectors', 0.06), ('amer', 0.059), ('budget', 0.058), ('switching', 0.058), ('exploration', 0.058), ('individual', 0.057), ('root', 0.055), ('logp', 0.055), ('ab', 0.054), ('tracking', 0.053), ('leaf', 0.052), ('prohibitively', 0.05), ('relations', 0.05), ('videos', 0.049), ('run', 0.048), ('sequences', 0.048), ('spatiotemporal', 0.047), ('accounts', 0.045), ('interval', 0.045), ('stochastic', 0.045), ('detector', 0.044), ('multitude', 0.044), ('detections', 0.043), ('ainc', 0.041), ('hmdp', 0.041), ('mecr', 0.041), ('schedules', 0.041), ('spatia', 0.041), ('staog', 0.041), ('tecto', 0.041), ('parent', 0.041), ('tracklet', 0.04), ('person', 0.039), ('exploitation', 0.038), ('jl', 0.037), ('vicinity', 0.037), ('tempo', 0.037), ('teos', 0.037), ('processes', 0.037), ('identifies', 0.036), ('search', 0.036), ('blocks', 0.036), ('executed', 0.035), ('il', 0.034), ('markovian', 0.034), ('facing', 0.033), ('edges', 0.032), ('accounting', 0.032), ('xof', 0.031), ('oregon', 0.031), ('states', 0.03), ('running', 0.03), ('tracks', 0.03), ('specifies', 0.029), ('bounding', 0.029), ('expressive', 0.029), ('boxes', 0.029), ('false', 0.029), ('relative', 0.029), ('training', 0.029), ('state', 0.029), ('steps', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000012 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition
Author: Mohamed R. Amer, Sinisa Todorovic, Alan Fern, Song-Chun Zhu
Abstract: This paper presents an efficient approach to video parsing. Our videos show a number of co-occurring individual and group activities. To address challenges of the domain, we use an expressive spatiotemporal AND-OR graph (ST-AOG) that jointly models activity parts, their spatiotemporal relations, and context, as well as enables multitarget tracking. The standard ST-AOG inference is prohibitively expensive in our setting, since it would require running a multitude of detectors, and tracking their detections in a long video footage. This problem is addressed by formulating a cost-sensitive inference of ST-AOG as Monte Carlo Tree Search (MCTS). For querying an activity in the video, MCTS optimally schedules a sequence of detectors and trackers to be run, and where they should be applied in the space-time volume. Evaluation on the benchmark datasets demonstrates that MCTS enables two-magnitude speed-ups without compromising accuracy relative to the standard cost-insensitive inference.
2 0.20151368 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
Author: Mihai Zanfir, Marius Leordeanu, Cristian Sminchisescu
Abstract: Human action recognition under low observational latency is receiving a growing interest in computer vision due to rapidly developing technologies in human-robot interaction, computer gaming and surveillance. In this paper we propose a fast, simple, yet powerful non-parametric Moving Pose (MP)frameworkfor low-latency human action and activity recognition. Central to our methodology is a moving pose descriptor that considers both pose information as well as differential quantities (speed and acceleration) of the human body joints within a short time window around the current frame. The proposed descriptor is used in conjunction with a modified kNN classifier that considers both the temporal location of a particular frame within the action sequence as well as the discrimination power of its moving pose descriptor compared to other frames in the training set. The resulting method is non-parametric and enables low-latency recognition, one-shot learning, and action detection in difficult unsegmented sequences. Moreover, the framework is real-time, scalable, and outperforms more sophisticated approaches on challenging benchmarks like MSR-Action3D or MSR-DailyActivities3D.
3 0.19323793 86 iccv-2013-Concurrent Action Detection with Structural Prediction
Author: Ping Wei, Nanning Zheng, Yibiao Zhao, Song-Chun Zhu
Abstract: Action recognition has often been posed as a classification problem, which assumes that a video sequence only have one action class label and different actions are independent. However, a single human body can perform multiple concurrent actions at the same time, and different actions interact with each other. This paper proposes a concurrent action detection model where the action detection is formulated as a structural prediction problem. In this model, an interval in a video sequence can be described by multiple action labels. An detected action interval is determined both by the unary local detector and the relations with other actions. We use a wavelet feature to represent the action sequence, and design a composite temporal logic descriptor to describe the action relations. The model parameters are trained by structural SVM learning. Given a long video sequence, a sequential decision window search algorithm is designed to detect the actions. Experiments on our new collected concurrent action dataset demonstrate the strength of our method.
Author: Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, Kate Saenko
Abstract: Despite a recent push towards large-scale object recognition, activity recognition remains limited to narrow domains and small vocabularies of actions. In this paper, we tackle the challenge of recognizing and describing activities “in-the-wild”. We present a solution that takes a short video clip and outputs a brief sentence that sums up the main activity in the video, such as the actor, the action and its object. Unlike previous work, our approach works on out-of-domain actions: it does not require training videos of the exact activity. If it cannot find an accurate prediction for a pre-trained model, it finds a less specific answer that is also plausible from a pragmatic standpoint. We use semantic hierarchies learned from the data to help to choose an appropriate level of generalization, and priors learned from web-scale natural language corpora to penalize unlikely combinations of actors/actions/objects; we also use a web-scale language model to “fill in ” novel verbs, i.e. when the verb does not appear in the training set. We evaluate our method on a large YouTube corpus and demonstrate it is able to generate short sentence descriptions of video clips better than baseline approaches.
5 0.16096 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos
Author: Sunil Bandla, Kristen Grauman
Abstract: Collecting and annotating videos of realistic human actions is tedious, yet critical for training action recognition systems. We propose a method to actively request the most useful video annotations among a large set of unlabeled videos. Predicting the utility of annotating unlabeled video is not trivial, since any given clip may contain multiple actions of interest, and it need not be trimmed to temporal regions of interest. To deal with this problem, we propose a detection-based active learner to train action category models. We develop a voting-based framework to localize likely intervals of interest in an unlabeled clip, and use them to estimate the total reduction in uncertainty that annotating that clip would yield. On three datasets, we show our approach can learn accurate action detectors more efficiently than alternative active learning strategies that fail to accommodate the “untrimmed” nature of real video data.
6 0.15336603 4 iccv-2013-ACTIVE: Activity Concept Transitions in Video Event Classification
7 0.14930551 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
8 0.14652334 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition
9 0.13544513 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
10 0.13432857 116 iccv-2013-Directed Acyclic Graph Kernels for Action Recognition
11 0.13115035 81 iccv-2013-Combining the Right Features for Complex Event Recognition
12 0.12978704 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments
13 0.12814055 166 iccv-2013-Finding Actors and Actions in Movies
14 0.12080751 65 iccv-2013-Breaking the Chain: Liberation from the Temporal Markov Assumption for Tracking Human Poses
15 0.11893793 448 iccv-2013-Weakly Supervised Learning of Image Partitioning Using Decision Trees with Structured Split Criteria
16 0.11725596 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding
17 0.11701944 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction
18 0.11362917 244 iccv-2013-Learning View-Invariant Sparse Representations for Cross-View Action Recognition
19 0.1101995 146 iccv-2013-Event Detection in Complex Scenes Using Interval Temporal Constraints
20 0.10701142 188 iccv-2013-Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps
topicId topicWeight
[(0, 0.229), (1, 0.157), (2, 0.068), (3, 0.159), (4, 0.089), (5, 0.018), (6, 0.028), (7, -0.006), (8, -0.044), (9, -0.012), (10, -0.021), (11, -0.02), (12, -0.018), (13, 0.071), (14, 0.076), (15, 0.068), (16, 0.03), (17, -0.074), (18, 0.006), (19, 0.016), (20, -0.091), (21, -0.016), (22, 0.009), (23, -0.034), (24, -0.023), (25, -0.03), (26, -0.029), (27, -0.087), (28, -0.03), (29, -0.069), (30, 0.052), (31, -0.062), (32, -0.014), (33, -0.036), (34, 0.047), (35, -0.006), (36, -0.038), (37, 0.007), (38, 0.003), (39, -0.032), (40, 0.004), (41, 0.029), (42, 0.027), (43, -0.033), (44, -0.034), (45, -0.022), (46, 0.005), (47, -0.01), (48, -0.053), (49, -0.063)]
simIndex simValue paperId paperTitle
same-paper 1 0.96753871 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition
Author: Mohamed R. Amer, Sinisa Todorovic, Alan Fern, Song-Chun Zhu
Abstract: This paper presents an efficient approach to video parsing. Our videos show a number of co-occurring individual and group activities. To address challenges of the domain, we use an expressive spatiotemporal AND-OR graph (ST-AOG) that jointly models activity parts, their spatiotemporal relations, and context, as well as enables multitarget tracking. The standard ST-AOG inference is prohibitively expensive in our setting, since it would require running a multitude of detectors, and tracking their detections in a long video footage. This problem is addressed by formulating a cost-sensitive inference of ST-AOG as Monte Carlo Tree Search (MCTS). For querying an activity in the video, MCTS optimally schedules a sequence of detectors and trackers to be run, and where they should be applied in the space-time volume. Evaluation on the benchmark datasets demonstrates that MCTS enables two-magnitude speed-ups without compromising accuracy relative to the standard cost-insensitive inference.
2 0.72849894 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
Author: Vignesh Ramanathan, Percy Liang, Li Fei-Fei
Abstract: Human action and role recognition play an important part in complex event understanding. State-of-the-art methods learn action and role models from detailed spatio temporal annotations, which requires extensive human effort. In this work, we propose a method to learn such models based on natural language descriptions of the training videos, which are easier to collect and scale with the number of actions and roles. There are two challenges with using this form of weak supervision: First, these descriptions only provide a high-level summary and often do not directly mention the actions and roles occurring in a video. Second, natural language descriptions do not provide spatio temporal annotations of actions and roles. To tackle these challenges, we introduce a topic-based semantic relatedness (SR) measure between a video description and an action and role label, and incorporate it into a posterior regularization objective. Our event recognition system based on these action and role models matches the state-ofthe-art method on the TRECVID-MED11 event kit, despite weaker supervision.
3 0.71705931 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding
Author: Weiyu Zhang, Menglong Zhu, Konstantinos G. Derpanis
Abstract: This paper presents a novel approach for analyzing human actions in non-scripted, unconstrained video settings based on volumetric, x-y-t, patch classifiers, termed actemes. Unlike previous action-related work, the discovery of patch classifiers is posed as a strongly-supervised process. Specifically, keypoint labels (e.g., position) across spacetime are used in a data-driven training process to discover patches that are highly clustered in the spacetime keypoint configuration space. To support this process, a new human action dataset consisting of challenging consumer videos is introduced, where notably the action label, the 2D position of a set of keypoints and their visibilities are provided for each video frame. On a novel input video, each acteme is used in a sliding volume scheme to yield a set of sparse, non-overlapping detections. These detections provide the intermediate substrate for segmenting out the action. For action classification, the proposed representation shows significant improvement over state-of-the-art low-level features, while providing spatiotemporal localiza- tion as additional output. This output sheds further light into detailed action understanding.
4 0.70453787 86 iccv-2013-Concurrent Action Detection with Structural Prediction
Author: Ping Wei, Nanning Zheng, Yibiao Zhao, Song-Chun Zhu
Abstract: Action recognition has often been posed as a classification problem, which assumes that a video sequence only have one action class label and different actions are independent. However, a single human body can perform multiple concurrent actions at the same time, and different actions interact with each other. This paper proposes a concurrent action detection model where the action detection is formulated as a structural prediction problem. In this model, an interval in a video sequence can be described by multiple action labels. An detected action interval is determined both by the unary local detector and the relations with other actions. We use a wavelet feature to represent the action sequence, and design a composite temporal logic descriptor to describe the action relations. The model parameters are trained by structural SVM learning. Given a long video sequence, a sequential decision window search algorithm is designed to detect the actions. Experiments on our new collected concurrent action dataset demonstrate the strength of our method.
Author: Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, Kate Saenko
Abstract: Despite a recent push towards large-scale object recognition, activity recognition remains limited to narrow domains and small vocabularies of actions. In this paper, we tackle the challenge of recognizing and describing activities “in-the-wild”. We present a solution that takes a short video clip and outputs a brief sentence that sums up the main activity in the video, such as the actor, the action and its object. Unlike previous work, our approach works on out-of-domain actions: it does not require training videos of the exact activity. If it cannot find an accurate prediction for a pre-trained model, it finds a less specific answer that is also plausible from a pragmatic standpoint. We use semantic hierarchies learned from the data to help to choose an appropriate level of generalization, and priors learned from web-scale natural language corpora to penalize unlikely combinations of actors/actions/objects; we also use a web-scale language model to “fill in ” novel verbs, i.e. when the verb does not appear in the training set. We evaluate our method on a large YouTube corpus and demonstrate it is able to generate short sentence descriptions of video clips better than baseline approaches.
6 0.6743598 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition
7 0.66785043 166 iccv-2013-Finding Actors and Actions in Movies
8 0.65625829 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
9 0.64660263 231 iccv-2013-Latent Multitask Learning for View-Invariant Action Recognition
10 0.6354177 116 iccv-2013-Directed Acyclic Graph Kernels for Action Recognition
11 0.63166177 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set
12 0.62018639 38 iccv-2013-Action Recognition with Actons
13 0.61599809 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos
14 0.59587282 260 iccv-2013-Manipulation Pattern Discovery: A Nonparametric Bayesian Approach
15 0.59039462 65 iccv-2013-Breaking the Chain: Liberation from the Temporal Markov Assumption for Tracking Human Poses
16 0.57324916 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments
17 0.55656481 443 iccv-2013-Video Synopsis by Heterogeneous Multi-source Correlation
18 0.55378103 176 iccv-2013-From Large Scale Image Categorization to Entry-Level Categories
19 0.55173731 165 iccv-2013-Find the Best Path: An Efficient and Accurate Classifier for Image Hierarchies
20 0.54596001 249 iccv-2013-Learning to Share Latent Tasks for Action Recognition
topicId topicWeight
[(2, 0.049), (12, 0.108), (13, 0.012), (22, 0.103), (26, 0.077), (31, 0.064), (34, 0.018), (42, 0.105), (64, 0.074), (73, 0.035), (78, 0.015), (89, 0.2), (95, 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.91386199 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition
Author: Mohamed R. Amer, Sinisa Todorovic, Alan Fern, Song-Chun Zhu
Abstract: This paper presents an efficient approach to video parsing. Our videos show a number of co-occurring individual and group activities. To address challenges of the domain, we use an expressive spatiotemporal AND-OR graph (ST-AOG) that jointly models activity parts, their spatiotemporal relations, and context, as well as enables multitarget tracking. The standard ST-AOG inference is prohibitively expensive in our setting, since it would require running a multitude of detectors, and tracking their detections in a long video footage. This problem is addressed by formulating a cost-sensitive inference of ST-AOG as Monte Carlo Tree Search (MCTS). For querying an activity in the video, MCTS optimally schedules a sequence of detectors and trackers to be run, and where they should be applied in the space-time volume. Evaluation on the benchmark datasets demonstrates that MCTS enables two-magnitude speed-ups without compromising accuracy relative to the standard cost-insensitive inference.
2 0.90596026 170 iccv-2013-Fingerspelling Recognition with Semi-Markov Conditional Random Fields
Author: Taehwan Kim, Greg Shakhnarovich, Karen Livescu
Abstract: Recognition of gesture sequences is in general a very difficult problem, but in certain domains the difficulty may be mitigated by exploiting the domain ’s “grammar”. One such grammatically constrained gesture sequence domain is sign language. In this paper we investigate the case of fingerspelling recognition, which can be very challenging due to the quick, small motions of the fingers. Most prior work on this task has assumed a closed vocabulary of fingerspelled words; here we study the more natural open-vocabulary case, where the only domain knowledge is the possible fingerspelled letters and statistics of their sequences. We develop a semi-Markov conditional model approach, where feature functions are defined over segments of video and their corresponding letter labels. We use classifiers of letters and linguistic handshape features, along with expected motion profiles, to define segmental feature functions. This approach improves letter error rate (Levenshtein distance between hypothesized and correct letter sequences) from 16.3% using a hidden Markov model baseline to 11.6% us- ing the proposed semi-Markov model.
3 0.90248823 299 iccv-2013-Online Video SEEDS for Temporal Window Objectness
Author: Michael Van_Den_Bergh, Gemma Roig, Xavier Boix, Santiago Manen, Luc Van_Gool
Abstract: Superpixel and objectness algorithms are broadly used as a pre-processing step to generate support regions and to speed-up further computations. Recently, many algorithms have been extended to video in order to exploit the temporal consistency between frames. However, most methods are computationally too expensive for real-time applications. We introduce an online, real-time video superpixel algorithm based on the recently proposed SEEDS superpixels. A new capability is incorporated which delivers multiple diverse samples (hypotheses) of superpixels in the same image or video sequence. The multiple samples are shown to provide a strong cue to efficiently measure the objectness of image windows, and we introduce the novel concept of objectness in temporal windows. Experiments show that the video superpixels achieve comparable performance to state-of-the-art offline methods while running at 30 fps on a single 2.8 GHz i7 CPU. State-of-the-art performance on objectness is also demonstrated, yet orders of magnitude faster and extended to temporal windows in video.
4 0.90239108 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
Author: Mihai Zanfir, Marius Leordeanu, Cristian Sminchisescu
Abstract: Human action recognition under low observational latency is receiving a growing interest in computer vision due to rapidly developing technologies in human-robot interaction, computer gaming and surveillance. In this paper we propose a fast, simple, yet powerful non-parametric Moving Pose (MP)frameworkfor low-latency human action and activity recognition. Central to our methodology is a moving pose descriptor that considers both pose information as well as differential quantities (speed and acceleration) of the human body joints within a short time window around the current frame. The proposed descriptor is used in conjunction with a modified kNN classifier that considers both the temporal location of a particular frame within the action sequence as well as the discrimination power of its moving pose descriptor compared to other frames in the training set. The resulting method is non-parametric and enables low-latency recognition, one-shot learning, and action detection in difficult unsegmented sequences. Moreover, the framework is real-time, scalable, and outperforms more sophisticated approaches on challenging benchmarks like MSR-Action3D or MSR-DailyActivities3D.
5 0.90108103 451 iccv-2013-Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions
Author: Mohamed Elhoseiny, Babak Saleh, Ahmed Elgammal
Abstract: The main question we address in this paper is how to use purely textual description of categories with no training images to learn visual classifiers for these categories. We propose an approach for zero-shot learning of object categories where the description of unseen categories comes in the form of typical text such as an encyclopedia entry, without the need to explicitly defined attributes. We propose and investigate two baseline formulations, based on regression and domain adaptation. Then, we propose a new constrained optimization formulation that combines a regression function and a knowledge transfer function with additional constraints to predict the classifier parameters for new classes. We applied the proposed approach on two fine-grained categorization datasets, and the results indicate successful classifier prediction.
6 0.89801025 367 iccv-2013-SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels
7 0.89475071 413 iccv-2013-Target-Driven Moire Pattern Synthesis by Phase Modulation
8 0.89207333 340 iccv-2013-Real-Time Articulated Hand Pose Estimation Using Semi-supervised Transductive Regression Forests
10 0.88267386 338 iccv-2013-Randomized Ensemble Tracking
11 0.88100278 305 iccv-2013-POP: Person Re-identification Post-rank Optimisation
12 0.87073278 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
13 0.87038779 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
14 0.8688246 136 iccv-2013-Efficient Pedestrian Detection by Directly Optimizing the Partial Area under the ROC Curve
15 0.86653471 349 iccv-2013-Regionlets for Generic Object Detection
16 0.86488903 146 iccv-2013-Event Detection in Complex Scenes Using Interval Temporal Constraints
17 0.8622123 190 iccv-2013-Handling Occlusions with Franken-Classifiers
18 0.85947978 428 iccv-2013-Translating Video Content to Natural Language Descriptions
19 0.85865581 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition
20 0.85821205 268 iccv-2013-Modeling 4D Human-Object Interactions for Event and Object Recognition