cvpr cvpr2013 cvpr2013-32 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yale Song, Louis-Philippe Morency, Randall Davis
Abstract: Recent progress has shown that learning from hierarchical feature representations leads to improvements in various computer vision tasks. Motivated by the observation that human activity data contains information at various temporal resolutions, we present a hierarchical sequence summarization approach for action recognition that learns multiple layers of discriminative feature representations at different temporal granularities. We build up a hierarchy dynamically and recursively by alternating sequence learning and sequence summarization. For sequence learning we use CRFs with latent variables to learn hidden spatiotemporal dynamics; for sequence summarization we group observations that have similar semantic meaning in the latent space. For each layer we learn an abstract feature representation through non-linear gate functions. This procedure is repeated to obtain a hierarchical sequence summary representation. We develop an efficient learning method to train our model and show that its complexity grows sublinearly with the size of the hierarchy. Experimental results show the effectiveness of our approach, achieving the best published results on the ArmGesture and Canal9 datasets.
Reference: text
sentIndex sentText sentNum sentScore
1 mit Abstract Recent progress has shown that learning from hierarchical feature representations leads to improvements in various computer vision tasks. [sent-2, score-0.321]
2 Motivated by the observation that human activity data contains information at various temporal resolutions, we present a hierarchical sequence summarization approach for action recognition that learns multiple layers of discriminative feature representations at different temporal granularities. [sent-3, score-1.45]
3 We build up a hierarchy dynamically and recursively by alternating sequence learning and sequence summarization. [sent-4, score-0.822]
4 For sequence learning we use CRFs with latent variables to learn hidden spatiotemporal dynamics; for sequence summarization we group observations that have similar semantic meaning in the latent space. [sent-5, score-1.588]
5 For each layer we learn an abstract feature representation through non-linear gate functions. [sent-6, score-0.496]
6 This procedure is repeated to obtain a hierarchical sequence summary representation. [sent-7, score-0.549]
7 We develop an efficient learning method to train our model and show that its complexity grows sublinearly with the size of the hierarchy. [sent-8, score-0.259]
8 Although there is much difference in algorithmic details, these approaches share the common goal of learning from hierarchical feature representations in order to capture high-level concepts that are otherwise difficult to express with a single representation approach. [sent-12, score-0.336]
9 Human activity data contains information at various temporal resolutions, having many similar observations with occasional and irregular changes. [sent-62, score-0.302]
10 We build a hierarchical representation of sequence and learn from multiple layers of different feature representations. [sent-64, score-0.634]
11 edu / yale s ong / l at http : // 333555666200 Numerous approaches have been proposed to learn from a hierarchical representation of human action [13, 23, 8, 26, 12]. [sent-72, score-0.5]
12 Other efforts have proposed sequence models to learn hierarchical feature representation [13, 28, 12, 6, 24]. [sent-74, score-0.55]
13 [12] showed that learning a hierarchical feature representation leads to significant improvements in action recognition. [sent-76, score-0.505]
14 The challenge here is efficiency: for deep belief networks [5] solving the optimization problems when the size of the hierarchy is large remains a challenge [18]. [sent-77, score-0.265]
15 This paper presents a hierarchical sequence summarization approach for action recognition that learns multiple layers of discriminative feature representations at different temporal granularities. [sent-78, score-1.224]
16 Our approach is motivated by the observation that human activity data contains information at various temporal resolutions. [sent-79, score-0.226]
17 We build up a hierarchical representation dynamically and recursively by alternating sequence learning and sequence summarization. [sent-80, score-0.914]
18 For sequence learning we use CRFs with latent variables [16], but modify the standard feature function to use a set of nonlinear gate functions, as used in neural networks, to automatically learn a discriminative feature representation. [sent-81, score-0.929]
19 For sequence summarization we group observations that have similar semantic meaning in the latent space, defining a similarity metric using the posteriors of latent variables, and using an efficient graph-based variable grouping algorithm [3] to obtain a sequence summary representation. [sent-82, score-1.603]
20 As the hierarchy builds, we learn discriminative feature representations that contain ever more high-level spatio-temporal information. [sent-83, score-0.278]
21 We have developed an efficient optimization method to train our model; its complexity grows only sublinearly as the size of the hierarchy grows. [sent-84, score-0.306]
22 Related Work Learning from a hierarchical feature representation has been a recurring theme in action recognition [13, 23, 8, 26, 12]. [sent-88, score-0.392]
23 This has been used to construct a hierarchical feature representation that is more discriminative and context-rich [23, 26, 8]. [sent-90, score-0.289]
24 [16] incorporated latent variables into CRF (HCRF) to learn hidden spatio-temporal dynamics, while Wang et al. [sent-102, score-0.309]
25 [15] presented Conditional Neural Fields (CNF) that used gate functions to extract nonlinear features representations. [sent-106, score-0.212]
26 However, these approaches are defined over a single representation and thus cannot benefit from the additional information that hierarchical representation provides. [sent-107, score-0.236]
27 Our model has many similarities to the deep learning paradigm [1], such as learning from multiple hidden layers with non-linear operations. [sent-108, score-0.352]
28 Compared to DBN, the learning complexity of our method grows sublinearly with the size of the hierarchy. [sent-113, score-0.259]
29 , [28]) define each layer as a combination of the original observation and the preceding layer’s posteriors, at the same temporal resolution. [sent-116, score-0.353]
30 Our work learns each layer at temporally coarser-grained resolutions, making our model capable of learning ever-more high-level concepts that incorporate the surrounding context (e. [sent-117, score-0.297]
31 Hierarchical Sequence Summarization We propose to capture complex spatio-temporal dynamics in human activity data by learning from a hierarchical sequence summary representation. [sent-121, score-0.724]
32 Intuitively, each layer in 333555666311 the hierarchy is a temporally coarser-grained summary of the sequence from the preceding layer, and is built dynamically and recursively by grouping observations that have similar semantic meaning in the latent space. [sent-122, score-1.232]
33 Our approach builds the hierarchy by alternating sequence learning and sequence summarization. [sent-123, score-0.73]
34 Notation Input to our model is a time-ordered sequence x = [x1; · · · ; xT] of length T (the length can vary across sequences); each per-frame observation xt ∈ RD is of dimension D and can be any type of actio∈n fReature (e. [sent-133, score-0.429]
35 Each sequence is labeled y from a finite alphabet set, y ∈ Y. [sent-137, score-0.277]
36 iWtee a dlpehnaobteet a sequence summary at the l-th layer in the hierarchy by xl = [xl1; ··· ; xlT]. [sent-138, score-0.916]
37 A super observation xtl is a group of observation;s· f·ro ;mx the preceding layer, and we define c(xtl) as a reference operator of xtl that returns the group of observations; for l = 1we set c(xtl) = xt. [sent-139, score-0.555]
38 Because our model is defined recursively, most procedures at each layer can be formulated without specifying the layer index. [sent-140, score-0.362]
39 Sequence Learning Following [16], we use CRFs with latent variables to capture hidden dynamics in each layer in the hierarchy. [sent-146, score-0.524]
40 Using a set of latent variables h ∈ H, the conditional probability sdeitstr oifb ulatitoennt i sv adrieafibnleeds as p(y|x;w) =Z(x1;w)? [sent-147, score-0.247]
41 t Our definition of feature function is different from that of [16] to accommodate the hierarchical nature of our approach. [sent-160, score-0.192]
42 Specifically, we define the super observation feature function that is different from [16]. [sent-161, score-0.263]
43 [16], (b) our approach uses an additional set of gate functions to learn an abstract feature representation of super observations. [sent-184, score-0.501]
44 Our super observation feature function (the first term of Equation 2) incorporates a set of non-linear gate functions G, as used in neural networks, to learn an abstract feature representation of super observations (see Figure 2 (b)). [sent-199, score-0.919]
45 Let ψg (x, t; w) be a function that computes, using a gate function g(·), an average of gated output values from each obtsieornva gt(io·n), c aonnt aavienreadg ein o a super o obustpeurvta vtiaolun exs? [sent-200, score-0.342]
46 (3) We adopt the popular logistic function as our gate function, g(z) = 1/(1+exp(−z)), which has been shown to perform gw(ezll) i=n v1/ar(i1ou+se txaps(ks− [z1)]). [sent-207, score-0.184]
47 , Wwhei cdhe hfinaes our super no btose prevraftoiromn feature function as f1(h,x,t;w) = ? [sent-208, score-0.208]
48 The set of gate fwuhnecrtieo ensa Gh gcre ∈ates G an aasd dthiteio snaaml layer m b. [sent-213, score-0.365]
49 That is, this feature function automatically learns an abstract representation of super observations, and thus provides more discriminative information for capturing complex spatio-temporal patterns in human activity data. [sent-215, score-0.438]
50 To see the effectiveness of the gate functions, consider another definition of the observation feature function, one without the gate functions (see Figure 2 (a)), f1(h,x,t;w) =|c(1xt)|? [sent-216, score-0.501]
51 We generate a sequence summary by grouping neighboring observations that have similar semantic labeling in the latent space. [sent-257, score-0.752]
52 As evidenced by the deep learning literature [1, 12], and consistent with our experimental result in Section 4, the step of non-linear feature learning leads to a more discriminative representation. [sent-260, score-0.294]
53 Complexity Analysis: Our model parameter vector is w = [wg,h; wg,d; wy,h; wy,h,h] and has the dimension of GH+GD+Y H+Y HH, with the number ofgate functions G, the number of latent states H, the feature dimension D, and the number of class labels Y . [sent-261, score-0.221]
54 Given a chain-structured sequence x of length T, we can solve the inference problem at O(Y TH2) using a belief propagation algorithm. [sent-262, score-0.347]
55 Sequence Summarization There are many ways to summarize xl to obtain a temporally coarser-grained sequence summary xl+1. [sent-265, score-0.702]
56 One simple approach is to group observations from xl at a fixed time interval, e. [sent-266, score-0.427]
57 , collapse every two consecutive observations and obtain a sequence with halfthe length ofxl . [sent-268, score-0.428]
58 We therefore summarize xl by grouping observations at an adaptive interval, based on how similar the semantic labeling of observations are in the latent space. [sent-270, score-0.813]
59 Said slightly differently, the similarity of latent variables is a measure of the similarity of the corrsponding observations, but in a space more likely to discriminative appropriately. [sent-272, score-0.251]
60 Sequence summarization can be seen as a variable grouping problem with a piecewise connectivity constraint. [sent-273, score-0.405]
61 The algorithm has the desirable property that it preserves detail in low-variance groups while ignoring detail in high-variance groups, producing a grouping of variables that is globally coherent. [sent-276, score-0.196]
62 The algorithm produces a set of super observations C = ··· , The algorithm merges c(xls+1) and ) if the difference between the groups is smaller than the minimum internal difference within the groups. [sent-280, score-0.357]
63 1 Complexity Analysis: As shown in [3], this sequence summarization algorithm runs quite efficiently in O(T log T) with the sequence length T. [sent-294, score-0.896]
64 333555666533 The first derivation comes from our reformulation of p(y|x; w) using hierarchical sequence summaries, the secopn(yd| comes fsrinomg hthieer way we ecqonusentrcuect s uthme sequence summaries. [sent-307, score-0.696]
65 To see this, recall that we obtain a sequence summary xl+1 given the posterior of latent variables p(hl |y, xl ; wl), and the posterior is computed based on the parameter vector wl ; this implies that xl+1 is conditionally independent of xl given wl . [sent-308, score-1.783]
66 To make our model tractable, we assume that a parameter vector at each layer wl is independent of each other. [sent-309, score-0.48]
67 In our approach only the original sequence x1 is available at the outset; to generate a sequence summary xl+1 we need the posterior p(hl |y, xl ; wl), and the quality of the posterior relies on an estim|ayt,e xof the solution wl obtained so far. [sent-328, score-1.288]
68 We therefore perform incremental optimization [4], where, at each layer l, we solve for only the necessary part of the solution while fixing all the others, and iterate the optimization process, incrementing l. [sent-329, score-0.227]
69 At each layer l of the incremental optimization, we solve ? [sent-330, score-0.227]
70 The training procedure involves, for each l, solving for w∗l and generating a sequence summary xl+1 for each sample in the dataset. [sent-362, score-0.407]
71 The testing procedure involves adding up logp(y|xl ; w∗l) computed gfr pomro ceeadchur layer oalnvde finding gth uep optimal sequence label y with the highest probability. [sent-363, score-0.532]
72 Note that if the summary produced the same sequence (i. [sent-364, score-0.366]
73 Complexity Analysis: Because of this incremental optimization, the complexity grows only sublinearly with the number of layers considered. [sent-368, score-0.339]
74 To see this, recall that solving an inference problem given a sequence takes O(Y TH2) and the sequence summarization takes O(? [sent-369, score-0.867]
75 ), and thus the complexity of our model increases sublinearly with the number of layers used. [sent-378, score-0.249]
76 We varied the number of latent states H ∈ {4, 8, 12} andW thee v naruimedbe trh eof n gate fru nocft liaontesn Gt s a∈t {s4 H, 8, ∈12} {,4 a,n8,d1 s2e}t tahned n thumenb uerm obfe layers aLte e= f 4n;c ftoiorn simplicity we s1e2t} H, a anndd s eGt ttoh eb neu tmheb same across layers. [sent-393, score-0.444]
77 oTrh esi mthprelischitoyld w ceo snestta Hnt ainnd sequence summarization was varied τ ∈ {0. [sent-394, score-0.623]
78 We include previous results on each dataset reported in the literature; we also include the result obtained by us using CNF [15] with latent variables (HCNF). [sent-406, score-0.201]
79 Detailed Analysis and Discussions For detailed analysis we evaluated whether our hierarchical representation is indeed advantageous over a single representation, and how our sequence summarization in the latent space differs from the other approaches. [sent-429, score-0.922]
80 single optimal representation: While our results show significant improvements over previous sequence learning models, they do not prove the advantage of learning from hierarchical sequence summary representation, as opposed to learning from only the optimal layer inside the hierarchy (if any). [sent-431, score-1.245]
81 The top row (a)-(c) shows experimental results comparing hierarchical (HSS) and single optimal representation approaches, the bottom row (d)-(f) shows the results on three different sequence summarization approaches. [sent-589, score-0.779]
82 This shows that there is no single representation that is as discriminative as the hierarchical representation. [sent-594, score-0.239]
83 2) Different sequence summarization algorithms: Our sequence summarization produces groups of tempo- rally neighboring observations that have similar semantic meaning in the latent space. [sent-595, score-1.555]
84 We compare this to two different approaches: One approach simply collapses every lconsecutive observations and obtain a sequence of length T/l at each layer l(“Fixed” in Figure 4). [sent-596, score-0.652]
85 Another approach produces groups of observations that are similar in the feature space, with a similarity metric defined as wst = |xs xt | and with the threshold range τ = {1, 5, 10} (“O=bs |”x in− Figure 4 w). [sent-597, score-0.373]
86 The Fixed approach collapses observations as long as there is more than one, even if they contain discriminative information individually, which may cause over-grouping. [sent-599, score-0.215]
87 The Obs approach groups observations using input features, not the corresponding posteriors p(h|y, x; w) in the ltaurteenst, space. [sent-601, score-0.218]
88 Our approach, on the other hand, uses latent variables that are defined in the scale [0: 1] and contains discriminative information learned via mathematical optimization. [sent-605, score-0.251]
89 Conclusion We presented a hierarchical sequence summarization (HSS) model for action recognition, and showed that it achieves the best published results on the ArmGesture and Canal9 datasets. [sent-608, score-0.946]
90 We showed how learning from a hierarchical representations is important, and how grouping observations that are similar in the latent space has several advantages over other methods. [sent-609, score-0.627]
91 By being feature agnostic, our model is applicable to other domains dealing with temporal sequence data, such as multimodal social signal processing. [sent-611, score-0.4]
92 We plan to test our model on these and other real-world sequence analysis tasks. [sent-612, score-0.277]
93 Each super observation represents key transitions of each action class. [sent-662, score-0.366]
94 For the purpose of visualization we selected the middle frame from each super observation at the 4-th layer. [sent-663, score-0.213]
95 Modeling hidden dynamics ofmultimodal cues for spontaneous agreement and disagreement recognition. [sent-673, score-0.194]
96 Learning hierarchical representations for face verification with convolutional deep belief networks. [sent-702, score-0.324]
97 Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. [sent-715, score-0.334]
98 Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. [sent-743, score-0.295]
99 A hierarchical model of shape and appearance for human action classification. [sent-749, score-0.329]
100 Hidden part models for human action recognition: Probabilistic versus max margin. [sent-849, score-0.187]
wordName wordTfidf (topN-words)
[('summarization', 0.313), ('wl', 0.299), ('sequence', 0.277), ('xl', 0.272), ('armgesture', 0.209), ('gate', 0.184), ('natops', 0.183), ('layer', 0.181), ('super', 0.158), ('hss', 0.156), ('action', 0.153), ('latent', 0.143), ('hierarchical', 0.142), ('sublinearly', 0.13), ('observations', 0.122), ('wst', 0.116), ('xtl', 0.116), ('morency', 0.107), ('xil', 0.101), ('hierarchy', 0.097), ('deep', 0.094), ('hcrf', 0.093), ('grouping', 0.092), ('summary', 0.089), ('layers', 0.084), ('hidden', 0.074), ('temporal', 0.073), ('gestures', 0.069), ('dynamics', 0.068), ('ht', 0.067), ('ct', 0.067), ('activity', 0.064), ('variables', 0.058), ('dbn', 0.058), ('yale', 0.055), ('observation', 0.055), ('cnf', 0.052), ('disagreement', 0.052), ('hcnf', 0.052), ('htl', 0.052), ('neco', 0.052), ('posteriors', 0.05), ('learning', 0.05), ('discriminative', 0.05), ('feature', 0.05), ('recursively', 0.048), ('logp', 0.048), ('representations', 0.047), ('representation', 0.047), ('quattoni', 0.047), ('aircraft', 0.046), ('mint', 0.046), ('xlt', 0.046), ('groups', 0.046), ('incremental', 0.046), ('cs', 0.046), ('conditional', 0.046), ('grows', 0.044), ('dynamically', 0.044), ('preceding', 0.044), ('collapses', 0.043), ('ionuptuptu', 0.043), ('occasional', 0.043), ('political', 0.043), ('procedure', 0.041), ('belief', 0.041), ('hlt', 0.04), ('fw', 0.04), ('song', 0.04), ('xt', 0.039), ('posterior', 0.037), ('resolutions', 0.036), ('meaning', 0.035), ('crfs', 0.035), ('hl', 0.035), ('complexity', 0.035), ('learns', 0.035), ('body', 0.035), ('summaries', 0.035), ('gh', 0.035), ('ong', 0.035), ('human', 0.034), ('learn', 0.034), ('summarize', 0.033), ('networks', 0.033), ('testing', 0.033), ('group', 0.033), ('varied', 0.033), ('neural', 0.033), ('improvements', 0.032), ('showed', 0.031), ('internal', 0.031), ('temporally', 0.031), ('mkl', 0.03), ('published', 0.03), ('hmm', 0.03), ('alternating', 0.029), ('semantic', 0.029), ('length', 0.029), ('functions', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 32 cvpr-2013-Action Recognition by Hierarchical Sequence Summarization
Author: Yale Song, Louis-Philippe Morency, Randall Davis
Abstract: Recent progress has shown that learning from hierarchical feature representations leads to improvements in various computer vision tasks. Motivated by the observation that human activity data contains information at various temporal resolutions, we present a hierarchical sequence summarization approach for action recognition that learns multiple layers of discriminative feature representations at different temporal granularities. We build up a hierarchy dynamically and recursively by alternating sequence learning and sequence summarization. For sequence learning we use CRFs with latent variables to learn hidden spatiotemporal dynamics; for sequence summarization we group observations that have similar semantic meaning in the latent space. For each layer we learn an abstract feature representation through non-linear gate functions. This procedure is repeated to obtain a hierarchical sequence summary representation. We develop an efficient learning method to train our model and show that its complexity grows sublinearly with the size of the hierarchy. Experimental results show the effectiveness of our approach, achieving the best published results on the ArmGesture and Canal9 datasets.
2 0.21673495 243 cvpr-2013-Large-Scale Video Summarization Using Web-Image Priors
Author: Aditya Khosla, Raffay Hamid, Chih-Jen Lin, Neel Sundaresan
Abstract: Given the enormous growth in user-generated videos, it is becoming increasingly important to be able to navigate them efficiently. As these videos are generally of poor quality, summarization methods designed for well-produced videos do not generalize to them. To address this challenge, we propose to use web-images as a prior to facilitate summarization of user-generated videos. Our main intuition is that people tend to take pictures of objects to capture them in a maximally informative way. Such images could therefore be used as prior information to summarize videos containing a similar set of objects. In this work, we apply our novel insight to develop a summarization algorithm that uses the web-image based prior information in an unsupervised manner. Moreover, to automatically evaluate summarization algorithms on a large scale, we propose a framework that relies on multiple summaries obtained through crowdsourcing. We demonstrate the effectiveness of our evaluation framework by comparing its performance to that ofmultiple human evaluators. Finally, wepresent resultsfor our framework tested on hundreds of user-generated videos.
3 0.14918698 14 cvpr-2013-A Joint Model for 2D and 3D Pose Estimation from a Single Image
Author: Edgar Simo-Serra, Ariadna Quattoni, Carme Torras, Francesc Moreno-Noguer
Abstract: We introduce a novel approach to automatically recover 3D human pose from a single image. Most previous work follows a pipelined approach: initially, a set of 2D features such as edges, joints or silhouettes are detected in the image, and then these observations are used to infer the 3D pose. Solving these two problems separately may lead to erroneous 3D poses when the feature detector has performed poorly. In this paper, we address this issue by jointly solving both the 2D detection and the 3D inference problems. For this purpose, we propose a Bayesian framework that integrates a generative model based on latent variables and discriminative 2D part detectors based on HOGs, and perform inference using evolutionary algorithms. Real experimentation demonstrates competitive results, and the ability of our methodology to provide accurate 2D and 3D pose estimations even when the 2D detectors are inaccurate.
4 0.13696578 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition
Author: Michalis Raptis, Leonid Sigal
Abstract: In this paper, we develop a new model for recognizing human actions. An action is modeled as a very sparse sequence of temporally local discriminative keyframes collections of partial key-poses of the actor(s), depicting key states in the action sequence. We cast the learning of keyframes in a max-margin discriminative framework, where we treat keyframes as latent variables. This allows us to (jointly) learn a set of most discriminative keyframes while also learning the local temporal context between them. Keyframes are encoded using a spatially-localizable poselet-like representation with HoG and BoW components learned from weak annotations; we rely on structured SVM formulation to align our components and minefor hard negatives to boost localization performance. This results in a model that supports spatio-temporal localization and is insensitive to dropped frames or partial observations. We show classification performance that is competitive with the state of the art on the benchmark UT-Interaction dataset and illustrate that our model outperforms prior methods in an on-line streaming setting.
5 0.13693897 287 cvpr-2013-Modeling Actions through State Changes
Author: Alireza Fathi, James M. Rehg
Abstract: In this paper we present a model of action based on the change in the state of the environment. Many actions involve similar dynamics and hand-object relationships, but differ in their purpose and meaning. The key to differentiating these actions is the ability to identify how they change the state of objects and materials in the environment. We propose a weakly supervised method for learning the object and material states that are necessary for recognizing daily actions. Once these state detectors are learned, we can apply them to input videos and pool their outputs to detect actions. We further demonstrate that our method can be used to segment discrete actions from a continuous video of an activity. Our results outperform state-of-the-art action recognition and activity segmentation results.
6 0.13068706 40 cvpr-2013-An Approach to Pose-Based Action Recognition
8 0.12376805 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
10 0.11328567 288 cvpr-2013-Modeling Mutual Visibility Relationship in Pedestrian Detection
11 0.1057058 302 cvpr-2013-Multi-task Sparse Learning with Beta Process Prior for Action Recognition
12 0.10240616 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
13 0.10076705 196 cvpr-2013-HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences
14 0.10010791 205 cvpr-2013-Hollywood 3D: Recognizing Actions in 3D Natural Scenes
15 0.099703848 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
16 0.097633295 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
17 0.097024381 123 cvpr-2013-Detection of Manipulation Action Consequences (MAC)
18 0.096780948 187 cvpr-2013-Geometric Context from Videos
19 0.095635287 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
20 0.095578164 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video
topicId topicWeight
[(0, 0.212), (1, -0.064), (2, -0.026), (3, -0.089), (4, -0.117), (5, 0.007), (6, -0.015), (7, 0.032), (8, -0.035), (9, -0.031), (10, 0.029), (11, -0.033), (12, -0.043), (13, -0.018), (14, -0.013), (15, 0.097), (16, -0.058), (17, 0.079), (18, 0.092), (19, 0.013), (20, 0.004), (21, -0.089), (22, 0.023), (23, -0.054), (24, -0.031), (25, -0.011), (26, 0.013), (27, -0.005), (28, 0.041), (29, 0.06), (30, 0.014), (31, 0.003), (32, -0.032), (33, 0.012), (34, 0.022), (35, 0.058), (36, -0.032), (37, 0.018), (38, 0.054), (39, -0.069), (40, -0.027), (41, -0.006), (42, -0.019), (43, -0.003), (44, -0.066), (45, 0.076), (46, -0.049), (47, 0.05), (48, -0.012), (49, 0.089)]
simIndex simValue paperId paperTitle
same-paper 1 0.93779218 32 cvpr-2013-Action Recognition by Hierarchical Sequence Summarization
Author: Yale Song, Louis-Philippe Morency, Randall Davis
Abstract: Recent progress has shown that learning from hierarchical feature representations leads to improvements in various computer vision tasks. Motivated by the observation that human activity data contains information at various temporal resolutions, we present a hierarchical sequence summarization approach for action recognition that learns multiple layers of discriminative feature representations at different temporal granularities. We build up a hierarchy dynamically and recursively by alternating sequence learning and sequence summarization. For sequence learning we use CRFs with latent variables to learn hidden spatiotemporal dynamics; for sequence summarization we group observations that have similar semantic meaning in the latent space. For each layer we learn an abstract feature representation through non-linear gate functions. This procedure is repeated to obtain a hierarchical sequence summary representation. We develop an efficient learning method to train our model and show that its complexity grows sublinearly with the size of the hierarchy. Experimental results show the effectiveness of our approach, achieving the best published results on the ArmGesture and Canal9 datasets.
2 0.66607934 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition
Author: Michalis Raptis, Leonid Sigal
Abstract: In this paper, we develop a new model for recognizing human actions. An action is modeled as a very sparse sequence of temporally local discriminative keyframes collections of partial key-poses of the actor(s), depicting key states in the action sequence. We cast the learning of keyframes in a max-margin discriminative framework, where we treat keyframes as latent variables. This allows us to (jointly) learn a set of most discriminative keyframes while also learning the local temporal context between them. Keyframes are encoded using a spatially-localizable poselet-like representation with HoG and BoW components learned from weak annotations; we rely on structured SVM formulation to align our components and minefor hard negatives to boost localization performance. This results in a model that supports spatio-temporal localization and is insensitive to dropped frames or partial observations. We show classification performance that is competitive with the state of the art on the benchmark UT-Interaction dataset and illustrate that our model outperforms prior methods in an on-line streaming setting.
3 0.66592139 291 cvpr-2013-Motionlets: Mid-level 3D Parts for Human Motion Recognition
Author: LiMin Wang, Yu Qiao, Xiaoou Tang
Abstract: This paper proposes motionlet, a mid-level and spatiotemporal part, for human motion recognition. Motionlet can be seen as a tight cluster in motion and appearance space, corresponding to the moving process of different body parts. We postulate three key properties of motionlet for action recognition: high motion saliency, multiple scale representation, and representative-discriminative ability. Towards this goal, we develop a data-driven approach to learn motionlets from training videos. First, we extract 3D regions with high motion saliency. Then we cluster these regions and preserve the centers as candidate templates for motionlet. Finally, we examine the representative and discriminative power of the candidates, and introduce a greedy method to select effective candidates. With motionlets, we present a mid-level representation for video, called motionlet activation vector. We conduct experiments on three datasets, KTH, HMDB51, and UCF50. The results show that the proposed methods significantly outperform state-of-the-art methods.
4 0.63840872 3 cvpr-2013-3D R Transform on Spatio-temporal Interest Points for Action Recognition
Author: Chunfeng Yuan, Xi Li, Weiming Hu, Haibin Ling, Stephen Maybank
Abstract: Spatio-temporal interest points serve as an elementary building block in many modern action recognition algorithms, and most of them exploit the local spatio-temporal volume features using a Bag of Visual Words (BOVW) representation. Such representation, however, ignorespotentially valuable information about the global spatio-temporal distribution of interest points. In this paper, we propose a new global feature to capture the detailed geometrical distribution of interest points. It is calculated by using the ℛ transform which is defined as an extended 3D discrete Rℛa tdroann transform, followed by applying a tewdo 3-dDir decitsicorneatel two-dimensional principal component analysis. Such ℛ feature captures the geometrical information of the Sinuctehre ℛst points and keeps invariant to geometry transformation and robust to noise. In addition, we propose a new fusion strategy to combine the ℛ feature with the BOVW representation for further improving recognition accuracy. Wpree suetnilitzaea context-aware fusion method to capture both the pairwise similarities and higher-order contextual interactions of the videos. Experimental results on several publicly available datasets demonstrate the effectiveness of the proposed approach for action recognition.
5 0.63616127 302 cvpr-2013-Multi-task Sparse Learning with Beta Process Prior for Action Recognition
Author: Chunfeng Yuan, Weiming Hu, Guodong Tian, Shuang Yang, Haoran Wang
Abstract: In this paper, we formulate human action recognition as a novel Multi-Task Sparse Learning(MTSL) framework which aims to construct a test sample with multiple features from as few bases as possible. Learning the sparse representation under each feature modality is considered as a single task in MTSL. Since the tasks are generated from multiple features associated with the same visual input, they are not independent but inter-related. We introduce a Beta process(BP) prior to the hierarchical MTSL model, which efficiently learns a compact dictionary and infers the sparse structure shared across all the tasks. The MTSL model enforces the robustness in coefficient estimation compared with performing each task independently. Besides, the sparseness is achieved via the Beta process formulation rather than the computationally expensive l1 norm penalty. In terms of non-informative gamma hyper-priors, the sparsity level is totally decided by the data. Finally, the learning problem is solved by Gibbs sampling inference which estimates the full posterior on the model parameters. Experimental results on the KTH and UCF sports datasets demonstrate the effectiveness of the proposed MTSL approach for action recognition.
6 0.62409902 105 cvpr-2013-Deep Learning Shape Priors for Object Segmentation
7 0.62217844 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
8 0.61861289 353 cvpr-2013-Relative Hidden Markov Models for Evaluating Motion Skill
10 0.60597795 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
11 0.60489887 287 cvpr-2013-Modeling Actions through State Changes
12 0.58860731 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
13 0.58813155 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
14 0.58722115 243 cvpr-2013-Large-Scale Video Summarization Using Web-Image Priors
15 0.58396757 413 cvpr-2013-Story-Driven Summarization for Egocentric Video
16 0.5709579 313 cvpr-2013-Online Dominant and Anomalous Behavior Detection in Videos
17 0.56733131 371 cvpr-2013-SCaLE: Supervised and Cascaded Laplacian Eigenmaps for Visual Object Recognition Based on Nearest Neighbors
18 0.55907011 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
19 0.5567534 40 cvpr-2013-An Approach to Pose-Based Action Recognition
20 0.54965407 123 cvpr-2013-Detection of Manipulation Action Consequences (MAC)
topicId topicWeight
[(6, 0.242), (10, 0.15), (16, 0.028), (26, 0.051), (28, 0.012), (33, 0.232), (67, 0.059), (69, 0.04), (80, 0.011), (87, 0.079)]
simIndex simValue paperId paperTitle
1 0.85143203 283 cvpr-2013-Megastereo: Constructing High-Resolution Stereo Panoramas
Author: Christian Richardt, Yael Pritch, Henning Zimmer, Alexander Sorkine-Hornung
Abstract: We present a solution for generating high-quality stereo panoramas at megapixel resolutions. While previous approaches introduced the basic principles, we show that those techniques do not generalise well to today’s high image resolutions and lead to disturbing visual artefacts. As our first contribution, we describe the necessary correction steps and a compact representation for the input images in order to achieve a highly accurate approximation to the required ray space. Our second contribution is a flow-based upsampling of the available input rays which effectively resolves known aliasing issues like stitching artefacts. The required rays are generated on the fly to perfectly match the desired output resolution, even for small numbers of input images. In addition, the upsampling is real-time and enables direct interactive control over the desired stereoscopic depth effect. In combination, our contributions allow the generation of stereoscopic panoramas at high output resolutions that are virtually free of artefacts such as seams, stereo discontinuities, vertical parallax and other mono-/stereoscopic shape distortions. Our process is robust, and other types of multiperspective panoramas, such as linear panoramas, can also benefit from our contributions. We show various comparisons and high-resolution results.
same-paper 2 0.81694782 32 cvpr-2013-Action Recognition by Hierarchical Sequence Summarization
Author: Yale Song, Louis-Philippe Morency, Randall Davis
Abstract: Recent progress has shown that learning from hierarchical feature representations leads to improvements in various computer vision tasks. Motivated by the observation that human activity data contains information at various temporal resolutions, we present a hierarchical sequence summarization approach for action recognition that learns multiple layers of discriminative feature representations at different temporal granularities. We build up a hierarchy dynamically and recursively by alternating sequence learning and sequence summarization. For sequence learning we use CRFs with latent variables to learn hidden spatiotemporal dynamics; for sequence summarization we group observations that have similar semantic meaning in the latent space. For each layer we learn an abstract feature representation through non-linear gate functions. This procedure is repeated to obtain a hierarchical sequence summary representation. We develop an efficient learning method to train our model and show that its complexity grows sublinearly with the size of the hierarchy. Experimental results show the effectiveness of our approach, achieving the best published results on the ArmGesture and Canal9 datasets.
3 0.78393126 168 cvpr-2013-Fast Object Detection with Entropy-Driven Evaluation
Author: Raphael Sznitman, Carlos Becker, François Fleuret, Pascal Fua
Abstract: Cascade-style approaches to implementing ensemble classifiers can deliver significant speed-ups at test time. While highly effective, they remain challenging to tune and their overall performance depends on the availability of large validation sets to estimate rejection thresholds. These characteristics are often prohibitive and thus limit their applicability. We introduce an alternative approach to speeding-up classifier evaluation which overcomes these limitations. It involves maintaining a probability estimate of the class label at each intermediary response and stopping when the corresponding uncertainty becomes small enough. As a result, the evaluation terminates early based on the sequence of responses observed. Furthermore, it does so independently of the type of ensemble classifier used or the way it was trained. We show through extensive experimentation that our method provides 2 to 10 fold speed-ups, over existing state-of-the-art methods, at almost no loss in accuracy on a number of object classification tasks.
4 0.76777703 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
Author: Ian Endres, Kevin J. Shih, Johnston Jiaa, Derek Hoiem
Abstract: We propose a method to learn a diverse collection of discriminative parts from object bounding box annotations. Part detectors can be trained and applied individually, which simplifies learning and extension to new features or categories. We apply the parts to object category detection, pooling part detections within bottom-up proposed regions and using a boosted classifier with proposed sigmoid weak learners for scoring. On PASCAL VOC 2010, we evaluate the part detectors ’ ability to discriminate and localize annotated keypoints. Our detection system is competitive with the best-existing systems, outperforming other HOG-based detectors on the more deformable categories.
5 0.76518756 414 cvpr-2013-Structure Preserving Object Tracking
Author: Lu Zhang, Laurens van_der_Maaten
Abstract: Model-free trackers can track arbitrary objects based on a single (bounding-box) annotation of the object. Whilst the performance of model-free trackers has recently improved significantly, simultaneously tracking multiple objects with similar appearance remains very hard. In this paper, we propose a new multi-object model-free tracker (based on tracking-by-detection) that resolves this problem by incorporating spatial constraints between the objects. The spatial constraints are learned along with the object detectors using an online structured SVM algorithm. The experimental evaluation ofour structure-preserving object tracker (SPOT) reveals significant performance improvements in multi-object tracking. We also show that SPOT can improve the performance of single-object trackers by simultaneously tracking different parts of the object.
6 0.76221889 285 cvpr-2013-Minimum Uncertainty Gap for Robust Visual Tracking
7 0.76160145 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation
8 0.76061362 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
9 0.76025617 314 cvpr-2013-Online Object Tracking: A Benchmark
10 0.76021147 400 cvpr-2013-Single Image Calibration of Multi-axial Imaging Systems
11 0.75947887 325 cvpr-2013-Part Discovery from Partial Correspondence
12 0.75898504 365 cvpr-2013-Robust Real-Time Tracking of Multiple Objects by Volumetric Mass Densities
13 0.75857675 324 cvpr-2013-Part-Based Visual Tracking with Online Latent Structural Learning
14 0.75676036 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection
15 0.75628668 360 cvpr-2013-Robust Estimation of Nonrigid Transformation for Point Set Registration
16 0.75579429 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path
17 0.75554144 331 cvpr-2013-Physically Plausible 3D Scene Tracking: The Single Actor Hypothesis
18 0.75522286 143 cvpr-2013-Efficient Large-Scale Structured Learning
19 0.75480932 406 cvpr-2013-Spatial Inference Machines
20 0.75412178 14 cvpr-2013-A Joint Model for 2D and 3D Pose Estimation from a Single Image