iccv iccv2013 iccv2013-38 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jun Zhu, Baoyuan Wang, Xiaokang Yang, Wenjun Zhang, Zhuowen Tu
Abstract: With the improved accessibility to an exploding amount of video data and growing demands in a wide range of video analysis applications, video-based action recognition/classification becomes an increasingly important task in computer vision. In this paper, we propose a two-layer structure for action recognition to automatically exploit a mid-level “acton ” representation. The weakly-supervised actons are learned via a new max-margin multi-channel multiple instance learning framework, which can capture multiple mid-level action concepts simultaneously. The learned actons (with no requirement for detailed manual annotations) observe theproperties ofbeing compact, informative, discriminative, and easy to scale. The experimental results demonstrate the effectiveness ofapplying the learned actons in our two-layer structure, and show the state-ofthe-art recognition performance on two challenging action datasets, i.e., Youtube and HMDB51.
Reference: text
sentIndex sentText sentNum sentScore
1 cn Abstract With the improved accessibility to an exploding amount of video data and growing demands in a wide range of video analysis applications, video-based action recognition/classification becomes an increasingly important task in computer vision. [sent-7, score-0.417]
2 In this paper, we propose a two-layer structure for action recognition to automatically exploit a mid-level “acton ” representation. [sent-8, score-0.227]
3 The weakly-supervised actons are learned via a new max-margin multi-channel multiple instance learning framework, which can capture multiple mid-level action concepts simultaneously. [sent-9, score-1.038]
4 The learned actons (with no requirement for detailed manual annotations) observe theproperties ofbeing compact, informative, discriminative, and easy to scale. [sent-10, score-0.657]
5 The experimental results demonstrate the effectiveness ofapplying the learned actons in our two-layer structure, and show the state-ofthe-art recognition performance on two challenging action datasets, i. [sent-11, score-0.884]
6 Introduction There have been growing demands in robust video analysis systems for action recognition/classification, ranging from content-based video retrieval, sports video analysis, surveillance event detection, to smart human-machine interface and gaming. [sent-15, score-0.44]
7 Continuous efforts are expected in the computer vision field to deliver working systems practical enough to deal with challenging real-world videos. [sent-16, score-0.022]
8 The current state-of-the-art performances [27, 6] have been observed by Bag-of-Features (BoF) model combined with spatial-temporal pyramid (STP) [11, 12]. [sent-17, score-0.026]
9 Nevertheless, the problem of video representation remains to be the center issue in action recognition. [sent-19, score-0.352]
10 More recently, along the line of research on attributes for 2D static images [25], dedicated efforts have been given to designing/learning mid-level representations [15, 23, 21] for action recognition in dynamic videos, and promising results greatly inspire researchers to explore this direction. [sent-20, score-0.277]
11 From another aspect, hierarchical structures with deep layers [13, 9] have been advocated to a large audience with a big leap on several grand challenge tasks. [sent-21, score-0.096]
12 However, it is a computationally and manually intensive job to learn appropriate models in deep layered structures. [sent-22, score-0.11]
13 Improvement has also been observed in not-so-deep but still hierarchically layered models for object detection [36]. [sent-23, score-0.053]
14 Inspired by these previous works, we propose a new approach to learn mid-level action representation, named “ac- ton” in this paper, based on a weakly-supervised learning strategy. [sent-24, score-0.251]
15 Besides, we develop a two-layer structure for action recognition, with the goal of leveraging the benefits of low-level, mid-level as well as the layered representation for knowledge abstraction. [sent-25, score-0.354]
16 Specifically, the first layer builds a low-level representation using classical BoF-STP model, while the second layer automatically exploits semantically meaningful mid-level representation via a new weaklysupervised learning strategy. [sent-26, score-0.39]
17 Our acton representation in the second layer is built directly on top of the first layer, with the goal of efficient knowledge abstraction and aggregation. [sent-27, score-0.525]
18 More specifically, we use those learned actons as a midlevel dictionary of intermediate concepts to characterize the semantic properties for each volume of interest (VOI). [sent-28, score-0.799]
19 4, we present the details of learning effective actons for VOIs, which is the primary focus of this paper. [sent-30, score-0.681]
20 Since the BoF features of the first layer and the acton response features of the second layer both tend to be highly nonlinear, we simply stack the representations of both layers and adopt a linear classifier for video-level action classification. [sent-31, score-0.828]
21 To achieve this goal, we learn the actons based on a novel max-margin multi-channel multiple instance learning (abbreviated by M4IL) method. [sent-32, score-0.723]
22 These actons, each of which corresponds to an underlying cluster/sub-modality of training data, are built on basis of a short sequence of subvolumns (i. [sent-34, score-0.022]
23 33555592 The existing MIL literatures either assume single class [1] (not suitable for building knowledge representation) or lack of explicit competition among different clusters [34, 3 1]. [sent-37, score-0.023]
24 In practice, our method naturally generalizes multiple instance support vector machine (MI-SVM) [1] and maximum margin multiple instance clustering (M3IC) [34], and retains complementary advantages of both these two methods. [sent-38, score-0.157]
25 Our experimental results show that adding the second-layer acton representation can provide complementary information w. [sent-40, score-0.418]
26 the first-layer representation for all the feature descriptor types as well as their different combinations, indicating the generalization ability facilitated by the learning of actons in the second layer. [sent-43, score-0.8]
27 Besides, the proposed twolayer representation can achieve superior classification performance than the state-of-the-art results on two benchmark action datasets (i. [sent-44, score-0.329]
28 The contributions of this paper are summarized as follows: (1) We propose a new mid-level acton representation for action recognition, which is automatically exploited through a weakly-supervised learning strategy. [sent-47, score-0.669]
29 (2) By generalizing the single-layer SPM pipeline successfully applied in image classification, we present a two-layer structure on video representation for action recognition task. [sent-48, score-0.352]
30 (3) In the second layer, we propose a novel M4IL method for learning the actons in a weakly-supervised manner. [sent-49, score-0.681]
31 It can capture multiple mid-level action concepts simultaneously for producing a discriminative and compact representation on action classification. [sent-50, score-0.663]
32 Related Work As in recent static image classification literature, the BoF-STP approaches with local spatial-temporal features [11, 7, 27] show its significance in action recognition on many challenging datasets [16, 19, 10]. [sent-52, score-0.227]
33 In action recognition literature, typical spatial-temporal feature descriptors include histogram of oriented gradients (HOG) [11], histogram of optical flow (HOF) [11] and motion boundary histogram (MBH) [4, 27] etc. [sent-53, score-0.227]
34 Moreover, recent work [27, 7, 6] demonstrate that leveraging the trajectory information [20, 18] leads to more discriminative feature representation and makes for recognition performance. [sent-55, score-0.097]
35 Besides those low-level features, mining mid-level feature representation (e. [sent-56, score-0.074]
36 , discriminative parts/patches [22, 23, 29, 28] or semantical attributes [25, 15, 14]) for image/video recognition has been a recent active area in computer vision. [sent-58, score-0.046]
37 It is even more labor intensive to label the videos in action recognition [15, 23]. [sent-60, score-0.26]
38 Our method, instead, only asks for the video-level class information of training video clips without any additional annotations, and thus is much easier to scale to deal with a large amount of training data. [sent-61, score-0.123]
39 There are some existing work [21, 14, 29] using weaklysupervised learning for visual recognition. [sent-62, score-0.072]
40 The most relevant “MIL-BoF” work [21] directly adopts the mi-SVM algorithm [1] on top of the Bag-of-Features subvolume representation for video action classification. [sent-63, score-0.425]
41 In contrast, we focus on learning the actons that serve as a dictionary of intermediate action concepts to describe each VOI. [sent-64, score-1.05]
42 With a spatial-temporal pooling step on the resultant acton responses of VOIs, we obtain a mid-level video representation and feed it into the final classifier. [sent-65, score-0.614]
43 Our experimental results demonstrate that with multiple actons, this mid-level representation tends to be more discriminative and diverse for action recognition. [sent-66, score-0.324]
44 The very recent work of [14, 29] exploit the MIL methods to learn multiple mid-level visual concepts for each class in a weakly-supervised manner. [sent-67, score-0.111]
45 In [14], the visual concepts are discovered via a successive two-stage method of using the mi-SVM algorithm followed by K-means clustering. [sent-68, score-0.088]
46 present an iterative EM algorithm for visual dictionary learning, which alternates between two steps of sampling positive instances and training off-the-shell multi-class SVM classifiers. [sent-70, score-0.09]
47 By contrast, our M4IL method can simultaneously explore multiple mid-level action concepts in a unified learning formulation, which could be readily solved by the CCCP algo- rithm. [sent-71, score-0.339]
48 Besides, we present a two-layer structure on video representation for action recognition, and show its superiority w. [sent-72, score-0.352]
49 A two-layer representation of videos for action recognition In this section, we elaborate the proposed two-layer framework of video representation for action recognition. [sent-77, score-0.653]
50 The first-layer representation is built by the codes of local spatial-temporal feature points, and the second-layer representation is constructed based on the acton responses of VOIs. [sent-78, score-0.615]
51 1, we adopt the linear SPM pipeline [32, 17, 35] to build the first-layer representation for a video clip. [sent-82, score-0.125]
52 Local feature extraction: For a video clip V, we assume there are P local STIPs detected. [sent-84, score-0.094]
53 Then each STIP is represented by a feature descriptor a ∈ RD to capture its rloepcarle sveinsuteadl cues (e. [sent-85, score-0.023]
54 Illustration of the proposed two-layer structure on action classification. [sent-89, score-0.227]
55 Spatial-temporal pooling: For capturing informative statistics and achieving invariance properties (e. [sent-94, score-0.062]
56 , transformation invariance) in spatial-temporal domain, a maximum pooling step is exploited to form the video-level representation of V, via statistical summarization over the codes C = [c1, c2, · · · ,cP] in L different subvolumes. [sent-96, score-0.169]
57 In this paper, we use a v,o·l·u·m ,cetric spatial-temporal pyramid as in [27], which × includes six different spatial-temporal grids leading to a total of L = 24 subvolumes. [sent-97, score-0.026]
58 Let Λl denote spatial-temporal domain of the lth subvolume used for pooling. [sent-98, score-0.073]
59 We define an element-wise maximization operation OPmax for mapping the codes located in Λl into a M-dimensional vector γl(1) = [γ(l,11),γl(,12), · · · ,γ(l1,M)]T = OPmax(C;Λl). [sent-99, score-0.073]
60 For any visual word u, the pooled signature is obtained by γ(l1,u) = maxp∈Ω(Λl) cp,u, where Ω(Λl) refers to the set of STIPs located in Λl. [sent-100, score-0.033]
61 Thus, the first-layer representation of V is denoted by a (M L)dthime feirnssti-olanyael vrre cetporer sΓen(1)t = o[γn(11 o)f; Vγ2(1 i)s; ·d d·e e· ;oγteL(1d)]. [sent-101, score-0.074]
62 The second-layer representation Besides the first-layer representation based on local STIPs, we construct a higher-level one by the acton responses of VOIs for action recognition in videos. [sent-104, score-0.78]
63 In this layer, a video clip is decomposed into a set of VOIs, which potentially correspond to action parts or relevant objects. [sent-105, score-0.321]
64 Feature description for VOIs: Assuming there are J VOIs extracted from V, we represent it by a bag of VOI features X = [x1, x2, · · · ,xJ]. [sent-107, score-0.073]
65 j denote the 3D spatialtemporal bounding box of the jth VOI, then its feature descriptor xj can be computed by pooling the codes of STIPs located in Λ? [sent-109, score-0.205]
66 Besides, the VOI descriptor xj is further normalized by its L2-norm. [sent-114, score-0.049]
67 × A mid-level representation based on acton responses: Given K actons, we construct the second-layer representation of V via their responses on extracted VOIs. [sent-115, score-0.553]
68 For VOI j, we obtain a K-dimensional response vector rj = [rj,1, rj,2, · · · ,rj,K]T, each component of which is comput- iesd tb hye r js,kig=mo Sid(w fkTuxnjc),ti ∀onk = to 1 p,r2o,d·u·c·e ,K a. [sent-117, score-0.078]
69 p Sro(rb)ab =il 1s+tiecxp1( f−oρr )- m on the resultant response value, where ρ is a saturation parameter for controlling its sharpness. [sent-118, score-0.058]
70 Let R = [r1, r2, · · · ,rJ] denote the set of responses on all VOIs in V. [sent-119, score-0.061]
71 S,im··i·la ,rr as the first-layer representation, we further pooling R for each subvolume of spatial-temporal pyramid (i. [sent-120, score-0.154]
72 , = OPmax(R;Λl), ∀l = 1,2,··· ,L), and obtain a γ(l2) [γ(12);γ2(2); ;γL(2)] (K L)-dimensional vector Γ(2) = · · · as (thKe second-layer representation of V. [sent-122, score-0.074]
73 Then, we concatenate them into a single vector Γ = [Γ(1); Γ(2)], which is the proposed two-layer representation of V. [sent-124, score-0.074]
74 Finally, we can apply 33555614 any off-the-shell classifier on Γ for action recognition. [sent-125, score-0.227]
75 Maximum ple instance margin learning supervised actons multi-channel (M4IL) multi- for weakly- In this paper, the mid-level intermediate concepts (i. [sent-127, score-0.883]
76 , actons) for the second-layer representation are not predefined and learned in a data-driven manner. [sent-129, score-0.074]
77 Although the acton labels of VOIs are unknown in training, the class labels of whole video clip are available and can provide informative cues for weakly-supervised learning of actons on VOIs. [sent-130, score-1.181]
78 This inherently coincides the assumption of multiple instance learning [1, 34]. [sent-131, score-0.066]
79 In this section, we formally introduce the proposed M4IL algorithm for learning actons. [sent-132, score-0.024]
80 ) Fwuhrtehreer Jmore, we assume there are K underlying channels (modalities), each of which corresponds to a candidate acton model wk (k ∈ C, C = {1, 2, · · · ,K}), for explaining the instances in positive bags. [sent-136, score-0.511]
81 B{1a,s2e,d· on tKhe} m, faoxri emxupmla margin principle [s1, i n3 p4]o, we present a method based on M4IL formulation for weakly-supervised actons: wk,ξmi? [sent-137, score-0.044]
82 (1), the first set of constraints corresponds to a multiple-instance-based margin (i. [sent-160, score-0.044]
83 , yimj∈aBximk∈aCxwkTxi,j) for each bag Xi: There should be at least one instance explained well by someone of the K actons for a positive bag, and a negative bag is preferable to have none of instance belonging to any sub-modality (i. [sent-162, score-0.94]
84 Actually, it generalizes MI-SVM [1] with the multiple-channel assumption of VOIs in a video. [sent-165, score-0.029]
85 (1) and [34] for this constraint lies in that we only consider the instances regarded as positive ones by the actons to contribute the margin between different sub-modalities. [sent-168, score-0.765]
86 As in [30, 34], the third set of constraints enforces the class balance to avoid a trivial solution that all of the instances are assigned to one channel. [sent-169, score-0.056]
87 1− ξi, ∀ yi = + 1 : KK − 1mj∈Bai+x[mzi,j∈aCxWTψ(xi,j,zi,j) −K1? [sent-183, score-0.043]
88 (2), We observe that the first and second sets of soft-margin constraints for positive bags are not convex. [sent-191, score-0.077]
89 In the tth iteration of CCCP, we replace the left terms of constraints on positive bags by the first-order Taylor expansions at Wt−1 . [sent-197, score-0.103]
90 For the left term of the first constraint on a positive bag Xi, let f(W,Xi) = (j,zmi,j)∈aBxi×CWTψ(xi,j,zi,j), and its subgradient at Wt−1 can be computed by ∇f(Wt−1,Xi) = ∂f(∂WW,Xi)|W=Wt−1 = ψ(xi,jˆ, zˆ i,jˆ), (3) where (jˆ, zˆ i,jˆ) = arg(j,zmi,j)∈aBxi×CWtT−1ψ(xi,j,zi,j). [sent-198, score-0.142]
wordName wordTfidf (topN-words)
[('actons', 0.657), ('acton', 0.344), ('vois', 0.333), ('action', 0.227), ('abxi', 0.156), ('voi', 0.156), ('cccp', 0.144), ('opmax', 0.125), ('wk', 0.103), ('stips', 0.092), ('concepts', 0.088), ('layer', 0.085), ('representation', 0.074), ('bag', 0.073), ('subvolume', 0.073), ('besides', 0.071), ('cwt', 0.063), ('yimj', 0.063), ('responses', 0.061), ('bof', 0.06), ('pooling', 0.055), ('stip', 0.054), ('layered', 0.053), ('video', 0.051), ('weaklysupervised', 0.048), ('bags', 0.046), ('margin', 0.044), ('wt', 0.043), ('yi', 0.043), ('clip', 0.043), ('instance', 0.042), ('codes', 0.04), ('informative', 0.039), ('subgradient', 0.038), ('demands', 0.036), ('stack', 0.036), ('mil', 0.036), ('located', 0.033), ('intensive', 0.033), ('youtube', 0.033), ('bi', 0.033), ('instances', 0.033), ('spm', 0.031), ('positive', 0.031), ('kk', 0.029), ('generalizes', 0.029), ('resultant', 0.029), ('response', 0.029), ('intermediate', 0.028), ('exploding', 0.028), ('audience', 0.028), ('baoyuan', 0.028), ('bes', 0.028), ('inspire', 0.028), ('mls', 0.028), ('spatialtemporal', 0.028), ('sro', 0.028), ('tbheel', 0.028), ('tkhe', 0.028), ('twolayer', 0.028), ('xiaokang', 0.028), ('xkyang', 0.028), ('zhuowen', 0.028), ('clips', 0.026), ('xj', 0.026), ('dictionary', 0.026), ('pyramid', 0.026), ('expansions', 0.026), ('gmail', 0.026), ('aendd', 0.026), ('hye', 0.026), ('ita', 0.026), ('deep', 0.024), ('learning', 0.024), ('shanghai', 0.024), ('jiao', 0.024), ('bte', 0.024), ('sid', 0.024), ('thke', 0.024), ('compact', 0.024), ('growing', 0.024), ('descriptor', 0.023), ('discriminative', 0.023), ('invariance', 0.023), ('abbreviated', 0.023), ('asks', 0.023), ('iesd', 0.023), ('semantical', 0.023), ('class', 0.023), ('xi', 0.022), ('efforts', 0.022), ('leap', 0.022), ('transmission', 0.022), ('facilitated', 0.022), ('someone', 0.022), ('stp', 0.022), ('tel', 0.022), ('built', 0.022), ('layers', 0.022), ('tu', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 38 iccv-2013-Action Recognition with Actons
Author: Jun Zhu, Baoyuan Wang, Xiaokang Yang, Wenjun Zhang, Zhuowen Tu
Abstract: With the improved accessibility to an exploding amount of video data and growing demands in a wide range of video analysis applications, video-based action recognition/classification becomes an increasingly important task in computer vision. In this paper, we propose a two-layer structure for action recognition to automatically exploit a mid-level “acton ” representation. The weakly-supervised actons are learned via a new max-margin multi-channel multiple instance learning framework, which can capture multiple mid-level action concepts simultaneously. The learned actons (with no requirement for detailed manual annotations) observe theproperties ofbeing compact, informative, discriminative, and easy to scale. The experimental results demonstrate the effectiveness ofapplying the learned actons in our two-layer structure, and show the state-ofthe-art recognition performance on two challenging action datasets, i.e., Youtube and HMDB51.
2 0.15698445 86 iccv-2013-Concurrent Action Detection with Structural Prediction
Author: Ping Wei, Nanning Zheng, Yibiao Zhao, Song-Chun Zhu
Abstract: Action recognition has often been posed as a classification problem, which assumes that a video sequence only have one action class label and different actions are independent. However, a single human body can perform multiple concurrent actions at the same time, and different actions interact with each other. This paper proposes a concurrent action detection model where the action detection is formulated as a structural prediction problem. In this model, an interval in a video sequence can be described by multiple action labels. An detected action interval is determined both by the unary local detector and the relations with other actions. We use a wavelet feature to represent the action sequence, and design a composite temporal logic descriptor to describe the action relations. The model parameters are trained by structural SVM learning. Given a long video sequence, a sequential decision window search algorithm is designed to detect the actions. Experiments on our new collected concurrent action dataset demonstrate the strength of our method.
3 0.14125092 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition
Author: Jiang Wang, Ying Wu
Abstract: Temporal misalignment and duration variation in video actions largely influence the performance of action recognition, but it is very difficult to specify effective temporal alignment on action sequences. To address this challenge, this paper proposes a novel discriminative learning-based temporal alignment method, called maximum margin temporal warping (MMTW), to align two action sequences and measure their matching score. Based on the latent structure SVM formulation, the proposed MMTW method is able to learn a phantom action template to represent an action class for maximum discrimination against other classes. The recognition of this action class is based on the associated learned alignment of the input action. Extensive experiments on five benchmark datasets have demonstrated that this MMTW model is able to significantly promote the accuracy and robustness of action recognition under temporal misalignment and variations.
4 0.13939454 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
Author: Mihai Zanfir, Marius Leordeanu, Cristian Sminchisescu
Abstract: Human action recognition under low observational latency is receiving a growing interest in computer vision due to rapidly developing technologies in human-robot interaction, computer gaming and surveillance. In this paper we propose a fast, simple, yet powerful non-parametric Moving Pose (MP)frameworkfor low-latency human action and activity recognition. Central to our methodology is a moving pose descriptor that considers both pose information as well as differential quantities (speed and acceleration) of the human body joints within a short time window around the current frame. The proposed descriptor is used in conjunction with a modified kNN classifier that considers both the temporal location of a particular frame within the action sequence as well as the discrimination power of its moving pose descriptor compared to other frames in the training set. The resulting method is non-parametric and enables low-latency recognition, one-shot learning, and action detection in difficult unsegmented sequences. Moreover, the framework is real-time, scalable, and outperforms more sophisticated approaches on challenging benchmarks like MSR-Action3D or MSR-DailyActivities3D.
5 0.13704869 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos
Author: Sunil Bandla, Kristen Grauman
Abstract: Collecting and annotating videos of realistic human actions is tedious, yet critical for training action recognition systems. We propose a method to actively request the most useful video annotations among a large set of unlabeled videos. Predicting the utility of annotating unlabeled video is not trivial, since any given clip may contain multiple actions of interest, and it need not be trimmed to temporal regions of interest. To deal with this problem, we propose a detection-based active learner to train action category models. We develop a voting-based framework to localize likely intervals of interest in an unlabeled clip, and use them to estimate the total reduction in uncertainty that annotating that clip would yield. On three datasets, we show our approach can learn accurate action detectors more efficiently than alternative active learning strategies that fail to accommodate the “untrimmed” nature of real video data.
6 0.13188013 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments
7 0.12079868 244 iccv-2013-Learning View-Invariant Sparse Representations for Cross-View Action Recognition
8 0.1175417 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
9 0.11524025 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction
10 0.11048834 249 iccv-2013-Learning to Share Latent Tasks for Action Recognition
11 0.10951221 396 iccv-2013-Space-Time Robust Representation for Action Recognition
12 0.1090584 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
13 0.10190244 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set
14 0.10073274 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding
15 0.09577363 39 iccv-2013-Action Recognition with Improved Trajectories
16 0.092565186 188 iccv-2013-Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps
17 0.092272699 116 iccv-2013-Directed Acyclic Graph Kernels for Action Recognition
18 0.088395961 231 iccv-2013-Latent Multitask Learning for View-Invariant Action Recognition
19 0.083135679 166 iccv-2013-Finding Actors and Actions in Movies
20 0.079279616 197 iccv-2013-Hierarchical Joint Max-Margin Learning of Mid and Top Level Representations for Visual Recognition
topicId topicWeight
[(0, 0.14), (1, 0.158), (2, 0.06), (3, 0.123), (4, -0.033), (5, -0.001), (6, 0.048), (7, -0.066), (8, -0.02), (9, -0.005), (10, 0.009), (11, 0.05), (12, 0.003), (13, -0.044), (14, 0.065), (15, -0.026), (16, 0.004), (17, 0.023), (18, 0.02), (19, 0.003), (20, -0.011), (21, 0.01), (22, -0.049), (23, -0.053), (24, -0.058), (25, 0.006), (26, 0.0), (27, 0.019), (28, 0.001), (29, 0.063), (30, 0.002), (31, -0.015), (32, -0.001), (33, 0.01), (34, -0.007), (35, 0.033), (36, -0.025), (37, -0.031), (38, -0.007), (39, 0.02), (40, -0.068), (41, -0.043), (42, -0.025), (43, -0.016), (44, 0.012), (45, -0.0), (46, -0.001), (47, -0.047), (48, -0.021), (49, -0.052)]
simIndex simValue paperId paperTitle
same-paper 1 0.94385755 38 iccv-2013-Action Recognition with Actons
Author: Jun Zhu, Baoyuan Wang, Xiaokang Yang, Wenjun Zhang, Zhuowen Tu
Abstract: With the improved accessibility to an exploding amount of video data and growing demands in a wide range of video analysis applications, video-based action recognition/classification becomes an increasingly important task in computer vision. In this paper, we propose a two-layer structure for action recognition to automatically exploit a mid-level “acton ” representation. The weakly-supervised actons are learned via a new max-margin multi-channel multiple instance learning framework, which can capture multiple mid-level action concepts simultaneously. The learned actons (with no requirement for detailed manual annotations) observe theproperties ofbeing compact, informative, discriminative, and easy to scale. The experimental results demonstrate the effectiveness ofapplying the learned actons in our two-layer structure, and show the state-ofthe-art recognition performance on two challenging action datasets, i.e., Youtube and HMDB51.
2 0.8389833 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition
Author: Jiang Wang, Ying Wu
Abstract: Temporal misalignment and duration variation in video actions largely influence the performance of action recognition, but it is very difficult to specify effective temporal alignment on action sequences. To address this challenge, this paper proposes a novel discriminative learning-based temporal alignment method, called maximum margin temporal warping (MMTW), to align two action sequences and measure their matching score. Based on the latent structure SVM formulation, the proposed MMTW method is able to learn a phantom action template to represent an action class for maximum discrimination against other classes. The recognition of this action class is based on the associated learned alignment of the input action. Extensive experiments on five benchmark datasets have demonstrated that this MMTW model is able to significantly promote the accuracy and robustness of action recognition under temporal misalignment and variations.
3 0.83764726 86 iccv-2013-Concurrent Action Detection with Structural Prediction
Author: Ping Wei, Nanning Zheng, Yibiao Zhao, Song-Chun Zhu
Abstract: Action recognition has often been posed as a classification problem, which assumes that a video sequence only have one action class label and different actions are independent. However, a single human body can perform multiple concurrent actions at the same time, and different actions interact with each other. This paper proposes a concurrent action detection model where the action detection is formulated as a structural prediction problem. In this model, an interval in a video sequence can be described by multiple action labels. An detected action interval is determined both by the unary local detector and the relations with other actions. We use a wavelet feature to represent the action sequence, and design a composite temporal logic descriptor to describe the action relations. The model parameters are trained by structural SVM learning. Given a long video sequence, a sequential decision window search algorithm is designed to detect the actions. Experiments on our new collected concurrent action dataset demonstrate the strength of our method.
4 0.81398726 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding
Author: Weiyu Zhang, Menglong Zhu, Konstantinos G. Derpanis
Abstract: This paper presents a novel approach for analyzing human actions in non-scripted, unconstrained video settings based on volumetric, x-y-t, patch classifiers, termed actemes. Unlike previous action-related work, the discovery of patch classifiers is posed as a strongly-supervised process. Specifically, keypoint labels (e.g., position) across spacetime are used in a data-driven training process to discover patches that are highly clustered in the spacetime keypoint configuration space. To support this process, a new human action dataset consisting of challenging consumer videos is introduced, where notably the action label, the 2D position of a set of keypoints and their visibilities are provided for each video frame. On a novel input video, each acteme is used in a sliding volume scheme to yield a set of sparse, non-overlapping detections. These detections provide the intermediate substrate for segmenting out the action. For action classification, the proposed representation shows significant improvement over state-of-the-art low-level features, while providing spatiotemporal localiza- tion as additional output. This output sheds further light into detailed action understanding.
5 0.80391324 231 iccv-2013-Latent Multitask Learning for View-Invariant Action Recognition
Author: Behrooz Mahasseni, Sinisa Todorovic
Abstract: This paper presents an approach to view-invariant action recognition, where human poses and motions exhibit large variations across different camera viewpoints. When each viewpoint of a given set of action classes is specified as a learning task then multitask learning appears suitable for achieving view invariance in recognition. We extend the standard multitask learning to allow identifying: (1) latent groupings of action views (i.e., tasks), and (2) discriminative action parts, along with joint learning of all tasks. This is because it seems reasonable to expect that certain distinct views are more correlated than some others, and thus identifying correlated views could improve recognition. Also, part-based modeling is expected to improve robustness against self-occlusion when actors are imaged from different views. Results on the benchmark datasets show that we outperform standard multitask learning by 21.9%, and the state-of-the-art alternatives by 4.5–6%.
6 0.7992183 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set
7 0.76232171 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos
8 0.75809455 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
9 0.74749893 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
10 0.70204961 166 iccv-2013-Finding Actors and Actions in Movies
11 0.69416028 244 iccv-2013-Learning View-Invariant Sparse Representations for Cross-View Action Recognition
12 0.66404057 265 iccv-2013-Mining Motion Atoms and Phrases for Complex Action Recognition
13 0.65232944 249 iccv-2013-Learning to Share Latent Tasks for Action Recognition
14 0.64230365 396 iccv-2013-Space-Time Robust Representation for Action Recognition
15 0.61490101 116 iccv-2013-Directed Acyclic Graph Kernels for Action Recognition
16 0.60762775 260 iccv-2013-Manipulation Pattern Discovery: A Nonparametric Bayesian Approach
17 0.59739822 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments
18 0.59204155 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction
19 0.59010792 188 iccv-2013-Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps
20 0.57470316 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition
topicId topicWeight
[(2, 0.073), (4, 0.017), (7, 0.02), (26, 0.066), (31, 0.442), (35, 0.011), (42, 0.076), (48, 0.01), (64, 0.071), (73, 0.014), (89, 0.107)]
simIndex simValue paperId paperTitle
1 0.88639843 345 iccv-2013-Recognizing Text with Perspective Distortion in Natural Scenes
Author: Trung Quy Phan, Palaiahnakote Shivakumara, Shangxuan Tian, Chew Lim Tan
Abstract: This paper presents an approach to text recognition in natural scene images. Unlike most existing works which assume that texts are horizontal and frontal parallel to the image plane, our method is able to recognize perspective texts of arbitrary orientations. For individual character recognition, we adopt a bag-of-keypoints approach, in which Scale Invariant Feature Transform (SIFT) descriptors are extracted densely and quantized using a pre-trained vocabulary. Following [1, 2], the context information is utilized through lexicons. We formulate word recognition as finding the optimal alignment between the set of characters and the list of lexicon words. Furthermore, we introduce a new dataset called StreetViewText-Perspective, which contains texts in street images with a great variety of viewpoints. Experimental results on public datasets and the proposed dataset show that our method significantly outperforms the state-of-the-art on perspective texts of arbitrary orientations.
2 0.84408951 408 iccv-2013-Super-resolution via Transform-Invariant Group-Sparse Regularization
Author: Carlos Fernandez-Granda, Emmanuel J. Candès
Abstract: We present a framework to super-resolve planar regions found in urban scenes and other man-made environments by taking into account their 3D geometry. Such regions have highly structured straight edges, but this prior is challenging to exploit due to deformations induced by the projection onto the imaging plane. Our method factors out such deformations by using recently developed tools based on convex optimization to learn a transform that maps the image to a domain where its gradient has a simple group-sparse structure. This allows to obtain a novel convex regularizer that enforces global consistency constraints between the edges of the image. Computational experiments with real images show that this data-driven approach to the design of regularizers promoting transform-invariant group sparsity is very effective at high super-resolution factors. We view our approach as complementary to most recent superresolution methods, which tend to focus on hallucinating high-frequency textures.
3 0.82434416 72 iccv-2013-Characterizing Layouts of Outdoor Scenes Using Spatial Topic Processes
Author: Dahua Lin, Jianxiong Xiao
Abstract: In this paper, we develop a generative model to describe the layouts of outdoor scenes the spatial configuration of regions. Specifically, the layout of an image is represented as a composite of regions, each associated with a semantic topic. At the heart of this model is a novel stochastic process called Spatial Topic Process, which generates a spatial map of topics from a set of coupled Gaussian processes, thus allowing the distributions of topics to vary continuously across the image plane. A key aspect that distinguishes this model from previous ones consists in its capability of capturing dependencies across both locations and topics while allowing substantial variations in the layouts. We demonstrate the practical utility of the proposed model by testing it on scene classification, semantic segmentation, and layout hallucination. –
same-paper 4 0.81742322 38 iccv-2013-Action Recognition with Actons
Author: Jun Zhu, Baoyuan Wang, Xiaokang Yang, Wenjun Zhang, Zhuowen Tu
Abstract: With the improved accessibility to an exploding amount of video data and growing demands in a wide range of video analysis applications, video-based action recognition/classification becomes an increasingly important task in computer vision. In this paper, we propose a two-layer structure for action recognition to automatically exploit a mid-level “acton ” representation. The weakly-supervised actons are learned via a new max-margin multi-channel multiple instance learning framework, which can capture multiple mid-level action concepts simultaneously. The learned actons (with no requirement for detailed manual annotations) observe theproperties ofbeing compact, informative, discriminative, and easy to scale. The experimental results demonstrate the effectiveness ofapplying the learned actons in our two-layer structure, and show the state-ofthe-art recognition performance on two challenging action datasets, i.e., Youtube and HMDB51.
5 0.79623413 357 iccv-2013-Robust Matrix Factorization with Unknown Noise
Author: Deyu Meng, Fernando De_La_Torre
Abstract: Many problems in computer vision can be posed as recovering a low-dimensional subspace from highdimensional visual data. Factorization approaches to lowrank subspace estimation minimize a loss function between an observed measurement matrix and a bilinear factorization. Most popular loss functions include the L2 and L1 losses. L2 is optimal for Gaussian noise, while L1 is for Laplacian distributed noise. However, real data is often corrupted by an unknown noise distribution, which is unlikely to be purely Gaussian or Laplacian. To address this problem, this paper proposes a low-rank matrix factorization problem with a Mixture of Gaussians (MoG) noise model. The MoG model is a universal approximator for any continuous distribution, and hence is able to model a wider range of noise distributions. The parameters of the MoG model can be estimated with a maximum likelihood method, while the subspace is computed with standard approaches. We illustrate the benefits of our approach in extensive syn- thetic and real-world experiments including structure from motion, face modeling and background subtraction.
6 0.70305395 275 iccv-2013-Motion-Aware KNN Laplacian for Video Matting
7 0.6634903 269 iccv-2013-Modeling Occlusion by Discriminative AND-OR Structures
8 0.62812626 73 iccv-2013-Class-Specific Simplex-Latent Dirichlet Allocation for Image Classification
9 0.60548633 376 iccv-2013-Scene Text Localization and Recognition with Oriented Stroke Detection
10 0.58859479 137 iccv-2013-Efficient Salient Region Detection with Soft Image Abstraction
11 0.57659495 210 iccv-2013-Image Retrieval Using Textual Cues
12 0.57456309 315 iccv-2013-PhotoOCR: Reading Text in Uncontrolled Conditions
13 0.5678916 180 iccv-2013-From Where and How to What We See
14 0.55393904 173 iccv-2013-Fluttering Pattern Generation Using Modified Legendre Sequence for Coded Exposure Imaging
15 0.55335629 415 iccv-2013-Text Localization in Natural Images Using Stroke Feature Transform and Text Covariance Descriptors
16 0.53092629 19 iccv-2013-A Learning-Based Approach to Reduce JPEG Artifacts in Image Matting
17 0.52596706 192 iccv-2013-Handwritten Word Spotting with Corrected Attributes
18 0.52560437 328 iccv-2013-Probabilistic Elastic Part Model for Unsupervised Face Detector Adaptation
19 0.52537811 287 iccv-2013-Neighbor-to-Neighbor Search for Fast Coding of Feature Vectors
20 0.51936698 156 iccv-2013-Fast Direct Super-Resolution by Simple Functions