iccv iccv2013 iccv2013-244 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jingjing Zheng, Zhuolin Jiang
Abstract: We present an approach to jointly learn a set of viewspecific dictionaries and a common dictionary for crossview action recognition. The set of view-specific dictionaries is learned for specific views while the common dictionary is shared across different views. Our approach represents videos in each view using both the corresponding view-specific dictionary and the common dictionary. More importantly, it encourages the set of videos taken from different views of the same action to have similar sparse representations. In this way, we can align view-specific features in the sparse feature spaces spanned by the viewspecific dictionary set and transfer the view-shared features in the sparse feature space spanned by the common dictionary. Meanwhile, the incoherence between the common dictionary and the view-specific dictionary set enables us to exploit the discrimination information encoded in viewspecific features and view-shared features separately. In addition, the learned common dictionary not only has the capability to represent actions from unseen views, but also , makes our approach effective in a semi-supervised setting where no correspondence videos exist and only a few labels exist in the target view. Extensive experiments using the multi-view IXMAS dataset demonstrate that our approach outperforms many recent approaches for cross-view action recognition.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present an approach to jointly learn a set of viewspecific dictionaries and a common dictionary for crossview action recognition. [sent-3, score-1.225]
2 The set of view-specific dictionaries is learned for specific views while the common dictionary is shared across different views. [sent-4, score-0.993]
3 Our approach represents videos in each view using both the corresponding view-specific dictionary and the common dictionary. [sent-5, score-0.798]
4 More importantly, it encourages the set of videos taken from different views of the same action to have similar sparse representations. [sent-6, score-1.016]
5 In this way, we can align view-specific features in the sparse feature spaces spanned by the viewspecific dictionary set and transfer the view-shared features in the sparse feature space spanned by the common dictionary. [sent-7, score-0.868]
6 Meanwhile, the incoherence between the common dictionary and the view-specific dictionary set enables us to exploit the discrimination information encoded in viewspecific features and view-shared features separately. [sent-8, score-1.087]
7 In addition, the learned common dictionary not only has the capability to represent actions from unseen views, but also , makes our approach effective in a semi-supervised setting where no correspondence videos exist and only a few labels exist in the target view. [sent-9, score-1.148]
8 Extensive experiments using the multi-view IXMAS dataset demonstrate that our approach outperforms many recent approaches for cross-view action recognition. [sent-10, score-0.363]
9 Joint learning of a view-specific dictionary pair and a common dictionary. [sent-21, score-0.467]
10 We not only learn a common dictionary D to model view-shared features of corresponding videos in both views, but also learn two view-specific dictionaries Ds and Dt that are incoherent to D to align the view-specific features. [sent-22, score-1.054]
11 This is because the same action looks quite different from different viewpoints as shown in Figure 1. [sent-25, score-0.396]
12 Thus action models learned from one view become less discriminative for recognizing actions in a much different view. [sent-26, score-0.668]
13 A very fruitful line of work for cross-view action recognition based on transfer learning is to construct the mappings or connections between different views, by using videos taken from different views of the same action [6, 7, 20, 8]. [sent-27, score-1.365]
14 [6] exploited the frame-to-frame correspondence in pairs of videos taken from two views of the same action by transferring the split-based features of video frames in the source view to the corresponding video frames in the target view. [sent-28, score-1.372]
15 [20] proposed to exploit the correspondence between the view-dependent codebooks constructed by k-means clustering on videos in each view. [sent-29, score-0.392]
16 However, the frame-to-frame correspondence [6] is computationally expensive, and the codebook-to-codebook correspondence [20] is not accurate enough to guarantee that a pair of videos observed in the source and target views will have similar feature representations. [sent-30, score-0.956]
17 In order to overcome these drawbacks, we propose a dictionary learning framework to exploit the video-to-video correspondence by encouraging pairs of videos taken in two 33 116769 views to have similar sparse representations. [sent-31, score-1.191]
18 Our approach not only learns a common dictionary shared by different views to model the view-shared features, but also learns a dictionary pair corresponding to the source and target views to model and align view-specific features in the two views. [sent-33, score-1.649]
19 Both the common dictionary and the corresponding view-specific dictionary are used to represent videos in each view. [sent-34, score-1.111]
20 , the indices of selected dictionary items) in sparse codes of videos from the source view to sparse codes of the corresponding videos from the target view. [sent-37, score-1.544]
21 In this way, videos across different views ofthe same action tend to have similar sparse representations. [sent-39, score-0.962]
22 Note that our approach enforces the common dictionary to be incoherent with viewspecific dictionaries, so that the discrimination information encoded in view-specific features and view-shared features are exploited separately and makes view-specific dictionaries more compact. [sent-40, score-0.871]
23 Actions are categorized into two types: shared actions observed in both views and orphan actions that are only observed in the source view. [sent-41, score-0.813]
24 Note that only pairs of videos taken from two views of the shared actions are used for dictionary learning. [sent-42, score-1.189]
25 In addition, we consider two scenarios for the shared actions: (1) shared actions in both views are unlabeled. [sent-43, score-0.58]
26 Contributions The main contributions of this paper are: • We propose to simultaneously learn a set of viewspecific dictionaries to exploit the video-level correspondence across views and a common dictionary to model the common patterns shared by different views. [sent-48, score-1.362]
27 • The incoherence between the common dictionary and the view-specific dictionaries enables our approach to drive the shared pattern to the common dictionary and focus on exploiting the discriminative correspondence information encoded by the view-specific dictionaries. [sent-49, score-1.494]
28 • • With the separation of the common dictionary, our approach not only learns more compact view-specific dictionaries, but also bridges the gap of the sparse representations of correspondence videos taken from different views of the same action using a more flexible method. [sent-50, score-1.324]
29 Our framework is a general approach and can be applied to cross-view and multi-view action recognition under both unsupervised and supervised settings. [sent-51, score-0.498]
30 Related Work Recently, several transfer learning techniques have been proposed for cross-view action recognition [6, 20, 8, 33]. [sent-53, score-0.425]
31 Even though this approach exploits the codebook-to-codebook correspondence between two views, it can not guarantee that videos taken at different views of shared actions will have similar features. [sent-57, score-0.937]
32 [33] proposed a dictionary learning framework for cross-view action recognition with the assumption that sparse representations of videos from different views of the same action should be strictly equal. [sent-59, score-1.804]
33 [11, 10] captured the structure of temporal similarities and dissimilarities within an action sequence using a Self-Similarity Matrix. [sent-63, score-0.363]
34 [15] learned two view-specific transformations for the source and target views, and then generated a sequence of linear transformations of action descriptors as the virtual views to connect two views. [sent-65, score-0.784]
35 Another fruitful line ofwork for cross-view action recognition concentrates on using the 3D image data. [sent-67, score-0.424]
36 [3 1] developed a 4D view-invariant action feature extraction to encode the shape and motion information of actors observed from multiple views. [sent-69, score-0.438]
37 Unsupervised Learning In the unsupervised setting , our goal is to find viewinvariant feature representations by making use of correspondence between videos of the shared actions taken from different views. [sent-75, score-0.798]
38 , yNv] ∈ Rd×N denote d-dimensional feature representations of N videos of the shared actions taken in the v-th view. [sent-79, score-0.603]
39 , yiV] are V action videos of the shared action yi taken from V views, which are referred to as correspondence videos. [sent-83, score-1.261]
40 On one hand, we would like to learn a common dictionary D ∈ Rd×J with a size of J shared by different views to represent videos from all views. [sent-84, score-1.078]
41 The first two terms are the reconstruction errors of videos from different views using D only or using both D and Dv. [sent-101, score-0.534]
42 Therefore the testing videos taken from different views of the same action will be encouraged to have similar sparse representations when using the learned D and Dv. [sent-107, score-1.069]
43 The last term regularizes the common dictionary to be incoherent to the view-specific dictionaries. [sent-109, score-0.521]
44 Supervised Learning Given the action categories of correspondence videos, we can learn a discriminative common dictionary and discriminative views-specific dictionaries by leveraging the category information. [sent-113, score-1.265]
45 We partition the dictionary items in each dictionary into disjoint subsets and associate each subset with one specific class label. [sent-114, score-0.982]
46 For videos from action class k, we aim to represent them using the same subset of dictionary items associated with class k. [sent-115, score-1.23]
47 For videos from different classes, we represent them using disjoint subsets of dictionary items. [sent-116, score-0.644]
48 This is supported by the intuition that action videos from the same class tend to have the similar features and each action video can be well represented by other videos from the same class [30]. [sent-117, score-1.286]
49 Assume there are K shared action classes, and D = [D1, . [sent-119, score-0.452]
50 When a video yiv is from class k at the v-th view, then qik and qivk are ones and other entries in qi and qiv are zeros. [sent-145, score-0.514]
51 The discriminative sparse-code error terms | |qi Axiv | |22 and | |qvi − Bzvi | |22 encourage the dictionary items with class k to be selected to reconstruct those videos from class k. [sent-147, score-0.955]
52 Note that the L2,1-norm regularization only regularize the relationship between the sparse codes of correspondence videos, but can not regularize the relationship between the sparse codes of videos from the same action class in each view. [sent-148, score-1.178]
53 In other words, our approach not only encourages the videos taken from different views of the same action to have similar sparse representations, but also encourages videos from − the same class in each view to have similar sparse representations. [sent-150, score-1.484]
54 This optimization problem is divided into three subproblems: (1) computing sparse codes with fixed Dv , D and A, B; (2) updating Dv , D with fixed sparse codes and A, B; (3) updating A, B with fixed Dv , D and sparse codes. [sent-154, score-0.562]
55 Computing Sparse Codes Given fixed Dv , D and A, B, we solve the sparse coding problem of the correspondence videos set by set and (2) is reduced to: V ? [sent-157, score-0.498]
56 Experiments We evaluated our approach for both cross-view and multi-view action recognition on the IXMAS multi-view dataset [28]. [sent-246, score-0.394]
57 This dataset contains 11 actions performed three times by ten actors taken from four side views and one top view. [sent-247, score-0.494]
58 We first detect up to 200 interest points from each action video and then extract a 100-dimensional gradient-based descriptors around these interest points via PCA. [sent-250, score-0.363]
59 Similarly, this codebook is used to encode shape-flow scriptors and each action video is represented dimensional histogram. [sent-267, score-0.436]
60 For fair comparison to [6, 20, 15], we use three evaluation modes: (1) unsupervised correspondence mode; (2) supervised correspondence mode ; (3)partially labeled mode. [sent-269, score-0.461]
61 For the first two correspondence mode, we use the leaveone-action-class-out strategy for choosing the orphan action which means that each time we only consider one action class for testing in the target view. [sent-270, score-1.067]
62 And all videos of the orphan action are excluded when learning the quantized visual words and constructing dictionaries. [sent-271, score-0.689]
63 The only difference between the first and the second mode is whether the category labels of the correspondence videos are available or not. [sent-272, score-0.463]
64 For the third mode, we follow [15] to consider a semi-supervised setting where a small portion of videos from the target view is labeled and no matched correspondence videos exist. [sent-273, score-0.813]
65 The second one is MIXSVM which trains two SVM’s on the source and target views and learns an optimal linear combination of them. [sent-277, score-0.421]
66 Note that the test actions from the source and target views are not seen during dictionary learning whereas the test action can be seen in the source view for classifier training in the first two evaluation modes. [sent-278, score-1.506]
67 On the contrary, the test action from different views can be seen during both dictionary learning and classifier training in the third mode. [sent-279, score-1.002]
68 For visualization purpose, two action classes ”check-watch” and ”waving” taken by Camera0 and Camera2 from the IXMAS dataset was selected to construct a simple cross-view dataset. [sent-287, score-0.417]
69 We extract the shape descriptor [16] for each video frame and learn a common dictionary and two-view specific dictionaries using our approach. [sent-288, score-0.689]
70 We then reconstruct a pair of frames taken from Camera0 and Camer2 views of the action ”waving” using two methods. [sent-289, score-0.718]
71 The first one is to use the common dictionary only to reconstruct the frame pair. [sent-290, score-0.494]
72 The other one is use both the common dictionary and the viewspecific dictionary for reconstruction. [sent-291, score-0.987]
73 Figure 2(b) shows the original shape feature and the reconstructed shape features of two frames of action ”waving” from two seen views and one unseen view using the mentioned two methods. [sent-292, score-0.76]
74 It demonstrates that the common dictionary has the ability to exploit view-shared features from different views. [sent-294, score-0.467]
75 Second, it can be observed that better reconstruction is achieved by using both the common dictionary D and view-specific dictionaries. [sent-295, score-0.508]
76 This is because the common dictionary may not reconstruct the more detailed view-specific features well such as arm poses. [sent-296, score-0.494]
77 The separation of the common dictionary enables the view-specific dictionaries to focus on exploiting and aligning view-specific features from different views. [sent-297, score-0.729]
78 Third, from the last row in Figure 2(b), we find that a good reconstruction of an action frame taken from the unseen view can be achieved by using the common dictionary only. [sent-298, score-1.048]
79 It demonstrates that the common dictionary learned from two seen views has the capability to represent videos of the same action from an unseen view. [sent-299, score-1.364]
80 (a) Visualization of all dictionary atoms in D (green color), Ds (red color) and Dt (purple color). [sent-350, score-0.395]
81 methods have nearly the same reconstruction performance for frames of the same action from the unseen view. [sent-411, score-0.475]
82 In addition, the separation of the common dictionary and view-specific dictionaries can enable us to learn more compact view-specific dictionaries. [sent-413, score-0.729]
83 We first learn a common dictionary D and two view-specific dictionaries {Ds , Dt} corresponding to the source and target views respectively. [sent-417, score-1.11]
84 The higher recognition accuracy obtained by our supervised setting over our unsupervised setting demonstrates that the dictionaries learned using labeled information across views are more discriminative. [sent-428, score-0.572]
85 Multi-view Action Recognition We select one camera as a target view and use all other four cameras as source views to explore the benefits of combining multiple source views. [sent-445, score-0.59]
86 Both D and the set of correspondence dictionaries Dv are learned by aligning the sparse representations of shared action videos across all views. [sent-447, score-1.196]
87 Since videos from all views are aligned into a common view-invariant sparse feature space, we do not need to differentiate the training videos from each source view in this common view-invariant sparse feature space. [sent-448, score-1.267]
88 Furthermore, [20, 33] and our unsupervised approach only use training videos from four source views to train a classifier while other approaches used all the training videos from all five views to train the classifier. [sent-452, score-1.125]
89 Conclusion We presented a novel dictionary learning framework to learn view-invariant sparse representations for cross-view action recognition. [sent-457, score-0.946]
90 We propose to simultaneously learn a common dictionary to model view-shared features and a set of view-specific dictionaries to align view-specific features from different views. [sent-458, score-0.722]
91 Both the common dictionary and the corresponding view-specific dictionary are used to represent videos from each view. [sent-459, score-1.111]
92 We transfer the indices of nonzeros in the sparse codes of videos from the source view to the sparse codes of the corresponding videos from the target view. [sent-460, score-1.18]
93 In this way, the mapping between the source and target views is encoded in the common dictionary and viewspecific dictionaries. [sent-461, score-1.045]
94 Meanwhile, the associated sparse representations are view-invariant because non-zero positions in the sparse codes of correspondence videos share the same set of indices. [sent-462, score-0.747]
95 Our approach can be applied to cross-view and multi-view action recognition under unsupervised, supervised and domain adaptation settings. [sent-463, score-0.446]
96 Learning a discriminative dictionary for sparse coding via label consistent k-svd. [sent-531, score-0.536]
97 A dictionary learning approach for classification: Separating the particularity and the commonality. [sent-555, score-0.395]
98 Single view human action recognition using key pose matching and viterbi path searching. [sent-608, score-0.476]
99 Making action recognition robust to occlusions and viewpoint changes. [sent-653, score-0.394]
100 Learning 4d action feature models for arbitrary view action recognition. [sent-674, score-0.808]
wordName wordTfidf (topN-words)
[('dictionary', 0.395), ('action', 0.363), ('videos', 0.249), ('views', 0.244), ('yiv', 0.218), ('dv', 0.211), ('dictionaries', 0.193), ('dxiv', 0.174), ('items', 0.161), ('actions', 0.158), ('correspondence', 0.143), ('dkv', 0.131), ('viewspecific', 0.125), ('qiv', 0.109), ('sparse', 0.106), ('yv', 0.098), ('target', 0.09), ('codes', 0.09), ('dj', 0.089), ('shared', 0.089), ('source', 0.087), ('axiv', 0.087), ('dtdv', 0.087), ('dvziv', 0.087), ('view', 0.082), ('orphan', 0.077), ('common', 0.072), ('ds', 0.071), ('mode', 0.071), ('qi', 0.068), ('bziv', 0.065), ('djv', 0.065), ('dvzv', 0.065), ('dxv', 0.065), ('jv', 0.065), ('rjv', 0.065), ('ziv', 0.065), ('zv', 0.065), ('ixmas', 0.064), ('xiv', 0.058), ('slep', 0.058), ('dt', 0.056), ('taken', 0.054), ('bracket', 0.054), ('incoherent', 0.054), ('representations', 0.053), ('vv', 0.053), ('atom', 0.052), ('supervised', 0.052), ('unsupervised', 0.052), ('dk', 0.051), ('crossview', 0.048), ('augsvm', 0.044), ('djx', 0.044), ('mixsvm', 0.044), ('qik', 0.044), ('qivk', 0.044), ('qvi', 0.044), ('ynv', 0.044), ('waving', 0.043), ('reconstruction', 0.041), ('unseen', 0.041), ('separation', 0.04), ('incoherence', 0.039), ('znv', 0.039), ('vjt', 0.039), ('actors', 0.038), ('encode', 0.037), ('scriptors', 0.036), ('dexter', 0.036), ('junejo', 0.036), ('parameswaran', 0.036), ('discriminative', 0.035), ('combinations', 0.034), ('align', 0.033), ('viewpoints', 0.033), ('zi', 0.033), ('weinland', 0.032), ('encoded', 0.032), ('updating', 0.032), ('class', 0.031), ('recognition', 0.031), ('rj', 0.031), ('kv', 0.031), ('transfer', 0.031), ('jiang', 0.031), ('kk', 0.031), ('rd', 0.03), ('fruitful', 0.03), ('recognizing', 0.03), ('frames', 0.03), ('modes', 0.029), ('enables', 0.029), ('learn', 0.029), ('accuracies', 0.028), ('reconstruct', 0.027), ('laptev', 0.027), ('yilmaz', 0.027), ('xv', 0.026), ('encourage', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999911 244 iccv-2013-Learning View-Invariant Sparse Representations for Cross-View Action Recognition
Author: Jingjing Zheng, Zhuolin Jiang
Abstract: We present an approach to jointly learn a set of viewspecific dictionaries and a common dictionary for crossview action recognition. The set of view-specific dictionaries is learned for specific views while the common dictionary is shared across different views. Our approach represents videos in each view using both the corresponding view-specific dictionary and the common dictionary. More importantly, it encourages the set of videos taken from different views of the same action to have similar sparse representations. In this way, we can align view-specific features in the sparse feature spaces spanned by the viewspecific dictionary set and transfer the view-shared features in the sparse feature space spanned by the common dictionary. Meanwhile, the incoherence between the common dictionary and the view-specific dictionary set enables us to exploit the discrimination information encoded in viewspecific features and view-shared features separately. In addition, the learned common dictionary not only has the capability to represent actions from unseen views, but also , makes our approach effective in a semi-supervised setting where no correspondence videos exist and only a few labels exist in the target view. Extensive experiments using the multi-view IXMAS dataset demonstrate that our approach outperforms many recent approaches for cross-view action recognition.
2 0.35119414 384 iccv-2013-Semi-supervised Robust Dictionary Learning via Efficient l-Norms Minimization
Author: Hua Wang, Feiping Nie, Weidong Cai, Heng Huang
Abstract: Representing the raw input of a data set by a set of relevant codes is crucial to many computer vision applications. Due to the intrinsic sparse property of real-world data, dictionary learning, in which the linear decomposition of a data point uses a set of learned dictionary bases, i.e., codes, has demonstrated state-of-the-art performance. However, traditional dictionary learning methods suffer from three weaknesses: sensitivity to noisy and outlier samples, difficulty to determine the optimal dictionary size, and incapability to incorporate supervision information. In this paper, we address these weaknesses by learning a Semi-Supervised Robust Dictionary (SSR-D). Specifically, we use the ℓ2,0+ norm as the loss function to improve the robustness against outliers, and develop a new structured sparse regularization com, , tom. . cai@sydney . edu . au , heng@uta .edu make the learning tasks easier to deal with and reduce the computational cost. For example, in image tagging, instead of using the raw pixel-wise features, semi-local or patch- based features, such as SIFT and geometric blur, are usually more desirable to achieve better performance. In practice, finding a set of compact features bases, also referred to as dictionary, with enhanced representative and discriminative power, plays a significant role in building a successful computer vision system. In this paper, we explore this important problem by proposing a novel formulation and its solution for learning Semi-Supervised Robust Dictionary (SSRD), where we examine the challenges in dictionary learning, and seek opportunities to overcome them and improve the dictionary qualities. 1.1. Challenges in Dictionary Learning to incorporate the supervision information in dictionary learning, without incurring additional parameters. Moreover, the optimal dictionary size is automatically learned from the input data. Minimizing the derived objective function is challenging because it involves many non-smooth ℓ2,0+ -norm terms. We present an efficient algorithm to solve the problem with a rigorous proof of the convergence of the algorithm. Extensive experiments are presented to show the superior performance of the proposed method.
3 0.33797187 161 iccv-2013-Fast Sparsity-Based Orthogonal Dictionary Learning for Image Restoration
Author: Chenglong Bao, Jian-Feng Cai, Hui Ji
Abstract: In recent years, how to learn a dictionary from input images for sparse modelling has been one very active topic in image processing and recognition. Most existing dictionary learning methods consider an over-complete dictionary, e.g. the K-SVD method. Often they require solving some minimization problem that is very challenging in terms of computational feasibility and efficiency. However, if the correlations among dictionary atoms are not well constrained, the redundancy of the dictionary does not necessarily improve the performance of sparse coding. This paper proposed a fast orthogonal dictionary learning method for sparse image representation. With comparable performance on several image restoration tasks, the proposed method is much more computationally efficient than the over-complete dictionary based learning methods.
4 0.32038504 188 iccv-2013-Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps
Author: Jiajia Luo, Wei Wang, Hairong Qi
Abstract: Human action recognition based on the depth information provided by commodity depth sensors is an important yet challenging task. The noisy depth maps, different lengths of action sequences, and free styles in performing actions, may cause large intra-class variations. In this paper, a new framework based on sparse coding and temporal pyramid matching (TPM) is proposed for depthbased human action recognition. Especially, a discriminative class-specific dictionary learning algorithm isproposed for sparse coding. By adding the group sparsity and geometry constraints, features can be well reconstructed by the sub-dictionary belonging to the same class, and the geometry relationships among features are also kept in the calculated coefficients. The proposed approach is evaluated on two benchmark datasets captured by depth cameras. Experimental results show that the proposed algorithm repeatedly hqi } @ ut k . edu GB ImagesR epth ImagesD setkonlSy0 896.5170d4ept.3h021 .x02y 19.876504.dep3th02.1 x02. achieves superior performance to the state of the art algorithms. Moreover, the proposed dictionary learning method also outperforms classic dictionary learning approaches.
5 0.30820233 276 iccv-2013-Multi-attributed Dictionary Learning for Sparse Coding
Author: Chen-Kuo Chiang, Te-Feng Su, Chih Yen, Shang-Hong Lai
Abstract: We present a multi-attributed dictionary learning algorithm for sparse coding. Considering training samples with multiple attributes, a new distance matrix is proposed by jointly incorporating data and attribute similarities. Then, an objective function is presented to learn categorydependent dictionaries that are compact (closeness of dictionary atoms based on data distance and attribute similarity), reconstructive (low reconstruction error with correct dictionary) and label-consistent (encouraging the labels of dictionary atoms to be similar). We have demonstrated our algorithm on action classification and face recognition tasks on several publicly available datasets. Experimental results with improved performance over previous dictionary learning methods are shown to validate the effectiveness of the proposed algorithm.
6 0.27677932 86 iccv-2013-Concurrent Action Detection with Structural Prediction
7 0.26818511 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
8 0.25465208 197 iccv-2013-Hierarchical Joint Max-Margin Learning of Mid and Top Level Representations for Visual Recognition
9 0.24645105 99 iccv-2013-Cross-View Action Recognition over Heterogeneous Feature Spaces
10 0.24008156 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos
11 0.2351228 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition
12 0.23019235 354 iccv-2013-Robust Dictionary Learning by Error Source Decomposition
13 0.22764881 231 iccv-2013-Latent Multitask Learning for View-Invariant Action Recognition
14 0.21251556 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
15 0.21134743 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction
16 0.20865689 96 iccv-2013-Coupled Dictionary and Feature Space Learning with Applications to Cross-Domain Image Synthesis and Recognition
17 0.20159043 249 iccv-2013-Learning to Share Latent Tasks for Action Recognition
18 0.19683555 359 iccv-2013-Robust Object Tracking with Online Multi-lifespan Dictionary Learning
19 0.18599389 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments
20 0.18571398 116 iccv-2013-Directed Acyclic Graph Kernels for Action Recognition
topicId topicWeight
[(0, 0.256), (1, 0.313), (2, -0.001), (3, 0.256), (4, -0.321), (5, -0.149), (6, -0.074), (7, -0.164), (8, -0.073), (9, 0.051), (10, 0.046), (11, 0.115), (12, 0.013), (13, -0.025), (14, 0.117), (15, -0.019), (16, -0.012), (17, -0.017), (18, -0.012), (19, -0.034), (20, 0.036), (21, -0.078), (22, -0.0), (23, 0.004), (24, 0.035), (25, -0.038), (26, -0.093), (27, -0.043), (28, -0.024), (29, -0.032), (30, 0.029), (31, 0.046), (32, 0.04), (33, -0.05), (34, 0.001), (35, 0.003), (36, 0.015), (37, 0.066), (38, -0.027), (39, 0.046), (40, -0.037), (41, -0.094), (42, -0.036), (43, -0.016), (44, 0.049), (45, -0.018), (46, 0.004), (47, -0.032), (48, -0.038), (49, -0.002)]
simIndex simValue paperId paperTitle
same-paper 1 0.98558843 244 iccv-2013-Learning View-Invariant Sparse Representations for Cross-View Action Recognition
Author: Jingjing Zheng, Zhuolin Jiang
Abstract: We present an approach to jointly learn a set of viewspecific dictionaries and a common dictionary for crossview action recognition. The set of view-specific dictionaries is learned for specific views while the common dictionary is shared across different views. Our approach represents videos in each view using both the corresponding view-specific dictionary and the common dictionary. More importantly, it encourages the set of videos taken from different views of the same action to have similar sparse representations. In this way, we can align view-specific features in the sparse feature spaces spanned by the viewspecific dictionary set and transfer the view-shared features in the sparse feature space spanned by the common dictionary. Meanwhile, the incoherence between the common dictionary and the view-specific dictionary set enables us to exploit the discrimination information encoded in viewspecific features and view-shared features separately. In addition, the learned common dictionary not only has the capability to represent actions from unseen views, but also , makes our approach effective in a semi-supervised setting where no correspondence videos exist and only a few labels exist in the target view. Extensive experiments using the multi-view IXMAS dataset demonstrate that our approach outperforms many recent approaches for cross-view action recognition.
2 0.79878402 188 iccv-2013-Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps
Author: Jiajia Luo, Wei Wang, Hairong Qi
Abstract: Human action recognition based on the depth information provided by commodity depth sensors is an important yet challenging task. The noisy depth maps, different lengths of action sequences, and free styles in performing actions, may cause large intra-class variations. In this paper, a new framework based on sparse coding and temporal pyramid matching (TPM) is proposed for depthbased human action recognition. Especially, a discriminative class-specific dictionary learning algorithm isproposed for sparse coding. By adding the group sparsity and geometry constraints, features can be well reconstructed by the sub-dictionary belonging to the same class, and the geometry relationships among features are also kept in the calculated coefficients. The proposed approach is evaluated on two benchmark datasets captured by depth cameras. Experimental results show that the proposed algorithm repeatedly hqi } @ ut k . edu GB ImagesR epth ImagesD setkonlSy0 896.5170d4ept.3h021 .x02y 19.876504.dep3th02.1 x02. achieves superior performance to the state of the art algorithms. Moreover, the proposed dictionary learning method also outperforms classic dictionary learning approaches.
Author: De-An Huang, Yu-Chiang Frank Wang
Abstract: Cross-domain image synthesis and recognition are typically considered as two distinct tasks in the areas of computer vision and pattern recognition. Therefore, it is not clear whether approaches addressing one task can be easily generalized or extended for solving the other. In this paper, we propose a unified model for coupled dictionary and feature space learning. The proposed learning model not only observes a common feature space for associating cross-domain image data for recognition purposes, the derived feature space is able to jointly update the dictionaries in each image domain for improved representation. This is why our method can be applied to both cross-domain image synthesis and recognition problems. Experiments on a variety of synthesis and recognition tasks such as single image super-resolution, cross-view action recognition, and sketchto-photo face recognition would verify the effectiveness of our proposed learning model.
4 0.71740925 231 iccv-2013-Latent Multitask Learning for View-Invariant Action Recognition
Author: Behrooz Mahasseni, Sinisa Todorovic
Abstract: This paper presents an approach to view-invariant action recognition, where human poses and motions exhibit large variations across different camera viewpoints. When each viewpoint of a given set of action classes is specified as a learning task then multitask learning appears suitable for achieving view invariance in recognition. We extend the standard multitask learning to allow identifying: (1) latent groupings of action views (i.e., tasks), and (2) discriminative action parts, along with joint learning of all tasks. This is because it seems reasonable to expect that certain distinct views are more correlated than some others, and thus identifying correlated views could improve recognition. Also, part-based modeling is expected to improve robustness against self-occlusion when actors are imaged from different views. Results on the benchmark datasets show that we outperform standard multitask learning by 21.9%, and the state-of-the-art alternatives by 4.5–6%.
5 0.70953542 38 iccv-2013-Action Recognition with Actons
Author: Jun Zhu, Baoyuan Wang, Xiaokang Yang, Wenjun Zhang, Zhuowen Tu
Abstract: With the improved accessibility to an exploding amount of video data and growing demands in a wide range of video analysis applications, video-based action recognition/classification becomes an increasingly important task in computer vision. In this paper, we propose a two-layer structure for action recognition to automatically exploit a mid-level “acton ” representation. The weakly-supervised actons are learned via a new max-margin multi-channel multiple instance learning framework, which can capture multiple mid-level action concepts simultaneously. The learned actons (with no requirement for detailed manual annotations) observe theproperties ofbeing compact, informative, discriminative, and easy to scale. The experimental results demonstrate the effectiveness ofapplying the learned actons in our two-layer structure, and show the state-ofthe-art recognition performance on two challenging action datasets, i.e., Youtube and HMDB51.
6 0.69648266 384 iccv-2013-Semi-supervised Robust Dictionary Learning via Efficient l-Norms Minimization
7 0.69294566 276 iccv-2013-Multi-attributed Dictionary Learning for Sparse Coding
8 0.67658323 161 iccv-2013-Fast Sparsity-Based Orthogonal Dictionary Learning for Image Restoration
9 0.66828203 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition
10 0.65955561 86 iccv-2013-Concurrent Action Detection with Structural Prediction
11 0.65135783 99 iccv-2013-Cross-View Action Recognition over Heterogeneous Feature Spaces
12 0.63595229 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding
13 0.62530816 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
14 0.62527686 354 iccv-2013-Robust Dictionary Learning by Error Source Decomposition
15 0.62392527 260 iccv-2013-Manipulation Pattern Discovery: A Nonparametric Bayesian Approach
16 0.61240137 197 iccv-2013-Hierarchical Joint Max-Margin Learning of Mid and Top Level Representations for Visual Recognition
17 0.59719896 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos
18 0.59522557 20 iccv-2013-A Max-Margin Perspective on Sparse Representation-Based Classification
19 0.58769041 114 iccv-2013-Dictionary Learning and Sparse Coding on Grassmann Manifolds: An Extrinsic Solution
20 0.57965106 166 iccv-2013-Finding Actors and Actions in Movies
topicId topicWeight
[(2, 0.476), (7, 0.013), (26, 0.041), (31, 0.018), (42, 0.109), (64, 0.088), (89, 0.137), (98, 0.016)]
simIndex simValue paperId paperTitle
1 0.9360677 13 iccv-2013-A General Two-Step Approach to Learning-Based Hashing
Author: Guosheng Lin, Chunhua Shen, David Suter, Anton van_den_Hengel
Abstract: Most existing approaches to hashing apply a single form of hash function, and an optimization process which is typically deeply coupled to this specific form. This tight coupling restricts the flexibility of the method to respond to the data, and can result in complex optimization problems that are difficult to solve. Here we propose a flexible yet simple framework that is able to accommodate different types of loss functions and hash functions. This framework allows a number of existing approaches to hashing to be placed in context, and simplifies the development of new problemspecific hashing methods. Our framework decomposes the hashing learning problem into two steps: hash bit learning and hash function learning based on the learned bits. The first step can typically be formulated as binary quadratic problems, and the second step can be accomplished by training standard binary classifiers. Both problems have been extensively studied in the literature. Our extensive experiments demonstrate that the proposed framework is effective, flexible and outperforms the state-of-the-art.
2 0.92653537 294 iccv-2013-Offline Mobile Instance Retrieval with a Small Memory Footprint
Author: Jayaguru Panda, Michael S. Brown, C.V. Jawahar
Abstract: Existing mobile image instance retrieval applications assume a network-based usage where image features are sent to a server to query an online visual database. In this scenario, there are no restrictions on the size of the visual database. This paper, however, examines how to perform this same task offline, where the entire visual index must reside on the mobile device itself within a small memory footprint. Such solutions have applications on location recognition and product recognition. Mobile instance retrieval requires a significant reduction in the visual index size. To achieve this, we describe a set of strategies that can reduce the visual index up to 60-80 compared to a scatannd raerddu iens tthaen vceis rueatrli ienvdaelx xim upple tom 6en0t-8at0io ×n found on ddte osk atops or servers. While our proposed reduction steps affect the overall mean Average Precision (mAP), they are able to maintain a good Precision for the top K results (PK). We argue that for such offline application, maintaining a good PK is sufficient. The effectiveness of this approach is demonstrated on several standard databases. A working application designed for a remote historical site is also presented. This application is able to reduce an 50,000 image index structure to 25 MBs while providing a precision of 97% for P10 and 100% for P1.
3 0.90157986 446 iccv-2013-Visual Semantic Complex Network for Web Images
Author: Shi Qiu, Xiaogang Wang, Xiaoou Tang
Abstract: This paper proposes modeling the complex web image collections with an automatically generated graph structure called visual semantic complex network (VSCN). The nodes on this complex network are clusters of images with both visual and semantic consistency, called semantic concepts. These nodes are connected based on the visual and semantic correlations. Our VSCN with 33, 240 concepts is generated from a collection of 10 million web images. 1 A great deal of valuable information on the structures of the web image collections can be revealed by exploring the VSCN, such as the small-world behavior, concept community, indegree distribution, hubs, and isolated concepts. It not only helps us better understand the web image collections at a macroscopic level, but also has many important practical applications. This paper presents two application examples: content-based image retrieval and image browsing. Experimental results show that the VSCN leads to significant improvement on both the precision of image retrieval (over 200%) and user experience for image browsing.
same-paper 4 0.88190109 244 iccv-2013-Learning View-Invariant Sparse Representations for Cross-View Action Recognition
Author: Jingjing Zheng, Zhuolin Jiang
Abstract: We present an approach to jointly learn a set of viewspecific dictionaries and a common dictionary for crossview action recognition. The set of view-specific dictionaries is learned for specific views while the common dictionary is shared across different views. Our approach represents videos in each view using both the corresponding view-specific dictionary and the common dictionary. More importantly, it encourages the set of videos taken from different views of the same action to have similar sparse representations. In this way, we can align view-specific features in the sparse feature spaces spanned by the viewspecific dictionary set and transfer the view-shared features in the sparse feature space spanned by the common dictionary. Meanwhile, the incoherence between the common dictionary and the view-specific dictionary set enables us to exploit the discrimination information encoded in viewspecific features and view-shared features separately. In addition, the learned common dictionary not only has the capability to represent actions from unseen views, but also , makes our approach effective in a semi-supervised setting where no correspondence videos exist and only a few labels exist in the target view. Extensive experiments using the multi-view IXMAS dataset demonstrate that our approach outperforms many recent approaches for cross-view action recognition.
5 0.88168502 352 iccv-2013-Revisiting Example Dependent Cost-Sensitive Learning with Decision Trees
Author: Oisin Mac Aodha, Gabriel J. Brostow
Abstract: Typical approaches to classification treat class labels as disjoint. For each training example, it is assumed that there is only one class label that correctly describes it, and that all other labels are equally bad. We know however, that good and bad labels are too simplistic in many scenarios, hurting accuracy. In the realm of example dependent costsensitive learning, each label is instead a vector representing a data point’s affinity for each of the classes. At test time, our goal is not to minimize the misclassification rate, but to maximize that affinity. We propose a novel example dependent cost-sensitive impurity measure for decision trees. Our experiments show that this new impurity measure improves test performance while still retaining the fast test times of standard classification trees. We compare our approach to classification trees and other cost-sensitive methods on three computer vision problems, tracking, descriptor matching, and optical flow, and show improvements in all three domains.
6 0.87895393 191 iccv-2013-Handling Uncertain Tags in Visual Recognition
7 0.85694349 214 iccv-2013-Improving Graph Matching via Density Maximization
8 0.82498312 374 iccv-2013-Salient Region Detection by UFO: Uniqueness, Focusness and Objectness
9 0.78666568 239 iccv-2013-Learning Hash Codes with Listwise Supervision
10 0.77563024 153 iccv-2013-Face Recognition Using Face Patch Networks
11 0.7543608 322 iccv-2013-Pose Estimation and Segmentation of People in 3D Movies
12 0.75282526 409 iccv-2013-Supervised Binary Hash Code Learning with Jensen Shannon Divergence
13 0.75079381 83 iccv-2013-Complementary Projection Hashing
14 0.73843402 229 iccv-2013-Large-Scale Video Hashing via Structure Learning
15 0.70623505 248 iccv-2013-Learning to Rank Using Privileged Information
16 0.70114017 313 iccv-2013-Person Re-identification by Salience Matching
17 0.69328845 368 iccv-2013-SYM-FISH: A Symmetry-Aware Flip Invariant Sketch Histogram Shape Descriptor
18 0.6908083 378 iccv-2013-Semantic-Aware Co-indexing for Image Retrieval
19 0.68121678 443 iccv-2013-Video Synopsis by Heterogeneous Multi-source Correlation
20 0.68102705 179 iccv-2013-From Subcategories to Visual Composites: A Multi-level Framework for Object Detection