cvpr cvpr2013 cvpr2013-98 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Zhong Zhang, Chunheng Wang, Baihua Xiao, Wen Zhou, Shuang Liu, Cunzhao Shi
Abstract: In this paper, we propose a novel method for cross-view action recognition via a continuous virtual path which connects the source view and the target view. Each point on this virtual path is a virtual view which is obtained by a linear transformation of the action descriptor. All the virtual views are concatenated into an infinite-dimensional feature to characterize continuous changes from the source to the target view. However, these infinite-dimensional features cannot be used directly. Thus, we propose a virtual view kernel to compute the value of similarity between two infinite-dimensional features, which can be readily used to construct any kernelized classifiers. In addition, there are a lot of unlabeled samples from the target view, which can be utilized to improve the performance of classifiers. Thus, we present a constraint strategy to explore the information contained in the unlabeled samples. The rationality behind the constraint is that any action video belongs to only one class. Our method is verified on the IXMAS dataset, and the experimental results demonstrate that our method achieves better performance than the state-of-the-art methods.
Reference: text
sentIndex sentText sentNum sentScore
1 cn iu a Abstract In this paper, we propose a novel method for cross-view action recognition via a continuous virtual path which connects the source view and the target view. [sent-9, score-1.673]
2 Each point on this virtual path is a virtual view which is obtained by a linear transformation of the action descriptor. [sent-10, score-2.055]
3 All the virtual views are concatenated into an infinite-dimensional feature to characterize continuous changes from the source to the target view. [sent-11, score-1.152]
4 Thus, we propose a virtual view kernel to compute the value of similarity between two infinite-dimensional features, which can be readily used to construct any kernelized classifiers. [sent-13, score-1.052]
5 In addition, there are a lot of unlabeled samples from the target view, which can be utilized to improve the performance of classifiers. [sent-14, score-0.379]
6 The rationality behind the constraint is that any action video belongs to only one class. [sent-16, score-0.411]
7 Action feature a is projected to form the virtual view fρ (0 ≤ ρ ≤ 1) on the continuous virtual path by transformation m(a0tri x≤ Mρ, and then all the virtual views are concatenated to form an infinitedimensional feature b∞ . [sent-28, score-2.691]
8 Inner product between them defines our virtual view kernel which can be computed in a close-form. [sent-29, score-0.952]
9 The virtual view kernel can be readily used to construct any kernelized classifiers. [sent-30, score-1.025]
10 This is because the same action looks very different when observed from different views. [sent-33, score-0.34]
11 Hence, action models learned using labeled samples in one view are less discriminative for recognizing actions in a different view. [sent-34, score-0.773]
12 In this paper, we present a novel kernel-based approach for cross-view action recognition via a continuous virtual path. [sent-36, score-1.072]
13 Imagine there is a virtual path connecting the source view and the target view, and each point on the virtual path refers to a virtual view. [sent-37, score-2.702]
14 An action feature is transformed to a virtual view on the virtual path by a particular class of linear projections. [sent-38, score-2.062]
15 Then, all the virtual views are integrated into an infinite-dimensional feature. [sent-39, score-0.838]
16 Since the infinitedimensional feature contains all the virtual views from the source to the target view, it is robust to view changes. [sent-40, score-1.358]
17 As 222666889088 these infinite-dimensional features cannot be used directly, we propose a virtual view kernel to measure the similarity between two infinite-dimensional features. [sent-41, score-0.979]
18 In addition, we solve the virtual view kernel under an information theoretic framework that allows maximizing discrimination among action classes. [sent-45, score-1.426]
19 Like the approaches in [7, 15, 17], an unlabeled action sample observed simultaneously in both views yields a corresponding pair, so that these pairs can be used in the training stage. [sent-48, score-0.709]
20 One is semi-supervised domain adaptation, where the target view contains a smallamount of labeled samples without corresponding pairs. [sent-51, score-0.531]
21 The other is unsupervised domain adaptation where the target view is completely unlabeled, which is referred as unla- beled mode. [sent-53, score-0.421]
22 It can be seen that there are several unlabeled samples from the target view in the above three modes, and it is usually insufficient to construct a good classifier only using labeled samples or corresponding pairs. [sent-54, score-0.767]
23 Hence, how to effectively utilize unlabeled samples from the target view is key to cross-view action recognition. [sent-55, score-0.943]
24 Related Work A lot of approaches have been proposed to address the problem of cross-view action recognition. [sent-60, score-0.34]
25 [21] presented a view-invariant representation of human action to capture the dramatic changes in the speed and direction of the trajectory using spatiotemporal curvature of 2D trajectory. [sent-63, score-0.34]
26 Recently, transfer learning approaches are employed to address cross-view action recognition. [sent-67, score-0.357]
27 [6] generated split-based features in the source view using Maximum Margin Clustering and then transferred the split values to the corresponding frames in the target view. [sent-69, score-0.46]
28 Then, the action videos are represented by “bilingual words” in both views. [sent-72, score-0.34]
29 Li et al ’s work [15], which also explored the idea of using virtual view to overcome the problem of view changes, is close to ours. [sent-73, score-1.114]
30 One difference is that their work only samples several virtual views, while our kernel-based method utilizes all the virtual views on the virtual path. [sent-75, score-2.264]
31 This can keep all the visual information on the virtual path and eliminate the requirement to tune the parameter needed in [15]. [sent-76, score-0.788]
32 The other is that their work only uses labeled samples or corresponding pairs to train the model, while our method makes full use of unlabeled samples from target view as well. [sent-77, score-0.773]
33 Approach In this section, we start by reviewing the method which obtains multiple virtual views by sampling virtual path. [sent-79, score-1.521]
34 Third, we formulate the problem under an information theoretic framework so as to maximize discrimination among action classes. [sent-81, score-0.473]
35 Multiple Virtual Views by Sampling Virtual Path Imagine that there is a virtual path V (ρ) , ρ ∈ [0, 1] connecting tinhee source rveie iws aV vSi ratundal t phaet target )v,ρiew ∈ V [0T,,1 w] choerneV (0) = VS and V (1) = VT [15]. [sent-86, score-1.024]
36 A particular class of linear projections is adopted to transform the action features on the virtual path V (ρ). [sent-87, score-1.15]
37 Let f(ρ) = Mρa be a virtual view on the virtual path V (ρ), where Mρ is a transformation matrix and a ∈ RD× 1is an action feature vector. [sent-88, score-2.077]
38 In the special case, nfdS = ∈ f R(0) = MSTa and fT = f(1) = MTTa are the source virtual view and target virtual view respectively, corresponding to the two endpoints of the virtual path. [sent-89, score-2.682]
39 The task is to compute the virtual view f(ρ) on the virtual path, i. [sent-93, score-1.556]
40 Multiple virtual views are sampled on the virtual path V (ρ) at L intervals ρ1 , ρ2 , . [sent-109, score-1.626]
41 After that the transformation matrix Mρi of the corresponding virtual view fρi can be obtained from Eq. [sent-116, score-0.927]
42 The final representation of an action video is simply obtainedbyconcatenatingthetransformationfeaturesMρTia into a single long feature vector: a = [(MSTa)T, (MρT1a)T, . [sent-118, score-0.362]
43 (2) The final feature vector implicitly incorporates multiple virtual view transformations, which change from the source view to the target view. [sent-122, score-1.372]
44 Virtual View Kernel The above method, however, only samples several virtual views on the virtual path, it neglects the information provided by the other virtual views. [sent-126, score-2.242]
45 Furthermore, this method has to adopt cross-validation to determine the parameter L (the number of virtual views sampling on the virtual path). [sent-127, score-1.521]
46 Intuitively, if we utilize all the virtual views on the virtual path, the above drawbacks can be naturally overcome. [sent-128, score-1.504]
47 Nevertheless, when the original feature projects to all the virtual views, it changes to an infinite-dimensional feature. [sent-129, score-0.688]
48 The proposed virtual view kernel is expected to be the measurement of similarity that is robust to the viewpoint changes. [sent-132, score-0.979]
49 In other word, although actions belong to the same class are observed from different views, the values of similarity computed by virtual view kernel are high enough to classify. [sent-133, score-1.043]
50 We next show that our kernel-based method don’t need to compute and store all the virtual views. [sent-135, score-0.666]
51 Given two original D-dimensional feature vectors ai and aj, we compute their virtual views f(ρ) for a continuous ρ from 0 to 1, and then concatenate all the virtual views into infinite-dimensional feature vectors bi∞ and bj∞ . [sent-136, score-1.812]
52 The proposed virtual view kernel is defined as the inner product between them: < bi∞,bj∞>=? [sent-137, score-0.992]
53 K ∈ RD×D is defined as the virtual view kerfneeal,t uwrehsi. [sent-144, score-0.89]
54 (5)∈, we can see that our virtual view kernel has a closed-form solution. [sent-154, score-0.952]
55 Since our method utilizes all the virtual views on the vir- tual path, it not only takes full advantage of the visual information provided by the continuous virtual path, but also saves the cost of tuning the parameter L (the number of virtual views sampling on the virtual path). [sent-155, score-3.085]
56 Maximizing Discrimination In this subsection, we discuss the problem of choosing discriminative values for MS and MT, because our virtual view kernel K are totally confirmed by transformation matrices MS and MT. [sent-158, score-0.989]
57 In the unlabeled mode, all the labeled training samples are from the source view. [sent-160, score-0.418]
58 In the partially labeled mode, only a part of samples from the target view are labeled as training data. [sent-161, score-0.621]
59 TLhetes {ea pairs are unlabeled, but belong to the same class observed simultaneously in the source view and target view. [sent-179, score-0.507]
60 Since aS and aT describe the same action class, we expect that they have high value of similarity, i. [sent-180, score-0.34]
61 Thus, how to effectively leverage unlabeled samples from the target view is crucial to cross-view action recognition. [sent-200, score-0.943]
62 In this work, we impose a constraint on the unlabeled samples from the target view. [sent-201, score-0.416]
63 For the two-class problem, the constraint is equivalent to maximize the absolute value of following formula: γ = aPTKau − aTNKau, (13) where au, aP and aN are the unlabeled feature vectors from the target view, positive feature vectors and negative feature vectors respectively. [sent-203, score-0.492]
64 Note that aP and aN could be either from the source view or target view because virtual view kernel K is robust to view changes. [sent-204, score-1.86]
65 When α = 0, it is the unlabeled mode or partially labeled mode with the constraint: MmS,aMxTI(V ;c) − βH({g(γ),−g(γ)}). [sent-214, score-0.483]
66 Implementation Details and Extensions Before training our model, we should determine the working mode and extract the corresponding single-view action feature vector from each training action video. [sent-252, score-0.826]
67 Once the virtual view kernel K is trained, we compute the values of similarity between any two training samples. [sent-253, score-0.979]
68 For a Q-class action recognition problem, we learn Q binary one-against-all models as described above. [sent-259, score-0.368]
69 Dataset and Low-level Feature Extraction We test our approach on the IXMAS multi-view action dataset [26], which contains eleven daily-life actions, such as check watch, punch, and turn around. [sent-266, score-0.34]
70 Each action is performed three times by twelve actors and observed from five different views including four side views and one top view. [sent-267, score-0.684]
71 Each row is a source view and each column a target view. [sent-277, score-0.46]
72 Finally, for each action video, we concatenate the local and global features to form a 1500-dimensional feature vector. [sent-326, score-0.38]
73 Correspondence mode: For an accurate comparison to [17] and [15], we follow the same leave-one-action-classout strategy for choosing the orphan action which means that each time we only consider one action class for testing in the target view. [sent-330, score-0.871]
74 The final results are reported according to average accuracy for all action classes in each view. [sent-331, score-0.358]
75 Note that the orphan action class is not used to train the virtual view kernel and establishes corresponding pairs. [sent-332, score-1.348]
76 We set the transformed virtual view dimension d to 20. [sent-334, score-0.89]
77 virtual view kernel (VVK) and virtual view kernel with constraint on unlabeled samples (VVKC), with [17] and [15]. [sent-341, score-2.185]
78 First, our algorithms (VVK and VVKC) outperform all five possible target views with varying source views on average recognition accuracies, which can be seen in the last row of Table 1. [sent-344, score-0.608]
79 Second, our VVKC achieves better results than [17] in all the view combinations and obtains better results than [15] except only one view combination. [sent-345, score-0.448]
80 Third, our VVK is superior to [15] with all view combinations but the combination of source view C1 and target view C3, due to sampling all the virtual views on the virtual path. [sent-346, score-2.429]
81 Finally, since our algorithm takes full advantage of the unlabeled samples from the target view, the average accuracy of VVKC is about 2% better than VVK. [sent-347, score-0.397]
82 Partially labeled and unlabeled modes: For partially labeled mode, we set β to 3, and set d to 20. [sent-348, score-0.362]
83 We compare our approach with multiple virtual views (MVV) proposed in [15], and the results are shown in Table 2. [sent-349, score-0.838]
84 The labeled samples from the target view take up 30% of all the target view samples as [15]. [sent-350, score-0.935]
85 We then study the recognition accuracy as the proportion oflabeled samples from the target view which increases from 0% to 30% in steps of 10%. [sent-353, score-0.5]
86 Cross-view action recognition accuracies on the IXMAS dataset compared with baseline from [15] when a varying proportion of samples are labeled in the target view. [sent-357, score-0.696]
87 Recognition accuracy of non-discriminative virtual views (NDVV), VVK, and VVKC on the IXMAS dataset. [sent-359, score-0.856]
88 53e4led It is worth noting that when proportion of labeled samples from the target view is 0%, it degenerates into unlabeled mode. [sent-363, score-0.699]
89 Effect of Maximizing Discrimination The virtual view kernel K (MS and MT) is learned under an information theoretic framework so as to maximize discrimination. [sent-368, score-1.031]
90 This indicates that maximizing discrimination plays an important role on cross-view action recognition. [sent-374, score-0.423]
91 We choose a target view and use all other four views as sources. [sent-378, score-0.531]
92 Cross-view action recognition accuracy (%) with multiple source views in correspondence mode. [sent-382, score-0.699]
93 Cross-view action recognition accuracy (%) with multiple source views in partially labeled mode. [sent-390, score-0.776]
94 Cross-view action recognition accuracy (%) under different α and β in correspondence mode. [sent-398, score-0.426]
95 The two accuracy numbers in a tuple are the average recognition accuracy of partially labeled and unlabeled modes. [sent-405, score-0.376]
96 In addition, comparing Table 4 with Table 1 and Table 5 with Table 2, we can see that the fusion strategy of multiple source views performs better than single source view. [sent-416, score-0.374]
97 The method constructs a continuous virtual path between the source view and the target view. [sent-427, score-1.286]
98 The proposed virtual view kernel utilizes all the virtual views on the virtual path to learn new feature representations that are robust to change in views. [sent-428, score-2.622]
99 Furthermore, we impose a constraint on unlabeled samples from the target view for further performance improvement. [sent-429, score-0.64]
100 Single view human action recognition using key pose matching and viterbi path searching. [sent-564, score-0.714]
wordName wordTfidf (topN-words)
[('virtual', 0.666), ('action', 0.34), ('view', 0.224), ('vvkc', 0.209), ('vvk', 0.19), ('views', 0.172), ('unlabeled', 0.172), ('ms', 0.151), ('mt', 0.136), ('target', 0.135), ('path', 0.122), ('ixmas', 0.118), ('source', 0.101), ('mode', 0.097), ('labeled', 0.073), ('samples', 0.072), ('modes', 0.067), ('kernel', 0.062), ('concretely', 0.061), ('ln', 0.057), ('discrimination', 0.054), ('theoretic', 0.051), ('junejo', 0.047), ('partially', 0.044), ('actions', 0.042), ('correspondence', 0.04), ('inner', 0.04), ('pages', 0.039), ('tn', 0.038), ('continuous', 0.038), ('aitkaj', 0.038), ('aptkau', 0.038), ('atnkau', 0.038), ('bilingual', 0.038), ('infinitedimensional', 0.038), ('matrixes', 0.038), ('msta', 0.038), ('mtta', 0.038), ('mvv', 0.038), ('constraint', 0.037), ('transformation', 0.037), ('adaptation', 0.035), ('orphan', 0.034), ('chunheng', 0.034), ('dexter', 0.034), ('rationality', 0.034), ('shuang', 0.034), ('kernelized', 0.031), ('vn', 0.029), ('maximizing', 0.029), ('recognition', 0.028), ('yilmaz', 0.028), ('aj', 0.028), ('maximize', 0.028), ('working', 0.027), ('domain', 0.027), ('farhadi', 0.027), ('similarity', 0.027), ('convenience', 0.026), ('pairs', 0.025), ('accuracies', 0.025), ('zhong', 0.024), ('differential', 0.023), ('vp', 0.023), ('tuple', 0.023), ('rao', 0.023), ('laptev', 0.023), ('proportion', 0.023), ('xiao', 0.023), ('geodesic', 0.023), ('readily', 0.023), ('zhou', 0.022), ('utilizes', 0.022), ('feature', 0.022), ('class', 0.022), ('entropy', 0.022), ('initializations', 0.022), ('recognizing', 0.022), ('rd', 0.021), ('temporal', 0.02), ('table', 0.02), ('rotation', 0.02), ('imagine', 0.02), ('joints', 0.02), ('approximated', 0.019), ('construct', 0.019), ('cn', 0.019), ('bn', 0.019), ('shi', 0.019), ('vectors', 0.018), ('concatenated', 0.018), ('accuracy', 0.018), ('variances', 0.018), ('products', 0.018), ('concatenate', 0.018), ('tt', 0.017), ('sampling', 0.017), ('greedy', 0.017), ('flow', 0.017), ('transfer', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path
Author: Zhong Zhang, Chunheng Wang, Baihua Xiao, Wen Zhou, Shuang Liu, Cunzhao Shi
Abstract: In this paper, we propose a novel method for cross-view action recognition via a continuous virtual path which connects the source view and the target view. Each point on this virtual path is a virtual view which is obtained by a linear transformation of the action descriptor. All the virtual views are concatenated into an infinite-dimensional feature to characterize continuous changes from the source to the target view. However, these infinite-dimensional features cannot be used directly. Thus, we propose a virtual view kernel to compute the value of similarity between two infinite-dimensional features, which can be readily used to construct any kernelized classifiers. In addition, there are a lot of unlabeled samples from the target view, which can be utilized to improve the performance of classifiers. Thus, we present a constraint strategy to explore the information contained in the unlabeled samples. The rationality behind the constraint is that any action video belongs to only one class. Our method is verified on the IXMAS dataset, and the experimental results demonstrate that our method achieves better performance than the state-of-the-art methods.
2 0.23087579 287 cvpr-2013-Modeling Actions through State Changes
Author: Alireza Fathi, James M. Rehg
Abstract: In this paper we present a model of action based on the change in the state of the environment. Many actions involve similar dynamics and hand-object relationships, but differ in their purpose and meaning. The key to differentiating these actions is the ability to identify how they change the state of objects and materials in the environment. We propose a weakly supervised method for learning the object and material states that are necessary for recognizing daily actions. Once these state detectors are learned, we can apply them to input videos and pool their outputs to detect actions. We further demonstrate that our method can be used to segment discrete actions from a continuous video of an activity. Our results outperform state-of-the-art action recognition and activity segmentation results.
Author: Tsz-Ho Yu, Tae-Kyun Kim, Roberto Cipolla
Abstract: This work addresses the challenging problem of unconstrained 3D human pose estimation (HPE)from a novelperspective. Existing approaches struggle to operate in realistic applications, mainly due to their scene-dependent priors, such as background segmentation and multi-camera network, which restrict their use in unconstrained environments. We therfore present a framework which applies action detection and 2D pose estimation techniques to infer 3D poses in an unconstrained video. Action detection offers spatiotemporal priors to 3D human pose estimation by both recognising and localising actions in space-time. Instead of holistic features, e.g. silhouettes, we leverage the flexibility of deformable part model to detect 2D body parts as a feature to estimate 3D poses. A new unconstrained pose dataset has been collected to justify the feasibility of our method, which demonstrated promising results, significantly outperforming the relevant state-of-the-arts.
4 0.20041288 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
Author: Arpit Jain, Abhinav Gupta, Mikel Rodriguez, Larry S. Davis
Abstract: How should a video be represented? We propose a new representation for videos based on mid-level discriminative spatio-temporal patches. These spatio-temporal patches might correspond to a primitive human action, a semantic object, or perhaps a random but informative spatiotemporal patch in the video. What defines these spatiotemporal patches is their discriminative and representative properties. We automatically mine these patches from hundreds of training videos and experimentally demonstrate that these patches establish correspondence across videos and align the videos for label transfer techniques. Furthermore, these patches can be used as a discriminative vocabulary for action classification where they demonstrate stateof-the-art performance on UCF50 and Olympics datasets.
5 0.19492823 40 cvpr-2013-An Approach to Pose-Based Action Recognition
Author: Chunyu Wang, Yizhou Wang, Alan L. Yuille
Abstract: We address action recognition in videos by modeling the spatial-temporal structures of human poses. We start by improving a state of the art method for estimating human joint locations from videos. More precisely, we obtain the K-best estimations output by the existing method and incorporate additional segmentation cues and temporal constraints to select the “best” one. Then we group the estimated joints into five body parts (e.g. the left arm) and apply data mining techniques to obtain a representation for the spatial-temporal structures of human actions. This representation captures the spatial configurations ofbodyparts in one frame (by spatial-part-sets) as well as the body part movements(by temporal-part-sets) which are characteristic of human actions. It is interpretable, compact, and also robust to errors on joint estimations. Experimental results first show that our approach is able to localize body joints more accurately than existing methods. Next we show that it outperforms state of the art action recognizers on the UCF sport, the Keck Gesture and the MSR-Action3D datasets.
6 0.18252593 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
7 0.18038639 387 cvpr-2013-Semi-supervised Domain Adaptation with Instance Constraints
8 0.17761707 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
9 0.16877364 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
10 0.16244099 123 cvpr-2013-Detection of Manipulation Action Consequences (MAC)
11 0.15544322 205 cvpr-2013-Hollywood 3D: Recognizing Actions in 3D Natural Scenes
12 0.15524566 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition
13 0.15076149 150 cvpr-2013-Event Recognition in Videos by Learning from Heterogeneous Web Sources
14 0.15072098 302 cvpr-2013-Multi-task Sparse Learning with Beta Process Prior for Action Recognition
15 0.13612586 3 cvpr-2013-3D R Transform on Spatio-temporal Interest Points for Action Recognition
16 0.13394845 419 cvpr-2013-Subspace Interpolation via Dictionary Learning for Unsupervised Domain Adaptation
17 0.11841783 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition
18 0.11701219 36 cvpr-2013-Adding Unlabeled Samples to Categories by Learned Attributes
19 0.1169042 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
20 0.10612256 337 cvpr-2013-Principal Observation Ray Calibration for Tiled-Lens-Array Integral Imaging Display
topicId topicWeight
[(0, 0.19), (1, -0.058), (2, -0.054), (3, -0.116), (4, -0.224), (5, -0.027), (6, -0.091), (7, -0.004), (8, -0.038), (9, -0.034), (10, -0.007), (11, -0.03), (12, -0.05), (13, -0.02), (14, -0.199), (15, -0.077), (16, -0.026), (17, -0.131), (18, 0.033), (19, 0.127), (20, -0.031), (21, -0.134), (22, -0.098), (23, 0.013), (24, 0.05), (25, -0.024), (26, 0.09), (27, -0.032), (28, -0.015), (29, -0.008), (30, -0.066), (31, -0.008), (32, -0.016), (33, 0.082), (34, -0.043), (35, -0.072), (36, -0.034), (37, 0.02), (38, 0.019), (39, -0.035), (40, 0.001), (41, 0.032), (42, 0.007), (43, -0.014), (44, -0.003), (45, -0.067), (46, 0.018), (47, -0.045), (48, -0.03), (49, 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.97280878 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path
Author: Zhong Zhang, Chunheng Wang, Baihua Xiao, Wen Zhou, Shuang Liu, Cunzhao Shi
Abstract: In this paper, we propose a novel method for cross-view action recognition via a continuous virtual path which connects the source view and the target view. Each point on this virtual path is a virtual view which is obtained by a linear transformation of the action descriptor. All the virtual views are concatenated into an infinite-dimensional feature to characterize continuous changes from the source to the target view. However, these infinite-dimensional features cannot be used directly. Thus, we propose a virtual view kernel to compute the value of similarity between two infinite-dimensional features, which can be readily used to construct any kernelized classifiers. In addition, there are a lot of unlabeled samples from the target view, which can be utilized to improve the performance of classifiers. Thus, we present a constraint strategy to explore the information contained in the unlabeled samples. The rationality behind the constraint is that any action video belongs to only one class. Our method is verified on the IXMAS dataset, and the experimental results demonstrate that our method achieves better performance than the state-of-the-art methods.
2 0.73246849 287 cvpr-2013-Modeling Actions through State Changes
Author: Alireza Fathi, James M. Rehg
Abstract: In this paper we present a model of action based on the change in the state of the environment. Many actions involve similar dynamics and hand-object relationships, but differ in their purpose and meaning. The key to differentiating these actions is the ability to identify how they change the state of objects and materials in the environment. We propose a weakly supervised method for learning the object and material states that are necessary for recognizing daily actions. Once these state detectors are learned, we can apply them to input videos and pool their outputs to detect actions. We further demonstrate that our method can be used to segment discrete actions from a continuous video of an activity. Our results outperform state-of-the-art action recognition and activity segmentation results.
3 0.6828813 123 cvpr-2013-Detection of Manipulation Action Consequences (MAC)
Author: Yezhou Yang, Cornelia Fermüller, Yiannis Aloimonos
Abstract: The problem of action recognition and human activity has been an active research area in Computer Vision and Robotics. While full-body motions can be characterized by movement and change of posture, no characterization, that holds invariance, has yet been proposed for the description of manipulation actions. We propose that a fundamental concept in understanding such actions, are the consequences of actions. There is a small set of fundamental primitive action consequences that provides a systematic high-level classification of manipulation actions. In this paper a technique is developed to recognize these action consequences. At the heart of the technique lies a novel active tracking and segmentation method that monitors the changes in appearance and topological structure of the manipulated object. These are then used in a visual semantic graph (VSG) based procedure applied to the time sequence of the monitored object to recognize the action consequence. We provide a new dataset, called Manipulation Action Consequences (MAC 1.0), which can serve as testbed for other studies on this topic. Several ex- periments on this dataset demonstrates that our method can robustly track objects and detect their deformations and division during the manipulation. Quantitative tests prove the effectiveness and efficiency of the method.
4 0.66458052 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
Author: Chao-Yeh Chen, Kristen Grauman
Abstract: We propose an approach to learn action categories from static images that leverages prior observations of generic human motion to augment its training process. Using unlabeled video containing various human activities, the system first learns how body pose tends to change locally in time. Then, given a small number of labeled static images, it uses that model to extrapolate beyond the given exemplars and generate “synthetic ” training examples—poses that could link the observed images and/or immediately precede or follow them in time. In this way, we expand the training set without requiring additional manually labeled examples. We explore both example-based and manifold-based methods to implement our idea. Applying our approach to recognize actions in both images and video, we show it enhances a state-of-the-art technique when very few labeled training examples are available.
5 0.65358013 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition
Author: Michalis Raptis, Leonid Sigal
Abstract: In this paper, we develop a new model for recognizing human actions. An action is modeled as a very sparse sequence of temporally local discriminative keyframes collections of partial key-poses of the actor(s), depicting key states in the action sequence. We cast the learning of keyframes in a max-margin discriminative framework, where we treat keyframes as latent variables. This allows us to (jointly) learn a set of most discriminative keyframes while also learning the local temporal context between them. Keyframes are encoded using a spatially-localizable poselet-like representation with HoG and BoW components learned from weak annotations; we rely on structured SVM formulation to align our components and minefor hard negatives to boost localization performance. This results in a model that supports spatio-temporal localization and is insensitive to dropped frames or partial observations. We show classification performance that is competitive with the state of the art on the benchmark UT-Interaction dataset and illustrate that our model outperforms prior methods in an on-line streaming setting.
6 0.64610249 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
7 0.61305702 3 cvpr-2013-3D R Transform on Spatio-temporal Interest Points for Action Recognition
8 0.58704263 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
9 0.57657069 291 cvpr-2013-Motionlets: Mid-level 3D Parts for Human Motion Recognition
10 0.57455236 150 cvpr-2013-Event Recognition in Videos by Learning from Heterogeneous Web Sources
11 0.57192171 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
12 0.56885409 387 cvpr-2013-Semi-supervised Domain Adaptation with Instance Constraints
13 0.56352907 40 cvpr-2013-An Approach to Pose-Based Action Recognition
14 0.54846919 179 cvpr-2013-From N to N+1: Multiclass Transfer Incremental Learning
15 0.54150915 302 cvpr-2013-Multi-task Sparse Learning with Beta Process Prior for Action Recognition
16 0.53039104 205 cvpr-2013-Hollywood 3D: Recognizing Actions in 3D Natural Scenes
17 0.51967347 444 cvpr-2013-Unconstrained Monocular 3D Human Pose Estimation by Action Detection and Cross-Modality Regression Forest
18 0.518942 419 cvpr-2013-Subspace Interpolation via Dictionary Learning for Unsupervised Domain Adaptation
19 0.49322566 149 cvpr-2013-Evaluation of Color STIPs for Human Action Recognition
20 0.47280544 34 cvpr-2013-Adaptive Active Learning for Image Classification
topicId topicWeight
[(10, 0.161), (16, 0.033), (26, 0.037), (33, 0.237), (67, 0.077), (69, 0.046), (77, 0.012), (87, 0.108), (99, 0.185)]
simIndex simValue paperId paperTitle
1 0.89080179 181 cvpr-2013-Fusing Depth from Defocus and Stereo with Coded Apertures
Author: Yuichi Takeda, Shinsaku Hiura, Kosuke Sato
Abstract: In this paper we propose a novel depth measurement method by fusing depth from defocus (DFD) and stereo. One of the problems of passive stereo method is the difficulty of finding correct correspondence between images when an object has a repetitive pattern or edges parallel to the epipolar line. On the other hand, the accuracy of DFD method is inherently limited by the effective diameter of the lens. Therefore, we propose the fusion of stereo method and DFD by giving different focus distances for left and right cameras of a stereo camera with coded apertures. Two types of depth cues, defocus and disparity, are naturally integrated by the magnification and phase shift of a single point spread function (PSF) per camera. In this paper we give the proof of the proportional relationship between the diameter of defocus and disparity which makes the calibration easy. We also show the outstanding performance of our method which has both advantages of two depth cues through simulation and actual experiments.
Author: Won Hwa Kim, Moo K. Chung, Vikas Singh
Abstract: The analysis of 3-D shape meshes is a fundamental problem in computer vision, graphics, and medical imaging. Frequently, the needs of the application require that our analysis take a multi-resolution view of the shape ’s local and global topology, and that the solution is consistent across multiple scales. Unfortunately, the preferred mathematical construct which offers this behavior in classical image/signal processing, Wavelets, is no longer applicable in this general setting (data with non-uniform topology). In particular, the traditional definition does not allow writing out an expansion for graphs that do not correspond to the uniformly sampled lattice (e.g., images). In this paper, we adapt recent results in harmonic analysis, to derive NonEuclidean Wavelets based algorithms for a range of shape analysis problems in vision and medical imaging. We show how descriptors derived from the dual domain representation offer native multi-resolution behavior for characterizing local/global topology around vertices. With only minor modifications, the framework yields a method for extracting interest/key points from shapes, a surprisingly simple algorithm for 3-D shape segmentation (competitive with state of the art), and a method for surface alignment (without landmarks). We give an extensive set of comparison results on a large shape segmentation benchmark and derive a uniqueness theorem for the surface alignment problem.
same-paper 3 0.86855197 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path
Author: Zhong Zhang, Chunheng Wang, Baihua Xiao, Wen Zhou, Shuang Liu, Cunzhao Shi
Abstract: In this paper, we propose a novel method for cross-view action recognition via a continuous virtual path which connects the source view and the target view. Each point on this virtual path is a virtual view which is obtained by a linear transformation of the action descriptor. All the virtual views are concatenated into an infinite-dimensional feature to characterize continuous changes from the source to the target view. However, these infinite-dimensional features cannot be used directly. Thus, we propose a virtual view kernel to compute the value of similarity between two infinite-dimensional features, which can be readily used to construct any kernelized classifiers. In addition, there are a lot of unlabeled samples from the target view, which can be utilized to improve the performance of classifiers. Thus, we present a constraint strategy to explore the information contained in the unlabeled samples. The rationality behind the constraint is that any action video belongs to only one class. Our method is verified on the IXMAS dataset, and the experimental results demonstrate that our method achieves better performance than the state-of-the-art methods.
4 0.84572035 73 cvpr-2013-Bringing Semantics into Focus Using Visual Abstraction
Author: C. Lawrence Zitnick, Devi Parikh
Abstract: Relating visual information to its linguistic semantic meaning remains an open and challenging area of research. The semantic meaning of images depends on the presence of objects, their attributes and their relations to other objects. But precisely characterizing this dependence requires extracting complex visual information from an image, which is in general a difficult and yet unsolved problem. In this paper, we propose studying semantic information in abstract images created from collections of clip art. Abstract images provide several advantages. They allow for the direct study of how to infer high-level semantic information, since they remove the reliance on noisy low-level object, attribute and relation detectors, or the tedious hand-labeling of images. Importantly, abstract images also allow the ability to generate sets of semantically similar scenes. Finding analogous sets of semantically similar real images would be nearly impossible. We create 1,002 sets of 10 semantically similar abstract scenes with corresponding written descriptions. We thoroughly analyze this dataset to discover semantically important features, the relations of words to visual features and methods for measuring semantic similarity.
5 0.8451907 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
Author: Ian Endres, Kevin J. Shih, Johnston Jiaa, Derek Hoiem
Abstract: We propose a method to learn a diverse collection of discriminative parts from object bounding box annotations. Part detectors can be trained and applied individually, which simplifies learning and extension to new features or categories. We apply the parts to object category detection, pooling part detections within bottom-up proposed regions and using a boosted classifier with proposed sigmoid weak learners for scoring. On PASCAL VOC 2010, we evaluate the part detectors ’ ability to discriminate and localize annotated keypoints. Our detection system is competitive with the best-existing systems, outperforming other HOG-based detectors on the more deformable categories.
6 0.83724898 400 cvpr-2013-Single Image Calibration of Multi-axial Imaging Systems
7 0.83709866 365 cvpr-2013-Robust Real-Time Tracking of Multiple Objects by Volumetric Mass Densities
8 0.83681178 414 cvpr-2013-Structure Preserving Object Tracking
9 0.83458602 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
10 0.8337456 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation
11 0.83171648 285 cvpr-2013-Minimum Uncertainty Gap for Robust Visual Tracking
12 0.83170223 314 cvpr-2013-Online Object Tracking: A Benchmark
13 0.82947749 325 cvpr-2013-Part Discovery from Partial Correspondence
14 0.82936472 324 cvpr-2013-Part-Based Visual Tracking with Online Latent Structural Learning
15 0.82742196 331 cvpr-2013-Physically Plausible 3D Scene Tracking: The Single Actor Hypothesis
16 0.82738107 143 cvpr-2013-Efficient Large-Scale Structured Learning
17 0.82616794 14 cvpr-2013-A Joint Model for 2D and 3D Pose Estimation from a Single Image
18 0.82595849 74 cvpr-2013-CLAM: Coupled Localization and Mapping with Efficient Outlier Handling
19 0.82580358 277 cvpr-2013-MODEC: Multimodal Decomposable Models for Human Pose Estimation
20 0.82566577 19 cvpr-2013-A Minimum Error Vanishing Point Detection Approach for Uncalibrated Monocular Images of Man-Made Environments