cvpr cvpr2013 cvpr2013-40 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Chunyu Wang, Yizhou Wang, Alan L. Yuille
Abstract: We address action recognition in videos by modeling the spatial-temporal structures of human poses. We start by improving a state of the art method for estimating human joint locations from videos. More precisely, we obtain the K-best estimations output by the existing method and incorporate additional segmentation cues and temporal constraints to select the “best” one. Then we group the estimated joints into five body parts (e.g. the left arm) and apply data mining techniques to obtain a representation for the spatial-temporal structures of human actions. This representation captures the spatial configurations ofbodyparts in one frame (by spatial-part-sets) as well as the body part movements(by temporal-part-sets) which are characteristic of human actions. It is interpretable, compact, and also robust to errors on joint estimations. Experimental results first show that our approach is able to localize body joints more accurately than existing methods. Next we show that it outperforms state of the art action recognizers on the UCF sport, the Keck Gesture and the MSR-Action3D datasets.
Reference: text
sentIndex sentText sentNum sentScore
1 An approach to pose-based action recognition Chunyu Wang1, Yizhou Wang1, and Alan L. [sent-1, score-0.488]
2 edu le Abstract We address action recognition in videos by modeling the spatial-temporal structures of human poses. [sent-8, score-0.672]
3 Then we group the estimated joints into five body parts (e. [sent-11, score-0.64]
4 This representation captures the spatial configurations ofbodyparts in one frame (by spatial-part-sets) as well as the body part movements(by temporal-part-sets) which are characteristic of human actions. [sent-14, score-0.452]
5 Experimental results first show that our approach is able to localize body joints more accurately than existing methods. [sent-16, score-0.502]
6 Next we show that it outperforms state of the art action recognizers on the UCF sport, the Keck Gesture and the MSR-Action3D datasets. [sent-17, score-0.575]
7 Recent action recognition systems rely on low-level and mid-level features such as local space-time interest points (e. [sent-22, score-0.488]
8 (a)A pose is composed of 14 joints at the bottom layer, which are grouped into five body parts in the layer above; (b)shows two spatial-part-sets which combine frequently co-occurring configurations of body parts in an action class. [sent-29, score-1.728]
9 An alternative line of work represent actions by sequences of poses in time (e. [sent-47, score-0.415]
10 [4][26]), where poses refer to spatial configurations of body joints. [sent-49, score-0.536]
11 However, pose-based action recognition can be very hard because of the difficulty to estimate high quality poses from action videos, except in special cases (e. [sent-52, score-1.18]
12 In this paper we present a novel pose-based action recognition approach which is effective on some challenging videos. [sent-55, score-0.488]
13 We first extend a state of the art method [27] to estimate human poses from action videos. [sent-56, score-0.823]
14 Given a video, we first obtain best-K pose estimations for each frame using the method of [27], then we infer the best poses by incorporating segmentation and temporal constraints for all frames in the video. [sent-57, score-0.655]
15 We experimentally show that this extension localizes body joints more accurately. [sent-58, score-0.502]
16 To represent human actions, we first group the estimated joints into five body parts (e. [sent-59, score-0.698]
17 We then apply data mining techniques in the spatial domain to obtain sets of distinctive co-occurring spatial configurations(poses) of body parts, which we call spatial-part-sets. [sent-63, score-0.458]
18 Similarly, in the temporal domain, we obtain sets of distinctive co-occurring pose sequences of body parts, which we call temporal-part-sets (e. [sent-64, score-0.617]
19 For test videos, we first detect these partsets from the estimated poses then represent the videos by histograms of the detected part-sets. [sent-68, score-0.363]
20 (i) It is interpretable, because we decompose poses into parts, guided by human body anatomy, and represent actions by the temporal movements of these parts. [sent-71, score-0.792]
21 bag of lowlevel features), because it helps prevent overfitting when training action classifiers. [sent-77, score-0.454]
22 This boosts action recognition performance compared with holistic pose features. [sent-79, score-0.791]
23 We demonstrate these advantages by showing that our proposed method outperforms state of the art action recognizers on the UCF sport, the Keck Gesture and the MSR-Action3D datasets. [sent-80, score-0.575]
24 Section 3, 4 introduces pose estimation and action representation, respectively. [sent-83, score-0.718]
25 Related Work We briefly review the pose-based action recognition methods in literature. [sent-87, score-0.488]
26 Xu et al[25] propose to automatically estimate joint locations from videos, and use joint locations coupled with motion features for action recognition. [sent-90, score-0.795]
27 Modest joint estimation can degrade the action recognition performance as shown in experiments. [sent-91, score-0.622]
28 However, implicit pose representations are difficult to relate to body parts, and so are it is hard to model meaningful body part movements in actions. [sent-97, score-0.796]
29 Instead, we use body parts as building blocks as they are more meaningful and compact. [sent-106, score-0.32]
30 Secondly, we model spatial pose structures as well as temporal pose evolutions, which are neglected in [21]. [sent-107, score-0.551]
31 Pose Estimation in Videos We now extend a state of the art image-based pose estimation method [27] to video sequences. [sent-109, score-0.388]
32 Our extension can localize joints more accurately, which is important for achieving good action recognition performance. [sent-110, score-0.738]
33 Initial Frame-based Pose Estimation A pose P is represented by 14 joints Ji: head, neck, (left/right)-hand/elbow/shoulder/hip/knee/foot. [sent-116, score-0.473]
34 Firstly, the learnt kinematic constraints tend to bias estimations to dominating poses in − training data, which decreases estimation accuracy for rare poses. [sent-134, score-0.353]
35 However, looking at 15-best poses returned by the model for each frame, we observe a high probability that the “correct” pose is among them. [sent-136, score-0.461]
36 1 Where φ(Pjii , Ii) is a unary term that measures the likelihood of the pose and ψ(Pjii,Pjii++11,Ii,Ii+1) is a pairwise term that measures the appearance, and location consistency of the joints in consecutive frames. [sent-169, score-0.473]
37 In particular, we group the 14 joints of pose P into five body parts(head, left/right arm, left/right leg) by human anatomy, i. [sent-176, score-0.853]
38 (a)we start by estimating poses for videos of the two action classes, i. [sent-215, score-0.783]
39 (b)then we cluster poses of each body part in training data and construct a part pose dictionary as described in Section 4. [sent-218, score-0.848]
40 (c)we extract temporal-part-sets(1-2) and spatial-part-sets(3) for the two action classes as described in Section 4. [sent-222, score-0.454]
41 Action Representation We next extract representative spatial/temporal pose structures from body poses for representing actions. [sent-254, score-0.748]
42 For spatial pose structures, we pursue sets of frequently cooccurring spatial configurations of body parts in a single frame, which we call the spatial-part-set, spi = {pj1, . [sent-255, score-0.81]
43 For temporal pose structures, we pursue s{epts of frequently co-occurring body part sequences ali = (pj1 , . [sent-259, score-0.756]
44 Note that body part sequence ali captures t{hael temporal pose eovtoelu thtaiotn b oofd a single body part (e. [sent-266, score-0.927]
45 See Figure 3 for the overall framework of the action representation. [sent-270, score-0.454]
46 Body Part A body part pi is composed of zi joint locations pi = (xi1, y1i , . [sent-273, score-0.558]
47 We learn a dictionary of pose templates Vi = {v1i , vi2 , . [sent-280, score-0.357]
48 , }, for each body part by clustering the poses {ofv training da}ta,. [sent-283, score-0.522]
49 r gE tahceh toesme-s plate pose represents a certain spatial configuration of body parts(See Figure 3. [sent-285, score-0.475]
50 We quantize all body part poses pi vkii 999991111188666 Figure 4. [sent-287, score-0.588]
51 (a)shows estimated poses for videos of turn-left and stop-left actions. [sent-289, score-0.329]
52 Each row in(1) is a transaction composed by five indexes of quantized body parts. [sent-292, score-0.536]
53 Quantized poses are then represented by the five indexes of the templates in the dictionaries. [sent-304, score-0.429]
54 Spatial-part-sets We propose spatial-part-sets to capture spatial configurations of multiple body parts: spi = {pj1, . [sent-307, score-0.34]
55 The ideal spatial-part-sets are those which occur frequently in one action class but rarely in other classes (and hence have both representative and discriminative power). [sent-315, score-0.499]
56 We obtain sets of spatial-part-sets for each action class using Contrast Mining techniques[6]. [sent-316, score-0.454]
57 We now relate the notations in contrast mining to our problem of mining spatial-part-sets. [sent-334, score-0.348]
58 Recall that the poses are quantized and represented by the five indexes of pose templates. [sent-335, score-0.634]
59 A pose P represented by five pose templates is∪ a tr. [sent-341, score-0.579]
60 All poses in the training data constitute the transaction database D(See Figure 4. [sent-345, score-0.391]
61 We pursue sets of spatial-part-sets for each pair of action classes y1 and y2. [sent-351, score-0.511]
62 By increasing the support rate, we guarantee if TSD+→D− the representative power of the spatial-part-sets for the positive action class. [sent-359, score-0.481]
63 Temporal-part-sets We propose temporal-part-sets to capture joint pose evolution of multiple body parts. [sent-364, score-0.568]
64 We denote pose sequences of body parts as ali = (pj1 , . [sent-365, score-0.62]
65 We mine a set of frequently cooccurring pose sequences, which we call temporal-part-sets, 999991111199777 Figure 5. [sent-369, score-0.43]
66 In implementation, for each of the five pose sequences (pi1, . [sent-377, score-0.333]
67 sub-sequences bofthe video compose a transaction, and the transactions of all videos compose the transaction database. [sent-394, score-0.412]
68 We mine a set of co-occurring sub-sequences for each pair of action classes as spatial-part-sets mining. [sent-395, score-0.539]
69 Classification of Actions We use the bag-of-words model to leverage spatial-partsets and temporal-part-sets for action recognition. [sent-399, score-0.454]
70 In the off-line mode, we pursue a set of part-sets for each pair of action classes. [sent-400, score-0.511]
71 For the UCF sport and Keck Gesture datasets, we estimate poses from videos by our proposed approach. [sent-409, score-0.487]
72 We report performance for both pose estimation and action recognition. [sent-410, score-0.718]
73 pose estimation and use the provided 3D poses (because the video frames are not provided) to recognize actions. [sent-413, score-0.596]
74 We also evaluate our approach’s robustness to ambiguous poses by perturbing the joint locations in MSR-Action3D. [sent-414, score-0.42]
75 The Keck gesture dataset [11] contains 14 different gesture classes. [sent-419, score-0.324]
76 The MSR-Action3D dataset [15] contains 20 actions, with each action performed three times by ten subjects. [sent-422, score-0.454]
77 Comparison to two baselines We compare our proposed representation with two baselines: holistic pose features and local body part based features. [sent-425, score-0.587]
78 A holistic pose feature is a concatenated vector of 14 joint locations. [sent-426, score-0.396]
79 We cluster holistic pose features using the k-means algorithm and obtain a prototype dictionary of size 600. [sent-427, score-0.404]
80 For local body part based features, we compute a separate pose dictionary for each body part, extract “bag-ofbody part” features, and concatenate them into a high dimensional vector. [sent-429, score-0.83]
81 We set dictionary sizes to (8, 25, 25, 25, 25) for the five body parts by cross validation. [sent-431, score-0.461]
82 On the UCF sport dataset, the holistic pose features and the local body part based features get 69. [sent-446, score-0.745]
83 Note that go back and come near actions have very similar poses but in reverse temporal order. [sent-460, score-0.472]
84 Sadanand’s action bank [18] achieves the highest recognition rate. [sent-471, score-0.534]
85 But their action bank is constructed by manually selecting frames from training data, so it is not appropriate to compare our fully automatic method to their’s. [sent-472, score-0.543]
86 Comparison of action recognition using poses estimated by [27] and by our method. [sent-477, score-0.726]
87 We also evaluated the pose estimation method in the context of action recognition. [sent-501, score-0.718]
88 Table 3 compares the action recognition accuracy using poses obtained by different pose estimation methods. [sent-502, score-0.99]
89 The tables shows that using the poses obtained by our method (which are more accurate) does improve the action recognition performance compared to using the poses obtained by[27] 5. [sent-503, score-0.964]
90 The performance of holistic pose features drop dramatically as the perturbation gets severe, which is expected since the accuracy of joint locations has large impact on holistic pose features. [sent-509, score-0.748]
91 Conclusion We proposed a novel action representation based on human poses. [sent-512, score-0.512]
92 The poses were obtained by extending an existing state-of-the-art pose estimation algorithm. [sent-513, score-0.502]
93 − data mining techniques to mine spatial-temporal pose structures for action representation. [sent-516, score-0.971]
94 Recognition of human body motion using phase space constraints. [sent-544, score-0.367]
95 Scale invariant action recognition using compound features mined from dense spatio-temporal corners. [sent-584, score-0.538]
96 Human action recognition using distribution of oriented rectangular patches. [sent-589, score-0.527]
97 Learning a hierarchy of discriminative space-time neighborhood feature for human action recognition. [sent-604, score-0.512]
98 Action mach a spatio-temporal maximum average correlation height filter for action recognition. [sent-629, score-0.454]
99 Mining actionlet ensemble for action recognition with depth cameras. [sent-657, score-0.53]
100 Combining skeletal pose with local motion for human activity recognition. [sent-689, score-0.338]
wordName wordTfidf (topN-words)
[('action', 0.454), ('keck', 0.298), ('body', 0.252), ('joints', 0.25), ('poses', 0.238), ('pose', 0.223), ('mining', 0.174), ('ucf', 0.165), ('gesture', 0.162), ('sport', 0.158), ('actions', 0.137), ('transaction', 0.111), ('joint', 0.093), ('videos', 0.091), ('mine', 0.085), ('itemset', 0.081), ('holistic', 0.08), ('ji', 0.078), ('arm', 0.074), ('growth', 0.073), ('dictionary', 0.071), ('temporal', 0.07), ('five', 0.07), ('parts', 0.068), ('kf', 0.068), ('pi', 0.066), ('compose', 0.063), ('templates', 0.063), ('pages', 0.061), ('indexes', 0.058), ('human', 0.058), ('motion', 0.057), ('pursue', 0.057), ('interpretable', 0.056), ('iji', 0.054), ('pjni', 0.054), ('ulthoff', 0.054), ('kb', 0.052), ('video', 0.051), ('sadanand', 0.051), ('compound', 0.05), ('locations', 0.049), ('interpretability', 0.048), ('itemsets', 0.048), ('recognizers', 0.048), ('tsd', 0.048), ('estimations', 0.046), ('bank', 0.046), ('configurations', 0.046), ('quantized', 0.045), ('frequently', 0.045), ('cooccurring', 0.045), ('fulfill', 0.045), ('frames', 0.043), ('pj', 0.042), ('actionlet', 0.042), ('spi', 0.042), ('constitute', 0.042), ('confusion', 0.041), ('art', 0.041), ('estimation', 0.041), ('perturbing', 0.04), ('sequences', 0.04), ('rectangular', 0.039), ('ib', 0.039), ('head', 0.037), ('anatomy', 0.037), ('pursued', 0.037), ('movements', 0.037), ('ali', 0.037), ('svms', 0.037), ('frame', 0.035), ('explaining', 0.035), ('structures', 0.035), ('coherence', 0.035), ('histograms', 0.034), ('going', 0.034), ('blank', 0.034), ('bobick', 0.034), ('recognition', 0.034), ('ieee', 0.033), ('neck', 0.033), ('transactions', 0.033), ('part', 0.032), ('item', 0.032), ('evolving', 0.032), ('state', 0.032), ('call', 0.032), ('ep', 0.032), ('jj', 0.031), ('rate', 0.031), ('prototype', 0.03), ('sd', 0.03), ('intersection', 0.029), ('articulated', 0.029), ('distributions', 0.029), ('captures', 0.029), ('learnt', 0.028), ('support', 0.027), ('go', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 40 cvpr-2013-An Approach to Pose-Based Action Recognition
Author: Chunyu Wang, Yizhou Wang, Alan L. Yuille
Abstract: We address action recognition in videos by modeling the spatial-temporal structures of human poses. We start by improving a state of the art method for estimating human joint locations from videos. More precisely, we obtain the K-best estimations output by the existing method and incorporate additional segmentation cues and temporal constraints to select the “best” one. Then we group the estimated joints into five body parts (e.g. the left arm) and apply data mining techniques to obtain a representation for the spatial-temporal structures of human actions. This representation captures the spatial configurations ofbodyparts in one frame (by spatial-part-sets) as well as the body part movements(by temporal-part-sets) which are characteristic of human actions. It is interpretable, compact, and also robust to errors on joint estimations. Experimental results first show that our approach is able to localize body joints more accurately than existing methods. Next we show that it outperforms state of the art action recognizers on the UCF sport, the Keck Gesture and the MSR-Action3D datasets.
Author: Tsz-Ho Yu, Tae-Kyun Kim, Roberto Cipolla
Abstract: This work addresses the challenging problem of unconstrained 3D human pose estimation (HPE)from a novelperspective. Existing approaches struggle to operate in realistic applications, mainly due to their scene-dependent priors, such as background segmentation and multi-camera network, which restrict their use in unconstrained environments. We therfore present a framework which applies action detection and 2D pose estimation techniques to infer 3D poses in an unconstrained video. Action detection offers spatiotemporal priors to 3D human pose estimation by both recognising and localising actions in space-time. Instead of holistic features, e.g. silhouettes, we leverage the flexibility of deformable part model to detect 2D body parts as a feature to estimate 3D poses. A new unconstrained pose dataset has been collected to justify the feasibility of our method, which demonstrated promising results, significantly outperforming the relevant state-of-the-arts.
3 0.35060954 287 cvpr-2013-Modeling Actions through State Changes
Author: Alireza Fathi, James M. Rehg
Abstract: In this paper we present a model of action based on the change in the state of the environment. Many actions involve similar dynamics and hand-object relationships, but differ in their purpose and meaning. The key to differentiating these actions is the ability to identify how they change the state of objects and materials in the environment. We propose a weakly supervised method for learning the object and material states that are necessary for recognizing daily actions. Once these state detectors are learned, we can apply them to input videos and pool their outputs to detect actions. We further demonstrate that our method can be used to segment discrete actions from a continuous video of an activity. Our results outperform state-of-the-art action recognition and activity segmentation results.
4 0.34457445 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
Author: Arpit Jain, Abhinav Gupta, Mikel Rodriguez, Larry S. Davis
Abstract: How should a video be represented? We propose a new representation for videos based on mid-level discriminative spatio-temporal patches. These spatio-temporal patches might correspond to a primitive human action, a semantic object, or perhaps a random but informative spatiotemporal patch in the video. What defines these spatiotemporal patches is their discriminative and representative properties. We automatically mine these patches from hundreds of training videos and experimentally demonstrate that these patches establish correspondence across videos and align the videos for label transfer techniques. Furthermore, these patches can be used as a discriminative vocabulary for action classification where they demonstrate stateof-the-art performance on UCF50 and Olympics datasets.
5 0.3224071 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
Author: Chao-Yeh Chen, Kristen Grauman
Abstract: We propose an approach to learn action categories from static images that leverages prior observations of generic human motion to augment its training process. Using unlabeled video containing various human activities, the system first learns how body pose tends to change locally in time. Then, given a small number of labeled static images, it uses that model to extrapolate beyond the given exemplars and generate “synthetic ” training examples—poses that could link the observed images and/or immediately precede or follow them in time. In this way, we expand the training set without requiring additional manually labeled examples. We explore both example-based and manifold-based methods to implement our idea. Applying our approach to recognize actions in both images and video, we show it enhances a state-of-the-art technique when very few labeled training examples are available.
6 0.28192478 206 cvpr-2013-Human Pose Estimation Using Body Parts Dependent Joint Regressors
7 0.27951238 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
8 0.26525012 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
9 0.26147515 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition
10 0.25726771 302 cvpr-2013-Multi-task Sparse Learning with Beta Process Prior for Action Recognition
11 0.25272188 123 cvpr-2013-Detection of Manipulation Action Consequences (MAC)
12 0.24870418 334 cvpr-2013-Pose from Flow and Flow from Pose
13 0.24431767 205 cvpr-2013-Hollywood 3D: Recognizing Actions in 3D Natural Scenes
14 0.23828387 14 cvpr-2013-A Joint Model for 2D and 3D Pose Estimation from a Single Image
15 0.22636454 153 cvpr-2013-Expanded Parts Model for Human Attribute and Action Recognition in Still Images
16 0.22283721 207 cvpr-2013-Human Pose Estimation Using a Joint Pixel-wise and Part-wise Formulation
17 0.2090296 335 cvpr-2013-Poselet Conditioned Pictorial Structures
18 0.19824548 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
19 0.19492823 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path
20 0.19403787 3 cvpr-2013-3D R Transform on Spatio-temporal Interest Points for Action Recognition
topicId topicWeight
[(0, 0.309), (1, -0.124), (2, -0.055), (3, -0.275), (4, -0.403), (5, -0.051), (6, 0.023), (7, 0.136), (8, -0.02), (9, -0.164), (10, -0.073), (11, 0.182), (12, -0.112), (13, 0.064), (14, -0.072), (15, 0.032), (16, 0.036), (17, -0.1), (18, -0.009), (19, 0.089), (20, 0.009), (21, -0.006), (22, -0.058), (23, 0.036), (24, -0.025), (25, -0.047), (26, -0.008), (27, 0.005), (28, 0.023), (29, -0.009), (30, 0.073), (31, -0.017), (32, -0.035), (33, 0.031), (34, 0.017), (35, -0.007), (36, -0.007), (37, 0.045), (38, 0.011), (39, -0.004), (40, -0.007), (41, -0.064), (42, -0.008), (43, 0.029), (44, -0.009), (45, 0.012), (46, -0.01), (47, 0.012), (48, -0.023), (49, 0.007)]
simIndex simValue paperId paperTitle
same-paper 1 0.97134799 40 cvpr-2013-An Approach to Pose-Based Action Recognition
Author: Chunyu Wang, Yizhou Wang, Alan L. Yuille
Abstract: We address action recognition in videos by modeling the spatial-temporal structures of human poses. We start by improving a state of the art method for estimating human joint locations from videos. More precisely, we obtain the K-best estimations output by the existing method and incorporate additional segmentation cues and temporal constraints to select the “best” one. Then we group the estimated joints into five body parts (e.g. the left arm) and apply data mining techniques to obtain a representation for the spatial-temporal structures of human actions. This representation captures the spatial configurations ofbodyparts in one frame (by spatial-part-sets) as well as the body part movements(by temporal-part-sets) which are characteristic of human actions. It is interpretable, compact, and also robust to errors on joint estimations. Experimental results first show that our approach is able to localize body joints more accurately than existing methods. Next we show that it outperforms state of the art action recognizers on the UCF sport, the Keck Gesture and the MSR-Action3D datasets.
Author: Tsz-Ho Yu, Tae-Kyun Kim, Roberto Cipolla
Abstract: This work addresses the challenging problem of unconstrained 3D human pose estimation (HPE)from a novelperspective. Existing approaches struggle to operate in realistic applications, mainly due to their scene-dependent priors, such as background segmentation and multi-camera network, which restrict their use in unconstrained environments. We therfore present a framework which applies action detection and 2D pose estimation techniques to infer 3D poses in an unconstrained video. Action detection offers spatiotemporal priors to 3D human pose estimation by both recognising and localising actions in space-time. Instead of holistic features, e.g. silhouettes, we leverage the flexibility of deformable part model to detect 2D body parts as a feature to estimate 3D poses. A new unconstrained pose dataset has been collected to justify the feasibility of our method, which demonstrated promising results, significantly outperforming the relevant state-of-the-arts.
3 0.88451606 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition
Author: Michalis Raptis, Leonid Sigal
Abstract: In this paper, we develop a new model for recognizing human actions. An action is modeled as a very sparse sequence of temporally local discriminative keyframes collections of partial key-poses of the actor(s), depicting key states in the action sequence. We cast the learning of keyframes in a max-margin discriminative framework, where we treat keyframes as latent variables. This allows us to (jointly) learn a set of most discriminative keyframes while also learning the local temporal context between them. Keyframes are encoded using a spatially-localizable poselet-like representation with HoG and BoW components learned from weak annotations; we rely on structured SVM formulation to align our components and minefor hard negatives to boost localization performance. This results in a model that supports spatio-temporal localization and is insensitive to dropped frames or partial observations. We show classification performance that is competitive with the state of the art on the benchmark UT-Interaction dataset and illustrate that our model outperforms prior methods in an on-line streaming setting.
4 0.79283404 287 cvpr-2013-Modeling Actions through State Changes
Author: Alireza Fathi, James M. Rehg
Abstract: In this paper we present a model of action based on the change in the state of the environment. Many actions involve similar dynamics and hand-object relationships, but differ in their purpose and meaning. The key to differentiating these actions is the ability to identify how they change the state of objects and materials in the environment. We propose a weakly supervised method for learning the object and material states that are necessary for recognizing daily actions. Once these state detectors are learned, we can apply them to input videos and pool their outputs to detect actions. We further demonstrate that our method can be used to segment discrete actions from a continuous video of an activity. Our results outperform state-of-the-art action recognition and activity segmentation results.
5 0.78841853 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
Author: Chao-Yeh Chen, Kristen Grauman
Abstract: We propose an approach to learn action categories from static images that leverages prior observations of generic human motion to augment its training process. Using unlabeled video containing various human activities, the system first learns how body pose tends to change locally in time. Then, given a small number of labeled static images, it uses that model to extrapolate beyond the given exemplars and generate “synthetic ” training examples—poses that could link the observed images and/or immediately precede or follow them in time. In this way, we expand the training set without requiring additional manually labeled examples. We explore both example-based and manifold-based methods to implement our idea. Applying our approach to recognize actions in both images and video, we show it enhances a state-of-the-art technique when very few labeled training examples are available.
6 0.78809482 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
7 0.74005795 291 cvpr-2013-Motionlets: Mid-level 3D Parts for Human Motion Recognition
8 0.72396481 123 cvpr-2013-Detection of Manipulation Action Consequences (MAC)
9 0.67973822 14 cvpr-2013-A Joint Model for 2D and 3D Pose Estimation from a Single Image
10 0.66833866 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
11 0.6473797 302 cvpr-2013-Multi-task Sparse Learning with Beta Process Prior for Action Recognition
12 0.64436442 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
13 0.63431805 153 cvpr-2013-Expanded Parts Model for Human Attribute and Action Recognition in Still Images
14 0.62780493 206 cvpr-2013-Human Pose Estimation Using Body Parts Dependent Joint Regressors
15 0.62196976 335 cvpr-2013-Poselet Conditioned Pictorial Structures
16 0.61763895 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path
17 0.61613703 277 cvpr-2013-MODEC: Multimodal Decomposable Models for Human Pose Estimation
18 0.61271346 3 cvpr-2013-3D R Transform on Spatio-temporal Interest Points for Action Recognition
19 0.60519642 439 cvpr-2013-Tracking Human Pose by Tracking Symmetric Parts
20 0.58611363 45 cvpr-2013-Articulated Pose Estimation Using Discriminative Armlet Classifiers
topicId topicWeight
[(10, 0.14), (16, 0.02), (26, 0.073), (28, 0.011), (33, 0.302), (63, 0.186), (67, 0.074), (69, 0.033), (80, 0.021), (87, 0.061)]
simIndex simValue paperId paperTitle
1 0.94218606 467 cvpr-2013-Wide-Baseline Hair Capture Using Strand-Based Refinement
Author: Linjie Luo, Cha Zhang, Zhengyou Zhang, Szymon Rusinkiewicz
Abstract: We propose a novel algorithm to reconstruct the 3D geometry of human hairs in wide-baseline setups using strand-based refinement. The hair strands arefirst extracted in each 2D view, and projected onto the 3D visual hull for initialization. The 3D positions of these strands are then refined by optimizing an objective function that takes into account cross-view hair orientation consistency, the visual hull constraint and smoothness constraints defined at the strand, wisp and global levels. Based on the refined strands, the algorithm can reconstruct an approximate hair surface: experiments with synthetic hair models achieve an accuracy of ∼3mm. We also show real-world examples to demonsotfra ∼te3 mthme capability t soh capture full-head hamairp styles as mwoenll- as hair in motion with as few as 8 cameras.
2 0.94181454 52 cvpr-2013-Axially Symmetric 3D Pots Configuration System Using Axis of Symmetry and Break Curve
Author: Kilho Son, Eduardo B. Almeida, David B. Cooper
Abstract: Thispaper introduces a novel approachfor reassembling pot sherds found at archaeological excavation sites, for the purpose ofreconstructing claypots that had been made on a wheel. These pots and the sherds into which they have broken are axially symmetric. The reassembly process can be viewed as 3D puzzle solving or generalized cylinder learning from broken fragments. The estimation exploits both local and semi-global geometric structure, thus making it a fundamental problem of geometry estimation from noisy fragments in computer vision and pattern recognition. The data used are densely digitized 3D laser scans of each fragment’s outer surface. The proposed reassembly system is automatic and functions when the pile of available fragments is from one or multiple pots, and even when pieces are missing from any pot. The geometric structure used are curves on the pot along which the surface had broken and the silhouette of a pot with respect to an axis, called axisprofile curve (APC). For reassembling multiple pots with or without missing pieces, our algorithm estimates the APC from each fragment, then reassembles into configurations the ones having distinctive APC. Further growth of configurations is based on adding remaining fragments such that their APC and break curves are consistent with those of a configuration. The method is novel, more robust and handles the largest numbers of fragments to date.
3 0.92971164 74 cvpr-2013-CLAM: Coupled Localization and Mapping with Efficient Outlier Handling
Author: Jonathan Balzer, Stefano Soatto
Abstract: We describe a method to efficiently generate a model (map) of small-scale objects from video. The map encodes sparse geometry as well as coarse photometry, and could be used to initialize dense reconstruction schemes as well as to support recognition and localization of three-dimensional objects. Self-occlusions and the predominance of outliers present a challenge to existing online Structure From Motion and Simultaneous Localization and Mapping systems. We propose a unified inference criterion that encompasses map building and localization (object detection) relative to the map in a coupled fashion. We establish correspondence in a computationally efficient way without resorting to combinatorial matching or random-sampling techniques. Instead, we use a simpler M-estimator that exploits putative correspondence from tracking after photometric and topological validation. We have collected a new dataset to benchmark model building in the small scale, which we test our algorithm on in comparison to others. Although our system is significantly leaner than previous ones, it compares favorably to the state of the art in terms of accuracy and robustness.
4 0.92038375 308 cvpr-2013-Nonlinearly Constrained MRFs: Exploring the Intrinsic Dimensions of Higher-Order Cliques
Author: Yun Zeng, Chaohui Wang, Stefano Soatto, Shing-Tung Yau
Abstract: This paper introduces an efficient approach to integrating non-local statistics into the higher-order Markov Random Fields (MRFs) framework. Motivated by the observation that many non-local statistics (e.g., shape priors, color distributions) can usually be represented by a small number of parameters, we reformulate the higher-order MRF model by introducing additional latent variables to represent the intrinsic dimensions of the higher-order cliques. The resulting new model, called NC-MRF, not only provides the flexibility in representing the configurations of higher-order cliques, but also automatically decomposes the energy function into less coupled terms, allowing us to design an efficient algorithmic framework for maximum a posteriori (MAP) inference. Based on this novel modeling/inference framework, we achieve state-of-the-art solutions to the challenging problems of class-specific image segmentation and template-based 3D facial expression tracking, which demonstrate the potential of our approach.
same-paper 5 0.90677679 40 cvpr-2013-An Approach to Pose-Based Action Recognition
Author: Chunyu Wang, Yizhou Wang, Alan L. Yuille
Abstract: We address action recognition in videos by modeling the spatial-temporal structures of human poses. We start by improving a state of the art method for estimating human joint locations from videos. More precisely, we obtain the K-best estimations output by the existing method and incorporate additional segmentation cues and temporal constraints to select the “best” one. Then we group the estimated joints into five body parts (e.g. the left arm) and apply data mining techniques to obtain a representation for the spatial-temporal structures of human actions. This representation captures the spatial configurations ofbodyparts in one frame (by spatial-part-sets) as well as the body part movements(by temporal-part-sets) which are characteristic of human actions. It is interpretable, compact, and also robust to errors on joint estimations. Experimental results first show that our approach is able to localize body joints more accurately than existing methods. Next we show that it outperforms state of the art action recognizers on the UCF sport, the Keck Gesture and the MSR-Action3D datasets.
6 0.88336903 33 cvpr-2013-Active Contours with Group Similarity
7 0.87839615 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection
8 0.87702537 414 cvpr-2013-Structure Preserving Object Tracking
9 0.87608504 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
10 0.87594175 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation
11 0.87576276 325 cvpr-2013-Part Discovery from Partial Correspondence
12 0.87567472 311 cvpr-2013-Occlusion Patterns for Object Class Detection
13 0.87362194 285 cvpr-2013-Minimum Uncertainty Gap for Robust Visual Tracking
14 0.87337047 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
15 0.87205559 206 cvpr-2013-Human Pose Estimation Using Body Parts Dependent Joint Regressors
16 0.87201053 14 cvpr-2013-A Joint Model for 2D and 3D Pose Estimation from a Single Image
17 0.87179184 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases
18 0.87108552 360 cvpr-2013-Robust Estimation of Nonrigid Transformation for Point Set Registration
19 0.87051922 314 cvpr-2013-Online Object Tracking: A Benchmark
20 0.87023824 43 cvpr-2013-Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs