cvpr cvpr2013 cvpr2013-277 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ben Sapp, Ben Taskar
Abstract: We propose a multimodal, decomposable model for articulated human pose estimation in monocular images. A typical approach to this problem is to use a linear structured model, which struggles to capture the wide range of appearance present in realistic, unconstrained images. In this paper, we instead propose a model of human pose that explicitly captures a variety of pose modes. Unlike other multimodal models, our approach includes both global and local pose cues and uses a convex objective and joint training for mode selection and pose estimation. We also employ a cascaded mode selection step which controls the trade-off between speed and accuracy, yielding a 5x speedup in inference and learning. Our model outperforms state-of-theart approaches across the accuracy-speed trade-off curve for several pose datasets. This includes our newly-collected dataset of people in movies, FLIC, which contains an order of magnitude more labeled data for training and testing than existing datasets. The new dataset and code are avail- able online. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract We propose a multimodal, decomposable model for articulated human pose estimation in monocular images. [sent-2, score-0.329]
2 In this paper, we instead propose a model of human pose that explicitly captures a variety of pose modes. [sent-4, score-0.316]
3 Unlike other multimodal models, our approach includes both global and local pose cues and uses a convex objective and joint training for mode selection and pose estimation. [sent-5, score-1.036]
4 We also employ a cascaded mode selection step which controls the trade-off between speed and accuracy, yielding a 5x speedup in inference and learning. [sent-6, score-0.743]
5 Our model outperforms state-of-theart approaches across the accuracy-speed trade-off curve for several pose datasets. [sent-7, score-0.159]
6 Introduction Human pose estimation from 2D images holds great potential to assist in a wide range of applications—for example, semantic indexing of images and videos, action recognition, activity analysis, and human computer interaction. [sent-11, score-0.206]
7 However, human pose estimation “in the wild” is an extremely challenging problem. [sent-12, score-0.206]
8 In this work, we focus explicitly on the multimodal nature of the 2D pose estimation problem. [sent-14, score-0.328]
9 Most models developed to estimate human pose in these varied settings extend the basic linear pictorial structures model (PS) [9, 14, 4, 1, 19, 15]. [sent-20, score-0.262]
10 In such models, part detectors are learned invariant to pose and appearance—e. [sent-21, score-0.157]
11 Recently there has been an explosion of successful work focused on increasing the number of modes in human pose models. [sent-26, score-0.509]
12 The models in this line of work in general can be described as instantiations of a family of compositional, hierarchical pose models. [sent-27, score-0.218]
13 Part modes at any level of granularity can capture different poses (e. [sent-28, score-0.384]
14 Also of crucial importance are details such as how models are trained, the computational demands of inference, and how modes are defined or discovered. [sent-33, score-0.384]
15 Importantly, increasing the number of modes leads to a computational complexity at least linear and at worst exponential in the number of modes and parts. [sent-34, score-0.652]
16 A key omission in recent multimodal models is efficient and joint inference and training. [sent-35, score-0.329]
17 In this paper, we present MODEC, a multimodal decomposable model with a focus on simplicity, speed and accuracy. [sent-36, score-0.286]
18 We define modes via clustering human body joint configurations in a normalized image-coordinate space, but mode definitions could easily be extended to be a function of image appearance as well. [sent-38, score-0.979]
19 Each mode is corresponds to a discriminative structured linear model. [sent-39, score-0.545]
20 Thanks to the rich, multimodal nature of the model, we see performance improvements even with only computationally-cheap image gradient features. [sent-40, score-0.172]
21 As a testament to the richness of our set of modes, learning a flat SVM classifier on HOG features and predicting the mean pose of the predicted mode at test time performs 333666777422 model. [sent-41, score-0.636]
22 The feature computation is expensive, and still fails at capturing the many appearance modes in real data. [sent-42, score-0.349]
23 Our MODEC model features explicit mode selection variables which are jointly inferred along with the best layout of body parts in the image. [sent-46, score-0.62]
24 Unlike some previous work, our method is also trained jointly (thus avoiding difficulties calibrating different submodel outputs) and includes both large-scope and local part-level cues (thus allowing it to effectively predict which mode to use). [sent-47, score-0.701]
25 Finally, we × employ an initial structured cascade mode selection step which cheaply discards unlikely modes up front, yielding a 5 speedup in inference and learning over considering all am 5o×des s feoerd every example. [sent-48, score-1.138]
26 It also suggests a way to scale up to even more modes as larger datasets become available. [sent-53, score-0.326]
27 In general, works either consider only global modes, local modes, or several multimodal levels of parts. [sent-56, score-0.172]
28 A second approach is to focus on modeling modes only at the part level, e. [sent-65, score-0.35]
29 If n parts each use k modes, this effectively gives up to kn different instantiations of modes for the complete model through mixing- to form a single detection. [sent-68, score-0.423]
30 In the past few years, there have been many instantiations of the family of multimodal models. [sent-71, score-0.227]
31 Although combinatorially rich, this approach lacks the ability to reason about pose structure larger than a pair of parts at a time. [sent-74, score-0.175]
32 A second issue is that inference must consider a quadratic number of local mode combinations—e. [sent-76, score-0.556]
33 for each of k wrist types, k elbow types must be considered, resulting in inference message passing that is k2 larger than unimodal inference. [sent-78, score-0.332]
34 A third category of models con- sider both global, local and intermediate part-granularity level modes [21, 17, 3, 18]. [sent-80, score-0.356]
35 All levels use image cues, allowing models to effectively represent mode appearance at different granularities. [sent-81, score-0.559]
36 The biggest downside to these richer models are their computational demands: First, quadratic mode inference is necessary, as with any local mode modeling. [sent-82, score-1.063]
37 Cliques zc in s(x, z) are associated with groups yc. [sent-85, score-0.165]
38 Each submodel sc(x, yc, zc) can be a typical graphical model over yc for a fixed instantiation of zc. [sent-86, score-0.256]
39 In contrast to the above, our model supports multimodal reasoning at the global level, as in [ 1 1, 25, 20]. [sent-89, score-0.172]
40 Unlike those, we explicitly reason about, represent cues for, and jointly learn to predict the correct global mode as well as location of parts. [sent-90, score-0.514]
41 Unlike local mode models such as [24], we do not require quadratic part-mode inference and can reason about larger structures. [sent-91, score-0.586]
42 Furthermore, we can learn and apply a mode filtering step to reduce the number of modes considered for each test image, speeding up learning and inference by a factor of 5. [sent-93, score-0.908]
43 Other local modeling methods: In the machine learning literature, there is a vast array of multimodal methods for prediction. [sent-94, score-0.172]
44 MODEC: Multimodal decomposable model We first describe our general multimodal decomposable (MODEC) structured model, and then an effective special case for 2D human pose estimation. [sent-101, score-0.573]
45 , the placement of P body parts in image coordinates), and special mode variables z = [z1, . [sent-109, score-0.62]
46 , zK] , zi ∈ [1, M] which capture different modes of the input and output (e. [sent-112, score-0.326]
47 , z corresponds ftfoe r henu-t man joint configurations which might semantically be interpreted as modes such as arm-folded, arm-raised, arm-down as in Figure 1). [sent-114, score-0.374]
48 c∈C This scores a choice of output variables y and mode variables z in example x. [sent-118, score-0.569]
49 The benefits of such a model over a non-multimodal one s(x, y) is that different modeling behaviors can be captured by the different mode submodels. [sent-122, score-0.477]
50 The first term in Equation 1 can capture structured relationships between the mode variables and the observed data. [sent-124, score-0.591]
51 Given such a scoring function, the goal is to determine the highest scoring value to output variables y and mode variables z given a test example x: z? [sent-126, score-0.693]
52 notably pictorial structures models for human parsing, and star or tree models for object detection, e. [sent-133, score-0.159]
53 (3) There is a oneto-many relationship from cliques zc to each variable in y: zc can be used to index multiple yi in different subsets yc, but each yi can only participate in factors with one zc. [sent-138, score-0.448]
54 This ensures during inference that the messages passed from the submodel terms to the mode-scoring term will maintain the decomposable structure of s(x, z). [sent-139, score-0.288]
55 MODEC model for human pose estimation We tailor MODEC for human pose estimation as follows (model structure is shown in Figure 1). [sent-153, score-0.412]
56 We employ two mode variables, one for the left side of the body, one for 333666777644 cascaded prediction step. [sent-154, score-0.677]
57 Then each remaining local submodel can be run in parallel on a test image, and the argmax prediction is taken as a guess. [sent-155, score-0.234]
58 Thanks to joint inference and training objectives, all submodels are well calibrated with each other. [sent-156, score-0.203]
59 Again, this is indexed by the mode and is thus mode-specific, imposed because different pose modes have different geometric characteristics. [sent-182, score-0.968]
60 We employ the following form for our mode scoring term s(x, z) : s(x, z) = w? [sent-187, score-0.566]
61 , zr) mode compatibility score that encodes how likely each of the M modes on one side of the body are to co-occur with each of the M modes on the other side—expressing an affinity for common poses such as arms folded, arms down together, and dislike of uncommon left-right pose combinations. [sent-192, score-1.451]
62 The other two terms can be viewed as mode classifiers: each attempts to predict the correct left/right mode based on image features. [sent-193, score-0.954]
63 In the next section we show a speedup using cascaded prediction to achieve inference sublinear in M. [sent-202, score-0.238]
64 Cascaded mode filtering The use of structured prediction cascades has been a successful tool for drastically reducing state spaces in structured problems [15, 23]. [sent-205, score-0.701]
65 Here we employ a simple multiclass cascade step to reduce the number of modes considered in MODEC. [sent-206, score-0.455]
66 Quickly rejecting modes has very appeal333666777755 ing properties: (1) it gives us an easy way to tradeoff accuracy versus speed, allowing us to achieve very fast stateof-the-art parsing. [sent-207, score-0.355]
67 We use an unstructured cascade model where we filter each mode variable z? [sent-209, score-0.566]
68 We employ a linear cascade model of the form κ(x, z) = θz · φ(x, z) (6) whose purpose is to score the mode z in image x, in order to × filter unlikely mode candidates. [sent-211, score-1.11]
69 The features of the model are φ(x, z) which capture the pose mode as a whole instead of individual local parts, and the parameters of the model are a linear set of weights for each mode, θz. [sent-212, score-0.61]
70 Following the cascade framework, we retain a set of mode possibilities M¯ ⊆ [1, M] after applying the cascade model: M¯ = {z | κ(x,z) ≥ αz∈ m[1a,Mx]κ(x,z) +1M − αz∈? [sent-213, score-0.678]
71 [1,M]κ(x,z)} The metaparameter α ∈ [0, 1) is set via cross-validation aTndh edi mcteattaepsa rhaomwe aggressively t)o prune—between pruning everything but the max-scoring mode to pruning everything below the mean score. [sent-214, score-0.525]
72 Applying this cascade before running MODEC results in the inference task z? [sent-216, score-0.168]
73 Modes are obtained from the data by finding centers {μi}iM=1 and example-mode membership fsientsd nSg g= c e{nStei}rsiM= {1μ in} pose space that minimize reconstructsieotns Serro =r {uSnde}r squared Euclidean distance: ? [sent-224, score-0.168]
74 ∈Si||yt− μi||2 (7) where μi is the Euclidean mean joint locations of the examples in mode cluster Si. [sent-228, score-0.525]
75 We take the cluster membership as our supervised definition of mode membership in each training example, so that we augment the training set to be D = {(xt, yt, zt)}. [sent-230, score-0.619]
76 Note that some of the modes are extremely difficult to describe at a local part level, such as arms severely foreshortened or crossed. [sent-232, score-0.385]
77 We seek to learn to correctly identify the correct mode and location of parts in each example. [sent-234, score-0.519]
78 = yt = zt,∀y (8) (9) In words, Equation 8 states that the score of the true joint configuration for submodel zt must be higher than zt’s score for any other (wrong) joint configuration in example t—the standard max-margin parsing constraint for a single structured model. [sent-239, score-0.582]
79 Equation 9 states that the score of the true configuration for zt must also be higher than all scores an incorrect submodel z? [sent-240, score-0.291]
80 = yt = zt ∈ M¯t,∀y Note the use of M¯t, the subset of modes unfiltered by our mode prediction cascade for each example. [sent-249, score-1.194]
81 This is considerably faster than considering all M modes in each training example. [sent-250, score-0.362]
82 We use a cutting plane technique where we find the most violated constraint in every training example via structured inference (which can be done in one parallel step over all training examples). [sent-252, score-0.242]
83 Finally, we share all parameters between the left and right side, and at test time simply flip the image horizontally to compute local part and mode scores for the other side. [sent-254, score-0.527]
84 In order to make the deformation cue a convex, unimodal penalty (and thus computable with distance transforms), we need to ensure that the corresponding parameters on these features wizj are positive. [sent-266, score-0.166]
85 The Buffy and Pascal Stickmen datasets contain only hundreds of examples for training pose estimation models. [sent-280, score-0.192]
86 3Increasing training set size from 500 to 4000 examples improves test accuracy from 32% to 42% wrist and elbow localization accuracy. [sent-290, score-0.188]
87 The model of Yang & Ramanan [24] is multimodal at the level of local parts, and has no larger mode structure. [sent-310, score-0.649]
88 We ascribe its success over [24] to (1) the flexibility of 32 global modes (2) large-granularity mode appearance terms and (3) the ability to train all mode models jointly. [sent-321, score-1.358]
89 The “mean cluster prediction” involves predicting the mean pose defined by the most likely pose, where the most likely pose is determined directly from a 32-way SVM classifier using the same HOG features as our complete model. [sent-478, score-0.266]
90 This surprising result indicates the importance of multimodal modeling in even the simplest form. [sent-481, score-0.172]
91 Note that “full training”—considering all modes in every training example—rather than “cascaded training”—just the ones selected by the cascade step—leads to roughly a 1. [sent-490, score-0.451]
92 This allows us to perform joint training and inference to manage the competition between modes in a principled way. [sent-520, score-0.489]
93 Pictorial structures revisited: People detection and articulated pose estimation. [sent-527, score-0.181]
94 Poselets: Body part detectors trained using 3d human pose annotations. [sent-533, score-0.207]
95 Articulated human pose estimation and search in (almost) unconstrained still images. [sent-554, score-0.206]
96 333666778088 their mode is overlaid the left and right side of each image. [sent-573, score-0.51]
97 The mode chosen by MODEC is highlighted in green. [sent-574, score-0.477]
98 Articulated part-based model for joint object detection and pose estimation. [sent-636, score-0.181]
99 Exploring the spatial hierarchy of mixture models for human pose estimation. [sent-643, score-0.213]
100 Multiple tree models for occlusion and spatial constraints in human pose estimation. [sent-655, score-0.213]
wordName wordTfidf (topN-words)
[('mode', 0.477), ('modec', 0.448), ('modes', 0.326), ('flic', 0.201), ('multimodal', 0.172), ('zc', 0.165), ('zt', 0.157), ('zr', 0.156), ('submodel', 0.134), ('pose', 0.133), ('yc', 0.122), ('unimodal', 0.099), ('yt', 0.094), ('cascade', 0.089), ('xt', 0.084), ('inference', 0.079), ('cascaded', 0.076), ('decomposable', 0.075), ('structured', 0.068), ('izj', 0.067), ('wizj', 0.067), ('elbows', 0.066), ('elbow', 0.063), ('wrist', 0.063), ('sapp', 0.06), ('instantiations', 0.055), ('maxz', 0.055), ('body', 0.055), ('wrists', 0.052), ('prediction', 0.051), ('human', 0.05), ('pictorial', 0.049), ('scoring', 0.049), ('articulated', 0.048), ('joint', 0.048), ('yi', 0.047), ('ben', 0.046), ('variables', 0.046), ('acci', 0.045), ('hopes', 0.045), ('wzi', 0.045), ('buffy', 0.044), ('parts', 0.042), ('eichner', 0.041), ('employ', 0.04), ('cps', 0.04), ('submodels', 0.04), ('equation', 0.039), ('speed', 0.039), ('sc', 0.038), ('cues', 0.037), ('cascades', 0.037), ('stickmen', 0.037), ('multimodality', 0.037), ('training', 0.036), ('arms', 0.035), ('membership', 0.035), ('forearm', 0.035), ('parsing', 0.033), ('side', 0.033), ('foreshortening', 0.032), ('maxy', 0.032), ('indexed', 0.032), ('speedup', 0.032), ('poses', 0.031), ('pascal', 0.031), ('arm', 0.03), ('models', 0.03), ('retrained', 0.03), ('rich', 0.029), ('allowing', 0.029), ('cyclic', 0.029), ('message', 0.028), ('demands', 0.028), ('yj', 0.028), ('unlikely', 0.027), ('granularity', 0.027), ('tractable', 0.027), ('weiss', 0.026), ('test', 0.026), ('fij', 0.026), ('curve', 0.026), ('unlike', 0.026), ('flexibility', 0.025), ('movies', 0.025), ('slack', 0.025), ('cliques', 0.024), ('compositional', 0.024), ('part', 0.024), ('difficulties', 0.024), ('wr', 0.024), ('pruning', 0.024), ('appearance', 0.023), ('parallel', 0.023), ('clothing', 0.023), ('estimation', 0.023), ('possibilities', 0.023), ('joints', 0.023), ('clutter', 0.023), ('tran', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 277 cvpr-2013-MODEC: Multimodal Decomposable Models for Human Pose Estimation
Author: Ben Sapp, Ben Taskar
Abstract: We propose a multimodal, decomposable model for articulated human pose estimation in monocular images. A typical approach to this problem is to use a linear structured model, which struggles to capture the wide range of appearance present in realistic, unconstrained images. In this paper, we instead propose a model of human pose that explicitly captures a variety of pose modes. Unlike other multimodal models, our approach includes both global and local pose cues and uses a convex objective and joint training for mode selection and pose estimation. We also employ a cascaded mode selection step which controls the trade-off between speed and accuracy, yielding a 5x speedup in inference and learning. Our model outperforms state-of-theart approaches across the accuracy-speed trade-off curve for several pose datasets. This includes our newly-collected dataset of people in movies, FLIC, which contains an order of magnitude more labeled data for training and testing than existing datasets. The new dataset and code are avail- able online. 1
2 0.13958135 159 cvpr-2013-Expressive Visual Text-to-Speech Using Active Appearance Models
Author: Robert Anderson, Björn Stenger, Vincent Wan, Roberto Cipolla
Abstract: This paper presents a complete system for expressive visual text-to-speech (VTTS), which is capable of producing expressive output, in the form of a ‘talking head’, given an input text and a set of continuous expression weights. The face is modeled using an active appearance model (AAM), and several extensions are proposed which make it more applicable to the task of VTTS. The model allows for normalization with respect to both pose and blink state which significantly reduces artifacts in the resulting synthesized sequences. We demonstrate quantitative improvements in terms of reconstruction error over a million frames, as well as in large-scale user studies, comparing the output of different systems.
3 0.1339224 335 cvpr-2013-Poselet Conditioned Pictorial Structures
Author: Leonid Pishchulin, Mykhaylo Andriluka, Peter Gehler, Bernt Schiele
Abstract: In this paper we consider the challenging problem of articulated human pose estimation in still images. We observe that despite high variability of the body articulations, human motions and activities often simultaneously constrain the positions of multiple body parts. Modelling such higher order part dependencies seemingly comes at a cost of more expensive inference, which resulted in their limited use in state-of-the-art methods. In this paper we propose a model that incorporates higher order part dependencies while remaining efficient. We achieve this by defining a conditional model in which all body parts are connected a-priori, but which becomes a tractable tree-structured pictorial structures model once the image observations are available. In order to derive a set of conditioning variables we rely on the poselet-based features that have been shown to be effective for people detection but have so far found limited application for articulated human pose estimation. We demon- strate the effectiveness of our approach on three publicly available pose estimation benchmarks improving or being on-par with state of the art in each case.
4 0.12547857 207 cvpr-2013-Human Pose Estimation Using a Joint Pixel-wise and Part-wise Formulation
Author: Ľ
Abstract: Our goal is to detect humans and estimate their 2D pose in single images. In particular, handling cases of partial visibility where some limbs may be occluded or one person is partially occluding another. Two standard, but disparate, approaches have developed in the field: the first is the part based approach for layout type problems, involving optimising an articulated pictorial structure; the second is the pixel based approach for image labelling involving optimising a random field graph defined on the image. Our novel contribution is a formulation for pose estimation which combines these two models in a principled way in one optimisation problem and thereby inherits the advantages of both of them. Inference on this joint model finds the set of instances of persons in an image, the location of their joints, and a pixel-wise body part labelling. We achieve near or state of the art results on standard human pose data sets, and demonstrate the correct estimation for cases of self-occlusion, person overlap and image truncation.
5 0.12247634 60 cvpr-2013-Beyond Physical Connections: Tree Models in Human Pose Estimation
Author: Fang Wang, Yi Li
Abstract: Simple tree models for articulated objects prevails in the last decade. However, it is also believed that these simple tree models are not capable of capturing large variations in many scenarios, such as human pose estimation. This paper attempts to address three questions: 1) are simple tree models sufficient? more specifically, 2) how to use tree models effectively in human pose estimation? and 3) how shall we use combined parts together with single parts efficiently? Assuming we have a set of single parts and combined parts, and the goal is to estimate a joint distribution of their locations. We surprisingly find that no latent variables are introduced in the Leeds Sport Dataset (LSP) during learning latent trees for deformable model, which aims at approximating the joint distributions of body part locations using minimal tree structure. This suggests one can straightforwardly use a mixed representation of single and combined parts to approximate their joint distribution in a simple tree model. As such, one only needs to build Visual Categories of the combined parts, and then perform inference on the learned latent tree. Our method outperformed the state of the art on the LSP, both in the scenarios when the training images are from the same dataset and from the PARSE dataset. Experiments on animal images from the VOC challenge further support our findings.
6 0.12120822 14 cvpr-2013-A Joint Model for 2D and 3D Pose Estimation from a Single Image
7 0.11929518 206 cvpr-2013-Human Pose Estimation Using Body Parts Dependent Joint Regressors
8 0.11828052 334 cvpr-2013-Pose from Flow and Flow from Pose
9 0.11728413 430 cvpr-2013-The SVM-Minus Similarity Score for Video Face Recognition
10 0.11515995 89 cvpr-2013-Computationally Efficient Regression on a Dependency Graph for Human Pose Estimation
11 0.11172097 380 cvpr-2013-Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images
12 0.10865497 444 cvpr-2013-Unconstrained Monocular 3D Human Pose Estimation by Action Detection and Cross-Modality Regression Forest
13 0.10338959 40 cvpr-2013-An Approach to Pose-Based Action Recognition
14 0.10084753 82 cvpr-2013-Class Generative Models Based on Feature Regression for Pose Estimation of Object Categories
15 0.099251889 324 cvpr-2013-Part-Based Visual Tracking with Online Latent Structural Learning
16 0.095675781 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
17 0.088804543 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation
18 0.088695072 254 cvpr-2013-Learning SURF Cascade for Fast and Accurate Object Detection
19 0.087837785 45 cvpr-2013-Articulated Pose Estimation Using Discriminative Armlet Classifiers
20 0.087080896 439 cvpr-2013-Tracking Human Pose by Tracking Symmetric Parts
topicId topicWeight
[(0, 0.179), (1, -0.034), (2, -0.007), (3, -0.085), (4, 0.024), (5, 0.016), (6, 0.08), (7, 0.058), (8, 0.046), (9, -0.071), (10, -0.051), (11, 0.105), (12, -0.054), (13, 0.002), (14, -0.05), (15, 0.026), (16, 0.035), (17, -0.04), (18, -0.024), (19, -0.039), (20, -0.025), (21, -0.0), (22, -0.05), (23, -0.011), (24, -0.02), (25, 0.042), (26, -0.048), (27, 0.03), (28, 0.045), (29, 0.0), (30, 0.054), (31, 0.009), (32, -0.019), (33, 0.001), (34, -0.014), (35, 0.004), (36, -0.017), (37, 0.053), (38, -0.04), (39, 0.037), (40, 0.037), (41, -0.014), (42, -0.014), (43, 0.03), (44, -0.021), (45, 0.024), (46, 0.082), (47, -0.009), (48, -0.006), (49, 0.084)]
simIndex simValue paperId paperTitle
same-paper 1 0.9439773 277 cvpr-2013-MODEC: Multimodal Decomposable Models for Human Pose Estimation
Author: Ben Sapp, Ben Taskar
Abstract: We propose a multimodal, decomposable model for articulated human pose estimation in monocular images. A typical approach to this problem is to use a linear structured model, which struggles to capture the wide range of appearance present in realistic, unconstrained images. In this paper, we instead propose a model of human pose that explicitly captures a variety of pose modes. Unlike other multimodal models, our approach includes both global and local pose cues and uses a convex objective and joint training for mode selection and pose estimation. We also employ a cascaded mode selection step which controls the trade-off between speed and accuracy, yielding a 5x speedup in inference and learning. Our model outperforms state-of-theart approaches across the accuracy-speed trade-off curve for several pose datasets. This includes our newly-collected dataset of people in movies, FLIC, which contains an order of magnitude more labeled data for training and testing than existing datasets. The new dataset and code are avail- able online. 1
2 0.83845121 45 cvpr-2013-Articulated Pose Estimation Using Discriminative Armlet Classifiers
Author: Georgia Gkioxari, Pablo Arbeláez, Lubomir Bourdev, Jitendra Malik
Abstract: We propose a novel approach for human pose estimation in real-world cluttered scenes, and focus on the challenging problem of predicting the pose of both arms for each person in the image. For this purpose, we build on the notion of poselets [4] and train highly discriminative classifiers to differentiate among arm configurations, which we call armlets. We propose a rich representation which, in addition to standardHOGfeatures, integrates the information of strong contours, skin color and contextual cues in a principled manner. Unlike existing methods, we evaluate our approach on a large subset of images from the PASCAL VOC detection dataset, where critical visual phenomena, such as occlusion, truncation, multiple instances and clutter are the norm. Our approach outperforms Yang and Ramanan [26], the state-of-the-art technique, with an improvement from 29.0% to 37.5% PCP accuracy on the arm keypoint prediction task, on this new pose estimation dataset.
3 0.82597041 335 cvpr-2013-Poselet Conditioned Pictorial Structures
Author: Leonid Pishchulin, Mykhaylo Andriluka, Peter Gehler, Bernt Schiele
Abstract: In this paper we consider the challenging problem of articulated human pose estimation in still images. We observe that despite high variability of the body articulations, human motions and activities often simultaneously constrain the positions of multiple body parts. Modelling such higher order part dependencies seemingly comes at a cost of more expensive inference, which resulted in their limited use in state-of-the-art methods. In this paper we propose a model that incorporates higher order part dependencies while remaining efficient. We achieve this by defining a conditional model in which all body parts are connected a-priori, but which becomes a tractable tree-structured pictorial structures model once the image observations are available. In order to derive a set of conditioning variables we rely on the poselet-based features that have been shown to be effective for people detection but have so far found limited application for articulated human pose estimation. We demon- strate the effectiveness of our approach on three publicly available pose estimation benchmarks improving or being on-par with state of the art in each case.
4 0.8191067 89 cvpr-2013-Computationally Efficient Regression on a Dependency Graph for Human Pose Estimation
Author: Kota Hara, Rama Chellappa
Abstract: We present a hierarchical method for human pose estimation from a single still image. In our approach, a dependency graph representing relationships between reference points such as bodyjoints is constructed and thepositions of these reference points are sequentially estimated by a successive application of multidimensional output regressions along the dependency paths, starting from the root node. Each regressor takes image features computed from an image patch centered on the current node ’s position estimated by the previous regressor and is specialized for estimating its child nodes ’ positions. The use of the dependency graph allows us to decompose a complex pose estimation problem into a set of local pose estimation problems that are less complex. We design a dependency graph for two commonly used human pose estimation datasets, the Buffy Stickmen dataset and the ETHZ PASCAL Stickmen dataset, and demonstrate that our method achieves comparable accuracy to state-of-the-art results on both datasets with significantly lower computation time than existing methods. Furthermore, we propose an importance weighted boosted re- gression trees method for transductive learning settings and demonstrate the resulting improved performance for pose estimation tasks.
5 0.81702787 14 cvpr-2013-A Joint Model for 2D and 3D Pose Estimation from a Single Image
Author: Edgar Simo-Serra, Ariadna Quattoni, Carme Torras, Francesc Moreno-Noguer
Abstract: We introduce a novel approach to automatically recover 3D human pose from a single image. Most previous work follows a pipelined approach: initially, a set of 2D features such as edges, joints or silhouettes are detected in the image, and then these observations are used to infer the 3D pose. Solving these two problems separately may lead to erroneous 3D poses when the feature detector has performed poorly. In this paper, we address this issue by jointly solving both the 2D detection and the 3D inference problems. For this purpose, we propose a Bayesian framework that integrates a generative model based on latent variables and discriminative 2D part detectors based on HOGs, and perform inference using evolutionary algorithms. Real experimentation demonstrates competitive results, and the ability of our methodology to provide accurate 2D and 3D pose estimations even when the 2D detectors are inaccurate.
6 0.80870944 206 cvpr-2013-Human Pose Estimation Using Body Parts Dependent Joint Regressors
7 0.78426707 207 cvpr-2013-Human Pose Estimation Using a Joint Pixel-wise and Part-wise Formulation
8 0.77700007 2 cvpr-2013-3D Pictorial Structures for Multiple View Articulated Pose Estimation
9 0.76478559 60 cvpr-2013-Beyond Physical Connections: Tree Models in Human Pose Estimation
10 0.72752869 82 cvpr-2013-Class Generative Models Based on Feature Regression for Pose Estimation of Object Categories
11 0.66528654 40 cvpr-2013-An Approach to Pose-Based Action Recognition
12 0.65647423 444 cvpr-2013-Unconstrained Monocular 3D Human Pose Estimation by Action Detection and Cross-Modality Regression Forest
13 0.65286136 439 cvpr-2013-Tracking Human Pose by Tracking Symmetric Parts
14 0.6524021 426 cvpr-2013-Tensor-Based Human Body Modeling
15 0.64286709 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation
16 0.61668122 153 cvpr-2013-Expanded Parts Model for Human Attribute and Action Recognition in Still Images
17 0.60105902 334 cvpr-2013-Pose from Flow and Flow from Pose
18 0.5868203 159 cvpr-2013-Expressive Visual Text-to-Speech Using Active Appearance Models
19 0.58296615 143 cvpr-2013-Efficient Large-Scale Structured Learning
20 0.56760156 120 cvpr-2013-Detecting and Naming Actors in Movies Using Generative Appearance Models
topicId topicWeight
[(10, 0.118), (16, 0.021), (26, 0.053), (28, 0.012), (33, 0.214), (65, 0.232), (67, 0.125), (69, 0.034), (80, 0.017), (87, 0.078)]
simIndex simValue paperId paperTitle
1 0.88423055 435 cvpr-2013-Towards Contactless, Low-Cost and Accurate 3D Fingerprint Identification
Author: Ajay Kumar, Cyril Kwong
Abstract: In order to avail the benefits of higher user convenience, hygiene, and improved accuracy, contactless 3D fingerprint recognition techniques have recently been introduced. One of the key limitations of these emerging 3D fingerprint technologies to replace the conventional 2D fingerprint system is their bulk and high cost, which mainly results from the use of multiple imaging cameras or structured lighting employed in these systems. This paper details the development of a contactless 3D fingerprint identification system that uses only single camera. We develop a new representation of 3D finger surface features using Finger Surface Codes and illustrate its effectiveness in matching 3D fingerprints. Conventional minutiae representation is extended in 3D space to accurately match the recovered 3D minutiae. Multiple 2D fingerprint images (with varying illumination profile) acquired to build 3D fingerprints can themselves be used recover 2D features for further improving 3D fingerprint identification and has been illustrated in this paper. The experimental results are shown on a database of 240 client fingerprints and confirm the advantages of the single camera based 3D fingerprint identification.
2 0.84739757 176 cvpr-2013-Five Shades of Grey for Fast and Reliable Camera Pose Estimation
Author: Adam Herout, István Szentandrási, Michal Zachariáš, Markéta Dubská, Rudolf Kajan
Abstract: We introduce here an improved design of the Uniform Marker Fields and an algorithm for their fast and reliable detection. Our concept of the marker field is designed so that it can be detected and recognized for camera pose estimation: in various lighting conditions, under a severe perspective, while heavily occluded, and under a strong motion blur. Our marker field detection harnesses the fact that the edges within the marker field meet at two vanishing points and that the projected planar grid of squares can be defined by a detectable mathematical formalism. The modules of the grid are greyscale and the locations within the marker field are defined by the edges between the modules. The assumption that the marker field is planar allows for a very cheap and reliable camera pose estimation in the captured scene. The detection rates and accuracy are slightly better compared to state-of-the-art marker-based solutions. At the same time, and more importantly, our detector of the marker field is several times faster and the reliable real-time detection can be thus achieved on mobile and low-power devices. We show three targeted applications where theplanarity is assured and where thepresented marker field design and detection algorithm provide a reliable and extremely fast solution.
3 0.82711834 139 cvpr-2013-Efficient 3D Endfiring TRUS Prostate Segmentation with Globally Optimized Rotational Symmetry
Author: Jing Yuan, Wu Qiu, Eranga Ukwatta, Martin Rajchl, Xue-Cheng Tai, Aaron Fenster
Abstract: Segmenting 3D endfiring transrectal ultrasound (TRUS) prostate images efficiently and accurately is of utmost importance for the planning and guiding 3D TRUS guided prostate biopsy. Poor image quality and imaging artifacts of 3D TRUS images often introduce a challenging task in computation to directly extract the 3D prostate surface. In this work, we propose a novel global optimization approach to delineate 3D prostate boundaries using its rotational resliced images around a specified axis, which properly enforces the inherent rotational symmetry of prostate shapes to jointly adjust a series of 2D slicewise segmentations in the global 3D sense. We show that the introduced challenging combinatorial optimization problem can be solved globally and exactly by means of convex relaxation. In this regard, we propose a novel coupled continuous max-flow model, which not only provides a powerful mathematical tool to analyze the proposed optimization problem but also amounts to a new and efficient duality-basedalgorithm. Ex- tensive experiments demonstrate that the proposed method significantly outperforms the state-of-art methods in terms ofefficiency, accuracy, reliability and less user-interactions, and reduces the execution time by a factor of 100.
4 0.82268667 177 cvpr-2013-FrameBreak: Dramatic Image Extrapolation by Guided Shift-Maps
Author: Yinda Zhang, Jianxiong Xiao, James Hays, Ping Tan
Abstract: We significantly extrapolate the field of view of a photograph by learning from a roughly aligned, wide-angle guide image of the same scene category. Our method can extrapolate typical photos into complete panoramas. The extrapolation problem is formulated in the shift-map image synthesis framework. We analyze the self-similarity of the guide image to generate a set of allowable local transformations and apply them to the input image. Our guided shift-map method preserves to the scene layout of the guide image when extrapolating a photograph. While conventional shiftmap methods only support translations, this is not expressive enough to characterize the self-similarity of complex scenes. Therefore we additionally allow image transformations of rotation, scaling and reflection. To handle this in- crease in complexity, we introduce a hierarchical graph optimization method to choose the optimal transformation at each output pixel. We demonstrate our approach on a variety of indoor, outdoor, natural, and man-made scenes.
same-paper 5 0.81887794 277 cvpr-2013-MODEC: Multimodal Decomposable Models for Human Pose Estimation
Author: Ben Sapp, Ben Taskar
Abstract: We propose a multimodal, decomposable model for articulated human pose estimation in monocular images. A typical approach to this problem is to use a linear structured model, which struggles to capture the wide range of appearance present in realistic, unconstrained images. In this paper, we instead propose a model of human pose that explicitly captures a variety of pose modes. Unlike other multimodal models, our approach includes both global and local pose cues and uses a convex objective and joint training for mode selection and pose estimation. We also employ a cascaded mode selection step which controls the trade-off between speed and accuracy, yielding a 5x speedup in inference and learning. Our model outperforms state-of-theart approaches across the accuracy-speed trade-off curve for several pose datasets. This includes our newly-collected dataset of people in movies, FLIC, which contains an order of magnitude more labeled data for training and testing than existing datasets. The new dataset and code are avail- able online. 1
6 0.81871372 310 cvpr-2013-Object-Centric Anomaly Detection by Attribute-Based Reasoning
7 0.79360896 80 cvpr-2013-Category Modeling from Just a Single Labeling: Use Depth Information to Guide the Learning of 2D Models
8 0.77496719 7 cvpr-2013-A Divide-and-Conquer Method for Scalable Low-Rank Latent Matrix Pursuit
9 0.76255697 2 cvpr-2013-3D Pictorial Structures for Multiple View Articulated Pose Estimation
10 0.75715214 339 cvpr-2013-Probabilistic Graphlet Cut: Exploiting Spatial Structure Cue for Weakly Supervised Image Segmentation
11 0.75501132 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
12 0.75329089 254 cvpr-2013-Learning SURF Cascade for Fast and Accurate Object Detection
13 0.75114191 45 cvpr-2013-Articulated Pose Estimation Using Discriminative Armlet Classifiers
14 0.75091088 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval
15 0.74928302 160 cvpr-2013-Face Recognition in Movie Trailers via Mean Sequence Sparse Representation-Based Classification
16 0.74805069 345 cvpr-2013-Real-Time Model-Based Rigid Object Pose Estimation and Tracking Combining Dense and Sparse Visual Cues
17 0.74790472 122 cvpr-2013-Detection Evolution with Multi-order Contextual Co-occurrence
18 0.74744529 414 cvpr-2013-Structure Preserving Object Tracking
19 0.74715221 288 cvpr-2013-Modeling Mutual Visibility Relationship in Pedestrian Detection
20 0.74528599 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation