cvpr cvpr2013 cvpr2013-277
Author: Ben Sapp, Ben Taskar
Abstract: We propose a multimodal, decomposable model for articulated human pose estimation in monocular images. A typical approach to this problem is to use a linear structured model, which struggles to capture the wide range of appearance present in realistic, unconstrained images. In this paper, we instead propose a model of human pose that explicitly captures a variety of pose modes. Unlike other multimodal models, our approach includes both global and local pose cues and uses a convex objective and joint training for mode selection and pose estimation. We also employ a cascaded mode selection step which controls the trade-off between speed and accuracy, yielding a 5x speedup in inference and learning. Our model outperforms state-of-theart approaches across the accuracy-speed trade-off curve for several pose datasets. This includes our newly-collected dataset of people in movies, FLIC, which contains an order of magnitude more labeled data for training and testing than existing datasets. The new dataset and code are avail- able online. 1
sentIndex sentText sentNum sentScore
1 com Abstract We propose a multimodal, decomposable model for articulated human pose estimation in monocular images. [sent-2, score-0.329]
2 In this paper, we instead propose a model of human pose that explicitly captures a variety of pose modes. [sent-4, score-0.316]
3 Unlike other multimodal models, our approach includes both global and local pose cues and uses a convex objective and joint training for mode selection and pose estimation. [sent-5, score-1.036]
4 We also employ a cascaded mode selection step which controls the trade-off between speed and accuracy, yielding a 5x speedup in inference and learning. [sent-6, score-0.743]
5 Our model outperforms state-of-theart approaches across the accuracy-speed trade-off curve for several pose datasets. [sent-7, score-0.159]
6 Introduction Human pose estimation from 2D images holds great potential to assist in a wide range of applications—for example, semantic indexing of images and videos, action recognition, activity analysis, and human computer interaction. [sent-11, score-0.206]
7 However, human pose estimation “in the wild” is an extremely challenging problem. [sent-12, score-0.206]
8 In this work, we focus explicitly on the multimodal nature of the 2D pose estimation problem. [sent-14, score-0.328]
9 Most models developed to estimate human pose in these varied settings extend the basic linear pictorial structures model (PS) [9, 14, 4, 1, 19, 15]. [sent-20, score-0.262]
10 In such models, part detectors are learned invariant to pose and appearance—e. [sent-21, score-0.157]
11 Recently there has been an explosion of successful work focused on increasing the number of modes in human pose models. [sent-26, score-0.509]
12 The models in this line of work in general can be described as instantiations of a family of compositional, hierarchical pose models. [sent-27, score-0.218]
13 Part modes at any level of granularity can capture different poses (e. [sent-28, score-0.384]
14 Also of crucial importance are details such as how models are trained, the computational demands of inference, and how modes are defined or discovered. [sent-33, score-0.384]
15 Importantly, increasing the number of modes leads to a computational complexity at least linear and at worst exponential in the number of modes and parts. [sent-34, score-0.652]
16 A key omission in recent multimodal models is efficient and joint inference and training. [sent-35, score-0.329]
17 In this paper, we present MODEC, a multimodal decomposable model with a focus on simplicity, speed and accuracy. [sent-36, score-0.286]
18 We define modes via clustering human body joint configurations in a normalized image-coordinate space, but mode definitions could easily be extended to be a function of image appearance as well. [sent-38, score-0.979]
19 Each mode is corresponds to a discriminative structured linear model. [sent-39, score-0.545]
20 Thanks to the rich, multimodal nature of the model, we see performance improvements even with only computationally-cheap image gradient features. [sent-40, score-0.172]
21 As a testament to the richness of our set of modes, learning a flat SVM classifier on HOG features and predicting the mean pose of the predicted mode at test time performs 333666777422 model. [sent-41, score-0.636]
22 The feature computation is expensive, and still fails at capturing the many appearance modes in real data. [sent-42, score-0.349]
23 Our MODEC model features explicit mode selection variables which are jointly inferred along with the best layout of body parts in the image. [sent-46, score-0.62]
24 Unlike some previous work, our method is also trained jointly (thus avoiding difficulties calibrating different submodel outputs) and includes both large-scope and local part-level cues (thus allowing it to effectively predict which mode to use). [sent-47, score-0.701]
25 Finally, we × employ an initial structured cascade mode selection step which cheaply discards unlikely modes up front, yielding a 5 speedup in inference and learning over considering all am 5o×des s feoerd every example. [sent-48, score-1.138]
26 It also suggests a way to scale up to even more modes as larger datasets become available. [sent-53, score-0.326]
27 In general, works either consider only global modes, local modes, or several multimodal levels of parts. [sent-56, score-0.172]
28 A second approach is to focus on modeling modes only at the part level, e. [sent-65, score-0.35]
29 If n parts each use k modes, this effectively gives up to kn different instantiations of modes for the complete model through mixing- to form a single detection. [sent-68, score-0.423]
30 In the past few years, there have been many instantiations of the family of multimodal models. [sent-71, score-0.227]
31 Although combinatorially rich, this approach lacks the ability to reason about pose structure larger than a pair of parts at a time. [sent-74, score-0.175]
32 A second issue is that inference must consider a quadratic number of local mode combinations—e. [sent-76, score-0.556]
33 for each of k wrist types, k elbow types must be considered, resulting in inference message passing that is k2 larger than unimodal inference. [sent-78, score-0.332]
34 A third category of models con- sider both global, local and intermediate part-granularity level modes [21, 17, 3, 18]. [sent-80, score-0.356]
35 All levels use image cues, allowing models to effectively represent mode appearance at different granularities. [sent-81, score-0.559]
36 The biggest downside to these richer models are their computational demands: First, quadratic mode inference is necessary, as with any local mode modeling. [sent-82, score-1.063]
37 Cliques zc in s(x, z) are associated with groups yc. [sent-85, score-0.165]
38 Each submodel sc(x, yc, zc) can be a typical graphical model over yc for a fixed instantiation of zc. [sent-86, score-0.256]
39 In contrast to the above, our model supports multimodal reasoning at the global level, as in [ 1 1, 25, 20]. [sent-89, score-0.172]
40 Unlike those, we explicitly reason about, represent cues for, and jointly learn to predict the correct global mode as well as location of parts. [sent-90, score-0.514]
41 Unlike local mode models such as [24], we do not require quadratic part-mode inference and can reason about larger structures. [sent-91, score-0.586]
42 Furthermore, we can learn and apply a mode filtering step to reduce the number of modes considered for each test image, speeding up learning and inference by a factor of 5. [sent-93, score-0.908]
43 Other local modeling methods: In the machine learning literature, there is a vast array of multimodal methods for prediction. [sent-94, score-0.172]
44 MODEC: Multimodal decomposable model We first describe our general multimodal decomposable (MODEC) structured model, and then an effective special case for 2D human pose estimation. [sent-101, score-0.573]
45 , the placement of P body parts in image coordinates), and special mode variables z = [z1, . [sent-109, score-0.62]
46 , zK] , zi ∈ [1, M] which capture different modes of the input and output (e. [sent-112, score-0.326]
47 , z corresponds ftfoe r henu-t man joint configurations which might semantically be interpreted as modes such as arm-folded, arm-raised, arm-down as in Figure 1). [sent-114, score-0.374]
48 c∈C This scores a choice of output variables y and mode variables z in example x. [sent-118, score-0.569]
49 The benefits of such a model over a non-multimodal one s(x, y) is that different modeling behaviors can be captured by the different mode submodels. [sent-122, score-0.477]
50 The first term in Equation 1 can capture structured relationships between the mode variables and the observed data. [sent-124, score-0.591]
51 Given such a scoring function, the goal is to determine the highest scoring value to output variables y and mode variables z given a test example x: z? [sent-126, score-0.693]
52 notably pictorial structures models for human parsing, and star or tree models for object detection, e. [sent-133, score-0.159]
53 (3) There is a oneto-many relationship from cliques zc to each variable in y: zc can be used to index multiple yi in different subsets yc, but each yi can only participate in factors with one zc. [sent-138, score-0.448]
54 This ensures during inference that the messages passed from the submodel terms to the mode-scoring term will maintain the decomposable structure of s(x, z). [sent-139, score-0.288]
55 MODEC model for human pose estimation We tailor MODEC for human pose estimation as follows (model structure is shown in Figure 1). [sent-153, score-0.412]
56 We employ two mode variables, one for the left side of the body, one for 333666777644 cascaded prediction step. [sent-154, score-0.677]
57 Then each remaining local submodel can be run in parallel on a test image, and the argmax prediction is taken as a guess. [sent-155, score-0.234]
58 Thanks to joint inference and training objectives, all submodels are well calibrated with each other. [sent-156, score-0.203]
59 Again, this is indexed by the mode and is thus mode-specific, imposed because different pose modes have different geometric characteristics. [sent-182, score-0.968]
60 We employ the following form for our mode scoring term s(x, z) : s(x, z) = w? [sent-187, score-0.566]
61 , zr) mode compatibility score that encodes how likely each of the M modes on one side of the body are to co-occur with each of the M modes on the other side—expressing an affinity for common poses such as arms folded, arms down together, and dislike of uncommon left-right pose combinations. [sent-192, score-1.451]
62 The other two terms can be viewed as mode classifiers: each attempts to predict the correct left/right mode based on image features. [sent-193, score-0.954]
63 In the next section we show a speedup using cascaded prediction to achieve inference sublinear in M. [sent-202, score-0.238]
64 Cascaded mode filtering The use of structured prediction cascades has been a successful tool for drastically reducing state spaces in structured problems [15, 23]. [sent-205, score-0.701]
65 Here we employ a simple multiclass cascade step to reduce the number of modes considered in MODEC. [sent-206, score-0.455]
66 Quickly rejecting modes has very appeal333666777755 ing properties: (1) it gives us an easy way to tradeoff accuracy versus speed, allowing us to achieve very fast stateof-the-art parsing. [sent-207, score-0.355]
67 We use an unstructured cascade model where we filter each mode variable z? [sent-209, score-0.566]
68 We employ a linear cascade model of the form κ(x, z) = θz · φ(x, z) (6) whose purpose is to score the mode z in image x, in order to × filter unlikely mode candidates. [sent-211, score-1.11]
69 The features of the model are φ(x, z) which capture the pose mode as a whole instead of individual local parts, and the parameters of the model are a linear set of weights for each mode, θz. [sent-212, score-0.61]
70 Following the cascade framework, we retain a set of mode possibilities M¯ ⊆ [1, M] after applying the cascade model: M¯ = {z | κ(x,z) ≥ αz∈ m[1a,Mx]κ(x,z) +1M − αz∈? [sent-213, score-0.678]
71 [1,M]κ(x,z)} The metaparameter α ∈ [0, 1) is set via cross-validation aTndh edi mcteattaepsa rhaomwe aggressively t)o prune—between pruning everything but the max-scoring mode to pruning everything below the mean score. [sent-214, score-0.525]
72 Applying this cascade before running MODEC results in the inference task z? [sent-216, score-0.168]
73 Modes are obtained from the data by finding centers {μi}iM=1 and example-mode membership fsientsd nSg g= c e{nStei}rsiM= {1μ in} pose space that minimize reconstructsieotns Serro =r {uSnde}r squared Euclidean distance: ? [sent-224, score-0.168]
74 ∈Si||yt− μi||2 (7) where μi is the Euclidean mean joint locations of the examples in mode cluster Si. [sent-228, score-0.525]
75 We take the cluster membership as our supervised definition of mode membership in each training example, so that we augment the training set to be D = {(xt, yt, zt)}. [sent-230, score-0.619]
76 Note that some of the modes are extremely difficult to describe at a local part level, such as arms severely foreshortened or crossed. [sent-232, score-0.385]
77 We seek to learn to correctly identify the correct mode and location of parts in each example. [sent-234, score-0.519]
78 = yt = zt,∀y (8) (9) In words, Equation 8 states that the score of the true joint configuration for submodel zt must be higher than zt’s score for any other (wrong) joint configuration in example t—the standard max-margin parsing constraint for a single structured model. [sent-239, score-0.582]
79 Equation 9 states that the score of the true configuration for zt must also be higher than all scores an incorrect submodel z? [sent-240, score-0.291]
80 = yt = zt ∈ M¯t,∀y Note the use of M¯t, the subset of modes unfiltered by our mode prediction cascade for each example. [sent-249, score-1.194]
81 This is considerably faster than considering all M modes in each training example. [sent-250, score-0.362]
82 We use a cutting plane technique where we find the most violated constraint in every training example via structured inference (which can be done in one parallel step over all training examples). [sent-252, score-0.242]
83 Finally, we share all parameters between the left and right side, and at test time simply flip the image horizontally to compute local part and mode scores for the other side. [sent-254, score-0.527]
84 In order to make the deformation cue a convex, unimodal penalty (and thus computable with distance transforms), we need to ensure that the corresponding parameters on these features wizj are positive. [sent-266, score-0.166]
85 The Buffy and Pascal Stickmen datasets contain only hundreds of examples for training pose estimation models. [sent-280, score-0.192]
86 3Increasing training set size from 500 to 4000 examples improves test accuracy from 32% to 42% wrist and elbow localization accuracy. [sent-290, score-0.188]
87 The model of Yang & Ramanan [24] is multimodal at the level of local parts, and has no larger mode structure. [sent-310, score-0.649]
88 We ascribe its success over [24] to (1) the flexibility of 32 global modes (2) large-granularity mode appearance terms and (3) the ability to train all mode models jointly. [sent-321, score-1.358]
89 The “mean cluster prediction” involves predicting the mean pose defined by the most likely pose, where the most likely pose is determined directly from a 32-way SVM classifier using the same HOG features as our complete model. [sent-478, score-0.266]
90 This surprising result indicates the importance of multimodal modeling in even the simplest form. [sent-481, score-0.172]
91 Note that “full training”—considering all modes in every training example—rather than “cascaded training”—just the ones selected by the cascade step—leads to roughly a 1. [sent-490, score-0.451]
92 This allows us to perform joint training and inference to manage the competition between modes in a principled way. [sent-520, score-0.489]
93 Pictorial structures revisited: People detection and articulated pose estimation. [sent-527, score-0.181]
94 Poselets: Body part detectors trained using 3d human pose annotations. [sent-533, score-0.207]
95 Articulated human pose estimation and search in (almost) unconstrained still images. [sent-554, score-0.206]
96 333666778088 their mode is overlaid the left and right side of each image. [sent-573, score-0.51]
97 The mode chosen by MODEC is highlighted in green. [sent-574, score-0.477]
98 Articulated part-based model for joint object detection and pose estimation. [sent-636, score-0.181]
99 Exploring the spatial hierarchy of mixture models for human pose estimation. [sent-643, score-0.213]
100 Multiple tree models for occlusion and spatial constraints in human pose estimation. [sent-655, score-0.213]
