Author: Andreas M. Lehrmann, Peter V. Gehler, Sebastian Nowozin
Abstract: Having a sensible prior of human pose is a vital ingredient for many computer vision applications, including tracking and pose estimation. While the application of global non-parametric approaches and parametric models has led to some success, finding the right balance in terms of flexibility and tractability, as well as estimating model parameters from data has turned out to be challenging. In this work, we introduce a sparse Bayesian network model of human pose that is non-parametric with respect to the estimation of both its graph structure and its local distributions. We describe an efficient sampling scheme for our model and show its tractability for the computation of exact log-likelihoods. We empirically validate our approach on the Human 3.6M dataset and demonstrate superior performance to global models and parametric networks. We further illustrate our model’s ability to represent and compose poses not present in the training set (compositionality) and describe a speed-accuracy trade-off that allows realtime scoring of poses.
1 Abstract Having a sensible prior of human pose is a vital ingredient for many computer vision applications, including tracking and pose estimation. [sent-6, score-0.607]
2 In this work, we introduce a sparse Bayesian network model of human pose that is non-parametric with respect to the estimation of both its graph structure and its local distributions. [sent-8, score-0.748]
3 Introduction Reasoning about human pose is a key ingredient in recent successful applications of computer vision systems [20]. [sent-14, score-0.33]
4 Accurately capturing the variability of human pose is challenging because there is both a variation between different persons as well as a combinatorial number of possible poses a single person can assume. [sent-15, score-0.384]
5 In this paper we propose a pose prior, a generative probabilistic model of static human pose. [sent-16, score-0.288]
6 A good pose prior must generalize to unseen poses and persons. [sent-18, score-0.37]
7 In order to generalize the prior must be compositional: it must represent the variations of parts that frequently occur together and produce a pose by combining these parts. [sent-24, score-0.242]
8 We achieve compositionality by factorizing the pose representation into a Bayesian network [13]. [sent-25, score-0.674]
9 The sparse hierarchical structure of the network enables efficient computa- tion of likelihoods and exact sampling. [sent-26, score-0.418]
10 To apply a Bayesian network on human pose data we need to specify the network structure and conditional probability distributions along the network and it is here that we make two novel technical contributions. [sent-27, score-1.378]
11 First, we enhance the representative power of Bayesian networks by proposing non-parametric Bayesian networks in which the conditional distributions are represented by conditional kernel density estimates. [sent-28, score-0.61]
12 Second, we use structure learning to obtain the network structure by finding parts of the pose that strongly depend on each other, leveraging non-parametric mutual information estimators on continuous joint data. [sent-29, score-0.765]
13 Related Work Pose priors are most often used within pose estimation systems and therefore some of the related works we discuss below incorporate a likelihood term that is computed from an observed image. [sent-38, score-0.25]
14 A natural idea to build a pose prior is to use the tree structure of the human skeleton as a starting point. [sent-40, score-0.595]
15 Models that follow the skeletal structure are called kinematic chain mod1281 els [2] and they allow us to incorporate prior beliefs about joint angles. [sent-41, score-0.632]
16 In [17] the authors used a multivariate Normal distribution along the kinematic chain and estimate the parameters from motion capture data. [sent-42, score-0.51]
17 The different choices of possible parametrizations in terms of joint angles or relative world coordinates in a kinematic tree model give rise to qualitatively different behaviours [10]. [sent-43, score-0.569]
18 Despite this flexibility a kinematic tree model has clear limitations, as sharply argued in [15]; it is unable to express the coordination of different limbs and fails to represent global balance and gravity constraints. [sent-44, score-0.744]
19 We will demonstrate that we can avoid these limitations by using a tree model that does not correspond to the kinematic chain but instead is chosen to optimally approximate the true distribution of poses. [sent-45, score-0.725]
20 The resulting tree no longer corresponds to a skeleton (Figure 1c and 3b) but retains all computational advantages of a tree-structured model. [sent-46, score-0.23]
21 Previous works have attempted to overcome the limitations of the kinematic tree model in different ways. [sent-47, score-0.524]
22 In [3] the authors have used a global kernel density model on human pose. [sent-48, score-0.292]
23 This model is global and does not reflect the combinatorial nature of human pose hence it is suitable only for modeling specific poses. [sent-49, score-0.339]
24 Another approach proposed in [21] has been to add further interactions to the kinematic tree so that limb-limb coordination and penetration constraints are modelled. [sent-50, score-0.569]
25 Another popular way to improve over the kinematic tree model is to add latent variables to the model. [sent-53, score-0.61]
26 In [15] the authors augment the kinematic tree model by a few latent variables that are identified by factor analysis. [sent-54, score-0.61]
27 The Gaussian Process latent variable model (GPLVM) [16] has been applied as a pose model [6]. [sent-55, score-0.342]
28 In the GPLVM model a low-dimensional latent space is transformed to pose space by means of a Gaussian Process regression function. [sent-56, score-0.297]
29 The Laplacian Eigenmap latent variable model (LELVM) [18] improves on the GPLVM by modeling the manifold of poses using a graph Laplacian and by providing tractable posterior inference in the latent space. [sent-58, score-0.398]
30 An interesting recent model based on a large number of latent binary variables is the implicit mixture of conditional restricted Boltzmann machines (imCRBM) [23]; both estimation and inference are again approximate. [sent-59, score-0.237]
31 In fact, each training pose is represented as one latent vector and they are not combined in an intelligent way. [sent-61, score-0.344]
32 Non-parametric Bayesian Networks In this section we introduce our non-parametric Bayesian network model of human pose and show its tractability. [sent-63, score-0.578]
33 We represent a human body pose by a d-dimensional vector whose components correspond either to angular or xyz coordinates of njoints. [sent-64, score-0.367]
34 Each pose thus decomposes on the joint level, x = [x1, . [sent-65, score-0.256]
35 ,n defines a high-dimensional pose distribution q(X) whose samples we denote by ? [sent-72, score-0.249]
36 A Bayesian network over X is a pair (p, G) where the disBtriabyuetisioann p featcwtoorrkize osv over Xth ies d air pecatired ( acyclic graph hGe, ? [sent-84, score-0.375]
37 The specification of a Bayesian network hence consists of two parts: The definition of a graph str? [sent-96, score-0.375]
38 Learning the Graph Structure The graph structure of a Bayesian network models the local and global (in)dependencies of a distribution. [sent-107, score-0.472]
39 I inn case oyf t htahet rheuflmecatns body, an obvious structure is the kinematic chain, i. [sent-109, score-0.44]
40 , a treestructured network with one parent per variable that follows the adjacency of joints in the body (Figure 1a). [sent-111, score-0.573]
41 Given a fully connected graph G˜ over X with edge weights wjk set equal to the mutual inGfo orvmerat iXon w MI(Xj , X wke)i bhet-s tween Xj and Xk, the solution to (2) can be shown to be the maximum spanning tree of G˜ (with edges directed outwards in a consistent way) [1]. [sent-137, score-0.447]
42 1 InG Gco (wntirtahst e dtog ethse d kiriencetmedat oiuc cwhaaridns, a Chow-Liu tree is thus guaranteed to model those pairs of joints that exhibit a high flow of information, independent of their adjacency in the human body. [sent-138, score-0.445]
43 Using the entropy estimate we sistent as N → ∞ in [ 1We omit arrows from our network visualizations and implicitly sume the orientations to be directed away from the hip node. [sent-156, score-0.406]
44 (4) The computed mutual information is visualized in Figure 1b and we can now solve for the Chow-Liu tree [4] by finding the maximum spanning tree [5] to obtain our final result G, tshheow mna xiinm Figure 1anc. [sent-158, score-0.475]
45 Learning the Local Models Once the network structure is fixed, we need to learn the local conditional distributions p (Xj | pa (Xj)) from training data. [sent-161, score-0.557]
46 Our approach will be to compute a conditional kernel density estimate (CKDE) in which we can condition on given values Y = y as needed. [sent-166, score-0.276]
47 In summary, we can compute the CKDE density p(x|y) efficiently and at the same asymptotic complexity as txh|ey joint cKieDnEtly density pth(xe, s ay)m. [sent-201, score-0.261]
48 Log-likelihoods and Sampling There are two important operations to perform in applications of our model as a pose prior: computing the likelihood of a given pose and sampling a pose from the prior. [sent-204, score-0.69]
49 Given a Chow-Liu/CKDE network with n variables, the log-likelihood logp (x) of a new observation x ∈ Rd is ? [sent-207, score-0.32]
50 This allows a detailed analysis of a pose not possible in global methods. [sent-218, score-0.262]
51 Thanks to the closed-form solution for a conditional Gaussian, we can employ standard ancestral sampling [13], i. [sent-220, score-0.255]
52 e, we find a topological ordering τ for the network structure and draw samples from p(Xτ(j) |Xpa(τ(j)) ), for j = 1, . [sent-221, score-0.374]
53 The H36M skeleton includes some spurious joints othseats we dheele Hte3, 6wMhi cshk erelestuonlts i ninc tluhed same 2e0 joints present in the Kinect skeleton [20]. [sent-260, score-0.378]
54 Pose Model We start by learning a pose model on the H36M training set according to the techniques introduced in section 2. [sent-264, score-0.258]
55 The resulting network structure is displayed in Figure 1c and it is worth noting some of its properties: 1. [sent-265, score-0.336]
56 Note that this does not apply to the kinematic chain. [sent-267, score-0.353]
57 The uninformative pairs of nodes present in the kinematic chain (red edges in Figure 1a) are circumvented in the Chow-Liu tree, thus guaranteeing, from an information-theoretic point of view, optimal conditional distributions under the given constraint of a sparse structure. [sent-269, score-0.768]
58 Subgraphs containing joints with high entropies (Figure 1d), such as the arms and legs, largely follow the kinematic chain. [sent-271, score-0.524]
59 This confirms the intuitive belief that joints with high uncertainty should be conditioned on nearby joints, as they provide the maximum information about a joints position in this case. [sent-272, score-0.366]
60 Using our Matlab implementation 1284 Figure 2: We show samples drawn from our non-parametric pose prior to give are untouched and were generated from a single Chow-Liu/CKDE model. [sent-276, score-0.28]
61 As described in section 2, our model consists of two components: estimation of the graph structure (non-parametric Chow-Liu tree) and estimation of the local distributions (conditional kernel density estimation). [sent-286, score-0.435]
62 The options for the conditional distributions are our CKDE approach and a Gaussian linear (GL) network [13]. [sent-289, score-0.464]
63 In the higher-order kinematic chain each joint is additionally conditioned on its parents’ parents. [sent-292, score-0.63]
64 We use parametric MI-estimation for the parametric GL network and our distance-based non-parametric MIestimation for the non-parametric CKDE network. [sent-294, score-0.466]
65 The network approaches are complemented by a comparison to the global GPLVM [16], where we employ the popular FITC approximation [22] together with subsampling to achieve tractability. [sent-295, score-0.374]
66 We use a reference implemenTable 1: Expected log-likelihoods of GL- and CKDE networks for different graph structures and a comparison to global methods. [sent-296, score-0.216]
67 89 Independent Gaussian linear Kinematic chain (order 1) network Kinematic chain (order 2) Chow-Liu tree −352. [sent-309, score-0.775]
68 03 Independent CKDE network KKiinneemmaatti cc cchhaaiinn ( oorrddeerr 21)) −322. [sent-325, score-0.29]
69 Let us now turn to the network approaches and analyze their graph structures. [sent-339, score-0.375]
70 Not surprisingly, a network modeling the joints independently performs worst, with test ELLs of −346 (GL) and −322 (CKDE). [sent-340, score-0.42]
71 Lawrence / fgplvm/ 1285 Figure 3: In (a), we show samples from the “wave” training set (left, 2 pose classes) and samples drawn from the learned model (right, 4 pose classes). [sent-347, score-0.545]
72 Higher-order kinematic tcoha −in3s1 improve on t −he2 7re1su (lCtsK by a)n. [sent-352, score-0.353]
73 The direct comparison of CKDE- to GL networks is unambiguous: CKDE networks perform consistently better, independent of the graph structure. [sent-360, score-0.279]
74 On the other hand, parametric networks based on the kinematic chain are too flexible in the sense that they allow arbitrary combinations of the position of different limbs. [sent-365, score-0.678]
75 At the same time, Gaussian linear networks are not flexible enough in the sense that their local distributions cannot cope with multimodality, which is essential when modeling human pose. [sent-371, score-0.219]
76 Ideally, we would like to have flexibility and compositionality only where it is adequate and needed. [sent-372, score-0.226]
77 We then learn a pose model according to section 2, draw 5000 samples from it and cluster them into 4 clusters using k-means. [sent-374, score-0.42]
78 Consequently, the joint positions of the latter are all modeled conditional on the corresponding joint positions of the former. [sent-381, score-0.27]
79 The samples generated by this model (Figure 3a (right)) fall into 4 distinct pose classes. [sent-382, score-0.249]
80 Two of the four clusters (coloured in purple and red) correspond to poses also present in the training set. [sent-383, score-0.276]
81 The other two clusters (coloured in blue and green) represent newly learned poses that do not appear in the training data: a neutral pose (both hands lowered) and a pose with both hands raised. [sent-384, score-0.698]
82 Real-time Scoring Time is a critical factor in applications such as tracking or pose estimation. [sent-402, score-0.246]
83 At training time, we cluster all training points into clusters C1, . [sent-411, score-0.265]
84 At test time, we partition the clusters into a set of core clusters Ce and a set of approximate clusters Ca based on the following scheme: Given a test pose x ∈ Rd, we use tnh teh ekd f-otrlleoew tion gde stcehrmeminee: tGheiv eclnus ate tersst w pohosese x ce ∈n Rters lie closest to x. [sent-415, score-0.77]
85 We then evaluate all training points within the core clusters exactly. [sent-418, score-0.256]
86 As the number of core clusters approaches the total number of clusters (or as the number of total clusters approaches the total number of training points), our approximate method converges to the exact log-likelihood. [sent-439, score-0.648]
87 Since the contribution of a training point to the loglikelihood decreases exponentially with its distance from the test point, a few core clusters should suffice to achieve a high level of accuracy. [sent-440, score-0.256]
88 Figure 4a shows the results in terms of accuracy and speed for a local log-likelihood: If an absolute error of 10−2 nats is acceptable, we need as few as 4 core clusters and the runtime is 1. [sent-444, score-0.336]
89 Adding more core clusters further decreases the error, while the runtime increases sublinearly. [sent-448, score-0.259]
90 As the evaluation of a log-likelihood for a Bayesian network in our case requires computation of 2n = 40 local log-likelihoods (see equation (7)), we achieve a total speed of approx. [sent-449, score-0.29]
91 Conclusion We have introduced a fully non-parametric Bayesian network model of human pose. [sent-454, score-0.367]
92 In order to learn the network structure, we have used a continuous variant of the ChowLiu tree, in which we have obtained the required estimates of mutual information by means of a non-parametric entropy estimator. [sent-455, score-0.464]
93 The comparison of different graph structures has shown that our non-parametric approach to structure learning outperforms the widely used kinematic chain and also a higher-order variant thereof by a significant margin. [sent-459, score-0.641]
94 We expect widespread applicability in domains such as tracking, pose estimation and pose denoising. [sent-462, score-0.461]
95 Efficient kernel density estimation using the fast Gauss transform with applications to color modeling and tracking. [sent-505, score-0.203]
96 Beyond trees: Commonfactor models for 2D human pose recovery. [sent-562, score-0.288]
97 Real-time human pose recognition in parts from single depth images. [sent-595, score-0.288]
98 Loose-limbed people: Estimating 3D human pose and motion using non-parametric belief propagation. [sent-604, score-0.319]
99 Dynamical binary latent variable models for 3D human pose tracking. [sent-619, score-0.419]
100 Modeling mutual context of object and human pose in human-object interaction activities. [sent-644, score-0.384]
