Abstract: Our goal is to detect humans and estimate their 2D pose in single images. In particular, handling cases of partial visibility where some limbs may be occluded or one person is partially occluding another. Two standard, but disparate, approaches have developed in the field: the first is the part based approach for layout type problems, involving optimising an articulated pictorial structure; the second is the pixel based approach for image labelling involving optimising a random field graph defined on the image. Our novel contribution is a formulation for pose estimation which combines these two models in a principled way in one optimisation problem and thereby inherits the advantages of both of them. Inference on this joint model finds the set of instances of persons in an image, the location of their joints, and a pixel-wise body part labelling. We achieve near or state of the art results on standard human pose data sets, and demonstrate the correct estimation for cases of self-occlusion, person overlap and image truncation.
1 uk Abstract Our goal is to detect humans and estimate their 2D pose in single images. [sent-11, score-0.179]
In particular, handling cases of partial visibility where some limbs may be occluded or one person is partially occluding another.
Our novel contribution is a formulation for pose estimation which combines these two models in a principled way in one optimisation problem and thereby inherits the advantages of both of them.
Inference on this joint model finds the set of instances of persons in an image, the location of their joints, and a pixel-wise body part labelling.
We achieve near or state of the art results on standard human pose data sets, and demonstrate the correct estimation for cases of self-occlusion, person overlap and image truncation.
These proceed by searching for the most probable location of body parts, estimating a per pixel cost for each part, and combining the costs using dynamic programming over the tree structure graph.
Recent work has attempted to overcome these problems, for example by enforcing consistency of ensembles of parts or eschewing the pictorialstructure formulation by directly learning poselets for human parts tightly clustered in both appearance and configuration spaces.
In order to deal with (c), and also with self-occlusion, the work of introduced a weak background model combined with a tight model of the human foreground.
The resulting method is one of the first to deal convincingly with the problem of self occlusion and clearly demonstrates the benefit of a background model.
Others have proposed modelling dependencies and relationships between multiple people, which addresses problem (a), and methods for efficiently sampling from pictorial structures.
The outcome is the set of instances of persons in an image with the location of their joints, and the pixel-wise labelling (segmentation) ofeach oftheir body parts.
Our work is a continuation of the theme of combining segmentation and human pose estimation.
This figure shows two candidates that end up in the final solution, their HOG masks, body-part masks, instance color models, the result of the texture component and final instance and body-part segmentation.
Candidates consist of 16 joints (some of them could be invisible) and the pixel-wise labelling of 1 1body-part labels and background.
Our goal is three fold: to assign every pixel to a body part of an instance of a person or to the background; for each instance, to estimate the body layout in terms of joint positions and body parts and specify their visibility; and thirdly, to determine the number of instances.
The first goal is close to the traditional labelling problem of semantic segmentation, the second is close to the pose estimation which typically uses tree-structured pictorial structures.
In more detail, the method should predict a set of instances (subset of the set of candidates), consisting of their pose (joint positions and body parts).
Correspondingly, a pixel in the image x is labelled by the instance and the body part that overlaps it.
models the contribution of a pixel to an instance of an articulated model.
For a non-overlapping set of models this component corresponds to the negative sum of responses of a tree-structured mixture-of-parts pictorial structure model.
links the model with its body-part labelling: given a set of joint positions from an instance of the model M, the body part mask is a (soft) assignment of pixels to mattes of the body parts.
is a Gaussian mixture color model for the foreground and background built by thresholding the body part masks for the model.
is a semantic segmentation of the image into body parts and background.
It is computed independently of the instances and contributes information on the appearance and shape of body parts.
optimization labels the pixels (with the instances and parts) and takes account of the costs from the four terms.
The outputs of the components are illustrated in figure 1, where it can be seen how the texture component can contribute additional information over that provided by the instance segmentation from the color model.
The optimization proceeds by first proposing a number, N, of candidates for the instances using the mixture-ofparts model.
Some of these candidates will survive and appear in the final solution, and the ones that do will have led to the minimal energy when all the components and interactions between instances have been taken into account (during inference).
In the following section we describe the method for generating the candidate instances, and then give the details of the component computation and inference method in section 4.
Efficiently Generating Pose Candidates We would like to find a set of candidates – local optima, such that they cover all the persons in the image and all their possible poses.
5 people present, to capture all possible poses a large number, N, of candidates is required (N = 200 in our experiments) and the running time of this method is too large.
In contrast we propose a method which takes only slightly more time to find a large number of candidates than to find just the best one per root node.
To increase the chance of capturing all instances we restrict ourselves to the search for the best solutions with at most K candidates (K = 8 here) with the same root node.
Starting from the leaf nodes going towards the root node for each location and type of the part the best locations and types of its children are estimated.
To find the best K candidates differing by at least one type of a child, we need to estimate for each location and type of the part its top K constellation (types and locations) of all its children.
In the first step we find the best location of a child for each type of a child, and take the top K solutions for this location and type.
This step is only approximate; these K solutions are only a good approximation of the real top K solutions, which can be obtained by merging all lists for each location given the type of a child.
In the second step the parent has to merge its response for each type with the top K solutions of each child.
The final set of candidates is obtained by merging whole trees of solutions of suppressed root nodes.
In practice for T = 5 and K = 8 (as used in the experiments) the running time of brute force search for the best location of a child of each type is much more expensive (especially in case of the sub-cell accuracy) than the sorting and merging steps together and the algorithm takes only 1.
To obtain also candidates with hidden parts, the set of types is altered with an additional hidden type, corresponding to the invisible joint whose children are also hidden.
Using this hidden type allows for candidates which have certain joints either outside of an image, occluded by another object or person, or self-occluded.
Implementation Details In this section we describe how each component of the model energy (1) is computed, and then the inference method for the model.
Each candidate model m is defined by the locations and types of parts m = (P, t) and has its own associated HOG weight vector wm and bias bm.
We now define the pixel-wise cost in terms of (3) as EHOG(xinst) = ∑ψHOG(xjinst)+ ∑j∈V ∑ (−|cm|bm)δ(m) m∑∈M (4) where V are the pixels, and δ(m) indicates the presence of a model m in the labelling.
where cm (j) is the corresponding cell of a model m for a pixel j.
the bias bm (note, the bias is negative in general as it has to prevent false negative human detections across the image).
The evidence should not be taken into account from occluded parts of the object, and thus a significantly occluded object would be unable to provide sufficient evidence, larger than the threshold.
The Body part mask component Given the location of the joints, it is possible to predict the body part segmentation to a good approximation.
We achieve this by learning a classifier to predict whether each pixel belongs to each body part given the location of all joints.
The intuitive way to incorporate this potential in the random field framework would be to add it as a simple unary potential, assigning a cost if the pixel j takes a model label m and the body-part label.
Suppose we use only the HOG and body part mask components.
In the other words, if the labelling agrees with the body-part mask prediction, the energy for each candidate should be 0.
Thus, we need to balance the bias of all foreground pixels and the unbiased potential takes the form: EMASK(M,x) =j∑∈V(HBm(j)−Hxmpjart(j))+m∑∈MC(m)δ(m), (6) where C(m) is defined as: C(m) =∑p∈{Bm}a∪xLpart(Hpm(j) − HBm(j)).
If the final labelling agrees with the most probable body part mask, then it sums up to zero; if some pixels do not agree, they are penalized based on the difference of bodypart likelihoods for the estimated and present label.
All distances are relative to the size of the object determined by the longest limb (all limbs are about the same size).
Because the joints (and limbs) may be occluded, we double the number of the decision stumps used as the weak features, where both conditions are by definition not satisfied for an occluded part or limb.
is there a shoulder further than θ from this point, and is there a shoulder closer than θ from this point, so that the algorithm can distinguish between the cases, when the shoulder is visible and when it is not.
The Color component Color component ensures the solutions, where the color models of the foreground and the background are different, are preferred.
It is self-trained for each instance using Gaussian mixture model initialised using the mask estimated as in section 4.
The Texture component The Texture component consists of potentials used for the semantic segmentation problem, the multi-feature TextonBoost, the body-part super-pixel terms and the pair-wise term is the usual contrast dependent Potts model.
Even though the performance of this component is not very high on its own, it can reliable distinguish between torso and arms and resolve several ambiguities of the mixture-of-parts model.
It consists of potentials used for the semantic segmentation problem: the multi-feature TextonBoost, the body-part super-pixel terms and the pair-wise term is the usual contrast dependent Potts model.
Inference We wish to minimize the energy (1) in order to determine: the set of instances M with their layout (joints, parts), as well as a pixel-wise labelling of the image according to whether the pixels belong to a part.
The optimization cannot be carried out directly, and we proceed in two stages: first, finding the number and joint position of the instances; and second, with this restricted set of label possibilities, determining the best pixel labelling.
In the first stage we have N human pose candidates (obtained as described in section 3).
For each candidate we compute the potentials EHOG, EMASK and ECOL.
The potential ETEX is independent of the candidates and is evaluated once for the entire image.
We start by labelling everything as background and iteratively adding the next best candidate.
The quality of each candidate is determined by calculating the energy after α-expansion over the 11 body-part labels and background.
In the second stage the optimization is over all the selected candidates to refine the solution.
We use the standard Buffy data set consisting of 748 images from episodes s5e2 –s5e6, with episode s5e3 used for training, episode s5e4 for validation and episodes s5e2, s5e5 and s5e6 for testing.
Largely occluded persons are marked as hard and ignored for the evaluation.
Each individual component of the model (the HOG, texture, color and body part mask potentials) is trained separately using this annotation.
The HOG component is trained using the approach, the texture component using the learning methods described, the color model using a mixture of 10 Gaussians, and body part mask using the method described in section 4.
The top 200 candidates are used in the experiments, 8 at most with the same scale and the same root node.
The Image Parse data sets consist of 305 images; each containing only one person and a labelling of 14 joints.
The data set does not contain pixel-wise labellings, so the texture and foreground mask potentials trained on the Buffy data set are also used for this data set.
Since most pixels are background, the average of the pixel-wise recall over classes is more affected by mislabelling a body part pixel as background than the other
In cases where the instances are highly occluded (such as C) or difficult to distinguish, the joints are not labelled, and the body-part pixels are labelled as hard (black) and ignored for training and evaluation.
Some of joints are labelled as half-visible (sometimes because they are too close to the boundary) and ignored for evaluation too.
For the pictorial structure measures, the comparison to the state-of-the-art methods for the Buffy data set in the loose-PCP measure is given in table 1, and for the Image Parse data set for the strict- and loose-PCP measures.
The incorporation of all components leads to a significant improvement on the Buffy data set, however, the method did not improve on the Image Parse data set, probably because the texture and mask potentials were trained on a different data set with different distribution of poses.
Texture potentials are good at distinguishing between limbs and torso, and thus help to resolve ambiguities in estimation of joints and their visibility.
Conclusion In this paper we have shown that, given appropriate training, it is possible to achieve Kinect style body part labelling and layout in color images (despite not having depth information).
86 Furthermore, we have for the first time covered the case of multiple, possibly interacting, human instances in quite varied and unconstrained poses. [sent-245, score-0.196]
87 The formulation of a joint model covering foreground and background has effectively dealt with all of the problems we listed in the introduction for pictorial-structures, e. [sent-246, score-0.255]
88 The incorporation of pose into the random field framework leads to a significant improvement of the performance. [sent-257, score-0.178]
89 The weights are optimised for the intersection over union measure, which is more suitable for this data set because of a significant imbalance of the body part classes (dominated by background). [sent-258, score-0.363]
90 Surprisingly, the incorporation of texture potentials improved the intersection over union measure also for the lower legs, even though there is insufficient training data to learn them well. [sent-259, score-0.338]
91 The improvement is mainly because the texture potentials give a much better definition of the instance boundary. [sent-260, score-0.282]
92 Pictorial structures revisited: People detection and articulated pose estimation. [sent-265, score-0.199]
93 Poselets: Body part detectors trained using 3d human pose annotations. [sent-275, score-0.249]
94 Posecut: Simultaneous segmentation and 3d pose estimation of humans using dynamic graph-cuts. [sent-289, score-0.192]
95 Upper body detection and tracking in extended signing sequences. [sent-297, score-0.214]
96 Clustered pose and nonlinear appearance models for human pose estimation. [sent-346, score-0.312]
97 Real-time human pose recognition in parts from single depth images. [sent-418, score-0.273]
98 Multi-level inference by relaxed dual decomposition for human pose segmentation. [sent-442, score-0.232]
99 The first three columns are the pictorial structure pose, instance segmentation and body-part segmentation obtained using our method; the last three columns are the corresponding ground truth. [sent-465, score-0.386]
100 Furthermore, there are examples of partial occlusion by another person (A right, C left), a background object (C left) and self-occlusion (B right, C right). [sent-468, score-0.218]
