nips nips2007 nips2007-3 knowledge-graph by maker-knowledge-mining

3 nips-2007-A Bayesian Model of Conditioned Perception

Source: pdf

Author: Alan Stocker, Eero P. Simoncelli

Abstract: unkown-abstract

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We argue that in many circumstances, human observers evaluate sensory evidence simultaneously under multiple hypotheses regarding the physical process that has generated the sensory information. [sent-6, score-0.817]

2 In such situations, inference can be optimal if an observer combines the evaluation results under each hypothesis according to the probability that the associated hypothesis is correct. [sent-7, score-0.867]

3 However, a number of experimental results reveal suboptimal behavior and may be explained by assuming that once an observer has committed to a particular hypothesis, subsequent evaluation is based on that hypothesis alone. [sent-8, score-0.878]

4 We formulate this behavior using a conditional Bayesian observer model, and demonstrate that it can account for psychophysical data from a recently reported perceptual experiment in which strong biases in perceptual estimates arise as a consequence of a preceding decision. [sent-10, score-1.317]

5 Not only does the model provide quantitative predictions of subjective responses in variants of the original experiment, but it also appears to be consistent with human responses to cognitive dissonance. [sent-11, score-0.245]

6 In different situations, the very same perceptual evidence (e. [sent-13, score-0.294]

7 Our perception is conditioned on the context within which we judge the evidence. [sent-16, score-0.217]

8 Contextual inﬂuences in low-level human perception are the norm rather than the exception, and have been widely reported. [sent-20, score-0.267]

9 Perceptual illusions, for example, often exhibit particularly strong contextual effects, either in terms of perceptual space (e. [sent-21, score-0.367]

10 Data of recent psychophysical experiments suggest that an observer’s previous perceptual decisions provide additional form of context that can substantially inﬂuence subsequent perception [3, 4]. [sent-26, score-0.582]

11 In particular, the outcome of a categorical decision task can strongly bias a subsequent estimation task that is based on the same stimulus presentation. [sent-27, score-0.469]

12 Contextual inﬂuences are typically strongest when the sensory evidence is most ambiguous in terms of its interpretation, as in the example of the half-full (or half-empty) glass. [sent-28, score-0.234]

13 Bayesian estimators have proven successful in modeling human behavior in a wide variety of lowlevel perceptual tasks (for example: cue-integration (see e. [sent-29, score-0.393]

14 This approach is known as optimal model evaluation [9], or Bayesian model averaging [10] and has been previously suggested to account for cognitive reasoning [11]. [sent-39, score-0.215]

15 In contrast to these studies, however, we propose that model averaging behavior is abandoned once the observer has committed to a particular hypothesis. [sent-43, score-0.673]

16 Speciﬁcally, subsequent perception is conditioned only on the chosen hypothesis, thus sacriﬁcing optimality in order to achieve self-consistency. [sent-44, score-0.33]

17 We examine this hypothesis in the context of a recent experiment in which subjects were asked to estimate the direction of motion of random dot patterns after being forced to make a categorical decision about whether the direction of motion fell on one side or the other of a reference mark [4]. [sent-45, score-2.033]

18 Depending on the different levels of motion coherence, responses on the estimation task were heavily biased by the categorical decision. [sent-46, score-0.588]

19 2 Observer Model We deﬁne perception as a statistical estimation problem in which an observer tries to infer the value of some environmental variable s based on sensory evidence m (see Fig. [sent-50, score-0.925]

20 Typically, there are sources of uncertainty associated with m, including both sensor noise and uncertainty about the relationship between the sensory evidence and the variable s. [sent-52, score-0.32]

21 In cases where the structural possibilities are discrete, we denote them as a set of hypotheses H = {h1 , . [sent-54, score-0.232]

22 First, the observer computes their belief p(H|m) world s property measurement noise! [sent-59, score-0.621]

23 observer s(m) hn hypotheses ^ prior knowledge Figure 1: Perception as conditioned inference problem. [sent-63, score-0.912]

24 Based on noisy sensory measurements m the observer generates different hypotheses for the generative structure that relates m to the stimulus variable s. [sent-64, score-0.92]

25 Perception is a two-fold inference problem: Given the measurement and prior knowledge, the observer generates and evaluates different structural hypotheses h i . [sent-65, score-0.911]

26 ˆ in each hypothesis for given sensory evidence m. [sent-67, score-0.375]

27 p(m) (1) Second, for each hypothesis, a conditional posterior is formulated as p(s|m, H = h i ), and the full (non-conditional) posterior is computed by integrating the evidence over all hypotheses, weighted by the belief in each hypothesis h i : N p(s|m, H = hi )p(H = hi |m) . [sent-69, score-0.4]

28 p(s|m) = (2) i=1 Finally, the observer selects an estimate s that minimizes the expected value (under the posterior) ˆ of an appropriate loss function 1 . [sent-70, score-0.514]

29 For example, suppose the observer selects the maximum a posteriori hypothesis h MAP , the hypothesis that is most probable given the sensory evidence and the prior distribution. [sent-73, score-1.076]

30 We assume that this decision then causes the observer to reset the posterior probabilities over the hypotheses to p(H|m) = 1, = 0, if H = hMAP otherwise. [sent-74, score-0.924]

31 (3) That is, the decision making process forces the observer to consider the selected hypothesis as correct, with all other hypotheses rendered impossible. [sent-75, score-1.024]

32 (4) We argue that this simpliﬁcation by decision is essential for complex perceptual tasks (see Discussion). [sent-80, score-0.366]

33 By making a decision, the observer frees resources, eliminating the need to continuously represent probabilities about other hypotheses, and also simpliﬁes the inference problem. [sent-81, score-0.586]

34 3 Example: Conditioned Perception of Visual Motion We tested our observer model by simulating a recently reported psychophysical experiment [4]. [sent-83, score-0.663]

35 Subjects in this experiment were asked on each trial to decide whether the overall motion direction of a random dot pattern was to the right or to the left of a reference mark (as seen from the ﬁxation point). [sent-84, score-1.085]

36 Low levels of motion coherence made the decision task difﬁcult for motion directions close to the reference mark. [sent-85, score-1.338]

37 In a subset of randomly selected trials subjects were also asked to estimate the precise angle of motion direction (see Fig. [sent-86, score-0.88]

38 The decision task was always preceding the estimation task, but at the time of the decision, subjects were unaware whether they would had to perform the estimation task or not. [sent-88, score-0.4]

39 1 Formulating the observer model We denote θ as the direction of coherent motion of the random dot pattern, and m the noisy sensory measurement. [sent-90, score-1.42]

40 Suppose that on a given trial the measurement m indicates a direction of motion to the right of the reference mark. [sent-91, score-0.875]

41 (a) Jazayeri and Movshon presented moving random dot patterns to subjects and asked them to decide if the overall motion direction was either to the right or the left of a reference mark [4]. [sent-99, score-1.091]

42 Random dot patterns could exhibit three different levels of motion coherence (3, 6, and 12%) and the single coherent motion direction was randomly selected from a uniform distribution over a symmetric range of angles [−α, α] around the reference mark. [sent-100, score-1.534]

43 (b) In randomly selected 30% of trials, subjects were also asked, after making the directional decision, to estimate the exact angle of motion direction by adjusting an arrow to point in the direction of perceived motion. [sent-101, score-0.976]

44 In a second version of the experiment, motion was either toward the direction of the reference mark or in the opposite direction. [sent-102, score-0.801]

45 The observer’s belief in each of the two hypotheses based on their measurement is given by the posterior distribution according to Eq. [sent-104, score-0.353]

46 (5) The optimal decision is to select the hypothesis h MAP that maximizes the posterior given by Eq. [sent-106, score-0.321]

47 human observer The subsequent conditioned estimate of motion direction then follows from Eq. [sent-110, score-1.41]

48 p(m|H = hMAP ) (6) The model is completely characterized by three quantities: The likelihood functions p(m|θ, H), the prior distributions p(θ|H) of the direction of motion given each hypothesis, and the prior on the hypotheses p(H) itself (shown in Fig. [sent-112, score-0.922]

49 In the given experimental setup, both prior distributions were uniform but the width parameter of the motion direction α was not explicitly available to the subjects and had to be individually learned from training trials. [sent-114, score-0.751]

50 The likelihood functions p(m|θ, H) is given by the uncertainty about the motion direction due to the low motion coherence levels in the stimuli and the sensory noise characteristics of the observer. [sent-116, score-1.481]

51 Figure 4 compares the prediction of the observer model with human data. [sent-119, score-0.664]

52 Trial data of the model were generated by ﬁrst sampling a hypothesis h according to p(H), then drawing a stimulus direction from p(θ|H = h ). [sent-120, score-0.405]

53 then picking a sensory measurement sample m according to the conditional probability p(m|θ, H = h ), and ﬁnally performing inference according to Eqs. [sent-121, score-0.313]

54 The model captures the characteristics of human behavior in both the decision and the subsequent estimation task. [sent-123, score-0.52]

55 Note the strong inﬂuence of the decision task on the subsequent estimation of the motion direction, effectively pushing the estimates away from the decision boundary. [sent-124, score-0.937]

56 We also compared the model with a second version of the experiment, in which the decision task was to discriminate between motion toward and away from the reference [4]. [sent-125, score-0.785]

57 Coherent motion of the random dot pattern was uniformly sampled from a range around the reference and from a range p(m|θ, H) p(θ|H) p(H) −α 12 % 6% 0. [sent-126, score-0.623]

58 5 α θ 3% −α α Figure 3: Ingredients of the conditional observer model. [sent-128, score-0.54]

59 The sensory signal is assumed to be corrupted by additive Gaussian noise, with width that varies inversely with the level of motion coherence. [sent-129, score-0.59]

60 The prior distribution over the hypotheses p(H) is uniform. [sent-131, score-0.25]

61 The two prior distributions over motion direction given each hypothesis, p(θ|H = h1,2 ), are again determined by the experimental setup, and are uniform over the range [0, ±α]. [sent-132, score-0.641]

62 around the direction opposite to the reference, as illustrated by the prior distributions shown in Fig. [sent-133, score-0.245]

63 3 Predictions The model framework also allows us to make quantitative predictions of human perceptual behavior under conditions not yet tested. [sent-137, score-0.468]

64 The model predicts that a human subject would respond to this change by more frequently choosing the more likely hypothesis. [sent-141, score-0.217]

65 However, this hypothesis would also be more likely to be correct, and thus the estimates under this hypothesis would exhibit less bias than in the original experiment. [sent-142, score-0.389]

66 The second modiﬁcation is to add a second reference and ask the subject to decide between three different classes of motion direction (e. [sent-143, score-0.815]

67 Again, the model predicts that in such a case, a human subject’s estimate in the central direction should be biased away from both decision boundaries, thus leading to an almost constant direction estimate. [sent-146, score-0.775]

68 Estimates following a decision in favor of the two outer classes show the same repulsive bias as seen in the original experiment. [sent-147, score-0.251]

69 4 Discussion We have presented a normative model for human perception that captures the conditioning effects of decisions on an observer’s subsequent evaluation of sensory evidence. [sent-148, score-0.637]

70 The model is based on the premise that observers aim for optimal inference (taking into account all sensory evidence and prior information), but that they exhibit decision-induced biases because they also aim to be selfconsistent, eliminating alternatives that have been decided against. [sent-149, score-0.55]

71 First, self-consistency would seem an important requirement for a stable interpretation of the environment, and adhering to it might outweigh the disadvantages of perceptual misjudgments. [sent-152, score-0.228]

72 Second, framing perception in terms of optimal statistical estimation implies that the more information an observer evaluates, the more accurately they should be able to solve a perceptual task. [sent-153, score-0.919]

73 But this assumes that the observer can construct and retain full probability distributions and perform optimal inference calculations on these. [sent-154, score-0.557]

74 Upper left: The two panels show the percentage of observed motion to the right as a function of the true pattern direction, for the three coherence levels tested. [sent-159, score-0.615]

75 Lower left: Mean estimates of the direction of motion after performing the decision task. [sent-161, score-0.764]

76 Clearly, the decision has a substantial impact on the subsequent estimate, producing a strong bias away from the reference. [sent-162, score-0.355]

77 The model response exhibits biases similar to those of the human subjects, with lower coherence levels producing stronger repulsive effects. [sent-163, score-0.481]

78 Right: Grayscale images show distributions of estimates across trials for both the human subject and the model observer, for all three coherence levels. [sent-164, score-0.445]

79 Model observer performed 40 trials at each motion direction (in 1. [sent-167, score-1.177]

80 In particular, one might expect that the sudden change in posterior probabilities over the hypotheses associated with the decision task would be reﬂected in sudden changes in response pattern in such populations [16]. [sent-173, score-0.496]

81 For the experiment we have modeled, the hypotheses were speciﬁed by the two alternatives of the decision task, and the subjects were forced to choose one of them. [sent-175, score-0.507]

82 First, do humans always decompose perceptual inference tasks into a set of inference problems, each conditioned on a different hypothesis? [sent-177, score-0.42]

83 In the absence of explicit instructions, humans may automatically perform implicit comparisons relative to reference features that are unconsciously selected from the environment. [sent-180, score-0.244]

84 Second, if humans do consider different hypotheses, do they always select a single one on which subsequent percepts are conditioned, even if not explicitly asked to do so? [sent-181, score-0.23]

85 For example, simply displaying the reference mark in the experiment of [4] (without asking the observer to report any decision) might be sufﬁcient to trigger an implicit decision that would result in behaviors similar to those shown in the explicit case. [sent-182, score-0.969]

86 Finally, although we have only tested it on data of a particular psychophysical experiment, we believe that our model may have implications beyond low-level sensory perception. [sent-183, score-0.262]

87 5 3% -40 40 20 0 -20 -40 6% 40 20 0 -20 -40 -20 -10 0 10 12 % 20 -20 -10 0 10 20 true direction [deg] Figure 5: Comparison of model predictions with data for second experiment. [sent-186, score-0.274]

88 Right: Grayscale images show the trial distributions of the human subject and the model observer for all three coherence levels. [sent-188, score-0.909]

89 well-studied human attribute is known as cognitive dissonance [17], which causes people to adjust their opinions and beliefs to be consistent with their previous statements or behaviors. [sent-194, score-0.241]

90 A new perceptual illusion reveals mechanisms of sensory decoding. [sent-224, score-0.396]

91 2 An example that is directly analogous to the perceptual experiment in [4] is documented in [18]: Subjects initially rated kitchen appliances for attractiveness, and then were allowed to select one as a gift from amongst two that they had rated equally. [sent-238, score-0.376]

92 The data show a repulsive bias of the post-decision ratings compared with the pre-decision ratings, such that the rating of the selected appliance increased, and the rating of the rejected appliance decreased. [sent-240, score-0.23]

93 8 20 -20 −α 40 -40 -40 -20 -10 0 10 20 -20 -10 0 10 20 true direction [deg] Figure 6: Model predictions for two modiﬁcations of the original experiment. [sent-243, score-0.243]

94 However, we keep the prior distribution of motion directions given a particular side p(θ|H) constant within the range [0, ±α]. [sent-248, score-0.442]

95 The model makes two predictions (trials shown for an intermediate coherence level): First, although tested with an equal number of trials for each motion direction, there is a strong bias induced by the asymmetric prior. [sent-249, score-0.797]

96 And second, the direction estimates on the left are more veridical than on the right. [sent-250, score-0.283]

97 B: We present two reference marks instead of one, asking the subjects to make a choice between three equally likely regions of motion direction. [sent-251, score-0.692]

98 Again, we assume uniform prior distributions of motion directions within each area. [sent-252, score-0.442]

99 The model predicts bilateral repulsion of the estimates in the central area, leading to a strong bias that is almost independent of coherence level. [sent-253, score-0.324]

100 Noise characteristics and prior expectations in human visual speed perception. [sent-265, score-0.237]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('observer', 0.514), ('motion', 0.396), ('perceptual', 0.228), ('hypotheses', 0.204), ('direction', 0.199), ('sensory', 0.168), ('coherence', 0.159), ('reference', 0.155), ('perception', 0.148), ('hypothesis', 0.141), ('decision', 0.138), ('deg', 0.133), ('human', 0.119), ('subsequent', 0.113), ('hmap', 0.112), ('subjects', 0.11), ('contextual', 0.082), ('asked', 0.08), ('measurement', 0.076), ('dot', 0.072), ('conditioned', 0.069), ('trials', 0.068), ('illusions', 0.067), ('jazayeri', 0.067), ('repulsive', 0.067), ('glass', 0.066), ('evidence', 0.066), ('observers', 0.064), ('psychophysical', 0.063), ('levels', 0.06), ('experiment', 0.055), ('veridical', 0.053), ('cognitive', 0.051), ('mark', 0.051), ('trial', 0.049), ('prior', 0.046), ('bias', 0.046), ('bayesian', 0.046), ('averaging', 0.046), ('behavior', 0.046), ('appliance', 0.045), ('dissonance', 0.045), ('liquid', 0.045), ('replotted', 0.045), ('biases', 0.045), ('perceived', 0.045), ('characteristics', 0.044), ('predictions', 0.044), ('inference', 0.043), ('posterior', 0.042), ('categorical', 0.041), ('coherent', 0.04), ('appliances', 0.039), ('stocker', 0.039), ('sudden', 0.039), ('humans', 0.037), ('half', 0.037), ('subject', 0.037), ('committed', 0.036), ('hn', 0.036), ('movshon', 0.036), ('task', 0.034), ('stimulus', 0.034), ('noise', 0.032), ('estimates', 0.031), ('asking', 0.031), ('maybe', 0.031), ('model', 0.031), ('away', 0.031), ('belief', 0.031), ('decisions', 0.03), ('predicts', 0.03), ('exhibit', 0.03), ('grayscale', 0.03), ('optical', 0.03), ('simoncelli', 0.03), ('estimation', 0.029), ('situations', 0.029), ('eliminating', 0.029), ('biased', 0.028), ('decide', 0.028), ('account', 0.028), ('evaluation', 0.028), ('visual', 0.028), ('physical', 0.028), ('structural', 0.028), ('rated', 0.027), ('sacri', 0.027), ('selected', 0.027), ('uncertainty', 0.027), ('strong', 0.027), ('cue', 0.026), ('preceding', 0.026), ('asymmetric', 0.026), ('hi', 0.026), ('inversely', 0.026), ('causes', 0.026), ('conditional', 0.026), ('uences', 0.026), ('implicit', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 3 nips-2007-A Bayesian Model of Conditioned Perception

Author: Alan Stocker, Eero P. Simoncelli

Abstract: unkown-abstract

2 0.11286656 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model

Author: Zhengdong Lu, Cristian Sminchisescu, Miguel Á. Carreira-Perpiñán

Abstract: Reliably recovering 3D human pose from monocular video requires models that bias the estimates towards typical human poses and motions. We construct priors for people tracking using the Laplacian Eigenmaps Latent Variable Model (LELVM). LELVM is a recently introduced probabilistic dimensionality reduction model that combines the advantages of latent variable models—a multimodal probability density for latent and observed variables, and globally differentiable nonlinear mappings for reconstruction and dimensionality reduction—with those of spectral manifold learning methods—no local optima, ability to unfold highly nonlinear manifolds, and good practical scaling to latent spaces of high dimension. LELVM is computationally efﬁcient, simple to learn from sparse training data, and compatible with standard probabilistic trackers such as particle ﬁlters. We analyze the performance of a LELVM-based probabilistic sigma point mixture tracker in several real and synthetic human motion sequences and demonstrate that LELVM not only provides sufﬁcient constraints for robust operation in the presence of missing, noisy and ambiguous image measurements, but also compares favorably with alternative trackers based on PCA or GPLVM priors. Recent research in reconstructing articulated human motion has focused on methods that can exploit available prior knowledge on typical human poses or motions in an attempt to build more reliable algorithms. The high-dimensionality of human ambient pose space—between 30-60 joint angles or joint positions depending on the desired accuracy level, makes exhaustive search prohibitively expensive. This has negative impact on existing trackers, which are often not sufﬁciently reliable at reconstructing human-like poses, self-initializing or recovering from failure. Such difﬁculties have stimulated research in algorithms and models that reduce the effective working space, either using generic search focusing methods (annealing, state space decomposition, covariance scaling) or by exploiting speciﬁc problem structure (e.g. kinematic jumps). Experience with these procedures has nevertheless shown that any search strategy, no matter how effective, can be made signiﬁcantly more reliable if restricted to low-dimensional state spaces. This permits a more thorough exploration of the typical solution space, for a given, comparatively similar computational effort as a high-dimensional method. The argument correlates well with the belief that the human pose space, although high-dimensional in its natural ambient parameterization, has a signiﬁcantly lower perceptual (latent or intrinsic) dimensionality, at least in a practical sense—many poses that are possible are so improbable in many real-world situations that it pays off to encode them with low accuracy. A perceptual representation has to be powerful enough to capture the diversity of human poses in a sufﬁciently broad domain of applicability (the task domain), yet compact and analytically tractable for search and optimization. This justiﬁes the use of models that are nonlinear and low-dimensional (able to unfold highly nonlinear manifolds with low distortion), yet probabilistically motivated and globally continuous for efﬁcient optimization. Reducing dimensionality is not the only goal: perceptual representations have to preserve critical properties of the ambient space. Reliable tracking needs locality: nearby regions in ambient space have to be mapped to nearby regions in latent space. If this does not hold, the tracker is forced to make unrealistically large, and difﬁcult to predict jumps in latent space in order to follow smooth trajectories in the joint angle ambient space. 1 In this paper we propose to model priors for articulated motion using a recently introduced probabilistic dimensionality reduction method, the Laplacian Eigenmaps Latent Variable Model (LELVM) [1]. Section 1 discusses the requirements of priors for articulated motion in the context of probabilistic and spectral methods for manifold learning, and section 2 describes LELVM and shows how it combines both types of methods in a principled way. Section 3 describes our tracking framework (using a particle ﬁlter) and section 4 shows experiments with synthetic and real human motion sequences using LELVM priors learned from motion-capture data. Related work: There is signiﬁcant work in human tracking, using both generative and discriminative methods. Due to space limitations, we will focus on the more restricted class of 3D generative algorithms based on learned state priors, and not aim at a full literature review. Deriving compact prior representations for tracking people or other articulated objects is an active research ﬁeld, steadily growing with the increased availability of human motion capture data. Howe et al. and Sidenbladh et al. [2] propose Gaussian mixture representations of short human motion fragments (snippets) and integrate them in a Bayesian MAP estimation framework that uses 2D human joint measurements, independently tracked by scaled prismatic models [3]. Brand [4] models the human pose manifold using a Gaussian mixture and uses an HMM to infer the mixture component index based on a temporal sequence of human silhouettes. Sidenbladh et al. [5] use similar dynamic priors and exploit ideas in texture synthesis—efﬁcient nearest-neighbor search for similar motion fragments at runtime—in order to build a particle-ﬁlter tracker with observation model based on contour and image intensity measurements. Sminchisescu and Jepson [6] propose a low-dimensional probabilistic model based on ﬁtting a parametric reconstruction mapping (sparse radial basis function) and a parametric latent density (Gaussian mixture) to the embedding produced with a spectral method. They track humans walking and involved in conversations using a Bayesian multiple hypotheses framework that fuses contour and intensity measurements. Urtasun et al. [7] use a dynamic MAP estimation framework based on a GPLVM and 2D human joint correspondences obtained from an independent image-based tracker. Li et al. [8] use a coordinated mixture of factor analyzers within a particle ﬁltering framework, in order to reconstruct human motion in multiple views using chamfer matching to score different conﬁguration. Wang et al. [9] learn a latent space with associated dynamics where both the dynamics and observation mapping are Gaussian processes, and Urtasun et al. [10] use it for tracking. Taylor et al. [11] also learn a binary latent space with dynamics (using an energy-based model) but apply it to synthesis, not tracking. Our work learns a static, generative low-dimensional model of poses and integrates it into a particle ﬁlter for tracking. We show its ability to work with real or partially missing data and to track multiple activities. 1 Priors for articulated human pose We consider the problem of learning a probabilistic low-dimensional model of human articulated motion. Call y ∈ RD the representation in ambient space of the articulated pose of a person. In this paper, y contains the 3D locations of anywhere between 10 and 60 markers located on the person’s joints (other representations such as joint angles are also possible). The values of y have been normalised for translation and rotation in order to remove rigid motion and leave only the articulated motion (see section 3 for how we track the rigid motion). While y is high-dimensional, the motion pattern lives in a low-dimensional manifold because most values of y yield poses that violate body constraints or are simply atypical for the motion type considered. Thus we want to model y in terms of a small number of latent variables x given a collection of poses {yn }N (recorded from a human n=1 with motion-capture technology). The model should satisfy the following: (1) It should deﬁne a probability density for x and y, to be able to deal with noise (in the image or marker measurements) and uncertainty (from missing data due to occlusion or markers that drop), and to allow integration in a sequential Bayesian estimation framework. The density model should also be ﬂexible enough to represent multimodal densities. (2) It should deﬁne mappings for dimensionality reduction F : y → x and reconstruction f : x → y that apply to any value of x and y (not just those in the training set); and such mappings should be deﬁned on a global coordinate system, be continuous (to avoid physically impossible discontinuities) and differentiable (to allow efﬁcient optimisation when tracking), yet ﬂexible enough to represent the highly nonlinear manifold of articulated poses. From a statistical machine learning point of view, this is precisely what latent variable models (LVMs) do; for example, factor analysis deﬁnes linear mappings and Gaussian densities, while the generative topographic mapping (GTM; [12]) deﬁnes nonlinear mappings and a Gaussian-mixture density in ambient space. However, factor analysis is too limited to be of practical use, and GTM— 2 while ﬂexible—has two important practical problems: (1) the latent space must be discretised to allow tractable learning and inference, which limits it to very low (2–3) latent dimensions; (2) the parameter estimation is prone to bad local optima that result in highly distorted mappings. Another dimensionality reduction method recently introduced, GPLVM [13], which uses a Gaussian process mapping f (x), partly improves this situation by deﬁning a tunable parameter xn for each data point yn . While still prone to local optima, this allows the use of a better initialisation for {xn }N (obtained from a spectral method, see later). This has prompted the application of n=1 GPLVM for tracking human motion [7]. However, GPLVM has some disadvantages: its training is very costly (each step of the gradient iteration is cubic on the number of training points N , though approximations based on using few points exist); unlike true LVMs, it deﬁnes neither a posterior distribution p(x|y) in latent space nor a dimensionality reduction mapping E {x|y}; and the latent representation it obtains is not ideal. For example, for periodic motions such as running or walking, repeated periods (identical up to small noise) can be mapped apart from each other in latent space because nothing constrains xn and xm to be close even when yn = ym (see ﬁg. 3 and [10]). There exists a different type of dimensionality reduction methods, spectral methods (such as Isomap, LLE or Laplacian eigenmaps [14]), that have advantages and disadvantages complementary to those of LVMs. They deﬁne neither mappings nor densities but just a correspondence (xn , yn ) between points in latent space xn and ambient space yn . However, the training is efﬁcient (a sparse eigenvalue problem) and has no local optima, and often yields a correspondence that successfully models highly nonlinear, convoluted manifolds such as the Swiss roll. While these attractive properties have spurred recent research in spectral methods, their lack of mappings and densities has limited their applicability in people tracking. However, a new model that combines the advantages of LVMs and spectral methods in a principled way has been recently proposed [1], which we brieﬂy describe next. 2 The Laplacian Eigenmaps Latent Variable Model (LELVM) LELVM is based on a natural way of deﬁning an out-of-sample mapping for Laplacian eigenmaps (LE) which, in addition, results in a density model. In LE, typically we ﬁrst deﬁne a k-nearestneighbour graph on the sample data {yn }N and weigh each edge yn ∼ ym by a Gaussian afﬁnity n=1 2 1 function K(yn , ym ) = wnm = exp (− 2 (yn − ym )/σ ). Then the latent points X result from: min tr XLX⊤ s.t. X ∈ RL×N , XDX⊤ = I, XD1 = 0 (1) where we deﬁne the matrix XL×N = (x1 , . . . , xN ), the symmetric afﬁnity matrix WN ×N , the deN gree matrix D = diag ( n=1 wnm ), the graph Laplacian matrix L = D−W, and 1 = (1, . . . , 1)⊤ . The constraints eliminate the two trivial solutions X = 0 (by ﬁxing an arbitrary scale) and x1 = · · · = xN (by removing 1, which is an eigenvector of L associated with a zero eigenvalue). The solution is given by the leading u2 , . . . , uL+1 eigenvectors of the normalised afﬁnity matrix 1 1 1 N = D− 2 WD− 2 , namely X = V⊤ D− 2 with VN ×L = (v2 , . . . , vL+1 ) (an a posteriori translated, rotated or uniformly scaled X is equally valid). Following [1], we now deﬁne an out-of-sample mapping F(y) = x for a new point y as a semisupervised learning problem, by recomputing the embedding as in (1) (i.e., augmenting the graph Laplacian with the new point), but keeping the old embedding ﬁxed: L K(y) X⊤ min tr ( X x ) K(y)⊤ 1⊤ K(y) (2) x⊤ x∈RL 2 where Kn (y) = K(y, yn ) = exp (− 1 (y − yn )/σ ) for n = 1, . . . , N is the kernel induced by 2 the Gaussian afﬁnity (applied only to the k nearest neighbours of y, i.e., Kn (y) = 0 if y ≁ yn ). This is one natural way of adding a new point to the embedding by keeping existing embedded points ﬁxed. We need not use the constraints from (1) because they would trivially determine x, and the uninteresting solutions X = 0 and X = constant were already removed in the old embedding anyway. The solution yields an out-of-sample dimensionality reduction mapping x = F(y): x = F(y) = X K(y) 1⊤ K(y) N K(y,yn ) PN x n=1 K(y,yn′ ) n ′ = (3) n =1 applicable to any point y (new or old). This mapping is formally identical to a Nadaraya-Watson estimator (kernel regression; [15]) using as data {(xn , yn )}N and the kernel K. We can take this n=1 a step further by deﬁning a LVM that has as joint distribution a kernel density estimate (KDE): p(x, y) = 1 N N n=1 Ky (y, yn )Kx (x, xn ) p(y) = 3 1 N N n=1 Ky (y, yn ) p(x) = 1 N N n=1 Kx (x, xn ) where Ky is proportional to K so it integrates to 1, and Kx is a pdf kernel in x–space. Consequently, the marginals in observed and latent space are also KDEs, and the dimensionality reduction and reconstruction mappings are given by kernel regression (the conditional means E {y|x}, E {x|y}): F(y) = N n=1 p(n|y)xn f (x) = N K (x,xn ) PN x y n=1 Kx (x,xn′ ) n ′ = n =1 N n=1 p(n|x)yn . (4) We allow the bandwidths to be different in the latent and ambient spaces: 2 2 1 Kx (x, xn ) ∝ exp (− 1 (x − xn )/σx ) and Ky (y, yn ) ∝ exp (− 2 (y − yn )/σy ). They 2 may be tuned to control the smoothness of the mappings and densities [1]. Thus, LELVM naturally extends a LE embedding (efﬁciently obtained as a sparse eigenvalue problem with a cost O(N 2 )) to global, continuous, differentiable mappings (NW estimators) and potentially multimodal densities having the form of a Gaussian KDE. This allows easy computation of posterior probabilities such as p(x|y) (unlike GPLVM). It can use a continuous latent space of arbitrary dimension L (unlike GTM) by simply choosing L eigenvectors in the LE embedding. It has no local optima since it is based on the LE embedding. LELVM can learn convoluted mappings (e.g. the Swiss roll) and deﬁne maps and densities for them [1]. The only parameters to set are the graph parameters (number of neighbours k, afﬁnity width σ) and the smoothing bandwidths σx , σy . 3 Tracking framework We follow the sequential Bayesian estimation framework, where for state variables s and observation variables z we have the recursive prediction and correction equations: p(st |z0:t−1 ) = p(st |st−1 ) p(st−1 |z0:t−1 ) dst−1 p(st |z0:t ) ∝ p(zt |st ) p(st |z0:t−1 ). (5) L We deﬁne the state variables as s = (x, d) where x ∈ R is the low-dim. latent space (for pose) and d ∈ R3 is the centre-of-mass location of the body (in the experiments our state also includes the orientation of the body, but for simplicity here we describe only the translation). The observed variables z consist of image features or the perspective projection of the markers on the camera plane. The mapping from state to observations is (for the markers’ case, assuming M markers): P f x ∈ RL − − → y ∈ R3M −→ ⊕ − − − z ∈ R2M −− − − −→ d ∈ R3 (6) where f is the LELVM reconstruction mapping (learnt from mocap data); ⊕ shifts each 3D marker by d; and P is the perspective projection (pinhole camera), applied to each 3D point separately. Here we use a simple observation model p(zt |st ): Gaussian with mean given by the transformation (6) and isotropic covariance (set by the user to control the inﬂuence of measurements in the tracking). We assume known correspondences and observations that are obtained either from the 3D markers (for tracking synthetic data) or 2D tracks obtained from a 2D tracker. Our dynamics model is p(st |st−1 ) ∝ pd (dt |dt−1 ) px (xt |xt−1 ) p(xt ) (7) where both dynamics models for d and x are random walks: Gaussians centred at the previous step value dt−1 and xt−1 , respectively, with isotropic covariance (set by the user to control the inﬂuence of dynamics in the tracking); and p(xt ) is the LELVM prior. Thus the overall dynamics predicts states that are both near the previous state and yield feasible poses. Of course, more complex dynamics models could be used if e.g. the speed and direction of movement are known. As tracker we use the Gaussian mixture Sigma-point particle ﬁlter (GMSPPF) [16]. This is a particle ﬁlter that uses a Gaussian mixture representation for the posterior distribution in state space and updates it with a Sigma-point Kalman ﬁlter. This Gaussian mixture will be used as proposal distribution to draw the particles. As in other particle ﬁlter implementations, the prediction step is carried out by approximating the integral (5) with particles and updating the particles’ weights. Then, a new Gaussian mixture is ﬁtted with a weighted EM algorithm to these particles. This replaces the resampling stage needed by many particle ﬁlters and mitigates the problem of sample depletion while also preventing the number of components in the Gaussian mixture from growing over time. The choice of this particular tracker is not critical; we use it to illustrate the fact that LELVM can be introduced in any probabilistic tracker for nonlinear, nongaussian models. Given the corrected distribution p(st |z0:t ), we choose its mean as recovered state (pose and location). It is also possible to choose instead the mode closest to the state at t − 1, which could be found by mean-shift or Newton algorithms [17] since we are using a Gaussian-mixture representation in state space. 4 4 Experiments We demonstrate our low-dimensional tracker on image sequences of people walking and running, both synthetic (ﬁg. 1) and real (ﬁg. 2–3). Fig. 1 shows the model copes well with persistent partial occlusion and severely subsampled training data (A,B), and quantitatively evaluates temporal reconstruction (C). For all our experiments, the LELVM parameters (number of neighbors k, Gaussian afﬁnity σ, and bandwidths σx and σy ) were set manually. We mainly considered 2D latent spaces (for pose, plus 6D for rigid motion), which were expressive enough for our experiments. More complex, higher-dimensional models are straightforward to construct. The initial state distribution p(s0 ) was chosen a broad Gaussian, the dynamics and observation covariance were set manually to control the tracking smoothness, and the GMSPPF tracker used a 5-component Gaussian mixture in latent space (and in the state space of rigid motion) and a small set of 500 particles. The 3D representation we use is a 102-D vector obtained by concatenating the 3D markers coordinates of all the body joints. These would be highly unconstrained if estimated independently, but we only use them as intermediate representation; tracking actually occurs in the latent space, tightly controlled using the LELVM prior. For the synthetic experiments and some of the real experiments (ﬁgs. 2–3) the camera parameters and the body proportions were known (for the latter, we used the 2D outputs of [6]). For the CMU mocap video (ﬁg. 2B) we roughly guessed. We used mocap data from several sources (CMU, OSU). As observations we always use 2D marker positions, which, depending on the analyzed sequence were either known (the synthetic case), or provided by an existing tracker [6] or speciﬁed manually (ﬁg. 2B). Alternatively 2D point trackers similar to the ones of [7] can be used. The forward generative model is obtained by combining the latent to ambient space mapping (this provides the position of the 3D markers) with a perspective projection transformation. The observation model is a product of Gaussians, each measuring the probability of a particular marker position given its corresponding image point track. Experiments with synthetic data: we analyze the performance of our tracker in controlled conditions (noise perturbed synthetically generated image tracks) both under regular circumstances (reasonable sampling of training data) and more severe conditions with subsampled training points and persistent partial occlusion (the man running behind a fence, with many of the 2D marker tracks obstructed). Fig. 1B,C shows both the posterior (ﬁltered) latent space distribution obtained from our tracker, and its mean (we do not show the distribution of the global rigid body motion; in all experiments this is tracked with good accuracy). In the latent space plot shown in ﬁg. 1B, the onset of running (two cycles were used) appears as a separate region external to the main loop. It does not appear in the subsampled training set in ﬁg. 1B, where only one running cycle was used for training and the onset of running was removed. In each case, one can see that the model is able to track quite competently, with a modest decrease in its temporal accuracy, shown in ﬁg. 1C, where the averages are computed per 3D joint (normalised wrt body height). Subsampling causes some ambiguity in the estimate, e.g. see the bimodality in the right plot in ﬁg. 1C. In another set of experiments (not shown) we also tracked using different subsets of 3D markers. The estimates were accurate even when about 30% of the markers were dropped. Experiments with real images: this shows our tracker’s ability to work with real motions of different people, with different body proportions, not in its latent variable model training set (ﬁgs. 2–3). We study walking, running and turns. In all cases, tracking and 3D reconstruction are reasonably accurate. We have also run comparisons against low-dimensional models based on PCA and GPLVM (ﬁg. 3). It is important to note that, for LELVM, errors in the pose estimates are primarily caused by mismatches between the mocap data used to learn the LELVM prior and the body proportions of the person in the video. For example, the body proportions of the OSU motion captured walker are quite different from those of the image in ﬁg. 2–3 (e.g. note how the legs of the stick man are shorter relative to the trunk). Likewise, the style of the runner from the OSU data (e.g. the swinging of the arms) is quite different from that of the video. Finally, the interest points tracked by the 2D tracker do not entirely correspond either in number or location to the motion capture markers, and are noisy and sometimes missing. In future work, we plan to include an optimization step to also estimate the body proportions. This would be complicated for a general, unconstrained model because the dimensions of the body couple with the pose, so either one or the other can be changed to improve the tracking error (the observation likelihood can also become singular). But for dedicated prior pose models like ours these difﬁculties should be signiﬁcantly reduced. The model simply cannot assume highly unlikely stances—these are either not representable at all, or have reduced probability—and thus avoids compensatory, unrealistic body proportion estimates. 5 n = 15 n = 40 n = 65 n = 90 n = 115 n = 140 A 1.5 1.5 1.5 1.5 1 −1 −1 −1 −1 −1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1 −1 −1 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 −1 −1 1 −1 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1.5 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0.3 0.2 0.1 −1 0.4 0.3 0.2 −1.5 −1.5 n = 60 0.4 0.3 −2 −1 −1.5 −1 0 −0.5 n = 49 0.4 0.1 −1 −1.5 1 0 −0.5 −2 −1 −1.5 −1 0.5 0 −0.5 −1.5 −1.5 0.5 1 0.5 0 −0.5 −1.5 −1.5 −1.5 1 0.5 0 −0.5 n = 37 0.2 0.1 −1 −1 −1.5 −0.5 −1 −1.5 −2 1 0.5 0 −0.5 −0.5 −1.5 −1.5 0.3 0.2 0.1 0 0.4 0.3 0.2 −0.5 n = 25 0.4 0.3 −1 0 −0.5 −1 −1.5 1 0.5 0 0 −0.5 1.5 1 0.5 0 −0.5 −2 1 0.5 0.5 0 −1.5 −1.5 −1.5 1.5 1 0.5 0 −1 −1.5 −2 1 1 0.5 −0.5 −1 −1.5 −1.5 n = 13 0.4 −1 −1 −1.5 −1.5 −1.5 n=1 B −1 −1 −1.5 −1.5 1.5 1 0.5 −0.5 0 −1 −1.5 −2 −2 1 0 −0.5 −0.5 −0.5 −1 −1.5 −2 0.5 0 0 −0.5 0.5 0 −0.5 −1 −1.5 −2 1 0.5 0.5 0 1 0.5 0 −0.5 −1 −1.5 −1 −1.5 −2 1 1 0.5 1.5 1 0.5 0 −0.5 0 −0.5 −1 −1.5 −2 1 −0.5 1.5 1.5 1 0.5 0.5 0 −0.5 −1 −1.5 −2 1.5 1 1 0.5 0 −0.5 −2 0 −0.5 −0.5 1.5 1 0.5 0 −1 −1.5 −1 −1.5 0.5 0 0 −0.5 −1.5 −1.5 −1.5 1.5 1.5 1 0.5 −0.5 0 −0.5 −2 1 0.5 0.5 0 −0.5 −1 −1.5 −1.5 1.5 1 1 0.5 0 −1 −1.5 1 1 0.5 0 −0.5 −0.5 1.5 1 −0.5 −2 1 0.5 0 0 −0.5 0.5 0 −1 −1.5 −2 1 0.5 0.5 0 −0.5 1.5 1.5 1 −0.5 −2 1 1 0.5 0.5 0 −1 −1.5 −1 −1.5 −2 −2 1 0 −0.5 1.5 1 0.5 −0.5 0 −0.5 −1 −1.5 −2 0.5 0 −0.5 0.5 0 −0.5 −1 −1.5 −2 1 0.5 0 −0.5 0.5 0 −0.5 −1 −1.5 −1 −1.5 −2 1 0.5 0 1 0.5 0 −0.5 0 −0.5 −1 −1.5 −2 1 0.5 1.5 1 0.5 0.5 0 −0.5 −1 −1.5 1 1 1 0.5 0 −0.5 −2 1.5 1.5 1 1 0.5 0 −1 −1.5 1.5 1.5 1 0.5 −0.5 0.2 0.1 0.1 0 0 0 0 0 0 −0.1 −0.1 −0.1 −0.1 −0.1 −0.1 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.3 −0.3 −0.3 −0.3 −0.3 −0.3 −0.4 0.4 0.6 0.8 1.5 1 1.2 1.4 1.6 1.8 2 2.2 −0.4 0.4 2.4 1.5 0.6 0.8 1.5 1 1.2 1.4 1.6 1.8 2 2.2 −0.4 0.4 2.4 1.5 0.6 0.8 1.5 1.5 1.5 1 1.2 1.4 1.6 1.8 2 2.2 −0.4 0.4 2.4 1.5 0.6 0.8 1.5 1.5 1.5 1 1.2 1.4 1.6 1.8 2 2.2 −0.4 0.4 2.4 1.5 0.6 0.8 1.5 1.5 1.5 1 1.2 1.4 1.6 1.8 2 2.2 −0.4 0.4 2.4 1.5 0.6 0.8 1.5 1.5 1.5 1 1.2 1.4 1.6 1.8 2 2.2 2.4 1.5 1.5 1.5 1.5 1.5 1 1 0.5 0.5 0 0 1 −0.5 −0.5 0 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 0.1 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 0 −0.5 −1.5 0.5 0 0 −0.5 0 −0.5 −0.5 −0.5 −2 −1 −1 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −2 −1 −1.5 −1 −1.5 1 0.5 0.5 0 −1.5 −1 1 1 0.5 −2 −1 −1.5 −1.5 −0.5 −1 −2 1 −1 0 −1 −1.5 −2 −0.5 −2 −1.5 0.5 −0.5 −1 −1.5 0 −0.5 −1 −1.5 0 −1.5 0.5 0 −0.5 −1 −0.5 −1 −1.5 1 0.5 0 −0.5 −1 −0.5 −1 1 0.5 0 −1.5 1 0.5 0 −0.5 −2 1 0.5 −2 −1 −1.5 −1.5 −0.5 0 −1 1 −1 0 1 −1.5 −2 −0.5 −2 −1.5 1 0.5 0 0.5 −0.5 −1 −1.5 0 −0.5 −1 −1.5 0 −1.5 0.5 0 −0.5 −1 −0.5 −1 −1.5 1 0.5 0 −0.5 −1 −0.5 −1 1 0.5 0 −1.5 1 0.5 1 0.5 0 −0.5 −2 1 0.5 −2 −1 −1.5 −1.5 −0.5 0 −1 1 −1 0 1 −1.5 −2 −0.5 −2 −1.5 1 0.5 0 0.5 −0.5 −1 −1.5 0 −0.5 −1 −1.5 0 −1.5 0.5 0 −0.5 −1 −0.5 −1 −1.5 1 0.5 0 −0.5 −1 −0.5 −1 1 0.5 0 −1.5 1 0.5 1 0.5 0 −0.5 −2 1 0.5 −2 −1 −1.5 −1.5 −0.5 0 −1 1 −1 0 1 −1.5 −2 −0.5 −2 −1 −1.5 −0.5 1 0.5 0 0.5 −0.5 −1 −1.5 0 −0.5 −0.5 −1 −1 −1.5 0.5 0 0 −0.5 −2 −1.5 0 −1.5 1 0.5 0.5 0 −2 −1 −1.5 −1.5 −1 1 1 0.5 −0.5 −1 1 0.5 1 0.5 −0.5 −1 −2 1 0 −0.5 −1 −1.5 −1.5 −1.5 −2 −1.5 0.5 0 −0.5 −1 −0.5 0 −0.5 −1.5 −1 −1.5 1 0.5 0 −0.5 0 1 −1 −0.5 −1 1 0.5 0 1 0.5 0 0.5 −0.5 −1 −0.5 −2 1 0.5 1 0.5 1 0.5 0 −1 1 0 1 −1.5 −2 0.5 0 1 0.5 −0.5 −1 −1.5 1 0.5 1 0.5 −1.5 −1.5 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 0.13 0.12 RMSE C RMSE 0.08 0.06 0.04 0.02 0.11 0.1 0.09 0.08 0.07 0.06 0 0 50 100 150 0.05 0 10 time step n 20 30 40 50 60 time step n Figure 1: OSU running man motion capture data. A: we use 217 datapoints for training LELVM (with added noise) and for tracking. Row 1: tracking in the 2D latent space. The contours (very tight in this sequence) are the posterior probability. Row 2: perspective-projection-based observations with occlusions. Row 3: each quadruplet (a, a′ , b, b′ ) show the true pose of the running man from a front and side views (a, b), and the reconstructed pose by tracking with our model (a′ , b′ ). B: we use the ﬁrst running cycle for training LELVM and the second cycle for tracking. C: RMSE errors for each frame, for the tracking of A (left plot) and B (middle plot), normalised so that 1 equals the M 1 ˆ 2 height of the stick man. RMSE(n) = M j=1 ynj − ynj −1/2 for all 3D locations of the M ˆ markers, i.e., comparison of reconstructed stick man yn with ground-truth stick man yn . Right plot: multimodal posterior distribution in pose space for the model of A (frame 42). Comparison with PCA and GPLVM (ﬁg. 3): for these models, the tracker uses the same GMSPPF setting as for LELVM (number of particles, initialisation, random-walk dynamics, etc.) but with the mapping y = f (x) provided by GPLVM or PCA, and with a uniform prior p(x) in latent space (since neither GPLVM nor the non-probabilistic PCA provide one). The LELVM-tracker uses both its f (x) and latent space prior p(x), as discussed. All methods use a 2D latent space. We ensured the best possible training of GPLVM by model selection based on multiple runs. For PCA, the latent space looks deceptively good, showing non-intersecting loops. However, (1) individual loops do not collect together as they should (for LELVM they do); (2) worse still, the mapping from 2D to pose space yields a poor observation model. The reason is that the loop in 102-D pose space is nonlinearly bent and a plane can at best intersect it at a few points, so the tracker often stays put at one of those (typically an “average” standing position), since leaving it would increase the error a lot. Using more latent dimensions would improve this, but as LELVM shows, this is not necessary. For GPLVM, we found high sensitivity to ﬁlter initialisation: the estimates have high variance across runs and are inaccurate ≈ 80% of the time. When it fails, the GPLVM tracker often freezes in latent space, like PCA. When it does succeed, it produces results that are comparable with LELVM, although somewhat less accurate visually. However, even then GPLVM’s latent space consists of continuous chunks spread apart and offset from each other; GPLVM has no incentive to place nearby two xs mapping to the same y. This effect, combined with the lack of a data-sensitive, realistic latent space density p(x), makes GPLVM jump erratically from chunk to chunk, in contrast with LELVM, which smoothly follows the 1D loop. Some GPLVM problems might be alleviated using higher-order dynamics, but our experiments suggest that such modeling sophistication is less 6 0 0.5 1 n=1 n = 15 n = 29 n = 43 n = 55 n = 69 A 100 100 100 100 100 50 50 50 50 50 0 0 0 0 0 0 −50 −50 −50 −50 −50 −50 −100 −100 −100 −100 −100 −100 50 50 40 −40 −20 −50 −30 −10 0 −40 20 −20 40 n=4 −40 10 20 −20 40 50 0 n=9 −40 −20 30 40 50 0 n = 14 −40 20 −20 40 50 30 20 10 0 0 −40 −20 −30 −20 −40 10 20 −30 −10 0 −50 30 50 n = 19 −50 −30 −10 −50 30 −10 −20 −30 −40 10 50 40 30 20 10 0 −50 −30 −10 −50 20 −10 −20 −30 −40 10 50 40 30 20 10 0 −50 −30 −10 −50 30 −10 −20 −30 −40 0 50 50 40 30 20 10 0 −50 −30 −10 −50 30 −10 −20 −30 −40 10 40 30 20 10 0 −10 −20 −30 50 40 30 20 10 0 −10 −50 100 40 −40 10 20 −50 30 50 n = 24 40 50 n = 29 B 20 20 20 20 20 40 40 40 40 40 60 60 60 60 60 80 80 80 80 80 80 100 100 100 100 100 100 120 120 120 120 120 120 140 140 140 140 140 140 160 160 160 160 160 160 180 180 180 180 180 200 200 200 200 200 220 220 220 220 220 50 100 150 200 250 300 350 50 100 150 200 250 300 350 50 100 150 200 250 300 350 50 100 150 200 250 300 350 20 40 60 180 200 220 50 100 150 200 250 300 350 50 100 150 100 100 100 100 100 50 50 50 50 200 250 300 350 100 50 50 0 0 0 0 0 0 −50 −50 −50 −50 −50 −50 −100 −100 −100 −100 −100 −100 50 50 40 −40 −20 −50 −30 −10 0 −40 10 20 −50 30 40 −40 −20 −50 −30 −10 0 −40 10 20 −50 30 40 −40 −20 −50 −30 −10 0 −40 −20 30 40 50 0 −40 10 20 −50 30 40 50 40 30 30 20 20 10 10 0 −50 −30 −10 −50 20 0 −10 −20 −30 −40 10 40 30 20 10 0 −10 −20 −30 50 50 40 30 20 10 0 −10 −20 −30 50 50 40 30 20 10 0 −10 −20 −30 50 40 30 20 10 0 −10 −50 50 −40 −20 −30 −20 −30 −10 0 −40 10 20 −50 30 40 50 −10 −50 −40 −20 −30 −20 −30 −10 0 −40 10 20 −50 30 40 50 Figure 2: A: tracking of a video from [6] (turning & walking). We use 220 datapoints (3 full walking cycles) for training LELVM. Row 1: tracking in the 2D latent space. The contours are the estimated posterior probability. Row 2: tracking based on markers. The red dots are the 2D tracks and the green stick man is the 3D reconstruction obtained using our model. Row 3: our 3D reconstruction from a different viewpoint. B: tracking of a person running straight towards the camera. Notice the scale changes and possible forward-backward ambiguities in the 3D estimates. We train the LELVM using 180 datapoints (2.5 running cycles); 2D tracks were obtained by manually marking the video. In both A–B the mocap training data was for a person different from the video’s (with different body proportions and motions), and no ground-truth estimate was available for favourable initialisation. LELVM GPLVM PCA tracking in latent space tracking in latent space tracking in latent space 2.5 0.02 30 38 2 38 0.99 0.015 20 1.5 38 0.01 1 10 0.005 0.5 0 0 0 −0.005 −0.5 −10 −0.01 −1 −0.015 −1.5 −20 −0.02 −0.025 −0.025 −2 −0.02 −0.015 −0.01 −0.005 0 0.005 0.01 0.015 0.02 0.025 −2.5 −2 −1 0 1 2 3 −30 −80 −60 −40 −20 0 20 40 60 80 Figure 3: Method comparison, frame 38. PCA and GPLVM map consecutive walking cycles to spatially distinct latent space regions. Compounded by a data independent latent prior, the resulting tracker gets easily confused: it jumps across loops and/or remains put, trapped in local optima. In contrast, LELVM is stable and follows tightly a 1D manifold (see videos). crucial if locality constraints are correctly modeled (as in LELVM). We conclude that, compared to LELVM, GPLVM is signiﬁcantly less robust for tracking, has much higher training overhead and lacks some operations (e.g. computing latent conditionals based on partly missing ambient data). 7 5 Conclusion and future work We have proposed the use of priors based on the Laplacian Eigenmaps Latent Variable Model (LELVM) for people tracking. LELVM is a probabilistic dim. red. method that combines the advantages of latent variable models and spectral manifold learning algorithms: a multimodal probability density over latent and ambient variables, globally differentiable nonlinear mappings for reconstruction and dimensionality reduction, no local optima, ability to unfold highly nonlinear manifolds, and good practical scaling to latent spaces of high dimension. LELVM is computationally efﬁcient, simple to learn from sparse training data, and compatible with standard probabilistic trackers such as particle ﬁlters. Our results using a LELVM-based probabilistic sigma point mixture tracker with several real and synthetic human motion sequences show that LELVM provides sufﬁcient constraints for robust operation in the presence of missing, noisy and ambiguous image measurements. Comparisons with PCA and GPLVM show LELVM is superior in terms of accuracy, robustness and computation time. The objective of this paper was to demonstrate the ability of the LELVM prior in a simple setting using 2D tracks obtained automatically or manually, and single-type motions (running, walking). Future work will explore more complex observation models such as silhouettes; the combination of different motion types in the same latent space (whose dimension will exceed 2); and the exploration of multimodal posterior distributions in latent space caused by ambiguities. Acknowledgments This work was partially supported by NSF CAREER award IIS–0546857 (MACP), NSF IIS–0535140 and EC MCEXT–025481 (CS). CMU data: http://mocap.cs.cmu.edu (created with funding from NSF EIA–0196217). OSU data: http://accad.osu.edu/research/mocap/mocap data.htm. References ´ [1] M. A. Carreira-Perpi˜ an and Z. Lu. The Laplacian Eigenmaps Latent Variable Model. In AISTATS, 2007. n´ [2] N. R. Howe, M. E. Leventon, and W. T. Freeman. Bayesian reconstruction of 3D human motion from single-camera video. In NIPS, volume 12, pages 820–826, 2000. [3] T.-J. Cham and J. M. Rehg. A multiple hypothesis approach to ﬁgure tracking. In CVPR, 1999. [4] M. Brand. Shadow puppetry. In ICCV, pages 1237–1244, 1999. [5] H. Sidenbladh, M. J. Black, and L. Sigal. Implicit probabilistic models of human motion for synthesis and tracking. In ECCV, volume 1, pages 784–800, 2002. [6] C. Sminchisescu and A. Jepson. Generative modeling for continuous non-linearly embedded visual inference. In ICML, pages 759–766, 2004. [7] R. Urtasun, D. J. Fleet, A. Hertzmann, and P. Fua. Priors for people tracking from small training sets. In ICCV, pages 403–410, 2005. [8] R. Li, M.-H. Yang, S. Sclaroff, and T.-P. Tian. Monocular tracking of 3D human motion with a coordinated mixture of factor analyzers. In ECCV, volume 2, pages 137–150, 2006. [9] J. M. Wang, D. Fleet, and A. Hertzmann. Gaussian process dynamical models. In NIPS, volume 18, 2006. [10] R. Urtasun, D. J. Fleet, and P. Fua. Gaussian process dynamical models for 3D people tracking. In CVPR, pages 238–245, 2006. [11] G. W. Taylor, G. E. Hinton, and S. Roweis. Modeling human motion using binary latent variables. In NIPS, volume 19, 2007. [12] C. M. Bishop, M. Svens´ n, and C. K. I. Williams. GTM: The generative topographic mapping. Neural e Computation, 10(1):215–234, January 1998. [13] N. Lawrence. Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research, 6:1783–1816, November 2005. [14] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373–1396, June 2003. [15] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman & Hall, 1986. [16] R. van der Merwe and E. A. Wan. Gaussian mixture sigma-point particle ﬁlters for sequential probabilistic inference in dynamic state-space models. In ICASSP, volume 6, pages 701–704, 2003. ´ [17] M. A. Carreira-Perpi˜ an. Acceleration strategies for Gaussian mean-shift image segmentation. In CVPR, n´ pages 1160–1167, 2006. 8

3 0.11172944 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding

Author: Omer Bobrowski, Ron Meir, Shy Shoham, Yonina Eldar

Abstract: It is becoming increasingly evident that organisms acting in uncertain dynamical environments often employ exact or approximate Bayesian statistical calculations in order to continuously estimate the environmental state, integrate information from multiple sensory modalities, form predictions and choose actions. What is less clear is how these putative computations are implemented by cortical neural networks. An additional level of complexity is introduced because these networks observe the world through spike trains received from primary sensory afferents, rather than directly. A recent line of research has described mechanisms by which such computations can be implemented using a network of neurons whose activity directly represents a probability distribution across the possible “world states”. Much of this work, however, uses various approximations, which severely restrict the domain of applicability of these implementations. Here we make use of rigorous mathematical results from the theory of continuous time point process ﬁltering, and show how optimal real-time state estimation and prediction may be implemented in a general setting using linear neural networks. We demonstrate the applicability of the approach with several examples, and relate the required network properties to the statistical nature of the environment, thereby quantifying the compatibility of a given network with its environment. 1

4 0.11040587 203 nips-2007-The rat as particle filter

Author: Aaron C. Courville, Nathaniel D. Daw

Abstract: Although theorists have interpreted classical conditioning as a laboratory model of Bayesian belief updating, a recent reanalysis showed that the key features that theoretical models capture about learning are artifacts of averaging over subjects. Rather than learning smoothly to asymptote (reﬂecting, according to Bayesian models, the gradual tradeoff from prior to posterior as data accumulate), subjects learn suddenly and their predictions ﬂuctuate perpetually. We suggest that abrupt and unstable learning can be modeled by assuming subjects are conducting inference using sequential Monte Carlo sampling with a small number of samples — one, in our simulations. Ensemble behavior resembles exact Bayesian models since, as in particle ﬁlters, it averages over many samples. Further, the model is capable of exhibiting sophisticated behaviors like retrospective revaluation at the ensemble level, even given minimally sophisticated individuals that do not track uncertainty in their beliefs over trials. 1

5 0.10665813 125 nips-2007-Markov Chain Monte Carlo with People

Author: Adam Sanborn, Thomas L. Griffiths

Abstract: Many formal models of cognition implicitly use subjective probability distributions to capture the assumptions of human learners. Most applications of these models determine these distributions indirectly. We propose a method for directly determining the assumptions of human learners by sampling from subjective probability distributions. Using a correspondence between a model of human choice and Markov chain Monte Carlo (MCMC), we describe a method for sampling from the distributions over objects that people associate with different categories. In our task, subjects choose whether to accept or reject a proposed change to an object. The task is constructed so that these decisions follow an MCMC acceptance rule, deﬁning a Markov chain for which the stationary distribution is the category distribution. We test this procedure for both artiﬁcial categories acquired in the laboratory, and natural categories acquired from experience. 1

6 0.10407262 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

7 0.092489846 51 nips-2007-Comparing Bayesian models for multisensory cue combination without mandatory integration

8 0.087524325 103 nips-2007-Inferring Elapsed Time from Stochastic Neural Processes

9 0.06636367 114 nips-2007-Learning and using relational theories

10 0.064536899 154 nips-2007-Predicting Brain States from fMRI Data: Incremental Functional Principal Component Regression

11 0.061545983 202 nips-2007-The discriminant center-surround hypothesis for bottom-up saliency

12 0.059633508 155 nips-2007-Predicting human gaze using low-level saliency combined with face detection

13 0.055541102 33 nips-2007-Bayesian Inference for Spiking Neuron Models with a Sparsity Prior

14 0.055338267 145 nips-2007-On Sparsity and Overcompleteness in Image Models

15 0.054927919 194 nips-2007-The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information

16 0.05448202 74 nips-2007-EEG-Based Brain-Computer Interaction: Improved Accuracy by Automatic Single-Trial Error Detection

17 0.053706326 122 nips-2007-Locality and low-dimensions in the prediction of natural experience from fMRI

18 0.052396536 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration

19 0.051983457 140 nips-2007-Neural characterization in partially observed populations of spiking neurons

20 0.051449828 50 nips-2007-Combined discriminative and generative articulated pose and non-rigid shape estimation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.167), (1, 0.025), (2, 0.073), (3, -0.035), (4, 0.04), (5, 0.122), (6, -0.002), (7, 0.082), (8, -0.067), (9, -0.172), (10, -0.057), (11, 0.004), (12, -0.034), (13, -0.019), (14, 0.039), (15, -0.074), (16, 0.013), (17, 0.057), (18, 0.154), (19, -0.063), (20, 0.017), (21, 0.013), (22, -0.019), (23, -0.079), (24, -0.028), (25, -0.011), (26, -0.142), (27, -0.012), (28, 0.143), (29, -0.03), (30, 0.117), (31, -0.031), (32, -0.042), (33, -0.039), (34, 0.107), (35, -0.058), (36, 0.011), (37, -0.007), (38, -0.049), (39, -0.109), (40, 0.163), (41, 0.035), (42, 0.018), (43, -0.11), (44, -0.034), (45, -0.029), (46, -0.047), (47, 0.04), (48, -0.031), (49, 0.048)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97262222 3 nips-2007-A Bayesian Model of Conditioned Perception

Author: Alan Stocker, Eero P. Simoncelli

Abstract: unkown-abstract

2 0.72109967 203 nips-2007-The rat as particle filter

Author: Aaron C. Courville, Nathaniel D. Daw

3 0.68405646 125 nips-2007-Markov Chain Monte Carlo with People

Author: Adam Sanborn, Thomas L. Griffiths

4 0.61187744 198 nips-2007-The Noisy-Logical Distribution and its Application to Causal Inference

Author: Hongjing Lu, Alan L. Yuille

Abstract: We describe a novel noisy-logical distribution for representing the distribution of a binary output variable conditioned on multiple binary input variables. The distribution is represented in terms of noisy-or’s and noisy-and-not’s of causal features which are conjunctions of the binary inputs. The standard noisy-or and noisy-andnot models, used in causal reasoning and artiﬁcial intelligence, are special cases of the noisy-logical distribution. We prove that the noisy-logical distribution is complete in the sense that it can represent all conditional distributions provided a sufﬁcient number of causal factors are used. We illustrate the noisy-logical distribution by showing that it can account for new experimental ﬁndings on how humans perform causal reasoning in complex contexts. We speculate on the use of the noisy-logical distribution for causal reasoning and artiﬁcial intelligence.

5 0.57335323 150 nips-2007-Optimal models of sound localization by barn owls

Author: Brian J. Fischer

Abstract: Sound localization by barn owls is commonly modeled as a matching procedure where localization cues derived from auditory inputs are compared to stored templates. While the matching models can explain properties of neural responses, no model explains how the owl resolves spatial ambiguity in the localization cues to produce accurate localization for sources near the center of gaze. Here, I examine two models for the barn owl’s sound localization behavior. First, I consider a maximum likelihood estimator in order to further evaluate the cue matching model. Second, I consider a maximum a posteriori estimator to test whether a Bayesian model with a prior that emphasizes directions near the center of gaze can reproduce the owl’s localization behavior. I show that the maximum likelihood estimator can not reproduce the owl’s behavior, while the maximum a posteriori estimator is able to match the behavior. This result suggests that the standard cue matching model will not be sufﬁcient to explain sound localization behavior in the barn owl. The Bayesian model provides a new framework for analyzing sound localization in the barn owl and leads to predictions about the owl’s localization behavior.

6 0.54519099 114 nips-2007-Learning and using relational theories

7 0.52846289 51 nips-2007-Comparing Bayesian models for multisensory cue combination without mandatory integration

8 0.45879066 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model

9 0.43520027 103 nips-2007-Inferring Elapsed Time from Stochastic Neural Processes

10 0.42011312 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

11 0.41154501 19 nips-2007-Active Preference Learning with Discrete Choice Data

12 0.39210084 133 nips-2007-Modelling motion primitives and their timing in biologically executed movements

13 0.38906094 50 nips-2007-Combined discriminative and generative articulated pose and non-rigid shape estimation

14 0.38591641 145 nips-2007-On Sparsity and Overcompleteness in Image Models

15 0.38574022 85 nips-2007-Experience-Guided Search: A Theory of Attentional Control

16 0.36905116 130 nips-2007-Modeling Natural Sounds with Modulation Cascade Processes

17 0.35566291 59 nips-2007-Continuous Time Particle Filtering for fMRI

18 0.34049889 74 nips-2007-EEG-Based Brain-Computer Interaction: Improved Accuracy by Automatic Single-Trial Error Detection

19 0.34024793 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding

20 0.32389328 1 nips-2007-A Bayesian Framework for Cross-Situational Word-Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.035), (13, 0.031), (16, 0.032), (18, 0.444), (21, 0.06), (34, 0.017), (35, 0.013), (47, 0.099), (83, 0.093), (85, 0.012), (87, 0.013), (90, 0.061)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92171258 203 nips-2007-The rat as particle filter

Author: Aaron C. Courville, Nathaniel D. Daw

same-paper 2 0.88048726 3 nips-2007-A Bayesian Model of Conditioned Perception

Author: Alan Stocker, Eero P. Simoncelli

Abstract: unkown-abstract

3 0.82185775 213 nips-2007-Variational Inference for Diffusion Processes

Author: Cédric Archambeau, Manfred Opper, Yuan Shen, Dan Cornford, John S. Shawe-taylor

Abstract: Diffusion processes are a family of continuous-time continuous-state stochastic processes that are in general only partially observed. The joint estimation of the forcing parameters and the system noise (volatility) in these dynamical systems is a crucial, but non-trivial task, especially when the system is nonlinear and multimodal. We propose a variational treatment of diffusion processes, which allows us to compute type II maximum likelihood estimates of the parameters by simple gradient techniques and which is computationally less demanding than most MCMC approaches. We also show how a cheap estimate of the posterior over the parameters can be constructed based on the variational free energy. 1

4 0.52444822 125 nips-2007-Markov Chain Monte Carlo with People

Author: Adam Sanborn, Thomas L. Griffiths

5 0.51326627 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

Author: Matthew Hoffman, Arnaud Doucet, Nando D. Freitas, Ajay Jasra

Abstract: A recently proposed formulation of the stochastic planning and control problem as one of parameter estimation for suitable artiﬁcial statistical models has led to the adoption of inference algorithms for this notoriously hard problem. At the algorithmic level, the focus has been on developing Expectation-Maximization (EM) algorithms. In this paper, we begin by making the crucial observation that the stochastic control problem can be reinterpreted as one of trans-dimensional inference. With this new interpretation, we are able to propose a novel reversible jump Markov chain Monte Carlo (MCMC) algorithm that is more efﬁcient than its EM counterparts. Moreover, it enables us to implement full Bayesian policy search, without the need for gradients and with one single Markov chain. The new approach involves sampling directly from a distribution that is proportional to the reward and, consequently, performs better than classic simulations methods in situations where the reward is a rare event.

6 0.46864027 47 nips-2007-Collapsed Variational Inference for HDP

7 0.45751601 51 nips-2007-Comparing Bayesian models for multisensory cue combination without mandatory integration

8 0.45471799 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model

9 0.44760525 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

10 0.44485146 74 nips-2007-EEG-Based Brain-Computer Interaction: Improved Accuracy by Automatic Single-Trial Error Detection

11 0.43817848 202 nips-2007-The discriminant center-surround hypothesis for bottom-up saliency

12 0.42946631 155 nips-2007-Predicting human gaze using low-level saliency combined with face detection

13 0.41926065 87 nips-2007-Fast Variational Inference for Large-scale Internet Diagnosis

14 0.41313517 100 nips-2007-Hippocampal Contributions to Control: The Third Way

15 0.40755334 122 nips-2007-Locality and low-dimensions in the prediction of natural experience from fMRI

16 0.40733117 198 nips-2007-The Noisy-Logical Distribution and its Application to Causal Inference

17 0.40702614 214 nips-2007-Variational inference for Markov jump processes

18 0.40556645 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration

19 0.4044213 154 nips-2007-Predicting Brain States from fMRI Data: Incremental Functional Principal Component Regression

20 0.40422457 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging