jmlr jmlr2006 jmlr2006-28 knowledge-graph by maker-knowledge-mining

28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting


Source: pdf

Author: Martin J. Wainwright

Abstract: Consider the problem of joint parameter estimation and prediction in a Markov random field: that is, the model parameters are estimated on the basis of an initial set of data, and then the fitted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a new noisy observation. Working under the restriction of limited computation, we analyze a joint method in which the same convex variational relaxation is used to construct an M-estimator for fitting parameters, and to perform approximate marginalization for the prediction step. The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i.e., an estimator that returns the “wrong” model even in the infinite data limit) is provably beneficial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of convex variational methods. This stability result provides additional incentive, apart from the obvious benefit of unique global optima, for using message-passing methods based on convex variational relaxations. We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product. Keywords: graphical model, Markov random field, belief propagation, variational method, parameter estimation, prediction error, algorithmic stability

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Working under the restriction of limited computation, we analyze a joint method in which the same convex variational relaxation is used to construct an M-estimator for fitting parameters, and to perform approximate marginalization for the prediction step. [sent-7, score-0.381]

2 En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of convex variational methods. [sent-11, score-0.611]

3 This stability result provides additional incentive, apart from the obvious benefit of unique global optima, for using message-passing methods based on convex variational relaxations. [sent-12, score-0.352]

4 Keywords: graphical model, Markov random field, belief propagation, variational method, parameter estimation, prediction error, algorithmic stability 1. [sent-14, score-0.475]

5 Well-known examples of such variational methods include mean field algorithms, the belief propagation or sum-product algorithm, as well as various extensions including generalized belief propagation and expectation propagation. [sent-30, score-0.499]

6 1830 E STIMATING THE “W RONG ” G RAPHICAL M ODEL specifically, our analysis applies to variational methods that are based on convex relaxations. [sent-62, score-0.228]

7 , 2005), reweighted forms of generalized belief propagation (Wiegerinck, 2005), and semidefinite relaxations (Wainwright and Jordan, 2005). [sent-64, score-0.289]

8 Moreover, it is possible to modify other variational methods—for instance, expectation propagation (Minka, 2001)—so as to “convexify” them. [sent-65, score-0.233]

9 En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of variational methods. [sent-71, score-0.536]

10 Thus, our global stability result provides further theoretical justification—apart from the obvious benefit of unique global optima—for using message-passing methods based on convex variational relaxations. [sent-75, score-0.382]

11 In Section 3, we introduce background on variational representations, including the notion of a convex surrogate to the cumulant generating function, and then illustrate this notion via the tree-reweighted Bethe approximation (Wainwright et al. [sent-79, score-0.612]

12 In Section 4, we describe how any convex surrogate defines a particular joint scheme for parameter estimation and prediction. [sent-81, score-0.4]

13 Section 5 provides results on the asymptotic behavior of the estimation step, as well as the stability of the prediction step. [sent-82, score-0.21]

14 Section 6 is devoted to the derivation of performance bounds for joint estimation 1831 WAINWRIGHT and prediction methods, with particular emphasis on the mixture-of-Gaussians observation model. [sent-83, score-0.227]

15 In Section 7, we provide experimental results on the performance of a joint estimation/prediction method based on the tree-reweighted Bethe surrogate, and compare it to a heuristic method based on the ordinary belief propagation algorithm. [sent-84, score-0.272]

16 We use the lower case letter xs to denote particular realizations of the random variable Xs in the set Xs . [sent-96, score-0.354]

17 , m − 1, the {0, 1}-valued indicator function if xs = j otherwise 1 0 I j [xs ] := (1) These indicator functions can be used to define a potential function θ s (·) : Xs → R via θs (xs ) := m−1 ∑ θs; j I j [xs ] (2) j=1 where θs = {θs; j , j = 1, . [sent-101, score-0.354]

18 Construction of Convex Surrogates This section is devoted to a systematic procedure for constructing convex functions that represent approximations to the cumulant generating function. [sent-135, score-0.287]

19 Nonetheless, this exact variational principle is useful, in that various natural relaxations of the optimization problem can be used to define convex surrogates to the cumulant generating function. [sent-138, score-0.501]

20 With this form of the dual function, we are guaranteed that the cumulant generating function A has the following variational representation: A(θ) = max µ∈MARGφ (G) θT µ − A∗ (µ) . [sent-156, score-0.356]

21 Here a special case of the junction tree theory guarantees 1834 E STIMATING THE “W RONG ” G RAPHICAL M ODEL that any Markov random field on a tree T = (V, E(T )) can be factorized in terms of its marginals as follows p(x ; θ(µ)) = µst (xs , xt ) . [sent-168, score-0.228]

22 As we discuss in the following section, the resulting relaxed optimization problem defines a convex surrogate to the cumulant generating function. [sent-172, score-0.496]

23 2 Convex Surrogates to the Cumulant Generating Function We now describe a general procedure for constructing convex surrogates to the cumulant generating function, consisting of two main ingredients. [sent-174, score-0.312]

24 By combining these two approximations, we obtain a convex surrogate B to the cumulant generating function, specified via the solution of the following relaxed optimization problem B(θ) := max τ∈RELφ (G) θT τ − B∗ (τ) . [sent-182, score-0.496]

25 The function B so defined has several desirable properties, as summarized in the following proposition: Proposition 2 Any convex surrogate B defined via (11) has the following properties: (i) For each θ ∈ Rd , the optimum defining B is attained at a unique point τ(θ). [sent-184, score-0.311]

26 These quantities, as locally valid marginal distributions, must satisfy the following set of local consistency conditions: LOCALφ (G) := τ ∈ Rd + ∑ τs (xs ) = 1, ∑ τst (xs , xt ) = τs (xs ) xs . [sent-198, score-0.483]

27 The convexified Bethe entropy approximation is based on taking a convex combination of these tree entropies, where each tree is weighted by a probability ρ(T ) ∈ [0, 1]. [sent-206, score-0.226]

28 , 2005) that it is strictly convex for any vector {ρ st } of strictly positive edge appearance probabilities. [sent-209, score-0.404]

29 1836 E STIMATING THE “W RONG ” G RAPHICAL M ODEL Bethe surrogate and reweighted sum-product: We use these two ingredients—the relaxation LOCALφ (G) of the marginal polytope, and the convexified Bethe entropy approximation (12)—to define the following convex surrogate Bρ (θ) := θT τ − B∗ (τ) . [sent-210, score-0.73]

30 s st 1837 WAINWRIGHT It is intractable to maximize the regularized likelihood directly, due to the presence of the cumulant generating function A. [sent-236, score-0.433]

31 Thus, a natural thought is to use the convex surrogate B to define an alternative estimator obtained by maximizing the regularized surrogate likelihood: B (θ) := µn , θ − B(θ) − λn R(θ). [sent-237, score-0.551]

32 (16) By design, the surrogate B and hence the surrogate likelihood B , as well as their derivatives, can be computed in a straightforward manner (typically by some sort of message-passing algorithm). [sent-238, score-0.452]

33 It is thus straightforward to compute the parameter θn achieving the maximum of the regularized surrogate likelihood (for instance, gradient descent would a simple though naive method). [sent-239, score-0.275]

34 , 2003b) that in the absence of regularization, the optimal parameter estimates θn have a very simple closed-form solution, specified in terms of the weights ρ st and the empirical marginals µ. [sent-241, score-0.363]

35 The key point is that the same convex surrogate B is used both in forming the surrogate likelihood (16) for approximate parameter estimation, and in the variational method (11) for performing prediction. [sent-256, score-0.75]

36 In the exponential family setting, for a fixed noisy observation y, this posterior can always be written as a new exponential family member, described by parameter θ + γ(y). [sent-258, score-0.22]

37 , xn } by maximizing the (regularized) surrogate likelihood B . [sent-267, score-0.242]

38 , a contaminated version of z ∼ p(· ; θ ∗ )) specified by a factorized conditional distribution of the form p(y | z) = ∏N p(ys | zs ), incorporate it into the s=1 model by forming the new exponential parameter θn ( · ) + γs (y) s where γs (y) merges the new data with the fitted model θn . [sent-271, score-0.218]

39 Using the message-passing algorithm associated with the convex surrogate B, compute approximate marginals τ(θ + γ) for the distribution that combines the fitted model with the new observation. [sent-274, score-0.456]

40 Use these approximate marginals to construct a prediction z(y; τ) of z based on the observation y and pseudomarginals τ. [sent-275, score-0.33]

41 The most important property of this joint scheme is that the convex surrogate B underlies both the parameter estimation phase (used to form the surrogate likelihood), and the prediction phase (used in the variational method for computing approximate marginals). [sent-282, score-0.875]

42 We then prove a Lipschitz stability result applicable to any variational method that is based on a strongly concave entropy approximation. [sent-287, score-0.394]

43 1 Estimator Asymptotics We begin by considering the asymptotic behavior of the parameter estimator θn defined by the surrogate likelihood (16). [sent-290, score-0.359]

44 Since this parameter estimator is a particular type of M-estimator (Serfling, 1980), its asymptotic behavior can be investigated using standard methods, as summarized in the following: Proposition 3 Recall the cumulant generating function A defined in Equation (4). [sent-291, score-0.263]

45 Let B be a strictly convex surrogate for A, defined via Equation (11) with a strictly concave entropy approximation −B∗ . [sent-292, score-0.403]

46 (b) the estimator is asymptotically normal: √ n n θ −θ d −→ N 0, ∇2 B(θ) −1 ∇2 A(θ∗ ) ∇2 B(θ) −1 Proof By construction, the convex surrogate B and the (negative) entropy approximation B ∗ are a Fenchel-Legendre conjugate dual pair. [sent-295, score-0.461]

47 By strict convexity, the regularized surrogate likelihood (17) has a unique global maximum. [sent-303, score-0.272]

48 However, previous experimental work has shown that methods based on convex relaxations, including the reweighted sum-product (or belief propagation) algorithm (Wainwright et al. [sent-320, score-0.248]

49 Here we provide theoretical support for these empirical observaGrid with attractive coupling 0. [sent-324, score-0.217]

50 Results shown with a grid with N = 100 nodes over a range of attractive coupling strengths. [sent-342, score-0.217]

51 The tree-reweighted algorithm, shown for two different settings of the edge weights ρst , remains stable over the full range of coupling strengths. [sent-344, score-0.237]

52 tions: in particular, we prove that, in sharp contrast to non-convex methods, any variational method based on a strongly convex entropy approximation is globally stable. [sent-347, score-0.348]

53 In particular, it turns out that any variational method based on a suitably concave entropy approximation satisfies such a stability condition. [sent-352, score-0.365]

54 With this definition, we have: Proposition 5 Consider any strictly convex surrogate B based on a strongly concave entropy approximation −B∗ . [sent-355, score-0.432]

55 Lemma 6 For any set {ρst } of strictly positive edge appearance probabilities, the convexified Bethe entropy (12) is strongly concave. [sent-364, score-0.222]

56 The entropy approximations that underlie other variational methods (e. [sent-369, score-0.282]

57 Recall from Proposition 3 that the parameter estimator based on the surrogate likelihood B is inconsistent, in the sense that the parameter vector θ returned in the limit of infinite data is generally not equal to the true parameter θ∗ . [sent-384, score-0.427]

58 j Approximate prediction: Since the marginal distributions µs; j (θ∗ + γ) are intractable to compute exactly, it is natural to consider an approximate predictor, based on a set τ of pseudomarginals computed from a variational relaxation. [sent-404, score-0.327]

59 More specifically, in the limiting case, the errors systematically introduced by the inconsistent learning procedure are cancelled out exactly by the approximate variational method for computing marginal distributions. [sent-419, score-0.294]

60 , m − 1, so that zopt (Y ; θ∗ ) ≈ Y ≈ zapp (Y ; θ). [sent-425, score-0.231]

61 Second, in order to apply the Lipschitz stability result, it is convenient to express the effect of introducing a new observation vector y, drawn from the additive noise observation model (21), as a perturbation of the exponential parameterization. [sent-432, score-0.264]

62 Thus, in contrast to the combination of approximate estimation with approximate estimation, applying the approximate message-passing algorithm to the true model fails to be exact even in the limit of zero SNR. [sent-455, score-0.213]

63 By the Pythagorean relation that characterizes the Bayes least squares estimator zopt (Y ; θ∗ ) = E(Z |Y,θ∗ ) [Z], we have ∆ R(α; θ∗ , θ) := = 1 1 E zapp (Y ; θ) − Z 2 − E zopt (Y ; θ∗ ) − Z 2 N N 1 E zapp (Y ; θ) − zopt (Y ; θ∗ ) 2 . [sent-472, score-0.644]

64 Using the relation τ(θ) = µ(θ∗ ) guaranteed by the definition of the ML estimator and surrogate estimator, we have τ(θ + γ) − µ(θ∗ + γ) = τ(θ + γ) − τ(θ) + [µ(θ∗ ) − µ(θ∗ + γ)] = 2 ∇ B(θ + sγ) − ∇ A(θ + tγ) γ 2 ∗ 2 2 2 , for some s,t ∈ [0, 1], where we have used the mean value theorem. [sent-478, score-0.266]

65 Experimental Results In order to test our joint estimation/prediction procedure, we have applied it to coupled Gaussian mixture models on different graphs, coupling strengths, observation SNRs, and mixture distributions. [sent-483, score-0.414]

66 Any instantiation of the treereweighted sum-product algorithm is specified by a collection of edge weights ρ st , one for each edge (s,t) of the graph. [sent-491, score-0.405]

67 , 2003b), the estimates (32) are the global maxima of the surrogate likelihood (16) based on the convexified Bethe approximation (12) without any regularization term (i. [sent-503, score-0.272]

68 Given the new noisy observation Y of the form (21), we incorporate it by by forming the new exponential parameter θn + γs (Y ), s where Equation (26a) defines γs for the Gaussian mixture model under consideration. [sent-507, score-0.277]

69 We then compute approximate marginals τ(θ + γ) by running the TRW sum-product algorithm with edge appearance weights ρst , using the message updates (14), on the graphical model distribution with exponential parameter θ + γ. [sent-509, score-0.388]

70 We use the approximate marginals (see Equation (15)) to construct the prediction zapp in Equation (24). [sent-510, score-0.32]

71 The prediction step reduces to computing the Bayes least squares estimate at each node independently, based only on the local data ys . [sent-514, score-0.34]

72 (b) The standard sum-product or belief propagation (BP) approach is closely related to the treereweighted sum-product method, but based on the edge weights ρ st = 1 for all edges. [sent-515, score-0.51]

73 , 2003b), this approximate parameter estimate uniquely defines the Markov random field for which the empirical marginals µs and µst are fixed points of the ordinary belief propagation algorithm. [sent-518, score-0.404]

74 In the prediction step, we then use the ordinary belief propagation algorithm (i. [sent-521, score-0.306]

75 , again with ρst = 1) to compute approximate marginals of the graphical model with parameter θ + γ. [sent-523, score-0.231]

76 Since the mixture variables have m = 2 states, the coupling distribution can be written as p(x ; θ∗ ) ∝ exp ∑ θ∗ x s + ∑ s s∈V θ∗ xs xt , st (s,t)∈E where x ∈ {−1, +1}N are “spin” variables indexing the mixture components. [sent-550, score-0.977]

77 We tested two types of coupling in the underlying Markov random field: (a) In the case of attractive coupling, for each coupling strength β ∈ [0, 1], we chose edge parameters as θ∗ ∼ U [0, β]. [sent-555, score-0.534]

78 st (b) In the case of mixed coupling, for each coupling strength β ∈ [0, 1], we chose edge parameters as θ∗ ∼ U [−β, β]. [sent-556, score-0.544]

79 st Here U [a, b] denotes a uniform distribution on the interval [a, b]. [sent-557, score-0.227]

80 1, for any underlying model p(x; θ ∗ ) in which approximate messagepassing yields the incorrect marginals (without any additional observations), there exists a range of SNR around α ≈ 0 for which this superior performance will hold. [sent-563, score-0.222]

81 8 1 Figure 4: Line plots of percentage increase in MSE relative to Bayes optimum for the TRW method applied to the true model (black circles) versus the approximate model (red diamonds) as a function of observation SNR for grids with N = 64 nodes, and attractive coupling β = 0. [sent-568, score-0.384]

82 Figure 4 provides an empirical demonstration of this claim, when the TRW algorithm for prediction is applied to a grid with N = 64 nodes and attractive coupling strength β = 0. [sent-571, score-0.372]

83 3 Comparison between Tree-reweighted and Ordinary Sum-product We now compare the performance of the prediction method based on tree-reweighted sum-product (TRW) message-passing to that based on ordinary sum-product or belief propagation (BP) messagepassing. [sent-580, score-0.306]

84 The top two rows show performance for attractive coupling, for mixture ensemble A ((a) through (c)) and ensemble B ((d) through (f)), whereas the bottom two row show performance for mixed coupling, for mixture ensemble A ((g) through (i)) and ensemble B ((j) through (l)). [sent-582, score-0.462]

85 5 1 Edge strength 0 Edge strength SNR (k) (l) Figure 5: Surface plots of the percentage increase in MSE relative to Bayes optimum for different methods as a function of observation SNR for grids with N = 64 nodes. [sent-613, score-0.228]

86 First row: Attractive coupling and a Gaussian mixture with components (ν0 , σ2 ) = (−1, 0. [sent-617, score-0.25]

87 Second 0 1 row: Attractive coupling and a Gaussian mixture with components (ν 0 , σ2 ) = (0, 1) and 0 (ν0 , σ2 ) = (0, 9). [sent-620, score-0.25]

88 Third row: Mixed coupling and a Gaussian mixture with components 1 (ν0 , σ2 ) = (−1, 0. [sent-621, score-0.25]

89 Fourth row: Mixed coupling and a Gaussian 0 1 mixture with components (ν0 , σ2 ) = (0, 1) and (ν0 , σ2 ) = (0, 9). [sent-624, score-0.25]

90 As the coupling is increased, the BP method eventually deteriorates quite seriously; indeed, for large enough coupling and low/intermediate SNR, its performance can be worse than the independence (IND) model. [sent-627, score-0.338]

91 Figure 6 provides plots of the actual MSE increase for the TRW algorithm (solid red lines), compared to the theoretical bound (27) (dotted blue lines), for the grid with N = 64 nodes, and attractive coupling of strength β = 0. [sent-636, score-0.297]

92 In this paper, we have described and analyzed methods for joint estimation and prediction that are based on convex variational methods. [sent-658, score-0.385]

93 0 1 unique global optima—to using variational methods based on convex approximations. [sent-678, score-0.258]

94 In particular, we established a global Lipschitz stability property that applies to any message-passing algorithm that is based on a strongly concave entropy approximation. [sent-679, score-0.271]

95 Form of Exponential Parameter √ Consider the observation model ys = αzs + 1 − α2 vs , where vs ∼ N(0, 1) and zs is a mixture of two Gaussians (ν0 , σ2 ) and (ν1 , σ2 ). [sent-719, score-0.487]

96 Conditioned on the value of the mixing indicator Xs = j, the 1 0 distribution of ys is Gaussian with mean αν j and variance α2 σ2 + (1 − α2 ). [sent-720, score-0.265]

97 j Let us focus on one component p(ys | xs ) in the factorized conditional distribution p(y | x) = ∏n p(ys | xs ). [sent-721, score-0.708]

98 We wish to represent the influence of this term on xs in the form exp(γs xs ) for some exponential parameter γs . [sent-723, score-0.796]

99 Tree-reweighted belief propagation algorithms and approximate ML estimation by pseudomoment matching. [sent-934, score-0.251]

100 Constructing free energy approximations and generalized belief propagation algorithms. [sent-992, score-0.211]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('xs', 0.354), ('snr', 0.285), ('wainwright', 0.28), ('ys', 0.265), ('st', 0.227), ('surrogate', 0.21), ('coupling', 0.169), ('bethe', 0.168), ('rong', 0.157), ('stimating', 0.157), ('variational', 0.153), ('marg', 0.147), ('cumulant', 0.136), ('trw', 0.133), ('zopt', 0.126), ('mse', 0.115), ('zapp', 0.105), ('marginals', 0.103), ('raphical', 0.102), ('odel', 0.096), ('stability', 0.094), ('belief', 0.093), ('entropy', 0.091), ('bp', 0.09), ('mixture', 0.081), ('strength', 0.08), ('propagation', 0.08), ('reweighted', 0.08), ('prediction', 0.075), ('convex', 0.075), ('pseudomarginals', 0.073), ('convexi', 0.071), ('rel', 0.071), ('zs', 0.068), ('edge', 0.068), ('rd', 0.065), ('xt', 0.065), ('lipschitz', 0.065), ('marginal', 0.064), ('surrogates', 0.063), ('wiegerinck', 0.063), ('ensemble', 0.063), ('route', 0.061), ('ordinary', 0.058), ('estimator', 0.056), ('exponential', 0.055), ('ind', 0.053), ('markov', 0.053), ('blse', 0.052), ('incorrect', 0.051), ('eld', 0.05), ('attractive', 0.048), ('polytope', 0.045), ('mooij', 0.044), ('tatikonda', 0.044), ('observation', 0.042), ('mrfs', 0.042), ('mts', 0.042), ('treereweighted', 0.042), ('joint', 0.041), ('estimation', 0.041), ('cov', 0.04), ('inconsistent', 0.04), ('proposition', 0.038), ('generating', 0.038), ('approximations', 0.038), ('relaxed', 0.037), ('approximate', 0.037), ('tted', 0.036), ('relaxations', 0.036), ('interpolation', 0.036), ('noisy', 0.035), ('ct', 0.035), ('appearance', 0.034), ('predictor', 0.034), ('multinomial', 0.033), ('parameter', 0.033), ('variances', 0.032), ('optima', 0.032), ('ihler', 0.032), ('kappen', 0.032), ('likelihood', 0.032), ('pseudomarginal', 0.031), ('bayes', 0.031), ('model', 0.031), ('limit', 0.03), ('tree', 0.03), ('global', 0.03), ('strongly', 0.029), ('denoising', 0.029), ('dual', 0.029), ('bounds', 0.028), ('begin', 0.028), ('hs', 0.027), ('loopy', 0.027), ('graphical', 0.027), ('pietra', 0.027), ('pseudolikelihood', 0.027), ('concave', 0.027), ('optimum', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

Author: Martin J. Wainwright

Abstract: Consider the problem of joint parameter estimation and prediction in a Markov random field: that is, the model parameters are estimated on the basis of an initial set of data, and then the fitted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a new noisy observation. Working under the restriction of limited computation, we analyze a joint method in which the same convex variational relaxation is used to construct an M-estimator for fitting parameters, and to perform approximate marginalization for the prediction step. The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i.e., an estimator that returns the “wrong” model even in the infinite data limit) is provably beneficial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of convex variational methods. This stability result provides additional incentive, apart from the obvious benefit of unique global optima, for using message-passing methods based on convex variational relaxations. We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product. Keywords: graphical model, Markov random field, belief propagation, variational method, parameter estimation, prediction error, algorithmic stability

2 0.19121605 32 jmlr-2006-Expectation Correction for Smoothed Inference in Switching Linear Dynamical Systems

Author: David Barber

Abstract: We introduce a method for approximate smoothed inference in a class of switching linear dynamical systems, based on a novel form of Gaussian Sum smoother. This class includes the switching Kalman ‘Filter’ and the more general case of switch transitions dependent on the continuous latent state. The method improves on the standard Kim smoothing approach by dispensing with one of the key approximations, thus making fuller use of the available future information. Whilst the central assumption required is projection to a mixture of Gaussians, we show that an additional conditional independence assumption results in a simpler but accurate alternative. Our method consists of a single Forward and Backward Pass and is reminiscent of the standard smoothing ‘correction’ recursions in the simpler linear dynamical system. The method is numerically stable and compares favourably against alternative approximations, both in cases where a single mixture component provides a good posterior approximation, and where a multimodal approximation is required. Keywords: Gaussian sum smoother, switching Kalman filter, switching linear dynamical system, expectation propagation, expectation correction 1. Switching Linear Dynamical System The Linear Dynamical System (LDS) (Bar-Shalom and Li, 1998; West and Harrison, 1999) is a key temporal model in which a latent linear process generates the observed time-series. For more complex time-series which are not well described globally by a single LDS, we may break the time-series into segments, each modeled by a potentially different LDS. This is the basis for the Switching LDS (SLDS) where, for each time-step t, a switch variable st ∈ 1, . . . , S describes which of the LDSs is to be used.1 The observation (or ‘visible’ variable) vt ∈ R V is linearly related to the hidden state ht ∈ R H by vt = B(st )ht + ηv (st ), ηv (st ) ∼ N (v(st ), Σv (st )) ¯ (1) where N (µ, Σ) denotes a Gaussian distribution with mean µ and covariance Σ. The transition dynamics of the continuous hidden state ht is linear ht = A(st )ht−1 + ηh (st ), ¯ ηh (st ) ∼ N h(st ), Σh (st ) . (2) 1. These systems also go under the names Jump Markov model/process, switching Kalman Filter, Switching Linear Gaussian State-Space model, Conditional Linear Gaussian Model. c 2006 David Barber. BARBER s1 s2 s3 s4 h1 h2 h3 h4 v1 v2 v3 v4 Figure 1: The independence structure of the aSLDS. Square nodes denote discrete variables, round nodes continuous variables. In the SLDS links from h to s are not normally considered. The dynamics of the switch variables is Markovian, with transition p(st |st−1 ). The SLDS is used in many disciplines, from econometrics to machine learning (Bar-Shalom and Li, 1998; Ghahramani and Hinton, 1998; Lerner et al., 2000; Kitagawa, 1994; Kim and Nelson, 1999; Pavlovic et al., 2001). See Lerner (2002) and Zoeter (2005) for recent reviews of work. AUGMENTED S WITCHING L INEAR DYNAMICAL S YSTEM In this article, we will consider the more general model in which the switch st is dependent on both the previous st−1 and ht−1 . We call this an augmented Switching Linear Dynamical System 2 (aSLDS), in keeping with the terminology in Lerner (2002). An equivalent probabilistic model is, as depicted in Figure (1), T p(v1:T , h1:T , s1:T ) = p(v1 |h1 , s1 )p(h1 |s1 )p(s1 ) ∏ p(vt |ht , st )p(ht |ht−1 , st )p(st |ht−1 , st−1 ). t=2 The notation x1:T is shorthand for x1 , . . . , xT . The distributions are parameterized as p(vt |ht , st ) = N (v(st ) + B(st )ht , Σv (st )) , ¯ ¯ p(ht |ht−1 , st ) = N h(st ) + A(st )ht−1 , Σh (st ) where p(h1 |s1 ) = N (µ(s1 ), Σ(s1 )). The aSLDS has been used, for example, in state-duration modeling in acoustics (Cemgil et al., 2006) and econometrics (Chib and Dueker, 2004). I NFERENCE The aim of this article is to address how to perform inference in both the SLDS and aSLDS. In particular we desire the so-called filtered estimate p(ht , st |v1:t ) and the smoothed estimate p(ht , st |v1:T ), for any t, 1 ≤ t ≤ T . Both exact filtered and smoothed inference in the SLDS is intractable, scaling exponentially with time (Lerner, 2002). To see this informally, consider the filtered posterior, which may be recursively computed using p(st , ht |v1:t ) = ∑ Z st−1 ht−1 p(st , ht |st−1 , ht−1 , vt )p(st−1 , ht−1 |v1:t−1 ). (3) At timestep 1, p(s1 , h1 |v1 ) = p(h1 |s1 , v1 )p(s1 |v1 ) is an indexed set of Gaussians. At time-step 2, due to the summation over the states s1 , p(s2 , h2 |v1:2 ) will be an indexed set of S Gaussians; similarly at 2. These models are closely related to Threshold Regression Models (Tong, 1990). 2516 E XPECTATION C ORRECTION time-step 3, it will be S2 and, in general, gives rise to St−1 Gaussians. More formally, in Lauritzen and Jensen (2001), a general exact method is presented for performing stable inference in such hybrid discrete models with conditional Gaussian potentials. The method requires finding a strong junction tree which, in the SLDS case, means that the discrete variables are placed in a single cluster, resulting in exponential complexity. The key issue in the (a)SLDS, therefore, is how to perform approximate inference in a numerically stable manner. Our own interest in the SLDS stems primarily from acoustic modeling, in which the time-series consists of many thousands of time-steps (Mesot and Barber, 2006; Cemgil et al., 2006). For this, we require a stable and computationally feasible approximate inference, which is also able to deal with state-spaces of high hidden dimension, H. 2. Expectation Correction Our approach to approximate p(ht , st |v1:T ) ≈ p(ht , st |v1:T ) mirrors the Rauch-Tung-Striebel (RTS) ˜ ‘correction’ smoother for the LDS (Rauch et al., 1965; Bar-Shalom and Li, 1998). Readers unfamiliar with this approach will find a short explanation in Appendix (A), which defines the important functions LDSFORWARD and LDSBACKWARD, which we shall make use of for inference in the aSLDS. Our correction approach consists of a single Forward Pass to recursively find the filtered posterior p(ht , st |v1:t ), followed by a single Backward Pass to correct this into a smoothed posterior ˜ p(ht , st |v1:T ). The Forward Pass we use is equivalent to Assumed Density Filtering (Alspach and ˜ Sorenson, 1972; Boyen and Koller, 1998; Minka, 2001). The main contribution of this paper is a novel form of Backward Pass, based on collapsing the smoothed posterior to a mixture of Gaussians. Unless stated otherwise, all quantities should be considered as approximations to their exact counterparts, and we will therefore usually omit the tildes˜throughout the article. 2.1 Forward Pass (Filtering) Readers familiar with Assumed Density Filtering (ADF) may wish to continue directly to Section (2.2). The basic idea is to represent the (intractable) posterior using a simpler distribution. This is then propagated forwards through time, conditioned on the new observation, and subsequently collapsed back to the tractable distribution representation—see Figure (2). Our aim is to form a recursion for p(st , ht |v1:t ), based on a Gaussian mixture approximation of p(ht |st , v1:t ). Without loss of generality, we may decompose the filtered posterior as p(ht , st |v1:t ) = p(ht |st , v1:t )p(st |v1:t ). We will first form a recursion for p(ht |st , v1:t ), and discuss the switch recursion p(st |v1:t ) later. The full procedure for computing the filtered posterior is presented in Algorithm (1). The exact representation of p(ht |st , v1:t ) is a mixture with O(St ) components. We therefore approximate this with a smaller It -component mixture It p(ht |st , v1:t ) ≈ p(ht |st , v1:t ) ≡ ˜ ˜ ˜ ∑ p(ht |it , st , v1:t ) p(it |st , v1:t ) it =1 where p(ht |it , st , v1:t ) is a Gaussian parameterized with mean3 f (it , st ) and covariance F(it , st ). The ˜ Gaussian mixture weights are given by p(it |st , v1:t ). In the above, p represent approximations to the ˜ ˜ 3. Strictly speaking, we should use the notation ft (it , st ) since, for each time t, we have a set of means indexed by it , st . This mild abuse of notation is used elsewhere in the paper. 2517 BARBER st st+1 it ht ht+1 vt+1 Figure 2: Structure of the mixture representation of the Forward Pass. Essentially, the Forward Pass defines a ‘prior’ distribution at time t which contains all the information from the variables v1:t . This prior is propagated forwards through time using the exact dynamics, conditioned on the observation, and then collapsed back to form a new prior approximation at time t + 1. corresponding exact p distributions. To find a recursion for these parameters, consider ˜ p(ht+1 |st+1 , v1:t+1 ) = ∑ p(ht+1 , st , it |st+1 , v1:t+1 ) ˜ st ,it = ∑ p(ht+1 |it , st , st+1 , v1:t+1 ) p(st , it |st+1 , v1:t+1 ) ˜ ˜ (4) st ,it where each of the factors can be recursively computed on the basis of the previous filtered results (see below). However, this recursion suffers from an exponential increase in mixture components. To deal with this, we will later collapse p(ht+1 |st+1 , v1:t+1 ) back to a smaller mixture. For the ˜ remainder, we drop the p notation, and concentrate on computing the r.h.s of Equation (4). ˜ E VALUATING p(ht+1 |st , it , st+1 , v1:t+1 ) We find p(ht+1 |st , it , st+1 , v1:t+1 ) from the joint distribution p(ht+1 , vt+1 |st , it , st+1 , v1:t ), which is a Gaussian with covariance and mean elements4 Σhh = A(st+1 )F(it , st )AT (st+1 ) + Σh (st+1 ), Σvv = B(st+1 )Σhh BT (st+1 ) + Σv (st+1 ) Σvh = B(st+1 )F(it , st ), µv = B(st+1 )A(st+1 ) f (it , st ), µh = A(st+1 ) f (it , st ). (5) These results are obtained from integrating the forward dynamics, Equations (1,2) over h t , using the results in Appendix (B). To find p(ht+1 |st , it , st+1 , v1:t+1 ) we may then condition p(ht+1 , vt+1 | st , it , st+1 , v1:t ) on vt+1 using the results in Appendix (C)—see also Algorithm (4). E VALUATING p(st , it |st+1 , v1:t+1 ) Up to a trivial normalization constant the mixture weight in Equation (4) can be found from the decomposition p(st , it |st+1 , v1:t+1 ) ∝ p(vt+1 |it , st , st+1 , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ). ¯ 4. We derive this for ht+1 , vt+1 ≡ 0, to ease notation. ¯ 2518 (6) E XPECTATION C ORRECTION Algorithm 1 aSLDS Forward Pass. Approximate the filtered posterior p(st |v1:t ) ≡ ρt , p(ht |st , v1:t ) ≡ ∑it wt (it , st )N ( ft (it , st ), Ft (it , st )). Also we return the approximate log-likelihood log p(v 1:T ). We ¯ require I1 = 1, I2 ≤ S, It ≤ S × It−1 . θt (s) = A(s), B(s), Σh (s), Σv (s), h(s), v(s) for t > 1. θ1 (s) = ¯ v (s), µ(s), v(s) A(s), B(s), Σ(s), Σ ¯ for s1 ← 1 to S do { f1 (1, s1 ), F1 (1, s1 ), p} = LDSFORWARD(0, 0, v1 ; θ(s1 )) ˆ ρ1 ← p(s1 ) p ˆ end for for t ← 2 to T do for st ← 1 to S do for i ← 1 to It−1 , and s ← 1 to S do {µx|y (i, s), Σx|y (i, s), p} = LDSFORWARD( ft−1 (i, s), Ft−1 (i, s), vt ; θt (st )) ˆ p∗ (st |i, s) ≡ p(st |ht−1 , st−1 = s) p(ht−1 |it−1 =i,st−1 =s,v1:t−1 ) p (st , i, s) ← wt−1 (i, s)p∗ (st |i, s)ρt−1 (s) p ˆ end for Collapse the It−1 × S mixture of Gaussians defined by µx|y ,Σx|y , and weights p(i, s|st ) ∝ p (st , i, s) to a Gaussian with It components, p(ht |st , v1:t ) ≈ I ∑itt =1 p(it |st , v1:t )p(ht |st , it , v1:t ). This defines the new means ft (it , st ), covariances Ft (it , st ) and mixture weights wt (it , st ) ≡ p(it |st , v1:t ). Compute ρt (st ) ∝ ∑i,s p (st , i, s) end for normalize ρt ≡ p(st |v1:t ) L ← L + log ∑st ,i,s p (st , i, s) end for The first factor in Equation (6), p(vt+1 |it , st , st+1 , v1:t ), is a Gaussian with mean µv and covariance Σvv , as given in Equation (5). The last two factors p(it |st , v1:t ) and p(st |v1:t ) are given from the previous iteration. Finally, p(st+1 |it , st , v1:t ) is found from p(st+1 |it , st , v1:t ) = p(st+1 |ht , st ) p(ht |it ,st ,v1:t ) (7) where · p denotes expectation with respect to p. In the standard SLDS, Equation (7) is replaced by the Markov transition p(st+1 |st ). In the aSLDS, however, Equation (7) will generally need to be computed numerically. A simple approximation is to evaluate Equation (7) at the mean value of the distribution p(ht |it , st , v1:t ). To take covariance information into account an alternative would be to draw samples from the Gaussian p(ht |it , st , v1:t ) and thus approximate the average of p(st+1 |ht , st ) by sampling.5 C LOSING THE R ECURSION We are now in a position to calculate Equation (4). For each setting of the variable st+1 , we have a mixture of It × S Gaussians. In order to avoid an exponential explosion in the number of mixture 5. Whilst we suggest sampling as part of the aSLDS update procedure, this does not render the Forward Pass as a form of sequential sampling procedure, such as Particle Filtering. The sampling here is a form of exact sampling, for which no convergence issues arise, being used only to numerically evaluate Equation (7). 2519 BARBER components, we numerically collapse this back to It+1 Gaussians to form It+1 p(ht+1 |st+1 , v1:t+1 ) ≈ ∑ it+1 =1 p(ht+1 |it+1 , st+1 , v1:t+1 )p(it+1 |st+1 , v1:t+1 ). Hence the Gaussian components and corresponding mixture weights p(it+1 |st+1 , v1:t+1 ) are defined implicitly through a numerical (Gaussian-Mixture to smaller Gaussian-Mixture) collapse procedure, for which any method of choice may be supplied. A straightforward approach that we use in our code is based on repeatedly merging low-weight components, as explained in Appendix (D). A R ECURSION FOR THE S WITCH VARIABLES A recursion for the switch variables can be found by considering p(st+1 |v1:t+1 ) ∝ ∑ p(it , st , st+1 , vt+1 , v1:t ). it ,st The r.h.s. of the above equation is proportional to ∑ p(vt+1 |it , st , st+1 , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ) st ,it where all terms have been computed during the recursion for p(ht+1 |st+1 , v1:t+1 ). T HE L IKELIHOOD p(v1:T ) The likelihood p(v1:T ) may be found by recursing p(v1:t+1 ) = p(vt+1 |v1:t )p(v1:t ), where p(vt+1 |v1:t ) = ∑ it ,st ,st+1 p(vt+1 |it , st , st+1 , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ). In the above expression, all terms have been computed in forming the recursion for the filtered posterior p(ht+1 , st+1 |v1:t+1 ). 2.2 Backward Pass (Smoothing) The main contribution of this paper is to find a suitable way to ‘correct’ the filtered posterior p(st , ht |v1:t ) obtained from the Forward Pass into a smoothed posterior p(st , ht |v1:T ). We initially derive this for the case of a single Gaussian representation—the extension to the mixture case is straightforward and given in Section (2.3). Our derivation holds for both the SLDS and aSLDS. We approximate the smoothed posterior p(ht |st , v1:T ) by a Gaussian with mean g(st ) and covariance G(st ), and our aim is to find a recursion for these parameters. A useful starting point is the exact relation: p(ht , st |v1:T ) = ∑ p(st+1 |v1:T )p(ht |st , st+1 , v1:T )p(st |st+1 , v1:T ). st+1 2520 E XPECTATION C ORRECTION The term p(ht |st , st+1 , v1:T ) may be computed as p(ht |st , st+1 , v1:T ) = Z p(ht , ht+1 |st , st+1 , v1:T ) = Z p(ht |ht+1 , st , st+1 , v1:T )p(ht+1 |st , st+1 , v1:T ) Z p(ht |ht+1 , st , st+1 , v1:t )p(ht+1 |st , st+1 , v1:T ) ht+1 ht+1 = ht+1 (8) which is in the form of a recursion. This recursion therefore requires p(ht+1 |st , st+1 , v1:T ), which we can write as p(ht+1 |st , st+1 , v1:T ) ∝ p(ht+1 |st+1 , v1:T )p(st |st+1 , ht+1 , v1:t ). (9) The above recursions represent the exact computation of the smoothed posterior. In our approximate treatment, we replace all quantities p with their corresponding approximations p. A difficulty ˜ is that the functional form of p(st |st+1 , ht+1 , v1:t ) in the approximation of Equation (9) is not squared ˜ exponential in ht+1 , so that p(ht+1 |st , st+1 , v1:T ) will not be a mixture of Gaussians.6 One possibil˜ ity would be to approximate the non-Gaussian p(ht+1 |st , st+1 , v1:T ) (dropping the p notation) by a ˜ Gaussian (mixture) by minimizing the Kullback-Leilbler divergence between the two, or performing moment matching in the case of a single Gaussian. A simpler alternative is to make the assumption p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ), see Figure (3). This is a considerable simplification since p(ht+1 |st+1 , v1:T ) is already known from the previous backward recursion. Under this assumption, the recursion becomes p(ht , st |v1:T ) ≈ ∑ p(st+1 |v1:T )p(st |st+1 , v1:T ) st+1 p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) . (10) We call the procedure based on Equation (10) Expectation Correction (EC) since it ‘corrects’ the filtered results which themselves are formed from propagating expectations. In Appendix (E) we show how EC is equivalent to a partial Discrete-Continuous factorized approximation. Equation (10) forms the basis of the the EC Backward Pass. However, similar to the ADF Forward Pass, the number of mixture components needed to represent the posterior in this recursion grows exponentially as we go backwards in time. The strategy we take to deal with this is a form of Assumed Density Smoothing, in which Equation (10) is interpreted as a propagated dynamics reversal, which will subsequently be collapsed back to an assumed family of distributions—see Figure (4). How we implement the recursion for the continuous and discrete factors is detailed below.7 6. In the exact calculation, p(ht+1 |st , st+1 , v1:T ) is a mixture of Gaussians since p(st |st+1 , ht+1 , v1:t ) = p(st , st+1 , ht+1 , v1:T )/p(st+1 , ht+1 , v1:T ) so that the mixture of Gaussians denominator p(st+1 , ht+1 , v1:T ) cancels with the first term in Equation (9), leaving a mixture of Gaussians. However, since in Equation (9) the two terms p(ht+1 |st+1 , v1:T ) and p(st |st+1 , ht+1 , v1:t ) are replaced by approximations, this cancellation is not guaranteed. 7. Equation (10) has the pleasing form of an RTS Backward Pass for the continuous part (analogous to LDS case), and a discrete smoother (analogous to a smoother recursion for the HMM). In the Forward-Backward algorithm for the HMM (Rabiner, 1989), the posterior γt ≡ p(st |v1:T ) is formed from the product of αt ≡ p(st |v1:t ) and βt ≡ p(vt+1:T |st ). This approach is also analogous to EP (Heskes and Zoeter, 2002). In the correction approach, a direct recursion for γt in terms of γt+1 and αt is formed, without explicitly defining βt . The two approaches to inference are known as α − β and α − γ recursions. 2521 BARBER st−1 st st+1 st+2 ht−1 ht ht+1 ht+2 vt−1 vt vt+1 vt+2 Figure 3: Our Backward Pass approximates p(ht+1 |st+1 , st , v1:T ) by p(ht+1 |st+1 , v1:T ). Motivation for this is that st only influences ht+1 through ht . However, ht will most likely be heavily influenced by v1:t , so that not knowing the state of st is likely to be of secondary importance. The darker shaded node is the variable we wish to find the posterior state of. The lighter shaded nodes are variables in known states, and the hashed node a variable whose state is indeed known but assumed unknown for the approximation. E VALUATING p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) is a Gaussian in ht , whose statistics we will now compute. First we find p(ht |ht+1 , st , st+1 , v1:t ) which may be obtained from the joint distribution p(ht , ht+1 |st , st+1 , v1:t ) = p(ht+1 |ht , st+1 )p(ht |st , v1:t ) (11) which itself can be found using the forward dynamics from the filtered estimate p(ht |st , v1:t ). The statistics for the marginal p(ht |st , st+1 , v1:t ) are simply those of p(ht |st , v1:t ), since st+1 carries no extra information about ht .8 The remaining statistics are the mean of ht+1 , the covariance of ht+1 and cross-variance between ht and ht+1 , ht+1 = A(st+1 ) ft (st ) Σt+1,t+1 = A(st+1 )Ft (st )AT (st+1 ) + Σh (st+1 ), Σt+1,t = A(st+1 )Ft (st ). Given the statistics of Equation (11), we may now condition on ht+1 to find p(ht |ht+1 , st , st+1 , v1:t ). Doing so effectively constitutes a reversal of the dynamics, ← − ← − ht = A (st , st+1 )ht+1 + η (st , st+1 ) ← − ← − ← − − where A (st , st+1 ) and η (st , st+1 ) ∼ N (←(st , st+1 ), Σ (st , st+1 )) are easily found using the condim tioned Gaussian results in Appendix (C)—see also Algorithm (5). Averaging the reversed dynamics we obtain a Gaussian in ht for p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) with statistics ← − ← − ← − ← − − µt = A (st , st+1 )g(st+1 ) + ←(st , st+1 ), Σt,t = A (st , st+1 )G(st+1 ) A T (st , st+1 ) + Σ (st , st+1 ). m These equations directly mirror the RTS Backward Pass, see Algorithm (5). 8. Integrating over ht+1 means that the information from st+1 passing through ht+1 via the term p(ht+1 |st+1 , ht ) vanishes. Also, since st is known, no information from st+1 passes through st to ht . 2522 E XPECTATION C ORRECTION st st+1 it jt+1 ht ht+1 vt vt+1 Figure 4: Structure of the Backward Pass for mixtures. Given the smoothed information at timestep t + 1, we need to work backwards to ‘correct’ the filtered estimate at time t. E VALUATING p(st |st+1 , v1:T ) The main departure of EC from previous methods is in treating the term p(st |st+1 , v1:T ) = p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) . (12) The term p(st |ht+1 , st+1 , v1:t ) is given by p(st |ht+1 , st+1 , v1:t ) = p(ht+1 |st , st+1 , v1:t )p(st , st+1 |v1:t ) . ∑st p(ht+1 |st , st+1 , v1:t )p(st , st+1 |v1:t ) (13) Here p(st , st+1 |v1:t ) = p(st+1 |st , v1:t )p(st |v1:t ), where p(st+1 |st , v1:t ) occurs in the Forward Pass, Equation (7). In Equation (13), p(ht+1 |st+1 , st , v1:t ) is found by marginalizing Equation (11). Performing the average over p(ht+1 |st+1 , v1:T ) in Equation (12) may be achieved by any numerical integration method desired. Below we outline a crude approximation that is fast and often performs surprisingly well. M EAN A PPROXIMATION A simple approximation of Equation (12) is to evaluate the integrand at the mean value of the averaging distribution. Replacing ht+1 in Equation (13) by its mean gives the simple approximation 1 T −1 1 e− 2 zt+1 (st ,st+1 )Σ (st ,st+1 |v1:t )zt+1 (st ,st+1 ) p(st |st+1 , v1:t ) p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) ≈ Z det Σ(st , st+1 |v1:t ) where zt+1 (st , st+1 ) ≡ ht+1 |st+1 , v1:T − ht+1 |st , st+1 , v1:t and Z ensures normalization over st . This result comes simply from the fact that in Equation (12) we have a Gaussian with a mean ht+1 |st , st+1 , v1:t and covariance Σ(st , st+1 |v1:t ), being the filtered covariance of ht+1 given st , st+1 and the observations v1:t , which may be taken from Σhh in Equation (5). Then evaluating this Gaussian at the specific point ht+1 |st+1 , v1:T , we arrive at the above expression. An alternative to this simple mean approximation is to sample from the Gaussian p(ht+1 |st+1 , v1:T ), which has the potential advantage that covariance information is used. 9 Other methods such as variational 9. This is a form of exact sampling since drawing samples from a Gaussian is easy. This should not be confused with meaning that this use of sampling renders EC a sequential Monte-Carlo sampling scheme. 2523 BARBER Algorithm 2 aSLDS: EC Backward Pass (Single Gaussian case I = J = 1). Approximates p(st |v1:T ) and p(ht |st , v1:T ) ≡ N (gt (st ), Gt (st )). This routine needs the results from Algorithm (1) for I = 1. GT ← FT , gT ← fT , for t ← T − 1 to 1 do for s ← 1 to S, s ← 1 to S do, (µ, Σ)(s, s ) = LDSBACKWARD(gt+1 (s ), Gt+1 (s ), ft (s), Ft (s), θt+1 (s )) p(s|s ) = p(st = s|ht+1 , st+1 = s , v1:t ) p(ht+1 |st+1 =s ,v1:T ) p(s, s |v1:T ) ← p(st+1 = s |v1:T )p(s|s ) end for for st ← 1 to S do Collapse the mixture defined by weights p(st+1 = s |st , v1:T ) ∝ p(st , s |v1:T ), means µ(st , s ) and covariances Σ(st , s ) to a single Gaussian. This defines the new means gt (st ), covariances Gt (st ). p(st |v1:T ) ← ∑s p(st , s |v1:T ) end for end for approximations to this average (Jaakkola and Jordan, 1996) or the unscented transform (Julier and Uhlmann, 1997) may be employed if desired. C LOSING THE R ECURSION We have now computed both the continuous and discrete factors in Equation (10), which we wish to use to write the smoothed estimate in the form p(ht , st |v1:T ) = p(st |v1:T )p(ht |st , v1:T ). The distribution p(ht |st , v1:T ) is readily obtained from the joint Equation (10) by conditioning on st to form the mixture p(ht |st , v1:T ) = ∑ p(st+1 |st , v1:T )p(ht |st , st+1 , v1:T ) st+1 which may be collapsed to a single Gaussian (or mixture if desired). As in the Forward Pass, this collapse implicitly defines the Gaussian mean g(st ) and covariance G(st ). The smoothed posterior p(st |v1:T ) is given by p(st |v1:T ) = = ∑ p(st+1 |v1:T )p(st |st+1 , v1:T ) st+1 ∑ p(st+1 |v1:T ) st+1 p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) . (14) The algorithm for the single Gaussian case is presented in Algorithm (2). N UMERICAL S TABILITY Numerical stability is a concern even in the LDS, and the same is to be expected for the aSLDS. Since the LDS recursions LDSFORWARD and LDSBACKWARD are embedded within the EC algorithm, we may immediately take advantage of the large body of work on stabilizing the LDS recursions, such as the Joseph form (Grewal and Andrews, 1993), or the square root forms (Park and Kailath, 1996; Verhaegen and Van Dooren, 1986). 2524 E XPECTATION C ORRECTION R ELAXING EC The conditional independence assumption p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ) is not strictly necessary in EC. We motivate it by computational simplicity, since finding an appropriate moment matching approximation of p(ht+1 |st , st+1 , v1:T ) in Equation (9) requires a relatively expensive nonGaussian integration. If we therefore did treat p(ht+1 |st , st+1 , v1:T ) more correctly, the central assumption in this relaxed version of EC would be a collapse to a mixture of Gaussians (the additional computation of Equation (12) may usually be numerically evaluated to high precision). Whilst we did not do so, implementing this should not give rise to numerical instabilities since no potential divisions are required, merely the estimation of moments. In the experiments presented here, we did not pursue this option, since we believe that the effect of this conditional independence assumption is relatively weak. I NCONSISTENCIES IN THE APPROXIMATION The recursion Equation (8), upon which EC depends, makes use of the Forward Pass results, and a subtle issue arises about possible inconsistencies in the Forward and Backward approximations. For example, under the conditional independence assumption in the Backward Pass, p(hT |sT −1 , sT , v1:T ) ≈ p(hT |sT , v1:T ), which is in contradiction to Equation (5) which states that the approximation to p(hT |sT −1 , sT , v1:T ) will depend on sT −1 . Similar contradictions occur also for the relaxed version of EC. Such potential inconsistencies arise because of the approximations made, and should not be considered as separate approximations in themselves. Furthermore, these inconsistencies will most likely be strongest at the end of the chain, t ≈ T , since only then is Equation (8) in direct contradiction to Equation (5). Such potential inconsistencies arise since EC is not founded on a consistency criterion, unlike EP—see Section (3)—but rather an approximation of the exact recursions. Our experience is that compared to EP, which attempts to ensure consistency based on multiple sweeps through the graph, such inconsistencies are a small price to pay compared to the numerical stability advantages of EC. 2.3 Using Mixtures in the Backward Pass The extension to the mixture case is straightforward, based on the representation Jt p(ht |st , v1:T ) ≈ ∑ p(ht |st , jt , v1:T )p( jt |st , v1:T ). jt =1 Analogously to the case with a single component, p(ht , st |v1:T ) = ∑ it , jt+1 ,st+1 p(st+1 |v1:T )p( jt+1 |st+1 , v1:T )p(ht | jt+1 , st+1 , it , st , v1:T ) · p(it , st |ht+1 , jt+1 , st+1 , v1:t ) p(ht+1 | jt+1 ,st+1 ,v1:T ) . The average in the last line of the above equation can be tackled using the same techniques as outlined in the single Gaussian case. To approximate p(ht | jt+1 , st+1 , it , st , v1:T ) we consider this as the marginal of the joint distribution p(ht , ht+1 |it , st , jt+1 , st+1 , v1:T ) = p(ht |ht+1 , it , st , jt+1 , st+1 , v1:t )p(ht+1 |it , st , jt+1 , st+1 , v1:T ). 2525 BARBER Algorithm 3 aSLDS: EC Backward Pass. Approximates p(st |v1:T ) and p(ht |st , v1:T ) ≡ Jt ut ( jt , st )N (gt ( jt , st ), Gt ( jt , st )) using a mixture of Gaussians. JT = IT , Jt ≤ S × It × Jt+1 . This ∑ jt =1 routine needs the results from Algorithm (1). GT ← FT , gT ← fT , uT ← wT (*) for t ← T − 1 to 1 do for s ← 1 to S, s ← 1 to S, i ← 1 to It , j ← 1 to Jt+1 do (µ, Σ)(i, s, j , s ) = LDSBACKWARD(gt+1 ( j , s ), Gt+1 ( j , s ), ft (i, s), Ft (i, s), θt+1 (s )) p(i, s| j , s ) = p(st = s, it = i|ht+1 , st+1 = s , jt+1 = j , v1:t ) p(ht+1 |st+1 =s , jt+1 = j ,v1:T ) p(i, s, j , s |v1:T ) ← p(st+1 = s |v1:T )ut+1 ( j , s )p(i, s| j , s ) end for for st ← 1 to S do Collapse the mixture defined by weights p(it = i, st+1 = s , jt+1 = j |st , v1:T ) ∝ p(i, st , j , s |v1:T ), means µ(it , st , j , s ) and covariances Σ(it , st , j , s ) to a mixture with Jt components. This defines the new means gt ( jt , st ), covariances Gt ( jt , st ) and mixture weights ut ( jt , st ). p(st |v1:T ) ← ∑it , j ,s p(it , st , j , s |v1:T ) end for end for (*) If JT < IT then the initialization is formed by collapsing the Forward Pass results at time T to JT components. As in the case of a single mixture, the problematic term is p(ht+1 |it , st , jt+1 , st+1 , v1:T ). Analogously to before, we may make the assumption p(ht+1 |it , st , jt+1 , st+1 , v1:T ) ≈ p(ht+1 | jt+1 , st+1 , v1:T ) meaning that information about the current switch state st , it is ignored.10 We can then form p(ht |st , v1:T ) = ∑ it , jt+1 ,st+1 p(it , jt+1 , st+1 |st , v1:T )p(ht |it , st , jt+1 , st+1 , v1:T ). This mixture can then be collapsed to smaller mixture using any method of choice, to give Jt p(ht |st , v1:T ) ≈ ∑ p(ht | jt , st , v1:T )p( jt |st , v1:T ) jt =1 The collapse procedure implicitly defines the means g( jt , st ) and covariances G( jt , st ) of the smoothed approximation. A recursion for the switches follows analogously to the single component Backward Pass. The resulting algorithm is presented in Algorithm (3), which includes using mixtures in both Forward and Backward Passes. Note that if JT < IT , an extra initial collapse is required of the IT component Forward Pass Gaussian mixture at time T to JT components. EC has time complexity O(S2 IJK) where S are the number of switch states, I and J are the number of Gaussians used in the Forward and Backward passes, and K is the time to compute the exact Kalman smoother for the system with a single switch state. 10. As in the single component case, in principle, this assumption may be relaxed and a moment matching approximation be performed instead. 2526 E XPECTATION C ORRECTION 3. Relation to Other Methods Approximate inference in the SLDS is a long-standing research topic, generating an extensive literature. See Lerner (2002) and Zoeter (2005) for reviews of previous work. A brief summary of some of the major existing approaches follows. Assumed Density Filtering Since the exact filtered estimate p(ht |st , v1:t ) is an (exponentially large) mixture of Gaussians, a useful remedy is to project at each stage of the recursion Equation (3) back to a limited set of K Gaussians. This is a Gaussian Sum Approximation (Alspach and Sorenson, 1972), and is a form of Assumed Density Filtering (ADF) (Minka, 2001). Similarly, Generalized Pseudo Bayes2 (GPB2) (Bar-Shalom and Li, 1998) also performs filtering by collapsing to a mixture of Gaussians. This approach to filtering is also taken in Lerner et al. (2000) which performs the collapse by removing spatially similar Gaussians, thereby retaining diversity. Several smoothing approaches directly use the results from ADF. The most popular is Kim’s method, which updates the filtered posterior weights to form the smoother (Kim, 1994; Kim and Nelson, 1999). In both EC and Kim’s method, the approximation p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ), is used to form a numerically simple Backward Pass. The other approximation in EC is to numerically compute the average in Equation (14). In Kim’s method, however, an update for the discrete variables is formed by replacing the required term in Equation (14) by p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) ≈ p(st |st+1 , v1:t ). (15) This approximation11 decouples the discrete Backward Pass in Kim’s method from the continuous dynamics, since p(st |st+1 , v1:t ) ∝ p(st+1 |st )p(st |v1:t )/p(st+1 |v1:t ) can be computed simply from the filtered results alone (the continuous Backward Pass in Kim’s method, however, does depend on the discrete Backward Pass). The fundamental difference between EC and Kim’s method is that the approximation (15) is not required by EC. The EC Backward Pass therefore makes fuller use of the future information, resulting in a recursion which intimately couples the continuous and discrete variables. The resulting effect on the quality of the approximation can be profound, as we will see in the experiments. Kim’s smoother corresponds to a potentially severe loss of future information and, in general, cannot be expected to improve much on the filtered results from ADF. The more recent work of Lerner et al. (2000) is similar in spirit to Kim’s method, whereby the contribution from the continuous variables is ignored in forming an approximate recursion for the smoothed p(st |v1:T ). The main difference is that for the discrete variables, Kim’s method is based on a correction smoother (Rauch et al., 1965), whereas Lerner’s method uses a Belief Propagation style Backward Pass (Jordan, 1998). Neither method correctly integrates information from the continuous variables. How to form a recursion for a mixture approximation which does not ignore information coming through the continuous hidden variables is a central contribution of our work. Kitagawa (1994) used a two-filter method in which the dynamics of the chain are reversed. Essentially, this corresponds to a Belief Propagation method which defines a Gaussian sum 11. In the HMM this is exact, but in the SLDS the future observations carry information about st . 2527 BARBER EC Mixture Collapsing to Single Mixture Collapsing to Mixture Cond. Indep. p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ) Approx. of p(st |st+1 , v1:T ), average Equation (12) Kim’s Backward Pass Mixture approx. of p(ht+1 |st , st+1 , v1:T ), Equation (9) Relaxed EC x x x x EP x Kim x x x x x Table 1: Relation between methods. In the EC methods, the mean approximation may be replaced by an essentially exact Monte Carlo approximation to Equation (12). EP refers to the Single Gaussian approximation in Heskes and Zoeter (2002). In the case of using Relaxed EC with collapse to a single Gaussian, EC and EP are not equivalent, since the underlying recursions on which the two methods are based are fundamentally different. approximation for p(vt+1:T |ht , st ). However, since this is not a density in ht , st , but rather a conditional likelihood, formally one cannot treat this using density propagation methods. In Kitagawa (1994), the singularities resulting from incorrectly treating p(vt+1:T |ht , st ) as a density are heuristically finessed. Expectation Propagation EP (Minka, 2001), as applied to the SLDS, corresponds to an approximate implementation of Belief Propagation12 (Jordan, 1998; Heskes and Zoeter, 2002). EP is the most sophisticated rival to Kim’s method and EC, since it makes the least assumptions. For this reason, we’ll explain briefly how EP works. Unlike EC, which is based on an approximation of the exact filtering and smoothing recursions, EP is based on a consistency criterion. First, let’s simplify the notation, and write the distribution as p = ∏t φ (xt−1 , vt−1 , xt , vt ), where xt ≡ ht ⊗ st , and φ (xt−1 , vt−1 , xt , vt ) ≡ p(xt |xt−1 )p(vt |xt ). EP defines ‘messages’ ρ, λ13 which contain information from past and future observations respectively. 14 Explicitly, we define ρt (xt ) ∝ p(xt |v1:t ) to represent knowledge about xt given all information from time 1 to t. Similarly, λt (xt ) represents knowledge about state xt given all observations from time T to time t + 1. In the sequel, we drop the time suffix for notational clarity. We define λ(xt ) implicitly through the requirement that the marginal smoothed inference is given by p(xt |v1:T ) ∝ ρ (xt ) λ (xt ) . (16) Hence λ (xt ) ∝ p(vt+1:T |xt , v1:t ) = p(vt+1:T |xt ) and represents all future knowledge about p(xt |v1:T ). From this p(xt−1 , xt |v1:T ) ∝ ρ (xt−1 ) φ (xt−1 , vt−1 , xt , vt ) λ (xt ) . (17) 12. Non-parametric belief propagation (Sudderth et al., 2003), which performs approximate inference in general continuous distributions, is also related to EP applied to the aSLDS, in the sense that the messages cannot be represented easily, and are approximated by mixtures of Gaussians. 13. These correspond to the α and β messages in the Hidden Markov Model framework (Rabiner, 1989). 14. In this Belief Propagation/EP viewpoint, the backward messages, traditionally labeled as β, correspond to conditional likelihoods, and not distributions. In contrast, in the EC approach, which is effectively a so-called α − γ recursion, the backward γ messages correspond to posterior distributions. 2528 E XPECTATION C ORRECTION Taking the above equation as a starting point, we have p(xt |v1:T ) ∝ Z xt−1 ρ (xt−1 ) φ (xt−1 , vt−1 , xt , vt ) λ (xt ) . Consistency with Equation (16) requires (neglecting irrelevant scalings) ρ (xt ) λ (xt ) ∝ Z xt−1 ρ (xt−1 ) φ (xt−1 , vt−1 , xt , vt ) λ (xt ) . Similarly, we can integrate Equation (17) over xt to get the marginal at time xt−1 which, by consistency, should be proportional to ρ (xt−1 ) λ (xt−1 ). Hence ρ (xt ) ∝ xt−1 ρ (xt−1 ) φ (xt−1 , xt ) λ (xt ) R λ (xt ) , λ (xt−1 ) ∝ R xt ρ (xt−1 ) φ (xt−1 , xt ) λ (xt ) ρ (xt−1 ) (18) where the divisions can be interpreted as preventing over-counting of messages. In an exact implementation, the common factors in the numerator and denominator cancel. EP addresses the fact that λ(xt ) is not a distribution by using Equation (18) R form the projection (or to R ‘collapse’). In the numerator, xt−1 ρ (xt−1 ) φ (xt−1 , xt ) λ (xt ) and xt ρ (xt−1 ) φ (xt−1 , xt ) λ (xt ) represent p(xt |v1:T ) and p(xt−1 |v1:T ). Since these are distributions (an indexed mixture of Gaussians in the SLDS), they may be projected/collapsed to a single indexed Gaussian. The update for the ρ message is then found from division by the λ potential, and vice versa. In EP the explicit division of potentials only makes sense for members of the exponential family. More complex methods could be envisaged in which, rather than an explicit division, the new messages are defined by minimizing some measure of divergence between R ρ(xt )λ(xt ) and xt−1 ρ (xt−1 ) φ (xt−1 , xt ) λ (xt ), such as the Kullback-Leibler divergence. In this way, non-exponential family approximations (such as mixtures of Gaussians) may be considered. Whilst this is certainly feasible, it is somewhat unattractive computationally since this would require for each time-step an expensive minimization. For the single Gaussian case, in order to perform the division, the potentials in the numerator and denominator are converted to their canonical representations. To form the ρ update, the result of the division is then reconverted back to a moment representation. The resulting recursions, due to the approximation, are no longer independent and Heskes and Zoeter (2002) show that using more than a single Forward and Backward sweep often improves on the quality of the approximation. This coupling is a departure from the exact recursions, which should remain independent. Applied to the SLDS, EP suffers from severe numerical instabilities (Heskes and Zoeter, 2002) and finding a way to minimize the corresponding EP free energy in an efficient, robust and guaranteed way remains an open problem. Our experience is that current implementations of EP are unsuitable for large scale time-series applications. Damping the parameter updates is one suggested approach to heuristically improve convergence. The source of these numerical instabilities is not well understood since, even in cases when the posterior appears uni-modal, the method is problematic. The frequent conversions between moment and canonical parameterizations of Gaussians are most likely at the root of the difficulties. An interesting comparison here is between Lauritzen’s original method for exact computation on conditional Gaussian distributions (for which the SLDS is a special case) Lauritzen (1992), 2529 BARBER which is numerically unstable due to conversion between moment and canonical representations, and Lauritzen and Jensen (2001), which improves stability by avoiding using canonical parameterizations. Variational Methods Ghahramani and Hinton (1998) used a variational method which approximates the joint distribution p(h1:T , s1:T |v1:T ) rather than the marginal p(ht , st |v1:T )—related work is presented in Lee et al. (2004). This is a disadvantage when compared to other methods that directly approximate the marginal. The variational methods are nevertheless potentially attractive since they are able to exploit structural properties of the distribution, such as a factored discrete state-transition. In this article, we concentrate on the case of a small number of states S and hence will not consider variational methods further here. 15 Sequential Monte Carlo (Particle Filtering) These methods form an approximate implementation of Equation (3), using a sum of delta functions to represent the posterior—see, for example, Doucet et al. (2001). Whilst potentially powerful, these non-analytic methods typically suffer in high-dimensional hidden spaces since they are often based on naive importance sampling, which restricts their practical use. ADF is generally preferential to Particle Filtering, since in ADF the approximation is a mixture of non-trivial distributions, which is better at capturing the variability of the posterior. Rao-Blackwellized Particle Filters (Doucet et al., 2000) are an attempt to alleviate the difficulty of sampling in high-dimensional state spaces by explicitly integrating over the continuous state. Non-Sequential Monte Carlo For fixed switches s1:T , p(v1:T |s1:T ) is easily computable since this is just the likelihood of an LDS. This observation raises the possibility of sampling from the posterior p(s 1:T |v1:T ) ∝ p(v1:T |s1:T )p(s1:T ) directly. Many possible sampling methods could be applied in this case, and the most immediate is Gibbs sampling, in which a sample for each t is drawn from p(st |s\t , v1:T )—see Neal (1993) for a general reference and Carter and Kohn (1996) for an application to the SLDS. This procedure may work well in practice provided that the initial setting of s1:T is in a region of high probability mass—otherwise, sampling by such individual coordinate updates may be extremely inefficient. 4. Experiments Our experiments examine the stability and accuracy of EC against several other methods on long time-series. In addition, we will compare the absolute accuracy of EC as a function of the number of mixture components on a short time-series, where exact inference may be explicitly evaluated. Testing EC in a problem with a reasonably long temporal sequence, T , is important since numerical stabilities may not be apparent in time-series of just a few time-steps. To do this, we sequentially generate hidden states ht , st and observations vt from a given model. Then, given only the parameters of the model and the observations (but not any of the hidden states), the task is to infer p(ht |st , v1:T ) and p(st |v1:T ). Since the exact computation is exponential in T , a formally exact evaluation of the method is infeasible. A simple alternative is to assume that the original sample states s1:T are the ‘correct’ inferred states, and compare our most probable posterior smoothed 15. Lerner (2002) discusses an approach in the case of a large structured discrete state transition. Related ideas could also be used in EC. 2530 E XPECTATION C ORRECTION 80 150 60 100 40 50 20 0 0 −50 −20 −100 −40 −150 −60 −80 0 10 20 30 40 50 60 70 80 90 −200 100 0 10 20 (a) Easy problem 30 40 50 60 70 80 90 100 (b) Hard problem Figure 5: SLDS: Throughout, S = 2, V = 1 (scalar observations), T = 100, with zero output bias. ¯ A(s) = 0.9999 ∗ orth(randn(H, H)), B(s) = randn(V, H), vt ≡ 0, h1 = 10 ∗ randn(H, 1), ¯ ¯ t>1 = 0, Σh = IH , p1 = uniform. The figures show typical examples for each of the two h 1 problems: (a) Easy problem. H = 3, Σh (s) = IH , Σv (s) = 0.1IV , p(st+1 |st ) ∝ 1S×S + IS . (b) Hard problem. H = 30, Σv (s) = 30IV ,Σh (s) = 0.01IH , p(st+1 |st ) ∝ 1S×S . PF RBPF EP ADFS KimS ECS ADFM KimM ECM Gibbs 1000 800 600 400 200 0 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 Figure 6: SLDS ‘Easy’ problem: The number of errors in estimating a binary switch p(st |v1:T ) over a time series of length T = 100. Hence 50 errors corresponds to random guessing. Plotted are histograms of the errors over 1000 experiments. The histograms have been cutoff at 20 errors in order to improve visualization. (PF) Particle Filter. (RBPF) Rao-Blackwellized PF. (EP) Expectation Propagation. (ADFS) Assumed Density Filtering using a Single Gaussian. (KimS) Kim’s smoother using the results from ADFS. (ECS) Expectation Correction using a Single Gaussian (I = J = 1). (ADFM) ADF using a multiple of I = 4 Gaussians. (KimM) Kim’s smoother using the results from ADFM. (ECM) Expectation Correction using a mixture with I = J = 4 components. In Gibbs sampling, we use the initialization from ADFM. estimates arg maxst p(st |v1:T ) with the assumed correct sample st .16 We look at two sets of experiments, one for the SLDS and one for the aSLDS. In both cases, scalar observations are used so that the complexity of the inference problem can be visually assessed. 16. We could also consider performance measures on the accuracy of p(ht |st , v1:T ). However, we prefer to look at approximating arg maxst p(st |v1:T ) since the sampled discrete states are likely to correspond to the exact arg max st p(st |v1:T ). In addition, if the posterior switch distribution is dominated by a single state s ∗ , then provided they are correctly 1:T estimated, the model reduces to an LDS, for which inference of the continuous hidden state is trivial. 2531 BARBER PF RBPF EP ADFS KimS ECS ADFM KimM ECM Gibbs 1000 800 600 400 200 0 0 25 50 75 0 25 50 75 0 25 50 75 0 25 50 75 0 25 50 75 0 25 50 75 0 25 50 75 0 25 50 75 0 25 50 75 0 25 50 75 Figure 7: SLDS ‘Hard’ problem: The number of errors in estimating a binary switch p(st |v1:T ) over a time series of length T = 100. Hence 50 errors corresponds to random guessing. Plotted are histograms of the errors over 1000 experiments. SLDS EXPERIMENTS We chose experimental conditions that, from the viewpoint of classical signal processing, are difficult, with changes in the switches occurring at a much higher rate than the typical frequencies in the signal. We consider two different toy SLDS experiments : The ‘easy’ problem corresponds to a low hidden dimension, H = 3, with low observation noise; The ‘hard’ problem corresponds to a high hidden dimension, H = 30, and high observation noise. See Figure (5) for details of the experimental setup. We compared methods using a single Gaussian, and methods using multiple Gaussians, see Figure (6) and Figure (7). For EC we use the mean approximation for the numerical integration of Equation (12). For the Particle Filter 1000 particles were used, with Kitagawa re-sampling (Kitagawa, 1996). For the Rao-Blackwellized Particle Filter (Doucet et al., 2000), 500 particles were used, with Kitagawa re-sampling. We included the Particle Filter merely for a point of comparison with ADF, since they are not designed to approximate the smoothed estimate. An alternative MCMC procedure is to perform Gibbs sampling of p(s 1:T |v1:T ) using p(st |s\t , v1:T ) ∝ p(v1:T |s1:T )p(s1:T ), where p(v1:T |s1:T ) is simply the likelihood of an LDS—see for example Carter and Kohn (1996).17 We initialize the state s1:T by using the most likely states st from the filtered results using a Gaussian mixture (ADFM), and then swept forwards in time, sampling from the state p(st |s\t , v1:T ) until the end of the chain. We then reversed direction, sampling from time T back to time 1, and continued repeating this procedure 100 times, with the mean over the last 80 sweeps used as the posterior mean approximation. This procedure is expensive since each sample requires computing the likelihood of an LDS defined on the whole time-series. The procedure therefore scales with GT 2 where G is the number of sweeps over the time series. Despite using a reasonable initialization, Gibbs sampling struggles to improve on the filtered results. We found that EP was numerically unstable and often struggled to converge. To encourage convergence, we used the damping method in Heskes and Zoeter (2002), performing 20 iterations with a damping factor of 0.5. The disappointing performance of EP is most likely due to conflicts 17. Carter and Kohn (1996) proposed an overly complex procedure for computing the likelihood p(v 1:T |s1:T ). This is simply the likelihood of an LDS (since s1:T are assumed known), and is readily computable using any of the standard procedures in the literature. 2532 E XPECTATION C ORRECTION PF ADFS ECS ADFM ECM 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 1000 800 600 400 200 0 Figure 8: aSLDS: Histogram of the number of errors in estimating a binary switch p(st |v1:T ) over a time series of length T = 100. Hence 50 errors corresponds to random guessing. Plotted are histograms of the errors over 1000 experiments. Augmented SLDS results. ADFM used I = 4 Gaussians, and ECM used I = J = 4 Gaussians. We used 1000 samples to approximate Equation (12). I J error 1 1 0.0989 4 1 0.0624 4 4 0.0365 16 1 0.0440 16 16 0.0130 64 1 0.0440 64 64 4.75e-4 256 1 0.0440 256 256 3.40e-8 Table 2: Errors in approximating the states for the multi-path problem, see Figure (9). The mean absolute deviation |pec (st |v1:T ) − pexact (st |v1:T )| averaged over the S = 4 states of st and over the times t = 1, . . . , 5, computed for different numbers of mixture components in EC. The mean approximation of Equation (12) is used. The exact computation uses S T −1 = 256 mixtures. resulting from numerical instabilities introduced by the frequent conversions between moment and canonical representations. The various algorithms differ widely in performance, see Figures (6,7). Not surprisingly, the best filtered results are given using ADF, since this is better able to represent the variance in the filtered posterior than the sampling methods. Unlike Kim’s method, EC makes good use of the future information to clean up the filtered results considerably. One should bear in mind that both EC, Kim’s method and the Gibbs initialization use the same ADF results. These results show that EC may dramatically improve on Kim’s method, so that the small amount of extra work in making a numerical approximation of p(st |st+1 , v1:T ), Equation (12), may bring significant benefits. AUGMENTED SLDS E XPERIMENTS In Figure (8), we chose a simple two state S = 2 transition distribution p(st+1 = 1|st , ht ) = σ htT w(st ) , where σ(x) ≡ 1/(1 + e−x ). Some care needs to be taken to make a model so for which even exact inference would produce posterior switches close to the sampled switches. If the switch variables st+1 changes wildly (which is possible given the above formula since the hidden state h may have a large projected change if the hidden state changes) essentially no information is left in the signal for any inference method to produce reasonable results. We therefore set w(st ) to a zero vector except for the first two components, which are independently sampled from a zero mean Gaussian with standard deviation 5. For each of the two switch states, s, we have a transition matrix A(s), which 2533 BARBER t=1 0 t=2 10 t=3 20 t=4 30 t=5 40 −40 −30 −20 −10 0 10 20 30 (a) 40 (b) Figure 9: (a) The multi-path problem. The particle starts from (0, 0) at time t = 1. Subsequently, at each time-point, either the vector (10, 10) (corresponding to states s = 1 and s = 3) or (−10, 10) (corresponding to states s = 2 and s = 4), is added to the hidden dynamics, perturbed by a small amount of noise, Σh = 0.1. The observations are v = h + ηv (s). For states s = 1, 2 the observation noise is small, Σv = 0.1I, but for s = 3, 4 the noise in the horizontal direction has variance 1000. The visible observations are given by the x’. The true hidden states are given by ‘+’. (b) The exact smoothed state posteriors p exact (st |v1:T ) computed by enumerating all paths (given by the dashed lines). we set to be block diagonal. The first 2 × 2 block is set to 0.9999R θ , where Rθ is a 2 × 2 rotation matrix with angle θ chosen uniformly from 0 to 1 radians. This means that st+1 is dependent on the first two components of ht which are rotating at a restricted rate. The remaining H − 2 × H − 2 block of A(s) is chosen as (using MATLAB notation) 0.9999 ∗ orth(rand(H − 2)), which means a scaled randomly chosen orthogonal matrix. Throughout, S = 2, V = 1, H = 30, T = 100, with zero output ¯ ¯ bias. Using partly MATLAB notation, B(s) = randn(V, H), vt ≡ 0, h1 = 10 ∗ randn(H, 1), ht>1 = 0, ¯ h = I , p = uniform. Σv = 30I , Σh = 0.1I . Σ1 H 1 V H We compare EC only against Particle Filters using 1000 particles, since other methods would require specialized and novel implementations. In ADFM, I = 4 Gaussians were used, and for ECM, I = J = 4 Gaussians were used. Looking at the results in Figure (8), we see that EC performs well, with some improvement in using the mixture representation I, J = 4 over a single Gaussian I = J = 1. The Particle Filter most likely failed since the hidden dimension is too high to be explored well with only 1000 particles. E FFECT OF U SING M IXTURES Our claim is that EC should cope in situations where the smoothed posterior p(ht |st , v1:T ) is multimodal and, consequently, cannot be well represented by a single Gaussian. 18 We therefore constructed an SLDS which exhibits multi-modality to see the effect of using EC with both I and J greater than 1. The ‘multi-path’ scenario is described in Figure (9), where a particle traces a path through a two dimensional space. A small number of time-steps was chosen so that the exact p(st |v1:T ) can be computed by direct enumeration. The observation of the particle is at times extremely noisy in the horizontal direction. This induces multi-modality of p(ht |st , v1:T ) since there 18. This should not be confused with the multi-modality of p(ht |v1:T ) = ∑st p(ht |st , v1:T )p(st |v1:T ). 2534 E XPECTATION C ORRECTION are several paths that might plausibly have been taken to give rise to the observations. The accuracy with which EC predicts the exact smoothed posterior is given in Table (2). For this problem we see that both the number of Forward (I) and Backward components (J) affects the accuracy of the approximation, generally with improved accuracy as the number of mixture components increases. For a ‘perfect’ approximation method, one would expect that when I = J = S T −1 = 256, then the approximation should become exact. The small error for this case in Table (2) may arise for several reasons: the extra independence assumption used in EC, or the simple mean approximation used to compute Equation (12), or numerical roundoff. However, at least in this case, the effect of these assumptions on the performance is very small. 5. Discussion Expectation Correction is a novel form of Backward Pass which makes less approximations than the widely used approach from Kim (1994). In Kim’s method, potentially important future information channeled through the continuous hidden variables is lost. EC, along with Kim’s method, makes the additional assumption p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ). However, our experience is that this assumption is rather mild, since the state of ht+1 will be most heavily influenced by its immediate parent st+1 . Our approximation is based on the idea that, although exact inference will consist of an exponentially large number of mixture components, due to the forgetting which commonly occurs in Markovian models, a finite number of mixture components may provide a reasonable approximation. In tracking situations where the visible information is (temporarily) not enough to specify accurately the hidden state, then representing the posterior p(ht |st , v1:T ) using a mixture of Gaussians may improve results significantly. Clearly, in systems with very long correlation times our method may require too many mixture components to produce a satisfactory result, although we are unaware of other techniques that would be able to cope well in that case. We hope that the straightforward ideas presented here may help facilitate the practical application of dynamic hybrid networks to machine learning and related areas. Whilst models with Gaussian emission distributions such as the SLDS are widespread, the extension of this method to non-Gaussian emissions p(vt |ht , st ) would clearly be of considerable interest. Software for Expectation Correction for this augmented class of Switching Linear Gaussian models is available from www.idiap.ch/∼barber. Acknowledgments I would like to thank Onno Zoeter and Tom Heskes for kindly providing their Expectation Propagation code, Silvia Chiappa for helpful discussions, and Bertrand Mesot for many discussions, help with the simulations and for suggesting the relationship between the partial factorization and independence viewpoints of EC. I would also like to thank the reviewers for their many helpful comments and suggestions. 2535 BARBER Algorithm 4 LDS Forward Pass. Compute the filtered posteriors p(ht |v1:t ) ≡ N ( ft , Ft ) for ¯ ¯ a LDS with parameters θt = A, B, Σh , Σv , h, v, for t > 1. At time t = 1, we use parameters v , µ, v, where Σ and µ are the prior covariance and mean of h. The log-likelihood θ1 = A, B, Σ, Σ ¯ L = log p(v1:T ) is also returned. F0 ← 0, f0 ← 0, L ← 0 for t ← 1, T do { ft , Ft , pt } = LDSFORWARD( ft−1 , Ft−1 , vt ; θt ) L ← L + log pt end for function LDSFORWARD( f , F, v; θ) Compute joint p(ht , vt |v1:t−1 ): ¯ µh ← A f + h, µv ← Bµh + v ¯ T Σhh ← AFA + Σh , Σvv ← BΣhh BT + Σv , Σvh ← BΣhh Find p(ht |v1:t ) by conditioning: f ← µh + ΣT Σ−1 (v − µv ), F ← Σhh − ΣT Σ−1 Σvh vh vv vh vv Compute p(vt |v1:t−1 ): √ 1 p ← exp − 2 (v − µv )T Σ−1 (v − µv ) / det 2πΣvv vv return f , F , p end function Appendix A. Inference in the LDS The LDS is defined by Equations (1,2) in the case of a single switch S = 1. The LDS Forward and Backward passes define the important functions LDSFORWARD and LDSBACKWARD, which we shall make use of for inference in the aSLDS. F ORWARD PASS (F ILTERING ) The filtered posterior p(ht |v1:t ) is a Gaussian which we parameterize with mean f t and covariance Ft . These parameters can be updated recursively using p(ht |v1:t ) ∝ p(ht , vt |v1:t−1 ), where the joint distribution p(ht , vt |v1:t−1 ) has statistics (see Appendix (B)) ¯ µh = A ft−1 + h, µv = Bµh + v ¯ Σhh = AFt−1 AT + Σh , Σvv = BΣhh BT + Σv , Σvh = BΣhh . We may then find p(ht |v1:t ) by conditioning p(ht , vt |v1:t−1 ) on vt , see Appendix (C). This gives rise to Algorithm (4). BACKWARD PASS The smoothed posterior p(ht |v1:T ) ≡ N (gt , Gt ) can be computed recursively using: p(ht |v1:T ) = Z ht+1 p(ht |ht+1 , v1:T )p(ht+1 |v1:T ) = Z ht+1 p(ht |ht+1 , v1:t )p(ht+1 |v1:T ) where p(ht |ht+1 , v1:t ) may be obtained from the joint distribution p(ht , ht+1 |v1:t ) = p(ht+1 |ht )p(ht |v1:t ) (19) 2536 E XPECTATION C ORRECTION Algorithm 5 LDS Backward Pass. Compute the smoothed posteriors p(ht |v1:T ). This requires the filtered results from Algorithm (4). GT ← FT , gT ← fT for t ← T − 1, 1 do {gt , Gt } = LDSBACKWARD(gt+1 , Gt+1 , ft , Ft ; θt+1 ) end for function LDSBACKWARD(g, G, f , F; θ) ¯ Σh h ← AF µh ← A f + h, Σh h ← AFAT + Σh , ← − − ← − ← ← f − ←µ − −1 T Σ ← Ft − Σh h Σh h Σh h , A ← ΣT h Σ−1 , m A h h hh ← − ← ← − − ← − − g ← A g + ←, m G ← AGAT+ Σ return g , G end function which itself can be obtained by forward propagation from p(ht |v1:t ). Conditioning Equation (19) to find p(ht |ht+1 , v1:t ) effectively reverses the dynamics, ← − ← − ht = At ht+1 + ηt ← − − ← − −← where At and η t ∼ N (←, Σt ) are found using the conditioned Gaussian results in Appendix (C)— mt these are explicitly given in Algorithm (5). Then averaging the reversed dynamics over p(h t+1 |v1:T ) we find that p(ht |v1:T ) is a Gaussian with statistics ← − ← − ← − ← − − gt = At gt+1 + ←, Gt = At Gt+1 At T + Σt . mt This Backward Pass is given in Algorithm (5). For parameter learning of the A matrix, the smoothed ← − T T statistic ht ht+1 is required. Using the above formulation, this is given by At Gt+1 + ht ht+1 . This is much simpler than the standard expressions cited in Shumway and Stoffer (2000) and Roweis and Ghahramani (1999). Appendix B. Gaussian Propagation Let y be linearly related to x through y = Mx + η, where η ∼ N (µ, Σ), and x ∼ N (µ x , Σx ). Then R p(y) = x p(y|x)p(x) is a Gaussian with mean Mµx + µ and covariance MΣx M T + Σ. Appendix C. Gaussian Conditioning For a joint Gaussian distribution over the vectors x and y with means µ x , µy and covariance elements Σxx ,Σxy ,Σyy , the conditional p(x|y) is a Gaussian with mean µx + Σxy Σ−1 (y − µy ) and covariance yy Σxx − Σxy Σ−1 Σyx . yy Appendix D. Collapsing Gaussians The user may provide any algorithm of their choice for collapsing a set of Gaussians to a smaller set of Gaussians (Titterington et al., 1985). Here, to be explicit, we present a simple one which is fast, but has the disadvantage that no spatial information about the mixture is used. 2537 BARBER First, we describe how to collapse a mixture to a single Gaussian: We may collapse a mixture of Gaussians p(x) = ∑i pi N (x|µi , Σi ) to a single Gaussian with mean ∑i pi µi and covariance ∑i pi Σi + µi µT − µµT . i To collapse a mixture to a K-component mixture we retain the K − 1 Gaussians with the largest mixture weights—the remaining N − K Gaussians are simply merged to a single Gaussian using the above method. The alternative of recursively merging the two Gaussians with the lowest mixture weights gave similar experimental performance. More sophisticated methods which retain some spatial information would clearly be potentially useful. The method presented in Lerner et al. (2000) is a suitable approach which considers removing Gaussians which are spatially similar (and not just low-weight components), thereby retaining diversity over the possible solutions. Appendix E. The Discrete-Continuous Factorization Viewpoint An alternative viewpoint is to proceed analogously to the Rauch-Tung-Striebel correction method for the LDS (Grewal and Andrews, 1993): p(ht , st |v1:T ) = = ∑ Z st+1 ht+1 p(st , ht , ht+1 , st+1 |v1:T ) ∑ p(st+1 |v1:T ) st+1 Z ht+1 p(ht , st |ht+1 , st+1 , v1:t )p(ht+1 |st+1 , v1:T ) st+1 = ≈ ∑ p(st+1 |v1:T ) p(ht |ht+1 , st+1 , st , v1:t )p(st |ht+1 , st+1 , v1:t ) ∑ p(st+1 |v1:T ) p(ht |ht+1 , st+1 , st , v1:t ) p(st |st+1 , v1:T ) st+1 (20) p(st |st+1 ,v1:T ) where angled brackets · denote averages with respect to p(ht+1 |st+1 , v1:T ). Whilst the factorized approximation in Equation (20) may seem severe, by comparing Equations (20) and (10) we see that it is equivalent to the apparently milder assumption p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ). Hence this factorized approximation is equivalent to the ‘standard’ EC approach in which the dependency on st is dropped. References D. L. Alspach and H. W. Sorenson. Nonlinear bayesian estimation using gaussian sum approximations. IEEE Transactions on Automatic Control, 17(4):439–448, 1972. Y. Bar-Shalom and Xiao-Rong Li. Estimation and Tracking : Principles, Techniques and Software. Artech House, Norwood, MA, 1998. X. Boyen and D. Koller. Tractable inference for complex stochastic processes. In Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence—UAI 1998, pages 33–42. Morgan Kaufmann, 1998. C. Carter and R. Kohn. Markov chain Monte Carlo in conditionally Gaussian state space models. Biometrika, 83:589–601, 1996. 2538 E XPECTATION C ORRECTION A. T. Cemgil, B. Kappen, and D. Barber. A Generative Model for Music Transcription. IEEE Transactions on Audio, Speech and Language Processing, 14(2):679 – 694, 2006. S. Chib and M. Dueker. Non-Markovian regime switching with endogenous states and time-varying state strengths. Econometric Society 2004 North American Summer Meetings 600, 2004. A. Doucet, N. de Freitas, K. Murphy, and S. Russell. Rao-Blackwellised particle filtering for dynamic Bayesian networks. Uncertainty in Artificial Intelligence, 2000. A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo Methods in Practice. Springer, 2001. Z. Ghahramani and G. E. Hinton. Variational learning for switching state-space models. Neural Computation, 12(4):963–996, 1998. M. S. Grewal and A. P. Andrews. Kalman Filtering: Theory and Practice. Prentice-Hall, 1993. T. Heskes and O. Zoeter. Expectation propagation for approximate inference in dynamic Bayesian networks. In A. Darwiche and N. Friedman, editors, Uncertainty in Artificial Intelligence, pages 216–223, 2002. T. Jaakkola and M. Jordan. A variational approach to Bayesian logistic regression problems and their extensions. In Artificial Intelligence and Statistics, 1996. M. I. Jordan. Learning in Graphical Models. MIT Press, 1998. S. Julier and J. Uhlmann. A new extension of the Kalman filter to nonlinear systems. In Int. Symp. Aerospace/Defense Sensing, Simul. and Controls, Orlando, FL, 1997. C-J. Kim. Dynamic linear models with Markov-switching. Journal of Econometrics, 60:1–22, 1994. C-J. Kim and C. R. Nelson. State-Space Models with Regime Switching. MIT Press, 1999. G. Kitagawa. The two-filter formula for smoothing and an implementation of the Gaussian-sum smoother. Annals of the Institute of Statistical Mathematics, 46(4):605–623, 1994. G. Kitagawa. Monte Carlo filter and smoother for non-Gaussian nonlinear state space models. Journal of Computational and Graphical Statistics, 5(1):1–25, 1996. S. Lauritzen and F. Jensen. Stable local computation with conditional Gaussian distributions. Statistics and Computing, 11:191–203, 2001. S. L. Lauritzen. Propagation of probabilities, means, and variances in mixed graphical association models. Journal of the American Statistical Association, 87(420):1098–1108, 1992. L. J. Lee, H. Attias, Li Deng, and P. Fieguth. A multimodal variational approach to learning and inference in switching state space models. In IEEE International Conference on Acoustics, Speech, and Signal Processing, (ICASSP 04), volume 5, pages 505–8, 2004. U. Lerner, R. Parr, D. Koller, and G. Biswas. Bayesian fault detection and diagnosis in dynamic systems. In Proceedings of the Seventeenth National Conference on Artificial Intelligence (AIII00), pages 531–537, 2000. 2539 BARBER U. N. Lerner. Hybrid Bayesian Networks for Reasoning about Complex Systems. PhD thesis, Stanford University, 2002. B. Mesot and D. Barber. Switching linear dynamical systems for noise robust speech recognition. IDIAP-RR 08, 2006. T. Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, MIT Media Lab, 2001. R. M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. CRG-TR-93-1, Dept. of Computer Science, University of Toronto, 1993. P. Park and T. Kailath. New square-root smoothing algorithms. IEEE Transactions on Automatic Control, 41:727–732, 1996. V. Pavlovic, J. M. Rehg, and J. MacCormick. Learning switching linear models of human motion. In Advances in Neural Information Processing systems (NIPS 13), pages 981–987, 2001. L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. of the IEEE, 77(2):257–286, 1989. H. E. Rauch, G. Tung, and C. T. Striebel. Maximum likelihood estimates of linear dynamic systems. American Institute of Aeronautics and Astronautics Journal (AIAAJ), 3(8):1445–1450, 1965. S. Roweis and Z. Ghahramani. A unifying review of linear Gaussian models. Neural Computation, 11(2):305–345, 1999. R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications. Springer, 2000. E. B. Sudderth, A. T. Ihler, W. T. Freeman, and A. S. Willsky. Nonparametric belief propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 1, pages 605–612, 2003. D. M. Titterington, A. F. M. Smith, and U. E. Makov. Statistical Analysis of Finite Mixture Distributions. Wiley, 1985. H. Tong. Nonlinear Time Series Analysis: A Dynamical Systems Approach. Oxford Univ. Press, 1990. M. Verhaegen and P. Van Dooren. Numerical aspects of different Kalman filter implementations. IEEE Transactions of Automatic Control, 31(10):907–917, 1986. M. West and J. Harrison. Bayesian Forecasting and Dynamic Models. Springer, 1999. O. Zoeter. Monitoring Non-Linear and Switching Dynamical Systems. PhD thesis, Radboud University Nijmegen, 2005. 2540

3 0.14608203 75 jmlr-2006-Policy Gradient in Continuous Time

Author: Rémi Munos

Abstract: Policy search is a method for approximately solving an optimal control problem by performing a parametric optimization search in a given class of parameterized policies. In order to process a local optimization technique, such as a gradient method, we wish to evaluate the sensitivity of the performance measure with respect to the policy parameters, the so-called policy gradient. This paper is concerned with the estimation of the policy gradient for continuous-time, deterministic state dynamics, in a reinforcement learning framework, that is, when the decision maker does not have a model of the state dynamics. We show that usual likelihood ratio methods used in discrete-time, fail to proceed the gradient because they are subject to variance explosion when the discretization time-step decreases to 0. We describe an alternative approach based on the approximation of the pathwise derivative, which leads to a policy gradient estimate that converges almost surely to the true gradient when the timestep tends to 0. The underlying idea starts with the derivation of an explicit representation of the policy gradient using pathwise derivation. This derivation makes use of the knowledge of the state dynamics. Then, in order to estimate the gradient from the observable data only, we use a stochastic policy to discretize the continuous deterministic system into a stochastic discrete process, which enables to replace the unknown coefficients by quantities that solely depend on known data. We prove the almost sure convergence of this estimate to the true policy gradient when the discretization time-step goes to zero. The method is illustrated on two target problems, in discrete and continuous control spaces. Keywords: optimal control, reinforcement learning, policy search, sensitivity analysis, parametric optimization, gradient estimate, likelihood ratio method, pathwise derivation 1. Introduction and Statement of the Problem We consider an optimal control problem with continuous state (xt ∈ IRd )t≥0 whose state dynamics is defined according to the controlled differential equation: dxt = f (xt , ut ), dt (1) where the control (ut )t≥0 is a Lebesgue measurable function with values in a control space U. Note that the state-dynamics f may also depend on time, but we omit this dependency in the notation, for simplicity. We intend to maximize a functional J that depends on the trajectory (xt )0≤t≤T over a finite-time horizon T > 0. For simplicity, in the paper, we illustrate the case of a terminal reward c 2006 Rémi Munos. M UNOS only: J(x; (ut )t≥0 ) := r(xT ), (2) where r : IRd → IR is the reward function. Extension to the case of general functional of the kind J(x; (ut )t≥0 ) = Z T 0 r(t, xt )dt + R(xT ), (3) with r and R being current and terminal reward functions, would easily follow, as indicated in Remark 1. The optimal control problem of finding a control (ut )t≥0 that maximizes the functional is replaced by a parametric optimization problem for which we search for a good feed-back control law in a given class of parameterized policies {πα : [0, T ] × IRd → U}α , where α ∈ IRm is the parameter. The control ut ∈ U (or action) at time t is ut = πα (t, xt ), and we may write the dynamics of the resulting feed-back system as dxt = fα (xt ), (4) dt where fα (xt ) := f (x, πα (t, x)). In the paper, we will make the assumption that fα is C 2 , with bounded derivatives. Let us define the performance measure V (α) := J(x; πα (t, xt )t≥0 ), where its dependency with respect to (w.r.t.) the parameter α is emphasized. One may also consider an average performance measure according to some distribution µ for the initial state: V (α) := E[J(x; πα (t, xt )t≥0 )|x ∼ µ]. In order to find a local maximum of V (α), one may perform a local search, such as a gradient ascent method α ← α + η∇αV (α), (5) with an adequate step η (see for example (Polyak, 1987; Kushner and Yin, 1997)). The computation of the gradient ∇αV (α) is the object of this paper. A first method would be to approximate the gradient by a finite-difference quotient for each of the m components of the parameter: V (α + εei ) −V (α) , ε for some small value of ε (we use the notation ∂α instead of ∇α to indicate that it is a singledimensional derivative). This finite-difference method requires the simulation of m + 1 trajectories to compute an approximation of the true gradient. When the number of parameters is large, this may be computationally expensive. However, this simple method may be efficient if the number of parameters is relatively small. In the rest of the paper we will not consider this approach, and will aim at computing the gradient using one trajectory only. ∂αi V (α) ≃ 772 P OLICY G RADIENT IN C ONTINUOUS T IME Pathwise estimation of the gradient. We now illustrate that if the decision-maker has access to a model of the state dynamics, then a pathwise derivation would directly lead to the policy gradient. Indeed, let us define the gradient of the state with respect to the parameter: zt := ∇α xt (i.e. zt is defined as a d × m-matrix whose (i, j)-component is the derivative of the ith component of xt w.r.t. α j ). Our smoothness assumption on fα allows to differentiate the state dynamics (4) w.r.t. α, which provides the dynamics on (zt ): dzt = ∇α fα (xt ) + ∇x fα (xt )zt , dt (6) where the coefficients ∇α fα and ∇x fα are, respectively, the derivatives of f w.r.t. the parameter (matrix of size d × m) and the state (matrix of size d × d). The initial condition for z is z0 = 0. When the reward function r is smooth (i.e. continuously differentiable), one may apply a pathwise differentiation to derive a gradient formula (see e.g. (Bensoussan, 1988) or (Yang and Kushner, 1991) for an extension to the stochastic case): ∇αV (α) = ∇x r(xT )zT . (7) Remark 1 In the more general setting of a functional (3), the gradient is deduced (by linearity) from the above formula: ∇αV (α) = Z T 0 ∇x r(t, xt )zt dt + ∇x R(xT )zT . What is known from the agent? The decision maker (call it the agent) that intends to design a good controller for the dynamical system may or may not know a model of the state dynamics f . In case the dynamics is known, the state gradient zt = ∇α xt may be computed from (6) along the trajectory and the gradient of the performance measure w.r.t. the parameter α is deduced at time T from (7), which allows to perform the gradient ascent step (5). However, in this paper we consider a Reinforcement Learning (Sutton and Barto, 1998) setting in which the state dynamics is unknown from the agent, but we still assume that the state is fully observable. The agent knows only the response of the system to its control. To be more precise, the available information to the agent at time t is its own control policy πα and the trajectory (xs )0≤s≤t up to time t. At time T , the agent receives the reward r(xT ) and, in this paper, we assume that the gradient ∇r(xT ) is available to the agent. From this point of view, it seems impossible to derive the state gradient zt from (6), since ∇α f and ∇x f are unknown. The term ∇x f (xt ) may be approximated by a least squares method from the observation of past states (xs )s≤t , as this will be explained later on in subsection 3.2. However the term ∇α f (xt ) cannot be calculated analogously. In this paper, we introduce the idea of using stochastic policies to approximate the state (xt ) and the state gradient (zt ) by discrete-time stochastic processes (Xt∆ ) and (Zt∆ ) (with ∆ being some discretization time-step). We show how Zt∆ can be computed without the knowledge of ∇α f , but only from information available to the agent. ∆ ∆ We prove the convergence (with probability one) of the gradient estimate ∇x r(XT )ZT derived from the stochastic processes to ∇αV (α) when ∆ → 0. Here, almost sure convergence is obtained using the concentration of measure phenomenon (Talagrand, 1996; Ledoux, 2001). 773 M UNOS y ∆ XT ∆ X t2 ∆ Xt 0 fα ∆ x Xt 1 Figure 1: A trajectory (Xt∆ )0≤n≤N and the state dynamics vector fα of the continuous process n (xt )0≤t≤T . Likelihood ratio method? It is worth mentioning that this strong convergence result contrasts with the usual likelihood ratio method (also called score method) in discrete time (see e.g. (Reiman and Weiss, 1986; Glynn, 1987) or more recently in the reinforcement learning literature (Williams, 1992; Sutton et al., 2000; Baxter and Bartlett, 2001; Marbach and Tsitsiklis, 2003)) for which the policy gradient estimate is subject to variance explosion when the discretization time-step ∆ tends to 0. The intuitive reason for that problem lies in the fact that the number of decisions before getting the reward grows to infinity when ∆ → 0 (the variance of likelihood ratio estimates being usually linear with the number of decisions). Let us illustrate this problem on a simple 2 dimensional process. Consider the deterministic continuous process (xt )0≤t≤1 defined by the state dynamics: dxt = fα := dt α 1−α , (8) (0 < α < 1) with initial condition x0 = (0 0)′ (where ′ denotes the transpose operator). The performance measure V (α) is the reward at the terminal state at time T = 1, with the reward function being the first coordinate of the state r((x y)′ ) := x. Thus V (α) = r(xT =1 ) = α and its derivative is ∇αV (α) = 1. Let (Xt∆ )0≤n≤N ∈ IR2 be a discrete time stochastic process (the discrete times being {tn = n ∆ n∆}n=0...N with the discretization time-step ∆ = 1/N) that starts from initial state X0 = x0 = (0 0)′ and makes N random moves of length ∆ towards the right (action u1 ) or the top (action u2 ) (see Figure 1) according to the stochastic policy (i.e., the probability of choosing the actions in each state x) πα (u1 |x) = α, πα (u2 |x) = 1 − α. The process is thus defined according to the dynamics: Xt∆ = Xt∆ + n n+1 Un 1 −Un ∆, (9) where (Un )0≤n < N and all ∞ N > 0), there exists a constant C that does not depend on N such that dn ≤ C/N. Thus we may take D2 = C2 /N. Now, from the previous paragraph, ||E[XN ] − xN || ≤ e(N), with e(N) → 0 when N → ∞. This means that ||h − E[h]|| + e(N) ≥ ||XN − xN ||, thus P(||h − E[h]|| ≥ ε + e(N)) ≥ P(||XN − xN || ≥ ε), and we deduce from (31) that 2 /(2C 2 ) P(||XN − xN || ≥ ε) ≤ 2e−N(ε+e(N)) . Thus, for all ε > 0, the series ∑N≥0 P(||XN − xN || ≥ ε) converges. Now, from Borel-Cantelli lemma, we deduce that for all ε > 0, there exists Nε such that for all N ≥ Nε , ||XN − xN || < ε, which ∆→0 proves the almost sure convergence of XN to xN as N → ∞ (i.e. XT −→ xT almost surely). Appendix C. Proof of Proposition 8 ′ First, note that Qt = X X ′ − X X is a symmetric, non-negative matrix, since it may be rewritten as 1 nt ∑ (Xs+ − X)(Xs+ − X)′ . s∈S(t) In solving the least squares problem (21), we deduce b = ∆X + AX∆, thus min A,b 1 1 ∑ ∆Xs − b −A(Xs+2 ∆Xs )∆ nt s∈S(t) ≤ 2 = min A 1 ∑ ∆Xs − ∆X − A(Xs+ − X)∆ nt s∈S(t) 1 ∑ ∆Xs− ∆X− ∇x f (X, ut )(Xs+− X)∆ nt s∈S(t) 2 2 . (32) Now, since Xs = X + O(∆) one may obtain like in (19) and (20) (by replacing Xt by X) that: ∆Xs − ∆X − ∇x f (X, ut )(Xs+ − X)∆ = O(∆3 ). (33) We deduce from (32) and (33) that 1 nt ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) (Xs+ − X)∆ 2 = O(∆6 ). s∈S(t) By developing each component, d ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) i=1 row i Qt ∇x f (Xt , ut ) − ∇x f (X, ut ) ′ row i = O(∆4 ). Now, from the definition of ν(∆), for all vector u ∈ IRd , u′ Qt u ≥ ν(∆)||u||2 , thus ν(∆)||∇x f (Xt , ut ) − ∇x f (X, ut )||2 = O(∆4 ). Condition (23) yields ∇x f (Xt , ut ) = ∇x f (X, ut ) + o(1), and since ∇x f (Xt , ut ) = ∇x f (X, ut ) + O(∆), we deduce lim ∇x f (Xt , ut ) = ∇x f (Xt , ut ). ∆→0 789 M UNOS References J. Baxter and P. L. Bartlett. Infinite-horizon gradient-based policy search. Journal of Artificial Intelligence Research, 15:319–350, 2001. A. Bensoussan. Perturbation methods in optimal control. Wiley/Gauthier-Villars Series in Modern Applied Mathematics. John Wiley & Sons Ltd., Chichester, 1988. Translated from the French by C. Tomson. A. Bogdanov. Optimal control of a double inverted pendulum on a cart. Technical report CSE-04006, CSEE, OGI School of Science and Engineering, OHSU, 2004. P. W. Glynn. Likelihood ratio gradient estimation: an overview. In A. Thesen, H. Grant, and W. D. Kelton, editors, Proceedings of the 1987 Winter Simulation Conference, pages 366–375, 1987. E. Gobet and R. Munos. Sensitivity analysis using Itô-Malliavin calculus and martingales. application to stochastic optimal control. SIAM journal on Control and Optimization, 43(5):1676–1713, 2005. G. H. Golub and C. F. Van Loan. Matrix Computations, 3rd ed. Baltimore, MD: Johns Hopkins, 1996. R. E. Kalman, P. L. Falb, and M. A. Arbib. Topics in Mathematical System Theory. New York: McGraw Hill, 1969. P. E. Kloeden and E. Platen. Numerical Solutions of Stochastic Differential Equations. SpringerVerlag, 1995. H. J. Kushner and G. Yin. Stochastic Approximation Algorithms and Applications. Springer-Verlag, Berlin and New York, 1997. S. M. LaValle. Planning Algorithms. Cambridge University Press, 2006. M. Ledoux. The concentration of measure phenomenon. American Mathematical Society, Providence, RI, 2001. P. Marbach and J. N. Tsitsiklis. Approximate gradient methods in policy-space optimization of Markov reward processes. Journal of Discrete Event Dynamical Systems, 13:111–148, 2003. B. T. Polyak. Introduction to Optimization. Optimization Software Inc., New York, 1987. M. I. Reiman and A. Weiss. Sensitivity analysis via likelihood ratios. In J. Wilson, J. Henriksen, and S. Roberts, editors, Proceedings of the 1986 Winter Simulation Conference, pages 285–289, 1986. R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. Bradford Book, 1998. R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. Neural Information Processing Systems. MIT Press, pages 1057–1063, 2000. 790 P OLICY G RADIENT IN C ONTINUOUS T IME M. Talagrand. A new look at independence. Annals of Probability, 24:1–34, 1996. R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992. J. Yang and H. J. Kushner. A Monte Carlo method for sensitivity analysis and parametric optimization of nonlinear stochastic systems. SIAM J. Control Optim., 29(5):1216–1249, 1991. 791

4 0.13251249 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation

Author: Kazuho Watanabe, Sumio Watanabe

Abstract: Bayesian learning has been widely used and proved to be effective in many data modeling problems. However, computations involved in it require huge costs and generally cannot be performed exactly. The variational Bayesian approach, proposed as an approximation of Bayesian learning, has provided computational tractability and good generalization performance in many applications. The properties and capabilities of variational Bayesian learning itself have not been clarified yet. It is still unknown how good approximation the variational Bayesian approach can achieve. In this paper, we discuss variational Bayesian learning of Gaussian mixture models and derive upper and lower bounds of variational stochastic complexities. The variational stochastic complexity, which corresponds to the minimum variational free energy and a lower bound of the Bayesian evidence, not only becomes important in addressing the model selection problem, but also enables us to discuss the accuracy of the variational Bayesian approach as an approximation of true Bayesian learning. Keywords: Gaussian mixture model, variational Bayesian learning, stochastic complexity

5 0.10617184 81 jmlr-2006-Some Discriminant-Based PAC Algorithms

Author: Paul W. Goldberg

Abstract: A classical approach in multi-class pattern classification is the following. Estimate the probability distributions that generated the observations for each label class, and then label new instances by applying the Bayes classifier to the estimated distributions. That approach provides more useful information than just a class label; it also provides estimates of the conditional distribution of class labels, in situations where there is class overlap. We would like to know whether it is harder to build accurate classifiers via this approach, than by techniques that may process all data with distinct labels together. In this paper we make that question precise by considering it in the context of PAC learnability. We propose two restrictions on the PAC learning framework that are intended to correspond with the above approach, and consider their relationship with standard PAC learning. Our main restriction of interest leads to some interesting algorithms that show that the restriction is not stronger (more restrictive) than various other well-known restrictions on PAC learning. An alternative slightly milder restriction turns out to be almost equivalent to unrestricted PAC learning. Keywords: computational learning theory, computational complexity, pattern classification

6 0.096606791 57 jmlr-2006-Linear State-Space Models for Blind Source Separation

7 0.073913619 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models

8 0.070814803 5 jmlr-2006-A Robust Procedure For Gaussian Graphical Model Search From Microarray Data WithpLarger Thann

9 0.067517847 55 jmlr-2006-Linear Programming Relaxations and Belief Propagation -- An Empirical Study     (Special Topic on Machine Learning and Optimization)

10 0.06750384 89 jmlr-2006-Structured Prediction, Dual Extragradient and Bregman Projections     (Special Topic on Machine Learning and Optimization)

11 0.066336043 27 jmlr-2006-Ensemble Pruning Via Semi-definite Programming     (Special Topic on Machine Learning and Optimization)

12 0.062062729 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models     (Special Topic on Machine Learning and Optimization)

13 0.059879873 91 jmlr-2006-The Interplay of Optimization and Machine Learning Research     (Special Topic on Machine Learning and Optimization)

14 0.057919733 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

15 0.057623323 38 jmlr-2006-Incremental Support Vector Learning: Analysis, Implementation and Applications     (Special Topic on Machine Learning and Optimization)

16 0.057191618 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

17 0.05673195 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis

18 0.056213778 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

19 0.054985605 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition

20 0.050728902 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.285), (1, 0.143), (2, -0.127), (3, -0.024), (4, -0.219), (5, -0.16), (6, 0.001), (7, 0.249), (8, -0.156), (9, -0.219), (10, -0.15), (11, -0.024), (12, 0.159), (13, -0.041), (14, -0.108), (15, 0.011), (16, -0.014), (17, -0.175), (18, -0.023), (19, 0.093), (20, 0.073), (21, -0.05), (22, 0.088), (23, 0.115), (24, -0.008), (25, 0.066), (26, -0.023), (27, -0.019), (28, -0.233), (29, 0.113), (30, -0.078), (31, -0.088), (32, 0.058), (33, 0.043), (34, 0.016), (35, 0.07), (36, 0.023), (37, -0.01), (38, -0.038), (39, 0.054), (40, -0.052), (41, -0.023), (42, 0.013), (43, -0.042), (44, -0.029), (45, 0.101), (46, 0.026), (47, 0.087), (48, -0.013), (49, -0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95767361 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

Author: Martin J. Wainwright

Abstract: Consider the problem of joint parameter estimation and prediction in a Markov random field: that is, the model parameters are estimated on the basis of an initial set of data, and then the fitted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a new noisy observation. Working under the restriction of limited computation, we analyze a joint method in which the same convex variational relaxation is used to construct an M-estimator for fitting parameters, and to perform approximate marginalization for the prediction step. The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i.e., an estimator that returns the “wrong” model even in the infinite data limit) is provably beneficial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of convex variational methods. This stability result provides additional incentive, apart from the obvious benefit of unique global optima, for using message-passing methods based on convex variational relaxations. We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product. Keywords: graphical model, Markov random field, belief propagation, variational method, parameter estimation, prediction error, algorithmic stability

2 0.5777896 32 jmlr-2006-Expectation Correction for Smoothed Inference in Switching Linear Dynamical Systems

Author: David Barber

Abstract: We introduce a method for approximate smoothed inference in a class of switching linear dynamical systems, based on a novel form of Gaussian Sum smoother. This class includes the switching Kalman ‘Filter’ and the more general case of switch transitions dependent on the continuous latent state. The method improves on the standard Kim smoothing approach by dispensing with one of the key approximations, thus making fuller use of the available future information. Whilst the central assumption required is projection to a mixture of Gaussians, we show that an additional conditional independence assumption results in a simpler but accurate alternative. Our method consists of a single Forward and Backward Pass and is reminiscent of the standard smoothing ‘correction’ recursions in the simpler linear dynamical system. The method is numerically stable and compares favourably against alternative approximations, both in cases where a single mixture component provides a good posterior approximation, and where a multimodal approximation is required. Keywords: Gaussian sum smoother, switching Kalman filter, switching linear dynamical system, expectation propagation, expectation correction 1. Switching Linear Dynamical System The Linear Dynamical System (LDS) (Bar-Shalom and Li, 1998; West and Harrison, 1999) is a key temporal model in which a latent linear process generates the observed time-series. For more complex time-series which are not well described globally by a single LDS, we may break the time-series into segments, each modeled by a potentially different LDS. This is the basis for the Switching LDS (SLDS) where, for each time-step t, a switch variable st ∈ 1, . . . , S describes which of the LDSs is to be used.1 The observation (or ‘visible’ variable) vt ∈ R V is linearly related to the hidden state ht ∈ R H by vt = B(st )ht + ηv (st ), ηv (st ) ∼ N (v(st ), Σv (st )) ¯ (1) where N (µ, Σ) denotes a Gaussian distribution with mean µ and covariance Σ. The transition dynamics of the continuous hidden state ht is linear ht = A(st )ht−1 + ηh (st ), ¯ ηh (st ) ∼ N h(st ), Σh (st ) . (2) 1. These systems also go under the names Jump Markov model/process, switching Kalman Filter, Switching Linear Gaussian State-Space model, Conditional Linear Gaussian Model. c 2006 David Barber. BARBER s1 s2 s3 s4 h1 h2 h3 h4 v1 v2 v3 v4 Figure 1: The independence structure of the aSLDS. Square nodes denote discrete variables, round nodes continuous variables. In the SLDS links from h to s are not normally considered. The dynamics of the switch variables is Markovian, with transition p(st |st−1 ). The SLDS is used in many disciplines, from econometrics to machine learning (Bar-Shalom and Li, 1998; Ghahramani and Hinton, 1998; Lerner et al., 2000; Kitagawa, 1994; Kim and Nelson, 1999; Pavlovic et al., 2001). See Lerner (2002) and Zoeter (2005) for recent reviews of work. AUGMENTED S WITCHING L INEAR DYNAMICAL S YSTEM In this article, we will consider the more general model in which the switch st is dependent on both the previous st−1 and ht−1 . We call this an augmented Switching Linear Dynamical System 2 (aSLDS), in keeping with the terminology in Lerner (2002). An equivalent probabilistic model is, as depicted in Figure (1), T p(v1:T , h1:T , s1:T ) = p(v1 |h1 , s1 )p(h1 |s1 )p(s1 ) ∏ p(vt |ht , st )p(ht |ht−1 , st )p(st |ht−1 , st−1 ). t=2 The notation x1:T is shorthand for x1 , . . . , xT . The distributions are parameterized as p(vt |ht , st ) = N (v(st ) + B(st )ht , Σv (st )) , ¯ ¯ p(ht |ht−1 , st ) = N h(st ) + A(st )ht−1 , Σh (st ) where p(h1 |s1 ) = N (µ(s1 ), Σ(s1 )). The aSLDS has been used, for example, in state-duration modeling in acoustics (Cemgil et al., 2006) and econometrics (Chib and Dueker, 2004). I NFERENCE The aim of this article is to address how to perform inference in both the SLDS and aSLDS. In particular we desire the so-called filtered estimate p(ht , st |v1:t ) and the smoothed estimate p(ht , st |v1:T ), for any t, 1 ≤ t ≤ T . Both exact filtered and smoothed inference in the SLDS is intractable, scaling exponentially with time (Lerner, 2002). To see this informally, consider the filtered posterior, which may be recursively computed using p(st , ht |v1:t ) = ∑ Z st−1 ht−1 p(st , ht |st−1 , ht−1 , vt )p(st−1 , ht−1 |v1:t−1 ). (3) At timestep 1, p(s1 , h1 |v1 ) = p(h1 |s1 , v1 )p(s1 |v1 ) is an indexed set of Gaussians. At time-step 2, due to the summation over the states s1 , p(s2 , h2 |v1:2 ) will be an indexed set of S Gaussians; similarly at 2. These models are closely related to Threshold Regression Models (Tong, 1990). 2516 E XPECTATION C ORRECTION time-step 3, it will be S2 and, in general, gives rise to St−1 Gaussians. More formally, in Lauritzen and Jensen (2001), a general exact method is presented for performing stable inference in such hybrid discrete models with conditional Gaussian potentials. The method requires finding a strong junction tree which, in the SLDS case, means that the discrete variables are placed in a single cluster, resulting in exponential complexity. The key issue in the (a)SLDS, therefore, is how to perform approximate inference in a numerically stable manner. Our own interest in the SLDS stems primarily from acoustic modeling, in which the time-series consists of many thousands of time-steps (Mesot and Barber, 2006; Cemgil et al., 2006). For this, we require a stable and computationally feasible approximate inference, which is also able to deal with state-spaces of high hidden dimension, H. 2. Expectation Correction Our approach to approximate p(ht , st |v1:T ) ≈ p(ht , st |v1:T ) mirrors the Rauch-Tung-Striebel (RTS) ˜ ‘correction’ smoother for the LDS (Rauch et al., 1965; Bar-Shalom and Li, 1998). Readers unfamiliar with this approach will find a short explanation in Appendix (A), which defines the important functions LDSFORWARD and LDSBACKWARD, which we shall make use of for inference in the aSLDS. Our correction approach consists of a single Forward Pass to recursively find the filtered posterior p(ht , st |v1:t ), followed by a single Backward Pass to correct this into a smoothed posterior ˜ p(ht , st |v1:T ). The Forward Pass we use is equivalent to Assumed Density Filtering (Alspach and ˜ Sorenson, 1972; Boyen and Koller, 1998; Minka, 2001). The main contribution of this paper is a novel form of Backward Pass, based on collapsing the smoothed posterior to a mixture of Gaussians. Unless stated otherwise, all quantities should be considered as approximations to their exact counterparts, and we will therefore usually omit the tildes˜throughout the article. 2.1 Forward Pass (Filtering) Readers familiar with Assumed Density Filtering (ADF) may wish to continue directly to Section (2.2). The basic idea is to represent the (intractable) posterior using a simpler distribution. This is then propagated forwards through time, conditioned on the new observation, and subsequently collapsed back to the tractable distribution representation—see Figure (2). Our aim is to form a recursion for p(st , ht |v1:t ), based on a Gaussian mixture approximation of p(ht |st , v1:t ). Without loss of generality, we may decompose the filtered posterior as p(ht , st |v1:t ) = p(ht |st , v1:t )p(st |v1:t ). We will first form a recursion for p(ht |st , v1:t ), and discuss the switch recursion p(st |v1:t ) later. The full procedure for computing the filtered posterior is presented in Algorithm (1). The exact representation of p(ht |st , v1:t ) is a mixture with O(St ) components. We therefore approximate this with a smaller It -component mixture It p(ht |st , v1:t ) ≈ p(ht |st , v1:t ) ≡ ˜ ˜ ˜ ∑ p(ht |it , st , v1:t ) p(it |st , v1:t ) it =1 where p(ht |it , st , v1:t ) is a Gaussian parameterized with mean3 f (it , st ) and covariance F(it , st ). The ˜ Gaussian mixture weights are given by p(it |st , v1:t ). In the above, p represent approximations to the ˜ ˜ 3. Strictly speaking, we should use the notation ft (it , st ) since, for each time t, we have a set of means indexed by it , st . This mild abuse of notation is used elsewhere in the paper. 2517 BARBER st st+1 it ht ht+1 vt+1 Figure 2: Structure of the mixture representation of the Forward Pass. Essentially, the Forward Pass defines a ‘prior’ distribution at time t which contains all the information from the variables v1:t . This prior is propagated forwards through time using the exact dynamics, conditioned on the observation, and then collapsed back to form a new prior approximation at time t + 1. corresponding exact p distributions. To find a recursion for these parameters, consider ˜ p(ht+1 |st+1 , v1:t+1 ) = ∑ p(ht+1 , st , it |st+1 , v1:t+1 ) ˜ st ,it = ∑ p(ht+1 |it , st , st+1 , v1:t+1 ) p(st , it |st+1 , v1:t+1 ) ˜ ˜ (4) st ,it where each of the factors can be recursively computed on the basis of the previous filtered results (see below). However, this recursion suffers from an exponential increase in mixture components. To deal with this, we will later collapse p(ht+1 |st+1 , v1:t+1 ) back to a smaller mixture. For the ˜ remainder, we drop the p notation, and concentrate on computing the r.h.s of Equation (4). ˜ E VALUATING p(ht+1 |st , it , st+1 , v1:t+1 ) We find p(ht+1 |st , it , st+1 , v1:t+1 ) from the joint distribution p(ht+1 , vt+1 |st , it , st+1 , v1:t ), which is a Gaussian with covariance and mean elements4 Σhh = A(st+1 )F(it , st )AT (st+1 ) + Σh (st+1 ), Σvv = B(st+1 )Σhh BT (st+1 ) + Σv (st+1 ) Σvh = B(st+1 )F(it , st ), µv = B(st+1 )A(st+1 ) f (it , st ), µh = A(st+1 ) f (it , st ). (5) These results are obtained from integrating the forward dynamics, Equations (1,2) over h t , using the results in Appendix (B). To find p(ht+1 |st , it , st+1 , v1:t+1 ) we may then condition p(ht+1 , vt+1 | st , it , st+1 , v1:t ) on vt+1 using the results in Appendix (C)—see also Algorithm (4). E VALUATING p(st , it |st+1 , v1:t+1 ) Up to a trivial normalization constant the mixture weight in Equation (4) can be found from the decomposition p(st , it |st+1 , v1:t+1 ) ∝ p(vt+1 |it , st , st+1 , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ). ¯ 4. We derive this for ht+1 , vt+1 ≡ 0, to ease notation. ¯ 2518 (6) E XPECTATION C ORRECTION Algorithm 1 aSLDS Forward Pass. Approximate the filtered posterior p(st |v1:t ) ≡ ρt , p(ht |st , v1:t ) ≡ ∑it wt (it , st )N ( ft (it , st ), Ft (it , st )). Also we return the approximate log-likelihood log p(v 1:T ). We ¯ require I1 = 1, I2 ≤ S, It ≤ S × It−1 . θt (s) = A(s), B(s), Σh (s), Σv (s), h(s), v(s) for t > 1. θ1 (s) = ¯ v (s), µ(s), v(s) A(s), B(s), Σ(s), Σ ¯ for s1 ← 1 to S do { f1 (1, s1 ), F1 (1, s1 ), p} = LDSFORWARD(0, 0, v1 ; θ(s1 )) ˆ ρ1 ← p(s1 ) p ˆ end for for t ← 2 to T do for st ← 1 to S do for i ← 1 to It−1 , and s ← 1 to S do {µx|y (i, s), Σx|y (i, s), p} = LDSFORWARD( ft−1 (i, s), Ft−1 (i, s), vt ; θt (st )) ˆ p∗ (st |i, s) ≡ p(st |ht−1 , st−1 = s) p(ht−1 |it−1 =i,st−1 =s,v1:t−1 ) p (st , i, s) ← wt−1 (i, s)p∗ (st |i, s)ρt−1 (s) p ˆ end for Collapse the It−1 × S mixture of Gaussians defined by µx|y ,Σx|y , and weights p(i, s|st ) ∝ p (st , i, s) to a Gaussian with It components, p(ht |st , v1:t ) ≈ I ∑itt =1 p(it |st , v1:t )p(ht |st , it , v1:t ). This defines the new means ft (it , st ), covariances Ft (it , st ) and mixture weights wt (it , st ) ≡ p(it |st , v1:t ). Compute ρt (st ) ∝ ∑i,s p (st , i, s) end for normalize ρt ≡ p(st |v1:t ) L ← L + log ∑st ,i,s p (st , i, s) end for The first factor in Equation (6), p(vt+1 |it , st , st+1 , v1:t ), is a Gaussian with mean µv and covariance Σvv , as given in Equation (5). The last two factors p(it |st , v1:t ) and p(st |v1:t ) are given from the previous iteration. Finally, p(st+1 |it , st , v1:t ) is found from p(st+1 |it , st , v1:t ) = p(st+1 |ht , st ) p(ht |it ,st ,v1:t ) (7) where · p denotes expectation with respect to p. In the standard SLDS, Equation (7) is replaced by the Markov transition p(st+1 |st ). In the aSLDS, however, Equation (7) will generally need to be computed numerically. A simple approximation is to evaluate Equation (7) at the mean value of the distribution p(ht |it , st , v1:t ). To take covariance information into account an alternative would be to draw samples from the Gaussian p(ht |it , st , v1:t ) and thus approximate the average of p(st+1 |ht , st ) by sampling.5 C LOSING THE R ECURSION We are now in a position to calculate Equation (4). For each setting of the variable st+1 , we have a mixture of It × S Gaussians. In order to avoid an exponential explosion in the number of mixture 5. Whilst we suggest sampling as part of the aSLDS update procedure, this does not render the Forward Pass as a form of sequential sampling procedure, such as Particle Filtering. The sampling here is a form of exact sampling, for which no convergence issues arise, being used only to numerically evaluate Equation (7). 2519 BARBER components, we numerically collapse this back to It+1 Gaussians to form It+1 p(ht+1 |st+1 , v1:t+1 ) ≈ ∑ it+1 =1 p(ht+1 |it+1 , st+1 , v1:t+1 )p(it+1 |st+1 , v1:t+1 ). Hence the Gaussian components and corresponding mixture weights p(it+1 |st+1 , v1:t+1 ) are defined implicitly through a numerical (Gaussian-Mixture to smaller Gaussian-Mixture) collapse procedure, for which any method of choice may be supplied. A straightforward approach that we use in our code is based on repeatedly merging low-weight components, as explained in Appendix (D). A R ECURSION FOR THE S WITCH VARIABLES A recursion for the switch variables can be found by considering p(st+1 |v1:t+1 ) ∝ ∑ p(it , st , st+1 , vt+1 , v1:t ). it ,st The r.h.s. of the above equation is proportional to ∑ p(vt+1 |it , st , st+1 , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ) st ,it where all terms have been computed during the recursion for p(ht+1 |st+1 , v1:t+1 ). T HE L IKELIHOOD p(v1:T ) The likelihood p(v1:T ) may be found by recursing p(v1:t+1 ) = p(vt+1 |v1:t )p(v1:t ), where p(vt+1 |v1:t ) = ∑ it ,st ,st+1 p(vt+1 |it , st , st+1 , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ). In the above expression, all terms have been computed in forming the recursion for the filtered posterior p(ht+1 , st+1 |v1:t+1 ). 2.2 Backward Pass (Smoothing) The main contribution of this paper is to find a suitable way to ‘correct’ the filtered posterior p(st , ht |v1:t ) obtained from the Forward Pass into a smoothed posterior p(st , ht |v1:T ). We initially derive this for the case of a single Gaussian representation—the extension to the mixture case is straightforward and given in Section (2.3). Our derivation holds for both the SLDS and aSLDS. We approximate the smoothed posterior p(ht |st , v1:T ) by a Gaussian with mean g(st ) and covariance G(st ), and our aim is to find a recursion for these parameters. A useful starting point is the exact relation: p(ht , st |v1:T ) = ∑ p(st+1 |v1:T )p(ht |st , st+1 , v1:T )p(st |st+1 , v1:T ). st+1 2520 E XPECTATION C ORRECTION The term p(ht |st , st+1 , v1:T ) may be computed as p(ht |st , st+1 , v1:T ) = Z p(ht , ht+1 |st , st+1 , v1:T ) = Z p(ht |ht+1 , st , st+1 , v1:T )p(ht+1 |st , st+1 , v1:T ) Z p(ht |ht+1 , st , st+1 , v1:t )p(ht+1 |st , st+1 , v1:T ) ht+1 ht+1 = ht+1 (8) which is in the form of a recursion. This recursion therefore requires p(ht+1 |st , st+1 , v1:T ), which we can write as p(ht+1 |st , st+1 , v1:T ) ∝ p(ht+1 |st+1 , v1:T )p(st |st+1 , ht+1 , v1:t ). (9) The above recursions represent the exact computation of the smoothed posterior. In our approximate treatment, we replace all quantities p with their corresponding approximations p. A difficulty ˜ is that the functional form of p(st |st+1 , ht+1 , v1:t ) in the approximation of Equation (9) is not squared ˜ exponential in ht+1 , so that p(ht+1 |st , st+1 , v1:T ) will not be a mixture of Gaussians.6 One possibil˜ ity would be to approximate the non-Gaussian p(ht+1 |st , st+1 , v1:T ) (dropping the p notation) by a ˜ Gaussian (mixture) by minimizing the Kullback-Leilbler divergence between the two, or performing moment matching in the case of a single Gaussian. A simpler alternative is to make the assumption p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ), see Figure (3). This is a considerable simplification since p(ht+1 |st+1 , v1:T ) is already known from the previous backward recursion. Under this assumption, the recursion becomes p(ht , st |v1:T ) ≈ ∑ p(st+1 |v1:T )p(st |st+1 , v1:T ) st+1 p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) . (10) We call the procedure based on Equation (10) Expectation Correction (EC) since it ‘corrects’ the filtered results which themselves are formed from propagating expectations. In Appendix (E) we show how EC is equivalent to a partial Discrete-Continuous factorized approximation. Equation (10) forms the basis of the the EC Backward Pass. However, similar to the ADF Forward Pass, the number of mixture components needed to represent the posterior in this recursion grows exponentially as we go backwards in time. The strategy we take to deal with this is a form of Assumed Density Smoothing, in which Equation (10) is interpreted as a propagated dynamics reversal, which will subsequently be collapsed back to an assumed family of distributions—see Figure (4). How we implement the recursion for the continuous and discrete factors is detailed below.7 6. In the exact calculation, p(ht+1 |st , st+1 , v1:T ) is a mixture of Gaussians since p(st |st+1 , ht+1 , v1:t ) = p(st , st+1 , ht+1 , v1:T )/p(st+1 , ht+1 , v1:T ) so that the mixture of Gaussians denominator p(st+1 , ht+1 , v1:T ) cancels with the first term in Equation (9), leaving a mixture of Gaussians. However, since in Equation (9) the two terms p(ht+1 |st+1 , v1:T ) and p(st |st+1 , ht+1 , v1:t ) are replaced by approximations, this cancellation is not guaranteed. 7. Equation (10) has the pleasing form of an RTS Backward Pass for the continuous part (analogous to LDS case), and a discrete smoother (analogous to a smoother recursion for the HMM). In the Forward-Backward algorithm for the HMM (Rabiner, 1989), the posterior γt ≡ p(st |v1:T ) is formed from the product of αt ≡ p(st |v1:t ) and βt ≡ p(vt+1:T |st ). This approach is also analogous to EP (Heskes and Zoeter, 2002). In the correction approach, a direct recursion for γt in terms of γt+1 and αt is formed, without explicitly defining βt . The two approaches to inference are known as α − β and α − γ recursions. 2521 BARBER st−1 st st+1 st+2 ht−1 ht ht+1 ht+2 vt−1 vt vt+1 vt+2 Figure 3: Our Backward Pass approximates p(ht+1 |st+1 , st , v1:T ) by p(ht+1 |st+1 , v1:T ). Motivation for this is that st only influences ht+1 through ht . However, ht will most likely be heavily influenced by v1:t , so that not knowing the state of st is likely to be of secondary importance. The darker shaded node is the variable we wish to find the posterior state of. The lighter shaded nodes are variables in known states, and the hashed node a variable whose state is indeed known but assumed unknown for the approximation. E VALUATING p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) is a Gaussian in ht , whose statistics we will now compute. First we find p(ht |ht+1 , st , st+1 , v1:t ) which may be obtained from the joint distribution p(ht , ht+1 |st , st+1 , v1:t ) = p(ht+1 |ht , st+1 )p(ht |st , v1:t ) (11) which itself can be found using the forward dynamics from the filtered estimate p(ht |st , v1:t ). The statistics for the marginal p(ht |st , st+1 , v1:t ) are simply those of p(ht |st , v1:t ), since st+1 carries no extra information about ht .8 The remaining statistics are the mean of ht+1 , the covariance of ht+1 and cross-variance between ht and ht+1 , ht+1 = A(st+1 ) ft (st ) Σt+1,t+1 = A(st+1 )Ft (st )AT (st+1 ) + Σh (st+1 ), Σt+1,t = A(st+1 )Ft (st ). Given the statistics of Equation (11), we may now condition on ht+1 to find p(ht |ht+1 , st , st+1 , v1:t ). Doing so effectively constitutes a reversal of the dynamics, ← − ← − ht = A (st , st+1 )ht+1 + η (st , st+1 ) ← − ← − ← − − where A (st , st+1 ) and η (st , st+1 ) ∼ N (←(st , st+1 ), Σ (st , st+1 )) are easily found using the condim tioned Gaussian results in Appendix (C)—see also Algorithm (5). Averaging the reversed dynamics we obtain a Gaussian in ht for p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) with statistics ← − ← − ← − ← − − µt = A (st , st+1 )g(st+1 ) + ←(st , st+1 ), Σt,t = A (st , st+1 )G(st+1 ) A T (st , st+1 ) + Σ (st , st+1 ). m These equations directly mirror the RTS Backward Pass, see Algorithm (5). 8. Integrating over ht+1 means that the information from st+1 passing through ht+1 via the term p(ht+1 |st+1 , ht ) vanishes. Also, since st is known, no information from st+1 passes through st to ht . 2522 E XPECTATION C ORRECTION st st+1 it jt+1 ht ht+1 vt vt+1 Figure 4: Structure of the Backward Pass for mixtures. Given the smoothed information at timestep t + 1, we need to work backwards to ‘correct’ the filtered estimate at time t. E VALUATING p(st |st+1 , v1:T ) The main departure of EC from previous methods is in treating the term p(st |st+1 , v1:T ) = p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) . (12) The term p(st |ht+1 , st+1 , v1:t ) is given by p(st |ht+1 , st+1 , v1:t ) = p(ht+1 |st , st+1 , v1:t )p(st , st+1 |v1:t ) . ∑st p(ht+1 |st , st+1 , v1:t )p(st , st+1 |v1:t ) (13) Here p(st , st+1 |v1:t ) = p(st+1 |st , v1:t )p(st |v1:t ), where p(st+1 |st , v1:t ) occurs in the Forward Pass, Equation (7). In Equation (13), p(ht+1 |st+1 , st , v1:t ) is found by marginalizing Equation (11). Performing the average over p(ht+1 |st+1 , v1:T ) in Equation (12) may be achieved by any numerical integration method desired. Below we outline a crude approximation that is fast and often performs surprisingly well. M EAN A PPROXIMATION A simple approximation of Equation (12) is to evaluate the integrand at the mean value of the averaging distribution. Replacing ht+1 in Equation (13) by its mean gives the simple approximation 1 T −1 1 e− 2 zt+1 (st ,st+1 )Σ (st ,st+1 |v1:t )zt+1 (st ,st+1 ) p(st |st+1 , v1:t ) p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) ≈ Z det Σ(st , st+1 |v1:t ) where zt+1 (st , st+1 ) ≡ ht+1 |st+1 , v1:T − ht+1 |st , st+1 , v1:t and Z ensures normalization over st . This result comes simply from the fact that in Equation (12) we have a Gaussian with a mean ht+1 |st , st+1 , v1:t and covariance Σ(st , st+1 |v1:t ), being the filtered covariance of ht+1 given st , st+1 and the observations v1:t , which may be taken from Σhh in Equation (5). Then evaluating this Gaussian at the specific point ht+1 |st+1 , v1:T , we arrive at the above expression. An alternative to this simple mean approximation is to sample from the Gaussian p(ht+1 |st+1 , v1:T ), which has the potential advantage that covariance information is used. 9 Other methods such as variational 9. This is a form of exact sampling since drawing samples from a Gaussian is easy. This should not be confused with meaning that this use of sampling renders EC a sequential Monte-Carlo sampling scheme. 2523 BARBER Algorithm 2 aSLDS: EC Backward Pass (Single Gaussian case I = J = 1). Approximates p(st |v1:T ) and p(ht |st , v1:T ) ≡ N (gt (st ), Gt (st )). This routine needs the results from Algorithm (1) for I = 1. GT ← FT , gT ← fT , for t ← T − 1 to 1 do for s ← 1 to S, s ← 1 to S do, (µ, Σ)(s, s ) = LDSBACKWARD(gt+1 (s ), Gt+1 (s ), ft (s), Ft (s), θt+1 (s )) p(s|s ) = p(st = s|ht+1 , st+1 = s , v1:t ) p(ht+1 |st+1 =s ,v1:T ) p(s, s |v1:T ) ← p(st+1 = s |v1:T )p(s|s ) end for for st ← 1 to S do Collapse the mixture defined by weights p(st+1 = s |st , v1:T ) ∝ p(st , s |v1:T ), means µ(st , s ) and covariances Σ(st , s ) to a single Gaussian. This defines the new means gt (st ), covariances Gt (st ). p(st |v1:T ) ← ∑s p(st , s |v1:T ) end for end for approximations to this average (Jaakkola and Jordan, 1996) or the unscented transform (Julier and Uhlmann, 1997) may be employed if desired. C LOSING THE R ECURSION We have now computed both the continuous and discrete factors in Equation (10), which we wish to use to write the smoothed estimate in the form p(ht , st |v1:T ) = p(st |v1:T )p(ht |st , v1:T ). The distribution p(ht |st , v1:T ) is readily obtained from the joint Equation (10) by conditioning on st to form the mixture p(ht |st , v1:T ) = ∑ p(st+1 |st , v1:T )p(ht |st , st+1 , v1:T ) st+1 which may be collapsed to a single Gaussian (or mixture if desired). As in the Forward Pass, this collapse implicitly defines the Gaussian mean g(st ) and covariance G(st ). The smoothed posterior p(st |v1:T ) is given by p(st |v1:T ) = = ∑ p(st+1 |v1:T )p(st |st+1 , v1:T ) st+1 ∑ p(st+1 |v1:T ) st+1 p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) . (14) The algorithm for the single Gaussian case is presented in Algorithm (2). N UMERICAL S TABILITY Numerical stability is a concern even in the LDS, and the same is to be expected for the aSLDS. Since the LDS recursions LDSFORWARD and LDSBACKWARD are embedded within the EC algorithm, we may immediately take advantage of the large body of work on stabilizing the LDS recursions, such as the Joseph form (Grewal and Andrews, 1993), or the square root forms (Park and Kailath, 1996; Verhaegen and Van Dooren, 1986). 2524 E XPECTATION C ORRECTION R ELAXING EC The conditional independence assumption p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ) is not strictly necessary in EC. We motivate it by computational simplicity, since finding an appropriate moment matching approximation of p(ht+1 |st , st+1 , v1:T ) in Equation (9) requires a relatively expensive nonGaussian integration. If we therefore did treat p(ht+1 |st , st+1 , v1:T ) more correctly, the central assumption in this relaxed version of EC would be a collapse to a mixture of Gaussians (the additional computation of Equation (12) may usually be numerically evaluated to high precision). Whilst we did not do so, implementing this should not give rise to numerical instabilities since no potential divisions are required, merely the estimation of moments. In the experiments presented here, we did not pursue this option, since we believe that the effect of this conditional independence assumption is relatively weak. I NCONSISTENCIES IN THE APPROXIMATION The recursion Equation (8), upon which EC depends, makes use of the Forward Pass results, and a subtle issue arises about possible inconsistencies in the Forward and Backward approximations. For example, under the conditional independence assumption in the Backward Pass, p(hT |sT −1 , sT , v1:T ) ≈ p(hT |sT , v1:T ), which is in contradiction to Equation (5) which states that the approximation to p(hT |sT −1 , sT , v1:T ) will depend on sT −1 . Similar contradictions occur also for the relaxed version of EC. Such potential inconsistencies arise because of the approximations made, and should not be considered as separate approximations in themselves. Furthermore, these inconsistencies will most likely be strongest at the end of the chain, t ≈ T , since only then is Equation (8) in direct contradiction to Equation (5). Such potential inconsistencies arise since EC is not founded on a consistency criterion, unlike EP—see Section (3)—but rather an approximation of the exact recursions. Our experience is that compared to EP, which attempts to ensure consistency based on multiple sweeps through the graph, such inconsistencies are a small price to pay compared to the numerical stability advantages of EC. 2.3 Using Mixtures in the Backward Pass The extension to the mixture case is straightforward, based on the representation Jt p(ht |st , v1:T ) ≈ ∑ p(ht |st , jt , v1:T )p( jt |st , v1:T ). jt =1 Analogously to the case with a single component, p(ht , st |v1:T ) = ∑ it , jt+1 ,st+1 p(st+1 |v1:T )p( jt+1 |st+1 , v1:T )p(ht | jt+1 , st+1 , it , st , v1:T ) · p(it , st |ht+1 , jt+1 , st+1 , v1:t ) p(ht+1 | jt+1 ,st+1 ,v1:T ) . The average in the last line of the above equation can be tackled using the same techniques as outlined in the single Gaussian case. To approximate p(ht | jt+1 , st+1 , it , st , v1:T ) we consider this as the marginal of the joint distribution p(ht , ht+1 |it , st , jt+1 , st+1 , v1:T ) = p(ht |ht+1 , it , st , jt+1 , st+1 , v1:t )p(ht+1 |it , st , jt+1 , st+1 , v1:T ). 2525 BARBER Algorithm 3 aSLDS: EC Backward Pass. Approximates p(st |v1:T ) and p(ht |st , v1:T ) ≡ Jt ut ( jt , st )N (gt ( jt , st ), Gt ( jt , st )) using a mixture of Gaussians. JT = IT , Jt ≤ S × It × Jt+1 . This ∑ jt =1 routine needs the results from Algorithm (1). GT ← FT , gT ← fT , uT ← wT (*) for t ← T − 1 to 1 do for s ← 1 to S, s ← 1 to S, i ← 1 to It , j ← 1 to Jt+1 do (µ, Σ)(i, s, j , s ) = LDSBACKWARD(gt+1 ( j , s ), Gt+1 ( j , s ), ft (i, s), Ft (i, s), θt+1 (s )) p(i, s| j , s ) = p(st = s, it = i|ht+1 , st+1 = s , jt+1 = j , v1:t ) p(ht+1 |st+1 =s , jt+1 = j ,v1:T ) p(i, s, j , s |v1:T ) ← p(st+1 = s |v1:T )ut+1 ( j , s )p(i, s| j , s ) end for for st ← 1 to S do Collapse the mixture defined by weights p(it = i, st+1 = s , jt+1 = j |st , v1:T ) ∝ p(i, st , j , s |v1:T ), means µ(it , st , j , s ) and covariances Σ(it , st , j , s ) to a mixture with Jt components. This defines the new means gt ( jt , st ), covariances Gt ( jt , st ) and mixture weights ut ( jt , st ). p(st |v1:T ) ← ∑it , j ,s p(it , st , j , s |v1:T ) end for end for (*) If JT < IT then the initialization is formed by collapsing the Forward Pass results at time T to JT components. As in the case of a single mixture, the problematic term is p(ht+1 |it , st , jt+1 , st+1 , v1:T ). Analogously to before, we may make the assumption p(ht+1 |it , st , jt+1 , st+1 , v1:T ) ≈ p(ht+1 | jt+1 , st+1 , v1:T ) meaning that information about the current switch state st , it is ignored.10 We can then form p(ht |st , v1:T ) = ∑ it , jt+1 ,st+1 p(it , jt+1 , st+1 |st , v1:T )p(ht |it , st , jt+1 , st+1 , v1:T ). This mixture can then be collapsed to smaller mixture using any method of choice, to give Jt p(ht |st , v1:T ) ≈ ∑ p(ht | jt , st , v1:T )p( jt |st , v1:T ) jt =1 The collapse procedure implicitly defines the means g( jt , st ) and covariances G( jt , st ) of the smoothed approximation. A recursion for the switches follows analogously to the single component Backward Pass. The resulting algorithm is presented in Algorithm (3), which includes using mixtures in both Forward and Backward Passes. Note that if JT < IT , an extra initial collapse is required of the IT component Forward Pass Gaussian mixture at time T to JT components. EC has time complexity O(S2 IJK) where S are the number of switch states, I and J are the number of Gaussians used in the Forward and Backward passes, and K is the time to compute the exact Kalman smoother for the system with a single switch state. 10. As in the single component case, in principle, this assumption may be relaxed and a moment matching approximation be performed instead. 2526 E XPECTATION C ORRECTION 3. Relation to Other Methods Approximate inference in the SLDS is a long-standing research topic, generating an extensive literature. See Lerner (2002) and Zoeter (2005) for reviews of previous work. A brief summary of some of the major existing approaches follows. Assumed Density Filtering Since the exact filtered estimate p(ht |st , v1:t ) is an (exponentially large) mixture of Gaussians, a useful remedy is to project at each stage of the recursion Equation (3) back to a limited set of K Gaussians. This is a Gaussian Sum Approximation (Alspach and Sorenson, 1972), and is a form of Assumed Density Filtering (ADF) (Minka, 2001). Similarly, Generalized Pseudo Bayes2 (GPB2) (Bar-Shalom and Li, 1998) also performs filtering by collapsing to a mixture of Gaussians. This approach to filtering is also taken in Lerner et al. (2000) which performs the collapse by removing spatially similar Gaussians, thereby retaining diversity. Several smoothing approaches directly use the results from ADF. The most popular is Kim’s method, which updates the filtered posterior weights to form the smoother (Kim, 1994; Kim and Nelson, 1999). In both EC and Kim’s method, the approximation p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ), is used to form a numerically simple Backward Pass. The other approximation in EC is to numerically compute the average in Equation (14). In Kim’s method, however, an update for the discrete variables is formed by replacing the required term in Equation (14) by p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) ≈ p(st |st+1 , v1:t ). (15) This approximation11 decouples the discrete Backward Pass in Kim’s method from the continuous dynamics, since p(st |st+1 , v1:t ) ∝ p(st+1 |st )p(st |v1:t )/p(st+1 |v1:t ) can be computed simply from the filtered results alone (the continuous Backward Pass in Kim’s method, however, does depend on the discrete Backward Pass). The fundamental difference between EC and Kim’s method is that the approximation (15) is not required by EC. The EC Backward Pass therefore makes fuller use of the future information, resulting in a recursion which intimately couples the continuous and discrete variables. The resulting effect on the quality of the approximation can be profound, as we will see in the experiments. Kim’s smoother corresponds to a potentially severe loss of future information and, in general, cannot be expected to improve much on the filtered results from ADF. The more recent work of Lerner et al. (2000) is similar in spirit to Kim’s method, whereby the contribution from the continuous variables is ignored in forming an approximate recursion for the smoothed p(st |v1:T ). The main difference is that for the discrete variables, Kim’s method is based on a correction smoother (Rauch et al., 1965), whereas Lerner’s method uses a Belief Propagation style Backward Pass (Jordan, 1998). Neither method correctly integrates information from the continuous variables. How to form a recursion for a mixture approximation which does not ignore information coming through the continuous hidden variables is a central contribution of our work. Kitagawa (1994) used a two-filter method in which the dynamics of the chain are reversed. Essentially, this corresponds to a Belief Propagation method which defines a Gaussian sum 11. In the HMM this is exact, but in the SLDS the future observations carry information about st . 2527 BARBER EC Mixture Collapsing to Single Mixture Collapsing to Mixture Cond. Indep. p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ) Approx. of p(st |st+1 , v1:T ), average Equation (12) Kim’s Backward Pass Mixture approx. of p(ht+1 |st , st+1 , v1:T ), Equation (9) Relaxed EC x x x x EP x Kim x x x x x Table 1: Relation between methods. In the EC methods, the mean approximation may be replaced by an essentially exact Monte Carlo approximation to Equation (12). EP refers to the Single Gaussian approximation in Heskes and Zoeter (2002). In the case of using Relaxed EC with collapse to a single Gaussian, EC and EP are not equivalent, since the underlying recursions on which the two methods are based are fundamentally different. approximation for p(vt+1:T |ht , st ). However, since this is not a density in ht , st , but rather a conditional likelihood, formally one cannot treat this using density propagation methods. In Kitagawa (1994), the singularities resulting from incorrectly treating p(vt+1:T |ht , st ) as a density are heuristically finessed. Expectation Propagation EP (Minka, 2001), as applied to the SLDS, corresponds to an approximate implementation of Belief Propagation12 (Jordan, 1998; Heskes and Zoeter, 2002). EP is the most sophisticated rival to Kim’s method and EC, since it makes the least assumptions. For this reason, we’ll explain briefly how EP works. Unlike EC, which is based on an approximation of the exact filtering and smoothing recursions, EP is based on a consistency criterion. First, let’s simplify the notation, and write the distribution as p = ∏t φ (xt−1 , vt−1 , xt , vt ), where xt ≡ ht ⊗ st , and φ (xt−1 , vt−1 , xt , vt ) ≡ p(xt |xt−1 )p(vt |xt ). EP defines ‘messages’ ρ, λ13 which contain information from past and future observations respectively. 14 Explicitly, we define ρt (xt ) ∝ p(xt |v1:t ) to represent knowledge about xt given all information from time 1 to t. Similarly, λt (xt ) represents knowledge about state xt given all observations from time T to time t + 1. In the sequel, we drop the time suffix for notational clarity. We define λ(xt ) implicitly through the requirement that the marginal smoothed inference is given by p(xt |v1:T ) ∝ ρ (xt ) λ (xt ) . (16) Hence λ (xt ) ∝ p(vt+1:T |xt , v1:t ) = p(vt+1:T |xt ) and represents all future knowledge about p(xt |v1:T ). From this p(xt−1 , xt |v1:T ) ∝ ρ (xt−1 ) φ (xt−1 , vt−1 , xt , vt ) λ (xt ) . (17) 12. Non-parametric belief propagation (Sudderth et al., 2003), which performs approximate inference in general continuous distributions, is also related to EP applied to the aSLDS, in the sense that the messages cannot be represented easily, and are approximated by mixtures of Gaussians. 13. These correspond to the α and β messages in the Hidden Markov Model framework (Rabiner, 1989). 14. In this Belief Propagation/EP viewpoint, the backward messages, traditionally labeled as β, correspond to conditional likelihoods, and not distributions. In contrast, in the EC approach, which is effectively a so-called α − γ recursion, the backward γ messages correspond to posterior distributions. 2528 E XPECTATION C ORRECTION Taking the above equation as a starting point, we have p(xt |v1:T ) ∝ Z xt−1 ρ (xt−1 ) φ (xt−1 , vt−1 , xt , vt ) λ (xt ) . Consistency with Equation (16) requires (neglecting irrelevant scalings) ρ (xt ) λ (xt ) ∝ Z xt−1 ρ (xt−1 ) φ (xt−1 , vt−1 , xt , vt ) λ (xt ) . Similarly, we can integrate Equation (17) over xt to get the marginal at time xt−1 which, by consistency, should be proportional to ρ (xt−1 ) λ (xt−1 ). Hence ρ (xt ) ∝ xt−1 ρ (xt−1 ) φ (xt−1 , xt ) λ (xt ) R λ (xt ) , λ (xt−1 ) ∝ R xt ρ (xt−1 ) φ (xt−1 , xt ) λ (xt ) ρ (xt−1 ) (18) where the divisions can be interpreted as preventing over-counting of messages. In an exact implementation, the common factors in the numerator and denominator cancel. EP addresses the fact that λ(xt ) is not a distribution by using Equation (18) R form the projection (or to R ‘collapse’). In the numerator, xt−1 ρ (xt−1 ) φ (xt−1 , xt ) λ (xt ) and xt ρ (xt−1 ) φ (xt−1 , xt ) λ (xt ) represent p(xt |v1:T ) and p(xt−1 |v1:T ). Since these are distributions (an indexed mixture of Gaussians in the SLDS), they may be projected/collapsed to a single indexed Gaussian. The update for the ρ message is then found from division by the λ potential, and vice versa. In EP the explicit division of potentials only makes sense for members of the exponential family. More complex methods could be envisaged in which, rather than an explicit division, the new messages are defined by minimizing some measure of divergence between R ρ(xt )λ(xt ) and xt−1 ρ (xt−1 ) φ (xt−1 , xt ) λ (xt ), such as the Kullback-Leibler divergence. In this way, non-exponential family approximations (such as mixtures of Gaussians) may be considered. Whilst this is certainly feasible, it is somewhat unattractive computationally since this would require for each time-step an expensive minimization. For the single Gaussian case, in order to perform the division, the potentials in the numerator and denominator are converted to their canonical representations. To form the ρ update, the result of the division is then reconverted back to a moment representation. The resulting recursions, due to the approximation, are no longer independent and Heskes and Zoeter (2002) show that using more than a single Forward and Backward sweep often improves on the quality of the approximation. This coupling is a departure from the exact recursions, which should remain independent. Applied to the SLDS, EP suffers from severe numerical instabilities (Heskes and Zoeter, 2002) and finding a way to minimize the corresponding EP free energy in an efficient, robust and guaranteed way remains an open problem. Our experience is that current implementations of EP are unsuitable for large scale time-series applications. Damping the parameter updates is one suggested approach to heuristically improve convergence. The source of these numerical instabilities is not well understood since, even in cases when the posterior appears uni-modal, the method is problematic. The frequent conversions between moment and canonical parameterizations of Gaussians are most likely at the root of the difficulties. An interesting comparison here is between Lauritzen’s original method for exact computation on conditional Gaussian distributions (for which the SLDS is a special case) Lauritzen (1992), 2529 BARBER which is numerically unstable due to conversion between moment and canonical representations, and Lauritzen and Jensen (2001), which improves stability by avoiding using canonical parameterizations. Variational Methods Ghahramani and Hinton (1998) used a variational method which approximates the joint distribution p(h1:T , s1:T |v1:T ) rather than the marginal p(ht , st |v1:T )—related work is presented in Lee et al. (2004). This is a disadvantage when compared to other methods that directly approximate the marginal. The variational methods are nevertheless potentially attractive since they are able to exploit structural properties of the distribution, such as a factored discrete state-transition. In this article, we concentrate on the case of a small number of states S and hence will not consider variational methods further here. 15 Sequential Monte Carlo (Particle Filtering) These methods form an approximate implementation of Equation (3), using a sum of delta functions to represent the posterior—see, for example, Doucet et al. (2001). Whilst potentially powerful, these non-analytic methods typically suffer in high-dimensional hidden spaces since they are often based on naive importance sampling, which restricts their practical use. ADF is generally preferential to Particle Filtering, since in ADF the approximation is a mixture of non-trivial distributions, which is better at capturing the variability of the posterior. Rao-Blackwellized Particle Filters (Doucet et al., 2000) are an attempt to alleviate the difficulty of sampling in high-dimensional state spaces by explicitly integrating over the continuous state. Non-Sequential Monte Carlo For fixed switches s1:T , p(v1:T |s1:T ) is easily computable since this is just the likelihood of an LDS. This observation raises the possibility of sampling from the posterior p(s 1:T |v1:T ) ∝ p(v1:T |s1:T )p(s1:T ) directly. Many possible sampling methods could be applied in this case, and the most immediate is Gibbs sampling, in which a sample for each t is drawn from p(st |s\t , v1:T )—see Neal (1993) for a general reference and Carter and Kohn (1996) for an application to the SLDS. This procedure may work well in practice provided that the initial setting of s1:T is in a region of high probability mass—otherwise, sampling by such individual coordinate updates may be extremely inefficient. 4. Experiments Our experiments examine the stability and accuracy of EC against several other methods on long time-series. In addition, we will compare the absolute accuracy of EC as a function of the number of mixture components on a short time-series, where exact inference may be explicitly evaluated. Testing EC in a problem with a reasonably long temporal sequence, T , is important since numerical stabilities may not be apparent in time-series of just a few time-steps. To do this, we sequentially generate hidden states ht , st and observations vt from a given model. Then, given only the parameters of the model and the observations (but not any of the hidden states), the task is to infer p(ht |st , v1:T ) and p(st |v1:T ). Since the exact computation is exponential in T , a formally exact evaluation of the method is infeasible. A simple alternative is to assume that the original sample states s1:T are the ‘correct’ inferred states, and compare our most probable posterior smoothed 15. Lerner (2002) discusses an approach in the case of a large structured discrete state transition. Related ideas could also be used in EC. 2530 E XPECTATION C ORRECTION 80 150 60 100 40 50 20 0 0 −50 −20 −100 −40 −150 −60 −80 0 10 20 30 40 50 60 70 80 90 −200 100 0 10 20 (a) Easy problem 30 40 50 60 70 80 90 100 (b) Hard problem Figure 5: SLDS: Throughout, S = 2, V = 1 (scalar observations), T = 100, with zero output bias. ¯ A(s) = 0.9999 ∗ orth(randn(H, H)), B(s) = randn(V, H), vt ≡ 0, h1 = 10 ∗ randn(H, 1), ¯ ¯ t>1 = 0, Σh = IH , p1 = uniform. The figures show typical examples for each of the two h 1 problems: (a) Easy problem. H = 3, Σh (s) = IH , Σv (s) = 0.1IV , p(st+1 |st ) ∝ 1S×S + IS . (b) Hard problem. H = 30, Σv (s) = 30IV ,Σh (s) = 0.01IH , p(st+1 |st ) ∝ 1S×S . PF RBPF EP ADFS KimS ECS ADFM KimM ECM Gibbs 1000 800 600 400 200 0 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 Figure 6: SLDS ‘Easy’ problem: The number of errors in estimating a binary switch p(st |v1:T ) over a time series of length T = 100. Hence 50 errors corresponds to random guessing. Plotted are histograms of the errors over 1000 experiments. The histograms have been cutoff at 20 errors in order to improve visualization. (PF) Particle Filter. (RBPF) Rao-Blackwellized PF. (EP) Expectation Propagation. (ADFS) Assumed Density Filtering using a Single Gaussian. (KimS) Kim’s smoother using the results from ADFS. (ECS) Expectation Correction using a Single Gaussian (I = J = 1). (ADFM) ADF using a multiple of I = 4 Gaussians. (KimM) Kim’s smoother using the results from ADFM. (ECM) Expectation Correction using a mixture with I = J = 4 components. In Gibbs sampling, we use the initialization from ADFM. estimates arg maxst p(st |v1:T ) with the assumed correct sample st .16 We look at two sets of experiments, one for the SLDS and one for the aSLDS. In both cases, scalar observations are used so that the complexity of the inference problem can be visually assessed. 16. We could also consider performance measures on the accuracy of p(ht |st , v1:T ). However, we prefer to look at approximating arg maxst p(st |v1:T ) since the sampled discrete states are likely to correspond to the exact arg max st p(st |v1:T ). In addition, if the posterior switch distribution is dominated by a single state s ∗ , then provided they are correctly 1:T estimated, the model reduces to an LDS, for which inference of the continuous hidden state is trivial. 2531 BARBER PF RBPF EP ADFS KimS ECS ADFM KimM ECM Gibbs 1000 800 600 400 200 0 0 25 50 75 0 25 50 75 0 25 50 75 0 25 50 75 0 25 50 75 0 25 50 75 0 25 50 75 0 25 50 75 0 25 50 75 0 25 50 75 Figure 7: SLDS ‘Hard’ problem: The number of errors in estimating a binary switch p(st |v1:T ) over a time series of length T = 100. Hence 50 errors corresponds to random guessing. Plotted are histograms of the errors over 1000 experiments. SLDS EXPERIMENTS We chose experimental conditions that, from the viewpoint of classical signal processing, are difficult, with changes in the switches occurring at a much higher rate than the typical frequencies in the signal. We consider two different toy SLDS experiments : The ‘easy’ problem corresponds to a low hidden dimension, H = 3, with low observation noise; The ‘hard’ problem corresponds to a high hidden dimension, H = 30, and high observation noise. See Figure (5) for details of the experimental setup. We compared methods using a single Gaussian, and methods using multiple Gaussians, see Figure (6) and Figure (7). For EC we use the mean approximation for the numerical integration of Equation (12). For the Particle Filter 1000 particles were used, with Kitagawa re-sampling (Kitagawa, 1996). For the Rao-Blackwellized Particle Filter (Doucet et al., 2000), 500 particles were used, with Kitagawa re-sampling. We included the Particle Filter merely for a point of comparison with ADF, since they are not designed to approximate the smoothed estimate. An alternative MCMC procedure is to perform Gibbs sampling of p(s 1:T |v1:T ) using p(st |s\t , v1:T ) ∝ p(v1:T |s1:T )p(s1:T ), where p(v1:T |s1:T ) is simply the likelihood of an LDS—see for example Carter and Kohn (1996).17 We initialize the state s1:T by using the most likely states st from the filtered results using a Gaussian mixture (ADFM), and then swept forwards in time, sampling from the state p(st |s\t , v1:T ) until the end of the chain. We then reversed direction, sampling from time T back to time 1, and continued repeating this procedure 100 times, with the mean over the last 80 sweeps used as the posterior mean approximation. This procedure is expensive since each sample requires computing the likelihood of an LDS defined on the whole time-series. The procedure therefore scales with GT 2 where G is the number of sweeps over the time series. Despite using a reasonable initialization, Gibbs sampling struggles to improve on the filtered results. We found that EP was numerically unstable and often struggled to converge. To encourage convergence, we used the damping method in Heskes and Zoeter (2002), performing 20 iterations with a damping factor of 0.5. The disappointing performance of EP is most likely due to conflicts 17. Carter and Kohn (1996) proposed an overly complex procedure for computing the likelihood p(v 1:T |s1:T ). This is simply the likelihood of an LDS (since s1:T are assumed known), and is readily computable using any of the standard procedures in the literature. 2532 E XPECTATION C ORRECTION PF ADFS ECS ADFM ECM 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 1000 800 600 400 200 0 Figure 8: aSLDS: Histogram of the number of errors in estimating a binary switch p(st |v1:T ) over a time series of length T = 100. Hence 50 errors corresponds to random guessing. Plotted are histograms of the errors over 1000 experiments. Augmented SLDS results. ADFM used I = 4 Gaussians, and ECM used I = J = 4 Gaussians. We used 1000 samples to approximate Equation (12). I J error 1 1 0.0989 4 1 0.0624 4 4 0.0365 16 1 0.0440 16 16 0.0130 64 1 0.0440 64 64 4.75e-4 256 1 0.0440 256 256 3.40e-8 Table 2: Errors in approximating the states for the multi-path problem, see Figure (9). The mean absolute deviation |pec (st |v1:T ) − pexact (st |v1:T )| averaged over the S = 4 states of st and over the times t = 1, . . . , 5, computed for different numbers of mixture components in EC. The mean approximation of Equation (12) is used. The exact computation uses S T −1 = 256 mixtures. resulting from numerical instabilities introduced by the frequent conversions between moment and canonical representations. The various algorithms differ widely in performance, see Figures (6,7). Not surprisingly, the best filtered results are given using ADF, since this is better able to represent the variance in the filtered posterior than the sampling methods. Unlike Kim’s method, EC makes good use of the future information to clean up the filtered results considerably. One should bear in mind that both EC, Kim’s method and the Gibbs initialization use the same ADF results. These results show that EC may dramatically improve on Kim’s method, so that the small amount of extra work in making a numerical approximation of p(st |st+1 , v1:T ), Equation (12), may bring significant benefits. AUGMENTED SLDS E XPERIMENTS In Figure (8), we chose a simple two state S = 2 transition distribution p(st+1 = 1|st , ht ) = σ htT w(st ) , where σ(x) ≡ 1/(1 + e−x ). Some care needs to be taken to make a model so for which even exact inference would produce posterior switches close to the sampled switches. If the switch variables st+1 changes wildly (which is possible given the above formula since the hidden state h may have a large projected change if the hidden state changes) essentially no information is left in the signal for any inference method to produce reasonable results. We therefore set w(st ) to a zero vector except for the first two components, which are independently sampled from a zero mean Gaussian with standard deviation 5. For each of the two switch states, s, we have a transition matrix A(s), which 2533 BARBER t=1 0 t=2 10 t=3 20 t=4 30 t=5 40 −40 −30 −20 −10 0 10 20 30 (a) 40 (b) Figure 9: (a) The multi-path problem. The particle starts from (0, 0) at time t = 1. Subsequently, at each time-point, either the vector (10, 10) (corresponding to states s = 1 and s = 3) or (−10, 10) (corresponding to states s = 2 and s = 4), is added to the hidden dynamics, perturbed by a small amount of noise, Σh = 0.1. The observations are v = h + ηv (s). For states s = 1, 2 the observation noise is small, Σv = 0.1I, but for s = 3, 4 the noise in the horizontal direction has variance 1000. The visible observations are given by the x’. The true hidden states are given by ‘+’. (b) The exact smoothed state posteriors p exact (st |v1:T ) computed by enumerating all paths (given by the dashed lines). we set to be block diagonal. The first 2 × 2 block is set to 0.9999R θ , where Rθ is a 2 × 2 rotation matrix with angle θ chosen uniformly from 0 to 1 radians. This means that st+1 is dependent on the first two components of ht which are rotating at a restricted rate. The remaining H − 2 × H − 2 block of A(s) is chosen as (using MATLAB notation) 0.9999 ∗ orth(rand(H − 2)), which means a scaled randomly chosen orthogonal matrix. Throughout, S = 2, V = 1, H = 30, T = 100, with zero output ¯ ¯ bias. Using partly MATLAB notation, B(s) = randn(V, H), vt ≡ 0, h1 = 10 ∗ randn(H, 1), ht>1 = 0, ¯ h = I , p = uniform. Σv = 30I , Σh = 0.1I . Σ1 H 1 V H We compare EC only against Particle Filters using 1000 particles, since other methods would require specialized and novel implementations. In ADFM, I = 4 Gaussians were used, and for ECM, I = J = 4 Gaussians were used. Looking at the results in Figure (8), we see that EC performs well, with some improvement in using the mixture representation I, J = 4 over a single Gaussian I = J = 1. The Particle Filter most likely failed since the hidden dimension is too high to be explored well with only 1000 particles. E FFECT OF U SING M IXTURES Our claim is that EC should cope in situations where the smoothed posterior p(ht |st , v1:T ) is multimodal and, consequently, cannot be well represented by a single Gaussian. 18 We therefore constructed an SLDS which exhibits multi-modality to see the effect of using EC with both I and J greater than 1. The ‘multi-path’ scenario is described in Figure (9), where a particle traces a path through a two dimensional space. A small number of time-steps was chosen so that the exact p(st |v1:T ) can be computed by direct enumeration. The observation of the particle is at times extremely noisy in the horizontal direction. This induces multi-modality of p(ht |st , v1:T ) since there 18. This should not be confused with the multi-modality of p(ht |v1:T ) = ∑st p(ht |st , v1:T )p(st |v1:T ). 2534 E XPECTATION C ORRECTION are several paths that might plausibly have been taken to give rise to the observations. The accuracy with which EC predicts the exact smoothed posterior is given in Table (2). For this problem we see that both the number of Forward (I) and Backward components (J) affects the accuracy of the approximation, generally with improved accuracy as the number of mixture components increases. For a ‘perfect’ approximation method, one would expect that when I = J = S T −1 = 256, then the approximation should become exact. The small error for this case in Table (2) may arise for several reasons: the extra independence assumption used in EC, or the simple mean approximation used to compute Equation (12), or numerical roundoff. However, at least in this case, the effect of these assumptions on the performance is very small. 5. Discussion Expectation Correction is a novel form of Backward Pass which makes less approximations than the widely used approach from Kim (1994). In Kim’s method, potentially important future information channeled through the continuous hidden variables is lost. EC, along with Kim’s method, makes the additional assumption p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ). However, our experience is that this assumption is rather mild, since the state of ht+1 will be most heavily influenced by its immediate parent st+1 . Our approximation is based on the idea that, although exact inference will consist of an exponentially large number of mixture components, due to the forgetting which commonly occurs in Markovian models, a finite number of mixture components may provide a reasonable approximation. In tracking situations where the visible information is (temporarily) not enough to specify accurately the hidden state, then representing the posterior p(ht |st , v1:T ) using a mixture of Gaussians may improve results significantly. Clearly, in systems with very long correlation times our method may require too many mixture components to produce a satisfactory result, although we are unaware of other techniques that would be able to cope well in that case. We hope that the straightforward ideas presented here may help facilitate the practical application of dynamic hybrid networks to machine learning and related areas. Whilst models with Gaussian emission distributions such as the SLDS are widespread, the extension of this method to non-Gaussian emissions p(vt |ht , st ) would clearly be of considerable interest. Software for Expectation Correction for this augmented class of Switching Linear Gaussian models is available from www.idiap.ch/∼barber. Acknowledgments I would like to thank Onno Zoeter and Tom Heskes for kindly providing their Expectation Propagation code, Silvia Chiappa for helpful discussions, and Bertrand Mesot for many discussions, help with the simulations and for suggesting the relationship between the partial factorization and independence viewpoints of EC. I would also like to thank the reviewers for their many helpful comments and suggestions. 2535 BARBER Algorithm 4 LDS Forward Pass. Compute the filtered posteriors p(ht |v1:t ) ≡ N ( ft , Ft ) for ¯ ¯ a LDS with parameters θt = A, B, Σh , Σv , h, v, for t > 1. At time t = 1, we use parameters v , µ, v, where Σ and µ are the prior covariance and mean of h. The log-likelihood θ1 = A, B, Σ, Σ ¯ L = log p(v1:T ) is also returned. F0 ← 0, f0 ← 0, L ← 0 for t ← 1, T do { ft , Ft , pt } = LDSFORWARD( ft−1 , Ft−1 , vt ; θt ) L ← L + log pt end for function LDSFORWARD( f , F, v; θ) Compute joint p(ht , vt |v1:t−1 ): ¯ µh ← A f + h, µv ← Bµh + v ¯ T Σhh ← AFA + Σh , Σvv ← BΣhh BT + Σv , Σvh ← BΣhh Find p(ht |v1:t ) by conditioning: f ← µh + ΣT Σ−1 (v − µv ), F ← Σhh − ΣT Σ−1 Σvh vh vv vh vv Compute p(vt |v1:t−1 ): √ 1 p ← exp − 2 (v − µv )T Σ−1 (v − µv ) / det 2πΣvv vv return f , F , p end function Appendix A. Inference in the LDS The LDS is defined by Equations (1,2) in the case of a single switch S = 1. The LDS Forward and Backward passes define the important functions LDSFORWARD and LDSBACKWARD, which we shall make use of for inference in the aSLDS. F ORWARD PASS (F ILTERING ) The filtered posterior p(ht |v1:t ) is a Gaussian which we parameterize with mean f t and covariance Ft . These parameters can be updated recursively using p(ht |v1:t ) ∝ p(ht , vt |v1:t−1 ), where the joint distribution p(ht , vt |v1:t−1 ) has statistics (see Appendix (B)) ¯ µh = A ft−1 + h, µv = Bµh + v ¯ Σhh = AFt−1 AT + Σh , Σvv = BΣhh BT + Σv , Σvh = BΣhh . We may then find p(ht |v1:t ) by conditioning p(ht , vt |v1:t−1 ) on vt , see Appendix (C). This gives rise to Algorithm (4). BACKWARD PASS The smoothed posterior p(ht |v1:T ) ≡ N (gt , Gt ) can be computed recursively using: p(ht |v1:T ) = Z ht+1 p(ht |ht+1 , v1:T )p(ht+1 |v1:T ) = Z ht+1 p(ht |ht+1 , v1:t )p(ht+1 |v1:T ) where p(ht |ht+1 , v1:t ) may be obtained from the joint distribution p(ht , ht+1 |v1:t ) = p(ht+1 |ht )p(ht |v1:t ) (19) 2536 E XPECTATION C ORRECTION Algorithm 5 LDS Backward Pass. Compute the smoothed posteriors p(ht |v1:T ). This requires the filtered results from Algorithm (4). GT ← FT , gT ← fT for t ← T − 1, 1 do {gt , Gt } = LDSBACKWARD(gt+1 , Gt+1 , ft , Ft ; θt+1 ) end for function LDSBACKWARD(g, G, f , F; θ) ¯ Σh h ← AF µh ← A f + h, Σh h ← AFAT + Σh , ← − − ← − ← ← f − ←µ − −1 T Σ ← Ft − Σh h Σh h Σh h , A ← ΣT h Σ−1 , m A h h hh ← − ← ← − − ← − − g ← A g + ←, m G ← AGAT+ Σ return g , G end function which itself can be obtained by forward propagation from p(ht |v1:t ). Conditioning Equation (19) to find p(ht |ht+1 , v1:t ) effectively reverses the dynamics, ← − ← − ht = At ht+1 + ηt ← − − ← − −← where At and η t ∼ N (←, Σt ) are found using the conditioned Gaussian results in Appendix (C)— mt these are explicitly given in Algorithm (5). Then averaging the reversed dynamics over p(h t+1 |v1:T ) we find that p(ht |v1:T ) is a Gaussian with statistics ← − ← − ← − ← − − gt = At gt+1 + ←, Gt = At Gt+1 At T + Σt . mt This Backward Pass is given in Algorithm (5). For parameter learning of the A matrix, the smoothed ← − T T statistic ht ht+1 is required. Using the above formulation, this is given by At Gt+1 + ht ht+1 . This is much simpler than the standard expressions cited in Shumway and Stoffer (2000) and Roweis and Ghahramani (1999). Appendix B. Gaussian Propagation Let y be linearly related to x through y = Mx + η, where η ∼ N (µ, Σ), and x ∼ N (µ x , Σx ). Then R p(y) = x p(y|x)p(x) is a Gaussian with mean Mµx + µ and covariance MΣx M T + Σ. Appendix C. Gaussian Conditioning For a joint Gaussian distribution over the vectors x and y with means µ x , µy and covariance elements Σxx ,Σxy ,Σyy , the conditional p(x|y) is a Gaussian with mean µx + Σxy Σ−1 (y − µy ) and covariance yy Σxx − Σxy Σ−1 Σyx . yy Appendix D. Collapsing Gaussians The user may provide any algorithm of their choice for collapsing a set of Gaussians to a smaller set of Gaussians (Titterington et al., 1985). Here, to be explicit, we present a simple one which is fast, but has the disadvantage that no spatial information about the mixture is used. 2537 BARBER First, we describe how to collapse a mixture to a single Gaussian: We may collapse a mixture of Gaussians p(x) = ∑i pi N (x|µi , Σi ) to a single Gaussian with mean ∑i pi µi and covariance ∑i pi Σi + µi µT − µµT . i To collapse a mixture to a K-component mixture we retain the K − 1 Gaussians with the largest mixture weights—the remaining N − K Gaussians are simply merged to a single Gaussian using the above method. The alternative of recursively merging the two Gaussians with the lowest mixture weights gave similar experimental performance. More sophisticated methods which retain some spatial information would clearly be potentially useful. The method presented in Lerner et al. (2000) is a suitable approach which considers removing Gaussians which are spatially similar (and not just low-weight components), thereby retaining diversity over the possible solutions. Appendix E. The Discrete-Continuous Factorization Viewpoint An alternative viewpoint is to proceed analogously to the Rauch-Tung-Striebel correction method for the LDS (Grewal and Andrews, 1993): p(ht , st |v1:T ) = = ∑ Z st+1 ht+1 p(st , ht , ht+1 , st+1 |v1:T ) ∑ p(st+1 |v1:T ) st+1 Z ht+1 p(ht , st |ht+1 , st+1 , v1:t )p(ht+1 |st+1 , v1:T ) st+1 = ≈ ∑ p(st+1 |v1:T ) p(ht |ht+1 , st+1 , st , v1:t )p(st |ht+1 , st+1 , v1:t ) ∑ p(st+1 |v1:T ) p(ht |ht+1 , st+1 , st , v1:t ) p(st |st+1 , v1:T ) st+1 (20) p(st |st+1 ,v1:T ) where angled brackets · denote averages with respect to p(ht+1 |st+1 , v1:T ). Whilst the factorized approximation in Equation (20) may seem severe, by comparing Equations (20) and (10) we see that it is equivalent to the apparently milder assumption p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ). Hence this factorized approximation is equivalent to the ‘standard’ EC approach in which the dependency on st is dropped. References D. L. Alspach and H. W. Sorenson. Nonlinear bayesian estimation using gaussian sum approximations. IEEE Transactions on Automatic Control, 17(4):439–448, 1972. Y. Bar-Shalom and Xiao-Rong Li. Estimation and Tracking : Principles, Techniques and Software. Artech House, Norwood, MA, 1998. X. Boyen and D. Koller. Tractable inference for complex stochastic processes. In Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence—UAI 1998, pages 33–42. Morgan Kaufmann, 1998. C. Carter and R. Kohn. Markov chain Monte Carlo in conditionally Gaussian state space models. Biometrika, 83:589–601, 1996. 2538 E XPECTATION C ORRECTION A. T. Cemgil, B. Kappen, and D. Barber. A Generative Model for Music Transcription. IEEE Transactions on Audio, Speech and Language Processing, 14(2):679 – 694, 2006. S. Chib and M. Dueker. Non-Markovian regime switching with endogenous states and time-varying state strengths. Econometric Society 2004 North American Summer Meetings 600, 2004. A. Doucet, N. de Freitas, K. Murphy, and S. Russell. Rao-Blackwellised particle filtering for dynamic Bayesian networks. Uncertainty in Artificial Intelligence, 2000. A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo Methods in Practice. Springer, 2001. Z. Ghahramani and G. E. Hinton. Variational learning for switching state-space models. Neural Computation, 12(4):963–996, 1998. M. S. Grewal and A. P. Andrews. Kalman Filtering: Theory and Practice. Prentice-Hall, 1993. T. Heskes and O. Zoeter. Expectation propagation for approximate inference in dynamic Bayesian networks. In A. Darwiche and N. Friedman, editors, Uncertainty in Artificial Intelligence, pages 216–223, 2002. T. Jaakkola and M. Jordan. A variational approach to Bayesian logistic regression problems and their extensions. In Artificial Intelligence and Statistics, 1996. M. I. Jordan. Learning in Graphical Models. MIT Press, 1998. S. Julier and J. Uhlmann. A new extension of the Kalman filter to nonlinear systems. In Int. Symp. Aerospace/Defense Sensing, Simul. and Controls, Orlando, FL, 1997. C-J. Kim. Dynamic linear models with Markov-switching. Journal of Econometrics, 60:1–22, 1994. C-J. Kim and C. R. Nelson. State-Space Models with Regime Switching. MIT Press, 1999. G. Kitagawa. The two-filter formula for smoothing and an implementation of the Gaussian-sum smoother. Annals of the Institute of Statistical Mathematics, 46(4):605–623, 1994. G. Kitagawa. Monte Carlo filter and smoother for non-Gaussian nonlinear state space models. Journal of Computational and Graphical Statistics, 5(1):1–25, 1996. S. Lauritzen and F. Jensen. Stable local computation with conditional Gaussian distributions. Statistics and Computing, 11:191–203, 2001. S. L. Lauritzen. Propagation of probabilities, means, and variances in mixed graphical association models. Journal of the American Statistical Association, 87(420):1098–1108, 1992. L. J. Lee, H. Attias, Li Deng, and P. Fieguth. A multimodal variational approach to learning and inference in switching state space models. In IEEE International Conference on Acoustics, Speech, and Signal Processing, (ICASSP 04), volume 5, pages 505–8, 2004. U. Lerner, R. Parr, D. Koller, and G. Biswas. Bayesian fault detection and diagnosis in dynamic systems. In Proceedings of the Seventeenth National Conference on Artificial Intelligence (AIII00), pages 531–537, 2000. 2539 BARBER U. N. Lerner. Hybrid Bayesian Networks for Reasoning about Complex Systems. PhD thesis, Stanford University, 2002. B. Mesot and D. Barber. Switching linear dynamical systems for noise robust speech recognition. IDIAP-RR 08, 2006. T. Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, MIT Media Lab, 2001. R. M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. CRG-TR-93-1, Dept. of Computer Science, University of Toronto, 1993. P. Park and T. Kailath. New square-root smoothing algorithms. IEEE Transactions on Automatic Control, 41:727–732, 1996. V. Pavlovic, J. M. Rehg, and J. MacCormick. Learning switching linear models of human motion. In Advances in Neural Information Processing systems (NIPS 13), pages 981–987, 2001. L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. of the IEEE, 77(2):257–286, 1989. H. E. Rauch, G. Tung, and C. T. Striebel. Maximum likelihood estimates of linear dynamic systems. American Institute of Aeronautics and Astronautics Journal (AIAAJ), 3(8):1445–1450, 1965. S. Roweis and Z. Ghahramani. A unifying review of linear Gaussian models. Neural Computation, 11(2):305–345, 1999. R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications. Springer, 2000. E. B. Sudderth, A. T. Ihler, W. T. Freeman, and A. S. Willsky. Nonparametric belief propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 1, pages 605–612, 2003. D. M. Titterington, A. F. M. Smith, and U. E. Makov. Statistical Analysis of Finite Mixture Distributions. Wiley, 1985. H. Tong. Nonlinear Time Series Analysis: A Dynamical Systems Approach. Oxford Univ. Press, 1990. M. Verhaegen and P. Van Dooren. Numerical aspects of different Kalman filter implementations. IEEE Transactions of Automatic Control, 31(10):907–917, 1986. M. West and J. Harrison. Bayesian Forecasting and Dynamic Models. Springer, 1999. O. Zoeter. Monitoring Non-Linear and Switching Dynamical Systems. PhD thesis, Radboud University Nijmegen, 2005. 2540

3 0.47584069 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation

Author: Kazuho Watanabe, Sumio Watanabe

Abstract: Bayesian learning has been widely used and proved to be effective in many data modeling problems. However, computations involved in it require huge costs and generally cannot be performed exactly. The variational Bayesian approach, proposed as an approximation of Bayesian learning, has provided computational tractability and good generalization performance in many applications. The properties and capabilities of variational Bayesian learning itself have not been clarified yet. It is still unknown how good approximation the variational Bayesian approach can achieve. In this paper, we discuss variational Bayesian learning of Gaussian mixture models and derive upper and lower bounds of variational stochastic complexities. The variational stochastic complexity, which corresponds to the minimum variational free energy and a lower bound of the Bayesian evidence, not only becomes important in addressing the model selection problem, but also enables us to discuss the accuracy of the variational Bayesian approach as an approximation of true Bayesian learning. Keywords: Gaussian mixture model, variational Bayesian learning, stochastic complexity

4 0.40352958 75 jmlr-2006-Policy Gradient in Continuous Time

Author: Rémi Munos

Abstract: Policy search is a method for approximately solving an optimal control problem by performing a parametric optimization search in a given class of parameterized policies. In order to process a local optimization technique, such as a gradient method, we wish to evaluate the sensitivity of the performance measure with respect to the policy parameters, the so-called policy gradient. This paper is concerned with the estimation of the policy gradient for continuous-time, deterministic state dynamics, in a reinforcement learning framework, that is, when the decision maker does not have a model of the state dynamics. We show that usual likelihood ratio methods used in discrete-time, fail to proceed the gradient because they are subject to variance explosion when the discretization time-step decreases to 0. We describe an alternative approach based on the approximation of the pathwise derivative, which leads to a policy gradient estimate that converges almost surely to the true gradient when the timestep tends to 0. The underlying idea starts with the derivation of an explicit representation of the policy gradient using pathwise derivation. This derivation makes use of the knowledge of the state dynamics. Then, in order to estimate the gradient from the observable data only, we use a stochastic policy to discretize the continuous deterministic system into a stochastic discrete process, which enables to replace the unknown coefficients by quantities that solely depend on known data. We prove the almost sure convergence of this estimate to the true policy gradient when the discretization time-step goes to zero. The method is illustrated on two target problems, in discrete and continuous control spaces. Keywords: optimal control, reinforcement learning, policy search, sensitivity analysis, parametric optimization, gradient estimate, likelihood ratio method, pathwise derivation 1. Introduction and Statement of the Problem We consider an optimal control problem with continuous state (xt ∈ IRd )t≥0 whose state dynamics is defined according to the controlled differential equation: dxt = f (xt , ut ), dt (1) where the control (ut )t≥0 is a Lebesgue measurable function with values in a control space U. Note that the state-dynamics f may also depend on time, but we omit this dependency in the notation, for simplicity. We intend to maximize a functional J that depends on the trajectory (xt )0≤t≤T over a finite-time horizon T > 0. For simplicity, in the paper, we illustrate the case of a terminal reward c 2006 Rémi Munos. M UNOS only: J(x; (ut )t≥0 ) := r(xT ), (2) where r : IRd → IR is the reward function. Extension to the case of general functional of the kind J(x; (ut )t≥0 ) = Z T 0 r(t, xt )dt + R(xT ), (3) with r and R being current and terminal reward functions, would easily follow, as indicated in Remark 1. The optimal control problem of finding a control (ut )t≥0 that maximizes the functional is replaced by a parametric optimization problem for which we search for a good feed-back control law in a given class of parameterized policies {πα : [0, T ] × IRd → U}α , where α ∈ IRm is the parameter. The control ut ∈ U (or action) at time t is ut = πα (t, xt ), and we may write the dynamics of the resulting feed-back system as dxt = fα (xt ), (4) dt where fα (xt ) := f (x, πα (t, x)). In the paper, we will make the assumption that fα is C 2 , with bounded derivatives. Let us define the performance measure V (α) := J(x; πα (t, xt )t≥0 ), where its dependency with respect to (w.r.t.) the parameter α is emphasized. One may also consider an average performance measure according to some distribution µ for the initial state: V (α) := E[J(x; πα (t, xt )t≥0 )|x ∼ µ]. In order to find a local maximum of V (α), one may perform a local search, such as a gradient ascent method α ← α + η∇αV (α), (5) with an adequate step η (see for example (Polyak, 1987; Kushner and Yin, 1997)). The computation of the gradient ∇αV (α) is the object of this paper. A first method would be to approximate the gradient by a finite-difference quotient for each of the m components of the parameter: V (α + εei ) −V (α) , ε for some small value of ε (we use the notation ∂α instead of ∇α to indicate that it is a singledimensional derivative). This finite-difference method requires the simulation of m + 1 trajectories to compute an approximation of the true gradient. When the number of parameters is large, this may be computationally expensive. However, this simple method may be efficient if the number of parameters is relatively small. In the rest of the paper we will not consider this approach, and will aim at computing the gradient using one trajectory only. ∂αi V (α) ≃ 772 P OLICY G RADIENT IN C ONTINUOUS T IME Pathwise estimation of the gradient. We now illustrate that if the decision-maker has access to a model of the state dynamics, then a pathwise derivation would directly lead to the policy gradient. Indeed, let us define the gradient of the state with respect to the parameter: zt := ∇α xt (i.e. zt is defined as a d × m-matrix whose (i, j)-component is the derivative of the ith component of xt w.r.t. α j ). Our smoothness assumption on fα allows to differentiate the state dynamics (4) w.r.t. α, which provides the dynamics on (zt ): dzt = ∇α fα (xt ) + ∇x fα (xt )zt , dt (6) where the coefficients ∇α fα and ∇x fα are, respectively, the derivatives of f w.r.t. the parameter (matrix of size d × m) and the state (matrix of size d × d). The initial condition for z is z0 = 0. When the reward function r is smooth (i.e. continuously differentiable), one may apply a pathwise differentiation to derive a gradient formula (see e.g. (Bensoussan, 1988) or (Yang and Kushner, 1991) for an extension to the stochastic case): ∇αV (α) = ∇x r(xT )zT . (7) Remark 1 In the more general setting of a functional (3), the gradient is deduced (by linearity) from the above formula: ∇αV (α) = Z T 0 ∇x r(t, xt )zt dt + ∇x R(xT )zT . What is known from the agent? The decision maker (call it the agent) that intends to design a good controller for the dynamical system may or may not know a model of the state dynamics f . In case the dynamics is known, the state gradient zt = ∇α xt may be computed from (6) along the trajectory and the gradient of the performance measure w.r.t. the parameter α is deduced at time T from (7), which allows to perform the gradient ascent step (5). However, in this paper we consider a Reinforcement Learning (Sutton and Barto, 1998) setting in which the state dynamics is unknown from the agent, but we still assume that the state is fully observable. The agent knows only the response of the system to its control. To be more precise, the available information to the agent at time t is its own control policy πα and the trajectory (xs )0≤s≤t up to time t. At time T , the agent receives the reward r(xT ) and, in this paper, we assume that the gradient ∇r(xT ) is available to the agent. From this point of view, it seems impossible to derive the state gradient zt from (6), since ∇α f and ∇x f are unknown. The term ∇x f (xt ) may be approximated by a least squares method from the observation of past states (xs )s≤t , as this will be explained later on in subsection 3.2. However the term ∇α f (xt ) cannot be calculated analogously. In this paper, we introduce the idea of using stochastic policies to approximate the state (xt ) and the state gradient (zt ) by discrete-time stochastic processes (Xt∆ ) and (Zt∆ ) (with ∆ being some discretization time-step). We show how Zt∆ can be computed without the knowledge of ∇α f , but only from information available to the agent. ∆ ∆ We prove the convergence (with probability one) of the gradient estimate ∇x r(XT )ZT derived from the stochastic processes to ∇αV (α) when ∆ → 0. Here, almost sure convergence is obtained using the concentration of measure phenomenon (Talagrand, 1996; Ledoux, 2001). 773 M UNOS y ∆ XT ∆ X t2 ∆ Xt 0 fα ∆ x Xt 1 Figure 1: A trajectory (Xt∆ )0≤n≤N and the state dynamics vector fα of the continuous process n (xt )0≤t≤T . Likelihood ratio method? It is worth mentioning that this strong convergence result contrasts with the usual likelihood ratio method (also called score method) in discrete time (see e.g. (Reiman and Weiss, 1986; Glynn, 1987) or more recently in the reinforcement learning literature (Williams, 1992; Sutton et al., 2000; Baxter and Bartlett, 2001; Marbach and Tsitsiklis, 2003)) for which the policy gradient estimate is subject to variance explosion when the discretization time-step ∆ tends to 0. The intuitive reason for that problem lies in the fact that the number of decisions before getting the reward grows to infinity when ∆ → 0 (the variance of likelihood ratio estimates being usually linear with the number of decisions). Let us illustrate this problem on a simple 2 dimensional process. Consider the deterministic continuous process (xt )0≤t≤1 defined by the state dynamics: dxt = fα := dt α 1−α , (8) (0 < α < 1) with initial condition x0 = (0 0)′ (where ′ denotes the transpose operator). The performance measure V (α) is the reward at the terminal state at time T = 1, with the reward function being the first coordinate of the state r((x y)′ ) := x. Thus V (α) = r(xT =1 ) = α and its derivative is ∇αV (α) = 1. Let (Xt∆ )0≤n≤N ∈ IR2 be a discrete time stochastic process (the discrete times being {tn = n ∆ n∆}n=0...N with the discretization time-step ∆ = 1/N) that starts from initial state X0 = x0 = (0 0)′ and makes N random moves of length ∆ towards the right (action u1 ) or the top (action u2 ) (see Figure 1) according to the stochastic policy (i.e., the probability of choosing the actions in each state x) πα (u1 |x) = α, πα (u2 |x) = 1 − α. The process is thus defined according to the dynamics: Xt∆ = Xt∆ + n n+1 Un 1 −Un ∆, (9) where (Un )0≤n < N and all ∞ N > 0), there exists a constant C that does not depend on N such that dn ≤ C/N. Thus we may take D2 = C2 /N. Now, from the previous paragraph, ||E[XN ] − xN || ≤ e(N), with e(N) → 0 when N → ∞. This means that ||h − E[h]|| + e(N) ≥ ||XN − xN ||, thus P(||h − E[h]|| ≥ ε + e(N)) ≥ P(||XN − xN || ≥ ε), and we deduce from (31) that 2 /(2C 2 ) P(||XN − xN || ≥ ε) ≤ 2e−N(ε+e(N)) . Thus, for all ε > 0, the series ∑N≥0 P(||XN − xN || ≥ ε) converges. Now, from Borel-Cantelli lemma, we deduce that for all ε > 0, there exists Nε such that for all N ≥ Nε , ||XN − xN || < ε, which ∆→0 proves the almost sure convergence of XN to xN as N → ∞ (i.e. XT −→ xT almost surely). Appendix C. Proof of Proposition 8 ′ First, note that Qt = X X ′ − X X is a symmetric, non-negative matrix, since it may be rewritten as 1 nt ∑ (Xs+ − X)(Xs+ − X)′ . s∈S(t) In solving the least squares problem (21), we deduce b = ∆X + AX∆, thus min A,b 1 1 ∑ ∆Xs − b −A(Xs+2 ∆Xs )∆ nt s∈S(t) ≤ 2 = min A 1 ∑ ∆Xs − ∆X − A(Xs+ − X)∆ nt s∈S(t) 1 ∑ ∆Xs− ∆X− ∇x f (X, ut )(Xs+− X)∆ nt s∈S(t) 2 2 . (32) Now, since Xs = X + O(∆) one may obtain like in (19) and (20) (by replacing Xt by X) that: ∆Xs − ∆X − ∇x f (X, ut )(Xs+ − X)∆ = O(∆3 ). (33) We deduce from (32) and (33) that 1 nt ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) (Xs+ − X)∆ 2 = O(∆6 ). s∈S(t) By developing each component, d ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) i=1 row i Qt ∇x f (Xt , ut ) − ∇x f (X, ut ) ′ row i = O(∆4 ). Now, from the definition of ν(∆), for all vector u ∈ IRd , u′ Qt u ≥ ν(∆)||u||2 , thus ν(∆)||∇x f (Xt , ut ) − ∇x f (X, ut )||2 = O(∆4 ). Condition (23) yields ∇x f (Xt , ut ) = ∇x f (X, ut ) + o(1), and since ∇x f (Xt , ut ) = ∇x f (X, ut ) + O(∆), we deduce lim ∇x f (Xt , ut ) = ∇x f (Xt , ut ). ∆→0 789 M UNOS References J. Baxter and P. L. Bartlett. Infinite-horizon gradient-based policy search. Journal of Artificial Intelligence Research, 15:319–350, 2001. A. Bensoussan. Perturbation methods in optimal control. Wiley/Gauthier-Villars Series in Modern Applied Mathematics. John Wiley & Sons Ltd., Chichester, 1988. Translated from the French by C. Tomson. A. Bogdanov. Optimal control of a double inverted pendulum on a cart. Technical report CSE-04006, CSEE, OGI School of Science and Engineering, OHSU, 2004. P. W. Glynn. Likelihood ratio gradient estimation: an overview. In A. Thesen, H. Grant, and W. D. Kelton, editors, Proceedings of the 1987 Winter Simulation Conference, pages 366–375, 1987. E. Gobet and R. Munos. Sensitivity analysis using Itô-Malliavin calculus and martingales. application to stochastic optimal control. SIAM journal on Control and Optimization, 43(5):1676–1713, 2005. G. H. Golub and C. F. Van Loan. Matrix Computations, 3rd ed. Baltimore, MD: Johns Hopkins, 1996. R. E. Kalman, P. L. Falb, and M. A. Arbib. Topics in Mathematical System Theory. New York: McGraw Hill, 1969. P. E. Kloeden and E. Platen. Numerical Solutions of Stochastic Differential Equations. SpringerVerlag, 1995. H. J. Kushner and G. Yin. Stochastic Approximation Algorithms and Applications. Springer-Verlag, Berlin and New York, 1997. S. M. LaValle. Planning Algorithms. Cambridge University Press, 2006. M. Ledoux. The concentration of measure phenomenon. American Mathematical Society, Providence, RI, 2001. P. Marbach and J. N. Tsitsiklis. Approximate gradient methods in policy-space optimization of Markov reward processes. Journal of Discrete Event Dynamical Systems, 13:111–148, 2003. B. T. Polyak. Introduction to Optimization. Optimization Software Inc., New York, 1987. M. I. Reiman and A. Weiss. Sensitivity analysis via likelihood ratios. In J. Wilson, J. Henriksen, and S. Roberts, editors, Proceedings of the 1986 Winter Simulation Conference, pages 285–289, 1986. R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. Bradford Book, 1998. R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. Neural Information Processing Systems. MIT Press, pages 1057–1063, 2000. 790 P OLICY G RADIENT IN C ONTINUOUS T IME M. Talagrand. A new look at independence. Annals of Probability, 24:1–34, 1996. R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992. J. Yang and H. J. Kushner. A Monte Carlo method for sensitivity analysis and parametric optimization of nonlinear stochastic systems. SIAM J. Control Optim., 29(5):1216–1249, 1991. 791

5 0.37543681 57 jmlr-2006-Linear State-Space Models for Blind Source Separation

Author: Rasmus Kongsgaard Olsson, Lars Kai Hansen

Abstract: We apply a type of generative modelling to the problem of blind source separation in which prior knowledge about the latent source signals, such as time-varying auto-correlation and quasiperiodicity, are incorporated into a linear state-space model. In simulations, we show that in terms of signal-to-error ratio, the sources are inferred more accurately as a result of the inclusion of strong prior knowledge. We explore different schemes of maximum-likelihood optimization for the purpose of learning the model parameters. The Expectation Maximization algorithm, which is often considered the standard optimization method in this context, results in slow convergence when the noise variance is small. In such scenarios, quasi-Newton optimization yields substantial improvements in a range of signal to noise ratios. We analyze the performance of the methods on convolutive mixtures of speech signals. Keywords: blind source separation, state-space model, independent component analysis, convolutive model, EM, speech modelling

6 0.32854652 27 jmlr-2006-Ensemble Pruning Via Semi-definite Programming     (Special Topic on Machine Learning and Optimization)

7 0.32336742 81 jmlr-2006-Some Discriminant-Based PAC Algorithms

8 0.30887452 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models

9 0.27575248 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

10 0.25985074 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis

11 0.25826514 55 jmlr-2006-Linear Programming Relaxations and Belief Propagation -- An Empirical Study     (Special Topic on Machine Learning and Optimization)

12 0.2562921 5 jmlr-2006-A Robust Procedure For Gaussian Graphical Model Search From Microarray Data WithpLarger Thann

13 0.244185 89 jmlr-2006-Structured Prediction, Dual Extragradient and Bregman Projections     (Special Topic on Machine Learning and Optimization)

14 0.22991367 74 jmlr-2006-Point-Based Value Iteration for Continuous POMDPs

15 0.22575583 15 jmlr-2006-Bayesian Network Learning with Parameter Constraints     (Special Topic on Machine Learning and Optimization)

16 0.21440914 91 jmlr-2006-The Interplay of Optimization and Machine Learning Research     (Special Topic on Machine Learning and Optimization)

17 0.21246156 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification

18 0.21104188 21 jmlr-2006-Computational and Theoretical Analysis of Null Space and Orthogonal Linear Discriminant Analysis

19 0.20070507 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

20 0.20007534 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.02), (35, 0.011), (36, 0.06), (45, 0.031), (50, 0.062), (61, 0.015), (63, 0.05), (68, 0.016), (76, 0.02), (78, 0.023), (79, 0.013), (81, 0.055), (84, 0.03), (89, 0.37), (90, 0.042), (91, 0.031), (96, 0.064)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73870564 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

Author: Martin J. Wainwright

Abstract: Consider the problem of joint parameter estimation and prediction in a Markov random field: that is, the model parameters are estimated on the basis of an initial set of data, and then the fitted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a new noisy observation. Working under the restriction of limited computation, we analyze a joint method in which the same convex variational relaxation is used to construct an M-estimator for fitting parameters, and to perform approximate marginalization for the prediction step. The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i.e., an estimator that returns the “wrong” model even in the infinite data limit) is provably beneficial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of convex variational methods. This stability result provides additional incentive, apart from the obvious benefit of unique global optima, for using message-passing methods based on convex variational relaxations. We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product. Keywords: graphical model, Markov random field, belief propagation, variational method, parameter estimation, prediction error, algorithmic stability

2 0.41541469 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models

Author: Dmitry M. Malioutov, Jason K. Johnson, Alan S. Willsky

Abstract: We present a new framework based on walks in a graph for analysis and inference in Gaussian graphical models. The key idea is to decompose the correlation between each pair of variables as a sum over all walks between those variables in the graph. The weight of each walk is given by a product of edgewise partial correlation coefficients. This representation holds for a large class of Gaussian graphical models which we call walk-summable. We give a precise characterization of this class of models, and relate it to other classes including diagonally dominant, attractive, nonfrustrated, and pairwise-normalizable. We provide a walk-sum interpretation of Gaussian belief propagation in trees and of the approximate method of loopy belief propagation in graphs with cycles. The walk-sum perspective leads to a better understanding of Gaussian belief propagation and to stronger results for its convergence in loopy graphs. Keywords: Gaussian graphical models, walk-sum analysis, convergence of loopy belief propagation

3 0.36200666 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models     (Special Topic on Machine Learning and Optimization)

Author: Juho Rousu, Craig Saunders, Sandor Szedmak, John Shawe-Taylor

Abstract: We present a kernel-based algorithm for hierarchical text classification where the documents are allowed to belong to more than one category at a time. The classification model is a variant of the Maximum Margin Markov Network framework, where the classification hierarchy is represented as a Markov tree equipped with an exponential family defined on the edges. We present an efficient optimization algorithm based on incremental conditional gradient ascent in single-example subspaces spanned by the marginal dual variables. The optimization is facilitated with a dynamic programming based algorithm that computes best update directions in the feasible set. Experiments show that the algorithm can feasibly optimize training sets of thousands of examples and classification hierarchies consisting of hundreds of nodes. Training of the full hierarchical model is as efficient as training independent SVM-light classifiers for each node. The algorithm’s predictive accuracy was found to be competitive with other recently introduced hierarchical multicategory or multilabel classification learning algorithms. Keywords: kernel methods, hierarchical classification, text categorization, convex optimization, structured outputs

4 0.35114732 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

Author: Mikio L. Braun

Abstract: The eigenvalues of the kernel matrix play an important role in a number of kernel methods, in particular, in kernel principal component analysis. It is well known that the eigenvalues of the kernel matrix converge as the number of samples tends to infinity. We derive probabilistic finite sample size bounds on the approximation error of individual eigenvalues which have the important property that the bounds scale with the eigenvalue under consideration, reflecting the actual behavior of the approximation errors as predicted by asymptotic results and observed in numerical simulations. Such scaling bounds have so far only been known for tail sums of eigenvalues. Asymptotically, the bounds presented here have a slower than stochastic rate, but the number of sample points necessary to make this disadvantage noticeable is often unrealistically large. Therefore, under practical conditions, and for all but the largest few eigenvalues, the bounds presented here form a significant improvement over existing non-scaling bounds. Keywords: kernel matrix, eigenvalues, relative perturbation bounds

5 0.34855863 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

Author: Francis R. Bach, Michael I. Jordan

Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters, with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive new cost functions for spectral clustering based on measures of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing these cost functions with respect to the partition leads to new spectral clustering algorithms. Minimizing with respect to the similarity matrix leads to algorithms for learning the similarity matrix from fully labelled data sets. We apply our learning algorithm to the blind one-microphone speech separation problem, casting the problem as one of segmentation of the spectrogram. Keywords: spectral clustering, blind source separation, computational auditory scene analysis

6 0.34141913 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

7 0.34133667 66 jmlr-2006-On Model Selection Consistency of Lasso

8 0.3389487 70 jmlr-2006-Online Passive-Aggressive Algorithms

9 0.3337408 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

10 0.33190787 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

11 0.33168128 53 jmlr-2006-Learning a Hidden Hypergraph

12 0.32989761 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

13 0.3295497 44 jmlr-2006-Large Scale Transductive SVMs

14 0.32944989 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

15 0.32837313 61 jmlr-2006-Maximum-Gain Working Set Selection for SVMs     (Special Topic on Machine Learning and Optimization)

16 0.32791787 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity

17 0.32746723 55 jmlr-2006-Linear Programming Relaxations and Belief Propagation -- An Empirical Study     (Special Topic on Machine Learning and Optimization)

18 0.32728583 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation

19 0.32630476 56 jmlr-2006-Linear Programs for Hypotheses Selection in Probabilistic Inference Models     (Special Topic on Machine Learning and Optimization)

20 0.32458642 51 jmlr-2006-Learning Sparse Representations by Non-Negative Matrix Factorization and Sequential Cone Programming     (Special Topic on Machine Learning and Optimization)