nips nips2006 nips2006-60 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mikhail Belkin, Partha Niyogi
Abstract: Geometrically based methods for various tasks of machine learning have attracted considerable attention over the last few years. In this paper we show convergence of eigenvectors of the point cloud Laplacian to the eigenfunctions of the Laplace-Beltrami operator on the underlying manifold, thus establishing the first convergence results for a spectral dimensionality reduction algorithm in the manifold setting. 1
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper we show convergence of eigenvectors of the point cloud Laplacian to the eigenfunctions of the Laplace-Beltrami operator on the underlying manifold, thus establishing the first convergence results for a spectral dimensionality reduction algorithm in the manifold setting. [sent-7, score-1.341]
2 Collectively this class of learning algorithms is often referred to as manifold learning algorithms. [sent-10, score-0.2]
3 Some recent manifold algorithms include Isomap [14] and Locally Linear Embedding (LLE) [13]. [sent-11, score-0.2]
4 In this paper we provide a theoretical analysis for the Laplacian Eigenmaps introduced in [2], a framework based on eigenvectors of the graph Laplacian associated to the point-cloud data. [sent-12, score-0.087]
5 More specifically, we prove that under certain conditions, eigenvectors of the graph Laplacian converge to eigenfunction of the Laplace-Beltrami operator on the underlying manifold. [sent-13, score-0.546]
6 We note that in mathematics the manifold Laplacian is a classical object of differential geometry with a rich tradition of inquiry. [sent-14, score-0.236]
7 Indeed, several recent manifold learning algorithms are closely related to the Laplacian. [sent-16, score-0.2]
8 The eigenfunction of the Laplacian are also eigenfunctions of heat diffusions, which is the point of view explored by Coifman and colleagues at Yale University in a series of recent papers on data analysis (e. [sent-17, score-0.745]
9 Hessian Eigenmaps approach which uses eigenfunctions of the Hessian operator for data representation was proposed by Donoho and Grimes in [7]. [sent-20, score-0.561]
10 Finally, as observed in [2], the cost function that is minimized to obtain the embedding of LLE is an approximation to the squared Laplacian. [sent-22, score-0.076]
11 In the manifold learning setting, the underlying manifold is usually unknown. [sent-23, score-0.4]
12 Therefore functional maps from the manifold need to be estimated using point cloud data. [sent-24, score-0.34]
13 The common approximation strategy in these methods is to construct an adjacency graph associated to a point cloud. [sent-25, score-0.083]
14 The underlying intuition is that since the graph is a proxy for the manifold, inference based on the structure of the graph corresponds to the desired inference based on the geometric structure of the manifold. [sent-26, score-0.133]
15 Building on recent results on functional convergence of approximation for the Laplace-Beltrami operator using heat kernels and results on consistency of eigenfunctions for empirical approximations of such operators, we show convergence of the Laplacian Eigenmaps algorithm. [sent-28, score-1.278]
16 We note that in order to prove convergence of a spectral method, one needs to demonstrate convergence of the empirical eigenvalues and eigenfunctions. [sent-29, score-0.589]
17 To our knowledge this is the first complete convergence proof for a spectral manifold learning method. [sent-30, score-0.563]
18 1 Prior and Related Work This paper relies on results obtained in [3, 1] for functional convergence of operators. [sent-32, score-0.229]
19 It turns out, however, that considerably more careful analysis is required to ensure spectral convergence, which is necessary to guarantee convergence of the corresponding algorithms. [sent-33, score-0.36]
20 To the best of our knowledge previous results are not sufficient to guarantee convergence for any spectral method in the manifold setting. [sent-34, score-0.522]
21 Lafon in [10] generalized pointwise convergence results from [1] to the important case of an arbitrary probability distribution on the manifold. [sent-35, score-0.219]
22 Those results were further generalized and presented with an empirical pointwise convergence theorem for the manifold case in [9]. [sent-37, score-0.539]
23 We observe that the arguments in this paper are likely to allow one to use these results to show convergence of eigenfunctions for a wide class of probability distributions on the manifold. [sent-38, score-0.561]
24 Empirical convergence of spectral clustering for a fixed kernel parameter t was analyzed in [11] and is used in this paper. [sent-39, score-0.301]
25 Finally we point out that while the analogies between the geometry of manifolds and the geometry of graphs are well-known in spectral graph theory and in certain areas of differential geometry (see, e. [sent-43, score-0.334]
26 2 Main Result The main result of this paper is to show convergence of eigenvectors of graph Laplacian associated to a point cloud dataset to eigenfunctions of the Laplace-Beltrami operator when the data is sampled from a uniform probability distribution on an embedded manifold. [sent-46, score-0.943]
27 In what follows we will assume that the manifold M is a compact infinitely differentiable Riemannian submanifold of RN without boundary. [sent-47, score-0.227]
28 Recall now that the Laplace-Beltrami operator ∆ on M is a differential operator ∆ : C 2 → L2 defined as ∆f = − div (∇f ) where ∇f is the gradient vector field and div denotes divergence. [sent-48, score-0.536]
29 ∆ is a positive semi-definite self-adjoint operator and has a discrete spectrum on a compact manifold. [sent-49, score-0.293]
30 We will generally denote its ith smallest eigenvalue by λi and the corresponding eigenfunction by ei . [sent-50, score-0.588]
31 It is easy to see that it acts by matrix multiplication on functions restricted to the point cloud, with the matrix being the corresponding graph Laplacian. [sent-53, score-0.101]
32 Our main theorem shows that that there is a way to choose a sequence tn , such that the ˆ eigenfunctions of the empirical operators Ltn converge to the eigenfunctions of the Laplacen Beltrami operator ∆ in probability. [sent-58, score-1.215]
33 1 Let λt be the ith eigenvalue of Lt and et be the corresponding eigenfuncn n,i n,i tion (which, for each fixed i, will be shown to exist for t sufficiently small). [sent-60, score-0.191]
34 Let λi and ei be the corresponding eigenvalue and eigenfunction of ∆ respectively. [sent-61, score-0.526]
35 Then there exists a sequence tn → 0, such that lim λtn = λi n,i n→∞ etn (x) − ei (x) n,i lim n→∞ 2 =0 where the limits are in probability. [sent-62, score-0.658]
36 3 Overview of the proof The proof of the main theorem consists of two main parts. [sent-63, score-0.261]
37 One is spectral convergence of the functional approximation Lt to ∆ as t → 0 and the other is spectral convergence of the ˆn empirical approximation Lt to Lt as the number of data points n tends to infinity. [sent-64, score-0.765]
38 These two types of convergence are then put together to obtain the main Theorem 2. [sent-65, score-0.196]
39 The more difficult part of the proof is to show convergence of eigenvalues and eigenfunctions of the functional approximation Lt to those of ∆ as t → 0. [sent-68, score-0.72]
40 To demonstrate t convergence we will take a different functional approximation 1−H of ∆, where Ht is the t t heat operator. [sent-69, score-0.483]
41 While 1−H does not converge uniformly to ∆ they share an eigenbasis and t t for each fixed i the ith eigenvalue of 1−H converges to the ith eigenvalue of ∆. [sent-70, score-0.311]
42 We will then t t consider the operator Rt = 1−H −Lt . [sent-71, score-0.222]
43 A careful analysis of this operator, which constitutes t t the bulk of the proof paper, shows that Rt is a small relatively bounded perturbation of 1−H , t Rt f in the sense that for any function f we have 1−Ht t convergence and lead to the following 2 f 2 ≪ 1 as t → 0. [sent-72, score-0.371]
44 1 Let λi , λt , ei , et be the ith smallest eigenvalues and the corresponding eigeni i functions of ∆ and Lt respectively. [sent-74, score-0.443]
45 Then lim |λi − λt | = 0 i t→0 lim ei − et i t→0 2 =0 ˆ Part 2. [sent-75, score-0.605]
46 The second part is to show that the eigenfunctions of the empirical operator Lt n t converge to eigenfunctions of L as n → ∞ in probability. [sent-76, score-0.976]
47 2 For a fixed sufficiently small t, let λt and λt be the ith eigenvalue of Lt n n,i i t t t and L respectively. [sent-79, score-0.138]
48 Then lim λt = λt n,i i n→∞ lim n→∞ assuming that λt ≤ i 1 2t . [sent-81, score-0.284]
49 et (x) − et (x) n,i i 2 =0 The convergence is almost sure. [sent-82, score-0.273]
50 Observe that this implies convergence for any fixed i as soon as t is sufficiently small. [sent-83, score-0.167]
51 n→∞ tn → 0 After demonstrating two types of convergence results in the top line of the diagram a simple argument shows that a sequence tn can be chosen to guarantee convergence as in the final Theorem 2. [sent-608, score-0.546]
52 Main Objects and the Outline of the Proof Let M be a compact smooth smoothly embedded k-dimensional manifold in RN with the induced Riemannian structure and the corresponding induced measure µ. [sent-612, score-0.227]
53 As above, we define the operator Lt : L2 (M) → L2 (M) as follows: Lt (f )(x) = (4πt)− k+2 2 e− x−y 2 4t M f (x) dµy − e− x−y 2 4t f (y) dµy M As shown in previous work, this operator serves as a functional approximation to the Laplace-Beltrami operator on M. [sent-613, score-0.758]
54 The purpose of this paper is to extend the previous results to the eigenvalues and eigenfunctions, which turn out to need some careful estimates. [sent-614, score-0.098]
55 We start by reviewing certain properties of the Laplace-Beltrami operator and its connection to the heat equation. [sent-615, score-0.446]
56 Recall that the heat equation on the manifold M is given by ∂h(x, t) ∂t where h(x, t) is the heat at time t at point x. [sent-616, score-0.648]
57 Let f (x) = h(x, 0) be the initial heat distribution. [sent-617, score-0.224]
58 We observe that from the definition of the derivative 1 ∆f = lim (h(x, t) − f (x)) t→0 t ∆h(x, t) = It is well-known (e. [sent-618, score-0.197]
59 , [12]) that the solution to the heat equation at time t can be written as Ht f (x) := h(x, t) = H t (x, y)f (y)dµy M Here Ht is the heat operator and H t (x, y) is the heat kernel of M. [sent-620, score-0.894]
60 It is also well-known that the heat operator Ht can be written as Ht = e−t∆ . [sent-621, score-0.446]
61 We immediately see that ∆ = t t limt→0 1−H and that eigenfunctions of Ht and hence eigenfunction of 1−H coincide with t t t −tλi eigenfunctions of the Laplace operator. [sent-622, score-0.894]
62 The ith eigenvalue of 1−H is equal to 1−et , t where λi as usual is the ith eigenvalue of ∆. [sent-623, score-0.276]
63 This pointwise operator convergence is discussed in [10, 3, 1]. [sent-625, score-0.441]
64 To obtain convergence of eigenfunctions, however, one typically needs the stronger uniform convergence. [sent-626, score-0.188]
65 This is sufficient for convergence of eigenfunctions and other spectral properties. [sent-628, score-0.64]
66 It turns out that this type of convergence does not hold for functional approximation Lt as t → 0, which presents a serious technical obstruction to proving convergence of spectral t properties. [sent-629, score-0.579]
67 To observe that Lt does not converge uniformly to ∆, observe that while 1−H t converges to ∆ for each fixed function f , even this convergence is not uniform. [sent-630, score-0.312]
68 Indeed, for a small t, we can always choose a sufficiently large λi ≫ 1/t and the corresponding eigenfunction ei of ∆, s. [sent-631, score-0.45]
69 1 − Ht − ∆ ei t 1 1 (1 − e−tλi ) − λi ≈ − λi ≫ 1 t t = 2 t Since Lt is an approximation to 1−H , uniform convergence cannot be expected and the t standard perturbation theory techniques do not apply. [sent-633, score-0.59]
70 Lt is a small relatively bounded perturbation of 1−Ht t . [sent-637, score-0.104]
71 The relative boundedness of the perturbation will imply convergence of eigenfunctions t of Lt to those of 1−H and hence, by the Observation 1, to eigenfunctions of ∆. [sent-639, score-0.985]
72 t We now define the perturbation operator Rt = 1 − Ht − Lt t The relative boundedness of the self-adjoint perturbation operator Rt is formalized as follows: 2 Theorem 4. [sent-640, score-0.688]
73 1 For any 0 < ǫ < k+2 there exists a constant C, such that for all t sufficiently small k+2 2 | Rt f, f | ≤ C max t k+2 −ǫ , t 2 ǫ 1−Ht t f, f In particular lim t→0 and hence Rt is dominated by Rt f, f =0 1−Ht t f, f sup f t 1−H t 2 =1 on L2 as t tends to 0. [sent-641, score-0.187]
74 This result implies that for small values of t, bottom eigenvalues and eigenfunction of Lt t are close to those of 1−H , which in turn implies convergence. [sent-642, score-0.242]
75 To establish this result, we t will need two key estimates on the size of the perturbation Rt in two different norms. [sent-643, score-0.104]
76 Then there is C ∈ R, such that for all sufficiently small values of t √ Rt f 2 ≤ C t f k +1 2 H In what follows we give the proof of the Theorem 4. [sent-648, score-0.062]
77 The proof of the Propositions requires technical estimates of the heat kernel and can be found the longer version of the paper enclosed. [sent-650, score-0.305]
78 4 Let e be an eigenvector of ∆ with the eigenvalue λ. [sent-655, score-0.076]
79 1] Let ei (x) be the ith eigenfunction of ∆ and let λi be the corresponding eigenvalue. [sent-658, score-0.512]
80 Recall that ei form an orthonormal basis of L2 (M). [sent-659, score-0.296]
81 Thus any function f ∈ L2 (M) can be written ∞ uniquely as f (x) = i=0 ai ei (x) where a2 < ∞. [sent-660, score-0.301]
82 For technical resons we will assume that i all our functions are perpendicular to the constant and the lowest eigenvalue is nonzero. [sent-661, score-0.095]
83 Recall also that Ht f = exp(−t∆)f, 1 − Ht 1 − e−λi t ei = ei t t Ht ei = exp(−tλi )ei , Now let us fix t and consider the function φ(x) = that φ is a concave and increasing function of x. [sent-662, score-0.804]
84 It is easy to check φ(x0 ) 1 − e− √ = x0 t t √ t Splitting the positive real line in two intervals [0, x0 ], [x0 , ∞) and using concavity and monotonicity we observe that φ(x) ≥ min Note that limt→0 1−e− √ t √ t 1 − e− √ t √ t 1 − e− x, t √ t = 1. [sent-665, score-0.075]
85 Therefore for t sufficiently small φ(x) ≥ min Thus 1 − Ht ei , ei t = 1 1 x, √ 2 2 t 1 − e−λi t 1 1 ≥ min λi , √ t 2 t (4) ∞ Now take f ∈ L2 , f (x) = 1 ai ei (x). [sent-666, score-0.837]
86 Taking α > 0, we split f as a sum of f1 and f2 as following: f1 = ai ei , f2 = λi ≤α ai ei λi >α It is clear that f = f1 + f2 and, since f1 and f2 are orthogonal, f will now deal separately with f1 and with f2 . [sent-668, score-0.602]
87 Recalling that f2 only has basis components with eigenvalues greater than α and using the inequality (4) we see that 1 − Ht 1 1 1 − Ht f, f ≥ f2 , f2 ≥ min α, √ f2 2 (7) 2 t t 2 t On the other hand, by Proposition 4. [sent-673, score-0.11]
88 SpecEss (Lt ) ⊂ Proof: As noted before Lt f is a difference of a multiplication operator and a compact operator Lt f (p) = g(p)f (p) − Kf (11) where p−q 2 k+2 g(p) = (4πt)− 2 e− 4t dµq M and Kf is a convolution with a Gaussian. [sent-678, score-0.495]
89 As noted in [11], it is a fact in basic perturbation theory SpecEss (Lt ) = rg g where rg g is the range of the function g : M → R. [sent-679, score-0.214]
90 To estimate rg g observe first that k lim (4πt)− 2 t→∞ e− p−q 2 4t dµq = 1 M We thus see that for t sufficiently small k (4πt)− 2 and hence g(t) > 1 t−1 . [sent-680, score-0.252]
91 2 Let et be an eigenfunction of Lt , Lt et = λt et , λt < 1 t−1 . [sent-682, score-0.341]
92 It is a standard fact of functional analysis that such points 2 are eigenvalues and there are corresponding eigenspaces of finite dimension. [sent-688, score-0.142]
93 Consider now λt ∈ [0, 1 t−1 ] and the corresponding eigenfunction et . [sent-689, score-0.235]
94 The Theorem 4 then follows from i i 2 Theorem 23 and Proposition 25 in [11], which show convergence of spectral properties for the empirical operators. [sent-690, score-0.342]
95 1 we obtain the following convergence results: ˆ Eig Lt n n→∞ . [sent-695, score-0.167]
96 Eig ∆ where the first convergence is almost surely for λi ≤ 1 t−1 . [sent-827, score-0.167]
97 On the other hand, by i i using the first arrow, we see that ǫ lim P et − et 2 ≥ =0 n,i i n→∞ 2 Thus for any p > 0 and for each t there exists an N , s. [sent-831, score-0.273]
98 P { et − ei 2 > ǫ} < p Inverting n,i this relationship, we see that for any N and for any probability p(N ) there exists a tN , s. [sent-833, score-0.346]
99 ∀n>N tN P { en,i − ei 2 > ǫ} < p(N ) Making p(N ) tend to zero, we obtain convergence in probability. [sent-835, score-0.435]
100 Grimes, Hessian Eigenmaps: new locally linear embedding techniques for high-dimensional data, PNAS, vol. [sent-872, score-0.069]
wordName wordTfidf (topN-words)
[('lt', 0.427), ('eigenfunctions', 0.339), ('ht', 0.277), ('ei', 0.268), ('heat', 0.224), ('operator', 0.222), ('manifold', 0.2), ('rt', 0.19), ('eigenfunction', 0.182), ('su', 0.175), ('convergence', 0.167), ('laplacian', 0.153), ('lim', 0.142), ('spectral', 0.134), ('eig', 0.109), ('eigenmaps', 0.106), ('perturbation', 0.104), ('di', 0.1), ('tn', 0.081), ('theorem', 0.079), ('cloud', 0.078), ('eigenvalue', 0.076), ('proposition', 0.072), ('erential', 0.069), ('specess', 0.069), ('riemannian', 0.068), ('belkin', 0.065), ('functional', 0.062), ('proof', 0.062), ('ith', 0.062), ('eigenvalues', 0.06), ('niyogi', 0.058), ('observe', 0.055), ('rg', 0.055), ('lafon', 0.055), ('et', 0.053), ('graph', 0.053), ('pointwise', 0.052), ('operators', 0.05), ('embedding', 0.046), ('div', 0.046), ('erentiable', 0.046), ('usions', 0.046), ('spectrum', 0.044), ('empirical', 0.041), ('grimes', 0.04), ('kf', 0.04), ('yale', 0.04), ('manifolds', 0.039), ('careful', 0.038), ('propositions', 0.036), ('vol', 0.036), ('boundedness', 0.036), ('geometrically', 0.036), ('luxburg', 0.036), ('geometry', 0.036), ('ciently', 0.036), ('converge', 0.035), ('eigenvectors', 0.034), ('coincide', 0.034), ('limt', 0.034), ('lle', 0.034), ('ai', 0.033), ('chicago', 0.033), ('approximation', 0.03), ('hessian', 0.029), ('main', 0.029), ('diagram', 0.029), ('donoho', 0.029), ('basis', 0.028), ('hein', 0.028), ('compact', 0.027), ('geometric', 0.027), ('consistency', 0.026), ('laplace', 0.025), ('erent', 0.025), ('von', 0.025), ('exists', 0.025), ('bousquet', 0.024), ('multiplication', 0.024), ('acts', 0.024), ('recall', 0.024), ('locally', 0.023), ('inequality', 0.022), ('uniform', 0.021), ('theorems', 0.021), ('guarantee', 0.021), ('sup', 0.02), ('prove', 0.02), ('colt', 0.02), ('concavity', 0.02), ('beltrami', 0.02), ('coifman', 0.02), ('columbus', 0.02), ('eigenspaces', 0.02), ('mbelkin', 0.02), ('sobolev', 0.02), ('summand', 0.02), ('usion', 0.02), ('technical', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 60 nips-2006-Convergence of Laplacian Eigenmaps
Author: Mikhail Belkin, Partha Niyogi
Abstract: Geometrically based methods for various tasks of machine learning have attracted considerable attention over the last few years. In this paper we show convergence of eigenvectors of the point cloud Laplacian to the eigenfunctions of the Laplace-Beltrami operator on the underlying manifold, thus establishing the first convergence results for a spectral dimensionality reduction algorithm in the manifold setting. 1
2 0.21618032 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts
Author: Hariharan Narayanan, Mikhail Belkin, Partha Niyogi
Abstract: One of the intuitions underlying many graph-based methods for clustering and semi-supervised learning, is that class or cluster boundaries pass through areas of low probability density. In this paper we provide some formal analysis of that notion for a probability distribution. We introduce a notion of weighted boundary volume, which measures the length of the class/cluster boundary weighted by the density of the underlying probability distribution. We show that sizes of the cuts of certain commonly used data adjacency graphs converge to this continuous weighted volume of the boundary. keywords: Clustering, Semi-Supervised Learning 1
3 0.16266666 80 nips-2006-Fundamental Limitations of Spectral Clustering
Author: Boaz Nadler, Meirav Galun
Abstract: Spectral clustering methods are common graph-based approaches to clustering of data. Spectral clustering algorithms typically start from local information encoded in a weighted graph on the data and cluster according to the global eigenvectors of the corresponding (normalized) similarity matrix. One contribution of this paper is to present fundamental limitations of this general local to global approach. We show that based only on local information, the normalized cut functional is not a suitable measure for the quality of clustering. Further, even with a suitable similarity measure, we show that the first few eigenvectors of such adjacency matrices cannot successfully cluster datasets that contain structures at different scales of size and density. Based on these findings, a second contribution of this paper is a novel diffusion based measure to evaluate the coherence of individual clusters. Our measure can be used in conjunction with any bottom-up graph-based clustering method, it is scale-free and can determine coherent clusters at all scales. We present both synthetic examples and real image segmentation problems where various spectral clustering algorithms fail. In contrast, using this coherence measure finds the expected clusters at all scales. Keywords: Clustering, kernels, learning theory. 1
4 0.1547229 198 nips-2006-Unified Inference for Variational Bayesian Linear Gaussian State-Space Models
Author: David Barber, Silvia Chiappa
Abstract: Linear Gaussian State-Space Models are widely used and a Bayesian treatment of parameters is therefore of considerable interest. The approximate Variational Bayesian method applied to these models is an attractive approach, used successfully in applications ranging from acoustics to bioinformatics. The most challenging aspect of implementing the method is in performing inference on the hidden state sequence of the model. We show how to convert the inference problem so that standard Kalman Filtering/Smoothing recursions from the literature may be applied. This is in contrast to previously published approaches based on Belief Propagation. Our framework both simplifies and unifies the inference problem, so that future applications may be more easily developed. We demonstrate the elegance of the approach on Bayesian temporal ICA, with an application to finding independent dynamical processes underlying noisy EEG signals. 1 Linear Gaussian State-Space Models Linear Gaussian State-Space Models (LGSSMs)1 are fundamental in time-series analysis [1, 2, 3]. In these models the observations v1:T 2 are generated from an underlying dynamical system on h1:T according to: v v vt = Bht + ηt , ηt ∼ N (0V , ΣV ), h h ht = Aht−1 + ηt , ηt ∼ N (0H , ΣH ) , where N (µ, Σ) denotes a Gaussian with mean µ and covariance Σ, and 0X denotes an Xdimensional zero vector. The observation vt has dimension V and the hidden state ht has dimension H. Probabilistically, the LGSSM is defined by: T p(v1:T , h1:T |Θ) = p(v1 |h1 )p(h1 ) p(vt |ht )p(ht |ht−1 ), t=2 with p(vt |ht ) = N (Bht , ΣV ), p(ht |ht−1 ) = N (Aht−1 , ΣH ), p(h1 ) = N (µ, Σ) and where Θ = {A, B, ΣH , ΣV , µ, Σ} denotes the model parameters. Because of the widespread use of these models, a Bayesian treatment of parameters is of considerable interest [4, 5, 6, 7, 8]. An exact implementation of the Bayesian LGSSM is formally intractable [8], and recently a Variational Bayesian (VB) approximation has been studied [4, 5, 6, 7, 9]. The most challenging part of implementing the VB method is performing inference over h1:T , and previous authors have developed their own specialized routines, based on Belief Propagation, since standard LGSSM inference routines appear, at first sight, not to be applicable. 1 2 Also called Kalman Filters/Smoothers, Linear Dynamical Systems. v1:T denotes v1 , . . . , vT . A key contribution of this paper is to show how the Variational Bayesian treatment of the LGSSM can be implemented using standard LGSSM inference routines. Based on the insight we provide, any standard inference method may be applied, including those specifically addressed to improve numerical stability [2, 10, 11]. In this article, we decided to describe the predictor-corrector and Rauch-Tung-Striebel recursions [2], and also suggest a small modification that reduces computational cost. The Bayesian LGSSM is particularly of interest when strong prior constraints are needed to find adequate solutions. One such case is in EEG signal analysis, whereby we wish to extract sources that evolve independently through time. Since EEG is particularly noisy [12], a prior that encourages sources to have preferential dynamics is advantageous. This application is discussed in Section 4, and demonstrates the ease of applying our VB framework. 2 Bayesian Linear Gaussian State-Space Models In the Bayesian treatment of the LGSSM, instead of considering the model parameters Θ as fixed, ˆ ˆ we define a prior distribution p(Θ|Θ), where Θ is a set of hyperparameters. Then: ˆ p(v1:T |Θ) = ˆ p(v1:T |Θ)p(Θ|Θ) . (1) Θ In a full Bayesian treatment we would define additional prior distributions over the hyperparameters ˆ Θ. Here we take instead the ML-II (‘evidence’) framework, in which the optimal set of hyperpaˆ ˆ rameters is found by maximizing p(v1:T |Θ) with respect to Θ [6, 7, 9]. For the parameter priors, here we define Gaussians on the columns of A and B 3 : H e− p(A|α, ΣH ) ∝ αj 2 ˆ ( A j −A j ) T ˆ Σ−1 (Aj −Aj ) H H , e− p(B|β, ΣV ) ∝ j=1 βj 2 T ˆ (Bj −Bj ) ˆ Σ−1 (Bj −Bj ) V , j=1 ˆ ˆ which has the effect of biasing the transition and emission matrices to desired forms A and B. The −1 −1 4 conjugate priors for general inverse covariances ΣH and ΣV are Wishart distributions [7] . In the simpler case assumed here of diagonal covariances these become Gamma distributions [5, 7]. The ˆ hyperparameters are then Θ = {α, β}5 . Variational Bayes ˆ Optimizing Eq. (1) with respect to Θ is difficult due to the intractability of the integrals. Instead, in VB, one considers the lower bound [6, 7, 9]6 : ˆ ˆ L = log p(v1:T |Θ) ≥ Hq (Θ, h1:T ) + log p(Θ|Θ) q(Θ) + E(h1:T , Θ) q(Θ,h1:T ) ≡ F, where E(h1:T , Θ) ≡ log p(v1:T , h1:T |Θ). Hd (x) signifies the entropy of the distribution d(x), and · d(x) denotes the expectation operator. The key approximation in VB is q(Θ, h1:T ) ≡ q(Θ)q(h1:T ), from which one may show that, for optimality of F, ˆ E(h1:T ,Θ) q(h1:T ) . q(h1:T ) ∝ e E(h1:T ,Θ) q(Θ) , q(Θ) ∝ p(Θ|Θ)e These coupled equations need to be iterated to convergence. The updates for the parameters q(Θ) are straightforward and are given in Appendices A and B. Once converged, the hyperparameters are ˆ updated by maximizing F with respect to Θ, which lead to simple update formulae [7]. Our main concern is with the update for q(h1:T ), for which this paper makes a departure from treatments previously presented. 3 More general Gaussian priors may be more suitable depending on the application. For expositional simplicity, we do not put priors on µ and Σ. 5 For simplicity, we keep the parameters of the Gamma priors fixed. 6 Strictly we should write throughout q(·|v1:T ). We omit the dependence on v1:T for notational convenience. 4 Unified Inference on q(h1:T ) 3 Optimally q(h1:T ) is Gaussian since, up to a constant, E(h1:T , Θ) − 1 2 q(Θ) is quadratic in h1:T 7 : T T (vt −Bht )T Σ−1 (vt −Bht ) V q(B,ΣV ) + (ht −Aht−1 ) Σ−1 (ht −Aht−1 ) H t=1 q(A,ΣH ) . (2) In addition, optimally, q(A|ΣH ) and q(B|ΣV ) are Gaussians (see Appendix A), so we can easily carry out the averages in Eq. (2). The further averages over q(ΣH ) and q(ΣV ) are also easy due to conjugacy. Whilst this defines the distribution q(h1:T ), quantities such as q(ht ), required for example for the parameter updates (see the Appendices), need to be inferred from this distribution. Clearly, in the non-Bayesian case, the averages over the parameters are not present, and the above simply represents the posterior distribution of an LGSSM whose visible variables have been clamped into their evidential states. In that case, inference can be performed using any standard LGSSM routine. Our aim, therefore, is to try to represent the averaged Eq. (2) directly as the posterior distribution q (h1:T |˜1:T ) of an LGSSM , for some suitable parameter settings. ˜ v Mean + Fluctuation Decomposition A useful decomposition is to write (vt − Bht )T Σ−1 (vt − Bht ) V = (vt − B ht )T Σ−1 (vt − B ht ) + hT SB ht , t V q(B,ΣV ) f luctuation mean and similarly (ht −Aht−1 )T Σ−1 (ht −Aht−1 ) H = (ht − A ht−1 )T Σ−1 (ht − A ht−1 ) +hT SA ht−1 , t−1 H q(A,ΣH ) mean f luctuation T −1 where the parameter covariances are SB ≡ B T Σ−1 B − B Σ−1 B = V HB and SA ≡ V V T −1 −1 −1 AT ΣH A − A ΣH A = HHA (for HA and HB defined in Appendix A). The mean terms simply represent a clamped LGSSM with averaged parameters. However, the extra contributions from the fluctuations mean that Eq. (2) cannot be written as a clamped LGSSM with averaged parameters. In order to deal with these extra terms, our idea is to treat the fluctuations as arising from an augmented visible variable, for which Eq. (2) can then be considered as a clamped LGSSM. Inference Using an Augmented LGSSM To represent Eq. (2) as an LGSSM q (h1:T |˜1:T ), we may augment vt and B as8 : ˜ v vt = vert(vt , 0H , 0H ), ˜ ˜ B = vert( B , UA , UB ), T where UA is the Cholesky decomposition of SA , so that UA UA = SA . Similarly, UB is the Cholesky decomposition of SB . The equivalent LGSSM q (h1:T |˜1:T ) is then completed by specifying9 ˜ v ˜ A≡ A , ˜ ΣH ≡ Σ−1 H −1 , ˜ ΣV ≡ diag( Σ−1 V −1 , IH , IH ), µ ≡ µ, ˜ ˜ Σ ≡ Σ. The validity of this parameter assignment can be checked by showing that, up to negligible constants, the exponent of this augmented LGSSM has the same form as Eq. (2)10 . Now that this has been written as an LGSSM q (h1:T |˜1:T ), standard inference routines in the literature may be applied to ˜ v compute q(ht |v1:T ) = q (ht |˜1:T ) [1, 2, 11]11 . ˜ v 7 For simplicity of exposition, we ignore the first time-point here. The notation vert(x1 , . . . , xn ) stands for vertically concatenating the arguments x1 , . . . , xn . 9 ˜ ˜ ˜ Strictly, we need a time-dependent emission Bt = B, for t = 1, . . . , T − 1. For time T , BT has the Cholesky factor UA replaced by 0H,H . 10 There are several ways of achieving a similar augmentation. We chose this since, in the non-Bayesian limit UA = UB = 0H,H , no numerical instabilities would be introduced. 11 Note that, since the augmented LGSSM q (h1:T |˜1:T ) is designed to match the fully clamped distribution ˜ v q(h1:T |v1:T ), the filtered posterior q (ht |˜1:t ) does not correspond to q(ht |v1:t ). ˜ v 8 Algorithm 1 LGSSM: Forward and backward recursive updates. The smoothed posterior p(ht |v1:T ) ˆ is returned in the mean hT and covariance PtT . t procedure F ORWARD 1a: P ← Σ −1 T T 1b: P ← DΣ, where D ≡ I − ΣUAB I + UAB ΣUAB UAB ˆ0 ← µ 2a: h1 ˆ 2b: h0 ← Dµ 1 1 ˆ ˆ ˆ 3: K ← P B T (BP B T + ΣV )−1 , P1 ← (I − KB)P , h1 ← h0 + K(vt − B h0 ) 1 1 1 for t ← 2, T do t−1 4: Ptt−1 ← APt−1 AT + ΣH t−1 5a: P ← Pt −1 T T 5b: P ← Dt Ptt−1 , where Dt ≡ I − Ptt−1 UAB I + UAB Ptt−1 UAB UAB ˆ ˆ 6a: ht−1 ← Aht−1 t t−1 ˆ ˆ 6b: ht−1 ← Dt Aht−1 t t−1 T ˆ ˆ ˆ 7: K ← P B (BP B T + ΣV )−1 , Ptt ← (I − KB)P , ht ← ht−1 + K(vt − B ht−1 ) t t t end for end procedure procedure BACKWARD for t ← T − 1, 1 do ← − t At ← Ptt AT (Pt+1 )−1 ← T − ← − T t t Pt ← Pt + At (Pt+1 − Pt+1 )At T ← ˆT − ˆ ˆ ˆ hT ← ht + At (ht+1 − Aht ) t t t end for end procedure For completeness, we decided to describe the standard predictor-corrector form of the Kalman Filter, together with the Rauch-Tung-Striebel Smoother [2]. These are given in Algorithm 1, where q (ht |˜1:T ) is computed by calling the FORWARD and BACKWARD procedures. ˜ v We present two variants of the FORWARD pass. Either we may call procedure FORWARD in ˜ ˜ ˜ ˜ ˜ ˜ Algorithm 1 with parameters A, B, ΣH , ΣV , µ, Σ and the augmented visible variables vt in which ˜ we use steps 1a, 2a, 5a and 6a. This is exactly the predictor-corrector form of a Kalman Filter [2]. Otherwise, in order to reduce the computational cost, we may call procedure FORWARD with the −1 ˜ ˜ parameters A, B , ΣH , Σ−1 , µ, Σ and the original visible variable vt in which we use steps ˜ ˜ V T 1b (where UAB UAB ≡ SA +SB ), 2b, 5b and 6b. The two algorithms are mathematically equivalent. Computing q(ht |v1:T ) = q (ht |˜1:T ) is then completed by calling the common BACKWARD pass. ˜ v The important point here is that the reader may supply any standard Kalman Filtering/Smoothing routine, and simply call it with the appropriate parameters. In some parameter regimes, or in very long time-series, numerical stability may be a serious concern, for which several stabilized algorithms have been developed over the years, for example the square-root forms [2, 10, 11]. By converting the problem to a standard form, we have therefore unified and simplified inference, so that future applications may be more readily developed12 . 3.1 Relation to Previous Approaches An alternative approach to the one above, and taken in [5, 7], is to write the posterior as T log q(h1:T ) = φt (ht−1 , ht ) + const. t=2 for suitably defined quadratic forms φt (ht−1 , ht ). Here the potentials φt (ht−1 , ht ) encode the averaging over the parameters A, B, ΣH , ΣV . The approach taken in [7] is to recognize this as a 12 The computation of the log-likelihood bound does not require any augmentation. pairwise Markov chain, for which the Belief Propagation recursions may be applied. The approach in [5] is based on a Kullback-Leibler minimization of the posterior with a chain structure, which is algorithmically equivalent to Belief Propagation. Whilst mathematically valid procedures, the resulting algorithms do not correspond to any of the standard forms in the Kalman Filtering/Smoothing literature, whose properties have been well studied [14]. 4 An Application to Bayesian ICA A particular case for which the Bayesian LGSSM is of interest is in extracting independent source signals underlying a multivariate timeseries [5, 15]. This will demonstrate how the approach developed in Section 3 makes VB easily to apply. The sources si are modeled as independent in the following sense: p(si , sj ) = p(si )p(sj ), 1:T 1:T 1:T 1:T for i = j, i, j = 1, . . . , C. Independence implies block diagonal transition and state noise matrices A, ΣH and Σ, where each block c has dimension Hc . A one dimensional source sc for each independent dynamical subsystem is then t formed from sc = 1T hc , where 1c is a unit vector and hc is the state of t c t t dynamical system c. Combining the sources, we can write st = P ht , where P = diag(1T , . . . , 1T ), ht = vert(h1 , . . . , hC ). The resulting 1 C t t emission matrix is constrained to be of the form B = W P , where W is the V × C mixing matrix. This means that the observations v are formed from linearly mixing the sources, vt = W st + ηt . The Figure 1: The structure of graphical structure of this model is presented in Fig 1. To encourage redundant components to be removed, we place a zero mean Gaussian the LGSSM for ICA. prior on W . In this case, we do not define a prior for the parameters ΣH and ΣV which are instead considered as hyperparameters. More details of the model are given in [15]. The constraint B = W P requires a minor modification from Section 3, as we discuss below. Inference on q(h1:T ) A small modification of the mean + fluctuation decomposition for B occurs, namely: (vt − Bht )T Σ−1 (vt − Bht ) V q(W ) = (vt − B ht )T Σ−1 (vt − B ht ) + hT P T SW P ht , t V −1 where B ≡ W P and SW = V HW . The quantities W and HW are obtained as in Appendix A.1 with the replacement ht ← P ht . To represent the above as a LGSSM, we augment vt and B as vt = vert(vt , 0H , 0C ), ˜ ˜ B = vert( B , UA , UW P ), where UW is the Cholesky decomposition of SW . The equivalent LGSSM is then completed by ˜ ˜ ˜ ˜ specifying A ≡ A , ΣH ≡ ΣH , ΣV ≡ diag(ΣV , IH , IC ), µ ≡ µ, Σ ≡ Σ, and inference for ˜ q(h1:T ) performed using Algorithm 1. This demonstrates the elegance and unity of the approach in Section 3, since no new algorithm needs to be developed to perform inference, even in this special constrained parameter case. 4.1 Demonstration As a simple demonstration, we used an LGSSM to generate 3 sources sc with random 5×5 transition t matrices Ac , µ = 0H and Σ ≡ ΣH ≡ IH . The sources were mixed into three observations v vt = W st + ηt , for W chosen with elements from a zero mean unit variance Gaussian distribution, and ΣV = IV . We then trained a Bayesian LGSSM with 5 sources and 7 × 7 transition matrices Ac . ˆ To bias the model to find the simplest sources, we used Ac ≡ 0Hc ,Hc for all sources. In Fig2a and Fig 2b we see the original sources and the noisy observations respectively. In Fig2c we see the estimated sources from our method after convergence of the hyperparameter updates. Two of the 5 sources have been removed, and the remaining three are a reasonable estimation of the original sources. Another possible approach for introducing prior knowledge is to use a Maximum a Posteriori (MAP) 0 50 100 150 200 250 300 0 50 100 (a) 150 200 250 300 0 50 (b) 100 150 200 250 300 0 50 (c) 100 150 200 250 300 (d) Figure 2: (a) Original sources st . (b) Observations resulting from mixing the original sources, v v vt = W st + ηt , ηt ∼ N (0, I). (c) Recovered sources using the Bayesian LGSSM. (d) Sources found with MAP LGSSM. 0 1 2 (a) 3s 0 1 2 (b) 3s 0 1 2 (c) 3s 0 1 2 (d) 3s 0 1 2 3s (e) Figure 3: (a) Original raw EEG recordings from 4 channels. (b-e) 16 sources st estimated by the Bayesian LGSSM. procedure by adding a prior term to the original log-likelihood log p(v1:T |A, W, ΣH , ΣV , µ, Σ) + log p(A|α) + log p(W |β). However, it is not clear how to reliably find the hyperparameters α and β in this case. One solution is to estimate them by optimizing the new objective function jointly with respect to the parameters and hyperparameters (this is the so-called joint map estimation – see for example [16]). A typical result of using this joint MAP approach on the artificial data is presented in Fig 2d. The joint MAP does not estimate the hyperparameters well, and the incorrect number of sources is identified. 4.2 Application to EEG Analysis In Fig 3a we plot three seconds of EEG data recorded from 4 channels (located in the right hemisphere) while a person is performing imagined movement of the right hand. As is typical in EEG, each channel shows drift terms below 1 Hz which correspond to artifacts of the instrumentation, together with the presence of 50 Hz mains contamination and masks the rhythmical activity related to the mental task, mainly centered at 10 and 20 Hz [17]. We would therefore like a method which enables us to extract components in these information-rich 10 and 20 Hz frequency bands. Standard ICA methods such as FastICA do not find satisfactory sources based on raw ‘noisy’ data, and preprocessing with band-pass filters is usually required. Additionally, in EEG research, flexibility in the number of recovered sources is important since there may be many independent oscillators of interest underlying the observations and we would like some way to automatically determine their effective number. To preferentially find sources at particular frequencies, we specified a block ˆ diagonal matrix Ac for each source c, where each block is a 2 × 2 rotation matrix at the desired frequency. We defined the following 16 groups of frequencies: [0.5], [0.5], [0.5], [0.5]; [10,11], [10,11], [10,11], [10,11]; [20,21], [20,21], [20,21], [20,21]; [50], [50], [50], [50]. The temporal evolution of the sources obtained after training the Bayesian LGSSM is given in Fig3(b,c,d,e) (grouped by frequency range). The Bayes LGSSM removed 4 unnecessary sources from the mixing matrix W , that is one [10,11] Hz and three [20,21] Hz sources. The first 4 sources contain dominant low frequency drift, sources 5, 6 and 8 contain [10,11] Hz, while source 10 contains [20,21] Hz centered activity. Of the 4 sources initialized to 50 Hz, only 2 retained 50 Hz activity, while the Ac of the other two have changed to model other frequencies present in the EEG. This method demonstrates the usefulness and applicability of the VB method in a real-world situation. 5 Conclusion We considered the application of Variational Bayesian learning to Linear Gaussian State-Space Models. This is an important class of models with widespread application, and finding a simple way to implement this approximate Bayesian procedure is of considerable interest. The most demanding part of the procedure is inference of the hidden states of the model. Previously, this has been achieved using Belief Propagation, which differs from inference in the Kalman Filtering/Smoothing literature, for which highly efficient and stabilized procedures exist. A central contribution of this paper is to show how inference can be written using the standard Kalman Filtering/Smoothing recursions by augmenting the original model. Additionally, a minor modification to the standard Kalman Filtering routine may be applied for computational efficiency. We demonstrated the elegance and unity of our approach by showing how to easily apply a Variational Bayes analysis of temporal ICA. Specifically, our Bayes ICA approach successfully extracts independent processes underlying EEG signals, biased towards preferred frequency ranges. We hope that this simple and unifying interpretation of Variational Bayesian LGSSMs may therefore facilitate the further application to related models. A A.1 Parameter Updates for A and B Determining q(B|ΣV ) By examining F, the contribution of q(B|ΣV ) can be interpreted as the negative KL divergence between q(B|ΣV ) and a Gaussian. Hence, optimally, q(B|ΣV ) is a Gaussian. The covariance [ΣB ]ij,kl ≡ Bij − Bij Bkl − Bkl (averages wrt q(B|ΣV )) is given by: T hj hl t t −1 [ΣB ]ij,kl = [HB ]jl [ΣV ]ik , where [HB ]jl ≡ t=1 −1 The mean is given by B = NB HB , where [NB ]ij ≡ T t=1 hj t q(ht ) q(ht ) + βj δjl . i ˆ vt + βj Bij . Determining q(A|ΣH ) Optimally, q(A|ΣH ) is a Gaussian with covariance T −1 hj hl t t −1 [ΣA ]ij,kl = [HA ]jl [ΣH ]ik , where [HA ]jl ≡ t=1 −1 The mean is given by A = NA HA , where [NA ]ij ≡ B T t=2 q(ht ) hj hi t−1 t + αj δjl . q(ht−1:t ) ˆ + αj Aij . Covariance Updates By specifying a Wishart prior for the inverse of the covariances, conjugate update formulae are possible. In practice, it is more common to specify diagonal inverse covariances, for which the corresponding priors are simply Gamma distributions [7, 5]. For this simple diagonal case, the explicit updates are given below. Determining q(ΣV ) For the constraint Σ−1 = diag(ρ), where each diagonal element follows a Gamma prior V Ga(b1 , b2 ) [7], q(ρ) factorizes and the optimal updates are q(ρi ) = Ga b1 + where GB ≡ −1 T NB HB NB . T T 1 i , b2 + (vt )2 − [GB ]ii + 2 2 t=1 j ˆ2 βj Bij , Determining q(ΣH ) Analogously, for Σ−1 = diag(τ ) with prior Ga(a1 , a2 ) [5], the updates are H T T −1 1 ˆij , a2 + (hi )2 − [GA ]ii + αj A2 , q(τi ) = Ga a1 + t 2 2 t=2 j −1 T where GA ≡ NA HA NA . Acknowledgments This work is supported by the European DIRAC Project FP6-0027787. This paper only reflects the authors’ views and funding agencies are not liable for any use that may be made of the information contained herein. References [1] Y. Bar-Shalom and X.-R. Li. Estimation and Tracking: Principles, Techniques and Software. Artech House, 1998. [2] M. S. Grewal and A. P. Andrews. Kalman Filtering: Theory and Practice Using MATLAB. John Wiley and Sons, Inc., 2001. [3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications. Springer, 2000. [4] M. J. Beal, F. Falciani, Z. Ghahramani, C. Rangel, and D. L. Wild. A Bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics, 21:349–356, 2005. [5] A. T. Cemgil and S. J. Godsill. Probabilistic phase vocoder and its application to interpolation of missing values in audio signals. In 13th European Signal Processing Conference, 2005. [6] H. Valpola and J. Karhunen. An unsupervised ensemble learning method for nonlinear dynamic statespace models. Neural Computation, 14:2647–2692, 2002. [7] M. J. Beal. Variational Algorithms for Approximate Bayesian Inference. Ph.D. thesis, Gatsby Computational Neuroscience Unit, University College London, 2003. [8] M. Davy and S. J. Godsill. Bayesian harmonic models for musical signal analysis (with discussion). In J.O. Bernardo, J.O. Berger, A.P Dawid, and A.F.M. Smith, editors, Bayesian Statistics VII. Oxford University Press, 2003. [9] D. J. C. MacKay. Ensemble learning and evidence maximisation. Unpublished manuscipt: www.variational-bayes.org, 1995. [10] M. Morf and T. Kailath. Square-root algorithms for least-squares estimation. IEEE Transactions on Automatic Control, 20:487–497, 1975. [11] P. Park and T. Kailath. New square-root smoothing algorithms. IEEE Transactions on Automatic Control, 41:727–732, 1996. [12] E. Niedermeyer and F. Lopes Da Silva. Electroencephalography: basic principles, clinical applications and related fields. Lippincott Williams and Wilkins, 1999. [13] S. Roweis and Z. Ghahramani. A unifying review of linear Gaussian models. Neural Computation, 11:305–345, 1999. [14] M. Verhaegen and P. Van Dooren. Numerical aspects of different Kalman filter implementations. IEEE Transactions of Automatic Control, 31:907–917, 1986. [15] S. Chiappa and D. Barber. Bayesian linear Gaussian state-space models for biosignal decomposition. Signal Processing Letters, 14, 2007. [16] S. S. Saquib, C. A. Bouman, and K. Sauer. ML parameter estimation for Markov random fields with applicationsto Bayesian tomography. IEEE Transactions on Image Processing, 7:1029–1044, 1998. [17] G. Pfurtscheller and F. H. Lopes da Silva. Event-related EEG/MEG synchronization and desynchronization: basic principles. Clinical Neurophysiology, pages 1842–1857, 1999.
5 0.13480201 10 nips-2006-A Novel Gaussian Sum Smoother for Approximate Inference in Switching Linear Dynamical Systems
Author: David Barber, Bertrand Mesot
Abstract: We introduce a method for approximate smoothed inference in a class of switching linear dynamical systems, based on a novel form of Gaussian Sum smoother. This class includes the switching Kalman Filter and the more general case of switch transitions dependent on the continuous latent state. The method improves on the standard Kim smoothing approach by dispensing with one of the key approximations, thus making fuller use of the available future information. Whilst the only central assumption required is projection to a mixture of Gaussians, we show that an additional conditional independence assumption results in a simpler but stable and accurate alternative. Unlike the alternative unstable Expectation Propagation procedure, our method consists only of a single forward and backward pass and is reminiscent of the standard smoothing ‘correction’ recursions in the simpler linear dynamical system. The algorithm performs well on both toy experiments and in a large scale application to noise robust speech recognition. 1 Switching Linear Dynamical System The Linear Dynamical System (LDS) [1] is a key temporal model in which a latent linear process generates the observed series. For complex time-series which are not well described globally by a single LDS, we may break the time-series into segments, each modeled by a potentially different LDS. This is the basis for the Switching LDS (SLDS) [2, 3, 4, 5] where, for each time t, a switch variable st ∈ 1, . . . , S describes which of the LDSs is to be used. The observation (or ‘visible’) vt ∈ RV is linearly related to the hidden state ht ∈ RH with additive noise η by vt = B(st )ht + η v (st ) p(vt |ht , st ) = N (B(st )ht , Σv (st )) ≡ (1) where N (µ, Σ) denotes a Gaussian distribution with mean µ and covariance Σ. The transition dynamics of the continuous hidden state ht is linear, ht = A(st )ht−1 + η h (st ), ≡ p(ht |ht−1 , st ) = N A(st )ht−1 , Σh (st ) (2) The switch st may depend on both the previous st−1 and ht−1 . This is an augmented SLDS (aSLDS), and defines the model T p(vt |ht , st )p(ht |ht−1 , st )p(st |ht−1 , st−1 ) p(v1:T , h1:T , s1:T ) = t=1 The standard SLDS[4] considers only switch transitions p(st |st−1 ). At time t = 1, p(s1 |h0 , s0 ) simply denotes the prior p(s1 ), and p(h1 |h0 , s1 ) denotes p(h1 |s1 ). The aim of this article is to address how to perform inference in the aSLDS. In particular we desire the filtered estimate p(ht , st |v1:t ) and the smoothed estimate p(ht , st |v1:T ), for any 1 ≤ t ≤ T . Both filtered and smoothed inference in the SLDS is intractable, scaling exponentially with time [4]. s1 s2 s3 s4 h1 h2 h3 h4 v1 v2 v3 v4 Figure 1: The independence structure of the aSLDS. Square nodes denote discrete variables, round nodes continuous variables. In the SLDS links from h to s are not normally considered. 2 Expectation Correction Our approach to approximate p(ht , st |v1:T ) mirrors the Rauch-Tung-Striebel ‘correction’ smoother for the simpler LDS [1].The method consists of a single forward pass to recursively find the filtered posterior p(ht , st |v1:t ), followed by a single backward pass to correct this into a smoothed posterior p(ht , st |v1:T ). The forward pass we use is equivalent to standard Assumed Density Filtering (ADF) [6]. The main contribution of this paper is a novel form of backward pass, based only on collapsing the smoothed posterior to a mixture of Gaussians. Together with the ADF forward pass, we call the method Expectation Correction, since it corrects the moments found from the forward pass. A more detailed description of the method, including pseudocode, is given in [7]. 2.1 Forward Pass (Filtering) Readers familiar with ADF may wish to continue directly to Section (2.2). Our aim is to form a recursion for p(st , ht |v1:t ), based on a Gaussian mixture approximation of p(ht |st , v1:t ). Without loss of generality, we may decompose the filtered posterior as p(ht , st |v1:t ) = p(ht |st , v1:t )p(st |v1:t ) (3) The exact representation of p(ht |st , v1:t ) is a mixture with O(S t ) components. We therefore approximate this with a smaller I-component mixture I p(ht |st , v1:t ) ≈ p(ht |it , st , v1:t )p(it |st , v1:t ) it =1 where p(ht |it , st , v1:t ) is a Gaussian parameterized with mean f (it , st ) and covariance F (it , st ). To find a recursion for these parameters, consider p(ht+1 |st+1 , v1:t+1 ) = p(ht+1 |st , it , st+1 , v1:t+1 )p(st , it |st+1 , v1:t+1 ) (4) st ,it Evaluating p(ht+1 |st , it , st+1 , v1:t+1 ) We find p(ht+1 |st , it , st+1 , v1:t+1 ) by first computing the joint distribution p(ht+1 , vt+1 |st , it , st+1 , v1:t ), which is a Gaussian with covariance and mean elements, Σhh = A(st+1 )F (it , st )AT (st+1 ) + Σh (st+1 ), Σvv = B(st+1 )Σhh B T (st+1 ) + Σv (st+1 ) Σvh = B(st+1 )F (it , st ), µv = B(st+1 )A(st+1 )f (it , st ), µh = A(st+1 )f (it , st ) (5) and then conditioning on vt+1 1 . For the case S = 1, this forms the usual Kalman Filter recursions[1]. Evaluating p(st , it |st+1 , v1:t+1 ) The mixture weight in (4) can be found from the decomposition p(st , it |st+1 , v1:t+1 ) ∝ p(vt+1 |it , st , st+1 , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ) (6) 1 p(x|y) is a Gaussian with mean µx + Σxy Σ−1 (y − µy ) and covariance Σxx − Σxy Σ−1 Σyx . yy yy The first factor in (6), p(vt+1 |it , st , st+1 , v1:t ) is a Gaussian with mean µv and covariance Σvv , as given in (5). The last two factors p(it |st , v1:t ) and p(st |v1:t ) are given from the previous iteration. Finally, p(st+1 |it , st , v1:t ) is found from p(st+1 |it , st , v1:t ) = p(st+1 |ht , st ) p(ht |it ,st ,v1:t ) (7) where · p denotes expectation with respect to p. In the SLDS, (7) is replaced by the Markov transition p(st+1 |st ). In the aSLDS, however, (7) will generally need to be computed numerically. Closing the recursion We are now in a position to calculate (4). For each setting of the variable st+1 , we have a mixture of I × S Gaussians which we numerically collapse back to I Gaussians to form I p(ht+1 |st+1 , v1:t+1 ) ≈ p(ht+1 |it+1 , st+1 , v1:t+1 )p(it+1 |st+1 , v1:t+1 ) it+1 =1 Any method of choice may be supplied to collapse a mixture to a smaller mixture; our code simply repeatedly merges low-weight components. In this way the new mixture coefficients p(it+1 |st+1 , v1:t+1 ), it+1 ∈ 1, . . . , I are defined, completing the description of how to form a recursion for p(ht+1 |st+1 , v1:t+1 ) in (3). A recursion for the switch variable is given by p(st+1 |v1:t+1 ) ∝ p(vt+1 |st+1 , it , st , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ) st ,it where all terms have been computed during the recursion for p(ht+1 |st+1 , v1:t+1 ). The likelihood p(v1:T ) may be found by recursing p(v1:t+1 ) = p(vt+1 |v1:t )p(v1:t ), where p(vt+1 |vt ) = p(vt+1 |it , st , st+1 , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ) it ,st ,st+1 2.2 Backward Pass (Smoothing) The main contribution of this paper is to find a suitable way to ‘correct’ the filtered posterior p(st , ht |v1:t ) obtained from the forward pass into a smoothed posterior p(st , ht |v1:T ). We derive this for the case of a single Gaussian representation. The extension to the mixture case is straightforward and presented in [7]. We approximate the smoothed posterior p(ht |st , v1:T ) by a Gaussian with mean g(st ) and covariance G(st ) and our aim is to find a recursion for these parameters. A useful starting point for a recursion is: p(st+1 |v1:T )p(ht |st , st+1 , v1:T )p(st |st+1 , v1:T ) p(ht , st |v1:T ) = st+1 The term p(ht |st , st+1 , v1:T ) may be computed as p(ht |st , st+1 , v1:T ) = p(ht |ht+1 , st , st+1 , v1:t )p(ht+1 |st , st+1 , v1:T ) (8) ht+1 The recursion therefore requires p(ht+1 |st , st+1 , v1:T ), which we can write as p(ht+1 |st , st+1 , v1:T ) ∝ p(ht+1 |st+1 , v1:T )p(st |st+1 , ht+1 , v1:t ) (9) The difficulty here is that the functional form of p(st |st+1 , ht+1 , v1:t ) is not squared exponential in ht+1 , so that p(ht+1 |st , st+1 , v1:T ) will not be Gaussian2 . One possibility would be to approximate the non-Gaussian p(ht+1 |st , st+1 , v1:T ) by a Gaussian (or mixture thereof) by minimizing the Kullback-Leilbler divergence between the two, or performing moment matching in the case of a single Gaussian. A simpler alternative (which forms ‘standard’ EC) is to make the assumption p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ), where p(ht+1 |st+1 , v1:T ) is already known from the previous backward recursion. Under this assumption, the recursion becomes p(ht , st |v1:T ) ≈ p(st+1 |v1:T )p(st |st+1 , v1:T ) p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) (10) st+1 2 In the exact calculation, p(ht+1 |st , st+1 , v1:T ) is a mixture of Gaussians, see [7]. However, since in (9) the two terms p(ht+1 |st+1 , v1:T ) will only be approximately computed during the recursion, our approximation to p(ht+1 |st , st+1 , v1:T ) will not be a mixture of Gaussians. Evaluating p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) is a Gaussian in ht , whose statistics we will now compute. First we find p(ht |ht+1 , st , st+1 , v1:t ) which may be obtained from the joint distribution p(ht , ht+1 |st , st+1 , v1:t ) = p(ht+1 |ht , st+1 )p(ht |st , v1:t ) (11) which itself can be found from a forward dynamics from the filtered estimate p(ht |st , v1:t ). The statistics for the marginal p(ht |st , st+1 , v1:t ) are simply those of p(ht |st , v1:t ), since st+1 carries no extra information about ht . The remaining statistics are the mean of ht+1 , the covariance of ht+1 and cross-variance between ht and ht+1 , which are given by ht+1 = A(st+1 )ft (st ), Σt+1,t+1 = A(st+1 )Ft (st )AT (st+1 )+Σh (st+1 ), Σt+1,t = A(st+1 )Ft (st ) Given the statistics of (11), we may now condition on ht+1 to find p(ht |ht+1 , st , st+1 , v1:t ). Doing so effectively constitutes a reversal of the dynamics, ← − − ht = A (st , st+1 )ht+1 + ←(st , st+1 ) η ← − ← − − − where A (st , st+1 ) and ←(st , st+1 ) ∼ N (← t , st+1 ), Σ (st , st+1 )) are easily found using η m(s conditioning. Averaging the above reversed dynamics over p(ht+1 |st+1 , v1:T ), we find that p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) is a Gaussian with statistics ← − ← − ← − ← − − µt = A (st , st+1 )g(st+1 )+← t , st+1 ), Σt,t = A (st , st+1 )G(st+1 ) A T (st , st+1 )+ Σ (st , st+1 ) m(s These equations directly mirror the standard RTS backward pass[1]. Evaluating p(st |st+1 , v1:T ) The main departure of EC from previous methods is in treating the term p(st |st+1 , v1:T ) = p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) (12) The term p(st |ht+1 , st+1 , v1:t ) is given by p(st |ht+1 , st+1 , v1:t ) = p(ht+1 |st+1 , st , v1:t )p(st , st+1 |v1:t ) ′ ′ s′ p(ht+1 |st+1 , st , v1:t )p(st , st+1 |v1:t ) (13) t Here p(st , st+1 |v1:t ) = p(st+1 |st , v1:t )p(st |v1:t ), where p(st+1 |st , v1:t ) occurs in the forward pass, (7). In (13), p(ht+1 |st+1 , st , v1:t ) is found by marginalizing (11). Computing the average of (13) with respect to p(ht+1 |st+1 , v1:T ) may be achieved by any numerical integration method desired. A simple approximation is to evaluate the integrand at the mean value of the averaging distribution p(ht+1 |st+1 , v1:T ). More sophisticated methods (see [7]) such as sampling from the Gaussian p(ht+1 |st+1 , v1:T ) have the advantage that covariance information is used3 . Closing the Recursion We have now computed both the continuous and discrete factors in (8), which we wish to use to write the smoothed estimate in the form p(ht , st |v1:T ) = p(st |v1:T )p(ht |st , v1:T ). The distribution p(ht |st , v1:T ) is readily obtained from the joint (8) by conditioning on st to form the mixture p(ht |st , v1:T ) = p(st+1 |st , v1:T )p(ht |st , st+1 , v1:T ) st+1 which may then be collapsed to a single Gaussian (the mixture case is discussed in [7]). The smoothed posterior p(st |v1:T ) is given by p(st |v1:T ) = p(st+1 |v1:T ) p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) . (14) st+1 3 This is a form of exact sampling since drawing samples from a Gaussian is easy. This should not be confused with meaning that this use of sampling renders EC a sequential Monte-Carlo scheme. 2.3 Relation to other methods The EC Backward pass is closely related to Kim’s method [8]. In both EC and Kim’s method, the approximation p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ), is used to form a numerically simple backward pass. The other ‘approximation’ in EC is to numerically compute the average in (14). In Kim’s method, however, an update for the discrete variables is formed by replacing the required term in (14) by p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) ≈ p(st |st+1 , v1:t ) (15) Since p(st |st+1 , v1:t ) ∝ p(st+1 |st )p(st |v1:t )/p(st+1 |v1:t ), this can be computed simply from the filtered results alone. The fundamental difference therefore between EC and Kim’s method is that the approximation, (15), is not required by EC. The EC backward pass therefore makes fuller use of the future information, resulting in a recursion which intimately couples the continuous and discrete variables. The resulting effect on the quality of the approximation can be profound, as we will see in the experiments. The Expectation Propagation (EP) algorithm makes the central assumption of collapsing the posteriors to a Gaussian family [5]; the collapse is defined by a consistency criterion on overlapping marginals. In our experiments, we take the approach in [9] of collapsing to a single Gaussian. Ensuring consistency requires frequent translations between moment and canonical parameterizations, which is the origin of potentially severe numerical instability [10]. In contrast, EC works largely with moment parameterizations of Gaussians, for which relatively few numerical difficulties arise. Unlike EP, EC is not based on a consistency criterion and a subtle issue arises about possible inconsistencies in the Forward and Backward approximations for EC. For example, under the conditional independence assumption in the Backward Pass, p(hT |sT −1 , sT , v1:T ) ≈ p(hT |sT , v1:T ), which is in contradiction to (5) which states that the approximation to p(hT |sT −1 , sT , v1:T ) will depend on sT −1 . Such potential inconsistencies arise because of the approximations made, and should not be considered as separate approximations in themselves. Rather than using a global (consistency) objective, EC attempts to faithfully approximate the exact Forward and Backward propagation routines. For this reason, as in the exact computation, only a single Forward and Backward pass are required in EC. In [11] a related dynamics reversed is proposed. However, the singularities resulting from incorrectly treating p(vt+1:T |ht , st ) as a density are heuristically finessed. In [12] a variational method approximates the joint distribution p(h1:T , s1:T |v1:T ) rather than the marginal inference p(ht , st |v1:T ). This is a disadvantage when compared to other methods that directly approximate the marginal. Sequential Monte Carlo methods (Particle Filters)[13], are essentially mixture of delta-function approximations. Whilst potentially powerful, these typically suffer in high-dimensional hidden spaces, unless techniques such as Rao-Blackwellization are performed. ADF is generally preferential to Particle Filtering since in ADF the approximation is a mixture of non-trivial distributions, and is therefore more able to represent the posterior. 3 Demonstration Testing EC in a problem with a reasonably long temporal sequence, T , is important since numerical instabilities may not be apparent in timeseries of just a few points. To do this, we sequentially generate hidden and visible states from a given model, here with H = 3, S = 2, V = 1 – see Figure(2) for full details of the experimental setup. Then, given only the parameters of the model and the visible observations (but not any of the hidden states h1:T , s1:T ), the task is to infer p(ht |st , v1:T ) and p(st |v1:T ). Since the exact computation is exponential in T , a simple alternative is to assume that the original sample states s1:T are the ‘correct’ inferences, and compare how our most probable posterior smoothed estimates arg maxst p(st |v1:T ) compare with the assumed correct sample st . We chose conditions that, from the viewpoint of classical signal processing, are difficult, with changes in the switches occurring at a much higher rate than the typical frequencies in the signal vt . For EC we use the mean approximation for the numerical integration of (12). We included the Particle Filter merely for a point of comparison with ADF, since they are not designed to approximate PF RBPF EP ADFS KimS ECS ADFM KimM ECM 1000 800 600 400 200 0 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 Figure 2: The number of errors in estimating p(st |v1:T ) for a binary switch (S = 2) over a time series of length T = 100. Hence 50 errors corresponds to random guessing. Plotted are histograms of the errors are over 1000 experiments. The x-axes are cut off at 20 errors to improve visualization of the results. (PF) Particle Filter. (RBPF) Rao-Blackwellized PF. (EP) Expectation Propagation. (ADFS) Assumed Density Filtering using a Single Gaussian. (KimS) Kim’s smoother using the results from ADFS. (ECS) Expectation Correction using a Single Gaussian (I = J = 1). (ADFM) ADF using a multiple of I = 4 Gaussians. (KimM) Kim’s smoother using the results from ADFM. (ECM) Expectation Correction using a mixture with I = J = 4 components. S = 2, V = 1 (scalar observations), T = 100, with zero output bias. A(s) = 0.9999 ∗ orth(randn(H, H)), B(s) = randn(V, H). H = 3, Σh (s) = IH , Σv (s) = 0.1IV , p(st+1 |st ) ∝ 1S×S + IS . At time t = 1, the priors are p1 = uniform, with h1 drawn from N (10 ∗ randn(H, 1), IH ). the smoothed estimate, for which 1000 particles were used, with Kitagawa resampling. For the RaoBlackwellized Particle Filter [13], 500 particles were used, with Kitagawa resampling. We found that EP4 was numerically unstable and often struggled to converge. To encourage convergence, we used the damping method in [9], performing 20 iterations with a damping factor of 0.5. Nevertheless, the disappointing performance of EP is most likely due to conflicts resulting from numerical instabilities introduced by the frequent conversions between moment and canonical representations. The best filtered results are given using ADF, since this is better able to represent the variance in the filtered posterior than the sampling methods. Unlike Kim’s method, EC makes good use of the future information to clean up the filtered results considerably. One should bear in mind that both EC and Kim’s method use the same ADF filtered results. This demonstrates that EC may dramatically improve on Kim’s method, so that the small amount of extra work in making a numerical approximation of p(st |st+1 , v1:T ), (12), may bring significant benefits. We found similar conclusions for experiments with an aSLDS[7]. 4 Application to Noise Robust ASR Here we briefly present an application of the SLDS to robust Automatic Speech Recognition (ASR), for which the intractable inference is performed by EC, and serves to demonstrate how EC scales well to a large-scale application. Fuller details are given in [14]. The standard approach to noise robust ASR is to provide a set of noise-robust features to a standard Hidden Markov Model (HMM) classifier, which is based on modeling the acoustic feature vector. For example, the method of Unsupervised Spectral Subtraction (USS) [15] provides state-of-the-art performance in this respect. Incorporating noise models directly into such feature-based HMM systems is difficult, mainly because the explicit influence of the noise on the features is poorly understood. An alternative is to model the raw speech signal directly, such as the SAR-HMM model [16] for which, under clean conditions, isolated spoken digit recognition performs well. However, the SAR-HMM performs poorly under noisy conditions, since no explicit noise processes are taken into account by the model. The approach we take here is to extend the SAR-HMM to include an explicit noise process, so that h the observed signal vt is modeled as a noise corrupted version of a clean hidden signal vt : h vt = vt + ηt ˜ 4 with ηt ∼ N (0, σ 2 ) ˜ ˜ Generalized EP [5], which groups variables together improves on the results, but is still far inferior to the EC results presented here – Onno Zoeter personal communication. Noise Variance 0 10−7 10−6 10−5 10−4 10−3 SNR (dB) 26.5 26.3 25.1 19.7 10.6 0.7 HMM 100.0% 100.0% 90.9% 86.4% 59.1% 9.1% SAR-HMM 97.0% 79.8% 56.7% 22.2% 9.7% 9.1% AR-SLDS 96.8% 96.8% 96.4% 94.8% 84.0% 61.2% Table 1: Comparison of the recognition accuracy of three models when the test utterances are corrupted by various levels of Gaussian noise. The dynamics of the clean signal is modeled by a switching AR process R h vt = h h cr (st )vt−r + ηt (st ), h ηt (st ) ∼ N (0, σ 2 (st )) r=1 where st ∈ {1, . . . , S} denotes which of a set of AR coefficients cr (st ) are to be used at time t, h and ηt (st ) is the so-called innovation noise. When σ 2 (st ) ≡ 0, this model reproduces the SARHMM of [16], a specially constrained HMM. Hence inference and learning for the SAR-HMM are tractable and straightforward. For the case σ 2 (st ) > 0 the model can be recast as an SLDS. To do this we define ht as a vector which contains the R most recent clean hidden samples ht = h vt h . . . vt−r+1 T (16) and we set A(st ) to be an R × R matrix where the first row contains the AR coefficients −cr (st ) and the rest is a shifted down identity matrix. For example, for a third order (R = 3) AR process, A(st ) = −c1 (st ) −c2 (st ) −c3 (st ) 1 0 0 0 1 0 . (17) The hidden covariance matrix Σh (s) has all elements zero, except the top-left most which is set to the innovation variance. To extract the first component of ht we use the (switch independent) 1 × R projection matrix B = [ 1 0 . . . 0 ]. The (switch independent) visible scalar noise 2 variance is given by Σv ≡ σv . A well-known issue with raw speech signal models is that the energy of a signal may vary from one speaker to another or because of a change in recording conditions. For this reason the innovation Σh is adjusted by maximizing the likelihood of an observed sequence with respect to the innovation covariance, a process called Gain Adaptation [16]. 4.1 Training & Evaluation Following [16], we trained a separate SAR-HMM for each of the eleven digits (0–9 and ‘oh’) from the TI-DIGITS database [17]. The training set for each digit was composed of 110 single digit utterances down-sampled to 8 kHz, each one pronounced by a male speaker. Each SAR-HMM was composed of ten states with a left-right transition matrix. Each state was associated with a 10thorder AR process and the model was constrained to stay an integer multiple of K = 140 time steps (0.0175 seconds) in the same state. We refer the reader to [16] for a detailed explanation of the training procedure used with the SAR-HMM. An AR-SLDS was built for each of the eleven digits by copying the parameters of the corresponding trained SAR-HMM, i.e., the AR coefficients cr (s) are copied into the first row of the hidden transition matrix A(s) and the same discrete transition distribution p(st | st−1 ) is used. The models were then evaluated on a test set composed of 112 corrupted utterances of each of the eleven digits, each pronounced by different male speakers than those used in the training set. The recognition accuracy obtained by the models on the corrupted test sets is presented in Table 1. As expected, the performance of the SAR-HMM rapidly decreases with noise. The feature-based HMM with USS has high accuracy only for high SNR levels. In contrast, the AR-SLDS achieves a recognition accuracy of 61.2% at a SNR close to 0 dB, while the performance of the two other methods is equivalent to random guessing (9.1%). Whilst other inference methods may also perform well in this case, we found that EC performs admirably, without numerical instabilities, even for time-series with several thousand time-steps. 5 Discussion We presented a method for approximate smoothed inference in an augmented class of switching linear dynamical systems. Our approximation is based on the idea that due to the forgetting which commonly occurs in Markovian models, a finite number of mixture components may provide a reasonable approximation. Clearly, in systems with very long correlation times our method may require too many mixture components to produce a satisfactory result, although we are unaware of other techniques that would be able to cope well in that case. The main benefit of EC over Kim smoothing is that future information is more accurately dealt with. Whilst EC is not as general as EP, EC carefully exploits the properties of singly-connected distributions, such as the aSLDS, to provide a numerically stable procedure. We hope that the ideas presented here may therefore help facilitate the practical application of dynamic hybrid networks. Acknowledgements This work is supported by the EU Project FP6-0027787. This paper only reflects the authors’ views and funding agencies are not liable for any use that may be made of the information contained herein. References [1] Y. Bar-Shalom and Xiao-Rong Li. Estimation and Tracking : Principles, Techniques and Software. Artech House, Norwood, MA, 1998. [2] V. Pavlovic, J. M. Rehg, and J. MacCormick. Learning switching linear models of human motion. In Advances in Neural Information Processing systems (NIPS 13), pages 981–987, 2001. [3] A. T. Cemgil, B. Kappen, and D. Barber. A Generative Model for Music Transcription. IEEE Transactions on Audio, Speech and Language Processing, 14(2):679 – 694, 2006. [4] U. N. Lerner. Hybrid Bayesian Networks for Reasoning about Complex Systems. PhD thesis, Stanford University, 2002. [5] O. Zoeter. Monitoring non-linear and switching dynamical systems. PhD thesis, Radboud University Nijmegen, 2005. [6] T. Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, MIT Media Lab, 2001. [7] D. Barber. Expectation Correction for Smoothed Inference in Switching Linear Dynamical Systems. Journal of Machine Learning Research, 7:2515–2540, 2006. [8] C-J. Kim. Dynamic linear models with Markov-switching. Journal of Econometrics, 60:1–22, 1994. [9] T. Heskes and O. Zoeter. Expectation Propagation for approximate inference in dynamic Bayesian networks. In A. Darwiche and N. Friedman, editors, Uncertainty in Art. Intelligence, pages 216–223, 2002. [10] S. Lauritzen and F. Jensen. Stable local computation with conditional Gaussian distributions. Statistics and Computing, 11:191–203, 2001. [11] G. Kitagawa. The Two-Filter Formula for Smoothing and an implementation of the Gaussian-sum smoother. Annals of the Institute of Statistical Mathematics, 46(4):605–623, 1994. [12] Z. Ghahramani and G. E. Hinton. Variational learning for switching state-space models. Neural Computation, 12(4):963–996, 1998. [13] A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo Methods in Practice. Springer, 2001. [14] B. Mesot and D. Barber. Switching Linear Dynamical Systems for Noise Robust Speech Recognition. IDIAP-RR 08, 2006. [15] G. Lathoud, M. Magimai-Doss, B. Mesot, and H. Bourlard. Unsupervised spectral subtraction for noiserobust ASR. In Proceedings of ASRU 2005, pages 189–194, November 2005. [16] Y. Ephraim and W. J. J. Roberts. Revisiting autoregressive hidden Markov modeling of speech signals. IEEE Signal Processing Letters, 12(2):166–169, February 2005. [17] R.G. Leonard. A database for speaker independent digit recognition. In Proceedings of ICASSP84, volume 3, 1984.
6 0.12304231 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments
7 0.11425146 128 nips-2006-Manifold Denoising
8 0.11290389 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
9 0.11202601 65 nips-2006-Denoising and Dimension Reduction in Feature Space
10 0.10857511 120 nips-2006-Learning to Traverse Image Manifolds
11 0.10479739 79 nips-2006-Fast Iterative Kernel PCA
12 0.086034559 14 nips-2006-A Small World Threshold for Economic Network Formation
13 0.075826906 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
14 0.072690092 127 nips-2006-MLLE: Modified Locally Linear Embedding Using Multiple Weights
15 0.072600454 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
16 0.068447478 163 nips-2006-Prediction on a Graph with a Perceptron
17 0.068095095 200 nips-2006-Unsupervised Regression with Applications to Nonlinear System Identification
18 0.06776277 61 nips-2006-Convex Repeated Games and Fenchel Duality
19 0.066337772 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis
20 0.064121999 104 nips-2006-Large-Scale Sparsified Manifold Regularization
topicId topicWeight
[(0, -0.176), (1, 0.081), (2, -0.112), (3, 0.191), (4, -0.024), (5, -0.111), (6, 0.074), (7, 0.055), (8, 0.006), (9, 0.022), (10, 0.15), (11, -0.091), (12, -0.056), (13, 0.038), (14, 0.167), (15, 0.046), (16, 0.029), (17, 0.142), (18, -0.226), (19, -0.029), (20, -0.072), (21, -0.101), (22, 0.026), (23, -0.241), (24, -0.075), (25, -0.063), (26, -0.035), (27, -0.113), (28, -0.024), (29, -0.145), (30, -0.086), (31, 0.041), (32, -0.049), (33, 0.174), (34, 0.013), (35, 0.041), (36, 0.077), (37, -0.095), (38, -0.0), (39, -0.068), (40, -0.012), (41, 0.067), (42, -0.019), (43, 0.0), (44, 0.061), (45, 0.072), (46, -0.081), (47, 0.019), (48, -0.045), (49, -0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.97546834 60 nips-2006-Convergence of Laplacian Eigenmaps
Author: Mikhail Belkin, Partha Niyogi
Abstract: Geometrically based methods for various tasks of machine learning have attracted considerable attention over the last few years. In this paper we show convergence of eigenvectors of the point cloud Laplacian to the eigenfunctions of the Laplace-Beltrami operator on the underlying manifold, thus establishing the first convergence results for a spectral dimensionality reduction algorithm in the manifold setting. 1
2 0.53799003 198 nips-2006-Unified Inference for Variational Bayesian Linear Gaussian State-Space Models
Author: David Barber, Silvia Chiappa
Abstract: Linear Gaussian State-Space Models are widely used and a Bayesian treatment of parameters is therefore of considerable interest. The approximate Variational Bayesian method applied to these models is an attractive approach, used successfully in applications ranging from acoustics to bioinformatics. The most challenging aspect of implementing the method is in performing inference on the hidden state sequence of the model. We show how to convert the inference problem so that standard Kalman Filtering/Smoothing recursions from the literature may be applied. This is in contrast to previously published approaches based on Belief Propagation. Our framework both simplifies and unifies the inference problem, so that future applications may be more easily developed. We demonstrate the elegance of the approach on Bayesian temporal ICA, with an application to finding independent dynamical processes underlying noisy EEG signals. 1 Linear Gaussian State-Space Models Linear Gaussian State-Space Models (LGSSMs)1 are fundamental in time-series analysis [1, 2, 3]. In these models the observations v1:T 2 are generated from an underlying dynamical system on h1:T according to: v v vt = Bht + ηt , ηt ∼ N (0V , ΣV ), h h ht = Aht−1 + ηt , ηt ∼ N (0H , ΣH ) , where N (µ, Σ) denotes a Gaussian with mean µ and covariance Σ, and 0X denotes an Xdimensional zero vector. The observation vt has dimension V and the hidden state ht has dimension H. Probabilistically, the LGSSM is defined by: T p(v1:T , h1:T |Θ) = p(v1 |h1 )p(h1 ) p(vt |ht )p(ht |ht−1 ), t=2 with p(vt |ht ) = N (Bht , ΣV ), p(ht |ht−1 ) = N (Aht−1 , ΣH ), p(h1 ) = N (µ, Σ) and where Θ = {A, B, ΣH , ΣV , µ, Σ} denotes the model parameters. Because of the widespread use of these models, a Bayesian treatment of parameters is of considerable interest [4, 5, 6, 7, 8]. An exact implementation of the Bayesian LGSSM is formally intractable [8], and recently a Variational Bayesian (VB) approximation has been studied [4, 5, 6, 7, 9]. The most challenging part of implementing the VB method is performing inference over h1:T , and previous authors have developed their own specialized routines, based on Belief Propagation, since standard LGSSM inference routines appear, at first sight, not to be applicable. 1 2 Also called Kalman Filters/Smoothers, Linear Dynamical Systems. v1:T denotes v1 , . . . , vT . A key contribution of this paper is to show how the Variational Bayesian treatment of the LGSSM can be implemented using standard LGSSM inference routines. Based on the insight we provide, any standard inference method may be applied, including those specifically addressed to improve numerical stability [2, 10, 11]. In this article, we decided to describe the predictor-corrector and Rauch-Tung-Striebel recursions [2], and also suggest a small modification that reduces computational cost. The Bayesian LGSSM is particularly of interest when strong prior constraints are needed to find adequate solutions. One such case is in EEG signal analysis, whereby we wish to extract sources that evolve independently through time. Since EEG is particularly noisy [12], a prior that encourages sources to have preferential dynamics is advantageous. This application is discussed in Section 4, and demonstrates the ease of applying our VB framework. 2 Bayesian Linear Gaussian State-Space Models In the Bayesian treatment of the LGSSM, instead of considering the model parameters Θ as fixed, ˆ ˆ we define a prior distribution p(Θ|Θ), where Θ is a set of hyperparameters. Then: ˆ p(v1:T |Θ) = ˆ p(v1:T |Θ)p(Θ|Θ) . (1) Θ In a full Bayesian treatment we would define additional prior distributions over the hyperparameters ˆ Θ. Here we take instead the ML-II (‘evidence’) framework, in which the optimal set of hyperpaˆ ˆ rameters is found by maximizing p(v1:T |Θ) with respect to Θ [6, 7, 9]. For the parameter priors, here we define Gaussians on the columns of A and B 3 : H e− p(A|α, ΣH ) ∝ αj 2 ˆ ( A j −A j ) T ˆ Σ−1 (Aj −Aj ) H H , e− p(B|β, ΣV ) ∝ j=1 βj 2 T ˆ (Bj −Bj ) ˆ Σ−1 (Bj −Bj ) V , j=1 ˆ ˆ which has the effect of biasing the transition and emission matrices to desired forms A and B. The −1 −1 4 conjugate priors for general inverse covariances ΣH and ΣV are Wishart distributions [7] . In the simpler case assumed here of diagonal covariances these become Gamma distributions [5, 7]. The ˆ hyperparameters are then Θ = {α, β}5 . Variational Bayes ˆ Optimizing Eq. (1) with respect to Θ is difficult due to the intractability of the integrals. Instead, in VB, one considers the lower bound [6, 7, 9]6 : ˆ ˆ L = log p(v1:T |Θ) ≥ Hq (Θ, h1:T ) + log p(Θ|Θ) q(Θ) + E(h1:T , Θ) q(Θ,h1:T ) ≡ F, where E(h1:T , Θ) ≡ log p(v1:T , h1:T |Θ). Hd (x) signifies the entropy of the distribution d(x), and · d(x) denotes the expectation operator. The key approximation in VB is q(Θ, h1:T ) ≡ q(Θ)q(h1:T ), from which one may show that, for optimality of F, ˆ E(h1:T ,Θ) q(h1:T ) . q(h1:T ) ∝ e E(h1:T ,Θ) q(Θ) , q(Θ) ∝ p(Θ|Θ)e These coupled equations need to be iterated to convergence. The updates for the parameters q(Θ) are straightforward and are given in Appendices A and B. Once converged, the hyperparameters are ˆ updated by maximizing F with respect to Θ, which lead to simple update formulae [7]. Our main concern is with the update for q(h1:T ), for which this paper makes a departure from treatments previously presented. 3 More general Gaussian priors may be more suitable depending on the application. For expositional simplicity, we do not put priors on µ and Σ. 5 For simplicity, we keep the parameters of the Gamma priors fixed. 6 Strictly we should write throughout q(·|v1:T ). We omit the dependence on v1:T for notational convenience. 4 Unified Inference on q(h1:T ) 3 Optimally q(h1:T ) is Gaussian since, up to a constant, E(h1:T , Θ) − 1 2 q(Θ) is quadratic in h1:T 7 : T T (vt −Bht )T Σ−1 (vt −Bht ) V q(B,ΣV ) + (ht −Aht−1 ) Σ−1 (ht −Aht−1 ) H t=1 q(A,ΣH ) . (2) In addition, optimally, q(A|ΣH ) and q(B|ΣV ) are Gaussians (see Appendix A), so we can easily carry out the averages in Eq. (2). The further averages over q(ΣH ) and q(ΣV ) are also easy due to conjugacy. Whilst this defines the distribution q(h1:T ), quantities such as q(ht ), required for example for the parameter updates (see the Appendices), need to be inferred from this distribution. Clearly, in the non-Bayesian case, the averages over the parameters are not present, and the above simply represents the posterior distribution of an LGSSM whose visible variables have been clamped into their evidential states. In that case, inference can be performed using any standard LGSSM routine. Our aim, therefore, is to try to represent the averaged Eq. (2) directly as the posterior distribution q (h1:T |˜1:T ) of an LGSSM , for some suitable parameter settings. ˜ v Mean + Fluctuation Decomposition A useful decomposition is to write (vt − Bht )T Σ−1 (vt − Bht ) V = (vt − B ht )T Σ−1 (vt − B ht ) + hT SB ht , t V q(B,ΣV ) f luctuation mean and similarly (ht −Aht−1 )T Σ−1 (ht −Aht−1 ) H = (ht − A ht−1 )T Σ−1 (ht − A ht−1 ) +hT SA ht−1 , t−1 H q(A,ΣH ) mean f luctuation T −1 where the parameter covariances are SB ≡ B T Σ−1 B − B Σ−1 B = V HB and SA ≡ V V T −1 −1 −1 AT ΣH A − A ΣH A = HHA (for HA and HB defined in Appendix A). The mean terms simply represent a clamped LGSSM with averaged parameters. However, the extra contributions from the fluctuations mean that Eq. (2) cannot be written as a clamped LGSSM with averaged parameters. In order to deal with these extra terms, our idea is to treat the fluctuations as arising from an augmented visible variable, for which Eq. (2) can then be considered as a clamped LGSSM. Inference Using an Augmented LGSSM To represent Eq. (2) as an LGSSM q (h1:T |˜1:T ), we may augment vt and B as8 : ˜ v vt = vert(vt , 0H , 0H ), ˜ ˜ B = vert( B , UA , UB ), T where UA is the Cholesky decomposition of SA , so that UA UA = SA . Similarly, UB is the Cholesky decomposition of SB . The equivalent LGSSM q (h1:T |˜1:T ) is then completed by specifying9 ˜ v ˜ A≡ A , ˜ ΣH ≡ Σ−1 H −1 , ˜ ΣV ≡ diag( Σ−1 V −1 , IH , IH ), µ ≡ µ, ˜ ˜ Σ ≡ Σ. The validity of this parameter assignment can be checked by showing that, up to negligible constants, the exponent of this augmented LGSSM has the same form as Eq. (2)10 . Now that this has been written as an LGSSM q (h1:T |˜1:T ), standard inference routines in the literature may be applied to ˜ v compute q(ht |v1:T ) = q (ht |˜1:T ) [1, 2, 11]11 . ˜ v 7 For simplicity of exposition, we ignore the first time-point here. The notation vert(x1 , . . . , xn ) stands for vertically concatenating the arguments x1 , . . . , xn . 9 ˜ ˜ ˜ Strictly, we need a time-dependent emission Bt = B, for t = 1, . . . , T − 1. For time T , BT has the Cholesky factor UA replaced by 0H,H . 10 There are several ways of achieving a similar augmentation. We chose this since, in the non-Bayesian limit UA = UB = 0H,H , no numerical instabilities would be introduced. 11 Note that, since the augmented LGSSM q (h1:T |˜1:T ) is designed to match the fully clamped distribution ˜ v q(h1:T |v1:T ), the filtered posterior q (ht |˜1:t ) does not correspond to q(ht |v1:t ). ˜ v 8 Algorithm 1 LGSSM: Forward and backward recursive updates. The smoothed posterior p(ht |v1:T ) ˆ is returned in the mean hT and covariance PtT . t procedure F ORWARD 1a: P ← Σ −1 T T 1b: P ← DΣ, where D ≡ I − ΣUAB I + UAB ΣUAB UAB ˆ0 ← µ 2a: h1 ˆ 2b: h0 ← Dµ 1 1 ˆ ˆ ˆ 3: K ← P B T (BP B T + ΣV )−1 , P1 ← (I − KB)P , h1 ← h0 + K(vt − B h0 ) 1 1 1 for t ← 2, T do t−1 4: Ptt−1 ← APt−1 AT + ΣH t−1 5a: P ← Pt −1 T T 5b: P ← Dt Ptt−1 , where Dt ≡ I − Ptt−1 UAB I + UAB Ptt−1 UAB UAB ˆ ˆ 6a: ht−1 ← Aht−1 t t−1 ˆ ˆ 6b: ht−1 ← Dt Aht−1 t t−1 T ˆ ˆ ˆ 7: K ← P B (BP B T + ΣV )−1 , Ptt ← (I − KB)P , ht ← ht−1 + K(vt − B ht−1 ) t t t end for end procedure procedure BACKWARD for t ← T − 1, 1 do ← − t At ← Ptt AT (Pt+1 )−1 ← T − ← − T t t Pt ← Pt + At (Pt+1 − Pt+1 )At T ← ˆT − ˆ ˆ ˆ hT ← ht + At (ht+1 − Aht ) t t t end for end procedure For completeness, we decided to describe the standard predictor-corrector form of the Kalman Filter, together with the Rauch-Tung-Striebel Smoother [2]. These are given in Algorithm 1, where q (ht |˜1:T ) is computed by calling the FORWARD and BACKWARD procedures. ˜ v We present two variants of the FORWARD pass. Either we may call procedure FORWARD in ˜ ˜ ˜ ˜ ˜ ˜ Algorithm 1 with parameters A, B, ΣH , ΣV , µ, Σ and the augmented visible variables vt in which ˜ we use steps 1a, 2a, 5a and 6a. This is exactly the predictor-corrector form of a Kalman Filter [2]. Otherwise, in order to reduce the computational cost, we may call procedure FORWARD with the −1 ˜ ˜ parameters A, B , ΣH , Σ−1 , µ, Σ and the original visible variable vt in which we use steps ˜ ˜ V T 1b (where UAB UAB ≡ SA +SB ), 2b, 5b and 6b. The two algorithms are mathematically equivalent. Computing q(ht |v1:T ) = q (ht |˜1:T ) is then completed by calling the common BACKWARD pass. ˜ v The important point here is that the reader may supply any standard Kalman Filtering/Smoothing routine, and simply call it with the appropriate parameters. In some parameter regimes, or in very long time-series, numerical stability may be a serious concern, for which several stabilized algorithms have been developed over the years, for example the square-root forms [2, 10, 11]. By converting the problem to a standard form, we have therefore unified and simplified inference, so that future applications may be more readily developed12 . 3.1 Relation to Previous Approaches An alternative approach to the one above, and taken in [5, 7], is to write the posterior as T log q(h1:T ) = φt (ht−1 , ht ) + const. t=2 for suitably defined quadratic forms φt (ht−1 , ht ). Here the potentials φt (ht−1 , ht ) encode the averaging over the parameters A, B, ΣH , ΣV . The approach taken in [7] is to recognize this as a 12 The computation of the log-likelihood bound does not require any augmentation. pairwise Markov chain, for which the Belief Propagation recursions may be applied. The approach in [5] is based on a Kullback-Leibler minimization of the posterior with a chain structure, which is algorithmically equivalent to Belief Propagation. Whilst mathematically valid procedures, the resulting algorithms do not correspond to any of the standard forms in the Kalman Filtering/Smoothing literature, whose properties have been well studied [14]. 4 An Application to Bayesian ICA A particular case for which the Bayesian LGSSM is of interest is in extracting independent source signals underlying a multivariate timeseries [5, 15]. This will demonstrate how the approach developed in Section 3 makes VB easily to apply. The sources si are modeled as independent in the following sense: p(si , sj ) = p(si )p(sj ), 1:T 1:T 1:T 1:T for i = j, i, j = 1, . . . , C. Independence implies block diagonal transition and state noise matrices A, ΣH and Σ, where each block c has dimension Hc . A one dimensional source sc for each independent dynamical subsystem is then t formed from sc = 1T hc , where 1c is a unit vector and hc is the state of t c t t dynamical system c. Combining the sources, we can write st = P ht , where P = diag(1T , . . . , 1T ), ht = vert(h1 , . . . , hC ). The resulting 1 C t t emission matrix is constrained to be of the form B = W P , where W is the V × C mixing matrix. This means that the observations v are formed from linearly mixing the sources, vt = W st + ηt . The Figure 1: The structure of graphical structure of this model is presented in Fig 1. To encourage redundant components to be removed, we place a zero mean Gaussian the LGSSM for ICA. prior on W . In this case, we do not define a prior for the parameters ΣH and ΣV which are instead considered as hyperparameters. More details of the model are given in [15]. The constraint B = W P requires a minor modification from Section 3, as we discuss below. Inference on q(h1:T ) A small modification of the mean + fluctuation decomposition for B occurs, namely: (vt − Bht )T Σ−1 (vt − Bht ) V q(W ) = (vt − B ht )T Σ−1 (vt − B ht ) + hT P T SW P ht , t V −1 where B ≡ W P and SW = V HW . The quantities W and HW are obtained as in Appendix A.1 with the replacement ht ← P ht . To represent the above as a LGSSM, we augment vt and B as vt = vert(vt , 0H , 0C ), ˜ ˜ B = vert( B , UA , UW P ), where UW is the Cholesky decomposition of SW . The equivalent LGSSM is then completed by ˜ ˜ ˜ ˜ specifying A ≡ A , ΣH ≡ ΣH , ΣV ≡ diag(ΣV , IH , IC ), µ ≡ µ, Σ ≡ Σ, and inference for ˜ q(h1:T ) performed using Algorithm 1. This demonstrates the elegance and unity of the approach in Section 3, since no new algorithm needs to be developed to perform inference, even in this special constrained parameter case. 4.1 Demonstration As a simple demonstration, we used an LGSSM to generate 3 sources sc with random 5×5 transition t matrices Ac , µ = 0H and Σ ≡ ΣH ≡ IH . The sources were mixed into three observations v vt = W st + ηt , for W chosen with elements from a zero mean unit variance Gaussian distribution, and ΣV = IV . We then trained a Bayesian LGSSM with 5 sources and 7 × 7 transition matrices Ac . ˆ To bias the model to find the simplest sources, we used Ac ≡ 0Hc ,Hc for all sources. In Fig2a and Fig 2b we see the original sources and the noisy observations respectively. In Fig2c we see the estimated sources from our method after convergence of the hyperparameter updates. Two of the 5 sources have been removed, and the remaining three are a reasonable estimation of the original sources. Another possible approach for introducing prior knowledge is to use a Maximum a Posteriori (MAP) 0 50 100 150 200 250 300 0 50 100 (a) 150 200 250 300 0 50 (b) 100 150 200 250 300 0 50 (c) 100 150 200 250 300 (d) Figure 2: (a) Original sources st . (b) Observations resulting from mixing the original sources, v v vt = W st + ηt , ηt ∼ N (0, I). (c) Recovered sources using the Bayesian LGSSM. (d) Sources found with MAP LGSSM. 0 1 2 (a) 3s 0 1 2 (b) 3s 0 1 2 (c) 3s 0 1 2 (d) 3s 0 1 2 3s (e) Figure 3: (a) Original raw EEG recordings from 4 channels. (b-e) 16 sources st estimated by the Bayesian LGSSM. procedure by adding a prior term to the original log-likelihood log p(v1:T |A, W, ΣH , ΣV , µ, Σ) + log p(A|α) + log p(W |β). However, it is not clear how to reliably find the hyperparameters α and β in this case. One solution is to estimate them by optimizing the new objective function jointly with respect to the parameters and hyperparameters (this is the so-called joint map estimation – see for example [16]). A typical result of using this joint MAP approach on the artificial data is presented in Fig 2d. The joint MAP does not estimate the hyperparameters well, and the incorrect number of sources is identified. 4.2 Application to EEG Analysis In Fig 3a we plot three seconds of EEG data recorded from 4 channels (located in the right hemisphere) while a person is performing imagined movement of the right hand. As is typical in EEG, each channel shows drift terms below 1 Hz which correspond to artifacts of the instrumentation, together with the presence of 50 Hz mains contamination and masks the rhythmical activity related to the mental task, mainly centered at 10 and 20 Hz [17]. We would therefore like a method which enables us to extract components in these information-rich 10 and 20 Hz frequency bands. Standard ICA methods such as FastICA do not find satisfactory sources based on raw ‘noisy’ data, and preprocessing with band-pass filters is usually required. Additionally, in EEG research, flexibility in the number of recovered sources is important since there may be many independent oscillators of interest underlying the observations and we would like some way to automatically determine their effective number. To preferentially find sources at particular frequencies, we specified a block ˆ diagonal matrix Ac for each source c, where each block is a 2 × 2 rotation matrix at the desired frequency. We defined the following 16 groups of frequencies: [0.5], [0.5], [0.5], [0.5]; [10,11], [10,11], [10,11], [10,11]; [20,21], [20,21], [20,21], [20,21]; [50], [50], [50], [50]. The temporal evolution of the sources obtained after training the Bayesian LGSSM is given in Fig3(b,c,d,e) (grouped by frequency range). The Bayes LGSSM removed 4 unnecessary sources from the mixing matrix W , that is one [10,11] Hz and three [20,21] Hz sources. The first 4 sources contain dominant low frequency drift, sources 5, 6 and 8 contain [10,11] Hz, while source 10 contains [20,21] Hz centered activity. Of the 4 sources initialized to 50 Hz, only 2 retained 50 Hz activity, while the Ac of the other two have changed to model other frequencies present in the EEG. This method demonstrates the usefulness and applicability of the VB method in a real-world situation. 5 Conclusion We considered the application of Variational Bayesian learning to Linear Gaussian State-Space Models. This is an important class of models with widespread application, and finding a simple way to implement this approximate Bayesian procedure is of considerable interest. The most demanding part of the procedure is inference of the hidden states of the model. Previously, this has been achieved using Belief Propagation, which differs from inference in the Kalman Filtering/Smoothing literature, for which highly efficient and stabilized procedures exist. A central contribution of this paper is to show how inference can be written using the standard Kalman Filtering/Smoothing recursions by augmenting the original model. Additionally, a minor modification to the standard Kalman Filtering routine may be applied for computational efficiency. We demonstrated the elegance and unity of our approach by showing how to easily apply a Variational Bayes analysis of temporal ICA. Specifically, our Bayes ICA approach successfully extracts independent processes underlying EEG signals, biased towards preferred frequency ranges. We hope that this simple and unifying interpretation of Variational Bayesian LGSSMs may therefore facilitate the further application to related models. A A.1 Parameter Updates for A and B Determining q(B|ΣV ) By examining F, the contribution of q(B|ΣV ) can be interpreted as the negative KL divergence between q(B|ΣV ) and a Gaussian. Hence, optimally, q(B|ΣV ) is a Gaussian. The covariance [ΣB ]ij,kl ≡ Bij − Bij Bkl − Bkl (averages wrt q(B|ΣV )) is given by: T hj hl t t −1 [ΣB ]ij,kl = [HB ]jl [ΣV ]ik , where [HB ]jl ≡ t=1 −1 The mean is given by B = NB HB , where [NB ]ij ≡ T t=1 hj t q(ht ) q(ht ) + βj δjl . i ˆ vt + βj Bij . Determining q(A|ΣH ) Optimally, q(A|ΣH ) is a Gaussian with covariance T −1 hj hl t t −1 [ΣA ]ij,kl = [HA ]jl [ΣH ]ik , where [HA ]jl ≡ t=1 −1 The mean is given by A = NA HA , where [NA ]ij ≡ B T t=2 q(ht ) hj hi t−1 t + αj δjl . q(ht−1:t ) ˆ + αj Aij . Covariance Updates By specifying a Wishart prior for the inverse of the covariances, conjugate update formulae are possible. In practice, it is more common to specify diagonal inverse covariances, for which the corresponding priors are simply Gamma distributions [7, 5]. For this simple diagonal case, the explicit updates are given below. Determining q(ΣV ) For the constraint Σ−1 = diag(ρ), where each diagonal element follows a Gamma prior V Ga(b1 , b2 ) [7], q(ρ) factorizes and the optimal updates are q(ρi ) = Ga b1 + where GB ≡ −1 T NB HB NB . T T 1 i , b2 + (vt )2 − [GB ]ii + 2 2 t=1 j ˆ2 βj Bij , Determining q(ΣH ) Analogously, for Σ−1 = diag(τ ) with prior Ga(a1 , a2 ) [5], the updates are H T T −1 1 ˆij , a2 + (hi )2 − [GA ]ii + αj A2 , q(τi ) = Ga a1 + t 2 2 t=2 j −1 T where GA ≡ NA HA NA . Acknowledgments This work is supported by the European DIRAC Project FP6-0027787. This paper only reflects the authors’ views and funding agencies are not liable for any use that may be made of the information contained herein. References [1] Y. Bar-Shalom and X.-R. Li. Estimation and Tracking: Principles, Techniques and Software. Artech House, 1998. [2] M. S. Grewal and A. P. Andrews. Kalman Filtering: Theory and Practice Using MATLAB. John Wiley and Sons, Inc., 2001. [3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications. Springer, 2000. [4] M. J. Beal, F. Falciani, Z. Ghahramani, C. Rangel, and D. L. Wild. A Bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics, 21:349–356, 2005. [5] A. T. Cemgil and S. J. Godsill. Probabilistic phase vocoder and its application to interpolation of missing values in audio signals. In 13th European Signal Processing Conference, 2005. [6] H. Valpola and J. Karhunen. An unsupervised ensemble learning method for nonlinear dynamic statespace models. Neural Computation, 14:2647–2692, 2002. [7] M. J. Beal. Variational Algorithms for Approximate Bayesian Inference. Ph.D. thesis, Gatsby Computational Neuroscience Unit, University College London, 2003. [8] M. Davy and S. J. Godsill. Bayesian harmonic models for musical signal analysis (with discussion). In J.O. Bernardo, J.O. Berger, A.P Dawid, and A.F.M. Smith, editors, Bayesian Statistics VII. Oxford University Press, 2003. [9] D. J. C. MacKay. Ensemble learning and evidence maximisation. Unpublished manuscipt: www.variational-bayes.org, 1995. [10] M. Morf and T. Kailath. Square-root algorithms for least-squares estimation. IEEE Transactions on Automatic Control, 20:487–497, 1975. [11] P. Park and T. Kailath. New square-root smoothing algorithms. IEEE Transactions on Automatic Control, 41:727–732, 1996. [12] E. Niedermeyer and F. Lopes Da Silva. Electroencephalography: basic principles, clinical applications and related fields. Lippincott Williams and Wilkins, 1999. [13] S. Roweis and Z. Ghahramani. A unifying review of linear Gaussian models. Neural Computation, 11:305–345, 1999. [14] M. Verhaegen and P. Van Dooren. Numerical aspects of different Kalman filter implementations. IEEE Transactions of Automatic Control, 31:907–917, 1986. [15] S. Chiappa and D. Barber. Bayesian linear Gaussian state-space models for biosignal decomposition. Signal Processing Letters, 14, 2007. [16] S. S. Saquib, C. A. Bouman, and K. Sauer. ML parameter estimation for Markov random fields with applicationsto Bayesian tomography. IEEE Transactions on Image Processing, 7:1029–1044, 1998. [17] G. Pfurtscheller and F. H. Lopes da Silva. Event-related EEG/MEG synchronization and desynchronization: basic principles. Clinical Neurophysiology, pages 1842–1857, 1999.
3 0.4925822 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts
Author: Hariharan Narayanan, Mikhail Belkin, Partha Niyogi
Abstract: One of the intuitions underlying many graph-based methods for clustering and semi-supervised learning, is that class or cluster boundaries pass through areas of low probability density. In this paper we provide some formal analysis of that notion for a probability distribution. We introduce a notion of weighted boundary volume, which measures the length of the class/cluster boundary weighted by the density of the underlying probability distribution. We show that sizes of the cuts of certain commonly used data adjacency graphs converge to this continuous weighted volume of the boundary. keywords: Clustering, Semi-Supervised Learning 1
4 0.48643726 128 nips-2006-Manifold Denoising
Author: Matthias Hein, Markus Maier
Abstract: We consider the problem of denoising a noisily sampled submanifold M in Rd , where the submanifold M is a priori unknown and we are only given a noisy point sample. The presented denoising algorithm is based on a graph-based diffusion process of the point sample. We analyze this diffusion process using recent results about the convergence of graph Laplacians. In the experiments we show that our method is capable of dealing with non-trivial high-dimensional noise. Moreover using the denoising algorithm as pre-processing method we can improve the results of a semi-supervised learning algorithm. 1
5 0.42997882 10 nips-2006-A Novel Gaussian Sum Smoother for Approximate Inference in Switching Linear Dynamical Systems
Author: David Barber, Bertrand Mesot
Abstract: We introduce a method for approximate smoothed inference in a class of switching linear dynamical systems, based on a novel form of Gaussian Sum smoother. This class includes the switching Kalman Filter and the more general case of switch transitions dependent on the continuous latent state. The method improves on the standard Kim smoothing approach by dispensing with one of the key approximations, thus making fuller use of the available future information. Whilst the only central assumption required is projection to a mixture of Gaussians, we show that an additional conditional independence assumption results in a simpler but stable and accurate alternative. Unlike the alternative unstable Expectation Propagation procedure, our method consists only of a single forward and backward pass and is reminiscent of the standard smoothing ‘correction’ recursions in the simpler linear dynamical system. The algorithm performs well on both toy experiments and in a large scale application to noise robust speech recognition. 1 Switching Linear Dynamical System The Linear Dynamical System (LDS) [1] is a key temporal model in which a latent linear process generates the observed series. For complex time-series which are not well described globally by a single LDS, we may break the time-series into segments, each modeled by a potentially different LDS. This is the basis for the Switching LDS (SLDS) [2, 3, 4, 5] where, for each time t, a switch variable st ∈ 1, . . . , S describes which of the LDSs is to be used. The observation (or ‘visible’) vt ∈ RV is linearly related to the hidden state ht ∈ RH with additive noise η by vt = B(st )ht + η v (st ) p(vt |ht , st ) = N (B(st )ht , Σv (st )) ≡ (1) where N (µ, Σ) denotes a Gaussian distribution with mean µ and covariance Σ. The transition dynamics of the continuous hidden state ht is linear, ht = A(st )ht−1 + η h (st ), ≡ p(ht |ht−1 , st ) = N A(st )ht−1 , Σh (st ) (2) The switch st may depend on both the previous st−1 and ht−1 . This is an augmented SLDS (aSLDS), and defines the model T p(vt |ht , st )p(ht |ht−1 , st )p(st |ht−1 , st−1 ) p(v1:T , h1:T , s1:T ) = t=1 The standard SLDS[4] considers only switch transitions p(st |st−1 ). At time t = 1, p(s1 |h0 , s0 ) simply denotes the prior p(s1 ), and p(h1 |h0 , s1 ) denotes p(h1 |s1 ). The aim of this article is to address how to perform inference in the aSLDS. In particular we desire the filtered estimate p(ht , st |v1:t ) and the smoothed estimate p(ht , st |v1:T ), for any 1 ≤ t ≤ T . Both filtered and smoothed inference in the SLDS is intractable, scaling exponentially with time [4]. s1 s2 s3 s4 h1 h2 h3 h4 v1 v2 v3 v4 Figure 1: The independence structure of the aSLDS. Square nodes denote discrete variables, round nodes continuous variables. In the SLDS links from h to s are not normally considered. 2 Expectation Correction Our approach to approximate p(ht , st |v1:T ) mirrors the Rauch-Tung-Striebel ‘correction’ smoother for the simpler LDS [1].The method consists of a single forward pass to recursively find the filtered posterior p(ht , st |v1:t ), followed by a single backward pass to correct this into a smoothed posterior p(ht , st |v1:T ). The forward pass we use is equivalent to standard Assumed Density Filtering (ADF) [6]. The main contribution of this paper is a novel form of backward pass, based only on collapsing the smoothed posterior to a mixture of Gaussians. Together with the ADF forward pass, we call the method Expectation Correction, since it corrects the moments found from the forward pass. A more detailed description of the method, including pseudocode, is given in [7]. 2.1 Forward Pass (Filtering) Readers familiar with ADF may wish to continue directly to Section (2.2). Our aim is to form a recursion for p(st , ht |v1:t ), based on a Gaussian mixture approximation of p(ht |st , v1:t ). Without loss of generality, we may decompose the filtered posterior as p(ht , st |v1:t ) = p(ht |st , v1:t )p(st |v1:t ) (3) The exact representation of p(ht |st , v1:t ) is a mixture with O(S t ) components. We therefore approximate this with a smaller I-component mixture I p(ht |st , v1:t ) ≈ p(ht |it , st , v1:t )p(it |st , v1:t ) it =1 where p(ht |it , st , v1:t ) is a Gaussian parameterized with mean f (it , st ) and covariance F (it , st ). To find a recursion for these parameters, consider p(ht+1 |st+1 , v1:t+1 ) = p(ht+1 |st , it , st+1 , v1:t+1 )p(st , it |st+1 , v1:t+1 ) (4) st ,it Evaluating p(ht+1 |st , it , st+1 , v1:t+1 ) We find p(ht+1 |st , it , st+1 , v1:t+1 ) by first computing the joint distribution p(ht+1 , vt+1 |st , it , st+1 , v1:t ), which is a Gaussian with covariance and mean elements, Σhh = A(st+1 )F (it , st )AT (st+1 ) + Σh (st+1 ), Σvv = B(st+1 )Σhh B T (st+1 ) + Σv (st+1 ) Σvh = B(st+1 )F (it , st ), µv = B(st+1 )A(st+1 )f (it , st ), µh = A(st+1 )f (it , st ) (5) and then conditioning on vt+1 1 . For the case S = 1, this forms the usual Kalman Filter recursions[1]. Evaluating p(st , it |st+1 , v1:t+1 ) The mixture weight in (4) can be found from the decomposition p(st , it |st+1 , v1:t+1 ) ∝ p(vt+1 |it , st , st+1 , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ) (6) 1 p(x|y) is a Gaussian with mean µx + Σxy Σ−1 (y − µy ) and covariance Σxx − Σxy Σ−1 Σyx . yy yy The first factor in (6), p(vt+1 |it , st , st+1 , v1:t ) is a Gaussian with mean µv and covariance Σvv , as given in (5). The last two factors p(it |st , v1:t ) and p(st |v1:t ) are given from the previous iteration. Finally, p(st+1 |it , st , v1:t ) is found from p(st+1 |it , st , v1:t ) = p(st+1 |ht , st ) p(ht |it ,st ,v1:t ) (7) where · p denotes expectation with respect to p. In the SLDS, (7) is replaced by the Markov transition p(st+1 |st ). In the aSLDS, however, (7) will generally need to be computed numerically. Closing the recursion We are now in a position to calculate (4). For each setting of the variable st+1 , we have a mixture of I × S Gaussians which we numerically collapse back to I Gaussians to form I p(ht+1 |st+1 , v1:t+1 ) ≈ p(ht+1 |it+1 , st+1 , v1:t+1 )p(it+1 |st+1 , v1:t+1 ) it+1 =1 Any method of choice may be supplied to collapse a mixture to a smaller mixture; our code simply repeatedly merges low-weight components. In this way the new mixture coefficients p(it+1 |st+1 , v1:t+1 ), it+1 ∈ 1, . . . , I are defined, completing the description of how to form a recursion for p(ht+1 |st+1 , v1:t+1 ) in (3). A recursion for the switch variable is given by p(st+1 |v1:t+1 ) ∝ p(vt+1 |st+1 , it , st , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ) st ,it where all terms have been computed during the recursion for p(ht+1 |st+1 , v1:t+1 ). The likelihood p(v1:T ) may be found by recursing p(v1:t+1 ) = p(vt+1 |v1:t )p(v1:t ), where p(vt+1 |vt ) = p(vt+1 |it , st , st+1 , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ) it ,st ,st+1 2.2 Backward Pass (Smoothing) The main contribution of this paper is to find a suitable way to ‘correct’ the filtered posterior p(st , ht |v1:t ) obtained from the forward pass into a smoothed posterior p(st , ht |v1:T ). We derive this for the case of a single Gaussian representation. The extension to the mixture case is straightforward and presented in [7]. We approximate the smoothed posterior p(ht |st , v1:T ) by a Gaussian with mean g(st ) and covariance G(st ) and our aim is to find a recursion for these parameters. A useful starting point for a recursion is: p(st+1 |v1:T )p(ht |st , st+1 , v1:T )p(st |st+1 , v1:T ) p(ht , st |v1:T ) = st+1 The term p(ht |st , st+1 , v1:T ) may be computed as p(ht |st , st+1 , v1:T ) = p(ht |ht+1 , st , st+1 , v1:t )p(ht+1 |st , st+1 , v1:T ) (8) ht+1 The recursion therefore requires p(ht+1 |st , st+1 , v1:T ), which we can write as p(ht+1 |st , st+1 , v1:T ) ∝ p(ht+1 |st+1 , v1:T )p(st |st+1 , ht+1 , v1:t ) (9) The difficulty here is that the functional form of p(st |st+1 , ht+1 , v1:t ) is not squared exponential in ht+1 , so that p(ht+1 |st , st+1 , v1:T ) will not be Gaussian2 . One possibility would be to approximate the non-Gaussian p(ht+1 |st , st+1 , v1:T ) by a Gaussian (or mixture thereof) by minimizing the Kullback-Leilbler divergence between the two, or performing moment matching in the case of a single Gaussian. A simpler alternative (which forms ‘standard’ EC) is to make the assumption p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ), where p(ht+1 |st+1 , v1:T ) is already known from the previous backward recursion. Under this assumption, the recursion becomes p(ht , st |v1:T ) ≈ p(st+1 |v1:T )p(st |st+1 , v1:T ) p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) (10) st+1 2 In the exact calculation, p(ht+1 |st , st+1 , v1:T ) is a mixture of Gaussians, see [7]. However, since in (9) the two terms p(ht+1 |st+1 , v1:T ) will only be approximately computed during the recursion, our approximation to p(ht+1 |st , st+1 , v1:T ) will not be a mixture of Gaussians. Evaluating p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) is a Gaussian in ht , whose statistics we will now compute. First we find p(ht |ht+1 , st , st+1 , v1:t ) which may be obtained from the joint distribution p(ht , ht+1 |st , st+1 , v1:t ) = p(ht+1 |ht , st+1 )p(ht |st , v1:t ) (11) which itself can be found from a forward dynamics from the filtered estimate p(ht |st , v1:t ). The statistics for the marginal p(ht |st , st+1 , v1:t ) are simply those of p(ht |st , v1:t ), since st+1 carries no extra information about ht . The remaining statistics are the mean of ht+1 , the covariance of ht+1 and cross-variance between ht and ht+1 , which are given by ht+1 = A(st+1 )ft (st ), Σt+1,t+1 = A(st+1 )Ft (st )AT (st+1 )+Σh (st+1 ), Σt+1,t = A(st+1 )Ft (st ) Given the statistics of (11), we may now condition on ht+1 to find p(ht |ht+1 , st , st+1 , v1:t ). Doing so effectively constitutes a reversal of the dynamics, ← − − ht = A (st , st+1 )ht+1 + ←(st , st+1 ) η ← − ← − − − where A (st , st+1 ) and ←(st , st+1 ) ∼ N (← t , st+1 ), Σ (st , st+1 )) are easily found using η m(s conditioning. Averaging the above reversed dynamics over p(ht+1 |st+1 , v1:T ), we find that p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) is a Gaussian with statistics ← − ← − ← − ← − − µt = A (st , st+1 )g(st+1 )+← t , st+1 ), Σt,t = A (st , st+1 )G(st+1 ) A T (st , st+1 )+ Σ (st , st+1 ) m(s These equations directly mirror the standard RTS backward pass[1]. Evaluating p(st |st+1 , v1:T ) The main departure of EC from previous methods is in treating the term p(st |st+1 , v1:T ) = p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) (12) The term p(st |ht+1 , st+1 , v1:t ) is given by p(st |ht+1 , st+1 , v1:t ) = p(ht+1 |st+1 , st , v1:t )p(st , st+1 |v1:t ) ′ ′ s′ p(ht+1 |st+1 , st , v1:t )p(st , st+1 |v1:t ) (13) t Here p(st , st+1 |v1:t ) = p(st+1 |st , v1:t )p(st |v1:t ), where p(st+1 |st , v1:t ) occurs in the forward pass, (7). In (13), p(ht+1 |st+1 , st , v1:t ) is found by marginalizing (11). Computing the average of (13) with respect to p(ht+1 |st+1 , v1:T ) may be achieved by any numerical integration method desired. A simple approximation is to evaluate the integrand at the mean value of the averaging distribution p(ht+1 |st+1 , v1:T ). More sophisticated methods (see [7]) such as sampling from the Gaussian p(ht+1 |st+1 , v1:T ) have the advantage that covariance information is used3 . Closing the Recursion We have now computed both the continuous and discrete factors in (8), which we wish to use to write the smoothed estimate in the form p(ht , st |v1:T ) = p(st |v1:T )p(ht |st , v1:T ). The distribution p(ht |st , v1:T ) is readily obtained from the joint (8) by conditioning on st to form the mixture p(ht |st , v1:T ) = p(st+1 |st , v1:T )p(ht |st , st+1 , v1:T ) st+1 which may then be collapsed to a single Gaussian (the mixture case is discussed in [7]). The smoothed posterior p(st |v1:T ) is given by p(st |v1:T ) = p(st+1 |v1:T ) p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) . (14) st+1 3 This is a form of exact sampling since drawing samples from a Gaussian is easy. This should not be confused with meaning that this use of sampling renders EC a sequential Monte-Carlo scheme. 2.3 Relation to other methods The EC Backward pass is closely related to Kim’s method [8]. In both EC and Kim’s method, the approximation p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ), is used to form a numerically simple backward pass. The other ‘approximation’ in EC is to numerically compute the average in (14). In Kim’s method, however, an update for the discrete variables is formed by replacing the required term in (14) by p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) ≈ p(st |st+1 , v1:t ) (15) Since p(st |st+1 , v1:t ) ∝ p(st+1 |st )p(st |v1:t )/p(st+1 |v1:t ), this can be computed simply from the filtered results alone. The fundamental difference therefore between EC and Kim’s method is that the approximation, (15), is not required by EC. The EC backward pass therefore makes fuller use of the future information, resulting in a recursion which intimately couples the continuous and discrete variables. The resulting effect on the quality of the approximation can be profound, as we will see in the experiments. The Expectation Propagation (EP) algorithm makes the central assumption of collapsing the posteriors to a Gaussian family [5]; the collapse is defined by a consistency criterion on overlapping marginals. In our experiments, we take the approach in [9] of collapsing to a single Gaussian. Ensuring consistency requires frequent translations between moment and canonical parameterizations, which is the origin of potentially severe numerical instability [10]. In contrast, EC works largely with moment parameterizations of Gaussians, for which relatively few numerical difficulties arise. Unlike EP, EC is not based on a consistency criterion and a subtle issue arises about possible inconsistencies in the Forward and Backward approximations for EC. For example, under the conditional independence assumption in the Backward Pass, p(hT |sT −1 , sT , v1:T ) ≈ p(hT |sT , v1:T ), which is in contradiction to (5) which states that the approximation to p(hT |sT −1 , sT , v1:T ) will depend on sT −1 . Such potential inconsistencies arise because of the approximations made, and should not be considered as separate approximations in themselves. Rather than using a global (consistency) objective, EC attempts to faithfully approximate the exact Forward and Backward propagation routines. For this reason, as in the exact computation, only a single Forward and Backward pass are required in EC. In [11] a related dynamics reversed is proposed. However, the singularities resulting from incorrectly treating p(vt+1:T |ht , st ) as a density are heuristically finessed. In [12] a variational method approximates the joint distribution p(h1:T , s1:T |v1:T ) rather than the marginal inference p(ht , st |v1:T ). This is a disadvantage when compared to other methods that directly approximate the marginal. Sequential Monte Carlo methods (Particle Filters)[13], are essentially mixture of delta-function approximations. Whilst potentially powerful, these typically suffer in high-dimensional hidden spaces, unless techniques such as Rao-Blackwellization are performed. ADF is generally preferential to Particle Filtering since in ADF the approximation is a mixture of non-trivial distributions, and is therefore more able to represent the posterior. 3 Demonstration Testing EC in a problem with a reasonably long temporal sequence, T , is important since numerical instabilities may not be apparent in timeseries of just a few points. To do this, we sequentially generate hidden and visible states from a given model, here with H = 3, S = 2, V = 1 – see Figure(2) for full details of the experimental setup. Then, given only the parameters of the model and the visible observations (but not any of the hidden states h1:T , s1:T ), the task is to infer p(ht |st , v1:T ) and p(st |v1:T ). Since the exact computation is exponential in T , a simple alternative is to assume that the original sample states s1:T are the ‘correct’ inferences, and compare how our most probable posterior smoothed estimates arg maxst p(st |v1:T ) compare with the assumed correct sample st . We chose conditions that, from the viewpoint of classical signal processing, are difficult, with changes in the switches occurring at a much higher rate than the typical frequencies in the signal vt . For EC we use the mean approximation for the numerical integration of (12). We included the Particle Filter merely for a point of comparison with ADF, since they are not designed to approximate PF RBPF EP ADFS KimS ECS ADFM KimM ECM 1000 800 600 400 200 0 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 Figure 2: The number of errors in estimating p(st |v1:T ) for a binary switch (S = 2) over a time series of length T = 100. Hence 50 errors corresponds to random guessing. Plotted are histograms of the errors are over 1000 experiments. The x-axes are cut off at 20 errors to improve visualization of the results. (PF) Particle Filter. (RBPF) Rao-Blackwellized PF. (EP) Expectation Propagation. (ADFS) Assumed Density Filtering using a Single Gaussian. (KimS) Kim’s smoother using the results from ADFS. (ECS) Expectation Correction using a Single Gaussian (I = J = 1). (ADFM) ADF using a multiple of I = 4 Gaussians. (KimM) Kim’s smoother using the results from ADFM. (ECM) Expectation Correction using a mixture with I = J = 4 components. S = 2, V = 1 (scalar observations), T = 100, with zero output bias. A(s) = 0.9999 ∗ orth(randn(H, H)), B(s) = randn(V, H). H = 3, Σh (s) = IH , Σv (s) = 0.1IV , p(st+1 |st ) ∝ 1S×S + IS . At time t = 1, the priors are p1 = uniform, with h1 drawn from N (10 ∗ randn(H, 1), IH ). the smoothed estimate, for which 1000 particles were used, with Kitagawa resampling. For the RaoBlackwellized Particle Filter [13], 500 particles were used, with Kitagawa resampling. We found that EP4 was numerically unstable and often struggled to converge. To encourage convergence, we used the damping method in [9], performing 20 iterations with a damping factor of 0.5. Nevertheless, the disappointing performance of EP is most likely due to conflicts resulting from numerical instabilities introduced by the frequent conversions between moment and canonical representations. The best filtered results are given using ADF, since this is better able to represent the variance in the filtered posterior than the sampling methods. Unlike Kim’s method, EC makes good use of the future information to clean up the filtered results considerably. One should bear in mind that both EC and Kim’s method use the same ADF filtered results. This demonstrates that EC may dramatically improve on Kim’s method, so that the small amount of extra work in making a numerical approximation of p(st |st+1 , v1:T ), (12), may bring significant benefits. We found similar conclusions for experiments with an aSLDS[7]. 4 Application to Noise Robust ASR Here we briefly present an application of the SLDS to robust Automatic Speech Recognition (ASR), for which the intractable inference is performed by EC, and serves to demonstrate how EC scales well to a large-scale application. Fuller details are given in [14]. The standard approach to noise robust ASR is to provide a set of noise-robust features to a standard Hidden Markov Model (HMM) classifier, which is based on modeling the acoustic feature vector. For example, the method of Unsupervised Spectral Subtraction (USS) [15] provides state-of-the-art performance in this respect. Incorporating noise models directly into such feature-based HMM systems is difficult, mainly because the explicit influence of the noise on the features is poorly understood. An alternative is to model the raw speech signal directly, such as the SAR-HMM model [16] for which, under clean conditions, isolated spoken digit recognition performs well. However, the SAR-HMM performs poorly under noisy conditions, since no explicit noise processes are taken into account by the model. The approach we take here is to extend the SAR-HMM to include an explicit noise process, so that h the observed signal vt is modeled as a noise corrupted version of a clean hidden signal vt : h vt = vt + ηt ˜ 4 with ηt ∼ N (0, σ 2 ) ˜ ˜ Generalized EP [5], which groups variables together improves on the results, but is still far inferior to the EC results presented here – Onno Zoeter personal communication. Noise Variance 0 10−7 10−6 10−5 10−4 10−3 SNR (dB) 26.5 26.3 25.1 19.7 10.6 0.7 HMM 100.0% 100.0% 90.9% 86.4% 59.1% 9.1% SAR-HMM 97.0% 79.8% 56.7% 22.2% 9.7% 9.1% AR-SLDS 96.8% 96.8% 96.4% 94.8% 84.0% 61.2% Table 1: Comparison of the recognition accuracy of three models when the test utterances are corrupted by various levels of Gaussian noise. The dynamics of the clean signal is modeled by a switching AR process R h vt = h h cr (st )vt−r + ηt (st ), h ηt (st ) ∼ N (0, σ 2 (st )) r=1 where st ∈ {1, . . . , S} denotes which of a set of AR coefficients cr (st ) are to be used at time t, h and ηt (st ) is the so-called innovation noise. When σ 2 (st ) ≡ 0, this model reproduces the SARHMM of [16], a specially constrained HMM. Hence inference and learning for the SAR-HMM are tractable and straightforward. For the case σ 2 (st ) > 0 the model can be recast as an SLDS. To do this we define ht as a vector which contains the R most recent clean hidden samples ht = h vt h . . . vt−r+1 T (16) and we set A(st ) to be an R × R matrix where the first row contains the AR coefficients −cr (st ) and the rest is a shifted down identity matrix. For example, for a third order (R = 3) AR process, A(st ) = −c1 (st ) −c2 (st ) −c3 (st ) 1 0 0 0 1 0 . (17) The hidden covariance matrix Σh (s) has all elements zero, except the top-left most which is set to the innovation variance. To extract the first component of ht we use the (switch independent) 1 × R projection matrix B = [ 1 0 . . . 0 ]. The (switch independent) visible scalar noise 2 variance is given by Σv ≡ σv . A well-known issue with raw speech signal models is that the energy of a signal may vary from one speaker to another or because of a change in recording conditions. For this reason the innovation Σh is adjusted by maximizing the likelihood of an observed sequence with respect to the innovation covariance, a process called Gain Adaptation [16]. 4.1 Training & Evaluation Following [16], we trained a separate SAR-HMM for each of the eleven digits (0–9 and ‘oh’) from the TI-DIGITS database [17]. The training set for each digit was composed of 110 single digit utterances down-sampled to 8 kHz, each one pronounced by a male speaker. Each SAR-HMM was composed of ten states with a left-right transition matrix. Each state was associated with a 10thorder AR process and the model was constrained to stay an integer multiple of K = 140 time steps (0.0175 seconds) in the same state. We refer the reader to [16] for a detailed explanation of the training procedure used with the SAR-HMM. An AR-SLDS was built for each of the eleven digits by copying the parameters of the corresponding trained SAR-HMM, i.e., the AR coefficients cr (s) are copied into the first row of the hidden transition matrix A(s) and the same discrete transition distribution p(st | st−1 ) is used. The models were then evaluated on a test set composed of 112 corrupted utterances of each of the eleven digits, each pronounced by different male speakers than those used in the training set. The recognition accuracy obtained by the models on the corrupted test sets is presented in Table 1. As expected, the performance of the SAR-HMM rapidly decreases with noise. The feature-based HMM with USS has high accuracy only for high SNR levels. In contrast, the AR-SLDS achieves a recognition accuracy of 61.2% at a SNR close to 0 dB, while the performance of the two other methods is equivalent to random guessing (9.1%). Whilst other inference methods may also perform well in this case, we found that EC performs admirably, without numerical instabilities, even for time-series with several thousand time-steps. 5 Discussion We presented a method for approximate smoothed inference in an augmented class of switching linear dynamical systems. Our approximation is based on the idea that due to the forgetting which commonly occurs in Markovian models, a finite number of mixture components may provide a reasonable approximation. Clearly, in systems with very long correlation times our method may require too many mixture components to produce a satisfactory result, although we are unaware of other techniques that would be able to cope well in that case. The main benefit of EC over Kim smoothing is that future information is more accurately dealt with. Whilst EC is not as general as EP, EC carefully exploits the properties of singly-connected distributions, such as the aSLDS, to provide a numerically stable procedure. We hope that the ideas presented here may therefore help facilitate the practical application of dynamic hybrid networks. Acknowledgements This work is supported by the EU Project FP6-0027787. This paper only reflects the authors’ views and funding agencies are not liable for any use that may be made of the information contained herein. References [1] Y. Bar-Shalom and Xiao-Rong Li. Estimation and Tracking : Principles, Techniques and Software. Artech House, Norwood, MA, 1998. [2] V. Pavlovic, J. M. Rehg, and J. MacCormick. Learning switching linear models of human motion. In Advances in Neural Information Processing systems (NIPS 13), pages 981–987, 2001. [3] A. T. Cemgil, B. Kappen, and D. Barber. A Generative Model for Music Transcription. IEEE Transactions on Audio, Speech and Language Processing, 14(2):679 – 694, 2006. [4] U. N. Lerner. Hybrid Bayesian Networks for Reasoning about Complex Systems. PhD thesis, Stanford University, 2002. [5] O. Zoeter. Monitoring non-linear and switching dynamical systems. PhD thesis, Radboud University Nijmegen, 2005. [6] T. Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, MIT Media Lab, 2001. [7] D. Barber. Expectation Correction for Smoothed Inference in Switching Linear Dynamical Systems. Journal of Machine Learning Research, 7:2515–2540, 2006. [8] C-J. Kim. Dynamic linear models with Markov-switching. Journal of Econometrics, 60:1–22, 1994. [9] T. Heskes and O. Zoeter. Expectation Propagation for approximate inference in dynamic Bayesian networks. In A. Darwiche and N. Friedman, editors, Uncertainty in Art. Intelligence, pages 216–223, 2002. [10] S. Lauritzen and F. Jensen. Stable local computation with conditional Gaussian distributions. Statistics and Computing, 11:191–203, 2001. [11] G. Kitagawa. The Two-Filter Formula for Smoothing and an implementation of the Gaussian-sum smoother. Annals of the Institute of Statistical Mathematics, 46(4):605–623, 1994. [12] Z. Ghahramani and G. E. Hinton. Variational learning for switching state-space models. Neural Computation, 12(4):963–996, 1998. [13] A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo Methods in Practice. Springer, 2001. [14] B. Mesot and D. Barber. Switching Linear Dynamical Systems for Noise Robust Speech Recognition. IDIAP-RR 08, 2006. [15] G. Lathoud, M. Magimai-Doss, B. Mesot, and H. Bourlard. Unsupervised spectral subtraction for noiserobust ASR. In Proceedings of ASRU 2005, pages 189–194, November 2005. [16] Y. Ephraim and W. J. J. Roberts. Revisiting autoregressive hidden Markov modeling of speech signals. IEEE Signal Processing Letters, 12(2):166–169, February 2005. [17] R.G. Leonard. A database for speaker independent digit recognition. In Proceedings of ICASSP84, volume 3, 1984.
6 0.38634178 123 nips-2006-Learning with Hypergraphs: Clustering, Classification, and Embedding
7 0.37818918 120 nips-2006-Learning to Traverse Image Manifolds
8 0.36428201 93 nips-2006-Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms
9 0.34601337 79 nips-2006-Fast Iterative Kernel PCA
10 0.3349916 127 nips-2006-MLLE: Modified Locally Linear Embedding Using Multiple Weights
11 0.33175632 80 nips-2006-Fundamental Limitations of Spectral Clustering
12 0.3226409 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
13 0.29238501 117 nips-2006-Learning on Graph with Laplacian Regularization
14 0.27355468 163 nips-2006-Prediction on a Graph with a Perceptron
15 0.26299003 77 nips-2006-Fast Computation of Graph Kernels
16 0.26185685 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis
17 0.25766078 159 nips-2006-Parameter Expanded Variational Bayesian Methods
18 0.25468948 65 nips-2006-Denoising and Dimension Reduction in Feature Space
19 0.24294741 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis
20 0.23693408 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression
topicId topicWeight
[(0, 0.275), (1, 0.052), (3, 0.052), (7, 0.217), (9, 0.041), (22, 0.078), (44, 0.071), (57, 0.031), (65, 0.042), (69, 0.019), (71, 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.85262799 60 nips-2006-Convergence of Laplacian Eigenmaps
Author: Mikhail Belkin, Partha Niyogi
Abstract: Geometrically based methods for various tasks of machine learning have attracted considerable attention over the last few years. In this paper we show convergence of eigenvectors of the point cloud Laplacian to the eigenfunctions of the Laplace-Beltrami operator on the underlying manifold, thus establishing the first convergence results for a spectral dimensionality reduction algorithm in the manifold setting. 1
2 0.65945727 190 nips-2006-The Neurodynamics of Belief Propagation on Binary Markov Random Fields
Author: Thomas Ott, Ruedi Stoop
Abstract: We rigorously establish a close relationship between message passing algorithms and models of neurodynamics by showing that the equations of a continuous Hopfield network can be derived from the equations of belief propagation on a binary Markov random field. As Hopfield networks are equipped with a Lyapunov function, convergence is guaranteed. As a consequence, in the limit of many weak connections per neuron, Hopfield networks exactly implement a continuous-time variant of belief propagation starting from message initialisations that prevent from running into convergence problems. Our results lead to a better understanding of the role of message passing algorithms in real biological neural networks.
3 0.65814978 129 nips-2006-Map-Reduce for Machine Learning on Multicore
Author: Cheng-tao Chu, Sang K. Kim, Yi-an Lin, Yuanyuan Yu, Gary Bradski, Kunle Olukotun, Andrew Y. Ng
Abstract: We are at the beginning of the multicore era. Computers will have increasingly many cores (processors), but there is still no good programming framework for these architectures, and thus no simple and unified way for machine learning to take advantage of the potential speed up. In this paper, we develop a broadly applicable parallel programming method, one that is easily applied to many different learning algorithms. Our work is in distinct contrast to the tradition in machine learning of designing (often ingenious) ways to speed up a single algorithm at a time. Specifically, we show that algorithms that fit the Statistical Query model [15] can be written in a certain “summation form,” which allows them to be easily parallelized on multicore computers. We adapt Google’s map-reduce [7] paradigm to demonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN). Our experimental results show basically linear speedup with an increasing number of processors. 1
4 0.65437776 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts
Author: Hariharan Narayanan, Mikhail Belkin, Partha Niyogi
Abstract: One of the intuitions underlying many graph-based methods for clustering and semi-supervised learning, is that class or cluster boundaries pass through areas of low probability density. In this paper we provide some formal analysis of that notion for a probability distribution. We introduce a notion of weighted boundary volume, which measures the length of the class/cluster boundary weighted by the density of the underlying probability distribution. We show that sizes of the cuts of certain commonly used data adjacency graphs converge to this continuous weighted volume of the boundary. keywords: Clustering, Semi-Supervised Learning 1
5 0.63830954 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models
Author: S. S. Keerthi, Vikas Sindhwani, Olivier Chapelle
Abstract: We consider the task of tuning hyperparameters in SVM models based on minimizing a smooth performance validation function, e.g., smoothed k-fold crossvalidation error, using non-linear optimization techniques. The key computation in this approach is that of the gradient of the validation function with respect to hyperparameters. We show that for large-scale problems involving a wide choice of kernel-based models and validation functions, this computation can be very efficiently done; often within just a fraction of the training time. Empirical results show that a near-optimal set of hyperparameters can be identified by our approach with very few training rounds and gradient computations. . 1
6 0.63329679 199 nips-2006-Unsupervised Learning of a Probabilistic Grammar for Object Detection and Parsing
7 0.61376035 80 nips-2006-Fundamental Limitations of Spectral Clustering
8 0.5827682 128 nips-2006-Manifold Denoising
10 0.56631976 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
11 0.56253427 33 nips-2006-Analysis of Representations for Domain Adaptation
12 0.56183016 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
13 0.56027156 56 nips-2006-Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data
14 0.55965614 181 nips-2006-Stability of $K$-Means Clustering
15 0.55880427 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions
16 0.55710429 163 nips-2006-Prediction on a Graph with a Perceptron
17 0.55637556 43 nips-2006-Bayesian Model Scoring in Markov Random Fields
18 0.55480295 121 nips-2006-Learning to be Bayesian without Supervision
19 0.55269957 109 nips-2006-Learnability and the doubling dimension