nips nips2006 nips2006-198 knowledge-graph by maker-knowledge-mining

198 nips-2006-Unified Inference for Variational Bayesian Linear Gaussian State-Space Models

Source: pdf

Author: David Barber, Silvia Chiappa

Abstract: Linear Gaussian State-Space Models are widely used and a Bayesian treatment of parameters is therefore of considerable interest. The approximate Variational Bayesian method applied to these models is an attractive approach, used successfully in applications ranging from acoustics to bioinformatics. The most challenging aspect of implementing the method is in performing inference on the hidden state sequence of the model. We show how to convert the inference problem so that standard Kalman Filtering/Smoothing recursions from the literature may be applied. This is in contrast to previously published approaches based on Belief Propagation. Our framework both simpliﬁes and uniﬁes the inference problem, so that future applications may be more easily developed. We demonstrate the elegance of the approach on Bayesian temporal ICA, with an application to ﬁnding independent dynamical processes underlying noisy EEG signals. 1 Linear Gaussian State-Space Models Linear Gaussian State-Space Models (LGSSMs)1 are fundamental in time-series analysis [1, 2, 3]. In these models the observations v1:T 2 are generated from an underlying dynamical system on h1:T according to: v v vt = Bht + ηt , ηt ∼ N (0V , ΣV ), h h ht = Aht−1 + ηt , ηt ∼ N (0H , ΣH ) , where N (µ, Σ) denotes a Gaussian with mean µ and covariance Σ, and 0X denotes an Xdimensional zero vector. The observation vt has dimension V and the hidden state ht has dimension H. Probabilistically, the LGSSM is deﬁned by: T p(v1:T , h1:T |Θ) = p(v1 |h1 )p(h1 ) p(vt |ht )p(ht |ht−1 ), t=2 with p(vt |ht ) = N (Bht , ΣV ), p(ht |ht−1 ) = N (Aht−1 , ΣH ), p(h1 ) = N (µ, Σ) and where Θ = {A, B, ΣH , ΣV , µ, Σ} denotes the model parameters. Because of the widespread use of these models, a Bayesian treatment of parameters is of considerable interest [4, 5, 6, 7, 8]. An exact implementation of the Bayesian LGSSM is formally intractable [8], and recently a Variational Bayesian (VB) approximation has been studied [4, 5, 6, 7, 9]. The most challenging part of implementing the VB method is performing inference over h1:T , and previous authors have developed their own specialized routines, based on Belief Propagation, since standard LGSSM inference routines appear, at ﬁrst sight, not to be applicable. 1 2 Also called Kalman Filters/Smoothers, Linear Dynamical Systems. v1:T denotes v1 , . . . , vT . A key contribution of this paper is to show how the Variational Bayesian treatment of the LGSSM can be implemented using standard LGSSM inference routines. Based on the insight we provide, any standard inference method may be applied, including those speciﬁcally addressed to improve numerical stability [2, 10, 11]. In this article, we decided to describe the predictor-corrector and Rauch-Tung-Striebel recursions [2], and also suggest a small modiﬁcation that reduces computational cost. The Bayesian LGSSM is particularly of interest when strong prior constraints are needed to ﬁnd adequate solutions. One such case is in EEG signal analysis, whereby we wish to extract sources that evolve independently through time. Since EEG is particularly noisy [12], a prior that encourages sources to have preferential dynamics is advantageous. This application is discussed in Section 4, and demonstrates the ease of applying our VB framework. 2 Bayesian Linear Gaussian State-Space Models In the Bayesian treatment of the LGSSM, instead of considering the model parameters Θ as ﬁxed, ˆ ˆ we deﬁne a prior distribution p(Θ|Θ), where Θ is a set of hyperparameters. Then: ˆ p(v1:T |Θ) = ˆ p(v1:T |Θ)p(Θ|Θ) . (1) Θ In a full Bayesian treatment we would deﬁne additional prior distributions over the hyperparameters ˆ Θ. Here we take instead the ML-II (‘evidence’) framework, in which the optimal set of hyperpaˆ ˆ rameters is found by maximizing p(v1:T |Θ) with respect to Θ [6, 7, 9]. For the parameter priors, here we deﬁne Gaussians on the columns of A and B 3 : H e− p(A|α, ΣH ) ∝ αj 2 ˆ ( A j −A j ) T ˆ Σ−1 (Aj −Aj ) H H , e− p(B|β, ΣV ) ∝ j=1 βj 2 T ˆ (Bj −Bj ) ˆ Σ−1 (Bj −Bj ) V , j=1 ˆ ˆ which has the effect of biasing the transition and emission matrices to desired forms A and B. The −1 −1 4 conjugate priors for general inverse covariances ΣH and ΣV are Wishart distributions [7] . In the simpler case assumed here of diagonal covariances these become Gamma distributions [5, 7]. The ˆ hyperparameters are then Θ = {α, β}5 . Variational Bayes ˆ Optimizing Eq. (1) with respect to Θ is difﬁcult due to the intractability of the integrals. Instead, in VB, one considers the lower bound [6, 7, 9]6 : ˆ ˆ L = log p(v1:T |Θ) ≥ Hq (Θ, h1:T ) + log p(Θ|Θ) q(Θ) + E(h1:T , Θ) q(Θ,h1:T ) ≡ F, where E(h1:T , Θ) ≡ log p(v1:T , h1:T |Θ). Hd (x) signiﬁes the entropy of the distribution d(x), and · d(x) denotes the expectation operator. The key approximation in VB is q(Θ, h1:T ) ≡ q(Θ)q(h1:T ), from which one may show that, for optimality of F, ˆ E(h1:T ,Θ) q(h1:T ) . q(h1:T ) ∝ e E(h1:T ,Θ) q(Θ) , q(Θ) ∝ p(Θ|Θ)e These coupled equations need to be iterated to convergence. The updates for the parameters q(Θ) are straightforward and are given in Appendices A and B. Once converged, the hyperparameters are ˆ updated by maximizing F with respect to Θ, which lead to simple update formulae [7]. Our main concern is with the update for q(h1:T ), for which this paper makes a departure from treatments previously presented. 3 More general Gaussian priors may be more suitable depending on the application. For expositional simplicity, we do not put priors on µ and Σ. 5 For simplicity, we keep the parameters of the Gamma priors ﬁxed. 6 Strictly we should write throughout q(·|v1:T ). We omit the dependence on v1:T for notational convenience. 4 Uniﬁed Inference on q(h1:T ) 3 Optimally q(h1:T ) is Gaussian since, up to a constant, E(h1:T , Θ) − 1 2 q(Θ) is quadratic in h1:T 7 : T T (vt −Bht )T Σ−1 (vt −Bht ) V q(B,ΣV ) + (ht −Aht−1 ) Σ−1 (ht −Aht−1 ) H t=1 q(A,ΣH ) . (2) In addition, optimally, q(A|ΣH ) and q(B|ΣV ) are Gaussians (see Appendix A), so we can easily carry out the averages in Eq. (2). The further averages over q(ΣH ) and q(ΣV ) are also easy due to conjugacy. Whilst this deﬁnes the distribution q(h1:T ), quantities such as q(ht ), required for example for the parameter updates (see the Appendices), need to be inferred from this distribution. Clearly, in the non-Bayesian case, the averages over the parameters are not present, and the above simply represents the posterior distribution of an LGSSM whose visible variables have been clamped into their evidential states. In that case, inference can be performed using any standard LGSSM routine. Our aim, therefore, is to try to represent the averaged Eq. (2) directly as the posterior distribution q (h1:T |˜1:T ) of an LGSSM , for some suitable parameter settings. ˜ v Mean + Fluctuation Decomposition A useful decomposition is to write (vt − Bht )T Σ−1 (vt − Bht ) V = (vt − B ht )T Σ−1 (vt − B ht ) + hT SB ht , t V q(B,ΣV ) f luctuation mean and similarly (ht −Aht−1 )T Σ−1 (ht −Aht−1 ) H = (ht − A ht−1 )T Σ−1 (ht − A ht−1 ) +hT SA ht−1 , t−1 H q(A,ΣH ) mean f luctuation T −1 where the parameter covariances are SB ≡ B T Σ−1 B − B Σ−1 B = V HB and SA ≡ V V T −1 −1 −1 AT ΣH A − A ΣH A = HHA (for HA and HB deﬁned in Appendix A). The mean terms simply represent a clamped LGSSM with averaged parameters. However, the extra contributions from the ﬂuctuations mean that Eq. (2) cannot be written as a clamped LGSSM with averaged parameters. In order to deal with these extra terms, our idea is to treat the ﬂuctuations as arising from an augmented visible variable, for which Eq. (2) can then be considered as a clamped LGSSM. Inference Using an Augmented LGSSM To represent Eq. (2) as an LGSSM q (h1:T |˜1:T ), we may augment vt and B as8 : ˜ v vt = vert(vt , 0H , 0H ), ˜ ˜ B = vert( B , UA , UB ), T where UA is the Cholesky decomposition of SA , so that UA UA = SA . Similarly, UB is the Cholesky decomposition of SB . The equivalent LGSSM q (h1:T |˜1:T ) is then completed by specifying9 ˜ v ˜ A≡ A , ˜ ΣH ≡ Σ−1 H −1 , ˜ ΣV ≡ diag( Σ−1 V −1 , IH , IH ), µ ≡ µ, ˜ ˜ Σ ≡ Σ. The validity of this parameter assignment can be checked by showing that, up to negligible constants, the exponent of this augmented LGSSM has the same form as Eq. (2)10 . Now that this has been written as an LGSSM q (h1:T |˜1:T ), standard inference routines in the literature may be applied to ˜ v compute q(ht |v1:T ) = q (ht |˜1:T ) [1, 2, 11]11 . ˜ v 7 For simplicity of exposition, we ignore the ﬁrst time-point here. The notation vert(x1 , . . . , xn ) stands for vertically concatenating the arguments x1 , . . . , xn . 9 ˜ ˜ ˜ Strictly, we need a time-dependent emission Bt = B, for t = 1, . . . , T − 1. For time T , BT has the Cholesky factor UA replaced by 0H,H . 10 There are several ways of achieving a similar augmentation. We chose this since, in the non-Bayesian limit UA = UB = 0H,H , no numerical instabilities would be introduced. 11 Note that, since the augmented LGSSM q (h1:T |˜1:T ) is designed to match the fully clamped distribution ˜ v q(h1:T |v1:T ), the ﬁltered posterior q (ht |˜1:t ) does not correspond to q(ht |v1:t ). ˜ v 8 Algorithm 1 LGSSM: Forward and backward recursive updates. The smoothed posterior p(ht |v1:T ) ˆ is returned in the mean hT and covariance PtT . t procedure F ORWARD 1a: P ← Σ −1 T T 1b: P ← DΣ, where D ≡ I − ΣUAB I + UAB ΣUAB UAB ˆ0 ← µ 2a: h1 ˆ 2b: h0 ← Dµ 1 1 ˆ ˆ ˆ 3: K ← P B T (BP B T + ΣV )−1 , P1 ← (I − KB)P , h1 ← h0 + K(vt − B h0 ) 1 1 1 for t ← 2, T do t−1 4: Ptt−1 ← APt−1 AT + ΣH t−1 5a: P ← Pt −1 T T 5b: P ← Dt Ptt−1 , where Dt ≡ I − Ptt−1 UAB I + UAB Ptt−1 UAB UAB ˆ ˆ 6a: ht−1 ← Aht−1 t t−1 ˆ ˆ 6b: ht−1 ← Dt Aht−1 t t−1 T ˆ ˆ ˆ 7: K ← P B (BP B T + ΣV )−1 , Ptt ← (I − KB)P , ht ← ht−1 + K(vt − B ht−1 ) t t t end for end procedure procedure BACKWARD for t ← T − 1, 1 do ← − t At ← Ptt AT (Pt+1 )−1 ← T − ← − T t t Pt ← Pt + At (Pt+1 − Pt+1 )At T ← ˆT − ˆ ˆ ˆ hT ← ht + At (ht+1 − Aht ) t t t end for end procedure For completeness, we decided to describe the standard predictor-corrector form of the Kalman Filter, together with the Rauch-Tung-Striebel Smoother [2]. These are given in Algorithm 1, where q (ht |˜1:T ) is computed by calling the FORWARD and BACKWARD procedures. ˜ v We present two variants of the FORWARD pass. Either we may call procedure FORWARD in ˜ ˜ ˜ ˜ ˜ ˜ Algorithm 1 with parameters A, B, ΣH , ΣV , µ, Σ and the augmented visible variables vt in which ˜ we use steps 1a, 2a, 5a and 6a. This is exactly the predictor-corrector form of a Kalman Filter [2]. Otherwise, in order to reduce the computational cost, we may call procedure FORWARD with the −1 ˜ ˜ parameters A, B , ΣH , Σ−1 , µ, Σ and the original visible variable vt in which we use steps ˜ ˜ V T 1b (where UAB UAB ≡ SA +SB ), 2b, 5b and 6b. The two algorithms are mathematically equivalent. Computing q(ht |v1:T ) = q (ht |˜1:T ) is then completed by calling the common BACKWARD pass. ˜ v The important point here is that the reader may supply any standard Kalman Filtering/Smoothing routine, and simply call it with the appropriate parameters. In some parameter regimes, or in very long time-series, numerical stability may be a serious concern, for which several stabilized algorithms have been developed over the years, for example the square-root forms [2, 10, 11]. By converting the problem to a standard form, we have therefore uniﬁed and simpliﬁed inference, so that future applications may be more readily developed12 . 3.1 Relation to Previous Approaches An alternative approach to the one above, and taken in [5, 7], is to write the posterior as T log q(h1:T ) = φt (ht−1 , ht ) + const. t=2 for suitably deﬁned quadratic forms φt (ht−1 , ht ). Here the potentials φt (ht−1 , ht ) encode the averaging over the parameters A, B, ΣH , ΣV . The approach taken in [7] is to recognize this as a 12 The computation of the log-likelihood bound does not require any augmentation. pairwise Markov chain, for which the Belief Propagation recursions may be applied. The approach in [5] is based on a Kullback-Leibler minimization of the posterior with a chain structure, which is algorithmically equivalent to Belief Propagation. Whilst mathematically valid procedures, the resulting algorithms do not correspond to any of the standard forms in the Kalman Filtering/Smoothing literature, whose properties have been well studied [14]. 4 An Application to Bayesian ICA A particular case for which the Bayesian LGSSM is of interest is in extracting independent source signals underlying a multivariate timeseries [5, 15]. This will demonstrate how the approach developed in Section 3 makes VB easily to apply. The sources si are modeled as independent in the following sense: p(si , sj ) = p(si )p(sj ), 1:T 1:T 1:T 1:T for i = j, i, j = 1, . . . , C. Independence implies block diagonal transition and state noise matrices A, ΣH and Σ, where each block c has dimension Hc . A one dimensional source sc for each independent dynamical subsystem is then t formed from sc = 1T hc , where 1c is a unit vector and hc is the state of t c t t dynamical system c. Combining the sources, we can write st = P ht , where P = diag(1T , . . . , 1T ), ht = vert(h1 , . . . , hC ). The resulting 1 C t t emission matrix is constrained to be of the form B = W P , where W is the V × C mixing matrix. This means that the observations v are formed from linearly mixing the sources, vt = W st + ηt . The Figure 1: The structure of graphical structure of this model is presented in Fig 1. To encourage redundant components to be removed, we place a zero mean Gaussian the LGSSM for ICA. prior on W . In this case, we do not deﬁne a prior for the parameters ΣH and ΣV which are instead considered as hyperparameters. More details of the model are given in [15]. The constraint B = W P requires a minor modiﬁcation from Section 3, as we discuss below. Inference on q(h1:T ) A small modiﬁcation of the mean + ﬂuctuation decomposition for B occurs, namely: (vt − Bht )T Σ−1 (vt − Bht ) V q(W ) = (vt − B ht )T Σ−1 (vt − B ht ) + hT P T SW P ht , t V −1 where B ≡ W P and SW = V HW . The quantities W and HW are obtained as in Appendix A.1 with the replacement ht ← P ht . To represent the above as a LGSSM, we augment vt and B as vt = vert(vt , 0H , 0C ), ˜ ˜ B = vert( B , UA , UW P ), where UW is the Cholesky decomposition of SW . The equivalent LGSSM is then completed by ˜ ˜ ˜ ˜ specifying A ≡ A , ΣH ≡ ΣH , ΣV ≡ diag(ΣV , IH , IC ), µ ≡ µ, Σ ≡ Σ, and inference for ˜ q(h1:T ) performed using Algorithm 1. This demonstrates the elegance and unity of the approach in Section 3, since no new algorithm needs to be developed to perform inference, even in this special constrained parameter case. 4.1 Demonstration As a simple demonstration, we used an LGSSM to generate 3 sources sc with random 5×5 transition t matrices Ac , µ = 0H and Σ ≡ ΣH ≡ IH . The sources were mixed into three observations v vt = W st + ηt , for W chosen with elements from a zero mean unit variance Gaussian distribution, and ΣV = IV . We then trained a Bayesian LGSSM with 5 sources and 7 × 7 transition matrices Ac . ˆ To bias the model to ﬁnd the simplest sources, we used Ac ≡ 0Hc ,Hc for all sources. In Fig2a and Fig 2b we see the original sources and the noisy observations respectively. In Fig2c we see the estimated sources from our method after convergence of the hyperparameter updates. Two of the 5 sources have been removed, and the remaining three are a reasonable estimation of the original sources. Another possible approach for introducing prior knowledge is to use a Maximum a Posteriori (MAP) 0 50 100 150 200 250 300 0 50 100 (a) 150 200 250 300 0 50 (b) 100 150 200 250 300 0 50 (c) 100 150 200 250 300 (d) Figure 2: (a) Original sources st . (b) Observations resulting from mixing the original sources, v v vt = W st + ηt , ηt ∼ N (0, I). (c) Recovered sources using the Bayesian LGSSM. (d) Sources found with MAP LGSSM. 0 1 2 (a) 3s 0 1 2 (b) 3s 0 1 2 (c) 3s 0 1 2 (d) 3s 0 1 2 3s (e) Figure 3: (a) Original raw EEG recordings from 4 channels. (b-e) 16 sources st estimated by the Bayesian LGSSM. procedure by adding a prior term to the original log-likelihood log p(v1:T |A, W, ΣH , ΣV , µ, Σ) + log p(A|α) + log p(W |β). However, it is not clear how to reliably ﬁnd the hyperparameters α and β in this case. One solution is to estimate them by optimizing the new objective function jointly with respect to the parameters and hyperparameters (this is the so-called joint map estimation – see for example [16]). A typical result of using this joint MAP approach on the artiﬁcial data is presented in Fig 2d. The joint MAP does not estimate the hyperparameters well, and the incorrect number of sources is identiﬁed. 4.2 Application to EEG Analysis In Fig 3a we plot three seconds of EEG data recorded from 4 channels (located in the right hemisphere) while a person is performing imagined movement of the right hand. As is typical in EEG, each channel shows drift terms below 1 Hz which correspond to artifacts of the instrumentation, together with the presence of 50 Hz mains contamination and masks the rhythmical activity related to the mental task, mainly centered at 10 and 20 Hz [17]. We would therefore like a method which enables us to extract components in these information-rich 10 and 20 Hz frequency bands. Standard ICA methods such as FastICA do not ﬁnd satisfactory sources based on raw ‘noisy’ data, and preprocessing with band-pass ﬁlters is usually required. Additionally, in EEG research, ﬂexibility in the number of recovered sources is important since there may be many independent oscillators of interest underlying the observations and we would like some way to automatically determine their effective number. To preferentially ﬁnd sources at particular frequencies, we speciﬁed a block ˆ diagonal matrix Ac for each source c, where each block is a 2 × 2 rotation matrix at the desired frequency. We deﬁned the following 16 groups of frequencies: [0.5], [0.5], [0.5], [0.5]; [10,11], [10,11], [10,11], [10,11]; [20,21], [20,21], [20,21], [20,21]; [50], [50], [50], [50]. The temporal evolution of the sources obtained after training the Bayesian LGSSM is given in Fig3(b,c,d,e) (grouped by frequency range). The Bayes LGSSM removed 4 unnecessary sources from the mixing matrix W , that is one [10,11] Hz and three [20,21] Hz sources. The ﬁrst 4 sources contain dominant low frequency drift, sources 5, 6 and 8 contain [10,11] Hz, while source 10 contains [20,21] Hz centered activity. Of the 4 sources initialized to 50 Hz, only 2 retained 50 Hz activity, while the Ac of the other two have changed to model other frequencies present in the EEG. This method demonstrates the usefulness and applicability of the VB method in a real-world situation. 5 Conclusion We considered the application of Variational Bayesian learning to Linear Gaussian State-Space Models. This is an important class of models with widespread application, and ﬁnding a simple way to implement this approximate Bayesian procedure is of considerable interest. The most demanding part of the procedure is inference of the hidden states of the model. Previously, this has been achieved using Belief Propagation, which differs from inference in the Kalman Filtering/Smoothing literature, for which highly efﬁcient and stabilized procedures exist. A central contribution of this paper is to show how inference can be written using the standard Kalman Filtering/Smoothing recursions by augmenting the original model. Additionally, a minor modiﬁcation to the standard Kalman Filtering routine may be applied for computational efﬁciency. We demonstrated the elegance and unity of our approach by showing how to easily apply a Variational Bayes analysis of temporal ICA. Speciﬁcally, our Bayes ICA approach successfully extracts independent processes underlying EEG signals, biased towards preferred frequency ranges. We hope that this simple and unifying interpretation of Variational Bayesian LGSSMs may therefore facilitate the further application to related models. A A.1 Parameter Updates for A and B Determining q(B|ΣV ) By examining F, the contribution of q(B|ΣV ) can be interpreted as the negative KL divergence between q(B|ΣV ) and a Gaussian. Hence, optimally, q(B|ΣV ) is a Gaussian. The covariance [ΣB ]ij,kl ≡ Bij − Bij Bkl − Bkl (averages wrt q(B|ΣV )) is given by: T hj hl t t −1 [ΣB ]ij,kl = [HB ]jl [ΣV ]ik , where [HB ]jl ≡ t=1 −1 The mean is given by B = NB HB , where [NB ]ij ≡ T t=1 hj t q(ht ) q(ht ) + βj δjl . i ˆ vt + βj Bij . Determining q(A|ΣH ) Optimally, q(A|ΣH ) is a Gaussian with covariance T −1 hj hl t t −1 [ΣA ]ij,kl = [HA ]jl [ΣH ]ik , where [HA ]jl ≡ t=1 −1 The mean is given by A = NA HA , where [NA ]ij ≡ B T t=2 q(ht ) hj hi t−1 t + αj δjl . q(ht−1:t ) ˆ + αj Aij . Covariance Updates By specifying a Wishart prior for the inverse of the covariances, conjugate update formulae are possible. In practice, it is more common to specify diagonal inverse covariances, for which the corresponding priors are simply Gamma distributions [7, 5]. For this simple diagonal case, the explicit updates are given below. Determining q(ΣV ) For the constraint Σ−1 = diag(ρ), where each diagonal element follows a Gamma prior V Ga(b1 , b2 ) [7], q(ρ) factorizes and the optimal updates are  q(ρi ) = Ga b1 + where GB ≡ −1 T NB HB NB .   T T 1 i , b2 +  (vt )2 − [GB ]ii + 2 2 t=1 j ˆ2 βj Bij  , Determining q(ΣH ) Analogously, for Σ−1 = diag(τ ) with prior Ga(a1 , a2 ) [5], the updates are H    T T −1 1 ˆij , a2 + (hi )2 − [GA ]ii + αj A2  , q(τi ) = Ga a1 + t 2 2 t=2 j −1 T where GA ≡ NA HA NA . Acknowledgments This work is supported by the European DIRAC Project FP6-0027787. This paper only reﬂects the authors’ views and funding agencies are not liable for any use that may be made of the information contained herein. References [1] Y. Bar-Shalom and X.-R. Li. Estimation and Tracking: Principles, Techniques and Software. Artech House, 1998. [2] M. S. Grewal and A. P. Andrews. Kalman Filtering: Theory and Practice Using MATLAB. John Wiley and Sons, Inc., 2001. [3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications. Springer, 2000. [4] M. J. Beal, F. Falciani, Z. Ghahramani, C. Rangel, and D. L. Wild. A Bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics, 21:349–356, 2005. [5] A. T. Cemgil and S. J. Godsill. Probabilistic phase vocoder and its application to interpolation of missing values in audio signals. In 13th European Signal Processing Conference, 2005. [6] H. Valpola and J. Karhunen. An unsupervised ensemble learning method for nonlinear dynamic statespace models. Neural Computation, 14:2647–2692, 2002. [7] M. J. Beal. Variational Algorithms for Approximate Bayesian Inference. Ph.D. thesis, Gatsby Computational Neuroscience Unit, University College London, 2003. [8] M. Davy and S. J. Godsill. Bayesian harmonic models for musical signal analysis (with discussion). In J.O. Bernardo, J.O. Berger, A.P Dawid, and A.F.M. Smith, editors, Bayesian Statistics VII. Oxford University Press, 2003. [9] D. J. C. MacKay. Ensemble learning and evidence maximisation. Unpublished manuscipt: www.variational-bayes.org, 1995. [10] M. Morf and T. Kailath. Square-root algorithms for least-squares estimation. IEEE Transactions on Automatic Control, 20:487–497, 1975. [11] P. Park and T. Kailath. New square-root smoothing algorithms. IEEE Transactions on Automatic Control, 41:727–732, 1996. [12] E. Niedermeyer and F. Lopes Da Silva. Electroencephalography: basic principles, clinical applications and related ﬁelds. Lippincott Williams and Wilkins, 1999. [13] S. Roweis and Z. Ghahramani. A unifying review of linear Gaussian models. Neural Computation, 11:305–345, 1999. [14] M. Verhaegen and P. Van Dooren. Numerical aspects of different Kalman ﬁlter implementations. IEEE Transactions of Automatic Control, 31:907–917, 1986. [15] S. Chiappa and D. Barber. Bayesian linear Gaussian state-space models for biosignal decomposition. Signal Processing Letters, 14, 2007. [16] S. S. Saquib, C. A. Bouman, and K. Sauer. ML parameter estimation for Markov random ﬁelds with applicationsto Bayesian tomography. IEEE Transactions on Image Processing, 7:1029–1044, 1998. [17] G. Pfurtscheller and F. H. Lopes da Silva. Event-related EEG/MEG synchronization and desynchronization: basic principles. Clinical Neurophysiology, pages 1842–1857, 1999.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Uniﬁed Inference for Variational Bayesian Linear Gaussian State-Space Models David Barber IDIAP Research Institute rue du Simplon 4, Martigny, Switzerland david. [sent-1, score-0.056]

2 ch Silvia Chiappa IDIAP Research Institute rue du Simplon 4, Martigny, Switzerland silvia. [sent-3, score-0.056]

3 ch Abstract Linear Gaussian State-Space Models are widely used and a Bayesian treatment of parameters is therefore of considerable interest. [sent-5, score-0.068]

4 The most challenging aspect of implementing the method is in performing inference on the hidden state sequence of the model. [sent-7, score-0.06]

5 We show how to convert the inference problem so that standard Kalman Filtering/Smoothing recursions from the literature may be applied. [sent-8, score-0.118]

6 Our framework both simpliﬁes and uniﬁes the inference problem, so that future applications may be more easily developed. [sent-10, score-0.06]

7 We demonstrate the elegance of the approach on Bayesian temporal ICA, with an application to ﬁnding independent dynamical processes underlying noisy EEG signals. [sent-11, score-0.086]

8 The observation vt has dimension V and the hidden state ht has dimension H. [sent-14, score-0.808]

9 Because of the widespread use of these models, a Bayesian treatment of parameters is of considerable interest [4, 5, 6, 7, 8]. [sent-16, score-0.095]

10 The most challenging part of implementing the VB method is performing inference over h1:T , and previous authors have developed their own specialized routines, based on Belief Propagation, since standard LGSSM inference routines appear, at ﬁrst sight, not to be applicable. [sent-18, score-0.175]

11 A key contribution of this paper is to show how the Variational Bayesian treatment of the LGSSM can be implemented using standard LGSSM inference routines. [sent-24, score-0.103]

12 Based on the insight we provide, any standard inference method may be applied, including those speciﬁcally addressed to improve numerical stability [2, 10, 11]. [sent-25, score-0.06]

13 In this article, we decided to describe the predictor-corrector and Rauch-Tung-Striebel recursions [2], and also suggest a small modiﬁcation that reduces computational cost. [sent-26, score-0.058]

14 The Bayesian LGSSM is particularly of interest when strong prior constraints are needed to ﬁnd adequate solutions. [sent-27, score-0.024]

15 One such case is in EEG signal analysis, whereby we wish to extract sources that evolve independently through time. [sent-28, score-0.161]

16 Since EEG is particularly noisy [12], a prior that encourages sources to have preferential dynamics is advantageous. [sent-29, score-0.185]

17 2 Bayesian Linear Gaussian State-Space Models In the Bayesian treatment of the LGSSM, instead of considering the model parameters Θ as ﬁxed, ˆ ˆ we deﬁne a prior distribution p(Θ|Θ), where Θ is a set of hyperparameters. [sent-31, score-0.067]

18 (1) Θ In a full Bayesian treatment we would deﬁne additional prior distributions over the hyperparameters ˆ Θ. [sent-33, score-0.112]

19 The −1 −1 4 conjugate priors for general inverse covariances ΣH and ΣV are Wishart distributions [7] . [sent-36, score-0.082]

20 In the simpler case assumed here of diagonal covariances these become Gamma distributions [5, 7]. [sent-37, score-0.077]

21 The updates for the parameters q(Θ) are straightforward and are given in Appendices A and B. [sent-45, score-0.045]

22 Once converged, the hyperparameters are ˆ updated by maximizing F with respect to Θ, which lead to simple update formulae [7]. [sent-46, score-0.074]

23 3 More general Gaussian priors may be more suitable depending on the application. [sent-48, score-0.033]

24 For expositional simplicity, we do not put priors on µ and Σ. [sent-49, score-0.033]

25 5 For simplicity, we keep the parameters of the Gamma priors ﬁxed. [sent-50, score-0.033]

26 (2) In addition, optimally, q(A|ΣH ) and q(B|ΣV ) are Gaussians (see Appendix A), so we can easily carry out the averages in Eq. [sent-54, score-0.032]

27 The further averages over q(ΣH ) and q(ΣV ) are also easy due to conjugacy. [sent-56, score-0.032]

28 Whilst this deﬁnes the distribution q(h1:T ), quantities such as q(ht ), required for example for the parameter updates (see the Appendices), need to be inferred from this distribution. [sent-57, score-0.045]

29 Clearly, in the non-Bayesian case, the averages over the parameters are not present, and the above simply represents the posterior distribution of an LGSSM whose visible variables have been clamped into their evidential states. [sent-58, score-0.183]

30 In that case, inference can be performed using any standard LGSSM routine. [sent-59, score-0.06]

31 (2) directly as the posterior distribution q (h1:T |˜1:T ) of an LGSSM , for some suitable parameter settings. [sent-61, score-0.024]

32 The mean terms simply represent a clamped LGSSM with averaged parameters. [sent-63, score-0.092]

33 (2) cannot be written as a clamped LGSSM with averaged parameters. [sent-65, score-0.092]

34 In order to deal with these extra terms, our idea is to treat the ﬂuctuations as arising from an augmented visible variable, for which Eq. [sent-66, score-0.079]

35 (2) as an LGSSM q (h1:T |˜1:T ), we may augment vt and B as8 : ˜ v vt = vert(vt , 0H , 0H ), ˜ ˜ B = vert( B , UA , UB ), T where UA is the Cholesky decomposition of SA , so that UA UA = SA . [sent-69, score-0.573]

36 Similarly, UB is the Cholesky decomposition of SB . [sent-70, score-0.031]

37 The equivalent LGSSM q (h1:T |˜1:T ) is then completed by specifying9 ˜ v ˜ A≡ A , ˜ ΣH ≡ Σ−1 H −1 , ˜ ΣV ≡ diag( Σ−1 V −1 , IH , IH ), µ ≡ µ, ˜ ˜ Σ ≡ Σ. [sent-71, score-0.029]

38 The validity of this parameter assignment can be checked by showing that, up to negligible constants, the exponent of this augmented LGSSM has the same form as Eq. [sent-72, score-0.044]

39 Now that this has been written as an LGSSM q (h1:T |˜1:T ), standard inference routines in the literature may be applied to ˜ v compute q(ht |v1:T ) = q (ht |˜1:T ) [1, 2, 11]11 . [sent-74, score-0.115]

40 9 ˜ ˜ ˜ Strictly, we need a time-dependent emission Bt = B, for t = 1, . [sent-83, score-0.038]

41 11 Note that, since the augmented LGSSM q (h1:T |˜1:T ) is designed to match the fully clamped distribution ˜ v q(h1:T |v1:T ), the ﬁltered posterior q (ht |˜1:t ) does not correspond to q(ht |v1:t ). [sent-90, score-0.16]

42 ˜ v 8 Algorithm 1 LGSSM: Forward and backward recursive updates. [sent-91, score-0.038]

43 The smoothed posterior p(ht |v1:T ) ˆ is returned in the mean hT and covariance PtT . [sent-92, score-0.024]

44 These are given in Algorithm 1, where q (ht |˜1:T ) is computed by calling the FORWARD and BACKWARD procedures. [sent-94, score-0.027]

45 Either we may call procedure FORWARD in ˜ ˜ ˜ ˜ ˜ ˜ Algorithm 1 with parameters A, B, ΣH , ΣV , µ, Σ and the augmented visible variables vt in which ˜ we use steps 1a, 2a, 5a and 6a. [sent-96, score-0.35]

46 Otherwise, in order to reduce the computational cost, we may call procedure FORWARD with the −1 ˜ ˜ parameters A, B , ΣH , Σ−1 , µ, Σ and the original visible variable vt in which we use steps ˜ ˜ V T 1b (where UAB UAB ≡ SA +SB ), 2b, 5b and 6b. [sent-98, score-0.306]

47 Computing q(ht |v1:T ) = q (ht |˜1:T ) is then completed by calling the common BACKWARD pass. [sent-100, score-0.056]

48 In some parameter regimes, or in very long time-series, numerical stability may be a serious concern, for which several stabilized algorithms have been developed over the years, for example the square-root forms [2, 10, 11]. [sent-102, score-0.037]

49 1 Relation to Previous Approaches An alternative approach to the one above, and taken in [5, 7], is to write the posterior as T log q(h1:T ) = φt (ht−1 , ht ) + const. [sent-105, score-0.561]

50 t=2 for suitably deﬁned quadratic forms φt (ht−1 , ht ). [sent-106, score-0.537]

51 Here the potentials φt (ht−1 , ht ) encode the averaging over the parameters A, B, ΣH , ΣV . [sent-107, score-0.537]

52 pairwise Markov chain, for which the Belief Propagation recursions may be applied. [sent-109, score-0.058]

53 The approach in [5] is based on a Kullback-Leibler minimization of the posterior with a chain structure, which is algorithmically equivalent to Belief Propagation. [sent-110, score-0.024]

54 The sources si are modeled as independent in the following sense: p(si , sj ) = p(si )p(sj ), 1:T 1:T 1:T 1:T for i = j, i, j = 1, . [sent-114, score-0.161]

55 Independence implies block diagonal transition and state noise matrices A, ΣH and Σ, where each block c has dimension Hc . [sent-118, score-0.09]

56 A one dimensional source sc for each independent dynamical subsystem is then t formed from sc = 1T hc , where 1c is a unit vector and hc is the state of t c t t dynamical system c. [sent-119, score-0.254]

57 Combining the sources, we can write st = P ht , where P = diag(1T , . [sent-120, score-0.579]

58 The resulting 1 C t t emission matrix is constrained to be of the form B = W P , where W is the V × C mixing matrix. [sent-127, score-0.067]

59 This means that the observations v are formed from linearly mixing the sources, vt = W st + ηt . [sent-128, score-0.365]

60 In this case, we do not deﬁne a prior for the parameters ΣH and ΣV which are instead considered as hyperparameters. [sent-132, score-0.024]

61 Inference on q(h1:T ) A small modiﬁcation of the mean + ﬂuctuation decomposition for B occurs, namely: (vt − Bht )T Σ−1 (vt − Bht ) V q(W ) = (vt − B ht )T Σ−1 (vt − B ht ) + hT P T SW P ht , t V −1 where B ≡ W P and SW = V HW . [sent-135, score-1.642]

62 To represent the above as a LGSSM, we augment vt and B as vt = vert(vt , 0H , 0C ), ˜ ˜ B = vert( B , UA , UW P ), where UW is the Cholesky decomposition of SW . [sent-138, score-0.573]

63 The equivalent LGSSM is then completed by ˜ ˜ ˜ ˜ specifying A ≡ A , ΣH ≡ ΣH , ΣV ≡ diag(ΣV , IH , IC ), µ ≡ µ, Σ ≡ Σ, and inference for ˜ q(h1:T ) performed using Algorithm 1. [sent-139, score-0.089]

64 This demonstrates the elegance and unity of the approach in Section 3, since no new algorithm needs to be developed to perform inference, even in this special constrained parameter case. [sent-140, score-0.048]

65 1 Demonstration As a simple demonstration, we used an LGSSM to generate 3 sources sc with random 5×5 transition t matrices Ac , µ = 0H and Σ ≡ ΣH ≡ IH . [sent-142, score-0.199]

66 The sources were mixed into three observations v vt = W st + ηt , for W chosen with elements from a zero mean unit variance Gaussian distribution, and ΣV = IV . [sent-143, score-0.497]

67 We then trained a Bayesian LGSSM with 5 sources and 7 × 7 transition matrices Ac . [sent-144, score-0.161]

68 In Fig2a and Fig 2b we see the original sources and the noisy observations respectively. [sent-146, score-0.184]

69 In Fig2c we see the estimated sources from our method after convergence of the hyperparameter updates. [sent-147, score-0.161]

70 Two of the 5 sources have been removed, and the remaining three are a reasonable estimation of the original sources. [sent-148, score-0.161]

71 Another possible approach for introducing prior knowledge is to use a Maximum a Posteriori (MAP) 0 50 100 150 200 250 300 0 50 100 (a) 150 200 250 300 0 50 (b) 100 150 200 250 300 0 50 (c) 100 150 200 250 300 (d) Figure 2: (a) Original sources st . [sent-149, score-0.227]

72 (b) Observations resulting from mixing the original sources, v v vt = W st + ηt , ηt ∼ N (0, I). [sent-150, score-0.342]

73 (b-e) 16 sources st estimated by the Bayesian LGSSM. [sent-154, score-0.203]

74 procedure by adding a prior term to the original log-likelihood log p(v1:T |A, W, ΣH , ΣV , µ, Σ) + log p(A|α) + log p(W |β). [sent-155, score-0.024]

75 However, it is not clear how to reliably ﬁnd the hyperparameters α and β in this case. [sent-156, score-0.045]

76 One solution is to estimate them by optimizing the new objective function jointly with respect to the parameters and hyperparameters (this is the so-called joint map estimation – see for example [16]). [sent-157, score-0.045]

77 The joint MAP does not estimate the hyperparameters well, and the incorrect number of sources is identiﬁed. [sent-159, score-0.206]

78 As is typical in EEG, each channel shows drift terms below 1 Hz which correspond to artifacts of the instrumentation, together with the presence of 50 Hz mains contamination and masks the rhythmical activity related to the mental task, mainly centered at 10 and 20 Hz [17]. [sent-162, score-0.024]

79 Standard ICA methods such as FastICA do not ﬁnd satisfactory sources based on raw ‘noisy’ data, and preprocessing with band-pass ﬁlters is usually required. [sent-164, score-0.161]

80 Additionally, in EEG research, ﬂexibility in the number of recovered sources is important since there may be many independent oscillators of interest underlying the observations and we would like some way to automatically determine their effective number. [sent-165, score-0.184]

81 To preferentially ﬁnd sources at particular frequencies, we speciﬁed a block ˆ diagonal matrix Ac for each source c, where each block is a 2 × 2 rotation matrix at the desired frequency. [sent-166, score-0.251]

82 The temporal evolution of the sources obtained after training the Bayesian LGSSM is given in Fig3(b,c,d,e) (grouped by frequency range). [sent-172, score-0.161]

83 The Bayes LGSSM removed 4 unnecessary sources from the mixing matrix W , that is one [10,11] Hz and three [20,21] Hz sources. [sent-173, score-0.19]

84 The ﬁrst 4 sources contain dominant low frequency drift, sources 5, 6 and 8 contain [10,11] Hz, while source 10 contains [20,21] Hz centered activity. [sent-174, score-0.322]

85 Of the 4 sources initialized to 50 Hz, only 2 retained 50 Hz activity, while the Ac of the other two have changed to model other frequencies present in the EEG. [sent-175, score-0.185]

86 This is an important class of models with widespread application, and ﬁnding a simple way to implement this approximate Bayesian procedure is of considerable interest. [sent-178, score-0.052]

87 The most demanding part of the procedure is inference of the hidden states of the model. [sent-179, score-0.06]

88 Previously, this has been achieved using Belief Propagation, which differs from inference in the Kalman Filtering/Smoothing literature, for which highly efﬁcient and stabilized procedures exist. [sent-180, score-0.097]

89 A central contribution of this paper is to show how inference can be written using the standard Kalman Filtering/Smoothing recursions by augmenting the original model. [sent-181, score-0.118]

90 Additionally, a minor modiﬁcation to the standard Kalman Filtering routine may be applied for computational efﬁciency. [sent-182, score-0.024]

91 We demonstrated the elegance and unity of our approach by showing how to easily apply a Variational Bayes analysis of temporal ICA. [sent-183, score-0.048]

92 The covariance [ΣB ]ij,kl ≡ Bij − Bij Bkl − Bkl (averages wrt q(B|ΣV )) is given by: T hj hl t t −1 [ΣB ]ij,kl = [HB ]jl [ΣV ]ik , where [HB ]jl ≡ t=1 −1 The mean is given by B = NB HB , where [NB ]ij ≡ T t=1 hj t q(ht ) q(ht ) + βj δjl . [sent-189, score-0.126]

93 Determining q(A|ΣH ) Optimally, q(A|ΣH ) is a Gaussian with covariance T −1 hj hl t t −1 [ΣA ]ij,kl = [HA ]jl [ΣH ]ik , where [HA ]jl ≡ t=1 −1 The mean is given by A = NA HA , where [NA ]ij ≡ B T t=2 q(ht ) hj hi t−1 t + αj δjl . [sent-191, score-0.126]

94 Covariance Updates By specifying a Wishart prior for the inverse of the covariances, conjugate update formulae are possible. [sent-193, score-0.053]

95 In practice, it is more common to specify diagonal inverse covariances, for which the corresponding priors are simply Gamma distributions [7, 5]. [sent-194, score-0.061]

96 For this simple diagonal case, the explicit updates are given below. [sent-195, score-0.073]

97 Determining q(ΣV ) For the constraint Σ−1 = diag(ρ), where each diagonal element follows a Gamma prior V Ga(b1 , b2 ) [7], q(ρ) factorizes and the optimal updates are  q(ρi ) = Ga b1 + where GB ≡ −1 T NB HB NB . [sent-196, score-0.097]

98   T T 1 i , b2 +  (vt )2 − [GB ]ii + 2 2 t=1 j ˆ2 βj Bij  , Determining q(ΣH ) Analogously, for Σ−1 = diag(τ ) with prior Ga(a1 , a2 ) [5], the updates are H    T T −1 1 ˆij , a2 + (hi )2 − [GA ]ii + αj A2  , q(τi ) = Ga a1 + t 2 2 t=2 j −1 T where GA ≡ NA HA NA . [sent-197, score-0.069]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ht', 0.537), ('lgssm', 0.497), ('vt', 0.271), ('uab', 0.184), ('aht', 0.166), ('sources', 0.161), ('bht', 0.147), ('ptt', 0.112), ('kalman', 0.111), ('ua', 0.102), ('hz', 0.101), ('eeg', 0.101), ('hb', 0.096), ('vb', 0.095), ('clamped', 0.092), ('bayesian', 0.088), ('jl', 0.082), ('ac', 0.073), ('ga', 0.07), ('vert', 0.067), ('bij', 0.064), ('variational', 0.061), ('inference', 0.06), ('ih', 0.058), ('recursions', 0.058), ('sa', 0.056), ('routines', 0.055), ('sb', 0.054), ('pt', 0.054), ('ha', 0.052), ('cholesky', 0.051), ('hc', 0.051), ('nb', 0.051), ('covariances', 0.049), ('elegance', 0.048), ('hj', 0.047), ('hyperparameters', 0.045), ('updates', 0.045), ('augmented', 0.044), ('ub', 0.044), ('treatment', 0.043), ('st', 0.042), ('fig', 0.04), ('emission', 0.038), ('sc', 0.038), ('dynamical', 0.038), ('backward', 0.038), ('forward', 0.037), ('appendices', 0.037), ('chiappa', 0.037), ('lgssms', 0.037), ('luctuation', 0.037), ('simplon', 0.037), ('stabilized', 0.037), ('uw', 0.037), ('bj', 0.035), ('visible', 0.035), ('ica', 0.034), ('diag', 0.034), ('sw', 0.034), ('optimally', 0.034), ('na', 0.034), ('gamma', 0.034), ('priors', 0.033), ('belief', 0.033), ('averages', 0.032), ('idiap', 0.032), ('martigny', 0.032), ('lopes', 0.032), ('hl', 0.032), ('bkl', 0.032), ('rue', 0.032), ('block', 0.031), ('decomposition', 0.031), ('wishart', 0.029), ('formulae', 0.029), ('mixing', 0.029), ('uni', 0.029), ('completed', 0.029), ('diagonal', 0.028), ('hw', 0.027), ('calling', 0.027), ('gb', 0.027), ('widespread', 0.027), ('appendix', 0.026), ('bp', 0.026), ('determining', 0.026), ('dt', 0.025), ('considerable', 0.025), ('prior', 0.024), ('drift', 0.024), ('du', 0.024), ('routine', 0.024), ('posterior', 0.024), ('frequencies', 0.024), ('bayes', 0.024), ('observations', 0.023), ('kb', 0.023), ('filtering', 0.023), ('switzerland', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 198 nips-2006-Unified Inference for Variational Bayesian Linear Gaussian State-Space Models

Author: David Barber, Silvia Chiappa

2 0.33217677 10 nips-2006-A Novel Gaussian Sum Smoother for Approximate Inference in Switching Linear Dynamical Systems

Author: David Barber, Bertrand Mesot

Abstract: We introduce a method for approximate smoothed inference in a class of switching linear dynamical systems, based on a novel form of Gaussian Sum smoother. This class includes the switching Kalman Filter and the more general case of switch transitions dependent on the continuous latent state. The method improves on the standard Kim smoothing approach by dispensing with one of the key approximations, thus making fuller use of the available future information. Whilst the only central assumption required is projection to a mixture of Gaussians, we show that an additional conditional independence assumption results in a simpler but stable and accurate alternative. Unlike the alternative unstable Expectation Propagation procedure, our method consists only of a single forward and backward pass and is reminiscent of the standard smoothing ‘correction’ recursions in the simpler linear dynamical system. The algorithm performs well on both toy experiments and in a large scale application to noise robust speech recognition. 1 Switching Linear Dynamical System The Linear Dynamical System (LDS) [1] is a key temporal model in which a latent linear process generates the observed series. For complex time-series which are not well described globally by a single LDS, we may break the time-series into segments, each modeled by a potentially different LDS. This is the basis for the Switching LDS (SLDS) [2, 3, 4, 5] where, for each time t, a switch variable st ∈ 1, . . . , S describes which of the LDSs is to be used. The observation (or ‘visible’) vt ∈ RV is linearly related to the hidden state ht ∈ RH with additive noise η by vt = B(st )ht + η v (st ) p(vt |ht , st ) = N (B(st )ht , Σv (st )) ≡ (1) where N (µ, Σ) denotes a Gaussian distribution with mean µ and covariance Σ. The transition dynamics of the continuous hidden state ht is linear, ht = A(st )ht−1 + η h (st ), ≡ p(ht |ht−1 , st ) = N A(st )ht−1 , Σh (st ) (2) The switch st may depend on both the previous st−1 and ht−1 . This is an augmented SLDS (aSLDS), and deﬁnes the model T p(vt |ht , st )p(ht |ht−1 , st )p(st |ht−1 , st−1 ) p(v1:T , h1:T , s1:T ) = t=1 The standard SLDS[4] considers only switch transitions p(st |st−1 ). At time t = 1, p(s1 |h0 , s0 ) simply denotes the prior p(s1 ), and p(h1 |h0 , s1 ) denotes p(h1 |s1 ). The aim of this article is to address how to perform inference in the aSLDS. In particular we desire the ﬁltered estimate p(ht , st |v1:t ) and the smoothed estimate p(ht , st |v1:T ), for any 1 ≤ t ≤ T . Both ﬁltered and smoothed inference in the SLDS is intractable, scaling exponentially with time [4]. s1 s2 s3 s4 h1 h2 h3 h4 v1 v2 v3 v4 Figure 1: The independence structure of the aSLDS. Square nodes denote discrete variables, round nodes continuous variables. In the SLDS links from h to s are not normally considered. 2 Expectation Correction Our approach to approximate p(ht , st |v1:T ) mirrors the Rauch-Tung-Striebel ‘correction’ smoother for the simpler LDS [1].The method consists of a single forward pass to recursively ﬁnd the ﬁltered posterior p(ht , st |v1:t ), followed by a single backward pass to correct this into a smoothed posterior p(ht , st |v1:T ). The forward pass we use is equivalent to standard Assumed Density Filtering (ADF) [6]. The main contribution of this paper is a novel form of backward pass, based only on collapsing the smoothed posterior to a mixture of Gaussians. Together with the ADF forward pass, we call the method Expectation Correction, since it corrects the moments found from the forward pass. A more detailed description of the method, including pseudocode, is given in [7]. 2.1 Forward Pass (Filtering) Readers familiar with ADF may wish to continue directly to Section (2.2). Our aim is to form a recursion for p(st , ht |v1:t ), based on a Gaussian mixture approximation of p(ht |st , v1:t ). Without loss of generality, we may decompose the ﬁltered posterior as p(ht , st |v1:t ) = p(ht |st , v1:t )p(st |v1:t ) (3) The exact representation of p(ht |st , v1:t ) is a mixture with O(S t ) components. We therefore approximate this with a smaller I-component mixture I p(ht |st , v1:t ) ≈ p(ht |it , st , v1:t )p(it |st , v1:t ) it =1 where p(ht |it , st , v1:t ) is a Gaussian parameterized with mean f (it , st ) and covariance F (it , st ). To ﬁnd a recursion for these parameters, consider p(ht+1 |st+1 , v1:t+1 ) = p(ht+1 |st , it , st+1 , v1:t+1 )p(st , it |st+1 , v1:t+1 ) (4) st ,it Evaluating p(ht+1 |st , it , st+1 , v1:t+1 ) We ﬁnd p(ht+1 |st , it , st+1 , v1:t+1 ) by ﬁrst computing the joint distribution p(ht+1 , vt+1 |st , it , st+1 , v1:t ), which is a Gaussian with covariance and mean elements, Σhh = A(st+1 )F (it , st )AT (st+1 ) + Σh (st+1 ), Σvv = B(st+1 )Σhh B T (st+1 ) + Σv (st+1 ) Σvh = B(st+1 )F (it , st ), µv = B(st+1 )A(st+1 )f (it , st ), µh = A(st+1 )f (it , st ) (5) and then conditioning on vt+1 1 . For the case S = 1, this forms the usual Kalman Filter recursions[1]. Evaluating p(st , it |st+1 , v1:t+1 ) The mixture weight in (4) can be found from the decomposition p(st , it |st+1 , v1:t+1 ) ∝ p(vt+1 |it , st , st+1 , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ) (6) 1 p(x|y) is a Gaussian with mean µx + Σxy Σ−1 (y − µy ) and covariance Σxx − Σxy Σ−1 Σyx . yy yy The ﬁrst factor in (6), p(vt+1 |it , st , st+1 , v1:t ) is a Gaussian with mean µv and covariance Σvv , as given in (5). The last two factors p(it |st , v1:t ) and p(st |v1:t ) are given from the previous iteration. Finally, p(st+1 |it , st , v1:t ) is found from p(st+1 |it , st , v1:t ) = p(st+1 |ht , st ) p(ht |it ,st ,v1:t ) (7) where · p denotes expectation with respect to p. In the SLDS, (7) is replaced by the Markov transition p(st+1 |st ). In the aSLDS, however, (7) will generally need to be computed numerically. Closing the recursion We are now in a position to calculate (4). For each setting of the variable st+1 , we have a mixture of I × S Gaussians which we numerically collapse back to I Gaussians to form I p(ht+1 |st+1 , v1:t+1 ) ≈ p(ht+1 |it+1 , st+1 , v1:t+1 )p(it+1 |st+1 , v1:t+1 ) it+1 =1 Any method of choice may be supplied to collapse a mixture to a smaller mixture; our code simply repeatedly merges low-weight components. In this way the new mixture coefﬁcients p(it+1 |st+1 , v1:t+1 ), it+1 ∈ 1, . . . , I are deﬁned, completing the description of how to form a recursion for p(ht+1 |st+1 , v1:t+1 ) in (3). A recursion for the switch variable is given by p(st+1 |v1:t+1 ) ∝ p(vt+1 |st+1 , it , st , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ) st ,it where all terms have been computed during the recursion for p(ht+1 |st+1 , v1:t+1 ). The likelihood p(v1:T ) may be found by recursing p(v1:t+1 ) = p(vt+1 |v1:t )p(v1:t ), where p(vt+1 |vt ) = p(vt+1 |it , st , st+1 , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ) it ,st ,st+1 2.2 Backward Pass (Smoothing) The main contribution of this paper is to ﬁnd a suitable way to ‘correct’ the ﬁltered posterior p(st , ht |v1:t ) obtained from the forward pass into a smoothed posterior p(st , ht |v1:T ). We derive this for the case of a single Gaussian representation. The extension to the mixture case is straightforward and presented in [7]. We approximate the smoothed posterior p(ht |st , v1:T ) by a Gaussian with mean g(st ) and covariance G(st ) and our aim is to ﬁnd a recursion for these parameters. A useful starting point for a recursion is: p(st+1 |v1:T )p(ht |st , st+1 , v1:T )p(st |st+1 , v1:T ) p(ht , st |v1:T ) = st+1 The term p(ht |st , st+1 , v1:T ) may be computed as p(ht |st , st+1 , v1:T ) = p(ht |ht+1 , st , st+1 , v1:t )p(ht+1 |st , st+1 , v1:T ) (8) ht+1 The recursion therefore requires p(ht+1 |st , st+1 , v1:T ), which we can write as p(ht+1 |st , st+1 , v1:T ) ∝ p(ht+1 |st+1 , v1:T )p(st |st+1 , ht+1 , v1:t ) (9) The difﬁculty here is that the functional form of p(st |st+1 , ht+1 , v1:t ) is not squared exponential in ht+1 , so that p(ht+1 |st , st+1 , v1:T ) will not be Gaussian2 . One possibility would be to approximate the non-Gaussian p(ht+1 |st , st+1 , v1:T ) by a Gaussian (or mixture thereof) by minimizing the Kullback-Leilbler divergence between the two, or performing moment matching in the case of a single Gaussian. A simpler alternative (which forms ‘standard’ EC) is to make the assumption p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ), where p(ht+1 |st+1 , v1:T ) is already known from the previous backward recursion. Under this assumption, the recursion becomes p(ht , st |v1:T ) ≈ p(st+1 |v1:T )p(st |st+1 , v1:T ) p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) (10) st+1 2 In the exact calculation, p(ht+1 |st , st+1 , v1:T ) is a mixture of Gaussians, see [7]. However, since in (9) the two terms p(ht+1 |st+1 , v1:T ) will only be approximately computed during the recursion, our approximation to p(ht+1 |st , st+1 , v1:T ) will not be a mixture of Gaussians. Evaluating p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) is a Gaussian in ht , whose statistics we will now compute. First we ﬁnd p(ht |ht+1 , st , st+1 , v1:t ) which may be obtained from the joint distribution p(ht , ht+1 |st , st+1 , v1:t ) = p(ht+1 |ht , st+1 )p(ht |st , v1:t ) (11) which itself can be found from a forward dynamics from the ﬁltered estimate p(ht |st , v1:t ). The statistics for the marginal p(ht |st , st+1 , v1:t ) are simply those of p(ht |st , v1:t ), since st+1 carries no extra information about ht . The remaining statistics are the mean of ht+1 , the covariance of ht+1 and cross-variance between ht and ht+1 , which are given by ht+1 = A(st+1 )ft (st ), Σt+1,t+1 = A(st+1 )Ft (st )AT (st+1 )+Σh (st+1 ), Σt+1,t = A(st+1 )Ft (st ) Given the statistics of (11), we may now condition on ht+1 to ﬁnd p(ht |ht+1 , st , st+1 , v1:t ). Doing so effectively constitutes a reversal of the dynamics, ← − − ht = A (st , st+1 )ht+1 + ←(st , st+1 ) η ← − ← − − − where A (st , st+1 ) and ←(st , st+1 ) ∼ N (← t , st+1 ), Σ (st , st+1 )) are easily found using η m(s conditioning. Averaging the above reversed dynamics over p(ht+1 |st+1 , v1:T ), we ﬁnd that p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) is a Gaussian with statistics ← − ← − ← − ← − − µt = A (st , st+1 )g(st+1 )+← t , st+1 ), Σt,t = A (st , st+1 )G(st+1 ) A T (st , st+1 )+ Σ (st , st+1 ) m(s These equations directly mirror the standard RTS backward pass[1]. Evaluating p(st |st+1 , v1:T ) The main departure of EC from previous methods is in treating the term p(st |st+1 , v1:T ) = p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) (12) The term p(st |ht+1 , st+1 , v1:t ) is given by p(st |ht+1 , st+1 , v1:t ) = p(ht+1 |st+1 , st , v1:t )p(st , st+1 |v1:t ) ′ ′ s′ p(ht+1 |st+1 , st , v1:t )p(st , st+1 |v1:t ) (13) t Here p(st , st+1 |v1:t ) = p(st+1 |st , v1:t )p(st |v1:t ), where p(st+1 |st , v1:t ) occurs in the forward pass, (7). In (13), p(ht+1 |st+1 , st , v1:t ) is found by marginalizing (11). Computing the average of (13) with respect to p(ht+1 |st+1 , v1:T ) may be achieved by any numerical integration method desired. A simple approximation is to evaluate the integrand at the mean value of the averaging distribution p(ht+1 |st+1 , v1:T ). More sophisticated methods (see [7]) such as sampling from the Gaussian p(ht+1 |st+1 , v1:T ) have the advantage that covariance information is used3 . Closing the Recursion We have now computed both the continuous and discrete factors in (8), which we wish to use to write the smoothed estimate in the form p(ht , st |v1:T ) = p(st |v1:T )p(ht |st , v1:T ). The distribution p(ht |st , v1:T ) is readily obtained from the joint (8) by conditioning on st to form the mixture p(ht |st , v1:T ) = p(st+1 |st , v1:T )p(ht |st , st+1 , v1:T ) st+1 which may then be collapsed to a single Gaussian (the mixture case is discussed in [7]). The smoothed posterior p(st |v1:T ) is given by p(st |v1:T ) = p(st+1 |v1:T ) p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) . (14) st+1 3 This is a form of exact sampling since drawing samples from a Gaussian is easy. This should not be confused with meaning that this use of sampling renders EC a sequential Monte-Carlo scheme. 2.3 Relation to other methods The EC Backward pass is closely related to Kim’s method [8]. In both EC and Kim’s method, the approximation p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ), is used to form a numerically simple backward pass. The other ‘approximation’ in EC is to numerically compute the average in (14). In Kim’s method, however, an update for the discrete variables is formed by replacing the required term in (14) by p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) ≈ p(st |st+1 , v1:t ) (15) Since p(st |st+1 , v1:t ) ∝ p(st+1 |st )p(st |v1:t )/p(st+1 |v1:t ), this can be computed simply from the ﬁltered results alone. The fundamental difference therefore between EC and Kim’s method is that the approximation, (15), is not required by EC. The EC backward pass therefore makes fuller use of the future information, resulting in a recursion which intimately couples the continuous and discrete variables. The resulting effect on the quality of the approximation can be profound, as we will see in the experiments. The Expectation Propagation (EP) algorithm makes the central assumption of collapsing the posteriors to a Gaussian family [5]; the collapse is deﬁned by a consistency criterion on overlapping marginals. In our experiments, we take the approach in [9] of collapsing to a single Gaussian. Ensuring consistency requires frequent translations between moment and canonical parameterizations, which is the origin of potentially severe numerical instability [10]. In contrast, EC works largely with moment parameterizations of Gaussians, for which relatively few numerical difﬁculties arise. Unlike EP, EC is not based on a consistency criterion and a subtle issue arises about possible inconsistencies in the Forward and Backward approximations for EC. For example, under the conditional independence assumption in the Backward Pass, p(hT |sT −1 , sT , v1:T ) ≈ p(hT |sT , v1:T ), which is in contradiction to (5) which states that the approximation to p(hT |sT −1 , sT , v1:T ) will depend on sT −1 . Such potential inconsistencies arise because of the approximations made, and should not be considered as separate approximations in themselves. Rather than using a global (consistency) objective, EC attempts to faithfully approximate the exact Forward and Backward propagation routines. For this reason, as in the exact computation, only a single Forward and Backward pass are required in EC. In [11] a related dynamics reversed is proposed. However, the singularities resulting from incorrectly treating p(vt+1:T |ht , st ) as a density are heuristically ﬁnessed. In [12] a variational method approximates the joint distribution p(h1:T , s1:T |v1:T ) rather than the marginal inference p(ht , st |v1:T ). This is a disadvantage when compared to other methods that directly approximate the marginal. Sequential Monte Carlo methods (Particle Filters)[13], are essentially mixture of delta-function approximations. Whilst potentially powerful, these typically suffer in high-dimensional hidden spaces, unless techniques such as Rao-Blackwellization are performed. ADF is generally preferential to Particle Filtering since in ADF the approximation is a mixture of non-trivial distributions, and is therefore more able to represent the posterior. 3 Demonstration Testing EC in a problem with a reasonably long temporal sequence, T , is important since numerical instabilities may not be apparent in timeseries of just a few points. To do this, we sequentially generate hidden and visible states from a given model, here with H = 3, S = 2, V = 1 – see Figure(2) for full details of the experimental setup. Then, given only the parameters of the model and the visible observations (but not any of the hidden states h1:T , s1:T ), the task is to infer p(ht |st , v1:T ) and p(st |v1:T ). Since the exact computation is exponential in T , a simple alternative is to assume that the original sample states s1:T are the ‘correct’ inferences, and compare how our most probable posterior smoothed estimates arg maxst p(st |v1:T ) compare with the assumed correct sample st . We chose conditions that, from the viewpoint of classical signal processing, are difﬁcult, with changes in the switches occurring at a much higher rate than the typical frequencies in the signal vt . For EC we use the mean approximation for the numerical integration of (12). We included the Particle Filter merely for a point of comparison with ADF, since they are not designed to approximate PF RBPF EP ADFS KimS ECS ADFM KimM ECM 1000 800 600 400 200 0 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 Figure 2: The number of errors in estimating p(st |v1:T ) for a binary switch (S = 2) over a time series of length T = 100. Hence 50 errors corresponds to random guessing. Plotted are histograms of the errors are over 1000 experiments. The x-axes are cut off at 20 errors to improve visualization of the results. (PF) Particle Filter. (RBPF) Rao-Blackwellized PF. (EP) Expectation Propagation. (ADFS) Assumed Density Filtering using a Single Gaussian. (KimS) Kim’s smoother using the results from ADFS. (ECS) Expectation Correction using a Single Gaussian (I = J = 1). (ADFM) ADF using a multiple of I = 4 Gaussians. (KimM) Kim’s smoother using the results from ADFM. (ECM) Expectation Correction using a mixture with I = J = 4 components. S = 2, V = 1 (scalar observations), T = 100, with zero output bias. A(s) = 0.9999 ∗ orth(randn(H, H)), B(s) = randn(V, H). H = 3, Σh (s) = IH , Σv (s) = 0.1IV , p(st+1 |st ) ∝ 1S×S + IS . At time t = 1, the priors are p1 = uniform, with h1 drawn from N (10 ∗ randn(H, 1), IH ). the smoothed estimate, for which 1000 particles were used, with Kitagawa resampling. For the RaoBlackwellized Particle Filter [13], 500 particles were used, with Kitagawa resampling. We found that EP4 was numerically unstable and often struggled to converge. To encourage convergence, we used the damping method in [9], performing 20 iterations with a damping factor of 0.5. Nevertheless, the disappointing performance of EP is most likely due to conﬂicts resulting from numerical instabilities introduced by the frequent conversions between moment and canonical representations. The best ﬁltered results are given using ADF, since this is better able to represent the variance in the ﬁltered posterior than the sampling methods. Unlike Kim’s method, EC makes good use of the future information to clean up the ﬁltered results considerably. One should bear in mind that both EC and Kim’s method use the same ADF ﬁltered results. This demonstrates that EC may dramatically improve on Kim’s method, so that the small amount of extra work in making a numerical approximation of p(st |st+1 , v1:T ), (12), may bring signiﬁcant beneﬁts. We found similar conclusions for experiments with an aSLDS[7]. 4 Application to Noise Robust ASR Here we brieﬂy present an application of the SLDS to robust Automatic Speech Recognition (ASR), for which the intractable inference is performed by EC, and serves to demonstrate how EC scales well to a large-scale application. Fuller details are given in [14]. The standard approach to noise robust ASR is to provide a set of noise-robust features to a standard Hidden Markov Model (HMM) classiﬁer, which is based on modeling the acoustic feature vector. For example, the method of Unsupervised Spectral Subtraction (USS) [15] provides state-of-the-art performance in this respect. Incorporating noise models directly into such feature-based HMM systems is difﬁcult, mainly because the explicit inﬂuence of the noise on the features is poorly understood. An alternative is to model the raw speech signal directly, such as the SAR-HMM model [16] for which, under clean conditions, isolated spoken digit recognition performs well. However, the SAR-HMM performs poorly under noisy conditions, since no explicit noise processes are taken into account by the model. The approach we take here is to extend the SAR-HMM to include an explicit noise process, so that h the observed signal vt is modeled as a noise corrupted version of a clean hidden signal vt : h vt = vt + ηt ˜ 4 with ηt ∼ N (0, σ 2 ) ˜ ˜ Generalized EP [5], which groups variables together improves on the results, but is still far inferior to the EC results presented here – Onno Zoeter personal communication. Noise Variance 0 10−7 10−6 10−5 10−4 10−3 SNR (dB) 26.5 26.3 25.1 19.7 10.6 0.7 HMM 100.0% 100.0% 90.9% 86.4% 59.1% 9.1% SAR-HMM 97.0% 79.8% 56.7% 22.2% 9.7% 9.1% AR-SLDS 96.8% 96.8% 96.4% 94.8% 84.0% 61.2% Table 1: Comparison of the recognition accuracy of three models when the test utterances are corrupted by various levels of Gaussian noise. The dynamics of the clean signal is modeled by a switching AR process R h vt = h h cr (st )vt−r + ηt (st ), h ηt (st ) ∼ N (0, σ 2 (st )) r=1 where st ∈ {1, . . . , S} denotes which of a set of AR coefﬁcients cr (st ) are to be used at time t, h and ηt (st ) is the so-called innovation noise. When σ 2 (st ) ≡ 0, this model reproduces the SARHMM of [16], a specially constrained HMM. Hence inference and learning for the SAR-HMM are tractable and straightforward. For the case σ 2 (st ) > 0 the model can be recast as an SLDS. To do this we deﬁne ht as a vector which contains the R most recent clean hidden samples ht = h vt h . . . vt−r+1 T (16) and we set A(st ) to be an R × R matrix where the ﬁrst row contains the AR coefﬁcients −cr (st ) and the rest is a shifted down identity matrix. For example, for a third order (R = 3) AR process, A(st ) = −c1 (st ) −c2 (st ) −c3 (st ) 1 0 0 0 1 0 . (17) The hidden covariance matrix Σh (s) has all elements zero, except the top-left most which is set to the innovation variance. To extract the ﬁrst component of ht we use the (switch independent) 1 × R projection matrix B = [ 1 0 . . . 0 ]. The (switch independent) visible scalar noise 2 variance is given by Σv ≡ σv . A well-known issue with raw speech signal models is that the energy of a signal may vary from one speaker to another or because of a change in recording conditions. For this reason the innovation Σh is adjusted by maximizing the likelihood of an observed sequence with respect to the innovation covariance, a process called Gain Adaptation [16]. 4.1 Training & Evaluation Following [16], we trained a separate SAR-HMM for each of the eleven digits (0–9 and ‘oh’) from the TI-DIGITS database [17]. The training set for each digit was composed of 110 single digit utterances down-sampled to 8 kHz, each one pronounced by a male speaker. Each SAR-HMM was composed of ten states with a left-right transition matrix. Each state was associated with a 10thorder AR process and the model was constrained to stay an integer multiple of K = 140 time steps (0.0175 seconds) in the same state. We refer the reader to [16] for a detailed explanation of the training procedure used with the SAR-HMM. An AR-SLDS was built for each of the eleven digits by copying the parameters of the corresponding trained SAR-HMM, i.e., the AR coefﬁcients cr (s) are copied into the ﬁrst row of the hidden transition matrix A(s) and the same discrete transition distribution p(st | st−1 ) is used. The models were then evaluated on a test set composed of 112 corrupted utterances of each of the eleven digits, each pronounced by different male speakers than those used in the training set. The recognition accuracy obtained by the models on the corrupted test sets is presented in Table 1. As expected, the performance of the SAR-HMM rapidly decreases with noise. The feature-based HMM with USS has high accuracy only for high SNR levels. In contrast, the AR-SLDS achieves a recognition accuracy of 61.2% at a SNR close to 0 dB, while the performance of the two other methods is equivalent to random guessing (9.1%). Whilst other inference methods may also perform well in this case, we found that EC performs admirably, without numerical instabilities, even for time-series with several thousand time-steps. 5 Discussion We presented a method for approximate smoothed inference in an augmented class of switching linear dynamical systems. Our approximation is based on the idea that due to the forgetting which commonly occurs in Markovian models, a ﬁnite number of mixture components may provide a reasonable approximation. Clearly, in systems with very long correlation times our method may require too many mixture components to produce a satisfactory result, although we are unaware of other techniques that would be able to cope well in that case. The main beneﬁt of EC over Kim smoothing is that future information is more accurately dealt with. Whilst EC is not as general as EP, EC carefully exploits the properties of singly-connected distributions, such as the aSLDS, to provide a numerically stable procedure. We hope that the ideas presented here may therefore help facilitate the practical application of dynamic hybrid networks. Acknowledgements This work is supported by the EU Project FP6-0027787. This paper only reﬂects the authors’ views and funding agencies are not liable for any use that may be made of the information contained herein. References [1] Y. Bar-Shalom and Xiao-Rong Li. Estimation and Tracking : Principles, Techniques and Software. Artech House, Norwood, MA, 1998. [2] V. Pavlovic, J. M. Rehg, and J. MacCormick. Learning switching linear models of human motion. In Advances in Neural Information Processing systems (NIPS 13), pages 981–987, 2001. [3] A. T. Cemgil, B. Kappen, and D. Barber. A Generative Model for Music Transcription. IEEE Transactions on Audio, Speech and Language Processing, 14(2):679 – 694, 2006. [4] U. N. Lerner. Hybrid Bayesian Networks for Reasoning about Complex Systems. PhD thesis, Stanford University, 2002. [5] O. Zoeter. Monitoring non-linear and switching dynamical systems. PhD thesis, Radboud University Nijmegen, 2005. [6] T. Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, MIT Media Lab, 2001. [7] D. Barber. Expectation Correction for Smoothed Inference in Switching Linear Dynamical Systems. Journal of Machine Learning Research, 7:2515–2540, 2006. [8] C-J. Kim. Dynamic linear models with Markov-switching. Journal of Econometrics, 60:1–22, 1994. [9] T. Heskes and O. Zoeter. Expectation Propagation for approximate inference in dynamic Bayesian networks. In A. Darwiche and N. Friedman, editors, Uncertainty in Art. Intelligence, pages 216–223, 2002. [10] S. Lauritzen and F. Jensen. Stable local computation with conditional Gaussian distributions. Statistics and Computing, 11:191–203, 2001. [11] G. Kitagawa. The Two-Filter Formula for Smoothing and an implementation of the Gaussian-sum smoother. Annals of the Institute of Statistical Mathematics, 46(4):605–623, 1994. [12] Z. Ghahramani and G. E. Hinton. Variational learning for switching state-space models. Neural Computation, 12(4):963–996, 1998. [13] A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo Methods in Practice. Springer, 2001. [14] B. Mesot and D. Barber. Switching Linear Dynamical Systems for Noise Robust Speech Recognition. IDIAP-RR 08, 2006. [15] G. Lathoud, M. Magimai-Doss, B. Mesot, and H. Bourlard. Unsupervised spectral subtraction for noiserobust ASR. In Proceedings of ASRU 2005, pages 189–194, November 2005. [16] Y. Ephraim and W. J. J. Roberts. Revisiting autoregressive hidden Markov modeling of speech signals. IEEE Signal Processing Letters, 12(2):166–169, February 2005. [17] R.G. Leonard. A database for speaker independent digit recognition. In Proceedings of ICASSP84, volume 3, 1984.

3 0.1547229 60 nips-2006-Convergence of Laplacian Eigenmaps

Author: Mikhail Belkin, Partha Niyogi

Abstract: Geometrically based methods for various tasks of machine learning have attracted considerable attention over the last few years. In this paper we show convergence of eigenvectors of the point cloud Laplacian to the eigenfunctions of the Laplace-Beltrami operator on the underlying manifold, thus establishing the ﬁrst convergence results for a spectral dimensionality reduction algorithm in the manifold setting. 1

4 0.12328193 61 nips-2006-Convex Repeated Games and Fenchel Duality

Author: Shai Shalev-shwartz, Yoram Singer

Abstract: We describe an algorithmic framework for an abstract game which we term a convex repeated game. We show that various online learning and boosting algorithms can be all derived as special cases of our algorithmic framework. This uniﬁed view explains the properties of existing algorithms and also enables us to derive several new interesting algorithms. Our algorithmic framework stems from a connection that we build between the notions of regret in game theory and weak duality in convex optimization. 1 Introduction and Problem Setting Several problems arising in machine learning can be modeled as a convex repeated game. Convex repeated games are closely related to online convex programming (see [19, 9] and the discussion in the last section). A convex repeated game is a two players game that is performed in a sequence of consecutive rounds. On round t of the repeated game, the ﬁrst player chooses a vector wt from a convex set S. Next, the second player responds with a convex function gt : S → R. Finally, the ﬁrst player suffers an instantaneous loss gt (wt ). We study the game from the viewpoint of the ﬁrst player. The goal of the ﬁrst player is to minimize its cumulative loss, t gt (wt ). To motivate this rather abstract setting let us ﬁrst cast the more familiar setting of online learning as a convex repeated game. Online learning is performed in a sequence of consecutive rounds. On round t, the learner ﬁrst receives a question, cast as a vector xt , and is required to provide an answer for this question. For example, xt can be an encoding of an email message and the question is whether the email is spam or not. The prediction of the learner is performed based on an hypothesis, ht : X → Y, where X is the set of questions and Y is the set of possible answers. In the aforementioned example, Y would be {+1, −1} where +1 stands for a spam email and −1 stands for a benign one. After predicting an answer, the learner receives the correct answer for the question, denoted yt , and suffers loss according to a loss function (ht , (xt , yt )). In most cases, the hypotheses used for prediction come from a parameterized set of hypotheses, H = {hw : w ∈ S}. For example, the set of linear classiﬁers, which is used for answering yes/no questions, is deﬁned as H = {hw (x) = sign( w, x ) : w ∈ Rn }. Thus, rather than saying that on round t the learner chooses a hypothesis, we can say that the learner chooses a vector wt and its hypothesis is hwt . Next, we note that once the environment chooses a question-answer pair (xt , yt ), the loss function becomes a function over the hypotheses space or equivalently over the set of parameter vectors S. We can therefore redeﬁne the online learning process as follows. On round t, the learner chooses a vector wt ∈ S, which deﬁnes a hypothesis hwt to be used for prediction. Then, the environment chooses a questionanswer pair (xt , yt ), which induces the following loss function over the set of parameter vectors, gt (w) = (hw , (xt , yt )). Finally, the learner suffers the loss gt (wt ) = (hwt , (xt , yt )). We have therefore described the process of online learning as a convex repeated game. In this paper we assess the performance of the ﬁrst player using the notion of regret. Given a number of rounds T and a ﬁxed vector u ∈ S, we deﬁne the regret of the ﬁrst player as the excess loss for not consistently playing the vector u, 1 T T gt (wt ) − t=1 1 T T gt (u) . t=1 Our main result is an algorithmic framework for the ﬁrst player which guarantees low regret with respect to any vector u ∈ S. Speciﬁcally, we derive regret bounds that take the following form ∀u ∈ S, 1 T T gt (wt ) − t=1 1 T T gt (u) ≤ t=1 f (u) + L √ , T (1) where f : S → R and L ∈ R+ . Informally, the function f measures the “complexity” of vectors in S and the scalar L is related to some generalized Lipschitz property of the functions g1 , . . . , gT . We defer the exact requirements we impose on f and L to later sections. Our algorithmic framework emerges from a representation of the regret bound given in Eq. (1) using an optimization problem. Speciﬁcally, we rewrite Eq. (1) as follows 1 T T gt (wt ) ≤ inf t=1 u∈S 1 T T gt (u) + t=1 f (u) + L √ . T (2) That is, the average loss of the ﬁrst player should be bounded above by the minimum value of an optimization problem in which we jointly minimize the average loss of u and the “complexity” of u as measured by the function f . Note that the optimization problem on the right-hand side of Eq. (2) can only be solved in hindsight after observing the entire sequence of loss functions. Nevertheless, writing the regret bound as in Eq. (2) implies that the average loss of the ﬁrst player forms a lower bound for a minimization problem. The notion of duality, commonly used in convex optimization theory, plays an important role in obtaining lower bounds for the minimal value of a minimization problem (see for example [14]). By generalizing the notion of Fenchel duality, we are able to derive a dual optimization problem, which can be optimized incrementally, as the game progresses. In order to derive explicit quantitative regret bounds we make an immediate use of the fact that dual objective lower bounds the primal objective. We therefore reduce the process of playing convex repeated games to the task of incrementally increasing the dual objective function. The amount by which the dual increases serves as a new and natural notion of progress. By doing so we are able to tie the primal objective value, the average loss of the ﬁrst player, and the increase in the dual. The rest of this paper is organized as follows. In Sec. 2 we establish our notation and point to a few mathematical tools that we use throughout the paper. Our main tool for deriving algorithms for playing convex repeated games is a generalization of Fenchel duality, described in Sec. 3. Our algorithmic framework is given in Sec. 4 and analyzed in Sec. 5. The generality of our framework allows us to utilize it in different problems arising in machine learning. Speciﬁcally, in Sec. 6 we underscore the applicability of our framework for online learning and in Sec. 7 we outline and analyze boosting algorithms based on our framework. We conclude with a discussion and point to related work in Sec. 8. Due to the lack of space, some of the details are omitted from the paper and can be found in [16]. 2 Mathematical Background We denote scalars with lower case letters (e.g. x and w), and vectors with bold face letters (e.g. x and w). The inner product between vectors x and w is denoted by x, w . Sets are designated by upper case letters (e.g. S). The set of non-negative real numbers is denoted by R+ . For any k ≥ 1, the set of integers {1, . . . , k} is denoted by [k]. A norm of a vector x is denoted by x . The dual norm is deﬁned as λ = sup{ x, λ : x ≤ 1}. For example, the Euclidean norm, x 2 = ( x, x )1/2 is dual to itself and the 1 norm, x 1 = i |xi |, is dual to the ∞ norm, x ∞ = maxi |xi |. We next recall a few deﬁnitions from convex analysis. The reader familiar with convex analysis may proceed to Lemma 1 while for a more thorough introduction see for example [1]. A set S is convex if for any two vectors w1 , w2 in S, all the line between w1 and w2 is also within S. That is, for any α ∈ [0, 1] we have that αw1 + (1 − α)w2 ∈ S. A set S is open if every point in S has a neighborhood lying in S. A set S is closed if its complement is an open set. A function f : S → R is closed and convex if for any scalar α ∈ R, the level set {w : f (w) ≤ α} is closed and convex. The Fenchel conjugate of a function f : S → R is deﬁned as f (θ) = supw∈S w, θ − f (w) . If f is closed and convex then the Fenchel conjugate of f is f itself. The Fenchel-Young inequality states that for any w and θ we have that f (w) + f (θ) ≥ w, θ . A vector λ is a sub-gradient of a function f at w if for all w ∈ S we have that f (w ) − f (w) ≥ w − w, λ . The differential set of f at w, denoted ∂f (w), is the set of all sub-gradients of f at w. If f is differentiable at w then ∂f (w) consists of a single vector which amounts to the gradient of f at w and is denoted by f (w). Sub-gradients play an important role in the deﬁnition of Fenchel conjugate. In particular, the following lemma states that if λ ∈ ∂f (w) then Fenchel-Young inequality holds with equality. Lemma 1 Let f be a closed and convex function and let ∂f (w ) be its differential set at w . Then, for all λ ∈ ∂f (w ) we have, f (w ) + f (λ ) = λ , w . A continuous function f is σ-strongly convex over a convex set S with respect to a norm · if S is contained in the domain of f and for all v, u ∈ S and α ∈ [0, 1] we have 1 (3) f (α v + (1 − α) u) ≤ α f (v) + (1 − α) f (u) − σ α (1 − α) v − u 2 . 2 Strongly convex functions play an important role in our analysis primarily due to the following lemma. Lemma 2 Let · be a norm over Rn and let · be its dual norm. Let f be a σ-strongly convex function on S and let f be its Fenchel conjugate. Then, f is differentiable with f (θ) = arg maxx∈S θ, x − f (x). Furthermore, for any θ, λ ∈ Rn we have 1 f (θ + λ) − f (θ) ≤ f (θ), λ + λ 2 . 2σ Two notable examples of strongly convex functions which we use are as follows. 1 Example 1 The function f (w) = 2 w norm. Its conjugate function is f (θ) = 2 2 1 2 is 1-strongly convex over S = Rn with respect to the θ 2. 2 2 n 1 Example 2 The function f (w) = i=1 wi log(wi / n ) is 1-strongly convex over the probabilistic n simplex, S = {w ∈ R+ : w 1 = 1}, with respect to the 1 norm. Its conjugate function is n 1 f (θ) = log( n i=1 exp(θi )). 3 Generalized Fenchel Duality In this section we derive our main analysis tool. We start by considering the following optimization problem, T inf c f (w) + t=1 gt (w) , w∈S where c is a non-negative scalar. An equivalent problem is inf w0 ,w1 ,...,wT c f (w0 ) + T t=1 gt (wt ) s.t. w0 ∈ S and ∀t ∈ [T ], wt = w0 . Introducing T vectors λ1 , . . . , λT , each λt ∈ Rn is a vector of Lagrange multipliers for the equality constraint wt = w0 , we obtain the following Lagrangian T T L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) = c f (w0 ) + t=1 gt (wt ) + t=1 λt , w0 − wt . The dual problem is the task of maximizing the following dual objective value, D(λ1 , . . . , λT ) = inf L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) w0 ∈S,w1 ,...,wT = − c sup w0 ∈S = −c f −1 c w0 , − 1 c T t=1 T t=1 λt − λt − f (w0 ) − T t=1 gt (λt ) , T t=1 sup ( wt , λt − gt (wt )) wt where, following the exposition of Sec. 2, f , g1 , . . . , gT are the Fenchel conjugate functions of f, g1 , . . . , gT . Therefore, the generalized Fenchel dual problem is sup − cf λ1 ,...,λT −1 c T t=1 λt − T t=1 gt (λt ) . (4) Note that when T = 1 and c = 1, the above duality is the so called Fenchel duality. 4 A Template Learning Algorithm for Convex Repeated Games In this section we describe a template learning algorithm for playing convex repeated games. As mentioned before, we study convex repeated games from the viewpoint of the ﬁrst player which we shortly denote as P1. Recall that we would like our learning algorithm to achieve a regret bound of the form given in Eq. (2). We start by rewriting Eq. (2) as follows T m gt (wt ) − c L ≤ inf u∈S t=1 c f (u) + gt (u) , (5) t=1 √ where c = T . Thus, up to the sublinear term c L, the cumulative loss of P1 lower bounds the optimum of the minimization problem on the right-hand side of Eq. (5). In the previous section we derived the generalized Fenchel dual of the right-hand side of Eq. (5). Our construction is based on the weak duality theorem stating that any value of the dual problem is smaller than the optimum value of the primal problem. The algorithmic framework we propose is therefore derived by incrementally ascending the dual objective function. Intuitively, by ascending the dual objective we move closer to the optimal primal value and therefore our performance becomes similar to the performance of the best ﬁxed weight vector which minimizes the right-hand side of Eq. (5). Initially, we use the elementary dual solution λ1 = 0 for all t. We assume that inf w f (w) = 0 and t for all t inf w gt (w) = 0 which imply that D(λ1 , . . . , λ1 ) = 0. We assume in addition that f is 1 T σ-strongly convex. Therefore, based on Lemma 2, the function f is differentiable. At trial t, P1 uses for prediction the vector wt = f −1 c T i=1 λt i . (6) After predicting wt , P1 receives the function gt and suffers the loss gt (wt ). Then, P1 updates the dual variables as follows. Denote by ∂t the differential set of gt at wt , that is, ∂t = {λ : ∀w ∈ S, gt (w) − gt (wt ) ≥ λ, w − wt } . (7) The new dual variables (λt+1 , . . . , λt+1 ) are set to be any set of vectors which satisfy the following 1 T two conditions: (i). ∃λ ∈ ∂t s.t. D(λt+1 , . . . , λt+1 ) ≥ D(λt , . . . , λt , λ , λt , . . . , λt ) 1 1 t−1 t+1 T T (ii). ∀i > t, λt+1 = 0 i . (8) In the next section we show that condition (i) ensures that the increase of the dual at trial t is proportional to the loss gt (wt ). The second condition ensures that we can actually calculate the dual at trial t without any knowledge on the yet to be seen loss functions gt+1 , . . . , gT . We conclude this section with two update rules that trivially satisfy the above two conditions. The ﬁrst update scheme simply ﬁnds λ ∈ ∂t and set λt+1 = i λ λt i if i = t if i = t . (9) The second update deﬁnes (λt+1 , . . . , λt+1 ) = argmax D(λ1 , . . . , λT ) 1 T λ1 ,...,λT s.t. ∀i = t, λi = λt . i (10) 5 Analysis In this section we analyze the performance of the template algorithm given in the previous section. Our proof technique is based on monitoring the value of the dual objective function. The main result is the following lemma which gives upper and lower bounds for the ﬁnal value of the dual objective function. Lemma 3 Let f be a σ-strongly convex function with respect to a norm · over a set S and assume that minw∈S f (w) = 0. Let g1 , . . . , gT be a sequence of convex and closed functions such that inf w gt (w) = 0 for all t ∈ [T ]. Suppose that a dual-incrementing algorithm which satisﬁes the conditions of Eq. (8) is run with f as a complexity function on the sequence g1 , . . . , gT . Let w1 , . . . , wT be the sequence of primal vectors that the algorithm generates and λT +1 , . . . , λT +1 1 T be its ﬁnal sequence of dual variables. Then, there exists a sequence of sub-gradients λ1 , . . . , λT , where λt ∈ ∂t for all t, such that T 1 gt (wt ) − 2σc t=1 T T λt 2 ≤ D(λT +1 , . . . , λT +1 ) 1 T t=1 ≤ inf c f (w) + w∈S gt (w) . t=1 Proof The second inequality follows directly from the weak duality theorem. Turning to the left most inequality, denote ∆t = D(λt+1 , . . . , λt+1 ) − D(λt , . . . , λt ) and note that 1 1 T T T D(λ1 +1 , . . . , λT +1 ) can be rewritten as T T t=1 D(λT +1 , . . . , λT +1 ) = 1 T T t=1 ∆t − D(λ1 , . . . , λ1 ) = 1 T ∆t , (11) where the last equality follows from the fact that f (0) = g1 (0) = . . . = gT (0) = 0. The deﬁnition of the update implies that ∆t ≥ D(λt , . . . , λt , λt , 0, . . . , 0) − D(λt , . . . , λt , 0, 0, . . . , 0) for 1 t−1 1 t−1 t−1 some subgradient λt ∈ ∂t . Denoting θ t = − 1 j=1 λj , we now rewrite the lower bound on ∆t as, c ∆t ≥ −c (f (θ t − λt /c) − f (θ t )) − gt (λt ) . Using Lemma 2 and the deﬁnition of wt we get that 1 (12) ∆t ≥ wt , λt − gt (λt ) − 2 σ c λt 2 . Since λt ∈ ∂t and since we assume that gt is closed and convex, we can apply Lemma 1 to get that wt , λt − gt (λt ) = gt (wt ). Plugging this equality into Eq. (12) and summing over t we obtain that T T T 1 2 . t=1 ∆t ≥ t=1 gt (wt ) − 2 σ c t=1 λt Combining the above inequality with Eq. (11) concludes our proof. The following regret bound follows as a direct corollary of Lemma 3. T 1 Theorem 1 Under the same conditions of Lemma 3. Denote L = T t=1 λt w ∈ S we have, T T c f (w) 1 1 + 2L c . t=1 gt (wt ) − T t=1 gt (w) ≤ T T σ √ In particular, if c = T , we obtain the bound, 1 T 6 T t=1 gt (wt ) − 1 T T t=1 gt (w) ≤ f (w)+L/(2 σ) √ T 2 . Then, for all . Application to Online learning In Sec. 1 we cast the task of online learning as a convex repeated game. We now demonstrate the applicability of our algorithmic framework for the problem of instance ranking. We analyze this setting since several prediction problems, including binary classiﬁcation, multiclass prediction, multilabel prediction, and label ranking, can be cast as special cases of the instance ranking problem. Recall that on each online round, the learner receives a question-answer pair. In instance ranking, the question is encoded by a matrix Xt of dimension kt × n and the answer is a vector yt ∈ Rkt . The semantic of yt is as follows. For any pair (i, j), if yt,i > yt,j then we say that yt ranks the i’th row of Xt ahead of the j’th row of Xt . We also interpret yt,i − yt,j as the conﬁdence in which the i’th row should be ranked ahead of the j’th row. For example, each row of Xt encompasses a representation of a movie while yt,i is the movie’s rating, expressed as the number of stars this movie has received by a movie reviewer. The predictions of the learner are determined ˆ based on a weight vector wt ∈ Rn and are deﬁned to be yt = Xt wt . Finally, let us deﬁne two loss functions for ranking, both generalize the hinge-loss used in binary classiﬁcation problems. Denote by Et the set {(i, j) : yt,i > yt,j }. For all (i, j) ∈ Et we deﬁne a pair-based hinge-loss i,j (w; (Xt , yt )) = [(yt,i − yt,j ) − w, xt,i − xt,j ]+ , where [a]+ = max{a, 0} and xt,i , xt,j are respectively the i’th and j’th rows of Xt . Note that i,j is zero if w ranks xt,i higher than xt,j with a sufﬁcient conﬁdence. Ideally, we would like i,j (wt ; (Xt , yt )) to be zero for all (i, j) ∈ Et . If this is not the case, we are being penalized according to some combination of the pair-based losses i,j . For example, we can set (w; (Xt , yt )) to be the average over the pair losses, 1 avg (w; (Xt , yt )) = |Et | (i,j)∈Et i,j (w; (Xt , yt )) . This loss was suggested by several authors (see for example [18]). Another popular approach (see for example [5]) penalizes according to the maximal loss over the individual pairs, max (w; (Xt , yt )) = max(i,j)∈Et i,j (w; (Xt , yt )) . We can apply our algorithmic framework given in Sec. 4 for ranking, using for gt (w) either avg (w; (Xt , yt )) or max (w; (Xt , yt )). The following theorem provides us with a sufﬁcient condition under which the regret bound from Thm. 1 holds for ranking as well. Theorem 2 Let f be a σ-strongly convex function over S with respect to a norm · . Denote by Lt the maximum over (i, j) ∈ Et of xt,i − xt,j 2 . Then, for both gt (w) = avg (w; (Xt , yt )) and ∗ gt (w) = max (w; (Xt , yt )), the following regret bound holds ∀u ∈ S, 7 1 T T t=1 gt (wt ) − 1 T T t=1 gt (u) ≤ 1 f (u)+ T PT t=1 Lt /(2 σ) √ T . The Boosting Game In this section we describe the applicability of our algorithmic framework to the analysis of boosting algorithms. A boosting algorithm uses a weak learning algorithm that generates weak-hypotheses whose performances are just slightly better than random guessing to build a strong-hypothesis which can attain an arbitrarily low error. The AdaBoost algorithm, proposed by Freund and Schapire [6], receives as input a training set of examples {(x1 , y1 ), . . . , (xm , ym )} where for all i ∈ [m], xi is taken from an instance domain X , and yi is a binary label, yi ∈ {+1, −1}. The boosting process proceeds in a sequence of consecutive trials. At trial t, the booster ﬁrst deﬁnes a distribution, denoted wt , over the set of examples. Then, the booster passes the training set along with the distribution wt to the weak learner. The weak learner is assumed to return a hypothesis ht : X → {+1, −1} whose average error is slightly smaller than 1 . That is, there exists a constant γ > 0 such that, 2 def m 1−yi ht (xi ) = ≤ 1 −γ . (13) i=1 wt,i 2 2 The goal of the boosting algorithm is to invoke the weak learner several times with different distributions, and to combine the hypotheses returned by the weak learner into a ﬁnal, so called strong, hypothesis whose error is small. The ﬁnal hypothesis combines linearly the T hypotheses returned by the weak learner with coefﬁcients α1 , . . . , αT , and is deﬁned to be the sign of hf (x) where T hf (x) = t=1 αt ht (x) . The coefﬁcients α1 , . . . , αT are determined by the booster. In Ad1 1 aBoost, the initial distribution is set to be the uniform distribution, w1 = ( m , . . . , m ). At iter1 ation t, the value of αt is set to be 2 log((1 − t )/ t ). The distribution is updated by the rule wt+1,i = wt,i exp(−αt yi ht (xi ))/Zt , where Zt is a normalization factor. Freund and Schapire [6] have shown that under the assumption given in Eq. (13), the error of the ﬁnal strong hypothesis is at most exp(−2 γ 2 T ). t Several authors [15, 13, 8, 4] have proposed to view boosting as a coordinate-wise greedy optimization process. To do so, note ﬁrst that hf errs on an example (x, y) iff y hf (x) ≤ 0. Therefore, the exp-loss function, deﬁned as exp(−y hf (x)), is a smooth upper bound of the zero-one error, which equals to 1 if y hf (x) ≤ 0 and to 0 otherwise. Thus, we can restate the goal of boosting as minimizing the average exp-loss of hf over the training set with respect to the variables α1 , . . . , αT . To simplify our derivation in the sequel, we prefer to say that boosting maximizes the negation of the loss, that is, T m 1 (14) max − m i=1 exp −yi t=1 αt ht (xi ) . α1 ,...,αT In this view, boosting is an optimization procedure which iteratively maximizes Eq. (14) with respect to the variables α1 , . . . , αT . This view of boosting, enables the hypotheses returned by the weak learner to be general functions into the reals, ht : X → R (see for instance [15]). In this paper we view boosting as a convex repeated game between a booster and a weak learner. To motivate our construction, we would like to note that boosting algorithms deﬁne weights in two different domains: the vectors wt ∈ Rm which assign weights to examples and the weights {αt : t ∈ [T ]} over weak-hypotheses. In the terminology used throughout this paper, the weights wt ∈ Rm are primal vectors while (as we show in the sequel) each weight αt of the hypothesis ht is related to a dual vector λt . In particular, we show that Eq. (14) is exactly the Fenchel dual of a primal problem for a convex repeated game, thus the algorithmic framework described thus far for playing games naturally ﬁts the problem of iteratively solving Eq. (14). To derive the primal problem whose Fenchel dual is the problem given in Eq. (14) let us ﬁrst denote by vt the vector in Rm whose ith element is vt,i = yi ht (xi ). For all t, we set gt to be the function gt (w) = [ w, vt ]+ . Intuitively, gt penalizes vectors w which assign large weights to examples which are predicted accurately, that is yi ht (xi ) > 0. In particular, if ht (xi ) ∈ {+1, −1} and wt is a distribution over the m examples (as is the case in AdaBoost), gt (wt ) reduces to 1 − 2 t (see Eq. (13)). In this case, minimizing gt is equivalent to maximizing the error of the individual T hypothesis ht over the examples. Consider the problem of minimizing c f (w) + t=1 gt (w) where f (w) is the relative entropy given in Example 2 and c = 1/(2 γ) (see Eq. (13)). To derive its Fenchel dual, we note that gt (λt ) = 0 if there exists βt ∈ [0, 1] such that λt = βt vt and otherwise gt (λt ) = ∞ (see [16]). In addition, let us deﬁne αt = 2 γ βt . Since our goal is to maximize the αt dual, we can restrict λt to take the form λt = βt vt = 2 γ vt , and get that D(λ1 , . . . , λT ) = −c f − 1 c T βt vt t=1 =− 1 log 2γ 1 m m e− PT t=1 αt yi ht (xi ) . (15) i=1 Minimizing the exp-loss of the strong hypothesis is therefore the dual problem of the following primal minimization problem: ﬁnd a distribution over the examples, whose relative entropy to the uniform distribution is as small as possible while the correlation of the distribution with each vt is as small as possible. Since the correlation of w with vt is inversely proportional to the error of ht with respect to w, we obtain that in the primal problem we are trying to maximize the error of each individual hypothesis, while in the dual problem we minimize the global error of the strong hypothesis. The intuition of ﬁnding distributions which in retrospect result in large error rates of individual hypotheses was also alluded in [15, 8]. We can now apply our algorithmic framework from Sec. 4 to boosting. We describe the game αt with the parameters αt , where αt ∈ [0, 2 γ], and underscore that in our case, λt = 2 γ vt . At the beginning of the game the booster sets all dual variables to be zero, ∀t αt = 0. At trial t of the boosting game, the booster ﬁrst constructs a primal weight vector wt ∈ Rm , which assigns importance weights to the examples in the training set. The primal vector wt is constructed as in Eq. (6), that is, wt = f (θ t ), where θ t = − i αi vi . Then, the weak learner responds by presenting the loss function gt (w) = [ w, vt ]+ . Finally, the booster updates the dual variables so as to increase the dual objective function. It is possible to show that if the range of ht is {+1, −1} 1 then the update given in Eq. (10) is equivalent to the update αt = min{2 γ, 2 log((1 − t )/ t )}. We have thus obtained a variant of AdaBoost in which the weights αt are capped above by 2 γ. A disadvantage of this variant is that we need to know the parameter γ. We would like to note in passing that this limitation can be lifted by a different deﬁnition of the functions gt . We omit the details due to the lack of space. To analyze our game of boosting, we note that the conditions given in Lemma 3 holds T and therefore the left-hand side inequality given in Lemma 3 tells us that t=1 gt (wt ) − T T +1 T +1 1 2 , . . . , λT ) . The deﬁnition of gt and the weak learnability ast=1 λt ∞ ≤ D(λ1 2c sumption given in Eq. (13) imply that wt , vt ≥ 2 γ for all t. Thus, gt (wt ) = wt , vt ≥ 2 γ which also implies that λt = vt . Recall that vt,i = yi ht (xi ). Assuming that the range of ht is [+1, −1] we get that λt ∞ ≤ 1. Combining all the above with the left-hand side inequality T given in Lemma 3 we get that 2 T γ − 2 c ≤ D(λT +1 , . . . , λT +1 ). Using the deﬁnition of D (see 1 T Eq. (15)), the value c = 1/(2 γ), and rearranging terms we recover the original bound for AdaBoost PT 2 m 1 −yi t=1 αt ht (xi ) ≤ e−2 γ T . i=1 e m 8 Related Work and Discussion We presented a new framework for designing and analyzing algorithms for playing convex repeated games. Our framework was used for the analysis of known algorithms for both online learning and boosting settings. The framework also paves the way to new algorithms. In a previous paper [17], we suggested the use of duality for the design of online algorithms in the context of mistake bound analysis. The contribution of this paper over [17] is three fold as we now brieﬂy discuss. First, we generalize the applicability of the framework beyond the speciﬁc setting of online learning with the hinge-loss to the general setting of convex repeated games. The setting of convex repeated games was formally termed “online convex programming” by Zinkevich [19] and was ﬁrst presented by Gordon in [9]. There is voluminous amount of work on unifying approaches for deriving online learning algorithms. We refer the reader to [11, 12, 3] for work closely related to the content of this paper. By generalizing our previously studied algorithmic framework [17] beyond online learning, we can automatically utilize well known online learning algorithms, such as the EG and p-norm algorithms [12, 11], to the setting of online convex programming. We would like to note that the algorithms presented in [19] can be derived as special cases of our algorithmic framework 1 by setting f (w) = 2 w 2 . Parallel and independently to this work, Gordon [10] described another algorithmic framework for online convex programming that is closely related to the potential based algorithms described by Cesa-Bianchi and Lugosi [3]. Gordon also considered the problem of deﬁning appropriate potential functions. Our work generalizes some of the theorems in [10] while providing a somewhat simpler analysis. Second, the usage of generalized Fenchel duality rather than the Lagrange duality given in [17] enables us to analyze boosting algorithms based on the framework. Many authors derived unifying frameworks for boosting algorithms [13, 8, 4]. Nonetheless, our general framework and the connection between game playing and Fenchel duality underscores an interesting perspective of both online learning and boosting. We believe that this viewpoint has the potential of yielding new algorithms in both domains. Last, despite the generality of the framework introduced in this paper, the resulting analysis is more distilled than the earlier analysis given in [17] for two reasons. (i) The usage of Lagrange duality in [17] is somehow restricted while the notion of generalized Fenchel duality is more appropriate to the general and broader problems we consider in this paper. (ii) The strongly convex property we employ both simpliﬁes the analysis and enables more intuitive conditions in our theorems. There are various possible extensions of the work that we did not pursue here due to the lack of space. For instanc, our framework can naturally be used for the analysis of other settings such as repeated games (see [7, 19]). The applicability of our framework to online learning can also be extended to other prediction problems such as regression and sequence prediction. Last, we conjecture that our primal-dual view of boosting will lead to new methods for regularizing boosting algorithms, thus improving their generalization capabilities. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2006. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. M. Collins, R.E. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances. Machine Learning, 2002. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive aggressive algorithms. JMLR, 7, Mar 2006. Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT, 1995. Y. Freund and R.E. Schapire. Game theory, on-line prediction and boosting. In COLT, 1996. J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2), 2000. G. Gordon. Regret bounds for prediction problems. In COLT, 1999. G. Gordon. No-regret algorithms for online convex programs. In NIPS, 2006. A. J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear discriminant updates. Machine Learning, 43(3), 2001. J. Kivinen and M. Warmuth. Relative loss bounds for multidimensional regression problems. Journal of Machine Learning, 45(3),2001. L. Mason, J. Baxter, P. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classiﬁers. MIT Press, 1999. Y. Nesterov. Primal-dual subgradient methods for convex problems. Technical report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 2005. R. E. Schapire and Y. Singer. Improved boosting algorithms using conﬁdence-rated predictions. Machine Learning, 37(3):1–40, 1999. S. Shalev-Shwartz and Y. Singer. Convex repeated games and fenchel duality. Technical report, The Hebrew University, 2006. S. Shalev-Shwartz and Y. Singer. Online learning meets optimization in the dual. In COLT, 2006. J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In ESANN, April 1999. M. Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In ICML, 2003.

5 0.097033828 159 nips-2006-Parameter Expanded Variational Bayesian Methods

Author: Tommi S. Jaakkola, Yuan Qi

Abstract: Bayesian inference has become increasingly important in statistical machine learning. Exact Bayesian calculations are often not feasible in practice, however. A number of approximate Bayesian methods have been proposed to make such calculations practical, among them the variational Bayesian (VB) approach. The VB approach, while useful, can nevertheless suffer from slow convergence to the approximate solution. To address this problem, we propose Parameter-eXpanded Variational Bayesian (PX-VB) methods to speed up VB. The new algorithm is inspired by parameter-expanded expectation maximization (PX-EM) and parameterexpanded data augmentation (PX-DA). Similar to PX-EM and -DA, PX-VB expands a model with auxiliary variables to reduce the coupling between variables in the original model. We analyze the convergence rates of VB and PX-VB and demonstrate the superior convergence rates of PX-VB in variational probit regression and automatic relevance determination. 1

6 0.086438395 93 nips-2006-Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms

7 0.084543608 176 nips-2006-Single Channel Speech Separation Using Factorial Dynamics

8 0.081424177 46 nips-2006-Blind source separation for over-determined delayed mixtures

9 0.070392862 2 nips-2006-A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

10 0.070109531 12 nips-2006-A Probabilistic Algorithm Integrating Source Localization and Noise Suppression of MEG and EEG data

11 0.069651246 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

12 0.068980753 22 nips-2006-Adaptive Spatial Filters with predefined Region of Interest for EEG based Brain-Computer-Interfaces

13 0.067229338 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

14 0.065017335 16 nips-2006-A Theory of Retinal Population Coding

15 0.063300088 134 nips-2006-Modeling Human Motion Using Binary Latent Variables

16 0.055878334 27 nips-2006-An EM Algorithm for Localizing Multiple Sound Sources in Reverberant Environments

17 0.053926699 181 nips-2006-Stability of $K$-Means Clustering

18 0.049299654 57 nips-2006-Conditional mean field

19 0.047436055 194 nips-2006-Towards a general independent subspace analysis

20 0.04144907 126 nips-2006-Logistic Regression for Single Trial EEG Classification

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.146), (1, -0.004), (2, -0.042), (3, -0.073), (4, -0.099), (5, -0.066), (6, 0.227), (7, 0.021), (8, -0.106), (9, 0.07), (10, 0.091), (11, 0.02), (12, -0.173), (13, 0.068), (14, 0.328), (15, 0.052), (16, 0.092), (17, 0.296), (18, -0.221), (19, -0.026), (20, 0.005), (21, -0.131), (22, 0.001), (23, -0.009), (24, -0.018), (25, -0.121), (26, 0.051), (27, -0.045), (28, 0.08), (29, 0.003), (30, 0.006), (31, -0.007), (32, 0.019), (33, 0.066), (34, 0.042), (35, -0.005), (36, 0.048), (37, -0.04), (38, -0.023), (39, 0.057), (40, -0.107), (41, 0.044), (42, 0.085), (43, 0.037), (44, -0.055), (45, 0.054), (46, 0.012), (47, -0.004), (48, -0.077), (49, 0.067)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96758169 198 nips-2006-Unified Inference for Variational Bayesian Linear Gaussian State-Space Models

Author: David Barber, Silvia Chiappa

2 0.82182628 10 nips-2006-A Novel Gaussian Sum Smoother for Approximate Inference in Switching Linear Dynamical Systems

Author: David Barber, Bertrand Mesot

3 0.51302731 60 nips-2006-Convergence of Laplacian Eigenmaps

Author: Mikhail Belkin, Partha Niyogi

4 0.38077581 146 nips-2006-No-regret Algorithms for Online Convex Programs

Author: Geoffrey J. Gordon

Abstract: Online convex programming has recently emerged as a powerful primitive for designing machine learning algorithms. For example, OCP can be used for learning a linear classiﬁer, dynamically rebalancing a binary search tree, ﬁnding the shortest path in a graph with unknown edge lengths, solving a structured classiﬁcation problem, or ﬁnding a good strategy in an extensive-form game. Several researchers have designed no-regret algorithms for OCP. But, compared to algorithms for special cases of OCP such as learning from expert advice, these algorithms are not very numerous or ﬂexible. In learning from expert advice, one tool which has proved particularly valuable is the correspondence between no-regret algorithms and convex potential functions: by reasoning about these potential functions, researchers have designed algorithms with a wide variety of useful guarantees such as good performance when the target hypothesis is sparse. Until now, there has been no such recipe for the more general OCP problem, and therefore no ability to tune OCP algorithms to take advantage of properties of the problem or data. In this paper we derive a new class of no-regret learning algorithms for OCP. These Lagrangian Hedging algorithms are based on a general class of potential functions, and are a direct generalization of known learning rules like weighted majority and external-regret matching. In addition to proving regret bounds, we demonstrate our algorithms learning to play one-card poker. 1

5 0.33139613 93 nips-2006-Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms

Author: Xinhua Zhang, Wee S. Lee

Abstract: Semi-supervised learning algorithms have been successfully applied in many applications with scarce labeled data, by utilizing the unlabeled data. One important category is graph based semi-supervised learning algorithms, for which the performance depends considerably on the quality of the graph, or its hyperparameters. In this paper, we deal with the less explored problem of learning the graphs. We propose a graph learning method for the harmonic energy minimization method; this is done by minimizing the leave-one-out prediction error on labeled data points. We use a gradient based method and designed an efﬁcient algorithm which significantly accelerates the calculation of the gradient by applying the matrix inversion lemma and using careful pre-computation. Experimental results show that the graph learning method is effective in improving the performance of the classiﬁcation algorithm. 1

6 0.31707075 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

7 0.29574347 176 nips-2006-Single Channel Speech Separation Using Factorial Dynamics

8 0.29336822 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

9 0.29052743 159 nips-2006-Parameter Expanded Variational Bayesian Methods

10 0.26598787 24 nips-2006-Aggregating Classification Accuracy across Time: Application to Single Trial EEG

11 0.2628856 90 nips-2006-Hidden Markov Dirichlet Process: Modeling Genetic Recombination in Open Ancestral Space

12 0.25501367 61 nips-2006-Convex Repeated Games and Fenchel Duality

13 0.24430324 16 nips-2006-A Theory of Retinal Population Coding

14 0.22743236 12 nips-2006-A Probabilistic Algorithm Integrating Source Localization and Noise Suppression of MEG and EEG data

15 0.22592142 194 nips-2006-Towards a general independent subspace analysis

16 0.21707292 2 nips-2006-A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

17 0.2158815 190 nips-2006-The Neurodynamics of Belief Propagation on Binary Markov Random Fields

18 0.21555907 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

19 0.2099641 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation

20 0.19508396 71 nips-2006-Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.09), (3, 0.026), (7, 0.054), (8, 0.014), (9, 0.025), (11, 0.018), (20, 0.038), (21, 0.015), (22, 0.039), (39, 0.242), (44, 0.055), (52, 0.026), (57, 0.06), (65, 0.044), (69, 0.056), (71, 0.014), (87, 0.032), (90, 0.011), (98, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80436403 198 nips-2006-Unified Inference for Variational Bayesian Linear Gaussian State-Space Models

Author: David Barber, Silvia Chiappa

2 0.6584661 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

Author: Rey Ramírez, Jason Palmer, Scott Makeig, Bhaskar D. Rao, David P. Wipf

Abstract: The ill-posed nature of the MEG/EEG source localization problem requires the incorporation of prior assumptions when choosing an appropriate solution out of an inﬁnite set of candidates. Bayesian methods are useful in this capacity because they allow these assumptions to be explicitly quantiﬁed. Recently, a number of empirical Bayesian approaches have been proposed that attempt a form of model selection by using the data to guide the search for an appropriate prior. While seemingly quite different in many respects, we apply a unifying framework based on automatic relevance determination (ARD) that elucidates various attributes of these methods and suggests directions for improvement. We also derive theoretical properties of this methodology related to convergence, local minima, and localization bias and explore connections with established algorithms. 1

3 0.53545105 34 nips-2006-Approximate Correspondences in High Dimensions

Author: Kristen Grauman, Trevor Darrell

Abstract: Pyramid intersection is an efﬁcient method for computing an approximate partial matching between two sets of feature vectors. We introduce a novel pyramid embedding based on a hierarchy of non-uniformly shaped bins that takes advantage of the underlying structure of the feature space and remains accurate even for sets with high-dimensional feature vectors. The matching similarity is computed in linear time and forms a Mercer kernel. Whereas previous matching approximation algorithms suffer from distortion factors that increase linearly with the feature dimension, we demonstrate that our approach can maintain constant accuracy even as the feature dimension increases. When used as a kernel in a discriminative classiﬁer, our approach achieves improved object recognition results over a state-of-the-art set kernel. 1

4 0.52850425 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints

Author: Carla P. Gomes, Ashish Sabharwal, Bart Selman

Abstract: We propose a new technique for sampling the solutions of combinatorial problems in a near-uniform manner. We focus on problems speciﬁed as a Boolean formula, i.e., on SAT instances. Sampling for SAT problems has been shown to have interesting connections with probabilistic reasoning, making practical sampling algorithms for SAT highly desirable. The best current approaches are based on Markov Chain Monte Carlo methods, which have some practical limitations. Our approach exploits combinatorial properties of random parity (XOR) constraints to prune away solutions near-uniformly. The ﬁnal sample is identiﬁed amongst the remaining ones using a state-of-the-art SAT solver. The resulting sampling distribution is provably arbitrarily close to uniform. Our experiments show that our technique achieves a signiﬁcantly better sampling quality than the best alternative. 1

5 0.52489477 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

Author: Dong S. Cheng, Vittorio Murino, Mário Figueiredo

Abstract: This paper proposes a new approach to model-based clustering under prior knowledge. The proposed formulation can be interpreted from two different angles: as penalized logistic regression, where the class labels are only indirectly observed (via the probability density of each class); as ﬁnite mixture learning under a grouping prior. To estimate the parameters of the proposed model, we derive a (generalized) EM algorithm with a closed-form E-step, in contrast with other recent approaches to semi-supervised probabilistic clustering which require Gibbs sampling or suboptimal shortcuts. We show that our approach is ideally suited for image segmentation: it avoids the combinatorial nature Markov random ﬁeld priors, and opens the door to more sophisticated spatial priors (e.g., wavelet-based) in a simple and computationally efﬁcient way. Finally, we extend our formulation to work in unsupervised, semi-supervised, or discriminative modes. 1

6 0.52382296 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

7 0.52091414 160 nips-2006-Part-based Probabilistic Point Matching using Equivalence Constraints

8 0.52048534 167 nips-2006-Recursive ICA

9 0.51774263 105 nips-2006-Large Margin Component Analysis

10 0.51728547 65 nips-2006-Denoising and Dimension Reduction in Feature Space

11 0.51691622 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

12 0.51628786 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

13 0.51591539 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

14 0.5143978 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields

15 0.51401311 97 nips-2006-Inducing Metric Violations in Human Similarity Judgements

16 0.51369488 134 nips-2006-Modeling Human Motion Using Binary Latent Variables

17 0.51272434 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition

18 0.51212388 43 nips-2006-Bayesian Model Scoring in Markov Random Fields

19 0.51194131 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming

20 0.51162899 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures