nips nips2006 nips2006-93 knowledge-graph by maker-knowledge-mining

93 nips-2006-Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms

Source: pdf

Author: Xinhua Zhang, Wee S. Lee

Abstract: Semi-supervised learning algorithms have been successfully applied in many applications with scarce labeled data, by utilizing the unlabeled data. One important category is graph based semi-supervised learning algorithms, for which the performance depends considerably on the quality of the graph, or its hyperparameters. In this paper, we deal with the less explored problem of learning the graphs. We propose a graph learning method for the harmonic energy minimization method; this is done by minimizing the leave-one-out prediction error on labeled data points. We use a gradient based method and designed an efﬁcient algorithm which significantly accelerates the calculation of the gradient by applying the matrix inversion lemma and using careful pre-computation. Experimental results show that the graph learning method is effective in improving the performance of the classiﬁcation algorithm. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 sg Abstract Semi-supervised learning algorithms have been successfully applied in many applications with scarce labeled data, by utilizing the unlabeled data. [sent-7, score-0.236]

2 One important category is graph based semi-supervised learning algorithms, for which the performance depends considerably on the quality of the graph, or its hyperparameters. [sent-8, score-0.125]

3 We propose a graph learning method for the harmonic energy minimization method; this is done by minimizing the leave-one-out prediction error on labeled data points. [sent-10, score-0.319]

4 We use a gradient based method and designed an efﬁcient algorithm which significantly accelerates the calculation of the gradient by applying the matrix inversion lemma and using careful pre-computation. [sent-11, score-0.268]

5 Experimental results show that the graph learning method is effective in improving the performance of the classiﬁcation algorithm. [sent-12, score-0.106]

6 1 Introduction Recently, graph based semi-supervised learning algorithms have been used successfully in various machine learning problems including classiﬁcation, regression, ranking, and dimensionality reduction. [sent-13, score-0.106]

7 These methods create graphs whose vertices correspond to the labeled and unlabeled data while the edge weights encode the similarity between each pair of data points. [sent-14, score-0.277]

8 Classiﬁcation is performed using these graphs by labeling unlabeled data in such a way that instances connected by large weights are given similar labels. [sent-15, score-0.185]

9 Example graph based semi-supervised algorithms include min-cut [3], harmonic energy minimization [11], and spectral graphical transducer [8]. [sent-16, score-0.259]

10 The performance of the classiﬁer depends considerably on the similarity measure of the graph, which is normally deﬁned in two steps. [sent-17, score-0.047]

11 Firstly, the weights are deﬁned locally in a pair-wise parametric form using functions that are essentially based on a distance metric such as radial basis functions (RBF). [sent-18, score-0.055]

12 As the distance metric is an important part of graph based semi-supervised learning, it is crucial to use a good distance metric. [sent-20, score-0.182]

13 In the second step, smoothing is applied globally, typically, based on the spectral transformation of the graph Laplacian [6, 10]. [sent-21, score-0.139]

14 There have been only a few existing approaches which address the problem of graph learning. [sent-22, score-0.106]

15 [13] learns a nonparametric spectral transformation of the graph Laplacian, assuming that the weight and distance metric are given. [sent-23, score-0.228]

16 [9] learns the spectral parameters by performing evidence maximization using approximate inference and gradient descent. [sent-24, score-0.113]

17 [12] uses evidence maximization and Laplace approximation to learn simple parameters of the similarity function. [sent-25, score-0.047]

18 Instead of learning one single good graph, [4] proposed building robust graphs by applying random perturbation and edge removal ∗ This work was done when the author was at the National University of Singapore. [sent-26, score-0.032]

19 Closest to our work is [11], which learns different bandwidths for different dimensions by minimizing the entropy on unlabeled data; like the maximum margin motivation in transductive SVM, the aim here is to get conﬁdent labeling of the data by the algorithm. [sent-29, score-0.284]

20 In this paper, we propose a new algorithm to learn the hyperparameters of distance metric, or more speciﬁcally, the bandwidth for different dimensions in the RBF form. [sent-30, score-0.148]

21 In essence, these bandwidths are just model parameters and normal model selection methods include k-fold cross validation or leave-one-out (LOO) cross validation in the extreme case can be used for selecting the bandwidths. [sent-31, score-0.16]

22 Motivated by the same spirit, we base our learning algorithm on the aim of achieving low LOO prediction loss on labeled data, i. [sent-32, score-0.092]

23 , each labeled data can be correctly classiﬁed by the other labeled data in a semi-supervised style with as high probability as possible. [sent-34, score-0.211]

24 This idea is similar to [5] which learns multiple parameters for SVM. [sent-35, score-0.034]

25 Since most LOO style algorithms are plagued with prohibitive computational cost, an efﬁcient algorithm is designed. [sent-36, score-0.046]

26 With a simple regularizer, the experimental results show that learning the hyperparameters by minimizing the LOO loss is effective. [sent-37, score-0.054]

27 2 Graph Based Semi-supervised Learning Suppose we have a set of labeled data points {(xi , yi )} for i ∈ L {1, . [sent-38, score-0.117]

28 In addition, we also have a set of unlabeled data points {xi } for i ∈ U {l + 1, . [sent-45, score-0.125]

29 1 Graph Based Classiﬁcation Algorithms One of the earliest graph based semi-supervised learning algorithms is min-cut by [3], which minimizes: E(f ) wij (fi − fj )2 (1) i,j where the nonnegative wij encodes the similarity between instance i and j. [sent-52, score-0.469]

30 The optimization variables fi (i ∈ U ) are constrained to {1, 0}. [sent-54, score-0.128]

31 [11] relaxed the constraint fi ∈ {1, 0} (i ∈ U ) to real numbers. [sent-56, score-0.128]

32 The optimal solution of the unlabeled data’s soft labels can be written neatly as: fU = (DU − WU U )−1 WU L fL = (I − PU U )−1 PU L fL (2) where fL is the vector of soft labels (ﬁxed to yi ) for L. [sent-57, score-0.326]

33 D diag(di ), where di j wij and DU −1 is the submatrix of D associated with unlabeled data. [sent-58, score-0.263]

34 All fi (i ∈ U ) are automatically bounded by [0, 1], so it is also known as square interpolation. [sent-62, score-0.128]

35 They can be interpreted by using Markov random walk on the graph. [sent-63, score-0.024]

36 Imagine a graph with n nodes corresponding to the n data points. [sent-64, score-0.106]

37 Deﬁne the probability of transferring from xi to xj as pij , which is actually row-wise normalization of wij . [sent-65, score-0.224]

38 The random walk starts from any unlabeled points, and stops once it hits any labeled point (absorbing boundary). [sent-66, score-0.263]

39 Then fi is the probability of hitting a positive labeled point. [sent-67, score-0.242]

40 In this sense, the labeling of each unlabeled point is largely based on its neighboring labeled points, which helps to alleviate the problem of noisy data. [sent-68, score-0.245]

41 (1) can also be interpreted as a quadratic energy function and its minimizer is known to be harmonic: fi (i ∈ U ) equals the average of fj (j = i) weighted by pij . [sent-69, score-0.298]

42 Finally, to translate the soft labels fi to hard labels pos/neg, the simplest way is by thresholding at 0. [sent-75, score-0.232]

43 [11] proposed another approach, called Class Mass Normalization (CMN), to make use of prior information such as class ratio in unlabeled data, estimated by that in labeled data. [sent-77, score-0.217]

44 Speciﬁcally, they normalize the soft labels to fi+ fi n j=1 fj as the probabilistic score of being positive, and to fi− (1 − fi ) n j=1 (1 − fj ) as the score of being negative. [sent-78, score-0.488]

45 Suppose there are r+ positive points and r− negative points in the labeled data, then we classify xi to positive iff fi+ r+ > fi− r− . [sent-79, score-0.118]

46 2 Basic Hyperparameter Learning Algorithms One of the simplest parametric form of wij is RBF: wij = exp − d 2 (xi,d − xj,d )2 σd (3) th where xi,d is the d component of xi , and likewise the meaning of fU,i in (4). [sent-81, score-0.281]

47 The bandwidth σd has considerable inﬂuence on the classiﬁcation accuracy. [sent-82, score-0.043]

48 HEM uses one common bandwidth for all dimensions, which can be easily selected by cross validation. [sent-83, score-0.077]

49 However, it will be desirable to learn different σd for different dimensions; this allows a form of feature selection. [sent-84, score-0.019]

50 [11] proposed learning the hyperparameters σd by minimizing the entropy on unlabeled data points (we call it MinEnt): H (fU ) = − u i=1 (fU,i log fU,i + (1 − fU,i ) log(1 − fU,i )) (4) The optimization is conducted by gradient descent. [sent-85, score-0.225]

51 3 Leave-one-out Hyperparameter Learning In this section, we present the formulation and efﬁcient calculation of our graph learning algorithm. [sent-87, score-0.146]

52 1 Formulation and Efﬁcient Calculation We propose a graph learning algorithm which is similar to minimizing the leave-one-out cross validation error. [sent-89, score-0.189]

53 Suppose we hold out a labeled example xt and predict its label by using the rest of the t labeled and unlabeled examples. [sent-90, score-0.359]

54 Making use of the result in (2), the soft label for xt is s fU (the t ﬁrst component of fU ), where t t t t s (1, 0, . [sent-91, score-0.102]

55 , fl ) , pij ˜ t PU L UU UL L ptt pU t t (1 − ε)pij + ε/n , PU U ptU PU U , pU t  (pl+1,t , . [sent-103, score-0.292]

56 ··· ··· ··· ··· ··· ···  pn,1 · · · pn,t−1 pn,t+1 · · · pn,l t If xt is positive, then we hope that fU,1 is as close to 1 as possible. [sent-110, score-0.07]

57 Otherwise, if xt is negative, we t hope that fU,1 is as close to 0 as possible. [sent-111, score-0.07]

58 So the cost function to be minimized can be written as: l Q= t=1 t ht fU,1 = l t=1 ˜t ˜t t ht s (I − PU U )−1 PU L fL (5) where ht (x) is the cost function for instance t. [sent-112, score-0.448]

59 We denote ht (x) = h+ (x) for yt = 1 and ht (x) = h− (x) for yt = 0. [sent-113, score-0.322]

60 The gradient is: l t t t ˜t ˜t ˜t ∂Q/∂σd = t=1 ht fU,1 s (I − PU U )−1 ∂ PU U ∂σd · fU + ∂ PU L ∂σd · fL , t ˜t using matrix property dX −1 = −X −1 (dX)X −1 . [sent-118, score-0.174]

61 Denoting (β t ) ht (fU,1 )s (I − PU U )−1 ˜ and noting P = εU + (1 − ε)P, we have ∂Q/∂σd = (1 − ε) t PU U t PU L Since in both and t to xi+l−1 , denoting PU N l t=1 (β t ) t t t t ∂PU U ∂σd · fU + ∂PU L ∂σd · fL . [sent-119, score-0.148]

62 th (6) , the ﬁrst row corresponds to xt , and the i row (i ≥ 2) corresponds t t t (PU L PU U ) makes sense as each row of PU N corresponds to a well deﬁned single data point. [sent-120, score-0.132]

63 We now use n n t t t swi k=1 wU N (i, k) and k=1 ∂wU N (i, k)/∂σd (i = 1, . [sent-122, score-0.28]

64 Now (6) can be rewritten in ground terms by the following “two” equations: n t t t ∂PU • (i, j) ∂σd = (swi )−1 ∂wU • (i, j) ∂σd − pt • (i, j) U k=1 t ∂wU N (i, k) ∂σd , 3 where • can be U or L. [sent-126, score-0.062]

65 The na¨ve way to calculate the function value Q and its gradient is presented in Algorithm 1. [sent-128, score-0.046]

66 Algorithm 1 na¨ve form of LOOHL ı 1: function value Q ← 0, gradient g ← (0, . [sent-130, score-0.046]

67 , l (leave-one-out loop for each labeled point) do t t ˜t ˜t t 3: fL ← (f1 , . [sent-136, score-0.092]

68 , fl ) , fU ← (I − PU U )−1 PU L fL , t t t ˜t Q ← Q + ht fU,1 , (β ) ← ht (fU,1 )s (I − PU U )−1 4: for each d = 1, . [sent-141, score-0.466]

69 , l − 1 t fL The computational complexity of the na¨ve algorithm is expensive: O(lu(mn+u2 )), just to calculate ı the gradient once. [sent-153, score-0.046]

70 Here we assume the cost of inverting a u × u matrix is O(u3 ). [sent-154, score-0.066]

71 We reduce the two terms in the cost by means of using matrix inversion lemma and careful pre-computation. [sent-155, score-0.149]

72 ˜t One part of the cost, O(lu3 ), stems from inverting I − PU U , a (u + 1) × (u + 1) matrix, for l times in t ˜ (5). [sent-156, score-0.034]

73 We note that for different t, I − PU U differs only by the ﬁrst row and ﬁrst column. [sent-157, score-0.021]

74 With I − P t expressed in this form, we are ready to apply matrix inversion lemma: UU A + αβ −1 = A−1 − A−1 α · β A−1 1 + α Aβ . [sent-162, score-0.053]

75 (7) ˜t We only need to invert I − PU U for t = 1 from scratch, and then apply (7) twice for each t new total complexity related to matrix inversion is O u3 + lu2 . [sent-163, score-0.073]

76 The The other part of the cost, O(lumn) , can be reduced by using careful pre-computation. [sent-165, score-0.042]

77 The Algorithm 2 below presents the efﬁcient approach to gradient calculation. [sent-168, score-0.046]

78 Algorithm 2 Efﬁcient algorithm to gradient calculation 1: for i, j = 1, . [sent-169, score-0.086]

79 , n do 2: for all feature dimension d on which either xi or xj is nonzero do 3: gd = gd + αij · ∂wij /∂σd 4: end for 5: end for Figure 1: Examples of degenerative graphs learned by pure LOOHL. [sent-172, score-0.304]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pu', 0.726), ('swi', 0.28), ('fl', 0.21), ('wu', 0.209), ('fu', 0.188), ('pik', 0.161), ('fi', 0.128), ('ht', 0.128), ('unlabeled', 0.125), ('wij', 0.118), ('loo', 0.111), ('graph', 0.106), ('gd', 0.102), ('labeled', 0.092), ('fj', 0.077), ('pl', 0.071), ('ij', 0.07), ('pt', 0.062), ('wii', 0.061), ('ru', 0.061), ('pij', 0.06), ('hyperparameter', 0.053), ('inversion', 0.053), ('soft', 0.052), ('hem', 0.051), ('loohl', 0.051), ('ptu', 0.051), ('xt', 0.05), ('fk', 0.049), ('ft', 0.049), ('harmonic', 0.046), ('gradient', 0.046), ('uu', 0.044), ('pit', 0.044), ('singapore', 0.044), ('bandwidth', 0.043), ('careful', 0.042), ('australia', 0.041), ('bandwidths', 0.04), ('pii', 0.04), ('calculation', 0.04), ('na', 0.035), ('cross', 0.034), ('metric', 0.034), ('du', 0.034), ('inverting', 0.034), ('learns', 0.034), ('dimensions', 0.034), ('yt', 0.033), ('energy', 0.033), ('spectral', 0.033), ('canberra', 0.032), ('cost', 0.032), ('graphs', 0.032), ('hyperparameters', 0.031), ('rbf', 0.03), ('labeling', 0.028), ('similarity', 0.028), ('style', 0.027), ('xi', 0.026), ('suppose', 0.026), ('labels', 0.026), ('validation', 0.026), ('yi', 0.025), ('national', 0.024), ('classi', 0.024), ('walk', 0.024), ('minimizing', 0.023), ('wee', 0.022), ('hits', 0.022), ('earliest', 0.022), ('hitting', 0.022), ('ptt', 0.022), ('transducer', 0.022), ('uij', 0.022), ('xinhua', 0.022), ('lemma', 0.022), ('laplacian', 0.021), ('end', 0.021), ('row', 0.021), ('dx', 0.021), ('distance', 0.021), ('hope', 0.02), ('anu', 0.02), ('invert', 0.02), ('absorbing', 0.02), ('submatrix', 0.02), ('scratch', 0.02), ('neatly', 0.02), ('transferring', 0.02), ('denoting', 0.02), ('ve', 0.02), ('th', 0.019), ('learn', 0.019), ('minimization', 0.019), ('considerably', 0.019), ('accelerates', 0.019), ('laplacians', 0.019), ('plagued', 0.019), ('scarce', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 93 nips-2006-Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms

Author: Xinhua Zhang, Wee S. Lee

2 0.086438395 198 nips-2006-Unified Inference for Variational Bayesian Linear Gaussian State-Space Models

Author: David Barber, Silvia Chiappa

Abstract: Linear Gaussian State-Space Models are widely used and a Bayesian treatment of parameters is therefore of considerable interest. The approximate Variational Bayesian method applied to these models is an attractive approach, used successfully in applications ranging from acoustics to bioinformatics. The most challenging aspect of implementing the method is in performing inference on the hidden state sequence of the model. We show how to convert the inference problem so that standard Kalman Filtering/Smoothing recursions from the literature may be applied. This is in contrast to previously published approaches based on Belief Propagation. Our framework both simpliﬁes and uniﬁes the inference problem, so that future applications may be more easily developed. We demonstrate the elegance of the approach on Bayesian temporal ICA, with an application to ﬁnding independent dynamical processes underlying noisy EEG signals. 1 Linear Gaussian State-Space Models Linear Gaussian State-Space Models (LGSSMs)1 are fundamental in time-series analysis [1, 2, 3]. In these models the observations v1:T 2 are generated from an underlying dynamical system on h1:T according to: v v vt = Bht + ηt , ηt ∼ N (0V , ΣV ), h h ht = Aht−1 + ηt , ηt ∼ N (0H , ΣH ) , where N (µ, Σ) denotes a Gaussian with mean µ and covariance Σ, and 0X denotes an Xdimensional zero vector. The observation vt has dimension V and the hidden state ht has dimension H. Probabilistically, the LGSSM is deﬁned by: T p(v1:T , h1:T |Θ) = p(v1 |h1 )p(h1 ) p(vt |ht )p(ht |ht−1 ), t=2 with p(vt |ht ) = N (Bht , ΣV ), p(ht |ht−1 ) = N (Aht−1 , ΣH ), p(h1 ) = N (µ, Σ) and where Θ = {A, B, ΣH , ΣV , µ, Σ} denotes the model parameters. Because of the widespread use of these models, a Bayesian treatment of parameters is of considerable interest [4, 5, 6, 7, 8]. An exact implementation of the Bayesian LGSSM is formally intractable [8], and recently a Variational Bayesian (VB) approximation has been studied [4, 5, 6, 7, 9]. The most challenging part of implementing the VB method is performing inference over h1:T , and previous authors have developed their own specialized routines, based on Belief Propagation, since standard LGSSM inference routines appear, at ﬁrst sight, not to be applicable. 1 2 Also called Kalman Filters/Smoothers, Linear Dynamical Systems. v1:T denotes v1 , . . . , vT . A key contribution of this paper is to show how the Variational Bayesian treatment of the LGSSM can be implemented using standard LGSSM inference routines. Based on the insight we provide, any standard inference method may be applied, including those speciﬁcally addressed to improve numerical stability [2, 10, 11]. In this article, we decided to describe the predictor-corrector and Rauch-Tung-Striebel recursions [2], and also suggest a small modiﬁcation that reduces computational cost. The Bayesian LGSSM is particularly of interest when strong prior constraints are needed to ﬁnd adequate solutions. One such case is in EEG signal analysis, whereby we wish to extract sources that evolve independently through time. Since EEG is particularly noisy [12], a prior that encourages sources to have preferential dynamics is advantageous. This application is discussed in Section 4, and demonstrates the ease of applying our VB framework. 2 Bayesian Linear Gaussian State-Space Models In the Bayesian treatment of the LGSSM, instead of considering the model parameters Θ as ﬁxed, ˆ ˆ we deﬁne a prior distribution p(Θ|Θ), where Θ is a set of hyperparameters. Then: ˆ p(v1:T |Θ) = ˆ p(v1:T |Θ)p(Θ|Θ) . (1) Θ In a full Bayesian treatment we would deﬁne additional prior distributions over the hyperparameters ˆ Θ. Here we take instead the ML-II (‘evidence’) framework, in which the optimal set of hyperpaˆ ˆ rameters is found by maximizing p(v1:T |Θ) with respect to Θ [6, 7, 9]. For the parameter priors, here we deﬁne Gaussians on the columns of A and B 3 : H e− p(A|α, ΣH ) ∝ αj 2 ˆ ( A j −A j ) T ˆ Σ−1 (Aj −Aj ) H H , e− p(B|β, ΣV ) ∝ j=1 βj 2 T ˆ (Bj −Bj ) ˆ Σ−1 (Bj −Bj ) V , j=1 ˆ ˆ which has the effect of biasing the transition and emission matrices to desired forms A and B. The −1 −1 4 conjugate priors for general inverse covariances ΣH and ΣV are Wishart distributions [7] . In the simpler case assumed here of diagonal covariances these become Gamma distributions [5, 7]. The ˆ hyperparameters are then Θ = {α, β}5 . Variational Bayes ˆ Optimizing Eq. (1) with respect to Θ is difﬁcult due to the intractability of the integrals. Instead, in VB, one considers the lower bound [6, 7, 9]6 : ˆ ˆ L = log p(v1:T |Θ) ≥ Hq (Θ, h1:T ) + log p(Θ|Θ) q(Θ) + E(h1:T , Θ) q(Θ,h1:T ) ≡ F, where E(h1:T , Θ) ≡ log p(v1:T , h1:T |Θ). Hd (x) signiﬁes the entropy of the distribution d(x), and · d(x) denotes the expectation operator. The key approximation in VB is q(Θ, h1:T ) ≡ q(Θ)q(h1:T ), from which one may show that, for optimality of F, ˆ E(h1:T ,Θ) q(h1:T ) . q(h1:T ) ∝ e E(h1:T ,Θ) q(Θ) , q(Θ) ∝ p(Θ|Θ)e These coupled equations need to be iterated to convergence. The updates for the parameters q(Θ) are straightforward and are given in Appendices A and B. Once converged, the hyperparameters are ˆ updated by maximizing F with respect to Θ, which lead to simple update formulae [7]. Our main concern is with the update for q(h1:T ), for which this paper makes a departure from treatments previously presented. 3 More general Gaussian priors may be more suitable depending on the application. For expositional simplicity, we do not put priors on µ and Σ. 5 For simplicity, we keep the parameters of the Gamma priors ﬁxed. 6 Strictly we should write throughout q(·|v1:T ). We omit the dependence on v1:T for notational convenience. 4 Uniﬁed Inference on q(h1:T ) 3 Optimally q(h1:T ) is Gaussian since, up to a constant, E(h1:T , Θ) − 1 2 q(Θ) is quadratic in h1:T 7 : T T (vt −Bht )T Σ−1 (vt −Bht ) V q(B,ΣV ) + (ht −Aht−1 ) Σ−1 (ht −Aht−1 ) H t=1 q(A,ΣH ) . (2) In addition, optimally, q(A|ΣH ) and q(B|ΣV ) are Gaussians (see Appendix A), so we can easily carry out the averages in Eq. (2). The further averages over q(ΣH ) and q(ΣV ) are also easy due to conjugacy. Whilst this deﬁnes the distribution q(h1:T ), quantities such as q(ht ), required for example for the parameter updates (see the Appendices), need to be inferred from this distribution. Clearly, in the non-Bayesian case, the averages over the parameters are not present, and the above simply represents the posterior distribution of an LGSSM whose visible variables have been clamped into their evidential states. In that case, inference can be performed using any standard LGSSM routine. Our aim, therefore, is to try to represent the averaged Eq. (2) directly as the posterior distribution q (h1:T |˜1:T ) of an LGSSM , for some suitable parameter settings. ˜ v Mean + Fluctuation Decomposition A useful decomposition is to write (vt − Bht )T Σ−1 (vt − Bht ) V = (vt − B ht )T Σ−1 (vt − B ht ) + hT SB ht , t V q(B,ΣV ) f luctuation mean and similarly (ht −Aht−1 )T Σ−1 (ht −Aht−1 ) H = (ht − A ht−1 )T Σ−1 (ht − A ht−1 ) +hT SA ht−1 , t−1 H q(A,ΣH ) mean f luctuation T −1 where the parameter covariances are SB ≡ B T Σ−1 B − B Σ−1 B = V HB and SA ≡ V V T −1 −1 −1 AT ΣH A − A ΣH A = HHA (for HA and HB deﬁned in Appendix A). The mean terms simply represent a clamped LGSSM with averaged parameters. However, the extra contributions from the ﬂuctuations mean that Eq. (2) cannot be written as a clamped LGSSM with averaged parameters. In order to deal with these extra terms, our idea is to treat the ﬂuctuations as arising from an augmented visible variable, for which Eq. (2) can then be considered as a clamped LGSSM. Inference Using an Augmented LGSSM To represent Eq. (2) as an LGSSM q (h1:T |˜1:T ), we may augment vt and B as8 : ˜ v vt = vert(vt , 0H , 0H ), ˜ ˜ B = vert( B , UA , UB ), T where UA is the Cholesky decomposition of SA , so that UA UA = SA . Similarly, UB is the Cholesky decomposition of SB . The equivalent LGSSM q (h1:T |˜1:T ) is then completed by specifying9 ˜ v ˜ A≡ A , ˜ ΣH ≡ Σ−1 H −1 , ˜ ΣV ≡ diag( Σ−1 V −1 , IH , IH ), µ ≡ µ, ˜ ˜ Σ ≡ Σ. The validity of this parameter assignment can be checked by showing that, up to negligible constants, the exponent of this augmented LGSSM has the same form as Eq. (2)10 . Now that this has been written as an LGSSM q (h1:T |˜1:T ), standard inference routines in the literature may be applied to ˜ v compute q(ht |v1:T ) = q (ht |˜1:T ) [1, 2, 11]11 . ˜ v 7 For simplicity of exposition, we ignore the ﬁrst time-point here. The notation vert(x1 , . . . , xn ) stands for vertically concatenating the arguments x1 , . . . , xn . 9 ˜ ˜ ˜ Strictly, we need a time-dependent emission Bt = B, for t = 1, . . . , T − 1. For time T , BT has the Cholesky factor UA replaced by 0H,H . 10 There are several ways of achieving a similar augmentation. We chose this since, in the non-Bayesian limit UA = UB = 0H,H , no numerical instabilities would be introduced. 11 Note that, since the augmented LGSSM q (h1:T |˜1:T ) is designed to match the fully clamped distribution ˜ v q(h1:T |v1:T ), the ﬁltered posterior q (ht |˜1:t ) does not correspond to q(ht |v1:t ). ˜ v 8 Algorithm 1 LGSSM: Forward and backward recursive updates. The smoothed posterior p(ht |v1:T ) ˆ is returned in the mean hT and covariance PtT . t procedure F ORWARD 1a: P ← Σ −1 T T 1b: P ← DΣ, where D ≡ I − ΣUAB I + UAB ΣUAB UAB ˆ0 ← µ 2a: h1 ˆ 2b: h0 ← Dµ 1 1 ˆ ˆ ˆ 3: K ← P B T (BP B T + ΣV )−1 , P1 ← (I − KB)P , h1 ← h0 + K(vt − B h0 ) 1 1 1 for t ← 2, T do t−1 4: Ptt−1 ← APt−1 AT + ΣH t−1 5a: P ← Pt −1 T T 5b: P ← Dt Ptt−1 , where Dt ≡ I − Ptt−1 UAB I + UAB Ptt−1 UAB UAB ˆ ˆ 6a: ht−1 ← Aht−1 t t−1 ˆ ˆ 6b: ht−1 ← Dt Aht−1 t t−1 T ˆ ˆ ˆ 7: K ← P B (BP B T + ΣV )−1 , Ptt ← (I − KB)P , ht ← ht−1 + K(vt − B ht−1 ) t t t end for end procedure procedure BACKWARD for t ← T − 1, 1 do ← − t At ← Ptt AT (Pt+1 )−1 ← T − ← − T t t Pt ← Pt + At (Pt+1 − Pt+1 )At T ← ˆT − ˆ ˆ ˆ hT ← ht + At (ht+1 − Aht ) t t t end for end procedure For completeness, we decided to describe the standard predictor-corrector form of the Kalman Filter, together with the Rauch-Tung-Striebel Smoother [2]. These are given in Algorithm 1, where q (ht |˜1:T ) is computed by calling the FORWARD and BACKWARD procedures. ˜ v We present two variants of the FORWARD pass. Either we may call procedure FORWARD in ˜ ˜ ˜ ˜ ˜ ˜ Algorithm 1 with parameters A, B, ΣH , ΣV , µ, Σ and the augmented visible variables vt in which ˜ we use steps 1a, 2a, 5a and 6a. This is exactly the predictor-corrector form of a Kalman Filter [2]. Otherwise, in order to reduce the computational cost, we may call procedure FORWARD with the −1 ˜ ˜ parameters A, B , ΣH , Σ−1 , µ, Σ and the original visible variable vt in which we use steps ˜ ˜ V T 1b (where UAB UAB ≡ SA +SB ), 2b, 5b and 6b. The two algorithms are mathematically equivalent. Computing q(ht |v1:T ) = q (ht |˜1:T ) is then completed by calling the common BACKWARD pass. ˜ v The important point here is that the reader may supply any standard Kalman Filtering/Smoothing routine, and simply call it with the appropriate parameters. In some parameter regimes, or in very long time-series, numerical stability may be a serious concern, for which several stabilized algorithms have been developed over the years, for example the square-root forms [2, 10, 11]. By converting the problem to a standard form, we have therefore uniﬁed and simpliﬁed inference, so that future applications may be more readily developed12 . 3.1 Relation to Previous Approaches An alternative approach to the one above, and taken in [5, 7], is to write the posterior as T log q(h1:T ) = φt (ht−1 , ht ) + const. t=2 for suitably deﬁned quadratic forms φt (ht−1 , ht ). Here the potentials φt (ht−1 , ht ) encode the averaging over the parameters A, B, ΣH , ΣV . The approach taken in [7] is to recognize this as a 12 The computation of the log-likelihood bound does not require any augmentation. pairwise Markov chain, for which the Belief Propagation recursions may be applied. The approach in [5] is based on a Kullback-Leibler minimization of the posterior with a chain structure, which is algorithmically equivalent to Belief Propagation. Whilst mathematically valid procedures, the resulting algorithms do not correspond to any of the standard forms in the Kalman Filtering/Smoothing literature, whose properties have been well studied [14]. 4 An Application to Bayesian ICA A particular case for which the Bayesian LGSSM is of interest is in extracting independent source signals underlying a multivariate timeseries [5, 15]. This will demonstrate how the approach developed in Section 3 makes VB easily to apply. The sources si are modeled as independent in the following sense: p(si , sj ) = p(si )p(sj ), 1:T 1:T 1:T 1:T for i = j, i, j = 1, . . . , C. Independence implies block diagonal transition and state noise matrices A, ΣH and Σ, where each block c has dimension Hc . A one dimensional source sc for each independent dynamical subsystem is then t formed from sc = 1T hc , where 1c is a unit vector and hc is the state of t c t t dynamical system c. Combining the sources, we can write st = P ht , where P = diag(1T , . . . , 1T ), ht = vert(h1 , . . . , hC ). The resulting 1 C t t emission matrix is constrained to be of the form B = W P , where W is the V × C mixing matrix. This means that the observations v are formed from linearly mixing the sources, vt = W st + ηt . The Figure 1: The structure of graphical structure of this model is presented in Fig 1. To encourage redundant components to be removed, we place a zero mean Gaussian the LGSSM for ICA. prior on W . In this case, we do not deﬁne a prior for the parameters ΣH and ΣV which are instead considered as hyperparameters. More details of the model are given in [15]. The constraint B = W P requires a minor modiﬁcation from Section 3, as we discuss below. Inference on q(h1:T ) A small modiﬁcation of the mean + ﬂuctuation decomposition for B occurs, namely: (vt − Bht )T Σ−1 (vt − Bht ) V q(W ) = (vt − B ht )T Σ−1 (vt − B ht ) + hT P T SW P ht , t V −1 where B ≡ W P and SW = V HW . The quantities W and HW are obtained as in Appendix A.1 with the replacement ht ← P ht . To represent the above as a LGSSM, we augment vt and B as vt = vert(vt , 0H , 0C ), ˜ ˜ B = vert( B , UA , UW P ), where UW is the Cholesky decomposition of SW . The equivalent LGSSM is then completed by ˜ ˜ ˜ ˜ specifying A ≡ A , ΣH ≡ ΣH , ΣV ≡ diag(ΣV , IH , IC ), µ ≡ µ, Σ ≡ Σ, and inference for ˜ q(h1:T ) performed using Algorithm 1. This demonstrates the elegance and unity of the approach in Section 3, since no new algorithm needs to be developed to perform inference, even in this special constrained parameter case. 4.1 Demonstration As a simple demonstration, we used an LGSSM to generate 3 sources sc with random 5×5 transition t matrices Ac , µ = 0H and Σ ≡ ΣH ≡ IH . The sources were mixed into three observations v vt = W st + ηt , for W chosen with elements from a zero mean unit variance Gaussian distribution, and ΣV = IV . We then trained a Bayesian LGSSM with 5 sources and 7 × 7 transition matrices Ac . ˆ To bias the model to ﬁnd the simplest sources, we used Ac ≡ 0Hc ,Hc for all sources. In Fig2a and Fig 2b we see the original sources and the noisy observations respectively. In Fig2c we see the estimated sources from our method after convergence of the hyperparameter updates. Two of the 5 sources have been removed, and the remaining three are a reasonable estimation of the original sources. Another possible approach for introducing prior knowledge is to use a Maximum a Posteriori (MAP) 0 50 100 150 200 250 300 0 50 100 (a) 150 200 250 300 0 50 (b) 100 150 200 250 300 0 50 (c) 100 150 200 250 300 (d) Figure 2: (a) Original sources st . (b) Observations resulting from mixing the original sources, v v vt = W st + ηt , ηt ∼ N (0, I). (c) Recovered sources using the Bayesian LGSSM. (d) Sources found with MAP LGSSM. 0 1 2 (a) 3s 0 1 2 (b) 3s 0 1 2 (c) 3s 0 1 2 (d) 3s 0 1 2 3s (e) Figure 3: (a) Original raw EEG recordings from 4 channels. (b-e) 16 sources st estimated by the Bayesian LGSSM. procedure by adding a prior term to the original log-likelihood log p(v1:T |A, W, ΣH , ΣV , µ, Σ) + log p(A|α) + log p(W |β). However, it is not clear how to reliably ﬁnd the hyperparameters α and β in this case. One solution is to estimate them by optimizing the new objective function jointly with respect to the parameters and hyperparameters (this is the so-called joint map estimation – see for example [16]). A typical result of using this joint MAP approach on the artiﬁcial data is presented in Fig 2d. The joint MAP does not estimate the hyperparameters well, and the incorrect number of sources is identiﬁed. 4.2 Application to EEG Analysis In Fig 3a we plot three seconds of EEG data recorded from 4 channels (located in the right hemisphere) while a person is performing imagined movement of the right hand. As is typical in EEG, each channel shows drift terms below 1 Hz which correspond to artifacts of the instrumentation, together with the presence of 50 Hz mains contamination and masks the rhythmical activity related to the mental task, mainly centered at 10 and 20 Hz [17]. We would therefore like a method which enables us to extract components in these information-rich 10 and 20 Hz frequency bands. Standard ICA methods such as FastICA do not ﬁnd satisfactory sources based on raw ‘noisy’ data, and preprocessing with band-pass ﬁlters is usually required. Additionally, in EEG research, ﬂexibility in the number of recovered sources is important since there may be many independent oscillators of interest underlying the observations and we would like some way to automatically determine their effective number. To preferentially ﬁnd sources at particular frequencies, we speciﬁed a block ˆ diagonal matrix Ac for each source c, where each block is a 2 × 2 rotation matrix at the desired frequency. We deﬁned the following 16 groups of frequencies: [0.5], [0.5], [0.5], [0.5]; [10,11], [10,11], [10,11], [10,11]; [20,21], [20,21], [20,21], [20,21]; [50], [50], [50], [50]. The temporal evolution of the sources obtained after training the Bayesian LGSSM is given in Fig3(b,c,d,e) (grouped by frequency range). The Bayes LGSSM removed 4 unnecessary sources from the mixing matrix W , that is one [10,11] Hz and three [20,21] Hz sources. The ﬁrst 4 sources contain dominant low frequency drift, sources 5, 6 and 8 contain [10,11] Hz, while source 10 contains [20,21] Hz centered activity. Of the 4 sources initialized to 50 Hz, only 2 retained 50 Hz activity, while the Ac of the other two have changed to model other frequencies present in the EEG. This method demonstrates the usefulness and applicability of the VB method in a real-world situation. 5 Conclusion We considered the application of Variational Bayesian learning to Linear Gaussian State-Space Models. This is an important class of models with widespread application, and ﬁnding a simple way to implement this approximate Bayesian procedure is of considerable interest. The most demanding part of the procedure is inference of the hidden states of the model. Previously, this has been achieved using Belief Propagation, which differs from inference in the Kalman Filtering/Smoothing literature, for which highly efﬁcient and stabilized procedures exist. A central contribution of this paper is to show how inference can be written using the standard Kalman Filtering/Smoothing recursions by augmenting the original model. Additionally, a minor modiﬁcation to the standard Kalman Filtering routine may be applied for computational efﬁciency. We demonstrated the elegance and unity of our approach by showing how to easily apply a Variational Bayes analysis of temporal ICA. Speciﬁcally, our Bayes ICA approach successfully extracts independent processes underlying EEG signals, biased towards preferred frequency ranges. We hope that this simple and unifying interpretation of Variational Bayesian LGSSMs may therefore facilitate the further application to related models. A A.1 Parameter Updates for A and B Determining q(B|ΣV ) By examining F, the contribution of q(B|ΣV ) can be interpreted as the negative KL divergence between q(B|ΣV ) and a Gaussian. Hence, optimally, q(B|ΣV ) is a Gaussian. The covariance [ΣB ]ij,kl ≡ Bij − Bij Bkl − Bkl (averages wrt q(B|ΣV )) is given by: T hj hl t t −1 [ΣB ]ij,kl = [HB ]jl [ΣV ]ik , where [HB ]jl ≡ t=1 −1 The mean is given by B = NB HB , where [NB ]ij ≡ T t=1 hj t q(ht ) q(ht ) + βj δjl . i ˆ vt + βj Bij . Determining q(A|ΣH ) Optimally, q(A|ΣH ) is a Gaussian with covariance T −1 hj hl t t −1 [ΣA ]ij,kl = [HA ]jl [ΣH ]ik , where [HA ]jl ≡ t=1 −1 The mean is given by A = NA HA , where [NA ]ij ≡ B T t=2 q(ht ) hj hi t−1 t + αj δjl . q(ht−1:t ) ˆ + αj Aij . Covariance Updates By specifying a Wishart prior for the inverse of the covariances, conjugate update formulae are possible. In practice, it is more common to specify diagonal inverse covariances, for which the corresponding priors are simply Gamma distributions [7, 5]. For this simple diagonal case, the explicit updates are given below. Determining q(ΣV ) For the constraint Σ−1 = diag(ρ), where each diagonal element follows a Gamma prior V Ga(b1 , b2 ) [7], q(ρ) factorizes and the optimal updates are  q(ρi ) = Ga b1 + where GB ≡ −1 T NB HB NB .   T T 1 i , b2 +  (vt )2 − [GB ]ii + 2 2 t=1 j ˆ2 βj Bij  , Determining q(ΣH ) Analogously, for Σ−1 = diag(τ ) with prior Ga(a1 , a2 ) [5], the updates are H    T T −1 1 ˆij , a2 + (hi )2 − [GA ]ii + αj A2  , q(τi ) = Ga a1 + t 2 2 t=2 j −1 T where GA ≡ NA HA NA . Acknowledgments This work is supported by the European DIRAC Project FP6-0027787. This paper only reﬂects the authors’ views and funding agencies are not liable for any use that may be made of the information contained herein. References [1] Y. Bar-Shalom and X.-R. Li. Estimation and Tracking: Principles, Techniques and Software. Artech House, 1998. [2] M. S. Grewal and A. P. Andrews. Kalman Filtering: Theory and Practice Using MATLAB. John Wiley and Sons, Inc., 2001. [3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications. Springer, 2000. [4] M. J. Beal, F. Falciani, Z. Ghahramani, C. Rangel, and D. L. Wild. A Bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics, 21:349–356, 2005. [5] A. T. Cemgil and S. J. Godsill. Probabilistic phase vocoder and its application to interpolation of missing values in audio signals. In 13th European Signal Processing Conference, 2005. [6] H. Valpola and J. Karhunen. An unsupervised ensemble learning method for nonlinear dynamic statespace models. Neural Computation, 14:2647–2692, 2002. [7] M. J. Beal. Variational Algorithms for Approximate Bayesian Inference. Ph.D. thesis, Gatsby Computational Neuroscience Unit, University College London, 2003. [8] M. Davy and S. J. Godsill. Bayesian harmonic models for musical signal analysis (with discussion). In J.O. Bernardo, J.O. Berger, A.P Dawid, and A.F.M. Smith, editors, Bayesian Statistics VII. Oxford University Press, 2003. [9] D. J. C. MacKay. Ensemble learning and evidence maximisation. Unpublished manuscipt: www.variational-bayes.org, 1995. [10] M. Morf and T. Kailath. Square-root algorithms for least-squares estimation. IEEE Transactions on Automatic Control, 20:487–497, 1975. [11] P. Park and T. Kailath. New square-root smoothing algorithms. IEEE Transactions on Automatic Control, 41:727–732, 1996. [12] E. Niedermeyer and F. Lopes Da Silva. Electroencephalography: basic principles, clinical applications and related ﬁelds. Lippincott Williams and Wilkins, 1999. [13] S. Roweis and Z. Ghahramani. A unifying review of linear Gaussian models. Neural Computation, 11:305–345, 1999. [14] M. Verhaegen and P. Van Dooren. Numerical aspects of different Kalman ﬁlter implementations. IEEE Transactions of Automatic Control, 31:907–917, 1986. [15] S. Chiappa and D. Barber. Bayesian linear Gaussian state-space models for biosignal decomposition. Signal Processing Letters, 14, 2007. [16] S. S. Saquib, C. A. Bouman, and K. Sauer. ML parameter estimation for Markov random ﬁelds with applicationsto Bayesian tomography. IEEE Transactions on Image Processing, 7:1029–1044, 1998. [17] G. Pfurtscheller and F. H. Lopes da Silva. Event-related EEG/MEG synchronization and desynchronization: basic principles. Clinical Neurophysiology, pages 1842–1857, 1999.

3 0.081264883 48 nips-2006-Branch and Bound for Semi-Supervised Support Vector Machines

Author: Olivier Chapelle, Vikas Sindhwani, S. S. Keerthi

Abstract: Semi-supervised SVMs (S3 VM) attempt to learn low-density separators by maximizing the margin over labeled and unlabeled examples. The associated optimization problem is non-convex. To examine the full potential of S3 VMs modulo local minima problems in current implementations, we apply branch and bound techniques for obtaining exact, globally optimal solutions. Empirical evidence suggests that the globally optimal solution can return excellent generalization performance in situations where other implementations fail completely. While our current implementation is only applicable to small datasets, we discuss variants that can potentially lead to practically useful algorithms. 1

4 0.077331074 150 nips-2006-On Transductive Regression

Author: Corinna Cortes, Mehryar Mohri

Abstract: In many modern large-scale learning applications, the amount of unlabeled data far exceeds that of labeled data. A common instance of this problem is the transductive setting where the unlabeled test points are known to the learning algorithm. This paper presents a study of regression problems in that setting. It presents explicit VC-dimension error bounds for transductive regression that hold for all bounded loss functions and coincide with the tight classiﬁcation bounds of Vapnik when applied to classiﬁcation. It also presents a new transductive regression algorithm inspired by our bound that admits a primal and kernelized closedform solution and deals efﬁciently with large amounts of unlabeled data. The algorithm exploits the position of unlabeled points to locally estimate their labels and then uses a global optimization to ensure robust predictions. Our study also includes the results of experiments with several publicly available regression data sets with up to 20,000 unlabeled examples. The comparison with other transductive regression algorithms shows that it performs well and that it can scale to large data sets.

5 0.069054238 116 nips-2006-Learning from Multiple Sources

Author: Koby Crammer, Michael Kearns, Jennifer Wortman

Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields results for classiﬁcation and regression generally, and for density estimation within the exponential family. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. 1

6 0.068804592 35 nips-2006-Approximate inference using planar graph decomposition

7 0.06870465 117 nips-2006-Learning on Graph with Laplacian Regularization

8 0.068345666 77 nips-2006-Fast Computation of Graph Kernels

9 0.066722445 104 nips-2006-Large-Scale Sparsified Manifold Regularization

10 0.066299908 203 nips-2006-implicit Online Learning with Kernels

11 0.063314103 10 nips-2006-A Novel Gaussian Sum Smoother for Approximate Inference in Switching Linear Dynamical Systems

12 0.061287183 60 nips-2006-Convergence of Laplacian Eigenmaps

13 0.059205096 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation

14 0.058101796 124 nips-2006-Linearly-solvable Markov decision problems

15 0.057167482 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields

16 0.05637477 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models

17 0.055055127 163 nips-2006-Prediction on a Graph with a Perceptron

18 0.054892235 154 nips-2006-Optimal Change-Detection and Spiking Neurons

19 0.054522909 39 nips-2006-Balanced Graph Matching

20 0.047770474 128 nips-2006-Manifold Denoising

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.148), (1, 0.051), (2, -0.068), (3, 0.05), (4, 0.003), (5, 0.023), (6, 0.026), (7, 0.009), (8, -0.052), (9, -0.07), (10, 0.046), (11, -0.124), (12, -0.064), (13, -0.005), (14, 0.081), (15, 0.071), (16, 0.026), (17, 0.051), (18, -0.083), (19, -0.067), (20, 0.057), (21, -0.109), (22, 0.022), (23, 0.046), (24, 0.052), (25, 0.019), (26, 0.045), (27, -0.019), (28, -0.039), (29, -0.058), (30, -0.015), (31, -0.002), (32, 0.061), (33, 0.028), (34, 0.088), (35, 0.034), (36, -0.01), (37, -0.036), (38, -0.048), (39, 0.158), (40, -0.051), (41, -0.0), (42, -0.045), (43, -0.026), (44, -0.053), (45, -0.058), (46, -0.073), (47, -0.104), (48, 0.063), (49, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92769319 93 nips-2006-Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms

Author: Xinhua Zhang, Wee S. Lee

2 0.53756309 48 nips-2006-Branch and Bound for Semi-Supervised Support Vector Machines

Author: Olivier Chapelle, Vikas Sindhwani, S. S. Keerthi

3 0.51551986 104 nips-2006-Large-Scale Sparsified Manifold Regularization

Author: Ivor W. Tsang, James T. Kwok

Abstract: Semi-supervised learning is more powerful than supervised learning by using both labeled and unlabeled data. In particular, the manifold regularization framework, together with kernel methods, leads to the Laplacian SVM (LapSVM) that has demonstrated state-of-the-art performance. However, the LapSVM solution typically involves kernel expansions of all the labeled and unlabeled examples, and is slow on testing. Moreover, existing semi-supervised learning methods, including the LapSVM, can only handle a small number of unlabeled examples. In this paper, we integrate manifold regularization with the core vector machine, which has been used for large-scale supervised and unsupervised learning. By using a sparsiﬁed manifold regularizer and formulating as a center-constrained minimum enclosing ball problem, the proposed method produces sparse solutions with low time and space complexities. Experimental results show that it is much faster than the LapSVM, and can handle a million unlabeled examples on a standard PC; while the LapSVM can only handle several thousand patterns. 1

4 0.48535547 150 nips-2006-On Transductive Regression

Author: Corinna Cortes, Mehryar Mohri

5 0.46879277 35 nips-2006-Approximate inference using planar graph decomposition

Author: Amir Globerson, Tommi S. Jaakkola

Abstract: A number of exact and approximate methods are available for inference calculations in graphical models. Many recent approximate methods for graphs with cycles are based on tractable algorithms for tree structured graphs. Here we base the approximation on a different tractable model, planar graphs with binary variables and pure interaction potentials (no external ﬁeld). The partition function for such models can be calculated exactly using an algorithm introduced by Fisher and Kasteleyn in the 1960s. We show how such tractable planar models can be used in a decomposition to derive upper bounds on the partition function of non-planar models. The resulting algorithm also allows for the estimation of marginals. We compare our planar decomposition to the tree decomposition method of Wainwright et. al., showing that it results in a much tighter bound on the partition function, improved pairwise marginals, and comparable singleton marginals. Graphical models are a powerful tool for modeling multivariate distributions, and have been successfully applied in various ﬁelds such as coding theory and image processing. Applications of graphical models typically involve calculating two types of quantities, namely marginal distributions, and MAP assignments. The evaluation of the model partition function is closely related to calculating marginals [12]. These three problems can rarely be solved exactly in polynomial time, and are provably computationally hard in the general case [1]. When the model conforms to a tree structure, however, all these problems can be solved in polynomial time. This has prompted extensive research into tree based methods. For example, the junction tree method [6] converts a graphical model into a tree by clustering nodes into cliques, such that the graph over cliques is a tree. The resulting maximal clique size (cf. tree width) may nevertheless be prohibitively large. Wainwright et. al. [9, 11] proposed an approximate method based on trees known as tree reweighting (TRW). The TRW approach decomposes the potential vector of a graphical model into a mixture over spanning trees of the model, and then uses convexity arguments to bound various quantities, such as the partition function. One key advantage of this approach is that it provides bounds on partition function value, a property which is not shared by approximations based on Bethe free energies [13]. In this paper we focus on a different class of tractable models: planar graphs. A graph is called planar if it can be drawn in the plane without crossing edges. Works in the 1960s by physicists Fisher [5] and Kasteleyn [7], among others, have shown that the partition function for planar graphs may be calculated in polynomial time. This, however, is true under two key restrictions. One is that the variables xi are binary. The other is that the interaction potential depends only on xi xj (where xi ∈ {±1}), and not on their individual values (i.e., the zero external ﬁeld case). Here we show how the above method can be used to obtain upper bounds on the partition function for non-planar graphs. As in TRW, we decompose the potential of a non-planar graph into a sum over spanning planar models, and then use a convexity argument to obtain an upper bound on the log partition function. The bound optimization is a convex problem, and can be solved in polynomial time. We compare our method with TRW on a planar graph with an external ﬁeld, and show that it performs favorably with respect to both pairwise marginals and the bound on the partition function, and the two methods give similar results for singleton marginals. 1 Deﬁnitions and Notations Given a graph G with n vertices and a set of edges E, we are interested in pairwise Markov Random Fields (MRF) over the graph G. A pairwise MRF [13] is a multivariate distribution over variables x = {x1 , . . . , xn } deﬁned as 1 P p(x) = e ij∈E fij (xi ,xj ) (1) Z where fij are a set of |E| functions, or interaction potentials, deﬁned over pairs of variables. The P partition function is deﬁned as Z = x e ij∈E fij (xi ,xj ) . Here we will focus on the case where xi ∈ {±1}. Furthermore, we will be interested in interaction potentials which only depend on agreement or disagreement between the signs of their variables. We deﬁne those by 1 θij (1 + xi xj ) = θij I(xi = xj ) (2) 2 so that fij (xi , xj ) is zero if xi = xj and θij if xi = xj . The model is then deﬁned via the set of parameters θij . We use θ to denote the vector of parameters θij , and denote the partition function by Z(θ) to highlight its dependence on these parameters. f (xi , xj ) = A graph G is deﬁned as planar if it can be drawn in the plane without any intersection of edges [4]. With some abuse of notation, we deﬁne E as the set of line segments in 2 corresponding to the edges in the graph. The regions of 2 \ E are deﬁned as the faces of the graph. The face which corresponds to an unbounded region is called the external face. Given a planar graph G, its dual graph G∗ is deﬁned in the following way: the vertices of G∗ correspond to faces of G, and there is an edge between two vertices in G∗ iff the two corresponding faces in G share an edge. If the graph G is weighted, the weight on an edge in G∗ is the weight on the edge shared by the corresponding faces in G. A plane triangulation of a planar graph G is obtained from G by adding edges such that all the faces of the resulting graph have exactly three vertices. Thus a plane triangulated graph has a dual where all vertices have degree three. It can be shown that every plane graph can be plane triangulated [4]. We shall also need the notion of a perfect matching on a graph. A perfect matching on a graph G is deﬁned as a set of edges H ⊆ E such that every vertex in G has exactly one edge in H incident on it. If the graph is weighted, the weight of the matching is deﬁned as the product of the weights of the edges in the matching. Finally, we recall the deﬁnition of a marginal polytope of a graph [12]. Consider an MRF over a graph G where fij are given by Equation 2. Denote the probability of the event I(xi = xj ) under p(x) by τij . The marginal polytope of G, denoted by M(G), is deﬁned as the set of values τij that can be obtained under some assignment to the parameters θij . For a general graph G the polytope M(G) cannot be described using a polynomial number of inequalities. However, for planar graphs, it turns out that a set of O(n3 ) constraints, commonly referred to as triangle inequalities, sufﬁce to describe M(G) (see [3] page 434). The triangle inequalities are deﬁned by 1 TRI(n) = {τij : τij + τjk − τik ≤ 1, τij + τjk + τik ≥ 1, ∀i, j, k ∈ {1, . . . , n}} (3) Note that the above inequalities actually contain variables τij which do not correspond to edges in the original graph G. Thus the equality M(G) = TRI(n) should be understood as referring only to the values of τij that correspond to edges in the graph. Importantly, the values of τij for edges not in the graph need not be valid marginals for any MRF. In other words M(G) is a projection of TRI(n) on the set of edges of G. It is well known that the marginal polytope for trees is described via pairwise constraints. It is thus interesting that for planar graphs, it is triplets, rather than pairwise 1 The deﬁnition here is slightly different from that in [3], since here we refer to agreement probabilities, whereas [3] refers to disagreement probabilities. This polytope is also referred to as the cut polytope. constraints, that characterize the polytope. In this sense, planar graphs and trees may be viewed as a hierarchy of polytope complexity classes. It remains an interesting problem to characterize other structures in this hierarchy and their related inference algorithms. 2 Exact calculation of partition function using perfect matching The seminal works of Kasteleyn [7] and Fisher [5] have shown how one can calculate the partition function for a binary MRF over a planar graph with pure interaction potentials. We brieﬂy review Fisher’s construction, which we will use in what follows. Our interpretation of the method differs somewhat from that of Fisher, but we believe it is more straightforward. The key idea in calculating the partition function is to convert the summation over values of x to the problem of calculating the sum of weights of all perfect matchings in a graph constructed from G, as shown below. In this section, we consider weighted graphs (graphs with numbers assigned to their edges). For the graph G associated with the pairwise MRF, we assign weights wij = e2θij to the edges. The ﬁrst step in the construction is to plane triangulate the graph G. Let us call the resulting graph GT . We deﬁne an MRF on GT by assigning a parameter θij = 0 to the edges that have been added to G, and the corresponding weight wij = 1. Thus GT essentially describes the same distribution as G, and therefore has the same partition function. We can thus restrict our attention to calculating the partition function for the MRF on GT . As a ﬁrst step in calculating a partition function over GT , we introduce the following deﬁnition: a ˆ set of edges E in GT is an agreement edge set (or AES) if for every triangle face F in GT one of the ˆ ˆ following holds: The edges in F are all in E, or exactly one of the edges in F is in E. The weight ˆ is deﬁned as the product of the weights of the edges in E. ˆ of a set E It can be shown that there exists a bijection between pairs of assignments {x, −x} and agreement edge sets. The mapping from x to an edge set is simply the set of edges such that xi = xj . It is easy to see that this is an agreement edge set. The reverse mapping is obtained by ﬁnding an assignment x such that xi = xj iff the corresponding edge is in the agreement edge set. The existence of this mapping can be shown by induction on the number of (triangle) faces. P The contribution of a given assignment x to the partition function is e ˆ sponds to an AES denoted by E it is easy to see that P e ij∈E θij I(xi =xj ) = e− P ij∈E θij P e ˆ ij∈E 2θij = ce P ˆ ij∈E ij∈E 2θij θij I(xi =xj ) =c wij . If x corre(4) ˆ ij∈E P where c = e− ij∈E θij . Deﬁne the superset Λ as the set of agreement edge sets. The above then implies that Z(θ) = 2c E∈Λ ij∈E wij , and is thus proportional to the sum of AES weights. ˆ ˆ To sum over agreement edge sets, we use the following elegant trick introduced by Fisher [5]. Construct a new graph GPM from the dual of GT by introducing new vertices and edges according to the following rule: Replace each original vertex with three vertices that are connected to each other, and assign a weight of one to the new edges. Next, consider the three neighbors of the original vertex 2 . Connect each of the three new vertices to one of these three neighbors, keeping the original weights on these edges. The transformation is illustrated in Figure 1. The new graph GPM has O(3n) vertices, and is also planar. It can be seen that there is a one to one correspondence between perfect matchings in GPM and agreement edge sets in GT . Deﬁne Ω to be the set of perfect matchings in GPM . Then Z(θ) = 2c M ∈Ω ij∈M wij where we have used the fact that all the new weights have a value of one. Thus, the partition function is a sum over the weights of perfect matchings in GPM . Finally, we need a way of summing over the weights of the set of perfect matchings in a graph. Kasteleyn [7] proved that for a planar graph GPM , this sum may be obtained using the following sequence of steps: • Direct the edges of the graph GPM such that for every face (except possibly the external face), the number of edges on its perimeter oriented in a clockwise manner is odd. Kasteleyn showed that such a so called Pfafﬁan orientation may be constructed in polynomial time for a planar graph (see also [8] page 322). 2 Note that in the dual of GT all vertices have degree three, since GT is plane triangulated. 1.2 0.7 0.6 1 1 1 0.8 0.6 0.8 1.5 1.4 1.5 1 1 1.2 1 1 1 1 0.7 1.4 1 1 1 Figure 1: Illustration of the graph transformations in Section 2 for a complete graph with four vertices. Left panel shows the original weighted graph (dotted edges and grey vertices) and its dual (solid edges and black vertices). Right panel shows the dual graph with each vertex replaced by a triangle (the graph GPM in the text). Weights for dual graph edges correspond to the weights on the original graph. • Deﬁne the matrix P (GPM ) to be a skew symmetric matrix such that Pij = 0 if ij is not an edge, Pij = wij if the arrow on edge ij runs from i to j and Pij = −wij otherwise. • The sum over weighted matchings can then be shown to equal |P (GPM )|. The partition function is thus given by Z(θ) = 2c |P (GPM )|. To conclude this section we reiterate the following two key points: the partition function of a binary MRF over a planar graph with interaction potentials as in Equation 2 may be calculated in polynomial time by calculating the determinant of a matrix of size O(3n). An important outcome of this result is that the functional relation between Z(θ) and the parameters θij is known, a fact we shall use in what follows. 3 Partition function bounds via planar decomposition Given a non-planar graph G over binary variables with a vector of interaction potentials θ, we wish to use the exact planar computation to obtain a bound on the partition function of the MRF on G. We assume for simplicity that the potentials on the MRF for G are given in the form of Equation 2. Thus, G violates the assumptions of the previous section only in its non-planarity. Deﬁne G(r) as a set of spanning planar subgraphs of G, i.e., each graph G(r) is planar and contains all the vertices of G and some its edges. Denote by m the number of such graphs. Introduce the following deﬁnitions: (r) • θ (r) is a set of parameters on the edges of G(r) , and θij is an element in this set. Z(θ (r) ) is the partition function of the MRF on G(r) with parameters θ (r) . ˆ (r) ˆ(r) • θ is a set of parameters on the edges of G such that if edge (ij) is in G(r) then θij = (r) ˆ(r) θ , and otherwise θ = 0. ij ij Given a distribution ρ(r) on the graphs G(r) (i.e., ρ(r) ≥ 0 for r = 1, . . . , m and assume that the parameters for G(r) are such that ˆ ρ(r)θ θ= (r) r ρ(r) = 1), (5) r Then, by the convexity of the log partition function, as a function of the model parameters, we have ρ(r) log Z(θ (r) ) ≡ f (θ, ρ, θ (r) ) log Z(θ) ≤ (6) r Since by assumption the graphs G(r) are planar, this bound can be calculated in polynomial time. Since this bound is true for any set of parameters θ (r) which satisﬁes the condition in Equation 5 and for any distribution ρ(r), we may optimize over these two variables to obtain the tightest bound possible. Deﬁne the optimal bound for a ﬁxed value of ρ(r) by g(ρ, θ) (optimization is w.r.t. θ (r) ) g(ρ, θ) = f (θ, ρ, θ (r) ) min θ (r) : P ˆ ρ(r)θ (r) =θ (7) Also, deﬁne the optimum of the above w.r.t. ρ by h(θ). h(θ) = min g(θ, ρ) ρ(r) ≥ 0, ρ(r) = 1 (8) Thus, h(θ) is the optimal upper bound for the given parameter vector θ. In the following section we argue that we can in fact ﬁnd the global optimum of the above problem. 4 Globally Optimal Bound Optimization First consider calculating g(ρ, θ) from Equation 7. Note that since log Z(θ (r) ) is a convex function of θ (r) , and the constraints are linear, the overall optimization is convex and can be solved efﬁciently. In the current implementation, we use a projected gradient algorithm [2]. The gradient of f (θ, ρ, θ (r) ) w.r.t. θ (r) is given by ∂f (θ, ρ, θ (r) ) (r) ∂θij (r) = ρ(r) 1 + eθij (r) P −1 (GPM ) (r) k(i,j) Sign(Pk(i,j) (GPM )) (9) where k(i, j) returns the row and column indices of the element in the upper triangular matrix of (r) (r) P (GPM ), which contains the element e2θij . Since the optimization in Equation 7 is convex, it has an equivalent convex dual. Although we do not use this dual for optimization (because of the difﬁculty of expressing the entropy of planar models solely in terms of triplet marginals), it nevertheless allows some insight into the structure of the problem. The dual in this case is closely linked to the notion of the marginal polytope deﬁned in Section 1. Using a derivation similar to [11], we arrive at the following characterization of the dual g(ρ, θ) = max τ ∈TRI(n) ρ(r)H(θ (r) (τ )) θ·τ + (10) r where θ (r) (τ ) denotes the parameters of an MRF on G(r) such that its marginals are given by the restriction of τ to the edges of G(r) , and H(θ (r) (τ )) denotes the entropy of the MRF over G(r) with parameters θ (r) (τ ). The maximized function in Equation 10 is linear in ρ and thus g(ρ, θ) is a pointwise maximum over (linear) convex functions in ρ and is thus convex in ρ. It therefore has no (r) local minima. Denote by θmin (ρ) the set of parameters that minimizes Equation 7 for a given value of ρ. Using a derivation similar to that in [11], the gradient of g(ρ, θ) can be shown to be ∂g(ρ, θ) (r) = H(θmin (ρ)) ∂ρ(r) (11) Since the partition function for G(r) can be calculated efﬁciently, so can the entropy. We can now summarize the algorithm for calculating h(θ) • Initialize ρ0 . Iterate: – For ρt , ﬁnd θ (r) which solves the minimization in Equation 7. – Calculate the gradient of g(ρ, θ) at ρt using the expression in Equation 11 – Update ρt+1 = ρt + αv where v is a feasible search direction calculated from the gradient of g(ρ, θ) and the simplex constraints on ρ. The step size α is calculated via an Armijo line search. – Halt when the change in g(ρ, θ) is smaller than some threshold. Note that the minimization w.r.t. θ (r) is not very time consuming since we can initialize it with the minimum from the previous step, and thus only a few iterations are needed to ﬁnd the new optimum, provided the change in ρ is not too big. The above algorithm is guaranteed to converge to a global optimum of ρ [2], and thus we obtain the tightest possible upper bound on Z(θ) given our planar graph decomposition. The procedure described here is asymmetric w.r.t. ρ and θ (r) . In a symmetric formulation the minimizing gradient steps could be carried out jointly or in an alternating sequence. The symmetric ˆ (r) formulation can be obtained by decoupling ρ and θ (r) in the bi-linear constraint ρ(r)θ = θ. Field Figure 2: Illustration of planar subgraph construction for a rectangular lattice with external ﬁeld. Original graph is shown on the left. The ﬁeld vertex is connected to all vertices (edges not shown). The graph on the right results from isolating the 4th ,5th columns of the original graph (shown in grey), and connecting the ﬁeld vertex to the external vertices of the three disconnected components. Note that the resulting graph is planar. ˜ ˜ Speciﬁcally, we introduce θ (r) = θ (r) ρ(r) and perform the optimization w.r.t. ρ and θ (r) . It can be ˜(r) ) with the relevant (de-coupled) constraint is equivalent shown that a stationary point of f (θ, ρ, θ to the procedure described above. The advantage of this approach is that the exact minimization w.r.t θ (r) is not required before modifying ρ. Our experiments have shown, however, that the methods take comparable times to converge, although this may be a property of the implementation. 5 Estimating Marginals The optimization problem as deﬁned above minimizes an upper bound on the partition function. However, it may also be of interest to obtain estimates of the marginals of the MRF over G. To obtain marginal estimates, we follow the approach in [11]. We ﬁrst characterize the optimum of Equation 7 for a ﬁxed value of ρ. Deriving the Lagrangian of Equation 7 w.r.t. θ (r) we obtain the (r) following characterization of θmin (ρ): Marginal Optimality Criterion: For any two graphs G(r) , G(s) such that the edge (ij) is in both (r) (s) graphs, the optimal parameter vector satisﬁes τij (θmin (ρ)) = τij (θmin (ρ)). Thus, the optimal set of parameters for the graphs G(r) is such that every two graphs agree on the marginals of all the edges they share. This implies that at the optimum, there is a well deﬁned set of marginals over all the edges. We use this set as an approximation to the true marginals. A different method for estimating marginals uses the partition function bound directly. We ﬁrst P calculate partition function bounds on the sums: αi (1) = x:xi =1 e ij∈E fij (xi ,xj ) and αi (−1) = P αi (1) e ij∈E fij (xi ,xj ) and then normalize αi (1)+αi (−1) to obtain an estimate for p(xi = 1). This method has the advantage of being more numerically stable (since it does not depend on derivatives of log Z). However, it needs to be calculated separately for each variable, so that it may be time consuming if one is interested in marginals for a large set of variables. x:xi =−1 6 Experimental Evaluation We study the application of our Planar Decomposition (PDC) P method to a binary MRF on a square P lattice with an external ﬁeld. The MRF is given by p(x) ∝ e ij∈E θij xi xj + i∈V θi xi where V are the lattice vertices, and θi and θij are parameters. Note that this interaction does not satisfy the conditions for exact calculation of the partition function, even though the graph is planar. This problem is in fact NP hard [1]. However, it is possible to obtain the desired interaction form by introducing an additional variable xn+1 that is connected to all the original variables.P Denote the correspondP ij∈E θij xi xj + i∈V θi,n+1 xi xn+1 , where ing graph by Gf . Consider the distribution p(x, xn+1 ) ∝ e θi,n+1 = θi . It is easy to see that any property of p(x) (e.g., partition function, marginals) may be calculated from the corresponding property of p(x, xn+1 ). The advantage of the latter distribution is that it has the desired interaction form. We can thus apply PDC by choosing planar subgraphs of the non-planar graph Gf . 0.25 0.15 0.1 0.05 0.5 1 1.5 Interaction Strength 0.03 Singleton Marginal Error Z Bound Error Pairwise Marginals Error 0.08 PDC TRW 0.2 0.07 0.06 0.05 0.04 0.03 0.02 2 0.5 1 1.5 Interaction Strength 0.025 0.02 0.015 0.01 0.005 2 0.5 1 1.5 Interaction Strength 2 !3 x 10 0.025 0.02 0.015 0.5 1 Field Strength 1.5 2 Singleton Marginal Error Pairwise Marginals Error Z Bound Error 0.03 0.03 0.025 0.02 0.015 0.5 1 Field Strength 1.5 2 9 8 7 6 5 4 3 0.5 1 Field Strength 1.5 2 Figure 3: Comparison of the TRW and Planar Decomposition (PDC) algorithms on a 7×7 square lattice. TRW results shown in red squares, and PDC in blue circles. Left column shows the error in the log partition bound. Middle column is the mean error for pairwise marginals, and right column is the error for the singleton marginal of the variable at the lattice center. Results in upper row are for ﬁeld parameters drawn from U[−0.05, 0.05] and various interaction parameters. Results in the lower row are for interaction parameters drawn from U [−0.5, 0.5] and various ﬁeld parameters. Error bars are standard errors calculated from 40 random trials. There are clearly many ways to choose spanning planar subgraphs of Gf . Spanning subtrees are one option, and were used in [11]. Since our optimization is polynomial in the number of subgraphs, √ we preferred to use a number of subgraphs that is linear in n. The key idea in generating these planar subgraphs is to generate disconnected components of the lattice and connect xn+1 only to the external vertices of these components. Here we generate three disconnected components by isolating two neighboring columns (or rows) from the rest of the graph, resulting in three components. This is √ illustrated in Figure 2. To this set of 2 n graphs, we add the independent variables graph consisting only of edges from the ﬁeld node to all the other nodes. We compared the performance of the PDC and TRW methods 3 4 on a 7 × 7 lattice . Since the exact partition function and marginals can be calculated for this case, we could compare both algorithms to the true values. The MRF parameters were set according to the two following scenarios: 1) Varying Interaction - The ﬁeld parameters θi were drawn uniformly from U[−0.05, 0.05], and the interaction θij from U[−α, α] where α ∈ {0.2, 0.4, . . . , 2}. This is the setting tested in [11]. 2) Varying Field θi was drawn uniformly from U[−α, α], where α ∈ {0.2, 0.4, . . . , 2} and θij from U[−0.5, 0.5]. For each scenario, we calculated the following measures: 1) Normalized log partition error 1 1 alg − log Z true ). 2) Error in pairwise marginals |E| ij∈E |palg (xi = 1, xj = 1) − 49 (log Z ptrue (xi = 1, xj = 1)|. Pairwise marginals were calculated jointly using the marginal optimality criterion of Section 5. 3) Error in singleton marginals. We calculated the singleton marginals for the innermost node in the lattice (i.e., coordinate [3, 3]), which intuitively should be the most difﬁcult for the planar based algorithm. This marginal was calculated using two partition functions, as explained in Section 5 5 . The same method was used for TRW. The reported error measure is |palg (xi = 1) − ptrue (xi = 1)|. Results were averaged over 40 random trials. Results for the two scenarios and different evaluation measures are given in Figure 3. It can be seen that the partition function bound for PDC is signiﬁcantly better than TRW for almost all parameter settings, although the difference becomes smaller for large ﬁeld values. Error for the PDC pairwise 3 TRW and PDC bounds were optimized over both the subgraph parameters and the mixture parameters ρ. In terms of running time, PDC optimization for a ﬁxed value of ρ took about 30 seconds, which is still slower than the TRW message passing implementation. 5 Results using the marginal optimality criterion were worse for PDC, possibly due to its reduced numerical precision. 4 marginals are smaller than those of TRW for all parameter settings. For the singleton parameters, TRW slightly outperforms PDC. This is not surprising since the ﬁeld is modeled by every spanning tree in the TRW decomposition, whereas in PDC not all the structures model a given ﬁeld. 7 Discussion We have presented a method for using planar graphs as the basis for approximating non-planar graphs such as planar graphs with external ﬁelds. While the restriction to binary variables limits the applicability of our approach, it remains relevant in many important applications, such as coding theory and combinatorial optimization. Moreover, it is always possible to convert a non-binary graphical model to a binary one by introducing additional variables. The resulting graph will typically not be planar, even when the original graph over k−ary variables is. However, the planar decomposition method can then be applied to this non-planar graph. The optimization of the decomposition is carried out explicitly over the planar subgraphs, thus limiting the number of subgraphs that can be used in the approximation. In the TRW method this problem is circumvented since it is possible to implicitly optimize over all spanning trees. The reason this can be done for trees is that the entropy of an MRF over a tree may be written as a function of its marginal variables. We do not know of an equivalent result for planar graphs, and it remains a challenge to ﬁnd one. It is however possible to combine the planar and tree decompositions into one single bound, which is guaranteed to outperform the tree or planar approximations alone. The planar decomposition idea may in principle be applied to bounding the value of the MAP assignment. However, as in TRW, it can be shown that the solution is not dependent on the decomposition (as long as each edge appears in some structure), and the problem is equivalent to maximizing a linear function over the marginal polytope (which can be done in polynomial time for planar graphs). However, such a decomposition may suggest new message passing algorithms, as in [10]. Acknowledgments The authors acknowledge support from the Defense Advanced Research Projects Agency (Transfer Learning program). Amir Globerson is also supported by the Rothschild Yad-Hanadiv fellowship. The authors also wish to thank Martin Wainwright for providing his TRW code. References [1] F. Barahona. On the computational complexity of ising spin glass models. J. Phys. A., 15(10):3241–3253, 1982. [2] D. P. Bertsekas, editor. Nonlinear Programming. Athena Scientiﬁc, Belmont, MA, 1995. [3] M.M. Deza and M. Laurent. Geometry of Cuts and Metrics. Springe-Verlag, 1997. [4] R. Diestel. Graph Theory. Springer-Verlag, 1997. [5] M.E. Fisher. On the dimer solution of planar ising models. J. Math. Phys., 7:1776–1781, 1966. [6] M.I. Jordan, editor. Learning in graphical models. MIT press, Cambridge, MA, 1998. [7] P.W. Kasteleyn. Dimer statistics and phase transitions. Journal of Math. Physics, 4:287–293, 1963. [8] L. Lovasz and M.D. Plummer. Matching Theory, volume 29 of Annals of discrete mathematics. NorthHolland, New-York, 1986. [9] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Tree-based reparameterization framework for analysis of sum-product and related algorithms. IEEE Trans. on Information Theory, 49(5):1120–1146, 2003. [10] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Map estimation via agreement on trees: messagepassing and linear programming. IEEE Trans. on Information Theory, 51(11):1120–1146, 2005. [11] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. A new class of upper bounds on the log partition function. IEEE Trans. on Information Theory, 51(7):2313–2335, 2005. [12] M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational inference. Technical report, UC Berkeley Dept. of Statistics, 2003. [13] J.S. Yedidia, W.T. W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. on Information Theory, 51(7):2282–2312, 2005.

6 0.46237066 39 nips-2006-Balanced Graph Matching

7 0.44820964 163 nips-2006-Prediction on a Graph with a Perceptron

8 0.44499278 198 nips-2006-Unified Inference for Variational Bayesian Linear Gaussian State-Space Models

9 0.41080505 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields

10 0.39573261 77 nips-2006-Fast Computation of Graph Kernels

11 0.39147744 60 nips-2006-Convergence of Laplacian Eigenmaps

12 0.37479642 123 nips-2006-Learning with Hypergraphs: Clustering, Classification, and Embedding

13 0.37412342 117 nips-2006-Learning on Graph with Laplacian Regularization

14 0.36074433 95 nips-2006-Implicit Surfaces with Globally Regularised and Compactly Supported Basis Functions

15 0.33967006 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation

16 0.33587292 10 nips-2006-A Novel Gaussian Sum Smoother for Approximate Inference in Switching Linear Dynamical Systems

17 0.32571608 146 nips-2006-No-regret Algorithms for Online Convex Programs

18 0.32164925 101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow

19 0.31820029 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models

20 0.31114176 169 nips-2006-Relational Learning with Gaussian Processes

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.075), (3, 0.037), (7, 0.083), (9, 0.038), (21, 0.011), (22, 0.04), (44, 0.038), (57, 0.043), (65, 0.037), (69, 0.468), (83, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92613524 147 nips-2006-Non-rigid point set registration: Coherent Point Drift

Author: Andriy Myronenko, Xubo Song, Miguel Á. Carreira-Perpiñán

Abstract: We introduce Coherent Point Drift (CPD), a novel probabilistic method for nonrigid registration of point sets. The registration is treated as a Maximum Likelihood (ML) estimation problem with motion coherence constraint over the velocity ﬁeld such that one point set moves coherently to align with the second set. We formulate the motion coherence constraint and derive a solution of regularized ML estimation through the variational approach, which leads to an elegant kernel form. We also derive the EM algorithm for the penalized ML optimization with deterministic annealing. The CPD method simultaneously ﬁnds both the non-rigid transformation and the correspondence between two point sets without making any prior assumption of the transformation model except that of motion coherence. This method can estimate complex non-linear non-rigid transformations, and is shown to be accurate on 2D and 3D examples and robust in the presence of outliers and missing points.

2 0.89693743 88 nips-2006-Greedy Layer-Wise Training of Deep Networks

Author: Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle

Abstract: Complexity theory of circuits strongly suggests that deep architectures can be much more efﬁcient (sometimes exponentially) than shallow architectures, in terms of computational elements required to represent some functions. Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization appears to often get stuck in poor solutions. Hinton et al. recently introduced a greedy layer-wise unsupervised learning algorithm for Deep Belief Networks (DBN), a generative model with many layers of hidden causal variables. In the context of the above optimization problem, we study this algorithm empirically and explore variants to better understand its success and extend it to cases where the inputs are continuous or where the structure of the input distribution is not revealing enough about the variable to be predicted in a supervised task. Our experiments also conﬁrm the hypothesis that the greedy layer-wise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.

3 0.88480628 176 nips-2006-Single Channel Speech Separation Using Factorial Dynamics

Author: John R. Hershey, Trausti Kristjansson, Steven Rennie, Peder A. Olsen

Abstract: Human listeners have the extraordinary ability to hear and recognize speech even when more than one person is talking. Their machine counterparts have historically been unable to compete with this ability, until now. We present a modelbased system that performs on par with humans in the task of separating speech of two talkers from a single-channel recording. Remarkably, the system surpasses human recognition performance in many conditions. The models of speech use temporal dynamics to help infer the source speech signals, given mixed speech signals. The estimated source signals are then recognized using a conventional speech recognition system. We demonstrate that the system achieves its best performance when the model of temporal dynamics closely captures the grammatical constraints of the task. One of the hallmarks of human perception is our ability to solve the auditory cocktail party problem: we can direct our attention to a given speaker in the presence of interfering speech, and understand what was said remarkably well. Until now the same could not be said for automatic speech recognition systems. However, we have recently introduced a system which in many conditions performs this task better than humans [1][2]. The model addresses the Pascal Speech Separation Challenge task [3], and outperforms all other published results by more than 10% word error rate (WER). In this model, dynamics are modeled using a layered combination of one or two Markov chains: one for long-term dependencies and another for short-term dependencies. The combination of the two speakers was handled via an iterative Laplace approximation method known as Algonquin [4]. Here we describe experiments that show better performance on the same task with a simpler version of the model. The task we address is provided by the PASCAL Speech Separation Challenge [3], which provides standard training, development, and test data sets of single-channel speech mixtures following an arbitrary but simple grammar. In addition, the challenge organizers have conducted human-listening experiments to provide an interesting baseline for comparison of computational techniques. The overall system we developed is composed of the three components: a speaker identiﬁcation and gain estimation component, a signal separation component, and a speech recognition system. In this paper we focus on the signal separation component, which is composed of the acoustic and grammatical models. The details of the other components are discussed in [2]. Single-channel speech separation has previously been attempted using Gaussian mixture models (GMMs) on individual frames of acoustic features. However such models tend to perform well only when speakers are of different gender or have rather different voices [4]. When speakers have similar voices, speaker-dependent mixture models cannot unambiguously identify the component speakers. In such cases it is helpful to model the temporal dynamics of the speech. Several models in the literature have attempted to do so either for recognition [5, 6] or enhancement [7, 8] of speech. Such models have typically been based on a discrete-state hidden Markov model (HMM) operating on a frame-based acoustic feature vector. Modeling the dynamics of the log spectrum of speech is challenging in that different speech components evolve at different time-scales. For example the excitation, which carries mainly pitch, versus the ﬁlter, which consists of the formant structure, are somewhat independent of each other. The formant structure closely follows the sequences of phonemes in each word, which are pronounced at a rate of several per second. In non-tonal languages such as English, the pitch ﬂuctuates with prosody over the course of a sentence, and is not directly coupled with the words being spoken. Nevertheless, it seems to be important in separating speech, because the pitch harmonics carry predictable structure that stands out against the background. We address the various dynamic components of speech by testing different levels of dynamic constraints in our models. We explore four different levels of dynamics: no dynamics, low-level acoustic dynamics, high-level grammar dynamics, and a layered combination, dual dynamics, of the acoustic and grammar dynamics. The grammar dynamics and dual dynamics models perform the best in our experiments. The acoustic models are combined to model mixtures of speech using two methods: a nonlinear model known as Algonquin, which models the combination of log-spectrum models as a sum in the power spectrum, and a simpler max model that combines two log spectra using the max function. It turns out that whereas Algonquin works well, our formulation of the max model does better overall. With the combination of the max model and grammar-level dynamics, the model produces remarkable results: it is often able to extract two utterances from a mixture even when they are from the same speaker 1 . Overall results are given in Table 1, which shows that our closest competitors are human listeners. Table 1: Overall word error rates across all conditions on the challenge task. Human: average human error rate, IBM: our best result, Next Best: the best of the eight other published results on this task, and Chance: the theoretical error rate for random guessing. System: Word Error Rate: 1 Human 22.3% IBM 22.6% Next Best 34.2% Chance 93.0% Speech Models The model consists of an acoustic model and temporal dynamics model for each source, and a mixing model, which models how the source models are combined to describe the mixture. The acoustic features were short-time log spectrum frames computed every 15 ms. Each frame was of length 40 ms and a 640-point mixed-radix FFT was used. The DC component was discarded, producing a 319-dimensional log-power-spectrum feature vector yt . The acoustic model consists of a set of diagonal-covariance Gaussians in the features. For a given speaker, a, we model the conditional probability of the log-power spectrum of each source signal xa given a discrete acoustic state sa as Gaussian, p(xa |sa ) = N (xa ; µsa , Σsa ), with mean µsa , and covariance matrix Σsa . We used 256 Gaussians, one per acoustic state, to model the acoustic space of each speaker. For efﬁciency and tractability we restrict the covariance to be diagonal. A model with no dynamics can be formulated by producing state probabilities p(sa ), and is depicted in 1(a). Acoustic Dynamics: To capture the low-level dynamics of the acoustic signal, we modeled the acoustic dynamics of a given speaker, a, via state transitions p(sa |sa ) as shown in Figure 1(b). t t−1 There are 256 acoustic states, hence for each speaker a, we estimated a 256 × 256 element transition matrix Aa . Grammar Dynamics: The grammar dynamics are modeled by grammar state transitions, a a p(vt |vt−1 ), which consist of left-to-right phone models. The legal word sequences are given by the Speech Separation Challenge grammar [3] and are modeled using a set of pronunciations that 1 Demos and information can be found at: http : //www.research.ibm.com/speechseparation sa t−1 sa t sa t−1 sa t xt−1 xt xt−1 xt (a) No Dynamics (b) Acoustic Dynamics a vt−1 a vt a vt−1 a vt sa t−1 sa t sa t−1 sa t xt−1 xt xt−1 xt (c) Grammar Dynamics (d) Dual Dynamics Figure 1: Graph of models for a given source. In (a), there are no dynamics, so the model is a simple mixture model. In (b), only acoustic dynamics are modeled. In (c), grammar dynamics are modeled with a shared set of acoustic Gaussians, in (d) dual – grammar and acoustic – dynamics have been combined. Note that (a) (b) and (c) are special cases of (d), where different nodes are assumed independent. map from words to three-state context-dependent phone models. The state transition probabilities derived from these phone models are sparse in the sense that most transition probabilities are zero. We model speaker dependent distributions p(sa |v a ) that associate the grammar states, v a to the speaker-dependent acoustic states. These are learned from training data where the grammar state sequences and acoustic state sequences are known for each utterance. The grammar of our system has 506 states, so we estimate a 506 × 256 element conditional probability matrix B a for each speaker. Dual Dynamics: The dual-dynamics model combines the acoustic dynamics with the grammar dynamics. It is useful in this case to avoid modeling the full combination of s and v states in the joint transitions p(sa |sa , vt ). Instead we make a naive-Bayes assumption to approximate this as t t−1 1 p(sa |sa )α p(sa |vt )β , where α and β adjust the relative inﬂuence of the two probabilities, and z t t−1 t z is the normalizing constant. Here we simply use the probability matrices Aa and B a , deﬁned above. 2 Mixed Speech Models The speech separation challenge involves recognizing speech in mixtures of signals from two speakers, a and b. We consider only mixing models that operate independently on each frequency for analytical and computational tractability. The short-time log spectrum of the mixture yt , in a given frequency band, is related to that of the two sources xa and xb via the mixing model given by the t t conditional probability distribution, p(y|xa , xb ). The joint distribution of the observation and source in one feature dimension, given the source states is thus: p(yt , xa , xb |sa , sb ) = p(yt |xa , xb )p(xa |sa )p(xb |sb ). t t t t t t t t t t (1) In general, to infer and reconstruct speech we need to compute the likelihood of the observed mixture p(yt |sa , sb ) = t t p(yt , xa , xb |sa , sb )dxa dxb , t t t t t t (2) and the posterior expected values of the sources given the states, E(xa |yt , sa , sb ) = t t t xa p(xa , xb |yt , sa , sb )dxa dxb , t t t t t t t (3) and similarly for xb . These quantities, combined with a prior model for the joint state set quences {sa , sb }, allow us to compute the minimum mean squared error (MMSE) estima1..T 1..T ˆ ˆ tors E(xa |y1..T ) or the maximum a posteriori (MAP) estimate E(xa |y1..T , sa 1..T , sb 1..T ), 1..T 1..T ˆ ˆ where sa 1..T , sb 1..T = arg maxsa ,sb p(sa , sb |y1..T ), where the subscript, 1..T , refers to 1..T 1..T 1..T 1..T all frames in the signal. The mixing model can be deﬁned in a number of ways. We explore two popular candidates, for which the above integrals can be readily computed: Algonquin, and the max model. s a s xa b xb y (a) Mixing Model (v a v b )t−1 (v a v b )t (sa sb )t−1 (sa sb )t yt yt (b) Dual Dynamics Factorial Model Figure 2: Model combination for two talkers. In (a) all dependencies are shown. In (b) the full dual-dynamics model is graphed with the xa and xb integrated out, and corresponding states from each speaker combined into product states. The other models are special cases of this graph with different edges removed, as in Figure 1. Algonquin: The relationship between the sources and mixture in the log power spectral domain is approximated as p(yt |xa , xb ) = N (yt ; log(exp(xa ) + exp(xb )), Ψ) (4) t t t t where Ψ is introduced to model the error due to the omission of phase [4]. An iterative NewtonLaplace method accurately approximates the conditional posterior p(xa , xb |yt , sa , sb ) from (1) as t t t t Gaussian. This Gaussian allows us to analytically compute the observation likelihood p(yt |sa , sb ) t t and expected value E(xa |yt , sa , sb ), as in [4]. t t t Max model: The mixing model is simpliﬁed using the fact that log of a sum is approximately the log of the maximum: p(y|xa , xb ) = δ y − max(xa , xb ) (5) In this model the likelihood is p(yt |sa , sb ) = pxa (yt |sa )Φxb (yt |sb ) + pxb (yt |sb )Φxa (yt |sa ), (6) t t t t t t t t t y t where Φxa (yt |sa ) = −∞ N (xa ; µsa , Σsa )dxa is a Gaussian cumulative distribution function [5]. t t t t t t In [5], such a model was used to compute state likelihoods and ﬁnd the optimal state sequence. In [8], a simpliﬁed model was used to infer binary masking values for reﬁltering. We take the max model a step further and derive source posteriors, so that we can compute the MMSE estimators for the log power spectrum. Note that the source posteriors in xa and xb are each t t a mixture of a delta function and a truncated Gaussian. Thus we analytically derive the necessary expected value: E(xa |yt , sa , sb ) t t t p(xa = yt |yt , sa , sb )yt + p(xa < yt |yt , sa , sb )E(xa |xa < yt , sa ) t t t t t t t t t pxa (yt |sa ) t a b , = πt yt + πt µsa − Σsa t t t Φxa (yt |sa ) t t = (7) (8) a b a with weights πt = p(xa=yt |yt , sa , sb ) = pxa (yt |sa )Φxb (yt |sb )/p(yt |sa , sb ), and πt = 1 − πt . For t t t t t t t t a ≫ µ b in a given frequency many pairs of states one model is signiﬁcantly louder than another µs s band, relative to their variances. In such cases it is reasonable to approximate the likelihood as p(yt |sa , sb ) ≈ pxa (yt |sa ), and the posterior expected values according to E(xa |yt , sa , sb ) ≈ yt and t t t t t t t E(xb |yt , sa , sb ) ≈ min(yt , µsb ), and similarly for µsa ≪ µsb . t t t t 3 Likelihood Estimation Because of the large number of state combinations, the model would not be practical without techniques to reduce computation time. To speed up the evaluation of the joint state likelihood, we employed both band quantization of the acoustic Gaussians and joint-state pruning. Band Quantization: One source of computational savings stems from the fact that some of the Gaussians in our model may differ only in a few features. Band quantization addresses this by approximating each of the D Gaussians of each model with a shared set of d Gaussians, where d ≪ D, in each of the F frequency bands of the feature vector. A similar idea is described in [9]. It relies on the use of a diagonal covariance matrix, so that p(xa |sa ) = f N (xa ; µf,sa , Σf,sa ), where Σf,sa f are the diagonal elements of covariance matrix Σsa . The mapping Mf (si ) associates each of the D Gaussians with one of the d Gaussians in band f . Now p(xa |sa ) = f N (xa ; µf,Mf (sa ) , Σf,Mf (sa ) ) ˆ f is used as a surrogate for p(xa |sa ). Figure 3 illustrates the idea. Figure 3: In band quantization, many multi-dimensional Gaussians are mapped to a few unidimensional Gaussians. Under this model the d Gaussians are optimized by minimizing the KL-divergence D( sa p(sa )p(xa |sa )|| sa p(sa )ˆ(xa |sa )), and likewise for sb . Then in each frequency band, p only d×d, instead of D ×D combinations of Gaussians have to be evaluated to compute p(y|sa , sb ). Despite the relatively small number of components d in each band, taken across bands, band quantization is capable of expressing dF distinct patterns, in an F -dimensional feature space, although in practice only a subset of these will be used to approximate the Gaussians in a given model. We used d = 8 and D = 256, which reduced the likelihood computation time by three orders of magnitude. Joint State Pruning: Another source of computational savings comes from the sparseness of the model. Only a handful of sa , sb combinations have likelihoods that are signiﬁcantly larger than the rest for a given observation. Only these states are required to adequately explain the observation. By pruning the total number of combinations down to a smaller number we can speed up the likelihood calculation, estimation of the components signals, as well as the temporal inference. However, we must estimate the likelihoods in order to determine which states to retain. We therefore used band-quantization to estimate likelihoods for all states, perform state pruning, and then the full model on the pruned states using the exact parameters. In the experiments reported here, we pruned down to 256 state combinations. The effect of these speedup methods on accuracy will be reported in a future publication. 4 Inference In our experiments we performed inference in four different conditions: no dynamics, with acoustic dynamics only, with grammar dynamics only, and with dual dynamics (acoustic and grammar). With no dynamics the source models reduce to GMMs and we infer MMSE estimates of the sources based on p(xa , xb |y) as computed from (1), using Algonquin or the max model. Once the log spectrum of each source is estimated, we estimate the corresponding time-domain signal as shown in [4]. In the acoustic dynamics condition the exact inference algorithm uses a 2-Dimensional Viterbi search, described below, with acoustic temporal constraints p(st |st−1 ) and likelihoods from Eqn. (1), to ﬁnd the most likely joint state sequence s1..T . Similarly in the grammar dynamics condition, 2-D Viterbi search is used to infer the grammar state sequences, v1..T . Instead of single Gaussians as the likelihood models, however, we have mixture models in this case. So we can perform an MMSE estimate of the sources by averaging over the posterior probability of the mixture components given the grammar Viterbi sequence, and the observations. It is critical to use the 2-D Viterbi algorithm in both cases, rather than the forward-backward algorithm, because in the same-speaker condition at 0dB, the acoustic models and dynamics are symmetric. This symmetry means that the posterior is essentially bimodal and averaging over these modes would yield identical estimates for both speakers. By ﬁnding the best path through the joint state space, the 2-D Viterbi algorithm breaks this symmetry and allows the model to make different estimates for each speaker. In the dual-dynamics condition we use the model of section 2(b). With two speakers, exact inference is computationally complex because the full joint distribution of the grammar and acoustic states, (v a × sa ) × (v b × sb ) is required and is very large in number. Instead we perform approximate inference by alternating the 2-D Viterbi search between two factors: the Cartesian product sa × sb of the acoustic state sequences and the Cartesian product v a × v b of the grammar state sequences. When evaluating each state sequence we hold the other chain constant, which decouples its dynamics and allows for efﬁcient inference. This is a useful factorization because the states sa and sb interact strongly with each other and similarly for v a and v b . Again, in the same-talker condition, the 2-D Viterbi search breaks the symmetry in each factor. 2-D Viterbi search: The Viterbi algorithm estimates the maximum-likelihood state sequence s1..T given the observations x1..T . The complexity of the Viterbi search is O(T D2 ) where D is the number of states and T is the number of frames. For producing MAP estimates of the 2 sources, we require a 2 dimensional Viterbi search which ﬁnds the most likely joint state sequences sa and 1..T sb given the mixed signal y1..T as was proposed in [5]. 1..T On the surface, the 2-D Viterbi search appears to be of complexity O(T D4 ). Surprisingly, it can be computed in O(T D3 ) operations. This stems from the fact that the dynamics for each chain are independent. The forward-backward algorithm for a factorial HMM with N state variables requires only O(T N DN +1 ) rather than the O(T D2N ) required for a naive implementation [10]. The same is true for the Viterbi algorithm. In the Viterbi algorithm, we wish to ﬁnd the most probable paths leading to each state by ﬁnding the two arguments sa and sb of the following maximization: t−1 t−1 {ˆa , sb } = st−1 ˆt−1 = arg max p(sa |sa )p(sb |sb )p(sa , sb |y1..t−1 ) t t−1 t t−1 t−1 t−1 sa sb t−1 t−1 arg max p(sa |sa ) max p(sb |sb )p(sa , sb |y1..t−1 ). t t−1 t t−1 t−1 t−1 a st−1 sb t−1 (9) The two maximizations can be done in sequence, requiring O(D3 ) operations with O(D2 ) storage for each step. In general, as with the forward-backward algorithm, the N -dimensional Viterbi search requires O(T N DN +1 ) operations. We can also exploit the sparsity of the transition matrices and observation likelihoods, by pruning unlikely values. Using both of these methods our implementation of 2-D Viterbi search is faster than the acoustic likelihood computation that serves as its input, for the model sizes and grammars chosen in the speech separation task. Speaker and Gain Estimation: In the challenge task, the gains and identities of the two speakers were unknown at test time and were selected from a set of 34 speakers which were mixed at SNRs ranging from 6dB to -9dB. We used speaker-dependent acoustic models because of their advantages when separating different speakers. These models were trained on gain-normalized data, so the models are not well matched to the different gains of the signals at test time. This means that we have to estimate both the speaker identities and the gain in order to adapt our models to the source signals for each test utterance. The number of speakers and range of SNRs in the test set makes it too expensive to consider every possible combination of models and gains. Instead, we developed an efﬁcient model-based method for identifying the speakers and gains, described in [2]. The algorithm is based upon a very simple idea: identify and utilize frames that are dominated by a single source – based on their likelihoods under each speaker-dependent acoustic model – to determine what sources are present in the mixture. Using this criteria we can eliminate most of the unlikely speakers, and explore all combinations of the remaining speakers. An approximate EM procedure is then used to select a single pair of speakers and estimate their gains. Recognition: Although inference in the system may involve recognition of the words– for models that contain a grammar –we still found that a separately trained recognizer performed better. After reconstruction, each of the two signals is therefore decoded with a speech recognition system that incorporates Speaker Dependent Labeling (SDL) [2]. This method uses speaker dependent models for each of the 34 speakers. Instead of using the speaker identities provided by the speaker ID and gain module, we followed the approach for gender dependent labeling (GDL) described in [11]. This technique provides better results than if the true speaker ID is speciﬁed. 5 Results The Speech Separation Challenge [3] involves separating the mixed speech of two speakers drawn from of a set of 34 speakers. An example utterance is place white by R 4 now. In each recording, one of the speakers says white while the other says blue, red or green. The task is to recognize the letter and the digit of the speaker that said white. Using the SDL recognizer, we decoded the two estimated signals under the assumption that one signal contains white and the other does not, and vice versa. We then used the association that yielded the highest combined likelihood. 80 WER (%) 60 40 20 0 Same Talker No Separation No dynamics Same Gender Acoustic Dyn. Different Gender Grammar Dyn All Dual Dyn Human Figure 4: Average word error rate (WER) as a function of model dynamics, in different talker conditions, compared to Human error rates, using Algonquin. Human listener performance [3] is compared in Figure 4 to results using the SDL recognizer without speech separation, and for each the proposed models. Performance is poor without separation in all conditions. With no dynamics the models do surprisingly well in the different talker conditions, but poorly when the signals come from the same talker. Acoustic dynamics gives some improvement, mainly in the same-talker condition. The grammar dynamics seems to give the most beneﬁt, bringing the error rate in the same-gender condition below that of humans. The dual-dynamics model performed about the same as the grammar dynamics model, despite our intuitions. Replacing Algonquin with the max model reduced the error rate in the dual dynamics model (from 24.3% to 23.5%) and grammar dynamics model (from 24.6% to 22.6%), which brings the latter closer than any other model to the human recognition rate of 22.3%. Figure 5 shows the relative word error rate of the best system compared to human subjects. When both speakers are around the same loudness, the system exceeds human performance, and in the same-gender condition makes less than half the errors of the humans. Human listeners do better when the two signals are at different levels, even if the target is below the masker (i.e., in -9dB), suggesting that they are better able to make use of differences in amplitude as a cue for separation. Relative Word Error Rate (WER) 200 Same Talker Same Gender Different Gender Human 150 100 50 0 −50 −100 6 dB 3 dB 0 dB −3 dB Signal to Noise Ratio (SNR) −6 dB −9 dB Figure 5: Word error rate of best system relative to human performance. Shaded area is where the system outperforms human listeners. An interesting question is to what extent different grammar constraints affect the results. To test this, we limited the grammar to just the two test utterances, and the error rate on the estimated sources dropped to around 10%. This may be a useful paradigm for separating speech from background noise when the text is known, such as in closed-captioned recordings. At the other extreme, in realistic speech recognition scenarios, there is little knowledge of the background speaker’s grammar. In such cases the beneﬁts of models of low-level acoustic continuity over purely grammar-based systems may be more apparent. It is our hope that further experiments with both human and machine listeners will provide us with a better understanding of the differences in their performance characteristics, and provide insights into how the human auditory system functions, as well as how automatic speech perception in general can be brought to human levels of performance. References [1] T. Kristjansson, J. R. Hershey, P. A. Olsen, S. Rennie, and R. Gopinath, “Super-human multi-talker speech recognition: The IBM 2006 speech separation challenge system,” in ICSLP, 2006. [2] Steven Rennie, Pedera A. Olsen, John R. Hershey, and Trausti Kristjansson, “Separating multiple speakers using temporal constraints,” in ISCA Workshop on Statistical And Perceptual Audition, 2006. [3] Martin Cooke and Tee-Won Lee, “Interspeech speech separation http : //www.dcs.shef.ac.uk/ ∼ martin/SpeechSeparationChallenge.htm, 2006. challenge,” [4] T. Kristjansson, J. Hershey, and H. Attias, “Single microphone source separation using high resolution signal reconstruction,” ICASSP, 2004. [5] P. Varga and R.K. Moore, “Hidden Markov model decomposition of speech and noise,” ICASSP, pp. 845–848, 1990. [6] M. Gales and S. Young, “Robust continuous speech recognition using parallel model combination,” IEEE Transactions on Speech and Audio Processing, vol. 4, no. 5, pp. 352–359, September 1996. [7] Y. Ephraim, “A Bayesian estimation approach for speech enhancement using hidden Markov models.,” vol. 40, no. 4, pp. 725–735, 1992. [8] S. Roweis, “Factorial models and reﬁltering for speech separation and denoising,” Eurospeech, pp. 1009–1012, 2003. [9] E. Bocchieri, “Vector quantization for the efﬁcient computation of continuous density likelihoods. proceedings of the international conference on acoustics,” in ICASSP, 1993, vol. II, pp. 692–695. [10] Zoubin Ghahramani and Michael I. Jordan, “Factorial hidden Markov models,” in Advances in Neural Information Processing Systems, vol. 8. [11] Peder Olsen and Satya Dharanipragada, “An efﬁcient integrated gender detection scheme and time mediated averaging of gender dependent acoustic models,” in Eurospeech 2003, 2003, vol. 4, pp. 2509–2512.

same-paper 4 0.87182009 93 nips-2006-Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms

Author: Xinhua Zhang, Wee S. Lee

5 0.81435579 201 nips-2006-Using Combinatorial Optimization within Max-Product Belief Propagation

Author: Daniel Tarlow, Gal Elidan, Daphne Koller, John C. Duchi

Abstract: In general, the problem of computing a maximum a posteriori (MAP) assignment in a Markov random ﬁeld (MRF) is computationally intractable. However, in certain subclasses of MRF, an optimal or close-to-optimal assignment can be found very efﬁciently using combinatorial optimization algorithms: certain MRFs with mutual exclusion constraints can be solved using bipartite matching, and MRFs with regular potentials can be solved using minimum cut methods. However, these solutions do not apply to the many MRFs that contain such tractable components as sub-networks, but also other non-complying potentials. In this paper, we present a new method, called C OMPOSE, for exploiting combinatorial optimization for sub-networks within the context of a max-product belief propagation algorithm. C OMPOSE uses combinatorial optimization for computing exact maxmarginals for an entire sub-network; these can then be used for inference in the context of the network as a whole. We describe highly efﬁcient methods for computing max-marginals for subnetworks corresponding both to bipartite matchings and to regular networks. We present results on both synthetic and real networks encoding correspondence problems between images, which involve both matching constraints and pairwise geometric constraints. We compare to a range of current methods, showing that the ability of C OMPOSE to transmit information globally across the network leads to improved convergence, decreased running time, and higher-scoring assignments.

6 0.57663339 134 nips-2006-Modeling Human Motion Using Binary Latent Variables

7 0.53904009 160 nips-2006-Part-based Probabilistic Point Matching using Equivalence Constraints

8 0.52025497 167 nips-2006-Recursive ICA

9 0.47606072 158 nips-2006-PG-means: learning the number of clusters in data

10 0.47071034 43 nips-2006-Bayesian Model Scoring in Markov Random Fields

11 0.46696827 97 nips-2006-Inducing Metric Violations in Human Similarity Judgements

12 0.46594161 198 nips-2006-Unified Inference for Variational Bayesian Linear Gaussian State-Space Models

13 0.46401528 31 nips-2006-Analysis of Contour Motions

14 0.46363932 34 nips-2006-Approximate Correspondences in High Dimensions

15 0.45779544 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians

16 0.45264676 15 nips-2006-A Switched Gaussian Process for Estimating Disparity and Segmentation in Binocular Stereo

17 0.4421154 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

18 0.44167915 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition

19 0.44074845 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model

20 0.43689302 39 nips-2006-Balanced Graph Matching