nips nips2006 nips2006-159 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Tommi S. Jaakkola, Yuan Qi
Abstract: Bayesian inference has become increasingly important in statistical machine learning. Exact Bayesian calculations are often not feasible in practice, however. A number of approximate Bayesian methods have been proposed to make such calculations practical, among them the variational Bayesian (VB) approach. The VB approach, while useful, can nevertheless suffer from slow convergence to the approximate solution. To address this problem, we propose Parameter-eXpanded Variational Bayesian (PX-VB) methods to speed up VB. The new algorithm is inspired by parameter-expanded expectation maximization (PX-EM) and parameterexpanded data augmentation (PX-DA). Similar to PX-EM and -DA, PX-VB expands a model with auxiliary variables to reduce the coupling between variables in the original model. We analyze the convergence rates of VB and PX-VB and demonstrate the superior convergence rates of PX-VB in variational probit regression and automatic relevance determination. 1
Reference: text
sentIndex sentText sentNum sentScore
1 A number of approximate Bayesian methods have been proposed to make such calculations practical, among them the variational Bayesian (VB) approach. [sent-8, score-0.229]
2 The VB approach, while useful, can nevertheless suffer from slow convergence to the approximate solution. [sent-9, score-0.204]
3 Similar to PX-EM and -DA, PX-VB expands a model with auxiliary variables to reduce the coupling between variables in the original model. [sent-12, score-0.282]
4 We analyze the convergence rates of VB and PX-VB and demonstrate the superior convergence rates of PX-VB in variational probit regression and automatic relevance determination. [sent-13, score-0.75]
5 Given a target probability distribution, variational Bayesian methods approximate the target distribution with a factored distribution. [sent-16, score-0.344]
6 While factoring omits dependencies present in the target distribution, the parameters of the factored approximation can be adjusted to improve the match. [sent-17, score-0.089]
7 Specifically, the approximation is optimized by minimizing the KL-divergence between the factored distribution and the target. [sent-18, score-0.092]
8 Variational Bayesian methods nevertheless suffer from slow convergence when the variables in the factored approximation are actually strongly coupled in the original model. [sent-24, score-0.283]
9 The slow convergence can be alleviated by data augmentation (van Dyk & Meng, 2001; Liu & Wu, 1999), where the idea is to identify an optimal reparameterization (within a family of possible reparameterizations) so as to remove coupling. [sent-27, score-0.211]
10 Our approach uses auxiliary parameters to speed up the deterministic approximation of the target distribution. [sent-31, score-0.255]
11 The original model is modified by auxiliary parameters that are optimized in conjunction with the variational approximation. [sent-33, score-0.425]
12 The optimization of the auxiliary parameters corresponds to a parameterized joint optimization of the variational components; the role of the new updates is to precisely remove otherwise strong functional couplings between the components thereby facilitating fast convergence. [sent-34, score-0.576]
13 w = 1 + D−1 1 + D−1 The variational estimate w converges to y, which actually is the true posterior mean (For this toy problem, p(w|y) = N (w|y, 1+D)). [sent-42, score-0.299]
14 Intuitively, the convergence speed of w and q(w) suffers from strong coupling between the updates of w and z. [sent-45, score-0.345]
15 To alleviate the coupling, we expand the original model with an additional parameter α: p(y|w, z) = N (y | w + z, 1) p(z|α) = N (z | α, D) (5) The expanded model reduces to the original one when α equals the null value α0 = 0. [sent-47, score-0.225]
16 Then, we reduce the expanded model to the original one by applying the reduction rule z new = z − α = z − z , wnew = w + α = w + z . [sent-49, score-0.24]
17 Here α breaks the update loop between q(w) and q(z) and plays the role of a correction force; it corrects the update trajectories of q(w) and q(z) and makes them point directly to the convergence point. [sent-51, score-0.221]
18 3 The PX-VB Algorithm In the general PX-VB formulation, we over-parameterize the model p(ˆ , D) to get pα (x, D), where x the original model is recovered for some default values of the auxiliary parameters α = α0 . [sent-52, score-0.207]
19 The algorithm consists of the typical VB updates relative to pα (x, D), the optimization of auxiliary parameters α, as well as a reduction step to turn the model back to the original form where α = α0 . [sent-53, score-0.406]
20 This last reduction step has the effect of jointly modifying the components of the factored variational approximation. [sent-54, score-0.323]
21 Put another way, we push the change in pα (x, D), due to the optimization of α, into the variational approximation instead. [sent-55, score-0.253]
22 Changing the variational approximation in this manner permits us to return the model into its original form and set α = α0 . [sent-56, score-0.222]
23 We minimize KL(q(x) pα (x, D)) over the auxiliary parameters α. [sent-62, score-0.174]
24 This optimization can be done jointly with some components of the variational distribution, if feasible. [sent-63, score-0.247]
25 The expanded model is reduced to the original model through reparameterization. [sent-65, score-0.136]
26 Accordingly, we change q (t+1) (x) to q (t+1) (ˆ ) such that x KL(q (t+1) (ˆ ) pα0 (ˆ , D)) = KL(q(x) pα(t+1) (x, D)) x x (7) where q (t+1) (ˆ ) are the modified components of the variational approximation. [sent-66, score-0.249]
27 Since we optimize α after optimizing {q(xs }, the mapping S should change at least two components of x. [sent-74, score-0.146]
28 If we jointly optimize α and one component q(xs ), it suffices (albeit need not be optimal) for the mapping Sα to change only q(xs ). [sent-76, score-0.119]
29 They both expand the original model and both are based on lower bounding KL-divergence. [sent-79, score-0.089]
30 However, the key difference is that the reduction step in PX-VB changes the lower-bounding distributions {q(xs )}, while in PXEM the reduction step is performed only for the parameters in p(x, D). [sent-80, score-0.088]
31 Here we demonstrate the use of variational Bayesian methods to train Probit models. [sent-92, score-0.189]
32 The data likelihood for Probit regression is σ(tn wT xn ), p(t|X, w) = n where X = [x1 , . [sent-93, score-0.159]
33 We can rewrite the likelihood in an equivalent form p(zn |w, xn ) = N (zn |wT xn , 1) p(tn |zn ) = sign(tn zn ) (8) Given a Gaussian prior over the parameter, p(w) = N (w|0, v0 I), we wish to approximate the posterior distribution p(w, z|X, t) by q(w, z) = q(w) n q(zn ). [sent-97, score-0.508]
34 (10) that To speed up the convergence of the above iterative updates, we apply the PX-VB method. [sent-99, score-0.192]
35 First, we ˆ ˆ expand the orginal model p(w, z, t|X) to pc (w, z, t|X) with the mapping ˆ ˆ w = wc z = zc (11) such that pc (zn |w, xn ) = N (zn |wT xn , c2 ) p(w) = N (w|0, c2 v0 I) (12) Setting c = c0 = 1 in the expanded model, we update q(zn ) and q(w) as before, via (9) and (10). [sent-100, score-0.493]
36 Then, we minimize KL q(z)q(w) pc (w, z, t|X) over c, yielding 1 −1 2 c2 = ( zn − 2 zn w T xn + xT wwT xn ) + v0 wwT (13) n N +M n where M is the dimension of w. [sent-101, score-0.66]
37 Since this equation can be efficiently calculated, the extra computational cost induced by the auxiliary variable is therefore small. [sent-103, score-0.174]
38 (14) Accordingly, we change q(w) to obtain a new posterior approximation qc (w): −1 −1 qc (w) = N (w|(XXT + v0 I)−1 X z /c, (XXT + v0 I)−1 /c2 ) We do not actually need to compute qc (zn ) if this component will be optimized next. [sent-106, score-0.304]
39 (15) By changing variables w to w through (14), the KL divergence between the approximate and exact posteriors remains the same. [sent-107, score-0.101]
40 After obtaining new approximations qc (w) and q(ˆn ), we reset c = z c0 = 1 for the next iteration. [sent-108, score-0.094]
41 Though similar to the PX-EM updates for the Probit regression problem (Liu et al. [sent-109, score-0.166]
42 , 1998), the PXVB updates are geared towards providing an approximate posterior distribution. [sent-110, score-0.199]
43 We use both synthetic data and a kidney biopsy data (van Dyk & Meng, 2001) as numerical examples for probit regression. [sent-111, score-0.284]
44 The comparison of convergence speeds for VB and PXVB is illustrated in figure 1. [sent-113, score-0.137]
45 On the synthetic data, PXVB converges immediately while VB updates are slow to converge. [sent-120, score-0.244]
46 On the kidney biopsy data set, PX-VB converges in 507 iterations, while VB converges in 7518 iterations. [sent-122, score-0.194]
47 In terms of CPU time, which reflects the extra computational cost induced by the auxiliary variables, PX-VB is 14 times more efficient. [sent-124, score-0.174]
48 In sum, with a simple modification of VB updates, we significantly improve the convergence speed of variational Bayesian estimation for probit model. [sent-126, score-0.519]
49 Here, we focus on variational ARD proposed by Bishop and Tipping (2000) for sparse Bayesian regression and classification. [sent-129, score-0.231]
50 The likelihood for ARD regression is N (tn |wT φn , τ −1 ) p(t|X, w, τ ) = n where φn is a feature vector based on xn , such as [k(x1 , xn ), . [sent-130, score-0.253]
51 , [k(xN , xn )]T where k(xi , xj ) is a nonlinear basis function. [sent-133, score-0.094]
52 The sequential VB updates on q(τ ), q(w) and q(α) are described by Bishop and Tipping (2000). [sent-138, score-0.124]
53 The variational RVM achieves good generalization performance as demonstrated by Bishop and Tipping (2000). [sent-139, score-0.189]
54 However, its training based on the VB updates can be quite slow. [sent-140, score-0.124]
55 ˆ ˆ First, we expand the original model p(w, α, τ |X, t) via w = w/r (17) ˆ while maintaining α and τ unchanged. [sent-142, score-0.089]
56 This gives r= g+ g 2 + 16M f 4f (19) T T 2 T T where f = τ and n xn ww xn + m wm αm and g = 2 τ m w xn tn . [sent-146, score-0.448]
57 If the regular update over q(w) achieves a smaller KL divergence, we reset r = 1. [sent-151, score-0.09]
58 Given r and q(w), we use w = rw to reduce the expanded model to the original one. [sent-152, score-0.136]
59 ˆ We can also introduce another auxiliary variable s such that α = α/s. [sent-154, score-0.174]
60 Similar to the above procedure, we optimize over s the expected log joint probability of the expanded model, and at the same ˆ ˆ time update q(α). [sent-155, score-0.183]
61 Then we change q(α) back to qs (α) using the inverse mapping α = sα. [sent-156, score-0.4]
62 The auxiliary variables r and s change the individual approximate posteriors q(w) and q(α) separately. [sent-158, score-0.27]
63 Setting c = c0 = 1, we perform the regular updates over q(τ ), q(w) and q(α). [sent-165, score-0.147]
64 Then we optimize over c the expected log joint probablity of the expanded model. [sent-166, score-0.141]
65 Therefore, we perform a few steps of Newton updates to partially optimize c. [sent-169, score-0.162]
66 Then using the inverse mapping, we reduce the expanded model to the original one and adjust both q(w) and q(α) accordingly. [sent-171, score-0.162]
67 Empirically, this approach can achieve faster convergence than using auxiliary variables on q(w) and q(α) separately. [sent-172, score-0.376]
68 We compare the convergence speed of VB and PX-VB for the ARD model on both synthetic data and gene expression data. [sent-174, score-0.265]
69 The results of convergence comparison are shown in figure 2. [sent-180, score-0.137]
70 With a little modification of VB updates, we increase the convergence speed significantly. [sent-181, score-0.192]
71 4 Convergence properties of VB and PX-VB In this section, we analyze convergence of VB and PX-VB, and their convergence rates. [sent-183, score-0.298]
72 Define the mapping q(t+1) = M (q(t) ) as one VB update of all the approximate distributions. [sent-184, score-0.13]
73 Define an objective function as the unnormalized KL divergence: qi (x) Q(q) = qi (x) log ) + ( p(x)dx − qi (x)dx). [sent-185, score-0.189]
74 (20) p(x) It is easy to check that minimizing Q(q) gives the same updates as VB which minimizes KL divergence. [sent-186, score-0.124]
75 1 by Luo and Tseng (1992), an iterative application of this mapping to minimize Q(q) results in at least linear convergence to an element q in the solution set. [sent-188, score-0.185]
76 Define the mapping q(t+1) = Mx (q(t) ) as one PX-VB update of all the approximate distributions. [sent-189, score-0.13]
77 ,β = [qT αT ]T converges to [q T αT ]T , where α ∈ Λ are the expanded model parameters, α0 are the null value in the original 0 model. [sent-193, score-0.183]
78 1 Convergence rate of VB and PX-VB The matrix rate of convergence DM (q): q(t+1) − q = DM (q)T (q(t) − q ) (21) where DM (q) = ∂Mj (q) ∂qi . [sent-195, score-0.137]
79 (t+1) −q Define the global rate of convergence for q: r = limt→∞ qq(t) −q . [sent-196, score-0.137]
80 Define the constraint set gs as the constraints for the sth update. [sent-199, score-0.259]
81 1 The matrix convergence rate for VB is: S DM (q ) = Ps (22) s=1 where Ps = Bs [BT D2 Q(q ) s −1 Bs ]−1 BT D2 Q(q ) s −1 and Bs = gs (q ). [sent-201, score-0.344]
82 Let Gs (ξ) be qs that maximizes the objective function Q(q) under the constraint gs (q) = gs (ξ) = [ξ \s ]. [sent-203, score-0.707]
83 To calculate DGS (q ), we differentiate the constraint gs (Gs (ξ)) = gs (ξ) and evaluate both sides at ξ = q , such that DGs (q )Bs = Bs . [sent-211, score-0.493]
84 (25) Similarly, we differentiate the Lagrange equation DQs (G(ξ)) − gs (G(ξ))λs (ξ) = 0 and evaluate both sides at ξ = q . [sent-212, score-0.286]
85 This yields DGs (q )D2 Qs (q ) − Dλs (q )BT = 0 s Equation (26) holds because ∂ 2 gs ∂qi ∂qj (26) = 0. [sent-213, score-0.207]
86 Therefore, Bs is an identity matrix with its sth column removed Bs = I:,s , where I is the identity matrix and s, : means without the sth column. [sent-218, score-0.104]
87 The above results help us understand the convergence speed of VB. [sent-228, score-0.192]
88 Since the global convergence rate is bounded by the maximal component convergence rate and generally there are many components with convergence rate same as the global rate. [sent-233, score-0.438]
89 Therefore, the instant convergence of qS could help increase the global convergence rate. [sent-234, score-0.303]
90 For PX-VB, we can compute the matrix rate of convergence similarly. [sent-235, score-0.137]
91 In the toy example in Section 2, PX-VB introduces an auxiliary variable α which has zero correlation with w, leading an instant convergence of the algorithm. [sent-236, score-0.397]
92 This suggests that PX-VB improves the convergence by reducing the correlation among {qs }. [sent-237, score-0.166]
93 Rigorously speaking, the reduction step in PXVB implictly defines a mapping between q to qα0 through the auxiliary variables α: (q, pα0 ) → (q, pα ) → (qα , pα0 ). [sent-238, score-0.289]
94 Thus, as long as the largest eigenvalue of Mα is smaller than 1, PX-VB converges faster than VB. [sent-241, score-0.122]
95 The choice of α affects the convergence rate by controlling the eigenvalue of this mapping. [sent-242, score-0.17]
96 In practice, we can check this eigenvalue to make sure the constructed PX-VB algorithm enjoys a fast convergence rate. [sent-244, score-0.17]
97 5 Discussion We have provided a general approach to speeding up convergence of variational Bayesian learning. [sent-245, score-0.326]
98 Faster convergence is guaranteed theoretically provided that the Jacobian of the transformation from auxiliary parameters to variational components has spectral norm bounded by one. [sent-246, score-0.527]
99 Our empirical results show that the performance gain due to the auxiliary method is substantial. [sent-248, score-0.174]
100 On the convergence of the coordinate descent method for convex differentiable minimization. [sent-296, score-0.137]
wordName wordTfidf (topN-words)
[('vb', 0.587), ('qs', 0.293), ('zn', 0.222), ('gs', 0.207), ('variational', 0.189), ('dgs', 0.179), ('auxiliary', 0.174), ('ard', 0.157), ('probit', 0.138), ('convergence', 0.137), ('kl', 0.133), ('updates', 0.124), ('bs', 0.118), ('tipping', 0.105), ('expanded', 0.103), ('tn', 0.1), ('xn', 0.094), ('dm', 0.092), ('xs', 0.09), ('wt', 0.089), ('dms', 0.086), ('bayesian', 0.085), ('ps', 0.085), ('pxvb', 0.079), ('liu', 0.078), ('bishop', 0.075), ('bt', 0.075), ('qc', 0.069), ('qi', 0.063), ('factored', 0.063), ('dyk', 0.06), ('kidney', 0.06), ('wnew', 0.06), ('px', 0.058), ('expand', 0.056), ('speed', 0.055), ('relevance', 0.054), ('meng', 0.052), ('sth', 0.052), ('mapping', 0.048), ('converges', 0.047), ('augmentation', 0.047), ('differentiate', 0.047), ('synthetic', 0.046), ('qq', 0.044), ('reduction', 0.044), ('faster', 0.042), ('regression', 0.042), ('update', 0.042), ('approximate', 0.04), ('auxiliar', 0.04), ('biopsy', 0.04), ('tseng', 0.04), ('vassar', 0.04), ('wwt', 0.04), ('xxt', 0.04), ('ms', 0.039), ('optimize', 0.038), ('divergence', 0.038), ('wu', 0.036), ('posterior', 0.035), ('id', 0.035), ('wm', 0.035), ('salakhutdinov', 0.035), ('jaakkola', 0.034), ('original', 0.033), ('change', 0.033), ('eigenvalue', 0.033), ('iterations', 0.032), ('sides', 0.032), ('determination', 0.032), ('ww', 0.031), ('optimization', 0.031), ('correlation', 0.029), ('tommi', 0.029), ('luo', 0.029), ('instant', 0.029), ('coupling', 0.029), ('automatic', 0.029), ('optimized', 0.029), ('toy', 0.028), ('pc', 0.028), ('mackay', 0.028), ('csail', 0.028), ('components', 0.027), ('gene', 0.027), ('slow', 0.027), ('target', 0.026), ('inverse', 0.026), ('em', 0.025), ('factorized', 0.025), ('reset', 0.025), ('analyze', 0.024), ('xx', 0.024), ('ln', 0.023), ('van', 0.023), ('likelihood', 0.023), ('variables', 0.023), ('beal', 0.023), ('regular', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 159 nips-2006-Parameter Expanded Variational Bayesian Methods
Author: Tommi S. Jaakkola, Yuan Qi
Abstract: Bayesian inference has become increasingly important in statistical machine learning. Exact Bayesian calculations are often not feasible in practice, however. A number of approximate Bayesian methods have been proposed to make such calculations practical, among them the variational Bayesian (VB) approach. The VB approach, while useful, can nevertheless suffer from slow convergence to the approximate solution. To address this problem, we propose Parameter-eXpanded Variational Bayesian (PX-VB) methods to speed up VB. The new algorithm is inspired by parameter-expanded expectation maximization (PX-EM) and parameterexpanded data augmentation (PX-DA). Similar to PX-EM and -DA, PX-VB expands a model with auxiliary variables to reduce the coupling between variables in the original model. We analyze the convergence rates of VB and PX-VB and demonstrate the superior convergence rates of PX-VB in variational probit regression and automatic relevance determination. 1
2 0.28776398 2 nips-2006-A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation
Author: Yee W. Teh, David Newman, Max Welling
Abstract: Latent Dirichlet allocation (LDA) is a Bayesian network that has recently gained much popularity in applications ranging from document modeling to computer vision. Due to the large scale nature of these applications, current inference procedures like variational Bayes and Gibbs sampling have been found lacking. In this paper we propose the collapsed variational Bayesian inference algorithm for LDA, and show that it is computationally efficient, easy to implement and significantly more accurate than standard variational Bayesian inference for LDA.
3 0.23801225 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors
Author: Mark Girolami, Mingjun Zhong
Abstract: By adopting Gaussian process priors a fully Bayesian solution to the problem of integrating possibly heterogeneous data sets within a classification setting is presented. Approximate inference schemes employing Variational & Expectation Propagation based methods are developed and rigorously assessed. We demonstrate our approach to integrating multiple data sets on a large scale protein fold prediction problem where we infer the optimal combinations of covariance functions and achieve state-of-the-art performance without resorting to any ad hoc parameter tuning and classifier combination. 1
4 0.15970717 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
Author: Rey Ramírez, Jason Palmer, Scott Makeig, Bhaskar D. Rao, David P. Wipf
Abstract: The ill-posed nature of the MEG/EEG source localization problem requires the incorporation of prior assumptions when choosing an appropriate solution out of an infinite set of candidates. Bayesian methods are useful in this capacity because they allow these assumptions to be explicitly quantified. Recently, a number of empirical Bayesian approaches have been proposed that attempt a form of model selection by using the data to guide the search for an appropriate prior. While seemingly quite different in many respects, we apply a unifying framework based on automatic relevance determination (ARD) that elucidates various attributes of these methods and suggests directions for improvement. We also derive theoretical properties of this methodology related to convergence, local minima, and localization bias and explore connections with established algorithms. 1
5 0.14316152 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures
Author: Kenichi Kurihara, Max Welling, Nikos A. Vlassis
Abstract: Dirichlet Process (DP) mixture models are promising candidates for clustering applications where the number of clusters is unknown a priori. Due to computational considerations these models are unfortunately unsuitable for large scale data-mining applications. We propose a class of deterministic accelerated DP mixture models that can routinely handle millions of data-cases. The speedup is achieved by incorporating kd-trees into a variational Bayesian algorithm for DP mixtures in the stick-breaking representation, similar to that of Blei and Jordan (2005). Our algorithm differs in the use of kd-trees and in the way we handle truncation: we only assume that the variational distributions are fixed at their priors after a certain level. Experiments show that speedups relative to the standard variational algorithm can be significant. 1
6 0.12645674 117 nips-2006-Learning on Graph with Laplacian Regularization
7 0.1140115 193 nips-2006-Tighter PAC-Bayes Bounds
8 0.11280387 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints
9 0.097033828 198 nips-2006-Unified Inference for Variational Bayesian Linear Gaussian State-Space Models
10 0.096255854 57 nips-2006-Conditional mean field
11 0.075194985 116 nips-2006-Learning from Multiple Sources
12 0.064698704 42 nips-2006-Bayesian Image Super-resolution, Continued
13 0.058860447 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation
14 0.058067251 43 nips-2006-Bayesian Model Scoring in Markov Random Fields
15 0.056817517 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression
16 0.05441517 79 nips-2006-Fast Iterative Kernel PCA
17 0.053010602 41 nips-2006-Bayesian Ensemble Learning
18 0.049104534 44 nips-2006-Bayesian Policy Gradient Algorithms
19 0.046626605 61 nips-2006-Convex Repeated Games and Fenchel Duality
20 0.046527475 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
topicId topicWeight
[(0, -0.179), (1, 0.05), (2, -0.021), (3, -0.055), (4, 0.004), (5, 0.161), (6, 0.309), (7, 0.051), (8, -0.12), (9, 0.077), (10, 0.072), (11, 0.089), (12, -0.014), (13, -0.101), (14, 0.108), (15, 0.261), (16, -0.058), (17, -0.065), (18, 0.022), (19, -0.046), (20, -0.077), (21, 0.095), (22, -0.191), (23, -0.09), (24, -0.183), (25, -0.06), (26, -0.011), (27, 0.023), (28, 0.062), (29, -0.179), (30, -0.043), (31, 0.139), (32, 0.092), (33, 0.019), (34, -0.07), (35, 0.108), (36, -0.106), (37, 0.093), (38, 0.048), (39, -0.045), (40, -0.097), (41, -0.0), (42, 0.082), (43, -0.132), (44, -0.038), (45, -0.026), (46, -0.087), (47, 0.085), (48, -0.088), (49, -0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.95697939 159 nips-2006-Parameter Expanded Variational Bayesian Methods
Author: Tommi S. Jaakkola, Yuan Qi
Abstract: Bayesian inference has become increasingly important in statistical machine learning. Exact Bayesian calculations are often not feasible in practice, however. A number of approximate Bayesian methods have been proposed to make such calculations practical, among them the variational Bayesian (VB) approach. The VB approach, while useful, can nevertheless suffer from slow convergence to the approximate solution. To address this problem, we propose Parameter-eXpanded Variational Bayesian (PX-VB) methods to speed up VB. The new algorithm is inspired by parameter-expanded expectation maximization (PX-EM) and parameterexpanded data augmentation (PX-DA). Similar to PX-EM and -DA, PX-VB expands a model with auxiliary variables to reduce the coupling between variables in the original model. We analyze the convergence rates of VB and PX-VB and demonstrate the superior convergence rates of PX-VB in variational probit regression and automatic relevance determination. 1
2 0.65078455 2 nips-2006-A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation
Author: Yee W. Teh, David Newman, Max Welling
Abstract: Latent Dirichlet allocation (LDA) is a Bayesian network that has recently gained much popularity in applications ranging from document modeling to computer vision. Due to the large scale nature of these applications, current inference procedures like variational Bayes and Gibbs sampling have been found lacking. In this paper we propose the collapsed variational Bayesian inference algorithm for LDA, and show that it is computationally efficient, easy to implement and significantly more accurate than standard variational Bayesian inference for LDA.
3 0.6089735 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures
Author: Kenichi Kurihara, Max Welling, Nikos A. Vlassis
Abstract: Dirichlet Process (DP) mixture models are promising candidates for clustering applications where the number of clusters is unknown a priori. Due to computational considerations these models are unfortunately unsuitable for large scale data-mining applications. We propose a class of deterministic accelerated DP mixture models that can routinely handle millions of data-cases. The speedup is achieved by incorporating kd-trees into a variational Bayesian algorithm for DP mixtures in the stick-breaking representation, similar to that of Blei and Jordan (2005). Our algorithm differs in the use of kd-trees and in the way we handle truncation: we only assume that the variational distributions are fixed at their priors after a certain level. Experiments show that speedups relative to the standard variational algorithm can be significant. 1
4 0.60837686 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors
Author: Mark Girolami, Mingjun Zhong
Abstract: By adopting Gaussian process priors a fully Bayesian solution to the problem of integrating possibly heterogeneous data sets within a classification setting is presented. Approximate inference schemes employing Variational & Expectation Propagation based methods are developed and rigorously assessed. We demonstrate our approach to integrating multiple data sets on a large scale protein fold prediction problem where we infer the optimal combinations of covariance functions and achieve state-of-the-art performance without resorting to any ad hoc parameter tuning and classifier combination. 1
5 0.45257747 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints
Author: Carla P. Gomes, Ashish Sabharwal, Bart Selman
Abstract: We propose a new technique for sampling the solutions of combinatorial problems in a near-uniform manner. We focus on problems specified as a Boolean formula, i.e., on SAT instances. Sampling for SAT problems has been shown to have interesting connections with probabilistic reasoning, making practical sampling algorithms for SAT highly desirable. The best current approaches are based on Markov Chain Monte Carlo methods, which have some practical limitations. Our approach exploits combinatorial properties of random parity (XOR) constraints to prune away solutions near-uniformly. The final sample is identified amongst the remaining ones using a state-of-the-art SAT solver. The resulting sampling distribution is provably arbitrarily close to uniform. Our experiments show that our technique achieves a significantly better sampling quality than the best alternative. 1
6 0.41077507 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
7 0.35098422 117 nips-2006-Learning on Graph with Laplacian Regularization
8 0.33126026 193 nips-2006-Tighter PAC-Bayes Bounds
9 0.30729866 198 nips-2006-Unified Inference for Variational Bayesian Linear Gaussian State-Space Models
10 0.3048059 57 nips-2006-Conditional mean field
11 0.26002336 60 nips-2006-Convergence of Laplacian Eigenmaps
12 0.23903544 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation
13 0.2330633 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods
14 0.21576566 41 nips-2006-Bayesian Ensemble Learning
15 0.21352634 43 nips-2006-Bayesian Model Scoring in Markov Random Fields
16 0.21313162 116 nips-2006-Learning from Multiple Sources
17 0.2119215 147 nips-2006-Non-rigid point set registration: Coherent Point Drift
18 0.20805098 44 nips-2006-Bayesian Policy Gradient Algorithms
19 0.19483162 156 nips-2006-Ordinal Regression by Extended Binary Classification
20 0.18914072 30 nips-2006-An Oracle Inequality for Clipped Regularized Risk Minimizers
topicId topicWeight
[(1, 0.09), (3, 0.035), (7, 0.075), (9, 0.04), (20, 0.037), (22, 0.07), (25, 0.012), (39, 0.014), (44, 0.131), (55, 0.232), (57, 0.061), (61, 0.017), (65, 0.02), (69, 0.036), (86, 0.011)]
simIndex simValue paperId paperTitle
1 0.84803814 113 nips-2006-Learning Structural Equation Models for fMRI
Author: Enrico Simonotto, Heather Whalley, Stephen Lawrie, Lawrence Murray, David Mcgonigle, Amos J. Storkey
Abstract: Structural equation models can be seen as an extension of Gaussian belief networks to cyclic graphs, and we show they can be understood generatively as the model for the joint distribution of long term average equilibrium activity of Gaussian dynamic belief networks. Most use of structural equation models in fMRI involves postulating a particular structure and comparing learnt parameters across different groups. In this paper it is argued that there are situations where priors about structure are not firm or exhaustive, and given sufficient data, it is worth investigating learning network structure as part of the approach to connectivity analysis. First we demonstrate structure learning on a toy problem. We then show that for particular fMRI data the simple models usually assumed are not supported. We show that is is possible to learn sensible structural equation models that can provide modelling benefits, but that are not necessarily going to be the same as a true causal model, and suggest the combination of prior models and learning or the use of temporal information from dynamic models may provide more benefits than learning structural equations alone. 1
2 0.83832794 196 nips-2006-TrueSkill™: A Bayesian Skill Rating System
Author: Ralf Herbrich, Tom Minka, Thore Graepel
Abstract: unkown-abstract
same-paper 3 0.83745354 159 nips-2006-Parameter Expanded Variational Bayesian Methods
Author: Tommi S. Jaakkola, Yuan Qi
Abstract: Bayesian inference has become increasingly important in statistical machine learning. Exact Bayesian calculations are often not feasible in practice, however. A number of approximate Bayesian methods have been proposed to make such calculations practical, among them the variational Bayesian (VB) approach. The VB approach, while useful, can nevertheless suffer from slow convergence to the approximate solution. To address this problem, we propose Parameter-eXpanded Variational Bayesian (PX-VB) methods to speed up VB. The new algorithm is inspired by parameter-expanded expectation maximization (PX-EM) and parameterexpanded data augmentation (PX-DA). Similar to PX-EM and -DA, PX-VB expands a model with auxiliary variables to reduce the coupling between variables in the original model. We analyze the convergence rates of VB and PX-VB and demonstrate the superior convergence rates of PX-VB in variational probit regression and automatic relevance determination. 1
4 0.65264601 139 nips-2006-Multi-dynamic Bayesian Networks
Author: Karim Filali, Jeff A. Bilmes
Abstract: We present a generalization of dynamic Bayesian networks to concisely describe complex probability distributions such as in problems with multiple interacting variable-length streams of random variables. Our framework incorporates recent graphical model constructs to account for existence uncertainty, value-specific independence, aggregation relationships, and local and global constraints, while still retaining a Bayesian network interpretation and efficient inference and learning techniques. We introduce one such general technique, which is an extension of Value Elimination, a backtracking search inference algorithm. Multi-dynamic Bayesian networks are motivated by our work on Statistical Machine Translation (MT). We present results on MT word alignment in support of our claim that MDBNs are a promising framework for the rapid prototyping of new MT systems. 1 INTRODUCTION The description of factorization properties of families of probabilities using graphs (i.e., graphical models, or GMs), has proven very useful in modeling a wide variety of statistical and machine learning domains such as expert systems, medical diagnosis, decision making, speech recognition, and natural language processing. There are many different types of graphical model, each with its own properties and benefits, including Bayesian networks, undirected Markov random fields, and factor graphs. Moreover, for different types of scientific modeling, different types of graphs are more or less appropriate. For example, static Bayesian networks are quite useful when the size of set of random variables in the domain does not grow or shrink for all data instances and queries of interest. Hidden Markov models (HMMs), on the other hand, are such that the number of underlying random variables changes depending on the desired length (which can be a random variable), and HMMs are applicable even without knowing this length as they can be extended indefinitely using online inference. HMMs have been generalized to dynamic Bayesian networks (DBNs) and temporal conditional random fields (CRFs), where an underlying set of variables gets repeated as needed to fill any finite but unbounded length. Probabilistic relational models (PRMs) [5] allow for a more complex template that can be expanded in multiple dimensions simultaneously. An attribute common to all of the above cases is that the specification of rules for expanding any particular instance of a model is finite. In other words, these forms of GM allow the specification of models with an unlimited number of random variables (RVs) using a finite description. This is achieved using parameter tying, so while the number of RVs increases without bound, the number of parameters does not. In this paper, we introduce a new class of model we call multi-dynamic Bayesian networks. MDBNs are motivated by our research into the application of graphical models to the domain of statistical machine translation (MT) and they have two key attributes from the graphical modeling perspective. First, an MDBN generalizes a DBN in that there are multiple “streams” of variables that can get unrolled, but where each stream may be unrolled by a differing amount. In the most general case, connecting these different streams together would require the specification of conditional probabil- ity tables with a varying and potentially unlimited number of parents. To avoid this problem and retain the template’s finite description length, we utilize a switching parent functionality (also called value-specific independence). Second, in order to capture the notion of fertility in MT-systems (defined later in the text), we employ a form of existence uncertainty [7] (that we call switching existence), whereby the existence of a given random variable might depend on the value of other random variables in the network. Being fully propositional, MDBNs lie between DBNs and PRMs in terms of expressiveness. While PRMs are capable of describing any MDBN, there are, in general, advantages to restricting ourselves to a more specific class of model. For example, in the DBN case, it is possible to provide a bound on inference costs just by looking at attributes of the DBN template only (e.g., the left or right interfaces [12, 2]). Restricting the model can also make it simpler to use in practice. MDBNs are still relatively simple, while at the same time making possible the easy expression of MT systems, and opening doors to novel forms of probabilistic inference as we show below. In section 2, we introduce MDBNs, and describe their application to machine translation showing how it is possible to represent even complex MT systems. In section 3, we describe MDBN learning and decoding algorithms. In section 4, we present experimental results in the area of statistical machine translation, and future work is discussed in section 5. 2 MDBNs A standard DBN [4] template consists of a directed acyclic graph G = (V, E) = (V1 ∪ V2 , E1 ∪ → E2 ∪ E2 ) with node set V and edge set E. For t ∈ {1, 2}, the sets Vt are the nodes at slice t, Et → are the intra-slice edges between nodes in Vt , and Et are the inter-slice edges between nodes in V1 and V2 . To unroll a DBN to length T , the nodes V2 along with the edges adjacent to any node in V2 are cloned T − 1 times (where parameters of cloned variables are constrained to be the same as the template) and re-connected at the corresponding places. An MDBN with K streams consists of the union of K DBN templates along with a template structure specifying rules to connect the various streams together. An MDBN template is a directed graph (k) G = (V, E) = ( V (k) , E (k) ∪ E ) k (k) (k) th k (k) where (V , E ) is the k DBN, and the edges E are rules specifying how to connect stream k to the other streams. These rules are general in that they specify the set of edges for all values of Tk . There can be arbitrary nesting of the streams such as, for example, it is possible to specify a model that can grow along several dimensions simultaneously. An MDBN also utilizes “switching existence”, meaning some subset of the variables in V bestow existence onto other variables in the network. We call these variables existence bestowing (or ebnodes). The idea of bestowing existence is well defined over a discrete space, and is not dissimilar to a variable length DBN. For example, we may have a joint distribution over lengths as follows: p(X1 , . . . , XN , N ) = p(X1 , . . . , Xn |N = n)p(N = n) where here N is an eb-node that determines the number of other random variables in the DGM. Our notion of eb-nodes allows us to model certain characteristics found within machine translation systems, such as “fertility” [3], where a given English word is cloned a random number of times in the generative process that explains a translation from French into English. This random cloning might happen simultaneously at all points along a given MDBN stream. This means that even for a given fixed stream length Ti = ti , each stream could have a randomly varying number of random variables. Our graphical notation for eb-nodes consists of the eb-node as a square box containing variables whose existence is determined by the eb-node. We start by providing a simple example of an expanded MDBN for three well known MT systems, namely the IBM models 1 and 2 [3], and the “HMM” model [15].1 We adopt the convention in [3] that our goal is to translate from a string of French words F = f of length M = m into a string of English words E = e of length L = l — of course these can be any two languages. The basic generative (noisy channel) approach when translating from French to English is to represent the joint 1 We will refer to it as M-HMM to avoid confusion with regular HMMs. distribution P (f , e) = P (f |e)P (e). P (e) is a language model specifying the prior over the word string e. The key goal is to produce a finite-description length representation for P (f |e) where f and e are of arbitrary length. A hidden alignment string, a, specifies how the English words align to the French word, leading to P (f |e) = a P (f , a|e). Figure 1(a) is a 2-stream MDBN expanded representation of the three models, in this case ℓ = 4 and m = 3. As shown, it appears that the fan-in to node fi will be ℓ and thus will grow without bound. However, a switching mechanism whereby P (fi |e, ai ) = P (fi |eai ) limits the number of parameters regardless of L. This means that the alignment variable ai indicates the English word eai that should be aligned to French word fi . The variable e0 is a null word that connects to French words not explained by any of e1 , . . . , eℓ . The graph expresses all three models — the difference is that, in Models 1 and 2, there are no edges between aj and aj+1 . In Model 1, p(aj = ℓ) is uniform on the set {1, . . . , L}; in Model 2, the distribution over aj is a function only of its position j, and on the English and French lengths ℓ and m respectively. In the M-HMM model, the ai variables form a first order Markov chain. l e0 ℓ e1 e3 e2 e1 e4 e2 e3 φ1 φ2 φ3 m’ φ0 τ01 a1 f2 a2 f3 a3 m (a) Models 1,2 and M-HMM τ12 τ13 τ21 π02 π11 π12 π13 π21 f2 f3 f4 f5 f6 a1 u v τ11 f1 f1 τ02 a2 a3 a4 a5 a6 π01 w y x m (b) Expanded M3 graph Figure 1: Expanded 2-stream MDBN description of IBM Models 1 and 2, and the M-HMM model for MT; and the expanded MDBN description of IBM Model 3 with fertility assignment φ0 = 2, φ1 = 3, φ2 = 1, φ3 = 0. From the above, we see that it would be difficult to express this model graphically using a standard DBN since L and M are unequal random variables. Indeed, there are two DBNs in operation, one consisting of the English string, and the other consisting of the French string and its alignment. Moreover, the fully connected structure of the graph in the figure can represent the appropriate family of model, but it also represents models whose parameter space grows without bound — the switching function allows the model template to stay finite regardless of L and M . With our MDBN descriptive abilities complete, it is now possible to describe the more complex IBM models 3, and 4[3] (an MDBN for Model3 is depicted in fig. 1(b)). The top most random variable, ℓ, is a hidden switching existence variable corresponding to the length of the English string. The box abutting ℓ includes all the nodes whose existence depends on the value of ℓ. In the figure, ℓ = 3, thus resulting in three English words e1 , e2 , and e3 connected using a second-order Markov chain. To each English word ei corresponds a conditionally dependent fertility eb-node φi , which indicates how many times ei is used by words in the French string. Each φi in turn controls the existence of a set of variables under it. Given the fertilities (the figure depicts the case φ1 = 3, φ2 = 1, φ3 = 0), for each word ei , φi French word variables are granted existence and are denoted by τi1 , τi2 , . . . , τiφi , what is called the tablet [3] of ei . The values taken by the τ variables need to match the actual observed French sequence f1 , . . . , fm . This is represented as a shared constraint between all the f , π, and τ variables which have incoming edges into the observed variable v. v’s conditional probability table is such that it is one only when the associated constraint is satisfied2 . The variable 2 This type of encoding of constraints corresponds to the standard mechanism used by Pearl [14]. A naive implementation, however, would enumerate a number of configurations exponential in the number of constrained variables, while typically only a small fraction of the configurations would have positive probability. πi,k ∈ {1, . . . , m} is a switching dependency parent with respect to the constraint variable v and determines which fj participates in an equality constraint with τi,k . The bottom variable m is a switching existence node (observed to be 6 in the figure) with corresponding French word sequence and alignment variables. The French sequence participates in the v constraint described above, while the alignment variables aj ∈ {1, . . . , ℓ}, j ∈ 1, . . . , m constrain the fertilities to take their unique allowable values (for the given alignment). Alignments also restrict the domain of permutation variables, π, using the constraint variable x. Finally, the domain size of each aj has to lie in the interval [0, ℓ] and that is enforced by the variable u. The dashed edges connecting the alignment a variables represent an extension to implement an M3/M-HMM hybrid. ℓ The null submodel involving the deterministic node m′ (= i=1 φi ) and eb-node φ0 accounts for French words that are not explained by any of the English words e1 , . . . , eℓ . In this submodel, successive permutation variables are ordered and this constraint is implemented using the observed child w of π0i and π0(i+1) . Model 4 [3] is similar to Model 3 except that the former is based on a more elaborate distortion model that uses relative instead of absolute positions both within and between tablets. 3 Inference, Parameter Estimation and MPE Multi-dynamic Bayesian Networks are amenable to any type of inference that is applicable to regular Bayesian networks as long as switching existence relationships are respected and all the constraints (aggregation for example) are satisfied. Unfortunately DBN inference procedures that take advantage of the repeatable template and can preprocess it offline, are not easy to apply to MDBNs. A case in point is the Junction Tree algorithm [11]. Triangulation algorithms exist that create an offline triangulated version of the input graph and do not re-triangulate it for each different instance of the input data [12, 2]. In MDBNs, due to the flexibility to unroll templates in several dimensions and to specify dependencies and constraints spanning the entire unrolled graph, it is not obvious how we can exploit any repetitive patterns in a Junction Tree-style offline triangulation of the graph template. In section 4, we discuss sampling inference methods we have used. Here we discuss our extension to a backtracking search algorithm with the same performance guarantees as the JT algorithm, but with the advantage of easily handling determinism, existence uncertainty, and constraints, both learned and explicitly stated. Value Elimination (VE) ([1]), is a backtracking Bayesian network inference technique that caches factors associated with portions of the search tree and uses them to avoid iterating again over the same subtrees. We follow the notation introduced in [1] and refer the reader to that paper for details about VE inference. We have extended the VE inference approach to handle explicitly encoded constraints, existence uncertainty, and to perform approximate local domain pruning (see section 4). We omit these details as well as others in the original paper and briefly describe the main data structure required by VE and sketch the algorithm we refer to as FirstPass (fig. 1) since it constitutes the first step of the learning procedure, our main contribution in this section. A VE factor, F , is such that we can write the following marginal of the joint distribution P (X = x, Y = y, Z) = F.val × f (Z) X=x such that (X∪Y)∩Z = ∅, F.val is a constant, and f (Z) a function of Z only. Y is a set of variables previously instantiated in the current branch of search tree to the value vector y. The pair (Y, y) is referred to as a dependency set (F.Dset). X is referred to as a subsumed set (F.Sset). By caching the tuple (F.Dset, F.Sset, F.val), we avoid recomputing the marginal again whenever (1) F.Dset is active, meaning all nodes stored in F.Dset are assigned their cached values in the current branch of the search tree; and (2) none of the variables in F.Sset are assigned yet. FirstPass (alg. 1) visits nodes in the graph in Depth First fashion. In line 7, we get the values of all Newly Single-valued (NSV) CPTs i.e., CPTs that involve the current node, V , and in which all We use a general directed domain pruning constraint. Deterministic relationships then become a special case of our constraint whereby the domain of the child variable is constrained to a single value with probability one. Variable traversal order: A, B, C, and D. Factors are numbered by order of creation. *Fi denotes the activation of factor i. Tau values propagated recursively F7: Dset={} Sset={A,B,C,D} val=P(E=e) F7.tau = 1.0 = P(Evidence)/F7.val A F5: Dset={A=0} Sset={B,C,D} F2 D *F1 *F2 Factor values needed for c(A=0) and c(C=0,B=0) computation: F5.val=P(B=0|A=0)*F3.val+P(B=1|A=0)*F4.val F3.val=P(C=0|B=0)*F1.val+P(C=1|B=0)*F2.val F4.val=P(C=0|B=1)*F1.val+P(C=1|B=1)*F2.val F1.val=P(D=0|C=0)P(E=e|D=0)+P(D=1|C=0)P(E=e|D=1) F2.val=P(D=0|C=1)P(E=e|D=0)+P(D=1|C=1)P(E=e|D=1) First pass C *F3 *F4 Second pass D B F4 C F6.tau = F7.tau * P(A=1) 1 B F3: Dset={B=0} Sset={C,D} F1 F5.tau = F7.tau * P(A=0) F6 0 F3.tau = F5.tau * P(B=0|A=0) + F6.tau * P(B=0|A=1) = P(B=0) F4.tau = F5.tau * P(B=1|A=0) + F6.tau * P(B=1|A=1) = P(B=1) F1.tau = F3.tau * P(C=0|B=0) + F4.tau * P(C=0|B=1) = P(C=0) F2.tau = F3.tau * P(C=1|B=0) + F4.tau * P(C=1|B=1) = P(C=1) c(A=0)=(1/P(e))*(F7.tau*P(A=0)*F5.val)=(1/P(e))(P(A=0)*P(E=e|A=0))=P(A=0|E=e) c(C=0,B=0)=(1/P(e))*F3.tau*P(C=0|B=0)*F1.val =(1/P(e) * (P(A=0,B=0)+P(A=1,B=0)) * P(C=0|B=0) * F1.val =(1/P(e)) * P(B=0) * P(C=0|B=0) * F1.val =(1/P(e)) * P(B=0) * P(C=0|B=0) * F1.val =(1/P(e)) * P(C=0,B=0) * F1.val =P(C=0,B=0,E=e)/P(e)=P(C=0,B=0|E=e) Figure 2: Learning example using the Markov chain A → B → C → D → E, where E is observed. In the first pass, factors (Dset, Sset and val) are learned in a bottom up fashion. Also, the normalization constant P (E = e) (probability of evidence) is obtained. In the second pass, tau values are updated in a top-down fashion and used to calculate expected counts c(F.head, pa(F.head)) corresponding to each F.head (the figure shows the derivations for (A=0) and (C=0,B=0), but all counts are updated in the same pass). other variables are already assigned (these variables and their values are accumulated into Dset). We also check for factors that are active, multiply their values in, and accumulate subsumed vars in Sset (to avoid branching on them). In line 10, we add V to the Sset. In line 11, we cache a new factor F with value F.val = sum. We store V into F.head, a pointer to the last variable to be inserted into F.Sset, and needed for parameter estimation described below. F.Dset consists of all the variables, except V , that appeared in any NSV CPT or the Dset of an activated factor at line 6. Regular Value Elimination is query-based, similar to variable elimination and recursive conditioning—what this means is that to answer a query of the type P (Q|E = e), where Q is query variable and E a set of evidence nodes, we force Q to be at the top of the search tree, run the backtracking algorithm and then read the answers to the queries P (Q = q|E = e), q ∈ Dom[Q], along each of the outgoing edges of Q. Parameter estimation would require running a number of queries on the order of the number of parameters to estimate. We extend VE into an algorithm that allows us to obtain Expectation Maximization sufficient statistics in a single run of Value Elimination plus a second pass, which can never take longer than the first one (and in practice is much faster). This two-pass procedure is analogous to the collect-distribute evidence procedure in the Junction Tree algorithm, but here we do this via a search tree. Let θX=x|pa(X)=y be a parameter associated with variable X with value x and parents Y = pa(X) when they have value y. Assuming a maximum likelihood learning scenario3 , to estimate θX=x|pa(X)=y , we need to compute f (X = x, pa(X) = y, E = e) = P (W, X = x, pa(X) = y, E = e) W\{X,pa(X)} which is a sum of joint probabilities of all configurations that are consistent with the assignment {X = x, pa(X) = y}. If we were to turn off factor caching, we would enumerate all such variable configurations and could compute the sum. When standard VE factors are used, however, this is no longer possible whenever X or any of its parents becomes subsumed. Fig. 2 illustrates an example of a VE tree and the factors that are learned in the case of a Markov chain with an evidence node at the end. We can readily estimate the parameters associated with variables A and B as they are not subsumed along any branch. C and D become subsumed, however, and we cannot obtain the correct counts along all the branches that would lead to C and D in the full enumeration case. To address this issue, we store a special value, F.tau, in each factor. F.tau holds the sum over all path probabilities from the first level of the search tree to the level at which the factor F was 3 For Bayesian networks the likelihood function decomposes such that maximizing the expectation of the complete likelihood is equivalent to maximizing the “local likelihood” of each variable in the network. either created or activated. For example, F 6.tau in fig. 2 is simply P (A = 1). Although we can compute F 3.tau directly, we can also compute it recursively using F 5.tau and F 6.tau as shown in the figure. This is because both F 5 and F 6 subsume F 3: in the context {F 5.Dset}, there exists a (unique) value dsub of F 5.head4 s.t. F 3 becomes activable. Likewise for F 6. We cannot compute F 1.tau directly, but we can, recursively, from F 3.tau and F 4.tau by taking advantage of a similar subsumption relationship. In general, we can show that the following recursive relationship holds: F pa .tau × N SVF pa .head=dsub × F.tau ← F pa ∈F pa Fact .val F.val Fact ∈Fact (1) where F pa is the set of factors that subsume F , Fact is the set of all factors (including F ) that become active in the context of {F pa .Dset, F pa .head = dsub } and N SVF pa .head=dsub is the product of all newly single valued CPTs under the same context. For top-level factors (not subsumed by any factor), F.tau = Pevidence /F.val, which is 1.0 when there is a unique top-level factor. Alg. 2 is a simple recursive computation of eq. 1 for each factor. We visit learned factors in the reverse order in which they were learned to ensure that, for any factor F ′ , F ′ .tau is incremented (line 13) by any F that might have activated F ′ (line 12). For example, in fig. 2, F 4 uses F 1 and F 2, so F 4.tau needs to be updated before F 1.tau and F 2.tau. In line 11, we can increment the counts for any NSV CPT entries since F.tau will account for the possible ways of reaching the configuration {F.Dset, F.head = d} in an equivalent full enumeration tree. Algorithm 1: FirstPass(level) 1 2 3 4 5 6 7 8 9 10 Input: Graph G Output: A list of learned factors and Pevidence Select var V to branch on if V ==NONE then return Sset={}, Dset={} for d ∈ Dom[V ] do V ←d prod = productOfAllNSVsAndActiveFactors(Dset, Sset) if prod != 0 then FirstPass(level+1) sum += prod Sset = Sset ∪ {V } cacheNewFactor(F.head ← V ,F.val ← sum, F.Sset ← Sset, F.Dset ← Dset); Algorithm 2: SecondPass() 1 2 3 4 5 6 7 8 9 10 11 12 13 Input: F : List of factors in the reverse order learned in the first pass and Pevidence . Result: Updated counts foreach F ∈ F do if F.Dset = {} then F.tau ← Pevidence /F.val else F.tau ← 0.0 Assign vars in F.Dset to their values V ← F.head (last node to have been subsumed in this factor) foreach d ∈ Dom[V ] do prod = productOfAllNSVsAndActiveFactors() prod∗ = F.tau foreach newly single-valued CPT C do count(C.child,C.parents)+=prod/Pevidence F ′ =getListOfActiveFactors() for F ′ ∈ F ′ do F ′ .tau+ = prod/F ′ .val Most Probable Explanation We compute MPE using a very similar two-pass algorithm. In the first pass, factors are used to store a maximum instead of a summation over variables in the Sset. We also keep track of the value of F.head at which the maximum is achieved. In the second pass, we recursively find the optimal variable configuration by following the trail of factors that are activated when we assign each F.head variable to its maximum value starting from the last learned factor. 4 Recall, F.head is the last variable to be added to a newly created factor in line 10 of alg. 1 4 MACHINE TRANSLATION WORD ALIGNMENT EXPERIMENTS A major motivation for pursuing the type of representation and inference described above is to make it possible to solve computationally-intensive real-world problems using large amounts of data, while retaining the full generality and expressiveness afforded by the MDBN modeling language. In the experiments below we compare running times of MDBNs to GIZA++ on IBM Models 1 through 4 and the M-HMM model. GIZA++ is a special-purpose optimized MT word alignment C++ tool that is widely used in current state-of-the-art phrase-based MT systems [10] and at the time of this writing is the only publicly available software that implements all of the IBM Models. We test on French-English 107 hand-aligned sentences5 from a corpus of the European parliament proceedings (Europarl [9]) and train on 10000 sentence pairs from the same corpus and of maximum number of words 40. The Alignment Error Rate (AER) [13] evaluation metric quantifies how well the MPE assignment to the hidden alignment variables matches human-generated alignments. Several pruning and smoothing techniques are used by GIZA and MDBNs. GIZA prunes low lexical (P (f |e)) probability values and uses a default small value for unseen (or pruned) probability table entries. For models 3 and 4, for which there is no known polynomial time algorithm to perform the full E-step or compute MPE, GIZA generates a set of high probability alignments using an MHMM and hill-climbing and collects EM counts over these alignments using M3 or M4. For MDBN models we use the following pruning strategy: at each level of the search tree we prune values which, together, account for the lowest specified percentage of the total probability mass of the product of all newly active CPTs in line 6 of alg. 1. This is a more effective pruning than simply removing low-probability values of each CPD because it factors in the joint contribution of multiple active variables. Table 1 shows a comparison of timing numbers obtained GIZA++ and MDBNs. The runtime numbers shown are for the combined tasks of training and decoding; however, training time dominates given the difference in size between train and test sets. For models 1 and 2 neither GIZA nor MDBNs perform any pruning. For the M-HMM, we prune 60% of probability mass at each level and use a Dirichlet prior over the alignment variables such that long-range transitions are exponentially less likely than shorter ones.6 This model achieves similar times and AER to GIZA’s. Interestingly, without any pruning, the MDBN M-HMM takes 160 minutes to complete while only marginally improving upon the pruned model. Experimenting with several pruning thresholds, we found that AER would worsen much more slowly than runtime decreases. Models 3 and 4 have treewidth equal to the number of alignment variables (because of the global constraints tying them) and therefore require approximate inference. Using Model 3, and a drastic pruning threshold that only keeps the value with the top probability at each level, we were able to achieve an AER not much higher than GIZA’s. For M4, it achieves a best AER of 31.7% while we do not improve upon Model3, most likely because a too restrictive pruning. Nevertheless, a simple variation on Model3 in the MDBN framework achieves a lower AER than our regular M3 (with pruning still the same). The M3-HMM hybrid model combines the Markov alignment dependencies from the M-HMM model with the fertility model of M3. MCMC Inference Sampling is widely used for inference in high-treewidth models. Although MDBNs support Likelihood Weighing, it is very inefficient when the probability of evidence is very small, as is the case in our MT models. Besides being slow, Markov chain Monte Carlo can be problematic when the joint distribution is not positive everywhere, in particular in the presence of determinism and hard constraints. Techniques such as blocking Gibbs sampling [8] try to address the problem. Often, however, one has to carefully choose a problem-dependent proposal distribution. We used MCMC to improve training of the M3-HMM model. We were able to achieve an AER of 32.8% (down from 39.1%) but using 400 minutes of uniprocessor time. 5 CONCLUSION The existing classes of graphical models are not ideally suited for representing SMT models because “natural” semantics for specifying the latter combine flavors of different GM types on top of standard directed Bayesian network semantics: switching parents found in Bayesian Multinets [6], aggregation relationships such as in Probabilistic Relational Models [5], and existence uncertainty [7]. We 5 Available at http://www.cs.washington.edu/homes/karim French and English have similar word orders. On a different language pair, a different prior might be more appropriate. With a uniform prior, the MDBN M-HMM has 36.0% AER. 6 Model Init M1 M2 M-HMM M3 M4 M3-HMM GIZA++ M1 M-HMM 1m45s (47.7%) N/A 2m02s (41.3%) N/A 4m05s (35.0%) N/A 2m50 (45%) 5m20s (38.5%) 5m20s (34.8%) 7m45s (31.7%) N/A MDBN M1 3m20s (48.0%) 5m30s (41.0%) 4m15s (33.0%) 12m (43.6%) 25m (43.6%) 9m30 (41.0%) M-HMM N/A N/A N/A 9m (42.5%) 23m (42.6%) 9m15s (39.1%) MCMC 400m (32.8%) Table 1: MDBN VE-based learning versus GIZA++ timings and %AER using 5 EM iterations. The columns M1 and M-HMM correspond to the model that is used to initialize the model in the corresponding row. The last row is a hybrid Model3-HMM model that we implemented using MDBNs and is not expressible using GIZA. have introduced a generalization of dynamic Bayesian networks to easily and concisely build models consisting of varying-length parallel asynchronous and interacting data streams. We have shown that our framework is useful for expressing various statistical machine translation models. We have also introduced new parameter estimation and decoding algorithms using exact and approximate searchbased probability computation. While our timing results are not yet as fast as a hand-optimized C++ program on the equivalent model, we have shown that even in this general-purpose framework of MDBNs, our timing numbers are competitive and usable. Our framework can of course do much more than the IBM and HMM models. One of our goals is to use this framework to rapidly prototype novel MT systems and develop methods to statistically induce an interlingua. We also intend to use MDBNs in other domains such as multi-party social interaction analysis. References [1] F. Bacchus, S. Dalmao, and T. Pitassi. Value elimination: Bayesian inference via backtracking search. In UAI-03, pages 20–28, San Francisco, CA, 2003. Morgan Kaufmann. [2] J. Bilmes and C. Bartels. On triangulating dynamic graphical models. In Uncertainty in Artificial Intelligence: Proceedings of the 19th Conference, pages 47–56. Morgan Kaufmann, 2003. [3] P. F. Brown, J. Cocke, S. A. Della Piettra, V. J. Della Piettra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S. Roossin. A statistical approach to machine translation. Computational Linguistics, 16(2):79–85, June 1990. [4] T. Dean and K. Kanazawa. Probabilistic temporal reasoning. AAAI, pages 524–528, 1988. [5] N. Friedman, L. Getoor, D. Koller, and A. Pfeffer. Learning probabilistic relational models. In IJCAI, pages 1300–1309, 1999. [6] D. Geiger and D. Heckerman. Knowledge representation and inference in similarity networks and Bayesian multinets. Artif. Intell., 82(1-2):45–74, 1996. [7] L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models of link structure. Journal of Machine Learning Research, 3(4-5):697–707, May 2003. [8] C. Jensen, A. Kong, and U. Kjaerulff. Blocking Gibbs sampling in very large probabilistic expert systems. In International Journal of Human Computer Studies. Special Issue on Real-World Applications of Uncertain Reasoning., 1995. [9] P. Koehn. Europarl: A multilingual corpus for evaluation of machine http://www.isi.edu/koehn/publications/europarl, 2002. translation. [10] P. Koehn, F. Och, and D. Marcu. Statistical phrase-based translation. In NAACL/HLT 2003, 2003. [11] S. Lauritzen. Graphical Models. Oxford Science Publications, 1996. [12] K. Murphy. Dynamic Bayesian Networks: Representation, Inference and Learning. PhD thesis, U.C. Berkeley, Dept. of EECS, CS Division, 2002. [13] F. J. Och and H. Ney. Improved statistical alignment models. In ACL, pages 440–447, Oct 2000. [14] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 2nd printing edition, 1988. [15] S. Vogel, H. Ney, and C. Tillmann. HMM-based word alignment in statistical translation. In Proceedings of the 16th conference on Computational linguistics, pages 836–841, Morristown, NJ, USA, 1996.
5 0.64385599 188 nips-2006-Temporal and Cross-Subject Probabilistic Models for fMRI Prediction Tasks
Author: Alexis Battle, Gal Chechik, Daphne Koller
Abstract: We present a probabilistic model applied to the fMRI video rating prediction task of the Pittsburgh Brain Activity Interpretation Competition (PBAIC) [2]. Our goal is to predict a time series of subjective, semantic ratings of a movie given functional MRI data acquired during viewing by three subjects. Our method uses conditionally trained Gaussian Markov random fields, which model both the relationships between the subjects’ fMRI voxel measurements and the ratings, as well as the dependencies of the ratings across time steps and between subjects. We also employed non-traditional methods for feature selection and regularization that exploit the spatial structure of voxel activity in the brain. The model displayed good performance in predicting the scored ratings for the three subjects in test data sets, and a variant of this model was the third place entrant to the 2006 PBAIC. 1
6 0.63070446 121 nips-2006-Learning to be Bayesian without Supervision
7 0.63051283 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures
8 0.62969863 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
9 0.62857151 57 nips-2006-Conditional mean field
10 0.62544006 175 nips-2006-Simplifying Mixture Models through Function Approximation
11 0.62356859 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians
12 0.6231066 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
13 0.62191075 193 nips-2006-Tighter PAC-Bayes Bounds
14 0.62076849 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure
15 0.61957037 20 nips-2006-Active learning for misspecified generalized linear models
16 0.61841446 98 nips-2006-Inferring Network Structure from Co-Occurrences
17 0.61815637 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
18 0.61380482 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
19 0.61263859 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
20 0.61258596 96 nips-2006-In-Network PCA and Anomaly Detection