jmlr jmlr2013 jmlr2013-47 knowledge-graph by maker-knowledge-mining

47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

Source: pdf

Author: Edward Challis, David Barber

Abstract: We investigate Gaussian Kullback-Leibler (G-KL) variational approximate inference techniques for Bayesian generalised linear models and various extensions. In particular we make the following novel contributions: sufﬁcient conditions for which the G-KL objective is differentiable and convex are described; constrained parameterisations of Gaussian covariance that make G-KL methods fast and scalable are provided; the lower bound to the normalisation constant provided by G-KL methods is proven to dominate those provided by local lower bounding methods; complexity and model applicability issues of G-KL versus other Gaussian approximate inference methods are discussed. Numerical results comparing G-KL and other deterministic Gaussian approximate inference methods are presented for: robust Gaussian process regression models with either Student-t or Laplace likelihoods, large scale Bayesian binary logistic regression models, and Bayesian sparse linear models for sequential experimental design. Keywords: generalised linear models, latent linear models, variational approximate inference, large scale inference, sparse learning, experimental design, active learning, Gaussian processes

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 UK Department of Computer Science University College London London, WC1E 6BT, UK Editor: Manfred Opper Abstract We investigate Gaussian Kullback-Leibler (G-KL) variational approximate inference techniques for Bayesian generalised linear models and various extensions. [sent-9, score-0.384]

2 1 Overview In Section 2 we provide an introduction and overview of Gaussian Kullback-Leibler (G-KL) approximate inference methods for problems of the form of Equation (2) and describe a large class of models for which G-KL inference is feasible. [sent-34, score-0.382]

3 To make G-KL approximate inference scalable we present constrained parameterisations of covariance. [sent-39, score-0.481]

4 In Section 5 we compare G-KL approximate inference to other Gaussian approximate inference methods. [sent-40, score-0.426]

5 Many approximate inference methods make this assumption, for example the Laplace approximation (see Barber, 2012 for a recent introduction), expectation propagation with an assumed Gaussian approximating density (Minka, 2001) and local variational bounding methods (Jaakkola and Jordan, 1997). [sent-51, score-0.533]

6 Whilst the site potentials are not, in full generality, easy to evaluate in the next section we describe a large class of models for which they can be computed efﬁciently and so for which the G-KL bound is tractable. [sent-62, score-0.436]

7 However, in many practical problems of interest each potential function φn takes the form φn (w) = φn (wT hn ), (5) for ﬁxed vectors hn . [sent-66, score-0.402]

8 1 N ON -G AUSSIAN M ODELS G-KL approximate inference is not limited to models where the target density has a Gaussian potential N (w|µ, Σ). [sent-75, score-0.386]

9 G-KL Bound Optimisation G-KL approximate inference proceeds to obtain the tightest lower-bound to log Z and the ‘closest’ Gaussian approximation to p(w) by maximising BKL (m, S) with respect to the moments m and S of the variational Gaussian density. [sent-81, score-0.488]

10 2 G-KL Bound Concavity If each site potential {φn }N is log-concave then the G-KL bound BKL (m, S) is jointly concave with n=1 respect to the variational Gaussian mean m and C the upper triangular Cholesky decomposition of covariance such that S = CT C. [sent-105, score-0.705]

11 Thus the G-KL bound is jointly concave in m, C provided all site potentials {φn }N are logn=1 concave. [sent-117, score-0.498]

12 Another deterministic Gaussian approximate inference procedure for models of the form of Equation (2) are local variational bounding methods (discussed at further length in Section 5. [sent-143, score-0.475]

13 For log-concave potentials local variational bounding methods, which optimise a different criterion with a different parameterisation to the G-KL bound, have also been shown to result in a convex optimisation problem (Seeger and Nickisch, 2011b). [sent-146, score-0.792]

14 To the best of our knowledge, local variational bounding and G-KL approximate inference methods are the only known concave variational inference procedures for models of the form of Equation (2). [sent-147, score-0.903]

15 Complexity : G-KL Bound and Gradient Computations In the previous section we provided conditions for which the G-KL bound is strongly concave and differentiable and so provided conditions for which G-KL bound optimisation using quasiNewton methods will exhibit super-linear convergence rates. [sent-157, score-0.415]

16 An important practical consideration is the numerical complexity of the bound and gradient computations required by any gradient ascent optimisation procedure. [sent-159, score-0.39]

17 We consider models where the covariance of the Gaussian potential in Equation (2) is spherical, Σ = ν2 I, and each potential function is a site projection, φn (w) = φn (wT hn ). [sent-161, score-0.67]

18 For models that do not satisfy this assumption, in Appendix D we present a full breakdown of the complexity of bound and gradient 2245 C HALLIS AND BARBER computations for each G-KL covariance parameterisation presented in Section 4. [sent-162, score-0.447]

19 The computational bottleneck arises from the projected variational variances s2 = CT hn 2 required to compute each log φn (wT hn ) term. [sent-170, score-0.549]

20 Thus for a broad class of models the G-KL bound and gradient computations scale O ND2 for general parameterisations of the covariance S = CT C. [sent-178, score-0.526]

21 To this end we now consider constrained parameterisations of covariance that reduce both the time and space complexity of G-KL procedures. [sent-185, score-0.409]

22 What is more, the complexity of computing the φn site potential contributions to the bound increases for the precision parameterised G-KL bound. [sent-189, score-0.403]

23 We note that since a Gaussian potential, N (w|µ, Σ), can be written as a product over D site projection potentials computing log N (w|µ, Σ) will in general scale O D3 —see Appendix B. [sent-192, score-0.454]

24 However the site projection potentials are not concave with respect to (Γnn )−1 thus the bound is neither concave nor convex for this parameterisation resulting in convergence to a possibly local optimum. [sent-209, score-0.797]

25 In Appendix B we provide equations for each term of the G-KL bound and its gradient for each of the covariance parameterisations considered below. [sent-211, score-0.497]

26 3 C ONSTRAINED C ONCAVE PARAMETERISATIONS Below we present constrained parameterisations of covariance which reduce both the space and time complexity of G-KL bound optimisation whilst preserving concavity. [sent-222, score-0.783]

27 The constrained parameterisations below have different qualities regarding the expressiveness of the variational Gaussian approximation. [sent-225, score-0.41]

28 Such a parameterisation describes a sparse covariance matrix and assumes zero covariance between variables that are indexed out of bandwidth. [sent-232, score-0.461]

29 Constraining C such that C = blkdiag (C1 , cIL×L ), with C1 a K × K Cholesky matrix we have that s2 = CT ET hn n 1 1 2 + c2 ( hn 2 − ET hn 2 ), 1 N meaning that only the K subspace vectors in E1 are needed to compute s2 n=1 . [sent-250, score-0.536]

30 For models of the form of Equation (2) three popular, deterministic, Gaussian, approximate inference techniques are local variational bounding, Laplace approximations, and expectation propagation with an assumed Gaussian density. [sent-268, score-0.449]

31 Of the three Gaussian approximate inference methods listed above only one, local variational bounding, provides a lower-bound to the normalisation constant Z. [sent-270, score-0.389]

32 Local variational bounding procedures that use exponentiated quadratic site n=1 bounds return a Gaussian approximation to the target density p(w). [sent-300, score-0.533]

33 Each site projection potential function is lower-bounded by an exponentiated quadratic parameterised in w and a variational parameter ξn . [sent-307, score-0.555]

34 Thus we can obtain a bound on Z by substituting Equation (12) into Equation (2): N Z= N (w|µ, Σ) ∏ φn (wT hn )dw n=1 ≥ 1 T N (w|µ, Σ) c(ξ)e− 2 w 1 T −1 e− 2 µ Σ µ = c(ξ) det (2πΣ) 1 F(ξ)w+wT f(ξ) T e− 2 w Aw+wT b dw (13) dw, where A := Σ−1 + F(ξ) and b := Σ−1 µ + f(ξ). [sent-313, score-0.387]

35 Completing the square in Equation (13) and integrating, we have log Z ≥ B(ξ), where 1 2 1 2 1 2 1 2 B (ξ) = log c(ξ) − µT Σ−1 µ + bT A−1 b − log det (2πΣ) − log det (2πA) . [sent-316, score-0.578]

36 However, many problems have more site potentials 1 φn than Gaussian moment parameters, that is N > 2 D(D + 3), and the local bound in such cases has a richer parameterisation than the G-KL. [sent-324, score-0.585]

37 We ﬁrst substitute the local bound on ∏N φn (wT hn ), Equation (12), into Equan=1 tion (4) to obtain a new bound ˜ BKL (m, S) ≥ BKL (m, S, ξ), where ˜ 2BKL = −2 log q(w) − log det (2πΣ) + 2 log c(ξ) − (w − µ)T Σ−1 (w − µ) − wT F(ξ)w + 2 wT f(ξ) . [sent-326, score-0.725]

38 Using Equation (14) this can be written as 1 2 1 2 ˜ BKL = − log q(w) − log det (2πΣ) + log c(ξ) − µT Σ−1 µ − 2251 1 T w Aw + wT b . [sent-327, score-0.382]

39 By deﬁning q(w) = N w|A−1 b, A−1 we obtain ˜ 1 2 1 2 ˜ BKL = −KL( q(w)| q(w)) − log det (2πΣ) + log c(ξ) − µT Σ−1 µ ˜ 1 1 + bT A−1 b − log det (2πA) . [sent-332, score-0.485]

40 m,S Thus optimal G-KL bounds are provably tighter than both the local variational bound and the G-KL bound calculated using the optimal local bound moments mξ and Sξ . [sent-338, score-0.478]

41 Furthermore, constrained parameterisations of covariance, introduced in Section 4, which are required when D ≫ 1, are also frequently observed to outperform local variational solutions despite the fact that they are not provably guaranteed to do so. [sent-341, score-0.444]

42 2 Complexity and Model Suitability Comparison We brieﬂy review the core computational bottlenecks and the conditions placed on the potential functions by the local variational bounding, the Laplace approximation and the Gaussian expectation propagation approximate inference methods. [sent-343, score-0.508]

43 1, have N free variational parameters— one for each site potential φn . [sent-357, score-0.397]

44 1, local variational bounding procedures are applicable provided tight exponentiated quadratic lower-bounds to the site projection potentials {φn }N exist— n=1 that is each site potential is required to be super-Gaussian (Palmer et al. [sent-363, score-0.96]

45 For example, the log det (A) term is no longer exactly computed and a lower-bound on log Z is no longer maintained—only an estimate of log Z is provided. [sent-374, score-0.382]

46 4 G-KL G-KL approximate inference methods require that each site projection potential has unbounded support on R. [sent-389, score-0.498]

47 Unlike local variational bounding procedures G-KL does not require the site potentials to be super-Gaussian. [sent-391, score-0.619]

48 Importantly, G-KL procedures can be made scalable by using constrained parameterisations of covariance that do not require making a priori factorisation assumptions for the approximate posterior density. [sent-401, score-0.64]

49 Scalable covariance decompositions for G-KL inference maintain a strict lowerbound on log Z whereas approximate local bound optimisers do not. [sent-402, score-0.557]

50 2 we compare the performance of the constrained parameterisations of G-KL covariance that we presented in Section 4. [sent-409, score-0.409]

51 Z The likelihood factorises over data instances, p(y|w) = ∏N φ(wn ), thus the GP posterior is of the n=1 form of Equation (1) with site projection potentials of the form of Equation (5). [sent-420, score-0.508]

52 In this section we compare G-KL approximate inference to other deterministic Gaussian approximate inference methods, namely: the Laplace approximation (Lap), local variational bounding (VB) and Gaussian expectation propagation (G-EP). [sent-462, score-0.69]

53 Second column section: Student’s t likelihood results with G-KL, local variational bounding (VB) and Laplace (Lap) approximate inference. [sent-548, score-0.384]

54 4 G-KL approximate inference is straightforward, for the G-KL approximate posterior q(w) = N (w|m, S) the likelihood’s contribution to the bound is log p(y|w) q(w) = ∑ log φn (mn + z Snn ) n N (z|0,1) . [sent-556, score-0.617]

55 The expectations for the Laplace likelihood site potentials have simple analytic forms—see Appendix B. [sent-558, score-0.409]

56 We follow the evidence maximisation or maximum likelihood two (ML-II) procedure to estimate the covariance hyperparameters, that is we set covariance hyperparameters to maximise p(y|X, θ). [sent-579, score-0.394]

57 The results show that both G-KL bound optimisation and G-KL hyperparameter optimisation is numerically stable. [sent-646, score-0.452]

58 G-KL approximate inference appears more robust than G-EP and VB—G-KL hyperparameter optimisation always converged, often to a better local optima. [sent-647, score-0.451]

59 Our aim is not make a comparison of deterministic approximate inference methods for Bayesian logistic regression models, see Nickisch and Rasmussen (2008) to that end, but to investigate the time accuracy trade-offs of each of the constrained G-KL covariance parameterisations. [sent-659, score-0.469]

60 The expression above is of the form of Equation (2) with logconcave site projection potentials φn (x) = σ(x) and hn = yn xn . [sent-666, score-0.518]

61 For local variational bounding (VB) approximate inference K refers to the number of Lanczos vectors used to update the variational parameters. [sent-681, score-0.588]

62 Since the G-KL bound is strongly concave we performed G-KL bound optimisation using Hessian free Newton methods for all the Cholesky parameterised covariance experiments. [sent-685, score-0.628]

63 G-KL results obtained using chevron Cholesky (Chev), banded Cholesky (Band), subspace Cholesky (Sub) and factor analysis (FA) constrained parameterisations of covariance. [sent-916, score-0.524]

64 In the smaller problems considered, the best G-KL times were 2262 G AUSSIAN KL A PPROXIMATE I NFERENCE achieved by the chevron Cholesky covariance followed by the banded, the subspace and the FA parameterisations in that order. [sent-937, score-0.534]

65 Whilst chevron and banded parameterisations both scale O (NDK) they access and compute different elements of the data and Cholesky matrices. [sent-940, score-0.414]

66 The G-KL banded covariance parameterisation achieves the strongest bound value with the chevron and factor analysis parameterisations a close second place. [sent-950, score-0.775]

67 The results show that the the G-KL mean is broadly invariant to the G-KL covariance parameterisations used. [sent-962, score-0.395]

68 3 S UMMARY The results support the use of the constrained Cholesky covariance parameterisations to drive scalable G-KL approximate inference procedures. [sent-973, score-0.622]

69 Whilst neither the banded nor the chevron Cholesky parameterisations are invariant to permutations of the index set they both achieved the strongest bound values and test set performance. [sent-974, score-0.49]

70 In this larger setting we apply G-KL and VB approximate inference methods only and make use of approximate covariance decompositions. [sent-1106, score-0.427]

71 These results highlight a general distinction between the two methods, VB optimisation is an approximate EM algorithm whilst GKL optimisation in this setting is implemented using an approximate second order gradient ascent procedure. [sent-1125, score-0.701]

72 All that is required to apply G-KL to a model of the form of Equation (2) is the pointwise evaluation of the univariate site projection potentials and that each of these potentials has unbounded support on R. [sent-1147, score-0.525]

73 Unlike other deterministic Gaussian approximate inference methods G-KL does not require the site potentials to be differentiable, super-Gaussian or log-concave. [sent-1148, score-0.544]

74 2270 G AUSSIAN KL A PPROXIMATE I NFERENCE A long perceived disadvantage of G-KL approximate inference is the difﬁculty of optimising the bound with respect to the O D2 parameters needed to specify the G-KL covariance matrix. [sent-1150, score-0.52]

75 We have shown, however, that whilst O D2 parameters are required in full generality, the computations needed for bound optimisation compare favourably with other deterministic Gaussian approximate inference procedures. [sent-1151, score-0.587]

76 n=1 For larger problems we provided concave constrained parameterisations of covariance that allow G-KL methods to be applied to larger problems without imposing a priori factorisation assumptions on the approximate posterior density. [sent-1153, score-0.676]

77 The results presented in Section 6 show that such constrained covariance parameterisations are at least as good as other widely used deterministic methods at capturing posterior covariance. [sent-1154, score-0.478]

78 G-KL approximate inference using constrained concave covariance parameterisations have optimisation convergence times comparable to fast approximate variational local bound methods whilst maintaining a strict lower-bound on log Z. [sent-1155, score-1.429]

79 The vgai package implements G-KL approximate inference for models of the form of Equation (2) where each potential function is a site projection φn (w) = φ(wT hn ). [sent-1160, score-0.684]

80 Generic site projection potentials are supported if an implementation of ψ := log φ : R → R is provided. [sent-1162, score-0.454]

81 The package implements the unconstrained Cholesky, constrained Cholesky and factor analysis parameterisations of covariance discussed in Section 4. [sent-1163, score-0.409]

82 G-KL bound optimisation is achieved in the vgai package using Mark Schmidt’s minFunc optimisation package. [sent-1165, score-0.42]

83 G-KL Bound Gradients We present the G-KL bound and its gradient for Gaussian and generic site projection potentials with full Cholesky and factor analysis parameterisations of G-KL covariance. [sent-1183, score-0.717]

84 Gradients for the chevron, banded and sparse Cholesky covariance parameterisations are implemented simply by placing that Cholesky parameterisation’s sparsity mask on the full Cholesky gradient matrix. [sent-1184, score-0.542]

85 2 Site Projection Potentials Each site projection potential’s contribution to the G-KL bound can be expressed as In = log φn (wT hn ) = log φ(y) N (y|mn ,s2 ) = log φ(mn + zsn ) N (z|0,1) , n where mn = hT m and s2 = hT Shn . [sent-1193, score-0.918]

86 The expectations and their derivatives are given by: In = ∂In = ∂mn ∂In = ∂s2 n N (z|0, 1) log φn (mn + zsn )dz, log φn (mn + zsn ) dz, sn log φn (mn + zsn ) z2 − 1 N (z|0, 1) dz. [sent-1196, score-0.612]

88 3 Gaussian Potentials For a Gaussian potential N (w|µ, Σ) the log expectation is given by log N (w|µ, Σ) q(w) =− 1 log det (2πΣ) + (m − µ)T Σ−1 (m − µ) + trace Σ−1 S 2 . [sent-1209, score-0.531]

89 ν = For the FA parameterised covariance we have y − HT w T y − HT w 2 = yT y − 2yT HT m + ∑ HT m i + ∑ ΘT HT i ij 2 + ij ∑ ∑ H2 ji j d2 j i with corresponding gradients: ∂ log N y|HT w, ν2 I ∂d j =− 1 ν2 ∂ log N y|Mw, ν2 I ∂Θ =− 1 HHT Θ. [sent-1218, score-0.399]

90 4 Subspace Covariance Decomposition We consider optimising the G-KL bound with respect to a covariance matrix parameterised on a subspace of the parameters w ∈ RD . [sent-1226, score-0.444]

91 1 2 2275 C HALLIS AND BARBER Since E is orthonormal it does not effect the value or gradient of the entropy’s contribution to the bound since log det (S) = log det (S′ ). [sent-1229, score-0.553]

92 For S block diagonal with the second block component spherical, S2 = c2 I, the orthonormal basis vectors E2 do not need to be computed or maintained since s2 = hT S′ hn = hT ET S1 E1 hn + c2 hT ET E2 hn = hT ET S1 E1 hn + c2 n n n 1 n 2 n 1 hn 2 2− E1 hn 2 . [sent-1232, score-0.97]

93 to the subspace parameterised variational Gaussian by iterating between optimising the bound with respect to the parameters {m, C1 , c} and updating the subspace basis vectors E1 . [sent-1236, score-0.51]

94 The G-KL bound for the subspace Cholesky covariance parameterisation is given by BKL (m, C, c, E) = K D D log (2π) + + ∑ log (Ckk ) + L log(c) 2 2 k=1 D 1 − log 2πν2 − 2 m − µ 2 ν 2 2 + trace CT C + Lc2 N + ∑ log φn (mn + zsn ) N (z|0,1) . [sent-1251, score-0.939]

95 As described above, when Σ = ν2 I, the only term in the G-KL bound that depends on E are the site projection potential functions log φn (wT hn ) . [sent-1263, score-0.611]

96 We consider G-KL inference problems of the form of Equation (2) where {φn }N are site n=1 projection potentials that are piecewise exponentiated quadratics, log-concave and have unbounded support on R. [sent-1295, score-0.557]

97 We consider each term that depends on the variational parameters m and S separately, namely: log det (S) from the entropy’s contribution, trace Σ−1 S and mT Σ−1 m from N the Gaussian potential’s contribution, and mn , s2 n=1 from the product of site projection potential’s n contribution. [sent-1336, score-0.663]

98 Note that this procedure can speed up G-KL bound optimisation only in settings where hn are not sparse. [sent-1354, score-0.405]

99 2 Hyperparameter Optimisation For a general likelihood p(y|w) = ∏N φn (wn ) and GP prior N (w|0, Σ) with covariance function n=1 Σmn = k(xm , xn ) we get the G-KL bound BKL (m, C) = D 1 1 1 + ∑ logCnn − log det (Σ) − mT Σ−1 m − trace Σ−1 S 2 2 2 2 n + ∑ log φ(mn + z Snn ) n N (z|0,1) . [sent-1364, score-0.644]

100 Concave gaussian variational approximations for inference in large-scale bayesian linear models. [sent-1619, score-0.436]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('bkl', 0.302), ('vb', 0.227), ('parameterisations', 0.223), ('cholesky', 0.195), ('optimisation', 0.172), ('laplace', 0.168), ('site', 0.167), ('potentials', 0.164), ('ht', 0.158), ('hallis', 0.157), ('hn', 0.157), ('student', 0.149), ('parameterisation', 0.144), ('variational', 0.142), ('covariance', 0.141), ('inference', 0.14), ('barber', 0.14), ('pproximate', 0.135), ('whilst', 0.126), ('wt', 0.119), ('aussian', 0.118), ('zsn', 0.111), ('kl', 0.106), ('chevron', 0.105), ('nference', 0.105), ('det', 0.103), ('seeger', 0.102), ('gaussian', 0.1), ('mn', 0.098), ('sed', 0.098), ('nickisch', 0.094), ('log', 0.093), ('ntrn', 0.092), ('concave', 0.091), ('optimising', 0.09), ('potential', 0.088), ('banded', 0.086), ('fa', 0.086), ('chev', 0.085), ('gp', 0.082), ('optimise', 0.079), ('likelihood', 0.078), ('lanczos', 0.076), ('bound', 0.076), ('approximate', 0.073), ('parameterised', 0.072), ('band', 0.07), ('posterior', 0.069), ('mt', 0.066), ('lml', 0.066), ('subspace', 0.065), ('wtr', 0.059), ('bounding', 0.057), ('gradients', 0.057), ('gradient', 0.057), ('exponentiated', 0.056), ('density', 0.056), ('procedures', 0.055), ('bayesian', 0.054), ('ntst', 0.052), ('ct', 0.051), ('dw', 0.051), ('tlp', 0.05), ('ch', 0.047), ('ndk', 0.046), ('optimised', 0.046), ('constrained', 0.045), ('sub', 0.044), ('logistic', 0.041), ('moments', 0.04), ('challis', 0.039), ('sublevel', 0.039), ('equation', 0.039), ('normalised', 0.035), ('sigmoid', 0.035), ('sparse', 0.035), ('hyperparameters', 0.034), ('local', 0.034), ('factorisation', 0.034), ('rd', 0.033), ('initialised', 0.033), ('reconstruction', 0.032), ('hyperparameter', 0.032), ('expectation', 0.031), ('broadly', 0.031), ('projection', 0.03), ('trace', 0.03), ('prior', 0.03), ('hessian', 0.03), ('measurement', 0.029), ('models', 0.029), ('regression', 0.029), ('dk', 0.029), ('likelihoods', 0.029), ('orthonormal', 0.028), ('ascent', 0.028), ('vanhatalo', 0.028), ('rasmussen', 0.027), ('measurements', 0.027), ('covariates', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000008 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

Author: Edward Challis, David Barber

2 0.22411659 121 jmlr-2013-Variational Inference in Nonconjugate Models

Author: Chong Wang, David M. Blei

Abstract: Mean-ﬁeld variational methods are widely used for approximate posterior inference in many probabilistic models. In a typical application, mean-ﬁeld methods approximately compute the posterior with a coordinate-ascent optimization algorithm. When the model is conditionally conjugate, the coordinate updates are easily derived and in closed form. However, many models of interest—like the correlated topic model and Bayesian logistic regression—are nonconjugate. In these models, mean-ﬁeld methods cannot be directly applied and practitioners have had to develop variational algorithms on a case-by-case basis. In this paper, we develop two generic methods for nonconjugate models, Laplace variational inference and delta method variational inference. Our methods have several advantages: they allow for easily derived variational algorithms with a wide class of nonconjugate models; they extend and unify some of the existing algorithms that have been derived for speciﬁc models; and they work well on real-world data sets. We studied our methods on the correlated topic model, Bayesian logistic regression, and hierarchical Bayesian logistic regression. Keywords: variational inference, nonconjugate models, Laplace approximations, the multivariate delta method

3 0.16681068 108 jmlr-2013-Stochastic Variational Inference

Author: Matthew D. Hoffman, David M. Blei, Chong Wang, John Paisley

Abstract: We develop stochastic variational inference, a scalable algorithm for approximating posterior distributions. We develop this technique for a large class of probabilistic models and we demonstrate it with two probabilistic topic models, latent Dirichlet allocation and the hierarchical Dirichlet process topic model. Using stochastic variational inference, we analyze several large collections of documents: 300K articles from Nature, 1.8M articles from The New York Times, and 3.8M articles from Wikipedia. Stochastic inference can easily handle data sets of this size and outperforms traditional variational inference, which can only handle a smaller subset. (We also show that the Bayesian nonparametric topic model outperforms its parametric counterpart.) Stochastic variational inference lets us apply complex Bayesian models to massive data sets. Keywords: Bayesian inference, variational inference, stochastic optimization, topic models, Bayesian nonparametrics

4 0.15393129 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood

Author: Jaakko Riihimäki, Pasi Jylänki, Aki Vehtari

Abstract: This paper considers probabilistic multinomial probit classiﬁcation using Gaussian process (GP) priors. Challenges with multiclass GP classiﬁcation are the integration over the non-Gaussian posterior distribution, and the increase of the number of unknown latent variables as the number of target classes grows. Expectation propagation (EP) has proven to be a very accurate method for approximate inference but the existing EP approaches for the multinomial probit GP classiﬁcation rely on numerical quadratures, or independence assumptions between the latent values associated with different classes, to facilitate the computations. In this paper we propose a novel nested EP approach which does not require numerical quadratures, and approximates accurately all betweenclass posterior dependencies of the latent values, but still scales linearly in the number of classes. The predictive accuracy of the nested EP approach is compared to Laplace, variational Bayes, and Markov chain Monte Carlo (MCMC) approximations with various benchmark data sets. In the experiments nested EP was the most consistent method compared to MCMC sampling, but in terms of classiﬁcation accuracy the differences between all the methods were small from a practical point of view. Keywords: Gaussian process, multiclass classiﬁcation, multinomial probit, approximate inference, expectation propagation

5 0.1264182 45 jmlr-2013-GPstuff: Bayesian Modeling with Gaussian Processes

Author: Jarno Vanhatalo, Jaakko Riihimäki, Jouni Hartikainen, Pasi Jylänki, Ville Tolvanen, Aki Vehtari

Abstract: The GPstuff toolbox is a versatile collection of Gaussian process models and computational tools required for Bayesian inference. The tools include, among others, various inference methods, sparse approximations and model assessment methods. Keywords: Gaussian process, Bayesian hierarchical model, nonparametric Bayes

6 0.11452314 49 jmlr-2013-Global Analytic Solution of Fully-observed Variational Bayesian Matrix Factorization

7 0.1012472 88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models

8 0.10023095 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs

9 0.09083122 90 jmlr-2013-Quasi-Newton Method: A New Direction

10 0.088666216 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

11 0.067797601 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators

12 0.06604667 15 jmlr-2013-Bayesian Canonical Correlation Analysis

13 0.060311273 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

14 0.058704253 79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression

15 0.057235669 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres

16 0.056872658 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization

17 0.055732016 120 jmlr-2013-Variational Algorithms for Marginal MAP

18 0.055096608 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation

19 0.054703705 8 jmlr-2013-A Theory of Multiclass Boosting

20 0.052919939 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.297), (1, -0.38), (2, 0.122), (3, -0.058), (4, -0.061), (5, -0.058), (6, -0.004), (7, -0.047), (8, -0.01), (9, -0.034), (10, 0.046), (11, 0.061), (12, 0.047), (13, -0.026), (14, -0.05), (15, 0.069), (16, 0.004), (17, -0.063), (18, -0.06), (19, -0.035), (20, 0.091), (21, -0.113), (22, 0.082), (23, -0.142), (24, -0.075), (25, -0.134), (26, 0.004), (27, -0.031), (28, 0.027), (29, -0.003), (30, 0.046), (31, 0.054), (32, 0.086), (33, 0.014), (34, -0.091), (35, 0.052), (36, -0.046), (37, -0.066), (38, 0.064), (39, 0.072), (40, -0.041), (41, -0.015), (42, -0.006), (43, 0.055), (44, -0.073), (45, -0.011), (46, 0.028), (47, 0.017), (48, -0.031), (49, 0.055)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94567114 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

Author: Edward Challis, David Barber

2 0.68442273 45 jmlr-2013-GPstuff: Bayesian Modeling with Gaussian Processes

Author: Jarno Vanhatalo, Jaakko Riihimäki, Jouni Hartikainen, Pasi Jylänki, Ville Tolvanen, Aki Vehtari

3 0.63717866 49 jmlr-2013-Global Analytic Solution of Fully-observed Variational Bayesian Matrix Factorization

Author: Shinichi Nakajima, Masashi Sugiyama, S. Derin Babacan, Ryota Tomioka

Abstract: The variational Bayesian (VB) approximation is known to be a promising approach to Bayesian estimation, when the rigorous calculation of the Bayes posterior is intractable. The VB approximation has been successfully applied to matrix factorization (MF), offering automatic dimensionality selection for principal component analysis. Generally, ﬁnding the VB solution is a non-convex problem, and most methods rely on a local search algorithm derived through a standard procedure for the VB approximation. In this paper, we show that a better option is available for fully-observed VBMF—the global solution can be analytically computed. More speciﬁcally, the global solution is a reweighted SVD of the observed matrix, and each weight can be obtained by solving a quartic equation with its coefﬁcients being functions of the observed singular value. We further show that the global optimal solution of empirical VBMF (where hyperparameters are also learned from data) can also be analytically computed. We illustrate the usefulness of our results through experiments in multi-variate analysis. Keywords: variational Bayes, matrix factorization, empirical Bayes, model-induced regularization, probabilistic PCA

4 0.63318765 121 jmlr-2013-Variational Inference in Nonconjugate Models

Author: Chong Wang, David M. Blei

5 0.5663892 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood

Author: Jaakko Riihimäki, Pasi Jylänki, Aki Vehtari

6 0.56241179 108 jmlr-2013-Stochastic Variational Inference

7 0.48393297 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators

8 0.45132366 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs

9 0.44253486 88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models

10 0.43799776 90 jmlr-2013-Quasi-Newton Method: A New Direction

11 0.41352189 15 jmlr-2013-Bayesian Canonical Correlation Analysis

12 0.38000762 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression

13 0.37141594 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

14 0.35580978 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

15 0.33062211 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning

16 0.32703561 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres

17 0.32700902 51 jmlr-2013-Greedy Sparsity-Constrained Optimization

18 0.31396356 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation

19 0.2761713 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization

20 0.27327845 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.035), (5, 0.124), (6, 0.051), (10, 0.078), (20, 0.015), (23, 0.033), (41, 0.011), (61, 0.012), (68, 0.016), (70, 0.025), (72, 0.332), (75, 0.081), (85, 0.029), (87, 0.025), (89, 0.013), (93, 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74436975 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

Author: Edward Challis, David Barber

2 0.5240823 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood

Author: Jaakko Riihimäki, Pasi Jylänki, Aki Vehtari

3 0.4653028 88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models

Author: Manfred Opper, Ulrich Paquet, Ole Winther

Abstract: Expectation Propagation (EP) provides a framework for approximate inference. When the model under consideration is over a latent Gaussian ﬁeld, with the approximation being Gaussian, we show how these approximations can systematically be corrected. A perturbative expansion is made of the exact but intractable correction, and can be applied to the model’s partition function and other moments of interest. The correction is expressed over the higher-order cumulants which are neglected by EP’s local matching of moments. Through the expansion, we see that EP is correct to ﬁrst order. By considering higher orders, corrections of increasing polynomial complexity can be applied to the approximation. The second order provides a correction in quadratic time, which we apply to an array of Gaussian process and Ising models. The corrections generalize to arbitrarily complex approximating families, which we illustrate on tree-structured Ising model approximations. Furthermore, they provide a polynomial-time assessment of the approximation error. We also provide both theoretical and practical insights on the exactness of the EP solution. Keywords: expectation consistent inference, expectation propagation, perturbation correction, Wick expansions, Ising model, Gaussian process

4 0.46069655 76 jmlr-2013-Nonparametric Sparsity and Regularization

Author: Lorenzo Rosasco, Silvia Villa, Sofia Mosci, Matteo Santoro, Alessandro Verri

Abstract: In this work we are interested in the problems of supervised learning and variable selection when the input-output dependence is described by a nonlinear function depending on a few variables. Our goal is to consider a sparse nonparametric model, hence avoiding linear or additive models. The key idea is to measure the importance of each variable in the model by making use of partial derivatives. Based on this intuition we propose a new notion of nonparametric sparsity and a corresponding least squares regularization scheme. Using concepts and results from the theory of reproducing kernel Hilbert spaces and proximal methods, we show that the proposed learning algorithm corresponds to a minimization problem which can be provably solved by an iterative procedure. The consistency properties of the obtained estimator are studied both in terms of prediction and selection performance. An extensive empirical analysis shows that the proposed method performs favorably with respect to the state-of-the-art methods. Keywords: sparsity, nonparametric, variable selection, regularization, proximal methods, RKHS ∗. Also at Istituto Italiano di Tecnologia, Via Morego 30, 16163 Genova, Italy and Massachusetts Institute of Technology, Bldg. 46-5155, 77 Massachusetts Avenue, Cambridge, MA 02139, USA. c 2013 Lorenzo Rosasco, Silvia Villa, Soﬁa Mosci, Matteo Santoro and Alessandro Verri. ROSASCO , V ILLA , M OSCI , S ANTORO AND V ERRI

5 0.45515332 120 jmlr-2013-Variational Algorithms for Marginal MAP

Author: Qiang Liu, Alexander Ihler

Abstract: The marginal maximum a posteriori probability (MAP) estimation problem, which calculates the mode of the marginal posterior distribution of a subset of variables with the remaining variables marginalized, is an important inference problem in many models, such as those with hidden variables or uncertain parameters. Unfortunately, marginal MAP can be NP-hard even on trees, and has attracted less attention in the literature compared to the joint MAP (maximization) and marginalization problems. We derive a general dual representation for marginal MAP that naturally integrates the marginalization and maximization operations into a joint variational optimization problem, making it possible to easily extend most or all variational-based algorithms to marginal MAP. In particular, we derive a set of “mixed-product” message passing algorithms for marginal MAP, whose form is a hybrid of max-product, sum-product and a novel “argmax-product” message updates. We also derive a class of convergent algorithms based on proximal point methods, including one that transforms the marginal MAP problem into a sequence of standard marginalization problems. Theoretically, we provide guarantees under which our algorithms give globally or locally optimal solutions, and provide novel upper bounds on the optimal objectives. Empirically, we demonstrate that our algorithms signiﬁcantly outperform the existing approaches, including a state-of-the-art algorithm based on local search methods. Keywords: graphical models, message passing, belief propagation, variational methods, maximum a posteriori, marginal-MAP, hidden variable models

6 0.45481095 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

7 0.45447746 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

8 0.45421067 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression

9 0.45338008 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs

10 0.45228618 121 jmlr-2013-Variational Inference in Nonconjugate Models

11 0.45164913 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

12 0.45101061 21 jmlr-2013-Classifier Selection using the Predicate Depth

13 0.45041585 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

14 0.44735226 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

15 0.44628847 59 jmlr-2013-Large-scale SVD and Manifold Learning

16 0.44533685 23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty

17 0.44499534 86 jmlr-2013-Parallel Vector Field Embedding

18 0.44466683 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

19 0.44277769 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

20 0.44263124 73 jmlr-2013-Multicategory Large-Margin Unified Machines