jmlr jmlr2005 jmlr2005-71 knowledge-graph by maker-knowledge-mining

71 jmlr-2005-Variational Message Passing

Source: pdf

Author: John Winn, Christopher M. Bishop

Abstract: Bayesian inference is now widely established as one of the principal foundations for machine learning. In practice, exact inference is rarely possible, and so a variety of approximation techniques have been developed, one of the most widely used being a deterministic framework called variational inference. In this paper we introduce Variational Message Passing (VMP), a general purpose algorithm for applying variational inference to Bayesian Networks. Like belief propagation, VMP proceeds by sending messages between nodes in the network and updating posterior beliefs using local operations at each node. Each such update increases a lower bound on the log evidence (unless already at a local maximum). In contrast to belief propagation, VMP can be applied to a very general class of conjugate-exponential models because it uses a factorised variational approximation. Furthermore, by introducing additional variational parameters, VMP can be applied to models containing non-conjugate distributions. The VMP framework also allows the lower bound to be evaluated, and this can be used both for model comparison and for detection of convergence. Variational message passing has been implemented in the form of a general purpose inference engine called VIBES (‘Variational Inference for BayEsian networkS’) which allows models to be speciﬁed graphically and then solved variationally without recourse to coding. Keywords: Bayesian networks, variational inference, message passing

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In practice, exact inference is rarely possible, and so a variety of approximation techniques have been developed, one of the most widely used being a deterministic framework called variational inference. [sent-9, score-0.555]

2 In this paper we introduce Variational Message Passing (VMP), a general purpose algorithm for applying variational inference to Bayesian Networks. [sent-10, score-0.488]

3 In contrast to belief propagation, VMP can be applied to a very general class of conjugate-exponential models because it uses a factorised variational approximation. [sent-13, score-0.546]

4 Variational message passing has been implemented in the form of a general purpose inference engine called VIBES (‘Variational Inference for BayEsian networkS’) which allows models to be speciﬁed graphically and then solved variationally without recourse to coding. [sent-16, score-0.598]

5 Keywords: Bayesian networks, variational inference, message passing 1. [sent-17, score-0.946]

6 For these reasons the framework of variational inference has received much attention. [sent-29, score-0.488]

7 In this paper, the variational message passing algorithm is developed, which optimises a variational bound using a set of local computations for each node, together with a mechanism for passing messages between the nodes. [sent-30, score-1.875]

8 These messages are deﬁned so that the optimal variational distribution for a node can be found by summing the messages from its children together with a function of the messages from its parents, where this function depends on the conditional distribution for the node. [sent-33, score-1.467]

9 Section 3 contains the derivation of the variational message passing algorithm, along with an example of its use. [sent-44, score-0.946]

10 Variational Inference In this section, variational inference will be reviewed brieﬂy with particular focus on the case where the variational distribution has a factorised form. [sent-48, score-1.039]

11 The goal in variational inference is to ﬁnd a tractable variational distribution Q(H) that closely approximates the true posterior distribution P(H | V). [sent-56, score-1.074]

12 Here KL(Q || P) is the Kullback-Leibler divergence between the true posterior P(H | V) and the variational approximation Q(H). [sent-59, score-0.539]

13 In this case, the variational posterior distribution equals the true posterior and L (Q) = ln P(V). [sent-63, score-0.863]

14 However, working with the true posterior distribution is computationally intractable (otherwise we wouldn’t be resorting to variational methods). [sent-64, score-0.551]

15 The variational optimisation proceeds by initialising each of the Q j (H j ) and then cycling through each factor in turn replacing the current distribution with a revised estimate given by (7). [sent-82, score-0.522]

16 Variational Message Passing In this section, the variational message passing algorithm will be derived and shown to optimise a factorised variational distribution using a message passing procedure on a graphical model. [sent-84, score-2.058]

17 For the initial derivation, it will be assumed that the variational distribution is factorised with respect to each hidden variable Hi and so can be written Q(H) = ∏ Qi (Hi ). [sent-85, score-0.551]

18 Xk chj Figure 1: A key observation is that the variational update equation for a node H j depends only on expectations over variables in the Markov blanket of that node (shown shaded), deﬁned as the set of parents, children and co-parents of that node. [sent-99, score-0.945]

19 We now substitute in the form of the joint probability distribution of a Bayesian network, as given in (1), ln Q j (H j ) = ∑i ln P(Xi | pai ) ∼Q(H ) + const. [sent-100, score-0.493]

20 This leaves only the conditional P(H j | paj ) together with the conditionals for all the children of H j , as these have H j in their parent set, ln Q j (H j ) = ln P(H j | paj ) ∼Q(H j ) + ∑k∈ch j ln P(Xk | pak ) ∼Q(H j ) + const. [sent-102, score-1.118]

21 The co-parents of a node X are all the nodes with at least one child which is also a child of X (excluding X itself). [sent-116, score-0.57]

22 2 Optimisation of Q in Conjugate-Exponential Models We will now demonstrate how the optimisation of the variational distribution can be carried out, given that the model is conjugate-exponential. [sent-131, score-0.526]

23 X chY Figure 2: Part of a graphical model showing a node Y , the parents and children of Y , and the coparents of Y with respect to a child node X. [sent-143, score-0.77]

24 If X is Gaussian distributed with mean Y and precision β, it follows that the co-parent set cpY contains only β, and the log conditional for X is ln P(X |Y, β) = βY −β/2 T X X2 + 1 (ln β − βY 2 − ln 2π). [sent-150, score-0.533]

25 Returning to the variational update equation (8) for a node Y , it follows that all the expectations on the right hand side can be calculated in terms of the u for each node in the Markov blanket of Y . [sent-160, score-0.899]

26 3 Deﬁnition of the Variational Message Passing Algorithm We have now reached the point where we can specify exactly what form the messages between nodes must take and so deﬁne the variational message passing algorithm. [sent-180, score-1.238]

27 The message from a parent node Y to a child node X is just the expectation under Q of the natural statistic vector mY →X = uY . [sent-181, score-1.096]

28 (18) The message from a child node X to a parent node Y is mX→Y = φXY uX , {mi→X }i∈cpY (19) Example which relies on X having received messages previously from all the co-parents. [sent-182, score-1.2]

29 If X is Gaussian distributed with conditional P(X |Y, β), the messages to its parents Y and β are mX→Y = β X − β /2 , mX→β = and the message from X to any child node is X X2 1 −2 X2 − 2 X Y + Y 2 1 2 . [sent-184, score-0.99]

30 When a node Y has received messages from all parents and children, we can ﬁnds its updated ∗ ∗ ∗ posterior distribution QY by ﬁnding its updated natural parameter vector φY . [sent-185, score-0.749]

31 The variational message passing algorithm uses these messages to optimise the variational distribution iteratively, as described in Algorithm 1 below. [sent-188, score-1.62]

32 Algorithm 1 The variational message passing algorithm 1. [sent-191, score-0.946]

33 For each node X j in turn, • Retrieve messages from all parent and child nodes, as deﬁned in (18) and (19). [sent-194, score-0.713]

34 4 Example: the Univariate Gaussian Model To illustrate how variational message passing works, let us apply it to a model which represents a set of observed one-dimensional data {xn }N with a univariate Gaussian distribution of mean µ and n=1 precision γ, N P(x | H ) = ∏ N (xn | µ, γ−1 ). [sent-204, score-1.106]

35 In this model, the conditional distribution of each data point xn is a univariate Gaussian, which is in the exponential family and so its logarithm can be expressed in standard form as ln P(xn | µ, γ−1 ) = γµ −γ/2 T xn 2 xn 1 + (ln γ − γµ2 − ln 2π) 2 2 and so ux (xn ) = [xn , xn ]T . [sent-208, score-0.985]

36 The braces around the messages leaving the plate indicate that a set of N distinct messages are being sent. [sent-211, score-0.525]

37 = 1 − 2 (xn − µ)2 1 2 T γ ln γ − ln 2π (22) which shows that, for conjugacy, uµ (µ) must be [µ, µ2 ]T and uγ (γ) must be [γ, ln γ]T or linear transforms of these. [sent-212, score-0.642]

38 Alternatively, we could have chosen a normal-gamma prior over both parameters which leads to a slightly more complicated message passing procedure. [sent-214, score-0.528]

39 We deﬁne the parameter priors to have hyper-parameters m, β, a and b, so that ln P(µ | m, β) = βm −β/2 ln P(γ | a, b) = −b a−1 T T µ µ2 1 + (ln β − βm2 − ln 2π) 2 γ ln γ + a ln b − ln Γ(a). [sent-215, score-1.284]

40 1 VARIATIONAL M ESSAGE PASSING IN THE U NIVARIATE G AUSSIAN M ODEL We can now apply variational message passing to infer the distributions over µ and γ variationally. [sent-218, score-0.988]

41 The variational distribution is fully factorised and takes the form Q(µ, γ) = Qµ (µ)Qγ (γ). [sent-219, score-0.551]

42 Let us choose to update Qµ (µ) ﬁrst, in which case variational message passing will proceed as follows (illustrated in Figure 3a-d). [sent-221, score-0.946]

43 671 W INN AND B ISHOP (b) Each xn node has now received messages from all co-parents of µ and so can send a message to µ which is the expectation of the natural parameter vector in (21), mxn →µ = γ xn − γ /2 . [sent-227, score-0.87]

44 (d) Finally, each xn node sends a message back to γ which is mxn →γ = 1 2 − 2 (xn − 2xn µ + µ2 ) 1 2 and γ can update its variational posterior φ∗ = γ −b a−1 N + ∑ mxn →γ . [sent-230, score-1.15]

45 n=1 As the expectation of uγ (γ) has changed, we can now go back to step (a) and send an updated message to each xn node and so on. [sent-231, score-0.601]

46 Hence, in variational message passing, the message passing procedure is repeated again and again until convergence (unlike in belief propagation on a junction tree where the exact posterior is available after a message passing is performed once). [sent-232, score-1.931]

47 Each round of message passing is equivalent to one iteration of the update equations in standard variational inference. [sent-233, score-0.946]

48 Figure 4 gives an indication of the accuracy of the variational approximation in this model, showing plots of both the true and variational posterior distributions for a toy example. [sent-234, score-0.976]

49 5 Initialisation and Message Passing Schedule The variational message passing algorithm is guaranteed to converge to a local minimum of the KL divergence. [sent-238, score-0.946]

50 5 2 4 µ 6 0 8 2 4 µ 6 8 Figure 4: Contour plots of the variational and true posterior over the mean µ and precision γ of a Gaussian distribution, given four samples from N (x | 5, 1). [sent-248, score-0.568]

51 6 Calculation of the Lower Bound L (Q) The variational message passing algorithm makes use of the lower bound L (Q) as a diagnostic of convergence. [sent-259, score-0.978]

52 This is achieved naturally in the variational message passing framework by providing a way to calculate the bound automatically, as will now be described. [sent-269, score-0.978]

53 To recap, the lower bound on the log evidence is deﬁned to be L (Q) = ln P(H, V) − ln Q(H) , where the expectations are with respect to Q. [sent-270, score-0.507]

54 In a Bayesian network, with a factorised Q distribution, the bound becomes L (Q) = ∑ i def = ln P(Xi | pai ) − ∑ ln Qi (Hi ) i∈H ∑ Li i where it has been decomposed into contributions from the individual nodes {Li }. [sent-271, score-0.69]

55 For a particular latent variable node H j , the contribution is L j = ln P(H j | paj ) − ln Q j (H j ) . [sent-272, score-0.788]

56 j j (23) Three of these terms are already calculated during the variational message passing algorithm: φ j (paj ) and φ∗ when ﬁnding the posterior distribution over H j in (20), and u j (H j ) when calculating outj going messages from H j . [sent-275, score-1.3]

57 Again, computation can be saved by computing uk (Vk ) during the initialisation of the message passing algorithm. [sent-278, score-0.584]

58 Allowable Models The variational message passing algorithm can be applied to a wide class of models, which will be characterised in this section. [sent-282, score-0.946]

59 This table can be encoded in implementations of the variational message passing algorithm and used during initialisation to check the conjugacy of the supplied model. [sent-292, score-1.183]

60 The choice of such a truncated distribution will change the form of messages to parent nodes (as the gX normalisation function will also be different) but will not change the form of messages that are passed to child nodes. [sent-297, score-0.823]

61 gamma gamma discrete Table 1: Distributions for each parameter of a number of exponential family distributions if the model is to satisfy conjugacy constraints. [sent-303, score-0.557]

62 This message can be evaluated only if it can be expressed as a function of the messages from the parent variables, which are the expectations of their natural statistics functions { uYi (Yi ) Q }. [sent-320, score-0.672]

63 For example, the expression X = Y1 +Y2Y3 can be achieved by having a deterministic product node A with parents Y2 and Y3 and a deterministic sum node X with parents Y1 and A. [sent-370, score-0.744]

64 2 D ETERMINISTIC N ODE M ESSAGES To examine message passing for deterministic nodes, we must consider the general case where the deterministic node X has multiple children {Z j }. [sent-375, score-0.925]

65 The message from the node X to any child Z j is simply mX→Z j = uX ( f (Y)) Q = ψX (mY1 →X , . [sent-376, score-0.628]

66 The message from a deterministic node to a parent Yk is then ∑ mZ →X mX→Yk = ΨX,Yk ({mYi →X }i=k ) j j which relies on having received messages from all the child nodes and from all the co-parents. [sent-382, score-1.121]

67 The sum of child messages can be computed and stored locally at the node and used to evaluate all childto-parent messages. [sent-383, score-0.579]

68 The message from the node X to any child is uX (X) as calculated from the mixture parameter vector φX (λ, {θk }). [sent-407, score-0.684]

69 Similarly, the message from X to a parent θk is the message that would be sent by the corresponding component if it were not in a mixture, scaled by the variational posterior over the indicator variable Q(λ = k). [sent-408, score-1.224]

70 In each case, a multivariate conditional distribution is deﬁned in the overall joint distribution P and the corresponding factor in the variational posterior Q will also be multivariate, rather than factorised with respect to the elements of the vector. [sent-413, score-0.767]

71 As a result, the conjugacy constraint between a parent node and a child node will also constrain the dimensionality of the corresponding vector-valued variables to be the same. [sent-417, score-0.89]

72 The conjugate distribution for this variable is the normal-gamma distribution, which is written    mλ µγ 1    µ2 γ  −2λ   1 ln P(c | m, λ, a, b) =  1  −b − 2 λm2   γ  + 2 (ln λ − ln 2π) + a ln b − ln Γ(a). [sent-428, score-0.939]

73 In addition, deterministic nodes can be included to allow parameters of child distributions to be deterministic functions of parent variables. [sent-434, score-0.522]

74 VIBES: An Implementation of Variational Message Passing The variational message passing algorithm has been implemented in a software package called VIBES (Variational Inference in BayEsian networkS), ﬁrst described by Bishop et al. [sent-438, score-0.972]

75 Extensions to Variational Message Passing In this section, three extensions to the variational message passing algorithm will be described. [sent-468, score-0.946]

76 However, it is possible to sidestep the conjugacy requirement by introducing additional variational parameters and approximating non-conjugate conditional distributions by valid conjugate ones. [sent-472, score-0.742]

77 We take the approach of Jaakkola and Jordan (1996) and use a variational bound for the logistic sigmoid function deﬁned as σ(a) def F(a, ξ) = σ(ξ) exp[(a − ξ)/2 + λ(ξ)(a2 − ξ2 )] where λ(ξ) = [1/2 − g(ξ)]/2ξ and ξ is a variational parameter. [sent-474, score-0.965]

78 5 0 −6 0 6 Figure 6: The logistic sigmoid function σ(a) and variational bound F(a, ξ). [sent-483, score-0.516]

79 Therefore, the variational posterior Qx (x) takes the form Qx (x) = σ(a )x [1 − σ(a )]1−x . [sent-489, score-0.516]

80 For consistency with general discrete distributions, we write the bound on the log conditional ln P(x | a) as 0 a ln P(x | a) = T δ(x − 0) δ(x − 1) 1 δ(x − 1) − 2 λ(ξ) T + (−a − ξ)/2 + λ(ξ)(a2 − ξ2 ) + ln σ(ξ) a a2 − ξ/2 − λ(ξ)ξ2 + ln σ(ξ). [sent-496, score-0.967]

81 The message from node x to node a is therefore mx→a = 1 δ(x − 1) − 2 λ(ξ) and all other messages are as in standard VMP. [sent-497, score-0.925]

82 In general, the introduction of additional variational parameters enormously extends the class of models to which VMP can be applied, as the constraint that the model distributions must be conjugate no longer applies. [sent-502, score-0.536]

83 2 Finding a Maximum A Posteriori Solution The advantage of using a variational distribution is that it provides a posterior distribution over latent variables. [sent-504, score-0.63]

84 From (3), the lower bound is L (Q ) = ln P(H, V) − ln Q(H) = ln P(H , V) + hδ where hδ is the differential entropy of the delta function. [sent-507, score-0.674]

85 The variational distribution can be written in factorised form as QMAP (H) = ∏ Q j (H j ). [sent-512, score-0.551]

86 The KL divergence between the approximating distribution and the true posterior is minimised if KL(Q j || Q j ) is minimised, where Q j is the standard variational solution given by (6). [sent-514, score-0.574]

87 In the message passing framework, a MAP solution can be obtained for a particular latent variable H j directly from the updated natural statistic vector φ j using (φ j )T du j (H j ) = 0. [sent-518, score-0.704]

88 684 VARIATIONAL M ESSAGE PASSING Given that the variational posterior is now a delta function, the expectation of any function f (H j ) under the variational posterior is just f (H j ). [sent-520, score-1.062]

89 Since all surrounding nodes can process these messages as normal, a MAP solution may be obtained for any chosen subset of variables (such as particular hyper-parameters), whilst a full posterior distribution is retained for all other variables. [sent-522, score-0.492]

90 Hence, the updated variational distribution factor for that variable has the same form and inference involves just updating the parameters of that distribution. [sent-530, score-0.568]

91 Second, conjugacy results in variational distributions being in standard exponential family form allowing their moments to be calculated analytically. [sent-531, score-0.761]

92 a The form of the gamma distribution means that messages sent to the node a are with respect to a natural statistic vector a ua = ln Γ(a) which means that the updated factor distribution Qa has the form T K ln Qa (a) = ∑ mx →a i i=1 a ln Γ(a) + (−α − 1) ln a − β + const. [sent-540, score-1.761]

93 This suggests that non-conjugate parts of a general graphical model could be handled within a BUGS-style framework whilst variational message passing is used for the rest of the model. [sent-547, score-1.074]

94 Discussion The variational message passing algorithm allows approximate inference using a factorised variational distribution in any conjugate-exponential model, and in a range of non-conjugate models. [sent-550, score-1.567]

95 In general, variational message passing dramatically simpliﬁes the construction and testing of new variational models and readily allows a range of alternative models to be tested on a given problem. [sent-552, score-1.364]

96 The general form of VMP also allows the inclusion of arbitrary nodes in the graphical model provided that each node is able to receive and generate appropriate messages in the required form, whether or not the model remains conjugate-exponential. [sent-553, score-0.598]

97 One limitation of the current algorithm is that it uses a variational distribution which is factorised across nodes, giving an approximate posterior which is separable with respect to individual (scalar or vector) variables. [sent-555, score-0.649]

98 Winn (2003) and Bishop and Winn (2003) have therefore proposed an extended version of variational message passing which allows for structured variational distributions. [sent-558, score-1.364]

99 The network now corresponds to a two-dimensional Gaussian model and variational inference can be performed automatically by pressing the Start button (which also performs initialisation). [sent-598, score-0.543]

100 Performing variational inference on this separable model leads to each one-dimensional mixture having three retained mixture components and gives an improved bound of -876 nats. [sent-647, score-0.66]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('variational', 0.418), ('message', 0.27), ('passing', 0.258), ('messages', 0.221), ('node', 0.217), ('vmp', 0.215), ('ln', 0.214), ('conjugacy', 0.181), ('vibes', 0.181), ('ux', 0.18), ('ishop', 0.153), ('child', 0.141), ('parent', 0.134), ('inn', 0.128), ('essage', 0.128), ('cpy', 0.126), ('uy', 0.126), ('xy', 0.101), ('paj', 0.099), ('factorised', 0.098), ('posterior', 0.098), ('gamma', 0.096), ('parents', 0.088), ('statistic', 0.087), ('plate', 0.083), ('winn', 0.081), ('pay', 0.079), ('cpk', 0.072), ('ua', 0.072), ('uyi', 0.072), ('nodes', 0.071), ('inference', 0.07), ('deterministic', 0.067), ('whilst', 0.067), ('mx', 0.063), ('propagation', 0.059), ('initialisation', 0.056), ('mixture', 0.056), ('chy', 0.054), ('mxn', 0.054), ('conditional', 0.053), ('precision', 0.052), ('gaussian', 0.05), ('conjugate', 0.048), ('expectations', 0.047), ('bayesian', 0.047), ('children', 0.046), ('exponential', 0.046), ('univariate', 0.045), ('pak', 0.045), ('updated', 0.045), ('eax', 0.045), ('qa', 0.045), ('optimisation', 0.045), ('latent', 0.044), ('kl', 0.044), ('bishop', 0.044), ('qx', 0.043), ('family', 0.042), ('distributions', 0.042), ('fx', 0.039), ('xn', 0.039), ('qy', 0.038), ('qi', 0.036), ('bugs', 0.036), ('hi', 0.036), ('distribution', 0.035), ('sent', 0.034), ('sigmoid', 0.034), ('graphical', 0.033), ('logistic', 0.032), ('moments', 0.032), ('bound', 0.032), ('def', 0.031), ('xk', 0.03), ('gx', 0.03), ('multivariate', 0.03), ('belief', 0.03), ('expectation', 0.03), ('pai', 0.03), ('dirichlet', 0.029), ('yk', 0.029), ('model', 0.028), ('analysers', 0.027), ('fk', 0.027), ('inclusive', 0.027), ('inspected', 0.027), ('pressing', 0.027), ('reparameterisation', 0.027), ('screenshot', 0.027), ('winbugs', 0.027), ('allowable', 0.027), ('software', 0.026), ('discrete', 0.026), ('tutorial', 0.025), ('hinton', 0.025), ('initialising', 0.024), ('diagram', 0.024), ('divergence', 0.023), ('shape', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 71 jmlr-2005-Variational Message Passing

Author: John Winn, Christopher M. Bishop

2 0.15587911 32 jmlr-2005-Expectation Consistent Approximate Inference

Author: Manfred Opper, Ole Winther

Abstract: We propose a novel framework for approximations to intractable probabilistic models which is based on a free energy formulation. The approximation can be understood as replacing an average over the original intractable distribution with a tractable one. It requires two tractable probability distributions which are made consistent on a set of moments and encode diﬀerent features of the original intractable distribution. In this way we are able to use Gaussian approximations for models with discrete or bounded variables which allow us to include non-trivial correlations. These are neglected in many other methods. We test the framework on toy benchmark problems for binary variables on fully connected graphs and 2D grids and compare with other methods, such as loopy belief propagation. Good performance is already achieved by using single nodes as tractable substructures. Signiﬁcant improvements are obtained when a spanning tree is used instead.

3 0.13498668 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors

Author: Alexander T. Ihler, John W. Fisher III, Alan S. Willsky

Abstract: Belief propagation (BP) is an increasingly popular method of performing approximate inference on arbitrary graphical models. At times, even further approximations are required, whether due to quantization of the messages or model parameters, from other simpliﬁed message or model representations, or from stochastic approximation methods. The introduction of such errors into the BP message computations has the potential to affect the solution obtained adversely. We analyze the effect resulting from message approximation under two particular measures of error, and show bounds on the accumulation of errors in the system. This analysis leads to convergence conditions for traditional BP message passing, and both strict bounds and estimates of the resulting error in systems of approximate BP message passing. Keywords: belief propagation, sum-product, convergence, approximate inference, quantization

4 0.12006304 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification

Author: Malte Kuss, Carl Edward Rasmussen

Abstract: Gaussian process priors can be used to deﬁne ﬂexible, probabilistic classiﬁcation models. Unfortunately exact Bayesian inference is analytically intractable and various approximation techniques have been proposed. In this work we review and compare Laplace’s method and Expectation Propagation for approximate Bayesian inference in the binary Gaussian process classiﬁcation model. We present a comprehensive comparison of the approximations, their predictive performance and marginal likelihood estimates to results obtained by MCMC sampling. We explain theoretically and corroborate empirically the advantages of Expectation Propagation compared to Laplace’s method. Keywords: Gaussian process priors, probabilistic classiﬁcation, Laplace’s approximation, expectation propagation, marginal likelihood, evidence, MCMC

5 0.10242669 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks

Author: Dmitry Rusakov, Dan Geiger

Abstract: We develop a closed form asymptotic formula to compute the marginal likelihood of data given a naive Bayesian network model with two hidden states and binary features. This formula deviates from the standard BIC score. Our work provides a concrete example that the BIC score is generally incorrect for statistical models that belong to stratiﬁed exponential families. This claim stands in contrast to linear and curved exponential families, where the BIC score has been proven to provide a correct asymptotic approximation for the marginal likelihood. Keywords: Bayesian networks, asymptotic model selection, Bayesian information criterion (BIC)

6 0.099771678 37 jmlr-2005-Generalization Bounds and Complexities Based on Sparsity and Clustering for Convex Combinations of Functions from Random Classes

7 0.094296552 51 jmlr-2005-Local Propagation in Conditional Gaussian Bayesian Networks

8 0.092940681 36 jmlr-2005-Gaussian Processes for Ordinal Regression

9 0.082446575 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach

10 0.07744509 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models

11 0.075165272 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features

12 0.073343962 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

13 0.072189696 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems

14 0.069966637 23 jmlr-2005-Convergence Theorems for Generalized Alternating Minimization Procedures

15 0.0643517 22 jmlr-2005-Concentration Bounds for Unigram Language Models

16 0.062636189 48 jmlr-2005-Learning the Kernel Function via Regularization

17 0.055700257 40 jmlr-2005-Inner Product Spaces for Bayesian Networks

18 0.05451097 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

19 0.037954852 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification

20 0.035589024 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.293), (1, -0.315), (2, -0.184), (3, -0.028), (4, -0.183), (5, 0.255), (6, -0.032), (7, 0.071), (8, -0.17), (9, -0.061), (10, 0.099), (11, -0.09), (12, -0.016), (13, -0.018), (14, -0.026), (15, -0.121), (16, -0.047), (17, -0.069), (18, -0.025), (19, 0.101), (20, 0.086), (21, -0.101), (22, 0.03), (23, 0.037), (24, -0.01), (25, 0.005), (26, 0.067), (27, -0.109), (28, 0.071), (29, -0.014), (30, -0.035), (31, -0.005), (32, -0.021), (33, 0.136), (34, 0.008), (35, -0.079), (36, 0.015), (37, 0.035), (38, -0.048), (39, 0.104), (40, 0.166), (41, 0.017), (42, 0.096), (43, -0.256), (44, 0.076), (45, -0.011), (46, 0.036), (47, 0.042), (48, -0.095), (49, -0.044)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96789992 71 jmlr-2005-Variational Message Passing

Author: John Winn, Christopher M. Bishop

2 0.5571698 32 jmlr-2005-Expectation Consistent Approximate Inference

Author: Manfred Opper, Ole Winther

3 0.44983676 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors

Author: Alexander T. Ihler, John W. Fisher III, Alan S. Willsky

4 0.34746289 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks

Author: Dmitry Rusakov, Dan Geiger

5 0.34150219 51 jmlr-2005-Local Propagation in Conditional Gaussian Bayesian Networks

Author: Robert G. Cowell

Abstract: This paper describes a scheme for local computation in conditional Gaussian Bayesian networks that combines the approach of Lauritzen and Jensen (2001) with some elements of Shachter and Kenley (1989). Message passing takes place on an elimination tree structure rather than the more compact (and usual) junction tree of cliques. This yields a local computation scheme in which all calculations involving the continuous variables are performed by manipulating univariate regressions, and hence matrix operations are avoided. Keywords: Bayesian networks, conditional Gaussian distributions, propagation algorithm, elimination tree

6 0.33773318 37 jmlr-2005-Generalization Bounds and Complexities Based on Sparsity and Clustering for Convex Combinations of Functions from Random Classes

7 0.33042657 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification

8 0.32872543 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach

9 0.30978358 23 jmlr-2005-Convergence Theorems for Generalized Alternating Minimization Procedures

10 0.29221484 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models

11 0.28075534 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features

12 0.26206428 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

13 0.23670524 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems

14 0.23347737 36 jmlr-2005-Gaussian Processes for Ordinal Regression

15 0.22939606 22 jmlr-2005-Concentration Bounds for Unigram Language Models

16 0.21924227 48 jmlr-2005-Learning the Kernel Function via Regularization

17 0.16125163 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification

18 0.15603586 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

19 0.15550712 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve

20 0.13775223 29 jmlr-2005-Efficient Margin Maximizing with Boosting

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.026), (15, 0.296), (17, 0.027), (19, 0.026), (36, 0.072), (37, 0.03), (39, 0.03), (43, 0.023), (47, 0.019), (52, 0.094), (59, 0.039), (70, 0.054), (80, 0.015), (88, 0.118), (90, 0.017), (94, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76318604 71 jmlr-2005-Variational Message Passing

Author: John Winn, Christopher M. Bishop

2 0.48597485 36 jmlr-2005-Gaussian Processes for Ordinal Regression

Author: Wei Chu, Zoubin Ghahramani

Abstract: We present a probabilistic kernel approach to ordinal regression based on Gaussian processes. A threshold model that generalizes the probit function is used as the likelihood function for ordinal variables. Two inference techniques, based on the Laplace approximation and the expectation propagation algorithm respectively, are derived for hyperparameter learning and model selection. We compare these two Gaussian process approaches with a previous ordinal regression method based on support vector machines on some benchmark and real-world data sets, including applications of ordinal regression to collaborative ﬁltering and gene expression analysis. Experimental results on these data sets verify the usefulness of our approach. Keywords: Gaussian processes, ordinal regression, approximate Bayesian inference, collaborative ﬁltering, gene expression analysis, feature selection

3 0.48584744 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)

Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson

Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by deﬁning a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semideﬁnite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classiﬁcation, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semideﬁnite programming

4 0.479543 39 jmlr-2005-Information Bottleneck for Gaussian Variables

Author: Gal Chechik, Amir Globerson, Naftali Tishby, Yair Weiss

Abstract: The problem of extracting the relevant aspects of data was previously addressed through the information bottleneck (IB) method, through (soft) clustering one variable while preserving information about another - relevance - variable. The current work extends these ideas to obtain continuous representations that preserve relevant information, rather than discrete clusters, for the special case of multivariate Gaussian variables. While the general continuous IB problem is difﬁcult to solve, we provide an analytic solution for the optimal representation and tradeoff between compression and relevance for the this important case. The obtained optimal representation is a noisy linear projection to eigenvectors of the normalized regression matrix Σx|y Σ−1 , which is also the basis obtained x in canonical correlation analysis. However, in Gaussian IB, the compression tradeoff parameter uniquely determines the dimension, as well as the scale of each eigenvector, through a cascade of structural phase transitions. This introduces a novel interpretation where solutions of different ranks lie on a continuum parametrized by the compression level. Our analysis also provides a complete analytic expression of the preserved information as a function of the compression (the “information-curve”), in terms of the eigenvalue spectrum of the data. As in the discrete case, the information curve is concave and smooth, though it is made of different analytic segments for each optimal dimension. Finally, we show how the algorithmic theory developed in the IB framework provides an iterative algorithm for obtaining the optimal Gaussian projections. Keywords: information bottleneck, Gaussian processes, dimensionality reduction, canonical correlation analysis

5 0.47894952 44 jmlr-2005-Learning Module Networks

Author: Eran Segal, Dana Pe'er, Aviv Regev, Daphne Koller, Nir Friedman

Abstract: Methods for learning Bayesian networks can discover dependency structure between observed variables. Although these methods are useful in many applications, they run into computational and statistical problems in domains that involve a large number of variables. In this paper,1 we consider a solution that is applicable when many variables have similar behavior. We introduce a new class of models, module networks, that explicitly partition the variables into modules, so that the variables in each module share the same parents in the network and the same conditional probability distribution. We deﬁne the semantics of module networks, and describe an algorithm that learns the modules’ composition and their dependency structure from data. Evaluation on real data in the domains of gene expression and the stock market shows that module networks generalize better than Bayesian networks, and that the learned module network structure reveals regularities that are obscured in learned Bayesian networks. 1. A preliminary version of this paper appeared in the Proceedings of the Nineteenth Conference on Uncertainty in Artiﬁcial Intelligence, 2003 (UAI ’03). c 2005 Eran Segal, Dana Pe’er, Aviv Regev, Daphne Koller and Nir Friedman. S EGAL , P E ’ ER , R EGEV, KOLLER AND F RIEDMAN

6 0.47484615 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

7 0.47059771 64 jmlr-2005-Semigroup Kernels on Measures

8 0.47017059 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

9 0.47012591 3 jmlr-2005-A Classification Framework for Anomaly Detection

10 0.46874309 40 jmlr-2005-Inner Product Spaces for Bayesian Networks

11 0.46791402 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints

12 0.46569982 48 jmlr-2005-Learning the Kernel Function via Regularization

13 0.4651472 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

14 0.4628965 11 jmlr-2005-Algorithmic Stability and Meta-Learning

15 0.4628686 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

16 0.46196353 20 jmlr-2005-Clustering with Bregman Divergences

17 0.45476058 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors

18 0.45369872 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning

19 0.45039561 54 jmlr-2005-Managing Diversity in Regression Ensembles

20 0.44950828 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial