nips nips2002 nips2002-204 knowledge-graph by maker-knowledge-mining

204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

Source: pdf

Author: Christopher M. Bishop, David Spiegelhalter, John Winn

Abstract: In recent years variational methods have become a popular tool for approximate inference and learning in a wide variety of probabilistic models. For each new application, however, it is currently necessary ﬁrst to derive the variational update equations, and then to implement them in application-speciﬁc code. Each of these steps is both time consuming and error prone. In this paper we describe a general purpose inference engine called VIBES (‘Variational Inference for Bayesian Networks’) which allows a wide variety of probabilistic models to be implemented and solved variationally without recourse to coding. New models are speciﬁed either through a simple script or via a graphical interface analogous to a drawing package. VIBES then automatically generates and solves the variational equations. We illustrate the power and ﬂexibility of VIBES using examples from Bayesian mixture modelling. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk/jmw39 Abstract In recent years variational methods have become a popular tool for approximate inference and learning in a wide variety of probabilistic models. [sent-19, score-0.491]

2 For each new application, however, it is currently necessary ﬁrst to derive the variational update equations, and then to implement them in application-speciﬁc code. [sent-20, score-0.406]

3 In this paper we describe a general purpose inference engine called VIBES (‘Variational Inference for Bayesian Networks’) which allows a wide variety of probabilistic models to be implemented and solved variationally without recourse to coding. [sent-22, score-0.214]

4 New models are speciﬁed either through a simple script or via a graphical interface analogous to a drawing package. [sent-23, score-0.159]

5 VIBES then automatically generates and solves the variational equations. [sent-24, score-0.358]

6 We illustrate the power and ﬂexibility of VIBES using examples from Bayesian mixture modelling. [sent-25, score-0.04]

7 1 Introduction Variational methods [1, 2] have been used successfully for a wide range of models, and new applications are constantly being explored. [sent-26, score-0.054]

8 In many ways the variational framework can be seen as a complementary approach to that of Markov chain Monte Carlo (MCMC), with diﬀerent strengths and weaknesses. [sent-27, score-0.358]

9 For many years there has existed a powerful tool for tackling new problems using MCMC, called BUGS (‘Bayesian inference Using Gibbs Sampling’) [3]. [sent-28, score-0.08]

10 In BUGS a new probabilistic model, expressed as a directed acyclic graph, can be encoded using a simple scripting notation, and then samples can be drawn from the posterior distribution (given some data set of observed values) using Gibbs sampling in a way that is largely automatic. [sent-29, score-0.189]

11 Furthermore, an extension called WinBUGS provides a graphical front end to BUGS in which the user draws a pictorial representation of the directed graph, and this automatically generates the required script. [sent-30, score-0.139]

12 We have been inspired by the success of BUGS to produce an analogous tool for the solution of problems using variational methods. [sent-31, score-0.391]

13 The challenge is to build a system that can handle a wide range of graph structures, a broad variety of common conditional probability distributions at the nodes, and a range of variational approximating distributions. [sent-32, score-0.634]

14 2 A General Framework for Variational Inference In this section we brieﬂy review the variational framework, and then we characterise a large class of models for which the variational method can be implemented automatically. [sent-34, score-0.747]

15 We denote the set of all variables in the model by W = (V, X) where V are the visible (observed) variables and X are the hidden (latent) variables. [sent-35, score-0.134]

16 As with BUGS, we focus on models that are speciﬁed in terms of an acyclic directed graph (treatment of undirected graphical models is equally possible and is somewhat more straightforward). [sent-36, score-0.273]

17 The joint distribution P (V, X) is then expressed in terms of conditional distributions P (Wi |pai ) at each node i, where pai denotes the set of variables corresponding to the parents of node i, and Wi denotes the variable, or group of variables, associated with node i. [sent-37, score-1.105]

18 The joint distribution of all variables is then given by the product of the conditionals P (V, X) = i P (Wi |pai ). [sent-38, score-0.141]

19 Our goal is to ﬁnd a variational distribution Q(X|V ) that approximates the true posterior distribution P (X|V ). [sent-39, score-0.477]

20 Here KL(Q P ) is the Kullback-Leibler divergence between the variational approximation Q(X|V ) and the true posterior P (X|V ). [sent-41, score-0.413]

21 Since this satisﬁes KL(Q P ) ≥ 0 it follows from (1) that the quantity L(Q) forms a lower bound on ln P (V ). [sent-42, score-0.219]

22 We now choose some family of distributions to represent Q(X|V ) and then seek a member of that family that maximizes the lower bound L(Q). [sent-43, score-0.173]

23 If we allow Q(X|V ) to have complete ﬂexibility then we see that the maximum of the lower bound occurs for Q(X|V ) = P (X|V ) so that the variational posterior distribution equals the true posterior. [sent-44, score-0.481]

24 In this case the Kullback-Leibler divergence vanishes and L(Q) = ln P (V ). [sent-45, score-0.183]

25 However, working with the true posterior distribution is computationally intractable (otherwise we wouldn’t be resorting to variational methods). [sent-46, score-0.445]

26 We must therefore consider a more restricted family of Q distributions which has the property that the lower bound (2) can be evaluated and optimized eﬃciently and yet which is still suﬃciently ﬂexible as to give a good approximation to the true posterior distribution. [sent-47, score-0.189]

27 1 Factorized Distributions For the purposes of building VIBES we have focussed attention initially on distributions that factorize with respect to disjoint groups Xi of variables Q(X|V ) = Qi (Xi ). [sent-49, score-0.126]

28 (4) i This approximation has been successfully used in many applications of variational methods [4, 5, 6]. [sent-50, score-0.358]

29 Substituting (4) into (2) we can maximize L(Q) variationally with respect to Qi (Xi ) keeping all Qj for j = i ﬁxed. [sent-51, score-0.048]

30 This leads to the solution ln Qi (Xi ) = ln P (V, X) {j=i} + const. [sent-52, score-0.366]

31 Taking exponentials of both sides and normalizing we obtain Qi (Xi ) = exp ln P (V, X) {j=i} . [sent-54, score-0.183]

32 Xi exp ln P (V, X) {j=i} (6) Note that these are coupled equations since the solution for each Qi (Xi ) depends on expectations with respect to the other factors Qj=i . [sent-55, score-0.296]

33 The variational optimization proceeds by initializing each of the Qi (Xi ) and then cycling through each factor in turn replacing the current distribution with a revised estimate given by (6). [sent-56, score-0.425]

34 The current version of VIBES is based on a factorization of the form (4) in which each factor Qi (Xi ) corresponds to one of the nodes of the graph (each of which can be a composite node, as discussed shortly). [sent-57, score-0.205]

35 Thus the expectations that must be performed on the right hand side of (6) involve only those variables lying in the Markov blanket of node i, in other words the parents, children and co-parents of i, as illustrated in Figure 1(a). [sent-59, score-0.49]

36 This is a key concept in VIBES since it allows the variational update equations to be expressed in terms of local operations, which can therefore be expressed in terms of generic code which is independent of the global structure of the graph. [sent-60, score-0.647]

37 Here we adopt a somewhat different viewpoint in that we make no distinction between latent variables and model parameters. [sent-63, score-0.091]

38 In a Bayesian setting these both correspond to unobserved stochastic variables and can be treated on an equal footing. [sent-64, score-0.067]

39 This allows us to consider conjugacy not just between variables and their parameters, but hierarchically between all parent-child pairs in the graph. [sent-65, score-0.192]

40 Thus we consider models in which each conditional distribution takes the standard exponential family form ln P (Xi |Y ) = φi (Y )T ui (Xi ) + fi (Xi ) + gi (Y ) (7) where the vector φ(Y ) is called the natural parameter of the distribution. [sent-66, score-0.454]

41 Now (i) consider a node Zj with parent Xi and co-parents cpj , as indicated in Figure 1(a). [sent-67, score-0.534]

42 Y1 YK cpj(i) { Xi Z1 } Zj (a) (b) Figure 1: (a) A central observation is that the variational update equations for node Xi depend only on expectations over variables appearing in the Markov blanket of Xi , namely the set of parents, children and co-parents. [sent-68, score-0.914]

43 As far as the pair of nodes Xi and Zj are concerned, we can think of P (Xi |Y ) (i) as a prior over Xi and the conditional P (Zj |Xi , cpj ) as a (contribution to) the likelihood function. [sent-70, score-0.496]

44 Conjugacy requires that, as a function of Xi , the product of these two conditionals must take the same form as (7). [sent-71, score-0.042]

45 Since the conditional (i) P (Zj |Xi , cpj ) is also in the exponential family it can be expressed as (i) (i) (i) ln P (Zj |Xi , cpj ) = φj (Xi , cpj )T uj (Zj ) + fj (Zj ) + gj (Xi , cpj ). [sent-72, score-1.501]

46 (8) Conjugacy then requires that this be expressible in the form (i) (i) (i) ln P (Zj |Xi , cpj ) = φj→i (Zj , cpj ) T ui (Xi ) + λ(Zj , cpj ) (9) for some choice of functions φ and λ. [sent-73, score-1.039]

47 Since this must hold for each of the parents of (i) Zj it follows that ln P (Zj |Xi , cpj ) must be a multi-linear function of the uk (Xk ) for each of the parents Xk of node XZj . [sent-74, score-0.875]

48 Also, we observe from (8) that the dependence (i) of ln P (Zj |Xi , cpj ) on Zj is again linear in the function uj (Zj ). [sent-75, score-0.501]

49 We can apply a similar argument to the conjugate relationship between node Xj and each of its parents, showing that the contribution from the conditional P (Xi |Y ) can again be expressed in terms of expectations of the natural parameters for the parent node distributions. [sent-76, score-0.737]

50 Hence the right hand side of the variational update equation (5) for a particular node Xi will be a multi-linear function of the expectations u for each node in the Markov blanket of Xi . [sent-77, score-0.959]

51 The variational update equation then takes the form  T M   (i) ln Qi (Xi ) = φi (Y ) Y + φj→i (Zj , cpj ) Zj ,cp(i) ui (Xi ) + const. [sent-78, score-0.905]

52 j   (10) j=1 which involves summation of bottom up ‘messages’ φj→i (i) Zj ,cpj from the children together with a top-down message φi (Y ) Y from the parents. [sent-79, score-0.058]

53 Since all of these messages are expressed in terms of the same basis ui (Xi ), we can write compact, generic code for updating any type of node, instead of having to take account explicitly of the many possible combinations of node types in each Markov blanket. [sent-80, score-0.389]

54 Then u = [µ, µ2 + τ −1 ]T , and the function fi (Xi ) is simply zero in this case. [sent-83, score-0.039]

55 Conjugacy allows us to choose a distribution for the parent µ that is Gaussian and a prior for τ that is a Gamma distribution. [sent-84, score-0.131]

56 The corresponding natural parameterizations and update messages are given by uµ = µ µ2 , φX→µ = τ X − τ /2 , uτ = τ ln τ , φX→τ = − (X − µ)2 1/2 . [sent-85, score-0.275]

57 We can similarly consider multi-dimensional Gaussian distributions, with a Gaussian prior for the mean and a Wishart prior for the inverse covariance matrix. [sent-86, score-0.069]

58 A generalization of the Gaussian is the rectiﬁed Gaussian which is deﬁned as P (X|µ, τ ) ∝ N (X|µ, τ ) for X ≥ 0 and P (X|µ, τ ) = 0 for X < 0, for which moments can be expressed in terms of the ‘erf’ function. [sent-87, score-0.089]

59 This rectiﬁcation corresponds to the introduction of a step function, whose logarithm corresponds to fi (Xi ) in (7), which is carried through the variational update equations unchanged. [sent-88, score-0.486]

60 Another example is the discrete distribution for categorical variables. [sent-90, score-0.074]

61 This has distribution P (S|π) = k=1 πk k and we can place a conjugate Dirichlet distribution over the parameters {πk }. [sent-95, score-0.13]

62 3 Allowable Distributions We now characterize the class of models that can be solved by VIBES using the factorized variational distribution given by (4). [sent-97, score-0.454]

63 First of all we note that, since a Gaussian variable can have a Gaussian parent for its mean, we can extend this hierarchically to any number of levels to give a sub-graph which is a DAG of Gaussian nodes of arbitrary topology. [sent-98, score-0.263]

64 Next, we observe that discrete variables S = {Sk } can be used to construct ‘pick’ functions which choose a particular parent node Y from amongst several conjugate K S parents {Yk }, so that Y = Yk when sk = 1, which can be written Y = k=1 Yk k . [sent-100, score-0.654]

65 Variational inference will therefore be tractable for this model provided it is tractable for each of the parents Yk individually. [sent-103, score-0.284]

66 This graph represents a generalization of the Gaussian mixture model, and includes as special cases models such as hidden Markov models, Kalman ﬁlters, factor analysers and principal component analysers, as well as mixtures and hierarchical mixtures of all of these. [sent-105, score-0.173]

67 There are other classes of models that are tractable under this scheme, for example Poisson variables having Gamma priors, although these may be of limited interest. [sent-106, score-0.158]

68 We can further extend the class of tractable models by considering nodes whose natural parameters are formed from deterministic functions of the states of several parents. [sent-107, score-0.305]

69 Suppose we have some conditional distribution P (X|Y, . [sent-109, score-0.09]

70 ) and we want to make Y some deterministic function of the states of some other nodes ψ(Z1 , . [sent-112, score-0.214]

71 In eﬀect we have a pseudo-parent that is a deterministic function of other nodes, and indeed is represented explicitly through additional deterministic nodes in the graphical interface both to WinBUGS and to VIBES. [sent-116, score-0.376]

72 This will be tractable under VIBES provided the expectation of uψ (ψ) can be expressed in terms of the expectations of the corresponding functions uj (Zj ) of the parents. [sent-117, score-0.269]

73 The pick functions discussed earlier are a special case of these deterministic functions. [sent-118, score-0.099]

74 Thus for a Gaussian node the mean can be formed from products and sums of the states of other Gaussian nodes provided the function is linear with respect to each of the nodes. [sent-119, score-0.367]

75 Finally, we have seen that continuous nodes can have both discrete and continuous parents but that discrete nodes can only have discrete parents. [sent-121, score-0.533]

76 We can allow discrete nodes to have continuous parents by stepping outside the conjugate-exponential framework by exploiting a variational bound on the logistic sigmoid function [1]. [sent-122, score-0.727]

77 We also wish to be able to evaluate the lower bound (2), both to conﬁrm the correctness of the variational updates (since the value of the bound should never decrease), as well as to monitor convergence and set termination criteria. [sent-123, score-0.463]

78 This can be done eﬃciently, largely using quantities that have already been calculated during the variational updates. [sent-124, score-0.358]

79 The menu of distributions available to the user is dynamically adjusted at each stage to ensure that only valid conjugate models can be constructed. [sent-126, score-0.195]

80 As in WinBUGS we have adopted the convention of making logical (deterministic) nodes explicit in the graphical representation as this greatly simpliﬁes the speciﬁcation and interpretation of the model. [sent-127, score-0.216]

81 We also use the ‘plate’ notation of a box surrounding one or more nodes to denote that those nodes are replicated some number of times as speciﬁed by the parameter appearing in the bottom right hand corner of the box. [sent-128, score-0.338]

82 1 Example: Bayesian Mixture Models We illustrate VIBES using a Bayesian model for a mixture of M probabilistic PCA distributions, each having maximum intrinsic dimensionality of q, with a sparse prior [6], for which the VIBES implementation is shown in Figure 2. [sent-130, score-0.063]

83 The dimensionality of the other variables is also determined by which plates they are contained in (e. [sent-132, score-0.067]

84 Variables t, x, W and µ are Gaussian, τ and α have Gamma distributions, S is discrete and π is Dirichlet. [sent-135, score-0.042]

85 Once the model is completed (and the ﬁle or ﬁles containing the observed variables Figure 2: Screen shot from VIBES showing the graph for a mixture of probabilistic PCA distributions. [sent-136, score-0.209]

86 The node t is coloured black to denote that this variable is observed, and the node ‘alpha’ has been highlighted and its properties (e. [sent-137, score-0.405]

87 the form of the distribution) can be changed using the menus on the left hand side. [sent-139, score-0.071]

88 W+mu’ is a deterministic node, and the double arrows denote deterministic relationships. [sent-141, score-0.138]

89 are speciﬁed) it is then ‘compiled’, which involves allocation of memory for the variables and initializing the distributions Qi (which is done using simple heuristics but which can also be over-ridden by the user). [sent-142, score-0.161]

90 If desired, monitoring of the lower bound (2) can be switched on (at the expense of slightly increased computation) and this can also be used to set a termination criterion. [sent-143, score-0.069]

91 Alternatively the variational optimization can be run for a ﬁxed number of iterations. [sent-144, score-0.358]

92 This simply involved dragging the α node outside of the M plate using the mouse and then recompiling (since α is now a vector of length q instead of a matrix of size M × q). [sent-147, score-0.279]

93 This literally takes a few seconds, in contrast to the eﬀort required to formulate the variational inference equations, and develop bespoke code, for a new model! [sent-148, score-0.405]

94 A screen shot of the corresponding VIBES model is shown in Figure 3. [sent-150, score-0.075]

95 4 Discussion Our early experiences with VIBES have shown that it dramatically simpliﬁes the construction and testing of new variational models, and readily allows a range of alternative models to be evaluated on a given problem. [sent-151, score-0.412]

96 Currently we are extending VIBES to cater for a broader range of variational distributions by allowing the user to specify a Q distribution deﬁned over a subgraph of the true graph [7]. [sent-152, score-0.594]

97 This causes there to be only q terms in α which are shared over the mixture components rather than M × q. [sent-154, score-0.062]

98 Note that, with no nodes highlighted, the side menus disappear. [sent-155, score-0.218]

99 For example, in order to broaden the range of models that can be tackled we can combine variational with other methods such as Gibbs sampling or optimization (empirical Bayes) to allow for non-conjugate hyper-priors for instance. [sent-157, score-0.412]

100 Similarly, there is scope for exploiting exact methods where there exist tractable sub-graphs. [sent-158, score-0.06]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('vibes', 0.477), ('variational', 0.358), ('zj', 0.353), ('cpj', 0.27), ('xi', 0.207), ('node', 0.188), ('ln', 0.183), ('nodes', 0.145), ('bugs', 0.125), ('pai', 0.119), ('parents', 0.117), ('yk', 0.111), ('qi', 0.106), ('sk', 0.098), ('winbugs', 0.095), ('conjugacy', 0.083), ('parent', 0.076), ('expectations', 0.072), ('graphical', 0.071), ('gamma', 0.07), ('deterministic', 0.069), ('expressed', 0.067), ('variables', 0.067), ('conjugate', 0.066), ('plate', 0.062), ('tractable', 0.06), ('graph', 0.06), ('distributions', 0.059), ('children', 0.058), ('conditional', 0.058), ('wishart', 0.057), ('blanket', 0.057), ('dag', 0.053), ('bayesian', 0.051), ('gaussian', 0.051), ('update', 0.048), ('menus', 0.048), ('variationally', 0.048), ('uj', 0.048), ('inference', 0.047), ('ui', 0.046), ('messages', 0.044), ('xk', 0.044), ('discrete', 0.042), ('hierarchically', 0.042), ('conditionals', 0.042), ('analysers', 0.042), ('shot', 0.042), ('equations', 0.041), ('kl', 0.04), ('mixture', 0.04), ('fi', 0.039), ('user', 0.039), ('wi', 0.039), ('family', 0.039), ('pca', 0.038), ('bound', 0.036), ('initializing', 0.035), ('engine', 0.035), ('qj', 0.035), ('drawing', 0.035), ('sums', 0.034), ('factorized', 0.033), ('screen', 0.033), ('termination', 0.033), ('gibbs', 0.033), ('tool', 0.033), ('distribution', 0.032), ('posterior', 0.032), ('recti', 0.032), ('models', 0.031), ('wide', 0.031), ('markov', 0.031), ('pick', 0.03), ('directed', 0.029), ('acyclic', 0.029), ('highlighted', 0.029), ('outside', 0.029), ('priors', 0.026), ('mcmc', 0.026), ('christopher', 0.026), ('exponential', 0.026), ('appearing', 0.025), ('side', 0.025), ('latent', 0.024), ('jordan', 0.024), ('dirichlet', 0.024), ('range', 0.023), ('diagram', 0.023), ('kluwer', 0.023), ('prior', 0.023), ('hand', 0.023), ('similarly', 0.023), ('true', 0.023), ('terms', 0.022), ('variety', 0.022), ('interface', 0.022), ('le', 0.022), ('conditioned', 0.022), ('code', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

Author: Christopher M. Bishop, David Spiegelhalter, John Winn

2 0.22190994 21 nips-2002-Adaptive Classification by Variational Kalman Filtering

Author: Peter Sykacek, Stephen J. Roberts

Abstract: We propose in this paper a probabilistic approach for adaptive inference of generalized nonlinear classiﬁcation that combines the computational advantage of a parametric solution with the ﬂexibility of sequential sampling techniques. We regard the parameters of the classiﬁer as latent states in a ﬁrst order Markov process and propose an algorithm which can be regarded as variational generalization of standard Kalman ﬁltering. The variational Kalman ﬁlter is based on two novel lower bounds that enable us to use a non-degenerate distribution over the adaptation rate. An extensive empirical evaluation demonstrates that the proposed method is capable of infering competitive classiﬁers both in stationary and non-stationary environments. Although we focus on classiﬁcation, the algorithm is easily extended to other generalized nonlinear models.

3 0.12033209 6 nips-2002-A Formulation for Minimax Probability Machine Regression

Author: Thomas Strohmann, Gregory Z. Grudic

Abstract: We formulate the regression problem as one of maximizing the minimum probability, symbolized by Ω, that future predicted outputs of the regression model will be within some ±ε bound of the true regression function. Our formulation is unique in that we obtain a direct estimate of this lower probability bound Ω. The proposed framework, minimax probability machine regression (MPMR), is based on the recently described minimax probability machine classiﬁcation algorithm [Lanckriet et al.] and uses Mercer Kernels to obtain nonlinear regression models. MPMR is tested on both toy and real world data, verifying the accuracy of the Ω bound, and the efﬁcacy of the regression models. 1

4 0.11534295 74 nips-2002-Dynamic Structure Super-Resolution

Author: Amos J. Storkey

Abstract: The problem of super-resolution involves generating feasible higher resolution images, which are pleasing to the eye and realistic, from a given low resolution image. This might be attempted by using simple ﬁlters for smoothing out the high resolution blocks or through applications where substantial prior information is used to imply the textures and shapes which will occur in the images. In this paper we describe an approach which lies between the two extremes. It is a generic unsupervised method which is usable in all domains, but goes beyond simple smoothing methods in what it achieves. We use a dynamic tree-like architecture to model the high resolution data. Approximate conditioning on the low resolution image is achieved through a mean ﬁeld approach. 1

5 0.11060568 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization

Author: Harald Steck, Tommi S. Jaakkola

Abstract: A common objective in learning a model from data is to recover its network structure, while the model parameters are of minor interest. For example, we may wish to recover regulatory networks from high-throughput data sources. In this paper we examine how Bayesian regularization using a product of independent Dirichlet priors over the model parameters affects the learned model structure in a domain with discrete variables. We show that a small scale parameter - often interpreted as

6 0.09788426 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

7 0.095126905 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

8 0.093873225 124 nips-2002-Learning Graphical Models with Mercer Kernels

9 0.092500113 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

10 0.090196185 73 nips-2002-Dynamic Bayesian Networks with Deterministic Latent Tables

11 0.087770924 17 nips-2002-A Statistical Mechanics Approach to Approximate Analytical Bootstrap Averages

12 0.072884433 69 nips-2002-Discriminative Learning for Label Sequences via Boosting

13 0.070503026 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum

14 0.066822141 85 nips-2002-Fast Kernels for String and Tree Matching

15 0.063905545 114 nips-2002-Information Regularization with Partially Labeled Data

16 0.063128084 4 nips-2002-A Differential Semantics for Jointree Algorithms

17 0.060026195 110 nips-2002-Incremental Gaussian Processes

18 0.059030659 32 nips-2002-Approximate Inference and Protein-Folding

19 0.058513097 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

20 0.057924498 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.184), (1, -0.058), (2, -0.048), (3, 0.061), (4, -0.007), (5, 0.114), (6, -0.123), (7, 0.123), (8, -0.015), (9, -0.061), (10, 0.075), (11, -0.092), (12, 0.145), (13, -0.011), (14, -0.058), (15, -0.136), (16, 0.051), (17, 0.07), (18, -0.015), (19, 0.048), (20, 0.042), (21, 0.057), (22, -0.211), (23, -0.088), (24, -0.022), (25, 0.027), (26, -0.011), (27, 0.105), (28, -0.204), (29, -0.012), (30, -0.208), (31, -0.031), (32, 0.072), (33, 0.132), (34, -0.129), (35, -0.061), (36, 0.219), (37, -0.014), (38, 0.103), (39, -0.144), (40, 0.085), (41, -0.033), (42, 0.042), (43, -0.017), (44, -0.041), (45, -0.056), (46, -0.008), (47, 0.078), (48, 0.079), (49, 0.044)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96449965 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

Author: Christopher M. Bishop, David Spiegelhalter, John Winn

2 0.64296013 21 nips-2002-Adaptive Classification by Variational Kalman Filtering

Author: Peter Sykacek, Stephen J. Roberts

3 0.50454503 6 nips-2002-A Formulation for Minimax Probability Machine Regression

Author: Thomas Strohmann, Gregory Z. Grudic

4 0.50337142 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization

Author: Harald Steck, Tommi S. Jaakkola

5 0.48735136 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

Author: Shinji Watanabe, Yasuhiro Minami, Atsushi Nakamura, Naonori Ueda

Abstract: In this paper, we propose a Bayesian framework, which constructs shared-state triphone HMMs based on a variational Bayesian approach, and recognizes speech based on the Bayesian prediction classiﬁcation; variational Bayesian estimation and clustering for speech recognition (VBEC). An appropriate model structure with high recognition performance can be found within a VBEC framework. Unlike conventional methods, including BIC or MDL criterion based on the maximum likelihood approach, the proposed model selection is valid in principle, even when there are insufﬁcient amounts of data, because it does not use an asymptotic assumption. In isolated word recognition experiments, we show the advantage of VBEC over conventional methods, especially when dealing with small amounts of data.

6 0.45378855 150 nips-2002-Multiple Cause Vector Quantization

7 0.43441823 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

8 0.43134087 101 nips-2002-Handling Missing Data with Variational Bayesian Learning of ICA

9 0.42826203 7 nips-2002-A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences

10 0.40026629 17 nips-2002-A Statistical Mechanics Approach to Approximate Analytical Bootstrap Averages

11 0.38221732 124 nips-2002-Learning Graphical Models with Mercer Kernels

12 0.35201204 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

13 0.34788477 110 nips-2002-Incremental Gaussian Processes

14 0.33566049 4 nips-2002-A Differential Semantics for Jointree Algorithms

15 0.32359138 32 nips-2002-Approximate Inference and Protein-Folding

16 0.31996387 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

17 0.31772876 194 nips-2002-The Decision List Machine

18 0.31548414 74 nips-2002-Dynamic Structure Super-Resolution

19 0.30571067 114 nips-2002-Information Regularization with Partially Labeled Data

20 0.29387084 73 nips-2002-Dynamic Bayesian Networks with Deterministic Latent Tables

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.025), (23, 0.018), (42, 0.087), (44, 0.011), (54, 0.102), (55, 0.033), (57, 0.018), (67, 0.019), (68, 0.055), (74, 0.12), (82, 0.205), (87, 0.021), (92, 0.084), (98, 0.115)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.85702509 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

Author: Christopher M. Bishop, David Spiegelhalter, John Winn

2 0.71312463 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

Author: Bernd Fischer, Johann Schumann, Wray Buntine, Alexander G. Gray

Abstract: Machine learning has reached a point where many probabilistic methods can be understood as variations, extensions and combinations of a much smaller set of abstract themes, e.g., as different instances of the EM algorithm. This enables the systematic derivation of algorithms customized for different models. Here, we describe the AUTO BAYES system which takes a high-level statistical model speciﬁcation, uses powerful symbolic techniques based on schema-based program synthesis and computer algebra to derive an efﬁcient specialized algorithm for learning that model, and generates executable code implementing that algorithm. This capability is far beyond that of code collections such as Matlab toolboxes or even tools for model-independent optimization such as BUGS for Gibbs sampling: complex new algorithms can be generated without new programming, algorithms can be highly specialized and tightly crafted for the exact structure of the model and data, and efﬁcient and commented code can be generated for different languages or systems. We present automatically-derived algorithms ranging from closed-form solutions of Bayesian textbook problems to recently-proposed EM algorithms for clustering, regression, and a multinomial form of PCA. 1 Automatic Derivation of Statistical Algorithms Overview. We describe a symbolic program synthesis system which works as a “statistical algorithm compiler:” it compiles a statistical model speciﬁcation into a custom algorithm design and from that further down into a working program implementing the algorithm design. This system, AUTO BAYES, can be loosely thought of as “part theorem prover, part Mathematica, part learning textbook, and part Numerical Recipes.” It provides much more ﬂexibility than a ﬁxed code repository such as a Matlab toolbox, and allows the creation of efﬁcient algorithms which have never before been implemented, or even written down. AUTO BAYES is intended to automate the more routine application of complex methods in novel contexts. For example, recent multinomial extensions to PCA [2, 4] can be derived in this way. The algorithm design problem. Given a dataset and a task, creating a learning method can be characterized by two main questions: 1. What is the model? 2. What algorithm will optimize the model parameters? The statistical algorithm (i.e., a parameter optimization algorithm for the statistical model) can then be implemented manually. The system in this paper answers the algorithm question given that the user has chosen a model for the data,and continues through to implementation. Performing this task at the state-of-the-art level requires an intertwined meld of probability theory, computational mathematics, and software engineering. However, a number of factors unite to allow us to solve the algorithm design problem computationally: 1. The existence of fundamental building blocks (e.g., standardized probability distributions, standard optimization procedures, and generic data structures). 2. The existence of common representations (i.e., graphical models [3, 13] and program schemas). 3. The formalization of schema applicability constraints as guards. 1 The challenges of algorithm design. The design problem has an inherently combinatorial nature, since subparts of a function may be optimized recursively and in different ways. It also involves the use of new data structures or approximations to gain performance. As the research in statistical algorithms advances, its creative focus should move beyond the ultimately mechanical aspects and towards extending the abstract applicability of already existing schemas (algorithmic principles like EM), improving schemas in ways that generalize across anything they can be applied to, and inventing radically new schemas. 2 Combining Schema-based Synthesis and Bayesian Networks Statistical Models. Externally, AUTO BAYES has the look and feel of 2 const int n_points as ’nr. of data points’ a compiler. Users specify their model 3 with 0 < n_points; 4 const int n_classes := 3 as ’nr. classes’ of interest in a high-level speciﬁcation 5 with 0 < n_classes language (as opposed to a program6 with n_classes << n_points; ming language). The ﬁgure shows the 7 double phi(1..n_classes) as ’weights’ speciﬁcation of the mixture of Gaus8 with 1 = sum(I := 1..n_classes, phi(I)); 9 double mu(1..n_classes); sians example used throughout this 9 double sigma(1..n_classes); paper.2 Note the constraint that the 10 int c(1..n_points) as ’class labels’; sum of the class probabilities must 11 c ˜ disc(vec(I := 1..n_classes, phi(I))); equal one (line 8) along with others 12 data double x(1..n_points) as ’data’; (lines 3 and 5) that make optimization 13 x(I) ˜ gauss(mu(c(I)), sigma(c(I))); of the model well-deﬁned. Also note 14 max pr(x| phi,mu,sigma ) wrt phi,mu,sigma ; the ability to specify assumptions of the kind in line 6, which may be used by some algorithms. The last line speciﬁes the goal inference task: maximize the conditional probability pr with respect to the parameters , , and . Note that moving the parameters across to the left of the conditioning bar converts this from a maximum likelihood to a maximum a posteriori problem. 1 model mog as ’Mixture of Gaussians’; ¡ £ £ £ §¤¢ £ © ¨ ¦ ¥ © ¡ ¡ £ £ £ ¨ Computational logic and theorem proving. Internally, AUTO BAYES uses a class of techniques known as computational logic which has its roots in automated theorem proving. AUTO BAYES begins with an initial goal and a set of initial assertions, or axioms, and adds new assertions, or theorems, by repeated application of the axioms, until the goal is proven. In our context, the goal is given by the input model; the derived algorithms are side effects of constructive theorems proving the existence of algorithms for the goal. 1 Schema guards vary widely; for example, compare Nead-Melder simplex or simulated annealing (which require only function evaluation), conjugate gradient (which require both Jacobian and Hessian), EM and its variational extension [6] (which require a latent-variable structure model). 2 Here, keywords have been underlined and line numbers have been added for reference in the text. The as-keyword allows annotations to variables which end up in the generated code’s comments. Also, n classes has been set to three (line 4), while n points is left unspeciﬁed. The class variable and single data variable are vectors, which deﬁnes them as i.i.d. Computer algebra. The ﬁrst core element which makes automatic algorithm derivation feasible is the fact that we can mechanize the required symbol manipulation, using computer algebra methods. General symbolic differentiation and expression simpliﬁcation are capabilities fundamental to our approach. AUTO BAYES contains a computer algebra engine using term rewrite rules which are an efﬁcient mechanism for substitution of equal quantities or expressions and thus well-suited for this task.3 Schema-based synthesis. The computational cost of full-blown theorem proving grinds simple tasks to a halt while elementary and intermediate facts are reinvented from scratch. To achieve the scale of deduction required by algorithm derivation, we thus follow a schema-based synthesis technique which breaks away from strict theorem proving. Instead, we formalize high-level domain knowledge, such as the general EM strategy, as schemas. A schema combines a generic code fragment with explicitly speciﬁed preconditions which describe the applicability of the code fragment. The second core element which makes automatic algorithm derivation feasible is the fact that we can use Bayesian networks to efﬁciently encode the preconditions of complex algorithms such as EM. First-order logic representation of Bayesian netNclasses works. A ﬁrst-order logic representation of Bayesian µ σ networks was developed by Haddawy [7]. In this framework, random variables are represented by functor symbols and indexes (i.e., speciﬁc instances φ x c of i.i.d. vectors) are represented as functor arguments. discrete gauss Nclasses Since unknown index values can be represented by Npoints implicitly universally quantiﬁed Prolog variables, this approach allows a compact encoding of networks involving i.i.d. variables or plates [3]; the ﬁgure shows the initial network for our running example. Moreover, such networks correspond to backtrack-free datalog programs, allowing the dependencies to be efﬁciently computed. We have extended the framework to work with non-ground probability queries since we seek to determine probabilities over entire i.i.d. vectors and matrices. Tests for independence on these indexed Bayesian networks are easily developed in Lauritzen’s framework which uses ancestral sets and set separation [9] and is more amenable to a theorem prover than the double negatives of the more widely known d-separation criteria. Given a Bayesian network, some probabilities can easily be extracted by enumerating the component probabilities at each node: § ¥ ¨¦¡ ¡ ¢© Lemma 1. Let be sets of variables over a Bayesian network with . Then descendents and parents hold 4 in the corresponding dependency graph iff the following probability statement holds: £ ¤ ¡ parents B % % 9 C0A@ ! 9 @8 § ¥ ¢ 2 ' % % 310 parents ©¢ £ ¡ ! ' % #!

3 0.70263767 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

Author: Shinji Watanabe, Yasuhiro Minami, Atsushi Nakamura, Naonori Ueda

4 0.70158398 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

Author: Max Welling, Simon Osindero, Geoffrey E. Hinton

Abstract: We propose a model for natural images in which the probability of an image is proportional to the product of the probabilities of some ﬁlter outputs. We encourage the system to ﬁnd sparse features by using a Studentt distribution to model each ﬁlter output. If the t-distribution is used to model the combined outputs of sets of neurally adjacent ﬁlters, the system learns a topographic map in which the orientation, spatial frequency and location of the ﬁlters change smoothly across the map. Even though maximum likelihood learning is intractable in our model, the product form allows a relatively efﬁcient learning procedure that works well even for highly overcomplete sets of ﬁlters. Once the model has been learned it can be used as a prior to derive the “iterated Wiener ﬁlter” for the purpose of denoising images.

5 0.6959641 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

Author: Peter Meinicke, Thorsten Twellmann, Helge Ritter

Abstract: We propose a framework for classiﬁer design based on discriminative densities for representation of the differences of the class-conditional distributions in a way that is optimal for classiﬁcation. The densities are selected from a parametrized set by constrained maximization of some objective function which measures the average (bounded) difference, i.e. the contrast between discriminative densities. We show that maximization of the contrast is equivalent to minimization of an approximation of the Bayes risk. Therefore using suitable classes of probability density functions, the resulting maximum contrast classiﬁers (MCCs) can approximate the Bayes rule for the general multiclass case. In particular for a certain parametrization of the density functions we obtain MCCs which have the same functional form as the well-known Support Vector Machines (SVMs). We show that MCC-training in general requires some nonlinear optimization but under certain conditions the problem is concave and can be tackled by a single linear program. We indicate the close relation between SVM- and MCC-training and in particular we show that Linear Programming Machines can be viewed as an approximate realization of MCCs. In the experiments on benchmark data sets, the MCC shows a competitive classiﬁcation performance.

6 0.69518352 21 nips-2002-Adaptive Classification by Variational Kalman Filtering

7 0.69411635 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture

8 0.69356781 74 nips-2002-Dynamic Structure Super-Resolution

9 0.69356465 163 nips-2002-Prediction and Semantic Association

10 0.69278353 3 nips-2002-A Convergent Form of Approximate Policy Iteration

11 0.69231981 53 nips-2002-Clustering with the Fisher Score

12 0.69137049 135 nips-2002-Learning with Multiple Labels

13 0.68961167 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

14 0.68959504 27 nips-2002-An Impossibility Theorem for Clustering

15 0.68819976 48 nips-2002-Categorization Under Complexity: A Unified MDL Account of Human Learning of Regular and Irregular Categories

16 0.68793017 152 nips-2002-Nash Propagation for Loopy Graphical Games

17 0.68783844 10 nips-2002-A Model for Learning Variance Components of Natural Images

18 0.68776298 124 nips-2002-Learning Graphical Models with Mercer Kernels

19 0.68602699 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction

20 0.68422574 89 nips-2002-Feature Selection by Maximum Marginal Diversity