nips nips2012 nips2012-96 knowledge-graph by maker-knowledge-mining

96 nips-2012-Density Propagation and Improved Bounds on the Partition Function


Source: pdf

Author: Stefano Ermon, Ashish Sabharwal, Bart Selman, Carla P. Gomes

Abstract: Given a probabilistic graphical model, its density of states is a distribution that, for any likelihood value, gives the number of configurations with that probability. We introduce a novel message-passing algorithm called Density Propagation (DP) for estimating this distribution. We show that DP is exact for tree-structured graphical models and is, in general, a strict generalization of both sum-product and max-product algorithms. Further, we use density of states and tree decomposition to introduce a new family of upper and lower bounds on the partition function. For any tree decomposition, the new upper bound based on finer-grained density of state information is provably at least as tight as previously known bounds based on convexity of the log-partition function, and strictly stronger if a general condition holds. We conclude with empirical evidence of improvement over convex relaxations and mean-field based bounds. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Abstract Given a probabilistic graphical model, its density of states is a distribution that, for any likelihood value, gives the number of configurations with that probability. [sent-14, score-0.413]

2 Further, we use density of states and tree decomposition to introduce a new family of upper and lower bounds on the partition function. [sent-17, score-0.773]

3 For any tree decomposition, the new upper bound based on finer-grained density of state information is provably at least as tight as previously known bounds based on convexity of the log-partition function, and strictly stronger if a general condition holds. [sent-18, score-0.643]

4 1 Introduction Associated with any undirected graphical model [1] is the so-called density of states, a term borrowed from statistical physics indicating a distribution that, for any likelihood value, gives the number of configurations with that probability. [sent-20, score-0.325]

5 The density of states plays an important role in statistical physics because it provides a fine grained description of the system, and can be used to efficiently compute many properties of interests, such as the partition function and its parameterized version [2, 3]. [sent-21, score-0.459]

6 It can be seen that computing the density of states is computationally intractable in the worst case, since it subsumes a #-P complete problem (computing the partition function) and an NP-hard one (MAP inference). [sent-22, score-0.487]

7 All current approximate techniques estimating the density of states are based on sampling, the most prominent being the Wang-Landau algorithm [3] and its improved variants [2]. [sent-23, score-0.32]

8 Furthermore, they ignore the structure of the underlying graphical model, effectively treating the energy function (which is proportional to the negative log-likelihood of a configuration) as a black-box. [sent-26, score-0.281]

9 As a first step towards exploiting the structure of the graphical model when computing the density of states, we propose an algorithm called D ENSITY P ROPAGATION (DP). [sent-27, score-0.285]

10 The algorithm is based on dynamic programming and can be conveniently expressed in terms of message passing on the graphical model. [sent-28, score-0.272]

11 We show that D ENSITY P ROPAGATION computes the density of states exactly for any tree-structured graphical model. [sent-29, score-0.457]

12 However, it computes something much richer, namely the density of states, which contains information such as the partition function and variable marginals. [sent-31, score-0.335]

13 , trees) is important because one can often decompose a complex problem into tractable subproblems (such as spanning trees) [4], and the solutions to these simpler problems can be combined to recover useful properties of the original graphical model [5, 6]. [sent-36, score-0.25]

14 In this paper we show that by combining the additional information given by the density of states, we can obtain a new family of upper and lower bounds on the partition function. [sent-37, score-0.497]

15 We prove that the new upper bound is always at least as tight as the one based on the convexity of the log-partition function [4], and we provide a general condition where the new bound is strictly tighter. [sent-38, score-0.39]

16 Further, we illustrate empirically that the new upper bound improves upon the convexity-based one on Ising grid and clique models, and that the new lower bound is empirically slightly stronger than the one given by mean-field theory [4, 7]. [sent-39, score-0.384]

17 2 Problem definition and setup We consider a graphical model specified as a factor graph with N = |V | discrete random variables xi , i ∈ V where xi ∈ Xi . [sent-40, score-0.293]

18 Given an exponential parameter vector Θ = {Θα , α ∈ I}, the exponential family defined by φ is the family of probability distributions over X defined as follows: 1 1 exp(Θ · φ(x)) = exp p(x, Θ) = Θα φα ({x}α ) (2) Z(Θ) Z(Θ) α∈I where we assume p(x) = p(x, Θ∗ ). [sent-49, score-0.247]

19 Given an exponential family, we define the density of states [2] as the following distribution: δ (E − Θ · φ(x)) n(E, Θ) = (3) x∈X where δ (E − Θ · φ(x)) indicates a Dirac delta centered at Θ · φ(x). [sent-50, score-0.367]

20 We will refer to the quantity α∈I Θα φα ({x}α ) α∈I log ψα ({x}α ) as the energy of a configuration x, although it has an additional minus sign with respect to the conventional energy in statistical physics. [sent-52, score-0.376]

21 , by defining a uniform probability measure over satisfying assignments), it is clear that computing the density of states is computationally intractable in the worst case, as a generalization of an NP-Complete problem (satisfiability testing) and a #-P complete problem (model counting). [sent-55, score-0.388]

22 We show that the density of states can be computed efficiently1 for acyclic graphical models. [sent-56, score-0.413]

23 We provide a Dynamic Programming algorithm, which can also be interpreted as a message passing algorithm on the factor graph, called D ENSITY P ROPAGATION (DP), which computes the density of states exactly for acyclic graphical models. [sent-57, score-0.67]

24 1 Density propagation equations D ENSITY P ROPAGATION works by exchanging messages from variable to factor nodes and vice versa. [sent-60, score-0.255]

25 Unlike traditional message passing algorithms, where messages represent marginal probabilities (vectors of real numbers), for every xi ∈ Xi a D ENSITY P ROPAGATION message ma→i (xi ) is a distribution (a “marginal” density of states), i. [sent-61, score-0.655]

26 The message from variable node i to factor node a is updated as follows: mi→a (xi ) = mb→i (xi ) (4) b∈N (i)\a where is the convolution operator (commutative, associative and distributive). [sent-65, score-0.329]

27 The message from factor a to variable i is updated as follows:   ma→i (xi ) = {x}α\i  j∈N (a)\i mj→a (xj ) δEα ({x}α ) (5) where δEα ({x}α ) is a Dirac delta function centered at Eα (xα ) = log ψα ({x}α ). [sent-67, score-0.194]

28 For tree structured graphical models, D ENSITY P ROPAGATION converges after a finite number of iterations, independent of the initial condition, to the true density of states. [sent-68, score-0.406]

29 1 Complexity and Approximation with Energy Bins The most efficient message update schedule for tree structured models is a two-pass procedure where messages are first sent from the leaves to the root node, and then propagated backwards from the root to the leaves. [sent-74, score-0.404]

30 However, as with other message-passing algorithms, for tree structured instances the algorithm will converge with either a sequential or a parallel update schedule, with any initial condition for the messages. [sent-75, score-0.149]

31 Although DP requires the same number of messages updates as BP and MP, DP updates are more expensive because they require the computation of convolutions. [sent-76, score-0.191]

32 Specifically, each variable-to-factor update rule (4) requires (N − 2)L convolutions, where N is the number of neighbors of the variable node and L is the number of states in the random variable. [sent-77, score-0.17]

33 In the worst case, the density of states can have an exponential number of non-zero entries (i. [sent-80, score-0.399]

34 , the finite number of possible energy values, which we will also refer to as “buckets”), for instance when potentials are set to logarithms of prime numbers, making every x ∈ X have a different probability. [sent-82, score-0.188]

35 , SAT/CSP models and certain grounded Markov Logic Networks [9]), the number of energy “buckets” is limited, e. [sent-85, score-0.188]

36 For general graphical models, coarse-grain energy bins can be used, similar to the Wang-Landau algorithm [3], without losing much precision. [sent-88, score-0.325]

37 This also guarantees that the density of states with coarse-grain energy bins gives a constant factor approximation of the true partition function. [sent-90, score-0.685]

38 2 Relationship with sum and max product algorithms D ENSITY P ROPAGATION is closely related to traditional message passing algorithms such as BP (Belief Propagation, Sum-Product) and MP (Max-Product), since it is based on the same (conditional) independence assumptions. [sent-93, score-0.179]

39 With the same initial condition and message update schedule, at every iteration we can recover Belief Propagation and Max-Product marginals from D ENSITY P ROPAGATION messages. [sent-96, score-0.183]

40 Given a DP message mi→j (xj ) = k ck (i → j, xj )δEk (i→j,xj ) , the Max-Product algorithm corresponds to considering only the entry associated with the highest probability, i. [sent-98, score-0.237]

41 As other message passing algorithms, D ENSITY P ROPAGATION updates are well defined also for loopy graphical models, even though there is no guarantee of convergence or correctness [12]. [sent-105, score-0.39]

42 The correspondence with BP and MP (Theorem 2) however still holds: if loopy BP converges, then the corresponding quantities µi→j computed from DP messages will converge as well, and to the same value (assuming the same initial condition and update schedule). [sent-106, score-0.166]

43 4 Bounds on the density of states using tractable families Using techniques such as D ENSITY P ROPAGATION, we can compute the density of states exactly for tractable families such as tree-structured graphical models. [sent-113, score-0.899]

44 By computing the partition function or MAP estimates for the tree structured subproblems, Wainwright et al. [sent-119, score-0.22]

45 We present a way to exploit the decomposition idea to derive an upper bound on the density of states n(E, Θ∗ ) of the original intractable model, despite the fact that density of states is not a convex function of Θ∗ . [sent-121, score-0.907]

46 The result below gives a point-by-point upper bound which, to the best of our knowledge, is the first bound of this kind for density of states. [sent-122, score-0.47]

47 In the following, with some abuse of the notation, we denote n(E, Θ∗ ) = x∈X 1{Θ∗ ·φ(x)=E} the function giving the number of configurations with energy E (zero almost everywhere). [sent-123, score-0.188]

48 From the definition of density of states and using 1{} to denote the 0-1 indicator function, n(E, Θ∗ ) = 1{Θ∗ φ(x)=E} = x∈X 1{( i γi Θi )φ(x)=E} x∈X n−1 n = . [sent-133, score-0.32]

49 Bounds on the partition function using n-dimensional matching The density of states n(E, Θ∗ ) can be used to compute the partition function, since by definition Z(Θ∗ ) = ||n(E, Θ∗ ) exp(E)||1 . [sent-158, score-0.687]

50 We can therefore get an upper bound on Z(Θ∗ ) by integrating the point-by-point upper bound on n(E, Θ∗ ) from Theorem 3. [sent-159, score-0.356]

51 This bound can be tighter than the known bound [6] obtained by applying Jensen’s inequality to the log-partition function (which is convex), given by log Z(Θ∗ ) ≤ i γi log Z(Θi ). [sent-160, score-0.27]

52 For instance, consider a graphical model with weights that are large enough such that the density of states based sum defining Z(Θ∗ ) is dominated by the contribution of the highest-energy bucket. [sent-161, score-0.413]

53 In general, the latter bound will always be strictly better for large enough w unless the highest-energy bucket counts are identical across all Θi . [sent-165, score-0.154]

54 While this is already promising, we can, in fact, obtain a much tighter bound by taking into account the interactions between different energy levels across any parameter decomposition, e. [sent-166, score-0.358]

55 Then, min Z(Θ∗ , σ) ≤ Z(Θ∗ ) ≤ max Z(Θ∗ , σ) (6) n n σ∈Π σ∈Π 5 Algorithm 1 Greedy algorithm for the maximum matching (upper bound). [sent-173, score-0.197]

56 Then we have Z(Θ∗ ) = Z(Θ∗ , σI ), which proves the upper and lower bounds in equation (6). [sent-176, score-0.175]

57 We can think of σ ∈ Πn as an n-dimensional matching over the exponential size configuration space X . [sent-177, score-0.216]

58 If we define the weight of each hyper-edge in the matching graph as w(σ(x)) = i yi (σi (x))γi then Z(Θ∗ , σ) = x∈X w(σ(x)) corresponds to the weight of the matching represented by σ. [sent-179, score-0.492]

59 Intuitively, the maximum matching corresponds to the case where the configurations in the high energy buckets of the densities happen to be the same configuration (matching), so that their energies are summed up. [sent-181, score-0.596]

60 1 Upper bound The maximum matching maxσ Z(Θ∗ , σ) (i. [sent-183, score-0.297]

61 , the upper bound on the partition function) can be computed using Algorithm 1. [sent-185, score-0.277]

62 Algorithm 1 returns a distribution ub such that ub (E)dE = |X | and ub (E) exp(E)dE = maxσ Z(Θ∗ , σ). [sent-186, score-0.441]

63 Notice however that ub (E) is not a valid point-by-point upper bound on the density n(E, Θ∗ ) of the original mode. [sent-187, score-0.517]

64 Algorithm 1 computes the maximum matching and its runtime is bounded by the total number of non-empty buckets i |{E|n(E, Θi ) > 0}|. [sent-189, score-0.436]

65 Intuitively, this means that for n = 2 parameters it is always optimal to connect the highest energy configurations, therefore the greedy method is optimal. [sent-192, score-0.188]

66 The runtime is proportional to the total number of buckets because we remove one bucket from at least one density at every iteration. [sent-194, score-0.441]

67 A key property of Algorithm 1 is that even though it defines a matching over an exponential number of configurations |X |, its runtime proportional only to the total number of buckets, because it matches configurations in groups at the bucket level. [sent-195, score-0.318]

68 The following result shows that the value of the maximum matching is at least as tight as the bound provided by the convexity of the log-partition function, which is used for example by Tree Reweighted Belief Propagation (TRWBP) [6]. [sent-196, score-0.381]

69 For any parameter decomposition i=1 γi Θi = Θ∗ , the upper bound given by the maximum matching in equation (6) and computed using Algorithm 1 is always at least as tight as the bound obtained using the convexity of the log-partition function. [sent-198, score-0.612]

70 The bound obtained by applying Jensen’s inequality to the log-partition function (which is convex), given by log Z(Θ∗ ) ≤ i γi log Z(Θi ) [6], leads to the following geometric average γ bound Z(Θ∗ ) ≤ i ( x yi (x)) i . [sent-200, score-0.361]

71 6 Algorithm 2 Greedy algorithm for the minimum matching with n = 2 parameters (lower bound). [sent-202, score-0.212]

72 2 Lower bound We also provide Algorithm 2 to compute the minimum matching when there are n = 2 parameters. [sent-204, score-0.312]

73 For n = 2, Algorithm 2 computes the minimum matching and its runtime is bounded by the total number of non-empty buckets i |{E|n(E, Θi ) > 0}|. [sent-207, score-0.451]

74 For the minimum matching case, the induction argument does not apply and the result does not extend to the case n > 2. [sent-208, score-0.212]

75 For that case, we can obtain a weaker lower bound by applying Reverse Generalized Holder inequality [15], obtaining from a different perspective a bound previously 1 derived in [16]. [sent-209, score-0.257]

76 We then have min Z(Θ∗ , σ) = yi (σmin,i (x))γi = || σ x i yi (σmin,i (x))γi ||1 ≥ 1 si γi ||yi (σmin,i (x)) ||si = i yi (σmin,i (x)) i (7) i x s i γi 1 si = yi (x) i s i γi x Notice this result cannot be applied if yi (x) = 0, i. [sent-211, score-0.598]

77 Figure 1 shows a simple 2 × 2 grid Ising Z(Θ) exp model with exponential parameter Θ∗ = [0, 0, 0, 0, 1, 1, 1, 1] (Θs = 0 and Θij = 1) decomposed as the convex sum of two parameters Θ1 and Θ2 corresponding to tractable distributions, i. [sent-215, score-0.222]

78 In panels 1(d) and 1(e) we report the corresponding density of states n(E, Θ1 ) and n(E, Θ2 ) as histograms. [sent-220, score-0.32]

79 For instance, for the model corresponding to Θ2 there are only two global configurations (all variables positive and all negative) that give an energy of 6. [sent-221, score-0.188]

80 In panels 1(f) and 1(c) we show ub and lb computed using Algorithms 1 and 2, i. [sent-227, score-0.205]

81 the solutions to the maximum and minimum matching problems, respectively. [sent-229, score-0.24]

82 For instance, for the maximum matching case the 2 configurations with energy 6 from n(E, Θ1 ) are matched with 2 of the 8 with energy 2 from n(E, Θ2 ), giving an energy 6/2 + 2/2 = 4. [sent-230, score-0.761]

83 Notice that ub and lb are not valid bounds on individual densities of states themselves, but they nonetheless provide upper and lower bounds on the partition function as shown in the figure: ≈ 248. [sent-231, score-0.71]

84 In this case, the additional information provided by the density leads to tighter upper and lower bounds on the partition function. [sent-237, score-0.51]

85 For each model, we compute the upper bound given by TRWBP (with edge appearance probabilities µe based on a 7 v1 2 v2 v1 2 2 v4 v3 6 6 6 8 4 0 2 0 2 2 Energy 4 (d) Histogram n(E, Θ1 ) 2 2 0 1 8 2 3 4 12 12 10 8 6 4 2 2 Energy Energy (c) Zub = 2 + 6e + 6e + 2e4 . [sent-243, score-0.178]

86 subset of 10 randomly selected spanning trees) and the mean-field bound using the implementations in libDAI [17]. [sent-252, score-0.153]

87 We then compute the bound based on the maximum matching using the same set of spanning trees. [sent-253, score-0.35]

88 For the grid case, we also use a combination of 2 spanning trees and compute the corresponding lower bound based on the minimum matching (notice it is not possible to cover all the edges in a clique with only 2 spanning tree). [sent-254, score-0.562]

89 For each bound, we report the relative error, defined as (log(bound) − log(Z)) / log(Z), where Z is the true partition function, computed using the junction tree method. [sent-255, score-0.221]

90 In these experiments, both our upper and lower bounds improve over the ones obtained with TRWBP [6] and mean-field respectively. [sent-256, score-0.175]

91 The lower bound based on minimum matching visually overlaps with the mean-field bound and is thus omitted from Figure 2. [sent-257, score-0.443]

92 By optimizing the parameters si in the inverse Holder bound (8) using numerical optimization (BFGS and BOBYQA [18]), we were always able to obtain a lower bound at least as good as the one given by mean field. [sent-260, score-0.265]

93 7 Conclusions We presented D ENSITY P ROPAGATION, a novel message passing algorithm for computing the density of states while exploiting the structure of the underlying graphical model. [sent-261, score-0.592]

94 We showed that D ENSITY P ROPAGATION computes the exact density for tree structured graphical models and is a generalization of both Belief Propagation and Max-Product algorithms. [sent-262, score-0.45]

95 We introduced a new family of bounds on the partition function based on n-dimensional matching and tree decomposition but without relying on convexity. [sent-263, score-0.513]

96 The additional information provided by the density of states leads, both theoretically and empirically, to tighter bounds than known convexity-based ones. [sent-264, score-0.43]

97 Efficient, multiple-range random walk algorithm to calculate the density of states. [sent-281, score-0.192]

98 Tree-reweighted belief propagation algorithms and approximate ML estimation via pseudo-moment matching. [sent-297, score-0.205]

99 MAP estimation, linear programming and belief propagation with convex free energies. [sent-318, score-0.205]

100 Loopy belief propagation for approximate inference: An empirical study. [sent-331, score-0.205]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ensity', 0.403), ('ropagation', 0.403), ('emax', 0.285), ('density', 0.192), ('energy', 0.188), ('matching', 0.169), ('gurations', 0.153), ('ub', 0.147), ('buckets', 0.147), ('dp', 0.136), ('states', 0.128), ('message', 0.126), ('propagation', 0.122), ('dyn', 0.119), ('emin', 0.119), ('xj', 0.111), ('yi', 0.106), ('bound', 0.1), ('partition', 0.099), ('messages', 0.099), ('bp', 0.097), ('tree', 0.095), ('ising', 0.094), ('graphical', 0.093), ('exp', 0.091), ('configurations', 0.084), ('holder', 0.084), ('belief', 0.083), ('mi', 0.081), ('mp', 0.08), ('upper', 0.078), ('trwbp', 0.071), ('bounds', 0.066), ('mb', 0.064), ('con', 0.062), ('xs', 0.059), ('xi', 0.059), ('distributive', 0.058), ('lb', 0.058), ('schedule', 0.058), ('bucket', 0.054), ('decomposition', 0.053), ('spanning', 0.053), ('passing', 0.053), ('notice', 0.052), ('convolution', 0.051), ('convexity', 0.051), ('mj', 0.048), ('graph', 0.048), ('tractable', 0.048), ('runtime', 0.048), ('bobyqa', 0.047), ('maxe', 0.047), ('exponential', 0.047), ('updates', 0.046), ('guration', 0.045), ('bins', 0.044), ('tighter', 0.044), ('computes', 0.044), ('eld', 0.044), ('minimum', 0.043), ('node', 0.042), ('convolutions', 0.042), ('ithaca', 0.042), ('libdai', 0.042), ('ek', 0.04), ('physics', 0.04), ('dirac', 0.039), ('clique', 0.039), ('ermon', 0.039), ('sabharwal', 0.039), ('loopy', 0.039), ('trees', 0.038), ('wainwright', 0.038), ('densities', 0.037), ('grid', 0.036), ('cornell', 0.036), ('gomes', 0.036), ('intractable', 0.036), ('families', 0.035), ('updated', 0.034), ('si', 0.034), ('factor', 0.034), ('correctness', 0.033), ('reweighted', 0.033), ('tight', 0.033), ('worst', 0.032), ('family', 0.031), ('lower', 0.031), ('geometric', 0.029), ('recover', 0.029), ('logic', 0.029), ('condition', 0.028), ('maximum', 0.028), ('subproblems', 0.027), ('summed', 0.027), ('junction', 0.027), ('inequality', 0.026), ('interactions', 0.026), ('structured', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function

Author: Stefano Ermon, Ashish Sabharwal, Bart Selman, Carla P. Gomes

Abstract: Given a probabilistic graphical model, its density of states is a distribution that, for any likelihood value, gives the number of configurations with that probability. We introduce a novel message-passing algorithm called Density Propagation (DP) for estimating this distribution. We show that DP is exact for tree-structured graphical models and is, in general, a strict generalization of both sum-product and max-product algorithms. Further, we use density of states and tree decomposition to introduce a new family of upper and lower bounds on the partition function. For any tree decomposition, the new upper bound based on finer-grained density of state information is provably at least as tight as previously known bounds based on convexity of the log-partition function, and strictly stronger if a general condition holds. We conclude with empirical evidence of improvement over convex relaxations and mean-field based bounds. 1

2 0.12177127 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

Author: Anima Anandkumar, Ragupathyraj Valluvan

Abstract: Graphical model selection refers to the problem of estimating the unknown graph structure given observations at the nodes in the model. We consider a challenging instance of this problem when some of the nodes are latent or hidden. We characterize conditions for tractable graph estimation and develop efficient methods with provable guarantees. We consider the class of Ising models Markov on locally tree-like graphs, which are in the regime of correlation decay. We propose an efficient method for graph estimation, and establish its structural consistency −δη(η+1)−2 when the number of samples n scales as n = Ω(θmin log p), where θmin is the minimum edge potential, δ is the depth (i.e., distance from a hidden node to the nearest observed nodes), and η is a parameter which depends on the minimum and maximum node and edge potentials in the Ising model. The proposed method is practical to implement and provides flexibility to control the number of latent variables and the cycle lengths in the output graph. We also present necessary conditions for graph estimation by any method and show that our method nearly matches the lower bound on sample requirements. Keywords: Graphical model selection, latent variables, quartet methods, locally tree-like graphs. 1

3 0.12095097 213 nips-2012-Minimization of Continuous Bethe Approximations: A Positive Variation

Author: Jason Pacheco, Erik B. Sudderth

Abstract: We develop convergent minimization algorithms for Bethe variational approximations which explicitly constrain marginal estimates to families of valid distributions. While existing message passing algorithms define fixed point iterations corresponding to stationary points of the Bethe free energy, their greedy dynamics do not distinguish between local minima and maxima, and can fail to converge. For continuous estimation problems, this instability is linked to the creation of invalid marginal estimates, such as Gaussians with negative variance. Conversely, our approach leverages multiplier methods with well-understood convergence properties, and uses bound projection methods to ensure that marginal approximations are valid at all iterations. We derive general algorithms for discrete and Gaussian pairwise Markov random fields, showing improvements over standard loopy belief propagation. We also apply our method to a hybrid model with both discrete and continuous variables, showing improvements over expectation propagation. 1

4 0.11872713 206 nips-2012-Majorization for CRFs and Latent Likelihoods

Author: Tony Jebara, Anna Choromanska

Abstract: The partition function plays a key role in probabilistic modeling including conditional random fields, graphical models, and maximum likelihood estimation. To optimize partition functions, this article introduces a quadratic variational upper bound. This inequality facilitates majorization methods: optimization of complicated functions through the iterative solution of simpler sub-problems. Such bounds remain efficient to compute even when the partition function involves a graphical model (with small tree-width) or in latent likelihood settings. For large-scale problems, low-rank versions of the bound are provided and outperform LBFGS as well as first-order methods. Several learning applications are shown and reduce to fast and convergent update rules. Experimental results show advantages over state-of-the-art optimization methods. This supplement presents additional details in support of the full article. These include the application of the majorization method to maximum entropy problems. It also contains proofs of the various theorems, in particular, a guarantee that the bound majorizes the partition function. In addition, a proof is provided guaranteeing convergence on (non-latent) maximum conditional likelihood problems. The supplement also contains supporting lemmas that show the bound remains applicable in constrained optimization problems. The supplement then proves the soundness of the junction tree implementation of the bound for graphical models with large n. It also proves the soundness of the low-rank implementation of the bound for problems with large d. Finally, the supplement contains additional experiments and figures to provide further empirical support for the majorization methodology. Supplement for Section 2 Proof of Theorem 1 Rewrite the partition function as a sum over the integer index j = 1, . . . , n under the random ordering π : Ω → {1, . . . , n}. This defines j∑ π(y) and associates h and f with = n hj = h(π −1 (j)) and fj = f (π −1 (j)). Next, write Z(θ) = j=1 αj exp(λ⊤ fj ) by introducing ˜ ˜ λ = θ − θ and αj = hj exp(θ ⊤ fj ). Define the partition function over only the first i components ∑i as Zi (θ) = j=1 αj exp(λ⊤ fj ). When i = 0, a trivial quadratic upper bound holds ( ) Z0 (θ) ≤ z0 exp 1 λ⊤ Σ0 λ + λ⊤ µ0 2 with the parameters z0 → 0+ , µ0 = 0, and Σ0 = z0 I. Next, add one term to the current partition function Z1 (θ) = Z0 (θ) + α1 exp(λ⊤ f1 ). Apply the current bound Z0 (θ) to obtain 1 Z1 (θ) ≤ z0 exp( 2 λ⊤ Σ0 λ + λ⊤ µ0 ) + α1 exp(λ⊤ f1 ). Consider the following change of variables u γ 1/2 −1/2 = Σ0 λ − Σ0 = α1 z0 exp( 1 (f1 2 (f1 − µ0 )) − µ0 )⊤ Σ−1 (f1 − µ0 )) 0 and rewrite the logarithm of the bound as log Z1 (θ) ( ) 1 ≤ log z0 − 1 (f1 − µ0 )⊤ Σ−1 (f1 − µ0 ) + λ⊤ f1 + log exp( 2 ∥u∥2 ) + γ . 0 2 Apply Lemma 1 (cf. [32] p. 100) to the last term to get ( (1 ) ) log Z1 (θ) ≤ log z0 − 1 (f1 − µ0 )⊤ Σ−1 (f1 − µ0 ) + λ⊤ f1 + log exp 2 ∥v∥2 +γ 0 2 ( ) v⊤ (u − v) 1 + + (u − v)⊤ I + Γvv⊤ (u − v) 1+γ exp(− 1 ∥v∥2 ) 2 2 10 where Γ = 1 1 tanh( 2 log(γ exp(− 2 ∥v∥2 ))) . 1 2 log(γ exp(− 2 ∥v∥2 )) The bound in [32] is tight when u = v. To achieve tightness −1/2 ˜ when θ = θ or, equivalently, λ = 0, we choose v = Σ0 (µ0 − f1 ) yielding ( ) Z1 (θ) ≤ z1 exp 1 λ⊤ Σ1 λ + λ⊤ µ1 2 where we have z1 = z0 + α1 z0 α1 = µ0 + f1 z0 + α1 z0 + α1 1 tanh( 2 log(α1 /z0 )) = Σ0 + (µ0 − f1 )(µ0 − f1 )⊤ . 2 log(α1 /z0 ) µ1 Σ1 This rule updates the bound parameters z0 , µ0 , Σ0 to incorporate an extra term in the sum over i in Z(θ). The process is iterated n times (replacing 1 with i and 0 with i − 1) to produce an overall bound on all terms. Lemma 1 (See [32] p. 100) ( ( ) ) For all u ∈ Rd , any v ∈ Rd and any γ ≥ 0, the bound log exp 1 ∥u∥2 + γ ≤ 2 ( ( ) ) log exp 1 ∥v∥2 + γ + 2 ( ) v⊤ (u − v) 1 + (u − v)⊤ I + Γvv⊤ (u − v) 1 1 + γ exp(− 2 ∥v∥2 ) 2 holds when the scalar term Γ = 1 tanh( 2 log(γ exp(−∥v∥2 /2))) . 2 log(γ exp(−∥v∥2 /2)) Equality is achieved when u = v. Proof of Lemma 1 The proof is provided in [32]. Supplement for Section 3 Maximum entropy problem We show here that partition functions arise naturally in maximum ∑ p(y) entropy estimation or minimum relative entropy RE(p∥h) = y p(y) log h(y) estimation. Consider the following problem: ∑ ∑ min RE(p∥h) s.t. p(y)f (y) = 0, p(y)g(y) ≥ 0. p y y d′ Here, assume that f : Ω → R and g : Ω → R are arbitrary (non-constant) vector-valued functions ( ) over the sample space. The solution distribution p(y) = h(y) exp θ ⊤ f (y) + ϑ⊤ g(y) /Z(θ, ϑ) is recovered by the dual optimization ∑ ( ) arg θ, ϑ = max − log h(y) exp θ ⊤ f (y) + ϑ⊤ g(y) d ϑ≥0,θ y ′ where θ ∈ Rd and ϑ ∈ Rd . These are obtained by minimizing Z(θ, ϑ) or equivalently by maximizing its negative logarithm. Algorithm 1 permits variational maximization of the dual via the quadratic program min 1 (β ϑ≥0,θ 2 ˜ ˜ − β)⊤ Σ(β − β) + β ⊤ µ ′ where β ⊤ = [θ ⊤ ϑ⊤ ]. Note that any general convex hull of constraints β ∈ Λ ⊆ Rd+d could be imposed without loss of generality. Proof of Theorem 2 We begin by proving a lemma that will be useful later. Lemma 2 If κΨ ≽ Φ ≻ 0 for Φ, Ψ ∈ Rd×d , then 1 ˜ ˜ ˜ L(θ) = − 2 (θ − θ)⊤ Φ(θ − θ) − (θ − θ)⊤ µ ˜ ˜ ˜ U (θ) = − 1 (θ − θ)⊤ Ψ(θ − θ) − (θ − θ)⊤ µ 2 satisfy supθ∈Λ L(θ) ≥ 1 κ ˜ supθ∈Λ U (θ) for any convex Λ ⊆ Rd , θ ∈ Λ, µ ∈ Rd and κ ∈ R+ . 11 Proof of Lemma 2 Define the primal problems of interest as PL = supθ∈Λ L(θ) and PU = supθ∈Λ U (θ). The constraints θ ∈ Λ can be summarized by a set of linear inequalities Aθ ≤ b where A ∈ Rk×d and b ∈ Rk for some (possibly infinite) k ∈ Z. Apply the change of variables ˜ ˜ ˜ ˜ ˜ ˜ z = θ− θ. The constraint A(z+ θ) ≤ b simplifies into Az ≤ b where b = b−Aθ. Since θ ∈ Λ, it 1 ⊤ ˜ ≥ 0. We obtain the equivalent primal problems PL = sup is easy to show that b ˜ − z Φz − Az≤b z⊤ µ and PU = supAz≤b − 1 z⊤ Ψz − z⊤ µ. The corresponding dual problems are ˜ 2 2 ⊤ −1 y⊤AΦ−1A⊤y ˜ µ Φ µ +y⊤AΦ−1µ+y⊤b+ y≥0 2 2 y⊤AΨ−1 A⊤y µ⊤Ψ−1µ ˜ DU = inf +y⊤AΨ−1µ+y⊤b+ . y≥0 2 2 DL = inf ˜ Due to strong duality, PL = DL and PU = DU . Apply the inequalities Φ ≼ κΨ and y⊤ b > 0 as ⊤ −1 y⊤AΨ−1 A⊤y y⊤AΨ−1 µ κ ˜ µΨ µ + + y⊤b + PL ≥ sup − z⊤ Ψz − z⊤ µ = inf y≥0 2 2κ κ 2κ ˜ Az≤b 1 1 ≥ DU = PU . κ κ 1 This proves that PL ≥ κ PU . We will use the above to prove Theorem 2. First, we will upper-bound (in the Loewner ordering sense) the matrices Σj in Algorithm 2. Since ∥fxj (y)∥2 ≤ r for all y ∈ Ωj and since µj in Algorithm 1 is a convex combination of fxj (y), the outer-product terms in the update for Σj satisfy (fxj (y) − µ)(fxj (y) − µ)⊤ ≼ 4r2 I. Thus, Σj ≼ F(α1 , . . . , αn )4r2 I holds where α 1 F(α1 , . . . , αn ) = i n ∑ tanh( 2 log( ∑i−1 αk )) k=1 αi 2 log( ∑i−1 α ) i=2 k k=1 using the definition of α1 , . . . , αn in the proof of Theorem 1. The formula for F starts at i = 2 since z0 → 0+ . Assume permutation π is sampled uniformly at random. The expected value of F is then α π(i) 1 n 1 ∑ ∑ tanh( 2 log( ∑i−1 απ(k) )) k=1 . Eπ [F(α1 , . . . , αn )] = απ(i) n! π i=2 ) 2 log( ∑i−1 k=1 απ(k) We claim that the expectation is maximized when all αi = 1 or any positive constant. Also, F is invariant under uniform scaling of its arguments. Write the expected value of F as E for short. ∂E ∂E ∂E Next, consider ∂αl at the setting αi = 1, ∀i. Due to the expectation over π, we have ∂αl = ∂αo for any l, o. Therefore, the gradient vector is constant when all αi = 1. Since F(α1 , . . . , αn ) is invariant to scaling, the gradient vector must therefore be the all zeros vector. Thus, the point ∂ ∂E when all αi = 1 is an extremum or a saddle. Next, consider ∂αo ∂αl for any l, o. At the setting 2 ∂ ∂E αi = 1, ∂ E = −c(n) and, ∂αo ∂αl = c(n)/(n − 1) for some non-negative constant function ∂α2 l c(n). Thus, the αi = 1 extremum is locally concave and is a maximum. This establishes that Eπ [F(α1 , . . . , αn )] ≤ Eπ [F(1, . . . , 1)] and yields the Loewner bound ) ( n−1 ∑ tanh(log(i)/2) 2 I = ωI. Σj ≼ 2r log(i) i=1 Apply this bound to each Σj in the lower bound on J(θ) and also note a corresponding upper bound ∑ ˜ ˜ ˜ J(θ) ≥ J(θ)−tω+tλ ∥θ − θ∥2− (θ − θ)⊤(µj −fxj (yj )) 2 j ∑ ˜ ˜ ˜ J(θ) ≤ J(θ)−tλ ∥θ − θ∥2− (θ − θ)⊤(µj −fxj (yj )) 2 j 12 ˜ which follows from Jensen’s inequality. Define the current θ at time τ as θτ and denote by Lτ (θ) the above lower bound and by Uτ (θ) the above upper bound at time τ . Clearly, Lτ (θ) ≤ J(θ) ≤ Uτ (θ) with equality when θ = θτ . Algorithm 2 maximizes J(θ) after initializing at θ0 and performing an update by maximizing a lower bound based on Σj . Since Lτ (θ) replaces the definition of Σj with ωI ≽ Σj , Lτ (θ) is a looser bound than the one used by Algorithm 2. Thus, performing θτ +1 = arg maxθ∈Λ Lτ (θ) makes less progress than a step of Algorithm 1. Consider computing the slower update at each iteration τ and returning θτ +1 = arg maxθ∈Λ Lτ (θ). Setting Φ = (tω +tλ)I, Ψ = tλI and κ = ω+λ allows us to apply Lemma 2 as follows λ sup Lτ (θ) − Lτ (θτ ) = θ∈Λ 1 sup Uτ (θ) − Uτ (θτ ). κ θ∈Λ Since Lτ (θτ ) = J(θτ ) = Uτ (θτ ), J(θτ +1 ) ≥ supθ∈Λ Lτ (θ) and supθ∈Λ Uτ (θ) ≥ J(θ ∗ ), we obtain ( ) 1 J(θτ +1 ) − J(θ ∗ ) ≥ 1− (J(θτ ) − J(θ ∗ )) . κ Iterate the above inequality starting at t = 0 to obtain ( )τ 1 ∗ J(θτ ) − J(θ ) ≥ 1− (J(θ0 ) − J(θ ∗ )) . κ ( ) 1 τ κ A solution within a multiplicative factor of ϵ implies that ϵ = 1 − κ or log(1/ϵ) = τ log κ−1 . ⌉ ⌈ Inserting the definition for κ shows that the number of iterations τ is at most log(1/ϵ) or κ log κ−1 ⌉ ⌈ log(1/ϵ) log(1+λ/ω) . Inserting the definition for ω gives the bound. Y12,0 Y11,1 Y21,1 Y31,1 ··· 1,1 Ym1,1 Figure 3: Junction tree of depth 2. Algorithm 5 SmallJunctionTree ˜ Input Parameters θ and h(u), f (u) ∀u ∈ Y12,0 and zi , Σi , µi ∀i = 1, . . . , m1,1 + Initialize z → 0 , µ = 0, Σ = zI For each configuration u ∈ Y12,0 { ∏m1,1 ∑m1,1 ∏m1,1 ˜ ˜ ˜ α = h(u)( ∑ zi exp(−θ ⊤ µi )) exp(θ ⊤ (f (u) + i=1 µi )) = h(u) exp(θ ⊤ f (u)) i=1 zi i=1 m1,1 l = f (u) + i=1 µi − µ 1 ∑m1,1 tanh( 2 log(α/z)) ⊤ Σ + = i=1 Σi + ll 2 log(α/z) α µ + = z+α l z += α } Output z, µ, Σ Supplement for Section 5 Proof of correctness for Algorithm 3 Consider a simple junction tree of depth 2 shown on Figure 3. The notation Yca,b refers to the cth tree node located at tree level a (first level is considered as the one with∑ leaves) whose parent is the bth from the higher tree level (the root has no parent so b = 0). tree ∑ Let Y a1 ,b1 refer to the sum over all configurations of variables in Yca1 ,b1 and Y a1 ,b1 \Y a2 ,b2 1 c1 c1 c2 refers to the sum over all configurations of variables that are in Yca1 ,b1 but not in Yca2 ,b2 . Let ma,b 1 2 denote the number of children of the bth node located at tree level a + 1. For short-hand, use ψ(Y ) = h(Y ) exp(θ ⊤ f (Y )). The partition function can be expressed as: 13 Y13,0 Y12,1 ··· Y11,1 Y21,1 ··· Y22,1 1,1 Ym1,1 ··· Y11,2 Y21,2 1,2 Ym1,2 2,1 Ym2,1 1,m2,1 Y1 ··· 1,m2,1 Y2 1,m 2,1 Ym1,m2,1 Figure 4: Junction tree of depth 3. Z(θ) = ∑ 2,0 u∈Y1 ≤ ∑ 2,0 u∈Y1 = ∑  ψ(u) ∏ m1,1 i=1 [ ∏ m1,1 ψ(u) i=1 [    ∑ ψ(v) 2,0 v∈Yi1,1 \Y1 ) 1( ˜ ˜ ˜ zi exp( θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi 2 ∏ ( m1,1 ⊤ h(u) exp(θ f (u)) 2,0 u∈Y1 zi exp i=1 ] 1 ˜ ˜ ˜ (θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi 2 )] ∑ where the upper-bound is obtained by applying Theorem 1 to each of the terms v∈Y 1,1 \Y 2,0 ψ(v). 1 i By simply rearranging terms we get: ) ( ( [ (m1,1 )) m1,1 ∏ ∑ ∑ ˜ zi exp(−θ ⊤ µi ) exp θ ⊤ f (u) + µi h(u) Z(θ) ≤ 2,0 u∈Y1 i=1 ( exp 1 ˜ (θ − θ)⊤ 2 (m1,1 ∑ ) Σi i=1 ˜ (θ − θ) )] . i=1 One ( can prove that this ) expression can be upper-bounded by 1 ˆ ⊤ Σ(θ − θ) + (θ − θ)⊤ µ where z, Σ and µ can be computed using Algoˆ ˆ z exp 2 (θ − θ) rithm 5 (a simplification of Algorithm 3). We will call this result Lemma A. The proof is similar to the proof of Theorem 1 so is not repeated here. Consider enlarging the tree to a depth 3 as shown on Figure 4. The partition function is now      m2,1 m1,i ∑  ∏ ∑ ∏ ∑   Z(θ) = ψ(w) . ψ(u)  ψ(v)  3,0 u∈Y1 i=1 3,0 v∈Yi2,1 \Y1 j=1 w∈Yj1,i \Yi2,1 ( )) ∏m1,i (∑ ∑ term By Lemma A we can upper bound each v∈Y 2,1 \Y 3,0 ψ(v) j=1 w∈Yj1,i \Yi2,1 ψ(w) 1 i ( ) ˆ ˆ ˆ by the expression zi exp 1 (θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi . This yields 2 [ )] ( m2,1 ∑ ∏ 1 ˜ ˜ ˜ Z(θ) ≤ ψ(u) (θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi . zi exp 2 3,0 i=1 u∈Y1 2,1 2,1 2,1 This process can be viewed as collapsing the sub-trees S1 , S2 , . . ., Sm2,1 to super-nodes that are represented by bound parameters, zi , Σi and µi , i = {1, 2, · · · , m2,1 }, where the sub-trees are 14 defined as: 2,1 S1 = 1,1 {Y12,1 , Y11,1 , Y21,1 , Y31,1 , . . . , Ym1,1 } 2,1 S2 = 1,2 {Y22,1 , Y11,2 , Y21,2 , Y31,2 , . . . , Ym1,2 } = 2,1 {Ym2,1 , Y1 . . . 2,1 Sm2,1 1,m2,1 1,m2,1 , Y2 1,m2,1 , Y3 1,m2,1 , . . . , Ym1,m2,1 }. Notice that the obtained expression can be further upper bounded using again Lemma A (induction) ( ) ˆ ˆ ˆ yielding a bound of the form: z exp 1 (θ − θ)⊤ Σ(θ − θ) + (θ − θ)⊤ µ . 2 Finally, for a general tree, follow the same steps described above, starting from leaves and collapsing nodes to super-nodes, each represented by bound parameters. This procedure effectively yields Algorithm 3 for the junction tree under consideration. Supplement for Section 6 Proof of correctness for Algorithm 4 We begin by proving a lemma that will be useful later. Lemma 3 For all x ∈ Rd and for all l ∈ Rd , 2  d d 2 ∑ ∑ l(i)  . x(i)2 l(i)2 ≥  x(i) √∑ d l(j)2 i=1 i=1 j=1 Proof of Lemma 3 By Jensen’s inequality, 2  ( d )2 d d ∑ x(i)l(i)2 ∑ ∑ x(i)l(i)2  . √∑ x(i)2 ∑d ⇐⇒ x(i)2 l(i)2 ≥  ≥ ∑d d l(j)2 l(j)2 l(j)2 j=1 j=1 i=1 i=1 i=1 i=1 d ∑ l(i)2 j=1 Now we prove the correctness of Algorithm 4. At the ith iteration, the algorithm stores Σi using ⊤ a low-rank representation Vi Si Vi + Di where Vi ∈ Rk×d is orthonormal, Si ∈ Rk×k positive d×d semi-definite and Di ∈ R is non-negative diagonal. The diagonal terms D are initialized to tλI where λ is the regularization term. To mimic Algorithm 1 we must increment the Σ matrix by a rank one update of the form Σi = Σi−1 + ri r⊤ . By projecting ri onto each eigenvector in V, we i ∑k ⊤ can decompose it as ri = j=1 Vi−1 (j, ·)ri Vi−1 (j, ·)⊤ + g = Vi−1 Vi−1 ri + g where g is the remaining residue. Thus the update rule can be rewritten as: Σi ⊤ ⊤ ⊤ = Σi−1 + ri r⊤ = Vi−1 Si−1 Vi−1 + Di−1 + (Vi−1 Vi−1 ri + g)(Vi−1 Vi−1 ri + g)⊤ i ′ ′ ′ ⊤ ⊤ ⊤ = Vi−1 (Si−1 + Vi−1 ri r⊤ Vi−1 )Vi−1 + Di−1 + gg⊤ = Vi−1 Si−1 Vi−1 + gg⊤ + Di−1 i ′ where we define Vi−1 = Qi−1 Vi−1 and defined Qi−1 in terms of the singular value decomposition, ′ ′ ⊤ Q⊤ Si−1 Qi−1 = svd(Si−1 + Vi−1 ri r⊤ Vi−1 ). Note that Si−1 is diagonal and nonnegative by i−1 i construction. The current formula for Σi shows that we have a rank (k + 1) system (plus diagonal term) which needs to be converted back to a rank k system (plus diagonal term) which we denote by ′ Σi . We have two options as follows. Case 1) Remove g from Σi to obtain ′ ′ ′ ′ ⊤ Σi = Vi−1 Si−1 Vi−1 + Di−1 = Σi − gg⊤ = Σi − cvv⊤ 1 where c = ∥g∥2 and v = ∥g∥ g. th Case 2) Remove the m (smallest) eigenvalue in S′ and its corresponding eigenvector: i−1 ′ Σi ′ ′ ′ ′ ′ ′ ⊤ = Vi−1 Si−1 Vi−1 + Di−1 + gg⊤ − S (m, m)V (m, ·)⊤ V (m, ·) = Σi − cvv⊤ ′ ′ where c = S (m, m) and v = V(m, ·) . 15 ′ Clearly, both cases can be written as an update of the form Σi = Σi + cvv⊤ where c ≥ 0 and v⊤ v = 1. We choose the case with smaller c value to minimize the change as we drop from a system of order (k + 1) to order k. Discarding the smallest singular value and its corresponding eigenvector would violate the bound. Instead, consider absorbing this term into the diagonal component to ′′ ′ preserve the bound. Formally, we look for a diagonal matrix F such that Σi = Σi + F which also ′′ maintains x⊤ Σi x ≥ x⊤ Σi x for all x ∈ Rd . Thus, we want to satisfy: ( d )2 d ∑ ∑ ⊤ ′′ ⊤ ⊤ ⊤ ⊤ x Σi x ≥ x Σi x ⇐⇒ x cvv x ≤ x Fx ⇐⇒ c x(i)v(i) ≤ x(i)2 F(i) i=1 i=1 where, for ease of notation, we take F(i) = F(i, i). ′ where w = v⊤ 1. Consider the case where v ≥ 0 though we will soon get rid of (∑ )2 ∑d d this assumption. We need an F such that i=1 x(i)2 F(i) ≥ c i=1 x(i)v(i) . Equivalently, we (∑ ) ∑d ′ 2 ′ d need i=1 x(i)2 F(i) ≥ . Define F(i) = F(i) for all i = 1, . . . , d. So, we need 2 i=1 x(i)v(i) cw cw2 (∑ ) ∑d ′ ′ ′ 2 d an F such that i=1 x(i)2 F(i) ≥ . Using Lemma 3 it is easy to show that we i=1 x(i)v(i) Define v = 1 wv ′ ′ ′ may choose F (i) = v(i) . Thus, we obtain F(i) = cw2 F(i) = cwv(i). Therefore, for all x ∈ Rd , ∑d all v ≥ 0, and for F(i) = cv(i) j=1 v(j) we have d ∑ ( x(i) F(i) ≥ c 2 i=1 d ∑ )2 x(i)v(i) . (3) i=1 To generalize the inequality to hold for all vectors v ∈ Rd with potentially negative entries, it is ∑d sufficient to set F(i) = c|v(i)| j=1 |v(j)|. To verify this, consider flipping the sign of any v(i). The left side of the Inequality 3 does not change. For the right side of this inequality, flipping the sign of v(i) is equivalent to flipping the sign of x(i) and not changing the sign of v(i). However, in this case the inequality holds as shown before (it holds for any x ∈ Rd ). Thus for all x, v ∈ Rd and ∑d for F(i) = c|v(i)| j=1 |v(j)|, Inequality 3 holds. Supplement for Section 7 Small scale experiments In additional small-scale experiments, we compared Algorithm 2 with steepest descent (SD), conjugate gradient (CG), BFGS and Newton-Raphson. Small-scale problems may be interesting in real-time learning settings, for example, when a website has to learn from a user’s uploaded labeled data in a split second to perform real-time retrieval. We considered logistic regression on five UCI data sets where missing values were handled via mean-imputation. A range of regularization settings λ ∈ {100 , 102 , 104 } was explored and all algorithms were initialized from the same ten random start-points. Table 3 shows the average number of seconds each algorithm needed to achieve the same solution that BFGS converged to (all algorithms achieve the same solution due to concavity). The bound is the fastest algorithm as indicated in bold. data|λ BFGS SD CG Newton Bound a|100 1.90 1.74 0.78 0.31 0.01 a|102 0.89 0.92 0.83 0.25 0.01 a|104 2.45 1.60 0.85 0.22 0.01 b|100 3.14 2.18 0.70 0.43 0.07 b|102 2.00 6.17 0.67 0.37 0.04 b|104 1.60 5.83 0.83 0.35 0.04 c|100 4.09 1.92 0.65 0.39 0.07 c|102 1.03 0.64 0.64 0.34 0.02 c|104 1.90 0.56 0.72 0.32 0.02 d|100 5.62 12.04 1.36 0.92 0.16 d|102 2.88 1.27 1.21 0.63 0.09 d|104 3.28 1.94 1.23 0.60 0.07 e|100 2.63 2.68 0.48 0.35 0.03 e|102 2.01 2.49 0.55 0.26 0.03 e|104 1.49 1.54 0.43 0.20 0.03 Table 3: Convergence time in seconds under various regularization levels for a) Bupa (t = 345, dim = 7), b) Wine (t = 178, dim = 14), c) Heart (t = 187, dim = 23), d) Ion (t = 351, dim = 34), and e) Hepatitis (t = 155, dim = 20) data sets. Influence of rank k on bound performance in large scale experiments We also examined the influence of k on bound performance and compared it with LBFGS, SD and CG. Several choices 16 of k were explored. Table 4 shows results for the SRBCT data-set. In general, the bound performs best but slows down for superfluously large values of k. Steepest descent and conjugate gradient are slow yet obviously do not vary with k. Note that each iteration takes less time with smaller k for the bound. However, we are reporting overall runtime which is also a function of the number of iterations. Therefore, total runtime (a function of both) may not always decrease/increase with k. k LBFGS SD CG Bound 1 1.37 8.80 4.39 0.56 2 1.32 8.80 4.39 0.56 4 1.39 8.80 4.39 0.67 8 1.35 8.80 4.39 0.96 16 1.46 8.80 4.39 1.34 32 1.40 8.80 4.39 2.11 64 1.54 8.80 4.39 4.57 Table 4: Convergence time in seconds as a function of k. Additional latent-likelihood results For completeness, Figure 5 depicts two additional data-sets to complement Figure 2. Similarly, Table 5 shows all experimental settings explored in order to provide the summary Table 2 in the main article. bupa wine −19 0 −5 −log(J(θ)) −log(J(θ)) −20 −21 −22 Bound Newton BFGS Conjugate gradient Steepest descent −15 −23 −24 −5 −10 0 5 log(Time) [sec] 10 −20 −4 −2 0 2 4 log(Time) [sec] 6 8 Figure 5: Convergence of test latent log-likelihood for bupa and wine data-sets. Data-set Algorithm BFGS SD CG Newton Bound ion m=1 m=2m=3m=4 -4.96 -5.55 -5.88 -5.79 -11.80 -9.92 -5.56 -8.59 -5.47 -5.81 -5.57 -5.22 -5.95 -5.95 -5.95 -5.95 -6.08 -4.84 -4.18 -5.17 Data-set Algorithm BFGS SD CG Newton Bound bupa m=1 m=2 m=3 m=4 -22.07 -21.78 -21.92 -21.87 -21.76 -21.74 -21.73 -21.83 -21.81 -21.81 -21.81 -21.81 -21.85 -21.85 -21.85 -21.85 -21.85 -19.95 -20.01 -19.97 wine m=1m=2m=3m=4 -0.90 -0.91 -1.79 -1.35 -1.61 -1.60 -1.37 -1.63 -0.51 -0.78 -0.95 -0.51 -0.71 -0.71 -0.71 -0.71 -0.51 -0.51 -0.48 -0.51 hepatitis m=1m=2m=3m=4 -4.42 -5.28 -4.95 -4.93 -4.93 -5.14 -5.01 -5.20 -4.84 -4.84 -4.84 -4.84 -5.50 -5.50 -5.50 -4.50 -5.47 -4.40 -4.75 -4.92 SRBCT m=1m=2m=3m=4 -5.99 -6.17 -6.09 -6.06 -5.61 -5.62 -5.62 -5.61 -5.62 -5.49 -5.36 -5.76 -5.54 -5.54 -5.54 -5.54 -5.31 -5.31 -4.90 -0.11 Table 5: Test latent log-likelihood at convergence for different values of m ∈ {1, 2, 3, 4} on ion, bupa, hepatitis, wine and SRBCT data-sets. 17

5 0.11816145 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models

Author: Nicholas Ruozzi

Abstract: Sudderth, Wainwright, and Willsky conjectured that the Bethe approximation corresponding to any fixed point of the belief propagation algorithm over an attractive, pairwise binary graphical model provides a lower bound on the true partition function. In this work, we resolve this conjecture in the affirmative by demonstrating that, for any graphical model with binary variables whose potential functions (not necessarily pairwise) are all log-supermodular, the Bethe partition function always lower bounds the true partition function. The proof of this result follows from a new variant of the “four functions” theorem that may be of independent interest. 1

6 0.11413378 147 nips-2012-Graphical Models via Generalized Linear Models

7 0.11213608 326 nips-2012-Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses

8 0.10839583 111 nips-2012-Efficient Sampling for Bipartite Matching Problems

9 0.099334471 248 nips-2012-Nonparanormal Belief Propagation (NPNBP)

10 0.09392111 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction

11 0.091965377 339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering

12 0.091168351 180 nips-2012-Learning Mixtures of Tree Graphical Models

13 0.087405011 204 nips-2012-MAP Inference in Chains using Column Generation

14 0.0812774 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms

15 0.079731978 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo

16 0.079360582 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks

17 0.079356574 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

18 0.078520596 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

19 0.074032694 260 nips-2012-Online Sum-Product Computation Over Trees

20 0.073183961 85 nips-2012-Convergence and Energy Landscape for Cheeger Cut Clustering


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.206), (1, 0.028), (2, 0.071), (3, -0.061), (4, -0.092), (5, -0.033), (6, -0.022), (7, -0.097), (8, -0.122), (9, 0.058), (10, -0.034), (11, -0.012), (12, 0.048), (13, 0.032), (14, -0.065), (15, -0.091), (16, -0.023), (17, -0.082), (18, -0.025), (19, 0.095), (20, 0.033), (21, -0.046), (22, -0.03), (23, -0.005), (24, -0.022), (25, 0.126), (26, 0.065), (27, -0.01), (28, 0.04), (29, 0.079), (30, 0.011), (31, -0.031), (32, -0.017), (33, -0.009), (34, 0.004), (35, 0.029), (36, 0.022), (37, -0.059), (38, -0.076), (39, 0.031), (40, -0.001), (41, -0.093), (42, -0.039), (43, 0.096), (44, 0.061), (45, 0.048), (46, -0.045), (47, -0.054), (48, 0.016), (49, 0.005)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94454455 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function

Author: Stefano Ermon, Ashish Sabharwal, Bart Selman, Carla P. Gomes

Abstract: Given a probabilistic graphical model, its density of states is a distribution that, for any likelihood value, gives the number of configurations with that probability. We introduce a novel message-passing algorithm called Density Propagation (DP) for estimating this distribution. We show that DP is exact for tree-structured graphical models and is, in general, a strict generalization of both sum-product and max-product algorithms. Further, we use density of states and tree decomposition to introduce a new family of upper and lower bounds on the partition function. For any tree decomposition, the new upper bound based on finer-grained density of state information is provably at least as tight as previously known bounds based on convexity of the log-partition function, and strictly stronger if a general condition holds. We conclude with empirical evidence of improvement over convex relaxations and mean-field based bounds. 1

2 0.72624952 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models

Author: Nicholas Ruozzi

Abstract: Sudderth, Wainwright, and Willsky conjectured that the Bethe approximation corresponding to any fixed point of the belief propagation algorithm over an attractive, pairwise binary graphical model provides a lower bound on the true partition function. In this work, we resolve this conjecture in the affirmative by demonstrating that, for any graphical model with binary variables whose potential functions (not necessarily pairwise) are all log-supermodular, the Bethe partition function always lower bounds the true partition function. The proof of this result follows from a new variant of the “four functions” theorem that may be of independent interest. 1

3 0.71387905 248 nips-2012-Nonparanormal Belief Propagation (NPNBP)

Author: Gal Elidan, Cobi Cario

Abstract: The empirical success of the belief propagation approximate inference algorithm has inspired numerous theoretical and algorithmic advances. Yet, for continuous non-Gaussian domains performing belief propagation remains a challenging task: recent innovations such as nonparametric or kernel belief propagation, while useful, come with a substantial computational cost and offer little theoretical guarantees, even for tree structured models. In this work we present Nonparanormal BP for performing efficient inference on distributions parameterized by a Gaussian copulas network and any univariate marginals. For tree structured networks, our approach is guaranteed to be exact for this powerful class of non-Gaussian models. Importantly, the method is as efficient as standard Gaussian BP, and its convergence properties do not depend on the complexity of the univariate marginals, even when a nonparametric representation is used. 1

4 0.6996271 213 nips-2012-Minimization of Continuous Bethe Approximations: A Positive Variation

Author: Jason Pacheco, Erik B. Sudderth

Abstract: We develop convergent minimization algorithms for Bethe variational approximations which explicitly constrain marginal estimates to families of valid distributions. While existing message passing algorithms define fixed point iterations corresponding to stationary points of the Bethe free energy, their greedy dynamics do not distinguish between local minima and maxima, and can fail to converge. For continuous estimation problems, this instability is linked to the creation of invalid marginal estimates, such as Gaussians with negative variance. Conversely, our approach leverages multiplier methods with well-understood convergence properties, and uses bound projection methods to ensure that marginal approximations are valid at all iterations. We derive general algorithms for discrete and Gaussian pairwise Markov random fields, showing improvements over standard loopy belief propagation. We also apply our method to a hybrid model with both discrete and continuous variables, showing improvements over expectation propagation. 1

5 0.66233283 206 nips-2012-Majorization for CRFs and Latent Likelihoods

Author: Tony Jebara, Anna Choromanska

Abstract: The partition function plays a key role in probabilistic modeling including conditional random fields, graphical models, and maximum likelihood estimation. To optimize partition functions, this article introduces a quadratic variational upper bound. This inequality facilitates majorization methods: optimization of complicated functions through the iterative solution of simpler sub-problems. Such bounds remain efficient to compute even when the partition function involves a graphical model (with small tree-width) or in latent likelihood settings. For large-scale problems, low-rank versions of the bound are provided and outperform LBFGS as well as first-order methods. Several learning applications are shown and reduce to fast and convergent update rules. Experimental results show advantages over state-of-the-art optimization methods. This supplement presents additional details in support of the full article. These include the application of the majorization method to maximum entropy problems. It also contains proofs of the various theorems, in particular, a guarantee that the bound majorizes the partition function. In addition, a proof is provided guaranteeing convergence on (non-latent) maximum conditional likelihood problems. The supplement also contains supporting lemmas that show the bound remains applicable in constrained optimization problems. The supplement then proves the soundness of the junction tree implementation of the bound for graphical models with large n. It also proves the soundness of the low-rank implementation of the bound for problems with large d. Finally, the supplement contains additional experiments and figures to provide further empirical support for the majorization methodology. Supplement for Section 2 Proof of Theorem 1 Rewrite the partition function as a sum over the integer index j = 1, . . . , n under the random ordering π : Ω → {1, . . . , n}. This defines j∑ π(y) and associates h and f with = n hj = h(π −1 (j)) and fj = f (π −1 (j)). Next, write Z(θ) = j=1 αj exp(λ⊤ fj ) by introducing ˜ ˜ λ = θ − θ and αj = hj exp(θ ⊤ fj ). Define the partition function over only the first i components ∑i as Zi (θ) = j=1 αj exp(λ⊤ fj ). When i = 0, a trivial quadratic upper bound holds ( ) Z0 (θ) ≤ z0 exp 1 λ⊤ Σ0 λ + λ⊤ µ0 2 with the parameters z0 → 0+ , µ0 = 0, and Σ0 = z0 I. Next, add one term to the current partition function Z1 (θ) = Z0 (θ) + α1 exp(λ⊤ f1 ). Apply the current bound Z0 (θ) to obtain 1 Z1 (θ) ≤ z0 exp( 2 λ⊤ Σ0 λ + λ⊤ µ0 ) + α1 exp(λ⊤ f1 ). Consider the following change of variables u γ 1/2 −1/2 = Σ0 λ − Σ0 = α1 z0 exp( 1 (f1 2 (f1 − µ0 )) − µ0 )⊤ Σ−1 (f1 − µ0 )) 0 and rewrite the logarithm of the bound as log Z1 (θ) ( ) 1 ≤ log z0 − 1 (f1 − µ0 )⊤ Σ−1 (f1 − µ0 ) + λ⊤ f1 + log exp( 2 ∥u∥2 ) + γ . 0 2 Apply Lemma 1 (cf. [32] p. 100) to the last term to get ( (1 ) ) log Z1 (θ) ≤ log z0 − 1 (f1 − µ0 )⊤ Σ−1 (f1 − µ0 ) + λ⊤ f1 + log exp 2 ∥v∥2 +γ 0 2 ( ) v⊤ (u − v) 1 + + (u − v)⊤ I + Γvv⊤ (u − v) 1+γ exp(− 1 ∥v∥2 ) 2 2 10 where Γ = 1 1 tanh( 2 log(γ exp(− 2 ∥v∥2 ))) . 1 2 log(γ exp(− 2 ∥v∥2 )) The bound in [32] is tight when u = v. To achieve tightness −1/2 ˜ when θ = θ or, equivalently, λ = 0, we choose v = Σ0 (µ0 − f1 ) yielding ( ) Z1 (θ) ≤ z1 exp 1 λ⊤ Σ1 λ + λ⊤ µ1 2 where we have z1 = z0 + α1 z0 α1 = µ0 + f1 z0 + α1 z0 + α1 1 tanh( 2 log(α1 /z0 )) = Σ0 + (µ0 − f1 )(µ0 − f1 )⊤ . 2 log(α1 /z0 ) µ1 Σ1 This rule updates the bound parameters z0 , µ0 , Σ0 to incorporate an extra term in the sum over i in Z(θ). The process is iterated n times (replacing 1 with i and 0 with i − 1) to produce an overall bound on all terms. Lemma 1 (See [32] p. 100) ( ( ) ) For all u ∈ Rd , any v ∈ Rd and any γ ≥ 0, the bound log exp 1 ∥u∥2 + γ ≤ 2 ( ( ) ) log exp 1 ∥v∥2 + γ + 2 ( ) v⊤ (u − v) 1 + (u − v)⊤ I + Γvv⊤ (u − v) 1 1 + γ exp(− 2 ∥v∥2 ) 2 holds when the scalar term Γ = 1 tanh( 2 log(γ exp(−∥v∥2 /2))) . 2 log(γ exp(−∥v∥2 /2)) Equality is achieved when u = v. Proof of Lemma 1 The proof is provided in [32]. Supplement for Section 3 Maximum entropy problem We show here that partition functions arise naturally in maximum ∑ p(y) entropy estimation or minimum relative entropy RE(p∥h) = y p(y) log h(y) estimation. Consider the following problem: ∑ ∑ min RE(p∥h) s.t. p(y)f (y) = 0, p(y)g(y) ≥ 0. p y y d′ Here, assume that f : Ω → R and g : Ω → R are arbitrary (non-constant) vector-valued functions ( ) over the sample space. The solution distribution p(y) = h(y) exp θ ⊤ f (y) + ϑ⊤ g(y) /Z(θ, ϑ) is recovered by the dual optimization ∑ ( ) arg θ, ϑ = max − log h(y) exp θ ⊤ f (y) + ϑ⊤ g(y) d ϑ≥0,θ y ′ where θ ∈ Rd and ϑ ∈ Rd . These are obtained by minimizing Z(θ, ϑ) or equivalently by maximizing its negative logarithm. Algorithm 1 permits variational maximization of the dual via the quadratic program min 1 (β ϑ≥0,θ 2 ˜ ˜ − β)⊤ Σ(β − β) + β ⊤ µ ′ where β ⊤ = [θ ⊤ ϑ⊤ ]. Note that any general convex hull of constraints β ∈ Λ ⊆ Rd+d could be imposed without loss of generality. Proof of Theorem 2 We begin by proving a lemma that will be useful later. Lemma 2 If κΨ ≽ Φ ≻ 0 for Φ, Ψ ∈ Rd×d , then 1 ˜ ˜ ˜ L(θ) = − 2 (θ − θ)⊤ Φ(θ − θ) − (θ − θ)⊤ µ ˜ ˜ ˜ U (θ) = − 1 (θ − θ)⊤ Ψ(θ − θ) − (θ − θ)⊤ µ 2 satisfy supθ∈Λ L(θ) ≥ 1 κ ˜ supθ∈Λ U (θ) for any convex Λ ⊆ Rd , θ ∈ Λ, µ ∈ Rd and κ ∈ R+ . 11 Proof of Lemma 2 Define the primal problems of interest as PL = supθ∈Λ L(θ) and PU = supθ∈Λ U (θ). The constraints θ ∈ Λ can be summarized by a set of linear inequalities Aθ ≤ b where A ∈ Rk×d and b ∈ Rk for some (possibly infinite) k ∈ Z. Apply the change of variables ˜ ˜ ˜ ˜ ˜ ˜ z = θ− θ. The constraint A(z+ θ) ≤ b simplifies into Az ≤ b where b = b−Aθ. Since θ ∈ Λ, it 1 ⊤ ˜ ≥ 0. We obtain the equivalent primal problems PL = sup is easy to show that b ˜ − z Φz − Az≤b z⊤ µ and PU = supAz≤b − 1 z⊤ Ψz − z⊤ µ. The corresponding dual problems are ˜ 2 2 ⊤ −1 y⊤AΦ−1A⊤y ˜ µ Φ µ +y⊤AΦ−1µ+y⊤b+ y≥0 2 2 y⊤AΨ−1 A⊤y µ⊤Ψ−1µ ˜ DU = inf +y⊤AΨ−1µ+y⊤b+ . y≥0 2 2 DL = inf ˜ Due to strong duality, PL = DL and PU = DU . Apply the inequalities Φ ≼ κΨ and y⊤ b > 0 as ⊤ −1 y⊤AΨ−1 A⊤y y⊤AΨ−1 µ κ ˜ µΨ µ + + y⊤b + PL ≥ sup − z⊤ Ψz − z⊤ µ = inf y≥0 2 2κ κ 2κ ˜ Az≤b 1 1 ≥ DU = PU . κ κ 1 This proves that PL ≥ κ PU . We will use the above to prove Theorem 2. First, we will upper-bound (in the Loewner ordering sense) the matrices Σj in Algorithm 2. Since ∥fxj (y)∥2 ≤ r for all y ∈ Ωj and since µj in Algorithm 1 is a convex combination of fxj (y), the outer-product terms in the update for Σj satisfy (fxj (y) − µ)(fxj (y) − µ)⊤ ≼ 4r2 I. Thus, Σj ≼ F(α1 , . . . , αn )4r2 I holds where α 1 F(α1 , . . . , αn ) = i n ∑ tanh( 2 log( ∑i−1 αk )) k=1 αi 2 log( ∑i−1 α ) i=2 k k=1 using the definition of α1 , . . . , αn in the proof of Theorem 1. The formula for F starts at i = 2 since z0 → 0+ . Assume permutation π is sampled uniformly at random. The expected value of F is then α π(i) 1 n 1 ∑ ∑ tanh( 2 log( ∑i−1 απ(k) )) k=1 . Eπ [F(α1 , . . . , αn )] = απ(i) n! π i=2 ) 2 log( ∑i−1 k=1 απ(k) We claim that the expectation is maximized when all αi = 1 or any positive constant. Also, F is invariant under uniform scaling of its arguments. Write the expected value of F as E for short. ∂E ∂E ∂E Next, consider ∂αl at the setting αi = 1, ∀i. Due to the expectation over π, we have ∂αl = ∂αo for any l, o. Therefore, the gradient vector is constant when all αi = 1. Since F(α1 , . . . , αn ) is invariant to scaling, the gradient vector must therefore be the all zeros vector. Thus, the point ∂ ∂E when all αi = 1 is an extremum or a saddle. Next, consider ∂αo ∂αl for any l, o. At the setting 2 ∂ ∂E αi = 1, ∂ E = −c(n) and, ∂αo ∂αl = c(n)/(n − 1) for some non-negative constant function ∂α2 l c(n). Thus, the αi = 1 extremum is locally concave and is a maximum. This establishes that Eπ [F(α1 , . . . , αn )] ≤ Eπ [F(1, . . . , 1)] and yields the Loewner bound ) ( n−1 ∑ tanh(log(i)/2) 2 I = ωI. Σj ≼ 2r log(i) i=1 Apply this bound to each Σj in the lower bound on J(θ) and also note a corresponding upper bound ∑ ˜ ˜ ˜ J(θ) ≥ J(θ)−tω+tλ ∥θ − θ∥2− (θ − θ)⊤(µj −fxj (yj )) 2 j ∑ ˜ ˜ ˜ J(θ) ≤ J(θ)−tλ ∥θ − θ∥2− (θ − θ)⊤(µj −fxj (yj )) 2 j 12 ˜ which follows from Jensen’s inequality. Define the current θ at time τ as θτ and denote by Lτ (θ) the above lower bound and by Uτ (θ) the above upper bound at time τ . Clearly, Lτ (θ) ≤ J(θ) ≤ Uτ (θ) with equality when θ = θτ . Algorithm 2 maximizes J(θ) after initializing at θ0 and performing an update by maximizing a lower bound based on Σj . Since Lτ (θ) replaces the definition of Σj with ωI ≽ Σj , Lτ (θ) is a looser bound than the one used by Algorithm 2. Thus, performing θτ +1 = arg maxθ∈Λ Lτ (θ) makes less progress than a step of Algorithm 1. Consider computing the slower update at each iteration τ and returning θτ +1 = arg maxθ∈Λ Lτ (θ). Setting Φ = (tω +tλ)I, Ψ = tλI and κ = ω+λ allows us to apply Lemma 2 as follows λ sup Lτ (θ) − Lτ (θτ ) = θ∈Λ 1 sup Uτ (θ) − Uτ (θτ ). κ θ∈Λ Since Lτ (θτ ) = J(θτ ) = Uτ (θτ ), J(θτ +1 ) ≥ supθ∈Λ Lτ (θ) and supθ∈Λ Uτ (θ) ≥ J(θ ∗ ), we obtain ( ) 1 J(θτ +1 ) − J(θ ∗ ) ≥ 1− (J(θτ ) − J(θ ∗ )) . κ Iterate the above inequality starting at t = 0 to obtain ( )τ 1 ∗ J(θτ ) − J(θ ) ≥ 1− (J(θ0 ) − J(θ ∗ )) . κ ( ) 1 τ κ A solution within a multiplicative factor of ϵ implies that ϵ = 1 − κ or log(1/ϵ) = τ log κ−1 . ⌉ ⌈ Inserting the definition for κ shows that the number of iterations τ is at most log(1/ϵ) or κ log κ−1 ⌉ ⌈ log(1/ϵ) log(1+λ/ω) . Inserting the definition for ω gives the bound. Y12,0 Y11,1 Y21,1 Y31,1 ··· 1,1 Ym1,1 Figure 3: Junction tree of depth 2. Algorithm 5 SmallJunctionTree ˜ Input Parameters θ and h(u), f (u) ∀u ∈ Y12,0 and zi , Σi , µi ∀i = 1, . . . , m1,1 + Initialize z → 0 , µ = 0, Σ = zI For each configuration u ∈ Y12,0 { ∏m1,1 ∑m1,1 ∏m1,1 ˜ ˜ ˜ α = h(u)( ∑ zi exp(−θ ⊤ µi )) exp(θ ⊤ (f (u) + i=1 µi )) = h(u) exp(θ ⊤ f (u)) i=1 zi i=1 m1,1 l = f (u) + i=1 µi − µ 1 ∑m1,1 tanh( 2 log(α/z)) ⊤ Σ + = i=1 Σi + ll 2 log(α/z) α µ + = z+α l z += α } Output z, µ, Σ Supplement for Section 5 Proof of correctness for Algorithm 3 Consider a simple junction tree of depth 2 shown on Figure 3. The notation Yca,b refers to the cth tree node located at tree level a (first level is considered as the one with∑ leaves) whose parent is the bth from the higher tree level (the root has no parent so b = 0). tree ∑ Let Y a1 ,b1 refer to the sum over all configurations of variables in Yca1 ,b1 and Y a1 ,b1 \Y a2 ,b2 1 c1 c1 c2 refers to the sum over all configurations of variables that are in Yca1 ,b1 but not in Yca2 ,b2 . Let ma,b 1 2 denote the number of children of the bth node located at tree level a + 1. For short-hand, use ψ(Y ) = h(Y ) exp(θ ⊤ f (Y )). The partition function can be expressed as: 13 Y13,0 Y12,1 ··· Y11,1 Y21,1 ··· Y22,1 1,1 Ym1,1 ··· Y11,2 Y21,2 1,2 Ym1,2 2,1 Ym2,1 1,m2,1 Y1 ··· 1,m2,1 Y2 1,m 2,1 Ym1,m2,1 Figure 4: Junction tree of depth 3. Z(θ) = ∑ 2,0 u∈Y1 ≤ ∑ 2,0 u∈Y1 = ∑  ψ(u) ∏ m1,1 i=1 [ ∏ m1,1 ψ(u) i=1 [    ∑ ψ(v) 2,0 v∈Yi1,1 \Y1 ) 1( ˜ ˜ ˜ zi exp( θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi 2 ∏ ( m1,1 ⊤ h(u) exp(θ f (u)) 2,0 u∈Y1 zi exp i=1 ] 1 ˜ ˜ ˜ (θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi 2 )] ∑ where the upper-bound is obtained by applying Theorem 1 to each of the terms v∈Y 1,1 \Y 2,0 ψ(v). 1 i By simply rearranging terms we get: ) ( ( [ (m1,1 )) m1,1 ∏ ∑ ∑ ˜ zi exp(−θ ⊤ µi ) exp θ ⊤ f (u) + µi h(u) Z(θ) ≤ 2,0 u∈Y1 i=1 ( exp 1 ˜ (θ − θ)⊤ 2 (m1,1 ∑ ) Σi i=1 ˜ (θ − θ) )] . i=1 One ( can prove that this ) expression can be upper-bounded by 1 ˆ ⊤ Σ(θ − θ) + (θ − θ)⊤ µ where z, Σ and µ can be computed using Algoˆ ˆ z exp 2 (θ − θ) rithm 5 (a simplification of Algorithm 3). We will call this result Lemma A. The proof is similar to the proof of Theorem 1 so is not repeated here. Consider enlarging the tree to a depth 3 as shown on Figure 4. The partition function is now      m2,1 m1,i ∑  ∏ ∑ ∏ ∑   Z(θ) = ψ(w) . ψ(u)  ψ(v)  3,0 u∈Y1 i=1 3,0 v∈Yi2,1 \Y1 j=1 w∈Yj1,i \Yi2,1 ( )) ∏m1,i (∑ ∑ term By Lemma A we can upper bound each v∈Y 2,1 \Y 3,0 ψ(v) j=1 w∈Yj1,i \Yi2,1 ψ(w) 1 i ( ) ˆ ˆ ˆ by the expression zi exp 1 (θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi . This yields 2 [ )] ( m2,1 ∑ ∏ 1 ˜ ˜ ˜ Z(θ) ≤ ψ(u) (θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi . zi exp 2 3,0 i=1 u∈Y1 2,1 2,1 2,1 This process can be viewed as collapsing the sub-trees S1 , S2 , . . ., Sm2,1 to super-nodes that are represented by bound parameters, zi , Σi and µi , i = {1, 2, · · · , m2,1 }, where the sub-trees are 14 defined as: 2,1 S1 = 1,1 {Y12,1 , Y11,1 , Y21,1 , Y31,1 , . . . , Ym1,1 } 2,1 S2 = 1,2 {Y22,1 , Y11,2 , Y21,2 , Y31,2 , . . . , Ym1,2 } = 2,1 {Ym2,1 , Y1 . . . 2,1 Sm2,1 1,m2,1 1,m2,1 , Y2 1,m2,1 , Y3 1,m2,1 , . . . , Ym1,m2,1 }. Notice that the obtained expression can be further upper bounded using again Lemma A (induction) ( ) ˆ ˆ ˆ yielding a bound of the form: z exp 1 (θ − θ)⊤ Σ(θ − θ) + (θ − θ)⊤ µ . 2 Finally, for a general tree, follow the same steps described above, starting from leaves and collapsing nodes to super-nodes, each represented by bound parameters. This procedure effectively yields Algorithm 3 for the junction tree under consideration. Supplement for Section 6 Proof of correctness for Algorithm 4 We begin by proving a lemma that will be useful later. Lemma 3 For all x ∈ Rd and for all l ∈ Rd , 2  d d 2 ∑ ∑ l(i)  . x(i)2 l(i)2 ≥  x(i) √∑ d l(j)2 i=1 i=1 j=1 Proof of Lemma 3 By Jensen’s inequality, 2  ( d )2 d d ∑ x(i)l(i)2 ∑ ∑ x(i)l(i)2  . √∑ x(i)2 ∑d ⇐⇒ x(i)2 l(i)2 ≥  ≥ ∑d d l(j)2 l(j)2 l(j)2 j=1 j=1 i=1 i=1 i=1 i=1 d ∑ l(i)2 j=1 Now we prove the correctness of Algorithm 4. At the ith iteration, the algorithm stores Σi using ⊤ a low-rank representation Vi Si Vi + Di where Vi ∈ Rk×d is orthonormal, Si ∈ Rk×k positive d×d semi-definite and Di ∈ R is non-negative diagonal. The diagonal terms D are initialized to tλI where λ is the regularization term. To mimic Algorithm 1 we must increment the Σ matrix by a rank one update of the form Σi = Σi−1 + ri r⊤ . By projecting ri onto each eigenvector in V, we i ∑k ⊤ can decompose it as ri = j=1 Vi−1 (j, ·)ri Vi−1 (j, ·)⊤ + g = Vi−1 Vi−1 ri + g where g is the remaining residue. Thus the update rule can be rewritten as: Σi ⊤ ⊤ ⊤ = Σi−1 + ri r⊤ = Vi−1 Si−1 Vi−1 + Di−1 + (Vi−1 Vi−1 ri + g)(Vi−1 Vi−1 ri + g)⊤ i ′ ′ ′ ⊤ ⊤ ⊤ = Vi−1 (Si−1 + Vi−1 ri r⊤ Vi−1 )Vi−1 + Di−1 + gg⊤ = Vi−1 Si−1 Vi−1 + gg⊤ + Di−1 i ′ where we define Vi−1 = Qi−1 Vi−1 and defined Qi−1 in terms of the singular value decomposition, ′ ′ ⊤ Q⊤ Si−1 Qi−1 = svd(Si−1 + Vi−1 ri r⊤ Vi−1 ). Note that Si−1 is diagonal and nonnegative by i−1 i construction. The current formula for Σi shows that we have a rank (k + 1) system (plus diagonal term) which needs to be converted back to a rank k system (plus diagonal term) which we denote by ′ Σi . We have two options as follows. Case 1) Remove g from Σi to obtain ′ ′ ′ ′ ⊤ Σi = Vi−1 Si−1 Vi−1 + Di−1 = Σi − gg⊤ = Σi − cvv⊤ 1 where c = ∥g∥2 and v = ∥g∥ g. th Case 2) Remove the m (smallest) eigenvalue in S′ and its corresponding eigenvector: i−1 ′ Σi ′ ′ ′ ′ ′ ′ ⊤ = Vi−1 Si−1 Vi−1 + Di−1 + gg⊤ − S (m, m)V (m, ·)⊤ V (m, ·) = Σi − cvv⊤ ′ ′ where c = S (m, m) and v = V(m, ·) . 15 ′ Clearly, both cases can be written as an update of the form Σi = Σi + cvv⊤ where c ≥ 0 and v⊤ v = 1. We choose the case with smaller c value to minimize the change as we drop from a system of order (k + 1) to order k. Discarding the smallest singular value and its corresponding eigenvector would violate the bound. Instead, consider absorbing this term into the diagonal component to ′′ ′ preserve the bound. Formally, we look for a diagonal matrix F such that Σi = Σi + F which also ′′ maintains x⊤ Σi x ≥ x⊤ Σi x for all x ∈ Rd . Thus, we want to satisfy: ( d )2 d ∑ ∑ ⊤ ′′ ⊤ ⊤ ⊤ ⊤ x Σi x ≥ x Σi x ⇐⇒ x cvv x ≤ x Fx ⇐⇒ c x(i)v(i) ≤ x(i)2 F(i) i=1 i=1 where, for ease of notation, we take F(i) = F(i, i). ′ where w = v⊤ 1. Consider the case where v ≥ 0 though we will soon get rid of (∑ )2 ∑d d this assumption. We need an F such that i=1 x(i)2 F(i) ≥ c i=1 x(i)v(i) . Equivalently, we (∑ ) ∑d ′ 2 ′ d need i=1 x(i)2 F(i) ≥ . Define F(i) = F(i) for all i = 1, . . . , d. So, we need 2 i=1 x(i)v(i) cw cw2 (∑ ) ∑d ′ ′ ′ 2 d an F such that i=1 x(i)2 F(i) ≥ . Using Lemma 3 it is easy to show that we i=1 x(i)v(i) Define v = 1 wv ′ ′ ′ may choose F (i) = v(i) . Thus, we obtain F(i) = cw2 F(i) = cwv(i). Therefore, for all x ∈ Rd , ∑d all v ≥ 0, and for F(i) = cv(i) j=1 v(j) we have d ∑ ( x(i) F(i) ≥ c 2 i=1 d ∑ )2 x(i)v(i) . (3) i=1 To generalize the inequality to hold for all vectors v ∈ Rd with potentially negative entries, it is ∑d sufficient to set F(i) = c|v(i)| j=1 |v(j)|. To verify this, consider flipping the sign of any v(i). The left side of the Inequality 3 does not change. For the right side of this inequality, flipping the sign of v(i) is equivalent to flipping the sign of x(i) and not changing the sign of v(i). However, in this case the inequality holds as shown before (it holds for any x ∈ Rd ). Thus for all x, v ∈ Rd and ∑d for F(i) = c|v(i)| j=1 |v(j)|, Inequality 3 holds. Supplement for Section 7 Small scale experiments In additional small-scale experiments, we compared Algorithm 2 with steepest descent (SD), conjugate gradient (CG), BFGS and Newton-Raphson. Small-scale problems may be interesting in real-time learning settings, for example, when a website has to learn from a user’s uploaded labeled data in a split second to perform real-time retrieval. We considered logistic regression on five UCI data sets where missing values were handled via mean-imputation. A range of regularization settings λ ∈ {100 , 102 , 104 } was explored and all algorithms were initialized from the same ten random start-points. Table 3 shows the average number of seconds each algorithm needed to achieve the same solution that BFGS converged to (all algorithms achieve the same solution due to concavity). The bound is the fastest algorithm as indicated in bold. data|λ BFGS SD CG Newton Bound a|100 1.90 1.74 0.78 0.31 0.01 a|102 0.89 0.92 0.83 0.25 0.01 a|104 2.45 1.60 0.85 0.22 0.01 b|100 3.14 2.18 0.70 0.43 0.07 b|102 2.00 6.17 0.67 0.37 0.04 b|104 1.60 5.83 0.83 0.35 0.04 c|100 4.09 1.92 0.65 0.39 0.07 c|102 1.03 0.64 0.64 0.34 0.02 c|104 1.90 0.56 0.72 0.32 0.02 d|100 5.62 12.04 1.36 0.92 0.16 d|102 2.88 1.27 1.21 0.63 0.09 d|104 3.28 1.94 1.23 0.60 0.07 e|100 2.63 2.68 0.48 0.35 0.03 e|102 2.01 2.49 0.55 0.26 0.03 e|104 1.49 1.54 0.43 0.20 0.03 Table 3: Convergence time in seconds under various regularization levels for a) Bupa (t = 345, dim = 7), b) Wine (t = 178, dim = 14), c) Heart (t = 187, dim = 23), d) Ion (t = 351, dim = 34), and e) Hepatitis (t = 155, dim = 20) data sets. Influence of rank k on bound performance in large scale experiments We also examined the influence of k on bound performance and compared it with LBFGS, SD and CG. Several choices 16 of k were explored. Table 4 shows results for the SRBCT data-set. In general, the bound performs best but slows down for superfluously large values of k. Steepest descent and conjugate gradient are slow yet obviously do not vary with k. Note that each iteration takes less time with smaller k for the bound. However, we are reporting overall runtime which is also a function of the number of iterations. Therefore, total runtime (a function of both) may not always decrease/increase with k. k LBFGS SD CG Bound 1 1.37 8.80 4.39 0.56 2 1.32 8.80 4.39 0.56 4 1.39 8.80 4.39 0.67 8 1.35 8.80 4.39 0.96 16 1.46 8.80 4.39 1.34 32 1.40 8.80 4.39 2.11 64 1.54 8.80 4.39 4.57 Table 4: Convergence time in seconds as a function of k. Additional latent-likelihood results For completeness, Figure 5 depicts two additional data-sets to complement Figure 2. Similarly, Table 5 shows all experimental settings explored in order to provide the summary Table 2 in the main article. bupa wine −19 0 −5 −log(J(θ)) −log(J(θ)) −20 −21 −22 Bound Newton BFGS Conjugate gradient Steepest descent −15 −23 −24 −5 −10 0 5 log(Time) [sec] 10 −20 −4 −2 0 2 4 log(Time) [sec] 6 8 Figure 5: Convergence of test latent log-likelihood for bupa and wine data-sets. Data-set Algorithm BFGS SD CG Newton Bound ion m=1 m=2m=3m=4 -4.96 -5.55 -5.88 -5.79 -11.80 -9.92 -5.56 -8.59 -5.47 -5.81 -5.57 -5.22 -5.95 -5.95 -5.95 -5.95 -6.08 -4.84 -4.18 -5.17 Data-set Algorithm BFGS SD CG Newton Bound bupa m=1 m=2 m=3 m=4 -22.07 -21.78 -21.92 -21.87 -21.76 -21.74 -21.73 -21.83 -21.81 -21.81 -21.81 -21.81 -21.85 -21.85 -21.85 -21.85 -21.85 -19.95 -20.01 -19.97 wine m=1m=2m=3m=4 -0.90 -0.91 -1.79 -1.35 -1.61 -1.60 -1.37 -1.63 -0.51 -0.78 -0.95 -0.51 -0.71 -0.71 -0.71 -0.71 -0.51 -0.51 -0.48 -0.51 hepatitis m=1m=2m=3m=4 -4.42 -5.28 -4.95 -4.93 -4.93 -5.14 -5.01 -5.20 -4.84 -4.84 -4.84 -4.84 -5.50 -5.50 -5.50 -4.50 -5.47 -4.40 -4.75 -4.92 SRBCT m=1m=2m=3m=4 -5.99 -6.17 -6.09 -6.06 -5.61 -5.62 -5.62 -5.61 -5.62 -5.49 -5.36 -5.76 -5.54 -5.54 -5.54 -5.54 -5.31 -5.31 -4.90 -0.11 Table 5: Test latent log-likelihood at convergence for different values of m ∈ {1, 2, 3, 4} on ion, bupa, hepatitis, wine and SRBCT data-sets. 17

6 0.6220482 359 nips-2012-Variational Inference for Crowdsourcing

7 0.61717653 147 nips-2012-Graphical Models via Generalized Linear Models

8 0.60960186 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo

9 0.60647863 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction

10 0.57933474 180 nips-2012-Learning Mixtures of Tree Graphical Models

11 0.57847148 214 nips-2012-Minimizing Sparse High-Order Energies by Submodular Vertex-Cover

12 0.55216897 204 nips-2012-MAP Inference in Chains using Column Generation

13 0.54658657 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

14 0.54289538 232 nips-2012-Multiplicative Forests for Continuous-Time Processes

15 0.5406757 53 nips-2012-Bayesian Pedigree Analysis using Measure Factorization

16 0.53318864 331 nips-2012-Symbolic Dynamic Programming for Continuous State and Observation POMDPs

17 0.53297925 123 nips-2012-Exponential Concentration for Mutual Information Estimation with Application to Forests

18 0.52690053 260 nips-2012-Online Sum-Product Computation Over Trees

19 0.51543581 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

20 0.49868172 326 nips-2012-Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.04), (21, 0.028), (36, 0.011), (38, 0.156), (39, 0.031), (53, 0.011), (54, 0.018), (55, 0.015), (74, 0.088), (76, 0.138), (77, 0.211), (80, 0.126), (92, 0.047)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92676723 14 nips-2012-A P300 BCI for the Masses: Prior Information Enables Instant Unsupervised Spelling

Author: Pieter-jan Kindermans, Hannes Verschore, David Verstraeten, Benjamin Schrauwen

Abstract: The usability of Brain Computer Interfaces (BCI) based on the P300 speller is severely hindered by the need for long training times and many repetitions of the same stimulus. In this contribution we introduce a set of unsupervised hierarchical probabilistic models that tackle both problems simultaneously by incorporating prior knowledge from two sources: information from other training subjects (through transfer learning) and information about the words being spelled (through language models). We show, that due to this prior knowledge, the performance of the unsupervised models parallels and in some cases even surpasses that of supervised models, while eliminating the tedious training session. 1

2 0.88402212 167 nips-2012-Kernel Hyperalignment

Author: Alexander Lorbert, Peter J. Ramadge

Abstract: We offer a regularized, kernel extension of the multi-set, orthogonal Procrustes problem, or hyperalignment. Our new method, called Kernel Hyperalignment, expands the scope of hyperalignment to include nonlinear measures of similarity and enables the alignment of multiple datasets with a large number of base features. With direct application to fMRI data analysis, kernel hyperalignment is well-suited for multi-subject alignment of large ROIs, including the entire cortex. We report experiments using real-world, multi-subject fMRI data. 1

3 0.87591732 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

Author: Jeff Beck, Alexandre Pouget, Katherine A. Heller

Abstract: Recent experiments have demonstrated that humans and animals typically reason probabilistically about their environment. This ability requires a neural code that represents probability distributions and neural circuits that are capable of implementing the operations of probabilistic inference. The proposed probabilistic population coding (PPC) framework provides a statistically efficient neural representation of probability distributions that is both broadly consistent with physiological measurements and capable of implementing some of the basic operations of probabilistic inference in a biologically plausible way. However, these experiments and the corresponding neural models have largely focused on simple (tractable) probabilistic computations such as cue combination, coordinate transformations, and decision making. As a result it remains unclear how to generalize this framework to more complex probabilistic computations. Here we address this short coming by showing that a very general approximate inference algorithm known as Variational Bayesian Expectation Maximization can be naturally implemented within the linear PPC framework. We apply this approach to a generic problem faced by any given layer of cortex, namely the identification of latent causes of complex mixtures of spikes. We identify a formal equivalent between this spike pattern demixing problem and topic models used for document classification, in particular Latent Dirichlet Allocation (LDA). We then construct a neural network implementation of variational inference and learning for LDA that utilizes a linear PPC. This network relies critically on two non-linear operations: divisive normalization and super-linear facilitation, both of which are ubiquitously observed in neural circuits. We also demonstrate how online learning can be achieved using a variation of Hebb’s rule and describe an extension of this work which allows us to deal with time varying and correlated latent causes. 1 Introduction to Probabilistic Inference in Cortex Probabilistic (Bayesian) reasoning provides a coherent and, in many ways, optimal framework for dealing with complex problems in an uncertain world. It is, therefore, somewhat reassuring that behavioural experiments reliably demonstrate that humans and animals behave in a manner consistent with optimal probabilistic reasoning when performing a wide variety of perceptual [1, 2, 3], motor [4, 5, 6], and cognitive tasks[7]. This remarkable ability requires a neural code that represents probability distribution functions of task relevant stimuli rather than just single values. While there 1 are many ways to represent functions, Bayes rule tells us that when it comes to probability distribution functions, there is only one statistically optimal way to do it. More precisely, Bayes Rule states that any pattern of activity, r, that efficiently represents a probability distribution over some task relevant quantity s, must satisfy the relationship p(s|r) ∝ p(r|s)p(s), where p(r|s) is the stimulus conditioned likelihood function that specifies the form of neural variability, p(s) gives the prior belief regarding the stimulus, and p(s|r) gives the posterior distribution over values of the stimulus, s given the representation r . Of course, it is unlikely that the nervous system consistently achieves this level of optimality. None-the-less, Bayes rule suggests the existence of a link between neural variability as characterized by the likelihood function p(r|s) and the state of belief of a mature statistical learning machine such as the brain. The so called Probabilistic Population Coding (or PPC) framework[8, 9, 10] takes this link seriously by proposing that the function encoded by a pattern of neural activity r is, in fact, the likelihood function p(r|s). When this is the case, the precise form of the neural variability informs the nature of the neural code. For example, the exponential family of statistical models with linear sufficient statistics has been shown to be flexible enough to model the first and second order statistics of in vivo recordings in awake behaving monkeys[9, 11, 12] and anesthetized cats[13]. When the likelihood function is modeled in this way, the log posterior probability over the stimulus is linearly encoded by neural activity, i.e. log p(s|r) = h(s) · r − log Z(r) (1) Here, the stimulus dependent kernel, h(s), is a vector of functions of s, the dot represents a standard dot product, and Z(r) is the partition function which serves to normalize the posterior. This log linear form for a posterior distribution is highly computationally convenient and allows for evidence integration to be implemented via linear operations on neural activity[14, 8]. Proponents of this kind of linear PPC have demonstrated how to build biologically plausible neural networks capable of implementing the operations of probabilistic inference that are needed to optimally perform the behavioural tasks listed above. This includes, linear PPC implementations of cue combination[8], evidence integration over time, maximum likelihood and maximum a posterior estimation[9], coordinate transformation/auditory localization[10], object tracking/Kalman filtering[10], explaining away[10], and visual search[15]. Moreover, each of these neural computations has required only a single recurrently connected layer of neurons that is capable of just two non-linear operations: coincidence detection and divisive normalization, both of which are widely observed in cortex[16, 17]. Unfortunately, this research program has been a piecemeal effort that has largely proceeded by building neural networks designed deal with particular problems. As a result, there have been no proposals for a general principle by which neural network implementations of linear PPCs might be generated and no suggestions regarding how to deal with complex (intractable) problems of probabilistic inference. In this work, we will partially address this short coming by showing that Variation Bayesian Expectation Maximization (VBEM) algorithm provides a general scheme for approximate inference and learning with linear PPCs. In section 2, we briefly review the VBEM algorithm and show how it naturally leads to a linear PPC representation of the posterior as well as constraints on the neural network dynamics which build that PPC representation. Because this section describes the VB-PPC approach rather abstractly, the remainder of the paper is dedicated to concrete applications. As a motivating example, we consider the problem of inferring the concentrations of odors in an olfactory scene from a complex pattern of spikes in a population of olfactory receptor neurons (ORNs). In section 3, we argue that this requires solving a spike pattern demixing problem which is indicative of the generic problem faced by many layers of cortex. We then show that this demixing problem is equivalent to the problem addressed by a class of models for text documents know as probabilistic topic models, in particular Latent Dirichlet Allocation or LDA[18]. In section 4, we apply the VB-PPC approach to build a neural network implementation of probabilistic inference and learning for LDA. This derivation shows that causal inference with linear PPC’s also critically relies on divisive normalization. This result suggests that this particular non-linearity may be involved in very general and fundamental probabilistic computation, rather than simply playing a role in gain modulation. In this section, we also show how this formulation allows for a probabilistic treatment of learning and show that a simple variation of Hebb’s rule can implement Bayesian learning in neural circuits. 2 We conclude this work by generalizing this approach to time varying inputs by introducing the Dynamic Document Model (DDM) which can infer short term fluctuations in the concentrations of individual topics/odors and can be used to model foraging and other tracking tasks. 2 Variational Bayesian Inference with linear Probabilistic Population Codes Variational Bayesian (VB) inference refers to a class of deterministic methods for approximating the intractable integrals which arise in the context of probabilistic reasoning. Properly implemented it can result a fast alternative to sampling based methods of inference such as MCMC[19] sampling. Generically, the goal of any Bayesian inference algorithm is to infer a posterior distribution over behaviourally relevant latent variables Z given observations X and a generative model which specifies the joint distribution p(X, Θ, Z). This task is confounded by the fact that the generative model includes latent parameters Θ which must be marginalized out, i.e. we wish to compute, p(Z|X) ∝ p(X, Θ, Z)dΘ (2) When the number of latent parameters is large this integral can be quite unwieldy. The VB algorithms simplify this marginalization by approximating the complex joint distribution over behaviourally relevant latents and parameters, p(Θ, Z|X), with a distribution q(Θ, Z) for which integrals of this form are easier to deal with in some sense. There is some art to choosing the particular form for the approximating distribution to make the above integral tractable, however, a factorized approximation is common, i.e. q(Θ, Z) = qΘ (Θ)qZ (Z). Regardless, for any given observation X, the approximate posterior is found by minimizing the Kullback-Leibler divergence between q(Θ, Z) and p(Θ, Z|X). When a factorized posterior is assumed, the Variational Bayesian Expectation Maximization (VBEM) algorithm finds a local minimum of the KL divergence by iteratively updating, qΘ (Θ) and qZ (Z) according to the scheme n log qΘ (Θ) ∼ log p(X, Θ, Z) n qZ (Z) and n+1 log qZ (Z) ∼ log p(X, Θ, Z) n qΘ (Θ) (3) Here the brackets indicate an expected value taken with respect to the subscripted probability distribution function and the tilde indicates equality up to a constant which is independent of Θ and Z. The key property to note here is that the approximate posterior which results from this procedure is in an exponential family form and is therefore representable by a linear PPC (Eq. 1). This feature allows for the straightforward construction of networks which implement the VBEM algorithm with linear PPC’s in the following way. If rn and rn are patterns of activity that use a linear PPC representation Θ Z of the relevant posteriors, then n log qΘ (Θ) ∼ hΘ (Θ) · rn Θ and n+1 log qZ (Z) ∼ hZ (Z) · rn+1 . Z (4) Here the stimulus dependent kernels hZ (Z) and hΘ (Θ) are chosen so that their outer product results in a basis that spans the function space on Z × Θ given by log p(X, Θ, Z) for every X. This choice guarantees that there exist functions fΘ (X, rn ) and fZ (X, rn ) such that Z Θ rn = fΘ (X, rn ) Θ Z and rn+1 = fZ (X, rn ) Θ Z (5) satisfy Eq. 3. When this is the case, simply iterating the discrete dynamical system described by Eq. 5 until convergence will find the VBEM approximation to the posterior. This is one way to build a neural network implementation of the VB algorithm. However, its not the only way. In general, any dynamical system which has stable fixed points in common with Eq. 5 can also be said to implement the VBEM algorithm. In the example below we will take advantage of this flexibility in order to build biologically plausible neural network implementations. 3 Response! to Mixture ! of Odors! Single Odor Response Cause Intensity Figure 1: (Left) Each cause (e.g. coffee) in isolation results in a pattern of neural activity (top). When multiple causes contribute to a scene this results in an overall pattern of neural activity which is a mixture of these patterns weighted by the intensities (bottom). (Right) The resulting pattern can be represented by a raster, where each spike is colored by its corresponding latent cause. 3 Probabilistic Topic Models for Spike Train Demixing Consider the problem of odor identification depicted in Fig. 1. A typical mammalian olfactory system consists of a few hundred different types of olfactory receptor neurons (ORNs), each of which responds to a wide range of volatile chemicals. This results in a highly distributed code for each odor. Since, a typical olfactory scene consists of many different odors at different concentrations, the pattern of ORN spike trains represents a complex mixture. Described in this way, it is easy to see that the problem faced by early olfactory cortex can be described as the task of demixing spike trains to infer latent causes (odor intensities). In many ways this olfactory problem is a generic problem faced by each cortical layer as it tries to make sense of the activity of the neurons in the layer below. The input patterns of activity consist of spikes (or spike counts) labeled by the axons which deliver them and summarized by a histogram which indicates how many spikes come from each input neuron. Of course, just because a spike came from a particular neuron does not mean that it had a particular cause, just as any particular ORN spike could have been caused by any one of a large number of volatile chemicals. Like olfactory codes, cortical codes are often distributed and multiple latent causes can be present at the same time. Regardless, this spike or histogram demixing problem is formally equivalent to a class of demixing problems which arise in the context of probabilistic topic models used for document modeling. A simple but successful example of this kind of topic model is called Latent Dirichlet Allocation (LDA) [18]. LDA assumes that word order in documents is irrelevant and, therefore, models documents as histograms of word counts. It also assumes that there are K topics and that each of these topics appears in different proportions in each document, e.g. 80% of the words in a document might be concerned with coffee and 20% with strawberries. Words from a given topic are themselves drawn from a distribution over words associated with that topic, e.g. when talking about coffee you have a 5% chance of using the word ’bitter’. The goal of LDA is to infer both the distribution over topics discussed in each document and the distribution of words associated with each topic. We can map the generative model for LDA onto the task of spike demixing in cortex by letting topics become latent causes or odors, words become neurons, word occurrences become spikes, word distributions associated with each topic become patterns of neural activity associated with each cause, and different documents become the observed patterns of neural activity on different trials. This equivalence is made explicit in Fig. 2 which describes the standard generative model for LDA applied to documents on the left and mixtures of spikes on the right. 4 LDA Inference and Network Implementation In this section we will apply the VB-PPC formulation to build a biologically plausible network capable of approximating probabilistic inference for spike pattern demixing. For simplicity, we will use the equivalent Gamma-Poisson formulation of LDA which directly models word and topic counts 4 1. For each topic k = 1, . . . , K, (a) Distribution over words βk ∼ Dirichlet(η0 ) 2. For document d = 1, . . . , D, (a) Distribution over topics θd ∼ Dirichlet(α0 ) (b) For word m = 1, . . . , Ωd i. Topic assignment zd,m ∼ Multinomial(θd ) ii. Word assignment ωd,m ∼ Multinomial(βzm ) 1. For latent cause k = 1, . . . , K, (a) Pattern of neural activity βk ∼ Dirichlet(η0 ) 2. For scene d = 1, . . . , D, (a) Relative intensity of each cause θd ∼ Dirichlet(α0 ) (b) For spike m = 1, . . . , Ωd i. Cause assignment zd,m ∼ Multinomial(θd ) ii. Neuron assignment ωd,m ∼ Multinomial(βzm ) Figure 2: (Left) The LDA generative model in the context of document modeling. (Right) The corresponding LDA generative model mapped onto the problem of spike demixing. Text related attributes on the left, in red, have been replaced with neural attributes on the right, in green. rather than topic assignments. Specifically, we define, Rd,j to be the number of times neuron j fires during trial d. Similarly, we let Nd,j,k to be the number of times a spike in neuron j comes from cause k in trial d. These new variables play the roles of the cause and neuron assignment variables, zd,m and ωd,m by simply counting them up. If we let cd,k be an un-normalized intensity of cause j such that θd,k = cd,k / k cd,k then the generative model, Rd,j = k Nd,j,k Nd,j,k ∼ Poisson(βj,k cd,k ) 0 cd,k ∼ Gamma(αk , C −1 ). (6) is equivalent to the topic models described above. Here the parameter C is a scale parameter which sets the expected total number of spikes from the population on each trial. Note that, the problem of inferring the wj,k and cd,k is a non-negative matrix factorization problem similar to that considered by Lee and Seung[20]. The primary difference is that, here, we are attempting to infer a probability distribution over these quantities rather than maximum likelihood estimates. See supplement for details. Following the prescription laid out in section 2, we approximate the posterior over latent variables given a set of input patterns, Rd , d = 1, . . . , D, with a factorized distribution of the form, qN (N)qc (c)qβ (β). This results in marginal posterior distributions q (β:,k |η:,k ), q cd,k |αd,k , C −1 + 1 ), and q (Nd,j,: | log pd,j,: , Rd,i ) which are Dirichlet, Gamma, and Multinomial respectively. Here, the parameters η:,k , αd,k , and log pd,j,: are the natural parameters of these distributions. The VBEM update algorithm yields update rules for these parameters which are summarized in Fig. 3 Algorithm1. Algorithm 1: Batch VB updates 1: while ηj,k not converged do 2: for d = 1, · · · , D do 3: while pd,j,k , αd,k not converged do 4: αd,k → α0 + j Rd,j pd,j,k 5: pd,j,k → Algorithm 2: Online VB updates 1: for d = 1, · · · , D do 2: reinitialize pj,k , αk ∀j, k 3: while pj,k , αk not converged do 4: αk → α0 + j Rd,j pj,k 5: pj,k → exp (ψ(ηj,k )−ψ(¯k )) exp ψ(αk ) η η i exp (ψ(ηj,i )−ψ(¯i )) exp ψ(αi ) exp (ψ(ηj,k )−ψ(¯k )) exp ψ(αd,k ) η η i exp (ψ(ηj,i )−ψ(¯i )) exp ψ(αd,i ) 6: end while 7: end for 8: ηj,k = η 0 + 9: end while end while ηj,k → (1 − dt)ηj,k + dt(η 0 + Rd,j pj,k ) 8: end for 6: 7: d Rd,j pd,j,k Figure 3: Here ηk = j ηj,k and ψ(x) is the digamma function so that exp ψ(x) is a smoothed ¯ threshold linear function. Before we move on to the neural network implementation, note that this standard formulation of variational inference for LDA utilizes a batch learning scheme that is not biologically plausible. Fortunately, an online version of this variational algorithm was recently proposed and shown to give 5 superior results when compared to the batch learning algorithm[21]. This algorithm replaces the sum over d in update equation for ηj,k with an incremental update based upon only the most recently observed pattern of spikes. See Fig. 3 Algorithm 2. 4.1 Neural Network Implementation Recall that the goal was to build a neural network that implements the VBEM algorithm for the underlying latent causes of a mixture of spikes using a neural code that represents the posterior distribution via a linear PPC. A linear PPC represents the natural parameters of a posterior distribution via a linear operation on neural activity. Since the primary quantity of interest here is the posterior distribution over odor concentrations, qc (c|α), this means that we need a pattern of activity rα which is linearly related to the αk ’s in the equations above. One way to accomplish this is to simply assume that the firing rates of output neurons are equal to the positive valued αk parameters. Fig. 4 depicts the overall network architecture. Input patterns of activity, R, are transmitted to the synapses of a population of output neurons which represent the αk ’s. The output activity is pooled to ¯ form an un-normalized prediction of the activity of each input neuron, Rj , given the output layer’s current state of belief about the latent causes of the Rj . The activity at each synapse targeted by input neuron j is then inhibited divisively by this prediction. This results in a dendrite that reports to the ¯ soma a quantity, Nj,k , which represents the fraction of unexplained spikes from input neuron j that could be explained by latent cause k. A continuous time dynamical system with this feature and the property that it shares its fixed points with the LDA algorithm is given by d ¯ Nj,k dt d αk dt ¯ ¯ = wj,k Rj − Rj Nj,k = (7) ¯ Nj,k exp (ψ (¯k )) (α0 − αk ) + exp (ψ (αk )) η (8) i ¯ where Rj = k wj,k exp (ψ (αk )), and wj,k = exp (ψ (ηj,k )). Note that, despite its form, it is Eq. 7 which implements the required divisive normalization operation since, in the steady state, ¯ ¯ Nj,k = wj,k Rj /Rj . Regardless, this network has a variety of interesting properties that align well with biology. It predicts that a balance of excitation and inhibition is maintained in the dendrites via divisive normalization and that the role of inhibitory neurons is to predict the input spikes which target individual dendrites. It also predicts superlinear facilitation. Specifically, the final term on the right of Eq. 8 indicates that more active cells will be more sensitive to their dendritic inputs. Alternatively, this could be implemented via recurrent excitation at the population level. In either case, this is the mechanism by which the network implements a sparse prior on topic concentrations and stands in stark contrast to the winner take all mechanisms which rely on competitive mutual inhibition mechanisms. Additionally, the ηj in Eq. 8 represents a cell wide ’leak’ parameter that indicates that the total leak should be ¯ roughly proportional to the sum total weight of the synapses which drive the neuron. This predicts that cells that are highly sensitive to input should also decay back to baseline more quickly. This implementation also predicts Hebbian learning of synaptic weights. To observe this fact, note that the online update rule for the ηj,k parameters can be implemented by simply correlating the activity at ¯ each synapse, Nj,k with activity at the soma αj via the equation: τL d ¯ wj,k = exp (ψ (¯k )) (η0 − 1/2 − wj,k ) + Nj,k exp ψ (αk ) η dt (9) where τL is a long time constant for learning and we have used the fact that exp (ψ (ηjk )) ≈ ηjk −1/2 for x > 1. For a detailed derivation see the supplementary material. 5 Dynamic Document Model LDA is a rather simple generative model that makes several unrealistic assumptions about mixtures of sensory and cortical spikes. In particular, it assumes both that there are no correlations between the 6 Targeted Divisive Normalization Targeted Divisive Normalization αj Ri Input Neurons Recurrent Connections ÷ ÷ -1 -1 Σ μj Nij Ri Synapses Output Neurons Figure 4: The LDA network model. Dendritically targeted inhibition is pooled from the activity of all neurons in the output layer and acts divisively. Σ jj' Nij Input Neurons Synapses Output Neurons Figure 5: DDM network model also includes recurrent connections which target the soma with both a linear excitatory signal and an inhibitory signal that also takes the form of a divisive normalization. intensities of latent causes and that there are no correlations between the intensities of latent causes in temporally adjacent trials or scenes. This makes LDA a rather poor computational model for a task like olfactory foraging which requires the animal to track the rise a fall of odor intensities as it navigates its environment. We can model this more complicated task by replacing the static cause or odor intensity parameters with dynamic odor intensity parameters whose behavior is governed by an exponentiated Ornstein-Uhlenbeck process with drift and diffusion matrices given by (Λ and ΣD ). We call this variant of LDA the Dynamic Document Model (DDM) as it could be used to model smooth changes in the distribution of topics over the course of a single document. 5.1 DDM Model Thus the generative model for the DDM is as follows: 1. For latent cause k = 1, . . . , K, (a) Cause distribution over spikes βk ∼ Dirichlet(η0 ) 2. For scene t = 1, . . . , T , (a) Log intensity of causes c(t) ∼ Normal(Λct−1 , ΣD ) (b) Number of spikes in neuron j resulting from cause k, Nj,k (t) ∼ Poisson(βj,k exp ck (t)) (c) Number of spikes in neuron j, Rj (t) = k Nj,k (t) This model bears many similarities to the Correlated and Dynamic topic models[22], but models dynamics over a short time scale, where the dynamic relationship (Λ, ΣD ) is important. 5.2 Network Implementation Once again the quantity of interest is the current distribution of latent causes, p(c(t)|R(τ ), τ = 0..T ). If no spikes occur then no evidence is presented and posterior inference over c(t) is simply given by an undriven Kalman filter with parameters (Λ, ΣD ). A recurrent neural network which uses a linear PPC to encode a posterior that evolves according to a Kalman filter has the property that neural responses are linearly related to the inverse covariance matrix of the posterior as well as that inverse covariance matrix times the posterior mean. In the absence of evidence, it is easy to show that these quantities must evolve according to recurrent dynamics which implement divisive normalization[10]. Thus, the patterns of neural activity which linearly encode them must do so as well. When a new spike arrives, optimal inference is no longer possible and a variational approximation must be utilized. As is shown in the supplement, this variational approximation is similar to the variational approximation used for LDA. As a result, a network which can divisively inhibit its synapses is able to implement approximate Bayesian inference. Curiously, this implies that the addition of spatial and temporal correlations to the latent causes adds very little complexity to the VB-PPC network implementation of probabilistic inference. All that is required is an additional inhibitory population which targets the somata in the output population. See Fig. 5. 7 Natural Parameters Natural Parameters (α) 0.4 200 450 180 0.3 Network Estimate Network Estimate 500 400 350 300 250 200 150 100 0.1 0 50 100 150 200 250 300 350 400 450 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 140 120 0.4 0.3 100 0.2 80 0.1 0 60 40 0.4 20 50 0 0 0.2 160 0 0 0.3 0.2 20 40 60 80 100 120 VBEM Estimate VBEM Estimate 140 160 180 200 0.1 0 Figure 6: (Left) Neural network approximation to the natural parameters of the posterior distribution over topics (the α’s) as a function of the VBEM estimate of those same parameters for a variety of ’documents’. (Center) Same as left, but for the natural parameters of the DDM (i.e the entries of the matrix Σ−1 (t) and Σ−1 µ(t) of the distribution over log topic intensities. (Right) Three example traces for cause intensity in the DDM. Black shows true concentration, blue and red (indistinguishable) show MAP estimates for the network and VBEM algorithms. 6 Experimental Results We compared the PPC neural network implementations of the variational inference with the standard VBEM algorithm. This comparison is necessary because the two algorithms are not guaranteed to converge to the same solution due to the fact that we only required that the neural network dynamics have the same fixed points as the standard VBEM algorithm. As a result, it is possible for the two algorithms to converge to different local minima of the KL divergence. For the network implementation of LDA we find good agreement between the neural network and VBEM estimates of the natural parameters of the posterior. See Fig. 6(left) which shows the two algorithms estimates of the shape parameter of the posterior distribution over topic (odor) concentrations (a quantity which is proportional to the expected concentration). This agreement, however, is not perfect, especially when posterior predicted concentrations are low. In part, this is due to the fact we are presenting the network with difficult inference problems for which the true posterior distribution over topics (odors) is highly correlated and multimodal. As a result, the objective function (KL divergence) is littered with local minima. Additionally, the discrete iterations of the VBEM algorithm can take very large steps in the space of natural parameters while the neural network implementation cannot. In contrast, the network implementation of the DDM is in much better agreement with the VBEM estimation. See Fig. 6(right). This is because the smooth temporal dynamics of the topics eliminate the need for the VBEM algorithm to take large steps. As a result, the smooth network dynamics are better able to accurately track the VBEM algorithms output. For simulation details please see the supplement. 7 Discussion and Conclusion In this work we presented a general framework for inference and learning with linear Probabilistic Population codes. This framework takes advantage of the fact that the Variational Bayesian Expectation Maximization algorithm generates approximate posterior distributions which are in an exponential family form. This is precisely the form needed in order to make probability distributions representable by a linear PPC. We then outlined a general means by which one can build a neural network implementation of the VB algorithm using this kind of neural code. We applied this VB-PPC framework to generate a biologically plausible neural network for spike train demixing. We chose this problem because it has many of the features of the canonical problem faced by nearly every layer of cortex, i.e. that of inferring the latent causes of complex mixtures of spike trains in the layer below. Curiously, this very complicated problem of probabilistic inference and learning ended up having a remarkably simple network solution, requiring only that neurons be capable of implementing divisive normalization via dendritically targeted inhibition and superlinear facilitation. Moreover, we showed that extending this approach to the more complex dynamic case in which latent causes change in intensity over time does not substantially increase the complexity of the neural circuit. Finally, we would like to note that, while we utilized a rate coding scheme for our linear PPC, the basic equations would still apply to any spike based log probability codes such as that considered Beorlin and Deneve[23]. 8 References [1] Daniel Kersten, Pascal Mamassian, and Alan Yuille. Object perception as Bayesian inference. Annual review of psychology, 55:271–304, January 2004. [2] Marc O Ernst and Martin S Banks. Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 415(6870):429–33, 2002. [3] Yair Weiss, Eero P Simoncelli, and Edward H Adelson. Motion illusions as optimal percepts. Nature neuroscience, 5(6):598–604, 2002. [4] P N Sabes. The planning and control of reaching movements. Current opinion in neurobiology, 10(6): 740–6, 2000. o [5] Konrad P K¨ rding and Daniel M Wolpert. Bayesian integration in sensorimotor learning. Nature, 427 (6971):244–7, 2004. [6] Emanuel Todorov. Optimality principles in sensorimotor control. Nature neuroscience, 7(9):907–15, 2004. [7] Erno T´ gl´ s, Edward Vul, Vittorio Girotto, Michel Gonzalez, Joshua B Tenenbaum, and Luca L Bonatti. e a Pure reasoning in 12-month-old infants as probabilistic inference. Science (New York, N.Y.), 332(6033): 1054–9, 2011. [8] W.J. Ma, J.M. Beck, P.E. Latham, and A. Pouget. Bayesian inference with probabilistic population codes. Nature Neuroscience, 2006. [9] Jeffrey M Beck, Wei Ji Ma, Roozbeh Kiani, Tim Hanks, Anne K Churchland, Jamie Roitman, Michael N Shadlen, Peter E Latham, and Alexandre Pouget. Probabilistic population codes for Bayesian decision making. Neuron, 60(6):1142–52, 2008. [10] J. M. Beck, P. E. Latham, and a. Pouget. Marginalization in Neural Circuits with Divisive Normalization. Journal of Neuroscience, 31(43):15310–15319, 2011. [11] Tianming Yang and Michael N Shadlen. Probabilistic reasoning by neurons. Nature, 447(7148):1075–80, 2007. [12] RHS Carpenter and MLL Williams. Neural computation of log likelihood in control of saccadic eye movements. Nature, 1995. [13] Arnulf B a Graf, Adam Kohn, Mehrdad Jazayeri, and J Anthony Movshon. Decoding the activity of neuronal populations in macaque primary visual cortex. Nature neuroscience, 14(2):239–45, 2011. [14] HB Barlow. Pattern Recognition and the Responses of Sensory Neurons. Annals of the New York Academy of Sciences, 1969. [15] Wei Ji Ma, Vidhya Navalpakkam, Jeffrey M Beck, Ronald Van Den Berg, and Alexandre Pouget. Behavior and neural basis of near-optimal visual search. Nature Neuroscience, (May), 2011. [16] DJ Heeger. Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9, 1992. [17] M Carandini, D J Heeger, and J a Movshon. Linearity and normalization in simple cells of the macaque primary visual cortex. The Journal of neuroscience : the official journal of the Society for Neuroscience, 17(21):8621–44, 1997. [18] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. JMLR, 2003. [19] M. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Unit, UCL, 2003. [20] D D Lee and H S Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401 (6755):788–91, 1999. [21] M. Hoffman, D. Blei, and F. Bach. Online learning for Latent Dirichlet Allocation. In NIPS, 2010. [22] D. Blei and J. Lafferty. Dynamic topic models. In ICML, 2006. [23] M. Boerlin and S. Deneve. Spike-based population coding and working memory. PLOS computational biology, 2011. 9

same-paper 4 0.83919531 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function

Author: Stefano Ermon, Ashish Sabharwal, Bart Selman, Carla P. Gomes

Abstract: Given a probabilistic graphical model, its density of states is a distribution that, for any likelihood value, gives the number of configurations with that probability. We introduce a novel message-passing algorithm called Density Propagation (DP) for estimating this distribution. We show that DP is exact for tree-structured graphical models and is, in general, a strict generalization of both sum-product and max-product algorithms. Further, we use density of states and tree decomposition to introduce a new family of upper and lower bounds on the partition function. For any tree decomposition, the new upper bound based on finer-grained density of state information is provably at least as tight as previously known bounds based on convexity of the log-partition function, and strictly stronger if a general condition holds. We conclude with empirical evidence of improvement over convex relaxations and mean-field based bounds. 1

5 0.83799648 65 nips-2012-Cardinality Restricted Boltzmann Machines

Author: Kevin Swersky, Ilya Sutskever, Daniel Tarlow, Richard S. Zemel, Ruslan Salakhutdinov, Ryan P. Adams

Abstract: The Restricted Boltzmann Machine (RBM) is a popular density model that is also good for extracting features. A main source of tractability in RBM models is that, given an input, the posterior distribution over hidden variables is factorizable and can be easily computed and sampled from. Sparsity and competition in the hidden representation is beneficial, and while an RBM with competition among its hidden units would acquire some of the attractive properties of sparse coding, such constraints are typically not added, as the resulting posterior over the hidden units seemingly becomes intractable. In this paper we show that a dynamic programming algorithm can be used to implement exact sparsity in the RBM’s hidden units. We also show how to pass derivatives through the resulting posterior marginals, which makes it possible to fine-tune a pre-trained neural network with sparse hidden layers. 1

6 0.76888728 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

7 0.76777589 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

8 0.76595932 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

9 0.76189321 168 nips-2012-Kernel Latent SVM for Visual Recognition

10 0.75971371 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

11 0.75688428 111 nips-2012-Efficient Sampling for Bipartite Matching Problems

12 0.75686634 260 nips-2012-Online Sum-Product Computation Over Trees

13 0.7555629 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity

14 0.75519031 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

15 0.75513154 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

16 0.75488681 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

17 0.75455862 197 nips-2012-Learning with Recursive Perceptual Representations

18 0.75439453 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

19 0.75429028 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

20 0.75356537 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines