nips nips2013 nips2013-258 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Justin Domke, Xianghang Liu
Abstract: Inference in general Ising models is difficult, due to high treewidth making treebased algorithms intractable. Moreover, when interactions are strong, Gibbs sampling may take exponential time to converge to the stationary distribution. We present an algorithm to project Ising model parameters onto a parameter set that is guaranteed to be fast mixing, under several divergences. We find that Gibbs sampling using the projected parameters is more accurate than with the original parameters when interaction strengths are strong and when limited time is available for sampling.
Reference: text
sentIndex sentText sentNum sentScore
1 Moreover, when interactions are strong, Gibbs sampling may take exponential time to converge to the stationary distribution. [sent-8, score-0.187]
2 We present an algorithm to project Ising model parameters onto a parameter set that is guaranteed to be fast mixing, under several divergences. [sent-9, score-0.185]
3 We find that Gibbs sampling using the projected parameters is more accurate than with the original parameters when interaction strengths are strong and when limited time is available for sampling. [sent-10, score-0.471]
4 For example, given some intractable distribution q, mean-field inference [14] attempts to minimize KL(p||q) over p ∈ TRACT, where TRACT is the set of fully-factorized distributions. [sent-13, score-0.098]
5 In different ways, loopy belief propagation [21] and tree-reweighted belief propagation [19] also make use of tree-based approximations, while Globerson and Jaakkola [6] provide an approximate inference method based on exact inference in planar graphs with zero field. [sent-15, score-0.34]
6 These are “fast mixing” models, or distributions that, while they may be high-treewidth, have parameter-space conditions guaranteeing that Gibbs sampling will quickly converge to the stationary distribution. [sent-17, score-0.114]
7 While the precise form of the parameter space conditions is slightly technical (Sections 2-3), informally, it is simply that interaction strengths between neighboring variables are not too strong. [sent-18, score-0.18]
8 In the context of the Ising model, we attempt to use these models in the most basic way possible– by taking an arbitrary (slow-mixing) set of parameters, projecting onto the fast-mixing set, using four different divergences. [sent-19, score-0.132]
9 Secondly, we experiment with projecting using the “zero-avoiding” divergence KL(q||p). [sent-21, score-0.224]
10 Third, we suggest a novel “piecewise” approximation of the KL divergence, where one drops edges from both q and p until a low-treewidth graph remains where the exact KL divergence can be calculated. [sent-23, score-0.222]
11 Since this requires expectations with respect to p, which is constrained to be fast-mixing, it can be approximated by Gibbs sampling, and the divergence can be minimized through stochastic approximation. [sent-26, score-0.195]
12 1 2 Background The literature on mixing times in Markov chains is extensive, including a recent textbook [10]. [sent-28, score-0.29]
13 In this paper, we consider the classic Gibbs sampling method [5], where one starts with some configuration x, and repeatedly picks a node i, and samples xi from p(xi |x−i ). [sent-35, score-0.177]
14 It is common to use more sophisticated methods such as block Gibbs sampling, the Swendsen-Wang algorithm [18], or tree sampling [7]. [sent-37, score-0.123]
15 Here, we focus on the univariate case for simplicity and because fast mixing of univariate Gibbs is sufficient for fast mixing of some other methods [13]. [sent-39, score-0.69]
16 The dependency Rij of xi on xj is defined by considering two configurations x and x , and measuring how much the conditional distribution of xi can vary when xk = xk for all k = j. [sent-43, score-0.251]
17 Given some threshold , the mixing time is the number of iterations needed to guarantee that the total variation distance of the Gibbs chain to the stationary distribution is less than . [sent-46, score-0.375]
18 The mixing time τ ( ) is the minimum time t such that the total variation distance between X t and the stationary distribution is at most . [sent-49, score-0.321]
19 x Unfortunately, the mixing time can be extremely long, which makes the use of Gibbs sampling delicate in practice. [sent-51, score-0.32]
20 For example, for the two-dimensional Ising model with zero field and uniform interactions, it is known that mixing time is polynomial (in the size of the grid) when the interaction strengths are below a threshold βc , and exponential for stronger interactions [11]. [sent-52, score-0.5]
21 For Gibbs sampling with random updates, if ||R||2 < 1, the mixing time is bounded by τ( ) ≤ n n . [sent-60, score-0.32]
22 ln 1 − ||R||2 Roughly speaking, if the spectral norm (maximum singular value) of R is less than one, rapid mixing will occur. [sent-61, score-0.542]
23 Some of the classic ways of establishing fast mixing can be seen as special cases of this. [sent-63, score-0.304]
24 2 3 Mixing Time Bounds For variables xi ∈ {−1, +1}, an Ising model is of the form p(x) = exp βij xi xj + i i,j αi xi − A(β, α) , where βij is the interaction strength between variables i and j, αi is the “field” for variable i, and A ensures normalization. [sent-66, score-0.376]
25 Thus, to summarize, an Ising model can be guaranteed to be fast mixing if the spectral norm of the absolute value of interactions terms is less than one. [sent-70, score-0.552]
26 4 Projection In this section, we imagine that we have some set of parameters θ, not necessarily fast mixing, and would like to obtain another set of parameters ψ which are as close as possible to θ, but guaranteed to be fast mixing. [sent-71, score-0.242]
27 This section derives a projection in the Euclidean norm, while Section 5 will build on this to consider other divergence measures. [sent-72, score-0.24]
28 We will use the following standard result that states that given a matrix A, the closest matrix with a maximum spectral norm can be obtained by thresholding the singular values. [sent-73, score-0.262]
29 B:||B||2 ≤c We denote this projection by B = Πc [A]. [sent-76, score-0.1]
30 This is close to providing an algorithm for obtaining the closest set of Ising model parameters that obey a given spectral norm constraint. [sent-77, score-0.252]
31 First, in general, even if A is sparse, the projected matrix B will be dense, meaning that projecting will destroy a sparse graph structure. [sent-79, score-0.173]
32 Second, this result constrains the spectral norm of B itself, rather than R = |B|, which is what needs to be controlled. [sent-80, score-0.207]
33 Then, enforcing that B obeys the graph structure is equivalent to enforcing that Zij Bij = 0 for all (i, j). [sent-83, score-0.157]
34 1 is equivalent to the problem of maxM≥0,Λ g(Λ, M ), where the objective and gradient of g are, for D(Λ, M ) = Πc [R+M −Λ Z], g(Λ, M ) = 1 ||D(Λ, M ) − R||2 + Λ · Z · D(Λ, M ) F 2 dg = Z D(Λ, M ) dΛ dg = D(Λ, M ). [sent-92, score-0.123]
35 Formally, if Ψ is the set of parameters that we can guarantee to be fast mixing, and D(θ, ψ) is a divergence between θ and ψ, then we would like to solve arg min D(θ, ψ). [sent-94, score-0.245]
36 (5) ψ∈Ψ As we will see, in selecting D there appears to be something of a trade-off between the quality of the approximation, and the ease of computing the projection in Eq. [sent-95, score-0.1]
37 1 Euclidean Distance The simplest divergence is simply the l2 distance between the parameter vectors, D(θ, ψ) = ||θ − ψ||2 . [sent-103, score-0.173]
38 For the Ising model, Theorem 7 provides a method to compute the projection arg minψ∈Ψ ||θ− ψ||2 . [sent-104, score-0.1]
39 However, it also forms the basis of our projected gradient descent strategy for computing the projection in Eq. [sent-106, score-0.187]
40 In each iteration, a few Markov chain steps are applied with the current parameters, and then the gradient is estimated using them. [sent-114, score-0.105]
41 2 KL-Divergence Perhaps the most natural divergence to use would be the “inclusive” KL-divergence D(θ, ψ) = KL(θ||ψ) = p(x; θ) log x p(x; θ) . [sent-118, score-0.14]
42 Thus, this divergence is only practical on low-treewidth “toy” graphs. [sent-124, score-0.14]
43 3 Piecewise KL-Divergences Inspired by the piecewise likelihood [17] and likelihood approximations based on mixtures of trees [15], we seek tractable approximations of the KL-divergence based on tractable subgraphs. [sent-126, score-0.45]
44 Thus, given some graph T , we define the “projection” θ(T ) onto the tree such by setting all edge parameters to zero if not part of T . [sent-128, score-0.274]
45 Then, given a set of graphs T , the piecewise KL-divergence is D(θ, ψ) = max KL(θ(T )||ψ(T )). [sent-129, score-0.323]
46 T Computing the derivative of this divergence is not hard– one simply computes the KL-divergence for each graph, and uses the gradient as in Eq. [sent-130, score-0.191]
47 In the simplest case, one could simply select a set of trees (assuring that each edge is covered by one tree), which makes it easy to compute the KLdivergence on each tree using the sum-product algorithm. [sent-133, score-0.125]
48 We will also experiment with selecting low-treewidth graphs, where exact inference can take place using the junction tree algorithm. [sent-134, score-0.128]
49 The divergence D(θ, ψ) = KL(ψ||θ) has the gradient d D(θ, ψ) = dψ x p(x; ψ)(ψ − θ) · f (x) (f (x) − µ(ψ)) . [sent-138, score-0.191]
50 Arguably, using this divergence is inferior to the “zero-avoiding” KL-divergence. [sent-139, score-0.14]
51 For example, since the parameters ψ may fail to put significant probability at configurations where θ does, using importance sampling to reweight samples from ψ to estimate expectations with respect to θ could have high variance Further, it can be non-convex with respect to ψ. [sent-140, score-0.216]
52 Minimizing this divergence under the constraint that the dependency matrix R corresponding to ψ have a limited spectral norm is closely related to naive mean-field, which can be seen as a degenerate case where one constrains R to have zero norm. [sent-142, score-0.406]
53 6 since it involves taking expectations with respect to ψ, rather than θ: since ψ is enforced to be fast-mixing, these expectations can be approximated by sampling. [sent-144, score-0.11]
54 Then, one can first approximate the marginals K 1 by µ = K k=1 f (xk ), and then approximate the gradient by ˆ g= ˆ 1 K K k=1 (ψ − θ) · f (xk ) f (xk ) − µ . [sent-149, score-0.219]
55 5 Interaction Strength Figure 1: The mean error of estimated univariate marginals on 8x8 grids (top row) and low-density random graphs (bottom row), comparing 30k iterations of Gibbs sampling after projection to variational methods. [sent-206, score-0.481]
56 To approximate the computational effort of projection (Table 1), sampling on the original parameters with 250k iterations is also included as a lower curve. [sent-207, score-0.34]
57 In the experiments, we approximate randomly generated Ising models with rapid-mixing distributions using the projection algorithms described previously. [sent-210, score-0.13]
58 Then, the marginals of rapid-mixing approximate distribution are compared against those of the target distributions by running a Gibbs chain on each. [sent-211, score-0.192]
59 We calculate the mean absolute distance of the marginals as the accuracy measure, with the marginals computed via the exact junction-tree algorithm. [sent-212, score-0.278]
60 We evaluate projecting under the Euclidean distance (Section 5. [sent-213, score-0.117]
61 On small graphs, it is possible to minimize the zero-avoiding KL-divergence KL(θ||ψ) by computing marginals using the junctiontree algorithm. [sent-217, score-0.108]
62 However, as minimizing this KL-divergence leads to exact marginal estimates, it doesn’t provide a useful measure of marginal accuracy. [sent-218, score-0.139]
63 Our methods are compared with four other inference algorithms, namely loopy belief-propagation (LBP), Tree-reweighted belief-propagation (TRW), Naive mean-field (MF), and Gibbs sampling on the original parameters. [sent-219, score-0.251]
64 The MF algorithm uses a fully factorized distribution as the tractable family, and can be viewed as an extreme case of minimizing the zero forcing KL-divergence KL(ψ||θ) under the constraint of zero spectral norm. [sent-221, score-0.2]
65 The tractable family that it uses guarantees “instant” mixing but is much more restrictive. [sent-222, score-0.345]
66 Theoretically, Gibbs sampling on the original parameters will produce highly accurate marginals if run long enough. [sent-223, score-0.315]
67 In contrast, Gibbs sampling on the rapid-mixing approximation is guaranteed to converge rapidly but will result in less accurate marginals asymptotically. [sent-225, score-0.245]
68 The random graph is based on an edge density of 0. [sent-248, score-0.128]
69 1 Configurations Two types of graph topologies are used: two-dimensional 8 × 8 grids and random graphs with 10 nodes. [sent-253, score-0.176]
70 Node parameters θi are uniformly drawn from unif(−dn , dn ) and we fix the field strength to dn = 1. [sent-258, score-0.27]
71 Edge parameters θij are uniformly drawn from unif(−de , de ) or unif(0, de ) to obtain mixed or attractive interactions respectively. [sent-260, score-0.209]
72 We generate graphs with different interaction strength de = 0, 0. [sent-261, score-0.346]
73 To calculate piecewise divergences, it remains to specify the set of subgraphs T . [sent-267, score-0.248]
74 It can be any tractable subgraph of the original distribution. [sent-268, score-0.118]
75 For random graphs, we use the sets of random spanning trees which can cover every edge of the original graphs as the set of subgraphs. [sent-271, score-0.204]
76 For each original or approximate distribution, a single chain of Gibbs sampling is run on the final parameters, and marginals are estimated from the samples drawn. [sent-280, score-0.359]
77 Note that this does not take into account the computational effort deployed during projection, which ranges from 30,000 total Gibbs iterations with repeated Euclidean projection (KL(ψ||θ)) to none at all (Original parameters). [sent-282, score-0.135]
78 2, we show that for Ising models, a sufficient condition for rapid-mixing is the spectral norm of pairwise weight matrix is less than 1. [sent-285, score-0.173]
79 However, we find in practice using a spectral norm bound of 2. [sent-287, score-0.173]
80 ) 7 Discussion Inference in high-treewidth graphical models is intractable, which has motivated several classes of approximations based on tractable families. [sent-291, score-0.101]
81 In this paper, we have proposed a new notion of “tractability”, insisting not that a graph has a fast algorithm for exact inference, but only that it obeys parameter-space conditions ensuring that Gibbs sampling will converge rapidly to the stationary distribution. [sent-292, score-0.287]
82 For the case of Ising models, we use a simple condition that can guarantee rapid mixing, namely that the spectral norm of the matrix of interaction strengths is less than one. [sent-293, score-0.415]
83 1 0 0 10 1 10 2 3 4 10 10 10 Number of samples (log scale) 0 0 10 5 10 1 10 2 3 4 10 10 10 Number of samples (log scale) 5 10 Figure 2: Example plots of the accuracy of obtained marginals vs. [sent-320, score-0.188]
84 ) Given an intractable set of parameters, we consider using this approximate family by “projecting” the intractable distribution onto it under several divergences. [sent-325, score-0.21]
85 First, we consider the Euclidean distance of parameters, and derive a dual algorithm to solve the projection, based on an iterative thresholding of the singular value decomposition. [sent-326, score-0.122]
86 This requires a stochastic approximation approach where one repeatedly generates samples from the model, and projects in the Euclidean norm after taking a gradient step. [sent-330, score-0.203]
87 We compare experimentally to Gibbs sampling on the original parameters, along with several standard variational methods. [sent-331, score-0.163]
88 Given enough time, Gibbs sampling using the original parameters will always be more accurate, but with finite time, projecting onto the fast-mixing set to generally gives better results. [sent-333, score-0.307]
89 First, one must find a bound on the dependency matrix for general MRFs, and secondly, an algorithm is needed to project onto the fast-mixing set defined by this bound. [sent-336, score-0.107]
90 , if one is doing maximum likelihood learning using MCMC to estimate the likelihood gradient, it would be natural to constrain the parameters to a fast mixing set. [sent-340, score-0.352]
91 One weakness of the proposed approach is the apparent looseness of the spectral norm bound. [sent-341, score-0.173]
92 For the two dimensional Ising model with no univariate terms, and a constant interaction strength β, √ there is a well-known threshold βc = 1 ln(1 + 2) ≈ . [sent-342, score-0.312]
93 4407, obtained using more advanced tech2 niques than the spectral norm [11]. [sent-343, score-0.173]
94 Roughly, for β < βc , mixing is known to occur quickly (polynomial in the grid size) while for β > βc , mixing is exponential. [sent-344, score-0.568]
95 On the other hand, the spectral bound norm will be equal to one for β = . [sent-345, score-0.173]
96 A tighter bound on when rapid mixing will occur would be more informative. [sent-349, score-0.309]
97 Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. [sent-374, score-0.262]
98 A simple condition implying rapid mixing of single-site dynamics on spin systems. [sent-388, score-0.348]
99 Convergent message-passing algorithms for inference over general graphs with convex free energies. [sent-391, score-0.124]
100 A generalized mean field algorithm for variational inference in exponential families. [sent-443, score-0.115]
wordName wordTfidf (topN-words)
[('kl', 0.375), ('trw', 0.292), ('tw', 0.281), ('gibbs', 0.262), ('piecewise', 0.248), ('mixing', 0.247), ('ising', 0.229), ('lbp', 0.187), ('strength', 0.15), ('divergence', 0.14), ('euclidean', 0.125), ('interaction', 0.121), ('field', 0.115), ('marginals', 0.108), ('projection', 0.1), ('spectral', 0.09), ('projecting', 0.084), ('norm', 0.083), ('edge', 0.075), ('graphs', 0.075), ('loopy', 0.075), ('grid', 0.074), ('sampling', 0.073), ('attractive', 0.071), ('rij', 0.071), ('tract', 0.07), ('svds', 0.07), ('zij', 0.065), ('tractable', 0.064), ('eld', 0.063), ('rapid', 0.062), ('xk', 0.061), ('singular', 0.06), ('dependency', 0.059), ('strengths', 0.059), ('fast', 0.057), ('divergences', 0.056), ('marginal', 0.055), ('expectations', 0.055), ('original', 0.054), ('chain', 0.054), ('unif', 0.053), ('graph', 0.053), ('dyer', 0.053), ('gradient', 0.051), ('pool', 0.05), ('tree', 0.05), ('intractable', 0.049), ('inference', 0.049), ('grids', 0.048), ('ergodic', 0.048), ('mf', 0.048), ('onto', 0.048), ('parameters', 0.048), ('mixed', 0.047), ('forcing', 0.046), ('bp', 0.044), ('mirror', 0.044), ('nicta', 0.043), ('chains', 0.043), ('interactions', 0.043), ('ij', 0.042), ('gurations', 0.042), ('univariate', 0.041), ('stationary', 0.041), ('sii', 0.041), ('samples', 0.04), ('spin', 0.039), ('bij', 0.039), ('treewidth', 0.037), ('approximations', 0.037), ('dn', 0.036), ('secondly', 0.036), ('variational', 0.036), ('projected', 0.036), ('markov', 0.036), ('peres', 0.036), ('yuval', 0.036), ('scan', 0.036), ('mrfs', 0.036), ('dg', 0.036), ('xi', 0.035), ('effort', 0.035), ('enforcing', 0.035), ('obeys', 0.034), ('stuart', 0.034), ('constrains', 0.034), ('family', 0.034), ('distance', 0.033), ('tommi', 0.033), ('planar', 0.033), ('accurate', 0.032), ('guaranteed', 0.032), ('obey', 0.031), ('exponential', 0.03), ('globerson', 0.03), ('approximate', 0.03), ('repeatedly', 0.029), ('exact', 0.029), ('thresholding', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing
Author: Justin Domke, Xianghang Liu
Abstract: Inference in general Ising models is difficult, due to high treewidth making treebased algorithms intractable. Moreover, when interactions are strong, Gibbs sampling may take exponential time to converge to the stationary distribution. We present an algorithm to project Ising model parameters onto a parameter set that is guaranteed to be fast mixing, under several divergences. We find that Gibbs sampling using the projected parameters is more accurate than with the original parameters when interaction strengths are strong and when limited time is available for sampling.
2 0.21046454 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations
Author: Tamir Hazan, Subhransu Maji, Tommi Jaakkola
Abstract: In this paper we describe how MAP inference can be used to sample efficiently from Gibbs distributions. Specifically, we provide means for drawing either approximate or unbiased samples from Gibbs’ distributions by introducing low dimensional perturbations and solving the corresponding MAP assignments. Our approach also leads to new ways to derive lower bounds on partition functions. We demonstrate empirically that our method excels in the typical “high signal high coupling” regime. The setting results in ragged energy landscapes that are challenging for alternative approaches to sampling and/or lower bounds. 1
3 0.1443319 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality
Author: Ilya O. Tolstikhin, Yevgeny Seldin
Abstract: We present a PAC-Bayes-Empirical-Bernstein inequality. The inequality is based on a combination of the PAC-Bayesian bounding technique with an Empirical Bernstein bound. We show that when the empirical variance is significantly smaller than the empirical loss the PAC-Bayes-Empirical-Bernstein inequality is significantly tighter than the PAC-Bayes-kl inequality of Seeger (2002) and otherwise it is comparable. Our theoretical analysis is confirmed empirically on a synthetic example and several UCI datasets. The PAC-Bayes-Empirical-Bernstein inequality is an interesting example of an application of the PAC-Bayesian bounding technique to self-bounding functions. 1
4 0.13226143 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models
Author: Jie Liu, David Page
Abstract: In large-scale applications of undirected graphical models, such as social networks and biological networks, similar patterns occur frequently and give rise to similar parameters. In this situation, it is beneficial to group the parameters for more efficient learning. We show that even when the grouping is unknown, we can infer these parameter groups during learning via a Bayesian approach. We impose a Dirichlet process prior on the parameters. Posterior inference usually involves calculating intractable terms, and we propose two approximation algorithms, namely a Metropolis-Hastings algorithm with auxiliary variables and a Gibbs sampling algorithm with “stripped” Beta approximation (Gibbs SBA). Simulations show that both algorithms outperform conventional maximum likelihood estimation (MLE). Gibbs SBA’s performance is close to Gibbs sampling with exact likelihood calculation. Models learned with Gibbs SBA also generalize better than the models learned by MLE on real-world Senate voting data. 1
5 0.12913661 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling
Author: Matthew Johnson, James Saunderson, Alan Willsky
Abstract: Sampling inference methods are computationally difficult to scale for many models in part because global dependencies can reduce opportunities for parallel computation. Without strict conditional independence structure among variables, standard Gibbs sampling theory requires sample updates to be performed sequentially, even if dependence between most variables is not strong. Empirical work has shown that some models can be sampled effectively by going “Hogwild” and simply running Gibbs updates in parallel with only periodic global communication, but the successes and limitations of such a strategy are not well understood. As a step towards such an understanding, we study the Hogwild Gibbs sampling strategy in the context of Gaussian distributions. We develop a framework which provides convergence conditions and error bounds along with simple proofs and connections to methods in numerical linear algebra. In particular, we show that if the Gaussian precision matrix is generalized diagonally dominant, then any Hogwild Gibbs sampler, with any update schedule or allocation of variables to processors, yields a stable sampling process with the correct sample mean. 1
6 0.1200863 143 nips-2013-Integrated Non-Factorized Variational Inference
7 0.11893788 107 nips-2013-Embed and Project: Discrete Sampling with Universal Hashing
8 0.1034807 318 nips-2013-Structured Learning via Logistic Regression
9 0.09145809 43 nips-2013-Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions
10 0.089318745 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion
11 0.087499149 161 nips-2013-Learning Stochastic Inverses
12 0.086703397 207 nips-2013-Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic
13 0.077198036 306 nips-2013-Speeding up Permutation Testing in Neuroimaging
14 0.076412477 257 nips-2013-Projected Natural Actor-Critic
15 0.076255955 184 nips-2013-Marginals-to-Models Reducibility
16 0.07283809 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits
17 0.072202422 36 nips-2013-Annealing between distributions by averaging moments
18 0.072173484 66 nips-2013-Computing the Stationary Distribution Locally
19 0.070839658 185 nips-2013-Matrix Completion From any Given Set of Observations
20 0.069173746 341 nips-2013-Universal models for binary spike patterns using centered Dirichlet processes
topicId topicWeight
[(0, 0.213), (1, 0.053), (2, 0.002), (3, 0.058), (4, 0.011), (5, 0.154), (6, 0.112), (7, -0.011), (8, 0.112), (9, -0.033), (10, 0.067), (11, 0.031), (12, 0.066), (13, -0.009), (14, 0.058), (15, -0.001), (16, -0.109), (17, -0.075), (18, -0.049), (19, 0.111), (20, -0.08), (21, -0.023), (22, 0.018), (23, -0.021), (24, -0.056), (25, 0.092), (26, 0.035), (27, -0.034), (28, 0.011), (29, 0.144), (30, 0.064), (31, -0.107), (32, -0.004), (33, -0.017), (34, 0.051), (35, -0.043), (36, -0.013), (37, -0.069), (38, 0.117), (39, -0.077), (40, 0.056), (41, 0.141), (42, 0.057), (43, 0.094), (44, 0.084), (45, -0.033), (46, 0.035), (47, 0.018), (48, -0.032), (49, 0.105)]
simIndex simValue paperId paperTitle
same-paper 1 0.96224326 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing
Author: Justin Domke, Xianghang Liu
Abstract: Inference in general Ising models is difficult, due to high treewidth making treebased algorithms intractable. Moreover, when interactions are strong, Gibbs sampling may take exponential time to converge to the stationary distribution. We present an algorithm to project Ising model parameters onto a parameter set that is guaranteed to be fast mixing, under several divergences. We find that Gibbs sampling using the projected parameters is more accurate than with the original parameters when interaction strengths are strong and when limited time is available for sampling.
2 0.843826 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations
Author: Tamir Hazan, Subhransu Maji, Tommi Jaakkola
Abstract: In this paper we describe how MAP inference can be used to sample efficiently from Gibbs distributions. Specifically, we provide means for drawing either approximate or unbiased samples from Gibbs’ distributions by introducing low dimensional perturbations and solving the corresponding MAP assignments. Our approach also leads to new ways to derive lower bounds on partition functions. We demonstrate empirically that our method excels in the typical “high signal high coupling” regime. The setting results in ragged energy landscapes that are challenging for alternative approaches to sampling and/or lower bounds. 1
3 0.77953601 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models
Author: Jie Liu, David Page
Abstract: In large-scale applications of undirected graphical models, such as social networks and biological networks, similar patterns occur frequently and give rise to similar parameters. In this situation, it is beneficial to group the parameters for more efficient learning. We show that even when the grouping is unknown, we can infer these parameter groups during learning via a Bayesian approach. We impose a Dirichlet process prior on the parameters. Posterior inference usually involves calculating intractable terms, and we propose two approximation algorithms, namely a Metropolis-Hastings algorithm with auxiliary variables and a Gibbs sampling algorithm with “stripped” Beta approximation (Gibbs SBA). Simulations show that both algorithms outperform conventional maximum likelihood estimation (MLE). Gibbs SBA’s performance is close to Gibbs sampling with exact likelihood calculation. Models learned with Gibbs SBA also generalize better than the models learned by MLE on real-world Senate voting data. 1
4 0.77405626 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling
Author: Matthew Johnson, James Saunderson, Alan Willsky
Abstract: Sampling inference methods are computationally difficult to scale for many models in part because global dependencies can reduce opportunities for parallel computation. Without strict conditional independence structure among variables, standard Gibbs sampling theory requires sample updates to be performed sequentially, even if dependence between most variables is not strong. Empirical work has shown that some models can be sampled effectively by going “Hogwild” and simply running Gibbs updates in parallel with only periodic global communication, but the successes and limitations of such a strategy are not well understood. As a step towards such an understanding, we study the Hogwild Gibbs sampling strategy in the context of Gaussian distributions. We develop a framework which provides convergence conditions and error bounds along with simple proofs and connections to methods in numerical linear algebra. In particular, we show that if the Gaussian precision matrix is generalized diagonally dominant, then any Hogwild Gibbs sampler, with any update schedule or allocation of variables to processors, yields a stable sampling process with the correct sample mean. 1
5 0.68663937 161 nips-2013-Learning Stochastic Inverses
Author: Andreas Stuhlmüller, Jacob Taylor, Noah Goodman
Abstract: We describe a class of algorithms for amortized inference in Bayesian networks. In this setting, we invest computation upfront to support rapid online inference for a wide range of queries. Our approach is based on learning an inverse factorization of a model’s joint distribution: a factorization that turns observations into root nodes. Our algorithms accumulate information to estimate the local conditional distributions that constitute such a factorization. These stochastic inverses can be used to invert each of the computation steps leading to an observation, sampling backwards in order to quickly find a likely explanation. We show that estimated inverses converge asymptotically in number of (prior or posterior) training samples. To make use of inverses before convergence, we describe the Inverse MCMC algorithm, which uses stochastic inverses to make block proposals for a Metropolis-Hastings sampler. We explore the efficiency of this sampler for a variety of parameter regimes and Bayes nets. 1
6 0.66534704 107 nips-2013-Embed and Project: Discrete Sampling with Universal Hashing
7 0.6412071 152 nips-2013-Learning Efficient Random Maximum A-Posteriori Predictors with Non-Decomposable Loss Functions
8 0.55519462 36 nips-2013-Annealing between distributions by averaging moments
9 0.53440905 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex
10 0.52945518 207 nips-2013-Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic
11 0.51766574 184 nips-2013-Marginals-to-Models Reducibility
12 0.51550543 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality
13 0.51022333 52 nips-2013-Bayesian inference as iterated random functions with applications to sequential inference in graphical models
14 0.50680137 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits
15 0.49252936 160 nips-2013-Learning Stochastic Feedforward Neural Networks
16 0.48567525 143 nips-2013-Integrated Non-Factorized Variational Inference
17 0.48231697 316 nips-2013-Stochastic blockmodel approximation of a graphon: Theory and consistent estimation
18 0.47447157 306 nips-2013-Speeding up Permutation Testing in Neuroimaging
19 0.47047392 43 nips-2013-Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions
20 0.46943197 168 nips-2013-Learning to Pass Expectation Propagation Messages
topicId topicWeight
[(16, 0.034), (33, 0.184), (34, 0.089), (41, 0.412), (56, 0.088), (70, 0.02), (85, 0.036), (89, 0.023), (93, 0.02), (95, 0.02)]
simIndex simValue paperId paperTitle
1 0.93730175 257 nips-2013-Projected Natural Actor-Critic
Author: Philip S. Thomas, William C. Dabney, Stephen Giguere, Sridhar Mahadevan
Abstract: Natural actor-critics form a popular class of policy search algorithms for finding locally optimal policies for Markov decision processes. In this paper we address a drawback of natural actor-critics that limits their real-world applicability—their lack of safety guarantees. We present a principled algorithm for performing natural gradient descent over a constrained domain. In the context of reinforcement learning, this allows for natural actor-critic algorithms that are guaranteed to remain within a known safe region of policy space. While deriving our class of constrained natural actor-critic algorithms, which we call Projected Natural ActorCritics (PNACs), we also elucidate the relationship between natural gradient descent and mirror descent. 1
2 0.89049339 189 nips-2013-Message Passing Inference with Chemical Reaction Networks
Author: Nils E. Napp, Ryan P. Adams
Abstract: Recent work on molecular programming has explored new possibilities for computational abstractions with biomolecules, including logic gates, neural networks, and linear systems. In the future such abstractions might enable nanoscale devices that can sense and control the world at a molecular scale. Just as in macroscale robotics, it is critical that such devices can learn about their environment and reason under uncertainty. At this small scale, systems are typically modeled as chemical reaction networks. In this work, we develop a procedure that can take arbitrary probabilistic graphical models, represented as factor graphs over discrete random variables, and compile them into chemical reaction networks that implement inference. In particular, we show that marginalization based on sum-product message passing can be implemented in terms of reactions between chemical species whose concentrations represent probabilities. We show algebraically that the steady state concentration of these species correspond to the marginal distributions of the random variables in the graph and validate the results in simulations. As with standard sum-product inference, this procedure yields exact results for tree-structured graphs, and approximate solutions for loopy graphs.
3 0.87945217 193 nips-2013-Mixed Optimization for Smooth Functions
Author: Mehrdad Mahdavi, Lijun Zhang, Rong Jin
Abstract: It is well known that the optimal convergence rate for stochastic optimization of √ smooth functions is O(1/ T ), which is same as stochastic optimization of Lipschitz continuous convex functions. This is in contrast to optimizing smooth functions using full gradients, which yields a convergence rate of O(1/T 2 ). In this work, we consider a new setup for optimizing smooth functions, termed as Mixed Optimization, which allows to access both a stochastic oracle and a full gradient oracle. Our goal is to significantly improve the convergence rate of stochastic optimization of smooth functions by having an additional small number of accesses to the full gradient oracle. We show that, with an O(ln T ) calls to the full gradient oracle and an O(T ) calls to the stochastic oracle, the proposed mixed optimization algorithm is able to achieve an optimization error of O(1/T ). 1
same-paper 4 0.85892296 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing
Author: Justin Domke, Xianghang Liu
Abstract: Inference in general Ising models is difficult, due to high treewidth making treebased algorithms intractable. Moreover, when interactions are strong, Gibbs sampling may take exponential time to converge to the stationary distribution. We present an algorithm to project Ising model parameters onto a parameter set that is guaranteed to be fast mixing, under several divergences. We find that Gibbs sampling using the projected parameters is more accurate than with the original parameters when interaction strengths are strong and when limited time is available for sampling.
5 0.79393619 175 nips-2013-Linear Convergence with Condition Number Independent Access of Full Gradients
Author: Lijun Zhang, Mehrdad Mahdavi, Rong Jin
Abstract: For smooth and strongly convex optimizations, the optimal iteration complexity of √ the gradient-based algorithm is O( κ log 1/ǫ), where κ is the condition number. In the case that the optimization problem is ill-conditioned, we need to evaluate a large number of full gradients, which could be computationally expensive. In this paper, we propose to remove the dependence on the condition number by allowing the algorithm to access stochastic gradients of the objective function. To this end, we present a novel algorithm named Epoch Mixed Gradient Descent (EMGD) that is able to utilize two kinds of gradients. A distinctive step in EMGD is the mixed gradient descent, where we use a combination of the full and stochastic gradients to update the intermediate solution. Theoretical analysis shows that EMGD is able to find an ǫ-optimal solution by computing O(log 1/ǫ) full gradients and O(κ2 log 1/ǫ) stochastic gradients. 1
6 0.78617984 62 nips-2013-Causal Inference on Time Series using Restricted Structural Equation Models
7 0.77325869 180 nips-2013-Low-rank matrix reconstruction and clustering via approximate message passing
8 0.72471666 11 nips-2013-A New Convex Relaxation for Tensor Completion
9 0.71162647 77 nips-2013-Correlations strike back (again): the case of associative memory retrieval
10 0.65246433 20 nips-2013-Accelerating Stochastic Gradient Descent using Predictive Variance Reduction
11 0.6351729 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives
12 0.61145443 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
13 0.60682958 54 nips-2013-Bayesian optimization explains human active search
14 0.5966053 113 nips-2013-Exact and Stable Recovery of Pairwise Interaction Tensors
15 0.58768988 184 nips-2013-Marginals-to-Models Reducibility
16 0.58371359 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent
17 0.58260554 301 nips-2013-Sparse Additive Text Models with Low Rank Background
18 0.57717997 295 nips-2013-Simultaneous Rectification and Alignment via Robust Recovery of Low-rank Tensors
19 0.57657379 318 nips-2013-Structured Learning via Logistic Regression
20 0.57495493 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables