nips nips2013 nips2013-34 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Matthew Johnson, James Saunderson, Alan Willsky
Abstract: Sampling inference methods are computationally difficult to scale for many models in part because global dependencies can reduce opportunities for parallel computation. Without strict conditional independence structure among variables, standard Gibbs sampling theory requires sample updates to be performed sequentially, even if dependence between most variables is not strong. Empirical work has shown that some models can be sampled effectively by going “Hogwild” and simply running Gibbs updates in parallel with only periodic global communication, but the successes and limitations of such a strategy are not well understood. As a step towards such an understanding, we study the Hogwild Gibbs sampling strategy in the context of Gaussian distributions. We develop a framework which provides convergence conditions and error bounds along with simple proofs and connections to methods in numerical linear algebra. In particular, we show that if the Gaussian precision matrix is generalized diagonally dominant, then any Hogwild Gibbs sampler, with any update schedule or allocation of variables to processors, yields a stable sampling process with the correct sample mean. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Sampling inference methods are computationally difficult to scale for many models in part because global dependencies can reduce opportunities for parallel computation. [sent-6, score-0.107]
2 Without strict conditional independence structure among variables, standard Gibbs sampling theory requires sample updates to be performed sequentially, even if dependence between most variables is not strong. [sent-7, score-0.154]
3 Empirical work has shown that some models can be sampled effectively by going “Hogwild” and simply running Gibbs updates in parallel with only periodic global communication, but the successes and limitations of such a strategy are not well understood. [sent-8, score-0.165]
4 As a step towards such an understanding, we study the Hogwild Gibbs sampling strategy in the context of Gaussian distributions. [sent-9, score-0.17]
5 In particular, we show that if the Gaussian precision matrix is generalized diagonally dominant, then any Hogwild Gibbs sampler, with any update schedule or allocation of variables to processors, yields a stable sampling process with the correct sample mean. [sent-11, score-0.523]
6 1 Introduction Scaling probabilistic inference algorithms to large datasets and parallel computing architectures is a challenge of great importance and considerable current research interest, and great strides have been made in designing parallelizeable algorithms. [sent-12, score-0.139]
7 Along with the powerful and sometimes complex new algorithms, a very simple strategy has proven to be surprisingly successful in some situations: running Gibbs sampling updates, derived only for the sequential setting, in parallel without globally synchronizing the sampler state after each update. [sent-13, score-0.326]
8 We refer to this strategy as “Hogwild Gibbs sampling” in reference to recent work [1] in which sequential computations for computing gradient steps were applied in parallel (without global coordination) to great beneficial effect. [sent-15, score-0.192]
9 There have been recent advances in understanding some of the particular structure of AD-LDA [6], but a thorough theoretical explanation for the effectiveness and limitations of Hogwild Gibbs sampling is far from complete. [sent-18, score-0.195]
10 Sampling inference algorithms for complex Bayesian models have notoriously resisted theoretical analysis, so to begin an analysis of Hogwild Gibbs sampling we consider a restricted class of models that is especially tractable for analysis: Gaussians. [sent-19, score-0.133]
11 Further, Gaussian sampling is of 1 Algorithm 1 Hogwild Gibbs Sampling Require: Samplers Gi (¯¬i ) which sample p(xi |x¬i = x¬i ), a partition {I1 , I2 , . [sent-21, score-0.175]
12 , K in parallel do for each of K parallel processors (1) 4: yIk ← xIk ¯ ¯ 5: for j = 1, 2, . [sent-33, score-0.257]
13 , q(k, ) do run local Gibbs steps with old 6: for i ∈ Ik do statistics from other processors (j) ( ) (j) ( ) 7: yi ← Gi (¯I1 , . [sent-36, score-0.148]
14 , xIK ) ¯ x ¯ ¯ 8: x( ¯ +1) (q(1, )) ← (¯I1 y (q(K, )) · · · yIK ¯ ) globally synchronize statistics great interest in its own right, and there is active research in developing powerful Gaussian samplers [7, 8, 9, 10]. [sent-42, score-0.127]
15 Gaussian Hogwild Gibbs sampling could be used in conjunction with those methods to allow greater parallelization and scalability, given an understanding of its applicability and tradeoffs. [sent-43, score-0.172]
16 Toward the goal of understanding Gaussian Hogwild Gibbs sampling, the main contribution of this paper is a linear algebraic framework for analyzing the stability and errors in Gaussian Hogwild Gibbs sampling. [sent-44, score-0.131]
17 2 Related Work There has been significant work on constructing parallel Gibbs sampling algorithms, and the contributions are too numerous to list here. [sent-48, score-0.218]
18 One recent body of work [11] provides exact parallel Gibbs samplers which exploit graphical model structure for parallelism. [sent-49, score-0.214]
19 The algorithms are supported by the standard Gibbs sampling analysis, and the authors point out that while heuristic parallel samplers such as the AD-LDA sampler offer easier implementation and often greater parallelism, they are currently not supported by much theoretical analysis. [sent-50, score-0.389]
20 The parallel sampling work that is most relevant to the proposed Hogwild Gibbs sampling analysis is the thorough empirical demonstration of AD-LDA [2, 3, 4, 5, 6] and its extensions. [sent-51, score-0.351]
21 The AD-LDA sampling algorithm is an instance of the strategy we have named Hogwild Gibbs, and Bekkerman et al. [sent-52, score-0.17]
22 However, this per-sample error does not necessarily provide a direct understanding of the effectiveness of the overall algorithm because errors might accumulate over sampling steps; indeed, understanding this potential error accumulation is of critical importance in iterative systems. [sent-61, score-0.297]
23 [1] provides both a motivation for Hogwild Gibbs sampling as well as the Hogwild name. [sent-67, score-0.133]
24 3 Background In this section we fix notation for Gaussian distributions and describe known connections between Gaussian sampling and a class of stationary iterative linear system solvers which are useful in analyzing the behavior of Hogwild Gibbs sampling. [sent-69, score-0.317]
25 The density of a Gaussian distribution on n variables with mean vector µ and positive definite1 covariance matrix Σ 0 has the form T 1 p(x) ∝ exp − 1 (x − µ) Σ−1 (x − µ) ∝ exp − 2 xT Jx + hT x 2 (1) where we have written the information parameters J := Σ−1 and h := Jµ. [sent-70, score-0.102]
26 One way to generate samples is via Gibbs sampling, in which one iterates sampling each xi conditioned on all other variables to construct a Markov chain for which the invariant distribution is the target N (µ, Σ). [sent-76, score-0.186]
27 The conditional distributions for Gibbs sampling steps are p(xi |x¬i = x¬i ) ∝ ¯ 1 ¯ exp − 2 Jii x2 + (hi − Ji¬i x¬i )xi . [sent-77, score-0.154]
28 That is, we update each xi via xi ← J1 (hi − Ji¬i x¬i ) + vi ¯ i ii iid where vi ∼ N (0, J1 ). [sent-78, score-0.121]
29 , n into a matrix equation relating the sampler state at t and t + 1: iid 1 x(t+1) = −D−1 Lx(t+1) − D−1 LT x(t) + D−1 h + D− 2 v (t) , v (t) ∼ N (0, I). [sent-82, score-0.145]
30 We can re-arrange the equation into an update expression: −1 x(t+1) = −(D + L) −1 LT x(t) + (D + L) −1 (t) iid v , v (t) ∼ N (0, D). [sent-85, score-0.121]
31 ˜ ˜ h + (D + L) The expectation of this update is exactly the Gauss-Seidel iterative linear system solver update [13, −1 −1 Section 7. [sent-86, score-0.294]
32 Therefore a Gaussian Gibbs sampling process can be interpreted as Gauss-Seidel iterates on the system Jµ = h with appropriately-shaped noise injected at each iteration. [sent-90, score-0.331]
33 Gauss-Seidel is one instance of a stationary iterative linear solver based on a matrix splitting. [sent-91, score-0.135]
34 For an iterative process like (2) to be stable or convergent for any initialization we require the eigenvalues of its 1 Assume models are non-degenerate: matrix parameters are of full rank and densities are finite everywhere. [sent-95, score-0.182]
35 41] [15], and the connection to Gibbs sampling provides an independent proof. [sent-104, score-0.164]
36 For the Gauss-Seidel splitting (and hence Gibbs sampling), M is chosen to be lower-triangular so that the corresponding linear system can be solved efficiently via backsubstitution. [sent-106, score-0.151]
37 In the sampling context, the per-iteration computational complexity is also determined by the covariance of the injected noise process v (t) , because at each iteration one must sample from a Gaussian distribution with covariance M T + N . [sent-107, score-0.373]
38 We highlight one other standard stationary iterative linear solver that is relevant to analyzing Gaussian Hogwild Gibbs sampling: Jacobi iterations, in which one splits J = D − A where D is the diagonal part of J and A is the negative of the off-diagonal part. [sent-108, score-0.19]
39 Due to the choice of a diagonal M , each coordinate update depends only on the previous sweep’s output, and thus the Jacobi update sweep can be performed in parallel. [sent-109, score-0.221]
40 A sufficient condition for the convergence of Jacobi iterates is for J to be a generalized diagonally dominant matrix (i. [sent-110, score-0.372]
41 [16], is to consider Gauss-Seidel iterations on a lifted 2n × 2n system: D −A −A D G-S update −− −→ −−− D−1 D AD−1 0 D−1 −1 0 0 A 0 = 0 0 D−1 A 2 . [sent-115, score-0.207]
42 (D−1 A) (3) Therefore one iteration of Gauss-Seidel on the lifted system is exactly two applications of the Jacobi update D−1 A to the second half of the state vector, so Jacobi iterations converge if Gauss-Seidel on the lifted system converges. [sent-116, score-0.435]
43 In this section we describe the LDS corresponding to Gaussian Hogwild Gibbs sampling and provide convergence and error analysis, along with a connection to a class of linear solvers. [sent-120, score-0.193]
44 For the majority of this section, we assume that the number of inner iterations performed on each processor is constant across time and processor index; that is, we have a single number q = q(k, ) of sub-iterations per processor for each outer iteration. [sent-121, score-0.264]
45 Given a joint Gaussian distribution of dimension n represented by a pair (J, h) as in (1), we represent an allocation of the n scalar variables to local processors by a partition of {1, 2, . [sent-123, score-0.216]
46 Consider a block-Jacobi splitting of J into its block diagonal and block off-diagonal components, J = Dblock − A, according to the partition. [sent-127, score-0.371]
47 A includes the cross-processor potentials, and this block-Jacobi splitting will model the outer iterations in Algorithm 1. [sent-128, score-0.176]
48 We further perform a Gauss-Seidel splitting on Dblock into (block-diagonal) lower-triangular and strictly upper-triangular parts, Dblock = B − C; these processor-local Gauss-Seidel splittings model the inner Gibbs sampling steps in Algorithm 1. [sent-129, score-0.285]
49 We refer to this splitting J = B − C − A as the Hogwild splitting; see Figure 1a for an example. [sent-130, score-0.086]
50 We use the lifting argument here because we extend the idea in our other proofs. [sent-132, score-0.08]
51 The resulting update operator for one outer iteration of the Hogwild Gibbs sampling process can be written as q−1 x (t+1) = (B −1 q (t) C) x j + j=0 iid (B −1 C) B −1 Ax(t) + h + v (t,j) , v (t,j) ∼ N (0, D) (4) where D is the diagonal of J. [sent-134, score-0.383]
52 Note that we shape the noise diagonally because in Hogwild Gibbs sampling we simply apply standard Gibbs updates in the inner loop. [sent-135, score-0.311]
53 1 Convergence and Correctness of Means Because the Gaussian Hogwild Gibbs sampling iterates form a Gaussian linear dynamical system, the process is stable (i. [sent-139, score-0.277]
54 We can write T = Tind + (I − Tind )Tblock where Tind is the purely GaussSeidel update when A = 0 and Tblock for the block Jacobi update, which corresponds to solving the processor-local linear systems exactly at each outer iteration. [sent-146, score-0.258]
55 The update (5) falls into the class of two-stage splitting methods [14, 17, 18], and the next proposition is equivalent to such two-stage solvers having the correct fixed point. [sent-147, score-0.285]
56 If the process is stable the mean process has a unique fixed point, and from (4) and (5) we can write the fixed-point equation for the process mean µhog as (I −T )µhog = (I −Tind )(I −Tblock )µhog = (I−Tind )(B−C)−1 h, hence (I−(B−C)−1 A)µhog = (B−C)−1 h and µhog = (B−C−A)−1 h. [sent-151, score-0.14]
57 Despite the complexity of the update map’s stability, in the next subsection we give a simple argument that identifies its convergence with the convergence of Gauss-Seidel iterates on a larger, non-symmetric linear system. [sent-154, score-0.201]
58 Given that relationship we then prove a condition on the entries of J that ensures the convergence of the Gaussian Hogwild Gibbs sampling process for any choice of partition or sub-iteration count. [sent-155, score-0.256]
59 (6) (L−1 U )k Therefore one step of Gauss-Seidel on the larger system corresponds to k applications of the GaussSeidel update L−1 U from the original system to the last block element of the lifted state vector. [sent-173, score-0.412]
60 8 (a) Support pattern (in black) of the Hogwild splitting J = B − C − A with n = 9 and the processor partition {{1, 2, 3}, {4, 5, 6}, {7, 8, 9}} 0. [sent-180, score-0.178]
61 0010 off-block-diagonal error block diagonal error 0. [sent-187, score-0.17]
62 20 t (c) Π projects to the block diagonal (d) Π projects to the off-block-diagonal Figure 1: (a) visualization of the Hogwild splitting; (b) Hogwild stability for generic models; (c) and (d) typical plots of ||Π(Σ − Σhog )||Fro . [sent-214, score-0.22]
63 In (b) each point corresponds to a sampled model iid iid J = QQT + nrI with Qij ∼ N (0, 1), r ∼ Uniform[0. [sent-215, score-0.104]
64 Two applications of the Hogwild update T of (5) are equivalent to the update to the last block element of the state vector in one Gauss-Seidel iteration on the (2qn) × (2qn) system E −F −F E x= ˜ h . [sent-221, score-0.318]
65 By comparing to the block update in (3), it suffices to consider E −1 F . [sent-235, score-0.184]
66 Furthermore, since the claim concerns the last block entry, we need only consider the last block row of E −1 F . [sent-236, score-0.252]
67 E is block lower-bidiagonal as the matrix that is inverted in (6), so E −1 has the same lower-triangular form as in (6) and the product of the last block row of E −1 with the last block column of F yields q j q−1 (B −1 C) + j=0 (B −1 C) B −1 A = T. [sent-237, score-0.389]
68 Gaussian Hogwild Gibbs sampling is convergent if Gauss-Seidel converges on (7). [sent-239, score-0.182]
69 Unfortunately the lifting is not symmetric and so we cannot impose positive semi-definiteness on the lifted system; however, another sufficient condition for Gauss-Seidel stability can be applied: Theorem 1. [sent-240, score-0.256]
70 14]) then Hogwild Gibbs sampling is convergent for any variable partition and any number of sub-iterations. [sent-246, score-0.224]
71 If J is generalized diagonally dominant then there exists a diagonal scaling matrix R such that J := JR is row diagonally dominant, i. [sent-248, score-0.493]
72 Since each scalar row of the coefficient matrix in (7) contains only entries from one row of J and zeros, it is generalized diagonally dominant with a scaling matrix that consists of 2q copies of R along the diagonal. [sent-251, score-0.349]
73 Finally, Gauss-Seidel iterations on generalized diagonally dominant systems are convergent [13, Theorem 5. [sent-252, score-0.329]
74 More broadly, diagonally dominant systems are well-known for their tractability and applicability in many other settings [19], and Gaussian Hogwild Gibbs provides another example of their utility. [sent-256, score-0.215]
75 Because of the connection to linear system solvers based on two-stage multisplittings, this result can be identified with [18, Theorem 2. [sent-257, score-0.128]
76 3], which shows that if the coefficient matrix is an H-matrix then the two-stage iterative solver is convergent. [sent-258, score-0.113]
77 Indeed, by the connection between solvers and samplers one can prove our Theorem 1 as a corollary to [18, Theorem 2. [sent-259, score-0.163]
78 The sufficient condition provided by Theorem 1 is coarse in that it provides convergence for any partition or update schedule. [sent-262, score-0.168]
79 2 Exact local block samples Convergence analysis simplifies greatly in the case where exact block samples are drawn at each processor because q is sufficiently large or because another exact sampler [9, 10] is used on each processor. [sent-266, score-0.449]
80 This regime of Hogwild Gibbs sampling is particularly interesting because it minimizes communication between processors. [sent-267, score-0.133]
81 In (4), we see that as q → ∞ we have T → Tblock ; that is, the deterministic part of the update becomes the block Jacobi update map, which admits a natural sufficient condition for convergence: 1 1 Proposition 4. [sent-268, score-0.281]
82 If ((B − C)− 2 A(B − C)− 2 )2 I, then block Hogwild Gibbs sampling converges. [sent-269, score-0.248]
83 2 Variances Since we can analyze Gaussian Hogwild Gibbs sampling as a linear dynamical system, we can write an expression for the steady-state covariance Σhog of the process when it is stable. [sent-273, score-0.284]
84 For a general stable LDS of the form x(t+1) = T x(t) + v (t) with v (t) ∼ N (0, Σinj ) the steady-state covariance ∞ is given by the series t=0 T t Σinj T tT , which is the solution to the linear discrete-time Lyapunov T equation Σ − T ΣT = Σinj in Σ. [sent-274, score-0.124]
85 The injected noise for the outer loop of the Hogwild iterations is generated by the inner loop, which itself has injected noise with covariance D, the diagonal of J, so for Hogwild sampling we have q−1 Σinj = j=0 (B −1 C)j B −1 DB −T (B −1 C)jT . [sent-275, score-0.494]
86 Composing these expressions we see that the Hogwild covariance is complicated, but we can analyze some salient properties in at least two regimes: when A is small and when local processors draw exact block samples (e. [sent-277, score-0.351]
87 When A = 0, the model is independent across processors and both the exact covariance and the Hogwild steady-state covariance for any q is (B − C)−1 . [sent-283, score-0.276]
88 (8) Since A has zero block-diagonal and S is block-diagonal, we see that to first order A has no effect on the block-diagonal of either the exact covariance or the Hogwild covariance. [sent-287, score-0.109]
89 As shown in Figure 1c, in numerical experiments higher-order terms improve the Hogwild covariance on the block diagonal relative to the A = 0 approximation, and the improvements increase with local mixing rates. [sent-288, score-0.339]
90 The off-block-diagonal first-order term in the Hogwild covariance is nonzero and it depends on the local mixing performed by S. [sent-289, score-0.169]
91 In particular, if global synchronization happens infrequently relative to the speed of local sampler mixing (e. [sent-290, score-0.215]
92 Figure 1d shows the off-block-diagonal error increasing with faster local mixing for small A. [sent-294, score-0.089]
93 Such an effect may be undesirable because increased local mixing reflects greater parallelism (or an application of more powerful local samplers [9, 10]). [sent-296, score-0.251]
94 In the next subsection we show that this case admits a special analysis and even an inexpensive correction to recover asymptotically unbiased estimates for the full covariance matrix. [sent-297, score-0.101]
95 2 Exact local block samples As local mixing increases, e. [sent-300, score-0.244]
96 as q → ∞ or if we use an exact block local sampler between global synchronizations, we are effectively sampling in the lifted model of Eq. [sent-302, score-0.508]
97 When local block samples are exact, the Hogwild sampled covariance ΣHog satisfies Σ = (I + (B − C)−1 A)ΣHog and ||Σ − ΣHog || ≤ ||(B − C)−1 A|| ||ΣHog || where Σ = J −1 is the exact target covariance and || · || is any submultiplicative matrix norm. [sent-304, score-0.366]
98 Using the lifting in (3), the Hogwild process steady-state covariance is the marginal covariance of half of the lifted state vector, so using Schur complements we can write ΣHog = ((B − C) − A(B −C)−1 A)−1 = [I +((B −C)−1 A)2 +· · · ](B −C)−1 . [sent-306, score-0.421]
99 1); and (4) when local samplers are run to convergence we can bound the error in the Hogwild variances and even efficiently correct estimates of the full covariance (Proposition 5). [sent-310, score-0.297]
100 “PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing”. [sent-345, score-0.101]
wordName wordTfidf (topN-words)
[('hogwild', 0.721), ('gibbs', 0.311), ('hog', 0.242), ('diagonally', 0.133), ('sampling', 0.133), ('jacobi', 0.129), ('tblock', 0.128), ('block', 0.115), ('samplers', 0.1), ('lifted', 0.098), ('tind', 0.091), ('processors', 0.087), ('splitting', 0.086), ('parallel', 0.085), ('dominant', 0.082), ('lifting', 0.08), ('covariance', 0.08), ('inj', 0.073), ('proposition', 0.072), ('gaussian', 0.072), ('sampler', 0.071), ('update', 0.069), ('system', 0.065), ('injected', 0.056), ('diagonal', 0.055), ('dblock', 0.055), ('lds', 0.053), ('iterates', 0.053), ('iid', 0.052), ('outer', 0.05), ('processor', 0.05), ('stability', 0.05), ('convergent', 0.049), ('mixing', 0.049), ('yik', 0.048), ('solver', 0.048), ('allocation', 0.047), ('lt', 0.047), ('stable', 0.044), ('iterative', 0.043), ('partition', 0.042), ('iterations', 0.04), ('local', 0.04), ('understanding', 0.039), ('ad', 0.037), ('strategy', 0.037), ('schur', 0.037), ('bekkerman', 0.037), ('frommer', 0.037), ('gaussseidel', 0.037), ('multisplittings', 0.037), ('ruozzi', 0.037), ('potentials', 0.036), ('complements', 0.035), ('ihler', 0.034), ('synchronization', 0.033), ('berman', 0.032), ('jii', 0.032), ('locking', 0.032), ('qqt', 0.032), ('solvers', 0.032), ('latent', 0.032), ('eecs', 0.032), ('connection', 0.031), ('convergence', 0.029), ('exact', 0.029), ('url', 0.028), ('condition', 0.028), ('xik', 0.028), ('sweep', 0.028), ('great', 0.027), ('powers', 0.027), ('correct', 0.026), ('willsky', 0.025), ('generalized', 0.025), ('process', 0.024), ('write', 0.024), ('niu', 0.024), ('inner', 0.024), ('dynamical', 0.023), ('effectiveness', 0.023), ('stationary', 0.022), ('global', 0.022), ('matrix', 0.022), ('parallelism', 0.022), ('analyzing', 0.022), ('dirichlet', 0.022), ('variances', 0.022), ('row', 0.022), ('parallelizing', 0.022), ('steps', 0.021), ('serial', 0.021), ('strictly', 0.021), ('composing', 0.021), ('asynchronous', 0.021), ('subsection', 0.021), ('updates', 0.021), ('scaling', 0.021), ('errors', 0.02), ('asuncion', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999928 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling
Author: Matthew Johnson, James Saunderson, Alan Willsky
Abstract: Sampling inference methods are computationally difficult to scale for many models in part because global dependencies can reduce opportunities for parallel computation. Without strict conditional independence structure among variables, standard Gibbs sampling theory requires sample updates to be performed sequentially, even if dependence between most variables is not strong. Empirical work has shown that some models can be sampled effectively by going “Hogwild” and simply running Gibbs updates in parallel with only periodic global communication, but the successes and limitations of such a strategy are not well understood. As a step towards such an understanding, we study the Hogwild Gibbs sampling strategy in the context of Gaussian distributions. We develop a framework which provides convergence conditions and error bounds along with simple proofs and connections to methods in numerical linear algebra. In particular, we show that if the Gaussian precision matrix is generalized diagonally dominant, then any Hogwild Gibbs sampler, with any update schedule or allocation of variables to processors, yields a stable sampling process with the correct sample mean. 1
2 0.21445614 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations
Author: Tamir Hazan, Subhransu Maji, Tommi Jaakkola
Abstract: In this paper we describe how MAP inference can be used to sample efficiently from Gibbs distributions. Specifically, we provide means for drawing either approximate or unbiased samples from Gibbs’ distributions by introducing low dimensional perturbations and solving the corresponding MAP assignments. Our approach also leads to new ways to derive lower bounds on partition functions. We demonstrate empirically that our method excels in the typical “high signal high coupling” regime. The setting results in ragged energy landscapes that are challenging for alternative approaches to sampling and/or lower bounds. 1
3 0.12913661 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing
Author: Justin Domke, Xianghang Liu
Abstract: Inference in general Ising models is difficult, due to high treewidth making treebased algorithms intractable. Moreover, when interactions are strong, Gibbs sampling may take exponential time to converge to the stationary distribution. We present an algorithm to project Ising model parameters onto a parameter set that is guaranteed to be fast mixing, under several divergences. We find that Gibbs sampling using the projected parameters is more accurate than with the original parameters when interaction strengths are strong and when limited time is available for sampling.
4 0.1290559 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits
Author: Jason Chang, John W. Fisher III
Abstract: We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve significant computational gains. We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. Each cluster is augmented with two subclusters to construct likely split moves. Unlike some previous parallel samplers, the proposed sampler enforces the correct stationary distribution of the Markov chain without the need for finite approximations. Empirical results illustrate that the new sampler exhibits better convergence properties than current methods. 1
5 0.11520632 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models
Author: Jie Liu, David Page
Abstract: In large-scale applications of undirected graphical models, such as social networks and biological networks, similar patterns occur frequently and give rise to similar parameters. In this situation, it is beneficial to group the parameters for more efficient learning. We show that even when the grouping is unknown, we can infer these parameter groups during learning via a Bayesian approach. We impose a Dirichlet process prior on the parameters. Posterior inference usually involves calculating intractable terms, and we propose two approximation algorithms, namely a Metropolis-Hastings algorithm with auxiliary variables and a Gibbs sampling algorithm with “stripped” Beta approximation (Gibbs SBA). Simulations show that both algorithms outperform conventional maximum likelihood estimation (MLE). Gibbs SBA’s performance is close to Gibbs sampling with exact likelihood calculation. Models learned with Gibbs SBA also generalize better than the models learned by MLE on real-world Senate voting data. 1
6 0.092938922 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion
7 0.082351699 119 nips-2013-Fast Template Evaluation with Vector Quantization
8 0.080235332 146 nips-2013-Large Scale Distributed Sparse Precision Estimation
9 0.074675649 122 nips-2013-First-order Decomposition Trees
10 0.074582227 161 nips-2013-Learning Stochastic Inverses
11 0.06831789 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture
12 0.065037258 220 nips-2013-On the Complexity and Approximation of Binary Evidence in Lifted Inference
13 0.064375319 43 nips-2013-Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions
14 0.062371727 40 nips-2013-Approximate Inference in Continuous Determinantal Processes
15 0.062291507 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC
16 0.058685187 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
17 0.0577485 107 nips-2013-Embed and Project: Discrete Sampling with Universal Hashing
18 0.055232499 123 nips-2013-Flexible sampling of discrete data correlations without the marginal distributions
19 0.0549183 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex
20 0.053872637 13 nips-2013-A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks
topicId topicWeight
[(0, 0.16), (1, 0.046), (2, -0.01), (3, 0.029), (4, 0.01), (5, 0.116), (6, 0.09), (7, 0.004), (8, 0.074), (9, -0.042), (10, 0.029), (11, 0.031), (12, 0.033), (13, 0.02), (14, 0.005), (15, 0.026), (16, -0.102), (17, -0.088), (18, -0.004), (19, 0.099), (20, -0.036), (21, -0.024), (22, 0.018), (23, -0.005), (24, 0.04), (25, 0.065), (26, 0.022), (27, -0.041), (28, -0.051), (29, 0.128), (30, 0.103), (31, -0.064), (32, 0.052), (33, -0.019), (34, 0.049), (35, -0.09), (36, 0.027), (37, -0.012), (38, 0.177), (39, -0.104), (40, -0.03), (41, -0.032), (42, 0.128), (43, 0.107), (44, -0.091), (45, 0.032), (46, 0.042), (47, 0.017), (48, -0.135), (49, -0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.948762 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling
Author: Matthew Johnson, James Saunderson, Alan Willsky
Abstract: Sampling inference methods are computationally difficult to scale for many models in part because global dependencies can reduce opportunities for parallel computation. Without strict conditional independence structure among variables, standard Gibbs sampling theory requires sample updates to be performed sequentially, even if dependence between most variables is not strong. Empirical work has shown that some models can be sampled effectively by going “Hogwild” and simply running Gibbs updates in parallel with only periodic global communication, but the successes and limitations of such a strategy are not well understood. As a step towards such an understanding, we study the Hogwild Gibbs sampling strategy in the context of Gaussian distributions. We develop a framework which provides convergence conditions and error bounds along with simple proofs and connections to methods in numerical linear algebra. In particular, we show that if the Gaussian precision matrix is generalized diagonally dominant, then any Hogwild Gibbs sampler, with any update schedule or allocation of variables to processors, yields a stable sampling process with the correct sample mean. 1
2 0.77234393 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models
Author: Jie Liu, David Page
Abstract: In large-scale applications of undirected graphical models, such as social networks and biological networks, similar patterns occur frequently and give rise to similar parameters. In this situation, it is beneficial to group the parameters for more efficient learning. We show that even when the grouping is unknown, we can infer these parameter groups during learning via a Bayesian approach. We impose a Dirichlet process prior on the parameters. Posterior inference usually involves calculating intractable terms, and we propose two approximation algorithms, namely a Metropolis-Hastings algorithm with auxiliary variables and a Gibbs sampling algorithm with “stripped” Beta approximation (Gibbs SBA). Simulations show that both algorithms outperform conventional maximum likelihood estimation (MLE). Gibbs SBA’s performance is close to Gibbs sampling with exact likelihood calculation. Models learned with Gibbs SBA also generalize better than the models learned by MLE on real-world Senate voting data. 1
3 0.74790543 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations
Author: Tamir Hazan, Subhransu Maji, Tommi Jaakkola
Abstract: In this paper we describe how MAP inference can be used to sample efficiently from Gibbs distributions. Specifically, we provide means for drawing either approximate or unbiased samples from Gibbs’ distributions by introducing low dimensional perturbations and solving the corresponding MAP assignments. Our approach also leads to new ways to derive lower bounds on partition functions. We demonstrate empirically that our method excels in the typical “high signal high coupling” regime. The setting results in ragged energy landscapes that are challenging for alternative approaches to sampling and/or lower bounds. 1
4 0.71561426 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing
Author: Justin Domke, Xianghang Liu
Abstract: Inference in general Ising models is difficult, due to high treewidth making treebased algorithms intractable. Moreover, when interactions are strong, Gibbs sampling may take exponential time to converge to the stationary distribution. We present an algorithm to project Ising model parameters onto a parameter set that is guaranteed to be fast mixing, under several divergences. We find that Gibbs sampling using the projected parameters is more accurate than with the original parameters when interaction strengths are strong and when limited time is available for sampling.
5 0.60881072 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits
Author: Jason Chang, John W. Fisher III
Abstract: We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve significant computational gains. We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. Each cluster is augmented with two subclusters to construct likely split moves. Unlike some previous parallel samplers, the proposed sampler enforces the correct stationary distribution of the Markov chain without the need for finite approximations. Empirical results illustrate that the new sampler exhibits better convergence properties than current methods. 1
6 0.59718084 161 nips-2013-Learning Stochastic Inverses
7 0.55041134 107 nips-2013-Embed and Project: Discrete Sampling with Universal Hashing
8 0.51371127 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
9 0.48536968 152 nips-2013-Learning Efficient Random Maximum A-Posteriori Predictors with Non-Decomposable Loss Functions
10 0.46568733 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models
11 0.44967818 160 nips-2013-Learning Stochastic Feedforward Neural Networks
12 0.44778007 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture
13 0.44159308 352 nips-2013-What do row and column marginals reveal about your dataset?
14 0.4326736 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex
15 0.42486078 316 nips-2013-Stochastic blockmodel approximation of a graphon: Theory and consistent estimation
16 0.41619927 122 nips-2013-First-order Decomposition Trees
17 0.3986797 220 nips-2013-On the Complexity and Approximation of Binary Evidence in Lifted Inference
18 0.39724523 306 nips-2013-Speeding up Permutation Testing in Neuroimaging
19 0.39120382 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
20 0.38352418 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
topicId topicWeight
[(16, 0.373), (33, 0.152), (34, 0.095), (41, 0.027), (49, 0.03), (56, 0.092), (70, 0.014), (85, 0.029), (89, 0.028), (93, 0.035), (95, 0.031)]
simIndex simValue paperId paperTitle
1 0.97857982 61 nips-2013-Capacity of strong attractor patterns to model behavioural and cognitive prototypes
Author: Abbas Edalat
Abstract: We solve the mean field equations for a stochastic Hopfield network with temperature (noise) in the presence of strong, i.e., multiply stored, patterns, and use this solution to obtain the storage capacity of such a network. Our result provides for the first time a rigorous solution of the mean filed equations for the standard Hopfield model and is in contrast to the mathematically unjustifiable replica technique that has been used hitherto for this derivation. We show that the critical temperature for stability of a strong pattern is equal to its degree or multiplicity, when the sum of the squares of degrees of the patterns is negligible compared to the network size. In the case of a single strong pattern, when the ratio of the number of all stored pattens and the network size is a positive constant, we obtain the distribution of the overlaps of the patterns with the mean field and deduce that the storage capacity for retrieving a strong pattern exceeds that for retrieving a simple pattern by a multiplicative factor equal to the square of the degree of the strong pattern. This square law property provides justification for using strong patterns to model attachment types and behavioural prototypes in psychology and psychotherapy. 1 Introduction: Multiply learned patterns in Hopfield networks The Hopfield network as a model of associative memory and unsupervised learning was introduced in [23] and has been intensively studied from a wide range of viewpoints in the past thirty years. However, properties of a strong pattern, as a pattern that has been multiply stored or learned in these networks, have only been examined very recently, a surprising delay given that repetition of an activity is the basis of learning by the Hebbian rule and long term potentiation. In particular, while the storage capacity of a Hopfield network with certain correlated patterns has been tackled [13, 25], the storage capacity of a Hopfield network in the presence of strong as well as random patterns has not been hitherto addressed. The notion of a strong pattern of a Hopfield network has been proposed in [15] to model attachment types and behavioural prototypes in developmental psychology and psychotherapy. This suggestion has been motivated by reviewing the pioneering work of Bowlby [9] in attachment theory and highlighting how a number of academic biologists, psychiatrists, psychologists, sociologists and neuroscientists have consistently regarded Hopfield-like artificial neural networks as suitable tools to model cognitive and behavioural constructs as patterns that are deeply and repeatedly learned by individuals [11, 22, 24, 30, 29, 10]. A number of mathematical properties of strong patterns in Hopfield networks, which give rise to strong attractors, have been derived in [15]. These show in particular that strong attractors are strongly stable; a series of experiments have also been carried out which confirm the mathematical 1 results and also indicate that a strong pattern stored in the network can be retrieved even in the presence of a large number of simple patterns, far exceeding the well-known maximum load parameter or storage capacity of the Hopfield network with random patterns (αc ≈ 0.138). In this paper, we consider strong patterns in stochastic Hopfield model with temperature, which accounts for various types of noise in the network. In these networks, the updating rule is probabilistic and depend on the temperature. Since analytical solution of such a system is not possible in general, one strives to obtain the average behaviour of the network when the input to each node, the so-called field at the node, is replaced with its mean. This is the basis of mean field theory for these networks. Due to the close connection between the Hopfield network and the Ising model in ferromagnetism [1, 8], the mean field approach for the Hopfield network and its variations has been tackled using the replica method, starting with the pioneering work of Amit, Gutfreund and Sompolinsky [3, 2, 4, 19, 31, 1, 13]. Although this method has been widely used in the theory of spin glasses in statistical physics [26, 16] its mathematical justification has proved to be elusive as we will discuss in the next section; see for example [20, page 264], [14, page 27], and [7, page 9]. In [17] and independently in [27], an alternative technique to the replica method for solving the mean field equations has been proposed which is reproduced and characterised as heuristic in [20, section 2.5] since it relies on a number of assumptions that are not later justified and uses a number of mathematical steps that are not validated. Here, we use the basic idea of the above heuristic to develop a verifiable mathematical framework with provable results grounded on elements of probability theory, with which we assume the reader is familiar. This technique allows us to solve the mean field equations for the Hopfield network in the presence of strong patterns and use the results to study, first, the stability of these patterns in the presence of temperature (noise) and, second, the storage capacity of the network with a single strong pattern at temperature zero. We show that the critical temperature for the stability of a strong pattern is equal to its degree (i.e., its multiplicity) when the ratio of the sum of the squares of degrees of the patterns to the network size tends to zero when the latter tends to infinity. In the case that there is only one strong pattern present with its degree small compared to the number of patterns and the latter is a fixed multiple of the number of nodes, we find the distribution of the overlap of the mean field and the patterns when the strong pattern is being retrieved. We use these distributions to prove that the storage capacity for retrieving a strong pattern exceeds that for a simple pattern by a multiplicative factor equal to the square of the degree of the strong attractor. This result matches the finding in [15] regarding the capacity of a network to recall strong patterns as mentioned above. Our results therefore show that strong patterns are robust and persistent in the network memory as attachment types and behavioural prototypes are in the human memory system. In this paper, we will several times use Lyapunov’s theorem in probability which provides a simple sufficient condition to generalise the Central Limit theorem when we deal with independent but not necessarily identically distributed random variables. We require a general form of this theorem kn as follows. Let Yn = N i=1 Yni , for n ∈ I , be a triangular array of random variables such that for each n, the random variables Yni , for 1 ≤ i ≤ kn are independent with E(Yni ) = 0 2 2 and E(Yni ) = σni , where E(X) stands for the expected value of the random variable X. Let kn 2 2 sn = i=1 σni . We use the notation X ∼ Y when the two random variables X and Y have the same distribution (for large n if either or both of them depend on n). Theorem 1.1 (Lyapunov’s theorem [6, page 368]) If for some δ > 0, we have the condition: 1 E(|Yn |2+δ |) → 0 s2+δ n d d as n → ∞ then s1 Yn −→ N (0, 1) as n → ∞ where −→ denotes convergence in distribution, and we denote n by N (a, σ 2 ) the normal distribution with mean a and variance σ 2 . Thus, for large n we have Yn ∼ N (0, s2 ). n 2 2 Mean field theory We consider a Hopfield network with N neurons i = 1, . . . , N with values Si = ±1 and follow the notations in [20]. As in [15], we assume patterns can be multiply stored and the degree of a pattern is defined as its multiplicity. The total number of patterns, counting their multiplicity, is denoted by p and we assume there are n patterns ξ 1 , . . . , ξ n with degrees d1 , . . . , dn ≥ 1 respectively and that n the remaining p − k=1 dk ≥ 0 patterns are simple, i.e., each has degree one. Note that by our assumptions there are precisely n p0 = p + n − dk k=1 distinct patterns, which we assume are independent and identically distributed with equal probability of taking value ±1 for each node. More generally, for any non-negative integer k ∈ I , we let N p0 dk . µ pk = µ=1 p µ µ 0 1 We use the generalized Hebbian rule for the synaptic couplings: wij = N µ=1 dµ ξi ξj for i = j with wii = 0 for 1 ≤ i, j ≤ N . As in the standard stochastic Hopfield model [20], we use Glauber dynamics [18] for the stochastic updating rule with pseudo-temperature T > 0, which accounts for various types of noise in the network, and assume zero bias in the local field. Putting β = 1/T (i.e., with the Boltzmann constant kB = 1) and letting fβ (h) = 1/(1 + exp(−2βh)), the stochastic updating rule at time t is given by: N Pr(Si (t + 1) = ±1) = fβ (±hi (t)), where hi (t) = wij Sj (t), (1) j=1 is the local field at i at time t. The updating is implemented asynchronously in a random way. The energy of the network in the configuration S = (Si )N is given by i=1 N 1 Si Sj wij . H(S) = − 2 i,j=1 For large N , this specifies a complex system, with an underlying state space of dimension 2N , which in general cannot be solved exactly. However, mean field theory has proved very useful in studying Hopfield networks. The average updated value of Si (t + 1) in Equation (1) is Si (t + 1) = 1/(1 + e−2βhi (t) ) − 1/(1 + e2βhi (t) ) = tanh(βhi (t)), (2) where . . . denotes taking average with respect to the probability distribution in the updating rule in Equation (1). The stationary solution for the mean field thus satisfies: Si = tanh(βhi ) , (3) The average overlap of pattern ξ µ with the mean field at the nodes of the network is given by: mν = 1 N N ν ξi Si (4) i=1 The replica technique for solving the mean field problem, used in the case p/N = α > 0 as N → ∞, seeks to obtain the average of the overlaps in Equation (4) by evaluating the partition function of the system, namely, Z = TrS exp(−βH(S)), where the trace TrS stands for taking sum over all possible configurations S = (Si )N . As it i=1 is generally the case in statistical physics, once the partition function of the system is obtained, 3 all required physical quantities can in principle be computed. However, in this case, the partition function is very difficult to compute since it entails computing the average log Z of log Z, where . . . indicates averaging over the random distribution of the stored patterns ξ µ . To overcome this problem, the identity Zk − 1 log Z = lim k→0 k is used to reduce the problem to finding the average Z k of Z k , which is then computed for positive integer values of k. For such k, we have: Z k = TrS 1 TrS 2 . . . TrS k exp(−β(H(S 1 ) + H(S 1 ) + . . . + H(S k ))), where for each i = 1, . . . , k the super-scripted configuration S i is a replica of the configuration state. In computing the trace over each replica, various parameters are obtained and the replica symmetry condition assumes that these parameters are independent of the particular replica under consideration. Apart from this assumption, there are two basic mathematical problems with the technique which makes it unjustifiable [20, page 264]. Firstly, the positive integer k above is eventually treated as a real number near zero without any mathematical justification. Secondly, the order of taking limits, in particular the order of taking the two limits k → 0 and N → ∞, are several times interchanged again without any mathematical justification. Here, we develop a mathematically rigorous method for solving the mean field problem, i.e., computing the average of the overlaps in Equation (4) in the case of p/N = α > 0 as N → ∞. Our method turns the basic idea of the heuristic presented in [17] and reproduced in [20] for solving the mean field equation into a mathematically verifiable formalism, which for the standard Hopfield network with random stored patterns gives the same result as the replica method, assuming replica symmetry. In the presence of strong patterns we obtain a set of new results as explained in the next two sections. The mean field equation is obtained from Equation (3) by approximating the right hand side of N this equation by the value of tanh at the mean field hi = j=1 wij Sj , ignoring the sum N j=1 wij (Sj − Sj ) for large N [17, page 32]: Si = tanh(β hi ) = tanh β N N j=1 p0 µ=1 µ µ dµ ξi ξj Sj . (5) Equation (5) gives the mean field equation for the Hopfield network with n possible strong patterns n ξ µ (1 ≤ µ ≤ n) and p − µ=1 dµ simple patterns ξ µ with n + 1 ≤ µ ≤ p0 . As in the standard Hopfield model, where all patterns are simple, we have two cases to deal with. However, we now have to account for the presence of strong attractors and our two cases will be as follows: (i) In the p0 first case we assume p2 := µ=1 d2 = o(N ), which includes the simpler case p2 N when p2 µ is fixed and independent of N . (ii) In the second case we assume we have a single strong attractor with the load parameter p/N = α > 0. 3 Stability of strong patterns with noise: p2 = o(N ) The case of constant p and N → ∞ is usually referred to as α = 0 in the standard Hopfield model. Here, we need to consider the sum of degrees of all stored patterns (and not just the number of patterns) compared to N . We solve the mean field equation with T > 0 by using a method similar in spirit to [20, page 33] for the standard Hopfield model, but in our case strong patterns induce a sequence of independent but non-identically distributed random variables in the crosstalk term, where the Central Limit Theorem cannot be used; we show however that Lyapunov’s theorem (Theorem (1.1) can be invoked. In retrieving pattern ξ 1 , we look for a solution of the mean filed 1 equation of the form: Si = mξi , where m > 0 is a constant. Using Equation (5) and separating 1 the contribution of ξ in the argument of tanh, we obtain: 1 mξi = tanh mβ 1 d1 ξi + N 4 µ µ 1 dµ ξi ξj ξj . j=i,µ>1 (6) For each N , µ > 1 and j = i, let dµ µ µ 1 (7) ξ ξ ξ . N i j j 2 This gives (p0 − 1)(N − 1) independent random variables with E(YN µj ) = 0, E(YN µj ) = d2 /N 2 , µ 3 3 3 and E(|YN µj |) = dµ /N . We have: YN µj = s2 := N 2 E(YN µj ) = µ>1,j=i 1 N −1 d2 ∼ N 2 µ>1 µ N d2 . µ (8) µ>1 Thus, as N → ∞, we have: 1 s3 N 3 E(|YN µj |) ∼ √ µ>1,j=i µ>1 N( d3 µ µ>1 d2 )3/2 µ → 0. (9) as N → ∞ since for positive numbers dµ we always have µ>1 d3 < ( µ>1 d2 )3/2 . Thus the µ µ Lyapunov condition is satisfied for δ = 1. By Lyapunov’s theorem we deduce: 1 N µ µ 1 dµ ξi ξj ξj ∼ N d2 /N µ 0, (10) µ>1 µ>1,j=i Since we also have p2 = o(N ), it follows that we can ignore the second term, i.e., the crosstalk term, in the argument of tanh in Equation (6) as N → ∞; we thus obtain: m = tanh βd1 m. (11) To examine the fixed points of the Equation (11), we let d = d1 for convenience and put x = βdm = dm/T , so that tanh x = T x/d; see Figure 1. It follows that Tc = d is the critical temperature. If T < d then there is a non-zero (non-trivial) solution for m, whereas for T > d we only have the trivial solution. For d = 1 our solution is that of the standard Hopfield network as in [20, page 34]. (d < T) y>x y = x ( d = T) y = tanh x y
2 0.95098418 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals
Author: Barbara Rakitsch, Christoph Lippert, Karsten Borgwardt, Oliver Stegle
Abstract: Multi-task prediction methods are widely used to couple regressors or classification models by sharing information across related tasks. We propose a multi-task Gaussian process approach for modeling both the relatedness between regressors and the task correlations in the residuals, in order to more accurately identify true sharing between regressors. The resulting Gaussian model has a covariance term in form of a sum of Kronecker products, for which efficient parameter inference and out of sample prediction are feasible. On both synthetic examples and applications to phenotype prediction in genetics, we find substantial benefits of modeling structured noise compared to established alternatives. 1
3 0.93258464 245 nips-2013-Pass-efficient unsupervised feature selection
Author: Crystal Maung, Haim Schweitzer
Abstract: The goal of unsupervised feature selection is to identify a small number of important features that can represent the data. We propose a new algorithm, a modification of the classical pivoted QR algorithm of Businger and Golub, that requires a small number of passes over the data. The improvements are based on two ideas: keeping track of multiple features in each pass, and skipping calculations that can be shown not to affect the final selection. Our algorithm selects the exact same features as the classical pivoted QR algorithm, and has the same favorable numerical stability. We describe experiments on real-world datasets which sometimes show improvements of several orders of magnitude over the classical algorithm. These results appear to be competitive with recently proposed randomized algorithms in terms of pass efficiency and run time. On the other hand, the randomized algorithms may produce more accurate features, at the cost of small probability of failure. 1
4 0.89928037 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits
Author: Jason Chang, John W. Fisher III
Abstract: We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve significant computational gains. We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. Each cluster is augmented with two subclusters to construct likely split moves. Unlike some previous parallel samplers, the proposed sampler enforces the correct stationary distribution of the Markov chain without the need for finite approximations. Empirical results illustrate that the new sampler exhibits better convergence properties than current methods. 1
same-paper 5 0.85110044 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling
Author: Matthew Johnson, James Saunderson, Alan Willsky
Abstract: Sampling inference methods are computationally difficult to scale for many models in part because global dependencies can reduce opportunities for parallel computation. Without strict conditional independence structure among variables, standard Gibbs sampling theory requires sample updates to be performed sequentially, even if dependence between most variables is not strong. Empirical work has shown that some models can be sampled effectively by going “Hogwild” and simply running Gibbs updates in parallel with only periodic global communication, but the successes and limitations of such a strategy are not well understood. As a step towards such an understanding, we study the Hogwild Gibbs sampling strategy in the context of Gaussian distributions. We develop a framework which provides convergence conditions and error bounds along with simple proofs and connections to methods in numerical linear algebra. In particular, we show that if the Gaussian precision matrix is generalized diagonally dominant, then any Hogwild Gibbs sampler, with any update schedule or allocation of variables to processors, yields a stable sampling process with the correct sample mean. 1
6 0.85008609 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions
7 0.67353207 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
8 0.66522461 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent
9 0.66044128 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models
10 0.65041089 252 nips-2013-Predictive PAC Learning and Process Decompositions
11 0.6468969 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
12 0.64577454 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series
13 0.64192837 355 nips-2013-Which Space Partitioning Tree to Use for Search?
14 0.63794237 350 nips-2013-Wavelets on Graphs via Deep Learning
15 0.63693368 47 nips-2013-Bayesian Hierarchical Community Discovery
16 0.62859404 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
17 0.62536186 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture
18 0.62512577 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models
19 0.61916012 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components
20 0.61843967 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking