nips nips2013 nips2013-34 knowledge-graph by maker-knowledge-mining

34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling

Source: pdf

Author: Matthew Johnson, James Saunderson, Alan Willsky

Abstract: Sampling inference methods are computationally difﬁcult to scale for many models in part because global dependencies can reduce opportunities for parallel computation. Without strict conditional independence structure among variables, standard Gibbs sampling theory requires sample updates to be performed sequentially, even if dependence between most variables is not strong. Empirical work has shown that some models can be sampled effectively by going “Hogwild” and simply running Gibbs updates in parallel with only periodic global communication, but the successes and limitations of such a strategy are not well understood. As a step towards such an understanding, we study the Hogwild Gibbs sampling strategy in the context of Gaussian distributions. We develop a framework which provides convergence conditions and error bounds along with simple proofs and connections to methods in numerical linear algebra. In particular, we show that if the Gaussian precision matrix is generalized diagonally dominant, then any Hogwild Gibbs sampler, with any update schedule or allocation of variables to processors, yields a stable sampling process with the correct sample mean. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Sampling inference methods are computationally difﬁcult to scale for many models in part because global dependencies can reduce opportunities for parallel computation. [sent-6, score-0.107]

2 Without strict conditional independence structure among variables, standard Gibbs sampling theory requires sample updates to be performed sequentially, even if dependence between most variables is not strong. [sent-7, score-0.154]

3 Empirical work has shown that some models can be sampled effectively by going “Hogwild” and simply running Gibbs updates in parallel with only periodic global communication, but the successes and limitations of such a strategy are not well understood. [sent-8, score-0.165]

4 As a step towards such an understanding, we study the Hogwild Gibbs sampling strategy in the context of Gaussian distributions. [sent-9, score-0.17]

5 In particular, we show that if the Gaussian precision matrix is generalized diagonally dominant, then any Hogwild Gibbs sampler, with any update schedule or allocation of variables to processors, yields a stable sampling process with the correct sample mean. [sent-11, score-0.523]

6 1 Introduction Scaling probabilistic inference algorithms to large datasets and parallel computing architectures is a challenge of great importance and considerable current research interest, and great strides have been made in designing parallelizeable algorithms. [sent-12, score-0.139]

7 Along with the powerful and sometimes complex new algorithms, a very simple strategy has proven to be surprisingly successful in some situations: running Gibbs sampling updates, derived only for the sequential setting, in parallel without globally synchronizing the sampler state after each update. [sent-13, score-0.326]

8 We refer to this strategy as “Hogwild Gibbs sampling” in reference to recent work [1] in which sequential computations for computing gradient steps were applied in parallel (without global coordination) to great beneﬁcial effect. [sent-15, score-0.192]

9 There have been recent advances in understanding some of the particular structure of AD-LDA [6], but a thorough theoretical explanation for the effectiveness and limitations of Hogwild Gibbs sampling is far from complete. [sent-18, score-0.195]

10 Sampling inference algorithms for complex Bayesian models have notoriously resisted theoretical analysis, so to begin an analysis of Hogwild Gibbs sampling we consider a restricted class of models that is especially tractable for analysis: Gaussians. [sent-19, score-0.133]

11 Further, Gaussian sampling is of 1 Algorithm 1 Hogwild Gibbs Sampling Require: Samplers Gi (¯¬i ) which sample p(xi |x¬i = x¬i ), a partition {I1 , I2 , . [sent-21, score-0.175]

12 , K in parallel do for each of K parallel processors (1) 4: yIk ← xIk ¯ ¯ 5: for j = 1, 2, . [sent-33, score-0.257]

13 , q(k, ) do run local Gibbs steps with old 6: for i ∈ Ik do statistics from other processors (j) ( ) (j) ( ) 7: yi ← Gi (¯I1 , . [sent-36, score-0.148]

14 , xIK ) ¯ x ¯ ¯ 8: x( ¯ +1) (q(1, )) ← (¯I1 y (q(K, )) · · · yIK ¯ ) globally synchronize statistics great interest in its own right, and there is active research in developing powerful Gaussian samplers [7, 8, 9, 10]. [sent-42, score-0.127]

15 Gaussian Hogwild Gibbs sampling could be used in conjunction with those methods to allow greater parallelization and scalability, given an understanding of its applicability and tradeoffs. [sent-43, score-0.172]

16 Toward the goal of understanding Gaussian Hogwild Gibbs sampling, the main contribution of this paper is a linear algebraic framework for analyzing the stability and errors in Gaussian Hogwild Gibbs sampling. [sent-44, score-0.131]

17 2 Related Work There has been signiﬁcant work on constructing parallel Gibbs sampling algorithms, and the contributions are too numerous to list here. [sent-48, score-0.218]

18 One recent body of work [11] provides exact parallel Gibbs samplers which exploit graphical model structure for parallelism. [sent-49, score-0.214]

19 The algorithms are supported by the standard Gibbs sampling analysis, and the authors point out that while heuristic parallel samplers such as the AD-LDA sampler offer easier implementation and often greater parallelism, they are currently not supported by much theoretical analysis. [sent-50, score-0.389]

20 The parallel sampling work that is most relevant to the proposed Hogwild Gibbs sampling analysis is the thorough empirical demonstration of AD-LDA [2, 3, 4, 5, 6] and its extensions. [sent-51, score-0.351]

21 The AD-LDA sampling algorithm is an instance of the strategy we have named Hogwild Gibbs, and Bekkerman et al. [sent-52, score-0.17]

22 However, this per-sample error does not necessarily provide a direct understanding of the effectiveness of the overall algorithm because errors might accumulate over sampling steps; indeed, understanding this potential error accumulation is of critical importance in iterative systems. [sent-61, score-0.297]

23 [1] provides both a motivation for Hogwild Gibbs sampling as well as the Hogwild name. [sent-67, score-0.133]

24 3 Background In this section we ﬁx notation for Gaussian distributions and describe known connections between Gaussian sampling and a class of stationary iterative linear system solvers which are useful in analyzing the behavior of Hogwild Gibbs sampling. [sent-69, score-0.317]

25 The density of a Gaussian distribution on n variables with mean vector µ and positive deﬁnite1 covariance matrix Σ 0 has the form T 1 p(x) ∝ exp − 1 (x − µ) Σ−1 (x − µ) ∝ exp − 2 xT Jx + hT x 2 (1) where we have written the information parameters J := Σ−1 and h := Jµ. [sent-70, score-0.102]

26 One way to generate samples is via Gibbs sampling, in which one iterates sampling each xi conditioned on all other variables to construct a Markov chain for which the invariant distribution is the target N (µ, Σ). [sent-76, score-0.186]

27 The conditional distributions for Gibbs sampling steps are p(xi |x¬i = x¬i ) ∝ ¯ 1 ¯ exp − 2 Jii x2 + (hi − Ji¬i x¬i )xi . [sent-77, score-0.154]

28 That is, we update each xi via xi ← J1 (hi − Ji¬i x¬i ) + vi ¯ i ii iid where vi ∼ N (0, J1 ). [sent-78, score-0.121]

29 , n into a matrix equation relating the sampler state at t and t + 1: iid 1 x(t+1) = −D−1 Lx(t+1) − D−1 LT x(t) + D−1 h + D− 2 v (t) , v (t) ∼ N (0, I). [sent-82, score-0.145]

30 We can re-arrange the equation into an update expression: −1 x(t+1) = −(D + L) −1 LT x(t) + (D + L) −1 (t) iid v , v (t) ∼ N (0, D). [sent-85, score-0.121]

31 ˜ ˜ h + (D + L) The expectation of this update is exactly the Gauss-Seidel iterative linear system solver update [13, −1 −1 Section 7. [sent-86, score-0.294]

32 Therefore a Gaussian Gibbs sampling process can be interpreted as Gauss-Seidel iterates on the system Jµ = h with appropriately-shaped noise injected at each iteration. [sent-90, score-0.331]

33 Gauss-Seidel is one instance of a stationary iterative linear solver based on a matrix splitting. [sent-91, score-0.135]

34 For an iterative process like (2) to be stable or convergent for any initialization we require the eigenvalues of its 1 Assume models are non-degenerate: matrix parameters are of full rank and densities are ﬁnite everywhere. [sent-95, score-0.182]

35 41] [15], and the connection to Gibbs sampling provides an independent proof. [sent-104, score-0.164]

36 For the Gauss-Seidel splitting (and hence Gibbs sampling), M is chosen to be lower-triangular so that the corresponding linear system can be solved efﬁciently via backsubstitution. [sent-106, score-0.151]

37 In the sampling context, the per-iteration computational complexity is also determined by the covariance of the injected noise process v (t) , because at each iteration one must sample from a Gaussian distribution with covariance M T + N . [sent-107, score-0.373]

38 We highlight one other standard stationary iterative linear solver that is relevant to analyzing Gaussian Hogwild Gibbs sampling: Jacobi iterations, in which one splits J = D − A where D is the diagonal part of J and A is the negative of the off-diagonal part. [sent-108, score-0.19]

39 Due to the choice of a diagonal M , each coordinate update depends only on the previous sweep’s output, and thus the Jacobi update sweep can be performed in parallel. [sent-109, score-0.221]

40 A sufﬁcient condition for the convergence of Jacobi iterates is for J to be a generalized diagonally dominant matrix (i. [sent-110, score-0.372]

41 [16], is to consider Gauss-Seidel iterations on a lifted 2n × 2n system: D −A −A D G-S update −− −→ −−− D−1 D AD−1 0 D−1 −1 0 0 A 0 = 0 0 D−1 A 2 . [sent-115, score-0.207]

42 (D−1 A) (3) Therefore one iteration of Gauss-Seidel on the lifted system is exactly two applications of the Jacobi update D−1 A to the second half of the state vector, so Jacobi iterations converge if Gauss-Seidel on the lifted system converges. [sent-116, score-0.435]

43 In this section we describe the LDS corresponding to Gaussian Hogwild Gibbs sampling and provide convergence and error analysis, along with a connection to a class of linear solvers. [sent-120, score-0.193]

44 For the majority of this section, we assume that the number of inner iterations performed on each processor is constant across time and processor index; that is, we have a single number q = q(k, ) of sub-iterations per processor for each outer iteration. [sent-121, score-0.264]

45 Given a joint Gaussian distribution of dimension n represented by a pair (J, h) as in (1), we represent an allocation of the n scalar variables to local processors by a partition of {1, 2, . [sent-123, score-0.216]

46 Consider a block-Jacobi splitting of J into its block diagonal and block off-diagonal components, J = Dblock − A, according to the partition. [sent-127, score-0.371]

47 A includes the cross-processor potentials, and this block-Jacobi splitting will model the outer iterations in Algorithm 1. [sent-128, score-0.176]

48 We further perform a Gauss-Seidel splitting on Dblock into (block-diagonal) lower-triangular and strictly upper-triangular parts, Dblock = B − C; these processor-local Gauss-Seidel splittings model the inner Gibbs sampling steps in Algorithm 1. [sent-129, score-0.285]

49 We refer to this splitting J = B − C − A as the Hogwild splitting; see Figure 1a for an example. [sent-130, score-0.086]

50 We use the lifting argument here because we extend the idea in our other proofs. [sent-132, score-0.08]

51 The resulting update operator for one outer iteration of the Hogwild Gibbs sampling process can be written as q−1 x (t+1) = (B −1 q (t) C) x j + j=0 iid (B −1 C) B −1 Ax(t) + h + v (t,j) , v (t,j) ∼ N (0, D) (4) where D is the diagonal of J. [sent-134, score-0.383]

52 Note that we shape the noise diagonally because in Hogwild Gibbs sampling we simply apply standard Gibbs updates in the inner loop. [sent-135, score-0.311]

53 1 Convergence and Correctness of Means Because the Gaussian Hogwild Gibbs sampling iterates form a Gaussian linear dynamical system, the process is stable (i. [sent-139, score-0.277]

54 We can write T = Tind + (I − Tind )Tblock where Tind is the purely GaussSeidel update when A = 0 and Tblock for the block Jacobi update, which corresponds to solving the processor-local linear systems exactly at each outer iteration. [sent-146, score-0.258]

55 The update (5) falls into the class of two-stage splitting methods [14, 17, 18], and the next proposition is equivalent to such two-stage solvers having the correct ﬁxed point. [sent-147, score-0.285]

56 If the process is stable the mean process has a unique ﬁxed point, and from (4) and (5) we can write the ﬁxed-point equation for the process mean µhog as (I −T )µhog = (I −Tind )(I −Tblock )µhog = (I−Tind )(B−C)−1 h, hence (I−(B−C)−1 A)µhog = (B−C)−1 h and µhog = (B−C−A)−1 h. [sent-151, score-0.14]

57 Despite the complexity of the update map’s stability, in the next subsection we give a simple argument that identiﬁes its convergence with the convergence of Gauss-Seidel iterates on a larger, non-symmetric linear system. [sent-154, score-0.201]

58 Given that relationship we then prove a condition on the entries of J that ensures the convergence of the Gaussian Hogwild Gibbs sampling process for any choice of partition or sub-iteration count. [sent-155, score-0.256]

59 (6) (L−1 U )k Therefore one step of Gauss-Seidel on the larger system corresponds to k applications of the GaussSeidel update L−1 U from the original system to the last block element of the lifted state vector. [sent-173, score-0.412]

60 8 (a) Support pattern (in black) of the Hogwild splitting J = B − C − A with n = 9 and the processor partition {{1, 2, 3}, {4, 5, 6}, {7, 8, 9}} 0. [sent-180, score-0.178]

61 0010 off-block-diagonal error block diagonal error 0. [sent-187, score-0.17]

62 20 t (c) Π projects to the block diagonal (d) Π projects to the off-block-diagonal Figure 1: (a) visualization of the Hogwild splitting; (b) Hogwild stability for generic models; (c) and (d) typical plots of ||Π(Σ − Σhog )||Fro . [sent-214, score-0.22]

63 In (b) each point corresponds to a sampled model iid iid J = QQT + nrI with Qij ∼ N (0, 1), r ∼ Uniform[0. [sent-215, score-0.104]

64 Two applications of the Hogwild update T of (5) are equivalent to the update to the last block element of the state vector in one Gauss-Seidel iteration on the (2qn) × (2qn) system E −F −F E x= ˜ h . [sent-221, score-0.318]

65 By comparing to the block update in (3), it sufﬁces to consider E −1 F . [sent-235, score-0.184]

66 Furthermore, since the claim concerns the last block entry, we need only consider the last block row of E −1 F . [sent-236, score-0.252]

67 E is block lower-bidiagonal as the matrix that is inverted in (6), so E −1 has the same lower-triangular form as in (6) and the product of the last block row of E −1 with the last block column of F yields q j q−1 (B −1 C) + j=0 (B −1 C) B −1 A = T. [sent-237, score-0.389]

68 Gaussian Hogwild Gibbs sampling is convergent if Gauss-Seidel converges on (7). [sent-239, score-0.182]

69 Unfortunately the lifting is not symmetric and so we cannot impose positive semi-deﬁniteness on the lifted system; however, another sufﬁcient condition for Gauss-Seidel stability can be applied: Theorem 1. [sent-240, score-0.256]

70 14]) then Hogwild Gibbs sampling is convergent for any variable partition and any number of sub-iterations. [sent-246, score-0.224]

71 If J is generalized diagonally dominant then there exists a diagonal scaling matrix R such that J := JR is row diagonally dominant, i. [sent-248, score-0.493]

72 Since each scalar row of the coefﬁcient matrix in (7) contains only entries from one row of J and zeros, it is generalized diagonally dominant with a scaling matrix that consists of 2q copies of R along the diagonal. [sent-251, score-0.349]

73 Finally, Gauss-Seidel iterations on generalized diagonally dominant systems are convergent [13, Theorem 5. [sent-252, score-0.329]

74 More broadly, diagonally dominant systems are well-known for their tractability and applicability in many other settings [19], and Gaussian Hogwild Gibbs provides another example of their utility. [sent-256, score-0.215]

75 Because of the connection to linear system solvers based on two-stage multisplittings, this result can be identiﬁed with [18, Theorem 2. [sent-257, score-0.128]

76 3], which shows that if the coefﬁcient matrix is an H-matrix then the two-stage iterative solver is convergent. [sent-258, score-0.113]

77 Indeed, by the connection between solvers and samplers one can prove our Theorem 1 as a corollary to [18, Theorem 2. [sent-259, score-0.163]

78 The sufﬁcient condition provided by Theorem 1 is coarse in that it provides convergence for any partition or update schedule. [sent-262, score-0.168]

79 2 Exact local block samples Convergence analysis simpliﬁes greatly in the case where exact block samples are drawn at each processor because q is sufﬁciently large or because another exact sampler [9, 10] is used on each processor. [sent-266, score-0.449]

80 This regime of Hogwild Gibbs sampling is particularly interesting because it minimizes communication between processors. [sent-267, score-0.133]

81 In (4), we see that as q → ∞ we have T → Tblock ; that is, the deterministic part of the update becomes the block Jacobi update map, which admits a natural sufﬁcient condition for convergence: 1 1 Proposition 4. [sent-268, score-0.281]

82 If ((B − C)− 2 A(B − C)− 2 )2 I, then block Hogwild Gibbs sampling converges. [sent-269, score-0.248]

83 2 Variances Since we can analyze Gaussian Hogwild Gibbs sampling as a linear dynamical system, we can write an expression for the steady-state covariance Σhog of the process when it is stable. [sent-273, score-0.284]

84 For a general stable LDS of the form x(t+1) = T x(t) + v (t) with v (t) ∼ N (0, Σinj ) the steady-state covariance ∞ is given by the series t=0 T t Σinj T tT , which is the solution to the linear discrete-time Lyapunov T equation Σ − T ΣT = Σinj in Σ. [sent-274, score-0.124]

85 The injected noise for the outer loop of the Hogwild iterations is generated by the inner loop, which itself has injected noise with covariance D, the diagonal of J, so for Hogwild sampling we have q−1 Σinj = j=0 (B −1 C)j B −1 DB −T (B −1 C)jT . [sent-275, score-0.494]

86 Composing these expressions we see that the Hogwild covariance is complicated, but we can analyze some salient properties in at least two regimes: when A is small and when local processors draw exact block samples (e. [sent-277, score-0.351]

87 When A = 0, the model is independent across processors and both the exact covariance and the Hogwild steady-state covariance for any q is (B − C)−1 . [sent-283, score-0.276]

88 (8) Since A has zero block-diagonal and S is block-diagonal, we see that to ﬁrst order A has no effect on the block-diagonal of either the exact covariance or the Hogwild covariance. [sent-287, score-0.109]

89 As shown in Figure 1c, in numerical experiments higher-order terms improve the Hogwild covariance on the block diagonal relative to the A = 0 approximation, and the improvements increase with local mixing rates. [sent-288, score-0.339]

90 The off-block-diagonal ﬁrst-order term in the Hogwild covariance is nonzero and it depends on the local mixing performed by S. [sent-289, score-0.169]

91 In particular, if global synchronization happens infrequently relative to the speed of local sampler mixing (e. [sent-290, score-0.215]

92 Figure 1d shows the off-block-diagonal error increasing with faster local mixing for small A. [sent-294, score-0.089]

93 Such an effect may be undesirable because increased local mixing reﬂects greater parallelism (or an application of more powerful local samplers [9, 10]). [sent-296, score-0.251]

94 In the next subsection we show that this case admits a special analysis and even an inexpensive correction to recover asymptotically unbiased estimates for the full covariance matrix. [sent-297, score-0.101]

95 2 Exact local block samples As local mixing increases, e. [sent-300, score-0.244]

96 as q → ∞ or if we use an exact block local sampler between global synchronizations, we are effectively sampling in the lifted model of Eq. [sent-302, score-0.508]

97 When local block samples are exact, the Hogwild sampled covariance ΣHog satisﬁes Σ = (I + (B − C)−1 A)ΣHog and ||Σ − ΣHog || ≤ ||(B − C)−1 A|| ||ΣHog || where Σ = J −1 is the exact target covariance and || · || is any submultiplicative matrix norm. [sent-304, score-0.366]

98 Using the lifting in (3), the Hogwild process steady-state covariance is the marginal covariance of half of the lifted state vector, so using Schur complements we can write ΣHog = ((B − C) − A(B −C)−1 A)−1 = [I +((B −C)−1 A)2 +· · · ](B −C)−1 . [sent-306, score-0.421]

99 1); and (4) when local samplers are run to convergence we can bound the error in the Hogwild variances and even efﬁciently correct estimates of the full covariance (Proposition 5). [sent-310, score-0.297]

100 “PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing”. [sent-345, score-0.101]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hogwild', 0.721), ('gibbs', 0.311), ('hog', 0.242), ('diagonally', 0.133), ('sampling', 0.133), ('jacobi', 0.129), ('tblock', 0.128), ('block', 0.115), ('samplers', 0.1), ('lifted', 0.098), ('tind', 0.091), ('processors', 0.087), ('splitting', 0.086), ('parallel', 0.085), ('dominant', 0.082), ('lifting', 0.08), ('covariance', 0.08), ('inj', 0.073), ('proposition', 0.072), ('gaussian', 0.072), ('sampler', 0.071), ('update', 0.069), ('system', 0.065), ('injected', 0.056), ('diagonal', 0.055), ('dblock', 0.055), ('lds', 0.053), ('iterates', 0.053), ('iid', 0.052), ('outer', 0.05), ('processor', 0.05), ('stability', 0.05), ('convergent', 0.049), ('mixing', 0.049), ('yik', 0.048), ('solver', 0.048), ('allocation', 0.047), ('lt', 0.047), ('stable', 0.044), ('iterative', 0.043), ('partition', 0.042), ('iterations', 0.04), ('local', 0.04), ('understanding', 0.039), ('ad', 0.037), ('strategy', 0.037), ('schur', 0.037), ('bekkerman', 0.037), ('frommer', 0.037), ('gaussseidel', 0.037), ('multisplittings', 0.037), ('ruozzi', 0.037), ('potentials', 0.036), ('complements', 0.035), ('ihler', 0.034), ('synchronization', 0.033), ('berman', 0.032), ('jii', 0.032), ('locking', 0.032), ('qqt', 0.032), ('solvers', 0.032), ('latent', 0.032), ('eecs', 0.032), ('connection', 0.031), ('convergence', 0.029), ('exact', 0.029), ('url', 0.028), ('condition', 0.028), ('xik', 0.028), ('sweep', 0.028), ('great', 0.027), ('powers', 0.027), ('correct', 0.026), ('willsky', 0.025), ('generalized', 0.025), ('process', 0.024), ('write', 0.024), ('niu', 0.024), ('inner', 0.024), ('dynamical', 0.023), ('effectiveness', 0.023), ('stationary', 0.022), ('global', 0.022), ('matrix', 0.022), ('parallelism', 0.022), ('analyzing', 0.022), ('dirichlet', 0.022), ('variances', 0.022), ('row', 0.022), ('parallelizing', 0.022), ('steps', 0.021), ('serial', 0.021), ('strictly', 0.021), ('composing', 0.021), ('asynchronous', 0.021), ('subsection', 0.021), ('updates', 0.021), ('scaling', 0.021), ('errors', 0.02), ('asuncion', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999928 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling

Author: Matthew Johnson, James Saunderson, Alan Willsky

2 0.21445614 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations

Author: Tamir Hazan, Subhransu Maji, Tommi Jaakkola

Abstract: In this paper we describe how MAP inference can be used to sample efﬁciently from Gibbs distributions. Speciﬁcally, we provide means for drawing either approximate or unbiased samples from Gibbs’ distributions by introducing low dimensional perturbations and solving the corresponding MAP assignments. Our approach also leads to new ways to derive lower bounds on partition functions. We demonstrate empirically that our method excels in the typical “high signal high coupling” regime. The setting results in ragged energy landscapes that are challenging for alternative approaches to sampling and/or lower bounds. 1

3 0.12913661 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing

Author: Justin Domke, Xianghang Liu

Abstract: Inference in general Ising models is difﬁcult, due to high treewidth making treebased algorithms intractable. Moreover, when interactions are strong, Gibbs sampling may take exponential time to converge to the stationary distribution. We present an algorithm to project Ising model parameters onto a parameter set that is guaranteed to be fast mixing, under several divergences. We ﬁnd that Gibbs sampling using the projected parameters is more accurate than with the original parameters when interaction strengths are strong and when limited time is available for sampling.

4 0.1290559 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

Author: Jason Chang, John W. Fisher III

Abstract: We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve signiﬁcant computational gains. We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. Each cluster is augmented with two subclusters to construct likely split moves. Unlike some previous parallel samplers, the proposed sampler enforces the correct stationary distribution of the Markov chain without the need for ﬁnite approximations. Empirical results illustrate that the new sampler exhibits better convergence properties than current methods. 1

5 0.11520632 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models

Author: Jie Liu, David Page

Abstract: In large-scale applications of undirected graphical models, such as social networks and biological networks, similar patterns occur frequently and give rise to similar parameters. In this situation, it is beneﬁcial to group the parameters for more efﬁcient learning. We show that even when the grouping is unknown, we can infer these parameter groups during learning via a Bayesian approach. We impose a Dirichlet process prior on the parameters. Posterior inference usually involves calculating intractable terms, and we propose two approximation algorithms, namely a Metropolis-Hastings algorithm with auxiliary variables and a Gibbs sampling algorithm with “stripped” Beta approximation (Gibbs SBA). Simulations show that both algorithms outperform conventional maximum likelihood estimation (MLE). Gibbs SBA’s performance is close to Gibbs sampling with exact likelihood calculation. Models learned with Gibbs SBA also generalize better than the models learned by MLE on real-world Senate voting data. 1

6 0.092938922 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

7 0.082351699 119 nips-2013-Fast Template Evaluation with Vector Quantization

8 0.080235332 146 nips-2013-Large Scale Distributed Sparse Precision Estimation

9 0.074675649 122 nips-2013-First-order Decomposition Trees

10 0.074582227 161 nips-2013-Learning Stochastic Inverses

11 0.06831789 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture

12 0.065037258 220 nips-2013-On the Complexity and Approximation of Binary Evidence in Lifted Inference

13 0.064375319 43 nips-2013-Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions

14 0.062371727 40 nips-2013-Approximate Inference in Continuous Determinantal Processes

15 0.062291507 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

16 0.058685187 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models

17 0.0577485 107 nips-2013-Embed and Project: Discrete Sampling with Universal Hashing

18 0.055232499 123 nips-2013-Flexible sampling of discrete data correlations without the marginal distributions

19 0.0549183 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex

20 0.053872637 13 nips-2013-A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.16), (1, 0.046), (2, -0.01), (3, 0.029), (4, 0.01), (5, 0.116), (6, 0.09), (7, 0.004), (8, 0.074), (9, -0.042), (10, 0.029), (11, 0.031), (12, 0.033), (13, 0.02), (14, 0.005), (15, 0.026), (16, -0.102), (17, -0.088), (18, -0.004), (19, 0.099), (20, -0.036), (21, -0.024), (22, 0.018), (23, -0.005), (24, 0.04), (25, 0.065), (26, 0.022), (27, -0.041), (28, -0.051), (29, 0.128), (30, 0.103), (31, -0.064), (32, 0.052), (33, -0.019), (34, 0.049), (35, -0.09), (36, 0.027), (37, -0.012), (38, 0.177), (39, -0.104), (40, -0.03), (41, -0.032), (42, 0.128), (43, 0.107), (44, -0.091), (45, 0.032), (46, 0.042), (47, 0.017), (48, -0.135), (49, -0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.948762 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling

Author: Matthew Johnson, James Saunderson, Alan Willsky

2 0.77234393 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models

Author: Jie Liu, David Page

3 0.74790543 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations

Author: Tamir Hazan, Subhransu Maji, Tommi Jaakkola

4 0.71561426 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing

Author: Justin Domke, Xianghang Liu

5 0.60881072 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

Author: Jason Chang, John W. Fisher III

6 0.59718084 161 nips-2013-Learning Stochastic Inverses

7 0.55041134 107 nips-2013-Embed and Project: Discrete Sampling with Universal Hashing

8 0.51371127 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models

9 0.48536968 152 nips-2013-Learning Efficient Random Maximum A-Posteriori Predictors with Non-Decomposable Loss Functions

10 0.46568733 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models

11 0.44967818 160 nips-2013-Learning Stochastic Feedforward Neural Networks

12 0.44778007 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture

13 0.44159308 352 nips-2013-What do row and column marginals reveal about your dataset?

14 0.4326736 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex

15 0.42486078 316 nips-2013-Stochastic blockmodel approximation of a graphon: Theory and consistent estimation

16 0.41619927 122 nips-2013-First-order Decomposition Trees

17 0.3986797 220 nips-2013-On the Complexity and Approximation of Binary Evidence in Lifted Inference

18 0.39724523 306 nips-2013-Speeding up Permutation Testing in Neuroimaging

19 0.39120382 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning

20 0.38352418 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(16, 0.373), (33, 0.152), (34, 0.095), (41, 0.027), (49, 0.03), (56, 0.092), (70, 0.014), (85, 0.029), (89, 0.028), (93, 0.035), (95, 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97857982 61 nips-2013-Capacity of strong attractor patterns to model behavioural and cognitive prototypes

Author: Abbas Edalat

Abstract: We solve the mean ﬁeld equations for a stochastic Hopﬁeld network with temperature (noise) in the presence of strong, i.e., multiply stored, patterns, and use this solution to obtain the storage capacity of such a network. Our result provides for the ﬁrst time a rigorous solution of the mean ﬁled equations for the standard Hopﬁeld model and is in contrast to the mathematically unjustiﬁable replica technique that has been used hitherto for this derivation. We show that the critical temperature for stability of a strong pattern is equal to its degree or multiplicity, when the sum of the squares of degrees of the patterns is negligible compared to the network size. In the case of a single strong pattern, when the ratio of the number of all stored pattens and the network size is a positive constant, we obtain the distribution of the overlaps of the patterns with the mean ﬁeld and deduce that the storage capacity for retrieving a strong pattern exceeds that for retrieving a simple pattern by a multiplicative factor equal to the square of the degree of the strong pattern. This square law property provides justiﬁcation for using strong patterns to model attachment types and behavioural prototypes in psychology and psychotherapy. 1 Introduction: Multiply learned patterns in Hopﬁeld networks The Hopﬁeld network as a model of associative memory and unsupervised learning was introduced in [23] and has been intensively studied from a wide range of viewpoints in the past thirty years. However, properties of a strong pattern, as a pattern that has been multiply stored or learned in these networks, have only been examined very recently, a surprising delay given that repetition of an activity is the basis of learning by the Hebbian rule and long term potentiation. In particular, while the storage capacity of a Hopﬁeld network with certain correlated patterns has been tackled [13, 25], the storage capacity of a Hopﬁeld network in the presence of strong as well as random patterns has not been hitherto addressed. The notion of a strong pattern of a Hopﬁeld network has been proposed in [15] to model attachment types and behavioural prototypes in developmental psychology and psychotherapy. This suggestion has been motivated by reviewing the pioneering work of Bowlby [9] in attachment theory and highlighting how a number of academic biologists, psychiatrists, psychologists, sociologists and neuroscientists have consistently regarded Hopﬁeld-like artiﬁcial neural networks as suitable tools to model cognitive and behavioural constructs as patterns that are deeply and repeatedly learned by individuals [11, 22, 24, 30, 29, 10]. A number of mathematical properties of strong patterns in Hopﬁeld networks, which give rise to strong attractors, have been derived in [15]. These show in particular that strong attractors are strongly stable; a series of experiments have also been carried out which conﬁrm the mathematical 1 results and also indicate that a strong pattern stored in the network can be retrieved even in the presence of a large number of simple patterns, far exceeding the well-known maximum load parameter or storage capacity of the Hopﬁeld network with random patterns (αc ≈ 0.138). In this paper, we consider strong patterns in stochastic Hopﬁeld model with temperature, which accounts for various types of noise in the network. In these networks, the updating rule is probabilistic and depend on the temperature. Since analytical solution of such a system is not possible in general, one strives to obtain the average behaviour of the network when the input to each node, the so-called ﬁeld at the node, is replaced with its mean. This is the basis of mean ﬁeld theory for these networks. Due to the close connection between the Hopﬁeld network and the Ising model in ferromagnetism [1, 8], the mean ﬁeld approach for the Hopﬁeld network and its variations has been tackled using the replica method, starting with the pioneering work of Amit, Gutfreund and Sompolinsky [3, 2, 4, 19, 31, 1, 13]. Although this method has been widely used in the theory of spin glasses in statistical physics [26, 16] its mathematical justiﬁcation has proved to be elusive as we will discuss in the next section; see for example [20, page 264], [14, page 27], and [7, page 9]. In [17] and independently in [27], an alternative technique to the replica method for solving the mean ﬁeld equations has been proposed which is reproduced and characterised as heuristic in [20, section 2.5] since it relies on a number of assumptions that are not later justiﬁed and uses a number of mathematical steps that are not validated. Here, we use the basic idea of the above heuristic to develop a veriﬁable mathematical framework with provable results grounded on elements of probability theory, with which we assume the reader is familiar. This technique allows us to solve the mean ﬁeld equations for the Hopﬁeld network in the presence of strong patterns and use the results to study, ﬁrst, the stability of these patterns in the presence of temperature (noise) and, second, the storage capacity of the network with a single strong pattern at temperature zero. We show that the critical temperature for the stability of a strong pattern is equal to its degree (i.e., its multiplicity) when the ratio of the sum of the squares of degrees of the patterns to the network size tends to zero when the latter tends to inﬁnity. In the case that there is only one strong pattern present with its degree small compared to the number of patterns and the latter is a ﬁxed multiple of the number of nodes, we ﬁnd the distribution of the overlap of the mean ﬁeld and the patterns when the strong pattern is being retrieved. We use these distributions to prove that the storage capacity for retrieving a strong pattern exceeds that for a simple pattern by a multiplicative factor equal to the square of the degree of the strong attractor. This result matches the ﬁnding in [15] regarding the capacity of a network to recall strong patterns as mentioned above. Our results therefore show that strong patterns are robust and persistent in the network memory as attachment types and behavioural prototypes are in the human memory system. In this paper, we will several times use Lyapunov’s theorem in probability which provides a simple sufﬁcient condition to generalise the Central Limit theorem when we deal with independent but not necessarily identically distributed random variables. We require a general form of this theorem kn as follows. Let Yn = N i=1 Yni , for n ∈ I , be a triangular array of random variables such that for each n, the random variables Yni , for 1 ≤ i ≤ kn are independent with E(Yni ) = 0 2 2 and E(Yni ) = σni , where E(X) stands for the expected value of the random variable X. Let kn 2 2 sn = i=1 σni . We use the notation X ∼ Y when the two random variables X and Y have the same distribution (for large n if either or both of them depend on n). Theorem 1.1 (Lyapunov’s theorem [6, page 368]) If for some δ > 0, we have the condition: 1 E(|Yn |2+δ |) → 0 s2+δ n d d as n → ∞ then s1 Yn −→ N (0, 1) as n → ∞ where −→ denotes convergence in distribution, and we denote n by N (a, σ 2 ) the normal distribution with mean a and variance σ 2 . Thus, for large n we have Yn ∼ N (0, s2 ). n 2 2 Mean ﬁeld theory We consider a Hopﬁeld network with N neurons i = 1, . . . , N with values Si = ±1 and follow the notations in [20]. As in [15], we assume patterns can be multiply stored and the degree of a pattern is deﬁned as its multiplicity. The total number of patterns, counting their multiplicity, is denoted by p and we assume there are n patterns ξ 1 , . . . , ξ n with degrees d1 , . . . , dn ≥ 1 respectively and that n the remaining p − k=1 dk ≥ 0 patterns are simple, i.e., each has degree one. Note that by our assumptions there are precisely n p0 = p + n − dk k=1 distinct patterns, which we assume are independent and identically distributed with equal probability of taking value ±1 for each node. More generally, for any non-negative integer k ∈ I , we let N p0 dk . µ pk = µ=1 p µ µ 0 1 We use the generalized Hebbian rule for the synaptic couplings: wij = N µ=1 dµ ξi ξj for i = j with wii = 0 for 1 ≤ i, j ≤ N . As in the standard stochastic Hopﬁeld model [20], we use Glauber dynamics [18] for the stochastic updating rule with pseudo-temperature T > 0, which accounts for various types of noise in the network, and assume zero bias in the local ﬁeld. Putting β = 1/T (i.e., with the Boltzmann constant kB = 1) and letting fβ (h) = 1/(1 + exp(−2βh)), the stochastic updating rule at time t is given by: N Pr(Si (t + 1) = ±1) = fβ (±hi (t)), where hi (t) = wij Sj (t), (1) j=1 is the local ﬁeld at i at time t. The updating is implemented asynchronously in a random way. The energy of the network in the conﬁguration S = (Si )N is given by i=1 N 1 Si Sj wij . H(S) = − 2 i,j=1 For large N , this speciﬁes a complex system, with an underlying state space of dimension 2N , which in general cannot be solved exactly. However, mean ﬁeld theory has proved very useful in studying Hopﬁeld networks. The average updated value of Si (t + 1) in Equation (1) is Si (t + 1) = 1/(1 + e−2βhi (t) ) − 1/(1 + e2βhi (t) ) = tanh(βhi (t)), (2) where . . . denotes taking average with respect to the probability distribution in the updating rule in Equation (1). The stationary solution for the mean ﬁeld thus satisﬁes: Si = tanh(βhi ) , (3) The average overlap of pattern ξ µ with the mean ﬁeld at the nodes of the network is given by: mν = 1 N N ν ξi Si (4) i=1 The replica technique for solving the mean ﬁeld problem, used in the case p/N = α > 0 as N → ∞, seeks to obtain the average of the overlaps in Equation (4) by evaluating the partition function of the system, namely, Z = TrS exp(−βH(S)), where the trace TrS stands for taking sum over all possible conﬁgurations S = (Si )N . As it i=1 is generally the case in statistical physics, once the partition function of the system is obtained, 3 all required physical quantities can in principle be computed. However, in this case, the partition function is very difﬁcult to compute since it entails computing the average log Z of log Z, where . . . indicates averaging over the random distribution of the stored patterns ξ µ . To overcome this problem, the identity Zk − 1 log Z = lim k→0 k is used to reduce the problem to ﬁnding the average Z k of Z k , which is then computed for positive integer values of k. For such k, we have: Z k = TrS 1 TrS 2 . . . TrS k exp(−β(H(S 1 ) + H(S 1 ) + . . . + H(S k ))), where for each i = 1, . . . , k the super-scripted conﬁguration S i is a replica of the conﬁguration state. In computing the trace over each replica, various parameters are obtained and the replica symmetry condition assumes that these parameters are independent of the particular replica under consideration. Apart from this assumption, there are two basic mathematical problems with the technique which makes it unjustiﬁable [20, page 264]. Firstly, the positive integer k above is eventually treated as a real number near zero without any mathematical justiﬁcation. Secondly, the order of taking limits, in particular the order of taking the two limits k → 0 and N → ∞, are several times interchanged again without any mathematical justiﬁcation. Here, we develop a mathematically rigorous method for solving the mean ﬁeld problem, i.e., computing the average of the overlaps in Equation (4) in the case of p/N = α > 0 as N → ∞. Our method turns the basic idea of the heuristic presented in [17] and reproduced in [20] for solving the mean ﬁeld equation into a mathematically veriﬁable formalism, which for the standard Hopﬁeld network with random stored patterns gives the same result as the replica method, assuming replica symmetry. In the presence of strong patterns we obtain a set of new results as explained in the next two sections. The mean ﬁeld equation is obtained from Equation (3) by approximating the right hand side of N this equation by the value of tanh at the mean ﬁeld hi = j=1 wij Sj , ignoring the sum N j=1 wij (Sj − Sj ) for large N [17, page 32]: Si = tanh(β hi ) = tanh β N N j=1 p0 µ=1 µ µ dµ ξi ξj Sj . (5) Equation (5) gives the mean ﬁeld equation for the Hopﬁeld network with n possible strong patterns n ξ µ (1 ≤ µ ≤ n) and p − µ=1 dµ simple patterns ξ µ with n + 1 ≤ µ ≤ p0 . As in the standard Hopﬁeld model, where all patterns are simple, we have two cases to deal with. However, we now have to account for the presence of strong attractors and our two cases will be as follows: (i) In the p0 ﬁrst case we assume p2 := µ=1 d2 = o(N ), which includes the simpler case p2 N when p2 µ is ﬁxed and independent of N . (ii) In the second case we assume we have a single strong attractor with the load parameter p/N = α > 0. 3 Stability of strong patterns with noise: p2 = o(N ) The case of constant p and N → ∞ is usually referred to as α = 0 in the standard Hopﬁeld model. Here, we need to consider the sum of degrees of all stored patterns (and not just the number of patterns) compared to N . We solve the mean ﬁeld equation with T > 0 by using a method similar in spirit to [20, page 33] for the standard Hopﬁeld model, but in our case strong patterns induce a sequence of independent but non-identically distributed random variables in the crosstalk term, where the Central Limit Theorem cannot be used; we show however that Lyapunov’s theorem (Theorem (1.1) can be invoked. In retrieving pattern ξ 1 , we look for a solution of the mean ﬁled 1 equation of the form: Si = mξi , where m > 0 is a constant. Using Equation (5) and separating 1 the contribution of ξ in the argument of tanh, we obtain:  1 mξi = tanh    mβ  1 d1 ξi + N 4 µ µ 1 dµ ξi ξj ξj  . j=i,µ>1 (6) For each N , µ > 1 and j = i, let dµ µ µ 1 (7) ξ ξ ξ . N i j j 2 This gives (p0 − 1)(N − 1) independent random variables with E(YN µj ) = 0, E(YN µj ) = d2 /N 2 , µ 3 3 3 and E(|YN µj |) = dµ /N . We have: YN µj = s2 := N 2 E(YN µj ) = µ>1,j=i 1 N −1 d2 ∼ N 2 µ>1 µ N d2 . µ (8) µ>1 Thus, as N → ∞, we have: 1 s3 N 3 E(|YN µj |) ∼ √ µ>1,j=i µ>1 N( d3 µ µ>1 d2 )3/2 µ → 0. (9) as N → ∞ since for positive numbers dµ we always have µ>1 d3 < ( µ>1 d2 )3/2 . Thus the µ µ Lyapunov condition is satisﬁed for δ = 1. By Lyapunov’s theorem we deduce: 1 N µ µ 1 dµ ξi ξj ξj ∼ N d2 /N µ 0, (10) µ>1 µ>1,j=i Since we also have p2 = o(N ), it follows that we can ignore the second term, i.e., the crosstalk term, in the argument of tanh in Equation (6) as N → ∞; we thus obtain: m = tanh βd1 m. (11) To examine the ﬁxed points of the Equation (11), we let d = d1 for convenience and put x = βdm = dm/T , so that tanh x = T x/d; see Figure 1. It follows that Tc = d is the critical temperature. If T < d then there is a non-zero (non-trivial) solution for m, whereas for T > d we only have the trivial solution. For d = 1 our solution is that of the standard Hopﬁeld network as in [20, page 34]. (d < T) y>x y = x ( d = T) y = tanh x y

2 0.95098418 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals

Author: Barbara Rakitsch, Christoph Lippert, Karsten Borgwardt, Oliver Stegle

Abstract: Multi-task prediction methods are widely used to couple regressors or classiﬁcation models by sharing information across related tasks. We propose a multi-task Gaussian process approach for modeling both the relatedness between regressors and the task correlations in the residuals, in order to more accurately identify true sharing between regressors. The resulting Gaussian model has a covariance term in form of a sum of Kronecker products, for which efﬁcient parameter inference and out of sample prediction are feasible. On both synthetic examples and applications to phenotype prediction in genetics, we ﬁnd substantial beneﬁts of modeling structured noise compared to established alternatives. 1

3 0.93258464 245 nips-2013-Pass-efficient unsupervised feature selection

Author: Crystal Maung, Haim Schweitzer

Abstract: The goal of unsupervised feature selection is to identify a small number of important features that can represent the data. We propose a new algorithm, a modiﬁcation of the classical pivoted QR algorithm of Businger and Golub, that requires a small number of passes over the data. The improvements are based on two ideas: keeping track of multiple features in each pass, and skipping calculations that can be shown not to affect the ﬁnal selection. Our algorithm selects the exact same features as the classical pivoted QR algorithm, and has the same favorable numerical stability. We describe experiments on real-world datasets which sometimes show improvements of several orders of magnitude over the classical algorithm. These results appear to be competitive with recently proposed randomized algorithms in terms of pass efﬁciency and run time. On the other hand, the randomized algorithms may produce more accurate features, at the cost of small probability of failure. 1

4 0.89928037 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

Author: Jason Chang, John W. Fisher III

same-paper 5 0.85110044 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling

Author: Matthew Johnson, James Saunderson, Alan Willsky

6 0.85008609 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions

7 0.67353207 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning

8 0.66522461 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent

9 0.66044128 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models

10 0.65041089 252 nips-2013-Predictive PAC Learning and Process Decompositions

11 0.6468969 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

12 0.64577454 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series

13 0.64192837 355 nips-2013-Which Space Partitioning Tree to Use for Search?