nips nips2013 nips2013-243 knowledge-graph by maker-knowledge-mining

243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

Source: pdf

Author: Jason Chang, John W. Fisher III

Abstract: We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve signiﬁcant computational gains. We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. Each cluster is augmented with two subclusters to construct likely split moves. Unlike some previous parallel samplers, the proposed sampler enforces the correct stationary distribution of the Markov chain without the need for ﬁnite approximations. Empirical results illustrate that the new sampler exhibits better convergence properties than current methods. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve signiﬁcant computational gains. [sent-6, score-0.485]

2 We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. [sent-7, score-0.259]

3 Each cluster is augmented with two subclusters to construct likely split moves. [sent-8, score-0.514]

4 Unlike some previous parallel samplers, the proposed sampler enforces the correct stationary distribution of the Markov chain without the need for ﬁnite approximations. [sent-9, score-0.461]

5 Empirical results illustrate that the new sampler exhibits better convergence properties than current methods. [sent-10, score-0.292]

6 Posterior sampling in complex models such as DPMMs is often difﬁcult because samplers that propose local changes exhibit poor convergence. [sent-17, score-0.379]

7 Here, we develop a sampler for DPMMs that: (1) preserves limiting guarantees; (2) proposes splits and merges to improve convergence; (3) can be parallelized to accommodate large datasets; and (4) is applicable to a wide variety of DPMMs (conjugate and non-conjugate). [sent-21, score-0.593]

8 While we focus on DP mixture models here, similar methods can be extended for mixture models with other priors (ﬁnite Dirichlet distributions, Pitman-Yor Processes, etc. [sent-23, score-0.334]

9 The majority of DPMM samplers ﬁt into one of two categories: collapsed-weight samplers that ∗ Jason Chang was partially supported by the Ofﬁce of Naval Research Multidisciplinary Research Initiative (MURI) program, award N000141110688. [sent-31, score-0.6]

10 [21, 23]) priors sample the cluster labels iteratively one data point at a time. [sent-39, score-0.354]

11 When a conjugate prior is used, one can also marginalize out cluster parameters. [sent-40, score-0.395]

12 Additionally, due to the particular marginalization schemes, these samplers cannot be parallelized. [sent-44, score-0.3]

13 Instantiated-weight (IW) samplers explicitly represent cluster weights, typically using a ﬁnite approximation to the DP (e. [sent-45, score-0.532]

14 Recently, [7] and [24] have eliminated the need for this approximation; however, IW samplers still suffer from convergence issues. [sent-48, score-0.369]

15 If cluster parameters are marginalized, it can be very unlikely for a single point to start a new cluster. [sent-49, score-0.232]

16 When cluster parameters are instantiated, samples of parameters from the prior are often a poor ﬁt to the data. [sent-50, score-0.259]

17 However, IW samplers are often useful because they can be parallelized across each data point conditioned on the weights and parameters. [sent-51, score-0.427]

18 We refer to this type of algorithm as “inter-cluster parallelizable”, since the cluster label for each point within a cluster can be sampled in parallel. [sent-52, score-0.513]

19 We refer to this type of implementation as “intra-cluster parallelizable”, since points in different super-clusters can be sampled in parallel, but points within a cluster cannot. [sent-55, score-0.298]

20 Recent CW samplers consider larger moves to address convergence issues. [sent-59, score-0.408]

21 Green and Richardson [9] present a reversible jump MCMC sampler that proposes splitting and merging components. [sent-60, score-0.328]

22 Proposed splits are unlikely to ﬁt the posterior since auxiliary variables governing the split cluster parameters and weights are proposed independent of the data. [sent-62, score-0.905]

23 Jain and Neal [13, 14] construct a split by running multiple restricted Gibbs scans for a single cluster in conjugate and nonconjugate models. [sent-63, score-0.651]

24 Dahl [5] proposes a split scheme for conjugate models by reassigning labels of a cluster sequentially. [sent-66, score-0.62]

25 All current split samplers construct a proposed move to be used in a Metropolis-Hastings framework. [sent-67, score-0.552]

26 If the split is rejected, considerable computation is wasted, and all information contained in learning the split is forgotten. [sent-68, score-0.376]

27 In contrast, the proposed method of ﬁtting sub-clusters iteratively learns likely split proposals with the auxiliary variables. [sent-69, score-0.494]

28 Additionally, we show that split proposals can be computed in parallel, allowing for very efﬁcient implementations. [sent-70, score-0.263]

29 Therefore, posterior MCMC inference in a DPMM could, in theory, alternate between the following samplers (π1 , . [sent-76, score-0.355]

30 , π∞ ) ∼ p(π|z, α), (1) ∝ θk ∼ fx (x{k} ; θk )fθ (θk ; λ), ∞ ∝ zi ∼ k=1 πk fx (xi ; θk )1 i = k], I[z ∀k ∈ {1, . [sent-79, score-0.737]

31 We use fx (x{k} ; θk ) to denote the product of likelihoods for all data points in cluster k. [sent-86, score-0.599]

32 When conjugate priors are used, the posterior distribution for cluster parameters is in the same family as the prior: p(θk |x, z, λ) ∝ fθ (θk ; λ)fx (x{k} ; θk ) ∝ fθ (θk ; λ∗ ), k (4) where λ∗ denotes the posterior hyperparameters for cluster k. [sent-87, score-0.705]

33 When cluster parameters are explicitly sampled, these algorithms may additionally suffer from slow convergence issues. [sent-92, score-0.37]

34 4 Exact Parallel Instantiated-Weight Samplers We now present a novel alternative to the instantiated-weight samplers that does not require any ﬁnite model approximations. [sent-104, score-0.3]

35 In particular, if one desires to sample from a target distribution, π(z), satisfying detailed balance for an ergodic Markov chain guarantees that simulations of the chain will uniquely converge to the target distribution of interest. [sent-106, score-0.386]

36 If a Markov chain is constructed with a transition distribution q(ˆ|z) that satisﬁes π(z)q(ˆ|z) = π(ˆ)q(z|ˆ), then the z z z z chain is said to satisfy the detailed balance condition and π(z) is guaranteed to be a stationary distribution of the chain. [sent-111, score-0.315]

37 We deﬁne a restricted sampler as one that satisﬁes detailed balance (e. [sent-112, score-0.435]

38 One key observation of this work is that multiple restricted samplers can be combined to form an ergodic chain. [sent-116, score-0.484]

39 we consider a sampler that is restricted to only sample labels belonging to non-empty clusters. [sent-121, score-0.413]

40 Such a sampler is not ergodic because it cannot create new clusters. [sent-122, score-0.355]

41 However, when mixed with a sampler that proposes splits, the resulting chain is ergodic and yields a valid sampler. [sent-123, score-0.493]

42 The coupled split sampler is discussed in Section 5. [sent-125, score-0.443]

43 , NK , α), ˜ ∝ θk ∼ fx (x{k} ; θk )fθ (θk ; λ), ∝ zi ∼ K k=1 πk fx (xi ; θk )1 i = k], I[z (7) ∀k ∈ {1, . [sent-140, score-0.737]

44 (9) We note that each of these steps can be parallelized and, because the mixture parameters are explicitly represented, this procedure works for conjugate and non-conjugate priors. [sent-147, score-0.273]

45 Similar to previous super-cluster methods, we can also restrict each cluster to only consider moving to a subset of other clusters. [sent-151, score-0.232]

46 By observing that any similarly restricted Gibbs sampler satisﬁes detailed balance, any randomized algorithm that assigns ﬁnite probability to any super-cluster grouping can be used. [sent-154, score-0.414]

47 Form the adjacency matrix, A, where Ak,m = exp[−D(fx (◦; θk ), fx (◦; θm ))] ∝ 2. [sent-165, score-0.334]

48 Solid ellipses indicate regular clusters and dotted ellipses indicate sub-cluster. [sent-169, score-0.302]

49 5 Parallel Split/Merge Moves via Sub-Clusters The preceding section showed that an exact MCMC sampling algorithm can be constructed by alternating between a restricted Gibbs sampler and split moves. [sent-172, score-0.606]

50 [5, 13, 14]) can result in an ergodic chain, we now develop efﬁcient split moves that are compatible with conjugate and non-conjugate priors and that can be parallelized. [sent-175, score-0.49]

51 We will augment the space with auxiliary variables, noting that samples of the non-auxiliary variables can be obtained by drawing samples from the joint space and simply discarding any auxiliary values. [sent-176, score-0.486]

52 1 Augmenting the Space with Auxiliary Variables Since the goal is to design a model that is tailored toward splitting clusters, we augment each regular cluster with two explicit sub-clusters (herein referred to as the “left” and “right” sub-clusters). [sent-178, score-0.344]

53 These auxiliary variables are named in a similar fashion to their regular-cluster counterparts because of the similarities between sub-clusters and regular-clusters. [sent-181, score-0.244]

54 One na¨ve choice for auxiliary parameter distributions is ı p(π k ) = Dir(π k, , π k,r ; α/2, α/2), p(z|π, θ, x, z) = k p(θk ) = fθ (θk, ; λ)fθ (θk,r ; λ), π k,zi fx (xi ;θ k,zi ) . [sent-182, score-0.538]

55 {i;zi =k} π k, fx (xi ;θ k, )+π k,r fx (xi ;θ k,r ) (10) (11) The corresponding graphical model is shown in Figure 3a. [sent-183, score-0.717]

56 It would be advantageous if the form of the posterior for the auxiliary variables matched those of the regular-clusters in Equation 7-9. [sent-184, score-0.299]

57 Unfortunately, because the normalization in Equation 11 depends on π and θ, this choice of auxiliary distributions does not result in the posterior distributions for π and θ that one would expect. [sent-185, score-0.259]

58 We note that this problem only arises in the auxiliary space where x generates the auxiliary label z (in contrast to the regular space, where z generates x). [sent-186, score-0.501]

59 Consequently, we alter the distribution over sub-cluster parameters to be p(θk |x, z, π) ∝ fθ (θk, ; λ)fθ (θk,r ; λ) {i;zi =k} π k, fx (xi ; θk, ) + π k,r fx (xi ; θk,r ) . [sent-188, score-0.668]

60 , K}, (13) (π k, , π k,r ) ∼ Dir(Nk, + α/2, Nk,r + α/2), ∝ θk,s ∼ fx (x{k,s} ; θk,s )fθ (θk,s ; λ), ∝ zi ∼ s∈{ ,r} ∀k ∈ {1, . [sent-192, score-0.403]

61 , K}, ∀s ∈ { , r}, π zi ,s fx (xi ; θzi ,s )1 i = s], I[z (14) ∀i ∈ {1, . [sent-195, score-0.403]

62 One can draw a sample from the space of K regular clusters by sampling all the regular- and sub-cluster parameters conditioned on labels and data from Equations 7, 8, 13, and 14. [sent-203, score-0.388]

63 We exploit these auxiliary variables to propose likely splits. [sent-210, score-0.244]

64 Because of the required reverse proposal in the Hastings ratio, we must propose both merges and splits. [sent-213, score-0.244]

65 Unfortunately, because the joint likelihood for the augmented space cannot be analytically expressed, the Hastings ratio for an arbitrary proposal distribution cannot be computed. [sent-214, score-0.261]

66 A split or merge move, denoted by Q, is ﬁrst selected at random. [sent-216, score-0.312]

67 In our examples, all possible splits and merges are considered since the number of clusters is much smaller than the number of data points. [sent-217, score-0.392]

68 The proposal of cluster parameters is written in a general form so that users can specify their own proposal for non-conjugate priors. [sent-220, score-0.476]

69 Sampling auxiliary variables from Equation 20 will be discussed shortly. [sent-222, score-0.244]

70 Assuming that this can be performed, we show in the supplement that the resulting Hastings ratio for a split is Hsplit-c = ˆ ˆ ˆ Γ(Nk )fθ (θk ;λ)fx (x{k} ;θk ) αq(θc |x,z,ˆ) z ˆ Γ(Nk )fθ (θc ;λ)fx (x{c} ;θc ) q(θk |x,z,ˆ) z k∈{m,n} = α ˆ Γ(Nk )fx (x{k} ;λ) . [sent-223, score-0.282]

71 Γ(Nc )fx (x{c} ;λ) k∈{m,n} (21) The ﬁrst expression can be used for non-conjugate models, and the second expression can be used in conjugate models where new cluster parameters are sampled directly from the posterior distribution. [sent-224, score-0.37]

72 We discuss these complications following the explanation of sampling the auxiliary variables in the next section. [sent-227, score-0.323]

73 4 Deferred Metropolis-Hastings Sampling The preceding section showed that sampling a split according to Equations 17-20 results in an accurate MH framework. [sent-229, score-0.267]

74 However, sampling the auxiliary variables from Equation 20 is not straightforward. [sent-230, score-0.323]

75 This step is equivalent to sampling cluster parameters and labels for a 2-component 6 mixture model, which is known to be difﬁcult. [sent-231, score-0.528]

76 In fact, that is precisely what the restricted Gibbs sampler is doing. [sent-233, score-0.339]

77 We therefore sample from Equation 20 by running a restricted Gibbs sampler for each newly proposed sub-cluster until they have burned-in. [sent-234, score-0.366]

78 We monitor the data-likelihood for cluster m, Lm = fx (x{m, } ; θm, ) · fx (x{m,r} ; θm,r ) and declare burn-in once Lm begins to oscillate. [sent-235, score-0.9]

79 Furthermore, due to the implicit marginalization of auxiliary variables, the restricted Gibbs sampler and split moves that act on clusters that were not recently split do not depend on the proposed auxiliary variables. [sent-236, score-1.365]

80 As such, these proposals can be computed before the auxiliary variables are even proposed. [sent-237, score-0.319]

81 The sampling of auxiliary variables of a recently split cluster are deferred to the restricted Gibbs sampler while the other sampling steps are run concurrently. [sent-238, score-1.161]

82 Once a set of proposed sub-clusters have burned-in, the corresponding clusters can be proposed to split again. [sent-239, score-0.386]

83 5 Merge Moves with Random Splits The Hastings ratio for a merge depends on the proposed auxiliary variables for the reverse split. [sent-241, score-0.44]

84 Since proposed splits are deterministic conditioned on the sub-cluster labels, the Hastings ratio will be zero if the proposed sub-cluster labels for a merge do not match those of the current clusters. [sent-242, score-0.47]

85 We show in the supplement that as the number of data points grows, the acceptance ratio for a merge move quickly decays. [sent-243, score-0.288]

86 With only 256 data points, the acceptance ratio for a merge proposal for 1000 trials in a 1D Gaussian mixture model did not exceed 10−16 . [sent-244, score-0.434]

87 Fortunately, there is a very simple sampler that is good at proposing merges: a data-independent, random split proposal generated from the prior with a corresponding merge move. [sent-247, score-0.716]

88 A split is constructed by sampling a random cluster, c, followed by a random partitioning of its data points form a Dirichlet-Multinomial. [sent-248, score-0.3]

89 We consider three different versions of the proposed algorithm: using sub-clusters with and without super-clusters (S UB C and S UB C+S UP C) and an approximate method that does not wait for the convergence of sub-clusters to split (S UB C+S UP C A PPROX). [sent-253, score-0.252]

90 The average log likelihood for multiple sample paths obtained using the algorithms without parallelization for different numbers of initial clusters K and concentration parameters α are shown in the ﬁrst two columns of Figure 4. [sent-259, score-0.237]

91 However, we ﬁnd that the samplers without split/merge proposals (FSD, G IBBS , G IBBS +SC) perform very poorly when the initial number of clusters and the concentration parameter is small. [sent-261, score-0.556]

92 We believe that because the G IBBS methods marginalize over the cluster parameters and weights, they achieve better performance as compared to the sub-cluster methods and FSD which explicitly instantiate them. [sent-286, score-0.285]

93 7 Conclusion We have presented a novel sampling algorithm for Dirichlet process mixture models. [sent-288, score-0.262]

94 By alternating between a restricted Gibbs sampler and a split proposal, ﬁnite approximations to the DPMM are not needed and efﬁcient inter-cluster parallelization can be achieved. [sent-289, score-0.583]

95 Additionally, the proposed method for constructing splits based on ﬁtting sub-clusters is, to our knowledge, the ﬁrst parallelizable split algorithm for mixture models. [sent-290, score-0.593]

96 Results on both synthetic and real data demonstrate that the speed of the sampler is orders of magnitude faster than other exact MCMC methods. [sent-291, score-0.255]

97 An improved merge-split sampler for conjugate Dirichlet process mixture models. [sent-328, score-0.521]

98 A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model. [sent-373, score-0.278]

99 Markov chain sampling methods for Dirichlet process mixture models. [sent-433, score-0.357]

100 Parallel Markov chain Monte Carlo for nonparametric mixture models. [sent-492, score-0.238]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fx', 0.334), ('samplers', 0.3), ('sampler', 0.255), ('cluster', 0.232), ('dirichlet', 0.219), ('auxiliary', 0.204), ('split', 0.188), ('ibbs', 0.187), ('dpmm', 0.154), ('hastings', 0.147), ('dpmms', 0.145), ('clusters', 0.144), ('ub', 0.144), ('mixture', 0.143), ('splits', 0.126), ('merge', 0.124), ('proposal', 0.122), ('merges', 0.122), ('fsd', 0.117), ('mcmc', 0.113), ('gibbs', 0.112), ('parallelizable', 0.109), ('ergodic', 0.1), ('chain', 0.095), ('augmented', 0.094), ('nk', 0.085), ('restricted', 0.084), ('conjugate', 0.083), ('cw', 0.083), ('sampling', 0.079), ('dp', 0.078), ('proposals', 0.075), ('labels', 0.074), ('moves', 0.071), ('zi', 0.069), ('dir', 0.063), ('balance', 0.061), ('ellipses', 0.057), ('iw', 0.057), ('parallelization', 0.056), ('posterior', 0.055), ('parallel', 0.055), ('marginalize', 0.053), ('supplement', 0.049), ('graphical', 0.049), ('label', 0.049), ('priors', 0.048), ('conditioned', 0.047), ('pprox', 0.047), ('supercluster', 0.047), ('parallelized', 0.047), ('ratio', 0.045), ('carlo', 0.044), ('regular', 0.044), ('monte', 0.044), ('markov', 0.043), ('proposes', 0.043), ('additionally', 0.041), ('fisher', 0.041), ('grouping', 0.04), ('process', 0.04), ('variables', 0.04), ('restaurant', 0.039), ('pubmed', 0.038), ('ishwaran', 0.038), ('dictionary', 0.038), ('augment', 0.038), ('grouped', 0.038), ('convergence', 0.037), ('move', 0.037), ('concentration', 0.037), ('documents', 0.037), ('csail', 0.036), ('jain', 0.036), ('detailed', 0.035), ('chinese', 0.034), ('inferred', 0.034), ('points', 0.033), ('weights', 0.033), ('nonconjugate', 0.033), ('suffer', 0.032), ('mh', 0.031), ('scans', 0.031), ('um', 0.031), ('equation', 0.031), ('lm', 0.03), ('multinomial', 0.03), ('splitting', 0.03), ('unfortunately', 0.029), ('bayesian', 0.029), ('nite', 0.029), ('stationary', 0.029), ('constructive', 0.029), ('xi', 0.028), ('marginalized', 0.028), ('capabilities', 0.028), ('hierarchical', 0.028), ('slow', 0.028), ('prior', 0.027), ('proposed', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

Author: Jason Chang, John W. Fisher III

2 0.23781222 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

Author: Dahua Lin

Abstract: Reliance on computationally expensive algorithms for inference has been limiting the use of Bayesian nonparametric models in large scale applications. To tackle this problem, we propose a Bayesian learning algorithm for DP mixture models. Instead of following the conventional paradigm – random initialization plus iterative update, we take an progressive approach. Starting with a given prior, our method recursively transforms it into an approximate posterior through sequential variational approximation. In this process, new components will be incorporated on the ﬂy when needed. The algorithm can reliably estimate a DP mixture model in one pass, making it particularly suited for applications with massive data. Experiments on both synthetic data and real datasets demonstrate remarkable improvement on efﬁciency – orders of magnitude speed-up compared to the state-of-the-art. 1

3 0.16809489 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture

Author: Trevor Campbell, Miao Liu, Brian Kulis, Jonathan P. How, Lawrence Carin

Abstract: This paper presents a novel algorithm, based upon the dependent Dirichlet process mixture model (DDPMM), for clustering batch-sequential data containing an unknown number of evolving clusters. The algorithm is derived via a lowvariance asymptotic analysis of the Gibbs sampling algorithm for the DDPMM, and provides a hard clustering with convergence guarantees similar to those of the k-means algorithm. Empirical results from a synthetic test with moving Gaussian clusters and a test with real ADS-B aircraft trajectory data demonstrate that the algorithm requires orders of magnitude less computational time than contemporary probabilistic and hard clustering algorithms, while providing higher accuracy on the examined datasets. 1

4 0.15254205 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models

Author: Jie Liu, David Page

Abstract: In large-scale applications of undirected graphical models, such as social networks and biological networks, similar patterns occur frequently and give rise to similar parameters. In this situation, it is beneﬁcial to group the parameters for more efﬁcient learning. We show that even when the grouping is unknown, we can infer these parameter groups during learning via a Bayesian approach. We impose a Dirichlet process prior on the parameters. Posterior inference usually involves calculating intractable terms, and we propose two approximation algorithms, namely a Metropolis-Hastings algorithm with auxiliary variables and a Gibbs sampling algorithm with “stripped” Beta approximation (Gibbs SBA). Simulations show that both algorithms outperform conventional maximum likelihood estimation (MLE). Gibbs SBA’s performance is close to Gibbs sampling with exact likelihood calculation. Models learned with Gibbs SBA also generalize better than the models learned by MLE on real-world Senate voting data. 1

5 0.13927177 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models

Author: Michael Hughes, Erik Sudderth

Abstract: Variational inference algorithms provide the most effective framework for largescale training of Bayesian nonparametric models. Stochastic online approaches are promising, but are sensitive to the chosen learning rate and often converge to poor local optima. We present a new algorithm, memoized online variational inference, which scales to very large (yet ﬁnite) datasets while avoiding the complexities of stochastic gradient. Our algorithm maintains ﬁnite-dimensional sufﬁcient statistics from batches of the full dataset, requiring some additional memory but still scaling to millions of examples. Exploiting nested families of variational bounds for inﬁnite nonparametric models, we develop principled birth and merge moves allowing non-local optimization. Births adaptively add components to the model to escape local optima, while merges remove redundancy and improve speed. Using Dirichlet process mixture models for image clustering and denoising, we demonstrate major improvements in robustness and accuracy.

6 0.13369296 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations

7 0.1290559 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling

8 0.12471783 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components

9 0.12080061 161 nips-2013-Learning Stochastic Inverses

10 0.11363042 43 nips-2013-Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions

11 0.10889938 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent

12 0.10276114 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex

13 0.097039551 63 nips-2013-Cluster Trees on Manifolds

14 0.096215084 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

15 0.094356015 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models

16 0.093853213 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models

17 0.088117801 40 nips-2013-Approximate Inference in Continuous Determinantal Processes

18 0.0868975 279 nips-2013-Robust Bloom Filters for Large MultiLabel Classification Tasks

19 0.086362511 123 nips-2013-Flexible sampling of discrete data correlations without the marginal distributions

20 0.084797733 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.198), (1, 0.077), (2, -0.071), (3, -0.0), (4, 0.033), (5, 0.186), (6, 0.224), (7, 0.009), (8, 0.147), (9, -0.035), (10, -0.038), (11, 0.087), (12, 0.024), (13, 0.099), (14, 0.038), (15, 0.012), (16, -0.001), (17, -0.117), (18, 0.07), (19, 0.064), (20, -0.052), (21, 0.068), (22, -0.016), (23, -0.038), (24, 0.03), (25, 0.059), (26, 0.001), (27, -0.13), (28, -0.016), (29, -0.013), (30, -0.003), (31, 0.076), (32, 0.132), (33, -0.101), (34, -0.03), (35, -0.069), (36, 0.086), (37, -0.049), (38, -0.011), (39, -0.057), (40, 0.004), (41, -0.037), (42, 0.066), (43, 0.105), (44, -0.069), (45, 0.018), (46, -0.067), (47, -0.067), (48, -0.059), (49, -0.083)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96711469 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

Author: Jason Chang, John W. Fisher III

2 0.77938044 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture

Author: Trevor Campbell, Miao Liu, Brian Kulis, Jonathan P. How, Lawrence Carin

3 0.76250845 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

Author: Dahua Lin

4 0.74222392 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models

Author: Jie Liu, David Page

5 0.67889649 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components

Author: Jeffrey W. Miller, Matthew T. Harrison

Abstract: For data assumed to come from a ﬁnite mixture with an unknown number of components, it has become common to use Dirichlet process mixtures (DPMs) not only for density estimation, but also for inferences about the number of components. The typical approach is to use the posterior distribution on the number of clusters — that is, the posterior on the number of components represented in the observed data. However, it turns out that this posterior is not consistent — it does not concentrate at the true number of components. In this note, we give an elementary proof of this inconsistency in what is perhaps the simplest possible setting: a DPM with normal components of unit variance, applied to data from a “mixture” with one standard normal component. Further, we show that this example exhibits severe inconsistency: instead of going to 1, the posterior probability that there is one cluster converges (in probability) to 0. 1

6 0.6712445 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent

7 0.65577817 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning

8 0.64657444 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling

9 0.60780591 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex

10 0.60077751 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models

11 0.59818244 161 nips-2013-Learning Stochastic Inverses

12 0.59200835 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models

13 0.56478769 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models

14 0.55280524 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions

15 0.55250311 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations

16 0.52859849 344 nips-2013-Using multiple samples to learn mixture models

17 0.49950528 320 nips-2013-Summary Statistics for Partitionings and Feature Allocations

18 0.4972268 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing

19 0.49716979 277 nips-2013-Restricting exchangeable nonparametric distributions

20 0.49269712 43 nips-2013-Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(16, 0.449), (33, 0.155), (34, 0.112), (41, 0.017), (49, 0.016), (56, 0.085), (85, 0.033), (89, 0.019), (93, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96669269 61 nips-2013-Capacity of strong attractor patterns to model behavioural and cognitive prototypes

Author: Abbas Edalat

Abstract: We solve the mean ﬁeld equations for a stochastic Hopﬁeld network with temperature (noise) in the presence of strong, i.e., multiply stored, patterns, and use this solution to obtain the storage capacity of such a network. Our result provides for the ﬁrst time a rigorous solution of the mean ﬁled equations for the standard Hopﬁeld model and is in contrast to the mathematically unjustiﬁable replica technique that has been used hitherto for this derivation. We show that the critical temperature for stability of a strong pattern is equal to its degree or multiplicity, when the sum of the squares of degrees of the patterns is negligible compared to the network size. In the case of a single strong pattern, when the ratio of the number of all stored pattens and the network size is a positive constant, we obtain the distribution of the overlaps of the patterns with the mean ﬁeld and deduce that the storage capacity for retrieving a strong pattern exceeds that for retrieving a simple pattern by a multiplicative factor equal to the square of the degree of the strong pattern. This square law property provides justiﬁcation for using strong patterns to model attachment types and behavioural prototypes in psychology and psychotherapy. 1 Introduction: Multiply learned patterns in Hopﬁeld networks The Hopﬁeld network as a model of associative memory and unsupervised learning was introduced in [23] and has been intensively studied from a wide range of viewpoints in the past thirty years. However, properties of a strong pattern, as a pattern that has been multiply stored or learned in these networks, have only been examined very recently, a surprising delay given that repetition of an activity is the basis of learning by the Hebbian rule and long term potentiation. In particular, while the storage capacity of a Hopﬁeld network with certain correlated patterns has been tackled [13, 25], the storage capacity of a Hopﬁeld network in the presence of strong as well as random patterns has not been hitherto addressed. The notion of a strong pattern of a Hopﬁeld network has been proposed in [15] to model attachment types and behavioural prototypes in developmental psychology and psychotherapy. This suggestion has been motivated by reviewing the pioneering work of Bowlby [9] in attachment theory and highlighting how a number of academic biologists, psychiatrists, psychologists, sociologists and neuroscientists have consistently regarded Hopﬁeld-like artiﬁcial neural networks as suitable tools to model cognitive and behavioural constructs as patterns that are deeply and repeatedly learned by individuals [11, 22, 24, 30, 29, 10]. A number of mathematical properties of strong patterns in Hopﬁeld networks, which give rise to strong attractors, have been derived in [15]. These show in particular that strong attractors are strongly stable; a series of experiments have also been carried out which conﬁrm the mathematical 1 results and also indicate that a strong pattern stored in the network can be retrieved even in the presence of a large number of simple patterns, far exceeding the well-known maximum load parameter or storage capacity of the Hopﬁeld network with random patterns (αc ≈ 0.138). In this paper, we consider strong patterns in stochastic Hopﬁeld model with temperature, which accounts for various types of noise in the network. In these networks, the updating rule is probabilistic and depend on the temperature. Since analytical solution of such a system is not possible in general, one strives to obtain the average behaviour of the network when the input to each node, the so-called ﬁeld at the node, is replaced with its mean. This is the basis of mean ﬁeld theory for these networks. Due to the close connection between the Hopﬁeld network and the Ising model in ferromagnetism [1, 8], the mean ﬁeld approach for the Hopﬁeld network and its variations has been tackled using the replica method, starting with the pioneering work of Amit, Gutfreund and Sompolinsky [3, 2, 4, 19, 31, 1, 13]. Although this method has been widely used in the theory of spin glasses in statistical physics [26, 16] its mathematical justiﬁcation has proved to be elusive as we will discuss in the next section; see for example [20, page 264], [14, page 27], and [7, page 9]. In [17] and independently in [27], an alternative technique to the replica method for solving the mean ﬁeld equations has been proposed which is reproduced and characterised as heuristic in [20, section 2.5] since it relies on a number of assumptions that are not later justiﬁed and uses a number of mathematical steps that are not validated. Here, we use the basic idea of the above heuristic to develop a veriﬁable mathematical framework with provable results grounded on elements of probability theory, with which we assume the reader is familiar. This technique allows us to solve the mean ﬁeld equations for the Hopﬁeld network in the presence of strong patterns and use the results to study, ﬁrst, the stability of these patterns in the presence of temperature (noise) and, second, the storage capacity of the network with a single strong pattern at temperature zero. We show that the critical temperature for the stability of a strong pattern is equal to its degree (i.e., its multiplicity) when the ratio of the sum of the squares of degrees of the patterns to the network size tends to zero when the latter tends to inﬁnity. In the case that there is only one strong pattern present with its degree small compared to the number of patterns and the latter is a ﬁxed multiple of the number of nodes, we ﬁnd the distribution of the overlap of the mean ﬁeld and the patterns when the strong pattern is being retrieved. We use these distributions to prove that the storage capacity for retrieving a strong pattern exceeds that for a simple pattern by a multiplicative factor equal to the square of the degree of the strong attractor. This result matches the ﬁnding in [15] regarding the capacity of a network to recall strong patterns as mentioned above. Our results therefore show that strong patterns are robust and persistent in the network memory as attachment types and behavioural prototypes are in the human memory system. In this paper, we will several times use Lyapunov’s theorem in probability which provides a simple sufﬁcient condition to generalise the Central Limit theorem when we deal with independent but not necessarily identically distributed random variables. We require a general form of this theorem kn as follows. Let Yn = N i=1 Yni , for n ∈ I , be a triangular array of random variables such that for each n, the random variables Yni , for 1 ≤ i ≤ kn are independent with E(Yni ) = 0 2 2 and E(Yni ) = σni , where E(X) stands for the expected value of the random variable X. Let kn 2 2 sn = i=1 σni . We use the notation X ∼ Y when the two random variables X and Y have the same distribution (for large n if either or both of them depend on n). Theorem 1.1 (Lyapunov’s theorem [6, page 368]) If for some δ > 0, we have the condition: 1 E(|Yn |2+δ |) → 0 s2+δ n d d as n → ∞ then s1 Yn −→ N (0, 1) as n → ∞ where −→ denotes convergence in distribution, and we denote n by N (a, σ 2 ) the normal distribution with mean a and variance σ 2 . Thus, for large n we have Yn ∼ N (0, s2 ). n 2 2 Mean ﬁeld theory We consider a Hopﬁeld network with N neurons i = 1, . . . , N with values Si = ±1 and follow the notations in [20]. As in [15], we assume patterns can be multiply stored and the degree of a pattern is deﬁned as its multiplicity. The total number of patterns, counting their multiplicity, is denoted by p and we assume there are n patterns ξ 1 , . . . , ξ n with degrees d1 , . . . , dn ≥ 1 respectively and that n the remaining p − k=1 dk ≥ 0 patterns are simple, i.e., each has degree one. Note that by our assumptions there are precisely n p0 = p + n − dk k=1 distinct patterns, which we assume are independent and identically distributed with equal probability of taking value ±1 for each node. More generally, for any non-negative integer k ∈ I , we let N p0 dk . µ pk = µ=1 p µ µ 0 1 We use the generalized Hebbian rule for the synaptic couplings: wij = N µ=1 dµ ξi ξj for i = j with wii = 0 for 1 ≤ i, j ≤ N . As in the standard stochastic Hopﬁeld model [20], we use Glauber dynamics [18] for the stochastic updating rule with pseudo-temperature T > 0, which accounts for various types of noise in the network, and assume zero bias in the local ﬁeld. Putting β = 1/T (i.e., with the Boltzmann constant kB = 1) and letting fβ (h) = 1/(1 + exp(−2βh)), the stochastic updating rule at time t is given by: N Pr(Si (t + 1) = ±1) = fβ (±hi (t)), where hi (t) = wij Sj (t), (1) j=1 is the local ﬁeld at i at time t. The updating is implemented asynchronously in a random way. The energy of the network in the conﬁguration S = (Si )N is given by i=1 N 1 Si Sj wij . H(S) = − 2 i,j=1 For large N , this speciﬁes a complex system, with an underlying state space of dimension 2N , which in general cannot be solved exactly. However, mean ﬁeld theory has proved very useful in studying Hopﬁeld networks. The average updated value of Si (t + 1) in Equation (1) is Si (t + 1) = 1/(1 + e−2βhi (t) ) − 1/(1 + e2βhi (t) ) = tanh(βhi (t)), (2) where . . . denotes taking average with respect to the probability distribution in the updating rule in Equation (1). The stationary solution for the mean ﬁeld thus satisﬁes: Si = tanh(βhi ) , (3) The average overlap of pattern ξ µ with the mean ﬁeld at the nodes of the network is given by: mν = 1 N N ν ξi Si (4) i=1 The replica technique for solving the mean ﬁeld problem, used in the case p/N = α > 0 as N → ∞, seeks to obtain the average of the overlaps in Equation (4) by evaluating the partition function of the system, namely, Z = TrS exp(−βH(S)), where the trace TrS stands for taking sum over all possible conﬁgurations S = (Si )N . As it i=1 is generally the case in statistical physics, once the partition function of the system is obtained, 3 all required physical quantities can in principle be computed. However, in this case, the partition function is very difﬁcult to compute since it entails computing the average log Z of log Z, where . . . indicates averaging over the random distribution of the stored patterns ξ µ . To overcome this problem, the identity Zk − 1 log Z = lim k→0 k is used to reduce the problem to ﬁnding the average Z k of Z k , which is then computed for positive integer values of k. For such k, we have: Z k = TrS 1 TrS 2 . . . TrS k exp(−β(H(S 1 ) + H(S 1 ) + . . . + H(S k ))), where for each i = 1, . . . , k the super-scripted conﬁguration S i is a replica of the conﬁguration state. In computing the trace over each replica, various parameters are obtained and the replica symmetry condition assumes that these parameters are independent of the particular replica under consideration. Apart from this assumption, there are two basic mathematical problems with the technique which makes it unjustiﬁable [20, page 264]. Firstly, the positive integer k above is eventually treated as a real number near zero without any mathematical justiﬁcation. Secondly, the order of taking limits, in particular the order of taking the two limits k → 0 and N → ∞, are several times interchanged again without any mathematical justiﬁcation. Here, we develop a mathematically rigorous method for solving the mean ﬁeld problem, i.e., computing the average of the overlaps in Equation (4) in the case of p/N = α > 0 as N → ∞. Our method turns the basic idea of the heuristic presented in [17] and reproduced in [20] for solving the mean ﬁeld equation into a mathematically veriﬁable formalism, which for the standard Hopﬁeld network with random stored patterns gives the same result as the replica method, assuming replica symmetry. In the presence of strong patterns we obtain a set of new results as explained in the next two sections. The mean ﬁeld equation is obtained from Equation (3) by approximating the right hand side of N this equation by the value of tanh at the mean ﬁeld hi = j=1 wij Sj , ignoring the sum N j=1 wij (Sj − Sj ) for large N [17, page 32]: Si = tanh(β hi ) = tanh β N N j=1 p0 µ=1 µ µ dµ ξi ξj Sj . (5) Equation (5) gives the mean ﬁeld equation for the Hopﬁeld network with n possible strong patterns n ξ µ (1 ≤ µ ≤ n) and p − µ=1 dµ simple patterns ξ µ with n + 1 ≤ µ ≤ p0 . As in the standard Hopﬁeld model, where all patterns are simple, we have two cases to deal with. However, we now have to account for the presence of strong attractors and our two cases will be as follows: (i) In the p0 ﬁrst case we assume p2 := µ=1 d2 = o(N ), which includes the simpler case p2 N when p2 µ is ﬁxed and independent of N . (ii) In the second case we assume we have a single strong attractor with the load parameter p/N = α > 0. 3 Stability of strong patterns with noise: p2 = o(N ) The case of constant p and N → ∞ is usually referred to as α = 0 in the standard Hopﬁeld model. Here, we need to consider the sum of degrees of all stored patterns (and not just the number of patterns) compared to N . We solve the mean ﬁeld equation with T > 0 by using a method similar in spirit to [20, page 33] for the standard Hopﬁeld model, but in our case strong patterns induce a sequence of independent but non-identically distributed random variables in the crosstalk term, where the Central Limit Theorem cannot be used; we show however that Lyapunov’s theorem (Theorem (1.1) can be invoked. In retrieving pattern ξ 1 , we look for a solution of the mean ﬁled 1 equation of the form: Si = mξi , where m > 0 is a constant. Using Equation (5) and separating 1 the contribution of ξ in the argument of tanh, we obtain:  1 mξi = tanh    mβ  1 d1 ξi + N 4 µ µ 1 dµ ξi ξj ξj  . j=i,µ>1 (6) For each N , µ > 1 and j = i, let dµ µ µ 1 (7) ξ ξ ξ . N i j j 2 This gives (p0 − 1)(N − 1) independent random variables with E(YN µj ) = 0, E(YN µj ) = d2 /N 2 , µ 3 3 3 and E(|YN µj |) = dµ /N . We have: YN µj = s2 := N 2 E(YN µj ) = µ>1,j=i 1 N −1 d2 ∼ N 2 µ>1 µ N d2 . µ (8) µ>1 Thus, as N → ∞, we have: 1 s3 N 3 E(|YN µj |) ∼ √ µ>1,j=i µ>1 N( d3 µ µ>1 d2 )3/2 µ → 0. (9) as N → ∞ since for positive numbers dµ we always have µ>1 d3 < ( µ>1 d2 )3/2 . Thus the µ µ Lyapunov condition is satisﬁed for δ = 1. By Lyapunov’s theorem we deduce: 1 N µ µ 1 dµ ξi ξj ξj ∼ N d2 /N µ 0, (10) µ>1 µ>1,j=i Since we also have p2 = o(N ), it follows that we can ignore the second term, i.e., the crosstalk term, in the argument of tanh in Equation (6) as N → ∞; we thus obtain: m = tanh βd1 m. (11) To examine the ﬁxed points of the Equation (11), we let d = d1 for convenience and put x = βdm = dm/T , so that tanh x = T x/d; see Figure 1. It follows that Tc = d is the critical temperature. If T < d then there is a non-zero (non-trivial) solution for m, whereas for T > d we only have the trivial solution. For d = 1 our solution is that of the standard Hopﬁeld network as in [20, page 34]. (d < T) y>x y = x ( d = T) y = tanh x y

2 0.92905849 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals

Author: Barbara Rakitsch, Christoph Lippert, Karsten Borgwardt, Oliver Stegle

Abstract: Multi-task prediction methods are widely used to couple regressors or classiﬁcation models by sharing information across related tasks. We propose a multi-task Gaussian process approach for modeling both the relatedness between regressors and the task correlations in the residuals, in order to more accurately identify true sharing between regressors. The resulting Gaussian model has a covariance term in form of a sum of Kronecker products, for which efﬁcient parameter inference and out of sample prediction are feasible. On both synthetic examples and applications to phenotype prediction in genetics, we ﬁnd substantial beneﬁts of modeling structured noise compared to established alternatives. 1

3 0.90957946 245 nips-2013-Pass-efficient unsupervised feature selection

Author: Crystal Maung, Haim Schweitzer

Abstract: The goal of unsupervised feature selection is to identify a small number of important features that can represent the data. We propose a new algorithm, a modiﬁcation of the classical pivoted QR algorithm of Businger and Golub, that requires a small number of passes over the data. The improvements are based on two ideas: keeping track of multiple features in each pass, and skipping calculations that can be shown not to affect the ﬁnal selection. Our algorithm selects the exact same features as the classical pivoted QR algorithm, and has the same favorable numerical stability. We describe experiments on real-world datasets which sometimes show improvements of several orders of magnitude over the classical algorithm. These results appear to be competitive with recently proposed randomized algorithms in terms of pass efﬁciency and run time. On the other hand, the randomized algorithms may produce more accurate features, at the cost of small probability of failure. 1

same-paper 4 0.87394667 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

Author: Jason Chang, John W. Fisher III

5 0.81512916 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions

Author: Francesca Petralia, Joshua T. Vogelstein, David Dunson

Abstract: Nonparametric estimation of the conditional distribution of a response given highdimensional features is a challenging problem. It is important to allow not only the mean but also the variance and shape of the response density to change ﬂexibly with features, which are massive-dimensional. We propose a multiscale dictionary learning model, which expresses the conditional response density as a convex combination of dictionary densities, with the densities used and their weights dependent on the path through a tree decomposition of the feature space. A fast graph partitioning algorithm is applied to obtain the tree decomposition, with Bayesian methods then used to adaptively prune and average over different sub-trees in a soft probabilistic manner. The algorithm scales efﬁciently to approximately one million features. State of the art predictive performance is demonstrated for toy examples and two neuroscience applications including up to a million features. 1

6 0.81211472 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling

7 0.62436604 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning

8 0.61891186 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent

9 0.61179018 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models

10 0.598378 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

11 0.59569991 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series

12 0.59147066 252 nips-2013-Predictive PAC Learning and Process Decompositions

13 0.58685505 47 nips-2013-Bayesian Hierarchical Community Discovery

14 0.58044332 355 nips-2013-Which Space Partitioning Tree to Use for Search?

15 0.58025253 350 nips-2013-Wavelets on Graphs via Deep Learning

16 0.57573587 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture

17 0.57460558 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models

18 0.57141596 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components

19 0.5692386 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models

20 0.5670315 41 nips-2013-Approximate inference in latent Gaussian-Markov models from continuous time observations