nips nips2013 nips2013-243 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jason Chang, John W. Fisher III
Abstract: We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve significant computational gains. We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. Each cluster is augmented with two subclusters to construct likely split moves. Unlike some previous parallel samplers, the proposed sampler enforces the correct stationary distribution of the Markov chain without the need for finite approximations. Empirical results illustrate that the new sampler exhibits better convergence properties than current methods. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve significant computational gains. [sent-6, score-0.485]
2 We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. [sent-7, score-0.259]
3 Each cluster is augmented with two subclusters to construct likely split moves. [sent-8, score-0.514]
4 Unlike some previous parallel samplers, the proposed sampler enforces the correct stationary distribution of the Markov chain without the need for finite approximations. [sent-9, score-0.461]
5 Empirical results illustrate that the new sampler exhibits better convergence properties than current methods. [sent-10, score-0.292]
6 Posterior sampling in complex models such as DPMMs is often difficult because samplers that propose local changes exhibit poor convergence. [sent-17, score-0.379]
7 Here, we develop a sampler for DPMMs that: (1) preserves limiting guarantees; (2) proposes splits and merges to improve convergence; (3) can be parallelized to accommodate large datasets; and (4) is applicable to a wide variety of DPMMs (conjugate and non-conjugate). [sent-21, score-0.593]
8 While we focus on DP mixture models here, similar methods can be extended for mixture models with other priors (finite Dirichlet distributions, Pitman-Yor Processes, etc. [sent-23, score-0.334]
9 The majority of DPMM samplers fit into one of two categories: collapsed-weight samplers that ∗ Jason Chang was partially supported by the Office of Naval Research Multidisciplinary Research Initiative (MURI) program, award N000141110688. [sent-31, score-0.6]
10 [21, 23]) priors sample the cluster labels iteratively one data point at a time. [sent-39, score-0.354]
11 When a conjugate prior is used, one can also marginalize out cluster parameters. [sent-40, score-0.395]
12 Additionally, due to the particular marginalization schemes, these samplers cannot be parallelized. [sent-44, score-0.3]
13 Instantiated-weight (IW) samplers explicitly represent cluster weights, typically using a finite approximation to the DP (e. [sent-45, score-0.532]
14 Recently, [7] and [24] have eliminated the need for this approximation; however, IW samplers still suffer from convergence issues. [sent-48, score-0.369]
15 If cluster parameters are marginalized, it can be very unlikely for a single point to start a new cluster. [sent-49, score-0.232]
16 When cluster parameters are instantiated, samples of parameters from the prior are often a poor fit to the data. [sent-50, score-0.259]
17 However, IW samplers are often useful because they can be parallelized across each data point conditioned on the weights and parameters. [sent-51, score-0.427]
18 We refer to this type of algorithm as “inter-cluster parallelizable”, since the cluster label for each point within a cluster can be sampled in parallel. [sent-52, score-0.513]
19 We refer to this type of implementation as “intra-cluster parallelizable”, since points in different super-clusters can be sampled in parallel, but points within a cluster cannot. [sent-55, score-0.298]
20 Recent CW samplers consider larger moves to address convergence issues. [sent-59, score-0.408]
21 Green and Richardson [9] present a reversible jump MCMC sampler that proposes splitting and merging components. [sent-60, score-0.328]
22 Proposed splits are unlikely to fit the posterior since auxiliary variables governing the split cluster parameters and weights are proposed independent of the data. [sent-62, score-0.905]
23 Jain and Neal [13, 14] construct a split by running multiple restricted Gibbs scans for a single cluster in conjugate and nonconjugate models. [sent-63, score-0.651]
24 Dahl [5] proposes a split scheme for conjugate models by reassigning labels of a cluster sequentially. [sent-66, score-0.62]
25 All current split samplers construct a proposed move to be used in a Metropolis-Hastings framework. [sent-67, score-0.552]
26 If the split is rejected, considerable computation is wasted, and all information contained in learning the split is forgotten. [sent-68, score-0.376]
27 In contrast, the proposed method of fitting sub-clusters iteratively learns likely split proposals with the auxiliary variables. [sent-69, score-0.494]
28 Additionally, we show that split proposals can be computed in parallel, allowing for very efficient implementations. [sent-70, score-0.263]
29 Therefore, posterior MCMC inference in a DPMM could, in theory, alternate between the following samplers (π1 , . [sent-76, score-0.355]
30 , π∞ ) ∼ p(π|z, α), (1) ∝ θk ∼ fx (x{k} ; θk )fθ (θk ; λ), ∞ ∝ zi ∼ k=1 πk fx (xi ; θk )1 i = k], I[z ∀k ∈ {1, . [sent-79, score-0.737]
31 We use fx (x{k} ; θk ) to denote the product of likelihoods for all data points in cluster k. [sent-86, score-0.599]
32 When conjugate priors are used, the posterior distribution for cluster parameters is in the same family as the prior: p(θk |x, z, λ) ∝ fθ (θk ; λ)fx (x{k} ; θk ) ∝ fθ (θk ; λ∗ ), k (4) where λ∗ denotes the posterior hyperparameters for cluster k. [sent-87, score-0.705]
33 When cluster parameters are explicitly sampled, these algorithms may additionally suffer from slow convergence issues. [sent-92, score-0.37]
34 4 Exact Parallel Instantiated-Weight Samplers We now present a novel alternative to the instantiated-weight samplers that does not require any finite model approximations. [sent-104, score-0.3]
35 In particular, if one desires to sample from a target distribution, π(z), satisfying detailed balance for an ergodic Markov chain guarantees that simulations of the chain will uniquely converge to the target distribution of interest. [sent-106, score-0.386]
36 If a Markov chain is constructed with a transition distribution q(ˆ|z) that satisfies π(z)q(ˆ|z) = π(ˆ)q(z|ˆ), then the z z z z chain is said to satisfy the detailed balance condition and π(z) is guaranteed to be a stationary distribution of the chain. [sent-111, score-0.315]
37 We define a restricted sampler as one that satisfies detailed balance (e. [sent-112, score-0.435]
38 One key observation of this work is that multiple restricted samplers can be combined to form an ergodic chain. [sent-116, score-0.484]
39 we consider a sampler that is restricted to only sample labels belonging to non-empty clusters. [sent-121, score-0.413]
40 Such a sampler is not ergodic because it cannot create new clusters. [sent-122, score-0.355]
41 However, when mixed with a sampler that proposes splits, the resulting chain is ergodic and yields a valid sampler. [sent-123, score-0.493]
42 The coupled split sampler is discussed in Section 5. [sent-125, score-0.443]
43 , NK , α), ˜ ∝ θk ∼ fx (x{k} ; θk )fθ (θk ; λ), ∝ zi ∼ K k=1 πk fx (xi ; θk )1 i = k], I[z (7) ∀k ∈ {1, . [sent-140, score-0.737]
44 (9) We note that each of these steps can be parallelized and, because the mixture parameters are explicitly represented, this procedure works for conjugate and non-conjugate priors. [sent-147, score-0.273]
45 Similar to previous super-cluster methods, we can also restrict each cluster to only consider moving to a subset of other clusters. [sent-151, score-0.232]
46 By observing that any similarly restricted Gibbs sampler satisfies detailed balance, any randomized algorithm that assigns finite probability to any super-cluster grouping can be used. [sent-154, score-0.414]
47 Form the adjacency matrix, A, where Ak,m = exp[−D(fx (◦; θk ), fx (◦; θm ))] ∝ 2. [sent-165, score-0.334]
48 Solid ellipses indicate regular clusters and dotted ellipses indicate sub-cluster. [sent-169, score-0.302]
49 5 Parallel Split/Merge Moves via Sub-Clusters The preceding section showed that an exact MCMC sampling algorithm can be constructed by alternating between a restricted Gibbs sampler and split moves. [sent-172, score-0.606]
50 [5, 13, 14]) can result in an ergodic chain, we now develop efficient split moves that are compatible with conjugate and non-conjugate priors and that can be parallelized. [sent-175, score-0.49]
51 We will augment the space with auxiliary variables, noting that samples of the non-auxiliary variables can be obtained by drawing samples from the joint space and simply discarding any auxiliary values. [sent-176, score-0.486]
52 1 Augmenting the Space with Auxiliary Variables Since the goal is to design a model that is tailored toward splitting clusters, we augment each regular cluster with two explicit sub-clusters (herein referred to as the “left” and “right” sub-clusters). [sent-178, score-0.344]
53 These auxiliary variables are named in a similar fashion to their regular-cluster counterparts because of the similarities between sub-clusters and regular-clusters. [sent-181, score-0.244]
54 One na¨ve choice for auxiliary parameter distributions is ı p(π k ) = Dir(π k, , π k,r ; α/2, α/2), p(z|π, θ, x, z) = k p(θk ) = fθ (θk, ; λ)fθ (θk,r ; λ), π k,zi fx (xi ;θ k,zi ) . [sent-182, score-0.538]
55 {i;zi =k} π k, fx (xi ;θ k, )+π k,r fx (xi ;θ k,r ) (10) (11) The corresponding graphical model is shown in Figure 3a. [sent-183, score-0.717]
56 It would be advantageous if the form of the posterior for the auxiliary variables matched those of the regular-clusters in Equation 7-9. [sent-184, score-0.299]
57 Unfortunately, because the normalization in Equation 11 depends on π and θ, this choice of auxiliary distributions does not result in the posterior distributions for π and θ that one would expect. [sent-185, score-0.259]
58 We note that this problem only arises in the auxiliary space where x generates the auxiliary label z (in contrast to the regular space, where z generates x). [sent-186, score-0.501]
59 Consequently, we alter the distribution over sub-cluster parameters to be p(θk |x, z, π) ∝ fθ (θk, ; λ)fθ (θk,r ; λ) {i;zi =k} π k, fx (xi ; θk, ) + π k,r fx (xi ; θk,r ) . [sent-188, score-0.668]
60 , K}, (13) (π k, , π k,r ) ∼ Dir(Nk, + α/2, Nk,r + α/2), ∝ θk,s ∼ fx (x{k,s} ; θk,s )fθ (θk,s ; λ), ∝ zi ∼ s∈{ ,r} ∀k ∈ {1, . [sent-192, score-0.403]
61 , K}, ∀s ∈ { , r}, π zi ,s fx (xi ; θzi ,s )1 i = s], I[z (14) ∀i ∈ {1, . [sent-195, score-0.403]
62 One can draw a sample from the space of K regular clusters by sampling all the regular- and sub-cluster parameters conditioned on labels and data from Equations 7, 8, 13, and 14. [sent-203, score-0.388]
63 We exploit these auxiliary variables to propose likely splits. [sent-210, score-0.244]
64 Because of the required reverse proposal in the Hastings ratio, we must propose both merges and splits. [sent-213, score-0.244]
65 Unfortunately, because the joint likelihood for the augmented space cannot be analytically expressed, the Hastings ratio for an arbitrary proposal distribution cannot be computed. [sent-214, score-0.261]
66 A split or merge move, denoted by Q, is first selected at random. [sent-216, score-0.312]
67 In our examples, all possible splits and merges are considered since the number of clusters is much smaller than the number of data points. [sent-217, score-0.392]
68 The proposal of cluster parameters is written in a general form so that users can specify their own proposal for non-conjugate priors. [sent-220, score-0.476]
69 Sampling auxiliary variables from Equation 20 will be discussed shortly. [sent-222, score-0.244]
70 Assuming that this can be performed, we show in the supplement that the resulting Hastings ratio for a split is Hsplit-c = ˆ ˆ ˆ Γ(Nk )fθ (θk ;λ)fx (x{k} ;θk ) αq(θc |x,z,ˆ) z ˆ Γ(Nk )fθ (θc ;λ)fx (x{c} ;θc ) q(θk |x,z,ˆ) z k∈{m,n} = α ˆ Γ(Nk )fx (x{k} ;λ) . [sent-223, score-0.282]
71 Γ(Nc )fx (x{c} ;λ) k∈{m,n} (21) The first expression can be used for non-conjugate models, and the second expression can be used in conjugate models where new cluster parameters are sampled directly from the posterior distribution. [sent-224, score-0.37]
72 We discuss these complications following the explanation of sampling the auxiliary variables in the next section. [sent-227, score-0.323]
73 4 Deferred Metropolis-Hastings Sampling The preceding section showed that sampling a split according to Equations 17-20 results in an accurate MH framework. [sent-229, score-0.267]
74 However, sampling the auxiliary variables from Equation 20 is not straightforward. [sent-230, score-0.323]
75 This step is equivalent to sampling cluster parameters and labels for a 2-component 6 mixture model, which is known to be difficult. [sent-231, score-0.528]
76 In fact, that is precisely what the restricted Gibbs sampler is doing. [sent-233, score-0.339]
77 We therefore sample from Equation 20 by running a restricted Gibbs sampler for each newly proposed sub-cluster until they have burned-in. [sent-234, score-0.366]
78 We monitor the data-likelihood for cluster m, Lm = fx (x{m, } ; θm, ) · fx (x{m,r} ; θm,r ) and declare burn-in once Lm begins to oscillate. [sent-235, score-0.9]
79 Furthermore, due to the implicit marginalization of auxiliary variables, the restricted Gibbs sampler and split moves that act on clusters that were not recently split do not depend on the proposed auxiliary variables. [sent-236, score-1.365]
80 As such, these proposals can be computed before the auxiliary variables are even proposed. [sent-237, score-0.319]
81 The sampling of auxiliary variables of a recently split cluster are deferred to the restricted Gibbs sampler while the other sampling steps are run concurrently. [sent-238, score-1.161]
82 Once a set of proposed sub-clusters have burned-in, the corresponding clusters can be proposed to split again. [sent-239, score-0.386]
83 5 Merge Moves with Random Splits The Hastings ratio for a merge depends on the proposed auxiliary variables for the reverse split. [sent-241, score-0.44]
84 Since proposed splits are deterministic conditioned on the sub-cluster labels, the Hastings ratio will be zero if the proposed sub-cluster labels for a merge do not match those of the current clusters. [sent-242, score-0.47]
85 We show in the supplement that as the number of data points grows, the acceptance ratio for a merge move quickly decays. [sent-243, score-0.288]
86 With only 256 data points, the acceptance ratio for a merge proposal for 1000 trials in a 1D Gaussian mixture model did not exceed 10−16 . [sent-244, score-0.434]
87 Fortunately, there is a very simple sampler that is good at proposing merges: a data-independent, random split proposal generated from the prior with a corresponding merge move. [sent-247, score-0.716]
88 A split is constructed by sampling a random cluster, c, followed by a random partitioning of its data points form a Dirichlet-Multinomial. [sent-248, score-0.3]
89 We consider three different versions of the proposed algorithm: using sub-clusters with and without super-clusters (S UB C and S UB C+S UP C) and an approximate method that does not wait for the convergence of sub-clusters to split (S UB C+S UP C A PPROX). [sent-253, score-0.252]
90 The average log likelihood for multiple sample paths obtained using the algorithms without parallelization for different numbers of initial clusters K and concentration parameters α are shown in the first two columns of Figure 4. [sent-259, score-0.237]
91 However, we find that the samplers without split/merge proposals (FSD, G IBBS , G IBBS +SC) perform very poorly when the initial number of clusters and the concentration parameter is small. [sent-261, score-0.556]
92 We believe that because the G IBBS methods marginalize over the cluster parameters and weights, they achieve better performance as compared to the sub-cluster methods and FSD which explicitly instantiate them. [sent-286, score-0.285]
93 7 Conclusion We have presented a novel sampling algorithm for Dirichlet process mixture models. [sent-288, score-0.262]
94 By alternating between a restricted Gibbs sampler and a split proposal, finite approximations to the DPMM are not needed and efficient inter-cluster parallelization can be achieved. [sent-289, score-0.583]
95 Additionally, the proposed method for constructing splits based on fitting sub-clusters is, to our knowledge, the first parallelizable split algorithm for mixture models. [sent-290, score-0.593]
96 Results on both synthetic and real data demonstrate that the speed of the sampler is orders of magnitude faster than other exact MCMC methods. [sent-291, score-0.255]
97 An improved merge-split sampler for conjugate Dirichlet process mixture models. [sent-328, score-0.521]
98 A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model. [sent-373, score-0.278]
99 Markov chain sampling methods for Dirichlet process mixture models. [sent-433, score-0.357]
100 Parallel Markov chain Monte Carlo for nonparametric mixture models. [sent-492, score-0.238]
wordName wordTfidf (topN-words)
[('fx', 0.334), ('samplers', 0.3), ('sampler', 0.255), ('cluster', 0.232), ('dirichlet', 0.219), ('auxiliary', 0.204), ('split', 0.188), ('ibbs', 0.187), ('dpmm', 0.154), ('hastings', 0.147), ('dpmms', 0.145), ('clusters', 0.144), ('ub', 0.144), ('mixture', 0.143), ('splits', 0.126), ('merge', 0.124), ('proposal', 0.122), ('merges', 0.122), ('fsd', 0.117), ('mcmc', 0.113), ('gibbs', 0.112), ('parallelizable', 0.109), ('ergodic', 0.1), ('chain', 0.095), ('augmented', 0.094), ('nk', 0.085), ('restricted', 0.084), ('conjugate', 0.083), ('cw', 0.083), ('sampling', 0.079), ('dp', 0.078), ('proposals', 0.075), ('labels', 0.074), ('moves', 0.071), ('zi', 0.069), ('dir', 0.063), ('balance', 0.061), ('ellipses', 0.057), ('iw', 0.057), ('parallelization', 0.056), ('posterior', 0.055), ('parallel', 0.055), ('marginalize', 0.053), ('supplement', 0.049), ('graphical', 0.049), ('label', 0.049), ('priors', 0.048), ('conditioned', 0.047), ('pprox', 0.047), ('supercluster', 0.047), ('parallelized', 0.047), ('ratio', 0.045), ('carlo', 0.044), ('regular', 0.044), ('monte', 0.044), ('markov', 0.043), ('proposes', 0.043), ('additionally', 0.041), ('fisher', 0.041), ('grouping', 0.04), ('process', 0.04), ('variables', 0.04), ('restaurant', 0.039), ('pubmed', 0.038), ('ishwaran', 0.038), ('dictionary', 0.038), ('augment', 0.038), ('grouped', 0.038), ('convergence', 0.037), ('move', 0.037), ('concentration', 0.037), ('documents', 0.037), ('csail', 0.036), ('jain', 0.036), ('detailed', 0.035), ('chinese', 0.034), ('inferred', 0.034), ('points', 0.033), ('weights', 0.033), ('nonconjugate', 0.033), ('suffer', 0.032), ('mh', 0.031), ('scans', 0.031), ('um', 0.031), ('equation', 0.031), ('lm', 0.03), ('multinomial', 0.03), ('splitting', 0.03), ('unfortunately', 0.029), ('bayesian', 0.029), ('nite', 0.029), ('stationary', 0.029), ('constructive', 0.029), ('xi', 0.028), ('marginalized', 0.028), ('capabilities', 0.028), ('hierarchical', 0.028), ('slow', 0.028), ('prior', 0.027), ('proposed', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits
Author: Jason Chang, John W. Fisher III
Abstract: We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve significant computational gains. We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. Each cluster is augmented with two subclusters to construct likely split moves. Unlike some previous parallel samplers, the proposed sampler enforces the correct stationary distribution of the Markov chain without the need for finite approximations. Empirical results illustrate that the new sampler exhibits better convergence properties than current methods. 1
2 0.23781222 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
Author: Dahua Lin
Abstract: Reliance on computationally expensive algorithms for inference has been limiting the use of Bayesian nonparametric models in large scale applications. To tackle this problem, we propose a Bayesian learning algorithm for DP mixture models. Instead of following the conventional paradigm – random initialization plus iterative update, we take an progressive approach. Starting with a given prior, our method recursively transforms it into an approximate posterior through sequential variational approximation. In this process, new components will be incorporated on the fly when needed. The algorithm can reliably estimate a DP mixture model in one pass, making it particularly suited for applications with massive data. Experiments on both synthetic data and real datasets demonstrate remarkable improvement on efficiency – orders of magnitude speed-up compared to the state-of-the-art. 1
3 0.16809489 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture
Author: Trevor Campbell, Miao Liu, Brian Kulis, Jonathan P. How, Lawrence Carin
Abstract: This paper presents a novel algorithm, based upon the dependent Dirichlet process mixture model (DDPMM), for clustering batch-sequential data containing an unknown number of evolving clusters. The algorithm is derived via a lowvariance asymptotic analysis of the Gibbs sampling algorithm for the DDPMM, and provides a hard clustering with convergence guarantees similar to those of the k-means algorithm. Empirical results from a synthetic test with moving Gaussian clusters and a test with real ADS-B aircraft trajectory data demonstrate that the algorithm requires orders of magnitude less computational time than contemporary probabilistic and hard clustering algorithms, while providing higher accuracy on the examined datasets. 1
4 0.15254205 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models
Author: Jie Liu, David Page
Abstract: In large-scale applications of undirected graphical models, such as social networks and biological networks, similar patterns occur frequently and give rise to similar parameters. In this situation, it is beneficial to group the parameters for more efficient learning. We show that even when the grouping is unknown, we can infer these parameter groups during learning via a Bayesian approach. We impose a Dirichlet process prior on the parameters. Posterior inference usually involves calculating intractable terms, and we propose two approximation algorithms, namely a Metropolis-Hastings algorithm with auxiliary variables and a Gibbs sampling algorithm with “stripped” Beta approximation (Gibbs SBA). Simulations show that both algorithms outperform conventional maximum likelihood estimation (MLE). Gibbs SBA’s performance is close to Gibbs sampling with exact likelihood calculation. Models learned with Gibbs SBA also generalize better than the models learned by MLE on real-world Senate voting data. 1
5 0.13927177 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models
Author: Michael Hughes, Erik Sudderth
Abstract: Variational inference algorithms provide the most effective framework for largescale training of Bayesian nonparametric models. Stochastic online approaches are promising, but are sensitive to the chosen learning rate and often converge to poor local optima. We present a new algorithm, memoized online variational inference, which scales to very large (yet finite) datasets while avoiding the complexities of stochastic gradient. Our algorithm maintains finite-dimensional sufficient statistics from batches of the full dataset, requiring some additional memory but still scaling to millions of examples. Exploiting nested families of variational bounds for infinite nonparametric models, we develop principled birth and merge moves allowing non-local optimization. Births adaptively add components to the model to escape local optima, while merges remove redundancy and improve speed. Using Dirichlet process mixture models for image clustering and denoising, we demonstrate major improvements in robustness and accuracy.
6 0.13369296 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations
7 0.1290559 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling
8 0.12471783 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components
9 0.12080061 161 nips-2013-Learning Stochastic Inverses
10 0.11363042 43 nips-2013-Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions
11 0.10889938 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent
12 0.10276114 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex
13 0.097039551 63 nips-2013-Cluster Trees on Manifolds
14 0.096215084 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC
15 0.094356015 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
16 0.093853213 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models
17 0.088117801 40 nips-2013-Approximate Inference in Continuous Determinantal Processes
18 0.0868975 279 nips-2013-Robust Bloom Filters for Large MultiLabel Classification Tasks
19 0.086362511 123 nips-2013-Flexible sampling of discrete data correlations without the marginal distributions
20 0.084797733 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions
topicId topicWeight
[(0, 0.198), (1, 0.077), (2, -0.071), (3, -0.0), (4, 0.033), (5, 0.186), (6, 0.224), (7, 0.009), (8, 0.147), (9, -0.035), (10, -0.038), (11, 0.087), (12, 0.024), (13, 0.099), (14, 0.038), (15, 0.012), (16, -0.001), (17, -0.117), (18, 0.07), (19, 0.064), (20, -0.052), (21, 0.068), (22, -0.016), (23, -0.038), (24, 0.03), (25, 0.059), (26, 0.001), (27, -0.13), (28, -0.016), (29, -0.013), (30, -0.003), (31, 0.076), (32, 0.132), (33, -0.101), (34, -0.03), (35, -0.069), (36, 0.086), (37, -0.049), (38, -0.011), (39, -0.057), (40, 0.004), (41, -0.037), (42, 0.066), (43, 0.105), (44, -0.069), (45, 0.018), (46, -0.067), (47, -0.067), (48, -0.059), (49, -0.083)]
simIndex simValue paperId paperTitle
same-paper 1 0.96711469 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits
Author: Jason Chang, John W. Fisher III
Abstract: We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve significant computational gains. We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. Each cluster is augmented with two subclusters to construct likely split moves. Unlike some previous parallel samplers, the proposed sampler enforces the correct stationary distribution of the Markov chain without the need for finite approximations. Empirical results illustrate that the new sampler exhibits better convergence properties than current methods. 1
2 0.77938044 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture
Author: Trevor Campbell, Miao Liu, Brian Kulis, Jonathan P. How, Lawrence Carin
Abstract: This paper presents a novel algorithm, based upon the dependent Dirichlet process mixture model (DDPMM), for clustering batch-sequential data containing an unknown number of evolving clusters. The algorithm is derived via a lowvariance asymptotic analysis of the Gibbs sampling algorithm for the DDPMM, and provides a hard clustering with convergence guarantees similar to those of the k-means algorithm. Empirical results from a synthetic test with moving Gaussian clusters and a test with real ADS-B aircraft trajectory data demonstrate that the algorithm requires orders of magnitude less computational time than contemporary probabilistic and hard clustering algorithms, while providing higher accuracy on the examined datasets. 1
3 0.76250845 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
Author: Dahua Lin
Abstract: Reliance on computationally expensive algorithms for inference has been limiting the use of Bayesian nonparametric models in large scale applications. To tackle this problem, we propose a Bayesian learning algorithm for DP mixture models. Instead of following the conventional paradigm – random initialization plus iterative update, we take an progressive approach. Starting with a given prior, our method recursively transforms it into an approximate posterior through sequential variational approximation. In this process, new components will be incorporated on the fly when needed. The algorithm can reliably estimate a DP mixture model in one pass, making it particularly suited for applications with massive data. Experiments on both synthetic data and real datasets demonstrate remarkable improvement on efficiency – orders of magnitude speed-up compared to the state-of-the-art. 1
4 0.74222392 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models
Author: Jie Liu, David Page
Abstract: In large-scale applications of undirected graphical models, such as social networks and biological networks, similar patterns occur frequently and give rise to similar parameters. In this situation, it is beneficial to group the parameters for more efficient learning. We show that even when the grouping is unknown, we can infer these parameter groups during learning via a Bayesian approach. We impose a Dirichlet process prior on the parameters. Posterior inference usually involves calculating intractable terms, and we propose two approximation algorithms, namely a Metropolis-Hastings algorithm with auxiliary variables and a Gibbs sampling algorithm with “stripped” Beta approximation (Gibbs SBA). Simulations show that both algorithms outperform conventional maximum likelihood estimation (MLE). Gibbs SBA’s performance is close to Gibbs sampling with exact likelihood calculation. Models learned with Gibbs SBA also generalize better than the models learned by MLE on real-world Senate voting data. 1
5 0.67889649 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components
Author: Jeffrey W. Miller, Matthew T. Harrison
Abstract: For data assumed to come from a finite mixture with an unknown number of components, it has become common to use Dirichlet process mixtures (DPMs) not only for density estimation, but also for inferences about the number of components. The typical approach is to use the posterior distribution on the number of clusters — that is, the posterior on the number of components represented in the observed data. However, it turns out that this posterior is not consistent — it does not concentrate at the true number of components. In this note, we give an elementary proof of this inconsistency in what is perhaps the simplest possible setting: a DPM with normal components of unit variance, applied to data from a “mixture” with one standard normal component. Further, we show that this example exhibits severe inconsistency: instead of going to 1, the posterior probability that there is one cluster converges (in probability) to 0. 1
6 0.6712445 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent
7 0.65577817 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
8 0.64657444 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling
9 0.60780591 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex
10 0.60077751 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models
11 0.59818244 161 nips-2013-Learning Stochastic Inverses
12 0.59200835 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
13 0.56478769 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models
14 0.55280524 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions
15 0.55250311 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations
16 0.52859849 344 nips-2013-Using multiple samples to learn mixture models
17 0.49950528 320 nips-2013-Summary Statistics for Partitionings and Feature Allocations
18 0.4972268 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing
19 0.49716979 277 nips-2013-Restricting exchangeable nonparametric distributions
20 0.49269712 43 nips-2013-Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions
topicId topicWeight
[(16, 0.449), (33, 0.155), (34, 0.112), (41, 0.017), (49, 0.016), (56, 0.085), (85, 0.033), (89, 0.019), (93, 0.032)]
simIndex simValue paperId paperTitle
1 0.96669269 61 nips-2013-Capacity of strong attractor patterns to model behavioural and cognitive prototypes
Author: Abbas Edalat
Abstract: We solve the mean field equations for a stochastic Hopfield network with temperature (noise) in the presence of strong, i.e., multiply stored, patterns, and use this solution to obtain the storage capacity of such a network. Our result provides for the first time a rigorous solution of the mean filed equations for the standard Hopfield model and is in contrast to the mathematically unjustifiable replica technique that has been used hitherto for this derivation. We show that the critical temperature for stability of a strong pattern is equal to its degree or multiplicity, when the sum of the squares of degrees of the patterns is negligible compared to the network size. In the case of a single strong pattern, when the ratio of the number of all stored pattens and the network size is a positive constant, we obtain the distribution of the overlaps of the patterns with the mean field and deduce that the storage capacity for retrieving a strong pattern exceeds that for retrieving a simple pattern by a multiplicative factor equal to the square of the degree of the strong pattern. This square law property provides justification for using strong patterns to model attachment types and behavioural prototypes in psychology and psychotherapy. 1 Introduction: Multiply learned patterns in Hopfield networks The Hopfield network as a model of associative memory and unsupervised learning was introduced in [23] and has been intensively studied from a wide range of viewpoints in the past thirty years. However, properties of a strong pattern, as a pattern that has been multiply stored or learned in these networks, have only been examined very recently, a surprising delay given that repetition of an activity is the basis of learning by the Hebbian rule and long term potentiation. In particular, while the storage capacity of a Hopfield network with certain correlated patterns has been tackled [13, 25], the storage capacity of a Hopfield network in the presence of strong as well as random patterns has not been hitherto addressed. The notion of a strong pattern of a Hopfield network has been proposed in [15] to model attachment types and behavioural prototypes in developmental psychology and psychotherapy. This suggestion has been motivated by reviewing the pioneering work of Bowlby [9] in attachment theory and highlighting how a number of academic biologists, psychiatrists, psychologists, sociologists and neuroscientists have consistently regarded Hopfield-like artificial neural networks as suitable tools to model cognitive and behavioural constructs as patterns that are deeply and repeatedly learned by individuals [11, 22, 24, 30, 29, 10]. A number of mathematical properties of strong patterns in Hopfield networks, which give rise to strong attractors, have been derived in [15]. These show in particular that strong attractors are strongly stable; a series of experiments have also been carried out which confirm the mathematical 1 results and also indicate that a strong pattern stored in the network can be retrieved even in the presence of a large number of simple patterns, far exceeding the well-known maximum load parameter or storage capacity of the Hopfield network with random patterns (αc ≈ 0.138). In this paper, we consider strong patterns in stochastic Hopfield model with temperature, which accounts for various types of noise in the network. In these networks, the updating rule is probabilistic and depend on the temperature. Since analytical solution of such a system is not possible in general, one strives to obtain the average behaviour of the network when the input to each node, the so-called field at the node, is replaced with its mean. This is the basis of mean field theory for these networks. Due to the close connection between the Hopfield network and the Ising model in ferromagnetism [1, 8], the mean field approach for the Hopfield network and its variations has been tackled using the replica method, starting with the pioneering work of Amit, Gutfreund and Sompolinsky [3, 2, 4, 19, 31, 1, 13]. Although this method has been widely used in the theory of spin glasses in statistical physics [26, 16] its mathematical justification has proved to be elusive as we will discuss in the next section; see for example [20, page 264], [14, page 27], and [7, page 9]. In [17] and independently in [27], an alternative technique to the replica method for solving the mean field equations has been proposed which is reproduced and characterised as heuristic in [20, section 2.5] since it relies on a number of assumptions that are not later justified and uses a number of mathematical steps that are not validated. Here, we use the basic idea of the above heuristic to develop a verifiable mathematical framework with provable results grounded on elements of probability theory, with which we assume the reader is familiar. This technique allows us to solve the mean field equations for the Hopfield network in the presence of strong patterns and use the results to study, first, the stability of these patterns in the presence of temperature (noise) and, second, the storage capacity of the network with a single strong pattern at temperature zero. We show that the critical temperature for the stability of a strong pattern is equal to its degree (i.e., its multiplicity) when the ratio of the sum of the squares of degrees of the patterns to the network size tends to zero when the latter tends to infinity. In the case that there is only one strong pattern present with its degree small compared to the number of patterns and the latter is a fixed multiple of the number of nodes, we find the distribution of the overlap of the mean field and the patterns when the strong pattern is being retrieved. We use these distributions to prove that the storage capacity for retrieving a strong pattern exceeds that for a simple pattern by a multiplicative factor equal to the square of the degree of the strong attractor. This result matches the finding in [15] regarding the capacity of a network to recall strong patterns as mentioned above. Our results therefore show that strong patterns are robust and persistent in the network memory as attachment types and behavioural prototypes are in the human memory system. In this paper, we will several times use Lyapunov’s theorem in probability which provides a simple sufficient condition to generalise the Central Limit theorem when we deal with independent but not necessarily identically distributed random variables. We require a general form of this theorem kn as follows. Let Yn = N i=1 Yni , for n ∈ I , be a triangular array of random variables such that for each n, the random variables Yni , for 1 ≤ i ≤ kn are independent with E(Yni ) = 0 2 2 and E(Yni ) = σni , where E(X) stands for the expected value of the random variable X. Let kn 2 2 sn = i=1 σni . We use the notation X ∼ Y when the two random variables X and Y have the same distribution (for large n if either or both of them depend on n). Theorem 1.1 (Lyapunov’s theorem [6, page 368]) If for some δ > 0, we have the condition: 1 E(|Yn |2+δ |) → 0 s2+δ n d d as n → ∞ then s1 Yn −→ N (0, 1) as n → ∞ where −→ denotes convergence in distribution, and we denote n by N (a, σ 2 ) the normal distribution with mean a and variance σ 2 . Thus, for large n we have Yn ∼ N (0, s2 ). n 2 2 Mean field theory We consider a Hopfield network with N neurons i = 1, . . . , N with values Si = ±1 and follow the notations in [20]. As in [15], we assume patterns can be multiply stored and the degree of a pattern is defined as its multiplicity. The total number of patterns, counting their multiplicity, is denoted by p and we assume there are n patterns ξ 1 , . . . , ξ n with degrees d1 , . . . , dn ≥ 1 respectively and that n the remaining p − k=1 dk ≥ 0 patterns are simple, i.e., each has degree one. Note that by our assumptions there are precisely n p0 = p + n − dk k=1 distinct patterns, which we assume are independent and identically distributed with equal probability of taking value ±1 for each node. More generally, for any non-negative integer k ∈ I , we let N p0 dk . µ pk = µ=1 p µ µ 0 1 We use the generalized Hebbian rule for the synaptic couplings: wij = N µ=1 dµ ξi ξj for i = j with wii = 0 for 1 ≤ i, j ≤ N . As in the standard stochastic Hopfield model [20], we use Glauber dynamics [18] for the stochastic updating rule with pseudo-temperature T > 0, which accounts for various types of noise in the network, and assume zero bias in the local field. Putting β = 1/T (i.e., with the Boltzmann constant kB = 1) and letting fβ (h) = 1/(1 + exp(−2βh)), the stochastic updating rule at time t is given by: N Pr(Si (t + 1) = ±1) = fβ (±hi (t)), where hi (t) = wij Sj (t), (1) j=1 is the local field at i at time t. The updating is implemented asynchronously in a random way. The energy of the network in the configuration S = (Si )N is given by i=1 N 1 Si Sj wij . H(S) = − 2 i,j=1 For large N , this specifies a complex system, with an underlying state space of dimension 2N , which in general cannot be solved exactly. However, mean field theory has proved very useful in studying Hopfield networks. The average updated value of Si (t + 1) in Equation (1) is Si (t + 1) = 1/(1 + e−2βhi (t) ) − 1/(1 + e2βhi (t) ) = tanh(βhi (t)), (2) where . . . denotes taking average with respect to the probability distribution in the updating rule in Equation (1). The stationary solution for the mean field thus satisfies: Si = tanh(βhi ) , (3) The average overlap of pattern ξ µ with the mean field at the nodes of the network is given by: mν = 1 N N ν ξi Si (4) i=1 The replica technique for solving the mean field problem, used in the case p/N = α > 0 as N → ∞, seeks to obtain the average of the overlaps in Equation (4) by evaluating the partition function of the system, namely, Z = TrS exp(−βH(S)), where the trace TrS stands for taking sum over all possible configurations S = (Si )N . As it i=1 is generally the case in statistical physics, once the partition function of the system is obtained, 3 all required physical quantities can in principle be computed. However, in this case, the partition function is very difficult to compute since it entails computing the average log Z of log Z, where . . . indicates averaging over the random distribution of the stored patterns ξ µ . To overcome this problem, the identity Zk − 1 log Z = lim k→0 k is used to reduce the problem to finding the average Z k of Z k , which is then computed for positive integer values of k. For such k, we have: Z k = TrS 1 TrS 2 . . . TrS k exp(−β(H(S 1 ) + H(S 1 ) + . . . + H(S k ))), where for each i = 1, . . . , k the super-scripted configuration S i is a replica of the configuration state. In computing the trace over each replica, various parameters are obtained and the replica symmetry condition assumes that these parameters are independent of the particular replica under consideration. Apart from this assumption, there are two basic mathematical problems with the technique which makes it unjustifiable [20, page 264]. Firstly, the positive integer k above is eventually treated as a real number near zero without any mathematical justification. Secondly, the order of taking limits, in particular the order of taking the two limits k → 0 and N → ∞, are several times interchanged again without any mathematical justification. Here, we develop a mathematically rigorous method for solving the mean field problem, i.e., computing the average of the overlaps in Equation (4) in the case of p/N = α > 0 as N → ∞. Our method turns the basic idea of the heuristic presented in [17] and reproduced in [20] for solving the mean field equation into a mathematically verifiable formalism, which for the standard Hopfield network with random stored patterns gives the same result as the replica method, assuming replica symmetry. In the presence of strong patterns we obtain a set of new results as explained in the next two sections. The mean field equation is obtained from Equation (3) by approximating the right hand side of N this equation by the value of tanh at the mean field hi = j=1 wij Sj , ignoring the sum N j=1 wij (Sj − Sj ) for large N [17, page 32]: Si = tanh(β hi ) = tanh β N N j=1 p0 µ=1 µ µ dµ ξi ξj Sj . (5) Equation (5) gives the mean field equation for the Hopfield network with n possible strong patterns n ξ µ (1 ≤ µ ≤ n) and p − µ=1 dµ simple patterns ξ µ with n + 1 ≤ µ ≤ p0 . As in the standard Hopfield model, where all patterns are simple, we have two cases to deal with. However, we now have to account for the presence of strong attractors and our two cases will be as follows: (i) In the p0 first case we assume p2 := µ=1 d2 = o(N ), which includes the simpler case p2 N when p2 µ is fixed and independent of N . (ii) In the second case we assume we have a single strong attractor with the load parameter p/N = α > 0. 3 Stability of strong patterns with noise: p2 = o(N ) The case of constant p and N → ∞ is usually referred to as α = 0 in the standard Hopfield model. Here, we need to consider the sum of degrees of all stored patterns (and not just the number of patterns) compared to N . We solve the mean field equation with T > 0 by using a method similar in spirit to [20, page 33] for the standard Hopfield model, but in our case strong patterns induce a sequence of independent but non-identically distributed random variables in the crosstalk term, where the Central Limit Theorem cannot be used; we show however that Lyapunov’s theorem (Theorem (1.1) can be invoked. In retrieving pattern ξ 1 , we look for a solution of the mean filed 1 equation of the form: Si = mξi , where m > 0 is a constant. Using Equation (5) and separating 1 the contribution of ξ in the argument of tanh, we obtain: 1 mξi = tanh mβ 1 d1 ξi + N 4 µ µ 1 dµ ξi ξj ξj . j=i,µ>1 (6) For each N , µ > 1 and j = i, let dµ µ µ 1 (7) ξ ξ ξ . N i j j 2 This gives (p0 − 1)(N − 1) independent random variables with E(YN µj ) = 0, E(YN µj ) = d2 /N 2 , µ 3 3 3 and E(|YN µj |) = dµ /N . We have: YN µj = s2 := N 2 E(YN µj ) = µ>1,j=i 1 N −1 d2 ∼ N 2 µ>1 µ N d2 . µ (8) µ>1 Thus, as N → ∞, we have: 1 s3 N 3 E(|YN µj |) ∼ √ µ>1,j=i µ>1 N( d3 µ µ>1 d2 )3/2 µ → 0. (9) as N → ∞ since for positive numbers dµ we always have µ>1 d3 < ( µ>1 d2 )3/2 . Thus the µ µ Lyapunov condition is satisfied for δ = 1. By Lyapunov’s theorem we deduce: 1 N µ µ 1 dµ ξi ξj ξj ∼ N d2 /N µ 0, (10) µ>1 µ>1,j=i Since we also have p2 = o(N ), it follows that we can ignore the second term, i.e., the crosstalk term, in the argument of tanh in Equation (6) as N → ∞; we thus obtain: m = tanh βd1 m. (11) To examine the fixed points of the Equation (11), we let d = d1 for convenience and put x = βdm = dm/T , so that tanh x = T x/d; see Figure 1. It follows that Tc = d is the critical temperature. If T < d then there is a non-zero (non-trivial) solution for m, whereas for T > d we only have the trivial solution. For d = 1 our solution is that of the standard Hopfield network as in [20, page 34]. (d < T) y>x y = x ( d = T) y = tanh x y
2 0.92905849 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals
Author: Barbara Rakitsch, Christoph Lippert, Karsten Borgwardt, Oliver Stegle
Abstract: Multi-task prediction methods are widely used to couple regressors or classification models by sharing information across related tasks. We propose a multi-task Gaussian process approach for modeling both the relatedness between regressors and the task correlations in the residuals, in order to more accurately identify true sharing between regressors. The resulting Gaussian model has a covariance term in form of a sum of Kronecker products, for which efficient parameter inference and out of sample prediction are feasible. On both synthetic examples and applications to phenotype prediction in genetics, we find substantial benefits of modeling structured noise compared to established alternatives. 1
3 0.90957946 245 nips-2013-Pass-efficient unsupervised feature selection
Author: Crystal Maung, Haim Schweitzer
Abstract: The goal of unsupervised feature selection is to identify a small number of important features that can represent the data. We propose a new algorithm, a modification of the classical pivoted QR algorithm of Businger and Golub, that requires a small number of passes over the data. The improvements are based on two ideas: keeping track of multiple features in each pass, and skipping calculations that can be shown not to affect the final selection. Our algorithm selects the exact same features as the classical pivoted QR algorithm, and has the same favorable numerical stability. We describe experiments on real-world datasets which sometimes show improvements of several orders of magnitude over the classical algorithm. These results appear to be competitive with recently proposed randomized algorithms in terms of pass efficiency and run time. On the other hand, the randomized algorithms may produce more accurate features, at the cost of small probability of failure. 1
same-paper 4 0.87394667 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits
Author: Jason Chang, John W. Fisher III
Abstract: We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve significant computational gains. We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. Each cluster is augmented with two subclusters to construct likely split moves. Unlike some previous parallel samplers, the proposed sampler enforces the correct stationary distribution of the Markov chain without the need for finite approximations. Empirical results illustrate that the new sampler exhibits better convergence properties than current methods. 1
5 0.81512916 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions
Author: Francesca Petralia, Joshua T. Vogelstein, David Dunson
Abstract: Nonparametric estimation of the conditional distribution of a response given highdimensional features is a challenging problem. It is important to allow not only the mean but also the variance and shape of the response density to change flexibly with features, which are massive-dimensional. We propose a multiscale dictionary learning model, which expresses the conditional response density as a convex combination of dictionary densities, with the densities used and their weights dependent on the path through a tree decomposition of the feature space. A fast graph partitioning algorithm is applied to obtain the tree decomposition, with Bayesian methods then used to adaptively prune and average over different sub-trees in a soft probabilistic manner. The algorithm scales efficiently to approximately one million features. State of the art predictive performance is demonstrated for toy examples and two neuroscience applications including up to a million features. 1
6 0.81211472 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling
7 0.62436604 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
8 0.61891186 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent
9 0.61179018 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models
10 0.598378 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
11 0.59569991 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series
12 0.59147066 252 nips-2013-Predictive PAC Learning and Process Decompositions
13 0.58685505 47 nips-2013-Bayesian Hierarchical Community Discovery
14 0.58044332 355 nips-2013-Which Space Partitioning Tree to Use for Search?
15 0.58025253 350 nips-2013-Wavelets on Graphs via Deep Learning
16 0.57573587 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture
17 0.57460558 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
18 0.57141596 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components
19 0.5692386 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models
20 0.5670315 41 nips-2013-Approximate inference in latent Gaussian-Markov models from continuous time observations