nips nips2013 nips2013-229 knowledge-graph by maker-knowledge-mining

229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation


Source: pdf

Author: Dahua Lin

Abstract: Reliance on computationally expensive algorithms for inference has been limiting the use of Bayesian nonparametric models in large scale applications. To tackle this problem, we propose a Bayesian learning algorithm for DP mixture models. Instead of following the conventional paradigm – random initialization plus iterative update, we take an progressive approach. Starting with a given prior, our method recursively transforms it into an approximate posterior through sequential variational approximation. In this process, new components will be incorporated on the fly when needed. The algorithm can reliably estimate a DP mixture model in one pass, making it particularly suited for applications with massive data. Experiments on both synthetic data and real datasets demonstrate remarkable improvement on efficiency – orders of magnitude speed-up compared to the state-of-the-art. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Reliance on computationally expensive algorithms for inference has been limiting the use of Bayesian nonparametric models in large scale applications. [sent-2, score-0.167]

2 To tackle this problem, we propose a Bayesian learning algorithm for DP mixture models. [sent-3, score-0.201]

3 Instead of following the conventional paradigm – random initialization plus iterative update, we take an progressive approach. [sent-4, score-0.155]

4 Starting with a given prior, our method recursively transforms it into an approximate posterior through sequential variational approximation. [sent-5, score-0.484]

5 In this process, new components will be incorporated on the fly when needed. [sent-6, score-0.129]

6 The algorithm can reliably estimate a DP mixture model in one pass, making it particularly suited for applications with massive data. [sent-7, score-0.234]

7 1 Introduction Bayesian nonparametric mixture models [7] provide an important framework to describe complex data. [sent-9, score-0.246]

8 In this family of models, Dirichlet process mixture models (DPMM) [1, 15, 18] are among the most popular in practice. [sent-10, score-0.172]

9 As opposed to traditional parametric models, DPMM allows the number of components to vary during inference, thus providing great flexibility for explorative analysis. [sent-11, score-0.129]

10 MCMC sampling [12, 14] is the conventional approach to Bayesian nonparametric estimation. [sent-13, score-0.156]

11 Typical variational methods for nonparametric mixture models rely on a truncated approximation of the stick breaking construction [16], which requires a fixed number of components to be maintained and iteratively updated during inference. [sent-17, score-0.703]

12 The truncation level are usually set conservatively to ensure approximation accuracy, incurring considerable amount of unnecessary computation. [sent-18, score-0.167]

13 Both MCMC sampling and variational inference maintain the entire configuration and perform iterative updates of multiple passes, which are often too expensive for large scale applications. [sent-21, score-0.293]

14 This challenge motivated us to develop a new learning method for Bayesian nonparametric models that can handle massive data efficiently. [sent-22, score-0.161]

15 In this paper, we propose an online Bayesian learning algorithm for generic DP mixture models. [sent-23, score-0.178]

16 Instead, it begins with the prior DP(αµ) and progressively transforms it into an approximate posterior of the mixtures, with new components introduced on the fly as needed. [sent-25, score-0.354]

17 Based on a new way of variational approximation, the algorithm proceeds sequentially, taking in one sample at a time to make the 1 update. [sent-26, score-0.203]

18 We also devise specific steps to prune redundant components and merge similar ones, thus further improving the performance. [sent-27, score-0.299]

19 We tested the proposed method on synthetic data as well as two real applications: modeling image patches and clustering documents. [sent-28, score-0.17]

20 Results show empirically that the proposed algorithm can reliably estimate a DP mixture model in a single pass over large datasets. [sent-29, score-0.181]

21 Dahl [6] developed the sequentially allocated sampler, where splits are proposed by sequentially allocating observations to one of two split components through sequential importance sampling. [sent-33, score-0.303]

22 There has also been substantial advancement in variational inference. [sent-35, score-0.203]

23 A significant development along is line is the Stochastic Variational Inference, a framework that incorporates stochastic optimization with variational inference [8]. [sent-36, score-0.297]

24 Wang and Blei [21] also proposed a truncation-free variational inference method for generic BNP models, where a sampling step is used for updating atom assignment that allows new atoms to be created on the fly. [sent-39, score-0.372]

25 Bryant and Sudderth [5] recently developed an online variational inference algorithm for HDP, using mini-batch to handle streaming data and split-merge moves to adapt truncation levels. [sent-40, score-0.375]

26 The differences stem from the theoretical basis – our method uses sequential approximation based on the predictive law, while theirs is an extension of the standard truncation-based model. [sent-46, score-0.233]

27 [13] recently proposed a method, called VSUGS, for fast estimation of DP mixture models. [sent-48, score-0.138]

28 Particularly, what we approximate is a joint posterior over both data allocation and model parameters, while VSUGS is based on the approximating the posterior of data allocation. [sent-50, score-0.334]

29 3 Nonparametric Mixture Models This section provide a brief review of Dirichlet Process Mixture Model – one of the most widely used nonparametric mixture models. [sent-53, score-0.246]

30 Instead of maintaining θi explicitly, we introduce an indicator zi for each i with 2 θi = φzi . [sent-73, score-0.127]

31 (4) k=1 Variational Approximation of Posterior Generally, there are two approaches to learning a mixture model from observed data, namely Maximum likelihood estimation (MLE) and Bayesian learning. [sent-83, score-0.138]

32 Specifically, maximum likelihood estimation seeks an optimal point estimate of ν, while Bayesian learning aims to derive the posterior distribution over the mixtures. [sent-84, score-0.167]

33 In particular, for DPMM, the predictive distribution of component parameters, conditioned on a set of observed samples x1:n , is given by (5) p(θ |x1:n ) = ED|x1:n [p(θ |D)] . [sent-87, score-0.161]

34 In this section, we derive a tractable approximation of this predictive distribution based on a detailed analysis of the posterior. [sent-92, score-0.119]

35 Then the posterior distribution of D remains a DP, as D|θ1:n ∼ DP(˜ µ), where α = α + n, and α˜ ˜ µ= ˜ α µ+ α+n K k=1 |Ck | δφ . [sent-104, score-0.167]

36 α+n k (6) The atoms are generally unobservable, and therefore it is more interesting in practice to consider the posterior distribution of D given the observed samples. [sent-105, score-0.237]

37 For this purpose, we derive the lemma below that provides a constructive characterization of the posterior distribution given both the observed samples x1:n and the partition z. [sent-106, score-0.266]

38 Drawing a sample from the posterior distribution p(D|z1:n , x1:n ) is equivalent to constructing a random probability measure as follows K β0 D + βk δφk , k=1 with D ∼ DP(αµ), (β0 , β1 , . [sent-110, score-0.167]

39 (7) Here, mk = |Ck |, µ|Ck is a posterior distribution given by i. [sent-117, score-0.225]

40 , φK are nondeterministic, instead they follow the posterior distributions µ|Ck . [sent-124, score-0.167]

41 By marginalizing out the partition z1:n , we obtain the posterior distribution p(D|x1:n ): p(z1:n |x1:n )p(D|x1:n , z1:n ). [sent-125, score-0.202]

42 To tackle this difficulty, we resort to variational approximation, that is, to choose a tractable distribution to approximate p(D|x1:n , z1:n ). [sent-133, score-0.266]

43 With this design, zi for different i and φk for different k are independent w. [sent-150, score-0.127]

44 α+n (12) The approximate posterior has two sets of parameters: ρ (ρ1 , . [sent-155, score-0.167]

45 With this approximation, the task of Bayesian learning reduces to the problem of finding the optimal setting of these parameters such that q(D|ρ, ν) best approximates the true posterior distribution. [sent-162, score-0.167]

46 K = 1) and progressively refine the model as samples come in, adding new components on the fly when needed. [sent-170, score-0.217]

47 Specifically, when the first sample x1 is observed, we introduce the first component and denote the posterior for this component by ν1 . [sent-171, score-0.297]

48 ρ1 (z1 = 1) = 1, and the posterior distribution over the component parameter is ν1 (dθ) ∝ µ(dθ)F (x1 |θ). [sent-174, score-0.232]

49 To explain xi+1 , we can use either of the K existing components or introduce a new component φk+1 . [sent-184, score-0.194]

50 (10) to approximate the posterior p(·|x1:i ), we get p(zi+1 , φ1:K+1 |x1:i+1 ) ∝ q(zi+1 |ρ1:i , ν (i) )p(xi+1 |zi+1 , φ1:K+1 ). [sent-193, score-0.167]

51 (14) (i+1) Then, the optimal settings of qi+1 and ν that minimizes the Kullback-Leibler divergence between q(zi+1 , φ1:K+1 |q1:i+1 , ν (i+1) ) and the approximate posterior in Eq. [sent-194, score-0.167]

52 (14) are given as follows: (i) ρi+1 ∝ (i) wk θ F (xi+1 |θ)νk (dθ) (k ≤ K), α θ F (xi+1 |θ)µ(dθ) (k = K + 1), 4 (15) Algorithm 1 Sequential Bayesian Learning of DPMM (for conjugate cases). [sent-195, score-0.229]

53 , K) marginal log-likelihood: hi (k) ← B(λ + Ti , λ + τ ) − B(λ, λ ) − bi (k = K + 1) ρi (k) ← wk ehi (k) / l wl ehi (l) for k = 1, . [sent-203, score-0.321]

54 , K + 1 with wK+1 = α if ρi (K + 1) > then wk ← wk + ρi (k), ζk ← ζk + ρi (k)Ti , and ζk ← ζk + ρi (k)τ , for k = 1, . [sent-206, score-0.278]

55 , K wK+1 ← ρi (K + 1), ζK+1 ← ρi (K + 1)Ti , and ζK+1 ← ρi (K + 1)τ K ←K +1 else re-normalize ρi such that K ρi (k) = 1 k=1 wk ← wk + ρi (k), ζk ← ζk + ρi (k)Ti , and ζk ← ζk + ρi (k)τ , for k = 1, . [sent-209, score-0.278]

56 , K end if end for (i) with wk = i j=1 ρj (k), and i+1 (i+1) νk (dθ) ∝ µ(dθ) j=1 F (xj |θ)ρj (k) µ(dθ)F (xi+1 |θ)ρi+1 (k) (k ≤ K), (k = K + 1). [sent-212, score-0.139]

57 This sequential approximation scheme introduces a new component for each sample, resulting in n components over the entire dataset. [sent-215, score-0.361]

58 5 Algorithm and Implementation This section discusses the implementation of the sequential Bayesian learning algorithm under two different circumstances: (1) µ and F are exponential family distributions that form a conjuate pair, and (2) µ is not a conjugate prior w. [sent-220, score-0.204]

59 , xn , the posterior distribution remains in the same family as µ with parameters n (λ + i=1 T (xi ), λ + nτ ). [sent-230, score-0.167]

60 (19) θ In such cases, both the base measure µ and the component-specific posterior measures νk can be represented using the natural parameter pairs, which we denote by (λ, λ ) and (ζk , ζk ). [sent-232, score-0.2]

61 With this notation, we derive a sequential learning algorithm for conjugate cases, as shown in Alg 1. [sent-233, score-0.204]

62 Unlike in the conjugate cases discussed above, there exist no formulas to update posterior 5 parameters or to compute marginal likelihood in general. [sent-236, score-0.257]

63 Consider a posterior distribution given by p(θ|x1:n ) ∝ µ(θ) i=1 F (xi |θ). [sent-238, score-0.167]

64 As opposed to random initialization, components created during this sequential construction are often truly needed, as the decisions of creating new components are based on knowledge accumulated from previous samples. [sent-252, score-0.416]

65 However, it is still possible that some components introduced at early iterations would become less useful and that multiple components may be similar. [sent-253, score-0.258]

66 We thus introduce a mechanism to remove undesirable components and merge similar ones. [sent-254, score-0.188]

67 Let wk = ˜ (i) (i) (i) i wk / l wl (with wk = ρj (k)) be the relative weight of a component at the i-th iteraj=1 tion. [sent-256, score-0.522]

68 Once the relative weight of a component drops below a small threshold εr , we remove it to save unnecessary computation on this component in the future. [sent-257, score-0.202]

69 The similarity between two components φk and φk can be measured in terms of the distance bei tween ρi (k) and ρi (k ) over all processed samples, as dρ (k, k ) = i−1 j=1 |ρj (k) − ρj (k )|. [sent-258, score-0.129]

70 We also merge the associated sufficient statistics (for conjugate case) or take an weighted average of the parameters (for non-conjugate case). [sent-262, score-0.149]

71 Since computing this distance between a pair of components takes O(n), we propose to examine similarities at an O(i · K)-interval so that the amortized complexity is maintained at O(nK). [sent-264, score-0.129]

72 First, it builds up the model on the fly, thus avoiding the need of randomly initializing a set of components as required by truncation-based methods. [sent-267, score-0.129]

73 adding more components or adapting existing components) when new data is available. [sent-270, score-0.129]

74 This distinguishes it from MCMC methods and conventional variational learning algorithms, making it a great fit for large scale problems. [sent-272, score-0.251]

75 6 Experiments To test the proposed algorithm, we conducted experiments on both synthetic data and real world applications – modeling image patches and document clustering. [sent-273, score-0.213]

76 Specifically, we constructed a data set comprised of 10000 samples in 9 Gaussian clusters of unit variance. [sent-277, score-0.121]

77 The estimation of these Gaussian components are based on the DPMM below: 2 D ∼ DP α · N (0, σp I) , 2 θ i ∼ D, xi ∼ N (θ i , σx I). [sent-279, score-0.206]

78 The key difference that leads to this improvement is that CGS and TFV rely on random initialization to bootstrap the algorithm, which would inevitably introduce similar components, while SVA leverages information gained from previous samples to decide whether new components are needed. [sent-296, score-0.21]

79 Without the prune-and-merge steps, it takes much longer for redundant components to fade out. [sent-301, score-0.177]

80 To tackle this problem, we applied our method to learn a nonparametric dictionary from the SUN database [24], a large dataset comprised of over 130K images, which capture a broad variety of scenes. [sent-313, score-0.266]

81 We extracted 2000 patches of size 32 × 32 from each image, and characterize each patch by a 128-dimensional SIFT feature. [sent-315, score-0.118]

82 0 2 4 6 TVF SVA SVA−PM 8 450 400 350 300 0 2 hour Figure 4: Average loglikelihood on image modeling as functions of run-time. [sent-324, score-0.133]

83 4 6 hour 8 TVF SVA SVA−PM 10 Figure 5: Average loglikelihood of document clusters as functions of run-time. [sent-325, score-0.196]

84 We notice in our experiments that even without an explicit redundancy-removal mechanism, some unnecessary components can still get removed when their relative weights decreases and becomes negligible. [sent-332, score-0.201]

85 Unlike standard topic modeling task, this is a higher level application that builds on top of the topic representation. [sent-336, score-0.13]

86 Specifically, we first obtain a collection of m topics from a subset of documents, and characterize all documents by topic proportions. [sent-337, score-0.13]

87 We assume that the topic proportion vector is generated from a category-specific Dirichlet distribution, as follows D ∼ DP (α · Dirsym (γp )) , θ i ∼ D, xi ∼ Dir(γx θ i ). [sent-338, score-0.142]

88 To generate a document, we draw a mean probability vector θ i from D, and generates the topic proportion vector xi from Dir(γx θ i ). [sent-340, score-0.142]

89 Note that this is not a conjugate model, and we use stochastic optimization instead of Bayesian updates in SVA (see section 5). [sent-342, score-0.156]

90 Also, TVF tends to generate more components than necessary, while SVA-PM maintains a better performance using much less components. [sent-351, score-0.129]

91 7 Conclusion We presented an online Bayesian learning algorithm to estimate DP mixture models. [sent-352, score-0.178]

92 Instead, it can reliably and efficiently learn a DPMM from scratch through sequential approximation in a single pass. [sent-354, score-0.242]

93 It is worth noting that the approximation is derived based on the predictive law of DPMM. [sent-357, score-0.155]

94 Mixtures of dirichlet processes with applications to bayesian nonparametric problems. [sent-361, score-0.358]

95 Truly nonparametric online variational inference for hierarchical dirichlet processes. [sent-378, score-0.578]

96 Sequentially-allocated merge-split sampler for conjugate and nonconjugate dirichlet process mixture models, 2005. [sent-383, score-0.43]

97 Effective split-merge monte carlo methods for nonparametric models of sequential data. [sent-397, score-0.222]

98 A sequential algorithm for fast fitting of dirichlet process mixture models. [sent-413, score-0.454]

99 A split-merge markov chain monte carlo procedure for the dirichlet process mixture model. [sent-431, score-0.34]

100 A split-merge mcmc algorithm for the hierarchical dirichlet process. [sent-447, score-0.24]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sva', 0.473), ('dp', 0.232), ('variational', 0.203), ('dpmm', 0.202), ('ck', 0.199), ('tfv', 0.195), ('dirichlet', 0.168), ('tvf', 0.167), ('posterior', 0.167), ('cgs', 0.158), ('wk', 0.139), ('mixture', 0.138), ('components', 0.129), ('zi', 0.127), ('sequential', 0.114), ('vsugs', 0.111), ('nonparametric', 0.108), ('chong', 0.093), ('conjugate', 0.09), ('bayesian', 0.082), ('xi', 0.077), ('patches', 0.077), ('hdp', 0.074), ('lik', 0.074), ('unnecessary', 0.072), ('mcmc', 0.072), ('atoms', 0.07), ('bnp', 0.068), ('bryant', 0.068), ('david', 0.067), ('predictive', 0.066), ('component', 0.065), ('topic', 0.065), ('tackle', 0.063), ('prune', 0.063), ('ti', 0.062), ('dir', 0.06), ('clusters', 0.06), ('merge', 0.059), ('inference', 0.059), ('progressively', 0.058), ('mk', 0.058), ('hour', 0.056), ('ehi', 0.056), ('eprints', 0.056), ('progressive', 0.056), ('trapped', 0.054), ('approximation', 0.053), ('massive', 0.053), ('synthetic', 0.053), ('wang', 0.051), ('initialization', 0.051), ('nott', 0.049), ('conventional', 0.048), ('redundant', 0.048), ('blei', 0.046), ('julia', 0.045), ('truly', 0.044), ('reliably', 0.043), ('document', 0.043), ('reliance', 0.043), ('kurihara', 0.043), ('truncation', 0.042), ('pm', 0.042), ('collapsed', 0.042), ('patch', 0.041), ('erik', 0.04), ('stick', 0.04), ('wl', 0.04), ('atom', 0.04), ('image', 0.04), ('online', 0.04), ('alg', 0.039), ('michael', 0.038), ('loglikelihood', 0.037), ('law', 0.036), ('crp', 0.036), ('partition', 0.035), ('whye', 0.035), ('yee', 0.035), ('traces', 0.035), ('stochastic', 0.035), ('constructive', 0.034), ('dictionary', 0.034), ('process', 0.034), ('base', 0.033), ('documents', 0.033), ('topics', 0.032), ('breaking', 0.032), ('scratch', 0.032), ('moves', 0.031), ('guration', 0.031), ('updates', 0.031), ('mixtures', 0.031), ('restaurant', 0.031), ('comprised', 0.031), ('sequentially', 0.03), ('database', 0.03), ('samples', 0.03), ('bi', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999934 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

Author: Dahua Lin

Abstract: Reliance on computationally expensive algorithms for inference has been limiting the use of Bayesian nonparametric models in large scale applications. To tackle this problem, we propose a Bayesian learning algorithm for DP mixture models. Instead of following the conventional paradigm – random initialization plus iterative update, we take an progressive approach. Starting with a given prior, our method recursively transforms it into an approximate posterior through sequential variational approximation. In this process, new components will be incorporated on the fly when needed. The algorithm can reliably estimate a DP mixture model in one pass, making it particularly suited for applications with massive data. Experiments on both synthetic data and real datasets demonstrate remarkable improvement on efficiency – orders of magnitude speed-up compared to the state-of-the-art. 1

2 0.23781222 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

Author: Jason Chang, John W. Fisher III

Abstract: We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve significant computational gains. We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. Each cluster is augmented with two subclusters to construct likely split moves. Unlike some previous parallel samplers, the proposed sampler enforces the correct stationary distribution of the Markov chain without the need for finite approximations. Empirical results illustrate that the new sampler exhibits better convergence properties than current methods. 1

3 0.19369496 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models

Author: Michael Hughes, Erik Sudderth

Abstract: Variational inference algorithms provide the most effective framework for largescale training of Bayesian nonparametric models. Stochastic online approaches are promising, but are sensitive to the chosen learning rate and often converge to poor local optima. We present a new algorithm, memoized online variational inference, which scales to very large (yet finite) datasets while avoiding the complexities of stochastic gradient. Our algorithm maintains finite-dimensional sufficient statistics from batches of the full dataset, requiring some additional memory but still scaling to millions of examples. Exploiting nested families of variational bounds for infinite nonparametric models, we develop principled birth and merge moves allowing non-local optimization. Births adaptively add components to the model to escape local optima, while merges remove redundancy and improve speed. Using Dirichlet process mixture models for image clustering and denoising, we demonstrate major improvements in robustness and accuracy.

4 0.14235538 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits

Author: Ben Shababo, Brooks Paige, Ari Pakman, Liam Paninski

Abstract: With the advent of modern stimulation techniques in neuroscience, the opportunity arises to map neuron to neuron connectivity. In this work, we develop a method for efficiently inferring posterior distributions over synaptic strengths in neural microcircuits. The input to our algorithm is data from experiments in which action potentials from putative presynaptic neurons can be evoked while a subthreshold recording is made from a single postsynaptic neuron. We present a realistic statistical model which accounts for the main sources of variability in this experiment and allows for significant prior information about the connectivity and neuronal cell types to be incorporated if available. Due to the technical challenges and sparsity of these systems, it is important to focus experimental time stimulating the neurons whose synaptic strength is most ambiguous, therefore we also develop an online optimal design algorithm for choosing which neurons to stimulate at each trial. 1

5 0.136917 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components

Author: Jeffrey W. Miller, Matthew T. Harrison

Abstract: For data assumed to come from a finite mixture with an unknown number of components, it has become common to use Dirichlet process mixtures (DPMs) not only for density estimation, but also for inferences about the number of components. The typical approach is to use the posterior distribution on the number of clusters — that is, the posterior on the number of components represented in the observed data. However, it turns out that this posterior is not consistent — it does not concentrate at the true number of components. In this note, we give an elementary proof of this inconsistency in what is perhaps the simplest possible setting: a DPM with normal components of unit variance, applied to data from a “mixture” with one standard normal component. Further, we show that this example exhibits severe inconsistency: instead of going to 1, the posterior probability that there is one cluster converges (in probability) to 0. 1

6 0.12956414 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture

7 0.12348704 317 nips-2013-Streaming Variational Bayes

8 0.11473642 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex

9 0.11205296 347 nips-2013-Variational Planning for Graph-based MDPs

10 0.10800646 143 nips-2013-Integrated Non-Factorized Variational Inference

11 0.10627835 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent

12 0.10195021 301 nips-2013-Sparse Additive Text Models with Low Rank Background

13 0.096633129 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series

14 0.095385894 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models

15 0.093763962 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models

16 0.093458645 193 nips-2013-Mixed Optimization for Smooth Functions

17 0.091535665 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models

18 0.087831981 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking

19 0.086460687 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking

20 0.085216254 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.231), (1, 0.075), (2, -0.077), (3, -0.011), (4, 0.016), (5, 0.168), (6, 0.2), (7, 0.058), (8, 0.193), (9, -0.024), (10, -0.045), (11, 0.186), (12, -0.0), (13, 0.072), (14, -0.013), (15, -0.03), (16, 0.093), (17, -0.113), (18, -0.009), (19, -0.031), (20, 0.022), (21, 0.015), (22, -0.067), (23, -0.016), (24, -0.063), (25, -0.018), (26, 0.012), (27, -0.082), (28, 0.016), (29, -0.014), (30, 0.009), (31, 0.114), (32, 0.036), (33, -0.052), (34, -0.099), (35, 0.003), (36, 0.045), (37, 0.04), (38, -0.052), (39, 0.039), (40, -0.032), (41, -0.112), (42, -0.005), (43, 0.06), (44, -0.003), (45, 0.019), (46, -0.037), (47, 0.018), (48, -0.033), (49, -0.117)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9555909 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

Author: Dahua Lin

Abstract: Reliance on computationally expensive algorithms for inference has been limiting the use of Bayesian nonparametric models in large scale applications. To tackle this problem, we propose a Bayesian learning algorithm for DP mixture models. Instead of following the conventional paradigm – random initialization plus iterative update, we take an progressive approach. Starting with a given prior, our method recursively transforms it into an approximate posterior through sequential variational approximation. In this process, new components will be incorporated on the fly when needed. The algorithm can reliably estimate a DP mixture model in one pass, making it particularly suited for applications with massive data. Experiments on both synthetic data and real datasets demonstrate remarkable improvement on efficiency – orders of magnitude speed-up compared to the state-of-the-art. 1

2 0.87632227 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models

Author: Michael Hughes, Erik Sudderth

Abstract: Variational inference algorithms provide the most effective framework for largescale training of Bayesian nonparametric models. Stochastic online approaches are promising, but are sensitive to the chosen learning rate and often converge to poor local optima. We present a new algorithm, memoized online variational inference, which scales to very large (yet finite) datasets while avoiding the complexities of stochastic gradient. Our algorithm maintains finite-dimensional sufficient statistics from batches of the full dataset, requiring some additional memory but still scaling to millions of examples. Exploiting nested families of variational bounds for infinite nonparametric models, we develop principled birth and merge moves allowing non-local optimization. Births adaptively add components to the model to escape local optima, while merges remove redundancy and improve speed. Using Dirichlet process mixture models for image clustering and denoising, we demonstrate major improvements in robustness and accuracy.

3 0.79740822 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

Author: Jason Chang, John W. Fisher III

Abstract: We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve significant computational gains. We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. Each cluster is augmented with two subclusters to construct likely split moves. Unlike some previous parallel samplers, the proposed sampler enforces the correct stationary distribution of the Markov chain without the need for finite approximations. Empirical results illustrate that the new sampler exhibits better convergence properties than current methods. 1

4 0.75549179 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex

Author: Sam Patterson, Yee Whye Teh

Abstract: In this paper we investigate the use of Langevin Monte Carlo methods on the probability simplex and propose a new method, Stochastic gradient Riemannian Langevin dynamics, which is simple to implement and can be applied to large scale data. We apply this method to latent Dirichlet allocation in an online minibatch setting, and demonstrate that it achieves substantial performance improvements over the state of the art online variational Bayesian methods. 1

5 0.73548472 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components

Author: Jeffrey W. Miller, Matthew T. Harrison

Abstract: For data assumed to come from a finite mixture with an unknown number of components, it has become common to use Dirichlet process mixtures (DPMs) not only for density estimation, but also for inferences about the number of components. The typical approach is to use the posterior distribution on the number of clusters — that is, the posterior on the number of components represented in the observed data. However, it turns out that this posterior is not consistent — it does not concentrate at the true number of components. In this note, we give an elementary proof of this inconsistency in what is perhaps the simplest possible setting: a DPM with normal components of unit variance, applied to data from a “mixture” with one standard normal component. Further, we show that this example exhibits severe inconsistency: instead of going to 1, the posterior probability that there is one cluster converges (in probability) to 0. 1

6 0.73143995 317 nips-2013-Streaming Variational Bayes

7 0.66059291 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models

8 0.62865359 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture

9 0.62018508 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning

10 0.58430535 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models

11 0.58369255 260 nips-2013-RNADE: The real-valued neural autoregressive density-estimator

12 0.57676113 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models

13 0.57053101 86 nips-2013-Demixing odors - fast inference in olfaction

14 0.56836545 277 nips-2013-Restricting exchangeable nonparametric distributions

15 0.56728029 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization

16 0.55538583 320 nips-2013-Summary Statistics for Partitionings and Feature Allocations

17 0.55104917 13 nips-2013-A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks

18 0.54769063 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models

19 0.54657525 167 nips-2013-Learning the Local Statistics of Optical Flow

20 0.54642951 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.015), (16, 0.101), (33, 0.14), (34, 0.152), (36, 0.032), (41, 0.031), (49, 0.034), (56, 0.113), (70, 0.026), (85, 0.033), (89, 0.032), (93, 0.046), (95, 0.011), (98, 0.168)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.88162029 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms

Author: Yu Zhang

Abstract: All the existing multi-task local learning methods are defined on homogeneous neighborhood which consists of all data points from only one task. In this paper, different from existing methods, we propose local learning methods for multitask classification and regression problems based on heterogeneous neighborhood which is defined on data points from all tasks. Specifically, we extend the knearest-neighbor classifier by formulating the decision function for each data point as a weighted voting among the neighbors from all tasks where the weights are task-specific. By defining a regularizer to enforce the task-specific weight matrix to approach a symmetric one, a regularized objective function is proposed and an efficient coordinate descent method is developed to solve it. For regression problems, we extend the kernel regression to multi-task setting in a similar way to the classification case. Experiments on some toy data and real-world datasets demonstrate the effectiveness of our proposed methods. 1

same-paper 2 0.86938965 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

Author: Dahua Lin

Abstract: Reliance on computationally expensive algorithms for inference has been limiting the use of Bayesian nonparametric models in large scale applications. To tackle this problem, we propose a Bayesian learning algorithm for DP mixture models. Instead of following the conventional paradigm – random initialization plus iterative update, we take an progressive approach. Starting with a given prior, our method recursively transforms it into an approximate posterior through sequential variational approximation. In this process, new components will be incorporated on the fly when needed. The algorithm can reliably estimate a DP mixture model in one pass, making it particularly suited for applications with massive data. Experiments on both synthetic data and real datasets demonstrate remarkable improvement on efficiency – orders of magnitude speed-up compared to the state-of-the-art. 1

3 0.83687019 207 nips-2013-Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic

Author: James L. Sharpnack, Akshay Krishnamurthy, Aarti Singh

Abstract: The detection of anomalous activity in graphs is a statistical problem that arises in many applications, such as network surveillance, disease outbreak detection, and activity monitoring in social networks. Beyond its wide applicability, graph structured anomaly detection serves as a case study in the difficulty of balancing computational complexity with statistical power. In this work, we develop from first principles the generalized likelihood ratio test for determining if there is a well connected region of activation over the vertices in the graph in Gaussian noise. Because this test is computationally infeasible, we provide a relaxation, called the Lov´ sz extended scan statistic (LESS) that uses submodularity to approximate the a intractable generalized likelihood ratio. We demonstrate a connection between LESS and maximum a-posteriori inference in Markov random fields, which provides us with a poly-time algorithm for LESS. Using electrical network theory, we are able to control type 1 error for LESS and prove conditions under which LESS is risk consistent. Finally, we consider specific graph models, the torus, knearest neighbor graphs, and ǫ-random graphs. We show that on these graphs our results provide near-optimal performance by matching our results to known lower bounds. 1

4 0.82089424 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions

Author: Francesca Petralia, Joshua T. Vogelstein, David Dunson

Abstract: Nonparametric estimation of the conditional distribution of a response given highdimensional features is a challenging problem. It is important to allow not only the mean but also the variance and shape of the response density to change flexibly with features, which are massive-dimensional. We propose a multiscale dictionary learning model, which expresses the conditional response density as a convex combination of dictionary densities, with the densities used and their weights dependent on the path through a tree decomposition of the feature space. A fast graph partitioning algorithm is applied to obtain the tree decomposition, with Bayesian methods then used to adaptively prune and average over different sub-trees in a soft probabilistic manner. The algorithm scales efficiently to approximately one million features. State of the art predictive performance is demonstrated for toy examples and two neuroscience applications including up to a million features. 1

5 0.81780219 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning

Author: Xinghao Pan, Joseph E. Gonzalez, Stefanie Jegelka, Tamara Broderick, Michael Jordan

Abstract: Research on distributed machine learning algorithms has focused primarily on one of two extremes—algorithms that obey strict concurrency constraints or algorithms that obey few or no such constraints. We consider an intermediate alternative in which algorithms optimistically assume that conflicts are unlikely and if conflicts do arise a conflict-resolution protocol is invoked. We view this “optimistic concurrency control” paradigm as particularly appropriate for large-scale machine learning algorithms, particularly in the unsupervised setting. We demonstrate our approach in three problem areas: clustering, feature learning and online facility location. We evaluate our methods via large-scale experiments in a cluster computing environment. 1

6 0.80865705 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent

7 0.80606782 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models

8 0.79942101 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture

9 0.79830408 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling

10 0.79813683 201 nips-2013-Multi-Task Bayesian Optimization

11 0.79813045 86 nips-2013-Demixing odors - fast inference in olfaction

12 0.79777718 350 nips-2013-Wavelets on Graphs via Deep Learning

13 0.79766041 317 nips-2013-Streaming Variational Bayes

14 0.79666805 308 nips-2013-Spike train entropy-rate estimation using hierarchical Dirichlet process priors

15 0.79640269 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking

16 0.79568273 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models

17 0.79495591 97 nips-2013-Distributed Submodular Maximization: Identifying Representative Elements in Massive Data

18 0.79448873 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

19 0.79412258 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations

20 0.79257184 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series