nips nips2012 nips2012-316 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ke Jiang, Brian Kulis, Michael I. Jordan
Abstract: Sampling and variational inference techniques are two standard methods for inference in probabilistic models, but for many problems, neither approach scales effectively to large-scale data. An alternative is to relax the probabilistic model into a non-probabilistic formulation which has a scalable associated algorithm. This can often be fulfilled by performing small-variance asymptotics, i.e., letting the variance of particular distributions in the model go to zero. For instance, in the context of clustering, such an approach yields connections between the kmeans and EM algorithms. In this paper, we explore small-variance asymptotics for exponential family Dirichlet process (DP) and hierarchical Dirichlet process (HDP) mixture models. Utilizing connections between exponential family distributions and Bregman divergences, we derive novel clustering algorithms from the asymptotic limit of the DP and HDP mixtures that features the scalability of existing hard clustering methods as well as the flexibility of Bayesian nonparametric models. We focus on special cases of our analysis for discrete-data problems, including topic modeling, and we demonstrate the utility of our results by applying variants of our algorithms to problems arising in vision and document analysis. 1
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper, we explore small-variance asymptotics for exponential family Dirichlet process (DP) and hierarchical Dirichlet process (HDP) mixture models. [sent-12, score-0.739]
2 We focus on special cases of our analysis for discrete-data problems, including topic modeling, and we demonstrate the utility of our results by applying variants of our algorithms to problems arising in vision and document analysis. [sent-14, score-0.187]
3 While probabilistic approaches—particularly Bayesian models—are flexible from a modeling perspective, lack of scalable inference methods can limit applicability on some data. [sent-16, score-0.146]
4 In some cases, links between probabilistic and non-probabilistic models can be made by applying asymptotics to the variance (or covariance) of distributions within the model. [sent-18, score-0.299]
5 Besides providing a conceptual link between seemingly quite different approaches, small-variance asymptotics can yield useful alternatives to probabilistic models when the data size becomes large, as the non-probabilistic models often exhibit more favorable scaling properties. [sent-20, score-0.299]
6 The use of such techniques to derive scalable algorithms from rich probabilistic models is still emerging, but provides a promising approach to developing scalable learning algorithms. [sent-21, score-0.173]
7 This paper explores such small-variance asymptotics for clustering, focusing on the DP mixture. [sent-22, score-0.25]
8 Existing work has considered asymptotics over the Gaussian DP mixture [3], leading to k-meanslike algorithms that do not fix the number of clusters upfront. [sent-23, score-0.577]
9 This approach, while an important first step, raises the question of whether we can perform similar asymptotics over distributions other 1 than the Gaussian. [sent-24, score-0.25]
10 We answer in the affirmative by showing how such asymptotics may be applied to the exponential family distributions for DP mixtures; such analysis opens the door to a new class of scalable clustering algorithms and utilizes connections between Bregman divergences and exponential families. [sent-25, score-1.31]
11 We further extend our approach to hierarchical nonparametric models (specifically, the hierarchical Dirichlet process (HDP) [4]), and we view a major contribution of our analysis to be the development of a general hard clustering algorithm for grouped data. [sent-26, score-0.515]
12 One of the primary advantages of generalizing beyond the Gaussian case is that it opens the door to novel scalable algorithms for discrete-data problems. [sent-27, score-0.25]
13 For instance, visual bag-of-words [5] have become a standard representation for images in a variety of computer vision tasks, but many existing probabilistic models in vision cannot scale to the size of data sets now commonly available. [sent-28, score-0.119]
14 Our analysis covers such problems; for instance, a particular special case of our analysis is a hard version of HDP topic modeling. [sent-32, score-0.253]
15 Related Work: In the non-Bayesian setting, asymptotics for the expectation-maximization algorithm for exponential family distributions were studied in [7]. [sent-34, score-0.535]
16 The authors showed a connection between EM and a general k-means-like algorithm, where the squared Euclidean distance is replaced by the Bregman divergence corresponding to exponential family distribution of interest. [sent-35, score-0.446]
17 Our results may be viewed as generalizing this approach to the Bayesian nonparametric setting. [sent-36, score-0.147]
18 As discussed above, our results may also be viewed as generalizing the approach of [3], where the asymptotics were performed for the DP mixture with a Gaussian likelihood, leading to a k-means-like algorithm where the number of clusters is not fixed upfront. [sent-37, score-0.662]
19 Note that our setting is considerably more involved than either of these previous works, particularly since we will require an appropriate technique for computing an asymptotic marginal likelihood. [sent-38, score-0.128]
20 Other connections between hard clustering and probabilistic models were explored in [8], which proposes a “Bayesian k-means” algorithm by performing a maximization-expectation algorithm. [sent-39, score-0.432]
21 2 Background In this section, we briefly review exponential family distributions, Bregman divergences, and the Dirichlet process mixture model. [sent-40, score-0.411]
22 1 The Exponential Family Consider the exponential family with natural parameter θ = {θj }d ∈ Rd ; then the exponential j=1 family probability density function can be written as [9]: p(x | θ) = exp x, θ − ψ(θ) − h(x) , where ψ(θ) = log exp( x, θ − h(x))dx is the log-partition function. [sent-42, score-0.662]
23 ψ(θ) can be utilized to compute the mean and covariance of p(x | θ); in particular, the expected value is given by ψ(θ), and the covariance is 2 ψ(θ). [sent-44, score-0.199]
24 A convenient property of the exponential family is that a conjugate prior distribution of θ exists; in particular, given any specific distribution in the exponential family, the conjugate prior can be parametrized as [11]: p(θ | τ, η) = exp θ, τ − ηψ(θ) − m(τ, η) . [sent-46, score-0.726]
25 Given a data point xi , the posterior distribution of θ has the same form as the prior, with τ → τ + xi and η → η + 1. [sent-48, score-0.228]
26 The Bregman divergence for any pair of points x, y ∈ S is defined as Dφ (x, y) = φ(x) − φ(y) − x − y, φ(y) , and can be viewed as a generalized distortion measure. [sent-50, score-0.225]
27 An important result connecting Bregman divergences and exponential families was discussed in [7] (see also [10, 11]), where a bijection between the two was established. [sent-51, score-0.346]
28 The Bregman divergence representation provides a natural way to parametrize the exponential family distributions with its expectation parameter and, as we will see, we will find it convenient to work with this form. [sent-53, score-0.414]
29 2 Dirichlet Process Mixture Models The Dirichlet Process (DP) mixture model is a Bayesian nonparametric mixture model [12]; unlike most parametric mixture models (Bayesian or otherwise), the number of clusters in a DP mixture is not fixed upfront. [sent-55, score-0.767]
30 Using the exponential family parameterized by the expectation µc , the likelihood for a data point can be expressed as the following infinite mixture: ∞ ∞ πc p(x | µc ) = p(x) = c=1 πc exp(−Dφ (x, µc ))fφ (x). [sent-56, score-0.322]
31 Moreover, a simple collapsed Gibbs sampler can be employed for performing inference in this model [13]; this Gibbs sampler will form the basis of our asymptotic analysis. [sent-58, score-0.392]
32 , xn }, the state of the Markov chain is the set of cluster indicators {z1 , . [sent-62, score-0.252]
33 , zn } as well as the cluster means of the currently-occupied clusters (the mixing weights have been integrated out). [sent-65, score-0.408]
34 If we choose to start a new cluster during the Gibbs update, we sample its mean from the posterior distribution obtained from the prior distribution G0 and the single observation xi . [sent-70, score-0.411]
35 After performing Gibbs moves on the cluster indicators, we update the cluster means µc by sampling from the posterior of µc given the data points assigned to cluster c. [sent-71, score-0.831]
36 Given an exponential family distribution p(x | θ) with natural parameter θ and log-partition function ψ(θ), consider a ˜ scaled exponential family distribution whose natural parameter is θ = βθ and log-partition function ˜ θ) = βψ(θ/β), where β > 0. [sent-73, score-0.641]
37 The following result characterizes the relationship between the ˜ ˜ is ψ( mean and covariance of the original and scaled exponential family distributions. [sent-74, score-0.444]
38 Given a scaled exponential family with θ = βθ and ψ(θ) = βψ(θ/β), the mean µ(θ) of the ˜ is cov(θ)/β. [sent-78, score-0.389]
39 It is perhaps intuitively simpler to observe what happens to the distribution using the 3 Bregman divergence representation. [sent-80, score-0.129]
40 Recall that the generating function φ for the Bregman divergence is given by the Legendre-conjugate of ψ. [sent-81, score-0.129]
41 The Bregman divergence representation for the scaled distribution is given by ˜ ˜ ˜ ˜ p(x | θ) = p(x | µ) = exp(−Dφ (x, µ))fφ (x) = exp(−βDφ (x, µ))fβφ (x), ˜ where the last equality follows from Lemma 3. [sent-83, score-0.2]
42 Next we consider the prior distribution under the scaled exponential family. [sent-86, score-0.258]
43 This gives the following prior written using the Bregman divergence, where we are now explicitly conditioning on β: ˜ p(θ | τ, η, β) = exp − η τ /β Dφ ,µ ˜ β η/β τ η , β β gφ ˜ = exp − ηDφ τ ,µ η gφ ˜ τ η , . [sent-88, score-0.224]
44 Standard algebraic manipulations yield the following: ˜ ˜ ˜ p(x | θ) × p(θ | τ, η, β)dθ p(x | τ, η, β) = = fφ (x) · gφ ˜ ˜ = fφ (x) · gφ ˜ ˜ βx + τ τ η ˜ ˜ ˜ , A(φ,τ,η,β) (x) exp − (β + η)Dφ , µ(θ) dθ ˜ β β β+η τ η βx + τ , A(φ,τ,η,β) (x) · β d · exp − (β + η)Dφ , µ(θ) ˜ β β β+η dθ. [sent-90, score-0.184]
45 (1) Here, A(φ,τ,η,β) (x) = exp − (βφ(x) + ηφ( τ ) − (β + η)φ( βx+τ )) , which arises when combining ˜ η β+η the Bregman divergences from the likelihood and the prior. [sent-91, score-0.296]
46 Note that the exponential term equals one since the divergence inside is 0. [sent-95, score-0.276]
47 1 Asymptotic Behavior of the Gibbs Sampler We now have the tools to consider the Gibbs sampler for the exponential family DP mixture as we let β → ∞. [sent-97, score-0.528]
48 As we will see, we will obtain a general k-means-like hard clustering algorithm which utilizes the appropriate Bregman divergence in place of the squared Euclidean distance, and also can vary the number of clusters. [sent-98, score-0.498]
49 Now, we consider the asymptotic behavior of these probabilities as β → ∞. [sent-100, score-0.132]
50 We 4 note that βxi + τ = xi , β→∞ β + η lim and lim A(φ,τ,η,β) (xi ) = exp(−η(φ(τ /η) − φ(xi ))), ˜ β→∞ and that the Laplace approximation error term goes to zero as β → ∞. [sent-101, score-0.128]
51 That is, the data point xi will be assigned to the nearest cluster with a divergence at most λ. [sent-110, score-0.468]
52 If the closest mean has a divergence greater than λ, we start a new cluster containing only xi . [sent-111, score-0.466]
53 Next, we show that sampling µc from the posterior distribution is achieved by simply computing the empirical mean of a cluster in the limit. [sent-112, score-0.274]
54 During Gibbs sampling, once we have performed one complete set of Gibbs moves on the cluster assignments, we need to sample the µc conditioned on all assignments and observations. [sent-113, score-0.3]
55 If we let nc be the number of points assigned to cluster c, then the posterior distribution (parameterized by the expectation parameter) for cluster c is p(µc | X, z, τ, η, β) ∝ p(Xc | µc , β)×p(µc | τ, η, β) ∝ exp −(βnc +η)Dφ nc i=1 βxc + τ i ,µ βnc + η where X is all the data, Xc = {xc , . [sent-114, score-0.819]
56 , xc c } is the set of points currently assigned to cluster c, and z n 1 is the set of all current assignments. [sent-117, score-0.361]
57 We can see that the mass of the posterior distribution becomes nc x concentrated around the sample mean i=1 i as β → ∞. [sent-118, score-0.172]
58 In other words, after we determine the nc assignments of data points to clusters, we update the means as the sample mean of the data points in each cluster. [sent-119, score-0.281]
59 This is equivalent to the standard k-means cluster mean update step. [sent-120, score-0.268]
60 2 Objective function and algorithm From the above asymptotic analysis of the Gibbs sampler, we observe a new algorithm which can be utilized for hard clustering. [sent-122, score-0.279]
61 It is as simple as the popular k-means algorithm, but also provides the ability to adapt the number of clusters depending on the data as well as incorporate different distortion measures. [sent-123, score-0.229]
62 , xn , λ > 0, and µ1 = 1 n n i=1 xn • Assignment: for each data point xi , compute the Bregman divergence Dφ (xi , µc ) to all existing clusters. [sent-127, score-0.226]
63 If minc Dφ (xi , µc ) ≤ λ, then zi,c0 = 1 where c0 = argminc Dφ (xi , µc ); otherwise, start a new cluster and set zi,cnew = 1; • Mean Update: compute the cluster mean for each cluster, µj = the set of points in the j-th cluster. [sent-128, score-0.481]
64 1 |lj | x∈lj x, where lj is We iterate between the assignment and mean update steps until local convergence. [sent-129, score-0.199]
65 Recall that the squared Euclidean distance is the Bregman divergence corresponding to the Gaussian distribution. [sent-133, score-0.161]
66 In the context of clustering, a hierarchical mixture allows one to cluster multiple groups of data—each group is clustered into a set of local clusters, but these local clusters are shared among the groups (i. [sent-143, score-0.798]
67 , sets of local clusters across groups form global clusters, with a shared global mean). [sent-145, score-0.402]
68 For Bayesian nonparametric mixture models, one way of achieving such hierarchies arises via the hierarchical Dirichlet Process (HDP) [4], which provides a nonparametric approach to allow sharing of clusters among a set of DP mixtures. [sent-146, score-0.576]
69 In particular, our approach opens the door to hard hierarchical algorithms over discrete data, such as text, and we briefly discuss an application of our derived algorithm to topic modeling. [sent-150, score-0.5]
70 The HDP model can be viewed as clustering each data set into local clusters, but where each local cluster is associated to a global mean. [sent-156, score-0.574]
71 , µg ), the associations of data points to local clusters, zij , and the associations of local clusters to global means, vjt , where t indexes the local clusters for a data set. [sent-161, score-0.758]
72 A standard Gibbs sampler considers updates on all of these variables, and in the nonparametric setting does not fix the number of local or global clusters. [sent-162, score-0.316]
73 As opposed to the flat model, the hard HDP requires two parameters: a value λtop which is utilized when starting a global (top-level) cluster, and a value λbottom which is utilized when starting a local cluster. [sent-164, score-0.349]
74 The resulting hard clustering algorithm first performs local assignment moves on the zij , then updates the local cluster assignments, and finally updates all global means. [sent-165, score-0.832]
75 The resulting objective function that is monotonically minimized by our algorithm is given as follows: k min {lc }k c=1 Dφ (xij , µc ) + λbottom t + λtop k, (4) c=1 xij ∈lc where k is the total number of global clusters and t is the total number of local clusters. [sent-166, score-0.352]
76 The bottomlevel penalty term λbottom controls both the number of local and top-level clusters, where larger λbottom tends to give fewer local clusters and more top-level clusters. [sent-167, score-0.339]
77 Clustering via an asymptotic multinomial DP mixture considerably outperforms the asymptotic Gaussian DP mixture; see text for details. [sent-171, score-0.492]
78 (Right) Elapsed time per iteration in seconds of our topic modeling algorithm when running on the NIPS data, as a function of the number of topics. [sent-172, score-0.16]
79 5 Experiments We conclude with a brief set of experiments highlighting applications of our analysis to discrete-data problems, namely image clustering and topic modeling. [sent-173, score-0.337]
80 Each image is processed via standard visual-bag-of-words: SIFT is densely applied on top of image patches in image, and the resulting SIFT vectors are quantized into 1000 visual words. [sent-178, score-0.115]
81 We explore whether the discrete version of our hard clustering algorithm based on a multinomial DP mixture outperforms the Gaussian mixture version (i. [sent-181, score-0.686]
82 For both the Gaussian and multinomial cases, we utilize a farthest-first approach for both selecting λ as well as initializing the clusters (see [3] for a discussion of farthest-first for selecting λ). [sent-184, score-0.337]
83 We compute the normalized mutual information (NMI) between the true clusters and the results of the two algorithms on this difficult data set. [sent-185, score-0.201]
84 06 on this data, whereas the hard multinomial version achieves a score of . [sent-187, score-0.233]
85 Note that comparisons between the Gibbs sampler and the corresponding hard clustering algorithm for the Gaussian case were considered in [3], where experiments on several data sets showed comparable clustering accuracy results between the sampler and the hard clustering method. [sent-191, score-0.997]
86 Furthermore, for a fully Bayesian model that places a prior on the concentration parameter, the sampler was shown to be considerably slower than the corresponding hard clustering method. [sent-192, score-0.487]
87 Given the similarity of the sampler for the Gaussian and multinomial case, we expect similar behavior with the multinomial Gibbs sampler. [sent-193, score-0.327]
88 We also highlight an application to topic modeling, by providing some qualitative results over two common document collections. [sent-195, score-0.152]
89 Utilizing our general algorithm for a hard version of the multinomial HDP is straightforward. [sent-196, score-0.233]
90 In order to apply the hard hierarchical algorithm to topic modeling, we simply utilize the discrete KL-divergence in the hard exponential family HDP, since topic modeling for text uses a multinomial distribution for the data likelihood. [sent-197, score-1.11]
91 To test topic modeling using our asymptotic approach, we performed analyses using the NIPS 1-121 and the NYTimes [15] datasets. [sent-198, score-0.255]
92 The prevailing metric to measure the goodness of topic models is perplexity; however, this is based on the predictive probability, which has no counterpart in the hard clustering case. [sent-206, score-0.422]
93 Furthermore, ground truth for topic models is difficult to obtain. [sent-207, score-0.125]
94 This makes quantitative comparisons difficult for topic modeling, and so we therefore focus on qualitative results. [sent-208, score-0.155]
95 6 Conclusion We considered a general small-variance asymptotic analysis for the exponential family DP and HDP mixture model. [sent-213, score-0.506]
96 Crucially, this analysis allows us to move beyond the Gaussian distribution in such models, and opens the door to new clustering applications, such as those involving discrete data. [sent-214, score-0.338]
97 Our analysis utilizes connections between Bregman divergences and exponential families, and results in a simple and scalable hard clustering algorithm which may be viewed as generalizing existing non-Bayesian Bregman clustering algorithms [7] as well as the DP-means algorithm [3]. [sent-215, score-1.02]
98 We plan to continue to focus on the difficult problem of quantitative evaluations comparing probabilistic and non-probabilistic methods for clustering, particularly for topic models. [sent-217, score-0.204]
99 We also plan to compare our algorithms with recent online inference schemes for topic modeling, particularly the online LDA [16] and online HDP [17] algorithms. [sent-218, score-0.125]
100 Markov chain sampling methods for Dirichlet process mixture models. [sent-299, score-0.126]
wordName wordTfidf (topN-words)
[('bregman', 0.392), ('asymptotics', 0.25), ('hdp', 0.25), ('dp', 0.215), ('cluster', 0.207), ('clusters', 0.201), ('clustering', 0.169), ('divergences', 0.167), ('gibbs', 0.149), ('exponential', 0.147), ('nytimes', 0.143), ('family', 0.138), ('divergence', 0.129), ('dirichlet', 0.129), ('hard', 0.128), ('mixture', 0.126), ('topic', 0.125), ('sampler', 0.117), ('cxi', 0.115), ('multinomial', 0.105), ('nc', 0.105), ('cov', 0.104), ('xi', 0.097), ('asymptotic', 0.095), ('exp', 0.092), ('cnew', 0.086), ('xc', 0.085), ('hierarchical', 0.078), ('lc', 0.074), ('door', 0.072), ('scaled', 0.071), ('zi', 0.069), ('opens', 0.065), ('scalable', 0.062), ('nonparametric', 0.062), ('conjugate', 0.061), ('bayesian', 0.06), ('elephant', 0.057), ('utilized', 0.056), ('covariance', 0.055), ('local', 0.055), ('lj', 0.054), ('global', 0.054), ('connections', 0.053), ('mixtures', 0.053), ('imagenet', 0.051), ('generalizing', 0.051), ('probabilistic', 0.049), ('assignments', 0.047), ('hierarchies', 0.047), ('moves', 0.046), ('indicators', 0.045), ('persian', 0.044), ('gaussian', 0.043), ('image', 0.043), ('country', 0.042), ('nmi', 0.042), ('xij', 0.042), ('prior', 0.04), ('euclidean', 0.04), ('utilizes', 0.04), ('won', 0.038), ('groups', 0.038), ('text', 0.038), ('brie', 0.038), ('likelihood', 0.037), ('probabilities', 0.037), ('modeling', 0.035), ('vision', 0.035), ('associations', 0.035), ('scalability', 0.035), ('assigned', 0.035), ('em', 0.034), ('viewed', 0.034), ('posterior', 0.034), ('points', 0.034), ('public', 0.033), ('zij', 0.033), ('considerably', 0.033), ('performing', 0.033), ('mean', 0.033), ('squared', 0.032), ('kulis', 0.032), ('favors', 0.032), ('families', 0.032), ('discrete', 0.032), ('utilize', 0.031), ('goes', 0.031), ('collapsed', 0.03), ('jordan', 0.03), ('quantitative', 0.03), ('pca', 0.029), ('topics', 0.029), ('processed', 0.029), ('assignment', 0.029), ('update', 0.028), ('controls', 0.028), ('distortion', 0.028), ('updates', 0.028), ('document', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models
Author: Ke Jiang, Brian Kulis, Michael I. Jordan
Abstract: Sampling and variational inference techniques are two standard methods for inference in probabilistic models, but for many problems, neither approach scales effectively to large-scale data. An alternative is to relax the probabilistic model into a non-probabilistic formulation which has a scalable associated algorithm. This can often be fulfilled by performing small-variance asymptotics, i.e., letting the variance of particular distributions in the model go to zero. For instance, in the context of clustering, such an approach yields connections between the kmeans and EM algorithms. In this paper, we explore small-variance asymptotics for exponential family Dirichlet process (DP) and hierarchical Dirichlet process (HDP) mixture models. Utilizing connections between exponential family distributions and Bregman divergences, we derive novel clustering algorithms from the asymptotic limit of the DP and HDP mixtures that features the scalability of existing hard clustering methods as well as the flexibility of Bayesian nonparametric models. We focus on special cases of our analysis for discrete-data problems, including topic modeling, and we demonstrate the utility of our results by applying variants of our algorithms to problems arising in vision and document analysis. 1
2 0.2781415 328 nips-2012-Submodular-Bregman and the Lovász-Bregman Divergences with Applications
Author: Rishabh Iyer, Jeff A. Bilmes
Abstract: We introduce a class of discrete divergences on sets (equivalently binary vectors) that we call the submodular-Bregman divergences. We consider two kinds, defined either from tight modular upper or tight modular lower bounds of a submodular function. We show that the properties of these divergences are analogous to the (standard continuous) Bregman divergence. We demonstrate how they generalize many useful divergences, including the weighted Hamming distance, squared weighted Hamming, weighted precision, recall, conditional mutual information, and a generalized KL-divergence on sets. We also show that the generalized Bregman divergence on the Lov´ sz extension of a submodular function, which we a call the Lov´ sz-Bregman divergence, is a continuous extension of a submodular a Bregman divergence. We point out a number of applications, and in particular show that a proximal algorithm defined through the submodular Bregman divergence provides a framework for many mirror-descent style algorithms related to submodular function optimization. We also show that a generalization of the k-means algorithm using the Lov´ sz Bregman divergence is natural in clustering scenarios where a ordering is important. A unique property of this algorithm is that computing the mean ordering is extremely efficient unlike other order based distance measures. 1
3 0.27225909 126 nips-2012-FastEx: Hash Clustering with Exponential Families
Author: Amr Ahmed, Sujith Ravi, Alex J. Smola, Shravan M. Narayanamurthy
Abstract: Clustering is a key component in any data analysis toolbox. Despite its importance, scalable algorithms often eschew rich statistical models in favor of simpler descriptions such as k-means clustering. In this paper we present a sampler, capable of estimating mixtures of exponential families. At its heart lies a novel proposal distribution using random projections to achieve high throughput in generating proposals, which is crucial for clustering models with large numbers of clusters. 1
4 0.25134066 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models
Author: Chong Wang, David M. Blei
Abstract: We present a truncation-free stochastic variational inference algorithm for Bayesian nonparametric models. While traditional variational inference algorithms require truncations for the model or the variational distribution, our method adapts model complexity on the fly. We studied our method with Dirichlet process mixture models and hierarchical Dirichlet process topic models on two large data sets. Our method performs better than previous stochastic variational inference algorithms. 1
5 0.1754355 69 nips-2012-Clustering Sparse Graphs
Author: Yudong Chen, Sujay Sanghavi, Huan Xu
Abstract: We develop a new algorithm to cluster sparse unweighted graphs – i.e. partition the nodes into disjoint clusters so that there is higher density within clusters, and low across clusters. By sparsity we mean the setting where both the in-cluster and across cluster edge densities are very small, possibly vanishing in the size of the graph. Sparsity makes the problem noisier, and hence more difficult to solve. Any clustering involves a tradeoff between minimizing two kinds of errors: missing edges within clusters and present edges across clusters. Our insight is that in the sparse case, these must be penalized differently. We analyze our algorithm’s performance on the natural, classical and widely studied “planted partition” model (also called the stochastic block model); we show that our algorithm can cluster sparser graphs, and with smaller clusters, than all previous methods. This is seen empirically as well. 1
6 0.17354874 89 nips-2012-Coupling Nonparametric Mixtures via Latent Dirichlet Processes
7 0.16629681 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes
8 0.16126262 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set
9 0.13660081 99 nips-2012-Dip-means: an incremental clustering method for estimating the number of clusters
10 0.12581603 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk
11 0.12373737 47 nips-2012-Augment-and-Conquer Negative Binomial Processes
12 0.12331355 294 nips-2012-Repulsive Mixtures
13 0.12244976 251 nips-2012-On Lifting the Gibbs Sampling Algorithm
14 0.11995809 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning
15 0.11969399 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models
16 0.11889536 26 nips-2012-A nonparametric variable clustering model
17 0.11392244 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models
18 0.11378959 349 nips-2012-Training sparse natural image models with a fast Gibbs sampler of an extended state space
19 0.11181091 57 nips-2012-Bayesian estimation of discrete entropy with mixtures of stick-breaking priors
20 0.10855456 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo
topicId topicWeight
[(0, 0.282), (1, 0.117), (2, -0.045), (3, -0.024), (4, -0.275), (5, -0.074), (6, -0.031), (7, -0.071), (8, 0.161), (9, -0.019), (10, 0.23), (11, -0.116), (12, -0.023), (13, -0.061), (14, 0.142), (15, -0.103), (16, -0.15), (17, -0.044), (18, -0.14), (19, -0.008), (20, -0.03), (21, -0.059), (22, 0.038), (23, 0.083), (24, 0.059), (25, -0.01), (26, -0.065), (27, -0.018), (28, 0.04), (29, 0.063), (30, -0.018), (31, 0.005), (32, -0.025), (33, -0.035), (34, 0.013), (35, 0.06), (36, 0.051), (37, -0.042), (38, 0.015), (39, -0.075), (40, 0.004), (41, -0.046), (42, -0.001), (43, 0.131), (44, 0.038), (45, -0.06), (46, 0.009), (47, 0.063), (48, 0.022), (49, 0.004)]
simIndex simValue paperId paperTitle
same-paper 1 0.95730859 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models
Author: Ke Jiang, Brian Kulis, Michael I. Jordan
Abstract: Sampling and variational inference techniques are two standard methods for inference in probabilistic models, but for many problems, neither approach scales effectively to large-scale data. An alternative is to relax the probabilistic model into a non-probabilistic formulation which has a scalable associated algorithm. This can often be fulfilled by performing small-variance asymptotics, i.e., letting the variance of particular distributions in the model go to zero. For instance, in the context of clustering, such an approach yields connections between the kmeans and EM algorithms. In this paper, we explore small-variance asymptotics for exponential family Dirichlet process (DP) and hierarchical Dirichlet process (HDP) mixture models. Utilizing connections between exponential family distributions and Bregman divergences, we derive novel clustering algorithms from the asymptotic limit of the DP and HDP mixtures that features the scalability of existing hard clustering methods as well as the flexibility of Bayesian nonparametric models. We focus on special cases of our analysis for discrete-data problems, including topic modeling, and we demonstrate the utility of our results by applying variants of our algorithms to problems arising in vision and document analysis. 1
2 0.79858261 126 nips-2012-FastEx: Hash Clustering with Exponential Families
Author: Amr Ahmed, Sujith Ravi, Alex J. Smola, Shravan M. Narayanamurthy
Abstract: Clustering is a key component in any data analysis toolbox. Despite its importance, scalable algorithms often eschew rich statistical models in favor of simpler descriptions such as k-means clustering. In this paper we present a sampler, capable of estimating mixtures of exponential families. At its heart lies a novel proposal distribution using random projections to achieve high throughput in generating proposals, which is crucial for clustering models with large numbers of clusters. 1
3 0.76380819 26 nips-2012-A nonparametric variable clustering model
Author: Konstantina Palla, Zoubin Ghahramani, David A. Knowles
Abstract: Factor analysis models effectively summarise the covariance structure of high dimensional data, but the solutions are typically hard to interpret. This motivates attempting to find a disjoint partition, i.e. a simple clustering, of observed variables into highly correlated subsets. We introduce a Bayesian non-parametric approach to this problem, and demonstrate advantages over heuristic methods proposed to date. Our Dirichlet process variable clustering (DPVC) model can discover blockdiagonal covariance structures in data. We evaluate our method on both synthetic and gene expression analysis problems. 1
4 0.73138893 99 nips-2012-Dip-means: an incremental clustering method for estimating the number of clusters
Author: Argyris Kalogeratos, Aristidis Likas
Abstract: Learning the number of clusters is a key problem in data clustering. We present dip-means, a novel robust incremental method to learn the number of data clusters that can be used as a wrapper around any iterative clustering algorithm of k-means family. In contrast to many popular methods which make assumptions about the underlying cluster distributions, dip-means only assumes a fundamental cluster property: each cluster to admit a unimodal distribution. The proposed algorithm considers each cluster member as an individual ‘viewer’ and applies a univariate statistic hypothesis test for unimodality (dip-test) on the distribution of distances between the viewer and the cluster members. Important advantages are: i) the unimodality test is applied on univariate distance vectors, ii) it can be directly applied with kernel-based methods, since only the pairwise distances are involved in the computations. Experimental results on artificial and real datasets indicate the effectiveness of our method and its superiority over analogous approaches.
5 0.70552397 251 nips-2012-On Lifting the Gibbs Sampling Algorithm
Author: Deepak Venugopal, Vibhav Gogate
Abstract: First-order probabilistic models combine the power of first-order logic, the de facto tool for handling relational structure, with probabilistic graphical models, the de facto tool for handling uncertainty. Lifted probabilistic inference algorithms for them have been the subject of much recent research. The main idea in these algorithms is to improve the accuracy and scalability of existing graphical models’ inference algorithms by exploiting symmetry in the first-order representation. In this paper, we consider blocked Gibbs sampling, an advanced MCMC scheme, and lift it to the first-order level. We propose to achieve this by partitioning the first-order atoms in the model into a set of disjoint clusters such that exact lifted inference is polynomial in each cluster given an assignment to all other atoms not in the cluster. We propose an approach for constructing the clusters and show how it can be used to trade accuracy with computational complexity in a principled manner. Our experimental evaluation shows that lifted Gibbs sampling is superior to the propositional algorithm in terms of accuracy, scalability and convergence.
6 0.66814804 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models
7 0.65788704 294 nips-2012-Repulsive Mixtures
8 0.65520561 89 nips-2012-Coupling Nonparametric Mixtures via Latent Dirichlet Processes
9 0.60465616 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models
10 0.59321564 299 nips-2012-Scalable imputation of genetic data with a discrete fragmentation-coagulation process
11 0.59035569 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes
12 0.58116883 69 nips-2012-Clustering Sparse Graphs
13 0.5696106 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set
14 0.54928827 220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models
15 0.54829389 315 nips-2012-Slice sampling normalized kernel-weighted completely random measure mixture models
16 0.5435366 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk
17 0.53086174 359 nips-2012-Variational Inference for Crowdsourcing
18 0.51415503 107 nips-2012-Effective Split-Merge Monte Carlo Methods for Nonparametric Models of Sequential Data
19 0.49792403 133 nips-2012-Finding Exemplars from Pairwise Dissimilarities via Simultaneous Sparse Recovery
20 0.4958474 155 nips-2012-Human memory search as a random walk in a semantic network
topicId topicWeight
[(0, 0.074), (17, 0.016), (21, 0.043), (38, 0.126), (39, 0.01), (42, 0.052), (54, 0.025), (55, 0.027), (63, 0.169), (74, 0.067), (76, 0.144), (80, 0.102), (92, 0.073)]
simIndex simValue paperId paperTitle
1 0.92698961 89 nips-2012-Coupling Nonparametric Mixtures via Latent Dirichlet Processes
Author: Dahua Lin, John W. Fisher
Abstract: Mixture distributions are often used to model complex data. In this paper, we develop a new method that jointly estimates mixture models over multiple data sets by exploiting the statistical dependencies between them. Specifically, we introduce a set of latent Dirichlet processes as sources of component models (atoms), and for each data set, we construct a nonparametric mixture model by combining sub-sampled versions of the latent DPs. Each mixture model may acquire atoms from different latent DPs, while each atom may be shared by multiple mixtures. This multi-to-multi association distinguishes the proposed method from previous ones that require the model structure to be a tree or a chain, allowing more flexible designs. We also derive a sampling algorithm that jointly infers the model parameters and present experiments on both document analysis and image modeling. 1
2 0.89102435 78 nips-2012-Compressive Sensing MRI with Wavelet Tree Sparsity
Author: Chen Chen, Junzhou Huang
Abstract: In Compressive Sensing Magnetic Resonance Imaging (CS-MRI), one can reconstruct a MR image with good quality from only a small number of measurements. This can significantly reduce MR scanning time. According to structured sparsity theory, the measurements can be further reduced to O(K + log n) for tree-sparse data instead of O(K + K log n) for standard K-sparse data with length n. However, few of existing algorithms have utilized this for CS-MRI, while most of them model the problem with total variation and wavelet sparse regularization. On the other side, some algorithms have been proposed for tree sparse regularization, but few of them have validated the benefit of wavelet tree structure in CS-MRI. In this paper, we propose a fast convex optimization algorithm to improve CS-MRI. Wavelet sparsity, gradient sparsity and tree sparsity are all considered in our model for real MR images. The original complex problem is decomposed into three simpler subproblems then each of the subproblems can be efficiently solved with an iterative scheme. Numerous experiments have been conducted and show that the proposed algorithm outperforms the state-of-the-art CS-MRI algorithms, and gain better reconstructions results on real MR images than general tree based solvers or algorithms. 1
same-paper 3 0.86217368 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models
Author: Ke Jiang, Brian Kulis, Michael I. Jordan
Abstract: Sampling and variational inference techniques are two standard methods for inference in probabilistic models, but for many problems, neither approach scales effectively to large-scale data. An alternative is to relax the probabilistic model into a non-probabilistic formulation which has a scalable associated algorithm. This can often be fulfilled by performing small-variance asymptotics, i.e., letting the variance of particular distributions in the model go to zero. For instance, in the context of clustering, such an approach yields connections between the kmeans and EM algorithms. In this paper, we explore small-variance asymptotics for exponential family Dirichlet process (DP) and hierarchical Dirichlet process (HDP) mixture models. Utilizing connections between exponential family distributions and Bregman divergences, we derive novel clustering algorithms from the asymptotic limit of the DP and HDP mixtures that features the scalability of existing hard clustering methods as well as the flexibility of Bayesian nonparametric models. We focus on special cases of our analysis for discrete-data problems, including topic modeling, and we demonstrate the utility of our results by applying variants of our algorithms to problems arising in vision and document analysis. 1
4 0.80090505 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models
Author: Chong Wang, David M. Blei
Abstract: We present a truncation-free stochastic variational inference algorithm for Bayesian nonparametric models. While traditional variational inference algorithms require truncations for the model or the variational distribution, our method adapts model complexity on the fly. We studied our method with Dirichlet process mixture models and hierarchical Dirichlet process topic models on two large data sets. Our method performs better than previous stochastic variational inference algorithms. 1
5 0.80049574 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
Author: Vasiliy Karasev, Alessandro Chiuso, Stefano Soatto
Abstract: We describe the tradeoff between the performance in a visual recognition problem and the control authority that the agent can exercise on the sensing process. We focus on the problem of “visual search” of an object in an otherwise known and static scene, propose a measure of control authority, and relate it to the expected risk and its proxy (conditional entropy of the posterior density). We show this analytically, as well as empirically by simulation using the simplest known model that captures the phenomenology of image formation, including scaling and occlusions. We show that a “passive” agent given a training set can provide no guarantees on performance beyond what is afforded by the priors, and that an “omnipotent” agent, capable of infinite control authority, can achieve arbitrarily good performance (asymptotically). In between these limiting cases, the tradeoff can be characterized empirically. 1
6 0.79876429 197 nips-2012-Learning with Recursive Perceptual Representations
7 0.79814065 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines
8 0.79766017 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes
9 0.79741257 65 nips-2012-Cardinality Restricted Boltzmann Machines
10 0.79668087 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs
11 0.79642773 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics
12 0.7961489 260 nips-2012-Online Sum-Product Computation Over Trees
13 0.79593176 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models
14 0.79545248 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model
15 0.79488933 47 nips-2012-Augment-and-Conquer Negative Binomial Processes
16 0.79463601 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes
17 0.79268485 168 nips-2012-Kernel Latent SVM for Visual Recognition
18 0.7926048 292 nips-2012-Regularized Off-Policy TD-Learning
19 0.79023629 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models
20 0.78977787 188 nips-2012-Learning from Distributions via Support Measure Machines