nips nips2012 nips2012-126 knowledge-graph by maker-knowledge-mining

126 nips-2012-FastEx: Hash Clustering with Exponential Families


Source: pdf

Author: Amr Ahmed, Sujith Ravi, Alex J. Smola, Shravan M. Narayanamurthy

Abstract: Clustering is a key component in any data analysis toolbox. Despite its importance, scalable algorithms often eschew rich statistical models in favor of simpler descriptions such as k-means clustering. In this paper we present a sampler, capable of estimating mixtures of exponential families. At its heart lies a novel proposal distribution using random projections to achieve high throughput in generating proposals, which is crucial for clustering models with large numbers of clusters. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper we present a sampler, capable of estimating mixtures of exponential families. [sent-9, score-0.157]

2 At its heart lies a novel proposal distribution using random projections to achieve high throughput in generating proposals, which is crucial for clustering models with large numbers of clusters. [sent-10, score-0.36]

3 1 Introduction Fast clustering algorithms are a staple of exploratory data analysis. [sent-11, score-0.126]

4 in large scale document analysis, or to provide a modicum of adaptivity to content personalization for a large basis of users [2, 3]. [sent-18, score-0.157]

5 The latter is often ignored but vital to the rationale for using more data — after all, for fixed model complexity there are rapidly diminishing returns afforded by extra data once a given threshold is exceeded. [sent-26, score-0.143]

6 by means of multicore sampling and we need to draw from a large number of clusters. [sent-33, score-0.218]

7 When sampling from many clusters, the time to compute the object likelihood with respect to all clusters dominates the inference procedure. [sent-34, score-0.476]

8 For instance, for 1000 clusters and documents of 1000 words a naive sampler needs to perform 106 floating point operations. [sent-35, score-0.461]

9 Given 10M documents this amounts to approximately 3 hours for a single Gibbs sampling iteration, which is clearly infeasible: sampling requires hundreds ∗ This work was carried out while AA, SR, SMN and AJS were with Yahoo Research. [sent-37, score-0.295]

10 To alleviate this issue we use binary hashing to compute a fast proposal distribution. [sent-40, score-0.293]

11 2 Mixtures of Exponential Families Our models are mixtures of exponential families due to their flexilibility. [sent-41, score-0.27]

12 For convenience we focus on mixtures of multinomials with correspondingly conjugate Dirichlet distributions. [sent-43, score-0.287]

13 In exponential families distributions over random variables are given by p(x; θ) = exp ( φ(x), θ − g(θ)) . [sent-54, score-0.208]

14 Empirical averages and probability estimates are directly connected via p(x; θ) = nx /m = eθx . [sent-67, score-0.148]

15 Here nx denotes the number of times we observe x. [sent-68, score-0.148]

16 (4) Here the conjugate prior itself is a member of the exponential family with sufficient statistic φ(θ) = (θ, −g(θ)) and with natural parameters (m0 , m0 µ0 ). [sent-75, score-0.228]

17 Finally, h(m0 , m0 µ0 ) is a log-partition function in the parameters of the conjugate prior. [sent-79, score-0.133]

18 Normalization in (4) implies p(θ|X) ∝ p(X|θ)p(θ|m0 , m0 µ0 ) =⇒ p(θ|X) = p (θ|m0 + m, m0 µ0 + mµ[X]) (5) We simply add the virtual observations m0 µ0 described by the conjugate prior to the actual observations X and compute the maximum likelihood estimate with respect to the augmented dataset. [sent-81, score-0.182]

19 This yields the smoothed estimates for event probabilities in x: nx + m0 [µ0 ]x nx + m0 [µ0 ]x p(x; θ) = and equivalently θx = log . [sent-83, score-0.296]

20 For the sake of brevity and to ensure computational tractability (we need to limit the time it takes to sample from the cluster distribution for a given instance) we limit ourselves to a Dirichlet-Multinomial model with k components: • Draw discrete mixture p(y|θ) with y ∈ {1, . [sent-87, score-0.22]

21 0 0 • For each component k draw exponential families distribution p(x|θy ) from conjugate with parameters (mcomponent , µcomponent ). [sent-91, score-0.412]

22 0 0 • For each i first draw component yi ∼ p(y|θ), then draw observation xi ∼ p(x|θyi ). [sent-92, score-0.285]

23 Note that we have two exponential families components here — a smoothed multinomial to capture cluster membership, i. [sent-93, score-0.474]

24 y ∼ p(y|θ) and one pertaining to the cluster distribution. [sent-95, score-0.179]

25 For large numbers, however, Gibbs sampling of the collapsed likelihood is computationally more advantageous since it only requires updates of the sufficient statistics of two clusters per sample, whereas EM necessitates an update of all clusters. [sent-98, score-0.618]

26 For a given xi draw yi ∼ p(yi |X, Y −i ) ∝ p(yi |Y ) · p(xi |yi , X −i , Y −i ). [sent-100, score-0.214]

27 We now show how this step can be accelerated significantly using a good proposal distribution, parallel sampling, and a Taylor expansion for general exponential families. [sent-104, score-0.369]

28 1 Acceleration Taylor Approximation for Collapsed Inference Let us briefly review the key equations involved in collapsed inference. [sent-106, score-0.162]

29 We exploit the properties of the log-partition function h of the conjugate prior for an approximation: ∂ h(m0 , µ0 m0 ) = (m0 ,m0 µ0 ) E [−g(θ), θ] =: (−γ ∗ , θ∗ ) θ∼p(θ|m0 ,m0 µ0 ) hence h(m0 + 1, m0 µ0 + φ(x)) ≈ h(m0 , m0 µ0 ) + θ∗ , φ(x) − γ ∗ . [sent-117, score-0.133]

30 Applying the Taylor expansion in h to (7) yields an approximation of x|X as p(x|X, m0 , m0 µ0 ) ≈ exp ( φ(x), θ∗ − g(θ∗ )) (11) Here the normalization g(θ∗ ) is an immediate consequence of the fact that this is a member of the exponential family. [sent-120, score-0.174]

31 Lemma 1 The expected parameter θ∗ = Eθ∼p(θ|X) [θ] induces at most O(m−1 ) sampler error. [sent-123, score-0.151]

32 Hence, (11) explains why updates obtained in collapsed inference often resemble (or are identical to) a maximum-a-posteriori estimate obtained by conjugate priors, such as in Dirichlet-multinomial smoothing. [sent-127, score-0.337]

33 2 Locality Sensitive Importance Sampling ∗ The next step is to accelerate the inner product φ(x), θy in (11) since this expression is evaluated k times at each Gibbs sampler step. [sent-130, score-0.263]

34 This provides a good approximation and therefore a proposal distribution that can be used in a Metropolis-Hastings scheme without an excessive rejection rate. [sent-133, score-0.236]

35 To provide some motivation consider metric-based clustering algorithms such as k-means. [sent-134, score-0.126]

36 They do not suffer greatly from dealing with large numbers of clusters — after all, we only need to find the closest prototype. [sent-135, score-0.287]

37 In a nutshell it involves transforming the set of cluster centers into a data structure that is only dependent on the inherent dimensionality of the data rather than the number of objects or the dimensionality of the actual data vector. [sent-137, score-0.179]

38 The problem with sampling from the collapsed distribution is that for a proper sampler we need to consider all cluster probabilities including those related to clusters which are highly implausible and unlikely to be chosen for a given instance. [sent-138, score-0.838]

39 Instead, we design a sampler which typically will only explore the clusters which are sufficiently close to the “best” matching cluster by means of a proposal distribution. [sent-141, score-0.773]

40 Since exponential families rely on inner products to determine the log-likelihood of how well the data fits, we can use hashing to accelerate the expensive part considerably, namely comparing data with clusters. [sent-144, score-0.418]

41 More specifically, u, v = u · v · cos (u, v) allows us to store the signature of a vector in terms of its signs and its norm to estimate the inner product efficiently. [sent-145, score-0.198]

42 l Definition 3 We denote by hl (v) ∈ {0, 1} a binary hash of v and by z l (u, v) an estimate of the probability of matching signs, obtained as follows hl (v) i := sgn [ v, wi ] where wi ∼ Um fixed and z l (u, v) := 4 1 h(u) − h(v) l 1 . [sent-146, score-0.465]

43 (13) That is, z l (u, v) measures how many bits differ between the hash vectors h(u) and h(v) associated with u, v. [sent-147, score-0.464]

44 In this case we may estimate the unnormalized log-likelihood of an instance being assigned to a cluster via sl (x, y) = θy φ(x) cos πz l (φ(x), θy ) − g(θy ) − log ny (14) We omitted the normalization log n of the cluster probability since it is identical for all components. [sent-148, score-0.531]

45 The binary representation is significant since on modern CPUs computing the Hamming distance between h(u) and h(v) via z l (u, v) can be achieved in a fraction of a single clock cycle by means of a vectorized instruction set. [sent-150, score-0.104]

46 This is supported by current generation ARM and Intel CPU cores and by AMD and Nvidia GPUs (for instance Intel’s SandyBridge series of processors can process up to 256 bits in one clock cycle per core) and easily accessible via compiler optimization. [sent-151, score-0.268]

47 3 Error Guarantees Note, though, that sl (x, y) is not accurate, since we only use an estimate of the inner product. [sent-153, score-0.134]

48 Theorem 4 Given k ∈ N mixture components and let l the number of bits used for hashing. [sent-156, score-0.182]

49 Then the unnormalized cluster log-likelihood is bounded with probability at least 1 − δ by sl (x, y) = θy ¯ φ(x) cos π max 0, z l (φ(x), θy ) − − g(θy ) − log ny (15) (log k/δ) /2l P ROOF. [sent-157, score-0.312]

50 For convenience denote by z ∞ (φ(x), θy ) the expected value of z l (φ(x), θy ) over all hash functions. [sent-160, score-0.36]

51 Since we know that z (φ(x), θy ) ≥ 0 we can bound it for all k clusters with probability δ by taking the union bound over all events with δ/k probability. [sent-162, score-0.248]

52 Remark 5 Using 128 hash bits and with a failure probability of at most 10−4 for k = 104 clusters the correction applied to z l (x, z) is less than 0. [sent-163, score-0.712]

53 Secondly, we use hashing to generate a proposal distribution: once we select a particular cluster we verify the estimate using the true likelihood. [sent-166, score-0.472]

54 4 Metropolis Hastings An alternative to using the approximate upper bound directly, we employ it as a proposal distribution in a Metropolis Hastings (MH) framework. [sent-168, score-0.195]

55 Denote by q the proposal distribution constructed from the bound on the log-likelihood after normalization. [sent-169, score-0.195]

56 For a given xi we first sample a new cluster new assignment yi ∼ q(. [sent-170, score-0.322]

57 ) and then accept the proposal using (15) with probability r where l ¯ q(y) ∝ es (x,y) and r = new i new q(y old ) p(yi )p(xi |Xyi , m0 , µ0 ) new ) old )p(x |X i q(yi p(yi , m0 , µ0 ) i y old (17) i Here p(xi |X, m0 , µ0 ) is the true collapsed conditional likelihood of (8). [sent-171, score-0.658]

58 Note that for a standard collapsed Gibbs sampler, p(x|X, µ0 , m0 ) would be computed for all k candidate clusters, however, in our framework, we only need to compute it for 2 clusters: the proposed and old clusters: an O(k) time saving per sample, albeit with a nontrivial rejection probability. [sent-175, score-0.336]

59 5 Example 3 For discrete distributions the conjugate is the Dirichlet distribution Dir(α1:d ) with components given by αj = m0 [µ0 ]j and the sum of the components is given by m0 , where j ∈ {1 · · · d}. [sent-176, score-0.133]

60 1 We have the following predictive posterior p(xi |X, yi , µ0 , m0 ) = 3. [sent-180, score-0.101]

61 D yi Γ nyi + αd d d=1 [xd + nd + αd ] d=1 [nyi + αd ] d (18) Updating the Sufficient Statistics We conclude our discussion of past proposals by discussing the updates involved in the sufficient statistics. [sent-182, score-0.22]

62 it is the sufficient statistic obtained from X by considering only instances for which yi = y. [sent-188, score-0.101]

63 This is then used to update the natural parameter θy and the hash representation h(θy ). [sent-190, score-0.384]

64 For l bits a naive update would perform the dot-product between the mean natural parameters and each random vector, which scales as O(Dl), where D is the vocabulary size. [sent-196, score-0.251]

65 However we can cache the l dot product values (as floating point numbers) for each cluster and update only those dot product values. [sent-197, score-0.24]

66 Note that we never need to store the random vectors w since we can compute them on the fly by means of hash functions rather than invoking a random number generator. [sent-199, score-0.361]

67 We use murmurhash as a fast and high quality hash function. [sent-200, score-0.38]

68 More specifically, we extracted the articles and category attributes from a dump of its database. [sent-203, score-0.113]

69 We generated multiple datasets for our experiments by first sampling a set of k categories and then by pooling all the articles from the chosen categories to form a document collection. [sent-204, score-0.382]

70 This way the data was comparable and the apparent and desired diversity in terms of cluster sizes was matched. [sent-205, score-0.179]

71 We extracted both 100 and 1000 categories, yielding the following datasets: W100 100 clusters 292k articles 2. [sent-206, score-0.361]

72 5M unique words vocabulary W1000 1000 clusters 710k articles 5. [sent-207, score-0.41]

73 6M unique words vocabulary We compare our fast sampler to a more conventional uncollapsed inference procedure. [sent-208, score-0.31]

74 It uses an uncollapsed likelihood and alternates between sampling cluster assignments and drawing from the Dirichlet distribution of the posterior. [sent-210, score-0.394]

75 (Right) The effect of the hash size on performance. [sent-216, score-0.323]

76 Unless stated otherwise we use l = 32 bit to represent a document and cluster. [sent-219, score-0.207]

77 This choice was made since it provides an efficient trade-off between memory usage and cost to compute the hash signatures. [sent-220, score-0.323]

78 2 Evaluation For each clustering method, we report results in terms of two different measures: efficiency and clustering quality. [sent-222, score-0.252]

79 Suppose we have two clusterings (partition of a document set into several subsets) C1 and C2 then: VI(C1 , C2 ) = H(C1 ) + H(C2 ) − 2I(C1 , C2 ) (19) where H(. [sent-227, score-0.138]

80 As evident from this Figure, our method both converges much faster than the baseline and achieves the same clustering quality. [sent-233, score-0.236]

81 Figure 1 also displays the effect of the number of hash bits l on solution quality. [sent-234, score-0.464]

82 As evident form the figure, increasing the number of bits caused our method to converge faster due to a tighter bound on the log-likelihood and thus a higher acceptance ratio. [sent-236, score-0.217]

83 We also observe that beyond 64 to 128 bits we do not observe any noticeable improvement as predicted by our theoretical guarantees. [sent-237, score-0.141]

84 As shown in this Table, thanks to the fast instruction set support for XOR and bitcount operations on modern processors, the time does not increase significantly as we increase the number of clusters and the overall time increases modestly as the number of clusters increases. [sent-239, score-0.556]

85 61 Table 1: Average time in microseconds spent per document for hash sampling in terms of computing the proposal distribution and total computation time. [sent-261, score-0.713]

86 As can be seen, the total computation time for sampling 10x more clusters only increases slightly, mostly due to the increase in proposal time. [sent-262, score-0.541]

87 37 Table 2: Clustering quality (VI) and absolute speedup achieved by hash sampling over the baseline (DP) clustering for different Wikipedia datasets. [sent-269, score-0.676]

88 Due to a high quality proposals the time to draw from 1000 rather than 100 clusters increases slightly. [sent-271, score-0.429]

89 5 Discussion and Future Work We presented a new efficient parallel algorithm to perform scalable clustering for exponential families. [sent-272, score-0.307]

90 It is general and uses techniques from hashing and information retrieval to circumvent the problem of large numbers of clusters. [sent-273, score-0.174]

91 Future work includes the application to a larger range of exponential family models and the extension of the fast retrieval scheme to hierarchical clustering. [sent-274, score-0.132]

92 Parallelization So far we only described a single processor sampling procedure. [sent-275, score-0.141]

93 To address the problem within single machines we use a multicore sampler to parallelize inference. [sent-277, score-0.246]

94 This requires a small amount of approximation — rather than sampling p(yi |xi , X −i , Y −i ) in sequence we sample up to c latent variables yi in parallel in c processor cores. [sent-278, score-0.282]

95 The latter approximation is negligible since c is tiny compared to the total number of documents we have. [sent-279, score-0.109]

96 updater writer disk sampler n A key advantage is that all samplers share the same sufficient statistics regardless of the number of cores used. [sent-285, score-0.313]

97 By delegating write permissions to a separate updater thread the code is considerably simplified. [sent-286, score-0.107]

98 Sequential Estimation Our approach is compatible with sequential estimation methods and it is possible to use hash signatures for Sequential Monte Carlo estimation for clustering as in [21, 22]. [sent-290, score-0.449]

99 Again, sampling is the dominant cost for inference and it can be accelerated by binary hashing. [sent-293, score-0.14]

100 Correlation clustering — minimizing disagreements on arbitrary weighted graphs. [sent-318, score-0.126]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hash', 0.323), ('fastex', 0.306), ('clusters', 0.248), ('proposal', 0.195), ('cluster', 0.179), ('collapsed', 0.162), ('sampler', 0.151), ('nx', 0.148), ('bits', 0.141), ('conjugate', 0.133), ('clustering', 0.126), ('families', 0.113), ('articles', 0.113), ('bit', 0.11), ('yi', 0.101), ('hashing', 0.098), ('dirichlet', 0.098), ('sampling', 0.098), ('document', 0.097), ('exponential', 0.095), ('ahmed', 0.091), ('multinomial', 0.087), ('vi', 0.086), ('old', 0.084), ('sl', 0.081), ('gibbs', 0.073), ('baseline', 0.072), ('draw', 0.071), ('narayanamurthy', 0.068), ('uncollapsed', 0.068), ('updater', 0.068), ('ex', 0.067), ('nyi', 0.066), ('mountain', 0.064), ('wikipedia', 0.064), ('smola', 0.063), ('mixtures', 0.062), ('sgn', 0.062), ('documents', 0.062), ('update', 0.061), ('personalization', 0.06), ('instruction', 0.06), ('accelerate', 0.059), ('suf', 0.058), ('quality', 0.057), ('taylor', 0.055), ('disk', 0.055), ('eisenstein', 0.055), ('multinomials', 0.055), ('oating', 0.055), ('signs', 0.055), ('proposals', 0.053), ('inner', 0.053), ('google', 0.052), ('cos', 0.052), ('chernoff', 0.049), ('multicore', 0.049), ('vital', 0.049), ('likelihood', 0.049), ('vocabulary', 0.049), ('nontrivial', 0.049), ('eh', 0.047), ('hastings', 0.047), ('afforded', 0.047), ('latter', 0.047), ('scalable', 0.046), ('parallelize', 0.046), ('nishes', 0.046), ('vldb', 0.044), ('metropolis', 0.044), ('processors', 0.044), ('clock', 0.044), ('processor', 0.043), ('streaming', 0.043), ('inference', 0.042), ('xi', 0.042), ('rejection', 0.041), ('clusterings', 0.041), ('mixture', 0.041), ('ca', 0.04), ('hl', 0.04), ('gold', 0.04), ('ho', 0.04), ('parallel', 0.04), ('normalization', 0.04), ('dominates', 0.039), ('cores', 0.039), ('considerably', 0.039), ('numbers', 0.039), ('expansion', 0.039), ('store', 0.038), ('poisson', 0.038), ('acceptance', 0.038), ('evident', 0.038), ('amounts', 0.037), ('convenience', 0.037), ('web', 0.037), ('categories', 0.037), ('gaussians', 0.037), ('retrieval', 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.000001 126 nips-2012-FastEx: Hash Clustering with Exponential Families

Author: Amr Ahmed, Sujith Ravi, Alex J. Smola, Shravan M. Narayanamurthy

Abstract: Clustering is a key component in any data analysis toolbox. Despite its importance, scalable algorithms often eschew rich statistical models in favor of simpler descriptions such as k-means clustering. In this paper we present a sampler, capable of estimating mixtures of exponential families. At its heart lies a novel proposal distribution using random projections to achieve high throughput in generating proposals, which is crucial for clustering models with large numbers of clusters. 1

2 0.27225909 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

Author: Ke Jiang, Brian Kulis, Michael I. Jordan

Abstract: Sampling and variational inference techniques are two standard methods for inference in probabilistic models, but for many problems, neither approach scales effectively to large-scale data. An alternative is to relax the probabilistic model into a non-probabilistic formulation which has a scalable associated algorithm. This can often be fulfilled by performing small-variance asymptotics, i.e., letting the variance of particular distributions in the model go to zero. For instance, in the context of clustering, such an approach yields connections between the kmeans and EM algorithms. In this paper, we explore small-variance asymptotics for exponential family Dirichlet process (DP) and hierarchical Dirichlet process (HDP) mixture models. Utilizing connections between exponential family distributions and Bregman divergences, we derive novel clustering algorithms from the asymptotic limit of the DP and HDP mixtures that features the scalability of existing hard clustering methods as well as the flexibility of Bayesian nonparametric models. We focus on special cases of our analysis for discrete-data problems, including topic modeling, and we demonstrate the utility of our results by applying variants of our algorithms to problems arising in vision and document analysis. 1

3 0.20394634 313 nips-2012-Sketch-Based Linear Value Function Approximation

Author: Marc Bellemare, Joel Veness, Michael Bowling

Abstract: Hashing is a common method to reduce large, potentially infinite feature vectors to a fixed-size table. In reinforcement learning, hashing is often used in conjunction with tile coding to represent states in continuous spaces. Hashing is also a promising approach to value function approximation in large discrete domains such as Go and Hearts, where feature vectors can be constructed by exhaustively combining a set of atomic features. Unfortunately, the typical use of hashing in value function approximation results in biased value estimates due to the possibility of collisions. Recent work in data stream summaries has led to the development of the tug-of-war sketch, an unbiased estimator for approximating inner products. Our work investigates the application of this new data structure to linear value function approximation. Although in the reinforcement learning setting the use of the tug-of-war sketch leads to biased value estimates, we show that this bias can be orders of magnitude less than that of standard hashing. We provide empirical results on two RL benchmark domains and fifty-five Atari 2600 games to highlight the superior learning performance obtained when using tug-of-war hashing. 1

4 0.18809439 148 nips-2012-Hamming Distance Metric Learning

Author: Mohammad Norouzi, David M. Blei, Ruslan Salakhutdinov

Abstract: Motivated by large-scale multimedia applications we propose to learn mappings from high-dimensional data to binary codes that preserve semantic similarity. Binary codes are well suited to large-scale applications as they are storage efficient and permit exact sub-linear kNN search. The framework is applicable to broad families of mappings, and uses a flexible form of triplet ranking loss. We overcome discontinuous optimization of the discrete mappings by minimizing a piecewise-smooth upper bound on empirical loss, inspired by latent structural SVMs. We develop a new loss-augmented inference algorithm that is quadratic in the code length. We show strong retrieval performance on CIFAR-10 and MNIST, with promising classification results using no more than kNN on the binary codes. 1

5 0.18525936 69 nips-2012-Clustering Sparse Graphs

Author: Yudong Chen, Sujay Sanghavi, Huan Xu

Abstract: We develop a new algorithm to cluster sparse unweighted graphs – i.e. partition the nodes into disjoint clusters so that there is higher density within clusters, and low across clusters. By sparsity we mean the setting where both the in-cluster and across cluster edge densities are very small, possibly vanishing in the size of the graph. Sparsity makes the problem noisier, and hence more difficult to solve. Any clustering involves a tradeoff between minimizing two kinds of errors: missing edges within clusters and present edges across clusters. Our insight is that in the sparse case, these must be penalized differently. We analyze our algorithm’s performance on the natural, classical and widely studied “planted partition” model (also called the stochastic block model); we show that our algorithm can cluster sparser graphs, and with smaller clusters, than all previous methods. This is seen empirically as well. 1

6 0.16055125 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set

7 0.15430732 71 nips-2012-Co-Regularized Hashing for Multimodal Data

8 0.14766328 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

9 0.13284796 99 nips-2012-Dip-means: an incremental clustering method for estimating the number of clusters

10 0.12886012 111 nips-2012-Efficient Sampling for Bipartite Matching Problems

11 0.11912447 129 nips-2012-Fast Variational Inference in the Conjugate Exponential Family

12 0.10773414 315 nips-2012-Slice sampling normalized kernel-weighted completely random measure mixture models

13 0.10561857 26 nips-2012-A nonparametric variable clustering model

14 0.10493563 107 nips-2012-Effective Split-Merge Monte Carlo Methods for Nonparametric Models of Sequential Data

15 0.10404634 349 nips-2012-Training sparse natural image models with a fast Gibbs sampler of an extended state space

16 0.10268874 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

17 0.10265872 251 nips-2012-On Lifting the Gibbs Sampling Algorithm

18 0.10264323 257 nips-2012-One Permutation Hashing

19 0.10241821 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning

20 0.10201128 294 nips-2012-Repulsive Mixtures


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.284), (1, 0.083), (2, -0.016), (3, -0.05), (4, -0.198), (5, -0.123), (6, -0.027), (7, 0.039), (8, 0.186), (9, 0.064), (10, 0.21), (11, -0.192), (12, -0.01), (13, 0.062), (14, -0.054), (15, -0.05), (16, -0.092), (17, -0.076), (18, -0.186), (19, -0.085), (20, -0.043), (21, 0.038), (22, -0.01), (23, 0.009), (24, -0.061), (25, -0.049), (26, -0.011), (27, -0.011), (28, -0.002), (29, -0.018), (30, 0.011), (31, -0.057), (32, -0.0), (33, 0.049), (34, -0.034), (35, 0.022), (36, -0.004), (37, 0.012), (38, -0.006), (39, -0.032), (40, 0.1), (41, -0.037), (42, -0.006), (43, 0.072), (44, -0.04), (45, -0.085), (46, 0.071), (47, 0.038), (48, 0.058), (49, -0.079)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94139355 126 nips-2012-FastEx: Hash Clustering with Exponential Families

Author: Amr Ahmed, Sujith Ravi, Alex J. Smola, Shravan M. Narayanamurthy

Abstract: Clustering is a key component in any data analysis toolbox. Despite its importance, scalable algorithms often eschew rich statistical models in favor of simpler descriptions such as k-means clustering. In this paper we present a sampler, capable of estimating mixtures of exponential families. At its heart lies a novel proposal distribution using random projections to achieve high throughput in generating proposals, which is crucial for clustering models with large numbers of clusters. 1

2 0.7746383 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

Author: Ke Jiang, Brian Kulis, Michael I. Jordan

Abstract: Sampling and variational inference techniques are two standard methods for inference in probabilistic models, but for many problems, neither approach scales effectively to large-scale data. An alternative is to relax the probabilistic model into a non-probabilistic formulation which has a scalable associated algorithm. This can often be fulfilled by performing small-variance asymptotics, i.e., letting the variance of particular distributions in the model go to zero. For instance, in the context of clustering, such an approach yields connections between the kmeans and EM algorithms. In this paper, we explore small-variance asymptotics for exponential family Dirichlet process (DP) and hierarchical Dirichlet process (HDP) mixture models. Utilizing connections between exponential family distributions and Bregman divergences, we derive novel clustering algorithms from the asymptotic limit of the DP and HDP mixtures that features the scalability of existing hard clustering methods as well as the flexibility of Bayesian nonparametric models. We focus on special cases of our analysis for discrete-data problems, including topic modeling, and we demonstrate the utility of our results by applying variants of our algorithms to problems arising in vision and document analysis. 1

3 0.69345421 26 nips-2012-A nonparametric variable clustering model

Author: Konstantina Palla, Zoubin Ghahramani, David A. Knowles

Abstract: Factor analysis models effectively summarise the covariance structure of high dimensional data, but the solutions are typically hard to interpret. This motivates attempting to find a disjoint partition, i.e. a simple clustering, of observed variables into highly correlated subsets. We introduce a Bayesian non-parametric approach to this problem, and demonstrate advantages over heuristic methods proposed to date. Our Dirichlet process variable clustering (DPVC) model can discover blockdiagonal covariance structures in data. We evaluate our method on both synthetic and gene expression analysis problems. 1

4 0.68469512 99 nips-2012-Dip-means: an incremental clustering method for estimating the number of clusters

Author: Argyris Kalogeratos, Aristidis Likas

Abstract: Learning the number of clusters is a key problem in data clustering. We present dip-means, a novel robust incremental method to learn the number of data clusters that can be used as a wrapper around any iterative clustering algorithm of k-means family. In contrast to many popular methods which make assumptions about the underlying cluster distributions, dip-means only assumes a fundamental cluster property: each cluster to admit a unimodal distribution. The proposed algorithm considers each cluster member as an individual ‘viewer’ and applies a univariate statistic hypothesis test for unimodality (dip-test) on the distribution of distances between the viewer and the cluster members. Important advantages are: i) the unimodality test is applied on univariate distance vectors, ii) it can be directly applied with kernel-based methods, since only the pairwise distances are involved in the computations. Experimental results on artificial and real datasets indicate the effectiveness of our method and its superiority over analogous approaches.

5 0.65269679 251 nips-2012-On Lifting the Gibbs Sampling Algorithm

Author: Deepak Venugopal, Vibhav Gogate

Abstract: First-order probabilistic models combine the power of first-order logic, the de facto tool for handling relational structure, with probabilistic graphical models, the de facto tool for handling uncertainty. Lifted probabilistic inference algorithms for them have been the subject of much recent research. The main idea in these algorithms is to improve the accuracy and scalability of existing graphical models’ inference algorithms by exploiting symmetry in the first-order representation. In this paper, we consider blocked Gibbs sampling, an advanced MCMC scheme, and lift it to the first-order level. We propose to achieve this by partitioning the first-order atoms in the model into a set of disjoint clusters such that exact lifted inference is polynomial in each cluster given an assignment to all other atoms not in the cluster. We propose an approach for constructing the clusters and show how it can be used to trade accuracy with computational complexity in a principled manner. Our experimental evaluation shows that lifted Gibbs sampling is superior to the propositional algorithm in terms of accuracy, scalability and convergence.

6 0.61908799 71 nips-2012-Co-Regularized Hashing for Multimodal Data

7 0.59355885 294 nips-2012-Repulsive Mixtures

8 0.5910145 313 nips-2012-Sketch-Based Linear Value Function Approximation

9 0.58846182 257 nips-2012-One Permutation Hashing

10 0.58677155 299 nips-2012-Scalable imputation of genetic data with a discrete fragmentation-coagulation process

11 0.55532688 329 nips-2012-Super-Bit Locality-Sensitive Hashing

12 0.55485636 155 nips-2012-Human memory search as a random walk in a semantic network

13 0.55408108 69 nips-2012-Clustering Sparse Graphs

14 0.53964442 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set

15 0.52774119 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

16 0.52722013 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk

17 0.52274966 107 nips-2012-Effective Split-Merge Monte Carlo Methods for Nonparametric Models of Sequential Data

18 0.52001059 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

19 0.50395769 133 nips-2012-Finding Exemplars from Pairwise Dissimilarities via Simultaneous Sparse Recovery

20 0.50006109 220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.066), (17, 0.022), (21, 0.026), (38, 0.123), (39, 0.015), (42, 0.025), (54, 0.023), (55, 0.026), (68, 0.2), (74, 0.058), (76, 0.186), (80, 0.103), (92, 0.051)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.87268221 126 nips-2012-FastEx: Hash Clustering with Exponential Families

Author: Amr Ahmed, Sujith Ravi, Alex J. Smola, Shravan M. Narayanamurthy

Abstract: Clustering is a key component in any data analysis toolbox. Despite its importance, scalable algorithms often eschew rich statistical models in favor of simpler descriptions such as k-means clustering. In this paper we present a sampler, capable of estimating mixtures of exponential families. At its heart lies a novel proposal distribution using random projections to achieve high throughput in generating proposals, which is crucial for clustering models with large numbers of clusters. 1

2 0.83841848 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming

Author: Mert Pilanci, Laurent E. Ghaoui, Venkat Chandrasekaran

Abstract: We consider the problem of cardinality penalized optimization of a convex function over the probability simplex with additional convex constraints. The classical 1 regularizer fails to promote sparsity on the probability simplex since 1 norm on the probability simplex is trivially constant. We propose a direct relaxation of the minimum cardinality problem and show that it can be efficiently solved using convex programming. As a first application we consider recovering a sparse probability measure given moment constraints, in which our formulation becomes linear programming, hence can be solved very efficiently. A sufficient condition for exact recovery of the minimum cardinality solution is derived for arbitrary affine constraints. We then develop a penalized version for the noisy setting which can be solved using second order cone programs. The proposed method outperforms known rescaling heuristics based on 1 norm. As a second application we consider convex clustering using a sparse Gaussian mixture and compare our results with the well known soft k-means algorithm. 1

3 0.79286134 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

Author: Michael Bryant, Erik B. Sudderth

Abstract: Variational methods provide a computationally scalable alternative to Monte Carlo methods for large-scale, Bayesian nonparametric learning. In practice, however, conventional batch and online variational methods quickly become trapped in local optima. In this paper, we consider a nonparametric topic model based on the hierarchical Dirichlet process (HDP), and develop a novel online variational inference algorithm based on split-merge topic updates. We derive a simpler and faster variational approximation of the HDP, and show that by intelligently splitting and merging components of the variational posterior, we can achieve substantially better predictions of test data than conventional online and batch variational algorithms. For streaming analysis of large datasets where batch analysis is infeasible, we show that our split-merge updates better capture the nonparametric properties of the underlying model, allowing continual learning of new topics.

4 0.78706288 188 nips-2012-Learning from Distributions via Support Measure Machines

Author: Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, Bernhard Schölkopf

Abstract: This paper presents a kernel-based discriminative learning framework on probability measures. Rather than relying on large collections of vectorial training examples, our framework learns using a collection of probability distributions that have been constructed to meaningfully represent training data. By representing these probability distributions as mean embeddings in the reproducing kernel Hilbert space (RKHS), we are able to apply many standard kernel-based learning techniques in straightforward fashion. To accomplish this, we construct a generalization of the support vector machine (SVM) called a support measure machine (SMM). Our analyses of SMMs provides several insights into their relationship to traditional SVMs. Based on such insights, we propose a flexible SVM (FlexSVM) that places different kernel functions on each training example. Experimental results on both synthetic and real-world data demonstrate the effectiveness of our proposed framework. 1

5 0.78666586 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

Author: Ke Jiang, Brian Kulis, Michael I. Jordan

Abstract: Sampling and variational inference techniques are two standard methods for inference in probabilistic models, but for many problems, neither approach scales effectively to large-scale data. An alternative is to relax the probabilistic model into a non-probabilistic formulation which has a scalable associated algorithm. This can often be fulfilled by performing small-variance asymptotics, i.e., letting the variance of particular distributions in the model go to zero. For instance, in the context of clustering, such an approach yields connections between the kmeans and EM algorithms. In this paper, we explore small-variance asymptotics for exponential family Dirichlet process (DP) and hierarchical Dirichlet process (HDP) mixture models. Utilizing connections between exponential family distributions and Bregman divergences, we derive novel clustering algorithms from the asymptotic limit of the DP and HDP mixtures that features the scalability of existing hard clustering methods as well as the flexibility of Bayesian nonparametric models. We focus on special cases of our analysis for discrete-data problems, including topic modeling, and we demonstrate the utility of our results by applying variants of our algorithms to problems arising in vision and document analysis. 1

6 0.78648627 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

7 0.78595519 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

8 0.78528386 294 nips-2012-Repulsive Mixtures

9 0.78500217 197 nips-2012-Learning with Recursive Perceptual Representations

10 0.78458357 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

11 0.78450394 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

12 0.78404337 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

13 0.78358382 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization

14 0.78321517 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification

15 0.78308994 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points

16 0.78189176 232 nips-2012-Multiplicative Forests for Continuous-Time Processes

17 0.78127325 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC

18 0.78126073 272 nips-2012-Practical Bayesian Optimization of Machine Learning Algorithms

19 0.78064567 54 nips-2012-Bayesian Probabilistic Co-Subspace Addition

20 0.78059518 220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models