nips nips2009 nips2009-186 knowledge-graph by maker-knowledge-mining

186 nips-2009-Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units


Source: pdf

Author: Feng Yan, Ningyi Xu, Yuan Qi

Abstract: The recent emergence of Graphics Processing Units (GPUs) as general-purpose parallel computing devices provides us with new opportunities to develop scalable learning methods for massive data. In this work, we consider the problem of parallelizing two inference methods on GPUs for latent Dirichlet Allocation (LDA) models, collapsed Gibbs sampling (CGS) and collapsed variational Bayesian (CVB). To address limited memory constraints on GPUs, we propose a novel data partitioning scheme that effectively reduces the memory cost. This partitioning scheme also balances the computational cost on each multiprocessor and enables us to easily avoid memory access conflicts. We use data streaming to handle extremely large datasets. Extensive experiments showed that our parallel inference methods consistently produced LDA models with the same predictive power as sequential training methods did but with 26x speedup for CGS and 196x speedup for CVB on a GPU with 30 multiprocessors. The proposed partitioning scheme and data streaming make our approach scalable with more multiprocessors. Furthermore, they can be used as general techniques to parallelize other machine learning models. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this work, we consider the problem of parallelizing two inference methods on GPUs for latent Dirichlet Allocation (LDA) models, collapsed Gibbs sampling (CGS) and collapsed variational Bayesian (CVB). [sent-5, score-0.487]

2 To address limited memory constraints on GPUs, we propose a novel data partitioning scheme that effectively reduces the memory cost. [sent-6, score-0.303]

3 This partitioning scheme also balances the computational cost on each multiprocessor and enables us to easily avoid memory access conflicts. [sent-7, score-0.263]

4 Extensive experiments showed that our parallel inference methods consistently produced LDA models with the same predictive power as sequential training methods did but with 26x speedup for CGS and 196x speedup for CVB on a GPU with 30 multiprocessors. [sent-9, score-0.774]

5 The proposed partitioning scheme and data streaming make our approach scalable with more multiprocessors. [sent-10, score-0.21]

6 While large CPU clusters are commonly used for parallel computing, Graphics Processing Units (GPUs) provide us with a powerful alternative platform for developing parallel machine learning methods. [sent-15, score-0.66]

7 A GPU has massively built-in parallel thread processors and high-speed memory, therefore providing potentially one or two magnitudes of peak flops and memory throughput greater than its CPU counterpart. [sent-16, score-0.67]

8 However, parallel computing 1 on GPUs can be a challenging task because of several limitations, such as relatively small memory size. [sent-22, score-0.445]

9 In this paper, we demonstrate how to overcome these limitations to parallel computing on GPUs with an exemplary data-intensive application, training Latent Dirichlet Allocation (LDA) models. [sent-23, score-0.33]

10 Our parallel approaches take the advantage of parallel computing power of GPUs and explore the algorithmic structures of LDA learning methods, therefore significantly reducing the computational cost. [sent-26, score-0.66]

11 Furthermore, our parallel inference approaches, based on a new data partition scheme and data streaming, can be applied to not only GPUs but also any shared memory machine. [sent-27, score-0.587]

12 Specifically, the main contributions of this paper include: • We introduce parallel collapsed Gibbs sampling (CGS) and parallel collapsed variational Bayesian (CVB) for LDA models on GPUs. [sent-28, score-1.077]

13 We also analyze the convergence property of the parallel variational inference and show that, with mild convexity assumptions, the parallel inference monotonically increases the variational lower bound until convergence. [sent-29, score-1.01]

14 • We propose a fast data partition scheme that efficiently balances the workloads across processors, fully utilizing the massive parallel mechanisms of GPUs. [sent-30, score-0.466]

15 • Extensive experiments show both parallel inference algorithms on GPUs achieve the same predictive power as their sequential inference counterparts on CPUs, but significantly faster. [sent-32, score-0.398]

16 The speedup is near linear in terms of the number of multiprocessors in the GPU card. [sent-33, score-0.262]

17 1 LDA models each of D documents as a mixture over K latent topics, and each topic k is a multinomial distribution over a word vocabulary having W distinct words denoted by φk = {φkw }, where φk is drawn from a symmetric Dirichlet prior with parameter β. [sent-35, score-0.287]

18 For the ith token in the document, a topic assignment zij is drawn with topic k chosen with probability θjk . [sent-37, score-0.347]

19 Then word xij is drawn from the zij th topic, with xij taking on value w with probability φzij w . [sent-38, score-0.377]

20 [9] applied the same state space to variational Bayesian and proposed the collapsed variational Bayesian inference algorithm. [sent-45, score-0.433]

21 In CVB, the posterior of z is approximated by a factorized posterior q(z) = ij q(zij |γij ) where q(zij |γij ) is multinomial with variational parameter 1 We use indices to represent topics, documents and vocabulary words. [sent-47, score-0.348]

22 The inference task is to find variational parameters maximizing the variational lower p(z, x|α, β) bound L(q) = q(z) log . [sent-49, score-0.316]

23 The updating formula for γij is similar to the CGS updates γijk ∝ (Eq [n¬ijk ] + β)(Eq [n¬ij ] + α)(Eq [n¬ij ] + W β)−1 xij jk k Varq [n¬ijk ] x ij ¬ij 2 q [nx k ]+β) ij exp(− 2(E 3 3. [sent-51, score-0.429]

24 1 − Varq [n¬ij ] jk 2(Eq [n¬ij ]+α)2 ) jk + Varq [n¬ij ] k ) 2(Eq [n¬ij ]+W β)2 k (2) Parallel Algorithms for LDA Training Parallel Collapsed Gibbs Sampling A natural way to parallelize LDA training is to distribute documents across P processors. [sent-52, score-0.342]

25 [8] introduced a parallel implementation of CGS on distributed machines, called AD-LDA. [sent-54, score-0.33]

26 In AD-LDA, D documents and document-specific counts njk are distributed over P processors, with D documents on each processor. [sent-55, score-0.272]

27 In each iteration, every processor p indeP pendently runs local Gibbs sampling with its own copy of topic-word count np and topic counts kw np = w np in parallel. [sent-56, score-0.598]

28 Then a global synchronization aggregates local counts np to produce k kw kw global counts nkw and nk . [sent-57, score-0.628]

29 However, it needs to store P copies of topicword counts nkw for all processors, which is unrealistic for GPUs with large P and large datasets due to device memory space limitation. [sent-59, score-0.356]

30 4 GBytes to store 256-topic nwk for 60 processors, exceeding the device memory capacity of current high-end GPUs. [sent-61, score-0.359]

31 In order to address this issue, we develop parallel CGS algorithm that only requires one copy of nkw . [sent-62, score-0.413]

32 Our parallel CGS algorithm is motivated by the following observation: for word Algorithm 1: Parallel Collapsed Gibbs Sampling token w1 in document j1 and word toInput: Word tokens x, document partition ken w2 in document j2 , if w1 = w2 and J1 , . [sent-63, score-0.85]

33 , V P assignment by (1) have no memory Output: njk , nwk , zij read/write conflicts on document-topic 1 Initialize topic assignment to each word token, set counts njk and topic-word counts nwk . [sent-69, score-1.033]

34 The algorithmic flow is summarized in np ← nk k Algorithm 1. [sent-70, score-0.219]

35 , JP /* Sampling step */ and distribute them to P processors, 4 for each processor p in parallel do we further divide the vocabulary words 5 Sample zij for j ∈ Jp and xij ∈ Vp⊕l V = {1, . [sent-77, score-0.634]

36 , W } into P disjoint subsets (Equation (1)) with global counts nwk , V1 , . [sent-80, score-0.217]

37 , VP , and each processor p (p = global counts njk and local counts np k 0, . [sent-83, score-0.38]

38 , P −1) stores a local copy of topic 6 end counts np . [sent-86, score-0.28]

39 Every parallel CGS training /* Synchronization step */ k p iteration consists of P epochs, and each 7 Update nk according to Equation (3) epoch consists of a sampling step and 8 end a synchronization step. [sent-87, score-0.674]

40 , P − 1), processor p samples topic assignments zij whose document index is j ∈ Jp and word index is xij ∈ Vp⊕l . [sent-91, score-0.483]

41 The ⊕ is the modulus P addition operation defined by a ⊕ b = (a + b) mod P, and all processors run the sampling simultaneously without memory read/write conflicts on the global counts njk and nwk . [sent-92, score-0.599]

42 Then the synchronization step uses (3) to aggregate np to global counts k nk , which are used as local counts in the next epoch. [sent-93, score-0.485]

43 (np − nk ), k nk ← nk + p 3 n p ← nk k (3) Our parallel CGS can be regarded as an extension to AD-LDA by using the data partition in local sampling and inserting P −1 more synchronization steps within an iteration. [sent-94, score-1.104]

44 Since our data partition guarantees that any two processors access neither the same document nor the same word in an epoch, the synchronization of nwk in AD-LDA is equivalent to keeping nwk unchanged after the sampling step of the epoch. [sent-95, score-0.791]

45 Becasue P processors concurrently sample new topic assignments in parallel CGS, we don’t necessarily sample from the correct posterior distribution. [sent-96, score-0.539]

46 2 Parallel Collapsed Variational Bayesian The collapsed Gibbs sampling and the collapsed variational Bayesian inference [9] are similar in their algorithmic structures. [sent-100, score-0.451]

47 A single iteration of our parallel CVB also consists of P epochs, and each epoch has an updating step and a synchronization step. [sent-103, score-0.497]

48 The updating step updates variational parameters in a similar manner as the sampling step of parallel CGS. [sent-104, score-0.513]

49 The synchronization step involves an affine combination of the variational parameters in the natural parameter space. [sent-106, score-0.257]

50 We pick a λsync as the P −1 p 0 updated λ from a one-parameter class of variational parameters λ(µ) that combines the contribution from all processors P −1 old λ(µ) = λ (λ(i) − λold ), µ ≥ 0. [sent-123, score-0.323]

51 In order to present in a unified way, we define the co-occurrence matrix R = (rjw ) as: For parallel CGS, rjw is the number of occurrences of word w in document j; for parallel CVB, rjw = 1 if w occurs at least once in j, otherwise rjw = 0. [sent-132, score-1.267]

52 The optimal data partition is equivalent to minimizing the following cost function P −1 C= max {Cmn }, rjw Cmn = (m,n): l=0 m⊕l=n rjw ∈Rmn 4 (5) The basic operation in the proposed algorithms is either sampling topic assignments (in CGS) or updating variational parameters (in CVB). [sent-134, score-0.646]

53 Since all processors have to wait for the slowest processor to complete its job before a synchronization step, the maximal Cmn is the number of basic operations for the slowest processor. [sent-139, score-0.331]

54 We define data partition efficiency, η, for a given row and column partitions by η= Copt , Copt = C rjw /P (6) j∈J,w∈V where Copt is the theoretically minimal number of basic operations. [sent-141, score-0.232]

55 This algorithm is fast, since it needs only one full sweep over all word w≤w tokens or unique document/word pairs to calculate jm and wn . [sent-156, score-0.275]

56 For a word token x in the corpus, the probability that x is the word w is P (x = w), the probability that x is in document j is P (x in j). [sent-159, score-0.285]

57 The G280 has 30 on-chip multiprocessors running at 1296 MHz, and each multiprocessor has 8 thread processors that are responsible for executing all threads deployed on the multiprocessor in parallel. [sent-174, score-0.496]

58 The G280 has 1 GBytes on-board device memory, the memory bandwidth is 141. [sent-175, score-0.256]

59 Threads in the same thread block are executed on a multiprocessor, and a multiprocessor can execute a number of thread blocks. [sent-180, score-0.257]

60 For a word token, fine parallel calculations, such as (1) and (2), are realized by parallel threads inside a thread block. [sent-182, score-0.903]

61 Given the limited amount of device memory on GPUs, we cannot load all training data and model parameters into the device memory for large-scale datasets. [sent-183, score-0.434]

62 However, the sequential nature of Gibbs sampling and variational Bayesian inferences allow us to implement a data streaming [5] scheme which effectively reduces GPU device memory space requirements. [sent-184, score-0.543]

63 Temporal data and variables, xij , zij and γij , are sent to a working space on GPU device memory on-the-fly. [sent-185, score-0.431]

64 For each dataset, we randomly extracted 90% of all word tokens as the training set, and the remaining 10% of word tokens are the test set. [sent-193, score-0.29]

65 We use λsync = λ(1) in the parallel CVB, and this setting works well in all of our experiments. [sent-196, score-0.33]

66 1 Perplexity We measure the performance of the parallel algorithms using test set perplexity. [sent-198, score-0.33]

67 For CVB, test set likelihood p(xtest ) is computed as test p(xtest ) = ¯ ¯ θjk φxij k log ij ¯ θjk = k α + E[njk ] Kα + k E[njk ] β + E[nwk ] ¯ φwk = W β + E[nk ] (7) We report the average perplexity and the standard deviation of 10 randomly initialized runs for the parallel CVB. [sent-200, score-0.574]

68 p(xtest ) = log ij 1 S ˆs ˆ θjk φs ij k x s ˆs θjk = k α + ns jk Kα + k ns jk β + ns wk ˆ φs = wk W β + ns k (8) Two small datasets, KOS and NIPS, are used in the perplexity experiment. [sent-203, score-0.803]

69 For a fixed number of K, there is no significant difference between the parallel and the single-processor algorithms. [sent-207, score-0.33]

70 It suggests our parallel algorithms converge to models having the same predictive power in terms of perplexity as single-processor LDA algorithms. [sent-208, score-0.455]

71 Perplexity as a function of iteration number for parallel CGS and parallel CVB on NIPS are shown in Figure 2 (a) and (b) respectively. [sent-209, score-0.66]

72 Since CVB actually maxmizes the variational lower bound L(q) on the training set, so we also investigated the convergence rate of the variational lower bound. [sent-210, score-0.282]

73 Figure 2 (c) shows the per word token variational lower bound as a function of iteration for P = 1, 10, 30 on a sampled 2 http://archive. [sent-212, score-0.287]

74 Both parallel algorithms converge as rapidly as the single-processor LDA algorithms. [sent-216, score-0.33]

75 In fact, as we decreased the number of synchronizations in the parallel CVB, the result became significantly worse. [sent-219, score-0.33]

76 6 8 C V B P a r a lle l C V B , P = 1 0 P a r a lle l C V B , P = 3 0 V a r ia tio n a l lo w e r b o u n d 2 2 0 0 P e r p le x ity -6 . [sent-223, score-0.228]

77 8 2 1 0 0 0 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 0 5 0 1 0 0 Ite r a tio n 1 5 0 2 0 0 2 5 0 0 3 0 0 5 0 (a) 1 0 0 1 5 0 Ite r a tio n Ite r a tio n (b) (c) Figure 2: (a) Test set perplexity as a function of iteration number for the parallel CGS on NIPS, K = 256. [sent-231, score-0.605]

78 (b) Test set perplexity as a function of iteration number for the parallel CVB on NIPS, K = 128. [sent-232, score-0.455]

79 Our speedup experiments are conducted on the NIPS dataset for both parallel algorithms and the large NYT dataset for only the parallel CGS, because γij of the NYT dataset requires too much memory space to fit into our PC’s host memory. [sent-242, score-1.014]

80 1 seconds on NYT (K = 128) for the parallel CGS, and 11. [sent-246, score-0.359]

81 Figure 3 shows the speedup of the parallel CGS (left) and the speedup of the parallel CVB (right) with the data partition efficiency η under the speedup. [sent-248, score-1.146]

82 Therefore data transfer between the device memory and the multiprocessor is better hidden by computation on the threads. [sent-250, score-0.292]

83 As a result, we have extra speedup when the number of “processors” (thread blocks) is larger than the number of multiprocessors on the GPU. [sent-251, score-0.262]

84 5 S tr e a m in g N o S tr e a m in g P = 1 0 P = 3 0 P = 6 0 P = 1 0 P = 3 0 P = 6 0 Figure 3: Speedup of parallel CGS (left) on NIPS and NYT, and speedup of parallel CVB (right) on NIPS. [sent-264, score-0.865]

85 Although using data streaming reduces the speedup of parallel CVB due to the low bandwidth between the PC host memory and the GPU device memory, it enables us to use a GPU card to process large-volume data. [sent-269, score-0.936]

86 7 The synchronization overhead is very small since P c u rre n t e v e n ra n d o m N and the speedup is largely determined by the maxi1 . [sent-270, score-0.321]

87 6 tween the PC host memory and the GPU device mem0 . [sent-277, score-0.251]

88 But the speedup with or without data streaming differs dramatically for the parallel CVB, Figure 4: data partition efficiency η of because its computation bandwidth is roughly ∼ 7. [sent-286, score-0.761]

89 Due to the negligible overheads higher than the maximal bandwidth that data stream- for the synchronization steps, the speedup ing can provide. [sent-288, score-0.36]

90 The high speedup of the parallel CVB is proportional to η in practice. [sent-289, score-0.535]

91 without data streaming is due to a hardware supported exponential function and a high performance implementation of parallel reduction that is used to normalize γij calculated from (2). [sent-290, score-0.441]

92 Figure 3 (right) shows that the larger the P , the smaller the speedup for the parallel CVB with data streaming. [sent-291, score-0.535]

93 3, “even” partitions documents and word vocabulary into roughly equal-sized subsets by setting jm = mD and wn = nW . [sent-295, score-0.306]

94 More than 20x speedup is achieved for both parallel algorithms with data streaming. [sent-298, score-0.535]

95 The speedup of the parallel CGS enables us to run 1000 iterations (K=128) Gibbs sampling on the large NYT dataset within 1. [sent-299, score-0.577]

96 Unlike their works, ours shows how to use GPUs to achieve significant, scalable speedup for LDA training while maintaining correct, accurate predictions. [sent-306, score-0.231]

97 However, with the limited memory size of a GPU, compared to that of a CPU cluster, this can lead to memory access conflicts. [sent-312, score-0.23]

98 This issue becomes severe when one performs many parallel jobs (threadblocks) and leads to wrong inference results and operation failure, as reported by Masada et al. [sent-313, score-0.364]

99 Accelerating collapsed variational bayesian inference for latent Dirichlet allocation with Nvidia CUDA compatible devices. [sent-370, score-0.365]

100 A collapsed variational Bayesian inference algorithm for Latent Dirichlet allocation. [sent-390, score-0.292]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cgs', 0.37), ('parallel', 0.33), ('cvb', 0.313), ('gpu', 0.256), ('speedup', 0.205), ('gpus', 0.187), ('rjw', 0.156), ('nwk', 0.142), ('variational', 0.141), ('zij', 0.139), ('nk', 0.135), ('processors', 0.134), ('perplexity', 0.125), ('ij', 0.119), ('collapsed', 0.117), ('synchronization', 0.116), ('jk', 0.116), ('memory', 0.115), ('cmn', 0.112), ('streaming', 0.111), ('ijk', 0.103), ('lda', 0.102), ('device', 0.102), ('njk', 0.091), ('thread', 0.091), ('jm', 0.091), ('word', 0.088), ('nyt', 0.087), ('kos', 0.085), ('np', 0.084), ('cpu', 0.077), ('partition', 0.076), ('counts', 0.075), ('xij', 0.075), ('topic', 0.075), ('multiprocessor', 0.075), ('copt', 0.071), ('xtest', 0.071), ('sync', 0.068), ('threads', 0.064), ('lle', 0.064), ('token', 0.058), ('masada', 0.057), ('multiprocessors', 0.057), ('parallelize', 0.057), ('tokens', 0.057), ('processor', 0.055), ('kw', 0.053), ('documents', 0.053), ('document', 0.051), ('epoch', 0.051), ('newman', 0.05), ('eq', 0.05), ('ity', 0.05), ('jp', 0.05), ('tio', 0.05), ('varq', 0.05), ('old', 0.048), ('copy', 0.046), ('asuncion', 0.045), ('dirichlet', 0.043), ('cuda', 0.043), ('gbytes', 0.043), ('ite', 0.043), ('rmn', 0.043), ('sampling', 0.042), ('var', 0.041), ('partitioning', 0.041), ('nips', 0.041), ('gibbs', 0.04), ('bandwidth', 0.039), ('wn', 0.039), ('ops', 0.037), ('nkw', 0.037), ('allocation', 0.037), ('ns', 0.036), ('latent', 0.036), ('vocabulary', 0.035), ('inference', 0.034), ('host', 0.034), ('wk', 0.032), ('scheme', 0.032), ('vn', 0.031), ('pc', 0.03), ('vp', 0.029), ('seconds', 0.029), ('lafayette', 0.028), ('mcopt', 0.028), ('nvidia', 0.028), ('purdue', 0.028), ('workloads', 0.028), ('datasets', 0.027), ('yan', 0.026), ('smyth', 0.026), ('parallelized', 0.026), ('scalable', 0.026), ('operations', 0.026), ('consumption', 0.025), ('cpus', 0.025), ('epochs', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 186 nips-2009-Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units

Author: Feng Yan, Ningyi Xu, Yuan Qi

Abstract: The recent emergence of Graphics Processing Units (GPUs) as general-purpose parallel computing devices provides us with new opportunities to develop scalable learning methods for massive data. In this work, we consider the problem of parallelizing two inference methods on GPUs for latent Dirichlet Allocation (LDA) models, collapsed Gibbs sampling (CGS) and collapsed variational Bayesian (CVB). To address limited memory constraints on GPUs, we propose a novel data partitioning scheme that effectively reduces the memory cost. This partitioning scheme also balances the computational cost on each multiprocessor and enables us to easily avoid memory access conflicts. We use data streaming to handle extremely large datasets. Extensive experiments showed that our parallel inference methods consistently produced LDA models with the same predictive power as sequential training methods did but with 26x speedup for CGS and 196x speedup for CVB on a GPU with 30 multiprocessors. The proposed partitioning scheme and data streaming make our approach scalable with more multiprocessors. Furthermore, they can be used as general techniques to parallelize other machine learning models. 1

2 0.15477562 123 nips-2009-Large Scale Nonparametric Bayesian Inference: Data Parallelisation in the Indian Buffet Process

Author: Finale Doshi-velez, Shakir Mohamed, Zoubin Ghahramani, David A. Knowles

Abstract: Nonparametric Bayesian models provide a framework for flexible probabilistic modelling of complex datasets. Unfortunately, the high-dimensional averages required for Bayesian methods can be slow, especially with the unbounded representations used by nonparametric models. We address the challenge of scaling Bayesian inference to the increasingly large datasets found in real-world applications. We focus on parallelisation of inference in the Indian Buffet Process (IBP), which allows data points to have an unbounded number of sparse latent features. Our novel MCMC sampler divides a large data set between multiple processors and uses message passing to compute the global likelihoods and posteriors. This algorithm, the first parallel inference scheme for IBP-based models, scales to datasets orders of magnitude larger than have previously been possible. 1

3 0.12392703 205 nips-2009-Rethinking LDA: Why Priors Matter

Author: Andrew McCallum, David M. Mimno, Hanna M. Wallach

Abstract: Implementations of topic models typically use symmetric Dirichlet priors with fixed concentration parameters, with the implicit assumption that such “smoothing parameters” have little practical effect. In this paper, we explore several classes of structured priors for topic models. We find that an asymmetric Dirichlet prior over the document–topic distributions has substantial advantages over a symmetric prior, while an asymmetric prior over the topic–word distributions provides no real benefit. Approximation of this prior structure through simple, efficient hyperparameter optimization steps is sufficient to achieve these performance gains. The prior structure we advocate substantially increases the robustness of topic models to variations in the number of topics and to the highly skewed word frequency distributions common in natural language. Since this prior structure can be implemented using efficient algorithms that add negligible cost beyond standard inference techniques, we recommend it as a new standard for topic modeling. 1

4 0.10638063 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process

Author: Chong Wang, David M. Blei

Abstract: The nested Chinese restaurant process (nCRP) is a powerful nonparametric Bayesian model for learning tree-based hierarchies from data. Since its posterior distribution is intractable, current inference methods have all relied on MCMC sampling. In this paper, we develop an alternative inference technique based on variational methods. To employ variational methods, we derive a tree-based stick-breaking construction of the nCRP mixture model, and a novel variational algorithm that efficiently explores a posterior over a large set of combinatorial structures. We demonstrate the use of this approach for text and hand written digits modeling, where we show we can adapt the nCRP to continuous data as well. 1

5 0.10535379 65 nips-2009-Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

Author: Chong Wang, David M. Blei

Abstract: We present a nonparametric hierarchical Bayesian model of document collections that decouples sparsity and smoothness in the component distributions (i.e., the “topics”). In the sparse topic model (sparseTM), each topic is represented by a bank of selector variables that determine which terms appear in the topic. Thus each topic is associated with a subset of the vocabulary, and topic smoothness is modeled on this subset. We develop an efficient Gibbs sampler for the sparseTM that includes a general-purpose method for sampling from a Dirichlet mixture with a combinatorial number of components. We demonstrate the sparseTM on four real-world datasets. Compared to traditional approaches, the empirical results will show that sparseTMs give better predictive performance with simpler inferred models. 1

6 0.10315154 204 nips-2009-Replicated Softmax: an Undirected Topic Model

7 0.09678185 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

8 0.087547891 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall

9 0.079941921 220 nips-2009-Slow Learners are Fast

10 0.079410143 234 nips-2009-Streaming k-means approximation

11 0.073556766 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

12 0.066391669 153 nips-2009-Modeling Social Annotation Data with Content Relevance using a Topic Model

13 0.06469658 228 nips-2009-Speeding up Magnetic Resonance Image Acquisition by Bayesian Multi-Slice Adaptive Compressed Sensing

14 0.062260535 96 nips-2009-Filtering Abstract Senses From Image Search Results

15 0.058082961 190 nips-2009-Polynomial Semantic Indexing

16 0.053253312 68 nips-2009-Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora

17 0.051022187 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models

18 0.048225023 226 nips-2009-Spatial Normalized Gamma Processes

19 0.04806342 179 nips-2009-On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation

20 0.046252333 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.143), (1, -0.039), (2, -0.053), (3, -0.127), (4, 0.096), (5, -0.164), (6, -0.016), (7, 0.021), (8, -0.059), (9, 0.079), (10, -0.071), (11, -0.011), (12, 0.021), (13, 0.095), (14, -0.022), (15, -0.025), (16, -0.076), (17, -0.069), (18, -0.071), (19, -0.012), (20, 0.018), (21, -0.068), (22, 0.074), (23, -0.047), (24, 0.077), (25, -0.083), (26, -0.019), (27, -0.014), (28, 0.054), (29, 0.065), (30, 0.003), (31, -0.007), (32, 0.062), (33, -0.053), (34, 0.005), (35, 0.137), (36, 0.007), (37, -0.009), (38, -0.026), (39, 0.127), (40, 0.012), (41, -0.041), (42, 0.065), (43, -0.015), (44, -0.163), (45, -0.028), (46, -0.03), (47, -0.077), (48, -0.075), (49, -0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95764643 186 nips-2009-Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units

Author: Feng Yan, Ningyi Xu, Yuan Qi

Abstract: The recent emergence of Graphics Processing Units (GPUs) as general-purpose parallel computing devices provides us with new opportunities to develop scalable learning methods for massive data. In this work, we consider the problem of parallelizing two inference methods on GPUs for latent Dirichlet Allocation (LDA) models, collapsed Gibbs sampling (CGS) and collapsed variational Bayesian (CVB). To address limited memory constraints on GPUs, we propose a novel data partitioning scheme that effectively reduces the memory cost. This partitioning scheme also balances the computational cost on each multiprocessor and enables us to easily avoid memory access conflicts. We use data streaming to handle extremely large datasets. Extensive experiments showed that our parallel inference methods consistently produced LDA models with the same predictive power as sequential training methods did but with 26x speedup for CGS and 196x speedup for CVB on a GPU with 30 multiprocessors. The proposed partitioning scheme and data streaming make our approach scalable with more multiprocessors. Furthermore, they can be used as general techniques to parallelize other machine learning models. 1

2 0.69004077 65 nips-2009-Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

Author: Chong Wang, David M. Blei

Abstract: We present a nonparametric hierarchical Bayesian model of document collections that decouples sparsity and smoothness in the component distributions (i.e., the “topics”). In the sparse topic model (sparseTM), each topic is represented by a bank of selector variables that determine which terms appear in the topic. Thus each topic is associated with a subset of the vocabulary, and topic smoothness is modeled on this subset. We develop an efficient Gibbs sampler for the sparseTM that includes a general-purpose method for sampling from a Dirichlet mixture with a combinatorial number of components. We demonstrate the sparseTM on four real-world datasets. Compared to traditional approaches, the empirical results will show that sparseTMs give better predictive performance with simpler inferred models. 1

3 0.63963681 205 nips-2009-Rethinking LDA: Why Priors Matter

Author: Andrew McCallum, David M. Mimno, Hanna M. Wallach

Abstract: Implementations of topic models typically use symmetric Dirichlet priors with fixed concentration parameters, with the implicit assumption that such “smoothing parameters” have little practical effect. In this paper, we explore several classes of structured priors for topic models. We find that an asymmetric Dirichlet prior over the document–topic distributions has substantial advantages over a symmetric prior, while an asymmetric prior over the topic–word distributions provides no real benefit. Approximation of this prior structure through simple, efficient hyperparameter optimization steps is sufficient to achieve these performance gains. The prior structure we advocate substantially increases the robustness of topic models to variations in the number of topics and to the highly skewed word frequency distributions common in natural language. Since this prior structure can be implemented using efficient algorithms that add negligible cost beyond standard inference techniques, we recommend it as a new standard for topic modeling. 1

4 0.6180619 153 nips-2009-Modeling Social Annotation Data with Content Relevance using a Topic Model

Author: Tomoharu Iwata, Takeshi Yamada, Naonori Ueda

Abstract: We propose a probabilistic topic model for analyzing and extracting contentrelated annotations from noisy annotated discrete data such as web pages stored in social bookmarking services. In these services, since users can attach annotations freely, some annotations do not describe the semantics of the content, thus they are noisy, i.e. not content-related. The extraction of content-related annotations can be used as a preprocessing step in machine learning tasks such as text classification and image recognition, or can improve information retrieval performance. The proposed model is a generative model for content and annotations, in which the annotations are assumed to originate either from topics that generated the content or from a general distribution unrelated to the content. We demonstrate the effectiveness of the proposed method by using synthetic data and real social annotation data for text and images.

5 0.60236317 204 nips-2009-Replicated Softmax: an Undirected Topic Model

Author: Geoffrey E. Hinton, Ruslan Salakhutdinov

Abstract: We introduce a two-layer undirected graphical model, called a “Replicated Softmax”, that can be used to model and automatically extract low-dimensional latent semantic representations from a large unstructured collection of documents. We present efficient learning and inference algorithms for this model, and show how a Monte-Carlo based method, Annealed Importance Sampling, can be used to produce an accurate estimate of the log-probability the model assigns to test data. This allows us to demonstrate that the proposed model is able to generalize much better compared to Latent Dirichlet Allocation in terms of both the log-probability of held-out documents and the retrieval accuracy.

6 0.56268066 123 nips-2009-Large Scale Nonparametric Bayesian Inference: Data Parallelisation in the Indian Buffet Process

7 0.55327106 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

8 0.55139244 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process

9 0.49417192 68 nips-2009-Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora

10 0.41769046 143 nips-2009-Localizing Bugs in Program Executions with Graphical Models

11 0.4044638 228 nips-2009-Speeding up Magnetic Resonance Image Acquisition by Bayesian Multi-Slice Adaptive Compressed Sensing

12 0.391507 226 nips-2009-Spatial Normalized Gamma Processes

13 0.34498188 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

14 0.33952194 29 nips-2009-An Infinite Factor Model Hierarchy Via a Noisy-Or Mechanism

15 0.32265851 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference

16 0.32121241 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models

17 0.31490624 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models

18 0.30834129 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

19 0.30364534 51 nips-2009-Clustering sequence sets for motif discovery

20 0.30167219 90 nips-2009-Factor Modeling for Advertisement Targeting


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.013), (10, 0.398), (24, 0.037), (25, 0.061), (35, 0.05), (36, 0.089), (39, 0.028), (55, 0.01), (58, 0.06), (61, 0.013), (71, 0.076), (81, 0.015), (86, 0.047)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81415182 186 nips-2009-Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units

Author: Feng Yan, Ningyi Xu, Yuan Qi

Abstract: The recent emergence of Graphics Processing Units (GPUs) as general-purpose parallel computing devices provides us with new opportunities to develop scalable learning methods for massive data. In this work, we consider the problem of parallelizing two inference methods on GPUs for latent Dirichlet Allocation (LDA) models, collapsed Gibbs sampling (CGS) and collapsed variational Bayesian (CVB). To address limited memory constraints on GPUs, we propose a novel data partitioning scheme that effectively reduces the memory cost. This partitioning scheme also balances the computational cost on each multiprocessor and enables us to easily avoid memory access conflicts. We use data streaming to handle extremely large datasets. Extensive experiments showed that our parallel inference methods consistently produced LDA models with the same predictive power as sequential training methods did but with 26x speedup for CGS and 196x speedup for CVB on a GPU with 30 multiprocessors. The proposed partitioning scheme and data streaming make our approach scalable with more multiprocessors. Furthermore, they can be used as general techniques to parallelize other machine learning models. 1

2 0.60145181 177 nips-2009-On Learning Rotations

Author: Raman Arora

Abstract: An algorithm is presented for online learning of rotations. The proposed algorithm involves matrix exponentiated gradient updates and is motivated by the von Neumann divergence. The multiplicative updates are exponentiated skew-symmetric matrices which comprise the Lie algebra of the rotation group. The orthonormality and unit determinant of the matrix parameter are preserved using matrix logarithms and exponentials and the algorithm lends itself to intuitive interpretation in terms of the differential geometry of the manifold associated with the rotation group. A complexity reduction result is presented that exploits the eigenstructure of the matrix updates to simplify matrix exponentiation to a quadratic form. 1

3 0.58021295 220 nips-2009-Slow Learners are Fast

Author: Martin Zinkevich, John Langford, Alex J. Smola

Abstract: Online learning algorithms have impressive convergence properties when it comes to risk minimization and convex games on very large problems. However, they are inherently sequential in their design which prevents them from taking advantage of modern multi-core architectures. In this paper we prove that online learning with delayed updates converges well, thereby facilitating parallel online learning. 1

4 0.39221072 27 nips-2009-Adaptive Regularization of Weight Vectors

Author: Koby Crammer, Alex Kulesza, Mark Dredze

Abstract: We present AROW, a new online learning algorithm that combines several useful properties: large margin training, confidence weighting, and the capacity to handle non-separable data. AROW performs adaptive regularization of the prediction function upon seeing each new instance, allowing it to perform especially well in the presence of label noise. We derive a mistake bound, similar in form to the second order perceptron bound, that does not assume separability. We also relate our algorithm to recent confidence-weighted online learning techniques and show empirically that AROW achieves state-of-the-art performance and notable robustness in the case of non-separable data. 1

5 0.38018945 97 nips-2009-Free energy score space

Author: Alessandro Perina, Marco Cristani, Umberto Castellani, Vittorio Murino, Nebojsa Jojic

Abstract: A score function induced by a generative model of the data can provide a feature vector of a fixed dimension for each data sample. Data samples themselves may be of differing lengths (e.g., speech segments, or other sequence data), but as a score function is based on the properties of the data generation process, it produces a fixed-length vector in a highly informative space, typically referred to as a “score space”. Discriminative classifiers have been shown to achieve higher performance in appropriately chosen score spaces than is achievable by either the corresponding generative likelihood-based classifiers, or the discriminative classifiers using standard feature extractors. In this paper, we present a novel score space that exploits the free energy associated with a generative model. The resulting free energy score space (FESS) takes into account latent structure of the data at various levels, and can be trivially shown to lead to classification performance that at least matches the performance of the free energy classifier based on the same generative model, and the same factorization of the posterior. We also show that in several typical vision and computational biology applications the classifiers optimized in FESS outperform the corresponding pure generative approaches, as well as a number of previous approaches to combining discriminating and generative models.

6 0.37841102 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

7 0.37795675 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization

8 0.37771258 260 nips-2009-Zero-shot Learning with Semantic Output Codes

9 0.37697214 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

10 0.37602088 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior

11 0.37535042 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction

12 0.37512603 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

13 0.37486014 226 nips-2009-Spatial Normalized Gamma Processes

14 0.37402886 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs

15 0.37397155 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

16 0.37299389 113 nips-2009-Improving Existing Fault Recovery Policies

17 0.37237889 129 nips-2009-Learning a Small Mixture of Trees

18 0.3709186 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition

19 0.36940742 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models

20 0.36856452 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA