nips nips2008 nips2008-28 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Padhraic Smyth, Max Welling, Arthur U. Asuncion
Abstract: Distributed learning is a problem of fundamental interest in machine learning and cognitive science. In this paper, we present asynchronous distributed learning algorithms for two well-known unsupervised learning frameworks: Latent Dirichlet Allocation (LDA) and Hierarchical Dirichlet Processes (HDP). In the proposed approach, the data are distributed across P processors, and processors independently perform Gibbs sampling on their local data and communicate their information in a local asynchronous manner with other processors. We demonstrate that our asynchronous algorithms are able to learn global topic models that are statistically as accurate as those learned by the standard LDA and HDP samplers, but with significant improvements in computation time and memory. We show speedup results on a 730-million-word text corpus using 32 processors, and we provide perplexity results for up to 1500 virtual processors. As a stepping stone in the development of asynchronous HDP, a parallel HDP sampler is also introduced. 1
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper, we present asynchronous distributed learning algorithms for two well-known unsupervised learning frameworks: Latent Dirichlet Allocation (LDA) and Hierarchical Dirichlet Processes (HDP). [sent-4, score-0.264]
2 In the proposed approach, the data are distributed across P processors, and processors independently perform Gibbs sampling on their local data and communicate their information in a local asynchronous manner with other processors. [sent-5, score-0.771]
3 We demonstrate that our asynchronous algorithms are able to learn global topic models that are statistically as accurate as those learned by the standard LDA and HDP samplers, but with significant improvements in computation time and memory. [sent-6, score-0.35]
4 We show speedup results on a 730-million-word text corpus using 32 processors, and we provide perplexity results for up to 1500 virtual processors. [sent-7, score-0.305]
5 As a stepping stone in the development of asynchronous HDP, a parallel HDP sampler is also introduced. [sent-8, score-0.268]
6 1 Introduction Learning algorithms that can perform in a distributed asynchronous manner are of interest for several different reasons. [sent-9, score-0.292]
7 In this paper, we focus on the specific problem of developing asynchronous distributed learning algorithms for a class of unsupervised learning techniques, specifically LDA [1] and HDP [2] with learning via Gibbs sampling. [sent-13, score-0.264]
8 A promising approach to scaling these algorithms to large data sets is to distribute the data across multiple processors and develop appropriate distributed topic-modeling algorithms [3, 4, 5]. [sent-15, score-0.478]
9 , learning a topic model in near real-time for tens of thousands of documents returned by a search-engine. [sent-18, score-0.144]
10 While synchronous distributed algorithms for topic models have been proposed in earlier work, here we investigate asynchronous distributed learning of topic models. [sent-19, score-0.566]
11 Our primary novel contribution is the introduction of new asynchronous distributed algorithms for LDA and HDP, based on local collapsed Gibbs sampling on each processor. [sent-21, score-0.38]
12 Our distributed framework can provide substantial memory and time savings over single1 processor computation, since each processor only needs to store and perform Gibbs sweeps over P th of the data, where P is the number of processors. [sent-23, score-0.79]
13 Furthermore, the asynchronous approach can scale to large corpora and large numbers of processors, since no global synchronization steps are required. [sent-24, score-0.29]
14 While building towards an asynchronous algorithm for HDP, we also introduce a novel synchronous distributed inference algorithm for HDP, again based on collapsed Gibbs sampling. [sent-25, score-0.349]
15 In the proposed framework, individual processors perform Gibbs sampling locally on each processor based on a noisy inexact view of the global topics. [sent-26, score-0.847]
16 We present perplexity and speedup results for our algorithms when applied to text data sets. [sent-31, score-0.304]
17 2 A brief review of topic models γ βk α Before delving into the details of our distributed algorithms, we first describe the LDA and HDP topic models. [sent-33, score-0.25]
18 In LDA, each document j is θ kj α θ kj modeled as a mixture over K topics, and each topic k is a multinomial η η distribution, φwk , over a vocabulary of W words1 . [sent-34, score-0.299]
19 For each token i in that document, a topic assignment zij is sampled from θkj , and the Figure 1: Graphical models specific word xij is drawn from φwzij . [sent-37, score-0.21]
20 One can perform collapsed Gibbs sampling [7] by integrating out θkj and φwk and sampling the topic assignments in the following manner: j P (zij = k|z ¬ij , w) ∝ ¬ij Nwk + η j ¬ij Njk + α . [sent-41, score-0.239]
21 (1) + Wη Nwk denotes the number of word tokens of type w assigned to topic k, while Njk denotes the number of tokens in document j assigned to topic k. [sent-42, score-0.274]
22 3 Asynchronous distributed learning for the LDA model We consider the problem of learning an LDA model with K topics in a distributed fashion where documents are distributed across P processors. [sent-54, score-0.341]
23 Each processor p stores the following local variables: 1 To avoid clutter, we write φwk or θkj to denote the set of all components, i. [sent-55, score-0.375]
24 2 p p wij contains the word type for each token i in document j in the processor, and zij contains the ¬p assigned topic for each token. [sent-63, score-0.229]
25 Nwk is the global word-topic count matrix stored at the processor— this matrix stores counts of other processors gathered during the communication step and does not p include the processor’s local counts. [sent-64, score-0.615]
26 Nkj is the local document-topic count matrix (derived from p p z p ), Nw is the simple word count on a processor (derived from wp ), and Nwk is the local word-topic p p count matrix (derived from z and w ) which only contains the counts of data on the processor. [sent-65, score-0.625]
27 [5] introduced a parallel version of LDA based on collapsed Gibbs sampling (which 1 we will call Parallel-LDA). [sent-67, score-0.147]
28 In Parallel-LDA, each processor receives P of the documents in the corpus and the z’s are globally initialized. [sent-68, score-0.424]
29 In the sampling step, each processor samples its local z p by using the global topics of the previous iteration. [sent-70, score-0.575]
30 In the synchronization step, the local p counts Nwk on each processor are aggregated to produce a global set of word-topic counts Nwk . [sent-71, score-0.643]
31 However, it is a fully synchronous algorithm since it requires global synchronization at each iteration. [sent-74, score-0.149]
32 some processors may be unavailable, while other processors may be in the middle of a long Gibbs sweep, due to differences in processor speeds. [sent-77, score-1.093]
33 To gain the benefits of asynchronous computing, we introduce an asynchronous distributed version of LDA (Async-LDA) that follows a similar two-step process to that above. [sent-78, score-0.442]
34 Each processor performs a local Gibbs sampling step followed by a step of communicating with another random processor. [sent-79, score-0.442]
35 Recall that Nwk represents processor p’s belief of the counts of all the other processors with which it has already communicated p (not including processor p’s local counts), while Nwk is the processor’s local word-topic counts. [sent-82, score-1.196]
36 p Once the inference of z p is complete (and Nwk is up- Algorithm 1 Async-LDA dated), the processor finds another finished processor and for each processor p in parallel do initiates communication2 . [sent-84, score-1.102]
37 We also assume in the simplified gosReceive Nwk from random proc g p sip scheme that a processor can establish communication Send Nwk to proc g with every other processor – later in the paper we also if p has met g before then ¬p ¬p g ˜g discuss scenarios that relax these assumptions. [sent-86, score-0.817]
38 In this p case, processors simply exchange their local Nwk ’s (their local contribution to the global topic set), and processor p g ¬p simply adds Nwk to its Nwk , and vice versa. [sent-88, score-0.923]
39 else ¬p ¬p g Nwk ← Nwk + Nwk end if until convergence end for Consider the case where two processors meet again. [sent-89, score-0.415]
40 The processors should not simply swap and add ¬p their local counts again; rather, each processor should first remove from Nwk the previous influence of the other processor during their previous encounter, in order to prevent processors that frequently meet from over-influencing each other. [sent-90, score-1.547]
41 We assume in the general case that a processor does not store in memory the previous counts of all the other processors that processor p has already met. [sent-91, score-1.168]
42 ¬p Since the previous local counts of the other processor were already absorbed into Nwk and are thus not retrievable, we must take a different approach. [sent-92, score-0.455]
43 In Async-LDA, the processors exchange their p p Nwk ’s, from which the count of words on each processor, Nw can be derived. [sent-93, score-0.415]
44 Nwk acts as a substitute for the Nwk that processor p received during their previous encounter. [sent-98, score-0.348]
45 The assumption of limited memory can be relaxed by allowing processors to cache previous counts of other processors g ˜g – the cached Nwk would replace Nwk . [sent-104, score-0.907]
46 4 Synchronous and asynchronous distributed learning for the HDP model Inference for HDP can be performed in a distributed manner as well. [sent-109, score-0.321]
47 Before discussing our asynchronous HDP algorithm, we first describe a synchronous parallel inference algorithm for HDP. [sent-110, score-0.303]
48 We begin with necessary notation for HDPs: γ is the concentration parameter for the top level Dirichlet Process (DP), α is the concentration parameter for the document level DP, βk ’s are toplevel topic probabilities, and η is the Dirichlet parameter for the base distribution. [sent-111, score-0.137]
49 Each procesp sor maintains local βk parameters which are augmented when a new topic is locally created. [sent-115, score-0.145]
50 During the Gibbs sampling step, each processor locally samples the z p topic assignments. [sent-116, score-0.507]
51 In the synchrop nization step, the local word-topic counts Nwk are aggregated into a single matrix of global counts p Nwk , and the local βk ’s are averaged to form a global βk . [sent-117, score-0.315]
52 First, the sampling equation for z p is different to that of Async-LDA, since some probability mass is reserved for new topics: ¬p +N p )¬ij ¬ij p P (N ¬p p wk +η Npjk + αp βk , if k ≤ Kp (N +N )¬ij +W η w wk ¬ij P (zpij = k|zp , wp ) ∝ αp β p new if k is new. [sent-126, score-0.224]
53 In Async-HDP, a processor can add new topics to its collection during the inference step. [sent-128, score-0.462]
54 Thus, when two processors communicate, the number of topics on each processor might be different. [sent-129, score-0.828]
55 One way to merge topics is to perform bipartite matching across the two topic sets, using the Hungarian algorithm. [sent-130, score-0.235]
56 However, performing this topic matching step imposes a computational penalty as the number of topics increases. [sent-131, score-0.224]
57 In our experiments for Async-LDA, Parallel-HDP, and Async-HDP, we do not perform topic matching, but we simply combine the topics on different processors based their topic IDs and (somewhat surprisingly) the topics gradually self-organize and align. [sent-132, score-0.8]
58 For our perplexity experiments, parallel processors were simulated in software and run on smaller data sets (KOS, NIPS), to enable us to test the statistical limits of our algorithms. [sent-147, score-0.665]
59 Our simulation features a gossip scheme over a fully connected network that lets each processor communicate with one other random processor at the end of every iteration, e. [sent-149, score-0.778]
60 In our perplexity experiments, the data set is separated into a training set and a test set. [sent-152, score-0.241]
61 We briefly describe how perplexity is computed for our models. [sent-154, score-0.241]
62 For Parallel-HDP, perplexity is calculated in the same way as in standard HDP: s s αβk + Njk 1 η + Nwk ˆs ˆ ˆs ˆ log log p(xtest ) = θjk φs where θjk = , φs = wk wk s s . [sent-158, score-0.391]
63 To obtain θjk , one must wk ˆ resample the topic assignments on the first half of each document in the test set while holding φs wk ˆs and θs . [sent-160, score-0.307]
64 Perplexity is evaluated on the second half of each document in the test set, given φwk jk The perplexity calculation for Async-LDA and Async-HDP uses the same formula. [sent-162, score-0.307]
65 Since each processor effectively learns a separate local topic model, we can directly compute the perplexity for each processor’s local model. [sent-163, score-0.74]
66 In our experiments, we report the average perplexity among processors, and we show error bars denoting the minimum and maximum perplexity among all processors. [sent-164, score-0.482]
67 The variance of perplexities between processors is usually quite small, which suggests that the local topic models learned on each processor are equally accurate. [sent-165, score-0.927]
68 1 Async-LDA perplexity and speedup results Figures 2(a,b) show the perplexities for Async-LDA on KOS and NIPS data sets for varying numbers of topics. [sent-178, score-0.378]
69 The variation in perplexities between LDA and Async-LDA is slight and is significantly less than the variation in perplexities as the number of topics K is changed. [sent-179, score-0.292]
70 As a baseline we ran an experiment where processors never communicate. [sent-185, score-0.366]
71 As the number of processors P was increased from 10 to 1500 the corresponding perplexities increased from 2600 to 5700, dramatically higher than our Async-LDA algorithm, indicating (unsurprisingly) that processor communication is essential to obtain good quality models. [sent-186, score-0.851]
72 As the number of processors increases, the rate of convergence slows, since it takes more iterations for information to propagate to all the processors. [sent-188, score-0.397]
73 As the data set size grows, the parallel efficiency increases, since communication overhead is dwarfed by the sampling time. [sent-193, score-0.147]
74 In Figure 3(a), we also show the performance of a baseline asynchronous averaging scheme, where ¬p ¬p ¬g g global counts are averaged together: Nwk ← (Nwk + Nwk )/d + Nwk . [sent-194, score-0.318]
75 Figures 3(a,b), C=5, show the g improvement made by letting each processor cache the five most recently seen Nwk ’s. [sent-199, score-0.387]
76 Note that we still assume a limited bandwidth – processors do not forward individual cached counts, but instead share a single matrix of combined cache counts that helps the processors to achieve faster burn-in time. [sent-200, score-0.895]
77 Topics grow at a slower rate in ParallelHDP since new topics that are generated locally on each processor are merged together during each synchronization step. [sent-216, score-0.535]
78 In this experiment, while the number of topics is still growing, the perplexity has converged, because the newest topics are smaller and do not significantly affect the predictive power of the model. [sent-217, score-0.469]
79 On the NIPS data set, there is a slight perplexity degradation, which is partially due to non-optimal parameter settings for α and γ. [sent-220, score-0.241]
80 Topics are generated at a slightly faster rate for Async-HDP than for Parallel-HDP because AsyncHDP take a less aggressive approach on pruning small topics, since processors need to be careful when pruning topics locally. [sent-221, score-0.48]
81 In our framework, if new data arrives, we simply assign the new data to a new processor, and then let that new processor enter the “world” of processors with which it can begin to communicate. [sent-225, score-0.731]
82 Our asynchronous approach requires no global initialization or global synchronization step. [sent-226, score-0.335]
83 We performed an experiment for Async-LDA where we introduced 10 new processors (each carrying new data) every 100 iterations. [sent-228, score-0.366]
84 Figure 5(a) shows that perplexity decreases as more processors and data are added. [sent-230, score-0.607]
85 After 1000 iterations, the perplexity of Async-LDA has converged to the standard LDA perplexity. [sent-231, score-0.241]
86 In reality, a processor may have a document set specialized to only a few topics. [sent-234, score-0.388]
87 We assigned 2 sets of documents to each of 10 processors, so that each processor had a document set that was specialized to 2 topics. [sent-237, score-0.435]
88 (c) Right: Async-LDA on KOS, K=16, where processors have varying amounts of data. [sent-243, score-0.366]
89 7 Another situation of interest is the case where the amount of data on each processor varies. [sent-245, score-0.348]
90 KOS was divided into 30 blocks of 100 documents and these blocks were assigned to 10 processors according to a distribution: {7, 6, 4, 3, 3, 2, 2, 1, 1, 1}. [sent-246, score-0.413]
91 We assume that if a processor has k blocks, then it will take k units of time to complete one sampling sweep. [sent-247, score-0.389]
92 Figure 5(c) shows that this load imbalance does not significantly affect the final perplexity achieved. [sent-248, score-0.241]
93 More generally, the time T p that each processor p takes to perform Gibbs sampling dictates the communication graph that will ensue. [sent-249, score-0.449]
94 5 processors with times T = {10, 12, 14, 19, 20} where P1, P2, P3 enter the network at time 0 and P4, P5 enter the network at time 34). [sent-252, score-0.434]
95 In our experiments, we assumed a fully connected network of processors and did not focus on other network topologies. [sent-256, score-0.4]
96 After running Async-LDA on both a 10x10 fixed grid network and a 100 node chain network on KOS K=16, we have verified that Async-LDA achieves the same perplexity as LDA as long as caching and forwarding of cached counts occurs between processors. [sent-257, score-0.403]
97 [5], who each propose parallel algorithms for the collapsed sampler for LDA. [sent-259, score-0.138]
98 The primary distinctions between our work and other work on distributed LDA based on Gibbs sampling are that (a) our algorithms use purely asynchronous communication rather than a global synchronous scheme, and (b) we have also extended these ideas (synchronous and asynchronous) to HDP. [sent-263, score-0.45]
99 Although processors perform local Gibbs sampling based on inexact global counts, our algorithms nonetheless produce solutions that are nearly the same as that of standard single-processor samplers. [sent-269, score-0.52]
100 Our perplexity and speedup results suggest that topic models can be learned in a scalable asynchronous fashion for a wide variety of situations. [sent-272, score-0.579]
wordName wordTfidf (topN-words)
[('nwk', 0.608), ('processors', 0.366), ('processor', 0.348), ('perplexity', 0.241), ('hdp', 0.228), ('lda', 0.211), ('asynchronous', 0.193), ('async', 0.166), ('kos', 0.147), ('topics', 0.114), ('topic', 0.097), ('perplexities', 0.089), ('counts', 0.08), ('wk', 0.075), ('pubmed', 0.064), ('kj', 0.064), ('gibbs', 0.061), ('parallel', 0.058), ('distributed', 0.056), ('dirichlet', 0.056), ('nyt', 0.055), ('zij', 0.055), ('synchronous', 0.052), ('synchronization', 0.052), ('speedup', 0.048), ('communication', 0.048), ('collapsed', 0.048), ('documents', 0.047), ('ij', 0.046), ('global', 0.045), ('nw', 0.044), ('sampling', 0.041), ('document', 0.04), ('cache', 0.039), ('count', 0.036), ('gossip', 0.032), ('cached', 0.03), ('njk', 0.028), ('newman', 0.028), ('local', 0.027), ('jk', 0.026), ('memory', 0.026), ('iteration', 0.025), ('nips', 0.023), ('proc', 0.021), ('xij', 0.021), ('locally', 0.021), ('communicate', 0.021), ('token', 0.021), ('imagine', 0.02), ('resample', 0.02), ('converges', 0.02), ('vocabulary', 0.019), ('wp', 0.019), ('forwarding', 0.018), ('nallapati', 0.018), ('newscast', 0.018), ('npjk', 0.018), ('wzij', 0.018), ('zpij', 0.018), ('iterations', 0.018), ('em', 0.018), ('send', 0.018), ('met', 0.017), ('sampler', 0.017), ('network', 0.017), ('enter', 0.017), ('manner', 0.016), ('gam', 0.016), ('mw', 0.016), ('word', 0.016), ('corpus', 0.016), ('balls', 0.016), ('mixture', 0.015), ('algorithms', 0.015), ('smyth', 0.015), ('mimno', 0.015), ('bandwidth', 0.014), ('distribute', 0.014), ('inexact', 0.014), ('reserved', 0.014), ('scenarios', 0.014), ('step', 0.013), ('middle', 0.013), ('globally', 0.013), ('exchange', 0.013), ('nished', 0.013), ('sweep', 0.013), ('asuncion', 0.013), ('convergence', 0.013), ('tokens', 0.012), ('zp', 0.012), ('perform', 0.012), ('latent', 0.012), ('end', 0.012), ('meet', 0.012), ('ps', 0.012), ('wolfe', 0.012), ('across', 0.012), ('aggregated', 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 28 nips-2008-Asynchronous Distributed Learning of Topic Models
Author: Padhraic Smyth, Max Welling, Arthur U. Asuncion
Abstract: Distributed learning is a problem of fundamental interest in machine learning and cognitive science. In this paper, we present asynchronous distributed learning algorithms for two well-known unsupervised learning frameworks: Latent Dirichlet Allocation (LDA) and Hierarchical Dirichlet Processes (HDP). In the proposed approach, the data are distributed across P processors, and processors independently perform Gibbs sampling on their local data and communicate their information in a local asynchronous manner with other processors. We demonstrate that our asynchronous algorithms are able to learn global topic models that are statistically as accurate as those learned by the standard LDA and HDP samplers, but with significant improvements in computation time and memory. We show speedup results on a 730-million-word text corpus using 32 processors, and we provide perplexity results for up to 1500 virtual processors. As a stepping stone in the development of asynchronous HDP, a parallel HDP sampler is also introduced. 1
2 0.16877989 229 nips-2008-Syntactic Topic Models
Author: Jordan L. Boyd-graber, David M. Blei
Abstract: We develop the syntactic topic model (STM), a nonparametric Bayesian model of parsed documents. The STM generates words that are both thematically and syntactically constrained, which combines the semantic insights of topic models with the syntactic information available from parse trees. Each word of a sentence is generated by a distribution that combines document-specific topic weights and parse-tree-specific syntactic transitions. Words are assumed to be generated in an order that respects the parse tree. We derive an approximate posterior inference method based on variational methods for hierarchical Dirichlet processes, and we report qualitative and quantitative results on both synthetic data and hand-parsed documents. 1
3 0.14267674 64 nips-2008-DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification
Author: Simon Lacoste-julien, Fei Sha, Michael I. Jordan
Abstract: Probabilistic topic models have become popular as methods for dimensionality reduction in collections of text documents or images. These models are usually treated as generative models and trained using maximum likelihood or Bayesian methods. In this paper, we discuss an alternative: a discriminative framework in which we assume that supervised side information is present, and in which we wish to take that side information into account in finding a reduced dimensionality representation. Specifically, we present DiscLDA, a discriminative variation on Latent Dirichlet Allocation (LDA) in which a class-dependent linear transformation is introduced on the topic mixture proportions. This parameter is estimated by maximizing the conditional likelihood. By using the transformed topic mixture proportions as a new representation of documents, we obtain a supervised dimensionality reduction algorithm that uncovers the latent structure in a document collection while preserving predictive power for the task of classification. We compare the predictive power of the latent structure of DiscLDA with unsupervised LDA on the 20 Newsgroups document classification task and show how our model can identify shared topics across classes as well as class-dependent topics.
4 0.12106465 3 nips-2008-A Massively Parallel Digital Learning Processor
Author: Hans P. Graf, Srihari Cadambi, Venkata Jakkula, Murugan Sankaradass, Eric Cosatto, Srimat Chakradhar, Igor Dourdanovic
Abstract: We present a new, massively parallel architecture for accelerating machine learning algorithms, based on arrays of vector processing elements (VPEs) with variable-resolution arithmetic. Groups of VPEs operate in SIMD (single instruction multiple data) mode, and each group is connected to an independent memory bank. The memory bandwidth thus scales with the number of VPEs, while the main data flows are local, keeping power dissipation low. With 256 VPEs, implemented on two FPGAs (field programmable gate array) chips, we obtain a sustained speed of 19 GMACS (billion multiplyaccumulate per sec.) for SVM training, and 86 GMACS for SVM classification. This performance is more than an order of magnitude higher than that of any FPGA implementation reported so far. The speed on one FPGA is similar to the fastest speeds published on a Graphics Processor for the MNIST problem, despite a clock rate that is an order of magnitude lower. Tests with Convolutional Neural Networks show similar compute performances. This massively parallel architecture is particularly attractive for embedded applications, where low power dissipation is critical. 1 I n trod u cti on Machine learning demands higher and higher compute-performance, but serial processors are not improving that much anymore - at least not as quickly as they used to. Mainstream processor development is moving to multi-core systems, using shared memory technology to hide the parallel nature of the processors. But shared memory technology does not scale to hundreds or thousands of cores. In order to reach such levels of parallelization alternative approaches have to be developed. Massively parallel general-purpose computers had limited success so far, because of difficulties programming these machines, and they remain a niche market, mostly in highperformance computing. Yet processors specialized for certain application domains, such as graphics processors or routing processors 1, have been parallelized to several hundred cores and are successful mass products. They improve performance over general-purpose processors by focusing on a few key algorithmic elements, yet still maintain enough flexibility that they can be programmed for a variety of applications. We explore in this paper if a similar approach can lead to efficient machine learning processors. 1 e.g. Nvidia, Quadro FX 5600 graphics processor; Cisco, CRS-1 routing processor Several processors optimized for machine learning, in particular for neural networks, were developed during the 1980’s and 90’s. Examples are the Synapse-1 architecture [1], or the Connectionist Network Supercomputer, CNS1 [2]. Recently there has been less activity in this field, but some accelerators are sold today for specific applications, such as the Axeon [3] processor for power train control of cars. Beside digital processors a large number of analog circuits were built, emulating neural network structures. Extremely high performance with low power dissipation is achievable, see e.g. [4][5], but these networks have little flexibility. SVM implementations on FPGA have been demonstrated in recent years [6-8], yet reached only low compute-performances. All machine learning processors had only limited success so far, indicating how difficult it is to find a good combination of performance, flexibility, price and ease of use. An important consideration is that many applications of machine learning, such as video analysis, data mining, or personalization of services, show the most promise in embedded systems. Embedded learning requires high compute performance while dissipating little power, a combination that is difficult to achieve, and so far required application specific IC (ASIC). Our aim is to develop architectures that meet the requirements for embedded learning, but are programmable and therefore can be used in a wide range of applications. With the goal of analyzing different architectures we designed a development and testing environment where the parallel computation is mapped onto FPGA’s. Initially this system was intended only for experimentation, but its performance is so high that this platform is useful in its own right as accelerator for high-performance systems. While the experiments shown here emphasize high performance, the architecture has been designed from the start for low power dissipation. The main features for achieving this goal are: low-resolution arithmetic, keeping the main data flow local, low operating frequencies, and a modular design, so that unused parts can be powered down dynamically. All results shown here are from the test platform; migration to lowpower FPGA or chip designs are done in a later stage. 2 Al gori th ms - A ri th meti c - A rch i te ctu re For a substantial improvement over a general purpose processor, the algorithms, the arithmetic units, as well as the architecture have to be optimized simultaneously. This is not just an exercise in hardware design, but algorithms and their software implementations have to be developed concurrently. Most machine learning algorithms have not been developed with parallelization in mind. Therefore, we first need to find good parallel versions, identify their performance bottlenecks, and then extract common computational patterns that can be mapped into accelerator hardware. 2.1 Algorithms Characteristic for machine learning is that large amounts of data need to be processed, often with predictable data access patterns and no dependency between operations over large segments of the computation. This is why data-parallelization can often provide good accelerations on multi-core chips, clusters of machines, or even on loosely coupled networks of machines. Using MapReduce, speedups linear with the number of processors have been reported in [9] for several machine learning algorithms. Up to 16 cores were tested, and simulations indicate good scaling to more processors in some cases. Many algorithms, such as KNN, K-means clustering, LVQ, and Neural Networks can be reduced to forms where the computation is dominated by vector-matrix multiplications, which are easily parallelizable. For Convolutional Neural Networks (CNN) the data flow can be complex, yet the core of the computation is a convolution, an operation which has been studied extensively for parallel implementations. For Support Vector Machines (SVM), several parallel algorithms were described, but most saturate quickly for more than 16 processors. Scaling to larger numbers of processors has been demonstrated, applying MapReduce on a graphics processor with 128 cores [10]. Another implementation on a cluster of 48 dual-core machines (with 384 MMX units) [11] scales even super-linearly, and, according to simulations, scales to thousands of cores. Based on this analysis it is clear that vector-matrix and matrix-matrix multiplications for large vector dimensionalities and large numbers of vectors must be handled efficiently. Yet this alone is not sufficient since data access patterns vary greatly between algorithms. We analyze this here in more detail for SVM and CNN. These algorithms were chosen, because they are widely used for industrial applications and cover a broad range of computation, I/O, and memory requirements. The characteristics of the SVM training are summarized in Table 1. We use an approach similar to the one described in [11] to split different parts of the computation between a host CPU and the FPGA accelerator. For large dimensions d of the vectors the calculation of the columns of the kernel matrix dominates by far. This is needed to update the gradients, and in the present implementation, only this part is mapped onto the FPGA. If the dimensionality d is smaller than around 100, operations 2 and 5 can become bottlenecks and should also be mapped onto the accelerator. Challenging is that for each kernel computation a new data vector has to be loaded 4 into the processor, leading to very high I/O requirements. We consider here dimensions of 10 - 10 5 7 and numbers of training data of 10 - 10 , resulting easily in Gigabytes that need to be transferred to the processors at each iteration. 1 2 3 4 5 6 Operation Initialize all αx, Gx Do Find working set αi, αj Update αi, αj Get 2 columns of kernel matrix Update gradients Gx While not converged Computation 2n IO 2n Unit CPU I I I I I * 2n I*2 I * (2d+2dn) I*n CPU CPU FPGA CPU * * * * 2n 10 2nd n Table 1: Compute- and IO-requirements of each step for SVM training (SMO algorithm). n: number of training data; d: dimension of the vectors; G: gradients; α: support vector factors; I: number of iterations. The last column indicates whether the execution happens on the host CPU or the accelerator FPGA. It is assumed that the kernel computation requires a dot product between vectors (e.g. rbf, polynomial, tanh kernels). Neural network algorithms are essentially sequences of vector-matrix multiplications, but networks with special connectivity patterns, such as convolutional networks have very different IO characteristics than fully connected networks. Table 2 shows the computation and IO requirements for scanning several convolution kernels over one input plane. A full network requires multiple of these operations for one layer, with nonlinearities between layers. We map all operations onto the FPGA accelerator, since intermediate results are re-used right away. The most significant 2 difference to between the SVM and CNN is the Compute/IO ratio: SVM: ~ 1; CNN: ~ L*k > 100. Therefore the requirements for these two algorithms are very different, and handling both cases efficiently is quite a challenge for an architecture design. Operation Load L kernels For all input pixels Shift in new pixel Multiply kernels Shift out result 1 2 3 4 Computation IO 2 L* k n* m 2 n*m*L*k n*m Unit FPGA FPGA FPGA FPGA FPGA Table 2: Compute- and IO-requirements for CNN computation (forward pass), where l kernels of size k*k are scanned simultaneously over an input plane of size n*m. This is representative for implementations with kernel unrolling (kernel pixels processed in parallel). Internal shifts, computation of the non-linearity, and border effects not shown. 2.2 Arithmetic Hardware can be built much more compactly and runs with lower power dissipation, if it uses fixed-point instead of floating-point operations. Fortunately, many learning algorithms tolerate a low resolution in most of the computations. This has been investigated extensively for neural networks [12][13], but less so for other learning algorithms. Learning from data is inherently a noisy process, because we see only a sparse sampling of the true probability distributions. A different type of noise is introduced in gradient descent algorithms, when only a few training data are used at a time to move the optimization forward iteratively. This noise is particularly pronounced for stochastic gradient descent. There is no point in representing noisy variables with high resolution, and it is therefore a property inherent to many algorithms that low-resolution computation can be used. It is important, not to confuse this tolerance to low resolution with the resolution required to avoid numeric instabilities. Some of the computations have to be performed with a high resolution, in particular for variables that are updated incrementally. They maintain the state of the optimization and may change in very small steps. But usually by far the largest part of the computation can be executed at a low resolution. Key is that the hardware is flexible enough and can take advantage of reduced resolution while handling high resolution where necessary. Problem Adult Forest MNIST NORB Kernel: Float Obj. f. # SV 31,930.77 11,486 653,170.7 49,333 4,960.13 6,172 1,243.71 3,077 F-score 77.58 98.29 99.12 93.34 Kernel: 16 bit fixed point Obj. f. # SV F-score 31,930.1 11,490 77.63 652,758 49,299 98.28 4,959.64 6,166 99.11 1,244.76 3,154 93.26 F-sc. (4b in) NA NA 99.11 92.78 Table 3: Comparison of the results of SVM training when the kernels are represented with floating point numbers (32 or 64 bits) (left half) and with 16 bit fixed point (right half). The last column shows the results when the resolution of the training data is reduced from 8 bit to 4 bit. For NORB this reduces the accuracy; all other differences in accuracy are not significant. All are two class problems: Adult: n=32,562, d=122; Forest: n=522,000, d=54 (2 against the rest); MNIST: n=60,000, d=784 (odd–even); NORB: n=48,560, d=5,184. We developed a simulator that allows running the training algorithms with various resolutions in each of the variables. A few examples for SVM training are shown in Table 3. Reducing the resolution of the kernel values from double or float to 16 bit fixed point representations does not affect the accuracy for any of the problems. Therefore all the multiplications in the dot products for the kernel computation can be done in low resolutions (4–16 bit in the factors), but the accumulator needs sufficient resolution to avoid over/under flow (48 bit). Once the calculation of the kernel value is completed, it can be reduced to 16 bit. A low resolution of 16 bit is also tolerable for the α values, but a high resolution is required for the gradients (double). For Neural Networks, including CNN, several studies have confirmed that states and gradients can be kept at low resolutions (<16 bit), but the weights must be maintained at a high resolution (float) (see e.g. [12]). In our own evaluations 24 bits in the weights tend to be sufficient. Once the network is trained, for the classification low resolutions can be used for the weights as well (<16 bit). 2.3 A rc h i t e c t u re Figure 1: Left: Schematic of the architecture with the main data flows; on one FPGA 128 VPE are configured into four SIMD groups; L-S: Load-store units. Right: Picture of an FPGA board; in our experiments one or two of them are used, connected via PCI bus to a host CPU. Based on the analysis above, it is clear that the architecture must be optimized for processing massive amounts of data with relatively low precision. Most of the time, data access patterns are predictable and data are processed in blocks that can be stored contiguously. This type of computation is well suited for vector processing, and simple vector processing elements (VPE) with fixed-point arithmetic can handle the operations. Since typically large blocks of data are processed with the same operation, groups of VPE can work in SIMD (single instruction multiple data) mode. Algorithms must then be segmented to map the highvolume, low precision parts onto the vector accelerators and parts requiring high precision arithmetic onto the CPU. The most important design decision is the organization of the memory. Most memory accesses are done in large blocks, so that the data can be streamed, making complex caching unnecessary. This is fortunate, since the amounts of data to be loaded onto the processor are so large that conventional caching strategies would be overwhelmed anyway. Because the blocks tend to be large, a high data bandwidth is crucial, but latency for starting a block transfer is less critical. Therefore we can use regular DDR memories and still get high IO rates. This led to the design shown schematically in Figure 1, where independent memory banks are connected via separate IO ports for each group of 32 VPE. By connecting multiple of the units shown in Figure 1 to a CPU, this architecture scales to larger numbers of VPE. Parallel data IO and parallel memory access scale simultaneously with the number of parallel cores, and we therefore refer to this as the P3 (P-cube) architecture. Notice also that the main data flow is only local between a group of VPE and its own memory block. Avoiding movements of data over long distances is crucial for low power dissipation. How far this architecture can reasonably scale with one CPU depends on the algorithms, the amount of data and the vector dimensionality (see below). A few hundred VPE per CPU have provided good accelerations in all our tests, and much higher numbers are possible with multi-core CPUs and faster CPU-FPGA connections. 3 I mp l e men tati on of th e P 3 A rch i t ectu re This architecture fits surprisingly well onto some of the recent FPGA chips that are available with several hundred Digital Signal Processors (DSP) units and over 1,000 IO pins for data transfers. The boards used here contain each one Xilinx Virtex 5 LX330T-2 FPGA coupled to 4 independent DDR2 SDRAM with a total of 1GB, and 2 independent 4MB SSRAM memory banks (commercial board from AlphaData). One FPGA chip contains 192 DSP with a maximum speed of 550MHz, which corresponds to a theoretical compute-performance of 105.6 GMACS (18 bit and 25 bit operands). There is a total of 14 Mbit of on-chip memory, and the chip incorporates 960 pins for data IO. Due to routing overhead, not all DSP units can be used and the actual clock frequencies tend to be considerably lower than what is advertised for such chips (typically 230MHz or less for our designs). Nevertheless, we obtain high performances because we can use a large number of DSP units for executing the main computation. The main architecture features are: • Parallel processing (on one chip): 128 VPE (hardware DSP) are divided into 4 blocks of 32, each group controlled by one sequencer with a vector instruction set. • Custom Precision: Data are represented with 1 to 16 bit resolution. Higher resolutions are possible by operating multiple DSP as one processor. • Overlapping Computation and Communication: CPU-FPGA communication is overlapped with the FPGA computation. • Overlap Memory Operations with Computation: All loads and stores from the FPGA to off-chip memory are performed concurrently with computations. • High Off-chip Memory Bandwidth: 6 independent data ports, each 32 bits wide, access banked memories concurrently (12GB/s per chip). • • Streaming Data Flow, Simple Access Patterns: Load/store units are tailored for streaming input and output data, and for simple, bursty access patterns. Caching is done under application control with dual-port memory on chip. Load/store with (de)compression: For an increase of effective IO bandwidth the load/store units provide compression and decompression in hardware. Figure 2 shows the configuration of the VPEs for vector dot product computation used for SVM training and classification. For training, the main computation is the calculation of one column of the kernel matrix. One vector is pre-fetched and stored in on-chip memory. All other vectors are streamed in from off-chip memory banks 1-4. Since this is a regular and predictable access pattern, we can utilize burst-mode, achieving a throughput of close to one memory word per cycle. But the speed is nevertheless IO bound. When several vectors can be stored on-chip, as is the case for classification, then the speed becomes compute-bound. Figure 2: Architecture for vector dot-product computation. The left side shows a high-level schematic with the main data flow. The data are streamed from memory banks 1-4 to the VPE arrays, while memory banks 5 and 6, alternatively receive results or stream them back to the host. The right side shows how a group of VPE is pipelined to improve clock speed. The operation for SVM training on the FPGA corresponds to a vector-matrix multiplication and the one for classification to a matrix-matrix multiplication. Therefore the configuration of Figure 2 is useful for many other algorithms as well, where operations with large vectors and matrices are needed, such as Neural Networks. We implemented a specialized configuration for Convolutional Neural Networks, for more efficiency and lower power dissipation. The VPE are daisy-chained and operate as systolic array. In this way we can take advantage of the high computation to IO ratio (Table 2) to reduce the data transfers from memory. 4 E val u ati on s We evaluated SVM training and classification with the NORB and MNIST problems, the latter with up to 2 million training samples (data from [11]). Both are benchmarks with vectors of high dimensionality, representative for applications in image and video analysis. The computation is split between CPU and FPGA as indicated by Table 1. The DDR2 memory banks are clocked at 230MHz, providing double that rate for data transfers. The data may be compressed to save IO bandwidth. On the FPGA they are decompressed first and distributed to the VPE. In our case, a 32 bit word contains eight 4-bit vector components. Four 32 bit words are needed to feed all 32 VPEs of a group; therefore clocking the VPE faster than 115MHz does not improve performance. A VPE executes a multiplication plus add operation in one clock cycle, resulting in a theoretical maximum of 14.7 GMACS per chip. The sustained compute-rate is lower, about 9.4 GMACS, due to overhead (see Table 4). The computation on the host CPU overlaps with that on the FPGA, and has no effect on the speed in the experiments shown here. For the classification the VPE can be clocked higher, at 230 MHz. By using 4-bit operands we can execute 2 multiply-accumulates simultaneously on one DSP, resulting in speed that is more than four times higher and a sustained 43.0 GMACS limited by the number and speed of the VPE. Adding a second FPGA card doubles the speed, showing little saturation effects yet, but for more FPGA per CPU there will be saturation (see Fig. 3). The compute speed in GMACS obtained for NORB is almost identical. # 60k 2M Iterations 8,000 266,900 CPU time 754s -- speed 0.5 -- CPU+MMX time speed 240 s 1.57 531,534 s 1.58 CPU+FPGA time speed 40 s 9.42 88,589 s 9.48 CPU+2 FPGA time speed 21 s 17.9 48,723 s 17.2 Table 4: Training times and average compute speed for SVM training. Systems tested: CPU, Opteron, 2.2GHz; CPU using MMX; CPU with one FPGA; CPU with two FPGA boards. Results are shown for training sizes of 60k and 2M samples. Compute speed is in GMACS (just kernel computations). Training algorithm: SMO with second order working set selection. Parallelizations of SVM training have been reported recently for a GPU [10] and for a cluster [11], both using the MNIST data. In [10] different bounds for stopping were used than here and in [11]. Nevertheless, a comparison of the compute performance is possible, because based on the number of iterations we can compute the average GMACS for the kernel computations. As can be seen in Table 5 a single FPGA is similar in speed to a GPU with 128 stream processors, despite a clock rate that is about 5.5 times lower for I/O and 11 times lower for the VPE. The cluster with 384 MMX units is about 6 times faster than one FPGA with 128 VPE, but dissipates about two orders of magnitude more electric power. For the FPGA this calculation includes only the computation of the kernel values while the part on the CPU is neglected. This is justified for this study, because the rest of the calculations can be mapped on the FPGA as well and will increase the power dissipation only minimally. Number Clock Operand Power Average of cores speed type dissipation compute speed CPU (Opteron) 1 2.2 GHz float 40 W 0.5 GMACS GPU (from [10]) 128 1.35 GHz float 80 W 7.4 GMACS Cluster (from [11]) 384 1.6 GHz byte > 1 kW 54 GMACS FPGA 128 0.12 GHz 4 bit nibble 9W 9.4 GMACS Table 5: Comparison of performances for SVM training (MNIST data). GPU: Nvidia 8800 GTX. Cluster: 48 dual core CPU (Athlon), 384 MMX units. The GPU was training with 60k samples ([10], table 2, second order), the cluster trained with 2 million samples. Processor Figure 3: Acceleration of SVM training as a function of the number of VPE. MNIST n: 2,000,000, d=784; NORB: n=48,560, d=5,184. The points for 128 and 256 VPE are experimental, the higher ones are simulations. Curves MNIST, NORB: Multiple FPGA are attached to one CPU. Curve MNIST C: Each FPGA is attached to a separate host CPU. Scaling of the acceleration with the number of VPEs is shown in Figure 3. The reference speed is that of one FPGA attached to a CPU. The evaluation has been done experimentally for 128 and 256 VPEs, and beyond that with a simulator. The onset of saturation depends on the dimensionality of the vectors, but to a much lesser extent on the number of training vectors (up to the limit of the memory on the FPGA card). MNIST saturates for more than two FPGAs because then the CPU and FPGA computation times become comparable. For the larger vectors of NORB (d=5,184) this saturation starts to be noticeable for more than 4 FPGA. Alternatively, a system can be scaled by grouping multiple CPU, each with one attached FPGA accelerator. Then the scaling follows a linear or even super-linear acceleration (MNIST C) to several thousand VPE. If the CPUs are working in a cluster arrangement, the scaling is similar to the one described in [11]. For convolutional neural networks, the architecture of Figure 2 is modified to allow a block of VPE to operate as systolic array. In this way convolutions can be implemented with minimal data movements. In addition to the convolution, also sub-sampling and non-linear functions plus the logistics to handle multiple layers with arbitrary numbers of kernels in each layer are done on the FPGA. Four separate blocks of such convolvers are packed onto one FPGA, using 100 VPE. Clocked at 115MHz, this architecture provides a maximum of 11.5 GMACS. Including all the overhead the sustained speed is about 10 GMACS. 5 Con cl u s i on s By systematically exploiting characteristic properties of machine learning algorithms, we developed a new massively parallel processor architecture that is very efficient and can be scaled to thousands of processing elements. The implementation demonstrated here is more than an order of magnitude higher in performance than previous FPGA implementations of SVM or CNN. For the MNIST problem it is comparable to the fastest GPU implementations reported so far. These results underline the importance of flexibility over raw compute-speed for massively parallel systems. The flexibility of the FPGA allows more efficient routing and packing of the data and the use of computations with the lowest resolution an algorithm permits. The results of Table 5 indicate the potential of this architecture for low-power operation in embedded applications. R e f e re n c e s [1] Ramacher, et al. (1995) Synapse-1: A high-speed general purpose parallel neurocomputer system. In Proc. 9th Intl. Symposium on Parallel Processing (IPPS'95), pp. 774-781. [2] Asanovic, K., Beck, Feldman, J., Morgan, N. & Wawrzynek, J. (1994) A Supercomputer for Neural Computation, Proc. IEEE Intl. Joint Conference on Neural Networks, pp. 5-9, Orlando, Florida. [3] Neil, P., (2005) Combining hardware with a powerful automotive MCU for powertrain applications. In Industrial Embedded Resource Guide, p. 88. [4] Korekado, et al. (2003) A Convolutional Neural Network VLSI for Image Recognition Using Merged/Mixed Analog-Digital Architecture, in Proc. 7th KES 2003, Oxford, pp 169-176. [5] Murasaki, M., Arima, Y. & Shinohara, H. (1993) A 20 Tera-CPS Analog Neural Network Board. In Proc. Int. Joint Conf. Neural Networks, pp. 3027 – 3030. [6] Pedersen, R., Schoeberl, M. (2006), An Embedded Support Vector Machine, WISE 2006. [7] Dey, S., Kedia, M. Agarwal, N., Basu, A., Embedded Support Vector Machine: Architectural Enhancements and Evaluation, in Proc 20th Int. Conf. VLSI Design. [8] Anguita, D., Boni, A., Ridella, S., (2003) A Digital Architecture for Support Vector Machines: Theory, Algorithm, and FPGA Implementation, IEEE Trans. Neural Networks, 14/5, pp.993-1009. [9] Chu, C., Kim, S., Lin, Y., Yu, Y., Bradski, G., Ng, A. & Olukotun, K. (2007) Map-Reduce for Machine Learning on Multicore, Advances in Neural Information Processing Systems 19, MIT Press. [10] Catanzaro, B., Sundaram, N., & Keutzer, K. (2008) Fast Support Vector Machine Training and Classification on Graphics Processors, Proc. 25th Int. Conf. Machine Learning, pp 104-111. [11] Durdanovic, I., Cosatto, E. & Graf, H. (2007) Large Scale Parallel SVM Implementation. In L. Bottou, O. Chapelle, D. DeCoste, J. Weston (eds.), Large Scale Kernel Machines, pp. 105-138, MIT Press. [12] Simard, P & Graf, H. (1994) Backpropagation without Multiplication. In J. Cowan, G. Tesauro, J. Alspector, (eds.), Neural Information Processing Systems 6, pp. 232 – 239, Morgan Kaufmann. [13] Savich, A., Moussa, M., Areibi, S., (2007) The Impact of Arithmetic Representation on Implementing MLP-BP on FPGAs: A Study, IEEE Trans. Neural Networks, 18/1, pp. 240-252.
5 0.097428732 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation
Author: Indraneel Mukherjee, David M. Blei
Abstract: Hierarchical probabilistic modeling of discrete data has emerged as a powerful tool for text analysis. Posterior inference in such models is intractable, and practitioners rely on approximate posterior inference methods such as variational inference or Gibbs sampling. There has been much research in designing better approximations, but there is yet little theoretical understanding of which of the available techniques are appropriate, and in which data analysis settings. In this paper we provide the beginnings of such understanding. We analyze the improvement that the recently proposed collapsed variational inference (CVB) provides over mean field variational inference (VB) in latent Dirichlet allocation. We prove that the difference in the tightness of the bound on the likelihood of a document decreases as O(k − 1) + log m/m, where k is the number of topics in the model and m is the number of words in a document. As a consequence, the advantage of CVB over VB is lost for long documents but increases with the number of topics. We demonstrate empirically that the theory holds, using simulated text data and two text corpora. We provide practical guidelines for choosing an approximation. 1
6 0.068492472 154 nips-2008-Nonparametric Bayesian Learning of Switching Linear Dynamical Systems
7 0.064021699 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization
8 0.063835032 4 nips-2008-A Scalable Hierarchical Distributed Language Model
9 0.058642264 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words
10 0.055981658 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
11 0.053902581 35 nips-2008-Bayesian Synchronous Grammar Induction
12 0.052886084 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
13 0.049939439 6 nips-2008-A ``Shape Aware'' Model for semi-supervised Learning of Objects and its Context
14 0.048562121 52 nips-2008-Correlated Bigram LSA for Unsupervised Language Model Adaptation
15 0.047603086 127 nips-2008-Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction
16 0.037815176 93 nips-2008-Global Ranking Using Continuous Conditional Random Fields
17 0.031290229 62 nips-2008-Differentiable Sparse Coding
18 0.031122126 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization
19 0.030676015 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations
20 0.030023018 143 nips-2008-Multi-label Multiple Kernel Learning
topicId topicWeight
[(0, -0.094), (1, -0.054), (2, 0.052), (3, -0.087), (4, -0.02), (5, -0.024), (6, 0.094), (7, 0.133), (8, -0.179), (9, -0.008), (10, -0.024), (11, -0.03), (12, -0.017), (13, 0.107), (14, -0.007), (15, 0.029), (16, 0.025), (17, -0.018), (18, -0.088), (19, -0.119), (20, -0.018), (21, 0.004), (22, 0.016), (23, 0.059), (24, 0.05), (25, 0.053), (26, 0.018), (27, 0.062), (28, 0.03), (29, -0.061), (30, -0.003), (31, 0.082), (32, 0.082), (33, -0.029), (34, -0.01), (35, 0.02), (36, 0.037), (37, 0.033), (38, 0.111), (39, 0.029), (40, 0.065), (41, -0.125), (42, 0.099), (43, -0.056), (44, 0.003), (45, -0.04), (46, 0.042), (47, -0.028), (48, 0.078), (49, -0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.95484602 28 nips-2008-Asynchronous Distributed Learning of Topic Models
Author: Padhraic Smyth, Max Welling, Arthur U. Asuncion
Abstract: Distributed learning is a problem of fundamental interest in machine learning and cognitive science. In this paper, we present asynchronous distributed learning algorithms for two well-known unsupervised learning frameworks: Latent Dirichlet Allocation (LDA) and Hierarchical Dirichlet Processes (HDP). In the proposed approach, the data are distributed across P processors, and processors independently perform Gibbs sampling on their local data and communicate their information in a local asynchronous manner with other processors. We demonstrate that our asynchronous algorithms are able to learn global topic models that are statistically as accurate as those learned by the standard LDA and HDP samplers, but with significant improvements in computation time and memory. We show speedup results on a 730-million-word text corpus using 32 processors, and we provide perplexity results for up to 1500 virtual processors. As a stepping stone in the development of asynchronous HDP, a parallel HDP sampler is also introduced. 1
2 0.76054573 229 nips-2008-Syntactic Topic Models
Author: Jordan L. Boyd-graber, David M. Blei
Abstract: We develop the syntactic topic model (STM), a nonparametric Bayesian model of parsed documents. The STM generates words that are both thematically and syntactically constrained, which combines the semantic insights of topic models with the syntactic information available from parse trees. Each word of a sentence is generated by a distribution that combines document-specific topic weights and parse-tree-specific syntactic transitions. Words are assumed to be generated in an order that respects the parse tree. We derive an approximate posterior inference method based on variational methods for hierarchical Dirichlet processes, and we report qualitative and quantitative results on both synthetic data and hand-parsed documents. 1
3 0.74590272 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation
Author: Indraneel Mukherjee, David M. Blei
Abstract: Hierarchical probabilistic modeling of discrete data has emerged as a powerful tool for text analysis. Posterior inference in such models is intractable, and practitioners rely on approximate posterior inference methods such as variational inference or Gibbs sampling. There has been much research in designing better approximations, but there is yet little theoretical understanding of which of the available techniques are appropriate, and in which data analysis settings. In this paper we provide the beginnings of such understanding. We analyze the improvement that the recently proposed collapsed variational inference (CVB) provides over mean field variational inference (VB) in latent Dirichlet allocation. We prove that the difference in the tightness of the bound on the likelihood of a document decreases as O(k − 1) + log m/m, where k is the number of topics in the model and m is the number of words in a document. As a consequence, the advantage of CVB over VB is lost for long documents but increases with the number of topics. We demonstrate empirically that the theory holds, using simulated text data and two text corpora. We provide practical guidelines for choosing an approximation. 1
4 0.73370773 64 nips-2008-DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification
Author: Simon Lacoste-julien, Fei Sha, Michael I. Jordan
Abstract: Probabilistic topic models have become popular as methods for dimensionality reduction in collections of text documents or images. These models are usually treated as generative models and trained using maximum likelihood or Bayesian methods. In this paper, we discuss an alternative: a discriminative framework in which we assume that supervised side information is present, and in which we wish to take that side information into account in finding a reduced dimensionality representation. Specifically, we present DiscLDA, a discriminative variation on Latent Dirichlet Allocation (LDA) in which a class-dependent linear transformation is introduced on the topic mixture proportions. This parameter is estimated by maximizing the conditional likelihood. By using the transformed topic mixture proportions as a new representation of documents, we obtain a supervised dimensionality reduction algorithm that uncovers the latent structure in a document collection while preserving predictive power for the task of classification. We compare the predictive power of the latent structure of DiscLDA with unsupervised LDA on the 20 Newsgroups document classification task and show how our model can identify shared topics across classes as well as class-dependent topics.
5 0.68360925 52 nips-2008-Correlated Bigram LSA for Unsupervised Language Model Adaptation
Author: Yik-cheung Tam, Tanja Schultz
Abstract: We present a correlated bigram LSA approach for unsupervised LM adaptation for automatic speech recognition. The model is trained using efficient variational EM and smoothed using the proposed fractional Kneser-Ney smoothing which handles fractional counts. We address the scalability issue to large training corpora via bootstrapping of bigram LSA from unigram LSA. For LM adaptation, unigram and bigram LSA are integrated into the background N-gram LM via marginal adaptation and linear interpolation respectively. Experimental results on the Mandarin RT04 test set show that applying unigram and bigram LSA together yields 6%–8% relative perplexity reduction and 2.5% relative character error rate reduction which is statistically significant compared to applying only unigram LSA. On the large-scale evaluation on Arabic, 3% relative word error rate reduction is achieved which is also statistically significant. 1
6 0.53033435 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization
7 0.41109937 3 nips-2008-A Massively Parallel Digital Learning Processor
8 0.37385213 4 nips-2008-A Scalable Hierarchical Distributed Language Model
9 0.36114308 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
10 0.35725167 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words
11 0.29672542 134 nips-2008-Mixed Membership Stochastic Blockmodels
12 0.269339 127 nips-2008-Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction
13 0.25267598 154 nips-2008-Nonparametric Bayesian Learning of Switching Linear Dynamical Systems
14 0.2428315 61 nips-2008-Diffeomorphic Dimensionality Reduction
15 0.23625904 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
16 0.23529403 167 nips-2008-One sketch for all: Theory and Application of Conditional Random Sampling
17 0.23415922 6 nips-2008-A ``Shape Aware'' Model for semi-supervised Learning of Objects and its Context
18 0.2322166 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models
19 0.22009635 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization
20 0.2193286 117 nips-2008-Learning Taxonomies by Dependence Maximization
topicId topicWeight
[(6, 0.048), (7, 0.054), (12, 0.028), (28, 0.105), (32, 0.41), (57, 0.05), (59, 0.014), (63, 0.036), (71, 0.013), (77, 0.036), (78, 0.018), (83, 0.061)]
simIndex simValue paperId paperTitle
same-paper 1 0.73451799 28 nips-2008-Asynchronous Distributed Learning of Topic Models
Author: Padhraic Smyth, Max Welling, Arthur U. Asuncion
Abstract: Distributed learning is a problem of fundamental interest in machine learning and cognitive science. In this paper, we present asynchronous distributed learning algorithms for two well-known unsupervised learning frameworks: Latent Dirichlet Allocation (LDA) and Hierarchical Dirichlet Processes (HDP). In the proposed approach, the data are distributed across P processors, and processors independently perform Gibbs sampling on their local data and communicate their information in a local asynchronous manner with other processors. We demonstrate that our asynchronous algorithms are able to learn global topic models that are statistically as accurate as those learned by the standard LDA and HDP samplers, but with significant improvements in computation time and memory. We show speedup results on a 730-million-word text corpus using 32 processors, and we provide perplexity results for up to 1500 virtual processors. As a stepping stone in the development of asynchronous HDP, a parallel HDP sampler is also introduced. 1
2 0.57250696 250 nips-2008-Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning
Author: Ali Rahimi, Benjamin Recht
Abstract: Randomized neural networks are immortalized in this AI Koan: In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6. “What are you doing?” asked Minsky. “I am training a randomly wired neural net to play tic-tac-toe,” Sussman replied. “Why is the net wired randomly?” asked Minsky. Sussman replied, “I do not want it to have any preconceptions of how to play.” Minsky then shut his eyes. “Why do you close your eyes?” Sussman asked his teacher. “So that the room will be empty,” replied Minsky. At that moment, Sussman was enlightened. We analyze shallow random networks with the help of concentration of measure inequalities. Specifically, we consider architectures that compute a weighted sum of their inputs after passing them through a bank of arbitrary randomized nonlinearities. We identify conditions under which these networks exhibit good classification performance, and bound their test error in terms of the size of the dataset and the number of random nonlinearities. 1
3 0.5457496 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words
Author: Kate Saenko, Trevor Darrell
Abstract: Polysemy is a problem for methods that exploit image search engines to build object category models. Existing unsupervised approaches do not take word sense into consideration. We propose a new method that uses a dictionary to learn models of visual word sense from a large collection of unlabeled web data. The use of LDA to discover a latent sense space makes the model robust despite the very limited nature of dictionary definitions. The definitions are used to learn a distribution in the latent space that best represents a sense. The algorithm then uses the text surrounding image links to retrieve images with high probability of a particular dictionary sense. An object classifier is trained on the resulting sense-specific images. We evaluate our method on a dataset obtained by searching the web for polysemous words. Category classification experiments show that our dictionarybased approach outperforms baseline methods. 1
4 0.52964288 103 nips-2008-Implicit Mixtures of Restricted Boltzmann Machines
Author: Vinod Nair, Geoffrey E. Hinton
Abstract: We present a mixture model whose components are Restricted Boltzmann Machines (RBMs). This possibility has not been considered before because computing the partition function of an RBM is intractable, which appears to make learning a mixture of RBMs intractable as well. Surprisingly, when formulated as a third-order Boltzmann machine, such a mixture model can be learned tractably using contrastive divergence. The energy function of the model captures threeway interactions among visible units, hidden units, and a single hidden discrete variable that represents the cluster label. The distinguishing feature of this model is that, unlike other mixture models, the mixing proportions are not explicitly parameterized. Instead, they are defined implicitly via the energy function and depend on all the parameters in the model. We present results for the MNIST and NORB datasets showing that the implicit mixture of RBMs learns clusters that reflect the class structure in the data. 1
5 0.4060156 92 nips-2008-Generative versus discriminative training of RBMs for classification of fMRI images
Author: Tanya Schmah, Geoffrey E. Hinton, Steven L. Small, Stephen Strother, Richard S. Zemel
Abstract: Neuroimaging datasets often have a very large number of voxels and a very small number of training cases, which means that overfitting of models for this data can become a very serious problem. Working with a set of fMRI images from a study on stroke recovery, we consider a classification task for which logistic regression performs poorly, even when L1- or L2- regularized. We show that much better discrimination can be achieved by fitting a generative model to each separate condition and then seeing which model is most likely to have generated the data. We compare discriminative training of exactly the same set of models, and we also consider convex blends of generative and discriminative training. 1
6 0.37950432 237 nips-2008-The Recurrent Temporal Restricted Boltzmann Machine
7 0.37120655 219 nips-2008-Spectral Hashing
8 0.37084755 4 nips-2008-A Scalable Hierarchical Distributed Language Model
9 0.35996452 229 nips-2008-Syntactic Topic Models
10 0.35508698 194 nips-2008-Regularized Learning with Networks of Features
11 0.35403338 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
12 0.35376018 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
13 0.35174397 62 nips-2008-Differentiable Sparse Coding
14 0.35116491 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
15 0.35036346 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
16 0.34928954 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
17 0.34848976 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
18 0.34836048 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks
19 0.34763554 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
20 0.34688067 75 nips-2008-Estimating vector fields using sparse basis field expansions