jmlr jmlr2009 jmlr2009-25 jmlr2009-25-reference knowledge-graph by maker-knowledge-mining

25 jmlr-2009-Distributed Algorithms for Topic Models


Source: pdf

Author: David Newman, Arthur Asuncion, Padhraic Smyth, Max Welling

Abstract: We describe distributed algorithms for two widely-used topic models, namely the Latent Dirichlet Allocation (LDA) model, and the Hierarchical Dirichet Process (HDP) model. In our distributed algorithms the data is partitioned across separate processors and inference is done in a parallel, distributed fashion. We propose two distributed algorithms for LDA. The first algorithm is a straightforward mapping of LDA to a distributed processor setting. In this algorithm processors concurrently perform Gibbs sampling over local data followed by a global update of topic counts. The algorithm is simple to implement and can be viewed as an approximation to Gibbs-sampled LDA. The second version is a model that uses a hierarchical Bayesian extension of LDA to directly account for distributed data. This model has a theoretical guarantee of convergence but is more complex to implement than the first algorithm. Our distributed algorithm for HDP takes the straightforward mapping approach, and merges newly-created topics either by matching or by topic-id. Using five real-world text corpora we show that distributed learning works well in practice. For both LDA and HDP, we show that the converged test-data log probability for distributed learning is indistinguishable from that obtained with single-processor learning. Our extensive experimental results include learning topic models for two multi-million document collections using a 1024-processor parallel computer. Keywords: topic models, latent Dirichlet allocation, hierarchical Dirichlet processes, distributed parallel computation


reference text

M. Abramowitz and I. Stegun. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover, New York, 1964. 1826 D ISTRIBUTED A LGORITHMS FOR T OPIC M ODELS A. Asuncion and D. Newman. UCI machine learning repository, http://www.ics.uci.edu/∼mlearn/MLRepository.html. 2007. URL D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. In Advances in Neural Information Processing Systems, volume 14, pages 601–608, Cambridge, MA, 2002. MIT Press. D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003. A. Brockwell. Parallel Markov chain Monte Carlo simulation by pre-fetching. Journal of Computational & Graphical Statistics, 15, No. 1:246–261, 2006. R. Burkard and E. Cela. Linear assignment problems and extensions. In P. Pardalos and D. Du, ed¸ itors, Handbook of Combinatorial Optimization, Supplement Volume A. Kluwer Academic Publishers, 1999. G. Casella and C. Robert. Rao-Blackwellisation of sampling schemes. Biometrika, 83(1):81–94, 1996. C. Chemudugunta, P. Smyth, and M. Steyvers. Modeling general and specific aspects of documents with a probabilistic topic model. In Advances in Neural Information Processing Systems 19, pages 241–248. MIT Press, Cambridge, MA, 2007. Chu, Kim, Lin, Yu, Bradski, Ng, and Olukotun. Map-reduce for machine learning on multicore. In B. Sch¨ lkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing o Systems 19, pages 281–288. MIT Press, Cambridge, MA, 2007. A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: Scalable online collaborative filtering. In WWW ’07: Proceedings of the 16th International Conference on World Wide Web, pages 271–280, New York, NY, 2007. ACM. M. D. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90(430):577–588, 1995. URL citeseer.ist.psu.edu/escobar94bayesian.html. G. Forman and B. Zhang. Distributed data clustering can be efficient and exact. In ACM KDD Explorations, volume 2, pages 34–38, New York, NY, 2000. ACM. T. Griffiths and M. Steyvers. Finding scientific topics. In Proceedings of the National Academy of Sciences, volume 101, pages 5228–5235, 2004. E. Kontoghiorghes. Handbook of Parallel Computing and Statistics (Statistics, Textbooks and Monographs). Chapman & Hall / CRC, 2005. W. Li and A. McCallum. Pachinko allocation: DAG-structured mixture models of topic correlations. In Proceedings of the International Conference on Machine Learning, volume 23, pages 577–584, New York, NY, 2006. ACM. J. Liu, W. Wong, and A. Kong. Covariance structure of the Gibbs sampler with applications to the comparisons of estimators and augmentation schemes. Biometrika, 81(1):27–40, 1994. 1827 N EWMAN , A SUNCION , S MYTH AND W ELLING D. Mimno and A. McCallum. Organizing the OCA: Learning faceted subjects from a library of digital books. In JCDL ’07: Proceedings of the 2007 conference on digital libraries, pages 376–385, New York, NY, 2007. ACM. R. Nallapati, W. Cohen, and J. Lafferty. Parallelized variational EM for latent Dirichlet allocation: An experimental evaluation of speed and scalability. In ICDMW ’07: Proceedings of the Seventh IEEE International Conference on Data Mining Workshops, pages 349–354, Washington, DC, 2007. IEEE Computer Society. P. Ferrari, A. Frigessi, and R. Schonmann. Convergence of some partially parallel Gibbs samplers with annealing. In Annals of Applied Probability, volume 3, pages 137–153. Institute of Mathematical Statistics, 1993. M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, volume 20, pages 487–494, Arlington, VA, 2004. AUAI Press. A. Rossini, L. Tierney, and N. Li. Simple parallel statistical computing in R. Journal of Computational & Graphical Statistics, 16(2):399, 2007. Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581, 2006. W. Kowalczyk and N. Vlassis. Newscast EM. In Advances in Neural Information Processing Systems 17, pages 713–720. MIT Press, Cambridge, MA, 2005. J. Wolfe, A. Haghighi, and D. Klein. Fully distributed EM for very large datasets. In Proceedings of the International Conference on Machine Learning, pages 1184–1191. ACM, New York, NY, 2008. L. Younes. Synchronous random fields and image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(4):380–390, 1998. 1828