emnlp emnlp2012 emnlp2012-8 emnlp2012-8-reference knowledge-graph by maker-knowledge-mining

8 emnlp-2012-A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes


Source: pdf

Author: Robert Lindsey ; William Headden ; Michael Stipicevic

Abstract: Topic models traditionally rely on the bagof-words assumption. In data mining applications, this often results in end-users being presented with inscrutable lists of topical unigrams, single words inferred as representative of their topics. In this article, we present a hierarchical generative probabilistic model of topical phrases. The model simultaneously infers the location, length, and topic of phrases within a corpus and relaxes the bagof-words assumption within phrases by using a hierarchy of Pitman-Yor processes. We use Markov chain Monte Carlo techniques for approximate inference in the model and perform slice sampling to learn its hyperparameters. We show via an experiment on human subjects that our model finds substantially better, more interpretable topical phrases than do competing models.


reference text

Ryan Prescott Adams and David J.C. MacKay. 2007. Bayesian online changepoint detection. Technical report, University of Cambridge, Cambridge, UK. David M. Blei, Andrew Y. Ng, Michael I. Jordan, and John Lafferty. 2003. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022. Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish, Chong Wang, and David M. Blei. 2009. Reading tea leaves: How humans interpret topic models. In Neural Information Processing Systems (NIPS). Stanley F. Chen and Joshua Goodman. 1998. An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Center for Research in Computing Technology, Harvard University. T. L. Griffiths and M. Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(Suppl. 1):5228–5235, April. Thomas L. Griffiths, Mark Steyvers, David M. Blei, and Joshua B. Tenenbaum. 2005. Integrating topics and syntax. In Advances in Neural Information Processing Systems 17, pages 537–544. MIT Press. Thomas L. Griffiths, Joshua B. Tenenbaum, and Mark Steyvers. 2007. Topics in semantic representation. Psychological Review, 114:21 1–244. Amit Gruber, Yair Weiss, and Michal Rosen-Zvi. 2007. Hidden topic Markov models. Journal of Machine Learning Research - Proceedings Track, 2: 163–170. Donna Harman. 1992. Overview of the first text retrieval conference (trec–1). In Proceedings of the first Text REtrieval Conference (TREC–1), Washington DC, USA. 222 Mark Johnson. 2010. PCFGs, Topic Models, Adaptor Grammars and Learning Topical Collocations and the Structure of Proper Names. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1148–1 157, Uppsala, Sweden, July. Association for Computational Linguistics. Thomas K. Landauer and Susan T. Dumais. 1997. A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2):21 1– 240. Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu. Radford Neal. 2000. Slice sampling. Annals of Statistics, 31:705–767. Gabriele Paolacci, Jesse Chandler, and Panagiotis G. Ipeirotis. 2010. Running experiments on Amazon Mechanical Turk. Judgment and Decision Making, 5(5):41 1–419. J. Pitman and M. Yor. 1997. The two-parameter PoissonDirichlet distribution derived from a stable subordinator. Annals of Probability, 25:855–900. J. Pitman. 2002. Combinatorial stochastic processes. Technical Report 621, Department of Statistics, University of California at Berkeley. Patrick Schone and Daniel Jurafsky. 2001. Is knowledge-free induction of multiword unit dictionary headwords a solved problem? In Lillian Lee and Donna Harman, editors, Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, pages 100–108. Yee Whye Teh. 2006. A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, ACL44, pages 985–992, Morristown, NJ, USA. Association for Computational Linguistics. Hanna M. Wallach. 2006. Topic modeling: beyond bagof-words. In Proceedings of the 23rd International Conference on Machine Learning, pages 977–984. Xuerui Wang, Andrew McCallum, and Xing Wei. 2007. Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In Proceedings of the 7th IEEE International Conference on Data Mining.