emnlp emnlp2011 emnlp2011-101 emnlp2011-101-reference knowledge-graph by maker-knowledge-mining

101 emnlp-2011-Optimizing Semantic Coherence in Topic Models

Source: pdf

Author: David Mimno ; Hanna Wallach ; Edmund Talley ; Miriam Leenders ; Andrew McCallum

Abstract: Latent variable models have the potential to add value to large document collections by discovering interpretable, low-dimensional subspaces. In order for people to use such models, however, they must trust them. Unfortunately, typical dimensionality reduction methods for text, such as latent Dirichlet allocation, often produce low-dimensional subspaces (topics) that are obviously flawed to human domain experts. The contributions of this paper are threefold: (1) An analysis of the ways in which topics can be flawed; (2) an automated evaluation metric for identifying such topics that does not rely on human annotators or reference collections outside the training data; (3) a novel statistical topic model based on this metric that significantly improves topic quality in a large-scale document collection from the National Institutes of Health (NIH).

reference text

Loulwah AlSumait, Daniel Barbara, James Gentle, and Carlotta Domeniconi. 2009. Topic significance ranking of LDA generative models. In ECML. David Andrzejewski, Xiaojin Zhu, and Mark Craven. 2009. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 25–32. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January. K.R. Canini, L. Shi, and T.L. Griffiths. 2009. Online inference of topics with latent Dirichlet allocation. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics. Jonathan Chang, Jordan Boyd-Graber, Chong Wang, Sean Gerrish, and David M. Blei. 2009. Reading tea leaves: How humans interpret topic models. In Advances in Neural Information Processing Systems 22, pages 288–296. Kenneth Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics, 6(1):22–29. Gabriel Doyle and Charles Elkan. 2009. Accounting for burstiness in topic models. In ICML. 1984. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transaction on Pattern Analysis and Machine Intelligence 6, pages 721–741 . S. Geman and D. Geman. Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl. 1):5228–5235. Matthew Hoffman, David Blei, and Francis Bach. 2010. Online learning for latent dirichlet allocation. In NIPS. Hosan Mahmoud. 2008. P o´lya Urn Models. Chapman & Hall/CRC Texts in Statistical Science. Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai. 2007. Automatic labeling of multinomial topic models. In Proceedings ofthe 13thACMSIGKDD International Conference on Knowledge Discovery and Data Mining, pages 490–499. David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic evaluation of topic coherence. In Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics. Yee Whye Teh, Dave Newman, and Max Welling. 2006. A collapsed variational Bayesian inference algorithm for lat ent Dirichlet allocation. In Advances in Neural Information Processing Systems 18. Hanna Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno. 2009. Evaluation methods for topic models. In Proceedings of the 26th Interational Conference on Machine Learning. Xing Wei and Bruce Croft. 2006. LDA-based document models for ad-hoc retrival. In Proceedings of the 29th Annual International SIGIR Conference. 272