acl acl2011 acl2011-14 acl2011-14-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yves Petinot ; Kathleen McKeown ; Kapil Thadani
Abstract: We investigate the relevance of hierarchical topic models to represent the content of Web gists. We focus our attention on DMOZ, a popular Web directory, and propose two algorithms to infer such a model from its manually-curated hierarchy of categories. Our first approach, based on information-theoretic grounds, uses an algorithm similar to recursive feature selection. Our second approach is fully Bayesian and derived from the more general model, hierarchical LDA. We evaluate the performance of both models against a flat 1-gram baseline and show improvements in terms of perplexity over held-out data.
A. Berger and V. Mittal. 2000. Ocelot: a system for summarizing web pages. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’00), pages 144–151. David M. Blei and J. Lafferty. 2009. Topic models. In A. Srivastava and M. Sahami, editors, Text Mining: Theory and Applications. Taylor and Francis. David M. Blei, Andrew Ng, and Michael Jordan. 2003. Latent dirichlet allocation. JMLR, 3:993–1022. David M. Blei, Thomas L. Griffiths, and Micheal I. Jordan. 2010. The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. In Journal of ACM, volume 57. Jean-Yves Delort, Bernadette Bouchon-Meunier, and Maria Rifqi. 2003. Enhanced web document summarization using hyperlinks. In Hypertext 2003, pages 208–215. Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. PNAS, 101(suppl. 1):5228–5235. Thomas Hofmann. 1999. The cluster-abstraction model: Unsupervised learning of topic hierarchies from text data. In Proceedings of IJCAI’99. Wei Li, David Blei, and Andrew McCallum. 2007. Nonparametric bayes pachinko allocation. In Proceedings ofthe Proceedings ofthe Twenty-Third Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-07), pages 243–250, Corvallis, Oregon. AUAI Press. C.-Y. Lin and E. Hovy. 2000. The automated acquisition of topic signatures for text summarization. In Proceedings of the 18th conference on Computational linguistics, pages 495–501 . Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D. Manning. 2009a. Labeled lda: A supervised topic model for credit attribution in multilabeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP 2009), Singapore, pages 248–256. Daniel Ramage, Paul Heymann, Christopher D. Manning, and Hector Garcia-Molina. 2009b. Clustering the tagged web. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, WSDM ’09, pages 54–63, New York, NY, USA. ACM. Joseph Reisinger and Marius Pa¸ sca. 2009. Latent variable models of concept-attribute attachment. In ACLIJCNLP ’09: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing ofthe AFNLP: Volume 2, pages 620–628, Morristown, NJ, USA. Association for Computational Linguistics. 675 Andreas Stolcke. 2002. Srilm - an extensible language modeling toolkit. In Proc. Intl. Conf. on Spoken Language Processing, vol. 2, pages 901–904, September. Jian-Tao Sun, Dou Shen, Hua-Jun Zeng, Qiang Yang, Yuchang Lu, and Zheng Chen. 2005. Web-page summarization using clickthrough data. In SIGIR 2005, pages 194–201. Hanna M. Wallach. 2006. Topic modeling: Beyond bagof-words. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, Pennsylvania, U.S., pages 977–984.