acl acl2012 acl2012-31 acl2012-31-reference knowledge-graph by maker-knowledge-mining

31 acl-2012-Authorship Attribution with Author-aware Topic Models

Source: pdf

Author: Yanir Seroussi ; Fabian Bohnert ; Ingrid Zukerman

Abstract: Authorship attribution deals with identifying the authors of anonymous texts. Building on our earlier finding that the Latent Dirichlet Allocation (LDA) topic model can be used to improve authorship attribution accuracy, we show that employing a previously-suggested Author-Topic (AT) model outperforms LDA when applied to scenarios with many authors. In addition, we define a model that combines LDA and AT by representing authors and documents over two disjoint topic sets, and show that our model outperforms LDA, AT and support vector machines on datasets with many authors.

reference text

Shlomo Argamon and Patrick Juola. 2011. Overview of the international authorship identification competition at PAN-201 1. In CLEF 2011: Proceedings of the 2011 Conference on Multilingual and Multimodal Information Access Evaluation (Lab and Workshop Notebook Papers), Amsterdam, The Netherlands. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan):993–1022. David M. Blei. 2012. Probabilistic topic models. Communications of the ACM, 55(4):77–84. Carole E. Chaski. 2005. Who’s at the keyboard? Authorship attribution in digital evidence investigations. International Journal of Digital Evidence, 4(1). Chaitanya Chemudugunta, Padhraic Smyth, and Mark Steyvers. 2006. Modeling general and specific aspects of documents with a probabilistic topic model. In NIPS 2006: Proceedings ofthe 20thAnnual Conference on Neural Information Processing Systems, pages 241–248, Vancouver, BC, Canada. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of Ma- chine Learning Research, 9(Aug): 1871–1874. Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(Suppl. 1):5228–5235. Roman Kern, Christin Seifert, Mario Zechner, and Michael Granitzer. 2011. Vote/veto meta-classifier for authorship identification. In CLEF 2011: Proceedings of the 2011 Conference on Multilingual and Multimodal Information Access Evaluation (Lab and Workshop Notebook Papers), Amsterdam, The Netherlands. Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2009. Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology, 60(1):9–26. Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2011. Authorship attribution in the wild. Language Resources and Evaluation, 45(1):83–94. Ioannis Kourtis and Efstathios Stamatatos. 2011. Author identification using semi-supervised learning. In CLEF 2011: Proceedings of the 2011 Conference on Multilingual and Multimodal Information Access Evaluation (Lab and Workshop Notebook Papers), Amsterdam, The Netherlands. Simon Lacoste-Julien, Fei Sha, and Michael I. Jordan. 2008. DiscLDA: Discriminative learning for dimensionality reduction and classification. In NIPS 2008: Proceedings of the 22nd Annual Conference on Neural Information Processing Systems, pages 897–904, Vancouver, BC, Canada. 268 David Mimno and Andrew McCallum. 2008. Topic models conditioned on arbitrary features with Dirichlet-multinomial regression. In UAI 2008: Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence, pages 411–418, Helsinki, Finland. Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D. Manning. 2009. Labeled LDA: A supervised topic model for credit attribution in multilabeled corpora. In EMNLP 2009: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 248–256, Singapore. Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. 2004. The author-topic model for authors and documents. In UAI 2004: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pages 487–494, Banff, AB, Canada. Michal Rosen-Zvi, Chaitanya Chemudugunta, Thomas Griffiths, Padhraic Smyth, and Mark Steyvers. 2010. Learning author-topic models from text corpora. ACM Transactions on Information Systems, 28(1): 1–38. Jonathan Schler, Moshe Koppel, Shlomo Argamon, and James W. Pennebaker. 2006. Effects of age and gender on blogging. In Proceedings of AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, pages 199–205, Stanford, CA, USA. Yanir Seroussi, Ingrid Zukerman, and Fabian Bohnert. 2011. Authorship attribution with latent Dirichlet allocation. In CoNLL 2011: Proceedings of the 15th International Conference on Computational Natural Language Learning, pages 181–189, Portland, OR, USA. Yanir Seroussi, Fabian Bohnert, and Ingrid Zukerman. 2012. Authorship attribution with author-aware topic models. Technical Report 2012/268, Faculty of Information Technology, Monash University, Clayton, VIC, Australia. Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology, 60(3):538–556. Mark Steyvers and Tom Griffiths. 2007. Probabilistic topic models. In Thomas K. Landauer, Danielle S. McNamara, Simon Dennis, and Walter Kintsch, editors, Handbook of Latent Semantic Analysis, pages 427–448. Lawrence Erlbaum Associates. Ludovic Tanguy, Assaf Urieli, Basilio Calderone, Nabil Hathout, and Franck Sajous. 2011. A multitude of linguistically-rich features for authorship attribution. In CLEF 2011: Proceedings of the 2011 Conference on Multilingual and Multimodal Information Access Evaluation (Lab and Workshop Notebook Papers), Amsterdam, The Netherlands. Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. 2006. Hierarchical Dirichlet pro- Yee cesses. Journal of the American Statistical Association, 101(476): 1566–1581. Hanna M. Wallach, David Mimno, and Andrew McCallum. 2009. Rethinking LDA: Why priors matter. In NIPS 2009: Proceedings of the 23rd Annual Conference on Neural Information Processing Systems, pages 1973–1981, Vancouver, BC, Canada. 269