emnlp emnlp2013 emnlp2013-27 emnlp2013-27-reference knowledge-graph by maker-knowledge-mining

27 emnlp-2013-Authorship Attribution of Micro-Messages

Source: pdf

Author: Roy Schwartz ; Oren Tsur ; Ari Rappoport ; Moshe Koppel

Abstract: Work on authorship attribution has traditionally focused on long texts. In this work, we tackle the question of whether the author of a very short text can be successfully identified. We use Twitter as an experimental testbed. We introduce the concept of an author’s unique “signature”, and show that such signatures are typical of many authors when writing very short texts. We also present a new authorship attribution feature (“flexible patterns”) and demonstrate a significant improvement over our baselines. Our results show that the author of a single tweet can be identified with good accuracy in an array of flavors of the authorship attribution task.

reference text

Ahmed Abbasi and Hsinchun Chen. 2005. Applying au- thorship analysis to extremist-group web forum messages. IEEE Intelligent Systems, 20:67–75. Ahmed Abbasi and Hsinchun Chen. 2008. Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems, 26(2):7: 1–7:29. Shlomo Argamon, Casey Whitelaw, Paul Chase, Sobhan Raj Hota, Navendu Garg, and Shlomo Levitan. 2007. Stylistic text classification using functional lexical features: Research articles. J. Am. Soc. Inf. Sci. Technol., 58(6):802–822. Matthew Berland and Eugene Charniak. 1999. Finding parts in very large corpora. In Proc. of ACL, pages 57–64, College Park, Maryland, USA. Ergun Bici ¸ci and Deniz Yuret. 2006. Clustering word pairs to answer analogy questions. In Proc. of TAINN, pages 1–8. Danushka T. Bollegala, Yutaka Matsuo, and Mitsuru Ishizuka. 2009. Measuring the similarity between implicit semantic relations from the web. In Proc. of WWW, New York, New York, USA. ACM Press. Sarah R. Boutwell. 2011. Authorship Attribution of Short Messages Using Multimodal Features. Master’s thesis, Naval Postgraduate School. John Burrows. 2002. ‘Delta’ : a Measure of Stylistic Difference and a Guide to Likely Authorship. Literary and Linguistic Computing, 17(3):267–287. Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27: 1– 27:27. Software available at http : / /www . cs ie . ntu .edu .tw/ ˜c j l in/ l ibsvm. Timothy Chklovski and Patrick Pantel. 2004. Verbocean: Mining the web for fine-grained semantic verb relations. In Dekang Lin and Dekai Wu, editors, Proc. of EMNLP, pages 33–40, Barcelona, Spain. Dmitry Davidov and Ari Rappoport. 2006. Efficient unsupervised discovery of word categories using symmetric patterns and high frequency words. In Proc. of ACL-Coling, pages 297–304, Sydney, Australia. Dmitry Davidov and Ari Rappoport. 2008a. Classification of semantic relationships between nominals using pattern clusters. In Proceedings of ACL-08: HLT, pages 227–235, Columbus, Ohio, June. Association for Computational Linguistics. Dmitry Davidov and Ari Rappoport. 2008b. Unsupervised discovery of generic relationships using pattern clusters and its evaluation by automatically generated SAT analogy questions. In Proc. of ACL-HLT, pages 692–700, Columbus, Ohio. Dmitry Davidov, Ari Rappoport, and Moshe Koppel. 2007. Fully unsupervised discovery of conceptspecific relationships by web mining. In Proc. of ACL, pages 232–239, Prague, Czech Republic. Dmitry Davidov, Oren Tsur, and Ari Rappoport. 2010a. Semi-supervised recognition of sarcastic sentences in twitter and amazon. In Proc. of CoNLL, pages 107– 116, Uppsala, Sweden. Dmitry Davidov, Oren Tsur, and Ari Rappoport. 2010b. Enhanced sentiment learning using twitter hashtags and smileys. In Proc. of Coling, pages 241–249, Beijing, China. Olivier De Vel, Alison Anderson, Malcolm Corney, and George Mohay. 2001 . Mining e-mail content for author identification forensics. ACM Sigmod Record, 30(4):55–64. Joachim Diederich, J o¨rg Kindermann, Edda Leopold, and Gerhard Paass. 2003. Authorship attribution with support vector machines. Applied intelligence, 19(12): 109–123. Adriano Ferraresi, Eros Zanchetta, Marco Baroni, and Silvia Bernardini. 2008. Introducing and evaluating ukwac, a very large web-derived corpus of english. In Proc. of the 4th Web as Corpus Workshop, WAC-4. Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, and Carole E Chaski. 2007. Identifying authorship by byte-level n-grams: The source code author profile (scap) method. Int Journal of Digital Evidence, 6(1): 1–18. 1890 Marti A. Hearst. 1992. Automatic acquisition of hyponyms from large text corpora. In Proc. of Coling – Volume 2, pages 539–545, Stroudsburg, PA, USA. Johan F Hoorn, Stefan L Frank, Wojtek Kowalczyk, and Floor van der Ham. 1999. Neural network identification of poets using letter sequences. Literary and Linguistic Computing, 14(3):3 11–338. Shunichi Ishihara. 2011. A forensic authorship classification in sms messages: A likelihood ratio based approach using n-gram. In Proc. of the Australasian Language Technology Association Workshop 2011, pages 47–56, Canberra, Australia. Patrick Juola. 2012. Large-scale experiments in authorship attribution. English Studies, 93(3):275–283. Bradley Kjell, W Addison Woods, and Ophir Frieder. 1995. Information retrieval using letter tuples with neural network and nearest neighbor classifiers. In IEEE International Conference on Systems, Man and Cybernetics, volume 2, pages 1222–1226. IEEE. Bradley Kjell. 1994. Authorship determination using letter pair frequency features with neural network classifiers. Literary and Linguistic Computing, 9(2): 119– 124. Moshe Koppel and Jonathan Schler. 2003. Exploiting stylistic idiosyncrasies for authorship attribution. In Proc. of IJCAI’03 Workshop on Computational Approaches to Style Analysis and Synthesis, volume 69, page 72. Moshe Koppel, Jonathan Schler, and Kfir Zigdon. 2005. Determining an author’s native language by mining a text for errors. In Proc. of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, KDD ’05, pages 624–628, New York, NY, USA. Moshe Koppel, Jonathan Schler, Shlomo Argamon, and Eran Messeri. 2006. Authorship attribution with thousands of candidate authors. In SIGIR, pages 659–660. Moshe Koppel, Jonathan Schler, and Elisheva BonchekDokow. 2007. Measuring differentiability: Unmasking pseudonymous authors. JMLR, 8: 1261–1276. Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2009. Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. , 60(1):9–26. Moshe Koppel, Navot Akiva, Idan Dershowitz, and Nachum Dershowitz. 2011a. Unsupervised decomposition of a document into authorial components. In Proc. of ACL-HLT, pages 1356–1364, Portland, Oregon, USA. Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2011b. Authorship attribution in the wild. Language Resources and Evaluation, 45(1):83–94. Zornitsa Kozareva, Ellen Riloff, and Eduard Hovy. 2008. Semantic class learning from the web with hyponym pattern linkage graphs. In Proc. of ACL-HLT, pages 1048–1056, Columbus, Ohio. Robert Layton, Paul Watters, and Richard Dazeley. 2010. Authorship attribution for twitter in 140 characters or less. In Proc. of the 2010 Second Cybercrime and Trustworthy Computing Workshop, CTC ’ 10, pages 1– 8, Washington, DC, USA. IEEE Computer Society. Robert AJ Matthews and Thomas VN Merriam. 1993. Neural computation in stylometry i: An application to the works of shakespeare and fletcher. Literary and Linguistic Computing, 8(4):203–209. DL Mealand. 1995. Correspondence analysis of luke. Literary and linguistic computing, 10(3): 171–182. Thomas Corwin Mendenhall. 1887. The characteristic curves of composition. Science, ns-9(214S):237–246. George K Mikros and Kostas Perifanos. 2013. Authorship attribution in greek tweets using authors multilevel n-gram profiles. In 2013 AAAI Spring Symposium Series. Ashwin Mohan, Ibrahim M Baggili, and Marcus K Rogers. 2010. Authorship attribution of sms messages using an n-grams approach. Technical report, CERIAS Tech Report 2011. Frederick Mosteller and David Lee Wallace. 1964. Inference and disputed authorship: The Federalist. Addison-Wesley. Fuchun Peng, Dale Schuurmans, and Shaojun Wang. 2004. Augmenting naive bayes classifiers with statistical language models. Information Retrieval, 7(34):317–345. Conrad Sanderson and Simon Guenter. 2006. Short text authorship attribution via sequence kernels, markov chains and author unmasking: An investigation. In Proc. of EMNLP, pages 482–491, Sydney, Australia. Rui Sousa Silva, Gustavo Laboreiro, Lu´ ıs Sarmento, Tim Grant, Eug e´nio Oliveira, and Belinda Maia. 2011. ‘twazn me! !! ;(’ automatic authorship analysis of micro-blogging messages. In Proc. of the 16th international conference on Natural language processing and information systems, NLDB’ 11, pages 161–168, Berlin, Heidelberg. Springer-Verlag. Thamar Solorio, Sangita Pillay, Sindhu Raghavan, and Manuel Montes-Gomez. 2011. Modality specific meta features for authorship attribution in web forum posts. In Proc. of IJCNLP, pages 156–164, Chiang Mai, Thailand, November. Efstathios Stamatatos. 2008. Author identification: Using text sampling to handle the class imbalance problem. Inf. Process. Manage., 44(2):790–799. Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology, 60(3):538–556. 1891 Oren Tsur, Dmitry Davidov, and Ari Rappoport. 2010. Icwsm–a great catchy name: Semi-supervised recognition of sarcastic sentences in online product reviews. In Proc. of ICWSM. Peter Turney. 2008a. A uniform approach to analogies, synonyms, antonyms, and associations. In Proc. of Coling, pages 905–912, Manchester, UK, August. Coling 2008 Organizing Committee. Peter D. Turney. 2008b. The latent relation mapping engine: Algorithm and experiments. Journal of Artificial Intelligence Research, 33:615–655. Dominic Widdows and Beate Dorow. 2002. A graph model for unsupervised lexical acquisition. In Proc. of Coling, pages 1–7, Stroudsburg, PA, USA. George Udny Yule. 1939. On sentence-length as a statistical characteristic of style in prose: with application to two cases of disputed authorship. Biometrika, 30(34):363–390.