acl acl2011 acl2011-214 acl2011-214-reference knowledge-graph by maker-knowledge-mining

214 acl-2011-Lost in Translation: Authorship Attribution using Frame Semantics

Source: pdf

Author: Steffen Hedegaard ; Jakob Grue Simonsen

Abstract: We investigate authorship attribution using classifiers based on frame semantics. The purpose is to discover whether adding semantic information to lexical and syntactic methods for authorship attribution will improve them, specifically to address the difficult problem of authorship attribution of translated texts. Our results suggest (i) that frame-based classifiers are usable for author attribution of both translated and untranslated texts; (ii) that framebased classifiers generally perform worse than the baseline classifiers for untranslated texts, but (iii) perform as well as, or superior to the baseline classifiers on translated texts; (iv) that—contrary to current belief—naïve clas- sifiers based on lexical markers may perform tolerably on translated texts if the combination of author and translator is present in the training set of a classifier.

reference text

John B. Archer, John L. Hilton, and G. Bruce Schaalje. 1997. Comparative power of three author-attribution techniques for differentiating authors. Journal of Book of Mormon Studies, 6(1):47–63. Shlomo Argamon. 2007. Interpreting Burrows’ Delta: Geometric and probabilistic foundations. Literary and Linguistic Computing, 23(2): 13 1–147. R. Arun, V. Suresh, and C. E. Veni Madhaven. 2009. Stopword graphs and authorship attribution in text corpora. In Proceedings of the 3rd IEEE International Conference on Semantic Computing (ICSC 2009), pages 192–196, Berkeley, CA, USA, sep. IEEE Computer Society Press. Dipanjan Das, Nathan Schneider, Desai Chen, and Noah A. Smith. 2010. Probabilistic frame-semantic parsing. In Proceedings of the North American Chapter of the Association for Compututional Linguistics Human Language Technologies Conference (NAACL HLT ’10). Kai-Bo Duan and S. Sathiya Keerthi. 2005. Which is the best multiclass svm method? an empirical study. In Proceedings of the Sixth International Workshop on Multiple Classifier Systems, pages 278–285. Michael Gamon. 2004. Linguistic correlates of style: Authorship classification with deep linguistic analysis features. In Proceedings of the 20th International Conference on Computational Linguistics (COLING ’04), pages 611–617. David I. Holmes. 1992. A stylometric analysis of mormon scripture and related texts. Journal of the Royal Statistical Society, Series A, 155(1):91–120. Richard Johansson and Pierre Nugues. 2007. Semantic structure extraction using nonprojective dependency Proceedings ofSemEval-2007, Prague, Czech Republic, June 23-24. Patrick Juola. 2006. Authorship attribution. Found. Trends Inf. Retr., 1(3):233–334. Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2008. Computational methods for authorship attribution. Journal of the American Society for Information Sciences and Technology, 60(1):9–25. Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2010. Authorship attribution in the wild. Language Resources and Evaluation, pages 1–12. 10.1007/s10579-009-91 11-2. trees. In Geoffrey Leech, Paul Rayson, and Andrew Wilson. 2001. Word Frequencies in Written and Spoken English: Based on the British National Corpus. Longman, London. Kim Luyckx and Walter Daelemans. 2010. The effect of author set size and data size in authorship attribution. Literary and Linguistic Computing. To appear. Philip M. McCarthy, Gwyneth A. Lewis, David F. Dufty, and Danielle S. McNamara. 2006. Analyzing writing styles with coh-metrix. In Proceedings of the International Conference of the Florida Artificial Intelligence Research Society, pages 764–769. Quinn McNemar. 1947. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12: 153–157. Frederick Mosteller and David L. Wallace. 1964. Inference and Disputed Authorship: The Federalist. Springer-Verlag, New York. 2nd Edition appeared in 1984 and was called Applied Bayesian and Classical Inference. Sindhu Raghavan, Adriana Kovashka, and Raymond Mooney. 2010. Authorship attribution using probabilistic context-free grammars. In Proceedings of the ACL 2010 Conference Short Papers, pages 38–42. Association for Computational Linguistics. Joseph Rudman. 1997. The state of authorship attribution studies: Some problems and solutions. Computers and the Humanities, 3 1(4):35 1–365. Joseph Ruppenhofer, Michael Ellsworth, Miriam R. L. Petruck, Christopher R. Johnson, and Jan Scheffczyk. 2006. FrameNet II: Extended Theory and Practice. The Framenet Project. Andrew I. Schein, Johnnie F. Caver, Randale J. Honaker, and Craig H. Martell. 2010. Author attribution evaluation with novel topic cross-validation. In Proceedings of the 2010 International Conference on Knowledge Discovery and Information Retrieval (KDIR ’10). Frank Yates. 1934. Contingency tables involving small numbers and the χ2 test. Supplement to the Journal of the Royal Statistical Society, 1(2):pp. 217–235. Mark D. Smucker, James Allan, and Ben Carterette. 2007. A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, CIKM ’07, pages 623–632, New York, NY, USA. ACM. Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology, 60(3):538–556. Benno Stein, Moshe Koppel, and Efstathios Stamatatos, editors. 2007. Proceedings of the SIGIR 2007 International Workshop on Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection, PAN 2007, Amsterdam, Netherlands, July 27, 2007, volume 276 of CEUR Workshop Proceedings. CEURWS.org. 70