emnlp emnlp2013 emnlp2013-37 emnlp2013-37-reference knowledge-graph by maker-knowledge-mining

37 emnlp-2013-Automatically Identifying Pseudepigraphic Texts

Source: pdf

Author: Moshe Koppel ; Shachar Seidman

Abstract: The identification of pseudepigraphic texts texts not written by the authors to which they are attributed – has important historical, forensic and commercial applications. We introduce an unsupervised technique for identifying pseudepigrapha. The idea is to identify textual outliers in a corpus based on the pairwise similarities of all documents in the corpus. The crucial point is that document similarity not be measured in any of the standard ways but rather be based on the output of a recently introduced algorithm for authorship verification. The proposed method strongly outperforms existing techniques in systematic experiments on a blog corpus. 1

reference text

S. M. Bendre and B. K. Kale. 1987. Masking effect on tests for outliers in normal samples, Biometrika, 74(4):891-896. Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and Jörg Sander. 2000. LOF: Identifying Density-Based Local Outliers, ACM SIGMOD Conference Proceedings. Varun Chandola, Arindam Banerjee and Vipin Kumar. 2009. Anomaly detection: a survey. ACM Computing Surveys 41, 3, Article 15. David L. Donoho. 1982. Breakdown properties of multivariate location estimators. Ph.D. qualifying paper, Harvard University. Peter Filzmoser, Ricardo Maronna and Mark Werner. 2008. Outlier identification in high dimensions. Computational Statistics and Data Analysis, 52: 1694-171 1. David Guthrie. 2008. Unsupervised Detection of Anomalous Text. PhD Thesis, University of Sheffield. Frank E. Grubbs. 1969. Procedures for detecting outlying observations in samples, Technometrics. V.J. Hodge and J. Austin. 2004. A survey of outlier detection methodologies. Artificial. Intelligence Review, 22 (2). pp. 85-126. Patrick Juola and Efstathios Stamatatos. 2013. Overview of the Author Identification Task at PAN 2013. P. Forner, R. Navigli, and D. Tufis (eds) CLEF 2013 Evaluation Labs and Workshop –Working Notes Papers. Moshe Koppel and Jonathan Schler 2004. Authorship verification as a one-class classification problem. In ICML ’04: Twentyfirst International Conference on Machine Learning, New York, NY, USA. Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2011. Authorship attribution in the wild. Language Resources and Evaluation, 45(1): 83–94. Moshe Koppel M. and Yaron Winter. 2013. Determining If Two Documents Are by the Same Author. J. Am. Soc. Inf. Sci. Technol. Frederick Mosteller and David L. Wallace. 1964. Inference and Disputed Authorship: The Federalist. Reading, Mass. Addison Wesley. Hans-Peter Kriegel, Matthias S. Schubert and Arthur Zimek. 2008. Angle-based outlier detection in high dimensional data. Proc. KDD. Thomas C. Mendenhall. 1887. The characteristic curves of composition, Science 9, 237-259. Sridhar Ramaswamy, Rajeev Rastogi and Kyuseok Shim. 2000. Efficient Algorithms for Mining Outliers from Large Data Sets. Proc. ACM SIDMOD Int. Conf. on Management of Data. Peter J. Rousseeuw. 1984. Least median of squares regression. Journal of the American Statistical Association, 79(388):87-880. Peter J. Rousseeuw and Annick M. Leroy. 2003. Robust Regression and Outlier Detection. John Wiley & Sons. J. Schler, M. Koppel, S. Argamon and J. Pennebaker. 2006. Effects of Age and Gender on Blogging. in Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs. Werner A. Stahel. 198 1. Breakdown of covariance estimators. Research Report 3 1, Fachgruppe f¨ur Statistik, Swiss Federal Institute of Technology (ETH), Zurich. 1454 Efstathios Stamatatos. 2009. Intrinsic plagiarism detection using character n-gram profiles. Proceedings of the SEPLN’09 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse. pp. 38–46. Benno Stein Prettenhofer. B, Nedim Lipka 2010. Intrinsic and Peter Plagiarism Analysis. Language Resources and Evaluation, 1–20. 2010. Benno Stein B, Nedim Lipka and Peter Prettenhofer. 2010. Intrinsic Plagiarism Analysis. Language Resources and Evaluation, 1–20. 2010.