emnlp emnlp2013 emnlp2013-37 knowledge-graph by maker-knowledge-mining

37 emnlp-2013-Automatically Identifying Pseudepigraphic Texts


Source: pdf

Author: Moshe Koppel ; Shachar Seidman

Abstract: The identification of pseudepigraphic texts texts not written by the authors to which they are attributed – has important historical, forensic and commercial applications. We introduce an unsupervised technique for identifying pseudepigrapha. The idea is to identify textual outliers in a corpus based on the pairwise similarities of all documents in the corpus. The crucial point is that document similarity not be measured in any of the standard ways but rather be based on the output of a recently introduced algorithm for authorship verification. The proposed method strongly outperforms existing techniques in systematic experiments on a blog corpus. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 moi shk@ gmai l com Abstract The identification of pseudepigraphic texts texts not written by the authors to which they are attributed – has important historical, forensic and commercial applications. [sent-2, score-0.419]

2 The idea is to identify textual outliers in a corpus based on the pairwise similarities of all documents in the corpus. [sent-4, score-0.397]

3 The crucial point is that document similarity not be measured in any of the standard ways but rather be based on the output of a recently introduced algorithm for authorship verification. [sent-5, score-0.388]

4 The proposed method strongly outperforms existing techniques in systematic experiments on a blog corpus. [sent-6, score-0.131]

5 1 Introduction The Shakespeare attribution problem is centuries old and shows no signs of abating. [sent-7, score-0.057]

6 Some scholars argue that some, or even all, of Shakespeare’s works were not actually written by him. [sent-8, score-0.13]

7 The most mainstream theory – and the one that interests us here – is that most of the works were written by Shakespeare, but that several of them were not. [sent-9, score-0.09]

8 Could modern methods of computational authorship attribution be used to detect which, if any, of the works attributed to Shakespeare were not written by him? [sent-10, score-0.359]

9 More generally, this paper deals with the unsupervised problem of detecting pseudepigrapha: documents in a supposedly single-author corpus that were not actually written by the corpus’s presumed author. [sent-11, score-0.213]

10 Studies as early as Mendenhall (1887), have observed that texts by a single author tend to be somewhat homogeneous in style. [sent-12, score-0.091]

11 shachar 9 @ gmai l com is indeed the case, we would expect that pseudepigrapha would be detectable as outliers. [sent-14, score-0.187]

12 Identifying such outlier texts is, of course, a special case of general outlier identification, one of the central tasks of statistics. [sent-15, score-1.166]

13 We will thus consider the pseudepigrapha problem in the context of the more general outlier detection problem. [sent-16, score-0.758]

14 Typically, research on textual outliers assumes that we have a corpus of known authentic documents and are asked to decide if a specified other document is authentic or not (Juola and Stamatatos, 2013). [sent-17, score-0.639]

15 One crucial aspect of our problem is that we do not assume that any specific text in a corpus is known a priori to be authentic or pseudepigraphic; we can assume only that most of the documents in the corpus are authentic. [sent-18, score-0.235]

16 The method we introduce in this paper builds on the approach of Koppel and Winter (2013) for determining if two documents are by the same author. [sent-19, score-0.175]

17 We apply that method to every pair of documents in a corpus and use properties of the resulting adjacency graph to identify outliers. [sent-20, score-0.175]

18 In the following section, we briefly outline previous work. [sent-21, score-0.027]

19 In Section 3 we provide a framework for outlier detection and in Section 4 we describe our method. [sent-22, score-0.65]

20 In Section 5 we describe the experimental setting and give results and in Section 6 we present results for the plays of Shakespeare. [sent-23, score-0.084]

21 2 Related Work Identifying outlier texts consists of two main stages: first, representing each text as a numerical vector representing relevant linguistic features of the text and second, using generic methods to identify outlier vectors. [sent-24, score-1.225]

22 There is a vast literature on generic methods for outlier detection, summarized in Hodge & Austin (2004) and Chandola et al. [sent-25, score-0.586]

23 oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is4t4ic9s–1454, lem setup does not entail obtaining any labeled examples of authentic or outlier documents, supervised and semi-supervised methods are inapplicable. [sent-29, score-0.644]

24 A classical variant of such methods for univariate normally distributed data uses the the zscore (Grubbs, 1969). [sent-31, score-0.07]

25 Such simple univariate outlier detectors are, however, inappropriate for identifying outliers in a high-dimensional textual corpus. [sent-32, score-0.919]

26 , 2008) have generalized univariate methods to highdimensional data points. [sent-35, score-0.07]

27 In his comprehensive review of outlier detection methods in textual data, Guthrie (2008) compares a variety of vectorization methods along with a variety of generic outlier methods. [sent-36, score-1.32]

28 The vectorization methods employ a variety of lexical and syntactic stylistic features, while the outlier detection methods use a variety of similarity/distance measures such as cosine and Euclidean distance. [sent-37, score-0.766]

29 Similar methods have also been used in the field of intrinsic plagiarism detection, which involves segmenting a text and then identifying outlier segments (Stamatatos, 2009; Stein et al. [sent-38, score-0.786]

30 3 Proximity Methods Formally, the problem we wish to solve is defined as follows: Given a set of documents D = {d1,… ,dn}, all or most of which were written by author A, which, if any, documents in D were not written by A? [sent-40, score-0.467]

31 We begin by considering the kinds of proximity … methods for textual outlier detection considered by Guthrie (2008) and in the work on intrinsic plagiarism detection; these will serve as baseline methods for our approach. [sent-41, score-0.901]

32 The idea is simple: mark as an outlier any document that is too far from the rest of the documents in the corpus. [sent-42, score-0.777]

33 The kinds of measurable features that can be used to represent a document include frequencies of word unigrams, function words, partsof-speech and character n-grams, as well as complexity measures such as type/token ratio, sentence and word length and so on. [sent-45, score-0.169]

34 We can use either inverses of distance measures such as Euclidean distance or Manhattan distance, or else direct similarity measures such as cosine or min-max. [sent-48, score-0.341]

35 Use an aggregation method to measure the similarity of a document to a set of documents. [sent-50, score-0.477]

36 One approach is to simply measure the distance from a document to the centroid of all the other documents (centroid method). [sent-51, score-0.394]

37 Yet another method is to use median distance (median method). [sent-53, score-0.122]

38 We note that the centroid method and mean method suffer from the masking effect (Bendre and Kale, 1987; Rousseeuw and Leroy, 2003): the presence of some outliers in the data can greatly distort the estimator's results regarding the presence of other outliers. [sent-54, score-0.383]

39 The k-NN method and the median method are both much more robust. [sent-55, score-0.12]

40 Choose some threshold beyond which a document is marked as an outlier. [sent-57, score-0.07]

41 4 Second-Order Similarity Our approach is to use an entirely different kind of similarity measure in Step 2. [sent-60, score-0.228]

42 Rather than use a first-order similarity measure, as is customary, we employ a second-order similarity measure that is the output of an algorithm used for the authorship verification problem (Koppel et al. [sent-61, score-0.618]

43 2011), in which we need to determine if two, possibly short, documents were written by the same author. [sent-62, score-0.213]

44 That algorithm, known as the “impostors method” (IM), works as follows. [sent-63, score-0.026]

45 Given two documents, d1 and d2, generate an appropriate set of impostor documents, p1,… ,pm and represent each of the documents in terms of some large feature set (for example, the frequencies of various words or character n-grams in the document). [sent-64, score-0.287]

46 For some random subset of the feature set, measure the similarity of d1 to d2 as well as to each of the documents p1,… ,pm and note if d1is closer to d2 than to any of the impostors. [sent-65, score-0.403]

47 Repeat this k times, choosing a different random subset of the features in each iteration. [sent-66, score-0.026]

48 If d1 is closer to d2 than to any of the … … impostors (and likewise switching the roles of d1 and d2) for at least θ% of iterations, then output that d2 and d1 are the same author. [sent-67, score-0.188]

49 ) Adapting that method for our purposes, we use the proportion of iterations for which d1 is closer to d2 than to any of the impostors as our similarity measure (adding a small twist to make the measure symmetric over d1 and d2, as can be seen in line 2. [sent-69, score-0.523]

50 Choose a feature set FS for representing documents, a first-order similarity measure sim, and an impostor set {p1,…,pm}. [sent-73, score-0.336]

51 If sim2(di, D) < θ (where θ is a parameter), then mark di as outlier. [sent-96, score-0.15]

52 The method for choosing the impostor set is corpus-dependent, but quite straightforward: we simply choose random impostors from the same genre and language as the documents in question. [sent-97, score-0.471]

53 The choice of feature set FS, first-order similarity measure sim, and aggregation function agg can be varied. [sent-98, score-0.543]

54 As for sim and agg, we show below results of experiments comparing the effectiveness of various choices for these parameters. [sent-100, score-0.12]

55 Using second-order similarity has several surface advantages over standard first-order measures. [sent-101, score-0.147]

56 First, it is decisive: for most pairs, second-order similarity will be close to 0 or close to 1. [sent-102, score-0.147]

57 As we will see, it is also simply much more effective for identifying outliers. [sent-104, score-0.043]

58 5 Experiments We begin by assembling a corpus consisting of 3540 blog posts written by 156 different bloggers. [sent-105, score-0.179]

59 The blogs are taken from the blog corpus assembled by Schler et al. [sent-106, score-0.109]

60 Each of the blogs was written in English by a single author in 2004 and each post consists of 1000 words (excess is truncated). [sent-108, score-0.135]

61 For our initial experiments, each trial consists of 10 blog posts, all but p of which are by a single blogger. [sent-109, score-0.161]

62 The number of pseudepigraphic documents, p, is chosen from a uniform distribution over the set {0,1,2,3 }. [sent-110, score-0.189]

63 Our task is to identify which, if any, documents in the set are not by the main author of the set. [sent-111, score-0.19]

64 The pseudepigraphic documents might be written by a single author or by multiple authors. [sent-112, score-0.443]

65 To measure the performance of a given similarity measure sim, we do the following in each trial: 1. [sent-113, score-0.309]

66 Represent each document in the trial set D in terms of BOW. [sent-114, score-0.152]

67 Measure the similarity of each pair of documents in the trial set using the similarity measure sim. [sent-116, score-0.606]

68 Using some aggregation function agg, compute for each document di: sim(di, D) = agg w∈{1,. [sent-118, score-0.385]

69 If sim (di, D) < θ, mark di as an outlier (where θ is a parameter ). [sent-122, score-0.828]

70 Our objective is to show that results using second-order similarity are stronger than those us- ing first-order similarity. [sent-123, score-0.147]

71 Before we do this, we need to determine the best aggregation function to use in our experiments. [sent-124, score-0.153]

72 As is evident, k-NN is the best aggregation function in each case. [sent-126, score-0.153]

73 We will give these baseline methods an advantage by using k-NN as our aggregation function in all our subsequent experiments. [sent-127, score-0.153]

74 We use BOW as our feature set and k-NN as our aggregation function. [sent-130, score-0.153]

75 We use 500 random blog posts as our impostor set. [sent-131, score-0.223]

76 In Figure 2, we show recall-precision curves for outlier documents over 250 independent trials, as just described, using four first-order similarity measures as well our second-order similarity measure using each of the four as a base measure. [sent-132, score-1.178]

77 As can be seen, even the worst second-order similarity measure significantly outperforms all the standard first-order measures. [sent-133, score-0.228]

78 In Figure 3, we show the breakeven values for each measure, pairing each first-order measure with the second-order measure that uses it as a base. [sent-134, score-0.216]

79 Clearly, the mere use of a second-order method improves results, regardless of the base measure. [sent-135, score-0.053]

80 similarity measures and four second-order similarity measures, based on 250 trials of 10 documents each. [sent-136, score-0.592]

81 In Figures 4 and 5, we repeat the experiment described in Figures 2 and 3 above, but with each trial consisting of 50 documents including any number of pseudepigraphic documents in the range 0 to 15. [sent-139, score-0.569]

82 The same phenomenon is apparent: second-order similarity strongly improves results over the corresponding first-order base similarity measure. [sent-140, score-0.347]

83 similarity measures and four second-order similarity measures, based on 250 trials of 50 documents each. [sent-141, score-0.592]

84 corresponding second-order measures 6 Results on Shakespeare We applied our methods to the texts of 42 plays by Shakespeare (taken from Project Gutenberg). [sent-142, score-0.203]

85 We included two plays by Thomas Kyd as sanity checks. [sent-143, score-0.084]

86 In addition, we included three plays occasionally attributed to Shakespeare, but generally regarded by authorities as pseudepigrapha (A Yorkshire Tragedy, The Life of Sir John Oldcastle and Pericles Prince of Tyre). [sent-144, score-0.233]

87 As impostors we used 39 works by contemporaries of Shakespeare, including Christopher Marlowe, Ben Jonson and John Fletcher. [sent-146, score-0.188]

88 We found that the two plays by Thomas Kyd and the three pseudepigraphic plays were all among the seven furthest outliers, as one would expect. [sent-147, score-0.357]

89 King Henry VI (Part 1) was not found to be an outlier at all. [sent-149, score-0.558]

90 Curiously, however, three undisputed plays by Shakespeare were found to be greater outliers than King Edward III. [sent-150, score-0.295]

91 We leave it to Shakespeare scholars to explain the reasons for these anomalies. [sent-153, score-0.04]

92 7 Conclusion In this paper we defined the problem of unsupervised outlier detection in the authorship verification domain. [sent-154, score-0.893]

93 Our method combines standard outlier detection methods with a novel inter1453 document similarity measure. [sent-155, score-0.893]

94 This similarity measure is the output of the impostors method recently developed for solving the authorship verification problem. [sent-156, score-0.659]

95 We have found that use of the kNN method for outlier detection in conjunction with this second-order similarity measure strongly outperforms methods based on any outlier detection method used in conjunction with any standard first-order similarity measures. [sent-157, score-1.813]

96 This improvement proves to be robust, holding for various corpus sizes and various underlying base similarity measures used in the second-order similarity measure. [sent-158, score-0.39]

97 The method can be used to resolve historical conundrums regarding the authenticity of works in questioned corpora, such as the Shakespeare corpus briefly considered here. [sent-159, score-0.11]

98 Masking effect on tests for outliers in normal samples, Biometrika, 74(4):891-896. [sent-167, score-0.211]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('outlier', 0.558), ('shakespeare', 0.239), ('outliers', 0.211), ('pseudepigraphic', 0.189), ('authorship', 0.171), ('agg', 0.162), ('impostors', 0.162), ('aggregation', 0.153), ('di', 0.15), ('documents', 0.149), ('similarity', 0.147), ('koppel', 0.125), ('sim', 0.12), ('plagiarism', 0.12), ('impostor', 0.108), ('kriegel', 0.108), ('pseudepigrapha', 0.108), ('dj', 0.1), ('detection', 0.092), ('authentic', 0.086), ('plays', 0.084), ('trial', 0.082), ('measure', 0.081), ('trials', 0.08), ('blog', 0.079), ('schler', 0.075), ('king', 0.074), ('moshe', 0.072), ('verification', 0.072), ('univariate', 0.07), ('document', 0.07), ('measures', 0.069), ('median', 0.068), ('centroid', 0.066), ('intrinsic', 0.065), ('written', 0.064), ('attribution', 0.057), ('fs', 0.054), ('bendre', 0.054), ('breakeven', 0.054), ('breunig', 0.054), ('chandola', 0.054), ('filzmoser', 0.054), ('guthrie', 0.054), ('hodge', 0.054), ('kyd', 0.054), ('masking', 0.054), ('maxu', 0.054), ('merry', 0.054), ('oldcastle', 0.054), ('rousseeuw', 0.054), ('shachar', 0.054), ('windsor', 0.054), ('wives', 0.054), ('stein', 0.051), ('euclidean', 0.051), ('texts', 0.05), ('ilan', 0.047), ('juola', 0.047), ('benno', 0.047), ('lipka', 0.047), ('lof', 0.047), ('manhattan', 0.047), ('nedim', 0.047), ('vectorization', 0.047), ('identifying', 0.043), ('tragedy', 0.043), ('author', 0.041), ('attributed', 0.041), ('pu', 0.04), ('efstathios', 0.04), ('bow', 0.04), ('scholars', 0.04), ('stamatatos', 0.038), ('peter', 0.037), ('textual', 0.037), ('posts', 0.036), ('edward', 0.036), ('israel', 0.036), ('estimator', 0.034), ('henry', 0.034), ('breakdown', 0.032), ('historical', 0.031), ('numerical', 0.031), ('character', 0.03), ('conjunction', 0.03), ('blogs', 0.03), ('proximity', 0.029), ('bar', 0.029), ('generic', 0.028), ('distance', 0.028), ('briefly', 0.027), ('base', 0.027), ('neighbors', 0.026), ('choosing', 0.026), ('strongly', 0.026), ('works', 0.026), ('closer', 0.026), ('method', 0.026), ('gmai', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 37 emnlp-2013-Automatically Identifying Pseudepigraphic Texts

Author: Moshe Koppel ; Shachar Seidman

Abstract: The identification of pseudepigraphic texts texts not written by the authors to which they are attributed – has important historical, forensic and commercial applications. We introduce an unsupervised technique for identifying pseudepigrapha. The idea is to identify textual outliers in a corpus based on the pairwise similarities of all documents in the corpus. The crucial point is that document similarity not be measured in any of the standard ways but rather be based on the output of a recently introduced algorithm for authorship verification. The proposed method strongly outperforms existing techniques in systematic experiments on a blog corpus. 1

2 0.190874 27 emnlp-2013-Authorship Attribution of Micro-Messages

Author: Roy Schwartz ; Oren Tsur ; Ari Rappoport ; Moshe Koppel

Abstract: Work on authorship attribution has traditionally focused on long texts. In this work, we tackle the question of whether the author of a very short text can be successfully identified. We use Twitter as an experimental testbed. We introduce the concept of an author’s unique “signature”, and show that such signatures are typical of many authors when writing very short texts. We also present a new authorship attribution feature (“flexible patterns”) and demonstrate a significant improvement over our baselines. Our results show that the author of a single tweet can be identified with good accuracy in an array of flavors of the authorship attribution task.

3 0.10551548 95 emnlp-2013-Identifying Multiple Userids of the Same Author

Author: Tieyun Qian ; Bing Liu

Abstract: This paper studies the problem of identifying users who use multiple userids to post in social media. Since multiple userids may belong to the same author, it is hard to directly apply supervised learning to solve the problem. This paper proposes a new method, which still uses supervised learning but does not require training documents from the involved userids. Instead, it uses documents from other userids for classifier building. The classifier can be applied to documents of the involved userids. This is possible because we transform the document space to a similarity space and learning is performed in this new space. Our evaluation is done in the online review domain. The experimental results using a large number of userids and their reviews show that the proposed method is highly effective. 1

4 0.090928517 24 emnlp-2013-Application of Localized Similarity for Web Documents

Author: Peter Rebersek ; Mateja Verlic

Abstract: In this paper we present a novel approach to automatic creation of anchor texts for hyperlinks in a document pointing to similar documents. Methods used in this approach rank parts of a document based on the similarity to a presumably related document. Ranks are then used to automatically construct the best anchor text for a link inside original document to the compared document. A number of different methods from information retrieval and natural language processing are adapted for this task. Automatically constructed anchor texts are manually evaluated in terms of relatedness to linked documents and compared to baseline consisting of originally inserted anchor texts. Additionally we use crowdsourcing for evaluation of original anchors and au- tomatically constructed anchors. Results show that our best adapted methods rival the precision of the baseline method.

5 0.074443236 148 emnlp-2013-Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching

Author: John Philip McCrae ; Philipp Cimiano ; Roman Klinger

Abstract: Cross-lingual topic modelling has applications in machine translation, word sense disambiguation and terminology alignment. Multilingual extensions of approaches based on latent (LSI), generative (LDA, PLSI) as well as explicit (ESA) topic modelling can induce an interlingual topic space allowing documents in different languages to be mapped into the same space and thus to be compared across languages. In this paper, we present a novel approach that combines latent and explicit topic modelling approaches in the sense that it builds on a set of explicitly defined topics, but then computes latent relations between these. Thus, the method combines the benefits of both explicit and latent topic modelling approaches. We show that on a crosslingual mate retrieval task, our model significantly outperforms LDA, LSI, and ESA, as well as a baseline that translates every word in a document into the target language.

6 0.068027772 193 emnlp-2013-Unsupervised Induction of Cross-Lingual Semantic Relations

7 0.063202269 169 emnlp-2013-Semi-Supervised Representation Learning for Cross-Lingual Text Classification

8 0.060957931 44 emnlp-2013-Centering Similarity Measures to Reduce Hubs

9 0.058311366 178 emnlp-2013-Success with Style: Using Writing Style to Predict the Success of Novels

10 0.055563323 204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication

11 0.054407161 109 emnlp-2013-Is Twitter A Better Corpus for Measuring Sentiment Similarity?

12 0.046595976 120 emnlp-2013-Learning Latent Word Representations for Domain Adaptation using Supervised Word Clustering

13 0.044823259 25 emnlp-2013-Appropriately Incorporating Statistical Significance in PMI

14 0.042770717 74 emnlp-2013-Event-Based Time Label Propagation for Automatic Dating of News Articles

15 0.042518035 41 emnlp-2013-Building Event Threads out of Multiple News Articles

16 0.040510163 12 emnlp-2013-A Semantically Enhanced Approach to Determine Textual Similarity

17 0.039588638 61 emnlp-2013-Detecting Promotional Content in Wikipedia

18 0.036690183 165 emnlp-2013-Scaling to Large3 Data: An Efficient and Effective Method to Compute Distributional Thesauri

19 0.034964129 200 emnlp-2013-Well-Argued Recommendation: Adaptive Models Based on Words in Recommender Systems

20 0.034780122 196 emnlp-2013-Using Crowdsourcing to get Representations based on Regular Expressions


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.122), (1, 0.05), (2, -0.075), (3, -0.017), (4, 0.02), (5, 0.007), (6, 0.044), (7, 0.062), (8, -0.013), (9, -0.103), (10, -0.03), (11, 0.057), (12, -0.021), (13, 0.019), (14, 0.091), (15, 0.024), (16, -0.153), (17, -0.032), (18, -0.153), (19, -0.006), (20, -0.231), (21, 0.107), (22, 0.065), (23, -0.173), (24, -0.167), (25, 0.017), (26, 0.018), (27, -0.038), (28, 0.085), (29, -0.024), (30, 0.147), (31, 0.005), (32, -0.055), (33, 0.051), (34, -0.079), (35, -0.064), (36, 0.029), (37, -0.0), (38, -0.104), (39, 0.025), (40, -0.058), (41, 0.023), (42, 0.026), (43, 0.163), (44, 0.107), (45, -0.111), (46, 0.061), (47, 0.091), (48, -0.05), (49, 0.065)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95418441 37 emnlp-2013-Automatically Identifying Pseudepigraphic Texts

Author: Moshe Koppel ; Shachar Seidman

Abstract: The identification of pseudepigraphic texts texts not written by the authors to which they are attributed – has important historical, forensic and commercial applications. We introduce an unsupervised technique for identifying pseudepigrapha. The idea is to identify textual outliers in a corpus based on the pairwise similarities of all documents in the corpus. The crucial point is that document similarity not be measured in any of the standard ways but rather be based on the output of a recently introduced algorithm for authorship verification. The proposed method strongly outperforms existing techniques in systematic experiments on a blog corpus. 1

2 0.74020147 95 emnlp-2013-Identifying Multiple Userids of the Same Author

Author: Tieyun Qian ; Bing Liu

Abstract: This paper studies the problem of identifying users who use multiple userids to post in social media. Since multiple userids may belong to the same author, it is hard to directly apply supervised learning to solve the problem. This paper proposes a new method, which still uses supervised learning but does not require training documents from the involved userids. Instead, it uses documents from other userids for classifier building. The classifier can be applied to documents of the involved userids. This is possible because we transform the document space to a similarity space and learning is performed in this new space. Our evaluation is done in the online review domain. The experimental results using a large number of userids and their reviews show that the proposed method is highly effective. 1

3 0.70850939 27 emnlp-2013-Authorship Attribution of Micro-Messages

Author: Roy Schwartz ; Oren Tsur ; Ari Rappoport ; Moshe Koppel

Abstract: Work on authorship attribution has traditionally focused on long texts. In this work, we tackle the question of whether the author of a very short text can be successfully identified. We use Twitter as an experimental testbed. We introduce the concept of an author’s unique “signature”, and show that such signatures are typical of many authors when writing very short texts. We also present a new authorship attribution feature (“flexible patterns”) and demonstrate a significant improvement over our baselines. Our results show that the author of a single tweet can be identified with good accuracy in an array of flavors of the authorship attribution task.

4 0.43667766 24 emnlp-2013-Application of Localized Similarity for Web Documents

Author: Peter Rebersek ; Mateja Verlic

Abstract: In this paper we present a novel approach to automatic creation of anchor texts for hyperlinks in a document pointing to similar documents. Methods used in this approach rank parts of a document based on the similarity to a presumably related document. Ranks are then used to automatically construct the best anchor text for a link inside original document to the compared document. A number of different methods from information retrieval and natural language processing are adapted for this task. Automatically constructed anchor texts are manually evaluated in terms of relatedness to linked documents and compared to baseline consisting of originally inserted anchor texts. Additionally we use crowdsourcing for evaluation of original anchors and au- tomatically constructed anchors. Results show that our best adapted methods rival the precision of the baseline method.

5 0.41997391 178 emnlp-2013-Success with Style: Using Writing Style to Predict the Success of Novels

Author: Vikas Ganjigunte Ashok ; Song Feng ; Yejin Choi

Abstract: Predicting the success of literary works is a curious question among publishers and aspiring writers alike. We examine the quantitative connection, if any, between writing style and successful literature. Based on novels over several different genres, we probe the predictive power of statistical stylometry in discriminating successful literary works, and identify characteristic stylistic elements that are more prominent in successful writings. Our study reports for the first time that statistical stylometry can be surprisingly effective in discriminating highly successful literature from less successful counterpart, achieving accuracy up to 84%. Closer analyses lead to several new insights into characteristics ofthe writing style in successful literature, including findings that are contrary to the conventional wisdom with respect to good writing style and readability. ,

6 0.360246 44 emnlp-2013-Centering Similarity Measures to Reduce Hubs

7 0.35558641 148 emnlp-2013-Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching

8 0.3310453 61 emnlp-2013-Detecting Promotional Content in Wikipedia

9 0.32229424 12 emnlp-2013-A Semantically Enhanced Approach to Determine Textual Similarity

10 0.31757802 165 emnlp-2013-Scaling to Large3 Data: An Efficient and Effective Method to Compute Distributional Thesauri

11 0.31562284 138 emnlp-2013-Naive Bayes Word Sense Induction

12 0.28736657 195 emnlp-2013-Unsupervised Spectral Learning of WCFG as Low-rank Matrix Completion

13 0.28287604 204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication

14 0.27384543 35 emnlp-2013-Automatically Detecting and Attributing Indirect Quotations

15 0.26917979 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity

16 0.25618973 25 emnlp-2013-Appropriately Incorporating Statistical Significance in PMI

17 0.24468234 26 emnlp-2013-Assembling the Kazakh Language Corpus

18 0.24355474 74 emnlp-2013-Event-Based Time Label Propagation for Automatic Dating of News Articles

19 0.2402387 173 emnlp-2013-Simulating Early-Termination Search for Verbose Spoken Queries

20 0.23713279 189 emnlp-2013-Two-Stage Method for Large-Scale Acquisition of Contradiction Pattern Pairs using Entailment


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.042), (18, 0.022), (22, 0.044), (30, 0.045), (36, 0.366), (50, 0.01), (51, 0.221), (66, 0.022), (71, 0.028), (73, 0.038), (75, 0.014), (77, 0.012), (96, 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77308381 37 emnlp-2013-Automatically Identifying Pseudepigraphic Texts

Author: Moshe Koppel ; Shachar Seidman

Abstract: The identification of pseudepigraphic texts texts not written by the authors to which they are attributed – has important historical, forensic and commercial applications. We introduce an unsupervised technique for identifying pseudepigrapha. The idea is to identify textual outliers in a corpus based on the pairwise similarities of all documents in the corpus. The crucial point is that document similarity not be measured in any of the standard ways but rather be based on the output of a recently introduced algorithm for authorship verification. The proposed method strongly outperforms existing techniques in systematic experiments on a blog corpus. 1

2 0.73218912 160 emnlp-2013-Relational Inference for Wikification

Author: Xiao Cheng ; Dan Roth

Abstract: Wikification, commonly referred to as Disambiguation to Wikipedia (D2W), is the task of identifying concepts and entities in text and disambiguating them into the most specific corresponding Wikipedia pages. Previous approaches to D2W focused on the use of local and global statistics over the given text, Wikipedia articles and its link structures, to evaluate context compatibility among a list of probable candidates. However, these methods fail (often, embarrassingly), when some level of text understanding is needed to support Wikification. In this paper we introduce a novel approach to Wikification by incorporating, along with statistical methods, richer relational analysis of the text. We provide an extensible, efficient and modular Integer Linear Programming (ILP) formulation of Wikification that incorporates the entity-relation inference problem, and show that the ability to identify relations in text helps both candi- date generation and ranking Wikipedia titles considerably. Our results show significant improvements in both Wikification and the TAC Entity Linking task.

3 0.67516077 69 emnlp-2013-Efficient Collective Entity Linking with Stacking

Author: Zhengyan He ; Shujie Liu ; Yang Song ; Mu Li ; Ming Zhou ; Houfeng Wang

Abstract: Entity disambiguation works by linking ambiguous mentions in text to their corresponding real-world entities in knowledge base. Recent collective disambiguation methods enforce coherence among contextual decisions at the cost of non-trivial inference processes. We propose a fast collective disambiguation approach based on stacking. First, we train a local predictor g0 with learning to rank as base learner, to generate initial ranking list of candidates. Second, top k candidates of related instances are searched for constructing expressive global coherence features. A global predictor g1 is trained in the augmented feature space and stacking is employed to tackle the train/test mismatch problem. The proposed method is fast and easy to implement. Experiments show its effectiveness over various algorithms on several public datasets. By learning a rich semantic relatedness measure be- . tween entity categories and context document, performance is further improved.

4 0.54180086 27 emnlp-2013-Authorship Attribution of Micro-Messages

Author: Roy Schwartz ; Oren Tsur ; Ari Rappoport ; Moshe Koppel

Abstract: Work on authorship attribution has traditionally focused on long texts. In this work, we tackle the question of whether the author of a very short text can be successfully identified. We use Twitter as an experimental testbed. We introduce the concept of an author’s unique “signature”, and show that such signatures are typical of many authors when writing very short texts. We also present a new authorship attribution feature (“flexible patterns”) and demonstrate a significant improvement over our baselines. Our results show that the author of a single tweet can be identified with good accuracy in an array of flavors of the authorship attribution task.

5 0.53516722 35 emnlp-2013-Automatically Detecting and Attributing Indirect Quotations

Author: Silvia Pareti ; Tim O'Keefe ; Ioannis Konstas ; James R. Curran ; Irena Koprinska

Abstract: Direct quotations are used for opinion mining and information extraction as they have an easy to extract span and they can be attributed to a speaker with high accuracy. However, simply focusing on direct quotations ignores around half of all reported speech, which is in the form of indirect or mixed speech. This work presents the first large-scale experiments in indirect and mixed quotation extraction and attribution. We propose two methods of extracting all quote types from news articles and evaluate them on two large annotated corpora, one of which is a contribution of this work. We further show that direct quotation attribution methods can be successfully applied to indirect and mixed quotation attribution.

6 0.5330869 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology

7 0.53044152 178 emnlp-2013-Success with Style: Using Writing Style to Predict the Success of Novels

8 0.52956599 166 emnlp-2013-Semantic Parsing on Freebase from Question-Answer Pairs

9 0.52929932 152 emnlp-2013-Predicting the Presence of Discourse Connectives

10 0.52902722 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation

11 0.52896851 73 emnlp-2013-Error-Driven Analysis of Challenges in Coreference Resolution

12 0.52880847 96 emnlp-2013-Identifying Phrasal Verbs Using Many Bilingual Corpora

13 0.52816939 148 emnlp-2013-Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching

14 0.52733022 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization

15 0.52705973 181 emnlp-2013-The Effects of Syntactic Features in Automatic Prediction of Morphology

16 0.52699524 62 emnlp-2013-Detection of Product Comparisons - How Far Does an Out-of-the-Box Semantic Role Labeling System Take You?

17 0.52602589 110 emnlp-2013-Joint Bootstrapping of Corpus Annotations and Entity Types

18 0.52597147 173 emnlp-2013-Simulating Early-Termination Search for Verbose Spoken Queries

19 0.52539021 91 emnlp-2013-Grounding Strategic Conversation: Using Negotiation Dialogues to Predict Trades in a Win-Lose Game

20 0.52529204 51 emnlp-2013-Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction