emnlp emnlp2013 emnlp2013-27 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Roy Schwartz ; Oren Tsur ; Ari Rappoport ; Moshe Koppel
Abstract: Work on authorship attribution has traditionally focused on long texts. In this work, we tackle the question of whether the author of a very short text can be successfully identified. We use Twitter as an experimental testbed. We introduce the concept of an author’s unique “signature”, and show that such signatures are typical of many authors when writing very short texts. We also present a new authorship attribution feature (“flexible patterns”) and demonstrate a significant improvement over our baselines. Our results show that the author of a single tweet can be identified with good accuracy in an array of flavors of the authorship attribution task.
Reference: text
sentIndex sentText sentNum sentScore
1 {roys 0 2 | oren | ari r} @ c s hu j i ac Abstract Work on authorship attribution has traditionally focused on long texts. [sent-3, score-0.9]
2 We introduce the concept of an author’s unique “signature”, and show that such signatures are typical of many authors when writing very short texts. [sent-6, score-0.332]
3 We also present a new authorship attribution feature (“flexible patterns”) and demonstrate a significant improvement over our baselines. [sent-7, score-0.839]
4 Our results show that the author of a single tweet can be identified with good accuracy in an array of flavors of the authorship attribution task. [sent-8, score-1.119]
5 Research in authorship attribution has developed substantially over the last decade (Stamatatos, 2009). [sent-10, score-0.813]
6 This has led to many recent authorship attribution projects that experimented with web data such as emails (Abbasi and Chen, 2008), web forum messages (Solorio et al. [sent-13, score-0.992]
7 i l authorship attribution systems, since authorship at- tribution methods that work well on long texts are often not as useful when applied to short texts (Burrows, 2002; Sanderson and Guenter, 2006). [sent-22, score-1.431]
8 Nonetheless, tweets are relatively self-contained and have smaller sentence length variance compared to excerpts from longer texts (see Section 3). [sent-23, score-0.424]
9 Moreover, an authorship attribution system of tweets may have various applications. [sent-25, score-1.142]
10 We denote the k-signatures of an author a as the features that appear in at least k% of a’s training samples, while not appearing in the training set of any other author. [sent-28, score-0.284]
11 Moreover, a substantial portion of the tweets in our training set contain at least one such signature. [sent-31, score-0.431]
12 We use a rigorous experimental setup, with varying number of authors (values between 50-1,000) and various sizes of the training set, ranging from 50 to 1,000 tweets per author. [sent-35, score-0.682]
13 Our results show that the author of a tweet can be successfully identified. [sent-41, score-0.334]
14 For example, when using a dataset of as many as 1,000 authors with 200 training tweets per author, we are able to obtain 30. [sent-42, score-0.574]
15 Using a dataset of 50 authors with as few as 50 training tweets per author, we obtain 50. [sent-45, score-0.574]
16 Using a dataset of 50 authors with 1,000 training tweets per author, our results reach as high as 71. [sent-47, score-0.574]
17 The effectiveness of function words as authorship attribution features (Koppel et al. [sent-51, score-0.782]
18 The fact that flexible patterns are learned from plain text in a fully unsupervised manner makes them domain and language independent. [sent-53, score-0.335]
19 We demonstrate that using flexible patterns gives significant improvement over our baseline system. [sent-54, score-0.392]
20 Furthermore, using flexible patterns, our system obtains a 6. [sent-55, score-0.28]
21 1% improvement over current state-of-the-art results in authorship attribution on Twitter. [sent-56, score-0.809]
22 – • • We provide the most extensive research to date on authorship a mttorisbtu etixotnen soivf micro-messages, and show that authors of very short texts can be successfully identified. [sent-58, score-0.736]
23 We introduce the concept of an author’s unique k-signature, ea tnhde dceonmcoenpsttr oafte a nt ahautt sourc’hs signatures are used by many authors in their writing of micro-messages. [sent-59, score-0.283]
24 We present a new feature for authorship attribWuetio pnre fnlte axib nleew patterns arn adu shhoorwsh ipts a significant added value over other methods. [sent-60, score-0.556]
25 Character n-gram features are especially useful for authorship attribution on micro-messages since they are relatively tolerant to typos and non-standard use of punctuation (Stamatatos, 2009). [sent-71, score-0.782]
26 For efficiency, we consider only character n-gram features that appear at least tcng times in the training set of at least one author (see Section 5). [sent-83, score-0.463]
27 We hypothesize that word n-gram features would be useful for authorship attribution on micro-messages. [sent-85, score-0.782]
28 For efficiency, we consider only word n-gram features that appear at least twng times in the training set of at least one author (see Section 5). [sent-90, score-0.348]
29 Tweets have several properties making them an ideal testbed for authorship attribution of short texts. [sent-102, score-0.896]
30 First, tweets are posted as single units and do not necessarily refer to each other. [sent-103, score-0.394]
31 Second, tweets have more standardized length distribution compared to other types of web data. [sent-105, score-0.392]
32 3 We found that (a) tweets are shorter than standard web data (14. [sent-108, score-0.392]
33 9), and (b) the standard deviation of the length of tweets is much smaller (6. [sent-110, score-0.36]
34 We also remove tweets marked as retweets (using the RT sign, a standard Twitter symbol to indicate that this tweet was written by a different user). [sent-117, score-0.476]
35 As some users retweet without using the RT sign, we also remove tweets that are an exact copy of an existing tweet posted in the previous seven days. [sent-118, score-0.604]
36 Apart from plain text, some tweets contain references to other Twitter users (in the format of @ ). [sent-119, score-0.454]
37 it 4These comprise ∼ 15% of all public tweets created from May T2h0e0s9e t coo Mmaprrcihse e2 ∼01105. [sent-128, score-0.36]
38 1882 Number of k−signatures per user Figure 1: Number of users with at least x k-signatures (100 authors, 180 training tweets per author). [sent-129, score-0.661]
39 We define the concept of the k-signature of an author a to be a feature that appears in at least k% of a’s training set, while not appearing in the training set of any other user. [sent-133, score-0.319]
40 Such signatures can be useful for identifying future (unlabeled) tweets written by a. [sent-134, score-0.432]
41 To validate our hypothesis, we use a dataset of 100 authors with 180 tweets per author. [sent-135, score-0.546]
42 Results demonstrate that 81 users use at least one 2%-signature, 43 users use at least one 5%-signature, and 17 users use at least one 10%signature. [sent-138, score-0.441]
43 Table 2 provides examples of tweets posted by such users. [sent-145, score-0.394]
44 5 Another interesting question is how many tweets contain at least one k-signature. [sent-146, score-0.403]
45 Figure 2 shows for each user the number of tweets in her training set for which at least one k-signature is found. [sent-147, score-0.431]
46 6% of the training tweets contain at least one 2%-signature, 10. [sent-149, score-0.431]
47 3% the training tweets contain at least one 5%-signature and 6. [sent-150, score-0.431]
48 5% of the training tweets contain at least one 10%-signature. [sent-151, score-0.431]
49 These findings also have direct implications on authorship attribution of micro-messages, since ksignatures are reliable classification features. [sent-153, score-0.808]
50 1883 Number of Tweets with at least one k−Signature Figure 2: Number of users with at least x training tweets that contain at least one k-signature (100 authors, 180 training tweets per author). [sent-158, score-1.067]
51 In order to test the affect of the training set size, we experiment with an increasingly larger number of tweets per author. [sent-166, score-0.487]
52 Experimenting with a range of training set sizes serves two purposes: (a) to check whether the author of a tweet can be identified using a very small number of (short) training samples, and (b) check how much our system can benefit from training on a larger corpus. [sent-167, score-0.464]
53 1884 10,183 users), and randomly select 1,000 tweets per user. [sent-169, score-0.428]
54 7 We perform a set of classification experiments, selecting for each author an increasingly larger subset of her 1,000 tweets as training set. [sent-171, score-0.63]
55 In a second set of experiments, we use an increasingly larger number of authors (values between 100-1,000), in order to check whether the author of a very short text can be identified in a “needle in a haystack” type of setting. [sent-175, score-0.419]
56 Due to complexity issues, we only experiment with 200 tweets per author as training set. [sent-176, score-0.641]
57 We use the same threshold values as the 200 tweets per author setting previously described (tcng = 4, twng = 2). [sent-178, score-0.662]
58 Training Set Size Figure 3: Authorship attribution accuracy for 50 authors with various training set sizes. [sent-180, score-0.456]
59 1885 Number of Candidate Authors Figure 4: Authorship attribution accuracy with varying number of candidate authors, using 200 training tweets per author. [sent-199, score-0.831]
60 Results demonstrate that authors of very short texts can be successfully identified, even with as few as 50 tweets per author (49. [sent-206, score-0.907]
61 Figure 4 shows our results for various numbers of authors, using 200 tweets per author as training set. [sent-212, score-0.641]
62 Results demonstrate that authors of an unknown tweet can be identified to a large extent even when there are as many as 1,000 candidate authors (30. [sent-213, score-0.418]
63 Results further validate that word ngram features substantially improve over character 9Results for 50 authors with 200 tweets per author are taken from Figure 3. [sent-216, score-0.877]
64 Results demonstrate that we are able to obtain very high precision (over 90%) while still maintaining a relatively high recall (from ∼35% recall for 50 tweets per authhigorh up ctaol > 6ro0%m r∼e3c5al%l f roer 1,000 tweets per author). [sent-222, score-0.886]
65 We show that flexible patterns can be used to improve classification results. [sent-227, score-0.361]
66 As a result, flexible patterns can pick up fine-grained differences between authors’ styles. [sent-229, score-0.335]
67 Flexible patterns can serve as binary classification features; a tweet matches a given flexible pattern if it contains the flexible pattern sequence. [sent-249, score-0.838]
68 A flexible pattern may appear in a given tweet with additional words not originally found in the flexible pattern, and/or with only a subset of the HFWs (Davidov et al. [sent-253, score-0.673]
69 Similarly, (4) is another partial match of (1), since (a) the word “good” is not part of the original flexible pattern and (b) the second occurrence of the word “the” does not appear in (4) (missing word is marked by ). [sent-256, score-0.306]
70 We repeat our experiments with varying training set sizes (see Section 5) with two more systems: one that uses character n-grams and flexible pattern features, and another that uses character n-grams, word n-grams and flexible patterns. [sent-268, score-0.923]
71 We only consider flexible pattern features that appear at least tfp times in the training set of at least one author. [sent-270, score-0.452]
72 Results demonstrate that flexible pattern features have an added value over both character n-grams alone (averaged 2. [sent-274, score-0.451]
73 5% 1887 Training Set Size Figure 7: Authorship attribution accuracy for 50 authors with various training set sizes and various feature sets. [sent-276, score-0.499]
74 Results demonstrate that it is highly significant in all settings, with p-values smaller than values between 10−3 (for 50 tweets per author) and 10−8 (1,000 tweets per author). [sent-283, score-0.886]
75 These margins are explained by the choice of algorithm (SVM and not SCAP/naive Bayes) and our set of features (character n-grams + word n-grams + flexible patterns compared to character n-grams only). [sent-289, score-0.486]
76 To illustrate the additional contribution of flexible patterns over word n-grams, consider the following tweets, written by the same author. [sent-297, score-0.335]
77 However, this style can be successfully identified using the flexible pattern (9), shared by (5-8). [sent-316, score-0.435]
78 theHFW CW IHFW This demonstrates the added value flexible pattern features have over word n-gram features. [sent-318, score-0.306]
79 8 Related Work Authorship attribution dates back to the end of 19th century, when (Mendenhall, 1887) applied sentence length and word length features to plays of Shakespeare. [sent-319, score-0.31]
80 Authorship attribution methods can be generally divided into two categories (Stamatatos, 2009). [sent-323, score-0.31]
81 In similarity-based methods, an anonymous text is attributed to some author whose writing style is most similar (by some distance metric). [sent-324, score-0.276]
82 Traditionally, authorship attribution systems have mainly been evaluated against long texts such as theater plays (Mendenhall, 1887), essays (Yule, 1939; Mosteller and Wallace, 1964), biblical books (Mealand, 1995; Koppel et al. [sent-337, score-0.846]
83 The perfor- mance of authorship attribution systems on short texts is affected by several factors (Stamatatos, 2009). [sent-349, score-0.895]
84 They experimented with 50 authors and compared different numbers of tweets per author (values between 20-200). [sent-356, score-0.768]
85 In our work, we noticed a different trend, and showed that more data can be extremely valuable for authorship attribution systems on micro-messages (see Section 6). [sent-358, score-0.782]
86 She also provided a set of experiments that studied the effect of joining several tweets into a single document. [sent-366, score-0.36]
87 They experimented with 10 authors of Greek text, and also joined several tweets into a single document. [sent-368, score-0.515]
88 Joining several tweets into a longer document is appealing since it can lead to substantial improvement of the classification results, as demonstrated by the works above. [sent-369, score-0.413]
89 However, this approach requires the test data to contain several tweets that are known a-priori to be written by the same author. [sent-370, score-0.36]
90 Patterns were first extracted in a fully unsupervised manner (“flexible patterns”) by (Davidov and Rappoport, 2006), who used flexible patterns in order to establish noun categories, and (Bici c¸i and Yuret, 2006) who used them for analogy question answering. [sent-377, score-0.365]
91 Ever since, flexible patterns were used as features for various tasks such as extraction of semantic relationships (Davidov et al. [sent-378, score-0.335]
92 We have shown that authors of very short texts can be successfully identified in an array of au- 1889 thorship attribution settings reported for long documents. [sent-384, score-0.642]
93 Last, we presented the first authorship attribution system that uses flexible patterns, and demonstrated that using these features significantly improves over other systems. [sent-388, score-1.033]
94 Identifying authorship by byte-level n-grams: The source code author profile (scap) method. [sent-496, score-0.657]
95 A forensic authorship classification in sms messages: A likelihood ratio based approach using n-gram. [sent-510, score-0.532]
96 Authorship attribution for twitter in 140 characters or less. [sent-570, score-0.38]
97 Authorship attribution in greek tweets using authors multilevel n-gram profiles. [sent-588, score-0.788]
98 Authorship attribution of sms messages using an n-grams approach. [sent-592, score-0.416]
99 Short text authorship attribution via sequence kernels, markov chains and author unmasking: An investigation. [sent-604, score-0.967]
100 Modality specific meta features for authorship attribution in web forum posts. [sent-618, score-0.91]
wordName wordTfidf (topN-words)
[('authorship', 0.472), ('tweets', 0.36), ('attribution', 0.31), ('flexible', 0.251), ('koppel', 0.236), ('author', 0.185), ('davidov', 0.15), ('layton', 0.146), ('authors', 0.118), ('tweet', 0.116), ('character', 0.115), ('abbasi', 0.113), ('hfws', 0.113), ('moshe', 0.097), ('users', 0.094), ('schler', 0.09), ('patterns', 0.084), ('ari', 0.083), ('kjell', 0.081), ('stamatatos', 0.079), ('signatures', 0.072), ('messages', 0.072), ('twitter', 0.07), ('dmitry', 0.069), ('per', 0.068), ('experimenting', 0.068), ('varying', 0.065), ('scap', 0.065), ('testbed', 0.065), ('rappoport', 0.064), ('texts', 0.064), ('style', 0.06), ('shlomo', 0.06), ('meta', 0.059), ('sanderson', 0.056), ('pattern', 0.055), ('tsur', 0.054), ('literary', 0.054), ('mosteller', 0.051), ('cw', 0.049), ('short', 0.049), ('boutwell', 0.049), ('hoorn', 0.049), ('tcng', 0.049), ('thehfw', 0.049), ('twng', 0.049), ('least', 0.043), ('sizes', 0.043), ('wallace', 0.042), ('guenter', 0.042), ('argamon', 0.041), ('bayes', 0.041), ('vel', 0.039), ('experimented', 0.037), ('svm', 0.037), ('forum', 0.037), ('identified', 0.036), ('efstathios', 0.036), ('margins', 0.036), ('solorio', 0.036), ('concept', 0.035), ('oren', 0.035), ('posted', 0.034), ('sms', 0.034), ('jonathan', 0.033), ('curves', 0.033), ('successfully', 0.033), ('berland', 0.032), ('bici', 0.032), ('bots', 0.032), ('frantzeskou', 0.032), ('hfw', 0.032), ('matthews', 0.032), ('mikros', 0.032), ('mohan', 0.032), ('tfp', 0.032), ('thorship', 0.032), ('web', 0.032), ('substantially', 0.031), ('naive', 0.031), ('increasingly', 0.031), ('signature', 0.031), ('writing', 0.031), ('demonstrate', 0.03), ('analogy', 0.03), ('obtains', 0.029), ('averaged', 0.028), ('hsinchun', 0.028), ('mendenhall', 0.028), ('unmasking', 0.028), ('training', 0.028), ('receives', 0.028), ('unique', 0.027), ('improvement', 0.027), ('columbus', 0.026), ('stylistic', 0.026), ('ngrams', 0.026), ('cws', 0.026), ('sarcastic', 0.026), ('classification', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 27 emnlp-2013-Authorship Attribution of Micro-Messages
Author: Roy Schwartz ; Oren Tsur ; Ari Rappoport ; Moshe Koppel
Abstract: Work on authorship attribution has traditionally focused on long texts. In this work, we tackle the question of whether the author of a very short text can be successfully identified. We use Twitter as an experimental testbed. We introduce the concept of an author’s unique “signature”, and show that such signatures are typical of many authors when writing very short texts. We also present a new authorship attribution feature (“flexible patterns”) and demonstrate a significant improvement over our baselines. Our results show that the author of a single tweet can be identified with good accuracy in an array of flavors of the authorship attribution task.
2 0.190874 37 emnlp-2013-Automatically Identifying Pseudepigraphic Texts
Author: Moshe Koppel ; Shachar Seidman
Abstract: The identification of pseudepigraphic texts texts not written by the authors to which they are attributed – has important historical, forensic and commercial applications. We introduce an unsupervised technique for identifying pseudepigrapha. The idea is to identify textual outliers in a corpus based on the pairwise similarities of all documents in the corpus. The crucial point is that document similarity not be measured in any of the standard ways but rather be based on the output of a recently introduced algorithm for authorship verification. The proposed method strongly outperforms existing techniques in systematic experiments on a blog corpus. 1
3 0.1840736 95 emnlp-2013-Identifying Multiple Userids of the Same Author
Author: Tieyun Qian ; Bing Liu
Abstract: This paper studies the problem of identifying users who use multiple userids to post in social media. Since multiple userids may belong to the same author, it is hard to directly apply supervised learning to solve the problem. This paper proposes a new method, which still uses supervised learning but does not require training documents from the involved userids. Instead, it uses documents from other userids for classifier building. The classifier can be applied to documents of the involved userids. This is possible because we transform the document space to a similarity space and learning is performed in this new space. Our evaluation is done in the online review domain. The experimental results using a large number of userids and their reviews show that the proposed method is highly effective. 1
4 0.18152629 16 emnlp-2013-A Unified Model for Topics, Events and Users on Twitter
Author: Qiming Diao ; Jing Jiang
Abstract: With the rapid growth of social media, Twitter has become one of the most widely adopted platforms for people to post short and instant message. On the one hand, people tweets about their daily lives, and on the other hand, when major events happen, people also follow and tweet about them. Moreover, people’s posting behaviors on events are often closely tied to their personal interests. In this paper, we try to model topics, events and users on Twitter in a unified way. We propose a model which combines an LDA-like topic model and the Recurrent Chinese Restaurant Process to capture topics and events. We further propose a duration-based regularization component to find bursty events. We also propose to use event-topic affinity vectors to model the asso- . ciation between events and topics. Our experiments shows that our model can accurately identify meaningful events and the event-topic affinity vectors are effective for event recommendation and grouping events by topics.
5 0.1574305 178 emnlp-2013-Success with Style: Using Writing Style to Predict the Success of Novels
Author: Vikas Ganjigunte Ashok ; Song Feng ; Yejin Choi
Abstract: Predicting the success of literary works is a curious question among publishers and aspiring writers alike. We examine the quantitative connection, if any, between writing style and successful literature. Based on novels over several different genres, we probe the predictive power of statistical stylometry in discriminating successful literary works, and identify characteristic stylistic elements that are more prominent in successful writings. Our study reports for the first time that statistical stylometry can be surprisingly effective in discriminating highly successful literature from less successful counterpart, achieving accuracy up to 84%. Closer analyses lead to several new insights into characteristics ofthe writing style in successful literature, including findings that are contrary to the conventional wisdom with respect to good writing style and readability. ,
6 0.15559177 109 emnlp-2013-Is Twitter A Better Corpus for Measuring Sentiment Similarity?
7 0.14541683 163 emnlp-2013-Sarcasm as Contrast between a Positive Sentiment and Negative Situation
8 0.12881145 89 emnlp-2013-Gender Inference of Twitter Users in Non-English Contexts
9 0.10013375 81 emnlp-2013-Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media
10 0.098360293 61 emnlp-2013-Detecting Promotional Content in Wikipedia
11 0.075546175 9 emnlp-2013-A Log-Linear Model for Unsupervised Text Normalization
12 0.068713196 18 emnlp-2013-A temporal model of text periodicities using Gaussian Processes
13 0.06568715 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging
14 0.0649243 168 emnlp-2013-Semi-Supervised Feature Transformation for Dependency Parsing
15 0.06346152 35 emnlp-2013-Automatically Detecting and Attributing Indirect Quotations
16 0.062502205 204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication
17 0.060533222 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation
18 0.055346139 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs
19 0.052519713 177 emnlp-2013-Studying the Recursive Behaviour of Adjectival Modification with Compositional Distributional Semantics
20 0.051809885 131 emnlp-2013-Mining New Business Opportunities: Identifying Trend related Products by Leveraging Commercial Intents from Microblogs
topicId topicWeight
[(0, -0.174), (1, 0.081), (2, -0.161), (3, -0.103), (4, 0.027), (5, -0.111), (6, 0.022), (7, 0.076), (8, 0.025), (9, 0.052), (10, -0.09), (11, 0.217), (12, 0.045), (13, 0.011), (14, 0.101), (15, -0.02), (16, -0.114), (17, -0.073), (18, -0.11), (19, -0.069), (20, -0.365), (21, 0.21), (22, 0.171), (23, -0.096), (24, -0.218), (25, 0.054), (26, 0.024), (27, -0.077), (28, -0.053), (29, -0.03), (30, 0.079), (31, -0.061), (32, 0.045), (33, 0.058), (34, -0.033), (35, -0.046), (36, -0.029), (37, -0.049), (38, -0.048), (39, 0.12), (40, -0.114), (41, 0.053), (42, -0.004), (43, 0.068), (44, 0.025), (45, -0.058), (46, 0.055), (47, 0.006), (48, -0.033), (49, 0.008)]
simIndex simValue paperId paperTitle
same-paper 1 0.96306717 27 emnlp-2013-Authorship Attribution of Micro-Messages
Author: Roy Schwartz ; Oren Tsur ; Ari Rappoport ; Moshe Koppel
Abstract: Work on authorship attribution has traditionally focused on long texts. In this work, we tackle the question of whether the author of a very short text can be successfully identified. We use Twitter as an experimental testbed. We introduce the concept of an author’s unique “signature”, and show that such signatures are typical of many authors when writing very short texts. We also present a new authorship attribution feature (“flexible patterns”) and demonstrate a significant improvement over our baselines. Our results show that the author of a single tweet can be identified with good accuracy in an array of flavors of the authorship attribution task.
2 0.68151253 37 emnlp-2013-Automatically Identifying Pseudepigraphic Texts
Author: Moshe Koppel ; Shachar Seidman
Abstract: The identification of pseudepigraphic texts texts not written by the authors to which they are attributed – has important historical, forensic and commercial applications. We introduce an unsupervised technique for identifying pseudepigrapha. The idea is to identify textual outliers in a corpus based on the pairwise similarities of all documents in the corpus. The crucial point is that document similarity not be measured in any of the standard ways but rather be based on the output of a recently introduced algorithm for authorship verification. The proposed method strongly outperforms existing techniques in systematic experiments on a blog corpus. 1
3 0.61128837 95 emnlp-2013-Identifying Multiple Userids of the Same Author
Author: Tieyun Qian ; Bing Liu
Abstract: This paper studies the problem of identifying users who use multiple userids to post in social media. Since multiple userids may belong to the same author, it is hard to directly apply supervised learning to solve the problem. This paper proposes a new method, which still uses supervised learning but does not require training documents from the involved userids. Instead, it uses documents from other userids for classifier building. The classifier can be applied to documents of the involved userids. This is possible because we transform the document space to a similarity space and learning is performed in this new space. Our evaluation is done in the online review domain. The experimental results using a large number of userids and their reviews show that the proposed method is highly effective. 1
4 0.59786087 178 emnlp-2013-Success with Style: Using Writing Style to Predict the Success of Novels
Author: Vikas Ganjigunte Ashok ; Song Feng ; Yejin Choi
Abstract: Predicting the success of literary works is a curious question among publishers and aspiring writers alike. We examine the quantitative connection, if any, between writing style and successful literature. Based on novels over several different genres, we probe the predictive power of statistical stylometry in discriminating successful literary works, and identify characteristic stylistic elements that are more prominent in successful writings. Our study reports for the first time that statistical stylometry can be surprisingly effective in discriminating highly successful literature from less successful counterpart, achieving accuracy up to 84%. Closer analyses lead to several new insights into characteristics ofthe writing style in successful literature, including findings that are contrary to the conventional wisdom with respect to good writing style and readability. ,
5 0.42996877 61 emnlp-2013-Detecting Promotional Content in Wikipedia
Author: Shruti Bhosale ; Heath Vinicombe ; Raymond Mooney
Abstract: This paper presents an approach for detecting promotional content in Wikipedia. By incorporating stylometric features, including features based on n-gram and PCFG language models, we demonstrate improved accuracy at identifying promotional articles, compared to using only lexical information and metafeatures.
6 0.42804721 89 emnlp-2013-Gender Inference of Twitter Users in Non-English Contexts
7 0.42425349 163 emnlp-2013-Sarcasm as Contrast between a Positive Sentiment and Negative Situation
8 0.39010713 35 emnlp-2013-Automatically Detecting and Attributing Indirect Quotations
9 0.37596083 109 emnlp-2013-Is Twitter A Better Corpus for Measuring Sentiment Similarity?
10 0.36095229 16 emnlp-2013-A Unified Model for Topics, Events and Users on Twitter
11 0.30888751 189 emnlp-2013-Two-Stage Method for Large-Scale Acquisition of Contradiction Pattern Pairs using Entailment
12 0.30711812 81 emnlp-2013-Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media
13 0.27606112 26 emnlp-2013-Assembling the Kazakh Language Corpus
14 0.27357265 9 emnlp-2013-A Log-Linear Model for Unsupervised Text Normalization
15 0.27353731 72 emnlp-2013-Elephant: Sequence Labeling for Word and Sentence Segmentation
16 0.26156718 34 emnlp-2013-Automatically Classifying Edit Categories in Wikipedia Revisions
18 0.23291144 204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication
20 0.22668713 21 emnlp-2013-An Empirical Study Of Semi-Supervised Chinese Word Segmentation Using Co-Training
topicId topicWeight
[(3, 0.035), (18, 0.036), (22, 0.039), (30, 0.092), (47, 0.014), (50, 0.014), (51, 0.211), (66, 0.05), (71, 0.032), (73, 0.29), (75, 0.034), (90, 0.011), (96, 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.79865789 27 emnlp-2013-Authorship Attribution of Micro-Messages
Author: Roy Schwartz ; Oren Tsur ; Ari Rappoport ; Moshe Koppel
Abstract: Work on authorship attribution has traditionally focused on long texts. In this work, we tackle the question of whether the author of a very short text can be successfully identified. We use Twitter as an experimental testbed. We introduce the concept of an author’s unique “signature”, and show that such signatures are typical of many authors when writing very short texts. We also present a new authorship attribution feature (“flexible patterns”) and demonstrate a significant improvement over our baselines. Our results show that the author of a single tweet can be identified with good accuracy in an array of flavors of the authorship attribution task.
2 0.64920777 95 emnlp-2013-Identifying Multiple Userids of the Same Author
Author: Tieyun Qian ; Bing Liu
Abstract: This paper studies the problem of identifying users who use multiple userids to post in social media. Since multiple userids may belong to the same author, it is hard to directly apply supervised learning to solve the problem. This paper proposes a new method, which still uses supervised learning but does not require training documents from the involved userids. Instead, it uses documents from other userids for classifier building. The classifier can be applied to documents of the involved userids. This is possible because we transform the document space to a similarity space and learning is performed in this new space. Our evaluation is done in the online review domain. The experimental results using a large number of userids and their reviews show that the proposed method is highly effective. 1
3 0.63744336 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology
Author: Yiping Jin ; Min-Yen Kan ; Jun-Ping Ng ; Xiangnan He
Abstract: This paper presents DefMiner, a supervised sequence labeling system that identifies scientific terms and their accompanying definitions. DefMiner achieves 85% F1 on a Wikipedia benchmark corpus, significantly improving the previous state-of-the-art by 8%. We exploit DefMiner to process the ACL Anthology Reference Corpus (ARC) – a large, real-world digital library of scientific articles in computational linguistics. The resulting automatically-acquired glossary represents the terminology defined over several thousand individual research articles. We highlight several interesting observations: more definitions are introduced for conference and workshop papers over the years and that multiword terms account for slightly less than half of all terms. Obtaining a list of popular , defined terms in a corpus ofcomputational linguistics papers, we find that concepts can often be categorized into one of three categories: resources, methodologies and evaluation metrics.
4 0.63686609 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization
Author: Kuzman Ganchev ; Dipanjan Das
Abstract: We present a framework for cross-lingual transfer of sequence information from a resource-rich source language to a resourceimpoverished target language that incorporates soft constraints via posterior regularization. To this end, we use automatically word aligned bitext between the source and target language pair, and learn a discriminative conditional random field model on the target side. Our posterior regularization constraints are derived from simple intuitions about the task at hand and from cross-lingual alignment information. We show improvements over strong baselines for two tasks: part-of-speech tagging and namedentity segmentation.
5 0.63586968 143 emnlp-2013-Open Domain Targeted Sentiment
Author: Margaret Mitchell ; Jacqui Aguilar ; Theresa Wilson ; Benjamin Van Durme
Abstract: We propose a novel approach to sentiment analysis for a low resource setting. The intuition behind this work is that sentiment expressed towards an entity, targeted sentiment, may be viewed as a span of sentiment expressed across the entity. This representation allows us to model sentiment detection as a sequence tagging problem, jointly discovering people and organizations along with whether there is sentiment directed towards them. We compare performance in both Spanish and English on microblog data, using only a sentiment lexicon as an external resource. By leveraging linguisticallyinformed features within conditional random fields (CRFs) trained to minimize empirical risk, our best models in Spanish significantly outperform a strong baseline, and reach around 90% accuracy on the combined task of named entity recognition and sentiment prediction. Our models in English, trained on a much smaller dataset, are not yet statistically significant against their baselines.
6 0.63450783 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity
7 0.63423645 167 emnlp-2013-Semi-Markov Phrase-Based Monolingual Alignment
8 0.63392812 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation
9 0.63370091 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging
10 0.63251978 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks
11 0.63212729 51 emnlp-2013-Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction
12 0.6313681 154 emnlp-2013-Prior Disambiguation of Word Tensors for Constructing Sentence Vectors
13 0.63130373 79 emnlp-2013-Exploiting Multiple Sources for Open-Domain Hypernym Discovery
14 0.63082093 13 emnlp-2013-A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and Nothing Else)
15 0.63061816 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs
16 0.62966603 69 emnlp-2013-Efficient Collective Entity Linking with Stacking
17 0.62899309 152 emnlp-2013-Predicting the Presence of Discourse Connectives
18 0.62778771 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction
19 0.62695521 106 emnlp-2013-Inducing Document Plans for Concept-to-Text Generation
20 0.6261369 110 emnlp-2013-Joint Bootstrapping of Corpus Annotations and Entity Types