emnlp emnlp2013 emnlp2013-61 knowledge-graph by maker-knowledge-mining

61 emnlp-2013-Detecting Promotional Content in Wikipedia


Source: pdf

Author: Shruti Bhosale ; Heath Vinicombe ; Raymond Mooney

Abstract: This paper presents an approach for detecting promotional content in Wikipedia. By incorporating stylometric features, including features based on n-gram and PCFG language models, we demonstrate improved accuracy at identifying promotional articles, compared to using only lexical information and metafeatures.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract This paper presents an approach for detecting promotional content in Wikipedia. [sent-4, score-0.805]

2 By incorporating stylometric features, including features based on n-gram and PCFG language models, we demonstrate improved accuracy at identifying promotional articles, compared to using only lexical information and metafeatures. [sent-5, score-0.877]

3 1 Introduction Wikipedia is a free, collaboratively edited encyclopedia. [sent-6, score-0.04]

4 Since normally anyone can create and edit pages, some articles are written in a promotional tone, violating Wikipedia’s policy requiring a neutral viewpoint. [sent-7, score-1.113]

5 Currently, such articles are identified manually and tagged with an appropriate Cleanup by Wikipedia editors. [sent-8, score-0.256]

6 Hence, we present message1 an approach to automatically detect promotional articles. [sent-10, score-0.738]

7 Related work in quality flaw detection in Wikipedia (Anderka et al. [sent-11, score-0.04]

8 , 2012) has relied on meta-features based on edit history, Wikipedia links, structural features and counts of words, sentences and paragraphs. [sent-12, score-0.118]

9 However, we hypothesize that there are subtle differences in the linguistic style that distinguish promotional tone, which we attempt to capture using stylometric features, particularly deeper syntactic features. [sent-13, score-1.068]

10 We model the style of promotional and normal articles using language models 1http : / / en . [sent-14, score-1.065]

11 We show that using such stylometric features improves over using only shallow lexical and meta-features. [sent-17, score-0.169]

12 (2012) developed a general model for detecting ten of Wikipedia’s most frequent quality flaws. [sent-19, score-0.041]

13 One of these flaw types, “Advert”2, refers to articles written like advertisements. [sent-20, score-0.306]

14 Their classifiers were trained using a set of lexical, structural, network and edit-history related features of Wikipedia articles. [sent-21, score-0.039]

15 However, they used no features capturing syntactic structure, at a level deeper than Part-OfSpeech (POS) tags. [sent-22, score-0.095]

16 A related area is that of vandalism detection in Wikipedia. [sent-23, score-0.095]

17 Several systems have been developed to detect vandalizing edits in Wikipedia. [sent-24, score-0.03]

18 These fall into two major categories: those analyzing author information and edit metadata (Wilkinson and Huberman, 2007; Stein and Hess, 2007); and those using NLP techniques such as n-gram language models and PCFGs (Wang and McKeown, 2010; Harpalani et al. [sent-25, score-0.051]

19 We combine relevant features from both these categories to train a classifier that distinguishes promotional content from normal Wikipedia articles. [sent-27, score-0.874]

20 3 Dataset Collection We extracted a set of about 13,000 articles from English Wikipedia’s category, “Category:All arti- 2“Advert” is the flaw-type of majority of the articles in the Category ‘Articles with a promotional tone’ . [sent-28, score-1.17]

21 oc d2s0 i1n3 N Aastusorcaila Ltiaonng fuoarg Ceo Pmrpoucetastsi onnga,l p Laignegsu 1is8t5ic1s–1857, Content Features cles with a promotional tone” as a set of positive examples. [sent-31, score-0.708]

22 We extracted a set of 26,000 untagged articles to form a noisy set of negative examples, which may contain some promotional articles that have not yet been tagged by Wikipedia editors. [sent-32, score-1.302]

23 We used 70% of the articles in each category to train language models for each of the three categories (promotional articles, featured/good articles, untagged articles), and used the remaining 30% to evaluate classifier performance using 10-fold crossvalidation. [sent-35, score-0.403]

24 1 Content and Meta Features of an Article We used the content and meta features proposed by Anderka et al. [sent-37, score-0.161]

25 This feature is the average of the sentiment scores assigned by SentiWordnet (Baccianella et al. [sent-45, score-0.049]

26 , 2010) to all positive and negative sentiment bearing words in an article. [sent-46, score-0.086]

27 2 N-Gram Language Models Language models are commonly used to measure stylistic differences in language usage between authors. [sent-49, score-0.044]

28 For this work, we employed them to model the difference in style of neutral vs. [sent-50, score-0.16]

29 We trained trigram word language models and trigram character language models5 with Witten-Bell smoothing to produce probabilistic models of both classes. [sent-52, score-0.188]

30 We hypothesize that sentences in promotional articles and those in neutral articles tend to have different kinds of syntactic structures and therefore, we explored the utility of PCFG models for detecting this difference. [sent-55, score-1.316]

31 Since we do not have ground-truth parse trees for sentences in our dataset, 5Modeling longer character sequences did not help. [sent-56, score-0.092]

32 , 2011), which uses the output of the Stanford parser to train PCFG models for stylistic analysis. [sent-59, score-0.044]

33 The language-modeling features used are shown in Table 5. [sent-64, score-0.039]

34 , 2000) to train a classifier using various combinations of features. [sent-67, score-0.047]

35 We used Decision Stumps as a base classifier and ran boosting for 500 iterations. [sent-68, score-0.074]

36 1 Methodology We used 10-fold cross-validation to test the performance of our classifier using various combinations of features. [sent-70, score-0.047]

37 We ran the classifier on the portion (30%) of the dataset not used for language modeling. [sent-71, score-0.047]

38 , 2012): • • Pessimistic Setting: The negative class consists Pofe sasritmicilsetsic cfr Soemtti ntghe: Untagged sveet . [sent-74, score-0.061]

39 c Sasisnc ceo some of these could be manually undetected promotional articles, the accuracy measured in this setting is probably an under-estimate. [sent-75, score-0.836]

40 Optimistic Setting: The negative class consists oOfp tairmticisletisc f Sreotmtin tgh:e TFheeat nuergeda/tGivoeo cdla set. [sent-76, score-0.037]

41 Tnshiesstse articles are at one end of the quality spectrum, making it relatively easier to distinguish them from promotional articles. [sent-77, score-0.939]

42 The true performance of the classifier is likely some- where between that achieved in these two settings. [sent-78, score-0.047]

43 6We maintain an equal number of positive and negative test cases in both the settings. [sent-79, score-0.06]

44 2 Results for Pessimistic Setting From Table 6, we see that all features perform better than the bag-of-words baseline. [sent-81, score-0.039]

45 We also see that character trigrams, one of the simplest features, gives the best individual performance. [sent-82, score-0.092]

46 However, deeper syntactic features using PCFGs also performs quite well. [sent-83, score-0.095]

47 Combining all of the language-modeling features (PCFG + character trigrams + Word trigrams) further improves performance. [sent-84, score-0.172]

48 Compared to the 58 content and meta features utilized by Anderka et al. [sent-85, score-0.161]

49 1, the PCFG and character trigram features give much better performance, both individually and when combined. [sent-87, score-0.179]

50 ’s features to the languagemodeling ones gives a fairly small improvement in performance. [sent-89, score-0.039]

51 This validates our hypothesis that promotional articles tend to have a distinct linguistic style that is captured well using language models. [sent-90, score-1.041]

52 3 Results for Optimistic Setting In the Optimistic Setting, as shown in Table 6, the content and meta features give the best performance, which improves only slightly when combined with language-modeling features. [sent-92, score-0.161]

53 This performance could be because there is a much clearer distinction between promotional articles and featured/good articles that can be captured by simple features alone. [sent-94, score-1.233]

54 For example, featured/good articles are generally longer than usual Wikipedia articles and have more references. [sent-95, score-0.462]

55 4 Top Ranked Features and their Performance To analyze the performance of different features, we determined the top ranked features using Informa- tion Gain. [sent-97, score-0.039]

56 In the Pessimistic Setting, the top six features are all language-modeling features (character trigram model feature works best), followed by basic meta-features such as character count, word count, category count and sentence count. [sent-98, score-0.274]

57 The new feature we introduced, “Overall Sentiment Score” is the 18th most informative feature in the pessimistic setting, indicating that the cumulative sentiment of a bag of words is not as discriminative as we would intuitively assume. [sent-99, score-0.267]

58 93, which is only slightly worse than that achieved using all features (F1 = 0. [sent-101, score-0.039]

59 In the Optimistic Setting, the top-ranked features are the number of references and the number of references per section. [sent-103, score-0.039]

60 This is consistent with the observation that featured/good articles have very long and comprehensive lists of references, since Wikipedia’s fundamental policy is to maintain verifiability by citing relevant sources. [sent-104, score-0.284]

61 988, which is almost as good as using all features (F1 = 0. [sent-107, score-0.039]

62 5 Optimistic and Pessimistic Settings In the optimistic setting, there is a clear distinction between the positive (promotional) and negative (featured/good) classes. [sent-110, score-0.169]

63 But there are only subtle differences between the positive and negative (untagged articles) classes in the pessimistic setting. [sent-111, score-0.303]

64 These two classes are superficially similar, in terms of length, reference count, section count etc. [sent-112, score-0.05]

65 Stylometric features based on the trained language models are successful at detecting the subtle linguistic differences in the two types of articles. [sent-113, score-0.128]

66 This is useful because the pessimistic setting is closer to the real-world setting of articles in Wikipedia. [sent-114, score-0.561]

67 6 Error Analysis Since the pessimistic setting is close to the real setting of Wikipedia articles, it is useful to do an error analysis of the classifier’s performance in this set- ting. [sent-116, score-0.33]

68 There is an approximately equal proportion of false positives and false negatives. [sent-117, score-0.168]

69 A significant number of false positives seem to be cases of manually undetected promotional articles. [sent-118, score-0.921]

70 But there are also many false positives that seem to be truly unbiased. [sent-120, score-0.141]

71 These articles appear to have been poorly written, without following Wikipedia’s editing policies. [sent-121, score-0.231]

72 Examples include use of very long lists of nouns, use of ambiguous terms like ”many believe” and excessive use of superlatives. [sent-122, score-0.047]

73 Other common characteristics of most of the false positives are presence of a considerable number of complex sentences with multiple subordinate clauses. [sent-123, score-0.113]

74 These stylistic cues seem to be misleading the classifier. [sent-124, score-0.072]

75 A common thread underlying most of the false negatives is the fact that they are written in a narrative style or they have excessive details in terms of the content. [sent-125, score-0.333]

76 Examples include narrating a detailed story of a fictional character in an unbiased manner or writing a minutely detailed account of the history of an organization. [sent-126, score-0.188]

77 Another source offalse negatives 1855 comes from biographical Wikipedia pages which are written in a resume style, listing all their qualifications and achievements. [sent-127, score-0.124]

78 These cues could help one manually detect that the article, though not promotional in style, is probably written with the view of promoting the entity. [sent-128, score-0.822]

79 As possible future work, we could incorporate features derived from language models for narrative style trained using an appropriate external corpus of narrative text. [sent-129, score-0.257]

80 This might enable the classifier to detect some cases of unbiased promotional articles. [sent-130, score-0.828]

81 6 Conclusion Our experiments and analysis show that stylometric features based on n-gram language models and deeper syntactic PCFG models work very well for detecting promotional articles in Wikipedia. [sent-131, score-1.205]

82 After analyzing the errors that are made during classification, we realize that though promotional content is non-neutral in majority of the cases, there do exist promotional articles that are neutral in style. [sent-132, score-1.761]

83 Adding additional features based on language models of narrative style could lead to a better model of Wikipedia’s promotional content. [sent-133, score-0.907]

84 In Proceedings of the 35th International ACM SIGIR Conference on Research and development in Information Retrieval, SIGIR ’ 12, pages 981–990, New York, NY, USA. [sent-139, score-0.029]

85 0: An enhanced lexical resource for sentiment analysis and opinion mining. [sent-144, score-0.049]

86 In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, volume 1, pages 288–298. [sent-153, score-0.029]

87 Language of vandalism: Improving Wikipedia vandalism detection via stylometric analysis. [sent-169, score-0.225]

88 In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, volume 2, pages 83–88. [sent-170, score-0.029]

89 Automatic qual- ity assessment of content created collaboratively by web communities: a case study of Wikipedia. [sent-173, score-0.096]

90 In Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital libraries, JCDL ’09, pages 295–304, New York, NY, USA. [sent-174, score-0.029]

91 An analysis of statistical models and features for reading difficulty prediction. [sent-178, score-0.039]

92 In Proceedings of the Third Workshop on Innovative Use of 1856 NLP for Building Educational Applications, pages 71– 79. [sent-179, score-0.029]

93 In Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING, volume 3, pages 255–264. [sent-184, score-0.029]

94 In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 423–430. [sent-188, score-0.029]

95 In Proceedings of the 48th annual meeting of the Association for Computational Linguistics, pages 544–554. [sent-193, score-0.029]

96 In Proceedings of the ACL 2010 Conference Short Papers, ACLShort ’ 10, pages 38–42, Stroudsburg, PA, USA. [sent-198, score-0.029]

97 Does it matter who contributes: a study on featured articles in the German Wikipedia. [sent-210, score-0.274]

98 In Proceedings of the Eighteenth Conference on Hypertext and Hypermedia, pages 171–174. [sent-211, score-0.029]

99 In Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics- Volume 13, pages 63–70. [sent-216, score-0.029]

100 ”: Automatic vandalism detection in Wikipedia with web-based shallow syntactic-semantic modeling. [sent-227, score-0.095]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('promotional', 0.708), ('articles', 0.231), ('anderka', 0.218), ('pessimistic', 0.218), ('pcfg', 0.208), ('wikipedia', 0.178), ('stylometric', 0.13), ('optimistic', 0.108), ('style', 0.102), ('untagged', 0.095), ('vandalism', 0.095), ('character', 0.092), ('authorship', 0.086), ('harpalani', 0.082), ('tone', 0.081), ('meta', 0.066), ('pcfgs', 0.065), ('positives', 0.058), ('neutral', 0.058), ('narrative', 0.058), ('deeper', 0.056), ('content', 0.056), ('setting', 0.056), ('false', 0.055), ('advert', 0.054), ('cleanup', 0.054), ('logitboost', 0.054), ('wilkinson', 0.054), ('stein', 0.052), ('edit', 0.051), ('sentiment', 0.049), ('subtle', 0.048), ('trigram', 0.048), ('excessive', 0.047), ('undetected', 0.047), ('classifier', 0.047), ('stylistic', 0.044), ('featured', 0.043), ('unbiased', 0.043), ('detecting', 0.041), ('trigrams', 0.041), ('flaw', 0.04), ('collaboratively', 0.04), ('features', 0.039), ('aro', 0.038), ('sentiwordnet', 0.038), ('negative', 0.037), ('friedman', 0.036), ('baccianella', 0.036), ('negatives', 0.036), ('written', 0.035), ('afrl', 0.033), ('raghavan', 0.031), ('policy', 0.03), ('category', 0.03), ('detect', 0.03), ('history', 0.029), ('attribution', 0.029), ('pages', 0.029), ('seem', 0.028), ('tional', 0.028), ('structural', 0.028), ('boosting', 0.027), ('count', 0.026), ('free', 0.026), ('raymond', 0.026), ('association', 0.025), ('manually', 0.025), ('distinction', 0.024), ('hypothesize', 0.024), ('normal', 0.024), ('kristina', 0.024), ('khaled', 0.024), ('roc', 0.024), ('cercone', 0.024), ('hypermedia', 0.024), ('benno', 0.024), ('nedim', 0.024), ('oofp', 0.024), ('aclshort', 0.024), ('adriana', 0.024), ('calves', 0.024), ('calvin', 0.024), ('cfr', 0.024), ('fictional', 0.024), ('flaws', 0.024), ('kovashka', 0.024), ('nec', 0.024), ('ntghe', 0.024), ('pacling', 0.024), ('promoting', 0.024), ('rayson', 0.024), ('resume', 0.024), ('rudolf', 0.024), ('superficially', 0.024), ('tfheeat', 0.024), ('maintain', 0.023), ('utility', 0.023), ('klein', 0.023), ('toutanova', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 61 emnlp-2013-Detecting Promotional Content in Wikipedia

Author: Shruti Bhosale ; Heath Vinicombe ; Raymond Mooney

Abstract: This paper presents an approach for detecting promotional content in Wikipedia. By incorporating stylometric features, including features based on n-gram and PCFG language models, we demonstrate improved accuracy at identifying promotional articles, compared to using only lexical information and metafeatures.

2 0.1149377 34 emnlp-2013-Automatically Classifying Edit Categories in Wikipedia Revisions

Author: Johannes Daxenberger ; Iryna Gurevych

Abstract: In this paper, we analyze a novel set of features for the task of automatic edit category classification. Edit category classification assigns categories such as spelling error correction, paraphrase or vandalism to edits in a document. Our features are based on differences between two versions of a document including meta data, textual and language properties and markup. In a supervised machine learning experiment, we achieve a micro-averaged F1 score of .62 on a corpus of edits from the English Wikipedia. In this corpus, each edit has been multi-labeled according to a 21-category taxonomy. A model trained on the same data achieves state-of-the-art performance on the related task of fluency edit classification. We apply pattern mining to automatically labeled edits in the revision histories of different Wikipedia articles. Our results suggest that high-quality articles show a higher degree of homogeneity with respect to their collaboration patterns as compared to random articles.

3 0.098360293 27 emnlp-2013-Authorship Attribution of Micro-Messages

Author: Roy Schwartz ; Oren Tsur ; Ari Rappoport ; Moshe Koppel

Abstract: Work on authorship attribution has traditionally focused on long texts. In this work, we tackle the question of whether the author of a very short text can be successfully identified. We use Twitter as an experimental testbed. We introduce the concept of an author’s unique “signature”, and show that such signatures are typical of many authors when writing very short texts. We also present a new authorship attribution feature (“flexible patterns”) and demonstrate a significant improvement over our baselines. Our results show that the author of a single tweet can be identified with good accuracy in an array of flavors of the authorship attribution task.

4 0.07715261 178 emnlp-2013-Success with Style: Using Writing Style to Predict the Success of Novels

Author: Vikas Ganjigunte Ashok ; Song Feng ; Yejin Choi

Abstract: Predicting the success of literary works is a curious question among publishers and aspiring writers alike. We examine the quantitative connection, if any, between writing style and successful literature. Based on novels over several different genres, we probe the predictive power of statistical stylometry in discriminating successful literary works, and identify characteristic stylistic elements that are more prominent in successful writings. Our study reports for the first time that statistical stylometry can be surprisingly effective in discriminating highly successful literature from less successful counterpart, achieving accuracy up to 84%. Closer analyses lead to several new insights into characteristics ofthe writing style in successful literature, including findings that are contrary to the conventional wisdom with respect to good writing style and readability. ,

5 0.066558309 168 emnlp-2013-Semi-Supervised Feature Transformation for Dependency Parsing

Author: Wenliang Chen ; Min Zhang ; Yue Zhang

Abstract: In current dependency parsing models, conventional features (i.e. base features) defined over surface words and part-of-speech tags in a relatively high-dimensional feature space may suffer from the data sparseness problem and thus exhibit less discriminative power on unseen data. In this paper, we propose a novel semi-supervised approach to addressing the problem by transforming the base features into high-level features (i.e. meta features) with the help of a large amount of automatically parsed data. The meta features are used together with base features in our final parser. Our studies indicate that our proposed approach is very effective in processing unseen data and features. Experiments on Chinese and English data sets show that the final parser achieves the best-reported accuracy on the Chinese data and comparable accuracy with the best known parsers on the English data.

6 0.062702052 133 emnlp-2013-Modeling Scientific Impact with Topical Influence Regression

7 0.057933886 42 emnlp-2013-Building Specialized Bilingual Lexicons Using Large Scale Background Knowledge

8 0.055787947 143 emnlp-2013-Open Domain Targeted Sentiment

9 0.051706605 109 emnlp-2013-Is Twitter A Better Corpus for Measuring Sentiment Similarity?

10 0.050505117 69 emnlp-2013-Efficient Collective Entity Linking with Stacking

11 0.050137311 24 emnlp-2013-Application of Localized Similarity for Web Documents

12 0.049053289 41 emnlp-2013-Building Event Threads out of Multiple News Articles

13 0.048932876 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology

14 0.04858654 144 emnlp-2013-Opinion Mining in Newspaper Articles by Entropy-Based Word Connections

15 0.047846608 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging

16 0.047530584 160 emnlp-2013-Relational Inference for Wikification

17 0.046939235 158 emnlp-2013-Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

18 0.043129791 171 emnlp-2013-Shift-Reduce Word Reordering for Machine Translation

19 0.04217374 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation

20 0.041451436 121 emnlp-2013-Learning Topics and Positions from Debatepedia


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.131), (1, 0.04), (2, -0.055), (3, -0.044), (4, 0.003), (5, -0.025), (6, 0.045), (7, 0.058), (8, 0.009), (9, 0.003), (10, 0.0), (11, 0.07), (12, -0.038), (13, 0.026), (14, 0.018), (15, 0.064), (16, -0.146), (17, -0.012), (18, -0.011), (19, 0.038), (20, -0.176), (21, 0.128), (22, 0.111), (23, 0.015), (24, 0.077), (25, 0.02), (26, 0.179), (27, -0.037), (28, 0.133), (29, -0.079), (30, 0.133), (31, -0.115), (32, -0.051), (33, 0.074), (34, 0.088), (35, -0.0), (36, -0.013), (37, -0.084), (38, -0.047), (39, 0.033), (40, 0.101), (41, -0.007), (42, -0.042), (43, -0.066), (44, -0.074), (45, -0.044), (46, -0.059), (47, -0.194), (48, 0.031), (49, 0.051)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92986554 61 emnlp-2013-Detecting Promotional Content in Wikipedia

Author: Shruti Bhosale ; Heath Vinicombe ; Raymond Mooney

Abstract: This paper presents an approach for detecting promotional content in Wikipedia. By incorporating stylometric features, including features based on n-gram and PCFG language models, we demonstrate improved accuracy at identifying promotional articles, compared to using only lexical information and metafeatures.

2 0.86449873 34 emnlp-2013-Automatically Classifying Edit Categories in Wikipedia Revisions

Author: Johannes Daxenberger ; Iryna Gurevych

Abstract: In this paper, we analyze a novel set of features for the task of automatic edit category classification. Edit category classification assigns categories such as spelling error correction, paraphrase or vandalism to edits in a document. Our features are based on differences between two versions of a document including meta data, textual and language properties and markup. In a supervised machine learning experiment, we achieve a micro-averaged F1 score of .62 on a corpus of edits from the English Wikipedia. In this corpus, each edit has been multi-labeled according to a 21-category taxonomy. A model trained on the same data achieves state-of-the-art performance on the related task of fluency edit classification. We apply pattern mining to automatically labeled edits in the revision histories of different Wikipedia articles. Our results suggest that high-quality articles show a higher degree of homogeneity with respect to their collaboration patterns as compared to random articles.

3 0.53765041 178 emnlp-2013-Success with Style: Using Writing Style to Predict the Success of Novels

Author: Vikas Ganjigunte Ashok ; Song Feng ; Yejin Choi

Abstract: Predicting the success of literary works is a curious question among publishers and aspiring writers alike. We examine the quantitative connection, if any, between writing style and successful literature. Based on novels over several different genres, we probe the predictive power of statistical stylometry in discriminating successful literary works, and identify characteristic stylistic elements that are more prominent in successful writings. Our study reports for the first time that statistical stylometry can be surprisingly effective in discriminating highly successful literature from less successful counterpart, achieving accuracy up to 84%. Closer analyses lead to several new insights into characteristics ofthe writing style in successful literature, including findings that are contrary to the conventional wisdom with respect to good writing style and readability. ,

4 0.47188908 27 emnlp-2013-Authorship Attribution of Micro-Messages

Author: Roy Schwartz ; Oren Tsur ; Ari Rappoport ; Moshe Koppel

Abstract: Work on authorship attribution has traditionally focused on long texts. In this work, we tackle the question of whether the author of a very short text can be successfully identified. We use Twitter as an experimental testbed. We introduce the concept of an author’s unique “signature”, and show that such signatures are typical of many authors when writing very short texts. We also present a new authorship attribution feature (“flexible patterns”) and demonstrate a significant improvement over our baselines. Our results show that the author of a single tweet can be identified with good accuracy in an array of flavors of the authorship attribution task.

5 0.44564548 168 emnlp-2013-Semi-Supervised Feature Transformation for Dependency Parsing

Author: Wenliang Chen ; Min Zhang ; Yue Zhang

Abstract: In current dependency parsing models, conventional features (i.e. base features) defined over surface words and part-of-speech tags in a relatively high-dimensional feature space may suffer from the data sparseness problem and thus exhibit less discriminative power on unseen data. In this paper, we propose a novel semi-supervised approach to addressing the problem by transforming the base features into high-level features (i.e. meta features) with the help of a large amount of automatically parsed data. The meta features are used together with base features in our final parser. Our studies indicate that our proposed approach is very effective in processing unseen data and features. Experiments on Chinese and English data sets show that the final parser achieves the best-reported accuracy on the Chinese data and comparable accuracy with the best known parsers on the English data.

6 0.420818 133 emnlp-2013-Modeling Scientific Impact with Topical Influence Regression

7 0.39988676 24 emnlp-2013-Application of Localized Similarity for Web Documents

8 0.36858037 189 emnlp-2013-Two-Stage Method for Large-Scale Acquisition of Contradiction Pattern Pairs using Entailment

9 0.35561305 37 emnlp-2013-Automatically Identifying Pseudepigraphic Texts

10 0.35129088 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction

11 0.35127074 95 emnlp-2013-Identifying Multiple Userids of the Same Author

12 0.34934291 26 emnlp-2013-Assembling the Kazakh Language Corpus

13 0.33047813 144 emnlp-2013-Opinion Mining in Newspaper Articles by Entropy-Based Word Connections

14 0.32601181 106 emnlp-2013-Inducing Document Plans for Concept-to-Text Generation

15 0.32523218 199 emnlp-2013-Using Topic Modeling to Improve Prediction of Neuroticism and Depression in College Students

16 0.32508406 35 emnlp-2013-Automatically Detecting and Attributing Indirect Quotations

17 0.32062775 69 emnlp-2013-Efficient Collective Entity Linking with Stacking

18 0.29442722 5 emnlp-2013-A Discourse-Driven Content Model for Summarising Scientific Articles Evaluated in a Complex Question Answering Task

19 0.28194767 171 emnlp-2013-Shift-Reduce Word Reordering for Machine Translation

20 0.27498659 135 emnlp-2013-Monolingual Marginal Matching for Translation Model Adaptation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.525), (18, 0.017), (22, 0.029), (30, 0.06), (45, 0.016), (50, 0.015), (51, 0.167), (66, 0.026), (71, 0.021), (74, 0.011), (75, 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93515676 6 emnlp-2013-A Generative Joint, Additive, Sequential Model of Topics and Speech Acts in Patient-Doctor Communication

Author: Byron C. Wallace ; Thomas A Trikalinos ; M. Barton Laws ; Ira B. Wilson ; Eugene Charniak

Abstract: We develop a novel generative model of conversation that jointly captures both the topical content and the speech act type associated with each utterance. Our model expresses both token emission and state transition probabilities as log-linear functions of separate components corresponding to topics and speech acts (and their interactions). We apply this model to a dataset comprising annotated patient-physician visits and show that the proposed joint approach outperforms a baseline univariate model.

2 0.84466356 5 emnlp-2013-A Discourse-Driven Content Model for Summarising Scientific Articles Evaluated in a Complex Question Answering Task

Author: Maria Liakata ; Simon Dobnik ; Shyamasree Saha ; Colin Batchelor ; Dietrich Rebholz-Schuhmann

Abstract: We present a method which exploits automatically generated scientific discourse annotations to create a content model for the summarisation of scientific articles. Full papers are first automatically annotated using the CoreSC scheme, which captures 11 contentbased concepts such as Hypothesis, Result, Conclusion etc at the sentence level. A content model which follows the sequence of CoreSC categories observed in abstracts is used to provide the skeleton of the summary, making a distinction between dependent and independent categories. Summary creation is also guided by the distribution of CoreSC categories found in the full articles, in order to adequately represent the article content. Fi- nally, we demonstrate the usefulness of the summaries by evaluating them in a complex question answering task. Results are very encouraging as summaries of papers from automatically obtained CoreSCs enable experts to answer 66% of complex content-related questions designed on the basis of paper abstracts. The questions were answered with a precision of 75%, where the upper bound for human summaries (abstracts) was 95%.

same-paper 3 0.8350603 61 emnlp-2013-Detecting Promotional Content in Wikipedia

Author: Shruti Bhosale ; Heath Vinicombe ; Raymond Mooney

Abstract: This paper presents an approach for detecting promotional content in Wikipedia. By incorporating stylometric features, including features based on n-gram and PCFG language models, we demonstrate improved accuracy at identifying promotional articles, compared to using only lexical information and metafeatures.

4 0.79806465 12 emnlp-2013-A Semantically Enhanced Approach to Determine Textual Similarity

Author: Eduardo Blanco ; Dan Moldovan

Abstract: This paper presents a novel approach to determine textual similarity. A layered methodology to transform text into logic forms is proposed, and semantic features are derived from a logic prover. Experimental results show that incorporating the semantic structure of sentences is beneficial. When training data is unavailable, scores obtained from the logic prover in an unsupervised manner outperform supervised methods.

5 0.48667848 36 emnlp-2013-Automatically Determining a Proper Length for Multi-Document Summarization: A Bayesian Nonparametric Approach

Author: Tengfei Ma ; Hiroshi Nakagawa

Abstract: Document summarization is an important task in the area of natural language processing, which aims to extract the most important information from a single document or a cluster of documents. In various summarization tasks, the summary length is manually defined. However, how to find the proper summary length is quite a problem; and keeping all summaries restricted to the same length is not always a good choice. It is obviously improper to generate summaries with the same length for two clusters of documents which contain quite different quantity of information. In this paper, we propose a Bayesian nonparametric model for multidocument summarization in order to automatically determine the proper lengths of summaries. Assuming that an original document can be reconstructed from its summary, we describe the ”reconstruction” by a Bayesian framework which selects sentences to form a good summary. Experimental results on DUC2004 data sets and some expanded data demonstrate the good quality of our summaries and the rationality of the length determination.

6 0.48200548 133 emnlp-2013-Modeling Scientific Impact with Topical Influence Regression

7 0.48059037 34 emnlp-2013-Automatically Classifying Edit Categories in Wikipedia Revisions

8 0.4729864 140 emnlp-2013-Of Words, Eyes and Brains: Correlating Image-Based Distributional Semantic Models with Neural Representations of Concepts

9 0.46788323 174 emnlp-2013-Single-Document Summarization as a Tree Knapsack Problem

10 0.45861495 153 emnlp-2013-Predicting the Resolution of Referring Expressions from User Behavior

11 0.45548469 72 emnlp-2013-Elephant: Sequence Labeling for Word and Sentence Segmentation

12 0.45238844 204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication

13 0.44401857 86 emnlp-2013-Feature Noising for Log-Linear Structured Prediction

14 0.44153994 106 emnlp-2013-Inducing Document Plans for Concept-to-Text Generation

15 0.43695393 144 emnlp-2013-Opinion Mining in Newspaper Articles by Entropy-Based Word Connections

16 0.43594047 199 emnlp-2013-Using Topic Modeling to Improve Prediction of Neuroticism and Depression in College Students

17 0.43534777 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology

18 0.43277994 196 emnlp-2013-Using Crowdsourcing to get Representations based on Regular Expressions

19 0.42723554 129 emnlp-2013-Measuring Ideological Proportions in Political Speeches

20 0.42689019 152 emnlp-2013-Predicting the Presence of Discourse Connectives