acl acl2011 acl2011-195 knowledge-graph by maker-knowledge-mining

195 acl-2011-Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis


Source: pdf

Author: Manoj Harpalani ; Michael Hart ; Sandesh Signh ; Rob Johnson ; Yejin Choi

Abstract: Community-based knowledge forums, such as Wikipedia, are susceptible to vandalism, i.e., ill-intentioned contributions that are detrimental to the quality of collective intelligence. Most previous work to date relies on shallow lexico-syntactic patterns and metadata to automatically detect vandalism in Wikipedia. In this paper, we explore more linguistically motivated approaches to vandalism detection. In particular, we hypothesize that textual vandalism constitutes a unique genre where a group of people share a similar linguistic behavior. Experimental results suggest that (1) statistical models give evidence to unique language styles in vandalism, and that (2) deep syntactic patterns based on probabilistic context free grammars (PCFG) discriminate vandalism more effectively than shallow lexicosyntactic patterns based on n-grams. ,

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Most previous work to date relies on shallow lexico-syntactic patterns and metadata to automatically detect vandalism in Wikipedia. [sent-4, score-0.986]

2 In this paper, we explore more linguistically motivated approaches to vandalism detection. [sent-5, score-0.759]

3 In particular, we hypothesize that textual vandalism constitutes a unique genre where a group of people share a similar linguistic behavior. [sent-6, score-0.808]

4 Experimental results suggest that (1) statistical models give evidence to unique language styles in vandalism, and that (2) deep syntactic patterns based on probabilistic context free grammars (PCFG) discriminate vandalism more effectively than shallow lexicosyntactic patterns based on n-grams. [sent-7, score-1.163]

5 This editable encyclopedia has amassed over 15 million articles across hundreds of languages. [sent-9, score-0.085]

6 25 million edits (and sometimes upwards of 3 million) daily (Wikipedia, 2010). [sent-12, score-0.267]

7 But allowing anonymous edits is a double-edged sword; nearly 7% (Potthast, 2010) of edits are vandalism, i. [sent-13, score-0.454]

8 revisions to articles that undermine the quality and veracity of the content. [sent-15, score-0.088]

9 As Wikipedia continues to grow, it will become increasingly infeasible rob , 83 ycho i c s . [sent-16, score-0.041]

10 This pressing issue has spawned recent research activities to understand and counteract vandalism (e. [sent-19, score-0.737]

11 Much of previous work relies on hand-picked rules such as lexical cues (e. [sent-22, score-0.024]

12 , anonymity, edit frequency) to automatically detect vandalism in Wikipedia (e. [sent-26, score-0.851]

13 Although some recent work has started exploring the use of natural language processing, most work to date is based on shallow lexico-syntactic patterns (e. [sent-31, score-0.126]

14 We explore more linguistically motivated approaches to detect vandalism in this paper. [sent-36, score-0.798]

15 Our hypothesis is that textual vandalism constitutes a unique genre where a group of people share similar linguistic behavior. [sent-37, score-0.808]

16 Some obvious hallmarks of this style include usage of obscenities, misspellings, and slang usage, but we aim to automatically uncover stylistic cues to effectively discriminate between vandalizing and normal text. [sent-38, score-0.377]

17 Experimental results suggest that (1) statistical models give evidence to unique language styles in vandalism, and that (2) deep syntactic patterns based on probabilistic context free grammar (PCFG) discriminate vandalism more effectively than shallow lexico-syntactic patterns based on n-grams. [sent-39, score-1.122]

18 2 Stylometric Features Stylometric features attempt to recognize patterns of style in text. [sent-40, score-0.109]

19 These techniques have been traditionally applied to attribute authorship (Argamon et al. [sent-41, score-0.063]

20 For our purposes, we hypothesize that different stylistic features appear in regular and vandalizing edits. [sent-46, score-0.311]

21 For regular edits, honest editors will strive to follow the stylistic guidelines set forth by Wikipedia (e. [sent-47, score-0.149]

22 For edits that vandalize articles, these users may converge on common ways of vandalizing articles. [sent-50, score-0.405]

23 1 Language Models To differentiate between the styles of normal users and vandalizers, we employ language models to capture the stylistic differences between authentic and vandalizing revisions. [sent-52, score-0.326]

24 We train two trigram language model (LM) with Good-Turing discounting and Katz backoff for smoothing of vandalizing edits (based on the text difference between the vandalizing and previous revision) and good edits (based on the text difference between the new and previous revision). [sent-53, score-0.86]

25 2 Probabilistic Context Free Grammar (PCFG) Models Probabilistic context-free grammars (PCFG) capture deep syntactic regularities beyond shallow lexicosyntactic patterns. [sent-55, score-0.17]

26 (2010) reported for the first time that PCFG models are effective in learning stylometric signature of authorship at deep syntactic levels. [sent-57, score-0.291]

27 In this work, we explore the use of PCFG models for vandalism detection, by viewing the task as a genre detection problem, where a group of authors share similar linguistic behavior. [sent-58, score-0.811]

28 (1) Given a training corpus D for vandalism detection and a generic PCFG parser Co trained on a manually tree-banked corpus such as WSJ or Brown, tree-bank each training document di ∈ D using the generic PCFG parser Co. [sent-61, score-0.826]

29 (2) Learn vandalism language by training a new PCFG parser Cvandal using only those treebanked documents in D that correspond to vandalism. [sent-62, score-0.757]

30 Likewise, learn regular Wikipedia language by training a new PCFG parser Cregular 84 using only those tree-banked documents in D that correspond to regular Wikipedia edits. [sent-63, score-0.092]

31 (3) For each test document, compare the probability of the edit determined by Cvandal and Cregular, where the parser with the higher score determines the class of the edit. [sent-64, score-0.095]

32 3 System Description Our system decides if an edit to an article is vandal- ism by training a classifier based on a set of features derived from many different aspects of the edit. [sent-66, score-0.174]

33 , 2010) of Wikipedia edits where revisions are labeled as either vandalizing or non-vandalizing. [sent-68, score-0.465]

34 This section will describe in brief the features used by our classifier, a more exhaustive description of our non-linguistically motivated features can be found in Harpalani et al. [sent-69, score-0.074]

35 1 Features Based on Metadata Our classifier takes into account metadata generated by the revision. [sent-72, score-0.114]

36 We generate features based on author reputation by recording if the edit is submitted by an anonymous user or a registered user. [sent-73, score-0.288]

37 If the author is registered, we record how long he has been registered, how many times he has previously vandalized Wikipedia, and how frequent he edits articles. [sent-74, score-0.327]

38 We generate features based on the characteristics ofthe articles revision history. [sent-76, score-0.154]

39 This includes how many times the article has been previously vandalized, the last time it was edited, how many times it has been reverted and other related features. [sent-77, score-0.071]

40 2 Features Based on Lexical Cues Our classifier also employs a subset of features that rely on lexical cues. [sent-79, score-0.067]

41 Simple strategies such as counting the number of vulgarities present in the revision are effective to capture obvious forms of vandalism. [sent-80, score-0.128]

42 We measure the edit distance between the old and new revision, the number of repeated patterns, slang words, vulgarities and pronouns, the type of edit (insert, modification or delete) and other similar features. [sent-81, score-0.219]

43 3 Features Based on Sentiment Wikipedia editors strive to maintain a neutral and objective voice in articles. [sent-87, score-0.072]

44 Vandals, however, insert subjective and polar statements into articles. [sent-88, score-0.022]

45 We build two classifiers based on the work of Pang and Lee (2004) to measure the polarity and objectivity of article edits. [sent-89, score-0.067]

46 We train the classifier on how many positive and negative sentences were inserted as well as the overall change in the sentiment score from the previous version to the new revision and the number of inserted or deleted subjective sentences in the revision. [sent-90, score-0.212]

47 We take the log-likelihood of the regular edit and vandalizing edit LMs. [sent-93, score-0.364]

48 For our PCFG, we take the difference between the minimum log-likelihood score (i. [sent-94, score-0.025]

49 4 Experimental Results Data We use the 2010 PAN Wikipedia vandalism corpus Potthast et al. [sent-103, score-0.737]

50 0320 Table 2: Top 10 ranked features on the unbalanced test data by InfoGain efit of stylometric analysis to vandalism detection. [sent-114, score-0.95]

51 This corpus comprises of 32452 edits on 28468 articles, with 2391 of the edits identified as vandalism by human annotators. [sent-115, score-1.191]

52 The class distribution is highly skewed, as only 7% of edits corresponds to vandalism. [sent-116, score-0.227]

53 deletions, template changes), we focus only on those edits that inserted or modified text (17145 edits in total) since stylometric features are not relevant to deletes and template modifications. [sent-119, score-0.749]

54 Note that insertions and modifications are the main source for vandalism. [sent-120, score-0.02]

55 We randomly separated 15000 edits for training of Cvandal and Cregular, and 17444 edits for testing, preserving the ratio of vandalism to non-vandalism revisions. [sent-121, score-1.191]

56 We eliminated 7359 of the testing edits to remove revisions that were exclusively template modifications (e. [sent-122, score-0.331]

57 inserting a link) and maintain the observed ratio of vandalism for a total of 10085 edits. [sent-124, score-0.737]

58 For each edit in the test set, we com- pute the probability of each modified sentence for Cvandal and Cregular and generate the statistics for the features described in 3. [sent-125, score-0.112]

59 We compare the performance of the language models and stylometric features against a baseline classifier that is trained on metadata, lexical and sentiment features using 10 fold stratified cross validation on the test set. [sent-127, score-0.325]

60 Because our dataset is highly skewed (97% corresponds to “not vandalism”), we report F-score and line+PCFG features. [sent-129, score-0.021]

61 Notice that several stylistic features present in these sentences are unlikely to appear in normal Wikipedia articles. [sent-131, score-0.128]

62 1 The baseline system, which includes a wide range of features that are shown to be highly effective in vandalism detection, achieves F-score 52. [sent-133, score-0.794]

63 The baseline features include all features introduced in Section 3. [sent-136, score-0.094]

64 Adding language model features to the baseline (denoted as +LM in Table 1) increases the F-score slightly (53. [sent-137, score-0.057]

65 Adding PCFG based features to the baseline (denoted as +PCFG) brings the most substantial performance improvement: it increases recall substantially while also improving precision, achieving 57. [sent-140, score-0.057]

66 Combining both PCFG and language model based features (denoted as +LM+PCFG) only results in a slight improvement in AUC. [sent-143, score-0.037]

67 From these results, we draw the following conclusions: • There are indeed unique language styles in van- dTahleisrme a rteh aint can u ben qdueete lcatnegdu awgieth s stylometric analysis. [sent-144, score-0.258]

68 • Rather unexpectedly, deep syntax oriented featRuartehse bra usneedx on PteCdlFyG, bring a mtaxuc ohr more s fueba-stantial improvement than language models that capture only shallow lexico-syntactic patterns. [sent-145, score-0.129]

69 1A naive rule that always chooses the majority class (“not vandalism ”) will receive zero F-score. [sent-146, score-0.756]

70 Dry wit, for example, relies on context and may receive a good score from the parser trained on regular Wikipedia edits (Cregular). [sent-148, score-0.302]

71 Notice that several of our PCFG features are in the top ten most informative features. [sent-150, score-0.037]

72 Language model based features were ranked very low in the list, hence we do not include them in the list. [sent-151, score-0.037]

73 This finding will be potentially ad- vantageous to many of the current anti-vandalism tools such as vulgarisms, which rely only on shallow lexico-syntactic patterns. [sent-152, score-0.077]

74 Examples To provide more insight to the task, Table 3 shows several instances where the addition of the PCFG derived features detected vandalism that the baseline approach could not. [sent-153, score-0.794]

75 Notice that the first example contains a lot of conjunctions that would be hard to characterize using shallow lexicosyntactic features. [sent-154, score-0.118]

76 It looks almost like a benign edit, however, what makes it a vandalism is the phrase “(Happy Birthday)” inserted in the middle. [sent-157, score-0.771]

77 Table 4 shows examples where all of our systems could not detect the vandalism correctly. [sent-158, score-0.776]

78 Notice that examples in Table 4 generally manifest more a formal voice than those in Table 3. [sent-159, score-0.019]

79 5 Related Work Wang and McKeown (2010) present the first approach that is linguistically motivated. [sent-160, score-0.022]

80 Their ap- proach was based on shallow syntactic patterns, while ours explores the use of deep syntactic patterns, and performs a comparative evaluation across different stylometry analysis techniques. [sent-161, score-0.129]

81 It is worthwhile to note that the approach of Wang and McKeown (2010) is not as practical and scalable as ours in that it requires crawling a substantial number (150) of webpages to detect each vandalism edit. [sent-162, score-0.806]

82 From our pilot study based on 1600 edits (50% of which is vandalism), we found that the topic-specific language models built from web search do not produce stronger result than PCFG based features. [sent-163, score-0.227]

83 We do not have a result directly comparable to theirs however, as we could not crawl the necessary webpages required to match the size of corpus. [sent-164, score-0.03]

84 The standard approach to Wikipedia vandalism detection is to develop a feature based on either the content or metadata and train a classifier to recognize it. [sent-165, score-0.923]

85 A comprehensive overview of what types of features have been employed for this task can be found in Potthast et al. [sent-166, score-0.037]

86 WikiTrust, a reputation system for Wikipedia authors, focuses on determining the likely quality of a contribution (Adler and de Alfaro, 2007). [sent-168, score-0.06]

87 6 Future Work and Conclusion This paper presents a vandalism detection system for Wikipedia that uses stylometric features to aide in classification. [sent-169, score-0.999]

88 We show that deep syntactic patterns based on PCFGs more effectively identify vandalism than shallow lexico-syntactic patterns based on n-grams or contextual language models. [sent-170, score-0.983]

89 Rather, PCFGs are able to detect differences in language styles between vandalizing edits and normal edits to Wikipedia articles. [sent-172, score-0.759]

90 We look to automate the expansion of the training set of vandalized revisions to include examples from outside of Wikipedia that reflect similar language styles. [sent-175, score-0.119]

91 Luis Ortiz for their valuable guidance and suggestions in applying Machine Learning and Natural Language Processing techniques to the task of vandalism detection. [sent-179, score-0.737]

92 We also recognize the hard work of Megha Bassi and Thanadit Phumprao for assisting us in building our vandalism detection pipeline that enabled us to perform these experiments. [sent-180, score-0.809]

93 Wikipedia vandalism detection: Combining natural language, metadata, and reputation features. [sent-192, score-0.797]

94 Detecting wikipedia vandalism with active learning and statistical language models. [sent-214, score-0.943]

95 Wiki vandalysis- wikipedia vandalism analysis lab report for pan at clef 2010. [sent-238, score-0.997]

96 A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. [sent-247, score-0.025]

97 Personal sense and idiolect: Combining authorship attribution and opinion analysis. [sent-252, score-0.098]

98 The use of textual, grammatical and sociolinguistic evidence in forensic text compari- 88 son:. [sent-290, score-0.03]

99 ”: Automatic vandalism detection in wikipedia with web-based shallow syntacticsemantic modeling. [sent-296, score-1.069]

100 Detecting wikipedia vandalism via spatio-temporal analysis of revision metadata. [sent-302, score-1.032]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('vandalism', 0.737), ('pcfg', 0.248), ('edits', 0.227), ('wikipedia', 0.206), ('vandalizing', 0.178), ('stylometric', 0.176), ('potthast', 0.122), ('cregular', 0.118), ('cvandal', 0.099), ('revision', 0.089), ('metadata', 0.084), ('shallow', 0.077), ('registered', 0.075), ('edit', 0.075), ('adler', 0.07), ('authorship', 0.063), ('reputation', 0.06), ('revisions', 0.06), ('stylistic', 0.06), ('harpalani', 0.059), ('vandalized', 0.059), ('styles', 0.057), ('deep', 0.052), ('detection', 0.049), ('patterns', 0.049), ('auc', 0.043), ('author', 0.041), ('lexicosyntactic', 0.041), ('rob', 0.041), ('alfaro', 0.039), ('bassi', 0.039), ('geiger', 0.039), ('logitboost', 0.039), ('manoj', 0.039), ('megha', 0.039), ('panicheva', 0.039), ('perner', 0.039), ('phumprao', 0.039), ('reverted', 0.039), ('thanadit', 0.039), ('vulgarities', 0.039), ('detect', 0.039), ('raghavan', 0.038), ('features', 0.037), ('encyclopedia', 0.037), ('pcfgs', 0.036), ('regular', 0.036), ('attribution', 0.035), ('discriminate', 0.035), ('objectivity', 0.035), ('teresa', 0.035), ('inserted', 0.034), ('chin', 0.032), ('clef', 0.032), ('discretization', 0.032), ('luca', 0.032), ('article', 0.032), ('notice', 0.031), ('normal', 0.031), ('slang', 0.03), ('forensic', 0.03), ('webpages', 0.03), ('classifier', 0.03), ('martin', 0.029), ('strive', 0.029), ('benno', 0.029), ('stein', 0.029), ('lm', 0.028), ('articles', 0.028), ('hart', 0.027), ('friedman', 0.027), ('west', 0.027), ('paolo', 0.026), ('argamon', 0.026), ('sentiment', 0.025), ('workshops', 0.025), ('genre', 0.025), ('difference', 0.025), ('mckeown', 0.025), ('unique', 0.025), ('ny', 0.024), ('cues', 0.024), ('template', 0.024), ('editors', 0.024), ('recognize', 0.023), ('weka', 0.023), ('free', 0.022), ('pan', 0.022), ('linguistically', 0.022), ('insert', 0.022), ('constitutes', 0.021), ('skewed', 0.021), ('modifications', 0.02), ('parser', 0.02), ('baseline', 0.02), ('million', 0.02), ('daily', 0.02), ('receive', 0.019), ('effectively', 0.019), ('voice', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 195 acl-2011-Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis

Author: Manoj Harpalani ; Michael Hart ; Sandesh Signh ; Rob Johnson ; Yejin Choi

Abstract: Community-based knowledge forums, such as Wikipedia, are susceptible to vandalism, i.e., ill-intentioned contributions that are detrimental to the quality of collective intelligence. Most previous work to date relies on shallow lexico-syntactic patterns and metadata to automatically detect vandalism in Wikipedia. In this paper, we explore more linguistically motivated approaches to vandalism detection. In particular, we hypothesize that textual vandalism constitutes a unique genre where a group of people share a similar linguistic behavior. Experimental results suggest that (1) statistical models give evidence to unique language styles in vandalism, and that (2) deep syntactic patterns based on probabilistic context free grammars (PCFG) discriminate vandalism more effectively than shallow lexicosyntactic patterns based on n-grams. ,

2 0.20860989 337 acl-2011-Wikipedia Revision Toolkit: Efficiently Accessing Wikipedias Edit History

Author: Oliver Ferschke ; Torsten Zesch ; Iryna Gurevych

Abstract: We present an open-source toolkit which allows (i) to reconstruct past states of Wikipedia, and (ii) to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia. Beyond that, the edit history of Wikipedia articles has been shown to be a valuable knowledge source for NLP, but access is severely impeded by the lack of efficient tools for managing the huge amount of provided data. By using a dedicated storage format, our toolkit massively decreases the data volume to less than 2% of the original size, and at the same time provides an easy-to-use interface to access the revision data. The language-independent design allows to process any language represented in Wikipedia. We expect this work to consolidate NLP research using Wikipedia in general, and to foster research making use of the knowledge encoded in Wikipedia’s edit history.

3 0.086533651 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation

Author: Danuta Ploch

Abstract: Named entity disambiguation is the task of linking an entity mention in a text to the correct real-world referent predefined in a knowledge base, and is a crucial subtask in many areas like information retrieval or topic detection and tracking. Named entity disambiguation is challenging because entity mentions can be ambiguous and an entity can be referenced by different surface forms. We present an approach that exploits Wikipedia relations between entities co-occurring with the ambiguous form to derive a range of novel features for classifying candidate referents. We find that our features improve disambiguation results significantly over a strong popularity baseline, and are especially suitable for recognizing entities not contained in the knowledge base. Our system achieves state-of-the-art results on the TAC-KBP 2009 dataset.

4 0.078890689 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing

Author: Mohit Bansal ; Dan Klein

Abstract: We investigate full-scale shortest-derivation parsing (SDP), wherein the parser selects an analysis built from the fewest number of training fragments. Shortest derivation parsing exhibits an unusual range of behaviors. At one extreme, in the fully unpruned case, it is neither fast nor accurate. At the other extreme, when pruned with a coarse unlexicalized PCFG, the shortest derivation criterion becomes both fast and surprisingly effective, rivaling more complex weighted-fragment approaches. Our analysis includes an investigation of tie-breaking and associated dynamic programs. At its best, our parser achieves an accuracy of 87% F1 on the English WSJ task with minimal annotation, and 90% F1 with richer annotation.

5 0.077160142 283 acl-2011-Simple English Wikipedia: A New Text Simplification Task

Author: William Coster ; David Kauchak

Abstract: In this paper we examine the task of sentence simplification which aims to reduce the reading complexity of a sentence by incorporating more accessible vocabulary and sentence structure. We introduce a new data set that pairs English Wikipedia with Simple English Wikipedia and is orders of magnitude larger than any previously examined for sentence simplification. The data contains the full range of simplification operations including rewording, reordering, insertion and deletion. We provide an analysis of this corpus as well as preliminary results using a phrase-based trans- lation approach for simplification.

6 0.075234421 214 acl-2011-Lost in Translation: Authorship Attribution using Frame Semantics

7 0.07406418 213 acl-2011-Local and Global Algorithms for Disambiguation to Wikipedia

8 0.073618636 235 acl-2011-Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment

9 0.072500139 285 acl-2011-Simple supervised document geolocation with geodesic grids

10 0.072493285 114 acl-2011-End-to-End Relation Extraction Using Distant Supervision from External Semantic Repositories

11 0.06099125 112 acl-2011-Efficient CCG Parsing: A* versus Adaptive Supertagging

12 0.055481862 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

13 0.055164002 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style

14 0.050898951 52 acl-2011-Automatic Labelling of Topic Models

15 0.040759441 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

16 0.038557839 133 acl-2011-Extracting Social Power Relationships from Natural Language

17 0.038225923 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

18 0.036622435 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

19 0.03643943 254 acl-2011-Putting it Simply: a Context-Aware Approach to Lexical Simplification

20 0.035909254 281 acl-2011-Sentiment Analysis of Citations using Sentence Structure-Based Features


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.107), (1, 0.037), (2, -0.021), (3, 0.004), (4, -0.018), (5, -0.007), (6, 0.019), (7, -0.025), (8, -0.089), (9, -0.037), (10, -0.044), (11, 0.012), (12, -0.014), (13, -0.009), (14, 0.026), (15, 0.045), (16, 0.131), (17, -0.053), (18, -0.004), (19, -0.122), (20, 0.108), (21, -0.068), (22, -0.059), (23, -0.096), (24, 0.067), (25, 0.008), (26, 0.089), (27, -0.005), (28, 0.023), (29, -0.001), (30, -0.035), (31, -0.004), (32, -0.052), (33, -0.03), (34, -0.004), (35, -0.017), (36, -0.006), (37, -0.055), (38, -0.006), (39, 0.001), (40, 0.016), (41, 0.089), (42, 0.03), (43, 0.062), (44, 0.07), (45, 0.128), (46, -0.073), (47, -0.119), (48, 0.035), (49, -0.053)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91638309 195 acl-2011-Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis

Author: Manoj Harpalani ; Michael Hart ; Sandesh Signh ; Rob Johnson ; Yejin Choi

Abstract: Community-based knowledge forums, such as Wikipedia, are susceptible to vandalism, i.e., ill-intentioned contributions that are detrimental to the quality of collective intelligence. Most previous work to date relies on shallow lexico-syntactic patterns and metadata to automatically detect vandalism in Wikipedia. In this paper, we explore more linguistically motivated approaches to vandalism detection. In particular, we hypothesize that textual vandalism constitutes a unique genre where a group of people share a similar linguistic behavior. Experimental results suggest that (1) statistical models give evidence to unique language styles in vandalism, and that (2) deep syntactic patterns based on probabilistic context free grammars (PCFG) discriminate vandalism more effectively than shallow lexicosyntactic patterns based on n-grams. ,

2 0.86381245 337 acl-2011-Wikipedia Revision Toolkit: Efficiently Accessing Wikipedias Edit History

Author: Oliver Ferschke ; Torsten Zesch ; Iryna Gurevych

Abstract: We present an open-source toolkit which allows (i) to reconstruct past states of Wikipedia, and (ii) to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia. Beyond that, the edit history of Wikipedia articles has been shown to be a valuable knowledge source for NLP, but access is severely impeded by the lack of efficient tools for managing the huge amount of provided data. By using a dedicated storage format, our toolkit massively decreases the data volume to less than 2% of the original size, and at the same time provides an easy-to-use interface to access the revision data. The language-independent design allows to process any language represented in Wikipedia. We expect this work to consolidate NLP research using Wikipedia in general, and to foster research making use of the knowledge encoded in Wikipedia’s edit history.

3 0.70226568 213 acl-2011-Local and Global Algorithms for Disambiguation to Wikipedia

Author: Lev Ratinov ; Dan Roth ; Doug Downey ; Mike Anderson

Abstract: Disambiguating concepts and entities in a context sensitive way is a fundamental problem in natural language processing. The comprehensiveness of Wikipedia has made the online encyclopedia an increasingly popular target for disambiguation. Disambiguation to Wikipedia is similar to a traditional Word Sense Disambiguation task, but distinct in that the Wikipedia link structure provides additional information about which disambiguations are compatible. In this work we analyze approaches that utilize this information to arrive at coherent sets of disambiguations for a given document (which we call “global” approaches), and compare them to more traditional (local) approaches. We show that previous approaches for global disambiguation can be improved, but even then the local disambiguation provides a baseline which is very hard to beat.

4 0.57149011 285 acl-2011-Simple supervised document geolocation with geodesic grids

Author: Benjamin Wing ; Jason Baldridge

Abstract: We investigate automatic geolocation (i.e. identification of the location, expressed as latitude/longitude coordinates) of documents. Geolocation can be an effective means of summarizing large document collections and it is an important component of geographic information retrieval. We describe several simple supervised methods for document geolocation using only the document’s raw text as evidence. All of our methods predict locations in the context of geodesic grids of varying degrees of resolution. We evaluate the methods on geotagged Wikipedia articles and Twitter feeds. For Wikipedia, our best method obtains a median prediction error of just 11.8 kilometers. Twitter geolocation is more challenging: we obtain a median error of 479 km, an improvement on previous results for the dataset.

5 0.50701624 338 acl-2011-Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis

Author: Daniel Bar ; Nicolai Erbs ; Torsten Zesch ; Iryna Gurevych

Abstract: We present Wikulu1, a system focusing on supporting wiki users with their everyday tasks by means of an intelligent interface. Wikulu is implemented as an extensible architecture which transparently integrates natural language processing (NLP) techniques with wikis. It is designed to be deployed with any wiki platform, and the current prototype integrates a wide range of NLP algorithms such as keyphrase extraction, link discovery, text segmentation, summarization, or text similarity. Additionally, we show how Wikulu can be applied for visually analyzing the results of NLP algorithms, educational purposes, and enabling semantic wikis.

6 0.50631529 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation

7 0.44932652 283 acl-2011-Simple English Wikipedia: A New Text Simplification Task

8 0.44460925 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style

9 0.43844977 26 acl-2011-A Speech-based Just-in-Time Retrieval System using Semantic Search

10 0.4223035 254 acl-2011-Putting it Simply: a Context-Aware Approach to Lexical Simplification

11 0.41696388 214 acl-2011-Lost in Translation: Authorship Attribution using Frame Semantics

12 0.41553289 114 acl-2011-End-to-End Relation Extraction Using Distant Supervision from External Semantic Repositories

13 0.41112697 84 acl-2011-Contrasting Opposing Views of News Articles on Contentious Issues

14 0.39269906 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing

15 0.37146872 222 acl-2011-Model-Portability Experiments for Textual Temporal Analysis

16 0.36616299 31 acl-2011-Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations

17 0.36377549 298 acl-2011-The ACL Anthology Searchbench

18 0.34746 291 acl-2011-SystemT: A Declarative Information Extraction System

19 0.34479061 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution

20 0.33083004 191 acl-2011-Knowledge Base Population: Successful Approaches and Challenges


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.031), (5, 0.426), (17, 0.027), (26, 0.03), (31, 0.018), (37, 0.062), (39, 0.053), (41, 0.039), (55, 0.014), (59, 0.035), (70, 0.012), (72, 0.03), (91, 0.041), (96, 0.087)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91106737 288 acl-2011-Subjective Natural Language Problems: Motivations, Applications, Characterizations, and Implications

Author: Cecilia Ovesdotter Alm

Abstract: This opinion paper discusses subjective natural language problems in terms of their motivations, applications, characterizations, and implications. It argues that such problems deserve increased attention because of their potential to challenge the status of theoretical understanding, problem-solving methods, and evaluation techniques in computational linguistics. The author supports a more holistic approach to such problems; a view that extends beyond opinion mining or sentiment analysis.

2 0.86709011 298 acl-2011-The ACL Anthology Searchbench

Author: Ulrich Schafer ; Bernd Kiefer ; Christian Spurk ; Jorg Steffen ; Rui Wang

Abstract: We describe a novel application for structured search in scientific digital libraries. The ACL Anthology Searchbench is meant to become a publicly available research tool to query the content of the ACL Anthology. The application provides search in both its bibliographic metadata and semantically analyzed full textual content. By combining these two features, very efficient and focused queries are possible. At the same time, the application serves as a showcase for the recent progress in natural language processing (NLP) research and language technology. The system currently indexes the textual content of 7,500 anthology papers from 2002–2009 with predicateargument-like semantic structures. It also provides useful search filters based on bibliographic metadata. It will be extended to provide the full anthology content and en- . hanced functionality based on further NLP techniques. 1 Introduction and Motivation Scientists in all disciplines nowadays are faced with a flood of new publications every day. In addition, more and more publications from the past become digitally available and thus even increase the amount. Finding relevant information and avoiding duplication of work have become urgent issues to be addressed by the scientific community. The organization and preservation of scientific knowledge in scientific publications, vulgo text documents, thwarts these efforts. From a viewpoint of 7 dfki .de / lt a computer scientist, scientific papers are just ‘unstructured information’ . At least in our own scientific community, Computational Linguistics, it is generally assumed that NLP could help to support search in such document collections. The ACL Anthology1 is a comprehensive elec- tronic collection of scientific papers in our own field (Bird et al., 2008). It is updated regularly with new publications, but also older papers have been scanned and are made available electronically. We have implemented the ACL Anthology Searchbench2 for two reasons: Our first aim is to provide a more targeted search facility in this collection than standard web search on the anthology website. In this sense, the Searchbench is meant to become a service to our own community. Our second motivation is to use the developed system as a showcase for the progress that has been made over the last years in precision-oriented deep linguistic parsing in terms of both efficiency and coverage, specifically in the context of the DELPHIN community3. Our system also uses further NLP techniques such as unsupervised term extraction, named entity recognition and part-of-speech (PoS) tagging. By automatically precomputing normalized semantic representations (predicate-argument structure) of each sentence in the anthology, the search space is structured and allows to find equivalent or related predicates even if they are expressed differ- 1http : / /www . aclweb .org/ anthology 2http : //aclasb . dfki . de 3http : / /www . de lph-in . net – DELPH-IN stands for DEep Linguistic Processing with HPSG INitiative. Portland,P Orroecge ondi,n UgSsA o,f 2 th1e J AunCeL 2-H0L1 T. 2 ?0c 1210 1S1ys Atesmso Dcieamtio n s ftorart Cio nms,p puatgaetiso 7n–al1 L3i,nguistics ently, e.g. in passive constructions, using synonyms, etc. By storing the semantic sentence structure along with the original text in a structured full-text search engine, it can be guaranteed that recall cannot fall behind the baseline of a fulltext search. In addition, the Searchbench also provides detailed bibliographic metadata for filtering as well as autosuggest texts for input fields computed from the corpus two further key features one can expect from such systems today, nevertheless very important for efficient search in digital libraries. We describe the offline preprocessing and deep parsing approach in Section 2. Section 3 concentrates on the generation of the semantic search index. In Section 4, we describe the search interface. We conclude in Section 5 and present an outlook to future extensions. – 2 Parsing the ACL Anthology The basis of the search index for the ACL Anthology are its original PDF documents, currently 8,200 from the years 2002 through 2009. To overcome quality problems in text extraction from PDF, we use a commercial PDF extractor based on OCR techniques. This approach guarantees uniform and highquality textual representations even from older papers in the anthology (before 2000) which mostly were scanned from printed paper versions. The general idea of the semantics-oriented access to scholarly paper content is to parse each sentence they contain with the open-source HPSG (Pollard and Sag, 1994) grammar for English (ERG; Flickinger (2002)) and then distill and index semantically structured representations for search. To make the deep parser robust, it is embedded in a NLP workflow. The coverage (percentage of full deeply parsed sentences) on the anthology corpus could be increased from 65 % to now more than 85 % through careful combination of several robustness techniques; for example: (1) chart pruning, directed search during parsing to increase per- formance, and also coverage for longer sentences (Cramer and Zhang, 2010); (2) chart mapping, a novel method for integrating preprocessing information in exactly the way the deep grammar expects it (Adolphs et al., 2008); (3) new version of the ERG with better handling of open word classes; (4) 8 more fine-grained named entity recognition, including recognition of citation patterns; (5) new, better suited parse ranking model (WeScience; Flickinger et al. (2010)). Because of limited space, we will focus on (1) and (2) below. A more detailed description and further results are available in (Sch a¨fer and Kiefer, 2011). Except for a small part of the named entity recognition components (citations, some terminology) and the parse ranking model, there are no further adaptations to genre or domain of the text corpus. This implies that the NLP workflow could be easily and modularly adapted to other (scientific or nonscientific) domains—mainly thanks to the generic and comprehensive language modelling in the ERG. The NLP preprocessing component workflow is implemented using the Heart of Gold NLP middleware architecture (Sch a¨fer, 2006). It starts with sentence boundary detection (SBR) and regular expression-based tokenization using its built-in component JTok, followed by the trigram-based PoS tagger TnT (Brants, 2000) trained on the Penn Treebank (Marcus et al., 1993) and the named entity recognizer SProUT (Dro z˙d z˙y n´ski et al., 2004). 2.1 Precise Preprocessing Integration with Chart Mapping Tagger output is combined with information from the named entity recognizer, e.g. delivering hypothetical information on citation expressions. The combined result is delivered as input to the deep parser PET (Callmeier, 2000) running the ERG. Here, citations, for example, can be treated as either persons, locations or appositions. Concerning punctuation, the ERG can make use of information on opening and closing quotation marks. Such information is often not explicit in the input text, e.g. when, as in our setup, gained through OCR which does not distinguish between ‘ and ’ or “ and However, a tokenizer can often guess (recon- ”. struct) leftness and rightness correctly. This information, passed to the deep parser via chart mapping, helps it to disambiguate. 2.2 Increased Processing Speed and Coverage through Chart Pruning In addition to a well-established discriminative maximum entropy model for post-analysis parse selection, we use an additional generative model as described in Cramer and Zhang (2010) to restrict the search space during parsing. This restriction increases efficiency, but also coverage, because the parse time was restricted to at most 60 CPU seconds on a standard PC, and more sentences could now be parsed within these bounds. A 4 GB limit for main memory consumption was far beyond what was ever needed. We saw a small but negligible decrease in parsing accuracy, 5.4 % best parses were not found due to the pruning of important chart edges. Ninomiya et al. (2006) did a very thorough comparison ofdifferent performance optimization strategies, and among those also a local pruning strategy similar to the one used here. There is an important difference between the systems, in that theirs works on a reduced context-free backbone first and reconstructs the results with the full grammar, while PET uses the HPSG grammar directly, with subsumption packing and partial unpacking to achieve a similar effect as the packed chart of a context-free parser. sentence length −→ Figure 1: Distribution of sentence length and mean parse times for mild pruning In total, we parsed 1,537,801 sentences, of which 57,832 (3.8 %) could not be parsed because of lexicon errors. Most of them were caused by OCR ar- tifacts resulting in unexpected punctuation character combinations. These can be identified and will be deleted in the future. Figure 1 displays the average parse time of processing with a mild chart pruning setting, together with the mean quadratic error. In addition, it contains the distribution of input sentences over sentence length. Obviously, the vast majority of sen9 tences has a length of at most 60 words4. The parse times only grow mildly due to the many optimization techniques in the original system, and also the new chart pruning method. The sentence length distribution has been integrated into Figure 1 to show that the predominant part of our real-world corpus can be processed using this information-rich method with very low parse times (overall average parse time < 2 s per sentence). The large amount of short inputs is at first surprising, even more so that most of these inputs can not be parsed. Most of these inputs are non-sentences such as headings, enumerations, footnotes, table cell content. There are several alternatives to deal with such input, one to identify and handle them in a preprocessing step, another to use a special root condition in the deep analysis component that is able to combine phrases with well-defined properties for inputs where no spanning result could be found. We employed the second method, which has the advantage that it handles a larger range of phenomena in a homogeneous way. Figure 2 shows the change in percentage of unparsed and timed out inputs for the mild pruning method with and without the root condition combining fragments. sentence length −→ Figure 2: Unparsed and timed out sentences with and without fragment combination Figure 2 shows that this changes the curve for unparsed sentences towards more expected characteristics and removes the uncommonly high percentage of short sentences for which no parse can be computed. Together with the parses for fragmented 4It has to be pointed out that extremely long sentences also may be non-sentences resulting from PDF extraction errors, missing punctuation etc. No manual correction took place. Figure 3: Multiple semantic tuples may be generated for a sentence input, we get a recall (sentences with at least one parse) over the whole corpus of 85.9 % (1,321,336 sentences), without a significant change for any of the other measures, and with potential for further improvement. 3 Semantic Tuple Extraction with DMRS In contrast to shallow parsers, the ERG not only handles detailed syntactic analyses of phrases, com- pounds, coordination, negation and other linguistic phenomena that are important for extracting semantic relations, but also generates a formal semantic representation of the meaning of the input sentence in the Minimal Recursion Semantics (MRS) representation format (Copestake et al., 2005). It consists of elementary predications for each word and larger constituents, connected via argument positions and variables, from which predicate-argument structure can be extracted. MRS representations resulting from deep parsing are still relatively close to linguistic structures and contain more detailed information than a user would like to query and search for. Therefore, an additional extraction and abstraction step is performed before storing semantic structures in the search index. Firstly, MRS is converted to DMRS (Copestake, 2009), a dependency-style version of MRS that eases extraction of predicate-argument structure using the implementation in LKB (Copestake, 2002). The representation format we devised for the search index we call semantic tuples, in fact quintuples

same-paper 3 0.81605083 195 acl-2011-Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis

Author: Manoj Harpalani ; Michael Hart ; Sandesh Signh ; Rob Johnson ; Yejin Choi

Abstract: Community-based knowledge forums, such as Wikipedia, are susceptible to vandalism, i.e., ill-intentioned contributions that are detrimental to the quality of collective intelligence. Most previous work to date relies on shallow lexico-syntactic patterns and metadata to automatically detect vandalism in Wikipedia. In this paper, we explore more linguistically motivated approaches to vandalism detection. In particular, we hypothesize that textual vandalism constitutes a unique genre where a group of people share a similar linguistic behavior. Experimental results suggest that (1) statistical models give evidence to unique language styles in vandalism, and that (2) deep syntactic patterns based on probabilistic context free grammars (PCFG) discriminate vandalism more effectively than shallow lexicosyntactic patterns based on n-grams. ,

4 0.81541711 260 acl-2011-Recognizing Authority in Dialogue with an Integer Linear Programming Constrained Model

Author: Elijah Mayfield ; Carolyn Penstein Rose

Abstract: We present a novel computational formulation of speaker authority in discourse. This notion, which focuses on how speakers position themselves relative to each other in discourse, is first developed into a reliable coding scheme (0.71 agreement between human annotators). We also provide a computational model for automatically annotating text using this coding scheme, using supervised learning enhanced by constraints implemented with Integer Linear Programming. We show that this constrained model’s analyses of speaker authority correlates very strongly with expert human judgments (r2 coefficient of 0.947).

5 0.6838457 60 acl-2011-Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability

Author: Jonathan H. Clark ; Chris Dyer ; Alon Lavie ; Noah A. Smith

Abstract: In statistical machine translation, a researcher seeks to determine whether some innovation (e.g., a new feature, model, or inference algorithm) improves translation quality in comparison to a baseline system. To answer this question, he runs an experiment to evaluate the behavior of the two systems on held-out data. In this paper, we consider how to make such experiments more statistically reliable. We provide a systematic analysis of the effects of optimizer instability—an extraneous variable that is seldom controlled for—on experimental outcomes, and make recommendations for reporting results more accurately.

6 0.4375259 33 acl-2011-An Affect-Enriched Dialogue Act Classification Model for Task-Oriented Dialogue

7 0.43466428 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing

8 0.4319033 307 acl-2011-Towards Tracking Semantic Change by Visual Analytics

9 0.43128666 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding

10 0.4305996 31 acl-2011-Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations

11 0.42606568 133 acl-2011-Extracting Social Power Relationships from Natural Language

12 0.42382413 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling

13 0.42210075 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning

14 0.42196724 316 acl-2011-Unary Constraints for Efficient Context-Free Parsing

15 0.42191386 8 acl-2011-A Corpus of Scope-disambiguated English Text

16 0.41944417 176 acl-2011-Integrating surprisal and uncertain-input models in online sentence comprehension: formal techniques and empirical results

17 0.41515782 215 acl-2011-MACAON An NLP Tool Suite for Processing Word Lattices

18 0.41311496 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

19 0.41023552 214 acl-2011-Lost in Translation: Authorship Attribution using Frame Semantics

20 0.40877265 226 acl-2011-Multi-Modal Annotation of Quest Games in Second Life