acl acl2011 acl2011-283 knowledge-graph by maker-knowledge-mining

283 acl-2011-Simple English Wikipedia: A New Text Simplification Task


Source: pdf

Author: William Coster ; David Kauchak

Abstract: In this paper we examine the task of sentence simplification which aims to reduce the reading complexity of a sentence by incorporating more accessible vocabulary and sentence structure. We introduce a new data set that pairs English Wikipedia with Simple English Wikipedia and is orders of magnitude larger than any previously examined for sentence simplification. The data contains the full range of simplification operations including rewording, reordering, insertion and deletion. We provide an analysis of this corpus as well as preliminary results using a phrase-based trans- lation approach for simplification.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Simple English Wikipedia: William Coster Computer Science Department Pomona College Claremont, CA 9171 1 wpc 0 2 0 0 9 @pomona edu . [sent-1, score-0.029]

2 Abstract In this paper we examine the task of sentence simplification which aims to reduce the reading complexity of a sentence by incorporating more accessible vocabulary and sentence structure. [sent-2, score-1.007]

3 We introduce a new data set that pairs English Wikipedia with Simple English Wikipedia and is orders of magnitude larger than any previously examined for sentence simplification. [sent-3, score-0.16]

4 The data contains the full range of simplification operations including rewording, reordering, insertion and deletion. [sent-4, score-0.733]

5 We provide an analysis of this corpus as well as preliminary results using a phrase-based trans- lation approach for simplification. [sent-5, score-0.029]

6 1 Introduction The task of text simplification aims to reduce the complexity of text while maintaining the content (Chandrasekar and Srinivas, 1997; Carroll et al. [sent-6, score-0.787]

7 In this paper, we explore the sentence simplification problem: given a sentence, the goal is to produce an equivalent sentence where the vocabulary and sentence structure are simpler. [sent-8, score-1.007]

8 Simplification techniques can be used to make text resources available to a broader range of readers, including children, language learners, the elderly, the hearing impaired and people with aphasia or cognitive disabilities (Carroll et al. [sent-10, score-0.159]

9 As a preprocessing step, simplification can improve the performance of NLP tasks, including parsing, semantic role labeling, machine translation and summarization (Miwa et al. [sent-12, score-0.745]

10 One of the key challenges for text simplification is data availability. [sent-18, score-0.714]

11 The small amount of simplification data currently available has prevented the application of data-driven techniques like those used in other text-to-text translation areas (Och and Ney, 2004; Chiang, 2010). [sent-19, score-0.716]

12 Most prior techniques for text simplification have involved either hand-crafted rules (Vickrey and Koller, 2008; Feng, 2008) or learned within a very restricted rule space (Chandrasekar and Srinivas, 1997). [sent-20, score-0.714]

13 We have generated a data set consisting of 137K aligned simplified/unsimplified sentence pairs by pairing documents, then sentences from English Wikipedia1 with corresponding documents and sentences from Simple English Wikipedia2. [sent-21, score-0.455]

14 Simple English Wikipedia contains articles aimed at children and English language learners and contains similar content to English Wikipedia but with simpler vocabulary and grammar. [sent-22, score-0.089]

15 Figure 1 shows example sentence simplifications from the data set. [sent-23, score-0.232]

16 Like machine translation and other text-to-text domains, text simplification involves the full range of transformation operations including deletion, rewording, reordering and insertion. [sent-24, score-0.821]

17 i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 665–669, Figure 1: Example sentence simplifications extracted from Wikipedia. [sent-31, score-0.232]

18 Normal refers to a sentence in an English Wikipedia article and Simple to a corresponding sentence in Simple English Wikipedia. [sent-32, score-0.32]

19 2 Previous Data Wikipedia and Simple English Wikipedia have both received some recent attention as a useful resource for text simplification and the related task of text compression. [sent-33, score-0.757]

20 Yamangil and Nelken (2008) examine the history logs of English Wikipedia to learn sentence compression rules. [sent-34, score-0.296]

21 (2010) learn a set of candidate phrase simplification rules based on edits identified in the revision histories of both Simple English Wikipedia and English Wikipedia. [sent-36, score-0.754]

22 However, they only provide a list of the top phrasal simplifications and do not utilize them in an endto-end simplification system. [sent-37, score-0.791]

23 Although the simplification problem shares some characteristics with the text compression problem, existing text compression data sets are small and contain a restricted set of possible transformations (often only deletion). [sent-39, score-1.125]

24 Knight and Marcu (2002) introduced the Zipf-Davis corpus which contains 1K sentence pairs. [sent-40, score-0.112]

25 Cohn and Lapata (2009) manually generated two parallel corpora from news stories totaling 3K sentence pairs. [sent-41, score-0.172]

26 Finally, Nomoto (2009) generated a data set based on RSS feeds containing 2K sentence pairs. [sent-42, score-0.143]

27 666 3 Simplification Corpus Generation We generated a parallel simplification corpus by aligning sentences between English Wikipedia and Simple English Wikipedia. [sent-43, score-0.801]

28 We first paired the articles by title, then removed all article pairs where either article: contained only a single line, was flagged as a stub, was flagged as a disambiguation page or was a meta-page about Wikipedia. [sent-45, score-0.276]

29 After pairing and filtering, 10,588 aligned, content article pairs remained (a 90% reduction from the original 110K Simple English Wikipedia articles). [sent-46, score-0.214]

30 Throughout the rest of this paper we will refer to unsimplified text from English Wikipedia as normal and to the simplified version from Simple English Wikipedia as simple. [sent-47, score-0.318]

31 To generate aligned sentence pairs from the aligned document pairs we followed an approach similar to those utilized in previous monolingual alignment problems (Barzilay and Elhadad, 2003; Nelken and Shieber, 2006). [sent-48, score-0.587]

32 Each simple paragraph was then aligned to every normal paragraph where the TF-IDF, cosine similarity was over a threshold or 0. [sent-50, score-0.793]

33 We initially investigated the paragraph clustering preprocessing step in (Barzilay and Elhadad, 2003), but did not find a qualitative difference and opted for the simpler similarity-based alignment approach, which does not require manual annotation. [sent-52, score-0.204]

34 a simple paragraph and one or more normal paragraphs), we then used a dynamic programming approach to find that best global sentence alignment following Barzilay and Elhadad (2003). [sent-55, score-0.669]

35 sim(i, j) is the similarity between the ith normal sentence and the jth simple sentence and was calculated using TF-IDF, cosine similarity. [sent-57, score-0.62]

36 Barzilay and Elhadad (2003) further discourage aligning dissimilar sentences by including a “mismatch penalty” in the similarity measure. [sent-60, score-0.142]

37 Instead, we included a filtering step removing all sentence pairs with a normalized similarity below a threshold of 0. [sent-61, score-0.244]

38 We found this approach to be more intuitive and allowed us to compare the effects of differing levels of similarity in the training set. [sent-63, score-0.043]

39 Our choice of threshold is high enough to ensure that most align- ments are correct, but low enough to allow for variation in the paired sentences. [sent-64, score-0.041]

40 In the future, we hope to explore other similarity techniques that will pair sentences with even larger variation. [sent-65, score-0.091]

41 4 Corpus Analysis From the 10K article pairs, we extracted 75K aligned paragraphs. [sent-66, score-0.224]

42 From these, we extracted the final set of 137K aligned sentence pairs. [sent-67, score-0.24]

43 To evaluate the quality of the aligned sentences, we asked two human evaluators to independently judge whether or not the aligned sentences were correctly aligned on a random sample of 100 sentence pairs. [sent-68, score-0.598]

44 667 91/100 were identified as correct, though many of the remaining 9 also had some partial content overlap. [sent-70, score-0.03]

45 We also repeated the experiment using only those sentences with a similarity above 0. [sent-71, score-0.091]

46 This reduced the number of pairs from 137K to 90K, but the evaluators identified 98/100 as correct. [sent-74, score-0.102]

47 The analysis throughout the rest of the section is for threshold of 0. [sent-75, score-0.041]

48 5, though similar results were also seen for the threshold of 0. [sent-76, score-0.041]

49 Although the average simple article contained approximately 40 sentences, we extracted an average of 14 aligned sentence pairs per article. [sent-78, score-0.462]

50 Qualitatively, it is rare to find a simple article that is a direct translation of the normal article, that is, a simple article that was generated by only making sentencelevel changes to the normal document. [sent-79, score-0.974]

51 However, there is a strong relationship between the two data sets: 27% of our aligned sentences were identical between simple and normal. [sent-80, score-0.254]

52 We left these identical sentence pairs in our data set since not all sentences need to be simplified and it is important for any simplification algorithm to be able to handle this case. [sent-81, score-0.879]

53 Much of the content without direct correspondence is removed during paragraph alignment. [sent-82, score-0.144]

54 65% of the simple paragraphs do not align to a normal paragraphs and are ignored. [sent-83, score-0.602]

55 On top of this, within aligned paragraphs, there are a large number of sentences that do not align. [sent-84, score-0.176]

56 Table 1 shows the proportion of the different sentence level alignment operations in our data set. [sent-85, score-0.264]

57 On both the simple and normal sides there are many sentences that do not align. [sent-86, score-0.401]

58 Table1:FtsoOwrknepiqourn aesointmrc oypnal eftos nwteoscim-lepv a32i58g7% n metopra- tions based on our learned sentence alignment. [sent-87, score-0.112]

59 To better understand how sentences are transformed from normal to simple sentences we learned a word alignment using GIZA++ (Och and Ney, 2003). [sent-89, score-0.571]

60 – – – – – Table2:PrcsdTmneptroawlgeirnotdsiefornsfmgeantio ce36p42%15a7 i% rsthaconied word-level operations based on the induced word alignment. [sent-91, score-0.062]

61 Splits and merges are from the perspective of words in the normal sentence. [sent-92, score-0.314]

62 Table 2 shows the percentage ofeach ofthese phenomena occurring in the sentence pairs. [sent-94, score-0.149]

63 All of the different operations occur frequently in the data set with rewordings being particularly prevalent. [sent-95, score-0.128]

64 5 Sentence-level Text Simplification To understand the usefulness of this data we ran preliminary experiments to learn a sentence-level simplification system. [sent-96, score-0.732]

65 We view the problem of text simplification as an English-to-English translation problem. [sent-97, score-0.759]

66 Motivated by the importance of lexical changes, we used Moses, a phrase-based ma- chine translation system (Och and Ney, 2004). [sent-98, score-0.045]

67 3 We trained Moses on 124K pairs from the data set and the n-gram language model on the simple side ofthis data. [sent-99, score-0.126]

68 We trained the hyper-parameters of the loglinear model on a 500 sentence pair development set. [sent-100, score-0.112]

69 We compared the trained system to a baseline of not doing any simplification (NONE). [sent-101, score-0.671]

70 We evaluated the two approaches on a test set of 1300 sentence pairs. [sent-102, score-0.112]

71 Since there is currently no standard for automatically evaluating sentence simplification, we used three different automatic measures that have been used in related domains: BLEU, which has been used extensively in machine translation (Papineni et al. [sent-103, score-0.157]

72 , 2002), and word-level F1 and simple string accuracy (SSA) which have been suggested 3We also experimented with T3 (Cohn and Lapata, 2009) but the results were poor and are not presented here. [sent-104, score-0.078]

73 All three ofthese measures have been shown to correlate with human judgements in their respective domains. [sent-110, score-0.037]

74 Although the baseline does well (recall that over a quarter of the sentence pairs in the data set are identical) the phrase-based approach does obtain a statistically significant improvement. [sent-114, score-0.16]

75 To understand the the limits of the phrase-based model for text simplification, we generated an nbest list of the 1000 most-likely simplifications for each test sentence. [sent-115, score-0.226]

76 We then greedily picked the simplification from this n-best list that had the highest sentence-level BLEU score based on the test examples, labeled Moses-Oracle in Table 3. [sent-116, score-0.671]

77 6 Conclusion We have described a new text simplification data set generated from aligning sentences in Simple English Wikipedia with sentences in English Wikipedia. [sent-119, score-0.892]

78 The data set is orders of magnitude larger than any currently available for text simplification or for the related field of text compression and is publicly available. [sent-120, score-0.941]

79 4 We provided preliminary text simplification results using Moses, a phrase-based translation system, and saw a statistically significant improvement of 0. [sent-121, score-0.788]

80 005 BLEU over the baseline of no simplification and showed that further improvement of up to 0. [sent-122, score-0.671]

81 In the future, we hope to explore alignment techniques more tailored to simplification as well as applications of this data to text simplification. [sent-124, score-0.804]

82 Practical simplification of English newspaper text to assist aphasic readers. [sent-134, score-0.743]

83 Models for sentence compression: A comparison across domains, training requirements and evaluation measures. [sent-146, score-0.112]

84 To- wards effective sentence simplification for automatic processing of biomedical text. [sent-166, score-0.783]

85 Summarization beyond sentence extraction: A probabilistic approach to sentence compression. [sent-174, score-0.224]

86 Learning simple Wikipedia: A cogitation in ascertaining abecedarian language. [sent-194, score-0.165]

87 A comparison of model free versus model intensive approaches to sentence compression. [sent-210, score-0.112]

88 For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia. [sent-246, score-0.12]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('simplification', 0.671), ('normal', 0.275), ('wikipedia', 0.227), ('compression', 0.184), ('aligned', 0.128), ('simplifications', 0.12), ('chandrasekar', 0.117), ('paragraph', 0.114), ('sentence', 0.112), ('pomona', 0.099), ('article', 0.096), ('alignment', 0.09), ('tadashi', 0.087), ('vickrey', 0.087), ('paragraphs', 0.084), ('elhadad', 0.083), ('align', 0.081), ('nelken', 0.081), ('simple', 0.078), ('barzilay', 0.067), ('claremont', 0.066), ('flagged', 0.066), ('rewordings', 0.066), ('moses', 0.065), ('english', 0.065), ('och', 0.065), ('srinivas', 0.062), ('operations', 0.062), ('cohn', 0.061), ('carroll', 0.06), ('lapata', 0.059), ('rewording', 0.058), ('miwa', 0.058), ('nomoto', 0.058), ('napoles', 0.058), ('yamangil', 0.058), ('skip', 0.056), ('feng', 0.056), ('evaluators', 0.054), ('rani', 0.054), ('bleu', 0.052), ('aligning', 0.051), ('yatskar', 0.05), ('sentences', 0.048), ('pairs', 0.048), ('translation', 0.045), ('text', 0.043), ('revision', 0.043), ('similarity', 0.043), ('threshold', 0.041), ('clarke', 0.04), ('histories', 0.04), ('pairing', 0.04), ('koller', 0.039), ('merges', 0.039), ('ofthese', 0.037), ('monolingual', 0.033), ('knight', 0.033), ('mirella', 0.032), ('understand', 0.032), ('ge', 0.032), ('deletion', 0.032), ('generated', 0.031), ('content', 0.03), ('splits', 0.03), ('learners', 0.03), ('ney', 0.03), ('summarization', 0.029), ('children', 0.029), ('raman', 0.029), ('impaired', 0.029), ('trimmer', 0.029), ('hearing', 0.029), ('totaling', 0.029), ('rune', 0.029), ('saetre', 0.029), ('kauchak', 0.029), ('viren', 0.029), ('wpc', 0.029), ('rss', 0.029), ('abecedarian', 0.029), ('aphasia', 0.029), ('aphasic', 0.029), ('ascertaining', 0.029), ('baral', 0.029), ('canning', 0.029), ('chitta', 0.029), ('cogitation', 0.029), ('disabilities', 0.029), ('graciela', 0.029), ('hakenberg', 0.029), ('jonnalagadda', 0.029), ('lijun', 0.029), ('siddhartha', 0.029), ('simplication', 0.029), ('siobhan', 0.029), ('tari', 0.029), ('elif', 0.029), ('penalty', 0.029), ('preliminary', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 283 acl-2011-Simple English Wikipedia: A New Text Simplification Task

Author: William Coster ; David Kauchak

Abstract: In this paper we examine the task of sentence simplification which aims to reduce the reading complexity of a sentence by incorporating more accessible vocabulary and sentence structure. We introduce a new data set that pairs English Wikipedia with Simple English Wikipedia and is orders of magnitude larger than any previously examined for sentence simplification. The data contains the full range of simplification operations including rewording, reordering, insertion and deletion. We provide an analysis of this corpus as well as preliminary results using a phrase-based trans- lation approach for simplification.

2 0.627666 254 acl-2011-Putting it Simply: a Context-Aware Approach to Lexical Simplification

Author: Or Biran ; Samuel Brody ; Noemie Elhadad

Abstract: We present a method for lexical simplification. Simplification rules are learned from a comparable corpus, and the rules are applied in a context-aware fashion to input sentences. Our method is unsupervised. Furthermore, it does not require any alignment or correspondence among the complex and simple corpora. We evaluate the simplification according to three criteria: preservation of grammaticality, preservation of meaning, and degree of simplification. Results show that our method outperforms an established simplification baseline for both meaning preservation and simplification, while maintaining a high level of grammaticality.

3 0.15495156 337 acl-2011-Wikipedia Revision Toolkit: Efficiently Accessing Wikipedias Edit History

Author: Oliver Ferschke ; Torsten Zesch ; Iryna Gurevych

Abstract: We present an open-source toolkit which allows (i) to reconstruct past states of Wikipedia, and (ii) to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia. Beyond that, the edit history of Wikipedia articles has been shown to be a valuable knowledge source for NLP, but access is severely impeded by the lack of efficient tools for managing the huge amount of provided data. By using a dedicated storage format, our toolkit massively decreases the data volume to less than 2% of the original size, and at the same time provides an easy-to-use interface to access the revision data. The language-independent design allows to process any language represented in Wikipedia. We expect this work to consolidate NLP research using Wikipedia in general, and to foster research making use of the knowledge encoded in Wikipedia’s edit history.

4 0.12228358 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?

Author: Jinxi Xu ; Jinying Chen

Abstract: Word alignment is a central problem in statistical machine translation (SMT). In recent years, supervised alignment algorithms, which improve alignment accuracy by mimicking human alignment, have attracted a great deal of attention. The objective of this work is to explore the performance limit of supervised alignment under the current SMT paradigm. Our experiments used a manually aligned ChineseEnglish corpus with 280K words recently released by the Linguistic Data Consortium (LDC). We treated the human alignment as the oracle of supervised alignment. The result is surprising: the gain of human alignment over a state of the art unsupervised method (GIZA++) is less than 1point in BLEU. Furthermore, we showed the benefit of improved alignment becomes smaller with more training data, implying the above limit also holds for large training conditions. 1

5 0.1058901 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

Author: Emmanuel Prochasson ; Pascale Fung

Abstract: We present a first known result of high precision rare word bilingual extraction from comparable corpora, using aligned comparable documents and supervised classification. We incorporate two features, a context-vector similarity and a co-occurrence model between words in aligned documents in a machine learning approach. We test our hypothesis on different pairs of languages and corpora. We obtain very high F-Measure between 80% and 98% for recognizing and extracting correct translations for rare terms (from 1to 5 occurrences). Moreover, we show that our system can be trained on a pair of languages and test on a different pair of languages, obtaining a F-Measure of 77% for the classification of Chinese-English translations using a training corpus of Spanish-French. Our method is therefore even potentially applicable to low resources languages without training data.

6 0.098968647 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation

7 0.08669319 114 acl-2011-End-to-End Relation Extraction Using Distant Supervision from External Semantic Repositories

8 0.084672078 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation

9 0.082242347 213 acl-2011-Local and Global Algorithms for Disambiguation to Wikipedia

10 0.079277992 130 acl-2011-Extracting Comparative Entities and Predicates from Texts Using Comparative Type Classification

11 0.078645505 16 acl-2011-A Joint Sequence Translation Model with Integrated Reordering

12 0.078391686 285 acl-2011-Simple supervised document geolocation with geodesic grids

13 0.077160142 195 acl-2011-Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis

14 0.071791999 110 acl-2011-Effective Use of Function Words for Rule Generalization in Forest-Based Translation

15 0.070529424 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules

16 0.069279596 280 acl-2011-Sentence Ordering Driven by Local and Global Coherence for Summary Generation

17 0.06899371 235 acl-2011-Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment

18 0.067751028 52 acl-2011-Automatic Labelling of Topic Models

19 0.065577552 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

20 0.062979244 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.186), (1, -0.056), (2, 0.003), (3, 0.128), (4, 0.023), (5, 0.006), (6, 0.036), (7, 0.03), (8, -0.106), (9, -0.055), (10, -0.036), (11, 0.041), (12, -0.041), (13, -0.035), (14, -0.028), (15, 0.047), (16, 0.291), (17, -0.026), (18, -0.037), (19, -0.186), (20, 0.069), (21, -0.196), (22, -0.174), (23, -0.233), (24, 0.307), (25, -0.032), (26, 0.072), (27, -0.091), (28, 0.163), (29, -0.058), (30, 0.004), (31, -0.102), (32, -0.08), (33, -0.049), (34, -0.242), (35, -0.169), (36, 0.121), (37, 0.024), (38, -0.074), (39, -0.077), (40, 0.113), (41, -0.2), (42, -0.042), (43, -0.113), (44, -0.008), (45, -0.048), (46, 0.063), (47, 0.001), (48, -0.152), (49, -0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94952428 283 acl-2011-Simple English Wikipedia: A New Text Simplification Task

Author: William Coster ; David Kauchak

Abstract: In this paper we examine the task of sentence simplification which aims to reduce the reading complexity of a sentence by incorporating more accessible vocabulary and sentence structure. We introduce a new data set that pairs English Wikipedia with Simple English Wikipedia and is orders of magnitude larger than any previously examined for sentence simplification. The data contains the full range of simplification operations including rewording, reordering, insertion and deletion. We provide an analysis of this corpus as well as preliminary results using a phrase-based trans- lation approach for simplification.

2 0.92792386 254 acl-2011-Putting it Simply: a Context-Aware Approach to Lexical Simplification

Author: Or Biran ; Samuel Brody ; Noemie Elhadad

Abstract: We present a method for lexical simplification. Simplification rules are learned from a comparable corpus, and the rules are applied in a context-aware fashion to input sentences. Our method is unsupervised. Furthermore, it does not require any alignment or correspondence among the complex and simple corpora. We evaluate the simplification according to three criteria: preservation of grammaticality, preservation of meaning, and degree of simplification. Results show that our method outperforms an established simplification baseline for both meaning preservation and simplification, while maintaining a high level of grammaticality.

3 0.56560111 337 acl-2011-Wikipedia Revision Toolkit: Efficiently Accessing Wikipedias Edit History

Author: Oliver Ferschke ; Torsten Zesch ; Iryna Gurevych

Abstract: We present an open-source toolkit which allows (i) to reconstruct past states of Wikipedia, and (ii) to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia. Beyond that, the edit history of Wikipedia articles has been shown to be a valuable knowledge source for NLP, but access is severely impeded by the lack of efficient tools for managing the huge amount of provided data. By using a dedicated storage format, our toolkit massively decreases the data volume to less than 2% of the original size, and at the same time provides an easy-to-use interface to access the revision data. The language-independent design allows to process any language represented in Wikipedia. We expect this work to consolidate NLP research using Wikipedia in general, and to foster research making use of the knowledge encoded in Wikipedia’s edit history.

4 0.45369482 195 acl-2011-Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis

Author: Manoj Harpalani ; Michael Hart ; Sandesh Signh ; Rob Johnson ; Yejin Choi

Abstract: Community-based knowledge forums, such as Wikipedia, are susceptible to vandalism, i.e., ill-intentioned contributions that are detrimental to the quality of collective intelligence. Most previous work to date relies on shallow lexico-syntactic patterns and metadata to automatically detect vandalism in Wikipedia. In this paper, we explore more linguistically motivated approaches to vandalism detection. In particular, we hypothesize that textual vandalism constitutes a unique genre where a group of people share a similar linguistic behavior. Experimental results suggest that (1) statistical models give evidence to unique language styles in vandalism, and that (2) deep syntactic patterns based on probabilistic context free grammars (PCFG) discriminate vandalism more effectively than shallow lexicosyntactic patterns based on n-grams. ,

5 0.44664657 130 acl-2011-Extracting Comparative Entities and Predicates from Texts Using Comparative Type Classification

Author: Seon Yang ; Youngjoong Ko

Abstract: The automatic extraction of comparative information is an important text mining problem and an area of increasing interest. In this paper, we study how to build a Korean comparison mining system. Our work is composed of two consecutive tasks: 1) classifying comparative sentences into different types and 2) mining comparative entities and predicates. We perform various experiments to find relevant features and learning techniques. As a result, we achieve outstanding performance enough for practical use. 1

6 0.3963725 213 acl-2011-Local and Global Algorithms for Disambiguation to Wikipedia

7 0.32514322 285 acl-2011-Simple supervised document geolocation with geodesic grids

8 0.30029574 84 acl-2011-Contrasting Opposing Views of News Articles on Contentious Issues

9 0.28829032 222 acl-2011-Model-Portability Experiments for Textual Temporal Analysis

10 0.28485778 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?

11 0.28342068 338 acl-2011-Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis

12 0.27639684 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

13 0.27446666 69 acl-2011-Clause Restructuring For SMT Not Absolutely Helpful

14 0.27042553 280 acl-2011-Sentence Ordering Driven by Local and Global Coherence for Summary Generation

15 0.26999933 114 acl-2011-End-to-End Relation Extraction Using Distant Supervision from External Semantic Repositories

16 0.26744273 157 acl-2011-I Thou Thee, Thou Traitor: Predicting Formal vs. Informal Address in English Literature

17 0.25917977 235 acl-2011-Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment

18 0.25466019 231 acl-2011-Nonlinear Evidence Fusion and Propagation for Hyponymy Relation Mining

19 0.2497422 76 acl-2011-Comparative News Summarization Using Linear Programming

20 0.24621172 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.037), (16, 0.251), (17, 0.036), (24, 0.011), (26, 0.02), (37, 0.093), (39, 0.049), (41, 0.059), (53, 0.017), (55, 0.022), (59, 0.038), (72, 0.042), (91, 0.053), (96, 0.171), (97, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91472483 291 acl-2011-SystemT: A Declarative Information Extraction System

Author: Yunyao Li ; Frederick Reiss ; Laura Chiticariu

Abstract: Frederick R. Reiss IBM Research - Almaden 650 Harry Road San Jose, CA 95120 frre i s @us . ibm . com s Laura Chiticariu IBM Research - Almaden 650 Harry Road San Jose, CA 95120 chit i us .ibm . com @ magnitude larger than classical IE corpora. An Emerging text-intensive enterprise applications such as social analytics and semantic search pose new challenges of scalability and usability to Information Extraction (IE) systems. This paper presents SystemT, a declarative IE system that addresses these challenges and has been deployed in a wide range of enterprise applications. SystemT facilitates the development of high quality complex annotators by providing a highly expressive language and an advanced development environment. It also includes a cost-based optimizer and a high-performance, flexible runtime with minimum memory footprint. We present SystemT as a useful resource that is freely available, and as an opportunity to promote research in building scalable and usable IE systems.

2 0.8560971 9 acl-2011-A Cross-Lingual ILP Solution to Zero Anaphora Resolution

Author: Ryu Iida ; Massimo Poesio

Abstract: We present an ILP-based model of zero anaphora detection and resolution that builds on the joint determination of anaphoricity and coreference model proposed by Denis and Baldridge (2007), but revises it and extends it into a three-way ILP problem also incorporating subject detection. We show that this new model outperforms several baselines and competing models, as well as a direct translation of the Denis / Baldridge model, for both Italian and Japanese zero anaphora. We incorporate our model in complete anaphoric resolvers for both Italian and Japanese, showing that our approach leads to improved performance also when not used in isolation, provided that separate classifiers are used for zeros and for ex- plicitly realized anaphors.

same-paper 3 0.80482614 283 acl-2011-Simple English Wikipedia: A New Text Simplification Task

Author: William Coster ; David Kauchak

Abstract: In this paper we examine the task of sentence simplification which aims to reduce the reading complexity of a sentence by incorporating more accessible vocabulary and sentence structure. We introduce a new data set that pairs English Wikipedia with Simple English Wikipedia and is orders of magnitude larger than any previously examined for sentence simplification. The data contains the full range of simplification operations including rewording, reordering, insertion and deletion. We provide an analysis of this corpus as well as preliminary results using a phrase-based trans- lation approach for simplification.

4 0.7955879 320 acl-2011-Unsupervised Discovery of Domain-Specific Knowledge from Text

Author: Dirk Hovy ; Chunliang Zhang ; Eduard Hovy ; Anselmo Penas

Abstract: Learning by Reading (LbR) aims at enabling machines to acquire knowledge from and reason about textual input. This requires knowledge about the domain structure (such as entities, classes, and actions) in order to do inference. We present a method to infer this implicit knowledge from unlabeled text. Unlike previous approaches, we use automatically extracted classes with a probability distribution over entities to allow for context-sensitive labeling. From a corpus of 1.4m sentences, we learn about 250k simple propositions about American football in the form of predicateargument structures like “quarterbacks throw passes to receivers”. Using several statistical measures, we show that our model is able to generalize and explain the data statistically significantly better than various baseline approaches. Human subjects judged up to 96.6% of the resulting propositions to be sensible. The classes and probabilistic model can be used in textual enrichment to improve the performance of LbR end-to-end systems.

5 0.7571218 254 acl-2011-Putting it Simply: a Context-Aware Approach to Lexical Simplification

Author: Or Biran ; Samuel Brody ; Noemie Elhadad

Abstract: We present a method for lexical simplification. Simplification rules are learned from a comparable corpus, and the rules are applied in a context-aware fashion to input sentences. Our method is unsupervised. Furthermore, it does not require any alignment or correspondence among the complex and simple corpora. We evaluate the simplification according to three criteria: preservation of grammaticality, preservation of meaning, and degree of simplification. Results show that our method outperforms an established simplification baseline for both meaning preservation and simplification, while maintaining a high level of grammaticality.

6 0.67471564 23 acl-2011-A Pronoun Anaphora Resolution System based on Factorial Hidden Markov Models

7 0.66060877 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

8 0.66053343 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

9 0.65974909 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

10 0.65850115 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations

11 0.65847802 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

12 0.65826684 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

13 0.65740371 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

14 0.65733433 133 acl-2011-Extracting Social Power Relationships from Natural Language

15 0.6572383 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

16 0.65723604 117 acl-2011-Entity Set Expansion using Topic information

17 0.65702105 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling

18 0.65686125 28 acl-2011-A Statistical Tree Annotator and Its Applications

19 0.65659773 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters

20 0.65599155 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning