acl acl2011 acl2011-254 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Or Biran ; Samuel Brody ; Noemie Elhadad
Abstract: We present a method for lexical simplification. Simplification rules are learned from a comparable corpus, and the rules are applied in a context-aware fashion to input sentences. Our method is unsupervised. Furthermore, it does not require any alignment or correspondence among the complex and simple corpora. We evaluate the simplification according to three criteria: preservation of grammaticality, preservation of meaning, and degree of simplification. Results show that our method outperforms an established simplification baseline for both meaning preservation and simplification, while maintaining a high level of grammaticality.
Reference: text
sentIndex sentText sentNum sentScore
1 Simplification rules are learned from a comparable corpus, and the rules are applied in a context-aware fashion to input sentences. [sent-6, score-0.201]
2 We evaluate the simplification according to three criteria: preservation of grammaticality, preservation of meaning, and degree of simplification. [sent-9, score-0.99]
3 Results show that our method outperforms an established simplification baseline for both meaning preservation and simplification, while maintaining a high level of grammaticality. [sent-10, score-0.954]
4 1 Introduction The task of simplification consists of editing an input text into a version that is less complex linguisti- cally or more readable. [sent-11, score-0.805]
5 Automated sentence simplification has been investigated mostly as a preprocessing step with the goal of improving NLP tasks, such as parsing (Chandrasekar et al. [sent-12, score-0.821]
6 Automated simplification can also be considered as a way to help end users access relevant information, which would be too complex to understand if left unedited. [sent-16, score-0.773]
7 As such, it was proposed as a tool for adults with aphasia (Carroll et al. [sent-17, score-0.055]
8 , 2004), readers with low-literacy skills (Williams and Reiter, 2005), individuals with intellectual disabilities (Huenerfauth et al. [sent-19, score-0.079]
9 In this paper, we present a sentence simplification approach, which focuses on lexical simplification. [sent-23, score-0.821]
10 1 The key contributions of our work are (i) an unsupervised method for learning pairs of complex and simpler synonyms; and (ii) a contextaware method for substituting one for the other. [sent-24, score-0.178]
11 The word magnate is determined as a candidate for sim– plification. [sent-26, score-0.138]
12 Two learned rules are available to the simplification system (substitute magnate with king or with businessman). [sent-27, score-0.861]
13 In the context of this sentence, the second rule is selected, resulting in the simpler output sentence. [sent-28, score-0.169]
14 Our method contributes to research on lexical simplification (both learning of rules and actual sentence simplification), a topic little investigated thus far. [sent-29, score-0.92]
15 From a technical perspective, the task of lexical simplification bears similarity with that of para1Our resulting system is available http://www. [sent-30, score-0.838]
16 Napoles and Dredze (2010) examined Wikipedia Simple articles looking for features that characterize a simple text, with the hope of informing research in automatic simplification methods. [sent-38, score-0.81]
17 (2010) learn lexical simplification rules from the edit histories of Wikipedia Simple articles. [sent-40, score-0.869]
18 Our method differs from theirs, as we rely on the two corpora as a whole, and do not require any aligned or designated simple/complex sentences when learning simplification rules. [sent-41, score-0.732]
19 SEW is a Wikipedia project providing articles in Simple English, a version of English which uses fewer words and easier grammar, and which aims to be easier to read for children, people who are learning English and people with learning difficulties. [sent-43, score-0.11]
20 Due to the labor involved in simplifying Wikipedia articles, only about 2% of the EW articles have been simplified. [sent-44, score-0.044]
21 Rather, we leverage SEW only as an example of an in-domain simple corpus, in order to extract word frequency estimates. [sent-46, score-0.084]
22 In practice, this means that our method is suitable for other cases where there exists a simplified corpus in the same domain. [sent-50, score-0.109]
23 The articles were preprocessed as follows: all comments, HTML tags, and Wiki links were removed. [sent-53, score-0.044]
24 org 4Aligning sentences in monolingual comparable corpora has been investigated (Barzilay and Elhadad, 2003; Nelken and Shieber, 2006), but is not a focus for this work. [sent-58, score-0.109]
25 Further preprocessing was carried out with the Stanford NLP Package5 to tokenize the text, transform all words to lower case, and identify sentence boundaries. [sent-60, score-0.057]
26 3 Method Our sentence simplification system consists of two main stages: rule extraction and simplification. [sent-61, score-0.862]
27 In the first stage, simplification rules are extracted from the corpora. [sent-62, score-0.799]
28 Each rule consists of an ordered word pair {original → simplified} along with a score indicating trhieg similarity bpelitfiweede}n a tlohen gw woridths. [sent-63, score-0.175]
29 a sIcn otrhee second stage, the system decides whether to apply a rule (i. [sent-64, score-0.073]
30 , transform the original word into the simplified one), based on the contextual information. [sent-66, score-0.193]
31 For each candidate word w, we constructed a context vector CVw, containing co-occurrence information within a 10-token window. [sent-71, score-0.119]
32 Each dimension iin the vector corresponds to a single word wi in the vocabulary, and a single dimension was added to represent any number token. [sent-72, score-0.104]
33 The value in each dimension CVw [i] of the vector was the number of occurrences ofthe corresponding word wi within a tentoken window surrounding an instance of the candidate word w. [sent-73, score-0.142]
34 From all possible word pairs (the Cartesian product of all words in the corpus vocabulary), we first remove pairs of morphological variants. [sent-76, score-0.124]
35 We also prune pairs where one word is a prefix of the other and the suffix is in {s, es, ed, ly, er, ing}. [sent-78, score-0.103]
36 edu its first sense (as listed in WordNet)7 is a synonym or hypernym of the first. [sent-87, score-0.091]
37 Finally, we compute the cosine similarity scores for the remaining pairs using their context vectors. [sent-88, score-0.207]
38 2 Ensuring Simplification From among our remaining candidate word pairs, we want to identify those that represent a complex word which can be replaced by a simpler one. [sent-91, score-0.198]
39 Our definition of the complexity of a word is based on two measures: the corpus complexity and the lexical complexity. [sent-92, score-0.192]
40 Specifically, we define the corpus com- plexity of a word as × Cw=ffww,E,Sinmglipslhe where fw,c is the frequency of word w in corpus c, and the lexical complexity as Lw = |w|, the length of the word. [sent-93, score-0.21]
41 The final complexity χw f|o,r t hthee l ewngorthd is given by the product of the two. [sent-94, score-0.066]
42 χw = Cw Lw After calculating the complexity of all words participating in the word pairs, we discard the pairs for which the first word’s complexity is lower than that of the second. [sent-95, score-0.208]
43 The remaining pairs constitute the final list of substitution candidates. [sent-96, score-0.094]
44 3 Ensuring Grammaticality To ensure that our simplification substitutions maintain the grammaticality ofthe original sentence, we generate grammatically consistent rules from the substitution candidate list. [sent-99, score-1.14]
45 For each candidate pair (original, simplified), we generate all consistent forms (fi (original) , fi(substitute)) of the two words using MorphAdorner. [sent-100, score-0.048]
46 For example, the word pair (stride, walk) will generate the form pairs (stride, walk), (striding, walking), (strode, walked) and (strides, walks). [sent-102, score-0.076]
47 Rather than attempting explicit disambiguation and adding complexity to the model, we rely on the first sense heuristic, which is know to be very strong, along with contextual information, as described in Section 3. [sent-104, score-0.117]
48 498 exactly the same list of form pairs, eliminating the original ungrammatical pair. [sent-106, score-0.056]
49 Finally, each pair (fi(original), fi(substitute)) becomes a rule {fi(original) → fi(substitute)}, cwoimthe weight Similarity(original, substitute). [sent-107, score-0.073]
50 2 Stage 2: Sentence Simplification Given an input sentence and the set of rules learned in the first stage, this stage determines which words in the sentence should be simplified, and applies the corresponding rules. [sent-109, score-0.271]
51 For example, suppose we have a rule {Han → Chinese}. [sent-111, score-0.073]
52 n W 1e3 w68o Huldan w raenbte tlos drove out the Mongols”, but to avoid applying it to a sentence like “The history of the Han ethnic group is closely tied to that of China ”. [sent-113, score-0.104]
53 The existence of related words like ethnic and China are clues that the latter sentence is in a specific, rather than general, context and therefore a more general and simpler hypernym is unsuitable. [sent-114, score-0.241]
54 To identify such cases, we calculate the similarity between the target word (the candidate for replacement) and the input sentence as a whole. [sent-115, score-0.275]
55 If this similarity is too high, it might be better not to simplify the original word. [sent-116, score-0.201]
56 We wish to detect and avoid cases where a word appears in the sentence with a different sense than the one originally considered when creating the simplification rule. [sent-118, score-0.817]
57 For this purpose, we examine the similarity between the rule as a whole (including both the original and the substitute words, and their associated context vectors) and the context of the input sentence. [sent-119, score-0.398]
58 If the similarity is high, it is likely the original word in the sentence and the rule are about the same sense. [sent-120, score-0.288]
59 1 Simplification Procedure Both factors described above require sufficient context in the input sentence. [sent-123, score-0.075]
60 Therefore, our system does not attempt to simplify sentences with less than seven content words. [sent-124, score-0.071]
61 58% Table 1: Average scores in three categories: grammaticality (Gram. [sent-136, score-0.163]
62 For grammaticality, we show percent of examples judged as good, with ok percent in parentheses. [sent-140, score-0.12]
63 For all other sentences, each content word is examined in order, ignoring words inside quotation marks or parentheses. [sent-141, score-0.062]
64 For each word w, the set of relevant simplification rules {w → x} is retrieved. [sent-142, score-0.827]
65 rFeolre eaancth s i rmuplel {w → x}, u {nwles →s th xe} replacement Fwoorrd e x already appears xin} ,th uen sentence, our system does the following: • • • Build the vector of sentence context SCVs,w in a Bsimuiil dar th manner t oof t sheant ednesccer ciobendte xint SSeCcVtion 3. [sent-143, score-0.146]
66 Create a common context vector CCVw,x for the rCurleea {w → x}. [sent-148, score-0.043]
67 We calculate the cosine similarity of the common context vector and the sentence context vector: ContextSim = cosine(CCVw,x, SCVs,w) If the context similarity is larger than a threshold (0. [sent-152, score-0.412]
68 If multiple rules apply for the same word, we use the one with the highest context similarity. [sent-154, score-0.11]
69 4 Experimental Setup Baseline We employ the method of Devlin and Unthank (2006) which replaces a word with its most frequent synonym (presumed to be the simplest) as our baseline. [sent-155, score-0.078]
70 To provide a fairer comparison to our system, we add the restriction that the synonyms should not share a prefix of four or more letters (a baseline version of lemmatization) and use MorphAdorner to produce a form that agrees with that of the original word. [sent-156, score-0.118]
71 %738165% Evaluation Dataset We sampled simplification examples for manual evaluation with the following criteria. [sent-168, score-0.732]
72 Among all sentences in English Wikipedia, we first extracted those where our system chose to simplify exactly one word, to provide a straightforward example for the human judges. [sent-169, score-0.071]
73 Of these, we chose the sentences where the baseline could also be used to simplify the target word (i. [sent-170, score-0.134]
74 , the word had a more frequent synonym), and the baseline replacement was different from the system choice. [sent-172, score-0.109]
75 Each was simplified by our system and the baseline, resulting in 130 simplification examples (consisting of an original and a simplified sentence). [sent-175, score-1.006]
76 Frequency Bands Although we included only a single example of each rule, some rules could be applied much more frequently than others, as the words and associated contexts were common in the dataset. [sent-176, score-0.067]
77 Since this factor strongly influences the utility of the system, we examined the performance along different frequency bands. [sent-177, score-0.123]
78 We split the evaluation dataset into three frequency bands of roughly equal size, resulting in 46 high, 44 med and 40 low. [sent-178, score-0.132]
79 Judgment Guidelines We divided the simplification examples among three annotators and ensured that no annotator saw both the system and baseline examples for the same sentence. [sent-179, score-0.795]
80 A small portion of the sentence pairs were duplicated among annotators to calculate pairwise interannotator agreement. [sent-181, score-0.169]
81 Our method is quantitatively better than the baseline at both grammaticality and meaning preservation, although the difference is not statistically significant. [sent-193, score-0.256]
82 001) outperforms the baseline, which represents the established simplifi- cation strategy of substituting a word with its most frequent WordNet synonym. [sent-195, score-0.064]
83 The results demonstrate the value of correctly representing and addressing content when attempting automatic simplification. [sent-196, score-0.051]
84 Table 2 contains the results for each of the frequency bands. [sent-197, score-0.056]
85 Grammaticality is not strongly influenced by frequency, and remains between 80-85% for both the baseline and our system (considering the ok judgment as positive). [sent-198, score-0.128]
86 This is not surprising, since the method for ensuring grammaticality is largely independent of context, and relies mostly on a morphological engine. [sent-199, score-0.215]
87 Simplification varies somewhat with frequency, with the best results for the medium frequency band. [sent-200, score-0.093]
88 The most noticeable effect is for preservation of meaning. [sent-202, score-0.129]
89 Here, the performance of the system (and the baseline) is the best for the medium frequency group. [sent-203, score-0.093]
90 However, the performance drops significantly for the low frequency band. [sent-204, score-0.056]
91 Since there are few examples from which to learn, the system is unable to effectively distinguish between different contexts and meanings ofthe word being simplified, and applies the simplification rule incorrectly. [sent-206, score-0.833]
92 These results indicate our system can be effectively used for simplification of words that occur frequently in the domain. [sent-207, score-0.732]
93 In many scenarios, these are precisely the cases where simplification is most desirable. [sent-208, score-0.732]
94 For rare words, it may be advisable to maintain the more complex form, to ensure that the meaning is preserved. [sent-209, score-0.127]
95 Query expansion, lexical simplification, and sentence selection strategies for multi-document summarization. [sent-224, score-0.089]
96 Practical simplication of english newspaper text to assist aphasic readers. [sent-229, score-0.086]
97 Automatic sentence simplification for subtitling in Dutch and English. [sent-241, score-0.789]
98 Extracting lay paraphrases of specialized expressions from monolingual comparable medical corpora. [sent-247, score-0.153]
99 Comparing evaluation techniques for text readability software for adults with intellectual disabilities. [sent-269, score-0.134]
100 Towards effective sentence simplification for automatic processing of biomedical text. [sent-275, score-0.822]
wordName wordTfidf (topN-words)
[('simplification', 0.732), ('grammaticality', 0.163), ('sew', 0.155), ('preservation', 0.129), ('cvw', 0.124), ('simplified', 0.109), ('wikipedia', 0.101), ('devlin', 0.095), ('stride', 0.093), ('elhadad', 0.077), ('substitute', 0.077), ('bands', 0.076), ('noemie', 0.076), ('similarity', 0.074), ('rule', 0.073), ('fi', 0.072), ('simplify', 0.071), ('rules', 0.067), ('complexity', 0.066), ('blake', 0.062), ('emie', 0.062), ('huenerfauth', 0.062), ('magnate', 0.062), ('unthank', 0.062), ('ok', 0.06), ('ew', 0.058), ('meaning', 0.058), ('stage', 0.058), ('sentence', 0.057), ('original', 0.056), ('frequency', 0.056), ('adults', 0.055), ('aphasic', 0.055), ('chandrasekar', 0.055), ('jonnalagadda', 0.055), ('napoles', 0.055), ('siobhan', 0.055), ('vickrey', 0.055), ('simpler', 0.053), ('ensuring', 0.052), ('attempting', 0.051), ('walked', 0.05), ('androutsopoulos', 0.05), ('intellectual', 0.05), ('nelken', 0.05), ('synonym', 0.05), ('wordnet', 0.049), ('pairs', 0.048), ('candidate', 0.048), ('ethnic', 0.047), ('yatskar', 0.047), ('replacement', 0.046), ('substitution', 0.046), ('ger', 0.045), ('siddharthan', 0.045), ('articles', 0.044), ('context', 0.043), ('columbia', 0.043), ('williams', 0.043), ('lay', 0.043), ('cosine', 0.042), ('monolingual', 0.042), ('hypernym', 0.041), ('complex', 0.041), ('mccarthy', 0.04), ('daelemans', 0.039), ('dimension', 0.038), ('cw', 0.038), ('carroll', 0.038), ('histories', 0.038), ('lw', 0.038), ('medium', 0.037), ('calculate', 0.036), ('walk', 0.036), ('substituting', 0.036), ('comparable', 0.035), ('baseline', 0.035), ('han', 0.035), ('lemmatization', 0.034), ('examined', 0.034), ('judgment', 0.033), ('influences', 0.033), ('medical', 0.033), ('people', 0.033), ('biomedical', 0.033), ('investigated', 0.032), ('input', 0.032), ('lexical', 0.032), ('english', 0.031), ('percent', 0.03), ('readability', 0.029), ('readers', 0.029), ('del', 0.029), ('fellbaum', 0.028), ('word', 0.028), ('annotators', 0.028), ('ensure', 0.028), ('prefix', 0.027), ('vte', 0.027), ('prodromos', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999934 254 acl-2011-Putting it Simply: a Context-Aware Approach to Lexical Simplification
Author: Or Biran ; Samuel Brody ; Noemie Elhadad
Abstract: We present a method for lexical simplification. Simplification rules are learned from a comparable corpus, and the rules are applied in a context-aware fashion to input sentences. Our method is unsupervised. Furthermore, it does not require any alignment or correspondence among the complex and simple corpora. We evaluate the simplification according to three criteria: preservation of grammaticality, preservation of meaning, and degree of simplification. Results show that our method outperforms an established simplification baseline for both meaning preservation and simplification, while maintaining a high level of grammaticality.
2 0.627666 283 acl-2011-Simple English Wikipedia: A New Text Simplification Task
Author: William Coster ; David Kauchak
Abstract: In this paper we examine the task of sentence simplification which aims to reduce the reading complexity of a sentence by incorporating more accessible vocabulary and sentence structure. We introduce a new data set that pairs English Wikipedia with Simple English Wikipedia and is orders of magnitude larger than any previously examined for sentence simplification. The data contains the full range of simplification operations including rewording, reordering, insertion and deletion. We provide an analysis of this corpus as well as preliminary results using a phrase-based trans- lation approach for simplification.
3 0.089141294 130 acl-2011-Extracting Comparative Entities and Predicates from Texts Using Comparative Type Classification
Author: Seon Yang ; Youngjoong Ko
Abstract: The automatic extraction of comparative information is an important text mining problem and an area of increasing interest. In this paper, we study how to build a Korean comparison mining system. Our work is composed of two consecutive tasks: 1) classifying comparative sentences into different types and 2) mining comparative entities and predicates. We perform various experiments to find relevant features and learning techniques. As a result, we achieve outstanding performance enough for practical use. 1
4 0.083500557 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents
Author: Emmanuel Prochasson ; Pascale Fung
Abstract: We present a first known result of high precision rare word bilingual extraction from comparable corpora, using aligned comparable documents and supervised classification. We incorporate two features, a context-vector similarity and a co-occurrence model between words in aligned documents in a machine learning approach. We test our hypothesis on different pairs of languages and corpora. We obtain very high F-Measure between 80% and 98% for recognizing and extracting correct translations for rare terms (from 1to 5 occurrences). Moreover, we show that our system can be trained on a pair of languages and test on a different pair of languages, obtaining a F-Measure of 77% for the classification of Chinese-English translations using a training corpus of Spanish-French. Our method is therefore even potentially applicable to low resources languages without training data.
5 0.069992363 188 acl-2011-Judging Grammaticality with Tree Substitution Grammar Derivations
Author: Matt Post
Abstract: In this paper, we show that local features computed from the derivations of tree substitution grammars such as the identify of particular fragments, and a count of large and small fragments are useful in binary grammatical classification tasks. Such features outperform n-gram features and various model scores by a wide margin. Although they fall short of the performance of the hand-crafted feature set of Charniak and Johnson (2005) developed for parse tree reranking, they do so with an order of magnitude fewer features. Furthermore, since the TSGs employed are learned in a Bayesian setting, the use of their derivations can be viewed as the automatic discovery of tree patterns useful for classification. On the BLLIP dataset, we achieve an accuracy of 89.9% in discriminating between grammatical text and samples from an n-gram language model. — —
6 0.068029776 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation
7 0.065601163 337 acl-2011-Wikipedia Revision Toolkit: Efficiently Accessing Wikipedias Edit History
8 0.062582873 11 acl-2011-A Fast and Accurate Method for Approximate String Search
9 0.062074963 132 acl-2011-Extracting Paraphrases from Definition Sentences on the Web
10 0.061704993 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar
11 0.060874466 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models
12 0.06051816 52 acl-2011-Automatic Labelling of Topic Models
13 0.059925415 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
14 0.059736855 268 acl-2011-Rule Markov Models for Fast Tree-to-String Translation
15 0.058738388 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules
16 0.05597005 37 acl-2011-An Empirical Evaluation of Data-Driven Paraphrase Generation Techniques
17 0.05265912 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation
18 0.051229432 114 acl-2011-End-to-End Relation Extraction Using Distant Supervision from External Semantic Repositories
19 0.051175974 44 acl-2011-An exponential translation model for target language morphology
20 0.05070908 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment
topicId topicWeight
[(0, 0.167), (1, -0.009), (2, -0.023), (3, 0.074), (4, -0.013), (5, -0.006), (6, 0.039), (7, 0.017), (8, -0.089), (9, -0.09), (10, -0.086), (11, 0.007), (12, -0.023), (13, 0.018), (14, 0.005), (15, 0.009), (16, 0.257), (17, -0.023), (18, -0.007), (19, -0.16), (20, 0.096), (21, -0.191), (22, -0.156), (23, -0.222), (24, 0.278), (25, -0.035), (26, 0.05), (27, -0.093), (28, 0.15), (29, -0.05), (30, 0.014), (31, -0.112), (32, -0.094), (33, -0.03), (34, -0.285), (35, -0.152), (36, 0.139), (37, 0.015), (38, -0.062), (39, -0.05), (40, 0.123), (41, -0.214), (42, -0.062), (43, -0.136), (44, -0.002), (45, -0.085), (46, 0.089), (47, 0.028), (48, -0.147), (49, 0.002)]
simIndex simValue paperId paperTitle
same-paper 1 0.94793946 254 acl-2011-Putting it Simply: a Context-Aware Approach to Lexical Simplification
Author: Or Biran ; Samuel Brody ; Noemie Elhadad
Abstract: We present a method for lexical simplification. Simplification rules are learned from a comparable corpus, and the rules are applied in a context-aware fashion to input sentences. Our method is unsupervised. Furthermore, it does not require any alignment or correspondence among the complex and simple corpora. We evaluate the simplification according to three criteria: preservation of grammaticality, preservation of meaning, and degree of simplification. Results show that our method outperforms an established simplification baseline for both meaning preservation and simplification, while maintaining a high level of grammaticality.
2 0.90590584 283 acl-2011-Simple English Wikipedia: A New Text Simplification Task
Author: William Coster ; David Kauchak
Abstract: In this paper we examine the task of sentence simplification which aims to reduce the reading complexity of a sentence by incorporating more accessible vocabulary and sentence structure. We introduce a new data set that pairs English Wikipedia with Simple English Wikipedia and is orders of magnitude larger than any previously examined for sentence simplification. The data contains the full range of simplification operations including rewording, reordering, insertion and deletion. We provide an analysis of this corpus as well as preliminary results using a phrase-based trans- lation approach for simplification.
3 0.49547583 337 acl-2011-Wikipedia Revision Toolkit: Efficiently Accessing Wikipedias Edit History
Author: Oliver Ferschke ; Torsten Zesch ; Iryna Gurevych
Abstract: We present an open-source toolkit which allows (i) to reconstruct past states of Wikipedia, and (ii) to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia. Beyond that, the edit history of Wikipedia articles has been shown to be a valuable knowledge source for NLP, but access is severely impeded by the lack of efficient tools for managing the huge amount of provided data. By using a dedicated storage format, our toolkit massively decreases the data volume to less than 2% of the original size, and at the same time provides an easy-to-use interface to access the revision data. The language-independent design allows to process any language represented in Wikipedia. We expect this work to consolidate NLP research using Wikipedia in general, and to foster research making use of the knowledge encoded in Wikipedia’s edit history.
4 0.4900102 130 acl-2011-Extracting Comparative Entities and Predicates from Texts Using Comparative Type Classification
Author: Seon Yang ; Youngjoong Ko
Abstract: The automatic extraction of comparative information is an important text mining problem and an area of increasing interest. In this paper, we study how to build a Korean comparison mining system. Our work is composed of two consecutive tasks: 1) classifying comparative sentences into different types and 2) mining comparative entities and predicates. We perform various experiments to find relevant features and learning techniques. As a result, we achieve outstanding performance enough for practical use. 1
5 0.4089607 195 acl-2011-Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis
Author: Manoj Harpalani ; Michael Hart ; Sandesh Signh ; Rob Johnson ; Yejin Choi
Abstract: Community-based knowledge forums, such as Wikipedia, are susceptible to vandalism, i.e., ill-intentioned contributions that are detrimental to the quality of collective intelligence. Most previous work to date relies on shallow lexico-syntactic patterns and metadata to automatically detect vandalism in Wikipedia. In this paper, we explore more linguistically motivated approaches to vandalism detection. In particular, we hypothesize that textual vandalism constitutes a unique genre where a group of people share a similar linguistic behavior. Experimental results suggest that (1) statistical models give evidence to unique language styles in vandalism, and that (2) deep syntactic patterns based on probabilistic context free grammars (PCFG) discriminate vandalism more effectively than shallow lexicosyntactic patterns based on n-grams. ,
6 0.33741099 213 acl-2011-Local and Global Algorithms for Disambiguation to Wikipedia
7 0.30132881 84 acl-2011-Contrasting Opposing Views of News Articles on Contentious Issues
8 0.2882615 231 acl-2011-Nonlinear Evidence Fusion and Propagation for Hyponymy Relation Mining
9 0.28024521 285 acl-2011-Simple supervised document geolocation with geodesic grids
10 0.27732119 229 acl-2011-NULEX: An Open-License Broad Coverage Lexicon
11 0.2688328 222 acl-2011-Model-Portability Experiments for Textual Temporal Analysis
12 0.26379576 338 acl-2011-Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis
13 0.26049253 76 acl-2011-Comparative News Summarization Using Linear Programming
14 0.25164077 341 acl-2011-Word Maturity: Computational Modeling of Word Knowledge
15 0.2516138 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents
16 0.25029486 157 acl-2011-I Thou Thee, Thou Traitor: Predicting Formal vs. Informal Address in English Literature
18 0.24779923 280 acl-2011-Sentence Ordering Driven by Local and Global Coherence for Summary Generation
19 0.23718877 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar
20 0.23518929 132 acl-2011-Extracting Paraphrases from Definition Sentences on the Web
topicId topicWeight
[(5, 0.027), (16, 0.111), (17, 0.062), (24, 0.172), (26, 0.033), (31, 0.01), (37, 0.071), (39, 0.048), (41, 0.049), (53, 0.013), (55, 0.028), (59, 0.054), (72, 0.038), (91, 0.051), (96, 0.129), (97, 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.80062628 254 acl-2011-Putting it Simply: a Context-Aware Approach to Lexical Simplification
Author: Or Biran ; Samuel Brody ; Noemie Elhadad
Abstract: We present a method for lexical simplification. Simplification rules are learned from a comparable corpus, and the rules are applied in a context-aware fashion to input sentences. Our method is unsupervised. Furthermore, it does not require any alignment or correspondence among the complex and simple corpora. We evaluate the simplification according to three criteria: preservation of grammaticality, preservation of meaning, and degree of simplification. Results show that our method outperforms an established simplification baseline for both meaning preservation and simplification, while maintaining a high level of grammaticality.
2 0.7731545 291 acl-2011-SystemT: A Declarative Information Extraction System
Author: Yunyao Li ; Frederick Reiss ; Laura Chiticariu
Abstract: Frederick R. Reiss IBM Research - Almaden 650 Harry Road San Jose, CA 95120 frre i s @us . ibm . com s Laura Chiticariu IBM Research - Almaden 650 Harry Road San Jose, CA 95120 chit i us .ibm . com @ magnitude larger than classical IE corpora. An Emerging text-intensive enterprise applications such as social analytics and semantic search pose new challenges of scalability and usability to Information Extraction (IE) systems. This paper presents SystemT, a declarative IE system that addresses these challenges and has been deployed in a wide range of enterprise applications. SystemT facilitates the development of high quality complex annotators by providing a highly expressive language and an advanced development environment. It also includes a cost-based optimizer and a high-performance, flexible runtime with minimum memory footprint. We present SystemT as a useful resource that is freely available, and as an opportunity to promote research in building scalable and usable IE systems.
3 0.74036813 9 acl-2011-A Cross-Lingual ILP Solution to Zero Anaphora Resolution
Author: Ryu Iida ; Massimo Poesio
Abstract: We present an ILP-based model of zero anaphora detection and resolution that builds on the joint determination of anaphoricity and coreference model proposed by Denis and Baldridge (2007), but revises it and extends it into a three-way ILP problem also incorporating subject detection. We show that this new model outperforms several baselines and competing models, as well as a direct translation of the Denis / Baldridge model, for both Italian and Japanese zero anaphora. We incorporate our model in complete anaphoric resolvers for both Italian and Japanese, showing that our approach leads to improved performance also when not used in isolation, provided that separate classifiers are used for zeros and for ex- plicitly realized anaphors.
4 0.72650045 320 acl-2011-Unsupervised Discovery of Domain-Specific Knowledge from Text
Author: Dirk Hovy ; Chunliang Zhang ; Eduard Hovy ; Anselmo Penas
Abstract: Learning by Reading (LbR) aims at enabling machines to acquire knowledge from and reason about textual input. This requires knowledge about the domain structure (such as entities, classes, and actions) in order to do inference. We present a method to infer this implicit knowledge from unlabeled text. Unlike previous approaches, we use automatically extracted classes with a probability distribution over entities to allow for context-sensitive labeling. From a corpus of 1.4m sentences, we learn about 250k simple propositions about American football in the form of predicateargument structures like “quarterbacks throw passes to receivers”. Using several statistical measures, we show that our model is able to generalize and explain the data statistically significantly better than various baseline approaches. Human subjects judged up to 96.6% of the resulting propositions to be sensible. The classes and probabilistic model can be used in textual enrichment to improve the performance of LbR end-to-end systems.
5 0.70789242 283 acl-2011-Simple English Wikipedia: A New Text Simplification Task
Author: William Coster ; David Kauchak
Abstract: In this paper we examine the task of sentence simplification which aims to reduce the reading complexity of a sentence by incorporating more accessible vocabulary and sentence structure. We introduce a new data set that pairs English Wikipedia with Simple English Wikipedia and is orders of magnitude larger than any previously examined for sentence simplification. The data contains the full range of simplification operations including rewording, reordering, insertion and deletion. We provide an analysis of this corpus as well as preliminary results using a phrase-based trans- lation approach for simplification.
6 0.69391882 39 acl-2011-An Ensemble Model that Combines Syntactic and Semantic Clustering for Discriminative Dependency Parsing
7 0.66795623 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering
8 0.66615361 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations
9 0.66580075 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment
10 0.66563827 137 acl-2011-Fine-Grained Class Label Markup of Search Queries
11 0.66328955 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization
12 0.66204709 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction
13 0.66152424 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks
14 0.66006148 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar
15 0.66002667 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters
16 0.65911579 178 acl-2011-Interactive Topic Modeling
17 0.6587953 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing
18 0.65828347 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation
19 0.65759146 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling
20 0.6561799 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning