acl acl2013 acl2013-3 knowledge-graph by maker-knowledge-mining

3 acl-2013-A Comparison of Techniques to Automatically Identify Complex Words.


Source: pdf

Author: Matthew Shardlow

Abstract: Identifying complex words (CWs) is an important, yet often overlooked, task within lexical simplification (The process of automatically replacing CWs with simpler alternatives). If too many words are identified then substitutions may be made erroneously, leading to a loss of meaning. If too few words are identified then those which impede a user’s understanding may be missed, resulting in a complex final text. This paper addresses the task of evaluating different methods for CW identification. A corpus of sentences with annotated CWs is mined from Simple Wikipedia edit histories, which is then used as the basis for several experiments. Firstly, the corpus design is explained and the results of the validation experiments using human judges are reported. Experiments are carried out into the CW identification techniques of: simplifying everything, frequency thresholding and training a support vector machine. These are based upon previous approaches to the task and show that thresholding does not perform significantly differently to the more na¨ ıve technique of simplifying everything. The support vector machine achieves a slight increase in precision over the other two methods, but at the cost of a dramatic trade off in recall.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract Identifying complex words (CWs) is an important, yet often overlooked, task within lexical simplification (The process of automatically replacing CWs with simpler alternatives). [sent-5, score-0.518]

2 If too few words are identified then those which impede a user’s understanding may be missed, resulting in a complex final text. [sent-7, score-0.125]

3 Experiments are carried out into the CW identification techniques of: simplifying everything, frequency thresholding and training a support vector machine. [sent-11, score-0.588]

4 These are based upon previous approaches to the task and show that thresholding does not perform significantly differently to the more na¨ ıve technique of simplifying everything. [sent-12, score-0.32]

5 The support vector machine achieves a slight increase in precision over the other two methods, but at the cost of a dramatic trade off in recall. [sent-13, score-0.091]

6 1 Introduction Complex Word (CW) identification is an important task at the first stage of lexical simplification and errors introduced or avoided here will affect final results. [sent-14, score-0.52]

7 This work looks at the process of automatically identifying difficult words for a lexical simplification system. [sent-15, score-0.53]

8 Lexical simplification is the task of identifying and replacing CWs in a text to improve the overall understandability and readability. [sent-16, score-0.431]

9 Lexical simplification is just one method of text simplification and is often deployed alongside other simplification methods (Carrol et al. [sent-18, score-1.229]

10 Syntactic simplification, statistical machine translation and semantic simplification (or explanation generation) are all current methods of text simplification. [sent-20, score-0.385]

11 Text simplification is typically deployed as an assistive technology (Devlin and Tait, 1998; Alu ı´sio and Gasperin, 2010), although this is not always the case. [sent-21, score-0.47]

12 Identifying CWs is a task which every lexical simplification system must perform, either explicitly or implicitly, before simplification can take place. [sent-23, score-0.874]

13 ) as complex (as they may be unfamiliar) or we may choose to discount them from our scheme altogether, as proper nouns are unlikely to have any valid replacements. [sent-27, score-0.192]

14 7 characters per word and has more syllables than any other word. [sent-30, score-0.11]

15 Further, CWs are often identified by their frequency (see Section 2. [sent-31, score-0.165]

16 c d2en0t1 3Re Ases aorc hiat Wio nrk fsohro Cp,om papguesta 1ti0o3n–a1l0 L9in,guistics ‘approximately’ exhibits a much lower frequency than the other words. [sent-34, score-0.133]

17 It is hoped that by providing this, the community will be able to identify and evaluate new techniques using the methods proposed herein. [sent-37, score-0.099]

18 If CW identification is not performed well, then potential candidates may be missed, and simple words may be falsely identified. [sent-38, score-0.205]

19 This is dangerous as simplification will often result in a minor change in a text’s semantics. [sent-39, score-0.385]

20 May be simplified to give: The United Kingdom is a country in northwest Europe. [sent-41, score-0.146]

21 In this example from the corpus used in this research, the word “state” is simplified to give “country”. [sent-42, score-0.115]

22 Whilst this is a valid synonym in the given context, state and country are not necessarily semantically identical. [sent-43, score-0.105]

23 The two main techniques that exist in the literature are simplifying everything (Devlin and Tait, 1998) SRKWUuainckBedpToraSeLmd-yEFsiaBrXteaBnmscaeil n e0 S. [sent-57, score-0.252]

24 30c21o075r9e027 Table 1: The results of different experiments on the SemEval lexical simplification data. [sent-58, score-0.455]

25 These show that SUBTLEX was the best word frequency measure for rating lexical complexity. [sent-59, score-0.241]

26 The other entries correspond to alternative word frequency measures. [sent-60, score-0.171]

27 6 require a word frequency measure as an indicator of lexical complexity. [sent-70, score-0.241]

28 The lexical simplification dataset from Task 1 at SemEval 2012 (De Belder and Moens, 2012) was used to compare several measures of word frequency as shown in Table 1. [sent-72, score-0.626]

29 The original simplifications were performed by editors trying to make documents as simple as possible. [sent-86, score-0.087]

30 However, if only examples of CWs were available, it would be very easy for a technique to overfit as it could just classify every single word as complex and get 100% accuracy. [sent-92, score-0.164]

31 There are several methods for finding these, including: selecting words from a reference easy word list; selecting words with high frequencies according to some corpus or using the simplified words from the second sentences in the CW corpus. [sent-97, score-0.086]

32 4 Simplify Everything The first implementation involved simplifying everything, a brute force method, in which a simplification algorithm is applied to every word. [sent-103, score-0.498]

33 A common variation is to limit the simplification to some combination of all the nouns, verbs and adjectives. [sent-105, score-0.385]

34 A standard baseline lexical simplification system was implemented following Devlin and Tait (1998). [sent-106, score-0.455]

35 If the synonym was more frequent than the original word then a substitution was made. [sent-108, score-0.089]

36 This relies on each word having an associated familiarity value provided by the SUBTLEX corpus. [sent-114, score-0.197]

37 Whilst this corpus is large, it will never cover every possible word, and so words which are not encountered are considered to have a frequency of 0. [sent-115, score-0.167]

38 To distinguish between complex and simple words a threshold was implemented. [sent-117, score-0.147]

39 Firstly, the training data was ordered by frequency, then the accuracy1 of the algorithm was examined with the threshold placed in between the frequency ofevery adjacent pair of words in the ordered list. [sent-119, score-0.172]

40 This may give some indication for future feature classification schemes. [sent-131, score-0.087]

41 Several external libraries were used to extract these as detailed below: Frequency The SUBTLEX frequency of each word was used as previously described in Section 2. [sent-136, score-0.171]

42 Syllable Count The number of syllables contained in a word is also a good estimate of its complexity. [sent-142, score-0.11]

43 Sense Count A count of the number of ways in which a word can be interpreted - showing how ambiguous a word is. [sent-144, score-0.132]

44 This again may give some indication of a word’s degree of ambiguity. [sent-147, score-0.087]

45 edu / AccuracF1y PreRciesicaolnl Figure 1: A bar chart with error bars showing the results of the CW identification experiments. [sent-152, score-0.114]

46 These show the correlation against the language’s simplicity and so a positive correlation indicates that if that feature is higher then the word will be simpler. [sent-156, score-0.226]

47 To analyse the features of the SVM, the correlation coefficient between each feature vector and the vector of feature labels was calculated. [sent-157, score-0.121]

48 The adopted labelling scheme assigned CWs as 0 and simple words as 1 and so the correlation of the features is notionally against the simplicity of the words. [sent-159, score-0.142]

49 A positive correlation indicates that if the value of that feature is higher, the word will be simpler. [sent-165, score-0.129]

50 0 9651 precision, which indicates that they are good at identifying the CWs, but also that they often identify simple words as CWs. [sent-175, score-0.125]

51 This indicates that many of the simple words which are falsely identified as com- plex are also replaced with an alternate substitution, which may result in a change in sense. [sent-177, score-0.176]

52 A paired t-test showed the difference between the thresholding method and the ‘simplify everything’ method was not statistically significant (p > 0. [sent-178, score-0.212]

53 Thresholding takes more data about the words into account and would appear to be a less na¨ ıve strategy than blindly simplifying everything. [sent-180, score-0.175]

54 The thresholding here may be limited by the resources, and a corpus using a larger word count may yield an improved result. [sent-182, score-0.366]

55 Whilst the thresholding and simplify everything methods were not significantly different from each other, the SVM method was significantly different from the other two (p < 0. [sent-183, score-0.435]

56 This indicates that the SVM was better at distinguishing between complex and simple words, but also wrongly identified many CWs. [sent-186, score-0.174]

57 The results for the SVM have a wide standard deviation (shown in the wide error bars in Figure 1) indicating a higher variability than the other methods. [sent-187, score-0.089]

58 Frequency and CD count are moderately positively correlated as may be expected. [sent-193, score-0.172]

59 This indicates that higher frequency words are likely to be simple. [sent-194, score-0.167]

60 Surprisingly, CD Count has a higher correlation than frequency itself, indicating that this is a better measure of word familiarity than the frequency measure. [sent-195, score-0.52]

61 Word length and number of syllables are moderately negatively correlated, indicating that the longer and more polysyllabic a word is, the less simple it becomes. [sent-197, score-0.22]

62 Whilst ‘finger’ is more commonly used than ‘digit’4, digit is one letter shorter. [sent-200, score-0.089]

63 The number of senses was very weakly negatively correlated with word simplicity. [sent-201, score-0.119]

64 This indicates that it is not a strong indicative factor in determining whether a word is simple or not. [sent-202, score-0.117]

65 Each target word occurs in a sentence and it may be the case that those words surrounding the target give extra information as to its complexity. [sent-205, score-0.097]

66 , 2012), and so simple words will occur in the presence of other simple words, whereas CWs will occur in the presence of other CWs. [sent-207, score-0.09]

67 As well as lexical contextual information, the surrounding syntax may offer some information on word difficulty. [sent-208, score-0.138]

68 Factors such as a very long sentence or a complex grammatical structure can make a word more difficult to understand. [sent-209, score-0.13]

69 These could be used to modify the familiarity score in the thresholding method, or they could be used as features in the SVM classifier. [sent-210, score-0.371]

70 The related work in this field is also generally 4in the SUBTLEX corpus ‘finger’ has 1870, whereas ‘digit’ has a frequency of 30. [sent-212, score-0.133]

71 a frequency of 107 used as a precursor to lexical simplification. [sent-213, score-0.203]

72 The simplest way to identify CWs in a sentence is to blindly assume that every word is complex, as described earlier in Section 2. [sent-215, score-0.168]

73 This was first used in Devlin’s seminal work on lexical simplification (Devlin and Tait, 1998). [sent-217, score-0.455]

74 However, further work into lexical simplification has refuted this (Lal and R ¨uger, 2002). [sent-220, score-0.455]

75 This method is still used, for example Thomas and Anderson (2012) simplify all nouns and verbs. [sent-221, score-0.12]

76 Another method of identifying CWs is to use frequency based thresholding over word familiarity scores, as described in Section 2. [sent-223, score-0.588]

77 This has been correlated with word difficulty via questionnaires (Zeng et al. [sent-227, score-0.089]

78 In both these cases, a familiarity score is used to determine how likely a subject is to understand a term. [sent-230, score-0.159]

79 (2012) use a threshold of 1% corpus frequency, along with other checks, to ensure that simple words are not erroneously simplified. [sent-232, score-0.124]

80 A Support Vector Machine is used to predict the familiarity of CWs in Zeng et al. [sent-234, score-0.159]

81 It takes features of term frequency and word length and is correlated against the familiarity scores which are already obtained. [sent-236, score-0.381]

82 This was done for the SemEval lexical simplification task (Specia et al. [sent-239, score-0.455]

83 Although this system is designed for synonym ranking, it could also be used for the CW identification task. [sent-241, score-0.116]

84 Machine learning has also been applied to the task of determining whether an entire sentence requires simplification (Gasperin et al. [sent-242, score-0.385]

85 The methods compared, whilst typical of current CW identification methods, are not an exhaustive set and variations exist. [sent-247, score-0.184]

86 This could be done using thresholding (Zeng-Treitler et al. [sent-249, score-0.212]

87 Another way to increase the accuracy of the frequency count method may be to use a larger corpus. [sent-252, score-0.219]

88 CW identification is the first step in the process of lexical simplification. [sent-257, score-0.135]

89 This research will be integrated in a future system which will simplify natural language for end users. [sent-258, score-0.086]

90 It is also hoped that other lexical simplification systems will take account of this work and will use the evaluation technique proposed herein to improve their identification of CWs. [sent-259, score-0.612]

91 It is hoped that new research in this field will evaluate the techniques used, rather than using inadequate techniques blindly and na¨ ıvely. [sent-262, score-0.231]

92 Fostering digital inclusion and accessibility: the PorSimples project for simplification of Portuguese texts. [sent-269, score-0.385]

93 Moving beyond Kucera and Francis : a critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. [sent-282, score-0.342]

94 Practical simplification of english newspaper text to assist aphasic readers. [sent-286, score-0.439]

95 The use of a psycholinguistic database in the simplification of text for aphasic readers. [sent-304, score-0.474]

96 Uowshef: Simplex lexical simplicity ranking based on contextual and psycholinguistic features. [sent-326, score-0.145]

97 Lexical complexity and fixation times in reading: Effects ofword frequency, verb complexity, and lexical ambiguity. [sent-335, score-0.099]

98 For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia. [sent-350, score-0.112]

99 A text corpora-based estimation of the familiarity of health terminology. [sent-354, score-0.159]

100 Estimating consumer familiarity with health terminology: a context-based approach. [sent-358, score-0.159]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('simplification', 0.385), ('cws', 0.38), ('cw', 0.33), ('thresholding', 0.212), ('subtlex', 0.19), ('devlin', 0.167), ('familiarity', 0.159), ('everything', 0.137), ('frequency', 0.133), ('gasperin', 0.133), ('svm', 0.128), ('specia', 0.12), ('tait', 0.12), ('whilst', 0.119), ('jauhar', 0.104), ('belder', 0.096), ('blindly', 0.096), ('zeng', 0.092), ('digit', 0.089), ('simplify', 0.086), ('alu', 0.083), ('hancke', 0.082), ('simplifying', 0.079), ('finger', 0.072), ('syllables', 0.072), ('lexical', 0.07), ('sio', 0.066), ('identification', 0.065), ('complex', 0.063), ('hoped', 0.063), ('yatskar', 0.063), ('correlation', 0.057), ('count', 0.056), ('semeval', 0.055), ('aphasic', 0.054), ('brysbaert', 0.054), ('carrol', 0.054), ('hokkaido', 0.054), ('honshu', 0.054), ('rayner', 0.054), ('shardlow', 0.054), ('siobhan', 0.054), ('country', 0.054), ('synonym', 0.051), ('correlated', 0.051), ('medical', 0.049), ('bars', 0.049), ('assistive', 0.048), ('bott', 0.048), ('simplified', 0.048), ('moens', 0.048), ('identifying', 0.046), ('missed', 0.045), ('simple', 0.045), ('histories', 0.044), ('northwest', 0.044), ('sujay', 0.044), ('lucia', 0.043), ('lal', 0.042), ('simplifications', 0.042), ('cd', 0.041), ('simplicity', 0.04), ('variability', 0.04), ('na', 0.04), ('erroneously', 0.04), ('islands', 0.04), ('threshold', 0.039), ('unfamiliar', 0.038), ('word', 0.038), ('alongside', 0.037), ('deployed', 0.037), ('techniques', 0.036), ('discount', 0.035), ('falsely', 0.035), ('moderately', 0.035), ('psycholinguistic', 0.035), ('qing', 0.035), ('trials', 0.035), ('indicates', 0.034), ('nouns', 0.034), ('every', 0.034), ('caroline', 0.033), ('kingdom', 0.033), ('vector', 0.032), ('synonyms', 0.032), ('identified', 0.032), ('ve', 0.031), ('libsvm', 0.031), ('tony', 0.031), ('support', 0.031), ('may', 0.03), ('negatively', 0.03), ('technique', 0.029), ('complexity', 0.029), ('give', 0.029), ('difficult', 0.029), ('elhadad', 0.029), ('sandra', 0.028), ('indication', 0.028), ('precision', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 3 acl-2013-A Comparison of Techniques to Automatically Identify Complex Words.

Author: Matthew Shardlow

Abstract: Identifying complex words (CWs) is an important, yet often overlooked, task within lexical simplification (The process of automatically replacing CWs with simpler alternatives). If too many words are identified then substitutions may be made erroneously, leading to a loss of meaning. If too few words are identified then those which impede a user’s understanding may be missed, resulting in a complex final text. This paper addresses the task of evaluating different methods for CW identification. A corpus of sentences with annotated CWs is mined from Simple Wikipedia edit histories, which is then used as the basis for several experiments. Firstly, the corpus design is explained and the results of the validation experiments using human judges are reported. Experiments are carried out into the CW identification techniques of: simplifying everything, frequency thresholding and training a support vector machine. These are based upon previous approaches to the task and show that thresholding does not perform significantly differently to the more na¨ ıve technique of simplifying everything. The support vector machine achieves a slight increase in precision over the other two methods, but at the cost of a dramatic trade off in recall.

2 0.20028149 194 acl-2013-Improving Text Simplification Language Modeling Using Unsimplified Text Data

Author: David Kauchak

Abstract: In this paper we examine language modeling for text simplification. Unlike some text-to-text translation tasks, text simplification is a monolingual translation task allowing for text in both the input and output domain to be used for training the language model. We explore the relationship between normal English and simplified English and compare language models trained on varying amounts of text from each. We evaluate the models intrinsically with perplexity and extrinsically on the lexical simplification task from SemEval 2012. We find that a combined model using both simplified and normal English data achieves a 23% improvement in perplexity and a 24% improvement on the lexical simplification task over a model trained only on simple data. Post-hoc analysis shows that the additional unsimplified data provides better coverage for unseen and rare n-grams.

3 0.19854055 322 acl-2013-Simple, readable sub-sentences

Author: Sigrid Klerke ; Anders Sgaard

Abstract: We present experiments using a new unsupervised approach to automatic text simplification, which builds on sampling and ranking via a loss function informed by readability research. The main idea is that a loss function can distinguish good simplification candidates among randomly sampled sub-sentences of the input sentence. Our approach is rated as equally grammatical and beginner reader appropriate as a supervised SMT-based baseline system by native speakers, but our setup performs more radical changes that better resembles the variation observed in human generated simplifications.

4 0.13229012 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation

Author: Xiaodong Zeng ; Derek F. Wong ; Lidia S. Chao ; Isabel Trancoso

Abstract: This paper presents a semi-supervised Chinese word segmentation (CWS) approach that co-regularizes character-based and word-based models. Similarly to multi-view learning, the “segmentation agreements” between the two different types of view are used to overcome the scarcity of the label information on unlabeled data. The proposed approach trains a character-based and word-based model on labeled data, respectively, as the initial models. Then, the two models are constantly updated using unlabeled examples, where the learning objective is maximizing their segmentation agreements. The agreements are regarded as a set of valuable constraints for regularizing the learning of both models on unlabeled data. The segmentation for an input sentence is decoded by using a joint scoring function combining the two induced models. The evaluation on the Chinese tree bank reveals that our model results in better gains over the state-of-the-art semi-supervised models reported in the literature.

5 0.12382378 243 acl-2013-Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation

Author: Aobo Wang ; Min-Yen Kan

Abstract: We address the problem of informal word recognition in Chinese microblogs. A key problem is the lack of word delimiters in Chinese. We exploit this reliance as an opportunity: recognizing the relation between informal word recognition and Chinese word segmentation, we propose to model the two tasks jointly. Our joint inference method significantly outperforms baseline systems that conduct the tasks individually or sequentially.

6 0.11463288 305 acl-2013-SORT: An Interactive Source-Rewriting Tool for Improved Translation

7 0.067315117 300 acl-2013-Reducing Annotation Effort for Quality Estimation via Active Learning

8 0.064953223 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing

9 0.061608583 173 acl-2013-Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

10 0.061109997 43 acl-2013-Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity

11 0.059677202 248 acl-2013-Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation

12 0.058480296 31 acl-2013-A corpus-based evaluation method for Distributional Semantic Models

13 0.057088021 342 acl-2013-Text Classification from Positive and Unlabeled Data using Misclassified Data Correction

14 0.056943163 78 acl-2013-Categorization of Turkish News Documents with Morphological Analysis

15 0.054298874 88 acl-2013-Computational considerations of comparisons and similes

16 0.049621176 351 acl-2013-Topic Modeling Based Classification of Clinical Reports

17 0.049026951 291 acl-2013-Question Answering Using Enhanced Lexical Semantic Models

18 0.047650371 325 acl-2013-Smoothed marginal distribution constraints for language modeling

19 0.045955651 193 acl-2013-Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations

20 0.044925965 185 acl-2013-Identifying Bad Semantic Neighbors for Improving Distributional Thesauri


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.154), (1, 0.022), (2, -0.003), (3, -0.037), (4, 0.039), (5, -0.059), (6, -0.026), (7, 0.014), (8, 0.013), (9, 0.06), (10, -0.045), (11, 0.046), (12, -0.078), (13, -0.014), (14, -0.098), (15, -0.036), (16, -0.018), (17, 0.002), (18, -0.01), (19, -0.003), (20, 0.015), (21, 0.004), (22, -0.004), (23, 0.002), (24, 0.009), (25, 0.008), (26, -0.068), (27, 0.044), (28, -0.03), (29, 0.018), (30, -0.105), (31, -0.033), (32, 0.011), (33, 0.114), (34, 0.006), (35, -0.036), (36, 0.048), (37, 0.041), (38, -0.021), (39, -0.046), (40, -0.198), (41, -0.015), (42, -0.182), (43, 0.031), (44, -0.059), (45, 0.101), (46, -0.009), (47, -0.213), (48, -0.051), (49, 0.191)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.88920796 3 acl-2013-A Comparison of Techniques to Automatically Identify Complex Words.

Author: Matthew Shardlow

Abstract: Identifying complex words (CWs) is an important, yet often overlooked, task within lexical simplification (The process of automatically replacing CWs with simpler alternatives). If too many words are identified then substitutions may be made erroneously, leading to a loss of meaning. If too few words are identified then those which impede a user’s understanding may be missed, resulting in a complex final text. This paper addresses the task of evaluating different methods for CW identification. A corpus of sentences with annotated CWs is mined from Simple Wikipedia edit histories, which is then used as the basis for several experiments. Firstly, the corpus design is explained and the results of the validation experiments using human judges are reported. Experiments are carried out into the CW identification techniques of: simplifying everything, frequency thresholding and training a support vector machine. These are based upon previous approaches to the task and show that thresholding does not perform significantly differently to the more na¨ ıve technique of simplifying everything. The support vector machine achieves a slight increase in precision over the other two methods, but at the cost of a dramatic trade off in recall.

2 0.81493258 194 acl-2013-Improving Text Simplification Language Modeling Using Unsimplified Text Data

Author: David Kauchak

Abstract: In this paper we examine language modeling for text simplification. Unlike some text-to-text translation tasks, text simplification is a monolingual translation task allowing for text in both the input and output domain to be used for training the language model. We explore the relationship between normal English and simplified English and compare language models trained on varying amounts of text from each. We evaluate the models intrinsically with perplexity and extrinsically on the lexical simplification task from SemEval 2012. We find that a combined model using both simplified and normal English data achieves a 23% improvement in perplexity and a 24% improvement on the lexical simplification task over a model trained only on simple data. Post-hoc analysis shows that the additional unsimplified data provides better coverage for unseen and rare n-grams.

3 0.75145143 322 acl-2013-Simple, readable sub-sentences

Author: Sigrid Klerke ; Anders Sgaard

Abstract: We present experiments using a new unsupervised approach to automatic text simplification, which builds on sampling and ranking via a loss function informed by readability research. The main idea is that a loss function can distinguish good simplification candidates among randomly sampled sub-sentences of the input sentence. Our approach is rated as equally grammatical and beginner reader appropriate as a supervised SMT-based baseline system by native speakers, but our setup performs more radical changes that better resembles the variation observed in human generated simplifications.

4 0.53507054 308 acl-2013-Scalable Modified Kneser-Ney Language Model Estimation

Author: Kenneth Heafield ; Ivan Pouzyrevsky ; Jonathan H. Clark ; Philipp Koehn

Abstract: We present an efficient algorithm to estimate large modified Kneser-Ney models including interpolation. Streaming and sorting enables the algorithm to scale to much larger models by using a fixed amount of RAM and variable amount of disk. Using one machine with 140 GB RAM for 2.8 days, we built an unpruned model on 126 billion tokens. Machine translation experiments with this model show improvement of 0.8 BLEU point over constrained systems for the 2013 Workshop on Machine Translation task in three language pairs. Our algorithm is also faster for small models: we estimated a model on 302 million tokens using 7.7% of the RAM and 14.0% of the wall time taken by SRILM. The code is open source as part of KenLM.

5 0.50979793 305 acl-2013-SORT: An Interactive Source-Rewriting Tool for Improved Translation

Author: Shachar Mirkin ; Sriram Venkatapathy ; Marc Dymetman ; Ioan Calapodescu

Abstract: The quality of automatic translation is affected by many factors. One is the divergence between the specific source and target languages. Another lies in the source text itself, as some texts are more complex than others. One way to handle such texts is to modify them prior to translation. Yet, an important factor that is often overlooked is the source translatability with respect to the specific translation system and the specific model that are being used. In this paper we present an interactive system where source modifications are induced by confidence estimates that are derived from the translation model in use. Modifications are automatically generated and proposed for the user’s ap- proval. Such a system can reduce postediting effort, replacing it by cost-effective pre-editing that can be done by monolinguals.

6 0.50576645 390 acl-2013-Word surprisal predicts N400 amplitude during reading

7 0.46307665 243 acl-2013-Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation

8 0.46261826 64 acl-2013-Automatically Predicting Sentence Translation Difficulty

9 0.46249318 325 acl-2013-Smoothed marginal distribution constraints for language modeling

10 0.45196143 371 acl-2013-Unsupervised joke generation from big data

11 0.43180749 247 acl-2013-Modeling of term-distance and term-occurrence information for improving n-gram language model performance

12 0.42261699 262 acl-2013-Offspring from Reproduction Problems: What Replication Failure Teaches Us

13 0.42234629 381 acl-2013-Variable Bit Quantisation for LSH

14 0.41375664 234 acl-2013-Linking and Extending an Open Multilingual Wordnet

15 0.41326481 232 acl-2013-Linguistic Models for Analyzing and Detecting Biased Language

16 0.40104941 327 acl-2013-Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison

17 0.39959571 135 acl-2013-English-to-Russian MT evaluation campaign

18 0.39357385 88 acl-2013-Computational considerations of comparisons and similes

19 0.39163458 356 acl-2013-Transfer Learning Based Cross-lingual Knowledge Extraction for Wikipedia

20 0.38720688 31 acl-2013-A corpus-based evaluation method for Distributional Semantic Models


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.062), (5, 0.317), (6, 0.04), (11, 0.045), (14, 0.012), (15, 0.017), (24, 0.07), (26, 0.069), (35, 0.069), (42, 0.045), (48, 0.035), (64, 0.011), (70, 0.026), (88, 0.026), (90, 0.031), (95, 0.054)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.8235718 88 acl-2013-Computational considerations of comparisons and similes

Author: Vlad Niculae ; Victoria Yaneva

Abstract: This paper presents work in progress towards automatic recognition and classification of comparisons and similes. Among possible applications, we discuss the place of this task in text simplification for readers with Autism Spectrum Disorders (ASD), who are known to have deficits in comprehending figurative language. We propose an approach to comparison recognition through the use of syntactic patterns. Keeping in mind the requirements of autistic readers, we discuss the properties relevant for distinguishing semantic criteria like figurativeness and abstractness.

same-paper 2 0.74033105 3 acl-2013-A Comparison of Techniques to Automatically Identify Complex Words.

Author: Matthew Shardlow

Abstract: Identifying complex words (CWs) is an important, yet often overlooked, task within lexical simplification (The process of automatically replacing CWs with simpler alternatives). If too many words are identified then substitutions may be made erroneously, leading to a loss of meaning. If too few words are identified then those which impede a user’s understanding may be missed, resulting in a complex final text. This paper addresses the task of evaluating different methods for CW identification. A corpus of sentences with annotated CWs is mined from Simple Wikipedia edit histories, which is then used as the basis for several experiments. Firstly, the corpus design is explained and the results of the validation experiments using human judges are reported. Experiments are carried out into the CW identification techniques of: simplifying everything, frequency thresholding and training a support vector machine. These are based upon previous approaches to the task and show that thresholding does not perform significantly differently to the more na¨ ıve technique of simplifying everything. The support vector machine achieves a slight increase in precision over the other two methods, but at the cost of a dramatic trade off in recall.

3 0.72094834 366 acl-2013-Understanding Verbs based on Overlapping Verbs Senses

Author: Kavitha Rajan

Abstract: Natural language can be easily understood by everyone irrespective of their differences in age or region or qualification. The existence of a conceptual base that underlies all natural languages is an accepted claim as pointed out by Schank in his Conceptual Dependency (CD) theory. Inspired by the CD theory and theories in Indian grammatical tradition, we propose a new set of meaning primitives in this paper. We claim that this new set of primitives captures the meaning inherent in verbs and help in forming an inter-lingual and computable ontological classification of verbs. We have identified seven primitive overlapping verb senses which substantiate our claim. The percentage of coverage of these primitives is 100% for all verbs in Sanskrit and Hindi and 3750 verbs in English. 1

4 0.46610209 2 acl-2013-A Bayesian Model for Joint Unsupervised Induction of Sentiment, Aspect and Discourse Representations

Author: Angeliki Lazaridou ; Ivan Titov ; Caroline Sporleder

Abstract: We propose a joint model for unsupervised induction of sentiment, aspect and discourse information and show that by incorporating a notion of latent discourse relations in the model, we improve the prediction accuracy for aspect and sentiment polarity on the sub-sentential level. We deviate from the traditional view of discourse, as we induce types of discourse relations and associated discourse cues relevant to the considered opinion analysis task; consequently, the induced discourse relations play the role of opinion and aspect shifters. The quantitative analysis that we conducted indicated that the integration of a discourse model increased the prediction accuracy results with respect to the discourse-agnostic approach and the qualitative analysis suggests that the induced representations encode a meaningful discourse structure.

5 0.46135941 318 acl-2013-Sentiment Relevance

Author: Christian Scheible ; Hinrich Schutze

Abstract: A number of different notions, including subjectivity, have been proposed for distinguishing parts of documents that convey sentiment from those that do not. We propose a new concept, sentiment relevance, to make this distinction and argue that it better reflects the requirements of sentiment analysis systems. We demonstrate experimentally that sentiment relevance and subjectivity are related, but different. Since no large amount of labeled training data for our new notion of sentiment relevance is available, we investigate two semi-supervised methods for creating sentiment relevance classifiers: a distant supervision approach that leverages structured information about the domain of the reviews; and transfer learning on feature representations based on lexical taxonomies that enables knowledge transfer. We show that both methods learn sentiment relevance classifiers that perform well.

6 0.45865092 194 acl-2013-Improving Text Simplification Language Modeling Using Unsimplified Text Data

7 0.45551336 83 acl-2013-Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model

8 0.45523429 377 acl-2013-Using Supervised Bigram-based ILP for Extractive Summarization

9 0.45520708 183 acl-2013-ICARUS - An Extensible Graphical Search Tool for Dependency Treebanks

10 0.45363653 144 acl-2013-Explicit and Implicit Syntactic Features for Text Classification

11 0.45317805 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

12 0.45291662 233 acl-2013-Linking Tweets to News: A Framework to Enrich Short Text Data in Social Media

13 0.44980371 187 acl-2013-Identifying Opinion Subgroups in Arabic Online Discussions

14 0.44953781 95 acl-2013-Crawling microblogging services to gather language-classified URLs. Workflow and case study

15 0.44921029 185 acl-2013-Identifying Bad Semantic Neighbors for Improving Distributional Thesauri

16 0.44892752 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation

17 0.44779176 99 acl-2013-Crowd Prefers the Middle Path: A New IAA Metric for Crowdsourcing Reveals Turker Biases in Query Segmentation

18 0.44771314 147 acl-2013-Exploiting Topic based Twitter Sentiment for Stock Prediction

19 0.44751275 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction

20 0.44724318 298 acl-2013-Recognizing Rare Social Phenomena in Conversation: Empowerment Detection in Support Group Chatrooms