emnlp emnlp2011 emnlp2011-35 knowledge-graph by maker-knowledge-mining

35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases


Source: pdf

Author: Daniel Dahlmeier ; Hwee Tou Ng

Abstract: We present a novel approach for automatic collocation error correction in learner English which is based on paraphrases extracted from parallel corpora. Our key assumption is that collocation errors are often caused by semantic similarity in the first language (L1language) of the writer. An analysis of a large corpus of annotated learner English confirms this assumption. We evaluate our approach on real-world learner data and show that L1-induced paraphrases outperform traditional approaches based on edit distance, homophones, and WordNet synonyms.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 s g Abstract We present a novel approach for automatic collocation error correction in learner English which is based on paraphrases extracted from parallel corpora. [sent-4, score-1.238]

2 Our key assumption is that collocation errors are often caused by semantic similarity in the first language (L1language) of the writer. [sent-5, score-0.866]

3 We evaluate our approach on real-world learner data and show that L1-induced paraphrases outperform traditional approaches based on edit distance, homophones, and WordNet synonyms. [sent-7, score-0.214]

4 1 Introduction Grammatical error correction (GEC) is emerging as a commercially attractive application of natural language processing (NLP) for the booming market of English as foreign or second language The de facto standard approach to GEC is to build (EFL/ESL1). [sent-8, score-0.395]

5 a statistical model that can choose the most likely correction from a confusion set of possible correction choices. [sent-9, score-0.636]

6 Work in contextsensitive spelling error correction (Golding and Roth, 1999) has traditionally focused on confusion sets with similar spelling (e. [sent-11, score-0.781]

7 In other words, the words in a confusion set are deemed confusable because of orthographic or phonetic similarity. [sent-16, score-0.238]

8 In contrast, we investigate in this paper a class of grammatical errors where the source of confusion is the similar semantics of the words, rather than or- thography, phonetics, or syntax. [sent-19, score-0.299]

9 In particular, we focus on collocation errors in EFL writing. [sent-20, score-0.811]

10 The term collocation (Firth, 1957) describes a sequence of words that is conventionally used together in a particular way by native speakers and appears more often together than one would expect by chance. [sent-21, score-0.749]

11 In this work, we present a novel approach for automatic correction of collocation errors in EFL writing. [sent-23, score-1.076]

12 We first analyze collocation errors in the NUS Corpus of Learner English (NUCLE), a fully annotated one-million-word corpus of learner English which we will make available to the community for research purposes (see Section 3 for details about the corpus). [sent-31, score-0.892]

13 Our analysis confirms that many collocation errors can be traced to similar translations in the writer’s L1-language. [sent-32, score-0.884]

14 Based on this result, we propose a novel approach for automatic collocation error correction. [sent-33, score-0.749]

15 Section 4 describes our approach for automatic collocation error correction. [sent-45, score-0.749]

16 2 Related Work In this section, we give an overview of related work on collocation error correction. [sent-49, score-0.749]

17 We also highlight differences between collocation error correction and related NLP tasks like context-sensitive spelling error correction, synonym extraction, lexical substitution, and paraphrasing. [sent-50, score-1.255]

18 Most work in collocation error correction has relied on dictionaries or manually created databases 108 to generate collocation candidates (Shei and Pain, 2000; Wible et al. [sent-51, score-1.754]

19 (2008), as they also use translation information to generate collocation candidates. [sent-60, score-0.711]

20 Context-sensitive spelling error correction is the task of correcting spelling mistakes that result in another valid word, see for example (Golding and Roth, 1999). [sent-64, score-0.781]

21 It has traditionally focused on a small number of pre-defined confusion sets, like homophones or frequent spelling errors. [sent-65, score-0.465]

22 Even when the confusion sets are formed automatically, the similarity of words in a confusion set has been based on edit distance or phonetic similarity (Carlson et al. [sent-66, score-0.292]

23 In contrast, we focus on words that are confusable due to their similar semantics instead of similar spelling or pronunciation. [sent-68, score-0.249]

24 Synonym extraction (Wu and Zhou, 2003), lexical substitution (McCarthy and Navigli, 2007) and paraphrasing (Madnani and Dorr, 2010) are related to collocation correction in the sense that they try to find semantically equivalent words or phrases. [sent-71, score-1.003]

25 However, there is a subtle but important difference between these tasks and collocation correction. [sent-72, score-0.677]

26 In contrast, in collocation correction, we are primarily interested in finding candidates which are not substitutable in their English context but appear to be substitutable in the L1-language of the writer, i. [sent-76, score-0.842]

27 Each error tag consists of the start and end offset of the annotation, the type of the error, and the appropriate gold correction as deemed by the annotator. [sent-93, score-0.455]

28 The annotators were asked to provide a correction that would result in a grammatical sentence if the selected word or phrase would be replaced by the correction. [sent-94, score-0.384]

29 In this work, we focus on errors which have been marked with the error tag wrong collocation/idiom/preposition. [sent-95, score-0.206]

30 In a similar way, we filter out a small number ofarticle errors which were marked as collocation errors. [sent-97, score-0.811]

31 Finally, we filter out instances where 109 the annotated phrase or the suggested correction is longer than 3 words, as we observe that they contain highly context-specific corrections and are unlikely to generalize well (e. [sent-98, score-0.444]

32 eAsefte car filtering, we e→nd “ up wpilyth t 2,747 collocation errors and their respective corrections, which account for about 6% of all errors in NUCLE. [sent-101, score-0.945]

33 This makes collocation errors the 7th largest class of errors in the corpus after article errors, redundancies, prepositions, noun number, verb tense, and mechanics. [sent-102, score-0.945]

34 Not counting duplicates, there are 2,412 distinct collocation errors and corrections. [sent-103, score-0.811]

35 Although there are other error types which are more frequent, collocation errors represent a particular challenge as the possible corrections are not restricted to a closed set of choices and they are directly related to semantics rather than syntax. [sent-104, score-1.002]

36 We analyzed the collocation errors and found that they can be attributed to the following sources of confusion: Spelling: We suspect that an error is caused by similar orthography if the edit distance between the erroneous phrase and its correction is less than a certain threshold. [sent-105, score-1.487]

37 Homophones: We suspect that an error is caused by similar pronunciation if the erroneous word and its correction have the same pronunciation. [sent-106, score-0.513]

38 Synonyms: We suspect that an error is caused by synonymy if the erroneous word and its correction are synonyms in WordNet (Fellbaum, 1998). [sent-108, score-0.579]

39 L1-transfer: We suspect that an error is caused by L1-transfer if the erroneous phrase and its correction share a common translation in a Chinese-English phrase table. [sent-111, score-0.721]

40 As CuVPlus and WordNet are defined for individual words, we extend the matching process to phrases in the following way: two phrases A and B deemed homophones/synonyms if they have the same length and the i-th word in phrase A is a homophone/synonym of the corresponding i-th word in phrase B. [sent-114, score-0.22]

41 Table 3: Examples of collocation errors with different sources of confusion. [sent-159, score-0.811]

42 90T5h486etrs76o8239l1d7for spelling errors is one for phrases of up to six characters and two for the remaining phrases. [sent-164, score-0.303]

43 As a collocation error can be part of more than one category, the rows in the table do not sum up to the total number of errors. [sent-167, score-0.749]

44 The number of errors that can be traced to L1-transfer greatly outnumbers all other categories. [sent-168, score-0.207]

45 The table also shows the number of collocation errors that can be traced to L1-transfer but not the other sources. [sent-169, score-0.884]

46 906 collocation errors with 692 distinct collocation error types can be attributed only to L1-transfer but not to spelling, homophones, or synonyms. [sent-170, score-1.589]

47 Table 3 shows some examples of collocation errors for each category from our corpus. [sent-171, score-0.811]

48 We note that there are also collocation error types that cannot be traced to any of the above sources. [sent-172, score-0.822]

49 110 4 Correcting Collocation Errors In this section, we propose a novel approach for cor- recting collocation errors in EFL writing. [sent-174, score-0.811]

50 1 L1-induced Paraphrases We use the popular technique of paraphrasing with parallel corpora (Bannard and Callison-Burch, 2005) to automatically find collocation candidates from a sentence-aligned L1-English parallel corpus. [sent-176, score-0.886]

51 The paraphrase probability of an English phrase e1 given an English phrase e2 is defined as p(e1|e2) = Xp(e1|f)p(f|e2) (1) Xf where f denotes a foreign phrase in the L1language. [sent-187, score-0.369]

52 (2) Typical features include a phrase translation probability p(e|f), an inverse phrase translation probability p(f|e), a language meo pdherals score p(e), nan pdro a conisttyan pt( phrase penalty. [sent-201, score-0.372]

53 Because of the great flexibility of the log-linear model, researchers have used the framework for other tasks outside SMT, including grammatical error correction (Brockett et al. [sent-206, score-0.369]

54 , 2007) to include collocation corrections with features derived from spelling, homophones, synonyms, and L1-induced paraphrases. [sent-210, score-0.769]

55 L1-paraphrases: For each English phrase, the phrase taapbhler acsoenst:ai Fnosr e enatcrhies E consisting soef, tthhee phrase and each of its L1-derived paraphrases as described in Section 4. [sent-222, score-0.26]

56 The first four tables only contain collocation candidates for individual words. [sent-230, score-0.74]

57 5 Experiments In this section, we empirically evaluate our approach on real collocation errors in learner English. [sent-232, score-0.892]

58 Our main evaluation metric is mean reciprocal rank (MRR) which is the arithmetic mean of the inverse ranks of the first correct answer returned by the system MRR =N1Xi=N1ran1k(i) (3) where N is the size of the test set. [sent-243, score-0.229]

59 3 Collocation Error Experiments Automatic correction of collocation errors can conceptually be divided into two steps: i) identification of wrong collocations in the input, and ii) correction of the identified collocations. [sent-247, score-1.405]

60 In this work, we focus on the second step and assume that the erroneous collocation has already been identified. [sent-248, score-0.765]

61 While this might seem like a simplification, it has been the common evaluation setup in collocation error correction (see for example (Wu et al. [sent-249, score-1.014]

62 In our experiments, we use the start and end offset ofthe collocation error provided by the human annotator to identify the location of the collocation error. [sent-252, score-1.453]

63 We remove phrase table entries where the phrase and the candidate correction are identical, thus practically forcing the system to change the identified phrase. [sent-254, score-0.439]

64 We previously observed that word order errors are virtually absent in our collocation errors. [sent-256, score-0.811]

65 After convergence, the model can be used to automatically correct new collocation errors. [sent-262, score-0.709]

66 6 Results We evaluate the performance of the proposed method on our test set of 856 sentences, each with one collocation error. [sent-263, score-0.677]

67 If the gold answer is not found in the top 100 outputs, the rank is considered to be infinity, or in other words, the inverse of the rank is zero. [sent-267, score-0.275]

68 The results of the automatic evaluation are shown in Table 4 For collocation errors, there are usually more than one possible correct answer. [sent-269, score-0.709]

69 Therefore, automatic evaluation underestimates the actual performance of the system by only considering the single gold answer as correct and all other answers as wrong. [sent-270, score-0.2]

70 The candidates are displayed together in alphabetical order without any information about their rank or which system produced them or the gold answer by the annotator. [sent-285, score-0.235]

71 The judges were asked to make a binary judgment for each of the candidates on whether the proposed candidate is a valid correction of the original or not. [sent-287, score-0.358]

72 In the first case, the rank of the first correct item is the minimum 113 rank of any item judged correct by either judge. [sent-306, score-0.248]

73 In the second case, the rank of the first correct item is the minimum rank of any item judged correct by both judges. [sent-307, score-0.248]

74 Unfortunately, comparison ofour results with previous work is complicated by the fact that there currently exists no standard data set for collocation error correction. [sent-315, score-0.749]

75 7 Analysis In this section, we analyze and categorize those test instances for which the ALL system could not produce an acceptable correction in the top 3 candidates. [sent-317, score-0.292]

76 Out-of-vocabulary (21/100) The most frequent reason why the system does not produce a good correction is that the erroneous collocation is out of vocabulary. [sent-321, score-1.03]

77 Goldalthough many may argue that public spending on the elderly should be limited . [sent-334, score-0.398]

78 Allalthough many may believe that public spending on the elderly should be limited . [sent-337, score-0.398]

79 although many may think that public spending on the elderly should be limited . [sent-340, score-0.398]

80 although many may accept that public spending on the elderly should be limited . [sent-343, score-0.398]

81 Baseline*although many may agreed that public spending on the elderly should be limited . [sent-346, score-0.398]

82 *although many may hold that public spending on the elderly should be limited . [sent-349, score-0.398]

83 *although many may agrees that public spending on the elderly should be limited . [sent-352, score-0.398]

84 in spending their resources (resources, resources on) Othersthis might redirect (make sound, reduce) foreign investments . [sent-412, score-0.306]

85 , the paraphrase table con- tains even| ||only get when the gold correction was even → only, or the phrase table actually contains tehvee gold answer r b tuhte efa pihlsr atsoe e r taanbkl eit among th coen top 3s answers. [sent-424, score-0.559]

86 Function/auxiliary words (14/100) We observe that collocation errors that involve function words or auxiliary words are not handled very well by our model. [sent-429, score-0.811]

87 As our model corrects collocation errors at the sentence level, such gold answers will be very difficult or impossible to determine correctly. [sent-433, score-0.912]

88 Spelling errors (9/100) Some of the collocation errors are caused by spelling mistakes, e. [sent-435, score-1.169]

89 Although the ALL model includes candidates which are created through edit distance, paraphrase candidates created from the misspelled word can dominate the top 3 ranks, e. [sent-438, score-0.223]

90 A possible solution would be to perform spell-checking as a separate pre-processing step prior to collocation correction. [sent-441, score-0.677]

91 Word sense (7/100) Some of the failures of the model can be attributed to ambiguous senses of the 115 collocation phrase. [sent-442, score-0.735]

92 Including word sense disambiguation into the model might help, although accurate word sense disambiguation on noisy learner text may not be easy. [sent-444, score-0.201]

93 Preposition constructions (6/100) Some of the collocation errors involve preposition constructions, e. [sent-445, score-0.845]

94 Others (11/100) Other mistakes include collocation errors where the gold answer slightly changed the semantics of the target word, e. [sent-451, score-0.978]

95 , redirect potential foreign investments → reduce potential foreign investments, active-passive adltuecrena ptoiotenn (enhanced economics → was economical), and noun possessive errors (british a’ss r eucloen → bicraitli)s,h a rule). [sent-453, score-0.328]

96 8 Conclusion and Future Work We have presented a novel approach for correcting collocation errors in written learner text. [sent-454, score-0.94]

97 In future work, we plan to extend our system to fully automatic collocation correction that involves both identification and correction of collocation errors. [sent-457, score-1.884]

98 An automatic collocation writing assistant for Taiwanese EFL learners: A case of corpus-based NLP technology. [sent-500, score-0.708]

99 A computational approach to detecting collocation errors in the writing of non-native speakers of English. [sent-539, score-0.877]

100 Using mostly native data to correct errors in learners’ writing: A meta-classifier approach. [sent-544, score-0.203]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('collocation', 0.677), ('correction', 0.265), ('efl', 0.19), ('homophones', 0.19), ('spending', 0.17), ('spelling', 0.169), ('elderly', 0.153), ('errors', 0.134), ('confusion', 0.106), ('corrections', 0.092), ('erroneous', 0.088), ('phrase', 0.087), ('mrr', 0.086), ('paraphrases', 0.086), ('learner', 0.081), ('concise', 0.076), ('public', 0.075), ('traced', 0.073), ('error', 0.072), ('writer', 0.069), ('gec', 0.068), ('answer', 0.067), ('synonyms', 0.066), ('unambiguous', 0.064), ('collocations', 0.064), ('candidates', 0.063), ('rank', 0.06), ('foreign', 0.058), ('chinese', 0.058), ('parallel', 0.057), ('answers', 0.056), ('caused', 0.055), ('confusable', 0.053), ('cuvplus', 0.051), ('investing', 0.051), ('investments', 0.051), ('nucle', 0.051), ('substitutable', 0.051), ('wible', 0.051), ('paraphrase', 0.05), ('dahlmeier', 0.049), ('english', 0.048), ('correcting', 0.048), ('ng', 0.047), ('edit', 0.047), ('deemed', 0.046), ('gold', 0.045), ('golding', 0.044), ('inverse', 0.043), ('judgments', 0.04), ('kappa', 0.04), ('nus', 0.04), ('madnani', 0.037), ('copy', 0.037), ('essays', 0.037), ('prepositions', 0.037), ('native', 0.037), ('speakers', 0.035), ('smt', 0.035), ('preposition', 0.034), ('translation', 0.034), ('assisted', 0.034), ('attend', 0.034), ('duplicates', 0.034), ('farghal', 0.034), ('futagi', 0.034), ('incidence', 0.034), ('shei', 0.034), ('watch', 0.034), ('zhong', 0.034), ('suspect', 0.033), ('phonetic', 0.033), ('correct', 0.032), ('judge', 0.032), ('paraphrasing', 0.032), ('wordnet', 0.032), ('grammatical', 0.032), ('item', 0.032), ('wu', 0.032), ('disambiguation', 0.031), ('writing', 0.031), ('british', 0.03), ('valid', 0.03), ('bannard', 0.03), ('rozovskaya', 0.029), ('sense', 0.029), ('students', 0.029), ('attributed', 0.029), ('foster', 0.029), ('chan', 0.029), ('mistakes', 0.028), ('liu', 0.027), ('semantics', 0.027), ('koehn', 0.027), ('intersection', 0.027), ('redirect', 0.027), ('categorize', 0.027), ('esl', 0.027), ('offset', 0.027), ('reciprocal', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000012 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases

Author: Daniel Dahlmeier ; Hwee Tou Ng

Abstract: We present a novel approach for automatic collocation error correction in learner English which is based on paraphrases extracted from parallel corpora. Our key assumption is that collocation errors are often caused by semantic similarity in the first language (L1language) of the writer. An analysis of a large corpus of annotated learner English confirms this assumption. We evaluate our approach on real-world learner data and show that L1-induced paraphrases outperform traditional approaches based on edit distance, homophones, and WordNet synonyms.

2 0.18269326 55 emnlp-2011-Exploiting Syntactic and Distributional Information for Spelling Correction with Web-Scale N-gram Models

Author: Wei Xu ; Joel Tetreault ; Martin Chodorow ; Ralph Grishman ; Le Zhao

Abstract: We propose a novel way of incorporating dependency parse and word co-occurrence information into a state-of-the-art web-scale ngram model for spelling correction. The syntactic and distributional information provides extra evidence in addition to that provided by a web-scale n-gram corpus and especially helps with data sparsity problems. Experimental results show that introducing syntactic features into n-gram based models significantly reduces errors by up to 12.4% over the current state-of-the-art. The word co-occurrence information shows potential but only improves overall accuracy slightly. 1

3 0.11934938 102 emnlp-2011-Parse Correction with Specialized Models for Difficult Attachment Types

Author: Enrique Henestroza Anguiano ; Marie Candito

Abstract: This paper develops a framework for syntactic dependency parse correction. Dependencies in an input parse tree are revised by selecting, for a given dependent, the best governor from within a small set of candidates. We use a discriminative linear ranking model to select the best governor from a group of candidates for a dependent, and our model includes a rich feature set that encodes syntactic structure in the input parse tree. The parse correction framework is parser-agnostic, and can correct attachments using either a generic model or specialized models tailored to difficult attachment types like coordination and pp-attachment. Our experiments show that parse correction, combining a generic model with specialized models for difficult attachment types, can successfully improve the quality of predicted parse trees output by sev- eral representative state-of-the-art dependency parsers for French.

4 0.11570223 145 emnlp-2011-Unsupervised Semantic Role Induction with Graph Partitioning

Author: Joel Lang ; Mirella Lapata

Abstract: In this paper we present a method for unsupervised semantic role induction which we formalize as a graph partitioning problem. Argument instances of a verb are represented as vertices in a graph whose edge weights quantify their role-semantic similarity. Graph partitioning is realized with an algorithm that iteratively assigns vertices to clusters based on the cluster assignments of neighboring vertices. Our method is algorithmically and conceptually simple, especially with respect to how problem-specific knowledge is incorporated into the model. Experimental results on the CoNLL 2008 benchmark dataset demonstrate that our model is competitive with other unsupervised approaches in terms of F1 whilst attaining significantly higher cluster purity.

5 0.10914947 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation

Author: Juri Ganitkevitch ; Chris Callison-Burch ; Courtney Napoles ; Benjamin Van Durme

Abstract: Previous work has shown that high quality phrasal paraphrases can be extracted from bilingual parallel corpora. However, it is not clear whether bitexts are an appropriate resource for extracting more sophisticated sentential paraphrases, which are more obviously learnable from monolingual parallel corpora. We extend bilingual paraphrase extraction to syntactic paraphrases and demonstrate its ability to learn a variety of general paraphrastic transformations, including passivization, dative shift, and topicalization. We discuss how our model can be adapted to many text generation tasks by augmenting its feature set, development data, and parameter estimation routine. We illustrate this adaptation by using our paraphrase model for the task of sentence compression and achieve results competitive with state-of-the-art compression systems.

6 0.099530034 6 emnlp-2011-A Generate and Rank Approach to Sentence Paraphrasing

7 0.091823839 3 emnlp-2011-A Correction Model for Word Alignments

8 0.08603283 54 emnlp-2011-Exploiting Parse Structures for Native Language Identification

9 0.084051646 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation

10 0.07699424 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

11 0.067562468 69 emnlp-2011-Identification of Multi-word Expressions by Combining Multiple Linguistic Information Sources

12 0.065721951 57 emnlp-2011-Extreme Extraction - Machine Reading in a Week

13 0.063274369 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection

14 0.061483774 38 emnlp-2011-Data-Driven Response Generation in Social Media

15 0.058549043 78 emnlp-2011-Large-Scale Noun Compound Interpretation Using Bootstrapping and the Web as a Corpus

16 0.056259912 7 emnlp-2011-A Joint Model for Extended Semantic Role Labeling

17 0.055345133 2 emnlp-2011-A Cascaded Classification Approach to Semantic Head Recognition

18 0.052429587 125 emnlp-2011-Statistical Machine Translation with Local Language Models

19 0.051294297 10 emnlp-2011-A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions

20 0.047496051 18 emnlp-2011-Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.191), (1, 0.02), (2, -0.018), (3, -0.075), (4, 0.012), (5, -0.076), (6, -0.083), (7, 0.168), (8, -0.036), (9, 0.031), (10, 0.083), (11, -0.126), (12, 0.127), (13, -0.008), (14, -0.056), (15, 0.13), (16, -0.037), (17, -0.145), (18, 0.164), (19, -0.189), (20, -0.021), (21, 0.189), (22, 0.105), (23, 0.155), (24, -0.068), (25, 0.178), (26, 0.179), (27, -0.082), (28, 0.223), (29, 0.104), (30, 0.006), (31, -0.071), (32, 0.103), (33, 0.096), (34, 0.023), (35, -0.005), (36, 0.024), (37, -0.008), (38, 0.042), (39, 0.046), (40, -0.12), (41, -0.145), (42, 0.008), (43, -0.036), (44, 0.006), (45, 0.081), (46, 0.088), (47, -0.017), (48, 0.049), (49, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95375091 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases

Author: Daniel Dahlmeier ; Hwee Tou Ng

Abstract: We present a novel approach for automatic collocation error correction in learner English which is based on paraphrases extracted from parallel corpora. Our key assumption is that collocation errors are often caused by semantic similarity in the first language (L1language) of the writer. An analysis of a large corpus of annotated learner English confirms this assumption. We evaluate our approach on real-world learner data and show that L1-induced paraphrases outperform traditional approaches based on edit distance, homophones, and WordNet synonyms.

2 0.82505882 55 emnlp-2011-Exploiting Syntactic and Distributional Information for Spelling Correction with Web-Scale N-gram Models

Author: Wei Xu ; Joel Tetreault ; Martin Chodorow ; Ralph Grishman ; Le Zhao

Abstract: We propose a novel way of incorporating dependency parse and word co-occurrence information into a state-of-the-art web-scale ngram model for spelling correction. The syntactic and distributional information provides extra evidence in addition to that provided by a web-scale n-gram corpus and especially helps with data sparsity problems. Experimental results show that introducing syntactic features into n-gram based models significantly reduces errors by up to 12.4% over the current state-of-the-art. The word co-occurrence information shows potential but only improves overall accuracy slightly. 1

3 0.51522541 102 emnlp-2011-Parse Correction with Specialized Models for Difficult Attachment Types

Author: Enrique Henestroza Anguiano ; Marie Candito

Abstract: This paper develops a framework for syntactic dependency parse correction. Dependencies in an input parse tree are revised by selecting, for a given dependent, the best governor from within a small set of candidates. We use a discriminative linear ranking model to select the best governor from a group of candidates for a dependent, and our model includes a rich feature set that encodes syntactic structure in the input parse tree. The parse correction framework is parser-agnostic, and can correct attachments using either a generic model or specialized models tailored to difficult attachment types like coordination and pp-attachment. Our experiments show that parse correction, combining a generic model with specialized models for difficult attachment types, can successfully improve the quality of predicted parse trees output by sev- eral representative state-of-the-art dependency parsers for French.

4 0.3603299 3 emnlp-2011-A Correction Model for Word Alignments

Author: J. Scott McCarley ; Abraham Ittycheriah ; Salim Roukos ; Bing Xiang ; Jian-ming Xu

Abstract: Models of word alignment built as sequences of links have limited expressive power, but are easy to decode. Word aligners that model the alignment matrix can express arbitrary alignments, but are difficult to decode. We propose an alignment matrix model as a correction algorithm to an underlying sequencebased aligner. Then a greedy decoding algorithm enables the full expressive power of the alignment matrix formulation. Improved alignment performance is shown for all nine language pairs tested. The improved alignments also improved translation quality from Chinese to English and English to Italian.

5 0.35347018 54 emnlp-2011-Exploiting Parse Structures for Native Language Identification

Author: Sze-Meng Jojo Wong ; Mark Dras

Abstract: Attempts to profile authors according to their characteristics extracted from textual data, including native language, have drawn attention in recent years, via various machine learning approaches utilising mostly lexical features. Drawing on the idea of contrastive analysis, which postulates that syntactic errors in a text are to some extent influenced by the native language of an author, this paper explores the usefulness of syntactic features for native language identification. We take two types of parse substructure as features— horizontal slices of trees, and the more general feature schemas from discriminative parse reranking—and show that using this kind of syntactic feature results in an accuracy score in classification of seven native languages of around 80%, an error reduction of more than 30%.

6 0.32660264 6 emnlp-2011-A Generate and Rank Approach to Sentence Paraphrasing

7 0.3080838 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation

8 0.30361444 38 emnlp-2011-Data-Driven Response Generation in Social Media

9 0.29019269 145 emnlp-2011-Unsupervised Semantic Role Induction with Graph Partitioning

10 0.2676231 2 emnlp-2011-A Cascaded Classification Approach to Semantic Head Recognition

11 0.25251764 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection

12 0.2474601 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation

13 0.24156506 7 emnlp-2011-A Joint Model for Extended Semantic Role Labeling

14 0.23985043 18 emnlp-2011-Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries

15 0.22642842 78 emnlp-2011-Large-Scale Noun Compound Interpretation Using Bootstrapping and the Web as a Corpus

16 0.22410931 100 emnlp-2011-Optimal Search for Minimum Error Rate Training

17 0.21695119 69 emnlp-2011-Identification of Multi-word Expressions by Combining Multiple Linguistic Information Sources

18 0.21132417 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

19 0.21063349 143 emnlp-2011-Unsupervised Information Extraction with Distributional Prior Knowledge

20 0.20474692 57 emnlp-2011-Extreme Extraction - Machine Reading in a Week


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.211), (18, 0.011), (23, 0.121), (36, 0.04), (37, 0.021), (45, 0.063), (53, 0.034), (54, 0.031), (56, 0.035), (57, 0.013), (62, 0.025), (64, 0.028), (66, 0.036), (69, 0.028), (79, 0.068), (80, 0.013), (82, 0.014), (85, 0.011), (87, 0.014), (94, 0.015), (96, 0.039), (98, 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77946705 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases

Author: Daniel Dahlmeier ; Hwee Tou Ng

Abstract: We present a novel approach for automatic collocation error correction in learner English which is based on paraphrases extracted from parallel corpora. Our key assumption is that collocation errors are often caused by semantic similarity in the first language (L1language) of the writer. An analysis of a large corpus of annotated learner English confirms this assumption. We evaluate our approach on real-world learner data and show that L1-induced paraphrases outperform traditional approaches based on edit distance, homophones, and WordNet synonyms.

2 0.62041348 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

Author: Kevin Gimpel ; Noah A. Smith

Abstract: We present a quasi-synchronous dependency grammar (Smith and Eisner, 2006) for machine translation in which the leaves of the tree are phrases rather than words as in previous work (Gimpel and Smith, 2009). This formulation allows us to combine structural components of phrase-based and syntax-based MT in a single model. We describe a method of extracting phrase dependencies from parallel text using a target-side dependency parser. For decoding, we describe a coarse-to-fine approach based on lattice dependency parsing of phrase lattices. We demonstrate performance improvements for Chinese-English and UrduEnglish translation over a phrase-based baseline. We also investigate the use of unsupervised dependency parsers, reporting encouraging preliminary results.

3 0.60983825 55 emnlp-2011-Exploiting Syntactic and Distributional Information for Spelling Correction with Web-Scale N-gram Models

Author: Wei Xu ; Joel Tetreault ; Martin Chodorow ; Ralph Grishman ; Le Zhao

Abstract: We propose a novel way of incorporating dependency parse and word co-occurrence information into a state-of-the-art web-scale ngram model for spelling correction. The syntactic and distributional information provides extra evidence in addition to that provided by a web-scale n-gram corpus and especially helps with data sparsity problems. Experimental results show that introducing syntactic features into n-gram based models significantly reduces errors by up to 12.4% over the current state-of-the-art. The word co-occurrence information shows potential but only improves overall accuracy slightly. 1

4 0.60969603 38 emnlp-2011-Data-Driven Response Generation in Social Media

Author: Alan Ritter ; Colin Cherry ; William B. Dolan

Abstract: Ottawa, Ontario, K1A 0R6 Co l . Cherry@ nrc-cnrc . gc . ca in Redmond, WA 98052 bi l ldol @mi cro so ft . com large corpus of status-response pairs found on Twitter to create a system that responds to Twitter status We present a data-driven approach to generating responses to Twitter status posts, based on phrase-based Statistical Machine Translation. We find that mapping conversational stimuli onto responses is more difficult than translating between languages, due to the wider range of possible responses, the larger fraction of unaligned words/phrases, and the presence of large phrase pairs whose alignment cannot be further decomposed. After addressing these challenges, we compare approaches based on SMT and Information Retrieval in a human evaluation. We show that SMT outperforms IR on this task, and its output is preferred over actual human responses in 15% of cases. As far as we are aware, this is the first work to investigate the use of phrase-based SMT to directly translate a linguistic stimulus into an appropriate response.

5 0.60967487 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation

Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng

Abstract: Many machine translation evaluation metrics have been proposed after the seminal BLEU metric, and many among them have been found to consistently outperform BLEU, demonstrated by their better correlations with human judgment. It has long been the hope that by tuning machine translation systems against these new generation metrics, advances in automatic machine translation evaluation can lead directly to advances in automatic machine translation. However, to date there has been no unambiguous report that these new metrics can improve a state-of-theart machine translation system over its BLEUtuned baseline. In this paper, we demonstrate that tuning Joshua, a hierarchical phrase-based statistical machine translation system, with the TESLA metrics results in significantly better humanjudged translation quality than the BLEUtuned baseline. TESLA-M in particular is simple and performs well in practice on large datasets. We release all our implementation under an open source license. It is our hope that this work will encourage the machine translation community to finally move away from BLEU as the unquestioned default and to consider the new generation metrics when tuning their systems.

6 0.60956919 136 emnlp-2011-Training a Parser for Machine Translation Reordering

7 0.6087634 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation

8 0.60730678 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features

9 0.60398424 132 emnlp-2011-Syntax-Based Grammaticality Improvement using CCG and Guided Search

10 0.60389751 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances

11 0.60252887 3 emnlp-2011-A Correction Model for Word Alignments

12 0.60087079 87 emnlp-2011-Lexical Generalization in CCG Grammar Induction for Semantic Parsing

13 0.60015464 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives

14 0.5992164 68 emnlp-2011-Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding

15 0.59912544 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study

16 0.59859145 65 emnlp-2011-Heuristic Search for Non-Bottom-Up Tree Structure Prediction

17 0.59856874 54 emnlp-2011-Exploiting Parse Structures for Native Language Identification

18 0.59783465 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax

19 0.59782684 66 emnlp-2011-Hierarchical Phrase-based Translation Representations

20 0.59765941 128 emnlp-2011-Structured Relation Discovery using Generative Models