emnlp emnlp2011 emnlp2011-54 knowledge-graph by maker-knowledge-mining

54 emnlp-2011-Exploiting Parse Structures for Native Language Identification


Source: pdf

Author: Sze-Meng Jojo Wong ; Mark Dras

Abstract: Attempts to profile authors according to their characteristics extracted from textual data, including native language, have drawn attention in recent years, via various machine learning approaches utilising mostly lexical features. Drawing on the idea of contrastive analysis, which postulates that syntactic errors in a text are to some extent influenced by the native language of an author, this paper explores the usefulness of syntactic features for native language identification. We take two types of parse substructure as features— horizontal slices of trees, and the more general feature schemas from discriminative parse reranking—and show that using this kind of syntactic feature results in an accuracy score in classification of seven native languages of around 80%, an error reduction of more than 30%.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Abstract Attempts to profile authors according to their characteristics extracted from textual data, including native language, have drawn attention in recent years, via various machine learning approaches utilising mostly lexical features. [sent-4, score-0.535]

2 Drawing on the idea of contrastive analysis, which postulates that syntactic errors in a text are to some extent influenced by the native language of an author, this paper explores the usefulness of syntactic features for native language identification. [sent-5, score-1.468]

3 1 Introduction Inferring characteristics of authors from their textual data, often termed authorship profiling, has seen a number of computational approaches proposed in recent years. [sent-7, score-0.104]

4 The problem is typically treated as a classification task, where an author is classified with respect to characteristics such as gender, age, native language, and so on. [sent-8, score-0.65]

5 The particular application that motivates the present study is detection of phishing (Myers, 2007), the attempt to defraud through texts that are designed to 1600 Mark Dras Centre for Language Technology Macquarie University Sydney, Australia mark dras @mq edu au . [sent-10, score-0.303]

6 One class of countermeasures to phishing consists of technical methods such as email authentication; another looks at profiling of the text’s author(s) (Fette et al. [sent-14, score-0.169]

7 In this paper we investigate classification of a text with respect to an author’s native language, where this is not the language that that text is written in (which is often the case in phishing); we refer to this as native language identification. [sent-17, score-1.132]

8 (2005) did suggest using syntactic errors in their work but did not investigate them in any detail. [sent-24, score-0.103]

9 Wong and Dras (2009) noted the relevance of the concept of contrastive analysis (Lado, 1957), which postulates that native language constructions lead to characteristic errors in a second language. [sent-25, score-0.93]

10 In their experimental work, however, they used only three manual syntactic constructions drawn from the literature; an ANOVA analysis showed a detectable effect, but they did not improve classification accuracy over purely lexical features. [sent-26, score-0.157]

11 In this paper, we investigate syntactic features for native language identification that are more general Proce dEindgisnb oufr tgh e, 2 S0c1o1tl Canodn,f eUrKen,c Jeuol yn 2 E7m–3p1ir,ic 2a0l1 M1. [sent-27, score-0.622]

12 Specifically, we look at two types of parse tree substructure to use as features: horizontal slices of the trees—that is, characterising parse trees as sets of context-free grammar production rules—and the features schemas used in discriminative parse reranking. [sent-31, score-0.605]

13 The goal of the present study is therefore to investigate the influence to which syntactic features represented by parse structures would have on the classification task of identifying an author’s native language relative to, and in combination with, lexical features. [sent-32, score-0.707]

14 In Section 2, we discuss some related work on the two key topics of this paper: primarily on comparable work in native language identification, and then on how the notion of contrastive analysis can be applicable here. [sent-34, score-0.688]

15 1 Native Language Identification The earliest work on native language identification in this classification paradigm is that of Koppel et al. [sent-38, score-0.644]

16 With five different groups of English authors (of na- tive languages Bulgarian, Czech, French, Russian, and Spanish) selected from the first version of International Corpus of Learner English (ICLE), they gained a relatively high classification accuracy of 80%. [sent-40, score-0.121]

17 (2005) also suggested that syntactic features (syntactic errors) might be useful features, but only investigated this idea at a shallow level by treating rare PoS bigrams as ungrammatical structures. [sent-42, score-0.081]

18 (2005) to investigate the hypothesis that the choice of words in second language writ1601 ing is highly influenced by the frequency of native language syllables the phonology of the native language. [sent-44, score-1.115]

19 Approximating this by character bi-grams alone, they managed to achieve a classification accuracy of 66%. [sent-45, score-0.115]

20 Native language is also amongst the characteristics investigated in the task of authorship profiling by Estival et al. [sent-46, score-0.219]

21 For — the native language identification classification task, their model yielded a reasonably high accuracy of 84%, but this was over a set of only three languages (Arabic, English and Spanish) and against a most frequent baseline of 62. [sent-49, score-0.703]

22 On the basis of frequency counts of word-based n-grams, surprisingly high classification accuracies within the range of 87-97% were achieved across six languages (English, German, French, Dutch, Spanish, and Italian). [sent-53, score-0.121]

23 This turns out, however, to be significantly influenced by the use of particular phrases used by speakers of different languages in the parliamentary context (e. [sent-54, score-0.213]

24 To our knowledge, Wong and Dras (2009) is the only work that has investigated the usefulness of syntactic features for the task of native language identification. [sent-57, score-0.616]

25 They then examined the literature on contrastive analysis (see Section 2. [sent-60, score-0.153]

26 2), from the field of second language acquisition, and selected three syntactic errors commonly observed in non-native English users—subject-verb disagreement, noun-number disagreement and misuse of determiners—that had been identified as being influenced by the native language. [sent-61, score-0.683]

27 An ANOVA analysis showed that the native language identification constructions were identifiable; however, the overall classification was not improved over the lexical features by using just the three manually detected syntactic errors. [sent-62, score-0.739]

28 As a possible approach that would improve the classification accuracy over just the three manually detected syntactic errors, Wong and Dras (2009) suggested deploying (but did not carry out) an idea put forward by Gamon (2004) (citing Baayen et al. [sent-66, score-0.102]

29 (1996)) for the related task of identifying the author of a text: to use CFG production rules to characterise syntactic structures used by authors. [sent-67, score-0.269]

30 1 We note that similar ideas have been used in the task of sentence grammaticality judgement, which utilise parser outputs (both trees and by-products) as classification features (Mutton et al. [sent-68, score-0.135]

31 We combine this idea with one we introduce in this paper, of using discriminative reranking features as a broader characterisation of the parse tree. [sent-74, score-0.182]

32 2 Contrastive analysis Contrastive analysis (Lado, 1957) was an early attempt in the field of second language acquisition to explain the kinds and source of errors that nonnative speakers make. [sent-76, score-0.215]

33 It arose out of behaviourist psychology, and saw language learning as an issue of habit formation that could be inhibited by previous habits inculcated in learning the native language. [sent-77, score-0.535]

34 The theory was also tied to structural linguistics: it compared the syntactic structures of the native and second languages to find differences that might cause learning difficulties. [sent-78, score-0.634]

35 In the context of native language identification, however, constrastive analysis postulates that this is exactly the case for the different classes. [sent-81, score-0.592]

36 1602 common across all language learners regardless of native language, which could not be explained under contrastive analysis. [sent-82, score-0.748]

37 In an overview of contrastive analysis after the emergence of error analysis, Wardhaugh (1970) noted that there were two interpretations of the CAH, termed the strong and weak forms. [sent-84, score-0.189]

38 Under the strong form, all errors were attributed to the native language, and clearly that was not tenable in light of error analysis evidence. [sent-85, score-0.598]

39 Wardhaugh noted claims at the time that the hypothesis was no longer useful in either the strong or the weak version: “Such a claim is perhaps unwarranted, but a period of quiescence is probable for CA itself”. [sent-87, score-0.072]

40 Nevertheless, smaller studies specifically of interlanguage errors have continued to be carried out, generally restricted in their scope to a specific grammatical aspect of English in which the native language of the learners might have an influence. [sent-89, score-0.771]

41 NLP techniques and a probabilistic view of native language identification now let us revisit and make use of the weak form of the CAH. [sent-96, score-0.618]

42 Interlanguage errors, as represented by differences in parse trees, may be characteristic of the native language of a learner; we can use the occurrence of these to come up with a revised likelihood of the native language. [sent-97, score-1.207]

43 (2005), that were empirically determined by Wong and Dras (2009) to be the best of three candidates; we used character bi-grams, as the best performing n-grams, although this also had been left unspecified by Koppel et al. [sent-108, score-0.091]

44 (These types of feature value are the best performing one for each lexi2As with most work in authorship profiling, only function words are used, so that the result is not tied to a particular domain, and no clues are obtained from different topics that different authors might write about. [sent-113, score-0.104]

45 (2005), as an ablative analysis showed that they contributed nothing to classification accuracy. [sent-116, score-0.062]

46 Production Rules Under this model (PRODRULE), we take as features horizontal slices of parse trees, in effect treating them as sets of CFG production rules. [sent-117, score-0.252]

47 For each language in our dataset, we identify the n rules most characteristic of the language using Information Gain (IG). [sent-120, score-0.121]

48 It is worth noting that the production rules being used here are all non-lexicalised ones, except those lexicalised with function words and punctuation, to avoid topic-related clues. [sent-123, score-0.175]

49 Reranking Features As opposed to the horizontal parse production rules, features used for discriminative reranking are cross-sections of parse trees that might capture other aspects of ungrammatical structures. [sent-124, score-0.389]

50 For these we use the 13 feature schemas de- scribed in Charniak and Johnson (2005), which were inspired by earlier work in discriminative estimation techniques, such as Johnson et al. [sent-125, score-0.172]

51 Examples of these feature schemas include tuples covering head-to-head dependencies, preterminals together with their closest maximal projection ancestors, and subtrees rooted in the least common ancestor. [sent-127, score-0.172]

52 These feature schemas are not the only possible ones—they were empirically selected for the specific purpose of augmenting the Charniak parser. [sent-128, score-0.172]

53 Johnson and Ural (2010) for the Berkeley parser (Petrov et al. [sent-131, score-0.073]

54 (2010) for the C&C; parser (Clark and Curran, 2007)). [sent-133, score-0.073]

55 We also use this standard set, specifically the set of instantiated feature schemas from the parser from Charniak and Johnson (2005) as trained on the Wall Street Journal (WSJ), which gives 1,333,837 potential features. [sent-134, score-0.245]

56 For each na— — tive language, we randomly select from amongst essays with length of500-1000 words. [sent-143, score-0.083]

57 For the purpose of the present study, we have 95 essays per native language. [sent-144, score-0.618]

58 For the same reason as highlighted by Wong and Dras (2009), we intentionally use fewer essays as compared to Koppel et al. [sent-145, score-0.083]

59 We divide these into training sets of 70 essays per lan3Koppel et al. [sent-147, score-0.083]

60 1604 guage, with a held-out test set of 25 essays per language. [sent-149, score-0.083]

61 2 Parsers We use two parsers: the Stanford parser (Klein and Manning, 2003) and the Charniak and Johnson (henceforth C&J;) parser (Charniak and Johnson, 2005). [sent-152, score-0.146]

62 Both are widely used, and produce relatively accurate parses: the Stanford parser gets a labelled f-score of 85. [sent-153, score-0.073]

63 With the Stanford parser, there are 26,284 unique parse production rules extractable from our ICLE training set of 490 texts, while the C&J; parser produces 27,705. [sent-156, score-0.281]

64 For reranking, we use only the C&J; parser—since the parser stores these features during parsing, we can use them directly as classification features. [sent-157, score-0.135]

65 The classifier is tuned to obtain an optimal classification model. [sent-162, score-0.062]

66 While testing for statistical significance of classification results is often not carried out in NLP, we do so here because the quantity of data could raise questions about the certainty of any effect. [sent-165, score-0.062]

67 The first point to note is that PROD-RULE, under both parsers, is a substantial improvement over LEXICAL when (non-lexicalised) parse rules together with rules lexicalised with function words are used (rows marked with * in Table 1), with the largest difference as much as 77. [sent-179, score-0.215]

68 There appears to be no difference according to the parser used, regardless of their differing accuracy on the WSJ. [sent-185, score-0.073]

69 Using the selection metric for PROD-RULE without rules lexicalised with function words produces results all around those for LEXICAL; using fewer reranking features is worse as the quality of RERANKING declines as feature cut-offs are raised. [sent-186, score-0.203]

70 Another, somewhat surprising point is that the RERANKING results are also generally around those of LEXICAL even though like PROD-RULE they are also using cross-sections of the parse tree. [sent-187, score-0.07]

71 The first is that the feature schemas used were originally chosen for the specific purpose of augmenting the performance of the Charniak parser; perhaps others might be more appropriate here. [sent-189, score-0.208]

72 The second is that we selected only those instantiated feature schemas that occurred in the WSJ, and then applied them to ICLE. [sent-190, score-0.172]

73 In contrast, the production rules of PROD-RULE were selected only from the ICLE training data. [sent-192, score-0.138]

74 Overall, the PROD-RULE model results in fewer misclassifications compared to the LEXICAL model; there are mostly only incremental improvements for each language, with perhaps the exception of the reduction in confusion in the Slavic languages. [sent-209, score-0.074]

75 We looked at some of the data, to see what kind of syntactic substructure is useful in classifying native language. [sent-210, score-0.616]

76 Although using feature selection with only 1000 features did not improve performance, the information gain ranking does identify particu- lar constructions as characteristic of one of the lan- guages, and so are useful for inspection. [sent-211, score-0.122]

77 A phenomenon that the literature has noted as occurring with Chinese speakers is that of the missing determiner. [sent-212, score-0.109]

78 One example is 5This does happen with native speakers of some other languages, such as Slavic ones, but not generally (from our knowledge of the literature) with native speakers of others, such as Romance ones. [sent-215, score-1.288]

79 In Figure 1we give the parse (from tNhPe S →tan fNoNrd parser) gofu rteh e1 wseent geinvece t Teh pea development of country park can directly elp to alleviate overcrowdedness and overpopulation in urban area. [sent-221, score-0.107]

80 Another production rule that occurs typically— in fact, almost exclusively—in the texts of native Chinese speakers is PP → VBG PP (by the StanfCohridn parser), awkehircsh i sa PlmPos →t always corresponds to tahnephrase according to. [sent-224, score-0.728]

81 ) ) ) Figure 3: Parse illustrating parser correction plastic waste generates toxic by-products— is an in-text citation that was removed in the preparation of ICLE) that illustrates this particular construction. [sent-227, score-0.11]

82 It appears that speakers of Chinese fre- quently use this phrase as a translation of g e¯n j` u. [sent-228, score-0.109]

83 However, bgorathm parsers appear atos V beP good aTt ‘ignoring’ errors, and producing relatively grammatical structures (albeit ones with different frequencies for different native languages). [sent-231, score-0.591]

84 Figure 3 gives the C&J; parse for Overall, cyber cafeis a good place as recreational centre with a bundle of up-to-dated information. [sent-232, score-0.221]

85 Nevertheless, the parser produces a solid grammatical tree, specifically assigning the category VBD to the compound cafeis. [sent-234, score-0.129]

86 We also present in Table 7 the top 10 rules chosen under the IG feature selection for the Stanford parser on the held-out set. [sent-236, score-0.127]

87 A number of these, and those ranked lower, are concerned with punctuation: these seem unlikely to be related to native language, but perhaps rather to how students of a particular language background are taught. [sent-237, score-0.571]

88 7 Conclusion In this paper we have shown that, using crosssections of parse trees, we can improve above an already good baseline in the task of native language identification. [sent-239, score-0.605]

89 The best features arising from the classification have been horizontal cross-sections of trees, rather than the more general discriminative parse reranking features that might have been expected to perform at least as well. [sent-241, score-0.297]

90 This relatively poorer performance by the reranking features may be due to a number of factors, all of which could be investigated in future work. [sent-242, score-0.153]

91 One is the use offeature schema instances that did not appear in the largely grammatical WSJ; another is the extension of feature schemas; and a third is the use of a parser that does not enforce linguistic constraints such as the Berkeley parser (Petrov et al. [sent-243, score-0.202]

92 Much gratitude is due to Mark Johnson for his guidance on the extraction of reranking features. [sent-248, score-0.112]

93 Subject-verb agreement errors in French and English: The role of syntactic hierarchy. [sent-297, score-0.103]

94 Linguistic correlates of style: Authorship classification with deep linguistic analysis features. [sent-301, score-0.062]

95 Connector usage in the English essay writing of native and non-native EFL speakers of English. [sent-305, score-0.644]

96 L1 transfer revisited: the L2 acquisition of telicity marking in English by Spanish and Bulgarian native speakers. [sent-378, score-0.672]

97 Using parse features for preposition selection and error detection. [sent-390, score-0.07]

98 Using classifier features for studying the effect of native language on the choice of written second language words. [sent-395, score-0.535]

99 A contrastive analysis of authorial presence in English, German, French, Russian and Bulgarian. [sent-404, score-0.153]

100 The impact of the absence of grammatical tense in L1 on the acquisition of the tense-aspect system in L2. [sent-429, score-0.099]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('native', 0.535), ('koppel', 0.253), ('dras', 0.208), ('schemas', 0.172), ('wong', 0.167), ('contrastive', 0.153), ('icle', 0.153), ('french', 0.13), ('np', 0.124), ('reranking', 0.112), ('spanish', 0.111), ('russian', 0.11), ('bulgarian', 0.11), ('speakers', 0.109), ('authorship', 0.104), ('phishing', 0.095), ('production', 0.084), ('essays', 0.083), ('granger', 0.076), ('nn', 0.076), ('pp', 0.075), ('profiling', 0.074), ('parser', 0.073), ('parse', 0.07), ('characteristic', 0.067), ('slavic', 0.066), ('jj', 0.065), ('errors', 0.063), ('classification', 0.062), ('learners', 0.06), ('tsur', 0.06), ('languages', 0.059), ('johnson', 0.058), ('ig', 0.058), ('estival', 0.057), ('interlanguage', 0.057), ('jojo', 0.057), ('lado', 0.057), ('postulates', 0.057), ('telicity', 0.057), ('vbg', 0.057), ('vigliocco', 0.057), ('japanese', 0.056), ('vp', 0.056), ('grammatical', 0.056), ('chinese', 0.056), ('constructions', 0.055), ('rules', 0.054), ('character', 0.053), ('horizontal', 0.053), ('author', 0.053), ('ove', 0.049), ('teaching', 0.049), ('halteren', 0.049), ('wagner', 0.049), ('wsj', 0.048), ('identification', 0.047), ('czech', 0.046), ('ci', 0.046), ('pr', 0.046), ('learner', 0.045), ('australia', 0.045), ('slices', 0.045), ('influenced', 0.045), ('english', 0.043), ('acquisition', 0.043), ('foster', 0.043), ('charniak', 0.042), ('australasian', 0.041), ('substructure', 0.041), ('investigated', 0.041), ('syntactic', 0.04), ('advp', 0.038), ('bundle', 0.038), ('burning', 0.038), ('cafeis', 0.038), ('characterise', 0.038), ('cyber', 0.038), ('iral', 0.038), ('josefvan', 0.038), ('logpr', 0.038), ('macwhinney', 0.038), ('misclassifications', 0.038), ('mutton', 0.038), ('ppim', 0.038), ('processability', 0.038), ('refaeilzadeh', 0.038), ('sylviane', 0.038), ('unspecified', 0.038), ('wardhaugh', 0.038), ('stanford', 0.038), ('park', 0.037), ('centre', 0.037), ('nnp', 0.037), ('marking', 0.037), ('rappoport', 0.037), ('illustrating', 0.037), ('lexicalised', 0.037), ('weak', 0.036), ('perhaps', 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 54 emnlp-2011-Exploiting Parse Structures for Native Language Identification

Author: Sze-Meng Jojo Wong ; Mark Dras

Abstract: Attempts to profile authors according to their characteristics extracted from textual data, including native language, have drawn attention in recent years, via various machine learning approaches utilising mostly lexical features. Drawing on the idea of contrastive analysis, which postulates that syntactic errors in a text are to some extent influenced by the native language of an author, this paper explores the usefulness of syntactic features for native language identification. We take two types of parse substructure as features— horizontal slices of trees, and the more general feature schemas from discriminative parse reranking—and show that using this kind of syntactic feature results in an accuracy score in classification of seven native languages of around 80%, an error reduction of more than 30%.

2 0.094507001 95 emnlp-2011-Multi-Source Transfer of Delexicalized Dependency Parsers

Author: Ryan McDonald ; Slav Petrov ; Keith Hall

Abstract: We present a simple method for transferring dependency parsers from source languages with labeled training data to target languages without labeled training data. We first demonstrate that delexicalized parsers can be directly transferred between languages, producing significantly higher accuracies than unsupervised parsers. We then use a constraint driven learning algorithm where constraints are drawn from parallel corpora to project the final parser. Unlike previous work on projecting syntactic resources, we show that simple methods for introducing multiple source lan- guages can significantly improve the overall quality of the resulting parsers. The projected parsers from our system result in state-of-theart performance when compared to previously studied unsupervised and projected parsing systems across eight different languages.

3 0.08603283 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases

Author: Daniel Dahlmeier ; Hwee Tou Ng

Abstract: We present a novel approach for automatic collocation error correction in learner English which is based on paraphrases extracted from parallel corpora. Our key assumption is that collocation errors are often caused by semantic similarity in the first language (L1language) of the writer. An analysis of a large corpus of annotated learner English confirms this assumption. We evaluate our approach on real-world learner data and show that L1-induced paraphrases outperform traditional approaches based on edit distance, homophones, and WordNet synonyms.

4 0.0859944 68 emnlp-2011-Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding

Author: Marco Dinarelli ; Sophie Rosset

Abstract: Reranking models have been successfully applied to many tasks of Natural Language Processing. However, there are two aspects of this approach that need a deeper investigation: (i) Assessment of hypotheses generated for reranking at classification phase: baseline models generate a list of hypotheses and these are used for reranking without any assessment; (ii) Detection of cases where reranking models provide a worst result: the best hypothesis provided by the reranking model is assumed to be always the best result. In some cases the reranking model provides an incorrect hypothesis while the baseline best hypothesis is correct, especially when baseline models are accurate. In this paper we propose solutions for these two aspects: (i) a semantic inconsistency metric to select possibly more correct n-best hypotheses, from a large set generated by an SLU basiline model. The selected hypotheses are reranked applying a state-of-the-art model based on Partial Tree Kernels, which encode SLU hypotheses in Support Vector Machines with complex structured features; (ii) finally, we apply a decision strategy, based on confidence values, to select the final hypothesis between the first ranked hypothesis provided by the baseline SLU model and the first ranked hypothesis provided by the re-ranker. We show the effectiveness of these solutions presenting comparative results obtained reranking hypotheses generated by a very accurate Conditional Random Field model. We evaluate our approach on the French MEDIA corpus. The results show significant improvements with respect to current state-of-the-art and previous 1104 Sophie Rosset LIMSI-CNRS B.P. 133, 91403 Orsay Cedex France ro s set @ l ims i fr . re-ranking models.

5 0.085843995 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax

Author: Jiajun Zhang ; Feifei Zhai ; Chengqing Zong

Abstract: Due to its explicit modeling of the grammaticality of the output via target-side syntax, the string-to-tree model has been shown to be one of the most successful syntax-based translation models. However, a major limitation of this model is that it does not utilize any useful syntactic information on the source side. In this paper, we analyze the difficulties of incorporating source syntax in a string-totree model. We then propose a new way to use the source syntax in a fuzzy manner, both in source syntactic annotation and in rule matching. We further explore three algorithms in rule matching: 0-1 matching, likelihood matching, and deep similarity matching. Our method not only guarantees grammatical output with an explicit target tree, but also enables the system to choose the proper translation rules via fuzzy use of the source syntax. Our extensive experiments have shown significant improvements over the state-of-the-art string-to-tree system. 1

6 0.084774606 16 emnlp-2011-Accurate Parsing with Compact Tree-Substitution Grammars: Double-DOP

7 0.081240505 60 emnlp-2011-Feature-Rich Language-Independent Syntax-Based Alignment for Statistical Machine Translation

8 0.080734402 10 emnlp-2011-A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions

9 0.077934049 103 emnlp-2011-Parser Evaluation over Local and Non-Local Deep Dependencies in a Large Corpus

10 0.074320845 102 emnlp-2011-Parse Correction with Specialized Models for Difficult Attachment Types

11 0.070255682 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation

12 0.069265209 55 emnlp-2011-Exploiting Syntactic and Distributional Information for Spelling Correction with Web-Scale N-gram Models

13 0.068106785 50 emnlp-2011-Evaluating Dependency Parsing: Robust and Heuristics-Free Cross-Annotation Evaluation

14 0.067626804 125 emnlp-2011-Statistical Machine Translation with Local Language Models

15 0.067101873 134 emnlp-2011-Third-order Variational Reranking on Packed-Shared Dependency Forests

16 0.066525094 4 emnlp-2011-A Fast, Accurate, Non-Projective, Semantically-Enriched Parser

17 0.063369803 136 emnlp-2011-Training a Parser for Machine Translation Reordering

18 0.062303677 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features

19 0.060441528 76 emnlp-2011-Language Models for Machine Translation: Original vs. Translated Texts

20 0.059691049 132 emnlp-2011-Syntax-Based Grammaticality Improvement using CCG and Guided Search


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.216), (1, 0.031), (2, -0.002), (3, 0.046), (4, -0.022), (5, 0.007), (6, -0.161), (7, 0.1), (8, 0.046), (9, -0.039), (10, -0.028), (11, -0.069), (12, -0.025), (13, 0.084), (14, -0.065), (15, -0.04), (16, -0.001), (17, -0.061), (18, -0.013), (19, -0.078), (20, -0.071), (21, 0.128), (22, 0.114), (23, -0.03), (24, 0.01), (25, 0.084), (26, 0.132), (27, 0.042), (28, 0.003), (29, 0.16), (30, 0.052), (31, 0.007), (32, -0.032), (33, -0.007), (34, -0.317), (35, -0.004), (36, 0.036), (37, -0.003), (38, -0.097), (39, -0.097), (40, -0.12), (41, 0.064), (42, 0.018), (43, 0.104), (44, -0.05), (45, -0.198), (46, -0.019), (47, -0.093), (48, -0.075), (49, 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93955576 54 emnlp-2011-Exploiting Parse Structures for Native Language Identification

Author: Sze-Meng Jojo Wong ; Mark Dras

Abstract: Attempts to profile authors according to their characteristics extracted from textual data, including native language, have drawn attention in recent years, via various machine learning approaches utilising mostly lexical features. Drawing on the idea of contrastive analysis, which postulates that syntactic errors in a text are to some extent influenced by the native language of an author, this paper explores the usefulness of syntactic features for native language identification. We take two types of parse substructure as features— horizontal slices of trees, and the more general feature schemas from discriminative parse reranking—and show that using this kind of syntactic feature results in an accuracy score in classification of seven native languages of around 80%, an error reduction of more than 30%.

2 0.50142771 16 emnlp-2011-Accurate Parsing with Compact Tree-Substitution Grammars: Double-DOP

Author: Federico Sangati ; Willem Zuidema

Abstract: We present a novel approach to Data-Oriented Parsing (DOP). Like other DOP models, our parser utilizes syntactic fragments of arbitrary size from a treebank to analyze new sentences, but, crucially, it uses only those which are encountered at least twice. This criterion allows us to work with a relatively small but representative set of fragments, which can be employed as the symbolic backbone of several probabilistic generative models. For parsing we define a transform-backtransform approach that allows us to use standard PCFG technology, making our results easily replicable. According to standard Parseval metrics, our best model is on par with many state-ofthe-art parsers, while offering some complementary benefits: a simple generative probability model, and an explicit representation of the larger units of grammar.

3 0.4936296 60 emnlp-2011-Feature-Rich Language-Independent Syntax-Based Alignment for Statistical Machine Translation

Author: Jason Riesa ; Ann Irvine ; Daniel Marcu

Abstract: unkown-abstract

4 0.47196853 68 emnlp-2011-Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding

Author: Marco Dinarelli ; Sophie Rosset

Abstract: Reranking models have been successfully applied to many tasks of Natural Language Processing. However, there are two aspects of this approach that need a deeper investigation: (i) Assessment of hypotheses generated for reranking at classification phase: baseline models generate a list of hypotheses and these are used for reranking without any assessment; (ii) Detection of cases where reranking models provide a worst result: the best hypothesis provided by the reranking model is assumed to be always the best result. In some cases the reranking model provides an incorrect hypothesis while the baseline best hypothesis is correct, especially when baseline models are accurate. In this paper we propose solutions for these two aspects: (i) a semantic inconsistency metric to select possibly more correct n-best hypotheses, from a large set generated by an SLU basiline model. The selected hypotheses are reranked applying a state-of-the-art model based on Partial Tree Kernels, which encode SLU hypotheses in Support Vector Machines with complex structured features; (ii) finally, we apply a decision strategy, based on confidence values, to select the final hypothesis between the first ranked hypothesis provided by the baseline SLU model and the first ranked hypothesis provided by the re-ranker. We show the effectiveness of these solutions presenting comparative results obtained reranking hypotheses generated by a very accurate Conditional Random Field model. We evaluate our approach on the French MEDIA corpus. The results show significant improvements with respect to current state-of-the-art and previous 1104 Sophie Rosset LIMSI-CNRS B.P. 133, 91403 Orsay Cedex France ro s set @ l ims i fr . re-ranking models.

5 0.42372656 103 emnlp-2011-Parser Evaluation over Local and Non-Local Deep Dependencies in a Large Corpus

Author: Emily M. Bender ; Dan Flickinger ; Stephan Oepen ; Yi Zhang

Abstract: In order to obtain a fine-grained evaluation of parser accuracy over naturally occurring text, we study 100 examples each of ten reasonably frequent linguistic phenomena, randomly selected from a parsed version of the English Wikipedia. We construct a corresponding set of gold-standard target dependencies for these 1000 sentences, operationalize mappings to these targets from seven state-of-theart parsers, and evaluate the parsers against this data to measure their level of success in identifying these dependencies.

6 0.41903511 102 emnlp-2011-Parse Correction with Specialized Models for Difficult Attachment Types

7 0.40415597 8 emnlp-2011-A Model of Discourse Predictions in Human Sentence Processing

8 0.39900571 55 emnlp-2011-Exploiting Syntactic and Distributional Information for Spelling Correction with Web-Scale N-gram Models

9 0.37659141 134 emnlp-2011-Third-order Variational Reranking on Packed-Shared Dependency Forests

10 0.36166644 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases

11 0.35558838 95 emnlp-2011-Multi-Source Transfer of Delexicalized Dependency Parsers

12 0.32949573 76 emnlp-2011-Language Models for Machine Translation: Original vs. Translated Texts

13 0.32160613 121 emnlp-2011-Semi-supervised CCG Lexicon Extension

14 0.31542307 85 emnlp-2011-Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming

15 0.30655271 84 emnlp-2011-Learning the Information Status of Noun Phrases in Spoken Dialogues

16 0.30416995 148 emnlp-2011-Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.

17 0.29936868 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax

18 0.29596689 111 emnlp-2011-Reducing Grounded Learning Tasks To Grammatical Inference

19 0.28800341 72 emnlp-2011-Improved Transliteration Mining Using Graph Reinforcement

20 0.27692589 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(15, 0.011), (23, 0.088), (36, 0.035), (37, 0.03), (45, 0.069), (53, 0.027), (54, 0.039), (55, 0.287), (57, 0.012), (62, 0.02), (64, 0.042), (66, 0.06), (69, 0.013), (79, 0.062), (82, 0.017), (87, 0.012), (90, 0.027), (96, 0.044), (98, 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.72370273 54 emnlp-2011-Exploiting Parse Structures for Native Language Identification

Author: Sze-Meng Jojo Wong ; Mark Dras

Abstract: Attempts to profile authors according to their characteristics extracted from textual data, including native language, have drawn attention in recent years, via various machine learning approaches utilising mostly lexical features. Drawing on the idea of contrastive analysis, which postulates that syntactic errors in a text are to some extent influenced by the native language of an author, this paper explores the usefulness of syntactic features for native language identification. We take two types of parse substructure as features— horizontal slices of trees, and the more general feature schemas from discriminative parse reranking—and show that using this kind of syntactic feature results in an accuracy score in classification of seven native languages of around 80%, an error reduction of more than 30%.

2 0.47606137 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features

Author: Christos Christodoulopoulos ; Sharon Goldwater ; Mark Steedman

Abstract: In this paper we present a fully unsupervised syntactic class induction system formulated as a Bayesian multinomial mixture model, where each word type is constrained to belong to a single class. By using a mixture model rather than a sequence model (e.g., HMM), we are able to easily add multiple kinds of features, including those at both the type level (morphology features) and token level (context and alignment features, the latter from parallel corpora). Using only context features, our system yields results comparable to state-of-the art, far better than a similar model without the one-class-per-type constraint. Using the additional features provides added benefit, and our final system outperforms the best published results on most of the 25 corpora tested.

3 0.47568214 8 emnlp-2011-A Model of Discourse Predictions in Human Sentence Processing

Author: Amit Dubey ; Frank Keller ; Patrick Sturt

Abstract: This paper introduces a psycholinguistic model of sentence processing which combines a Hidden Markov Model noun phrase chunker with a co-reference classifier. Both models are fully incremental and generative, giving probabilities of lexical elements conditional upon linguistic structure. This allows us to compute the information theoretic measure of surprisal, which is known to correlate with human processing effort. We evaluate our surprisal predictions on the Dundee corpus of eye-movement data show that our model achieve a better fit with human reading times than a syntax-only model which does not have access to co-reference information.

4 0.4690305 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

Author: Kevin Gimpel ; Noah A. Smith

Abstract: We present a quasi-synchronous dependency grammar (Smith and Eisner, 2006) for machine translation in which the leaves of the tree are phrases rather than words as in previous work (Gimpel and Smith, 2009). This formulation allows us to combine structural components of phrase-based and syntax-based MT in a single model. We describe a method of extracting phrase dependencies from parallel text using a target-side dependency parser. For decoding, we describe a coarse-to-fine approach based on lattice dependency parsing of phrase lattices. We demonstrate performance improvements for Chinese-English and UrduEnglish translation over a phrase-based baseline. We also investigate the use of unsupervised dependency parsers, reporting encouraging preliminary results.

5 0.46847224 53 emnlp-2011-Experimental Support for a Categorical Compositional Distributional Model of Meaning

Author: Edward Grefenstette ; Mehrnoosh Sadrzadeh

Abstract: Modelling compositional meaning for sentences using empirical distributional methods has been a challenge for computational linguists. We implement the abstract categorical model of Coecke et al. (2010) using data from the BNC and evaluate it. The implementation is based on unsupervised learning of matrices for relational words and applying them to the vectors of their arguments. The evaluation is based on the word disambiguation task developed by Mitchell and Lapata (2008) for intransitive sentences, and on a similar new experiment designed for transitive sentences. Our model matches the results of its competitors . in the first experiment, and betters them in the second. The general improvement in results with increase in syntactic complexity showcases the compositional power of our model.

6 0.45988199 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation

7 0.45971602 136 emnlp-2011-Training a Parser for Machine Translation Reordering

8 0.45894486 97 emnlp-2011-Multiword Expression Identification with Tree Substitution Grammars: A Parsing tour de force with French

9 0.4588412 85 emnlp-2011-Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming

10 0.4580667 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases

11 0.45768979 107 emnlp-2011-Probabilistic models of similarity in syntactic context

12 0.45762244 140 emnlp-2011-Universal Morphological Analysis using Structured Nearest Neighbor Prediction

13 0.45752963 87 emnlp-2011-Lexical Generalization in CCG Grammar Induction for Semantic Parsing

14 0.455722 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax

15 0.45561969 111 emnlp-2011-Reducing Grounded Learning Tasks To Grammatical Inference

16 0.45452315 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation

17 0.45281452 39 emnlp-2011-Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model

18 0.45242202 132 emnlp-2011-Syntax-Based Grammaticality Improvement using CCG and Guided Search

19 0.45130104 56 emnlp-2011-Exploring Supervised LDA Models for Assigning Attributes to Adjective-Noun Phrases

20 0.45101547 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation