acl acl2012 acl2012-83 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Claire Gardent ; Shashi Narayan
Abstract: In recent years, error mining approaches were developed to help identify the most likely sources of parsing failures in parsing systems using handcrafted grammars and lexicons. However the techniques they use to enumerate and count n-grams builds on the sequential nature of a text corpus and do not easily extend to structured data. In this paper, we propose an algorithm for mining trees and apply it to detect the most likely sources of generation failure. We show that this tree mining algorithm permits identifying not only errors in the generation system (grammar, lexicon) but also mismatches between the structures contained in the input and the input structures expected by our generator as well as a few idiosyncrasies/error in the input data.
Reference: text
sentIndex sentText sentNum sentScore
1 fr Abstract In recent years, error mining approaches were developed to help identify the most likely sources of parsing failures in parsing systems using handcrafted grammars and lexicons. [sent-3, score-0.408]
2 In this paper, we propose an algorithm for mining trees and apply it to detect the most likely sources of generation failure. [sent-5, score-0.646]
3 We show that this tree mining algorithm permits identifying not only errors in the generation system (grammar, lexicon) but also mismatches between the structures contained in the input and the input structures expected by our generator as well as a few idiosyncrasies/error in the input data. [sent-6, score-1.134]
4 1 Introduction In recent years, error mining techniques have been developed to help identify the most likely sources of parsing failure (van Noord, 2004; Sagot and de la Clergerie, 2006; de Kok et al. [sent-7, score-0.557]
5 For each n-gram of words (and/or part of speech tag) occurring in the corpus to be parsed, a suspicion rate is then computed which, in essence, captures the likelihood that this n-gram causes parsing to fail. [sent-10, score-0.675]
6 These error mining techniques have been applied with good results on parsing output and shown to help improve the large scale symbolic grammars and 592 Villers-l e`s-Nancy, F-54600, France shashi . [sent-11, score-0.41]
7 There are some NLP applications though where the processed data is structured data such as trees or graphs and which would benefit from error mining. [sent-17, score-0.313]
8 , 2011)), it would be useful to be able to apply error mining on the input trees to find the most likely causes of generation failure. [sent-19, score-0.751]
9 In this paper, we address this issue and propose an approach that supports error mining on trees. [sent-20, score-0.318]
10 We adapt an existing algorithm for tree mining which we then use to mine the Generation Challenge dependency trees and identify the most likely causes of generation failure. [sent-21, score-0.87]
11 , the fact that the input provided by the SR task fails to match the input expected by the symbolic generation systems (Belz et al. [sent-25, score-0.402]
12 , 2004) for discovering frequently occurring subtrees in a database of labelled unordered trees. [sent-32, score-0.293]
13 Section 3 shows how to adapt this algorithm to mine the SR dependency trees for subtrees with high suspicion rate. [sent-33, score-1.062]
14 Section 4 presents an experiment we made using the resulting tree mining algorithm on SR dependency trees and summarises the results. [sent-34, score-0.675]
15 , 2004) provides a complete and computationally efficient method for discovering frequently occurring subtrees in a database of labelled unordered trees and counting them. [sent-39, score-0.496]
16 Given a set of trees T, the HybridTreeMiner algorithm proceeds in two steps. [sent-42, score-0.25]
17 First, the unordered labelled trees contained in T are converted to a canonical form called BFCF (Breadth-First Canonical Form). [sent-43, score-0.411]
18 In that way, distinct instantiations of the same unordered trees have a unique representation. [sent-44, score-0.29]
19 Second, the subtrees of the BFCF trees are enumerated in increasing size order using two tree operations called join and extension and their support (the number of trees in the database that contains each subtree) is recorded. [sent-45, score-0.835]
20 In effect, the algorithm builds an enumeration tree whose nodes are the possible subtrees of T and such that, at depth d of this enumeration tree, all possible frequent subtrees consisting of d nodes are listed. [sent-46, score-0.663]
21 593 The BFCF canonical form of an unordered tree is an ordered tree t such that t has the smallest breath-first canonical string (BFCS) encoding according to lexicographic order. [sent-49, score-0.479]
22 The join and extension operations used to iteratively enumerate subtrees are depicted in Figure 2 and can be defined as follows. [sent-53, score-0.321]
23 • Extension: Given a tree t of height ht and a nEoxdteen n, extending at twreiteh n yields a tree t0 (a child of t in the enumeration tree) with height ht0 such that n is a child of one of t’s legs and ht0 is ht + 1. [sent-55, score-0.44]
24 • Join: Given two trees t1 and t2 of same height Jho differing only i tnr ethesei tr rightmost leg and such that t1 sorts lower than t2, joining t1 and t2 yields a tree t0 (a child of t1in the enumeration tree) of same height h by adding the rightmost leg of t2 to t1 at level h − 1. [sent-56, score-0.747]
25 of all trees in which this subtree occurs and of its position in the tree (represented by the list of tree nodes mapped onto by the subtree). [sent-58, score-0.583]
26 For thejoin operation, the subtrees being combined must occur in the same tree at the same position (the intersection of their occurrence lists must be non empty and the tree nodes must match except the last node). [sent-61, score-0.505]
27 For the extension operation, the extension of a tree t is licensed for any given occurrence in the occurrence list only if the planned extension maps onto the tree identified by the occurrence. [sent-62, score-0.541]
28 3 Mining Dependency Trees We develop an algorithm (called ErrorTreeMiner, ETM) which adapts the HybridTreeMiner algorithm to mine sources of generation errors in the Generation Challenge SR shallow input data. [sent-63, score-0.444]
29 The main modification is that instead of simply counting trees, we want to compute their suspicion rate. [sent-64, score-0.554]
30 Since we work with subtrees of arbitrary length, we also need to check whether constructing a longer subtree is useful that is, whether its suspicion rate is equal or higher than the suspicion rate of any of the subtrees it contains. [sent-67, score-1.688]
31 , 2009), this also permits bypassing suspicion sharing that is the fact that, if n2 is the cause of a generation failure, and if n2 is contained in larger trees n3 and n4, then all three trees will have high suspicion rate making it difficult to identify the actual source of failure namely n2. [sent-70, score-2.003]
32 Because we use a milder condition however (we accept bigger trees whose suspicion rate is equal to the suspicion rate of any of their subtrees), some amount of 594 Algorithm 1 ErrorTreeMiner(D,minsup) Note: D consists of Dfail and Dpass F1 ← {Frequent 1-trees} F2 ∅{ for ←i ← 1, . [sent-71, score-1.553]
33 First, dependency trees are converted to BreadthFirst Canonical Form whereby lexicographic order can apply to the word forms labelling tree nodes, to their part of speech, to their dependency relation or to any combination thereof3. [sent-93, score-0.626]
34 It is then continued by extending the trees using the join and extension operations. [sent-98, score-0.351]
35 As explained in Section 2 above, join and extension only apply provided the resulting trees occur in the data (this is checked by looking up occurrence lists). [sent-99, score-0.432]
36 3For convenience, the dependency relation labelling the edges of dependency trees is brought down to the daughter node of the edge. [sent-100, score-0.413]
37 595 Each time an n-node tree tn, is built, it is checked that (i) its support is above the set threshold and (ii) its suspicion rate is higher than or equal to the suspicion rate of all (n − 1)-node subtrees of tn. [sent-101, score-1.631]
38 First, while HTM explores the enumeration tree depth-first, ETM proceeds breadth-first to ensure that the suspicion rate of (n-1)-node trees is always available when checking whether an n-node tree should be introduced. [sent-103, score-1.224]
39 As a result, while ETM looses the space advantage of HTM by a small margin4, it benefits from a much stronger pruning of the search space than HTM through suspicion rate checking. [sent-105, score-0.675]
40 4 Experiment and Results Using the input data provided by the Generation Challenge SR Task, we applied the error mining algorithm described in the preceding Section to debug and extend a symbolic surface realiser developed for this task. [sent-109, score-0.823]
41 It consists of a set of unordered labelled syntactic dependency trees whose nodes are labelled with word forms, part of speech categories, partial morphosyntactic information such as tense and number and, in some cases, a sense tag identifier. [sent-112, score-0.607]
42 tence are represented by a node in the tree and the alignment between nodes and word forms was provided by the organisers. [sent-115, score-0.25]
43 The surface realiser used is a system based on a Feature-Based Lexicalised Tree Adjoining Grammar (FB-LTAG) for English extended with a unification based compositional semantics. [sent-116, score-0.286]
44 The surface realisation algorithm extends the algorithm proposed in (Gardent and Perez-Beltrachini, 2010) and adapts it to work on the SR dependency input rather than on flat semantic representations. [sent-119, score-0.53]
45 2 Experimental Setup To facilitate interpretation, we first chunked the input data in NPs, PPs and Clauses and performed error mining on the resulting sets of data. [sent-121, score-0.397]
46 The chunking was performed by retrieving from the Penn Treebank (PTB), for each phrase type, the yields of the constituents of that type and by using the alignment between words and dependency tree nodes provided by the organisers of the SR Task. [sent-122, score-0.288]
47 Using this chunked data, we then ran the generator on the corresponding SR Task dependency trees and stored separately, the input dependency trees for which generation succeeded and the input dependency trees for which generation failed. [sent-124, score-1.374]
48 Using infor- mation provided by the generator, we then removed from the failed data, those cases where generation failed either because a word was missing in the lexicon or because a TAG tree/family was missing in the grammar but required by the lexicon and the input data. [sent-125, score-0.627]
49 These cases can easily be detected using the generation system and thus do not need to be handled by error mining. [sent-126, score-0.261]
50 3 Results One feature of our approach is that it permits mining the data for tree patterns of arbitrary size using different types of labelling information (POS tags, dependencies, word forms and any combination thereof). [sent-129, score-0.619]
51 1 Mining on single labels (word form, POS tag or dependency) Mining on a single label permits (i) assessing the relative impact of each category in a given label category and (ii) identifying different sources of errors depending on the type of label considered (POS tag, dependency or word form). [sent-133, score-0.353]
52 Mining on POS tags Table 1illustrates how mining on a single label (in this case, POS tags) gives a good overview of how the different categories in that label type impact generation: two POS tags (POS and CC) have a suspicion rate of 0. [sent-134, score-0.883]
53 Other POS tag with much lower suspicion rate indicate that there are unresolved issues with, in decreasing order of suspicion rate, cardinal numbers (CD), proper names (NNP), nouns (NN), prepositions (IN) and determiners (DT). [sent-136, score-1.402]
54 Hence whenever a possessive appears in the input data, generation fails. [sent-141, score-0.28]
55 1 and displaying only trees of size 1 sorted by decreasing suspicion rate (Sus) The second highest ranked category is CC for coordinations. [sent-160, score-0.912]
56 In this case, error mining unveils a bug in the grammar trees associated with conjunction which made all sentences containing a conjunction fail. [sent-161, score-0.579]
57 Because the grammar is compiled out of a strongly factorised description, errors in this description can propagate to a large number of trees in the grammar. [sent-162, score-0.3]
58 It turned out that an error occurred in a class inherited by all conjunction trees thereby blocking the generation of any sentence requiring the use of a conjunction. [sent-163, score-0.464]
59 Next but with a much lower suspicion rate come cardinal numbers (CD), proper names (NNP), nouns (NN), prepositions (IN) and determiners (DT). [sent-164, score-0.776]
60 We will see below how the richer information provided by mining for larger tree patterns with mixed labelling information permits identifying the contexts in which these POS tags lead to generation failure. [sent-165, score-0.768]
61 In this way, we found for instance, that cardinal numbers induced many generation failures whenever they were categorised as determiners but not as nouns in our lexicon. [sent-168, score-0.305]
62 One interesting case stood out which pointed to idiosyncrasies in the input data: The word form $ 597 (Sus=1) was assigned the POS tag $ in the input data, a POS tag which is unknown to our system and not documented in the SR Task guidelines. [sent-170, score-0.302]
63 Mining on Dependencies When mining on de- pendencies, suspects can point to syntactic constructions (rather than words or word categories) that are not easily spotted when mining on words or parts of speech. [sent-174, score-0.416]
64 Thus, while problems with coordination could easily be spotted through a high suspicion rate for the CC POS tag, some constructions are linked neither to a specific POS tag nor to a specific word. [sent-175, score-0.747]
65 This is the case, for instance, for apposition which a suspicion rate of 0. [sent-176, score-0.675]
66 54, 183F/155P) on the TMP dependency indicates that temporal modifiers are not correctly handled either because of missing or erroneous information in the grammar or because of a mismatch between the input data and the fomat expected by the surface realiser. [sent-179, score-0.435]
67 Interestingly, the underspecified dependency relation DEP which is typically used in cases for which no obvious syntactic dependency comes to mind shows a suspicion rate of 0. [sent-180, score-0.825]
68 2 Mining on trees of arbitrary size and complex labelling patterns While error mining with tree patterns of size one permits ranking and qualifying the various sources of errors, larger patterns often provide more detailed contextual information about these errors. [sent-184, score-1.016]
69 For instance, Table 1 shows that the CD POS tag has a suspicion rate of 0. [sent-185, score-0.747]
70 The larger tree patterns identified below permits a more specific characterization of the context in which this POS tag co-occurs with generation failure: TP1 CD(IN,RBR) more than 10 TP2 IN(CD) of 1991 TP3 NNP(CD) November 1 TP4 CD(NNP(CD)) Nov. [sent-187, score-0.537]
71 TP7 points to a mismatch between the SR data and the format expected by the generator: while the latter expects the structure IN(RB), the input format provided by the SR Task is RB(IN). [sent-199, score-0.308]
72 4 Improving Generation Using the Results of Error Mining Table 2 shows how implementing some of the corrections suggested by error mining impacts the number of NP chunks (size 4) that can be generated. [sent-201, score-0.353]
73 In this experiment, the total number of input (NP) dependency trees is 24995. [sent-202, score-0.357]
74 Before error mining, generation failed on 33% of these input. [sent-203, score-0.305]
75 Converting the input data to the correct input format to resolve the mismatch induced by possessive ’s (cf. [sent-207, score-0.3]
76 1) reduce gener- ation failure to 21%6 and combining both corrections results in a failure rate of 13%. [sent-210, score-0.295]
77 In other words, error mining permits quickly identifying two issues which, once corrected, reduces generation failure by 20 points. [sent-211, score-0.686]
78 When mining on clause size chunks, other mismatches were identified such as in particular, mismatches introduced by subjects and auxiliaries: 6For NP of size 4, 3264 rewritten. [sent-212, score-0.408]
79 The table compares the number of failures on NP chunks of size 4 before (first row) and after (second row) rewriting the SR data to the format expected by our generator and before (second column) and after (third column) correcting the grammar and lexicon errors discussed in Section 4. [sent-214, score-0.377]
80 5 Related Work We now relate our proposal (i) to previous proposals on error mining and (ii) to the use of error mining in natural language generation. [sent-217, score-0.636]
81 (van Noord, 2004) initiated error mining on parsing results with a very simple approach computing the parsability rate of each n-gram in a very large corpus. [sent-219, score-0.473]
82 , 2009) combined the iterative error mining proposed by (Sagot and de la Clergerie, 2006) with expansion of forms to n-grams of words and POS tags of arbitrary length. [sent-246, score-0.431]
83 Typically, the input to surface realisation is a structured representation (i. [sent-251, score-0.361]
84 Mining these structured representations thus permits identifying causes ofundergeneration in surface realisation systems. [sent-254, score-0.412]
85 Error Mining for Generation Not much work has been done on mining the results of surface realisers. [sent-255, score-0.343]
86 In contrast, our approach works on the input to surface realisation, automatically separates correct from incorrect items using surface realisation and targets the most likely sources of errors rather than the absolute ones. [sent-257, score-0.572]
87 More generally, our approach is the first to our knowledge, which mines a surface realiser for undergeneration. [sent-258, score-0.286]
88 Indeed, apart from (Gardent and Kow, 2007), most previous work on surface realisation evaluation has focused on evaluating the performance and the coverage of surface realisers. [sent-259, score-0.417]
89 In both cases however, because it is produced using the grammar exploited by the surface realiser, the input produced can only be used to test for overgeneration (and performance) . [sent-263, score-0.339]
90 The error mining approach we pro599 pose helps identifying such mismatches automatically. [sent-266, score-0.418]
91 6 Conclusion Previous work on error mining has focused on applications (parsing) where the input data is sequential working mainly on words and part of speech tags. [sent-267, score-0.397]
92 In this paper, we proposed a novel approach to error mining which permits mining trees. [sent-268, score-0.656]
93 And we showed that this supports the identification of gaps and errors in the grammar and in the lexicon; and of mismatches between the input data format and the format expected by our realiser. [sent-270, score-0.38]
94 We applied our error mining approach to the input of a surface realiser to identify the most likely sources of undergeneration. [sent-271, score-0.72]
95 We plan to also explore how it can be used to detect the most likely sources of overgeneration based on the output of this surface realiser on the SR Task data. [sent-272, score-0.39]
96 Similarly, since the surface realiser is non deterministic, the number of output trees to be kept will need to be experimented with. [sent-277, score-0.489]
97 The first surface realisation shared task: Overview and evaluation results. [sent-282, score-0.282]
98 Hybridtreeminer: An efficient algorithm for mining frequent rooted trees and free trees using canonical form. [sent-296, score-0.715]
99 A generalized method for iterative error mining in parsing results. [sent-301, score-0.318]
100 The osu system for surface realization at generation challenges 2011. [sent-322, score-0.286]
wordName wordTfidf (topN-words)
[('suspicion', 0.554), ('mining', 0.208), ('trees', 0.203), ('generation', 0.151), ('realiser', 0.151), ('sr', 0.149), ('realisation', 0.147), ('tree', 0.142), ('subtrees', 0.139), ('surface', 0.135), ('permits', 0.13), ('rate', 0.121), ('gardent', 0.12), ('error', 0.11), ('cd', 0.102), ('lqueue', 0.101), ('mismatches', 0.1), ('join', 0.093), ('sus', 0.088), ('htm', 0.088), ('nnp', 0.087), ('unordered', 0.087), ('failure', 0.087), ('etm', 0.084), ('hybridtreeminer', 0.084), ('kok', 0.08), ('generator', 0.08), ('input', 0.079), ('dependency', 0.075), ('pos', 0.073), ('leg', 0.073), ('minsup', 0.073), ('tag', 0.072), ('fail', 0.07), ('labelled', 0.067), ('bfcf', 0.067), ('cardinal', 0.067), ('overgeneration', 0.067), ('sup', 0.067), ('enumeration', 0.062), ('lexicon', 0.06), ('tn', 0.06), ('labelling', 0.06), ('subtree', 0.06), ('sagot', 0.059), ('clergerie', 0.059), ('loria', 0.059), ('grammar', 0.058), ('symbolic', 0.058), ('extension', 0.055), ('canonical', 0.054), ('failures', 0.053), ('format', 0.052), ('bfcs', 0.05), ('expects', 0.05), ('fpu', 0.05), ('kow', 0.05), ('possessive', 0.05), ('father', 0.05), ('rightmost', 0.05), ('missing', 0.048), ('bb', 0.047), ('height', 0.047), ('algorithm', 0.047), ('occurrence', 0.046), ('failed', 0.044), ('mine', 0.044), ('belz', 0.044), ('penn', 0.043), ('patterns', 0.042), ('np', 0.041), ('mismatch', 0.04), ('chi', 0.04), ('errors', 0.039), ('de', 0.039), ('challenge', 0.039), ('enlg', 0.037), ('treebank', 0.037), ('la', 0.037), ('sources', 0.037), ('forms', 0.037), ('nodes', 0.036), ('cc', 0.036), ('chunks', 0.035), ('claire', 0.035), ('provided', 0.035), ('determiners', 0.034), ('sorted', 0.034), ('enumerate', 0.034), ('nps', 0.034), ('dominic', 0.034), ('errortreeminer', 0.034), ('espinosa', 0.034), ('iqf', 0.034), ('noord', 0.034), ('parsability', 0.034), ('rajkumar', 0.034), ('shashi', 0.034), ('whereby', 0.034), ('sort', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 83 acl-2012-Error Mining on Dependency Trees
Author: Claire Gardent ; Shashi Narayan
Abstract: In recent years, error mining approaches were developed to help identify the most likely sources of parsing failures in parsing systems using handcrafted grammars and lexicons. However the techniques they use to enumerate and count n-grams builds on the sequential nature of a text corpus and do not easily extend to structured data. In this paper, we propose an algorithm for mining trees and apply it to detect the most likely sources of generation failure. We show that this tree mining algorithm permits identifying not only errors in the generation system (grammar, lexicon) but also mismatches between the structures contained in the input and the input structures expected by our generator as well as a few idiosyncrasies/error in the input data.
2 0.090553619 4 acl-2012-A Comparative Study of Target Dependency Structures for Statistical Machine Translation
Author: Xianchao Wu ; Katsuhito Sudoh ; Kevin Duh ; Hajime Tsukada ; Masaaki Nagata
Abstract: This paper presents a comparative study of target dependency structures yielded by several state-of-the-art linguistic parsers. Our approach is to measure the impact of these nonisomorphic dependency structures to be used for string-to-dependency translation. Besides using traditional dependency parsers, we also use the dependency structures transformed from PCFG trees and predicate-argument structures (PASs) which are generated by an HPSG parser and a CCG parser. The experiments on Chinese-to-English translation show that the HPSG parser’s PASs achieved the best dependency and translation accuracies. 1
3 0.088966608 9 acl-2012-A Cost Sensitive Part-of-Speech Tagging: Differentiating Serious Errors from Minor Errors
Author: Hyun-Je Song ; Jeong-Woo Son ; Tae-Gil Noh ; Seong-Bae Park ; Sang-Jo Lee
Abstract: All types of part-of-speech (POS) tagging errors have been equally treated by existing taggers. However, the errors are not equally important, since some errors affect the performance of subsequent natural language processing (NLP) tasks seriously while others do not. This paper aims to minimize these serious errors while retaining the overall performance of POS tagging. Two gradient loss functions are proposed to reflect the different types of errors. They are designed to assign a larger cost to serious errors and a smaller one to minor errors. Through a set of POS tagging experiments, it is shown that the classifier trained with the proposed loss functions reduces serious errors compared to state-of-the-art POS taggers. In addition, the experimental result on text chunking shows that fewer serious errors help to improve the performance of sub- sequent NLP tasks.
4 0.078093775 103 acl-2012-Grammar Error Correction Using Pseudo-Error Sentences and Domain Adaptation
Author: Kenji Imamura ; Kuniko Saito ; Kugatsu Sadamitsu ; Hitoshi Nishikawa
Abstract: This paper presents grammar error correction for Japanese particles that uses discriminative sequence conversion, which corrects erroneous particles by substitution, insertion, and deletion. The error correction task is hindered by the difficulty of collecting large error corpora. We tackle this problem by using pseudoerror sentences generated automatically. Furthermore, we apply domain adaptation, the pseudo-error sentences are from the source domain, and the real-error sentences are from the target domain. Experiments show that stable improvement is achieved by using domain adaptation.
5 0.07730227 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures
Author: Danilo Croce ; Alessandro Moschitti ; Roberto Basili ; Martha Palmer
Abstract: In this paper, we propose innovative representations for automatic classification of verbs according to mainstream linguistic theories, namely VerbNet and FrameNet. First, syntactic and semantic structures capturing essential lexical and syntactic properties of verbs are defined. Then, we design advanced similarity functions between such structures, i.e., semantic tree kernel functions, for exploiting distributional and grammatical information in Support Vector Machines. The extensive empirical analysis on VerbNet class and frame detection shows that our models capture mean- ingful syntactic/semantic structures, which allows for improving the state-of-the-art.
6 0.075752169 19 acl-2012-A Ranking-based Approach to Word Reordering for Statistical Machine Translation
7 0.075742535 175 acl-2012-Semi-supervised Dependency Parsing using Lexical Affinities
8 0.075511158 57 acl-2012-Concept-to-text Generation via Discriminative Reranking
9 0.073490232 90 acl-2012-Extracting Narrative Timelines as Temporal Dependency Structures
10 0.073271036 172 acl-2012-Selective Sharing for Multilingual Dependency Parsing
11 0.07315892 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers
12 0.071663827 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets
13 0.070955351 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?
14 0.070775948 106 acl-2012-Head-driven Transition-based Parsing with Top-down Prediction
15 0.070751555 122 acl-2012-Joint Evaluation of Morphological Segmentation and Syntactic Parsing
16 0.067993492 170 acl-2012-Robust Conversion of CCG Derivations to Phrase Structure Trees
17 0.065898448 154 acl-2012-Native Language Detection with Tree Substitution Grammars
18 0.064658001 213 acl-2012-Utilizing Dependency Language Models for Graph-based Dependency Parsing Models
19 0.06369552 115 acl-2012-Identifying High-Impact Sub-Structures for Convolution Kernels in Document-level Sentiment Classification
20 0.063404717 41 acl-2012-Bootstrapping a Unified Model of Lexical and Phonetic Acquisition
topicId topicWeight
[(0, -0.193), (1, 0.01), (2, -0.126), (3, -0.094), (4, -0.033), (5, 0.007), (6, 0.008), (7, 0.026), (8, 0.017), (9, -0.004), (10, -0.016), (11, -0.021), (12, -0.018), (13, 0.062), (14, 0.016), (15, -0.018), (16, 0.021), (17, -0.033), (18, 0.014), (19, -0.044), (20, -0.011), (21, -0.03), (22, 0.014), (23, 0.002), (24, -0.022), (25, 0.014), (26, 0.028), (27, -0.027), (28, -0.038), (29, 0.047), (30, 0.016), (31, 0.077), (32, -0.094), (33, -0.071), (34, 0.037), (35, -0.046), (36, 0.03), (37, -0.026), (38, -0.084), (39, -0.064), (40, 0.04), (41, -0.036), (42, 0.213), (43, -0.047), (44, -0.03), (45, -0.023), (46, -0.02), (47, -0.06), (48, -0.052), (49, -0.067)]
simIndex simValue paperId paperTitle
same-paper 1 0.95054013 83 acl-2012-Error Mining on Dependency Trees
Author: Claire Gardent ; Shashi Narayan
Abstract: In recent years, error mining approaches were developed to help identify the most likely sources of parsing failures in parsing systems using handcrafted grammars and lexicons. However the techniques they use to enumerate and count n-grams builds on the sequential nature of a text corpus and do not easily extend to structured data. In this paper, we propose an algorithm for mining trees and apply it to detect the most likely sources of generation failure. We show that this tree mining algorithm permits identifying not only errors in the generation system (grammar, lexicon) but also mismatches between the structures contained in the input and the input structures expected by our generator as well as a few idiosyncrasies/error in the input data.
2 0.58226001 175 acl-2012-Semi-supervised Dependency Parsing using Lexical Affinities
Author: Seyed Abolghasem Mirroshandel ; Alexis Nasr ; Joseph Le Roux
Abstract: Treebanks are not large enough to reliably model precise lexical phenomena. This deficiency provokes attachment errors in the parsers trained on such data. We propose in this paper to compute lexical affinities, on large corpora, for specific lexico-syntactic configurations that are hard to disambiguate and introduce the new information in a parser. Experiments on the French Treebank showed a relative decrease ofthe error rate of 7. 1% Labeled Accuracy Score yielding the best parsing results on this treebank.
3 0.55968839 139 acl-2012-MIX Is Not a Tree-Adjoining Language
Author: Makoto Kanazawa ; Sylvain Salvati
Abstract: The language MIX consists of all strings over the three-letter alphabet {a, b, c} that contain an equal n-luemttebrer a olpfh occurrences }o tfh heaatch c olentttaeinr. We prove Joshi’s (1985) conjecture that MIX is not a tree-adjoining language.
4 0.54316825 185 acl-2012-Strong Lexicalization of Tree Adjoining Grammars
Author: Andreas Maletti ; Joost Engelfriet
Abstract: Recently, it was shown (KUHLMANN, SATTA: Tree-adjoining grammars are not closed under strong lexicalization. Comput. Linguist., 2012) that finitely ambiguous tree adjoining grammars cannot be transformed into a normal form (preserving the generated tree language), in which each production contains a lexical symbol. A more powerful model, the simple context-free tree grammar, admits such a normal form. It can be effectively constructed and the maximal rank of the nonterminals only increases by 1. Thus, simple context-free tree grammars strongly lexicalize tree adjoining grammars and themselves.
5 0.53109723 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers
Author: Bevan Jones ; Mark Johnson ; Sharon Goldwater
Abstract: Many semantic parsing models use tree transformations to map between natural language and meaning representation. However, while tree transformations are central to several state-of-the-art approaches, little use has been made of the rich literature on tree automata. This paper makes the connection concrete with a tree transducer based semantic parsing model and suggests that other models can be interpreted in a similar framework, increasing the generality of their contributions. In particular, this paper further introduces a variational Bayesian inference algorithm that is applicable to a wide class of tree transducers, producing state-of-the-art semantic parsing results while remaining applicable to any domain employing probabilistic tree transducers.
6 0.51893967 9 acl-2012-A Cost Sensitive Part-of-Speech Tagging: Differentiating Serious Errors from Minor Errors
7 0.50683415 181 acl-2012-Spectral Learning of Latent-Variable PCFGs
8 0.50639886 34 acl-2012-Automatically Learning Measures of Child Language Development
10 0.48130164 71 acl-2012-Dependency Hashing for n-best CCG Parsing
11 0.46134752 106 acl-2012-Head-driven Transition-based Parsing with Top-down Prediction
12 0.45975101 30 acl-2012-Attacking Parsing Bottlenecks with Unlabeled Data and Relevant Factorizations
13 0.44995335 42 acl-2012-Bootstrapping via Graph Propagation
14 0.44764641 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets
15 0.44572687 103 acl-2012-Grammar Error Correction Using Pseudo-Error Sentences and Domain Adaptation
16 0.43667614 137 acl-2012-Lemmatisation as a Tagging Task
17 0.4347254 122 acl-2012-Joint Evaluation of Morphological Segmentation and Syntactic Parsing
18 0.43334764 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures
19 0.43018776 109 acl-2012-Higher-order Constituent Parsing and Parser Combination
20 0.42865083 4 acl-2012-A Comparative Study of Target Dependency Structures for Statistical Machine Translation
topicId topicWeight
[(26, 0.059), (28, 0.02), (30, 0.102), (37, 0.037), (39, 0.052), (57, 0.246), (59, 0.015), (74, 0.038), (82, 0.019), (84, 0.022), (85, 0.022), (90, 0.095), (92, 0.046), (94, 0.037), (98, 0.011), (99, 0.074)]
simIndex simValue paperId paperTitle
1 0.82030028 27 acl-2012-Arabic Retrieval Revisited: Morphological Hole Filling
Author: Kareem Darwish ; Ahmed Ali
Abstract: Due to Arabic’s morphological complexity, Arabic retrieval benefits greatly from morphological analysis – particularly stemming. However, the best known stemming does not handle linguistic phenomena such as broken plurals and malformed stems. In this paper we propose a model of character-level morphological transformation that is trained using Wikipedia hypertext to page title links. The use of our model yields statistically significant improvements in Arabic retrieval over the use of the best statistical stemming technique. The technique can potentially be applied to other languages.
same-paper 2 0.80042976 83 acl-2012-Error Mining on Dependency Trees
Author: Claire Gardent ; Shashi Narayan
Abstract: In recent years, error mining approaches were developed to help identify the most likely sources of parsing failures in parsing systems using handcrafted grammars and lexicons. However the techniques they use to enumerate and count n-grams builds on the sequential nature of a text corpus and do not easily extend to structured data. In this paper, we propose an algorithm for mining trees and apply it to detect the most likely sources of generation failure. We show that this tree mining algorithm permits identifying not only errors in the generation system (grammar, lexicon) but also mismatches between the structures contained in the input and the input structures expected by our generator as well as a few idiosyncrasies/error in the input data.
3 0.75814551 110 acl-2012-Historical Analysis of Legal Opinions with a Sparse Mixed-Effects Latent Variable Model
Author: William Yang Wang ; Elijah Mayfield ; Suresh Naidu ; Jeremiah Dittmar
Abstract: We propose a latent variable model to enhance historical analysis of large corpora. This work extends prior work in topic modelling by incorporating metadata, and the interactions between the components in metadata, in a general way. To test this, we collect a corpus of slavery-related United States property law judgements sampled from the years 1730 to 1866. We study the language use in these legal cases, with a special focus on shifts in opinions on controversial topics across different regions. Because this is a longitudinal data set, we are also interested in understanding how these opinions change over the course of decades. We show that the joint learning scheme of our sparse mixed-effects model improves on other state-of-the-art generative and discriminative models on the region and time period identification tasks. Experiments show that our sparse mixed-effects model is more accurate quantitatively and qualitatively interesting, and that these improvements are robust across different parameter settings.
4 0.74440032 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation
Author: Tong Xiao ; Jingbo Zhu ; Hao Zhang ; Qiang Li
Abstract: We present a new open source toolkit for phrase-based and syntax-based machine translation. The toolkit supports several state-of-the-art models developed in statistical machine translation, including the phrase-based model, the hierachical phrase-based model, and various syntaxbased models. The key innovation provided by the toolkit is that the decoder can work with various grammars and offers different choices of decoding algrithms, such as phrase-based decoding, decoding as parsing/tree-parsing and forest-based decoding. Moreover, several useful utilities were distributed with the toolkit, including a discriminative reordering model, a simple and fast language model, and an implementation of minimum error rate training for weight tuning. 1
5 0.6362626 61 acl-2012-Cross-Domain Co-Extraction of Sentiment and Topic Lexicons
Author: Fangtao Li ; Sinno Jialin Pan ; Ou Jin ; Qiang Yang ; Xiaoyan Zhu
Abstract: Extracting sentiment and topic lexicons is important for opinion mining. Previous works have showed that supervised learning methods are superior for this task. However, the performance of supervised methods highly relies on manually labeled training data. In this paper, we propose a domain adaptation framework for sentiment- and topic- lexicon co-extraction in a domain of interest where we do not require any labeled data, but have lots of labeled data in another related domain. The framework is twofold. In the first step, we generate a few high-confidence sentiment and topic seeds in the target domain. In the second step, we propose a novel Relational Adaptive bootstraPping (RAP) algorithm to expand the seeds in the target domain by exploiting the labeled source domain data and the relationships between topic and sentiment words. Experimental results show that our domain adaptation framework can extract precise lexicons in the target domain without any annotation.
6 0.59009778 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation
7 0.58597404 202 acl-2012-Transforming Standard Arabic to Colloquial Arabic
8 0.57663125 19 acl-2012-A Ranking-based Approach to Word Reordering for Statistical Machine Translation
10 0.56878716 75 acl-2012-Discriminative Strategies to Integrate Multiword Expression Recognition and Parsing
11 0.56157792 144 acl-2012-Modeling Review Comments
12 0.55528599 65 acl-2012-Crowdsourcing Inference-Rule Evaluation
13 0.55358809 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations
14 0.53552252 138 acl-2012-LetsMT!: Cloud-Based Platform for Do-It-Yourself Machine Translation
15 0.53537446 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT
16 0.53188694 120 acl-2012-Information-theoretic Multi-view Domain Adaptation
18 0.52822471 136 acl-2012-Learning to Translate with Multiple Objectives
19 0.52814662 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets
20 0.52808356 80 acl-2012-Efficient Tree-based Approximation for Entailment Graph Learning