acl acl2011 acl2011-330 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Seth Kulick ; Ann Bies ; Justin Mott
Abstract: This work introduces a new approach to checking treebank consistency. Derivation trees based on a variant of Tree Adjoining Grammar are used to compare the annotation of word sequences based on their structural similarity. This overcomes the problems of earlier approaches based on using strings of words rather than tree structure to identify the appropriate contexts for comparison. We report on the result of applying this approach to the Penn Arabic Treebank and how this approach leads to high precision of error detection.
Reference: text
sentIndex sentText sentNum sentScore
1 Derivation trees based on a variant of Tree Adjoining Grammar are used to compare the annotation of word sequences based on their structural similarity. [sent-2, score-0.18]
2 This overcomes the problems of earlier approaches based on using strings of words rather than tree structure to identify the appropriate contexts for comparison. [sent-3, score-0.158]
3 1 Introduction The internal consistency of the annotation in a tree- , bank is crucial in order to provide reliable training and testing data for parsers and linguistic research. [sent-5, score-0.188]
4 Treebank annotation, consisting of syntactic structure with words as the terminals, is by its nature more complex and thus more prone to error than other annotation tasks, such as part-of-speech tagging. [sent-6, score-0.083]
5 Recent work has therefore focused on the importance of detecting errors in the treebank (Green and Manning, 2010), and methods for finding such errors automatically, e. [sent-7, score-0.108]
6 We present here a new approach to this problem that builds upon Dickinson and Meurers (2003b), by integrating the perspective on treebank consistency checking and search in Kulick and Bies (2010). [sent-11, score-0.145]
7 The approach in Dickinson and Meurers (2003b) has certain limitations and complications that are inherent in examining only strings of words. [sent-12, score-0.019]
8 edu come these problems, we recast the search as one of searching for inconsistently-used elementary trees in a Tree Adjoining Grammar-based form of the treebank. [sent-15, score-0.106]
9 This allows consistency checking to be based on structural locality instead of n-grams, resulting in improved precision of finding inconsistent treebank annotation, allowing for the correction of such inconsistencies in future work. [sent-16, score-0.347]
10 Adopting their terminology, a “variation nucleus” is the string of words with a difference in the annotation (label), while a “variation n-gram” is a larger string containing the variation nucleus. [sent-19, score-0.244]
11 ( NP the mo st important po int s ) For example, suppose the pair of phrases in (1) are taken from two different sentences in a corpus. [sent-22, score-0.053]
12 The “variation nucleus” is the string most important, and the larger surrounding n-gram is the string the mo st important point s . [sent-23, score-0.138]
13 This is an example of error in the corpus, since the second annotation is incorrect, and this difference manifests itself by the nucleus having in (a) the label ADJP but in (b) the default label NIL (meaning for their system that the nucleus has no covering node). [sent-24, score-0.839]
14 i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 693–698, fringe heuristic”, which considers two variation nuclei to have a comparable context if they are properly contained within the same variation n-gram i. [sent-27, score-0.686]
15 For the the pair in (1), the two instances of the variation nucleus satisfy the nonfringe heuristic because they are properly contained within the identical variation n-gram (with the and point s on either side). [sent-30, score-0.691]
16 NP qmp summit NP qmp summit $rm NP Sharm Al$yx NP $rm Al$yx Sharm the Sheikh the Sheikh c. [sent-36, score-0.512]
17 NP qmp NP summit NP NP $rmAl$yx (mSr) Sharm the Sheikh Egypt We motivate our approach by illustrating the limitations of the DECCA approach. [sent-37, score-0.256]
18 Consider the trees (2a) and (2b), taken from two instances of the threeword sequence qmp $rm Al $yx in the Arabic Treebank. [sent-38, score-0.404]
19 2 There is no need to look at any surrounding annotation to conclude that there is an inconsistency in the annotation of this sequence. [sent-39, score-0.324]
20 3 However, based on (2ab), the DECCA system would not even identify the three-word sequence qmp $rm Al $yx as a nucleus to compare, because both instances have a NP covering node, and so are considered to have the same label. [sent-40, score-0.682]
21 3While the nature of the inconsistency is not the issue here, (b) is the correct annotation. [sent-48, score-0.12]
22 694 inconsistent structures for the identical word sequences as in (2ab), the DECCA approach would instead focus on the single word Al $yx, which has a NP label in (2a), while it has the default label NIL in (2b). [sent-49, score-0.161]
23 However, whether it is reported as a variation depends on the irrelevant fact of whether the word to the right of Al $yx is the same in both instances, thus allowing it to pass the non-fringe heuristic (since it already has the same word, $rm, on the left). [sent-50, score-0.22]
24 There is an additional NP level in (2c) because of the adjunct ( mSr ) , causing qmp $ rm Al $yx to have no covering node, and so have the default label NIL, and therefore categorized as a variation compared to (2b). [sent-52, score-0.579]
25 However, this is a spurious difference, since the label difference is caused only by the irrelevant presence of an adjunct, and it is clear, without looking at any further structure, that the annotation of qmp $rm Al $yx is identical in (2bc). [sent-53, score-0.431]
26 This reliance on irrelevant material arises from using on a single node label to characterize a structural annotation and the surrounding word context to overcome the resulting complications. [sent-55, score-0.287]
27 3 Using Derivation Tree Fragments We utilize ideas from the long line ofTree Adjoining Grammar-based research (Joshi and Schabes, 1997), based on working with small “elementary trees” (abbreviated “etrees” in the rest of this paper) that are the “building blocks” of the full trees of a treebank. [sent-57, score-0.09]
28 This decomposition of the full tree into etrees also results in a “derivation tree” that records how the elementary trees relate to each other. [sent-58, score-0.495]
29 We illustrate the basics of TAG-based derivation we are using with examples based on the trees in (2). [sent-59, score-0.31]
30 Sister adjunction attaches a tree (or single node) as a sister to another node, and Chomsky-adjunction forms a recursive structure as well, duplicating a node. [sent-62, score-0.197]
31 As typically done, we use head rules to decompose a full tree and extract the etrees. [sent-63, score-0.159]
32 The three derivation trees, corre- sponding to (2abc), are shown in Figure 1. [sent-64, score-0.24]
33 It has three etrees, numbered a1, a2, a3, which are the nodes in the derivation tree which show how the three etrees connect to each other. [sent-66, score-0.609]
34 The ˆ symbol at node NP ˆ in a1 indicates that it is a substitution node, and the S : 1. [sent-68, score-0.11]
35 2 above a2 indicates that it substitutes into node at Gorn address 1. [sent-69, score-0.068]
36 , the substitution node), and likewise for a3 substituting into a2. [sent-72, score-0.042]
37 The derivation tree for (2b) also has three etrees, although the structure is different. [sent-73, score-0.379]
38 Because the lower NP is flat in (2b), the rightmost noun, Al $yx, is taken as the head of the etree b2, with the degenerate tree for $rm sister-adjoining to the left of Al $yx, as indicated by the A : 1. [sent-74, score-0.218]
39 The derivation tree for (2c) is identical to that of (2b), except that it has the additional tree c4 for the adjunct mSr, which right Chomsky-adjoins to the root of c2, as indicated by the M : 1 right . [sent-76, score-0.586]
40 4 , , 4We leave out the irrelevant (here) details of the parentheses 695 This tree decomposition and resulting derivation tree provide us with the tool for comparing nuclei without the interfering effects from words not in the nucleus. [sent-77, score-1.082]
41 We are interested not in the derivation tree for an entire sentence, but rather only that slice of it having etrees with words that are in the nucleus being examined, which we call the derivation tree fragment. [sent-78, score-1.325]
42 That is, for a given nucleus being examined, we partition its instances based on the covering node in the full tree, and within each set of instances we compare the derivation tree fragments for each instance. [sent-79, score-1.076]
43 These derivation tree fragments are the relevant structures to compare for inconsistent annotation, and are computed separately for each instance of each nucleus from the full derivation tree that each instance is part of. [sent-80, score-1.274]
44 5 For example, for comparing our three instances of qmp $rm Al $yx, the three derivation tree fragments would be the structures consisting of (a1, a2, a3), (b1, b2, b3) and (c1, c2, c3), along with their connecting Gorn addresses and attachment types. [sent-81, score-0.787]
45 This indicates that the instances (2ab) have different internal structures (without the need to look at a surrounding context), while the instances (2bc) have identical internal structures (allowing us to abstract away from the interfering effects of adjunction). [sent-82, score-0.398]
46 Space prevents full discussion here, but the etrees and derivation trees as just described require refinement to be truly appropriate for comparing nuclei. [sent-83, score-0.56]
47 The reason is that etrees might encode more information than is relevant for many comparisons of nuclei. [sent-84, score-0.23]
48 , and this would lead to its having different etrees, differing in their node label for the substitution node. [sent-86, score-0.135]
49 If the nucleus under comparison includes the verb but not any words from the complement, the inclusion of the different substitution nodes would cause irrelevant differences for that particular nucleus comparison. [sent-87, score-0.807]
50 We solve these problems by mapping down the in the derivation tree. [sent-88, score-0.24]
51 5A related approach is taken by Kato and Matsubara (2010), who compare partial parse trees for different instances of the same sequence of words in a corpus, resulting in rules based on a synchronous Tree Substitution Grammar (Eisner, 2003). [sent-89, score-0.174]
52 We suspect that there are some major differences between our approaches regarding such issues as the representation of adjuncts, but we leave such a comparison for future work. [sent-90, score-0.019]
53 These reductions are (automatically) done for each nucleus comparison in a way that is appropriate for that particular nucleus comparison. [sent-92, score-0.674]
54 A particular etree may be reduced in one way for one nucleus, and then a different way for a different nucleus. [sent-93, score-0.058]
55 This is done for each etree in a derivation tree fragment. [sent-94, score-0.437]
56 4 Results on Test Corpus Green and Manning (2010) discuss annotation con- sistency in the Penn Arabic Treebank (ATB), and for our test corpus we follow their discussion and use the same data set, the training section of three parts of the ATB (Maamouri et al. [sent-95, score-0.083]
57 Their work is ideal for us, since they used the DECCA algorithm for the consistency evaluation. [sent-99, score-0.041]
58 They did not use the “non-fringe” heuristic, but instead manually examined a sample of 100 nuclei to determine whether they were annotation errors. [sent-100, score-0.605]
59 The DECCA system6 identified 24,3 19 distinct variation nuclei, while our system had 54,496. [sent-104, score-0.093]
60 DECCA examined 1,158,342 n-grams, consisting of 2,966,274 6We worked at first with version 0. [sent-105, score-0.042]
61 However this software does not implement the non-fringe heuristic and does not make available the actual instances of the nuclei that were found. [sent-107, score-0.602]
62 We therefore re-implemented the algorithm to make these features available, being careful to exactly match our output against the released DECCA system as far as the nuclei and n-grams found. [sent-108, score-0.498]
63 , different corpus positions of the ngrams), while our system examined 605,906 instances of the 54,496 nuclei. [sent-111, score-0.125]
64 For our system, the number of nuclei increases and the variation ngrams are eliminated. [sent-112, score-0.592]
65 This is because all nuclei with more than one instance are evaluated, in order to search for constituents that have the same root but different internal structure. [sent-113, score-0.524]
66 The number of reported inconsistencies is shown in Table 2. [sent-114, score-0.13]
67 DECCA identified 4,140 nuclei as likely errors - i. [sent-115, score-0.499]
68 Our system identified 9,984 nuclei as having inconsistent annotation - i. [sent-118, score-0.627]
69 , with at least two instances with different derivation tree fragments. [sent-120, score-0.462]
70 2 Eliminating Duplicate Nuclei Some of these 9,984 nuclei are however redundant, due to nuclei contained within larger nuclei, such as $ rm Al $ yx inside qmp $ rm Al $ yx in (2abc). [sent-122, score-2.036]
71 Eliminating such duplicates is not just a simple matter of string inclusion, since the larger nucleus can sometimes reveal different annotation inconsisten- cies than just those in the smaller substring nucleus, and also a single nucleus string can be included in different larger nuclei. [sent-123, score-0.862]
72 We cannot discuss here the full details of our solution, but it basically consists of two steps. [sent-124, score-0.02]
73 First, as a result of the analysis described so far, for each nucleus we have a mapping of each instance of that nucleus to a derivation tree fragment. [sent-125, score-1.053]
74 Second, we test for each possible redundancy (meaning string inclusion) whether there is a true structural redundancy by testing for an isomorphism between the mappings for two nuclei. [sent-126, score-0.061]
75 For this test corpus, eliminating such duplicates leaves 4,272 nuclei as having inconsistent annotation. [sent-127, score-0.61]
76 It is unknown how many of the DECCA nuclei are duplicates, although many certainly are. [sent-128, score-0.48]
77 For example, qmp $rm Al $yx and $rm Al $yx are reported as separate results. [sent-129, score-0.249]
78 3 Grouping Inconsistencies by Structure Across all variation nuclei, there are only a finite number of derivation tree fragments and thus ways in which such fragments indicate an annotation inconsistency. [sent-131, score-0.703]
79 We categorize each annotation inconsistency by the inconsistency type, which is simply a set ofnumbers representing the different derivation tree fragments. [sent-132, score-0.702]
80 We can then present the results not by listing each nucleus string, but instead by the inconsistency types, with each type having some number of nuclei associated with it. [sent-133, score-0.937]
81 For example, instances of $rm Al $yx might have just the derivation tree fragments (a2, a3) and (b2, b3) in Figure 1, and the numbers representing this pair is the “inconsistency type” for this (nucleus, internal context) inconsistency. [sent-134, score-0.58]
82 There are nine other nuclei reported as having an inconsistency based on the exact same derivation tree fragments (abstracting only away from the particular lexical items), and so all these nuclei are grouped together as having the same “inconsistency type”. [sent-135, score-1.552]
83 This grouping results in the 4,272 non-duplicate nuclei found being grouped into 1,91 1inconsistency types. [sent-136, score-0.509]
84 4 Precision and Recall The grouping of internal checking results by inconsistency types is a qualitative improvement in consistency reporting, with a high precision. [sent-138, score-0.268]
85 7 By viewing inconsistencies by structural annotation types, we can examine large numbers of nuclei at a time. [sent-139, score-0.701]
86 Of the first 10 different types of derivation tree inconsistencies, which include 266 different nuclei, all 10 appear to real cases of annotation inconsistency, and the same seems to hold for each of the nuclei in those 10 types, although we have not checked every single nucleus. [sent-140, score-0.942]
87 For comparison, we chose a sample of 100 nuclei output by DECCA on this same data, and by our judgment the DECCA precision is about 74%, including 15 duplicates. [sent-141, score-0.48]
88 Measuring recall is tricky, even using the errors identified in Green and Manning (2010) as “gold” errors. [sent-142, score-0.019]
89 One factor is that a system might report a variation nucleus, but still not report all the relevant instances of that nucleus. [sent-143, score-0.176]
90 For example, while both systems report $rm Al $yx as a sequence with inconsistent annotation, DECCA only reports the two instances that pass the “non-fringe heuristic”, while our system lists 132 instances of $rm Al $yx, partitioning them into the two derivation tree fragments. [sent-144, score-0.632]
91 We will be carrying out a careful accounting of the recall evaluation in future work. [sent-145, score-0.039]
92 7“Precision” here means the percentage of reported variations that are actually annotation errors. [sent-146, score-0.102]
93 697 5 Future Work While we continue the evaluation work, our primary concern now is to use the reported inconsistent derivation tree fragments to correct the annotation inconsistencies in the actual data, and then evaluate the effect of the corpus corrections on parsing. [sent-147, score-0.73]
94 Our system groups all instances of a nucleus into different derivation tree fragments, and it would be easy enough for an annotator to specify which is correct (or perhaps instead derive this automatically based on frequencies). [sent-148, score-0.799]
95 However, because the derivation trees and etrees are somewhat abstracted from the actual trees in the treebank, it can be challenging to automatically correct the structure in every location to reflect the correct derivation tree fragment. [sent-149, score-0.989]
96 This is because of details concerning the surrounding structure and the interaction with annotation style guidelines such as having only one level of recursive modification or differences in constituent bracketing depending on whether a constituent is a “single-word” or not. [sent-150, score-0.18]
97 We are focusing on accounting for these issues in cur- rent work to allow such automatic correction. [sent-151, score-0.021]
98 Statistical parsing with an automatically extracted tree adjoining grammar. [sent-169, score-0.197]
99 Correcting errors in a treebank based on synchronous tree substitution grammar. [sent-204, score-0.27]
100 A TAG-derived database for treebank search and parser analysis. [sent-209, score-0.07]
wordName wordTfidf (topN-words)
[('nuclei', 0.48), ('nucleus', 0.337), ('decca', 0.288), ('yx', 0.256), ('derivation', 0.24), ('etrees', 0.23), ('qmp', 0.23), ('rm', 0.157), ('dickinson', 0.156), ('tree', 0.139), ('inconsistency', 0.12), ('inconsistencies', 0.111), ('al', 0.11), ('meurers', 0.109), ('kulick', 0.094), ('variation', 0.093), ('np', 0.09), ('maamouri', 0.088), ('instances', 0.083), ('annotation', 0.083), ('arabic', 0.08), ('fragments', 0.074), ('treebank', 0.07), ('trees', 0.07), ('bies', 0.07), ('node', 0.068), ('seth', 0.067), ('inconsistent', 0.064), ('adjoining', 0.058), ('basma', 0.058), ('etree', 0.058), ('fatma', 0.058), ('gaddeche', 0.058), ('krouna', 0.058), ('mekki', 0.058), ('msr', 0.058), ('sharm', 0.058), ('sheikh', 0.058), ('sondos', 0.058), ('detmar', 0.051), ('atb', 0.051), ('green', 0.05), ('wigdan', 0.047), ('irrelevant', 0.046), ('internal', 0.044), ('consortium', 0.043), ('substitution', 0.042), ('examined', 0.042), ('adjunct', 0.042), ('kato', 0.042), ('consistency', 0.041), ('buckwalter', 0.04), ('mohamed', 0.04), ('heuristic', 0.039), ('interfering', 0.038), ('matsubara', 0.038), ('mott', 0.038), ('nil', 0.038), ('surrounding', 0.038), ('duplicates', 0.037), ('elementary', 0.036), ('markus', 0.036), ('checking', 0.034), ('gorn', 0.034), ('string', 0.034), ('covering', 0.032), ('mo', 0.032), ('adjp', 0.031), ('adjunction', 0.031), ('boyd', 0.029), ('spence', 0.029), ('ann', 0.029), ('eliminating', 0.029), ('grouping', 0.029), ('tlt', 0.028), ('structural', 0.027), ('treebanks', 0.027), ('sister', 0.027), ('summit', 0.026), ('identical', 0.026), ('inclusion', 0.026), ('label', 0.025), ('pass', 0.023), ('structures', 0.021), ('taken', 0.021), ('spurious', 0.021), ('accounting', 0.021), ('manning', 0.02), ('linguistic', 0.02), ('joshi', 0.02), ('contained', 0.02), ('full', 0.02), ('constituent', 0.02), ('ngrams', 0.019), ('reported', 0.019), ('theories', 0.019), ('differences', 0.019), ('errors', 0.019), ('strings', 0.019), ('careful', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 330 acl-2011-Using Derivation Trees for Treebank Error Detection
Author: Seth Kulick ; Ann Bies ; Justin Mott
Abstract: This work introduces a new approach to checking treebank consistency. Derivation trees based on a variant of Tree Adjoining Grammar are used to compare the annotation of word sequences based on their structural similarity. This overcomes the problems of earlier approaches based on using strings of words rather than tree structure to identify the appropriate contexts for comparison. We report on the result of applying this approach to the Penn Arabic Treebank and how this approach leads to high precision of error detection.
2 0.15741986 30 acl-2011-Adjoining Tree-to-String Translation
Author: Yang Liu ; Qun Liu ; Yajuan Lu
Abstract: We introduce synchronous tree adjoining grammars (TAG) into tree-to-string translation, which converts a source tree to a target string. Without reconstructing TAG derivations explicitly, our rule extraction algorithm directly learns tree-to-string rules from aligned Treebank-style trees. As tree-to-string translation casts decoding as a tree parsing problem rather than parsing, the decoder still runs fast when adjoining is included. Less than 2 times slower, the adjoining tree-tostring system improves translation quality by +0.7 BLEU over the baseline system only allowing for tree substitution on NIST ChineseEnglish test sets.
3 0.1227081 48 acl-2011-Automatic Detection and Correction of Errors in Dependency Treebanks
Author: Alexander Volokh ; Gunter Neumann
Abstract: Annotated corpora are essential for almost all NLP applications. Whereas they are expected to be of a very high quality because of their importance for the followup developments, they still contain a considerable number of errors. With this work we want to draw attention to this fact. Additionally, we try to estimate the amount of errors and propose a method for their automatic correction. Whereas our approach is able to find only a portion of the errors that we suppose are contained in almost any annotated corpus due to the nature of the process of its creation, it has a very high precision, and thus is in any case beneficial for the quality of the corpus it is applied to. At last, we compare it to a different method for error detection in treebanks and find out that the errors that we are able to detect are mostly different and that our approaches are complementary. 1
4 0.099952415 206 acl-2011-Learning to Transform and Select Elementary Trees for Improved Syntax-based Machine Translations
Author: Bing Zhao ; Young-Suk Lee ; Xiaoqiang Luo ; Liu Li
Abstract: We propose a novel technique of learning how to transform the source parse trees to improve the translation qualities of syntax-based translation models using synchronous context-free grammars. We transform the source tree phrasal structure into a set of simpler structures, expose such decisions to the decoding process, and find the least expensive transformation operation to better model word reordering. In particular, we integrate synchronous binarizations, verb regrouping, removal of redundant parse nodes, and incorporate a few important features such as translation boundaries. We learn the structural preferences from the data in a generative framework. The syntax-based translation system integrating the proposed techniques outperforms the best Arabic-English unconstrained system in NIST08 evaluations by 1.3 absolute BLEU, which is statistically significant.
5 0.08486867 173 acl-2011-Insertion Operator for Bayesian Tree Substitution Grammars
Author: Hiroyuki Shindo ; Akinori Fujino ; Masaaki Nagata
Abstract: We propose a model that incorporates an insertion operator in Bayesian tree substitution grammars (BTSG). Tree insertion is helpful for modeling syntax patterns accurately with fewer grammar rules than BTSG. The experimental parsing results show that our model outperforms a standard PCFG and BTSG for a small dataset. For a large dataset, our model obtains comparable results to BTSG, making the number of grammar rules much smaller than with BTSG.
6 0.084200762 110 acl-2011-Effective Use of Function Words for Rule Generalization in Forest-Based Translation
7 0.078805596 188 acl-2011-Judging Grammaticality with Tree Substitution Grammar Derivations
8 0.077859595 7 acl-2011-A Corpus for Modeling Morpho-Syntactic Agreement in Arabic: Gender, Number and Rationality
9 0.075223625 268 acl-2011-Rule Markov Models for Fast Tree-to-String Translation
10 0.072085761 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning
11 0.070993863 28 acl-2011-A Statistical Tree Annotator and Its Applications
12 0.066557996 166 acl-2011-Improving Decoding Generalization for Tree-to-String Translation
13 0.066106454 61 acl-2011-Binarized Forest to String Translation
14 0.060872264 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations
15 0.057703186 272 acl-2011-Semantic Information and Derivation Rules for Robust Dialogue Act Detection in a Spoken Dialogue System
16 0.057430651 164 acl-2011-Improving Arabic Dependency Parsing with Form-based and Functional Morphological Features
17 0.056367926 123 acl-2011-Exact Decoding of Syntactic Translation Models through Lagrangian Relaxation
18 0.056357898 329 acl-2011-Using Deep Morphology to Improve Automatic Error Detection in Arabic Handwriting Recognition
19 0.054124661 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing
20 0.052993648 282 acl-2011-Shift-Reduce CCG Parsing
topicId topicWeight
[(0, 0.121), (1, -0.06), (2, -0.007), (3, -0.101), (4, -0.028), (5, 0.004), (6, -0.056), (7, -0.033), (8, -0.029), (9, -0.017), (10, -0.067), (11, -0.025), (12, -0.051), (13, 0.051), (14, 0.078), (15, -0.041), (16, 0.045), (17, -0.006), (18, -0.032), (19, -0.03), (20, -0.027), (21, 0.039), (22, -0.014), (23, 0.071), (24, 0.025), (25, 0.017), (26, -0.015), (27, 0.03), (28, 0.048), (29, -0.084), (30, -0.064), (31, -0.026), (32, -0.004), (33, 0.004), (34, 0.008), (35, -0.021), (36, -0.034), (37, -0.106), (38, -0.045), (39, -0.054), (40, 0.025), (41, 0.004), (42, -0.035), (43, -0.019), (44, -0.054), (45, -0.002), (46, 0.045), (47, 0.1), (48, 0.118), (49, -0.091)]
simIndex simValue paperId paperTitle
same-paper 1 0.94996834 330 acl-2011-Using Derivation Trees for Treebank Error Detection
Author: Seth Kulick ; Ann Bies ; Justin Mott
Abstract: This work introduces a new approach to checking treebank consistency. Derivation trees based on a variant of Tree Adjoining Grammar are used to compare the annotation of word sequences based on their structural similarity. This overcomes the problems of earlier approaches based on using strings of words rather than tree structure to identify the appropriate contexts for comparison. We report on the result of applying this approach to the Penn Arabic Treebank and how this approach leads to high precision of error detection.
2 0.7513876 173 acl-2011-Insertion Operator for Bayesian Tree Substitution Grammars
Author: Hiroyuki Shindo ; Akinori Fujino ; Masaaki Nagata
Abstract: We propose a model that incorporates an insertion operator in Bayesian tree substitution grammars (BTSG). Tree insertion is helpful for modeling syntax patterns accurately with fewer grammar rules than BTSG. The experimental parsing results show that our model outperforms a standard PCFG and BTSG for a small dataset. For a large dataset, our model obtains comparable results to BTSG, making the number of grammar rules much smaller than with BTSG.
3 0.64878792 30 acl-2011-Adjoining Tree-to-String Translation
Author: Yang Liu ; Qun Liu ; Yajuan Lu
Abstract: We introduce synchronous tree adjoining grammars (TAG) into tree-to-string translation, which converts a source tree to a target string. Without reconstructing TAG derivations explicitly, our rule extraction algorithm directly learns tree-to-string rules from aligned Treebank-style trees. As tree-to-string translation casts decoding as a tree parsing problem rather than parsing, the decoder still runs fast when adjoining is included. Less than 2 times slower, the adjoining tree-tostring system improves translation quality by +0.7 BLEU over the baseline system only allowing for tree substitution on NIST ChineseEnglish test sets.
4 0.61491889 28 acl-2011-A Statistical Tree Annotator and Its Applications
Author: Xiaoqiang Luo ; Bing Zhao
Abstract: In many natural language applications, there is a need to enrich syntactical parse trees. We present a statistical tree annotator augmenting nodes with additional information. The annotator is generic and can be applied to a variety of applications. We report 3 such applications in this paper: predicting function tags; predicting null elements; and predicting whether a tree constituent is projectable in machine translation. Our function tag prediction system outperforms significantly published results.
5 0.6063512 188 acl-2011-Judging Grammaticality with Tree Substitution Grammar Derivations
Author: Matt Post
Abstract: In this paper, we show that local features computed from the derivations of tree substitution grammars such as the identify of particular fragments, and a count of large and small fragments are useful in binary grammatical classification tasks. Such features outperform n-gram features and various model scores by a wide margin. Although they fall short of the performance of the hand-crafted feature set of Charniak and Johnson (2005) developed for parse tree reranking, they do so with an order of magnitude fewer features. Furthermore, since the TSGs employed are learned in a Bayesian setting, the use of their derivations can be viewed as the automatic discovery of tree patterns useful for classification. On the BLLIP dataset, we achieve an accuracy of 89.9% in discriminating between grammatical text and samples from an n-gram language model. — —
6 0.53152364 206 acl-2011-Learning to Transform and Select Elementary Trees for Improved Syntax-based Machine Translations
7 0.52226543 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing
8 0.52151358 110 acl-2011-Effective Use of Function Words for Rule Generalization in Forest-Based Translation
9 0.50620806 268 acl-2011-Rule Markov Models for Fast Tree-to-String Translation
10 0.46726736 7 acl-2011-A Corpus for Modeling Morpho-Syntactic Agreement in Arabic: Gender, Number and Rationality
11 0.46370021 267 acl-2011-Reversible Stochastic Attribute-Value Grammars
12 0.45582962 154 acl-2011-How to train your multi bottom-up tree transducer
13 0.4461537 59 acl-2011-Better Automatic Treebank Conversion Using A Feature-Based Approach
14 0.41930079 219 acl-2011-Metagrammar engineering: Towards systematic exploration of implemented grammars
15 0.41843069 239 acl-2011-P11-5002 k2opt.pdf
16 0.41010696 217 acl-2011-Machine Translation System Combination by Confusion Forest
17 0.40815955 192 acl-2011-Language-Independent Parsing with Empty Elements
19 0.39865091 166 acl-2011-Improving Decoding Generalization for Tree-to-String Translation
20 0.3961527 290 acl-2011-Syntax-based Statistical Machine Translation using Tree Automata and Tree Transducers
topicId topicWeight
[(5, 0.038), (11, 0.051), (17, 0.055), (26, 0.015), (31, 0.011), (37, 0.072), (39, 0.061), (41, 0.046), (55, 0.011), (59, 0.039), (68, 0.272), (72, 0.022), (80, 0.017), (89, 0.012), (91, 0.045), (96, 0.132)]
simIndex simValue paperId paperTitle
same-paper 1 0.76362014 330 acl-2011-Using Derivation Trees for Treebank Error Detection
Author: Seth Kulick ; Ann Bies ; Justin Mott
Abstract: This work introduces a new approach to checking treebank consistency. Derivation trees based on a variant of Tree Adjoining Grammar are used to compare the annotation of word sequences based on their structural similarity. This overcomes the problems of earlier approaches based on using strings of words rather than tree structure to identify the appropriate contexts for comparison. We report on the result of applying this approach to the Penn Arabic Treebank and how this approach leads to high precision of error detection.
2 0.5605135 48 acl-2011-Automatic Detection and Correction of Errors in Dependency Treebanks
Author: Alexander Volokh ; Gunter Neumann
Abstract: Annotated corpora are essential for almost all NLP applications. Whereas they are expected to be of a very high quality because of their importance for the followup developments, they still contain a considerable number of errors. With this work we want to draw attention to this fact. Additionally, we try to estimate the amount of errors and propose a method for their automatic correction. Whereas our approach is able to find only a portion of the errors that we suppose are contained in almost any annotated corpus due to the nature of the process of its creation, it has a very high precision, and thus is in any case beneficial for the quality of the corpus it is applied to. At last, we compare it to a different method for error detection in treebanks and find out that the errors that we are able to detect are mostly different and that our approaches are complementary. 1
3 0.56024951 136 acl-2011-Finding Deceptive Opinion Spam by Any Stretch of the Imagination
Author: Myle Ott ; Yejin Choi ; Claire Cardie ; Jeffrey T. Hancock
Abstract: Consumers increasingly rate, review and research products online (Jansen, 2010; Litvin et al., 2008). Consequently, websites containing consumer reviews are becoming targets of opinion spam. While recent work has focused primarily on manually identifiable instances of opinion spam, in this work we study deceptive opinion spam—fictitious opinions that have been deliberately written to sound authentic. Integrating work from psychology and computational linguistics, we develop and compare three approaches to detecting deceptive opinion spam, and ultimately develop a classifier that is nearly 90% accurate on our gold-standard opinion spam dataset. Based on feature analysis of our learned models, we additionally make several theoretical contributions, including revealing a relationship between deceptive opinions and imaginative writing.
4 0.5548805 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation
Author: Zhongguo Li
Abstract: Lots of Chinese characters are very productive in that they can form many structured words either as prefixes or as suffixes. Previous research in Chinese word segmentation mainly focused on identifying only the word boundaries without considering the rich internal structures of many words. In this paper we argue that this is unsatisfying in many ways, both practically and theoretically. Instead, we propose that word structures should be recovered in morphological analysis. An elegant approach for doing this is given and the result is shown to be promising enough for encouraging further effort in this direction. Our probability model is trained with the Penn Chinese Treebank and actually is able to parse both word and phrase structures in a unified way. 1 Why Parse Word Structures? Research in Chinese word segmentation has progressed tremendously in recent years, with state of the art performing at around 97% in precision and recall (Xue, 2003; Gao et al., 2005; Zhang and Clark, 2007; Li and Sun, 2009). However, virtually all these systems focus exclusively on recognizing the word boundaries, giving no consideration to the internal structures of many words. Though it has been the standard practice for many years, we argue that this paradigm is inadequate both in theory and in practice, for at least the following four reasons. The first reason is that if we confine our definition of word segmentation to the identification of word boundaries, then people tend to have divergent 1405 opinions as to whether a linguistic unit is a word or not (Sproat et al., 1996). This has led to many different annotation standards for Chinese word segmentation. Even worse, this could cause inconsistency in the same corpus. For instance, 䉂 擌 奒 ‘vice president’ is considered to be one word in the Penn Chinese Treebank (Xue et al., 2005), but is split into two words by the Peking University corpus in the SIGHAN Bakeoffs (Sproat and Emerson, 2003). Meanwhile, 䉂 䀓 惼 ‘vice director’ and 䉂 䚲䡮 ‘deputy are both segmented into two words in the same Penn Chinese Treebank. In fact, all these words are composed of the prefix 䉂 ‘vice’ and a root word. Thus the structure of 䉂擌奒 ‘vice president’ can be represented with the tree in Figure 1. Without a doubt, there is complete agree- manager’ NN ,,ll JJf NNf 䉂 擌奒 Figure 1: Example of a word with internal structure. ment on the correctness of this structure among native Chinese speakers. So if instead of annotating only word boundaries, we annotate the structures of every word, then the annotation tends to be more 1 1Here it is necessary to add a note on terminology used in this paper. Since there is no universally accepted definition of the “word” concept in linguistics and especially in Chinese, whenever we use the term “word” we might mean a linguistic unit such as 䉂 擌奒 ‘vice president’ whose structure is shown as the tree in Figure 1, or we might mean a smaller unit such as 擌奒 ‘president’ which is a substructure of that tree. Hopefully, ProceedingPso orftla thned 4,9 Otrhe Agonnn,u Jauln Mee 1e9t-i2ng4, o 2f0 t1h1e. A ?c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s405–1414, consistent and there could be less duplication of efforts in developing the expensive annotated corpus. The second reason is applications have different requirements for granularity of words. Take the personal name 撱 嗤吼 ‘Zhou Shuren’ as an example. It’s considered to be one word in the Penn Chinese Treebank, but is segmented into a surname and a given name in the Peking University corpus. For some applications such as information extraction, the former segmentation is adequate, while for others like machine translation, the later finer-grained output is more preferable. If the analyzer can produce a structure as shown in Figure 4(a), then every application can extract what it needs from this tree. A solution with tree output like this is more elegant than approaches which try to meet the needs of different applications in post-processing (Gao et al., 2004). The third reason is that traditional word segmentation has problems in handling many phenomena in Chinese. For example, the telescopic compound 㦌 撥 怂惆 ‘universities, middle schools and primary schools’ is in fact composed ofthree coordinating elements 㦌惆 ‘university’, 撥 惆 ‘middle school’ and 怂惆 ‘primary school’ . Regarding it as one flat word loses this important information. Another example is separable words like 扩 扙 ‘swim’ . With a linear segmentation, the meaning of ‘swimming’ as in 扩 堑 扙 ‘after swimming’ cannot be properly represented, since 扩扙 ‘swim’ will be segmented into discontinuous units. These language usages lie at the boundary between syntax and morphology, and are not uncommon in Chinese. They can be adequately represented with trees (Figure 2). (a) NN (b) ???HHH JJ NNf ???HHH JJf JJf JJf 㦌 撥 怂 惆 VV ???HHH VV NNf ZZ VVf VVf 扩 扙 堑 Figure 2: Example of telescopic compound (a) and separable word (b). The last reason why we should care about word the context will always make it clear what is being referred to with the term “word”. 1406 structures is related to head driven statistical parsers (Collins, 2003). To illustrate this, note that in the Penn Chinese Treebank, the word 戽 䊂䠽 吼 ‘English People’ does not occur at all. Hence constituents headed by such words could cause some difficulty for head driven models in which out-ofvocabulary words need to be treated specially both when they are generated and when they are conditioned upon. But this word is in turn headed by its suffix 吼 ‘people’, and there are 2,233 such words in Penn Chinese Treebank. If we annotate the structure of every compound containing this suffix (e.g. Figure 3), such data sparsity simply goes away.
5 0.55474162 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering
Author: Joel Lang ; Mirella Lapata
Abstract: In this paper we describe an unsupervised method for semantic role induction which holds promise for relieving the data acquisition bottleneck associated with supervised role labelers. We present an algorithm that iteratively splits and merges clusters representing semantic roles, thereby leading from an initial clustering to a final clustering of better quality. The method is simple, surprisingly effective, and allows to integrate linguistic knowledge transparently. By combining role induction with a rule-based component for argument identification we obtain an unsupervised end-to-end semantic role labeling system. Evaluation on the CoNLL 2008 benchmark dataset demonstrates that our method outperforms competitive unsupervised approaches by a wide margin.
6 0.55012751 137 acl-2011-Fine-Grained Class Label Markup of Search Queries
7 0.54979038 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models
8 0.54856038 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations
9 0.54821455 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing
10 0.54680568 117 acl-2011-Entity Set Expansion using Topic information
11 0.54679978 28 acl-2011-A Statistical Tree Annotator and Its Applications
12 0.545506 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing
13 0.54496938 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations
14 0.54459655 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction
15 0.54436266 5 acl-2011-A Comparison of Loopy Belief Propagation and Dual Decomposition for Integrated CCG Supertagging and Parsing
16 0.54412055 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing
17 0.54408306 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment
18 0.5436818 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding
19 0.54332316 44 acl-2011-An exponential translation model for target language morphology
20 0.54289752 178 acl-2011-Interactive Topic Modeling