acl acl2012 acl2012-71 knowledge-graph by maker-knowledge-mining

71 acl-2012-Dependency Hashing for n-best CCG Parsing

Source: pdf

Author: Dominick Ng ; James R. Curran

Abstract: Optimising for one grammatical representation, but evaluating over a different one is a particular challenge for parsers and n-best CCG parsing. We find that this mismatch causes many n-best CCG parses to be semantically equivalent, and describe a hashing technique that eliminates this problem, improving oracle n-best F-score by 0.7% and reranking accuracy by 0.4%. We also present a comprehensive analysis of errors made by the C&C; CCG parser, providing the first breakdown of the impact of implementation decisions, such as supertagging, on parsing accuracy.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We find that this mismatch causes many n-best CCG parses to be semantically equivalent, and describe a hashing technique that eliminates this problem, improving oracle n-best F-score by 0. [sent-6, score-0.667]

2 Reranking operates over a list of n-best parses according to the original model, allowing poor local parse decisions to be identified using arbitrary rich parse features. [sent-12, score-0.242]

3 Huang and Chiang (2005)’s n-best algorithms are used in a wide variety of parsers, including an n-best version of the C&C; CCG parser (Clark and Curran, 2007; Brennan, 2008). [sent-14, score-0.227]

4 The oracle F-score of this parser (calculated by selecting the most optimal parse in the n-best list) is 92. [sent-15, score-0.435]

5 In contrast, the Charniak parser records an oracle F-score of 96. [sent-18, score-0.389]

6 We describe how n-best parsing algorithms that operate over derivations do not account for absorption ambiguities in parsing, causing semantically identical parses to exist in the CCG n-best list. [sent-26, score-0.457]

7 We develop a hashing technique over dependencies that removes duplicates and improves the oracle F-score by 0. [sent-28, score-0.527]

8 32% F-score when returning the best parse in the chart using the supertagger on standard settings. [sent-38, score-0.618]

9 Thus the supertagger contributes roughly 5% of parser error, and the parser model the remaining 7. [sent-39, score-0.967]

10 c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi4c 9s7–505, Jack swims across the river NP S\NP ((S\NP)\(S\NP))/NP NP/N N NP> (S\NP)\(S\NP)> S\NP< S< Figure 1: A CCG derivation with a PP adjunct, demonstrating forward and backward combinator application. [sent-46, score-0.265]

11 In Figure 1, swims generates one dependency: hswims, S[dcl]\NP1 , 1, Jack, −i where the dependency contains the head word, head category, argument slot, argument word, and whether the dependency is long-range. [sent-60, score-0.302]

12 498 Jack swims across the river NP (S\NP)/PP PP/NP NP/N N NP> PP> S\NP> S< Figure 2: A CCG derivation with a PP argument (note the categories of swims and across). [sent-61, score-0.484]

13 The standard CCG parsing evaluation calculates labeled precision, recall, and F-score over the dependencies recovered by a parser as compared to CCGbank (Clark et al. [sent-69, score-0.37]

14 However, the adjunct to argument change results in different categories for swims and across; nearly every CCG dependency in the sentence is headed by one of these two words and thus each one changes as a result. [sent-74, score-0.359]

15 All experiments in this paper use the normal-form C&C; parser model over CCGbank 00 (Clark and Curran, 2007). [sent-76, score-0.227]

16 Scores are reported for sentences which the parser could analyse; we observed similarconclusions whenrepeating ourexperiments over the subset of sentences that were parsable under all configurations described in this paper. [sent-77, score-0.252]

17 2 The C&C; parser The C&C; parser (Clark and Curran, 2007) is a fast and accurate CCG parser trained on CCGbank 02-21, with an accuracy of 86. [sent-79, score-0.726]

18 It is a two-phase system, where a supertagger assigns possible categories to words in a sentence and the parser combines them using the CKY algorithm. [sent-81, score-0.853]

19 A parameter β is passed to the supertagger as a multi-tagging probability beam. [sent-85, score-0.513]

20 β is initially set at a very restrictive value, and if the parser cannot form an analysis the supertagger is rerun with a lower β, returning more categories and giving the parser more options in constructing a parse. [sent-86, score-1.08]

21 The supertagger also uses a tag dictionary, as described by Ratnaparkhi (1996), and accepts a cut- off k. [sent-88, score-0.553]

22 Words seen more than k times in CCGbank 02-21 may only be assigned categories seen with that word more than 5 times in CCGbank 02-21 ; the frequency must also be no less than 1/500th of the most frequent tag for that word. [sent-89, score-0.231]

23 Words seen fewer than k times may only be assigned categories seen with the POS of the word in CCGbank 02-21, subject to the cutoff and ratio constraint (Clark and Curran, 2004b). [sent-90, score-0.231]

24 The tag dictionary eliminates infrequent categories and improves the performance of the supertagger, but at the cost of removing unseen or infrequently seen categories from consideration. [sent-91, score-0.353]

25 The parser accepts POS-tagged text as input; unlike many PTB parsers, these tags are fixed and remain unchanged throughout during the parsing pipeline. [sent-92, score-0.34]

26 The POS tags are important features for the supertagger; parsing accuracy using gold-standard POS tags is typically 2% higher than using automatically assigned POS tags (Clark and Curran, 2004b). [sent-93, score-0.26]

27 Huang and Chiang (2005) define several n-best algorithms that allow dynamic programming to be retained whilst generating precisely the top n parses using the observation that once the 1-best parse is generated, the 2nd best parse must differ in exactly one location from it, and so forth. [sent-97, score-0.242]

28 Collins (2000)’s parser reranker uses n-best parses of PTB 02-21 as training data. [sent-102, score-0.497]

29 The system improves the accuracy of the Collins parser from 88. [sent-104, score-0.272]

30 In 50-best mode the parser has an oracle F-score of 96. [sent-109, score-0.437]

31 The C&C; parser employs the normal-form constraints of Eisner (1996) to address spurious ambiguity in 1-best parsing. [sent-115, score-0.265]

32 # indicates the inclusion of dependency hashing absorption ambiguity in CCG; Figure 3 depicts four semantically equivalent sequences of absorption and combinator application in a sentence fragment. [sent-126, score-0.642]

33 The Brennan (2008) CCG n-best parser differentiates CCG parses by derivation rather than logical form. [sent-127, score-0.462]

34 To illustrate how this is insufficient, we ran the parser using Algorithm 3 of Huang and Chiang (2005) with n = 10 and n = 50, and calculated how many parses were semantically distinct (i. [sent-128, score-0.449]

35 The results (summarised in Table 1) are striking: just 52% of 10-best parses and 34% of 50-best parses are distinct. [sent-131, score-0.3]

36 GRs are generated via a dependency to GR mapping in the parser as well as a post-processing script to clean up common errors (Clark and Curran, 2007). [sent-135, score-0.282]

37 GRs provide a more formalism-neutral comparison and abstract away from the raw CCG dependencies; for example, in Figures 1and 2, the dependency from swims to Jack would be abstracted into ( sub j swims Jack ) and thus would be identical in both parses. [sent-136, score-0.339]

38 Hence, there are even fewer distinct parses in the GR results summarised in Table 2: 45% and 27% of 10-best and 50-best parses respectively yield unique GRs. [sent-137, score-0.355]

39 1 Dependency hashing To address this problem of semantically equivalent n-best parses, we define a uniqueness constraint over all the n-best candidates: Constraint. [sent-139, score-0.406]

40 # indicates the inclusion of dependency hashing Enforcing this constraint is non-trivial as it is infeasible to directly compare every dependency in a partial tree with another. [sent-150, score-0.464]

41 Our technique does not use a hashtable, and instead only stores the hash value for each set of dependencies, which is much more efficient but runs the risk of filtering unique parses due to collisions. [sent-158, score-0.255]

42 XOR is commonly employed in hashing applications for randomly permuting numbers, and it is also order independent: a ⊕ b ≡ b ⊕ a. [sent-162, score-0.284]

43 big red ball ) N/N N/N N RRB N> N> N> big red ball ) N/N N/N N RRB N> N> N> big red ball ) N/N N/N N RRB N> N> N> big red ball ) N/N N/N N RRB N/N>B N> N> Figure 3: All four derivations have a different syntactic structure, but generate identical dependencies. [sent-164, score-0.564]

44 2 Hashing performance We evaluate our hashing technique with several experiments. [sent-169, score-0.284]

45 We reran the diversity experiments, and verified that every n-best parse for every sentence in CCGbank 00 was unique (see Table 1), corroborating our decision to use hashing alone. [sent-175, score-0.385]

46 On average, there are fewer parses per sentence, showing that hashing is eliminating many equivalent parses for more ambiguous sentences. [sent-176, score-0.623]

47 However, hashing also leads to a near doubling of unique parses in 10-best mode and a 2. [sent-177, score-0.508]

48 These results show that hashing prunes away equivalent parses, creating more diversity in the n-best list. [sent-180, score-0.352]

49 We also evaluate the oracle F-score of the parser using dependency hashing. [sent-181, score-0.444]

50 9641875 Table 4: Oracle precision, recall, and F-score on gold and auto POS tags for the C&C; n-best parser. [sent-191, score-0.25]

51 h31i5 ng Table 5: Reranked parser accuracy; labeled F-score using gold POS tags, with and without dependency hashing 3. [sent-195, score-0.641]

52 3 CCG reranking performance Finally, we implement a discriminative maximum entropy reranker for the n-best C&C; parser and evaluate it when using dependency hashing. [sent-196, score-0.472]

53 Our experiments rerank the top 10-best parses and use four configurations: with and without dependency hashing for generating the training and test data for the reranker. [sent-199, score-0.489]

54 Table 5 shows that labeled F-score improves substantially when dependency hashing is used to create reranker training data. [sent-200, score-0.459]

55 15% using hashing at test are statistically indistinguishable from one other; though we would expect the latter to perform better. [sent-207, score-0.284]

56 Our results also show that the reranker performs extremely poorly using diversified test parses and undiversified training parses. [sent-208, score-0.311]

57 4 Analysing parser errors A substantial gap exists between the oracle F-score of our improved n-best parser and other PTB n-best parsers (Charniak and Johnson, 2005). [sent-214, score-0.673]

58 We analyse three main classes of errors in the C&C; parser in order to answer this question: grammar error, supertagger error, and model error. [sent-216, score-0.786]

59 Grammar error: the parser implements a subset of the grammar and unary type-changing rules in CCGbank for efficiency, with some rules, such as substitution, omitted for efficiency (Clark and Curran, 2007). [sent-218, score-0.31]

60 This means that, given the correct cat- egories for words in a sentence, the parser may be unable to combine them into a derivation yielding the correct dependencies, or it may not recognise the gold standard category at all. [sent-219, score-0.437]

61 There is an additional constraint in the parser that only allows two categories to combine if they have been seen to combine in the training data. [sent-220, score-0.471]

62 This seen rules constraint is used to reduce the size of the chart and improve parsing speed, at the cost of only permitting category combinations seen in CCGbank 0221 (Clark and Curran, 2007). [sent-221, score-0.331]

63 Supertagger error: The supertagger uses a restricted set of 425 categories determined by a frequency cutoff of 10 over the training data (Clark and Curran, 2004b). [sent-222, score-0.626]

64 502 The β parameter restricts the categories to within a probability beam, and the tag dictionary restricts the set of categories that can be considered for each word. [sent-224, score-0.396]

65 Supertagger model error occurs when the supertagger can assign a word its correct category, but the statistical model does not assign the correct tag enough probability for it to fall within the β. [sent-225, score-0.625]

66 In this experiment the parser only needs to combine the categories correctly to form the gold parse. [sent-231, score-0.466]

67 2 Supertagger and model error To determine supertagger and model error, we run the parser on standard settings over CCGbank 00 and examined the chart. [sent-241, score-0.812]

68 If it contains the gold parse, then a model error results if the parser returns any other parser. [sent-242, score-0.374]

69 Otherwise, it is a supertagger or grammar error, where the parser cannot construct the best parse. [sent-243, score-0.786]

70 Each partial tree was scored using the formula: score = ncorrect − nbad where ncorrect is the number of dependencies which appear in the gold standard, and nbad is the number of dependencies which do not appear in the gold standard. [sent-245, score-0.436]

71 -tagdict indicates disabling the tag dictionary, -seen rules indicates disabling the seen rules constraint βkcats/wordsent/secLPLRLFAFcover∆LF∆AF 0 0. [sent-270, score-0.349]

72 7137 Table 7: Category ambiguity, speed, labeled P, R, F-score on gold and auto POS, and coverage over CCGbank 00 for the standard supertagger parameters selecting the best scoring parse against the gold parse in the chart. [sent-291, score-0.931]

73 49% represents supertagger error (where the supertagger has not provided the correct categories), and the difference to the baseline performance indicates model error (where the parser model has not selected the optimal parse given the current categories). [sent-294, score-1.443]

74 The impact of tag dictionary errors must be neutralised in order to distinguish between the types of supertagger error. [sent-296, score-0.627]

75 This was done for categories that the supertagger could use; categories that were not in the permissible set of 425 categories were not considered. [sent-298, score-0.852]

76 This is an optimistic experiment; removing the tag dictionary entirely would greatly increase the number of categories considered by the supertagger and may dramatically change the tagging results. [sent-299, score-0.714]

77 The delta columns indicate the difference in labeled Fscore to the oracle result, which discounts the grammar error in the parser. [sent-301, score-0.28]

78 We ran the experiment in four configurations: disabling the tag dictionary, dis503 abling the seen rules constraint, and disabling both. [sent-302, score-0.297]

79 These numbers are the upper bound of the parser with the supertagger on standard settings. [sent-308, score-0.795]

80 Our result with gold POS tags is statistically identical to the oracle experiment conducted by Auli and Lopez (201 1), which exchanged brackets for dependencies in the forest oracle algorithm of Huang (2008). [sent-309, score-0.619]

81 A perfect tag dictionary that always contains the gold standard category if it is available results in an upper bound accuracy of 95. [sent-311, score-0.318]

82 This shows that overall supertagger error in the parser is around 5. [sent-313, score-0.812]

83 2%, with roughly 1% attributable to the use of the tag dictionary and the remainder to the supertagger model. [sent-314, score-0.601]

84 5% worse than the oracle categories result due to model error and supertagger error, so model error accounts for roughly 7. [sent-316, score-0.958]

85 5% accuracy improvement over both the standard parser configuration and the -tagdict configuration, at the cost of roughly 0. [sent-319, score-0.297]

86 The results also show that model and supertagger error largely accounts for the remaining oracle accuracy difference between the C&C; n-best parser and the Charniak/Collins n-best parsers. [sent-335, score-1.045]

87 The absolute upper bound of the C&C; parser is only 1% higher than the oracle 50-best score in Table 4, placing the n-best parser close to its theoretical limit. [sent-336, score-0.671]

88 3 Varying supertagger parameters We conduct a further experiment to determine the impact of the standard β and k values used in the parser. [sent-338, score-0.564]

89 The coverage peaks at the second-lowest value because at lower β values, the number of categories returned means all of the possible derivations cannot be stored in the chart. [sent-348, score-0.253]

90 The back-off approach substantially increases coverage by ensuring that parses that fail at higher β values are retried at lower ones, at the cost of reducing the upper accuracy bound to below that of any individual β. [sent-349, score-0.302]

91 The speed of the parser varies substantially in this experiment, from 40. [sent-350, score-0.257]

92 4 Gold and automatic POS tags There is a substantial difference in accuracy between experiments that use gold POS and auto POS tags. [sent-355, score-0.295]

93 Both the supertagger and parser use POS tags independently as features, but this result suggests that the bulk of the performance difference comes from the supertagger. [sent-359, score-0.791]

94 To fully identify the error contributions, we ran an experiment where we provide gold POS tags to one of the parser and supertagger, and auto POS tags to the other, and then run the standard evaluation (the oracle experiment will be identical to the “best in chart”). [sent-360, score-0.848]

95 Table 8 shows that supplying the parser with auto POS tags reduces accuracy by 0. [sent-361, score-0.474]

96 27% compared to the baseline parser, while supplying the supertagger with auto POS tags results in a 1. [sent-362, score-0.715]

97 The parser uses more features in a wider context than the supertagger, so it is less affected by POS tag errors. [sent-364, score-0.267]

98 5 Conclusion We have described how a mismatch between the way CCG parses are modeled and evaluated caused equivalent parses to be produced in n-best parsing. [sent-365, score-0.367]

99 We eliminate duplicates by hashing dependencies, significantly improving the oracle F-score of CCG nbest parsing by 0. [sent-366, score-0.508]

100 We have comprehensively investigated the sources of error in the C&C; parser to explain the gap in oracle performance compared with other n-best parsers. [sent-370, score-0.486]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('supertagger', 0.513), ('ccg', 0.359), ('ccgbank', 0.309), ('hashing', 0.284), ('parser', 0.227), ('oracle', 0.162), ('parses', 0.15), ('swims', 0.124), ('curran', 0.124), ('auto', 0.124), ('reranker', 0.12), ('categories', 0.113), ('derivations', 0.088), ('dependencies', 0.081), ('np', 0.081), ('clark', 0.08), ('hash', 0.079), ('absorption', 0.078), ('disabling', 0.078), ('jack', 0.078), ('gold', 0.075), ('error', 0.072), ('reranking', 0.07), ('pos', 0.065), ('rrb', 0.062), ('supertagging', 0.062), ('parsing', 0.062), ('chart', 0.059), ('dependency', 0.055), ('category', 0.055), ('derivation', 0.054), ('ball', 0.054), ('grs', 0.054), ('coverage', 0.052), ('charniak', 0.052), ('tags', 0.051), ('ptb', 0.05), ('hockenmaier', 0.05), ('dictionary', 0.048), ('mode', 0.048), ('brennan', 0.047), ('hashtable', 0.047), ('parse', 0.046), ('grammar', 0.046), ('accuracy', 0.045), ('huang', 0.044), ('semantically', 0.043), ('collisions', 0.041), ('combinatory', 0.041), ('diversified', 0.041), ('restricts', 0.041), ('tag', 0.04), ('constraint', 0.04), ('seen', 0.039), ('equivalent', 0.039), ('julia', 0.038), ('ambiguity', 0.038), ('auli', 0.037), ('xor', 0.037), ('rules', 0.037), ('identical', 0.036), ('fscore', 0.035), ('river', 0.035), ('pp', 0.034), ('argument', 0.034), ('adjunct', 0.033), ('parsers', 0.032), ('sydney', 0.031), ('nbad', 0.031), ('ncorrect', 0.031), ('optimisations', 0.031), ('parc', 0.031), ('subtractive', 0.031), ('logical', 0.031), ('speed', 0.03), ('infeasible', 0.03), ('lf', 0.03), ('distinct', 0.029), ('diversity', 0.029), ('big', 0.029), ('bound', 0.029), ('gr', 0.029), ('mismatch', 0.028), ('johnson', 0.027), ('supplying', 0.027), ('combinator', 0.027), ('red', 0.027), ('brackets', 0.027), ('impact', 0.026), ('unique', 0.026), ('upper', 0.026), ('accounts', 0.026), ('combine', 0.026), ('gap', 0.025), ('configurations', 0.025), ('backward', 0.025), ('configuration', 0.025), ('experiment', 0.025), ('tse', 0.025), ('discarding', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 71 acl-2012-Dependency Hashing for n-best CCG Parsing

Author: Dominick Ng ; James R. Curran

2 0.42602241 170 acl-2012-Robust Conversion of CCG Derivations to Phrase Structure Trees

Author: Jonathan K. Kummerfeld ; Dan Klein ; James R. Curran

Abstract: We propose an improved, bottom-up method for converting CCG derivations into PTB-style phrase structure trees. In contrast with past work (Clark and Curran, 2009), which used simple transductions on category pairs, our approach uses richer transductions attached to single categories. Our conversion preserves more sentences under round-trip conversion (5 1.1% vs. 39.6%) and is more robust. In particular, unlike past methods, ours does not require ad-hoc rules over non-local features, and so can be easily integrated into a parser.

3 0.1723763 4 acl-2012-A Comparative Study of Target Dependency Structures for Statistical Machine Translation

Author: Xianchao Wu ; Katsuhito Sudoh ; Kevin Duh ; Hajime Tsukada ; Masaaki Nagata

Abstract: This paper presents a comparative study of target dependency structures yielded by several state-of-the-art linguistic parsers. Our approach is to measure the impact of these nonisomorphic dependency structures to be used for string-to-dependency translation. Besides using traditional dependency parsers, we also use the dependency structures transformed from PCFG trees and predicate-argument structures (PASs) which are generated by an HPSG parser and a CCG parser. The experiments on Chinese-to-English translation show that the HPSG parser’s PASs achieved the best dependency and translation accuracies. 1

4 0.14659804 175 acl-2012-Semi-supervised Dependency Parsing using Lexical Affinities

Author: Seyed Abolghasem Mirroshandel ; Alexis Nasr ; Joseph Le Roux

Abstract: Treebanks are not large enough to reliably model precise lexical phenomena. This deficiency provokes attachment errors in the parsers trained on such data. We propose in this paper to compute lexical affinities, on large corpora, for specific lexico-syntactic configurations that are hard to disambiguate and introduce the new information in a parser. Experiments on the French Treebank showed a relative decrease ofthe error rate of 7. 1% Labeled Accuracy Score yielding the best parsing results on this treebank.

5 0.14132561 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?

Author: Hui Zhang ; David Chiang

Abstract: Syntax-based translation models that operate on the output of a source-language parser have been shown to perform better if allowed to choose from a set of possible parses. In this paper, we investigate whether this is because it allows the translation stage to overcome parser errors or to override the syntactic structure itself. We find that it is primarily the latter, but that under the right conditions, the translation stage does correct parser errors, improving parsing accuracy on the Chinese Treebank.

6 0.13328388 109 acl-2012-Higher-order Constituent Parsing and Parser Combination

7 0.12561344 106 acl-2012-Head-driven Transition-based Parsing with Top-down Prediction

8 0.11312117 5 acl-2012-A Comparison of Chinese Parsers for Stanford Dependencies

9 0.10759962 213 acl-2012-Utilizing Dependency Language Models for Graph-based Dependency Parsing Models

10 0.10173013 95 acl-2012-Fast Syntactic Analysis for Statistical Language Modeling via Substructure Sharing and Uptraining

11 0.088710248 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence

12 0.087937519 90 acl-2012-Extracting Narrative Timelines as Temporal Dependency Structures

13 0.086031243 57 acl-2012-Concept-to-text Generation via Discriminative Reranking

14 0.083578356 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets

15 0.081936076 87 acl-2012-Exploiting Multiple Treebanks for Parsing with Quasi-synchronous Grammars

16 0.079781488 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

17 0.079505861 30 acl-2012-Attacking Parsing Bottlenecks with Unlabeled Data and Relevant Factorizations

18 0.073272921 45 acl-2012-Capturing Paradigmatic and Syntagmatic Lexical Relations: Towards Accurate Chinese Part-of-Speech Tagging

19 0.072441801 9 acl-2012-A Cost Sensitive Part-of-Speech Tagging: Differentiating Serious Errors from Minor Errors

20 0.07238616 119 acl-2012-Incremental Joint Approach to Word Segmentation, POS Tagging, and Dependency Parsing in Chinese

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.201), (1, -0.045), (2, -0.222), (3, -0.18), (4, -0.103), (5, -0.097), (6, 0.019), (7, -0.024), (8, 0.074), (9, 0.021), (10, 0.104), (11, 0.088), (12, -0.043), (13, 0.057), (14, -0.015), (15, -0.048), (16, -0.045), (17, -0.057), (18, -0.174), (19, -0.04), (20, -0.088), (21, -0.036), (22, 0.149), (23, 0.064), (24, -0.024), (25, -0.095), (26, 0.089), (27, -0.329), (28, -0.164), (29, -0.035), (30, 0.227), (31, 0.066), (32, 0.053), (33, -0.109), (34, 0.168), (35, -0.137), (36, -0.032), (37, 0.026), (38, 0.0), (39, 0.047), (40, -0.108), (41, 0.086), (42, -0.063), (43, -0.001), (44, 0.105), (45, -0.079), (46, -0.083), (47, -0.031), (48, 0.049), (49, 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95303404 170 acl-2012-Robust Conversion of CCG Derivations to Phrase Structure Trees

Author: Jonathan K. Kummerfeld ; Dan Klein ; James R. Curran

same-paper 2 0.9233954 71 acl-2012-Dependency Hashing for n-best CCG Parsing

Author: Dominick Ng ; James R. Curran

3 0.49723318 4 acl-2012-A Comparative Study of Target Dependency Structures for Statistical Machine Translation

Author: Xianchao Wu ; Katsuhito Sudoh ; Kevin Duh ; Hajime Tsukada ; Masaaki Nagata

4 0.46407926 175 acl-2012-Semi-supervised Dependency Parsing using Lexical Affinities

Author: Seyed Abolghasem Mirroshandel ; Alexis Nasr ; Joseph Le Roux

5 0.41352582 197 acl-2012-Tokenization: Returning to a Long Solved Problem A Survey, Contrastive Experiment, Recommendations, and Toolkit

Author: Rebecca Dridan ; Stephan Oepen

Abstract: We examine some of the frequently disregarded subtleties of tokenization in Penn Treebank style, and present a new rule-based preprocessing toolkit that not only reproduces the Treebank tokenization with unmatched accuracy, but also maintains exact stand-off pointers to the original text and allows flexible configuration to diverse use cases (e.g. to genreor domain-specific idiosyncrasies). 1 Introduction—Motivation The task of tokenization is hardly counted among the grand challenges of NLP and is conventionally interpreted as breaking up “natural language text [...] into distinct meaningful units (or tokens)” (Kaplan, 2005). Practically speaking, however, tokenization is often combined with other string-level preprocessing—for example normalization of punctuation (of different conventions for dashes, say), disambiguation of quotation marks (into opening vs. closing quotes), or removal of unwanted mark-up— where the specifics of such pre-processing depend both on properties of the input text as well as on assumptions made in downstream processing. Applying some string-level normalizationprior to the identification of token boundaries can improve (or simplify) tokenization, and a sub-task like the disambiguation of quote marks would in fact be hard to perform after tokenization, seeing that it depends on adjacency to whitespace. In the following, we thus assume a generalized notion of tokenization, comprising all string-level processing up to and including the conversion of a sequence of characters (a string) to a sequence of token objects.1 1Obviously, some of the normalization we include in the tokenization task (in this generalized interpretation) could be left to downstream analysis, where a tagger or parser, for example, could be expected to accept non-disambiguated quote marks (so-called straight or typewriter quotes) and disambiguate as 378 Arguably, even in an overtly ‘separating’ language like English, there can be token-level ambiguities that ultimately can only be resolved through parsing (see § 3 for candidate examples), and indeed Waldron et al. (2006) entertain the idea of downstream processing on a token lattice. In this article, however, we accept the tokenization conventions and sequential nature of the Penn Treebank (PTB; Marcus et al., 1993) as a useful point of reference— primarily for interoperability of different NLP tools. Still, we argue, there is remaining work to be done on PTB-compliant tokenization (reviewed in§ 2), both methodologically, practically, and technologically. In § 3 we observe that state-of-the-art tools perform poorly on re-creating PTB tokenization, and move on in § 4 to develop a modular, parameterizable, and transparent framework for tokenization. Besides improvements in tokenization accuracy and adaptability to diverse use cases, in § 5 we further argue that each token object should unambiguously link back to an underlying element of the original input, which in the case of tokenization of text we realize through a notion of characterization. 2 Common Conventions Due to the popularity of the PTB, its tokenization has been a de-facto standard for two decades. Ap- proximately, this means splitting off punctuation into separate tokens, disambiguating straight quotes, and separating contractions such as can’t into ca and n ’t. There are, however, many special cases— part of syntactic analysis. However, on the (predominant) point of view that punctuation marks form tokens in their own right, the tokenizer would then have to adorn quote marks in some way, as to whether they were split off the left or right periphery of a larger token, to avoid unwanted syntactic ambiguity. Further, increasing use of Unicode makes texts containing ‘natively’ disambiguated quotes more common, where it would seem unfortunate to discard linguistically pertinent information by normalizing towards the poverty of pure ASCII punctuation. ProceedJienjgus, R ofep thueb 5lic0t hof A Knonrueaa,l M 8-e1e4ti Jnugly o f2 t0h1e2 A.s ?c so2c0ia1t2io Ans fsoorc Ciatoiomnp fuotart Cioonmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi3c 7s8–382, documented and undocumented. In much tagging and parsing work, PTB data has been used with gold-standard tokens, to a point where many researchers are unaware of the existence of the original ‘raw’ (untokenized) text. Accordingly, the formal definition of PTB has received little attention, but reproducing PTB tokenization automatically actually is not a trivial task (see § 3). As the NLP community has moved to process data other than the PTB, some of the limitations of the tokenization2 PTB tokenization have been recognized, and many recently released data sets are accompanied by a note on tokenization along the lines of: Tokenization is similar to that used in PTB, except . . . Most exceptions are to do with hyphenation, or special forms of named entities such as chemical names or URLs. None of the documentation with extant data sets is sufficient to fully reproduce the tokenization.3 The CoNLL 2008 Shared Task data actually provided two forms of tokenization: that from the PTB (which many pre-processing tools would have been trained on), and another form that splits (most) hyphenated terms. This latter convention recently seems to be gaining ground in data sets like the Google 1T n-gram corpus (LDC #2006T13) and OntoNotes (Hovy et al., 2006). Clearly, as one moves towards a more application- and domaindriven idea of ‘correct’ tokenization, a more transparent, flexible, and adaptable approach to stringlevel pre-processing is called for. 3 A Contrastive Experiment To get an overview of current tokenization methods, we recovered and tokenized the raw text which was the source of the (Wall Street Journal portion of the) PTB, and compared it to the gold tokenization in the syntactic annotation in the We used three common methods of tokenization: (a) the original treebank.4 2See http : / /www . cis .upenn .edu/ ~t reebank/ t okeni z at ion .html for available ‘documentation’ and a sed script for PTB-style tokenization. 3Øvrelid et al. (2010) observe that tokenizing with the GENIA tagger yields mismatches in one of five sentences of the GENIA Treebank, although the GENIA guidelines refer to scripts that may be available on request (Tateisi & Tsujii, 2006). 4The original WSJ text was last included with the 1995 release of the PTB (LDC #95T07) and required alignment with the treebank, with some manual correction so that the same text is represented in both raw and parsed formats. 379 Tokenization Differing Levenshtein Method Sentences Distance tokenizer.sed 3264 11168 CoreNLP 1781 3717 C&J; parser 2597 4516 Table 1: Quantitative view on tokenization differences. PTB tokenizer.sed script; (b) the tokenizer from the Stanford CoreNLP tools5; and (c) tokenization from the parser of Charniak & Johnson (2005). Table 1 shows quantitative differences between each of the three methods and the PTB, both in terms of the number of sentences where the tokenization differs, and also in the total Levenshtein distance (Levenshtein, 1966) over tokens (for a total of 49,208 sentences and 1,173,750 gold-standard tokens). Looking at the differences qualitatively, the most consistent issue across all tokenization methods was ambiguity of sentence-final periods. In the treebank, final periods are always (with about 10 exceptions) a separate token. If the sentence ends in U.S. (but not other abbreviations, oddly), an extra period is hallucinated, so the abbreviation also has one. In contrast, C&J; add a period to all final abbreviations, CoreNLP groups the final period with a final abbreviation and hence lacks a sentence-final period token, and the sed script strips the period off U.S. The ‘correct’ choice in this case is not obvious and will depend on how the tokens are to be used. The majority of the discrepancies in the sed script tokenization come from an under-restricted punctuation rule that incorrectly splits on commas within numbers or ampersands within names. Other than that, the problematic cases are mostly shared across tokenization methods, and include issues with currencies, Irish names, hyphenization, and quote disambiguation. In addition, C&J; make some additional modifications to the text, lemmatising expressions such as won ’t as will and n ’t. 4 REPP: A Generalized Framework For tokenization to be studied as a first-class problem, and to enable customization and flexibility to diverse use cases, we suggest a non-procedural, rule-based framework dubbed REPP (Regular 5See corenlp / / nlp . st anford . edu / so ftware / run in ‘ st rict Treebank3 ’ mode. http : . shtml, Expression-Based Pre-Processing)—essentially a cascade of ordered finite-state string rewriting rules, though transcending the formal complexity of regular languages by inclusion of (a) full perl-compatible regular expressions and (b) fixpoint iteration over groups of rules. In this approach, a first phase of string-level substitutions inserts whitespace around, for example, punctuation marks; upon completion of string rewriting, token boundaries are stipulated between all whitespace-separated substrings (and only these). For a good balance of human and machine readability, REPP tokenization rules are specified in a simple, line-oriented textual form. Figure 1 shows a (simplified) excerpt from our PTB-style tokenizer, where the first character on each line is one of four REPP operators, as follows: (a) ‘#’ for group formation; (b) ‘>’ for group invocation, (c) ‘ ! ’ for substitution (allowing capture groups), and (d) ‘ : ’ for token boundary detection.6 In Figure 1, the two rules stripping off prefix and suffix punctuation marks adjacent to whitespace (i.e. matching the tab-separated left-hand side of the rule, to replace the match with its right-hand side) form a numbered group (‘# 1’), which will be iterated when called (‘> 1 until none ’) of the rules in the group fires (a fixpoint). In this example, conditioning on whitespace adjacency avoids the issues observed with the PTB sed script (e.g. token boundaries within comma-separated numbers) and also protects against infinite loops in the group.7 REPP rule sets can be organized as modules, typ6Strictly speaking, there are another two operators, for lineoriented comments and automated versioning of rule files. 7For this example, the same effects seemingly could be obtained without iteration (using greatly more complex rules); our actual, non-simplified rules, however, further deal with punctuation marks that can function as prefixes or suffixes, as well as with corner cases like factor(s) or Ca[2+]. Also in mark-up removal and normalization, we have found it necessary to ‘parse’ nested structures by means of iterative groups. 380 ically each in a file of its own, and invoked selectively by name (e.g. ‘>wiki’ in Figure 1); to date, there exist modules for quote disambiguation, (relevant subsets of) various mark-up languages (HTML, LATEX, wiki, and XML), and a handful of robustness rules (e.g. seeking to identify and repair ‘sandwiched’ inter-token punctuation). Individual tokenizers are configured at run-time, by selectively activating a set of modules (through command-line op- tions). An open-source reference implementation of the REPP framework (in C++) is available, together with a library of modules for English. 5 Characterization for Traceability Tokenization, and specifically our notion of generalized tokenization which allows text normalization, involves changes to the original text being analyzed, rather than just additional annotation. As such, full traceability from the token objects to the original text is required, which we formalize as ‘characterization’, in terms of character position links back to the source.8 This has the practical benefit of allowing downstream analysis as direct (stand-off) annotation on the source text, as seen for example in the ACL Anthology Searchbench (Schäfer et al., 2011). With our general regular expression replacement rules in REPP, making precise what it means for a token to link back to its ‘underlying’ substring requires some care in the design and implementation. Definite characterization links between the string before (I) and after (O) the application of a single orurele ( can only bftee res (tOab)li tshheed a pinp lcicerattiaoinn positions, viz. (a) spans not matched by the rule: unchanged text in O outside the span matched by the left-hand tseixdet regex outfs tidhee truhele s can always d be b ylin thkeed le bfta-chka ntod I; and (b) spans caught by a regex capture group: capture groups represent bthye a same te caxtp tiunr eth ger oleufpt-: and right-hand sides of a substitution, and so can be linked back to O.9 Outside these text spans, we can only md bakace kd etofin Oit.e statements about characterization links at boundary points, which include the start and end of the full string, the start and end of the string 8If the tokenization process was only concerned with the identification of token boundaries, characterization would be near-trivial. 9If capture group references are used out-of-order, however, the per-group linkage is no longer well-defined, and we resort to the maximum-span ‘union’ of boundary points (see below). matched by the rule, and the start and end of any capture groups in the rule. Each character in the string being processed has a start and end position, marking the point before and after the character in the original string. Before processing, the end position would always be one greater than the start position. However, if a rule mapped a string-initial, PTB-style opening double quote (``) to one-character Unicode “, the new first character of the string would have start position 0, but end position 2. In contrast, if there were a rule !wo (n’ t ) will \1 (1) applied to the string I ’t go!, all characters in the won second token of the resulting string (I will n’t go!) will have start position 2 and end position 4. This demonstrates one of the formal consequences of our design: we have no reason to assign the characters ill any start position other than 2.10 Since explicit character links between each I O will only be estaband laicstheerd l iantk kms abtecthw or capture group boundaries, any tteabxtfrom the left-hand side of a rule that should appear in O must be explicitly linked through a capture group rOefe mreunstc eb (rather tihtlayn l merely hwroriuttgehn ao cuta ipntu utrhee righthand side of the rule). In other words, rule (1) above should be preferred to the following variant (which would result in character start and end offsets of 0 and 5 for both output tokens): ! won’ t will n’ t (2) During rule application, we keep track of character start and end positions as offsets between a string before and after each rule application (i.e. all pairs hI, Oi), and these offsets are eventually traced back thoI ,thOe original string fats etthse atireme ev oefn ftiunaalll yto tkraecneidzat biaocnk. 6 Quantitative and Qualitative Evaluation In our own work on preparing various (non-PTB) genres for parsing, we devised a set of REPP rules with the goal of following the PTB conventions. When repeating the experiment of § 3 above using REPP tokenization, we obtained an initial difference in 1505 sentences, with a Levenshtein dis10This subtlety will actually be invisible in the final token objects if will remains a single token, but if subsequent rules were to split this token further, all its output tokens would have a start position of 2 and an end position of 4. While this example may seem unlikely, we have come across similar scenarios in fine-tuning actual REPP rules. 381 tance of 3543 (broadly comparable to CoreNLP, if marginally more accurate). Examining these discrepancies, we revealed some deficiencies in our rules, as well as some peculiarities of the ‘raw’ Wall Street Journal text from the PTB distribution. A little more than 200 mismatches were owed to improper treatment of currency symbols (AU$) and decade abbreviations (’60s), which led to the refinement of two existing rules. Notable PTB idiosyncrasies (in the sense of deviations from common typography) include ellipses with spaces separating the periods and a fairly large number of possessives (’s) being separated from their preceding token. Other aspects of gold-standard PTB tokenization we consider unwarranted ‘damage’ to the input text, such as hallucinating an extra period after U . S . and splitting cannot (which adds spurious ambiguity). For use cases where the goal were strict compliance, for instance in pre-processing inputs for a PTB-derived parser, we added an optional REPP module (of currently half a dozen rules) to cater to these corner cases—in a spirit similar to the CoreNLP mode we used in § 3. With these extra rules, remaining tokenization discrepancies are contained in 603 sentences (just over 1%), which gives a Levenshtein distance of 1389. 7 Discussion—Conclusion Compared to the best-performing off-the-shelf system in our earlier experiment (where it is reasonable to assume that PTB data has played at least some role in development), our results eliminate two thirds of the remaining tokenization errors—a more substantial reduction than recent improvements in parsing accuracy against the PTB, for example. Of the remaining differences, cerned with mid-sentence at least half of those riod was separated treebank—a pattern Some differences over 350 are con- period ambiguity, are instances where where from an abbreviation a pein the we do not wish to emulate. in quote disambiguation also re- main, often triggered by whitespace on both sides of quote marks in the raw text. The final 200 or so dif- ferences stem from manual corrections made during treebanking, and we consider that these cases could not be replicated automatically in any generalizable fashion. References Waldron, B., Copestake, A., Schäfer, U., & Kiefer, Ch(ionap-frgbpnt.heias1Ikt7nA,p3asEP–rs.1,oi8&cn0ieag;)J.todiaAohni dgnsfmonAroa,fxCbMethon.ermt,(pd42Uui30sStcraAd5ti.m)oA.niCanloutaLivrlsneMgr-eutorieas-ftni kceg-s Isd5Bota.hurlyd(2.scIne0itsne0ra6Dn)ad.Et LiPorvneHapl-ruIoaCNcteio snofin(elrpsge.nacIn2ed6Pot3rno–kcLe2naei6dns8iagnt)ui.oasgGnoe sfntRaohne-, Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., & Weischedel, R. (2006). Ontonotes. The 90% solution. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (pp. 57–60). New York City, USA. Kaplan, R. M. (2005). A method for tokenizing text. Festschrift for Kimmo Koskenniemi on his 60th birthday. In A. Arppe, L. Carlson, K. Lindén, J. Piitulainen, M. Suominen, M. Vainio, H. Westerlund, & A. Yli-Jyrä (Eds.), Inquiries into words, constraints and contexts (pp. 55 64). Stanford, CA: CSLI Publications. – Levenshtein, V. (1966). Binary codes capable ofcor- recting deletions, insertions and reversals. Soviet Physice Doklady, 10, 707–710. – Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English. The Penn Treebank. Computational Linguistics, 19, 3 13 330. – Øvrelid, L., Velldal, E., & Oepen, S. (2010). Syntactic scope resolution in uncertainty analysis. In Proceedings of the 23rd international conference on computational linguistics (pp. 1379 1387). Beijing, China. – Schäfer, U., Kiefer, B., Spurk, C., Steffen, J., & Wang, R. (201 1). The ACL Anthology Searchbench. In Proceedings of the ACL-HLT 2011 system demonstrations (pp. 7–13). Portland, Oregon, USA. Tateisi, Y., & Tsujii, J. (2006). GENIA annotation guidelines for tokenization and POS tagging (Technical Report # TR-NLP-UT-2006-4). Tokyo, Japan: Tsujii Lab, University of Tokyo. 382

6 0.39655045 30 acl-2012-Attacking Parsing Bottlenecks with Unlabeled Data and Relevant Factorizations

7 0.38037309 122 acl-2012-Joint Evaluation of Morphological Segmentation and Syntactic Parsing

8 0.35833117 83 acl-2012-Error Mining on Dependency Trees

9 0.3521004 109 acl-2012-Higher-order Constituent Parsing and Parser Combination

10 0.34388706 75 acl-2012-Discriminative Strategies to Integrate Multiword Expression Recognition and Parsing

11 0.32676241 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence

12 0.31284288 95 acl-2012-Fast Syntactic Analysis for Statistical Language Modeling via Substructure Sharing and Uptraining

13 0.309268 87 acl-2012-Exploiting Multiple Treebanks for Parsing with Quasi-synchronous Grammars

14 0.30718198 106 acl-2012-Head-driven Transition-based Parsing with Top-down Prediction

15 0.29462078 5 acl-2012-A Comparison of Chinese Parsers for Stanford Dependencies

16 0.28875235 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?

17 0.28817591 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets

18 0.266388 213 acl-2012-Utilizing Dependency Language Models for Graph-based Dependency Parsing Models

19 0.24004254 108 acl-2012-Hierarchical Chunk-to-String Translation

20 0.2362998 90 acl-2012-Extracting Narrative Timelines as Temporal Dependency Structures

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(26, 0.035), (28, 0.033), (30, 0.021), (37, 0.363), (39, 0.045), (57, 0.013), (71, 0.011), (74, 0.039), (82, 0.04), (84, 0.016), (85, 0.022), (90, 0.101), (92, 0.04), (94, 0.022), (99, 0.117)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91200006 114 acl-2012-IRIS: a Chat-oriented Dialogue System based on the Vector Space Model

Author: Rafael E. Banchs ; Haizhou Li

Abstract: This system demonstration paper presents IRIS (Informal Response Interactive System), a chat-oriented dialogue system based on the vector space model framework. The system belongs to the class of examplebased dialogue systems and builds its chat capabilities on a dual search strategy over a large collection of dialogue samples. Additional strategies allowing for system adaptation and learning implemented over the same vector model space framework are also described and discussed. 1

2 0.89657187 42 acl-2012-Bootstrapping via Graph Propagation

Author: Max Whitney ; Anoop Sarkar

Abstract: Bootstrapping a classifier from a small set of seed rules can be viewed as the propagation of labels between examples via features shared between them. This paper introduces a novel variant of the Yarowsky algorithm based on this view. It is a bootstrapping learning method which uses a graph propagation algorithm with a well defined objective function. The experimental results show that our proposed bootstrapping algorithm achieves state of the art performance or better on several different natural language data sets.

3 0.8757084 64 acl-2012-Crosslingual Induction of Semantic Roles

Author: Ivan Titov ; Alexandre Klementiev

Abstract: We argue that multilingual parallel data provides a valuable source of indirect supervision for induction of shallow semantic representations. Specifically, we consider unsupervised induction of semantic roles from sentences annotated with automatically-predicted syntactic dependency representations and use a stateof-the-art generative Bayesian non-parametric model. At inference time, instead of only seeking the model which explains the monolingual data available for each language, we regularize the objective by introducing a soft constraint penalizing for disagreement in argument labeling on aligned sentences. We propose a simple approximate learning algorithm for our set-up which results in efficient inference. When applied to German-English parallel data, our method obtains a substantial improvement over a model trained without using the agreement signal, when both are tested on non-parallel sentences.

same-paper 4 0.87047446 71 acl-2012-Dependency Hashing for n-best CCG Parsing

Author: Dominick Ng ; James R. Curran

5 0.8669942 115 acl-2012-Identifying High-Impact Sub-Structures for Convolution Kernels in Document-level Sentiment Classification

Author: Zhaopeng Tu ; Yifan He ; Jennifer Foster ; Josef van Genabith ; Qun Liu ; Shouxun Lin

Abstract: Convolution kernels support the modeling of complex syntactic information in machinelearning tasks. However, such models are highly sensitive to the type and size of syntactic structure used. It is therefore an important challenge to automatically identify high impact sub-structures relevant to a given task. In this paper we present a systematic study investigating (combinations of) sequence and convolution kernels using different types of substructures in document-level sentiment classification. We show that minimal sub-structures extracted from constituency and dependency trees guided by a polarity lexicon show 1.45 pointabsoluteimprovementinaccuracy overa bag-of-words classifier on a widely used sentiment corpus. 1

6 0.61292529 146 acl-2012-Modeling Topic Dependencies in Hierarchical Text Categorization

7 0.59778351 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures

8 0.57160068 80 acl-2012-Efficient Tree-based Approximation for Entailment Graph Learning

9 0.56191146 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence

10 0.56092817 147 acl-2012-Modeling the Translation of Predicate-Argument Structure for SMT

11 0.55229872 170 acl-2012-Robust Conversion of CCG Derivations to Phrase Structure Trees

12 0.55008817 30 acl-2012-Attacking Parsing Bottlenecks with Unlabeled Data and Relevant Factorizations

13 0.54794562 106 acl-2012-Head-driven Transition-based Parsing with Top-down Prediction

14 0.54501504 130 acl-2012-Learning Syntactic Verb Frames using Graphical Models

15 0.54154634 184 acl-2012-String Re-writing Kernel

16 0.54122889 191 acl-2012-Temporally Anchored Relation Extraction

17 0.53631723 175 acl-2012-Semi-supervised Dependency Parsing using Lexical Affinities

18 0.53506404 72 acl-2012-Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents

19 0.53427786 12 acl-2012-A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relation Extraction

20 0.53423673 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool