acl acl2012 acl2012-170 knowledge-graph by maker-knowledge-mining

170 acl-2012-Robust Conversion of CCG Derivations to Phrase Structure Trees


Source: pdf

Author: Jonathan K. Kummerfeld ; Dan Klein ; James R. Curran

Abstract: We propose an improved, bottom-up method for converting CCG derivations into PTB-style phrase structure trees. In contrast with past work (Clark and Curran, 2009), which used simple transductions on category pairs, our approach uses richer transductions attached to single categories. Our conversion preserves more sentences under round-trip conversion (5 1.1% vs. 39.6%) and is more robust. In particular, unlike past methods, ours does not require ad-hoc rules over non-local features, and so can be easily integrated into a parser.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We propose an improved, bottom-up method for converting CCG derivations into PTB-style phrase structure trees. [sent-5, score-0.143]

2 In contrast with past work (Clark and Curran, 2009), which used simple transductions on category pairs, our approach uses richer transductions attached to single categories. [sent-6, score-0.144]

3 Our conversion preserves more sentences under round-trip conversion (5 1. [sent-7, score-0.39]

4 In particular, unlike past methods, ours does not require ad-hoc rules over non-local features, and so can be easily integrated into a parser. [sent-11, score-0.067]

5 , 2008), LTAG (Xia, 1999), and CCG (Hockenmaier, 2003), is a complex process that renders linguistic phenomena in formalism-specific ways. [sent-15, score-0.025]

6 Tools for reversing these conversions are desirable for downstream parser use and parser comparison. [sent-16, score-0.331]

7 However, reversing conversions is difficult, as corpus conversions may lose information or smooth over PTB inconsistencies. [sent-17, score-0.149]

8 Clark and Curran (2009) developed a CCG to PTB conversion that treats the CCG derivation as a phrase structure tree and applies hand-crafted rules to every pair of categories that combine in the derivation. [sent-18, score-0.524]

9 Because their approach does not exploit the generalisations inherent in the CCG formalism, they must resort to ad-hoc rules over non-local features of the CCG constituents being combined (when a fixed pair of CCG categories correspond to multiple PTB structures). [sent-19, score-0.259]

10 Even with such rules, they correctly convert only 39. [sent-20, score-0.044]

11 au Our conversion assigns a set of bracket instructions to each word based on its CCG category, then follows the CCG derivation, applying and combining instructions at each combinatory step to build a phrase structure tree. [sent-25, score-0.788]

12 This requires specific instructions for each category (not all pairs), and generic operations for each combinator. [sent-26, score-0.258]

13 We cover all categories in the development set and correctly convert 51. [sent-27, score-0.179]

14 Unlike Clark and Curran’s approach, we require no rules that consider non-local features of constituents, which enables the possibility of simple integration with a CKY-based parser. [sent-29, score-0.039]

15 The most common errors our approach makes involve nodes for clauses and rare spans such as QPs, NXs, and NACs. [sent-30, score-0.153]

16 Many of these errors are inconsistencies in the original PTB annotations that are not recoverable. [sent-31, score-0.111]

17 These issues make evaluating parser output difficult, but our method does enable an improved comparison of CCG and PTB parsers. [sent-32, score-0.148]

18 2 Background There has been extensive work on converting parser output for evaluation, e. [sent-33, score-0.205]

19 There has also been work on conversion to phrase structure, from dependencies (Xia and Palmer, 2001; Xia et al. [sent-37, score-0.225]

20 Our focus is on CCG to PTB conversion (Clark and Curran, 2009). [sent-41, score-0.195]

21 1 Combinatory Categorial Grammar (CCG) The lower half of Figure 1 shows a CCG derivation (Steedman, 2000) in which each word is assigned a category, and combinatory rules are applied to adjacent categories until only one remains. [sent-43, score-0.411]

22 the N assigned to magistrates, or complex functions of the form result / arg, where result and arg are categories and the slash indicates the argument’s directionality. [sent-56, score-0.21]

23 Figure 1uses function application, where a complex category consumes an adjacent argument to form its result, e. [sent-58, score-0.179]

24 S[dcl] \NP combines with the NP to its left to form an S[dcl] . [sent-60, score-0.026]

25 M coomreb powerful tchoem NbPin atoto irtss alel fotw to categories to combine with greater flexibility. [sent-61, score-0.159]

26 We cannot form a PTB tree by simply relabeling the categories in a CCG derivation because the mapping to phrase labels is many-to-many, CCG derivations contain extra brackets due to binarisation, and there are cases where the constituents in the PTB tree and the CCG derivation cross (e. [sent-62, score-0.553]

27 2 Clark and Curran (2009) Clark and Curran (2009), hereafter C&C-CONV;, assign a schema to each leaf (lexical category) and rule (pair of combining categories) in the CCG derivation. [sent-66, score-0.055]

28 The PTB tree is constructed from the CCG bottomup, creating leaves with lexical schemas, then merging/adding sub-trees using rule schemas at each step. [sent-67, score-0.141]

29 C&C-CONV; has sparsity problems, requiring schemas for all valid pairs of categories at a minimum, the 2853 unique category combinations found in CCGbank. [sent-70, score-0.287]

30 Clark and Curran (2009) create schemas for only 776 of these, handling the remainder with approximate catch-all rules. [sent-71, score-0.09]

31 C&C-CONV; only specifies one simple schema for each rule (pair of categories). [sent-72, score-0.083]

32 : (N/N)/(N/N) + N/N “more than” + “30” (1) “relatively” + “small” (2) Here either a QP bracket (1) or an ADJP bracket (2) should be created. [sent-75, score-0.214]

33 Since both examples involve the same rule schema, C&C-CONV; would incorrectly process them in the same way. [sent-76, score-0.031]

34 To combat the most glaring errors, C&C-CONV; manipulates the PTB tree with ad-hoc rules based on non-local features over the CCG nodes being combined an approach that cannot be easily integrated into a parser. [sent-77, score-0.143]

35 These disadvantages are a consequence of failing to exploit the generalisations that CCG combinators define. [sent-78, score-0.127]

36 We return to this example below to show how our approach handles both cases correctly. [sent-79, score-0.065]

37 — 3 Our Approach Our conversion assigns a set of instructions to each lexical category and defines generic operations for each combinator that combine instructions. [sent-80, score-0.477]

38 Figure 2 shows a typical instruction, which specifies the node to create and where to place the PTB trees associated with the two categories combining. [sent-81, score-0.163]

39 Categories with multiple arguments are assigned one instruction per argument, e. [sent-83, score-0.094]

40 These are applied one at a time, as each combinatory step occurs. [sent-86, score-0.134]

41 For the example from the previous section we begin by assigning the instructions shown in Table 3. [sent-87, score-0.161]

42 Some of these can apply immediately as they do not involve an argument, e. [sent-88, score-0.031]

43 One of the more complex cases in the example is Italian, which is assigned (NP f {a}). [sent-91, score-0.086]

44 Symbol MeaningExample (X f a)Add an X bracket around (VP f a) functor and argument {} Flatten enclosed node (N f {a}) X{ *} Use same label as arg. [sent-95, score-0.172]

45 ((SN* ff {aa}})) or default to X fi Place subtrees (PP f0 (S f1. [sent-96, score-0.065]

46 k a)) Table 2: Types of operations in instructions. [sent-98, score-0.035]

47 For the complete example the final tree is almost correct but omits the S bracket around the final two NPs. [sent-99, score-0.158]

48 To fix our example we could have modified our instructions to use the final symbol in Table 2. [sent-100, score-0.161]

49 However, for this particular construction the PTB annotations are inconsistent, and so we cannot recover without introducing more errors elsewhere. [sent-102, score-0.104]

50 For combinators other than function application, we combine the instructions in various ways. [sent-103, score-0.274]

51 Additionally, we vary the instructions assigned based on the POS tag in 32 cases, and for the word not, to recover distinctions not captured by CCGbank categories alone. [sent-104, score-0.393]

52 In 52 cases the later instructions depend on the structure of the argument being picked up. [sent-105, score-0.261]

53 We have sixteen special cases for noncombinatory binary rules and twelve special cases for non-combinatory unary rules. [sent-106, score-0.109]

54 ADJP example because the two cases have different lexical categories: ((N/N)/(N/N))\(S[adj]\NP) on than and (N/N)/(N/N) on relatively. [sent-108, score-0.035]

55 This lexical dif- ference means we can assign different instructions to correctly recover the QP and ADJP nodes, whereas C&C-CONV; applies the same schema in both cases as the categories combining are the same. [sent-109, score-0.426]

56 4 Evaluation Using sections 00-21 of the treebanks, we handcrafted instructions for 527 lexical categories, a process that took under 100 hours, and includes all the categories used by the C&C; parser. [sent-110, score-0.296]

57 There are 647 further categories and 35 non-combinatory binary rules in sections 00-21 that we did not annotate. [sent-111, score-0.174]

58 For 107 CategoryInstruction set N(NP f) N/N1 NP[nb]/N1 ((S[dcl]\NP3)/NP2)/NP1 (S a (NP (NP (VP (VP f {a}) ff {a}) ff a) {f} a) f) Table 3: Instruction sets for the categories in Figure 1. [sent-112, score-0.215]

59 8 23 (len ≤ 40) Table 4: PARSEVAL Precision, Recall, F-Score, and exact sentence match for converted gold CCG derivations. [sent-146, score-0.13]

60 unannotated categories, we use the instructions of the result category with an added instruction. [sent-147, score-0.223]

61 Table 4 compares our approach with C&C-CONV; on gold CCG derivations. [sent-148, score-0.047]

62 Many of the remaining errors relate to missing and extra clause nodes and a range of rare structures, such as QPs, NACs, and NXs. [sent-153, score-0.17]

63 The only other prominent errors are single word spans, e. [sent-154, score-0.064]

64 Many of these errors are unrecoverable from CCGbank, either because inconsistencies in the PTB have been smoothed over or because they are genuine but rare constructions that were lost. [sent-157, score-0.171]

65 1 Parser Comparison When we convert the output of a CCG parser, the PTB trees that are produced will contain errors created by our conversion as well as by the parser. [sent-159, score-0.333]

66 In this section we are interested in comparing parsers, so we need to factor out errors created by our conversion. [sent-160, score-0.064]

67 One way to do this is to calculate a projected score (PROJ), as the parser result over the oracle result, but this is a very rough approximation. [sent-161, score-0.152]

68 Another way is to evaluate only on the 51% of sentences for which our conversion from gold CCG derivations is perfect (CLEAN). [sent-162, score-0.298]

69 Left: Most points lie below the diagonal, indicating that the quality of converted parser output (y) is upper bounded by the quality of conversion on gold parses (x). [sent-164, score-0.473]

70 Right: No clear correlation is present, indicating that the set of sentences that are converted best (on the far right), are not necessarily easy to parse. [sent-165, score-0.108]

71 introduces errors, as the parser output may contain categories that are harder to convert. [sent-166, score-0.283]

72 Parser F-scores are generally higher on CLEAN, which could mean that this set is easier to parse, or it could mean that these sentences don’t contain annotation inconsistencies, and so the parsers aren’t incorrect for returning the true parse (as opposed to the one in the PTB). [sent-167, score-0.079]

73 To test this distinction we look for correlation between conversion quality and parse difficulty on another metric. [sent-168, score-0.247]

74 In particular, Figure 3 (right) shows CCG labeled dependency performance for the C&C; parser vs. [sent-169, score-0.118]

75 In the left plot, the y-axis is PARSEVAL on converted C&C; parser output. [sent-172, score-0.227]

76 The few points above the diagonal are mostly short sentences on which the C&C; parser uses categories that lead to one extra correct node. [sent-174, score-0.331]

77 The main constructions on which parse errors occur, e. [sent-175, score-0.118]

78 PP attachment, are rarely converted incorrectly, and so we expect the number of errors to be cumulative. [sent-177, score-0.147]

79 Some sentences are higher in the right plot than the left because there are distinctions in CCG that are not always present in the PTB, e. [sent-178, score-0.122]

80 Table 5 presents F-scores for three PTB parsers and three CCG parsers (with their output converted by our method). [sent-181, score-0.217]

81 One interesting comparison is between the PTB parser of Petrov and Klein (2007) and 108 SentencesCLEANALLPROJ CCG parsers with their output converted by our method. [sent-182, score-0.283]

82 CLEAN is only on sentences that are converted perfectly from gold CCG (5 1%). [sent-183, score-0.13]

83 PROJ is a projected F-score (ALL result / CCGbank ALL result). [sent-185, score-0.034]

84 the CCG parser of Fowler and Penn (2010), which use the same underlying parser. [sent-186, score-0.118]

85 As shown earlier, CLEAN does not completely factor out the errors introduced by our conversion, as the parser output may be more difficult to convert, and the calculation of PROJ only roughly factors out the errors. [sent-189, score-0.212]

86 However, the results do suggest that the performance of the CCG parsers is approaching that of the Petrov parser. [sent-190, score-0.052]

87 5 Conclusion By exploiting the generalised combinators of the CCG formalism, we have developed a new method of converting CCG derivations into PTB-style trees. [sent-191, score-0.202]

88 Our system, which is publicly available1 , is more effective than previous work, increasing exact sentence match by more than 11% (absolute), and can be directly integrated with a CCG parser. [sent-192, score-0.028]

89 In Proceedings of the workshop on Speech and Natural Language, pages 306–3 11. [sent-215, score-0.032]

90 A comparison of loopy belief propagation and dual decomposition for integrated ccg supertagging and parsing. [sent-218, score-0.633]

91 In Proceedings of the Beyond PARSEVAL Workshop at LREC, pages 4–8. [sent-223, score-0.032]

92 Comparing the accuracy of CCG and penn treebank parsers. [sent-244, score-0.041]

93 Building a large annotated corpus of english: the penn treebank. [sent-274, score-0.041]

94 Comparative parser performance analysis across grammar frameworks through automatic tree conversion using synchronous grammars. [sent-278, score-0.389]

95 Corpus-oriented grammar development for acquiring a head-driven phrase structure grammar from the penn treebank. [sent-282, score-0.121]

96 In Proceedings of the 7th International Workshop on Treebanks and Linguistic Theories, pages 159–170. [sent-303, score-0.032]

97 In Proceedings of the Natural Language Processing Pacific Rim Symposium, pages 398– 403. [sent-307, score-0.032]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ccg', 0.605), ('ptb', 0.343), ('conversion', 0.195), ('np', 0.194), ('instructions', 0.161), ('dcl', 0.156), ('ccgbank', 0.142), ('categories', 0.135), ('combinatory', 0.134), ('magistrates', 0.128), ('curran', 0.125), ('parser', 0.118), ('bracket', 0.107), ('len', 0.09), ('schemas', 0.09), ('combinators', 0.089), ('nb', 0.085), ('clark', 0.085), ('converted', 0.083), ('adjp', 0.077), ('proj', 0.077), ('xia', 0.071), ('vp', 0.07), ('instruction', 0.068), ('suicide', 0.067), ('parseval', 0.065), ('argument', 0.065), ('errors', 0.064), ('category', 0.062), ('qp', 0.061), ('converting', 0.057), ('derivations', 0.056), ('schema', 0.055), ('categorial', 0.054), ('conversions', 0.054), ('clean', 0.054), ('parsers', 0.052), ('death', 0.051), ('alelnl', 0.051), ('fowler', 0.051), ('qps', 0.051), ('tree', 0.051), ('derivation', 0.05), ('extra', 0.048), ('constituents', 0.047), ('gold', 0.047), ('inconsistencies', 0.047), ('convert', 0.044), ('plot', 0.041), ('penn', 0.041), ('reversing', 0.041), ('transductions', 0.041), ('ff', 0.04), ('recover', 0.04), ('italian', 0.039), ('rules', 0.039), ('generalisations', 0.038), ('briscoe', 0.036), ('operations', 0.035), ('cases', 0.035), ('cahill', 0.034), ('lexicalised', 0.034), ('projected', 0.034), ('fei', 0.034), ('rare', 0.033), ('nns', 0.033), ('pages', 0.032), ('formalisms', 0.031), ('hpsg', 0.031), ('distinctions', 0.031), ('adjoining', 0.031), ('matsuzaki', 0.031), ('involve', 0.031), ('diagonal', 0.03), ('handles', 0.03), ('output', 0.03), ('phrase', 0.03), ('marcus', 0.03), ('jj', 0.029), ('abney', 0.028), ('integrated', 0.028), ('miyao', 0.028), ('specifies', 0.028), ('parse', 0.027), ('adjacent', 0.027), ('treebanks', 0.027), ('constructions', 0.027), ('petrov', 0.026), ('martha', 0.026), ('assigned', 0.026), ('left', 0.026), ('palmer', 0.025), ('complex', 0.025), ('grammar', 0.025), ('correlation', 0.025), ('subtrees', 0.025), ('nodes', 0.025), ('combine', 0.024), ('right', 0.024), ('arg', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 170 acl-2012-Robust Conversion of CCG Derivations to Phrase Structure Trees

Author: Jonathan K. Kummerfeld ; Dan Klein ; James R. Curran

Abstract: We propose an improved, bottom-up method for converting CCG derivations into PTB-style phrase structure trees. In contrast with past work (Clark and Curran, 2009), which used simple transductions on category pairs, our approach uses richer transductions attached to single categories. Our conversion preserves more sentences under round-trip conversion (5 1.1% vs. 39.6%) and is more robust. In particular, unlike past methods, ours does not require ad-hoc rules over non-local features, and so can be easily integrated into a parser.

2 0.42602241 71 acl-2012-Dependency Hashing for n-best CCG Parsing

Author: Dominick Ng ; James R. Curran

Abstract: Optimising for one grammatical representation, but evaluating over a different one is a particular challenge for parsers and n-best CCG parsing. We find that this mismatch causes many n-best CCG parses to be semantically equivalent, and describe a hashing technique that eliminates this problem, improving oracle n-best F-score by 0.7% and reranking accuracy by 0.4%. We also present a comprehensive analysis of errors made by the C&C; CCG parser, providing the first breakdown of the impact of implementation decisions, such as supertagging, on parsing accuracy.

3 0.23733954 4 acl-2012-A Comparative Study of Target Dependency Structures for Statistical Machine Translation

Author: Xianchao Wu ; Katsuhito Sudoh ; Kevin Duh ; Hajime Tsukada ; Masaaki Nagata

Abstract: This paper presents a comparative study of target dependency structures yielded by several state-of-the-art linguistic parsers. Our approach is to measure the impact of these nonisomorphic dependency structures to be used for string-to-dependency translation. Besides using traditional dependency parsers, we also use the dependency structures transformed from PCFG trees and predicate-argument structures (PASs) which are generated by an HPSG parser and a CCG parser. The experiments on Chinese-to-English translation show that the HPSG parser’s PASs achieved the best dependency and translation accuracies. 1

4 0.16108029 197 acl-2012-Tokenization: Returning to a Long Solved Problem A Survey, Contrastive Experiment, Recommendations, and Toolkit

Author: Rebecca Dridan ; Stephan Oepen

Abstract: We examine some of the frequently disregarded subtleties of tokenization in Penn Treebank style, and present a new rule-based preprocessing toolkit that not only reproduces the Treebank tokenization with unmatched accuracy, but also maintains exact stand-off pointers to the original text and allows flexible configuration to diverse use cases (e.g. to genreor domain-specific idiosyncrasies). 1 Introduction—Motivation The task of tokenization is hardly counted among the grand challenges of NLP and is conventionally interpreted as breaking up “natural language text [...] into distinct meaningful units (or tokens)” (Kaplan, 2005). Practically speaking, however, tokenization is often combined with other string-level preprocessing—for example normalization of punctuation (of different conventions for dashes, say), disambiguation of quotation marks (into opening vs. closing quotes), or removal of unwanted mark-up— where the specifics of such pre-processing depend both on properties of the input text as well as on assumptions made in downstream processing. Applying some string-level normalizationprior to the identification of token boundaries can improve (or simplify) tokenization, and a sub-task like the disambiguation of quote marks would in fact be hard to perform after tokenization, seeing that it depends on adjacency to whitespace. In the following, we thus assume a generalized notion of tokenization, comprising all string-level processing up to and including the conversion of a sequence of characters (a string) to a sequence of token objects.1 1Obviously, some of the normalization we include in the tokenization task (in this generalized interpretation) could be left to downstream analysis, where a tagger or parser, for example, could be expected to accept non-disambiguated quote marks (so-called straight or typewriter quotes) and disambiguate as 378 Arguably, even in an overtly ‘separating’ language like English, there can be token-level ambiguities that ultimately can only be resolved through parsing (see § 3 for candidate examples), and indeed Waldron et al. (2006) entertain the idea of downstream processing on a token lattice. In this article, however, we accept the tokenization conventions and sequential nature of the Penn Treebank (PTB; Marcus et al., 1993) as a useful point of reference— primarily for interoperability of different NLP tools. Still, we argue, there is remaining work to be done on PTB-compliant tokenization (reviewed in§ 2), both methodologically, practically, and technologically. In § 3 we observe that state-of-the-art tools perform poorly on re-creating PTB tokenization, and move on in § 4 to develop a modular, parameterizable, and transparent framework for tokenization. Besides improvements in tokenization accuracy and adaptability to diverse use cases, in § 5 we further argue that each token object should unambiguously link back to an underlying element of the original input, which in the case of tokenization of text we realize through a notion of characterization. 2 Common Conventions Due to the popularity of the PTB, its tokenization has been a de-facto standard for two decades. Ap- proximately, this means splitting off punctuation into separate tokens, disambiguating straight quotes, and separating contractions such as can’t into ca and n ’t. There are, however, many special cases— part of syntactic analysis. However, on the (predominant) point of view that punctuation marks form tokens in their own right, the tokenizer would then have to adorn quote marks in some way, as to whether they were split off the left or right periphery of a larger token, to avoid unwanted syntactic ambiguity. Further, increasing use of Unicode makes texts containing ‘natively’ disambiguated quotes more common, where it would seem unfortunate to discard linguistically pertinent information by normalizing towards the poverty of pure ASCII punctuation. ProceedJienjgus, R ofep thueb 5lic0t hof A Knonrueaa,l M 8-e1e4ti Jnugly o f2 t0h1e2 A.s ?c so2c0ia1t2io Ans fsoorc Ciatoiomnp fuotart Cioonmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi3c 7s8–382, documented and undocumented. In much tagging and parsing work, PTB data has been used with gold-standard tokens, to a point where many researchers are unaware of the existence of the original ‘raw’ (untokenized) text. Accordingly, the formal definition of PTB has received little attention, but reproducing PTB tokenization automatically actually is not a trivial task (see § 3). As the NLP community has moved to process data other than the PTB, some of the limitations of the tokenization2 PTB tokenization have been recognized, and many recently released data sets are accompanied by a note on tokenization along the lines of: Tokenization is similar to that used in PTB, except . . . Most exceptions are to do with hyphenation, or special forms of named entities such as chemical names or URLs. None of the documentation with extant data sets is sufficient to fully reproduce the tokenization.3 The CoNLL 2008 Shared Task data actually provided two forms of tokenization: that from the PTB (which many pre-processing tools would have been trained on), and another form that splits (most) hyphenated terms. This latter convention recently seems to be gaining ground in data sets like the Google 1T n-gram corpus (LDC #2006T13) and OntoNotes (Hovy et al., 2006). Clearly, as one moves towards a more application- and domaindriven idea of ‘correct’ tokenization, a more transparent, flexible, and adaptable approach to stringlevel pre-processing is called for. 3 A Contrastive Experiment To get an overview of current tokenization methods, we recovered and tokenized the raw text which was the source of the (Wall Street Journal portion of the) PTB, and compared it to the gold tokenization in the syntactic annotation in the We used three common methods of tokenization: (a) the original treebank.4 2See http : / /www . cis .upenn .edu/ ~t reebank/ t okeni z at ion .html for available ‘documentation’ and a sed script for PTB-style tokenization. 3Øvrelid et al. (2010) observe that tokenizing with the GENIA tagger yields mismatches in one of five sentences of the GENIA Treebank, although the GENIA guidelines refer to scripts that may be available on request (Tateisi & Tsujii, 2006). 4The original WSJ text was last included with the 1995 release of the PTB (LDC #95T07) and required alignment with the treebank, with some manual correction so that the same text is represented in both raw and parsed formats. 379 Tokenization Differing Levenshtein Method Sentences Distance tokenizer.sed 3264 11168 CoreNLP 1781 3717 C&J; parser 2597 4516 Table 1: Quantitative view on tokenization differences. PTB tokenizer.sed script; (b) the tokenizer from the Stanford CoreNLP tools5; and (c) tokenization from the parser of Charniak & Johnson (2005). Table 1 shows quantitative differences between each of the three methods and the PTB, both in terms of the number of sentences where the tokenization differs, and also in the total Levenshtein distance (Levenshtein, 1966) over tokens (for a total of 49,208 sentences and 1,173,750 gold-standard tokens). Looking at the differences qualitatively, the most consistent issue across all tokenization methods was ambiguity of sentence-final periods. In the treebank, final periods are always (with about 10 exceptions) a separate token. If the sentence ends in U.S. (but not other abbreviations, oddly), an extra period is hallucinated, so the abbreviation also has one. In contrast, C&J; add a period to all final abbreviations, CoreNLP groups the final period with a final abbreviation and hence lacks a sentence-final period token, and the sed script strips the period off U.S. The ‘correct’ choice in this case is not obvious and will depend on how the tokens are to be used. The majority of the discrepancies in the sed script tokenization come from an under-restricted punctuation rule that incorrectly splits on commas within numbers or ampersands within names. Other than that, the problematic cases are mostly shared across tokenization methods, and include issues with currencies, Irish names, hyphenization, and quote disambiguation. In addition, C&J; make some additional modifications to the text, lemmatising expressions such as won ’t as will and n ’t. 4 REPP: A Generalized Framework For tokenization to be studied as a first-class problem, and to enable customization and flexibility to diverse use cases, we suggest a non-procedural, rule-based framework dubbed REPP (Regular 5See corenlp / / nlp . st anford . edu / so ftware / run in ‘ st rict Treebank3 ’ mode. http : . shtml, Expression-Based Pre-Processing)—essentially a cascade of ordered finite-state string rewriting rules, though transcending the formal complexity of regular languages by inclusion of (a) full perl-compatible regular expressions and (b) fixpoint iteration over groups of rules. In this approach, a first phase of string-level substitutions inserts whitespace around, for example, punctuation marks; upon completion of string rewriting, token boundaries are stipulated between all whitespace-separated substrings (and only these). For a good balance of human and machine readability, REPP tokenization rules are specified in a simple, line-oriented textual form. Figure 1 shows a (simplified) excerpt from our PTB-style tokenizer, where the first character on each line is one of four REPP operators, as follows: (a) ‘#’ for group formation; (b) ‘>’ for group invocation, (c) ‘ ! ’ for substitution (allowing capture groups), and (d) ‘ : ’ for token boundary detection.6 In Figure 1, the two rules stripping off prefix and suffix punctuation marks adjacent to whitespace (i.e. matching the tab-separated left-hand side of the rule, to replace the match with its right-hand side) form a numbered group (‘# 1’), which will be iterated when called (‘> 1 until none ’) of the rules in the group fires (a fixpoint). In this example, conditioning on whitespace adjacency avoids the issues observed with the PTB sed script (e.g. token boundaries within comma-separated numbers) and also protects against infinite loops in the group.7 REPP rule sets can be organized as modules, typ6Strictly speaking, there are another two operators, for lineoriented comments and automated versioning of rule files. 7For this example, the same effects seemingly could be obtained without iteration (using greatly more complex rules); our actual, non-simplified rules, however, further deal with punctuation marks that can function as prefixes or suffixes, as well as with corner cases like factor(s) or Ca[2+]. Also in mark-up removal and normalization, we have found it necessary to ‘parse’ nested structures by means of iterative groups. 380 ically each in a file of its own, and invoked selectively by name (e.g. ‘>wiki’ in Figure 1); to date, there exist modules for quote disambiguation, (relevant subsets of) various mark-up languages (HTML, LATEX, wiki, and XML), and a handful of robustness rules (e.g. seeking to identify and repair ‘sandwiched’ inter-token punctuation). Individual tokenizers are configured at run-time, by selectively activating a set of modules (through command-line op- tions). An open-source reference implementation of the REPP framework (in C++) is available, together with a library of modules for English. 5 Characterization for Traceability Tokenization, and specifically our notion of generalized tokenization which allows text normalization, involves changes to the original text being analyzed, rather than just additional annotation. As such, full traceability from the token objects to the original text is required, which we formalize as ‘characterization’, in terms of character position links back to the source.8 This has the practical benefit of allowing downstream analysis as direct (stand-off) annotation on the source text, as seen for example in the ACL Anthology Searchbench (Schäfer et al., 2011). With our general regular expression replacement rules in REPP, making precise what it means for a token to link back to its ‘underlying’ substring requires some care in the design and implementation. Definite characterization links between the string before (I) and after (O) the application of a single orurele ( can only bftee res (tOab)li tshheed a pinp lcicerattiaoinn positions, viz. (a) spans not matched by the rule: unchanged text in O outside the span matched by the left-hand tseixdet regex outfs tidhee truhele s can always d be b ylin thkeed le bfta-chka ntod I; and (b) spans caught by a regex capture group: capture groups represent bthye a same te caxtp tiunr eth ger oleufpt-: and right-hand sides of a substitution, and so can be linked back to O.9 Outside these text spans, we can only md bakace kd etofin Oit.e statements about characterization links at boundary points, which include the start and end of the full string, the start and end of the string 8If the tokenization process was only concerned with the identification of token boundaries, characterization would be near-trivial. 9If capture group references are used out-of-order, however, the per-group linkage is no longer well-defined, and we resort to the maximum-span ‘union’ of boundary points (see below). matched by the rule, and the start and end of any capture groups in the rule. Each character in the string being processed has a start and end position, marking the point before and after the character in the original string. Before processing, the end position would always be one greater than the start position. However, if a rule mapped a string-initial, PTB-style opening double quote (``) to one-character Unicode “, the new first character of the string would have start position 0, but end position 2. In contrast, if there were a rule !wo (n’ t ) will \1 (1) applied to the string I ’t go!, all characters in the won second token of the resulting string (I will n’t go!) will have start position 2 and end position 4. This demonstrates one of the formal consequences of our design: we have no reason to assign the characters ill any start position other than 2.10 Since explicit character links between each I O will only be estaband laicstheerd l iantk kms abtecthw or capture group boundaries, any tteabxtfrom the left-hand side of a rule that should appear in O must be explicitly linked through a capture group rOefe mreunstc eb (rather tihtlayn l merely hwroriuttgehn ao cuta ipntu utrhee righthand side of the rule). In other words, rule (1) above should be preferred to the following variant (which would result in character start and end offsets of 0 and 5 for both output tokens): ! won’ t will n’ t (2) During rule application, we keep track of character start and end positions as offsets between a string before and after each rule application (i.e. all pairs hI, Oi), and these offsets are eventually traced back thoI ,thOe original string fats etthse atireme ev oefn ftiunaalll yto tkraecneidzat biaocnk. 6 Quantitative and Qualitative Evaluation In our own work on preparing various (non-PTB) genres for parsing, we devised a set of REPP rules with the goal of following the PTB conventions. When repeating the experiment of § 3 above using REPP tokenization, we obtained an initial difference in 1505 sentences, with a Levenshtein dis10This subtlety will actually be invisible in the final token objects if will remains a single token, but if subsequent rules were to split this token further, all its output tokens would have a start position of 2 and an end position of 4. While this example may seem unlikely, we have come across similar scenarios in fine-tuning actual REPP rules. 381 tance of 3543 (broadly comparable to CoreNLP, if marginally more accurate). Examining these discrepancies, we revealed some deficiencies in our rules, as well as some peculiarities of the ‘raw’ Wall Street Journal text from the PTB distribution. A little more than 200 mismatches were owed to improper treatment of currency symbols (AU$) and decade abbreviations (’60s), which led to the refinement of two existing rules. Notable PTB idiosyncrasies (in the sense of deviations from common typography) include ellipses with spaces separating the periods and a fairly large number of possessives (’s) being separated from their preceding token. Other aspects of gold-standard PTB tokenization we consider unwarranted ‘damage’ to the input text, such as hallucinating an extra period after U . S . and splitting cannot (which adds spurious ambiguity). For use cases where the goal were strict compliance, for instance in pre-processing inputs for a PTB-derived parser, we added an optional REPP module (of currently half a dozen rules) to cater to these corner cases—in a spirit similar to the CoreNLP mode we used in § 3. With these extra rules, remaining tokenization discrepancies are contained in 603 sentences (just over 1%), which gives a Levenshtein distance of 1389. 7 Discussion—Conclusion Compared to the best-performing off-the-shelf system in our earlier experiment (where it is reasonable to assume that PTB data has played at least some role in development), our results eliminate two thirds of the remaining tokenization errors—a more substantial reduction than recent improvements in parsing accuracy against the PTB, for example. Of the remaining differences, cerned with mid-sentence at least half of those riod was separated treebank—a pattern Some differences over 350 are con- period ambiguity, are instances where where from an abbreviation a pein the we do not wish to emulate. in quote disambiguation also re- main, often triggered by whitespace on both sides of quote marks in the raw text. The final 200 or so dif- ferences stem from manual corrections made during treebanking, and we consider that these cases could not be replicated automatically in any generalizable fashion. References Waldron, B., Copestake, A., Schäfer, U., & Kiefer, Ch(ionap-frgbpnt.heias1Ikt7nA,p3asEP–rs.1,oi8&cn0ieag;)J.todiaAohni dgnsfmonAroa,fxCbMethon.ermt,(pd42Uui30sStcraAd5ti.m)oA.niCanloutaLivrlsneMgr-eutorieas-ftni kceg-s Isd5Bota.hurlyd(2.scIne0itsne0ra6Dn)ad.Et LiPorvneHapl-ruIoaCNcteio snofin(elrpsge.nacIn2ed6Pot3rno–kcLe2naei6dns8iagnt)ui.oasgGnoe sfntRaohne-, Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., & Weischedel, R. (2006). Ontonotes. The 90% solution. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (pp. 57–60). New York City, USA. Kaplan, R. M. (2005). A method for tokenizing text. Festschrift for Kimmo Koskenniemi on his 60th birthday. In A. Arppe, L. Carlson, K. Lindén, J. Piitulainen, M. Suominen, M. Vainio, H. Westerlund, & A. Yli-Jyrä (Eds.), Inquiries into words, constraints and contexts (pp. 55 64). Stanford, CA: CSLI Publications. – Levenshtein, V. (1966). Binary codes capable ofcor- recting deletions, insertions and reversals. Soviet Physice Doklady, 10, 707–710. – Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English. The Penn Treebank. Computational Linguistics, 19, 3 13 330. – Øvrelid, L., Velldal, E., & Oepen, S. (2010). Syntactic scope resolution in uncertainty analysis. In Proceedings of the 23rd international conference on computational linguistics (pp. 1379 1387). Beijing, China. – Schäfer, U., Kiefer, B., Spurk, C., Steffen, J., & Wang, R. (201 1). The ACL Anthology Searchbench. In Proceedings of the ACL-HLT 2011 system demonstrations (pp. 7–13). Portland, Oregon, USA. Tateisi, Y., & Tsujii, J. (2006). GENIA annotation guidelines for tokenization and POS tagging (Technical Report # TR-NLP-UT-2006-4). Tokyo, Japan: Tsujii Lab, University of Tokyo. 382

5 0.14147808 109 acl-2012-Higher-order Constituent Parsing and Parser Combination

Author: Xiao Chen ; Chunyu Kit

Abstract: This paper presents a higher-order model for constituent parsing aimed at utilizing more local structural context to decide the score of a grammar rule instance in a parse tree. Experiments on English and Chinese treebanks confirm its advantage over its first-order version. It achieves its best F1 scores of 91.86% and 85.58% on the two languages, respectively, and further pushes them to 92.80% and 85.60% via combination with other highperformance parsers.

6 0.11021246 30 acl-2012-Attacking Parsing Bottlenecks with Unlabeled Data and Relevant Factorizations

7 0.091804199 87 acl-2012-Exploiting Multiple Treebanks for Parsing with Quasi-synchronous Grammars

8 0.083488517 106 acl-2012-Head-driven Transition-based Parsing with Top-down Prediction

9 0.08324936 108 acl-2012-Hierarchical Chunk-to-String Translation

10 0.081243835 59 acl-2012-Corpus-based Interpretation of Instructions in Virtual Environments

11 0.077083319 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

12 0.076489858 93 acl-2012-Fast Online Lexicon Learning for Grounded Language Acquisition

13 0.075777844 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?

14 0.072412707 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets

15 0.071264155 122 acl-2012-Joint Evaluation of Morphological Segmentation and Syntactic Parsing

16 0.071067818 168 acl-2012-Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations

17 0.07010559 5 acl-2012-A Comparison of Chinese Parsers for Stanford Dependencies

18 0.068573698 175 acl-2012-Semi-supervised Dependency Parsing using Lexical Affinities

19 0.067993492 83 acl-2012-Error Mining on Dependency Trees

20 0.06523557 213 acl-2012-Utilizing Dependency Language Models for Graph-based Dependency Parsing Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.176), (1, -0.035), (2, -0.202), (3, -0.161), (4, -0.104), (5, -0.084), (6, 0.004), (7, 0.007), (8, 0.06), (9, 0.043), (10, 0.089), (11, 0.094), (12, -0.082), (13, 0.083), (14, -0.06), (15, -0.092), (16, -0.115), (17, -0.088), (18, -0.209), (19, -0.025), (20, -0.104), (21, -0.018), (22, 0.133), (23, 0.157), (24, -0.06), (25, -0.102), (26, 0.031), (27, -0.389), (28, -0.155), (29, -0.003), (30, 0.228), (31, 0.096), (32, 0.071), (33, -0.151), (34, 0.195), (35, -0.158), (36, -0.067), (37, 0.002), (38, -0.019), (39, 0.064), (40, -0.132), (41, 0.076), (42, -0.078), (43, 0.009), (44, 0.101), (45, 0.001), (46, -0.054), (47, -0.009), (48, -0.006), (49, 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96575928 170 acl-2012-Robust Conversion of CCG Derivations to Phrase Structure Trees

Author: Jonathan K. Kummerfeld ; Dan Klein ; James R. Curran

Abstract: We propose an improved, bottom-up method for converting CCG derivations into PTB-style phrase structure trees. In contrast with past work (Clark and Curran, 2009), which used simple transductions on category pairs, our approach uses richer transductions attached to single categories. Our conversion preserves more sentences under round-trip conversion (5 1.1% vs. 39.6%) and is more robust. In particular, unlike past methods, ours does not require ad-hoc rules over non-local features, and so can be easily integrated into a parser.

2 0.8453263 71 acl-2012-Dependency Hashing for n-best CCG Parsing

Author: Dominick Ng ; James R. Curran

Abstract: Optimising for one grammatical representation, but evaluating over a different one is a particular challenge for parsers and n-best CCG parsing. We find that this mismatch causes many n-best CCG parses to be semantically equivalent, and describe a hashing technique that eliminates this problem, improving oracle n-best F-score by 0.7% and reranking accuracy by 0.4%. We also present a comprehensive analysis of errors made by the C&C; CCG parser, providing the first breakdown of the impact of implementation decisions, such as supertagging, on parsing accuracy.

3 0.48987731 197 acl-2012-Tokenization: Returning to a Long Solved Problem A Survey, Contrastive Experiment, Recommendations, and Toolkit

Author: Rebecca Dridan ; Stephan Oepen

Abstract: We examine some of the frequently disregarded subtleties of tokenization in Penn Treebank style, and present a new rule-based preprocessing toolkit that not only reproduces the Treebank tokenization with unmatched accuracy, but also maintains exact stand-off pointers to the original text and allows flexible configuration to diverse use cases (e.g. to genreor domain-specific idiosyncrasies). 1 Introduction—Motivation The task of tokenization is hardly counted among the grand challenges of NLP and is conventionally interpreted as breaking up “natural language text [...] into distinct meaningful units (or tokens)” (Kaplan, 2005). Practically speaking, however, tokenization is often combined with other string-level preprocessing—for example normalization of punctuation (of different conventions for dashes, say), disambiguation of quotation marks (into opening vs. closing quotes), or removal of unwanted mark-up— where the specifics of such pre-processing depend both on properties of the input text as well as on assumptions made in downstream processing. Applying some string-level normalizationprior to the identification of token boundaries can improve (or simplify) tokenization, and a sub-task like the disambiguation of quote marks would in fact be hard to perform after tokenization, seeing that it depends on adjacency to whitespace. In the following, we thus assume a generalized notion of tokenization, comprising all string-level processing up to and including the conversion of a sequence of characters (a string) to a sequence of token objects.1 1Obviously, some of the normalization we include in the tokenization task (in this generalized interpretation) could be left to downstream analysis, where a tagger or parser, for example, could be expected to accept non-disambiguated quote marks (so-called straight or typewriter quotes) and disambiguate as 378 Arguably, even in an overtly ‘separating’ language like English, there can be token-level ambiguities that ultimately can only be resolved through parsing (see § 3 for candidate examples), and indeed Waldron et al. (2006) entertain the idea of downstream processing on a token lattice. In this article, however, we accept the tokenization conventions and sequential nature of the Penn Treebank (PTB; Marcus et al., 1993) as a useful point of reference— primarily for interoperability of different NLP tools. Still, we argue, there is remaining work to be done on PTB-compliant tokenization (reviewed in§ 2), both methodologically, practically, and technologically. In § 3 we observe that state-of-the-art tools perform poorly on re-creating PTB tokenization, and move on in § 4 to develop a modular, parameterizable, and transparent framework for tokenization. Besides improvements in tokenization accuracy and adaptability to diverse use cases, in § 5 we further argue that each token object should unambiguously link back to an underlying element of the original input, which in the case of tokenization of text we realize through a notion of characterization. 2 Common Conventions Due to the popularity of the PTB, its tokenization has been a de-facto standard for two decades. Ap- proximately, this means splitting off punctuation into separate tokens, disambiguating straight quotes, and separating contractions such as can’t into ca and n ’t. There are, however, many special cases— part of syntactic analysis. However, on the (predominant) point of view that punctuation marks form tokens in their own right, the tokenizer would then have to adorn quote marks in some way, as to whether they were split off the left or right periphery of a larger token, to avoid unwanted syntactic ambiguity. Further, increasing use of Unicode makes texts containing ‘natively’ disambiguated quotes more common, where it would seem unfortunate to discard linguistically pertinent information by normalizing towards the poverty of pure ASCII punctuation. ProceedJienjgus, R ofep thueb 5lic0t hof A Knonrueaa,l M 8-e1e4ti Jnugly o f2 t0h1e2 A.s ?c so2c0ia1t2io Ans fsoorc Ciatoiomnp fuotart Cioonmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi3c 7s8–382, documented and undocumented. In much tagging and parsing work, PTB data has been used with gold-standard tokens, to a point where many researchers are unaware of the existence of the original ‘raw’ (untokenized) text. Accordingly, the formal definition of PTB has received little attention, but reproducing PTB tokenization automatically actually is not a trivial task (see § 3). As the NLP community has moved to process data other than the PTB, some of the limitations of the tokenization2 PTB tokenization have been recognized, and many recently released data sets are accompanied by a note on tokenization along the lines of: Tokenization is similar to that used in PTB, except . . . Most exceptions are to do with hyphenation, or special forms of named entities such as chemical names or URLs. None of the documentation with extant data sets is sufficient to fully reproduce the tokenization.3 The CoNLL 2008 Shared Task data actually provided two forms of tokenization: that from the PTB (which many pre-processing tools would have been trained on), and another form that splits (most) hyphenated terms. This latter convention recently seems to be gaining ground in data sets like the Google 1T n-gram corpus (LDC #2006T13) and OntoNotes (Hovy et al., 2006). Clearly, as one moves towards a more application- and domaindriven idea of ‘correct’ tokenization, a more transparent, flexible, and adaptable approach to stringlevel pre-processing is called for. 3 A Contrastive Experiment To get an overview of current tokenization methods, we recovered and tokenized the raw text which was the source of the (Wall Street Journal portion of the) PTB, and compared it to the gold tokenization in the syntactic annotation in the We used three common methods of tokenization: (a) the original treebank.4 2See http : / /www . cis .upenn .edu/ ~t reebank/ t okeni z at ion .html for available ‘documentation’ and a sed script for PTB-style tokenization. 3Øvrelid et al. (2010) observe that tokenizing with the GENIA tagger yields mismatches in one of five sentences of the GENIA Treebank, although the GENIA guidelines refer to scripts that may be available on request (Tateisi & Tsujii, 2006). 4The original WSJ text was last included with the 1995 release of the PTB (LDC #95T07) and required alignment with the treebank, with some manual correction so that the same text is represented in both raw and parsed formats. 379 Tokenization Differing Levenshtein Method Sentences Distance tokenizer.sed 3264 11168 CoreNLP 1781 3717 C&J; parser 2597 4516 Table 1: Quantitative view on tokenization differences. PTB tokenizer.sed script; (b) the tokenizer from the Stanford CoreNLP tools5; and (c) tokenization from the parser of Charniak & Johnson (2005). Table 1 shows quantitative differences between each of the three methods and the PTB, both in terms of the number of sentences where the tokenization differs, and also in the total Levenshtein distance (Levenshtein, 1966) over tokens (for a total of 49,208 sentences and 1,173,750 gold-standard tokens). Looking at the differences qualitatively, the most consistent issue across all tokenization methods was ambiguity of sentence-final periods. In the treebank, final periods are always (with about 10 exceptions) a separate token. If the sentence ends in U.S. (but not other abbreviations, oddly), an extra period is hallucinated, so the abbreviation also has one. In contrast, C&J; add a period to all final abbreviations, CoreNLP groups the final period with a final abbreviation and hence lacks a sentence-final period token, and the sed script strips the period off U.S. The ‘correct’ choice in this case is not obvious and will depend on how the tokens are to be used. The majority of the discrepancies in the sed script tokenization come from an under-restricted punctuation rule that incorrectly splits on commas within numbers or ampersands within names. Other than that, the problematic cases are mostly shared across tokenization methods, and include issues with currencies, Irish names, hyphenization, and quote disambiguation. In addition, C&J; make some additional modifications to the text, lemmatising expressions such as won ’t as will and n ’t. 4 REPP: A Generalized Framework For tokenization to be studied as a first-class problem, and to enable customization and flexibility to diverse use cases, we suggest a non-procedural, rule-based framework dubbed REPP (Regular 5See corenlp / / nlp . st anford . edu / so ftware / run in ‘ st rict Treebank3 ’ mode. http : . shtml, Expression-Based Pre-Processing)—essentially a cascade of ordered finite-state string rewriting rules, though transcending the formal complexity of regular languages by inclusion of (a) full perl-compatible regular expressions and (b) fixpoint iteration over groups of rules. In this approach, a first phase of string-level substitutions inserts whitespace around, for example, punctuation marks; upon completion of string rewriting, token boundaries are stipulated between all whitespace-separated substrings (and only these). For a good balance of human and machine readability, REPP tokenization rules are specified in a simple, line-oriented textual form. Figure 1 shows a (simplified) excerpt from our PTB-style tokenizer, where the first character on each line is one of four REPP operators, as follows: (a) ‘#’ for group formation; (b) ‘>’ for group invocation, (c) ‘ ! ’ for substitution (allowing capture groups), and (d) ‘ : ’ for token boundary detection.6 In Figure 1, the two rules stripping off prefix and suffix punctuation marks adjacent to whitespace (i.e. matching the tab-separated left-hand side of the rule, to replace the match with its right-hand side) form a numbered group (‘# 1’), which will be iterated when called (‘> 1 until none ’) of the rules in the group fires (a fixpoint). In this example, conditioning on whitespace adjacency avoids the issues observed with the PTB sed script (e.g. token boundaries within comma-separated numbers) and also protects against infinite loops in the group.7 REPP rule sets can be organized as modules, typ6Strictly speaking, there are another two operators, for lineoriented comments and automated versioning of rule files. 7For this example, the same effects seemingly could be obtained without iteration (using greatly more complex rules); our actual, non-simplified rules, however, further deal with punctuation marks that can function as prefixes or suffixes, as well as with corner cases like factor(s) or Ca[2+]. Also in mark-up removal and normalization, we have found it necessary to ‘parse’ nested structures by means of iterative groups. 380 ically each in a file of its own, and invoked selectively by name (e.g. ‘>wiki’ in Figure 1); to date, there exist modules for quote disambiguation, (relevant subsets of) various mark-up languages (HTML, LATEX, wiki, and XML), and a handful of robustness rules (e.g. seeking to identify and repair ‘sandwiched’ inter-token punctuation). Individual tokenizers are configured at run-time, by selectively activating a set of modules (through command-line op- tions). An open-source reference implementation of the REPP framework (in C++) is available, together with a library of modules for English. 5 Characterization for Traceability Tokenization, and specifically our notion of generalized tokenization which allows text normalization, involves changes to the original text being analyzed, rather than just additional annotation. As such, full traceability from the token objects to the original text is required, which we formalize as ‘characterization’, in terms of character position links back to the source.8 This has the practical benefit of allowing downstream analysis as direct (stand-off) annotation on the source text, as seen for example in the ACL Anthology Searchbench (Schäfer et al., 2011). With our general regular expression replacement rules in REPP, making precise what it means for a token to link back to its ‘underlying’ substring requires some care in the design and implementation. Definite characterization links between the string before (I) and after (O) the application of a single orurele ( can only bftee res (tOab)li tshheed a pinp lcicerattiaoinn positions, viz. (a) spans not matched by the rule: unchanged text in O outside the span matched by the left-hand tseixdet regex outfs tidhee truhele s can always d be b ylin thkeed le bfta-chka ntod I; and (b) spans caught by a regex capture group: capture groups represent bthye a same te caxtp tiunr eth ger oleufpt-: and right-hand sides of a substitution, and so can be linked back to O.9 Outside these text spans, we can only md bakace kd etofin Oit.e statements about characterization links at boundary points, which include the start and end of the full string, the start and end of the string 8If the tokenization process was only concerned with the identification of token boundaries, characterization would be near-trivial. 9If capture group references are used out-of-order, however, the per-group linkage is no longer well-defined, and we resort to the maximum-span ‘union’ of boundary points (see below). matched by the rule, and the start and end of any capture groups in the rule. Each character in the string being processed has a start and end position, marking the point before and after the character in the original string. Before processing, the end position would always be one greater than the start position. However, if a rule mapped a string-initial, PTB-style opening double quote (``) to one-character Unicode “, the new first character of the string would have start position 0, but end position 2. In contrast, if there were a rule !wo (n’ t ) will \1 (1) applied to the string I ’t go!, all characters in the won second token of the resulting string (I will n’t go!) will have start position 2 and end position 4. This demonstrates one of the formal consequences of our design: we have no reason to assign the characters ill any start position other than 2.10 Since explicit character links between each I O will only be estaband laicstheerd l iantk kms abtecthw or capture group boundaries, any tteabxtfrom the left-hand side of a rule that should appear in O must be explicitly linked through a capture group rOefe mreunstc eb (rather tihtlayn l merely hwroriuttgehn ao cuta ipntu utrhee righthand side of the rule). In other words, rule (1) above should be preferred to the following variant (which would result in character start and end offsets of 0 and 5 for both output tokens): ! won’ t will n’ t (2) During rule application, we keep track of character start and end positions as offsets between a string before and after each rule application (i.e. all pairs hI, Oi), and these offsets are eventually traced back thoI ,thOe original string fats etthse atireme ev oefn ftiunaalll yto tkraecneidzat biaocnk. 6 Quantitative and Qualitative Evaluation In our own work on preparing various (non-PTB) genres for parsing, we devised a set of REPP rules with the goal of following the PTB conventions. When repeating the experiment of § 3 above using REPP tokenization, we obtained an initial difference in 1505 sentences, with a Levenshtein dis10This subtlety will actually be invisible in the final token objects if will remains a single token, but if subsequent rules were to split this token further, all its output tokens would have a start position of 2 and an end position of 4. While this example may seem unlikely, we have come across similar scenarios in fine-tuning actual REPP rules. 381 tance of 3543 (broadly comparable to CoreNLP, if marginally more accurate). Examining these discrepancies, we revealed some deficiencies in our rules, as well as some peculiarities of the ‘raw’ Wall Street Journal text from the PTB distribution. A little more than 200 mismatches were owed to improper treatment of currency symbols (AU$) and decade abbreviations (’60s), which led to the refinement of two existing rules. Notable PTB idiosyncrasies (in the sense of deviations from common typography) include ellipses with spaces separating the periods and a fairly large number of possessives (’s) being separated from their preceding token. Other aspects of gold-standard PTB tokenization we consider unwarranted ‘damage’ to the input text, such as hallucinating an extra period after U . S . and splitting cannot (which adds spurious ambiguity). For use cases where the goal were strict compliance, for instance in pre-processing inputs for a PTB-derived parser, we added an optional REPP module (of currently half a dozen rules) to cater to these corner cases—in a spirit similar to the CoreNLP mode we used in § 3. With these extra rules, remaining tokenization discrepancies are contained in 603 sentences (just over 1%), which gives a Levenshtein distance of 1389. 7 Discussion—Conclusion Compared to the best-performing off-the-shelf system in our earlier experiment (where it is reasonable to assume that PTB data has played at least some role in development), our results eliminate two thirds of the remaining tokenization errors—a more substantial reduction than recent improvements in parsing accuracy against the PTB, for example. Of the remaining differences, cerned with mid-sentence at least half of those riod was separated treebank—a pattern Some differences over 350 are con- period ambiguity, are instances where where from an abbreviation a pein the we do not wish to emulate. in quote disambiguation also re- main, often triggered by whitespace on both sides of quote marks in the raw text. The final 200 or so dif- ferences stem from manual corrections made during treebanking, and we consider that these cases could not be replicated automatically in any generalizable fashion. References Waldron, B., Copestake, A., Schäfer, U., & Kiefer, Ch(ionap-frgbpnt.heias1Ikt7nA,p3asEP–rs.1,oi8&cn0ieag;)J.todiaAohni dgnsfmonAroa,fxCbMethon.ermt,(pd42Uui30sStcraAd5ti.m)oA.niCanloutaLivrlsneMgr-eutorieas-ftni kceg-s Isd5Bota.hurlyd(2.scIne0itsne0ra6Dn)ad.Et LiPorvneHapl-ruIoaCNcteio snofin(elrpsge.nacIn2ed6Pot3rno–kcLe2naei6dns8iagnt)ui.oasgGnoe sfntRaohne-, Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., & Weischedel, R. (2006). Ontonotes. The 90% solution. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (pp. 57–60). New York City, USA. Kaplan, R. M. (2005). A method for tokenizing text. Festschrift for Kimmo Koskenniemi on his 60th birthday. In A. Arppe, L. Carlson, K. Lindén, J. Piitulainen, M. Suominen, M. Vainio, H. Westerlund, & A. Yli-Jyrä (Eds.), Inquiries into words, constraints and contexts (pp. 55 64). Stanford, CA: CSLI Publications. – Levenshtein, V. (1966). Binary codes capable ofcor- recting deletions, insertions and reversals. Soviet Physice Doklady, 10, 707–710. – Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English. The Penn Treebank. Computational Linguistics, 19, 3 13 330. – Øvrelid, L., Velldal, E., & Oepen, S. (2010). Syntactic scope resolution in uncertainty analysis. In Proceedings of the 23rd international conference on computational linguistics (pp. 1379 1387). Beijing, China. – Schäfer, U., Kiefer, B., Spurk, C., Steffen, J., & Wang, R. (201 1). The ACL Anthology Searchbench. In Proceedings of the ACL-HLT 2011 system demonstrations (pp. 7–13). Portland, Oregon, USA. Tateisi, Y., & Tsujii, J. (2006). GENIA annotation guidelines for tokenization and POS tagging (Technical Report # TR-NLP-UT-2006-4). Tokyo, Japan: Tsujii Lab, University of Tokyo. 382

4 0.4129127 4 acl-2012-A Comparative Study of Target Dependency Structures for Statistical Machine Translation

Author: Xianchao Wu ; Katsuhito Sudoh ; Kevin Duh ; Hajime Tsukada ; Masaaki Nagata

Abstract: This paper presents a comparative study of target dependency structures yielded by several state-of-the-art linguistic parsers. Our approach is to measure the impact of these nonisomorphic dependency structures to be used for string-to-dependency translation. Besides using traditional dependency parsers, we also use the dependency structures transformed from PCFG trees and predicate-argument structures (PASs) which are generated by an HPSG parser and a CCG parser. The experiments on Chinese-to-English translation show that the HPSG parser’s PASs achieved the best dependency and translation accuracies. 1

5 0.34762156 175 acl-2012-Semi-supervised Dependency Parsing using Lexical Affinities

Author: Seyed Abolghasem Mirroshandel ; Alexis Nasr ; Joseph Le Roux

Abstract: Treebanks are not large enough to reliably model precise lexical phenomena. This deficiency provokes attachment errors in the parsers trained on such data. We propose in this paper to compute lexical affinities, on large corpora, for specific lexico-syntactic configurations that are hard to disambiguate and introduce the new information in a parser. Experiments on the French Treebank showed a relative decrease ofthe error rate of 7. 1% Labeled Accuracy Score yielding the best parsing results on this treebank.

6 0.32199574 30 acl-2012-Attacking Parsing Bottlenecks with Unlabeled Data and Relevant Factorizations

7 0.29473698 122 acl-2012-Joint Evaluation of Morphological Segmentation and Syntactic Parsing

8 0.29387155 83 acl-2012-Error Mining on Dependency Trees

9 0.26824576 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence

10 0.25368842 109 acl-2012-Higher-order Constituent Parsing and Parser Combination

11 0.21952589 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers

12 0.21894483 75 acl-2012-Discriminative Strategies to Integrate Multiword Expression Recognition and Parsing

13 0.21469933 59 acl-2012-Corpus-based Interpretation of Instructions in Virtual Environments

14 0.20777057 87 acl-2012-Exploiting Multiple Treebanks for Parsing with Quasi-synchronous Grammars

15 0.20758715 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets

16 0.20624684 93 acl-2012-Fast Online Lexicon Learning for Grounded Language Acquisition

17 0.20264296 5 acl-2012-A Comparison of Chinese Parsers for Stanford Dependencies

18 0.20125836 108 acl-2012-Hierarchical Chunk-to-String Translation

19 0.19858856 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?

20 0.19737637 38 acl-2012-Bayesian Symbol-Refined Tree Substitution Grammars for Syntactic Parsing


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.024), (26, 0.023), (28, 0.03), (30, 0.023), (37, 0.069), (39, 0.033), (59, 0.012), (71, 0.022), (74, 0.025), (82, 0.019), (84, 0.011), (85, 0.018), (90, 0.09), (92, 0.039), (94, 0.015), (99, 0.477)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97767788 169 acl-2012-Reducing Wrong Labels in Distant Supervision for Relation Extraction

Author: Shingo Takamatsu ; Issei Sato ; Hiroshi Nakagawa

Abstract: In relation extraction, distant supervision seeks to extract relations between entities from text by using a knowledge base, such as Freebase, as a source of supervision. When a sentence and a knowledge base refer to the same entity pair, this approach heuristically labels the sentence with the corresponding relation in the knowledge base. However, this heuristic can fail with the result that some sentences are labeled wrongly. This noisy labeled data causes poor extraction performance. In this paper, we propose a method to reduce the number of wrong labels. We present a novel generative model that directly models the heuristic labeling process of distant supervision. The model predicts whether assigned labels are correct or wrong via its hidden variables. Our experimental results show that this model detected wrong labels with higher performance than baseline methods. In the ex- periment, we also found that our wrong label reduction boosted the performance of relation extraction.

2 0.97295499 153 acl-2012-Named Entity Disambiguation in Streaming Data

Author: Alexandre Davis ; Adriano Veloso ; Altigran Soares ; Alberto Laender ; Wagner Meira Jr.

Abstract: The named entity disambiguation task is to resolve the many-to-many correspondence between ambiguous names and the unique realworld entity. This task can be modeled as a classification problem, provided that positive and negative examples are available for learning binary classifiers. High-quality senseannotated data, however, are hard to be obtained in streaming environments, since the training corpus would have to be constantly updated in order to accomodate the fresh data coming on the stream. On the other hand, few positive examples plus large amounts of unlabeled data may be easily acquired. Producing binary classifiers directly from this data, however, leads to poor disambiguation performance. Thus, we propose to enhance the quality of the classifiers using finer-grained variations of the well-known ExpectationMaximization (EM) algorithm. We conducted a systematic evaluation using Twitter streaming data and the results show that our classifiers are extremely effective, providing improvements ranging from 1% to 20%, when compared to the current state-of-the-art biased SVMs, being more than 120 times faster.

3 0.95866817 149 acl-2012-Movie-DiC: a Movie Dialogue Corpus for Research and Development

Author: Rafael E. Banchs

Abstract: This paper describes Movie-DiC a Movie Dialogue Corpus recently collected for research and development purposes. The collected dataset comprises 132,229 dialogues containing a total of 764,146 turns that have been extracted from 753 movies. Details on how the data collection has been created and how it is structured are provided along with its main statistics and characteristics. 1

4 0.95379615 53 acl-2012-Combining Textual Entailment and Argumentation Theory for Supporting Online Debates Interactions

Author: Elena Cabrio ; Serena Villata

Abstract: Blogs and forums are widely adopted by online communities to debate about various issues. However, a user that wants to cut in on a debate may experience some difficulties in extracting the current accepted positions, and can be discouraged from interacting through these applications. In our paper, we combine textual entailment with argumentation theory to automatically extract the arguments from debates and to evaluate their acceptability.

5 0.92071515 101 acl-2012-Fully Abstractive Approach to Guided Summarization

Author: Pierre-Etienne Genest ; Guy Lapalme

Abstract: This paper shows that full abstraction can be accomplished in the context of guided summarization. We describe a work in progress that relies on Information Extraction, statistical content selection and Natural Language Generation. Early results already demonstrate the effectiveness of the approach.

same-paper 6 0.91971189 170 acl-2012-Robust Conversion of CCG Derivations to Phrase Structure Trees

7 0.64069074 29 acl-2012-Assessing the Effect of Inconsistent Assessors on Summarization Evaluation

8 0.63968426 40 acl-2012-Big Data versus the Crowd: Looking for Relationships in All the Right Places

9 0.63171279 159 acl-2012-Pattern Learning for Relation Extraction with a Hierarchical Topic Model

10 0.60251349 191 acl-2012-Temporally Anchored Relation Extraction

11 0.58832288 62 acl-2012-Cross-Lingual Mixture Model for Sentiment Classification

12 0.58493042 201 acl-2012-Towards the Unsupervised Acquisition of Discourse Relations

13 0.57163632 104 acl-2012-Graph-based Semi-Supervised Learning Algorithms for NLP

14 0.56385505 52 acl-2012-Combining Coherence Models and Machine Translation Evaluation Metrics for Summarization Evaluation

15 0.55236292 114 acl-2012-IRIS: a Chat-oriented Dialogue System based on the Vector Space Model

16 0.55138588 151 acl-2012-Multilingual Subjectivity and Sentiment Analysis

17 0.54976749 157 acl-2012-PDTB-style Discourse Annotation of Chinese Text

18 0.54952174 84 acl-2012-Estimating Compact Yet Rich Tree Insertion Grammars

19 0.54367983 71 acl-2012-Dependency Hashing for n-best CCG Parsing

20 0.54024428 8 acl-2012-A Corpus of Textual Revisions in Second Language Writing