acl acl2012 acl2012-157 knowledge-graph by maker-knowledge-mining

157 acl-2012-PDTB-style Discourse Annotation of Chinese Text

Source: pdf

Author: Yuping Zhou ; Nianwen Xue

Abstract: We describe a discourse annotation scheme for Chinese and report on the preliminary results. Our scheme, inspired by the Penn Discourse TreeBank (PDTB), adopts the lexically grounded approach; at the same time, it makes adaptations based on the linguistic and statistical characteristics of Chinese text. Annotation results show that these adaptations work well in practice. Our scheme, taken together with other PDTB-style schemes (e.g. for English, Turkish, Hindi, and Czech), affords a broader perspective on how the generalized lexically grounded approach can flesh itself out in the context of cross-linguistic annotation of discourse relations.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 y zhou @brande i edu s Abstract We describe a discourse annotation scheme for Chinese and report on the preliminary results. [sent-2, score-0.678]

2 Our scheme, inspired by the Penn Discourse TreeBank (PDTB), adopts the lexically grounded approach; at the same time, it makes adaptations based on the linguistic and statistical characteristics of Chinese text. [sent-3, score-0.305]

3 for English, Turkish, Hindi, and Czech), affords a broader perspective on how the generalized lexically grounded approach can flesh itself out in the context of cross-linguistic annotation of discourse relations. [sent-7, score-0.838]

4 1 Introduction In the realm of discourse annotation, the Penn Discourse TreeBank (PDTB) (Prasad et al. [sent-8, score-0.453]

5 , 2008) separates itself by adopting a lexically grounded approach: Discourse relations are lexically anchored by discourse connectives (e. [sent-9, score-1.226]

6 In the absence of explicit discourse connectives, the PDTB asks the annotator to fill in a discourse connective that best describes the discourse relation between these two sentences, instead of selecting from an inventory of predefined discourse relations. [sent-12, score-2.509]

7 By keeping the discourse annotation lexically grounded even in the case of implicit discourse relations, the PDTB appeals to the annotator’s judgment at an intuitive level. [sent-13, score-1.547]

8 edu s contrast with an approach in which the set of discourse relations are pre-determined by linguistic experts and the role of the annotator is just to select from those choices (Mann and Thompson, 1988; Carlson et al. [sent-15, score-0.626]

9 This lexically grounded approach led to consistent and reliable discourse annotation, a feat that is generally hard to achieve for discourse annotation. [sent-17, score-1.113]

10 The PDTB team reported interannotator agreement in the lower 90% for explicit discourse relations (Miltsakaki et al. [sent-18, score-0.807]

11 In this paper we describe a discourse annotation scheme for Chinese that adopts this lexically grounded approach while making adaptations when warranted by the linguistic and statistical properties of Chinese text. [sent-20, score-0.983]

12 This scheme is shown to be practical and effective in the annotation experiment. [sent-21, score-0.225]

13 The rest of the paper is organized as follows: In Section 2, we review the key aspects of the PDTB annotation scheme under discussion in this paper. [sent-22, score-0.225]

14 In Section 4, we present the preliminary annotation results we have so far. [sent-27, score-0.151]

15 2 The PDTB annotation scheme As mentioned in the introduction, discourse relation is viewed as a predication with two arguments in the framework of the PDTB. [sent-29, score-0.897]

16 Two types of discourse relation are distinguished in the annotation: explicit and implicit. [sent-31, score-0.776]

17 c s 2o0c1ia2ti Aosns fo cria Ctio nm fpourta Ctoiomnpault Laitniognuaislt Licisn,g puaigsteiscs 69–7 , Although their annotation is carried out separately, it conforms to the same paradigm of a discourse connective with two arguments. [sent-34, score-0.987]

18 1 Annotation of explicit discourse relations Explicit discourse relations are those anchored by explicit discourse connectives in text. [sent-40, score-2.286]

19 Explicit connectives are drawn from three grammatical classes: • • • Subordinating conjunctions: e. [sent-41, score-0.271]

20 Not all uses of these lexical items are considered to function as a discourse connective. [sent-48, score-0.453]

21 For example, coordinating conjunctions appearing in VP coordinations, such as “and” in (1), are not annotated as discourse connectives. [sent-49, score-0.508]

22 The text spans ofthe two arguments of a discourse connective are marked up. [sent-52, score-0.947]

23 There are no restric- tions on how many clauses can be included in the text span for an argument other than the Minimality Principle: Only as many clauses and/or sentences should be included in an argument selection as are minimally required and sufficient for the interpretation of the relation. [sent-54, score-0.212]

24 EntRel: when the only relation between the two arguments nis t hthea to they rdeelsactiroibne bdeitffweereennt aspects of the same entity, as in (3). [sent-57, score-0.166]

25 NoRel: when neither a lexicalized discourse relNatoiRone nor entity-based lceoxhicearleinzecde disi present. [sent-58, score-0.453]

26 There are restrictions on what kinds of implicit relations are subjected to annotation, presented below. [sent-70, score-0.477]

27 These restrictions do not have counterparts in explicit relation annotation. [sent-71, score-0.335]

28 • Implicit relations between adjacent clauses in tImhep same sleantitoennsce b ntwote separated by a sseesm i n- • • colon are not annotated, even though the relation may very well be definable. [sent-72, score-0.389]

29 A case in point is presented in (4) below, involving an intrasentential comma-separated relation between a main clause and a free adjunct. [sent-73, score-0.228]

30 Implicit relations between adjacent sentences across a paragraph boundary are cneontt a snennotteantecdes. [sent-74, score-0.179]

31 (4) [MC The market for export financing was liberalized in the mid-1980s], [FA forcing the bank to face competition]. [sent-76, score-0.152]

32 3 Annotation of senses Discourse connectives, whether originally present in the data in the case of explicit relations, or filled in by annotators in the case of implicit relations, along with text spans marked as “AltLex”, are annotated with respect to their senses. [sent-78, score-0.635]

33 It is worth noting that a type of implicit relation, namely those labeled as “EntRel”, is not part of the sense hierarchy since it has no explicit counterpart. [sent-82, score-0.493]

34 1 Key characteristics of Chinese text Despite similarities in discourse features between Chinese and English (Xue, 2005), there are differences that have a significant impact on how dis- course relations could be best annotated. [sent-84, score-0.601]

35 To report the same facts in English, it is more natural to break them down into two sentences or two semicolon-separated clauses, but in Chinese, not only are they merely separated by comma, but also there is no connective relating them. [sent-97, score-0.452]

36 This difference in writing style necessitates rethinking of the annotation scheme. [sent-98, score-0.151]

37 If we apply the PDTB scheme to the English translation, regardless of whether the two pieces of facts are expressed in two sentences or two semi-colon-separated clauses, at least one discourse relation will be annotated, relating these two text units. [sent-99, score-0.693]

38 In contrast, if we apply the same scheme to the Chinese sentence, no discourse relation will be picked out because this is just one comma-separated sentence with no explicit discourse connectives in it. [sent-100, score-1.54]

39 In other words, the discourse relation within the Chinese sentence, which would be captured in its English counterpart following the PDTB procedure, would be lost when annotating Chinese. [sent-101, score-0.645]

40 To ensure a reasonable level of coverage, we need to consider comma-delimited intra-sentential implicit relations when annotating Chinese text. [sent-103, score-0.5]

41 One of them is that it introduces into discourse annotation considerable ambiguity associated with the comma. [sent-105, score-0.604]

42 For example, the first instance of comma in (5), immediately following “据悉” (“according to reports”), clearly does not indicate a discourse relation, so it needs to be spelt out in the guidelines how to exclude such cases of comma as discourse relation indicators. [sent-106, score-1.147]

43 We think, however, that disambiguating the commas in Chinese text is valuable in its own right and is a necessary step in annotating discourse relations. [sent-107, score-0.522]

44 Another complication is that some commaseparated chunks are ambiguous as to whether they should be considered potential arguments in a discourse relation. [sent-108, score-0.496]

45 ]” (7) [S2 同时发展 At the same time , develop 跨国经营大力开拓 transnational operation , vigorously open up 多元化市场。 ] diversified market “[S2 At the same time, (it) developed transnational operations (and) vigorously opened up diversified markets. [sent-122, score-0.258]

46 ]” Since the subject can be omitted from the entire sentence, absence or presence of subject in a clause is not an indication whether the clause is a main clause or a free adjunct, or whether it is part of a VP coordination without a connective. [sent-123, score-0.345]

47 These basic decisions directly based on linguistic characteristics of Chinese lead to more systematic adaptations to the annotation scheme, to which we will turn in the next subsection. [sent-126, score-0.249]

48 1 is that we have a whole lot more tokens of implicit relation than explicit relation to deal with. [sent-129, score-0.695]

49 , 2005), 82% are tokens of implicit relation, compared to 54. [sent-131, score-0.283]

50 Given the overwhelming number of implicit relations, we re-examine where it could make an impact in the annotation scheme. [sent-134, score-0.476]

51 1 Procedural division between explicit and implicit discourse relation In the PDTB, explicit and implicit relations are annotated separately. [sent-138, score-1.652]

52 This is probably partly because explicit connectives are quite abundant in English, and partly because the project evolved in stages, expanding from the more canonical case of explicit relation to implicit relation for greater coverage. [sent-139, score-1.178]

53 So the question now is how to annotate explicit and implicit relations in one fell swoop? [sent-141, score-0.597]

54 In Chinese text, the use of a discourse connective is almost always accompanied by a punctuation or two (usually period and/or comma), preceding or flanking it. [sent-142, score-0.836]

55 So a sensible solution is to rely on punctuations as the denominator between explicit and implicit relations;and in the case of explicit relation, the connective will be marked up as an attribute of the discourse relation. [sent-143, score-1.488]

56 This unified approach simplifies the annotation procedure while preserving the explicit/implicit distinction in the process. [sent-144, score-0.203]

57 The thrust of the lexically grounded approach is that discourse annotation should be a data-driven, bottom-up process, rather than a top-down one, trying to fit data into a prescriptive system. [sent-147, score-0.811]

58 As to what role this distinction plays in the annotation procedure, it is an engineering issue, depending on a slew of factors, among which are cross-linguistic variations. [sent-149, score-0.203]

59 In the case of Chinese, we think it is more economical to treat explicit and implicit relations alike in the annotation process. [sent-150, score-0.81]

60 To treat explicit and implicit relations alike actually goes beyond annotating them in one pass; it also involves how they are annotated, which we discuss next. [sent-151, score-0.694]

61 2 Annotation of implicit discourse relations In the PDTB, treatment of implicit discourse relations is modeled after that ofexplicit relations, and at the same time, some restrictions are put on implicit, but not explicit, relations. [sent-154, score-1.814]

62 This is quite understandable: implicit discourse relations tend to be vague and elusive, so making use of explicit relations as a prototype helps pin them down, and restrictions are put in place to strike a balance between high reliability and good coverage. [sent-155, score-1.279]

63 When implicit relations constitute a vast majority of the data as is the case with Chinese, both aspects need to be re-examined to strike a new balance. [sent-156, score-0.466]

64 The inserted connectives and those marked as “AltLex”, along with explicit discourse connectives, are further annotated with respect to their senses. [sent-158, score-0.927]

65 When a connective needs to be inserted in a majority of cases, the difficulty of the task really stands out. [sent-159, score-0.383]

66 In many cases, it seems, there is a good reason for not having a connective present and because of it, the wording rejects insertion of a connective even if it expresses the underlying discourse relation exactly (or sometimes, maybe the wording itself is the reason for not having a connective). [sent-160, score-1.464]

67 So to try to insert a connective expression may very well be too hard a task for annotators, with little to show for their effort in the end. [sent-161, score-0.413]

68 Furthermore, the inter-annotator agreement for providing an explicit connective in place of an implicit one is computed based on the type of explicit connectives (e. [sent-162, score-1.309]

69 Given the above two considerations, our solution is to annotate implicit discourse relations with their senses directly, bypassing the step of inserting a connective expression. [sent-168, score-1.336]

70 It has been pointed out that to train annotators to reason about pre-defined abstract relations with high reliability might be too hard a task (Prasad et al. [sent-169, score-0.232]

71 This difficulty can be overcome by associating each semantic type with one or two prototypical explicit connectives and asking annotators to consider each to see if it expresses the implicit discourse relation. [sent-171, score-1.222]

72 This way, annotators have a concrete aid to reason about abstract relations without having to choose one connective from a set expressing roughly the same relation or having to worry about whether insertion of the connective is somehow awkward. [sent-172, score-1.121]

73 It should be noted that annotating implicit relations directly with their senses means that sense annotation is no longer restricted to those that can be lexically expressed, but also includes those that cannot, notably those labeled “EntRel/NoRel” in the PDTB. [sent-173, score-0.883]

74 2 In other words, we annotate senses of discourse relations, not just connectives and their lexical alternatives (in the case of AltLex). [sent-174, score-0.793]

75 This expansion is consistent with the generalized view of the lexically grounded approach discussed in Section 3. [sent-175, score-0.234]

76 With respect to restrictions on implicit relation, we will adopt them as they prove to be necessary in the annotation process, with one exception. [sent-178, score-0.48]

77 The exception is the restriction that implicit relations between adjacent clauses in the same sentence not separated by a semi-colon are not annotated. [sent-179, score-0.579]

78 This restriction seems to apply mainly to a main clause and any free adjunct attached to it in English; in Chinese, however, the distinction between a main clause and a 2Thus “EntRel” and “NoRel” are treated rather than relation types, in our scheme. [sent-180, score-0.413]

79 as relation senses, 74 free adjunct is not as clear-cut for reasons explained in Section 3. [sent-181, score-0.199]

80 3 Definition of Arg1 and Arg2 The third area that an overwhelming number of implicit relation in the data affects is how Arg1 and Arg2 are defined. [sent-186, score-0.448]

81 As mentioned in the introduction, discourse relations are viewed as a predication with two arguments. [sent-187, score-0.654]

82 These two arguments are defined based on the physical location of the connective in the PDTB: Arg2 is the argument expressed by the clause syntactically bound to the connective and Arg1 is the other argument. [sent-188, score-0.92]

83 In the case of implicit relations, the label is assigned according to the text order. [sent-189, score-0.283]

84 In an annotation task where implicit relations constitute an overwhelming majority, the distinction of Arg1 andArg2 is meaningless in most cases. [sent-190, score-0.676]

85 In addition, the phenomenon of parallel connectives is predominant in Chinese. [sent-191, score-0.271]

86 Parallel connectives are pairs of connectives that take the same arguments, examples of which in English are “if. [sent-192, score-0.542]

87 In Chinese, most connectives are part of a pair; though some can be dropped from their pair, it is considered “proper” or formal to use both. [sent-199, score-0.308]

88 (8) below presents two such examples, for which parallel connectives are not possible in English. [sent-200, score-0.271]

89 ” In the PDTB, parallel connectives are annotated discontinuously; but given the prevalence of such phenomenon in Chinese, such practice would generate a considerably high percentage of essentially repetitive annotation among explicit relations. [sent-205, score-0.588]

90 Rather than abandoning the distinction altogether, we think it makes more sense to define Arg1 and Arg2 semantically. [sent-207, score-0.13]

91 It will not create too much additional work beyond distinction of different senses of discourse relation in the PDTB. [sent-208, score-0.697]

92 In this scheme, no matter which one of 因 (“because”) and 故 (“therefore”) appears without the other, or if they appear as a pair in a sentence, or if the relation is implicit, the Arg1 and Arg2 labels will be consistently assigned to the same clauses. [sent-210, score-0.123]

93 This approach is consistent with the move from annotating senses of connectives to annotating senses of discourse relations, pointed out in Section 3. [sent-211, score-1.0]

94 For example, in the PDTB’s sense hierarchy, “reason” and “result” are subtypes under type CONTINGENCY:Cause: “reason” applies to connectives like “because” and “since” while “result” applies to connectives like “so” and “as a result”. [sent-214, score-0.635]

95 When we move to annotating senses of discourse relations, since both types of connectives express the same underlying discourse relation, there will not be further division under CONTINGENCY:Cause, and the “rea- son”/“result” distinction is an intrinsic property of the semantic type. [sent-215, score-1.397]

96 4 Annotation experiment To test our adapted annotation scheme, we have conducted annotation experiments on a modest, yet significant, amount of data and computed agreement statistics. [sent-217, score-0.342]

97 1 Set-up The agreement statistics come from annotation conducted by two annotators in training so far. [sent-219, score-0.24]

98 The annotation is carried out on 75 the PDTB annotation tool3. [sent-223, score-0.302]

99 2 Inter-annotator agreement To evaluate our proposed scheme, we measure agreement on each adaption proposed in Section 3, as well as agreement on argument span determination. [sent-225, score-0.165]

100 s521pects of Chinese discourse annotation: rel-ident, discourse relation identification; rel-type, relation type classification; imp-sns-type, classification of sense type of implicit relations; arg-order, order determination of Arg1 and Arg2. [sent-233, score-1.479]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('discourse', 0.453), ('connective', 0.383), ('pdtb', 0.364), ('implicit', 0.283), ('connectives', 0.271), ('explicit', 0.166), ('annotation', 0.151), ('relations', 0.148), ('relation', 0.123), ('lexically', 0.119), ('chinese', 0.111), ('adaptations', 0.098), ('altlex', 0.092), ('grounded', 0.088), ('contingency', 0.084), ('entrel', 0.077), ('scheme', 0.074), ('acknowledged', 0.07), ('export', 0.07), ('annotating', 0.069), ('senses', 0.069), ('clause', 0.066), ('clauses', 0.061), ('companies', 0.059), ('market', 0.056), ('prasad', 0.056), ('acceptance', 0.056), ('norel', 0.056), ('dongguan', 0.053), ('predication', 0.053), ('distinction', 0.052), ('subtypes', 0.049), ('annotators', 0.049), ('pilot', 0.047), ('comma', 0.047), ('restrictions', 0.046), ('acknowledging', 0.046), ('cause', 0.045), ('argument', 0.045), ('sense', 0.044), ('relating', 0.043), ('arguments', 0.043), ('overwhelming', 0.042), ('subject', 0.04), ('agreement', 0.04), ('adjuncts', 0.039), ('miltsakaki', 0.039), ('free', 0.039), ('marked', 0.037), ('literally', 0.037), ('dropped', 0.037), ('adjunct', 0.037), ('neg', 0.037), ('pragmatic', 0.035), ('customs', 0.035), ('milgrim', 0.035), ('strike', 0.035), ('transnational', 0.035), ('vigorously', 0.035), ('village', 0.035), ('reason', 0.035), ('think', 0.034), ('distinguished', 0.034), ('spans', 0.031), ('adjacent', 0.031), ('justification', 0.031), ('brande', 0.031), ('diversified', 0.031), ('shanghai', 0.031), ('procedural', 0.031), ('xue', 0.031), ('conjunctions', 0.03), ('insert', 0.03), ('division', 0.03), ('restriction', 0.03), ('alike', 0.028), ('east', 0.028), ('anchored', 0.028), ('brandeis', 0.028), ('waltham', 0.028), ('responded', 0.028), ('company', 0.028), ('files', 0.028), ('coordination', 0.028), ('generalized', 0.027), ('distinguishing', 0.026), ('leave', 0.026), ('port', 0.026), ('actively', 0.026), ('instantiation', 0.026), ('wording', 0.026), ('bank', 0.026), ('separated', 0.026), ('annotator', 0.025), ('coordinating', 0.025), ('cases', 0.024), ('reports', 0.024), ('vp', 0.024), ('partly', 0.023), ('acknowledge', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 157 acl-2012-PDTB-style Discourse Annotation of Chinese Text

Author: Yuping Zhou ; Nianwen Xue

2 0.43641958 193 acl-2012-Text-level Discourse Parsing with Rich Linguistic Features

Author: Vanessa Wei Feng ; Graeme Hirst

Abstract: In this paper, we develop an RST-style textlevel discourse parser, based on the HILDA discourse parser (Hernault et al., 2010b). We significantly improve its tree-building step by incorporating our own rich linguistic features. We also analyze the difficulty of extending traditional sentence-level discourse parsing to text-level parsing by comparing discourseparsing performance under different discourse conditions.

3 0.31290931 47 acl-2012-Chinese Comma Disambiguation for Discourse Analysis

Author: Yaqin Yang ; Nianwen Xue

Abstract: The Chinese comma signals the boundary of discourse units and also anchors discourse relations between adjacent text spans. In this work, we propose a discourse structureoriented classification of the comma that can be automatically extracted from the Chinese Treebank based on syntactic patterns. We then experimented with two supervised learning methods that automatically disambiguate the Chinese comma based on this classification. The first method integrates comma classification into parsing, and the second method adopts a “post-processing” approach that extracts features from automatic parses to train a classifier. The experimental results show that the second approach compares favorably against the first approach.

4 0.2950469 201 acl-2012-Towards the Unsupervised Acquisition of Discourse Relations

Author: Christian Chiarcos

Abstract: This paper describes a novel approach towards the empirical approximation of discourse relations between different utterances in texts. Following the idea that every pair of events comes with preferences regarding the range and frequency of discourse relations connecting both parts, the paper investigates whether these preferences are manifested in the distribution of relation words (that serve to signal these relations). Experiments on two large-scale English web corpora show that significant correlations between pairs of adjacent events and relation words exist, that they are reproducible on different data sets, and for three relation words, that their distribution corresponds to theorybased assumptions. 1 Motivation Texts are not merely accumulations of isolated utterances, but the arrangement of utterances conveys meaning; human text understanding can thus be described as a process to recover the global structure of texts and the relations linking its different parts (Vallduv ı´ 1992; Gernsbacher et al. 2004). To capture these aspects of meaning in NLP, it is necessary to develop operationalizable theories, and, within a supervised approach, large amounts of annotated training data. To facilitate manual annotation, weakly supervised or unsupervised techniques can be applied as preprocessing step for semimanual annotation, and this is part of the motivation of the approach described here. 213 Discourse relations involve different aspects of meaning. This may include factual knowledge about the connected discourse segments (a ‘subjectmatter’ relation, e.g., if one utterance represents the cause for another, Mann and Thompson 1988, p.257), argumentative purposes (a ‘presentational’ relation, e.g., one utterance motivates the reader to accept a claim formulated in another utterance, ibid., p.257), or relations between entities mentioned in the connected discourse segments (anaphoric relations, Webber et al. 2003). Discourse relations can be indicated explicitly by optional cues, e.g., adverbials (e.g., however), conjunctions (e.g., but), or complex phrases (e.g., in contrast to what Peter said a minute ago). Here, these cues are referred to as relation words. Assuming that relation words are associated with specific discourse relations (Knott and Dale 1994; Prasad et al. 2008), the distribution of relation words found between two (types of) events can yield insights into the range of discourse relations possible at this occasion and their respective likeliness. For this purpose, this paper proposes a background knowledge base (BKB) that hosts pairs of events (here heuristically represented by verbs) along with distributional profiles for relation words. The primary data structure of the BKB is a triple where one event (type) is connected with a particular relation word to another event (type). Triples are further augmented with a frequency score (expressing the likelihood of the triple to be observed), a significance score (see below), and a correlation score (indicating whether a pair of events has a positive or negative correlation with a particular relation word). ProceedJienjgus, R ofep thueb 5lic0t hof A Knonrueaa,l M 8-e1e4ti Jnugly o f2 t0h1e2 A.s ?c so2c0ia1t2io Ans fsoorc Ciatoiomnp fuotart Cioonmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi2c 1s3–217, Triples can be easily acquired from automatically parsed corpora. While the relation word is usually part of the utterance that represents the source of the relation, determining the appropriate target (antecedent) of the relation may be difficult to achieve. As a heuristic, an adjacency preference is adopted, i.e., the target is identified with the main event of the preceding utterance.1 The BKB can be constructed from a sufficiently large corpus as follows: • • identify event types and relation words for every utterance create a candidate triple consisting of the event type of the utterance, the relation word, and the event type of the preceding utterance. add the candidate triple to the BKB, if it found in the BKB, increase its score by (or initialize it with) 1, – – • perform a pruning on all candidate triples, calcpuerlaftoer significance aonnd a lclo crarneldaitdioante scores Pruning uses statistical significance tests to evaluate whether the relative frequency of a relation word for a pair of events is significantly higher or lower than the relative frequency of the relation word in the entire corpus. Assuming that incorrect candidate triples (i.e., where the factual target of the relation was non-adjacent) are equally distributed, they should be filtered out by the significance tests. The goal of this paper is to evaluate the validity of this approach. 2 Experimental Setup By generalizing over multiple occurrences of the same events (or, more precisely, event types), one can identify preferences of event pairs for one or several relation words. These preferences capture context-invariant characteristics of pairs of events and are thus to considered to reflect a semantic predisposition for a particular discourse relation. Formally, an event is the semantic representation of the meaning conveyed in the utterance. We 1Relations between non-adjacent utterances are constrained by the structure of discourse (Webber 1991), and thus less likely than relations between adjacent utterances. 214 assume that the same event can reoccur in different contexts, we are thus studying relations between types of events. For the experiment described here, events are heuristically identified with the main predicates of a sentence, i.e., non-auxiliar, noncausative, non-modal verbal lexemes that serve as heads of main clauses. The primary data structure of the approach described here is a triple consisting of a source event, a relation word and a target (antecedent) event. These triples are harvested from large syntactically annotated corpora. For intersentential relations, the target is identified with the event of the immediately preceding main clause. These extraction preferences are heuristic approximations, and thus, an additional pruning step is necessary. For this purpose, statistical significance tests are adopted (χ2 for triples of frequent events and relation words, t-test for rare events and/or relation words) that compare the relative frequency of a rela- tion word given a pair of events with the relative frequency of the relation word in the entire corpus. All results with p ≥ .05 are excluded, i.e., only triples are preserved pfo ≥r w .0h5ic ahr teh eex xocblsuedrevde,d i positive or negative correlation between a pair of events and a relation word is not due to chance with at least 95% probability. Assuming an even distribution of incorrect target events, this should rule these out. Additionally, it also serves as a means of evaluation. Using statistical significance tests as pruning criterion entails that all triples eventually confirmed are statistically significant.2 This setup requires immense amounts of data: We are dealing with several thousand events (theoretically, the total number of verbs of a language). The chance probability for two events to occur in adjacent position is thus far below 10−6, and it decreases further if the likelihood of a relation word is taken into consideration. All things being equal, we thus need millions of sentences to create the BKB. Here, two large-scale corpora of English are employed, PukWaC and Wackypedia EN (Baroni et al. 2009). PukWaC is a 2G-token web corpus of British English crawled from the uk domain (Ferraresi et al. 2Subsequent studies may employ less rigid pruning criteria. For the purpose of the current paper, however, the statistical significance of all extracted triples serves as an criterion to evaluate methodological validity. 2008), and parsed with MaltParser (Nivre et al. 2006). It is distributed in 5 parts; Only PukWaC1 to PukWaC-4 were considered here, constituting 82.2% (72.5M sentences) of the entire corpus, PukWaC-5 is left untouched for forthcoming evaluation experiments. Wackypedia EN is a 0.8G-token dump of the English Wikipedia, annotated with the same tools. It is distributed in 4 different files; the last portion was left untouched for forthcoming evaluation experiments. The portion analyzed here comprises 33.2M sentences, 75.9% of the corpus. The extraction of events in these corpora uses simple patterns that combine dependency information and part-of-speech tags to retrieve the main verbs and store their lemmata as event types. The target (antecedent) event was identified with the last main event of the preceding sentence. As relation words, only sentence-initial children of the source event that were annotated as adverbial modifiers, verb modifiers or conjunctions were considered. 3 Evaluation To evaluate the validity of the approach, three fundamental questions need to be addressed: significance (are there significant correlations between pairs of events and relation words ?), reproducibility (can these correlations confirmed on independent data sets ?), and interpretability (can these correlations be interpreted in terms of theoretically-defined discourse relations ?). 3.1 Significance and Reproducibility Significance tests are part of the pruning stage of the algorithm. Therefore, the number of triples eventually retrieved confirms the existence of statistically significant correlations between pairs of events and relation words. The left column of Tab. 1 shows the number of triples obtained from PukWaC subcorpora of different size. For reproducibility, compare the triples identified with Wackypedia EN and PukWaC subcorpora of different size: Table 1 shows the number of triples found in both Wackypedia EN and PukWaC, and the agreement between both resources. For two triples involving the same events (event types) and the same relation word, agreement means that the relation word shows either positive or negative correlation 215 TasPbe13u7l4n2k98t. We254Mn1a c:CeAs(gurb42)et760cr8m,iop3e61r4l28np0st6uwicho21rm9W,e2673mas048p7c3okenytpdoagi21p8r,o35eE0s29Nit36nvgreipol8796r50s9%.n3509egative correlation of event pairs and relation words between Wackypedia EN and PukWaC subcorpora of different size TBH: thb ouetwnev r17 t1,o27,t0a95P41 ul2kWv6aCs,8.0 Htr5iple1v s, 45.12T35av9sg7.reH7em nv6 ts62(. %.9T2) Table 2: Agreement between but (B), however (H) and then (T) on PukWaC in both corpora, disagreement means positive correlation in one corpus and negative correlation in the other. Table 1 confirms that results obtained on one resource can be reproduced on another. This indicates that triples indeed capture context-invariant, and hence, semantic, characteristics of the relation between events. The data also indicates that reproducibility increases with the size of corpora from which a BKB is built. 3.2 Interpretability Any theory of discourse relations would predict that relation words with similar function should have similar distributions, whereas one would expect different distributions for functionally unrelated relation words. These expectations are tested here for three of the most frequent relation words found in the corpora, i.e., but, then and however. But and however can be grouped together under a generalized notion of contrast (Knott and Dale 1994; Prasad et al. 2008); then, on the other hand, indicates a tem- poral and/or causal relation. Table 2 confirms the expectation that event pairs that are correlated with but tend to show the same correlation with however, but not with then. 4 Discussion and Outlook This paper described a novel approach towards the unsupervised acquisition of discourse relations, with encouraging preliminary results: Large collections of parsed text are used to assess distributional profiles of relation words that indicate discourse relations that are possible between specific types of events; on this basis, a background knowledge base (BKB) was created that can be used to predict an appropriatediscoursemarkertoconnecttwoutterances with no overt relation word. This information can be used, for example, to facilitate the semiautomated annotation of discourse relations, by pointing out the ‘default’ relation word for a given pair of events. Similarly, Zhou et al. (2010) used a language model to predict discourse markers for implicitly realized discourse relations. As opposed to this shallow, n-gram-based approach, here, the internal structure of utterances is exploited: based on semantic considerations, syntactic patterns have been devised that extract triples of event pairs and relation words. The resulting BKB provides a distributional approximation of the discourse relations that can hold between two specific event types. Both approaches exploit complementary sources of knowledge, and may be combined with each other to achieve a more precise prediction of implicit discourse connectives. The validity of the approach was evaluated with respect to three evaluation criteria: The extracted associations between relation words and event pairs could be shown to be statistically significant, and to be reproducible on other corpora; for three highly frequent relation words, theoretical predictions about their relative distribution could be confirmed, indicating their interpretability in terms of presupposed taxonomies of discourse relations. Another prospective field of application can be seen in NLP applications, where selection preferences for relation words may serve as a cheap replacement for full-fledged discourse parsing. In the Natural Language Understanding domain, the BKB may help to disambiguate or to identify discourse relations between different events; in the context of Machine Translation, it may represent a factor guid- ing the insertion of relation words, a task that has been found to be problematic for languages that dif216 fer in their inventory and usage of discourse markers, e.g., German and English (Stede and Schmitz 2000). The approach is language-independent (except for the syntactic extraction patterns), and it does not require manually annotated data. It would thus be easy to create background knowledge bases with relation words for other languages or specific domains given a sufficient amount of textual data. – Related research includes, for example, the unsupervised recognition of causal and temporal relationships, as required, for example, for the recognition of textual entailment. Riaz and Girju (2010) exploit distributional information about pairs of utterances. Unlike approach described here, they are not restricted to adjacent utterances, and do not rely on explicit and recurrent relation words. Their approach can thus be applied to comparably small data sets. However, they are restricted to a specific type of relations whereas here the entire band- width of discourse relations that are explicitly realized in a language are covered. Prospectively, both approaches could be combined to compensate their respective weaknesses. Similar observations can be made with respect to Chambers and Jurafsky (2009) and Kasch and Oates (2010), who also study a single discourse relation (narration), and are thus more limited in scope than the approach described here. However, as their approach extends beyond pairs of events to complex event chains, it seems that both approaches provide complementary types of information and their results could also be combined in a fruitful way to achieve a more detailed assessment of discourse relations. The goal of this paper was to evaluate the methdological validity of the approach. It thus represents the basis for further experiments, e.g., with respect to the enrichment the BKB with information provided by Riaz and Girju (2010), Chambers and Jurafsky (2009) and Kasch and Oates (2010). Other directions of subsequent research may include address more elaborate models of events, and the investigation of the relationship between relation words and taxonomies of discourse relations. Acknowledgments This work was supported by a fellowship within the Postdoc program of the German Academic Exchange Service (DAAD). Initial experiments were conducted at the Collaborative Research Center (SFB) 632 “Information Structure” at the University of Potsdam, Germany. Iwould also like to thank three anonymous reviewers for valuable comments and feedback, as well as Manfred Stede and Ed Hovy whose work on discourse relations on the one hand and proposition stores on the other hand have been the main inspiration for this paper. References M. Baroni, S. Bernardini, A. Ferraresi, and E. Zanchetta. The wacky wide web: a collection of very large linguistically processed webcrawled corpora. Language Resources and Evaluation, 43(3):209–226, 2009. N. Chambers and D. Jurafsky. Unsupervised learning of narrative schemas and their participants. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pages 602–610. Association for Computational Linguistics, 2009. A. Ferraresi, E. Zanchetta, M. Baroni, and S. Bernardini. Introducing and evaluating ukwac, a very large web-derived corpus of english. In Proceedings of the 4th Web as Corpus Workshop (WAC-4) Can we beat Google, pages 47–54, 2008. Morton Ann Gernsbacher, Rachel R. W. Robertson, Paola Palladino, and Necia K. Werner. Managing mental representations during narrative comprehension. Discourse Processes, 37(2): 145–164, 2004. N. Kasch and T. Oates. Mining script-like structures from the web. In Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading, pages 34–42. Association for Computational Linguistics, 2010. A. Knott and R. Dale. Using linguistic phenomena to motivate a set ofcoherence relations. Discourse processes, 18(1):35–62, 1994. 217 J. van Kuppevelt and R. Smith, editors. Current Directions in Discourse andDialogue. Kluwer, Dordrecht, 2003. William C. Mann and Sandra A. Thompson. Rhetorical Structure Theory: Toward a functional theory of text organization. Text, 8(3):243–281, 1988. J. Nivre, J. Hall, and J. Nilsson. Maltparser: A data-driven parser-generator for dependency parsing. In Proc. of LREC, pages 2216–2219. Citeseer, 2006. R. Prasad, N. Dinesh, A. Lee, E. Miltsakaki, L. Robaldo, A. Joshi, and B. Webber. The penn discourse treebank 2.0. In Proc. 6th International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, 2008. M. Riaz and R. Girju. Another look at causality: Discovering scenario-specific contingency relationships with no supervision. In Semantic Computing (ICSC), 2010 IEEE Fourth International Conference on, pages 361–368. IEEE, 2010. M. Stede and B. Schmitz. Discourse particles and discourse functions. Machine translation, 15(1): 125–147, 2000. Enric Vallduv ı´. The Informational Component. Garland, New York, 1992. Bonnie L. Webber. Structure and ostension in the interpretation of discourse deixis. Natural Language and Cognitive Processes, 2(6): 107–135, 1991. Bonnie L. Webber, Matthew Stone, Aravind K. Joshi, and Alistair Knott. Anaphora and discourse structure. Computational Linguistics, 4(29):545– 587, 2003. Z.-M. Zhou, Y. Xu, Z.-Y. Niu, M. Lan, J. Su, and C.L. Tan. Predicting discourse connectives for implicit discourse relation recognition. In COLING 2010, pages 1507–15 14, Beijing, China, August 2010.

5 0.14020312 52 acl-2012-Combining Coherence Models and Machine Translation Evaluation Metrics for Summarization Evaluation

Author: Ziheng Lin ; Chang Liu ; Hwee Tou Ng ; Min-Yen Kan

Abstract: An ideal summarization system should produce summaries that have high content coverage and linguistic quality. Many state-ofthe-art summarization systems focus on content coverage by extracting content-dense sentences from source articles. A current research focus is to process these sentences so that they read fluently as a whole. The current AESOP task encourages research on evaluating summaries on content, readability, and overall responsiveness. In this work, we adapt a machine translation metric to measure content coverage, apply an enhanced discourse coherence model to evaluate summary readability, and combine both in a trained regression model to evaluate overall responsiveness. The results show significantly improved performance over AESOP 2011 submitted metrics.

6 0.11088938 102 acl-2012-Genre Independent Subgroup Detection in Online Discussion Threads: A Study of Implicit Attitude using Textual Latent Semantics

7 0.090083115 208 acl-2012-Unsupervised Relation Discovery with Sense Disambiguation

8 0.078693964 191 acl-2012-Temporally Anchored Relation Extraction

9 0.078241371 159 acl-2012-Pattern Learning for Relation Extraction with a Hierarchical Topic Model

10 0.073194906 50 acl-2012-Collective Classification for Fine-grained Information Status

11 0.06835068 90 acl-2012-Extracting Narrative Timelines as Temporal Dependency Structures

12 0.062499061 49 acl-2012-Coarse Lexical Semantic Annotation with Supersenses: An Arabic Case Study

13 0.061976958 12 acl-2012-A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relation Extraction

14 0.056097444 64 acl-2012-Crosslingual Induction of Semantic Roles

15 0.054848462 217 acl-2012-Word Sense Disambiguation Improves Information Retrieval

16 0.053942598 53 acl-2012-Combining Textual Entailment and Argumentation Theory for Supporting Online Debates Interactions

17 0.051553193 44 acl-2012-CSNIPER - Annotation-by-query for Non-canonical Constructions in Large Corpora

18 0.051340804 133 acl-2012-Learning to "Read Between the Lines" using Bayesian Logic Programs

19 0.05055714 147 acl-2012-Modeling the Translation of Predicate-Argument Structure for SMT

20 0.048745617 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.158), (1, 0.115), (2, -0.115), (3, 0.136), (4, 0.081), (5, -0.092), (6, -0.256), (7, -0.138), (8, -0.261), (9, 0.488), (10, 0.024), (11, -0.045), (12, 0.054), (13, -0.012), (14, 0.121), (15, -0.066), (16, 0.012), (17, 0.02), (18, 0.03), (19, -0.009), (20, -0.101), (21, 0.012), (22, -0.121), (23, 0.03), (24, -0.03), (25, -0.057), (26, 0.045), (27, -0.007), (28, -0.075), (29, 0.02), (30, -0.057), (31, 0.005), (32, 0.007), (33, 0.004), (34, 0.065), (35, -0.007), (36, 0.006), (37, 0.035), (38, 0.01), (39, 0.012), (40, 0.021), (41, -0.015), (42, -0.011), (43, 0.014), (44, -0.013), (45, 0.017), (46, 0.003), (47, 0.003), (48, -0.036), (49, 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98519957 157 acl-2012-PDTB-style Discourse Annotation of Chinese Text

Author: Yuping Zhou ; Nianwen Xue

2 0.93171418 47 acl-2012-Chinese Comma Disambiguation for Discourse Analysis

Author: Yaqin Yang ; Nianwen Xue

3 0.93017495 193 acl-2012-Text-level Discourse Parsing with Rich Linguistic Features

Author: Vanessa Wei Feng ; Graeme Hirst

4 0.80805552 201 acl-2012-Towards the Unsupervised Acquisition of Discourse Relations

Author: Christian Chiarcos

5 0.40573496 52 acl-2012-Combining Coherence Models and Machine Translation Evaluation Metrics for Summarization Evaluation

Author: Ziheng Lin ; Chang Liu ; Hwee Tou Ng ; Min-Yen Kan

6 0.28810242 50 acl-2012-Collective Classification for Fine-grained Information Status

7 0.25277865 208 acl-2012-Unsupervised Relation Discovery with Sense Disambiguation

8 0.23893529 159 acl-2012-Pattern Learning for Relation Extraction with a Hierarchical Topic Model

9 0.23585777 133 acl-2012-Learning to "Read Between the Lines" using Bayesian Logic Programs

10 0.23103751 12 acl-2012-A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relation Extraction

11 0.22713695 195 acl-2012-The Creation of a Corpus of English Metalanguage

12 0.22704086 129 acl-2012-Learning High-Level Planning from Text

13 0.22315319 6 acl-2012-A Comprehensive Gold Standard for the Enron Organizational Hierarchy

14 0.21671906 49 acl-2012-Coarse Lexical Semantic Annotation with Supersenses: An Arabic Case Study

15 0.21573092 53 acl-2012-Combining Textual Entailment and Argumentation Theory for Supporting Online Debates Interactions

16 0.21294115 191 acl-2012-Temporally Anchored Relation Extraction

17 0.20362172 73 acl-2012-Discriminative Learning for Joint Template Filling

18 0.20327839 13 acl-2012-A Graphical Interface for MT Evaluation and Error Analysis

19 0.19488449 102 acl-2012-Genre Independent Subgroup Detection in Online Discussion Threads: A Study of Implicit Attitude using Textual Latent Semantics

20 0.19084693 44 acl-2012-CSNIPER - Annotation-by-query for Non-canonical Constructions in Large Corpora

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.035), (26, 0.038), (28, 0.041), (30, 0.037), (37, 0.023), (39, 0.077), (74, 0.025), (82, 0.025), (84, 0.029), (85, 0.019), (90, 0.069), (91, 0.335), (92, 0.049), (94, 0.026), (99, 0.081)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.84390533 157 acl-2012-PDTB-style Discourse Annotation of Chinese Text

Author: Yuping Zhou ; Nianwen Xue

2 0.7519058 197 acl-2012-Tokenization: Returning to a Long Solved Problem A Survey, Contrastive Experiment, Recommendations, and Toolkit

Author: Rebecca Dridan ; Stephan Oepen

Abstract: We examine some of the frequently disregarded subtleties of tokenization in Penn Treebank style, and present a new rule-based preprocessing toolkit that not only reproduces the Treebank tokenization with unmatched accuracy, but also maintains exact stand-off pointers to the original text and allows flexible configuration to diverse use cases (e.g. to genreor domain-specific idiosyncrasies). 1 Introduction—Motivation The task of tokenization is hardly counted among the grand challenges of NLP and is conventionally interpreted as breaking up “natural language text [...] into distinct meaningful units (or tokens)” (Kaplan, 2005). Practically speaking, however, tokenization is often combined with other string-level preprocessing—for example normalization of punctuation (of different conventions for dashes, say), disambiguation of quotation marks (into opening vs. closing quotes), or removal of unwanted mark-up— where the specifics of such pre-processing depend both on properties of the input text as well as on assumptions made in downstream processing. Applying some string-level normalizationprior to the identification of token boundaries can improve (or simplify) tokenization, and a sub-task like the disambiguation of quote marks would in fact be hard to perform after tokenization, seeing that it depends on adjacency to whitespace. In the following, we thus assume a generalized notion of tokenization, comprising all string-level processing up to and including the conversion of a sequence of characters (a string) to a sequence of token objects.1 1Obviously, some of the normalization we include in the tokenization task (in this generalized interpretation) could be left to downstream analysis, where a tagger or parser, for example, could be expected to accept non-disambiguated quote marks (so-called straight or typewriter quotes) and disambiguate as 378 Arguably, even in an overtly ‘separating’ language like English, there can be token-level ambiguities that ultimately can only be resolved through parsing (see § 3 for candidate examples), and indeed Waldron et al. (2006) entertain the idea of downstream processing on a token lattice. In this article, however, we accept the tokenization conventions and sequential nature of the Penn Treebank (PTB; Marcus et al., 1993) as a useful point of reference— primarily for interoperability of different NLP tools. Still, we argue, there is remaining work to be done on PTB-compliant tokenization (reviewed in§ 2), both methodologically, practically, and technologically. In § 3 we observe that state-of-the-art tools perform poorly on re-creating PTB tokenization, and move on in § 4 to develop a modular, parameterizable, and transparent framework for tokenization. Besides improvements in tokenization accuracy and adaptability to diverse use cases, in § 5 we further argue that each token object should unambiguously link back to an underlying element of the original input, which in the case of tokenization of text we realize through a notion of characterization. 2 Common Conventions Due to the popularity of the PTB, its tokenization has been a de-facto standard for two decades. Ap- proximately, this means splitting off punctuation into separate tokens, disambiguating straight quotes, and separating contractions such as can’t into ca and n ’t. There are, however, many special cases— part of syntactic analysis. However, on the (predominant) point of view that punctuation marks form tokens in their own right, the tokenizer would then have to adorn quote marks in some way, as to whether they were split off the left or right periphery of a larger token, to avoid unwanted syntactic ambiguity. Further, increasing use of Unicode makes texts containing ‘natively’ disambiguated quotes more common, where it would seem unfortunate to discard linguistically pertinent information by normalizing towards the poverty of pure ASCII punctuation. ProceedJienjgus, R ofep thueb 5lic0t hof A Knonrueaa,l M 8-e1e4ti Jnugly o f2 t0h1e2 A.s ?c so2c0ia1t2io Ans fsoorc Ciatoiomnp fuotart Cioonmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi3c 7s8–382, documented and undocumented. In much tagging and parsing work, PTB data has been used with gold-standard tokens, to a point where many researchers are unaware of the existence of the original ‘raw’ (untokenized) text. Accordingly, the formal definition of PTB has received little attention, but reproducing PTB tokenization automatically actually is not a trivial task (see § 3). As the NLP community has moved to process data other than the PTB, some of the limitations of the tokenization2 PTB tokenization have been recognized, and many recently released data sets are accompanied by a note on tokenization along the lines of: Tokenization is similar to that used in PTB, except . . . Most exceptions are to do with hyphenation, or special forms of named entities such as chemical names or URLs. None of the documentation with extant data sets is sufficient to fully reproduce the tokenization.3 The CoNLL 2008 Shared Task data actually provided two forms of tokenization: that from the PTB (which many pre-processing tools would have been trained on), and another form that splits (most) hyphenated terms. This latter convention recently seems to be gaining ground in data sets like the Google 1T n-gram corpus (LDC #2006T13) and OntoNotes (Hovy et al., 2006). Clearly, as one moves towards a more application- and domaindriven idea of ‘correct’ tokenization, a more transparent, flexible, and adaptable approach to stringlevel pre-processing is called for. 3 A Contrastive Experiment To get an overview of current tokenization methods, we recovered and tokenized the raw text which was the source of the (Wall Street Journal portion of the) PTB, and compared it to the gold tokenization in the syntactic annotation in the We used three common methods of tokenization: (a) the original treebank.4 2See http : / /www . cis .upenn .edu/ ~t reebank/ t okeni z at ion .html for available ‘documentation’ and a sed script for PTB-style tokenization. 3Øvrelid et al. (2010) observe that tokenizing with the GENIA tagger yields mismatches in one of five sentences of the GENIA Treebank, although the GENIA guidelines refer to scripts that may be available on request (Tateisi & Tsujii, 2006). 4The original WSJ text was last included with the 1995 release of the PTB (LDC #95T07) and required alignment with the treebank, with some manual correction so that the same text is represented in both raw and parsed formats. 379 Tokenization Differing Levenshtein Method Sentences Distance tokenizer.sed 3264 11168 CoreNLP 1781 3717 C&J; parser 2597 4516 Table 1: Quantitative view on tokenization differences. PTB tokenizer.sed script; (b) the tokenizer from the Stanford CoreNLP tools5; and (c) tokenization from the parser of Charniak & Johnson (2005). Table 1 shows quantitative differences between each of the three methods and the PTB, both in terms of the number of sentences where the tokenization differs, and also in the total Levenshtein distance (Levenshtein, 1966) over tokens (for a total of 49,208 sentences and 1,173,750 gold-standard tokens). Looking at the differences qualitatively, the most consistent issue across all tokenization methods was ambiguity of sentence-final periods. In the treebank, final periods are always (with about 10 exceptions) a separate token. If the sentence ends in U.S. (but not other abbreviations, oddly), an extra period is hallucinated, so the abbreviation also has one. In contrast, C&J; add a period to all final abbreviations, CoreNLP groups the final period with a final abbreviation and hence lacks a sentence-final period token, and the sed script strips the period off U.S. The ‘correct’ choice in this case is not obvious and will depend on how the tokens are to be used. The majority of the discrepancies in the sed script tokenization come from an under-restricted punctuation rule that incorrectly splits on commas within numbers or ampersands within names. Other than that, the problematic cases are mostly shared across tokenization methods, and include issues with currencies, Irish names, hyphenization, and quote disambiguation. In addition, C&J; make some additional modifications to the text, lemmatising expressions such as won ’t as will and n ’t. 4 REPP: A Generalized Framework For tokenization to be studied as a first-class problem, and to enable customization and flexibility to diverse use cases, we suggest a non-procedural, rule-based framework dubbed REPP (Regular 5See corenlp / / nlp . st anford . edu / so ftware / run in ‘ st rict Treebank3 ’ mode. http : . shtml, Expression-Based Pre-Processing)—essentially a cascade of ordered finite-state string rewriting rules, though transcending the formal complexity of regular languages by inclusion of (a) full perl-compatible regular expressions and (b) fixpoint iteration over groups of rules. In this approach, a first phase of string-level substitutions inserts whitespace around, for example, punctuation marks; upon completion of string rewriting, token boundaries are stipulated between all whitespace-separated substrings (and only these). For a good balance of human and machine readability, REPP tokenization rules are specified in a simple, line-oriented textual form. Figure 1 shows a (simplified) excerpt from our PTB-style tokenizer, where the first character on each line is one of four REPP operators, as follows: (a) ‘#’ for group formation; (b) ‘>’ for group invocation, (c) ‘ ! ’ for substitution (allowing capture groups), and (d) ‘ : ’ for token boundary detection.6 In Figure 1, the two rules stripping off prefix and suffix punctuation marks adjacent to whitespace (i.e. matching the tab-separated left-hand side of the rule, to replace the match with its right-hand side) form a numbered group (‘# 1’), which will be iterated when called (‘> 1 until none ’) of the rules in the group fires (a fixpoint). In this example, conditioning on whitespace adjacency avoids the issues observed with the PTB sed script (e.g. token boundaries within comma-separated numbers) and also protects against infinite loops in the group.7 REPP rule sets can be organized as modules, typ6Strictly speaking, there are another two operators, for lineoriented comments and automated versioning of rule files. 7For this example, the same effects seemingly could be obtained without iteration (using greatly more complex rules); our actual, non-simplified rules, however, further deal with punctuation marks that can function as prefixes or suffixes, as well as with corner cases like factor(s) or Ca[2+]. Also in mark-up removal and normalization, we have found it necessary to ‘parse’ nested structures by means of iterative groups. 380 ically each in a file of its own, and invoked selectively by name (e.g. ‘>wiki’ in Figure 1); to date, there exist modules for quote disambiguation, (relevant subsets of) various mark-up languages (HTML, LATEX, wiki, and XML), and a handful of robustness rules (e.g. seeking to identify and repair ‘sandwiched’ inter-token punctuation). Individual tokenizers are configured at run-time, by selectively activating a set of modules (through command-line op- tions). An open-source reference implementation of the REPP framework (in C++) is available, together with a library of modules for English. 5 Characterization for Traceability Tokenization, and specifically our notion of generalized tokenization which allows text normalization, involves changes to the original text being analyzed, rather than just additional annotation. As such, full traceability from the token objects to the original text is required, which we formalize as ‘characterization’, in terms of character position links back to the source.8 This has the practical benefit of allowing downstream analysis as direct (stand-off) annotation on the source text, as seen for example in the ACL Anthology Searchbench (Schäfer et al., 2011). With our general regular expression replacement rules in REPP, making precise what it means for a token to link back to its ‘underlying’ substring requires some care in the design and implementation. Definite characterization links between the string before (I) and after (O) the application of a single orurele ( can only bftee res (tOab)li tshheed a pinp lcicerattiaoinn positions, viz. (a) spans not matched by the rule: unchanged text in O outside the span matched by the left-hand tseixdet regex outfs tidhee truhele s can always d be b ylin thkeed le bfta-chka ntod I; and (b) spans caught by a regex capture group: capture groups represent bthye a same te caxtp tiunr eth ger oleufpt-: and right-hand sides of a substitution, and so can be linked back to O.9 Outside these text spans, we can only md bakace kd etofin Oit.e statements about characterization links at boundary points, which include the start and end of the full string, the start and end of the string 8If the tokenization process was only concerned with the identification of token boundaries, characterization would be near-trivial. 9If capture group references are used out-of-order, however, the per-group linkage is no longer well-defined, and we resort to the maximum-span ‘union’ of boundary points (see below). matched by the rule, and the start and end of any capture groups in the rule. Each character in the string being processed has a start and end position, marking the point before and after the character in the original string. Before processing, the end position would always be one greater than the start position. However, if a rule mapped a string-initial, PTB-style opening double quote (``) to one-character Unicode “, the new first character of the string would have start position 0, but end position 2. In contrast, if there were a rule !wo (n’ t ) will \1 (1) applied to the string I ’t go!, all characters in the won second token of the resulting string (I will n’t go!) will have start position 2 and end position 4. This demonstrates one of the formal consequences of our design: we have no reason to assign the characters ill any start position other than 2.10 Since explicit character links between each I O will only be estaband laicstheerd l iantk kms abtecthw or capture group boundaries, any tteabxtfrom the left-hand side of a rule that should appear in O must be explicitly linked through a capture group rOefe mreunstc eb (rather tihtlayn l merely hwroriuttgehn ao cuta ipntu utrhee righthand side of the rule). In other words, rule (1) above should be preferred to the following variant (which would result in character start and end offsets of 0 and 5 for both output tokens): ! won’ t will n’ t (2) During rule application, we keep track of character start and end positions as offsets between a string before and after each rule application (i.e. all pairs hI, Oi), and these offsets are eventually traced back thoI ,thOe original string fats etthse atireme ev oefn ftiunaalll yto tkraecneidzat biaocnk. 6 Quantitative and Qualitative Evaluation In our own work on preparing various (non-PTB) genres for parsing, we devised a set of REPP rules with the goal of following the PTB conventions. When repeating the experiment of § 3 above using REPP tokenization, we obtained an initial difference in 1505 sentences, with a Levenshtein dis10This subtlety will actually be invisible in the final token objects if will remains a single token, but if subsequent rules were to split this token further, all its output tokens would have a start position of 2 and an end position of 4. While this example may seem unlikely, we have come across similar scenarios in fine-tuning actual REPP rules. 381 tance of 3543 (broadly comparable to CoreNLP, if marginally more accurate). Examining these discrepancies, we revealed some deficiencies in our rules, as well as some peculiarities of the ‘raw’ Wall Street Journal text from the PTB distribution. A little more than 200 mismatches were owed to improper treatment of currency symbols (AU$) and decade abbreviations (’60s), which led to the refinement of two existing rules. Notable PTB idiosyncrasies (in the sense of deviations from common typography) include ellipses with spaces separating the periods and a fairly large number of possessives (’s) being separated from their preceding token. Other aspects of gold-standard PTB tokenization we consider unwarranted ‘damage’ to the input text, such as hallucinating an extra period after U . S . and splitting cannot (which adds spurious ambiguity). For use cases where the goal were strict compliance, for instance in pre-processing inputs for a PTB-derived parser, we added an optional REPP module (of currently half a dozen rules) to cater to these corner cases—in a spirit similar to the CoreNLP mode we used in § 3. With these extra rules, remaining tokenization discrepancies are contained in 603 sentences (just over 1%), which gives a Levenshtein distance of 1389. 7 Discussion—Conclusion Compared to the best-performing off-the-shelf system in our earlier experiment (where it is reasonable to assume that PTB data has played at least some role in development), our results eliminate two thirds of the remaining tokenization errors—a more substantial reduction than recent improvements in parsing accuracy against the PTB, for example. Of the remaining differences, cerned with mid-sentence at least half of those riod was separated treebank—a pattern Some differences over 350 are con- period ambiguity, are instances where where from an abbreviation a pein the we do not wish to emulate. in quote disambiguation also re- main, often triggered by whitespace on both sides of quote marks in the raw text. The final 200 or so dif- ferences stem from manual corrections made during treebanking, and we consider that these cases could not be replicated automatically in any generalizable fashion. References Waldron, B., Copestake, A., Schäfer, U., & Kiefer, Ch(ionap-frgbpnt.heias1Ikt7nA,p3asEP–rs.1,oi8&cn0ieag;)J.todiaAohni dgnsfmonAroa,fxCbMethon.ermt,(pd42Uui30sStcraAd5ti.m)oA.niCanloutaLivrlsneMgr-eutorieas-ftni kceg-s Isd5Bota.hurlyd(2.scIne0itsne0ra6Dn)ad.Et LiPorvneHapl-ruIoaCNcteio snofin(elrpsge.nacIn2ed6Pot3rno–kcLe2naei6dns8iagnt)ui.oasgGnoe sfntRaohne-, Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., & Weischedel, R. (2006). Ontonotes. The 90% solution. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (pp. 57–60). New York City, USA. Kaplan, R. M. (2005). A method for tokenizing text. Festschrift for Kimmo Koskenniemi on his 60th birthday. In A. Arppe, L. Carlson, K. Lindén, J. Piitulainen, M. Suominen, M. Vainio, H. Westerlund, & A. Yli-Jyrä (Eds.), Inquiries into words, constraints and contexts (pp. 55 64). Stanford, CA: CSLI Publications. – Levenshtein, V. (1966). Binary codes capable ofcor- recting deletions, insertions and reversals. Soviet Physice Doklady, 10, 707–710. – Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English. The Penn Treebank. Computational Linguistics, 19, 3 13 330. – Øvrelid, L., Velldal, E., & Oepen, S. (2010). Syntactic scope resolution in uncertainty analysis. In Proceedings of the 23rd international conference on computational linguistics (pp. 1379 1387). Beijing, China. – Schäfer, U., Kiefer, B., Spurk, C., Steffen, J., & Wang, R. (201 1). The ACL Anthology Searchbench. In Proceedings of the ACL-HLT 2011 system demonstrations (pp. 7–13). Portland, Oregon, USA. Tateisi, Y., & Tsujii, J. (2006). GENIA annotation guidelines for tokenization and POS tagging (Technical Report # TR-NLP-UT-2006-4). Tokyo, Japan: Tsujii Lab, University of Tokyo. 382

3 0.58162069 102 acl-2012-Genre Independent Subgroup Detection in Online Discussion Threads: A Study of Implicit Attitude using Textual Latent Semantics

Author: Pradeep Dasigi ; Weiwei Guo ; Mona Diab

Abstract: We describe an unsupervised approach to the problem of automatically detecting subgroups of people holding similar opinions in a discussion thread. An intuitive way of identifying this is to detect the attitudes of discussants towards each other or named entities or topics mentioned in the discussion. Sentiment tags play an important role in this detection, but we also note another dimension to the detection of people’s attitudes in a discussion: if two persons share the same opinion, they tend to use similar language content. We consider the latter to be an implicit attitude. In this paper, we investigate the impact of implicit and explicit attitude in two genres of social media discussion data, more formal wikipedia discussions and a debate discussion forum that is much more informal. Experimental results strongly suggest that implicit attitude is an important complement for explicit attitudes (expressed via sentiment) and it can improve the sub-group detection performance independent of genre.

4 0.40458757 187 acl-2012-Subgroup Detection in Ideological Discussions

Author: Amjad Abu-Jbara ; Pradeep Dasigi ; Mona Diab ; Dragomir Radev

Abstract: The rapid and continuous growth of social networking sites has led to the emergence of many communities of communicating groups. Many of these groups discuss ideological and political topics. It is not uncommon that the participants in such discussions split into two or more subgroups. The members of each subgroup share the same opinion toward the discussion topic and are more likely to agree with members of the same subgroup and disagree with members from opposing subgroups. In this paper, we propose an unsupervised approach for automatically detecting discussant subgroups in online communities. We analyze the text exchanged between the participants of a discussion to identify the attitude they carry toward each other and towards the various aspects of the discussion topic. We use attitude predictions to construct an attitude vector for each discussant. We use clustering techniques to cluster these vectors and, hence, determine the subgroup membership of each participant. We compare our methods to text clustering and other baselines, and show that our method achieves promising results.

5 0.40167621 47 acl-2012-Chinese Comma Disambiguation for Discourse Analysis

Author: Yaqin Yang ; Nianwen Xue

6 0.39607418 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base

7 0.39149538 132 acl-2012-Learning the Latent Semantics of a Concept from its Definition

8 0.39031643 21 acl-2012-A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle

9 0.38671929 191 acl-2012-Temporally Anchored Relation Extraction

10 0.38669261 139 acl-2012-MIX Is Not a Tree-Adjoining Language

11 0.38368088 29 acl-2012-Assessing the Effect of Inconsistent Assessors on Summarization Evaluation

12 0.38236856 80 acl-2012-Efficient Tree-based Approximation for Entailment Graph Learning

13 0.38155013 159 acl-2012-Pattern Learning for Relation Extraction with a Hierarchical Topic Model

14 0.37849292 84 acl-2012-Estimating Compact Yet Rich Tree Insertion Grammars

15 0.37664098 40 acl-2012-Big Data versus the Crowd: Looking for Relationships in All the Right Places

16 0.37539822 62 acl-2012-Cross-Lingual Mixture Model for Sentiment Classification

17 0.3733044 28 acl-2012-Aspect Extraction through Semi-Supervised Modeling

18 0.37294033 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers

19 0.37077445 7 acl-2012-A Computational Approach to the Automation of Creative Naming

20 0.36922091 36 acl-2012-BIUTEE: A Modular Open-Source System for Recognizing Textual Entailment