acl acl2012 acl2012-47 knowledge-graph by maker-knowledge-mining

47 acl-2012-Chinese Comma Disambiguation for Discourse Analysis

Source: pdf

Author: Yaqin Yang ; Nianwen Xue

Abstract: The Chinese comma signals the boundary of discourse units and also anchors discourse relations between adjacent text spans. In this work, we propose a discourse structureoriented classification of the comma that can be automatically extracted from the Chinese Treebank based on syntactic patterns. We then experimented with two supervised learning methods that automatically disambiguate the Chinese comma based on this classification. The first method integrates comma classification into parsing, and the second method adopts a “post-processing” approach that extracts features from automatic parses to train a classifier. The experimental results show that the second approach compares favorably against the first approach.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 yaqin @brande i edu s Abstract The Chinese comma signals the boundary of discourse units and also anchors discourse relations between adjacent text spans. [sent-2, score-1.689]

2 In this work, we propose a discourse structureoriented classification of the comma that can be automatically extracted from the Chinese Treebank based on syntactic patterns. [sent-3, score-1.12]

3 We then experimented with two supervised learning methods that automatically disambiguate the Chinese comma based on this classification. [sent-4, score-0.594]

4 The first method integrates comma classification into parsing, and the second method adopts a “post-processing” approach that extracts features from automatic parses to train a classifier. [sent-5, score-0.635]

5 For example, Jin et al ( 2004) and Li et al ( 2005) view the disambiguation of the Chinese comma as a way of breaking up long Chinese sentences into shorter ones to facilitate parsing. [sent-9, score-0.627]

6 Although both studies show a positive impact of this approach, comma disambiguation is viewed merely as a convenient tool to help achieve a more important goal. [sent-11, score-0.627]

7 edu Xue and Yang ( 2011) point out that the very reason for the existence ofthese long Chinese sentences is because the Chinese comma is ambiguous and in some context, it identifies the boundary ofa sentence just as a period, a question mark, or an exclamation mark does. [sent-13, score-0.728]

8 The disambiguation of comma is viewed as a necessary step to detect sentence boundaries in Chinese and it can benefit a whole range of downstream NLP applications such as syntactic parsing and Machine Translation. [sent-14, score-0.729]

9 In the present work, we expand this view and propose to look at the Chinese comma in the context of discourse analysis. [sent-16, score-1.028]

10 The Chinese comma is viewed as a delimiter of elementary discourse units (EDUs), in the sense of the Rhetorical Structure Theory (Carlson et al. [sent-17, score-1.142]

11 It is also considered to be the anchor of discourse relations, in the sense of the Penn Discourse Treebank (PDT) (Prasad et al. [sent-20, score-0.472]

12 Disambiguating the comma is thus necessary for the purpose of discourse segmentation, the identification of EDUs, a first step in building up the discourse structure of a Chinese text. [sent-22, score-1.462]

13 Developing a supervised or semi-supervised model of discourse segmentation would require ground truth annotated based on a well-established representation scheme, but as of right now no such annotation exists for Chinese to the best of our knowledge. [sent-23, score-0.467]

14 c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi7c 8s6–794, a method of automatically deriving a preliminary form of discourse structure anchored by the Chinese comma from the Penn Chinese Treebank (CTB) (Xue et al. [sent-27, score-1.077]

15 This discourse information is formalized as a classification of the Chinese comma, with each class representing the boundary of an elementary discourse unit as well as the anchor of a coarse-grained discourse relation between the two discourse units that it delimits. [sent-29, score-2.042]

16 In the first method, we replace the part-of-speech (POS) tag of each comma in the CTB with a derived discourse category and retrain a state-of-theart Chinese parser on the relabeled data. [sent-31, score-1.142]

17 We then evaluate how accurately the commas are classified in the parsing process. [sent-32, score-0.294]

18 In the second method, we parse these sentences and extract lexical and syntactic information as features to predict these new discourse categories. [sent-33, score-0.485]

19 In Section 2, we present our approach to automatically extract discourse information from a syntactically annotated treebank and present our classification scheme. [sent-36, score-0.533]

20 2 Chinese comma classification There are many ways to conceptualize the discourse structure of a text (Mann et al. [sent-41, score-1.069]

21 , 2008), but there is more of a consensus among researchers about the fundamental building blocks of the discourse structure. [sent-43, score-0.462]

22 Although they are phrased in different ways, syntactically these discourse units are generally realized as clauses or built on top of clauses. [sent-46, score-0.521]

23 So the first step in building the discourse structure of a text is to identify these discourse units. [sent-47, score-0.868]

24 787 In Chinese, these elementary discourse units are generally delimited by the comma, but not all commas mark the boundaries of a discourse unit. [sent-48, score-1.251]

25 In (1), for example, Comma [1] marks the boundary of a discourse unit while Comma [2] does not. [sent-49, score-0.578]

26 This is reflected in its English translation: while the first comma corresponds to an English comma, the second comma is not translated at all, as it marks the boundary between a subject and its predicate, where no comma is needed in English. [sent-50, score-2.046]

27 Disambiguating these two types of commas is thus an important first step in identifying elementary discourse units and building up the discourse structure of a text. [sent-51, score-1.251]

28 ” 其充沛的精力和的，。 Although to the best of our knowledge, no such discourse segmented data for Chinese exists in the public domain, this information can be extracted from the syntactic annotation of the CTB. [sent-54, score-0.518]

29 In the syntactic annotation of the sentence, illustrated in (a), it is clear that while the first comma in the sentence marks the boundary of a clause, the second one marks the demarcation between the subject NP and the predicate VP and thus is not an indicator of a discourse boundary. [sent-55, score-1.539]

30 (a)IP-CNDI,P 1ADVPNP, 2VP In addition to a binary distinction of whether a comma marks the boundary of a discourse unit, the CTB annotation also allows the extraction of a more elaborate classification of commas based on coordination and subordination relations of commaseparated clauses. [sent-56, score-1.657]

31 This classification of the Chinese comma can be viewed as a first approximation of the discourse relations anchored by the comma that can be refined later via a manual annotation process. [sent-57, score-1.786]

32 Based on the syntactic annotation in the CTB, we classify the Chinese comma into seven hierarchically organized categories, as illustrated in Figure 1. [sent-58, score-0.743]

33 The first distinction is made between commas that indicate a discourse boundary (RELATION) and those that do not (OTHER). [sent-59, score-0.784]

34 Commas that indicate discourse boundaries are further divided into commas that separate coordinated discourse units (COORD) vs commas that separate discourse units in a subordination relation (SUBORD). [sent-60, score-2.176]

35 Based on the levels of embedding and the syntactic category of the coordinated structures, we define three different types of coordination (SB, IP COORD and VP COORD). [sent-61, score-0.245]

36 We also define three types of subordination relations (ADJ, COMP, Sent SBJ), based on the syntactic structure. [sent-62, score-0.153]

37 We view this comma to be a marker of the sentence boundary and it serves the same function as the unambiguous sentence boundary delimitors (periods, question marks, exclamation marks) in Chinese. [sent-66, score-0.835]

38 The syntactic pattern that is used to infer this relation is illustrated in (b). [sent-67, score-0.118]

39 ” (b) IP-Root IP , IP Clause Clause IP Coordination (IP COORD): Coordinated IPs that are not the immediate children of the root IP are also considered to be discourse units and the commas linking them are labeled IP COORD. [sent-71, score-0.762]

40 Different from the sentence boundary cases, these coordinated IPs are often embedded in a larger structure. [sent-72, score-0.232]

41 ” (c) IP PP Modifier , IP IP,IP Conjunct Conjunct VP Coordination (VP COORD): Coordinated VPs, when separated by the comma, are not semantically different from coordinated IPs. [sent-76, score-0.125]

42 The only difference is that in the latter case, the coordinated VPs share a subject, while coordinated IPs tend to have different subjects. [sent-77, score-0.25]

43 Maintaining this distinction allow us to model subject (dis)continuity, which helps recover a subject when it is dropped, a prevalent phenomenon in Chinese. [sent-78, score-0.24]

44 As shown in (4), the VPs in the text spans separated by Comma [6] have the same subject, thus the subject in the second VP is dropped. [sent-79, score-0.168]

45 It holds between a subordinate clause and its main clause. [sent-84, score-0.178]

46 The subordinate clause is normally introduced by a subordinating conjunction and it typically provides the cause, purpose, manner, or condition for the main clause. [sent-85, score-0.249]

47 In the PDT terms, these subordinate conjunctions are discourse connectives that anchor a discourse relation between the subordinate clause and the main clause. [sent-86, score-1.204]

48 In Chinese, with few exceptions, the subordinate clause comes before the main clause. [sent-87, score-0.178]

49 发生保险责任范围 (5) 若工程 if project happen insurance liability scope 内的自然灾害 [7] inside DE natural disaster , 中保财产保险公司 China Insurance property insurance company ， 789 将规定进行按 will according to provision excecute 赔偿。 compensation . [sent-89, score-0.217]

50 “If natural disasters within the scope of the insurance liability happen in the project, PICC Property Insurance Company will provide compensations according to the provisions. [sent-90, score-0.108]

51 The functional tags are attached to the subordinate clause and they include CND (conditional), PRP (purpose or reason), MNR (manner), or ADV (other types of subordinate clauses that are adjuncts to the main clause). [sent-94, score-0.294]

52 Complementation (COMP): When a comma separates a verb governor and its complement clause, this verb and its subject generally describe the attribution of the complement clause. [sent-95, score-0.714]

53 Attribution is an important notion in discourse analysis in both the RST framework and in the PDT. [sent-96, score-0.434]

54 An example of this is given in (6), and the syntactic pattern used to extract this relation is illustrated in (f). [sent-97, score-0.118]

55 Sentential Subject (SBJ): This category is for commas that separate a sentential subject from its predicate VP. [sent-113, score-0.49]

56 An example is given in (7) and the syntactic pattern used to extract this relation is illustrated in (g). [sent-114, score-0.118]

57 VP Sentential Subject Others (OTHER): The remaining cases of comma receive the OTHER label, indicating they do not mark the boundary of a discourse segment. [sent-119, score-1.109]

58 Our proposed comma classification scheme serves the dual purpose of identifying elementary discourse units and at the same time detecting coarse-grained discourse relations anchored by the comma. [sent-120, score-1.707]

59 The discourse relations identified in this manner by no means constitute the full discourse analysis of a text, they are, however, a good first approximation. [sent-121, score-0.909]

60 The advantage of our approach is that we do not require manual discourse annotations, and all the information we need is automatically extracted from the syntactic annotation of the CTB and attached to instances of the comma in the corpus. [sent-122, score-1.112]

61 This makes it possible for us to train supervised models to automatically classify the commas in any Chinese text. [sent-123, score-0.299]

62 790 3 Two comma classification methods Given the gold standard parses, based on the syntactic patterns described in Section 2, we can map the POS tag of each comma instance in the CTB to one of the seven classes described in Section 2. [sent-124, score-1.28]

63 Using this relabeled data as training data, we experimented with two automatic comma disambiguation methods. [sent-125, score-0.688]

64 In the first method, we simply retrained the Berkeley parser (Petrov and Klein, 2007) on the relabeled data and computed how accurately the commas are labeled in a held-out test set. [sent-126, score-0.33]

65 Given a comma, we define the preceding text span as ispan and the following text span as j span. [sent-132, score-0.13]

66 Subject and Predicate features: We explored various combinations of the subject (sbj), predicate (pred) and object (obj) of the two spans. [sent-135, score-0.159]

67 The subject of ispan is represented as sbji, etc. [sent-136, score-0.166]

68 The lemma of predi, the lemma of predj, the conjunction of sbji and predj, the conjunction of predi and sbjj 3. [sent-140, score-0.356]

69 whether the conjunction of sbji and predj occurs more than 2 times in the auxiliary corpus when j does not have a subject. [sent-141, score-0.341]

70 whether the conjunction of obji and predj occurs more than 2 times in the auxiliary corpus when j does not have a subject 5. [sent-143, score-0.415]

71 Whether the conjunction of predi and sbjj occurs more than 2 times in the auxiliary corpus when idoes not have a subject. [sent-144, score-0.279]

72 Mutual Information features: Mutual information is intended to capture the association strength between the subject of a previous span and the predicate of the current span. [sent-145, score-0.201]

73 The conjunction of sbji and predj when j does not have a subject if their MIvalue is greater than -8. [sent-148, score-0.406]

74 Whether obji and predj has an MI value greater than 5. [sent-151, score-0.169]

75 Whether the MI value of sbji and predj is greater than 0. [sent-154, score-0.215]

76 Whether the MI value of obji and predj is greater than 0. [sent-157, score-0.169]

77 Whether the MI value of predi and sbjj is greater than 0. [sent-160, score-0.122]

78 the comma separated spans are constituents in Tree (b) but not in Tree (d). [sent-164, score-0.642]

79 The conjunction of all constituent labels in both spans, if neither span form a single constituent. [sent-169, score-0.113]

80 2 Results As mentioned in Section 3, we experimented with two comma classification methods. [sent-180, score-0.635]

81 In the first method, we replace the part-of-speech (POS) tags of the commas with the seven classes defined in Section 2. [sent-181, score-0.269]

82 We then retrain the Berkeley parser (Petrov and Klein, 2007) using the training set as presented in Table 1, parse the test set, and evaluate the comma classification accuracy. [sent-182, score-0.659]

83 In the second method, we use the relabeled commas as the gold-standard data to train a supervised classifier to automatically classify the commas. [sent-183, score-0.36]

84 1 Subject continuity One of the goals for this classification scheme is to model subject continuity, which answers the question of how accurately we can predict whether two comma-separated text spans have the same subject or different subjects. [sent-198, score-0.363]

85 When the two spans share the same subject, the comma belongs to the category VP COORD. [sent-199, score-0.671]

86 , when one of the span does not even have a subject, the comma belongs to other categories. [sent-203, score-0.636]

87 If the subject of the text span after a comma is dropped as shown in (h), the parser often produces a VP coordination structure as shown in (i) and vice versa. [sent-224, score-0.796]

88 (h) IP (i) IP IP,IP NPVP NP 5 VP VP VP , VP Related Work There is a large body of work on discourse analysis in the field of Natural Language Processing. [sent-226, score-0.434]

89 An unsupervised approach was proposed to recognize discourse relations in (Marcu and Echihabi, 2002), which extracts discourse relations that hold between arbitrary spans of text making use of cue phrases. [sent-228, score-0.998]

90 Like the present work, a lot of research on discourse analysis is carried out at the sentence level. [sent-229, score-0.46]

91 , 2004) implement models to perform discourse parsing, while (Sporleder and Lapata, 2005) intro- duces discourse chunking as an alternative to full- the results for each individual category. [sent-233, score-0.868]

92 The emergence of linguistic corpora annotated with discourse structure such as the RST Discourse Treebank (Carlson et al. [sent-235, score-0.434]

93 Compared with English, much less work has been done in Chinese discourse analysis, presumably due to the lack of discourse resources in Chinese. [sent-240, score-0.868]

94 h07e present work based on Maximum Entropy model trains a statistical classifier to recognize discourse relations. [sent-255, score-0.434]

95 Their work, however, is only concerned with discourse relations between adjacent sentences, thus side-stepping the hard problem of disambiguating the Chinese comma and analyzing intra-sentence discourse relations. [sent-256, score-1.536]

96 To the best of our knowledge, our work is the first in attempting to disambiguating the Chinese comma as the first step in performing Chinese discourse analysis. [sent-257, score-1.061]

97 6 Conclusions and future work We proposed a approach to disambiguate the Chinese comma as a first step toward discourse analysis. [sent-258, score-1.028]

98 We presented two automatic comma disambiguation methods that perform comparably. [sent-260, score-0.627]

99 In the first method, comma disambiguation is integrated into the parsing process while in the second method we train a supervised classifier to classify the Chinese comma, using features extracted from automatic parses. [sent-261, score-0.682]

100 Much needs to be done in the area, but we believe our work provides insight into the intricacy and complexity of discourse analysis in Chinese. [sent-262, score-0.434]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('comma', 0.594), ('discourse', 0.434), ('commas', 0.269), ('ip', 0.215), ('coord', 0.147), ('coordinated', 0.125), ('predj', 0.123), ('subject', 0.12), ('chinese', 0.119), ('vp', 0.111), ('ctb', 0.105), ('sbji', 0.092), ('clause', 0.09), ('subordinate', 0.088), ('boundary', 0.081), ('insurance', 0.077), ('sb', 0.072), ('conjunction', 0.071), ('pdt', 0.067), ('marks', 0.063), ('predi', 0.061), ('relabeled', 0.061), ('sbjj', 0.061), ('subordination', 0.061), ('ips', 0.061), ('xue', 0.06), ('units', 0.059), ('elementary', 0.055), ('auxiliary', 0.055), ('conjunct', 0.053), ('syntactic', 0.051), ('prasad', 0.049), ('anchored', 0.049), ('vps', 0.049), ('spans', 0.048), ('ispan', 0.046), ('obji', 0.046), ('polanyi', 0.046), ('yaqin', 0.046), ('genres', 0.046), ('span', 0.042), ('relations', 0.041), ('classification', 0.041), ('coordination', 0.04), ('dollars', 0.04), ('edus', 0.04), ('rst', 0.04), ('soricut', 0.04), ('sporleder', 0.04), ('predicate', 0.039), ('adj', 0.038), ('anchor', 0.038), ('comp', 0.036), ('illustrated', 0.035), ('continuity', 0.034), ('miltsakaki', 0.034), ('disambiguation', 0.033), ('disambiguating', 0.033), ('sentential', 0.033), ('annotation', 0.033), ('rhetorical', 0.032), ('sbj', 0.032), ('company', 0.032), ('relation', 0.032), ('complementation', 0.031), ('guangdong', 0.031), ('idoes', 0.031), ('liability', 0.031), ('ninety', 0.031), ('renfa', 0.031), ('subord', 0.031), ('classify', 0.03), ('treebank', 0.03), ('bank', 0.03), ('marcu', 0.029), ('category', 0.029), ('mi', 0.028), ('functional', 0.028), ('blocks', 0.028), ('carlson', 0.028), ('mann', 0.028), ('syntactically', 0.028), ('hundred', 0.027), ('exclamation', 0.027), ('eleni', 0.027), ('rashmi', 0.027), ('revenue', 0.027), ('sur', 0.027), ('mutual', 0.027), ('sentence', 0.026), ('doesn', 0.026), ('china', 0.025), ('parsing', 0.025), ('nianwen', 0.025), ('penn', 0.025), ('yang', 0.025), ('retrain', 0.024), ('adjunction', 0.024), ('brandeis', 0.024), ('export', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999905 47 acl-2012-Chinese Comma Disambiguation for Discourse Analysis

Author: Yaqin Yang ; Nianwen Xue

2 0.37119433 193 acl-2012-Text-level Discourse Parsing with Rich Linguistic Features

Author: Vanessa Wei Feng ; Graeme Hirst

Abstract: In this paper, we develop an RST-style textlevel discourse parser, based on the HILDA discourse parser (Hernault et al., 2010b). We significantly improve its tree-building step by incorporating our own rich linguistic features. We also analyze the difficulty of extending traditional sentence-level discourse parsing to text-level parsing by comparing discourseparsing performance under different discourse conditions.

3 0.31290931 157 acl-2012-PDTB-style Discourse Annotation of Chinese Text

Author: Yuping Zhou ; Nianwen Xue

Abstract: We describe a discourse annotation scheme for Chinese and report on the preliminary results. Our scheme, inspired by the Penn Discourse TreeBank (PDTB), adopts the lexically grounded approach; at the same time, it makes adaptations based on the linguistic and statistical characteristics of Chinese text. Annotation results show that these adaptations work well in practice. Our scheme, taken together with other PDTB-style schemes (e.g. for English, Turkish, Hindi, and Czech), affords a broader perspective on how the generalized lexically grounded approach can flesh itself out in the context of cross-linguistic annotation of discourse relations.

4 0.22559591 201 acl-2012-Towards the Unsupervised Acquisition of Discourse Relations

Author: Christian Chiarcos

Abstract: This paper describes a novel approach towards the empirical approximation of discourse relations between different utterances in texts. Following the idea that every pair of events comes with preferences regarding the range and frequency of discourse relations connecting both parts, the paper investigates whether these preferences are manifested in the distribution of relation words (that serve to signal these relations). Experiments on two large-scale English web corpora show that significant correlations between pairs of adjacent events and relation words exist, that they are reproducible on different data sets, and for three relation words, that their distribution corresponds to theorybased assumptions. 1 Motivation Texts are not merely accumulations of isolated utterances, but the arrangement of utterances conveys meaning; human text understanding can thus be described as a process to recover the global structure of texts and the relations linking its different parts (Vallduv ı´ 1992; Gernsbacher et al. 2004). To capture these aspects of meaning in NLP, it is necessary to develop operationalizable theories, and, within a supervised approach, large amounts of annotated training data. To facilitate manual annotation, weakly supervised or unsupervised techniques can be applied as preprocessing step for semimanual annotation, and this is part of the motivation of the approach described here. 213 Discourse relations involve different aspects of meaning. This may include factual knowledge about the connected discourse segments (a ‘subjectmatter’ relation, e.g., if one utterance represents the cause for another, Mann and Thompson 1988, p.257), argumentative purposes (a ‘presentational’ relation, e.g., one utterance motivates the reader to accept a claim formulated in another utterance, ibid., p.257), or relations between entities mentioned in the connected discourse segments (anaphoric relations, Webber et al. 2003). Discourse relations can be indicated explicitly by optional cues, e.g., adverbials (e.g., however), conjunctions (e.g., but), or complex phrases (e.g., in contrast to what Peter said a minute ago). Here, these cues are referred to as relation words. Assuming that relation words are associated with specific discourse relations (Knott and Dale 1994; Prasad et al. 2008), the distribution of relation words found between two (types of) events can yield insights into the range of discourse relations possible at this occasion and their respective likeliness. For this purpose, this paper proposes a background knowledge base (BKB) that hosts pairs of events (here heuristically represented by verbs) along with distributional profiles for relation words. The primary data structure of the BKB is a triple where one event (type) is connected with a particular relation word to another event (type). Triples are further augmented with a frequency score (expressing the likelihood of the triple to be observed), a significance score (see below), and a correlation score (indicating whether a pair of events has a positive or negative correlation with a particular relation word). ProceedJienjgus, R ofep thueb 5lic0t hof A Knonrueaa,l M 8-e1e4ti Jnugly o f2 t0h1e2 A.s ?c so2c0ia1t2io Ans fsoorc Ciatoiomnp fuotart Cioonmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi2c 1s3–217, Triples can be easily acquired from automatically parsed corpora. While the relation word is usually part of the utterance that represents the source of the relation, determining the appropriate target (antecedent) of the relation may be difficult to achieve. As a heuristic, an adjacency preference is adopted, i.e., the target is identified with the main event of the preceding utterance.1 The BKB can be constructed from a sufficiently large corpus as follows: • • identify event types and relation words for every utterance create a candidate triple consisting of the event type of the utterance, the relation word, and the event type of the preceding utterance. add the candidate triple to the BKB, if it found in the BKB, increase its score by (or initialize it with) 1, – – • perform a pruning on all candidate triples, calcpuerlaftoer significance aonnd a lclo crarneldaitdioante scores Pruning uses statistical significance tests to evaluate whether the relative frequency of a relation word for a pair of events is significantly higher or lower than the relative frequency of the relation word in the entire corpus. Assuming that incorrect candidate triples (i.e., where the factual target of the relation was non-adjacent) are equally distributed, they should be filtered out by the significance tests. The goal of this paper is to evaluate the validity of this approach. 2 Experimental Setup By generalizing over multiple occurrences of the same events (or, more precisely, event types), one can identify preferences of event pairs for one or several relation words. These preferences capture context-invariant characteristics of pairs of events and are thus to considered to reflect a semantic predisposition for a particular discourse relation. Formally, an event is the semantic representation of the meaning conveyed in the utterance. We 1Relations between non-adjacent utterances are constrained by the structure of discourse (Webber 1991), and thus less likely than relations between adjacent utterances. 214 assume that the same event can reoccur in different contexts, we are thus studying relations between types of events. For the experiment described here, events are heuristically identified with the main predicates of a sentence, i.e., non-auxiliar, noncausative, non-modal verbal lexemes that serve as heads of main clauses. The primary data structure of the approach described here is a triple consisting of a source event, a relation word and a target (antecedent) event. These triples are harvested from large syntactically annotated corpora. For intersentential relations, the target is identified with the event of the immediately preceding main clause. These extraction preferences are heuristic approximations, and thus, an additional pruning step is necessary. For this purpose, statistical significance tests are adopted (χ2 for triples of frequent events and relation words, t-test for rare events and/or relation words) that compare the relative frequency of a rela- tion word given a pair of events with the relative frequency of the relation word in the entire corpus. All results with p ≥ .05 are excluded, i.e., only triples are preserved pfo ≥r w .0h5ic ahr teh eex xocblsuedrevde,d i positive or negative correlation between a pair of events and a relation word is not due to chance with at least 95% probability. Assuming an even distribution of incorrect target events, this should rule these out. Additionally, it also serves as a means of evaluation. Using statistical significance tests as pruning criterion entails that all triples eventually confirmed are statistically significant.2 This setup requires immense amounts of data: We are dealing with several thousand events (theoretically, the total number of verbs of a language). The chance probability for two events to occur in adjacent position is thus far below 10−6, and it decreases further if the likelihood of a relation word is taken into consideration. All things being equal, we thus need millions of sentences to create the BKB. Here, two large-scale corpora of English are employed, PukWaC and Wackypedia EN (Baroni et al. 2009). PukWaC is a 2G-token web corpus of British English crawled from the uk domain (Ferraresi et al. 2Subsequent studies may employ less rigid pruning criteria. For the purpose of the current paper, however, the statistical significance of all extracted triples serves as an criterion to evaluate methodological validity. 2008), and parsed with MaltParser (Nivre et al. 2006). It is distributed in 5 parts; Only PukWaC1 to PukWaC-4 were considered here, constituting 82.2% (72.5M sentences) of the entire corpus, PukWaC-5 is left untouched for forthcoming evaluation experiments. Wackypedia EN is a 0.8G-token dump of the English Wikipedia, annotated with the same tools. It is distributed in 4 different files; the last portion was left untouched for forthcoming evaluation experiments. The portion analyzed here comprises 33.2M sentences, 75.9% of the corpus. The extraction of events in these corpora uses simple patterns that combine dependency information and part-of-speech tags to retrieve the main verbs and store their lemmata as event types. The target (antecedent) event was identified with the last main event of the preceding sentence. As relation words, only sentence-initial children of the source event that were annotated as adverbial modifiers, verb modifiers or conjunctions were considered. 3 Evaluation To evaluate the validity of the approach, three fundamental questions need to be addressed: significance (are there significant correlations between pairs of events and relation words ?), reproducibility (can these correlations confirmed on independent data sets ?), and interpretability (can these correlations be interpreted in terms of theoretically-defined discourse relations ?). 3.1 Significance and Reproducibility Significance tests are part of the pruning stage of the algorithm. Therefore, the number of triples eventually retrieved confirms the existence of statistically significant correlations between pairs of events and relation words. The left column of Tab. 1 shows the number of triples obtained from PukWaC subcorpora of different size. For reproducibility, compare the triples identified with Wackypedia EN and PukWaC subcorpora of different size: Table 1 shows the number of triples found in both Wackypedia EN and PukWaC, and the agreement between both resources. For two triples involving the same events (event types) and the same relation word, agreement means that the relation word shows either positive or negative correlation 215 TasPbe13u7l4n2k98t. We254Mn1a c:CeAs(gurb42)et760cr8m,iop3e61r4l28np0st6uwicho21rm9W,e2673mas048p7c3okenytpdoagi21p8r,o35eE0s29Nit36nvgreipol8796r50s9%.n3509egative correlation of event pairs and relation words between Wackypedia EN and PukWaC subcorpora of different size TBH: thb ouetwnev r17 t1,o27,t0a95P41 ul2kWv6aCs,8.0 Htr5iple1v s, 45.12T35av9sg7.reH7em nv6 ts62(. %.9T2) Table 2: Agreement between but (B), however (H) and then (T) on PukWaC in both corpora, disagreement means positive correlation in one corpus and negative correlation in the other. Table 1 confirms that results obtained on one resource can be reproduced on another. This indicates that triples indeed capture context-invariant, and hence, semantic, characteristics of the relation between events. The data also indicates that reproducibility increases with the size of corpora from which a BKB is built. 3.2 Interpretability Any theory of discourse relations would predict that relation words with similar function should have similar distributions, whereas one would expect different distributions for functionally unrelated relation words. These expectations are tested here for three of the most frequent relation words found in the corpora, i.e., but, then and however. But and however can be grouped together under a generalized notion of contrast (Knott and Dale 1994; Prasad et al. 2008); then, on the other hand, indicates a tem- poral and/or causal relation. Table 2 confirms the expectation that event pairs that are correlated with but tend to show the same correlation with however, but not with then. 4 Discussion and Outlook This paper described a novel approach towards the unsupervised acquisition of discourse relations, with encouraging preliminary results: Large collections of parsed text are used to assess distributional profiles of relation words that indicate discourse relations that are possible between specific types of events; on this basis, a background knowledge base (BKB) was created that can be used to predict an appropriatediscoursemarkertoconnecttwoutterances with no overt relation word. This information can be used, for example, to facilitate the semiautomated annotation of discourse relations, by pointing out the ‘default’ relation word for a given pair of events. Similarly, Zhou et al. (2010) used a language model to predict discourse markers for implicitly realized discourse relations. As opposed to this shallow, n-gram-based approach, here, the internal structure of utterances is exploited: based on semantic considerations, syntactic patterns have been devised that extract triples of event pairs and relation words. The resulting BKB provides a distributional approximation of the discourse relations that can hold between two specific event types. Both approaches exploit complementary sources of knowledge, and may be combined with each other to achieve a more precise prediction of implicit discourse connectives. The validity of the approach was evaluated with respect to three evaluation criteria: The extracted associations between relation words and event pairs could be shown to be statistically significant, and to be reproducible on other corpora; for three highly frequent relation words, theoretical predictions about their relative distribution could be confirmed, indicating their interpretability in terms of presupposed taxonomies of discourse relations. Another prospective field of application can be seen in NLP applications, where selection preferences for relation words may serve as a cheap replacement for full-fledged discourse parsing. In the Natural Language Understanding domain, the BKB may help to disambiguate or to identify discourse relations between different events; in the context of Machine Translation, it may represent a factor guid- ing the insertion of relation words, a task that has been found to be problematic for languages that dif216 fer in their inventory and usage of discourse markers, e.g., German and English (Stede and Schmitz 2000). The approach is language-independent (except for the syntactic extraction patterns), and it does not require manually annotated data. It would thus be easy to create background knowledge bases with relation words for other languages or specific domains given a sufficient amount of textual data. – Related research includes, for example, the unsupervised recognition of causal and temporal relationships, as required, for example, for the recognition of textual entailment. Riaz and Girju (2010) exploit distributional information about pairs of utterances. Unlike approach described here, they are not restricted to adjacent utterances, and do not rely on explicit and recurrent relation words. Their approach can thus be applied to comparably small data sets. However, they are restricted to a specific type of relations whereas here the entire band- width of discourse relations that are explicitly realized in a language are covered. Prospectively, both approaches could be combined to compensate their respective weaknesses. Similar observations can be made with respect to Chambers and Jurafsky (2009) and Kasch and Oates (2010), who also study a single discourse relation (narration), and are thus more limited in scope than the approach described here. However, as their approach extends beyond pairs of events to complex event chains, it seems that both approaches provide complementary types of information and their results could also be combined in a fruitful way to achieve a more detailed assessment of discourse relations. The goal of this paper was to evaluate the methdological validity of the approach. It thus represents the basis for further experiments, e.g., with respect to the enrichment the BKB with information provided by Riaz and Girju (2010), Chambers and Jurafsky (2009) and Kasch and Oates (2010). Other directions of subsequent research may include address more elaborate models of events, and the investigation of the relationship between relation words and taxonomies of discourse relations. Acknowledgments This work was supported by a fellowship within the Postdoc program of the German Academic Exchange Service (DAAD). Initial experiments were conducted at the Collaborative Research Center (SFB) 632 “Information Structure” at the University of Potsdam, Germany. Iwould also like to thank three anonymous reviewers for valuable comments and feedback, as well as Manfred Stede and Ed Hovy whose work on discourse relations on the one hand and proposition stores on the other hand have been the main inspiration for this paper. References M. Baroni, S. Bernardini, A. Ferraresi, and E. Zanchetta. The wacky wide web: a collection of very large linguistically processed webcrawled corpora. Language Resources and Evaluation, 43(3):209–226, 2009. N. Chambers and D. Jurafsky. Unsupervised learning of narrative schemas and their participants. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pages 602–610. Association for Computational Linguistics, 2009. A. Ferraresi, E. Zanchetta, M. Baroni, and S. Bernardini. Introducing and evaluating ukwac, a very large web-derived corpus of english. In Proceedings of the 4th Web as Corpus Workshop (WAC-4) Can we beat Google, pages 47–54, 2008. Morton Ann Gernsbacher, Rachel R. W. Robertson, Paola Palladino, and Necia K. Werner. Managing mental representations during narrative comprehension. Discourse Processes, 37(2): 145–164, 2004. N. Kasch and T. Oates. Mining script-like structures from the web. In Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading, pages 34–42. Association for Computational Linguistics, 2010. A. Knott and R. Dale. Using linguistic phenomena to motivate a set ofcoherence relations. Discourse processes, 18(1):35–62, 1994. 217 J. van Kuppevelt and R. Smith, editors. Current Directions in Discourse andDialogue. Kluwer, Dordrecht, 2003. William C. Mann and Sandra A. Thompson. Rhetorical Structure Theory: Toward a functional theory of text organization. Text, 8(3):243–281, 1988. J. Nivre, J. Hall, and J. Nilsson. Maltparser: A data-driven parser-generator for dependency parsing. In Proc. of LREC, pages 2216–2219. Citeseer, 2006. R. Prasad, N. Dinesh, A. Lee, E. Miltsakaki, L. Robaldo, A. Joshi, and B. Webber. The penn discourse treebank 2.0. In Proc. 6th International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, 2008. M. Riaz and R. Girju. Another look at causality: Discovering scenario-specific contingency relationships with no supervision. In Semantic Computing (ICSC), 2010 IEEE Fourth International Conference on, pages 361–368. IEEE, 2010. M. Stede and B. Schmitz. Discourse particles and discourse functions. Machine translation, 15(1): 125–147, 2000. Enric Vallduv ı´. The Informational Component. Garland, New York, 1992. Bonnie L. Webber. Structure and ostension in the interpretation of discourse deixis. Natural Language and Cognitive Processes, 2(6): 107–135, 1991. Bonnie L. Webber, Matthew Stone, Aravind K. Joshi, and Alistair Knott. Anaphora and discourse structure. Computational Linguistics, 4(29):545– 587, 2003. Z.-M. Zhou, Y. Xu, Z.-Y. Niu, M. Lan, J. Su, and C.L. Tan. Predicting discourse connectives for implicit discourse relation recognition. In COLING 2010, pages 1507–15 14, Beijing, China, August 2010.

5 0.10940783 52 acl-2012-Combining Coherence Models and Machine Translation Evaluation Metrics for Summarization Evaluation

Author: Ziheng Lin ; Chang Liu ; Hwee Tou Ng ; Min-Yen Kan

Abstract: An ideal summarization system should produce summaries that have high content coverage and linguistic quality. Many state-ofthe-art summarization systems focus on content coverage by extracting content-dense sentences from source articles. A current research focus is to process these sentences so that they read fluently as a whole. The current AESOP task encourages research on evaluating summaries on content, readability, and overall responsiveness. In this work, we adapt a machine translation metric to measure content coverage, apply an enhanced discourse coherence model to evaluate summary readability, and combine both in a trained regression model to evaluate overall responsiveness. The results show significantly improved performance over AESOP 2011 submitted metrics.

6 0.071089134 168 acl-2012-Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations

7 0.065976799 109 acl-2012-Higher-order Constituent Parsing and Parser Combination

8 0.062339462 5 acl-2012-A Comparison of Chinese Parsers for Stanford Dependencies

9 0.060461231 45 acl-2012-Capturing Paradigmatic and Syntagmatic Lexical Relations: Towards Accurate Chinese Part-of-Speech Tagging

10 0.059058279 50 acl-2012-Collective Classification for Fine-grained Information Status

11 0.057220962 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?

12 0.053375863 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets

13 0.048689544 87 acl-2012-Exploiting Multiple Treebanks for Parsing with Quasi-synchronous Grammars

14 0.0483035 13 acl-2012-A Graphical Interface for MT Evaluation and Error Analysis

15 0.047753099 128 acl-2012-Learning Better Rule Extraction with Translation Span Alignment

16 0.045336667 147 acl-2012-Modeling the Translation of Predicate-Argument Structure for SMT

17 0.044849984 134 acl-2012-Learning to Find Translations and Transliterations on the Web

18 0.043226853 30 acl-2012-Attacking Parsing Bottlenecks with Unlabeled Data and Relevant Factorizations

19 0.042872153 119 acl-2012-Incremental Joint Approach to Word Segmentation, POS Tagging, and Dependency Parsing in Chinese

20 0.042559151 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.155), (1, 0.05), (2, -0.13), (3, 0.046), (4, 0.036), (5, -0.089), (6, -0.173), (7, -0.151), (8, -0.249), (9, 0.439), (10, 0.024), (11, -0.047), (12, 0.019), (13, -0.005), (14, 0.126), (15, -0.048), (16, 0.02), (17, 0.044), (18, 0.019), (19, 0.0), (20, -0.103), (21, 0.037), (22, -0.125), (23, 0.04), (24, -0.017), (25, -0.046), (26, 0.024), (27, 0.012), (28, -0.034), (29, 0.009), (30, -0.062), (31, -0.009), (32, -0.035), (33, -0.006), (34, 0.034), (35, -0.015), (36, 0.046), (37, 0.052), (38, 0.008), (39, 0.01), (40, 0.042), (41, 0.029), (42, -0.014), (43, -0.003), (44, 0.001), (45, -0.054), (46, -0.05), (47, 0.003), (48, -0.072), (49, -0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97438538 47 acl-2012-Chinese Comma Disambiguation for Discourse Analysis

Author: Yaqin Yang ; Nianwen Xue

2 0.9389888 193 acl-2012-Text-level Discourse Parsing with Rich Linguistic Features

Author: Vanessa Wei Feng ; Graeme Hirst

3 0.9373306 157 acl-2012-PDTB-style Discourse Annotation of Chinese Text

Author: Yuping Zhou ; Nianwen Xue

4 0.7146492 201 acl-2012-Towards the Unsupervised Acquisition of Discourse Relations

Author: Christian Chiarcos

5 0.39378241 52 acl-2012-Combining Coherence Models and Machine Translation Evaluation Metrics for Summarization Evaluation

Author: Ziheng Lin ; Chang Liu ; Hwee Tou Ng ; Min-Yen Kan

6 0.28409219 50 acl-2012-Collective Classification for Fine-grained Information Status

7 0.25077146 168 acl-2012-Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations

8 0.24408157 13 acl-2012-A Graphical Interface for MT Evaluation and Error Analysis

9 0.23154692 26 acl-2012-Applications of GPC Rules and Character Structures in Games for Learning Chinese Characters

10 0.23055859 210 acl-2012-Unsupervized Word Segmentation: the Case for Mandarin Chinese

11 0.21995009 87 acl-2012-Exploiting Multiple Treebanks for Parsing with Quasi-synchronous Grammars

12 0.21459581 45 acl-2012-Capturing Paradigmatic and Syntagmatic Lexical Relations: Towards Accurate Chinese Part-of-Speech Tagging

13 0.21276593 133 acl-2012-Learning to "Read Between the Lines" using Bayesian Logic Programs

14 0.2126807 122 acl-2012-Joint Evaluation of Morphological Segmentation and Syntactic Parsing

15 0.21223778 6 acl-2012-A Comprehensive Gold Standard for the Enron Organizational Hierarchy

16 0.20361201 190 acl-2012-Syntactic Stylometry for Deception Detection

17 0.20068161 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence

18 0.19294366 94 acl-2012-Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection

19 0.19250605 75 acl-2012-Discriminative Strategies to Integrate Multiword Expression Recognition and Parsing

20 0.19236793 109 acl-2012-Higher-order Constituent Parsing and Parser Combination

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.311), (25, 0.02), (26, 0.066), (28, 0.044), (30, 0.029), (37, 0.028), (39, 0.059), (49, 0.01), (51, 0.01), (74, 0.037), (84, 0.021), (85, 0.036), (90, 0.086), (91, 0.031), (92, 0.038), (94, 0.033), (99, 0.048)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.71427494 47 acl-2012-Chinese Comma Disambiguation for Discourse Analysis

Author: Yaqin Yang ; Nianwen Xue

2 0.64873815 202 acl-2012-Transforming Standard Arabic to Colloquial Arabic

Author: Emad Mohamed ; Behrang Mohit ; Kemal Oflazer

Abstract: We present a method for generating Colloquial Egyptian Arabic (CEA) from morphologically disambiguated Modern Standard Arabic (MSA). When used in POS tagging, this process improves the accuracy from 73.24% to 86.84% on unseen CEA text, and reduces the percentage of out-ofvocabulary words from 28.98% to 16.66%. The process holds promise for any NLP task targeting the dialectal varieties of Arabic; e.g., this approach may provide a cheap way to leverage MSA data and morphological resources to create resources for colloquial Arabic to English machine translation. It can also considerably speed up the annotation of Arabic dialects.

3 0.42354605 197 acl-2012-Tokenization: Returning to a Long Solved Problem A Survey, Contrastive Experiment, Recommendations, and Toolkit

Author: Rebecca Dridan ; Stephan Oepen

Abstract: We examine some of the frequently disregarded subtleties of tokenization in Penn Treebank style, and present a new rule-based preprocessing toolkit that not only reproduces the Treebank tokenization with unmatched accuracy, but also maintains exact stand-off pointers to the original text and allows flexible configuration to diverse use cases (e.g. to genreor domain-specific idiosyncrasies). 1 Introduction—Motivation The task of tokenization is hardly counted among the grand challenges of NLP and is conventionally interpreted as breaking up “natural language text [...] into distinct meaningful units (or tokens)” (Kaplan, 2005). Practically speaking, however, tokenization is often combined with other string-level preprocessing—for example normalization of punctuation (of different conventions for dashes, say), disambiguation of quotation marks (into opening vs. closing quotes), or removal of unwanted mark-up— where the specifics of such pre-processing depend both on properties of the input text as well as on assumptions made in downstream processing. Applying some string-level normalizationprior to the identification of token boundaries can improve (or simplify) tokenization, and a sub-task like the disambiguation of quote marks would in fact be hard to perform after tokenization, seeing that it depends on adjacency to whitespace. In the following, we thus assume a generalized notion of tokenization, comprising all string-level processing up to and including the conversion of a sequence of characters (a string) to a sequence of token objects.1 1Obviously, some of the normalization we include in the tokenization task (in this generalized interpretation) could be left to downstream analysis, where a tagger or parser, for example, could be expected to accept non-disambiguated quote marks (so-called straight or typewriter quotes) and disambiguate as 378 Arguably, even in an overtly ‘separating’ language like English, there can be token-level ambiguities that ultimately can only be resolved through parsing (see § 3 for candidate examples), and indeed Waldron et al. (2006) entertain the idea of downstream processing on a token lattice. In this article, however, we accept the tokenization conventions and sequential nature of the Penn Treebank (PTB; Marcus et al., 1993) as a useful point of reference— primarily for interoperability of different NLP tools. Still, we argue, there is remaining work to be done on PTB-compliant tokenization (reviewed in§ 2), both methodologically, practically, and technologically. In § 3 we observe that state-of-the-art tools perform poorly on re-creating PTB tokenization, and move on in § 4 to develop a modular, parameterizable, and transparent framework for tokenization. Besides improvements in tokenization accuracy and adaptability to diverse use cases, in § 5 we further argue that each token object should unambiguously link back to an underlying element of the original input, which in the case of tokenization of text we realize through a notion of characterization. 2 Common Conventions Due to the popularity of the PTB, its tokenization has been a de-facto standard for two decades. Ap- proximately, this means splitting off punctuation into separate tokens, disambiguating straight quotes, and separating contractions such as can’t into ca and n ’t. There are, however, many special cases— part of syntactic analysis. However, on the (predominant) point of view that punctuation marks form tokens in their own right, the tokenizer would then have to adorn quote marks in some way, as to whether they were split off the left or right periphery of a larger token, to avoid unwanted syntactic ambiguity. Further, increasing use of Unicode makes texts containing ‘natively’ disambiguated quotes more common, where it would seem unfortunate to discard linguistically pertinent information by normalizing towards the poverty of pure ASCII punctuation. ProceedJienjgus, R ofep thueb 5lic0t hof A Knonrueaa,l M 8-e1e4ti Jnugly o f2 t0h1e2 A.s ?c so2c0ia1t2io Ans fsoorc Ciatoiomnp fuotart Cioonmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi3c 7s8–382, documented and undocumented. In much tagging and parsing work, PTB data has been used with gold-standard tokens, to a point where many researchers are unaware of the existence of the original ‘raw’ (untokenized) text. Accordingly, the formal definition of PTB has received little attention, but reproducing PTB tokenization automatically actually is not a trivial task (see § 3). As the NLP community has moved to process data other than the PTB, some of the limitations of the tokenization2 PTB tokenization have been recognized, and many recently released data sets are accompanied by a note on tokenization along the lines of: Tokenization is similar to that used in PTB, except . . . Most exceptions are to do with hyphenation, or special forms of named entities such as chemical names or URLs. None of the documentation with extant data sets is sufficient to fully reproduce the tokenization.3 The CoNLL 2008 Shared Task data actually provided two forms of tokenization: that from the PTB (which many pre-processing tools would have been trained on), and another form that splits (most) hyphenated terms. This latter convention recently seems to be gaining ground in data sets like the Google 1T n-gram corpus (LDC #2006T13) and OntoNotes (Hovy et al., 2006). Clearly, as one moves towards a more application- and domaindriven idea of ‘correct’ tokenization, a more transparent, flexible, and adaptable approach to stringlevel pre-processing is called for. 3 A Contrastive Experiment To get an overview of current tokenization methods, we recovered and tokenized the raw text which was the source of the (Wall Street Journal portion of the) PTB, and compared it to the gold tokenization in the syntactic annotation in the We used three common methods of tokenization: (a) the original treebank.4 2See http : / /www . cis .upenn .edu/ ~t reebank/ t okeni z at ion .html for available ‘documentation’ and a sed script for PTB-style tokenization. 3Øvrelid et al. (2010) observe that tokenizing with the GENIA tagger yields mismatches in one of five sentences of the GENIA Treebank, although the GENIA guidelines refer to scripts that may be available on request (Tateisi & Tsujii, 2006). 4The original WSJ text was last included with the 1995 release of the PTB (LDC #95T07) and required alignment with the treebank, with some manual correction so that the same text is represented in both raw and parsed formats. 379 Tokenization Differing Levenshtein Method Sentences Distance tokenizer.sed 3264 11168 CoreNLP 1781 3717 C&J; parser 2597 4516 Table 1: Quantitative view on tokenization differences. PTB tokenizer.sed script; (b) the tokenizer from the Stanford CoreNLP tools5; and (c) tokenization from the parser of Charniak & Johnson (2005). Table 1 shows quantitative differences between each of the three methods and the PTB, both in terms of the number of sentences where the tokenization differs, and also in the total Levenshtein distance (Levenshtein, 1966) over tokens (for a total of 49,208 sentences and 1,173,750 gold-standard tokens). Looking at the differences qualitatively, the most consistent issue across all tokenization methods was ambiguity of sentence-final periods. In the treebank, final periods are always (with about 10 exceptions) a separate token. If the sentence ends in U.S. (but not other abbreviations, oddly), an extra period is hallucinated, so the abbreviation also has one. In contrast, C&J; add a period to all final abbreviations, CoreNLP groups the final period with a final abbreviation and hence lacks a sentence-final period token, and the sed script strips the period off U.S. The ‘correct’ choice in this case is not obvious and will depend on how the tokens are to be used. The majority of the discrepancies in the sed script tokenization come from an under-restricted punctuation rule that incorrectly splits on commas within numbers or ampersands within names. Other than that, the problematic cases are mostly shared across tokenization methods, and include issues with currencies, Irish names, hyphenization, and quote disambiguation. In addition, C&J; make some additional modifications to the text, lemmatising expressions such as won ’t as will and n ’t. 4 REPP: A Generalized Framework For tokenization to be studied as a first-class problem, and to enable customization and flexibility to diverse use cases, we suggest a non-procedural, rule-based framework dubbed REPP (Regular 5See corenlp / / nlp . st anford . edu / so ftware / run in ‘ st rict Treebank3 ’ mode. http : . shtml, Expression-Based Pre-Processing)—essentially a cascade of ordered finite-state string rewriting rules, though transcending the formal complexity of regular languages by inclusion of (a) full perl-compatible regular expressions and (b) fixpoint iteration over groups of rules. In this approach, a first phase of string-level substitutions inserts whitespace around, for example, punctuation marks; upon completion of string rewriting, token boundaries are stipulated between all whitespace-separated substrings (and only these). For a good balance of human and machine readability, REPP tokenization rules are specified in a simple, line-oriented textual form. Figure 1 shows a (simplified) excerpt from our PTB-style tokenizer, where the first character on each line is one of four REPP operators, as follows: (a) ‘#’ for group formation; (b) ‘>’ for group invocation, (c) ‘ ! ’ for substitution (allowing capture groups), and (d) ‘ : ’ for token boundary detection.6 In Figure 1, the two rules stripping off prefix and suffix punctuation marks adjacent to whitespace (i.e. matching the tab-separated left-hand side of the rule, to replace the match with its right-hand side) form a numbered group (‘# 1’), which will be iterated when called (‘> 1 until none ’) of the rules in the group fires (a fixpoint). In this example, conditioning on whitespace adjacency avoids the issues observed with the PTB sed script (e.g. token boundaries within comma-separated numbers) and also protects against infinite loops in the group.7 REPP rule sets can be organized as modules, typ6Strictly speaking, there are another two operators, for lineoriented comments and automated versioning of rule files. 7For this example, the same effects seemingly could be obtained without iteration (using greatly more complex rules); our actual, non-simplified rules, however, further deal with punctuation marks that can function as prefixes or suffixes, as well as with corner cases like factor(s) or Ca[2+]. Also in mark-up removal and normalization, we have found it necessary to ‘parse’ nested structures by means of iterative groups. 380 ically each in a file of its own, and invoked selectively by name (e.g. ‘>wiki’ in Figure 1); to date, there exist modules for quote disambiguation, (relevant subsets of) various mark-up languages (HTML, LATEX, wiki, and XML), and a handful of robustness rules (e.g. seeking to identify and repair ‘sandwiched’ inter-token punctuation). Individual tokenizers are configured at run-time, by selectively activating a set of modules (through command-line op- tions). An open-source reference implementation of the REPP framework (in C++) is available, together with a library of modules for English. 5 Characterization for Traceability Tokenization, and specifically our notion of generalized tokenization which allows text normalization, involves changes to the original text being analyzed, rather than just additional annotation. As such, full traceability from the token objects to the original text is required, which we formalize as ‘characterization’, in terms of character position links back to the source.8 This has the practical benefit of allowing downstream analysis as direct (stand-off) annotation on the source text, as seen for example in the ACL Anthology Searchbench (Schäfer et al., 2011). With our general regular expression replacement rules in REPP, making precise what it means for a token to link back to its ‘underlying’ substring requires some care in the design and implementation. Definite characterization links between the string before (I) and after (O) the application of a single orurele ( can only bftee res (tOab)li tshheed a pinp lcicerattiaoinn positions, viz. (a) spans not matched by the rule: unchanged text in O outside the span matched by the left-hand tseixdet regex outfs tidhee truhele s can always d be b ylin thkeed le bfta-chka ntod I; and (b) spans caught by a regex capture group: capture groups represent bthye a same te caxtp tiunr eth ger oleufpt-: and right-hand sides of a substitution, and so can be linked back to O.9 Outside these text spans, we can only md bakace kd etofin Oit.e statements about characterization links at boundary points, which include the start and end of the full string, the start and end of the string 8If the tokenization process was only concerned with the identification of token boundaries, characterization would be near-trivial. 9If capture group references are used out-of-order, however, the per-group linkage is no longer well-defined, and we resort to the maximum-span ‘union’ of boundary points (see below). matched by the rule, and the start and end of any capture groups in the rule. Each character in the string being processed has a start and end position, marking the point before and after the character in the original string. Before processing, the end position would always be one greater than the start position. However, if a rule mapped a string-initial, PTB-style opening double quote (``) to one-character Unicode “, the new first character of the string would have start position 0, but end position 2. In contrast, if there were a rule !wo (n’ t ) will \1 (1) applied to the string I ’t go!, all characters in the won second token of the resulting string (I will n’t go!) will have start position 2 and end position 4. This demonstrates one of the formal consequences of our design: we have no reason to assign the characters ill any start position other than 2.10 Since explicit character links between each I O will only be estaband laicstheerd l iantk kms abtecthw or capture group boundaries, any tteabxtfrom the left-hand side of a rule that should appear in O must be explicitly linked through a capture group rOefe mreunstc eb (rather tihtlayn l merely hwroriuttgehn ao cuta ipntu utrhee righthand side of the rule). In other words, rule (1) above should be preferred to the following variant (which would result in character start and end offsets of 0 and 5 for both output tokens): ! won’ t will n’ t (2) During rule application, we keep track of character start and end positions as offsets between a string before and after each rule application (i.e. all pairs hI, Oi), and these offsets are eventually traced back thoI ,thOe original string fats etthse atireme ev oefn ftiunaalll yto tkraecneidzat biaocnk. 6 Quantitative and Qualitative Evaluation In our own work on preparing various (non-PTB) genres for parsing, we devised a set of REPP rules with the goal of following the PTB conventions. When repeating the experiment of § 3 above using REPP tokenization, we obtained an initial difference in 1505 sentences, with a Levenshtein dis10This subtlety will actually be invisible in the final token objects if will remains a single token, but if subsequent rules were to split this token further, all its output tokens would have a start position of 2 and an end position of 4. While this example may seem unlikely, we have come across similar scenarios in fine-tuning actual REPP rules. 381 tance of 3543 (broadly comparable to CoreNLP, if marginally more accurate). Examining these discrepancies, we revealed some deficiencies in our rules, as well as some peculiarities of the ‘raw’ Wall Street Journal text from the PTB distribution. A little more than 200 mismatches were owed to improper treatment of currency symbols (AU$) and decade abbreviations (’60s), which led to the refinement of two existing rules. Notable PTB idiosyncrasies (in the sense of deviations from common typography) include ellipses with spaces separating the periods and a fairly large number of possessives (’s) being separated from their preceding token. Other aspects of gold-standard PTB tokenization we consider unwarranted ‘damage’ to the input text, such as hallucinating an extra period after U . S . and splitting cannot (which adds spurious ambiguity). For use cases where the goal were strict compliance, for instance in pre-processing inputs for a PTB-derived parser, we added an optional REPP module (of currently half a dozen rules) to cater to these corner cases—in a spirit similar to the CoreNLP mode we used in § 3. With these extra rules, remaining tokenization discrepancies are contained in 603 sentences (just over 1%), which gives a Levenshtein distance of 1389. 7 Discussion—Conclusion Compared to the best-performing off-the-shelf system in our earlier experiment (where it is reasonable to assume that PTB data has played at least some role in development), our results eliminate two thirds of the remaining tokenization errors—a more substantial reduction than recent improvements in parsing accuracy against the PTB, for example. Of the remaining differences, cerned with mid-sentence at least half of those riod was separated treebank—a pattern Some differences over 350 are con- period ambiguity, are instances where where from an abbreviation a pein the we do not wish to emulate. in quote disambiguation also re- main, often triggered by whitespace on both sides of quote marks in the raw text. The final 200 or so dif- ferences stem from manual corrections made during treebanking, and we consider that these cases could not be replicated automatically in any generalizable fashion. References Waldron, B., Copestake, A., Schäfer, U., & Kiefer, Ch(ionap-frgbpnt.heias1Ikt7nA,p3asEP–rs.1,oi8&cn0ieag;)J.todiaAohni dgnsfmonAroa,fxCbMethon.ermt,(pd42Uui30sStcraAd5ti.m)oA.niCanloutaLivrlsneMgr-eutorieas-ftni kceg-s Isd5Bota.hurlyd(2.scIne0itsne0ra6Dn)ad.Et LiPorvneHapl-ruIoaCNcteio snofin(elrpsge.nacIn2ed6Pot3rno–kcLe2naei6dns8iagnt)ui.oasgGnoe sfntRaohne-, Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., & Weischedel, R. (2006). Ontonotes. The 90% solution. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (pp. 57–60). New York City, USA. Kaplan, R. M. (2005). A method for tokenizing text. Festschrift for Kimmo Koskenniemi on his 60th birthday. In A. Arppe, L. Carlson, K. Lindén, J. Piitulainen, M. Suominen, M. Vainio, H. Westerlund, & A. Yli-Jyrä (Eds.), Inquiries into words, constraints and contexts (pp. 55 64). Stanford, CA: CSLI Publications. – Levenshtein, V. (1966). Binary codes capable ofcor- recting deletions, insertions and reversals. Soviet Physice Doklady, 10, 707–710. – Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English. The Penn Treebank. Computational Linguistics, 19, 3 13 330. – Øvrelid, L., Velldal, E., & Oepen, S. (2010). Syntactic scope resolution in uncertainty analysis. In Proceedings of the 23rd international conference on computational linguistics (pp. 1379 1387). Beijing, China. – Schäfer, U., Kiefer, B., Spurk, C., Steffen, J., & Wang, R. (201 1). The ACL Anthology Searchbench. In Proceedings of the ACL-HLT 2011 system demonstrations (pp. 7–13). Portland, Oregon, USA. Tateisi, Y., & Tsujii, J. (2006). GENIA annotation guidelines for tokenization and POS tagging (Technical Report # TR-NLP-UT-2006-4). Tokyo, Japan: Tsujii Lab, University of Tokyo. 382

4 0.4172025 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base

Author: Gerard de Melo ; Gerhard Weikum

Abstract: We present UWN, a large multilingual lexical knowledge base that describes the meanings and relationships of words in over 200 languages. This paper explains how link prediction, information integration and taxonomy induction methods have been used to build UWN based on WordNet and extend it with millions of named entities from Wikipedia. We additionally introduce extensions to cover lexical relationships, frame-semantic knowledge, and language data. An online interface provides human access to the data, while a software API enables applications to look up over 16 million words and names.

5 0.41392049 41 acl-2012-Bootstrapping a Unified Model of Lexical and Phonetic Acquisition

Author: Micha Elsner ; Sharon Goldwater ; Jacob Eisenstein

Abstract: ILCC, School of Informatics School of Interactive Computing University of Edinburgh Georgia Institute of Technology Edinburgh, EH8 9AB, UK Atlanta, GA, 30308, USA (a) intended: /ju want w2n/ /want e kUki/ (b) surface: [j@ w a?P w2n] [wan @ kUki] During early language acquisition, infants must learn both a lexicon and a model of phonetics that explains how lexical items can vary in pronunciation—for instance “the” might be realized as [Di] or [D@]. Previous models of acquisition have generally tackled these problems in isolation, yet behavioral evidence suggests infants acquire lexical and phonetic knowledge simultaneously. We present a Bayesian model that clusters together phonetic variants of the same lexical item while learning both a language model over lexical items and a log-linear model of pronunciation variability based on articulatory features. The model is trained on transcribed surface pronunciations, and learns by bootstrapping, without access to the true lexicon. We test the model using a corpus of child-directed speech with realistic phonetic variation and either gold standard or automatically induced word boundaries. In both cases modeling variability improves the accuracy of the learned lexicon over a system that assumes each lexical item has a unique pronunciation.

6 0.40887374 102 acl-2012-Genre Independent Subgroup Detection in Online Discussion Threads: A Study of Implicit Attitude using Textual Latent Semantics

7 0.40814623 157 acl-2012-PDTB-style Discourse Annotation of Chinese Text

8 0.40734684 21 acl-2012-A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle

9 0.4069975 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures

10 0.40530962 130 acl-2012-Learning Syntactic Verb Frames using Graphical Models

11 0.40507537 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence

12 0.40484446 175 acl-2012-Semi-supervised Dependency Parsing using Lexical Affinities

13 0.40476906 72 acl-2012-Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents

14 0.40421489 29 acl-2012-Assessing the Effect of Inconsistent Assessors on Summarization Evaluation

15 0.40412676 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

16 0.40334794 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool

17 0.40312493 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation

18 0.40258354 28 acl-2012-Aspect Extraction through Semi-Supervised Modeling

19 0.40201187 156 acl-2012-Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information

20 0.40158293 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers