acl acl2011 acl2011-311 knowledge-graph by maker-knowledge-mining

311 acl-2011-Translationese and Its Dialects

Source: pdf

Author: Moshe Koppel ; Noam Ordan

Abstract: While it is has often been observed that the product of translation is somehow different than non-translated text, scholars have emphasized two distinct bases for such differences. Some have noted interference from the source language spilling over into translation in a source-language-specific way, while others have noted general effects of the process of translation that are independent of source language. Using a series of text categorization experiments, we show that both these effects exist and that, moreover, there is a continuum between them. There are many effects of translation that are consistent among texts translated from a given source language, some of which are consistent even among texts translated from families of source languages. Significantly, we find that even for widely unrelated source languages and multiple genres, differences between translated texts and non-translated texts are sufficient for a learned classifier to accurately determine if a given text is translated or original.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Some have noted interference from the source language spilling over into translation in a source-language-specific way, while others have noted general effects of the process of translation that are independent of source language. [sent-3, score-0.663]

2 There are many effects of translation that are consistent among texts translated from a given source language, some of which are consistent even among texts translated from families of source languages. [sent-5, score-1.078]

3 Significantly, we find that even for widely unrelated source languages and multiple genres, differences between translated texts and non-translated texts are sufficient for a learned classifier to accurately determine if a given text is translated or original. [sent-6, score-1.153]

4 of interference, the process by which a specific source language leaves distinct marks or fingerprints in the target language, so that translations from different source languages into the same target language may be regarded as distinct dialects of translationese. [sent-13, score-0.649]

5 Furthermore, we will show that the degree of difference between translations from two source languages reflects the degree of difference between the source languages themselves. [sent-17, score-0.722]

6 Translations from cognate languages differ from non-translated texts in similar ways, while translations from unrelated languages differ from non-translated texts in distinct ways. [sent-18, score-0.82]

7 In the following section, we show that translations from different source languages can be distinguished from each other and that closely related source languages manifest similar forms of interference. [sent-21, score-0.722]

8 In section 3, we show that, in a corpus involving five European languages, we can distinguish translationese from non-translated text and we consider some salient markers of translationese. [sent-22, score-0.913]

9 c s 2o0ci1a1ti Aons fo cria Ctio mnp fourta Ctio mnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s318–1326, 4, we consider the extent to which markers of translationese cross over into non-European languages as well as into different genres. [sent-25, score-0.989]

10 2 Interference Effects in Translationese In this section, we perform several text categorization experiments designed to show the extent to which interference affects (both positively and negatively) our ability to classify documents. [sent-27, score-0.251]

11 The full corpus consists of texts translated into English from 11 different languages (and vice versa), as well as texts originally produced in English. [sent-30, score-0.622]

12 For our purposes, it will be sufficient to use translations from five languages (Finnish, French, German, Italian and Spanish), as well as original English. [sent-31, score-0.373]

13 We note that this corpus constitutes a comparable corpus (Laviosa, 1997), since it contains (1) texts written originally in a certain language (English), as well as (2) texts translated into that same language, matched for genre, domain, publication timeframe, etc. [sent-32, score-0.487]

14 Each of the five translated components is a text file containing just under 500,000 words; the original English component is a file of the same size as the aggregate of the other five. [sent-33, score-0.332]

15 The five source languages we use were selected by first eliminating several source languages for which the available text was limited and then choosing from among the remaining languages, those of varying degrees of pairwise similarity. [sent-34, score-0.647]

16 Thus, we select three cognate (Romance) languages (French, Italian and Spanish), a fourth less related language (German), and a fifth even further removed (Finnish). [sent-35, score-0.27]

17 As will become clear, the motivation is to see whether the distance between the languages impacts the distinctiveness of the translation product. [sent-36, score-0.21]

18 We divide each of the translated corpora into 250 equal chunks, paying no attention to natural units within the corpus. [sent-37, score-0.232]

19 We set aside 50 chunks from each of the translated corpora and 250 chunks from the original English 1319 corpus for development purposes (as will be explained below). [sent-39, score-0.43]

20 The experiments described below use the remaining 1000 translated chunks and 1000 original English chunks. [sent-40, score-0.311]

21 2 Identifying source language Our objective in this section is to measure the ex- tent to which translations are affected by source language. [sent-42, score-0.388]

22 Our first experiment will be to use text categorization methods to learn a classifier that categorizes translations according to source language. [sent-43, score-0.408]

23 High accuracy would reflect that there are exploitable differences among translations of otherwise comparable texts that differ only in terms of source language. [sent-45, score-0.477]

24 We use the 200 chunks from each translated corpus, as described above. [sent-47, score-0.261]

25 The restriction to function words is crucial; we wish to rely only on stylistic differences rather than content differences that might be artifacts of the corpus. [sent-49, score-0.2]

26 We use Bayesian logistic regression (Madigan, 2005) as our learning method in order to learn a classifier that classifies a given text into one of five classes representing the different source languages. [sent-50, score-0.303]

27 As can be seen, there are more mistakes across the three cognate languages than between those three languages and German and still fewer mistakes involving the more distant Finnish language. [sent-55, score-0.437]

28 Table1:CIEFDtrosien fusI1ito634n908 maF1tr6i19 2x fo1Er7 s28 30 -fo1Dld7e14c85 rosF19iv0123alidton experiment to determine source language of texts translated into English This result strengthens that of van Halteren (2008) in a similar experiment. [sent-56, score-0.555]

29 4% for a six-way decision (including the original which has no source language). [sent-61, score-0.18]

30 Significantly, though, van Halteren‟s feature set included content words and he notes that many of the most salient differences reflected differences in thematic emphasis. [sent-62, score-0.218]

31 In Table 2, we show the two words most overrepresented and the two words most underrepresented in translations from each source language (ranked according to an unpaired T-test). [sent-64, score-0.428]

32 For each of these, the difference between frequency of use in the indicated language and frequency of use in the other languages in aggregate is significant at p<0. [sent-65, score-0.193]

33 This suggests the possibility that interference effects in cognate languages such as French, Italian and Spanish might be similar. [sent-70, score-0.506]

34 For German, both underrepresented items appear as overrepresented in the Romance languages, and, conversely, underrepresented items in the Romance languages appear as overrepresented items for German. [sent-73, score-0.507]

35 This may cast doubt on the idea that all translations share universal properties and that at best we may claim that particular properties are shared by closely related languages but not others. [sent-74, score-0.381]

36 In the experiments pre1320 sented in the next subsection, we‟ll find that translationese is gradable: closely related languages share more features, yet even further removed languages share enough properties to hold the general translationese hypothesis as valid. [sent-75, score-1.825]

37 3 Identifying translationese per source language We now wish to measure in a subtler manner the extent to which interference affects translation. [sent-77, score-1.045]

38 The catch is that all our training texts for the class T will be translations from some fixed source language, while all our test documents in T will be translations from a different source language. [sent-79, score-0.626]

39 The answer to this question will tell us a great deal about how much of translationese is general and how much of it is language dependent. [sent-81, score-0.733]

40 If accuracy is close to 100%, translationese is purely general (Baker, 1993). [sent-82, score-0.768]

41 Note that, whereas in our first experiment above pair-specific interference facilitated good classification, in this experiment pair-specific interference is an impediment to good classification. [sent-86, score-0.39]

42 We create, for example, a “French” corpus consisting of the 200 chunks of text translated from French and 200 original English texts. [sent-88, score-0.368]

43 We similarly create a corpus for each of the other source languages, taking care that each of the 1000 original English texts appears in exactly one of the corpora. [sent-89, score-0.322]

44 We then apply this learned classifier to the texts in, for example, the equivalent “Italian” corpus to see if we can classify them as translated or original. [sent-92, score-0.428]

45 For any given source language, it is quite easy to distinguish translations from original English. [sent-98, score-0.343]

46 This clearly indicates that interference effects from one source language might be misleading when used to identify translations from a different language. [sent-103, score-0.518]

47 0003 of tokens in texts translated from Finnish as opposed to 0. [sent-105, score-0.313]

48 0015 of tokens in original English texts), but in the German corpus, me is an indicator of translated text (constituting 0. [sent-106, score-0.278]

49 Finally, we note that even in the case of training or testing on Finnish, results are considerably better than random, suggesting that despite the con- founding effects of interference, some general properties of translationese are being picked up in each case. [sent-116, score-0.883]

50 f853ie rus- ing one source language and testing it using another source language sider source-language-independent translation. [sent-124, score-0.302]

51 1 Identifying translationese In order to identify general effects on translation, we now consider the same two-class classification problem as above, distinguishing T from O, except that now the translated texts in both our train and test data will be drawn from multiple source languages. [sent-126, score-1.306]

52 If we succeed at this task, it must be because of features of translationese that cross source-languages. [sent-127, score-0.775]

53 We use as our translated corpus, the 1000 translated chunks (200 from each of five source languages) and as our original English corpus all 1000 original English chunks. [sent-129, score-0.754]

54 However, in each case the translations were from a single source language. [sent-137, score-0.258]

55 (Van Halteren considered multiple source languages, but each learned classifier used only one of them. [sent-138, score-0.213]

56 ) Thus, those results do not prove that translationese has distinctive source-languageindependent features. [sent-139, score-0.704]

57 To our knowledge, the only earlier work that used a learned classifier to identify translations in which both test and train sets involved multiple source languages is Baroni and Bernardini (2006), in which the target language was Italian and the source languages were known to be varied. [sent-140, score-0.862]

58 The actual distribution of source languages was, however, not known to the researchers. [sent-141, score-0.297]

59 We note that the preponderance of such cohesive markers are significantly more frequent in translations. [sent-167, score-0.19]

60 In particular, they found that type-to-token ratio (lexical variety/richness), mean sentence length and proportion of grammatical words (lexical density/readability) are all smaller in translated texts. [sent-179, score-0.203]

61 (2009), who considered lexical features, found cultural differences, like over-representation of ladies and gentlemen in translated speeches. [sent-181, score-0.203]

62 3 that when we trained in one language and tested in another, classification succeeded to the extent that the source languages used in training and testing, respectively, are related to each other. [sent-185, score-0.326]

63 In effect, general differences between translationese and original English were partially overwhelmed by language-specific differences that held for the training language but not the test language. [sent-186, score-0.931]

64 We thus now revisit that earlier experiment, but restrict ourselves to features that distinguish translationese from original English generally. [sent-187, score-0.822]

65 We use Bayesian logistic regression to learn a classifier to distinguish between translationese and original English. [sent-190, score-0.909]

66 We now find that even in the difficult case where we train on Finnish and test on another language (or vice versa), we succeed at distinguishing translationese from original English with accuracy above 80%. [sent-193, score-0.889]

67 4 Other Genres and Language Families We have found both general and language-specific differences between translationese and original English in one large corpus. [sent-197, score-0.857]

68 It might be wondered whether the phenomena we have found hold in other genres and for a completely different set of source languages. [sent-198, score-0.192]

69 1 The IHT corpus Our second corpus includes three translated corpora, each of which is an on-line local supplement to the International Herald Tribune (IHT): Kathimerini (translated from Greek), Ha ’aretz (translated from Hebrew), and the JoongAng Daily (translated from Korean). [sent-201, score-0.267]

70 (Unlike for our Europarl corpus, the amount of English text available is not equal to the aggregate of the translated corpora, but rather equal to each of the individual corpora. [sent-205, score-0.254]

71 Furthermore, the source languages (Hebrew, Greek and Korean) in the IHT corpus are more disparate than those in the Europarl corpus. [sent-207, score-0.381]

72 Our first hypothesis is that a classifier for identifying translationese that is trained on Europarl will succeed only weakly to identify translationese in IHT. [sent-210, score-1.531]

73 Nevertheless, it is clear that source language strongly affects translationese in this corpus. [sent-219, score-0.834]

74 l085as iferusing Train one source language and testing it using another source language Third, we find in ten-fold cross-validation experiments that we can distinguish translationese from original English in the IHT corpus with accuracy of 86. [sent-226, score-1.181]

75 Thus, despite the great distance between the three source languages in this corpus, general differences between translationese and original English are sufficient to facilitate reasonably accurate identification of translationese. [sent-228, score-1.154]

76 3 Combining the corpora First, we consider whether a classifier learned on the Europarl corpus can be used to identify translationese in the IHT corpus, and vice versa. [sent-230, score-0.872]

77 3, that we would achieve better than random results but not high accuracy, since there are no doubt features common to translations from the five European languages of Europarl that are distinct from those of translations from the very different languages in IHT. [sent-232, score-0.677]

78 The weak results reflect both differences between the families of source languages involved in the respective corpora, as well as genre differences. [sent-236, score-0.466]

79 Thus, for example, we find that of the pronouns shown in Table 4 above, only he and his are significantly under-represented in translationese in the IHT corpus. [sent-237, score-0.769]

80 Thus, that effect is specific either to the genre of Europarl or to the European languages considered there. [sent-238, score-0.19]

81 Now, we combine the two corpora and check if we can identify translationese across two genres and eight languages. [sent-239, score-0.852]

82 We run the same experiments as described above, using 200 texts from each of 1324 the eight source languages and 1600 non-translated English texts, 1000 from Europarl and 600 from IHT. [sent-240, score-0.44]

83 In 10-fold cross-validation, we find that we can distinguish translationese from non-translated English with accuracy of 90. [sent-241, score-0.797]

84 This shows that there are features of translationese that cross genres and widely disparate languages. [sent-243, score-0.841]

85 Thus, for one prominent example, we find that, as in Europarl, the word the is overrepresented in translationese in IHT (15. [sent-244, score-0.835]

86 The preponderance of cohesive adverbs are over-represented in translationese, most of them with differences significant at p<0. [sent-252, score-0.253]

87 5 Conclusions We have found that we can learn classifiers that determine source language given a translated text, as well as classifiers that distinguish translated text from non-translated text in the source language. [sent-258, score-0.751]

88 These text categorization experiments suggest that both source language and the mere fact of being translated play a crucial role in the makeup of a translated text. [sent-259, score-0.599]

89 It is important to note that our learned classifiers are based solely on function words, so that, unlike earlier studies, the differences we find are unlikely to include cultural or thematic differences that might be artifacts of corpus construction. [sent-260, score-0.297]

90 In addition, we find that the exploitability of differences between translated texts and nontranslated texts are related to the difference between source languages: translations from similar source languages are different from non-translated texts in similar ways. [sent-261, score-1.185]

91 For example, Fusco (1990) studies translations between Spanish and Italian and considers the impact of structural differences between the two languages on translation quality. [sent-263, score-0.412]

92 Studying the differences and distance between languages by comparing translations into the same language may serve as another way to deepen our typological knowledge. [sent-264, score-0.369]

93 As we have seen, training on source language x and testing on source language y provides us with a good estimation of the distance between languages, in accordance with what we find in standard works on typology (cf. [sent-265, score-0.325]

94 In addition to its intrinsic interest, the finding that the distance between languages is directly correlated with our ability to distinguish translations from a given source language from non-translated text is of great importance for several computational tasks. [sent-267, score-0.485]

95 First, translations can be studied in order to shed new light on the differences between languages and can bear on attested techniques for using cognates to improve machine translation (Kondrak & Sherif, 2006). [sent-268, score-0.412]

96 Additionally, given the results of our experiments, it stands to reason that using translated texts, especially from related source languages, will prove beneficial for constructing language models and will outperform results obtained from non-translated texts. [sent-269, score-0.333]

97 Finally, we find that there are general properties of translationese sufficiently strong that we can identify translationese even in a combined corpus that is comprised of eight very disparate languages across two distinct genres, one spoken and the other written. [sent-271, score-1.858]

98 Prominent among these properties is the word the, as well as a number of cohesive adverbs, each of which is significantly translated texts. [sent-272, score-0.322]

99 Translationese in Swedish novels translated from English, in Lars Wollin & Hans Lindquist (eds. [sent-299, score-0.203]

100 Automatic detection of translated text and its impact on machine translation. [sent-318, score-0.228]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('translationese', 0.704), ('iht', 0.289), ('translated', 0.203), ('europarl', 0.167), ('languages', 0.167), ('interference', 0.159), ('source', 0.13), ('translations', 0.128), ('finnish', 0.12), ('texts', 0.11), ('cognate', 0.103), ('halteren', 0.096), ('underrepresented', 0.09), ('cohesive', 0.088), ('overrepresented', 0.08), ('effects', 0.077), ('differences', 0.074), ('italian', 0.073), ('ilisei', 0.072), ('kurokawa', 0.072), ('families', 0.072), ('markers', 0.066), ('genres', 0.062), ('chunks', 0.058), ('adverbs', 0.055), ('laviosa', 0.054), ('disparate', 0.052), ('classifier', 0.051), ('french', 0.05), ('original', 0.05), ('succeed', 0.048), ('van', 0.047), ('german', 0.045), ('baroni', 0.044), ('translation', 0.043), ('testing', 0.042), ('pronouns', 0.042), ('spanish', 0.042), ('emphasized', 0.041), ('baker', 0.04), ('greek', 0.039), ('categorization', 0.038), ('romance', 0.038), ('experiment', 0.036), ('bernardini', 0.036), ('explicitation', 0.036), ('frawley', 0.036), ('madigan', 0.036), ('preponderance', 0.036), ('logistic', 0.036), ('distinct', 0.035), ('mona', 0.035), ('claims', 0.035), ('distinguish', 0.035), ('accuracy', 0.035), ('english', 0.034), ('eight', 0.033), ('earlier', 0.033), ('regression', 0.033), ('sara', 0.033), ('corpus', 0.032), ('learned', 0.032), ('animate', 0.032), ('properties', 0.031), ('korean', 0.031), ('corpora', 0.029), ('extent', 0.029), ('distinguishing', 0.029), ('strengthens', 0.029), ('artifacts', 0.029), ('gradability', 0.029), ('general', 0.029), ('five', 0.028), ('prominent', 0.028), ('bayesian', 0.028), ('liwc', 0.028), ('benjamins', 0.028), ('chunk', 0.026), ('european', 0.026), ('constituting', 0.026), ('haifa', 0.026), ('aggregate', 0.026), ('noted', 0.026), ('text', 0.025), ('ctio', 0.025), ('routledge', 0.025), ('accurately', 0.025), ('pennebaker', 0.024), ('doubt', 0.024), ('dialects', 0.024), ('sufficiently', 0.024), ('identify', 0.024), ('find', 0.023), ('modality', 0.023), ('wish', 0.023), ('genre', 0.023), ('bold', 0.023), ('frequencies', 0.023), ('salient', 0.023), ('cross', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 311 acl-2011-Translationese and Its Dialects

Author: Moshe Koppel ; Noam Ordan

2 0.12606339 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

Author: Emmanuel Prochasson ; Pascale Fung

Abstract: We present a first known result of high precision rare word bilingual extraction from comparable corpora, using aligned comparable documents and supervised classification. We incorporate two features, a context-vector similarity and a co-occurrence model between words in aligned documents in a machine learning approach. We test our hypothesis on different pairs of languages and corpora. We obtain very high F-Measure between 80% and 98% for recognizing and extracting correct translations for rare terms (from 1to 5 occurrences). Moreover, we show that our system can be trained on a pair of languages and test on a different pair of languages, obtaining a F-Measure of 77% for the classification of Chinese-English translations using a training corpus of Spanish-French. Our method is therefore even potentially applicable to low resources languages without training data.

3 0.097970121 157 acl-2011-I Thou Thee, Thou Traitor: Predicting Formal vs. Informal Address in English Literature

Author: Manaal Faruqui ; Sebastian Pado

Abstract: In contrast to many languages (like Russian or French), modern English does not distinguish formal and informal (“T/V”) address overtly, for example by pronoun choice. We describe an ongoing study which investigates to what degree the T/V distinction is recoverable in English text, and with what textual features it correlates. Our findings are: (a) human raters can label English utterances as T or V fairly well, given sufficient context; (b), lexical cues can predict T/V almost at human level.

4 0.097659983 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals

Author: Omar F. Zaidan ; Chris Callison-Burch

Abstract: Naively collecting translations by crowdsourcing the task to non-professional translators yields disfluent, low-quality results if no quality control is exercised. We demonstrate a variety of mechanisms that increase the translation quality to near professional levels. Specifically, we solicit redundant translations and edits to them, and automatically select the best output among them. We propose a set of features that model both the translations and the translators, such as country of residence, LM perplexity of the translation, edit rate from the other translations, and (optionally) calibration against professional translators. Using these features to score the collected translations, we are able to discriminate between acceptable and unacceptable translations. We recreate the NIST 2009 Urdu-toEnglish evaluation set with Mechanical Turk, and quantitatively show that our models are able to select translations within the range of quality that we expect from professional trans- lators. The total cost is more than an order of magnitude lower than professional translation.

5 0.092282012 214 acl-2011-Lost in Translation: Authorship Attribution using Frame Semantics

Author: Steffen Hedegaard ; Jakob Grue Simonsen

Abstract: We investigate authorship attribution using classifiers based on frame semantics. The purpose is to discover whether adding semantic information to lexical and syntactic methods for authorship attribution will improve them, specifically to address the difficult problem of authorship attribution of translated texts. Our results suggest (i) that frame-based classifiers are usable for author attribution of both translated and untranslated texts; (ii) that framebased classifiers generally perform worse than the baseline classifiers for untranslated texts, but (iii) perform as well as, or superior to the baseline classifiers on translated texts; (iv) that—contrary to current belief—naïve clas- sifiers based on lexical markers may perform tolerably on translated texts if the combination of author and translator is present in the training set of a classifier.

6 0.091157638 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation

7 0.084430203 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

8 0.071593449 75 acl-2011-Combining Morpheme-based Machine Translation with Post-processing Morpheme Prediction

9 0.062805742 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models

10 0.061330609 44 acl-2011-An exponential translation model for target language morphology

11 0.060173899 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

12 0.059499756 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach

13 0.057973191 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora

14 0.056257538 16 acl-2011-A Joint Sequence Translation Model with Integrated Reordering

15 0.055093851 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation

16 0.054505538 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment

17 0.053577174 247 acl-2011-Pre- and Postprocessing for Statistical Machine Translation into Germanic Languages

18 0.052781153 109 acl-2011-Effective Measures of Domain Similarity for Parsing

19 0.052505478 323 acl-2011-Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections

20 0.051956639 70 acl-2011-Clustering Comparable Corpora For Bilingual Lexicon Extraction

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.141), (1, -0.029), (2, 0.032), (3, 0.06), (4, 0.002), (5, 0.024), (6, 0.071), (7, 0.002), (8, 0.011), (9, 0.008), (10, -0.028), (11, -0.094), (12, -0.001), (13, -0.056), (14, 0.039), (15, -0.028), (16, 0.0), (17, 0.012), (18, 0.058), (19, -0.106), (20, 0.039), (21, -0.008), (22, -0.014), (23, 0.003), (24, -0.054), (25, 0.022), (26, -0.061), (27, 0.005), (28, 0.023), (29, -0.068), (30, 0.039), (31, -0.004), (32, 0.008), (33, -0.002), (34, -0.01), (35, 0.022), (36, -0.067), (37, 0.05), (38, -0.104), (39, 0.031), (40, -0.07), (41, 0.037), (42, 0.01), (43, -0.177), (44, 0.042), (45, -0.009), (46, -0.123), (47, -0.043), (48, -0.008), (49, 0.048)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92909473 311 acl-2011-Translationese and Its Dialects

Author: Moshe Koppel ; Noam Ordan

2 0.76251519 157 acl-2011-I Thou Thee, Thou Traitor: Predicting Formal vs. Informal Address in English Literature

Author: Manaal Faruqui ; Sebastian Pado

3 0.64171922 214 acl-2011-Lost in Translation: Authorship Attribution using Frame Semantics

Author: Steffen Hedegaard ; Jakob Grue Simonsen

4 0.63509113 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

Author: Hal Daume III ; Jagadeesh Jagarlamudi

Abstract: We show that unseen words account for a large part of the translation error when moving to new domains. Using an extension of a recent approach to mining translations from comparable corpora (Haghighi et al., 2008), we are able to find translations for otherwise OOV terms. We show several approaches to integrating such translations into a phrasebased translation system, yielding consistent improvements in translations quality (between 0.5 and 1.5 Bleu points) on four domains and two language pairs.

5 0.58957642 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

Author: Emmanuel Prochasson ; Pascale Fung

6 0.58571184 151 acl-2011-Hindi to Punjabi Machine Translation System

7 0.58297449 303 acl-2011-Tier-based Strictly Local Constraints for Phonology

8 0.5664832 70 acl-2011-Clustering Comparable Corpora For Bilingual Lexicon Extraction

9 0.54173177 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

10 0.53434443 193 acl-2011-Language-independent compound splitting with morphological operations

11 0.52085227 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

12 0.51217431 96 acl-2011-Disambiguating temporal-contrastive connectives for machine translation

13 0.49975017 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation

14 0.4894155 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals

15 0.47007436 297 acl-2011-That's What She Said: Double Entendre Identification

16 0.46853986 138 acl-2011-French TimeBank: An ISO-TimeML Annotated Reference Corpus

17 0.46771002 247 acl-2011-Pre- and Postprocessing for Statistical Machine Translation into Germanic Languages

18 0.46553108 323 acl-2011-Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections

19 0.45733708 133 acl-2011-Extracting Social Power Relationships from Natural Language

20 0.45405075 313 acl-2011-Two Easy Improvements to Lexical Weighting

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.039), (17, 0.071), (26, 0.039), (31, 0.025), (37, 0.1), (39, 0.026), (41, 0.078), (53, 0.016), (54, 0.223), (55, 0.017), (59, 0.047), (72, 0.028), (91, 0.044), (96, 0.104), (97, 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.78555608 210 acl-2011-Lexicographic Semirings for Exact Automata Encoding of Sequence Models

Author: Brian Roark ; Richard Sproat ; Izhak Shafran

Abstract: In this paper we introduce a novel use of the lexicographic semiring and motivate its use for speech and language processing tasks. We prove that the semiring allows for exact encoding of backoff models with epsilon transitions. This allows for off-line optimization of exact models represented as large weighted finite-state transducers in contrast to implicit (on-line) failure transition representations. We present preliminary empirical results demonstrating that, even in simple intersection scenarios amenable to the use of failure transitions, the use of the more powerful lexicographic semiring is competitive in terms of time of intersection. 1 Introduction and Motivation Representing smoothed n-gram language models as weighted finite-state transducers (WFST) is most naturally done with a failure transition, which reflects the semantics of the “otherwise” formulation of smoothing (Allauzen et al., 2003). For example, the typical backoff formulation of the probability of a word w given a history h is as follows P(w | h) = ?αPh(Pw( |w h )| h0) oift hc(ehrwwis)e > 0 (1) where P is an empirical estimate of the probability that reserves small finite probability for unseen n-grams; αh is a backoff weight that ensures normalization; and h0 is a backoff history typically achieved by excising the earliest word in the history h. The principle benefit of encoding the WFST in this way is that it only requires explicitly storing n-gram transitions for observed n-grams, i.e., count greater than zero, as opposed to all possible n-grams of the given order which would be infeasible in for example large vocabulary speech recognition. This is a massive space savings, and such an approach is also used for non-probabilistic stochastic language 1 models, such as those trained with the perceptron algorithm (Roark et al., 2007), as the means to access all and exactly those features that should fire for a particular sequence in a deterministic automaton. Similar issues hold for other finite-state se- quence processing problems, e.g., tagging, bracketing or segmenting. Failure transitions, however, are an implicit method for representing a much larger explicit automaton in the case of n-gram models, all possible n-grams for that order. During composition with the model, the failure transition must be interpreted on the fly, keeping track of those symbols that have already been found leaving the original state, and only allowing failure transition traversal for symbols that have not been found (the semantics of “otherwise”). This compact implicit representation cannot generally be preserved when composing with other models, e.g., when combining a language model with a pronunciation lexicon as in widelyused FST approaches to speech recognition (Mohri et al., 2002). Moving from implicit to explicit representation when performing such a composition leads to an explosion in the size of the resulting transducer, frequently making the approach intractable. In practice, an off-line approximation to the model is made, typically by treating the failure transitions as epsilon transitions (Mohri et al., 2002; Allauzen et al., 2003), allowing large transducers to be composed and optimized off-line. These complex approximate transducers are then used during first-pass – decoding, and the resulting pruned search graphs (e.g., word lattices) can be rescored with exact language models encoded with failure transitions. Similar problems arise when building, say, POStaggers as WFST: not every pos-tag sequence will have been observed during training, hence failure transitions will achieve great savings in the size of models. Yet discriminative models may include complex features that combine both input stream (word) and output stream (tag) sequences in a single feature, yielding complicated transducer topologies for which effective use of failure transitions may not Proceedings Pofo trhtlea 4nd9,th O Arnegnouna,l J Muneeet 1in9g-2 o4f, t 2h0e1 A1s.s ?oc ci2a0t1io1n A fosrso Ccioamtiopnut faotrio Cnoaml Lpiuntgauti osntiacls: Lsihnogrutpisatipcesrs, pages 1–5, be possible. An exact encoding using other mechanisms is required in such cases to allow for off-line representation and optimization. In this paper, we introduce a novel use of a semiring the lexicographic semiring (Golan, 1999) which permits an exact encoding of these sorts of models with the same compact topology as with failure transitions, but using epsilon transitions. Unlike the standard epsilon approximation, this semiring allows for an exact representation, while also allowing (unlike failure transition approaches) for off-line – – composition with other transducers, with all the optimizations that such representations provide. In the next section, we introduce the semiring, followed by a proof that its use yields exact representations. We then conclude with a brief evaluation of the cost of intersection relative to failure transitions in comparable situations. 2 The Lexicographic Semiring Weighted automata are automata in which the transitions carry weight elements of a semiring (Kuich and Salomaa, 1986). A semiring is a ring that may lack negation, with two associative operations ⊕ and ⊗lac akn nde tghaetiiro nre,ws piethcti twveo i dasesnotictiya ievleem oepnertas t 0io annsd ⊕ ⊕ 1. a nAd ⊗com anmdo tnh esierm reirsipnegc tiivn es pideeenchti ayn edl elmanegnutasg 0e processing, and one that we will be using in this paper, is the tropical semiring (R ∪ {∞}, min, +, ∞, 0), i.e., tmhein t riosp tihcea l⊕ s omfi trhineg gs(e mRi∪rin{g∞ (w},imthi nid,e+nt,i∞ty ,∞0)), ia.end., m+ ins tihse t ⊗e o⊕f othfe t hseem seirminigri n(wg i(wth tidhe indteitnyt t0y). ∞Th)i asn ids a+pp isro thpreia ⊗te o ofof rth h pee srfeomrmiri nngg (Vwitietrhb iid seenatricthy u0s).in Tgh negative log probabilities – we add negative logs along a path and take the min between paths. A hW1 , W2 . . . Wni-lexicographic weight is a tupleA o hfW weights wherei- eeaxichco gorfa pthhiec w weeiigghhtt cisla ass teusW1, W2 . . . Wn, must observe the path property (Mohri, 2002). The path property of a semiring K is defined in terms of the natural order on K such that: a <2 ws e3 w&o4;)r The term “lexicographic” is an apt term for this semiring since the comparison for ⊕ is like the lexicseomgriaripnhgic s icnocmep thareis coonm opfa srtisrionngs f,o rco ⊕m ipsa lrikineg t thhee l feixrist- elements, then the second, and so forth. 3 Language model encoding 3.1 Standard encoding For language model encoding, we will differentiate between two classes of transitions: backoff arcs (labeled with a φ for failure, or with ? using our new semiring); and n-gram arcs (everything else, labeled with the word whose probability is assigned). Each state in the automaton represents an n-gram history string h and each n-gram arc is weighted with the (negative log) conditional probability of the word w labeling the arc given the history h. For a given history h and n-gram arc labeled with a word w, the destination of the arc is the state associated with the longest suffix of the string hw that is a history in the model. This will depend on the Markov order of the n-gram model. For example, consider the trigram model schematic shown in Figure 1, in which only history sequences of length 2 are kept in the model. Thus, from history hi = wi−2wi−1, the word wi transitions to hi+1 = wi−1wi, w2hii−ch1 is the longest suffix of hiwi in the modie−l1. As detailed in the “otherwise” semantics of equation 1, backoff arcs transition from state h to a state h0, typically the suffix of h of length |h| − 1, with we,i tgyhpti c(a−lllyog th αeh s)u. Wixe o cfa hll othf ele ndgestthin |hat|io −n 1s,ta wtei ah bwaecikgohtff s−taltoe.g αThis recursive backoff topology terminates at the unigram state, i.e., h = ?, no history. Backoff states of order k may be traversed either via φ-arcs from the higher order n-gram of order k + 1or via an n-gram arc from a lower order n-gram of order k −1. This means that no n-gram arc can enter tohred ezre rko−eth1. .o Trhdiesr mstaeaten s(fi tnhaalt bnaoc nk-ogfrfa),m ma andrc f cualln-o enrdteerr states history strings of length n − 1 for a model sotfa toersde —r n h may ihnagvse o n-gram a nrc −s e 1nt feorri nag m forodeml other full-order states as well as from backoff states of history size n − 2. — s—to 3.2 Encoding with lexicographic semiring For an LM machine M on the tropical semiring with failure transitions, which is deterministic and has the wih-2 =i1wφ/-logPwα(hi-1|whiφ)/-logwPhiα(+-1w i=|-1wiφ)/-logPαw(hi+)1 Figure 1: Deterministic finite-state representation of n-gram models with negative log probabilities (tropical semiring). The symbol φ labels backoff transitions. Modified from Roark and Sproat (2007), Figure 6.1. path property, we can simulate φ-arcs in a standard LM topology by a topologically equivalent machine M0 on the lexicographic hT, Ti semiring, where φ has boenen th hreep l eaxciceod gwraitphh eicps hilTo,nT, ais sfeomlloirwinsg. ,F worh every n-gram arc with label w and weight c, source state si and destination state sj, construct an n-gram arc with label w, weight h0, ci, source state si0, and deswtiniathtio lanb estla wte, s0j. gThhte h e0x,citi c, soosut rocfe e satcahte s state is constructed as follows. If the state is non-final, h∞, ∞i . sOttruhectrewdis aes fifo litl ofiwnsa.l Iwf tihthe e sxtiatt ec iosst n co nit- fwinilall ,b he∞ ∞h0,,∞ ∞cii . hLeertw n sbee tfh iet length oithf th exei longest history string iin. the model. For every φ-arc with (backoff) weight c, source state si, and destination state sj representing a history of length k, construct an ?-arc with source state si0, destination state s0j, and weight hΦ⊗(n−k) , ci, where Φ > 0 and Φ⊗(n−k) takes Φ to the (n − k)th power with the ⊗ operation. In the tropical semiring, ⊗ ris w +, so Φe⊗ ⊗(n o−pke) = (n − k)Φ. tFroorp iecxaalm sepmlei,r i nng a, t⊗rigi sra +m, msoo Φdel, if we= =ar (en b −ac kki)nΦg. off from a bigram state h (history length = 1) to a unigram state, n − k = 2 − 0 = 2, so we set the buanicgkroafmf w steaigteh,t nto −h2 kΦ, = =− l2og − α 0h) = =for 2 ,s soome w Φe s>et 0 th. cInk ofrfd were gtoh tco tom hb2iΦn,e −thleo gmαodel with another automaton or transducer, we would need to also convert those models to the hT, Ti semiring. For these aveutrotm thaotsae, mwoed seilmsp toly t uese hT a, Tdeif saeumlt rtrinagn.sf Foromra thtieosen such that every transition with weight c is assigned weight h0, ci . For example, given a word lattice wL,e iwghe tco h0n,vceir.t the lattice to L0 in the lexicographic semiring using this default transformation, and then perform the intersection L0 ∩ M0. By removing epsilon transitions and determ∩in Mizing the result, the low cost path for any given string will be retained in the result, which will correspond to the path achieved with Finally we project the second dimension of the hT, Ti weights to produce a lattice dini mtheen strioonpi ocfal t seem hTir,iTngi, wweihgichhts i tso e pqruoidvuacleen at ltaot tichee 3 result of L ∩ M, i.e., φ-arcs. C2(det(eps-rem(L0 ∩ M0))) = L ∩ M where C2 denotes projecting the second-dimension wofh tehree ChT, Ti weights, det(·) denotes determinizatoifon t,h aen hdT e,pTsi-r wemei(g·h) sde,n doette(s· )?- dreenmootveasl. d 4 Proof We wish to prove that for any machine N, ShortestPath(M0 ∩ N0) passes through the equivalent states in M0 to∩ t Nhose passed through in M for ShortestPath(M ∩ N) . Therefore determinization Sofh othrtee rsetsPualttihn(gM Mint ∩er Nse)c.ti Tonh rafefteorr e?- dreemteromvianl yzaiteilodns the same topology as intersection with the equivalent φ machine. Intuitively, since the first dimension of the hT, Ti weights is 0 for n-gram arcs and > 0 foofr t h beac hkTo,ffT arcs, tghhet ss ihsor 0te fostr p na-tghr awmil la rtcrasv aenrdse > >the 0 fewest possible backoff arcs; further, since higherorder backoff arcs cost less in the first dimension of the hT, Ti weights in M0, the shortest path will intchleud heT n-gram iagrhcst sa ti nth Meir earliest possible point. We prove this by induction on the state-sequence of the path p/p0 up to a given state si/si0 in the respective machines M/M0. Base case: If p/p0 is of length 0, and therefore the states si/si0 are the initial states of the respective machines, the proposition clearly holds. Inductive step: Now suppose that p/p0 visits s0...si/s00...si0 and we have therefore reached si/si0 in the respective machines. Suppose the cumulated weights of p/p0 are W and hΨ, Wi, respectively. We wish to show thaarte w Whic anhedv heΨr sj isi n reexspt evcitsiivteedly o. nW p (i.e., the path becomes s0...sisj) the equivalent state s0 is visited on p0 (i.e., the path becomes s00...si0s0j). Let w be the next symbol to be matched leaving states si and si0. There are four cases to consider: (1) there is an n-gram arc leaving states si and si0 labeled with w, but no backoff arc leaving the state; (2) there is no n-gram arc labeled with w leaving the states, but there is a backoff arc; (3) there is no ngram arc labeled with w and no backoff arc leaving the states; and (4) there is both an n-gram arc labeled with w and a backoff arc leaving the states. In cases (1) and (2), there is only one possible transition to take in either M or M0, and based on the algorithm for construction of M0 given in Section 3.2, these transitions will point to sj and s0j respectively. Case (3) leads to failure of intersection with either machine. This leaves case (4) to consider. In M, since there is a transition leaving state si labeled with w, the backoff arc, which is a failure transition, cannot be traversed, hence the destination of the n-gram arc sj will be the next state in p. However, in M0, both the n-gram transition labeled with w and the backoff transition, now labeled with ?, can be traversed. What we will now prove is that the shortest path through M0 cannot include taking the backoff arc in this case. In order to emit w by taking the backoff arc out of state si0, one or more backoff (?) transitions must be taken, followed by an n-gram arc labeled with w. Let k be the order of the history represented by state si0, hence the cost of the first backoff arc is h(n − k)Φ, −log(αsi0 )i in our semiring. If we tirsa vhe(rns e− km) Φ b,a−ckloofgf( αarcs) ip irnior o tro eemmiitrtiinngg. the w, the first dimension of our accumulated cost will be m(n −k + m−21)Φ, based on our algorithm for consmtr(unct−ionk +of M0 given in Section 3.2. Let sl0 be the destination state after traversing m backoff arcs followed by an n-gram arc labeled with w. Note that, by definition, m ≤ k, and k − m + 1 is the orbdeyr oeffi nstitaitoen ,sl0 m. B≤ as ked, onnd t khe − c mons +tru 1ct iiosn t ealg oor-rithm, the state sl0 is also reachable by first emitting w from state si0 to reach state s0j followed by some number of backoff transitions. The order of state s0j is either k (if k is the highest order in the model) or k + 1 (by extending the history of state si0 by one word). If it is of order k, then it will require m −1 backoff arcs to reach state sl0, one fewer tqhuainre t mhe− −pa1t hb ctok osftfat aer ss0l oth raeta cbheg sitanste w sith a backoff arc, for a total cost of (m − 1) (n − k + m−21)Φ which is less than m(n − k + m−21)Φ. If state s0j icish o ifs o lerdsser hka n+ m1,( th −er ke +will be m backoff arcs to reach state sl0, but with a total cost of m(n − (k + 1) + m−21)Φ m(n − k + m−23)Φ = which is also less than m(n − km + m−21)Φ. Hence twheh cstha ties asl0ls coa lne asl twhaayns mbe( n re −ac khe +d from si0 with a lower cost through state s0j than by first taking the backoff arc from si0. Therefore the shortest path on M0 must follow s00...si0s0j. 2 This completes the proof. 5 Experimental Comparison of ?, φ and hT, Ti encoded language models For our experiments we used lattices derived from a very large vocabulary continuous speech recognition system, which was built for the 2007 GALE Arabic speech recognition task, and used in the work reported in Lehr and Shafran (201 1). The lexicographic semiring was evaluated on the development 4 set (2.6 hours of broadcast news and conversations; 18K words). The 888 word lattices for the development set were generated using a competitive baseline system with acoustic models trained on about 1000 hrs of Arabic broadcast data and a 4-gram language model. The language model consisting of 122M n-grams was estimated by interpolation of 14 components. The vocabulary is relatively large at 737K and the associated dictionary has only single pronunciations. The language model was converted to the automaton topology described earlier, and represented in three ways: first as an approximation of a failure machine using epsilons instead of failure arcs; second as a correct failure machine; and third using the lexicographic construction derived in this paper. The three versions of the LM were evaluated by intersecting them with the 888 lattices of the development set. The overall error rate for the systems was 24.8%—comparable to the state-of-theart on this task1 . For the shortest paths, the failure and lexicographic machines always produced identical lattices (as determined by FST equivalence); in contrast, 81% of the shortest paths from the epsilon approximation are different, at least in terms of weights, from the shortest paths using the failure LM. For full lattices, 42 (4.7%) of the lexicographic outputs differ from the failure LM outputs, due to small floating point rounding issues; 863 (97%) of the epsilon approximation outputs differ. In terms of size, the failure LM, with 5.7 million arcs requires 97 Mb. The equivalent hT, Tillieoxnico argcrasp rheqicu iLreMs r 9e7qu Mireb.s 1 T20h eM ebq,u idvuael eton tth heT ,dToui-bling of the size of the weights.2 To measure speed, we performed the intersections 1000 times for each of our 888 lattices on a 2993 MHz Intel?R Xeon?R CPU, and took the mean times for each of our methods. The 888 lattices were processed with a mean of 1.62 seconds in total (1.8 msec per lattice) using the failure LM; using the hT, Ti-lexicographic iLnMg t rheequ fairieldur 1e.8 L Msec;o unsdinsg g(2 t.h0e m hTse,cT per lxaitctiocger)a, pahnidc is thus about 11% slower. Epsilon approximation, where the failure arcs are approximated with epsilon arcs took 1.17 seconds (1.3 msec per lattice). The 1The error rate is a couple of points higher than in Lehr and Shafran (2011) since we discarded non-lexical words, which are absent in maximum likelihood estimated language model and are typically augmented to the unigram backoff state with an arbitrary cost, fine-tuned to optimize performance for a given task. 2If size became an issue, the first dimension of the hT, TiweigIhft scizane bbee c raemprees aennt iesdsu eby, tah esi fnigrlste bdyimtee. slightly slower speeds for the exact method using the failure LM, and hT, Ti can be related to the overhfaeialdur eof L cMom, apnudtin hgT ,tThei f caailnur bee f urenlcattieodn aot rhuen toivmeer,and determinization, respectively. 6 Conclusion In this paper we have introduced a novel application of the lexicographic semiring, proved that it can be used to provide an exact encoding of language model topologies with failure arcs, and provided experimental results that demonstrate its efficiency. Since the hT, Ti-lexicographic semiring is both lefSt-i nacned hr iegh htT-d,iTstrii-bleuxtiicvoe,g roatphheirc o spetmimiriiznagtions such as minimization are possible. The particular hT, Ti-lexicographic semiring we have used thiceruel airs h bTu,t Toni-el oxifc many h piocss siebmlei ilnexgic woegr haapvheic u esendcodings. We are currently exploring the use of a lexicographic semiring that involves different semirings in the various dimensions, for the integration of part-of-speech taggers into language models. An implementation of the lexicographic semiring by the second author is already available as part of the OpenFst package (Allauzen et al., 2007). The methods described here are part of the NGram language-model-training toolkit, soon to be released at opengrm .org. Acknowledgments This research was supported in part by NSF Grant #IIS-081 1745 and DARPA grant #HR001 1-09-10041. Any opinions, findings, conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the NSF or DARPA. We thank Maider Lehr for help in preparing the test data. We also thank the ACL reviewers for valuable comments. References Cyril Allauzen, Mehryar Mohri, and Brian Roark. 2003. Generalized algorithms for constructing statistical language models. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 40–47. Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wojciech Skut, and Mehryar Mohri. 2007. OpenFst: A general and efficient weighted finite-state transducer library. In Proceedings of the Twelfth International Conference on Implementation and Application of Automata (CIAA 2007), Lecture Notes in Computer Sci5 ence, volume 4793, pages 11–23, Prague, Czech Republic. Springer. Jonathan Golan. 1999. Semirings and their Applications. Kluwer Academic Publishers, Dordrecht. Werner Kuich and Arto Salomaa. 1986. Semirings, Automata, Languages. Number 5 in EATCS Monographs on Theoretical Computer Science. SpringerVerlag, Berlin, Germany. Maider Lehr and Izhak Shafran. 2011. Learning a discriminative weighted finite-state transducer for speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, July. Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. 2002. Weighted finite-state transducers in speech recognition. Computer Speech and Language, 16(1):69–88. Mehryar Mohri. 2002. Semiring framework and algorithms for shortest-distance problems. Journal of Automata, Languages and Combinatorics, 7(3):321–350. Brian Roark and Richard Sproat. 2007. Computational Approaches to Morphology and Syntax. Oxford University Press, Oxford. Brian Roark, Murat Saraclar, and Michael Collins. 2007. Discriminative n-gram language modeling. Computer Speech and Language, 21(2):373–392.

2 0.76180625 149 acl-2011-Hierarchical Reinforcement Learning and Hidden Markov Models for Task-Oriented Natural Language Generation

Author: Nina Dethlefs ; Heriberto Cuayahuitl

Abstract: Surface realisation decisions in language generation can be sensitive to a language model, but also to decisions of content selection. We therefore propose the joint optimisation of content selection and surface realisation using Hierarchical Reinforcement Learning (HRL). To this end, we suggest a novel reward function that is induced from human data and is especially suited for surface realisation. It is based on a generation space in the form of a Hidden Markov Model (HMM). Results in terms of task success and human-likeness suggest that our unified approach performs better than greedy or random baselines.

same-paper 3 0.75454718 311 acl-2011-Translationese and Its Dialects

Author: Moshe Koppel ; Noam Ordan

4 0.63043559 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction

Author: Shasha Liao ; Ralph Grishman

Abstract: Annotating training data for event extraction is tedious and labor-intensive. Most current event extraction tasks rely on hundreds of annotated documents, but this is often not enough. In this paper, we present a novel self-training strategy, which uses Information Retrieval (IR) to collect a cluster of related documents as the resource for bootstrapping. Also, based on the particular characteristics of this corpus, global inference is applied to provide more confident and informative data selection. We compare this approach to self-training on a normal newswire corpus and show that IR can provide a better corpus for bootstrapping and that global inference can further improve instance selection. We obtain gains of 1.7% in trigger labeling and 2.3% in role labeling through IR and an additional 1.1% in trigger labeling and 1.3% in role labeling by applying global inference. 1

5 0.629264 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

Author: Yee Seng Chan ; Dan Roth

Abstract: In this paper, we observe that there exists a second dimension to the relation extraction (RE) problem that is orthogonal to the relation type dimension. We show that most of these second dimensional structures are relatively constrained and not difficult to identify. We propose a novel algorithmic approach to RE that starts by first identifying these structures and then, within these, identifying the semantic type of the relation. In the real RE problem where relation arguments need to be identified, exploiting these structures also allows reducing pipelined propagated errors. We show that this RE framework provides significant improvement in RE performance.

6 0.61934066 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

7 0.61550611 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

8 0.61414391 277 acl-2011-Semi-supervised Relation Extraction with Large-scale Word Clustering

9 0.61398095 111 acl-2011-Effects of Noun Phrase Bracketing in Dependency Parsing and Machine Translation

10 0.61068916 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

11 0.61022723 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations

12 0.61010504 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

13 0.60921133 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks

14 0.60917902 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

15 0.60713595 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

16 0.60711551 73 acl-2011-Collective Classification of Congressional Floor-Debate Transcripts

17 0.60685349 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization

18 0.60633254 164 acl-2011-Improving Arabic Dependency Parsing with Form-based and Functional Morphological Features

19 0.60605115 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding

20 0.60569787 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing