emnlp emnlp2010 emnlp2010-47 knowledge-graph by maker-knowledge-mining

47 emnlp-2010-Example-Based Paraphrasing for Improved Phrase-Based Statistical Machine Translation


Source: pdf

Author: Aurelien Max

Abstract: In this article, an original view on how to improve phrase translation estimates is proposed. This proposal is grounded on two main ideas: first, that appropriate examples of a given phrase should participate more in building its translation distribution; second, that paraphrases can be used to better estimate this distribution. Initial experiments provide evidence of the potential of our approach and its implementation for effectively improving translation performance.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Paris Sud Orsay, France aurel ien Abstract In this article, an original view on how to improve phrase translation estimates is proposed. [sent-2, score-0.54]

2 This proposal is grounded on two main ideas: first, that appropriate examples of a given phrase should participate more in building its translation distribution; second, that paraphrases can be used to better estimate this distribution. [sent-3, score-1.249]

3 Initial experiments provide evidence of the potential of our approach and its implementation for effectively improving translation performance. [sent-4, score-0.355]

4 1 Introduction Phrase translation estimation in Statistical Phrasebased Translation (Koehn et al. [sent-5, score-0.355]

5 examples1 can lead to significant improvements in translation quality. [sent-8, score-0.355]

6 Also, providing indirect training instances via synonyms or paraphrases for previously unseen phrases can result in gains in translation quality, which are more apparent when little training data is originally available (Callison-Burch et al. [sent-9, score-1.146]

7 Although there is a consensus on the importance of using more parallel data in SMT, it has never been formally shown that all additional training instances are actually useful in predicting contextually appropriate translation hypotheses. [sent-14, score-0.516]

8 In fact, it is important that phrase translation models adequatly reflect contextual preferences for each phrase occurrence in a text. [sent-20, score-0.777]

9 A variety of recent works have used dynamically adapted translation models, where each phrase occurrence has its own translation distribution (Carpuat and Wu, 2007; Stroppa et al. [sent-21, score-0.904]

10 , 2010) which shows that phrase-based SMT systems are expressive enough to achieve very high translation performance and therefore suggests a better scoring of phrases. [sent-25, score-0.355]

11 Exploiting comparable corpora for acquiring translation equivalents (Munteanu and Marcu, 2005; Abdul-Rauf and Schwenk, 2009) offers interesting prospects to this issue, but so far focus has not been so much on context appropriateness as on globally increasing the number of biphrase examples. [sent-27, score-0.565]

12 However, in our view, this finding does not contradict the need for estimating translation distributions at the individual phrase level, but they should be integrated as additional information. [sent-30, score-0.508]

13 In contrast to previous attempts at using paraphrases to improve Statistical Machine Translation, which require external data in the form of additional parallel bilingual corpora (Callison-Burch et al. [sent-35, score-0.687]

14 , 2010) or sentential paraphrases of the input (Schroeder et al. [sent-40, score-0.61]

15 It also significantly departs from previous work in that paraphrasing is not simply considered as a way of finding alternative wordings that can be translated given the original training data for out-of-vocabulary phrases only (Callison-Burch et al. [sent-42, score-0.356]

16 , 2010), we do not encode paraphrases into input lattices to have them compete against each other to belong to the source sentential paraphrase that will lead to the highest scoring output sentence3. [sent-50, score-0.912]

17 Instead, we make use of all contextually appropriate paraphrases of a source phrase, which collectively evaluate the quality of each translation for that phrase. [sent-51, score-1.189]

18 This work can thus be seen as a contribution towards shifting from global phrase translation distributions to contextual translation distributions for contextually equivalent source units. [sent-52, score-1.141]

19 We describe an experimental setup in section 4 and com3This highly depends on how well estimated translations for each independent paraphrase are. [sent-56, score-0.361]

20 1 Contextual estimation of phrase translations In standard approaches to phrase-based SMT, evidence of a translation is accumulated uniformely every time it is found associated with a source phrase in the training corpus. [sent-60, score-0.978]

21 In addition to the fact that errors in automatic word alignment and non literal translations often produce useless biphrases, this results in rare but appropriate translations being very unlikely to be considered during decoding. [sent-61, score-0.487]

22 , 2009) build classifiers offline for the phrases in a test set, so that context similarity can for example reinforce scores associated with rare but appropriate translations. [sent-65, score-0.392]

23 These works converge on the need for accessing a sufficient number of examples that are relevant for any source phrase in context, fast enough to permit on-the-fly phrase table building. [sent-71, score-0.517]

24 This paper proposes an intermediate step: the full set of phrase examples is found efficiently, and a measure of the adequacy of each example with a phrase in context provides evidence for its translation that depends on this value of adequacy. [sent-72, score-0.804]

25 In this way, the translation associated with an example for a different sense of a polysemous word would in the best scenario only be considered marginally when computing the translation distribution. [sent-73, score-0.746]

26 Ideally, one would stop extracting examples when enough appropriate examples have been found to estimate a reliable translation distribution. [sent-75, score-0.601]

27 Figure 1: Effect of number of samples on translation quality (measured on German to English translation on Europarl data) reported by (Callison-Burch et al. [sent-84, score-0.71]

28 , 2005) measured the impact on translation quality of the sample size in random sampling of source phrase examples in the training corpus to estimate a phrase’s translation probabilities. [sent-86, score-1.122]

29 In fact, (Lopez, 2008) points out to using discriminatively trained models with contextual features of source phrases in conjunction with phrase sampling as an open problem. [sent-90, score-0.535]

30 2 Using paraphrases for translating For some phrases, not enough examples can be found in the training corpus to estimate reliable translation probabilities in context. [sent-93, score-1.073]

31 We can in fact consider the set of source phrases that have similar translations in context. [sent-95, score-0.475]

32 Figure 2 provides examples of English paraphrases obtained by automatically pivoting via French. [sent-103, score-0.723]

33 As can be seen, some examples would be clearly useful to better estimate translations of the original source phrase: (Balkan War ↔ war in the Balkans) are syntactic variants that can generally s Buablsktiatnuste) awreith s enatachct other, (Balkan War ↔ Balkans war) are character-level variants4. [sent-104, score-0.514]

34 Previous attempts at exploiting paraphrases in SMT have first concentrated on obtaining translations for phrases absent from the training corpus (Callison-Burch et al. [sent-106, score-0.938]

35 , 2009) in fact considers both paraphrases and entailed texts to increase the number of properly translated texts. [sent-115, score-0.609]

36 2009) proceed similarly but obtain their paraphrases from comparatively much larger monolingual corpora by following the distributionality hypothesis. [sent-116, score-0.647]

37 Furthermore, the described implementations do not consider acceptability of the paraphrases in context, as their underlying hypothesis is that it might be more desirable to translate some paraphrase than not to translate a given phrase. [sent-118, score-0.811]

38 The natural next step that we take here is to exploit the complementarity of the original bilin- gual training corpus for finding paraphrases and the monolingual (source) side of the same corpus for validating them in context. [sent-121, score-0.66]

39 Furthermore, our focus here is not on paraphrasing unseen phrases7, but possibly any phrase, or any phrase seen less than a given number of times, or any types of difficult-totranslate phrases (Mohit and Hwa, 2007). [sent-122, score-0.467]

40 , 2010) uses crowdsourcing to obtain paraphrases for source phrases corresponding to mistranslated target phrases. [sent-124, score-0.884]

41 The spotting of the incorrect target phrases and the paraphrasing of the source phrases can be automated. [sent-125, score-0.599]

42 In this scenario, paraphrases are in fact competing with each other, whereas in our proposal paraphrases collectively participate in evaluating the quality of each translation for a source phrase. [sent-131, score-1.596]

43 We believe that if two phrases are indeed paraphrases in context, then their respective set of translations are both relevant to translate the two phrases. [sent-132, score-0.985]

44 7Doing it in conjunction with our approach for improving the translation of known phrases is part of our future work. [sent-134, score-0.513]

45 Lastly, automatic sentential paraphrasing has also been used in SMT to build alternative reference translations for parameter optimization (Madnani et al. [sent-138, score-0.379]

46 A source phrase f in a sentence being translated may therefore be aligned to a variety of target phrases. [sent-143, score-0.384]

47 8 Therefore, the translation distribution of some phrase is globally estimated from a training corpus independently of the actual context of that phrase. [sent-146, score-0.578]

48 9 On Figure 3, phrase f has at least two distinct senses: one represented by set E, which tinw our example corresponds etos ethntee appropriate sense for a particular occurrence of f in a test sentence; and one which corresponds to translation e5. [sent-147, score-0.612]

49 9Context is in fact taken into account to some extent by the target language model, which should score higher translations that are more appropriate given a target translation hypothesis being built. [sent-152, score-0.736]

50 In fact, in this work we consider the target language model as the main source of information for selecting among acceptable target phrases (target language paraphrases). [sent-153, score-0.369]

51 Figure 3: Example of possible source equivalents and translations for phrase occurrence f “un bon avocat” in the sentence “L’embauche d’un bon avocat est cruciale quelle que soit l’activit e´” (“Hiring a good lawyer is crucial to any business ”). [sent-154, score-0.625]

52 Set E represents target phrase types that are acceptable Streatn Esla rteiopnress given trhgee particular context of f, and set F represents source phrase types ltharat c can xbet oinf a paraphrasing rreelsaetnitosn s otou rfc depending on the context they appear in. [sent-155, score-0.729]

53 Taking an extreme view on this issue, it is in fact desirable that when estimating phrase translation probabilities for a phrase f, translations of incompatible senses be not considered. [sent-159, score-0.873]

54 660 biphrase from the training corpus, C(f) the context of some source phrase, C(fk) the context of a particular example of f in the training corpus, simphr a function indicating the contextual similarity between two phrase contexts, and ej is any possible translation of f. [sent-162, score-0.996]

55 The problem of modeling phrase translation is however not limited to inappropriate training examples. [sent-163, score-0.508]

56 For various reasons, legitimate occurrences of source phrases may not be considered when building a phrase’s translation distribution. [sent-164, score-0.618]

57 We describe those cases by considering the possible source phrases pi from Figure 3: • • • • ’s only translation, e1, is a common translatpion with f; each contextually-appropriate example of p1 should reinforce the probability of e1 being a translation for f. [sent-165, score-0.667]

58 Tarpapnroslpartiioatne e6 ashmopulleds correspond to contextually-inappropriate examples of p2, so e6 should not be considered as a new potential translation for f. [sent-167, score-0.428]

59 shares a translation with f, e2, but this is pdue to the polysemous nature of this translation. [sent-170, score-0.391]

60 Again, all examples of p4 should be found contextually-inappropriate with f, and their translations should not be considered when es- p4 timating the translations of f. [sent-171, score-0.497]

61 List of potential paraphrases: some mechanism for finding potential paraphrases for source phrases is required, and several such mechanisms could be combined. [sent-175, score-0.831]

62 Contextual similarity measure: a similarity measure between the contexts of two phrases or two potential local paraphrases is required. [sent-178, score-0.83]

63 Robust translation evaluation: our approach is designed to reinforce estimates for any contextually-appropriate translations of a phrase, as shown by set E on Figure 3. [sent-181, score-0.616]

64 , 2006) and the use of paraphrases for reference translations (Kauchak and Barzilay, 2006). [sent-185, score-0.78]

65 In this paper, we want to evaluate whether an endogenous approach for finding paraphrases can lead to some improvement in translation performance. [sent-195, score-0.961]

66 Note that we will not consider in this initial work the possibility of adding new translations to phrases (such as e4 for f on Figure 3) as it adds complexity and should be investigated when the other simpler cases can be handled successfully. [sent-196, score-0.37]

67 In the following section, we describe experiments in which the original bilingual corpus is the only resource used to find potential paraphrases and to estimate phrase translations in context. [sent-197, score-1.072]

68 2 Example-based Paraphrasing SMT systems We also built systems that exploit phrase and paraphrase context under the form of two additional models pcont and ppara described in section 3. [sent-207, score-0.448]

69 models are added to the list of models used to evaluate the various translations of a phrase in the appropriate phrase tables, and are optimized with the other models by standard MERT. [sent-215, score-0.581]

70 In order to model context, we modified the source texts so that each phrase becomes unique in the phrase table, i. [sent-216, score-0.411]

71 Exact phrase examples add at least a translation count of 1, i. [sent-239, score-0.581]

72 their translation is always taken into account to estimate pcont. [sent-241, score-0.392]

73 Paraphrase examples add a translation count of 0 if length = 0, i. [sent-242, score-0.428]

74 their translation is not taken into account at all if surrounding n-gram similarity is too low. [sent-244, score-0.407]

75 We implemented the following strategy to find paraphrases for phrases in the test file. [sent-251, score-0.726]

76 am tvhoenbisde anpghin tvheoplvroedFigure 6: Examples of paraphrases in context from the development file. [sent-257, score-0.638]

77 The input sentence (IS) contains a source phrase of interest (in bold), the paraphrase example (PE) contains a paraphrase of that source phrase (in bold) for which a paraphrase translation (PT) is known. [sent-258, score-1.318]

78 paraphrases p for a phrase f by pivot: all target language phrases e aligned to f are first extracted, and all source language phrases p aligned to e are extracted. [sent-259, score-1.259]

79 14O Figure 6 shows examples of paraphrases in context with high similarity with some original phrase, and Figure 7 provides various statistics on the paraphrases extracted on the test file. [sent-261, score-1.363]

80 , 2005) report it to be an optimal sample size for estimating phrase translation probabilities. [sent-265, score-0.508]

81 Fpilgehnu452r31ats7: S,e#61ntap230hisr1ac54e,fs72r01o8n42#u,1e7p5mna89r2bsh45,7ofa02r81e5pdha#64s,e08n172a4rpuh2m907,f6ab13r8se40257 663 of paraphrased phrases and numbers of paraphrases per phrase length. [sent-267, score-0.934]

82 Results on French to English translation are less positive: neither cont nor para alone improve over the baseline with any metrics. [sent-275, score-0.489]

83 Verbs, whose translation improved slightly, are strongly inflected in French, so finding examples for a given form is more difficult than for less inflected word categories, as is finding paraphrases with the appropriate inflection. [sent-279, score-1.059]

84 Also, pivoting via English is one reason why paraphrases obtained via a low-inflected language can be of varying quality. [sent-280, score-0.65]

85 These results confirm that translation performance can be improved by exploiting context and paraphrases in the original training corpus only. [sent-283, score-1.025]

86 We next attempted to measure whether some improvement in the quality of the paraphrases used would have some measurable impact on translation performance. [sent-284, score-0.923]

87 To this end, we devised a semi-oracle experiment in the following way: the source and target test files were automatically aligned, and for each source phrase possible target phrases (i. [sent-285, score-0.627]

88 In this way, we exploit the information that paraphrases can at least produce the desired translation, but they may also propose other incorrect translations and/or be present in very few examples. [sent-288, score-0.78]

89 Under this condition, this result shows that the higher the quality of the paraphrases used, the more translation quality can be im15Several pivot phrases may in fact have been automatically extracted for a given phrase, some of which being possible bad candidates. [sent-296, score-1.119]

90 proved, which is in line with works that make use of human-made paraphrases to improve translation quality (Schroeder et al. [sent-318, score-0.923]

91 Table 10 presents a typology of paraphrases found in our development set and classifies the impact of using them for phrase translation estimation. [sent-321, score-1.109]

92 Conclusion and future work We have introduced an original way of exploiting both context and paraphrasing for the estimation of phrase translations in phrase-based SMT. [sent-326, score-0.592]

93 To our knowledge, this is the first time that paraphrases acquired in an endogenous manner have been shown to improve translation performance, which shows that bilingual corpora can be better exploited than they typically are. [sent-327, score-1.08]

94 Our experiments further showed the promises of our approach when paraphrases of higher quality are available. [sent-328, score-0.568]

95 Our future work includes three main areas: first, we want to improve the modeling of context, by notably working on techniques inspired from Information Retrieval to quickly access contextually-similar examples of source phrases in bilingual corpora. [sent-330, score-0.406]

96 Such contextual sampling on large bilingual corpora for phrases and their paraphrases, which could integrate more complex linguistic information, will allow us to assess our approach on more challenging conditions. [sent-331, score-0.396]

97 Second, we will combine paraphrases obtained via different techniques and resources, which 665 will allow us to also learn translation distributions for phrases absent from the original corpus. [sent-333, score-1.113]

98 Lastly, we want to also exploit paraphrases for the additional translations that they propose (such as e4 on Figure 3) and that would be contextually similar in the target language to other existing translations of a given phrase or that could even represent a new sense of the original phrase. [sent-334, score-1.328]

99 Improving statistical machine translation by paraphrasing the training data. [sent-351, score-0.48]

100 Explorations in using grammatical dependencies for contextual phrase translation disambiguation. [sent-423, score-0.583]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('paraphrases', 0.568), ('translation', 0.355), ('translations', 0.212), ('phrases', 0.158), ('phrase', 0.153), ('paraphrase', 0.149), ('smt', 0.138), ('para', 0.134), ('mirkin', 0.131), ('paraphrasing', 0.125), ('source', 0.105), ('carpuat', 0.098), ('contextually', 0.098), ('marton', 0.089), ('ei', 0.084), ('stroppa', 0.082), ('cle', 0.082), ('pivoting', 0.082), ('aziz', 0.077), ('contextual', 0.075), ('schroeder', 0.074), ('examples', 0.073), ('context', 0.07), ('bilingual', 0.07), ('haque', 0.066), ('aur', 0.066), ('bannard', 0.064), ('appropriate', 0.063), ('eamt', 0.059), ('biphrase', 0.058), ('bsl', 0.058), ('ej', 0.058), ('onishi', 0.058), ('max', 0.056), ('war', 0.055), ('lien', 0.055), ('madnani', 0.055), ('paraphrased', 0.055), ('french', 0.054), ('target', 0.053), ('similarity', 0.052), ('lopez', 0.051), ('corpora', 0.049), ('reinforce', 0.049), ('lattices', 0.048), ('translate', 0.047), ('sampling', 0.044), ('resnik', 0.043), ('sentential', 0.042), ('translated', 0.041), ('occurrence', 0.041), ('andy', 0.041), ('proceedings', 0.04), ('translating', 0.04), ('additionnally', 0.038), ('amendments', 0.038), ('avocat', 0.038), ('balkan', 0.038), ('balkans', 0.038), ('biphrases', 0.038), ('bon', 0.038), ('eii', 0.038), ('endogenous', 0.038), ('hawai', 0.038), ('lama', 0.038), ('lengthleft', 0.038), ('lengthright', 0.038), ('lucia', 0.038), ('numoccs', 0.038), ('pcont', 0.038), ('ppara', 0.038), ('shachar', 0.038), ('simpara', 0.038), ('specia', 0.038), ('vicinity', 0.038), ('pivot', 0.038), ('gimpel', 0.038), ('lastly', 0.038), ('cc', 0.037), ('estimate', 0.037), ('pk', 0.036), ('polysemous', 0.036), ('contrastive', 0.036), ('chris', 0.035), ('apparent', 0.034), ('appropriateness', 0.033), ('typology', 0.033), ('hildebrand', 0.033), ('accessing', 0.033), ('cancedda', 0.033), ('dymetman', 0.033), ('kauchak', 0.033), ('nitin', 0.033), ('aligned', 0.032), ('original', 0.032), ('unseen', 0.031), ('du', 0.03), ('bonnie', 0.03), ('monolingual', 0.03), ('validating', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 47 emnlp-2010-Example-Based Paraphrasing for Improved Phrase-Based Statistical Machine Translation

Author: Aurelien Max

Abstract: In this article, an original view on how to improve phrase translation estimates is proposed. This proposal is grounded on two main ideas: first, that appropriate examples of a given phrase should participate more in building its translation distribution; second, that paraphrases can be used to better estimate this distribution. Initial experiments provide evidence of the potential of our approach and its implementation for effectively improving translation performance.

2 0.45538244 50 emnlp-2010-Facilitating Translation Using Source Language Paraphrase Lattices

Author: Jinhua Du ; Jie Jiang ; Andy Way

Abstract: For resource-limited language pairs, coverage of the test set by the parallel corpus is an important factor that affects translation quality in two respects: 1) out of vocabulary words; 2) the same information in an input sentence can be expressed in different ways, while current phrase-based SMT systems cannot automatically select an alternative way to transfer the same information. Therefore, given limited data, in order to facilitate translation from the input side, this paper proposes a novel method to reduce the translation difficulty using source-side lattice-based paraphrases. We utilise the original phrases from the input sentence and the corresponding paraphrases to build a lattice with estimated weights for each edge to improve translation quality. Compared to the baseline system, our method achieves relative improvements of 7.07%, 6.78% and 3.63% in terms of BLEU score on small, medium and large- scale English-to-Chinese translation tasks respectively. The results show that the proposed method is effective not only for resourcelimited language pairs, but also for resourcesufficient pairs to some extent.

3 0.39848486 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

Author: Philip Resnik ; Olivia Buzek ; Chang Hu ; Yakov Kronrod ; Alex Quinn ; Benjamin B. Bederson

Abstract: Targeted paraphrasing is a new approach to the problem of obtaining cost-effective, reasonable quality translation that makes use of simple and inexpensive human computations by monolingual speakers in combination with machine translation. The key insight behind the process is that it is possible to spot likely translation errors with only monolingual knowledge of the target language, and it is possible to generate alternative ways to say the same thing (i.e. paraphrases) with only monolingual knowledge of the source language. Evaluations demonstrate that this approach can yield substantial improvements in translation quality.

4 0.25195614 89 emnlp-2010-PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts

Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng

Abstract: We present PEM, the first fully automatic metric to evaluate the quality of paraphrases, and consequently, that of paraphrase generation systems. Our metric is based on three criteria: adequacy, fluency, and lexical dissimilarity. The key component in our metric is a robust and shallow semantic similarity measure based on pivot language N-grams that allows us to approximate adequacy independently of lexical similarity. Human evaluation shows that PEM achieves high correlation with human judgments.

5 0.24479079 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

Author: Lei Shi ; Rada Mihalcea ; Mingjun Tian

Abstract: In this paper, we introduce a method that automatically builds text classifiers in a new language by training on already labeled data in another language. Our method transfers the classification knowledge across languages by translating the model features and by using an Expectation Maximization (EM) algorithm that naturally takes into account the ambiguity associated with the translation of a word. We further exploit the readily available unlabeled data in the target language via semisupervised learning, and adapt the translated model to better fit the data distribution of the target language.

6 0.21750189 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

7 0.20924528 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

8 0.1726598 5 emnlp-2010-A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

9 0.1677102 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

10 0.15557334 39 emnlp-2010-EMNLP 044

11 0.15023191 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

12 0.12951656 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar

13 0.12871756 22 emnlp-2010-Automatic Evaluation of Translation Quality for Distant Language Pairs

14 0.11667896 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar

15 0.11523923 79 emnlp-2010-Mining Name Translations from Entity Graph Mapping

16 0.11458925 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

17 0.1091423 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

18 0.10666285 36 emnlp-2010-Discriminative Word Alignment with a Function Word Reordering Model

19 0.10038949 76 emnlp-2010-Maximum Entropy Based Phrase Reordering for Hierarchical Phrase-Based Translation

20 0.090016641 29 emnlp-2010-Combining Unsupervised and Supervised Alignments for MT: An Empirical Study


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.366), (1, -0.473), (2, -0.212), (3, 0.051), (4, -0.051), (5, 0.148), (6, 0.007), (7, 0.069), (8, 0.167), (9, -0.058), (10, 0.006), (11, -0.024), (12, 0.012), (13, 0.023), (14, -0.066), (15, 0.024), (16, -0.049), (17, 0.084), (18, 0.094), (19, 0.182), (20, 0.045), (21, 0.043), (22, -0.095), (23, -0.015), (24, -0.164), (25, 0.122), (26, -0.075), (27, -0.097), (28, 0.112), (29, 0.011), (30, 0.031), (31, -0.004), (32, 0.014), (33, -0.074), (34, -0.051), (35, 0.067), (36, 0.006), (37, -0.003), (38, -0.045), (39, -0.019), (40, 0.028), (41, 0.055), (42, 0.04), (43, -0.006), (44, -0.048), (45, 0.01), (46, 0.026), (47, -0.005), (48, -0.033), (49, -0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96169305 47 emnlp-2010-Example-Based Paraphrasing for Improved Phrase-Based Statistical Machine Translation

Author: Aurelien Max

Abstract: In this article, an original view on how to improve phrase translation estimates is proposed. This proposal is grounded on two main ideas: first, that appropriate examples of a given phrase should participate more in building its translation distribution; second, that paraphrases can be used to better estimate this distribution. Initial experiments provide evidence of the potential of our approach and its implementation for effectively improving translation performance.

2 0.94485742 50 emnlp-2010-Facilitating Translation Using Source Language Paraphrase Lattices

Author: Jinhua Du ; Jie Jiang ; Andy Way

Abstract: For resource-limited language pairs, coverage of the test set by the parallel corpus is an important factor that affects translation quality in two respects: 1) out of vocabulary words; 2) the same information in an input sentence can be expressed in different ways, while current phrase-based SMT systems cannot automatically select an alternative way to transfer the same information. Therefore, given limited data, in order to facilitate translation from the input side, this paper proposes a novel method to reduce the translation difficulty using source-side lattice-based paraphrases. We utilise the original phrases from the input sentence and the corresponding paraphrases to build a lattice with estimated weights for each edge to improve translation quality. Compared to the baseline system, our method achieves relative improvements of 7.07%, 6.78% and 3.63% in terms of BLEU score on small, medium and large- scale English-to-Chinese translation tasks respectively. The results show that the proposed method is effective not only for resourcelimited language pairs, but also for resourcesufficient pairs to some extent.

3 0.84351647 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

Author: Philip Resnik ; Olivia Buzek ; Chang Hu ; Yakov Kronrod ; Alex Quinn ; Benjamin B. Bederson

Abstract: Targeted paraphrasing is a new approach to the problem of obtaining cost-effective, reasonable quality translation that makes use of simple and inexpensive human computations by monolingual speakers in combination with machine translation. The key insight behind the process is that it is possible to spot likely translation errors with only monolingual knowledge of the target language, and it is possible to generate alternative ways to say the same thing (i.e. paraphrases) with only monolingual knowledge of the source language. Evaluations demonstrate that this approach can yield substantial improvements in translation quality.

4 0.68881726 89 emnlp-2010-PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts

Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng

Abstract: We present PEM, the first fully automatic metric to evaluate the quality of paraphrases, and consequently, that of paraphrase generation systems. Our metric is based on three criteria: adequacy, fluency, and lexical dissimilarity. The key component in our metric is a robust and shallow semantic similarity measure based on pivot language N-grams that allows us to approximate adequacy independently of lexical similarity. Human evaluation shows that PEM achieves high correlation with human judgments.

5 0.64124578 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

Author: Samidh Chatterjee ; Nicola Cancedda

Abstract: Minimum Error Rate Training is the algorithm for log-linear model parameter training most used in state-of-the-art Statistical Machine Translation systems. In its original formulation, the algorithm uses N-best lists output by the decoder to grow the Translation Pool that shapes the surface on which the actual optimization is performed. Recent work has been done to extend the algorithm to use the entire translation lattice built by the decoder, instead of N-best lists. We propose here a third, intermediate way, consisting in growing the translation pool using samples randomly drawn from the translation lattice. We empirically measure a systematic im- provement in the BLEU scores compared to training using N-best lists, without suffering the increase in computational complexity associated with operating with the whole lattice.

6 0.6120019 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

7 0.59910727 5 emnlp-2010-A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

8 0.55647683 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

9 0.48395914 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar

10 0.43604177 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

11 0.41873959 39 emnlp-2010-EMNLP 044

12 0.39205024 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

13 0.38548839 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

14 0.38368261 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar

15 0.37897193 76 emnlp-2010-Maximum Entropy Based Phrase Reordering for Hierarchical Phrase-Based Translation

16 0.36972356 22 emnlp-2010-Automatic Evaluation of Translation Quality for Distant Language Pairs

17 0.3674835 79 emnlp-2010-Mining Name Translations from Entity Graph Mapping

18 0.3310169 1 emnlp-2010-"Poetic" Statistical Machine Translation: Rhyme and Meter

19 0.32599697 19 emnlp-2010-Automatic Analysis of Rhythmic Poetry with Applications to Generation and Translation

20 0.31379196 36 emnlp-2010-Discriminative Word Alignment with a Function Word Reordering Model


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.015), (12, 0.033), (29, 0.092), (30, 0.429), (52, 0.061), (56, 0.056), (66, 0.123), (72, 0.049), (76, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.952241 112 emnlp-2010-Unsupervised Discovery of Negative Categories in Lexicon Bootstrapping

Author: Tara McIntosh

Abstract: Multi-category bootstrapping algorithms were developed to reduce semantic drift. By extracting multiple semantic lexicons simultaneously, a category’s search space may be restricted. The best results have been achieved through reliance on manually crafted negative categories. Unfortunately, identifying these categories is non-trivial, and their use shifts the unsupervised bootstrapping paradigm towards a supervised framework. We present NEG-FINDER, the first approach for discovering negative categories automatically. NEG-FINDER exploits unsupervised term clustering to generate multiple negative categories during bootstrapping. Our algorithm effectively removes the necessity of manual intervention and formulation of negative categories, with performance closely approaching that obtained using negative categories defined by a domain expert.

same-paper 2 0.83343184 47 emnlp-2010-Example-Based Paraphrasing for Improved Phrase-Based Statistical Machine Translation

Author: Aurelien Max

Abstract: In this article, an original view on how to improve phrase translation estimates is proposed. This proposal is grounded on two main ideas: first, that appropriate examples of a given phrase should participate more in building its translation distribution; second, that paraphrases can be used to better estimate this distribution. Initial experiments provide evidence of the potential of our approach and its implementation for effectively improving translation performance.

3 0.81287658 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective

Author: Amr Ahmed ; Eric Xing

Abstract: With the proliferation of user-generated articles over the web, it becomes imperative to develop automated methods that are aware of the ideological-bias implicit in a document collection. While there exist methods that can classify the ideological bias of a given document, little has been done toward understanding the nature of this bias on a topical-level. In this paper we address the problem ofmodeling ideological perspective on a topical level using a factored topic model. We develop efficient inference algorithms using Collapsed Gibbs sampling for posterior inference, and give various evaluations and illustrations of the utility of our model on various document collections with promising results. Finally we give a Metropolis-Hasting inference algorithm for a semi-supervised extension with decent results.

4 0.59670824 31 emnlp-2010-Constraints Based Taxonomic Relation Classification

Author: Quang Do ; Dan Roth

Abstract: Determining whether two terms in text have an ancestor relation (e.g. Toyota and car) or a sibling relation (e.g. Toyota and Honda) is an essential component of textual inference in NLP applications such as Question Answering, Summarization, and Recognizing Textual Entailment. Significant work has been done on developing stationary knowledge sources that could potentially support these tasks, but these resources often suffer from low coverage, noise, and are inflexible when needed to support terms that are not identical to those placed in them, making their use as general purpose background knowledge resources difficult. In this paper, rather than building a stationary hierarchical structure of terms and relations, we describe a system that, given two terms, determines the taxonomic relation between them using a machine learning-based approach that makes use of existing resources. Moreover, we develop a global constraint opti- mization inference process and use it to leverage an existing knowledge base also to enforce relational constraints among terms and thus improve the classifier predictions. Our experimental evaluation shows that our approach significantly outperforms other systems built upon existing well-known knowledge sources.

5 0.53329259 89 emnlp-2010-PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts

Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng

Abstract: We present PEM, the first fully automatic metric to evaluate the quality of paraphrases, and consequently, that of paraphrase generation systems. Our metric is based on three criteria: adequacy, fluency, and lexical dissimilarity. The key component in our metric is a robust and shallow semantic similarity measure based on pivot language N-grams that allows us to approximate adequacy independently of lexical similarity. Human evaluation shows that PEM achieves high correlation with human judgments.

6 0.53013283 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

7 0.51702356 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation

8 0.51298571 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

9 0.50835019 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification

10 0.50535566 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

11 0.50471514 50 emnlp-2010-Facilitating Translation Using Source Language Paraphrase Lattices

12 0.48656213 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

13 0.48604354 76 emnlp-2010-Maximum Entropy Based Phrase Reordering for Hierarchical Phrase-Based Translation

14 0.47948277 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions

15 0.47365648 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

16 0.47358567 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

17 0.47342968 51 emnlp-2010-Function-Based Question Classification for General QA

18 0.47263163 12 emnlp-2010-A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web

19 0.47239032 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar

20 0.46985474 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities