emnlp emnlp2011 emnlp2011-76 knowledge-graph by maker-knowledge-mining

76 emnlp-2011-Language Models for Machine Translation: Original vs. Translated Texts


Source: pdf

Author: Gennadi Lembersky ; Noam Ordan ; Shuly Wintner

Abstract: We investigate the differences between language models compiled from original target-language texts and those compiled from texts manually translated to the target language. Corroborating established observations of Translation Studies, we demonstrate that the latter are significantly better predictors of translated sentences than the former, and hence fit the reference set better. Furthermore, translated texts yield better language models for statistical machine translation than original texts.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 il, noam Abstract We investigate the differences between language models compiled from original target-language texts and those compiled from texts manually translated to the target language. [sent-6, score-1.471]

2 Corroborating established observations of Translation Studies, we demonstrate that the latter are significantly better predictors of translated sentences than the former, and hence fit the reference set better. [sent-7, score-0.691]

3 Furthermore, translated texts yield better language models for statistical machine translation than original texts. [sent-8, score-0.871]

4 In this work we explore the differences between LMs compiled from texts originally written in the target language and LMs compiled from translated texts. [sent-12, score-1.259]

5 The motivation for our work stems from much research in Translation Studies that suggests that original texts are significantly different from translated ones in various aspects (Gellerstam, 1986) . [sent-13, score-0.772]

6 Our research question is whether a language model compiled from translated texts may similarly improve the results of machine translation. [sent-22, score-0.906]

7 We test this hypothesis on several translation tasks, where the target language is always English. [sent-23, score-0.138]

8 For each language pair we build two English language models from two types of corpora: texts originally written in English, and human translations from the source language into English. [sent-24, score-0.521]

9 We show that for each language pair, the latter language model better fits a set of reference translations in terms of perplexity. [sent-25, score-0.231]

10 Research in Translation Studies suggests that all translated texts, irrespective of source language, share some so-called translation universals. [sent-27, score-0.616]

11 Consequently, translated texts from several languages to a single target language resemble each other along various axes. [sent-28, score-0.805]

12 To test this hypothesis, we compile additional English LMs, this time using texts translated to English from languages other than the source. [sent-29, score-0.814]

13 Again, we use perplexity to assess the fit of these LMs to reference sets of translated-to-English sentences. [sent-30, score-0.326]

14 Whereas they outperform original-based LMs, LMs compiled from texts that were translated from the source language still fit the reference set best. [sent-32, score-1.159]

15 We use four types of LMs: original; translated from Proce Ed iningbsu orfg th ,e S 2c0o1tl1an Cdo,n UfeKr,en Jcuely on 27 E–m31p,ir 2ic0a1l1 M. [sent-35, score-0.462]

16 ec th2o0d1s1 i Ans Nsoactuiartaioln La fonrg Cuaogmep Purtoatcieosnsainlg L,in pgaugies ti 3c6s3–374, the source language; translated from other languages; and a mixture of translations from several languages. [sent-37, score-0.64]

17 We show that the translatedfrom-source-language LMs provide a significant improvement in the quality of the translation output over all other LMs, and that the mixture LMs always outperform the original LMs. [sent-38, score-0.223]

18 This improvement persists even when the original LMs are up to ten times larger than the translated ones. [sent-39, score-0.542]

19 original and translated texts exhibit significant, measurable differences; 2. [sent-41, score-0.772]

20 LMs compiled from translated texts better fit translated references than LMs compiled from original texts of the same (and much larger) size (and, to a lesser extent, LMs compiled from texts translated from languages other than the source language) ; and 3. [sent-42, score-2.955]

21 MT systems that use LMs based on manually translated texts significantly outperform LMs based on originally written texts. [sent-43, score-0.872]

22 It is important to emphasize that translated texts abound: Many languages, especially lowresource ones, are more likely to have translated texts (religious scripts, educational materials, etc. [sent-44, score-1.458]

23 2 Background and Related Work Numerous studies suggest that translated texts are different from original ones. [sent-52, score-0.817]

24 Gellerstam (1986) compares texts written originally in Swedish and texts translated from English into Swedish. [sent-53, score-1.092]

25 He notes that the differences between them do not indicate poor translation but rather 364 a statistical phenomenon, which he terms translationese. [sent-54, score-0.137]

26 The features of translationese were theoretically organized under the terms laws of translation and translation universals. [sent-60, score-0.306]

27 The former pertains to the fingerprints of the source text that are left in the translation product. [sent-62, score-0.187]

28 The latter pertains to the effort to standardize the translation product according to existing norms in the tar- get language (and culture) . [sent-63, score-0.132]

29 Interestingly, these two laws are in fact reflected in the architecture of statistical machine translation: interference corresponds to the translation model and standardization to the language model. [sent-64, score-0.177]

30 The combined effect of these laws creates a hybrid text that partly corresponds to the source text and partly to texts written originally in the target language but in fact belongs to neither (Frawley, 1984) . [sent-65, score-0.474]

31 Baker (1993, 1995, 1996) suggests several candidates for translation universals, which are claimed to appear in any translated text, regardless of the source language. [sent-66, score-0.616]

32 These include simplification, the tendency of translated texts to simplify the language, the message or both; and explicitation, their tendency to spell out implicit utterances that occur in the source text. [sent-67, score-0.77]

33 Baroni and Bernardini (2006) use machine learning techniques to distinguish between original and translated Italian texts, reporting 86. [sent-68, score-0.519]

34 For each of these languages, a parallel six-lingual subcorpus is extracted, including an original text and its translations into the other five languages. [sent-75, score-0.212]

35 The task is to identify the source language of translated texts, and the reported results are excellent. [sent-76, score-0.517]

36 This finding is crucial: as Baker (1996) states, translations do resemble each other; however, in accordance with the law of interference, the study of van Halteren (2008) suggests that translation from different source languages constitute different sublanguages. [sent-77, score-0.36]

37 2, LMs based on translations from the source language outperform LMs compiled from non-source translations, in terms of both fitness to the reference set and improving MT. [sent-79, score-0.559]

38 We show that using a LM trained on a text translated from the source language of the MT system does indeed improve the results of the translation. [sent-85, score-0.517]

39 Texts translated from one language differ from texts translated from other languages; 3. [sent-89, score-1.2]

40 LMs compiled from manually translated texts are better for MT as measured using BLEU than LMs compiled from original texts. [sent-90, score-1.154]

41 For each language pair we create a reference set comprising several thousands of sentences written originally in the source language and manually translated to English. [sent-92, score-0.774]

42 365 To investigate the first hypothesis, we train two LMs for each language pair, one created from original English texts and the other from texts translated into English. [sent-95, score-1.047]

43 Then, we check which LM better fits the reference set. [sent-96, score-0.142]

44 Fitness of a LM to a set of sentences is measured in terms of perplexity (Jelinek et al. [sent-97, score-0.188]

45 Given a language model and a test (reference) set, perplexity measures the predictive power of the language model over the test set, by looking at the average probability the model assigns to the test data. [sent-100, score-0.161]

46 Formally, the perplexity PP of a language model L on a test set W = w1 w2 . [sent-102, score-0.161]

47 wi−1) (1) For the second hypothesis, we extend the experiment to LMs created from texts translated from other languages to English. [sent-107, score-0.781]

48 For example, we test how well a LM trained on Frenchto-English-translated texts fits the German-toEnglish reference set; and how well a LM trained on German-to-English-translated texts fits the French-to-English reference set. [sent-108, score-0.79]

49 All systems use a translation model extracted from a parallel corpus which is oblivious to the direction of the translation; and one of the abovementioned LMs. [sent-111, score-0.17]

50 This is a large multilingual corpus, containing sentences translated from several European languages. [sent-117, score-0.489]

51 However, it is organized as a collection of bilingual corpora rather than as a single multilingual one, and it is hard to identify sentences that are translated to several languages. [sent-118, score-0.536]

52 We therefore treat each bilingual sub-corpus in isolation; each such sub-corpus contains sentences translated from various languages. [sent-119, score-0.489]

53 5 million English tokens translated from each of these source languages (T-L) , as well as sentences written originally in English (O-EN) . [sent-125, score-0.757]

54 The mixture corpus (MIX) , which is designed to represent “general” translated language, is constructed by randomly selecting sentences translated from any language (excluding original English sentences) . [sent-126, score-1.042]

55 This is a bilingual French– English corpus comprising about 80% original English texts (EO) and about 20% texts translated from French (FO) . [sent-129, score-1.025]

56 We first separate original English from the original French and then, for each original language, we randomly extract portions of texts of different sizes: 1M, 5M and 10M tokens from the FO corpus and 1M, 5M, 10M, 25M, 50M and 100M tokens from the EO corpus; see Table 2. [sent-130, score-0.496]

57 The translated (T-HE) corpus consists of articles collected from the Israeli newspaper HaAretz over the same period of time. [sent-139, score-0.462]

58 3 SMT Training Data To focus on the effect of the language model on translation quality, we design SMT training corpora to be oblivious to the direction of translation. [sent-148, score-0.179]

59 We also use the Hansard corpus: We randomy extract 50,000 sentences from the original French sub-corpora and another 50,000 sentences from the original English sub-corpora. [sent-150, score-0.168]

60 For Hebrew we use the Hebrew– English parallel corpus (Tsvetkov and Wintner, 2010) which contains sentences translated from Hebrew to English (54%) and from English to Hebrew (46%) . [sent-151, score-0.527]

61 First, they are used as the test sets in the experiments that measure the perplexity of the language models. [sent-157, score-0.161]

62 For each language L we use the L-English subcorpus of Europarl (over the period of October to December 2000) , containing only sentences originally produced in language L. [sent-159, score-0.144]

63 The Hansard reference set is completely disjoint from the LM and SMT training sets and comprises only original French sentences. [sent-160, score-0.194]

64 The Hebrew-to-English reference set is an independent (disjoint) part of the Hebrew-to-English parallel corpus. [sent-161, score-0.144]

65 All sentences are originally written in Hebrew and are manually translated to English. [sent-165, score-0.613]

66 Original texts We train several 4-gram LMs for each Europarl sub-corpus, based on the corpora described in Section 3. [sent-172, score-0.322]

67 For each language L, we train a LM based on texts translated from L, from languages other than L as well as texts originally written in English. [sent-174, score-1.18]

68 The LMs are applied to the reference set of texts translated from L, and we compute the perplexity: the fitness of the LM to the reference set. [sent-175, score-1.012]

69 Table 6 details the results, where for each sub-corpus and LM we list the number of unigrams in the test set, the number of out-of-vocabulary items (OOV) and the perplexity (PP) . [sent-176, score-0.161]

70 The lowest perplexity (reflecting the best fit) in each sub-corpus is typeset in boldface, and the highest (worst fit) is slanted. [sent-177, score-0.161]

71 For each language L, the perplexity of the LM that was created from L translations is lowest, followed immediately by the MIX LM. [sent-179, score-0.25]

72 Furthermore, the perplexity of the LM created from originally-English texts is highest in all experiments. [sent-180, score-0.414]

73 In addition, the perplexity of LMs constructed from texts translated from languages other than L always lies between these two extremes: it is a better fit of the reference set than original texts, but not as good as texts translated from L (or mixture translations) . [sent-181, score-1.913]

74 This corroborates the hypothesis that translations form a language in itself, and translations from L1 to L2 , form a sub-language, related to yet different from translations from 368 Table 6: Fitness of various LMs to the reference set other languages to L2. [sent-182, score-0.478]

75 A possible explanation for the different perplexity results between the LMs could be the specific contents of the corpora used to compile the LMs. [sent-183, score-0.241]

76 To rule out this possibility and to further emphasize that the corpora are indeed structurally different, we conduct more experiments, in which we gradually abstract away from the domain- and content-specific features of the texts and emphasize their syntactic structure. [sent-184, score-0.356]

77 At each step, we train six language models on O- and T-texts and apply them to the reference set (adapted to the same level of abstraction, of course) . [sent-191, score-0.153]

78 As the abstraction of the text increases, we also increase the order of the LMs: From 4-grams for text without punctuation and NE abstraction to 5-grams for noun abstraction to 8-grams for full POS abstraction. [sent-192, score-0.321]

79 The results, which are depicted in Table 7, consistently show that the T-based LM is a better fit to the reference set, albeit to a lesser extent. [sent-193, score-0.165]

80 Although the T-based LM has more OOVs, it is a better fit to the translated text than the O-based LM: Its perplexity is lower by 20. [sent-199, score-0.682]

81 Interestingly, the O-corpus LM has more unique unigrams than the T-corpus LM, supporting the claim of Al-Shabab (1996) that translated texts have lower type-to-token ratio. [sent-201, score-0.715]

82 The results, which are depicted in Table 9, consistently show that the T-based LM is a better fit to the reference set. [sent-203, score-0.165]

83 Clearly, then, translated LMs better fit the references than original ones, and the differences can be traced back not just to (trivial) specific lexical choice, but also to syntactic structure, as evidenced by the POS abstraction experiments. [sent-204, score-0.723]

84 In fact, in order to retain the low perplexity level of translated texts, a LM based on original texts must be approximately ten times larger. [sent-205, score-0.956]

85 Translated LMs for MT The last hypothesis we test is whether a better fitting language model yields a better machine translation system. [sent-210, score-0.138]

86 We now want to assess whether the benefits of using translated LMs carry over to scenarios where large original corpora exist. [sent-263, score-0.566]

87 We use the Hansard SMT translation model and Hansard LMs to train nine MT systems, three with varying sizes of translated texts and six with varying sizes of original texts. [sent-265, score-0.918]

88 Table 12 shows that the original English LMs should be enlarged by a factor of ten to achieve translation quality similar to that of translationbased LMs. [sent-268, score-0.179]

89 In other words, much smaller translated LMs perform better than much larger original ones, and this is true for various LM sizes. [sent-269, score-0.519]

90 5 Discussion We use language models computed from different types of corpora to investigate whether their fitness to a reference set of translatedto-English sentences can differentiate between them (and, hence, between the corpora on which they are based) . [sent-270, score-0.312]

91 Our main findings are that LMs compiled from manually translated corpora are much better predictors of translated texts than LMs compiled from original-language corpora of the same size. [sent-271, score-1.69]

92 The results are robust, and are sustainable even when the corpora and the reference sentences are abstracted in ways that retain their syntactic structure but ignore specific word meanings. [sent-272, score-0.18]

93 Furthermore, we show that translated LMs are better predictors of translated 371 sentences even when the LMs are compiled from texts translated from languages other than the source language. [sent-273, score-2.015]

94 However, LMs based on texts translated from the source language still outperform LMs translated from other languages. [sent-274, score-1.265]

95 We also show that MT systems based on translated-from-source-language LMs outperform MT systems based on originals LMs or LMs translated from other languages. [sent-275, score-0.528]

96 Furthermore, our results show that original LMs require ten times more data to exhibit the same fitness to the reference set and the same translation quality as translated LMs. [sent-278, score-0.832]

97 One plausible hypothesis is that recurrent multiword expressions in the source language are frequently solved by human translations and each of these expressions converges to a set of high-quality translation equivalents which are represented in the LM. [sent-281, score-0.282]

98 This work also bears on language typology: we conjecture that LMs compiled from texts translated not from the original language, but from a closely related one, can be better than texts translated from a more distant language. [sent-284, score-1.678]

99 A new approach to the study of Translationese: Machine-learning the difference between original and translated text. [sent-312, score-0.519]

100 Automatic detection of translated text and its impact on machine translation. [sent-395, score-0.462]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lms', 0.604), ('translated', 0.462), ('texts', 0.253), ('compiled', 0.191), ('perplexity', 0.161), ('lm', 0.146), ('hansard', 0.132), ('abstraction', 0.107), ('europarl', 0.106), ('reference', 0.106), ('translation', 0.099), ('translations', 0.089), ('originally', 0.089), ('fitness', 0.085), ('hebrew', 0.085), ('french', 0.08), ('mt', 0.077), ('url', 0.074), ('translationese', 0.066), ('languages', 0.066), ('mona', 0.064), ('fit', 0.059), ('original', 0.057), ('source', 0.055), ('doi', 0.051), ('benjamins', 0.051), ('isbn', 0.051), ('kurokawa', 0.049), ('corpora', 0.047), ('koehn', 0.046), ('smt', 0.045), ('studies', 0.045), ('shuly', 0.043), ('bahl', 0.043), ('laws', 0.042), ('hypothesis', 0.039), ('amsterdam', 0.038), ('differences', 0.038), ('parallel', 0.038), ('english', 0.037), ('predictors', 0.037), ('interference', 0.036), ('literary', 0.036), ('fits', 0.036), ('philipp', 0.035), ('written', 0.035), ('mixture', 0.034), ('baker', 0.033), ('aviv', 0.033), ('chrupa', 0.033), ('compile', 0.033), ('frawley', 0.033), ('gellerstam', 0.033), ('haaretz', 0.033), ('ilisei', 0.033), ('oblivious', 0.033), ('ordan', 0.033), ('originals', 0.033), ('pertains', 0.033), ('pym', 0.033), ('tel', 0.033), ('outperform', 0.033), ('mix', 0.032), ('comprises', 0.031), ('http', 0.031), ('jelinek', 0.03), ('morristown', 0.03), ('september', 0.029), ('hypotheses', 0.029), ('persistent', 0.028), ('subcorpus', 0.028), ('halteren', 0.028), ('haifa', 0.028), ('tsvetkov', 0.028), ('wintner', 0.028), ('gill', 0.028), ('honour', 0.028), ('lalit', 0.028), ('italian', 0.028), ('emphasize', 0.028), ('sentences', 0.027), ('law', 0.027), ('nj', 0.026), ('portions', 0.026), ('noam', 0.026), ('elena', 0.026), ('bleu', 0.025), ('six', 0.025), ('resemble', 0.024), ('hans', 0.024), ('israel', 0.024), ('ten', 0.023), ('differ', 0.023), ('tokens', 0.023), ('frederick', 0.022), ('gideon', 0.022), ('goutte', 0.022), ('train', 0.022), ('cl', 0.021), ('genre', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000014 76 emnlp-2011-Language Models for Machine Translation: Original vs. Translated Texts

Author: Gennadi Lembersky ; Noam Ordan ; Shuly Wintner

Abstract: We investigate the differences between language models compiled from original target-language texts and those compiled from texts manually translated to the target language. Corroborating established observations of Translation Studies, we demonstrate that the latter are significantly better predictors of translated sentences than the former, and hence fit the reference set better. Furthermore, translated texts yield better language models for statistical machine translation than original texts.

2 0.16339122 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection

Author: Amittai Axelrod ; Xiaodong He ; Jianfeng Gao

Abstract: Xiaodong He Microsoft Research Redmond, WA 98052 xiaohe @mi cro s o ft . com Jianfeng Gao Microsoft Research Redmond, WA 98052 j fgao @mi cro s o ft . com have its own argot, vocabulary or stylistic preferences, such that the corpus characteristics will necWe explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large generaldomain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora 1% the size of the original can then used to train small domain-adapted Statistical Machine Translation (SMT) systems which outperform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in- and general-domain systems during decoding. – –

3 0.11016459 51 emnlp-2011-Exact Decoding of Phrase-Based Translation Models through Lagrangian Relaxation

Author: Yin-Wen Chang ; Michael Collins

Abstract: This paper describes an algorithm for exact decoding of phrase-based translation models, based on Lagrangian relaxation. The method recovers exact solutions, with certificates of optimality, on over 99% of test examples. The method is much more efficient than approaches based on linear programming (LP) or integer linear programming (ILP) solvers: these methods are not feasible for anything other than short sentences. We compare our method to MOSES (Koehn et al., 2007), and give precise estimates of the number and magnitude of search errors that MOSES makes.

4 0.10020766 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation

Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng

Abstract: Many machine translation evaluation metrics have been proposed after the seminal BLEU metric, and many among them have been found to consistently outperform BLEU, demonstrated by their better correlations with human judgment. It has long been the hope that by tuning machine translation systems against these new generation metrics, advances in automatic machine translation evaluation can lead directly to advances in automatic machine translation. However, to date there has been no unambiguous report that these new metrics can improve a state-of-theart machine translation system over its BLEUtuned baseline. In this paper, we demonstrate that tuning Joshua, a hierarchical phrase-based statistical machine translation system, with the TESLA metrics results in significantly better humanjudged translation quality than the BLEUtuned baseline. TESLA-M in particular is simple and performs well in practice on large datasets. We release all our implementation under an open source license. It is our hope that this work will encourage the machine translation community to finally move away from BLEU as the unquestioned default and to consider the new generation metrics when tuning their systems.

5 0.097128324 125 emnlp-2011-Statistical Machine Translation with Local Language Models

Author: Christof Monz

Abstract: Part-of-speech language modeling is commonly used as a component in statistical machine translation systems, but there is mixed evidence that its usage leads to significant improvements. We argue that its limited effectiveness is due to the lack of lexicalization. We introduce a new approach that builds a separate local language model for each word and part-of-speech pair. The resulting models lead to more context-sensitive probability distributions and we also exploit the fact that different local models are used to estimate the language model probability of each word during decoding. Our approach is evaluated for Arabic- and Chinese-to-English translation. We show that it leads to statistically significant improvements for multiple test sets and also across different genres, when compared against a competitive baseline and a system using a part-of-speech model.

6 0.095432527 131 emnlp-2011-Syntactic Decision Tree LMs: Random Selection or Intelligent Design?

7 0.089493334 5 emnlp-2011-A Fast Re-scoring Strategy to Capture Long-Distance Dependencies

8 0.068139769 46 emnlp-2011-Efficient Subsampling for Training Complex Language Models

9 0.065277956 148 emnlp-2011-Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.

10 0.063002884 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

11 0.062165193 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training

12 0.061639547 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation

13 0.060441528 54 emnlp-2011-Exploiting Parse Structures for Native Language Identification

14 0.055728734 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax

15 0.055128627 69 emnlp-2011-Identification of Multi-word Expressions by Combining Multiple Linguistic Information Sources

16 0.053334374 62 emnlp-2011-Generating Subsequent Reference in Shared Visual Scenes: Computation vs Re-Use

17 0.053271856 25 emnlp-2011-Cache-based Document-level Statistical Machine Translation

18 0.052724145 36 emnlp-2011-Corroborating Text Evaluation Results with Heterogeneous Measures

19 0.049663063 3 emnlp-2011-A Correction Model for Word Alignments

20 0.049204919 66 emnlp-2011-Hierarchical Phrase-based Translation Representations


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.17), (1, 0.087), (2, 0.086), (3, -0.171), (4, -0.001), (5, -0.022), (6, -0.018), (7, -0.05), (8, -0.094), (9, -0.093), (10, 0.111), (11, 0.012), (12, 0.016), (13, 0.02), (14, -0.092), (15, 0.194), (16, 0.09), (17, 0.013), (18, -0.096), (19, -0.022), (20, 0.054), (21, -0.062), (22, 0.098), (23, -0.143), (24, -0.109), (25, -0.055), (26, 0.009), (27, -0.12), (28, 0.032), (29, 0.043), (30, -0.004), (31, -0.086), (32, -0.019), (33, -0.097), (34, -0.196), (35, -0.108), (36, 0.016), (37, -0.086), (38, 0.009), (39, -0.153), (40, 0.081), (41, 0.083), (42, -0.152), (43, 0.023), (44, 0.025), (45, -0.034), (46, -0.036), (47, 0.02), (48, -0.101), (49, 0.109)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95542866 76 emnlp-2011-Language Models for Machine Translation: Original vs. Translated Texts

Author: Gennadi Lembersky ; Noam Ordan ; Shuly Wintner

Abstract: We investigate the differences between language models compiled from original target-language texts and those compiled from texts manually translated to the target language. Corroborating established observations of Translation Studies, we demonstrate that the latter are significantly better predictors of translated sentences than the former, and hence fit the reference set better. Furthermore, translated texts yield better language models for statistical machine translation than original texts.

2 0.64512831 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection

Author: Amittai Axelrod ; Xiaodong He ; Jianfeng Gao

Abstract: Xiaodong He Microsoft Research Redmond, WA 98052 xiaohe @mi cro s o ft . com Jianfeng Gao Microsoft Research Redmond, WA 98052 j fgao @mi cro s o ft . com have its own argot, vocabulary or stylistic preferences, such that the corpus characteristics will necWe explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large generaldomain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora 1% the size of the original can then used to train small domain-adapted Statistical Machine Translation (SMT) systems which outperform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in- and general-domain systems during decoding. – –

3 0.56249398 46 emnlp-2011-Efficient Subsampling for Training Complex Language Models

Author: Puyang Xu ; Asela Gunawardana ; Sanjeev Khudanpur

Abstract: We propose an efficient way to train maximum entropy language models (MELM) and neural network language models (NNLM). The advantage of the proposed method comes from a more robust and efficient subsampling technique. The original multi-class language modeling problem is transformed into a set of binary problems where each binary classifier predicts whether or not a particular word will occur. We show that the binarized model is as powerful as the standard model and allows us to aggressively subsample negative training examples without sacrificing predictive performance. Empirical results show that we can train MELM and NNLM at 1% ∼ 5% of the strtaaninda MrdE complexity LwMith a no %los ∼s 5in% performance.

4 0.53956109 131 emnlp-2011-Syntactic Decision Tree LMs: Random Selection or Intelligent Design?

Author: Denis Filimonov ; Mary Harper

Abstract: Decision trees have been applied to a variety of NLP tasks, including language modeling, for their ability to handle a variety of attributes and sparse context space. Moreover, forests (collections of decision trees) have been shown to substantially outperform individual decision trees. In this work, we investigate methods for combining trees in a forest, as well as methods for diversifying trees for the task of syntactic language modeling. We show that our tree interpolation technique outperforms the standard method used in the literature, and that, on this particular task, restricting tree contexts in a principled way produces smaller and better forests, with the best achieving an 8% relative reduction in Word Error Rate over an n-gram baseline.

5 0.45316112 148 emnlp-2011-Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.

Author: Ashish Venugopal ; Jakob Uszkoreit ; David Talbot ; Franz Och ; Juri Ganitkevitch

Abstract: We propose a general method to watermark and probabilistically identify the structured outputs of machine learning algorithms. Our method is robust to local editing operations and provides well defined trade-offs between the ability to identify algorithm outputs and the quality of the watermarked output. Unlike previous work in the field, our approach does not rely on controlling the inputs to the algorithm and provides probabilistic guarantees on the ability to identify collections of results from one’s own algorithm. We present an application in statistical machine translation, where machine translated output is watermarked at minimal loss in translation quality and detected with high recall. 1 Motivation Machine learning algorithms provide structured results to input queries by simulating human behavior. Examples include automatic machine translation (Brown et al. , 1993) or automatic text and rich media summarization (Goldstein et al. , 1999) . These algorithms often estimate some portion of their models from publicly available human generated data. As new services that output structured results are made available to the public and the results disseminated on the web, we face a daunting new challenge: Machine generated structured results contaminate the pool of naturally generated human data. For example, machine translated output 1363 2Center for Language and Speech Processing Johns Hopkins University Baltimore, MD 21218, USA juri@cs.jhu.edu and human generated translations are currently both found extensively on the web, with no automatic way of distinguishing between them. Algorithms that mine data from the web (Uszkoreit et al. , 2010) , with the goal of learning to simulate human behavior, will now learn models from this contaminated and potentially selfgenerated data, reinforcing the errors committed by earlier versions of the algorithm. It is beneficial to be able to identify a set of encountered structured results as having been generated by one’s own algorithm, with the purpose of filtering such results when building new models. Problem Statement: We define a structured result of a query q as r = {z1 · · · zL} where tthuree odr rdeesru latn dof identity qof a sele rm =en {tzs zi are important to the quality of the result r. The structural aspect of the result implies the existence of alternative results (across both the order of elements and the elements themselves) that might vary in their quality. Given a collection of N results, CN = r1 · · · rN, where each result ri has k rankedC alterna·t·iv·ers Dk (qi) of relatively similar quality and queries q1 · · · qN are arbitrary and not controlled by the watermarking algorithm, we define the watermarking task as: Task. Replace ri with ri0 ∈ Dk (qi) for some subset of results in CN to produce a watermarked sceoltle ocfti orens CN0 slleucchti otnha Ct: • CN0 is probabilistically identifiable as having bCeen generated by one’s own algorithm. Proce dEindgisnb oufr tgh e, 2 S0c1o1tl Canodn,f eUrKen,c Jeuol yn 2 E7m–3p1ir,ic 2a0l1 M1.e ?tc ho2d0s1 in A Nsasotucira tlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinag uesis 1ti3c6s3–1372, • • • 2 the degradation in quality from CN to the wthaete dremgarrakdeadt CN0 isnho quulda bitye analytically controllable, trading quality for detection performance. CN0 should not be detectable as watermCarked content without access to the generating algorithms. the detection of CN0 should be robust to simple eddeitte operations performed on individual results r ∈ CN0. Impact on Statistical Machine Translation Recent work(Resnik and Smith, 2003; Munteanu and Marcu, 2005; Uszkoreit et al. , 2010) has shown that multilingual parallel documents can be efficiently identified on the web and used as training data to improve the quality of statistical machine translation. The availability of free translation services (Google Translate, Bing Translate) and tools (Moses, Joshua) , increase the risk that the content found by parallel data mining is in fact generated by a machine, rather than by humans. In this work, we focus on statistical machine translation as an application for watermarking, with the goal of discarding documents from training if they have been generated by one’s own algorithms. To estimate the magnitude of the problem, we used parallel document mining (Uszkoreit et al. , 2010) to generate a collection of bilingual document pairs across several languages. For each document, we inspected the page content for source code that indicates the use of translation modules/plug-ins that translate and publish the translated content. We computed the proportion of the content within our corpus that uses these modules. We find that a significant proportion of the mined parallel data for some language pairs is generated via one of these translation modules. The top 3 languages pairs, each with parallel translations into English, are Tagalog (50.6%) , Hindi (44.5%) and Galician (41.9%) . While these proportions do not reflect impact on each language’s monolingual web, they are certainly high 1364 enough to affect machine translations systems that train on mined parallel data. In this work, we develop a general approach to watermark structured outputs and apply it to the outputs of a statistical machine translation system with the goal of identifying these same outputs on the web. In the context of the watermarking task defined above, we output selecting alternative translations for input source sentences. These translations often undergo simple edit and formatting operations such as case changes, sentence and word deletion or post editing, prior to publishing on the web. We want to ensure that we can still detect watermarked translations despite these edit operations. Given the rapid pace of development within machine translation, it is also important that the watermark be robust to improvements in underlying translation quality. Results from several iterations of the system within a single collection of documents should be identifiable under probabilistic bounds. While we present evaluation results for statistical machine translation, our proposed approach and associated requirements are applicable to any algorithm that produces structured results with several plausible alternatives. The alternative results can arise as a result of inherent task ambiguity (for example, there are multiple correct translations for a given input source sentence) or modeling uncertainty (for example, a model assigning equal probability to two competing results) . 3 Watermark Structured Results Selecting an alternative r0 from the space of alternatives Dk (q) can be stated as: r0= arr∈gDmk(aqx)w(r,Dk(q),h) (1) where w ranks r ∈ Dk (q) based on r’s presentwahtieorne owf a watermarking signal computed by a hashing operation h. In this approach, w and its component operation h are the only secrets held by the watermarker. This selection criterion is applied to all system outputs, ensuring that watermarked and non-watermarked version of a collection will never be available for comparison. A specific implementation of w within our watermarking approach can be evaluated by the following metrics: • • • False Positive Rate: how often nonFwaaltseermarked collections are falsely identified as watermarked. Recall Rate: how often watermarked collRecectiaolnls R are correctly inde wntaitfeierdm as wdat ceorl-marked. Quality Degradation: how significantly dQoueasl CN0 d Dieffegrr fdraotmio CN when evaluated by tdaoseks specific quality Cmetrics. While identification is performed at the collection level, we can scale these metrics based on the size of each collection to provide more task sensitive metrics. For example, in machine translation, we count the number of words in the collection towards the false positive and recall rates. In Section 3.1, we define a random hashing operation h and a task independent implementation of the selector function w. Section 3.2 describes how to classify a collection of watermarked results. Section 3.3 and 3.4 describes refinements to the selection and classification criteria that mitigate quality degradation. Following a comparison to related work in Section 4, we present experimental results for several languages in Section 5. 3.1 Watermarking: CN → CN0 We define a random hashing operation h that is applied to result r. It consists of two components: • A hash function applied to a structured re- sAul ht r hto f generate a lbieitd sequence cotfu a dfix reedlength. • An optional mapping that maps a single cAannd oidptaitoen raels umlta r ntog a hsaett mofa spusb -are ssiunlgtsle. Each sub-result is then hashed to generate a concatenated bit sequence for r. A good hash function produces outputs whose bits are independent. This implies that we can treat the bits for any input structured results 1365 as having been generated by a binomial distribution with equal probability of generating 1s vs 0s. This condition also holds when accumulating the bit sequences over a collection of results as long as its elements are selected uniformly from the space of possible results. Therefore, the bits generated from a collection of unwatermarked results will follow a binomial distribution with parameter p = 0.5. This result provides a null hypothesis for a statistical test on a given bit sequence, testing whether it is likely to have been generated from a binomial distribution binomial(n, p) where p = 0.5 and n is the length of the bit sequence. For a collection CN = r1 · · · rN, we can define a Fwaorte arm coalrlekc ranking funct·i·o·nr w to systematically select alternatives ri0 ∈ Dk (q) , such that the resulting CN0 is unlikely ∈to D produce bit sequences ltthinagt f Collow the p = 0.5 binomial distribution. A straightforward biasing criteria would be to select the candidate whose bit sequence exhibits the highest ratio of 1s. w can be defined as: (2) w(r,Dk(q),h) =#(|h1,(rh)(|r)) where h(r) returns the randomized bit sequence for result r, and #(x, y) counts the number of occurrences of x in sequence Selecting alternatives results to exhibit this bias will result in watermarked collections that exhibit this same bias. y. 3.2 Detecting the Watermark To classify a collection CN as watermarked or non-watermarked, we apply the hashing operation h on each element in CN and concatenate ttihoen sequences. eTlhemis sequence is tested against the null hypothesis that it was generated by a binomial distribution with parameter p = 0.5. We can apply a Fisherian test of statistical significance to determine whether the observed distribution of bits is unlikely to have occurred by chance under the null hypothesis (binomial with p = 0.5) . We consider a collection of results that rejects the null hypothesis to be watermarked results generated by our own algorithms. The p-value under the null hypothesis is efficiently computed by: p − value = Pn (X ≥ x) = Xi=nx?ni?pi(1 − p)n−i (3) (4) where x is the number of 1s observed in the collection, and n is the total number of bits in the sequence. Comparing this p-value against a desired significance level α, we reject the null hypothesis for collections that have Pn(X ≥ x) < α, thus deciding that such collections( were gen- erated by our own system. This classification criteria has a fixed false positive rate. Setting α = 0.05, we know that 5% of non-watermarked bit sequences will be falsely labeled as watermarked. This parameter α can be controlled on an application specific basis. By biasing the selection of candidate results to produce more 1s than 0s, we have defined a watermarking approach that exhibits a fixed false positive rate, a probabilistically bounded detection rate and a task independent hashing and selection criteria. In the next sections, we will deal with the question of robustness to edit operations and quality degradation. 3.3 Robustness and Inherent Bias We would like the ability to identify watermarked collections to be robust to simple edit operations. Even slight modifications to the elements within an item r would yield (by construction of the hash function) , completely different bit sequences that no longer preserve the biases introduced by the watermark selection function. To ensure that the distributional biases introduced by the watermark selector are preserved, we can optionally map individual results into a set of sub-results, each one representing some local structure of r. h is then applied to each subresult and the results concatenated to represent r. This mapping is defined as a component of the h operation. While a particular edit operation might affect a small number of sub-results, the majority of the bits in the concatenated bit sequence for r would remain untouched, thereby limiting the damage to the biases selected during watermark1366 ing. This is of course no defense to edit operations that are applied globally across the result; our expectation is that such edits would either significantly degrade the quality of the result or be straightforward to identify directly. For example, a sequence of words r = z1 · · · zL can be mapped into a set of consecutive n-gram sequences. Operations to edit a word zi in r will only affect events that consider the word zi. To account for the fact that alternatives in Dk (q) might now result in bit sequences of different lengths, we can generalize the biasing criteria to directly reflect the expected contribution to the watermark by defining: w(r, Dk(q), h) = Pn(X ≥ #(1, h(r))) (5) where Pn gives probabilities from binomial(n = |h(r) |,p = 0.5) . (Irn)|h,epr =en 0t. 5c)o.llection level biases: Our null hypothesis is based on the assumption that collections of results draw uniformly from the space of possible results. This assumption might not always hold and depends on the type of the results and collection. For example, considering a text document as a collection of sentences, we can expect that some sentences might repeat more frequently than others. This scenario is even more likely when applying a mapping into sub-results. n-gram sequences follow long-tailed or Zipfian distributions, with a small number of n-grams contributing heavily toward the total number of n-grams in a document. A random hash function guarantees that inputs are distributed uniformly at random over the output range. However, the same input will be assigned the same output deterministically. Therefore, if the distribution of inputs is heavily skewed to certain elements of the input space, the output distribution will not be uniformly distributed. The bit sequences resulting from the high frequency sub-results have the potential to generate inherently biased distributions when accumulated at the collection level. We want to choose a mapping that tends towards generating uniformly from the space of sub-results. We can empirically measure the quality of a sub-result mapping for a specific task by computing the false positive rate on non-watermarked collections. For a given significance level α, an ideal mapping would result in false positive rates close to α as well. Figure 1 shows false positive rates from 4 alternative mappings, computed on a large corpus of French documents (see Table 1for statistics) . Classification decisions are made at the collection level (documents) but the contribution to the false positive rate is based on the number of words in the classified document. We consider mappings from a result (sentence) into its 1-grams, 1 − 5-grams and 3 − 5 grams as well as trahem non-mapping case, w 3h −ere 5 tghrea mfusll a sres wuelltl is hashed. Figure 1 shows that the 1-grams and 1 − 5gram generate wsusb t-hraetsul tthse t 1h-agtr rmessu latn idn 1h −eav 5-ily biased false positive rates. The 3 − 5 gram mapping yields pfaolsseit positive r.a Ttesh ecl 3os −e t 5o gthraemir theoretically expected values. 1 Small deviations are expected since documents make different contributions to the false positive rate as a function of the number of words that they represent. For the remainder of this work, we use the 3-5 gram mapping and the full sentence mapping, since the alternatives generate inherently distributions with very high false positive rates. 3.4 Considering Quality The watermarking described in Equation 3 chooses alternative results on a per result basis, with the goal of influencing collection level bit sequences. The selection criteria as described will choose the most biased candidates available in Dk (q) . The parameter k determines the extent to which lesser quality alternatives can be chosen. If all the alternatives in each Dk (q) are of relatively similar quality, we expect minimal degradation due to watermarking. Specific tasks however can be particularly sensitive to choosing alternative results. Discriminative approaches that optimize for arg max selection like (Och, 2003; Liang et al. , 2006; Chiang et al. , 2009) train model parameters such 1In the final version of this paper we will perform sampling to create a more reliable estimate of the false positive rate that is not overly influenced by document length distributions. 1367 that the top-ranked result is well separated from its competing alternatives. Different queries also differ in the inherent ambiguity expected from their results; sometimes there really is just one correct result for a query, while for other queries, several alternatives might be equally good. By generalizing the definition of the w function to interpolate the estimated loss in quality and the gain in the watermarking signal, we can trade-off the ability to identify the watermarked collections against quality degradation: w(r,Dk(q),fw)− =(1 λ − ∗ λ g)ai ∗nl( or,s D(rk,(Dq)k,(fqw)) (6) Loss: The loss(r, Dk (q)) function reflects the quality degradation that results from selecting alternative r as opposed to the best ranked candidate in Dk (q)) . We will experiment with two variants: lossrank (r, Dk (q)) = (rank(r) − k)/k losscost(r, Dk(q)) = (cost(r)−cost(r1))/ cost(r1) where: • • • rank(r) : returns the rank of r within Dk (q) . cost(r) : a weighted sum of features (not cnoosrtm(ra)li:ze ad over httheed sse uarmch o space) rine a loglinear model such as those mentioned in (Och, 2003). r1: the highest ranked alternative in Dk (q) . lossrank provides a generally applicable criteria to select alternatives, penalizing selection from deep within Dk (q) . This estimate of the quality degradation does not reflect the generating model’s opinion on relative quality. losscost considers the relative increase in the generating model’s cost assigned to the alternative translation. Gain: The gain(r, Dk (q) , fw) function reflects the gain in the watermarking signal by selecting candidate r. We simply define the gain as the Pn(X ≥ #(1, h(r))) from Equation 5. ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 p-value threshold (a) 1-grams mapping ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 p-value threshold (c) 3 − 5-grams mapping ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 p-value threshold (b) 1− 5-grams mapping p-value threshold (d) Full result hashing Figure 1 Comparison : of expected false positive rates against observed false positive rates for different sub-result mappings. 4 Related Work Using watermarks with the goal of transmitting a hidden message within images, video, audio and monolingual text media is common. For structured text content, linguistic approaches like (Chapman et al. , 2001; Gupta et al., 2006) use language specific linguistic and semantic expansions to introduce hidden watermarks. These expansions provide alternative candidates within which messages can be encoded. Recent publications have extended this idea to machine translation, using multiple systems and expansions to generate alternative translations. (Stutsman et al. , 2006) uses a hashing function to select alternatives that encode the hidden message in the lower order bits of the translation. In each of these approaches, the watermarker has control over the collection of results into which the watermark is to be embedded. These approaches seek to embed a hidden message into a collection of results that is selected by the watermarker. In contrast, we address the condition where the input queries are not in the watermarker’s control. 1368 The goal is therefore to introduce the watermark into all generated results, with the goal of probabilistically identifying such outputs. Our approach is also task independent, avoiding the need for templates to generate additional alternatives. By addressing the problem directly within the search space of a dynamic programming algorithm, we have access to high quality alternatives with well defined models of quality loss. Finally, our approach is robust to local word editing. By using a sub-result mapping, we increase the level of editing required to obscure the watermark signal; at high levels of editing, the quality of the results themselves would be significantly degraded. 5 Experiments We evaluate our watermarking approach applied to the outputs of statistical machine translation under the following experimental setup. A repository of parallel (aligned source and target language) web documents is sampled to produce a large corpus on which to evaluate the watermarking classification performance. The corpora represent translations into 4 diverse target languages, using English as the source language. Each document in this corpus can be considered a collection of un-watermarked structured results, where source sentences are queries and each target sentence represents a structured result. Using a state-of-the-art phrase-based statistical machine translation system (Och and Ney, 2004) trained on parallel documents identified by (Uszkoreit et al. , 2010) , we generate a set of 100 alternative translations for each source sentence. We apply the proposed watermarking approach, along with the proposed refinements that address task specific loss (Section 3.4) and robustness to edit operations (Section 3.3) to generate watermarked corpora. Each method is controlled via a single parameter (like k or λ) which is varied to generate alternative watermarked collections. For each parameter value, we evaluate the Recall Rate and Quality Degradation with the goal of finding a setting that yields a high recall rate, minimal quality degradation. False positive rates are evaluated based on a fixed classification significance level of α = 0.05. The false positive and recall rates are evaluated on the word level; a document that is misclassified or correctly identified contributes its length in words towards the error calculation. In this work, we use α = 0.05 during classification corresponding to an expected 5% false positive rate. The false positive rate is a function of h and the significance level α and therefore constant across the parameter values k and λ. We evaluate quality degradation on human translated test corpora that are more typical for machine translation evaluation. Each test corpus consists of 5000 source sentences randomly selected from the web and translated into each respective language. We chose to evaluate quality on test corpora to ensure that degradations are not hidden by imperfectly matched web corpora and are consistent with the kind of results often reported for machine translation systems. As with the classification corpora, we create watermarked versions at each parameter value. For a given pa1369 recall Figure 2: BLEU loss against recall of watermarked content for the baseline approach (max K-best) , rank and cost interpolation. rameter value, we measure false positive and re- call rates on the classification corpora and quality degradation on the evaluation corpora. Table 1 shows corpus statistics for the classification and test corpora and non-watermarked BLEU scores for each target language. All source texts are in English. 5.1 Loss Interpolated Experiments Our first set of experiments demonstrates baseline performance using the watermarking criteria in Equation 5 versus the refinements suggested in Section 3.4 to mitigate quality degradation. The h function is computed on the full sentence result r with no sub-event mapping. The following methods are evaluated in Figure 2. • • Baseline method (labeled “max K-best” ): sBealescetlsin er0 purely (blaasbedel on gain Kin- bweastte”r):marking signal (Equation 5) and is parameterized by k: the number of alternatives considered for each result. Rank interpolation: incorporates rank into w, varying ptholea interpolation parameter nλ.t • Cost interpolation: incorporates cost into w, varying tohlea interpolation parameter nλ.t The observed false positive rate on the French classification corpora is 1.9%. ClassificationQuality Ta AbFHularei ankg1bdic:sehitCon#t12e08n7w39t1065o s40r7tda617tsi c#sfo1e r85n37c2018tl2a5e4s 5n0sicfeastion#adno1c68 q3u09 06ma70lietynsdegr#ad7 aw3 t534io9 0rn279dcsorp# as.e5 nN54 t08 oe369n-wceatsrmBaL21 kEe6320d. U462579 B%LEU scores are reported for the quality corpora. We consider 0.2% BLEU loss as a threshold for acceptable quality degradation. Each method is judged by its ability to achieve high recall below this quality degradation threshold. Applying cost interpolation yields the best results in Figure 2, achieving a recall of 85% at 0.2% BLEU loss, while rank interpolation achieves a recall of 76%. The baseline approach of selecting the highest gain candidate within a depth of k candidates does not provide sufficient parameterization to yield low quality degradation. At k = 2, this method yields almost 90% recall, but with approximately 0.4% BLEU loss. 5.2 Robustness Experiments In Section 5.2, we proposed mapping results into sub-events or features. We considered alternative feature mappings in Figure 1, finding that mapping sentence results into a collection of 35 grams yields acceptable false positive rates at varied levels of α. Figure 3 presents results that compare moving from the result level hashing to the 3-5 gram sub-result mapping. We show the impact of the mapping on the baseline max K-best method as well as for cost interpolation. There are substantial reductions in recall rate at the 0.2% BLEU loss level when applying sub-result mappings in cases. The cost interpolation method recall drops from 85% to 77% when using the 3-5 grams event mapping. The observed false positive rate of the 3-5 gram mapping is 4.7%. By using the 3-5 gram mapping, we expect to increase robustness against local word edit operations, but we have sacrificed recall rate due to the inherent distributional bias discussed in Section 3.3. 1370 recall Figure 3: BLEU loss against recall of watermarked content for the baseline and cost interpolation methods using both result level and 3-5 gram mapped events. 5.3 Multilingual Experiments The watermarking approach proposed here introduces no language specific watermarking operations and it is thus broadly applicable to translating into all languages. In Figure 4, we report results for the baseline and cost interpolation methods, considering both the result level and 3-5 gram mapping. We set α = 0.05 and measure recall at 0.2% BLEU degradation for translation from English into Arabic, French, Hindi and Turkish. The observed false positive rates for full sentence hashing are: Arabic: 2.4%, French: 1.8%, Hindi: 5.6% and Turkish: 5.5%, while for the 3-5 gram mapping, they are: Arabic: 5.8%, French: 7.5%, Hindi:3.5% and Turkish: 6.2%. Underlying translation quality plays an important role in translation quality degradation when watermarking. Without a sub-result mapping, French (BLEU: 26.45%) Figure 4: Loss of recall when using 3-5 gram mapping vs sentence level mapping for Arabic, French, Hindi and Turkish translations. achieves recall of 85% at 0.2% BLEU loss, while the other languages achieve over 90% recall at the same BLEU loss threshold. Using a subresult mapping degrades quality for each language pair, but changes the relative performance. Turkish experiences the highest relative drop in recall, unlike French and Arabic, where results are relatively more robust to using sub-sentence mappings. This is likely a result of differences in n-gram distributions across these languages. The languages considered here all use space separated words. For languages that do not, like Chinese or Thai, our approach can be applied at the character level. 6 Conclusions In this work we proposed a general method to watermark and probabilistically identify the structured outputs of machine learning algorithms. Our method provides probabilistic bounds on detection ability, analytic control on quality degradation and is robust to local edit- ing operations. Our method is applicable to any task where structured outputs are generated with ambiguities or ties in the results. We applied this method to the outputs of statistical machine translation, evaluating each refinement to our approach with false positive and recall rates against BLEU score quality degradation. Our results show that it is possible, across several language pairs, to achieve high recall rates (over 80%) with low false positive rates (between 5 and 8%) at minimal quality degradation (0.2% 1371 BLEU) , while still allowing for local edit operations on the translated output. In future work we will continue to investigate methods to mitigate quality loss. References Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. 2007. Minimum error rate training in statistical machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Peter F. Brown, Vincent J.Della Pietra, Stephen A. Della Pietra, and Robert. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19:263–311 . Mark Chapman, George Davida, and Marc Rennhardway. 2001. A practical and effective approach to large-scale automated linguistic steganography. In Proceedings of the Information Security Conference. David Chiang, Kevin Knight, and Wei Wang. 2009. 11,001 new features for statistical machine translation. In North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL-HLT). Jade Goldstein, Mark Kantrowitz, Vibhu Mittal, and Jaime Carbonell. 1999. Summarizing text documents: Sentence selection and evaluation metrics. In Research and Development in Information Retrieval, pages 121–128. Gaurav Gupta, Josef Pieprzyk, and Hua Xiong Wang. 2006. An attack-localizing watermarking scheme for natural language documents. In Proceedings of the 2006 A CM Symposium on Information, computer and communications security, ASIACCS ’06, pages 157–165, New York, NY, USA. ACM. Percy Liang, Alexandre Bouchard-Cote, Dan Klein, and Ben Taskar. 2006. An end-to-end discriminative approach to machine translation. In Proceedings of the Joint International Conference on Computational Linguistics and Association of Computational Linguistics (COLING/A CL, pages 761–768. Dragos Stefan Munteanu and Daniel Marcu. 2005. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics. Franz Josef Och and Hermann Ney. alignment template approach to statistical machine translation. Computational Linguistics. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 2003 Meeting of the Asssociation of Computational Linguistics. Philip Resnik and Noah A. Smith. 2003. The web as a parallel corpus. computational linguistics. Computational Linguistics. Ryan Stutsman, Mikhail Atallah, Christian Grothoff, and Krista Grothoff. 2006. Lost in just the translation. In Proceedings of the 2006 A CM Symposium on Applied Computing. Jakob Uszkoreit, Jay Ponte, Ashok Popat, and Moshe Dubiner. 2010. Large scale parallel document mining for machine translation. In Proceedings of the 2010 COLING. 1372 2004. The

6 0.43553057 125 emnlp-2011-Statistical Machine Translation with Local Language Models

7 0.40525854 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation

8 0.38850459 133 emnlp-2011-The Imagination of Crowds: Conversational AAC Language Modeling using Crowdsourcing and Large Data Sources

9 0.37804794 5 emnlp-2011-A Fast Re-scoring Strategy to Capture Long-Distance Dependencies

10 0.3522433 51 emnlp-2011-Exact Decoding of Phrase-Based Translation Models through Lagrangian Relaxation

11 0.31471059 54 emnlp-2011-Exploiting Parse Structures for Native Language Identification

12 0.31338319 18 emnlp-2011-Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries

13 0.30285835 25 emnlp-2011-Cache-based Document-level Statistical Machine Translation

14 0.28354949 36 emnlp-2011-Corroborating Text Evaluation Results with Heterogeneous Measures

15 0.27628759 85 emnlp-2011-Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming

16 0.27366674 62 emnlp-2011-Generating Subsequent Reference in Shared Visual Scenes: Computation vs Re-Use

17 0.26887706 66 emnlp-2011-Hierarchical Phrase-based Translation Representations

18 0.25884634 3 emnlp-2011-A Correction Model for Word Alignments

19 0.2461358 93 emnlp-2011-Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation

20 0.24520418 91 emnlp-2011-Literal and Metaphorical Sense Identification through Concrete and Abstract Context


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(23, 0.119), (34, 0.275), (36, 0.036), (37, 0.032), (45, 0.076), (53, 0.059), (54, 0.036), (57, 0.016), (62, 0.022), (64, 0.048), (66, 0.042), (69, 0.025), (79, 0.03), (85, 0.035), (87, 0.01), (90, 0.013), (96, 0.033), (98, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76085162 76 emnlp-2011-Language Models for Machine Translation: Original vs. Translated Texts

Author: Gennadi Lembersky ; Noam Ordan ; Shuly Wintner

Abstract: We investigate the differences between language models compiled from original target-language texts and those compiled from texts manually translated to the target language. Corroborating established observations of Translation Studies, we demonstrate that the latter are significantly better predictors of translated sentences than the former, and hence fit the reference set better. Furthermore, translated texts yield better language models for statistical machine translation than original texts.

2 0.52998012 13 emnlp-2011-A Word Reordering Model for Improved Machine Translation

Author: Karthik Visweswariah ; Rajakrishnan Rajkumar ; Ankur Gandhe ; Ananthakrishnan Ramanathan ; Jiri Navratil

Abstract: Preordering of source side sentences has proved to be useful in improving statistical machine translation. Most work has used a parser in the source language along with rules to map the source language word order into the target language word order. The requirement to have a source language parser is a major drawback, which we seek to overcome in this paper. Instead of using a parser and then using rules to order the source side sentence we learn a model that can directly reorder source side sentences to match target word order using a small parallel corpus with highquality word alignments. Our model learns pairwise costs of a word immediately preced- ing another word. We use the Lin-Kernighan heuristic to find the best source reordering efficiently during training and testing and show that it suffices to provide good quality reordering. We show gains in translation performance based on our reordering model for translating from Hindi to English, Urdu to English (with a public dataset), and English to Hindi. For English to Hindi we show that our technique achieves better performance than a method that uses rules applied to the source side English parse.

3 0.52910626 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation

Author: Yang Gao ; Philipp Koehn ; Alexandra Birch

Abstract: Long-distance reordering remains one of the biggest challenges facing machine translation. We derive soft constraints from the source dependency parsing to directly address the reordering problem for the hierarchical phrasebased model. Our approach significantly improves Chinese–English machine translation on a large-scale task by 0.84 BLEU points on average. Moreover, when we switch the tuning function from BLEU to the LRscore which promotes reordering, we observe total improvements of 1.21 BLEU, 1.30 LRscore and 3.36 TER over the baseline. On average our approach improves reordering precision and recall by 6.9 and 0.3 absolute points, respectively, and is found to be especially effective for long-distance reodering.

4 0.52046484 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

Author: Kevin Gimpel ; Noah A. Smith

Abstract: We present a quasi-synchronous dependency grammar (Smith and Eisner, 2006) for machine translation in which the leaves of the tree are phrases rather than words as in previous work (Gimpel and Smith, 2009). This formulation allows us to combine structural components of phrase-based and syntax-based MT in a single model. We describe a method of extracting phrase dependencies from parallel text using a target-side dependency parser. For decoding, we describe a coarse-to-fine approach based on lattice dependency parsing of phrase lattices. We demonstrate performance improvements for Chinese-English and UrduEnglish translation over a phrase-based baseline. We also investigate the use of unsupervised dependency parsers, reporting encouraging preliminary results.

5 0.51964396 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features

Author: Christos Christodoulopoulos ; Sharon Goldwater ; Mark Steedman

Abstract: In this paper we present a fully unsupervised syntactic class induction system formulated as a Bayesian multinomial mixture model, where each word type is constrained to belong to a single class. By using a mixture model rather than a sequence model (e.g., HMM), we are able to easily add multiple kinds of features, including those at both the type level (morphology features) and token level (context and alignment features, the latter from parallel corpora). Using only context features, our system yields results comparable to state-of-the art, far better than a similar model without the one-class-per-type constraint. Using the additional features provides added benefit, and our final system outperforms the best published results on most of the 25 corpora tested.

6 0.516231 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection

7 0.51082879 136 emnlp-2011-Training a Parser for Machine Translation Reordering

8 0.50969476 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation

9 0.50816685 59 emnlp-2011-Fast and Robust Joint Models for Biomedical Event Extraction

10 0.5070979 68 emnlp-2011-Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding

11 0.50599688 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training

12 0.50396132 66 emnlp-2011-Hierarchical Phrase-based Translation Representations

13 0.50394565 79 emnlp-2011-Lateen EM: Unsupervised Training with Multiple Objectives, Applied to Dependency Grammar Induction

14 0.50391704 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases

15 0.50174087 128 emnlp-2011-Structured Relation Discovery using Generative Models

16 0.50085175 54 emnlp-2011-Exploiting Parse Structures for Native Language Identification

17 0.5000487 85 emnlp-2011-Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming

18 0.49997705 25 emnlp-2011-Cache-based Document-level Statistical Machine Translation

19 0.49968222 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study

20 0.4982512 46 emnlp-2011-Efficient Subsampling for Training Complex Language Models