emnlp emnlp2011 emnlp2011-44 knowledge-graph by maker-knowledge-mining

44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection


Source: pdf

Author: Amittai Axelrod ; Xiaodong He ; Jianfeng Gao

Abstract: Xiaodong He Microsoft Research Redmond, WA 98052 xiaohe @mi cro s o ft . com Jianfeng Gao Microsoft Research Redmond, WA 98052 j fgao @mi cro s o ft . com have its own argot, vocabulary or stylistic preferences, such that the corpus characteristics will necWe explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large generaldomain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora 1% the size of the original can then used to train small domain-adapted Statistical Machine Translation (SMT) systems which outperform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in- and general-domain systems during decoding. – –

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. [sent-6, score-0.332]

2 These subcorpora 1% the size of the original can then used to train small domain-adapted Statistical Machine Translation (SMT) systems which outperform systems trained on the entire corpus. [sent-7, score-0.344]

3 The trouble is that except for the few all-purpose SMT systems there is never enough training data that is directly relevant to the translation task at – – hand. [sent-12, score-0.313]

4 Even if there is no formal genre for the text to be translated, any coherent translation task will 355 essarily deviate from any all-encompassing model of language. [sent-13, score-0.344]

5 The task of domain adaptation is to translate a text in a particular (target) domain for which only a small amount of training data is available, using an MT system trained on a larger set of data that is not restricted to the target domain. [sent-18, score-0.571]

6 Many existing domain adaptation methods fall into two broad categories. [sent-20, score-0.375]

7 It can be also achieved at the model level by combining multiple translation or language models together, often in a weighted manner. [sent-22, score-0.413]

8 ec th2o0d1s1 i Ans Nsoactuiartaioln La fonrg Cuaogmep Purtoatcieosnsainlg L,in pgaugies ti 3c5s5–362, First, we present three methods for ranking the sentences in a general-domain corpus with respect to an in-domain corpus. [sent-27, score-0.17]

9 The first two data selection methods are applications of language-modeling techniques to MT (one for the first time). [sent-29, score-0.131]

10 The third method is novel and explicitly takes into account the bilingual nature of the MT training corpus. [sent-30, score-0.139]

11 We show that it is possible to use our data selection methods to subselect less than 1% (or discard 99%) of a large general training corpus and still increase translation performance by nearly 2 BLEU points. [sent-31, score-0.509]

12 We test their combination with the indomain set, followed by examining the subcorpora to see whether they are actually in-domain, out-ofdomain, or something in between. [sent-33, score-0.42]

13 Based on this, we compare translation model combination methods. [sent-34, score-0.383]

14 Finally, we show that these tiny translation models for model combination can improve system performance even further over the current standard way of producing a domain-adapted MT system. [sent-35, score-0.605]

15 1 Training Data Selection An underlying assumption in domain adaptation is that a general-domain corpus, if sufficiently broad, likely includes some sentences that could fall within the target domain and thus should be used for training. [sent-38, score-0.536]

16 Equally, the general-domain corpus likely includes sentences that are so unlike the domain of the task that using them to train the model is probably more harmful than beneficial. [sent-39, score-0.301]

17 One mechanism for domain adaptation is thus to select only a portion of the general-domain corpus, and use only that subset to train a complete system. [sent-40, score-0.414]

18 The simplest instance of this problem can be found in the realm of language modeling, using perplexity-based selection methods. [sent-41, score-0.162]

19 The sentences in the general-domain corpus are scored by their perplexity score according to an in-domain language model, and then sorted, with only the lowest ones being retained. [sent-42, score-0.5]

20 The ranking of the sentences in a general-domain corpus according to in-domain perplexity has also been applied to machine translation by both Yasuda et al (2008), and Foster et al (2010). [sent-44, score-0.934]

21 We test this approach, with the difference that we simply use the source side perplexity rather than computing the geometric mean of the perplexities over both sides of the corpus. [sent-45, score-0.317]

22 Foster et al (2010) do not mention what percentage of the corpus they select for their IR-baseline, but they concatenate the data to their in-domain corpus and report a decrease in performance. [sent-47, score-0.226]

23 , 2009), who assign a (possibly-zero) weight to each sentence in the large corpus and modify the empirical phrase counts accordingly. [sent-50, score-0.152]

24 Foster et al (2010) further perform this on extracted phrase pairs, not just sentences. [sent-51, score-0.151]

25 We apply this criterion for the first time to the task of selecting training data for machine translation systems. [sent-56, score-0.417]

26 In practice, most practical systems also perform target-side language model adaptation (Eck et al. [sent-62, score-0.236]

27 , 2004); we eschew this in order to isolate the effects of translation model adaptation alone. [sent-63, score-0.549]

28 Directly concatenating the phrase tables into one larger one isn’t strongly motivated; identical phrase pairs within the resulting table can lead to unpredictable behavior during decoding. [sent-64, score-0.388]

29 Nakov (2008) handled identical phrase pairs by prioritizing the source tables, however in our experience identical entries in phrase tables are not very common when comparing across domains. [sent-65, score-0.41]

30 Foster and Kuhn (2007) interpolated the in- and general-domain phrase tables together, assigning either linear or log-linear weights to the entries in the tables before combining overlapping entries; this is now standard practice. [sent-66, score-0.489]

31 , 2007) to pass both tables to the Moses SMT decoder (Koehn et al. [sent-68, score-0.181]

32 , 2003), instead of directly combining the phrase tables to perform domain adaptation. [sent-69, score-0.349]

33 Our in-domain data consisted of the IWSLT corpus of approximately 30,000 sentences in Chinese and English. [sent-75, score-0.169]

34 Our general-domain corpus was 12 million parallel sentences comprising a variety of publicly available datasets, web data, and private translation texts. [sent-76, score-0.575]

35 Both the in- and generaldomain corpora were identically segmented (in Chinese) and tokenized (in English), but otherwise unprocessed. [sent-77, score-0.249]

36 2 System Description In order to highlight the data selection work, we used an out-of-the-box Moses framework using GIZA++ (Och and Ney, 2003) and MERT (Och, 2003) to train and tune the machine translation systems. [sent-83, score-0.527]

37 The only exception was the phrase table for the large out-of-domain system trained on 12m sentence pairs, which we trained on a cluster using a word-dependent HMM-based alignment (He, 2007). [sent-84, score-0.307]

38 We used the Moses decoder to produce all the system outputs, and scored them with the NIST mt -eval 3 1 4 tool used in the IWSLT evalutation. [sent-85, score-0.219]

39 3 Language Models Our work depends on the use of language models to rank sentences in the training corpus, in addition to their normal use during machine translation tuning and decoding. [sent-87, score-0.526]

40 4 Baseline System The in-domain baseline consisted of a translation system trained using Moses, as described above, on the IWSLT corpus. [sent-91, score-0.484]

41 The general-domain baseline was substantially larger, having been trained on 12 million sentence pairs, and had a phrase table containing 1. [sent-93, score-0.231]

42 5117 Table 1: Baseline translation results for in-domain and general-domain systems. [sent-100, score-0.313]

43 ni 4 oo l s / Training Data Selection Methods We present three techniques for ranking and selecting subsets of a general-domain corpus, with an eye towards improving overall translation performance. [sent-104, score-0.457]

44 1, one established method is to rank the sentences in the generaldomain corpus by their perplexity score accord- ing to a language model trained on the small indomain corpus. [sent-107, score-0.98]

45 This reduces the perplexity of the general-domain corpus, with the expectation that only sentences similar to the in-domain corpus will remain. [sent-108, score-0.414]

46 We apply the method to machine translation, even though perplexity reduction has been shown to not correlate with translation performance (Axelrod, 2006). [sent-109, score-0.636]

47 The perplexity of some string s with empirical ngram distribution p given a language model q is: 2−Pxp(x)logq(x) = 2H(p,q) (1) where H(p, q) is the cross-entropy between p and q. [sent-111, score-0.38]

48 Selecting the sentences with the lowest perplexity is therefore equivalent to choosing the sentences with the lowest cross-entropy according to the in-domain language model. [sent-113, score-0.517]

49 They then rank the general-domain corpus sentences using: HI(s) − HO(s) (2) and again taking the lowest-scoring sentences. [sent-118, score-0.163]

50 This criterion biases towards sentences that are both like 358 the in-domain corpus and unlike the average of the general-domain corpus. [sent-119, score-0.131]

51 For this experiment we reused the in-domain LM from the previous method, and trained a second LM on a random subset of 35k sentences from the Chinese side of the general corpus, except using the same vocabulary as the indomain LM. [sent-120, score-0.392]

52 3 Data Selection using Bilingual Cross-Entropy Difference In addition to using these two monolingual criteria for MT data selection, we propose a new method that takes in to account the bilingual nature of the problem. [sent-122, score-0.139]

53 Again, the vocabulary of the language model trained on a subset of the generaldomain corpus was restricted to only cover those tokens found in the in-domain corpus, following Moore and Lewis (2010). [sent-126, score-0.47]

54 5 Results of Training Data Selection The baseline results show that a translation system trained on the general-domain corpus outperforms a system trained on the in-domain corpus by over 3 BLEU points. [sent-127, score-0.709]

55 We used the three methods from Section 4 to identify the best-scoring sentences in the generaldomain corpus. [sent-129, score-0.315]

56 We consider three methods for extracting domaintargeted parallel data from a general corpus: sourceside cross-entropy (Cross-Ent), source-side crossentropy difference (Moore-Lewis) from (Moore and Lewis, 2010), and bilingual cross-entropy difference (bML), which is novel. [sent-130, score-0.176]

57 The net effect is that of domain adaptation via threshhold filtering. [sent-134, score-0.301]

58 New MT systems were then trained solely on these small subcorpora, and compared against the baseline model trained on the entire 12m-sentence general-domain corpus. [sent-135, score-0.205]

59 All three methods presented for selecting a subset of the general-domain corpus (Cross-Entropy, Moore-Lewis, bilingual Moore-Lewis) could be used to train a state-of-the-art machine translation system. [sent-140, score-0.702]

60 The simplest method, using only the source-side cross-entropy, was able to outperform the general-domain model when selecting 150k out of 12 million sentences. [sent-141, score-0.183]

61 The other monolingual method, source-side cross-entropy difference, was able to perform nearly as well as the generaldomain model with only 35k sentences. [sent-142, score-0.28]

62 The bilingual Moore-Lewis method proposed in this paper works best, consistently boosting performance by 1. [sent-143, score-0.182]

63 1 Pseudo In-Domain Data The results in Table 2 show that all three methods (Cross-Entropy, Moore-Lewis, bilingual MooreLewis) can extract subsets of the general-domain corpus that are useful for the purposes of statistical machine translation. [sent-146, score-0.325]

64 We trained a baseline language model on the indomain data and used it to compute the perplexity of the same (in-domain) held-out dev set used to tune the translation models. [sent-150, score-0.99]

65 We extracted the top N sentences using each ranking method, varying N from 10k to 200k, and then trained language models on these subcorpora. [sent-151, score-0.226]

66 These were then used to also compute the perplexity of the same held-out dev set, shown below in Figure 1. [sent-152, score-0.392]

67 nI-domainb aselnieCross-EnrtopyMoore-LewsibliniguaM l-L Top-ranked general-domani sentences (ni k) Figure 1: Corpus Selection Results The perplexity of the dev set according to LMs trained on the top-ranked sentences varied from 77 to 120, depending on the size of the subset and the method used. [sent-153, score-0.649]

68 4 on 20k sentences, and bilingual MooreLewis was consistently the best, with a lowest perplexity of 76. [sent-155, score-0.516]

69 And yet, none of these scores are anywhere near the perplexity of 36. [sent-157, score-0.283]

70 From this it can be deduced that the selection methods are not finding data that is strictly indomain. [sent-159, score-0.131]

71 Rather they are extracting pseudo indomain data which is relevant, but with a differing distribution than the original in-domain corpus. [sent-160, score-0.396]

72 As further evidence, consider the results of concatenating the in-domain corpus with the best extracted subcorpora (using the bilingual MooreLewis method), shown in Table 3. [sent-161, score-0.464]

73 and pseudo in-domain data to train a single model. [sent-164, score-0.272]

74 6 Translation Model Combination Because the pseudo in-domain data should be kept separate from the in-domain data, one must train multiple translation models in order to advantageously use the general-domain corpus. [sent-165, score-0.619]

75 1 Linear Interpolation A common approach to managing multiple translation models is to interpolate them, as in (Foster and Kuhn, 2007) and (L¨ u et al. [sent-168, score-0.347]

76 Linear interpolation of phrase tables was shown to improve performance over the individual models, but this still may not be the most effective use of the translation models. [sent-172, score-0.598]

77 2 Multiple Models We next tested the approach in (Koehn and Schroeder, 2007), passing the two phrase tables directly to the decoder and tuning a system using both 360 phrase tables in parallel. [sent-174, score-0.619]

78 Each phrase table receives a separate set of weights during tuning, thus this combined translation model has more parameters than a normal single-table system. [sent-175, score-0.431]

79 However, the exact overlap between the phrase tables was tiny, minimizing this effect. [sent-178, score-0.218]

80 3 Translation Model Combination Results Table 4 shows baseline results for the in-domain translation system and the general-domain system, evaluated on the in-domain data. [sent-180, score-0.359]

81 The table also shows that linearly interpolating the translation models improved the overall BLEU score, as expected. [sent-181, score-0.347]

82 l52ts7081 We conclude that it can be more effective to not attempt translation model adaptation directly, and instead let the decoder do the work. [sent-187, score-0.599]

83 7 Combining Multi-Model and Data Selection Approaches We presented in Section 5 several methods to improve the performance of a single general-domain translation system by restricting its training corpus on an information-theoretic basis to a very small number of sentences. [sent-188, score-0.424]

84 3 shows that using two translation models over all the available data (one in-domain, one general-domain) outperforms any single individual translation model so far, albeit only slightly. [sent-190, score-0.691]

85 2030 Table 5: Translation results from using in-domain and pseudo in-domain translation models together. [sent-199, score-0.576]

86 It is well and good to use the in-domain data to select pseudo in-domain data from the generaldomain corpus, but given that this requires access to an in-domain corpus, one might as well use it. [sent-200, score-0.51]

87 As such, we used the in-domain translation model alongside translation models trained on the subcorpora selected using the Moore-Lewis and bilingual Moore-Lewis methods in Section 4. [sent-201, score-1.131]

88 A translation system trained on a pseudo in- domain subset of the general corpus, selected with the bilingual Moore-Lewis method, can be further improved by combining with an in-domain model. [sent-203, score-0.983]

89 Thus a domain-adapted system comprising two phrase tables trained on a total of 180k sentences outperformed the standard multi-model system which was trained on 12 million sentences. [sent-206, score-0.644]

90 This tiny combined system was also 3+ points better than the general-domain system by itself, and 6+ points better than the in-domain system alone. [sent-207, score-0.28]

91 8 Conclusions Sentence pairs from a general-domain corpus that seem similar to an in-domain corpus may not actually represent the same distribution of language, as measured by language model perplexity. [sent-208, score-0.161]

92 Nonetheless, we have shown that relatively tiny amounts of this pseudo in-domain data can prove more useful than the entire general-domain corpus for the purposes of domain-targeted translation tasks. [sent-209, score-0.749]

93 361 This paper has also explored three simple yet effective methods for extracting these pseudo indomain sentences from a general-domain corpus. [sent-210, score-0.462]

94 A translation model trained on any of these subcorpora can be comparable or substantially better than a translation system trained on the entire corpus. [sent-211, score-1.091]

95 In particular, the new bilingual Moore-Lewis method, which is specifically tailored to the machine translation scenario, is shown to be more efficient and stable for MT domain adaptation. [sent-212, score-0.588]

96 Translation models trained on data selected in this way consistently outperformed the general-domain baseline while using as few as 35k out of 12 million sentences. [sent-213, score-0.221]

97 We have also shown in passing that the linear interpolation of translation models may work less well for translation model adaptation than the multiple paths decoding technique of (Birch et al. [sent-216, score-1.092]

98 These approaches of data selection and model combination can be stacked, resulting in a compact, two – – phrase-table, translation system trained on 1% of the available data that again outperforms a state-of-theart translation system trained on all the data. [sent-218, score-1.093]

99 Besides improving translation performance, this work also provides a way to mine very large corpora in a computationally-limited environment, such as on an ordinary computer or perhaps a mobile device. [sent-219, score-0.313]

100 The maximum size of a useful general-domain corpus is now limited only by the availability of data, rather than by how large a translation model can be fit into memory at once. [sent-220, score-0.409]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('translation', 0.313), ('perplexity', 0.283), ('generaldomain', 0.249), ('pseudo', 0.229), ('subcorpora', 0.214), ('adaptation', 0.205), ('indomain', 0.167), ('iwslt', 0.16), ('foster', 0.16), ('moore', 0.146), ('tiny', 0.142), ('bilingual', 0.139), ('lewis', 0.137), ('selection', 0.131), ('tables', 0.131), ('bleu', 0.109), ('dev', 0.109), ('moorelewis', 0.107), ('schroeder', 0.107), ('yasuda', 0.107), ('domain', 0.096), ('matsoukas', 0.092), ('mt', 0.088), ('trained', 0.087), ('phrase', 0.087), ('interpolated', 0.074), ('moses', 0.073), ('lmi', 0.071), ('tgt', 0.071), ('birch', 0.069), ('interpolation', 0.067), ('koehn', 0.067), ('sentences', 0.066), ('corpus', 0.065), ('al', 0.064), ('selecting', 0.064), ('ho', 0.062), ('hi', 0.062), ('amittai', 0.061), ('subcorpus', 0.061), ('lm', 0.06), ('kuhn', 0.06), ('million', 0.057), ('alexandra', 0.056), ('eck', 0.056), ('roland', 0.056), ('xiaodong', 0.056), ('src', 0.051), ('nakov', 0.051), ('axelrod', 0.051), ('lowest', 0.051), ('decoder', 0.05), ('wa', 0.049), ('gao', 0.049), ('smt', 0.048), ('system', 0.046), ('concatenating', 0.046), ('redmond', 0.046), ('passing', 0.046), ('philipp', 0.045), ('chinese', 0.043), ('consistently', 0.043), ('train', 0.043), ('decoding', 0.042), ('broad', 0.042), ('jianfeng', 0.042), ('tuning', 0.041), ('target', 0.041), ('subsets', 0.041), ('paths', 0.041), ('machine', 0.04), ('cro', 0.04), ('statistical', 0.04), ('combination', 0.039), ('ranking', 0.039), ('consisted', 0.038), ('discarding', 0.038), ('subset', 0.038), ('identical', 0.037), ('parallel', 0.037), ('comprising', 0.037), ('goodman', 0.036), ('microsoft', 0.036), ('scored', 0.035), ('combining', 0.035), ('models', 0.034), ('side', 0.034), ('string', 0.034), ('conventional', 0.033), ('select', 0.032), ('ngram', 0.032), ('quantity', 0.032), ('fall', 0.032), ('rank', 0.032), ('model', 0.031), ('simplest', 0.031), ('entries', 0.031), ('tti', 0.031), ('eiichiro', 0.031), ('fbk', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection

Author: Amittai Axelrod ; Xiaodong He ; Jianfeng Gao

Abstract: Xiaodong He Microsoft Research Redmond, WA 98052 xiaohe @mi cro s o ft . com Jianfeng Gao Microsoft Research Redmond, WA 98052 j fgao @mi cro s o ft . com have its own argot, vocabulary or stylistic preferences, such that the corpus characteristics will necWe explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large generaldomain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora 1% the size of the original can then used to train small domain-adapted Statistical Machine Translation (SMT) systems which outperform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in- and general-domain systems during decoding. – –

2 0.21194032 125 emnlp-2011-Statistical Machine Translation with Local Language Models

Author: Christof Monz

Abstract: Part-of-speech language modeling is commonly used as a component in statistical machine translation systems, but there is mixed evidence that its usage leads to significant improvements. We argue that its limited effectiveness is due to the lack of lexicalization. We introduce a new approach that builds a separate local language model for each word and part-of-speech pair. The resulting models lead to more context-sensitive probability distributions and we also exploit the fact that different local models are used to estimate the language model probability of each word during decoding. Our approach is evaluated for Arabic- and Chinese-to-English translation. We show that it leads to statistically significant improvements for multiple test sets and also across different genres, when compared against a competitive baseline and a system using a part-of-speech model.

3 0.16577967 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation

Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng

Abstract: Many machine translation evaluation metrics have been proposed after the seminal BLEU metric, and many among them have been found to consistently outperform BLEU, demonstrated by their better correlations with human judgment. It has long been the hope that by tuning machine translation systems against these new generation metrics, advances in automatic machine translation evaluation can lead directly to advances in automatic machine translation. However, to date there has been no unambiguous report that these new metrics can improve a state-of-theart machine translation system over its BLEUtuned baseline. In this paper, we demonstrate that tuning Joshua, a hierarchical phrase-based statistical machine translation system, with the TESLA metrics results in significantly better humanjudged translation quality than the BLEUtuned baseline. TESLA-M in particular is simple and performs well in practice on large datasets. We release all our implementation under an open source license. It is our hope that this work will encourage the machine translation community to finally move away from BLEU as the unquestioned default and to consider the new generation metrics when tuning their systems.

4 0.16472979 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

Author: Kevin Gimpel ; Noah A. Smith

Abstract: We present a quasi-synchronous dependency grammar (Smith and Eisner, 2006) for machine translation in which the leaves of the tree are phrases rather than words as in previous work (Gimpel and Smith, 2009). This formulation allows us to combine structural components of phrase-based and syntax-based MT in a single model. We describe a method of extracting phrase dependencies from parallel text using a target-side dependency parser. For decoding, we describe a coarse-to-fine approach based on lattice dependency parsing of phrase lattices. We demonstrate performance improvements for Chinese-English and UrduEnglish translation over a phrase-based baseline. We also investigate the use of unsupervised dependency parsers, reporting encouraging preliminary results.

5 0.16339122 76 emnlp-2011-Language Models for Machine Translation: Original vs. Translated Texts

Author: Gennadi Lembersky ; Noam Ordan ; Shuly Wintner

Abstract: We investigate the differences between language models compiled from original target-language texts and those compiled from texts manually translated to the target language. Corroborating established observations of Translation Studies, we demonstrate that the latter are significantly better predictors of translated sentences than the former, and hence fit the reference set better. Furthermore, translated texts yield better language models for statistical machine translation than original texts.

6 0.14382003 93 emnlp-2011-Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation

7 0.14236647 25 emnlp-2011-Cache-based Document-level Statistical Machine Translation

8 0.12130979 118 emnlp-2011-SMT Helps Bitext Dependency Parsing

9 0.11552298 3 emnlp-2011-A Correction Model for Word Alignments

10 0.11429693 131 emnlp-2011-Syntactic Decision Tree LMs: Random Selection or Intelligent Design?

11 0.11287665 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax

12 0.10991104 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation

13 0.10444063 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation

14 0.10170177 100 emnlp-2011-Optimal Search for Minimum Error Rate Training

15 0.097994231 51 emnlp-2011-Exact Decoding of Phrase-Based Translation Models through Lagrangian Relaxation

16 0.094037615 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training

17 0.093884327 38 emnlp-2011-Data-Driven Response Generation in Social Media

18 0.093858831 15 emnlp-2011-A novel dependency-to-string model for statistical machine translation

19 0.088226691 18 emnlp-2011-Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries

20 0.08434701 136 emnlp-2011-Training a Parser for Machine Translation Reordering


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.271), (1, 0.168), (2, 0.146), (3, -0.274), (4, 0.014), (5, -0.072), (6, 0.042), (7, -0.045), (8, -0.112), (9, -0.046), (10, 0.072), (11, 0.035), (12, 0.034), (13, 0.073), (14, -0.047), (15, 0.254), (16, 0.136), (17, -0.082), (18, -0.08), (19, -0.022), (20, -0.03), (21, -0.073), (22, 0.014), (23, 0.007), (24, -0.069), (25, -0.127), (26, -0.015), (27, -0.022), (28, -0.025), (29, 0.137), (30, -0.063), (31, -0.034), (32, 0.011), (33, -0.061), (34, -0.065), (35, 0.024), (36, 0.001), (37, -0.114), (38, 0.023), (39, -0.012), (40, -0.066), (41, 0.017), (42, -0.068), (43, -0.033), (44, 0.079), (45, 0.02), (46, -0.002), (47, 0.062), (48, -0.031), (49, 0.08)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9613831 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection

Author: Amittai Axelrod ; Xiaodong He ; Jianfeng Gao

Abstract: Xiaodong He Microsoft Research Redmond, WA 98052 xiaohe @mi cro s o ft . com Jianfeng Gao Microsoft Research Redmond, WA 98052 j fgao @mi cro s o ft . com have its own argot, vocabulary or stylistic preferences, such that the corpus characteristics will necWe explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large generaldomain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora 1% the size of the original can then used to train small domain-adapted Statistical Machine Translation (SMT) systems which outperform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in- and general-domain systems during decoding. – –

2 0.80572605 76 emnlp-2011-Language Models for Machine Translation: Original vs. Translated Texts

Author: Gennadi Lembersky ; Noam Ordan ; Shuly Wintner

Abstract: We investigate the differences between language models compiled from original target-language texts and those compiled from texts manually translated to the target language. Corroborating established observations of Translation Studies, we demonstrate that the latter are significantly better predictors of translated sentences than the former, and hence fit the reference set better. Furthermore, translated texts yield better language models for statistical machine translation than original texts.

3 0.68833923 25 emnlp-2011-Cache-based Document-level Statistical Machine Translation

Author: Zhengxian Gong ; Min Zhang ; Guodong Zhou

Abstract: Statistical machine translation systems are usually trained on a large amount of bilingual sentence pairs and translate one sentence at a time, ignoring document-level information. In this paper, we propose a cache-based approach to document-level translation. Since caches mainly depend on relevant data to supervise subsequent decisions, it is critical to fill the caches with highly-relevant data of a reasonable size. In this paper, we present three kinds of caches to store relevant document-level information: 1) a dynamic cache, which stores bilingual phrase pairs from the best translation hypotheses of previous sentences in the test document; 2) a static cache, which stores relevant bilingual phrase pairs extracted from similar bilingual document pairs (i.e. source documents similar to the test document and their corresponding target documents) in the training parallel corpus; 3) a topic cache, which stores the target-side topic words related with the test document in the source-side. In particular, three new features are designed to explore various kinds of document-level information in above three kinds of caches. Evaluation shows the effectiveness of our cache-based approach to document-level translation with the performance improvement of 0.8 1 in BLUE score over Moses. Especially, detailed analysis and discussion are presented to give new insights to document-level translation. 1

4 0.67587292 125 emnlp-2011-Statistical Machine Translation with Local Language Models

Author: Christof Monz

Abstract: Part-of-speech language modeling is commonly used as a component in statistical machine translation systems, but there is mixed evidence that its usage leads to significant improvements. We argue that its limited effectiveness is due to the lack of lexicalization. We introduce a new approach that builds a separate local language model for each word and part-of-speech pair. The resulting models lead to more context-sensitive probability distributions and we also exploit the fact that different local models are used to estimate the language model probability of each word during decoding. Our approach is evaluated for Arabic- and Chinese-to-English translation. We show that it leads to statistically significant improvements for multiple test sets and also across different genres, when compared against a competitive baseline and a system using a part-of-speech model.

5 0.6732893 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation

Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng

Abstract: Many machine translation evaluation metrics have been proposed after the seminal BLEU metric, and many among them have been found to consistently outperform BLEU, demonstrated by their better correlations with human judgment. It has long been the hope that by tuning machine translation systems against these new generation metrics, advances in automatic machine translation evaluation can lead directly to advances in automatic machine translation. However, to date there has been no unambiguous report that these new metrics can improve a state-of-theart machine translation system over its BLEUtuned baseline. In this paper, we demonstrate that tuning Joshua, a hierarchical phrase-based statistical machine translation system, with the TESLA metrics results in significantly better humanjudged translation quality than the BLEUtuned baseline. TESLA-M in particular is simple and performs well in practice on large datasets. We release all our implementation under an open source license. It is our hope that this work will encourage the machine translation community to finally move away from BLEU as the unquestioned default and to consider the new generation metrics when tuning their systems.

6 0.60578525 18 emnlp-2011-Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries

7 0.5799076 93 emnlp-2011-Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation

8 0.56726891 118 emnlp-2011-SMT Helps Bitext Dependency Parsing

9 0.51959044 148 emnlp-2011-Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.

10 0.46803015 133 emnlp-2011-The Imagination of Crowds: Conversational AAC Language Modeling using Crowdsourcing and Large Data Sources

11 0.45989215 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation

12 0.4592315 100 emnlp-2011-Optimal Search for Minimum Error Rate Training

13 0.44026238 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

14 0.43516675 131 emnlp-2011-Syntactic Decision Tree LMs: Random Selection or Intelligent Design?

15 0.43450585 38 emnlp-2011-Data-Driven Response Generation in Social Media

16 0.43414354 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices

17 0.42985871 51 emnlp-2011-Exact Decoding of Phrase-Based Translation Models through Lagrangian Relaxation

18 0.42057163 46 emnlp-2011-Efficient Subsampling for Training Complex Language Models

19 0.41849059 66 emnlp-2011-Hierarchical Phrase-based Translation Representations

20 0.41285387 36 emnlp-2011-Corroborating Text Evaluation Results with Heterogeneous Measures


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(23, 0.138), (36, 0.022), (37, 0.023), (45, 0.064), (53, 0.066), (54, 0.021), (57, 0.017), (62, 0.023), (64, 0.023), (66, 0.023), (69, 0.018), (79, 0.05), (82, 0.015), (85, 0.378), (96, 0.024), (98, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.70433325 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection

Author: Amittai Axelrod ; Xiaodong He ; Jianfeng Gao

Abstract: Xiaodong He Microsoft Research Redmond, WA 98052 xiaohe @mi cro s o ft . com Jianfeng Gao Microsoft Research Redmond, WA 98052 j fgao @mi cro s o ft . com have its own argot, vocabulary or stylistic preferences, such that the corpus characteristics will necWe explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large generaldomain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora 1% the size of the original can then used to train small domain-adapted Statistical Machine Translation (SMT) systems which outperform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in- and general-domain systems during decoding. – –

2 0.68789828 13 emnlp-2011-A Word Reordering Model for Improved Machine Translation

Author: Karthik Visweswariah ; Rajakrishnan Rajkumar ; Ankur Gandhe ; Ananthakrishnan Ramanathan ; Jiri Navratil

Abstract: Preordering of source side sentences has proved to be useful in improving statistical machine translation. Most work has used a parser in the source language along with rules to map the source language word order into the target language word order. The requirement to have a source language parser is a major drawback, which we seek to overcome in this paper. Instead of using a parser and then using rules to order the source side sentence we learn a model that can directly reorder source side sentences to match target word order using a small parallel corpus with highquality word alignments. Our model learns pairwise costs of a word immediately preced- ing another word. We use the Lin-Kernighan heuristic to find the best source reordering efficiently during training and testing and show that it suffices to provide good quality reordering. We show gains in translation performance based on our reordering model for translating from Hindi to English, Urdu to English (with a public dataset), and English to Hindi. For English to Hindi we show that our technique achieves better performance than a method that uses rules applied to the source side English parse.

3 0.46032566 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation

Author: Yang Gao ; Philipp Koehn ; Alexandra Birch

Abstract: Long-distance reordering remains one of the biggest challenges facing machine translation. We derive soft constraints from the source dependency parsing to directly address the reordering problem for the hierarchical phrasebased model. Our approach significantly improves Chinese–English machine translation on a large-scale task by 0.84 BLEU points on average. Moreover, when we switch the tuning function from BLEU to the LRscore which promotes reordering, we observe total improvements of 1.21 BLEU, 1.30 LRscore and 3.36 TER over the baseline. On average our approach improves reordering precision and recall by 6.9 and 0.3 absolute points, respectively, and is found to be especially effective for long-distance reodering.

4 0.45443556 76 emnlp-2011-Language Models for Machine Translation: Original vs. Translated Texts

Author: Gennadi Lembersky ; Noam Ordan ; Shuly Wintner

Abstract: We investigate the differences between language models compiled from original target-language texts and those compiled from texts manually translated to the target language. Corroborating established observations of Translation Studies, we demonstrate that the latter are significantly better predictors of translated sentences than the former, and hence fit the reference set better. Furthermore, translated texts yield better language models for statistical machine translation than original texts.

5 0.43942207 38 emnlp-2011-Data-Driven Response Generation in Social Media

Author: Alan Ritter ; Colin Cherry ; William B. Dolan

Abstract: Ottawa, Ontario, K1A 0R6 Co l . Cherry@ nrc-cnrc . gc . ca in Redmond, WA 98052 bi l ldol @mi cro so ft . com large corpus of status-response pairs found on Twitter to create a system that responds to Twitter status We present a data-driven approach to generating responses to Twitter status posts, based on phrase-based Statistical Machine Translation. We find that mapping conversational stimuli onto responses is more difficult than translating between languages, due to the wider range of possible responses, the larger fraction of unaligned words/phrases, and the presence of large phrase pairs whose alignment cannot be further decomposed. After addressing these challenges, we compare approaches based on SMT and Information Retrieval in a human evaluation. We show that SMT outperforms IR on this task, and its output is preferred over actual human responses in 15% of cases. As far as we are aware, this is the first work to investigate the use of phrase-based SMT to directly translate a linguistic stimulus into an appropriate response.

6 0.43199116 3 emnlp-2011-A Correction Model for Word Alignments

7 0.42797548 133 emnlp-2011-The Imagination of Crowds: Conversational AAC Language Modeling using Crowdsourcing and Large Data Sources

8 0.42167518 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation

9 0.41950554 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

10 0.41783604 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases

11 0.41702157 136 emnlp-2011-Training a Parser for Machine Translation Reordering

12 0.41571078 72 emnlp-2011-Improved Transliteration Mining Using Graph Reinforcement

13 0.41485596 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training

14 0.41322893 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances

15 0.41279894 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study

16 0.41174194 68 emnlp-2011-Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding

17 0.4111259 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features

18 0.41096404 6 emnlp-2011-A Generate and Rank Approach to Sentence Paraphrasing

19 0.41089758 46 emnlp-2011-Efficient Subsampling for Training Complex Language Models

20 0.41080618 25 emnlp-2011-Cache-based Document-level Statistical Machine Translation