acl acl2013 acl2013-11 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Rico Sennrich ; Holger Schwenk ; Walid Aransa
Abstract: While domain adaptation techniques for SMT have proven to be effective at improving translation quality, their practicality for a multi-domain environment is often limited because of the computational and human costs of developing and maintaining multiple systems adapted to different domains. We present an architecture that delays the computation of translation model features until decoding, allowing for the application of mixture-modeling techniques at decoding time. We also de- scribe a method for unsupervised adaptation with development and test data from multiple domains. Experimental results on two language pairs demonstrate the effectiveness of both our translation model architecture and automatic clustering, with gains of up to 1BLEU over unadapted systems and single-domain adaptation.
Reference: text
sentIndex sentText sentNum sentScore
1 We present an architecture that delays the computation of translation model features until decoding, allowing for the application of mixture-modeling techniques at decoding time. [sent-5, score-0.776]
2 We also de- scribe a method for unsupervised adaptation with development and test data from multiple domains. [sent-6, score-0.555]
3 Experimental results on two language pairs demonstrate the effectiveness of both our translation model architecture and automatic clustering, with gains of up to 1BLEU over unadapted systems and single-domain adaptation. [sent-7, score-0.543]
4 1 Introduction The effectiveness of domain adaptation approaches such as mixture-modeling (Foster and Kuhn, 2007) has been established, and has led to research on a wide array of adaptation techniques in SMT, for instance (Matsoukas et al. [sent-8, score-0.746]
5 In all these approaches, adaptation is performed during model training, with respect to a representative development corpus, and the models are kept unchanged when then system is deployed. [sent-11, score-0.498]
6 Therefore, when working with multiple and/or unlabelled domains, domain adaptation is often impractical for a number of reasons. [sent-12, score-0.574]
7 This is impractical in many real applications, in particular a web translation service which is faced with texts coming from many different domains. [sent-16, score-0.287]
8 Secondly, domain adaptation bears a risk of performance loss. [sent-17, score-0.507]
9 If there is a mismatch between the domain of the development set and the test set, domain adaptation can potentially harm performance compared to an unadapted baseline. [sent-18, score-0.775]
10 We introduce a translation model architecture that delays the computation of features to the decoding phase. [sent-19, score-0.776]
11 The calculation is based on a vector of component models, with each component providing the sufficient statistics necessary for the computation of the features. [sent-20, score-0.42]
12 With this framework, adaptation to a new domain simply consists of updating a weight vector, and multiple domains can be supported by the same system. [sent-21, score-0.632]
13 We also present a clustering approach for unsupervised adaptation in a multi-domain environment. [sent-22, score-0.541]
14 In the development phase, a set of development data is clustered, and the models are adapted to each cluster. [sent-23, score-0.345]
15 For each sentence that is being decoded, we choose the weight vector that is optimized on the closest cluster, allowing for adaptation even with unlabelled and heterogeneous test data. [sent-24, score-0.686]
16 , 2010) delay the computation of translation model features for the purpose of interactive machine translation with online training. [sent-26, score-0.78]
17 The similarity suggests that our framework could also be used for interactive learning, with the ability to learn a model incrementally from user feedback, and weight it differently than the static models, opening new research opportunities. [sent-30, score-0.234]
18 (Sennrich, 2012b) perform instance weighting of translation models, based on the sufficient statistics. [sent-31, score-0.406]
19 Our framework implements this idea, with the main difference that the actual combination is delayed until decoding, to support adaptation to multiple domains in a single system. [sent-32, score-0.442]
20 , 2012) describe an ensemble decoding framework which combines several translation models in the decoding step. [sent-34, score-0.631]
21 Our work is similar to theirs in that the combination is done at runtime, but we also delay the computation of translation model probabilities, and thus have access to richer sufficient statistics. [sent-35, score-0.5]
22 , 2012) describe, plus additional ones such as forms of instance weighting, which are not possible after the translation probabilities have been computed. [sent-37, score-0.289]
23 They use separate translation systems for each domain, and a supervised setting, whereas we aim for a system that integrates support for multiple domains, with or without supervision. [sent-40, score-0.287]
24 (Yamamoto and Sumita, 2007) propose unsupervised clustering at both training and decoding time. [sent-41, score-0.413]
25 3 Translation Model Architecture This section covers the architecture of the multidomain translation model framework. [sent-44, score-0.482]
26 Our translation model is embedded in a log-linear model as is common for SMT, and treated as a single translation model in this log-linear combination. [sent-45, score-0.622]
27 The architecture has two goals: move the calculation oftranslation model features to the decoding phase, and allow for multiple knowledge sources (e. [sent-47, score-0.42]
28 Our immediate purpose for this paper is domain adaptation in a multi-domain environment, but the delay of the feature computation has other potential applications, e. [sent-50, score-0.574]
29 We are concerned with calculating four features during decoding, henceforth just referred to as the translation model features: p(s|t), lex(s|t), p(t|s) tarnadn lex(t|s). [sent-53, score-0.289]
30 Traditionally, the phrase translation probabilities p(s|t) and p(t|s) are estimated through untsimesoo pt(hse|td) )m aanxdim pu(mt|s l)ik aerleih eoostdim mesattiemda ttihornou (MLE). [sent-58, score-0.297]
31 833 In order to compute the translation model features online, a number of sufficient statistics need to be accessible at decoding time. [sent-64, score-0.6]
32 sF)o,r w accessing tthheem st during decoding, we simply store them in the decoder’s data structure, rather than storing pre-computed translation model features. [sent-66, score-0.479]
33 2 The statistics are accessed when the decoder collects all translation options for a phrase s in the source sentence. [sent-68, score-0.451]
34 We then access all translation options for each component table, obtaining a vector of statistics c(s) for the source phrase, and c(t) and c(s, t) for each potential target phrase. [sent-69, score-0.502]
35 After all tables have been accessed, and we thus know the full set of possible translation options (s, t), we perform a second round of lookups for all c(t) in the vector which are still set to 0. [sent-72, score-0.531]
36 We deem it is still practical because the collection of translation options is typically only a small fraction of total decoding time, with search making up the largest part. [sent-79, score-0.491]
37 For storing and accessing the sufficient statistics (except for the word (pair) frequencies), we use an on-disk data structure pro2We have released an implementation of the architecture as part of the Moses decoder. [sent-80, score-0.449]
38 The architecture can thus be used as a drop-in replacement for a baseline system that is trained on concatenated training data, with non-uniform weights only being used for texts for which better weights have been established. [sent-92, score-0.293]
39 Table 1 shows the effect of weighting two corpora on the probability estimates for the translation of row. [sent-95, score-0.305]
40 German Zeile (row in a table) is predominant in a bitext from the domain IT, whereas 4We prune the tables to the most frequent 50 phrase pairs per source phrase before combining them, since calculating the features for all phrase pairs of very common source phrases causes a significant slow-down. [sent-96, score-0.333]
41 834 gold clusters clustering with Euclidean distance clustering with cosine similarity Figure 1: Clustering of data set which contains sentences from two domains: LEGAL and IT. [sent-98, score-0.602]
42 Comparison between gold segmentation, and clustering with two alternative distance/similarity measures. [sent-99, score-0.231]
43 Reihe (line of objects) occurs more often in a legal corpus. [sent-101, score-0.21]
44 4 Unsupervised Clustering for Online Translation Model Adaptation The framework supports decoding each sentence with a separate weight vector of size 4n, 4 being the number of translation model features whose computation can be weighted, and n the number of model components. [sent-107, score-0.756]
45 As a way of optimizing instance weights, (Sennrich, 2012b) minimize translation model perplexity on a set of phrase pairs, automatically extracted from a parallel development set. [sent-109, score-0.578]
46 We follow this technique, but want to have multiple weight vectors, adapted to different texts, between which the system switches at decoding time. [sent-110, score-0.4]
47 The goal is to perform domain adaptation without requiring domain labels or user input, neither for development nor decoding. [sent-111, score-0.701]
48 For each sentence in the test set, assign it to the nearest cluster and use the translation model weights associated with the cluster. [sent-117, score-0.489]
49 1 Clustering the Development Set We use k-means clustering to cluster the sentences of the development set. [sent-121, score-0.418]
50 Figure 1 illustrates clustering in a two-dimensional vector space, and demonstrates that Euclidean distance is unsuitable because it may perform a clustering that is irrelevant to our purposes. [sent-125, score-0.43]
51 As a result of development set clustering, we obtain a bitext for each cluster, which we use to optimize the model weights, and a centroid per cluster. [sent-126, score-0.233]
52 At decoding time, we need only perform an assignment step. [sent-127, score-0.232]
53 2 Scalability Considerations Our theoretical expectation is that domain adaptation will fail to perform well if the test data is from 835 a different domain than the development data, or if the development data is a heterogeneous mix of domains. [sent-130, score-0.879]
54 A multi-domain setup can mitigate this risk, but only if the relevant domain is represented in the development data, and if the development data is adequately segmented for the optimization. [sent-131, score-0.37]
55 We thus suggest that the development data should contain enough data from all domains that one wants to adapt to, and a high number of clusters. [sent-132, score-0.212]
56 While the resource requirements increase with the number of component models, increasing the number of clusters is computationally cheap at runtime. [sent-133, score-0.289]
57 Only the clustering of the development set and optimization of the translation model weights for each clusters is affected by k. [sent-134, score-0.869]
58 5 The biggest risk of increasing the number of clusters is that if the clusters become too small, perplexity minimization may overfit these small clusters. [sent-136, score-0.578]
59 We will experiment with different num- bers of clusters, but since we expect the optimal number of clusters to depend on the amount of development data, and the number of domains, we cannot make generalized statements about the ideal number of k. [sent-137, score-0.343]
60 We perform a linear interpolation of models for each cluster, with interpolation coefficients optimized using perplexity minimization on the development set. [sent-139, score-0.487]
61 The cost of moving language model interpolation into the decoding phase is far greater than for translation models, since the number of hypotheses that need to be evaluated by the language model is several orders of magnitudes higher than the number of phrase pairs used during the translation. [sent-140, score-0.721]
62 For the experiments with language model adaptation, we have chosen to perform linear interpolation offline, and perform language model switching during decoding. [sent-141, score-0.246]
63 For scaling the approach to a high number of clusters, we envision that multi5If the development set is labelled, one can also use a gold segmentation of development sets instead of k-means clustering. [sent-143, score-0.375]
64 At decoding time, cluster assignment can be performed by automatically assigning each sentence to the closest centroid, or again through gold labels, if available. [sent-144, score-0.446]
65 pass decoding, with an unadapted language model in the first phase, and rescoring with a language model adapted online, could perform adequately, and keep the complexity independent of the number of clusters. [sent-146, score-0.319]
66 We report translation quality using BLEU (Papineni et 836 al. [sent-152, score-0.245]
67 The first is an English–German translation task with two domains, texts related to information technology (IT) and legal documents (LEGAL). [sent-159, score-0.455]
68 7 data sets come from the domain IT: 6 from OPUS (Tiedemann, 2009) and a translation memory (tm) provided by our industry partner. [sent-161, score-0.469]
69 3 data sets are from the legal domain: the ECB corpus from OPUS, plus the JRCAcquis (Steinberger et al. [sent-162, score-0.293]
70 The development sets are random samples from the respective in-domain bitexts (heldout from training). [sent-167, score-0.242]
71 We vary the number of clusters k from 1, which corresponds to adapting the models to the full development set, to 16. [sent-176, score-0.387]
72 The baseline is the concatenation of all train- Tmaibzeled on Wfoeuirg development rs fetesa (from gold split and clustering with k = 2). [sent-177, score-0.423]
73 We also evaluate the labelled setting, where instead of unsupervised clustering, we use gold labels to split the development and test sets, and adapt the models to each labelled domain. [sent-179, score-0.462]
74 For our clustering experiments, the development set is the concatenation of the LEGAL and IT development sets. [sent-182, score-0.486]
75 This allows for a detailed analysis of the effect of development data clustering for the purpose of model adaptation. [sent-184, score-0.338]
76 837 We find that an adaptation of the TM and LM to the full development set (system “1 cluster”) yields the smallest improvements over the unadapted baseline. [sent-187, score-0.609]
77 For the IT test set, the system with gold labels and TM adaptation yields an improvement of 0. [sent-190, score-0.433]
78 Results with 16 clusters are slightly worse than those with 2–8 clusters due to two effects. [sent-218, score-0.42]
79 Secondly, about one third of the IT test set is assigned to a cluster that is not IT-specific, which weakens the effect of domain adaptation for the systems with 16 clusters. [sent-224, score-0.549]
80 This can be explained by the fact that the majority of training data is already from the legal domain, which makes it unnecessary to boost its impact on the probability distribution even further. [sent-226, score-0.21]
81 Table 5 shows the automatically obtained translation model weight vectors for two systems, “gold clusters” and “2 clusters”, for the feature p(t|s). [sent-227, score-0.375]
82 As in the first task, adaptation to the full development set is least effective. [sent-233, score-0.454]
83 The systems with unsupervised clusters significantly outperform the baseline. [sent-234, score-0.269]
84 We conclude that the translation model architecture is effective in a multi-domain setting, both with unsupervised clusters and labelled domains. [sent-254, score-0.799]
85 The fact that language model adaptation yields an additional improvement in our experiments suggests that it it would be worthwhile to also investigate a language model data structure that efficiently supports multiple domains. [sent-255, score-0.493]
86 6 Conclusion We have presented a novel translation model architecture that delays the computation of translation model features to the decoding phase, and uses a vector of component models for this computation. [sent-256, score-1.213]
87 We have also described a usage scenario for this architecture, namely its ability to quickly switch between weight vectors in order to serve as an adapted model for multiple domains. [sent-257, score-0.251]
88 A simple, unsupervised clustering of development data is sufficient to make use of this ability and imple838 ment a multi-domain translation system. [sent-258, score-0.66]
89 If available, one can also use the architecture in a labelled setting. [sent-259, score-0.241]
90 Future work could involve merging our translation model framework with the online adaptation of other models, or the log-linear weights. [sent-260, score-0.651]
91 , 2012), who perform feature augmentation to obtain multiple sets of adapted log-linear weights. [sent-262, score-0.244]
92 , 2012) use labelled data, their approach could in principle also be applied after unsupervised clustering. [sent-264, score-0.22]
93 The translation model framework could also serve as the basis of real-time adaptation of translation systems, e. [sent-265, score-0.855]
94 by using incremental means to update the weight vector, or having an incrementally trainable component model that learns from the post-edits by the user, and is assigned a suitable weight. [sent-267, score-0.257]
95 Combining multi-domain statistical machine translation models using automatic classifiers. [sent-273, score-0.245]
96 One system, many domains: Open-domain statistical machine translation via feature augmentation. [sent-297, score-0.245]
97 Mixture-modeling with unsupervised clusters for domain adaptation in statistical machine translation. [sent-341, score-0.694]
98 Perplexity minimization for translation model domain adaptation in statistical machine translation. [sent-345, score-0.764]
99 A general framework to weight heterogeneous parallel data for model adaptation in statistical machine translation. [sent-350, score-0.536]
100 DGTTM: A freely available translation memory in 22 languages. [sent-358, score-0.288]
wordName wordTfidf (topN-words)
[('adaptation', 0.321), ('translation', 0.245), ('clusters', 0.21), ('legal', 0.21), ('lex', 0.194), ('decoding', 0.193), ('clustering', 0.161), ('sennrich', 0.147), ('architecture', 0.141), ('development', 0.133), ('cluster', 0.124), ('unadapted', 0.113), ('tm', 0.113), ('storing', 0.107), ('domain', 0.104), ('labelled', 0.1), ('mert', 0.098), ('bleu', 0.097), ('steinberger', 0.086), ('weight', 0.086), ('cz', 0.085), ('accessing', 0.083), ('interpolation', 0.08), ('adapted', 0.079), ('component', 0.079), ('domains', 0.079), ('delays', 0.078), ('razmara', 0.078), ('rico', 0.078), ('weights', 0.076), ('computation', 0.075), ('delay', 0.074), ('tables', 0.073), ('bitexts', 0.07), ('opus', 0.07), ('gold', 0.07), ('vector', 0.069), ('tj', 0.069), ('unlabelled', 0.065), ('mans', 0.064), ('zabokrtsk', 0.064), ('perplexity', 0.064), ('phase', 0.063), ('sufficient', 0.062), ('principle', 0.061), ('weighting', 0.06), ('clark', 0.06), ('unsupervised', 0.059), ('yamamoto', 0.059), ('concatenation', 0.059), ('closest', 0.059), ('moses', 0.058), ('czeng', 0.057), ('ecb', 0.057), ('matecat', 0.057), ('centroid', 0.056), ('statistics', 0.056), ('interactive', 0.056), ('lm', 0.055), ('americas', 0.055), ('options', 0.053), ('phrase', 0.052), ('lookups', 0.052), ('multidomain', 0.052), ('minimization', 0.05), ('incrementally', 0.048), ('nez', 0.047), ('amta', 0.046), ('decoder', 0.045), ('koehn', 0.045), ('heterogeneous', 0.045), ('augmentation', 0.045), ('equation', 0.044), ('foster', 0.044), ('sj', 0.044), ('model', 0.044), ('plus', 0.044), ('adapting', 0.044), ('risk', 0.044), ('matsoukas', 0.043), ('memory', 0.043), ('yields', 0.042), ('multiple', 0.042), ('holger', 0.042), ('impractical', 0.042), ('mle', 0.042), ('alignment', 0.042), ('association', 0.041), ('optimized', 0.041), ('online', 0.041), ('denver', 0.04), ('shah', 0.04), ('parallel', 0.04), ('optimizer', 0.039), ('bojar', 0.039), ('och', 0.039), ('sets', 0.039), ('perform', 0.039), ('bears', 0.038), ('industry', 0.038)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation
Author: Rico Sennrich ; Holger Schwenk ; Walid Aransa
Abstract: While domain adaptation techniques for SMT have proven to be effective at improving translation quality, their practicality for a multi-domain environment is often limited because of the computational and human costs of developing and maintaining multiple systems adapted to different domains. We present an architecture that delays the computation of translation model features until decoding, allowing for the application of mixture-modeling techniques at decoding time. We also de- scribe a method for unsupervised adaptation with development and test data from multiple domains. Experimental results on two language pairs demonstrate the effectiveness of both our translation model architecture and automatic clustering, with gains of up to 1BLEU over unadapted systems and single-domain adaptation.
2 0.21844622 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation
Author: Conghui Zhu ; Taro Watanabe ; Eiichiro Sumita ; Tiejun Zhao
Abstract: Typical statistical machine translation systems are batch trained with a given training data and their performances are largely influenced by the amount of data. With the growth of the available data across different domains, it is computationally demanding to perform batch training every time when new data comes. In face of the problem, we propose an efficient phrase table combination method. In particular, we train a Bayesian phrasal inversion transduction grammars for each domain separately. The learned phrase tables are hierarchically combined as if they are drawn from a hierarchical Pitman-Yor process. The performance measured by BLEU is at least as comparable to the traditional batch training method. Furthermore, each phrase table is trained separately in each domain, and while computational overhead is significantly reduced by training them in parallel.
3 0.21285853 383 acl-2013-Vector Space Model for Adaptation in Statistical Machine Translation
Author: Boxing Chen ; Roland Kuhn ; George Foster
Abstract: This paper proposes a new approach to domain adaptation in statistical machine translation (SMT) based on a vector space model (VSM). The general idea is first to create a vector profile for the in-domain development (“dev”) set. This profile might, for instance, be a vector with a dimensionality equal to the number of training subcorpora; each entry in the vector reflects the contribution of a particular subcorpus to all the phrase pairs that can be extracted from the dev set. Then, for each phrase pair extracted from the training data, we create a vector with features defined in the same way, and calculate its similarity score with the vector representing the dev set. Thus, we obtain a de- coding feature whose value represents the phrase pair’s closeness to the dev. This is a simple, computationally cheap form of instance weighting for phrase pairs. Experiments on large scale NIST evaluation data show improvements over strong baselines: +1.8 BLEU on Arabic to English and +1.4 BLEU on Chinese to English over a non-adapted baseline, and significant improvements in most circumstances over baselines with linear mixture model adaptation. An informal analysis suggests that VSM adaptation may help in making a good choice among words with the same meaning, on the basis of style and genre.
4 0.18185885 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation
Author: Jiajun Zhang ; Chengqing Zong
Abstract: Currently, almost all of the statistical machine translation (SMT) models are trained with the parallel corpora in some specific domains. However, when it comes to a language pair or a different domain without any bilingual resources, the traditional SMT loses its power. Recently, some research works study the unsupervised SMT for inducing a simple word-based translation model from the monolingual corpora. It successfully bypasses the constraint of bitext for SMT and obtains a relatively promising result. In this paper, we take a step forward and propose a simple but effective method to induce a phrase-based model from the monolingual corpora given an automatically-induced translation lexicon or a manually-edited translation dictionary. We apply our method for the domain adaptation task and the extensive experiments show that our proposed method can substantially improve the translation quality. 1
5 0.18153308 201 acl-2013-Integrating Translation Memory into Phrase-Based Machine Translation during Decoding
Author: Kun Wang ; Chengqing Zong ; Keh-Yih Su
Abstract: Since statistical machine translation (SMT) and translation memory (TM) complement each other in matched and unmatched regions, integrated models are proposed in this paper to incorporate TM information into phrase-based SMT. Unlike previous multi-stage pipeline approaches, which directly merge TM result into the final output, the proposed models refer to the corresponding TM information associated with each phrase at SMT decoding. On a Chinese–English TM database, our experiments show that the proposed integrated Model-III is significantly better than either the SMT or the TM systems when the fuzzy match score is above 0.4. Furthermore, integrated Model-III achieves overall 3.48 BLEU points improvement and 2.62 TER points reduction in comparison with the pure SMT system. Be- . sides, the proposed models also outperform previous approaches significantly.
6 0.17513672 38 acl-2013-Additive Neural Networks for Statistical Machine Translation
7 0.17145211 197 acl-2013-Incremental Topic-Based Translation Model Adaptation for Conversational Spoken Language Translation
8 0.16513124 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers
9 0.14807168 328 acl-2013-Stacking for Statistical Machine Translation
10 0.14724042 345 acl-2013-The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis
11 0.14711507 40 acl-2013-Advancements in Reordering Models for Statistical Machine Translation
12 0.1470862 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference
13 0.1464013 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling
14 0.14289187 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT
15 0.14224899 338 acl-2013-Task Alternation in Parallel Sentence Retrieval for Twitter Translation
16 0.13989462 47 acl-2013-An Information Theoretic Approach to Bilingual Word Clustering
17 0.13587879 255 acl-2013-Name-aware Machine Translation
18 0.13561811 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl
19 0.1354062 19 acl-2013-A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation
20 0.13481498 134 acl-2013-Embedding Semantic Similarity in Tree Kernels for Domain Adaptation of Relation Extraction
topicId topicWeight
[(0, 0.312), (1, -0.177), (2, 0.215), (3, 0.101), (4, 0.017), (5, -0.012), (6, -0.018), (7, 0.033), (8, -0.018), (9, 0.017), (10, -0.008), (11, 0.009), (12, -0.042), (13, 0.081), (14, -0.02), (15, 0.037), (16, -0.085), (17, 0.007), (18, 0.043), (19, 0.037), (20, 0.074), (21, -0.026), (22, 0.037), (23, -0.023), (24, 0.038), (25, -0.065), (26, 0.107), (27, 0.041), (28, 0.036), (29, 0.013), (30, 0.11), (31, 0.144), (32, -0.107), (33, -0.092), (34, 0.012), (35, 0.093), (36, 0.025), (37, -0.064), (38, -0.113), (39, 0.025), (40, 0.008), (41, 0.116), (42, -0.068), (43, 0.024), (44, 0.06), (45, -0.007), (46, 0.072), (47, -0.033), (48, -0.013), (49, 0.058)]
simIndex simValue paperId paperTitle
same-paper 1 0.96527696 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation
Author: Rico Sennrich ; Holger Schwenk ; Walid Aransa
Abstract: While domain adaptation techniques for SMT have proven to be effective at improving translation quality, their practicality for a multi-domain environment is often limited because of the computational and human costs of developing and maintaining multiple systems adapted to different domains. We present an architecture that delays the computation of translation model features until decoding, allowing for the application of mixture-modeling techniques at decoding time. We also de- scribe a method for unsupervised adaptation with development and test data from multiple domains. Experimental results on two language pairs demonstrate the effectiveness of both our translation model architecture and automatic clustering, with gains of up to 1BLEU over unadapted systems and single-domain adaptation.
2 0.86128569 383 acl-2013-Vector Space Model for Adaptation in Statistical Machine Translation
Author: Boxing Chen ; Roland Kuhn ; George Foster
Abstract: This paper proposes a new approach to domain adaptation in statistical machine translation (SMT) based on a vector space model (VSM). The general idea is first to create a vector profile for the in-domain development (“dev”) set. This profile might, for instance, be a vector with a dimensionality equal to the number of training subcorpora; each entry in the vector reflects the contribution of a particular subcorpus to all the phrase pairs that can be extracted from the dev set. Then, for each phrase pair extracted from the training data, we create a vector with features defined in the same way, and calculate its similarity score with the vector representing the dev set. Thus, we obtain a de- coding feature whose value represents the phrase pair’s closeness to the dev. This is a simple, computationally cheap form of instance weighting for phrase pairs. Experiments on large scale NIST evaluation data show improvements over strong baselines: +1.8 BLEU on Arabic to English and +1.4 BLEU on Chinese to English over a non-adapted baseline, and significant improvements in most circumstances over baselines with linear mixture model adaptation. An informal analysis suggests that VSM adaptation may help in making a good choice among words with the same meaning, on the basis of style and genre.
3 0.83987808 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation
Author: Conghui Zhu ; Taro Watanabe ; Eiichiro Sumita ; Tiejun Zhao
Abstract: Typical statistical machine translation systems are batch trained with a given training data and their performances are largely influenced by the amount of data. With the growth of the available data across different domains, it is computationally demanding to perform batch training every time when new data comes. In face of the problem, we propose an efficient phrase table combination method. In particular, we train a Bayesian phrasal inversion transduction grammars for each domain separately. The learned phrase tables are hierarchically combined as if they are drawn from a hierarchical Pitman-Yor process. The performance measured by BLEU is at least as comparable to the traditional batch training method. Furthermore, each phrase table is trained separately in each domain, and while computational overhead is significantly reduced by training them in parallel.
4 0.80042076 328 acl-2013-Stacking for Statistical Machine Translation
Author: Majid Razmara ; Anoop Sarkar
Abstract: We propose the use of stacking, an ensemble learning technique, to the statistical machine translation (SMT) models. A diverse ensemble of weak learners is created using the same SMT engine (a hierarchical phrase-based system) by manipulating the training data and a strong model is created by combining the weak models on-the-fly. Experimental results on two language pairs and three different sizes of training data show significant improvements of up to 4 BLEU points over a conventionally trained SMT model.
5 0.7333498 201 acl-2013-Integrating Translation Memory into Phrase-Based Machine Translation during Decoding
Author: Kun Wang ; Chengqing Zong ; Keh-Yih Su
Abstract: Since statistical machine translation (SMT) and translation memory (TM) complement each other in matched and unmatched regions, integrated models are proposed in this paper to incorporate TM information into phrase-based SMT. Unlike previous multi-stage pipeline approaches, which directly merge TM result into the final output, the proposed models refer to the corresponding TM information associated with each phrase at SMT decoding. On a Chinese–English TM database, our experiments show that the proposed integrated Model-III is significantly better than either the SMT or the TM systems when the fuzzy match score is above 0.4. Furthermore, integrated Model-III achieves overall 3.48 BLEU points improvement and 2.62 TER points reduction in comparison with the pure SMT system. Be- . sides, the proposed models also outperform previous approaches significantly.
6 0.717291 338 acl-2013-Task Alternation in Parallel Sentence Retrieval for Twitter Translation
7 0.70069665 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT
8 0.70064461 68 acl-2013-Bilingual Data Cleaning for SMT using Graph-based Random Walk
9 0.6958583 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation
10 0.69468623 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation
11 0.68374306 46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation
12 0.67719042 38 acl-2013-Additive Neural Networks for Statistical Machine Translation
13 0.66843086 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers
14 0.66108072 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl
15 0.65974706 221 acl-2013-Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines
16 0.65116733 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference
17 0.63768095 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language
18 0.63698328 180 acl-2013-Handling Ambiguities of Bilingual Predicate-Argument Structures for Statistical Machine Translation
19 0.63548231 214 acl-2013-Language Independent Connectivity Strength Features for Phrase Pivot Statistical Machine Translation
20 0.63190609 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling
topicId topicWeight
[(0, 0.041), (6, 0.031), (11, 0.031), (24, 0.035), (26, 0.043), (35, 0.051), (42, 0.491), (48, 0.038), (70, 0.022), (88, 0.015), (90, 0.063), (95, 0.076)]
simIndex simValue paperId paperTitle
1 0.98366123 125 acl-2013-Distortion Model Considering Rich Context for Statistical Machine Translation
Author: Isao Goto ; Masao Utiyama ; Eiichiro Sumita ; Akihiro Tamura ; Sadao Kurohashi
Abstract: This paper proposes new distortion models for phrase-based SMT. In decoding, a distortion model estimates the source word position to be translated next (NP) given the last translated source word position (CP). We propose a distortion model that can consider the word at the CP, a word at an NP candidate, and the context of the CP and the NP candidate simultaneously. Moreover, we propose a further improved model that considers richer context by discriminating label sequences that specify spans from the CP to NP candidates. It enables our model to learn the effect of relative word order among NP candidates as well as to learn the effect of distances from the training data. In our experiments, our model improved 2.9 BLEU points for Japanese-English and 2.6 BLEU points for Chinese-English translation compared to the lexical reordering models.
Author: Sina Zarriess ; Jonas Kuhn
Abstract: We suggest a generation task that integrates discourse-level referring expression generation and sentence-level surface realization. We present a data set of German articles annotated with deep syntax and referents, including some types of implicit referents. Our experiments compare several architectures varying the order of a set of trainable modules. The results suggest that a revision-based pipeline, with intermediate linearization, significantly outperforms standard pipelines or a parallel architecture.
3 0.972597 372 acl-2013-Using CCG categories to improve Hindi dependency parsing
Author: Bharat Ram Ambati ; Tejaswini Deoskar ; Mark Steedman
Abstract: We show that informative lexical categories from a strongly lexicalised formalism such as Combinatory Categorial Grammar (CCG) can improve dependency parsing of Hindi, a free word order language. We first describe a novel way to obtain a CCG lexicon and treebank from an existing dependency treebank, using a CCG parser. We use the output of a supertagger trained on the CCGbank as a feature for a state-of-the-art Hindi dependency parser (Malt). Our results show that using CCG categories improves the accuracy of Malt on long distance dependencies, for which it is known to have weak rates of recovery.
same-paper 4 0.96142447 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation
Author: Rico Sennrich ; Holger Schwenk ; Walid Aransa
Abstract: While domain adaptation techniques for SMT have proven to be effective at improving translation quality, their practicality for a multi-domain environment is often limited because of the computational and human costs of developing and maintaining multiple systems adapted to different domains. We present an architecture that delays the computation of translation model features until decoding, allowing for the application of mixture-modeling techniques at decoding time. We also de- scribe a method for unsupervised adaptation with development and test data from multiple domains. Experimental results on two language pairs demonstrate the effectiveness of both our translation model architecture and automatic clustering, with gains of up to 1BLEU over unadapted systems and single-domain adaptation.
5 0.96116507 64 acl-2013-Automatically Predicting Sentence Translation Difficulty
Author: Abhijit Mishra ; Pushpak Bhattacharyya ; Michael Carl
Abstract: In this paper we introduce Translation Difficulty Index (TDI), a measure of difficulty in text translation. We first define and quantify translation difficulty in terms of TDI. We realize that any measure of TDI based on direct input by translators is fraught with subjectivity and adhocism. We, rather, rely on cognitive evidences from eye tracking. TDI is measured as the sum of fixation (gaze) and saccade (rapid eye movement) times of the eye. We then establish that TDI is correlated with three properties of the input sentence, viz. length (L), degree of polysemy (DP) and structural complexity (SC). We train a Support Vector Regression (SVR) system to predict TDIs for new sentences using these features as input. The prediction done by our framework is well correlated with the empirical gold standard data, which is a repository of < L, DP, SC > and TDI pairs for a set of sentences. The primary use of our work is a way of “binning” sentences (to be translated) in “easy”, “medium” and “hard” categories as per their predicted TDI. This can decide pricing of any translation task, especially useful in a scenario where parallel corpora for Machine Translation are built through translation crowdsourcing/outsourcing. This can also provide a way of monitoring progress of second language learners.
6 0.93770146 40 acl-2013-Advancements in Reordering Models for Statistical Machine Translation
7 0.93586081 206 acl-2013-Joint Event Extraction via Structured Prediction with Global Features
8 0.92969602 302 acl-2013-Robust Automated Natural Language Processing with Multiword Expressions and Collocations
9 0.78294206 166 acl-2013-Generalized Reordering Rules for Improved SMT
10 0.76433271 77 acl-2013-Can Markov Models Over Minimal Translation Units Help Phrase-Based SMT?
11 0.73892665 281 acl-2013-Post-Retrieval Clustering Using Third-Order Similarity Measures
12 0.72812861 38 acl-2013-Additive Neural Networks for Statistical Machine Translation
13 0.72644657 56 acl-2013-Argument Inference from Relevant Event Mentions in Chinese Argument Extraction
14 0.71147025 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation
15 0.69980067 69 acl-2013-Bilingual Lexical Cohesion Trigger Model for Document-Level Machine Translation
16 0.68782687 68 acl-2013-Bilingual Data Cleaning for SMT using Graph-based Random Walk
17 0.68337178 199 acl-2013-Integrating Multiple Dependency Corpora for Inducing Wide-coverage Japanese CCG Resources
18 0.67954421 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation
19 0.67897761 363 acl-2013-Two-Neighbor Orientation Model with Cross-Boundary Global Contexts
20 0.67774343 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT