acl acl2012 acl2012-143 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Majid Razmara ; George Foster ; Baskaran Sankaran ; Anoop Sarkar
Abstract: Statistical machine translation is often faced with the problem of combining training data from many diverse sources into a single translation model which then has to translate sentences in a new domain. We propose a novel approach, ensemble decoding, which combines a number of translation systems dynamically at the decoding step. In this paper, we evaluate performance on a domain adaptation setting where we translate sentences from the medical domain. Our experimental results show that ensemble decoding outperforms various strong baselines including mixture models, the current state-of-the-art for domain adaptation in machine translation.
Reference: text
sentIndex sentText sentNum sentScore
1 ca Abstract Statistical machine translation is often faced with the problem of combining training data from many diverse sources into a single translation model which then has to translate sentences in a new domain. [sent-6, score-0.503]
2 We propose a novel approach, ensemble decoding, which combines a number of translation systems dynamically at the decoding step. [sent-7, score-0.762]
3 In this paper, we evaluate performance on a domain adaptation setting where we translate sentences from the medical domain. [sent-8, score-0.328]
4 Our experimental results show that ensemble decoding outperforms various strong baselines including mixture models, the current state-of-the-art for domain adaptation in machine translation. [sent-9, score-1.312]
5 1 Introduction Statistical machine translation (SMT) systems require large parallel corpora in order to be able to obtain a reasonable translation quality. [sent-10, score-0.456]
6 It is an interesting question whether a model that is trained on an existing large bilingual corpus in a specific domain can be adapted to another domain for which little parallel data is present. [sent-13, score-0.306]
7 Domain adaptation techniques aim at finding ways to adjust an out-of-domain (OUT) model to represent a target domain (in-domain or IN). [sent-14, score-0.415]
8 Common techniques for model adaptation adapt two main components of contemporary state-of-theart SMT systems: the language model and the translation model. [sent-15, score-0.565]
9 However, language model adaptation is a more straight-forward problem compared to 940 translation model adaptation, because various measures such as perplexity of adapted language models can be easily computed on data in the target domain. [sent-16, score-0.581]
10 As a result, language model adaptation has been well studied in various work (Clarkson and Robinson, 1997; Seymore and Rosenfeld, 1997; Bacchiani and Roark, 2003; Eck et al. [sent-17, score-0.269]
11 It is also easier to obtain monolingual data in the target domain, compared to bilingual data which is required for translation model adaptation. [sent-19, score-0.294]
12 In this paper, we focused on adapting only the translation model by fixing a language model for all the experiments. [sent-20, score-0.309]
13 We expect domain adaptation for machine translation can be improved further by combining orthogonal techniques for translation model adaptation combined with language model adaptation. [sent-21, score-1.173]
14 In this paper, a new approach for adapting the translation model is proposed. [sent-22, score-0.259]
15 We use a novel system combination approach called ensemble decoding in order to combine two or more translation models with the goal of constructing a system that outperforms all the component models. [sent-23, score-1.045]
16 The main applications of en- semble models are domain adaptation, domain mixing and system combination. [sent-26, score-0.348]
17 , 2012), an in-house implementation of hierarchical phrase-based translation system (Chiang, 2005), to implement ensemble decoding using multiple translation models. [sent-28, score-1.018]
18 We compare the results of ensemble decoding with a number of baselines for domain adaptation. [sent-29, score-0.721]
19 In addition to the basic approach of concatenation of in-domain and out-of-domain data, we also trained a log-linear mixture model (Foster and Kuhn, 2007) Proce dJienjgus, R ofep thueb 5lic0t hof A Knonruea ,l M 8-e1e4ti Jnugly o f2 t0h1e2 A. [sent-30, score-0.489]
20 c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi9c 4s0–949, as well as the linear mixture model of (Foster et al. [sent-32, score-0.46]
21 Furthermore, within the framework of ensemble decoding, we study and evaluate various methods for combining translation tables. [sent-34, score-0.622]
22 In addition to this baseline, we have experimented with two more sophisticated baselines which are based on mixture techniques. [sent-36, score-0.469]
23 1 Log-Linear Mixture Log-linear translation model (TM) mixtures are of the form: p(¯ e|f¯) ∝ exp? [sent-38, score-0.361]
24 Whenever a phrase pair d2o =es not appear hint a component phrase table, we set the corresponding pm( e¯|f¯) to a small epsilon value. [sent-44, score-0.245]
25 We then find the set of weights that minimize the cross-entropy of the mixture p( e¯|f¯) with respect to ˜ p( ¯e, f¯): λˆ 941 XM λˆ = argλmaxX ¯e,f¯ p˜( e¯,f¯)logXmλmpm( e¯|f¯) For efficiency and stability, we use the EM algorithm to find rather than L-BFGS as in (Foster et al. [sent-50, score-0.449]
26 Whenever a phrase pair does not appear in a component phrase table, we set the corresponding pm( e¯|f¯) to 0; pairs in ˜ p( ¯e, f¯) that do not appear in at le(a ¯es|t one component table are discarded. [sent-52, score-0.398]
27 – 3 Ensemble Decoding Ensemble decoding is a way to combine the expertise of different models in one single model. [sent-55, score-0.228]
28 The current implementation is able to combine hierarchical phrase-based systems (Chiang, 2005) as well as phrase-based translation systems (Koehn et al. [sent-56, score-0.256]
29 However, the method can be easily extended to support combining a number of heterogeneous translation systems e. [sent-58, score-0.244]
30 Given a number of translation models which are already trained and tuned, the ensemble decoder uses hypotheses constructed from all of the models in order to translate a sentence. [sent-62, score-0.761]
31 The cells of the CKY chart are populated with appropriate rules from all the phrase tables of different components. [sent-65, score-0.334]
32 the maximum span length) are populated from the phrasetables and the rest of the chart uses glue rules as defined in (Chiang, 2005). [sent-68, score-0.28]
33 The rules suggested from the component models are combined in a single set. [sent-69, score-0.263]
34 Some of the rules may be unique and others may be common with other component model rule sets, though with different scores. [sent-70, score-0.26]
35 Depending on the mixture operation used for combining the scores, we would get different mixture scores. [sent-72, score-0.867]
36 The choice of mixture operation will be discussed in Section 3. [sent-73, score-0.46]
37 Each cell, covering a span, is populated with rules from all component models as well as from cells covering a sub-span of it. [sent-76, score-0.416]
38 Therefore, the score of a phrase-pair ( e¯, f¯) in the ensemble model is: p(¯ e| f¯) ∝ exp? [sent-80, score-0.428]
39 1 Mixture Operations Mixture operations receive two or more scores (probabilities) and return the mixture score (probability). [sent-84, score-0.518]
40 In this section, we explore different options for mixture operation and discuss some of the characteristics of these mixture operations. [sent-85, score-0.832]
41 • Weighted Sum (wsum): in wsum the ensemble probability ius proportional to tshuem weighted sum of all individual model probabilities (i. [sent-86, score-0.662]
42 Xm • where m denotes the index of component models, M is the total number of them and λi is the weight for component i. [sent-91, score-0.306]
43 Weighted Max (wmax): where the ensemble score hist ethde M weighted max wofh aelrle em thoede el scores. [sent-92, score-0.471]
44 942 Model Switching (Switch): in model switching, eealc Shw wceitlclh hiinn tgh (eS CwKitYch c)h:a irnt gets populated only by rules from one of the models and the other models’ rules are discarded. [sent-97, score-0.313]
45 In this method, we need to define a binary indicator function δ(f¯, m) for each span and component model to specify rules of which model to retain for each span. [sent-99, score-0.359]
46 This sum has to take into account the translation table limit (ttl), on the number of rules suggested by each model for each cell: ψ(f¯,n) = λnXexp? [sent-101, score-0.371]
47 These models are Figure 1: The cells in the CKY chart are populated using rules from all component models and sub-span cells. [sent-107, score-0.547]
48 Xm Product models have been used in combining LMs and TMs in SMT as well as some other NLP tasks such as ensemble parsing (Petrov, 2010). [sent-110, score-0.466]
49 Each of these mixture operations has a specific property that makes it work in specific domain adaptation or system combination scenarios. [sent-111, score-0.844]
50 For instance, LOPs may not be optimal for domain adaptation in the setting where there are two or more models trained on heterogeneous corpora. [sent-112, score-0.381]
51 2, we compare the BLEU scores of different mixture operations on a French-English experimental setup. [sent-120, score-0.518]
52 2 Normalization Since in log-linear models, the model scores are not normalized to form probability distributions, the scores that different models assign to each phrasepair may not be in the same scale. [sent-122, score-0.261]
53 So the list of rules coming from each model for a cell in CKY chart is normal- ized before getting mixed with other phrase-table rules. [sent-126, score-0.25]
54 However, we did not try it as the BLEU scores we got using the normalization heuristic was not promissing and it would impose a cost in decoding as well. [sent-130, score-0.303]
55 Component weights for each mixture operation are optimized on the dev-set using CONDOR. [sent-134, score-0.537]
56 For the mixture baselines, we used a standard one-pass phrase-based system (Koehn et al. [sent-141, score-0.372]
57 , 2005), with the following 7 features: relative-frequency and lexical translation model (TM) probabilities in both directions; worddisplacement distortion model; language model (LM) and word count. [sent-143, score-0.353]
58 For ensemble decoding, we modified an in-house implementation of hierarchical phrase-based system, Kriya (Sankaran et al. [sent-146, score-0.425]
59 Fixing the language model allows us to compare various translation model combination techniques. [sent-154, score-0.386]
60 Table 3 shows the results of ensemble decoding with different mixture operations and model weight settings. [sent-164, score-1.042]
61 Each mixture operation has been evaluated on the test-set by setting the component weights uniformly (denoted by uniform) and by tuning the weights using CONDOR (denoted by tuned) on a held-out set. [sent-165, score-0.81]
62 All of these mixture operations were able to significantly improve over the concatenation baseline. [sent-169, score-0.506]
63 linear mixture model implemented in Hiero) which is statistically significant based on Clark et al. [sent-174, score-0.46]
64 lowest score among the mixture operations, however after tuning, it learns to bias the weights towards one of the models and hence improves by 1. [sent-185, score-0.502]
65 Although Switching:Sum outperforms the concatenation baseline, it is substantially worse than other mixture operations. [sent-187, score-0.439]
66 An interesting observation based on the results in Table 3 is that uniform weights are doing reasonably well given that the component weights are not optimized and therefore model scores may not be in the same scope (refer to discussion in §3. [sent-189, score-0.436]
67 This shared component controls the variance of the weights in the two models when combined with the standard L-1 normalization of each model’s weights and hence prohibits models to have too varied scores for the same input. [sent-194, score-0.541]
68 The boxes show how the Ensemble model is able to use ngrams from the IN and OUT models to construct a better translation than both of them. [sent-197, score-0.312]
69 Similarly, the second example shows how ensemble decoding improves lexical choices as well as word re-orderings. [sent-200, score-0.553]
70 1 Domain Adaptation Early approaches to domain adaptation involved information retrieval techniques where sentence pairs related to the target domain were retrieved from the training corpus using IR methods (Eck et al. [sent-202, score-0.474]
71 Other domain adaptation methods involve techniques that distinguish between general and domainspecific examples (Daum ´e and Marcu, 2006). [sent-209, score-0.365]
72 Two famous examples of such methods are linear mixtures and log-linear mixtures (Koehn and Schroeder, 2007; Civera and Juan, 2007; Foster and Kuhn, 2007) which were used as baselines and discussed in Section 2. [sent-217, score-0.301]
73 2 System Combination Tackling the model adaptation problem using sys- tem combination approaches has been experimented in various work (Koehn and Schroeder, 2007; Hildebrand and Vogel, 2009). [sent-223, score-0.384]
74 In a similar approach, Koehn and Schroeder (2007) use a feature of the factored translation model framework in Moses SMT system (Koehn and Schroeder, 2007) to use multiple alternative decoding paths. [sent-225, score-0.434]
75 Two decoding paths, one for each translation table (IN and OUT), were used during decoding. [sent-226, score-0.384]
76 The Moses SMT system implements (Koehn and Schroeder, 946 2007) and can treat multiple translation tables in two different ways: intersection and union. [sent-229, score-0.246]
77 Firstly, unlike the multi-table support of Moses which only supports phrase-based translation table combination, our approach supports ensembles of both hierarchical and phrase-based systems. [sent-236, score-0.256]
78 With little modification, it can also support ensemble of syntax-based systems with the other two state-of-the-art SMT systems. [sent-237, score-0.378]
79 Secondly, our combining method uses the union option, but instead of preserving the features of all phrase-tables, it only combines their scores using various mixture operations. [sent-238, score-0.527]
80 Finally, by avoiding increasing the number of features we can add as many translation models as we need without serious performance drop. [sent-240, score-0.262]
81 (2010), a generalization of consensus or minimum Bayes risk decoding where the search space consists of those of multiple systems, in that model combination uses forest of derivations of all component models to do the combination. [sent-244, score-0.508]
82 In other words, it requires all component models to fully decode each sentence, compute n-gram expectations from each component model and calculate posterior probabilities over translation derivations. [sent-245, score-0.662]
83 While, in our approach we only use partial hypotheses from component models and the derivation forest is constructed by the ensemble model. [sent-246, score-0.652]
84 A major difference is that in the model combination approach the component search spaces are conjoined and they are not intermingled as opposed to our approach where these search spaces are intermixed on spans. [sent-247, score-0.28]
85 Their derivation-level max-translation decoding is similar to our ensemble decoding with wsum as the mixture operation. [sent-255, score-1.197]
86 We did not restrict ourself to this particular mixture operation and experimented with a 947 number of different mixing techniques and as Table 3 shows we could improve over wsum in our experimental setup. [sent-256, score-0.709]
87 (2009) used a modified version of MERT to tune max-translation de- coding weights, while we use a two-step approach using MERT for tuning each component model separately and then using CONDOR to tune component weights on top of them. [sent-258, score-0.476]
88 6 Conclusion & Future Work In this paper, we presented a new approach for domain adaptation using ensemble decoding. [sent-259, score-0.706]
89 In this approach a number of MT systems are combined at decoding time in order to form an ensemble model. [sent-260, score-0.553]
90 The model combination can be done using various mixture operations. [sent-261, score-0.499]
91 Future work includes extending this approach to use multiple translation models with multiple language models in ensemble decoding. [sent-265, score-0.693]
92 Different mixture operations can be investigated and the behaviour of each operation can be studied in more details. [sent-266, score-0.527]
93 We will also add capability of supporting syntax-based ensemble decoding and experiment how a phrase-based system can benefit from syntax information present in a syntax-aware MT system. [sent-267, score-0.553]
94 Furthermore, ensemble decoding can be applied on domain mixing settings in which development sets and test sets include sentences from different domains and genres, and this is a very suitable setting for an ensemble model which can adapt to new domains at test time. [sent-268, score-1.167]
95 Domain adaptation for statistical machine translation with monolingual resources. [sent-285, score-0.501]
96 Domain adaptation in statistical machine translation with mixture modelling. [sent-300, score-0.838]
97 Language model adaptation using mixtures and an exponentially decaying cache. [sent-314, score-0.371]
98 Language model adaptation for statistical machine translation based on information retrieval. [sent-332, score-0.516]
99 Discriminative instance weighting for domain adaptation in statistical machine translation. [sent-341, score-0.406]
100 Adaptation of the translation model for statistical machine translation based on information retrieval. [sent-351, score-0.506]
wordName wordTfidf (topN-words)
[('ensemble', 0.378), ('mixture', 0.372), ('adaptation', 0.219), ('translation', 0.209), ('decoding', 0.175), ('schroeder', 0.169), ('component', 0.153), ('xm', 0.139), ('foster', 0.129), ('switching', 0.124), ('lops', 0.121), ('domain', 0.109), ('statmt', 0.108), ('condor', 0.105), ('mixtures', 0.102), ('bleu', 0.101), ('cky', 0.097), ('wsum', 0.097), ('populated', 0.096), ('koehn', 0.09), ('operation', 0.088), ('eck', 0.084), ('hildebrand', 0.084), ('pm', 0.083), ('stroudsburg', 0.082), ('scores', 0.079), ('chart', 0.078), ('weights', 0.077), ('mixing', 0.077), ('combination', 0.077), ('kriya', 0.073), ('sankaran', 0.073), ('smt', 0.071), ('hypotheses', 0.068), ('hiero', 0.068), ('concatenation', 0.067), ('operations', 0.067), ('tm', 0.066), ('exp', 0.066), ('cell', 0.065), ('emea', 0.063), ('mert', 0.062), ('pa', 0.06), ('baselines', 0.059), ('cells', 0.057), ('rules', 0.057), ('george', 0.055), ('max', 0.055), ('sum', 0.055), ('optimizer', 0.054), ('models', 0.053), ('chiang', 0.051), ('model', 0.05), ('span', 0.049), ('normalization', 0.049), ('baskaran', 0.048), ('berghen', 0.048), ('majid', 0.048), ('mmax', 0.048), ('mpm', 0.048), ('pools', 0.048), ('portage', 0.048), ('prod', 0.048), ('seymore', 0.048), ('vanden', 0.048), ('product', 0.048), ('hierarchical', 0.047), ('wm', 0.046), ('phrase', 0.046), ('canada', 0.044), ('probabilities', 0.044), ('och', 0.043), ('tuning', 0.043), ('almut', 0.042), ('bacchiani', 0.042), ('civera', 0.042), ('silja', 0.042), ('ueffing', 0.042), ('anoop', 0.042), ('union', 0.041), ('icassp', 0.04), ('weighting', 0.04), ('pages', 0.04), ('sadat', 0.039), ('clarkson', 0.039), ('powell', 0.039), ('linear', 0.038), ('weighted', 0.038), ('parallel', 0.038), ('experimented', 0.038), ('statistical', 0.038), ('intersection', 0.037), ('techniques', 0.037), ('matthias', 0.036), ('orthogonal', 0.036), ('clark', 0.036), ('stephan', 0.035), ('vogel', 0.035), ('monolingual', 0.035), ('combining', 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000012 143 acl-2012-Mixing Multiple Translation Models in Statistical Machine Translation
Author: Majid Razmara ; George Foster ; Baskaran Sankaran ; Anoop Sarkar
Abstract: Statistical machine translation is often faced with the problem of combining training data from many diverse sources into a single translation model which then has to translate sentences in a new domain. We propose a novel approach, ensemble decoding, which combines a number of translation systems dynamically at the decoding step. In this paper, we evaluate performance on a domain adaptation setting where we translate sentences from the medical domain. Our experimental results show that ensemble decoding outperforms various strong baselines including mixture models, the current state-of-the-art for domain adaptation in machine translation.
2 0.2307497 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation
Author: Tong Xiao ; Jingbo Zhu ; Hao Zhang ; Qiang Li
Abstract: We present a new open source toolkit for phrase-based and syntax-based machine translation. The toolkit supports several state-of-the-art models developed in statistical machine translation, including the phrase-based model, the hierachical phrase-based model, and various syntaxbased models. The key innovation provided by the toolkit is that the decoder can work with various grammars and offers different choices of decoding algrithms, such as phrase-based decoding, decoding as parsing/tree-parsing and forest-based decoding. Moreover, several useful utilities were distributed with the toolkit, including a discriminative reordering model, a simple and fast language model, and an implementation of minimum error rate training for weight tuning. 1
3 0.22461452 203 acl-2012-Translation Model Adaptation for Statistical Machine Translation with Monolingual Topic Information
Author: Jinsong Su ; Hua Wu ; Haifeng Wang ; Yidong Chen ; Xiaodong Shi ; Huailin Dong ; Qun Liu
Abstract: To adapt a translation model trained from the data in one domain to another, previous works paid more attention to the studies of parallel corpus while ignoring the in-domain monolingual corpora which can be obtained more easily. In this paper, we propose a novel approach for translation model adaptation by utilizing in-domain monolingual topic information instead of the in-domain bilingual corpora, which incorporates the topic information into translation probability estimation. Our method establishes the relationship between the out-of-domain bilingual corpus and the in-domain monolingual corpora via topic mapping and phrase-topic distribution probability estimation from in-domain monolingual corpora. Experimental result on the NIST Chinese-English translation task shows that our approach significantly outperforms the baseline system.
4 0.22285631 141 acl-2012-Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
Author: Xiaodong He ; Li Deng
Abstract: This paper proposes a new discriminative training method in constructing phrase and lexicon translation models. In order to reliably learn a myriad of parameters in these models, we propose an expected BLEU score-based utility function with KL regularization as the objective, and train the models on a large parallel dataset. For training, we derive growth transformations for phrase and lexicon translation probabilities to iteratively improve the objective. The proposed method, evaluated on the Europarl German-to-English dataset, leads to a 1.1 BLEU point improvement over a state-of-the-art baseline translation system. In IWSLT 201 1 Benchmark, our system using the proposed method achieves the best Chinese-to-English translation result on the task of translating TED talks.
5 0.16942728 131 acl-2012-Learning Translation Consensus with Structured Label Propagation
Author: Shujie Liu ; Chi-Ho Li ; Mu Li ; Ming Zhou
Abstract: In this paper, we address the issue for learning better translation consensus in machine translation (MT) research, and explore the search of translation consensus from similar, rather than the same, source sentences or their spans. Unlike previous work on this topic, we formulate the problem as structured labeling over a much smaller graph, and we propose a novel structured label propagation for the task. We convert such graph-based translation consensus from similar source strings into useful features both for n-best output reranking and for decoding algorithm. Experimental results show that, our method can significantly improve machine translation performance on both IWSLT and NIST data, compared with a state-ofthe-art baseline. 1
6 0.16414572 199 acl-2012-Topic Models for Dynamic Translation Model Adaptation
7 0.16032027 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations
8 0.15650615 140 acl-2012-Machine Translation without Words through Substring Alignment
9 0.15170446 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?
10 0.14164458 128 acl-2012-Learning Better Rule Extraction with Translation Span Alignment
11 0.13746744 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT
12 0.13700458 204 acl-2012-Translation Model Size Reduction for Hierarchical Phrase-based Statistical Machine Translation
13 0.12809171 179 acl-2012-Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm
14 0.12523663 22 acl-2012-A Topic Similarity Model for Hierarchical Phrase-based Translation
15 0.12482926 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages
16 0.11704715 97 acl-2012-Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation
17 0.11177879 105 acl-2012-Head-Driven Hierarchical Phrase-based Translation
18 0.10877011 67 acl-2012-Deciphering Foreign Language by Combining Language Models and Context Vectors
19 0.10865183 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets
20 0.10251807 125 acl-2012-Joint Learning of a Dual SMT System for Paraphrase Generation
topicId topicWeight
[(0, -0.31), (1, -0.235), (2, 0.144), (3, 0.03), (4, 0.01), (5, -0.037), (6, 0.009), (7, -0.024), (8, 0.005), (9, -0.029), (10, -0.004), (11, -0.016), (12, -0.049), (13, -0.013), (14, 0.007), (15, 0.065), (16, 0.049), (17, 0.101), (18, 0.05), (19, -0.023), (20, 0.03), (21, -0.087), (22, -0.01), (23, 0.031), (24, 0.14), (25, -0.058), (26, 0.015), (27, 0.012), (28, -0.08), (29, -0.008), (30, 0.052), (31, -0.033), (32, 0.083), (33, 0.013), (34, -0.079), (35, -0.06), (36, -0.003), (37, -0.037), (38, -0.019), (39, -0.093), (40, -0.034), (41, 0.115), (42, -0.078), (43, -0.154), (44, -0.001), (45, 0.084), (46, -0.025), (47, 0.005), (48, 0.03), (49, -0.166)]
simIndex simValue paperId paperTitle
same-paper 1 0.95801407 143 acl-2012-Mixing Multiple Translation Models in Statistical Machine Translation
Author: Majid Razmara ; George Foster ; Baskaran Sankaran ; Anoop Sarkar
Abstract: Statistical machine translation is often faced with the problem of combining training data from many diverse sources into a single translation model which then has to translate sentences in a new domain. We propose a novel approach, ensemble decoding, which combines a number of translation systems dynamically at the decoding step. In this paper, we evaluate performance on a domain adaptation setting where we translate sentences from the medical domain. Our experimental results show that ensemble decoding outperforms various strong baselines including mixture models, the current state-of-the-art for domain adaptation in machine translation.
2 0.79768395 141 acl-2012-Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
Author: Xiaodong He ; Li Deng
Abstract: This paper proposes a new discriminative training method in constructing phrase and lexicon translation models. In order to reliably learn a myriad of parameters in these models, we propose an expected BLEU score-based utility function with KL regularization as the objective, and train the models on a large parallel dataset. For training, we derive growth transformations for phrase and lexicon translation probabilities to iteratively improve the objective. The proposed method, evaluated on the Europarl German-to-English dataset, leads to a 1.1 BLEU point improvement over a state-of-the-art baseline translation system. In IWSLT 201 1 Benchmark, our system using the proposed method achieves the best Chinese-to-English translation result on the task of translating TED talks.
3 0.74079043 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation
Author: Tong Xiao ; Jingbo Zhu ; Hao Zhang ; Qiang Li
Abstract: We present a new open source toolkit for phrase-based and syntax-based machine translation. The toolkit supports several state-of-the-art models developed in statistical machine translation, including the phrase-based model, the hierachical phrase-based model, and various syntaxbased models. The key innovation provided by the toolkit is that the decoder can work with various grammars and offers different choices of decoding algrithms, such as phrase-based decoding, decoding as parsing/tree-parsing and forest-based decoding. Moreover, several useful utilities were distributed with the toolkit, including a discriminative reordering model, a simple and fast language model, and an implementation of minimum error rate training for weight tuning. 1
4 0.7135576 131 acl-2012-Learning Translation Consensus with Structured Label Propagation
Author: Shujie Liu ; Chi-Ho Li ; Mu Li ; Ming Zhou
Abstract: In this paper, we address the issue for learning better translation consensus in machine translation (MT) research, and explore the search of translation consensus from similar, rather than the same, source sentences or their spans. Unlike previous work on this topic, we formulate the problem as structured labeling over a much smaller graph, and we propose a novel structured label propagation for the task. We convert such graph-based translation consensus from similar source strings into useful features both for n-best output reranking and for decoding algorithm. Experimental results show that, our method can significantly improve machine translation performance on both IWSLT and NIST data, compared with a state-ofthe-art baseline. 1
Author: Jinsong Su ; Hua Wu ; Haifeng Wang ; Yidong Chen ; Xiaodong Shi ; Huailin Dong ; Qun Liu
Abstract: To adapt a translation model trained from the data in one domain to another, previous works paid more attention to the studies of parallel corpus while ignoring the in-domain monolingual corpora which can be obtained more easily. In this paper, we propose a novel approach for translation model adaptation by utilizing in-domain monolingual topic information instead of the in-domain bilingual corpora, which incorporates the topic information into translation probability estimation. Our method establishes the relationship between the out-of-domain bilingual corpus and the in-domain monolingual corpora via topic mapping and phrase-topic distribution probability estimation from in-domain monolingual corpora. Experimental result on the NIST Chinese-English translation task shows that our approach significantly outperforms the baseline system.
6 0.67323697 204 acl-2012-Translation Model Size Reduction for Hierarchical Phrase-based Statistical Machine Translation
7 0.658867 105 acl-2012-Head-Driven Hierarchical Phrase-based Translation
9 0.64025092 67 acl-2012-Deciphering Foreign Language by Combining Language Models and Context Vectors
10 0.61903924 163 acl-2012-Prediction of Learning Curves in Machine Translation
11 0.60843009 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations
12 0.60717428 97 acl-2012-Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation
13 0.60517424 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?
14 0.59660894 128 acl-2012-Learning Better Rule Extraction with Translation Span Alignment
15 0.59021693 136 acl-2012-Learning to Translate with Multiple Objectives
16 0.58747959 140 acl-2012-Machine Translation without Words through Substring Alignment
17 0.57656664 164 acl-2012-Private Access to Phrase Tables for Statistical Machine Translation
18 0.57415491 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages
19 0.56611615 199 acl-2012-Topic Models for Dynamic Translation Model Adaptation
20 0.54846084 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets
topicId topicWeight
[(25, 0.01), (26, 0.025), (28, 0.037), (30, 0.01), (37, 0.025), (39, 0.032), (57, 0.02), (74, 0.046), (82, 0.015), (85, 0.029), (90, 0.584), (92, 0.031), (94, 0.038), (99, 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99861461 143 acl-2012-Mixing Multiple Translation Models in Statistical Machine Translation
Author: Majid Razmara ; George Foster ; Baskaran Sankaran ; Anoop Sarkar
Abstract: Statistical machine translation is often faced with the problem of combining training data from many diverse sources into a single translation model which then has to translate sentences in a new domain. We propose a novel approach, ensemble decoding, which combines a number of translation systems dynamically at the decoding step. In this paper, we evaluate performance on a domain adaptation setting where we translate sentences from the medical domain. Our experimental results show that ensemble decoding outperforms various strong baselines including mixture models, the current state-of-the-art for domain adaptation in machine translation.
2 0.99414974 212 acl-2012-Using Search-Logs to Improve Query Tagging
Author: Kuzman Ganchev ; Keith Hall ; Ryan McDonald ; Slav Petrov
Abstract: Syntactic analysis of search queries is important for a variety of information-retrieval tasks; however, the lack of annotated data makes training query analysis models difficult. We propose a simple, efficient procedure in which part-of-speech tags are transferred from retrieval-result snippets to queries at training time. Unlike previous work, our final model does not require any additional resources at run-time. Compared to a state-ofthe-art approach, we achieve more than 20% relative error reduction. Additionally, we annotate a corpus of search queries with partof-speech tags, providing a resource for future work on syntactic query analysis.
3 0.99410301 177 acl-2012-Sentence Dependency Tagging in Online Question Answering Forums
Author: Zhonghua Qu ; Yang Liu
Abstract: Online forums are becoming a popular resource in the state of the art question answering (QA) systems. Because of its nature as an online community, it contains more updated knowledge than other places. However, going through tedious and redundant posts to look for answers could be very time consuming. Most prior work focused on extracting only question answering sentences from user conversations. In this paper, we introduce the task of sentence dependency tagging. Finding dependency structure can not only help find answer quickly but also allow users to trace back how the answer is concluded through user conversations. We use linear-chain conditional random fields (CRF) for sentence type tagging, and a 2D CRF to label the dependency relation between sentences. Our experimental results show that our proposed approach performs well for sentence dependency tagging. This dependency information can benefit other tasks such as thread ranking and answer summarization in online forums.
4 0.99335819 33 acl-2012-Automatic Event Extraction with Structured Preference Modeling
Author: Wei Lu ; Dan Roth
Abstract: This paper presents a novel sequence labeling model based on the latent-variable semiMarkov conditional random fields for jointly extracting argument roles of events from texts. The model takes in coarse mention and type information and predicts argument roles for a given event template. This paper addresses the event extraction problem in a primarily unsupervised setting, where no labeled training instances are available. Our key contribution is a novel learning framework called structured preference modeling (PM), that allows arbitrary preference to be assigned to certain structures during the learning procedure. We establish and discuss connections between this framework and other existing works. We show empirically that the structured preferences are crucial to the success of our task. Our model, trained without annotated data and with a small number of structured preferences, yields performance competitive to some baseline supervised approaches.
5 0.99019194 2 acl-2012-A Broad-Coverage Normalization System for Social Media Language
Author: Fei Liu ; Fuliang Weng ; Xiao Jiang
Abstract: Social media language contains huge amount and wide variety of nonstandard tokens, created both intentionally and unintentionally by the users. It is of crucial importance to normalize the noisy nonstandard tokens before applying other NLP techniques. A major challenge facing this task is the system coverage, i.e., for any user-created nonstandard term, the system should be able to restore the correct word within its top n output candidates. In this paper, we propose a cognitivelydriven normalization system that integrates different human perspectives in normalizing the nonstandard tokens, including the enhanced letter transformation, visual priming, and string/phonetic similarity. The system was evaluated on both word- and messagelevel using four SMS and Twitter data sets. Results show that our system achieves over 90% word-coverage across all data sets (a . 10% absolute increase compared to state-ofthe-art); the broad word-coverage can also successfully translate into message-level performance gain, yielding 6% absolute increase compared to the best prior approach.
6 0.97071958 23 acl-2012-A Two-step Approach to Sentence Compression of Spoken Utterances
7 0.96131837 131 acl-2012-Learning Translation Consensus with Structured Label Propagation
8 0.96075934 216 acl-2012-Word Epoch Disambiguation: Finding How Words Change Over Time
9 0.95677835 119 acl-2012-Incremental Joint Approach to Word Segmentation, POS Tagging, and Dependency Parsing in Chinese
10 0.9545452 55 acl-2012-Community Answer Summarization for Multi-Sentence Question with Group L1 Regularization
11 0.95225304 9 acl-2012-A Cost Sensitive Part-of-Speech Tagging: Differentiating Serious Errors from Minor Errors
12 0.95068312 172 acl-2012-Selective Sharing for Multilingual Dependency Parsing
14 0.94792533 150 acl-2012-Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia
15 0.94515181 213 acl-2012-Utilizing Dependency Language Models for Graph-based Dependency Parsing Models
16 0.94352597 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations
17 0.93534666 137 acl-2012-Lemmatisation as a Tagging Task
18 0.93258893 20 acl-2012-A Statistical Model for Unsupervised and Semi-supervised Transliteration Mining
19 0.93249971 203 acl-2012-Translation Model Adaptation for Statistical Machine Translation with Monolingual Topic Information
20 0.92939049 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets